Home » Python » html – Python ftech Title and pdf link from an url-Exceptionshub

html – Python ftech Title and pdf link from an url-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment

Questions:

I’m trying to fetch the Book Title and books embeded url link from an url, the html source content of the url looks like below, i have Just taken some little portion out of it to understand.

The when link name is here .. However the little source html portion as follows..

<section>
  <div class="book row" isbn-data="1601982941">
    <div class="col-lg-3">
      <div class="book-cats">Artificial Intelligence</div>
      <div style="width:100%;">
        <img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
      </div>
    </div>
    <div class="col-lg-6">
      <div class="star-ratings"></div>
      <h2>Learning Deep Architectures for AI</h2>
      <span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
      <div class="meta-auth-ttl"></div>
      <p>Foundations and Trends(r) in Machine Learning.</p>
      <div>
        <a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
        <a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
      </div>
    </div>
  </div>
</section>
<section>
  <div class="book row" isbn-data="1496034023">
    <div class="col-lg-3">
      <div class="book-cats">Artificial Intelligence</div>
      <div style="width:100%;">
        <img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
      </div>
    </div>
    <div class="col-lg-6">
      <div class="star-ratings"></div>
      <h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
      <span class="meta-auth"><b>Roberto Battiti &amp; Mauro Brunato, 2013</b></span>
      <div class="meta-auth-ttl"></div>
      <p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
      <div>
        <a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
        <a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
      </div>
    </div>
  </div>
</section>

I have tried below code:

This code just fetched the Book name or Title but still has header <h2> printing. I am looking forward to print Book name and book’s pdf link as well.

#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq


web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()

soup = bs(web_res, 'html.parser')

headers = soup.find_all(['h2'])
print(*headers, sep='\n')

#divs = soup.find_all('div')
#print(*divs, sep="\n\n")

header_1 = soup.find_all('h2', class_='book-container')
print(header_1)

output:

<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>

Desired Output:

Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf

Please help me understand how to achive this as I have googled around but due to lack of knowlede i’m unable to get it. as when i see the html source there are lot of div and class , so little confused to opt which class to fetch the href and h2.

How to&Answers:

The HTML is very nicely structured and you can make use of that here. The site evidently uses Bootstrap as a style scaffolding (with row and col-[size]-[gridcount] classes you can mostly ignore.

You essentially have:

  • a <div class="book"> per book
    • a column with
      • <div class="book-cats"> category and
      • image
    • a second column with
      • <div class="star-ratings"> ratings block
      • <h2> book title
      • <span class="meta-auth"> author line
      • <p> book description
      • two links with <a class=“btn" ...>

Most of those can be ignored. Both the title and your desired link are the first element of their type, so you could just use element.nested_element to grab either.

So all you have to do is

  • loop over all the book divs.
  • for every such div, take the h2 and first a elements.
  • For the title take the contained text of the h2
  • For the link take the href attribute of the a anchor link.

like this:

for book in soup.select("div.book:has(h2):has(a.btn[href])"):
    title = book.h2.get_text(strip=True)
    link = book.select_one("a.btn[href]")["href"]
    # store or process title and link
    print("Title:", title)
    print("Link:", link)

I used .select_one() with a CSS selector to be a bit more specific about what link element to accept; .btn specifies the class and [href] that a href attribute must be present.

I also enhanced the book search by limiting it to divs that have both a title and at least 1 link; the :has(...) selector limits matches to those with specific child elements.

The above produces:

Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...

Answer:

You can get the main idea from this code:

for items in zip(soup.find_all(['h2']), soup.find_all('a', class_="btn")):
    h2, href = items[0].text, items[1].get('href')
    print('Title:', h2)
    print('Link:', href)