Home » excel » html – How does table parsing work in python? Is there an easy way other that beautiful soup?

html – How does table parsing work in python? Is there an easy way other that beautiful soup?

Posted by: admin April 23, 2020 Leave a comment

Questions:

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.

On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.

How to&Answers:

You don’t really have to literally navigate the tree, you can simply try to see what identifies those lines.

Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.

However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.

The code should look something like this for extracting all links from those tables:

import urllib2

from bs4 import BeautifulSoup, SoupStrainer


content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()  
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)

links=[] 
for sp in soup.find_all(align="center"):
    a_tag = sp('a')
    if a_tag:
        links.append(a_tag[0].get('href'))

There’s one more thing to note here, notice the use of SoupStrainer, it’s used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)