Home » Python » python – Use BeautifulSoup to extract text under specific header-Exceptionshub

python – Use BeautifulSoup to extract text under specific header-Exceptionshub

Posted by: admin February 24, 2020 Leave a comment


How do I extract all the text below a specific header? In this case, I need to extract the text under Topic 2. EDIT: On other webpages, “Topic 2” sometimes appears as the third heading, or the first. “Topic 2” isn’t always in the same place, and it doesn’t always have the same id number.

# import library
from bs4 import BeautifulSoup

# dummy webpage text
body = '''
<h2 id="1">Topic 1</h2>
<p> This is the first sentence.</p>
<p> This is the second sentence.</p>
<p> This is the third sentence.</p>

<h2 id="2">Topic 2</h2>
<p> This is the fourth sentence.</p>
<p> This is the fifth sentence.</p>

<h2 id="3">Topic 3</h2>
<p> This is the sixth sentence.</p>
<p> This is the seventh sentence.</p>
<p> This is the eighth sentence.</p>

# convert text to soup 
soup = BeautifulSoup(body, 'lxml')

If I extract text only under ”’Topic 2”’, this is what my output would be.

This is the fourth sentence. This is the fifth sentence.

My attempts to solve this problem:

I tried soup.select('h2 + p'), but this only got me the first sentences under each header.

[<p> This is the first sentence.</p>,
 <p> This is the fourth sentence.</p>,
 <p> This is the sixth sentence.</p>]

I also tried this, but it gave me all the text, when I only need text under Topic 2:

import pandas as pd 

lst = []
for row in soup.find_all('p'):
    text_dict = {}
    text_dict['text'] = row.text

df = pd.DataFrame(lst) 


|   | text                          |
| 0 | This is the first sentence.   |
| 1 | This is the second sentence.  |
| 2 | This is the third sentence.   |
| 3 | This is the fourth sentence.  |
| 4 | This is the fifth sentence.   |
| 5 | This is the sixth sentence.   |
| 6 | This is the seventh sentence. |
| 7 | This is the eighth sentence.  |
How to&Answers:


target = soup.find('h2',text='Topic 2')
for sib in target.find_next_siblings():
    if sib.name=="h2":

Output (from you html above):

 This is the fourth sentence.
 This is the fifth sentence.


The problem is that you think the text us under the header. Technically, the text nodes are siblings of the headers, so the only way get them is the more sequential process of iterating through siblings:

  1. find a header
  2. find everything not a header & extract text
  3. find another header (or EOF) and stop.

More like:

h2 = soup.find('h2', id='2')
for sibling in h2.next_siblings:
   if sibling.name != (None, 'p'):
   # ... do what you like with the <p> node

(Note that a BeautifulSoup sibling of < h2 > is an string element, usually a newline, name == None, so make sure you handle or ignore it properly.)