Can script tags and all of their contents be removed from HTML with BeautifulSoup, or do I have to use Regular Expressions or something else?
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<script>a</script>baba<script>b</script>', 'lxml') >>> [s.extract() for s in soup('script')] >>> soup baba
As stated in the (official documentation) you can use the
extract method to remove all the subtree that matches the search.
import BeautifulSoup a = BeautifulSoup.BeautifulSoup("<html><body><script>aaa</script></body></html>") [x.extract() for x in a.findAll('script')]
Updated answer for those who might need for future reference:
The correct answer is.
You can use different ways but
decompose works in-place.
soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>') soup.i.decompose() print str(soup) #prints '<p>This is a slimy text and</p>'
Pretty useful to get rid of detritus like ‘script’,’img’ so and so forth.