Home » Php » php – strip all classes from p tags

php – strip all classes from p tags

Posted by: admin July 12, 2020 Leave a comment

Questions:

I was just wondering if any one knew a function to remove ALL classes from a string in php.. Basically I only want

<p> 

tags rather than

<p class="...">

If that makes sense 🙂

How to&Answers:

A fairly naive regex will probably work for you

$html=preg_replace('/class=".*?"/', '', $html);

I say naive because it would fail if your body text happened to contain class=”something” for some reason!. It could be made a little more robust by looking for class=”” inside angled bracketted tags if need be.

Answer:

Maybe it’s a bit overkill for your need, but, to parse/validate/clean HTML data, the best tool I know is HTML Purifier

It allows you to define which tags, and which attributes, are OK ; and/or which ones are not ; and it gives valid/clean (X)HTML as output.

(Using regexes to “parse” HTML seems OK at the beginning… And then, when you want to add specific stuff, it generally becomes hell to understand/maintain)

Answer:

You load the HTML into a DOMDocument class, load that into simpleXML. Then you do an XPath query for all p elements and then loop through them. On each loop, you rename the class attribute to something like “killmeplease”.

When that’s done, reoutput the simpleXML as XML (which, by the way, may change the HTML, but usually only for the better), and you will have a HTML string where each p has a class of “killmeplease”. Use str_replace to actually remove them.

Example:

$html_file = "somehtmlfile.html";

$dom = new DOMDocument();
$dom->loadHTMLFile($html_file);

$xml = simplexml_import_dom($dom);

$paragraphs = $xml->xpath("//p");

foreach($paragraphs as $paragraph) {
     $paragraph['class'] = "killmeplease";
 }

 $new_html = $xml->asXML();

 $better_html = str_replace('class="killmeplease"', "", $new_html);

Or, if you want to make the code more simple but tangle with preg_replace, you could go with:

$html_file = "somehtmlfile.html";
$html_string = file_get_contents($html_file);

$bad_p_class = "/(<p ).*(class=.*)(\s.*>)/";

$better_html = preg_replace($bad_p_class, '$1 $3', $html_string);

The tricky part with regular expressions is they tend to be greedy and trying to turn that off can cause problems if your p element tag has a line break in it. But give either of those a shot.

Answer:

HTML Purifier

HTML can be very tricky to regex because of the hundreds of different ways code can be written or formatted.

The HTML purifier is a mature open source library for cleaning up HTML. I would advise its usage in this case.

In HTML purifier’s configuration documentation, you can specify classes and attributes which should be allowed and what the purifier should do if it finds them.

http://htmlpurifier.org/docs/

Answer:

$html = "<p id='fine' class='r3e1 b4d 1' style='widows: inherit;'>";    
preg_replace('/\sclass=[\'|"][^\'"]+[\'|"]/', '', $html);

If you are being put to the test against Microsoft Office-exported HTML you’ll need more than class-removal but HTML Tidy has a config flag just for Microsoft Office!

Otherwise, this should be safer than some other answers given they are a little greedy and you don’t know what sort of encapsulation will be used (' or ").

Note: The pattern is actually /\sclass=['|"][^'"]+['|"]/ but, as there are both inverted commas (") apostrophes ('), I had to escape all occurrences of one (\') to encapsulate the pattern.

Answer:

I would do something like this on jQuery. Place this in your page header:

$(document).ready(function(){
$(p).each(function(){
     $(this).removeAttr("class");
     //or  $(this).removeclass("className");
})

});