Home » Php » Match unclosed html tags using regex and php

Match unclosed html tags using regex and php

Posted by: admin July 12, 2020 Leave a comment

Questions:

I am using php and regex to find unclosed html tags in a string :

This is my string :

$s="<div><h2>Hello world<h2><p>It's 7Am where I live<p><div>";

You can see All tags here are not closed.

I want to find all unclosed tags, but the problem is that my regex is matching opening tags also.

Here is my regex so far

/<[^>]+>/i

And this is my preg_match_all() function

preg_match_all("/<[^>]+>/i",$s,$v);

print_r($v);

What do I need to change in my regex to match only the unclosed tags?

 <h2>
 <p>
 <div>
How to&Answers:

You might be unaware of this, but DOMDocument can help you fix the HTML.

$html = "<div><h2>Hello world<h2><p>It's 7Am where I live<p><div>";
libxml_use_internal_errors(true);

$dom = new DOMDocument();
$dom->loadHTML('<root>' . $html . '</root>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

foreach( $xpath->query('//*[not(node())]') as $node ) {
    $node->parentNode->removeChild($node);
}
echo substr($dom->saveHTML(), 6, -8);

See IDEONE demo

Result: <div><h2>Hello world</h2><p>It's 7Am where I live</p></div>

Note that the XPath-based empty node cleanup is necessary as the DOM contains empty <h2></h2>, <p></p> and <div></div> tags after loading HTML into DOM.

The <root> element is added in the beginning to make sure we get the root element alright. Later, we can post-process it with substr.

The LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD flags are necessary so that no DTD and other rubbish were not added to the DOM.

Answer:

Finding unmatched tags seems fundamentally too hard to do with a regex. You basically need to put each opening tag to you see onto a queue and then pop it off of the queue when you see the closing tag.

Recommend you use a library that does HTML validation. See these questions:

Remove unmatched HTML tags in a string

How to find the unclosed div tag

PHP get all unclosed HTML tags in string