Home » Php » PHP Regex to match HTML code using capturing group

PHP Regex to match HTML code using capturing group

Posted by: admin February 25, 2020 Leave a comment

Questions:

I’m stuck trying to write a regular expression in PHP that matches A HREF tags using capturing groups.

My current code looks like this:

$content = preg_replace_callback(
  '/<a[^>]*href=["|\']([^"|\']*)["|\'][^>]*>([^<]*)<\/a>/i',
    function($m) {
...

The code works perfectly fine for anything like this:

<a href="/go/bla" rel="sponsored noopener" target="_blank">Test link</a>

But I have some URLs that look like this – note the nested <span></span>:

<a href="/go/bla" rel="sponsored noopener" target="_blank"><span>Test link</span></a>

My second capturing group matches for ^< which is why the doesn’t match. I was trying to change the group to match anything BUT . That’s where I failed, thanks to my lack of regex experience 🙂

Could any regex expert please point me in the right direction?

How to&Answers:

This should be sufficient for your example

<a[^>]*href=["|\']([^"|\']*)["|\'][^>]*>(?:<[^>]+>)?([^<]*)(?:<[^>]+>)?<\/a>

Adding the (?:<[^>]+>)? will match the extra tags if they exists.

See this in action here.

Answer:

The current regex should help you:

<a[^>]*href=["|\']([^"|\']*)["|\'][^>]*>(?:<[^>]+>)*([^<]*)(?:</[^>]+>)*<\/a>

This will match your example as well as this example:

<a href="/go/bla" rel="sponsored noopener" target="_blank"><span><h1>Test link</h1></span></a>

However what about this?

<a href="/go/bla" rel="sponsored noopener" target="_blank"><span><h1>Test <span>link</span></h1></span></a>

Nope! This breaks. And now we’ll have to go back and wrap our minds around tags within tags with text outside those tags to still match, we’ll have to break it up some more. At this stage it would be better to simply just fetch a list of all a tags, and then perform some substitutions to extract the data you need after the fact.

$matches = preg_match_callback('/<a[^>]*?href=(.*?")[^>]*?>(.*?)</a>/i', function($m) {
  ... more regexes
}

It may be better to consider using a library that allows you to load html content as objects (much like a browser would) and query your results using something like xpath.

In PHP you can use the DOM and XPath to load html. Below is an example.

$doc = new DOMDocument();
$html = <<<EOD
<html>
<body>
<a href="/go/bla" rel="sponsored noopener" target="_blank">Test link</a>
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span>Test link</span></a>
<a href="/go/bla" rel="sponsored noopener" target="_blank"><span><h1>Test <span>link</span></h1></span></a>
</body>
</html>
EOD;

$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$query = $xpath->query('//a');

if (!is_null($query)) {
    foreach ($query as $q) {
        print $q->getAttribute('href') . ' - ';
        print $q->nodeValue . "\n";
    }
}