Home » Php » php – How modify regex to exclude some text?

php – How modify regex to exclude some text?

Posted by: admin February 25, 2020 Leave a comment

Questions:

I have regex as '@(?:<script type="text/javascript"|<script)(.*)</script>@msU'. I need modify this expression to exclude <scripts> that will be contain custom no-defer attribute.

Example: include (<script type="text/javascript"></script>, <script></script>), exclude (<script no-defer type="text/javascript"></script>)

How can I modify my regex?

How to&Answers:

I totally agree with the comment of @JayBlanchard mentionning the fact that it would be far safer using a PHP DOM parser. You can then easily remove them if they don’t have the no-defer attribute.

But well… if you really want to do it with a regular expression, I would first try to look for all the <script> tags and capture the attributes in a capturing group with something a bit like this:

The idea is to do the job in 2 passes. This could be done by using PHP’s preg_replace_callback() function that lets you then execute some PHP for each match and there you’ll be able to parse the attributes a bit safely and see if you’ve got a no-defer attribute and decide to keep it instead of pushing it into your array of scripts to move to the bottom of your page.

You could also use preg_match_all() and loop over the results to decide what to do. But I would personnaly go for the DOM parser solution first and then for the preg_replace_callback() solution with a callback function that can access an array to store the items that have been removed. This can be done with the help of anonymous (closure) functions and the use ($scripts_to_move_down) functionnality. See here: https://www.php.net/manual/en/functions.anonymous.php

This would become something like this:


$script_tags_to_move = [];

// Find all script tags and store and then remove them if they don't have the
// no-defer attribute.
$html = preg_replace_callback(
    '/<\s*script(?<attributes>[^>]*)>.*?<\s*\/\s*script\s*>/si',
    function ($matches) use (&$script_tags_to_move) {
        // If the attributes contains no-defer (search is not very safe -> to improve).
        if (preg_match('/(^|\s)no-defer(\s|$)/i', $matches['attributes'])) {
            // Keep the script tag in the HTML.
            return $matches[0];
        } else {
            // Store the script tag.
            $script_tags_to_move[] = $matches[0];
            // And remove it from the HTML.
            return '';
        }
    },
    $html
);

// Inject the script tags at the end, before the closing body tag.
$html = preg_replace(
    '~<\s*/\s*body\s*>~is',
    implode("\n", $script_tags_to_move) . '</body>',
    $html
);

Try it out here: http://sandbox.onlinephpfunctions.com/code/21a938482e883a1d470e61f312764c112c73bb85

Answer:

This would do it:

@<script(?!.*?no-defer).*?>.*?</script>@gm

https://regex101.com/r/NWoKj8/1

Answer:

Here is a alternative using DOMDocument. It’s easier to use and check for some tags and/or attributes to remove.

<?php

$html = '<html><body>foo</body><script type="text/javascript"></script><script></script><script no-defer type="text/javascript"></script><script src="" no-defer type="text/javascript"></script></html>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$scripts = $doc->getElementsByTagName('script');
for ($i = $scripts->length; --$i >= 0; ) {
    $item = $scripts->item($i);
    foreach($item->attributes as $att) {
        if($att->name == 'no-defer') {
            $item->parentNode->removeChild($item);
        }
    }
}

$newHtml = $doc->saveHtml();

print_r($newHtml);