Home » Php » php – Why does this regex take so long to find email addresses in certain files?

php – Why does this regex take so long to find email addresses in certain files?

Posted by: admin July 12, 2020 Leave a comment

Questions:

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can’t find and has been tested on all kinds of email configurations … changing this is not exactly my question … but understand if that is the root cause ):

/[a-z0-9_\-\+][email protected][a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

I’m using preg_match_all() in PHP.

This works great for 99.99…% of files I’m looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive “random” alphanumeric characters like this:

wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...

Here are two pages causing the problem. View source to see the long strings.

Any thoughts on what is causing this?

–FINAL SOLUTION–

I tested various regexes suggested in the answers. @FailedDev’s answer helped and dropped processing time from a few minutes to a few seconds. @hakre’s answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It’s @hakre’s second suggestion.

/[a-z0-9_\-\+]{1,256}[email protected][a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
How to&Answers:

You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?

For example, you’re using + to match one or more chars. Let’s say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let’s say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.

If you can limit the number of characters (this looks a bit like you’re looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test “only”:

/[a-z0-9_\-\+]{1,256}@[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i

That’s probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:

/[a-z0-9_\-\+]{1,256}[email protected][a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

Which should speed it up again.

Answer:

My best guess would be to try using possesive quantifiers :

[a-z0-9_\-\+]+

to

[a-z0-9_\-\+]++

This should fail the regex faster so it may improve performance in these situations.

Edit:

Maybe atomic grouping could also help :

/(?>[a-z0-9_\-+]++)@(?>[a-z0-9\-]++\.)(?>[a-z]{2,3})(?:\.[a-z]{2})?/

You should first go with option one. It would be interesting to see if there is any difference by also using option two.