I have a regular expression that looks for email addresses ( this was taken from another SO post that I can’t find and has been tested on all kinds of email configurations … changing this is not exactly my question … but understand if that is the root cause ):
I’m using preg_match_all() in PHP.
This works great for 99.99…% of files I’m looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive “random” alphanumeric characters like this:
Here are two pages causing the problem. View source to see the long strings.
Any thoughts on what is causing this?
I tested various regexes suggested in the answers. @FailedDev’s answer helped and dropped processing time from a few minutes to a few seconds. @hakre’s answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It’s @hakre’s second suggestion.
You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?
For example, you’re using
+ to match one or more chars. Let’s say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let’s say you have a string with 20 000 chars and two
+ groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.
If you can limit the number of characters (this looks a bit like you’re looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test “only”:
That’s probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:
Which should speed it up again.
My best guess would be to try using possesive quantifiers :
This should fail the regex faster so it may improve performance in these situations.
Maybe atomic grouping could also help :
You should first go with option one. It would be interesting to see if there is any difference by also using option two.