1

I have a script that searches through a log file using a regex filtering statement and puts the matching lines into another file. The regex is fairly straightforward something like this:

(en|es|fr|zh|ar|)/?(news|publications|about|key-issues|contact-us)

(with a few more matching keywords, etc).

I have a pretty good idea which of the groups of matching keywords gets most matches. Would it improve the performance of the script if I put the keywords most likely to match first in the list (for example, 'news' is most likely to get matched, followed by 'publications', etc)? Or does it not matter which is the order? When the script does the parsing, does it go through the line trying to match with the first element, then if no match the second, and so on, until it finds a match? Would there be a way to make the script more efficient, if we know about the likelihood of each keyword being a match?

1 Answer 1

1

Yes, regex engines will match ORs left-to-right, so you can optimize your search by making en the left-most one, if English is most common. Most important is that you compile your regex beforehand though, so that it can turn it into a state machine. The performance difference after compilation will probably be negligible (Unless you are already having performance problems - beware the premature optimization).

For proof, use https://regex101.com/, and click "regex debugger". When en is first, it takes 22 steps to recognize "en/publications". However, when en is last, it will take 27 steps to recognize the same string.

Sign up to request clarification or add additional context in comments.

4 Comments

The only time you'll realistically have a problem with regex performance is in the case of catastrophic backtracking.
Thank you for this! I reordered the strings to put the most common ones first (after having them further in the back), and the script is about 25% faster (a 1GB log file is parsed and filtered in about 15 mins, when it took over 20 mins before). I will review my search strings and sort them out to optimise this. The improvement is quite meaningful.
Oh wow, that's a massive file to try to regex. If you know a bit of C, try hand writing the search. You'll probably be able to get it down in under a minute.
The file is a web traffic log, and I need to split it into 11 separate smaller ones, divided by the specific sub-sections of that web site (i.e. by specific URL paths). I have 11 regex patterns in a hash table, and I loop through it, parsing the file for each of these. Each matched line is written into the appropriate file. idon't really know any C, but am curious if it could save me much time. Perhaps if it could be written so that I don't have to loop 11 times through the same file in order to filter for each individual regex line and write out the matches to their respective files....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.