Improving processing speed with a regex filtering in PowerShell

Question

I have a script that searches through a log file using a regex filtering statement and puts the matching lines into another file. The regex is fairly straightforward something like this:

(en|es|fr|zh|ar|)/?(news|publications|about|key-issues|contact-us)

(with a few more matching keywords, etc).

I have a pretty good idea which of the groups of matching keywords gets most matches. Would it improve the performance of the script if I put the keywords most likely to match first in the list (for example, 'news' is most likely to get matched, followed by 'publications', etc)? Or does it not matter which is the order? When the script does the parsing, does it go through the line trying to match with the first element, then if no match the second, and so on, until it finds a match? Would there be a way to make the script more efficient, if we know about the likelihood of each keyword being a match?

Nicholas Pipitone · Accepted Answer · 2018-08-02 19:36:31Z

1

Yes, regex engines will match ORs left-to-right, so you can optimize your search by making en the left-most one, if English is most common. Most important is that you compile your regex beforehand though, so that it can turn it into a state machine. The performance difference after compilation will probably be negligible (Unless you are already having performance problems - beware the premature optimization).

For proof, use https://regex101.com/, and click "regex debugger". When en is first, it takes 22 steps to recognize "en/publications". However, when en is last, it will take 27 steps to recognize the same string.

edited Aug 2, 2018 at 19:36

answered Aug 2, 2018 at 19:21

Nicholas Pipitone

4,2424 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Nicholas Pipitone Over a year ago

The only time you'll realistically have a problem with regex performance is in the case of catastrophic backtracking.

Predrag Vasić Over a year ago

Thank you for this! I reordered the strings to put the most common ones first (after having them further in the back), and the script is about 25% faster (a 1GB log file is parsed and filtered in about 15 mins, when it took over 20 mins before). I will review my search strings and sort them out to optimise this. The improvement is quite meaningful.

Nicholas Pipitone Over a year ago

Oh wow, that's a massive file to try to regex. If you know a bit of C, try hand writing the search. You'll probably be able to get it down in under a minute.

Predrag Vasić Over a year ago

The file is a web traffic log, and I need to split it into 11 separate smaller ones, divided by the specific sub-sections of that web site (i.e. by specific URL paths). I have 11 regex patterns in a hash table, and I loop through it, parsing the file for each of these. Each matched line is written into the appropriate file. idon't really know any C, but am curious if it could save me much time. Perhaps if it could be written so that I don't have to loop 11 times through the same file in order to filter for each individual regex line and write out the matches to their respective files....

Collectives™ on Stack Overflow

Improving processing speed with a regex filtering in PowerShell

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related