Optimize RegEx matching in PowerShell

Question

I am filtering a large web access log file and creating about a dozen smaller ones depending on the regex match. Since I have little experience with regex, I would hope to figure out how to optimise the patterns for better performance.

The source is formatted as follows:

2015-06-14 00:00:06 38.75.53.205 - HTTP 10.250.35.69 80 GET /en/process/dsa/policy/wbss/wbss_current/wbss2013/Wbss13.pdf - 206 299 16722 0 HTTP (etc)

or

2015-06-13 00:00:31 1.22.55.170 - HTTP 157.150.186.68 80 GET /esd/sddev/enable/compl.htm - 200 396 23040 0 HTTP/1.1 Mozilla (etc)

The following are a few of my regex patterns. They all look in the same area of each line, after GET. This is how I have them now:

dsq = "( /esd/sddev/| /creative/)"
dpq = "/dsa/policy/"
pop = "(^((?! /popq/ /caster/(dsa/(policy|qsc|qlation))|(esd/(fed|cdq|qaccount|sddev|creative|forums/rdev))).)*$)"

The first two are looking to match specified patterns, while "pop" is supposed to match everything BUT the specified patterns.

This works as it is, but since my log files tend to rather large (1GB and bigger), and I have some 12 different patterns to match, I was hoping there may be a way to improve the performance of these patterns.

As for the usage, I have a following code, where $profile is one of those listed above (they are in a hash table, and I loop through them separately):

 Get-Content $sourcefile -ReadCount 5000 | 
 ForEach { $_ -match $profile | Add-Content targetfile }

Thank you all for any insight!

I think what StegMan is saying is that his suggestion is to take each line and convert it into a PowerShell object so that you can make more targeted searches on the data. — Matt
– Matt, Commented Aug 25, 2015 at 18:30
How are you executing those regexes against files especially the last one? That would be very important here. I would like you to break down how you want the last regex to work. The lookahead and quantifier could probably be improved. — Matt
– Matt, Commented Aug 25, 2015 at 18:46
Thank you all for feedback. @StegMan, I'm not sure how to use ConvertFrom-String... — Predrag Vasić
– Predrag Vasić, Commented Aug 25, 2015 at 19:40

Matt · Accepted Answer · 2015-08-26 00:30:40Z

3

Not an improvement on the regex but if you are running a pass on the $sourcefile for every profile you have I can offer a small solution for that.

Get-Content $sourcefile -ReadCount 5000 | ForEach { 
    switch -regex ($_)  {
       $dsq {$chosenPath = "file1"; continue}
       $dpq {$chosenPath = "file2"; continue}
       $pop {$chosenPath = "file3"; continue}
       default {}
    }

    # If no path is set they we skip this step. 
    If($chosenPath){$_ | Add-Content $chosenPath}
}

Use the -regex switch for switch. You can reference every element of your hashtable for the matches. If a match is found then we set the output file for that pass and stop processing the switch in case there are other matches. The order of the matches would matter this way. Since you stated that the matches are mutually exclusive this should not be an issue.

You could rewrite this with an add-content for every match but I was trying to stop repeating similar code. If you did remove it and put back in all the add-contents you could remove the $null logic that I added.

Regex efficiency

With that last one if you are just trying to match everything other then for the pop why not remove the lookahead, greedy qualifier and anchors and just use -notmatch?

$pop = "/popq/ /caster/(dsa/(policy|qsc|qlation))|esd/(fed|cdq|qaccount|sddev|creative|forums/rdev))"
Get-Content $sourcefile -ReadCount 5000 | 
    ForEach { $_ -notmatch $pop | Add-Content targetfile }

As a side note I would have expected that you would need a second loop in there to break out the array of 5000 items?

Get-Content $sourcefile -ReadCount 5000 | 
 ForEach { $_ | ForEach{ $_ -match $profile | Add-Content targetfile }}

I wonder if the regex is being performed on 5000 lines at once instead of the one line you expect it to be.... or maybe its a typo ... or maybe im nuts.

edited Aug 26, 2015 at 0:30

answered Aug 26, 2015 at 0:22

Matt

47.1k9 gold badges90 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Predrag Vasić Over a year ago

The offered answer looks like a significant improvement, especially for eliminating the need to read through $sourcefile separately for each profile. The way I understand it, code looks at each line in the log and matches it against each pattern. As soon as it finds a match, it outputs that match to the output file and moves onto the next line in the log, starting back from the first pattern. I am just not sure if it will then add more matched lines to the existing output files, or will it replace them.

Matt Over a year ago

@PredragVasić I follow your comment until the last sentence. It will append to the output files as determined by the match during each pass.

Predrag Vasić Over a year ago

as for the -ReadCount 5000, based on my results, the line performs regex on 5,000 lines at a time, line by line. The output from this code seems correct, with output files containing over 400k of matched lines from a source with over 700k lines. When I first had the code without -ReadCount 5000, it took forever to run. Someone else suggested adding it in, and apparently, it loads 5,000 lines into memory at a time, making it much faster, rather than reading each line from the file every time it performs the match. This is how it was explained to me, and results seem to confirm it.

Predrag Vasić Over a year ago

That sounds just perfect! I'm currently working on rewriting the script and will test it. I'm curious to see the difference in performance. I'll report back later. Thank you again so much for the advice and solutions!

Predrag Vasić Over a year ago

One more note, regarding the last one that is essentially a -notmatch. If I use a hash table to store pairs of profile names and regex patterns, then I can't use the -nomatch without some additional logic (checking for that last pattern). However, what might be possible is the following. All other patterns match specific lines in the log; the last one actually grabs all lines that don't match any other patterns and sends them to its own output file. Since your code checks every line against each pattern, we could simply send the line to that last target file if nothing matches.

|

Collectives™ on Stack Overflow

Optimize RegEx matching in PowerShell

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related