I am filtering a large web access log file and creating about a dozen smaller ones depending on the regex match. Since I have little experience with regex, I would hope to figure out how to optimise the patterns for better performance.
The source is formatted as follows:
2015-06-14 00:00:06 38.75.53.205 - HTTP 10.250.35.69 80 GET /en/process/dsa/policy/wbss/wbss_current/wbss2013/Wbss13.pdf - 206 299 16722 0 HTTP (etc)
or
2015-06-13 00:00:31 1.22.55.170 - HTTP 157.150.186.68 80 GET /esd/sddev/enable/compl.htm - 200 396 23040 0 HTTP/1.1 Mozilla (etc)
The following are a few of my regex patterns. They all look in the same area of each line, after GET. This is how I have them now:
dsq = "( /esd/sddev/| /creative/)"
dpq = "/dsa/policy/"
pop = "(^((?! /popq/ /caster/(dsa/(policy|qsc|qlation))|(esd/(fed|cdq|qaccount|sddev|creative|forums/rdev))).)*$)"
The first two are looking to match specified patterns, while "pop" is supposed to match everything BUT the specified patterns.
This works as it is, but since my log files tend to rather large (1GB and bigger), and I have some 12 different patterns to match, I was hoping there may be a way to improve the performance of these patterns.
As for the usage, I have a following code, where $profile is one of those listed above (they are in a hash table, and I loop through them separately):
Get-Content $sourcefile -ReadCount 5000 |
ForEach { $_ -match $profile | Add-Content targetfile }
Thank you all for any insight!