Extracting number field from logfile using regex

Question

I have a log file having a output:

Time = 1

smoothSolver:  Solving for Ux, Initial residual = 0.230812, Final residual = 0.0134171, No Iterations 2
smoothSolver:  Solving for Uy, Initial residual = 0.283614, Final residual = 0.0158797, No Iterations 3
smoothSolver:  Solving for Uz, Initial residual = 0.190444, Final residual = 0.016567, No Iterations 2
GAMG:  Solving for p, Initial residual = 0.0850116, Final residual = 0.00375608, No Iterations 3
time step continuity errors : sum local = 0.00999678, global = 0.00142109, cumulative = 0.00142109
smoothSolver:  Solving for omega, Initial residual = 0.00267604, Final residual = 0.000166675, No Iterations 3
bounding omega, min: -26.6597 max: 18468.7 average: 219.43
smoothSolver:  Solving for k, Initial residual = 1, Final residual = 0.0862096, No Iterations 2
ExecutionTime = 4.84 s  ClockTime = 5 s

I need to extract cumulative = 0.00142109 (which is in line 5 of the about output) using Python's regular expression. More precisely, I need to extract only the value 0.00142109 that corresponds to cumulative and write to an another file.

Currently, this is what I have:

contCumulative_0_out = open('contCumulative_0', 'w+')

with open(logFile, 'r') as logfile_read:
    for line in logfile_read:
        line = line.rstrip()
        if re.findall('cumulative = ([+-]?\d+)(?:\.\d+)?(?:[eE][+-]?\d+)?', line):
            print line
            contCumulative_0_out.write(line)

However, the output with the above code is:

time step continuity errors : sum local = 0.00999678, global = 0.00142109, cumulative = 0.00142109

I am basically getting the entire line that matches cumulative

Please let me know how to extract only the value corresponding to cumulative.

Andrew_Lvov · Accepted Answer · 2015-01-18 23:12:46Z

0

If the number is in the format you specified, I'd use simpler regex pattern:

for line in logfile_read:
    res = re.search(r'cumulative = ((\d|.)+)', line)
    if res:
        contCumulative_0_out.write(res.group(1))

Otherwise, just use your pattern, but use re.match and write the walue of res.group(n) as well, where res is the resulf or re.match and n is the number of subexpression in your regex exp enclosed in '(' and ')'.

answered Jan 18, 2015 at 23:12

Andrew_Lvov

4,6782 gold badges29 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

smci Over a year ago

Minor point: ((\d|.)+) is better written as ([\d\.]+) and more efficient to parse.

hypersonics Over a year ago

Thanks @smci. However, when I use ([\d\.]+) all negative values are omitted. In my original post, I had shown only one set of output. However, I do get have negative values for subsequent outputs. Hence, with ([\d\.]+) all the negative values are omitted.

smci · Accepted Answer · 2015-01-19 03:13:49Z

0

That's because re.findall() returns you a list of strings, instead of re.search() which returns a MatchObject. In any case you're throwing away that return value from your re.find/search() call, and then your code just uses line.

# Wrong
if re.findall(<regex>, line):
    print line
    contCumulative_0_out.write(line)

# Right
mat = re.search(<regex>, line) # but your regex needs changing, see below
if mat:
    cumvalue = mat.groups()
    print cumvalue
    contCumulative_0_out.write(cumvalue)
    #break # if you know you only have at most one match per file

However as @Andrew_Lvov points out, your regex is too complicated and doesn't force a start with a digit. So now you need to fix that. Andrew's regex is faster and good-enough (we know the number won't be malformed, and we can't pick up something with multiple periods, like an IP address).

(Btw, for efficiency, if you're guaranteed at most one instance of 'cumulative' line per file, you have no reason not to break from the for-loop after you process your match. Also, for efficiency, the line = line.rstrip() is unnecessary.)

Anyway, skim the doc on the difference between match, search and findall/finditer. Essential to know which is which. Pattern-matching fns and their variants are a pain in the ass in pretty much every language. Or type help(re) inside a Python shell.

edited Jan 19, 2015 at 3:13

answered Jan 18, 2015 at 23:09

smci

34.2k21 gold badges118 silver badges152 bronze badges

6 Comments

hypersonics Over a year ago

Thanks @Andrew_Lvov for your corrections. I have now used your regex syntax.

hypersonics Over a year ago

With @Andrew_Lvov regex syntax, my output is ('0.00142109', '9'). How to eliminate '9'?

smci Over a year ago

Then you want mat.group(0). But anyway, this goes away if you write the regex using ([\d\.]+) rather than ((\d|.)+)'

hypersonics Over a year ago

Is there a way to extract multiple strings? For example my final aim is to extract Time = <> (present in the first line of logfile) and the cumulative value corresponding to that. The output should be 1 <space> 0.00142109.

hypersonics Over a year ago

Hi @smci have a look at stackoverflow.com/questions/28017121/…. Much appreciated.

|

Collectives™ on Stack Overflow

Extracting number field from logfile using regex

2 Answers 2

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related