Python - Parse text file and create lists based on some criteria

Question

I've been searching for a solution to this question for a while without any luck. I'm wanting to use Python to read a text file and create some lists (or arrays) based on the data in the file. An example will best illustrate my goal.

Consider the following text:

NODE
1.0, 2.0
2.0, 2.0
3.0, 2.0
4.0, 2.0
ELEMENT
1, 2, 3, 4
5, 6, 7, 8
1, 2, 3, 4
1, 2, 3, 4
1, 2, 3, 4
5, 6, 7, 8
5, 6, 7, 8
5, 6, 7, 8

I would like to read through the file (ideally once as the files can be large) and once I find "NODE" take each line between "NODE" and "ELEMENT" and put into a list. Then, once I reach "ELEMENT" take each line between "ELEMENT" and some other break (maybe another "ELEMENT" or end of file, etc…) and put that into a list. For this example,it would result in two lists.

I've tried various things but they all require knowing information about the file beforehand. I'd like to be able to automate it. Thank you very much!

What have you done so far? Please post the code you have written. — sampathsris
– sampathsris, Commented Aug 15, 2014 at 5:07
If you don't want to require any information about the file beforehand, what's the rule that tells you that you've hit a new section? — abarnert
– abarnert, Commented Aug 15, 2014 at 5:19
@abarnert I misspoke in my initial post, I know what sections I'm looking for (i.e. NODE or ELEMENT), just not the number of lines between each section. — Ryan James
– Ryan James, Commented Aug 15, 2014 at 19:40
Thank you all for the different options. Dawg's solution looks like it will be most likely to do what I need to do in the big picture. — Ryan James
– Ryan James, Commented Aug 15, 2014 at 19:51

Community · Accepted Answer · 2017-05-23 10:30:16Z

4

With that example data, and assuming that the labels follow the pattern in your example, you can use a regex:

import re, mmap, os

def conv(s):
    try:
        return float(s)
    except ValueError:
        return s    

data_dict={}
with open(fn, 'r') as fin:
    size = os.stat(fn).st_size
    data = mmap.mmap(fin.fileno(), size, access=mmap.ACCESS_READ)
    for m in re.finditer(r'^(\w+)$([\d\s,.]+)', data, re.M):
        data_dict[m.group(1)]=[[conv(e) for e in line.split(',')] 
                        for line in m.group(2).splitlines() if line.strip()]

print data_dict

Prints:

{'NODE': [[1.0, 2.0], [2.0, 2.0], [3.0, 2.0], [4.0, 2.0]], 
 'ELEMENT': [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0]]}

So, how does this work:

We use mmap to apply a regex to a file
We assume that the labels are the form of ^\w+$ (ie, labels made up of letters and numbers on a line)
Then capture all the numbers and spaces following that
Create a dict with the label as the key, the parsed numbers as the list of floats following.

Done!

edited May 23, 2017 at 10:30

CommunityBot

11 silver badge

answered Aug 15, 2014 at 5:33

dawg

105k24 gold badges143 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ryan James Over a year ago

@dawg, I really like this solution as I think it will provide future flexibility. That said, I have a lot to learn about mmap and regex. One immediate question regarding regex. I also have section headers that look like "*Section, Name = Section1". Mind providing more information on how to handle this with regex? Thanks!

dawg Over a year ago

For that exact example, use ^(\w+,\s+\w+\s*=\s*\w+)$ for the section part of the regex (not tested...)

Ryan James Over a year ago

Thanks again @dawg! One last question, I promise (this is getting too far from the original question). I can't seem to get your example regex to work. I think it has to do with the asterisk at the beginning of the string "*Section". How do I capture that asterisk? Thanks again!

dawg Over a year ago

Try ^(\*\w+,\s+\w+\s*=\s*\w+)$

abarnert Over a year ago

The only problem with this solution (which is often not a problem at all, as long as you make sure it isn't relevant) is that mmap can't handle huge files on 32-bit platforms. It doesn't have to read the whole file into memory, but it does have to allocate page space for the whole file, and in 32-bit-land, there's only 2-4GB of page space.

abarnert · Accepted Answer · 2014-08-15 05:33:22Z

If you want this to be fully general and automated, you need to come up with the rule that distinguishes section headers from rows. I'll invent one, but it's probably not the one you want, in which case my invented code won't work for you… but hopefully it will show you what you need to do, and how to get started.

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

Now, we can just group the rows by whether or not they're section headers by using itertools.groupby. If you printed out each group, you'd get something like this:

True, [['NODE']]
False, [['1.0', '2.0'], ['2.0', '2.0'], …, ]
True, [['ELEMENT']]
False, [['1.0', '2.0', '3.0', '4.0'], …, ]

We don't care about the first value in each of those, so drop it.

And we want to batch up each pair of adjacent groups into a (header, rows) pair, which we can do by zipping our iterator with itself.

And then just put it in a dict, which will look something like this:

{'NODE': [['1.0', '2.0'], ['2.0', '2.0'], …],
 'ELEMENT': [['1.0', '2.0', '3.0', '4.0'], …]}

Here's the whole thing:

import csv
import itertools

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

with open(path) as f:
    rows = csv.reader(f)
    grouped = itertools.groupby(rows, new_section)
    groups = (group for key, group in grouped)
    pairs = zip(groups, groups)
    lists = {header[0][0]: rows for header, rows in pairs}

Joran Beasley · Accepted Answer · 2014-08-15 05:22:21Z

def getBlocks(fname):
    state = 0 
    node = []
    ele = []
    with open(fname) as f:
        for line in f:
        if "NODE" in line:
            if state == 2:
            yield (node,ele)
            node,ele = [],[]   
            state = 1
        elif state == 1 and "ELEMENT" in line:
            state = 2
        elif state == 1:
            node.append(list(map(float,line.split(","))))
        elif state == 2 and re.match("[a-zA-Z]+",line):
            yield (node,ele)
            node,ele = [],[]   
            state = 0 
        elif state == 2:
            ele.append(list(map(int,line.split(","))))
        yield (node,ele)

for node,ele in getBlocks("somefile.txt"):
    print "N:",node
    print "E:",ele

might be about what your looking for its kinda gross... im sure you can do it better

abarnert · Accepted Answer · 2014-08-18 02:19:22Z

0

For the simpler problem in the updated question, you really don't need regexps, or groupby, or a complex state machine, or anything beyond what a novice should be able to understand easily.

All you need to do is accumulate rows into one list until you find the row 'ELEMENT', then start accumulating rows into the other one. Like this:

import csv
result = {'NODES': [], 'ELEMENTS': []}
current = result['NODES']
with open(path) as f:
    for row in csv.reader(f):
        if row == ['NODE']:
            pass
        elif row == ['ELEMENT']:
            current = result['ELEMENTS']
        else:
            current.append(row)

answered Aug 18, 2014 at 2:19

abarnert

368k54 gold badges626 silver badges692 bronze badges

1 Comment

Ryan James Over a year ago

Thanks @abarnert! This is a very simple, clean approach. This appears to be a viable option as well.

Collectives™ on Stack Overflow

Python - Parse text file and create lists based on some criteria

4 Answers 4

5 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related