0

I've been searching for a solution to this question for a while without any luck. I'm wanting to use Python to read a text file and create some lists (or arrays) based on the data in the file. An example will best illustrate my goal.

Consider the following text:

NODE
1.0, 2.0
2.0, 2.0
3.0, 2.0
4.0, 2.0
ELEMENT
1, 2, 3, 4
5, 6, 7, 8
1, 2, 3, 4
1, 2, 3, 4
1, 2, 3, 4
5, 6, 7, 8
5, 6, 7, 8
5, 6, 7, 8

I would like to read through the file (ideally once as the files can be large) and once I find "NODE" take each line between "NODE" and "ELEMENT" and put into a list. Then, once I reach "ELEMENT" take each line between "ELEMENT" and some other break (maybe another "ELEMENT" or end of file, etc…) and put that into a list. For this example,it would result in two lists.

I've tried various things but they all require knowing information about the file beforehand. I'd like to be able to automate it. Thank you very much!

4
  • 2
    What have you done so far? Please post the code you have written. Commented Aug 15, 2014 at 5:07
  • If you don't want to require any information about the file beforehand, what's the rule that tells you that you've hit a new section? Commented Aug 15, 2014 at 5:19
  • @abarnert I misspoke in my initial post, I know what sections I'm looking for (i.e. NODE or ELEMENT), just not the number of lines between each section. Commented Aug 15, 2014 at 19:40
  • Thank you all for the different options. Dawg's solution looks like it will be most likely to do what I need to do in the big picture. Commented Aug 15, 2014 at 19:51

4 Answers 4

4

With that example data, and assuming that the labels follow the pattern in your example, you can use a regex:

import re, mmap, os

def conv(s):
    try:
        return float(s)
    except ValueError:
        return s    

data_dict={}
with open(fn, 'r') as fin:
    size = os.stat(fn).st_size
    data = mmap.mmap(fin.fileno(), size, access=mmap.ACCESS_READ)
    for m in re.finditer(r'^(\w+)$([\d\s,.]+)', data, re.M):
        data_dict[m.group(1)]=[[conv(e) for e in line.split(',')] 
                        for line in m.group(2).splitlines() if line.strip()]

print data_dict

Prints:

{'NODE': [[1.0, 2.0], [2.0, 2.0], [3.0, 2.0], [4.0, 2.0]], 
 'ELEMENT': [[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0], [5.0, 6.0, 7.0, 8.0]]}

So, how does this work:

  1. We use mmap to apply a regex to a file
  2. We assume that the labels are the form of ^\w+$ (ie, labels made up of letters and numbers on a line)
  3. Then capture all the numbers and spaces following that
  4. Create a dict with the label as the key, the parsed numbers as the list of floats following.

Done!

Sign up to request clarification or add additional context in comments.

5 Comments

@dawg, I really like this solution as I think it will provide future flexibility. That said, I have a lot to learn about mmap and regex. One immediate question regarding regex. I also have section headers that look like "*Section, Name = Section1". Mind providing more information on how to handle this with regex? Thanks!
For that exact example, use ^(\w+,\s+\w+\s*=\s*\w+)$ for the section part of the regex (not tested...)
Thanks again @dawg! One last question, I promise (this is getting too far from the original question). I can't seem to get your example regex to work. I think it has to do with the asterisk at the beginning of the string "*Section". How do I capture that asterisk? Thanks again!
Try ^(\*\w+,\s+\w+\s*=\s*\w+)$
The only problem with this solution (which is often not a problem at all, as long as you make sure it isn't relevant) is that mmap can't handle huge files on 32-bit platforms. It doesn't have to read the whole file into memory, but it does have to allocate page space for the whole file, and in 32-bit-land, there's only 2-4GB of page space.
2

If you want this to be fully general and automated, you need to come up with the rule that distinguishes section headers from rows. I'll invent one, but it's probably not the one you want, in which case my invented code won't work for you… but hopefully it will show you what you need to do, and how to get started.

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

Now, we can just group the rows by whether or not they're section headers by using itertools.groupby. If you printed out each group, you'd get something like this:

True, [['NODE']]
False, [['1.0', '2.0'], ['2.0', '2.0'], …, ]
True, [['ELEMENT']]
False, [['1.0', '2.0', '3.0', '4.0'], …, ]

We don't care about the first value in each of those, so drop it.

And we want to batch up each pair of adjacent groups into a (header, rows) pair, which we can do by zipping our iterator with itself.

And then just put it in a dict, which will look something like this:

{'NODE': [['1.0', '2.0'], ['2.0', '2.0'], …],
 'ELEMENT': [['1.0', '2.0', '3.0', '4.0'], …]}

Here's the whole thing:

import csv
import itertools

def new_section(row):
    return len(row) == 1 and row[0].isalpha() and row[0].isupper()

with open(path) as f:
    rows = csv.reader(f)
    grouped = itertools.groupby(rows, new_section)
    groups = (group for key, group in grouped)
    pairs = zip(groups, groups)
    lists = {header[0][0]: rows for header, rows in pairs}

Comments

0
def getBlocks(fname):
    state = 0 
    node = []
    ele = []
    with open(fname) as f:
        for line in f:
        if "NODE" in line:
            if state == 2:
            yield (node,ele)
            node,ele = [],[]   
            state = 1
        elif state == 1 and "ELEMENT" in line:
            state = 2
        elif state == 1:
            node.append(list(map(float,line.split(","))))
        elif state == 2 and re.match("[a-zA-Z]+",line):
            yield (node,ele)
            node,ele = [],[]   
            state = 0 
        elif state == 2:
            ele.append(list(map(int,line.split(","))))
        yield (node,ele)

for node,ele in getBlocks("somefile.txt"):
    print "N:",node
    print "E:",ele

might be about what your looking for its kinda gross... im sure you can do it better

Comments

0

For the simpler problem in the updated question, you really don't need regexps, or groupby, or a complex state machine, or anything beyond what a novice should be able to understand easily.

All you need to do is accumulate rows into one list until you find the row 'ELEMENT', then start accumulating rows into the other one. Like this:

import csv
result = {'NODES': [], 'ELEMENTS': []}
current = result['NODES']
with open(path) as f:
    for row in csv.reader(f):
        if row == ['NODE']:
            pass
        elif row == ['ELEMENT']:
            current = result['ELEMENTS']
        else:
            current.append(row)

1 Comment

Thanks @abarnert! This is a very simple, clean approach. This appears to be a viable option as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.