Need more efficient way to parse out csv file in Python

Question

Here's a sample csv file

id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601

This is the output I'm looking for (list of serial_no withing a list of ids):

[2, [500,501,502]]
[3, [600, 601]]

I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
  each_row = []     
    each_row.append(row[0])
    each_row.append(row[1])
    zipped_data.append(each_row)
for rec in zipped_data:
  if rec[0] not in ids:
    ids.append(rec[0])
for id in ids:
    for rec in zipped_data:
      if rec[0] == id:
        ser_no.append(rec[1])
  tmp.append(id)
  tmp.append(ser_no)
  print tmp
  tmp = []
  ser_no = []

**I've omitted var initializing for simplicity of code

print tmp

Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!

"Pythonic", not "Pythonian". :)

Karl Knechtel
– Karl Knechtel

2011-06-29 02:26:55 +00:00
Commented Jun 29, 2011 at 2:26 — Karl Knechtel
– Karl Knechtel, Commented Jun 29, 2011 at 2:26
You'll never be a "pythonista" with remarks like that ;-D

pavium
– pavium

2011-06-29 02:30:13 +00:00
Commented Jun 29, 2011 at 2:30 — pavium
– pavium, Commented Jun 29, 2011 at 2:30
Just to be clear, short code does not imply efficiency.

tjm
– tjm

2011-06-29 02:31:11 +00:00
Commented Jun 29, 2011 at 2:31 — tjm
– tjm, Commented Jun 29, 2011 at 2:31
you have 600,600 but may mean 600,601

ninjagecko
– ninjagecko

2011-06-29 02:42:10 +00:00
Commented Jun 29, 2011 at 2:42 — ninjagecko
– ninjagecko, Commented Jun 29, 2011 at 2:42
yes i did, it's a typo. thanks!

t0x13
– t0x13

2011-06-29 02:46:33 +00:00
Commented Jun 29, 2011 at 2:46 — t0x13
– t0x13, Commented Jun 29, 2011 at 2:46

Philip Southam · Accepted Answer · 2011-06-29 03:06:25Z

13

from collections import defaultdict

records = defaultdict(list)

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
    records[row[0]].append(row[1])

#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results

If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])

edited Jun 29, 2011 at 3:06

answered Jun 29, 2011 at 2:32

Philip Southam

16.6k6 gold badges30 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

t0x13 Over a year ago

This is beautiful! I have so much learning to do about Python.. I think this does exactly what I need. But one question when it prints out records they are not in order ( ids are in this order: 3, 2, 4). Why does this happen and what can I do to fix that?

Karl Knechtel Over a year ago

A dictionary does not keep its keys in order. You can convert the dictionary to a list of key-value pairs using .items(), and sort that list.

Philip Southam Over a year ago

@tasha I added sorting for you.

John La Rooy Over a year ago

you could use operator.itemgetter(0) instead of the lambda function for the key, but in the case of sorting dictionary items, it is redundant - sorted will return the correct result without using a key argument

Philip Southam Over a year ago

@gnibbler +1 for the tip that now seems so obvious but had never thought about.

Ignacio Vazquez-Abrams · Accepted Answer · 2011-06-29 02:33:22Z

5

Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.

result = collections.defaultdict(list)
for row in data:
  result[row[0]].append(row[1])

answered Jun 29, 2011 at 2:33

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Comments

Mister_Tom · Accepted Answer · 2011-06-29 03:15:44Z

2

Here's a version I wrote, looks like there are plenty of answers for this one already though.

You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).

#!/usr/bin/python
import csv

myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)

myData = {}

for myRow in csvFile:
    myId = myRow['id']
    if not myData.has_key(myId): myData[myId] = []
    myData[myId].append(myRow['serial_no'])

for myId in sorted(myData):
    print '%s %s' % (myId, myData[myId])

myFile.close()

answered Jun 29, 2011 at 3:15

Mister_Tom

1,5741 gold badge23 silver badges36 bronze badges

2 Comments

t0x13 Over a year ago

Thank you for another solution. I was wondering what is the more appropriate way to parse csv, by the column name or column position?

Mister_Tom Over a year ago

By column name is more readable (if you or someone else are trying to figure out what the code is doing). By column index (position) might be a little more efficient (should be negligible in this case of reading small csv files).

Karl Knechtel · Accepted Answer · 2011-06-29 02:38:42Z

1

Some observations:

0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...

1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.

2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.

3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.

4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.

5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.

6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).

Applying these ideas, we get:

filename = 'test.csv'

with open(filename) as in_file:
    data = csv.reader(in_file)
    data.next() # ignore the field labels
    rows = list(data) # read the rest of the rows from the iterator

print [
    # We want a list of all serial numbers from rows with a matching ID...
    [serial_no for row_id, serial_no in rows if row_id == id]
    # for each of the IDs that there is to match, which come from making
    # a set from the first column of the data.
    for id in set(zip(*rows)[0])
]

We can probably do even better than this by using the groupby function from the itertools module.

answered Jun 29, 2011 at 2:38

Karl Knechtel

61.4k14 gold badges134 silver badges193 bronze badges

4 Comments

t0x13 Over a year ago

Thank you for all the suggestions.. I didn't know many of these features.

JBernardo Over a year ago

Actually file is a deprecated reference to open. It's widely used as a variable name... And was removed on Python3

John La Rooy Over a year ago

groupby will only work if rows with the same id are already grouped. They may be, but it is not specified.

John La Rooy Over a year ago

I added an answer using groupby for completeness

John La Rooy · Accepted Answer · 2011-06-29 03:45:27Z

0

example using itertools.groupby. This only works if the rows are already grouped by id

from csv import DictReader
from itertools import groupby
from operator import itemgetter

filename = 'test.csv'

# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:

    # group by id - this requires that the rows are already grouped by id
    groups = groupby(DictReader(infile), key=itemgetter('id'))

    # loop through the groups printing a list for each one
    for i,j in groups:
        print [i, map(itemgetter(' serial_no'), list(j))]

note the space in front of ' serial_no'. This is because of the space after the comma in the input file

edited Jun 29, 2011 at 3:45

answered Jun 29, 2011 at 3:40

John La Rooy

306k54 gold badges378 silver badges514 bronze badges

Collectives™ on Stack Overflow

Need more efficient way to parse out csv file in Python

5 Answers 5

5 Comments

Comments

2 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

5 Comments

Comments

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related