1

Here's a sample csv file

id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601

This is the output I'm looking for (list of serial_no withing a list of ids):

[2, [500,501,502]]
[3, [600, 601]]

I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
  each_row = []     
    each_row.append(row[0])
    each_row.append(row[1])
    zipped_data.append(each_row)
for rec in zipped_data:
  if rec[0] not in ids:
    ids.append(rec[0])
for id in ids:
    for rec in zipped_data:
      if rec[0] == id:
        ser_no.append(rec[1])
  tmp.append(id)
  tmp.append(ser_no)
  print tmp
  tmp = []
  ser_no = []

**I've omitted var initializing for simplicity of code

print tmp

Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!

5
  • "Pythonic", not "Pythonian". :) Commented Jun 29, 2011 at 2:26
  • You'll never be a "pythonista" with remarks like that ;-D Commented Jun 29, 2011 at 2:30
  • 3
    Just to be clear, short code does not imply efficiency. Commented Jun 29, 2011 at 2:31
  • you have 600,600 but may mean 600,601 Commented Jun 29, 2011 at 2:42
  • yes i did, it's a typo. thanks! Commented Jun 29, 2011 at 2:46

5 Answers 5

13
from collections import defaultdict

records = defaultdict(list)

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
    records[row[0]].append(row[1])

#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results

If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])

Sign up to request clarification or add additional context in comments.

5 Comments

This is beautiful! I have so much learning to do about Python.. I think this does exactly what I need. But one question when it prints out records they are not in order ( ids are in this order: 3, 2, 4). Why does this happen and what can I do to fix that?
A dictionary does not keep its keys in order. You can convert the dictionary to a list of key-value pairs using .items(), and sort that list.
@tasha I added sorting for you.
you could use operator.itemgetter(0) instead of the lambda function for the key, but in the case of sorting dictionary items, it is redundant - sorted will return the correct result without using a key argument
@gnibbler +1 for the tip that now seems so obvious but had never thought about.
5

Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.

result = collections.defaultdict(list)
for row in data:
  result[row[0]].append(row[1])

Comments

2

Here's a version I wrote, looks like there are plenty of answers for this one already though.

You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).

#!/usr/bin/python
import csv

myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)

myData = {}

for myRow in csvFile:
    myId = myRow['id']
    if not myData.has_key(myId): myData[myId] = []
    myData[myId].append(myRow['serial_no'])

for myId in sorted(myData):
    print '%s %s' % (myId, myData[myId])

myFile.close()

2 Comments

Thank you for another solution. I was wondering what is the more appropriate way to parse csv, by the column name or column position?
By column name is more readable (if you or someone else are trying to figure out what the code is doing). By column index (position) might be a little more efficient (should be negligible in this case of reading small csv files).
1

Some observations:

0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...

1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.

2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.

3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.

4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.

5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.

6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).

Applying these ideas, we get:

filename = 'test.csv'

with open(filename) as in_file:
    data = csv.reader(in_file)
    data.next() # ignore the field labels
    rows = list(data) # read the rest of the rows from the iterator

print [
    # We want a list of all serial numbers from rows with a matching ID...
    [serial_no for row_id, serial_no in rows if row_id == id]
    # for each of the IDs that there is to match, which come from making
    # a set from the first column of the data.
    for id in set(zip(*rows)[0])
]

We can probably do even better than this by using the groupby function from the itertools module.

4 Comments

Thank you for all the suggestions.. I didn't know many of these features.
Actually file is a deprecated reference to open. It's widely used as a variable name... And was removed on Python3
groupby will only work if rows with the same id are already grouped. They may be, but it is not specified.
I added an answer using groupby for completeness
0

example using itertools.groupby. This only works if the rows are already grouped by id

from csv import DictReader
from itertools import groupby
from operator import itemgetter

filename = 'test.csv'

# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:

    # group by id - this requires that the rows are already grouped by id
    groups = groupby(DictReader(infile), key=itemgetter('id'))

    # loop through the groups printing a list for each one
    for i,j in groups:
        print [i, map(itemgetter(' serial_no'), list(j))]

note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.