1

I have written a piece of code that compares data from two csv's and writes the final output to a new csv. The problem is except for the header nothing else is being written into the csv. Below is my code,

import csv


data_3B = open('3B_processed.csv', 'r') 
reader_3B = csv.DictReader(data_3B)

data_2A = open('2A_processed.csv', 'r') 
reader_2A = csv.DictReader(data_2A)

l_3B_2A = [["taxable_entity_id", "return_period", "3B", "2A"]]

for row_3B in reader_3B:
    for row_2A in reader_2A:
        if row_3B["taxable_entity_id"] == row_2A["taxable_entity_id"] and row_3B["return_period"] == row_2A["return_period"]:
            l_3B_2A.append([row_3B["taxable_entity_id"], row_3B["return_period"], row_3B["total"], row_2A["total"]])


with open("3Bvs2A_new.csv", "w") as csv_file:
    writer = csv.writer(csv_file)

    writer.writerows(l_3B_2A)

csv_file.close()

How do I solve this?

Edit: 2A_processed.csv sample:

taxable_entity_id,return_period,total
2d9cc638-5ed0-410f-9a76-422e32f34779,072019,0
2d9cc638-5ed0-410f-9a76-422e32f34779,062019,0
2d9cc638-5ed0-410f-9a76-422e32f34779,082019,0
e5091f99-e725-44bc-b018-0843953a8771,082019,0
e5091f99-e725-44bc-b018-0843953a8771,052019,41711.5
920da7ba-19c7-45ce-ba59-3aa19a6cb7f0,032019,2862.94
410ecd0f-ea0f-4a36-8fa6-9488ba3c095b,082018,48253.9

3B_processed sample:

taxable_entity_id,return_period,total
1e5ccfbc-a03e-429e-b79a-68041b69dfb0,072017,0.0
1e5ccfbc-a03e-429e-b79a-68041b69dfb0,082017,0.0
1e5ccfbc-a03e-429e-b79a-68041b69dfb0,092017,0.0
f7d52d1f-00a5-440d-9e76-cb7fbf1afde3,122017,0.0
1b9afebb-495d-4516-96bd-1e21138268b7,072017,146500.0
1b9afebb-495d-4516-96bd-1e21138268b7,082017,251710.0
6
  • have you tried to do this with pandas data frame? Commented Sep 1, 2019 at 14:01
  • are you sure that your list l_3B_2A contains all the data you mean to collect? That if condition looks suspicious to me. Wild guess, one of the keys taxable_entity_id or return_period contains a newline at the end, which is why the equality comparison of the strings never works. Commented Sep 1, 2019 at 14:06
  • @BillyBonaros How do I do this using pandas? I'm not very good at it.. Commented Sep 1, 2019 at 14:13
  • @Arne, I'm not sure.. there is no newline at the end of those two keys.. the two csv's that I'm comparing were generated using python. I'm adding a sample. Commented Sep 1, 2019 at 14:16
  • @MohnishM if you write an else condition after you if and print row_3B and row_2A in there, you might get an idea why you never get to enter the if condition even though you thing it should. Commented Sep 1, 2019 at 14:20

3 Answers 3

2

The csv.DictReader objects in your code can only read through the file once, because they are reading from file objects (created with open). Therefore, the second and subsequent times through the outer loop, the inner loop does not run, because there are no more row_2A values in reader_2A - the reader is at the end of the file after the first time.

The simplest fix is to read each file into a list first. We can make a helper function to handle this, and also ensure the files are closed properly:

def lines_of_csv(filename):
    with open(filename) as source:
        return list(csv.DictReader(source))

reader_3B = lines_of_csv('3B_processed.csv')
reader_2A = lines_of_csv('2A_processed.csv')
Sign up to request clarification or add additional context in comments.

5 Comments

In one of my early attempts at writing this script, I did that. But it takes a hell of a long time since there are 1 million lines in one CSV and half-a-million in the other.
Yeah, then you need a better algorithm. I can show how to do that, but it would be much simpler to use an actual database.
I can't connect directly to the database since there is a 5min time out and ssh tunnel forwarding in place. I've asked about that here.. stackoverflow.com/questions/57737400/… have not received any helpful responses yet..
Since you've presumably already downloaded the database data locally as csv, maybe you could rebuild a local copy of the database?
Hey, I've solved this problem and not working on this anymore. May be I could've done it that way.. thanks for your suggestion anyway!
1

I put your code into a file test.py and created test files to simulate your csvs.

$ python3 ./test.py
$ cat ./3Bvs2A_new.csv 
taxable_entity_id,return_period,3B,2A
1,2,3,2
$ cat ./3B_processed.csv 
total,taxable_entity_id,return_period,3B,2A
3,1,2,3,4
3,4,3,2,1

$ cat ./2A_processed.csv 
taxable_entity_id,return_period,2A,3B,total
1,2,3,4,2
4,3,2,1,2

So as you can see the order of the columns doesn't matter as they are being accessed correctly using the dict reader and if the first row is a match your code works but there are no rows left in the second csv file after the processing the first row from the first file. I suggest making a dictionary if taxable_entity_id and return_period tuple values, processing the first csv file by adding totals into the dict then running through the second one and looking them up.

row_lookup = {}
for row in first_csv:
    rowLookup[(row['taxable_entity_id'], row['return_period'])] = row['total']

for row in second_csv:
    if (row['taxable_entity_id'],row['return_period']) in row_lookup.keys():
        newRow = [row['taxable_entity_id'], row['return_period'], row['total'] ,row_lookup[(row['taxable_entity_id'],row['return_period']] ]

Of course that only works if pairs of taxable_entity_ids and return_periods are always unique... Hard to say exactly what you should do without knowing the exact nature of your task and full format of your csvs.

4 Comments

This doesn't address the issue with reading the files, and it also misunderstands how the loops work in the original code. However, the suggestion for a lookup dict is still a good one, and key to making it work efficiently.
Wait I see what you're saying. I missed the nesting of the loops. So his first row in the second csv doesn't match any in the first and so he gets no output. I was concentrating on trying to figure out what he wanted to do and don't have the reputation to ask with a comment...
@David.. Bro, you saved my life. Your method reduced my execution time from days to seconds. Thanks!
Glad to help. Just make sure that this is the correct output. If pairs of entity ids and return periods only occur once in each file then it should be fine. If they can occur more than once (I'm not sure what kind of entity you mean for the entity ids,) then the dictionary will only hold the last record with a particular id from the first CSV and return period and you'll get one row per pair in the second file vs rows for all combinations iterating through lists. I do get the impression that this is correct though.
0

You can do this with pandas if the data frames are equal-sized like this :

reader_3B=pd.read_csv('3B_processed.csv')
reader_2A=pd.read_csv('2A_processed.csv')

l_3B_2A=row_3B[(row_3B["taxable_entity_id"] == row_2A["taxable_entity_id"])&(row_3B["return_period"] == row_2A["return_period"])]

l_3B_2A.to_csv('3Bvs2A_new.csv') 

1 Comment

One of the CSV's has around a million lines while the other has half-a-million.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.