3

I have a dataset in a CSV file consisting of 2500 lines. The file is structured that (simplified) way:

id_run; run_name; receptor1; receptor2; receptor3_value; [...]; receptor50_value

Each receptor of the file is already in a table and have a unique id.

I need to upload each line to a table with this format:

id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1

Actually, I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.

For 2500 runs (so 2500 lines in the CSV file) and 50 receptors, my Python program generates ~110000 records in the text file to be uploaded.

I'm dropping the foreign keys of the destination table and restoring them after the upload.

Using this method, it takes actually ~8 seconds to generate the text file and 1 second to copy the file to the table.

Is there a way, method, library or anything else I could use to accelerate the preparation of the data for the upload so that 90% of the time required isn't for the writing of the text file?

Edit:

Here is my (updated) code. I'm now using a bulk writing to the text file. It looks likes it faster (uploaded 110 000 lines in 3.8 seconds).

# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
    # the run_specs variable consists of the attributes defining a run 
    # (id_run, run_name, etc.). So basically a line in the CSV file without the 
    # receptors data
    sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
    for rec_i, rec in enumerate(rec_uids):
        # the rec_uids variable is the unique IDs in the database for the 
        # receptors in the CSV file
        line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
        lines.append(line_to_write)

# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
    tmp_data.writelines(lines)

# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys

# drop the foreign keys
for key in rr_fks:
    DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs

# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
                                                                    " ", False)

# restore foreign keys
for key in rr_fks:
    DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])

# commit to database
conn.commit()

Edit #2:

Using cStringIO library, I replaced the creation of a temporary text file with a filelike object, but the speed gains is very very small.

Code changed:

outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
    outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))

cur.copy_from(outf, 'receptor_results')
7
  • 1
    What code do you currently use? minimal reproducible example Commented Jun 3, 2016 at 16:41
  • I edited the main post with my code ! Commented Jun 3, 2016 at 16:59
  • I if your question is about how to prepare your text data fast, then it got nothing to do with the database server. This is confusing. Commented Jun 3, 2016 at 17:03
  • I'm looking for ways to make my process faster, either using other commands to prepare my text file or uploading differently to postgres Commented Jun 3, 2016 at 17:05
  • @kaycee I have answered that below ;) Commented Jun 3, 2016 at 17:11

2 Answers 2

3

Yes, there is something you can do to speed up writing the data to the file in advance: don't bother!

You already fit the data into memory, so that isn't an issue. So, instead of writing the lines to a list of strings, write them to a slightly different object - a StringIO instance. Then the data can stay in memory and serve as the parameter to psycopg2's copy_from function.

filelike = StringIO.StringIO('\n'.join(['1\tA', '2\tB', '3\tC']))
cursor.copy_from(filelike, 'your-table-name')

Notice that the StringIO must contain the newlines, the field separators and so on - just as the file would have.

Sign up to request clarification or add additional context in comments.

1 Comment

I used the cStringIO library, but the speed gain is very minimal. See Edit #2 in the main post.
1

I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.

It is a heavy and unnecessary round-trip for all your data. Since you already have it in memory, you should just translate it into a multi-row insert directly:

INSERT INTO table(col1, col2) VALUES (val1, val2), (val3, val4), ...

i.e. concatenate your data into such a query and execute it as is.

In your case you would probably generate and execute 50 such inserts, with 2500 rows in each, according to your requirements.

It will be the best-performing solution ;)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.