4

I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?

Here is the function where I feed the data to the db.

 def db_update(val_str, tbl, cols):

    conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
    cur = conn.cursor()
    output = cStringIO.StringIO()
    output.write(val_str)
    output.seek(0)
    cur.copy_from(output, tbl, sep='\t', columns=(cols))
    conn.commit()

I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.

3
  • Have you tried caching the connection? You can also simplify the function a bit by passing val_str directly to StringIO(), thus eliminating both the write and the subsequent seek. Commented Aug 25, 2012 at 3:42
  • I am a StringIO() novice. Could you give an example of what you mean? Dumb probably, but if I don't ask... Commented Aug 25, 2012 at 4:20
  • output = cStringIO.StringIO(val_str) is all you need to do to get a read-only file-like object containing val_str (suitable for copy_from). Basically, if you don't give it any arguments you get a read/write file, but if you give it a string argument cStringIO.StringIO gives you a read-only file with the specified contents. See cStringIO.StringIO. Commented Aug 25, 2012 at 4:23

3 Answers 3

2

There are several things that can slow inserts as tables grow:

  • Triggers that have to do more work as the DB grows
  • Indexes, which get more expensive to update as they grow

Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.

Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.

If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.

You'll also save time by not re-connecting for each block of work. Maintain a connection.

Sign up to request clarification or add additional context in comments.

3 Comments

This is all very helpful. A lot of these tips seem aimed at general optimization. I am curious what you think is the most likely culprit is for the steep decline in performance.
I guess you said so right at the top. I don't have any triggers. One index on the primary key.
@MikeGirard I wouldn't expect such a drop, but it's hard to say more without data like iostat and vmstat measurements, hard timings, the Pg logs, etc. The fact that COPY doesn't offer any useful EXPLAIN ANALYZE data doesn't help.
0

From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).

Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.

Comments

0

I using the following and I don't get any hit on performance as far as I have seen:

import psycopg2
import psycopg2.extras

local_conn_string = """
    host='localhost'
    port='5432'
    dbname='backupdata'
    user='postgres'
    password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
    'cursor_unique_name',
     cursor_factory=psycopg2.extras.DictCursor)

I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).

Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11

I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.

NB: The time is counting upwards and is not for each query.

I believe it has to do with the way my cursor is created.

EDIT1:

I am using the following to run my query:

local_cursor.execute("""SELECT * FROM data;""")

row_count = 0
for row in local_cursor:
    if(row_count % 100000 == 0 and row_count != 0):
        print("Parsed %s rows in %s" % (row_count,
                                        my_timer.get_time_hhmmss()
                                        ))
    parse_row(row)
    row_count += 1

print("Finished running script!")
print("Parsed %s rows" % row_count)

The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.

EDIT2:

It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:

Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.