1

So, I'm working in updating thousands of rows in a Postgres DB with Python (v3.6). After cleaning the data and preparing it, I'm having issues with times on the row updating. I've already indexed the columns that are being used to do the query.

I'm using psycopg2 to execute a "execute_batch" update on the table after having created the column, but the times just do not have any sense. It takes 40 seconds to update 10k rows, and what is breaking my mind, is that changing the "page_size" parameter of the function doesn't seem to change the speed of the updates.

These two codes would give the same time results:

psycopg2.extras.execute_batch(self.cursor, query, field_list, page_size=1000)

psycopg2.extras.execute_batch(self.cursor, query, field_list, page_size=10)

With all this, am I doing something wrong? Is it necessary to change anything in the database configuration so that the page_size argument would change its behaviour?

So far I've found a post that obtain improvements when using this method, but I cannot reproduce its results:

https://hakibenita.com/fast-load-data-python-postgresql#measuring-time

Any light in this would be awesome.

Many thanks!

3
  • Have a look at this benchmark: aaronolszewski.com/psycopg2-execution-time - especially execute_mogrify_method() Commented Oct 22, 2019 at 14:15
  • what query are you executing? what indexes/triggers do you have on the table? is Python or PG using lots of CPU/IO? Commented Oct 22, 2019 at 14:50
  • @MauriceMeyer I had a look at that post too. Actually, what was happening was the type of variable that I was using to do the query. As I was using pandas to generate the dictionary, by default it was using float for a identifier that was set as integer in the database. It seemed that changing the type of the variable increased hugely the performance of the update. (180k updates in 20s more or less). Thank by the responses though! Commented Oct 23, 2019 at 15:15

1 Answer 1

1

Unless the bottleneck which execute_batch removes is the bottleneck you actually face, there is no reason to expect a performance improvement.

If the time to do the update is dominated by index maintenance (which is likely, if your table is indexed), then nothing else is going to matter.

If python is running on the same server as your database, or they are on a reasonably fast LAN, reducing network round trips is probably of little importance, until every other bottleneck has been removed first.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.