Pandas using too much memory with read_sql_table

Question

I am trying to read in a table from my Postgres database into Python. Table has around 8 million rows and 17 columns, and has a size of 622MB in the DB.

I can export the entire table to csv using psql, and then use pd.read_csv() to read it in. It works perfectly fine. Python process only uses around 1GB of memory and everything is good.

Now, the task we need to do requires this pull to be automated, so I thought I could read the table in using pd.read_sql_table() directly from the DB. Using the following code

import sqlalchemy
engine = sqlalchemy.create_engine("postgresql://username:password@hostname:5432/db")
the_frame = pd.read_sql_table(table_name='table_name', con=engine,schema='schemaname')

This approach starts using a lot of memory. When I track the memory usage using Task Manager, I can see the Python process memory usage climb and climb, until it hits all the way up to 16GB and freezes the computer.

Any ideas on why this might be happening is appreciated.

See if there is a chunksize argument and read the dataframe in chunks — Ted Petrou
– Ted Petrou, Commented Dec 21, 2016 at 3:32

Community · Accepted Answer · 2017-05-23 12:18:05Z

5

You need to set the chunksize argument so that pandas will iterate over smaller chunks of data. See this post: https://stackoverflow.com/a/31839639/3707607

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered Dec 21, 2016 at 3:36

Ted Petrou

62.4k19 gold badges139 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user4505419 Over a year ago

Interesting. This works for sure, ill mark the question as answered. I am still unsure why bringing in the entire dataframe from the database would require 10x more memory than reading it from a csv though.

Gabe Over a year ago

Note: pandas can still potentially bring in the entire huge dataset from the database even though chunksize is used. see this comment: stackoverflow.com/questions/18107953/…

Owais Ajaz Over a year ago

chunksize still loads all the data in memory, stream_results=True is the answer. it is server side cursor that loads the rows in given chunks and save memory.. efficiently using in many pipelines, it may also help when you load history data

Amin Pial Over a year ago

this doesn't work. A "TRUE CHUNKSIZE" can be achieved via only SSCursor (Server side cursor), where you tell the database to send you chunksize number of rows. This can't be achieved via pandas.read_sql(chunksize=n), as it calls cursor.execute() which loads all the data to memory at once.

Amin Pial · Accepted Answer · 2024-10-11 19:19:44Z

Under the hood, pandas will call cursor.execute(your_query), this will load the binary data received from database via socket.recv() (via TCP). This will actually load all the data into memory, no matter what you do with pd.read_sql(chunksize=n).

Okay. So what is the solution? Using pandas.read_sql() leaves you without any flexibility whatsoever. Here are some ideas:-

use pyarrow as dtype_backend in pd.read_sql(), but I have tested it actually takes more memory because of multiple copies. First query result is being loaded to pandas Dataframe then converted to pyarrow Table. What a joke!
define dtype={}, it will surely reduce some memory for you. But not enough.

You can see it doesn't work with read_sql(). I would suggest you to check connectorx library.This is super memory efficient and fast. it is the same as pandas.read_sql(), it will give you pandas dataframe in return. Try it.

Collectives™ on Stack Overflow

Pandas using too much memory with read_sql_table

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related