Python sql query taking up 3x memory than table size?

Ask Question

Asked 5 years, 1 month ago

Modified 5 years, 1 month ago

Viewed 654 times

I am running a very simple query using python on a table from Snowflake database using the package snowflake-connector-python==2.3.3 installed with the additional [pandas] I containerized my python app using the python:3.7.0-slim image. And my script is extremely simple.

from snowflake import connector

import os

ctx = connector.connect(
    user=os.environ['USER'],
    password=os.environ['PASSWORD'],
    account=os.environ['ACCOUNT'],
    warehouse=os.environ['WAREHOUSE'],
    database=os.environ['DATABASE'],
    schema=os.environ['SCHEMA'])

cur = ctx.cursor()

# Execute a statement that will generate a result set.
sql = "SELECT * FROM MY_TABLE ORDER BY MY_COLUMN"

print("executing query: " + sql)

cur.execute(sql)

df = cur.fetch_pandas_all()

The actual table size from what Snowflake tells me is 3.3 GB. However when I run this app it crashes as it takes over 9GB of RAM. I know this because I'm running it in a kubernetes cluster and the pod is evicted and says it used 9535336Ki memory. Is there something I'm missing here? How can the memory usage be 3x the table size?

asked Oct 19, 2020 at 19:33

alex

2,2532 gold badges40 silver badges70 bronze badges

This may not be simple: SELECT * FROM. See why. Try selecting exact needed columns for app.

Parfait
– Parfait

2020-10-19 19:38:47 +00:00
Commented Oct 19, 2020 at 19:38
One of the top comment's reads It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved. And I need every column. Also it still doesn't answer the question: the table itself is 3.3 GB but the size of my container grows beyond 9GB so why is that?

alex
– alex

2020-10-19 19:42:13 +00:00
Commented Oct 19, 2020 at 19:42
Understood but for maintainable code, always a good idea to specify explicitly the columns and in consistent order in case underlying table or code changes. For debugging, try selecting one, two, three, etc.. columns at a time one and check RAM usage. Watch for large data types (object, arrays, geospatial). I also wonder about cur.fetch_pandas_all(). Try removing for debugging reasons to isolate problematic line. Docs indicate faster than pandas' read_sql but I wonder.

Parfait
– Parfait

2020-10-19 19:53:45 +00:00
Commented Oct 19, 2020 at 19:53
Cool thank you for the callout - I should be a bit more specific. When I put the pipeline in production I will switch to naming all columns. I actually found a similar issue in this post stackoverflow.com/questions/41253326/… . I'm thinking it might be a pandas related thing

alex
– alex

2020-10-19 19:56:27 +00:00
Commented Oct 19, 2020 at 19:56

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Python sql query taking up 3x memory than table size?

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked