0

I am running a very simple query using python on a table from Snowflake database using the package snowflake-connector-python==2.3.3 installed with the additional [pandas] I containerized my python app using the python:3.7.0-slim image. And my script is extremely simple.

from snowflake import connector

import os

ctx = connector.connect(
    user=os.environ['USER'],
    password=os.environ['PASSWORD'],
    account=os.environ['ACCOUNT'],
    warehouse=os.environ['WAREHOUSE'],
    database=os.environ['DATABASE'],
    schema=os.environ['SCHEMA'])

cur = ctx.cursor()

# Execute a statement that will generate a result set.
sql = "SELECT * FROM MY_TABLE ORDER BY MY_COLUMN"

print("executing query: " + sql)

cur.execute(sql)

df = cur.fetch_pandas_all()

The actual table size from what Snowflake tells me is 3.3 GB. However when I run this app it crashes as it takes over 9GB of RAM. I know this because I'm running it in a kubernetes cluster and the pod is evicted and says it used 9535336Ki memory. Is there something I'm missing here? How can the memory usage be 3x the table size?

4
  • This may not be simple: SELECT * FROM. See why. Try selecting exact needed columns for app. Commented Oct 19, 2020 at 19:38
  • One of the top comment's reads It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved. And I need every column. Also it still doesn't answer the question: the table itself is 3.3 GB but the size of my container grows beyond 9GB so why is that? Commented Oct 19, 2020 at 19:42
  • Understood but for maintainable code, always a good idea to specify explicitly the columns and in consistent order in case underlying table or code changes. For debugging, try selecting one, two, three, etc.. columns at a time one and check RAM usage. Watch for large data types (object, arrays, geospatial). I also wonder about cur.fetch_pandas_all(). Try removing for debugging reasons to isolate problematic line. Docs indicate faster than pandas' read_sql but I wonder. Commented Oct 19, 2020 at 19:53
  • Cool thank you for the callout - I should be a bit more specific. When I put the pipeline in production I will switch to naming all columns. I actually found a similar issue in this post stackoverflow.com/questions/41253326/… . I'm thinking it might be a pandas related thing Commented Oct 19, 2020 at 19:56

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.