0

I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time adding new rows with no pre-existing ID in the larger dataframe. I believe the correct way to do this is to use the Polars update() method with the "how" strategy set to "full".

Unfortunately, this works fine testing on my local machine but silently fails in a Cloud Function environment even with the container configured for 8G RAM.

I am using scan_csv() with infer_schema=False to get LazyFrames (with only strings) of the two datasets before calling update(), and tried logging intermediate results using describe(), which logs the dataframe stats just fine for each of the source datasets, but never is able to get past the update() to log the resulting dataframe describe():

import polars as pl

large_df = pl.scan_csv(large_file_path, infer_schema=False)
small_df = pl.scan_csv(small_file_path, infer_schema=False)

logging.info(f'LARGE: {large_df.describe()}') # Logs are visible for this
logging.info(f'SMALL: {small_df.describe()}') # Logs are visible for this
merged_df = large_df.update(small_df, how='full', on='id') # results in OOM in the Cloud Function log
logger.info(f'MERGED: {merged_df.describe()}') # Never reaches this line

Am I doing anything wrong or inefficient here?

3
  • Have you checked the memory overhead of the container itself vs the memory footprint of the above function, e.g., w/ something like podman stats? And have you checked the output of large_df.update(...).explain(streaming=True) or large_df.update(...).show_graph(plan_stage="physical", engine="streaming")? Commented Dec 1 at 14:32
  • 1
    You can set os.environ["POLARS_VERBOSE"] = "1" (also POLARS_TRACK_METRICS on latest) to get logging/info from Polars. The code for update() github.com/pola-rs/polars/blob/… shows it is implemented in Python - so you could try each step manually to see where it fails e.g. by starting with large_df.join(small_df, how='full', on='id') Commented Dec 2 at 10:46
  • Something seems to be missing because you're not doing collect anywhere in your snippet so if it's going OOM without a collect something else is going on. Commented Dec 2 at 16:27

1 Answer 1

0

Thanks to some consultation from the Polars community, the fix turned out to be a combination of several factors. Here are the optimizations I added to successfully get the merge/update/join portion to execute:

1. Increased the Cloud Function's memory from 8GB to 16GB
2. Removed the .describe() calls, as these allegedly do a collect operation and can be costly
3. Pre-sorted the LazyFrames under operation, by the key ("id")
4. Removed `infer_schema=False` in the scan_csv() step.
5. Reverted away from the syntactic sugar method update() to instead use join():

large_df.join(small_df, how='full', on='id')

Also importantly, I realized the Cloud infrastructure I was using defaulted to Python 3.8 and consequently the latest version of Polars was not able to be retrieved (this forced Polars to be stuck on 1.8.2). It should be said that the above issue was resolved even given the outdated versions of Python and Polars, so it was not a direct cause, however other Polars function calls began to fail apart from this code block, which I will address in a separate question, so it is definitely a red flag.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.