I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time adding new rows with no pre-existing ID in the larger dataframe. I believe the correct way to do this is to use the Polars update() method with the "how" strategy set to "full".
Unfortunately, this works fine testing on my local machine but silently fails in a Cloud Function environment even with the container configured for 8G RAM.
I am using scan_csv() with infer_schema=False to get LazyFrames (with only strings) of the two datasets before calling update(), and tried logging intermediate results using describe(), which logs the dataframe stats just fine for each of the source datasets, but never is able to get past the update() to log the resulting dataframe describe():
import polars as pl
large_df = pl.scan_csv(large_file_path, infer_schema=False)
small_df = pl.scan_csv(small_file_path, infer_schema=False)
logging.info(f'LARGE: {large_df.describe()}') # Logs are visible for this
logging.info(f'SMALL: {small_df.describe()}') # Logs are visible for this
merged_df = large_df.update(small_df, how='full', on='id') # results in OOM in the Cloud Function log
logger.info(f'MERGED: {merged_df.describe()}') # Never reaches this line
Am I doing anything wrong or inefficient here?
podman stats? And have you checked the output oflarge_df.update(...).explain(streaming=True)orlarge_df.update(...).show_graph(plan_stage="physical", engine="streaming")?os.environ["POLARS_VERBOSE"] = "1"(alsoPOLARS_TRACK_METRICSon latest) to get logging/info from Polars. The code for update() github.com/pola-rs/polars/blob/… shows it is implemented in Python - so you could try each step manually to see where it fails e.g. by starting withlarge_df.join(small_df, how='full', on='id')collectanywhere in your snippet so if it's going OOM without acollectsomething else is going on.