I have a compressed CSV file compressed as csv.gz which I want to run some processing on. I generally go with Polars because it is more memory-efficient and faster. Here is the code which I am using to lazily read and filter it before I can run some other processing on it.
df = (
pl.scan_csv(underlying_file_path, try_parse_dates=True, low_memory=True)
.select(pl.col("bin", "price", "type", "date", "fut"))
.filter(pl.col("date") == pl.col("date").min())
.collect()
)
On running this, I seem to run out of memory because I just get a Killed message with no other output. On the other hand, when I try to read and print the same dataframe with Pandas:
df = pd.read_csv(underlying_file_path, usecols=["bin_endtime", "strike_price", "opt_type", "expiry_date", "cp_fut"], parse_dates=True, low_memory=True)
This works fine and I am able to print and process the file fine. This is uncanny because up till now, I've always noticed that Polars is able to handle larger data than Pandas and is faster while doing so. Why could this be happening?
Details
- OS: Ubuntu 22.04.5 LTS
- Pandas Version: 2.3.3
- Polars version: 1.35.2
- Python version: 3.10.12
- File size: 2.1G
- Number of rows in the CSV file: 42.39M
I wish to debug what is happening here and in case it is a genuine limitation of Polars, report it to the devs. How do I see where things are falling apart.
Please let me know if any other details are required.
htopto verify it's indeed an OOM?os.environ["POLARS_VERBOSE"] = "1"(andPOLARS_TRACK_METRICS) to get logging/info. I would try.collect(engine="streaming")to see if that also errors. You can also try the 1.36.0b2 prerelease github.com/pola-rs/polars/releases/tag/py-1.36.0-beta.2 if you're going to report a bug. There seems to be some existing issues stating that Polars doesn't have proper streaming decompression yet, which may be related: github.com/pola-rs/polars/issues/18724df = pl.from_pandas(df_pd), it works fine and when I print the estimated size usingprint(df.estimated_size(unit='mb')), it takes 900M. It's weird how a dataframe which when read, takes 900 MB in memory exhausts 94% of the system's memory(16G). Finally, I checked dmesg and indeed I can indeed see many messages related to OOM killing the proc.