Why does Polars run OOM while trying to read a compressed CSV file while Pandas is able to do it?

Ask Question

Asked 8 days ago

Modified 7 days ago

Viewed 77 times

I have a compressed CSV file compressed as csv.gz which I want to run some processing on. I generally go with Polars because it is more memory-efficient and faster. Here is the code which I am using to lazily read and filter it before I can run some other processing on it.

df = (
    pl.scan_csv(underlying_file_path, try_parse_dates=True, low_memory=True)
    .select(pl.col("bin", "price", "type", "date", "fut"))
    .filter(pl.col("date") == pl.col("date").min())
    .collect()
)

On running this, I seem to run out of memory because I just get a Killed message with no other output. On the other hand, when I try to read and print the same dataframe with Pandas:

df = pd.read_csv(underlying_file_path, usecols=["bin_endtime", "strike_price", "opt_type", "expiry_date", "cp_fut"], parse_dates=True, low_memory=True)

This works fine and I am able to print and process the file fine. This is uncanny because up till now, I've always noticed that Polars is able to handle larger data than Pandas and is faster while doing so. Why could this be happening?

Details

OS: Ubuntu 22.04.5 LTS
Pandas Version: 2.3.3
Polars version: 1.35.2
Python version: 3.10.12
File size: 2.1G
Number of rows in the CSV file: 42.39M

I wish to debug what is happening here and in case it is a genuine limitation of Polars, report it to the devs. How do I see where things are falling apart.

Please let me know if any other details are required.

edited Dec 4 at 5:47

Barmar

789k57 gold badges555 silver badges669 bronze badges

asked Dec 3 at 9:17

kaddy

115 bronze badges

Have you checked the memory footprint w/ e.g., htop to verify it's indeed an OOM?

usdn
– usdn

2025-12-03 12:20:00 +00:00
Commented Dec 3 at 12:20
You can set os.environ["POLARS_VERBOSE"] = "1" (and POLARS_TRACK_METRICS) to get logging/info. I would try .collect(engine="streaming") to see if that also errors. You can also try the 1.36.0b2 prerelease github.com/pola-rs/polars/releases/tag/py-1.36.0-beta.2 if you're going to report a bug. There seems to be some existing issues stating that Polars doesn't have proper streaming decompression yet, which may be related: github.com/pola-rs/polars/issues/18724

jqurious
– jqurious

2025-12-03 13:07:04 +00:00
Commented Dec 3 at 13:07
@usdn Yes, I have verified that it is indeed OOM. When I monitor the script using htop, I can see that the process uses at peak 93.7% of system memory. This is unexpected because when I read the CSV using Pandas and load it as a Pandas dataframe using df = pl.from_pandas(df_pd), it works fine and when I print the estimated size using print(df.estimated_size(unit='mb')), it takes 900M. It's weird how a dataframe which when read, takes 900 MB in memory exhausts 94% of the system's memory(16G). Finally, I checked dmesg and indeed I can indeed see many messages related to OOM killing the proc.

kaddy
– kaddy

2025-12-04 09:03:23 +00:00
Commented Dec 4 at 9:03
@jqurious Thanks for referencing the GitHub issue. Just as mentioned in the GitHub issue, when I read uncompress the CSV file (it uncompressed to a 17G CSV file) and read it, Polars manage to do so pretty damn fast without running out of memory. The flags to print verbose logs don't help out much. As far as trying out a different version is concerned, I am unable to do so atm, and I believe it won't even help, since the changelog doesn't mention any improvements to this particular issue.

kaddy
– kaddy

2025-12-04 09:27:50 +00:00
Commented Dec 4 at 9:27
You can try this plugin to stream the compressed CSV: github.com/ghuls/polars_streaming_csv_decompression

BallpointBen
– BallpointBen

2025-12-08 18:22:56 +00:00
Commented Dec 8 at 18:22

| Show 1 more comment

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Why does Polars run OOM while trying to read a compressed CSV file while Pandas is able to do it?

Details

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Details

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest