0

I have a compressed CSV file compressed as csv.gz which I want to run some processing on. I generally go with Polars because it is more memory-efficient and faster. Here is the code which I am using to lazily read and filter it before I can run some other processing on it.

df = (
    pl.scan_csv(underlying_file_path, try_parse_dates=True, low_memory=True)
    .select(pl.col("bin", "price", "type", "date", "fut"))
    .filter(pl.col("date") == pl.col("date").min())
    .collect()
)

On running this, I seem to run out of memory because I just get a Killed message with no other output. On the other hand, when I try to read and print the same dataframe with Pandas:

df = pd.read_csv(underlying_file_path, usecols=["bin_endtime", "strike_price", "opt_type", "expiry_date", "cp_fut"], parse_dates=True, low_memory=True)

This works fine and I am able to print and process the file fine. This is uncanny because up till now, I've always noticed that Polars is able to handle larger data than Pandas and is faster while doing so. Why could this be happening?

Details
  • OS: Ubuntu 22.04.5 LTS
  • Pandas Version: 2.3.3
  • Polars version: 1.35.2
  • Python version: 3.10.12
  • File size: 2.1G
  • Number of rows in the CSV file: 42.39M

I wish to debug what is happening here and in case it is a genuine limitation of Polars, report it to the devs. How do I see where things are falling apart.

Please let me know if any other details are required.

6
  • Have you checked the memory footprint w/ e.g., htop to verify it's indeed an OOM? Commented Dec 3 at 12:20
  • You can set os.environ["POLARS_VERBOSE"] = "1" (and POLARS_TRACK_METRICS) to get logging/info. I would try .collect(engine="streaming") to see if that also errors. You can also try the 1.36.0b2 prerelease github.com/pola-rs/polars/releases/tag/py-1.36.0-beta.2 if you're going to report a bug. There seems to be some existing issues stating that Polars doesn't have proper streaming decompression yet, which may be related: github.com/pola-rs/polars/issues/18724 Commented Dec 3 at 13:07
  • @usdn Yes, I have verified that it is indeed OOM. When I monitor the script using htop, I can see that the process uses at peak 93.7% of system memory. This is unexpected because when I read the CSV using Pandas and load it as a Pandas dataframe using df = pl.from_pandas(df_pd), it works fine and when I print the estimated size using print(df.estimated_size(unit='mb')), it takes 900M. It's weird how a dataframe which when read, takes 900 MB in memory exhausts 94% of the system's memory(16G). Finally, I checked dmesg and indeed I can indeed see many messages related to OOM killing the proc. Commented Dec 4 at 9:03
  • @jqurious Thanks for referencing the GitHub issue. Just as mentioned in the GitHub issue, when I read uncompress the CSV file (it uncompressed to a 17G CSV file) and read it, Polars manage to do so pretty damn fast without running out of memory. The flags to print verbose logs don't help out much. As far as trying out a different version is concerned, I am unable to do so atm, and I believe it won't even help, since the changelog doesn't mention any improvements to this particular issue. Commented Dec 4 at 9:27
  • You can try this plugin to stream the compressed CSV: github.com/ghuls/polars_streaming_csv_decompression Commented Dec 8 at 18:22

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.