Newest 'python-polars' Questions

0 votes

0 answers

45 views

Polars: out-of-memory problem of groupby-max

I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the ...

Akira

2,853

asked 4 hours ago

1 vote

2 answers

75 views

Find the most recent article in a group and stream the result to disk

I have from pathlib import Path import polars as pl inDir = r"E:\Personal Projects\tmp\tarFiles\result2" outDir = r"C:\Users\Akira\Documents" inDir = Path(inDir) outDir = ...

Akira

2,853

asked 14 hours ago

1 vote

0 answers

72 views

Performing cumulative sum in a Window with different columns ordering and null last configuration in Polars

Polars version: 1.25.2 I have a dataframe: from datetime import date test_df = pl.DataFrame([ ("A", None, date(2009, 1, 24), 1), ("A", date(2010, 3, 24), date(2013, 1, 24),...

Jonathan

2,343

asked yesterday

1 vote

0 answers

67 views

Warning and performance issues when scanning delta tables

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...

gaut

6,048

asked Dec 6 at 1:45

3 votes

0 answers

117 views

Python 3.14t free-thread compatibility with Polars

I started using Python3.14t (free-threaded build) recently and had some blast. However, when I install Polars python3.14t -m pip install polars The process stuck at the last line of the following ...

user2961927

1,810

asked Dec 5 at 17:57

1 vote

1 answer

63 views

Make a custom rust class input to polars expression plugin

I want to make a custom polars plugin which takes a class defined in rust using pyo3. I have managed to create a class which can roundtrip pickle defined as: #[pyclass(module = "mylib._internal&...

thoooooooomas

103

asked Dec 5 at 8:35

Advice

0 votes

0 replies

24 views

Using the query plan (lazy frame) as a cache key

Consider the following situation: There is a complex (and time-consuming) query, which has some "slowly changing" parameters, i.e. the same query gets executed on the same data(-source) over ...

Benjamin Trendelkamp-Schroer

121

asked Dec 4 at 20:56

Advice

0 votes

2 replies

48 views

How can I use polars to convert all, or most, columns from one type to another?

I see tons of examples of how to convert or operate on specific columns, where the column name is known and simple, like 'a' or 'b'. I Have hundreds, maybe thousands of columns in thousands of ...

shrykullgod

43

asked Dec 3 at 22:04

0 votes

0 answers

73 views

Why does Polars run OOM while trying to read a compressed CSV file while Pandas is able to do it?

I have a compressed CSV file compressed as csv.gz which I want to run some processing on. I generally go with Polars because it is more memory-efficient and faster. Here is the code which I am using ...

kaddy

11

asked Dec 3 at 9:17

0 votes

1 answer

93 views

Polars lazyframe update() silently failing in a serverless Cloud Function (OOM error)

I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time ...

starmandeluxe

2,607

asked Dec 1 at 10:47

1 vote

0 answers

56 views

How to show the streaming parts of a polars query using explain()?

I am trying to explain() a Polars query to see which operations can be executed using the streaming engine. Currently, I am only able to do this using show_graph(). From sources on the web, I see that ...

gaut

6,048

asked Nov 28 at 11:34

1 vote

1 answer

76 views

Polars parse multiple datetime format [duplicate]

I have string column in polars dataframe with multiple datetime formats and I am using following code to convert datatype of column from string into datetime. import polars as pl df = pl.from_dict({'...

dikesh

3,135

asked Nov 26 at 12:27

0 votes

0 answers

78 views

polars.LazyFrame.sink_csv does not give CRLF line termination [duplicate]

I have a Python file import polars as pl import requests from pathlib import Path url = "https://raw.githubusercontent.com/leanhdung1994/files/main/processedStep1_enwiktionary_namespace_0_43....

Akira

2,853

asked Nov 25 at 19:19

1 vote

3 answers

181 views

Polars: how to write a column of strings into a txt file without escaping?

I have a .ndjson files with millions of rows. Each row has a field html which contains html strings. I would like to write all such html into a .txt file. One html is into one line of the .txt file. I ...

Akira

2,853

asked Nov 25 at 0:08

2 votes

1 answer

141 views

Why does a nearest join_asof() return exact matches despite allow_exact_matches=False?

I am looking for the nearest non exact match on the dates column: import polars as pl df = pl.from_repr(""" ┌─────┬────────────┐ │ uid ┆ dates │ │ --- ┆ --- │ │ i64 ┆ date ...

rainerpf

21

asked Nov 21 at 20:58

-2 votes

1 answer

99 views

polars.exceptions.DuplicateError: column with name 'name_ID' has more than one occurrence [closed]

I have a dictionary of polars.DataFrames called data_dict. All dataframes inside the dict values are having an extra index column ''. I want to drop that column and set a new column named 'name_ID' ...

Tudi72

31

asked Nov 19 at 16:08

2 votes

1 answer

84 views

Change color of single line in altair line chart based on other indicator column

Imagine having the following polars dataframe "df" that contains the temperature of a machine that is either "active" or "inactive": import polars as pl from datetime ...

the_economist

579

asked Nov 17 at 9:32

1 vote

0 answers

78 views

Is it possible to drop/select columns where col.n_unique > 1 with native polars syntax [duplicate]

I have a table that looks like this import polars as pl df = pl.DataFrame( { "col1": [1, 2, 3, 4, 5], "col2": [10, 20, 30, 40, 50], "col3": [...

Lethnis

31

asked Nov 17 at 2:07

Advice

0 votes

7 replies

117 views

High volume URL parsing in Python

I use the polars, urllib and tldextract packages in python to parse 2 columns of URL strings in zstd-compressed parquet files (averaging 8GB, 40 million rows). The parsed output include the scheme, ...

norcalpedaler

132

asked Nov 16 at 18:34

12 votes

0 answers

372 views

Not displaying DataFrame's name in Data Wrangler extension of VSCode, displaying "Data grid"

It is a while that I am using Data Wrangler extension in VS Code; it is very useful for analyzing datasets and filtering some columns to see the features. When I opened a dataframe in it, it used to ...

Javad Faraji

41

asked Nov 16 at 8:02

1 vote

1 answer

113 views

Altair stacked bar chart in custom order

I've built a dataset in Polars (python), attempting to plot it as a stacked horizontal bar chart using Polars' built-in Altair plot function, however trying to specify a custom sort order for the ...

ExactaBox

3,425

asked Nov 14 at 20:46

1 vote

1 answer

117 views

Polars print changed values between 2 dataframes

Given two polars dataframes of the same shape, I would like to print the number of values different between the two, including missing values that are not missing in the other dataframe. I came up ...

robertspierre

5,383

asked Nov 13 at 16:52

2 votes

2 answers

93 views

Seeking more efficient method in Python & Polars to perform monthly comparison within each year

I have a CSV of energy consumption data over time (each month for several years). I want to determine the percentage (decimal portion) for each month across that year; e.g., August was 12.3% of the ...

Buckley

151

asked Nov 13 at 16:26

1 vote

3 answers

102 views

Show matched rows in polars join

When you join two tables, STATA prints the number of rows merged and unmerged. For instance, take Example 1 at page 13 of the STATA merge doc: use https://www.stata-press.com/data/r19/autosize merge 1:...

robertspierre

5,383

asked Nov 11 at 15:20

3 votes

0 answers

154 views

Why polars join function performance deteriorates so much from version 1.30.0 to 1.31.0?

I noticed a significant performance deterioration when using polars dataframe join function after upgrading polars from 1.30.0 to 1.31.0. The code snippet is below: import polars as pl import time ...

Y. Gao

1,049

asked Nov 7 at 13:14

1 vote

2 answers

162 views

Replace value by condition across entire polars df

I'd like to replace any value greater than some condition with zero for any column except the date column in a df. The closest I've found it df.with_columns( pl.when(pl.any_horizontal(pl.col(pl....

thefrollickingnerd

401

asked Nov 5 at 0:26

2 votes

1 answer

135 views

Find differing rows between two Polars DataFrames based on ID and multiple columns

I have two Polars DataFrames (df1 and df2) with the same columns. I want to compare them by ID and Iname, and get the rows where any of the other columns (X, Y, Z) differ between the two. import ...

Simon

1,209

asked Nov 4 at 19:06

0 votes

0 answers

167 views

How to efficiently get the last row of a rolling aggregation group without .last()?

I'm working with a large Polars LazyFrame and computing rolling aggregations grouped by customer (Cusid). I need to find the "front" of the rolling window (last Tts_date) for each group to ...

Liisjak

37

asked Nov 4 at 16:13

6 votes

1 answer

112 views

Polars streaming: How to compute a nested window aggregation while avoiding in-memory-maps?

I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'. Constraints: I want to preserve all original records in the result. (if possible) avoid ...

gogodigi

95

asked Oct 31 at 11:16

4 votes

3 answers

107 views

Extending polars DataFrame while maintaining variables between calls

I would like to code a logger for polars using the Custom Namespace API. For instance, starting from: import logging import polars as pl penguins_pl = pl.read_csv("https://raw.githubusercontent....

robertspierre

5,383

asked Oct 31 at 9:19

0 votes

1 answer

76 views

Python tempfile TemporaryDirectory path changes multiple times after initialization

I am using tempfile with Polars for the first time and getting some surprising behavior when running it in a serverless Cloud Function-like environment. Here is my simple test code: try: with ...

starmandeluxe

2,607

asked Oct 31 at 4:42

4 votes

4 answers

188 views

Reference column named "*" in Polars

I have a Polars DataFrame with a column named "*" and would like to reference just that column. When I try to use pl.col("*") it is interpreted as a wildcard for "all columns.&...

Sam

359

asked Oct 29 at 21:56

1 vote

2 answers

89 views

Adding an Object column to a polars DataFrame with broadcasting

If I have a DataFrame, I can create a column with a single value like this: df = pl.DataFrame([[1, 2, 3]]) df.with_columns(pl.lit("ok").alias("metadata")) shape: (3, 2) ┌──────────...

Ilya V. Schurov

8,197

asked Oct 28 at 13:07

1 vote

0 answers

78 views

Polars LazyFrame sink_parquet + PartitionByKey slower to S3 than local disk

I'm wondering why I'm seeing such poor performance when writing a LazyFrame using PartitionByKey to S3 when compared to other methods. Here is a simple test script that writes out some random data to ...

Stephen

276

asked Oct 24 at 22:21

1 vote

2 answers

113 views

python typing distinctions between inline created parameters and variables

Preamble I'm using polars's write_excel method which has a parameter column_formats which wants a ColumnFormatDict that is defined here and below ColumnFormatDict: TypeAlias = Mapping[ # dict of ...

Dean MacGregor

20.1k

asked Oct 24 at 15:52

2 votes

0 answers

181 views

Speeding up Polars rust plugin branching and aggregating

I'm following polars plugins tutorial - branch mispredictions and it says that theres a faster way to implement the following code: #[polars_expr(output_type=Int64)] fn sum_i64(inputs: &[Series]) -...

Ariana

29

asked Oct 23 at 10:38

-1 votes

1 answer

123 views

Compare 2 columns in Polars and rearrange them when they match and unmatch?

A Polars DataFrame that has 2 columns [Col01 & Col02]. They hold same values though not the same number of times [e.g. Col01 can have say 5 rows of '00000'while Col02 may have 20 rows of '00000' ...

Mohan Prasath

1

asked Oct 17 at 13:57

8 votes

1 answer

265 views

How to write a pandas-compatible, non-elementary expression in narwhals

I'm working with the narwhals package and I'm trying to write an expression that is: applied over groups using .over() Non-elementary/chained (longer than a single operation) Works when the native df ...

Slash

581

asked Oct 14 at 19:07

-2 votes

1 answer

132 views

Polars scan_ndjson Out of memory

Description Trying to read 32GB of data splitted in 16 .jsonl files. I use the function scan_ndjson of Polars but the execution stops with error 137 (Out of memory). Here is the code: # Count infobox ...

codug

27

asked Oct 13 at 11:08

3 votes

3 answers

159 views

Calculating monthly revenue given start and end date for each ID using Polars

I have a dataframe using this format import polars as pl df = pl.from_repr(""" ┌─────┬────────────┬────────────┬──────────┐ │ ID ┆ DATE_PREV ┆ DATE ┆ REV_DIFF │ │ --- ┆ --- ...

Philipp

65

asked Oct 8 at 14:48

2 votes

1 answer

94 views

polars-u64-idx not available for latest version

While the standard Polars package is available in version 1.34.0 the polars-u64-idx package is missing the latest versions. Does anyone know if this package is discontinued?

Stefan Herrmann

81

asked Oct 7 at 10:03

2 votes

2 answers

268 views

How do I get polars.Expr.str.json_decode to decode simple map to List(Struct({'key': String, 'value': Int32}))?

json_decode requires that we specify the dtype. Polars represents maps with arbitrary keys as a List<struct<2>> (see here). EDIT: Suppose I don't know the keys in my JSON ahead of time, ...

user31639176

23

asked Oct 6 at 18:10

2 votes

1 answer

128 views

How to perform sinking lazyframes with diverging queries to different partitions

I have a very big parquet file which I'm attempting to read from and split into partitioned folders on a column "token". Currently I'm using pl.scan_parquet on the big parquet file followed ...

WillowOfTheBorder

45

asked Oct 6 at 12:44

2 votes

3 answers

121 views

Forward fill using values from rows that match a condition in Polars

I have this dataframe: import polars as pl df = pl.DataFrame({'value': [1,2,3,4,5,None,None], 'flag': [0,1,1,1,0,0,0]}) ┌───────┬──────┐ │ value ┆ flag │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═══════╪══...

Phil-ZXX

3,601

asked Oct 2 at 10:25

2 votes

1 answer

68 views

How to select joined columns with structure like namespaces (a.col1, b.col2)?

I am working to migrate from PySpark to Polars. In PySpark I often use aliases on dataframes so I can clearly see which columns come from which side of a join. I'd like to get similarly readable code ...

Arend-Jan Tissing

376

asked Oct 2 at 10:07

0 votes

0 answers

120 views

Enabling Delta Table checkpointing when using polars write_delta()

I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook. Having had a production process up ...

Stuart J Cuthbertson

438

asked Sep 30 at 14:21

1 vote

1 answer

99 views

Converting a Rust `futures::TryStream` to a `polars::LazyFrame`

I have an application where I have a futures::TryStream. Still in a streaming fashion, I want to convert this into a polars::LazyFrame. It is important to note that the TryStream comes from the ...

bmitc

908

asked Sep 30 at 4:00

0 votes

1 answer

121 views

PyCharm "view as DataFrame" shows nothing for polars DataFrames

Basically the title. Using PyCharm 2023.3.3 I'm not able to see the data of polars DataFrames. As an example, I've a simple DataFrame like this: print(ids_df) shape: (1, 4) ┌───────────────────────────...

Nauel

522

asked Sep 29 at 9:56

3 votes

3 answers

93 views

Dynamically index a column in Polars

I have a simple dataframe look like this: import polars as pl df = pl.DataFrame({ 'ref': ['a', 'b', 'c', 'd', 'e', 'f'], 'idx': [4, 3, 1, 6, 2, 5], }) How can I obtain the result as ...

Baffin Chu

217

asked Sep 27 at 22:07

2 votes

1 answer

108 views

Find nearest / closest value to subset of values in a Polars dataframe

I have this dataframe import polars as pl df = pl.from_repr(""" ┌────────────┬──────┐ │ date ┆ ME │ │ --- ┆ --- │ │ date ┆ i64 │ ╞════════════╪══════╡ │ 2027-11-...

Phil-ZXX

3,601

asked Sep 26 at 15:47

Collectives™ on Stack Overflow