2,827 questions
0
votes
0
answers
45
views
Polars: out-of-memory problem of groupby-max
I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the ...
1
vote
2
answers
75
views
Find the most recent article in a group and stream the result to disk
I have
from pathlib import Path
import polars as pl
inDir = r"E:\Personal Projects\tmp\tarFiles\result2"
outDir = r"C:\Users\Akira\Documents"
inDir = Path(inDir)
outDir = ...
1
vote
0
answers
72
views
Performing cumulative sum in a Window with different columns ordering and null last configuration in Polars
Polars version: 1.25.2
I have a dataframe:
from datetime import date
test_df = pl.DataFrame([
("A", None, date(2009, 1, 24), 1),
("A", date(2010, 3, 24), date(2013, 1, 24),...
1
vote
0
answers
67
views
Warning and performance issues when scanning delta tables
Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...
3
votes
0
answers
117
views
Python 3.14t free-thread compatibility with Polars
I started using Python3.14t (free-threaded build) recently and had some blast. However, when I install Polars
python3.14t -m pip install polars
The process stuck at the last line of the following
...
1
vote
1
answer
63
views
Make a custom rust class input to polars expression plugin
I want to make a custom polars plugin which takes a class defined in rust using pyo3. I have managed to create a class which can roundtrip pickle defined as:
#[pyclass(module = "mylib._internal&...
Advice
0
votes
0
replies
24
views
Using the query plan (lazy frame) as a cache key
Consider the following situation: There is a complex (and time-consuming) query, which has some "slowly changing" parameters, i.e. the same query gets executed on the same data(-source) over ...
Advice
0
votes
2
replies
48
views
How can I use polars to convert all, or most, columns from one type to another?
I see tons of examples of how to convert or operate on specific columns, where the column name is known and simple, like 'a' or 'b'.
I Have hundreds, maybe thousands of columns in thousands of ...
0
votes
0
answers
73
views
Why does Polars run OOM while trying to read a compressed CSV file while Pandas is able to do it?
I have a compressed CSV file compressed as csv.gz which I want to run some processing on. I generally go with Polars because it is more memory-efficient and faster. Here is the code which I am using ...
0
votes
1
answer
93
views
Polars lazyframe update() silently failing in a serverless Cloud Function (OOM error)
I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time ...
1
vote
0
answers
56
views
How to show the streaming parts of a polars query using explain()?
I am trying to explain() a Polars query to see which operations can be executed using the streaming engine. Currently, I am only able to do this using show_graph().
From sources on the web, I see that ...
1
vote
1
answer
76
views
Polars parse multiple datetime format [duplicate]
I have string column in polars dataframe with multiple datetime formats and I am using following code to convert datatype of column from string into datetime.
import polars as pl
df = pl.from_dict({'...
0
votes
0
answers
78
views
polars.LazyFrame.sink_csv does not give CRLF line termination [duplicate]
I have a Python file
import polars as pl
import requests
from pathlib import Path
url = "https://raw.githubusercontent.com/leanhdung1994/files/main/processedStep1_enwiktionary_namespace_0_43....
1
vote
3
answers
181
views
Polars: how to write a column of strings into a txt file without escaping?
I have a .ndjson files with millions of rows. Each row has a field html which contains html strings. I would like to write all such html into a .txt file. One html is into one line of the .txt file. I ...
2
votes
1
answer
141
views
Why does a nearest join_asof() return exact matches despite allow_exact_matches=False?
I am looking for the nearest non exact match on the dates column:
import polars as pl
df = pl.from_repr("""
┌─────┬────────────┐
│ uid ┆ dates │
│ --- ┆ --- │
│ i64 ┆ date ...
-2
votes
1
answer
99
views
polars.exceptions.DuplicateError: column with name 'name_ID' has more than one occurrence [closed]
I have a dictionary of polars.DataFrames called data_dict.
All dataframes inside the dict values are having an extra index column ''.
I want to drop that column and set a new column named 'name_ID'
...
2
votes
1
answer
84
views
Change color of single line in altair line chart based on other indicator column
Imagine having the following polars dataframe "df" that contains the temperature of a machine that is either "active" or "inactive":
import polars as pl
from datetime ...
1
vote
0
answers
78
views
Is it possible to drop/select columns where col.n_unique > 1 with native polars syntax [duplicate]
I have a table that looks like this
import polars as pl
df = pl.DataFrame(
{
"col1": [1, 2, 3, 4, 5],
"col2": [10, 20, 30, 40, 50],
"col3": [...
Advice
0
votes
7
replies
117
views
High volume URL parsing in Python
I use the polars, urllib and tldextract packages in python to parse 2 columns of URL strings in zstd-compressed parquet files (averaging 8GB, 40 million rows). The parsed output include the scheme, ...
12
votes
0
answers
372
views
Not displaying DataFrame's name in Data Wrangler extension of VSCode, displaying "Data grid"
It is a while that I am using Data Wrangler extension in VS Code; it is very useful for analyzing datasets and filtering some columns to see the features. When I opened a dataframe in it, it used to ...
1
vote
1
answer
113
views
Altair stacked bar chart in custom order
I've built a dataset in Polars (python), attempting to plot it as a stacked horizontal bar chart using Polars' built-in Altair plot function, however trying to specify a custom sort order for the ...
1
vote
1
answer
117
views
Polars print changed values between 2 dataframes
Given two polars dataframes of the same shape, I would like to print the number of values different between the two, including missing values that are not missing in the other dataframe.
I came up ...
2
votes
2
answers
93
views
Seeking more efficient method in Python & Polars to perform monthly comparison within each year
I have a CSV of energy consumption data over time (each month for several years).
I want to determine the percentage (decimal portion) for each month across that year; e.g., August was 12.3% of the ...
1
vote
3
answers
102
views
Show matched rows in polars join
When you join two tables, STATA prints the number of rows merged and unmerged.
For instance, take Example 1 at page 13 of the STATA merge doc:
use https://www.stata-press.com/data/r19/autosize
merge 1:...
3
votes
0
answers
154
views
Why polars join function performance deteriorates so much from version 1.30.0 to 1.31.0?
I noticed a significant performance deterioration when using polars dataframe join function after upgrading polars from 1.30.0 to 1.31.0. The code snippet is below:
import polars as pl
import time
...
1
vote
2
answers
162
views
Replace value by condition across entire polars df
I'd like to replace any value greater than some condition with zero for any column except the date column in a df.
The closest I've found it
df.with_columns(
pl.when(pl.any_horizontal(pl.col(pl....
2
votes
1
answer
135
views
Find differing rows between two Polars DataFrames based on ID and multiple columns
I have two Polars DataFrames (df1 and df2) with the same columns.
I want to compare them by ID and Iname, and get the rows where any of the other columns (X, Y, Z) differ between the two.
import ...
0
votes
0
answers
167
views
How to efficiently get the last row of a rolling aggregation group without .last()?
I'm working with a large Polars LazyFrame and computing rolling aggregations grouped by customer (Cusid). I need to find the "front" of the rolling window (last Tts_date) for each group to ...
6
votes
1
answer
112
views
Polars streaming: How to compute a nested window aggregation while avoiding in-memory-maps?
I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'.
Constraints:
I want to preserve all original records in the result.
(if possible) avoid ...
4
votes
3
answers
107
views
Extending polars DataFrame while maintaining variables between calls
I would like to code a logger for polars using the Custom Namespace API.
For instance, starting from:
import logging
import polars as pl
penguins_pl = pl.read_csv("https://raw.githubusercontent....
0
votes
1
answer
76
views
Python tempfile TemporaryDirectory path changes multiple times after initialization
I am using tempfile with Polars for the first time and getting some surprising behavior when running it in a serverless Cloud Function-like environment. Here is my simple test code:
try:
with ...
4
votes
4
answers
188
views
Reference column named "*" in Polars
I have a Polars DataFrame with a column named "*" and would like to reference just that column. When I try to use pl.col("*") it is interpreted as a wildcard for "all columns.&...
1
vote
2
answers
89
views
Adding an Object column to a polars DataFrame with broadcasting
If I have a DataFrame, I can create a column with a single value like this:
df = pl.DataFrame([[1, 2, 3]])
df.with_columns(pl.lit("ok").alias("metadata"))
shape: (3, 2)
┌──────────...
1
vote
0
answers
78
views
Polars LazyFrame sink_parquet + PartitionByKey slower to S3 than local disk
I'm wondering why I'm seeing such poor performance when writing a LazyFrame using PartitionByKey to S3 when compared to other methods. Here is a simple test script that writes out some random data to ...
1
vote
2
answers
113
views
python typing distinctions between inline created parameters and variables
Preamble
I'm using polars's write_excel method which has a parameter column_formats which wants a ColumnFormatDict that is defined here and below
ColumnFormatDict: TypeAlias = Mapping[
# dict of ...
2
votes
0
answers
181
views
Speeding up Polars rust plugin branching and aggregating
I'm following polars plugins tutorial - branch mispredictions and it says that theres a faster way to implement the following code:
#[polars_expr(output_type=Int64)]
fn sum_i64(inputs: &[Series]) -...
-1
votes
1
answer
123
views
Compare 2 columns in Polars and rearrange them when they match and unmatch?
A Polars DataFrame that has 2 columns [Col01 & Col02]. They hold same values though not the same number of times [e.g. Col01 can have say 5 rows of '00000'while Col02 may have 20 rows of '00000' ...
8
votes
1
answer
265
views
How to write a pandas-compatible, non-elementary expression in narwhals
I'm working with the narwhals package and I'm trying to write an expression that is:
applied over groups using .over()
Non-elementary/chained (longer than a single operation)
Works when the native df ...
-2
votes
1
answer
132
views
Polars scan_ndjson Out of memory
Description
Trying to read 32GB of data splitted in 16 .jsonl files.
I use the function scan_ndjson of Polars but the execution stops with error 137 (Out of memory).
Here is the code:
# Count infobox ...
3
votes
3
answers
159
views
Calculating monthly revenue given start and end date for each ID using Polars
I have a dataframe using this format
import polars as pl
df = pl.from_repr("""
┌─────┬────────────┬────────────┬──────────┐
│ ID ┆ DATE_PREV ┆ DATE ┆ REV_DIFF │
│ --- ┆ --- ...
2
votes
1
answer
94
views
polars-u64-idx not available for latest version
While the standard Polars package is available in version 1.34.0 the polars-u64-idx package is missing the latest versions.
Does anyone know if this package is discontinued?
2
votes
2
answers
268
views
How do I get polars.Expr.str.json_decode to decode simple map to List(Struct({'key': String, 'value': Int32}))?
json_decode requires that we specify the dtype.
Polars represents maps with arbitrary keys as a List<struct<2>> (see here).
EDIT: Suppose I don't know the keys in my JSON ahead of time, ...
2
votes
1
answer
128
views
How to perform sinking lazyframes with diverging queries to different partitions
I have a very big parquet file which I'm attempting to read from and split into partitioned folders on a column "token".
Currently I'm using pl.scan_parquet on the big parquet file followed ...
2
votes
3
answers
121
views
Forward fill using values from rows that match a condition in Polars
I have this dataframe:
import polars as pl
df = pl.DataFrame({'value': [1,2,3,4,5,None,None], 'flag': [0,1,1,1,0,0,0]})
┌───────┬──────┐
│ value ┆ flag │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪══...
2
votes
1
answer
68
views
How to select joined columns with structure like namespaces (a.col1, b.col2)?
I am working to migrate from PySpark to Polars. In PySpark I often use aliases on dataframes so I can clearly see which columns come from which side of a join. I'd like to get similarly readable code ...
0
votes
0
answers
120
views
Enabling Delta Table checkpointing when using polars write_delta()
I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook.
Having had a production process up ...
1
vote
1
answer
99
views
Converting a Rust `futures::TryStream` to a `polars::LazyFrame`
I have an application where I have a futures::TryStream. Still in a streaming fashion, I want to convert this into a polars::LazyFrame. It is important to note that the TryStream comes from the ...
0
votes
1
answer
121
views
PyCharm "view as DataFrame" shows nothing for polars DataFrames
Basically the title. Using PyCharm 2023.3.3 I'm not able to see the data of polars DataFrames.
As an example, I've a simple DataFrame like this:
print(ids_df)
shape: (1, 4)
┌───────────────────────────...
3
votes
3
answers
93
views
Dynamically index a column in Polars
I have a simple dataframe look like this:
import polars as pl
df = pl.DataFrame({
'ref': ['a', 'b', 'c', 'd', 'e', 'f'],
'idx': [4, 3, 1, 6, 2, 5],
})
How can I obtain the result as ...
2
votes
1
answer
108
views
Find nearest / closest value to subset of values in a Polars dataframe
I have this dataframe
import polars as pl
df = pl.from_repr("""
┌────────────┬──────┐
│ date ┆ ME │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪══════╡
│ 2027-11-...