Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
45 views

I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the ...
Akira's user avatar
  • 2,853
1 vote
2 answers
75 views

I have from pathlib import Path import polars as pl inDir = r"E:\Personal Projects\tmp\tarFiles\result2" outDir = r"C:\Users\Akira\Documents" inDir = Path(inDir) outDir = ...
Akira's user avatar
  • 2,853
1 vote
0 answers
72 views

Polars version: 1.25.2 I have a dataframe: from datetime import date test_df = pl.DataFrame([ ("A", None, date(2009, 1, 24), 1), ("A", date(2010, 3, 24), date(2013, 1, 24),...
Jonathan's user avatar
  • 2,343
1 vote
0 answers
67 views

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...
gaut's user avatar
  • 6,048
3 votes
0 answers
117 views

I started using Python3.14t (free-threaded build) recently and had some blast. However, when I install Polars python3.14t -m pip install polars The process stuck at the last line of the following ...
user2961927's user avatar
  • 1,810
1 vote
1 answer
63 views

I want to make a custom polars plugin which takes a class defined in rust using pyo3. I have managed to create a class which can roundtrip pickle defined as: #[pyclass(module = "mylib._internal&...
thoooooooomas's user avatar
Advice
0 votes
0 replies
24 views

Consider the following situation: There is a complex (and time-consuming) query, which has some "slowly changing" parameters, i.e. the same query gets executed on the same data(-source) over ...
Benjamin Trendelkamp-Schroer's user avatar
Advice
0 votes
2 replies
48 views

I see tons of examples of how to convert or operate on specific columns, where the column name is known and simple, like 'a' or 'b'. I Have hundreds, maybe thousands of columns in thousands of ...
shrykullgod's user avatar
0 votes
0 answers
73 views

I have a compressed CSV file compressed as csv.gz which I want to run some processing on. I generally go with Polars because it is more memory-efficient and faster. Here is the code which I am using ...
kaddy's user avatar
  • 11
0 votes
1 answer
93 views

I am trying to apply changes from one dataframe (source file is a 7 MB .CSV) to a larger dataframe (source file approx. 3GB .CSV), e.g. update existing rows with matching IDs, while at the same time ...
starmandeluxe's user avatar
1 vote
0 answers
56 views

I am trying to explain() a Polars query to see which operations can be executed using the streaming engine. Currently, I am only able to do this using show_graph(). From sources on the web, I see that ...
gaut's user avatar
  • 6,048
1 vote
1 answer
76 views

I have string column in polars dataframe with multiple datetime formats and I am using following code to convert datatype of column from string into datetime. import polars as pl df = pl.from_dict({'...
dikesh's user avatar
  • 3,135
0 votes
0 answers
78 views

I have a Python file import polars as pl import requests from pathlib import Path url = "https://raw.githubusercontent.com/leanhdung1994/files/main/processedStep1_enwiktionary_namespace_0_43....
Akira's user avatar
  • 2,853
1 vote
3 answers
181 views

I have a .ndjson files with millions of rows. Each row has a field html which contains html strings. I would like to write all such html into a .txt file. One html is into one line of the .txt file. I ...
Akira's user avatar
  • 2,853
2 votes
1 answer
141 views

I am looking for the nearest non exact match on the dates column: import polars as pl df = pl.from_repr(""" ┌─────┬────────────┐ │ uid ┆ dates │ │ --- ┆ --- │ │ i64 ┆ date ...
rainerpf's user avatar
-2 votes
1 answer
99 views

I have a dictionary of polars.DataFrames called data_dict. All dataframes inside the dict values are having an extra index column ''. I want to drop that column and set a new column named 'name_ID' ...
Tudi72's user avatar
  • 31
2 votes
1 answer
84 views

Imagine having the following polars dataframe "df" that contains the temperature of a machine that is either "active" or "inactive": import polars as pl from datetime ...
the_economist's user avatar
1 vote
0 answers
78 views

I have a table that looks like this import polars as pl df = pl.DataFrame( { "col1": [1, 2, 3, 4, 5], "col2": [10, 20, 30, 40, 50], "col3": [...
Lethnis's user avatar
  • 31
Advice
0 votes
7 replies
117 views

I use the polars, urllib and tldextract packages in python to parse 2 columns of URL strings in zstd-compressed parquet files (averaging 8GB, 40 million rows). The parsed output include the scheme, ...
norcalpedaler's user avatar
12 votes
0 answers
372 views

It is a while that I am using Data Wrangler extension in VS Code; it is very useful for analyzing datasets and filtering some columns to see the features. When I opened a dataframe in it, it used to ...
Javad Faraji's user avatar
1 vote
1 answer
113 views

I've built a dataset in Polars (python), attempting to plot it as a stacked horizontal bar chart using Polars' built-in Altair plot function, however trying to specify a custom sort order for the ...
ExactaBox's user avatar
  • 3,425
1 vote
1 answer
117 views

Given two polars dataframes of the same shape, I would like to print the number of values different between the two, including missing values that are not missing in the other dataframe. I came up ...
robertspierre's user avatar
2 votes
2 answers
93 views

I have a CSV of energy consumption data over time (each month for several years). I want to determine the percentage (decimal portion) for each month across that year; e.g., August was 12.3% of the ...
Buckley's user avatar
  • 151
1 vote
3 answers
102 views

When you join two tables, STATA prints the number of rows merged and unmerged. For instance, take Example 1 at page 13 of the STATA merge doc: use https://www.stata-press.com/data/r19/autosize merge 1:...
robertspierre's user avatar
3 votes
0 answers
154 views

I noticed a significant performance deterioration when using polars dataframe join function after upgrading polars from 1.30.0 to 1.31.0. The code snippet is below: import polars as pl import time ...
Y. Gao's user avatar
  • 1,049
1 vote
2 answers
162 views

I'd like to replace any value greater than some condition with zero for any column except the date column in a df. The closest I've found it df.with_columns( pl.when(pl.any_horizontal(pl.col(pl....
thefrollickingnerd's user avatar
2 votes
1 answer
135 views

I have two Polars DataFrames (df1 and df2) with the same columns. I want to compare them by ID and Iname, and get the rows where any of the other columns (X, Y, Z) differ between the two. import ...
Simon's user avatar
  • 1,209
0 votes
0 answers
167 views

I'm working with a large Polars LazyFrame and computing rolling aggregations grouped by customer (Cusid). I need to find the "front" of the rolling window (last Tts_date) for each group to ...
Liisjak's user avatar
  • 37
6 votes
1 answer
112 views

I want to calculate the mean over some group column 'a' but include only one value per second group column 'b'. Constraints: I want to preserve all original records in the result. (if possible) avoid ...
gogodigi's user avatar
4 votes
3 answers
107 views

I would like to code a logger for polars using the Custom Namespace API. For instance, starting from: import logging import polars as pl penguins_pl = pl.read_csv("https://raw.githubusercontent....
robertspierre's user avatar
0 votes
1 answer
76 views

I am using tempfile with Polars for the first time and getting some surprising behavior when running it in a serverless Cloud Function-like environment. Here is my simple test code: try: with ...
starmandeluxe's user avatar
4 votes
4 answers
188 views

I have a Polars DataFrame with a column named "*" and would like to reference just that column. When I try to use pl.col("*") it is interpreted as a wildcard for "all columns.&...
Sam's user avatar
  • 359
1 vote
2 answers
89 views

If I have a DataFrame, I can create a column with a single value like this: df = pl.DataFrame([[1, 2, 3]]) df.with_columns(pl.lit("ok").alias("metadata")) shape: (3, 2) ┌──────────...
Ilya V. Schurov's user avatar
1 vote
0 answers
78 views

I'm wondering why I'm seeing such poor performance when writing a LazyFrame using PartitionByKey to S3 when compared to other methods. Here is a simple test script that writes out some random data to ...
Stephen's user avatar
  • 276
1 vote
2 answers
113 views

Preamble I'm using polars's write_excel method which has a parameter column_formats which wants a ColumnFormatDict that is defined here and below ColumnFormatDict: TypeAlias = Mapping[ # dict of ...
Dean MacGregor's user avatar
2 votes
0 answers
181 views

I'm following polars plugins tutorial - branch mispredictions and it says that theres a faster way to implement the following code: #[polars_expr(output_type=Int64)] fn sum_i64(inputs: &[Series]) -...
Ariana's user avatar
  • 29
-1 votes
1 answer
123 views

A Polars DataFrame that has 2 columns [Col01 & Col02]. They hold same values though not the same number of times [e.g. Col01 can have say 5 rows of '00000'while Col02 may have 20 rows of '00000' ...
Mohan Prasath's user avatar
8 votes
1 answer
265 views

I'm working with the narwhals package and I'm trying to write an expression that is: applied over groups using .over() Non-elementary/chained (longer than a single operation) Works when the native df ...
Slash's user avatar
  • 581
-2 votes
1 answer
132 views

Description Trying to read 32GB of data splitted in 16 .jsonl files. I use the function scan_ndjson of Polars but the execution stops with error 137 (Out of memory). Here is the code: # Count infobox ...
codug's user avatar
  • 27
3 votes
3 answers
159 views

I have a dataframe using this format import polars as pl df = pl.from_repr(""" ┌─────┬────────────┬────────────┬──────────┐ │ ID ┆ DATE_PREV ┆ DATE ┆ REV_DIFF │ │ --- ┆ --- ...
Philipp's user avatar
  • 65
2 votes
1 answer
94 views

While the standard Polars package is available in version 1.34.0 the polars-u64-idx package is missing the latest versions. Does anyone know if this package is discontinued?
Stefan Herrmann's user avatar
2 votes
2 answers
268 views

json_decode requires that we specify the dtype. Polars represents maps with arbitrary keys as a List<struct<2>> (see here). EDIT: Suppose I don't know the keys in my JSON ahead of time, ...
user31639176's user avatar
2 votes
1 answer
128 views

I have a very big parquet file which I'm attempting to read from and split into partitioned folders on a column "token". Currently I'm using pl.scan_parquet on the big parquet file followed ...
WillowOfTheBorder's user avatar
2 votes
3 answers
121 views

I have this dataframe: import polars as pl df = pl.DataFrame({'value': [1,2,3,4,5,None,None], 'flag': [0,1,1,1,0,0,0]}) ┌───────┬──────┐ │ value ┆ flag │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═══════╪══...
Phil-ZXX's user avatar
  • 3,601
2 votes
1 answer
68 views

I am working to migrate from PySpark to Polars. In PySpark I often use aliases on dataframes so I can clearly see which columns come from which side of a join. I'd like to get similarly readable code ...
Arend-Jan Tissing's user avatar
0 votes
0 answers
120 views

I am using polars.df.write_delta() to initially create, and subsequently append to, Delta Tables in Microsoft Fabric OneLake storage, via a Fabric python notebook. Having had a production process up ...
Stuart J Cuthbertson's user avatar
1 vote
1 answer
99 views

I have an application where I have a futures::TryStream. Still in a streaming fashion, I want to convert this into a polars::LazyFrame. It is important to note that the TryStream comes from the ...
bmitc's user avatar
  • 908
0 votes
1 answer
121 views

Basically the title. Using PyCharm 2023.3.3 I'm not able to see the data of polars DataFrames. As an example, I've a simple DataFrame like this: print(ids_df) shape: (1, 4) ┌───────────────────────────...
Nauel's user avatar
  • 522
3 votes
3 answers
93 views

I have a simple dataframe look like this: import polars as pl df = pl.DataFrame({ 'ref': ['a', 'b', 'c', 'd', 'e', 'f'], 'idx': [4, 3, 1, 6, 2, 5], }) How can I obtain the result as ...
Baffin Chu's user avatar
2 votes
1 answer
108 views

I have this dataframe import polars as pl df = pl.from_repr(""" ┌────────────┬──────┐ │ date ┆ ME │ │ --- ┆ --- │ │ date ┆ i64 │ ╞════════════╪══════╡ │ 2027-11-...
Phil-ZXX's user avatar
  • 3,601

1
2 3 4 5
57