Polars chain multiple operations on select() with values_counts()

Question

I'm working with a Polars dataframe and I want to perform a series of operations using the .select() method. However, I'm facing problems when I try to apply value_counts() followed by unnest() to get separate columns instead of a struct column.

If I just use the method alone, then I don't have any issues:

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "State"]).first().over("CustomerID")).unnest("Country")
    .unique(maintain_order=True)

)

But, since I'm doing a series of operations like this:

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "Count"]).first().over("CustomerID").unnest("Country"),
        Days_Since_Last_Purchase = pl.col("InvoiceDate").max() - pl.col("InvoiceDate").max().over("CustomerID"),
    )
    .unique(maintain_order=True)
)

I'm facing the following error:

AttributeError: 'Expr' object has no attribute 'unnest'

Example Data :

import polars as pl

df = pl.read_csv(b"""
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Transaction_Status
541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18T10:01:00.000000,1.0399999618530273,12346,United Kingdom,Completed
C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18T10:17:00.000000,1.0399999618530273,12346,United Kingdom,Cancelled
537626,84997D,PINK 3 PIECE POLKADOT CUTLERY SET,6,2010-12-07T14:57:00.000000,3.75,12347,Iceland,Completed
537626,22729,ALARM CLOCK BAKELIKE ORANGE,4,2010-12-07T14:57:00.000000,3.75,12347,Iceland,Completed
537626,22492,MINI PAINT SET VINTAGE ,36,2010-12-07T14:57:00.000000,0.6499999761581421,12347,Iceland,Completed
537626,22727,ALARM CLOCK BAKELIKE RED ,4,2010-12-07T14:57:00.000000,3.75,12347,Iceland,Completed
537626,22774,RED DRAWER KNOB ACRYLIC EDWARDIAN,12,2010-12-07T14:57:00.000000,1.25,12347,Iceland,Completed
537626,22195,LARGE HEART MEASURING SPOONS,12,2010-12-07T14:57:00.000000,1.649999976158142,12347,Iceland,Completed
537626,22805,BLUE DRAWER KNOB ACRYLIC EDWARDIAN,12,2010-12-07T14:57:00.000000,1.25,12347,Iceland,Completed
537626,22771,CLEAR DRAWER KNOB ACRYLIC EDWARDIAN,12,2010-12-07T14:57:00.000000,1.25,12347,Iceland,Completed
""", try_parse_dates=True, schema_overrides={"CustomerID": pl.String})

Hericks · Accepted Answer · 2024-08-19 21:11:22Z

Note that in your first example, you didn't call .unnest() directly on the value_counts() expression, but on the select context.

This can also be done if the select context contains multiple expressions.

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "State"]).first().over("CustomerID"),
        Days_Since_Last_Purchase = pl.col("InvoiceDate").max() - pl.col("InvoiceDate").max().over("CustomerID"),
    )
    .unnest("Country")
    .unique(maintain_order=True)
)

shape: (2, 4)
┌────────────┬────────────────┬───────┬──────────────────────────┐
│ CustomerID ┆ Country        ┆ State ┆ Days_Since_Last_Purchase │
│ ---        ┆ ---            ┆ ---   ┆ ---                      │
│ str        ┆ str            ┆ u32   ┆ duration[μs]             │
╞════════════╪════════════════╪═══════╪══════════════════════════╡
│ 12346      ┆ United Kingdom ┆ 2     ┆ 0µs                      │
│ 12347      ┆ Iceland        ┆ 8     ┆ 41d 19h 20m              │
└────────────┴────────────────┴───────┴──────────────────────────┘

jqurious · Accepted Answer · 2024-10-14 20:29:11Z

.value_counts() does a .group_by().len() internally.

I've found it easier (and more efficient) to just avoid using value_counts altogether.

(df
  .with_columns(
     pl.len().over("CustomerID", "Country"), # "value_counts"
     pl.col("InvoiceDate").max().alias("InvoiceMax")
  )
  .group_by("CustomerID")
  .agg(
     pl.col("Country", "len").sort_by("len").last(),
     (pl.col("InvoiceMax").first() - pl.col("InvoiceDate").max()).alias("Days_Since_Last_Purchase")
  )
)

shape: (2, 4)
┌────────────┬────────────────┬─────┬──────────────────────────┐
│ CustomerID ┆ Country        ┆ len ┆ Days_Since_Last_Purchase │
│ ---        ┆ ---            ┆ --- ┆ ---                      │
│ str        ┆ str            ┆ u32 ┆ duration[μs]             │
╞════════════╪════════════════╪═════╪══════════════════════════╡
│ 12347      ┆ Iceland        ┆ 8   ┆ 41d 19h 20m              │
│ 12346      ┆ United Kingdom ┆ 2   ┆ 0µs                      │
└────────────┴────────────────┴─────┴──────────────────────────┘

Bryro · Accepted Answer · 2024-08-19 21:07:17Z

-2

u can try:

df_transformed = (
    df
    .select([
        pl.col("CustomerID"),
        pl.struct([
            pl.col("Country").value_counts(sort=True).alias("Country"),
            pl.lit(None).alias("State")  
        ]).alias("location"),
        (pl.col("InvoiceDate").max().over("CustomerID") - pl.col("InvoiceDate")).alias("Days_Since_Last_Purchase")
    ])
    .with_columns(
        pl.col("location").struct.field("Country").alias("Country"),
        pl.col("location").struct.field("State").alias("State")
    )
    .unique(maintain_order=True)
)

print(df_transformed)

answered Aug 19, 2024 at 21:07

Bryro

2432 silver badges16 bronze badges

1 Comment

jqurious Over a year ago

Have you actually tried this? It does not work. ShapeError: Series length 2 doesn't match the DataFrame height of 10

Collectives™ on Stack Overflow

Polars chain multiple operations on select() with values_counts()

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related