3

I'm working with a Polars dataframe and I want to perform a series of operations using the .select() method. However, I'm facing problems when I try to apply value_counts() followed by unnest() to get separate columns instead of a struct column.

If I just use the method alone, then I don't have any issues:

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "State"]).first().over("CustomerID")).unnest("Country")
    .unique(maintain_order=True)

)

But, since I'm doing a series of operations like this:

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "Count"]).first().over("CustomerID").unnest("Country"),
        Days_Since_Last_Purchase = pl.col("InvoiceDate").max() - pl.col("InvoiceDate").max().over("CustomerID"),
    )
    .unique(maintain_order=True)
)

I'm facing the following error:

AttributeError: 'Expr' object has no attribute 'unnest'

Example Data :

import polars as pl

df = pl.read_csv(b"""
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Transaction_Status
541431,23166,MEDIUM CERAMIC TOP STORAGE JAR,74215,2011-01-18T10:01:00.000000,1.0399999618530273,12346,United Kingdom,Completed
C541433,23166,MEDIUM CERAMIC TOP STORAGE JAR,-74215,2011-01-18T10:17:00.000000,1.0399999618530273,12346,United Kingdom,Cancelled
537626,84997D,PINK 3 PIECE POLKADOT CUTLERY SET,6,2010-12-07T14:57:00.000000,3.75,12347,Iceland,Completed
537626,22729,ALARM CLOCK BAKELIKE ORANGE,4,2010-12-07T14:57:00.000000,3.75,12347,Iceland,Completed
537626,22492,MINI PAINT SET VINTAGE ,36,2010-12-07T14:57:00.000000,0.6499999761581421,12347,Iceland,Completed
537626,22727,ALARM CLOCK BAKELIKE RED ,4,2010-12-07T14:57:00.000000,3.75,12347,Iceland,Completed
537626,22774,RED DRAWER KNOB ACRYLIC EDWARDIAN,12,2010-12-07T14:57:00.000000,1.25,12347,Iceland,Completed
537626,22195,LARGE HEART MEASURING SPOONS,12,2010-12-07T14:57:00.000000,1.649999976158142,12347,Iceland,Completed
537626,22805,BLUE DRAWER KNOB ACRYLIC EDWARDIAN,12,2010-12-07T14:57:00.000000,1.25,12347,Iceland,Completed
537626,22771,CLEAR DRAWER KNOB ACRYLIC EDWARDIAN,12,2010-12-07T14:57:00.000000,1.25,12347,Iceland,Completed
""", try_parse_dates=True, schema_overrides={"CustomerID": pl.String})

3 Answers 3

1

Note that in your first example, you didn't call .unnest() directly on the value_counts() expression, but on the select context.

This can also be done if the select context contains multiple expressions.

(
    df
    .select(
        pl.col("CustomerID"),
        pl.col("Country").value_counts(sort=True).struct.rename_fields(["Country", "State"]).first().over("CustomerID"),
        Days_Since_Last_Purchase = pl.col("InvoiceDate").max() - pl.col("InvoiceDate").max().over("CustomerID"),
    )
    .unnest("Country")
    .unique(maintain_order=True)
)
shape: (2, 4)
┌────────────┬────────────────┬───────┬──────────────────────────┐
│ CustomerID ┆ Country        ┆ State ┆ Days_Since_Last_Purchase │
│ ---        ┆ ---            ┆ ---   ┆ ---                      │
│ str        ┆ str            ┆ u32   ┆ duration[μs]             │
╞════════════╪════════════════╪═══════╪══════════════════════════╡
│ 12346      ┆ United Kingdom ┆ 2     ┆ 0µs                      │
│ 12347      ┆ Iceland        ┆ 8     ┆ 41d 19h 20m              │
└────────────┴────────────────┴───────┴──────────────────────────┘
Sign up to request clarification or add additional context in comments.

Comments

1

.value_counts() does a .group_by().len() internally.

I've found it easier (and more efficient) to just avoid using value_counts altogether.

(df
  .with_columns(
     pl.len().over("CustomerID", "Country"), # "value_counts"
     pl.col("InvoiceDate").max().alias("InvoiceMax")
  )
  .group_by("CustomerID")
  .agg(
     pl.col("Country", "len").sort_by("len").last(),
     (pl.col("InvoiceMax").first() - pl.col("InvoiceDate").max()).alias("Days_Since_Last_Purchase")
  )
)
shape: (2, 4)
┌────────────┬────────────────┬─────┬──────────────────────────┐
│ CustomerID ┆ Country        ┆ len ┆ Days_Since_Last_Purchase │
│ ---        ┆ ---            ┆ --- ┆ ---                      │
│ str        ┆ str            ┆ u32 ┆ duration[μs]             │
╞════════════╪════════════════╪═════╪══════════════════════════╡
│ 12347      ┆ Iceland        ┆ 8   ┆ 41d 19h 20m              │
│ 12346      ┆ United Kingdom ┆ 2   ┆ 0µs                      │
└────────────┴────────────────┴─────┴──────────────────────────┘

Comments

-2

u can try:

df_transformed = (
    df
    .select([
        pl.col("CustomerID"),
        pl.struct([
            pl.col("Country").value_counts(sort=True).alias("Country"),
            pl.lit(None).alias("State")  
        ]).alias("location"),
        (pl.col("InvoiceDate").max().over("CustomerID") - pl.col("InvoiceDate")).alias("Days_Since_Last_Purchase")
    ])
    .with_columns(
        pl.col("location").struct.field("Country").alias("Country"),
        pl.col("location").struct.field("State").alias("State")
    )
    .unique(maintain_order=True)
)

print(df_transformed)

1 Comment

Have you actually tried this? It does not work. ShapeError: Series length 2 doesn't match the DataFrame height of 10

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.