Pyspark convert array of key/value structs into single struct

Question

I have a column that is an arbitrary length array of key/value structs:

StructType([
    StructField("key", StringType(), False),
    StructType([
        StructField("string_value", StringType(),  True),
        StructField("int_value",    IntegerType(), True),
        StructField("float_value",  FloatType(),   True),
        StructField("double_value", DoubleType(),  True)
    ])
])

I know that there are only a few different key names and what each of their data types is. For example, name is always a string, birth_year is always an integer, etc. Not every attribute will always be present, so the predefined struct must have all nullable values, e.g.:

StructType([
    StructField("first_name",  StringType(),  True),
    StructField("middle_name", StringType(),  True),
    StructField("last_name",   StringType(),  True),
    StructField("birth_year",  IntegerType(), True),
    StructField("ssn",         IntegerType(), True),
    StructField("zipcode",     IntegerType(), True),
])

My incoming columns will look something like this:

[
    (key: "first_name", value: (string_type: "John")),
    (key: "ssn",        value: (int_type:    123456789)),
    (key: "last_name",  value: (string_type: "Doe")),
]
------------------------------------------------------
[
    (key: "ssn",        value: (int_type:    987654321)),
    (key: "last_name",  value: (string_type: "Jones")),
]
------------------------------------------------------
[
    (key: "zipcode",    value: (int_type:    13579)),
    (key: "first_name", value: (string_type: "Bob")),
    (key: "birth_year", value: (int_type:    1985)),
    (key: "last_name",  value: (string_type: "Smith")),
]

and I want them to become a column of the person struct like this:

{
    first_name: "John",
    last_name:  "Doe",
    ssn:        123456789
}
------------------------------------------------------
{
    last_name:  "Jones",
    ssn:        987654321
}
------------------------------------------------------
{
    first_name: "Bob",
    last_name:  "Smith",
    birth_year: 1985,
    zipcode:    13579
}

This is a playground example, but the real data will have several billion rows, so performance is important and it shouldn't use Python UDFs, but rather only things from pyspark.sql.functions.

werner · Accepted Answer · 2023-11-17 22:17:28Z

For each element of the wanted struct, filter can be used to extract the expected value from the array:

from pyspark.sql import functions as F

df = ...input data...

# a list of all possible struct entries in the input data
cfgs = [
    ("first_name", "string_type"),
    ("middle_name", "string_type"),
    ("last_name", "string_type"),
    ("birth_year", "int_type"),
    ("ssn", "int_type"),
    ("zipcode", "int_type")
]

cols = [            # for each element of the cfgs list
                    # take the element of the input array with the correct key
    (F.filter(F.col('person'), lambda c: c['key']==cfg[0])
      [0]           # take the first result (if any)
      ['value']     # take the value struct
      [cfg[1]])     # take the correct element of the the value struct
    .alias(cfg[0])  # rename the column
  for cfg in cfgs]

# combine the columns into a new struct
new_df = df.select(F.struct(cols).alias('person'))

Result:

+------------------------------------------+
|person                                    |
+------------------------------------------+
|{John, null, Doe, null, 123456789, null}  |
|{null, null, Jones, null, 987654321, null}|
|{Bob, null, Smith, 1985, null, 13579}     |
+------------------------------------------+

root
 |-- person: struct (nullable = false)
 |    |-- first_name: string (nullable = true)
 |    |-- middle_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |    |-- birth_year: long (nullable = true)
 |    |-- ssn: long (nullable = true)
 |    |-- zipcode: long (nullable = true)

Collectives™ on Stack Overflow

Pyspark convert array of key/value structs into single struct

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related