3

I have a column that is an arbitrary length array of key/value structs:

StructType([
    StructField("key", StringType(), False),
    StructType([
        StructField("string_value", StringType(),  True),
        StructField("int_value",    IntegerType(), True),
        StructField("float_value",  FloatType(),   True),
        StructField("double_value", DoubleType(),  True)
    ])
])

I know that there are only a few different key names and what each of their data types is. For example, name is always a string, birth_year is always an integer, etc. Not every attribute will always be present, so the predefined struct must have all nullable values, e.g.:

StructType([
    StructField("first_name",  StringType(),  True),
    StructField("middle_name", StringType(),  True),
    StructField("last_name",   StringType(),  True),
    StructField("birth_year",  IntegerType(), True),
    StructField("ssn",         IntegerType(), True),
    StructField("zipcode",     IntegerType(), True),
])

My incoming columns will look something like this:

[
    (key: "first_name", value: (string_type: "John")),
    (key: "ssn",        value: (int_type:    123456789)),
    (key: "last_name",  value: (string_type: "Doe")),
]
------------------------------------------------------
[
    (key: "ssn",        value: (int_type:    987654321)),
    (key: "last_name",  value: (string_type: "Jones")),
]
------------------------------------------------------
[
    (key: "zipcode",    value: (int_type:    13579)),
    (key: "first_name", value: (string_type: "Bob")),
    (key: "birth_year", value: (int_type:    1985)),
    (key: "last_name",  value: (string_type: "Smith")),
]

and I want them to become a column of the person struct like this:

{
    first_name: "John",
    last_name:  "Doe",
    ssn:        123456789
}
------------------------------------------------------
{
    last_name:  "Jones",
    ssn:        987654321
}
------------------------------------------------------
{
    first_name: "Bob",
    last_name:  "Smith",
    birth_year: 1985,
    zipcode:    13579
}

This is a playground example, but the real data will have several billion rows, so performance is important and it shouldn't use Python UDFs, but rather only things from pyspark.sql.functions.

1 Answer 1

1

For each element of the wanted struct, filter can be used to extract the expected value from the array:

from pyspark.sql import functions as F

df = ...input data...

# a list of all possible struct entries in the input data
cfgs = [
    ("first_name", "string_type"),
    ("middle_name", "string_type"),
    ("last_name", "string_type"),
    ("birth_year", "int_type"),
    ("ssn", "int_type"),
    ("zipcode", "int_type")
]

cols = [            # for each element of the cfgs list
                    # take the element of the input array with the correct key
    (F.filter(F.col('person'), lambda c: c['key']==cfg[0])
      [0]           # take the first result (if any)
      ['value']     # take the value struct
      [cfg[1]])     # take the correct element of the the value struct
    .alias(cfg[0])  # rename the column
  for cfg in cfgs]

# combine the columns into a new struct
new_df = df.select(F.struct(cols).alias('person'))

Result:

+------------------------------------------+
|person                                    |
+------------------------------------------+
|{John, null, Doe, null, 123456789, null}  |
|{null, null, Jones, null, 987654321, null}|
|{Bob, null, Smith, 1985, null, 13579}     |
+------------------------------------------+

root
 |-- person: struct (nullable = false)
 |    |-- first_name: string (nullable = true)
 |    |-- middle_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |    |-- birth_year: long (nullable = true)
 |    |-- ssn: long (nullable = true)
 |    |-- zipcode: long (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.