I have a column that is an arbitrary length array of key/value structs:
StructType([
StructField("key", StringType(), False),
StructType([
StructField("string_value", StringType(), True),
StructField("int_value", IntegerType(), True),
StructField("float_value", FloatType(), True),
StructField("double_value", DoubleType(), True)
])
])
I know that there are only a few different key names and what each of their data types is. For example, name is always a string, birth_year is always an integer, etc. Not every attribute will always be present, so the predefined struct must have all nullable values, e.g.:
StructType([
StructField("first_name", StringType(), True),
StructField("middle_name", StringType(), True),
StructField("last_name", StringType(), True),
StructField("birth_year", IntegerType(), True),
StructField("ssn", IntegerType(), True),
StructField("zipcode", IntegerType(), True),
])
My incoming columns will look something like this:
[
(key: "first_name", value: (string_type: "John")),
(key: "ssn", value: (int_type: 123456789)),
(key: "last_name", value: (string_type: "Doe")),
]
------------------------------------------------------
[
(key: "ssn", value: (int_type: 987654321)),
(key: "last_name", value: (string_type: "Jones")),
]
------------------------------------------------------
[
(key: "zipcode", value: (int_type: 13579)),
(key: "first_name", value: (string_type: "Bob")),
(key: "birth_year", value: (int_type: 1985)),
(key: "last_name", value: (string_type: "Smith")),
]
and I want them to become a column of the person struct like this:
{
first_name: "John",
last_name: "Doe",
ssn: 123456789
}
------------------------------------------------------
{
last_name: "Jones",
ssn: 987654321
}
------------------------------------------------------
{
first_name: "Bob",
last_name: "Smith",
birth_year: 1985,
zipcode: 13579
}
This is a playground example, but the real data will have several billion rows, so performance is important and it shouldn't use Python UDFs, but rather only things from pyspark.sql.functions.