I have a dataframe with a nested array field (events).
-- id: long (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- timestamp: long (nullable = true)
| | |-- value: string (nullable = true)
I want to flatten the data and get a dataframe with a schema similar to this:
-- id: long (nullable = true)
-- key: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: string (nullable = true)
example input:
+-----+-------------------------------------------------------+
|id | events |
+-----+-------------------------------------------------------+
| 1 | [[john , 1547758879, 1], [bob, 1547759154, 1]] |
| 2 | [[samantha , 1547758879, 1], [eric, 1547759154, 1]] |
+-----+-------------------------------------------------------+
example putput:
+-----+---------+----------+-----+
|id |key |timestamp |value|
+-----+---------+----------+-----+
| 1 |john |1547758879| 1|
| 1 |bob |1547759154| 1|
| 2 |samantha |1547758879| 1|
| 2 |eric |1547759154| 1|
+-----+---------+----------+-----+
df.select("id", df.events.value, fn.explode(df.events.key).alias("keys")).\ withColumn("values", fn.explode(df.events.value)).\ select("id","keys", "values").show(truncate=False). But it raises an error.