Flatten a nested array into rows

Question

I have a dataframe with a nested array field (events).

-- id: long (nullable = true)
 |-- events: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- value: string (nullable = true)

I want to flatten the data and get a dataframe with a schema similar to this:

-- id: long (nullable = true)
-- key: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: string (nullable = true)

example input:

+-----+-------------------------------------------------------+
|id   |             events                                    |
+-----+-------------------------------------------------------+
|  1  | [[john , 1547758879, 1], [bob, 1547759154, 1]]        |
|  2  | [[samantha , 1547758879, 1], [eric, 1547759154, 1]]   |
+-----+-------------------------------------------------------+

example putput:

+-----+---------+----------+-----+
|id   |key      |timestamp |value|
+-----+---------+----------+-----+
|  1  |john     |1547758879|    1|
|  1  |bob      |1547759154|    1|
|  2  |samantha |1547758879|    1|
|  2  |eric     |1547759154|    1|
+-----+---------+----------+-----+

I am a newbie so I am not sure. But I tried: df.select("id", df.events.value, fn.explode(df.events.key).alias("keys")).\ withColumn("values", fn.explode(df.events.value)).\ select("id","keys", "values").show(truncate=False). But it raises an error. — user3192082
– user3192082, Commented Jul 24, 2019 at 2:55
Add you code and the full error message to your question. It is not readable in the comments. — Klaus D.
– Klaus D., Commented Jul 24, 2019 at 3:17

Jeremy · Accepted Answer · 2019-07-24 03:26:46Z

You can use explode to split each element of the array into its own row, then just select the individual elements of the structure.

case class Event(key: String, timestamp: Long, value: String)
val df = List((1, Seq(Event("john", 1547758879, "1"), 
                      Event("bob", 1547759154, "1"))), 
              (2, Seq(Event("samantha", 1547758879, "1"), 
                      Event("eric", 1547759154, "1")))
             ).toDF("id","events")

df.show(false)
/*--+--------------------------------------------------+
|id |events                                            |
+---+--------------------------------------------------+
|1  |[[john, 1547758879, 1], [bob, 1547759154, 1]]     |
|2  |[[samantha, 1547758879, 1], [eric, 1547759154, 1]]|
+---+-------------------------------------------------*/

val exploded = df.withColumn("events", explode($"events"))
exploded.show(false)
/*--+-------------------------+
|id |events                   |
+---+-------------------------+
|1  |[john, 1547758879, 1]    |
|1  |[bob, 1547759154, 1]     |
|2  |[samantha, 1547758879, 1]|
|2  |[eric, 1547759154, 1]    |
+---+------------------------*/

val unstructured = exploded.select($"id", $"events.key", $"events.timestamp", $"events.value")
unstructured.show
/*--+--------+----------+-----+
| id|     key| timestamp|value|
+---+--------+----------+-----+
|  1|    john|1547758879|    1|
|  1|     bob|1547759154|    1|
|  2|samantha|1547758879|    1|
|  2|    eric|1547759154|    1|
+---+--------+----------+----*/

user3192082 · Accepted Answer · 2019-07-24 03:34:11Z

1

df.select("id", fn.explode(df.events).alias('events')). \
    select("id", fn.col("events").getItem("key").alias("key"),
           fn.col("events").getItem("value").alias("value"),
           fn.col("events").getItem("timestamp").alias("timestamp"))

answered Jul 24, 2019 at 3:34

user3192082

3462 silver badges14 bronze badges

Comments

rrcal · Accepted Answer · 2019-07-24 02:32:09Z

0

You can try the following approach:

Add a count of how many elements are in each events row:

## recreate the dataframe sample
df = pd.DataFrame(
    [
        [1, [['john' , 1547758879, 1], ['bob', 1547759154, 1]]],
        [2, [['samantha' , 1547758879, 1], ['eric', 1547759154, 1]]]
    ], columns = ['id','events']
)

df['elements'] = df['events'].apply(lambda x: len(x))

Out[36]: 
   id                                             events  elements
0   1      [[john, 1547758879, 1], [bob, 1547759154, 1]]         2
1   2  [[samantha, 1547758879, 1], [eric, 1547759154,1]]         2

Flatten the nested results into one list of lists:

values = df['events'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]

>> flat_results
Out[38]: 
[['john', 1547758879, 1],
 ['bob', 1547759154, 1],
 ['samantha', 1547758879, 1],
 ['eric', 1547759154, 1]]

Create a new DataFrame from the the flattened list

new_df = pd.DataFrame(flat_results, columns=['key','timestamp','value'])

Use the element count to repeat the id from the original source

new_df['id'] = df['id'].repeat(df['elements'].values).values

>> new_df
Out[40]: 
        key   timestamp  value  id
0      john  1547758879      1   1
1       bob  1547759154      1   1
2  samantha  1547758879      1   2
3      eric  1547759154      1   2

answered Jul 24, 2019 at 2:32

rrcal

3,7906 gold badges27 silver badges35 bronze badges

2 Comments

user3192082 Over a year ago

Thanks, but this is in pyspark, I am not sure how to do that without transforming to a pandas dataframe.

rrcal Over a year ago

Oh I took it as entirely a pandas problem. My apologies @user3192082

Collectives™ on Stack Overflow

Flatten a nested array into rows

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related