0

I have a dataframe with a nested array field (events).

-- id: long (nullable = true)
 |-- events: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- value: string (nullable = true)

I want to flatten the data and get a dataframe with a schema similar to this:

-- id: long (nullable = true)
-- key: string (nullable = true)
-- timestamp: long (nullable = true)
-- value: string (nullable = true)

example input:

+-----+-------------------------------------------------------+
|id   |             events                                    |
+-----+-------------------------------------------------------+
|  1  | [[john , 1547758879, 1], [bob, 1547759154, 1]]        |
|  2  | [[samantha , 1547758879, 1], [eric, 1547759154, 1]]   |
+-----+-------------------------------------------------------+

example putput:

+-----+---------+----------+-----+
|id   |key      |timestamp |value|
+-----+---------+----------+-----+
|  1  |john     |1547758879|    1|
|  1  |bob      |1547759154|    1|
|  2  |samantha |1547758879|    1|
|  2  |eric     |1547759154|    1|
+-----+---------+----------+-----+
3
  • 1
    What have you tried so far? Commented Jul 24, 2019 at 2:11
  • I am a newbie so I am not sure. But I tried: df.select("id", df.events.value, fn.explode(df.events.key).alias("keys")).\ withColumn("values", fn.explode(df.events.value)).\ select("id","keys", "values").show(truncate=False). But it raises an error. Commented Jul 24, 2019 at 2:55
  • Add you code and the full error message to your question. It is not readable in the comments. Commented Jul 24, 2019 at 3:17

3 Answers 3

1

You can use explode to split each element of the array into its own row, then just select the individual elements of the structure.

case class Event(key: String, timestamp: Long, value: String)
val df = List((1, Seq(Event("john", 1547758879, "1"), 
                      Event("bob", 1547759154, "1"))), 
              (2, Seq(Event("samantha", 1547758879, "1"), 
                      Event("eric", 1547759154, "1")))
             ).toDF("id","events")

df.show(false)
/*--+--------------------------------------------------+
|id |events                                            |
+---+--------------------------------------------------+
|1  |[[john, 1547758879, 1], [bob, 1547759154, 1]]     |
|2  |[[samantha, 1547758879, 1], [eric, 1547759154, 1]]|
+---+-------------------------------------------------*/

val exploded = df.withColumn("events", explode($"events"))
exploded.show(false)
/*--+-------------------------+
|id |events                   |
+---+-------------------------+
|1  |[john, 1547758879, 1]    |
|1  |[bob, 1547759154, 1]     |
|2  |[samantha, 1547758879, 1]|
|2  |[eric, 1547759154, 1]    |
+---+------------------------*/

val unstructured = exploded.select($"id", $"events.key", $"events.timestamp", $"events.value")
unstructured.show
/*--+--------+----------+-----+
| id|     key| timestamp|value|
+---+--------+----------+-----+
|  1|    john|1547758879|    1|
|  1|     bob|1547759154|    1|
|  2|samantha|1547758879|    1|
|  2|    eric|1547759154|    1|
+---+--------+----------+----*/
Sign up to request clarification or add additional context in comments.

Comments

1
df.select("id", fn.explode(df.events).alias('events')). \
    select("id", fn.col("events").getItem("key").alias("key"),
           fn.col("events").getItem("value").alias("value"),
           fn.col("events").getItem("timestamp").alias("timestamp"))

Comments

0

You can try the following approach:

  1. Add a count of how many elements are in each events row:
## recreate the dataframe sample
df = pd.DataFrame(
    [
        [1, [['john' , 1547758879, 1], ['bob', 1547759154, 1]]],
        [2, [['samantha' , 1547758879, 1], ['eric', 1547759154, 1]]]
    ], columns = ['id','events']
)

df['elements'] = df['events'].apply(lambda x: len(x))

Out[36]: 
   id                                             events  elements
0   1      [[john, 1547758879, 1], [bob, 1547759154, 1]]         2
1   2  [[samantha, 1547758879, 1], [eric, 1547759154,1]]         2
  1. Flatten the nested results into one list of lists:
values = df['events'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]

>> flat_results
Out[38]: 
[['john', 1547758879, 1],
 ['bob', 1547759154, 1],
 ['samantha', 1547758879, 1],
 ['eric', 1547759154, 1]]
  1. Create a new DataFrame from the the flattened list
new_df = pd.DataFrame(flat_results, columns=['key','timestamp','value'])
  1. Use the element count to repeat the id from the original source
new_df['id'] = df['id'].repeat(df['elements'].values).values

>> new_df
Out[40]: 
        key   timestamp  value  id
0      john  1547758879      1   1
1       bob  1547759154      1   1
2  samantha  1547758879      1   2
3      eric  1547759154      1   2

2 Comments

Thanks, but this is in pyspark, I am not sure how to do that without transforming to a pandas dataframe.
Oh I took it as entirely a pandas problem. My apologies @user3192082

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.