Flatten array of arrays (different dimensions) of a sql.dataframe.DataFrame in pyspark

Question

I have a pyspark.sql.dataframe.DataFrame which is something like this:

+---------------------------+--------------------+--------------------+
|collect_list(results)      |        userid      |         page       |
+---------------------------+--------------------+--------------------+
|       [[[roundtrip, fal...|13482f06-9185-47f...|1429d15b-91d0-44b...|
+---------------------------+--------------------+--------------------+

Inside the collect_list(results) column there is an array with len = 2, and the elements are also arrays (the first one has a len = 1, and the second one a len = 9).

Is there a way to flatten this array of arrays into a unique array with len = 10 using pyspark?

Thanks!

Perhaps it is easier to rework by altering the way you got to this DataFrame. Can you show us? — Oliver W.
– Oliver W., Commented Dec 9, 2019 at 19:52
@OliverW. the query is pretty simple: query1 = spark.sql(""" select collect_list(results), userid, page from table group by 2,3 """) — Fede Blanco
– Fede Blanco, Commented Dec 9, 2019 at 20:21

Seb · Accepted Answer · 2019-12-09 19:55:03Z

2

You can flatten an array of array using pyspark.sql.functions.flatten. Documentation here. For example this will create a new column called results with the flatten results assuming your dataframe variable is called df.

import pyspark.sql.functions as F
...
df.withColumn('results', F.flatten('collect_list(results)')

answered Dec 9, 2019 at 19:55

Seb

5395 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Fede Blanco Over a year ago

I've just realized I have version 2.3.1 and according to the documentation "flatten" is in version 2.4. Thanks for your answer!

Oliver W. · Accepted Answer · 2019-12-09 21:03:12Z

For a version that works before Spark 2.4 (but not before 1.3), you could try to explode the dataset you obtained before grouping, thereby unnesting one level of the array, then call groupBy and collect_list. Like this:

from pyspark.sql.functions import collect_list, explode

df = spark.createDataFrame([("foo", [1,]), ("foo", [2, 3])], schema=("foo", "bar"))
df.show()
# +---+------+                                                                    
# |foo|   bar|
# +---+------+
# |foo|   [1]|
# |foo|[2, 3]|
# +---+------+
(df.select(
    df.foo,
    explode(df.bar))
 .groupBy("foo")
 .agg(collect_list("col"))
 .show())
# +---+-----------------+
# |foo|collect_list(col)|
# +---+-----------------+
# |foo|        [1, 2, 3]|
# +---+-----------------+

Collectives™ on Stack Overflow

Flatten array of arrays (different dimensions) of a sql.dataframe.DataFrame in pyspark

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related