Split large array columns into multiple columns - Pyspark

Question

I have:

+---+-------+-------+
| id|   var1|   var2|
+---+-------+-------+
|  a|[1,2,3]|[1,2,3]|
|  b|[2,3,4]|[2,3,4]|
+---+-------+-------+

I want:

+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
|  a|      1|      2|      3|      1|      2|      3|
|  b|      2|      3|      4|      2|      3|      4|
+---+-------+-------+-------+-------+-------+-------+

The solution provided by How to split a list to multiple columns in Pyspark?

df1.select('id', df1.var1[0], df1.var1[1], ...).show()

works, but some of my arrays are very long (max 332).

How can I write this so that it takes account of all length arrays?

In the link you shared, there is the line df2.select('key', 'value.*').show() shown; does that help? — Cleb
– Cleb, Commented Aug 2, 2018 at 7:54
Unfortunately that's for StructType variables and I'd like to keep this as an array — Microsim
– Microsim, Commented Aug 2, 2018 at 8:01

Pierre Gourseaud · Accepted Answer · 2018-08-02 09:44:10Z

7

This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.

from pyspark.sql import functions as F

df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])

columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()

df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
|  a|      1|      2|      3|      1|      2|      3|
|  b|      2|      3|      4|      2|      3|      4|
+---+-------+-------+-------+-------+-------+-------+

edited Aug 2, 2018 at 9:44

answered Aug 2, 2018 at 9:18

Pierre Gourseaud

2,49716 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

caring-goat-913 Over a year ago

FYI, you can use a generator expression instead of building a list in memory with a list comprehension, e.g. df.select(*(F.size(col).alias(col) for col in columns))

Collectives™ on Stack Overflow

Split large array columns into multiple columns - Pyspark

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related