Flatten array of arrays column without UDF's

Question

Is there a way to flatten a column which contains array of arrays without using UDF's (in case of DataFrames)?

For example:

+---------------------------------------------------------------------------------------------------------+
|vector                                                                                                   |
+---------------------------------------------------------------------------------------------------------+
|[[106.0,1006.0,26.0], [107.0,1007.0,27.0], [108.0,1008.0,28.0]]                                          |
|[[100.0,1000.0,20.0]]                                                                                    |
|[[101.0,1001.0,21.0], [102.0,1002.0,22.0], [103.0,1003.0,23.0], [104.0,1004.0,24.0], [105.0,1005.0,25.0]]|
+---------------------------------------------------------------------------------------------------------+

should be converted to

+---------------------------------------------------------------------------------------------------------+
|vector                                                                                                   |
+---------------------------------------------------------------------------------------------------------+
|[106.0,1006.0,26.0,107.0,1007.0,27.0,108.0,1008.0,28.0]                                         
|[100.0,1000.0,20.0]                                                                                   
|[101.0,1001.0,21.0,102.0,1002.0,22.0,103.0,1003.0,23.0,104.0,1004.0,24.0,105.0,1005.0,25.0]|
+---------------------------------------------------------------------------------------------------------+

How to make good reproducible apache spark dataframe examples — pault
– pault, Commented Feb 7, 2018 at 15:44

Nihar Mehta · Accepted Answer · 2019-06-26 21:34:02Z

3

You can use the flatten function provided in the official documentation. It was introduced in spark 2.4. Check out this equivalent duplicate question:

Is there a Spark built-in that flattens nested arrays?

answered Jun 26, 2019 at 21:34

Nihar Mehta

312 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pault · Accepted Answer · 2018-02-07 16:02:36Z

1

Here is a way to do it with rdd:

from operator import add

df = sqlcx.createDataFrame(
    [
        ("A", [[106.0,1006.0,26.0], [107.0,1007.0,27.0], [108.0,1008.0,28.0]])
    ],
    ("Col1", "Col2")
)

df.rdd.map(lambda row: (row['Col1'], reduce(add, row['Col2'])))\
    .toDF(['Col1', 'Col2'])\
    .show(truncate=False)
#+----+---------------------------------------------------------------+
#|Col1|Col2                                                           |
#+----+---------------------------------------------------------------+
#|A   |[106.0, 1006.0, 26.0, 107.0, 1007.0, 27.0, 108.0, 1008.0, 28.0]|
#+----+---------------------------------------------------------------+

However, serialization to rdd is costly in terms of performance. I personally would recommend using a udf to accomplish this. As far as I know, there isn't a way to do this without udf using only spark DataFrame functions.

edited Feb 7, 2018 at 16:02

answered Feb 7, 2018 at 16:00

pault

43.7k17 gold badges121 silver badges161 bronze badges

2 Comments

Anahcolus Over a year ago

what if two or more rows have the same value in Col1 column?

pault Over a year ago

@RameshMaharjan I updated the code without using explode. This works better.

Collectives™ on Stack Overflow

Flatten array of arrays column without UDF's

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related