1

Is there a way to flatten a column which contains array of arrays without using UDF's (in case of DataFrames)?

For example:

+---------------------------------------------------------------------------------------------------------+
|vector                                                                                                   |
+---------------------------------------------------------------------------------------------------------+
|[[106.0,1006.0,26.0], [107.0,1007.0,27.0], [108.0,1008.0,28.0]]                                          |
|[[100.0,1000.0,20.0]]                                                                                    |
|[[101.0,1001.0,21.0], [102.0,1002.0,22.0], [103.0,1003.0,23.0], [104.0,1004.0,24.0], [105.0,1005.0,25.0]]|
+---------------------------------------------------------------------------------------------------------+

should be converted to

+---------------------------------------------------------------------------------------------------------+
|vector                                                                                                   |
+---------------------------------------------------------------------------------------------------------+
|[106.0,1006.0,26.0,107.0,1007.0,27.0,108.0,1008.0,28.0]                                         
|[100.0,1000.0,20.0]                                                                                   
|[101.0,1001.0,21.0,102.0,1002.0,22.0,103.0,1003.0,23.0,104.0,1004.0,24.0,105.0,1005.0,25.0]|
+---------------------------------------------------------------------------------------------------------+
1

2 Answers 2

3

You can use the flatten function provided in the official documentation. It was introduced in spark 2.4. Check out this equivalent duplicate question:

Is there a Spark built-in that flattens nested arrays?

Sign up to request clarification or add additional context in comments.

Comments

1

Here is a way to do it with rdd:

from operator import add

df = sqlcx.createDataFrame(
    [
        ("A", [[106.0,1006.0,26.0], [107.0,1007.0,27.0], [108.0,1008.0,28.0]])
    ],
    ("Col1", "Col2")
)

df.rdd.map(lambda row: (row['Col1'], reduce(add, row['Col2'])))\
    .toDF(['Col1', 'Col2'])\
    .show(truncate=False)
#+----+---------------------------------------------------------------+
#|Col1|Col2                                                           |
#+----+---------------------------------------------------------------+
#|A   |[106.0, 1006.0, 26.0, 107.0, 1007.0, 27.0, 108.0, 1008.0, 28.0]|
#+----+---------------------------------------------------------------+

However, serialization to rdd is costly in terms of performance. I personally would recommend using a udf to accomplish this. As far as I know, there isn't a way to do this without udf using only spark DataFrame functions.

2 Comments

what if two or more rows have the same value in Col1 column?
@RameshMaharjan I updated the code without using explode. This works better.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.