Pyspark transfrom list of array to list of strings

Question

Given a dataframe with a list of arrays

Schema 
|-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- quantity: string (nullable = true)

+-------------------------------+
|items                          |
+-------------------------------+
|[[A, 1], [B, 1], [C, 2]]       |
---------------------------------

How do i get a string:

+-------------------------------+
|items                          |
+-------------------------------+
|A, 1, B, 1, C, 2               |
---------------------------------

Tried:

df.withColumn('item_str', concat_ws(" ", col("items"))).select("item_str").show(truncate = False)

Error:

: org.apache.spark.sql.AnalysisException: cannot resolve 'concat_ws(' ', `items`)' due to data type mismatch: argument 2 requires (array<string> or string) type, however, '`items`' is of array<struct<name:string,quantity:string>> type.;;

The error tells you that you must first transform items array into a array<string> and then call concat on it — ebonnal
– ebonnal, Commented Mar 10, 2020 at 8:30
@Bitswazsky i tried flatten: df.withColumn("items_flat",flatten("items")).show(False) and got error: The argument should be an array of arrays, but 'items' is of array<struct<name:string,quantity:string>> type.;; — jxn
– jxn, Commented Mar 10, 2020 at 10:38

abiratsis · Accepted Answer · 2020-03-10 12:20:59Z

2

You can achive that using a combination of transform and array_join build-in functions:

from pyspark.sql.functions import expr

df.withColumn("items", expr("array_join(transform(items, \
                                i -> concat_ws(',', i.name, i.quantity)), ',')"))

We use transform to iterate among items and transform each of them into a string of name,quantity. Then we use array_join to concatenate all the items, returned by transform, seperated by comma.

edited Mar 10, 2020 at 12:20

answered Mar 10, 2020 at 12:03

abiratsis

7,3414 gold badges31 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Prabhanj · Accepted Answer · 2020-03-10 09:35:43Z

-1

Explode might be useful here

import org.apache.spark.sql.functions._
df.select(explode("items")).select("col.*")

answered Mar 10, 2020 at 9:35

Prabhanj

2822 gold badges4 silver badges17 bronze badges

Collectives™ on Stack Overflow

Pyspark transfrom list of array to list of strings

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related