How to convert multiple array type columns in pyspark dataframe to string?

Question

I would like to convert multiple array time columns in a dataframe to string. Can someone please help?

I have dataframewith different types of element.Some number/some array. I want to convert only array columns to string and the rest should be as it is.

Expected Output: Expected Output:

ARCrow · Accepted Answer · 2022-06-22 20:30:33Z

1

You can use array_join transformation in pyspark. https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.array_join.html#pyspark.sql.functions.array_join

answered Jun 22, 2022 at 20:30

ARCrow

1,8673 gold badges14 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alexa_028 Over a year ago

How can I use it in my case when I want to do it for multiple columns?

ARCrow Over a year ago

You can use it multiple times and create multiple new columns

Alex Ott · Accepted Answer · 2022-06-23 15:00:10Z

1

I used something like this and that gave me the results:

selectionColumns = [F.coalesce(i[0], F.array()).alias(i[0]) if 'array' in i[1] else i[0] for i in df_grouped.dtypes ]
dfForExplode = df_grouped.select(*selectionColumns)

arrayColumns = [ i[0] for i in dfForExplode.dtypes if 'array' in i[1] ]

for col in arrayColumns:
    df_grouped=df_grouped.withColumn(col,F.concat_ws(' || ',df_grouped[col]))

edited Jun 23, 2022 at 15:00

Alex Ott

88.1k10 gold badges110 silver badges157 bronze badges

answered Jun 22, 2022 at 21:49

alexa_028

114 bronze badges

Comments

Sachin Tiwari · Accepted Answer · 2022-06-23 16:39:42Z

you can use concat_ws

>>> from pyspark.sql.functions import col, concat_ws
>>> data_df.show()
+---+---------+---------+---+---+---------------+
| _1|       _2|       _3| _4| _5|             _6|
+---+---------+---------+---+---+---------------+
|  1|[a, b, c]|[c, d, e]| 10| 20|         [a, b]|
|  2|[d, f, h]|   [s, c]| 11| 21|[f, g, h, j, k]|
|  3|[a, f, g]|[r, t, y]| 12| 22|         [g, h]|
+---+---------+---------+---+---+---------------+

>>> df2 = data_df.withColumn("_2",concat_ws(",",col("_2"))).withColumn("_3",concat_ws(",",col("_3"))).withColumn("_6",concat_ws(",",col("_6")))
>>> df2.show()
+---+-----+-----+---+---+---------+
| _1|   _2|   _3| _4| _5|       _6|
+---+-----+-----+---+---+---------+
|  1|a,b,c|c,d,e| 10| 20|      a,b|
|  2|d,f,h|  s,c| 11| 21|f,g,h,j,k|
|  3|a,f,g|r,t,y| 12| 22|      g,h|
+---+-----+-----+---+---+---------+

>>> df2.printSchema()
root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = false)
 |-- _3: string (nullable = false)
 |-- _4: long (nullable = true)
 |-- _5: long (nullable = true)
 |-- _6: string (nullable = false)

Collectives™ on Stack Overflow

How to convert multiple array type columns in pyspark dataframe to string?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related