Split Array of Strings in a DataFrame into their own columns

Question

I have a dataframe like this:

df.show()

+-----+ 
|col1 | 
+-----+ 
|[a,b]| 
|[c,d]|   
+-----+

How to convert it into a dataframe like below

+----+----+ 
|col1|col2| 
+----+----+ 
|   a|   b| 
|   c|   d|  
+----+----+

Will the list be of fixed length?

pissall
– pissall

2019-10-20 04:17:15 +00:00
Commented Oct 20, 2019 at 4:17 — pissall
– pissall, Commented Oct 20, 2019 at 4:17

pissall · Accepted Answer · 2019-10-20 04:26:33Z

2

It depends on the type of your "list":

If it is of type ArrayType():

df = spark.createDataFrame(spark.sparkContext.parallelize([['a', ["a","b","c"]], ['b', ["d","e","f"]]]), ["key", "col"])
df.printSchema()
df.show()
root
 |-- key: string (nullable = true)
 |-- col: array (nullable = true)
 |    |-- element: string (containsNull = true)
+---+---------+
|key|      col|
+---+---------+
|  a|[a, b, c]|
|  b|[d, e, f]|
+---+---------+

you can access the values like you would with python using []:

df.select("key", df.col[0], df.col[1], df.col[2]).show()
+---+------+------+------+
|key|col[0]|col[1]|col[2]|
+---+------+------+------+
|  a|     a|     b|     c|
|  b|     d|     e|     f|
+---+------+------+------+

If it is of type StructType(): (maybe you built your dataframe by reading a JSON)

df2 = df.select("key", F.struct(
        df.col[0].alias("col1"), 
        df.col[1].alias("col2"), 
        df.col[2].alias("col3")
    ).alias("col"))
df2.printSchema()
df2.show()

root
 |-- key: string (nullable = true)
 |-- col: struct (nullable = false)
 |    |-- col1: string (nullable = true)
 |    |-- col2: string (nullable = true)
 |    |-- col3: string (nullable = true)
+---+---------+
|key|      col|
+---+---------+
|  a|[a, b, c]|
|  b|[d, e, f]|
+---+---------+

you can directly 'split' the column using *:

df2.select('key', 'col.*').show()

+---+----+----+----+
|key|col1|col2|col3|
+---+----+----+----+
|  a|   a|   b|   c|
|  b|   d|   e|   f|
+---+----+----+----+

answered Oct 20, 2019 at 4:26

pissall

7,4442 gold badges29 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gadam Over a year ago

Thank you, it was indeed built from JSON and it was an ArrayType. And the first part of your answer helped. The StructType in your answer though, wouldn't the contents of 'col' column be like [(col1=a),(col2:b),(col3:c)] ?

pissall Over a year ago

@Gadam I’m creating it out of the existing data frame. If you see that’s how I’m accessing those elements above

Collectives™ on Stack Overflow

Split Array of Strings in a DataFrame into their own columns

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related