1

I have a dataframe like this:

df.show()

+-----+ 
|col1 | 
+-----+ 
|[a,b]| 
|[c,d]|   
+-----+ 

How to convert it into a dataframe like below

+----+----+ 
|col1|col2| 
+----+----+ 
|   a|   b| 
|   c|   d|  
+----+----+ 
1
  • Will the list be of fixed length? Commented Oct 20, 2019 at 4:17

1 Answer 1

2

It depends on the type of your "list":

If it is of type ArrayType():

df = spark.createDataFrame(spark.sparkContext.parallelize([['a', ["a","b","c"]], ['b', ["d","e","f"]]]), ["key", "col"])
df.printSchema()
df.show()
root
 |-- key: string (nullable = true)
 |-- col: array (nullable = true)
 |    |-- element: string (containsNull = true)
+---+---------+
|key|      col|
+---+---------+
|  a|[a, b, c]|
|  b|[d, e, f]|
+---+---------+
  • you can access the values like you would with python using []:
df.select("key", df.col[0], df.col[1], df.col[2]).show()
+---+------+------+------+
|key|col[0]|col[1]|col[2]|
+---+------+------+------+
|  a|     a|     b|     c|
|  b|     d|     e|     f|
+---+------+------+------+
  • If it is of type StructType(): (maybe you built your dataframe by reading a JSON)
df2 = df.select("key", F.struct(
        df.col[0].alias("col1"), 
        df.col[1].alias("col2"), 
        df.col[2].alias("col3")
    ).alias("col"))
df2.printSchema()
df2.show()

root
 |-- key: string (nullable = true)
 |-- col: struct (nullable = false)
 |    |-- col1: string (nullable = true)
 |    |-- col2: string (nullable = true)
 |    |-- col3: string (nullable = true)
+---+---------+
|key|      col|
+---+---------+
|  a|[a, b, c]|
|  b|[d, e, f]|
+---+---------+
  • you can directly 'split' the column using *:
df2.select('key', 'col.*').show()

+---+----+----+----+
|key|col1|col2|col3|
+---+----+----+----+
|  a|   a|   b|   c|
|  b|   d|   e|   f|
+---+----+----+----+
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, it was indeed built from JSON and it was an ArrayType. And the first part of your answer helped. The StructType in your answer though, wouldn't the contents of 'col' column be like [(col1=a),(col2:b),(col3:c)] ?
@Gadam I’m creating it out of the existing data frame. If you see that’s how I’m accessing those elements above

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.