Convert JSON using PySpark and data frame to have array elements under root

Question

I want to reorganise the following JSON so that array elements under docs will be under root.

Example input

{
  "response":{"docs":
      [{
        "column1":"dataA",
        "column2":"dataB"
      },  
      {
        "column1":"dataC",
        "column2":"dataD"
      }]
   }
}

Example PySpark script

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf


conf = SparkConf().setAppName("pyspark")
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)
df = sqlContext.read.json("file:///.../input.json", multiLine=True)
new = df.select("response.docs")
new.printSchema()
new.write.mode("overwrite").format('json').save("file:///.../output.json")

The script already converts the schema to the following

root
 |-- docs: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- column1: string (nullable = true)
 |    |    |-- column2: string (nullable = true)

However, it should look like this final JSON

[
 {"column1":"dataA","column2":"dataB"},
 {"column1":"dataC","column2":"dataD"}
]

How can this be done using Spark?

deronwu · Accepted Answer · 2018-09-20 10:24:44Z

1

You can explode the response.docs column.
Then just select column1 and column2 from this exploded column.
Like this

df.select(F.explode('response.docs').alias('col')) \
  .select('col.column1', 'col.column2')

Then the result will be like this

+-------+-------+
|column1|column2|
+-------+-------+
|  dataA|  dataB|
|  dataC|  dataD|
+-------+-------+

answered Sep 20, 2018 at 10:24

deronwu

1265 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Yuriy Bondaruk · Accepted Answer · 2018-09-20 07:47:29Z

0

Try using explode Spark function (see example here)

answered Sep 20, 2018 at 7:47

Yuriy Bondaruk

4,7802 gold badges38 silver badges53 bronze badges

2 Comments

olliiiver Over a year ago

Hi. Thanks for the answer. However, seems like df.select(explode(df.response.docs)) will also create a new column called "col".

Yuriy Bondaruk Over a year ago

You just need to select needed columns afterwards

Collectives™ on Stack Overflow

Convert JSON using PySpark and data frame to have array elements under root

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related