How can i create a dataframe from a complex JSON in string format using Spark scala

Question

I want to create a dataframe from a complex JSON in String format using Spark scala.

Spark version is 3.1.2. Scala version is 2.12.14.

The source data is like below:

{
  "info": [
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "fields": true,
        "stat": "not sure"
      }
    },
    {
      "done": "time",
      "id": 14,
      "type": "normal",
      "pid": 764310,
      "add": {
        "fields": true,
        "stat": "sure"
      }
    },
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "note": {
          "id": 922,
          "score": 0
        }
      }
    }
  ],
  "more": {
    "a": "ok",
    "b": "fine",
    "c": 3
  }
}

I have tried following things but not working.

val schema = new StructType().add("info", ArrayType(StringType)).add("more", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data))) // data is as mentioned above JSON

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

schema gets printed as below

root
 |-- info: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- more: string (nullable = true)

    print(df.head())

Above line throws exception java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of array<string>

Please help me to do this.

Hi, welcome to stack overflow and please read this. Main thing that is missing from the question is what did you try until now? Because there is a small million of tutorials on exactly this scenario available by googling. — Filip
– Filip, Commented Sep 6, 2021 at 11:40

Vikas Saxena · Accepted Answer · 2021-09-07 06:48:39Z

1

If the data resides in files in HDFS/S3 etc you can easily ready them using spark.read.json function

Something like this should work on hdfs

val df = spark.read.option("multiline","true").json("hdfs:///home/vikas/sample/*.json")

on s3 it would be

val df = spark.read.option("multiline","true").json("s3a://vikas/sample/*.json")

please ensure that you have read access to the path to read the files

As mentioned in your comment, you are reading data from an API, in that case, the follwoing should work for spark 2.2 and above

import spark.implicits._
val jsonStr = """{ "metadata": { "key": 84896, "value": 54 }}"""
val df = spark.read.json(Seq(jsonStr).toDS)

edited Sep 7, 2021 at 6:48

answered Sep 7, 2021 at 2:28

Vikas Saxena

1,1581 gold badge13 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Arjun Over a year ago

thank you for replying. That's the restriction, we have data in string form received from an API. storing and reading in file won't be efficient use of Spark in memory processing.

Vikas Saxena Over a year ago

@Arjun, please let me know if this works out for you!!

Arjun Over a year ago

I have tried that but from Seq(data) it is not showing a method to convert to dataset using toDS method. so it seems we need to create dataframe directly by using createDataframe method. just check my spark version

Vikas Saxena Over a year ago

@Arjun, you need to do import spark.implicits._ to convert a Sequence to Dataframe or Dataset

Arjun Over a year ago

thank you for guiding to the depth. your way is working but it is causing issue while getting data. Anyway i found a successful way to deal with String data itself. I will write down here as well. thank you once again.

Arjun · Accepted Answer · 2021-09-07 12:21:17Z

0

I found a solution by doing this, worked for me:

val schema = new StructType().add("data", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data)))

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

println(df.head().getAs("data").toString)

answered Sep 7, 2021 at 12:21

Arjun

132 bronze badges

2 Comments

Community Over a year ago

Please add further details to expand on your answer, such as working code or documentation citations.

Arjun Over a year ago

Code which has mentioned is enough to run a solution successfully

Collectives™ on Stack Overflow

How can i create a dataframe from a complex JSON in string format using Spark scala

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related