0

I want to create a dataframe from a complex JSON in String format using Spark scala.

Spark version is 3.1.2. Scala version is 2.12.14.

The source data is like below:

{
  "info": [
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "fields": true,
        "stat": "not sure"
      }
    },
    {
      "done": "time",
      "id": 14,
      "type": "normal",
      "pid": 764310,
      "add": {
        "fields": true,
        "stat": "sure"
      }
    },
    {
      "done": "time",
      "id": 9,
      "type": "normal",
      "pid": 202020,
      "add": {
        "note": {
          "id": 922,
          "score": 0
        }
      }
    }
  ],
  "more": {
    "a": "ok",
    "b": "fine",
    "c": 3
  }
}

I have tried following things but not working.

val schema = new StructType().add("info", ArrayType(StringType)).add("more", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data))) // data is as mentioned above JSON

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

schema gets printed as below

root
 |-- info: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- more: string (nullable = true)

    print(df.head())

Above line throws exception java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of array<string>

Please help me to do this.

1
  • Hi, welcome to stack overflow and please read this. Main thing that is missing from the question is what did you try until now? Because there is a small million of tutorials on exactly this scenario available by googling. Commented Sep 6, 2021 at 11:40

2 Answers 2

1

If the data resides in files in HDFS/S3 etc you can easily ready them using spark.read.json function

Something like this should work on hdfs

val df = spark.read.option("multiline","true").json("hdfs:///home/vikas/sample/*.json")

on s3 it would be

val df = spark.read.option("multiline","true").json("s3a://vikas/sample/*.json")

please ensure that you have read access to the path to read the files

As mentioned in your comment, you are reading data from an API, in that case, the follwoing should work for spark 2.2 and above

import spark.implicits._
val jsonStr = """{ "metadata": { "key": 84896, "value": 54 }}"""
val df = spark.read.json(Seq(jsonStr).toDS)
Sign up to request clarification or add additional context in comments.

5 Comments

thank you for replying. That's the restriction, we have data in string form received from an API. storing and reading in file won't be efficient use of Spark in memory processing.
@Arjun, please let me know if this works out for you!!
I have tried that but from Seq(data) it is not showing a method to convert to dataset using toDS method. so it seems we need to create dataframe directly by using createDataframe method. just check my spark version
@Arjun, you need to do import spark.implicits._ to convert a Sequence to Dataframe or Dataset
thank you for guiding to the depth. your way is working but it is causing issue while getting data. Anyway i found a successful way to deal with String data itself. I will write down here as well. thank you once again.
0

I found a solution by doing this, worked for me:

val schema = new StructType().add("data", StringType)

val rdd = ss.sparkContext.parallelize(Seq(Row(data)))

val df = ss.createDataFrame(rdd, schema)

df.printSchema()

println(df.head().getAs("data").toString)

2 Comments

Please add further details to expand on your answer, such as working code or documentation citations.
Code which has mentioned is enough to run a solution successfully

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.