0

convert nested json row value (json) to new dataframe

val rd1= spark.read.option("multiLine", "true").option("mode", "PERMISSIVE").json("data.json")

import org.apache.spark.sql.functions._

val ds1= rd1.select("alpha._id", "alpha.Description", "alpha.Sub-Tower","alpha.Tower","alpha.input_data") // 

ds1.show()// it gives only single row with array in each column values  instead need table of 4 rows

my approach 1

val ds2=ds1
  .withColumn("Description", explode(col("Description")))
  .withColumn("Tower",data explode(col("Tower")))
  .withColumn("input_data", explode(col("input_data")))
  .withColumn("Sub-Tower", explode(col("Sub-Tower")))
  .withColumn("_id", explode(col("_id"))) 

println(ds2.count()) /// the json array lenngth is  4 it is giving 1025 incorrect output

input

{

  "name": "raxvsdbsd",
  "stack": "raw",
  "threshold": "50",

  "alpha": [
    {
      "_id": "27",
      "input_data": "alpha beta gamma",
      "Tower": "A B C",
      "Description": "a b,c",
      "Sub-Tower": "crt"
    },
    {
      "_id": "91",
      "input_data": "alpha beta gamma",
      "Tower": "A B C",
      "Description": "a b,c",
      "Sub-Tower": "crt"
    },
     {
      "_id": "21",
      "input_data": "alpha beta gamma",
      "Tower": "A B C",
      "Description": "a b,c",
      "Sub-Tower": "crt"
    },

     {
      "_id": "29",
      "input_data": "alpha beta gamma",
      "Tower": "A B C",
      "Description": "a b,c",
      "Sub-Tower": "crt"
    }
  ]
}

expected output :

table for alpha as below :

+-----------+---------+-----+---+----------------+
|Description|Sub-Tower|Tower|_id|      input_data|
+-----------+---------+-----+---+----------------+
|      a b,c|      crt|A B C| 27|alpha beta gamma|
|      a b,c|      crt|A B C| 91|alpha beta gamma|
|      a b,c|      crt|A B C| 21|alpha beta gamma|
|      a b,c|      crt|A B C| 29|alpha beta gamma|
+-----------+---------+-----+---+----------------+

1 Answer 1

1

The following is the code in scala to explode the content of column alpha

val df = <read_your_input_file_using_spark>

import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._

val result = df.select(explode($"alpha").as("alpha")).select("alpha.*")

result.printSchema()
result.show()

and the result is as follows:

root
 |-- Description: string (nullable = true)
 |-- Sub-Tower: string (nullable = true)
 |-- Tower: string (nullable = true)
 |-- _id: string (nullable = true)
 |-- input_data: string (nullable = true)

+-----------+---------+-----+---+----------------+
|Description|Sub-Tower|Tower|_id|      input_data|
+-----------+---------+-----+---+----------------+
|      a b,c|      crt|A B C| 27|alpha beta gamma|
|      a b,c|      crt|A B C| 91|alpha beta gamma|
|      a b,c|      crt|A B C| 21|alpha beta gamma|
|      a b,c|      crt|A B C| 29|alpha beta gamma|
+-----------+---------+-----+---+----------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.