Scala Spark - Split JSON column to multiple columns

Question

Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:

val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))

it outputs as follows:

+----------------+---------------------------------------+
| encrypted_data | decrypted_json                        |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"}        |
+----------------+---------------------------------------+

The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:

+----------------+----------------------+
| encrypted_data | a      | b           |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+

Have you tried from_jason as described at spark.apache.org/docs/latest/api/java/org/apache/spark/sql/… — Salim
– Salim, Commented Jan 6, 2020 at 13:50
Does this answer your question? How to query JSON data column using Spark DataFrames? — blackbishop
– blackbishop, Commented Jan 6, 2020 at 17:31

venus · Accepted Answer · 2020-01-06 20:27:12Z

2

Below solution is inspired by one of the solutions given by @Jacek Laskowski:

import org.apache.spark.sql.types._
val JsonSchema = new StructType()
  .add($"a".string)
  .add($"b".string)
val schema = new StructType()
  .add($"encrypted_data".string)
  .add($"decrypted_json".array(JsonSchema))

val schemaAsJson = schema.json

import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)

import org.apache.spark.sql.functions._

val rawJsons = Seq("""
  {
    "encrypted_data" : "eyJleHAiOjE1",
    "decrypted_json" : [
      {
        "a" : "547.65",
        "b" : "Some Data"
      }
    ]
  }
""").toDF("rawjson")

val people = rawJsons
  .select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
  .select("json.*") // <-- flatten the struct field
  .withColumn("address", explode($"decrypted_json")) // <-- explode the array field
  .drop("decrypted_json")  // <-- no longer needed
  .select("encrypted_data", "address.*") // <-- flatten the struct field

Please go through Link for the original solution with the explanation.
I hope that helps.

answered Jan 6, 2020 at 20:27

venus

1,26812 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Hang Wu Over a year ago

How to make the header be something like "encrypted_data, json_a, json_b", i.e., adding "json_" prefix to the fields of json?

Salim · Accepted Answer · 2020-01-21 00:33:54Z

0

Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -

    val sparkSession = //create spark session
    import sparkSession.implicits._

    val jsonData = """{"a":547.65 , "b":"Some Data"}"""
    val schema = {StructType(
      List(
        StructField("a", DoubleType, nullable = false),
        StructField("b", StringType, nullable = false)
      ))}

    val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
    val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))

    dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()

Result

+-------------+------------------------------+------+---------+
|string_column|json_column                   |a     |b        |
+-------------+------------------------------+------+---------+
|dummy data   |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+

edited Jan 21, 2020 at 0:33

answered Jan 6, 2020 at 14:02

Salim

2,18814 silver badges15 bronze badges

2 Comments

Roni Gadot Over a year ago

Thank you for your reply, what exactly should I pass as the schema?

Salim Over a year ago

Schema of the json needs to be passed. I think my code has an extra quot. I will fix when I get time.

Collectives™ on Stack Overflow

Scala Spark - Split JSON column to multiple columns

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related