Spark: Splitting JSON strings into separate dataframe columns

Question

Im loading the below JSON string into a dataframe column.

{
    "title": {
        "titleid": "222",
        "titlename": "ABCD"
    },
    "customer": {
        "customerDetail": {
            "customerid": 878378743,
            "customerstatus": "ACTIVE",
            "customersystems": {
                "customersystem1": "SYS01",
                "customersystem2": null
            },
            "sysid": null
        },
        "persons": [{
            "personid": "123",
            "personname": "IIISKDJKJSD"
        },
        {
            "personid": "456",
            "personname": "IUDFIDIKJK"
        }]
    }
}

val js = spark.read.json("./src/main/resources/json/customer.txt")
println(js.schema)
val newDF = df.select(from_json($"value", js.schema).as("parsed_value"))
newDF.selectExpr("parsed_value.customer.*").show(false)

//Schema:

StructType(StructField(customer,StructType(StructField(customerDetail,StructType(StructField(customerid,LongType,true), StructField(customerstatus,StringType,true), StructField(customersystems,StructType(StructField(customersystem1,StringType,true), StructField(customersystem2,StringType,true)),true), StructField(sysid,StringType,true)),true), StructField(persons,ArrayType(StructType(StructField(personid,StringType,true), StructField(personname,StringType,true)),true),true)),true), StructField(title,StructType(StructField(titleid,StringType,true), StructField(titlename,StringType,true)),true))

//Output:

+------------------------------+---------------------------------------+
|customerDetail                |persons                                |
+------------------------------+---------------------------------------+
|[878378743, ACTIVE, [SYS01,],]|[[123, IIISKDJKJSD], [456, IUDFIDIKJK]]|
+------------------------------+---------------------------------------+

My Question: Is there a way that I can split the key value as a separate dataframe columns like below by keeping the Array columns as is since I need to have only one record per json string:

Example for customer column:

customer.customerDetail.customerid,customer.customerDetail.customerstatus,customer.customerDetail.customersystems.customersystem1,customer.customerDetail.customersystems.customersystem2,customerid,customer.customerDetail.sysid,customer.persons
878378743,ACTIVE,SYS01,null,null,{"persons": [ { "personid": "123", "personname": "IIISKDJKJSD" }, { "personid": "456", "personname": "IUDFIDIKJK" } ] }

baitmbarek · Accepted Answer · 2019-12-03 16:38:13Z

3

Edited post :

val df = spark.read.json("your/path/data.json")
import org.apache.spark.sql.functions.col
def collectFields(field: String, sc: DataType): Seq[String] = {
  sc match {
    case sf: StructType => sf.fields.flatMap(f => collectFields(field+"."+f.name, f.dataType))
    case _ => Seq(field)
  }
}

val fields = collectFields("",df.schema).map(_.tail)

df.select(fields.map(col):_*).show(false)

Output :

+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|customerid|customerstatus|customersystem1|customersystem2|sysid|persons                              |titleid|titlename|
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+
|878378743 |ACTIVE        |SYS01          |null           |null |[[123,IIISKDJKJSD], [456,IUDFIDIKJK]]|222    |ABCD     |
+----------+--------------+---------------+---------------+-----+-------------------------------------+-------+---------+

edited Dec 3, 2019 at 16:38

answered Dec 3, 2019 at 10:18

baitmbarek

2,5264 gold badges20 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Leibnitz Over a year ago

Thanks, but I need only one record by keeping the arrays as is so no Explode. And I don't want to give all the column names explicitly for selecting since there could be hundreds.

baitmbarek Over a year ago

@Leibnitz thanks for your feedback. Just updated my answer to fit your needs. I created a basic recursive function to find nested fields for StructTypes.

Leibnitz Over a year ago

Thank you for your time! This works well @@baitmbarek

Awaldeep Singh · Accepted Answer · 2019-12-03 06:25:23Z

0

You can try with the help of RDD's by defining column names in an empty RDD and then reading json,converting it to DataFrame with .toDF() and iterating it to the empty RDD.

answered Dec 3, 2019 at 6:25

Awaldeep Singh

1401 silver badge12 bronze badges

Collectives™ on Stack Overflow

Spark: Splitting JSON strings into separate dataframe columns

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related