1

It seems like there should be a function for this in Spark SQL similar to pivoting, but I haven't found any solution to transforming a JSON key into a a value. Suppose I have a badly formed JSON (the format of which I cannot change):

{"A long string containing serverA": {"x": 1, "y": 2}}

how can I process it to

{"server": "A", "x": 1, "y": 2}

?

I read the JSONs into an an sql.dataframe and would then like to process them as described above:

val cs = spark.read.json("sample.json")
  .???
1
  • A direct transformation of the key name into a key-value pair, as shown in the example, would be the neatest way, but I would also accept something like "newkey": "A long string containing serverA" Commented Nov 8, 2021 at 14:01

1 Answer 1

1

If we want to use only spark functions and no UDFs, you could use from_json to parse the json into a map (we need to specify a schema). Then you just need to extract the information with spark functions. One way to do it is as follows:

val schema = MapType(
    StringType,
    StructType(Array(
        StructField("x", IntegerType),
        StructField("y", IntegerType)
    ))
)

spark.read.text("...")
    .withColumn("json", from_json('value, schema))
    .withColumn("key", map_keys('json).getItem(0))
    .withColumn("value", map_values('json).getItem(0))
    .withColumn("server",
        // Extracting the server name with a regex
        regexp_replace(regexp_extract('key, "server[^ ]*", 0), "server", ""))
    .select("server", "value.*")
    .show(false)

which yields:

+------+---+---+
|server|x  |y  |
+------+---+---+
|A     |1  |2  |
+------+---+---+
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks, Oli! This would work perfectly with my toy dataset. In my real dataset I have many more key-value mappings in the inner JSON, and I wonder if I can do the transformation without having to specify a schema.
With from_json you need to specify the schema but there are other ways to go at it. What's your json like exactly?
It is from a client and I am not sure if I can post the entire structure. It goes like{ "Statistics for client XXX": { "ipaddress": "XX.XXX.XXX.XXX", "totalAccessRequests": 243, "totalDupAccessRequests": 0, "totalAccessAccepts": 51, and so on, with 17 numeric values. I can of course write a schema like you did here, but at that point I wonder if I go a completely different route (parsing into a case class in scala)
In any case I will accept your answer since it works perfectly for the toy dataset I had originally posted.
Thanks, but that does not solve your problem :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.