0

I have a dataframe with a json string column.

enter image description here

I am trying to turn this json string column into a proper STRUCT object but as you can see my schema is dynamic and can differ for each row. Basically, in some instances I have a json object and then in some its a json array of objects and the number of possible objects in that array can not be known.

I tried this solution but it can only successfully generate schemas for a single object but not an array of objects.

json_schema = spark.read.json(df.rdd.map(lambda row: row.json-string)).schema
df = df.withColumn('new-struct-column', F.from_json(F.col('json-string'), json_schema))

enter image description here

Also, I have an extra key called text being generated by this method and I don't know where it is coming from.

1
  • 1
    honestly, better fix the source rather than trying to use this file as it is. Commented Jan 24, 2023 at 17:10

1 Answer 1

1

if the json does not contains any nested json, this would helps you:

>>> df.withColumn("correct-json-string", concat(lit("["),regexp_extract(col("json-string"), "\{.*\}", 0), lit("]"))).show(5, False)

+---+------------------------------------------------------+------------------------------------------------------+
|id |json-string                                           |correct-json-string                                   |
+---+------------------------------------------------------+------------------------------------------------------+
|1  |[{"code": 1, "label": "1"}]                           |[{"code": 1, "label": "1"}]                           |
|2  |{"code": 2, "label":"2"}                              |[{"code": 2, "label":"2"}]                            |
|3  |[{"code": 3, "label": "3"}, {"code": 4, "label": "4"}]|[{"code": 3, "label": "3"}, {"code": 4, "label": "4"}]|
+---+------------------------------------------------------+------------------------------------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for this, but this just creates a column with a uniform structure. Pyspark still cant infer this to an ArrayType even though its a valid json value. Only json string values that start at an object level are recognized. So I might just need to do { "data": [{"code": 1, "label": "1"}] } for me to have quick fix.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.