Pyspark creating a Struct object from a json string of arrays and objects without schema

Question

I have a dataframe with a json string column.

I am trying to turn this json string column into a proper STRUCT object but as you can see my schema is dynamic and can differ for each row. Basically, in some instances I have a json object and then in some its a json array of objects and the number of possible objects in that array can not be known.

I tried this solution but it can only successfully generate schemas for a single object but not an array of objects.

json_schema = spark.read.json(df.rdd.map(lambda row: row.json-string)).schema
df = df.withColumn('new-struct-column', F.from_json(F.col('json-string'), json_schema))

Also, I have an extra key called text being generated by this method and I don't know where it is coming from.

honestly, better fix the source rather than trying to use this file as it is. — Steven
– Steven, Commented Jan 24, 2023 at 17:10

Hossein Torabi · Accepted Answer · 2023-01-26 13:01:30Z

1

if the json does not contains any nested json, this would helps you:

>>> df.withColumn("correct-json-string", concat(lit("["),regexp_extract(col("json-string"), "\{.*\}", 0), lit("]"))).show(5, False)

+---+------------------------------------------------------+------------------------------------------------------+
|id |json-string                                           |correct-json-string                                   |
+---+------------------------------------------------------+------------------------------------------------------+
|1  |[{"code": 1, "label": "1"}]                           |[{"code": 1, "label": "1"}]                           |
|2  |{"code": 2, "label":"2"}                              |[{"code": 2, "label":"2"}]                            |
|3  |[{"code": 3, "label": "3"}, {"code": 4, "label": "4"}]|[{"code": 3, "label": "3"}, {"code": 4, "label": "4"}]|
+---+------------------------------------------------------+------------------------------------------------------+

answered Jan 26, 2023 at 13:01

Hossein Torabi

7431 gold badge8 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tendaitakas Over a year ago

Thank you for this, but this just creates a column with a uniform structure. Pyspark still cant infer this to an ArrayType even though its a valid json value. Only json string values that start at an object level are recognized. So I might just need to do { "data": [{"code": 1, "label": "1"}] } for me to have quick fix.

Collectives™ on Stack Overflow

Pyspark creating a Struct object from a json string of arrays and objects without schema

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related