1

This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:

>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)

which you can instantiate like such:

schema = StructType([StructField("big", ArrayType(StructType([
    StructField("keep", StringType()),
    StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete for example.

root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)

According to the solution I linked that tries to leverage pyspark.sql's to_json and from_json functions, it should be accomplishable with something like this:

new_schema = StructType([StructField("big", ArrayType(StructType([
             StructField("keep", StringType())
])))])

test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))

>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)

>>> test_df.show()
+----+
| big|
+----+
|null|
+----+

So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?

Pyspark to_json documentation Pyspark from_json documentation

1 Answer 1

1

It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:

new_schema = ArrayType(StructType([StructField("keep", StringType())]))

test_df = df.withColumn("big", from_json(to_json("big"), new_schema))
Sign up to request clarification or add additional context in comments.

2 Comments

Good catch, I never thought to try that because the pyspark documentation made it seem like I needed to pass in a structtype. >schema – a StructType to use when parsing the json column
@ark0n, it's basically the datatype of a particular column, so can be any complex data type in pyspark. you can also use df.select('big').schema.simpleString() to retrieve and modify this information. (just make sure use back-tick to enclose field names if some of them contain special chars like SPACES, dot )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.