0

I would like to replace a null value of a pyspark dataframe column with another string column converted to array.

import pyspark.sql.functions as F
import pyspark.sql.types as T

new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
new_customers.printSchema()
new_customers.show(5, False)

new_customers = new_customers.withColumn("new_val", F.coalesce(F.col("val"), F.array(F.col("name"))))

new_customers.show(10, truncate=False)

But, it is

 root
 |-- name: string (nullable = true)
 |-- val: array (nullable = true)
 |    |-- element: string (containsNull = true)

+------+---+
|name  |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John  |[] |
|Cosimo|[d]|
+------+---+

+------+---+-------+
|name  |val|new_val|
+------+---+-------+
|Karen |[a]|[a]    |
|Penny |[b]|[b]    |
|John  |[] |[]     |
|Cosimo|[d]|[d]    |
+------+---+-------+

what I expect:

+------+---+-------+
|name  |val|new_val|
+------+---+-------+
|Karen |[a]|[a]    |
|Penny |[b]|[b]    |
|John  |[] |[John] |
|Cosimo|[d]|[d]    |
+------+---+-------+

Did I miss something ? thanks

1 Answer 1

1

Problem is that you've an array with null element in it. It will not test positive for isNull check.

First clean up single-null-element arrays:

import pyspark.sql.functions as F
import pyspark.sql.types as T

new_customers = spark.createDataFrame(data=[["Karen", ["a"]], ["Penny", ["b"]], ["John", [None]], ["Cosimo", ["d"]]], schema=["name", "val"])
+------+------+
|name  |val   |
+------+------+
|Karen |[a]   |
|Penny |[b]   |
|John  |[null]|
|Cosimo|[d]   |
+------+------+


new_customers = new_customers.withColumn("val", F.filter(F.col("val"), lambda x: x.isNotNull()))
+------+---+
|name  |val|
+------+---+
|Karen |[a]|
|Penny |[b]|
|John  |[] |
|Cosimo|[d]|
+------+---+

Then, change your expression for array empty check instead of null check:

new_customers = new_customers.withColumn("new_val", F.when(F.size("val")>0, F.col("val")).otherwise(F.array(F.col("name"))))
+------+---+-------+
|name  |val|new_val|
+------+---+-------+
|Karen |[a]|[a]    |
|Penny |[b]|[b]    |
|John  |[] |[John] |
|Cosimo|[d]|[d]    |
+------+---+-------+
Sign up to request clarification or add additional context in comments.

2 Comments

thanks for your help, but I still got "John -> []", my pyspark version is 2.3.2
my spark has no "filter()".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.