23

Spark 2.4 introduced new useful Spark SQL functions involving arrays, but I was a little bit puzzled when I found out that the result of select array_remove(array(1, 2, 3, null, 3), null) is null and not [1, 2, 3, 3].

Is this the expected behavior? Is it possible to remove nulls using array_remove?

As a side note, for now the alternative I am using is a higher order function in Databricks:
select filter(array(1, 2, 3, null, 3), x -> x is not null)

1
  • 3
    The alternative is the way to go. array_remove depends on notion of equality, and equality with NULL is undefined. Commented Jan 12, 2019 at 13:46

6 Answers 6

18
+25

To answer your first question, "Is this the expected behavior?", yes. Because the official notebook (https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) points out "Remove all elements that equal to the given element from the given array." and NULL corresponds to undefined values and the results will also be not defined.

So, I think NULL s are out of the purview of this function.

Better you found out a way to overcome this, you can also use spark.sql("""SELECT array_except(array(1, 2, 3, 3, null, 3, 3,3, 4, 5), array(null))""").show(), but the downside is that the result will be without duplicates.

Sign up to request clarification or add additional context in comments.

Comments

18

Spark 3.4+

array_compact("col_name")

Full PySpark example:

from pyspark.sql import functions as F
df = spark.createDataFrame([([3, None, 3],)], ["c"])
df.show()
# +------------+
# |           c|
# +------------+
# |[3, null, 3]|
# +------------+

df = df.withColumn("c", F.array_compact("c"))

df.show()
# +------+
# |     c|
# +------+
# |[3, 3]|
# +------+

2 Comments

this is most simple and easy solution I found among 10+ answers I have tried. thanks!
Stackoverflow win Copilot, thanks!
8

You can do something like this in Spark 2:

import org.apache.spark.sql.functions._
import org.apache.spark.sql._

/**
  * Array without nulls
  * For complex types, you are responsible for passing in a nullPlaceholder of the same type as elements in the array
  */
def non_null_array(columns: Seq[Column], nullPlaceholder: Any = "רכוב כל יום"): Column =
  array_remove(array(columns.map(c => coalesce(c, lit(nullPlaceholder))): _*), nullPlaceholder)

In Spark 3, there is new array filter function and you can do:

df.select(filter(col("array_column"), x => x.isNotNull))

Comments

7

https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html

array_remove(array, T): array Remove all elements that equal to the given element from the given array.

Note: I only referred the documentation and they have taken the same data. **null can never be equal to null.

Comments

-1

I don't think you can use array_remove() or array_except() for your problem. However, though it's not a very good solution, but it may help.

@F.udf("array<string>")
def udf_remove_nulls(arr):
    return [i for i in arr if i is not None]

df = df.withColumn("col_wo_nulls", udf_remove_nulls(df["array_column"]))

Comments

-1

If you also want to get rid of duplicates, returning each distinct non-NULL value exactly once, you can use array_except:

f.array_except(f.col("array_column_with_nulls"), f.array(f.lit(None)))

or, equivalent, SQL like this:

array_except(your_array_with_NULLs, array())

1 Comment

I downvoted as this removes duplicates, and the asker wants to retain duplicates

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.