How to remove nulls with array_remove Spark SQL built-in function

Question

Spark 2.4 introduced new useful Spark SQL functions involving arrays, but I was a little bit puzzled when I found out that the result of select array_remove(array(1, 2, 3, null, 3), null) is null and not [1, 2, 3, 3].

Is this the expected behavior? Is it possible to remove nulls using array_remove?

As a side note, for now the alternative I am using is a higher order function in Databricks:
select filter(array(1, 2, 3, null, 3), x -> x is not null)

The alternative is the way to go. array_remove depends on notion of equality, and equality with NULL is undefined. — zero323
– zero323, Commented Jan 12, 2019 at 13:46

ZygD · Accepted Answer · 2023-04-14 06:26:28Z

18

+25

To answer your first question, "Is this the expected behavior?", yes. Because the official notebook (https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) points out "Remove all elements that equal to the given element from the given array." and NULL corresponds to undefined values and the results will also be not defined.

So, I think NULL s are out of the purview of this function.

Better you found out a way to overcome this, you can also use spark.sql("""SELECT array_except(array(1, 2, 3, 3, null, 3, 3,3, 4, 5), array(null))""").show(), but the downside is that the result will be without duplicates.

edited Apr 14, 2023 at 6:26

ZygD

24.8k41 gold badges107 silver badges144 bronze badges

answered Oct 19, 2019 at 6:07

Sarath Chandra Vema

8121 gold badge8 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ZygD · Accepted Answer · 2023-04-14 06:27:28Z

18

Spark 3.4+

array_compact("col_name")

Full PySpark example:

from pyspark.sql import functions as F
df = spark.createDataFrame([([3, None, 3],)], ["c"])
df.show()
# +------------+
# |           c|
# +------------+
# |[3, null, 3]|
# +------------+

df = df.withColumn("c", F.array_compact("c"))

df.show()
# +------+
# |     c|
# +------+
# |[3, 3]|
# +------+

edited Apr 14, 2023 at 6:27

answered Apr 14, 2023 at 6:17

ZygD

24.8k41 gold badges107 silver badges144 bronze badges

2 Comments

Frank Over a year ago

this is most simple and easy solution I found among 10+ answers I have tried. thanks!

FeiLiao Jan 10 at 16:14

Stackoverflow win Copilot, thanks!

pathikrit · Accepted Answer · 2020-07-13 18:12:43Z

8

You can do something like this in Spark 2:

import org.apache.spark.sql.functions._
import org.apache.spark.sql._

/**
  * Array without nulls
  * For complex types, you are responsible for passing in a nullPlaceholder of the same type as elements in the array
  */
def non_null_array(columns: Seq[Column], nullPlaceholder: Any = "רכוב כל יום"): Column =
  array_remove(array(columns.map(c => coalesce(c, lit(nullPlaceholder))): _*), nullPlaceholder)

In Spark 3, there is new array filter function and you can do:

df.select(filter(col("array_column"), x => x.isNotNull))

edited Jul 13, 2020 at 18:12

answered Oct 18, 2019 at 19:42

pathikrit

33.7k39 gold badges154 silver badges230 bronze badges

Comments

Shubhanshu · Accepted Answer · 2019-01-14 07:00:49Z

7

https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html

array_remove(array, T): array Remove all elements that equal to the given element from the given array.

Note: I only referred the documentation and they have taken the same data. **null can never be equal to null.

edited Jan 14, 2019 at 7:00

answered Jan 14, 2019 at 6:20

Shubhanshu

1842 silver badges11 bronze badges

Comments

Loochie · Accepted Answer · 2021-06-21 11:16:22Z

-1

I don't think you can use array_remove() or array_except() for your problem. However, though it's not a very good solution, but it may help.

@F.udf("array<string>")
def udf_remove_nulls(arr):
    return [i for i in arr if i is not None]

df = df.withColumn("col_wo_nulls", udf_remove_nulls(df["array_column"]))

edited Jun 21, 2021 at 11:16

answered Jun 17, 2021 at 19:09

Loochie

2,47215 silver badges20 bronze badges

Comments

Aaron McDaid · Accepted Answer · 2023-02-23 17:04:50Z

-1

If you also want to get rid of duplicates, returning each distinct non-NULL value exactly once, you can use array_except:

f.array_except(f.col("array_column_with_nulls"), f.array(f.lit(None)))

or, equivalent, SQL like this:

array_except(your_array_with_NULLs, array())

edited Feb 23, 2023 at 17:04

Aaron McDaid

27.3k10 gold badges73 silver badges95 bronze badges

answered Sep 23, 2022 at 15:19

drGabriel

83610 silver badges8 bronze badges

1 Comment

Aaron McDaid Over a year ago

I downvoted as this removes duplicates, and the asker wants to retain duplicates

Collectives™ on Stack Overflow

How to remove nulls with array_remove Spark SQL built-in function

6 Answers 6

Comments

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

2 Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related