Spark Scala : Check if string isn't null or empty

Question

First, due to the three value logic, this isn't just the negation of any valid implementation of a null-or-empty check.

I want to make a function isNotNullish , which is as close as possible to isNotNull but also filters out empty strings. I'm running into some oddities involving how column/column types work, as well as three value logic. So far the best I have is:

def isNotNullish(questionable: Column) : Column = {
  val oddish = questionable === lit("")
  questionable.isNotNull && (!oddish || oddish.isNull)
}

If that looks strange it's because it is. As far as I can tell, questionable === lit("") will return null for a given row if that row holds either null or (some!) non-string types. In three value logic, true && null = null, which would cause questionable.isNotNull && (questionable =!= lit("") to return null in some cases where I'd want it to return true. The questionable.isNotNull && (!oddish || oddish.isNull) code should never yield null, always true or false.

This almost-not-quite works: for some reason I don't understand, the === comparison is happy to return null for numeric types, but fails for complex types. (Even if what I'm trying to do is impossible or inadvisable, I'd love an explanation for that.)

It might be more responsible to check the schema and simply do a different test for string types than for others, but as far as I can tell that requires the dataframe to be passed as a parameter. I'm trying to avoid that, especially for long sequences of transformations where said parameter might be anonymous.

(Before anyone asks, I know that dealing with null and types in this sloppy a way would be horrible in general scala, but I think it's different in the context of Spark/SQL/huge dataframes of varying schema. The specific case is automatic exploration of sketchy data, so being able to answer questions like "I don't know what the columns are, but tell me how often they hold actual values" is useful.)

Doesn't this solve the problem? questionable.isNotNull &&questionable !== lit("") — Luis Miguel Mejía Suárez
– Luis Miguel Mejía Suárez, Commented Feb 13, 2020 at 17:56
@LuisMiguelMejíaSuárez I think that for any numeric value, that evaluates to true && null = null. — Edward Peters
– Edward Peters, Commented Feb 13, 2020 at 18:36

Salim · Accepted Answer · 2020-02-13 18:20:44Z

4

Does this work you to use <=>

    val employees = spark.createDataFrame(Seq(("E1","100.0"), ("E2","200.0"),("E3",null),("E4",""))).toDF("employee","salary")
    employees.show()
    employees.filter(notNullAndEmpty($"salary")).show()

  def notNullAndEmpty(c:Column): Column ={
    (c.isNotNull and !(c <=> lit("")))
  }

Data -

+--------+------+
|employee|salary|
+--------+------+
|      E1| 100.0|
|      E2| 200.0|
|      E3|  null|
|      E4|      |
+--------+------+

Result

+--------+------+
|employee|salary|
+--------+------+
|      E1| 100.0|
|      E2| 200.0|
+--------+------+

answered Feb 13, 2020 at 18:20

Salim

2,18814 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Edward Peters Over a year ago

Thanks! I didn't know about the <=> operator, but that seems to solve it perfectly.

Edward Peters Over a year ago

On further checking, I think that has the same problem mine does - type errors if you compare to structured data. (The <=> is more elegant, though)

Salim Over a year ago

by structured data you mean a Long, Array, a Struct etc?

Edward Peters Over a year ago

Arrays, specifically. (I suspect structs would cause problems too, but I flatten them out at an earlier point in my code so I'm not sure.) Are Longs structs? Did you mean Map?

Salim Over a year ago

@Edward I see your point. The <=> fails when data type mismatches. We can get data type of column and pass that in to the function and then compared only where data type matches. Another way is to use Dataset and Scala code.

|

Collectives™ on Stack Overflow

Spark Scala : Check if string isn't null or empty

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related