4

First, due to the three value logic, this isn't just the negation of any valid implementation of a null-or-empty check.

I want to make a function isNotNullish , which is as close as possible to isNotNull but also filters out empty strings. I'm running into some oddities involving how column/column types work, as well as three value logic. So far the best I have is:

def isNotNullish(questionable: Column) : Column = {
  val oddish = questionable === lit("")
  questionable.isNotNull && (!oddish || oddish.isNull)
}

If that looks strange it's because it is. As far as I can tell, questionable === lit("") will return null for a given row if that row holds either null or (some!) non-string types. In three value logic, true && null = null, which would cause questionable.isNotNull && (questionable =!= lit("") to return null in some cases where I'd want it to return true. The questionable.isNotNull && (!oddish || oddish.isNull) code should never yield null, always true or false.

This almost-not-quite works: for some reason I don't understand, the === comparison is happy to return null for numeric types, but fails for complex types. (Even if what I'm trying to do is impossible or inadvisable, I'd love an explanation for that.)

It might be more responsible to check the schema and simply do a different test for string types than for others, but as far as I can tell that requires the dataframe to be passed as a parameter. I'm trying to avoid that, especially for long sequences of transformations where said parameter might be anonymous.

(Before anyone asks, I know that dealing with null and types in this sloppy a way would be horrible in general scala, but I think it's different in the context of Spark/SQL/huge dataframes of varying schema. The specific case is automatic exploration of sketchy data, so being able to answer questions like "I don't know what the columns are, but tell me how often they hold actual values" is useful.)

2
  • Doesn't this solve the problem? questionable.isNotNull &&questionable !== lit("") Commented Feb 13, 2020 at 17:56
  • @LuisMiguelMejíaSuárez I think that for any numeric value, that evaluates to true && null = null. Commented Feb 13, 2020 at 18:36

1 Answer 1

4

Does this work you to use <=>

    val employees = spark.createDataFrame(Seq(("E1","100.0"), ("E2","200.0"),("E3",null),("E4",""))).toDF("employee","salary")
    employees.show()
    employees.filter(notNullAndEmpty($"salary")).show()

  def notNullAndEmpty(c:Column): Column ={
    (c.isNotNull and !(c <=> lit("")))
  }

Data -

+--------+------+
|employee|salary|
+--------+------+
|      E1| 100.0|
|      E2| 200.0|
|      E3|  null|
|      E4|      |
+--------+------+

Result

+--------+------+
|employee|salary|
+--------+------+
|      E1| 100.0|
|      E2| 200.0|
+--------+------+
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks! I didn't know about the <=> operator, but that seems to solve it perfectly.
On further checking, I think that has the same problem mine does - type errors if you compare to structured data. (The <=> is more elegant, though)
by structured data you mean a Long, Array, a Struct etc?
Arrays, specifically. (I suspect structs would cause problems too, but I flatten them out at an earlier point in my code so I'm not sure.) Are Longs structs? Did you mean Map?
@Edward I see your point. The <=> fails when data type mismatches. We can get data type of column and pass that in to the function and then compared only where data type matches. Another way is to use Dataset and Scala code.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.