1

I have a DataFrame as below.

Value1 Value2 Value3
30000  40000  50000
null   20000  10000

Also, I have a UDF created as

val testUDF=udf((a: Double, b: Double, c: Double) => {
   if(a==null && b!=null && c!=null)
       b+c
   else
       a+b+c
})

I have a code as below.

input.withColumn("checkNull", testUDF(col("value1"),col("value2"),col("value3"))).show

Resulting dataframe is as

Value1 Value2 Value3 checkNull
30000  40000  50000  120000
null   20000  10000   null

Here, instead of displaying 3000 for second row for the column "checkNUll", it displayed, null. Any thing I am doing wrong in my code? I dont want to replace it with 0. Because, if I want to do multiplication instead of addition above, it will fail.

8
  • did you try "is null" and "is not null" instead? Commented Aug 4, 2017 at 6:29
  • 1
    Replace nulls with 0s and then add. Commented Aug 4, 2017 at 6:40
  • Yes. a.isNull throws an error saying isNull is not a member of Double Commented Aug 4, 2017 at 6:40
  • 1
    You can't multiply with nulls there so that's a bummer. You need to find out in which condition you'd be multiplying and then replace accordingly. Commented Aug 4, 2017 at 6:46
  • 1
    Here are some explanations why you can't have null value in scala.Double stackoverflow.com/questions/45285253/… Commented Aug 4, 2017 at 7:00

3 Answers 3

2

There is so many ways to do what you are trying to do and with the current information, I'd suggest using coalesce :

df.withColumn("x4", 
                   coalesce(
                        $"x1".cast("long") * $"x2" * $"x3", 
                        $"x2".cast("long") * $"x3", 
                        lit(0)
                   )
  ).show
+-----+-----+-----+--------------+
|   x1|   x2|   x3|            x4|
+-----+-----+-----+--------------+
|30000|40000|50000|60000000000000|
| null|20000|10000|     200000000|
+-----+-----+-----+--------------+

I'm casting to long due to this issue Why do these two multiplication operations give different results?

Here is another solution filling the column x1 with 1 to multiply :

df.na.fill(1, Seq("x1")).withColumn("x4", $"x1".cast("long") * $"x2" * $"x3").show

It yields the same results.

One other option is to do the following :

df.withColumn("x4", 
                when($"x1".isNull and $"x2".isNotNull and $"x3".isNotNull, $"x2".cast("long") * $"x3")
                .otherwise($"x1".cast("long") * $"x2"* $"x3")
  ).show

I advice you to read the following entry about Dealing with Null in Apache Spark.

Sign up to request clarification or add additional context in comments.

5 Comments

how would the first part result in 200000000? Casting a null to long doesn't do anything, does it? Wouldn't you get a 0 in the second row because the first two parameters to coalesce would result in a null?
casting a null into long will yield a null thus the first expression is null. I'm sure of that.
Oh, I thought it was $"x1" in the second parameter as well. My bad.
No worries @philantrovert
Spark has a special NaN semantics which makes your UDF not working spark.apache.org/docs/latest/…
0

You have to use filter in such condition because when value1 is null it wont call udf.

val testStr = """[{"Val1":1000, "Val2":2000, "Val3":4000},{"Val2":2000, "Val3":4000}]"""
    val rdd = sc.parallelize(Seq(testStr))
          val df = sqlContext.read.json(rdd)

          val newdf = df.filter(df.col("Val1").isNotNull).withColumn("checkNull",df.col("Val1")+df.col("Val2")+df.col("Val3"))
            val newdfw2 = df.filter(df.col("Val1").isNull).withColumn("checkNull",df.col("Val2")+df.col("Val3"))
            val alldf = newdf.unionAll(newdfw2)
          alldf.show()

Comments

0

You can simply use na.fill(0.0) to replace all null values as 0.0 and just use + notation to add the column values as

df.select($"Value1".cast(DoubleType), $"Value2".cast(DoubleType), $"Value3".cast(DoubleType))
    .na.fill(0.0)
    .withColumn("checkNull", $"Value1"+$"Value2"+$"Value3")
    .show(false)

Given dataframe as

+------+------+------+
|Value1|Value2|Value3|
+------+------+------+
|30000 |40000 |50000 |
|null  |20000 |10000 |
+------+------+------+

you should have output as

+-------+-------+-------+---------+
|Value1 |Value2 |Value3 |checkNull|
+-------+-------+-------+---------+
|30000.0|40000.0|50000.0|120000.0 |
|0.0    |20000.0|10000.0|30000.0  |
+-------+-------+-------+---------+

I hope the answer is helpful

2 Comments

There is no way to modify my udf is it?
I guess you don't want better solutions, do you? :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.