How to replace nulls with empty string ("") in Apache spark using scala [duplicate]

Question

I am working with huge datasets (Contains 332 fields) in Apache spark with scala ( which except one field, remaining 331 can be null) of around 10M records. But I would like to replace null with a blank string (""). What would be the best way to achieve this as I have a huge number of fields? I want to handle nulls while importing this data set so I will be safe while performing transformations or exporting to DF. So I have created case class with 332 fields, what would be the best way to handle these nulls? I can use Option(field).getOrElse(""), but I guess it's not the best way as I have huge number of fields. Thank you!!

@mtoto Shouldn't it be marked as a duplicate of this question : stackoverflow.com/questions/33376571/… — philantrovert
– philantrovert, Commented Aug 21, 2017 at 8:54
@eliasah and mtoto: I agree with the above comment. Nice find philantrovert — Brad Cupit
– Brad Cupit, Commented Jun 21, 2018 at 13:06

Ganesh · Accepted Answer · 2017-08-21 08:01:07Z

8

You should look at DataFrameNAFunctions. There are functions to replace null values in different type of fields to a default value.

val naFunctions = explodeDF.na
val nonNullDF = naFunctions.fill("")

This will replace all the null values in the string fields to "".

If your dataset has some fields with different datatypes, then you have to repeat the same function by giving the default value of that particular type. For example, Int fields can be given default value 0.

answered Aug 21, 2017 at 8:01

Ganesh

6044 silver badges4 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

redsk Over a year ago

This should be the accepted answer!

Rockie Yang · Accepted Answer · 2017-08-21 07:01:44Z

7

We can use udf to get a safe column like this

val df = Seq((1,"Hello"), (2,"world"), (3,null)).toDF("id", "name")

val safeString: String => String = s => if (s == null) "" else s
val udfSafeString = udf(safeString)

val dfSafe = df.select($"id", udfSafeString($"name").alias("name"))

dfSafe.show

If you have lots of columns, and one of the columns is the key column. We can do like this.

val safeCols = df.columns.map(colName => 
    if (colName == "id") col(colName) 
    else udfSafeString(col(colName)).alias(colName))

val dfSafe =  df.select(safeCols:_*)
dfSafe.show

edited Aug 21, 2017 at 7:01

answered Aug 21, 2017 at 6:55

Rockie Yang

4,95533 silver badges35 bronze badges

Collectives™ on Stack Overflow

How to replace nulls with empty string ("") in Apache spark using scala [duplicate]

2 Answers 2

1 Comment

Comments

Hot Network Questions