1

I am working with huge datasets (Contains 332 fields) in Apache spark with scala ( which except one field, remaining 331 can be null) of around 10M records. But I would like to replace null with a blank string (""). What would be the best way to achieve this as I have a huge number of fields? I want to handle nulls while importing this data set so I will be safe while performing transformations or exporting to DF. So I have created case class with 332 fields, what would be the best way to handle these nulls? I can use Option(field).getOrElse(""), but I guess it's not the best way as I have huge number of fields. Thank you!!

2
  • 1
    @mtoto Shouldn't it be marked as a duplicate of this question : stackoverflow.com/questions/33376571/… Commented Aug 21, 2017 at 8:54
  • @eliasah and mtoto: I agree with the above comment. Nice find philantrovert Commented Jun 21, 2018 at 13:06

2 Answers 2

8

You should look at DataFrameNAFunctions. There are functions to replace null values in different type of fields to a default value.

val naFunctions = explodeDF.na
val nonNullDF = naFunctions.fill("")

This will replace all the null values in the string fields to "".

If your dataset has some fields with different datatypes, then you have to repeat the same function by giving the default value of that particular type. For example, Int fields can be given default value 0.

Sign up to request clarification or add additional context in comments.

1 Comment

This should be the accepted answer!
7

We can use udf to get a safe column like this

val df = Seq((1,"Hello"), (2,"world"), (3,null)).toDF("id", "name")

val safeString: String => String = s => if (s == null) "" else s
val udfSafeString = udf(safeString)

val dfSafe = df.select($"id", udfSafeString($"name").alias("name"))

dfSafe.show

If you have lots of columns, and one of the columns is the key column. We can do like this.

val safeCols = df.columns.map(colName => 
    if (colName == "id") col(colName) 
    else udfSafeString(col(colName)).alias(colName))

val dfSafe =  df.select(safeCols:_*)
dfSafe.show

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.