I am working with huge datasets (Contains 332 fields) in Apache spark with scala ( which except one field, remaining 331 can be null) of around 10M records. But I would like to replace null with a blank string (""). What would be the best way to achieve this as I have a huge number of fields? I want to handle nulls while importing this data set so I will be safe while performing transformations or exporting to DF. So I have created case class with 332 fields, what would be the best way to handle these nulls? I can use Option(field).getOrElse(""), but I guess it's not the best way as I have huge number of fields. Thank you!!
-
1@mtoto Shouldn't it be marked as a duplicate of this question : stackoverflow.com/questions/33376571/…philantrovert– philantrovert2017-08-21 08:54:01 +00:00Commented Aug 21, 2017 at 8:54
-
@eliasah and mtoto: I agree with the above comment. Nice find philantrovertBrad Cupit– Brad Cupit2018-06-21 13:06:52 +00:00Commented Jun 21, 2018 at 13:06
Add a comment
|
2 Answers
You should look at DataFrameNAFunctions. There are functions to replace null values in different type of fields to a default value.
val naFunctions = explodeDF.na
val nonNullDF = naFunctions.fill("")
This will replace all the null values in the string fields to "".
If your dataset has some fields with different datatypes, then you have to repeat the same function by giving the default value of that particular type. For example, Int fields can be given default value 0.
1 Comment
redsk
This should be the accepted answer!
We can use udf to get a safe column like this
val df = Seq((1,"Hello"), (2,"world"), (3,null)).toDF("id", "name")
val safeString: String => String = s => if (s == null) "" else s
val udfSafeString = udf(safeString)
val dfSafe = df.select($"id", udfSafeString($"name").alias("name"))
dfSafe.show
If you have lots of columns, and one of the columns is the key column. We can do like this.
val safeCols = df.columns.map(colName =>
if (colName == "id") col(colName)
else udfSafeString(col(colName)).alias(colName))
val dfSafe = df.select(safeCols:_*)
dfSafe.show