1

I have the following dataframe schema:

root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- cities: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- postcode: string (nullable = true

And my dataframe looks like this:

+---------+--------+-----------------------------------+
|firstname|lastname|cities                             |
+---------+--------+-----------------------------------+
|John     |Doe     |[[New York,A000000], [Warsaw,null]]|
|John     |Smith   |[[Berlin,null]]                    |
|John     |null    |[[Paris,null]]                     |
+---------+--------+-----------------------------------+

I want to replace all of null values with string "unknown". When I use na.fill function I get the following dataframe:

df.na.fill("unknown").show()

+---------+--------+-----------------------------------+
|firstname|lastname|cities                             |
+---------+--------+-----------------------------------+
|John     |Doe     |[[New York,A000000], [Warsaw,null]]|
|John     |Smith   |[[Berlin,null]]                    |
|John     |unknown |[[Paris,null]]                     |
+---------+--------+-----------------------------------+

How can I replace ALL of the null values in dataframe (Including nested arrays)?

0

1 Answer 1

3

na.fill doesn't fill null elements within a struct field in an array column. One approach would be to use a UDF as shown below:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

case class City(name: String, postcode: String)

val df = Seq(
  ("John", "Doe", Seq(City("New York", "A000000"), City("Warsaw", null))),
  ("John", "Smith", Seq(City("Berlin", null))),
  ("John", null, Seq(City("Paris", null)))
).toDF("firstname", "lastname", "cities")

val defaultStr = "unknown"

def patchNull(default: String) = udf( (s: Seq[Row]) =>
  s.map( r => (r.getAs[String]("name"), r.getAs[String]("postcode")) match {
      case (null, null) => (default, default)
      case (c, null) => (c, default)
      case (null, p) => (default, p)
      case e => e
    }
  ) )

df.
  withColumn( "cities", patchNull(defaultStr)($"cities") ).
  na.fill(defaultStr).
  show(false)
// +---------+--------+--------------------------------------+
// |firstname|lastname|cities                                |
// +---------+--------+--------------------------------------+
// |John     |Doe     |[[New York,A000000], [Warsaw,unknown]]|
// |John     |Smith   |[[Berlin,unknown]]                    |
// |John     |unknown |[[Paris,unknown]]                     |
// +---------+--------+--------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.