How to create DataFrame with NaN, null and None values in some of its cells?

Question

In a coursera course about spark + scala I had a lesson about using fillna and replace functions.
I tried to reproduce it to check how does it work in real life but I have a problem with creating df with values that are meant to be replaced.
I tried to do it with use of json input file and with use of sequence of tuples. In both cases I received exceptions.
Could you please give advice what do I have to do to create DataFrame which contains null / NaN / None (maybe all of them, that would be the best scenario for the learning purpose).

object HowToCreateDfWithNullsOrNaNs
{
  def main(args: Array[String]): Unit =
  {
    fromFile()
  }

  def fromFile(): Unit =
  {
    // input_file.json: { "name": "Tom", "surname": null, "age": 10}
    val rddFromJson: RDD[String] = spark.sparkContext.textFile("src/main/resources/input_file.json")
    import spark.implicits._
    /*
      Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
      Old column names (1): value
      New column names (3): name, surname, age
     */
    rddFromJson.toDF("name", "surname", "age")
  }

  def fromSeq() =
  {
    val tupleSeq: Seq[(String, Any, Int)] = Seq(("Tom", null , 10))
    val rdd = spark.sparkContext.parallelize(tupleSeq)
    /*
      Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
     */
    import spark.implicits._
    rdd.toDF("name", "surname", "age")
  }
}

qaziqarta · Accepted Answer · 2022-08-14 20:22:36Z

3

Since you want to create DataFrames, I suggest not to struggle with RDD-->DF transformations, and create, well, DataFrames.

For example, modifying your two examples:

// input_file.json: { "name": "Tom", "surname": null, "age": 10}
val df = spark.read.json("src/main/resources/input_file.json")
+---+----+-------+
|age|name|surname|
+---+----+-------+
| 10| Tom|   null|
+---+----+-------+

val columns = Seq("name","surname","age")
val tupleSeq: Seq[(String, String, Int)] = Seq(("Tom", null , 10))
spark.createDataFrame(tupleSeq).toDF(columns:_*)
+----+-------+---+
|name|surname|age|
+----+-------+---+
| Tom|   null| 10|
+----+-------+---+

Note that in the 2nd example the type shouldn't be Any, for the reasons given in this answer:

All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there

answered Aug 14, 2022 at 20:22

qaziqarta

1,9541 gold badge6 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bridgemnc Over a year ago

Your answer is what I need. After using it I had some issues with fill and replace functions. For example they did not work for me with null, and for None I received exception. But they work for NaN as expected. Anyway question was about DF with null/NaN/None and your suggestion + my fill/replace code works with NaN. Thank you :)

qaziqarta Over a year ago

@bridgemnc, BTW, regarding Any and nulls in dataframes, I think this would also be of interest to you: stackoverflow.com/questions/52959980/…, stackoverflow.com/questions/44725222/…

Collectives™ on Stack Overflow

How to create DataFrame with NaN, null and None values in some of its cells?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related