How to replace null values with a specific value in Dataframe using spark in Java?

Question

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

Rami · Accepted Answer · 2017-06-21 09:31:07Z

52

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

In your case it will be something like:

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))

edited Jun 21, 2017 at 9:31

answered Jun 21, 2017 at 9:21

Rami

8,34419 gold badges71 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

PirateJack Over a year ago

Is it available in Java? I couldn't find a similar fill function.

Rami Over a year ago

Sorry I didn't use it in Java, but you can find here the latest version documentation of Spark, and you can see the DataFrameNaFunctions there: spark.apache.org/docs/latest/api/java/index.html probably try fill without .na

Rami Over a year ago

@PirateJack can you please accept the answer if it solved your problem?

Harsha TJ Over a year ago

have tried using it with null ? . It says cannot be applied to (Null, Int). It hasn't solved the purpose for me. So was wondering there might be some solution now after 2 years :)

Dee Carter · Accepted Answer · 2017-06-21 10:00:27Z

14

You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

So if you already know the value that you want to replace Null with...:

String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

edited Jun 21, 2017 at 10:00

answered Jun 21, 2017 at 9:22

Dee Carter

4284 silver badges15 bronze badges

3 Comments

PirateJack Over a year ago

My dataframes are of type Dataset<Row>. It says it's not defined for type Dataset<Row>

Dee Carter Over a year ago

I have updated my answer to include the .na part. You could also try: df.na.fill(ImmutableMap.of("ColumnName", "replacementValue", "egName", "egA");

PirateJack Over a year ago

Thanks a lot for help. I was able to implement it using the scala Sequence libraries. I'll update the same in my answer.

koiralo · Accepted Answer · 2017-06-21 09:39:21Z

9

You can use DataFrame.na.fill() to replace the null with some value To update at once you can do as

val map = Map("Name" -> "a", "Place" -> "a2")

df.na.fill(map).show()

But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with like function.

edited Jun 21, 2017 at 9:39

answered Jun 21, 2017 at 9:21

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

2 Comments

PirateJack Over a year ago

I need to this for each column separately instead of whole dataframe at once. Can you please share an example as how will I replace any value. Also, I'll create a regular expression for the bad records. Please share the java example if you have. Thank you.

Omkar Puttagunta Over a year ago

can we do this based on condition like -> fill column2 "only if col1 is not null"?

PirateJack · Accepted Answer · 2017-06-21 10:38:26Z

1

In order to replace the NULL values with a given string I've used fill function present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-

List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);

answered Jun 21, 2017 at 10:38

PirateJack

3631 gold badge3 silver badges8 bronze badges

1 Comment

Pardeep Sharma Over a year ago

my Dataset in spark is having null values, which I'm trying to save in Redshift it's accepting as a null string which is not what we want we want null as null in Redshift too. Any idea how to implement that.

Collectives™ on Stack Overflow

How to replace null values with a specific value in Dataframe using spark in Java?

4 Answers 4

4 Comments

3 Comments

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related