28

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

4 Answers 4

52

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

In your case it will be something like:

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))
Sign up to request clarification or add additional context in comments.

4 Comments

Is it available in Java? I couldn't find a similar fill function.
Sorry I didn't use it in Java, but you can find here the latest version documentation of Spark, and you can see the DataFrameNaFunctions there: spark.apache.org/docs/latest/api/java/index.html probably try fill without .na
@PirateJack can you please accept the answer if it solved your problem?
have tried using it with null ? . It says cannot be applied to (Null, Int). It hasn't solved the purpose for me. So was wondering there might be some solution now after 2 years :)
14

You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

So if you already know the value that you want to replace Null with...:

String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

3 Comments

My dataframes are of type Dataset<Row>. It says it's not defined for type Dataset<Row>
I have updated my answer to include the .na part. You could also try: df.na.fill(ImmutableMap.of("ColumnName", "replacementValue", "egName", "egA");
Thanks a lot for help. I was able to implement it using the scala Sequence libraries. I'll update the same in my answer.
9

You can use DataFrame.na.fill() to replace the null with some value To update at once you can do as

val map = Map("Name" -> "a", "Place" -> "a2")

df.na.fill(map).show()

But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with like function.

2 Comments

I need to this for each column separately instead of whole dataframe at once. Can you please share an example as how will I replace any value. Also, I'll create a regular expression for the bad records. Please share the java example if you have. Thank you.
can we do this based on condition like -> fill column2 "only if col1 is not null"?
1

In order to replace the NULL values with a given string I've used fill function present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-

List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);

1 Comment

my Dataset in spark is having null values, which I'm trying to save in Redshift it's accepting as a null string which is not what we want we want null as null in Redshift too. Any idea how to implement that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.