1

I have a spark dataframe with below schema:

root
 |-- cluster_info: struct (nullable = true)
 |    |-- cluster_id: string (nullable = true)
 |    |-- influencers: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- screenName: string (nullable = true)

And I need to get unique list of screenName and I am doing it using below code. But collect is a very heavy operation, is there a better way to do it.

var namesDF = df.select(concat_ws(",", $"cluster_info.influencers.screenName").as("screenName"))
val influencerNameList: List[String] = namesDF.map(r => r(0).asInstanceOf[String]).collect().toList.mkString(",").split(",").toList.distinct

Please suggest. Thanks in advance.

1 Answer 1

2

You can select nested field screenName as array and explode it and get the distinct values as below

var namesDF = df.select($"cluster_info.influencers.screenName").as("screenName"))
  .withColumn("screenName", explode($"screenName"))
  .select("screenName.screenName")
  .distinct()

You already got the distinct screenName To get the list you can use

namesDF.rdd.map(_.getString(0).collect()

But I don't suggest you to collect the result if you have big dataset

Hope this helps!

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your response. But when I apply map function on it, it give Error:Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. namesDF.map( influencerName => { Error:not enough arguments for method map
still same. Might be I am doing something wrong in map.
I am trying to avoid collect.
collect fails with: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost, executor driver): com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.