Convert Arrays into Spark DataSet in Scala

Question

I'm trying to create a DataSet from 4 arrays. I have arrays like this :

// Array 1
val rawValues = rawData.select(collect_list("rawValue")).first().getList[Double](0).asScala.toArray

// Array 2 
var trendValues = Array[Double]()

// Array 3 
var seasonalValues = Array[Double]()

// Array 4     
var remainder = Array[Double]()

I have populated last 3 arrays based upon some computations (not included here) on the first Array. All the 4 arrays are of equal size and to populate the first array, another dataset's column-rawValue is converted into an Array as shown above.

After doing all the computations, I want to create a DataSet which is having 4 separate columns and every column is representing above 4 separate arrays.

So, basically how can I create a Dataset from arrays? I'm struggling in doing the same.

Please help.

Why are you converting Dataset to array, can you convert array to Dataset & join all datasets together ? — s.polam
– s.polam, Commented May 8, 2020 at 2:15
To create array1, I am using a dataset's specific column on which I need to apply variety of operations to further compute other 3 arrays. So, finally I will have 4 arrays and need to convert these 4 arrays as 4 specific columns in the dataset. — zubug55
– zubug55, Commented May 8, 2020 at 2:19
ok, can you add sample data for those & final output you want ? — s.polam
– s.polam, Commented May 8, 2020 at 2:20

Belwal · Accepted Answer · 2020-05-10 15:02:25Z

1

You just need to club them together in a Sequnce:

case class ArrayMap(rawValues: Double, trendValues: Double, seasonalValues: Double, remainder: Double)

import spark.implicits._
val data = for(i <- arr1.indices) yield ArrayMap(arr1(i), arr2(i) ,arr3(i) ,arr4(i))
data.toDF()

//or else, but takes more steps
arr1.zip(arr2).zip(arr3).zip(arr4)
  .map(a => ArrayMap(a._1._1._1, a._1._1._2, a._1._2, a._2))
  .toSeq.toDF()

Use zipAll if Arrays are of different sizes.

EDIT:

I am not sure the use case on how the data is flowing down but if you are trying to create all 4 Arrays from DataFrame, I would suggest you to transform it within DataFrame instead of taking this approach(especially if the Data size is large).

edited May 10, 2020 at 15:02

answered May 8, 2020 at 1:45

Belwal

4831 gold badge4 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

zubug55 Over a year ago

Can you suggest a way in which I. can define the schema first and then put all the arrays according to the schema into the dataset?

Belwal Over a year ago

Map them to a case class instead of tuples. case ArrayCol(first : Double, second: String, ...). It would look like Seq(ArrayCol).toDF()

zubug55 Over a year ago

You can modify your answer according to your latest comment. If it works, I will accept it. Thanks

zubug55 Over a year ago

In "or else" option you have given, toDF() doesn't work. i.e after toSeq, I am not getting toDF(). I am looking to convert into dataset

zubug55 Over a year ago

Sorry, I was doing something wrong. It works. I think in your answer you should also compare the difference between the two options you gave and also indicate the one which should be used for better efficiency and speed.

|

Collectives™ on Stack Overflow

Convert Arrays into Spark DataSet in Scala

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related