0

I'm trying to create a DataSet from 4 arrays. I have arrays like this :

// Array 1
val rawValues = rawData.select(collect_list("rawValue")).first().getList[Double](0).asScala.toArray

// Array 2 
var trendValues = Array[Double]()

// Array 3 
var seasonalValues = Array[Double]()

// Array 4     
var remainder = Array[Double]()

I have populated last 3 arrays based upon some computations (not included here) on the first Array. All the 4 arrays are of equal size and to populate the first array, another dataset's column-rawValue is converted into an Array as shown above.

After doing all the computations, I want to create a DataSet which is having 4 separate columns and every column is representing above 4 separate arrays.

So, basically how can I create a Dataset from arrays? I'm struggling in doing the same.

Please help.

3
  • Why are you converting Dataset to array, can you convert array to Dataset & join all datasets together ? Commented May 8, 2020 at 2:15
  • To create array1, I am using a dataset's specific column on which I need to apply variety of operations to further compute other 3 arrays. So, finally I will have 4 arrays and need to convert these 4 arrays as 4 specific columns in the dataset. Commented May 8, 2020 at 2:19
  • ok, can you add sample data for those & final output you want ? Commented May 8, 2020 at 2:20

1 Answer 1

1

You just need to club them together in a Sequnce:

case class ArrayMap(rawValues: Double, trendValues: Double, seasonalValues: Double, remainder: Double)

import spark.implicits._
val data = for(i <- arr1.indices) yield ArrayMap(arr1(i), arr2(i) ,arr3(i) ,arr4(i))
data.toDF()

//or else, but takes more steps
arr1.zip(arr2).zip(arr3).zip(arr4)
  .map(a => ArrayMap(a._1._1._1, a._1._1._2, a._1._2, a._2))
  .toSeq.toDF()

Use zipAll if Arrays are of different sizes.

EDIT:

I am not sure the use case on how the data is flowing down but if you are trying to create all 4 Arrays from DataFrame, I would suggest you to transform it within DataFrame instead of taking this approach(especially if the Data size is large).

Sign up to request clarification or add additional context in comments.

6 Comments

Can you suggest a way in which I. can define the schema first and then put all the arrays according to the schema into the dataset?
Map them to a case class instead of tuples. case ArrayCol(first : Double, second: String, ...). It would look like Seq(ArrayCol).toDF()
You can modify your answer according to your latest comment. If it works, I will accept it. Thanks
In "or else" option you have given, toDF() doesn't work. i.e after toSeq, I am not getting toDF(). I am looking to convert into dataset
Sorry, I was doing something wrong. It works. I think in your answer you should also compare the difference between the two options you gave and also indicate the one which should be used for better efficiency and speed.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.