I am trying to convert a csv file to a dataframe in Spark 1.5.2 with Scala without the use of the library databricks, as it is a community project and this library is not available. My approach was the following:
var inputPath = "input.csv"
var text = sc.textFile(inputPath)
var rows = text.map(line => line.split(",").map(_.trim))
var header = rows.first()
var data = rows.filter(_(0) != header(0))
var df = sc.makeRDD(1 to data.count().toInt).map(i => (data.take(i).drop(i-1)(0)(0), data.take(i).drop(i-1)(0)(1), data.take(i).drop(i-1)(0)(2), data.take(i).drop(i-1)(0)(3), data.take(i).drop(i-1)(0)(4))).toDF(header(0), header(1), header(2), header(3), header(4))
This code, even though it is quite a mess, works without returning any error messages. The problem comes when trying to display the data inside dfin order to verify the correctness of this method and later try to do some queries in df. The error code I am getting after executing df.show() is SPARK-5063. My questions are:
1) Why is it not possible to print the content of df?
2) Is there any other more straightforward method to convert a csv to a dataframe in Spark 1.5.2 without using the library databricks?
spark-csvplug-in has been merged into Spark 2.x core libraries?--jars, along with itscommons-csvdependency? It worked pretty well for me with Spark bundled in the CDH distro (note that with an Apache build,--jarsdid not work well with CDH, I had to go forspark.driver.extraClasspathprop and an explicitsc.addJar()as a workaround)