I'd like to plot 200 Gb of the NYC taxi Dataset. I managed to plot/visualize pandas dataframe using datashader. But I didn't manage to use the PySpark dataframe (using a 4-nodes cluster with 8Gb RAM on each) to get it done. What I can do though, is to use the .toPandas() method to convert the PySpark dataframe into a Pandas dataframe. But this will load the entire dataframe in RAM on the driver node (which has not enough RAM to fit the entire dataset), and therefore, doesn't make use of the distributed power of Spark.
I also know that, fetching only the pickup and dropoff longtitudes and latitudes will bring the dataframe to around ~30GB. But that doesn't change the problem.
I've created an issue on the datashader GitHub here Datashader issue opened
I have looked at Dask as an alternative, but it seems the conversion PySpark dataframe -> Dask dataframe is not supported yet.
Thank you for your suggestion !