Using datashader with PySpark DataFrame

Question

I'd like to plot 200 Gb of the NYC taxi Dataset. I managed to plot/visualize pandas dataframe using datashader. But I didn't manage to use the PySpark dataframe (using a 4-nodes cluster with 8Gb RAM on each) to get it done. What I can do though, is to use the .toPandas() method to convert the PySpark dataframe into a Pandas dataframe. But this will load the entire dataframe in RAM on the driver node (which has not enough RAM to fit the entire dataset), and therefore, doesn't make use of the distributed power of Spark.

I also know that, fetching only the pickup and dropoff longtitudes and latitudes will bring the dataframe to around ~30GB. But that doesn't change the problem.

I've created an issue on the datashader GitHub here Datashader issue opened

I have looked at Dask as an alternative, but it seems the conversion PySpark dataframe -> Dask dataframe is not supported yet.

Thank you for your suggestion !

mdurant · Accepted Answer · 2017-09-05 02:15:43Z

2

There is, indeed, no direct way to convert a (distributed) pyspark dataframe to a Dask dataframe. However, Dask is its own execution engine, and you should be able to sidestep spark completely, if you wish. Dask is able to load datasets from CSV from a remote datasource such as S3 in a similar manner to spark, which might look something like:

df = dask.dataframe.read_csv('s3://bucket/path/taxi*.csv')

This works particularly well with datashader, which knows how to compute its aggregations using Dask, so you can work with datasets larger than memory, potentially computed across a cluster - all without spark.

The datashader examples contain both Dask and NYC taxi examples (but not the both together, unfortunately).

answered Sep 5, 2017 at 2:15

mdurant

28.8k5 gold badges49 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

filipyoo Over a year ago

Yes I moved to Dask which is pretty awesome combining with Parquet ! Hope some day there will be a support for Pyspark dataframe so that Spark users can directly user datashader!

Gayatri · Accepted Answer · 2017-09-05 04:01:11Z

0

This is something different from Dask..

I would say that the best way to visualize such data with spark is to use zeppelin. It is easy to install https://zeppelin.apache.org/. You have default visualizations you can use with spark. check it out.

answered Sep 5, 2017 at 4:01

Gayatri

2,2635 gold badges25 silver badges35 bronze badges

Collectives™ on Stack Overflow

Using datashader with PySpark DataFrame

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related