2

I want to convert Dask Dataframe to Spark Dataframe.

Let's consider this example:

import dask.dataframe as dd
dask_df = dd.read_csv("file_name.csv")

# convert dask df to spark df
spark_df = spark_session.createDataFrame(dask_df)

But this is not working. Is there any alternative to do this. Thanks in advance.

2 Answers 2

3

Writing the Spark DataFrame to disk with Dask and reading it with Spark is the best for bigger datasets.

Here's how you can convert smaller datasets.

pandas_df = dask_df.compute()
pyspark_df = spark.createDataFrame(pandas_df) 

I don't know of an in-memory way to convert a Dask DataFrame to a Spark DataFrame without a massive shuffle, but that'd certainly be a cool feature.

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

You best alternative is to save the dataframe to files, e.g., the parquet format: dask_df.to_parquet(...). If your data is small enough, you could load it entirely into the client and feed the resultant pandas dataframe to Spark.

Although it's possible to co-locate spark and dask workers on nodes, they will no be in direct communication with each other, and streaming large data via the client is not a good idea.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.