efficient partitioning column-wise when converting from dask dataframe to xarray (dask array)

Ask Question

Asked 2 years, 11 months ago

Modified 2 years, 11 months ago

Viewed 186 times

A common task in my daily data wrangling is converting tab-delimited text files to xarray datasets and continuing analysis on the dataset and saving to zarr or netCDF format.

I have developed a data pipeline to read the data into dask dataframe, convert it to a dask array and assigning to a xarray dataset.

The data is two-dimensional along dimensions "time" and "subid".

ddf = dd.read_table(some_text_file, ...)
data = ddf.to_dask_array(lengths=True)
attrs = {"about": "some attribute"}

ds = xr.Dataset({"parameter": (["time", "subid"], data, attrs)}

Now, this is working fine as intended. However, recently we have been doing many computational heavy operations along the time dimension such that we often want to rechunk the data along that dimension, so we usually follow this up by:

ds_chunked = ds.chunk({"time": -1, "subid": "auto"})

and then save to disk or do some analysis and then save to disk.

This is however causing quite a bottleneck as the rechunking adds significant time to the pipeline generating the datasets. I am aware that there is no work-around to speeding up rechunking and that one should avoid it if looking for performance improvement. So my questions is really: can anyone think of any smart idea to avoid it or possibly improving its speed? I've looked into reading the data into partitions by column but I haven't found any info on dask dataframe reading data in partitions along columns. If I could "read the data column-wise" (i.e. along time dimension), I wouldn't have to rechunk.

To generate some data for this example you can replace the read_table with something like ddf = dask.datasets.timeseries().

Tried reading data into a dask dataframe but I have to rechunk everytime so I am looking for a way to read the data initially column-wise (along one dimension).

edited Dec 20, 2022 at 8:21

Adriaan

18.2k7 gold badges47 silver badges88 bronze badges

asked Dec 20, 2022 at 8:20

officialankan

112 bronze badges

Please remember that Stack Overflow is not your favourite Python forum, but rather a question and answer site for all programming related questions. Thus, please always include the tag of the language you are programming in, that way other users familiar with that language can more easily find your question. Take the tour and read up on How to Ask to get more information on how this site works, then edit the question with the relevant tags.

Adriaan
– Adriaan

2022-12-20 08:21:46 +00:00
Commented Dec 20, 2022 at 8:21
interesting question! can you tell us more about the structure of the data? I think the order of iteration in the columns will end up mattering a lot. could you take your guidance to use dask.datasets.timeseries() (great idea btw) and turn it into a full minimal reproducible example? also, more info about the number/sizes of files and your available cluster workers/cpus/memory would be great.

Michael Delgado
– Michael Delgado

2022-12-20 16:47:35 +00:00
Commented Dec 20, 2022 at 16:47

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

efficient partitioning column-wise when converting from dask dataframe to xarray (dask array)

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest