1

A common task in my daily data wrangling is converting tab-delimited text files to xarray datasets and continuing analysis on the dataset and saving to zarr or netCDF format.

I have developed a data pipeline to read the data into dask dataframe, convert it to a dask array and assigning to a xarray dataset.

The data is two-dimensional along dimensions "time" and "subid".

ddf = dd.read_table(some_text_file, ...)
data = ddf.to_dask_array(lengths=True)
attrs = {"about": "some attribute"}

ds = xr.Dataset({"parameter": (["time", "subid"], data, attrs)}

Now, this is working fine as intended. However, recently we have been doing many computational heavy operations along the time dimension such that we often want to rechunk the data along that dimension, so we usually follow this up by:

ds_chunked = ds.chunk({"time": -1, "subid": "auto"})

and then save to disk or do some analysis and then save to disk.

This is however causing quite a bottleneck as the rechunking adds significant time to the pipeline generating the datasets. I am aware that there is no work-around to speeding up rechunking and that one should avoid it if looking for performance improvement. So my questions is really: can anyone think of any smart idea to avoid it or possibly improving its speed? I've looked into reading the data into partitions by column but I haven't found any info on dask dataframe reading data in partitions along columns. If I could "read the data column-wise" (i.e. along time dimension), I wouldn't have to rechunk.

To generate some data for this example you can replace the read_table with something like ddf = dask.datasets.timeseries().

Tried reading data into a dask dataframe but I have to rechunk everytime so I am looking for a way to read the data initially column-wise (along one dimension).

2
  • Please remember that Stack Overflow is not your favourite Python forum, but rather a question and answer site for all programming related questions. Thus, please always include the tag of the language you are programming in, that way other users familiar with that language can more easily find your question. Take the tour and read up on How to Ask to get more information on how this site works, then edit the question with the relevant tags. Commented Dec 20, 2022 at 8:21
  • interesting question! can you tell us more about the structure of the data? I think the order of iteration in the columns will end up mattering a lot. could you take your guidance to use dask.datasets.timeseries() (great idea btw) and turn it into a full minimal reproducible example? also, more info about the number/sizes of files and your available cluster workers/cpus/memory would be great. Commented Dec 20, 2022 at 16:47

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.