A common task in my daily data wrangling is converting tab-delimited text files to xarray datasets and continuing analysis on the dataset and saving to zarr or netCDF format.
I have developed a data pipeline to read the data into dask dataframe, convert it to a dask array and assigning to a xarray dataset.
The data is two-dimensional along dimensions "time" and "subid".
ddf = dd.read_table(some_text_file, ...)
data = ddf.to_dask_array(lengths=True)
attrs = {"about": "some attribute"}
ds = xr.Dataset({"parameter": (["time", "subid"], data, attrs)}
Now, this is working fine as intended. However, recently we have been doing many computational heavy operations along the time dimension such that we often want to rechunk the data along that dimension, so we usually follow this up by:
ds_chunked = ds.chunk({"time": -1, "subid": "auto"})
and then save to disk or do some analysis and then save to disk.
This is however causing quite a bottleneck as the rechunking adds significant time to the pipeline generating the datasets. I am aware that there is no work-around to speeding up rechunking and that one should avoid it if looking for performance improvement. So my questions is really: can anyone think of any smart idea to avoid it or possibly improving its speed? I've looked into reading the data into partitions by column but I haven't found any info on dask dataframe reading data in partitions along columns. If I could "read the data column-wise" (i.e. along time dimension), I wouldn't have to rechunk.
To generate some data for this example you can replace the read_table with something like ddf = dask.datasets.timeseries().
Tried reading data into a dask dataframe but I have to rechunk everytime so I am looking for a way to read the data initially column-wise (along one dimension).
dask.datasets.timeseries()(great idea btw) and turn it into a full minimal reproducible example? also, more info about the number/sizes of files and your available cluster workers/cpus/memory would be great.