Any recommended ways to make PyTorch DataLoader (torch.utils.data.DataLoader) work in distributed environment, single machine and multiple machines? Can it be done without DistributedDataParallel?
1 Answer
Maybe you need to make your question clear. DistributedDataParallel is abbreviated as DDP, you need to train a model with DDP in a distributed environment. This question seems to ask how to arrange the dataset loading process for distributed training.
First of all,
data.Dataloader is proper for both dist and non-dist training, usually, there is no need to do something on that.
But the sampling strategy varies in this two modes, you need to specify a sampler for the dataloader(the sampler arg in data.Dataloader), adopting torch.utils.data.distributed.DistributedSampler is the simplest way.