Shuffler inside a Zipper only shuffle some elements
🐛 Describe the bug
d1 and d2 have different length, d3 is a zipper contains them.
import torchdata.datapipes as dp
d1 = dp.map.SequenceWrapper(['0', '1', '2', '3'])
d1 = dp.map.Shuffler(d1)
d2 = dp.map.SequenceWrapper(['a', 'b', 'c', 'd', 'e', 'f'])
d2 = dp.map.Shuffler(d2)
d3 = dp.map.Zipper(d2, d1)
from torch.utils.data import DataLoader
dl = DataLoader(d3, batch_size=None, num_workers=1, shuffle=True)
for i in range(10):
o = []
for x in dl:
o.append(x)
print(o)
The results:
[['f', '2'], ['a', '3'], ['e', '0'], ['c', '1']]
[['e', '0'], ['c', '1'], ['f', '2'], ['a', '3']]
[['c', '1'], ['a', '3'], ['f', '2'], ['e', '0']]
[['e', '0'], ['a', '3'], ['c', '1'], ['f', '2']]
[['e', '0'], ['c', '1'], ['f', '2'], ['a', '3']]
[['c', '1'], ['e', '0'], ['f', '2'], ['a', '3']]
[['a', '3'], ['e', '0'], ['f', '2'], ['c', '1']]
[['a', '3'], ['c', '1'], ['e', '0'], ['f', '2']]
[['c', '1'], ['e', '0'], ['f', '2'], ['a', '3']]
[['e', '0'], ['c', '1'], ['a', '3'], ['f', '2']]
As we can see, the results of 10 runs only contain partial elements of d2.
Versions
Collecting environment information... PyTorch version: 1.12.1 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 7.5.0-6ubuntu2) 7.5.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31
Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.0-124-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 10.1.243 GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB GPU 4: NVIDIA A100-SXM4-80GB GPU 5: NVIDIA A100-SXM4-80GB GPU 6: NVIDIA A100-SXM4-80GB GPU 7: NVIDIA A100-SXM4-80GB
Nvidia driver version: 510.85.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy==0.971
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.4
[pip3] pytorch-lightning==1.7.3
[pip3] pytorch-ranger==0.1.1
[pip3] torch==1.12.1
[pip3] torch-complex==0.4.3
[pip3] torch-optimizer==0.3.0
[pip3] torch-stoi==0.1.2
[pip3] torchaudio==0.12.1
[pip3] torchdata==0.4.1
[pip3] torchmetrics==0.9.3
[pip3] torchvision==0.13.1
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.6.0 hecad31d_10 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7e14d7c_0 conda-forge
[conda] mkl_fft 1.3.1 py39h0c7bc48_1 conda-forge
[conda] mkl_random 1.2.2 py39hde0f152_0 conda-forge
[conda] numpy 1.22.4 pypi_0 pypi
[conda] pytorch 1.12.1 py3.9_cuda11.6_cudnn8.3.2_0 pytorch
[conda] pytorch-lightning 1.7.3 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch-complex 0.4.3 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torch-stoi 0.1.2 pypi_0 pypi
[conda] torchaudio 0.12.1 py39_cu116 pytorch
[conda] torchdata 0.4.1 pypi_0 pypi
[conda] torchmetrics 0.9.3 pypi_0 pypi
[conda] torchvision 0.13.1 py39_cu116 pytorch
Iterable Shuffler with Zipper for datapipes with different length works right. Seems it's only the problem of map Shuffler and Zipper
Thank you for asking about it. I am currently working on a PR to enable proper shuffling for MapDataPipe. The above behavior is the map.shuffle is not shuffled per epoch.
https://github.com/pytorch/pytorch/pull/83202 is landed to make sure Shuffler is properly shuffled per epoch. And, I am still working on https://github.com/pytorch/pytorch/pull/82975 to make MapDataPipe being seeded properly by DataLoader.
Will post when the PR is landed then you can test it with the nightly releases.
Great! Thank you.