Error in Streaming Dataset Decompression in Distributed Setting
Environment
Enroot image built off the nvcr.io/nvidia/pytorch:24.11-py3 docker image.
- OS: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20241208
- Hardware (GPU, or instance type): Two nodes with 8xH100 each
Issue
in the os.rename(tmp_filename, raw_filename) line here inside the _decompress_shard_part function in the Stream class I'm getting the error:
FileNotFoundError: [Errno 2] No such file or directory: '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp' -> '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds'
Further Details
My data is stored on FSx and then loaded into the streaming dataset via the local option. When I check, these files /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds exists and /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp does not.
The issue appears to be non-deterministic and only occurs sometimes (e.g., on a recent run it happened 3x at the start for different .mds files and then disappeared).
Attempted Fix
I tried increasing retry here from 7 to 20, but that didn't solve it.
To reproduce
Working on a repo script, may take a sec given my setup is pretty involved.
Expected behavior
Data should be decompressed without any error.
Ideas on cause
Initially, I though the error was due to a race condition, but looking into StreamingDataset I see there are file locks to prevent that issue. So now I’m totally stumped on what’s causing the problem.
@snarayan21 - hoping you may have a suggestion here on how to fix!
I think this might be related to the same issue of https://github.com/mosaicml/streaming/issues/824 FYI