streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Error in Streaming Dataset Decompression in Distributed Setting

Open jasonkrone opened this issue 1 year ago • 2 comments

Environment

Enroot image built off the nvcr.io/nvidia/pytorch:24.11-py3 docker image.

  • OS: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20241208
  • Hardware (GPU, or instance type): Two nodes with 8xH100 each

Issue

in the os.rename(tmp_filename, raw_filename) line here inside the _decompress_shard_part function in the Stream class I'm getting the error:

FileNotFoundError: [Errno 2] No such file or directory: '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp' -> '/data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds'

Further Details

My data is stored on FSx and then loaded into the streaming dataset via the local option. When I check, these files /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds exists and /data/open-web-math/dev/shard-1-of-1-part-2-of-3/shard.00000.mds.tmp does not.

The issue appears to be non-deterministic and only occurs sometimes (e.g., on a recent run it happened 3x at the start for different .mds files and then disappeared).

Attempted Fix

I tried increasing retry here from 7 to 20, but that didn't solve it.

To reproduce

Working on a repo script, may take a sec given my setup is pretty involved.

Expected behavior

Data should be decompressed without any error.

Ideas on cause

Initially, I though the error was due to a race condition, but looking into StreamingDataset I see there are file locks to prevent that issue. So now I’m totally stumped on what’s causing the problem.

jasonkrone avatar Jan 15 '25 21:01 jasonkrone

@snarayan21 - hoping you may have a suggestion here on how to fix!

jasonkrone avatar Jan 15 '25 21:01 jasonkrone

I think this might be related to the same issue of https://github.com/mosaicml/streaming/issues/824 FYI

ethantang-db avatar Jan 15 '25 21:01 ethantang-db