streaming Integer overflow and data corruption (uncompressed mds file size is larger than 2^32)

Environment

OS: Ubuntu 22.04
Python: 3.10.13
mosaicml-streaming: 0.7.4

To reproduce

from streaming import StreamingDataset

data = StreamingDataset(
    remote="gs://<redacted>",
    local="/tmp/data",
    split="validation",
    batch_size=1024,
    allow_unsafe_types=True,
)

for sample in data:
    pass

this will print

...
/env/lib/python3.10/site-packages/streaming/base/format/mds/reader.py:143: RuntimeWarning: overflow encountered in scalar subtract
  data = fp.read(end - begin)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/env/lib/python3.10/site-packages/streaming/base/dataset.py", line 1454, in __iter__
    yield from map(self.__getitem__, self._each_sample_id(it))
  File "/env/lib/python3.10/site-packages/streaming/base/array.py", line 90, in __getitem__
    return self.get_item(at)
  File "/env/lib/python3.10/site-packages/streaming/base/dataset.py", line 1210, in get_item
    sample = shard[shard_sample_id]
  File "/env/lib/python3.10/site-packages/streaming/base/array.py", line 90, in __getitem__
    return self.get_item(at)
  File "/env/lib/python3.10/site-packages/streaming/base/format/base/reader.py", line 319, in get_item
    data = self.get_sample_data(idx)
  File "/env/lib/python3.10/site-packages/streaming/base/format/mds/reader.py", line 145, in get_sample_data
    raise IndexError(
IndexError: Relative sample index 90057 is not present in the 0000/shard.00000.mds file.

Expected behavior

no overflows or other errors

Additional context

Unfortunately I can't share the data file but the index.json is like this:

{"shards": [{"column_encodings": ["int64", "str", "int32", "pkl", "int32", "int32", "int32", "str", "int32", "int32", "int16", "ndarray:int64", "ndarray:int32", "ndarray:int32", "ndarray:int32", "ndarray:int16", "float32", "str", "int32"], "column_names": ["battle_start_time", "battle_uuid", "event_slot", "frame", "game_mode", "game_mode_variation", "game_type", "game_version", "location_id", "map_type", "player_count", "players_avatar_id", "players_hero_id", "players_hero_score", "players_score", "results", "sample_weight", "simulator_version", "vum_tick"], "column_sizes": [8, null, 4, null, 4, 4, 4, null, 4, 4, 2, null, null, null, null, null, 4, null, 4], "compression": "zstd:3", "format": "mds", "hashes": ["xxh64"], "raw_data": {"basename": "shard.00000.mds", "bytes": 14783060602, "hashes": {"xxh64": "0b13b87eb91c3dde"}}, "samples": 538141, "size_limit": null, "version": 2, "zip_data": {"basename": "shard.00000.mds.zstd", "bytes": 246573839, "hashes": {"xxh64": "2c7c813c41f0820a"}}}], "version": 2}

Notably, the uncompressed size is 14783060602 which doesn't fit in a 32-bit unsigned integer.

When looking at the overflowing line data = fp.read(end - begin) with a debugger I get these values:

ipdb> print(end)
1868767867
ipdb> print(begin)
2473939046

Apr 22 '24 07:04 jarnoseppanen-sc

if the 32-bit size constraint can't be lifted it would be nice if it crashed/asserted already when writing the file, instead of when trying to read

Apr 22 '24 07:04 jarnoseppanen-sc

Hello from the author of the index.json and MDS, and that particular decision where we standardized all the ints needed by MDS itself to be u32 not u64. I had been wondering for the past couple years if or when this would ever become an issue, and so cromulent it is that the answer would be found to be in the affirmative.

Apr 22 '24 19:04 knighton

First, can you reduce shard size? Our rule of thumb is very approximately 32MB depending on a small number of mostly network factors like max concurrent connections and typical available bandwidth per connection.

All arguments held same, as you increase shard size from tens of MB upward, depending on your samples per second per replica and your training job will eventually start to sputter and finally hang, subject to the different factors around shuffling sample ID distributions, shard size distributions, and your luck. There is also the Streaming Simulator.

While it is true that it's just as valid to use StreamingDataset without a remote (so that it does not stream from anywhere, expecting all its shards to have been downloaded to local already), please still continue to size your shards vaguely toward 32MB (designed for best smooth, incrementally-downloading, lazy-loaded global random access) as the specifics of your situation warrant, in any case keeping the sizes of each MDS shard file safely below the 4GB limit. If you aim for two targest, you will miss both. We aim for shards to be sized to be eventually streamed from, which is more practical than multi-GB monster shards in a number of arcane ways.

Apr 22 '24 19:04 knighton

yeah confirmed that re-writing the data with more partitions (smaller shard size) fixes the problem.

I guess it's a valid design to limit uncompressed partition size to 4GiB, though the UX is a bit edgy as it crashes with a scary exception during reading. It's also hard to know beforehand how many partitions are needed to not cross the limit.

Thanks for a wonderful library in any case!

Apr 23 '24 12:04 jarnoseppanen-sc

I ran into the same issue (filed then closed #671 ).

My suggestion: Can MDS somehow warn if the given size_limit is None or exceeds 2**32? At least this will prevent other people from accidentally hitting the same problem.

May 14 '24 12:05 smspillaz

Hey @smspillaz and @jarnoseppanen-sc, thanks for raising this issue with us! The PR above (#672) addresses this bug by ensuring that the shard size limit can never go above 2**32 - 1. Hope this resolves your concerns.

May 14 '24 17:05 snarayan21