streaming
streaming copied to clipboard
Replication changes sample order
Environment
- mosaicml-streaming==0.7.5
To reproduce
Steps to reproduce the behavior:
- Use
StreamingDatasetin distributed training with the same seed and setreplicationeither to None or an integer > 1 - Print out samples across all devices and ignore duplicated samples
Expected behavior
The overall order of the samples should be the same, but using replication seems to lead to a different random shuffling of the data
Hey! We don't currently guarantee deterministic sample order if replication changes, but I see how that would be useful. Will take note of this request. thanks!
@CodeCreator do you see this even when going from replication 2 -> replication 4, for example?
@snarayan21 yeah I'm also seeing changes to the sample order when changing the replication factor