streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Replication changes sample order

Open CodeCreator opened this issue 1 year ago • 3 comments

Environment

  • mosaicml-streaming==0.7.5

To reproduce

Steps to reproduce the behavior:

  1. Use StreamingDataset in distributed training with the same seed and set replication either to None or an integer > 1
  2. Print out samples across all devices and ignore duplicated samples

Expected behavior

The overall order of the samples should be the same, but using replication seems to lead to a different random shuffling of the data

CodeCreator avatar Jul 15 '24 16:07 CodeCreator

Hey! We don't currently guarantee deterministic sample order if replication changes, but I see how that would be useful. Will take note of this request. thanks!

snarayan21 avatar Jul 23 '24 09:07 snarayan21

@CodeCreator do you see this even when going from replication 2 -> replication 4, for example?

snarayan21 avatar Jul 23 '24 09:07 snarayan21

@snarayan21 yeah I'm also seeing changes to the sample order when changing the replication factor

CodeCreator avatar Jul 30 '24 18:07 CodeCreator