[DataPipe] Add RandomSplitter (without buffer)
Stack from ghstack:
- -> #724
This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1) only one group can be iterated through at a time and 2) it skips over all the groups that do not match the target (which is potentially wasteful).
Implementation note:
- I decided against reusing
_ChildDataPipesince its features are overly complicated for this use case. - I also decided against having an option to change seed automatically after each iteration, because there are situations where the first iteration is for
testand the second iteration is forvalid. Changing seed will be confusing and causes inconsistency.
See #712 for related discussion. See #723 for the version with buffer.
Differential Revision: D38675266
Offline: Discussion:
- This buffer-less version is likely better but we need more clear error message.
- Let's support both syntax - if "target" is provided, then return only one DataPipe. Otherwise, returns a list of DataPipes. Look at the first commit.
- We definitely want
set_seedto allow changing ofseed. - The default behavior should be same seed every epoch. We can have an argument to allow automatically changing of seed between epochs.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Can we derive total_length from source Datapipe if possible?
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Can we derive
total_lengthfrom source Datapipe if possible?
Updated the implementation to do that with an exception when it cannot infer length from the source.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Thanks for the helpful comments. It is simpler than before now!
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.