Dynamic bucket creation customization options
What does this PR do? Please describe: Add a new optional argument to dynamic_bucket---bucket_creation_fn---to allow the user to customize how buckets are created.
bucket_creation_fn allows the user to customize what dynamic_bucket yields as a bucket once the cost threshold is met as well as what remains in the subsequent bucket.
Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible changes. N/A
Check list:
- [ ] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
- [ ] Did you read the contributor guideline?
- [ ] Did you make sure that your PR does only one thing instead of bundling different changes together?
- [ ] Did you make sure to update the documentation with your changes? (if necessary)
- [ ] Did you write any new necessary tests?
- [ ] Did you verify new and existing tests pass locally with your changes?
- [ ] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)
Do we have a specific use case where these two new parameters are required? If I recall correctly, we had a discussion about whether to include the example that exceeds the threshold within the returned bucket or to leave it in the buffer for the next bucket. It makes sense to expose it as an option. Just curious what other use cases we have to require this level of flexibility.
Do we have a specific use case where these two new parameters are required? If I recall correctly, we had a discussion about whether to include the example that exceeds the threshold within the returned bucket or to leave it in the buffer for the next bucket. It makes sense to expose it as an option. Just curious what other use cases we have to require this level of flexibility.
I took a closer look at the use case in question and it seems like we need more flexibility than just including/excluding the last example in a threshold-exceeding bucket. We're trying to allow the user to replicate fairseq1's batch_by_size: https://github.com/facebookresearch/fairseq/blob/main/fairseq/data/data_utils_fast.pyx#L20-L103
One capability that the user needs to replicate batch_by_size is to be able to extract the highest multiple of a specified batch size from a bucket that causes the cost threshold to be exceeded, and leave the rest to be included the subsequent bucket. So, it's not necessarily just excluding the last example, but dynamically deciding how much of the bucket tail should be excluded. Seems like the user needs to have a mechanism to, for example, yield only part of the bucket that exceeds the threshold and to keep the rest for the subsequent bucket.
Hi @syleshfb!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
Please close this in favor of #753