Immutability for data collators
What does this PR do?
Introduces new tests that check if a data collator might introduce side effects, i.e. the given input changes after the call to the collator. Motivated by #30556
Furthermore, fixes the seq2seq collator to not introduce side effects on the given input's labels. This is done by:
- Passing only relevant features to the tokenizer to pad.
- Manually crafting the labels afterwards.
- Reintroducing tokenizer behaviour by converting labels to the respective datatype (pt, tf, np).
As a side note, added some more checks on
Nonelabels, especially when given"labels": Nonein the dictionary.
Last remarks:
- The test handles most of the cases that are introduced in the base tests but if I missed something give me a heads up :D
- I'm not sure how to handle it when the user introduces
Nonelabels. For now, I return them byNoneagain.
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@amyeroberts @Rocketknight1
Oh yea, one last thing. I've created separate classes for the immutability tests. Thought it got too convoluted otherwise.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@Rocketknight1 With code duplications, do you mean across the different classes between pt/tf/np? I agree with that, it's more so a dependency check to see that it suddenly doesn't branch out to something unwanted and that different input datatypes are handled correctly. For example, when the default collator is used, it branches into separate np, tf, and pt calls (i.e. numpy_default_data_collator, tf_default_data_collator, torch_default_data_collator). I haven't deep-dived where else something like this might happen.
Yes, that's what I was referring to - and I think it's fine to keep it as-is!