transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Immutability for data collators

Open vasqu opened this issue 1 year ago • 4 comments

What does this PR do?

Introduces new tests that check if a data collator might introduce side effects, i.e. the given input changes after the call to the collator. Motivated by #30556

Furthermore, fixes the seq2seq collator to not introduce side effects on the given input's labels. This is done by:

  • Passing only relevant features to the tokenizer to pad.
  • Manually crafting the labels afterwards.
  • Reintroducing tokenizer behaviour by converting labels to the respective datatype (pt, tf, np). As a side note, added some more checks on None labels, especially when given "labels": None in the dictionary.

Last remarks:

  • The test handles most of the cases that are introduced in the base tests but if I missed something give me a heads up :D
  • I'm not sure how to handle it when the user introduces None labels. For now, I return them by None again.

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [x] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [x] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@amyeroberts @Rocketknight1

vasqu avatar May 01 '24 21:05 vasqu

Oh yea, one last thing. I've created separate classes for the immutability tests. Thought it got too convoluted otherwise.

vasqu avatar May 01 '24 21:05 vasqu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1 With code duplications, do you mean across the different classes between pt/tf/np? I agree with that, it's more so a dependency check to see that it suddenly doesn't branch out to something unwanted and that different input datatypes are handled correctly. For example, when the default collator is used, it branches into separate np, tf, and pt calls (i.e. numpy_default_data_collator, tf_default_data_collator, torch_default_data_collator). I haven't deep-dived where else something like this might happen.

vasqu avatar May 03 '24 12:05 vasqu

Yes, that's what I was referring to - and I think it's fine to keep it as-is!

Rocketknight1 avatar May 07 '24 14:05 Rocketknight1