[SigLIP] Add fast tokenizer
What does this PR do?
Fixes #29925.
🔴 Breaking change! Updates SiglipTokenizer, specifically strip behaviour in tokenize
To do:
- [ ] fix remaining tests
- [ ] add slow integration test
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@ArthurZucker I'm down to these 3 tests failing:
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_added_tokens_do_lower_case - AssertionError: 'aaaaa bbbbbb ' == 'aaaaa bbbbbb '
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_special_tokens_initialization - AssertionError: Lists differ: [342, 322, 291, 269, 262, 266, 32100, 507, 4290, 1] != [342, 322, 291, 269, 262, 266, 32100, 12936, 1]
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_tokenization_python_rust_equals - AssertionError: Sequences differ: [291,[64 chars]62, 232, 141, 158, 232, 141, 163, 232, 142, 16[5335 chars]3, 1] != [291,[64 chars]62, 2, 16577, 266, 2, 1443, 412, 282, 1791, 13[517...
but I don't really know how to fix these. Are you able to look into these?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
is this getting merged anytime soon?
Comments need to be adressed. cc @itazap if you want to take this over!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @itazap do you have bandwidth to work on this? There were only 2 tests remain to be fixed (for which I'd need some guidance)
@NielsRogge Thank you for your patience! I'm looking into the failing tests now
@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including:
- The
SiglipConverter.normalizeris dropping all punctuation, including["<",">"]which need to be kept for the<special>tokens you are adding.
list_normalizers.append(normalizers.Replace(Regex(r"[" + re.escape(string.punctuation) + "]"), ""))
- Also in
SiglipConvert.normalizerI believe a WhitespaceSplit() should be added like below (see T5)
def pre_tokenizer(self, replacement, add_prefix_space):
prepend_scheme = _get_prepend_scheme(add_prefix_space, self.original_tokenizer)
return pre_tokenizers.Sequence(
[
pre_tokenizers.WhitespaceSplit(),
pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme),
]
)
-
The rust (fast) and python tokenizers tokenize differently, for example if you
.tokenize('<special> token), fast output will include['token'], while slow will include['to','ken']. I'm not sure which is expected as I am not too familiar with this model. Do you know which is expected? -
[EDIT]
do_lower_caseshould be True for failingtest_added_tokens_do_lower_case
This should be a good starting point in regards to where to debug. Let me know if I can answer any questions about the above!
Also, may be helpful to merge the latest main code into this branch!
Thanks for looking into it, SiglipTokenizer is exactly the same as T5Tokenizer, except that it includes this function before calling _tokenize (as seen here). I was mainly wondering how this functionality could be incorporated into the fast tokenizer.
@NielsRogge in that function, the text = text.strip() is causing the discrepancy. In PreTrainerTokenizer.tokenize() , the input string gets split on special tokens. Then, your canonicalize_text may be called on a substring (such as a single token) in cases including special tokens, or on the whole string/sentence, when there are no special tokens (example: 'canonicalize_text is trying to achieve with the .strip(), but the link you provided mentions only punctuation and lowercasing. without the .strip it should be more consistent but not sure if that will give the same results.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @itazap would you be interested to work on this?
@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀
@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this!
Summary of changes:
-
test_chat_template_return_assistant_tokens_maskskipped because siglip strips the punctuation used in chat templates and this test is too specific with matching punctuation chars like pipe -
self.assertNotEqual(sp_tokens, tokens)removed from SigLip and T5. Each of these vars is explicitly tested to Equal an expected value, and the two being NotEqual is not something we actually want to enforce. We already check what each of the two is Equal to! -
canonicalize_textfunction intokenization_siglipshould not strip based on the paper source. This may affect current SigLip slow/python
@ArthurZucker re-requested for review, thanks!
Sounds good! ˜~
Hey! 🤗 Thanks for your contribution to the transformers library!
Before merging this pull request, slow tests CI should be triggered. To enable this:
- Add the
run-slowlabel to the PR - When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command
[run-slow]followed by a comma separated list of all the models to be tested, i.e.[run_slow] model_to_test_1, model_to_test_2- If the pull request affects a lot of models, put at most 10 models in the commit message
- A
transformersmaintainer will then approve the workflow to start the tests
(For maintainers) The documentation for slow tests CI on PRs is here.
@ArthurZucker @NielsRogge we cannot merge until we merge a feature to support loading Fast without a Fast tokenizer class specified in the tokenizer_config file! need either https://github.com/huggingface/transformers/pull/34212 or https://github.com/huggingface/transformers/pull/33751 merged first! (that is why this PR is failing, the test fails as the feature is not yet unsupported)
No worries, let's wait a bit to refactor 🤗