What does this PR do?

Fixes #29925.

🔴 Breaking change! Updates SiglipTokenizer, specifically strip behaviour in tokenize

To do:

[ ] fix remaining tests
[ ] add slow integration test

Mar 30 '24 20:03 NielsRogge

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Mar 30 '24 20:03 HuggingFaceDocBuilderDev

@ArthurZucker I'm down to these 3 tests failing:

FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_added_tokens_do_lower_case - AssertionError: 'aaaaa bbbbbb ' == 'aaaaa bbbbbb '
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_special_tokens_initialization - AssertionError: Lists differ: [342, 322, 291, 269, 262, 266, 32100, 507, 4290, 1] != [342, 322, 291, 269, 262, 266, 32100, 12936, 1]
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_tokenization_python_rust_equals - AssertionError: Sequences differ: [291,[64 chars]62, 232, 141, 158, 232, 141, 163, 232, 142, 16[5335 chars]3, 1] != [291,[64 chars]62, 2, 16577, 266, 2, 1443, 412, 282, 1791, 13[517...

but I don't really know how to fix these. Are you able to look into these?

Apr 22 '24 18:04 NielsRogge

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 25 '24 08:05 github-actions[bot]

is this getting merged anytime soon?

May 29 '24 04:05 yxchng

Comments need to be adressed. cc @itazap if you want to take this over!

Jun 05 '24 12:06 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 30 '24 08:06 github-actions[bot]

Hi @itazap do you have bandwidth to work on this? There were only 2 tests remain to be fixed (for which I'd need some guidance)

Jul 08 '24 08:07 NielsRogge

@NielsRogge Thank you for your patience! I'm looking into the failing tests now

Jul 09 '24 07:07 itazap

@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including:

The SiglipConverter.normalizer is dropping all punctuation, including ["<",">"] which need to be kept for the <special> tokens you are adding.

list_normalizers.append(normalizers.Replace(Regex(r"[" + re.escape(string.punctuation) + "]"), ""))

Also in SiglipConvert.normalizer I believe a WhitespaceSplit() should be added like below (see T5)

    def pre_tokenizer(self, replacement, add_prefix_space):
        prepend_scheme = _get_prepend_scheme(add_prefix_space, self.original_tokenizer)
        return pre_tokenizers.Sequence(
            [
                pre_tokenizers.WhitespaceSplit(),
                pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme),
            ]
        )

The rust (fast) and python tokenizers tokenize differently, for example if you .tokenize('<special> token) , fast output will include ['token'], while slow will include ['to','ken']. I'm not sure which is expected as I am not too familiar with this model. Do you know which is expected?
[EDIT] do_lower_case should be True for failing test_added_tokens_do_lower_case

This should be a good starting point in regards to where to debug. Let me know if I can answer any questions about the above!

Also, may be helpful to merge the latest main code into this branch!

Jul 09 '24 09:07 itazap

Thanks for looking into it, SiglipTokenizer is exactly the same as T5Tokenizer, except that it includes this function before calling _tokenize (as seen here). I was mainly wondering how this functionality could be incorporated into the fast tokenizer.

Jul 09 '24 09:07 NielsRogge

@NielsRogge in that function, the text = text.strip() is causing the discrepancy. In PreTrainerTokenizer.tokenize() , the input string gets split on special tokens. Then, your canonicalize_text may be called on a substring (such as a single token) in cases including special tokens, or on the whole string/sentence, when there are no special tokens (example: ' token' vs 'hey token'). I'm not exactly sure what the canonicalize_text is trying to achieve with the .strip(), but the link you provided mentions only punctuation and lowercasing. without the .strip it should be more consistent but not sure if that will give the same results.

Jul 09 '24 10:07 itazap

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 03 '24 08:08 github-actions[bot]

Hi @itazap would you be interested to work on this?

Aug 12 '24 09:08 NielsRogge

@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀

Aug 19 '24 10:08 itazap

@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this!

Aug 22 '24 13:08 itazap

Summary of changes:

test_chat_template_return_assistant_tokens_mask skipped because siglip strips the punctuation used in chat templates and this test is too specific with matching punctuation chars like pipe
self.assertNotEqual(sp_tokens, tokens) removed from SigLip and T5. Each of these vars is explicitly tested to Equal an expected value, and the two being NotEqual is not something we actually want to enforce. We already check what each of the two is Equal to!
canonicalize_text function in tokenization_siglip should not strip based on the paper source. This may affect current SigLip slow/python

@ArthurZucker re-requested for review, thanks!

Aug 26 '24 12:08 itazap

Sounds good! ˜~

Sep 27 '24 15:09 ArthurZucker

Hey! 🤗 Thanks for your contribution to the transformers library!

Before merging this pull request, slow tests CI should be triggered. To enable this:

Add the run-slow label to the PR
When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command [run-slow] followed by a comma separated list of all the models to be tested, i.e. [run_slow] model_to_test_1, model_to_test_2
- If the pull request affects a lot of models, put at most 10 models in the commit message
A transformers maintainer will then approve the workflow to start the tests

(For maintainers) The documentation for slow tests CI on PRs is here.

Oct 01 '24 07:10 HuggingFaceDocBuilderDev

@ArthurZucker @NielsRogge we cannot merge until we merge a feature to support loading Fast without a Fast tokenizer class specified in the tokenizer_config file! need either https://github.com/huggingface/transformers/pull/34212 or https://github.com/huggingface/transformers/pull/33751 merged first! (that is why this PR is failing, the test fails as the feature is not yet unsupported)

Oct 24 '24 07:10 itazap

No worries, let's wait a bit to refactor 🤗

Oct 29 '24 09:10 ArthurZucker

[SigLIP] Add fast tokenizer

What does this PR do?