transformers fix LayoutLMv3TokenizerFast subword label after 'Ġ' token

LayoutLMv3TokenizerFast produces empty 'Ġ' token with offset_mapping = (0, 0). Next token is wrongly assumed to also be beginning of word and isn't correctly assigned pad_token_label. This may lead to misalignment of words and token representations. Other BPE tokenizers might be affected

Add check for previous token if it had an empty offset_mapping (not including special tokens) Remove copy check from LayoutLMv2TokenizerFast for _batch_encode_plus because it is not affected (uses WordPiece instead of BPE) Modify test with text that produce 'Ġ' token.

Fixes issue: #19978

@NielsRogge @ArthurZucker

Feb 19 '23 20:02 thibaultdouzon

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Feb 19 '23 20:02 HuggingFaceDocBuilderDev

Also cc @amyeroberts

Mar 27 '23 19:03 sgugger

Hi @ArthurZucker, thanks for your investigations.

This PR fixes the problem for LayoutLMv3 but I expect the problem to exist on other models using Fast BPE tokenization, I will take a look when I can to list all impacted models that need a fix.

Mar 28 '23 15:03 thibaultdouzon

Thanks a lot for this fix, would you be able to take into account my comment such that we can merge it? 🙏

Thanks!

Btw the same fix could then be applied to LayoutLMv2 and LayoutXLM

Apr 03 '23 13:04 NielsRogge

LayoutLMv2 uses WordPiece and not BPE. From what I saw its vocabulary does not contain empty token and thus cannot produce (0, 0) offset_mapping when encoding.

Apr 03 '23 14:04 thibaultdouzon