Thibault Douzon
Thibault Douzon
From what I understood (I couldn't make the debugger enter the render_prepare sadly), the plot contains duplicate scales for x and y, the second (incorrect one without labels) erasing the...
Hi @ArthurZucker, thanks for your investigations. This PR fixes the problem for LayoutLMv3 but I expect the problem to exist on other models using Fast BPE tokenization, I will take...
LayoutLMv2 uses WordPiece and not BPE. From what I saw its vocabulary does not contain empty token and thus cannot produce (0, 0) offset_mapping when encoding.
The same problem arises with all BPE based tokenizers. Example with LayoutXLM: ``` import numpy as np from transformers import LayoutXLMTokenizerFast processor = LayoutXLMTokenizerFast.from_pretrained( "microsoft/layoutxlm-base", apply_ocr=False ) words = ["pencil",...