LayoutLM reading order for token classification

Open AleRosae opened this issue 3 years ago • 0 comments

Hi, thank you very much for all the work that you have done, it is a huge help :) I have noticed that the way in which text is preprocessed for LayoutLMv1 (and I assume also for further versions) does not take into account the reading order. For instance, for example image that it is shown at the beginning of this notebook, the output in train.txt is:

R&D	O
:	S-QUESTION
Suggestion:	S-QUESTION
Date:	S-QUESTION
Licensee	S-ANSWER
Yes	S-QUESTION
No	S-QUESTION
597005708	O
R&D	B-HEADER
QUALITY	I-HEADER
IMPROVEMENT	I-HEADER
SUGGESTION/	I-HEADER
SOLUTION	I-HEADER
FORM	E-HEADER
[..etc]

but what I assume is the correct reading order should be something like:

R&D	B-HEADER
QUALITY	I-HEADER
IMPROVEMENT	I-HEADER
SUGGESTION/	I-HEADER
SOLUTION	I-HEADER
FORM	E-HEADER
NAME B-QUESTION
PHONE I-QUESTION
EXT E-QUESTION
M. B-ANSWER
HAMANN E-ANSWER
[...etc]

Is this irrelevant for training LayoutLM for token classification tasks? When we create a custom dataset, should we insert the text-label pairs following the reading order provided by OCR or it does not matter?

Jul 29 '22 14:07 AleRosae