Transformers-Tutorials Leveraging Segment position embeddings during Inference time in LayoutLMv3 Token Classification

@NielsRogge Can you help me understand how we can leverage Segment position embeddings during inference time as we are still predicting based on each token? Do we need to do an external text segmentation (region detection of segments) if the OCR engine gives outputs on a word level?

It would be great if you could provide some pointers on how to leverage Segment position embeddings to improve accuracy during inference time.

Thank You

Jun 21 '22 07:06 arunpurohit3799

Hi,

That's a great question! I think that recognizing these segments can be seen as "layout analysis", see https://paperswithcode.com/task/document-layout-analysis. This is often framed as an object detection problem.

Now, the state-of-the-art for this is... wait for it... yes, LayoutLMv3 :D the authors show in their paper that LayoutLMv3 gets state-of-the-art results on PubLayNet, which is a benchmark for detecting things like tables, headers, text etc. in PDFs. Note that for this task, they only feed the image to the model, hence one doesn't require segment position embeddings for this task. I created a Space for it here: https://huggingface.co/spaces/nielsr/dit-document-layout-analysis (this is based on the Document Image Transformer, but LayoutLMv3 is used in the same way).

Details here: https://github.com/microsoft/unilm/tree/master/layoutlmv3#document-layout-detection-on-publaynet

Layout analysis is framed as an object detection problem: they use LayoutLMv3 as backbone, with Mask R-CNN as object detection framework on top, similar to the Document Image Transformer (DiT). For now, one has to use the original code for this (which is based on Detectron2), as HuggingFace Transformers only includes the backbones (LayoutLMv3 and DiT).

Jun 21 '22 09:06 NielsRogge

Hi @NielsRogge thank you very much for your detailed explanation and all your contributions! Is there any notebook example of how to fine LayoutLMv3 in this task? I am doing layout analysis in languages other than English, since LayoutLMv3 is only available in this language, can we use LayoutXLM models for this task?

Jul 04 '22 13:07 alejandrojcastaneira

so the pipeline for inference with segment-level positional features would be to run it through the layoutlmv3 finetuned on publaynet, modify the ocr results with those text segment positions receieved from the output of the model, and feed in the modified ocr results with the new positional embeddings to the layoutlmv3 fine-tuned on the funsd dataset?

Jul 25 '22 17:07 wandering-walrus

Yes that's correct. Note that in the StructuralLM paper, which introduced segment position embeddings, they just consider the bounding boxes that an OCR engine outputs as "cells" (= "segments").

Oct 12 '22 17:10 NielsRogge

Hi @wandering-walrus,

This thread is quite interesting regarding obtaining segment position embeddings for LayoutLM-like models. Basically, there's no need for a separate layout analysis model, the OCR engine can be used to identify segments.

Oct 20 '22 13:10 NielsRogge