unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Finetuning description for DocVQA using LayoutLMv3

Open allanj opened this issue 3 years ago • 4 comments

LayoutLMv3

Seems not available yet.

allanj avatar Jul 09 '22 00:07 allanj

Please see section 3.3 Fine-tuning on Multimodal Tasks, Task III Document Visual Question Answering in the paper.

HYPJUDY avatar Jul 11 '22 07:07 HYPJUDY

Thanks for that. I have actually read that. But I guess some implementation details are missing and there is no appendix for me to know more about that.

  1. How do you find the start and end positions? (If there are multiple same answers, do you use the first one or all of them)
  2. The original test set does not provide the gold answers. How do you get the answers? Or are the results in the validation set?
  3. How does the OCR affect the performance? What if we are using the provided OCR?

allanj avatar Jul 12 '22 02:07 allanj

@allanj You may refer to this https://github.com/anisha2102/docvqa for data pre-processing.

wolfshow avatar Jul 12 '22 02:07 wolfshow

@allanj You may refer to this https://github.com/anisha2102/docvqa for data pre-processing.

Hi @wolfshow This script uses lowercase for all tokens, while LayoutLMv3 actually uses Roberta tokenizer, which does not require lowercase words, I'm wondering if you exactly follow that script or if you have some medications that are not fully described in the paper during finetuning.

allanj avatar Aug 03 '22 06:08 allanj