unilm Finetuning description for DocVQA using LayoutLMv3

LayoutLMv3

Seems not available yet.

Jul 09 '22 00:07 allanj

Please see section 3.3 Fine-tuning on Multimodal Tasks, Task III Document Visual Question Answering in the paper.

Jul 11 '22 07:07 HYPJUDY

Thanks for that. I have actually read that. But I guess some implementation details are missing and there is no appendix for me to know more about that.

How do you find the start and end positions? (If there are multiple same answers, do you use the first one or all of them)
The original test set does not provide the gold answers. How do you get the answers? Or are the results in the validation set?
How does the OCR affect the performance? What if we are using the provided OCR?

Jul 12 '22 02:07 allanj

@allanj You may refer to this https://github.com/anisha2102/docvqa for data pre-processing.

Jul 12 '22 02:07 wolfshow

@allanj You may refer to this https://github.com/anisha2102/docvqa for data pre-processing.

Hi @wolfshow This script uses lowercase for all tokens, while LayoutLMv3 actually uses Roberta tokenizer, which does not require lowercase words, I'm wondering if you exactly follow that script or if you have some medications that are not fully described in the paper during finetuning.

Aug 03 '22 06:08 allanj