Finetuning description for DocVQA using LayoutLMv3
LayoutLMv3
Seems not available yet.
Please see section 3.3 Fine-tuning on Multimodal Tasks, Task III Document Visual Question Answering in the paper.
Thanks for that. I have actually read that. But I guess some implementation details are missing and there is no appendix for me to know more about that.
- How do you find the start and end positions? (If there are multiple same answers, do you use the first one or all of them)
- The original test set does not provide the gold answers. How do you get the answers? Or are the results in the validation set?
- How does the OCR affect the performance? What if we are using the provided OCR?
@allanj You may refer to this https://github.com/anisha2102/docvqa for data pre-processing.
@allanj You may refer to this https://github.com/anisha2102/docvqa for data pre-processing.
Hi @wolfshow This script uses lowercase for all tokens, while LayoutLMv3 actually uses Roberta tokenizer, which does not require lowercase words, I'm wondering if you exactly follow that script or if you have some medications that are not fully described in the paper during finetuning.