Layoutlmv2 documents with multiple pages
Hi Niels,
I am attempting to build off of your Layoutlmv2 example except I am using my own Amazon Textract results instead of the internal OCR engine in LayoutLMv2Processor. The LayoutLMv2 docs nicely explain how to handle this case. The problem, however, is they don't appear to explain documents with multiple pages (I believe your tutorial only handles single page cases as well).
When you create the Dataset object using from_pandas, each row in the pandas dataset represents a single document. When that document is passed to Dataset.map, how do you then handle the case where the document has multiple pages? My instinct is that words probably becomes a List[List[str]] where each internal List is its own page, and images becomes a list of images per page. Do you happen to know if this case is explained anywhere?
For anyone that comes across this, I wanted to link this bug report i wrote which might save them some time.
https://github.com/huggingface/datasets/issues/4352
I'm still not 100% sure if I am generating inputs for the model correctly based on paginated data, but I will update the thread when I find out
Again, for anyone that finds this, a more well formulated version of this question:
https://stackoverflow.com/questions/72260549/how-to-represent-paginated-documents-as-a-single-instance-of-training-data-for-w
https://discuss.huggingface.co/t/how-to-represent-paginated-documents-as-a-single-instance-of-training-data-for-whole-document-classification/18009