Transformers-Tutorials DONUT: Reading order for pseudo-OCR pre-training task

I would like to train the Donut base model for a few more epochs on the pre-training pseudo-OCR task using a custom dataset. In what reading order should the individual words of the document image be passed to the model? The Donut paper states:

The model is trained to read all texts in the image in reading order (from top-left to bottom-right, basically). [...] This task can be interpreted as a pseudo-OCR task.

What does "top-left to bottom-right" mean for multi-column text? For instance, consider the attached dummy document with one heading and two text columns: 000a_readingorder Should the document be transcribed as:

Word1 Col1w1 Col1w2 Col2w1 Col2w2, or
Word1 Col1w1 Col2w1 Col1w2 Col2w2 ?

I imagine that any dataset used for the pre-training pseudo-OCR task should adopt the same reading order policy as the pe-trained Donut base model. Unfortunately, I am not able to find any information of the exact implementation of "top-left to bottom-right", neither in the paper, the paper supplement, nor the source code.

Jan 16 '25 11:01 mustaszewski

Hi,

The best would be to contact the Donut author regarding this. @gwkrsrch

Jan 16 '25 13:01 NielsRogge

Hi @mustaszewski (cc @NielsRogge ),

Apologies for the delayed response. Let me address your question.

In general, for documents, a raster scanning order is typically used. However, during the training of Donut, we employed a more mixed approach. You might find it useful to refer to our published code and datasets here:

For the synthetic samples, since we can create and control the layout entirely, we designed them with a reading order that considers column groupings, as exemplified in your first scenario. However, it's important to note that adhering strictly to this order wasn't a necessity for achieving our training objectives.

During the pretraining of Donut, in addition to synthetic samples, we also utilized real document data by employing OCR to generate pseudo-labels for the reading task. Given the complexity in accurately capturing layouts such as paragraphs and columns in real documents, it was unfeasible to enforce the sophisticated reading order seen in your first scenario. Instead, we opted for a more straightforward approach, enforcing a simple top-left to bottom-right order, akin to your second scenario.

To summarize, during Donut's training, both types of reading orders you've described were encountered to an extent, potentially introducing a degree of ambiguity to the model. This mix worked well in our experiments, as our aim was to demonstrate Donut's effectiveness even without detailed layout adherence. Nonetheless, improving reading order alignment could likely enhance performance, although it wasn't our primary focus at the time.

I hope this clarifies things. Please feel free to reach out if you have further questions.

Best,
Geewook Kim

Jan 29 '25 11:01 gwkrsrch