i-Code Generating bounding boxes with UDOP

Hi,

by reading the UDOP paper, my understanding is that during pre-training the model is taught to predict the layout of a target (textual) sequence using special layout tokens. I was wondering if it is possible to exploit such capability also during finetuning e.g. to finetune the model using target sequences such as: <key> Name <loc_100> <loc_200> <loc_150> <loc_250> </key> <value> Jane Doe <loc_110> <loc_210> <loc_160> <loc_260> </key>

Ideally, could this approach allow to have a correspondence between the generated text (e.g. the name) and its position within the page document?

Jul 17 '23 12:07 AleRosae

We have some similar objectives. For example, question answering, the answer will be followed by its bounding box. So, this is possible indeed as long as the format follows "[text sequence] "

Jul 17 '23 23:07 zinengtang

Thank you for your answer @zinengtang! So if I'm not mistaking, to do so we should first normalize the original bounding boxes in range [0, 1000] on the basis of width and height of the original image; then normalize them between [0, 1]; and then convert them into layout tokens by multiplying them for the layout vocabulary size (500). Am I getting it right?

Btw, I'm using the (not yet merged) code from the HuggingFace PR that is porting UDOP into Transformers. Works like a charm, but there might be some differences with your code.

Jul 18 '23 07:07 AleRosae

@AleRosae can you share any snippets of your use of the PR? I got stuck on an early step.

Thanks in advance.

Jul 18 '23 16:07 sromoam

Hi @sromoam, for inference you can use the standard generate() method:

model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)

You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:

processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)

For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

Jul 19 '23 10:07 AleRosae

Hi @sromoam, for inference you can use the standard generate() method:
model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)
You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:
processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)
For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.

can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this error : ImportError: cannot import name 'UdopForConditionalGeneration' from 'transformers'

Mar 19 '24 11:03 jainamhdoshi

Hi @sromoam, for inference you can use the standard generate() method:
model = UdopForConditionalGeneration.from_pretrained("udop_model")
outputs = model.generate(input_ids=input_ids,
                                      bbox=bbox,
                                      attention_mask=attention_mask,
                                      pixel_values=pixel_values,
                                   max_length=512,
                                   use_cache=False,
                                   num_beams=1,
                                   return_dict_in_generate=True)
You can obtain input_ids, bboxes, attention_mask and pixel values using the UdopProcessor:
processor = UdopProcessor.from_pretrained("udop_model", apply_ocr=True)
encoding = processor(images=image, return_tensors="pt").to(device)
For finetuning, you can follow the Pix2Struct tutorial. Just be sure to also include words and bboxes in your dataloader, as Pix2Struct only takes images as input.
can you please provide which libraries did you import for UdopForConditionalGeneration as i am getting error like this error : ImportError: cannot import name 'UdopForConditionalGeneration' from 'transformers'

solved the issue we need to have transformers version = "4.39.0.dev0" which can be cloned from here https://github.com/huggingface/transformers/blob/main/src/transformers/init.py the commit on Mar 18, 2024

Mar 19 '24 12:03 jainamhdoshi

@zinengtang I want to use the processor with my own OCR. What should be the format of the bouding boxes? 1. Normalized with heigh and width? 2. Normalized with height and width * 1000 3. Other option?

Jun 07 '24 22:06 Joao-M-Silva