Arnaud Stiegler comments

Results 10 comments of


                                            Arnaud Stiegler

[Community Event] Doc Tests Sprint

Happy to work on TrOCR (pytorch and TF)

Pix2Struct: unable to overfit on a single training sample

Thank you! Let me know if there's anything I can help with :)

Pix2Struct: unable to overfit on a single training sample

Oh yeah, you're right! Completely missed it, and it does solve the generation issue after 50 steps basically. ``` step: 0 train_loss: 8.3875150680542 prediction: ['

Pix2Struct: unable to overfit on a single training sample

Good catch, just tried without the label smoothing and the losses now look much more normal: ``` step: 0 train_loss: 7.458827972412109 prediction: ['

Pix2Struct: unable to overfit on a single training sample

Trying it right now! Will keep you updated once I got the results back :)

Pix2Struct: unable to overfit on a single training sample

From my experiment, the training loss on larger datasets is indeed getting much lower (expected) but it doesn't seem to be solving the issue.

Pix2Struct: unable to overfit on a single training sample

Losses overall look okay (with and without the label smoothing), but there seems to be some disconnect between the loss (both training and validation) value I'm getting and the actual...

Pix2Struct: unable to overfit on a single training sample

Yeah, the model seems to be learning well on >3k images dataset with the change on the decoder config. This seems to be the root cause. Really good catch @gbarello-uipath...

Default pipeline generates many "unreadable" documents

Thanks for the answer! I didn't know about the predefined pipelines, not sure whether I missed them in the documentation. Are those just "random" pipelines or is there a specific...

Issue with tokenizing '1' preceded by a char

One solution that works is: - `processor.tokenizer._tokenizer.pre_tokenizer.add_prefix_space = False` to prevent the model from using tokens preceded with a blank space - Add the token '1' to the tokenizer It's...