Kosmos-2.5 - Image-to-markdown generation for images outside the sample-set provided is almost entirely garbled - output markdown is completely unusable.
Describe the bug Model I am using: Kosmos-2.5
The problem arises when using:
- [ x ] the official example scripts: Using the exact required custom-libraries and dependencies to run the supplied
inference.pyscript. Same results obtained when run bare-metal or when extended via a simple Flask-API in a containerized environment: https://github.com/abgulati/kosmos-2_5-containerized
Description: Image-to-markdown generation for images outside the sample-set provided is almost entirely garbled - output markdown is completely unusable.
Elaborating in the examples below:
Example 1 - Using the sample in.png example image provided with the model:
- On running the
inference.pyscript with--do_mdfor image-to-markdown generation:
- Isolating the
results:
- Cleaning the
results:
- Perfect markdown output as rendered via
https://markdownlivepreview.com/:
This confirms the model is working correctly!
Example 2 - Table from a Boeing manual:
- Output of
inference.pyscript with--do_mdfor image-to-markdown generation:
- Copying, cleaning and generating a markdown preview of the
results- completely garbled & unusable output:
Example 3 - Table of network connectors from my notes for the CompTIA Network+ exam:
- Output of
inference.pyscript with--do_mdfor image-to-markdown generation:
- Copying, cleaning and generating a markdown preview of the
results- completely garbled & unusable output:
Example 4 - Table of commons ports and services from my notes for the CompTIA Network+ & Security+ exams:
- Output of
inference.pyscript with--do_mdfor image-to-markdown generation:
- Copying, cleaning and generating a markdown preview of the
results- completely garbled & unusable output:
As demonstrated by these examples, markdown-generation for images outside the sample (training?) set is completely garbled and unusable. The first example establishes the model itself is working correctly.
Further, --do_ocr works perfectly and outputs high-accuracy, high-quality data.
To Reproduce Steps to reproduce the behavior:
- Run model for markdown generation:
python3 inference.py --do_md --image_path/image.png -- ckpt ckpt.pt
Expected behavior Respectably accurate markdown generation
- Platform: WSL Ubuntu 22.04
- Python version: v3.10.12
- PyTorch version (GPU?): 2.5.0.dev20240705+cu124 for RTX 3090
- Detailed system specs:
Intel Core i9 13900KF
Nvidia RTX 3090FE
32GB DDR5 5600MT/s (16x2)
Windows 11 - OS Build 22631.3737
CUDA 12.4
Flash-Attention-2 (v2.5.9.post1)
tiktoken 0.7.0
tqdm 4.66.4
omegaconf 2.0.6 (hydra-core 1.0.7)
boto3 1.34.140
iopath 0.1.10
fairscale 0.4.0
scipy 1.10.0
triton 2.3.1
https://github.com/facebookresearch/xformers.git@04de99bb28aa6de8d48fab3cdbbc9e3874c994b8
https://github.com/Dod-o/kosmos2.5_tools.git@fairseq
https://github.com/Dod-o/kosmos2.5_tools.git@infinibatch
https://github.com/Dod-o/kosmos2.5_tools.git@torchscale
https://github.com/Dod-o/kosmos2.5_tools.git@transformers
Following