docling
docling copied to clipboard
experimental: introduce img understand pipeline
This new feature creates a new ImgUnderstand pipeline which uses vision LLMs to describe the pictures contained in documents.
The pipeline allows to use
- Local LLM, via vLLM
- LLM as a service, e.g. on watsonx.ai or openai compatible apis
Checklist:
- [x] Commit Message Formatting: Commit titles and messages follow guidelines in the conventional commits.
- [ ] Documentation has been updated, if necessary.
- [ ] Examples have been added, if necessary.
- [ ] Tests have been added, if necessary.
Offline LLM
vLLM
Pros:
- efficiently run vision models offline, see the docs page.
- supports different models without further specialization
- already used by InstructLab and part of RHEL AI
Cons:
- no support for mac (any architecture)
- vLLM has an exact pinning of
torch, which creates issues withpoetry.-
vllm==0.5.xdepends ontorch==2.3.0 -
vllm==0.6.xdepends ontorch==2.4.0.
-
HF transforms
Pros:
- no strong pinning of
torch
Cons:
- more code needed
- different models require different implementations, e.g.
llava-nextis different thanphi-3-v.
@dolfim-ibm could we not use some standard HF models (eg florence and onechart)?
superseded by #259