docling icon indicating copy to clipboard operation
docling copied to clipboard

experimental: introduce img understand pipeline

Open dolfim-ibm opened this issue 1 year ago • 2 comments

This new feature creates a new ImgUnderstand pipeline which uses vision LLMs to describe the pictures contained in documents.

The pipeline allows to use

  1. Local LLM, via vLLM
  2. LLM as a service, e.g. on watsonx.ai or openai compatible apis

Checklist:

  • [x] Commit Message Formatting: Commit titles and messages follow guidelines in the conventional commits.
  • [ ] Documentation has been updated, if necessary.
  • [ ] Examples have been added, if necessary.
  • [ ] Tests have been added, if necessary.

dolfim-ibm avatar Sep 22 '24 18:09 dolfim-ibm

Offline LLM

vLLM

Pros:

  • efficiently run vision models offline, see the docs page.
  • supports different models without further specialization
  • already used by InstructLab and part of RHEL AI

Cons:

  • no support for mac (any architecture)
  • vLLM has an exact pinning of torch, which creates issues with poetry.
    • vllm==0.5.x depends on torch==2.3.0
    • vllm==0.6.x depends on torch==2.4.0.

HF transforms

Pros:

  • no strong pinning of torch

Cons:

  • more code needed
  • different models require different implementations, e.g. llava-next is different than phi-3-v.

dolfim-ibm avatar Sep 22 '24 18:09 dolfim-ibm

@dolfim-ibm could we not use some standard HF models (eg florence and onechart)?

PeterStaar-IBM avatar Sep 24 '24 04:09 PeterStaar-IBM

superseded by #259

dolfim-ibm avatar Nov 06 '24 10:11 dolfim-ibm