markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Add support for Google's Gemini and Anthropic's Claude models

Open Eigilak opened this issue 10 months ago • 6 comments

Description

Currently, markitdown only supports OpenAI models for image captioning and content extraction. It would be valuable to add support for other leading multimodal LLMs, specifically:

  1. Google's Gemini models (Pro and Ultra)
  2. Anthropic's Claude models (Opus, Sonnet, and Haiku)

This would provide us with more flexibility and choice based on their preferences, API access, pricing, or specific model strengths.

Motivation

  • Different users have access to different AI provider APIs
  • Some users may prefer the strengths of a particular model family
  • Pricing and rate limits vary between providers
  • Organizations may have existing enterprise agreements with Google or Anthropic

Current implementation

Currently, _image_converter.py has hardcoded OpenAI-specific client API calls:

# Prepare the OpenAI API request
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {
                    "url": data_uri,
                },
            },
        ],
    }
]

# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content

Proposed solution

Create an abstraction layer for LLM providers that would:

  1. Detect the client type (OpenAI, Google, or Anthropic)
  2. Use the appropriate API format for each provider
  3. Extract the response content consistently

This could be implemented either:

  • As provider-specific adapter classes
  • Through a simple detection mechanism based on client type
  • Or potentially by leveraging [Semantic Kernel](https://github.com/microsoft/markitdown/issues/232) as suggested in a related issue

Related issues

Eigilak avatar Mar 15 '25 20:03 Eigilak

Thanks for the issue. 100% agree we need to abstract away the LLM client -- or perhaps even the idea of using an LLM.

As far as this library goes, we basically need an image captioner, and it doesn't really matter how that happens (though being able to customize the prompt is a nice feature).

I'm going to think on this for a bit, and sort out what design we are comfortable with -- but for sure this is a feature we need to implement.

afourney avatar Mar 16 '25 06:03 afourney

Related #1135

afourney avatar Mar 16 '25 16:03 afourney

we're 100% powered by Gemini, we really need this supported

deathemperor avatar Apr 25 '25 17:04 deathemperor

Hi, I noticed your discussion about abstracting the LLM client. It looks like the pydantic-ai library (https://ai.pydantic.dev) already provides a nice abstraction layer, with support for Gemini, OpenAI, Ollama, and more.

Perhaps the API integration within markitdown could look something like this:

from pydantic_ai import Agent, BinaryContent
from markitdown import MarkItDown

# markitdown could internally utilize a function like this
def describe_image(agent: Agent, data: bytes):
    result = agent.run_sync(
        [BinaryContent(data=data, media_type="image/jpeg")],
    )
    return result.output


agent = Agent(model="gemini-2.0-flash", system_prompt="describe the image")
md = MarkItDown(agent=agent)
md.convert("example.jpg")

Stanley5249 avatar May 10 '25 10:05 Stanley5249

Hi has this been officially implemented? Thanks!

ansemin avatar Jun 27 '25 10:06 ansemin

bump

raphael2692 avatar Oct 28 '25 10:10 raphael2692