markitdown Add support for Google's Gemini and Anthropic's Claude models

Description

Currently, markitdown only supports OpenAI models for image captioning and content extraction. It would be valuable to add support for other leading multimodal LLMs, specifically:

Google's Gemini models (Pro and Ultra)
Anthropic's Claude models (Opus, Sonnet, and Haiku)

This would provide us with more flexibility and choice based on their preferences, API access, pricing, or specific model strengths.

Motivation

Different users have access to different AI provider APIs
Some users may prefer the strengths of a particular model family
Pricing and rate limits vary between providers
Organizations may have existing enterprise agreements with Google or Anthropic

Current implementation

Currently, _image_converter.py has hardcoded OpenAI-specific client API calls:

# Prepare the OpenAI API request
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {
                    "url": data_uri,
                },
            },
        ],
    }
]

# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content

Proposed solution

Create an abstraction layer for LLM providers that would:

Detect the client type (OpenAI, Google, or Anthropic)
Use the appropriate API format for each provider
Extract the response content consistently

This could be implemented either:

As provider-specific adapter classes
Through a simple detection mechanism based on client type
Or potentially by leveraging [Semantic Kernel](https://github.com/microsoft/markitdown/issues/232) as suggested in a related issue

Related issues

[Issue #232: Suggestion to use Semantic Kernel for different LLM providers](https://github.com/microsoft/markitdown/issues/232)
[Issue #12: LLM Integration for image understanding](https://github.com/microsoft/markitdown/issues/12)

Mar 15 '25 20:03 Eigilak

Thanks for the issue. 100% agree we need to abstract away the LLM client -- or perhaps even the idea of using an LLM.

As far as this library goes, we basically need an image captioner, and it doesn't really matter how that happens (though being able to customize the prompt is a nice feature).

I'm going to think on this for a bit, and sort out what design we are comfortable with -- but for sure this is a feature we need to implement.

Mar 16 '25 06:03 afourney

Related #1135

Mar 16 '25 16:03 afourney

we're 100% powered by Gemini, we really need this supported

Apr 25 '25 17:04 deathemperor

Hi, I noticed your discussion about abstracting the LLM client. It looks like the pydantic-ai library (https://ai.pydantic.dev) already provides a nice abstraction layer, with support for Gemini, OpenAI, Ollama, and more.

Perhaps the API integration within markitdown could look something like this:

from pydantic_ai import Agent, BinaryContent
from markitdown import MarkItDown

# markitdown could internally utilize a function like this
def describe_image(agent: Agent, data: bytes):
    result = agent.run_sync(
        [BinaryContent(data=data, media_type="image/jpeg")],
    )
    return result.output


agent = Agent(model="gemini-2.0-flash", system_prompt="describe the image")
md = MarkItDown(agent=agent)
md.convert("example.jpg")

May 10 '25 10:05 Stanley5249

Hi has this been officially implemented? Thanks!

Jun 27 '25 10:06 ansemin

bump

Oct 28 '25 10:10 raphael2692