Add support for Google's Gemini and Anthropic's Claude models
Description
Currently, markitdown only supports OpenAI models for image captioning and content extraction. It would be valuable to add support for other leading multimodal LLMs, specifically:
- Google's Gemini models (Pro and Ultra)
- Anthropic's Claude models (Opus, Sonnet, and Haiku)
This would provide us with more flexibility and choice based on their preferences, API access, pricing, or specific model strengths.
Motivation
- Different users have access to different AI provider APIs
- Some users may prefer the strengths of a particular model family
- Pricing and rate limits vary between providers
- Organizations may have existing enterprise agreements with Google or Anthropic
Current implementation
Currently, _image_converter.py has hardcoded OpenAI-specific client API calls:
# Prepare the OpenAI API request
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": data_uri,
},
},
],
}
]
# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content
Proposed solution
Create an abstraction layer for LLM providers that would:
- Detect the client type (OpenAI, Google, or Anthropic)
- Use the appropriate API format for each provider
- Extract the response content consistently
This could be implemented either:
- As provider-specific adapter classes
- Through a simple detection mechanism based on client type
- Or potentially by leveraging [Semantic Kernel](https://github.com/microsoft/markitdown/issues/232) as suggested in a related issue
Related issues
- [Issue #232: Suggestion to use Semantic Kernel for different LLM providers](https://github.com/microsoft/markitdown/issues/232)
- [Issue #12: LLM Integration for image understanding](https://github.com/microsoft/markitdown/issues/12)
Thanks for the issue. 100% agree we need to abstract away the LLM client -- or perhaps even the idea of using an LLM.
As far as this library goes, we basically need an image captioner, and it doesn't really matter how that happens (though being able to customize the prompt is a nice feature).
I'm going to think on this for a bit, and sort out what design we are comfortable with -- but for sure this is a feature we need to implement.
Related #1135
we're 100% powered by Gemini, we really need this supported
Hi, I noticed your discussion about abstracting the LLM client. It looks like the pydantic-ai library (https://ai.pydantic.dev) already provides a nice abstraction layer, with support for Gemini, OpenAI, Ollama, and more.
Perhaps the API integration within markitdown could look something like this:
from pydantic_ai import Agent, BinaryContent
from markitdown import MarkItDown
# markitdown could internally utilize a function like this
def describe_image(agent: Agent, data: bytes):
result = agent.run_sync(
[BinaryContent(data=data, media_type="image/jpeg")],
)
return result.output
agent = Agent(model="gemini-2.0-flash", system_prompt="describe the image")
md = MarkItDown(agent=agent)
md.convert("example.jpg")
Hi has this been officially implemented? Thanks!
bump