Investigate support for images in `ToolCallResult`
Currently, providers that expose a tool role for messages (OpenAI, Gemini, Ollama) do not support images in the tool result.
Conversely, Anthropic has no tool role and requires tool results to be included in a user message, which can contain images. Since multimodal support in Amazon Bedrock is mostly based on Anthropic, Bedrock supports this use case as well. For a Bedrock example, see https://github.com/deepset-ai/haystack-experimental/pull/307#issuecomment-2903865141.
Supporting this use case would require refactoring our ToolCallResult dataclass.
If more model providers begin to allow this pattern, we should investigate and evaluate this refactoring.
This doesn't currently seem like a high priority, because a simple workaround is to create a user message with the image returned by the tool.
a simple workaround is to create a user message with the image returned by the tool
I am afraid that this might not be that simple, due to the fact that Anthropic utilizes special content types for tool results different from the regular user messages. Thus, we still need to track whether the message is from tool or from user.
Moreover, if my understanding is correct, Gemini models should accept any json-serializable format, not only string. (reference: https://cloud.google.com/vertex-ai/generative-ai/docs/reference/rest/v1beta1/Content#FunctionResponse)
IMO, currently there is no clear consensus about the tool return format and types. So ideally we should be a little flexible, so that it is ultimately up to the model to decide how the tool result can be used. I do not have a clear plan at the moment though.
This topic re-emerged when experimenting with MCP tools that can return images.
When multimodality lands on Haystack, we should find a convenient way to make the LLM understand these results. We might also explore the user message idea indicated above.