tools returning a ContentImageUrl break openai
As per documentation:
... Your function can return standard objects to cast into strings, ..., or even content parts for multi-modal generation (ContentImageUrl)
However using a tool that returns an image via rg.ContentImageUrl creates a tool response that OpenAI doesn't like:
litellm.BadRequestError: OpenAIException - Error code: 400 - {
'error': {
'message': "Invalid 'messages[3]'. Image URLs are only allowed for messages with role 'user', but this message with role 'tool' contains an image URL.",
'type': 'invalid_request_error',
'param': 'messages[3]',
'code': 'invalid_value'
}
}
Possible solution: rigging should check if a tool is returning an image, in which case it should force the message role to user.
Example tool:
import cv2
import os
import rigging as rg
def read_webcam_image() -> rg.ContentImageUrl:
"""Reads an image from the webcam."""
webcam_url = os.getenv("WEBCAM_URL")
if webcam_url is None:
raise Exception("WEBCAM_URL environment variable is not set")
cap = cv2.VideoCapture(webcam_url)
try:
if cap.isOpened():
ret, frame = cap.read()
if ret:
# save the image to a file in the same directory as the script
script_path = os.path.abspath(__file__)
script_dir = os.path.dirname(script_path)
screenshot_path = os.path.join(script_dir, "webcam.jpg")
cv2.imwrite(screenshot_path, frame)
return rg.ContentImageUrl.from_file(
screenshot_path, mimetype="image/jpeg"
)
else:
raise Exception("Failed to read frame from RTSP stream")
else:
raise Exception("Could not open RTSP stream")
finally:
cap.release()
This is an interesting one, and part of a larger pattern regarding special handling for different providers. A few early thoughts:
- Because we delegate a huge number of providers into the "LiteLLM" blackbox - I haven't build a clean system for overriding inference behaviors based on the underlying provider. It might be time to do that, but I worry about hidden behaviors that are opaque to the user and difficult to configure.
- In terms of overloading the role for the message, does that break any expectations on the provider side for tool calls? If it makes a tool call, then receives a response from a user role, could that cause more issues than it's solving?
I think I can expand a few different COAs:
- Check if a tool is returning an image, in which case it should force the message role to user - should this happen for all providers? Do we need a support map?
- Don't allow tools to return content outside of text - is it more common for this to be supported? or unsupported?
- Just let the behavior stand, add a note to documentation that not all providers will support multi-modal content from tools.
Seems like a missing piece of information is the common standards for this behavior. I can try to gather some references so we can make an informed call here.
- Probably provider specific? Will do some testing ...
- It's usually supported and became more relevant with operators / browser-use
- idk ... i'd like to be able to use vision with rigging tools, with nerve the workaround is simply https://github.com/dreadnode/nerve/blob/main/src/agent/generator/openai.rs#L259 - for openai all tool outputs are returned with role=user