rigging tools returning a ContentImageUrl break openai

As per documentation:

... Your function can return standard objects to cast into strings, ..., or even content parts for multi-modal generation (ContentImageUrl)

However using a tool that returns an image via rg.ContentImageUrl creates a tool response that OpenAI doesn't like:

litellm.BadRequestError: OpenAIException - Error code: 400 - {
    'error': {
        'message': "Invalid 'messages[3]'. Image URLs are only allowed for messages with role 'user', but this message with role 'tool' contains an image URL.",
        'type': 'invalid_request_error',
        'param': 'messages[3]',
        'code': 'invalid_value'
    }
}

Possible solution: rigging should check if a tool is returning an image, in which case it should force the message role to user.

Example tool:

import cv2
import os
import rigging as rg


def read_webcam_image() -> rg.ContentImageUrl:
    """Reads an image from the webcam."""

    webcam_url = os.getenv("WEBCAM_URL")
    if webcam_url is None:
        raise Exception("WEBCAM_URL environment variable is not set")

    cap = cv2.VideoCapture(webcam_url)
    try:
        if cap.isOpened():
            ret, frame = cap.read()
            if ret:
                # save the image to a file in the same directory as the script
                script_path = os.path.abspath(__file__)
                script_dir = os.path.dirname(script_path)
                screenshot_path = os.path.join(script_dir, "webcam.jpg")
                cv2.imwrite(screenshot_path, frame)

                return rg.ContentImageUrl.from_file(
                    screenshot_path, mimetype="image/jpeg"
                )
            else:
                raise Exception("Failed to read frame from RTSP stream")
        else:
            raise Exception("Could not open RTSP stream")
    finally:
        cap.release()

Feb 18 '25 16:02 evilsocket

This is an interesting one, and part of a larger pattern regarding special handling for different providers. A few early thoughts:

Because we delegate a huge number of providers into the "LiteLLM" blackbox - I haven't build a clean system for overriding inference behaviors based on the underlying provider. It might be time to do that, but I worry about hidden behaviors that are opaque to the user and difficult to configure.
In terms of overloading the role for the message, does that break any expectations on the provider side for tool calls? If it makes a tool call, then receives a response from a user role, could that cause more issues than it's solving?

I think I can expand a few different COAs:

Check if a tool is returning an image, in which case it should force the message role to user - should this happen for all providers? Do we need a support map?
Don't allow tools to return content outside of text - is it more common for this to be supported? or unsupported?
Just let the behavior stand, add a note to documentation that not all providers will support multi-modal content from tools.

Seems like a missing piece of information is the common standards for this behavior. I can try to gather some references so we can make an informed call here.

Feb 18 '25 18:02 monoxgas

Probably provider specific? Will do some testing ...
It's usually supported and became more relevant with operators / browser-use
idk ... i'd like to be able to use vision with rigging tools, with nerve the workaround is simply https://github.com/dreadnode/nerve/blob/main/src/agent/generator/openai.rs#L259 - for openai all tool outputs are returned with role=user

Feb 18 '25 18:02 evilsocket