rigging icon indicating copy to clipboard operation
rigging copied to clipboard

tools returning a ContentImageUrl break openai

Open evilsocket opened this issue 11 months ago • 2 comments

As per documentation:

... Your function can return standard objects to cast into strings, ..., or even content parts for multi-modal generation (ContentImageUrl)

However using a tool that returns an image via rg.ContentImageUrl creates a tool response that OpenAI doesn't like:

litellm.BadRequestError: OpenAIException - Error code: 400 - {
    'error': {
        'message': "Invalid 'messages[3]'. Image URLs are only allowed for messages with role 'user', but this message with role 'tool' contains an image URL.",
        'type': 'invalid_request_error',
        'param': 'messages[3]',
        'code': 'invalid_value'
    }
}

Possible solution: rigging should check if a tool is returning an image, in which case it should force the message role to user.

Example tool:

import cv2
import os
import rigging as rg


def read_webcam_image() -> rg.ContentImageUrl:
    """Reads an image from the webcam."""

    webcam_url = os.getenv("WEBCAM_URL")
    if webcam_url is None:
        raise Exception("WEBCAM_URL environment variable is not set")

    cap = cv2.VideoCapture(webcam_url)
    try:
        if cap.isOpened():
            ret, frame = cap.read()
            if ret:
                # save the image to a file in the same directory as the script
                script_path = os.path.abspath(__file__)
                script_dir = os.path.dirname(script_path)
                screenshot_path = os.path.join(script_dir, "webcam.jpg")
                cv2.imwrite(screenshot_path, frame)

                return rg.ContentImageUrl.from_file(
                    screenshot_path, mimetype="image/jpeg"
                )
            else:
                raise Exception("Failed to read frame from RTSP stream")
        else:
            raise Exception("Could not open RTSP stream")
    finally:
        cap.release()

evilsocket avatar Feb 18 '25 16:02 evilsocket

This is an interesting one, and part of a larger pattern regarding special handling for different providers. A few early thoughts:

  • Because we delegate a huge number of providers into the "LiteLLM" blackbox - I haven't build a clean system for overriding inference behaviors based on the underlying provider. It might be time to do that, but I worry about hidden behaviors that are opaque to the user and difficult to configure.
  • In terms of overloading the role for the message, does that break any expectations on the provider side for tool calls? If it makes a tool call, then receives a response from a user role, could that cause more issues than it's solving?

I think I can expand a few different COAs:

  1. Check if a tool is returning an image, in which case it should force the message role to user - should this happen for all providers? Do we need a support map?
  2. Don't allow tools to return content outside of text - is it more common for this to be supported? or unsupported?
  3. Just let the behavior stand, add a note to documentation that not all providers will support multi-modal content from tools.

Seems like a missing piece of information is the common standards for this behavior. I can try to gather some references so we can make an informed call here.

monoxgas avatar Feb 18 '25 18:02 monoxgas

  1. Probably provider specific? Will do some testing ...
  2. It's usually supported and became more relevant with operators / browser-use
  3. idk ... i'd like to be able to use vision with rigging tools, with nerve the workaround is simply https://github.com/dreadnode/nerve/blob/main/src/agent/generator/openai.rs#L259 - for openai all tool outputs are returned with role=user

evilsocket avatar Feb 18 '25 18:02 evilsocket