modelcontextprotocol icon indicating copy to clipboard operation
modelcontextprotocol copied to clipboard

RFC: add `Tool.outputSchema` and `CallToolResult.structuredContent`

Open bhosmer-ant opened this issue 9 months ago • 10 comments

Adds support for strict validation of structured tool results.

  • A Tool can now optionally provide an outputSchema property, containing a JSON schema that defines the structure of its output.
  • CallToolResult adds a new structuredContent property, mutually exclusive with CallToolResult.content property:
    • for Tools that do not declare an outputSchema, result.structuredContent will be absent, and result.content will be returned as before.
    • for Tools that declare an outputSchema, result.structuredContent will contain a string whose contents must validate against the schema, and result.content will not be absent.

Prototype for typescript SDK support in https://github.com/modelcontextprotocol/typescript-sdk/pull/454.

Design notes

This PR aims to provide simple, lightweight support for strict validation of tool result data whose structure can be entirely described by a single JSON schema. The approach here pairs a new Tool.outputSchema property with a new CallToolResult.structuredContent property, avoiding use of the CallToolResult.content array.

This approach leaves the path open for adding schematic validation support to the much richer and more complex space of tools that make use of the full expressiveness of the CallToolResult.content array, via an additional Tool property. Support for these use cases has been proposed in #356, and is under active discussion there.

(After exploring possible ways of providing integrated support for both kinds use cases with one set of protocol additions, it's clear that both will be better served by a disjoint approach: strict validation of statically typed data results can be accomplished with the simple additions provided here, and the subtleties arising from supporting full space of CallToolResult.content shapes - see e.g. #415, in addition to #356 - can be addressed more naturally absent the need to support the use cases addressed here.)

Motivation and Context

For tools that return structured output, having a description of that structure available is useful for various tasks, including:

  • Validating the structure of tool results (and performing a more informed examination of the values they contain, post-validation). Especially useful when interacting with untrusted servers.
  • Considering outputSchemas (or their absence) when making decisions about which tools to expose to the model.
  • Transforming tool results before forwarding content to the model (e.g. formatting, projecting).
  • Making tool results available as structured data in coding environments.

How Has This Been Tested?

No tests yet.

Breaking Changes

Optional new property that introduces a new behavior, not a breaking change.

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [X] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [X] Documentation update

Checklist

  • [X] I have read the MCP Documentation
  • [X] My code follows the repository's style guidelines
  • [ ] New and existing tests pass locally
  • [ ] I have added appropriate error handling
  • [X] I have added or updated documentation as needed

Additional context

bhosmer-ant avatar Apr 19 '25 05:04 bhosmer-ant

A couple of comments on this as I prepare to undraft #223 for RFC:

  • I've updated #223 to indicate that Servers that support structured output should advertise generates: ["application/json"]
  • Is there any consideration for the Server returning a TextResourceContents with a mimeType of application/json. I think this would be a more deliberate action by the Server in this scenario.

[update] The specific proposal would be to return a CallToolResult as follows:

{
  "jsonrpc": "2.0",
  "id": "abc123",
  "result": {
    "content": [
      {
        "type": "resource",
        "resource": {
          "uri": "file:///example/data.json",
          "mimeType": "application/json",
          "text": "{\"name\":\"John Doe\",..... and so on }}"
        }
      }
    ],
    "isError": false
  }
}

With guidance that Servers returning a Structured Response MUST return a CallToolResult containing one EmbeddedResource of type application/json.

evalstate avatar Apr 19 '25 08:04 evalstate

Having outputSchema restrict you to to returning a single text content entry who's text validates to the schema feels oddly restrictive. That approach makes annotations largely pointless, and I can think of plenty of cases where one would want to have multiple content entries that would be possible with #356:

  • Document processing:
    • Have an outputSchema that is treated as definitions for multiple document types
    • Return content entries for:
      • ImageContent - Generated thumbnail for the document
      • TextContent - Plain text contents of the document
      • DataContent - Structured data, with schema referencing one of the definitions in outputSchema
  • Multi-entity type search
    • Have an outputSchema that is treated as definitions for multiple entity types
    • Return multiple content entries with schema refs and annotations for relevance/importance

Another drawback I see is the lack of ability to dynamically define the structure/schema for response content. There are certainly cases where the output schema may not be known ahead of time, but would still be useful for the client or LLM consuming the content; it also enriches the capability of sampling and prompt messages.

Lastly, going with this approach, extending its functionality in the future would likely represent a breaking chance since it largely goes against the implied design pattern of CallToolResult having an arbitrary number of content entries.

lukaswelinder avatar Apr 19 '25 20:04 lukaswelinder

Having outputSchema restrict you to to returning a single text content entry who's text validates to the schema feels oddly restrictive. That approach makes annotations largely pointless, and I can think of plenty of cases where one would want to have multiple content entries that would be possible with #356:

  • Document processing:

    • Have an outputSchema that is treated as definitions for multiple document types

    • Return content entries for:

      • ImageContent - Generated thumbnail for the document
      • TextContent - Plain text contents of the document
      • DataContent - Structured data, with schema referencing one of the definitions in outputSchema
  • Multi-entity type search

    • Have an outputSchema that is treated as definitions for multiple entity types
    • Return multiple content entries with schema refs and annotations for relevance/importance

Another drawback I see is the lack of ability to dynamically define the structure/schema for response content. There are certainly cases where the output schema may not be known ahead of time, but would still be useful for the client or LLM consuming the content; it also enriches the capability of sampling and prompt messages.

Lastly, going with this approach, extending its functionality in the future would likely represent a breaking chance since it largely goes against the implied design pattern of CallToolResult having an arbitrary number of content entries.

@lukaswelinder first of all, my apologies for putting up this PR without first participating in the discussion on #356 - I only saw it as I was writing the PR comments for this, but hadn't had a chance to look properly yet. Definitely didn't mean to step on your ongoing work.

I'll respond to your comment a bit later (not at keyboard right now) and also make comments on #356. Meanwhile I'll move this to draft, pending further discussion.

bhosmer-ant avatar Apr 20 '25 13:04 bhosmer-ant

A couple of comments on this as I prepare to undraft #223 for RFC:

  • I've updated RFC: Client / Server Content capabilities #223 to indicate that Servers that support structured output should advertise generates: ["application/json"]
  • Is there any consideration for the Server returning a TextResourceContents with a mimeType of application/json. I think this would be a more deliberate action by the Server in this scenario.

[update] The specific proposal would be to return a CallToolResult as follows:

{
  "jsonrpc": "2.0",
  "id": "abc123",
  "result": {
    "content": [
      {
        "type": "resource",
        "resource": {
          "uri": "file:///example/data.json",
          "mimeType": "application/json",
          "text": "{\"name\":\"John Doe\",..... and so on }}"
        }
      }
    ],
    "isError": false
  }
}

With guidance that Servers returning a Structured Response MUST return a CallToolResult containing one EmbeddedResource of type application/json.

@evalstate thanks for the heads up - will come back to this after we see what comes out of the discussion on #356 , per previous comment

bhosmer-ant avatar Apr 20 '25 13:04 bhosmer-ant

@lukaswelinder first of all, my apologies for putting up this PR without first participating in the discussion on https://github.com/modelcontextprotocol/modelcontextprotocol/pull/356 - I only saw it as I was writing the PR comments for this, but hadn't had a chance to look properly yet. Definitely didn't mean to step on your ongoing work.

I'll respond to your comment a bit later (not at keyboard right now) and also make comments on https://github.com/modelcontextprotocol/modelcontextprotocol/pull/356. Meanwhile I'll move this to draft, pending further discussion.

@bhosmer-ant No offense taken, just glad to see there is motivation here. Input and feedback on #356 would be great.

lukaswelinder avatar Apr 21 '25 19:04 lukaswelinder

LGTM, thanks for shepharding this!

sambhav avatar May 07 '25 00:05 sambhav

It's still not clear to me how this is better than the additional 3-4 lines of code required to achieve this with the current spec here : https://gist.github.com/evalstate/e49cb163297c1ab940fb8a98e31947ed - the motivation and context isn't clear to me as a Host application developer.

I'm also not at all keen on the branching logic of the CallToolResults changing based on whether a field was present on the Tool description. Is that a necessary change?

At this point, if we want to do this shouldn't we just introduce a new type of tool instead (e.g. StructuredTool)?

evalstate avatar May 07 '25 11:05 evalstate

It's still not clear to me how this is better than the additional 3-4 lines of code required to achieve this with the current spec here : https://gist.github.com/evalstate/e49cb163297c1ab940fb8a98e31947ed - the motivation and context isn't clear to me as a Host application developer.

Recapping the previous discussion, so apologies for brevity, but: it's better because it facilitates the structured data use case in a simple, direct way, without need for the extra ceremony and hops involved in routing the result through an EmbeddedResources.

I'm also not at all keen on the branching logic of the CallToolResults changing based on whether a field was present on the Tool description. Is that a necessary change?

Yeah, the triggering of validation based on the presence of an outputSchema in the tool definition is a key feature.

At this point, if we want to do this shouldn't we just introduce a new type of tool instead (e.g. StructuredTool)?

I think that would introduce fragmentation at the top level of the concept hierarchy that we don't want.

bhosmer-ant avatar May 07 '25 13:05 bhosmer-ant

@ihrpr fyi new rev makes structuredContent an object, and updates the docs w/compatibility language (and a better example). (TS SDK example updated too)

bhosmer-ant avatar May 07 '25 13:05 bhosmer-ant

OK - I'll just note my outstanding concerns on this one - not expecting a response - just adding my perspective as a Host application developer.

  • MCP Server SDK. Introduction of return type polymorphism based on the presence of the Tool outputSchema will make the developer experience around tool definition and implementation more complex than necessary.
  • MCP Client SDK. Return type polymorphism needs to be handled by the SDK along with additional validation, meaning changes will be needed for implementation, and requiring the Host integrator to special-case the new "structured" return type.
  • Compatibility. Forward/Backward compatibility is managed by the MCP Server itself rather than handled by a stated convention within the SDK. This gives a large number of possibilities to integrate and test for - and opens challenges (potentially including security) when there is a content mismatch, as well as potentially doubling the length of returned content.
  • Consistency. Currently Tools, Prompts and Resources have a logical consistency between their types. This adds a unique Tool-only condition that can't otherwise be represented within the MCP protocol (conceptual fragmentation).
  • JSON Specific. The use of mime types for the schema and payload would allow the use of non-JSON structures if desired.
  • Host Application Development. For someone building a generic Host application it's still not immediately obvious what the benefit is in receiving the data in structured form. Since both schema and content are supplied by the Server, the "interacting with untrusted servers" motivation isn't obviously improved here. Without an identifying uri or prior knowledge of the server/schema this is still "just JSON tokens". On this basis, this change brings extra effort to me as an integrator, with no clear benefit.

evalstate avatar May 07 '25 15:05 evalstate