MCP tool image results not visible to vision models on OpenAI-compatible APIs — workaround included

View original issue on GitHub  ·  Variant 3

Images from MCP Tools Not Visible to Vision Models on OpenAI-Compatible APIs: A Deep Dive and Workaround

When integrating MCP (Machine Control Protocol) tools with vision models served through OpenAI-compatible APIs, a common issue arises: images returned by the tools are silently ignored. The vision model behaves as if no image data was provided, leading to unexpected and incorrect results. This article explores the root causes of this problem and presents a practical workaround.

Root Cause: Incompatibilities Between MCP and OpenAI Vision API Expectations

The core of the issue lies in two key incompatibilities between the MCP's image format and message structure and what OpenAI-compatible vision APIs (like those used by llama-server) expect:

  1. Image Format Mismatch: MCP represents images as a JSON object with type, data (base64 encoded image), and mimeType fields.
  2. Incorrect Message Role: MCP delivers tool results, including images, within messages having a role of "tool".

OpenAI-compatible vision models, on the other hand, require a different image format and a specific message role:

  1. Expected Image Format: The image data must be structured as a JSON object with type set to "image_url" and an image_url field containing a nested object with a url field. The url should be a data URI containing the MIME type and base64 encoded image data. For example:
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/png;base64,"
      }
    }
  2. Expected Message Role: Vision models only process image content when it's included in messages with a role of "user". Images included in messages with a role of "tool" are ignored.

Both these conditions must be met for the vision model to correctly process the image data. Addressing only one issue will not resolve the problem.

Workaround: Bridging the Gap with a Proxy

A practical workaround involves creating a proxy layer that intercepts MCP tool results, transforms the image format, and re-injects the image data into the conversation with the correct role. The example below demonstrates how to achieve this using a Node.js proxy:

  1. Intercept Tool Result Messages: The proxy should monitor messages from the MCP server, looking for those containing image data.
  2. Transform Image Format: Convert the MCP image format to the image_url format expected by OpenAI-compatible APIs.
  3. Re-inject as User Message: Create a new message with role: "user" containing the transformed image data. Insert this message into the conversation history before forwarding it to the vision model.

Here's a simplified example (using JavaScript-like syntax for illustration):


function transformMCPImage(message) {
  if (message.role === "tool" && message.content && message.content.type === "image") {
    const imageUrl = `data:${message.content.mimeType};base64,${message.content.data}`;
    const newUserMessage = {
      role: "user",
      content: [{
        type: "image_url",
        image_url: { url: imageUrl }
      }]
    };
    return newUserMessage;
  }
  return message; // No transformation needed
}

This function checks if the message is a tool message containing image data. If so, it constructs the image_url format and creates a new user message containing the image. This new message should then be inserted into the message history before being sent to the vision model.

Practical Tips and Considerations

By understanding the incompatibilities between MCP and OpenAI-compatible vision APIs and implementing a suitable workaround, you can successfully integrate MCP tools with vision models and unlock the full potential of these powerful technologies.