Discover multimodal support #1522

SamYuan1990 · 2025-09-23T10:09:49Z

SamYuan1990
Sep 23, 2025

Ref to roadmap, multimodal support in MCP is on the roadmap. I am here to share with community with my discover on multimodal support through MCP.

Why I need multimodal

Generally, we say the task having different types of input like a text from keyboard, or a voice from microphone, some photo from camera... etc, it's a meta data as package. Specific for me, I am doing a "chatbox" style i18n agent https://github.com/SamYuan1990/i18n-agent-action . I want to support human layer input as text, sounds and files. For now, I supports sounds to text and text from chat box today, I will explain my motivation below:

Token effective

Ref to some whisper's test voice, one 30 words wav file is about 200k file size. As I am located in China. To deal with i18n task with voice input, I found 3 options today.

Option 1: Send it to multimodal LLM service, I found one, it cost 1M/50¥.
Option 2: Deploy a Tiny or base whisper, STT at local and send to DeepSeek, DeepSeek cost about 1million token/ 8¥.
Option 3: Just make the voice file base64 and send to some multimodal LLM service... or base64 the voice file and make a MCP server at local. Ask DeepSeek to invoke this MCP server to do STT.

Options	Unit Cost	Comments
STT service	1M / 50¥	We will have 200k/1M * 50¥
Local STT and handle text	1million token/ 8¥	We will have 30 words and 30 / 1M * 8¥
Voice file base64 send to LLM	n/A	In this case, your chat to LLM will contains lots of base64 non meaningful token, and because base64 is your mcp tool parameter format, it will out of content window size.

Performance

Note: it's also benefits performance as throughput, as you can see, the information go through network, either 30 token words in UTF-8 or 200k wav files(no matter you base64 it or not). 30 tokens in UTF-8 always faster than 200k wav files right?

What's the approaches

You can see my requirements is just a file to markdown or plain text. Just like markitdown mcp.

Markitdown

if you go through the code, you will found out it's need a URI to locate your file.

code ref 1
code ref 2
Which means, in mark it down, it makes MCP server to pull the 3rd party data. Which means, you ask agent like "please help translate file at URI:URI_path for me into zh". And the URI_path replaced with file content been used among MCP process.

My attempts

My agent has something different, like I hope my agent able running on mac, ios, linux, windows, android, web as client and server like github action job, container service, MCP...
Which means I have no idea where/what the URI should be. For example, I made a voice recording on my iphone, and running MCP on my mac, connected via SSE via home wifi for network.
Instead, I made a /upload on my mcp server, which allows agent(mcp client), uploads local asset to mcp server. Then I pass file identify among MCP process, such like ./test.txt, ./test.pdf or 0.wav.

The workflow turns to be:

Connected to MCP server.
Send file on to MCP server.
Ask LLM to translate file with MCP tool.
MCP tool be invoked.

Note links above as show me the code.

Test result

Welcome to reproduce
start MCP server, which now I hard code to deep seek as service, you are free to change.

export api_key=...
export encoder=/Users/yuanyi/OpenSource/i18n-agent-action/App/storage/data/base-encoder.onnx
export decoder=/Users/yuanyi/OpenSource/i18n-agent-action/App/storage/data/base-decoder.onnxexport tokens=/Users/yuanyi/OpenSource/i18n-agent-action/App/storage/data/base-tokens.txt
python ./mcp/mcp_server.py

start MCP client in MCP folder

export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_API_KEY=...
python mcp_client.py

My result:

part of log
[调用工具 translate_text]
你好，你好吗？...
[调用工具 translate_file]
翻译完成。文件 `test.txt` 的内容已翻译为中文：...
[调用工具 translate_file]
我已经使用translate_file工具将0.wav文件翻译成了中文。翻译结果是：....
[调用工具 translate_file]
我已经使用translate_file工具将test.pdf文件翻译成了中文（zh）。翻译后的内容显示为：...

If it's a RFC to MCP

Overall

It sounds like on MCP server, we'd better have a strand path which allows client to upload data asset and then, client, server and LLM can negotiate the asset. One of workflow maybe:

sequenceDiagram
    participant User
    participant Client
    participant LLM
    participant MCPServer

    User->>Client: Request to translate file
    Client->>LLM: Send translation request (with file info)
    
    Note over LLM,MCPServer: LLM analyzes request and decides to use MCP tools
    
    LLM->>Client: Instruct to call MCP Server's translation tool
    Client->>MCPServer: Establish connection (initialize session)
    MCPServer-->>Client: Connection confirmed, returns capability list

    Client->>MCPServer: Upload file (call file upload resource)
    MCPServer-->>Client: File upload successful

    Client->>MCPServer: Call translation tool (specify file)
    MCPServer-->>Client: Return translation result

    Client->>LLM: Forward translation result
    LLM->>Client: Possible post-processing or formatting
    Client-->>User: Display final translation result

Availability

As the client/agent able to get asset from device(text from keyboard, or a voice from microphone, some photo from camera...). So the client is able to send the asset on to MCP server.
MCP server is able to handle the asset, like markitdown, pass by URI, instead a 3rd party URI, in this turn, the URI is generated by MCP server itself, or considering asset across MCP servers. The client may able to name the asset.
Hence, the asset is available on both MCP client and server. In this "rfc", just turn the MCP server approach from pull the asset from URI(3rd part) to wait for client push it on to server.

Through put

For each MCP server, if the file is 200k. Previously, the file may on s3, so server pull the file from s3 or other cloud storage. For this approach, the server just wait a push, so no changes on the network through put for single MCP server.
For client, previously may need to push it once to s3, for now, may need to push the file on to different MCP server on demand.

Security

Content security

Ref to https://modelcontextprotocol.io/specification/2025-06-18/basic/security_best_practices#architecture-and-attack-flows
If we run by SSE, the network security is no change.
As the file content goes to MCP server either way by(pull or push), client needs to trust MCP server/MCP server provider.
With this change, we reduced 3rd party storage, for example, URI as s3 storage.

LLM prompt injection?

Some LLM service now has protection on file handler, if you pass them a file, they may response as "I don't have content of the file..." or some LLM service don't support multimodal for today. When a wav file there, it may reject your request. So some prompt eng/prompt skill is needed to make LLM focus use MCP tool today.

Alternative

Current design as client push to server, as one side. We may make server to fetch from client?

Thanks and waiting for any feedback

lilei105 · 2025-09-23T11:41:48Z

lilei105
Sep 23, 2025

Could you define the "Send translation request (with file info)" part in your workflow diagram? What do you plan for the file info (instead of real file) here? Something like "hey LLM I have a wave file to extract text from, tell me what to do"?

2 replies

SamYuan1990 Sep 23, 2025
Author

if you want a show me the code style answer, which is here https://github.com/SamYuan1990/i18n-agent-action/blob/main/mcp/mcp_client.py#L71
if you want a vibe explain....
I didn't define the "Send translation request (with file info)" . Instead, the demo is made for a general case. Let's say you will have voice/picture as a file.

The define sounds like "Translate file to language tool, with input as file name and target language code(as ISO standard)", then, leave it to the LLM service.
if the LLM service see a "pdf"/"txt", then it response it is a file. if the LLM service see a ".wav", then it may noticed it as a voice maybe not.Just leave it to your LLM service.

Considering the MCP tool defined a general description, in MCP tool implementation, there some logic to handle different file format, as a multimodal MCP server....
It may support different file format, txt, ini, md, cvs, xls, docx, pttx, pdf, wav, mp3 ......
Instead test file formats one by one, I just test handled with uploaded file process, as it been described, no matter what format of file, it been uploaded to MCP server.

For demo, what we need to ensure is that LLM will invoke the Translate file tool. And tested with txt, pdf, wav, as multimodal:

The MCP tool defined.
Some prompt with hint as force LLM invoke the MCP tool.
The MCP tool designed and impl to support multimodal.

SamYuan1990 Sep 23, 2025
Author

Which means, I just ask as "please translate file for me, here is the file name xxxx with file translate tool..." and the file name as identity in the demo for now, as general purpose.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discover multimodal support #1522

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Discover multimodal support #1522

Uh oh!

SamYuan1990 Sep 23, 2025

Why I need multimodal

Token effective

Performance

What's the approaches

Markitdown

My attempts

Test result

If it's a RFC to MCP

Overall

Availability

Through put

Security

Content security

LLM prompt injection?

Alternative

Thanks and waiting for any feedback

Replies: 1 comment · 2 replies

Uh oh!

lilei105 Sep 23, 2025

Uh oh!

SamYuan1990 Sep 23, 2025 Author

Uh oh!

SamYuan1990 Sep 23, 2025 Author

SamYuan1990
Sep 23, 2025

Replies: 1 comment 2 replies

lilei105
Sep 23, 2025

SamYuan1990 Sep 23, 2025
Author

SamYuan1990 Sep 23, 2025
Author