Discover multimodal support #1522
SamYuan1990
started this conversation in
Ideas - General
Replies: 1 comment 2 replies
-
|
Could you define the "Send translation request (with file info)" part in your workflow diagram? What do you plan for the file info (instead of real file) here? Something like "hey LLM I have a wave file to extract text from, tell me what to do"? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Ref to roadmap, multimodal support in MCP is on the roadmap. I am here to share with community with my discover on multimodal support through MCP.
Why I need multimodal
Generally, we say the task having different types of input like a text from keyboard, or a voice from microphone, some photo from camera... etc, it's a meta data as package. Specific for me, I am doing a "chatbox" style i18n agent https://github.com/SamYuan1990/i18n-agent-action . I want to support human layer input as text, sounds and files. For now, I supports sounds to text and text from chat box today, I will explain my motivation below:
Token effective
Ref to some whisper's test voice, one 30 words wav file is about 200k file size. As I am located in China. To deal with i18n task with voice input, I found 3 options today.
token/ 8¥
Performance
Note: it's also benefits performance as throughput, as you can see, the information go through network, either 30 token words in UTF-8 or 200k wav files(no matter you base64 it or not). 30 tokens in UTF-8 always faster than 200k wav files right?
What's the approaches
You can see my requirements is just a file to markdown or plain text. Just like markitdown mcp.
Markitdown
if you go through the code, you will found out it's need a URI to locate your file.
Which means, in mark it down, it makes MCP server to pull the 3rd party data. Which means, you ask agent like "please help translate file at URI:URI_path for me into zh". And the URI_path replaced with file content been used among MCP process.
My attempts
My agent has something different, like I hope my agent able running on mac, ios, linux, windows, android, web as client and server like github action job, container service, MCP...
Which means I have no idea where/what the URI should be. For example, I made a voice recording on my iphone, and running MCP on my mac, connected via SSE via home wifi for network.
Instead, I made a
/uploadon my mcp server, which allows agent(mcp client), uploads local asset to mcp server. Then I pass file identify among MCP process, such like ./test.txt, ./test.pdf or 0.wav.The workflow turns to be:
Note links above as show me the code.
Test result
Welcome to reproduce
start MCP server, which now I hard code to deep seek as service, you are free to change.
start MCP client in MCP folder
My result:
If it's a RFC to MCP
Overall
It sounds like on MCP server, we'd better have a strand path which allows client to upload data asset and then, client, server and LLM can negotiate the asset. One of workflow maybe:
sequenceDiagram participant User participant Client participant LLM participant MCPServer User->>Client: Request to translate file Client->>LLM: Send translation request (with file info) Note over LLM,MCPServer: LLM analyzes request and decides to use MCP tools LLM->>Client: Instruct to call MCP Server's translation tool Client->>MCPServer: Establish connection (initialize session) MCPServer-->>Client: Connection confirmed, returns capability list Client->>MCPServer: Upload file (call file upload resource) MCPServer-->>Client: File upload successful Client->>MCPServer: Call translation tool (specify file) MCPServer-->>Client: Return translation result Client->>LLM: Forward translation result LLM->>Client: Possible post-processing or formatting Client-->>User: Display final translation resultAvailability
As the client/agent able to get asset from device(text from keyboard, or a voice from microphone, some photo from camera...). So the client is able to send the asset on to MCP server.
MCP server is able to handle the asset, like markitdown, pass by URI, instead a 3rd party URI, in this turn, the URI is generated by MCP server itself, or considering asset across MCP servers. The client may able to name the asset.
Hence, the asset is available on both MCP client and server. In this "rfc", just turn the MCP server approach from pull the asset from URI(3rd part) to wait for client push it on to server.
Through put
For each MCP server, if the file is 200k. Previously, the file may on s3, so server pull the file from s3 or other cloud storage. For this approach, the server just wait a push, so no changes on the network through put for single MCP server.
For client, previously may need to push it once to s3, for now, may need to push the file on to different MCP server on demand.
Security
Content security
Ref to https://modelcontextprotocol.io/specification/2025-06-18/basic/security_best_practices#architecture-and-attack-flows
If we run by SSE, the network security is no change.
As the file content goes to MCP server either way by(pull or push), client needs to trust MCP server/MCP server provider.
With this change, we reduced 3rd party storage, for example, URI as s3 storage.
LLM prompt injection?
Some LLM service now has protection on file handler, if you pass them a file, they may response as "I don't have content of the file..." or some LLM service don't support multimodal for today. When a
wavfile there, it may reject your request. So some prompt eng/prompt skill is needed to make LLM focus use MCP tool today.Alternative
Current design as client push to server, as one side. We may make server to fetch from client?
Thanks and waiting for any feedback
Beta Was this translation helpful? Give feedback.
All reactions