sirohikartik/mmcp
3.3
If you are the rightful owner of mmcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
An MCP server for Multimodal Input is designed to handle and process various types of data inputs, such as text, audio, and video, in a unified manner.
mmcp
An MCP server for Multimodal Input
This is a prototype stage demonstration of a multimodal MCP server that currently has three modality inputs -
- Text
- Image
- Audio
- Video modality with audio support
The first one is just simple but let's discuss the next three in detail:-
1. Image Modality
Here we are doing two things
- Reading text from image if any
- Captioning the image using an opensource model
2. Audio Modality
Similarly here we're doing
- Converting audio to text if any
- Using an open source model to characterize or classify the background noise
3. Video Modality
Here the image is split into frames every tenth of a second each of those frames are passed to a video captioning model and each caption is added to context along with the frame number so temporal information can be maintained.