mmcp

sirohikartik/mmcp

3.3

If you are the rightful owner of mmcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

An MCP server for Multimodal Input is designed to handle and process various types of data inputs, such as text, audio, and video, in a unified manner.

mmcp

An MCP server for Multimodal Input

This is a prototype stage demonstration of a multimodal MCP server that currently has three modality inputs -

  1. Text
  2. Image
  3. Audio
  4. Video modality with audio support
The first one is just simple but let's discuss the next three in detail:-

1. Image Modality

Here we are doing two things

  1. Reading text from image if any
  2. Captioning the image using an opensource model

2. Audio Modality

Similarly here we're doing

  1. Converting audio to text if any
  2. Using an open source model to characterize or classify the background noise

3. Video Modality

Here the image is split into frames every tenth of a second each of those frames are passed to a video captioning model and each caption is added to context along with the frame number so temporal information can be maintained.