BilalAltundag/Audio-Intelligence-MCP
If you are the rightful owner of Audio-Intelligence-MCP and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Audio-Intelligence-MCP is a modular FastMCP server designed for comprehensive audio processing, transcription, classification, and analysis.
Audio-Intelligence-MCP
🎧 An MCP (Model Context Protocol) based server for audio processing. It provides tools for transcription, feature analysis, classification, metadata extraction, and format conversion.
Table of Contents
- Installation
- Running
- Tools
- Example Usage
- Outputs and Directory Structure
- Notes and Tips
Installation
- Python 3.10+ and (optional) CUDA-enabled PyTorch should be installed.
- Install dependencies:
pip install -r requirements.txt
- FFmpeg is required for pydub. Ensure FFmpeg is installed on your system and added to PATH.
- Windows:
choco install ffmpegorwinget install Gyan.FFmpeg - macOS:
brew install ffmpeg - Linux:
sudo apt-get install ffmpeg
- Windows:
Running
Start MCP Server via Stdio
The server runs with stdio transport and is launched by an MCP client. For a direct test:
python main.py
In this mode, without an MCP client, it won’t produce meaningful output; typical usage is via a client.
Example Client (LangGraph + Gemini)
try.py launches the main.py MCP server over stdio, discovers tools, and wires them to a ReAct agent:
python try.py
The default sample sends a transcription request using files under audio/.
Tools
Server name: CustomAudioMCP
All tools accept a list of file paths. Invalid paths or unsupported formats return an error. Supported formats: .wav, .mp3, .ogg, .flac.
transcript
- Purpose: Converts audio files to text (Whisper base).
- Input:
file_paths: List[str]language: str = "en"output_dir: str = "output"overwrite: bool = False
- Output:
{ "transcripts": { "<path>": { "transcription": str, "transcript_file": str } | { "error": str } } }
- Note: Processes at 16kHz and writes output under
output/transcripts/.
feature_analysis
- Purpose: Extracts basic features like pitch, tempo, duration; generates a waveform PNG.
- Input:
file_paths: List[str]output_dir: str = "output"overwrite: bool = False
- Output:
{ "analyses": { "<path>": { "features": { "mean_pitch": float, "tempo": float, "duration_s": float }, "waveform_plot": str } | { "error": str } } }
- Note: Waveform images are saved under
output/waveforms/.
audio_classification
- Purpose: Classifies audio content (AST model fine-tuned on AudioSet).
- Input:
file_paths: List[str]output_dir: str = "output"output_csv: Optional[str] = Noneoverwrite: bool = False
- Output:
{ "classifications": { "<path>": { "label": str, "confidence": float } | { "error": str }, "csv_path"?: str, "csv_error"?: str } }
- Note: Optionally writes a CSV into
output/classifications/.
metadata_extraction
- Purpose: Extracts metadata like duration, sample rate, channels, and tags; saves as JSON.
- Input:
file_paths: List[str]output_dir: str = "output"overwrite: bool = False
- Output:
{ "metadata": { "<path>": { "metadata": { "duration_ms": int, "bitrate": int, "channels": int, "tags": object }, "metadata_file": str } | { "error": str } } }
- Note: JSON files are saved under
output/metadata/.
audio_conversion
- Purpose: Converts files to the target audio format.
- Input:
file_paths: List[str]target_format: str = "wav"output_dir: str = "output"overwrite: bool = False
- Output:
{ "converted_files": { "<path>": { "converted_file": str } | { "error": str } } }
- Note: Outputs are written under
output/converted/.
Example Usage
The following examples represent pseudo inputs that an MCP client would translate into tool calls:
{
"tool": "transcript",
"args": { "file_paths": ["audio/speech-94649.wav"], "language": "en" }
}
{
"tool": "feature_analysis",
"args": { "file_paths": ["audio/speech-94649.wav"] }
}
{
"tool": "audio_classification",
"args": { "file_paths": ["audio/speech-94649.wav"], "output_csv": "labels.csv" }
}
{
"tool": "metadata_extraction",
"args": { "file_paths": ["audio/speech-94649.wav"] }
}
{
"tool": "audio_conversion",
"args": { "file_paths": ["audio/speech-94649.wav"], "target_format": "mp3" }
}
Outputs and Directory Structure
output/transcripts/:*_transcript_<timestamp>.txtwaveforms/:*_waveform_<timestamp>.pngclassifications/:labels.csv(optional)metadata/:*_metadata_<timestamp>.jsonconverted/:*_converted_<timestamp>.<ext>
Notes and Tips
- If a GPU is available,
cudawill be used automatically; otherwisecpu. - First run may take longer due to model downloads.
- If a
config.jsonfile exists, it updates defaults likeoutput_dir,sample_rate, andoverwrite_files. - Tools return error messages for unsupported formats or non-existent paths.
Upcoming Features
| 🧩 Tool Name | 🎯 Purpose | 🧾 Description |
|---|---|---|
| noise_reduction | Noise reduction | Filters background noise using noisereduce or torchaudio. |
| silence_detection | Silence detection | Detects silent segments in audio for automatic trimming or splitting. |
| speaker_diarization | Speaker diarization | Distinguishes who is speaking using pyannote.audio or resemblyzer. |
| emotion_recognition | Emotion recognition | Predicts emotions such as “happy” or “sad” using MelSpectrogram + CNN/Transformer. |
| auto_editing | Smart audio editing | Removes silence, reduces noise, and balances volume automatically (pipeline-style). |
| voice_similarity | Voice similarity | Determines whether two voice samples belong to the same speaker. |
| keyword_spotting | Keyword spotting | Detects specific keywords within audio (e.g., “hey assistant”). |