claude-desktop-realtime-audio-mcp-python

joelfuller2016/claude-desktop-realtime-audio-mcp-python

3.2

If you are the rightful owner of claude-desktop-realtime-audio-mcp-python and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A Python-based Model Context Protocol (MCP) server for real-time audio input on Claude Desktop, leveraging Python's audio processing capabilities.

Tools
3
Resources
0
Prompts
0

Claude Desktop Real-time Audio MCP Server (Python Implementation)

License: MIT Python Version Platform

A Python-based Model Context Protocol (MCP) server that enables real-time microphone input for Claude Desktop on Windows. This implementation leverages Python's superior audio processing ecosystem to provide robust voice-driven conversations with Claude through WASAPI audio capture and multiple speech recognition engines.

๐Ÿš€ Key Advantages of Python Implementation

  • ๐Ÿ Mature Audio Ecosystem: Leverages sounddevice, webrtcvad, and specialized Windows audio libraries
  • ๐Ÿง  Multiple STT Engines: OpenAI Whisper (local/API), Azure Speech, Google Speech-to-Text
  • โšก FastMCP Framework: High-level Pythonic interface for rapid MCP development
  • ๐Ÿ”ง Easy Configuration: JSON/YAML configuration with environment variable support
  • ๐Ÿ“Š Better Debugging: Comprehensive logging and performance monitoring
  • ๐Ÿ”„ Async Architecture: Non-blocking operations with asyncio

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Claude        โ”‚    โ”‚   FastMCP Server โ”‚    โ”‚  Audio Capture  โ”‚
โ”‚   Desktop       โ”‚โ—„โ”€โ”€โ–บโ”‚   (Python)       โ”‚โ—„โ”€โ”€โ–บโ”‚  (sounddevice)  โ”‚
โ”‚                 โ”‚    โ”‚                  โ”‚    โ”‚  + WASAPI       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                โ”‚                        โ”‚
                                โ–ผ                        โ–ผ
                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                       โ”‚  STT Engines     โ”‚    โ”‚  Voice Activity โ”‚
                       โ”‚  โ€ข Whisper       โ”‚    โ”‚  Detection      โ”‚
                       โ”‚  โ€ข Azure Speech  โ”‚    โ”‚  (webrtcvad)    โ”‚
                       โ”‚  โ€ข Google Speech โ”‚    โ”‚                 โ”‚
                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“‹ Prerequisites

  • Windows 10/11 (Windows 7+ with WASAPI support)
  • Python 3.8+
  • Claude Desktop (latest version)

๐Ÿšฆ Quick Start

1. Installation

# Clone the repository
git clone https://github.com/joelfuller2016/claude-desktop-realtime-audio-mcp-python.git
cd claude-desktop-realtime-audio-mcp-python

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Install with GPU support (optional)
pip install -r requirements.txt[gpu]

2. Configuration

Create a configuration file or use environment variables:

# Set OpenAI API key (for Whisper API)
set OPENAI_API_KEY=your_api_key_here

# Set Azure Speech key (optional)
set AZURE_SPEECH_KEY=your_azure_key
set AZURE_SPEECH_REGION=eastus

# Set Google credentials (optional)
set GOOGLE_CREDENTIALS_PATH=path/to/credentials.json

3. Test Audio Setup

# Test your microphone and audio devices
python -m audio.test_setup

4. Run the MCP Server

# Start the server
python main.py

# Or with debug logging
python main.py --debug

5. Configure Claude Desktop

Add to your Claude Desktop configuration file (claude_desktop_config.json):

{
  "mcpServers": {
    "claude-audio": {
      "command": "python",
      "args": [
        "C:\\full\\path\\to\\main.py"
      ],
      "env": {
        "OPENAI_API_KEY": "your_api_key_here"
      }
    }
  }
}

๐Ÿ› ๏ธ MCP Tools Available

Audio Control

  • start_recording() - Start real-time audio capture
  • stop_recording() - Stop audio capture
  • get_recording_status() - Get current status and config
  • test_audio_capture(duration=3.0) - Test microphone

Device Management

  • list_audio_devices() - List all audio input devices
  • set_audio_device(device_id) - Set audio input device
  • configure_audio_settings() - Adjust sample rate, channels, etc.

Speech Recognition

  • set_stt_engine(engine) - Switch between whisper/azure/google
  • View available engines and status

Resources

  • audio://devices - Available audio devices
  • audio://config - Current audio configuration
  • stt://engines - STT engines status

โš™๏ธ Configuration

Audio Settings

{
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "chunk_size": 1024,
    "device_id": null,
    "use_wasapi_exclusive": false,
    "low_latency": true
  }
}

Voice Activity Detection

{
  "vad": {
    "mode": "hybrid",
    "webrtc_aggressiveness": 2,
    "energy_threshold": 0.01,
    "min_speech_duration": 0.1,
    "min_silence_duration": 0.3
  }
}

Speech-to-Text Engines

{
  "stt": {
    "default_engine": "whisper",
    "whisper": {
      "model_size": "base",
      "use_api": false,
      "language": null
    },
    "azure": {
      "enabled": false,
      "api_key": null,
      "region": "eastus"
    },
    "google": {
      "enabled": false,
      "credentials_path": null
    }
  }
}

๐Ÿ”ง Advanced Usage

Using Local Whisper Models

# Different model sizes (tiny, base, small, medium, large)
config.stt.whisper.model_size = "small"  # Faster
config.stt.whisper.model_size = "large"  # More accurate

Optimizing for Real-time Performance

# Low-latency settings
config.audio.chunk_size = 512
config.audio.sample_rate = 16000
config.vad.min_speech_duration = 0.1

Using Cloud STT Services

# Azure Speech
export AZURE_SPEECH_KEY="your_key"
export AZURE_SPEECH_REGION="eastus"

# Google Speech
export GOOGLE_CREDENTIALS_PATH="/path/to/credentials.json"

๐Ÿงช Testing

Test Audio Devices

python -c "from audio.capture import list_audio_devices; print(list_audio_devices())"

Test Voice Activity Detection

python -c "from audio.vad import create_vad; vad = create_vad('hybrid')"

Benchmark STT Engines

python -m stt.benchmark --audio test_audio.wav

๐Ÿ“Š Performance Monitoring

The server includes comprehensive logging and performance monitoring:

  • Audio Processing: Chunk processing times, dropouts, queue status
  • VAD Performance: Speech detection accuracy, false positives
  • STT Metrics: Transcription latency, confidence scores, accuracy
  • System Resources: Memory usage, CPU utilization

Enable debug logging for detailed metrics:

python main.py --debug

๐Ÿ” Troubleshooting

Common Issues

1. No audio devices detected

# Check if sounddevice can see your devices
python -c "import sounddevice as sd; print(sd.query_devices())"

2. High latency

# Reduce chunk size and enable low latency
config.audio.chunk_size = 512
config.audio.low_latency = True

3. Whisper model loading errors

# Clear Whisper cache and redownload
pip uninstall openai-whisper
pip install openai-whisper

4. WASAPI permissions on Windows

  • Check microphone privacy settings
  • Run as administrator if needed
  • Ensure Claude Desktop has microphone permissions

Debug Mode

# Enable comprehensive logging
python main.py --debug

# Or set environment variable
set LOG_LEVEL=DEBUG
python main.py

๐Ÿ” Security Considerations

  • API keys are stored in environment variables or secure config files
  • Audio data is processed locally by default (Whisper local models)
  • Cloud STT services can be disabled for maximum privacy
  • No audio data is permanently stored

๐Ÿš€ Performance Optimization

For Low Latency (<200ms)

{
  "audio": {
    "chunk_size": 512,
    "sample_rate": 16000
  },
  "stt": {
    "whisper": {
      "model_size": "tiny",
      "fp16": true
    }
  }
}

For High Accuracy

{
  "audio": {
    "chunk_size": 2048
  },
  "stt": {
    "whisper": {
      "model_size": "large",
      "beam_size": 10
    }
  }
}

๐Ÿค Contributing

We welcome contributions! Areas where help is needed:

  • ๐ŸŽฏ Additional STT engines (AssemblyAI, Rev.ai, etc.)
  • ๐Ÿ”ง Audio preprocessing (noise reduction, normalization)
  • ๐Ÿ“ฑ Cross-platform support (macOS, Linux)
  • ๐Ÿงช Testing frameworks (automated audio testing)
  • ๐Ÿ“š Documentation (tutorials, examples)

๐Ÿ“„ License

This project is licensed under the MIT License - see the file for details.

๐Ÿ™ Acknowledgments

  • Anthropic for Claude and the Model Context Protocol
  • OpenAI for Whisper speech recognition
  • Python Audio Community for excellent libraries
  • FastMCP for the high-level MCP framework

๐Ÿ“ž Support


โญ Star this repository if you find it useful!

This Python implementation provides a more maintainable and feature-rich alternative to the original TypeScript version, with better audio processing capabilities and easier extensibility.