mcp-speech-to-text

michaelyuwh/mcp-speech-to-text

3.2

If you are the rightful owner of mcp-speech-to-text and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The MCP Speech-to-Text Server is a Model Context Protocol server that leverages OpenAI Whisper to provide speech-to-text capabilities, designed for integration with n8n AI workflows.

Tools
4
Resources
0
Prompts
0

MCP Speech-to-Text: Production-Ready Local Solution

Platform Support Docker

A production-ready Model Context Protocol (MCP) server that provides completely local, offline speech-to-text capabilities. Optimized for x86_64 production deployment with macOS development support.

๐ŸŽฏ Perfect For

  • Production Systems - x86_64 Linux servers with full offline capabilities
  • Hong Kong Users - No regional restrictions or blocking
  • Privacy-Conscious - All processing happens locally
  • Cost-Conscious - Zero API costs after initial setup
  • n8n Workflows - Direct MCP integration

๐Ÿ—๏ธ Platform Support Matrix

PlatformSpeech EngineOffline ModeInternet RequiredProduction Ready
x86_64 LinuxVosk + SpeechRecognitionโœ… FullโŒ Noโœ… Yes
ARM64 LinuxSpeechRecognitionโš ๏ธ Limitedโœ… Yesโœ… Yes
macOS (Dev)SpeechRecognitionโš ๏ธ Limitedโœ… Yes๐Ÿ”ง Dev Only

๐Ÿš€ Quick Start

๐Ÿญ Production Deployment (x86_64 Linux)

Best Choice: Full offline capabilities with Docker

# 1. Clone repository
git clone https://github.com/michaelyuwh/mcp-speech-to-text.git
cd mcp-speech-to-text

# 2. Deploy with Docker Compose (Automatic platform detection)
docker compose up -d

# 3. Verify deployment
./scripts/test-deployment.sh

# 4. Check status
docker compose ps
docker compose logs -f mcp-speech-to-text

๐Ÿ’ป Development Setup (macOS)

For Development and Testing

# 1. Clone and setup
git clone https://github.com/michaelyuwh/mcp-speech-to-text.git
cd mcp-speech-to-text

# 2. Install with uv (recommended for macOS)
uv sync
uv run python -c "from src.mcp_speech_to_text.server_sr import SpeechToTextServer; print('โœ… Ready')"

# 3. Run development server
uv run python -m mcp_speech_to_text

๐Ÿ› ๏ธ Deployment Methods

Method 1: Docker Compose (Production)

docker compose up -d                    # Start services
docker compose logs -f                  # View logs
docker compose down                     # Stop services

Method 2: Direct Docker Build

./scripts/build-x86_64.sh              # Build for x86_64
docker run -d --name mcp-speech mcp-speech-to-text:x86_64-latest

Method 3: Native Python (Development)

# With uv (macOS recommended)
uv sync && uv run python -m mcp_speech_to_text

# With pip
pip install -e . && python -m mcp_speech_to_text

โš™๏ธ Available MCP Tools

๐ŸŽฏ Core Speech Recognition

  • transcribe_audio_offline - Vosk offline transcription (x86_64 only)
  • transcribe_audio_file - SpeechRecognition transcription (all platforms)
  • record_and_transcribe - Live microphone recording and transcription
  • get_supported_engines - List available speech engines

๐Ÿ”ง Audio Processing

  • convert_audio_format - Convert between audio formats
  • test_microphone - Test microphone functionality and list devices
  • get_supported_formats - List supported audio formats

๐Ÿงช Testing and Verification

Quick Health Check

# Test current setup
./scripts/test-deployment.sh

# Test Docker image
docker run --rm mcp-speech-to-text:latest python -c "
from src.mcp_speech_to_text.server import OfflineSpeechToTextServer
server = OfflineSpeechToTextServer()
print('โœ… Server healthy')
"

Development Testing (macOS)

# Test SpeechRecognition setup
uv run python -c "
from src.mcp_speech_to_text.server_sr import SpeechToTextServer
server = SpeechToTextServer()
print('โœ… Development environment ready')
"

๐Ÿ“Š Performance Characteristics

x86_64 Production (Vosk Offline)

  • Startup Time: 10-15 seconds (model loading)
  • Memory Usage: 200-300MB (with small model)
  • CPU Usage: 5-10% during transcription
  • Accuracy: Very good for offline recognition
  • Latency: Near real-time (< 1 second)
  • Internet: Not required after setup

Fallback Mode (SpeechRecognition)

  • Startup Time: 2-3 seconds
  • Memory Usage: 50-100MB
  • CPU Usage: 2-5%
  • Accuracy: Excellent (Google API)
  • Latency: 1-3 seconds
  • Internet: Required for operation

๐Ÿ—‚๏ธ Project Structure

mcp-speech-to-text/
โ”œโ”€โ”€ src/mcp_speech_to_text/
โ”‚   โ”œโ”€โ”€ server.py              # Vosk server (x86_64 production)
โ”‚   โ”œโ”€โ”€ server_sr.py           # SpeechRecognition server (dev/fallback)
โ”‚   โ”œโ”€โ”€ __main__.py            # Auto-detecting entry point
โ”‚   โ””โ”€โ”€ models/                # Vosk models (auto-downloaded)
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ build-x86_64.sh        # Production build script
โ”‚   โ””โ”€โ”€ test-deployment.sh     # Comprehensive testing
โ”œโ”€โ”€ .github/workflows/
โ”‚   โ””โ”€โ”€ build-x86_64.yml       # CI/CD for x86_64 builds
โ”œโ”€โ”€ Dockerfile                 # Multi-platform container
โ”œโ”€โ”€ docker-compose.yml         # Production deployment config
โ”œโ”€โ”€ DEPLOYMENT_X86_64.md       # Detailed production guide
โ””โ”€โ”€ README.md                  # This file

๐Ÿ”ง Configuration

Environment Variables

VariableDefaultDescription
SPEECH_ENGINEautovosk, google, or auto
VOSK_MODEL_PATH/app/modelsPath to Vosk models
MCP_SERVER_PORT8000Server port

Docker Volumes

Host PathContainer PathPurpose
./audio_files/app/audio_filesAudio file storage
./models/app/models/customCustom Vosk models

๐Ÿ” Troubleshooting

Common Platform Issues

โ“ "Vosk not available" on macOS

Expected behavior - macOS ARM doesn't support Vosk. Use SpeechRecognition:

uv run python -m mcp_speech_to_text.server_sr
โ“ Docker build fails on Apple Silicon

Use platform-specific build:

docker buildx build --platform linux/amd64 .
โ“ No audio devices in Docker

Add device access:

docker run --device /dev/snd your-image

Platform-Specific Solutions

x86_64 Linux Production
  • Install audio packages: apt-get install portaudio19-dev
  • Verify model download: ls src/mcp_speech_to_text/models/
  • Check container logs: docker logs mcp-speech-to-text
macOS Development
  • Install portaudio: brew install portaudio
  • Use development server: server_sr.py
  • Enable microphone permissions in System Settings

๐Ÿ“ˆ Scaling for Production

Single Server

# Basic deployment
docker compose up -d

Load Balanced (Multiple Containers)

# Scale up containers
docker compose up -d --scale mcp-speech-to-text=3

Kubernetes (Advanced)

See DEPLOYMENT_X86_64.md for Kubernetes manifests and advanced deployment patterns.

๐Ÿ›ก๏ธ Security and Privacy

  • โœ… No Data Transmission - All processing happens locally
  • โœ… No API Keys - No external service dependencies
  • โœ… Container Security - Runs as non-root user
  • โœ… Minimal Attack Surface - Only required ports exposed
  • โœ… Audio Privacy - Files never leave your infrastructure

๐Ÿ“š Documentation

  • - Comprehensive production deployment guide
  • - Automated testing and building
  • Docker Hub - Pre-built images (coming soon)

๐ŸŽฏ Why This Solution?

โœ… Advantages

  • No Vendor Lock-in - Independent of OpenAI, Google, Azure
  • Predictable Costs - Zero ongoing API charges
  • Data Privacy - Audio processing never leaves your infrastructure
  • High Availability - No dependency on external services
  • Regional Independence - Works anywhere, including Hong Kong

๐ŸŽช Perfect Use Cases

  • Enterprise Environments - Privacy and compliance requirements
  • Cost-Sensitive Projects - High-volume speech processing
  • Offline Environments - Air-gapped or limited connectivity
  • Regional Restrictions - Areas with limited API access
  • Development Teams - Consistent dev/prod environments

๐Ÿค Contributing

  1. Fork the repository
  2. Develop on macOS using server_sr.py
  3. Test on x86_64 Linux using Docker
  4. Submit pull request with platform testing

๐Ÿ“„ License

MIT License - Complete freedom to use, modify, and distribute.


Ready to deploy speech-to-text without the cloud? Choose your platform above and get started! ๐Ÿš€