aparsoft/kokoro-mcp-server
If you are the rightful owner of kokoro-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Model Context Protocol (MCP) server enables seamless integration of text-to-speech capabilities directly within AI tools like Claude Desktop and Cursor, eliminating the need for context switching.
Kokoro MCP SERVER: Text To Speech (TTS)
A comprehensive Text-to-Speech toolkit built on Kokoro-82M with audio enhancement, Model Context Protocol (MCP) server integration, CLI interface, and Docker deployment.
Features
- Kokoro-82M TTS Engine: Open-weight model with 82M parameters (510 tokens per pass)
- š Streamlit Web UI: Enterprise-grade management interface with real-time preview (OPTIONAL)
- Audio Enhancement: Professional processing with librosa (normalization, noise reduction, fade in/out)
- MCP Server: Model Context Protocol integration for Claude Desktop, Cursor, and other AI tools (OPTIONAL)
- CLI Interface: Command-line tools for quick generation (OPTIONAL)
- Batch Processing: Generate multiple audio files efficiently
- Script Processing: Convert complete video scripts with automatic text chunking
- Docker Support: Containerization with docker-compose
- Enterprise Features: Structured logging, configuration management, comprehensive testing
- CI/CD: GitHub Actions pipeline with automated testing
Streamlit Web Interface (Optional)
Beautiful web UI for managing all TTS functionality:
- šÆ Single Generation - Convert text with real-time preview
- š¦ Batch Processing - Process multiple texts in one go
- š Script Processing - Complete video script conversion
- š Voice Explorer - Compare all 12 voices side-by-side
- āļø Configuration - Manage settings visually
- š Analytics - Track generations with charts and statistics
Install with: pip install -e ".[streamlit]"
or pip install -e ".[complete]"
Quick Start:
python run_streamlit.py
# Opens at http://localhost:8501
š See for complete Streamlit documentation.
Building on Kokoro-82M
What Kokoro-82M Provides Out-of-the-Box: Kokoro-82M is an exceptional open-weight TTS model that delivers: core neural TTS inference with 82M parameters, a basic Python inference library (KPipeline), 10 professional voice packs (male/female, American/British), phonemization (G2P) system, and raw 24kHz audio output with a 510-token processing limit per pass.
What aparsoft-tts Adds: We integrate Kokoro-82M's excellent TTS inference with comprehensive development tooling and workflow enhancements. This toolkit adds:
- Audio post-processing - Normalization, noise reduction, silence trimming, and fade in/out using librosa
- Automated script workflows - Direct script-to-voiceover conversion with paragraph detection and gap management
- IDE-native generation - MCP server integration eliminates context switching for Claude Desktop and Cursor users
- Deployment infrastructure - Docker deployment, structured logging, configuration management, and comprehensive testing
- Batch processing - CLI and Python APIs for processing multiple segments efficiently
Technical Implementation
Audio Enhancement (librosa Integration):
This toolkit adds an audio processing pipeline on Kokoro generated TTS output:
# Without enhancement - raw Kokoro output
audio = kokoro_pipeline(text)
# With enhancement
audio = enhance_audio(
kokoro_output,
normalize=True, # Consistent volume
trim_silence=True, # Remove dead air
noise_reduction=True, # Spectral gating
add_fade=True # Smooth transitions
)
Result: Voiceovers ready for YouTube, podcasts, or content creation without additional audio editing.
MCP Server Integration:
Traditional workflow:
# 1. Write script in Claude/Cursor
# 2. Copy text to terminal
# 3. Run Python script
# 4. Switch back to editor
# 5. Repeat for each segment
With MCP server:
# In Claude Desktop or Cursor:
"Generate voiceover for this section using am_michael voice"
# Done. Audio generated without leaving your workspace.
Workflow Enhancement:
- Content creators: Write scripts in AI editors, generate voiceovers inline
- Developers: Generate test audio during development without context switching
- Teams: Standardized TTS across tools (Claude, Cursor, CLI, API)
- Automation: AI agents can generate audio as part of content pipelines
Deployment Features:
The toolkit wraps Kokoro with common deployment and development needs:
- Configuration management - Environment-based settings, no hardcoded values
- Structured logging - JSON logs for aggregation, correlation IDs for tracing
- Error handling - Custom exceptions, graceful failures, detailed error context
- Testing - Comprehensive test suite, CI/CD integration
- Docker deployment - Containerized with health checks, resource limits
- CLI interface - Quick access without writing code
Use Cases
YouTube/Podcast Production:
# Process entire video script with proper gaps
engine.process_script("script.txt", "voiceover.wav", gap_duration=0.5)
AI-Assisted Content Creation:
# In Claude Desktop with MCP:
User: "Generate a 30-second intro for my coding tutorial"
Claude: *generates script and voiceover via MCP*
Batch Content Generation:
# Generate 100 audio segments for e-learning course
engine.batch_generate(lesson_texts, output_dir="lessons/")
Development/Testing:
# Quick CLI test during development
aparsoft-tts generate "Test message" -o test.wav
Quick Start
Installation
System Dependencies (Required):
# Ubuntu/Debian
sudo apt-get install espeak-ng ffmpeg libsndfile1
# macOS
brew install espeak ffmpeg
# Windows: Download from
# - espeak-ng: http://espeak.sourceforge.net/
# - ffmpeg: https://ffmpeg.org/download.html
Python Package - Choose Your Installation:
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# OPTION 1: Complete installation (RECOMMENDED)
# Includes: TTS Engine + MCP Server + CLI + Streamlit Web UI
pip install -e ".[complete]"
# OPTION 2: Without Streamlit (Developers)
# Includes: TTS Engine + MCP Server + CLI (no web UI)
pip install -e ".[mcp,cli]"
# OPTION 3: Streamlit Only
# Includes: TTS Engine + Streamlit Web UI (no MCP, no CLI)
pip install -e ".[streamlit]"
# OPTION 4: Core Only (Minimal)
# Includes: TTS Engine only (Python API)
pip install -e .
# OPTION 5: Everything (Contributors)
# Includes: All features + development tools
pip install -e ".[all]"
š See for detailed installation options and troubleshooting.
Quick Launch
Streamlit Web UI:
# Cross-platform launcher
python run_streamlit.py
# Or use platform-specific scripts
./run_streamlit.sh # Linux/macOS
run_streamlit.bat # Windows
# Or direct
streamlit run streamlit_app.py
MCP Server (for Claude Desktop/Cursor):
- See MCP Integration section below
Basic Usage
from aparsoft_tts import TTSEngine
# Initialize engine
engine = TTSEngine()
# Generate speech
engine.generate(
text="Welcome to Kokoro YouTube TTS",
output_path="output.wav"
)
CLI Usage
# Generate audio
aparsoft-tts generate "Hello world" -o output.wav
# List available voices
aparsoft-tts voices
# Process video script
aparsoft-tts script video_script.txt -o voiceover.wav
# Batch generate
aparsoft-tts batch "Intro" "Body" "Outro" -d segments/
Available Voices
Male Voices:
am_adam
- American male (natural inflection)am_michael
- American male (deeper tones, professional)bm_george
- British male (classic accent)bm_lewis
- British male (modern accent)
Female Voices:
af_bella
- American female (warm tones)af_nicole
- American female (dynamic range)af_sarah
- American female (clear articulation)af_sky
- American female (youthful energy)bf_emma
- British female (professional)bf_isabella
- British female (soft tones)
Special Voices:
af
- Default mix (50-50 blend of Bella and Sarah)
Advanced Usage
Custom Configuration
from aparsoft_tts import TTSEngine, TTSConfig
# Create custom configuration
config = TTSConfig(
voice="bm_george",
speed=1.2,
enhance_audio=True,
fade_duration=0.2
)
engine = TTSEngine(config=config)
engine.generate("Custom configuration", "output.wav")
Audio Enhancement
from aparsoft_tts.utils.audio import enhance_audio
# Generate raw audio
audio = engine.generate("Test audio")
# Apply custom enhancement
enhanced = enhance_audio(
audio,
sample_rate=24000,
normalize=True,
trim_silence=True,
trim_db=25.0,
noise_reduction=True,
add_fade=True,
fade_duration=0.15
)
Batch Processing
# Process multiple texts
texts = [
"Welcome to the tutorial",
"Let's explore the features",
"Thanks for watching"
]
paths = engine.batch_generate(
texts=texts,
output_dir="segments/",
voice="am_michael"
)
Script Processing
# Process complete video script with automatic text chunking
engine.process_script(
script_path="video_script.txt",
output_path="complete_voiceover.wav",
gap_duration=0.5, # Gap between paragraphs
voice="am_michael",
speed=1.0
)
# Note: Kokoro processes up to 510 tokens per pass.
# Long scripts are automatically chunked and combined seamlessly.
Podcast Generation (Multi-Voice)
Create podcast-style content with different voices and speeds per segment. Perfect for interviews, dialogues, or multi-speaker content.
Via MCP (Claude Desktop/Cursor):
"Create a podcast with these segments:
- Intro by am_michael: 'Welcome to Tech Talk'
- Guest by af_bella at 0.95 speed: 'Thanks for having me'
- Outro by am_michael: 'See you next week'"
Via Python API:
from aparsoft_tts.utils.audio import combine_audio_segments, save_audio
# Define podcast segments with different voices/speeds
segments = [
{"text": "Welcome to the show", "voice": "am_michael", "speed": 1.0},
{"text": "Great to be here", "voice": "af_bella", "speed": 0.95},
{"text": "Thanks for listening", "voice": "am_michael", "speed": 1.0},
]
# Generate each segment
audio_segments = []
for seg in segments:
audio = engine.generate(
text=seg["text"],
voice=seg["voice"],
speed=seg["speed"]
)
audio_segments.append(audio)
# Combine with gaps
combined = combine_audio_segments(
audio_segments,
sample_rate=24000,
gap_duration=0.6 # Pause between segments
)
# Save
save_audio(combined, "podcast.wav", sample_rate=24000)
Via Streamlit UI:
- Open Streamlit:
python run_streamlit.py
- Navigate to "šļø Podcast Generation" tab
- Click "ā Add Segment" for each speaker
- Configure voice, speed, and text per segment
- Adjust gap duration in settings panel
- Click "š§ Generate Podcast"
Features:
- Per-segment voice control (host/guest conversations)
- Individual speed settings (emphasis/pacing)
- Configurable gaps between segments
- Audio enhancement (normalization, crossfades)
- Segment reordering (move up/down)
- Template support for quick start
Streaming Generation
# Generate audio in chunks
for chunk in engine.generate_stream(
text="Long text for streaming...",
voice="am_michael"
):
# Process chunk as it's generated
process_audio_chunk(chunk)
Model Context Protocol (MCP) Integration
Quick MCP Setup (5 Minutes)
What is MCP? Model Context Protocol lets Claude Desktop and Cursor generate speech directly from your conversations. No copy-pasting, no context switching.
For Developers: Quick Start
# 1. Find your Python path
which python # Linux/Mac
where python # Windows
# Example output: /home/ram/projects/youtube-creator/venv/bin/python
Claude Desktop:
# 1. Open config (creates if doesn't exist)
code ~/Library/Application\ Support/Claude/claude_desktop_config.json # macOS
code ~/.config/Claude/claude_desktop_config.json # Linux
notepad %APPDATA%\Claude\claude_desktop_config.json # Windows
# 2. Add this (use YOUR absolute Python path):
{
"mcpServers": {
"aparsoft-tts": {
"command": "/absolute/path/to/your/venv/bin/python",
"args": ["-m", "aparsoft_tts.mcp_server"]
}
}
}
# 3. Restart Claude (Cmd/Ctrl + R)
Cursor:
# 1. Create/edit config
mkdir -p ~/.cursor && code ~/.cursor/mcp.json
# 2. Add this (use YOUR absolute Python path):
{
"mcpServers": {
"aparsoft-tts": {
"command": "/absolute/path/to/your/venv/bin/python",
"args": ["-m", "aparsoft_tts.mcp_server"]
}
}
}
# 3. Restart Cursor completely
Testing MCP Server
# Quick test - should print server info
python -m aparsoft_tts.mcp_server --help
# Interactive testing with MCP Inspector
npx @modelcontextprotocol/inspector \
--command "/path/to/venv/bin/python" \
--args "-m" "aparsoft_tts.mcp_server"
# Opens UI at http://localhost:6274
Usage Examples
In Claude Desktop or Cursor, just ask naturally:
# Basic generation
"Generate speech for 'Welcome to my channel' using am_michael voice"
# Voice discovery
"List all available TTS voices"
# Batch processing
"Create voiceovers for these three segments: 'Intro', 'Main', 'Outro'"
# Script processing
"Process video_script.txt and create a complete voiceover"
# Custom parameters
"Generate 'Test message' at 1.3x speed with British accent"
MCP Tools Available
-
generate_speech: Single audio generation with full control
- Text input (up to 10,000 characters)
- Voice selection (6 voices)
- Speed control (0.5x - 2.0x)
- Audio enhancement toggle
-
list_voices: Get voice catalog with descriptions
-
batch_generate: Process multiple texts efficiently
-
process_script: Complete video script conversion
- Automatic paragraph detection
- Configurable gap duration
- Handles long texts via automatic chunking
Troubleshooting MCP
"Could not attach to MCP server"
- Use absolute path:
/full/path/to/venv/bin/python
- Test server runs:
python -m aparsoft_tts.mcp_server
- Check Python version:
python --version
(needs 3.10+)
"Tool not found"
# Reinstall MCP dependencies
pip install -e ".[mcp]"
# Verify FastMCP
python -c "from fastmcp import FastMCP; print('ā
OK')"
Detailed Documentation: See for comprehensive MCP guide with advanced features, debugging, and production deployment.
Docker Deployment
Build and Run
# Build image
docker build -t aparsoft-tts:latest .
# Run MCP server
docker run -d \
--name aparsoft-tts \
-v $(pwd)/outputs:/app/outputs \
-v $(pwd)/logs:/app/logs \
aparsoft-tts:latest
# Run CLI commands
docker run --rm \
-v $(pwd)/outputs:/app/outputs \
aparsoft-tts:latest \
aparsoft-tts generate "Docker test" -o /app/outputs/test.wav
Docker Compose
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
Environment Variables
# TTS Configuration
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
# MCP Server
MCP_SERVER_NAME=aparsoft-tts-server
MCP_ENABLE_RATE_LIMITING=true
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json
Project Structure
youtube-creator/
āāā aparsoft_tts/
ā āāā core/
ā ā āāā engine.py # TTS engine
ā āāā utils/
ā ā āāā audio.py # Audio processing with librosa
ā ā āāā logging.py # Structured logging
ā ā āāā exceptions.py # Custom exceptions
ā āāā config.py # Configuration management
ā āāā cli.py # CLI interface
ā āāā mcp_server.py # MCP server (FastMCP)
āāā tests/
ā āāā unit/ # Unit tests
ā āāā integration/ # Integration tests
āāā examples/ # Usage examples
āāā pyproject.toml # Project metadata
āāā Dockerfile # Docker configuration
āāā docker-compose.yml # Docker Compose config
Audio Processing
The toolkit enhances Kokoro's output with professional audio processing:
Features:
- Normalization: Consistent volume levels
- Silence Trimming: Remove quiet sections (configurable threshold)
- Noise Reduction: Spectral gating for cleaner audio
- Fade In/Out: Smooth transitions, prevents clicks
- Custom Processing: Extensible with librosa/scipy
Enhancement Pipeline:
from aparsoft_tts.utils.audio import enhance_audio, save_audio
# Generate raw audio
audio = engine.generate("Your text here")
# Apply enhancement pipeline
enhanced = enhance_audio(
audio,
sample_rate=24000,
normalize=True, # Normalize volume
trim_silence=True, # Trim silence
trim_db=20.0, # Threshold in dB
noise_reduction=True, # Apply noise gate
add_fade=True, # Add fade in/out
fade_duration=0.1 # 100ms fade
)
# Save enhanced audio
save_audio(enhanced, "enhanced.wav", sample_rate=24000)
Configuration
Using Configuration Files
from aparsoft_tts import TTSConfig, MCPConfig, LoggingConfig, Config
# TTS settings
tts_config = TTSConfig(
voice="am_michael",
speed=1.0,
enhance_audio=True,
sample_rate=24000,
output_format="wav"
)
# MCP server settings
mcp_config = MCPConfig(
server_name="aparsoft-tts-production",
enable_rate_limiting=True,
rate_limit_calls=100
)
# Logging settings
logging_config = LoggingConfig(
level="INFO",
format="json",
output="file"
)
# Combined configuration
config = Config(
tts=tts_config,
mcp=mcp_config,
logging=logging_config
)
Environment Variables
Create .env
file:
# TTS Settings
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
TTS_SAMPLE_RATE=24000
# Audio Processing
TTS_TRIM_SILENCE=true
TTS_TRIM_DB=20.0
TTS_FADE_DURATION=0.1
# Logging
LOG_LEVEL=INFO
LOG_FORMAT=console
LOG_OUTPUT=stdout
Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=aparsoft_tts --cov-report=html
# Run specific test file
pytest tests/unit/test_engine.py
# Run only fast tests
pytest -m "not slow"
Development
Setup Development Environment
# Clone repository
git clone https://github.com/aparsoft/kokoro-youtube-tts.git
cd kokoro-youtube-tts
# Install with dev dependencies
pip install -e ".[dev,mcp,cli,all]"
# Install pre-commit hooks
pre-commit install
Running CI Locally
The project includes GitHub Actions workflow for CI/CD:
- Code quality checks (Black, Ruff, mypy)
- Tests on multiple Python versions (3.10, 3.11, 3.12)
- Docker build verification
- Security scanning with Trivy
API Reference
TTSEngine
Initialization:
TTSEngine(config: TTSConfig | None = None)
Methods:
generate(text, output_path, voice, speed, enhance)
- Generate speechgenerate_stream(text, voice, speed)
- Stream audio chunksbatch_generate(texts, output_dir, voice, speed)
- Batch processingprocess_script(script_path, output_path, gap_duration, voice, speed)
- Process scriptslist_voices()
- Get available voices
Configuration Classes
TTSConfig
- TTS engine settingsMCPConfig
- MCP server configurationLoggingConfig
- Logging configurationConfig
- Main application configuration
Audio Utilities
enhance_audio(audio, ...)
- Apply audio enhancementcombine_audio_segments(segments, ...)
- Combine audio filessave_audio(audio, path, ...)
- Save audio to fileload_audio(path, ...)
- Load audio from filechunk_audio(audio, ...)
- Split audio into chunksget_audio_duration(audio, ...)
- Get audio duration
Examples
See the examples/
directory for complete examples:
basic_usage.py
- Simple generation examplesyoutube_workflow.py
- Complete YouTube video production workflow
Troubleshooting
espeak-ng not found
# Ubuntu/Debian
sudo apt-get install espeak-ng
# macOS
brew install espeak
# Windows: Download from http://espeak.sourceforge.net/
Audio quality issues
Enable audio enhancement:
engine.generate(text="Your text", enhance=True)
Import errors
Ensure virtual environment is activated:
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
Docker issues
Check container logs:
docker logs aparsoft-tts
Performance
Benchmarks (on typical consumer hardware):
- Model Loading: ~2-3 seconds (one-time)
- Generation Speed: ~0.5s per second of audio
- Memory Usage: ~2GB RAM (model loaded)
- Token Processing: Up to 510 tokens per pass
Text Length Limits:
Kokoro-82M processes up to 510 tokens in a single pass. For longer texts:
- Automatic chunking: Engine automatically splits long texts
- Script processing: Handles unlimited length via intelligent segmentation
- Batch processing: Each segment processed independently
Optimization Tips:
- Reuse engine instances (avoid reloading model)
- Disable enhancement for draft generations (
enhance=False
) - Use streaming for long texts (automatic chunking)
- Batch process multiple files for efficiency
- Enable GPU acceleration on supported platforms
- For very long texts, use
process_script()
for optimal chunking
Credits & Acknowledgements
This project builds upon excellent open-source software:
Core Dependencies
-
Kokoro-82M by hexgrad - Apache License 2.0
- Open-weight TTS model with 82M parameters
- Processes up to 510 tokens per pass
- Architectured by @yl4579 (StyleTTS 2)
- 24kHz audio output, <100 hours training data
-
librosa - ISC License
- Audio analysis and processing
-
FastMCP - MIT License
- Model Context Protocol server framework
Additional Dependencies
- soundfile - Audio I/O
- pydantic - Configuration management
- structlog - Structured logging
- typer - CLI framework
- pytest - Testing framework
Special Thanks
- š ļø @yl4579 for StyleTTS 2 architecture
- š hexgrad team for Kokoro model and inference library
- š Anthropic for Model Context Protocol
- š All contributors to the open-source dependencies
License
This project is licensed under the Apache License 2.0 - see the file for details.
Third-Party Licenses:
- Kokoro-82M: Apache License 2.0
- librosa: ISC License
- FastMCP: MIT License
Support
- Email: contact@aparsoft.com
- Website: https://aparsoft.com
- Issues: GitHub Issues
Citation
If you use this toolkit in your research or project, please cite:
@software{kokoro_youtube_tts,
author = {Aparsoft},
title = {Kokoro YouTube TTS: Comprehensive TTS Toolkit},
year = {2025},
url = {https://github.com/aparsoft/kokoro-youtube-tts}
}
For the Kokoro model:
@software{kokoro_tts,
author = {hexgrad},
title = {Kokoro-82M: Open-weight TTS Model},
year = {2024},
url = {https://huggingface.co/hexgrad/Kokoro-82M}
}
Built with ā¤ļø for the video creator community