kokoro-mcp-server

aparsoft/kokoro-mcp-server

3.3

If you are the rightful owner of kokoro-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Model Context Protocol (MCP) server enables seamless integration of text-to-speech capabilities directly within AI tools like Claude Desktop and Cursor, eliminating the need for context switching.

Tools
4
Resources
0
Prompts
0

Kokoro MCP SERVER: Text To Speech (TTS)

A comprehensive Text-to-Speech toolkit built on Kokoro-82M with audio enhancement, Model Context Protocol (MCP) server integration, CLI interface, and Docker deployment.

Python 3.10+ Code style: black


Features

  • Kokoro-82M TTS Engine: Open-weight model with 82M parameters (510 tokens per pass)
  • 🌐 Streamlit Web UI: Enterprise-grade management interface with real-time preview (OPTIONAL)
  • Audio Enhancement: Professional processing with librosa (normalization, noise reduction, fade in/out)
  • MCP Server: Model Context Protocol integration for Claude Desktop, Cursor, and other AI tools (OPTIONAL)
  • CLI Interface: Command-line tools for quick generation (OPTIONAL)
  • Batch Processing: Generate multiple audio files efficiently
  • Script Processing: Convert complete video scripts with automatic text chunking
  • Docker Support: Containerization with docker-compose
  • Enterprise Features: Structured logging, configuration management, comprehensive testing
  • CI/CD: GitHub Actions pipeline with automated testing

Streamlit Web Interface (Optional)

Beautiful web UI for managing all TTS functionality:

  • šŸŽÆ Single Generation - Convert text with real-time preview
  • šŸ“¦ Batch Processing - Process multiple texts in one go
  • šŸ“„ Script Processing - Complete video script conversion
  • šŸ” Voice Explorer - Compare all 12 voices side-by-side
  • āš™ļø Configuration - Manage settings visually
  • šŸ“Š Analytics - Track generations with charts and statistics

Install with: pip install -e ".[streamlit]" or pip install -e ".[complete]"

Quick Start:

python run_streamlit.py
# Opens at http://localhost:8501

šŸ“š See for complete Streamlit documentation.


Building on Kokoro-82M

What Kokoro-82M Provides Out-of-the-Box: Kokoro-82M is an exceptional open-weight TTS model that delivers: core neural TTS inference with 82M parameters, a basic Python inference library (KPipeline), 10 professional voice packs (male/female, American/British), phonemization (G2P) system, and raw 24kHz audio output with a 510-token processing limit per pass.

What aparsoft-tts Adds: We integrate Kokoro-82M's excellent TTS inference with comprehensive development tooling and workflow enhancements. This toolkit adds:

  1. Audio post-processing - Normalization, noise reduction, silence trimming, and fade in/out using librosa
  2. Automated script workflows - Direct script-to-voiceover conversion with paragraph detection and gap management
  3. IDE-native generation - MCP server integration eliminates context switching for Claude Desktop and Cursor users
  4. Deployment infrastructure - Docker deployment, structured logging, configuration management, and comprehensive testing
  5. Batch processing - CLI and Python APIs for processing multiple segments efficiently

Technical Implementation

Audio Enhancement (librosa Integration):

This toolkit adds an audio processing pipeline on Kokoro generated TTS output:

# Without enhancement - raw Kokoro output
audio = kokoro_pipeline(text)

# With enhancement
audio = enhance_audio(
    kokoro_output,
    normalize=True,        # Consistent volume
    trim_silence=True,     # Remove dead air
    noise_reduction=True,  # Spectral gating
    add_fade=True         # Smooth transitions
)

Result: Voiceovers ready for YouTube, podcasts, or content creation without additional audio editing.

MCP Server Integration:

Traditional workflow:

# 1. Write script in Claude/Cursor
# 2. Copy text to terminal
# 3. Run Python script
# 4. Switch back to editor
# 5. Repeat for each segment

With MCP server:

# In Claude Desktop or Cursor:
"Generate voiceover for this section using am_michael voice"
# Done. Audio generated without leaving your workspace.

Workflow Enhancement:

  • Content creators: Write scripts in AI editors, generate voiceovers inline
  • Developers: Generate test audio during development without context switching
  • Teams: Standardized TTS across tools (Claude, Cursor, CLI, API)
  • Automation: AI agents can generate audio as part of content pipelines

Deployment Features:

The toolkit wraps Kokoro with common deployment and development needs:

  • Configuration management - Environment-based settings, no hardcoded values
  • Structured logging - JSON logs for aggregation, correlation IDs for tracing
  • Error handling - Custom exceptions, graceful failures, detailed error context
  • Testing - Comprehensive test suite, CI/CD integration
  • Docker deployment - Containerized with health checks, resource limits
  • CLI interface - Quick access without writing code

Use Cases

YouTube/Podcast Production:

# Process entire video script with proper gaps
engine.process_script("script.txt", "voiceover.wav", gap_duration=0.5)

AI-Assisted Content Creation:

# In Claude Desktop with MCP:
User: "Generate a 30-second intro for my coding tutorial"
Claude: *generates script and voiceover via MCP*

Batch Content Generation:

# Generate 100 audio segments for e-learning course
engine.batch_generate(lesson_texts, output_dir="lessons/")

Development/Testing:

# Quick CLI test during development
aparsoft-tts generate "Test message" -o test.wav

Quick Start

Installation

System Dependencies (Required):

# Ubuntu/Debian
sudo apt-get install espeak-ng ffmpeg libsndfile1

# macOS
brew install espeak ffmpeg

# Windows: Download from
# - espeak-ng: http://espeak.sourceforge.net/
# - ffmpeg: https://ffmpeg.org/download.html

Python Package - Choose Your Installation:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# OPTION 1: Complete installation (RECOMMENDED)
# Includes: TTS Engine + MCP Server + CLI + Streamlit Web UI
pip install -e ".[complete]"

# OPTION 2: Without Streamlit (Developers)
# Includes: TTS Engine + MCP Server + CLI (no web UI)
pip install -e ".[mcp,cli]"

# OPTION 3: Streamlit Only
# Includes: TTS Engine + Streamlit Web UI (no MCP, no CLI)
pip install -e ".[streamlit]"

# OPTION 4: Core Only (Minimal)
# Includes: TTS Engine only (Python API)
pip install -e .

# OPTION 5: Everything (Contributors)
# Includes: All features + development tools
pip install -e ".[all]"

šŸ“š See for detailed installation options and troubleshooting.

Quick Launch

Streamlit Web UI:

# Cross-platform launcher
python run_streamlit.py

# Or use platform-specific scripts
./run_streamlit.sh      # Linux/macOS
run_streamlit.bat        # Windows

# Or direct
streamlit run streamlit_app.py

MCP Server (for Claude Desktop/Cursor):

Basic Usage

from aparsoft_tts import TTSEngine

# Initialize engine
engine = TTSEngine()

# Generate speech
engine.generate(
    text="Welcome to Kokoro YouTube TTS",
    output_path="output.wav"
)

CLI Usage

# Generate audio
aparsoft-tts generate "Hello world" -o output.wav

# List available voices
aparsoft-tts voices

# Process video script
aparsoft-tts script video_script.txt -o voiceover.wav

# Batch generate
aparsoft-tts batch "Intro" "Body" "Outro" -d segments/

Available Voices

Male Voices:

  • am_adam - American male (natural inflection)
  • am_michael - American male (deeper tones, professional)
  • bm_george - British male (classic accent)
  • bm_lewis - British male (modern accent)

Female Voices:

  • af_bella - American female (warm tones)
  • af_nicole - American female (dynamic range)
  • af_sarah - American female (clear articulation)
  • af_sky - American female (youthful energy)
  • bf_emma - British female (professional)
  • bf_isabella - British female (soft tones)

Special Voices:

  • af - Default mix (50-50 blend of Bella and Sarah)

Advanced Usage

Custom Configuration

from aparsoft_tts import TTSEngine, TTSConfig

# Create custom configuration
config = TTSConfig(
    voice="bm_george",
    speed=1.2,
    enhance_audio=True,
    fade_duration=0.2
)

engine = TTSEngine(config=config)
engine.generate("Custom configuration", "output.wav")

Audio Enhancement

from aparsoft_tts.utils.audio import enhance_audio

# Generate raw audio
audio = engine.generate("Test audio")

# Apply custom enhancement
enhanced = enhance_audio(
    audio,
    sample_rate=24000,
    normalize=True,
    trim_silence=True,
    trim_db=25.0,
    noise_reduction=True,
    add_fade=True,
    fade_duration=0.15
)

Batch Processing

# Process multiple texts
texts = [
    "Welcome to the tutorial",
    "Let's explore the features",
    "Thanks for watching"
]

paths = engine.batch_generate(
    texts=texts,
    output_dir="segments/",
    voice="am_michael"
)

Script Processing

# Process complete video script with automatic text chunking
engine.process_script(
    script_path="video_script.txt",
    output_path="complete_voiceover.wav",
    gap_duration=0.5,  # Gap between paragraphs
    voice="am_michael",
    speed=1.0
)

# Note: Kokoro processes up to 510 tokens per pass.
# Long scripts are automatically chunked and combined seamlessly.

Podcast Generation (Multi-Voice)

Create podcast-style content with different voices and speeds per segment. Perfect for interviews, dialogues, or multi-speaker content.

Via MCP (Claude Desktop/Cursor):

"Create a podcast with these segments:
- Intro by am_michael: 'Welcome to Tech Talk'
- Guest by af_bella at 0.95 speed: 'Thanks for having me'
- Outro by am_michael: 'See you next week'"

Via Python API:

from aparsoft_tts.utils.audio import combine_audio_segments, save_audio

# Define podcast segments with different voices/speeds
segments = [
    {"text": "Welcome to the show", "voice": "am_michael", "speed": 1.0},
    {"text": "Great to be here", "voice": "af_bella", "speed": 0.95},
    {"text": "Thanks for listening", "voice": "am_michael", "speed": 1.0},
]

# Generate each segment
audio_segments = []
for seg in segments:
    audio = engine.generate(
        text=seg["text"],
        voice=seg["voice"],
        speed=seg["speed"]
    )
    audio_segments.append(audio)

# Combine with gaps
combined = combine_audio_segments(
    audio_segments,
    sample_rate=24000,
    gap_duration=0.6  # Pause between segments
)

# Save
save_audio(combined, "podcast.wav", sample_rate=24000)

Via Streamlit UI:

  1. Open Streamlit: python run_streamlit.py
  2. Navigate to "šŸŽ™ļø Podcast Generation" tab
  3. Click "āž• Add Segment" for each speaker
  4. Configure voice, speed, and text per segment
  5. Adjust gap duration in settings panel
  6. Click "šŸŽ§ Generate Podcast"

Features:

  • Per-segment voice control (host/guest conversations)
  • Individual speed settings (emphasis/pacing)
  • Configurable gaps between segments
  • Audio enhancement (normalization, crossfades)
  • Segment reordering (move up/down)
  • Template support for quick start

Streaming Generation

# Generate audio in chunks
for chunk in engine.generate_stream(
    text="Long text for streaming...",
    voice="am_michael"
):
    # Process chunk as it's generated
    process_audio_chunk(chunk)

Model Context Protocol (MCP) Integration

Quick MCP Setup (5 Minutes)

What is MCP? Model Context Protocol lets Claude Desktop and Cursor generate speech directly from your conversations. No copy-pasting, no context switching.

For Developers: Quick Start
# 1. Find your Python path
which python  # Linux/Mac
where python  # Windows

# Example output: /home/ram/projects/youtube-creator/venv/bin/python

Claude Desktop:

# 1. Open config (creates if doesn't exist)
code ~/Library/Application\ Support/Claude/claude_desktop_config.json  # macOS
code ~/.config/Claude/claude_desktop_config.json  # Linux
notepad %APPDATA%\Claude\claude_desktop_config.json  # Windows

# 2. Add this (use YOUR absolute Python path):
{
  "mcpServers": {
    "aparsoft-tts": {
      "command": "/absolute/path/to/your/venv/bin/python",
      "args": ["-m", "aparsoft_tts.mcp_server"]
    }
  }
}

# 3. Restart Claude (Cmd/Ctrl + R)

Cursor:

# 1. Create/edit config
mkdir -p ~/.cursor && code ~/.cursor/mcp.json

# 2. Add this (use YOUR absolute Python path):
{
  "mcpServers": {
    "aparsoft-tts": {
      "command": "/absolute/path/to/your/venv/bin/python",
      "args": ["-m", "aparsoft_tts.mcp_server"]
    }
  }
}

# 3. Restart Cursor completely
Testing MCP Server
# Quick test - should print server info
python -m aparsoft_tts.mcp_server --help

# Interactive testing with MCP Inspector
npx @modelcontextprotocol/inspector \
  --command "/path/to/venv/bin/python" \
  --args "-m" "aparsoft_tts.mcp_server"
# Opens UI at http://localhost:6274
Usage Examples

In Claude Desktop or Cursor, just ask naturally:

# Basic generation
"Generate speech for 'Welcome to my channel' using am_michael voice"

# Voice discovery
"List all available TTS voices"

# Batch processing
"Create voiceovers for these three segments: 'Intro', 'Main', 'Outro'"

# Script processing
"Process video_script.txt and create a complete voiceover"

# Custom parameters
"Generate 'Test message' at 1.3x speed with British accent"

MCP Tools Available

  1. generate_speech: Single audio generation with full control

    • Text input (up to 10,000 characters)
    • Voice selection (6 voices)
    • Speed control (0.5x - 2.0x)
    • Audio enhancement toggle
  2. list_voices: Get voice catalog with descriptions

  3. batch_generate: Process multiple texts efficiently

  4. process_script: Complete video script conversion

    • Automatic paragraph detection
    • Configurable gap duration
    • Handles long texts via automatic chunking

Troubleshooting MCP

"Could not attach to MCP server"

  • Use absolute path: /full/path/to/venv/bin/python
  • Test server runs: python -m aparsoft_tts.mcp_server
  • Check Python version: python --version (needs 3.10+)

"Tool not found"

# Reinstall MCP dependencies
pip install -e ".[mcp]"

# Verify FastMCP
python -c "from fastmcp import FastMCP; print('āœ… OK')"

Detailed Documentation: See for comprehensive MCP guide with advanced features, debugging, and production deployment.


Docker Deployment

Build and Run

# Build image
docker build -t aparsoft-tts:latest .

# Run MCP server
docker run -d \
  --name aparsoft-tts \
  -v $(pwd)/outputs:/app/outputs \
  -v $(pwd)/logs:/app/logs \
  aparsoft-tts:latest

# Run CLI commands
docker run --rm \
  -v $(pwd)/outputs:/app/outputs \
  aparsoft-tts:latest \
  aparsoft-tts generate "Docker test" -o /app/outputs/test.wav

Docker Compose

# Start services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Environment Variables

# TTS Configuration
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true

# MCP Server
MCP_SERVER_NAME=aparsoft-tts-server
MCP_ENABLE_RATE_LIMITING=true

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=json

Project Structure

youtube-creator/
ā”œā”€ā”€ aparsoft_tts/
│   ā”œā”€ā”€ core/
│   │   └── engine.py          # TTS engine
│   ā”œā”€ā”€ utils/
│   │   ā”œā”€ā”€ audio.py           # Audio processing with librosa
│   │   ā”œā”€ā”€ logging.py         # Structured logging
│   │   └── exceptions.py      # Custom exceptions
│   ā”œā”€ā”€ config.py              # Configuration management
│   ā”œā”€ā”€ cli.py                 # CLI interface
│   └── mcp_server.py          # MCP server (FastMCP)
ā”œā”€ā”€ tests/
│   ā”œā”€ā”€ unit/                  # Unit tests
│   └── integration/           # Integration tests
ā”œā”€ā”€ examples/                  # Usage examples
ā”œā”€ā”€ pyproject.toml             # Project metadata
ā”œā”€ā”€ Dockerfile                 # Docker configuration
└── docker-compose.yml         # Docker Compose config

Audio Processing

The toolkit enhances Kokoro's output with professional audio processing:

Features:

  • Normalization: Consistent volume levels
  • Silence Trimming: Remove quiet sections (configurable threshold)
  • Noise Reduction: Spectral gating for cleaner audio
  • Fade In/Out: Smooth transitions, prevents clicks
  • Custom Processing: Extensible with librosa/scipy

Enhancement Pipeline:

from aparsoft_tts.utils.audio import enhance_audio, save_audio

# Generate raw audio
audio = engine.generate("Your text here")

# Apply enhancement pipeline
enhanced = enhance_audio(
    audio,
    sample_rate=24000,
    normalize=True,      # Normalize volume
    trim_silence=True,   # Trim silence
    trim_db=20.0,        # Threshold in dB
    noise_reduction=True,  # Apply noise gate
    add_fade=True,       # Add fade in/out
    fade_duration=0.1    # 100ms fade
)

# Save enhanced audio
save_audio(enhanced, "enhanced.wav", sample_rate=24000)

Configuration

Using Configuration Files

from aparsoft_tts import TTSConfig, MCPConfig, LoggingConfig, Config

# TTS settings
tts_config = TTSConfig(
    voice="am_michael",
    speed=1.0,
    enhance_audio=True,
    sample_rate=24000,
    output_format="wav"
)

# MCP server settings
mcp_config = MCPConfig(
    server_name="aparsoft-tts-production",
    enable_rate_limiting=True,
    rate_limit_calls=100
)

# Logging settings
logging_config = LoggingConfig(
    level="INFO",
    format="json",
    output="file"
)

# Combined configuration
config = Config(
    tts=tts_config,
    mcp=mcp_config,
    logging=logging_config
)

Environment Variables

Create .env file:

# TTS Settings
TTS_VOICE=am_michael
TTS_SPEED=1.0
TTS_ENHANCE_AUDIO=true
TTS_SAMPLE_RATE=24000

# Audio Processing
TTS_TRIM_SILENCE=true
TTS_TRIM_DB=20.0
TTS_FADE_DURATION=0.1

# Logging
LOG_LEVEL=INFO
LOG_FORMAT=console
LOG_OUTPUT=stdout

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=aparsoft_tts --cov-report=html

# Run specific test file
pytest tests/unit/test_engine.py

# Run only fast tests
pytest -m "not slow"

Development

Setup Development Environment

# Clone repository
git clone https://github.com/aparsoft/kokoro-youtube-tts.git
cd kokoro-youtube-tts

# Install with dev dependencies
pip install -e ".[dev,mcp,cli,all]"

# Install pre-commit hooks
pre-commit install

Running CI Locally

The project includes GitHub Actions workflow for CI/CD:

  • Code quality checks (Black, Ruff, mypy)
  • Tests on multiple Python versions (3.10, 3.11, 3.12)
  • Docker build verification
  • Security scanning with Trivy

API Reference

TTSEngine

Initialization:

TTSEngine(config: TTSConfig | None = None)

Methods:

  • generate(text, output_path, voice, speed, enhance) - Generate speech
  • generate_stream(text, voice, speed) - Stream audio chunks
  • batch_generate(texts, output_dir, voice, speed) - Batch processing
  • process_script(script_path, output_path, gap_duration, voice, speed) - Process scripts
  • list_voices() - Get available voices

Configuration Classes

  • TTSConfig - TTS engine settings
  • MCPConfig - MCP server configuration
  • LoggingConfig - Logging configuration
  • Config - Main application configuration

Audio Utilities

  • enhance_audio(audio, ...) - Apply audio enhancement
  • combine_audio_segments(segments, ...) - Combine audio files
  • save_audio(audio, path, ...) - Save audio to file
  • load_audio(path, ...) - Load audio from file
  • chunk_audio(audio, ...) - Split audio into chunks
  • get_audio_duration(audio, ...) - Get audio duration

Examples

See the examples/ directory for complete examples:

  • basic_usage.py - Simple generation examples
  • youtube_workflow.py - Complete YouTube video production workflow

Troubleshooting

espeak-ng not found

# Ubuntu/Debian
sudo apt-get install espeak-ng

# macOS
brew install espeak

# Windows: Download from http://espeak.sourceforge.net/

Audio quality issues

Enable audio enhancement:

engine.generate(text="Your text", enhance=True)

Import errors

Ensure virtual environment is activated:

source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

Docker issues

Check container logs:

docker logs aparsoft-tts

Performance

Benchmarks (on typical consumer hardware):

  • Model Loading: ~2-3 seconds (one-time)
  • Generation Speed: ~0.5s per second of audio
  • Memory Usage: ~2GB RAM (model loaded)
  • Token Processing: Up to 510 tokens per pass

Text Length Limits:

Kokoro-82M processes up to 510 tokens in a single pass. For longer texts:

  • Automatic chunking: Engine automatically splits long texts
  • Script processing: Handles unlimited length via intelligent segmentation
  • Batch processing: Each segment processed independently

Optimization Tips:

  1. Reuse engine instances (avoid reloading model)
  2. Disable enhancement for draft generations (enhance=False)
  3. Use streaming for long texts (automatic chunking)
  4. Batch process multiple files for efficiency
  5. Enable GPU acceleration on supported platforms
  6. For very long texts, use process_script() for optimal chunking

Credits & Acknowledgements

This project builds upon excellent open-source software:

Core Dependencies

  • Kokoro-82M by hexgrad - Apache License 2.0

    • Open-weight TTS model with 82M parameters
    • Processes up to 510 tokens per pass
    • Architectured by @yl4579 (StyleTTS 2)
    • 24kHz audio output, <100 hours training data
  • librosa - ISC License

    • Audio analysis and processing
  • FastMCP - MIT License

    • Model Context Protocol server framework

Additional Dependencies

  • soundfile - Audio I/O
  • pydantic - Configuration management
  • structlog - Structured logging
  • typer - CLI framework
  • pytest - Testing framework

Special Thanks

  • šŸ› ļø @yl4579 for StyleTTS 2 architecture
  • šŸ† hexgrad team for Kokoro model and inference library
  • 🌐 Anthropic for Model Context Protocol
  • šŸ“Š All contributors to the open-source dependencies

License

This project is licensed under the Apache License 2.0 - see the file for details.

Third-Party Licenses:

  • Kokoro-82M: Apache License 2.0
  • librosa: ISC License
  • FastMCP: MIT License

Support


Citation

If you use this toolkit in your research or project, please cite:

@software{kokoro_youtube_tts,
  author = {Aparsoft},
  title = {Kokoro YouTube TTS: Comprehensive TTS Toolkit},
  year = {2025},
  url = {https://github.com/aparsoft/kokoro-youtube-tts}
}

For the Kokoro model:

@software{kokoro_tts,
  author = {hexgrad},
  title = {Kokoro-82M: Open-weight TTS Model},
  year = {2024},
  url = {https://huggingface.co/hexgrad/Kokoro-82M}
}

Built with ā¤ļø for the video creator community