ruslanmv/avatar-renderer-mcp
If you are the rightful owner of avatar-renderer-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Avatar Renderer Pod is a VideoGenie-class avatar generator that natively speaks the MCP protocol, enabling high-quality talking-head generation from a single image.
Avatar Renderer MCP
Production-ready AI-powered talking head generation system with enterprise-grade MCP integration
A high-performance, scalable avatar rendering engine that transforms static images and audio into realistic talking head videos using state-of-the-art deep learning models.
About
Avatar Renderer MCP is a sophisticated video generation system designed for production environments. It combines multiple cutting-edge AI models (FOMM, Diff2Lip, Wav2Lip, SadTalker) to create photorealistic talking avatars from just two inputs:
- š¼ļø A still image of a person's face
- š¤ An audio file containing speech
The AI analyzes both inputs and generates a video where the person appears to be speaking naturally, with synchronized lip movements, facial expressions, and head pose animations.
Key Differentiators
- Production-Ready Architecture: Built for enterprise deployments with Kubernetes, Docker, and auto-scaling support
- MCP Protocol Integration: Native Model Context Protocol (MCP) support for seamless AI agent communication
- Intelligent Fallback System: Automatic GPU memory management with graceful degradation
- Scalable Design: Celery-based distributed task processing with KEDA autoscaling
- Cloud-Native: Helm charts, Kubernetes manifests, and Terraform configurations included
Features
Core Capabilities
- ā Multi-Model Pipeline: FOMM for head pose + Diff2Lip for photorealistic lip-sync
- ā Automatic Fallback: Switches to SadTalker + Wav2Lip when GPU memory is constrained
- ā GPU-Accelerated Encoding: NVENC H.264 encoding achieving >200 FPS on V100 GPUs
- ā MCP STDIO Server: Ready for AI agent integration with auto-discovery
- ā RESTful API: FastAPI-based HTTP interface for traditional integrations
- ā Face Enhancement: Built-in GFPGAN support for improved output quality
- ā Phoneme Alignment: Montreal Forced Aligner integration for precise lip-sync
DevOps & Infrastructure
- ā Containerized: Production-grade Dockerfile with CUDA 12.4 support
- ā Kubernetes-Ready: Helm charts and raw manifests for K8s deployments
- ā Auto-Scaling: KEDA integration for demand-based pod scaling
- ā CI/CD Pipelines: GitHub Actions and Tekton configurations included
- ā Monitoring: Prometheus metrics and structured logging
- ā Cloud Storage: S3/COS integration for output delivery
Installation
Prerequisites
- Python: 3.11 or 3.12
- GPU: NVIDIA GPU with CUDA 12.4+ (optional but recommended)
- Package Manager: uv (recommended) or pip
- FFmpeg: With NVENC support for GPU-accelerated encoding
- Docker: (Optional) For containerized deployments
Quick Start
1. Install uv (Recommended Package Manager)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or via pip
pip install uv
2. Clone the Repository
git clone https://github.com/ruslanmv/avatar-renderer-mcp.git
cd avatar-renderer-mcp
3. Install Dependencies
# Install production dependencies
make install
# OR install with development tools
make dev-install
4. Download Model Checkpoints
# Downloads ~3GB of model weights
make download-models
5. Run the Server
# Start FastAPI REST server on http://localhost:8080
make run
# OR start MCP STDIO server
make run-stdio
Manual Installation (Without Make)
# Create virtual environment with uv
uv venv .venv --python 3.11
# Activate virtual environment
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
# Install dependencies
uv pip install -e ".[dev]"
# Run the application
uvicorn app.api:app --host 0.0.0.0 --port 8080 --reload
Usage
REST API Example
# Submit a rendering job
curl -X POST http://localhost:8080/render \
-H 'Content-Type: application/json' \
-d '{
"avatarPath": "/path/to/avatar.png",
"audioPath": "/path/to/speech.wav"
}'
# Response
{
"jobId": "550e8400-e29b-41d4-a716-446655440000",
"statusUrl": "/status/550e8400-e29b-41d4-a716-446655440000",
"async": true
}
# Check job status or download result
curl http://localhost:8080/status/550e8400-e29b-41d4-a716-446655440000
MCP Integration
Register with your MCP Gateway:
curl -X POST http://gateway:4444/servers \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "avatar-renderer",
"transport": "stdio",
"command": "/usr/bin/python3",
"args": ["/app/mcp_server.py"],
"autoDiscover": true
}'
The gateway will automatically discover the render_avatar tool via mcp-tool.json.
Python API Example
from app.pipeline import render_pipeline
# Generate talking head video
render_pipeline(
face_image="avatars/person.jpg",
audio="audio/speech.wav",
out_path="output/result.mp4",
reference_video=None # Optional: provide driving video
)
Architecture
graph TB
A[Client Request] --> B{FastAPI Gateway}
B --> C[Celery Worker Queue]
C --> D[GPU Node]
D --> E[FOMM Head Pose]
E --> F{GPU Memory Check}
F -->|Sufficient| G[Diff2Lip Diffusion]
F -->|Constrained| H[SadTalker + Wav2Lip]
G --> I[GFPGAN Enhancement]
H --> I
I --> J[FFmpeg NVENC Encoding]
J --> K[Cloud Storage Upload]
K --> L[Return Signed URL]
Component Overview
| Component | Purpose | Technology |
|---|---|---|
| FastAPI | REST API gateway | Python, uvicorn |
| MCP Server | STDIO protocol handler | Python, asyncio |
| Celery | Distributed task queue | Redis, RabbitMQ |
| FOMM | Head pose generation | PyTorch, CUDA |
| Diff2Lip | Diffusion-based lip-sync | Stable Diffusion |
| SadTalker | Fallback motion model | 3DMM, PyTorch |
| Wav2Lip | Fallback lip-sync GAN | PyTorch |
| GFPGAN | Face enhancement | GAN, PyTorch |
| FFmpeg | Video encoding | H.264 NVENC |
Configuration
All configuration is managed via environment variables or a .env file:
# General Settings
LOG_LEVEL=INFO
TMP_DIR=/tmp
# GPU Configuration
CUDA_VISIBLE_DEVICES=0
TORCH_USE_INT8=false
# Celery (Optional - leave empty for local mode)
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_BACKEND_URL=redis://localhost:6379/0
CELERY_CONCURRENCY=1
# Model Paths
MODEL_ROOT=/models
FOMM_CKPT_DIR=/models/fomm
DIFF2LIP_CKPT_DIR=/models/diff2lip
SADTALKER_CKPT_DIR=/models/sadtalker
WAV2LIP_CKPT=/models/wav2lip/wav2lip_gan.pth
GFPGAN_CKPT=/models/gfpgan/GFPGANv1.4.pth
# FFmpeg
FFMPEG_BIN=ffmpeg
# MCP Integration
MCP_ENABLE=true
MCP_TOOL_NAME=avatar_renderer
Development
Setup Development Environment
# Install all dependencies including dev tools
make dev-install
# Install pre-commit hooks
make pre-commit-install
Code Quality
# Run linters (ruff, mypy, black)
make lint
# Auto-format code
make format
# Run tests with coverage
make test
# Run all checks
make check
Testing
# Run all tests
make test
# Run specific test categories
make test-unit # Unit tests only
make test-integration # Integration tests only
make test-gpu # GPU-dependent tests
Deployment
Docker
# Build Docker image
make docker-build
# Run with GPU support
make docker-run
# Or manually
docker build -t avatar-renderer:latest .
docker run --gpus all -p 8080:8080 \
-v $(pwd)/models:/models:ro \
avatar-renderer:latest
Kubernetes / Helm
# Deploy using Helm
helm upgrade --install avatar-renderer ./charts/avatar-renderer \
--namespace videogenie \
--create-namespace \
--set image.tag=$(git rev-parse --short HEAD) \
--set resources.limits.nvidia\\.com/gpu=1
# Or use raw manifests
kubectl apply -f k8s/
Auto-Scaling with KEDA
The deployment includes KEDA ScaledObject configuration that scales pods based on:
- Kafka message lag
- Redis queue depth
- Custom Prometheus metrics
Performance
| Metric | Value | Hardware |
|---|---|---|
| FPS (Encoding) | >200 fps | V100 GPU |
| Latency (512x512) | ~2.5s | V100, NVENC |
| VRAM Usage | 6-12 GB | Depends on pipeline |
| CPU Cores | 2-4 recommended | For preprocessing |
| Throughput | 100+ jobs/hour | Single V100 |
Project Structure
avatar-renderer-mcp/
āāā app/ # Application source code
ā āāā __init__.py # Package initialization
ā āāā api.py # FastAPI REST endpoints
ā āāā mcp_server.py # MCP STDIO protocol server
ā āāā pipeline.py # Core rendering pipeline
ā āāā settings.py # Configuration management
ā āāā viseme_align.py # Phoneme-to-viseme alignment
ā āāā worker.py # Celery task worker
āāā tests/ # Test suite
ā āāā __init__.py
ā āāā conftest.py
ā āāā test_api.py
ā āāā test_mcp_stdio.py
āāā scripts/ # Utility scripts
ā āāā benchmark.py
ā āāā download_models.sh
āāā charts/ # Helm deployment charts
āāā k8s/ # Raw Kubernetes manifests
āāā terraform/ # Infrastructure as Code
āāā docs/ # Documentation
āāā .gitignore # Git ignore rules
āāā Dockerfile # Container definition
āāā LICENSE # Apache 2.0 license
āāā Makefile # Build automation
āāā pyproject.toml # Project metadata & dependencies
āāā README.md # This file
Troubleshooting
Common Issues
| Problem | Solution |
|---|---|
CUDA out of memory | Reduce Diff2Lip steps or use Wav2Lip fallback |
Green/black artifacts | Update NVIDIA drivers (ā„545), check FFmpeg NVENC support |
Lips drift from audio | Check phoneme alignment, resample audio to 16kHz |
Models not found | Run make download-models |
uv not found | Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh |
Debug Mode
# Enable debug logging
export LOG_LEVEL=DEBUG
# Run with verbose output
make run
Roadmap
- WebRTC streaming support for real-time rendering
- Incremental synthesis for low-latency applications
- Multi-language phoneme support
- Advanced emotion control
- Cloud-native TTS integration
- Multi-GPU distributed rendering
Contributing
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Ensure all tests pass and code follows PEP 8 standards:
make check
License
This project is licensed under the Apache License 2.0 - see the file for details.
Copyright 2025 Ruslan Magana Vsevolodovna
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Author
Ruslan Magana Vsevolodovna
- š Website: ruslanmv.com
- š§ Email: contact@ruslanmv.com
- š¼ GitHub: @ruslanmv
Acknowledgments
This project builds upon the work of several research teams:
- FOMM (First Order Motion Model) - Aliaksandr Siarohin et al., NeurIPS 2020
- Diff2Lip - Diffusion-based lip synchronization research
- SadTalker - Zhang et al., CVPR 2023
- Wav2Lip - K R Prajwal et al., ACM MM 2020
- GFPGAN - Tencent ARC Lab, 2021
Special thanks to the open-source AI community for advancing the state of the art in generative models.
Support
For issues, questions, or feature requests, please:
- Check the Troubleshooting section
- Search existing issues
- Open a new issue with detailed information
For commercial support or consulting inquiries, contact: contact@ruslanmv.com
Transforming still images into lifelike talking avatars