avatar-renderer-mcp

ruslanmv/avatar-renderer-mcp

3.3

If you are the rightful owner of avatar-renderer-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Avatar Renderer Pod is a VideoGenie-class avatar generator that natively speaks the MCP protocol, enabling high-quality talking-head generation from a single image.

Tools
1
Resources
0
Prompts
0

Avatar Renderer MCP

Python 3.11+ Code style: black Ruff Package Manager: uv

Production-ready AI-powered talking head generation system with enterprise-grade MCP integration

A high-performance, scalable avatar rendering engine that transforms static images and audio into realistic talking head videos using state-of-the-art deep learning models.


About

Avatar Renderer MCP is a sophisticated video generation system designed for production environments. It combines multiple cutting-edge AI models (FOMM, Diff2Lip, Wav2Lip, SadTalker) to create photorealistic talking avatars from just two inputs:

  • šŸ–¼ļø A still image of a person's face
  • šŸŽ¤ An audio file containing speech

The AI analyzes both inputs and generates a video where the person appears to be speaking naturally, with synchronized lip movements, facial expressions, and head pose animations.

Key Differentiators

  • Production-Ready Architecture: Built for enterprise deployments with Kubernetes, Docker, and auto-scaling support
  • MCP Protocol Integration: Native Model Context Protocol (MCP) support for seamless AI agent communication
  • Intelligent Fallback System: Automatic GPU memory management with graceful degradation
  • Scalable Design: Celery-based distributed task processing with KEDA autoscaling
  • Cloud-Native: Helm charts, Kubernetes manifests, and Terraform configurations included

Features

Core Capabilities

  • āœ… Multi-Model Pipeline: FOMM for head pose + Diff2Lip for photorealistic lip-sync
  • āœ… Automatic Fallback: Switches to SadTalker + Wav2Lip when GPU memory is constrained
  • āœ… GPU-Accelerated Encoding: NVENC H.264 encoding achieving >200 FPS on V100 GPUs
  • āœ… MCP STDIO Server: Ready for AI agent integration with auto-discovery
  • āœ… RESTful API: FastAPI-based HTTP interface for traditional integrations
  • āœ… Face Enhancement: Built-in GFPGAN support for improved output quality
  • āœ… Phoneme Alignment: Montreal Forced Aligner integration for precise lip-sync

DevOps & Infrastructure

  • āœ… Containerized: Production-grade Dockerfile with CUDA 12.4 support
  • āœ… Kubernetes-Ready: Helm charts and raw manifests for K8s deployments
  • āœ… Auto-Scaling: KEDA integration for demand-based pod scaling
  • āœ… CI/CD Pipelines: GitHub Actions and Tekton configurations included
  • āœ… Monitoring: Prometheus metrics and structured logging
  • āœ… Cloud Storage: S3/COS integration for output delivery

Installation

Prerequisites

  • Python: 3.11 or 3.12
  • GPU: NVIDIA GPU with CUDA 12.4+ (optional but recommended)
  • Package Manager: uv (recommended) or pip
  • FFmpeg: With NVENC support for GPU-accelerated encoding
  • Docker: (Optional) For containerized deployments

Quick Start

1. Install uv (Recommended Package Manager)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or via pip
pip install uv
2. Clone the Repository
git clone https://github.com/ruslanmv/avatar-renderer-mcp.git
cd avatar-renderer-mcp
3. Install Dependencies
# Install production dependencies
make install

# OR install with development tools
make dev-install
4. Download Model Checkpoints
# Downloads ~3GB of model weights
make download-models
5. Run the Server
# Start FastAPI REST server on http://localhost:8080
make run

# OR start MCP STDIO server
make run-stdio

Manual Installation (Without Make)

# Create virtual environment with uv
uv venv .venv --python 3.11

# Activate virtual environment
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

# Install dependencies
uv pip install -e ".[dev]"

# Run the application
uvicorn app.api:app --host 0.0.0.0 --port 8080 --reload

Usage

REST API Example

# Submit a rendering job
curl -X POST http://localhost:8080/render \
  -H 'Content-Type: application/json' \
  -d '{
    "avatarPath": "/path/to/avatar.png",
    "audioPath": "/path/to/speech.wav"
  }'

# Response
{
  "jobId": "550e8400-e29b-41d4-a716-446655440000",
  "statusUrl": "/status/550e8400-e29b-41d4-a716-446655440000",
  "async": true
}

# Check job status or download result
curl http://localhost:8080/status/550e8400-e29b-41d4-a716-446655440000

MCP Integration

Register with your MCP Gateway:

curl -X POST http://gateway:4444/servers \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "avatar-renderer",
    "transport": "stdio",
    "command": "/usr/bin/python3",
    "args": ["/app/mcp_server.py"],
    "autoDiscover": true
  }'

The gateway will automatically discover the render_avatar tool via mcp-tool.json.

Python API Example

from app.pipeline import render_pipeline

# Generate talking head video
render_pipeline(
    face_image="avatars/person.jpg",
    audio="audio/speech.wav",
    out_path="output/result.mp4",
    reference_video=None  # Optional: provide driving video
)

Architecture

graph TB
    A[Client Request] --> B{FastAPI Gateway}
    B --> C[Celery Worker Queue]
    C --> D[GPU Node]
    D --> E[FOMM Head Pose]
    E --> F{GPU Memory Check}
    F -->|Sufficient| G[Diff2Lip Diffusion]
    F -->|Constrained| H[SadTalker + Wav2Lip]
    G --> I[GFPGAN Enhancement]
    H --> I
    I --> J[FFmpeg NVENC Encoding]
    J --> K[Cloud Storage Upload]
    K --> L[Return Signed URL]

Component Overview

ComponentPurposeTechnology
FastAPIREST API gatewayPython, uvicorn
MCP ServerSTDIO protocol handlerPython, asyncio
CeleryDistributed task queueRedis, RabbitMQ
FOMMHead pose generationPyTorch, CUDA
Diff2LipDiffusion-based lip-syncStable Diffusion
SadTalkerFallback motion model3DMM, PyTorch
Wav2LipFallback lip-sync GANPyTorch
GFPGANFace enhancementGAN, PyTorch
FFmpegVideo encodingH.264 NVENC

Configuration

All configuration is managed via environment variables or a .env file:

# General Settings
LOG_LEVEL=INFO
TMP_DIR=/tmp

# GPU Configuration
CUDA_VISIBLE_DEVICES=0
TORCH_USE_INT8=false

# Celery (Optional - leave empty for local mode)
CELERY_BROKER_URL=redis://localhost:6379/0
CELERY_BACKEND_URL=redis://localhost:6379/0
CELERY_CONCURRENCY=1

# Model Paths
MODEL_ROOT=/models
FOMM_CKPT_DIR=/models/fomm
DIFF2LIP_CKPT_DIR=/models/diff2lip
SADTALKER_CKPT_DIR=/models/sadtalker
WAV2LIP_CKPT=/models/wav2lip/wav2lip_gan.pth
GFPGAN_CKPT=/models/gfpgan/GFPGANv1.4.pth

# FFmpeg
FFMPEG_BIN=ffmpeg

# MCP Integration
MCP_ENABLE=true
MCP_TOOL_NAME=avatar_renderer

Development

Setup Development Environment

# Install all dependencies including dev tools
make dev-install

# Install pre-commit hooks
make pre-commit-install

Code Quality

# Run linters (ruff, mypy, black)
make lint

# Auto-format code
make format

# Run tests with coverage
make test

# Run all checks
make check

Testing

# Run all tests
make test

# Run specific test categories
make test-unit          # Unit tests only
make test-integration   # Integration tests only
make test-gpu           # GPU-dependent tests

Deployment

Docker

# Build Docker image
make docker-build

# Run with GPU support
make docker-run

# Or manually
docker build -t avatar-renderer:latest .
docker run --gpus all -p 8080:8080 \
  -v $(pwd)/models:/models:ro \
  avatar-renderer:latest

Kubernetes / Helm

# Deploy using Helm
helm upgrade --install avatar-renderer ./charts/avatar-renderer \
  --namespace videogenie \
  --create-namespace \
  --set image.tag=$(git rev-parse --short HEAD) \
  --set resources.limits.nvidia\\.com/gpu=1

# Or use raw manifests
kubectl apply -f k8s/

Auto-Scaling with KEDA

The deployment includes KEDA ScaledObject configuration that scales pods based on:

  • Kafka message lag
  • Redis queue depth
  • Custom Prometheus metrics

Performance

MetricValueHardware
FPS (Encoding)>200 fpsV100 GPU
Latency (512x512)~2.5sV100, NVENC
VRAM Usage6-12 GBDepends on pipeline
CPU Cores2-4 recommendedFor preprocessing
Throughput100+ jobs/hourSingle V100

Project Structure

avatar-renderer-mcp/
ā”œā”€ā”€ app/                        # Application source code
│   ā”œā”€ā”€ __init__.py            # Package initialization
│   ā”œā”€ā”€ api.py                 # FastAPI REST endpoints
│   ā”œā”€ā”€ mcp_server.py          # MCP STDIO protocol server
│   ā”œā”€ā”€ pipeline.py            # Core rendering pipeline
│   ā”œā”€ā”€ settings.py            # Configuration management
│   ā”œā”€ā”€ viseme_align.py        # Phoneme-to-viseme alignment
│   └── worker.py              # Celery task worker
ā”œā”€ā”€ tests/                      # Test suite
│   ā”œā”€ā”€ __init__.py
│   ā”œā”€ā”€ conftest.py
│   ā”œā”€ā”€ test_api.py
│   └── test_mcp_stdio.py
ā”œā”€ā”€ scripts/                    # Utility scripts
│   ā”œā”€ā”€ benchmark.py
│   └── download_models.sh
ā”œā”€ā”€ charts/                     # Helm deployment charts
ā”œā”€ā”€ k8s/                        # Raw Kubernetes manifests
ā”œā”€ā”€ terraform/                  # Infrastructure as Code
ā”œā”€ā”€ docs/                       # Documentation
ā”œā”€ā”€ .gitignore                  # Git ignore rules
ā”œā”€ā”€ Dockerfile                  # Container definition
ā”œā”€ā”€ LICENSE                     # Apache 2.0 license
ā”œā”€ā”€ Makefile                    # Build automation
ā”œā”€ā”€ pyproject.toml             # Project metadata & dependencies
└── README.md                   # This file

Troubleshooting

Common Issues

ProblemSolution
CUDA out of memoryReduce Diff2Lip steps or use Wav2Lip fallback
Green/black artifactsUpdate NVIDIA drivers (≄545), check FFmpeg NVENC support
Lips drift from audioCheck phoneme alignment, resample audio to 16kHz
Models not foundRun make download-models
uv not foundInstall uv: curl -LsSf https://astral.sh/uv/install.sh | sh

Debug Mode

# Enable debug logging
export LOG_LEVEL=DEBUG

# Run with verbose output
make run

Roadmap

  • WebRTC streaming support for real-time rendering
  • Incremental synthesis for low-latency applications
  • Multi-language phoneme support
  • Advanced emotion control
  • Cloud-native TTS integration
  • Multi-GPU distributed rendering

Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Ensure all tests pass and code follows PEP 8 standards:

make check

License

This project is licensed under the Apache License 2.0 - see the file for details.

Copyright 2025 Ruslan Magana Vsevolodovna

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Author

Ruslan Magana Vsevolodovna


Acknowledgments

This project builds upon the work of several research teams:

Special thanks to the open-source AI community for advancing the state of the art in generative models.


Support

For issues, questions, or feature requests, please:

  1. Check the Troubleshooting section
  2. Search existing issues
  3. Open a new issue with detailed information

For commercial support or consulting inquiries, contact: contact@ruslanmv.com


Made with ā¤ļø by Ruslan Magana Vsevolodovna
Transforming still images into lifelike talking avatars