avatar-renderer-mcp

ruslanmv/avatar-renderer-mcp

3.2

If you are the rightful owner of avatar-renderer-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Avatar Renderer Pod is a VideoGenie-class avatar generator that natively speaks the MCP protocol, enabling high-quality talking-head generation from a single image.

Tools
  1. render_avatar

    Generates a talking-head avatar from a single image using the MCP protocol.

Avatar Renderer Pod – VideoGenie‑class avatar generator (MCP‑ready)

A single‑image → talking‑head engine that reaches high-quality while natively speaking the MCP protocol.
Drop the container into your GPU pool and an MCP Gateway auto‑discovers the render_avatar tool the moment the process starts.


🚀 Features

  • FOMM (head pose) + Diff2Lip (diffusion visemes) with automatic fallback to SadTalker (+ Wav2Lip) when VRAM is tight.
  • MCP STDIO server and FastAPI REST façade live in the same container.
  • CUDA 12.4 + PyTorch 2.3; NVENC H.264 encodes > 200 fps on a V100.
  • Pluggable pipeline (pipeline.py) – swap in AnimateDiff, DreamTalk, LIAON‑LipSync, etc.
  • Helm chart & raw manifests request nvidia.com/gpu: 1 and tolerate the dedicated=gpu taint.
  • KEDA‑ready: ScaledObject samples Kafka lag and spins 0 → N render pods as demand changes.
  • Full CI (CPU‑only) plus Colab notebooks for checkpoint tuning.

📦 Project tree

avatar-renderer-pod/
├── README.md                    # You are here
├── LICENSE                      # Apache‑2.0
├── app/                         # Runtime code
│   ├── api.py                   # FastAPI → Celery task
│   ├── mcp_server.py            # STDIO MCP server (render_avatar)
│   ├── worker.py                # Celery worker bootstrap
│   ├── pipeline.py              # FOMM+Diff2Lip → FFmpeg glue
│   ├── viseme_align.py          # Montreal Forced Aligner helper
│   └── settings.py              # Pydantic env config
├── models/                      # 🔹 **not tracked** – mount at runtime
│   ├── fomm/                    # `vox-cpk.pth` – FOMM (Aliaksandr Siarohin, 2020)
│   ├── diff2lip/                # `Diff2Lip.pth` – diffusion visemes (Yuan Gary, 2024)
│   ├── sadtalker/               # `sadtalker.pth` – SadTalker motion (Zhang et al., CVPR 2023)
│   ├── wav2lip/                 # `wav2lip_gan.pth` – lip GAN (K Rudrabha, 2020)
│   └── gfpgan/                  # `GFPGANv1.3.pth` – face enhancer (Tencent ARC, 2021)
├── Dockerfile                   # CUDA 12.4 runtime image
├── requirements.txt             # torch, diffusers, fastapi, celery …
├── charts/                      # Helm deployment & service
│   └── avatar-renderer/
│       └── values.yaml
├── k8s/                         # Raw YAML (if Helm not used)
│   ├── deployment.yaml
│   ├── service.yaml
│   └── autoscale.yaml
├── mcp-tool.json                # Manifest auto‑discovered by Gateway
├── ci/
│   ├── github-actions.yml       # lint → build → test → push
│   └── tekton-build.yaml
├── notebooks/                   # Colab/Jupyter demos & fine‑tune
│   ├── 01_fomm_diff2lip_demo.ipynb
│   └── 02_finetune_diff2lip.ipynb
├── scripts/                     # Utility helpers
│   ├── download_models.sh       # Fetch all checkpoints (≈ 3 GB)
│   ├── benchmark.py             # FPS & latency profiler
│   └── healthcheck.sh           # Curl‑based liveness probe
├── tests/                       # pytest (CPU) smoke ≤ 5 s
│   ├── conftest.py
│   ├── test_render_api.py
│   ├── test_mcp_stdio.py
│   └── assets/ {alice.png, hello.wav}
└── docs/
    ├── 00_overview.md
    └── 02_mcp_integration.md

🖼 Render workflow (detailed)

sequenceDiagram
    autonumber

    participant API as FastAPI API
    participant Celery as Celery Worker
    participant FOMM as FOMM Motion Model
    participant Diff2Lip as Diff2Lip Viseme Diffusion
    participant SadTalker as SadTalker Motion Fallback
    participant W2L as Wav2Lip Lip GAN Fallback
    participant FFmpeg as FFmpeg Encoder

    API->>API: POST /render request
    API->>Celery: Enqueue render job
    
    alt Reference video provided
        Celery->>FOMM: Process full-motion driving video
    else Still image provided
        Celery->>FOMM: Infer pose from still image and audio
    end
    
    alt Sufficient GPU RAM >= 12 GB
        Celery->>Diff2Lip: Diffuse mouth frames
        Diff2Lip-->>Celery: Return RGBA frames
    else Lower GPU RAM
        Celery->>SadTalker: Generate 3D motion coefficients
        SadTalker-->>Celery: Return coarse frames
        Celery->>W2L: Refine lips with GAN
        W2L-->>Celery: Return final frames
    end
    
    Celery->>FFmpeg: Encode video to H.264
    FFmpeg-->>Celery: Return final MP4 video
    Celery-->>API: Return signed URL for video
  • FOMM or SadTalker provides realistic head pose, eye‑blink and basic expression.
  • Diff2Lip (Stable Diffusion in‑painting) improves lip realism; falls back to Wav2Lip when VRAM is scarce.
  • Final MP4 is encoded with -profile:v high -preset p7 -b:v 6M for presentation‑ready quality.

🛠 Local quick‑start (GPU workstation)

# 1. Clone & pull weights (≈ 3 GB, once)
bash scripts/download_models.sh

# 2. Build & run
docker build -t avatar-renderer:dev .
docker run --gpus all -p 8080:8080 \
  -v $(pwd)/models:/models avatar-renderer:dev &

# 3. Render
curl -X POST localhost:8080/render \
  -H 'Content-Type: application/json' \
  -d '{"avatarId":"alice","voiceUrl":"https://example.com/hello.wav"}'

🔗 MCP integration

curl -X POST http://gateway:4444/servers \
 -H "Authorization: Bearer $ADMIN_TOKEN" \
 -H "Content-Type: application/json" \
 -d '{
   "name": "avatar-renderer",
   "transport": "stdio",
   "command": "/usr/bin/python3",
   "args": ["/app/mcp_server.py"],
   "autoDiscover": true
 }'

Gateway detects mcp-tool.json and registers the render_avatar tool automatically.


🚦 Feature compliance matrix

FeatureStatusNotes
Realistic facial animation from still imageFOMM / SadTalker for full head + expressions
High‑fidelity lip‑syncDiff2Lip diffusion visemes or Wav2Lip GAN fallback
MP4 export (presentation‑ready)H.264 NVENC, signed COS/S3 URL
Live WebRTC streaming⚠️ SoonGStreamer NVENC RTP branch in feature/webrtc
AI‑agent integration (MCP)STDIO protocol ready, REST remains for demos
Low‑latency incremental synthesis⚠️ R&DNeeds chunked TTS + sliding‑window Diff2Lip

⚙ Helm deployment (OpenShift)

helm upgrade --install avatar-renderer charts/avatar-renderer \
  --namespace videogenie --create-namespace \
  --set image.tag=$(git rev-parse --short HEAD)

Requests 1 GPU, 6 GiB RAM, 2 vCPU. The existing VideoGenie KEDA ScaledObject will autoscale pods based on Kafka lag.

Makefile Guide

# first‑time developer workflow
make setup
make download-models
make run        # REST server → http://localhost:8080/render

# MCP stdio test
make run-stdio  # then echo '{"tool":"render_avatar", ...}' | ./app/mcp_server.py

# build + run container
make docker-build
make docker-run

🐛 Troubleshooting

SymptomFix / Hint
CUDA error: out of memoryReduce --diff2lip_steps, enable Wav2Lip fallback, or upgrade GPU
Stuck at “Align visemes …”Ensure M.F. Aligner English model in models/mfa/
Green / black artefacts in output videoDriver ≥ 545, verify FFmpeg built with --enable-nvenc
Lips drift from audio by > 100 msCheck viseme_align.py phoneme timing; re‑sample audio 16 kHz

📜 License

Apache 2.0 — use it, fork it, break it, fix it, PR back 🙌