OtherVibes/mcp-as-a-judge
If you are the rightful owner of mcp-as-a-judge and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.
MCP as a Judge ⚖️
mcp-name: io.github.OtherVibes/mcp-as-a-judge
MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.
MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations for:
- Research, system design, and planning
- Code changes, testing, and task-completion verification
It enforces evidence-based research, reuse over reinvention, and human-in-the-loop decisions.
If your IDE has rules/agents (Copilot, Cursor, Claude Code), keep using them—this Judge adds enforceable approval gates on plan, code diffs, and tests.
Key problems with AI coding assistants and LLMs
- Treat LLM output as ground truth; skip research and use outdated information
- Reinvent the wheel instead of reusing libraries and existing code
- Cut corners: code below engineering standards and weak tests
- Make unilateral decisions when requirements are ambiguous or plans change
- Security blind spots: missing input validation, injection risks/attack vectors, least‑privilege violations, and weak defensive programming
Vibe coding doesn’t have to be frustrating
What it enforces
- Evidence‑based research and reuse (best practices, libraries, existing code)
- Plan‑first delivery aligned to user requirements
- Human‑in‑the‑loop decisions for ambiguity and blockers
- Quality gates on code and tests (security, performance, maintainability)
Key capabilities
- Intelligent code evaluation via MCP sampling; enforces software‑engineering standards and flags security/performance/maintainability risks
- Comprehensive plan/design review: validates architecture, research depth, requirements fit, and implementation approach
- User‑driven decisions via MCP elicitation: clarifies requirements, resolves obstacles, and keeps choices transparent
- Security validation in system design and code changes
Tools and how they help
Tool | What it solves |
---|---|
set_coding_task | Creates/updates task metadata; classifies task_size; returns next-step workflow guidance |
get_current_coding_task | Recovers the latest task_id and metadata to resume work safely |
judge_coding_plan | Validates plan/design; requires library selection and internal reuse maps; flags risks |
judge_code_change | Reviews unified Git diffs for correctness, reuse, security, and code quality |
judge_testing_implementation | Validates tests using real runner output and optional coverage |
judge_coding_task_completion | Final gate ensuring plan, code, and tests approvals before completion |
raise_missing_requirements | Elicits missing details and decisions to unblock progress |
raise_obstacle | Engages the user on trade‑offs, constraints, and enforced changes |
🚀 Quick Start
Requirements & Recommendations
MCP Client Prerequisites
MCP as a Judge is heavily dependent on MCP Sampling and MCP Elicitation features for its core functionality:
- MCP Sampling - Required for AI-powered code evaluation and judgment
- MCP Elicitation - Required for interactive user decision prompts
System Prerequisites
- Docker Desktop / Python 3.13+ - Required for running the MCP server
Supported AI Assistants
AI Assistant | Platform | MCP Support | Status | Notes |
---|---|---|---|---|
GitHub Copilot | Visual Studio Code | ✅ Full | Recommended | Complete MCP integration with sampling and elicitation |
Claude Code | - | ⚠️ Partial | Requires LLM API key | Sampling Support feature request Elicitation Support feature request |
Cursor | - | ⚠️ Partial | Requires LLM API key | MCP support available, but sampling/elicitation limited |
Augment | - | ⚠️ Partial | Requires LLM API key | MCP support available, but sampling/elicitation limited |
Qodo | - | ⚠️ Partial | Requires LLM API key | MCP support available, but sampling/elicitation limited |
✅ Recommended setup: GitHub Copilot + VS Code — full MCP sampling; no API key needed.
⚠️ Critical: For assistants without full MCP sampling (Cursor, Claude Code, Augment, Qodo), you MUST set LLM_API_KEY
. Without it, the server cannot evaluate plans or code. See LLM API Configuration.
💡 Tip: Prefer large context models (≥ 1M tokens) for better analysis and judgments.
If the MCP server isn’t auto‑used
For troubleshooting, visit the FAQs section.
🔧 MCP Configuration
Configure MCP as a Judge in your MCP-enabled client:
Method 1: Using Docker (Recommended)
One‑click install for VS Code (MCP)
Notes:
- VS Code controls the sampling model; select it via “MCP: List Servers → mcp-as-a-judge → Configure Model Access”.
-
Configure MCP Settings:
Add this to your MCP client configuration file:
{ "command": "docker", "args": ["run", "--rm", "-i", "--pull=always", "ghcr.io/othervibes/mcp-as-a-judge:latest"], "env": { "LLM_API_KEY": "your-openai-api-key-here", "LLM_MODEL_NAME": "gpt-4o-mini" } }
📝 Configuration Options (All Optional):
- LLM_API_KEY: Optional for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
- The
--pull=always
flag ensures you always get the latest version automatically
Then manually update when needed:
# Pull the latest version docker pull ghcr.io/othervibes/mcp-as-a-judge:latest
Method 2: Using uv
-
Install the package:
uv tool install mcp-as-a-judge
-
Configure MCP Settings:
The MCP server may be automatically detected by your MCP‑enabled client.
📝 Notes:
- No additional configuration needed for GitHub Copilot + VS Code (has built-in MCP sampling)
- LLM_API_KEY is optional and can be set via environment variable if needed
-
To update to the latest version:
# Update MCP as a Judge to the latest version uv tool upgrade mcp-as-a-judge
Select a sampling model in VS Code
- Open Command Palette (Cmd/Ctrl+Shift+P) → “MCP: List Servers”
- Select the configured server “mcp-as-a-judge”
- Choose “Configure Model Access”
- Check your preferred model(s) to enable sampling
🔑 LLM API Configuration (Optional)
For AI assistants without full MCP sampling support you can configure an LLM API key as a fallback. This ensures MCP as a Judge works even when the client doesn't support MCP sampling.
- Set
LLM_API_KEY
(unified key). Vendor is auto-detected; optionally setLLM_MODEL_NAME
to override the default.
Supported LLM Providers
Rank | Provider | API Key Format | Default Model | Notes |
---|---|---|---|---|
1 | OpenAI | sk-... | gpt-4.1 | Fast and reliable model optimized for speed |
2 | Anthropic | sk-ant-... | claude-sonnet-4-20250514 | High-performance with exceptional reasoning |
3 | AIza... | gemini-2.5-pro | Most advanced model with built-in thinking | |
4 | Azure OpenAI | [a-f0-9]{32} | gpt-4.1 | Same as OpenAI but via Azure |
5 | AWS Bedrock | AWS credentials | anthropic.claude-sonnet-4-20250514-v1:0 | Aligned with Anthropic |
6 | Vertex AI | Service Account JSON | gemini-2.5-pro | Enterprise Gemini via Google Cloud |
7 | Groq | gsk_... | deepseek-r1 | Best reasoning model with speed advantage |
8 | OpenRouter | sk-or-... | deepseek/deepseek-r1 | Best reasoning model available |
9 | xAI | xai-... | grok-code-fast-1 | Latest coding-focused model (Aug 2025) |
10 | Mistral | [a-f0-9]{64} | pixtral-large | Most advanced model (124B params) |
Client-Specific Setup
Cursor
-
Open Cursor Settings:
- Go to
File
→Preferences
→Cursor Settings
- Navigate to the
MCP
tab - Click
+ Add
to add a new MCP server
- Go to
-
Add MCP Server Configuration:
{ "command": "uv", "args": ["tool", "run", "mcp-as-a-judge"], "env": { "LLM_API_KEY": "your-openai-api-key-here", "LLM_MODEL_NAME": "gpt-4.1" } }
📝 Configuration Options:
- LLM_API_KEY: Required for Cursor (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
Claude Code
-
Add MCP Server via CLI:
# Set environment variables first (optional model override) export LLM_API_KEY="your_api_key_here" export LLM_MODEL_NAME="claude-3-5-haiku" # Optional: faster/cheaper model # Add MCP server claude mcp add mcp-as-a-judge -- uv tool run mcp-as-a-judge
-
Alternative: Manual Configuration:
- Create or edit
~/.config/claude-code/mcp_servers.json
{ "command": "uv", "args": ["tool", "run", "mcp-as-a-judge"], "env": { "LLM_API_KEY": "your-anthropic-api-key-here", "LLM_MODEL_NAME": "claude-3-5-haiku" } }
📝 Configuration Options:
- LLM_API_KEY: Required for Claude Code (limited MCP sampling)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
- Create or edit
Other MCP Clients
For other MCP-compatible clients, use the standard MCP server configuration:
{
"command": "uv",
"args": ["tool", "run", "mcp-as-a-judge"],
"env": {
"LLM_API_KEY": "your-openai-api-key-here",
"LLM_MODEL_NAME": "gpt-5"
}
}
📝 Configuration Options:
- LLM_API_KEY: Required for most MCP clients (except GitHub Copilot + VS Code)
- LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
🔒 Privacy & Flexible AI Integration
🔑 MCP Sampling (Preferred) + LLM API Key Fallback
Primary Mode: MCP Sampling
- All judgments are performed using MCP Sampling capability
- No need to configure or pay for external LLM API services
- Works directly with your MCP-compatible client's existing AI model
- Currently supported by: GitHub Copilot + VS Code
Fallback Mode: LLM API Key
- When MCP sampling is not available, the server can use LLM API keys
- Supports multiple providers via LiteLLM: OpenAI, Anthropic, Google, Azure, Groq, Mistral, xAI
- Automatic vendor detection from API key patterns
- Default model selection per vendor when no model is specified
🛡️ Your Privacy Matters
- The server runs locally on your machine
- No data collection - your code and conversations stay private
- No external API calls when using MCP Sampling. If you set
LLM_API_KEY
for fallback, the server will call your chosen LLM provider only to perform judgments (plan/code/test) with the evaluation content you provide. - Complete control over your development workflow and sensitive information
🤝 Contributing
We welcome contributions! Please see for guidelines.
Development Setup
# Clone the repository
git clone https://github.com/OtherVibes/mcp-as-a-judge.git
cd mcp-as-a-judge
# Install dependencies with uv
uv sync --all-extras --dev
# Install pre-commit hooks
uv run pre-commit install
# Run tests
uv run pytest
# Run all checks
uv run pytest && uv run ruff check && uv run ruff format --check && uv run mypy src
© Concepts and Methodology
© 2025 OtherVibes and Zvi Fried. The "MCP as a Judge" concept, the "behavioral MCP" approach, the staged workflow (plan → code → test → completion), tool taxonomy/descriptions, and prompt templates are original work developed in this repository.
Prior Art and Attribution
While “LLM‑as‑a‑judge” is a broadly known idea, this repository defines the original “MCP as a Judge” behavioral MCP pattern by OtherVibes and Zvi Fried. It combines task‑centric workflow enforcement (plan → code → test → completion), explicit LLM‑based validations, and human‑in‑the‑loop elicitation, along with the prompt templates and tool taxonomy provided here. Please attribute as: “OtherVibes – MCP as a Judge (Zvi Fried)”.
❓ FAQ
How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?
Feature | IDE Rules | Subagents | MCP as a Judge |
---|---|---|---|
Static behavior guidance | ✓ | ✓ | ✗ |
Custom system prompts | ✓ | ✓ | ✓ |
Project context integration | ✓ | ✓ | ✓ |
Specialized task handling | ✗ | ✓ | ✓ |
Active quality gates | ✗ | ✗ | ✓ |
Evidence-based validation | ✗ | ✗ | ✓ |
Approve/reject with feedback | ✗ | ✗ | ✓ |
Workflow enforcement | ✗ | ✗ | ✓ |
Cross-assistant compatibility | ✗ | ✗ | ✓ |
How does the Judge workflow relate to the tasklist? Why do we need both?
- Tasklist = planning/organization: tracks tasks, priorities, and status. It doesn’t guarantee engineering quality or readiness.
- Judge workflow = quality gates: enforces approvals for plan/design, code diffs, tests, and final completion. It demands real evidence (e.g., unified Git diffs and raw test output) and returns structured approvals and required improvements.
- Together: Use the tasklist to organize work; use the Judge to decide when each stage is actually ready to proceed. The server also emits next_tool guidance to keep progress moving through the gates.
If the Judge isn’t used automatically, how do I force it?
- In your prompt: "use mcp-as-a-judge" or "Evaluate plan/code/test using the MCP server mcp-as-a-judge".
- VS Code: Command Palette → "MCP: List Servers" → ensure "mcp-as-a-judge" is listed and enabled.
- Ensure the MCP server is running and, in your client, the judge tools are enabled/approved.
How do I select models for sampling in VS Code?
- Open Command Palette (Cmd/Ctrl+Shift+P) → "MCP: List Servers"
- Select "mcp-as-a-judge" → "Configure Model Access"
- Check your preferred model(s) to enable sampling
📄 License
This project is licensed under the MIT License (see ).
🙏 Acknowledgments
- Model Context Protocol by Anthropic
- LiteLLM for unified LLM API integration