mcp-as-a-judge

OtherVibes/mcp-as-a-judge

3.4

If you are the rightful owner of mcp-as-a-judge and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.

Tools
5
Resources
0
Prompts
0

MCP as a Judge ⚖️

mcp-name: io.github.OtherVibes/mcp-as-a-judge

MCP as a Judge acts as a validation layer between AI coding assistants and LLMs, helping ensure safer and higher-quality code.

License: MIT Python 3.13+ MCP Compatible

CI Release PyPI version

MCP as a Judge is a behavioral MCP that strengthens AI coding assistants by requiring explicit LLM evaluations for:

  • Research, system design, and planning
  • Code changes, testing, and task-completion verification

It enforces evidence-based research, reuse over reinvention, and human-in-the-loop decisions.

If your IDE has rules/agents (Copilot, Cursor, Claude Code), keep using them—this Judge adds enforceable approval gates on plan, code diffs, and tests.

Key problems with AI coding assistants and LLMs

  • Treat LLM output as ground truth; skip research and use outdated information
  • Reinvent the wheel instead of reusing libraries and existing code
  • Cut corners: code below engineering standards and weak tests
  • Make unilateral decisions when requirements are ambiguous or plans change
  • Security blind spots: missing input validation, injection risks/attack vectors, least‑privilege violations, and weak defensive programming

Vibe coding doesn’t have to be frustrating

What it enforces

  • Evidence‑based research and reuse (best practices, libraries, existing code)
  • Plan‑first delivery aligned to user requirements
  • Human‑in‑the‑loop decisions for ambiguity and blockers
  • Quality gates on code and tests (security, performance, maintainability)

Key capabilities

  • Intelligent code evaluation via MCP sampling; enforces software‑engineering standards and flags security/performance/maintainability risks
  • Comprehensive plan/design review: validates architecture, research depth, requirements fit, and implementation approach
  • User‑driven decisions via MCP elicitation: clarifies requirements, resolves obstacles, and keeps choices transparent
  • Security validation in system design and code changes

Tools and how they help

ToolWhat it solves
set_coding_taskCreates/updates task metadata; classifies task_size; returns next-step workflow guidance
get_current_coding_taskRecovers the latest task_id and metadata to resume work safely
judge_coding_planValidates plan/design; requires library selection and internal reuse maps; flags risks
judge_code_changeReviews unified Git diffs for correctness, reuse, security, and code quality
judge_testing_implementationValidates tests using real runner output and optional coverage
judge_coding_task_completionFinal gate ensuring plan, code, and tests approvals before completion
raise_missing_requirementsElicits missing details and decisions to unblock progress
raise_obstacleEngages the user on trade‑offs, constraints, and enforced changes

🚀 Quick Start

Requirements & Recommendations

MCP Client Prerequisites

MCP as a Judge is heavily dependent on MCP Sampling and MCP Elicitation features for its core functionality:

System Prerequisites
  • Docker Desktop / Python 3.13+ - Required for running the MCP server
Supported AI Assistants
AI AssistantPlatformMCP SupportStatusNotes
GitHub CopilotVisual Studio Code✅ FullRecommendedComplete MCP integration with sampling and elicitation
Claude Code-⚠️ PartialRequires LLM API keySampling Support feature request
Elicitation Support feature request
Cursor-⚠️ PartialRequires LLM API keyMCP support available, but sampling/elicitation limited
Augment-⚠️ PartialRequires LLM API keyMCP support available, but sampling/elicitation limited
Qodo-⚠️ PartialRequires LLM API keyMCP support available, but sampling/elicitation limited

✅ Recommended setup: GitHub Copilot + VS Code — full MCP sampling; no API key needed.

⚠️ Critical: For assistants without full MCP sampling (Cursor, Claude Code, Augment, Qodo), you MUST set LLM_API_KEY. Without it, the server cannot evaluate plans or code. See LLM API Configuration.

💡 Tip: Prefer large context models (≥ 1M tokens) for better analysis and judgments.

If the MCP server isn’t auto‑used

For troubleshooting, visit the FAQs section.

🔧 MCP Configuration

Configure MCP as a Judge in your MCP-enabled client:

Method 1: Using Docker (Recommended)

One‑click install for VS Code (MCP)

Install for MCP as a Judge

Notes:

  • VS Code controls the sampling model; select it via “MCP: List Servers → mcp-as-a-judge → Configure Model Access”.
  1. Configure MCP Settings:

    Add this to your MCP client configuration file:

    {
      "command": "docker",
      "args": ["run", "--rm", "-i", "--pull=always", "ghcr.io/othervibes/mcp-as-a-judge:latest"],
      "env": {
        "LLM_API_KEY": "your-openai-api-key-here",
        "LLM_MODEL_NAME": "gpt-4o-mini"
      }
    }
    

    📝 Configuration Options (All Optional):

    • LLM_API_KEY: Optional for GitHub Copilot + VS Code (has built-in MCP sampling)
    • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
    • The --pull=always flag ensures you always get the latest version automatically

    Then manually update when needed:

    # Pull the latest version
    docker pull ghcr.io/othervibes/mcp-as-a-judge:latest
    

Method 2: Using uv

  1. Install the package:

    uv tool install mcp-as-a-judge
    
  2. Configure MCP Settings:

    The MCP server may be automatically detected by your MCP‑enabled client.

    📝 Notes:

    • No additional configuration needed for GitHub Copilot + VS Code (has built-in MCP sampling)
    • LLM_API_KEY is optional and can be set via environment variable if needed
  3. To update to the latest version:

    # Update MCP as a Judge to the latest version
    uv tool upgrade mcp-as-a-judge
    

Select a sampling model in VS Code

  • Open Command Palette (Cmd/Ctrl+Shift+P) → “MCP: List Servers”
  • Select the configured server “mcp-as-a-judge”
  • Choose “Configure Model Access”
  • Check your preferred model(s) to enable sampling

🔑 LLM API Configuration (Optional)

For AI assistants without full MCP sampling support you can configure an LLM API key as a fallback. This ensures MCP as a Judge works even when the client doesn't support MCP sampling.

  • Set LLM_API_KEY (unified key). Vendor is auto-detected; optionally set LLM_MODEL_NAME to override the default.

Supported LLM Providers

RankProviderAPI Key FormatDefault ModelNotes
1OpenAIsk-...gpt-4.1Fast and reliable model optimized for speed
2Anthropicsk-ant-...claude-sonnet-4-20250514High-performance with exceptional reasoning
3GoogleAIza...gemini-2.5-proMost advanced model with built-in thinking
4Azure OpenAI[a-f0-9]{32}gpt-4.1Same as OpenAI but via Azure
5AWS BedrockAWS credentialsanthropic.claude-sonnet-4-20250514-v1:0Aligned with Anthropic
6Vertex AIService Account JSONgemini-2.5-proEnterprise Gemini via Google Cloud
7Groqgsk_...deepseek-r1Best reasoning model with speed advantage
8OpenRoutersk-or-...deepseek/deepseek-r1Best reasoning model available
9xAIxai-...grok-code-fast-1Latest coding-focused model (Aug 2025)
10Mistral[a-f0-9]{64}pixtral-largeMost advanced model (124B params)

Client-Specific Setup

Cursor
  1. Open Cursor Settings:

    • Go to FilePreferencesCursor Settings
    • Navigate to the MCP tab
    • Click + Add to add a new MCP server
  2. Add MCP Server Configuration:

    {
      "command": "uv",
      "args": ["tool", "run", "mcp-as-a-judge"],
      "env": {
        "LLM_API_KEY": "your-openai-api-key-here",
        "LLM_MODEL_NAME": "gpt-4.1"
      }
    }
    

    📝 Configuration Options:

    • LLM_API_KEY: Required for Cursor (limited MCP sampling)
    • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
Claude Code
  1. Add MCP Server via CLI:

    # Set environment variables first (optional model override)
    export LLM_API_KEY="your_api_key_here"
    export LLM_MODEL_NAME="claude-3-5-haiku"  # Optional: faster/cheaper model
    
    # Add MCP server
    claude mcp add mcp-as-a-judge -- uv tool run mcp-as-a-judge
    
  2. Alternative: Manual Configuration:

    • Create or edit ~/.config/claude-code/mcp_servers.json
    {
      "command": "uv",
      "args": ["tool", "run", "mcp-as-a-judge"],
      "env": {
        "LLM_API_KEY": "your-anthropic-api-key-here",
        "LLM_MODEL_NAME": "claude-3-5-haiku"
      }
    }
    

    📝 Configuration Options:

    • LLM_API_KEY: Required for Claude Code (limited MCP sampling)
    • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)
Other MCP Clients

For other MCP-compatible clients, use the standard MCP server configuration:

{
  "command": "uv",
  "args": ["tool", "run", "mcp-as-a-judge"],
  "env": {
    "LLM_API_KEY": "your-openai-api-key-here",
    "LLM_MODEL_NAME": "gpt-5"
  }
}

📝 Configuration Options:

  • LLM_API_KEY: Required for most MCP clients (except GitHub Copilot + VS Code)
  • LLM_MODEL_NAME: Optional custom model (see Supported LLM Providers for defaults)

🔒 Privacy & Flexible AI Integration

🔑 MCP Sampling (Preferred) + LLM API Key Fallback

Primary Mode: MCP Sampling

  • All judgments are performed using MCP Sampling capability
  • No need to configure or pay for external LLM API services
  • Works directly with your MCP-compatible client's existing AI model
  • Currently supported by: GitHub Copilot + VS Code

Fallback Mode: LLM API Key

  • When MCP sampling is not available, the server can use LLM API keys
  • Supports multiple providers via LiteLLM: OpenAI, Anthropic, Google, Azure, Groq, Mistral, xAI
  • Automatic vendor detection from API key patterns
  • Default model selection per vendor when no model is specified

🛡️ Your Privacy Matters

  • The server runs locally on your machine
  • No data collection - your code and conversations stay private
  • No external API calls when using MCP Sampling. If you set LLM_API_KEY for fallback, the server will call your chosen LLM provider only to perform judgments (plan/code/test) with the evaluation content you provide.
  • Complete control over your development workflow and sensitive information

🤝 Contributing

We welcome contributions! Please see for guidelines.

Development Setup

# Clone the repository
git clone https://github.com/OtherVibes/mcp-as-a-judge.git
cd mcp-as-a-judge

# Install dependencies with uv
uv sync --all-extras --dev

# Install pre-commit hooks
uv run pre-commit install

# Run tests
uv run pytest

# Run all checks
uv run pytest && uv run ruff check && uv run ruff format --check && uv run mypy src

© Concepts and Methodology

© 2025 OtherVibes and Zvi Fried. The "MCP as a Judge" concept, the "behavioral MCP" approach, the staged workflow (plan → code → test → completion), tool taxonomy/descriptions, and prompt templates are original work developed in this repository.

Prior Art and Attribution

While “LLM‑as‑a‑judge” is a broadly known idea, this repository defines the original “MCP as a Judge” behavioral MCP pattern by OtherVibes and Zvi Fried. It combines task‑centric workflow enforcement (plan → code → test → completion), explicit LLM‑based validations, and human‑in‑the‑loop elicitation, along with the prompt templates and tool taxonomy provided here. Please attribute as: “OtherVibes – MCP as a Judge (Zvi Fried)”.

❓ FAQ

How is “MCP as a Judge” different from rules/subagents in IDE assistants (GitHub Copilot, Cursor, Claude Code)?

FeatureIDE RulesSubagentsMCP as a Judge
Static behavior guidance
Custom system prompts
Project context integration
Specialized task handling
Active quality gates
Evidence-based validation
Approve/reject with feedback
Workflow enforcement
Cross-assistant compatibility

How does the Judge workflow relate to the tasklist? Why do we need both?

  • Tasklist = planning/organization: tracks tasks, priorities, and status. It doesn’t guarantee engineering quality or readiness.
  • Judge workflow = quality gates: enforces approvals for plan/design, code diffs, tests, and final completion. It demands real evidence (e.g., unified Git diffs and raw test output) and returns structured approvals and required improvements.
  • Together: Use the tasklist to organize work; use the Judge to decide when each stage is actually ready to proceed. The server also emits next_tool guidance to keep progress moving through the gates.

If the Judge isn’t used automatically, how do I force it?

  • In your prompt: "use mcp-as-a-judge" or "Evaluate plan/code/test using the MCP server mcp-as-a-judge".
  • VS Code: Command Palette → "MCP: List Servers" → ensure "mcp-as-a-judge" is listed and enabled.
  • Ensure the MCP server is running and, in your client, the judge tools are enabled/approved.

How do I select models for sampling in VS Code?

  • Open Command Palette (Cmd/Ctrl+Shift+P) → "MCP: List Servers"
  • Select "mcp-as-a-judge" → "Configure Model Access"
  • Check your preferred model(s) to enable sampling

📄 License

This project is licensed under the MIT License (see ).

🙏 Acknowledgments