mcp-ptu-server by VFG92 - MCP Server

🧠 MCP PTU Server

An MCP-compliant Cloudflare Worker that orchestrates structured multi-path reasoning sessions for ChatGPT and other Model Context Protocol clients. The worker keeps session state inside Durable Objects, enforces plan diversity, records evidence, and continuously reports quality metrics so agents can self-regulate their workflow.

Feature highlights

Parallel reasoning workflow – Durable Objects coordinate the end-to-end lifecycle: session init, plan submission, manifest execution, evidence registration, peer critiques, mediation, readiness checks, and finalization.
Self-assessment based evidence ledger – register_execution_results accepts counts instead of raw text. Payloads must use synthetic reference IDs matching ^(Source|Calc|Data|WP)\d+$ and include honest coverage/confidence estimates to avoid moderation throttles.
Quality metrics in real time – Confidence, coverage, consensus, and saliency reports are recalculated after every registration. list_plan_status and readiness tools surface actionable gaps.
Session registry & proxy endpoints – Requests automatically resolve to the correct Durable Object, whether they arrive through /mcp, /proxy, or the /api/register-results direct API. Heartbeat and regeneration helpers keep long sessions alive.
Capability adapters – The worker bundles the "everything" capability set alongside filesystem, memory, and sequential-thinking modules, with rich integration tests under __tests__/.

🆕 What's New (October 2025)

🚀 Latest Updates (v5.10.0 - October 2025)

Enhanced Reasoning Quality (Popper-Inspired Falsification)

✅ Falsification tests now REQUIRED (minimum 3 claims per critique)
✅ Counterfactual scenarios now REQUIRED (prevents overfitting to preferred assumptions)
✅ UCT-based depth tracking (MCTS-inspired exploration guidance for plan expansion)

Why this matters:

🎯 Prevents overfitting: Falsification tests force rigorous hypothesis testing (Popper's criterion)
🎯 Prevents assumption lock-in: Counterfactual scenarios explore "what if driver X doesn't hold?"
🎯 Optimizes exploration: UCT balances exploitation (high-benefit plans) vs exploration (under-explored plans)

Example falsification test:

{
  "claim": "Market size is $50B",
  "falsification_test": "If Q1 2025 growth is <5%, this claim is falsified",
  "counterfactual_scenario": "If market size is actually $30B, plan would need to reduce headcount by 40% and focus on premium segment only"
}

UCT Depth Tracking:

Plans ranked by UCT score (exploitation + C * exploration)
ChatGPT receives guidance on which plan to expand next
Prevents premature convergence on single plan

🔧 Previous Updates (v5.9.1 - October 2025)

Fixed: Diversity Axis Parsing

✅ Added support for Key (value) format in diversity axes (e.g., Metodologia (econometrico))
✅ Fixed semantic diversity calculation returning 0 for clearly different plans
✅ Plans using parentheses format now correctly differentiate from each other

Fixed: Confidence Calculation & Moderation Blocks

✅ Increased evidence quality bonuses to make 85% threshold achievable
✅ Disabled automatic penalties when self-assessment is present (trust ChatGPT's counts)
✅ Added regex validation to enforce synthetic IDs (Source1, Calc1, Data1, WP1)
✅ Prevented moderation blocks by rejecting real source names (ISTAT, WEF, Excelsior)
✅ Reduced payload size (removed evidence_refs, max 100 char summary) to avoid connector blocks
✅ Reduced summary max length from 300 to 200 chars to minimize moderation risk
✅ Updated tool descriptions with CRITICAL anti-moderation rules

Impact:

ChatGPT can now reach quality thresholds more easily with proper evidence
Prevents 403 moderation errors by enforcing synthetic IDs in payloads
Clearer guidance helps ChatGPT self-correct and avoid common mistakes

🎯 Self-Assessment Approach (v5.9.0+) - NO MORE 403 ERRORS!

MAJOR CHANGE: ChatGPT now counts evidence and self-evaluates quality instead of sending textual content.

Why this is revolutionary:

✅ 10x smaller payload (only numbers) → NO 403 errors from OpenAI gateway
✅ NO batching needed (payload always small enough)
✅ Self-correction loop (ChatGPT knows if evidence is insufficient and can improve)
✅ Honest evaluation (ChatGPT takes responsibility for quality verification)

How it works:

ChatGPT executes ALL steps using native tools (web search, Python, code interpreter)
ChatGPT counts evidence items (sources, datapoints, workpapers)
ChatGPT self-evaluates quality honestly (confidence, coverage)
ChatGPT calls register_execution_results with self-assessment (counts + evaluation)
Server validates and provides immediate feedback

🚨 CRITICAL: Anti-Moderation Rules:

❌ NEVER use real source names in evidence_refs.ref_id (ISTAT, WEF, Excelsior, etc.)
❌ NEVER use URLs, citations, or organization names in payload
✅ ALWAYS use synthetic IDs: Source1, Source2, Calc1, Data1, WP1, etc.
✅ Keep summary ultra-short (max 200 chars) with ONLY numbers and generic terms
⚠️ Real names trigger OpenAI moderation blocks causing 403 errors!

Example payload (CORRECT - uses synthetic IDs):

{
  "execution_token": "exec_...",
  "self_assessment": {
    "total_evidence_items": 45,
    "external_sources": 12,
    "quantitative_datapoints": 23,
    "workpapers_created": 8,
    "estimated_confidence": 0.82,
    "estimated_coverage": 0.96,
    "meets_confidence_threshold": false,
    "gaps_identified": ["Missing validation for claim X"]
  },
  "results": [{
    "plan_id": "P1",
    "step_id": "step_1",
    "evidence_count": 3,
    "evidence_refs": [{"ref_id": "Source1", "type": "source"}],
    "summary": "12 journeys. Gap 12-25%. 3 sources, 5 calcs."
  }]
}

Benefits:

🚫 NO MORE 403 "safety" blocks (payload too small to trigger filters)
🚫 NO MORE batching complexity (single call for all results)
🚫 NO MORE URL handling issues (no URLs in payload)
✅ ChatGPT self-corrects (knows when to add more evidence)
✅ Server validates honesty (compares declared vs calculated metrics)

Previous Features (Still Active)

Session Registry Integration:

/api/register-results correctly routes to the same Durable Object that created the session
Mapping session_id → DO_ID registered in global SessionRegistry

Enhanced Evidence Quality Guidance:

list_plan_status shows evidence quality report with self-assessment validation
Server compares ChatGPT's self-evaluation with calculated metrics

Constructive Disagreement Rewarded:

Consensus calculation values well-argued disagreement over shallow agreement
Quality bonuses for: claims_challenged (+0.20), falsification_tests (+0.25), residual_risks (+0.15)

Meta-Reflection Analysis:

generate_meta_reflection analyzes patterns in disagreements
Identifies low-confidence decisions and provides actionable recommendations

Key capabilities

Parallel reasoning orchestration – create, execute, critique, and finalize reasoning plans through dedicated MCP tools.
Evidence ledger – every piece of evidence recorded during plan execution is registered automatically with traceable IDs.
Dynamic quality metrics – confidence, coverage, and consensus scores are calculated from session data and surfaced during finalization.
Session persistence – the /proxy endpoint forwards requests to the correct Durable Object using the session_id, removing the need for custom headers in clients that cannot set them.

Project structure

Path	Purpose
`src/workers/`	Cloudflare Worker entrypoint, Durable Objects, manifest execution logic, and MCP tool handlers.
`src/types/`	Shared type definitions for manifests, payload validation, and capability wiring.
`src/ui/`	Lightweight diagnostic UI bundle used by the worker when running with Wrangler.
`examples/`	Scripted walkthroughs for manifest execution, peer review, and capability integration.
`scripts/`	Shell helpers for local smoke tests and debugging complex sessions.
`__tests__/`	Jest suites covering parallel reasoning flows, evidence validation, consensus metrics, and persistence.

Getting started

Install dependencies
```
npm install
```
Type-check the project
```
npm run build
```
Run the unit and integration suite
```
npm test
```
Start a local worker (requires configured Cloudflare credentials)
```
npm run workers:dev
```

Target runtime: Node.js 20+ locally and Cloudflare Workers (Wrangler 4.40+). Update wrangler.toml with your account bindings before deploying.

Testing & diagnostics

Command	What it verifies
`npm test`	Core Jest suite for Durable Objects, manifest execution, metrics, and adapters.
`npm run test:coverage`	Generates code coverage for the MCP worker surface.
`./test-direct-api.sh`	End-to-end manifest workflow via the direct HTTP API (session extraction and routing).
`./test-simple-direct-api.sh`	Minimal register-results smoke test with moderation-safe payloads.
`./test-403-fix.sh`	Regression test ensuring synthetic IDs and short summaries prevent OpenAI 403 moderation errors.
`./scripts/test-parallel-reasoning-simple.sh`	Lightweight smoke test for Durable Object persistence (requires `workers:dev`).
`./scripts/test-parallel-reasoning-fix.sh`	Verbose MCP walkthrough useful for debugging session lifecycle or mediation issues.

MCP interfaces

HTTP endpoints

POST /mcp – Canonical MCP transport (requires mcp-session-id header).
GET /mcp – Server-sent events stream for MCP responses.
POST /proxy – Extracts session_id from the request body for clients that cannot set custom headers.
POST /api/register-results – Direct API alternative that reuses execution tokens to bypass stale MCP sessions.
POST /heartbeat – Keeps long-running sessions alive.
GET /health – Basic health check.

Core MCP tools

init_parallel_reasoning
submit_reasoning_plan
execute_reasoning_manifest
register_execution_results (with self-assessment counts)
list_plan_status
submit_peer_critique
submit_mediation_decision
generate_meta_reflection
check_session_readiness
finalize_parallel_reasoning
Utility helpers such as regenerate_execution_token

Oracle tools (Optional Formal Verification)

verify_logical_claim - SAT solver for propositional logic (CNF DIMACS-like format)
verify_algebraic_claim - Computer Algebra System for symbolic math (AST format)
verify_proof_sketch - Proof checker for propositional logic proofs

Oracle Features:

✅ Strict input formats: CNF DIMACS-like for SAT, structured AST for CAS, proof steps for checker
✅ Hard timeout: <8ms CPU per oracle call
✅ Deduplication: Automatic caching to avoid re-verification of identical claims
✅ Witness-based evidence: Returns synthetic witness hashes, not full proofs
✅ FORMAT_UNSUPPORTED: Clear error for invalid inputs
✅ Oracle validation rate: Tracked in session metrics (% of claims verified by oracles)

Each response includes a workflow checklist that tracks progress toward readiness thresholds (confidence ≥ 85%, coverage ≥ 95%, consensus ≥ 80%).

Oracle Tools (Optional Formal Verification)

The server includes optional oracle tools for formal verification of critical claims during peer review:

Available Oracles

1. SAT Solver (verify_logical_claim)

Input: CNF formula in DIMACS-like format
Output: SAT/UNSAT/UNKNOWN with witness hash
Use case: Verify logical consistency of propositional claims

Example:

{
  "claim_id": "claim-1",
  "formula": {
    "num_vars": 3,
    "num_clauses": 2,
    "clauses": [[1, -2], [2, 3]]
  }
}

2. Computer Algebra System (verify_algebraic_claim)

Input: Algebraic expression in structured AST format
Output: SIMPLIFIED/EXPANDED/FACTORED/SOLVED/EQUIVALENT/NOT_EQUIVALENT with witness hash
Use case: Verify algebraic equivalences and simplifications
Operations: simplify, factor (limited), expand, solve (limited), equivalent
Note: factor and solve have limited capabilities due to Math.js constraints

Example:

{
  "claim_id": "claim-2",
  "operation": "simplify",
  "expression": {
    "type": "operator",
    "operator": "+",
    "operands": [
      {"type": "number", "value": 2},
      {"type": "number", "value": 3}
    ]
  }
}

3. Proof Checker (verify_proof_sketch)

Input: Proof sketch with premises, conclusion, and justification steps (structured object, not Metamath syntax)
Output: VALID/INVALID with witness hash
Use case: Verify propositional logic proofs
Supported rules:
- Fully verified: premise, modus_ponens, and_intro, and_elim, or_intro
- Simplified (no assumption tracking): or_elim, implies_intro, implies_elim
Note: Rules requiring assumption contexts (or_elim, implies_intro, implies_elim) are accepted but not fully verified

Example:

{
  "claim_id": "claim-3",
  "proof": {
    "premises": ["A", "A -> B"],
    "conclusion": "B",
    "steps": [
      {"formula": "A", "justification": "premise"},
      {"formula": "A -> B", "justification": "premise"},
      {"formula": "B", "justification": "modus_ponens"}
    ]
  }
}

Oracle Integration with Peer Critiques

Oracles can be used during peer review to formally verify challenged claims:

{
  "reviewer_plan_id": "plan_A",
  "reviewed_plan_id": "plan_B",
  "claims_challenged": [
    {
      "claim": "Market size is $50B",
      "evidence_ids": ["ev1"],
      "challenge": "Data is outdated",
      "falsification_test": "If Q1 2025 growth is <5%, claim is falsified",
      "counterfactual_scenario": "If market is $30B, reduce headcount by 40%",
      "oracle_verified": {
        "claim_id": "claim-1",
        "oracle_type": "sat",
        "result": "SAT",
        "witness_hash": "a3f5b2c1...",
        "timestamp": 1234567890
      }
    }
  ]
}

Oracle Validation Metrics

The oracle_validation_rate metric tracks the percentage of critical claims that have been formally verified:

Shown in list_plan_status responses
Included in finalization reports
Calculated as: oracle_verified_claims / total_claims

Oracle Tool Descriptions (as seen by ChatGPT)

When ChatGPT calls list_tools, it sees these descriptions:

verify_logical_claim:

SAT solver for propositional logic (CNF format). Returns SAT/UNSAT/UNKNOWN with witness hash. LIMITATION: Only formulas with ≤10 variables (brute-force). Larger formulas return UNKNOWN. Timeout: 8ms. Automatic deduplication.

verify_algebraic_claim:

Computer Algebra System (Math.js) for symbolic math. Operations: simplify (full), factor (limited), expand (full), solve (limited), equivalent (compares simplified forms). Returns SIMPLIFIED/EXPANDED/FACTORED/SOLVED/EQUIVALENT/NOT_EQUIVALENT with witness hash. LIMITATIONS: factor and solve have limited capabilities due to Math.js constraints. Timeout: 8ms. Automatic deduplication (cache includes operation and expected_result).

verify_proof_sketch:

Proof checker for propositional logic. Returns VALID/INVALID with witness hash. FULLY VERIFIED RULES: premise, modus_ponens, and_intro, and_elim, or_intro. SIMPLIFIED RULES (no assumption tracking): or_elim, implies_intro, implies_elim. Formula syntax: 'A -> B', 'A AND B', 'A OR B'. Uses regex pattern matching (not full AST). Timeout: 8ms. Automatic deduplication.

These descriptions clearly communicate the capabilities and limitations to ChatGPT, preventing misuse.

Examples & further reading

examples/parallel-reasoning-v5-example.ts – Complete manifest workflow showcasing self-assessment registration.
examples/peer-review-example.ts – Peer critique and mediation lifecycle.
examples/capability-integration-example.ts – Capability orchestration and evidence handling primer.
src/workers/everything-workers.ts – Tool registry and descriptions surfaced to MCP clients.
src/workers/session-metrics.ts – Implementation of consensus, confidence, coverage, and saliency calculations.
src/workers/oracle-tools.ts – Oracle implementations for formal verification.
__tests__/oracle-tools.test.ts – Oracle test suite with format validation and performance tests.

For more background on the Model Context Protocol, visit modelcontextprotocol.io.