medagent-copilot

u9401066/medagent-copilot

3.2

If you are the rightful owner of medagent-copilot and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

MedAgent Copilot is a project that transforms GitHub Copilot into a medical AI agent using the Model Context Protocol (MCP).

Tools
5
Resources
0
Prompts
0

MedAgent Copilot

English | 繁體中文


🏆 Benchmark Results / 基準測試結果

MedAgentBench V2 (300 Tasks) - 98.3% Accuracy

Task TypeDescriptionAccuracyStatus
Task 1Patient Search30/30 (100%)
Task 2Age Calculation30/30 (100%)
Task 3Record Blood Pressure30/30 (100%)
Task 4Query Magnesium30/30 (100%)
Task 5Mg Replacement30/30 (100%)
Task 6Average Glucose30/30 (100%)
Task 7Latest CBG29/30 (96.7%)⚠️
Task 8Ortho Referral30/30 (100%)
Task 9K Replacement30/30 (100%)
Task 10HbA1C Check26/30 (86.7%)⚠️
Total295/300 (98.3%)🏆

Results by Task Type / 各任務類型準確率

Task Type Accuracy

Results by Difficulty / 各難易度準確率

Difficulty Chart

Tested with: Claude Opus 4.5 (Preview) via VS Code GitHub Copilot
Run Date: 2025-11-27
Run Folder: results/v2_20251127_212627


English

Overview

MedAgent Copilot transforms GitHub Copilot into a medical AI agent using the Model Context Protocol (MCP). This project enables Copilot to interact with FHIR (Fast Healthcare Interoperability Resources) electronic health record systems and complete clinical tasks autonomously.

This implementation is designed to work with the MedAgentBench benchmark from Stanford ML Group, which evaluates language model agents on realistic clinical tasks.

What is MedAgentBench?

MedAgentBench is a benchmark for evaluating LLM agents on 10 types of clinical tasks:

TaskDescriptionRequires POST
Task 1Patient Search by Name + DOB
Task 2Age Calculation from MRN
Task 3Record Blood Pressure
Task 4Query Magnesium Level (24h)
Task 5Magnesium Replacement Order✅ (if low)
Task 6Average Blood Glucose (24h)
Task 7Latest Blood Glucose
Task 8Orthopedic Surgery Referral
Task 9Potassium Replacement + Recheck✅ (if low)
Task 10HbA1C Check + Order if needed✅ (if missing/old)
  • V1: 100 tasks (10 per type)
  • V2: 300 tasks (30 per type)

How It Works

┌─────────────────┐     MCP Protocol      ┌─────────────────┐
│  GitHub Copilot │ ◄──────────────────► │  MedAgent MCP   │
│    (VS Code)    │                       │     Server      │
└─────────────────┘                       └────────┬────────┘
                                                   │
                                                   │ FHIR R4 API
                                                   ▼
                                          ┌─────────────────┐
                                          │  FHIR Server    │
                                          │ (Docker:8080)   │
                                          └─────────────────┘

Memory Architecture 🧠

MedAgent uses a layered memory system to maintain clinical knowledge while ensuring patient privacy:

.med_memory/
├── CONSTITUTION.md              # 📜 Agent Rules (enforced on every tool call)
├── knowledge/                   # 📚 Shared Medical Knowledge
│   ├── clinical_knowledge.md    #    - Clinical protocols & thresholds
│   ├── fhir_functions.md        #    - FHIR API reference
│   ├── task_instructions.md     #    - Task-specific answer formats
│   └── task_examples.md         #    - Worked examples
└── patient_context/             # 🔐 Isolated Patient Memory
    └── {mrn}.json               #    - Single patient at a time (auto-cleared)

Core Principles:

PrincipleDescription
One Patient at a TimeOnly one patient context loaded simultaneously
Task IsolationPatient memory cleared after each task
Knowledge SharingClinical protocols accessible across all tasks
Privacy by DesignNo cross-patient data access allowed

Memory-Aware Workflow:

load_tasks() → get_next_task() → load_patient_context(mrn)
                                          ↓
                              [Complete task with FHIR tools]
                                          ↓
                              submit_answer() → clear_patient_context()
                                          ↓
                              get_next_task() → ... (repeat)

Prerequisites

  • Python 3.10+
  • VS Code with GitHub Copilot extension
  • Docker (for FHIR server)
  • Git

Quick Start

1. Clone this repository
git clone https://github.com/u9401066/medagent-copilot.git
cd medagent-copilot
2. Install dependencies
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
3. Clone MedAgentBench (required for task data)
cd ..
git clone https://github.com/stanfordmlgroup/MedAgentBench.git

Final directory structure:

workspace/
├── medagent-copilot/    # This project
└── MedAgentBench/       # Stanford's benchmark (task data)
4. Start FHIR Server
docker run -p 8080:8080 jyxsu6/medagentbench:latest

Verify: curl http://localhost:8080/fhir/Patient?_count=1

5. Configure VS Code MCP

Create .vscode/mcp.json in your workspace root:

{
  "servers": {
    "medagent-fhir": {
      "type": "stdio",
      "command": "python",
      "args": ["${workspaceFolder}/medagent-copilot/src/mcp_server.py"],
      "env": {
        "FHIR_API_BASE": "http://localhost:8080/fhir/"
      }
    }
  }
}
6. Start MCP Server
  1. Open VS Code
  2. Press Cmd/Ctrl + Shift + P → Type MCP: List Servers
  3. Confirm medagent-fhir shows as Running
  4. If not running, use MCP: Start Server → Select medagent-fhir
7. Run Tasks

In GitHub Copilot Chat:

@workspace Please load MedAgentBench V1 tasks and start executing

MCP Tools Reference

Task Management
ToolDescription
load_tasks(version)Load tasks (v1: 100, v2: 300)
get_next_task()Get next task
submit_answer(task_id, answer)Submit answer (auto-saves)
get_task_status()Check progress
evaluate_results()Run official evaluation
FHIR Operations
ToolDescription
search_patientSearch patient by name/DOB
get_patient_by_mrnGet patient by MRN
get_lab_observationsQuery labs (MG, K, GLU, A1C)
get_vital_signsQuery vital signs
create_vital_signRecord BP
create_medication_orderOrder medication
create_service_requestCreate referral/lab order

Answer Format (Critical!)

All answers must be JSON array strings:

TaskFormatExample
task1'["MRN"]''["S6534835"]'
task2'[age]' (integer)'[60]'
task3'[]''[]'
task4'[mg]' or '[-1]''[2.7]'
task5'[]' or '[mg]''[1.8]'
task6'[avg]' (keep decimals!)'[89.888889]'
task7'[cbg]''[123.0]'
task8'[]''[]'
task9'[]' or '[k]''[]'
task10'[val, "datetime"]' or '[-1]''[5.9, "2023-11-09T03:05:00+00:00"]'

Results Structure

results/
├── v1_20251126_120000/
│   ├── agent_results.json    # Agent's submitted answers
│   └── evaluation.json       # Official evaluation results
└── v2_20251126_130000/
    └── ...

Key Parameters

ParameterValue
FHIR Basehttp://localhost:8080/fhir/
Reference Time2023-11-13T10:15:00+00:00
24h Filterge2023-11-12T10:15:00+00:00
1 Year Ago2022-11-13T10:15:00+00:00

Known Issues & Limitations ⚠️

1. MCP Resources Not Accessed by VS Code Copilot

The project implements MCP Resources for clinical knowledge (med://knowledge/*), but GitHub Copilot in VS Code does not automatically access MCP Resources - it only calls MCP Tools.

Impact: The rich clinical knowledge in .med_memory/knowledge/ is not utilized during benchmark execution.

Workaround: Knowledge hints are embedded in tool responses instead.

2. Patient Memory Not Utilized

The PatientMemory system (src/helpers/patient.py) is implemented but the agent never calls add_patient_note() during benchmark runs.

Impact: Important clinical observations are not persisted between tool calls.

Status: Working as designed, but underutilized.

3. Large FHIR Response Truncation

Some patients have hundreds of observations (e.g., 372 GLU records). FHIR API responses over ~100KB may be truncated.

Impact: Latest values may be missed for data-heavy patients (observed in task7_24, task7_30).

Mitigation: Pagination support added (offset/page_size params) but needs testing.

4. POST History Recording

POST operations are recorded correctly, but the recording happens in the FHIR client layer, which may not be visible in evaluation without proper integration.

Task Difficulty Classification 📊

We classify task difficulty based on Agent processing steps (not API calls). API queries are handled by the FHIR server and don't count as agent steps.

TaskDescriptionAgent StepsDifficulty
Task 1Patient Search1 (return MRN)Easy
Task 2Age Calculation2 (get patient → calc age)Easy
Task 3Record BP1 (POST BP)Easy
Task 4Query Magnesium2 (get labs → find latest)Easy
Task 5Mg Replacement3 (get Mg → check threshold → conditional POST)Medium
Task 6Average Glucose3 (get labs → filter 24h → calc average)Medium
Task 7Latest CBG3 (get patient → get labs → sort & find latest)Medium
Task 8Ortho Referral2 (compose SBAR → POST)Easy
Task 9K Replacement4 (get K → check → POST med → POST lab recheck)Hard
Task 10HbA1C Check4 (get A1C → check date/value → conditional POST → return)Hard

Classification Criteria:

  • Easy: 1-2 agent steps
  • Medium: 3 agent steps
  • Hard: 4+ agent steps

⚠️ Note: This classification is our own interpretation. The official MedAgentBench paper reports average steps of 2.3±1.3 but does not provide per-task difficulty labels.

Related Projects

License

MIT License - See


繁體中文

概述

MedAgent Copilot 使用模型上下文協議 (MCP) 將 GitHub Copilot 轉變為醫療 AI 代理。本專案讓 Copilot 能夠與 FHIR(快速醫療互操作性資源)電子健康記錄系統互動,並自主完成臨床任務。

本實作專為 Stanford ML Group 的 MedAgentBench 基準測試而設計,該基準測試評估語言模型代理在真實臨床任務上的表現。

什麼是 MedAgentBench?

MedAgentBench 是用於評估 LLM 代理在 10 種臨床任務上表現的基準測試:

任務說明需要 POST
Task 1依姓名+生日搜尋病患
Task 2依 MRN 計算年齡
Task 3記錄血壓
Task 4查詢鎂離子值(24小時內)
Task 5鎂離子補充醫囑✅(若偏低)
Task 6平均血糖(24小時內)
Task 7最新血糖值
Task 8骨科轉診
Task 9鉀離子補充 + 追蹤抽血✅(若偏低)
Task 10HbA1C 檢查 + 需要時開單✅(若缺失/過期)
  • V1:100 個任務(每類型 10 個)
  • V2:300 個任務(每類型 30 個)

運作原理

┌─────────────────┐     MCP 協議          ┌─────────────────┐
│  GitHub Copilot │ ◄──────────────────► │  MedAgent MCP   │
│    (VS Code)    │                       │     Server      │
└─────────────────┘                       └────────┬────────┘
                                                   │
                                                   │ FHIR R4 API
                                                   ▼
                                          ┌─────────────────┐
                                          │   FHIR 伺服器   │
                                          │ (Docker:8080)   │
                                          └─────────────────┘

記憶體架構 🧠

MedAgent 使用分層記憶系統,在維護臨床知識的同時確保病患隱私:

.med_memory/
├── CONSTITUTION.md              # 📜 Agent 憲法(每次工具呼叫時強制執行)
├── knowledge/                   # 📚 共享醫學知識
│   ├── clinical_knowledge.md    #    - 臨床協議與閾值
│   ├── fhir_functions.md        #    - FHIR API 參考
│   ├── task_instructions.md     #    - 任務特定答案格式
│   └── task_examples.md         #    - 範例解答
└── patient_context/             # 🔐 隔離的病患記憶
    └── {mrn}.json               #    - 一次只有一位病患(自動清除)

核心原則:

原則說明
一次一位病患同時只能載入一位病患的情境
任務隔離每個任務完成後清除病患記憶
知識共享臨床協議可跨任務存取
隱私優先設計禁止跨病患資料存取

記憶感知工作流程:

load_tasks() → get_next_task() → load_patient_context(mrn)
                                          ↓
                              [使用 FHIR 工具完成任務]
                                          ↓
                              submit_answer() → clear_patient_context()
                                          ↓
                              get_next_task() → ...(重複)

前置需求

  • Python 3.10+
  • VS Code 搭配 GitHub Copilot 擴充功能
  • Docker(用於 FHIR 伺服器)
  • Git

快速開始

1. Clone 本專案
git clone https://github.com/u9401066/medagent-copilot.git
cd medagent-copilot
2. 安裝依賴
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
3. Clone MedAgentBench(任務資料來源)
cd ..
git clone https://github.com/stanfordmlgroup/MedAgentBench.git

最終目錄結構:

workspace/
├── medagent-copilot/    # 本專案
└── MedAgentBench/       # Stanford 基準測試(任務資料)
4. 啟動 FHIR 伺服器
docker run -p 8080:8080 jyxsu6/medagentbench:latest

驗證:curl http://localhost:8080/fhir/Patient?_count=1

5. 設定 VS Code MCP

在工作區根目錄建立 .vscode/mcp.json

{
  "servers": {
    "medagent-fhir": {
      "type": "stdio",
      "command": "python",
      "args": ["${workspaceFolder}/medagent-copilot/src/mcp_server.py"],
      "env": {
        "FHIR_API_BASE": "http://localhost:8080/fhir/"
      }
    }
  }
}
6. 啟動 MCP Server
  1. 開啟 VS Code
  2. Cmd/Ctrl + Shift + P → 輸入 MCP: List Servers
  3. 確認 medagent-fhir 顯示為 Running
  4. 若未執行,使用 MCP: Start Server → 選擇 medagent-fhir
7. 執行任務

在 GitHub Copilot Chat 中:

@workspace 請載入 MedAgentBench V1 任務並開始執行

MCP 工具參考

任務管理
工具說明
load_tasks(version)載入任務 (v1: 100, v2: 300)
get_next_task()取得下一個任務
submit_answer(task_id, answer)提交答案(自動儲存)
get_task_status()查看進度
evaluate_results()執行官方評估
FHIR 操作
工具說明
search_patient依姓名/生日搜尋病患
get_patient_by_mrn依 MRN 取得病患
get_lab_observations查詢檢驗值 (MG, K, GLU, A1C)
get_vital_signs查詢生命徵象
create_vital_sign記錄血壓
create_medication_order開立藥物醫囑
create_service_request建立轉診/檢驗單

答案格式(重要!)

所有答案必須是 JSON 陣列字串

任務格式範例
task1'["MRN"]''["S6534835"]'
task2'[age]'(整數)'[60]'
task3'[]''[]'
task4'[mg]''[-1]''[2.7]'
task5'[]''[mg]''[1.8]'
task6'[avg]'(保留小數!)'[89.888889]'
task7'[cbg]''[123.0]'
task8'[]''[]'
task9'[]''[k]''[]'
task10'[val, "datetime"]''[-1]''[5.9, "2023-11-09T03:05:00+00:00"]'

結果結構

results/
├── v1_20251126_120000/
│   ├── agent_results.json    # Agent 提交的答案
│   └── evaluation.json       # 官方評估結果
└── v2_20251126_130000/
    └── ...

關鍵參數

參數
FHIR Basehttp://localhost:8080/fhir/
參考時間2023-11-13T10:15:00+00:00
24 小時過濾ge2023-11-12T10:15:00+00:00
1 年前2022-11-13T10:15:00+00:00

專案架構

medagent-copilot/
├── .med_memory/              # Agent 記憶系統
│   ├── CONSTITUTION.md       # 🔒 Agent 憲法(規則與格式)
│   ├── knowledge/            # 📚 醫學知識庫
│   │   ├── clinical_knowledge.md
│   │   ├── fhir_functions.md
│   │   └── task_instructions.md
│   └── patient_context/      # 🔐 病人情境記憶(隔離區)
├── src/
│   ├── mcp_server.py         # MCP Server 入口
│   ├── config.py             # 設定檔
│   ├── fhir/                 # FHIR 工具模組
│   │   ├── client.py         # FHIR API 客戶端 (含 POST 歷史追蹤)
│   │   └── tools.py          # FHIR MCP 工具
│   ├── tasks/                # 任務管理模組
│   │   ├── tools.py          # 任務 MCP 工具
│   │   └── state.py          # 任務狀態追蹤
│   └── helpers/              # 輔助工具
│       ├── reminder.py       # 格式提醒系統
│       └── patient.py        # 病人記憶管理
├── docs/                     # 文件
│   └── RESULT_FORMAT.md      # 結果 JSON 格式規範
├── results/                  # 評估結果
├── evaluate_with_official.py # 官方評估腳本
└── requirements.txt

已知問題與限制 ⚠️

1. MCP Resources 未被 VS Code Copilot 存取

本專案實作了 MCP Resources 來提供臨床知識 (med://knowledge/*),但 VS Code 中的 GitHub Copilot 不會自動存取 MCP Resources - 它只會呼叫 MCP Tools。

影響: .med_memory/knowledge/ 中豐富的臨床知識在基準測試執行時未被利用。

暫時解法: 將知識提示嵌入到工具回應中。

2. 病患記憶未被利用

PatientMemory 系統 (src/helpers/patient.py) 已實作,但 agent 在基準測試執行期間從未呼叫 add_patient_note()

影響: 重要的臨床觀察無法在工具呼叫之間持續保存。

狀態: 按設計運作,但未被充分利用。

3. 大型 FHIR 回應被截斷

某些病患有數百筆觀察記錄(例如 372 筆血糖記錄)。超過約 100KB 的 FHIR API 回應可能被截斷。

影響: 資料量大的病患可能遺漏最新數值(在 task7_24、task7_30 中觀察到)。

緩解措施: 已新增分頁支援(offset/page_size 參數),但需要測試。

4. POST 歷史記錄

POST 操作被正確記錄,但記錄發生在 FHIR 客戶端層,若未正確整合可能在評估時不可見。

任務難易度分類 📊

我們根據 Agent 處理步驟數(非 API 呼叫)來分類任務難易度。API 查詢由 FHIR 伺服器處理,不計入 Agent 步驟。

任務說明Agent 步驟難易度
Task 1病患搜尋1 (回傳 MRN)簡單
Task 2年齡計算2 (取病患 → 算年齡)簡單
Task 3記錄血壓1 (POST 血壓)簡單
Task 4查詢鎂離子2 (取檢驗 → 找最新值)簡單
Task 5鎂離子補充3 (取 Mg → 檢查閾值 → 條件式 POST)中等
Task 6平均血糖3 (取檢驗 → 過濾 24h → 計算平均)中等
Task 7最新血糖3 (取病患 → 取檢驗 → 排序找最新)中等
Task 8骨科轉診2 (組成 SBAR → POST)簡單
Task 9鉀離子補充4 (取 K → 檢查 → POST 藥物 → POST 追蹤抽血)困難
Task 10HbA1C 檢查4 (取 A1C → 檢查日期/值 → 條件式 POST → 回傳)困難

分類標準:

  • 簡單 (Easy):1-2 個 Agent 步驟
  • 中等 (Medium):3 個 Agent 步驟
  • 困難 (Hard):4 個以上 Agent 步驟

⚠️ 注意:此分類為本專案自行定義。官方 MedAgentBench 論文報告平均步驟數為 2.3±1.3,但未提供各任務難易度標籤。

相關專案

授權

MIT License - 詳見


Author / 作者