omni-parser-mcp-server by goyaladitya05 - MCP Server

👁️ OSIRIS: Vision-Based GUI Agent (MCP)

Banner

Osiris is an Autonomous Computer Use Agent powered by the Model Context Protocol (MCP).

Unlike standard agents that rely on accessibility APIs or HTML DOM (which fail on legacy apps, remote desktops, and games), Osiris uses Pure Vision. It implements Microsoft's OmniParser research to "see" the screen, detect interactable elements (buttons, icons, text fields), and map them to pixel-perfect coordinates.

It enables LLMs (like Gemini Flash) to control any software just like a human does: by looking and clicking.

🎥 Demo

In this demo, Osiris autonomously opens WhatsApp, searches for a specific group, and sends a message based on a high-level natural language command.

🧠 How It Works

Osiris is built on a modular Vision-Language-Action pipeline:

The Eye (Vision Core):
- Uses a fine-tuned YOLOv8 model (Microsoft OmniParser) to detect UI elements.
- Uses EasyOCR/TrOCR to read text inside buttons and fields.
- Merges these into a structured "UI Tree" with bounding boxes.
The Protocol (MCP Server):
- Exposes the vision logic as an MCP Tool (parse_screen).
- This allows any MCP-compliant client (Claude Desktop, Cursor, Custom Agents) to "plug in" vision capabilities instantly.
The Brain (Autonomous Loop):
- A Python agent loop powered by Gemini Flash.
- It maintains a history of actions, reasons about the screen state, and executes mouse/keyboard commands via PyAutoGUI.

📸 Visual Debugger

We include a Streamlit Dashboard to visualize exactly what the AI sees.

Left: Original Screenshot | Right: Osiris Detection (Red=Icons, Green=Text Buttons)

🛠️ Installation

1. Clone the Repository

git clone https://github.com/yourusername/osiris.git
cd osiris

2. Install Dependencies

We recommend using a virtual environment.

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Download Model Weights

This script downloads the OmniParser YOLO weights from HuggingFace.

python setup.py

4. Configure Keys

Create a .env file in the root directory:

GOOGLE_API_KEY=your_gemini_api_key_here

Usage

1. Run the Autonomous Agent

This starts the loop. You will be prompted to enter a goal.

python agent.py

Example Goal: "Open Notepad, type a poem about AI, and save it."

2. Run the Visual Debugger

This runs a Streamlit app that visualizes the AI's vision.

streamlit run app.py

3. Run as an MCP Server

If you want to connect Osiris to Claude Desktop or another MCP client:

python -m osiris.server

Demo

Demo video present at