goyaladitya05/omni-parser-mcp-server
If you are the rightful owner of omni-parser-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Osiris is an Autonomous Computer Use Agent that leverages the Model Context Protocol (MCP) to enable vision-based interaction with any software.
👁️ OSIRIS: Vision-Based GUI Agent (MCP)

Osiris is an Autonomous Computer Use Agent powered by the Model Context Protocol (MCP).
Unlike standard agents that rely on accessibility APIs or HTML DOM (which fail on legacy apps, remote desktops, and games), Osiris uses Pure Vision. It implements Microsoft's OmniParser research to "see" the screen, detect interactable elements (buttons, icons, text fields), and map them to pixel-perfect coordinates.
It enables LLMs (like Gemini Flash) to control any software just like a human does: by looking and clicking.
🎥 Demo
In this demo, Osiris autonomously opens WhatsApp, searches for a specific group, and sends a message based on a high-level natural language command.
🧠 How It Works
Osiris is built on a modular Vision-Language-Action pipeline:
-
The Eye (Vision Core):
- Uses a fine-tuned YOLOv8 model (Microsoft OmniParser) to detect UI elements.
- Uses EasyOCR/TrOCR to read text inside buttons and fields.
- Merges these into a structured "UI Tree" with bounding boxes.
-
The Protocol (MCP Server):
- Exposes the vision logic as an MCP Tool (
parse_screen). - This allows any MCP-compliant client (Claude Desktop, Cursor, Custom Agents) to "plug in" vision capabilities instantly.
- Exposes the vision logic as an MCP Tool (
-
The Brain (Autonomous Loop):
- A Python agent loop powered by Gemini Flash.
- It maintains a history of actions, reasons about the screen state, and executes mouse/keyboard commands via
PyAutoGUI.
📸 Visual Debugger
We include a Streamlit Dashboard to visualize exactly what the AI sees.
Left: Original Screenshot | Right: Osiris Detection (Red=Icons, Green=Text Buttons)
🛠️ Installation
1. Clone the Repository
git clone https://github.com/yourusername/osiris.git
cd osiris
2. Install Dependencies
We recommend using a virtual environment.
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
3. Download Model Weights
This script downloads the OmniParser YOLO weights from HuggingFace.
python setup.py
4. Configure Keys
Create a .env file in the root directory:
GOOGLE_API_KEY=your_gemini_api_key_here
Usage
1. Run the Autonomous Agent
This starts the loop. You will be prompted to enter a goal.
python agent.py
Example Goal: "Open Notepad, type a poem about AI, and save it."
2. Run the Visual Debugger
This runs a Streamlit app that visualizes the AI's vision.
streamlit run app.py
3. Run as an MCP Server
If you want to connect Osiris to Claude Desktop or another MCP client:
python -m osiris.server
Demo
Demo video present at