vision-agent-mcp

landing-ai/vision-agent-mcp

3.4

If you are the rightful owner of vision-agent-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

VisionAgent MCP Server is a lightweight, side-car server that facilitates communication between MCP-compatible clients and Landing AI’s VisionAgent REST APIs, enabling natural-language computer-vision and document-analysis commands.

The VisionAgent MCP Server is designed to bridge the gap between modern LLM agents and external tools through the Model Context Protocol (MCP). It operates as a local server, translating tool calls from MCP-compatible clients into authenticated HTTPS requests to Landing AI’s VisionAgent REST APIs. This setup allows users to issue natural-language commands for computer vision and document analysis directly from their editors without the need for custom REST code or additional SDKs. The server supports a variety of use cases, including document analysis, object detection, instance segmentation, activity recognition, and depth estimation. By running locally on STDIN/STDOUT, it ensures that all operations are performed securely and efficiently, with outputs such as JSON responses and images being streamed back to the client for further processing or visualization.

Features

  • Supports natural-language commands for computer vision and document analysis.
  • Operates as a local server, ensuring secure and efficient processing.
  • Translates MCP-compatible client calls into authenticated HTTPS requests.
  • Streams JSON responses and images back to the client for visualization.
  • Facilitates integration with various MCP-compatible clients like Claude Desktop and Cursor.

Usages

npx with VS Code

{
  "mcpServers": {
    "VisionAgent": {
      "command": "npx",
      "args": ["vision-tools-mcp"],
      "env": {
        "VISION_AGENT_API_KEY": "<YOUR_API_KEY>",
        "OUTPUT_DIRECTORY": "/path/to/output/directory",
        "IMAGE_DISPLAY_ENABLED": "true"
      }
    }
  }
}

node with VS Code

{
  "mcpServers": {
    "VisionAgent": {
      "command": "node",
      "args": [
        "/path/to/build/index.js"
      ],
      "env": {
        "VISION_AGENT_API_KEY": "<YOUR_API_KEY>",
        "OUTPUT_DIRECTORY": "../../output",
        "IMAGE_DISPLAY_ENABLED": "true"
      }
    }
  }
}

Tools

  1. agentic-document-analysis

    Parse PDFs/images to extract text, tables, charts, and diagrams.

  2. text-to-object-detection

    Detect objects using free-form prompts and outputs bounding boxes.

  3. text-to-instance-segmentation

    Provides pixel-perfect masks for images.

  4. activity-recognition

    Recognizes multiple activities in video with start/end timestamps.

  5. depth-pro

    High-resolution monocular depth estimation for single images.