vision-agent-mcp by landing-ai - MCP Server

The VisionAgent MCP Server is designed to bridge the gap between modern LLM agents and external tools through the Model Context Protocol (MCP). It operates as a local server, translating tool calls from MCP-compatible clients into authenticated HTTPS requests to Landing AI’s VisionAgent REST APIs. This setup allows users to issue natural-language commands for computer vision and document analysis directly from their editors without the need for custom REST code or additional SDKs. The server supports a variety of use cases, including document analysis, object detection, instance segmentation, activity recognition, and depth estimation. By running locally on STDIN/STDOUT, it ensures that all operations are performed securely and efficiently, with outputs such as JSON responses and images being streamed back to the client for further processing or visualization.

Features

Supports natural-language commands for computer vision and document analysis.
Operates as a local server, ensuring secure and efficient processing.
Translates MCP-compatible client calls into authenticated HTTPS requests.
Streams JSON responses and images back to the client for visualization.
Facilitates integration with various MCP-compatible clients like Claude Desktop and Cursor.

Usages

npx with VS Code

{
  "mcpServers": {
    "VisionAgent": {
      "command": "npx",
      "args": ["vision-tools-mcp"],
      "env": {
        "VISION_AGENT_API_KEY": "<YOUR_API_KEY>",
        "OUTPUT_DIRECTORY": "/path/to/output/directory",
        "IMAGE_DISPLAY_ENABLED": "true"
      }
    }
  }
}

node with VS Code

{
  "mcpServers": {
    "VisionAgent": {
      "command": "node",
      "args": [
        "/path/to/build/index.js"
      ],
      "env": {
        "VISION_AGENT_API_KEY": "<YOUR_API_KEY>",
        "OUTPUT_DIRECTORY": "../../output",
        "IMAGE_DISPLAY_ENABLED": "true"
      }
    }
  }
}

Tools

agentic-document-analysis
Parse PDFs/images to extract text, tables, charts, and diagrams.
text-to-object-detection
Detect objects using free-form prompts and outputs bounding boxes.
text-to-instance-segmentation
Provides pixel-perfect masks for images.
activity-recognition
Recognizes multiple activities in video with start/end timestamps.
depth-pro
High-resolution monocular depth estimation for single images.

landing-ai/vision-agent-mcp

Features

Usages

npx with VS Code

node with VS Code

Tools

agentic-document-analysis

text-to-object-detection

text-to-instance-segmentation

activity-recognition

depth-pro