DINO-X-MCP by IDEA-Research - MCP Server

DINO-X MCP is a powerful tool designed to enhance the capabilities of large language models by enabling them to perform detailed object detection and image understanding. This is achieved through the integration of DINO-X and Grounding DINO 1.6 API, which provide the necessary framework for precise localization and high-quality structured outputs for visual content. The server is particularly useful in scenarios where multimodal models fall short in terms of precise image analysis. With DINO-X MCP, users can achieve fine-grained image understanding, accurately obtain object count, position, and attributes, and integrate with other MCP servers to build complex visual workflows. This makes it an ideal solution for tasks such as visual question answering and building natural language-driven visual agents for real-world automation scenarios.

Features

Fine-grained image understanding with full-scene recognition and targeted detection.
Accurate object count, position, and attribute detection for visual question answering.
Integration with other MCP servers for multi-step visual workflows.
Natural language-driven visual agents for real-world automation.
Support for various image formats and remote URLs.

Tools

detect-all-objects
Detects and localizes all recognizable objects in an image.
object-detection-by-text
Detects and localizes objects in an image based on a natural language prompt.
detect-human-pose-keypoints
Detects 17 human body keypoints per person in an image for pose estimation.
visualize-detections
Visualizes detection results by drawing bounding boxes and labels on the image.

DINO-X-MCP

Features

Tools

detect-all-objects

object-detection-by-text

detect-human-pose-keypoints

visualize-detections