patelnav/mcp-vision
If you are the rightful owner of mcp-vision and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Dead-simple MCP server for vision analysis with Google Gemini Flash-Lite.
mcp-vision
Dead-simple MCP server for vision analysis with Google Gemini Flash-Lite.
What it does
Exposes a single MCP tool that sends your images + a single instruction string straight to Google Gemini Flash-Lite and returns the model's raw text answer.
- One tool, one job:
vision.analyze - Backend: Google AI Studio or Vertex AI (your choice)
- Default model:
models/gemini-flash-lite-latest - Modes: Text + images (no audio/video in v1)
Installation
npm install
npm run build
Configuration
Copy .env.example to .env and configure:
Option 1: AI Studio (Recommended for simplicity)
GEMINI_PROVIDER=ais
GEMINI_API_KEY=your_api_key_here
Get your API key at: https://aistudio.google.com/app/apikey
Option 2: Vertex AI
GEMINI_PROVIDER=vertex
GOOGLE_CLOUD_PROJECT=your-project-id
GEMINI_LOCATION=us-central1
Auth options (any one works):
- Application Default Credentials (recommended): set
GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.jsonor rungcloud auth application-default login - User credentials: run
gcloud auth login
Token resolution order used by the server:
- If installed, use
google-auth-libraryto acquire an ADC token (no gcloud required) gcloud auth application-default print-access-tokengcloud auth print-access-token
Optional Settings
# Use a different model
GEMINI_MODEL=models/gemini-flash-lite-latest
# Auto-resize images - DEFAULT is 2048px (set to 0 to disable)
VISION_MAX_LONG_EDGE=2048
Claude Desktop Setup
Add to your claude_desktop_config.json:
{
"mcpServers": {
"vision": {
"command": "node",
"args": ["/absolute/path/to/mcp-vision/dist/index.js"],
"env": {
"GEMINI_PROVIDER": "ais",
"GEMINI_API_KEY": "your_api_key_here"
}
}
}
}
Or using npx:
{
"mcpServers": {
"vision": {
"command": "npx",
"args": ["-y", "mcp-gemini-vision"],
"env": {
"GEMINI_PROVIDER": "ais",
"GEMINI_API_KEY": "your_api_key_here"
}
}
}
}
Usage
The tool accepts:
Input:
{
"images": "https://example.com/screenshot.png" | ["/path/to/img1.png", "data:image/png;base64,..."],
"instruction": "Natural language task for the screenshot(s)."
}
Output:
{
"text": "<Gemini raw text reply>"
}
Image formats supported
- HTTP(S) URLs:
https://example.com/image.png - File URLs:
file:///absolute/path/to/image.png - Absolute paths:
/absolute/path/to/image.png - Data URIs:
data:image/png;base64,iVBORw0KG...
Example instructions
Overlap check:
"Return JSON {overlap:boolean, examples:[{text,bbox,reason}]} — do any borders overlap any text?"
Aesthetic analysis:
"In one sentence: does the hero feel cramped? If so, suggest one fix."
OCR:
"What does the toast say? Quote exactly."
Extract UI elements:
"Extract all visible button labels as a JSON array."
Whitespace rating:
"Rate hero whitespace 0–1; if <0.6, give exactly one fix."
How it works
- Normalize images: Accept URLs, file paths, file:// URLs, or data URIs
- HTTP(S) URLs are fetched with timeout and validated
- All images are validated as real images using
sharp(prevents exfiltration) - MIME types derived from actual image format, not file extension
- Auto-resize: Images larger than 2048px (configurable) are automatically downscaled
- Call Gemini once: Build parts array with images + instruction text, with 60s timeout
- Return raw: Return exactly what Gemini sends back (no schema coercion)
- Error handling: Try/catch on JSON parsing with fallback to text for better diagnostics
Security & Limits
- Image validation: All images validated with
sharp.metadata()before upload (prevents arbitrary file exfiltration) - Size limits: Max 18MB per image, max 10 images per request
- Timeouts: 60s for HTTP fetches and API calls
- Auto-resize: ON by default at 2048px (set
VISION_MAX_LONG_EDGE=0to disable, but validation still runs) - Images + text only (no audio/video in v1)
For larger or frequently reused assets, consider the Gemini Files API (future enhancement).
Development
npm run dev # Watch mode
npm run build # Compile TypeScript
npm start # Run compiled server
License
MIT