byte-vision-mcp

kbrisso/byte-vision-mcp

3.4

If you are the rightful owner of byte-vision-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Byte Vision MCP is a server that bridges MCP-compatible clients with local LLama.cpp language models for text completion.

Tools
1
Resources
0
Prompts
0

Byte Vision MCP

A Model Context Protocol (MCP) server that provides text completion capabilities using local LLama.cpp models. This server exposes a single MCP tool that accepts text prompts and returns AI-generated completions using locally hosted language models.

What is this project?

Byte Vision MCP is a bridge between MCP-compatible clients (like Claude Desktop, IDEs, or other AI tools) and local LLama.cpp language models. It allows you to:

  • Use local language models through the MCP protocol
  • Configure all model parameters via environment files
  • Generate text completions with custom prompts
  • Maintain privacy by keeping everything local
  • Integrate with MCP-compatible applications

Features

  • MCP Protocol Support: Standard MCP server implementation
  • Local Model Execution: Uses LLama.cpp for model inference
  • Configurable Parameters: All settings controlled via environment file
  • GPU Acceleration: Supports CUDA, ROCm, and Metal
  • Prompt Caching: Built-in caching for improved performance
  • Comprehensive Logging: Detailed logging for debugging and monitoring
  • Graceful Shutdown: Proper resource cleanup and error handling

Built With

Core Dependencies

Indirect Dependencies

Web Framework & HTTP

Runtime Requirements

  • Go SDK 1.23+ - Modern Go runtime with latest features
  • LLama.cpp - Local language model inference engine
  • GGUF Models - Quantized language models in GGUF format

Prerequisites

  • Go 1.23+ for building the server
  • LLama.cpp binaries (see /llamacpp/README.md for installation)
  • GGUF format models (see /models/README.md for sources)

Quick Start

1. Clone and Build

git clone <repository-url>
cd byte-vision-mcp
go mod tidy
go build -o byte-vision-mcp

2. Set Up LLama.cpp

Follow the instructions in: /llamacpp/README.md

  • Download prebuilt binaries, or
  • Build from source

3. Download Models

See: /models/README.md

  • Recommended model sources
  • How to download GGUF models
  • Model placement instructions

4. Configure Environment

Copy the example configuration:

cp example-byte-vision-cfg.env byte-vision-cfg.env

Edit to match your setup: byte-vision-cfg.env

Update paths to match your installation

LLamaCliPath=/path/to/your/llama-cli ModelFullPathVal=/path/to/your/model.gguf AppLogPath=/path/to/logs/

5. Run the Server

./byte-vision-mcp

The server will start on http://localhost:8080/mcp-completion by default.

Project Structure

byte-vision-mcp/
ā”œā”€ā”€ llamacpp/ # LLama.cpp binaries and installation guide
ā”œā”€ā”€ logs/ # Application and model logs
ā”œā”€ā”€ models/ # GGUF model files
ā”œā”€ā”€ prompt-cache/ # Cached prompts for performance
ā”œā”€ā”€ main.go # Main MCP server implementation
ā”œā”€ā”€ model.go # Model execution logic
ā”œā”€ā”€ types.go # Configuration structures
ā”œā”€ā”€ byte-vision-cfg.env # Your configuration (create from example)
└── example-byte-vision-cfg.env # Example configuration

Configuration

The file controls all aspects of the server: byte-vision-cfg.env

Application Settings

  • : Directory for log files AppLogPath
  • : Log file name AppLogFileName
  • : Server port (default :8080) HttpPort
  • : MCP endpoint path (default ) EndPoint``/mcp-completion
  • : Request timeout (default 300) TimeOutSeconds

LLama.cpp Settings

  • : Path to llama-cli executable LLamaCliPath
  • : Path to your GGUF model file ModelFullPathVal
  • : Context window size CtxSizeVal
  • : Number of layers to offload to GPU GPULayersVal
  • : Generation temperature TemperatureVal
  • : Maximum tokens to generate PredictVal
  • And many more LLama.cpp parameters...

Usage

MCP Tool: generate_completion

The server exposes a single MCP tool that accepts both basic and advanced parameters for fine-tuning LLama.cpp execution.

Basic Usage
**Input:**
{
"prompt": "Write a short story about a robot learning to paint"
}
**Output:**
{
"content": [
{
"type": "text",
"text": "Generated completion text..."
}
]
}
Advanced Usage with Parameters

Full Parameter Input:

{
"prompt": "Explain quantum computing in simple terms",
"temperature": 0.7,
"predict": 500,
"top_k": 40,
"top_p": 0.9,
"ctx_size": 4096,
"threads": 8,
"gpu_layers": 35
}
Available Parameters
Core Model & Performance Parameters
ParameterTypeDescriptionExampleDefault Source
modelstringOverride model path"/path/to/model.gguf"ModelFullPathVal
threadsintCPU threads for generation8ThreadsVal
gpu_layersintGPU acceleration layers35GPULayersVal
ctx_sizeintContext window size4096CtxSizeVal
batch_sizeintBatch processing size512BatchCmdVal
Generation Control Parameters
ParameterTypeDescriptionRangeDefault Source
predictintNumber of tokens to generate1-8192PredictVal
temperaturefloatCreativity/randomness control0.0-2.0TemperatureVal
top_kintTop-K sampling1-100TopKVal
top_pfloatTop-P (nucleus) sampling0.0-1.0TopPVal
repeat_penaltyfloatRepetition penalty0.5-2.0RepeatPenaltyVal
Input/Output Parameters
ParameterTypeDescriptionExampleDefault Source
prompt_filestringLoad prompt from file"/path/to/prompt.txt"PromptFileVal
log_filestringCustom log file path"/path/to/custom.log"ModelLogFileNameVal
Parameter Usage Examples
1. Creative Writing (High Temperature)
{
"prompt": "Write a creative story about time travel",
"temperature": 1.2,
"top_p": 0.95,
"predict": 1000,
"repeat_penalty": 1.1
}
2. Technical Documentation (Low Temperature)
{
"prompt": "Explain the TCP/IP protocol stack",
"temperature": 0.3,
"top_k": 10,
"predict": 800,
"ctx_size": 8192
}
3. Code Generation (Balanced)
{
"prompt": "Write a Python function to sort a list",
"temperature": 0.6,
"top_k": 30,
"top_p": 0.8,
"predict": 400
}
4. Long Context Processing
{
"prompt": "Summarize this document...",
"ctx_size": 32768,
"gpu_layers": 40,
"batch_size": 1024,
"predict": 500
}
5. Performance Optimization
{
"prompt": "Quick question about Go syntax",
"threads": 12,
"gpu_layers": 45,
"batch_size": 2048,
"predict": 200,
"temperature": 0.4
}
6. Using External Prompt File
{
"prompt_file": "/path/to/complex_prompt.txt",
"temperature": 0.8,
"predict": 1500,
"log_file": "/path/to/custom_generation.log"
}
Parameter Guidelines
Temperature Settings
- **0.0-0.3**: Highly deterministic, factual responses
- **0.4-0.7**: Balanced creativity and coherence
- **0.8-1.2**: Creative, varied responses
- **1.3-2.0**: Highly creative, potentially chaotic
Context Size Guidelines
- **2048-4096**: Short conversations, simple tasks
- **8192-16384**: Medium documents, complex reasoning
- **32768+**: Long documents, extensive context
GPU Layers Optimization
- **0**: CPU-only processing
- **25-35**: Balanced CPU/GPU (8GB VRAM)
- **40+**: Full GPU acceleration (12GB+ VRAM)
Prediction Length
- **50-200**: Short answers, code snippets
- **300-800**: Medium explanations, documentation
- **1000+**: Long-form content, stories
Performance Considerations
Memory Usage
{
"ctx_size": 4096, // Lower for limited RAM
"batch_size": 512, // Smaller batches for stability
"gpu_layers": 25 // Reduce if GPU memory limited
}
Speed Optimization
{
"threads": 8, // Match CPU cores
"gpu_layers": 45, // Maximize GPU usage
"batch_size": 2048, // Larger batches for throughput
"predict": 200 // Shorter for quick responses
}
Quality Optimization
{
"temperature": 0.7, // Balanced creativity
"top_k": 40, // Diverse sampling
"top_p": 0.9, // Nucleus sampling
"repeat_penalty": 1.1, // Reduce repetition
"ctx_size": 8192 // Larger context for coherence
}
Error Handling

The tool provides specific error messages for invalid parameters:

{
"content": [
{
"type": "text",
"text": "Error: Invalid temperature value. Must be between 0.0 and 2.0"
}
]
}
Default Behavior
  • All parameters are optional except prompt
  • Environment configuration is used when parameters are not specified
  • Zero values are ignored (e.g., temperature: 0 uses config default)
  • Invalid values fall back to configuration defaults
Integration Examples
JavaScript/Node.js
const result = await mcpClient.callTool("generate_completion", {
prompt: "Explain async/await in JavaScript",
temperature: 0.5,
predict: 600,
top_k: 25
});
console.log(result.content[0].text);
Python
response = mcp_client.call_tool("generate_completion", {
"prompt": "Write a Python class for a binary tree",
"temperature": 0.6,
"predict": 800,
"top_p": 0.8
})
print(response["content"][0]["text"])
curl (Direct HTTP)
curl -X POST http://localhost:8080/mcp-completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain Docker containers",
"temperature": 0.4,
"predict": 500,
"ctx_size": 4096
}'

This comprehensive parameter system allows fine-grained control over LLama.cpp behavior while maintaining backward compatibility and ease of use.

GPU Acceleration

NVIDIA GPUs (CUDA)

  • Download CUDA-enabled LLama.cpp binaries
  • Set GPULayersVal=33 (or adjust based on your GPU memory)
  • Set MainGPUVal=0 (or your preferred GPU index)

AMD GPUs (ROCm - Linux only)

  • Download ROCm-enabled LLama.cpp binaries
  • Configure similar to CUDA setup

Apple Silicon (Metal - macOS)

  • Metal support is built-in
  • No additional configuration needed

Logging

Logs are written to both console and file:

  • Application logs: logs/byte-vision-mcp.log
  • Model logs: logs/[model-name].log
  • Configurable log levels and verbosity

See for log management details. /logs/README.md

Troubleshooting

Common Issues

  1. "llama-cli not found"

    • Check in your file LLamaCliPath``.env
    • Ensure the binary has execute permissions
  2. "Model file not found"

    • Verify points to a valid .gguf file ModelFullPathVal
    • Check file permissions
  3. Out of memory errors

    • Reduce CtxSizeVal
    • Use a smaller model
    • Increase for GPU offloading GPULayersVal
  4. Slow generation

    • Enable GPU acceleration
    • Increase GPULayersVal
    • Use quantized models (Q4, Q5, Q8)
  5. Server won't start

    • Check if port is already in use
    • Verify all paths in configuration exist
    • Check logs for detailed error messages

Development

Building from Source

go mod tidy go build -o byte-vision-mcp

Running Tests

go test ./...

Dependencies

    • Environment file loading github.com/joho/godotenv
    • MCP protocol implementation github.com/metoro-io/mcp-golang

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the terms of the MIT license.

Support

  • Check the individual README files in each subdirectory for specific setup instructions
  • Review logs for detailed error information
  • Ensure all paths in configuration are absolute and accessible

Author: Kevin Brisson

Email: kbrisso@gmail.com

LinkedIn: Kevin Brisson

Project Link: https://github.com/kbrisso/byte-vision-mcp