0xsaju/mcp-server
If you are the rightful owner of mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A Model Context Protocol (MCP) server that integrates with a local Large Language Model (LLM) to provide tools and resources for MCP-compatible clients.
MCP Server with Local LLM
A Model Context Protocol (MCP) server that integrates with your local Large Language Model, providing tools and resources that can be used by MCP-compatible clients like Claude Desktop, IDEs, and other AI assistants.
Features
- Local LLM Integration: Uses your local language model (default: Qwen3-8B) for all operations
- MCP-Compatible: Implements the Model Context Protocol specification
- Multiple Tools: Provides various tools for different use cases
- Resource Access: Exposes model information and conversation history
- Memory Efficient: Uses 4-bit quantization for optimal performance
Available Tools
1. chat_with_llm
Basic chat interface with the local LLM.
Parameters:
message(required): The message to send to the LLMsystem_prompt(optional): System prompt to guide behaviortemperature(optional): Temperature for generation (0-1, default: 0.7)max_tokens(optional): Maximum tokens to generate (default: 512)
2. analyze_code
Analyze and review code using the local LLM.
Parameters:
code(required): The code to analyzelanguage(optional): Programming languageanalysis_type(optional): Type of analysis (review, explain, optimize, debug)
3. generate_text
Generate text content in various styles and lengths.
Parameters:
prompt(required): The prompt for text generationstyle(optional): Writing style (formal, casual, technical, creative)length(optional): Desired length (short, medium, long)
4. translate_text
Translate text between languages.
Parameters:
text(required): Text to translatetarget_language(required): Target languagesource_language(optional): Source language (auto-detect if not provided)
5. summarize_content
Summarize content in different formats.
Parameters:
content(required): Content to summarizesummary_type(optional): Type of summary (brief, detailed, bullet_points)max_length(optional): Maximum length in words (default: 150)
Available Resources
1. llm://model/info
Information about the loaded local LLM model.
2. llm://conversations/history
History of conversations with the LLM.
3. llm://system/status
Current system status and resource usage.
Installation
-
Clone or download this repository to your local machine.
-
Install dependencies:
pip install -r requirements.txt -
Set up environment variables (create a
.envfile):MODEL_NAME=Qwen/Qwen3-8B MAX_NEW_TOKENS=512 TEMPERATURE=0.7 -
Ensure you have sufficient hardware:
- GPU with at least 8GB VRAM (recommended)
- 16GB+ system RAM
- CUDA-compatible GPU (for optimal performance)
Usage
Running the Server
-
Start the MCP server:
python mcp_server.py -
The server will:
- Load the specified model (Qwen3-8B by default)
- Initialize the MCP server
- Wait for JSON-RPC requests via stdin/stdout
Integrating with Claude Desktop
To use this server with Claude Desktop, add it to your MCP configuration:
-
Create or edit your Claude Desktop MCP configuration file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
- macOS:
-
Add the server configuration:
{ "mcpServers": { "local-llm": { "command": "python", "args": ["/Users/sazzad/Documents/mcp-server/mcp_server.py"], "env": { "MODEL_NAME": "Qwen/Qwen3-8B" } } } } -
Restart Claude Desktop to load the new server.
Example Usage in Claude Desktop
Once configured, you can use the tools in Claude Desktop:
Please use the local LLM to analyze this Python code:
Claude will then use the analyze_code tool to get analysis from your local LLM.
Configuration
Environment Variables
MODEL_NAME: HuggingFace model name (default: "Qwen/Qwen3-8B")MAX_NEW_TOKENS: Maximum tokens to generate (default: 512)TEMPERATURE: Default temperature for generation (default: 0.7)
Supported Models
This server works with most HuggingFace transformers models that support:
- Chat templates
- 4-bit quantization via BitsAndBytesConfig
- Causal language modeling
Popular choices:
Qwen/Qwen3-8B(default)microsoft/DialoGPT-mediummeta-llama/Llama-2-7b-chat-hfmistralai/Mistral-7B-Instruct-v0.1
Hardware Requirements
Minimum Requirements
- 8GB RAM
- 4GB GPU VRAM (with 4-bit quantization)
- 10GB disk space (for model storage)
Recommended Requirements
- 16GB+ RAM
- 8GB+ GPU VRAM
- CUDA-compatible GPU
- SSD storage for faster model loading
Troubleshooting
Common Issues
-
Out of Memory Error:
- Reduce
MAX_NEW_TOKENS - Use a smaller model
- Ensure 4-bit quantization is working
- Reduce
-
Model Loading Fails:
- Check internet connection for initial download
- Verify model name is correct
- Ensure sufficient disk space
-
CUDA Errors:
- Update CUDA drivers
- Check GPU compatibility
- Fall back to CPU if needed
Performance Optimization
-
Use GPU acceleration:
pip install torch --index-url https://download.pytorch.org/whl/cu118 -
Enable Flash Attention (if compatible):
pip install flash-attn -
Adjust batch size and sequence length based on your hardware.
Development
Project Structure
mcp-server/
├── mcp_server.py # Main MCP server implementation
├── mcp-server.py # Legacy FastAPI server (keep for reference)
├── requirements.txt # Python dependencies
├── mcp/ # MCP protocol implementation
│ ├── __init__.py
│ ├── server.py # MCP server base
│ └── types.py # MCP protocol types
├── .env # Environment variables (create this)
└── README.md # This file
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
License
MIT License - see LICENSE file for details.
Support
For issues and questions:
- Check the troubleshooting section
- Review the configuration
- Open an issue on GitHub