AuraFriday/llm_mcp
If you are the rightful owner of llm_mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The MCP server enables local LLMs with full access to MCP tools, allowing for autonomous tool-calling and offline AI operations.
Local LLM — Offline AI with Tool-Calling
MCP server that runs local LLMs (with full access to MCP tools included). Callable by Python to chain MCP tools with local intelligence.
Run AI models locally. No cloud. No API keys. Full tool-calling support. Your AI can call other AIs, completely offline.
Benefits
1. 🔒 Complete Privacy
Not cloud-dependent — fully local. Models run on your hardware. Data never leaves your machine. Perfect for sensitive work, proprietary code, or air-gapped environments.
2. 🤖 Autonomous Tool-Calling
Not just text generation — AI agents. Local models can call MCP tools automatically. Query databases, control browsers, show UI popups — all without cloud APIs.
3. ⚡ Zero Cost After Setup
Not pay-per-token — unlimited usage. Download models once, use forever. No API bills, no rate limits, no usage tracking.
Why This Tool Changes Local AI
Cloud APIs are powerful but limited. OpenAI, Anthropic, Google — great models, but you pay per token, data leaves your network, and you're dependent on their availability.
Existing local LLM tools are basic. Ollama, LM Studio, llama.cpp — good for chat, but no tool-calling, no MCP integration, no autonomous agents.
Tool-calling is cloud-only. OpenAI's function calling, Anthropic's tool use — powerful features locked to cloud APIs. Local models can't compete.
This tool solves all of that.
Run models locally with:
- Full tool-calling support (OpenAI-compatible format)
- Automatic MCP tool integration
- Hardware optimization (CUDA/CPU, quantization)
- OpenRouter-compatible interface
- Zero cloud dependencies
Your AI can now be an autonomous agent, calling tools and making decisions, all running on your hardware.
Real-World Story: The Air-Gapped Environment
The Problem:
A defense contractor needed AI assistance for code analysis. Security requirements:
- No internet connection allowed
- No data can leave the facility
- No cloud API access
- Must work on classified networks
Standard solution: None. All major AI assistants require internet.
With This Tool:
# 1. Check hardware capabilities
hardware = hardware_info()
# Shows: CUDA available, 24GB VRAM, RTX 4090
# 2. List installed models
models = list_installed_models()
# Shows: Qwen2.5-7B-Instruct (7GB), Llama-3.1-8B (8GB)
# 3. Run local AI with tool-calling
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "Analyze the security of auth.py"}
],
tools=[
{
"type": "function",
"function": {
"name": "read_file",
"description": "Read file contents",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"}
}
}
}
}
],
auto_execute_tools=True # Autonomous mode
)
# Model reads file, analyzes code, returns findings
# All local, no internet, no cloud APIs
Result:
- Code analysis working in air-gapped environment
- AI can read files, analyze patterns, suggest improvements
- Zero data exfiltration risk
- Unlimited usage, no costs
- Full audit trail of all operations
The kicker: Same setup now powers their entire development workflow. Code review, documentation generation, test creation — all local, all private, all autonomous.
The Complete Feature Set
Hardware Detection
Check Capabilities:
# Get hardware info
info = hardware_info()
# Returns:
# {
# "cuda_available": true,
# "cuda_version": "12.1",
# "gpu_name": "NVIDIA RTX 4090",
# "gpu_memory_gb": 24,
# "cpu_cores": 16,
# "ram_gb": 64,
# "recommended_models": ["7B", "13B", "30B"]
# }
Why hardware detection matters: Automatically selects appropriate models and quantization based on your hardware. No manual configuration needed.
Model Management
List Installed Models:
# See what's already downloaded
models = list_installed_models()
# Returns list of cached models:
# [
# {
# "id": "Qwen/Qwen2.5-7B-Instruct",
# "size_gb": 7.2,
# "type": "text-generation",
# "supports_tools": true,
# "cache_path": "~/.cache/huggingface/..."
# }
# ]
Model Information:
# Get details about a specific model
info = model_info(model="Qwen/Qwen2.5-7B-Instruct")
# Returns:
# {
# "name": "Qwen2.5-7B-Instruct",
# "size_gb": 7.2,
# "context_length": 32768,
# "supports_tools": true,
# "supports_vision": false,
# "quantization_options": ["4bit", "8bit", "none"],
# "recommended_vram_gb": 8
# }
Why model management matters: Know what's installed, what's possible, what will fit in your VRAM. Make informed decisions about which models to use.
Chat Completions
Basic Chat:
# Simple text generation
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "Explain quantum computing"}
]
)
Multi-Turn Conversations:
# Maintain conversation history
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful coding assistant"},
{"role": "user", "content": "Write a Python function to sort a list"},
{"role": "assistant", "content": "Here's a sorting function..."},
{"role": "user", "content": "Now optimize it for large lists"}
]
)
Temperature and Sampling:
# Control randomness and creativity
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Write a creative story"}],
temperature=0.9, # More creative
top_p=0.95,
max_tokens=1000
)
Tool-Calling (Autonomous Agents):
# AI can call MCP tools automatically
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "Show me all browser tabs and save to database"}
],
tools=[
{
"type": "function",
"function": {
"name": "browser_list_tabs",
"description": "Get all open browser tabs",
"parameters": {"type": "object", "properties": {}}
}
},
{
"type": "function",
"function": {
"name": "sqlite_execute",
"description": "Execute SQL query",
"parameters": {
"type": "object",
"properties": {
"sql": {"type": "string"},
"database": {"type": "string"}
}
}
}
}
],
auto_execute_tools=True # Autonomous mode
)
# Model will:
# 1. Call browser_list_tabs
# 2. Process results
# 3. Call sqlite_execute to save data
# 4. Return summary
Why tool-calling matters: Local AI becomes an autonomous agent. Not just answering questions — taking actions, calling tools, completing complex workflows.
Hardware Optimization
Automatic Quantization:
# Let tool decide best quantization
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
quantization="auto" # Chooses 4bit/8bit/none based on VRAM
)
Device Selection:
# Control where model runs
response = chat_completion(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
device="cuda" # or "cpu" or "auto"
)
Why optimization matters: Run larger models on limited hardware. 7B model in 4-bit quantization fits in 4GB VRAM. 13B model fits in 8GB. Automatic selection ensures best performance for your hardware.
Usage Examples
Hardware Info
{
"input": {
"operation": "hardware_info",
"tool_unlock_token": "YOUR_TOKEN"
}
}
List Installed Models
{
"input": {
"operation": "list_installed_models",
"tool_unlock_token": "YOUR_TOKEN"
}
}
Model Info
{
"input": {
"operation": "model_info",
"model": "Qwen/Qwen2.5-7B-Instruct",
"tool_unlock_token": "YOUR_TOKEN"
}
}
Chat Completion
{
"input": {
"operation": "chat_completion",
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Explain quantum computing"}
],
"temperature": 0.7,
"max_tokens": 1000,
"tool_unlock_token": "YOUR_TOKEN"
}
}
Chat with Tool-Calling
{
"input": {
"operation": "chat_completion",
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Get browser tabs and save to database"}
],
"tools": [
{
"type": "function",
"function": {
"name": "browser_list_tabs",
"description": "Get all open browser tabs",
"parameters": {"type": "object", "properties": {}}
}
}
],
"auto_execute_tools": true,
"tool_unlock_token": "YOUR_TOKEN"
}
}
Technical Architecture
Supported Models
Text Generation:
- Small (0.5-3B): Qwen2.5-0.5B, Phi-3-mini, Gemma-2-2B
- Medium (7-13B): Llama-3.1-8B, Qwen2.5-7B, Mistral-7B
- Large (30B+): Llama-3.1-70B, Qwen2.5-32B
Tool-Calling Support:
- Qwen2.5-Instruct family (recommended)
- Llama-3.x Instruct
- Any model with function-calling training
Multi-Modal (Future):
- LLaVA (vision + text)
- Qwen2-VL (vision + text)
- KOSMOS-2 (vision + text)
Hardware Requirements
Minimum (CPU-only):
- 8GB RAM
- 4-core CPU
- Small models only (0.5-3B)
Recommended (GPU):
- 8GB+ VRAM (NVIDIA GPU)
- CUDA 11.8+
- Medium models (7-13B) with quantization
Optimal (High-end GPU):
- 24GB+ VRAM
- CUDA 12.1+
- Large models (30B+) or unquantized medium models
Quantization
4-bit (BitsAndBytes):
- ~4GB VRAM for 7B model
- Minimal quality loss
- 4x memory reduction
8-bit:
- ~8GB VRAM for 7B model
- Negligible quality loss
- 2x memory reduction
None (FP16/BF16):
- ~14GB VRAM for 7B model
- Full quality
- Maximum memory usage
Model Storage
Cache Location:
- Linux/Mac:
~/.cache/huggingface/hub/ - Windows:
%USERPROFILE%\.cache\huggingface\hub\ - Override:
HF_HOMEenvironment variable
Storage Requirements:
- 0.5B model: ~1GB
- 7B model: ~7GB
- 13B model: ~13GB
- 70B model: ~70GB
Performance Considerations
Inference Speed
- 0.5B model: ~50-100 tokens/sec (CPU)
- 7B model: ~20-40 tokens/sec (GPU, 4-bit)
- 13B model: ~10-20 tokens/sec (GPU, 4-bit)
- 70B model: ~2-5 tokens/sec (GPU, 4-bit)
Memory Usage
- Model size + context window + overhead
- 7B model (4-bit): ~4-6GB VRAM
- 7B model (8-bit): ~8-10GB VRAM
- 7B model (FP16): ~14-16GB VRAM
Context Length
- Qwen2.5: 32K tokens
- Llama-3.1: 128K tokens
- Gemma-2: 8K tokens
- Varies by model
Limitations & Considerations
Model Quality
- Smaller Models: Less capable than GPT-4/Claude
- Specialized Tasks: May require fine-tuning
- Reasoning: Limited compared to frontier models
- Knowledge: Training data cutoff applies
Hardware Requirements
- GPU Recommended: CPU inference is slow
- VRAM Limits: Larger models need more memory
- Quantization Trade-off: Quality vs memory
- Cooling: Extended use generates heat
Tool-Calling
- Model Support: Not all models support tools
- Schema Compliance: Requires proper training
- Error Handling: Tools may fail
- Security: Validate all tool calls
Privacy vs Features
- Offline: No internet = no web search
- Local Only: Can't access cloud services
- Updates: Manual model downloads
- Support: Community-driven
Why This Tool is Unmatched
1. Complete Privacy
Data never leaves your machine. Perfect for sensitive work.
2. Autonomous Agents
Tool-calling support. AI can take actions, not just answer questions.
3. Zero Cost
Unlimited usage after initial download. No API bills.
4. OpenRouter-Compatible
Same interface as cloud APIs. Easy switching.
5. Hardware Optimization
Automatic quantization and device selection.
6. MCP Integration
Call any MCP tool from local AI.
7. Multiple Models
Support for Qwen, Llama, Mistral, Gemma, and more.
8. Production-Ready
Error handling, validation, monitoring.
9. Air-Gap Compatible
Works without internet after initial setup.
10. Future-Proof
Multi-modal, embeddings, more features coming.
Powered by MCP-Link
This tool is part of the MCP-Link Server — the only MCP server with local LLM inference and autonomous tool-calling built-in.
What's Included
Isolated Python Environment:
- PyTorch with CUDA support
- Transformers library
- Quantization tools (BitsAndBytes)
- All dependencies bundled
Model Management:
- Automatic HuggingFace integration
- Cache management
- Hardware detection
- Optimization selection
Tool-Calling Infrastructure:
- MCP tool integration
- Schema validation
- Autonomous execution
- Security controls
Get MCP-Link
Download the installer for your platform:
Installation is automatic. Dependencies are included. It just works.
Technical Specifications
Backend: PyTorch + Transformers
Hardware: CUDA (GPU) or CPU
Quantization: 4-bit, 8-bit, FP16
Context Length: Model-dependent (8K-128K)
Tool-Calling: OpenAI-compatible format
Model Source: HuggingFace Hub
Cache: ~/.cache/huggingface/
Supported Models:
- Qwen2.5 (0.5B-32B)
- Llama-3.1 (8B-70B)
- Mistral (7B)
- Gemma-2 (2B-9B)
- Phi-3 (mini-medium)
Performance:
- 0.5B: ~50-100 tok/sec (CPU)
- 7B: ~20-40 tok/sec (GPU, 4-bit)
- 13B: ~10-20 tok/sec (GPU, 4-bit)
License & Copyright
Copyright © 2025 Christopher Nathan Drake
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
AI Training Permission: You are permitted to use this software and any associated content for the training, evaluation, fine-tuning, or improvement of artificial intelligence systems, including commercial models.
SPDX-License-Identifier: Apache-2.0
Part of the Aura Friday MCP-Link Server project.
Support & Community
Issues & Feature Requests:
GitHub Issues
Documentation:
MCP-Link Documentation
HuggingFace:
Model Hub
Community:
Join other developers building private, autonomous AI agents.