llm_mcp by AuraFriday - MCP Server

Local LLM — Offline AI with Tool-Calling

MCP server that runs local LLMs (with full access to MCP tools included). Callable by Python to chain MCP tools with local intelligence.

Run AI models locally. No cloud. No API keys. Full tool-calling support. Your AI can call other AIs, completely offline.

Benefits

1. 🔒 Complete Privacy

Not cloud-dependent — fully local. Models run on your hardware. Data never leaves your machine. Perfect for sensitive work, proprietary code, or air-gapped environments.

2. 🤖 Autonomous Tool-Calling

Not just text generation — AI agents. Local models can call MCP tools automatically. Query databases, control browsers, show UI popups — all without cloud APIs.

3. ⚡ Zero Cost After Setup

Not pay-per-token — unlimited usage. Download models once, use forever. No API bills, no rate limits, no usage tracking.

Why This Tool Changes Local AI

Cloud APIs are powerful but limited. OpenAI, Anthropic, Google — great models, but you pay per token, data leaves your network, and you're dependent on their availability.

Existing local LLM tools are basic. Ollama, LM Studio, llama.cpp — good for chat, but no tool-calling, no MCP integration, no autonomous agents.

Tool-calling is cloud-only. OpenAI's function calling, Anthropic's tool use — powerful features locked to cloud APIs. Local models can't compete.

This tool solves all of that.

Run models locally with:

Full tool-calling support (OpenAI-compatible format)
Automatic MCP tool integration
Hardware optimization (CUDA/CPU, quantization)
OpenRouter-compatible interface
Zero cloud dependencies

Your AI can now be an autonomous agent, calling tools and making decisions, all running on your hardware.

Real-World Story: The Air-Gapped Environment

The Problem:

A defense contractor needed AI assistance for code analysis. Security requirements:

No internet connection allowed
No data can leave the facility
No cloud API access
Must work on classified networks

Standard solution: None. All major AI assistants require internet.

With This Tool:

# 1. Check hardware capabilities
hardware = hardware_info()
# Shows: CUDA available, 24GB VRAM, RTX 4090

# 2. List installed models
models = list_installed_models()
# Shows: Qwen2.5-7B-Instruct (7GB), Llama-3.1-8B (8GB)

# 3. Run local AI with tool-calling
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "Analyze the security of auth.py"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "read_file",
                "description": "Read file contents",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "path": {"type": "string"}
                    }
                }
            }
        }
    ],
    auto_execute_tools=True  # Autonomous mode
)

# Model reads file, analyzes code, returns findings
# All local, no internet, no cloud APIs

Result:

Code analysis working in air-gapped environment
AI can read files, analyze patterns, suggest improvements
Zero data exfiltration risk
Unlimited usage, no costs
Full audit trail of all operations

The kicker: Same setup now powers their entire development workflow. Code review, documentation generation, test creation — all local, all private, all autonomous.

The Complete Feature Set

Hardware Detection

Check Capabilities:

# Get hardware info
info = hardware_info()

# Returns:
# {
#   "cuda_available": true,
#   "cuda_version": "12.1",
#   "gpu_name": "NVIDIA RTX 4090",
#   "gpu_memory_gb": 24,
#   "cpu_cores": 16,
#   "ram_gb": 64,
#   "recommended_models": ["7B", "13B", "30B"]
# }

Why hardware detection matters: Automatically selects appropriate models and quantization based on your hardware. No manual configuration needed.

Model Management

List Installed Models:

# See what's already downloaded
models = list_installed_models()

# Returns list of cached models:
# [
#   {
#     "id": "Qwen/Qwen2.5-7B-Instruct",
#     "size_gb": 7.2,
#     "type": "text-generation",
#     "supports_tools": true,
#     "cache_path": "~/.cache/huggingface/..."
#   }
# ]

Model Information:

# Get details about a specific model
info = model_info(model="Qwen/Qwen2.5-7B-Instruct")

# Returns:
# {
#   "name": "Qwen2.5-7B-Instruct",
#   "size_gb": 7.2,
#   "context_length": 32768,
#   "supports_tools": true,
#   "supports_vision": false,
#   "quantization_options": ["4bit", "8bit", "none"],
#   "recommended_vram_gb": 8
# }

Why model management matters: Know what's installed, what's possible, what will fit in your VRAM. Make informed decisions about which models to use.

Chat Completions

Basic Chat:

# Simple text generation
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ]
)

Multi-Turn Conversations:

# Maintain conversation history
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant"},
        {"role": "user", "content": "Write a Python function to sort a list"},
        {"role": "assistant", "content": "Here's a sorting function..."},
        {"role": "user", "content": "Now optimize it for large lists"}
    ]
)

Temperature and Sampling:

# Control randomness and creativity
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a creative story"}],
    temperature=0.9,  # More creative
    top_p=0.95,
    max_tokens=1000
)

Tool-Calling (Autonomous Agents):

# AI can call MCP tools automatically
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "Show me all browser tabs and save to database"}
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "browser_list_tabs",
                "description": "Get all open browser tabs",
                "parameters": {"type": "object", "properties": {}}
            }
        },
        {
            "type": "function",
            "function": {
                "name": "sqlite_execute",
                "description": "Execute SQL query",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "sql": {"type": "string"},
                        "database": {"type": "string"}
                    }
                }
            }
        }
    ],
    auto_execute_tools=True  # Autonomous mode
)

# Model will:
# 1. Call browser_list_tabs
# 2. Process results
# 3. Call sqlite_execute to save data
# 4. Return summary

Why tool-calling matters: Local AI becomes an autonomous agent. Not just answering questions — taking actions, calling tools, completing complex workflows.

Hardware Optimization

Automatic Quantization:

# Let tool decide best quantization
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    quantization="auto"  # Chooses 4bit/8bit/none based on VRAM
)

Device Selection:

# Control where model runs
response = chat_completion(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
    device="cuda"  # or "cpu" or "auto"
)

Why optimization matters: Run larger models on limited hardware. 7B model in 4-bit quantization fits in 4GB VRAM. 13B model fits in 8GB. Automatic selection ensures best performance for your hardware.

Usage Examples

Hardware Info

{
  "input": {
    "operation": "hardware_info",
    "tool_unlock_token": "YOUR_TOKEN"
  }
}

List Installed Models

{
  "input": {
    "operation": "list_installed_models",
    "tool_unlock_token": "YOUR_TOKEN"
  }
}

Model Info

{
  "input": {
    "operation": "model_info",
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "tool_unlock_token": "YOUR_TOKEN"
  }
}

Chat Completion

{
  "input": {
    "operation": "chat_completion",
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain quantum computing"}
    ],
    "temperature": 0.7,
    "max_tokens": 1000,
    "tool_unlock_token": "YOUR_TOKEN"
  }
}

Chat with Tool-Calling

{
  "input": {
    "operation": "chat_completion",
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Get browser tabs and save to database"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "browser_list_tabs",
          "description": "Get all open browser tabs",
          "parameters": {"type": "object", "properties": {}}
        }
      }
    ],
    "auto_execute_tools": true,
    "tool_unlock_token": "YOUR_TOKEN"
  }
}

Technical Architecture

Supported Models

Text Generation:

Small (0.5-3B): Qwen2.5-0.5B, Phi-3-mini, Gemma-2-2B
Medium (7-13B): Llama-3.1-8B, Qwen2.5-7B, Mistral-7B
Large (30B+): Llama-3.1-70B, Qwen2.5-32B

Tool-Calling Support:

Qwen2.5-Instruct family (recommended)
Llama-3.x Instruct
Any model with function-calling training

Multi-Modal (Future):

LLaVA (vision + text)
Qwen2-VL (vision + text)
KOSMOS-2 (vision + text)

Hardware Requirements

Minimum (CPU-only):

8GB RAM
4-core CPU
Small models only (0.5-3B)

Recommended (GPU):

8GB+ VRAM (NVIDIA GPU)
CUDA 11.8+
Medium models (7-13B) with quantization

Optimal (High-end GPU):

24GB+ VRAM
CUDA 12.1+
Large models (30B+) or unquantized medium models

Quantization

4-bit (BitsAndBytes):

~4GB VRAM for 7B model
Minimal quality loss
4x memory reduction

8-bit:

~8GB VRAM for 7B model
Negligible quality loss
2x memory reduction

None (FP16/BF16):

~14GB VRAM for 7B model
Full quality
Maximum memory usage

Model Storage

Cache Location:

Linux/Mac: ~/.cache/huggingface/hub/
Windows: %USERPROFILE%\.cache\huggingface\hub\
Override: HF_HOME environment variable

Storage Requirements:

0.5B model: ~1GB
7B model: ~7GB
13B model: ~13GB
70B model: ~70GB

Performance Considerations

Inference Speed

0.5B model: ~50-100 tokens/sec (CPU)
7B model: ~20-40 tokens/sec (GPU, 4-bit)
13B model: ~10-20 tokens/sec (GPU, 4-bit)
70B model: ~2-5 tokens/sec (GPU, 4-bit)

Memory Usage

Model size + context window + overhead
7B model (4-bit): ~4-6GB VRAM
7B model (8-bit): ~8-10GB VRAM
7B model (FP16): ~14-16GB VRAM

Context Length

Qwen2.5: 32K tokens
Llama-3.1: 128K tokens
Gemma-2: 8K tokens
Varies by model

Limitations & Considerations

Model Quality

Smaller Models: Less capable than GPT-4/Claude
Specialized Tasks: May require fine-tuning
Reasoning: Limited compared to frontier models
Knowledge: Training data cutoff applies

Hardware Requirements

GPU Recommended: CPU inference is slow
VRAM Limits: Larger models need more memory
Quantization Trade-off: Quality vs memory
Cooling: Extended use generates heat

Tool-Calling

Model Support: Not all models support tools
Schema Compliance: Requires proper training
Error Handling: Tools may fail
Security: Validate all tool calls

Privacy vs Features

Offline: No internet = no web search
Local Only: Can't access cloud services
Updates: Manual model downloads
Support: Community-driven

Why This Tool is Unmatched

1. Complete Privacy
Data never leaves your machine. Perfect for sensitive work.

2. Autonomous Agents
Tool-calling support. AI can take actions, not just answer questions.

3. Zero Cost
Unlimited usage after initial download. No API bills.

4. OpenRouter-Compatible
Same interface as cloud APIs. Easy switching.

5. Hardware Optimization
Automatic quantization and device selection.

6. MCP Integration
Call any MCP tool from local AI.

7. Multiple Models
Support for Qwen, Llama, Mistral, Gemma, and more.

8. Production-Ready
Error handling, validation, monitoring.

9. Air-Gap Compatible
Works without internet after initial setup.

10. Future-Proof
Multi-modal, embeddings, more features coming.

Powered by MCP-Link

This tool is part of the MCP-Link Server — the only MCP server with local LLM inference and autonomous tool-calling built-in.

What's Included

Isolated Python Environment:

PyTorch with CUDA support
Transformers library
Quantization tools (BitsAndBytes)
All dependencies bundled

Model Management:

Automatic HuggingFace integration
Cache management
Hardware detection
Optimization selection

Tool-Calling Infrastructure:

MCP tool integration
Schema validation
Autonomous execution
Security controls

Get MCP-Link

Download the installer for your platform:

Installation is automatic. Dependencies are included. It just works.

Technical Specifications

Backend: PyTorch + Transformers
Hardware: CUDA (GPU) or CPU
Quantization: 4-bit, 8-bit, FP16
Context Length: Model-dependent (8K-128K)
Tool-Calling: OpenAI-compatible format
Model Source: HuggingFace Hub
Cache: ~/.cache/huggingface/

Supported Models:

Qwen2.5 (0.5B-32B)
Llama-3.1 (8B-70B)
Mistral (7B)
Gemma-2 (2B-9B)
Phi-3 (mini-medium)

Performance:

0.5B: ~50-100 tok/sec (CPU)
7B: ~20-40 tok/sec (GPU, 4-bit)
13B: ~10-20 tok/sec (GPU, 4-bit)

License & Copyright

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at:

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

AI Training Permission: You are permitted to use this software and any associated content for the training, evaluation, fine-tuning, or improvement of artificial intelligence systems, including commercial models.

SPDX-License-Identifier: Apache-2.0

Part of the Aura Friday MCP-Link Server project.

Support & Community

Issues & Feature Requests:
GitHub Issues

Documentation:
MCP-Link Documentation

HuggingFace:
Model Hub

Community:
Join other developers building private, autonomous AI agents.