rbj2000/mcp-crawl4ai-rag
If you are the rightful owner of mcp-crawl4ai-rag and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
MCP Crawl4AI RAG is a database-agnostic server that integrates web crawling with Retrieval Augmented Generation (RAG) capabilities, supporting multiple vector database backends.
MCP Crawl4AI RAG - Database Agnostic
A powerful, database-agnostic MCP (Model Context Protocol) server that integrates web crawling with RAG (Retrieval Augmented Generation) capabilities. Built with Crawl4AI for intelligent web scraping and supports multiple AI providers (OpenAI, Ollama, vLLM) and vector database backends for flexible deployment.
With this MCP server, you can scrape anything and then use that knowledge anywhere for RAG across 5 different vector database backends and 3 AI providers (OpenAI, Ollama, vLLM).
Overview
🚀 Features
Multi-AI Provider Support
- OpenAI: Production-grade embeddings and LLM (GPT-4, text-embedding-3)
- Ollama: Local AI deployment with privacy-focused models (Llama, nomic-embed-text)
- vLLM: Cloud-deployed text, embedding, and vision models with OpenAI-compatible API
- Hybrid Mode: Mix providers for cost optimization (e.g., vLLM embeddings + OpenAI LLM)
- Vision Models: Multi-modal RAG with image understanding via vLLM (LLaVA, Qwen-VL)
Multi-Database Support
- Supabase: Production-ready with pgvector support
- SQLite: Local development with built-in vector similarity
- Pinecone: High-performance managed vector database
- Neo4j: Vector database + knowledge graph capabilities
- Weaviate: Enterprise-grade vector search (coming soon)
Advanced RAG Strategies
- Contextual Embeddings: LLM-enhanced chunk understanding
- Hybrid Search: Vector similarity + keyword search
- Agentic RAG: Separate code example indexing
- Provider-Agnostic Reranking: Improved result relevance with Ollama/OpenAI/HuggingFace
- Knowledge Graph Integration: AI hallucination detection
Intelligent Web Crawling
- Smart URL Detection: Automatically handles sitemaps, text files, and web pages
- Content Processing: Markdown extraction with intelligent chunking
- Code Block Extraction: Separate indexing of code examples
- Batch Processing: Efficient parallel crawling and embedding
🎯 Quick Start
Option 1: Docker (Recommended)
# Clone the repository
git clone https://github.com/rbj2000/mcp-crawl4ai-rag.git
cd mcp-crawl4ai-rag
# Copy environment template
cp .env.example .env
# Edit .env with your configuration
# Deploy with your preferred database
docker-compose --profile supabase up # Supabase
docker-compose --profile sqlite up # SQLite
docker-compose --profile pinecone up # Pinecone
docker-compose --profile neo4j up # Neo4j
Option 2: Local Development
# Install dependencies
uv pip install -e . # Basic installation
uv pip install -e ".[supabase]" # With Supabase support
uv pip install -e ".[all]" # All database providers
# Setup Crawl4AI
crawl4ai-setup
# Configure environment
export AI_PROVIDER=openai # or ollama, vllm
export VECTOR_DB_PROVIDER=sqlite # or supabase, pinecone, neo4j_vector
export OPENAI_API_KEY=your_key # For OpenAI provider
# Run the server
python src/crawl4ai_mcp_refactored.py
Vision
The Crawl4AI RAG MCP server is just the beginning. Here's where we're headed:
-
Integration with Archon: Building this system directly into Archon to create a comprehensive knowledge engine for AI coding assistants to build better AI agents.
-
✅ Multiple AI Providers:
Expanding beyond OpenAICOMPLETED - Full support for OpenAI, Ollama, and vLLM with hybrid configurations for complete control and privacy. -
✅ Advanced RAG Strategies:
Implementing sophisticated retrieval techniquesCOMPLETED - Contextual embeddings, hybrid search, agentic RAG, and provider-agnostic reranking now available. -
🚧 Multi-Modal RAG (In Progress): Adding vision model support for image understanding, enabling search across text, images, and diagrams in documentation. vLLM provider foundation completed (Story 2.1 ✅).
-
Enhanced Chunking Strategy: Implementing a Context 7-inspired chunking approach that focuses on examples and creates distinct, semantically meaningful sections for each chunk, improving retrieval precision.
-
Performance Optimization: Increasing crawling and indexing speed to make it more realistic to "quickly" index new documentation to then leverage it within the same prompt in an AI coding assistant.
Features
- Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)
- Recursive Crawling: Follows internal links to discover content
- Parallel Processing: Efficiently crawls multiple pages simultaneously
- Content Chunking: Intelligently splits content by headers and size for better processing
- Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision
- Source Retrieval: Retrieve sources available for filtering to guide the RAG process
Tools
The server provides essential web crawling and search tools:
Core Tools (Always Available)
crawl_single_page
: Quickly crawl a single web page and store its content in the vector databasesmart_crawl_url
: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively)get_available_sources
: Get a list of all available sources (domains) in the databaseperform_rag_query
: Search for relevant content using semantic search with optional source filtering
Conditional Tools
search_code_examples
(requiresUSE_AGENTIC_RAG=true
): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.
Knowledge Graph Tools (requires USE_KNOWLEDGE_GRAPH=true
, see below)
parse_github_repository
: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships for hallucination detectioncheck_ai_script_hallucinations
: Analyze Python scripts for AI hallucinations by validating imports, method calls, and class usage against the knowledge graphquery_knowledge_graph
: Explore and query the Neo4j knowledge graph with commands likerepos
,classes
,methods
, and custom Cypher queries
Prerequisites
- Docker/Docker Desktop if running the MCP server as a container (recommended)
- Python 3.12+ if running the MCP server directly through uv
- Supabase (database for RAG)
- OpenAI API key (for generating embeddings)
- Neo4j (optional, for knowledge graph functionality) - see Knowledge Graph Setup section
Installation
Using Docker (Recommended)
-
Clone this repository:
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
-
Build the Docker image:
docker build -t mcp/crawl4ai-rag --build-arg PORT=8051 .
-
Create a
.env
file based on the configuration section below
Using uv directly (no Docker)
-
Clone this repository:
git clone https://github.com/coleam00/mcp-crawl4ai-rag.git cd mcp-crawl4ai-rag
-
Install uv if you don't have it:
pip install uv
-
Create and activate a virtual environment:
uv venv .venv\Scripts\activate # on Mac/Linux: source .venv/bin/activate
-
Install dependencies:
uv pip install -e . crawl4ai-setup
-
Create a
.env
file based on the configuration section below
Database Setup
Before running the server, you need to set up the database with the pgvector extension:
-
Go to the SQL Editor in your Supabase dashboard (create a new project first if necessary)
-
Create a new query and paste the contents of
crawled_pages.sql
-
Run the query to create the necessary tables and functions
Knowledge Graph Setup (Optional)
To enable AI hallucination detection and repository analysis features, you need to set up Neo4j.
Also, the knowledge graph implementation isn't fully compatible with Docker yet, so I would recommend right now running directly through uv if you want to use the hallucination detection within the MCP server!
For installing Neo4j:
Local AI Package (Recommended)
The easiest way to get Neo4j running locally is with the Local AI Package - a curated collection of local AI services including Neo4j:
-
Clone the Local AI Package:
git clone https://github.com/coleam00/local-ai-packaged.git cd local-ai-packaged
-
Start Neo4j: Follow the instructions in the Local AI Package repository to start Neo4j with Docker Compose
-
Default connection details:
- URI:
bolt://localhost:7687
- Username:
neo4j
- Password: Check the Local AI Package documentation for the default password
- URI:
Manual Neo4j Installation
Alternatively, install Neo4j directly:
-
Install Neo4j Desktop: Download from neo4j.com/download
-
Create a new database:
- Open Neo4j Desktop
- Create a new project and database
- Set a password for the
neo4j
user - Start the database
-
Note your connection details:
- URI:
bolt://localhost:7687
(default) - Username:
neo4j
(default) - Password: Whatever you set during creation
- URI:
Configuration
Create a .env
file in the project root with the following variables:
# MCP Server Configuration
HOST=0.0.0.0
PORT=8051
TRANSPORT=sse
# AI Provider Selection
AI_PROVIDER=openai # Options: openai, ollama, vllm, mixed
EMBEDDING_PROVIDER=openai # Optional: override for embeddings
LLM_PROVIDER=openai # Optional: override for LLM
# OpenAI Configuration (if using OpenAI)
OPENAI_API_KEY=your_openai_api_key
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_LLM_MODEL=gpt-4o-mini
# Ollama Configuration (if using Ollama)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
OLLAMA_LLM_MODEL=llama3.2:1b
# vLLM Configuration (if using vLLM)
VLLM_BASE_URL=https://your-vllm-endpoint.com/v1
VLLM_API_KEY=your_vllm_api_key
VLLM_TEXT_MODEL=meta-llama/Llama-3.1-8B-Instruct
VLLM_EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
VLLM_VISION_ENABLED=false # Enable vision models (coming soon)
# Vector Database Configuration
VECTOR_DB_PROVIDER=supabase # Options: supabase, sqlite, neo4j_vector, pinecone, weaviate
SUPABASE_URL=your_supabase_project_url
SUPABASE_SERVICE_KEY=your_supabase_service_key
EMBEDDING_DIMENSION=1536 # Must match your embedding model dimensions
# RAG Strategies (set to "true" or "false", default to "false")
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=false
USE_AGENTIC_RAG=false
USE_RERANKING=false
RERANKING_PROVIDER=ollama # Options: ollama, openai, huggingface
USE_KNOWLEDGE_GRAPH=false
# Neo4j Configuration (required for knowledge graph functionality)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_neo4j_password
For detailed configuration options and examples, see .
RAG Strategy Options
The Crawl4AI RAG MCP server supports five powerful RAG strategies that can be enabled independently:
1. USE_CONTEXTUAL_EMBEDDINGS
When enabled, this strategy enhances each chunk's embedding with additional context from the entire document. The system passes both the full document and the specific chunk to an LLM (configured via MODEL_CHOICE
) to generate enriched context that gets embedded alongside the chunk content.
- When to use: Enable this when you need high-precision retrieval where context matters, such as technical documentation where terms might have different meanings in different sections.
- Trade-offs: Slower indexing due to LLM calls for each chunk, but significantly better retrieval accuracy.
- Cost: Additional LLM API calls during indexing.
2. USE_HYBRID_SEARCH
Combines traditional keyword search with semantic vector search to provide more comprehensive results. The system performs both searches in parallel and intelligently merges results, prioritizing documents that appear in both result sets.
- When to use: Enable this when users might search using specific technical terms, function names, or when exact keyword matches are important alongside semantic understanding.
- Trade-offs: Slightly slower search queries but more robust results, especially for technical content.
- Cost: No additional API costs, just computational overhead.
3. USE_AGENTIC_RAG
Enables specialized code example extraction and storage. When crawling documentation, the system identifies code blocks (≥300 characters), extracts them with surrounding context, generates summaries, and stores them in a separate vector database table specifically designed for code search.
- When to use: Essential for AI coding assistants that need to find specific code examples, implementation patterns, or usage examples from documentation.
- Trade-offs: Significantly slower crawling due to code extraction and summarization, requires more storage space.
- Cost: Additional LLM API calls for summarizing each code example.
- Benefits: Provides a dedicated
search_code_examples
tool that AI agents can use to find specific code implementations.
4. USE_RERANKING
Provider-agnostic reranking that improves search result relevance. Supports multiple reranking providers (Ollama, OpenAI, HuggingFace) that can be configured independently from your primary AI provider.
- When to use: Enable this when search precision is critical and you need the most relevant results at the top. Particularly useful for complex queries where semantic similarity alone might not capture query intent.
- Trade-offs: Adds ~100-200ms to search queries depending on result count and provider, but significantly improves result ordering.
- Cost: Depends on provider - Ollama is free (local), HuggingFace has API costs, OpenAI uses embedding similarity (existing costs).
- Configuration: Set
RERANKING_PROVIDER
to choose provider (defaults to yourAI_PROVIDER
). - Benefits: Better result relevance, especially for complex queries. Works with both regular RAG search and code example search.
5. USE_KNOWLEDGE_GRAPH
Enables AI hallucination detection and repository analysis using Neo4j knowledge graphs. When enabled, the system can parse GitHub repositories into a graph database and validate AI-generated code against real repository structures. (NOT fully compatible with Docker yet, I'd recommend running through uv)
- When to use: Enable this for AI coding assistants that need to validate generated code against real implementations, or when you want to detect when AI models hallucinate non-existent methods, classes, or incorrect usage patterns.
- Trade-offs: Requires Neo4j setup and additional dependencies. Repository parsing can be slow for large codebases, and validation requires repositories to be pre-indexed.
- Cost: No additional API costs for validation, but requires Neo4j infrastructure (can use free local installation or cloud AuraDB).
- Benefits: Provides three powerful tools:
parse_github_repository
for indexing codebases,check_ai_script_hallucinations
for validating AI-generated code, andquery_knowledge_graph
for exploring indexed repositories.
You can now tell the AI coding assistant to add a Python GitHub repository to the knowledge graph like:
"Add https://github.com/pydantic/pydantic-ai.git to the knowledge graph"
Make sure the repo URL ends with .git.
You can also have the AI coding assistant check for hallucinations with scripts it just created, or you can manually run the command:
python knowledge_graphs/ai_hallucination_detector.py [full path to your script to analyze]
Recommended Configurations
For general documentation RAG:
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=true
For AI coding assistant with code examples:
USE_CONTEXTUAL_EMBEDDINGS=true
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
USE_KNOWLEDGE_GRAPH=false
For AI coding assistant with hallucination detection:
USE_CONTEXTUAL_EMBEDDINGS=true
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=true
USE_RERANKING=true
USE_KNOWLEDGE_GRAPH=true
For fast, basic RAG:
USE_CONTEXTUAL_EMBEDDINGS=false
USE_HYBRID_SEARCH=true
USE_AGENTIC_RAG=false
USE_RERANKING=false
USE_KNOWLEDGE_GRAPH=false
Running the Server
Using Docker
docker run --env-file .env -p 8051:8051 mcp/crawl4ai-rag
Using Python
uv run src/crawl4ai_mcp.py
The server will start and listen on the configured host and port.
Integration with MCP Clients
SSE Configuration
Once you have the server running with SSE transport, you can connect to it using this configuration:
{
"mcpServers": {
"crawl4ai-rag": {
"transport": "sse",
"url": "http://localhost:8051/sse"
}
}
}
Note for Windsurf users: Use
serverUrl
instead ofurl
in your configuration:{ "mcpServers": { "crawl4ai-rag": { "transport": "sse", "serverUrl": "http://localhost:8051/sse" } } }
Note for Docker users: Use
host.docker.internal
instead oflocalhost
if your client is running in a different container. This will apply if you are using this MCP server within n8n!
Note for Claude Code users:
claude mcp add-json crawl4ai-rag '{"type":"http","url":"http://localhost:8051/sse"}' --scope user
Stdio Configuration
Add this server to your MCP configuration for Claude Desktop, Windsurf, or any other MCP client:
{
"mcpServers": {
"crawl4ai-rag": {
"command": "python",
"args": ["path/to/crawl4ai-mcp/src/crawl4ai_mcp.py"],
"env": {
"TRANSPORT": "stdio",
"OPENAI_API_KEY": "your_openai_api_key",
"SUPABASE_URL": "your_supabase_url",
"SUPABASE_SERVICE_KEY": "your_supabase_service_key",
"USE_KNOWLEDGE_GRAPH": "false",
"NEO4J_URI": "bolt://localhost:7687",
"NEO4J_USER": "neo4j",
"NEO4J_PASSWORD": "your_neo4j_password"
}
}
}
}
Docker with Stdio Configuration
{
"mcpServers": {
"crawl4ai-rag": {
"command": "docker",
"args": ["run", "--rm", "-i",
"-e", "TRANSPORT",
"-e", "OPENAI_API_KEY",
"-e", "SUPABASE_URL",
"-e", "SUPABASE_SERVICE_KEY",
"-e", "USE_KNOWLEDGE_GRAPH",
"-e", "NEO4J_URI",
"-e", "NEO4J_USER",
"-e", "NEO4J_PASSWORD",
"mcp/crawl4ai"],
"env": {
"TRANSPORT": "stdio",
"OPENAI_API_KEY": "your_openai_api_key",
"SUPABASE_URL": "your_supabase_url",
"SUPABASE_SERVICE_KEY": "your_supabase_service_key",
"USE_KNOWLEDGE_GRAPH": "false",
"NEO4J_URI": "bolt://localhost:7687",
"NEO4J_USER": "neo4j",
"NEO4J_PASSWORD": "your_neo4j_password"
}
}
}
}
Knowledge Graph Architecture
The knowledge graph system stores repository code structure in Neo4j with the following components:
Core Components (knowledge_graphs/
folder):
parse_repo_into_neo4j.py
: Clones and analyzes GitHub repositories, extracting Python classes, methods, functions, and imports into Neo4j nodes and relationshipsai_script_analyzer.py
: Parses Python scripts using AST to extract imports, class instantiations, method calls, and function usageknowledge_graph_validator.py
: Validates AI-generated code against the knowledge graph to detect hallucinations (non-existent methods, incorrect parameters, etc.)hallucination_reporter.py
: Generates comprehensive reports about detected hallucinations with confidence scores and recommendationsquery_knowledge_graph.py
: Interactive CLI tool for exploring the knowledge graph (functionality now integrated into MCP tools)
Knowledge Graph Schema:
The Neo4j database stores code structure as:
Nodes:
Repository
: GitHub repositoriesFile
: Python files within repositoriesClass
: Python classes with methods and attributesMethod
: Class methods with parameter informationFunction
: Standalone functionsAttribute
: Class attributes
Relationships:
Repository
-[:CONTAINS]->File
File
-[:DEFINES]->Class
File
-[:DEFINES]->Function
Class
-[:HAS_METHOD]->Method
Class
-[:HAS_ATTRIBUTE]->Attribute
Workflow:
- Repository Parsing: Use
parse_github_repository
tool to clone and analyze open-source repositories - Code Validation: Use
check_ai_script_hallucinations
tool to validate AI-generated Python scripts - Knowledge Exploration: Use
query_knowledge_graph
tool to explore available repositories, classes, and methods
Building Your Own Server
This implementation provides a foundation for building more complex MCP servers with web crawling capabilities. To build your own:
- Add your own tools by creating methods with the
@mcp.tool()
decorator - Create your own lifespan function to add your own dependencies
- Modify the
utils.py
file for any helper functions you need - Extend the crawling capabilities by adding more specialized crawlers