mcp-crawl4ai-rag

coleam00/mcp-crawl4ai-rag

4.1

mcp-crawl4ai-rag is hosted online, so all tools can be tested directly either in theInspector tabor in theOnline Client.

If you are the rightful owner of mcp-crawl4ai-rag and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Crawl4AI RAG MCP Server is a powerful implementation of the Model Context Protocol (MCP) integrated with Crawl4AI and Supabase, providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities.

Try mcp-crawl4ai-rag with chat:

Tools

Functions exposed to the LLM to take actions

crawl_single_page

Crawl a single web page and store its content in Supabase.

This tool is ideal for quickly retrieving content from a specific URL without following links.
The content is stored in Supabase for later retrieval and querying.

Args:
    ctx: The MCP server provided context
    url: URL of the web page to crawl

Returns:
    Summary of the crawling operation and storage in Supabase

smart_crawl_url

Intelligently crawl a URL based on its type and store content in Supabase.

This tool automatically detects the URL type and applies the appropriate crawling method:
- For sitemaps: Extracts and crawls all URLs in parallel
- For text files (llms.txt): Directly retrieves the content
- For regular webpages: Recursively crawls internal links up to the specified depth

All crawled content is chunked and stored in Supabase for later retrieval and querying.

Args:
    ctx: The MCP server provided context
    url: URL to crawl (can be a regular webpage, sitemap.xml, or .txt file)
    max_depth: Maximum recursion depth for regular URLs (default: 3)
    max_concurrent: Maximum number of concurrent browser sessions (default: 10)
    chunk_size: Maximum size of each content chunk in characters (default: 1000)

Returns:
    JSON string with crawl summary and storage information

get_available_sources

Get all available sources from the sources table.

This tool returns a list of all unique sources (domains) that have been crawled and stored
in the database, along with their summaries and statistics. This is useful for discovering 
what content is available for querying.

Always use this tool before calling the RAG query or code example query tool
with a specific source filter!

Args:
    ctx: The MCP server provided context

Returns:
    JSON string with the list of available sources and their details

perform_rag_query

Perform a RAG (Retrieval Augmented Generation) query on the stored content.

This tool searches the vector database for content relevant to the query and returns
the matching documents. Optionally filter by source domain.
Get the source by using the get_available_sources tool before calling this search!

Args:
    ctx: The MCP server provided context
    query: The search query
    source: Optional source domain to filter results (e.g., 'example.com')
    match_count: Maximum number of results to return (default: 5)

Returns:
    JSON string with the search results

search_code_examples

Search for code examples relevant to the query.

This tool searches the vector database for code examples relevant to the query and returns
the matching examples with their summaries. Optionally filter by source_id.
Get the source_id by using the get_available_sources tool before calling this search!

Use the get_available_sources tool first to see what sources are available for filtering.

Args:
    ctx: The MCP server provided context
    query: The search query
    source_id: Optional source ID to filter results (e.g., 'example.com')
    match_count: Maximum number of results to return (default: 5)

Returns:
    JSON string with the search results

check_ai_script_hallucinations

Check an AI-generated Python script for hallucinations using the knowledge graph.

This tool analyzes a Python script for potential AI hallucinations by validating
imports, method calls, class instantiations, and function calls against a Neo4j
knowledge graph containing real repository data.

The tool performs comprehensive analysis including:
- Import validation against known repositories
- Method call validation on classes from the knowledge graph
- Class instantiation parameter validation
- Function call parameter validation
- Attribute access validation

Args:
    ctx: The MCP server provided context
    script_path: Absolute path to the Python script to analyze

Returns:
    JSON string with hallucination detection results, confidence scores, and recommendations

query_knowledge_graph

Query and explore the Neo4j knowledge graph containing repository data.

This tool provides comprehensive access to the knowledge graph for exploring repositories,
classes, methods, functions, and their relationships. Perfect for understanding what data
is available for hallucination detection and debugging validation results.

**⚠️ IMPORTANT: Always start with the `repos` command first!**
Before using any other commands, run `repos` to see what repositories are available
in your knowledge graph. This will help you understand what data you can explore.

## Available Commands:

**Repository Commands:**
- `repos` - **START HERE!** List all repositories in the knowledge graph
- `explore <repo_name>` - Get detailed overview of a specific repository

**Class Commands:**  
- `classes` - List all classes across all repositories (limited to 20)
- `classes <repo_name>` - List classes in a specific repository
- `class <class_name>` - Get detailed information about a specific class including methods and attributes

**Method Commands:**
- `method <method_name>` - Search for methods by name across all classes
- `method <method_name> <class_name>` - Search for a method within a specific class

**Custom Query:**
- `query <cypher_query>` - Execute a custom Cypher query (results limited to 20 records)

## Knowledge Graph Schema:

**Node Types:**
- Repository: `(r:Repository {name: string})`
- File: `(f:File {path: string, module_name: string})`
- Class: `(c:Class {name: string, full_name: string})`
- Method: `(m:Method {name: string, params_list: [string], params_detailed: [string], return_type: string, args: [string]})`
- Function: `(func:Function {name: string, params_list: [string], params_detailed: [string], return_type: string, args: [string]})`
- Attribute: `(a:Attribute {name: string, type: string})`

**Relationships:**
- `(r:Repository)-[:CONTAINS]->(f:File)`
- `(f:File)-[:DEFINES]->(c:Class)`
- `(c:Class)-[:HAS_METHOD]->(m:Method)`
- `(c:Class)-[:HAS_ATTRIBUTE]->(a:Attribute)`
- `(f:File)-[:DEFINES]->(func:Function)`

## Example Workflow:
```
1. repos                                    # See what repositories are available
2. explore pydantic-ai                      # Explore a specific repository
3. classes pydantic-ai                      # List classes in that repository
4. class Agent                              # Explore the Agent class
5. method run_stream                        # Search for run_stream method
6. method __init__ Agent                    # Find Agent constructor
7. query "MATCH (c:Class)-[:HAS_METHOD]->(m:Method) WHERE m.name = 'run' RETURN c.name, m.name LIMIT 5"
```

Args:
    ctx: The MCP server provided context
    command: Command string to execute (see available commands above)

Returns:
    JSON string with query results, statistics, and metadata

parse_github_repository

Parse a GitHub repository into the Neo4j knowledge graph.

This tool clones a GitHub repository, analyzes its Python files, and stores
the code structure (classes, methods, functions, imports) in Neo4j for use
in hallucination detection. The tool:

- Clones the repository to a temporary location
- Analyzes Python files to extract code structure
- Stores classes, methods, functions, and imports in Neo4j
- Provides detailed statistics about the parsing results
- Automatically handles module name detection for imports

Args:
    ctx: The MCP server provided context
    repo_url: GitHub repository URL (e.g., 'https://github.com/user/repo.git')

Returns:
    JSON string with parsing results, statistics, and repository information

Prompts

Interactive templates invoked by user choice

No prompts

Resources

Contextual data attached and managed by the client

No resources