illogical/HoarderMCP
If you are the rightful owner of HoarderMCP and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
HoarderMCP is a Model Context Protocol (MCP) server designed for web content crawling, processing, and vector storage.
HoarderMCP
HoarderMCP is a Model Context Protocol (MCP) server designed for web content crawling, processing, and vector storage. It provides tools for ingesting web content, extracting relevant information, and making it searchable through vector similarity search.
Features
- Web Crawling: Crawl websites and sitemaps to extract content
- Advanced Content Processing:
- Semantic chunking for Markdown with header-based splitting
- Code-aware chunking for Python and C# with syntax preservation
- Configurable chunk sizes and overlap for optimal context
- Token-based size optimization
- Vector Storage: Store and search content using vector embeddings
- API-First: RESTful API for easy integration with other services
- Asynchronous: Built with async/await for high performance
- Extensible: Support for multiple vector stores (Milvus, FAISS, Chroma, etc.)
- Observability: Integrated with Langfuse for tracing and monitoring
Installation
-
Clone the repository:
git clone https://github.com/yourusername/hoardermcp.git cd hoardermcp
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the package in development mode:
pip install -e .
-
Install development dependencies:
pip install -r requirements-dev.txt
Prerequisites
Running the Server
-
Start Milvus using Docker:
docker-compose up -d
-
Run the development server:
python -m hoardermcp.main --reload
The API will be available at
http://localhost:8000
API Documentation
Once the server is running, you can access:
- OpenAPI docs: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Configuration
Configuration can be provided through environment variables or a .env
file. See .env.example
for available options.
Usage
Ingesting Content
import httpx
# Ingest a webpage
response = httpx.post(
"http://localhost:8000/ingest",
json={
"sources": [
{
"url": "https://example.com",
"content_type": "text/html"
}
]
}
)
print(response.json())
Advanced Chunking Example
from hoardermcp.core.chunking import ChunkingFactory, ChunkingConfig, ChunkingStrategy
from hoardermcp.models.document import Document, DocumentMetadata, DocumentType
# Create a markdown document
markdown_content = """# Title\n\n## Section 1\nContent for section 1\n\n## Section 2\nContent for section 2"""
doc = Document(
id="test",
content=markdown_content,
metadata=DocumentMetadata(
source="example.md",
content_type=DocumentType.MARKDOWN
)
)
# Configure chunking
config = ChunkingConfig(
strategy=ChunkingStrategy.SEMANTIC,
chunk_size=1000,
chunk_overlap=200
)
# Get appropriate chunker and process document
chunker = ChunkingFactory.get_chunker(
doc_type=DocumentType.MARKDOWN,
config=config
)
chunks = chunker.chunk_document(doc)
print(f"Document split into {len(chunks)} chunks")
for i, chunk in enumerate(chunks):
print(f"Chunk {i + 1} (length: {len(chunk.content)}): {chunk.content[:50]}...")
Searching Content
import httpx
# Search for similar content
response = httpx.post(
"http://localhost:8000/search",
json={
"query": "What is HoarderMCP?",
"k": 5
}
)
print(response.json())
Development
Code Style
This project uses:
Run the following commands before committing:
black .
isort .
mypy .
Testing
Run tests using pytest:
pytest