mcphub-com/kreuzberg-mcp
kreuzberg-mcp is hosted online, so all tools can be tested directly either in theInspector tabor in theOnline Client.
If you are the rightful owner of kreuzberg-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Kreuzberg MCP Server is a robust solution for document intelligence and text extraction, leveraging the Model Context Protocol to handle diverse file formats.
Try kreuzberg-mcp with chat:
Tools
Functions exposed to the LLM to take actions
get_pdf_local_path
Given a PDF URL, deterministically derive a local filename from the URL,
check the system temp directory, download if missing, and return the full path.
Returns:
str: Full path to the local PDF file in the system temp directory.
Raises:
ValueError: If the URL is invalid or not HTTP/HTTPS.
URLError/HTTPError/OSError: If download or file operations fail.
extract_document
Extract text content from a document file.
Args:
file_path: Path to the document file
mime_type: MIME type of the document (auto-detected if not provided)
force_ocr: Force OCR even for text-based documents
chunk_content: Split content into chunks
extract_tables: Extract tables from the document
extract_entities: Extract named entities
extract_keywords: Extract keywords
ocr_backend: OCR backend to use (tesseract, easyocr, paddleocr)
max_chars: Maximum characters per chunk
max_overlap: Character overlap between chunks
keyword_count: Number of keywords to extract
auto_detect_language: Auto-detect document language
Returns:
Extracted content with metadata, tables, chunks, entities, and keywords
extract_bytes
Extract text content from document bytes.
Args:
content_base64: Base64-encoded document content
mime_type: MIME type of the document
force_ocr: Force OCR even for text-based documents
chunk_content: Split content into chunks
extract_tables: Extract tables from the document
extract_entities: Extract named entities
extract_keywords: Extract keywords
ocr_backend: OCR backend to use (tesseract, easyocr, paddleocr)
max_chars: Maximum characters per chunk
max_overlap: Character overlap between chunks
keyword_count: Number of keywords to extract
auto_detect_language: Auto-detect document language
Returns:
Extracted content with metadata, tables, chunks, entities, and keywords
extract_simple
Simple text extraction from a document file.
Args:
file_path: Path to the document file
mime_type: MIME type of the document (auto-detected if not provided)
Returns:
Extracted text content as a string
Prompts
Interactive templates invoked by user choice
extract_and_summarize
Extract text from a document and provide a summary prompt.
Args:
file_path: Path to the document file
Returns:
Extracted content with summarization prompt
extract_structured
Extract text with structured analysis prompt.
Args:
file_path: Path to the document file
Returns:
Extracted content with structured analysis prompt
Resources
Contextual data attached and managed by the client
get_default_config
URI: config://default
MIME: text/plain
Get the default extraction configuration.
get_discovered_config
URI: config://discovered
MIME: text/plain
Get the discovered configuration from config files.
get_available_backends
URI: config://available-backends
MIME: text/plain
Get available OCR backends.
get_supported_formats
URI: extractors://supported-formats
MIME: text/plain
Get supported document formats.