mcphub-com/kreuzberg-mcp

3.5

kreuzberg-mcp is hosted online, so all tools can be tested directly either in theInspector tabor in theOnline Client.

If you are the rightful owner of kreuzberg-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

The Kreuzberg MCP Server is a robust solution for document intelligence and text extraction, leveraging the Model Context Protocol to handle diverse file formats.

Try kreuzberg-mcp with chat:

Tools

Functions exposed to the LLM to take actions

get_pdf_local_path

Given a PDF URL, deterministically derive a local filename from the URL,
check the system temp directory, download if missing, and return the full path.

Returns:
    str: Full path to the local PDF file in the system temp directory.

Raises:
    ValueError: If the URL is invalid or not HTTP/HTTPS.
    URLError/HTTPError/OSError: If download or file operations fail.

extract_document

Extract text content from a document file.

Args:
    file_path: Path to the document file
    mime_type: MIME type of the document (auto-detected if not provided)
    force_ocr: Force OCR even for text-based documents
    chunk_content: Split content into chunks
    extract_tables: Extract tables from the document
    extract_entities: Extract named entities
    extract_keywords: Extract keywords
    ocr_backend: OCR backend to use (tesseract, easyocr, paddleocr)
    max_chars: Maximum characters per chunk
    max_overlap: Character overlap between chunks
    keyword_count: Number of keywords to extract
    auto_detect_language: Auto-detect document language

Returns:
    Extracted content with metadata, tables, chunks, entities, and keywords

extract_bytes

Extract text content from document bytes.

Args:
    content_base64: Base64-encoded document content
    mime_type: MIME type of the document
    force_ocr: Force OCR even for text-based documents
    chunk_content: Split content into chunks
    extract_tables: Extract tables from the document
    extract_entities: Extract named entities
    extract_keywords: Extract keywords
    ocr_backend: OCR backend to use (tesseract, easyocr, paddleocr)
    max_chars: Maximum characters per chunk
    max_overlap: Character overlap between chunks
    keyword_count: Number of keywords to extract
    auto_detect_language: Auto-detect document language

Returns:
    Extracted content with metadata, tables, chunks, entities, and keywords

extract_simple

Simple text extraction from a document file.

Args:
    file_path: Path to the document file
    mime_type: MIME type of the document (auto-detected if not provided)

Returns:
    Extracted text content as a string

Prompts

Interactive templates invoked by user choice

extract_and_summarize

Extract text from a document and provide a summary prompt.

Args:
    file_path: Path to the document file

Returns:
    Extracted content with summarization prompt

extract_structured

Extract text with structured analysis prompt.

Args:
    file_path: Path to the document file

Returns:
    Extracted content with structured analysis prompt

Resources

Contextual data attached and managed by the client

get_default_config

URI: config://default

MIME: text/plain

Get the default extraction configuration.

get_discovered_config

URI: config://discovered

MIME: text/plain

Get the discovered configuration from config files.

get_available_backends

URI: config://available-backends

MIME: text/plain

Get available OCR backends.

get_supported_formats

URI: extractors://supported-formats

MIME: text/plain

Get supported document formats.