escarabajo

soyrochus/escarabajo

3.2

If you are the rightful owner of escarabajo and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

Escarabajo is a local-first MCP server designed to synchronize binary documents with Markdown, enabling efficient search, summarization, and reasoning over source material.

Tools
11
Resources
0
Prompts
0

Escarabajo MCP Server

Escarabajo (Scarab) is a local-first MCP server purpose-built for documentation archaeology. It keeps heavyweight PDF, DOCX, and PPTX artefacts in sync with a clean Markdown knowledge base, then exposes that knowledge safely to agents through a read-only default surface. Maintenance instrumentation is gated behind an explicit mode switch so production agents can browse confidently while operators keep full control over destructive actions.

Escarabajo logo

Highlights

  • Read-only default: agents can browse kb:// resources (kb://list, kb://docgen-spec, kb://doc/{id}) without gaining access to mutating tools.
  • Markdown knowledge base with provenance metadata (SHA-256, byte counts, timestamps) sourced from .Escarabajo/index.json.
  • DocGen spec resource plus a maintenance-gated prompt writer keep Copilot Agent Mode aligned with the repository’s documentation rules.
  • Dedicated documentation: the covers day-to-day workflows, while the explains internal modules and data flow for maintainers.
  • Maintenance mode unlocks conversion tools with rate limits, sandboxed path validation, dry-run support, and typed error responses.
  • Environment-aware configuration loader (ESCARABAJO_MODE, ESCARABAJO_EXPOSE_CONTENT, ESCARABAJO_REPO_ROOT) layered above .Escarabajo/config.yaml.

Documentation

  • Getting things done: Read the for CLI commands, MCP capabilities, and guided workflows that pair with Copilot Agent Mode.
  • Understanding the system: Consult the for module boundaries, service catalog, data flow, and operational controls.
  • Need context fast? Use the /docgen prompt (see ) to let Copilot assemble documentation directly from the knowledge base.

Operating Modes & Environment

VariableValuesEffect
ESCARABAJO_MODEreadonly (default) / maintSelects read-only browsing or maintenance mode. Tools are registered only when set to maint.
ESCARABAJO_EXPOSE_CONTENTtrue / false (default)Enables the read_text maintenance tool for streaming Markdown bodies.
ESCARABAJO_REPO_ROOTAbsolute pathOverrides the repository root resolved by the server (else the provided CLI path or cwd).

Environment variables override values in .Escarabajo/config.yaml. The server refuses non-absolute ESCARABAJO_REPO_ROOT paths and keeps all file access sandboxed to the resolved repo root.

MCP Interface

Resources (kb://)

  • kb://docgen-spec β†’ YAML doc generation contract derived from .Escarabajo/config.yaml plus a tiny structure.outputs block collaborators can edit.
  • kb://list β†’ JSON array of documents: {id, title, path, source, shasum, bytes, updated_at, tags}.
  • kb://doc/{id} β†’ Markdown content for the given knowledge-base document. An optional ?max_chars=20000 query parameter caps transport size; truncated responses append a clear notice.

Prompt Files

Run materialize_prompt_files() in maintenance mode to (re)write .github/prompts/docgen.prompt.md. The prompt instructs Copilot Agent Mode to pull in kb://docgen-spec, kb://list, and the specific kb://doc/{id} resources needed for documentation updates.

Maintenance Tools (available only when ESCARABAJO_MODE=maint)

ToolInput Schema (summary)Notes
scan_repo{globs?, exclude_globs?}Lists candidate binaries with size/mtime metadata.
sync_all{globs?, exclude_globs?, ocr?, dry_run?}Limited to ≀250 files per call; supports dry-run planning.
sync_paths{paths, ocr?, dry_run?}Paths are repo-sandboxed; same rate limit and dry-run support.
get_text_path{src, ocr?}Ensures Markdown is fresh and returns path, status, timestamps, and byte counts.
purge_outputs{globs?}Deletes generated Markdown under .Escarabajo/kb.
config_get{}Returns the merged configuration with environment overrides applied.
config_set{updates}Whitelisted keys only; rejects unknown or malformed updates.
read_text{path, max_bytes?}Streams Markdown when ESCARABAJO_EXPOSE_CONTENT=true; otherwise raises ContentExposureError.
materialize_prompt_files{profile?}Generates .github/prompts/docgen.prompt.md using kb://docgen-spec; idempotent.

All maintenance tools return typed errors (ModeDisabledError, SandboxError, RateLimitError, ValidationError, ContentExposureError) so clients can react programmatically.

Example mcp.json

{
  "servers": [
    {
      "name": "Escarabajo KB",
      "command": ["python", "-m", "Escarabajo", "--mcp"],
      "env": {
        "ESCARABAJO_MODE": "readonly"
      },
      "resources": ["kb://list", "kb://docgen-spec", "kb://doc/{id}"]
    }
  ]
}

Switch ESCARABAJO_MODE to maint when you need the maintenance tools (for example inside a trusted operator workspace).

Typical Client Flow

  1. In Copilot Chat, pick Add Context β†’ MCP Resources and include kb://docgen-spec, kb://list, plus any kb://doc/{id} entries you plan to cite.
  2. Ask the agent to follow the DocGen spec. The /docgen prompt file (generated via materialize_prompt_files()) already contains precise instructions and inlines the spec as YAML for reference.
  3. Use kb://list to discover document ids and refresh the context as needed.
  4. When content looks stale, restart Escarabajo in maintenance mode and rerun the relevant sync tool before regenerating docs.

Maintenance Workflow

  1. Restart the server with ESCARABAJO_MODE=maint (and optionally ESCARABAJO_EXPOSE_CONTENT=true for streaming).

  2. Start with a dry run:

    ESCARABAJO_MODE=maint python -m Escarabajo --mcp
    # From your client:
    sync_all({"dry_run": true})
    
  3. Review the planned candidates. When satisfied, rerun without dry_run (the server enforces a 250-document rate limit per invocation).

  4. Use get_text_path to confirm a specific Markdown artefact or read_text (with content exposure enabled) to stream its contents.

  5. Regenerate .github/prompts/docgen.prompt.md with materialize_prompt_files() whenever the DocGen spec changes.

  6. Switch back to ESCARABAJO_MODE=readonly once the knowledge base is fresh.

CLI Quick Start

# Initialise (or ensure) the workspace
escarabajo init

# Scan for supported documents
escarabajo scan

# Synchronise everything with OCR enabled for PDFs
escarabajo sync --ocr

# Synchronise specific files
escarabajo sync-paths tests/data/RNGenius_Functional_Spec.docx

# Return the Markdown path for an individual document
escarabajo get-path --src tests/data/RNGenius_Functional_Spec.docx

The CLI mirrors the maintenance tools but runs outside the MCP envelope. All filesystem interaction remains sandboxed to the configured repository root.

Configuration

escarabajo config-get prints the effective configuration (defaults shown below). Environment overrides (ESCARABAJO_*) take precedence and are echoed in this payload so operators can confirm runtime values.

kb_dir: ".Escarabajo/kb"
globs: ["**/*.docx", "**/*.pptx", "**/*.pdf"]
exclude_globs: [".git/**", ".Escarabajo/**", "node_modules/**", "**/~$*", "**/*.tmp"]
ocr: false
expose_content: false
skip_unchanged: false
pdf:
  page_delimiter: "--- page {n} ---"
pptx:
  slide_delimiter: "--- slide {n} ---"
docx:
  keep_tables: true

Repository Layout

  • .Escarabajo/ – Runtime metadata, configuration, and cached Markdown.
    • config.yaml – User-editable defaults.
    • index.json – Provenance ledger capturing hashes, mtimes, and extraction results.
    • kb/<source>.md – Canonical Markdown derived from each binary.
  • Escarabajo/ – Python package providing the MCP server, CLI, extractor implementations, and knowledge-base helpers.
  • tests/ – Pytest suite covering MCP surfaces, extraction fidelity, and regression fixtures.

Testing

Populate the knowledge base by syncing the fixtures (for example with the CLI) so .Escarabajo/index.json and .Escarabajo/kb/ contain the authoritative Markdown. Then run:

uv run pytest -q

The suite verifies resource responses, prompt structure, maintenance gating, and path sandboxing alongside extractor regressions.

Principles of Participation

Everyone is invited and welcome to contribute: open issues, propose pull requests, share ideas, or help improve documentation.
Participation is open to all, regardless of background or viewpoint.

This project follows the , which affirms respect for people, freedom to critique ideas, and space for diverse perspectives.

License and Copyright

Copyright (c) 2025, Iwan van der Kleijn

This project is licensed under the MIT License. See the file for details.