soyrochus/escarabajo
If you are the rightful owner of escarabajo and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
Escarabajo is a local-first MCP server designed to synchronize binary documents with Markdown, enabling efficient search, summarization, and reasoning over source material.
Escarabajo MCP Server
Escarabajo (Scarab) is a local-first MCP server purpose-built for documentation archaeology. It keeps heavyweight PDF, DOCX, and PPTX artefacts in sync with a clean Markdown knowledge base, then exposes that knowledge safely to agents through a read-only default surface. Maintenance instrumentation is gated behind an explicit mode switch so production agents can browse confidently while operators keep full control over destructive actions.
Highlights
- Read-only default: agents can browse
kb://
resources (kb://list
,kb://docgen-spec
,kb://doc/{id}
) without gaining access to mutating tools. - Markdown knowledge base with provenance metadata (SHA-256, byte counts,
timestamps) sourced from
.Escarabajo/index.json
. - DocGen spec resource plus a maintenance-gated prompt writer keep Copilot Agent Mode aligned with the repositoryβs documentation rules.
- Dedicated documentation: the covers day-to-day workflows, while the explains internal modules and data flow for maintainers.
- Maintenance mode unlocks conversion tools with rate limits, sandboxed path validation, dry-run support, and typed error responses.
- Environment-aware configuration loader (
ESCARABAJO_MODE
,ESCARABAJO_EXPOSE_CONTENT
,ESCARABAJO_REPO_ROOT
) layered above.Escarabajo/config.yaml
.
Documentation
- Getting things done: Read the for CLI commands, MCP capabilities, and guided workflows that pair with Copilot Agent Mode.
- Understanding the system: Consult the for module boundaries, service catalog, data flow, and operational controls.
- Need context fast? Use the
/docgen
prompt (see ) to let Copilot assemble documentation directly from the knowledge base.
Operating Modes & Environment
Variable | Values | Effect |
---|---|---|
ESCARABAJO_MODE | readonly (default) / maint | Selects read-only browsing or maintenance mode. Tools are registered only when set to maint . |
ESCARABAJO_EXPOSE_CONTENT | true / false (default) | Enables the read_text maintenance tool for streaming Markdown bodies. |
ESCARABAJO_REPO_ROOT | Absolute path | Overrides the repository root resolved by the server (else the provided CLI path or cwd ). |
Environment variables override values in .Escarabajo/config.yaml
. The server
refuses non-absolute ESCARABAJO_REPO_ROOT
paths and keeps all file access
sandboxed to the resolved repo root.
MCP Interface
Resources (kb://
)
kb://docgen-spec
β YAML doc generation contract derived from.Escarabajo/config.yaml
plus a tinystructure.outputs
block collaborators can edit.kb://list
β JSON array of documents:{id, title, path, source, shasum, bytes, updated_at, tags}
.kb://doc/{id}
β Markdown content for the given knowledge-base document. An optional?max_chars=20000
query parameter caps transport size; truncated responses append a clear notice.
Prompt Files
Run materialize_prompt_files()
in maintenance mode to (re)write
.github/prompts/docgen.prompt.md
. The prompt instructs Copilot Agent Mode to
pull in kb://docgen-spec
, kb://list
, and the specific kb://doc/{id}
resources needed for documentation updates.
Maintenance Tools (available only when ESCARABAJO_MODE=maint
)
Tool | Input Schema (summary) | Notes |
---|---|---|
scan_repo | {globs?, exclude_globs?} | Lists candidate binaries with size/mtime metadata. |
sync_all | {globs?, exclude_globs?, ocr?, dry_run?} | Limited to β€250 files per call; supports dry-run planning. |
sync_paths | {paths, ocr?, dry_run?} | Paths are repo-sandboxed; same rate limit and dry-run support. |
get_text_path | {src, ocr?} | Ensures Markdown is fresh and returns path , status , timestamps, and byte counts. |
purge_outputs | {globs?} | Deletes generated Markdown under .Escarabajo/kb . |
config_get | {} | Returns the merged configuration with environment overrides applied. |
config_set | {updates} | Whitelisted keys only; rejects unknown or malformed updates. |
read_text | {path, max_bytes?} | Streams Markdown when ESCARABAJO_EXPOSE_CONTENT=true ; otherwise raises ContentExposureError . |
materialize_prompt_files | {profile?} | Generates .github/prompts/docgen.prompt.md using kb://docgen-spec ; idempotent. |
All maintenance tools return typed errors (ModeDisabledError
,
SandboxError
, RateLimitError
, ValidationError
, ContentExposureError
) so
clients can react programmatically.
Example mcp.json
{
"servers": [
{
"name": "Escarabajo KB",
"command": ["python", "-m", "Escarabajo", "--mcp"],
"env": {
"ESCARABAJO_MODE": "readonly"
},
"resources": ["kb://list", "kb://docgen-spec", "kb://doc/{id}"]
}
]
}
Switch ESCARABAJO_MODE
to maint
when you need the maintenance tools (for
example inside a trusted operator workspace).
Typical Client Flow
- In Copilot Chat, pick Add Context β MCP Resources and include
kb://docgen-spec
,kb://list
, plus anykb://doc/{id}
entries you plan to cite. - Ask the agent to follow the DocGen spec. The
/docgen
prompt file (generated viamaterialize_prompt_files()
) already contains precise instructions and inlines the spec as YAML for reference. - Use
kb://list
to discover document ids and refresh the context as needed. - When content looks stale, restart Escarabajo in maintenance mode and rerun the relevant sync tool before regenerating docs.
Maintenance Workflow
-
Restart the server with
ESCARABAJO_MODE=maint
(and optionallyESCARABAJO_EXPOSE_CONTENT=true
for streaming). -
Start with a dry run:
ESCARABAJO_MODE=maint python -m Escarabajo --mcp # From your client: sync_all({"dry_run": true})
-
Review the planned candidates. When satisfied, rerun without
dry_run
(the server enforces a 250-document rate limit per invocation). -
Use
get_text_path
to confirm a specific Markdown artefact orread_text
(with content exposure enabled) to stream its contents. -
Regenerate
.github/prompts/docgen.prompt.md
withmaterialize_prompt_files()
whenever the DocGen spec changes. -
Switch back to
ESCARABAJO_MODE=readonly
once the knowledge base is fresh.
CLI Quick Start
# Initialise (or ensure) the workspace
escarabajo init
# Scan for supported documents
escarabajo scan
# Synchronise everything with OCR enabled for PDFs
escarabajo sync --ocr
# Synchronise specific files
escarabajo sync-paths tests/data/RNGenius_Functional_Spec.docx
# Return the Markdown path for an individual document
escarabajo get-path --src tests/data/RNGenius_Functional_Spec.docx
The CLI mirrors the maintenance tools but runs outside the MCP envelope. All filesystem interaction remains sandboxed to the configured repository root.
Configuration
escarabajo config-get
prints the effective configuration (defaults shown
below). Environment overrides (ESCARABAJO_*
) take precedence and are echoed in
this payload so operators can confirm runtime values.
kb_dir: ".Escarabajo/kb"
globs: ["**/*.docx", "**/*.pptx", "**/*.pdf"]
exclude_globs: [".git/**", ".Escarabajo/**", "node_modules/**", "**/~$*", "**/*.tmp"]
ocr: false
expose_content: false
skip_unchanged: false
pdf:
page_delimiter: "--- page {n} ---"
pptx:
slide_delimiter: "--- slide {n} ---"
docx:
keep_tables: true
Repository Layout
.Escarabajo/
β Runtime metadata, configuration, and cached Markdown.config.yaml
β User-editable defaults.index.json
β Provenance ledger capturing hashes, mtimes, and extraction results.kb/<source>.md
β Canonical Markdown derived from each binary.
Escarabajo/
β Python package providing the MCP server, CLI, extractor implementations, and knowledge-base helpers.tests/
β Pytest suite covering MCP surfaces, extraction fidelity, and regression fixtures.
Testing
Populate the knowledge base by syncing the fixtures (for example with the CLI)
so .Escarabajo/index.json
and .Escarabajo/kb/
contain the authoritative
Markdown. Then run:
uv run pytest -q
The suite verifies resource responses, prompt structure, maintenance gating, and path sandboxing alongside extractor regressions.
Principles of Participation
Everyone is invited and welcome to contribute: open issues, propose pull requests, share ideas, or help improve documentation.
Participation is open to all, regardless of background or viewpoint.
This project follows the , which affirms respect for people, freedom to critique ideas, and space for diverse perspectives.
License and Copyright
Copyright (c) 2025, Iwan van der Kleijn
This project is licensed under the MIT License. See the file for details.