oreofeolurin/corpus
If you are the rightful owner of corpus and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
Corpus is a Python CLI tool designed for extracting, packaging, and serving datasets via MCP, enabling LLMs to browse, search, and retrieve files efficiently.
Corpus (Data extraction, packaging, and MCP serving)
Corpus is a Python CLI for extracting and packaging datasets, then serving them via MCP so LLMs can browse, search, and retrieve files. It supports YouTube capture, packing local directories or remote GitHub repos, a multi-collection catalog, and an HTTP MCP server.
Commands are exposed as corpus (primary) and cpack (packer alias).
Table of Contents
- Features
- Installation
- Quick Start
- Command Reference
- Configuration
- Output Formats
- Examples
- Troubleshooting
- Contributing
- License
Features
- YouTube capture: Fetch
yt-dlpmetadata, subtitles/transcripts, MP4, and screenshots via ffmpeg; emitsindex.jsonautomatically. - Packer (cpack): Package directories or remote GitHub repos into a single bundle with a prepended human index and optional JSON sidecar; supports include/exclude globs, compression, gzip/base64.
- Catalog: Add bundles or directories as named collections and serve them together.
- HTTP MCP server: Expose collections to IDE agents (Cursor, etc.) with tools:
list_collections,list_files,get_file,search. - Batch, inspect, clean, doctor utilities.
Installation
From PyPI (recommended)
pipx install corpus-cli
Build from Source
Clone the repository and install in development mode:
git clone https://github.com/oreofeolurin/corpus.git
cd corpus
make install-dev
This installs the package in editable mode with development dependencies.
Standalone Binaries
Download pre-built executables from GitHub releases:
- Go to Releases
- Download the binary for your platform:
corpus-macOS(macOS)corpus-ubuntu(Linux)corpus-windows(Windows)
- Make executable and run:
chmod +x corpus-macOS ./corpus-macOS --help
Or build from source:
make binary
./dist/corpus --help
Build distribution packages:
make build
# Creates dist/corpus_cli-*.whl and dist/corpus-cli-*.tar.gz
Docker
Build and run with Docker:
docker build -t oreofeolurin/corpus:latest .
docker run --rm -v "$PWD/artifacts:/data" oreofeolurin/corpus:latest capture --url https://www.youtube.com/watch?v=VIDEO_ID --out /data
Quick Start
# Check environment
corpus doctor
# Capture one video
corpus capture --url https://www.youtube.com/watch?v=VIDEO_ID --out artifacts/yt --mode scene --max-shots 100
# Inspect the results
corpus inspect --path artifacts/yt/VIDEO_ID --details
# Pack a directory into a single bundle (human index on top)
corpus pack -d artifacts/yt/VIDEO_ID -o artifacts/yt/VIDEO_ID/corpus-out.txt -i "**/*.txt" -i "**/*.info.json"
# Pack a remote GitHub repo subtree and register it in the catalog
corpus pack --repo https://github.com/temporalio/sdk-python/temporalio \
-i "**/*.py" -o dist/temporal-sdk-python.txt \
--register-name "Temporal SDK (Python)" --register-tags python,sdk
# Serve the catalog via HTTP MCP (default 127.0.0.1:8787)
corpus mcp serve
# Or point at a single bundle/dir (one-off)
corpus mcp serve --source dist/temporal-sdk-python.txt
Command Reference
corpus capture
Capture a YouTube video's metadata, transcript, and screenshots.
Flags:
--url(required): YouTube watch URL--out(defaultartifacts/yt): Output base directory--lang(defaulten): Subtitle language--mode(interval|uniform|scene, defaultinterval)--every(default60): Interval seconds forinterval--max-shots(default60): Cap or count foruniform/scene--scene-thresh(default0.3): Scene detection threshold--overwrite: Overwrite existing files--no-video,--no-subs,--no-shots: Skip stages--json: Emit JSON summary to stdout
Output structure:
<out>/<video_id>/
├─ <video_id>.info.json
├─ <video_id>.<lang>.vtt (if available)
├─ <video_id>.txt (timestamped transcript)
├─ <video_id>.mp4
└─ shots/
├─ shot_0001.jpg
└─ ...
corpus batch
Process multiple URLs with optional JSONL streaming.
Flags:
--file PATHor--stdin--out,--concurrency,--fail-fast, plus capture flags--json: stream JSONL to stdout
corpus pack (alias: cpack)
Pack files from directories or a GitHub repo into a single bundle, optionally registering in the catalog.
Flags:
-d/--dir,-o/--output--repo URL(optional): GitHub URL; supports/tree/<ref>and subdirs-i/--include GLOB(repeatable),-x/--exclude GLOB(repeatable)-c/--compress,-m/--max-compress-z/--gzip,-b/--base64(requires--gzip)-v/--verbose--config PATH: load cpack YAML/JSON config; CLI overrides file--write-index(default true): prepend a human-readable index to the bundle--index-output PATH(JSON, Corpus v1 schema)--register-id ID(optional): register bundle in the catalog after pack--register-name NAME,--register-tags t1,t2(optional); if only name is provided, a slug ID is auto-generated
corpus mcp (catalog and server)
corpus mcp add --id <id> --source <path> [--type auto|bundle|dir] [--name …] [--tags …]corpus mcp rm --id <id>corpus mcp lscorpus mcp serve [--source <path>] [--host 127.0.0.1] [--port 8787]
MCP tools (HTTP):
list_collections→ lines of<id>\t<type>\t<source>list_files {collection}→ file paths in the collectionget_file {collection, path, start?, end?}→ file content (optionally a line range)search {collection, query, top_k?, case_sensitive?}→path:line: snippet
corpus inspect
Summarize an artifact directory (--details for metadata fields).
corpus clean
Delete by --id under --out or everything with --all (guarded by --yes).
corpus doctor
Check for yt-dlp, ffmpeg, ffprobe, and Python dependencies.
Configuration
Packer config (cpack)
You may use cpack.yml, cpack.yaml, or cpack.json in the input directory, or specify --config.
YAML example:
inputDir: ./src
outputFile: output.txt
includeGlobs:
- "**/*.py"
- "**/*.md"
excludeGlobs:
- "**/.git/**"
- "**/node_modules/**"
verbose: true
compress: false
maxCompress: false
gzip: false
base64: false
JSON example:
{
"inputDir": "./src",
"outputFile": "output.txt",
"includeGlobs": ["**/*.py", "**/*.md"],
"excludeGlobs": ["**/.git/**", "**/node_modules/**"],
"verbose": true,
"compress": false,
"maxCompress": false,
"gzip": false,
"base64": false
}
Output formats
Packer supports multiple formats:
- Standard text (default)
- Compressed whitespace (
--compress) - Max-compressed with comment removal (
--max-compress) - Gzipped bytes (
--gzip) - Base64 of gzip (
--gzip --base64)
Examples
- Interval capture; 1/min up to 20 shots:
corpus capture --url https://www.youtube.com/watch?v=VIDEO_ID \
--out artifacts/yt --mode interval --every 60 --max-shots 20
- Scene-based capture (approx 100 shots):
corpus capture --url https://www.youtube.com/watch?v=VIDEO_ID \
--out artifacts/yt --mode scene --scene-thresh 0.3 --max-shots 100
- Batch from file with JSONL output:
corpus batch --file urls.txt --out artifacts/yt --concurrency 4 --json
- Packer include/exclude with compression and registration:
corpus pack -d artifacts/yt/VIDEO_ID -o artifacts/yt/VIDEO_ID/corpus-out.txt \
-i "**/*.txt" -i "**/*.info.json" -x "**/shots/**" -c -v \
--register-name "My Video Artifacts" --register-tags video,youtube
Troubleshooting
ffmpeg not found: install via Homebrew (brew install ffmpeg) or your OS package manager.yt-dlpissues / rate limits: provide cookies (--cookiesvia capture in future), use proxies (--proxyin future), or rerun later.- Permissions: ensure the output directory is writable.
- Docker: use the provided Dockerfile to avoid local dependency issues.
Contributing
Contributions are welcome! See for development, testing, and release guidelines.
License
Licensed under the .