DRanger666/pdf-search-mcp
If you are the rightful owner of pdf-search-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The PDF Search MCP Server is a lightweight server and CLI harness that enables indexing and semantic searching of PDFs using CrewAI's PDFSearchTool, with support for custom embeddings and Qdrant stack.
PDF Search MCP Server
Lightweight Model Context Protocol (MCP) server and CLI harness that wrap CrewAI’s PDFSearchTool so you can index and semantically search PDFs with your own embeddings/Qdrant stack.
Components
- MCP server (
pdfsearch.mcp_pdfsearch): exposes asearch_pdftool that validates input PDFs, ensures the configured Qdrant collection exists, and runs CrewAI searches. - CLI harness (
scripts/test_pdfsearch_tool.py): developer-facing script that fingerprints PDFs, records cache metadata, skips re-ingestion on cache hits, and lets you run ad‑hoc queries without a client that speaks MCP. - Caching helpers (
src/pdfsearch/cache.py): shared utilities that compute SHA-256 hashes, derive per-PDF collection names (pdf_<hash[:12]>), persist metadata under~/.cache/pdf_search_mcp/cache.json, and track whether a PDF has already been indexed (Step 1–3 of the caching rollout).
Requirements
- Python 3.8+
- Access to an embeddings provider supported by CrewAI (VoyageAI recommended)
- Reachable Qdrant instance (cloud or self-hosted)
Installation
pip install -e .
cp .env.example .env
# populate .env with your real credentials
Configuration
Config (src/pdfsearch/config.py) reads the following environment variables:
| Variable | Required | Notes |
|---|---|---|
EMBEDDER_PROVIDER | Required | e.g. voyageai, openai, etc. |
EMBEDDER_MODEL | Required | Provider-specific model ID (e.g. voyage-code-3). |
EMBEDDER_API_KEY | Conditional* | Required for non-Voyage providers. |
EMBEDDINGS_VOYAGEAI_API_KEY | Conditional* | Required when EMBEDDER_PROVIDER=voyageai; automatically overrides EMBEDDER_API_KEY. |
QDRANT_URL | Required | HTTPS URL or local endpoint for your Qdrant service. |
QDRANT_API_KEY | Optional | Needed for managed Qdrant deployments. |
VECTOR_SIZE | Required | Integer dimension of your embeddings. |
VECTOR_DISTANCE | Required | One of Cosine, Euclid, Dot, etc. (matches qdrant_client.models.Distance). |
LOG_LEVEL | Optional | Defaults to INFO. |
Tip:
.env.exampleenumerates every variable the server expects. Copy it to.env, fill in your values, and the CLI/MCP server will load it viapython-dotenv.
Running the CLI Harness
python scripts/test_pdfsearch_tool.py \
--pdf sample_pdfs/manual.pdf \
--query "explain the architecture" \
--limit 5 \
--similarity-threshold 0.6 \
--format pretty
What it does:
- Loads
.env, validates the PDF (extension +%PDF-header). - Computes the PDF hash and derives a Qdrant collection name (
pdf_<hash>). - Records cache metadata (path, hash, collection, ingestion flag) in
~/.cache/pdf_search_mcp/cache.json. - Initializes
PDFSearchToolwith your embeddings + Qdrant config and forces the adapter to use the derived collection. - On the first run, ingests the PDF into Qdrant and marks the cache entry as indexed; subsequent runs skip ingestion and reuse the existing vectors when the cache says the PDF is already indexed.
- Issues a similarity search using the adapter’s low-level
client.search()API for structured results (includingchunk_index, normalized score, and Qdrantpoint_id).
Resetting a collection: if you need to re-ingest (e.g., after deleting a Qdrant collection), remove the hash entry from
~/.cache/pdf_search_mcp/cache.jsonor toggle its"indexed"flag tofalsebefore re-running the CLI.
Running the MCP Server
pdf-search-mcp
# or
python -m pdfsearch.mcp_pdfsearch
The server registers a single MCP tool:
| Tool | Description | Args |
|---|---|---|
search_pdf | Validates/ingests a PDF and runs a semantic search through CrewAI | pdf_path (string, required), query (string, required) |
Internally the server bootstraps a shared PDFSearchTool instance against the pdf_search_bootstrap collection to verify configuration on startup, then reuses it for incoming tool calls.
Caching Rollout
See docs/caching_rollout_plan.md for the staged plan that moves from simple fingerprinting (already implemented) to per-PDF collections, ingestion skipping, CLI cache inspection, concurrency guards, and eventually sharing the cache logic with the MCP server.
Current status:
- Step 1 complete: hash + JSON cache helpers (
src/pdfsearch/cache.py). - Step 2 complete: deterministic collection names + adapter retargeting.
- Step 3 complete: cache-aware ingestion skipping in the CLI (per-PDF collections are only populated once unless you reset the cache).
- Step 4 and beyond (CLI cache inspection/clearing flags, locking, MCP reuse) remain on the roadmap.
Development Notes
- Legacy pytest scaffolding now lives in
legacy_tests/; it is not wired into the active test suite but preserved for reference. Seedocs/notes/legacy_testing_strategy.mdfor the original testing vision. docs/notes/gpt_codex_debugging.mdcaptures debugging research on CrewAI vectordb config expectations.- Sample PDFs are ignored by default (
sample_pdfs/in.gitignore); add your own for local testing.
Troubleshooting
ValueError: EMBEDDER_PROVIDER environment variable is required– ensure.envis loaded (CLI usespython-dotenv) and the variable is set.File does not appear to be a valid PDF– the validator inspects the%PDF-header; export/download the PDF again if it was generated incorrectly.- Qdrant collection errors – the config layer calls
client.get_or_create_collection(...). VerifyVECTOR_SIZE,VECTOR_DISTANCE, andQDRANT_URL/API keymatch your deployment. - Duplicate chunks after multiple runs – delete the per-PDF collection in Qdrant (e.g.,
curl -X DELETE http://<qdrant-host>/collections/pdf_<hash>) and remove or edit the matching entry in~/.cache/pdf_search_mcp/cache.jsonso"indexed"becomesfalsebefore re-running the CLI.