pdf-search-mcp by DRanger666 - MCP Server

PDF Search MCP Server

Lightweight Model Context Protocol (MCP) server and CLI harness that wrap CrewAI’s PDFSearchTool so you can index and semantically search PDFs with your own embeddings/Qdrant stack.

Components

MCP server (pdfsearch.mcp_pdfsearch): exposes a search_pdf tool that validates input PDFs, ensures the configured Qdrant collection exists, and runs CrewAI searches.
CLI harness (scripts/test_pdfsearch_tool.py): developer-facing script that fingerprints PDFs, records cache metadata, skips re-ingestion on cache hits, and lets you run ad‑hoc queries without a client that speaks MCP.
Caching helpers (src/pdfsearch/cache.py): shared utilities that compute SHA-256 hashes, derive per-PDF collection names (pdf_<hash[:12]>), persist metadata under ~/.cache/pdf_search_mcp/cache.json, and track whether a PDF has already been indexed (Step 1–3 of the caching rollout).

Requirements

Python 3.8+
Access to an embeddings provider supported by CrewAI (VoyageAI recommended)
Reachable Qdrant instance (cloud or self-hosted)

Installation

pip install -e .
cp .env.example .env
# populate .env with your real credentials

Configuration

Config (src/pdfsearch/config.py) reads the following environment variables:

Variable	Required	Notes
`EMBEDDER_PROVIDER`	Required	e.g. `voyageai`, `openai`, etc.
`EMBEDDER_MODEL`	Required	Provider-specific model ID (e.g. `voyage-code-3`).
`EMBEDDER_API_KEY`	Conditional*	Required for non-Voyage providers.
`EMBEDDINGS_VOYAGEAI_API_KEY`	Conditional*	Required when `EMBEDDER_PROVIDER=voyageai`; automatically overrides `EMBEDDER_API_KEY`.
`QDRANT_URL`	Required	HTTPS URL or local endpoint for your Qdrant service.
`QDRANT_API_KEY`	Optional	Needed for managed Qdrant deployments.
`VECTOR_SIZE`	Required	Integer dimension of your embeddings.
`VECTOR_DISTANCE`	Required	One of `Cosine`, `Euclid`, `Dot`, etc. (matches `qdrant_client.models.Distance`).
`LOG_LEVEL`	Optional	Defaults to `INFO`.

Tip: .env.example enumerates every variable the server expects. Copy it to .env, fill in your values, and the CLI/MCP server will load it via python-dotenv.

Running the CLI Harness

python scripts/test_pdfsearch_tool.py \
  --pdf sample_pdfs/manual.pdf \
  --query "explain the architecture" \
  --limit 5 \
  --similarity-threshold 0.6 \
  --format pretty

What it does:

Loads .env, validates the PDF (extension + %PDF- header).
Computes the PDF hash and derives a Qdrant collection name (pdf_<hash>).
Records cache metadata (path, hash, collection, ingestion flag) in ~/.cache/pdf_search_mcp/cache.json.
Initializes PDFSearchTool with your embeddings + Qdrant config and forces the adapter to use the derived collection.
On the first run, ingests the PDF into Qdrant and marks the cache entry as indexed; subsequent runs skip ingestion and reuse the existing vectors when the cache says the PDF is already indexed.
Issues a similarity search using the adapter’s low-level client.search() API for structured results (including chunk_index, normalized score, and Qdrant point_id).

Resetting a collection: if you need to re-ingest (e.g., after deleting a Qdrant collection), remove the hash entry from ~/.cache/pdf_search_mcp/cache.json or toggle its "indexed" flag to false before re-running the CLI.

Running the MCP Server

pdf-search-mcp
# or
python -m pdfsearch.mcp_pdfsearch

The server registers a single MCP tool:

Tool	Description	Args
`search_pdf`	Validates/ingests a PDF and runs a semantic search through CrewAI	`pdf_path` (string, required), `query` (string, required)

Internally the server bootstraps a shared PDFSearchTool instance against the pdf_search_bootstrap collection to verify configuration on startup, then reuses it for incoming tool calls.

Caching Rollout

See docs/caching_rollout_plan.md for the staged plan that moves from simple fingerprinting (already implemented) to per-PDF collections, ingestion skipping, CLI cache inspection, concurrency guards, and eventually sharing the cache logic with the MCP server.

Current status:

Step 1 complete: hash + JSON cache helpers (src/pdfsearch/cache.py).
Step 2 complete: deterministic collection names + adapter retargeting.
Step 3 complete: cache-aware ingestion skipping in the CLI (per-PDF collections are only populated once unless you reset the cache).
Step 4 and beyond (CLI cache inspection/clearing flags, locking, MCP reuse) remain on the roadmap.

Development Notes

Legacy pytest scaffolding now lives in legacy_tests/; it is not wired into the active test suite but preserved for reference. See docs/notes/legacy_testing_strategy.md for the original testing vision.
docs/notes/gpt_codex_debugging.md captures debugging research on CrewAI vectordb config expectations.
Sample PDFs are ignored by default (sample_pdfs/ in .gitignore); add your own for local testing.

Troubleshooting

ValueError: EMBEDDER_PROVIDER environment variable is required – ensure .env is loaded (CLI uses python-dotenv) and the variable is set.
File does not appear to be a valid PDF – the validator inspects the %PDF- header; export/download the PDF again if it was generated incorrectly.
Qdrant collection errors – the config layer calls client.get_or_create_collection(...). Verify VECTOR_SIZE, VECTOR_DISTANCE, and QDRANT_URL/API key match your deployment.
Duplicate chunks after multiple runs – delete the per-PDF collection in Qdrant (e.g., curl -X DELETE http://<qdrant-host>/collections/pdf_<hash>) and remove or edit the matching entry in ~/.cache/pdf_search_mcp/cache.json so "indexed" becomes false before re-running the CLI.