pdf-search-mcp

DRanger666/pdf-search-mcp

3.2

If you are the rightful owner of pdf-search-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

The PDF Search MCP Server is a lightweight server and CLI harness that enables indexing and semantic searching of PDFs using CrewAI's PDFSearchTool, with support for custom embeddings and Qdrant stack.

Tools
1
Resources
0
Prompts
0

PDF Search MCP Server

Lightweight Model Context Protocol (MCP) server and CLI harness that wrap CrewAI’s PDFSearchTool so you can index and semantically search PDFs with your own embeddings/Qdrant stack.

Components

  • MCP server (pdfsearch.mcp_pdfsearch): exposes a search_pdf tool that validates input PDFs, ensures the configured Qdrant collection exists, and runs CrewAI searches.
  • CLI harness (scripts/test_pdfsearch_tool.py): developer-facing script that fingerprints PDFs, records cache metadata, skips re-ingestion on cache hits, and lets you run ad‑hoc queries without a client that speaks MCP.
  • Caching helpers (src/pdfsearch/cache.py): shared utilities that compute SHA-256 hashes, derive per-PDF collection names (pdf_<hash[:12]>), persist metadata under ~/.cache/pdf_search_mcp/cache.json, and track whether a PDF has already been indexed (Step 1–3 of the caching rollout).

Requirements

  • Python 3.8+
  • Access to an embeddings provider supported by CrewAI (VoyageAI recommended)
  • Reachable Qdrant instance (cloud or self-hosted)

Installation

pip install -e .
cp .env.example .env
# populate .env with your real credentials

Configuration

Config (src/pdfsearch/config.py) reads the following environment variables:

VariableRequiredNotes
EMBEDDER_PROVIDERRequirede.g. voyageai, openai, etc.
EMBEDDER_MODELRequiredProvider-specific model ID (e.g. voyage-code-3).
EMBEDDER_API_KEYConditional*Required for non-Voyage providers.
EMBEDDINGS_VOYAGEAI_API_KEYConditional*Required when EMBEDDER_PROVIDER=voyageai; automatically overrides EMBEDDER_API_KEY.
QDRANT_URLRequiredHTTPS URL or local endpoint for your Qdrant service.
QDRANT_API_KEYOptionalNeeded for managed Qdrant deployments.
VECTOR_SIZERequiredInteger dimension of your embeddings.
VECTOR_DISTANCERequiredOne of Cosine, Euclid, Dot, etc. (matches qdrant_client.models.Distance).
LOG_LEVELOptionalDefaults to INFO.

Tip: .env.example enumerates every variable the server expects. Copy it to .env, fill in your values, and the CLI/MCP server will load it via python-dotenv.

Running the CLI Harness

python scripts/test_pdfsearch_tool.py \
  --pdf sample_pdfs/manual.pdf \
  --query "explain the architecture" \
  --limit 5 \
  --similarity-threshold 0.6 \
  --format pretty

What it does:

  1. Loads .env, validates the PDF (extension + %PDF- header).
  2. Computes the PDF hash and derives a Qdrant collection name (pdf_<hash>).
  3. Records cache metadata (path, hash, collection, ingestion flag) in ~/.cache/pdf_search_mcp/cache.json.
  4. Initializes PDFSearchTool with your embeddings + Qdrant config and forces the adapter to use the derived collection.
  5. On the first run, ingests the PDF into Qdrant and marks the cache entry as indexed; subsequent runs skip ingestion and reuse the existing vectors when the cache says the PDF is already indexed.
  6. Issues a similarity search using the adapter’s low-level client.search() API for structured results (including chunk_index, normalized score, and Qdrant point_id).

Resetting a collection: if you need to re-ingest (e.g., after deleting a Qdrant collection), remove the hash entry from ~/.cache/pdf_search_mcp/cache.json or toggle its "indexed" flag to false before re-running the CLI.

Running the MCP Server

pdf-search-mcp
# or
python -m pdfsearch.mcp_pdfsearch

The server registers a single MCP tool:

ToolDescriptionArgs
search_pdfValidates/ingests a PDF and runs a semantic search through CrewAIpdf_path (string, required), query (string, required)

Internally the server bootstraps a shared PDFSearchTool instance against the pdf_search_bootstrap collection to verify configuration on startup, then reuses it for incoming tool calls.

Caching Rollout

See docs/caching_rollout_plan.md for the staged plan that moves from simple fingerprinting (already implemented) to per-PDF collections, ingestion skipping, CLI cache inspection, concurrency guards, and eventually sharing the cache logic with the MCP server.

Current status:

  • Step 1 complete: hash + JSON cache helpers (src/pdfsearch/cache.py).
  • Step 2 complete: deterministic collection names + adapter retargeting.
  • Step 3 complete: cache-aware ingestion skipping in the CLI (per-PDF collections are only populated once unless you reset the cache).
  • Step 4 and beyond (CLI cache inspection/clearing flags, locking, MCP reuse) remain on the roadmap.

Development Notes

  • Legacy pytest scaffolding now lives in legacy_tests/; it is not wired into the active test suite but preserved for reference. See docs/notes/legacy_testing_strategy.md for the original testing vision.
  • docs/notes/gpt_codex_debugging.md captures debugging research on CrewAI vectordb config expectations.
  • Sample PDFs are ignored by default (sample_pdfs/ in .gitignore); add your own for local testing.

Troubleshooting

  • ValueError: EMBEDDER_PROVIDER environment variable is required – ensure .env is loaded (CLI uses python-dotenv) and the variable is set.
  • File does not appear to be a valid PDF – the validator inspects the %PDF- header; export/download the PDF again if it was generated incorrectly.
  • Qdrant collection errors – the config layer calls client.get_or_create_collection(...). Verify VECTOR_SIZE, VECTOR_DISTANCE, and QDRANT_URL/API key match your deployment.
  • Duplicate chunks after multiple runs – delete the per-PDF collection in Qdrant (e.g., curl -X DELETE http://<qdrant-host>/collections/pdf_<hash>) and remove or edit the matching entry in ~/.cache/pdf_search_mcp/cache.json so "indexed" becomes false before re-running the CLI.