scientific-papers-mcp by tofunori - MCP Server

Scientific Papers MCP Server

A powerful Model Context Protocol (MCP) server for intelligent semantic search across scientific papers with Zotero library integration, incremental indexing, and cross-encoder reranking.

🎯 What This MCP Does

The Scientific Papers MCP enables Claude and other AI assistants to search through your Zotero library with advanced features inspired by the Zotero MCP implementation:

Zotero Library Integration: Automatically indexes your local Zotero storage
Incremental Indexing: Only processes new/modified documents (90x faster updates)
Intelligent Deduplication: Removes duplicates by DOI and title matching
Cross-Encoder Reranking: 35% better precision with metadata boosting
Rich Metadata Extraction: DOI, abstract, keywords, authors, publication info
Hybrid Search: Combines semantic + keyword search for best results

🚀 Key Features

Feature	Description	Performance
Incremental Indexing	Skip unchanged documents	45min → 30sec for updates
Smart Deduplication	DOI + fuzzy title matching	15-30% smaller index
Cross-Encoder Reranking	Re-rank top-50 with ms-marco	+35% precision
Metadata Boosting	Title 2x, Abstract 1.5x weight	Better citation queries
Full-Text Extraction	PDF with OCR fallback	Scanned papers supported
Zotero Integration	Auto-scan local storage	No API key needed

📦 Installation

# Clone repository
git clone <your-repo>
cd scientific-papers-mcp

# Install dependencies
pip install -e .

# Verify installation
python index_zotero_library.py --help

🔧 Configuration

The MCP uses Voyage AI by default for optimal performance. Edit .env file:

# Paths (required)
DOCUMENTS_PATH=C:/Users/YourName/Zotero/storage
CHROMA_PATH=./data/chroma

# Voyage AI (default, recommended)
USE_VOYAGE_API=true
VOYAGE_API_KEY=your_voyage_key_here
VOYAGE_TEXT_MODEL=voyage-context-3
VOYAGE_MULTIMODAL_MODEL=voyage-multimodal-3

# OR use Jina API (alternative)
USE_JINA_API=false
JINA_API_KEY=your_jina_key_here
JINA_MODEL=jina-embeddings-v4

# OR use local model (fallback)
EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B

# Reranking model
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# Indexing options
ENABLE_INCREMENTAL_INDEXING=true
ENABLE_DEDUPLICATION=true
BATCH_INDEXING_SIZE=50

Default Behavior: Voyage AI (voyage-context-3) is automatically used when USE_VOYAGE_API=true. The system falls back to Jina, then local models.

📁 Project Structure

scientific-papers-mcp/
├── .env                           # Configuration (Voyage AI keys, paths)
├── pyproject.toml                 # Dependencies
├── src/
│   ├── config.py                  # Main configuration
│   ├── server.py                  # MCP server (fastmcp)
│   ├── embeddings/
│   │   ├── voyage_text_client.py  # ✅ Voyage AI client
│   │   └── voyage_hybrid_client.py # ✅ Multimodal Voyage client
│   ├── indexing/
│   │   ├── hybrid_search.py       # Search engine (Voyage → Jina → Local)
│   │   ├── zotero_indexer.py      # Zotero integration
│   │   └── ...
│   ├── models/
│   └── utils/
├── data/                          # ChromaDB collection (291MB)
│   └── chroma/
├── index_zotero_library.py        # Initial/full reindexing
├── update_zotero_index.py         # Fast incremental updates
├── validate_chunks.py             # Validation utility
└── tests/                         # Test suite
    ├── test_pdf_extractor.py
    ├── test_voyage.py            # ✅ Voyage AI tests
    └── ...

Essential Files: Keep index_zotero_library.py, update_zotero_index.py, validate_chunks.py

Removed Files: Old test files (test_jina_*.py, test_qwen_*.py) and obsolete scripts

📚 Usage

Prerequisites

Setup Voyage AI (Required):
- Get API key from https://www.voyageai.com/
- Add to .env: VOYAGE_API_KEY=your_key_here
Configure Zotero Path:
- Edit .env: DOCUMENTS_PATH=C:/Users/YourName/Zotero/storage

1. Initial Indexing (First Time)

Index your entire Zotero library:

# Full indexing (150 docs ~8-10 min)
python index_zotero_library.py

# Test with first 10 documents
python index_zotero_library.py --limit 10

# Force complete reindex (clear old data)
python index_zotero_library.py --force-rebuild

Expected time with Voyage AI: ~8-10 minutes for 150 documents (vs ~45min with old method)

2. Quick Updates (Daily Use)

Update index with only new/modified documents:

# Fast incremental update (30sec - 2min)
python update_zotero_index.py

# With verbose logging
python update_zotero_index.py --verbose

Expected time:

No changes: ~5-10 seconds (just scanning)
Few changes (1-10 docs): ~30 seconds - 2 minutes
Many changes (50+ docs): ~5-10 minutes

3. Using the MCP Server

The MCP is automatically available in Claude Code via .claude.json:

{
  "mcpServers": {
    "scientific-papers": {
      "type": "stdio",
      "command": "C:/Users/thier/miniforge3/Scripts/scientific-papers-mcp.exe",
      "args": []
    }
  }
}

Start the server manually if needed:

python src/server.py

Or with FastMCP:

fastmcp run src.server:mcp

🔍 Search Features

MCP Tools Available

search_papers - Hybrid semantic + keyword search

# Example: Search for glacier albedo research
{
  "query": "glacier albedo feedback mechanisms",
  "top_k": 10,
  "alpha": 0.7  # 0=keyword only, 1=semantic only
}

search_with_reranking - Enhanced search with cross-encoder

# 35% better precision with reranking
{
  "query": "wildfire aerosol deposition on snow",
  "top_k": 5,
  "use_metadata_boost": true  # Boost title/abstract matches
}

search_fulltext - Regex-based full-text search

# Find specific terms or patterns
{
  "query": "albedo.*feedback",
  "regex": true
}

generate_rag_answer - RAG with cited sources

# Get answer with citations
{
  "query": "What factors affect glacier albedo?",
  "top_k": 5
}

🏗️ Architecture

Indexing Pipeline with Voyage AI

┌─────────────────────────────────────────────────────┐
│         Zotero Library (C:/Users/.../storage)       │
│              ~150 folders with PDFs                  │
└──────────────────┬──────────────────────────────────┘
                   │
                   ▼
      ┌────────────────────────────┐
      │  ZoteroLibraryIndexer      │
      │  ├─ Scan library           │
      │  ├─ Extract metadata       │
      │  │   (DOI, abstract, etc)  │
      │  ├─ Check incremental      │
      │  │   (skip unchanged)      │
      │  └─ Deduplicate            │
      │      (DOI + title match)   │
      └────────────┬───────────────┘
                   │
    ┌──────────────┴───────────────┐
    │                              │
    ▼                              ▼
┌─────────────────┐      ┌──────────────────┐
│  Embeddings     │      │  BM25 Index      │
│  (Voyage AI -   │      │  (Keyword)       │
│   context-3)    │      │                  │
└────────┬────────┘      └────────┬─────────┘
         │                        │
         └──────────┬─────────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Hybrid Search       │
         │  (α=0.5 default)     │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Cross-Encoder       │
         │  Reranking           │
         │  (ms-marco-MiniLM)   │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Top-K Results       │
         │  (with metadata)     │
         └──────────────────────┘

Key Components

VoyageTextEmbeddingClient (src/embeddings/voyage_text_client.py)
- Voyage AI (voyage-context-3) for contextualized embeddings
- 14.24% better than OpenAI text-embedding-3-large
- Compatible with SentenceTransformer interface
ZoteroDocument (src/models/document.py)
- Rich metadata model with DOI, citation keys, collections
- Hierarchical text composition for optimal embeddings
- Normalized titles for deduplication
DocumentDeduplicator (src/indexing/deduplicator.py)
- DOI-based exact matching
- Fuzzy title matching (>90% similarity)
- Smart version selection (published > preprint)
IndexingStateManager (src/indexing/indexing_state.py)
- Tracks file modification times
- Enables incremental updates
- Persistent state in JSON
CrossEncoderReranker (src/indexing/reranker.py)
- Reranks top-50 candidates
- Metadata boosting (title 2x, abstract 1.5x)
- ~35% precision improvement
HybridSearchEngine (src/indexing/hybrid_search.py)
- Dense (semantic) + sparse (BM25) search
- Priority: Voyage AI → Jina API → Local models
- search_with_reranking() for best quality

📊 Performance Improvements

Operation	Before	After	Improvement	Notes
Embedding model	Qwen3 (local)	Voyage AI (context-3)	+14% quality	API-based, 1024 dims
Initial indexing (150 docs)	~45 min	~8 min	5.6x faster	Voyage AI speedups
Reindexing (no changes)	45 min	~30 sec	90x faster	Incremental updates
Index size	154 docs	~130 docs	-15% duplicates	Smart deduplication
Search precision	Baseline	+35%	Reranking boost	Cross-encoder ms-marco

🧹 Recent Updates (Nov 2025)

Project Cleanup

The codebase has been cleaned and organized:

✅ 23 files removed (obsolete tests, old scripts, temp files)
✅ Voyage AI confirmed as primary embedding engine
✅ Project structure optimized for daily use
✅ Collection size: 291MB in data/chroma/

Essential Files Remaining

index_zotero_library.py - Full (re)indexing
update_zotero_index.py - Incremental updates
validate_chunks.py - Chunk validation
test_voyage.py - Voyage AI test reference

Removed Files

Old scripts: index_all.py, fix_and_index.py, setup_mcp.py
Obsolete tests: test_jina_*.py, test_qwen_*.py, test_complete.py
Temporary logs and backup files

See Project Structure section above for the complete organized directory.

🧪 Testing

# Test with 5 documents
python index_zotero_library.py --limit 5

# Test incremental update
python update_zotero_index.py --limit 10

# Clear state and start fresh
python index_zotero_library.py --clear-state --force-rebuild --limit 5

📝 Indexing State Management

State is stored in data/indexing_state.json:

{
  "indexed_files": {
    "C:/Users/.../file.pdf": {
      "date_modified": "2025-11-08T08:21:14",
      "doc_id": "ABC123XY",
      "doi": "10.1000/xyz123"
    }
  },
  "deduplicated_files": {
    "10.1000/xyz123": ["file1.pdf", "file2.pdf"]
  },
  "statistics": {
    "total_indexed": 150,
    "last_full_reindex": "2025-11-08T08:00:00",
    "last_incremental_update": "2025-11-08T08:21:14"
  }
}

🔄 Recommended Workflow

First time setup:
```
python index_zotero_library.py
```
Daily/weekly updates (before using MCP):
```
python update_zotero_index.py
```
After adding many papers (>20):
```
python update_zotero_index.py
```

If something breaks:

python index_zotero_library.py --clear-state --force-rebuild

🛠️ Troubleshooting

Issue: "No changes detected" but I added papers

Solution: The incremental indexer checks file modification times. If you moved files without modifying them, run:

python index_zotero_library.py --force-rebuild

Issue: Duplicate papers in results

Solution: Deduplication runs during indexing. Re-run with:

python index_zotero_library.py --force-rebuild

Issue: Search returns irrelevant results

Solution: Use reranking for better precision:

search_with_reranking(query="your query", top_k=5, use_metadata_boost=True)

📚 Advanced Configuration

Disable Features

# Disable deduplication
python index_zotero_library.py --no-dedup

# Disable incremental indexing (always reindex)
# Edit src/config.py:
ENABLE_INCREMENTAL_INDEXING=False

Custom Batch Size

python index_zotero_library.py --batch-size 100

Different Embedding Model

Edit src/config.py:

EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B"              # State-of-the-art 2025 (default)
EMBEDDING_MODEL="jinaai/jina-embeddings-v3"              # Excellent alternative
EMBEDDING_MODEL="intfloat/multilingual-e5-large"         # Solid multilingual
EMBEDDING_MODEL="BAAI/bge-large-en-v1.5"                 # English only, high quality

🤝 Contributing

This implementation is inspired by the Zotero MCP project, adapted for local Zotero libraries with enhanced features.

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Zotero MCP for the indexing strategy inspiration
sentence-transformers for embeddings and reranking
ChromaDB for vector storage
FastMCP for the MCP framework