scientific-papers-mcp

tofunori/scientific-papers-mcp

3.1

If you are the rightful owner of scientific-papers-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

A Model Context Protocol (MCP) server designed for intelligent search within a collection of scientific documents focused on glaciology.

Scientific Papers MCP Server

A powerful Model Context Protocol (MCP) server for intelligent semantic search across scientific papers with Zotero library integration, incremental indexing, and cross-encoder reranking.

🎯 What This MCP Does

The Scientific Papers MCP enables Claude and other AI assistants to search through your Zotero library with advanced features inspired by the Zotero MCP implementation:

  • Zotero Library Integration: Automatically indexes your local Zotero storage
  • Incremental Indexing: Only processes new/modified documents (90x faster updates)
  • Intelligent Deduplication: Removes duplicates by DOI and title matching
  • Cross-Encoder Reranking: 35% better precision with metadata boosting
  • Rich Metadata Extraction: DOI, abstract, keywords, authors, publication info
  • Hybrid Search: Combines semantic + keyword search for best results

🚀 Key Features

FeatureDescriptionPerformance
Incremental IndexingSkip unchanged documents45min → 30sec for updates
Smart DeduplicationDOI + fuzzy title matching15-30% smaller index
Cross-Encoder RerankingRe-rank top-50 with ms-marco+35% precision
Metadata BoostingTitle 2x, Abstract 1.5x weightBetter citation queries
Full-Text ExtractionPDF with OCR fallbackScanned papers supported
Zotero IntegrationAuto-scan local storageNo API key needed

📦 Installation

# Clone repository
git clone <your-repo>
cd scientific-papers-mcp

# Install dependencies
pip install -e .

# Verify installation
python index_zotero_library.py --help

🔧 Configuration

The MCP uses Voyage AI by default for optimal performance. Edit .env file:

# Paths (required)
DOCUMENTS_PATH=C:/Users/YourName/Zotero/storage
CHROMA_PATH=./data/chroma

# Voyage AI (default, recommended)
USE_VOYAGE_API=true
VOYAGE_API_KEY=your_voyage_key_here
VOYAGE_TEXT_MODEL=voyage-context-3
VOYAGE_MULTIMODAL_MODEL=voyage-multimodal-3

# OR use Jina API (alternative)
USE_JINA_API=false
JINA_API_KEY=your_jina_key_here
JINA_MODEL=jina-embeddings-v4

# OR use local model (fallback)
EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B

# Reranking model
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# Indexing options
ENABLE_INCREMENTAL_INDEXING=true
ENABLE_DEDUPLICATION=true
BATCH_INDEXING_SIZE=50

Default Behavior: Voyage AI (voyage-context-3) is automatically used when USE_VOYAGE_API=true. The system falls back to Jina, then local models.

📁 Project Structure

scientific-papers-mcp/
├── .env                           # Configuration (Voyage AI keys, paths)
├── pyproject.toml                 # Dependencies
├── src/
│   ├── config.py                  # Main configuration
│   ├── server.py                  # MCP server (fastmcp)
│   ├── embeddings/
│   │   ├── voyage_text_client.py  # ✅ Voyage AI client
│   │   └── voyage_hybrid_client.py # ✅ Multimodal Voyage client
│   ├── indexing/
│   │   ├── hybrid_search.py       # Search engine (Voyage → Jina → Local)
│   │   ├── zotero_indexer.py      # Zotero integration
│   │   └── ...
│   ├── models/
│   └── utils/
├── data/                          # ChromaDB collection (291MB)
│   └── chroma/
├── index_zotero_library.py        # Initial/full reindexing
├── update_zotero_index.py         # Fast incremental updates
├── validate_chunks.py             # Validation utility
└── tests/                         # Test suite
    ├── test_pdf_extractor.py
    ├── test_voyage.py            # ✅ Voyage AI tests
    └── ...

Essential Files: Keep index_zotero_library.py, update_zotero_index.py, validate_chunks.py

Removed Files: Old test files (test_jina_*.py, test_qwen_*.py) and obsolete scripts

📚 Usage

Prerequisites

  1. Setup Voyage AI (Required):

  2. Configure Zotero Path:

    • Edit .env: DOCUMENTS_PATH=C:/Users/YourName/Zotero/storage

1. Initial Indexing (First Time)

Index your entire Zotero library:

# Full indexing (150 docs ~8-10 min)
python index_zotero_library.py

# Test with first 10 documents
python index_zotero_library.py --limit 10

# Force complete reindex (clear old data)
python index_zotero_library.py --force-rebuild

Expected time with Voyage AI: ~8-10 minutes for 150 documents (vs ~45min with old method)

2. Quick Updates (Daily Use)

Update index with only new/modified documents:

# Fast incremental update (30sec - 2min)
python update_zotero_index.py

# With verbose logging
python update_zotero_index.py --verbose

Expected time:

  • No changes: ~5-10 seconds (just scanning)
  • Few changes (1-10 docs): ~30 seconds - 2 minutes
  • Many changes (50+ docs): ~5-10 minutes

3. Using the MCP Server

The MCP is automatically available in Claude Code via .claude.json:

{
  "mcpServers": {
    "scientific-papers": {
      "type": "stdio",
      "command": "C:/Users/thier/miniforge3/Scripts/scientific-papers-mcp.exe",
      "args": []
    }
  }
}

Start the server manually if needed:

python src/server.py

Or with FastMCP:

fastmcp run src.server:mcp

🔍 Search Features

MCP Tools Available

  1. search_papers - Hybrid semantic + keyword search

    # Example: Search for glacier albedo research
    {
      "query": "glacier albedo feedback mechanisms",
      "top_k": 10,
      "alpha": 0.7  # 0=keyword only, 1=semantic only
    }
    
  2. search_with_reranking - Enhanced search with cross-encoder

    # 35% better precision with reranking
    {
      "query": "wildfire aerosol deposition on snow",
      "top_k": 5,
      "use_metadata_boost": true  # Boost title/abstract matches
    }
    
  3. search_fulltext - Regex-based full-text search

    # Find specific terms or patterns
    {
      "query": "albedo.*feedback",
      "regex": true
    }
    
  4. generate_rag_answer - RAG with cited sources

    # Get answer with citations
    {
      "query": "What factors affect glacier albedo?",
      "top_k": 5
    }
    

🏗️ Architecture

Indexing Pipeline with Voyage AI

┌─────────────────────────────────────────────────────┐
│         Zotero Library (C:/Users/.../storage)       │
│              ~150 folders with PDFs                  │
└──────────────────┬──────────────────────────────────┘
                   │
                   ▼
      ┌────────────────────────────┐
      │  ZoteroLibraryIndexer      │
      │  ├─ Scan library           │
      │  ├─ Extract metadata       │
      │  │   (DOI, abstract, etc)  │
      │  ├─ Check incremental      │
      │  │   (skip unchanged)      │
      │  └─ Deduplicate            │
      │      (DOI + title match)   │
      └────────────┬───────────────┘
                   │
    ┌──────────────┴───────────────┐
    │                              │
    ▼                              ▼
┌─────────────────┐      ┌──────────────────┐
│  Embeddings     │      │  BM25 Index      │
│  (Voyage AI -   │      │  (Keyword)       │
│   context-3)    │      │                  │
└────────┬────────┘      └────────┬─────────┘
         │                        │
         └──────────┬─────────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Hybrid Search       │
         │  (α=0.5 default)     │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Cross-Encoder       │
         │  Reranking           │
         │  (ms-marco-MiniLM)   │
         └──────────┬───────────┘
                    │
                    ▼
         ┌──────────────────────┐
         │  Top-K Results       │
         │  (with metadata)     │
         └──────────────────────┘

Key Components

  1. VoyageTextEmbeddingClient (src/embeddings/voyage_text_client.py)

    • Voyage AI (voyage-context-3) for contextualized embeddings
    • 14.24% better than OpenAI text-embedding-3-large
    • Compatible with SentenceTransformer interface
  2. ZoteroDocument (src/models/document.py)

    • Rich metadata model with DOI, citation keys, collections
    • Hierarchical text composition for optimal embeddings
    • Normalized titles for deduplication
  3. DocumentDeduplicator (src/indexing/deduplicator.py)

    • DOI-based exact matching
    • Fuzzy title matching (>90% similarity)
    • Smart version selection (published > preprint)
  4. IndexingStateManager (src/indexing/indexing_state.py)

    • Tracks file modification times
    • Enables incremental updates
    • Persistent state in JSON
  5. CrossEncoderReranker (src/indexing/reranker.py)

    • Reranks top-50 candidates
    • Metadata boosting (title 2x, abstract 1.5x)
    • ~35% precision improvement
  6. HybridSearchEngine (src/indexing/hybrid_search.py)

    • Dense (semantic) + sparse (BM25) search
    • Priority: Voyage AI → Jina API → Local models
    • search_with_reranking() for best quality

📊 Performance Improvements

OperationBeforeAfterImprovementNotes
Embedding modelQwen3 (local)Voyage AI (context-3)+14% qualityAPI-based, 1024 dims
Initial indexing (150 docs)~45 min~8 min5.6x fasterVoyage AI speedups
Reindexing (no changes)45 min~30 sec90x fasterIncremental updates
Index size154 docs~130 docs-15% duplicatesSmart deduplication
Search precisionBaseline+35%Reranking boostCross-encoder ms-marco

🧹 Recent Updates (Nov 2025)

Project Cleanup

The codebase has been cleaned and organized:

  • 23 files removed (obsolete tests, old scripts, temp files)
  • Voyage AI confirmed as primary embedding engine
  • Project structure optimized for daily use
  • Collection size: 291MB in data/chroma/

Essential Files Remaining

  • index_zotero_library.py - Full (re)indexing
  • update_zotero_index.py - Incremental updates
  • validate_chunks.py - Chunk validation
  • test_voyage.py - Voyage AI test reference

Removed Files

  • Old scripts: index_all.py, fix_and_index.py, setup_mcp.py
  • Obsolete tests: test_jina_*.py, test_qwen_*.py, test_complete.py
  • Temporary logs and backup files

See Project Structure section above for the complete organized directory.

🧪 Testing

# Test with 5 documents
python index_zotero_library.py --limit 5

# Test incremental update
python update_zotero_index.py --limit 10

# Clear state and start fresh
python index_zotero_library.py --clear-state --force-rebuild --limit 5

📝 Indexing State Management

State is stored in data/indexing_state.json:

{
  "indexed_files": {
    "C:/Users/.../file.pdf": {
      "date_modified": "2025-11-08T08:21:14",
      "doc_id": "ABC123XY",
      "doi": "10.1000/xyz123"
    }
  },
  "deduplicated_files": {
    "10.1000/xyz123": ["file1.pdf", "file2.pdf"]
  },
  "statistics": {
    "total_indexed": 150,
    "last_full_reindex": "2025-11-08T08:00:00",
    "last_incremental_update": "2025-11-08T08:21:14"
  }
}

🔄 Recommended Workflow

  1. First time setup:

    python index_zotero_library.py
    
  2. Daily/weekly updates (before using MCP):

    python update_zotero_index.py
    
  3. After adding many papers (>20):

    python update_zotero_index.py
    
  4. If something breaks:

    python index_zotero_library.py --clear-state --force-rebuild
    

🛠️ Troubleshooting

Issue: "No changes detected" but I added papers

Solution: The incremental indexer checks file modification times. If you moved files without modifying them, run:

python index_zotero_library.py --force-rebuild

Issue: Duplicate papers in results

Solution: Deduplication runs during indexing. Re-run with:

python index_zotero_library.py --force-rebuild

Issue: Search returns irrelevant results

Solution: Use reranking for better precision:

search_with_reranking(query="your query", top_k=5, use_metadata_boost=True)

📚 Advanced Configuration

Disable Features

# Disable deduplication
python index_zotero_library.py --no-dedup

# Disable incremental indexing (always reindex)
# Edit src/config.py:
ENABLE_INCREMENTAL_INDEXING=False

Custom Batch Size

python index_zotero_library.py --batch-size 100

Different Embedding Model

Edit src/config.py:

EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B"              # State-of-the-art 2025 (default)
EMBEDDING_MODEL="jinaai/jina-embeddings-v3"              # Excellent alternative
EMBEDDING_MODEL="intfloat/multilingual-e5-large"         # Solid multilingual
EMBEDDING_MODEL="BAAI/bge-large-en-v1.5"                 # English only, high quality

🤝 Contributing

This implementation is inspired by the Zotero MCP project, adapted for local Zotero libraries with enhanced features.

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments