tofunori/scientific-papers-mcp
If you are the rightful owner of scientific-papers-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A Model Context Protocol (MCP) server designed for intelligent search within a collection of scientific documents focused on glaciology.
Scientific Papers MCP Server
A powerful Model Context Protocol (MCP) server for intelligent semantic search across scientific papers with Zotero library integration, incremental indexing, and cross-encoder reranking.
🎯 What This MCP Does
The Scientific Papers MCP enables Claude and other AI assistants to search through your Zotero library with advanced features inspired by the Zotero MCP implementation:
- Zotero Library Integration: Automatically indexes your local Zotero storage
- Incremental Indexing: Only processes new/modified documents (90x faster updates)
- Intelligent Deduplication: Removes duplicates by DOI and title matching
- Cross-Encoder Reranking: 35% better precision with metadata boosting
- Rich Metadata Extraction: DOI, abstract, keywords, authors, publication info
- Hybrid Search: Combines semantic + keyword search for best results
🚀 Key Features
| Feature | Description | Performance |
|---|---|---|
| Incremental Indexing | Skip unchanged documents | 45min → 30sec for updates |
| Smart Deduplication | DOI + fuzzy title matching | 15-30% smaller index |
| Cross-Encoder Reranking | Re-rank top-50 with ms-marco | +35% precision |
| Metadata Boosting | Title 2x, Abstract 1.5x weight | Better citation queries |
| Full-Text Extraction | PDF with OCR fallback | Scanned papers supported |
| Zotero Integration | Auto-scan local storage | No API key needed |
📦 Installation
# Clone repository
git clone <your-repo>
cd scientific-papers-mcp
# Install dependencies
pip install -e .
# Verify installation
python index_zotero_library.py --help
🔧 Configuration
The MCP uses Voyage AI by default for optimal performance. Edit .env file:
# Paths (required)
DOCUMENTS_PATH=C:/Users/YourName/Zotero/storage
CHROMA_PATH=./data/chroma
# Voyage AI (default, recommended)
USE_VOYAGE_API=true
VOYAGE_API_KEY=your_voyage_key_here
VOYAGE_TEXT_MODEL=voyage-context-3
VOYAGE_MULTIMODAL_MODEL=voyage-multimodal-3
# OR use Jina API (alternative)
USE_JINA_API=false
JINA_API_KEY=your_jina_key_here
JINA_MODEL=jina-embeddings-v4
# OR use local model (fallback)
EMBEDDING_MODEL=Qwen/Qwen3-Embedding-0.6B
# Reranking model
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
# Indexing options
ENABLE_INCREMENTAL_INDEXING=true
ENABLE_DEDUPLICATION=true
BATCH_INDEXING_SIZE=50
Default Behavior: Voyage AI (voyage-context-3) is automatically used when USE_VOYAGE_API=true. The system falls back to Jina, then local models.
📁 Project Structure
scientific-papers-mcp/
├── .env # Configuration (Voyage AI keys, paths)
├── pyproject.toml # Dependencies
├── src/
│ ├── config.py # Main configuration
│ ├── server.py # MCP server (fastmcp)
│ ├── embeddings/
│ │ ├── voyage_text_client.py # ✅ Voyage AI client
│ │ └── voyage_hybrid_client.py # ✅ Multimodal Voyage client
│ ├── indexing/
│ │ ├── hybrid_search.py # Search engine (Voyage → Jina → Local)
│ │ ├── zotero_indexer.py # Zotero integration
│ │ └── ...
│ ├── models/
│ └── utils/
├── data/ # ChromaDB collection (291MB)
│ └── chroma/
├── index_zotero_library.py # Initial/full reindexing
├── update_zotero_index.py # Fast incremental updates
├── validate_chunks.py # Validation utility
└── tests/ # Test suite
├── test_pdf_extractor.py
├── test_voyage.py # ✅ Voyage AI tests
└── ...
Essential Files: Keep index_zotero_library.py, update_zotero_index.py, validate_chunks.py
Removed Files: Old test files (test_jina_*.py, test_qwen_*.py) and obsolete scripts
📚 Usage
Prerequisites
-
Setup Voyage AI (Required):
- Get API key from https://www.voyageai.com/
- Add to
.env:VOYAGE_API_KEY=your_key_here
-
Configure Zotero Path:
- Edit
.env:DOCUMENTS_PATH=C:/Users/YourName/Zotero/storage
- Edit
1. Initial Indexing (First Time)
Index your entire Zotero library:
# Full indexing (150 docs ~8-10 min)
python index_zotero_library.py
# Test with first 10 documents
python index_zotero_library.py --limit 10
# Force complete reindex (clear old data)
python index_zotero_library.py --force-rebuild
Expected time with Voyage AI: ~8-10 minutes for 150 documents (vs ~45min with old method)
2. Quick Updates (Daily Use)
Update index with only new/modified documents:
# Fast incremental update (30sec - 2min)
python update_zotero_index.py
# With verbose logging
python update_zotero_index.py --verbose
Expected time:
- No changes: ~5-10 seconds (just scanning)
- Few changes (1-10 docs): ~30 seconds - 2 minutes
- Many changes (50+ docs): ~5-10 minutes
3. Using the MCP Server
The MCP is automatically available in Claude Code via .claude.json:
{
"mcpServers": {
"scientific-papers": {
"type": "stdio",
"command": "C:/Users/thier/miniforge3/Scripts/scientific-papers-mcp.exe",
"args": []
}
}
}
Start the server manually if needed:
python src/server.py
Or with FastMCP:
fastmcp run src.server:mcp
🔍 Search Features
MCP Tools Available
-
search_papers- Hybrid semantic + keyword search# Example: Search for glacier albedo research { "query": "glacier albedo feedback mechanisms", "top_k": 10, "alpha": 0.7 # 0=keyword only, 1=semantic only } -
search_with_reranking- Enhanced search with cross-encoder# 35% better precision with reranking { "query": "wildfire aerosol deposition on snow", "top_k": 5, "use_metadata_boost": true # Boost title/abstract matches } -
search_fulltext- Regex-based full-text search# Find specific terms or patterns { "query": "albedo.*feedback", "regex": true } -
generate_rag_answer- RAG with cited sources# Get answer with citations { "query": "What factors affect glacier albedo?", "top_k": 5 }
🏗️ Architecture
Indexing Pipeline with Voyage AI
┌─────────────────────────────────────────────────────┐
│ Zotero Library (C:/Users/.../storage) │
│ ~150 folders with PDFs │
└──────────────────┬──────────────────────────────────┘
│
▼
┌────────────────────────────┐
│ ZoteroLibraryIndexer │
│ ├─ Scan library │
│ ├─ Extract metadata │
│ │ (DOI, abstract, etc) │
│ ├─ Check incremental │
│ │ (skip unchanged) │
│ └─ Deduplicate │
│ (DOI + title match) │
└────────────┬───────────────┘
│
┌──────────────┴───────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ Embeddings │ │ BM25 Index │
│ (Voyage AI - │ │ (Keyword) │
│ context-3) │ │ │
└────────┬────────┘ └────────┬─────────┘
│ │
└──────────┬─────────────┘
│
▼
┌──────────────────────┐
│ Hybrid Search │
│ (α=0.5 default) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Cross-Encoder │
│ Reranking │
│ (ms-marco-MiniLM) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Top-K Results │
│ (with metadata) │
└──────────────────────┘
Key Components
-
VoyageTextEmbeddingClient(src/embeddings/voyage_text_client.py)- Voyage AI (voyage-context-3) for contextualized embeddings
- 14.24% better than OpenAI text-embedding-3-large
- Compatible with SentenceTransformer interface
-
ZoteroDocument(src/models/document.py)- Rich metadata model with DOI, citation keys, collections
- Hierarchical text composition for optimal embeddings
- Normalized titles for deduplication
-
DocumentDeduplicator(src/indexing/deduplicator.py)- DOI-based exact matching
- Fuzzy title matching (>90% similarity)
- Smart version selection (published > preprint)
-
IndexingStateManager(src/indexing/indexing_state.py)- Tracks file modification times
- Enables incremental updates
- Persistent state in JSON
-
CrossEncoderReranker(src/indexing/reranker.py)- Reranks top-50 candidates
- Metadata boosting (title 2x, abstract 1.5x)
- ~35% precision improvement
-
HybridSearchEngine(src/indexing/hybrid_search.py)- Dense (semantic) + sparse (BM25) search
- Priority: Voyage AI → Jina API → Local models
search_with_reranking()for best quality
📊 Performance Improvements
| Operation | Before | After | Improvement | Notes |
|---|---|---|---|---|
| Embedding model | Qwen3 (local) | Voyage AI (context-3) | +14% quality | API-based, 1024 dims |
| Initial indexing (150 docs) | ~45 min | ~8 min | 5.6x faster | Voyage AI speedups |
| Reindexing (no changes) | 45 min | ~30 sec | 90x faster | Incremental updates |
| Index size | 154 docs | ~130 docs | -15% duplicates | Smart deduplication |
| Search precision | Baseline | +35% | Reranking boost | Cross-encoder ms-marco |
🧹 Recent Updates (Nov 2025)
Project Cleanup
The codebase has been cleaned and organized:
- ✅ 23 files removed (obsolete tests, old scripts, temp files)
- ✅ Voyage AI confirmed as primary embedding engine
- ✅ Project structure optimized for daily use
- ✅ Collection size: 291MB in
data/chroma/
Essential Files Remaining
index_zotero_library.py- Full (re)indexingupdate_zotero_index.py- Incremental updatesvalidate_chunks.py- Chunk validationtest_voyage.py- Voyage AI test reference
Removed Files
- Old scripts:
index_all.py,fix_and_index.py,setup_mcp.py - Obsolete tests:
test_jina_*.py,test_qwen_*.py,test_complete.py - Temporary logs and backup files
See Project Structure section above for the complete organized directory.
🧪 Testing
# Test with 5 documents
python index_zotero_library.py --limit 5
# Test incremental update
python update_zotero_index.py --limit 10
# Clear state and start fresh
python index_zotero_library.py --clear-state --force-rebuild --limit 5
📝 Indexing State Management
State is stored in data/indexing_state.json:
{
"indexed_files": {
"C:/Users/.../file.pdf": {
"date_modified": "2025-11-08T08:21:14",
"doc_id": "ABC123XY",
"doi": "10.1000/xyz123"
}
},
"deduplicated_files": {
"10.1000/xyz123": ["file1.pdf", "file2.pdf"]
},
"statistics": {
"total_indexed": 150,
"last_full_reindex": "2025-11-08T08:00:00",
"last_incremental_update": "2025-11-08T08:21:14"
}
}
🔄 Recommended Workflow
-
First time setup:
python index_zotero_library.py -
Daily/weekly updates (before using MCP):
python update_zotero_index.py -
After adding many papers (>20):
python update_zotero_index.py -
If something breaks:
python index_zotero_library.py --clear-state --force-rebuild
🛠️ Troubleshooting
Issue: "No changes detected" but I added papers
Solution: The incremental indexer checks file modification times. If you moved files without modifying them, run:
python index_zotero_library.py --force-rebuild
Issue: Duplicate papers in results
Solution: Deduplication runs during indexing. Re-run with:
python index_zotero_library.py --force-rebuild
Issue: Search returns irrelevant results
Solution: Use reranking for better precision:
search_with_reranking(query="your query", top_k=5, use_metadata_boost=True)
📚 Advanced Configuration
Disable Features
# Disable deduplication
python index_zotero_library.py --no-dedup
# Disable incremental indexing (always reindex)
# Edit src/config.py:
ENABLE_INCREMENTAL_INDEXING=False
Custom Batch Size
python index_zotero_library.py --batch-size 100
Different Embedding Model
Edit src/config.py:
EMBEDDING_MODEL="Qwen/Qwen3-Embedding-0.6B" # State-of-the-art 2025 (default)
EMBEDDING_MODEL="jinaai/jina-embeddings-v3" # Excellent alternative
EMBEDDING_MODEL="intfloat/multilingual-e5-large" # Solid multilingual
EMBEDDING_MODEL="BAAI/bge-large-en-v1.5" # English only, high quality
🤝 Contributing
This implementation is inspired by the Zotero MCP project, adapted for local Zotero libraries with enhanced features.
📄 License
MIT License - See LICENSE file for details
🙏 Acknowledgments
- Zotero MCP for the indexing strategy inspiration
- sentence-transformers for embeddings and reranking
- ChromaDB for vector storage
- FastMCP for the MCP framework