benjaminfh/research-mcp
If you are the rightful owner of research-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A production-ready Model Context Protocol (MCP) server designed to provide AI assistants with tools for searching, fetching, and processing academic papers from multiple sources.
Research MCP Server
A production-ready Model Context Protocol (MCP) server that provides AI assistants with comprehensive tools to search, fetch, and process academic papers from multiple sources. Built with parallel processing, authentication, and enterprise deployment capabilities.
Key Features
- Multi-Source Search: Query 250M+ papers across OpenAlex, Semantic Scholar, ArXiv, PDF Mirrors, and Unpaywall
- Parallel Processing: Background PDF processing with worker pools, job queues, and entity management
- Enterprise Security: JWT authentication, secure token validation, and production deployment support
- Smart Text Extraction: Docling-powered PDF processing with structured content extraction
- Production Ready: HTTP/stdio transport modes, systematic logging, and comprehensive monitoring
- Developer Friendly: Extensive testing suite, debugging tools, and modular architecture
Architecture Overview
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Research MCP Server ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā MCP Tools (search, fetch) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Search Orchestrator ā
ā āāā Paper Availability Checker ā
ā āāā Results Filtering & Ranking ā
ā āāā Background Prefetching ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Data Source Repositories ā
ā āāā OpenAlex (250M papers, comprehensive metadata) ā
ā āāā Semantic Scholar (citation networks, AI-enhanced) ā
ā āāā ArXiv (preprints, latest research) ā
ā āāā PDF Mirrors (configurable mirror access) ā
ā āāā Unpaywall (open access detection) ā
ā āāā Local Files (cached content) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Parallel Processing Engine ā
ā āāā Queue Manager (job routing & scheduling) ā
ā āāā Worker Pool (configurable concurrency) ā
ā āāā Entity Processors (PDF, metadata extraction) ā
ā āāā Job Management (status tracking, error handling) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Authentication & Security ā
ā āāā JWT Token Management ā
ā āāā API Key Validation ā
ā āāā Secure Download Paths ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā Storage & Database ā
ā āāā SQLite Database (papers, jobs, processing status) ā
ā āāā File System Cache (PDFs, extracted text) ā
ā āāā Entity Management (automatic CRUD operations) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Quick Start
For Claude Code users (recommended):
# Clone and install to Claude Code
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
pip install -r requirements.txt
fastmcp install claude-code research_server.py
For local development:
# Clone and run
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your configuration (see Configuration section below)
python research_server.py
Configuration
Required Environment Variables
Copy the environment template and configure these required variables:
cp .env.example .env
Edit .env
with the following required settings:
# Paper source configuration
MIRROR_URLS=https://your-mirror1.com,https://your-mirror2.com
OPENALEX_EMAIL=your.email@example.com
S2_API_KEY=your_semantic_scholar_api_key
# Storage paths
MCP_STORAGE_BASE=~/mcp-storage
DATABASE_PATH=~/mcp-storage/papers.db
Optional Configuration
# Search behavior
SEARCH_RESULTS=40
SEARCH_RESULTS_MULTIPLIER=2
PREFETCH_TOP_N=5
# Production deployment
ENVIRONMENT=prod
MCP_HOST=127.0.0.1
MCP_PORT=8000
MCP_PATH=/mcp
# Authentication (production)
JWT_SECRET=your_generated_secret
JWT_ALGORITHM=HS256
JWT_ISSUER=research-mcp
JWT_AUDIENCE=mcp-api
# Optional API keys for enhanced functionality
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-...
GOOGLE_API_KEY=...
PERPLEXITY_API_KEY=...
Data Sources
The server integrates with multiple academic repositories, each with specialized capabilities:
OpenAlex Repository
- Capabilities: Search, metadata, PDF URLs
- Coverage: 250M+ scholarly papers across all disciplines
- Strengths: Comprehensive metadata, institutional affiliations, funding info
- API: Free, no authentication required (email recommended)
Semantic Scholar Repository
- Capabilities: Search, metadata, citation networks
- Coverage: 200M+ papers with AI-enhanced metadata
- Strengths: Citation graphs, influence metrics, semantic similarity
- API: Free tier (100 requests/5min), enhanced with API key
ArXiv Repository
- Capabilities: Search, metadata, direct PDF access
- Coverage: 2M+ preprints in STEM fields
- Strengths: Latest research, direct PDF downloads, version tracking
- API: Free, no authentication required
PDF Mirrors Repository
- Capabilities: Direct PDF downloads with configurable mirror fallback
- Coverage: Configurable paper access via mirror network
- Strengths: High availability, fast downloads, automatic failover
- Configuration:
MIRROR_URLS
environment variable
Unpaywall Repository
- Capabilities: Open access detection and free PDF discovery
- Coverage: 50M+ papers with open access status
- Strengths: Legal free access detection, institutional repositories
- API: Free, email required for identification
Local Files Repository
- Capabilities: Cached content, processed text, metadata
- Coverage: Previously downloaded and processed papers
- Strengths: Instant access, no network dependency, full-text search
Parallel Processing System
The server includes a sophisticated parallel processing engine for handling PDF downloads and text extraction:
Core Components
Queue Manager (core/parallel_processing/parallel_queue_manager.py
)
- Job scheduling and routing
- Queue priority management
- Worker coordination
- Status monitoring
Worker Pool (core/parallel_processing/worker_pool.py
)
- Configurable concurrency levels
- Automatic worker scaling
- Error handling and retry logic
- Resource management
Entity Processors (core/parallel_processing/processors/
)
- PDF text extraction
- Metadata normalization
- Content validation
- Result serialization
Job Management (core/parallel_processing/job_management/
)
- Status tracking
- Progress monitoring
- Error reporting
- Completion callbacks
Processing Pipeline
- Job Creation: Search results trigger background processing jobs
- Queue Routing: Jobs routed to appropriate processor based on content type
- Worker Assignment: Available workers claim jobs from priority queues
- Processing: PDF download, text extraction, and metadata enhancement
- Storage: Results stored in database with full provenance tracking
- Notification: Status updates propagated to requesting clients
Configuration Options
# Search configuration
SEARCH_RESULTS=40
SEARCH_RESULTS_MULTIPLIER=2
PREFETCH_TOP_N=5
# Environment
ENVIRONMENT=dev
Authentication & Security
JWT Authentication
Production deployments use JWT tokens for secure API access:
# Generate secret key
python -m auth.jwt_token generate-secret
# Configure in .env
JWT_SECRET=your_generated_secret
JWT_ALGORITHM=HS256
JWT_ISSUER=research-mcp
JWT_AUDIENCE=mcp-api
Token Management
# Generate tokens (admin use)
from auth.jwt_token import create_access_token
token = create_access_token({"sub": "client_id"})
# Validate tokens (automatic)
# All API requests validated against JWT_SECRET
Security Features
- Path Validation: All downloads to secure sandbox directories
- URL Validation: PDF URL verification before download
- API Rate Limiting: Configurable per-source rate limits
- Error Sanitization: No sensitive data in error responses
API Reference
Search Tool
search(query: str, limit: int = 10) -> SearchResult
Parameters:
query
: Natural language search querylimit
: Maximum results to return (default: 10, max: 40)
Returns:
{
"papers": [
{
"id": "10.1038/nature12373",
"title": "Deep learning paper title",
"url": "https://doi.org/10.1038/nature12373",
"authors": ["Author One", "Author Two"],
"abstract": "Paper abstract...",
"publication_date": "2023-01-15",
"source": "openalex",
"pdf_available": true
}
],
"total_found": 1500,
"search_time": 0.85
}
Fetch Tool
fetch(identifier: str) -> Paper
Parameters:
identifier
: DOI, ArXiv ID, or paper URL
Returns:
{
"id": "10.1038/nature12373",
"title": "Deep learning paper title",
"text": "Full extracted text content...",
"url": "https://doi.org/10.1038/nature12373",
"metadata": {
"authors": ["Author One", "Author Two"],
"abstract": "Paper abstract...",
"publication_date": "2023-01-15",
"citation_count": 1250,
"processing_status": "completed"
}
}
Configuration Validation
The server validates configuration at startup and provides detailed error messages for missing or invalid settings.
Required for basic operation:
MIRROR_URLS
: At least one mirror URL for PDF accessOPENALEX_EMAIL
: Email for OpenAlex API (recommended for politeness)S2_API_KEY
: Semantic Scholar API key for enhanced rate limitsMCP_STORAGE_BASE
andDATABASE_PATH
: Storage locations
Optional for enhanced features:
- API keys (OpenAI, Anthropic, etc.) enable additional functionality
- JWT settings are required for production HTTP mode
- Search parameters can be tuned for performance
Development Guide
Project Structure
research-mcp/
āāā research_server.py # Main MCP server (recommended)
āāā core/ # Core infrastructure
ā āāā config.py # Configuration management
ā āāā requests.py # HTTP request utilities
ā āāā db/ # Database schema and migrations
ā āāā models/ # Data models and enums
ā āāā parallel_processing/ # Parallel processing engine
āāā tools/ # Data source implementations
ā āāā searcher.py # Main search orchestrator
ā āāā base/ # Base classes and DTOs
ā āāā openalex/ # OpenAlex integration
ā āāā s2/ # Semantic Scholar integration
ā āāā arxiv/ # ArXiv integration
ā āāā pdf_mirrors/ # PDF mirror network
ā āāā unpaywall/ # Unpaywall integration
ā āāā local_files/ # Local cache management
ā āāā paper_fetcher/ # PDF processing pipeline
āāā auth/ # Authentication providers
āāā deploy/ # Production deployment tools
āāā tests/ # Unit and integration tests
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=. --cov-report=html
Adding New Data Sources
- Create repository class:
# tools/newsource/newsource.py
from tools.base import PaperRepositoryBase, PaperMetadata, SearchResult
class NewSourceRepository(PaperRepositoryBase):
has_search = True
has_metadata = True
async def search_papers(self, query: str, limit: int) -> SearchResult:
# Implementation here
pass
async def get_paper_metadata(self, identifier: str) -> PaperMetadata:
# Implementation here
pass
- Register with searcher:
# tools/searcher.py
from tools.newsource import NewSourceRepository
# Add to repository list
repositories = [
OpenAlexRepository(),
NewSourceRepository(), # Add here
# ... other repositories
]
- Add configuration:
# core/config.py
NEWSOURCE_API_KEY = os.getenv("NEWSOURCE_API_KEY")
Adding New Processors
- Create processor class:
# core/parallel_processing/processors/new_processor.py
from .entity_processor import EntityProcessor
class NewProcessor(EntityProcessor):
def get_entity_model(self):
class NewEntity(BaseModel):
entity_id: str
data: str
status: str = "pending"
return NewEntity
def process(self, job):
# Processing logic here
return {"status": "success", "result": "processed"}
- Register processor:
# core/parallel_processing/queue_router.py
from .processors.new_processor import NewProcessor
processors = {
"pdf": PDFProcessor(),
"new_type": NewProcessor(), # Add here
}
Debugging Tools
Configuration validation:
python research_server.py
# Configuration is validated on startup
Check diagnostics:
# View service logs (if deployed)
sudo journalctl -u research-mcp -n 50
# Test database connection
python -c "from core.config import DATABASE_PATH; print(DATABASE_PATH)"
Production Deployment
Environment Setup
- Copy server configuration:
cp deploy/.env.server.example deploy/.env.server
# Edit with production values
- Generate authentication secrets:
python -m auth.jwt_token generate-secret
# Add to .env.server as JWT_SECRET
- Configure storage paths:
# In .env.server
MCP_STORAGE_BASE=/opt/mcp-data
DATABASE_PATH=/opt/mcp-data/papers.db
ENVIRONMENT=prod
HTTP Transport Mode
Production deployments use HTTP transport instead of stdio:
# Server configuration
MCP_HOST=0.0.0.0
MCP_PORT=8000
MCP_PATH=/mcp
STATELESS_HTTP=true
# Start server
ENVIRONMENT=prod python research_server.py
Systemd Service
# Deploy with systemd service
cd deploy/
python launch.py --create-service
# Service management
sudo systemctl start research-mcp
sudo systemctl enable research-mcp
sudo systemctl status research-mcp
Monitoring & Logging
Service logs:
sudo journalctl -u research-mcp -f
Application logs:
tail -f /opt/mcp-data/logs/research-mcp.log
Performance monitoring:
python deploy/post_deploy_check.py
Scaling Considerations
Search performance tuning:
# Adjust search parameters
SEARCH_RESULTS=60
SEARCH_RESULTS_MULTIPLIER=3
PREFETCH_TOP_N=10
Storage optimization:
# Use SSD storage for better performance
MCP_STORAGE_BASE=/fast-storage/mcp-data
DATABASE_PATH=/fast-storage/mcp-data/papers.db
Troubleshooting
Common Issues
Configuration validation failures:
# Check required environment variables
python research_server.py
# Server will validate config on startup and report issues
PDF download failures:
- Check network connectivity to configured mirror URLs
- Verify MIRROR_URLS environment variable is properly set
- Check application logs for specific error messages
Authentication errors:
# Validate JWT configuration
python -m auth.jwt_token generate-secret
Database issues:
- Check DATABASE_PATH is accessible and writable
- Verify MCP_STORAGE_BASE directory exists
- Review application logs for SQLite errors
Performance Issues
Slow searches:
- Check OpenAlex API response times
- Verify network connectivity to all sources
- Increase
SEARCH_RESULTS_MULTIPLIER
value
Processing bottlenecks:
- Check disk space in
MCP_STORAGE_BASE
- Monitor PDF processing queue in logs
- Verify mirror connectivity
Memory usage:
- Monitor PDF processing memory usage in system logs
- Reduce
PREFETCH_TOP_N
if needed - Check for PDF processing errors in application logs
Getting Help
- Check logs: Review application and system logs for errors
- Validate config: Run configuration validation tools
- Test connectivity: Verify API access and network connectivity
- Monitor resources: Check CPU, memory, and disk usage
- Review documentation: Check deployment and configuration guides
Contributing
Development Setup
# Fork repository
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
# Create development environment
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Copy configuration
cp .env.example .env
# Configure for development
Testing Guidelines
- Write unit tests for new functionality
- Include integration tests for external APIs
- Test error handling and edge cases
- Verify configuration validation
- Document new features and APIs
Code Standards
- Follow Python type hints
- Use async/await for I/O operations
- Include docstrings for public APIs
- Handle errors gracefully
- Log important operations
- Maintain backward compatibility
Built for the research community. Empowering AI assistants with comprehensive academic paper access.
License
This project is licensed under the MIT License - see the file for details.
Copyright (c) 2025 Benjamin Hall