research-mcp by benjaminfh - MCP Server

Research MCP Server

A production-ready Model Context Protocol (MCP) server that provides AI assistants with comprehensive tools to search, fetch, and process academic papers from multiple sources. Built with parallel processing, authentication, and enterprise deployment capabilities.

Key Features

Multi-Source Search: Query 250M+ papers across OpenAlex, Semantic Scholar, ArXiv, PDF Mirrors, and Unpaywall
Parallel Processing: Background PDF processing with worker pools, job queues, and entity management
Enterprise Security: JWT authentication, secure token validation, and production deployment support
Smart Text Extraction: Docling-powered PDF processing with structured content extraction
Production Ready: HTTP/stdio transport modes, systematic logging, and comprehensive monitoring
Developer Friendly: Extensive testing suite, debugging tools, and modular architecture

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     Research MCP Server                        │
├─────────────────────────────────────────────────────────────────┤
│  MCP Tools (search, fetch)                                     │
├─────────────────────────────────────────────────────────────────┤
│  Search Orchestrator                                           │
│  ├── Paper Availability Checker                                │
│  ├── Results Filtering & Ranking                               │
│  └── Background Prefetching                                    │
├─────────────────────────────────────────────────────────────────┤
│  Data Source Repositories                                      │
│  ├── OpenAlex (250M papers, comprehensive metadata)            │
│  ├── Semantic Scholar (citation networks, AI-enhanced)         │
│  ├── ArXiv (preprints, latest research)                        │
│  ├── PDF Mirrors (configurable mirror access)                  │
│  ├── Unpaywall (open access detection)                         │
│  └── Local Files (cached content)                              │
├─────────────────────────────────────────────────────────────────┤
│  Parallel Processing Engine                                    │
│  ├── Queue Manager (job routing & scheduling)                  │
│  ├── Worker Pool (configurable concurrency)                    │
│  ├── Entity Processors (PDF, metadata extraction)              │
│  └── Job Management (status tracking, error handling)          │
├─────────────────────────────────────────────────────────────────┤
│  Authentication & Security                                     │
│  ├── JWT Token Management                                      │
│  ├── API Key Validation                                        │
│  └── Secure Download Paths                                     │
├─────────────────────────────────────────────────────────────────┤
│  Storage & Database                                            │
│  ├── SQLite Database (papers, jobs, processing status)         │
│  ├── File System Cache (PDFs, extracted text)                  │
│  └── Entity Management (automatic CRUD operations)             │
└─────────────────────────────────────────────────────────────────┘

Quick Start

For Claude Code users (recommended):

# Clone and install to Claude Code
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
pip install -r requirements.txt
fastmcp install claude-code research_server.py

For local development:

# Clone and run
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your configuration (see Configuration section below)
python research_server.py

Configuration

Required Environment Variables

Copy the environment template and configure these required variables:

cp .env.example .env

Edit .env with the following required settings:

# Paper source configuration
MIRROR_URLS=https://your-mirror1.com,https://your-mirror2.com
OPENALEX_EMAIL=your.email@example.com
S2_API_KEY=your_semantic_scholar_api_key

# Storage paths
MCP_STORAGE_BASE=~/mcp-storage
DATABASE_PATH=~/mcp-storage/papers.db

Optional Configuration

# Search behavior
SEARCH_RESULTS=40
SEARCH_RESULTS_MULTIPLIER=2
PREFETCH_TOP_N=5

# Production deployment
ENVIRONMENT=prod
MCP_HOST=127.0.0.1
MCP_PORT=8000
MCP_PATH=/mcp

# Authentication (production)
JWT_SECRET=your_generated_secret
JWT_ALGORITHM=HS256
JWT_ISSUER=research-mcp
JWT_AUDIENCE=mcp-api

# Optional API keys for enhanced functionality
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-...
GOOGLE_API_KEY=...
PERPLEXITY_API_KEY=...

Data Sources

The server integrates with multiple academic repositories, each with specialized capabilities:

OpenAlex Repository

Capabilities: Search, metadata, PDF URLs
Coverage: 250M+ scholarly papers across all disciplines
Strengths: Comprehensive metadata, institutional affiliations, funding info
API: Free, no authentication required (email recommended)

Semantic Scholar Repository

Capabilities: Search, metadata, citation networks
Coverage: 200M+ papers with AI-enhanced metadata
Strengths: Citation graphs, influence metrics, semantic similarity
API: Free tier (100 requests/5min), enhanced with API key

ArXiv Repository

Capabilities: Search, metadata, direct PDF access
Coverage: 2M+ preprints in STEM fields
Strengths: Latest research, direct PDF downloads, version tracking
API: Free, no authentication required

PDF Mirrors Repository

Capabilities: Direct PDF downloads with configurable mirror fallback
Coverage: Configurable paper access via mirror network
Strengths: High availability, fast downloads, automatic failover
Configuration: MIRROR_URLS environment variable

Unpaywall Repository

Capabilities: Open access detection and free PDF discovery
Coverage: 50M+ papers with open access status
Strengths: Legal free access detection, institutional repositories
API: Free, email required for identification

Local Files Repository

Capabilities: Cached content, processed text, metadata
Coverage: Previously downloaded and processed papers
Strengths: Instant access, no network dependency, full-text search

Parallel Processing System

The server includes a sophisticated parallel processing engine for handling PDF downloads and text extraction:

Core Components

Queue Manager (core/parallel_processing/parallel_queue_manager.py)

Job scheduling and routing
Queue priority management
Worker coordination
Status monitoring

Worker Pool (core/parallel_processing/worker_pool.py)

Configurable concurrency levels
Automatic worker scaling
Error handling and retry logic
Resource management

Entity Processors (core/parallel_processing/processors/)

PDF text extraction
Metadata normalization
Content validation
Result serialization

Job Management (core/parallel_processing/job_management/)

Status tracking
Progress monitoring
Error reporting
Completion callbacks

Processing Pipeline

Job Creation: Search results trigger background processing jobs
Queue Routing: Jobs routed to appropriate processor based on content type
Worker Assignment: Available workers claim jobs from priority queues
Processing: PDF download, text extraction, and metadata enhancement
Storage: Results stored in database with full provenance tracking
Notification: Status updates propagated to requesting clients

Configuration Options

# Search configuration
SEARCH_RESULTS=40
SEARCH_RESULTS_MULTIPLIER=2
PREFETCH_TOP_N=5

# Environment
ENVIRONMENT=dev

Authentication & Security

JWT Authentication

Production deployments use JWT tokens for secure API access:

# Generate secret key
python -m auth.jwt_token generate-secret

# Configure in .env
JWT_SECRET=your_generated_secret
JWT_ALGORITHM=HS256
JWT_ISSUER=research-mcp
JWT_AUDIENCE=mcp-api

Token Management

# Generate tokens (admin use)
from auth.jwt_token import create_access_token
token = create_access_token({"sub": "client_id"})

# Validate tokens (automatic)
# All API requests validated against JWT_SECRET

Security Features

Path Validation: All downloads to secure sandbox directories
URL Validation: PDF URL verification before download
API Rate Limiting: Configurable per-source rate limits
Error Sanitization: No sensitive data in error responses

API Reference

Search Tool

search(query: str, limit: int = 10) -> SearchResult

Parameters:

query: Natural language search query
limit: Maximum results to return (default: 10, max: 40)

Returns:

{
  "papers": [
    {
      "id": "10.1038/nature12373",
      "title": "Deep learning paper title",
      "url": "https://doi.org/10.1038/nature12373",
      "authors": ["Author One", "Author Two"],
      "abstract": "Paper abstract...",
      "publication_date": "2023-01-15",
      "source": "openalex",
      "pdf_available": true
    }
  ],
  "total_found": 1500,
  "search_time": 0.85
}

Fetch Tool

fetch(identifier: str) -> Paper

Parameters:

identifier: DOI, ArXiv ID, or paper URL

Returns:

{
  "id": "10.1038/nature12373",
  "title": "Deep learning paper title",
  "text": "Full extracted text content...",
  "url": "https://doi.org/10.1038/nature12373",
  "metadata": {
    "authors": ["Author One", "Author Two"],
    "abstract": "Paper abstract...",
    "publication_date": "2023-01-15",
    "citation_count": 1250,
    "processing_status": "completed"
  }
}

Configuration Validation

The server validates configuration at startup and provides detailed error messages for missing or invalid settings.

Required for basic operation:

MIRROR_URLS: At least one mirror URL for PDF access
OPENALEX_EMAIL: Email for OpenAlex API (recommended for politeness)
S2_API_KEY: Semantic Scholar API key for enhanced rate limits
MCP_STORAGE_BASE and DATABASE_PATH: Storage locations

Optional for enhanced features:

API keys (OpenAI, Anthropic, etc.) enable additional functionality
JWT settings are required for production HTTP mode
Search parameters can be tuned for performance

Development Guide

Project Structure

research-mcp/
├── research_server.py          # Main MCP server (recommended)
├── core/                       # Core infrastructure
│   ├── config.py              # Configuration management
│   ├── requests.py            # HTTP request utilities
│   ├── db/                    # Database schema and migrations
│   ├── models/                # Data models and enums
│   └── parallel_processing/   # Parallel processing engine
├── tools/                     # Data source implementations
│   ├── searcher.py           # Main search orchestrator
│   ├── base/                 # Base classes and DTOs
│   ├── openalex/             # OpenAlex integration
│   ├── s2/                   # Semantic Scholar integration
│   ├── arxiv/                # ArXiv integration
│   ├── pdf_mirrors/          # PDF mirror network
│   ├── unpaywall/            # Unpaywall integration
│   ├── local_files/          # Local cache management
│   └── paper_fetcher/        # PDF processing pipeline
├── auth/                     # Authentication providers
├── deploy/                   # Production deployment tools
└── tests/                    # Unit and integration tests

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

Adding New Data Sources

Create repository class:

# tools/newsource/newsource.py
from tools.base import PaperRepositoryBase, PaperMetadata, SearchResult

class NewSourceRepository(PaperRepositoryBase):
    has_search = True
    has_metadata = True

    async def search_papers(self, query: str, limit: int) -> SearchResult:
        # Implementation here
        pass

    async def get_paper_metadata(self, identifier: str) -> PaperMetadata:
        # Implementation here
        pass

Register with searcher:

# tools/searcher.py
from tools.newsource import NewSourceRepository

# Add to repository list
repositories = [
    OpenAlexRepository(),
    NewSourceRepository(),  # Add here
    # ... other repositories
]

Add configuration:

# core/config.py
NEWSOURCE_API_KEY = os.getenv("NEWSOURCE_API_KEY")

Adding New Processors

Create processor class:

# core/parallel_processing/processors/new_processor.py
from .entity_processor import EntityProcessor

class NewProcessor(EntityProcessor):
    def get_entity_model(self):
        class NewEntity(BaseModel):
            entity_id: str
            data: str
            status: str = "pending"
        return NewEntity

    def process(self, job):
        # Processing logic here
        return {"status": "success", "result": "processed"}

Register processor:

# core/parallel_processing/queue_router.py
from .processors.new_processor import NewProcessor

processors = {
    "pdf": PDFProcessor(),
    "new_type": NewProcessor(),  # Add here
}

Debugging Tools

Configuration validation:

python research_server.py
# Configuration is validated on startup

Check diagnostics:

# View service logs (if deployed)
sudo journalctl -u research-mcp -n 50

# Test database connection
python -c "from core.config import DATABASE_PATH; print(DATABASE_PATH)"

Production Deployment

Environment Setup

Copy server configuration:

cp deploy/.env.server.example deploy/.env.server
# Edit with production values

Generate authentication secrets:

python -m auth.jwt_token generate-secret
# Add to .env.server as JWT_SECRET

Configure storage paths:

# In .env.server
MCP_STORAGE_BASE=/opt/mcp-data
DATABASE_PATH=/opt/mcp-data/papers.db
ENVIRONMENT=prod

HTTP Transport Mode

Production deployments use HTTP transport instead of stdio:

# Server configuration
MCP_HOST=0.0.0.0
MCP_PORT=8000
MCP_PATH=/mcp
STATELESS_HTTP=true

# Start server
ENVIRONMENT=prod python research_server.py

Systemd Service

# Deploy with systemd service
cd deploy/
python launch.py --create-service

# Service management
sudo systemctl start research-mcp
sudo systemctl enable research-mcp
sudo systemctl status research-mcp

Monitoring & Logging

Service logs:

sudo journalctl -u research-mcp -f

Application logs:

tail -f /opt/mcp-data/logs/research-mcp.log

Performance monitoring:

python deploy/post_deploy_check.py

Scaling Considerations

Search performance tuning:

# Adjust search parameters
SEARCH_RESULTS=60
SEARCH_RESULTS_MULTIPLIER=3
PREFETCH_TOP_N=10

Storage optimization:

# Use SSD storage for better performance
MCP_STORAGE_BASE=/fast-storage/mcp-data
DATABASE_PATH=/fast-storage/mcp-data/papers.db

Troubleshooting

Common Issues

Configuration validation failures:

# Check required environment variables
python research_server.py
# Server will validate config on startup and report issues

PDF download failures:

Check network connectivity to configured mirror URLs
Verify MIRROR_URLS environment variable is properly set
Check application logs for specific error messages

Authentication errors:

# Validate JWT configuration
python -m auth.jwt_token generate-secret

Database issues:

Check DATABASE_PATH is accessible and writable
Verify MCP_STORAGE_BASE directory exists
Review application logs for SQLite errors

Performance Issues

Slow searches:

Check OpenAlex API response times
Verify network connectivity to all sources
Increase SEARCH_RESULTS_MULTIPLIER value

Processing bottlenecks:

Check disk space in MCP_STORAGE_BASE
Monitor PDF processing queue in logs
Verify mirror connectivity

Memory usage:

Monitor PDF processing memory usage in system logs
Reduce PREFETCH_TOP_N if needed
Check for PDF processing errors in application logs

Getting Help

Check logs: Review application and system logs for errors
Validate config: Run configuration validation tools
Test connectivity: Verify API access and network connectivity
Monitor resources: Check CPU, memory, and disk usage
Review documentation: Check deployment and configuration guides

Contributing

Development Setup

# Fork repository
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp

# Create development environment
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Copy configuration
cp .env.example .env
# Configure for development

Testing Guidelines

Write unit tests for new functionality
Include integration tests for external APIs
Test error handling and edge cases
Verify configuration validation
Document new features and APIs

Code Standards

Follow Python type hints
Use async/await for I/O operations
Include docstrings for public APIs
Handle errors gracefully
Log important operations
Maintain backward compatibility

Built for the research community. Empowering AI assistants with comprehensive academic paper access.

License

This project is licensed under the MIT License - see the file for details.