research-mcp

benjaminfh/research-mcp

3.3

If you are the rightful owner of research-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A production-ready Model Context Protocol (MCP) server designed to provide AI assistants with tools for searching, fetching, and processing academic papers from multiple sources.

Tools
2
Resources
0
Prompts
0

Research MCP Server

License: MIT Python 3.12+

A production-ready Model Context Protocol (MCP) server that provides AI assistants with comprehensive tools to search, fetch, and process academic papers from multiple sources. Built with parallel processing, authentication, and enterprise deployment capabilities.

Key Features

  • Multi-Source Search: Query 250M+ papers across OpenAlex, Semantic Scholar, ArXiv, PDF Mirrors, and Unpaywall
  • Parallel Processing: Background PDF processing with worker pools, job queues, and entity management
  • Enterprise Security: JWT authentication, secure token validation, and production deployment support
  • Smart Text Extraction: Docling-powered PDF processing with structured content extraction
  • Production Ready: HTTP/stdio transport modes, systematic logging, and comprehensive monitoring
  • Developer Friendly: Extensive testing suite, debugging tools, and modular architecture

Architecture Overview

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                     Research MCP Server                        │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  MCP Tools (search, fetch)                                     │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Search Orchestrator                                           │
│  ā”œā”€ā”€ Paper Availability Checker                                │
│  ā”œā”€ā”€ Results Filtering & Ranking                               │
│  └── Background Prefetching                                    │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Data Source Repositories                                      │
│  ā”œā”€ā”€ OpenAlex (250M papers, comprehensive metadata)            │
│  ā”œā”€ā”€ Semantic Scholar (citation networks, AI-enhanced)         │
│  ā”œā”€ā”€ ArXiv (preprints, latest research)                        │
│  ā”œā”€ā”€ PDF Mirrors (configurable mirror access)                  │
│  ā”œā”€ā”€ Unpaywall (open access detection)                         │
│  └── Local Files (cached content)                              │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Parallel Processing Engine                                    │
│  ā”œā”€ā”€ Queue Manager (job routing & scheduling)                  │
│  ā”œā”€ā”€ Worker Pool (configurable concurrency)                    │
│  ā”œā”€ā”€ Entity Processors (PDF, metadata extraction)              │
│  └── Job Management (status tracking, error handling)          │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Authentication & Security                                     │
│  ā”œā”€ā”€ JWT Token Management                                      │
│  ā”œā”€ā”€ API Key Validation                                        │
│  └── Secure Download Paths                                     │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│  Storage & Database                                            │
│  ā”œā”€ā”€ SQLite Database (papers, jobs, processing status)         │
│  ā”œā”€ā”€ File System Cache (PDFs, extracted text)                  │
│  └── Entity Management (automatic CRUD operations)             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Quick Start

For Claude Code users (recommended):

# Clone and install to Claude Code
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
pip install -r requirements.txt
fastmcp install claude-code research_server.py

For local development:

# Clone and run
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your configuration (see Configuration section below)
python research_server.py

Configuration

Required Environment Variables

Copy the environment template and configure these required variables:

cp .env.example .env

Edit .env with the following required settings:

# Paper source configuration
MIRROR_URLS=https://your-mirror1.com,https://your-mirror2.com
OPENALEX_EMAIL=your.email@example.com
S2_API_KEY=your_semantic_scholar_api_key

# Storage paths
MCP_STORAGE_BASE=~/mcp-storage
DATABASE_PATH=~/mcp-storage/papers.db

Optional Configuration

# Search behavior
SEARCH_RESULTS=40
SEARCH_RESULTS_MULTIPLIER=2
PREFETCH_TOP_N=5

# Production deployment
ENVIRONMENT=prod
MCP_HOST=127.0.0.1
MCP_PORT=8000
MCP_PATH=/mcp

# Authentication (production)
JWT_SECRET=your_generated_secret
JWT_ALGORITHM=HS256
JWT_ISSUER=research-mcp
JWT_AUDIENCE=mcp-api

# Optional API keys for enhanced functionality
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-...
GOOGLE_API_KEY=...
PERPLEXITY_API_KEY=...

Data Sources

The server integrates with multiple academic repositories, each with specialized capabilities:

OpenAlex Repository

  • Capabilities: Search, metadata, PDF URLs
  • Coverage: 250M+ scholarly papers across all disciplines
  • Strengths: Comprehensive metadata, institutional affiliations, funding info
  • API: Free, no authentication required (email recommended)

Semantic Scholar Repository

  • Capabilities: Search, metadata, citation networks
  • Coverage: 200M+ papers with AI-enhanced metadata
  • Strengths: Citation graphs, influence metrics, semantic similarity
  • API: Free tier (100 requests/5min), enhanced with API key

ArXiv Repository

  • Capabilities: Search, metadata, direct PDF access
  • Coverage: 2M+ preprints in STEM fields
  • Strengths: Latest research, direct PDF downloads, version tracking
  • API: Free, no authentication required

PDF Mirrors Repository

  • Capabilities: Direct PDF downloads with configurable mirror fallback
  • Coverage: Configurable paper access via mirror network
  • Strengths: High availability, fast downloads, automatic failover
  • Configuration: MIRROR_URLS environment variable

Unpaywall Repository

  • Capabilities: Open access detection and free PDF discovery
  • Coverage: 50M+ papers with open access status
  • Strengths: Legal free access detection, institutional repositories
  • API: Free, email required for identification

Local Files Repository

  • Capabilities: Cached content, processed text, metadata
  • Coverage: Previously downloaded and processed papers
  • Strengths: Instant access, no network dependency, full-text search

Parallel Processing System

The server includes a sophisticated parallel processing engine for handling PDF downloads and text extraction:

Core Components

Queue Manager (core/parallel_processing/parallel_queue_manager.py)

  • Job scheduling and routing
  • Queue priority management
  • Worker coordination
  • Status monitoring

Worker Pool (core/parallel_processing/worker_pool.py)

  • Configurable concurrency levels
  • Automatic worker scaling
  • Error handling and retry logic
  • Resource management

Entity Processors (core/parallel_processing/processors/)

  • PDF text extraction
  • Metadata normalization
  • Content validation
  • Result serialization

Job Management (core/parallel_processing/job_management/)

  • Status tracking
  • Progress monitoring
  • Error reporting
  • Completion callbacks

Processing Pipeline

  1. Job Creation: Search results trigger background processing jobs
  2. Queue Routing: Jobs routed to appropriate processor based on content type
  3. Worker Assignment: Available workers claim jobs from priority queues
  4. Processing: PDF download, text extraction, and metadata enhancement
  5. Storage: Results stored in database with full provenance tracking
  6. Notification: Status updates propagated to requesting clients

Configuration Options

# Search configuration
SEARCH_RESULTS=40
SEARCH_RESULTS_MULTIPLIER=2
PREFETCH_TOP_N=5

# Environment
ENVIRONMENT=dev

Authentication & Security

JWT Authentication

Production deployments use JWT tokens for secure API access:

# Generate secret key
python -m auth.jwt_token generate-secret

# Configure in .env
JWT_SECRET=your_generated_secret
JWT_ALGORITHM=HS256
JWT_ISSUER=research-mcp
JWT_AUDIENCE=mcp-api

Token Management

# Generate tokens (admin use)
from auth.jwt_token import create_access_token
token = create_access_token({"sub": "client_id"})

# Validate tokens (automatic)
# All API requests validated against JWT_SECRET

Security Features

  • Path Validation: All downloads to secure sandbox directories
  • URL Validation: PDF URL verification before download
  • API Rate Limiting: Configurable per-source rate limits
  • Error Sanitization: No sensitive data in error responses

API Reference

Search Tool

search(query: str, limit: int = 10) -> SearchResult

Parameters:

  • query: Natural language search query
  • limit: Maximum results to return (default: 10, max: 40)

Returns:

{
  "papers": [
    {
      "id": "10.1038/nature12373",
      "title": "Deep learning paper title",
      "url": "https://doi.org/10.1038/nature12373",
      "authors": ["Author One", "Author Two"],
      "abstract": "Paper abstract...",
      "publication_date": "2023-01-15",
      "source": "openalex",
      "pdf_available": true
    }
  ],
  "total_found": 1500,
  "search_time": 0.85
}

Fetch Tool

fetch(identifier: str) -> Paper

Parameters:

  • identifier: DOI, ArXiv ID, or paper URL

Returns:

{
  "id": "10.1038/nature12373",
  "title": "Deep learning paper title",
  "text": "Full extracted text content...",
  "url": "https://doi.org/10.1038/nature12373",
  "metadata": {
    "authors": ["Author One", "Author Two"],
    "abstract": "Paper abstract...",
    "publication_date": "2023-01-15",
    "citation_count": 1250,
    "processing_status": "completed"
  }
}

Configuration Validation

The server validates configuration at startup and provides detailed error messages for missing or invalid settings.

Required for basic operation:

  • MIRROR_URLS: At least one mirror URL for PDF access
  • OPENALEX_EMAIL: Email for OpenAlex API (recommended for politeness)
  • S2_API_KEY: Semantic Scholar API key for enhanced rate limits
  • MCP_STORAGE_BASE and DATABASE_PATH: Storage locations

Optional for enhanced features:

  • API keys (OpenAI, Anthropic, etc.) enable additional functionality
  • JWT settings are required for production HTTP mode
  • Search parameters can be tuned for performance

Development Guide

Project Structure

research-mcp/
ā”œā”€ā”€ research_server.py          # Main MCP server (recommended)
ā”œā”€ā”€ core/                       # Core infrastructure
│   ā”œā”€ā”€ config.py              # Configuration management
│   ā”œā”€ā”€ requests.py            # HTTP request utilities
│   ā”œā”€ā”€ db/                    # Database schema and migrations
│   ā”œā”€ā”€ models/                # Data models and enums
│   └── parallel_processing/   # Parallel processing engine
ā”œā”€ā”€ tools/                     # Data source implementations
│   ā”œā”€ā”€ searcher.py           # Main search orchestrator
│   ā”œā”€ā”€ base/                 # Base classes and DTOs
│   ā”œā”€ā”€ openalex/             # OpenAlex integration
│   ā”œā”€ā”€ s2/                   # Semantic Scholar integration
│   ā”œā”€ā”€ arxiv/                # ArXiv integration
│   ā”œā”€ā”€ pdf_mirrors/          # PDF mirror network
│   ā”œā”€ā”€ unpaywall/            # Unpaywall integration
│   ā”œā”€ā”€ local_files/          # Local cache management
│   └── paper_fetcher/        # PDF processing pipeline
ā”œā”€ā”€ auth/                     # Authentication providers
ā”œā”€ā”€ deploy/                   # Production deployment tools
└── tests/                    # Unit and integration tests

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=. --cov-report=html

Adding New Data Sources

  1. Create repository class:
# tools/newsource/newsource.py
from tools.base import PaperRepositoryBase, PaperMetadata, SearchResult

class NewSourceRepository(PaperRepositoryBase):
    has_search = True
    has_metadata = True

    async def search_papers(self, query: str, limit: int) -> SearchResult:
        # Implementation here
        pass

    async def get_paper_metadata(self, identifier: str) -> PaperMetadata:
        # Implementation here
        pass
  1. Register with searcher:
# tools/searcher.py
from tools.newsource import NewSourceRepository

# Add to repository list
repositories = [
    OpenAlexRepository(),
    NewSourceRepository(),  # Add here
    # ... other repositories
]
  1. Add configuration:
# core/config.py
NEWSOURCE_API_KEY = os.getenv("NEWSOURCE_API_KEY")

Adding New Processors

  1. Create processor class:
# core/parallel_processing/processors/new_processor.py
from .entity_processor import EntityProcessor

class NewProcessor(EntityProcessor):
    def get_entity_model(self):
        class NewEntity(BaseModel):
            entity_id: str
            data: str
            status: str = "pending"
        return NewEntity

    def process(self, job):
        # Processing logic here
        return {"status": "success", "result": "processed"}
  1. Register processor:
# core/parallel_processing/queue_router.py
from .processors.new_processor import NewProcessor

processors = {
    "pdf": PDFProcessor(),
    "new_type": NewProcessor(),  # Add here
}

Debugging Tools

Configuration validation:

python research_server.py
# Configuration is validated on startup

Check diagnostics:

# View service logs (if deployed)
sudo journalctl -u research-mcp -n 50

# Test database connection
python -c "from core.config import DATABASE_PATH; print(DATABASE_PATH)"

Production Deployment

Environment Setup

  1. Copy server configuration:
cp deploy/.env.server.example deploy/.env.server
# Edit with production values
  1. Generate authentication secrets:
python -m auth.jwt_token generate-secret
# Add to .env.server as JWT_SECRET
  1. Configure storage paths:
# In .env.server
MCP_STORAGE_BASE=/opt/mcp-data
DATABASE_PATH=/opt/mcp-data/papers.db
ENVIRONMENT=prod

HTTP Transport Mode

Production deployments use HTTP transport instead of stdio:

# Server configuration
MCP_HOST=0.0.0.0
MCP_PORT=8000
MCP_PATH=/mcp
STATELESS_HTTP=true

# Start server
ENVIRONMENT=prod python research_server.py

Systemd Service

# Deploy with systemd service
cd deploy/
python launch.py --create-service

# Service management
sudo systemctl start research-mcp
sudo systemctl enable research-mcp
sudo systemctl status research-mcp

Monitoring & Logging

Service logs:

sudo journalctl -u research-mcp -f

Application logs:

tail -f /opt/mcp-data/logs/research-mcp.log

Performance monitoring:

python deploy/post_deploy_check.py

Scaling Considerations

Search performance tuning:

# Adjust search parameters
SEARCH_RESULTS=60
SEARCH_RESULTS_MULTIPLIER=3
PREFETCH_TOP_N=10

Storage optimization:

# Use SSD storage for better performance
MCP_STORAGE_BASE=/fast-storage/mcp-data
DATABASE_PATH=/fast-storage/mcp-data/papers.db

Troubleshooting

Common Issues

Configuration validation failures:

# Check required environment variables
python research_server.py
# Server will validate config on startup and report issues

PDF download failures:

  • Check network connectivity to configured mirror URLs
  • Verify MIRROR_URLS environment variable is properly set
  • Check application logs for specific error messages

Authentication errors:

# Validate JWT configuration
python -m auth.jwt_token generate-secret

Database issues:

  • Check DATABASE_PATH is accessible and writable
  • Verify MCP_STORAGE_BASE directory exists
  • Review application logs for SQLite errors

Performance Issues

Slow searches:

  • Check OpenAlex API response times
  • Verify network connectivity to all sources
  • Increase SEARCH_RESULTS_MULTIPLIER value

Processing bottlenecks:

  • Check disk space in MCP_STORAGE_BASE
  • Monitor PDF processing queue in logs
  • Verify mirror connectivity

Memory usage:

  • Monitor PDF processing memory usage in system logs
  • Reduce PREFETCH_TOP_N if needed
  • Check for PDF processing errors in application logs

Getting Help

  1. Check logs: Review application and system logs for errors
  2. Validate config: Run configuration validation tools
  3. Test connectivity: Verify API access and network connectivity
  4. Monitor resources: Check CPU, memory, and disk usage
  5. Review documentation: Check deployment and configuration guides

Contributing

Development Setup

# Fork repository
git clone https://github.com/benjaminfh/research-mcp.git
cd research-mcp

# Create development environment
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Copy configuration
cp .env.example .env
# Configure for development

Testing Guidelines

  • Write unit tests for new functionality
  • Include integration tests for external APIs
  • Test error handling and edge cases
  • Verify configuration validation
  • Document new features and APIs

Code Standards

  • Follow Python type hints
  • Use async/await for I/O operations
  • Include docstrings for public APIs
  • Handle errors gracefully
  • Log important operations
  • Maintain backward compatibility

Built for the research community. Empowering AI assistants with comprehensive academic paper access.

License

This project is licensed under the MIT License - see the file for details.

Copyright (c) 2025 Benjamin Hall