paper-download-mcp

Oxidane-bot/paper-download-mcp

3.3

If you are the rightful owner of paper-download-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.

The Paper Download MCP Server is a tool designed to facilitate the downloading of academic papers from multiple sources using intelligent routing and MCP technology.

Tools
3
Resources
0
Prompts
0

Paper Download MCP Server

PyPI version Python Version License: MIT Downloads

MCP server for downloading academic papers from multiple sources with intelligent routing.

Note: This project is built on top of scihub-cli, adapting its core functionality for MCP integration. If you find this useful, consider starring both projects!

Features

  • Multi-Source Support: Downloads from multiple sources with automatic fallback
    • arXiv: Prioritized for preprints (free, no API key needed)
    • Unpaywall: For open access papers (requires email)
    • Direct PDF: Handles direct PDF URLs
    • PMC: PubMed Central articles
    • HTML Landing: Extracts PDF links from article pages
    • Sci-Hub: Fallback for older papers (coverage-driven)
    • CORE: Additional OA fallback
  • Intelligent Routing: Priority-based source selection with year-aware routing
  • 3 MCP Tools:
    • paper_download - Download single paper by DOI or URL
    • paper_batch_download - Download multiple papers with progress reporting
    • paper_metadata - Get paper metadata without downloading PDF
  • Clean Filenames: [YYYY] - Paper Title.pdf format
  • Rate Limiting: Built-in delays for API compliance
  • Comprehensive Error Messages: Actionable suggestions on failures

Installation

For Users (Recommended)

No manual installation required! Use uvx for automatic environment management:

{
  "mcpServers": {
    "paper-download": {
      "command": "uvx",
      "args": ["paper-download-mcp"],
      "env": {
        "PAPER_DOWNLOAD_EMAIL": "your-email@university.edu"
      }
    }
  }
}

For Developers

git clone <repository-url>
cd paper-download-mcp
uv sync
uv run python -m paper_download_mcp.server

Configuration

Required Environment Variables

  • PAPER_DOWNLOAD_EMAIL: Your email address (required for Unpaywall API compliance)
    • Example: researcher@university.edu
    • Used for Unpaywall API tracking and contact purposes

Optional Environment Variables

  • PAPER_DOWNLOAD_OUTPUT_DIR: Default output directory (default: ./downloads)

Claude Desktop Setup

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "paper-download": {
      "command": "uvx",
      "args": ["paper-download-mcp"],
      "env": {
        "PAPER_DOWNLOAD_EMAIL": "your-email@university.edu",
        "PAPER_DOWNLOAD_OUTPUT_DIR": "/path/to/papers"
      }
    }
  }
}

After configuration, restart Claude Desktop.

Tools

paper_download

Download a single academic paper by DOI or URL.

Parameters:

  • identifier (required): DOI or URL (e.g., 10.1038/nature12373)
  • output_dir (optional): Output directory (default: ./downloads)

Example:

Download the paper 10.1038/nature12373

Returns:

  • Markdown with download details (file path, size, source, timing)
  • Error message with suggestions if download fails

paper_batch_download

Download multiple papers sequentially with progress reporting.

Parameters:

  • identifiers (required): List of DOIs or URLs (1-50 maximum)
  • output_dir (optional): Output directory (default: ./downloads)

Example:

Download these papers: 10.1038/nature12373, 10.1126/science.1234567

Returns:

  • Markdown summary with statistics
  • List of successful downloads
  • List of failed downloads with errors

Note: Downloads are sequential with 2-second delays for rate limiting.

paper_metadata

Retrieve paper metadata without downloading the PDF.

Parameters:

  • identifier (required): DOI or URL

Example:

Get metadata for 10.1038/nature12373

Returns:

  • JSON with paper details:
    • DOI, title, year, authors, journal
    • Open access status
    • Available download sources

How It Works

Intelligent Source Routing

The server uses priority routing to keep fast sources first:

  1. arXiv IDs/URLs: Try arXiv first, then OA sources if needed
  2. DOIs (year < 2021): OA sources first, Sci-Hub last
  3. DOIs (year ≥ 2021): OA sources only (Sci-Hub has no coverage)
  4. Year unknown: OA sources first, Sci-Hub last

Download Process

  1. Normalize DOI from URL/identifier
  2. Detect publication year via Crossref API (DOIs only)
  3. Route to appropriate source based on priority/year
  4. Download PDF with retry on failure
  5. Validate file (PDF header, size check)
  6. Generate filename: [YYYY] - Title.pdf
  7. Return absolute file path

Rate Limiting

  • 2-second delay between batch downloads
  • Respects Unpaywall API limits (~100k requests/day)
  • Built-in exponential backoff retry (3 attempts max)

Troubleshooting

"PAPER_DOWNLOAD_EMAIL environment variable is required"

Solution: Set the email in your Claude Desktop config (see Configuration section above).

"Paper not found in any source"

Possible causes:

  • Invalid or incorrect DOI
  • Paper too recent (not yet indexed)
  • Paper behind paywall with no open access version
  • Sci-Hub mirrors temporarily unavailable

Solutions:

  • Verify DOI on doi.org
  • Use paper_metadata to check availability
  • Try again later (mirrors may recover)

Download times out

Causes:

  • Slow network connection
  • Sci-Hub mirror selection taking too long
  • Large PDF file

Solutions:

  • Check internet connection
  • Retry (mirror selection is cached after first success)
  • Single papers typically complete in <15 seconds

Downloaded file is corrupted

The server validates PDFs before returning. If you encounter corruption:

  1. Check disk space
  2. Verify file permissions in output directory
  3. Try different paper (may be source issue)

Testing

MCP Inspector

Test the server with MCP Inspector:

export PAPER_DOWNLOAD_EMAIL=test@example.com
npx @modelcontextprotocol/inspector uv run python -m paper_download_mcp.server

Unit Tests

uv run pytest

Legal Notice

IMPORTANT: This tool provides access to academic papers through multiple sources:

  • Unpaywall (https://unpaywall.org): Legal open-access aggregator operated by OurResearch. Recommended and prioritized when available.

  • Sci-Hub: Operates in a legal gray area. While it provides access to research, it may violate copyright laws in some jurisdictions. Use at your own risk.

User Responsibilities:

  • You are responsible for compliance with applicable copyright laws in your jurisdiction
  • This tool is intended for research and educational purposes only
  • The maintainers assume no liability for how you use this tool
  • When possible, prefer legal open-access sources (Unpaywall)

By using this tool, you acknowledge these legal considerations and agree to use it responsibly.

Project Structure

paper-download-mcp/
├── src/
│   └── paper_download_mcp/
│       ├── server.py           # FastMCP entry point
│       ├── models.py           # Pydantic input schemas
│       ├── formatters.py       # Markdown/JSON formatters
│       ├── tools/
│       │   ├── download.py     # Download tools
│       │   └── metadata.py     # Metadata tool
│       └── scihub_core/        # Copied from scihub-cli
├── pyproject.toml
├── README.md
└── .gitignore

Architecture

Layer Architecture

  1. FastMCP Server Layer: Protocol handling, tool registration, config validation
  2. MCP Tools Layer: Request parsing, response formatting, async coordination
  3. Models & Formatters: Data validation, output serialization
  4. scihub_core Layer: Academic paper logic (unchanged from scihub-cli)

Async Pattern

All tools use asyncio.to_thread() to wrap synchronous scihub-cli code:

@mcp.tool()
async def paper_download(...):
    def _sync_download():
        # Synchronous scihub-cli code
        client = SciHubClient()
        return client.download_paper(doi)

    # Run in thread pool
    result = await asyncio.to_thread(_sync_download)
    return format_result(result)

This preserves the battle-tested scihub-cli code without modifications.

Performance

OperationTargetTypicalMax
Get Metadata<1s0.5s2s
Single Download<5s2-3s10s
Batch (10 papers)<40s25-30s60s

Note: First download may take longer (5-10s) due to mirror selection. Subsequent downloads use cached mirror.

Contributing

For Maintainers: Syncing from scihub-cli

The scihub_core/ directory contains code copied from the upstream project. When bugs are fixed or features added to scihub-cli:

Workflow:

  1. Fix/implement in scihub-cli project first
  2. Run tests and commit to scihub-cli
  3. Copy updated files to paper-download-mcp/src/paper_download_mcp/scihub_core/
  4. Test MCP server functionality
  5. Commit with message referencing upstream commit:
    sync: Update <file> from scihub-cli (<description>)
    
    Synced from scihub-cli commit <hash>
    <details of changes>
    

Last sync: scihub-cli@9787efc (2024-12-02) - Fixed year type bug in UnpaywallSource

License

MIT License - See LICENSE file for details

Credits

Support

For issues or questions:

  1. Check the Troubleshooting section above
  2. Open an issue on GitHub (include error messages and steps to reproduce)

Disclaimer: This tool is provided as-is for research and educational purposes. Users assume all responsibility for compliance with applicable laws and regulations.