mcp-research-server by hanzdi - MCP Server

MCP Research Server

The MCP Research Server is a powerful, containerized backend service designed for the advanced analysis of scientific research papers. It leverages a Model Context Protocol (MCP) to provide a suite of tools for indexing, searching, and analyzing PDF documents. The server is built with a modern Python stack and utilizes Docker for easy deployment and scalability.

Access Options

You have three ways to interact with the MCP Research Server:

Claude Desktop Integration (Recommended) - Native MCP integration within Claude Desktop
Production-Ready CLI Client (mcp_client.py) - A versatile command-line tool with the following features:
- Dual-mode operation: interactive chat and single-command execution.
- Support for multiple AI models.
- JSON output for easy integration with other tools.
- Timeout management to handle long-running tasks.
- Suitable for production scripting.
Jupyter Notebook Client (mcp_notebook.ipynb) - Research-focused notebook environment for interactive analysis

The CLI and Notebook clients run outside Docker and require local dependency installation. See Alternative MCP Client Access for detailed configuration instructions.

Features

Robust PDF Processing: Ingests and processes PDF documents using GROBID for high-quality structured data extraction, with a graceful fallback to PyMuPDF for maximum reliability.
Advanced Vector-Based Indexing: Indexes research papers in a Weaviate vector database, creating a rich, searchable knowledge base.
State-of-the-Art Hybrid Search: Combines semantic (vector) search and keyword-based (BM25) search, with a reranker module, to deliver highly relevant and accurate search results.
Comprehensive Duplicate Detection: Implements a multi-level duplicate detection strategy—checking file, content, and metadata hashes—to ensure data integrity and prevent redundant entries.
In-Depth Citation Analysis: Extracts and analyzes citation networks, allowing users to trace the flow of information and identify influential papers.
Multi-Faceted Document Comparison: Provides tools to compare documents based on content similarity, topic distribution, and citation overlap.
Extractive Summarization: Generates concise summaries of research papers, with the ability to focus on specific sections like the abstract, methods, or results.
Dynamic Terminology Tracking: Identifies and tracks the evolution of terminology across a collection of documents, highlighting emerging concepts and trends.
Fully Dockerized Environment: All services are containerized for easy setup, consistent deployment, and scalability.

Technology Stack

Python: 3.11+
Server Framework: FastMCP
Vector Database: Weaviate
PDF Parsing: GROBID
Caching & Task Queues: Redis
NLP & Machine Learning: NLTK, spaCy, scikit-learn, Sentence Transformers

Getting Started

Prerequisites

Docker and Docker Compose
Python 3.11+ (for local development)

Configuration

The MCP Research Server supports flexible configuration for different deployment environments through a dual configuration system and environment variable overrides.

Standard Docker Deployment (Default)

For most users, no configuration changes are needed. The system uses container names and internal Docker networking:

# Simply start all services - no configuration needed
docker-compose up -d

Default Settings (Automatic):

Weaviate: Uses container name weaviate on port 8080
Redis: Uses container name redis on port 6379
GROBID: Uses container name grobid on port 8070

Environment Variable Overrides

Create a .env file to customize settings for your environment:

cp .env.example .env

Dual Configuration Options:

The system supports two configuration patterns for maximum flexibility:

1. Host/Port Configuration (Recommended for Docker):

# Container-based deployment (default)
WEAVIATE_HOST=weaviate
WEAVIATE_PORT=8080
REDIS_HOST=redis
REDIS_PORT=6379

# Local development override
WEAVIATE_HOST=localhost
WEAVIATE_PORT=8088  # External port mapping
REDIS_HOST=localhost
REDIS_PORT=6379

2. URL Configuration (Alternative):

# Complete endpoint URLs
WEAVIATE_URL=http://localhost:8088
REDIS_URL=redis://localhost:6379
GROBID_URL=http://localhost:8070

Other Key Variables:

SEARCH_HYBRID_ALPHA: Controls vector/keyword search balance (0.0 for pure keyword, 1.0 for pure vector)
DEV_MODE=true: Enables development features like colorful logging
LOG_LEVEL=DEBUG: Increases logging verbosity for troubleshooting

Deployment Examples

Local Development Setup:

# Override for external access during development
export WEAVIATE_HOST=localhost
export WEAVIATE_PORT=8088  # Use external port mapping
export REDIS_HOST=localhost
docker-compose up -d

Production Environment:

# Point to production services
export WEAVIATE_HOST=prod-weaviate.company.com
export WEAVIATE_PORT=80
export REDIS_HOST=prod-redis.company.com
export REDIS_PORT=6379
docker-compose up -d

Kubernetes/Cloud Deployment:

# Use service discovery names
export WEAVIATE_HOST=weaviate-service
export REDIS_HOST=redis-service
# Apply your Kubernetes manifests

Service Ports

Default External Port Mappings (docker-compose.yml):

Weaviate: 8088:8080 (external:internal)
GROBID: 8070:8070
Redis: 6379:6379

These mappings allow external access while maintaining internal Docker networking.

Installation & Running the Server

Build and Run the Docker Containers: This command will build the custom Python environment and start all the services in the background.
```
docker-compose up -d --build
```

Set Up Weaviate Schema: After the containers are running, execute the following command to initialize the Weaviate schema. This only needs to be done once.

docker-compose exec mcp-server python scripts/setup_weaviate.py create

Database management

docker-compose exec mcp-server python scripts/setup_weaviate.py check
docker-compose exec mcp-server python scripts/setup_weaviate.py backup
docker-compose exec mcp-server python scripts/setup_weaviate.py create --force  # Reset schema

Database and cache cleanup (comprehensive system reset)

docker-compose exec mcp-server python scripts/cleanup_database_cache.py --stats     # Show system statistics
docker-compose exec mcp-server python scripts/cleanup_database_cache.py --all      # Clear everything (prompts for confirmation)
docker-compose exec mcp-server python scripts/cleanup_database_cache.py --database # Clear only Weaviate database
docker-compose exec mcp-server python scripts/cleanup_database_cache.py --cache    # Clear only Redis cache
docker-compose exec mcp-server python scripts/cleanup_database_cache.py --all --confirm  # Skip confirmation prompt

Run the MCP Server:

docker-compose exec mcp-server python src/mcp_server.py

Usage

The MCP Research Server provides a suite of tools for interacting with and analyzing research papers.

Tools

Tool Name	Description
`list_upload_files`	Lists all uploaded PDF files with their metadata.
`check_document_status`	Verifies if a document already exists in the database using a multi-level duplicate detection strategy.
`index_pdf`	Indexes a new PDF research paper, including content extraction, chunking, and vectorization.
`hybrid_search`	Performs a hybrid search combining semantic and keyword-based search, with reranking for improved relevance.
`extract_citations`	Extracts all citations from a specified document and provides their context.
`compare_papers`	Compares two research papers across multiple dimensions, including content, topics, and citations.
`generate_summary`	Generates an extractive summary of a research paper, with options to focus on specific sections.
`find_related_sections`	Finds specific sections within documents that are semantically related to a given query.
`track_terminology`	Tracks the usage and evolution of terminology and concepts across a set of documents.

Alternative MCP Client Access

In addition to using the MCP server through Claude Desktop, the project provides two alternative client tools for accessing the MCP server functionality outside of the Docker environment:

1. Production-Ready CLI Client (`mcp_client.py`)

The mcp_client.py file provides a sophisticated command-line interface powered by PydanticAI. This production-ready tool supports both interactive and non-interactive modes with comprehensive configuration options.

Key Features:

Dual Operation Modes: Interactive CLI and single-command execution
Configurable AI Models: Support for GPT-4, Claude, Gemini, and LLaMA models
Output Formatting: Clean text output or structured JSON responses
Robust Error Handling: Comprehensive error management with proper exit codes
Timeout Management: Configurable request timeouts to prevent hanging
Input Validation: Prompt validation and sanitization
Clean Output: Optional quiet mode for script-friendly output

Command Line Options:

python mcp_client.py [-h] [-p PROMPT] [-q] [-m MODEL] [-t TIMEOUT] [--format {text,json}]

Options:
  -h, --help                Show help message and exit
  -p, --prompt PROMPT       Run non-interactive mode with single prompt
  -q, --quiet               Suppress server logs for clean output
  -m, --model MODEL         AI model to use (default: openrouter:openai/gpt-4.1-nano)
  -t, --timeout TIMEOUT     Timeout in seconds (default: 60)
  --format {text,json}      Output format: text (default) or json

Usage Examples:

Interactive Mode (Default):

python mcp_client.py

Non-Interactive Mode:

# Basic usage
python mcp_client.py -p "List all PDF files in the uploads directory"

# With different AI model
python mcp_client.py -p "Search for ML papers" --model "openrouter:anthropic/claude-3.5-sonnet"

# Clean JSON output for scripting
python mcp_client.py -p "List files" --format json --quiet

# With custom timeout
python mcp_client.py -p "Complex analysis task" --timeout 120

# Production scripting example
python mcp_client.py -p "Generate summary of doc_123" --quiet --format json --model "openrouter:google/gemini-2.5-pro-preview"

2. Jupyter Notebook Client (`mcp_notebook.ipynb`)

The mcp_notebook.ipynb provides a Jupyter notebook environment for running MCP commands directly in notebook cells. This is ideal for research workflows, experimentation, and interactive analysis.

Key Features:

Run MCP commands in individual notebook cells
Perfect for research workflows and data analysis
Combines code execution with rich output formatting
Supports multiple AI models for different analysis needs

Usage:

jupyter notebook mcp_notebook.ipynb

Configuration Requirements

Important: Both client tools run outside of Docker and require local installation of all dependencies.

1. Local Dependencies Installation

Install the required Python packages locally:

pip install pydantic-ai python-dotenv nest-asyncio

2. Environment Configuration

Both tools require an OpenRouter API key for AI model access. Configure your .env file with:

# Required for MCP client tools
OPENROUTER_API_KEY=your_openrouter_api_key_here

3. Model Selection and Configuration

The CLI client supports multiple AI models through OpenRouter with dynamic configuration:

Available Models:

openai/gpt-4.1 - High-quality reasoning and analysis
openai/gpt-4.1-nano - Fast and lightweight (default)
google/gemini-2.5-pro-preview - Google's latest model with large context
anthropic/claude-3.5-sonnet - Anthropic's advanced reasoning model
meta-llama/llama-3.1-70b-instruct - Meta's open-source model

Model Configuration Examples:

CLI Client (Runtime Configuration):

# Use different models via command line
python mcp_client.py -p "Research query" --model "openrouter:anthropic/claude-3.5-sonnet"
python mcp_client.py -p "Quick task" --model "openrouter:openai/gpt-4.1-nano"
python mcp_client.py -p "Complex analysis" --model "openrouter:google/gemini-2.5-pro-preview"

Notebook Client (Code Configuration):

# Explicit configuration
api_key = os.getenv('OPENROUTER_API_KEY')
provider = OpenRouterProvider(api_key=api_key)
model = OpenAIModel('openai/gpt-4.1', provider=provider)
agent = Agent(model=model, mcp_servers=[mcp_server])

# Or shorthand syntax
agent = Agent(model='openrouter:anthropic/claude-3.5-sonnet', mcp_servers=[mcp_server])

4. File Path Configuration

Both tools are pre-configured to work with the project structure. If you move the files or change the project location, update the project path:

# Update this path in both files if needed
project_path = Path('/path/to/your/mcp-research-server')

Advanced Usage Patterns

Script Integration

The CLI client is designed for seamless integration with shell scripts and automation:

#!/bin/bash
# Example automation script

# Check if papers exist
# WARNING: If you modify this script to use dynamic content in the prompt (-p),
# ensure that the input is properly escaped or sanitized to prevent command injection.
RESULT=$(python mcp_client.py -p "List all PDF files" --format json --quiet)
if [[ $? -eq 0 ]]; then
    echo "Papers found: $RESULT"
else
    echo "Error accessing papers"
    exit 1
fi

# Search for specific topics
python mcp_client.py -p "Search for papers about neural networks" --quiet --timeout 30
#### Batch Processing
Process multiple queries efficiently:

```bash
# Process multiple research queries
for query in "machine learning" "deep learning" "neural networks"; do
    echo "Processing: $query"
    python mcp_client.py -p "Search for papers about $query" --quiet --format json > "results_${query// /_}.json"
done

Error Handling and Monitoring

The CLI client provides comprehensive error handling for production use:

# Monitor for timeouts and connection issues
python mcp_client.py -p "Long running analysis" --timeout 300 --format json --quiet
case $? in
    0) echo "Success" ;;
    1) echo "Request failed or timed out" ;;
    *) echo "Unexpected error" ;;
esac

Performance and Optimization

Model Selection Guidelines

Choose the appropriate model based on your use case:

Quick Tasks: openai/gpt-4.1-nano (fastest, most cost-effective)
Complex Analysis: google/gemini-2.5-pro-preview (largest context window)
Balanced Performance: anthropic/claude-3.5-sonnet (high quality reasoning)
Open Source: meta-llama/llama-3.1-70b-instruct (free alternative)

Output Format Recommendations

Human Reading: Use default text format
Script Processing: Use JSON format with --quiet flag
Logging/Monitoring: Use JSON format for structured data

Timeout Configuration

Set appropriate timeouts based on operation complexity:

Simple queries: 30-60 seconds (default)
Complex analysis: 120-300 seconds
Large document processing: 300+ seconds

Troubleshooting

Common Issues and Solutions

1. Connection Timeouts

# Increase timeout for complex operations
python mcp_client.py -p "Complex query" --timeout 120

2. Model Not Found

# Verify model name format
python mcp_client.py -p "Test" --model "openrouter:openai/gpt-4.1-nano"

3. Clean Output for Scripts

# Use quiet mode and JSON format
python mcp_client.py -p "Query" --quiet --format json 2>/dev/null

4. Docker Services Not Running

# Ensure Docker services are running
docker-compose ps
docker-compose up -d

Production Deployment Considerations

Environment Variables

For production deployment, consider these environment variables:

# .env file for production
OPENROUTER_API_KEY=your_production_api_key
PROJECT_PATH=/path/to/production/mcp-research-server
LOG_LEVEL=ERROR
QUIET_MODE=1

Monitoring and Logging

The CLI client provides structured JSON output for monitoring:

# Monitor success/failure rates
python mcp_client.py -p "Health check" --format json --quiet | jq '.success'

# Log errors for debugging
python mcp_client.py -p "Query" --format json 2>> error.log

This enhanced CLI client provides a production-ready interface for accessing the MCP Research Server with comprehensive configuration options, robust error handling, and flexible output formatting suitable for both interactive use and automated workflows.

Usage Examples

CLI Client Examples:

# Start the interactive CLI
python mcp_client.py

# Then use natural language commands:
> List all PDF files in the uploads directory
> Search for papers about machine learning transformers
> Index the new research paper 'paper.pdf'
> Compare papers with IDs 'doc_123' and 'doc_456'

Notebook Client Examples:

# Run MCP commands in notebook cells
async with agent.run_mcp_servers():
    result = agent.run_sync('List all PDF files')
print(result.output)

# Research workflow example
async with agent.run_mcp_servers():
    result = agent.run_sync('Search for papers about transformer architectures')
print(result.output)

# Advanced analysis
async with agent.run_mcp_servers():
    result = agent.run_sync('Generate a summary of the paper with ID doc_123')
print(result.output)

When to Use Each Tool

Use CLI Client (mcp_client.py) when:

You prefer command-line interfaces
You need quick, interactive access to MCP tools
You want to integrate MCP functionality into scripts
You need a lightweight alternative to Claude Desktop

Use Notebook Client (mcp_notebook.ipynb) when:

You're conducting research analysis
You need to document your research process
You want to combine code, results, and explanations
You're experimenting with different queries and approaches
You need rich output formatting and visualization

Technical Requirements

System Requirements:

Python 3.11+
Docker and Docker Compose (for the MCP server)
OpenRouter API key
Local installation of PydanticAI and dependencies

Network Requirements:

Both tools communicate with the MCP server running in Docker
Ensure Docker containers are running before using either tool
The tools connect to the server via docker-compose exec commands

Architecture & Mechanics

The MCP Research Server is designed with a modular and scalable architecture, with several key mechanics that power its features.

Core Architectural Patterns

Lazy Initialization: Server components are initialized on the first tool call, not at startup, to ensure efficient resource usage. An asyncio.Lock is used to ensure thread safety during this process.
Graceful Degradation for PDF Processing: The system prioritizes GROBID for its high-quality structured data extraction. If GROBID fails or is unavailable, the server automatically falls back to PyMuPDF for text extraction, ensuring that the system remains operational.
Comprehensive Duplicate Detection: Before indexing, a multi-level duplicate detection strategy is employed, combining content and metadata fingerprinting (hashes) with semantic similarity search to find conceptually similar documents.
Hybrid Search with Reranking: Search queries combine BM25 keyword-based search for lexical precision with vector-based semantic search for contextual relevance. The results are then passed through a reranker model to provide the most relevant results to the user.
Graph-Based Citation Network: The server uses a three-collection schema in Weaviate (ResearchPapers, Citation, and CitationMention) to build a graph of citations, enabling complex analysis of citation networks.

Tool Mechanics Explained

list_upload_files: This tool lists the contents of the uploads directory, which is the designated location for new PDF files awaiting processing. It provides a simple way to see which documents are ready to be indexed.
check_document_status: This tool performs a comprehensive, multi-level duplicate check to determine if a document already exists in the system. It uses a combination of file hashes, content hashes (fingerprints), metadata (title and authors), and semantic similarity to identify potential duplicates. The tool returns a detailed report, including a recommendation on whether to proceed with indexing, merge with an existing document, or skip the upload.
index_pdf: When a PDF is indexed, it first undergoes duplicate detection. If it's a new document, it's processed by GROBID (or PyMuPDF as a fallback) to extract text, metadata, and citations. The content is then split into chunks using a sentence-aware strategy to maintain semantic coherence. These chunks are vectorized and stored in Weaviate.
hybrid_search: This tool first expands the user's query with synonyms and related terms. It then performs a hybrid search in Weaviate, which combines vector search (for semantic meaning) and BM25 search (for keyword matching). The alpha parameter controls the balance between these two methods. Finally, the results are reranked to improve relevance.
extract_citations: This tool leverages the graph-based citation network in Weaviate. It starts by fetching the citation mentions linked to a document. For each mention, it retrieves the full bibliographic data of the cited paper. If include_context is true, it also provides the text snippet where the citation was mentioned. This allows for a complete picture of not just what is cited, but also how it is cited.
compare_papers: This tool performs a multi-faceted comparison. It calculates content similarity using the cosine similarity of the documents' vector embeddings. It also compares the topics (derived from the content) and the citation lists to find overlaps. The final output is a comprehensive summary of both the similarities and differences between the two papers.
generate_summary: This tool uses an extractive summarization technique. It scores each sentence in the document (or a specific section) based on a combination of factors:
- TF-IDF: To identify sentences with important keywords.
- Positional Value: Giving more weight to sentences at the beginning and end of sections.
- Cue Phrases: Looking for phrases like "in conclusion," "the main finding is," etc. The top-scoring sentences are then combined to form the summary.
find_related_sections: This tool finds specific sections within documents that are semantically related to a given query. It can be limited to a specific document and uses the hybrid search functionality to find relevant sections. The results are then grouped by section and ranked by relevance, allowing users to quickly locate the most pertinent information within a document or across the entire collection.
track_terminology: This tool analyzes the content of multiple documents to identify key terms. It uses TF-IDF to find frequent terms and then analyzes their distribution across the documents to identify emerging trends. It can also identify domain-specific terms by comparing the term frequencies in the given documents to a general corpus.

Development

For local development, you can run the server without Docker, but you will need to have the required services (Weaviate, GROBID, Redis) running.

Install Dependencies:
```
pip install -r requirements.txt
```
Run Services: Ensure that you have Weaviate, GROBID, and Redis running and accessible to your local machine.
Run the Server:
```
python src/mcp_server.py
```

Troubleshooting

Weaviate Connection Issues: Ensure the Weaviate container is running and that the WEAVIATE_URL in your .env file is correctly configured to point to the Weaviate service.
GROBID Errors: Check the logs of the GROBID container (docker logs mcp-grobid) for any errors. Large or complex PDFs can sometimes cause timeouts or processing failures.
Schema Creation Failed: Make sure that the Weaviate service is fully up and running before you execute the scripts/setup_weaviate.py create command.

hanzdi/mcp-research-server