hanzdi/mcp-research-server
If you are the rightful owner of mcp-research-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The MCP Research Server is a containerized backend service for analyzing scientific research papers using the Model Context Protocol (MCP).
MCP Research Server
The MCP Research Server is a powerful, containerized backend service designed for the advanced analysis of scientific research papers. It leverages a Model Context Protocol (MCP) to provide a suite of tools for indexing, searching, and analyzing PDF documents. The server is built with a modern Python stack and utilizes Docker for easy deployment and scalability.
Access Options
You have three ways to interact with the MCP Research Server:
- Claude Desktop Integration (Recommended) - Native MCP integration within Claude Desktop
- Production-Ready CLI Client (
mcp_client.py) - A versatile command-line tool with the following features:- Dual-mode operation: interactive chat and single-command execution.
- Support for multiple AI models.
- JSON output for easy integration with other tools.
- Timeout management to handle long-running tasks.
- Suitable for production scripting.
- Jupyter Notebook Client (
mcp_notebook.ipynb) - Research-focused notebook environment for interactive analysis
The CLI and Notebook clients run outside Docker and require local dependency installation. See Alternative MCP Client Access for detailed configuration instructions.
Features
- Robust PDF Processing: Ingests and processes PDF documents using GROBID for high-quality structured data extraction, with a graceful fallback to PyMuPDF for maximum reliability.
- Advanced Vector-Based Indexing: Indexes research papers in a Weaviate vector database, creating a rich, searchable knowledge base.
- State-of-the-Art Hybrid Search: Combines semantic (vector) search and keyword-based (BM25) search, with a reranker module, to deliver highly relevant and accurate search results.
- Comprehensive Duplicate Detection: Implements a multi-level duplicate detection strategy—checking file, content, and metadata hashes—to ensure data integrity and prevent redundant entries.
- In-Depth Citation Analysis: Extracts and analyzes citation networks, allowing users to trace the flow of information and identify influential papers.
- Multi-Faceted Document Comparison: Provides tools to compare documents based on content similarity, topic distribution, and citation overlap.
- Extractive Summarization: Generates concise summaries of research papers, with the ability to focus on specific sections like the abstract, methods, or results.
- Dynamic Terminology Tracking: Identifies and tracks the evolution of terminology across a collection of documents, highlighting emerging concepts and trends.
- Fully Dockerized Environment: All services are containerized for easy setup, consistent deployment, and scalability.
Technology Stack
- Python: 3.11+
- Server Framework: FastMCP
- Vector Database: Weaviate
- PDF Parsing: GROBID
- Caching & Task Queues: Redis
- NLP & Machine Learning: NLTK, spaCy, scikit-learn, Sentence Transformers
Getting Started
Prerequisites
- Docker and Docker Compose
- Python 3.11+ (for local development)
Configuration
The MCP Research Server supports flexible configuration for different deployment environments through a dual configuration system and environment variable overrides.
Standard Docker Deployment (Default)
For most users, no configuration changes are needed. The system uses container names and internal Docker networking:
# Simply start all services - no configuration needed
docker-compose up -d
Default Settings (Automatic):
- Weaviate: Uses container name
weaviateon port8080 - Redis: Uses container name
redison port6379 - GROBID: Uses container name
grobidon port8070
Environment Variable Overrides
Create a .env file to customize settings for your environment:
cp .env.example .env
Dual Configuration Options:
The system supports two configuration patterns for maximum flexibility:
1. Host/Port Configuration (Recommended for Docker):
# Container-based deployment (default)
WEAVIATE_HOST=weaviate
WEAVIATE_PORT=8080
REDIS_HOST=redis
REDIS_PORT=6379
# Local development override
WEAVIATE_HOST=localhost
WEAVIATE_PORT=8088 # External port mapping
REDIS_HOST=localhost
REDIS_PORT=6379
2. URL Configuration (Alternative):
# Complete endpoint URLs
WEAVIATE_URL=http://localhost:8088
REDIS_URL=redis://localhost:6379
GROBID_URL=http://localhost:8070
Other Key Variables:
SEARCH_HYBRID_ALPHA: Controls vector/keyword search balance (0.0 for pure keyword, 1.0 for pure vector)DEV_MODE=true: Enables development features like colorful loggingLOG_LEVEL=DEBUG: Increases logging verbosity for troubleshooting
Deployment Examples
Local Development Setup:
# Override for external access during development
export WEAVIATE_HOST=localhost
export WEAVIATE_PORT=8088 # Use external port mapping
export REDIS_HOST=localhost
docker-compose up -d
Production Environment:
# Point to production services
export WEAVIATE_HOST=prod-weaviate.company.com
export WEAVIATE_PORT=80
export REDIS_HOST=prod-redis.company.com
export REDIS_PORT=6379
docker-compose up -d
Kubernetes/Cloud Deployment:
# Use service discovery names
export WEAVIATE_HOST=weaviate-service
export REDIS_HOST=redis-service
# Apply your Kubernetes manifests
Service Ports
Default External Port Mappings (docker-compose.yml):
- Weaviate:
8088:8080(external:internal) - GROBID:
8070:8070 - Redis:
6379:6379
These mappings allow external access while maintaining internal Docker networking.
Installation & Running the Server
-
Build and Run the Docker Containers: This command will build the custom Python environment and start all the services in the background.
docker-compose up -d --build -
Set Up Weaviate Schema: After the containers are running, execute the following command to initialize the Weaviate schema. This only needs to be done once.
docker-compose exec mcp-server python scripts/setup_weaviate.py createDatabase management
docker-compose exec mcp-server python scripts/setup_weaviate.py check docker-compose exec mcp-server python scripts/setup_weaviate.py backup docker-compose exec mcp-server python scripts/setup_weaviate.py create --force # Reset schemaDatabase and cache cleanup (comprehensive system reset)
docker-compose exec mcp-server python scripts/cleanup_database_cache.py --stats # Show system statistics docker-compose exec mcp-server python scripts/cleanup_database_cache.py --all # Clear everything (prompts for confirmation) docker-compose exec mcp-server python scripts/cleanup_database_cache.py --database # Clear only Weaviate database docker-compose exec mcp-server python scripts/cleanup_database_cache.py --cache # Clear only Redis cache docker-compose exec mcp-server python scripts/cleanup_database_cache.py --all --confirm # Skip confirmation prompt -
Run the MCP Server:
docker-compose exec mcp-server python src/mcp_server.py
Usage
The MCP Research Server provides a suite of tools for interacting with and analyzing research papers.
Tools
| Tool Name | Description |
|---|---|
list_upload_files | Lists all uploaded PDF files with their metadata. |
check_document_status | Verifies if a document already exists in the database using a multi-level duplicate detection strategy. |
index_pdf | Indexes a new PDF research paper, including content extraction, chunking, and vectorization. |
hybrid_search | Performs a hybrid search combining semantic and keyword-based search, with reranking for improved relevance. |
extract_citations | Extracts all citations from a specified document and provides their context. |
compare_papers | Compares two research papers across multiple dimensions, including content, topics, and citations. |
generate_summary | Generates an extractive summary of a research paper, with options to focus on specific sections. |
find_related_sections | Finds specific sections within documents that are semantically related to a given query. |
track_terminology | Tracks the usage and evolution of terminology and concepts across a set of documents. |
Alternative MCP Client Access
In addition to using the MCP server through Claude Desktop, the project provides two alternative client tools for accessing the MCP server functionality outside of the Docker environment:
1. Production-Ready CLI Client (mcp_client.py)
The mcp_client.py file provides a sophisticated command-line interface powered by PydanticAI. This production-ready tool supports both interactive and non-interactive modes with comprehensive configuration options.
Key Features:
- Dual Operation Modes: Interactive CLI and single-command execution
- Configurable AI Models: Support for GPT-4, Claude, Gemini, and LLaMA models
- Output Formatting: Clean text output or structured JSON responses
- Robust Error Handling: Comprehensive error management with proper exit codes
- Timeout Management: Configurable request timeouts to prevent hanging
- Input Validation: Prompt validation and sanitization
- Clean Output: Optional quiet mode for script-friendly output
Command Line Options:
python mcp_client.py [-h] [-p PROMPT] [-q] [-m MODEL] [-t TIMEOUT] [--format {text,json}]
Options:
-h, --help Show help message and exit
-p, --prompt PROMPT Run non-interactive mode with single prompt
-q, --quiet Suppress server logs for clean output
-m, --model MODEL AI model to use (default: openrouter:openai/gpt-4.1-nano)
-t, --timeout TIMEOUT Timeout in seconds (default: 60)
--format {text,json} Output format: text (default) or json
Usage Examples:
Interactive Mode (Default):
python mcp_client.py
Non-Interactive Mode:
# Basic usage
python mcp_client.py -p "List all PDF files in the uploads directory"
# With different AI model
python mcp_client.py -p "Search for ML papers" --model "openrouter:anthropic/claude-3.5-sonnet"
# Clean JSON output for scripting
python mcp_client.py -p "List files" --format json --quiet
# With custom timeout
python mcp_client.py -p "Complex analysis task" --timeout 120
# Production scripting example
python mcp_client.py -p "Generate summary of doc_123" --quiet --format json --model "openrouter:google/gemini-2.5-pro-preview"
2. Jupyter Notebook Client (mcp_notebook.ipynb)
The mcp_notebook.ipynb provides a Jupyter notebook environment for running MCP commands directly in notebook cells. This is ideal for research workflows, experimentation, and interactive analysis.
Key Features:
- Run MCP commands in individual notebook cells
- Perfect for research workflows and data analysis
- Combines code execution with rich output formatting
- Supports multiple AI models for different analysis needs
Usage:
jupyter notebook mcp_notebook.ipynb
Configuration Requirements
Important: Both client tools run outside of Docker and require local installation of all dependencies.
1. Local Dependencies Installation
Install the required Python packages locally:
pip install pydantic-ai python-dotenv nest-asyncio
2. Environment Configuration
Both tools require an OpenRouter API key for AI model access. Configure your .env file with:
# Required for MCP client tools
OPENROUTER_API_KEY=your_openrouter_api_key_here
3. Model Selection and Configuration
The CLI client supports multiple AI models through OpenRouter with dynamic configuration:
Available Models:
openai/gpt-4.1- High-quality reasoning and analysisopenai/gpt-4.1-nano- Fast and lightweight (default)google/gemini-2.5-pro-preview- Google's latest model with large contextanthropic/claude-3.5-sonnet- Anthropic's advanced reasoning modelmeta-llama/llama-3.1-70b-instruct- Meta's open-source model
Model Configuration Examples:
CLI Client (Runtime Configuration):
# Use different models via command line
python mcp_client.py -p "Research query" --model "openrouter:anthropic/claude-3.5-sonnet"
python mcp_client.py -p "Quick task" --model "openrouter:openai/gpt-4.1-nano"
python mcp_client.py -p "Complex analysis" --model "openrouter:google/gemini-2.5-pro-preview"
Notebook Client (Code Configuration):
# Explicit configuration
api_key = os.getenv('OPENROUTER_API_KEY')
provider = OpenRouterProvider(api_key=api_key)
model = OpenAIModel('openai/gpt-4.1', provider=provider)
agent = Agent(model=model, mcp_servers=[mcp_server])
# Or shorthand syntax
agent = Agent(model='openrouter:anthropic/claude-3.5-sonnet', mcp_servers=[mcp_server])
4. File Path Configuration
Both tools are pre-configured to work with the project structure. If you move the files or change the project location, update the project path:
# Update this path in both files if needed
project_path = Path('/path/to/your/mcp-research-server')
Advanced Usage Patterns
Script Integration
The CLI client is designed for seamless integration with shell scripts and automation:
#!/bin/bash
# Example automation script
# Check if papers exist
# WARNING: If you modify this script to use dynamic content in the prompt (-p),
# ensure that the input is properly escaped or sanitized to prevent command injection.
RESULT=$(python mcp_client.py -p "List all PDF files" --format json --quiet)
if [[ $? -eq 0 ]]; then
echo "Papers found: $RESULT"
else
echo "Error accessing papers"
exit 1
fi
# Search for specific topics
python mcp_client.py -p "Search for papers about neural networks" --quiet --timeout 30
#### Batch Processing
Process multiple queries efficiently:
```bash
# Process multiple research queries
for query in "machine learning" "deep learning" "neural networks"; do
echo "Processing: $query"
python mcp_client.py -p "Search for papers about $query" --quiet --format json > "results_${query// /_}.json"
done
Error Handling and Monitoring
The CLI client provides comprehensive error handling for production use:
# Monitor for timeouts and connection issues
python mcp_client.py -p "Long running analysis" --timeout 300 --format json --quiet
case $? in
0) echo "Success" ;;
1) echo "Request failed or timed out" ;;
*) echo "Unexpected error" ;;
esac
Performance and Optimization
Model Selection Guidelines
Choose the appropriate model based on your use case:
- Quick Tasks:
openai/gpt-4.1-nano(fastest, most cost-effective) - Complex Analysis:
google/gemini-2.5-pro-preview(largest context window) - Balanced Performance:
anthropic/claude-3.5-sonnet(high quality reasoning) - Open Source:
meta-llama/llama-3.1-70b-instruct(free alternative)
Output Format Recommendations
- Human Reading: Use default text format
- Script Processing: Use JSON format with
--quietflag - Logging/Monitoring: Use JSON format for structured data
Timeout Configuration
Set appropriate timeouts based on operation complexity:
- Simple queries: 30-60 seconds (default)
- Complex analysis: 120-300 seconds
- Large document processing: 300+ seconds
Troubleshooting
Common Issues and Solutions
1. Connection Timeouts
# Increase timeout for complex operations
python mcp_client.py -p "Complex query" --timeout 120
2. Model Not Found
# Verify model name format
python mcp_client.py -p "Test" --model "openrouter:openai/gpt-4.1-nano"
3. Clean Output for Scripts
# Use quiet mode and JSON format
python mcp_client.py -p "Query" --quiet --format json 2>/dev/null
4. Docker Services Not Running
# Ensure Docker services are running
docker-compose ps
docker-compose up -d
Production Deployment Considerations
Environment Variables
For production deployment, consider these environment variables:
# .env file for production
OPENROUTER_API_KEY=your_production_api_key
PROJECT_PATH=/path/to/production/mcp-research-server
LOG_LEVEL=ERROR
QUIET_MODE=1
Monitoring and Logging
The CLI client provides structured JSON output for monitoring:
# Monitor success/failure rates
python mcp_client.py -p "Health check" --format json --quiet | jq '.success'
# Log errors for debugging
python mcp_client.py -p "Query" --format json 2>> error.log
This enhanced CLI client provides a production-ready interface for accessing the MCP Research Server with comprehensive configuration options, robust error handling, and flexible output formatting suitable for both interactive use and automated workflows.
Usage Examples
CLI Client Examples:
# Start the interactive CLI
python mcp_client.py
# Then use natural language commands:
> List all PDF files in the uploads directory
> Search for papers about machine learning transformers
> Index the new research paper 'paper.pdf'
> Compare papers with IDs 'doc_123' and 'doc_456'
Notebook Client Examples:
# Run MCP commands in notebook cells
async with agent.run_mcp_servers():
result = agent.run_sync('List all PDF files')
print(result.output)
# Research workflow example
async with agent.run_mcp_servers():
result = agent.run_sync('Search for papers about transformer architectures')
print(result.output)
# Advanced analysis
async with agent.run_mcp_servers():
result = agent.run_sync('Generate a summary of the paper with ID doc_123')
print(result.output)
When to Use Each Tool
Use CLI Client (mcp_client.py) when:
- You prefer command-line interfaces
- You need quick, interactive access to MCP tools
- You want to integrate MCP functionality into scripts
- You need a lightweight alternative to Claude Desktop
Use Notebook Client (mcp_notebook.ipynb) when:
- You're conducting research analysis
- You need to document your research process
- You want to combine code, results, and explanations
- You're experimenting with different queries and approaches
- You need rich output formatting and visualization
Technical Requirements
System Requirements:
- Python 3.11+
- Docker and Docker Compose (for the MCP server)
- OpenRouter API key
- Local installation of PydanticAI and dependencies
Network Requirements:
- Both tools communicate with the MCP server running in Docker
- Ensure Docker containers are running before using either tool
- The tools connect to the server via docker-compose exec commands
Architecture & Mechanics
The MCP Research Server is designed with a modular and scalable architecture, with several key mechanics that power its features.
Core Architectural Patterns
- Lazy Initialization: Server components are initialized on the first tool call, not at startup, to ensure efficient resource usage. An
asyncio.Lockis used to ensure thread safety during this process. - Graceful Degradation for PDF Processing: The system prioritizes GROBID for its high-quality structured data extraction. If GROBID fails or is unavailable, the server automatically falls back to PyMuPDF for text extraction, ensuring that the system remains operational.
- Comprehensive Duplicate Detection: Before indexing, a multi-level duplicate detection strategy is employed, combining content and metadata fingerprinting (hashes) with semantic similarity search to find conceptually similar documents.
- Hybrid Search with Reranking: Search queries combine BM25 keyword-based search for lexical precision with vector-based semantic search for contextual relevance. The results are then passed through a reranker model to provide the most relevant results to the user.
- Graph-Based Citation Network: The server uses a three-collection schema in Weaviate (
ResearchPapers,Citation, andCitationMention) to build a graph of citations, enabling complex analysis of citation networks.
Tool Mechanics Explained
-
list_upload_files: This tool lists the contents of theuploadsdirectory, which is the designated location for new PDF files awaiting processing. It provides a simple way to see which documents are ready to be indexed. -
check_document_status: This tool performs a comprehensive, multi-level duplicate check to determine if a document already exists in the system. It uses a combination of file hashes, content hashes (fingerprints), metadata (title and authors), and semantic similarity to identify potential duplicates. The tool returns a detailed report, including a recommendation on whether to proceed with indexing, merge with an existing document, or skip the upload. -
index_pdf: When a PDF is indexed, it first undergoes duplicate detection. If it's a new document, it's processed by GROBID (or PyMuPDF as a fallback) to extract text, metadata, and citations. The content is then split into chunks using a sentence-aware strategy to maintain semantic coherence. These chunks are vectorized and stored in Weaviate. -
hybrid_search: This tool first expands the user's query with synonyms and related terms. It then performs a hybrid search in Weaviate, which combines vector search (for semantic meaning) and BM25 search (for keyword matching). Thealphaparameter controls the balance between these two methods. Finally, the results are reranked to improve relevance. -
extract_citations: This tool leverages the graph-based citation network in Weaviate. It starts by fetching the citation mentions linked to a document. For each mention, it retrieves the full bibliographic data of the cited paper. Ifinclude_contextis true, it also provides the text snippet where the citation was mentioned. This allows for a complete picture of not just what is cited, but also how it is cited. -
compare_papers: This tool performs a multi-faceted comparison. It calculates content similarity using the cosine similarity of the documents' vector embeddings. It also compares the topics (derived from the content) and the citation lists to find overlaps. The final output is a comprehensive summary of both the similarities and differences between the two papers. -
generate_summary: This tool uses an extractive summarization technique. It scores each sentence in the document (or a specific section) based on a combination of factors:- TF-IDF: To identify sentences with important keywords.
- Positional Value: Giving more weight to sentences at the beginning and end of sections.
- Cue Phrases: Looking for phrases like "in conclusion," "the main finding is," etc. The top-scoring sentences are then combined to form the summary.
-
find_related_sections: This tool finds specific sections within documents that are semantically related to a given query. It can be limited to a specific document and uses the hybrid search functionality to find relevant sections. The results are then grouped by section and ranked by relevance, allowing users to quickly locate the most pertinent information within a document or across the entire collection. -
track_terminology: This tool analyzes the content of multiple documents to identify key terms. It uses TF-IDF to find frequent terms and then analyzes their distribution across the documents to identify emerging trends. It can also identify domain-specific terms by comparing the term frequencies in the given documents to a general corpus.
Development
For local development, you can run the server without Docker, but you will need to have the required services (Weaviate, GROBID, Redis) running.
-
Install Dependencies:
pip install -r requirements.txt -
Run Services: Ensure that you have Weaviate, GROBID, and Redis running and accessible to your local machine.
-
Run the Server:
python src/mcp_server.py
Troubleshooting
- Weaviate Connection Issues: Ensure the Weaviate container is running and that the
WEAVIATE_URLin your.envfile is correctly configured to point to the Weaviate service. - GROBID Errors: Check the logs of the GROBID container (
docker logs mcp-grobid) for any errors. Large or complex PDFs can sometimes cause timeouts or processing failures. - Schema Creation Failed: Make sure that the Weaviate service is fully up and running before you execute the
scripts/setup_weaviate.py createcommand.