shaifulshabuj/simple-document-mcp-server
If you are the rightful owner of simple-document-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A minimal MCP server for document processing and search with multi-language support.
Simple Document MCP Server
A minimal MCP (Model Context Protocol) server for document processing and search with multi-language support.
🎯 Features
- Multi-format Support: PDF, DOCX, XLSX, TXT, CSV, JSON, and Markdown files
- Multi-language Support: English, Japanese, Bangla (Bengali), and more
- Full-text Search: Search across all indexed documents with context
- Regex Search: Advanced pattern matching with regular expressions
- Smart Caching: Intelligent file change detection for optimal performance
- Document Metadata: Extract file type, language, size, and modification date
- Interactive Client: Easy-to-use command-line interface
- Flexible Directory: Custom documents directory via command-line arguments
- Configurable Logging: Adjustable log levels (DEBUG, INFO, WARNING, ERROR)
- Error Handling: Robust error handling and logging
- Auto-create Directories: Automatically creates missing document directories
- Comprehensive Testing: Full test suite with pytest
🚀 Quick Start
Option 1: Automated Setup
# Run the setup script
./setup.sh
# Activate the virtual environment
source venv/bin/activate
Option 2: Manual Setup
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Create documents directory
mkdir -p documents
📖 Usage
Starting the Server
# Terminal 1: Start the MCP server
python simple_mcp_server.py # Use default ./documents directory
python simple_mcp_server.py --dir /path/docs # Use custom directory
python simple_mcp_server.py -d ~/Documents # Use home Documents folder
python simple_mcp_server.py --log-level DEBUG # Enable debug logging
Using the Client
# Terminal 2: Start the interactive client
python simple_client.py # Use default settings
python simple_client.py --dir /path/docs # Use custom documents directory
python simple_client.py demo # Run demo mode
python simple_client.py demo --dir ~/docs # Run demo with custom directory
python simple_client.py --log-level DEBUG # Enable debug logging
Available Commands
| Command | Description | Example |
|---|---|---|
scan | Scan and index all documents | scan |
search <query> | Search for text in documents | search machine learning |
list | List all processed documents | list |
stats | Show document collection statistics | stats |
content <filename> | Get full content of a document | content sample.txt |
tools | Show available MCP tools | tools |
quit | Exit the client | quit |
📁 Directory Structure
docmcp/
├── simple_mcp_server.py # Main MCP server
├── simple_client.py # Interactive client
├── requirements.txt # Python dependencies
├── setup.sh # Automated setup script
├── README.md # This file
└── documents/ # Document storage
├── english/ # English documents
├── japanese/ # Japanese documents
└── bangla/ # Bangla documents
🔧 MCP Tools
The server provides 6 MCP tools:
- scan_documents: Index all documents in the documents directory
- search_documents: Search for text with configurable result limits
- search_documents_regex: Advanced pattern search using regular expressions
- list_documents: List all processed documents with metadata
- get_document_stats: Get collection statistics (size, languages, types)
- get_document_content: Retrieve full content of a specific document
🌍 Language Support
The server automatically detects document language using langdetect. Supported languages include:
- English (en)
- Japanese (ja)
- Bangla/Bengali (bn)
- And many more (any language supported by langdetect)
📄 Supported File Types
| Extension | Type | Library Used |
|---|---|---|
.pdf | PDF Documents | PyPDF2 |
.docx | Word Documents | python-docx |
.xlsx | Excel Spreadsheets | openpyxl |
.txt | Text Files | Built-in (multi-encoding) |
.csv | CSV Files | csv (built-in) |
.json | JSON Files | json (built-in) |
.md, .markdown | Markdown Files | markdown |
🔍 Search Features
- Full-text search across all document content
- Regex search with advanced pattern matching
- Context highlighting around matches
- Multiple matches per document with position tracking
- Result limiting to prevent overwhelming output
- Case-insensitive search
- Smart caching with automatic file change detection
🧪 Testing
Run the comprehensive test suite:
# Install test dependencies
pip install -r requirements.txt
# Run all tests
pytest tests/ -v
# Run tests with coverage
pytest tests/ --cov=simple_mcp_server --cov-report=html
# Run specific test file
pytest tests/test_document_processor.py -v
🛠️ Development
Adding New File Types
To add support for new file types, extend the SimpleDocumentProcessor class:
def extract_text_from_newtype(self, file_path: Path) -> str:
# Your extraction logic here
pass
# Add to the extractors dictionary in process_document()
extractors = {
'.newext': (self.extract_text_from_newtype, "New Type"),
# ... existing extractors
}
Customizing Search
The search functionality can be enhanced by modifying the search_documents method:
def search_documents(self, query: str, max_results: int = 50) -> List[Dict[str, Any]]:
# Add regex support, fuzzy matching, etc.
pass
🔒 Error Handling
The server includes comprehensive error handling:
- File reading errors: Gracefully handles corrupted or unreadable files
- Encoding issues: Tries multiple encodings for text files
- Missing dependencies: Clear error messages for missing libraries
- Server errors: JSON error responses for client handling
📊 Example Output
Server Help
$ python simple_mcp_server.py --help
usage: simple_mcp_server.py [-h] [--dir DIR] [--log-level {DEBUG,INFO,WARNING,ERROR}] [--version]
Simple Document MCP Server
options:
-h, --help show this help message and exit
--dir DIR, -d DIR Directory containing documents to process (default: ./documents)
--log-level {DEBUG,INFO,WARNING,ERROR}
Set logging level (default: INFO)
--version show program's version number and exit
Client Help
$ python simple_client.py --help
usage: simple_client.py [-h] [--dir DIR] [--log-level {DEBUG,INFO,WARNING,ERROR}] [{interactive,demo}]
Simple Document MCP Client
positional arguments:
{interactive,demo} Run in interactive mode or demo mode (default: interactive)
options:
-h, --help show this help message and exit
--dir DIR, -d DIR Directory containing documents to process (uses server default if not specified)
--log-level {DEBUG,INFO,WARNING,ERROR}
Set server logging level (default: INFO)
Document Scan with Custom Directory
🎬 Running Demo Commands...
📁 Documents directory: /custom/path/docs
✅ Scanned and processed 3 documents
📚 Documents (3):
1. sample.txt (Text File, en)
Path: /custom/path/docs/sample.txt
Size: 1.2 KB
Preview: This is a sample English document for testing...
2. sample.txt (Text File, ja)
Path: /custom/path/docs/sample.txt
Size: 0.8 KB
Preview: これは日本語のサンプル文書です...
Search Results
🎯 Search Results for 'processing' (2 matches):
1. sample.txt (Text File, en)
Path: documents/english/sample.txt
Position: 156
Context: ...document **processing** system supports...
2. readme.txt (Text File, en)
Path: documents/english/readme.txt
Position: 89
Context: ...server can **processing**: - PDF files...
👨💻 Author
Shaiful Islam Shabuj
- GitHub: @shaifulshabuj
- Repository: simple-document-mcp-server
🤝 Contributing
- Fork the repository
- Create a feature branch
- Add your improvements
- Test with the provided client
- Submit a pull request
📝 License
This project is licensed under the MIT License - see the file for details.
Copyright (c) 2024 Shaiful Islam Shabuj
🆘 Troubleshooting
Common Issues
"Missing required dependency"
pip install -r requirements.txt
"Server not found"
- Make sure you're in the correct directory
- Check that
simple_mcp_server.pyexists - Verify Python path in the client
"No documents found"
- Check that documents exist in the
documents/directory - Run the
scancommand first - Verify file permissions
"Language detection failed"
- Document might be too short for reliable detection
- Try with longer text content
- Check for non-text content in files
Getting Help
- Check the server logs for detailed error messages
- Run the demo client:
python simple_client.py demo - Verify your setup with the sample documents
- Check file permissions and encoding issues