simple-document-mcp-server

shaifulshabuj/simple-document-mcp-server

3.2

If you are the rightful owner of simple-document-mcp-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

A minimal MCP server for document processing and search with multi-language support.

Tools
5
Resources
0
Prompts
0

Simple Document MCP Server

A minimal MCP (Model Context Protocol) server for document processing and search with multi-language support.

๐ŸŽฏ Features

  • Multi-format Support: PDF, DOCX, XLSX, and TXT files
  • Multi-language Support: English, Japanese, Bangla (Bengali), and more
  • Full-text Search: Search across all indexed documents with context
  • Document Metadata: Extract file type, language, size, and modification date
  • Interactive Client: Easy-to-use command-line interface
  • Flexible Directory: Custom documents directory via command-line arguments
  • Configurable Logging: Adjustable log levels (DEBUG, INFO, WARNING, ERROR)
  • Error Handling: Robust error handling and logging
  • Auto-create Directories: Automatically creates missing document directories

๐Ÿš€ Quick Start

Option 1: Automated Setup

# Run the setup script
./setup.sh

# Activate the virtual environment
source venv/bin/activate

Option 2: Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Create documents directory
mkdir -p documents

๐Ÿ“– Usage

Starting the Server

# Terminal 1: Start the MCP server
python simple_mcp_server.py                    # Use default ./documents directory
python simple_mcp_server.py --dir /path/docs   # Use custom directory
python simple_mcp_server.py -d ~/Documents     # Use home Documents folder
python simple_mcp_server.py --log-level DEBUG  # Enable debug logging

Using the Client

# Terminal 2: Start the interactive client
python simple_client.py                       # Use default settings
python simple_client.py --dir /path/docs      # Use custom documents directory
python simple_client.py demo                  # Run demo mode
python simple_client.py demo --dir ~/docs     # Run demo with custom directory
python simple_client.py --log-level DEBUG     # Enable debug logging

Available Commands

CommandDescriptionExample
scanScan and index all documentsscan
search <query>Search for text in documentssearch machine learning
listList all processed documentslist
statsShow document collection statisticsstats
content <filename>Get full content of a documentcontent sample.txt
toolsShow available MCP toolstools
quitExit the clientquit

๐Ÿ“ Directory Structure

docmcp/
โ”œโ”€โ”€ simple_mcp_server.py    # Main MCP server
โ”œโ”€โ”€ simple_client.py        # Interactive client
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ”œโ”€โ”€ setup.sh               # Automated setup script
โ”œโ”€โ”€ README.md              # This file
โ””โ”€โ”€ documents/             # Document storage
    โ”œโ”€โ”€ english/           # English documents
    โ”œโ”€โ”€ japanese/          # Japanese documents
    โ””โ”€โ”€ bangla/            # Bangla documents

๐Ÿ”ง MCP Tools

The server provides 5 MCP tools:

  1. scan_documents: Index all documents in the documents directory
  2. search_documents: Search for text with configurable result limits
  3. list_documents: List all processed documents with metadata
  4. get_document_stats: Get collection statistics (size, languages, types)
  5. get_document_content: Retrieve full content of a specific document

๐ŸŒ Language Support

The server automatically detects document language using langdetect. Supported languages include:

  • English (en)
  • Japanese (ja)
  • Bangla/Bengali (bn)
  • And many more (any language supported by langdetect)

๐Ÿ“„ Supported File Types

ExtensionTypeLibrary Used
.pdfPDF DocumentsPyPDF2
.docxWord Documentspython-docx
.xlsxExcel Spreadsheetsopenpyxl
.txtText FilesBuilt-in (multi-encoding)

๐Ÿ” Search Features

  • Full-text search across all document content
  • Context highlighting around matches
  • Multiple matches per document with position tracking
  • Result limiting to prevent overwhelming output
  • Case-insensitive search

๐Ÿ› ๏ธ Development

Adding New File Types

To add support for new file types, extend the SimpleDocumentProcessor class:

def extract_text_from_newtype(self, file_path: Path) -> str:
    # Your extraction logic here
    pass

# Add to the extractors dictionary in process_document()
extractors = {
    '.newext': (self.extract_text_from_newtype, "New Type"),
    # ... existing extractors
}

Customizing Search

The search functionality can be enhanced by modifying the search_documents method:

def search_documents(self, query: str, max_results: int = 50) -> List[Dict[str, Any]]:
    # Add regex support, fuzzy matching, etc.
    pass

๐Ÿ”’ Error Handling

The server includes comprehensive error handling:

  • File reading errors: Gracefully handles corrupted or unreadable files
  • Encoding issues: Tries multiple encodings for text files
  • Missing dependencies: Clear error messages for missing libraries
  • Server errors: JSON error responses for client handling

๐Ÿ“Š Example Output

Server Help

$ python simple_mcp_server.py --help
usage: simple_mcp_server.py [-h] [--dir DIR] [--log-level {DEBUG,INFO,WARNING,ERROR}] [--version]

Simple Document MCP Server

options:
  -h, --help            show this help message and exit
  --dir DIR, -d DIR     Directory containing documents to process (default: ./documents)
  --log-level {DEBUG,INFO,WARNING,ERROR}
                        Set logging level (default: INFO)
  --version             show program's version number and exit

Client Help

$ python simple_client.py --help
usage: simple_client.py [-h] [--dir DIR] [--log-level {DEBUG,INFO,WARNING,ERROR}] [{interactive,demo}]

Simple Document MCP Client

positional arguments:
  {interactive,demo}    Run in interactive mode or demo mode (default: interactive)

options:
  -h, --help            show this help message and exit
  --dir DIR, -d DIR     Directory containing documents to process (uses server default if not specified)
  --log-level {DEBUG,INFO,WARNING,ERROR}
                        Set server logging level (default: INFO)

Document Scan with Custom Directory

๐ŸŽฌ Running Demo Commands...
๐Ÿ“ Documents directory: /custom/path/docs
โœ… Scanned and processed 3 documents

๐Ÿ“š Documents (3):
  1. sample.txt (Text File, en)
     Path: /custom/path/docs/sample.txt
     Size: 1.2 KB
     Preview: This is a sample English document for testing...

  2. sample.txt (Text File, ja)
     Path: /custom/path/docs/sample.txt
     Size: 0.8 KB
     Preview: ใ“ใ‚Œใฏๆ—ฅๆœฌ่ชžใฎใ‚ตใƒณใƒ—ใƒซๆ–‡ๆ›ธใงใ™...

Search Results

๐ŸŽฏ Search Results for 'processing' (2 matches):

  1. sample.txt (Text File, en)
     Path: documents/english/sample.txt
     Position: 156
     Context: ...document **processing** system supports...

  2. readme.txt (Text File, en)
     Path: documents/english/readme.txt
     Position: 89
     Context: ...server can **processing**: - PDF files...

๐Ÿ‘จโ€๐Ÿ’ป Author

Shaiful Islam Shabuj

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add your improvements
  4. Test with the provided client
  5. Submit a pull request

๐Ÿ“ License

This project is licensed under the MIT License - see the file for details.

Copyright (c) 2024 Shaiful Islam Shabuj

๐Ÿ†˜ Troubleshooting

Common Issues

"Missing required dependency"

pip install -r requirements.txt

"Server not found"

  • Make sure you're in the correct directory
  • Check that simple_mcp_server.py exists
  • Verify Python path in the client

"No documents found"

  • Check that documents exist in the documents/ directory
  • Run the scan command first
  • Verify file permissions

"Language detection failed"

  • Document might be too short for reliable detection
  • Try with longer text content
  • Check for non-text content in files

Getting Help

  1. Check the server logs for detailed error messages
  2. Run the demo client: python simple_client.py demo
  3. Verify your setup with the sample documents
  4. Check file permissions and encoding issues