cyberelf/mcp-office-reader
If you are the rightful owner of mcp-office-reader and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Office Reader MCP is a server designed to read and convert office documents into markdown format, supporting streaming and pagination for large files.
Office Reader MCP
A Model Context Protocol (MCP) server for reading and converting office documents (Excel, PDF, DOCX) to markdown format with streaming support and pagination for large files.
Features
- π Document Support: Excel (.xlsx, .xls), PDF (.pdf), and DOCX (.docx) files
- π Streaming Mode: Process large documents in chunks with progress reporting
- π Text Length Check: Get document size without reading full content
- π Pagination Support: Read documents in chunks with offset and size limits
- π Progress Tracking: Real-time progress updates for long-running operations
- π Non-blocking: Asynchronous processing that doesn't freeze the MCP server
- π‘οΈ Error Handling: Graceful error handling with detailed error messages
- π UTF-8 Support: Proper handling of multi-byte characters (Chinese, Japanese, Arabic, etc.)
- π Loop Prevention: Built-in safeguards against infinite loops in edge cases
- π§ͺ Comprehensive Testing: Unit tests and end-to-end tests for all functionality
Tools Available
1. get_document_text_length
β NEW
Best for: Checking document size before processing
- Returns total text length without processing the full document
- Fast operation for size estimation
- Helps decide whether to use pagination or streaming
Parameters:
file_path
(string): Path to the office document file
Response:
{
"file_path": "/path/to/document.pdf",
"total_length": 125000,
"file_exists": true,
"error": null
}
2. read_office_document
β ENHANCED
Best for: Reading documents with size control and pagination
- Supports offset and size limits for large documents
- Default maximum size: 50,000 characters (prevents timeouts)
- Pagination support for reading large documents in chunks
- Returns metadata about total size and remaining content
Parameters:
file_path
(string): Path to the office document filemax_size
(optional number): Maximum characters to return (default: 50,000)offset
(optional number): Character offset to start reading from (default: 0)
Response:
{
"file_path": "/path/to/document.pdf",
"total_length": 125000,
"offset": 0,
"returned_length": 50000,
"has_more": true,
"content": "# Document Title\n\nContent here..."
}
3. read_office_document_legacy
Best for: Backward compatibility
- Processes the entire document at once (no size limits)
- Returns complete markdown content
- Use only for small documents or when you need full content
Parameters:
file_path
(string): Path to the office document file
4. stream_office_document
Best for: Large documents with real-time progress
- Processes documents in configurable chunks
- Returns progress information with each chunk
- Prevents timeouts on large files
- Provides real-time feedback
Parameters:
file_path
(string): Path to the office document filechunk_size
(optional number): Maximum characters per chunk (default: 10,000)
Quick Start
Installation
git clone <repository-url>
cd office_reader
cargo build --release
Running the MCP Server
cargo run
Example Usage
Recommended Workflow for Large Documents
- Check document size first:
{
"name": "get_document_text_length",
"arguments": {
"file_path": "/path/to/large-document.pdf"
}
}
- Read in chunks if large (>50,000 characters):
{
"name": "read_office_document",
"arguments": {
"file_path": "/path/to/large-document.pdf",
"max_size": 25000,
"offset": 0
}
}
- Continue reading with offset:
{
"name": "read_office_document",
"arguments": {
"file_path": "/path/to/large-document.pdf",
"max_size": 25000,
"offset": 25000
}
}
Processing a Large PDF with Streaming
{
"name": "stream_office_document",
"arguments": {
"file_path": "/path/to/large-textbook.pdf",
"chunk_size": 15000
}
}
Processing a Small Excel File
{
"name": "read_office_document",
"arguments": {
"file_path": "/path/to/spreadsheet.xlsx"
}
}
Testing
Run All Tests
# Unit tests
cargo test --lib
# End-to-end tests
cargo test --test e2e_test
# All tests
cargo test
Test Categories
Unit Tests (src/streaming_parser.rs
)
- β Configuration validation
- β Progress serialization/deserialization
- β Error handling
- β Stream completion
- β Custom chunk sizes
End-to-End Tests (tests/e2e_test.rs
)
- β Excel document streaming
- β PDF document streaming with small chunks
- β Non-existent file handling
- β Unsupported file type handling
- β Default chunk size behavior
- β Tool availability verification
Test Example
# Test streaming functionality specifically
cargo test streaming_parser
# Test a specific e2e scenario
cargo test test_stream_excel_document
Streaming Demo
Run the included example to see streaming in action:
# Show help and usage
cargo run --example test_streaming
# Process a PDF with default chunk size (10,000 characters)
cargo run --example test_streaming document.pdf
# Process an Excel file with custom chunk size
cargo run --example test_streaming spreadsheet.xlsx 5000
# Process a large document with larger chunks
cargo run --example test_streaming large-textbook.pdf 15000
Example Output:
π Testing Office Reader MCP Streaming Functionality
π Processing file: document.pdf
π File size: 2,450,000 bytes (2.34 MB)
π File type: .pdf
βοΈ Streaming Configuration:
- Chunk size: 15000 characters
- Pages per chunk: 5
π Starting PDF streaming process...
π¦ Chunk #1
Current position: 14,856
Progress: 6% (14856/245000)
Content length: 14,856 characters
Total processed: 14,856 characters
Is complete: false
β
Success
Preview: # Document Title ## Chunk 1 (characters 0-15000) This is the beginning of the document...
π¦ Chunk #2
Current position: 29,712
Progress: 12% (29712/245000)
...
This will:
- β Validate the file path and type
- π Display file information (size, type)
- βοΈ Configure streaming with your chosen chunk size
- π Process the document chunk by chunk
- π Show real-time progress updates
- π Preview content from each chunk
Configuration
StreamingConfig Options
pub struct StreamingConfig {
pub chunk_size_pages: usize, // Pages per chunk (for future use)
pub max_chunk_size_chars: usize, // Characters per chunk
pub include_progress: bool, // Include progress info
}
Default Settings
- Chunk size: 10,000 characters
- Progress reporting: Enabled
- Word boundary breaking: Enabled for better readability
Performance Benefits
Before (Traditional Mode)
- β Large files could timeout
- β No progress feedback
- β Memory intensive for huge documents
- β Blocking operation
After (Streaming Mode)
- β Processes files of any size
- β Real-time progress updates
- β Memory efficient chunking
- β Non-blocking async processing
- β Configurable chunk sizes
- β UTF-8 character boundary safety
- β Infinite loop prevention
- β Word boundary optimization with fallbacks
Error Handling
The streaming mode provides detailed error information:
{
"current_page": 0,
"total_pages": null,
"current_chunk": "",
"is_complete": true,
"error": "Failed to extract text from PDF: File not found"
}
Use Cases
Perfect for Streaming Mode:
- π Textbooks (like "Reinforcement Learning 2ed-Richard Sutton")
- π Large Reports (100+ pages)
- π Complex Spreadsheets (multiple sheets)
- π Technical Documentation
Traditional Mode Still Good For:
- π Small Documents (<50 pages)
- π Quick Previews
- β‘ When you need immediate full content
Architecture
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β MCP Client βββββΆβ Office Reader βββββΆβ Document β
β β β MCP Server β β Processors β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β Streaming β β β’ PDF Extract β
β Parser β β β’ Calamine β
β β β β’ DOCX-RS β
ββββββββββββββββββββ βββββββββββββββββββ
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
cargo test
- Submit a pull request
License
[Add your license information here]