mcp-office-reader

cyberelf/mcp-office-reader

3.2

If you are the rightful owner of mcp-office-reader and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

The Office Reader MCP is a server designed to read and convert office documents into markdown format, supporting streaming and pagination for large files.

Tools
4
Resources
0
Prompts
0

Office Reader MCP

A Model Context Protocol (MCP) server for reading and converting office documents (Excel, PDF, DOCX) to markdown format with streaming support and pagination for large files.

Features

  • πŸ“„ Document Support: Excel (.xlsx, .xls), PDF (.pdf), and DOCX (.docx) files
  • πŸš€ Streaming Mode: Process large documents in chunks with progress reporting
  • πŸ“ Text Length Check: Get document size without reading full content
  • πŸ“– Pagination Support: Read documents in chunks with offset and size limits
  • πŸ“Š Progress Tracking: Real-time progress updates for long-running operations
  • πŸ”„ Non-blocking: Asynchronous processing that doesn't freeze the MCP server
  • πŸ›‘οΈ Error Handling: Graceful error handling with detailed error messages
  • 🌍 UTF-8 Support: Proper handling of multi-byte characters (Chinese, Japanese, Arabic, etc.)
  • πŸ”’ Loop Prevention: Built-in safeguards against infinite loops in edge cases
  • πŸ§ͺ Comprehensive Testing: Unit tests and end-to-end tests for all functionality

Tools Available

1. get_document_text_length ⭐ NEW

Best for: Checking document size before processing

  • Returns total text length without processing the full document
  • Fast operation for size estimation
  • Helps decide whether to use pagination or streaming

Parameters:

  • file_path (string): Path to the office document file

Response:

{
  "file_path": "/path/to/document.pdf",
  "total_length": 125000,
  "file_exists": true,
  "error": null
}

2. read_office_document ⭐ ENHANCED

Best for: Reading documents with size control and pagination

  • Supports offset and size limits for large documents
  • Default maximum size: 50,000 characters (prevents timeouts)
  • Pagination support for reading large documents in chunks
  • Returns metadata about total size and remaining content

Parameters:

  • file_path (string): Path to the office document file
  • max_size (optional number): Maximum characters to return (default: 50,000)
  • offset (optional number): Character offset to start reading from (default: 0)

Response:

{
  "file_path": "/path/to/document.pdf",
  "total_length": 125000,
  "offset": 0,
  "returned_length": 50000,
  "has_more": true,
  "content": "# Document Title\n\nContent here..."
}

3. read_office_document_legacy

Best for: Backward compatibility

  • Processes the entire document at once (no size limits)
  • Returns complete markdown content
  • Use only for small documents or when you need full content

Parameters:

  • file_path (string): Path to the office document file

4. stream_office_document

Best for: Large documents with real-time progress

  • Processes documents in configurable chunks
  • Returns progress information with each chunk
  • Prevents timeouts on large files
  • Provides real-time feedback

Parameters:

  • file_path (string): Path to the office document file
  • chunk_size (optional number): Maximum characters per chunk (default: 10,000)

Quick Start

Installation

git clone <repository-url>
cd office_reader
cargo build --release

Running the MCP Server

cargo run

Example Usage

Recommended Workflow for Large Documents
  1. Check document size first:
{
  "name": "get_document_text_length",
  "arguments": {
    "file_path": "/path/to/large-document.pdf"
  }
}
  1. Read in chunks if large (>50,000 characters):
{
  "name": "read_office_document",
  "arguments": {
    "file_path": "/path/to/large-document.pdf",
    "max_size": 25000,
    "offset": 0
  }
}
  1. Continue reading with offset:
{
  "name": "read_office_document",
  "arguments": {
    "file_path": "/path/to/large-document.pdf",
    "max_size": 25000,
    "offset": 25000
  }
}
Processing a Large PDF with Streaming
{
  "name": "stream_office_document",
  "arguments": {
    "file_path": "/path/to/large-textbook.pdf",
    "chunk_size": 15000
  }
}
Processing a Small Excel File
{
  "name": "read_office_document",
  "arguments": {
    "file_path": "/path/to/spreadsheet.xlsx"
  }
}

Testing

Run All Tests

# Unit tests
cargo test --lib

# End-to-end tests
cargo test --test e2e_test

# All tests
cargo test

Test Categories

Unit Tests (src/streaming_parser.rs)
  • βœ… Configuration validation
  • βœ… Progress serialization/deserialization
  • βœ… Error handling
  • βœ… Stream completion
  • βœ… Custom chunk sizes
End-to-End Tests (tests/e2e_test.rs)
  • βœ… Excel document streaming
  • βœ… PDF document streaming with small chunks
  • βœ… Non-existent file handling
  • βœ… Unsupported file type handling
  • βœ… Default chunk size behavior
  • βœ… Tool availability verification

Test Example

# Test streaming functionality specifically
cargo test streaming_parser

# Test a specific e2e scenario
cargo test test_stream_excel_document

Streaming Demo

Run the included example to see streaming in action:

# Show help and usage
cargo run --example test_streaming

# Process a PDF with default chunk size (10,000 characters)
cargo run --example test_streaming document.pdf

# Process an Excel file with custom chunk size
cargo run --example test_streaming spreadsheet.xlsx 5000

# Process a large document with larger chunks
cargo run --example test_streaming large-textbook.pdf 15000

Example Output:

πŸš€ Testing Office Reader MCP Streaming Functionality
πŸ“„ Processing file: document.pdf
πŸ“Š File size: 2,450,000 bytes (2.34 MB)
πŸ“‹ File type: .pdf

βš™οΈ  Streaming Configuration:
   - Chunk size: 15000 characters
   - Pages per chunk: 5

πŸ”„ Starting PDF streaming process...

πŸ“¦ Chunk #1
   Current position: 14,856
   Progress: 6% (14856/245000)
   Content length: 14,856 characters
   Total processed: 14,856 characters
   Is complete: false
   βœ… Success
   Preview: # Document Title  ## Chunk 1 (characters 0-15000)  This is the beginning of the document...

πŸ“¦ Chunk #2
   Current position: 29,712
   Progress: 12% (29712/245000)
   ...

This will:

  1. βœ… Validate the file path and type
  2. πŸ“Š Display file information (size, type)
  3. βš™οΈ Configure streaming with your chosen chunk size
  4. πŸ”„ Process the document chunk by chunk
  5. πŸ“ˆ Show real-time progress updates
  6. πŸ“‹ Preview content from each chunk

Configuration

StreamingConfig Options

pub struct StreamingConfig {
    pub chunk_size_pages: usize,        // Pages per chunk (for future use)
    pub max_chunk_size_chars: usize,    // Characters per chunk
    pub include_progress: bool,         // Include progress info
}

Default Settings

  • Chunk size: 10,000 characters
  • Progress reporting: Enabled
  • Word boundary breaking: Enabled for better readability

Performance Benefits

Before (Traditional Mode)

  • ❌ Large files could timeout
  • ❌ No progress feedback
  • ❌ Memory intensive for huge documents
  • ❌ Blocking operation

After (Streaming Mode)

  • βœ… Processes files of any size
  • βœ… Real-time progress updates
  • βœ… Memory efficient chunking
  • βœ… Non-blocking async processing
  • βœ… Configurable chunk sizes
  • βœ… UTF-8 character boundary safety
  • βœ… Infinite loop prevention
  • βœ… Word boundary optimization with fallbacks

Error Handling

The streaming mode provides detailed error information:

{
  "current_page": 0,
  "total_pages": null,
  "current_chunk": "",
  "is_complete": true,
  "error": "Failed to extract text from PDF: File not found"
}

Use Cases

Perfect for Streaming Mode:

  • πŸ“š Textbooks (like "Reinforcement Learning 2ed-Richard Sutton")
  • πŸ“Š Large Reports (100+ pages)
  • πŸ“‹ Complex Spreadsheets (multiple sheets)
  • πŸ“„ Technical Documentation

Traditional Mode Still Good For:

  • πŸ“ Small Documents (<50 pages)
  • πŸ” Quick Previews
  • ⚑ When you need immediate full content

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   MCP Client    │───▢│  Office Reader   │───▢│  Document       β”‚
β”‚                 β”‚    β”‚  MCP Server      β”‚    β”‚  Processors     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚                         β”‚
                              β–Ό                         β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚  Streaming       β”‚    β”‚  β€’ PDF Extract  β”‚
                       β”‚  Parser          β”‚    β”‚  β€’ Calamine     β”‚
                       β”‚                  β”‚    β”‚  β€’ DOCX-RS      β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass: cargo test
  5. Submit a pull request

License

[Add your license information here]