RayenMalouche/MCP-PDF-Extractor-server
If you are the rightful owner of MCP-PDF-Extractor-server and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Tika MCP Extractor Server is a Model Context Protocol (MCP) compliant server that utilizes Apache Tika for content and metadata extraction from various file formats.
Tika MCP Extractor Server
Overview
The Tika MCP Extractor Server is a Model Context Protocol (MCP) compliant server that uses Apache Tika to extract content and metadata from files in various formats (e.g., PDF, DOCX, TXT, HTML, images) stored in a files-to-extract
directory. It supports conversion to HTML (with optional CSS styling for better readability) or plain text and provides tools to list files and retrieve metadata. Built with Java 23, Spring Boot, Jetty, and the MCP SDK (0.11.0), it integrates with MCP-compliant clients like Claude Desktop or MCP Inspector.
The server exposes four MCP tools:
extract-to-html
: Converts file content to HTML (with embedded CSS).extract-text
: Extracts plain text.list-available-files
: Lists files in the directory with details.get-file-metadata
: Retrieves detailed file metadata.
It also provides REST endpoints for testing, including a new endpoint to serve raw HTML directly for browser rendering. All operations are local, requiring no internet access, making it ideal for secure document processing workflows.
Features
- File Extraction: Converts file content to HTML (with CSS for readability) or plain text using Apache Tika.
- Metadata Extraction: Retrieves metadata like title, author, content type, and creation date.
- File Listing: Scans
files-to-extract
for files, providing size, MIME type, and modification details. - MCP Integration: Four synchronous tools with JSON schema validation.
- REST Testing Endpoints:
- GET
/api/test/list
: Lists available files. - POST
/api/test/extract-html
: Extracts file content as JSON with HTML string. - POST
/api/test/extract-text
: Extracts file content as plain text in JSON. - POST
/api/test/raw-html
: Serves raw HTML directly (renderable in browsers). - GET/POST
/api/health
: Checks server and directory status.
- GET
- CORS Support: Enabled for all REST endpoints for web-based testing.
- Configurability: Settings (port, directory, Tika options) via
application.properties
. - Error Handling: Robust checks for file existence, readability, and parsing errors.
- Logging: Console logs with DEBUG support for Tika and PDFBox.
Prerequisites
- Java: JDK 23+ (tested with OpenJDK 24.0.2).
- Maven: Version 3.6+ for dependency management and building.
- Supported File Formats: PDF, DOCX, TXT, HTML, images, etc., handled by Apache Tika 2.9.1 and PDFBox 2.0.29.
- Optional: IntelliJ IDEA for development (output indicates IntelliJ usage, but any IDE or CLI works).
- Local Files: Place files in
files-to-extract
directory; no internet required.
Installation
-
Clone the Repository (if hosted):
git clone https://github.com/RayenMalouche/MCP-PDF-Extractor-server.git cd MCP-PDF-Extractor-server
-
Create the Files Directory:
- The server reads from
files-to-extract
(configurable). - Create it:
mkdir files-to-extract
- Add sample files (e.g.,
sample.pdf
,document.docx
) for testing.
- Build the Project:
- Use Maven to compile and resolve dependencies:
mvn clean install
- Outputs executable JAR in
target/
.
Configuration
Settings are defined in src/main/resources/application.properties
:
# Tika MCP Extractor Server Configuration
spring.application.name=TikaExtractorMCPServer
# Server Configuration
server.port=45453
# Tika Configuration
tika.max.string.length=-1
tika.detect.language=false
# File Processing Configuration
files.directory=files-to-extract
files.max.size=52428800
# Logging Configuration
logging.level.org.apache.tika=DEBUG
logging.level.org.apache.pdfbox=DEBUG
- spring.application.name: Application name for Spring Boot.
- server.port: HTTP port (default: 45453).
- tika.max.string.length: Sets max string length for Tika (-1 = unlimited).
- tika.detect.language: Disables language detection for performance.
- files.directory: Directory for input files.
- files.max.size: Max file size (50MB).
- logging.level: DEBUG for Tika and PDFBox to troubleshoot extraction issues.
The ConfigLoader
class loads these properties at startup, falling back to defaults if the file is missing or malformed.
How It Functions
Architecture
- Main Class (
PdfExtractorApplication.java
):- Initializes server, loads config via
ConfigLoader
. - Ensures
files-to-extract
exists. - Supports HTTP/SSE (default or
--streamable-http
) or STDIO (--stdio
) modes. - Configures Jetty server with MCP transport, test, and health servlets.
- Initializes server, loads config via
- Service Layer (
TikaExtractorService.java
):- Core extraction logic using Apache Tika.
- Methods:
extractToHtml
: Generates HTML with embedded CSS (viaToHTMLContentHandler
).extractText
: Extracts plain text usingBodyContentHandler
.listAvailableFiles
: Scans directory, returns file details (size, MIME, etc.).getFileMetadata
: Extracts metadata (e.g.,TikaCoreProperties.TITLE
,CREATOR
).
- Validates file existence and readability.
- MCP Tools (
McpToolsProvider.java
):- Defines four tools with JSON schemas and handlers.
- Calls
TikaExtractorService
and formats JSON responses (HTML includes CSS). - Handles errors with standardized JSON messages.
- Web Layer:
TestServlet.java
: REST endpoints for testing, including/raw-html
for direct HTML rendering.HealthServlet.java
: Checks server status and directory accessibility.- Supports CORS for web clients.
- Dependencies:
- Tika (2.9.1): Parses files; PDFBox (2.0.29) for PDF support.
- MCP SDK (0.11.0): MCP protocol compliance.
- Jetty (12.0.18): HTTP server.
- Jackson (2.15.2): JSON processing.
- Spring Boot: Manages dependencies and configuration.
Workflow
- Place a file (e.g.,
sample.pdf
) infiles-to-extract
. - Start the server.
- Use an MCP client to call a tool (e.g.,
extract-to-html
with{"filename": "sample.pdf"}
). - Alternatively, use REST endpoints:
- JSON response: POST
/api/test/extract-html
. - Raw HTML: POST
/api/test/raw-html
(renderable in browsers).
- Server parses the file, returns JSON or HTML with embedded CSS for better formatting.
Running the Server
HTTP/SSE Mode
- Default mode for web or MCP Inspector:
mvn spring-boot:run
- Streamable HTTP (for MCP Inspector):
mvn spring-boot:run -- --streamable-http
- Output:
Configuration loaded. Server port: 45453 Directory exists: files-to-extract Starting Tika MCP server with HTTP/SSE transport... Tika MCP Extractor Server started on port 45453 Mode: Standard HTTP/SSE MCP endpoint: http://localhost:45453/ SSE endpoint: http://localhost:45453/sse Test endpoints: - List files: GET http://localhost:45453/api/test/list - Extract HTML: POST http://localhost:45453/api/test/extract-html - Extract text: POST http://localhost:45453/api/test/extract-text - Raw HTML: POST http://localhost:45453/api/test/raw-html Health check: http://localhost:45453/api/health Files directory: ./files-to-extract/
STDIO Mode
- For command-line or local MCP clients:
mvn spring-boot:run -- --stdio
IDE (IntelliJ)
- Run
PdfExtractorApplication
main method. - Native Access Warning: IntelliJβs runtime triggers warnings. Ignore or add to VM options:
--enable-native-access=ALL-UNNAMED
Stop with Ctrl+C.
Usage
MCP Tools
- Client: Use MCP-compliant tools (e.g., MCP Inspector, Claude Desktop).
- Payload: JSON with tool parameters:
{ "filename": "sample.pdf" }
- Tools:
extract-to-html
: Returns{"status": "success", "filename": "...", "contentType": "...", "htmlLength": ..., "html": "..."}
(HTML includes CSS).extract-text
: Returns plain text in JSON.list-available-files
: Returns file list with size, MIME, etc.get-file-metadata
: Returns metadata map.
- Errors:
{"status": "error", "message": "..."}
.
REST Endpoints
Test with CURL, Postman, or browsers:
- List Files:
Response:
curl http://localhost:45453/api/test/list
{ "files": { "sample.pdf": { "size": 123456, "lastModified": 1698765432000, "canRead": true, "mimeType": "application/pdf" } }, "count": 1, "path": ".../files-to-extract" }
- Extract HTML (JSON):
Response:
curl -X POST http://localhost:45453/api/test/extract-html \ -H "Content-Type: application/json" \ -d '{"filename":"sample.pdf"}'
{ "filename": "sample.pdf", "html": "<html><head><style>body { font-family: Arial, sans-serif; ... }</style></head><body>...</body></html>", "contentType": "application/pdf", "title": "Sample Document", "author": "John Doe" }
- Extract Raw HTML:
curl -X POST http://localhost:45453/api/test/raw-html \ -H "Content-Type: application/json" \ -d '{"filename":"sample.pdf"}' > output.html
- Open
output.html
in a browser to view styled HTML.
- Open
- Extract Text:
curl -X POST http://localhost:45453/api/test/extract-text \ -H "Content-Type: application/json" \ -d '{"filename":"sample.pdf"}'
- Health Check:
Response:
curl http://localhost:45453/api/health
{ "status": "ok", "server": "Tika MCP Extractor Server", "version": "1.0.0", "filesDirectoryExists": true, "filesDirectoryReadable": true, "filesDirectoryWritable": true }
Testing
Unit Tests
- Add JUnit tests in
src/test/java
:import org.junit.jupiter.api.Test; import static org.junit.jupiter.api.Assertions.*; import com.mcp.RayenMalouche.pdf.PDFExtractor.Service.TikaExtractorService; class TikaExtractorServiceTest { @Test void testPdfExtraction() throws Exception { TikaExtractorService service = new TikaExtractorService(); Map<String, Object> result = service.extractToHtml("sample.pdf"); assertNotNull(result.get("html")); assertEquals("application/pdf", result.get("contentType")); assertTrue(((String) result.get("html")).contains("<style>")); } }
- Run:
mvn test
- Note: Ensure
sample.pdf
exists infiles-to-extract
for tests.
Manual Testing
- Place files in
files-to-extract
(e.g.,sample.pdf
). - Start server.
- Test REST endpoints with CURL/Postman:
- Verify
/raw-html
renders in browser (save output to.html
file). - Check
/extract-html
for JSON with styled HTML.
- For MCP, use MCP Inspector or simulate via HTTP POST to
/
or/message
. - Check logs for errors (e.g., "ERROR in extract-to-html").
Edge Cases
- Non-existent File: Returns
{"status": "error", "message": "File not found: ..."}
or HTML error page for/raw-html
. - Large Files: Limited by
files.max.size
(50MB); adjust in properties. - Unsupported Formats: Tika falls back to text extraction if possible.
Project Structure
PDFExtractor/
βββ src/
β βββ main/
β β βββ java/com/mcp/RayenMalouche/pdf/PDFExtractor/
β β β βββ PdfExtractorApplication.java # Main entry point
β β β βββ config/
β β β β βββ ConfigLoader.java # Loads properties
β β β βββ Service/
β β β β βββ TikaExtractorService.java # Extraction logic
β β β βββ tools/
β β β β βββ McpToolsProvider.java # MCP tools
β β β βββ web/
β β β β βββ TestServlet.java # REST test endpoints
β β β β βββ HealthServlet.java # Health check
β β βββ resources/
β β β βββ application.properties # Configuration
β βββ test/ # Add tests here
βββ files-to-extract/ # Input files
βββ pom.xml # Maven config
βββ target/ # Build artifacts
βββ README.md # This file
Dependencies
From pom.xml
:
- Spring Boot (3.5.5): Framework foundation.
- MCP SDK (0.11.0): MCP protocol support (note: deprecated APIs).
- Jetty (12.0.18): Embedded HTTP server.
- Jackson (2.15.2): JSON processing.
- Tika (2.9.1): File parsing.
- PDFBox (2.0.29): PDF support (downgraded to fix
NoSuchMethodError
). - Commons-IO (2.11.0), Commons-Codec (1.15): File utilities.
- Run
mvn dependency:tree
for full list.
Limitations
- Deprecated APIs: MCP SDK 0.11.0 uses deprecated
Tool
constructors. Update to latest SDK when stable. - Image Handling: Embedded images in files (e.g., DOCX) are referenced (e.g.,
src="embedded:image1.jpg"
) but not extracted/served. - No File Upload: Files must be manually placed in
files-to-extract
. - Performance: Large files may strain memory; no async processing.
- Security: No authentication for endpoints; local use only.
- Native Access Warning: IntelliJ runtime triggers warningsβsafe to ignore or add
--enable-native-access=ALL-UNNAMED
.
Future Improvements
- Image Extraction: Extract and serve embedded images via a new endpoint.
- File Upload Endpoint: Allow dynamic file uploads to
files-to-extract
. - Update MCP SDK: Migrate to latest version to resolve deprecations.
- Async Processing: Use reactive streams for large files.
- Full Spring Boot Integration: Replace Jetty with Springβs embedded Tomcat/WebFlux.
- Authentication: Add basic auth for REST endpoints.
- Unit Tests: Expand test coverage for all components.
- CI/CD: Add GitHub Actions for automated builds/tests.
Troubleshooting
- Native Access Warning:
- IntelliJ-related:
WARNING: java.lang.System::load has been called...
. - Fix: Add
--enable-native-access=ALL-UNNAMED
to VM options or ignore.
- IntelliJ-related:
- Port Conflict:
- Change
server.port
inapplication.properties
.
- Change
- File Not Found:
- Ensure file exists in
files-to-extract
and matches case.
- Ensure file exists in
- PDF Extraction Errors:
- Fixed by downgrading to PDFBox 2.0.29 (resolves
NoSuchMethodError
). - Enable
logging.level.org.apache.pdfbox=DEBUG
for diagnostics.
- Fixed by downgrading to PDFBox 2.0.29 (resolves
- Tika Errors:
- Verify file format support; update Tika if needed.
- Build Issues:
- Run
mvn clean install
; ensure JDK 23+. - Check Maven dependencies for conflicts (
mvn dependency:tree | grep pdfbox
).
- Run
Contributing
- Fork the repository.
- Create a feature branch:
git checkout -b feature/YourFeature
. - Commit changes:
git commit -m "Add YourFeature"
. - Push:
git push origin feature/YourFeature
. - Open a pull request with test cases and docs.
Contact
- Maintainer: Mohamed Rayen Malouche
- Email: rayenmalouche27@gmail.com
Last Updated: August 30, 2025