RayenMalouche/MCP-PDF-Extractor-server-v2
If you are the rightful owner of MCP-PDF-Extractor-server-v2 and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The PDFExtractor is a Model Context Protocol (MCP) server designed for automated file content extraction and conversion to HTML, suitable for document processing and integration with AI assistants.
PDFExtractor MCP Server
Overview
The PDFExtractor
is a Model Context Protocol (MCP) server that extracts content from files (e.g., PDF, Word, Markdown) located in a designated file-to-extract
directory and converts the content into HTML using Apache Tika. Built with Spring Boot, Jetty, and the MCP SDK, it supports both HTTP/SSE and STDIO transport protocols, making it compatible with MCP-compliant clients like Claude Desktop or MCP Inspector. The server exposes a tool, extract-file-to-html
, that accepts a filename and returns the extracted content in HTML format, along with metadata.
This project is designed for developers and systems requiring automated file content extraction and conversion to HTML, suitable for applications such as document processing, content management, or integration with AI assistants.
Features
- File Content Extraction: Extracts text and metadata from various file formats (PDF, DOCX, Markdown, etc.) using Apache Tika.
- HTML Conversion: Converts extracted content to HTML using Tikaβs
ToHTMLContentHandler
. - MCP Compatibility: Supports MCP protocol with a synchronous tool (
extract-file-to-html
) for integration with MCP-compliant clients. - Transport Options: Supports HTTP/SSE (port 45452) and STDIO transports for flexible integration.
- REST Endpoint: Provides a
/api/test-extract
endpoint for testing file extraction via HTTP POST. - Health Check: Includes a
/api/health
endpoint to verify server status. - CORS Support: Enables cross-origin requests for the test endpoint, facilitating web-based testing.
- Configurable Directory: Reads files from a configurable
file-to-extract
directory, specified inapplication.properties
.
Prerequisites
- Java: Version 21 or higher.
- Maven: Version 3.6.0 or higher for building the project.
- Supported File Formats: Ensure files in the
file-to-extract
directory are in formats supported by Apache Tika (e.g., PDF, DOCX, TXT, Markdown).
Installation
-
Clone the Repository:
git clone https://github.com/RayenMalouche/MCP-PDF-Extractor-server-v2.git cd MCP-PDF-Extractor-server-v2
-
Create the File Directory:
- The server reads files from the
file-to-extract
directory by default, as specified inapplication.properties
. - Create this directory in the project root:
mkdir file-to-extract
- Place the files you want to extract (e.g., PDFs, Word documents) in this directory.
- The server reads files from the
-
Build the Project:
- Use Maven to build the project:
mvn clean install
- Use Maven to build the project:
Configuration
The server configuration is defined in src/main/resources/application.properties
:
spring.application.name=FileExtractMcpServer
file.directory=file-to-extract
- spring.application.name: Name of the application.
- file.directory: Directory where input files are stored (default:
file-to-extract
). Update this property to change the directory path if needed.
Running the Server
The server can be run in two modes: HTTP/SSE or STDIO.
1. HTTP/SSE Mode
- Default mode, suitable for web-based or MCP Inspector integration.
- Run the server:
mvn spring-boot:run
- For streamable HTTP (MCP Inspector compatibility):
mvn spring-boot:run -- --streamable-http
- The server starts on
http://localhost:45452
with the following endpoints:- MCP Endpoint:
http://localhost:45452/
(or/message
for streamable HTTP). - SSE Endpoint:
http://localhost:45452/sse
. - Test Endpoint:
http://localhost:45452/api/test-extract
(POST). - Health Check:
http://localhost:45452/api/health
(GET/POST).
- MCP Endpoint:
2. STDIO Mode
- Suitable for command-line or local MCP client integration.
- Run the server:
mvn spring-boot:run -- --stdio
Example Usage
Using the MCP Tool
- The
extract-file-to-html
tool accepts a JSON payload with afilename
field. - Example request (via MCP client or HTTP POST to
/api/test-extract
):{ "filename": "example.pdf" }
- Example using
curl
:curl -X POST http://localhost:45452/api/test-extract \ -H "Content-Type: application/json" \ -d '{"filename":"example.pdf"}'
- Expected response:
{ "status": "success", "message": "File content extracted successfully", "html": "<html>...</html>", "metadata": { "filename": "example.pdf", "contentType": "application/pdf" } }
- If the file is not found or extraction fails, an error response is returned:
{ "status": "error", "message": "Failed to extract file: File not found", "errorType": "IOException" }
Health Check
- Verify server status:
curl http://localhost:45452/api/health
- Response:
{ "status": "healthy", "server": "File Extract MCP Server", "version": "1.0.0" }
Project Structure
PDFExtractor/
βββ src/
β βββ main/
β β βββ java/com/mcp/RayenMalouche/pdf/v2/PDFExtractor/
β β β βββ PdfExtractorApplication.java # Main application with MCP server logic
β β β βββ config/
β β β β βββ ConfigLoader.java # Loads configuration from application.properties
β β βββ resources/
β β β βββ application.properties # Configuration file
β βββ test/ # Test sources (not included in this setup)
βββ file-to-extract/ # Directory for input files
βββ pom.xml # Maven configuration
βββ README.md # This file
Dependencies
- Spring Boot: Framework for building the application (
spring-boot-starter
,spring-boot-starter-web
). - Apache Tika: For file parsing and HTML conversion (
tika-core
,tika-parsers-standard-package
). - MCP SDK: For MCP protocol support (
io.modelcontextprotocol.sdk:mcp
). - Jetty: Embedded server for HTTP/SSE transport (
jetty-server
,jetty-ee10-servlet
). - Jackson: JSON processing (
jackson-databind
).
Limitations
- Image Handling: Embedded images in documents (e.g., in DOCX or PDF) are referenced in the HTML output (e.g.,
src="embedded:image1.jpeg"
) but not served directly. Additional logic is needed to extract and serve images (see Future Improvements). - Styling: The generated HTML lacks CSS styling, which may affect rendering. Basic formatting (e.g.,
<b>
,<i>
) is preserved, but visual fidelity to the original document is limited. - File Size: Large files may impact performance due to Tikaβs parsing and HTML conversion overhead.
- Supported Formats: Depends on Apache Tikaβs capabilities. Common formats (PDF, DOCX, TXT, Markdown) are supported, but exotic formats may require additional Tika parsers.
Future Improvements
- Image Support: Extract embedded images to a directory and serve them via a dedicated endpoint (e.g.,
/api/images/<filename>
). - CSS Styling: Include a default stylesheet in the HTML output to improve rendering.
- Link Validation: Ensure table of contents and figure links resolve correctly in the HTML output.
- Pagination: Support streaming or paginated responses for large documents to improve performance.
- Metadata Enrichment: Include additional Tika metadata (e.g., word count, creation date) in the response.
- File Upload: Add an endpoint to upload files to the
file-to-extract
directory dynamically.
Troubleshooting
- File Not Found Error:
- Ensure the file exists in the
file-to-extract
directory. - Check the filename in the request matches exactly (case-sensitive).
- Ensure the file exists in the
- Port Conflict:
- If port
45452
is in use, update the port inPdfExtractorApplication.java
(line:connector.setPort(45452)
).
- If port
- Tika Parsing Errors:
- Verify the file format is supported by Tika.
- Update Tika dependencies to the latest version if issues persist.
- Configuration Issues:
- Ensure
application.properties
is insrc/main/resources/
. - Check that
file.directory
points to a valid directory.
- Ensure
Contributing
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/YourFeature
). - Commit changes (
git commit -m "Add YourFeature"
). - Push to the branch (
git push origin feature/YourFeature
). - Open a pull request.
Please include tests and update documentation as needed.
Contact
For questions or support, contact the project maintainer:
- Name: Mohamed Rayen Malouche
- Email: rayenmalouche27@gmail.com