MDConverter

hublun/MDConverter

3.2

If you are the rightful owner of MDConverter and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.

MDConverter is a comprehensive Python ecosystem designed to convert HTML webpage packages into clean, well-formatted Markdown files, featuring both a standalone package and an MCP server implementation.

Tools
  1. convert_html_to_markdown

    Full HTML to Markdown conversion.

  2. validate_html_file

    Validate HTML files before conversion.

  3. get_html_metadata

    Extract metadata without full conversion.

  4. list_supported_formats

    Show supported formats and features.

  5. convert_html_content

    Convert HTML content strings directly.

MDConverter - HTML to Markdown Converter

A comprehensive Python ecosystem for converting HTML webpage packages into clean, well-formatted Markdown files. This repository contains both a standalone Python package and an MCP (Model Context Protocol) server implementation.

๐Ÿš€ Features

  • Clean Conversion: Converts HTML to well-formatted Markdown with proper formatting
  • Image Processing: Handles and preserves images from webpage packages
  • Metadata Extraction: Extracts and preserves article metadata (title, author, description, etc.)
  • Content Cleaning: Removes ads, scripts, navigation, and other unwanted elements
  • Code Block Preservation: Maintains syntax highlighting in code blocks
  • Configurable Output: Extensive configuration options via files or CLI
  • Multiple Interfaces: CLI, Python API, and MCP server
  • Package Structure: Proper Python package with modular design

๐Ÿ“ Repository Structure

MDConverter/
โ”œโ”€โ”€ README.md                           # This file
โ”œโ”€โ”€ LICENSE                            # License file
โ”œโ”€โ”€ 
โ”œโ”€โ”€ standalone/                        # ๐Ÿ“ฆ STANDALONE PACKAGE
โ”‚   โ”œโ”€โ”€ setup.py                      # Package setup
โ”‚   โ”œโ”€โ”€ requirements.txt              # Dependencies
โ”‚   โ”œโ”€โ”€ html_to_markdown_converter.py # Main converter script
โ”‚   โ”œโ”€โ”€ src/mdconverter/              # Package source
โ”‚   โ”œโ”€โ”€ tests/                        # Test suite
โ”‚   โ”œโ”€โ”€ examples/                     # Usage examples
โ”‚   โ”œโ”€โ”€ docs/                         # Documentation
โ”‚   โ”œโ”€โ”€ assets/                       # Processed images and templates
โ”‚   โ””โ”€โ”€ output/                       # Default output directory
โ”‚
โ””โ”€โ”€ mcp-server/                       # ๐Ÿ”Œ MCP SERVER
    โ”œโ”€โ”€ pyproject.toml                # Modern Python project config
    โ”œโ”€โ”€ src/                          # MCP server source
    โ”œโ”€โ”€ tests/                        # MCP server tests
    โ”œโ”€โ”€ converted_articles/           # Example conversions
    โ”œโ”€โ”€ test_output/                  # Test outputs
    โ””โ”€โ”€ config/                       # MCP configuration files

๐Ÿ”ง Installation

Standalone Package

# Clone the repository
git clone https://github.com/hublun/MDConverter.git
cd MDConverter/standalone

# Install the package
pip install -e .

# Or install dependencies directly
pip install -r requirements.txt

MCP Server

# Navigate to MCP server directory
cd MDConverter/mcp-server

# Install the MCP server
pip install -e .

๐Ÿ“– Usage

Command Line Interface (Standalone)

# Basic usage
python html_to_markdown_converter.py input.html

# With custom output file
python html_to_markdown_converter.py input.html output.md

# Using the package CLI
mdconverter input.html -o output.md

# With configuration file
mdconverter input.html --config config.json

Python API (Standalone)

from mdconverter import HTMLToMarkdownConverter, Config

# Basic conversion
converter = HTMLToMarkdownConverter("input.html")
success = converter.convert()

# With custom configuration
config = Config({
    'output_dir': 'custom_output',
    'add_metadata': True,
    'log_level': 'DEBUG'
})

converter = HTMLToMarkdownConverter(
    "input.html", 
    output_file="custom.md",
    config=config
)
success = converter.convert()

MCP Server Usage

Add to your MCP client configuration:

{
  "mcpServers": {
    "mdconverter": {
      "command": "mdconverter-mcp",
      "args": []
    }
  }
}
Available MCP Tools
  1. convert_html_to_markdown: Full HTML to Markdown conversion
  2. validate_html_file: Validate HTML files before conversion
  3. get_html_metadata: Extract metadata without full conversion
  4. list_supported_formats: Show supported formats and features
  5. convert_html_content: Convert HTML content strings directly

๐Ÿ› ๏ธ Configuration

Standalone Package

Create a config.json file:

{
  "output_dir": "output",
  "images_dir": "assets/images",
  "preserve_images": true,
  "clean_html": true,
  "add_metadata": true,
  "log_level": "INFO"
}

MCP Server

The MCP server supports the same configuration options via tool parameters:

{
  "tool": "convert_html_to_markdown",
  "arguments": {
    "html_file_path": "/path/to/webpage.html",
    "output_dir": "converted_articles",
    "preserve_images": true,
    "add_metadata": true
  }
}

๐Ÿงช Testing

Standalone Package

cd standalone
python -m pytest tests/

MCP Server

cd mcp-server
python test_conversion.py

๐Ÿ“š Documentation

๐Ÿ”„ Migration Guide

From Legacy Script

If you were using the old html_to_markdown_converter.py script:

# Old way
python html_to_markdown_converter.py input.html

# New way (still supported)
cd standalone
python html_to_markdown_converter.py input.html

# Or use the package
mdconverter input.html

From Standalone to MCP

The MCP server provides the same functionality with additional features:

# Standalone
converter = HTMLToMarkdownConverter("file.html")
converter.convert()

# MCP
{
  "tool": "convert_html_to_markdown",
  "arguments": {"html_file_path": "file.html"}
}

๐ŸŽฏ Key Features

Content Processing

  • Smart Content Extraction: Identifies and extracts main article content
  • Metadata Preservation: YAML frontmatter with article information
  • Image Organization: Copies and organizes image files
  • Code Syntax Preservation: Maintains syntax highlighting
  • Clean Output: Removes unwanted elements (ads, navigation, etc.)

Multiple Interfaces

  • CLI Tool: Command-line interface for batch processing
  • Python API: Programmatic access for integration
  • MCP Server: Model Context Protocol for AI assistant integration

Advanced Features

  • Configurable Processing: Extensive customization options
  • Error Handling: Comprehensive validation and error reporting
  • Multiple Formats: Support for various HTML structures
  • Template System: Customizable output templates

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new features
  5. Update documentation
  6. Submit a pull request

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the file for details.

๐Ÿ†˜ Support

  • Issues: Report bugs and feature requests via GitHub Issues
  • Documentation: Check the docs/ directories for detailed guides
  • Examples: See the examples/ directories for usage examples

Note: This repository was extracted from the Leet_Vibe repository to create a focused, standalone HTML to Markdown conversion tool.