dsayswhat/rag-experiment
If you are the rightful owner of rag-experiment and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Model Context Protocol (MCP) server is a key component in integrating AI assistants with structured content management systems, enabling semantic search and interaction.
PDF-to-PostgreSQL MCP Framework
A complete, domain-agnostic framework for extracting high-quality text from PDFs, storing it in a searchable vector database, and providing AI assistant integration through a Model Context Protocol server.
Overview
Status: OPERATIONAL - Complete system from PDF extraction to AI assistant integration.
This framework provides a production-ready solution for PDF-based content management with AI integration:
- Text Extraction: Clean, structured text extraction from any PDF documents
- Semantic Chunking: Intelligent section-based content organization
- Vector Database: PostgreSQL + pgvector storage with OpenAI embeddings
- Content Classification: Configurable categorization system for any content domain
- MCP Server: Model Context Protocol server providing semantic search for AI assistants
- Modular Architecture: Centralized configuration and utilities across all components
Perfect for technical documentation, research papers, legal documents, manuals, and any PDF-based knowledge management workflows with AI assistant integration.
Key Features
Text Extraction
- High-Quality Extraction: Uses PyMuPDF4LLM for superior text extraction compared to traditional methods
- Markdown Formatting: Outputs structured Markdown with headers, bold text, and proper formatting
- Semantic Chunking: Intelligent section-based organization
- Batch Processing: Handle multiple PDFs in a single operation
Vector Database
- PostgreSQL + pgvector: Production-ready vector storage with semantic search
- OpenAI Embeddings: High-quality text embeddings using text-embedding-ada-002
- Configurable Classification: User-defined content types and categories
- Multi-Source Support: Official documents, supplementary materials, and custom notes
AI Integration
- MCP Server: Claude Code and Claude Desktop compatible
- Semantic Search: Vector similarity search with filtering capabilities
- CRUD Operations: Create, read, update content through AI assistants
- Configurable Domains: Adapt for any content type or industry
Quick Start
1. Text Extraction Only
# Clone and setup extraction environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
# Extract document sections
python extraction/pdf_extractor.py data/sample_document.pdf
2. Full Vector Database Setup
See for complete PostgreSQL + pgvector installation and content loading.
# Quick version (after PostgreSQL setup):
cd database/
pip install -r requirements.txt
cp .env.example .env # Edit with your database URL and OpenAI API key
python load_content.py --sections-dir ../output/
3. AI Assistant Integration
See for Claude integration setup.
# Start MCP server
cd mcp_server/
pip install -r requirements.txt
python content_server.py
Architecture
Directory Structure
pdf-to-postgresql-mcp/
├── core/ # Framework utilities
│ ├── config.py # Environment configuration
│ ├── database.py # Database utilities
│ └── openai_utils.py # AI integration
├── extraction/ # PDF processing
│ ├── pdf_extractor.py # Generic section extraction
│ └── extract-dolmenwood.py # Example implementation
├── database/ # Vector database
│ ├── schema.sql # PostgreSQL schema
│ ├── load_content.py # Content loading
│ └── README.md # Setup guide
├── mcp_server/ # AI integration
│ ├── content_server.py # MCP server
│ └── README.md # Integration guide
├── docs/ # Documentation
└── output/ # Extracted content
Configuration
Content Types
The framework supports fully configurable content types via environment variables:
# Example configurations for different domains
CONTENT_TYPES="reference,procedure,concept,analysis,example" # Technical docs
CONTENT_TYPES="statute,regulation,case,procedure,form" # Legal docs
CONTENT_TYPES="spell,location,rule,lore,equipment" # Gaming docs
Environment Setup
# Copy and configure environment
cp .env.example .env
# Edit .env with your settings:
DATABASE_URL="postgresql://user:pass@localhost/content_database"
OPENAI_API_KEY="your_openai_api_key"
CONTENT_DOMAIN="technical_documentation"
CONTENT_TYPES="reference,procedure,concept,guide,example"
Usage Examples
Basic Text Extraction
# Extract all content from a document
python extraction/pdf_extractor.py data/technical_manual.pdf
# Extract specific pages
python extraction/pdf_extractor.py data/manual.pdf --pages 15 16 17
# Batch process multiple documents
python extraction/pdf_extractor.py data/*.pdf
Database Operations
# Load extracted content into database
python database/load_content.py --sections-dir output/
# Load with custom source type
python database/load_content.py --sections-dir notes/ --source-type supplementary
AI Assistant Integration
# Start MCP server for AI integration
python mcp_server/content_server.py
# Use with Claude Code (server runs automatically)
# Or configure for Claude Desktop (see mcp_server/README.md)
Content Management
The framework supports multiple content workflows:
- PDF Extraction: Process existing documents
- Custom Content: Add notes and supplementary materials
- AI Integration: Search and manage content through AI assistants
- Version Control: Track changes and updates
See for detailed workflows.
Use Cases
Technical Documentation
- API documentation processing
- User manual digitization
- Knowledge base creation
- Support documentation search
Research & Academic
- Research paper analysis
- Literature review assistance
- Citation and reference management
- Academic content organization
Legal & Compliance
- Regulatory document processing
- Legal research assistance
- Policy and procedure management
- Compliance documentation
Enterprise Knowledge Management
- Internal documentation systems
- Training material digitization
- Procedural knowledge capture
- Expert system development
Performance
- Processing Speed: ~1-2 seconds per page for text extraction
- Memory Efficiency: Processes documents incrementally
- Database Performance: Vector search with configurable indexing
- AI Integration: Batched embedding generation for efficiency
Documentation
- - Complete setup instructions
- - PDF processing details
- - Workflow documentation
- - Implementation details
- - AI assistant setup
Requirements
- Python 3.8+
- PostgreSQL 12+ with pgvector extension
- OpenAI API key for embeddings
- 2GB+ RAM for processing large documents
Framework Design
This framework was designed with the following principles:
- Domain Agnostic: Works with any PDF-based content
- Modular Architecture: Composable components for different use cases
- Production Ready: Robust error handling and performance optimization
- AI Native: Built for AI assistant integration from the ground up
- Extensible: Easy to customize and extend for specific needs
Contributing
The framework is designed to be easily extended and customized. Key extension points:
- Content Classification: Add custom content type detection
- Extraction Methods: Implement domain-specific extraction logic
- AI Integration: Extend MCP server with additional capabilities
- Database Schema: Add custom metadata fields and indexes
License
This project is provided as-is for educational and development purposes. Please ensure compliance with relevant licenses for dependencies and content sources.