marc-shade/synthetic-data-mcp
If you are the rightful owner of synthetic-data-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
The Synthetic Data MCP Server is an enterprise-grade solution designed for generating privacy-compliant synthetic datasets, particularly for regulated industries requiring compliance with standards like HIPAA, PCI DSS, SOX, and GDPR. It supports multiple LLM providers.
Synthetic Data MCP Server
Enterprise-grade Model Context Protocol (MCP) server for generating privacy-compliant synthetic datasets. Built for regulated industries requiring HIPAA, PCI DSS, SOX, and GDPR compliance with multiple LLM provider support.
๐ Features
Core Capabilities
- Privacy-First Local Inference: Ollama integration for 100% local data generation
- Domain-Specific Generation: Specialized synthetic data for healthcare and finance
- Privacy Protection: Differential privacy, k-anonymity, l-diversity
- PII Safety Guarantee: Never retains or outputs original personal data
- Compliance Validation: HIPAA, PCI DSS, SOX, GDPR compliance checking
- Statistical Fidelity: Advanced validation to ensure data utility
- Audit Trail: Comprehensive logging for regulatory compliance
- Multi-Provider Support: Ollama (default), OpenAI, Anthropic, Google, OpenRouter
LLM Provider Support (2025 Models)
- OpenAI: GPT-5, GPT-5 Mini/Nano, GPT-4o
- Anthropic: Claude Opus 4.1, Claude Sonnet 4 (1M context), Claude 3.5 series
- Google: Gemini 2.5 Pro/Flash/Flash-Lite (1M+ context, multimodal)
- Local Models: Dynamic Ollama integration (Llama 3.3, Qwen 2.5/3, DeepSeek-R1, Mistral Small 3)
- Smart Routing: Automatic provider selection with cost optimization
- Fallback: Multi-tier fallback with local model support
Technology Stack (2025 Latest)
- FastAPI 0.116+: High-performance async web framework
- FastMCP: High-performance MCP server implementation
- Pydantic 2.11+: Type-safe data validation with enhanced performance
- SQLAlchemy 2.0+: Modern async ORM with type safety
- DSPy: Language model programming framework for intelligent data generation
- NumPy 2.3+ & Pandas 2.3+: Advanced data processing capabilities
- Redis & DiskCache: Multi-tier caching for cost optimization
- Rich: Beautiful terminal interfaces and progress indicators
๐ฏ Enterprise Benefits
- Privacy-First: Generate synthetic data without exposing sensitive information
- Compliance-Ready: Built-in validation for HIPAA, PCI DSS, SOX, and GDPR
- Multi-Provider: Support for cloud APIs and local inference
- Production-Scale: High-performance generation for enterprise data volumes
- Zero Vendor Lock-in: Switch between providers seamlessly
- Cost Control: Use local models for unlimited generation
๐ฅ Healthcare Use Cases
- Patient record synthesis with HIPAA Safe Harbor compliance
- Clinical trial data generation for FDA submissions
- Medical research datasets without PHI exposure
- Drug discovery data augmentation
- Healthcare analytics and ML model training
- EHR system testing and validation
๐ฐ Finance Use Cases
- Transaction pattern modeling for fraud detection
- Credit risk assessment dataset generation
- Regulatory stress testing data (Basel III, Dodd-Frank)
- PCI DSS compliant payment data synthesis
- Trading algorithm development and backtesting
- Financial reporting system validation
๐ ๏ธ Installation
Production Installation
pip install synthetic-data-mcp
Development Installation
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
pip install -e ".[dev,healthcare,finance]"
๐ฏ Quick Start
1. Configure LLM Provider
Choose your preferred provider:
OpenAI (Recommended for Production)
export OPENAI_API_KEY="sk-your-key-here"
Anthropic Claude
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
Google Gemini
export GOOGLE_API_KEY="your-key-here"
OpenRouter (Access to 100+ Models)
export OPENROUTER_API_KEY="sk-or-your-key-here"
export OPENROUTER_MODEL="meta-llama/llama-3.1-8b-instruct"
Local Models (Ollama) - Privacy-First (DEFAULT)
# Install Ollama first: https://ollama.ai
ollama pull mistral-small:latest # Or any preferred model
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_MODEL="mistral-small:latest"
# The system automatically detects and uses Ollama if available
# No API keys required for local inference!
2. Start the MCP Server
synthetic-data-mcp serve --port 3000
3. Add to Claude Desktop Configuration
{
"mcpServers": {
"synthetic-data": {
"command": "python",
"args": ["-m", "synthetic_data_mcp.server"],
"env": {
"OPENAI_API_KEY": "your-api-key"
}
}
}
}
4. Generate Synthetic Data
# Using the MCP client
result = await client.call_tool(
"generate_synthetic_dataset",
{
"domain": "healthcare",
"dataset_type": "patient_records",
"record_count": 10000,
"privacy_level": "high",
"compliance_frameworks": ["hipaa"],
"output_format": "json"
}
)
๐๏ธ Provider Configuration
Priority-Based Provider Selection
The system automatically selects the best available provider:
- Local Models (Ollama) - Highest privacy, no API costs
- OpenAI - Best performance and reliability
- Anthropic Claude - Excellent reasoning capabilities
- Google Gemini - Fast and cost-effective
- OpenRouter - Access to open source models
- Fallback Mock - Testing and development
Provider-Specific Configuration
OpenAI Configuration
# Environment variables
OPENAI_API_KEY="sk-your-key-here"
OPENAI_MODEL="gpt-4" # or gpt-4-turbo, gpt-3.5-turbo
OPENAI_TEMPERATURE="0.7"
OPENAI_MAX_TOKENS="2000"
Anthropic Configuration
# Environment variables
ANTHROPIC_API_KEY="sk-ant-your-key-here"
ANTHROPIC_MODEL="claude-3-opus-20240229" # or claude-3-sonnet, claude-3-haiku
ANTHROPIC_MAX_TOKENS="2000"
Local Ollama Configuration
# Environment variables
OLLAMA_BASE_URL="http://localhost:11434"
OLLAMA_MODEL="llama3.1:8b" # or any installed model
# Supported local models:
# - llama3.1:8b, llama3.1:70b
# - mistral:7b, mixtral:8x7b
# - qwen2:7b, deepseek-coder:6.7b
# - and 20+ more models
๐ง Available MCP Tools
generate_synthetic_dataset
Generate domain-specific synthetic datasets with compliance validation.
Parameters:
domain
: Healthcare, finance, or customdataset_type
: Patient records, transactions, clinical trials, etc.record_count
: Number of synthetic records to generateprivacy_level
: Privacy protection level (low/medium/high/maximum)compliance_frameworks
: Required compliance validationsoutput_format
: JSON, CSV, Parquet, or database exportprovider
: Override automatic provider selection
validate_dataset_compliance
Validate existing datasets against regulatory requirements.
analyze_privacy_risk
Comprehensive privacy risk assessment for datasets.
generate_domain_schema
Create Pydantic schemas for domain-specific data structures.
benchmark_synthetic_data
Performance and utility benchmarking against real data.
๐ Compliance Frameworks
Healthcare Compliance
- HIPAA Safe Harbor: Automatic validation of 18 identifiers
- HIPAA Expert Determination: Statistical disclosure control
- FDA Guidance: Synthetic clinical data for submissions
- GDPR: Healthcare data processing compliance
- HITECH: Security and breach notification
Finance Compliance
- PCI DSS: Payment card industry data security
- SOX: Sarbanes-Oxley internal controls
- Basel III: Banking regulatory framework
- MiFID II: Markets in Financial Instruments Directive
- Dodd-Frank: Financial reform regulations
๐ Privacy Protection
Core Privacy Features
- Differential Privacy: Configurable ฮต values (0.1-1.0)
- Statistical Disclosure Control: k-anonymity, l-diversity, t-closeness
- Synthetic Data Indistinguishability: Provable privacy guarantees
- Re-identification Risk Assessment: Continuous monitoring
- Privacy Budget Management: Automatic composition tracking
PII Protection Guarantee
- NO Data Retention: Original personal data is NEVER stored
- Automatic PII Detection: Identifies names, emails, SSNs, phones, addresses, credit cards
- Complete Anonymization: All PII is anonymized before pattern learning
- Statistical Learning Only: Only learns distributions, means, and frequencies
- 100% Synthetic Output: Generated data is completely fake
Credit Card Safety
- Test Card Numbers Only: Uses official test cards (4242-4242-4242-4242, etc.)
- Provider Support: Visa, Mastercard, AmEx, Discover, and more
- Configurable Providers: Specify provider or use weighted distribution
- Never Real Cards: Original credit card numbers are never retained or output
Example usage with credit card provider selection:
# Use specific provider test cards
result = await pipeline.ingest(
source=data,
credit_card_provider='visa' # Uses Visa test cards
)
# Or let system use mixed providers (default)
result = await pipeline.ingest(
source=data # Automatically uses weighted distribution
)
๐ Performance & Quality
- Statistical Fidelity: 95%+ correlation preservation
- Privacy Preservation: <1% re-identification risk
- Utility Preservation: >90% ML model performance
- Compliance Rate: 100% regulatory framework adherence
- Generation Speed: 1,000-10,000 records/second (provider dependent)
Provider Performance Comparison
Provider | Speed (req/s) | Quality | Privacy | Cost |
---|---|---|---|---|
Ollama Local | 10-50 | High | Maximum | Free |
OpenAI GPT-4 | 20-100 | Excellent | Medium | $$$ |
Claude 3 Opus | 15-80 | Excellent | Medium | $$$ |
Gemini Pro | 50-200 | Good | Medium | $ |
OpenRouter | 10-100 | Variable | Medium | $ |
๐งช Testing
# Run all tests
pytest
# Run compliance tests only
pytest -m compliance
# Run privacy tests
pytest -m privacy
# Run with coverage
pytest --cov=synthetic_data_mcp --cov-report=html
# Test specific provider
OPENAI_API_KEY=sk-test pytest -m integration
๐ Deployment
Docker Deployment
docker build -t synthetic-data-mcp .
docker run -p 3000:3000 \
-e OPENAI_API_KEY=your-key \
synthetic-data-mcp
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: synthetic-data-mcp
spec:
replicas: 3
selector:
matchLabels:
app: synthetic-data-mcp
template:
metadata:
labels:
app: synthetic-data-mcp
spec:
containers:
- name: synthetic-data-mcp
image: synthetic-data-mcp:latest
ports:
- containerPort: 3000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-key
๐ง Development
Code Quality
# Format code
black .
isort .
# Run linting
flake8 src tests
# Type checking
mypy src
Adding New Providers
- Create provider module in
src/synthetic_data_mcp/providers/
- Implement DSPy LM interface
- Add configuration in
core/generator.py
- Add tests in
tests/test_providers.py
๐ Examples
Healthcare Example
import asyncio
from synthetic_data_mcp import SyntheticDataGenerator
async def generate_patients():
generator = SyntheticDataGenerator()
result = await generator.generate_dataset(
domain="healthcare",
dataset_type="patient_records",
record_count=1000,
privacy_level="high",
compliance_frameworks=["hipaa"]
)
print(f"Generated {len(result['dataset'])} patient records")
return result
# Run the example
asyncio.run(generate_patients())
Finance Example
async def generate_transactions():
generator = SyntheticDataGenerator()
result = await generator.generate_dataset(
domain="finance",
dataset_type="transactions",
record_count=50000,
privacy_level="high",
compliance_frameworks=["pci_dss"]
)
print(f"Generated {len(result['dataset'])} transactions")
return result
๐ค Contributing
We welcome contributions! Please see our for details.
Development Setup
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,healthcare,finance]"
pre-commit install
๐ License
MIT License - see file for details.
๐ Support
๐ Related Projects
- Model Context Protocol (MCP)
- DSPy Framework
- Ollama - Local LLM inference
- OpenRouter - Access to 100+ models
Built with โค๏ธ for enterprise developers who need compliant, privacy-preserving synthetic data generation.