marc-shade/synthetic-data-mcp
If you are the rightful owner of synthetic-data-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Synthetic Data MCP Server is an enterprise-grade solution designed for generating privacy-compliant synthetic datasets, particularly for regulated industries requiring compliance with standards like HIPAA, PCI DSS, SOX, and GDPR. It supports multiple LLM providers.
Synthetic Data MCP Server
Synthetic data generation for training and testing.
Part of the Agentic System - a 24/7 autonomous AI framework with persistent memory.
Enterprise-grade Model Context Protocol (MCP) server for generating privacy-compliant synthetic datasets. Built for regulated industries requiring HIPAA, PCI DSS, SOX, and GDPR compliance with multiple LLM provider support.
Features
Core Capabilities
- Privacy-First Local Inference: Ollama integration for 100% local data generation
- Domain-Specific Generation: Specialized synthetic data for healthcare and finance
- Privacy Protection: Differential privacy, k-anonymity, l-diversity
- PII Safety Guarantee: Never retains or outputs original personal data
- Compliance Validation: HIPAA, PCI DSS, SOX, GDPR compliance checking
- Statistical Fidelity: Advanced validation to ensure data utility
- Audit Trail: Comprehensive logging for regulatory compliance
- Multi-Provider Support: Ollama (default), OpenAI, Anthropic, Google, OpenRouter
LLM Provider Support (2025 Models)
- OpenAI: GPT-5, GPT-5 Mini/Nano, GPT-4o
- Anthropic: Claude Opus 4.1, Claude Sonnet 4 (1M context), Claude 3.5 series
- Google: Gemini 2.5 Pro/Flash/Flash-Lite (1M+ context, multimodal)
- Local Models: Dynamic Ollama integration (Llama 3.3, Qwen 2.5/3, DeepSeek-R1, Mistral Small 3)
- Smart Routing: Automatic provider selection with cost optimization
- Fallback: Multi-tier fallback with local model support
Technology Stack (2025 Latest)
- FastAPI 0.116+: High-performance async web framework
- FastMCP: High-performance MCP server implementation
- Pydantic 2.11+: Type-safe data validation with enhanced performance
- SQLAlchemy 2.0+: Modern async ORM with type safety
- DSPy: Language model programming framework for intelligent data generation
- NumPy 2.3+ & Pandas 2.3+: Advanced data processing capabilities
- Redis & DiskCache: Multi-tier caching for cost optimization
- Rich: Beautiful terminal interfaces and progress indicators
Enterprise Benefits
- Privacy-First: Generate synthetic data without exposing sensitive information
- Compliance-Ready: Built-in validation for HIPAA, PCI DSS, SOX, and GDPR
- Multi-Provider: Support for cloud APIs and local inference
- Production-Scale: High-performance generation for enterprise data volumes
- Zero Vendor Lock-in: Switch between providers seamlessly
- Cost Control: Use local models for unlimited generation
Healthcare Use Cases
- Patient record synthesis with HIPAA Safe Harbor compliance
- Clinical trial data generation for FDA submissions
- Medical research datasets without PHI exposure
- Drug discovery data augmentation
- Healthcare analytics and ML model training
- EHR system testing and validation
Finance Use Cases
- Transaction pattern modeling for fraud detection
- Credit risk assessment dataset generation
- Regulatory stress testing data (Basel III, Dodd-Frank)
- PCI DSS compliant payment data synthesis
- Trading algorithm development and backtesting
- Financial reporting system validation
Installation
Production Installation
pip install synthetic-data-mcp
Development Installation
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
pip install -e ".[dev,healthcare,finance]"
Quick Start
1. Configure LLM Provider
Choose your preferred provider:
OpenAI (Recommended for Production)
export OPENAI_API_KEY="sk-your-key-here"
Anthropic Claude
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
Google Gemini
export GOOGLE_API_KEY="your-key-here"
OpenRouter (Access to 100+ Models)
export OPENROUTER_API_KEY="sk-or-your-key-here"
export OPENROUTER_MODEL="meta-llama/llama-3.1-8b-instruct"
Local Models (Ollama) - Privacy-First (DEFAULT)
# Install Ollama first: https://ollama.ai
ollama pull mistral-small:latest # Or any preferred model
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_MODEL="mistral-small:latest"
# The system automatically detects and uses Ollama if available
# No API keys required for local inference!
2. Start the MCP Server
synthetic-data-mcp serve --port 3000
3. Add to Claude Desktop Configuration
{
"mcpServers": {
"synthetic-data": {
"command": "python",
"args": ["-m", "synthetic_data_mcp.server"],
"env": {
"OPENAI_API_KEY": "your-api-key"
}
}
}
}
4. Generate Synthetic Data
# Using the MCP client
result = await client.call_tool(
"generate_synthetic_dataset",
{
"domain": "healthcare",
"dataset_type": "patient_records",
"record_count": 10000,
"privacy_level": "high",
"compliance_frameworks": ["hipaa"],
"output_format": "json"
}
)
Provider Configuration
Priority-Based Provider Selection
The system automatically selects the best available provider:
- Local Models (Ollama) - Highest privacy, no API costs
- OpenAI - Best performance and reliability
- Anthropic Claude - Excellent reasoning capabilities
- Google Gemini - Fast and cost-effective
- OpenRouter - Access to open source models
- Fallback Mock - Testing and development
Provider-Specific Configuration
OpenAI Configuration
# Environment variables
OPENAI_API_KEY="sk-your-key-here"
OPENAI_MODEL="gpt-4" # or gpt-4-turbo, gpt-3.5-turbo
OPENAI_TEMPERATURE="0.7"
OPENAI_MAX_TOKENS="2000"
Anthropic Configuration
# Environment variables
ANTHROPIC_API_KEY="sk-ant-your-key-here"
ANTHROPIC_MODEL="claude-3-opus-20240229" # or claude-3-sonnet, claude-3-haiku
ANTHROPIC_MAX_TOKENS="2000"
Local Ollama Configuration
# Environment variables
OLLAMA_BASE_URL="http://localhost:11434"
OLLAMA_MODEL="llama3.1:8b" # or any installed model
# Supported local models:
# - llama3.1:8b, llama3.1:70b
# - mistral:7b, mixtral:8x7b
# - qwen2:7b, deepseek-coder:6.7b
# - and 20+ more models
Available MCP Tools
generate_synthetic_dataset
Generate domain-specific synthetic datasets with compliance validation.
Parameters:
domain: Healthcare, finance, or customdataset_type: Patient records, transactions, clinical trials, etc.record_count: Number of synthetic records to generateprivacy_level: Privacy protection level (low/medium/high/maximum)compliance_frameworks: Required compliance validationsoutput_format: JSON, CSV, Parquet, or database exportprovider: Override automatic provider selection
validate_dataset_compliance
Validate existing datasets against regulatory requirements.
analyze_privacy_risk
Comprehensive privacy risk assessment for datasets.
generate_domain_schema
Create Pydantic schemas for domain-specific data structures.
benchmark_synthetic_data
Performance and utility benchmarking against real data.
Compliance Frameworks
Healthcare Compliance
- HIPAA Safe Harbor: Automatic validation of 18 identifiers
- HIPAA Expert Determination: Statistical disclosure control
- FDA Guidance: Synthetic clinical data for submissions
- GDPR: Healthcare data processing compliance
- HITECH: Security and breach notification
Finance Compliance
- PCI DSS: Payment card industry data security
- SOX: Sarbanes-Oxley internal controls
- Basel III: Banking regulatory framework
- MiFID II: Markets in Financial Instruments Directive
- Dodd-Frank: Financial reform regulations
Privacy Protection
Core Privacy Features
- Differential Privacy: Configurable ε values (0.1-1.0)
- Statistical Disclosure Control: k-anonymity, l-diversity, t-closeness
- Synthetic Data Indistinguishability: Provable privacy guarantees
- Re-identification Risk Assessment: Continuous monitoring
- Privacy Budget Management: Automatic composition tracking
PII Protection Guarantee
- NO Data Retention: Original personal data is NEVER stored
- Automatic PII Detection: Identifies names, emails, SSNs, phones, addresses, credit cards
- Complete Anonymization: All PII is anonymized before pattern learning
- Statistical Learning Only: Only learns distributions, means, and frequencies
- 100% Synthetic Output: Generated data is completely fake
Credit Card Safety
- Test Card Numbers Only: Uses official test cards (4242-4242-4242-4242, etc.)
- Provider Support: Visa, Mastercard, AmEx, Discover, and more
- Configurable Providers: Specify provider or use weighted distribution
- Never Real Cards: Original credit card numbers are never retained or output
Example usage with credit card provider selection:
# Use specific provider test cards
result = await pipeline.ingest(
source=data,
credit_card_provider='visa' # Uses Visa test cards
)
# Or let system use mixed providers (default)
result = await pipeline.ingest(
source=data # Automatically uses weighted distribution
)
Performance & Quality
- Statistical Fidelity: 95%+ correlation preservation
- Privacy Preservation: <1% re-identification risk
- Utility Preservation: >90% ML model performance
- Compliance Rate: 100% regulatory framework adherence
- Generation Speed: 1,000-10,000 records/second (provider dependent)
Provider Performance Comparison
| Provider | Speed (req/s) | Quality | Privacy | Cost |
|---|---|---|---|---|
| Ollama Local | 10-50 | High | Maximum | Free |
| OpenAI GPT-4 | 20-100 | Excellent | Medium | $$$ |
| Claude 3 Opus | 15-80 | Excellent | Medium | $$$ |
| Gemini Pro | 50-200 | Good | Medium | $ |
| OpenRouter | 10-100 | Variable | Medium | $ |
Testing
# Run all tests
pytest
# Run compliance tests only
pytest -m compliance
# Run privacy tests
pytest -m privacy
# Run with coverage
pytest --cov=synthetic_data_mcp --cov-report=html
# Test specific provider
OPENAI_API_KEY=sk-test pytest -m integration
Deployment
Docker Deployment
docker build -t synthetic-data-mcp .
docker run -p 3000:3000 \
-e OPENAI_API_KEY=your-key \
synthetic-data-mcp
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: synthetic-data-mcp
spec:
replicas: 3
selector:
matchLabels:
app: synthetic-data-mcp
template:
metadata:
labels:
app: synthetic-data-mcp
spec:
containers:
- name: synthetic-data-mcp
image: synthetic-data-mcp:latest
ports:
- containerPort: 3000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-key
🔧 Development
Code Quality
# Format code
black .
isort .
# Run linting
flake8 src tests
# Type checking
mypy src
Adding New Providers
- Create provider module in
src/synthetic_data_mcp/providers/ - Implement DSPy LM interface
- Add configuration in
core/generator.py - Add tests in
tests/test_providers.py
Examples
Healthcare Example
import asyncio
from synthetic_data_mcp import SyntheticDataGenerator
async def generate_patients():
generator = SyntheticDataGenerator()
result = await generator.generate_dataset(
domain="healthcare",
dataset_type="patient_records",
record_count=1000,
privacy_level="high",
compliance_frameworks=["hipaa"]
)
print(f"Generated {len(result['dataset'])} patient records")
return result
# Run the example
asyncio.run(generate_patients())
Finance Example
async def generate_transactions():
generator = SyntheticDataGenerator()
result = await generator.generate_dataset(
domain="finance",
dataset_type="transactions",
record_count=50000,
privacy_level="high",
compliance_frameworks=["pci_dss"]
)
print(f"Generated {len(result['dataset'])} transactions")
return result
Contributing
We welcome contributions! Please see our for details.
Development Setup
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,healthcare,finance]"
pre-commit install
License
MIT License - see file for details.
Support
Related Projects
- Model Context Protocol (MCP)
- DSPy Framework
- Ollama - Local LLM inference
- OpenRouter - Access to 100+ models
Built with ❤️ for enterprise developers who need compliant, privacy-preserving synthetic data generation.
Part of the MCP Ecosystem
This server integrates with other MCP servers for comprehensive AGI capabilities:
| Server | Purpose |
|---|---|
| enhanced-memory-mcp | 4-tier persistent memory with semantic search |
| agent-runtime-mcp | Persistent task queues and goal decomposition |
| agi-mcp | Full AGI orchestration with 21 tools |
| cluster-execution-mcp | Distributed task routing across nodes |
| node-chat-mcp | Inter-node AI communication |
| ember-mcp | Production-only policy enforcement |
See agentic-system-oss for the complete framework.