DharmpratapSingh/Team-1B-Fusion
If you are the rightful owner of Team-1B-Fusion and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
The Model Context Protocol (MCP) server is a standardized interface for accessing and querying emissions data using a conversational AI system. It leverages the power of LLMs to interpret natural language queries and provides structured data responses.
ClimateGPT: AI-Powered Climate & Emissions Data Analytics Platform
A comprehensive data engineering system for querying, analyzing, and interpreting global greenhouse gas emissions data using the Model Context Protocol (MCP) and Large Language Models with persona-based response differentiation.
Table of Contents
- Features
- Quick Start
- Example Queries
- Architecture
- API Reference
- Configuration
- Testing
- Data Sources
- Usage Notes
- Troubleshooting
- Documentation
- Development
- License
- Contributing
- Support
Features
- Multi-sector emissions data: Transport, Power Industry, and other EDGAR sectors
- Geographic granularity: Country, admin-1 (state/province), and city-level data
- Temporal analysis: Monthly and annual data from 2000-2024
- Conversational interface: Natural language queries powered by LLM
- MCP server: Standardized data access via Model Context Protocol
- Interactive UI: Streamlit-based chat interface with persona modes
- Smart Entity Resolution: Automatic handling of country/city aliases (USA→United States, NYC→New York, etc.), fuzzy matching for typos, and intelligent geographic level detection
- Enterprise Security: SQL injection prevention, input validation, CORS restrictions, and secure credential management
- High Performance: Optimized query execution, connection pooling, and 50% reduction in database load
Quick Start
Prerequisites
- Python 3.11 (specifically required - not 3.10 or 3.12)
- Why 3.11? This version provides the optimal balance of performance and compatibility with our dependencies (DuckDB, GeoPandas, NumPy 2.x)
- uv - Fast Python package manager and resolver
- Installation:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Installation:
- Docker (optional, for containerized deployment)
- OpenAI API Key (or compatible LLM endpoint) for conversational interface
Local Development
- Start the MCP bridge + MCP stdio server (port 8010):
make serve
- In a second terminal, start the UI (port 8501):
make ui
- Open http://localhost:8501 in your browser
Docker Deployment
docker compose up --build
This will start both services:
- server: HTTP bridge (FastAPI) + true MCP stdio server on port 8010
- ui: Streamlit interface on port 8501
Example Queries
Once the system is running, you can ask questions like:
Simple Queries:
- "What were the CO2 emissions from transport in Germany in 2023?"
- "Show me power industry emissions in California for 2022"
- "What are the transport emissions for Paris in 2020?"
Temporal Analysis:
- "How have transport emissions in the United States changed from 2000 to 2023?"
- "Compare monthly transport emissions in Beijing between 2022 and 2023"
- "What were the peak emissions months for power industry in Texas in 2023?"
Comparative Queries:
- "Compare transport emissions between France and Germany in 2023"
- "Which US state had the highest power industry emissions in 2022?"
- "Compare emissions from transport and power industry in China in 2023"
Complex Multi-Sector:
- "Analyze transport and power industry emissions trends in India from 2015 to 2023"
- "What sectors contribute most to emissions in California?"
Smart Entity Resolution (NEW!): The system now intelligently handles various name formats and aliases:
- Country aliases: "USA", "US", "America" → "United States of America"
- State abbreviations: "CA", "TX", "NY" → "California", "Texas", "New York"
- City nicknames: "NYC" → "New York", "LA" → "Los Angeles"
- Typo correction: "Califronia" → "California" (fuzzy matching at 80%+ similarity)
- Auto-level detection: Automatically determines if you're querying a country, state, or city
- Intelligent fallback: If city data unavailable, tries state → country automatically
Notes:
- You can now use common aliases like "USA", "UK", "NYC" - the system normalizes them automatically!
- Emissions are in tonnes CO₂ (displayed as MtCO₂ for large values)
- Data covers 2000-2024 with monthly resolution
- Available sectors: transport, power-industry, waste, agriculture, buildings, fuel-exploitation, industrial-combustion, industrial-processes
Architecture
MCP Server Stack (mcp_http_bridge.py + mcp_server_stdio.py)
mcp_http_bridge.py: FastAPI bridge that exposes the legacy HTTP REST surface (/query,/list_files, etc.) while proxying every request to the real MCP stdio server. Includes CORS security restrictions and configurable origin whitelisting.mcp_server_stdio.py: The fully featured MCP implementation that speaks the Model Context Protocol over stdio and executes all DuckDB queries. Features include:- Smart Entity Resolution: Normalizes location names, handles aliases, performs fuzzy matching, and auto-detects geographic levels
- Security: Comprehensive SQL injection prevention with input validation, column name sanitization, and parameterized queries
- Performance: Connection pooling, optimized query execution (no redundant queries), and efficient cursor management
- MCP Tools: 15+ tools including
smart_query_emissions,query_emissions,calculate_yoy_change,get_file_info, and more
The bridge automatically spawns mcp_server_stdio.py on startup and relays JSON-RPC traffic between HTTP clients (UI, automation) and the MCP server. This keeps existing HTTP integrations working while using the canonical MCP runtime under the hood.
UI (enhanced_climategpt_with_personas.py)
Streamlit chat interface with:
- Multiple persona modes (Analyst, Technical, Policy Advisor)
- Chat-first layout with inline controls
- CSV export of query results
- Status indicators and error handling
- Secure credential management (no hardcoded defaults)
API Reference
The HTTP bridge exposes the following RESTful endpoints on port 8010:
Query Endpoint
POST /query
Execute a natural language query against the emissions database.
curl -X POST http://localhost:8010/query \
-H "Content-Type: application/json" \
-d '{
"question": "What were transport emissions in Germany in 2023?",
"assist_mode": "smart",
"proxy_strategy": "spatial"
}'
Request Body:
{
"question": "string (required)",
"assist_mode": "smart|lite|off (default: smart)",
"proxy_strategy": "spatial|random|off (default: spatial)",
"proxy_max_k": 10,
"proxy_radius_km": 100
}
Response:
{
"answer": "Natural language summary",
"data": [...],
"metadata": {
"intent": {...},
"datasets_used": [...],
"query_time_ms": 123
}
}
List Files Endpoint
GET /list_files
List all available datasets in the manifest.
curl http://localhost:8010/list_files
Response:
{
"files": [
{
"name": "transport_admin0_yearly",
"description": "Country-level transport emissions (yearly)",
"path": "data/warehouse/climategpt.duckdb",
"table_name": "transport_admin0_yearly"
}
]
}
Health Check
GET /health
Check server status.
curl http://localhost:8010/health
Response:
{
"status": "healthy",
"mcp_server": "running",
"version": "0.2.0"
}
Configuration
Environment Variables
Configure the system using environment variables. Create a .env file or export them in your shell:
MCP Server Configuration
# Required
export MCP_MANIFEST_PATH=data/curated-2/manifest_mcp_duckdb.json
# Optional
export PORT=8010 # HTTP server port (default: 8010)
export MCP_RATE_CAP=60 # Rate limit: requests per 5 minutes per IP
export MCP_LOG_LEVEL=INFO # Logging level: DEBUG|INFO|WARNING|ERROR
LLM Configuration
# Required for conversational interface
# IMPORTANT: Must be in "username:password" format for authentication
export OPENAI_API_KEY=username:password # Your API credentials in username:password format
# Optional
export OPENAI_BASE_URL=https://api.openai.com/v1 # LLM endpoint
export MODEL=gpt-4 # Model name (gpt-4, gpt-3.5-turbo, etc.)
Security Note: The system no longer accepts hardcoded credentials. The OPENAI_API_KEY environment variable is required and must be in username:password format. The application will fail fast with a clear error if credentials are missing or incorrectly formatted.
Query Behavior Defaults
# Assist Mode: how the LLM assists with query interpretation
export ASSIST_DEFAULT=smart # smart|lite|off (default: smart)
# smart: Full LLM-powered query understanding
# lite: Basic query parsing
# off: Direct SQL only
# Proxy Strategy: how to handle missing admin-1/city data
export PROXY_DEFAULT=spatial # spatial|random|off (default: spatial)
# spatial: Find nearby locations with data
# random: Use random sampling
# off: No proxy fallback
export PROXY_MAX_K=10 # Max proxy results to return
export PROXY_RADIUS_KM=100 # Search radius for spatial proxy (km)
Streamlit UI Configuration
export STREAMLIT_SERVER_PORT=8501 # UI port (default: 8501)
export STREAMLIT_SERVER_ADDRESS=0.0.0.0 # Bind address
Security Configuration
# CORS Origins (comma-separated list of allowed origins)
export ALLOWED_ORIGINS=http://localhost:8501,http://localhost:3000
# Default allows localhost origins only for security
# Add your production domains as needed:
# export ALLOWED_ORIGINS=https://yourdomain.com,https://app.yourdomain.com
# Environment mode (affects error messages and security defaults)
export ENVIRONMENT=production # production|development (default: production)
# Rate Limiting
export RATE_LIMIT_MAX_REQUESTS=100 # Max requests per window (default: 100)
export RATE_LIMIT_WINDOW_SECONDS=60 # Rate limit window in seconds (default: 60)
Security Best Practice: Never use * for allowed origins in production. The system now enforces explicit origin whitelisting to prevent unauthorized cross-origin requests.
Database Configuration
# Database file path
export DB_PATH=data/warehouse/climategpt.duckdb # Path to DuckDB database
# Connection pool settings
export DB_POOL_SIZE=10 # Connection pool size (default: 10)
export DB_MAX_CONNECTIONS=20 # Maximum concurrent connections (default: 20)
# Note: Higher pool sizes improve concurrent query performance but use more memory
# Recommended: 10-20 for most deployments, 20-50 for high-traffic production
Performance Configuration
# LLM Concurrency Control
export LLM_CONCURRENCY_LIMIT=10 # Max concurrent LLM API calls (default: 10)
# Lower values (2-5): Conservative, prevents rate limiting
# Higher values (10-20): Better throughput if API allows
# Query Result Caching
export CACHE_SIZE=1000 # Cache entry limit (default: 1000)
export CACHE_TTL_SECONDS=300 # Cache TTL in seconds (default: 300/5min)
Development & Debugging
export DEBUG=true # Enable debug mode
export LOG_QUERIES=true # Log all SQL queries
See docker-compose.yml for container-specific configuration examples.
Testing
Unit & Integration Tests
Run the test suite:
make test
Or with pytest directly:
uv run pytest -v
LLM Comparative Testing
The system has been tested with multiple LLM backends. Key findings:
- Default LLM: 100% success rate, 5.7s average response time (recommended for production)
- Llama Q5_K_M: 80% success rate, 10.4s average response time (viable for development/testing)
- All tool calls and natural language summarization working correctly
See docs/TESTING_RESULTS.md for detailed comparison results.
For automated LLM testing tools, see the testing/ directory which includes:
- Test harness with 50 question bank covering all sectors and query types
- Analysis and visualization scripts
- LM Studio setup guides
Data Sources
This project uses EDGAR (Emissions Database for Global Atmospheric Research) v2024 datasets:
- CO₂ emissions by sector (transport, power industry, etc.)
- Global coverage with spatial resolution
- Monthly temporal resolution (2000-2024)
Usage Notes
- Use exact country names (e.g., "United States of America" not "USA")
- All emissions values are in tonnes CO₂; large numbers displayed as MtCO₂
- No forecasts or per-capita metrics (by design)
- Queries are limited by rate limiting to prevent abuse
Troubleshooting
Common Issues
Server Won't Start
Problem: ModuleNotFoundError or import errors
Solution:
# Ensure dependencies are installed
uv sync
# Or with pip
pip install -r requirements.txt
# Verify Python version
python --version # Should be 3.11.x
Problem: Port 8010 already in use
Solution:
# Find and kill the process using port 8010
lsof -ti:8010 | xargs kill -9
# Or change the port
export PORT=8011
make serve
UI Connection Issues
Problem: UI can't connect to MCP server
Solution:
# Verify server is running
curl http://localhost:8010/health
# Check server logs for errors
# Make sure both server and UI are running in separate terminals
# Terminal 1
make serve
# Terminal 2
make ui
Database Errors
Problem: DuckDB: IO Error: No such file or directory
Solution:
# Verify database exists
ls -lh data/warehouse/climategpt.duckdb
# Check manifest path
export MCP_MANIFEST_PATH=data/curated-2/manifest_mcp_duckdb.json
# Verify manifest is valid
python -c "import json; print(json.load(open('data/curated-2/manifest_mcp_duckdb.json')))"
LLM/OpenAI Errors
Problem: AuthenticationError or Invalid API key
Solution:
# Set your API key
export OPENAI_API_KEY=sk-your-actual-key-here
# Or create a .env file
echo "OPENAI_API_KEY=sk-your-key" > .env
Problem: Rate limit errors from OpenAI
Solution:
- Wait a few minutes and retry
- Switch to a different model with higher limits
- Consider using a local LLM (see
testing/LM_STUDIO_SETUP.md)
Docker Issues
Problem: Docker build fails
Solution:
# Clean Docker cache
docker system prune -a
# Rebuild with no cache
docker compose build --no-cache
# Check Docker resources (need at least 4GB RAM)
docker stats
Problem: Container exits immediately
Solution:
# Check container logs
docker compose logs server
docker compose logs ui
# Verify environment variables in docker-compose.yml
Query Returns No Results
Problem: Valid question returns empty data
Possible causes:
- Country name mismatch: Use "United States of America" not "USA"
- Data not available: Not all locations have city/admin-1 level data
- Time range: Data is limited to 2000-2024
- Sector name: Use exact sector names (e.g., "power-industry" not "power")
Solution:
# Check available datasets
curl http://localhost:8010/list_files
# Try a simpler query first
curl -X POST http://localhost:8010/query \
-H "Content-Type: application/json" \
-d '{"question": "What were transport emissions in Germany in 2023?"}'
Performance Issues
Problem: Slow query responses
Solution:
- Check system resources (CPU, RAM, disk I/O)
- Reduce
PROXY_MAX_Kfor faster spatial queries - Use yearly data instead of monthly for large time ranges
- Consider optimizing DuckDB with
PRAGMA threads=4
Getting Help
If you encounter issues not covered here:
- Check the logs in the terminal where you ran
make serve - Enable debug logging:
export MCP_LOG_LEVEL=DEBUG - Review the full documentation in
docs/ - Check existing issues on GitHub
- Create a new issue with:
- Error message
- Steps to reproduce
- Python version (
python --version) - OS and version
Documentation
Additional documentation is available in the docs/ folder:
Setup & Architecture
docs/SYSTEM_REFERENCE.md- System architecture and quick referencedocs/ARCHITECTURE.md- Detailed system architecturedocs/API.md- API documentationdocs/MCP_ARCHITECTURE.md- MCP protocol detailsdocs/DEPLOYMENT.md- Deployment guide
Development & Testing
docs/TESTING_RESULTS.md- Testing results and LLM comparison findingsdocs/DEPENDENCY_AUDIT.md- Dependency security and audit reporttesting/- Automated testing tools and scripts
Reports & Analysis
docs/reports/- Comprehensive reports on improvements, security fixes, and implementation status
For automated testing tools and scripts, see the testing/ directory.
Recent Improvements Documentation
The following reports document recent enhancements and security improvements:
docs/reports/CODE_REVIEW_REPORT.md- Comprehensive code review and analysis (1,213 lines)docs/SMART_QUERY_GUIDE.md- Smart entity resolution system user guidedocs/reports/FINAL_STATUS.md- Final status report of all implemented improvementsdocs/reports/IMPROVEMENTS_SUMMARY.md- Summary of all fixes and enhancementsdocs/reports/REPOSITORY_STATUS.md- Current repository status and structure
Latest Updates (v0.2.0):
- ✅ Smart entity resolution with alias normalization and fuzzy matching
- ✅ All P0+P1 security issues resolved (hardcoded credentials removed, CORS hardened)
- ✅ Performance optimizations (50% reduction in database load)
- ✅ Comprehensive SQL injection prevention
- ✅ Enhanced input validation across all query endpoints
Development
Project Structure
.
├── mcp_http_bridge.py # HTTP bridge that wraps the MCP stdio server
├── mcp_server_stdio.py # MCP stdio protocol server (source of truth)
├── enhanced_climategpt_with_personas.py # Streamlit UI
├── run_llm.py # LLM integration harness
├── src/
│ ├── pipelines/ # Data processing pipelines
│ │ ├── viirs.py # VIIRS data pipeline
│ │ └── __init__.py
│ └── utils/ # Core utilities
│ ├── router.py # Intent to dataset routing
│ ├── intent.py # Intent extraction
│ ├── answer.py # Response formatting
│ ├── fallbacks.py # Query fallback logic
│ ├── baseline_context.py # Baseline context handling
│ ├── logging.py # Logging utilities
│ └── http.py # HTTP utilities
├── scripts/ # Utility scripts
│ ├── preprocessing/ # Data preprocessing for EDGAR sectors
│ │ ├── sector_config.py # Centralized sector configuration
│ │ ├── geometry_loader.py # Geographic boundary loader
│ │ ├── spatial_aggregation.py # Spatial join engine
│ │ ├── process_transport_sector.py # Transport sector pipeline
│ │ ├── process_power_sector.py # Power sector pipeline
│ │ ├── process_all_sectors.py # Batch process all sectors
│ │ └── (legacy scripts)
│ ├── database/ # Database management
│ │ ├── analyze_database.py # Database analysis
│ │ ├── apply_database_indexes.py # Index management
│ │ ├── create_database_indexes.sql # SQL index definitions
│ │ └── create_materialized_views.* # View management
│ └── analysis/ # Analysis and validation
│ ├── audit_dependencies.py # Dependency audit
│ └── validate_phase5.py # Phase 5 validation
├── notebooks/ # Jupyter notebooks for exploration
│ ├── EDGAR_Transport.ipynb # Original transport processing (now modular)
│ └── (analysis documentation)
├── shared/ # Shared utilities
│ ├── entity_normalization.py # Entity name normalization
│ └── __init__.py
├── middleware/ # Middleware components
│ └── request_tracking.py # Request tracking
├── utils/ # Legacy utilities
│ ├── config.py # Configuration management
│ ├── error_handling.py # Error handling utilities
│ └── serialization.py # Serialization utilities
├── models/ # Data models (placeholder)
├── data/
│ ├── curated/ # Legacy curated data
│ │ └── manifest_mcp.json # Legacy manifest
│ ├── curated-2/ # Processed datasets
│ │ └── manifest_mcp_duckdb.json # Dataset manifest
│ ├── warehouse/ # DuckDB databases
│ └── geo/ # Geographic boundary files
├── testing/ # LLM testing infrastructure
│ ├── test_harness.py # Automated test runner
│ ├── analyze_results.py # Results analysis
│ ├── test_question_bank.json # 50 test questions
│ └── test_results/ # Test outputs (gitignored)
├── tests/ # Unit/integration tests
├── docs/ # Documentation
│ ├── QUICK_START.md # Setup guide
│ ├── SYSTEM_REFERENCE.md # Architecture reference
│ ├── TESTING_GUIDE.md # Testing procedures
│ ├── TESTING_RESULTS.md # LLM comparison results
│ ├── MCP_ARCHITECTURE.md # MCP architecture details
│ ├── API.md # API documentation
│ └── (additional documentation)
├── .github/ # GitHub Actions workflows
│ ├── workflows/
│ │ ├── ci.yml # CI pipeline
│ │ ├── security.yml # Security scanning
│ │ └── deploy.yml # Deployment workflow
│ └── dependabot.yml # Dependency updates
├── Dockerfile.server # Server container
├── Dockerfile.ui # UI container
├── docker-compose.yml # Multi-container setup
├── Makefile # Development commands
├── pyproject.toml # UV package manager config
├── uv.lock # Locked dependencies
├── requirements.txt # Pip fallback dependencies
└── .gitignore # Git ignore rules
Dependencies
Managed via pyproject.toml with pinned versions for reproducibility:
- FastAPI + Uvicorn (API server)
- Streamlit (UI)
- DuckDB (analytical database)
- OpenAI (LLM integration)
- Pandas, NumPy (data processing)
- GeoPandas, Shapely (spatial operations)
Install all dependencies:
uv sync
License
This project is currently in development. License information will be added soon.
For questions about licensing, please contact the project maintainers or open an issue on GitHub.
Contributing
We welcome contributions! To contribute:
Getting Started
- Fork the repository on GitHub
- Clone your fork:
git clone https://github.com/YOUR_USERNAME/Team-1B-Fusion.git cd Team-1B-Fusion - Create a feature branch:
git checkout -b feature/your-feature-name
Development Workflow
-
Install dependencies:
uv sync -
Make your changes:
- Write clean, documented code
- Follow the existing code style
- Add tests for new functionality
-
Run tests and linting:
# Run tests make test # Run linting uv run ruff check . uv run black --check . # Auto-format code uv run black . uv run ruff check --fix . -
Commit your changes:
git add . git commit -m "feat: add your feature description"Use conventional commit messages:
feat:for new featuresfix:for bug fixesdocs:for documentationtest:for testsrefactor:for refactoringchore:for maintenance
-
Push and create a Pull Request:
git push origin feature/your-feature-name
Code Quality Standards
- Python 3.11 compatibility required
- Type hints encouraged for new code
- Docstrings for all public functions/classes
- Test coverage for new features
- No breaking changes without discussion
Areas for Contribution
- 🐛 Bug fixes
- 📚 Documentation improvements
- ✨ New data sources or sectors
- 🔧 Performance optimizations
- 🧪 Additional test coverage
- 🌍 Geographic data expansion
- 🎨 UI/UX improvements
Code Review Process
- All PRs require at least one approval
- CI/CD pipeline must pass (linting, tests, security checks)
- Documentation must be updated for user-facing changes
- Maintainers will review within 1-2 weeks
Support
Getting Help
- 📖 Documentation: Check the
docs/folder for detailed guides - 💬 Issues: Open an issue on GitHub
- 🐛 Bug Reports: Use the issue template and include reproduction steps
- 💡 Feature Requests: Describe your use case and proposed solution
Reporting Security Issues
If you discover a security vulnerability, please DO NOT open a public issue. Instead, email the maintainers directly or use GitHub's private security reporting feature.
Community Guidelines
- Be respectful and inclusive
- Help others learn and grow
- Provide constructive feedback
- Follow the code of conduct (coming soon)
Built with: Python 3.11 | FastAPI | Streamlit | DuckDB | MCP Protocol
Data Source: EDGAR v2024 - Emissions Database for Global Atmospheric Research
Version: 0.2.0 | Status: Active Development 🚧