Team-1B-Fusion by DharmpratapSingh - MCP Server

ClimateGPT: AI-Powered Climate & Emissions Data Analytics Platform

A comprehensive data engineering system for querying, analyzing, and interpreting global greenhouse gas emissions data using the Model Context Protocol (MCP) and Large Language Models with persona-based response differentiation.

Features
Quick Start
Example Queries
Architecture
API Reference
Configuration
- Environment Variables
Testing
Data Sources
Usage Notes
Troubleshooting
Documentation
Development
License
Contributing
Support

Features

Multi-sector emissions data: Transport, Power Industry, and other EDGAR sectors
Geographic granularity: Country, admin-1 (state/province), and city-level data
Temporal analysis: Monthly and annual data from 2000-2024
Conversational interface: Natural language queries powered by LLM
MCP server: Standardized data access via Model Context Protocol
Interactive UI: Streamlit-based chat interface with persona modes
Smart Entity Resolution: Automatic handling of country/city aliases (USA→United States, NYC→New York, etc.), fuzzy matching for typos, and intelligent geographic level detection
Enterprise Security: SQL injection prevention, input validation, CORS restrictions, and secure credential management
High Performance: Optimized query execution, connection pooling, and 50% reduction in database load

Quick Start

Prerequisites

Python 3.11 (specifically required - not 3.10 or 3.12)
- Why 3.11? This version provides the optimal balance of performance and compatibility with our dependencies (DuckDB, GeoPandas, NumPy 2.x)
uv - Fast Python package manager and resolver
- Installation: curl -LsSf https://astral.sh/uv/install.sh | sh
Docker (optional, for containerized deployment)
OpenAI API Key (or compatible LLM endpoint) for conversational interface

Local Development

Start the MCP bridge + MCP stdio server (port 8010):

make serve

In a second terminal, start the UI (port 8501):

make ui

Open http://localhost:8501 in your browser

Docker Deployment

docker compose up --build

This will start both services:

server: HTTP bridge (FastAPI) + true MCP stdio server on port 8010
ui: Streamlit interface on port 8501

Example Queries

Once the system is running, you can ask questions like:

Simple Queries:

"What were the CO2 emissions from transport in Germany in 2023?"
"Show me power industry emissions in California for 2022"
"What are the transport emissions for Paris in 2020?"

Temporal Analysis:

"How have transport emissions in the United States changed from 2000 to 2023?"
"Compare monthly transport emissions in Beijing between 2022 and 2023"
"What were the peak emissions months for power industry in Texas in 2023?"

Comparative Queries:

"Compare transport emissions between France and Germany in 2023"
"Which US state had the highest power industry emissions in 2022?"
"Compare emissions from transport and power industry in China in 2023"

Complex Multi-Sector:

"Analyze transport and power industry emissions trends in India from 2015 to 2023"
"What sectors contribute most to emissions in California?"

Smart Entity Resolution (NEW!): The system now intelligently handles various name formats and aliases:

Country aliases: "USA", "US", "America" → "United States of America"
State abbreviations: "CA", "TX", "NY" → "California", "Texas", "New York"
City nicknames: "NYC" → "New York", "LA" → "Los Angeles"
Typo correction: "Califronia" → "California" (fuzzy matching at 80%+ similarity)
Auto-level detection: Automatically determines if you're querying a country, state, or city
Intelligent fallback: If city data unavailable, tries state → country automatically

Notes:

You can now use common aliases like "USA", "UK", "NYC" - the system normalizes them automatically!
Emissions are in tonnes CO₂ (displayed as MtCO₂ for large values)
Data covers 2000-2024 with monthly resolution
Available sectors: transport, power-industry, waste, agriculture, buildings, fuel-exploitation, industrial-combustion, industrial-processes

Architecture

MCP Server Stack (`mcp_http_bridge.py` + `mcp_server_stdio.py`)

mcp_http_bridge.py: FastAPI bridge that exposes the legacy HTTP REST surface (/query, /list_files, etc.) while proxying every request to the real MCP stdio server. Includes CORS security restrictions and configurable origin whitelisting.
mcp_server_stdio.py: The fully featured MCP implementation that speaks the Model Context Protocol over stdio and executes all DuckDB queries. Features include:
- Smart Entity Resolution: Normalizes location names, handles aliases, performs fuzzy matching, and auto-detects geographic levels
- Security: Comprehensive SQL injection prevention with input validation, column name sanitization, and parameterized queries
- Performance: Connection pooling, optimized query execution (no redundant queries), and efficient cursor management
- MCP Tools: 15+ tools including smart_query_emissions, query_emissions, calculate_yoy_change, get_file_info, and more

The bridge automatically spawns mcp_server_stdio.py on startup and relays JSON-RPC traffic between HTTP clients (UI, automation) and the MCP server. This keeps existing HTTP integrations working while using the canonical MCP runtime under the hood.

UI (`enhanced_climategpt_with_personas.py`)

Streamlit chat interface with:

Multiple persona modes (Analyst, Technical, Policy Advisor)
Chat-first layout with inline controls
CSV export of query results
Status indicators and error handling
Secure credential management (no hardcoded defaults)

API Reference

The HTTP bridge exposes the following RESTful endpoints on port 8010:

Query Endpoint

POST /query

Execute a natural language query against the emissions database.

curl -X POST http://localhost:8010/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What were transport emissions in Germany in 2023?",
    "assist_mode": "smart",
    "proxy_strategy": "spatial"
  }'

Request Body:

{
  "question": "string (required)",
  "assist_mode": "smart|lite|off (default: smart)",
  "proxy_strategy": "spatial|random|off (default: spatial)",
  "proxy_max_k": 10,
  "proxy_radius_km": 100
}

Response:

{
  "answer": "Natural language summary",
  "data": [...],
  "metadata": {
    "intent": {...},
    "datasets_used": [...],
    "query_time_ms": 123
  }
}

List Files Endpoint

GET /list_files

List all available datasets in the manifest.

curl http://localhost:8010/list_files

Response:

{
  "files": [
    {
      "name": "transport_admin0_yearly",
      "description": "Country-level transport emissions (yearly)",
      "path": "data/warehouse/climategpt.duckdb",
      "table_name": "transport_admin0_yearly"
    }
  ]
}

Health Check

GET /health

Check server status.

curl http://localhost:8010/health

Response:

{
  "status": "healthy",
  "mcp_server": "running",
  "version": "0.2.0"
}

Configuration

Environment Variables

Configure the system using environment variables. Create a .env file or export them in your shell:

MCP Server Configuration

# Required
export MCP_MANIFEST_PATH=data/curated-2/manifest_mcp_duckdb.json

# Optional
export PORT=8010                          # HTTP server port (default: 8010)
export MCP_RATE_CAP=60                    # Rate limit: requests per 5 minutes per IP
export MCP_LOG_LEVEL=INFO                 # Logging level: DEBUG|INFO|WARNING|ERROR

LLM Configuration

# Required for conversational interface
# IMPORTANT: Must be in "username:password" format for authentication
export OPENAI_API_KEY=username:password   # Your API credentials in username:password format

# Optional
export OPENAI_BASE_URL=https://api.openai.com/v1  # LLM endpoint
export MODEL=gpt-4                        # Model name (gpt-4, gpt-3.5-turbo, etc.)

Security Note: The system no longer accepts hardcoded credentials. The OPENAI_API_KEY environment variable is required and must be in username:password format. The application will fail fast with a clear error if credentials are missing or incorrectly formatted.

Query Behavior Defaults

# Assist Mode: how the LLM assists with query interpretation
export ASSIST_DEFAULT=smart               # smart|lite|off (default: smart)
  # smart: Full LLM-powered query understanding
  # lite: Basic query parsing
  # off: Direct SQL only

# Proxy Strategy: how to handle missing admin-1/city data
export PROXY_DEFAULT=spatial              # spatial|random|off (default: spatial)
  # spatial: Find nearby locations with data
  # random: Use random sampling
  # off: No proxy fallback

export PROXY_MAX_K=10                     # Max proxy results to return
export PROXY_RADIUS_KM=100                # Search radius for spatial proxy (km)

Streamlit UI Configuration

export STREAMLIT_SERVER_PORT=8501         # UI port (default: 8501)
export STREAMLIT_SERVER_ADDRESS=0.0.0.0   # Bind address

Security Configuration

# CORS Origins (comma-separated list of allowed origins)
export ALLOWED_ORIGINS=http://localhost:8501,http://localhost:3000

# Default allows localhost origins only for security
# Add your production domains as needed:
# export ALLOWED_ORIGINS=https://yourdomain.com,https://app.yourdomain.com

# Environment mode (affects error messages and security defaults)
export ENVIRONMENT=production              # production|development (default: production)

# Rate Limiting
export RATE_LIMIT_MAX_REQUESTS=100        # Max requests per window (default: 100)
export RATE_LIMIT_WINDOW_SECONDS=60       # Rate limit window in seconds (default: 60)

Security Best Practice: Never use * for allowed origins in production. The system now enforces explicit origin whitelisting to prevent unauthorized cross-origin requests.

Database Configuration

# Database file path
export DB_PATH=data/warehouse/climategpt.duckdb  # Path to DuckDB database

# Connection pool settings
export DB_POOL_SIZE=10                    # Connection pool size (default: 10)
export DB_MAX_CONNECTIONS=20              # Maximum concurrent connections (default: 20)

# Note: Higher pool sizes improve concurrent query performance but use more memory
# Recommended: 10-20 for most deployments, 20-50 for high-traffic production

Performance Configuration

# LLM Concurrency Control
export LLM_CONCURRENCY_LIMIT=10           # Max concurrent LLM API calls (default: 10)
  # Lower values (2-5): Conservative, prevents rate limiting
  # Higher values (10-20): Better throughput if API allows

# Query Result Caching
export CACHE_SIZE=1000                    # Cache entry limit (default: 1000)
export CACHE_TTL_SECONDS=300              # Cache TTL in seconds (default: 300/5min)

Development & Debugging

export DEBUG=true                         # Enable debug mode
export LOG_QUERIES=true                   # Log all SQL queries

See docker-compose.yml for container-specific configuration examples.

Testing

Unit & Integration Tests

Run the test suite:

make test

Or with pytest directly:

uv run pytest -v

LLM Comparative Testing

The system has been tested with multiple LLM backends. Key findings:

Default LLM: 100% success rate, 5.7s average response time (recommended for production)
Llama Q5_K_M: 80% success rate, 10.4s average response time (viable for development/testing)
All tool calls and natural language summarization working correctly

See docs/TESTING_RESULTS.md for detailed comparison results.

For automated LLM testing tools, see the testing/ directory which includes:

Test harness with 50 question bank covering all sectors and query types
Analysis and visualization scripts
LM Studio setup guides

Data Sources

This project uses EDGAR (Emissions Database for Global Atmospheric Research) v2024 datasets:

CO₂ emissions by sector (transport, power industry, etc.)
Global coverage with spatial resolution
Monthly temporal resolution (2000-2024)

Usage Notes

Use exact country names (e.g., "United States of America" not "USA")
All emissions values are in tonnes CO₂; large numbers displayed as MtCO₂
No forecasts or per-capita metrics (by design)
Queries are limited by rate limiting to prevent abuse

Troubleshooting

Common Issues

Server Won't Start

Problem: ModuleNotFoundError or import errors

Solution:

# Ensure dependencies are installed
uv sync

# Or with pip
pip install -r requirements.txt

# Verify Python version
python --version  # Should be 3.11.x

Problem: Port 8010 already in use

Solution:

# Find and kill the process using port 8010
lsof -ti:8010 | xargs kill -9

# Or change the port
export PORT=8011
make serve

UI Connection Issues

Problem: UI can't connect to MCP server

Solution:

# Verify server is running
curl http://localhost:8010/health

# Check server logs for errors
# Make sure both server and UI are running in separate terminals

# Terminal 1
make serve

# Terminal 2
make ui

Database Errors

Problem: DuckDB: IO Error: No such file or directory

Solution:

# Verify database exists
ls -lh data/warehouse/climategpt.duckdb

# Check manifest path
export MCP_MANIFEST_PATH=data/curated-2/manifest_mcp_duckdb.json

# Verify manifest is valid
python -c "import json; print(json.load(open('data/curated-2/manifest_mcp_duckdb.json')))"

LLM/OpenAI Errors

Problem: AuthenticationError or Invalid API key

Solution:

# Set your API key
export OPENAI_API_KEY=sk-your-actual-key-here

# Or create a .env file
echo "OPENAI_API_KEY=sk-your-key" > .env

Problem: Rate limit errors from OpenAI

Solution:

Wait a few minutes and retry
Switch to a different model with higher limits
Consider using a local LLM (see testing/LM_STUDIO_SETUP.md)

Docker Issues

Problem: Docker build fails

Solution:

# Clean Docker cache
docker system prune -a

# Rebuild with no cache
docker compose build --no-cache

# Check Docker resources (need at least 4GB RAM)
docker stats

Problem: Container exits immediately

Solution:

# Check container logs
docker compose logs server
docker compose logs ui

# Verify environment variables in docker-compose.yml

Query Returns No Results

Problem: Valid question returns empty data

Possible causes:

Country name mismatch: Use "United States of America" not "USA"
Data not available: Not all locations have city/admin-1 level data
Time range: Data is limited to 2000-2024
Sector name: Use exact sector names (e.g., "power-industry" not "power")

Solution:

# Check available datasets
curl http://localhost:8010/list_files

# Try a simpler query first
curl -X POST http://localhost:8010/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What were transport emissions in Germany in 2023?"}'

Performance Issues

Problem: Slow query responses

Solution:

Check system resources (CPU, RAM, disk I/O)
Reduce PROXY_MAX_K for faster spatial queries
Use yearly data instead of monthly for large time ranges
Consider optimizing DuckDB with PRAGMA threads=4

Getting Help

If you encounter issues not covered here:

Check the logs in the terminal where you ran make serve
Enable debug logging: export MCP_LOG_LEVEL=DEBUG
Review the full documentation in docs/
Check existing issues on GitHub
Create a new issue with:
- Error message
- Steps to reproduce
- Python version (python --version)
- OS and version

Documentation

Additional documentation is available in the docs/ folder:

Setup & Architecture

docs/SYSTEM_REFERENCE.md - System architecture and quick reference
docs/ARCHITECTURE.md - Detailed system architecture
docs/API.md - API documentation
docs/MCP_ARCHITECTURE.md - MCP protocol details
docs/DEPLOYMENT.md - Deployment guide

Development & Testing

docs/TESTING_RESULTS.md - Testing results and LLM comparison findings
docs/DEPENDENCY_AUDIT.md - Dependency security and audit report
testing/ - Automated testing tools and scripts

Reports & Analysis

docs/reports/ - Comprehensive reports on improvements, security fixes, and implementation status

For automated testing tools and scripts, see the testing/ directory.

Recent Improvements Documentation

The following reports document recent enhancements and security improvements:

docs/reports/CODE_REVIEW_REPORT.md - Comprehensive code review and analysis (1,213 lines)
docs/SMART_QUERY_GUIDE.md - Smart entity resolution system user guide
docs/reports/FINAL_STATUS.md - Final status report of all implemented improvements
docs/reports/IMPROVEMENTS_SUMMARY.md - Summary of all fixes and enhancements
docs/reports/REPOSITORY_STATUS.md - Current repository status and structure

Latest Updates (v0.2.0):

✅ Smart entity resolution with alias normalization and fuzzy matching
✅ All P0+P1 security issues resolved (hardcoded credentials removed, CORS hardened)
✅ Performance optimizations (50% reduction in database load)
✅ Comprehensive SQL injection prevention
✅ Enhanced input validation across all query endpoints

Development

Project Structure

.
├── mcp_http_bridge.py                     # HTTP bridge that wraps the MCP stdio server
├── mcp_server_stdio.py                    # MCP stdio protocol server (source of truth)
├── enhanced_climategpt_with_personas.py   # Streamlit UI
├── run_llm.py                             # LLM integration harness
├── src/
│   ├── pipelines/                         # Data processing pipelines
│   │   ├── viirs.py                       # VIIRS data pipeline
│   │   └── __init__.py
│   └── utils/                             # Core utilities
│       ├── router.py                      # Intent to dataset routing
│       ├── intent.py                      # Intent extraction
│       ├── answer.py                      # Response formatting
│       ├── fallbacks.py                   # Query fallback logic
│       ├── baseline_context.py            # Baseline context handling
│       ├── logging.py                     # Logging utilities
│       └── http.py                        # HTTP utilities
├── scripts/                               # Utility scripts
│   ├── preprocessing/                     # Data preprocessing for EDGAR sectors
│   │   ├── sector_config.py               # Centralized sector configuration
│   │   ├── geometry_loader.py             # Geographic boundary loader
│   │   ├── spatial_aggregation.py         # Spatial join engine
│   │   ├── process_transport_sector.py    # Transport sector pipeline
│   │   ├── process_power_sector.py        # Power sector pipeline
│   │   ├── process_all_sectors.py         # Batch process all sectors
│   │   └── (legacy scripts)
│   ├── database/                          # Database management
│   │   ├── analyze_database.py            # Database analysis
│   │   ├── apply_database_indexes.py      # Index management
│   │   ├── create_database_indexes.sql    # SQL index definitions
│   │   └── create_materialized_views.*    # View management
│   └── analysis/                          # Analysis and validation
│       ├── audit_dependencies.py          # Dependency audit
│       └── validate_phase5.py             # Phase 5 validation
├── notebooks/                             # Jupyter notebooks for exploration
│   ├── EDGAR_Transport.ipynb              # Original transport processing (now modular)
│   └── (analysis documentation)
├── shared/                                # Shared utilities
│   ├── entity_normalization.py            # Entity name normalization
│   └── __init__.py
├── middleware/                            # Middleware components
│   └── request_tracking.py                # Request tracking
├── utils/                                 # Legacy utilities
│   ├── config.py                          # Configuration management
│   ├── error_handling.py                  # Error handling utilities
│   └── serialization.py                   # Serialization utilities
├── models/                                # Data models (placeholder)
├── data/
│   ├── curated/                           # Legacy curated data
│   │   └── manifest_mcp.json              # Legacy manifest
│   ├── curated-2/                         # Processed datasets
│   │   └── manifest_mcp_duckdb.json       # Dataset manifest
│   ├── warehouse/                         # DuckDB databases
│   └── geo/                               # Geographic boundary files
├── testing/                               # LLM testing infrastructure
│   ├── test_harness.py                    # Automated test runner
│   ├── analyze_results.py                 # Results analysis
│   ├── test_question_bank.json            # 50 test questions
│   └── test_results/                      # Test outputs (gitignored)
├── tests/                                 # Unit/integration tests
├── docs/                                  # Documentation
│   ├── QUICK_START.md                     # Setup guide
│   ├── SYSTEM_REFERENCE.md                # Architecture reference
│   ├── TESTING_GUIDE.md                   # Testing procedures
│   ├── TESTING_RESULTS.md                 # LLM comparison results
│   ├── MCP_ARCHITECTURE.md                # MCP architecture details
│   ├── API.md                             # API documentation
│   └── (additional documentation)
├── .github/                               # GitHub Actions workflows
│   ├── workflows/
│   │   ├── ci.yml                         # CI pipeline
│   │   ├── security.yml                   # Security scanning
│   │   └── deploy.yml                     # Deployment workflow
│   └── dependabot.yml                     # Dependency updates
├── Dockerfile.server                      # Server container
├── Dockerfile.ui                          # UI container
├── docker-compose.yml                     # Multi-container setup
├── Makefile                               # Development commands
├── pyproject.toml                         # UV package manager config
├── uv.lock                                # Locked dependencies
├── requirements.txt                       # Pip fallback dependencies
└── .gitignore                             # Git ignore rules

Dependencies

Managed via pyproject.toml with pinned versions for reproducibility:

FastAPI + Uvicorn (API server)
Streamlit (UI)
DuckDB (analytical database)
OpenAI (LLM integration)
Pandas, NumPy (data processing)
GeoPandas, Shapely (spatial operations)

Install all dependencies:

uv sync

License

This project is currently in development. License information will be added soon.

For questions about licensing, please contact the project maintainers or open an issue on GitHub.

Contributing

We welcome contributions! To contribute:

Getting Started

Fork the repository on GitHub

Clone your fork:

git clone https://github.com/YOUR_USERNAME/Team-1B-Fusion.git
cd Team-1B-Fusion

Create a feature branch:

git checkout -b feature/your-feature-name

Development Workflow

Install dependencies:
```
uv sync
```
Make your changes:
- Write clean, documented code
- Follow the existing code style
- Add tests for new functionality

Run tests and linting:

# Run tests
make test

# Run linting
uv run ruff check .
uv run black --check .

# Auto-format code
uv run black .
uv run ruff check --fix .

Commit your changes:
```
git add .
git commit -m "feat: add your feature description"
```
Use conventional commit messages:
- feat: for new features
- fix: for bug fixes
- docs: for documentation
- test: for tests
- refactor: for refactoring
- chore: for maintenance

Push and create a Pull Request:

git push origin feature/your-feature-name

Code Quality Standards

Python 3.11 compatibility required
Type hints encouraged for new code
Docstrings for all public functions/classes
Test coverage for new features
No breaking changes without discussion

Areas for Contribution

🐛 Bug fixes
📚 Documentation improvements
✨ New data sources or sectors
🔧 Performance optimizations
🧪 Additional test coverage
🌍 Geographic data expansion
🎨 UI/UX improvements

Code Review Process

All PRs require at least one approval
CI/CD pipeline must pass (linting, tests, security checks)
Documentation must be updated for user-facing changes
Maintainers will review within 1-2 weeks

Support

Getting Help

📖 Documentation: Check the docs/ folder for detailed guides
💬 Issues: Open an issue on GitHub
🐛 Bug Reports: Use the issue template and include reproduction steps
💡 Feature Requests: Describe your use case and proposed solution

Reporting Security Issues

If you discover a security vulnerability, please DO NOT open a public issue. Instead, email the maintainers directly or use GitHub's private security reporting feature.

Community Guidelines

Be respectful and inclusive
Help others learn and grow
Provide constructive feedback
Follow the code of conduct (coming soon)

Built with: Python 3.11 | FastAPI | Streamlit | DuckDB | MCP Protocol

Data Source: EDGAR v2024 - Emissions Database for Global Atmospheric Research

Version: 0.2.0 | Status: Active Development 🚧

DharmpratapSingh/Team-1B-Fusion

ClimateGPT: AI-Powered Climate & Emissions Data Analytics Platform

Table of Contents

Features

Quick Start

Prerequisites

Local Development

Docker Deployment

Example Queries

Architecture

MCP Server Stack (mcp_http_bridge.py + mcp_server_stdio.py)

UI (enhanced_climategpt_with_personas.py)

API Reference

Query Endpoint

List Files Endpoint

Health Check

Configuration

Environment Variables

MCP Server Configuration

LLM Configuration

Query Behavior Defaults

Streamlit UI Configuration

Security Configuration

Database Configuration

Performance Configuration

Development & Debugging

Testing

Unit & Integration Tests

LLM Comparative Testing

Data Sources

Usage Notes

Troubleshooting

Common Issues

Server Won't Start

UI Connection Issues

Database Errors

LLM/OpenAI Errors

Docker Issues

Query Returns No Results

Performance Issues

Getting Help

Documentation

Setup & Architecture

Development & Testing

Reports & Analysis

Recent Improvements Documentation

Development

Project Structure

Dependencies

License

Contributing

Getting Started

Development Workflow

Code Quality Standards

Areas for Contribution

Code Review Process

Support

Getting Help

Reporting Security Issues

Community Guidelines

MCP Server Stack (`mcp_http_bridge.py` + `mcp_server_stdio.py`)

UI (`enhanced_climategpt_with_personas.py`)