Samar-Bons/DataChat
If you are the rightful owner of DataChat and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to dayong@mcphub.com.
A remote MCP Server for Snowflake, deployed as an AWS Lambda Function.
Data Chat MVP
A web application enabling non-technical users to upload CSV files and query them using natural language via chat interface.
📊 Project Status
Current Phase: CSV Upload MVP Implementation
✅ Completed (85% Backend Foundation)
- P0 - Foundation: Git repo, pre-commit hooks, CI pipeline, Docker setup ✅
- P1 - Backend Core: FastAPI app, config management, database foundations, health endpoints ✅
- P2 - Auth & Sessions: Complete Google OAuth integration with user management ✅
- User SQLAlchemy model with preferences
- JWT-based session management
- Auth endpoints (login, callback, logout, profile, preferences)
- 103 tests with 89.73% coverage
- P3 - Data Integration: LLM pipeline and data processing infrastructure ✅
- Gemini API service with BYOK model and prompt engineering
- Schema service ready for SQLite data from CSV uploads
- Chat endpoints for NL→SQL conversion with autorun support
- SQL validation and injection prevention
- 55 tests with 95%+ coverage for new components
- P4 - Frontend Auth: Complete authentication flow ✅
- React app with Google OAuth integration
- Protected routes and user context
- Dashboard layout ready for data components
🎯 Next Priority: CSV Upload MVP (Prompt 11)
- Backend: CSV upload endpoint, file processing, SQLite conversion
- Frontend: File upload components, schema preview, chat interface integration
- Integration: Complete upload → query → results flow
🔮 Future Features (Post-MVP)
- Multiple file formats: Excel, JSON, Parquet support
- Database connections: PostgreSQL, MySQL, Snowflake integration
- Advanced features: Query history, favorites, data visualization
- Enterprise: SSO, RBAC, collaboration features
🚀 Quick Start with Docker (RECOMMENDED)
For the fastest development experience:
# 1. Clone and setup environment
git clone <your-repo-url>
cd snowflake-mcp-lambda
make setup # Copies .env.example to .env
# 2. IMPORTANT: Edit .env with required values
# Required: GEMINI_API_KEY (for LLM functionality)
# Optional: GOOGLE_CLIENT_* (for auth), database passwords
# 3. Start everything with one command
make dev-setup
# This builds containers, starts services, and runs migrations
# 4. Access the application
# - Frontend: http://localhost:3000 (React app)
# - Backend API: http://localhost:8000 (FastAPI)
# - API Docs: http://localhost:8000/docs (Interactive API docs)
Common development commands:
make help # Show all available commands
make up # Start services
make down # Stop services
make logs # View all logs
make test # Run backend tests
make health # Check service health
make wait-healthy # Wait for all services to be ready
make clean # Clean up containers/volumes
💡 Key Features
🎯 MVP User Experience
- Upload CSV File - Drag-and-drop interface with progress indication
- Schema Preview - Auto-detected columns and data types with confirmation
- Chat Interface - Ask questions about your data in natural language
- SQL Generation - Gemini LLM converts questions to SQLite queries
- Results Display - Interactive table with sorting, filtering, export options
🔒 Security & Privacy
- No data persistence - Files deleted after session ends
- Read-only queries - Only SELECT statements allowed
- User-provided API keys - Gemini API key never stored server-side
- File validation - Size limits, format validation, secure processing
⚡ Performance
- Fast uploads - Supports files up to 100MB
- Efficient processing - CSV → SQLite conversion with schema inference
- Quick queries - In-memory SQLite for fast query execution
- Session-based - No database dependencies for core functionality
⚙️ Configuration
Required API Keys
Gemini API (Required for LLM functionality)
- Go to Google AI Studio
- Create a new API key
- Add to your
.envfile:GEMINI_API_KEY=your-api-key
Google OAuth (Optional for MVP)
- Go to Google Cloud Console
- Create OAuth 2.0 Client ID credentials
- Add redirect URI:
http://localhost:8000/api/v1/auth/callback - Add to
.env:GOOGLE_CLIENT_IDandGOOGLE_CLIENT_SECRET
Environment Variables
Copy .env.example to .env and fill in your values:
# Required for core functionality
GEMINI_API_KEY=your-gemini-api-key
# Optional for authentication (can use without login)
GOOGLE_CLIENT_ID=your-google-client-id
GOOGLE_CLIENT_SECRET=your-google-client-secret
# Auto-generated secure secrets
JWT_SECRET_KEY=your-secure-secret-key
POSTGRES_PASSWORD=your-db-password
🏗️ Architecture
Tech Stack
- Backend: FastAPI (Python) + Poetry for dependency management
- Frontend: React + Vite + TypeScript + Tailwind CSS
- Data Processing: pandas + SQLite for CSV file handling
- LLM Integration: Google Gemini API for natural language queries
- Authentication: Google OAuth with JWT cookies (optional)
- Development: Docker Compose for local development
File Processing Pipeline
CSV Upload → Schema Detection → SQLite Conversion → Query Interface
↓ ↓ ↓ ↓
File validation Column types In-memory DB Natural language
Size limits Data preview Fast queries SQL generation
Error handling User confirm Session scope Results display
🧪 Testing & Development
Pre-commit Quality Checks
Our pre-commit hooks run the same checks as CI for immediate feedback:
- Ruff - Fast Python linting and formatting
- MyPy - Type checking with strict mode
- Pytest - Full test suite with 85%+ coverage requirement
- Security - Basic vulnerability scanning
# Setup development environment
./scripts/setup-dev.sh
# All commits now run full CI checks locally
# (30-60 seconds per commit, but zero CI failures)
Test Coverage
- Backend: 91%+ coverage with comprehensive unit and integration tests
- Frontend: Component tests with vitest + E2E tests with cypress
- File Processing: Edge cases for various CSV formats and encodings
- LLM Integration: Mocked Gemini responses for reliable testing
📋 Development Status
Implementation Progress
See docs/planning/PROJECT_STATUS.md for detailed status and next steps
Key files for developers and AI assistants:
docs/planning/PROJECT_STATUS.md- Current status and ready-to-execute promptsdocs/planning/spec_v2.md- Complete feature requirements and architectureDEVELOPMENT.md- Developer setup and troubleshooting guideCLAUDE.md- Project context for AI code assistants
Current Priority: CSV Upload MVP
The next major milestone is implementing CSV file upload capability that allows users to:
- Upload CSV files without requiring database credentials
- Preview and validate detected schema
- Query uploaded data using natural language via chat
- Export results and manage session data
This creates a complete, demo-ready application that works entirely with user-provided files.
🤖 AI-Assisted Development Notes
This project is designed for seamless AI-assisted development:
Why Comprehensive Pre-commit Hooks?
AI assistants can't directly access CI logs. When CI fails, humans must manually copy error messages back to the AI, creating a frustrating debugging loop. Our solution:
- Immediate feedback - AI gets errors instantly, not after CI failure
- Same environment - What passes locally passes in CI
- Faster iteration - Fix issues in the same conversation
- No context switching - AI can resolve issues without human intervention
AI Assistant Friendly Features
- Clear documentation - Comprehensive planning docs in
docs/planning/ - Ready-to-execute prompts - Specific implementation tasks in PROJECT_STATUS.md
- Modular architecture - Clean separation of concerns for focused changes
- Comprehensive tests - Confidence that changes don't break existing functionality
This approach transforms AI assistants from helpful but sometimes frustrating to genuinely reliable development partners.
Transform your CSV data into insights through natural language conversation. No database setup required - just upload and start asking questions!