jingchen0123/data-science-mcp
If you are the rightful owner of data-science-mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
A comprehensive Model Context Protocol (MCP) server for data science workflows, providing tools for data loading, exploration, visualization, processing, statistical analysis, and code generation.
Data Science MCP Agent
A comprehensive Model Context Protocol (MCP) server for data science workflows, providing tools for data loading, exploration, visualization, processing, statistical analysis, and code generation.
Features
🔍 Data Exploration & Analysis
- Load and inspect CSV datasets
- Comprehensive data quality assessment
- Exploratory data analysis with statistics and insights
- Column-specific analysis and profiling
- Missing value and outlier detection
📊 Data Visualization
- Create scatter plots, histograms, bar charts, and line plots
- Generate correlation matrices and box plots
- Customizable plot titles and styling
🔧 Data Processing
- Filter datasets with complex conditions
- Transform columns (log, sqrt, standardize, normalize, etc.)
- Group and aggregate data with multiple functions
- Handle missing values with various imputation methods
📈 Statistical Analysis
- Correlation analysis between variables
- One-sample and paired t-tests
- ANOVA for group comparisons
- Chi-square tests for categorical associations
- Normality and homogeneity testing
- Power analysis and effect size calculations
💻 Code Generation & Execution
- Generate Python analysis code from natural language descriptions
- Execute generated code safely in controlled environment
- Store and retrieve code snippets
- Automatic file path resolution
🎯 Guided Workflows
- Pre-built prompt templates for common tasks
- Step-by-step data cleaning workflows
- Feature engineering guidance
- Statistical test recommendations
Installation
Prerequisites
- Python 3.8 or higher
uv
package manager (recommended) orpip
Setup
-
Clone or download the repository
git clone <repository-url> cd data-science-mcp
-
Install dependencies
# Using uv (recommended) uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv add "mcp[cli]" pandas numpy matplotlib seaborn scikit-learn scipy # Or using pip pip install "mcp[cli]" pandas numpy matplotlib seaborn scikit-learn scipy
-
Configure data directory Edit the
DATA_DIR
path inserver.py
:DATA_DIR = Path("/path/to/your/data/directory")
-
Run the server
python server.py
Configuration for Claude Desktop
Add this configuration to your claude_desktop_config.json
:
{
"mcpServers": {
"data-science": {
"command": "python",
"args": ["/absolute/path/to/data-science-mcp/server.py"],
"env": {}
}
}
}
Configuration file locations:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
- Windows:
%APPDATA%\Claude\claude_desktop_config.json
Tools Reference
Data Loading & Management
load_csv(file_path)
- Load CSV files into the systemlist_datasets()
- List all available datasetsget_dataset_info(file_path)
- Get detailed dataset informationsave_csv(data, file_path)
- Save data to CSV formatupload_csv(content, file_path)
- Upload CSV content as string
Data Exploration
explore_data(file_path, sample_rows=5)
- Comprehensive data explorationdescribe_dataset(file_path)
- Generate descriptive statisticsget_columns_info(file_path, columns=None)
- Detailed column analysisdetect_outliers(file_path, columns=None, method="iqr")
- Outlier detectioncheck_data_quality(file_path)
- Complete data quality assessment
Data Visualization
plot_scatter(file_path, x_column, y_column, color_column=None)
- Scatter plotsplot_histogram(file_path, column, bins=20, kde=False)
- Histogramsplot_bar(file_path, x_column, y_column=None, aggfunc="count")
- Bar chartsplot_line(file_path, x_column, y_columns)
- Line plotsplot_correlation_matrix(file_path, columns=None)
- Correlation heatmapsplot_box(file_path, columns, by_column=None)
- Box plotsget_correlation(file_path, column1, column2)
- Correlation analysis
Data Processing
filter_data(file_path, condition, output_file_path=None)
- Filter datasetstransform_column(file_path, column, transformation, new_column=None)
- Transform columnsgroup_and_aggregate(file_path, group_by, aggregate_cols, aggregate_funcs)
- Group and aggregatehandle_missing_values(file_path, columns=None, method="mean")
- Handle missing data
Statistical Analysis
run_ttest(file_path, column, test_value=0)
- One-sample t-testrun_paired_ttest(file_path, column1, column2)
- Paired t-testrun_anova(file_path, value_column, group_column)
- One-way ANOVArun_chi_square(file_path, column1, column2)
- Chi-square testrun_correlation_test(file_path, column1, column2, method="pearson")
- Correlation testrun_regression(file_path, dependent_var, independent_vars)
- Linear regressioncheck_normality(file_path, column, test="shapiro")
- Normality testingcheck_homogeneity(file_path, value_column, group_column)
- Homogeneity testingpower_analysis(test_type, effect_size, alpha=0.05, power=0.8)
- Power analysiseffect_size(file_path, column1, column2=None, test_type="mean_diff")
- Effect size calculation
Code Generation & Execution
generate_analysis_code(request, dataset_path)
- Generate Python code from descriptionsexecute_code(code_id=None, code=None)
- Execute Python codeget_code(code_id)
- Retrieve stored codesave_code(code_id, file_path)
- Save code to file
Resources
The server exposes CSV data through the resource system:
csv://{file_path}
- Access CSV file content with preview and statistics
Prompt Templates
Pre-built prompts for common workflows:
analyze_dataset
- Comprehensive dataset analysisexplore_relationship
- Guided variable relationship explorationdata_science_assistant
- General assistancedata_cleaning_workflow
- Step-by-step data cleaningfeature_engineering_guide
- Feature engineering guidanceexplain_correlation
- Correlation explanationinterpret_visualization
- Visualization interpretationstatistical_test_advisor
- Statistical test recommendationsmodeling_workflow
- Guided modeling workflow
Usage Examples
Basic Data Analysis
1. Load your dataset: "Please load the file customer_data.csv"
2. Explore the data: "Can you explore this dataset and tell me what you find?"
3. Visualize relationships: "Create a scatter plot of age vs income"
4. Generate insights: "What correlations do you see in this data?"
Advanced Workflows
1. Data quality check: "Please assess the quality of my dataset"
2. Handle missing values: "How should I handle the missing data?"
3. Statistical analysis: "Test if there's a significant difference between groups"
4. Generate report: "Create a comprehensive analysis report"
Code Generation
1. "Generate code to analyze the correlation between all numeric variables"
2. "Create a regression model predicting sales from the other variables"
3. "Write code to detect and visualize outliers in the dataset"
File Path Handling
The server supports both absolute and relative paths:
- Relative paths:
"data.csv"
→ resolves to{DATA_DIR}/data.csv
- Absolute paths:
"/full/path/to/data.csv"
→ used as-is - Auto-extension:
"data"
→ automatically becomes"data.csv"
Error Handling
- Comprehensive input validation
- Clear error messages with suggestions
- Graceful handling of missing files or invalid data
- Safe code execution with proper error reporting
Data Directory
By default, the server uses a data directory for file operations:
- Location: Configurable in
server.py
- Purpose: Centralized data storage and file resolution
- Sample data: Example dataset created automatically
Dependencies
- Core:
mcp
,pandas
,numpy
- Visualization:
matplotlib
,seaborn
- Statistics:
scipy
,scikit-learn
- System:
pathlib
,os
,io
Architecture
The server is organized into modular components:
data_loading.py
- Data management and loadingexploration.py
- Data exploration and quality assessmentvisualization.py
- Plotting and visualization toolsprocessing.py
- Data transformation and manipulationstatistical_tests.py
- Statistical analysis functionscode_generation.py
- Dynamic code generation and executiontemplates.py
- Prompt templates for guided workflows
Contributing
- Fork the repository
- Create a feature branch
- Add new tools or improve existing ones
- Test with various datasets
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support
For issues, questions, or feature requests:
- Check the error messages for specific guidance
- Ensure your data files are accessible and properly formatted
- Verify your Python environment has all required dependencies
- Review the tool documentation for proper usage
Changelog
Version 1.0.0
- Initial release with comprehensive data science toolkit
- Full MCP protocol compliance
- Automatic output capture and reporting
- Guided workflow templates
- Statistical analysis suite