abhisiroha/data_mcp
If you are the rightful owner of data_mcp and would like to certify it and/or have it hosted online, please leave a comment on the right or send an email to henry@mcphub.com.
This document provides a structured summary of a Model Context Protocol (MCP) server designed to assist users in preparing data for model training.
data_mcp
Model Context Protocol (MCP) - Feature Engineering Service
This project provides a FastAPI-based server to help users ingest tabular data and automatically engineer features, making data ready for machine learning model training. It leverages PySpark for scalable data processing and supports both CSV and Parquet files.
Features
- Data Ingestion: Load CSV or Parquet files and get schema, null stats, and sample rows.
- Automated Feature Engineering: Impute missing values, encode categorical variables, extract datetime features, and assemble features for ML.
- Interactive API: Suggests required information (e.g., categorical columns) if not provided.
- Scalable: Uses PySpark for handling large datasets.
Requirements
- Python 3.8+
- Java (for PySpark)
- See
requirements.txt
for Python dependencies
Installation
# Clone the repository
# cd into the project directory
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
pip install -r requirements.txt
Configuration
- Allowed file types: CSV, Parquet
- Max file size: 200MB (see
config.py
) - Data directory: Set
DATA_DIR
environment variable if needed
Running the Server
uvicorn main:app --reload --host 0.0.0.0 --port 8000
API Endpoints
1. Ingest Data
POST /ingest
Request Body:
{
"file_path": "path/to/data.csv",
"file_type": "csv",
"delimiter": ","
}
Response:
{
"columns": ["col1", "col2", ...],
"dtypes": {"col1": "int", "col2": "string", ...},
"null_stats": {"col1": 0, "col2": 5, ...},
"sample_rows": [ {"col1": 1, "col2": "A"}, ... ]
}
2. Feature Engineering
POST /feature-engineer
Request Body:
{
"target_column": "label",
"categorical_columns": ["cat1", "cat2"],
"datetime_columns": ["date_col"],
"feature_methods": ["auto"],
"custom_params": {}
}
Response:
{
"engineered_columns": ["col1", "cat1_oh", ...],
"transformations": {"cat1": "onehot", ...},
"feature_stats": {"col1": {"mean": 5.2, "count": 100}, ...},
"message": "Auto feature engineering complete"
}
If required info is missing, the API will suggest or request it interactively.
Example Usage (Python)
import requests
# Ingest data
resp = requests.post("http://localhost:8000/ingest", json={
"file_path": "data.csv",
"file_type": "csv",
"delimiter": ","
})
print(resp.json())
# Feature engineering
resp = requests.post("http://localhost:8000/feature-engineer", json={
"categorical_columns": ["cat1", "cat2"]
})
print(resp.json())
Logging
- Logs are printed to the console with timestamps and log levels (see
logger.py
).
License
MIT License. See .