data_mcp by abhisiroha - MCP Server

data_mcp

Model Context Protocol (MCP) - Feature Engineering Service

This project provides a FastAPI-based server to help users ingest tabular data and automatically engineer features, making data ready for machine learning model training. It leverages PySpark for scalable data processing and supports both CSV and Parquet files.

Features

Data Ingestion: Load CSV or Parquet files and get schema, null stats, and sample rows.
Automated Feature Engineering: Impute missing values, encode categorical variables, extract datetime features, and assemble features for ML.
Interactive API: Suggests required information (e.g., categorical columns) if not provided.
Scalable: Uses PySpark for handling large datasets.

Requirements

Python 3.8+
Java (for PySpark)
See requirements.txt for Python dependencies

Installation

# Clone the repository
# cd into the project directory
python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate
pip install -r requirements.txt

Configuration

Allowed file types: CSV, Parquet
Max file size: 200MB (see config.py)
Data directory: Set DATA_DIR environment variable if needed

Running the Server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

API Endpoints

1. Ingest Data

POST /ingest

Request Body:

{
  "file_path": "path/to/data.csv",
  "file_type": "csv",
  "delimiter": ","
}

Response:

{
  "columns": ["col1", "col2", ...],
  "dtypes": {"col1": "int", "col2": "string", ...},
  "null_stats": {"col1": 0, "col2": 5, ...},
  "sample_rows": [ {"col1": 1, "col2": "A"}, ... ]
}

2. Feature Engineering

POST /feature-engineer

Request Body:

{
  "target_column": "label",
  "categorical_columns": ["cat1", "cat2"],
  "datetime_columns": ["date_col"],
  "feature_methods": ["auto"],
  "custom_params": {}
}

Response:

{
  "engineered_columns": ["col1", "cat1_oh", ...],
  "transformations": {"cat1": "onehot", ...},
  "feature_stats": {"col1": {"mean": 5.2, "count": 100}, ...},
  "message": "Auto feature engineering complete"
}

If required info is missing, the API will suggest or request it interactively.

Example Usage (Python)

import requests

# Ingest data
resp = requests.post("http://localhost:8000/ingest", json={
    "file_path": "data.csv",
    "file_type": "csv",
    "delimiter": ","
})
print(resp.json())

# Feature engineering
resp = requests.post("http://localhost:8000/feature-engineer", json={
    "categorical_columns": ["cat1", "cat2"]
})
print(resp.json())

Logging

Logs are printed to the console with timestamps and log levels (see logger.py).

License

MIT License. See .