Data Strategy: Kaggle-First
Philosophy
Start with simple, single-table tabular datasets from Kaggle to reproduce and optimize MALMAS results. Graduate to multi-table relational datasets in a future phase once the core system is stable.
Why Kaggle?
| Advantage | Explanation |
|---|---|
| Real-world complexity | Actual competition data with realistic noise, missing values, categorical features |
| Clear evaluation | Public leaderboards provide ground-truth performance targets |
| Rich metadata | Dataset descriptions, discussion insights, proven solutions |
| Multi-table path | Many competitions have relational data (e.g., Home Credit, Porto Seguro) |
| API access | kagglehub library enables programmatic download |
Phase 1 Datasets (Single-Table, Lower Risk)
| Dataset | Kaggle Competition | Rows | Features | Task | Why Include |
|---|---|---|---|---|---|
| Titanic | titanic |
891 | 12 | Binary Classification | Hello-world, fast iteration |
| House Prices | house-prices-advanced-regression-techniques |
1,460 | 81 | Regression | Mixed types, missing values |
| Porto Seguro | porto-seguro-safe-driver-prediction |
595K | 57 | Binary Classification | Large scale, anonymized features |
| Santander | santander-customer-transaction-prediction |
200K | 200 | Binary Classification | High dimensionality |
| California Housing | N/A (sklearn) | 20K | 8 | Regression | Baseline sanity check |
Phase 2 Datasets (Multi-Table, Higher Complexity) — Future Work
| Dataset | Competition | Tables | Challenge |
|---|---|---|---|
| Home Credit | home-credit-default-risk |
7 | Relational joins, feature aggregation |
| IEEE-CIS Fraud | ieee-fraud-detection |
4 | Transaction + identity tables |
| Recruit Restaurant | recruit-restaurant-visitor-forecasting |
5 | Time-series + relational |
Data Ingestion Architecture
# src/feature_forge/data/ingestion.py
from abc import ABC, abstractmethod
import pandas as pd
class DatasetFetcher(ABC):
"""Abstract base for dataset fetchers."""
@abstractmethod
def fetch(self, name: str, save_dir: str = "data/raw") -> dict:
"""Download dataset and return paths + metadata.
Returns:
dict with keys: train_path, test_path, target_column,
description, task_type
"""
pass
class KaggleFetcher(DatasetFetcher):
"""Fetch datasets from Kaggle using kagglehub."""
def fetch(self, name: str, save_dir: str = "data/raw") -> dict:
import kagglehub
path = kagglehub.dataset_download(name)
# Parse competition structure
# Return standardized metadata
return {...}
class OpenMLFetcher(DatasetFetcher):
"""Fetch datasets from OpenML."""
def fetch(self, name: str, save_dir: str = "data/raw") -> dict:
from sklearn.datasets import fetch_openml
# Download and save locally
return {...}
class LocalFetcher(DatasetFetcher):
"""Load from local files."""
def fetch(self, name: str, save_dir: str = "data/raw") -> dict:
# Load from data/raw/{name}/
return {...}
Dataset Registry
# src/feature_forge/data/registry.py
class DatasetRegistry:
"""Built-in registry of known datasets."""
DATASETS = {
"titanic": {
"source": "kaggle",
"competition": "titanic",
"task": "classification",
"target": "Survived",
"description": "Predict survival on the Titanic",
},
"house-prices": {
"source": "kaggle",
"competition": "house-prices-advanced-regression-techniques",
"task": "regression",
"target": "SalePrice",
"description": "Predict house sale prices",
},
# ... etc
}
@classmethod
def list(cls) -> list[str]:
return list(cls.DATASETS.keys())
@classmethod
def get(cls, name: str) -> dict:
return cls.DATASETS[name]
Sample Datasets
Small samples (<1MB) committed to repo for quick testing:
data/samples/
├── titanic_sample.csv # 100 rows
├── house_prices_sample.csv # 100 rows
└── california_housing_sample.csv # 100 rows
Usage:
from feature_forge.data.registry import DatasetRegistry
from feature_forge.data.loader import DatasetLoader
# Quick test with sample (no internet)
loader = DatasetLoader(use_sample=True)
df = loader.load("titanic")
# Full dataset (downloads if needed)
loader = DatasetLoader(use_sample=False)
df = loader.load("titanic") # Fetches from Kaggle
Data Metadata Format
Each dataset has a metadata.json:
{
"name": "titanic",
"source": "kaggle",
"competition": "titanic",
"task": "classification",
"target_column": "Survived",
"description": "Predict survival on the Titanic",
"columns": [
{"name": "PassengerId", "type": "numeric", "role": "id"},
{"name": "Survived", "type": "categorical", "role": "target"},
{"name": "Pclass", "type": "categorical", "role": "feature"},
{"name": "Name", "type": "text", "role": "feature"},
{"name": "Sex", "type": "categorical", "role": "feature"},
{"name": "Age", "type": "numeric", "role": "feature", "missing": true},
{"name": "SibSp", "type": "numeric", "role": "feature"},
{"name": "Parch", "type": "numeric", "role": "feature"},
{"name": "Ticket", "type": "text", "role": "feature"},
{"name": "Fare", "type": "numeric", "role": "feature"},
{"name": "Cabin", "type": "categorical", "role": "feature", "missing": true},
{"name": "Embarked", "type": "categorical", "role": "feature", "missing": true}
],
"num_samples": 891,
"num_features": 11,
"missing_values": true,
"has_text": true
}
Kaggle API Setup
Users need to configure Kaggle credentials:
# Install Kaggle CLI
pip install kaggle
# Download API token from kaggle.com/account
# Save to ~/.kaggle/kaggle.json
# Or use environment variables
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_key
In feature_forge:
Data Flow
User requests dataset "titanic"
│
▼
DatasetRegistry.lookup("titanic") → metadata
│
▼
DatasetLoader.load("titanic")
│
├─→ Check data/raw/titanic/ exists?
│ ├─→ Yes → Load from disk
│ └─→ No → Fetch from Kaggle
│ ├─→ kagglehub.dataset_download()
│ ├─→ Parse train.csv, test.csv
│ ├─→ Generate metadata.json
│ └─→ Save to data/raw/titanic/
│
▼
Return: df_train, df_test, target, metadata
Future: Multi-Table Support
# Future extension for Phase 2
class RelationalDataset:
"""Multi-table dataset with relationships."""
def __init__(self, tables: dict[str, pd.DataFrame], relationships: list[dict]):
self.tables = tables
self.relationships = relationships
def join(self, table1: str, table2: str, on: str) -> pd.DataFrame:
"""Join two tables on a key."""
return pd.merge(self.tables[table1], self.tables[table2], on=on)
This will enable experiments on relational feature engineering (e.g., aggregation across joined tables).