Feature Forge Implementation Plan
Version: 0.1.0 Date: May 2026 Status: Draft
Overview
This document outlines the comprehensive implementation plan for feature_forge, a modular experimentation platform for LLM-based multi-agent automated feature engineering. It is designed to systematically break down, compare, and optimize feature engineering methods — starting with the MALMAS architecture and its competitive baselines.
Goals
- Reproduce and optimize the MALMAS paper results on standard tabular datasets
- Enable isolated experimentation of every component (agents, memory, router, baselines)
- Provide sklearn-compatible APIs for drop-in adoption
- Track everything — experiments, LLM costs, feature quality, with WandB and Langfuse
- Support dynamic data ingestion from Kaggle (starting simple, scaling to multi-table)
Plan Structure
| Document | Purpose |
|---|---|
01_architecture.md |
High-level architecture and design philosophy |
02_directory_structure.md |
Complete directory layout |
03_key_design_decisions.md |
Configuration, caching, sandboxing, plugin system |
04_implementation_phases.md |
13-phase implementation roadmap |
05_dependencies.md |
pyproject.toml specification |
06_data_strategy.md |
Kaggle-focused data ingestion strategy |
07_observability.md |
structlog + Langfuse + OpenTelemetry |
08_experiment_tracking.md |
WandB + MLflow abstraction |
09_baseline_selection.md |
Why MALMAS + OpenFE + CAAFE + LLM-FE |
Research Basis
This plan is informed by:
- Google AI Search (May 2026): LLM-based AFE architecture best practices, WandB vs MLflow comparison, Langfuse multi-agent observability, structlog best practices
- Context7 Documentation: wandb, mlflow, langfuse-python, structlog official docs
- MALMAS Technical Roadmap (docs/MALMAS_Technical_Roadmap.md): Current state assessment and refactoring recommendations
- MALMAS Codebase Analysis (@/Users/minghao/Desktop/personal/MALMAS): Deep dive into existing methods, agents, baselines
- python-project-structure skill: pydantic-settings, YAML config, dotenvx secrets
- python-tooling skill: uv, ruff, pytest, pre-commit, CI/CD
Quick Start Decision Log
| Decision | Choice | Rationale |
|---|---|---|
| Dataset source | Kaggle | Real-world datasets, clear path to multi-table complexity |
| Experiment config | Python-first, YAML supported | Flexibility for researchers, declarative option for reproducibility |
| LLM caching | Enforced default ON | Prevent accidental API costs; explicit opt-out only |
| Tracking backend | WandB default, MLflow optional | Superior visualization, free academic tier, W&B Weave for LLM |
| Observability | Langfuse cloud | Zero infra overhead, hierarchical tracing, prompt management |
| Logging | structlog | 2x faster than stdlib, JSON in prod, pretty in dev, OTel integration |
| Baselines | OpenFE + CAAFE + LLM-FE | Top 3 non-MALMAS methods per 2026 rankings |
| Package manager | uv | Modern, fast, deterministic with uv.lock |
| Layout | src/ | Tests run against installed package |
Next Steps
- Review all plan documents in
docs/plan/ - Approve or modify Phase 1 scope
- Begin implementation with
uv initand directory scaffolding