Baseline Selection Rationale

Performance Ranking (2026)

Based on the MALMAS paper and comparative studies (Refs 1-3 below):

Rank	Method	Type	Avg AUC Gain	Why Selected
1	MALMAS	LLM Multi-Agent	+0.05 to +0.08	Our core implementation
2	LLM-FE	LLM Evolutionary	+0.04 to +0.06	Strongest non-MALMAS LLM method
3	CAAFE	Context-Aware LLM	+0.03 to +0.05	Generates features from descriptions
4	OpenFE	Traditional (Non-LLM)	+0.02 to +0.04	Strongest non-LLM, "expert-level"
5	OCTree	Tree-based	+0.01 to +0.03	Mid-tier, often outperformed
6	AutoFeat	Classical	+0.01 to +0.02	Foundational but limited
7	DFS	Classical	~0.00 to +0.01	Basic relational synthesis

Selected Baselines for Phase 1

We implement MALMAS + 3 baselines:

1. OpenFE (Highest Priority Non-LLM Baseline)

Why: - Strongest non-LLM method, often beats 99% of Kaggle teams - "Feature boosting" algorithm with two-stage pruning - Fast, deterministic, no API costs - Serves as the "floor" for LLM methods to beat

Implementation:

class OpenFEBaseline(Baseline):
    def fit(self, X_train, y_train):
        self.openfe = OpenFE()
        self.openfe.fit(data=X_train, label=y_train, n_jobs=4)
        return self

    def transform(self, X):
        return self.openfe.transform(X)

2. CAAFE (Context-Aware LLM Baseline)

Why: - Generates features directly from dataset descriptions - Uses LLM but with simpler single-agent architecture - Good comparison for "multi-agent vs single-agent"

Implementation:

class CAAFEBaseline(Baseline):
    def fit(self, X_train, y_train, description: str):
        self.caafe = CAAFEClassifier(
            iterations=10,
            dataset_description=description,
        )
        self.caafe.fit(X_train, y_train)
        return self

    def transform(self, X):
        return self.caafe.transform(X)

3. LLM-FE (Evolutionary LLM Baseline)

Why: - Evolutionary search + data-driven feedback - Consistently outperforms OCTree and classical methods - Good comparison for "memory-augmented vs evolutionary"

Implementation:

class LLMFEBaseline(Baseline):
    def fit(self, X_train, y_train):
        self.llmfe = LLMFE()
        self.llmfe.fit(X_train, y_train)
        return self

    def transform(self, X):
        return self.llmfe.transform(X)

Excluded Baselines (Future Work)

Baseline	Reason for Exclusion	Future Path
OCTree	Outperformed by LLM-FE, adds complexity for marginal gain	Phase 2 if needed
AutoFeat	Foundational but weak; easily beaten by OpenFE	Optional extra
DFS	Basic relational synthesis; not competitive on single-table	Phase 2 (multi-table)

Comparison Dimensions

Each baseline will be compared across:

Dimension	How Measured
Predictive Gain	ΔAUC (classification) or Δ-NRMSE (regression)
Feature Diversity	Number of unique feature types generated
Cost Efficiency	LLM API cost per unit gain
Latency	Wall-clock time to generate features
Robustness	Standard deviation across random seeds
Scalability	Performance on datasets of varying size

Experiment Design

# Compare all methods on same dataset/seed
matrix = (
    ExperimentMatrix()
    .datasets(["titanic", "house-prices", "porto-seguro"])
    .methods([
        "malmas_full",      # All agents, all memory
        "malmas_no_memory", # Ablated
        "openfe",
        "caafe",
        "llmfe",
        "baseline",         # No feature engineering
    ])
    .seeds([0, 1, 2, 3, 4])
    .models(["xgboost"])
)

Expected Outcomes

Based on 2026 rankings: - MALMAS should outperform all baselines on most datasets - OpenFE should be the strongest non-LLM competitor - LLM-FE should be the strongest single-agent LLM method - CAAFE should show strong gains on datasets with rich descriptions

References

[1] MALMAS paper (April 2026). Memory-Augmented LLM-based Multi-Agent System for Automated Feature Engineering. arXiv:2604.20261.

[2] OpenFE paper. Zhang et al. (2023). OpenFE: Automated Feature Generation with Expert-level Performance. ICML.

[3] CAAFE paper. Hollmann et al. (2023). Large Language Models for Automated Feature Engineering.