Skip to content

Baseline Selection Rationale

Performance Ranking (2026)

Based on the MALMAS paper and comparative studies (Refs 1-3 below):

Rank Method Type Avg AUC Gain Why Selected
1 MALMAS LLM Multi-Agent +0.05 to +0.08 Our core implementation
2 LLM-FE LLM Evolutionary +0.04 to +0.06 Strongest non-MALMAS LLM method
3 CAAFE Context-Aware LLM +0.03 to +0.05 Generates features from descriptions
4 OpenFE Traditional (Non-LLM) +0.02 to +0.04 Strongest non-LLM, "expert-level"
5 OCTree Tree-based +0.01 to +0.03 Mid-tier, often outperformed
6 AutoFeat Classical +0.01 to +0.02 Foundational but limited
7 DFS Classical ~0.00 to +0.01 Basic relational synthesis

Selected Baselines for Phase 1

We implement MALMAS + 3 baselines:

1. OpenFE (Highest Priority Non-LLM Baseline)

Why: - Strongest non-LLM method, often beats 99% of Kaggle teams - "Feature boosting" algorithm with two-stage pruning - Fast, deterministic, no API costs - Serves as the "floor" for LLM methods to beat

Implementation:

class OpenFEBaseline(Baseline):
    def fit(self, X_train, y_train):
        self.openfe = OpenFE()
        self.openfe.fit(data=X_train, label=y_train, n_jobs=4)
        return self

    def transform(self, X):
        return self.openfe.transform(X)

2. CAAFE (Context-Aware LLM Baseline)

Why: - Generates features directly from dataset descriptions - Uses LLM but with simpler single-agent architecture - Good comparison for "multi-agent vs single-agent"

Implementation:

class CAAFEBaseline(Baseline):
    def fit(self, X_train, y_train, description: str):
        self.caafe = CAAFEClassifier(
            iterations=10,
            dataset_description=description,
        )
        self.caafe.fit(X_train, y_train)
        return self

    def transform(self, X):
        return self.caafe.transform(X)

3. LLM-FE (Evolutionary LLM Baseline)

Why: - Evolutionary search + data-driven feedback - Consistently outperforms OCTree and classical methods - Good comparison for "memory-augmented vs evolutionary"

Implementation:

class LLMFEBaseline(Baseline):
    def fit(self, X_train, y_train):
        self.llmfe = LLMFE()
        self.llmfe.fit(X_train, y_train)
        return self

    def transform(self, X):
        return self.llmfe.transform(X)

Excluded Baselines (Future Work)

Baseline Reason for Exclusion Future Path
OCTree Outperformed by LLM-FE, adds complexity for marginal gain Phase 2 if needed
AutoFeat Foundational but weak; easily beaten by OpenFE Optional extra
DFS Basic relational synthesis; not competitive on single-table Phase 2 (multi-table)

Comparison Dimensions

Each baseline will be compared across:

Dimension How Measured
Predictive Gain ΔAUC (classification) or Δ-NRMSE (regression)
Feature Diversity Number of unique feature types generated
Cost Efficiency LLM API cost per unit gain
Latency Wall-clock time to generate features
Robustness Standard deviation across random seeds
Scalability Performance on datasets of varying size

Experiment Design

# Compare all methods on same dataset/seed
matrix = (
    ExperimentMatrix()
    .datasets(["titanic", "house-prices", "porto-seguro"])
    .methods([
        "malmas_full",      # All agents, all memory
        "malmas_no_memory", # Ablated
        "openfe",
        "caafe",
        "llmfe",
        "baseline",         # No feature engineering
    ])
    .seeds([0, 1, 2, 3, 4])
    .models(["xgboost"])
)

Expected Outcomes

Based on 2026 rankings: - MALMAS should outperform all baselines on most datasets - OpenFE should be the strongest non-LLM competitor - LLM-FE should be the strongest single-agent LLM method - CAAFE should show strong gains on datasets with rich descriptions

References

[1] MALMAS paper (April 2026). Memory-Augmented LLM-based Multi-Agent System for Automated Feature Engineering. arXiv:2604.20261.

[2] OpenFE paper. Zhang et al. (2023). OpenFE: Automated Feature Generation with Expert-level Performance. ICML.

[3] CAAFE paper. Hollmann et al. (2023). Large Language Models for Automated Feature Engineering.