Architecture & Design Philosophy

Core Philosophy

feature_forge treats every method as a first-class, independently runnable, composable experiment unit. We move from "run the pipeline" to "design an experiment matrix."

Design Principles

1. Plugin-Ready Everything

Every agent and baseline is discoverable via Python entry points. This makes the core repo lightweight while allowing research groups to publish extensions as independent pip packages.

# Downstream package's pyproject.toml
[project.entry-points."feature_forge.agents"]
my_domain_agent = "my_package:DomainAgent"

2. Experiment-First

Instead of running one pipeline, you define an experiment matrix:

from feature_forge.experiment.matrix import ExperimentMatrix

matrix = (
    ExperimentMatrix()
    .datasets(["titanic", "house-prices"])
    .methods({"malmas_full": ["unary", "cross", "aggregation", "temporal"],
              "malmas_no_memory": [...],
              "openfe": ["openfe"]})
    .seeds([0, 1, 2])
    .models(["xgboost", "lightgbm"])
    .rounds([1, 2, 4])
)

3. Immutable Configuration

No mutable global state. All configuration is instance-based, validated at startup, and overridable via env vars.

4. Security by Default

Sandboxed code execution (AST validation, restricted builtins)
LLM response caching enforced by default
No raw exec() without sandbox

5. Observable Everything

Every agent call traced via Langfuse
Every pipeline step logged via structlog
Every experiment tracked via WandB
Costs transparent: token usage → USD per agent per round

Layered Architecture

┌─────────────────────────────────────────────────────────────┐
│  EXPERIMENT LAYER                                           │
│  - ExperimentMatrix (Cartesian product definitions)         │
│  - ExperimentRunner (execution engine)                      │
│  - ExperimentTracker (WandB/MLflow abstraction)             │
│  - Reporter (auto-generated markdown/HTML reports)          │
├─────────────────────────────────────────────────────────────┤
│  PIPELINE LAYER                                             │
│  - MALMASFeatureEngineer (sklearn-compatible API)           │
│  - CorePipeline (single-round execution)                    │
│  - IterativePipeline (N-round with memory + router)         │
│  - AblationPipelines (no-memory, no-router, single-agent)   │
├─────────────────────────────────────────────────────────────┤
│  AGENT & BASELINE LAYER                                     │
│  - Agent ABC + Registry (entry-point discovery)             │
│  - 6 MALMAS agents (unary, cross, aggregation, ...)         │
│  - RouterAgent (data-driven, performance-driven, hybrid)    │
│  - Baseline ABC + Registry                                  │
│  - OpenFE, CAAFE, LLM-FE baselines                          │
├─────────────────────────────────────────────────────────────┤
│  MEMORY LAYER                                               │
│  - ProceduralMemory (successful transforms)                 │
│  - FeedbackMemory (feature gains/losses)                    │
│  - ConceptualMemory (LLM-summarized rules)                  │
│  - Persistence (JSON/dill serializers)                      │
├─────────────────────────────────────────────────────────────┤
│  LLM LAYER                                                  │
│  - LLMClient ABC (unified interface)                        │
│  - Provider implementations (OpenAI, DeepSeek, Anthropic)   │
│  - DiskCache (enforced default, SHA-256 keyed)              │
│  - LangfuseWrapper (auto-tracing + cost tracking)           │
├─────────────────────────────────────────────────────────────┤
│  EVALUATION LAYER                                           │
│  - Metrics (AUC, ACC, NRMSE, custom)                        │
│  - CV (k-fold cross-validation)                             │
│  - ModelFactory (XGB, LGB, CatBoost, RF, MLP)              │
│  - Sandbox (AST-validated, restricted-builtin execution)    │
├─────────────────────────────────────────────────────────────┤
│  DATA LAYER                                                 │
│  - Dataset ABC + Registry                                   │
│  - KaggleFetcher (primary source)                           │
│  - OpenMLFetcher (secondary)                                │
│  - Sample datasets (for quick testing)                      │
├─────────────────────────────────────────────────────────────┤
│  OBSERVABILITY LAYER                                        │
│  - structlog (JSON in prod, pretty in dev)                  │
│  - OpenTelemetry processor (trace_id/span_id in logs)       │
│  - Langfuse tracer (@observe decorators)                    │
└─────────────────────────────────────────────────────────────┘

Agent System Architecture

┌─────────────────┐     uses      ┌─────────────────┐
│  Experiment     │──────────────▶│  Iterative      │
│  Runner         │               │  Pipeline       │
└─────────────────┘               └────────┬────────┘
                                           │
                              ┌────────────┼────────────┐
                              ▼            ▼            ▼
                         ┌────────┐   ┌────────┐   ┌────────┐
                         │ Router │   │ Memory │   │ Eval   │
                         │ Agent  │   │ System │   │ Engine │
                         └───┬────┘   └────────┘   └────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ Agent 1  │  │ Agent 2  │  │ Agent N  │
        │ (unary)  │  │ (cross)  │  │ (...)    │
        │ Memory   │  │ Memory   │  │ Memory   │
        └────┬─────┘  └────┬─────┘  └────┬─────┘
             │             │             │
             └─────────────┼─────────────┘
                           ▼
                    ┌──────────────┐
                    │ LLM Client   │
                    │ + Cache      │
                    │ + Langfuse   │
                    └──────────────┘

Data Flow

Raw Dataset (Kaggle)
    │
    ▼
Dataset Loader → df_train, df_test, target, metadata
    │
    ▼
Feature Engineering Pipeline
    │
    ├─→ Router selects active agents
    │   ├─→ Each agent: prompt → LLM plan → LLM code → sandbox execution
    │   ├─→ Evaluate each feature via 5-fold CV
    │   ├─→ Update agent memory (procedural, feedback, conceptual)
    │   └─→ Persist top features to df_train/df_test
    │
    ├─→ Global conceptual summary
    │
    └─→ Next round (if Nround > 1)
    │
    ▼
Final Evaluation
    ├─→ Baseline model score (original features)
    ├─→ MALMAS score (original + generated features)
    └─→ Baseline methods scores (OpenFE, CAAFE, LLM-FE)
    │
    ▼
Experiment Tracker (WandB)
    ├─→ Log all metrics, parameters, artifacts
    ├─→ Log LLM costs per agent per round
    └─→ Generate comparison visualizations

Security Model

Layer	Mechanism
Code Execution	AST parsing + forbidden names whitelist + restricted builtins
LLM Calls	DiskCache enforced, no uncached execution by default
Secrets	dotenvx encrypted `.env`, never committed
Imports	No dynamic imports in sandboxed code
File System	No `open()`, no file operations in sandbox

Concurrency Model

Agent-level parallelism: asyncio.gather() for selected agents per round
Experiment-level parallelism: ProcessPoolExecutor for independent experiment combinations
LLM calls: Async with semaphore-based rate limiting
Memory access: Per-agent memory is isolated (no shared state)