Implementation Phases

Phase 1: Skeleton & Tooling (Days 1-2)

Goal: Establish project foundation with modern Python tooling.

Day	Task	Deliverable
1	`uv init --src feature-forge`, create directory structure	`pyproject.toml`, `src/`, `tests/`, `config/`
1	Configure `pyproject.toml` with metadata, dependencies, tool configs	Installable package
2	Set up GitHub Actions CI (`ci.yml`)	Automated lint/test on PR
2	Set up pre-commit hooks (ruff, conventional commits)	`.pre-commit-config.yaml`
2	Configure dotenvx for secrets	`.env` (encrypted), `.env.keys` (gitignored)

Success Criteria: - uv sync completes without errors - uv run pytest runs (even if no tests yet) - uv run ruff check . passes - pre-commit install succeeds - import feature_forge works after uv pip install -e .

Phase 2: Config & Types (Days 3-4)

Goal: Immutable, validated, env-var-overridable configuration.

Day	Task	Deliverable
3	Implement `Settings` with pydantic-settings	`src/feature_forge/config.py`
3	Create `config/settings.yaml` with defaults	`config/settings.yaml`
3	Write `exceptions.py` with full hierarchy	`src/feature_forge/exceptions.py`
4	Write `types.py` with shared aliases	`src/feature_forge/types.py`
4	Add unit tests for config validation	`tests/unit/test_config.py`

Success Criteria: - Settings() loads from YAML - Settings(temperature=0.5) overrides YAML - FF_TASK=regression env var overrides - Invalid config raises ConfigurationError - settings.llm.api_key returns SecretStr

Phase 3: Observability (Days 5-6)

Goal: Structured logging + LLM tracing from day one.

Day	Task	Deliverable
5	Configure structlog (JSON prod / pretty dev)	`src/feature_forge/observability/structlog_config.py`
5	Add OpenTelemetry processor for trace correlation	`add_open_telemetry_spans` processor
6	Integrate Langfuse `@observe` decorators	`src/feature_forge/observability/langfuse_tracer.py`
6	Add unit tests for logging	`tests/unit/test_observability.py`

Success Criteria: - structlog.get_logger().info("test", x=1) outputs JSON in CI - bind_contextvars(experiment_id="e1") propagates to all logs - @observe() decorator captures function latency - Langfuse traces show in cloud dashboard

Phase 4: LLM Layer (Days 7-9)

Goal: Provider-agnostic LLM client with enforced caching.

Day	Task	Deliverable
7	Implement `LLMClient` ABC	`src/feature_forge/llm/base.py`
7	Implement `DiskCache` with SHA-256 keys	`src/feature_forge/llm/cache.py`
8	Implement OpenAI provider	`src/feature_forge/llm/providers/openai.py`
8	Implement DeepSeek provider	`src/feature_forge/llm/providers/deepseek.py`
9	Implement Anthropic provider	`src/feature_forge/llm/providers/anthropic.py`
9	Add Langfuse wrapper (auto-tracing)	`src/feature_forge/llm/langfuse_wrapper.py`

Success Criteria: - LLMClient.complete(messages) returns response - Same prompt returns cached response (no API call) - Langfuse shows generation spans with token usage - Invalid API key raises LLMError

Phase 5: Agent System (Days 10-13)

Goal: All 6 MALMAS agents + Router + Registry.

Day	Task	Deliverable
10	Implement `Agent` ABC + `AgentRegistry`	`src/feature_forge/agents/base.py`
10	Port prompt templates to `src/feature_forge/prompts/`	All 6 prompt files
11	Implement UnaryFeatureAgent	`src/feature_forge/agents/unary.py`
11	Implement CrossCompositionalAgent	`src/feature_forge/agents/cross_compositional.py`
12	Implement AggregationConstructAgent	`src/feature_forge/agents/aggregation.py`
12	Implement TemporalFeatureAgent	`src/feature_forge/agents/temporal.py`
13	Implement LocalTransformAgent + LocalPatternAgent	`src/feature_forge/agents/local_transform.py`, `local_pattern.py`
13	Implement RouterAgent	`src/feature_forge/agents/router.py`

Success Criteria: - AgentRegistry.discover() finds all 6 agents - Each agent generates a FeatureSpec from a prompt - Router selects agents based on data characteristics - All agents are entry-point registered

Phase 6: Memory System (Days 14-16)

Goal: Procedural, feedback, and conceptual memory.

Day	Task	Deliverable
14	Implement `MemoryEntry` dataclass + persistence	`src/feature_forge/memory/base.py`, `persistence.py`
14	Implement `ProceduralMemory`	`src/feature_forge/memory/procedural.py`
15	Implement `FeedbackMemory`	`src/feature_forge/memory/feedback.py`
15	Implement `ConceptualMemory` (with LLM summarization)	`src/feature_forge/memory/conceptual.py`
16	Integrate memory into agent base class	`Agent.memory` attribute
16	Add memory tests	`tests/unit/test_memory.py`

Success Criteria: - Memory persists across rounds - Conceptual memory generates bullet-point rules - Top-k features retrievable by score - Memory serializes to JSON and loads back

Phase 7: Evaluation (Days 17-19)

Goal: Feature evaluation, model factory, sandboxed execution.

Day	Task	Deliverable
17	Implement metrics (AUC, ACC, NRMSE)	`src/feature_forge/evaluation/metrics.py`
17	Implement k-fold CV evaluator	`src/feature_forge/evaluation/cv.py`
18	Implement ModelFactory	`src/feature_forge/evaluation/model_factory.py`
18	Implement `SandboxedExecutor`	`src/feature_forge/evaluation/sandbox.py`
19	Add evaluation tests	`tests/unit/test_evaluation.py`, `test_sandbox.py`

Success Criteria: - cv.evaluate_feature(X, y, feature_code) returns gain - SandboxedExecutor blocks eval(), open(), imports - ModelFactory creates XGB/LGB/CatBoost/RF/MLP - Sandbox allows pandas, numpy, math operations

Phase 8: Pipeline & API (Days 20-23)

Goal: Core pipeline, iterative pipeline, ablations, sklearn API.

Day	Task	Deliverable
20	Implement `CorePipeline` (single round)	`src/feature_forge/pipeline/core.py`
21	Implement `IterativePipeline` (N-round)	`src/feature_forge/pipeline/iterative.py`
22	Implement ablation pipelines	`src/feature_forge/pipeline/ablations.py`
22	Implement `MALMASFeatureEngineer` (sklearn)	`src/feature_forge/api.py`
23	Add pipeline integration tests	`tests/integration/test_pipeline.py`

Success Criteria: - fe.fit(X_train, y_train) runs full pipeline - fe.transform(X_test) applies generated features - Pipeline([("fe", fe), ("clf", XGBClassifier())]) works - cross_val_score(pipeline, X, y) works

Phase 9: Baselines (Days 24-27)

Goal: OpenFE, CAAFE, LLM-FE baseline implementations.

Day	Task	Deliverable
24	Implement `Baseline` ABC + `BaselineRegistry`	`src/feature_forge/baselines/base.py`
24	Implement OpenFE baseline	`src/feature_forge/baselines/openfe.py`
25	Implement CAAFE baseline	`src/feature_forge/baselines/caafe.py`
26	Implement LLM-FE baseline	`src/feature_forge/baselines/llmfe.py`
27	Add baseline tests	`tests/integration/test_baselines.py`

Success Criteria: - Each baseline implements fit(X_train, y_train) / transform(X_test) - BaselineRegistry.discover() finds all baselines - OpenFE baseline matches reference implementation

Phase 10: Experiment Harness (Days 28-31)

Goal: Unified tracking, experiment matrices, auto-reporting.

Day	Task	Deliverable
28	Implement `ExperimentTracker` ABC	`src/feature_forge/experiment/tracker.py`
28	Implement `WandBTracker`	`src/feature_forge/experiment/wandb_backend.py`
29	Implement `MLflowTracker`	`src/feature_forge/experiment/mlflow_backend.py`
29	Implement `ExperimentMatrix`	`src/feature_forge/experiment/matrix.py`
30	Implement `ExperimentRunner`	`src/feature_forge/experiment/runner.py`
31	Implement `Reporter`	`src/feature_forge/experiment/reporter.py`

Success Criteria: - ExperimentMatrix generates all combinations - ExperimentRunner executes in parallel - WandB shows all metrics, parameters, artifacts - Reporter generates markdown comparison tables

Phase 11: Data Layer (Days 32-33)

Goal: Kaggle-focused data ingestion with sample datasets.

Day	Task	Deliverable
32	Implement `KaggleFetcher`	`src/feature_forge/data/ingestion.py`
32	Implement `DatasetRegistry`	`src/feature_forge/data/registry.py`
33	Add sample datasets + ingestion tests	`data/samples/`, `tests/integration/test_data_ingestion.py`

Success Criteria: - KaggleFetcher.fetch("titanic") downloads dataset - DatasetRegistry.list() shows available datasets - Sample datasets load without internet - Ingestion handles CSV + metadata JSON

Phase 12: Tests & Documentation (Days 34-38)

Goal: Comprehensive test coverage and interactive notebooks.

Day	Task	Deliverable
34-35	Unit tests for all core modules	`tests/unit/` — target 80%+ coverage
36	Integration tests	`tests/integration/`
37	Marimo notebooks	`notebooks/01_agent_comparison.py`, etc.
38	API reference docs	`docs/api_reference.md`

Success Criteria: - pytest --cov=feature_forge shows >80% coverage - All integration tests pass - Notebooks run end-to-end

Phase 13: Benchmarks & Release Prep (Days 39-42)

Goal: Full benchmark suite and release readiness.

Day	Task	Deliverable
39-40	Run full benchmark suite	`.github/workflows/benchmark.yml`
41	Write README with quick start	`README.md`
42	Write migration guide	`docs/migration_guide.md`

Success Criteria: - Benchmark workflow runs on schedule - README has working code examples - Package installable via uv pip install -e .

Total Timeline

Phase	Duration	Cumulative
1-2 (Foundation)	4 days	Day 4
3-5 (Core Engine)	9 days	Day 13
6-8 (Pipeline)	9 days	Day 22
9-11 (Methods + Data)	9 days	Day 31
12-13 (Quality + Release)	9 days	Day 40

Total: ~6 weeks (assuming 1 developer, full-time)