Skip to content

Implementation Phases

Phase 1: Skeleton & Tooling (Days 1-2)

Goal: Establish project foundation with modern Python tooling.

Day Task Deliverable
1 uv init --src feature-forge, create directory structure pyproject.toml, src/, tests/, config/
1 Configure pyproject.toml with metadata, dependencies, tool configs Installable package
2 Set up GitHub Actions CI (ci.yml) Automated lint/test on PR
2 Set up pre-commit hooks (ruff, conventional commits) .pre-commit-config.yaml
2 Configure dotenvx for secrets .env (encrypted), .env.keys (gitignored)

Success Criteria: - uv sync completes without errors - uv run pytest runs (even if no tests yet) - uv run ruff check . passes - pre-commit install succeeds - import feature_forge works after uv pip install -e .


Phase 2: Config & Types (Days 3-4)

Goal: Immutable, validated, env-var-overridable configuration.

Day Task Deliverable
3 Implement Settings with pydantic-settings src/feature_forge/config.py
3 Create config/settings.yaml with defaults config/settings.yaml
3 Write exceptions.py with full hierarchy src/feature_forge/exceptions.py
4 Write types.py with shared aliases src/feature_forge/types.py
4 Add unit tests for config validation tests/unit/test_config.py

Success Criteria: - Settings() loads from YAML - Settings(temperature=0.5) overrides YAML - FF_TASK=regression env var overrides - Invalid config raises ConfigurationError - settings.llm.api_key returns SecretStr


Phase 3: Observability (Days 5-6)

Goal: Structured logging + LLM tracing from day one.

Day Task Deliverable
5 Configure structlog (JSON prod / pretty dev) src/feature_forge/observability/structlog_config.py
5 Add OpenTelemetry processor for trace correlation add_open_telemetry_spans processor
6 Integrate Langfuse @observe decorators src/feature_forge/observability/langfuse_tracer.py
6 Add unit tests for logging tests/unit/test_observability.py

Success Criteria: - structlog.get_logger().info("test", x=1) outputs JSON in CI - bind_contextvars(experiment_id="e1") propagates to all logs - @observe() decorator captures function latency - Langfuse traces show in cloud dashboard


Phase 4: LLM Layer (Days 7-9)

Goal: Provider-agnostic LLM client with enforced caching.

Day Task Deliverable
7 Implement LLMClient ABC src/feature_forge/llm/base.py
7 Implement DiskCache with SHA-256 keys src/feature_forge/llm/cache.py
8 Implement OpenAI provider src/feature_forge/llm/providers/openai.py
8 Implement DeepSeek provider src/feature_forge/llm/providers/deepseek.py
9 Implement Anthropic provider src/feature_forge/llm/providers/anthropic.py
9 Add Langfuse wrapper (auto-tracing) src/feature_forge/llm/langfuse_wrapper.py

Success Criteria: - LLMClient.complete(messages) returns response - Same prompt returns cached response (no API call) - Langfuse shows generation spans with token usage - Invalid API key raises LLMError


Phase 5: Agent System (Days 10-13)

Goal: All 6 MALMAS agents + Router + Registry.

Day Task Deliverable
10 Implement Agent ABC + AgentRegistry src/feature_forge/agents/base.py
10 Port prompt templates to src/feature_forge/prompts/ All 6 prompt files
11 Implement UnaryFeatureAgent src/feature_forge/agents/unary.py
11 Implement CrossCompositionalAgent src/feature_forge/agents/cross_compositional.py
12 Implement AggregationConstructAgent src/feature_forge/agents/aggregation.py
12 Implement TemporalFeatureAgent src/feature_forge/agents/temporal.py
13 Implement LocalTransformAgent + LocalPatternAgent src/feature_forge/agents/local_transform.py, local_pattern.py
13 Implement RouterAgent src/feature_forge/agents/router.py

Success Criteria: - AgentRegistry.discover() finds all 6 agents - Each agent generates a FeatureSpec from a prompt - Router selects agents based on data characteristics - All agents are entry-point registered


Phase 6: Memory System (Days 14-16)

Goal: Procedural, feedback, and conceptual memory.

Day Task Deliverable
14 Implement MemoryEntry dataclass + persistence src/feature_forge/memory/base.py, persistence.py
14 Implement ProceduralMemory src/feature_forge/memory/procedural.py
15 Implement FeedbackMemory src/feature_forge/memory/feedback.py
15 Implement ConceptualMemory (with LLM summarization) src/feature_forge/memory/conceptual.py
16 Integrate memory into agent base class Agent.memory attribute
16 Add memory tests tests/unit/test_memory.py

Success Criteria: - Memory persists across rounds - Conceptual memory generates bullet-point rules - Top-k features retrievable by score - Memory serializes to JSON and loads back


Phase 7: Evaluation (Days 17-19)

Goal: Feature evaluation, model factory, sandboxed execution.

Day Task Deliverable
17 Implement metrics (AUC, ACC, NRMSE) src/feature_forge/evaluation/metrics.py
17 Implement k-fold CV evaluator src/feature_forge/evaluation/cv.py
18 Implement ModelFactory src/feature_forge/evaluation/model_factory.py
18 Implement SandboxedExecutor src/feature_forge/evaluation/sandbox.py
19 Add evaluation tests tests/unit/test_evaluation.py, test_sandbox.py

Success Criteria: - cv.evaluate_feature(X, y, feature_code) returns gain - SandboxedExecutor blocks eval(), open(), imports - ModelFactory creates XGB/LGB/CatBoost/RF/MLP - Sandbox allows pandas, numpy, math operations


Phase 8: Pipeline & API (Days 20-23)

Goal: Core pipeline, iterative pipeline, ablations, sklearn API.

Day Task Deliverable
20 Implement CorePipeline (single round) src/feature_forge/pipeline/core.py
21 Implement IterativePipeline (N-round) src/feature_forge/pipeline/iterative.py
22 Implement ablation pipelines src/feature_forge/pipeline/ablations.py
22 Implement MALMASFeatureEngineer (sklearn) src/feature_forge/api.py
23 Add pipeline integration tests tests/integration/test_pipeline.py

Success Criteria: - fe.fit(X_train, y_train) runs full pipeline - fe.transform(X_test) applies generated features - Pipeline([("fe", fe), ("clf", XGBClassifier())]) works - cross_val_score(pipeline, X, y) works


Phase 9: Baselines (Days 24-27)

Goal: OpenFE, CAAFE, LLM-FE baseline implementations.

Day Task Deliverable
24 Implement Baseline ABC + BaselineRegistry src/feature_forge/baselines/base.py
24 Implement OpenFE baseline src/feature_forge/baselines/openfe.py
25 Implement CAAFE baseline src/feature_forge/baselines/caafe.py
26 Implement LLM-FE baseline src/feature_forge/baselines/llmfe.py
27 Add baseline tests tests/integration/test_baselines.py

Success Criteria: - Each baseline implements fit(X_train, y_train) / transform(X_test) - BaselineRegistry.discover() finds all baselines - OpenFE baseline matches reference implementation


Phase 10: Experiment Harness (Days 28-31)

Goal: Unified tracking, experiment matrices, auto-reporting.

Day Task Deliverable
28 Implement ExperimentTracker ABC src/feature_forge/experiment/tracker.py
28 Implement WandBTracker src/feature_forge/experiment/wandb_backend.py
29 Implement MLflowTracker src/feature_forge/experiment/mlflow_backend.py
29 Implement ExperimentMatrix src/feature_forge/experiment/matrix.py
30 Implement ExperimentRunner src/feature_forge/experiment/runner.py
31 Implement Reporter src/feature_forge/experiment/reporter.py

Success Criteria: - ExperimentMatrix generates all combinations - ExperimentRunner executes in parallel - WandB shows all metrics, parameters, artifacts - Reporter generates markdown comparison tables


Phase 11: Data Layer (Days 32-33)

Goal: Kaggle-focused data ingestion with sample datasets.

Day Task Deliverable
32 Implement KaggleFetcher src/feature_forge/data/ingestion.py
32 Implement DatasetRegistry src/feature_forge/data/registry.py
33 Add sample datasets + ingestion tests data/samples/, tests/integration/test_data_ingestion.py

Success Criteria: - KaggleFetcher.fetch("titanic") downloads dataset - DatasetRegistry.list() shows available datasets - Sample datasets load without internet - Ingestion handles CSV + metadata JSON


Phase 12: Tests & Documentation (Days 34-38)

Goal: Comprehensive test coverage and interactive notebooks.

Day Task Deliverable
34-35 Unit tests for all core modules tests/unit/ — target 80%+ coverage
36 Integration tests tests/integration/
37 Marimo notebooks notebooks/01_agent_comparison.py, etc.
38 API reference docs docs/api_reference.md

Success Criteria: - pytest --cov=feature_forge shows >80% coverage - All integration tests pass - Notebooks run end-to-end


Phase 13: Benchmarks & Release Prep (Days 39-42)

Goal: Full benchmark suite and release readiness.

Day Task Deliverable
39-40 Run full benchmark suite .github/workflows/benchmark.yml
41 Write README with quick start README.md
42 Write migration guide docs/migration_guide.md

Success Criteria: - Benchmark workflow runs on schedule - README has working code examples - Package installable via uv pip install -e .


Total Timeline

Phase Duration Cumulative
1-2 (Foundation) 4 days Day 4
3-5 (Core Engine) 9 days Day 13
6-8 (Pipeline) 9 days Day 22
9-11 (Methods + Data) 9 days Day 31
12-13 (Quality + Release) 9 days Day 40

Total: ~6 weeks (assuming 1 developer, full-time)