Skip to content

BENCHMARKING.md — Benchmarking Guide

This document explains how to benchmark uncertainty_flow models using the built-in benchmarking framework.


Quick Start

# Run the consolidated benchmark suite on all default datasets
uv run python benchmarks/run_benchmarks.py --all-datasets

# Run on a single dataset with custom iterations
uv run python benchmarks/run_benchmarks.py -d weather -n 500 --iterations 5

# Generate a report from saved results
uv run python benchmarks/generate_report.py --output results/report.md

Available Datasets

The library integrates with HuggingFace datasets and includes 108 datasets for benchmarking:

Dataset Domain Description
weather Climate Weather time series (ts-arena)
exchange_rate Finance Daily exchange rates
electricity Energy Electricity demand time series
m4_daily Mixed M4 daily forecasting competition
m4_hourly Mixed M4 hourly forecasting competition
m4_weekly Mixed M4 weekly forecasting competition
m4_monthly Mixed M4 monthly forecasting competition
m4_quarterly Mixed M4 quarterly forecasting competition
m4_yearly Mixed M4 yearly forecasting competition
weatherbench_daily Climate WeatherBench daily weather
weatherbench_hourly_temperature Climate WeatherBench hourly temperature
monash_electricity_hourly Energy Australian electricity demand
monash_london_smart_meters Energy London smart meter data
ercot Energy Texas electricity demand
monash_traffic Transportation Traffic flow data
monash_pedestrian_counts Transportation Pedestrian counts
taxi_1h Transportation Taxi trip counts (1h)
monash_hospital Healthcare Hospital admissions
monash_fred_md Finance FRED macroeconomic indicators
m5 Retail Walmart sales data

Filter by Domain

# List only energy datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Energy

# List only climate datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Climate

CLI Commands

Benchmark orchestration is implemented by the BenchmarkFlow module and exposed publicly through BenchmarkRunner and the CLI.

Flow lifecycle per run:

  1. load dataset
  2. split into tune/train/test (or rolling-origin splits)
  3. tune-per-run-context (optional)
  4. fit/predict
  5. evaluate
  6. sink through ResultSink

benchmark — Run Benchmark

uv run python -m uncertainty_flow.cli benchmark --dataset <name> [OPTIONS]
Option Default Description
--dataset, -d (required) Dataset name or HuggingFace path
--model, -m all Models to run: all, quantile-forest, conformal-regressor, conformal-forecaster
--n-samples, -n 1000 Number of samples to use
--horizon, -h 3 Forecast horizon for time series models
--n-estimators, -e 30 Number of base estimators
--target, -t OT Target column name
--auto-tune true Enable/disable auto-tuning
--target-coverage, -c 0.9 Target coverage level for tuning
--tune-samples 500 Samples to use for tuning
--output, -o benchmark_results Output file prefix
--json-only - Output only JSON
--csv-only - Output only CSV

Examples

# Run all models with auto-tuning (default)
uv run python -m uncertainty_flow.cli benchmark --dataset weather

# Run specific models
uv run python -m uncertainty_flow.cli benchmark --dataset m4_daily \
    --model quantile-forest,conformal-regressor

# Run without auto-tuning (faster, uses default params)
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune

# Custom coverage target and sample size
uv run python -m uncertainty_flow.cli benchmark --dataset electricity \
    --target-coverage 0.8 --n-samples 2000

# Save results
uv run python -m uncertainty_flow.cli benchmark --dataset weather \
    --output my_results

list-datasets — List Available Datasets

# List all datasets
uv run python -m uncertainty_flow.cli list-datasets

# Filter by domain
uv run python -m uncertainty_flow.cli list-datasets --domain Energy

download-dataset — Download Dataset for Offline Use

# Download a single dataset
uv run python -m uncertainty_flow.cli download-dataset m4_daily

# Download to custom cache directory
uv run python -m uncertainty_flow.cli download-dataset weather --cache-dir /path/to/cache

Auto-Tuning

Auto-tuning is enabled by default and automatically finds optimal hyperparameters for each model to achieve the target coverage level.

How It Works

  1. For each model, the tuner tests multiple parameter combinations
  2. Parameters are scored on validation splits only (no fit/predict on the same rows)
  3. The best parameters are used for the final benchmark on a separate untouched test holdout

Validation Strategy (Leakage-Safe)

  • Tabular tuning defaults to random holdout; for small datasets it uses CV.
  • Time-series tuning defaults to temporal holdout.
  • Optional hybrid validation uses outer holdout + inner out-of-sample CV on outer-train only.
  • The selector is deterministic and logs chosen strategy and rationale.

Example strategy logs:

validation_strategy strategy=temporal_holdout reason=time_series task defaults to temporal holdout ...
tuning_validation_plan model=conformal-forecaster strategy=temporal_holdout reason=time_series task defaults...

Search Space

Model Parameters Tested
quantile-forest n_estimators: [20, 30, 50], min_samples_leaf: [3, 5, 10]
conformal-regressor supported base-estimator params such as n_estimators, plus calibration_size: [0.15, 0.20, 0.25, 0.30]
conformal-forecaster supported base-estimator params such as n_estimators, plus calibration_size: [0.15, 0.20, 0.25, 0.30] and lags: [1, 2, 3]

Disabling Auto-Tuning

# Faster runs with default parameters
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune

Output Format

JSON Output

{
  "dataset": "weather",
  "metadata": {
    "run_id": "3d115493",
    "timestamp": "2026-03-31T13:30:22Z",
    "dataset": "weather",
    "domain": "Climate",
    "n_samples": 1000,
    "horizon": 3,
    "test_size": 0.2,
    "auto_tune": true,
    "target_coverage": 0.9
  },
  "errors": [],
  "results": [
    {
      "model": "conformal-forecaster",
      "coverage_90": 0.9449,
      "coverage_80": 0.8788,
      "sharpness_90": 0.0223,
      "sharpness_80": 0.0148,
      "winkler_90": 0.0260,
      "winkler_80": 0.0197,
      "pinball_loss": 0.0027,
      "train_time_sec": 0.091,
      "n_samples": 997,
      "tuned_params": {"n_estimators": 50, "calibration_size": 0.25, "lags": 1},
      "was_tuned": true,
      "validation_coverage_90": 0.9123,
      "validation_sharpness_90": 0.0311,
      "validation_winkler_90": 0.0364,
      "validation_split_type": "temporal_holdout",
      "validation_strategy": "temporal_holdout",
      "validation_n_splits": 1,
      "test_split_type": "out_of_time"
    }
  ]
}

Serialized benchmark output is owned by ResultSink (sinks.py) and uses:

  • top-level: dataset, metadata, errors, results
  • no top-level models field

Metrics Explained

Metric Description Target
coverage_90 Fraction of true values within 90% prediction interval ~0.90
coverage_80 Fraction of true values within 80% prediction interval ~0.80
sharpness_90 Average width of 90% prediction intervals Lower is better
winkler_90 Winkler score for 90% intervals Lower is better
pinball_loss Pinball loss at quantile 0.1 Lower is better
train_time_sec Training time in seconds -

Using the Library Programmatically

Python API

from uncertainty_flow.benchmarking import BenchmarkConfig, BenchmarkRunner

# Create config with auto-tuning enabled
config = BenchmarkConfig(
    dataset_name="weather",
    n_samples=1000,
    horizon=3,
    auto_tune=True,
    target_coverage=0.9,
)

# Run benchmark
runner = BenchmarkRunner(config)
runner.load_data()
result = runner.run_all()

# Access results
for model_result in result.models:
    print(f"{model_result.model_name}:")
    print(f"  Coverage @ 90%: {model_result.coverage_90}")
    print(f"  Sharpness @ 90%: {model_result.sharpness_90}")
    print(f"  Tuned params: {model_result.tuned_params}")

# Save results
runner.save_json("results.json")
runner.save_csv("results.csv")

Extending Benchmark Models

Use the provider seam for new benchmark model adapters.

  • Stable built-in names remain: quantile-forest, conformal-regressor, conformal-forecaster
  • Provider contract lives in providers.py (BenchmarkModelProvider)
  • Legacy class registry path in runner.py remains for compatibility, but provider-based extension is the primary path
  • Prefer provider seam when adding new benchmark integrations or custom adapter logic

Maintainer Migration Note

Benchmark architecture moved from an all-in-runner.py mental model to split modules:

  • orchestration: flow.py
  • model seams: providers.py
  • output seams: sinks.py
  • configs/results contracts: configs.py, results.py
  • public adapter: runner.py

Auto-Tuning Only

from uncertainty_flow.benchmarking.tuning import auto_tune_model, TuningConfig
from uncertainty_flow.benchmarking.datasets import load_dataset

# Load data
df, _ = load_dataset("weather", n_samples=500)
target = "OT"

# Tune a specific model
config = TuningConfig(target_coverage=0.9, n_samples=500)
result = auto_tune_model(
    model_name="conformal-forecaster",
    df=df,
    target=target,
    horizon=3,
    config=config,
)

print(f"Best params: {result.best_params}")
print(f"Coverage: {result.coverage_90}")

Best Practices

  1. Use Auto-Tuning — It significantly improves coverage calibration with minimal performance overhead.

  2. Choose Appropriate Sample Size — Use at least 500 samples for reliable tuning, 1000+ for final benchmarks.

  3. Match Horizon to Dataset — Set --horizon based on your forecasting needs. Larger horizons require more data.

  4. Compare Multiple Models — Different models excel on different datasets. Run all models to find the best fit.

  5. Consider Coverage vs Sharpness Trade-off — A model with slightly lower coverage but much tighter intervals may be preferable for some applications.


Troubleshooting

"Dataset not found"

# Verify dataset name
uv run python -m uncertainty_flow.cli list-datasets | grep <name>

# Use full HuggingFace path if needed
uv run python -m uncertainty_flow.cli benchmark \
    --dataset autogluon/chronos_datasets/m4_daily

Poor Coverage Results

  • Enable auto-tuning to find better hyperparameters
  • Increase tune-samples for more reliable tuning
  • Try a different model — some models work better on certain data patterns

Slow Benchmark Runs

  • Reduce n-samples for faster iteration
  • Disable auto-tuning for quick experiments
  • Reduce model complexity (fewer estimators)