BENCHMARKING.md — Benchmarking Guide¶

This document explains how to benchmark uncertainty_flow models using the built-in benchmarking framework.

Quick Start¶

# Run the consolidated benchmark suite on all default datasets
uv run python benchmarks/run_benchmarks.py --all-datasets

# Run on a single dataset with custom iterations
uv run python benchmarks/run_benchmarks.py -d weather -n 500 --iterations 5

# Generate a report from saved results
uv run python benchmarks/generate_report.py --output results/report.md

---

## Available Datasets

The library integrates with [HuggingFace datasets](https://huggingface.co/datasets) and includes **108 datasets** for benchmarking:

| Dataset | Domain | Description |
|---------|--------|-------------|
| `weather` | Climate | Weather time series (ts-arena) |
| `exchange_rate` | Finance | Daily exchange rates |
| `electricity` | Energy | Electricity demand time series |
| `m4_daily` | Mixed | M4 daily forecasting competition |
| `m4_hourly` | Mixed | M4 hourly forecasting competition |
| `m4_weekly` | Mixed | M4 weekly forecasting competition |
| `m4_monthly` | Mixed | M4 monthly forecasting competition |
| `m4_quarterly` | Mixed | M4 quarterly forecasting competition |
| `m4_yearly` | Mixed | M4 yearly forecasting competition |
| `weatherbench_daily` | Climate | WeatherBench daily weather |
| `weatherbench_hourly_temperature` | Climate | WeatherBench hourly temperature |
| `monash_electricity_hourly` | Energy | Australian electricity demand |
| `monash_london_smart_meters` | Energy | London smart meter data |
| `ercot` | Energy | Texas electricity demand |
| `monash_traffic` | Transportation | Traffic flow data |
| `monash_pedestrian_counts` | Transportation | Pedestrian counts |
| `taxi_1h` | Transportation | Taxi trip counts (1h) |
| `monash_hospital` | Healthcare | Hospital admissions |
| `monash_fred_md` | Finance | FRED macroeconomic indicators |
| `m5` | Retail | Walmart sales data |

### Filter by Domain

```bash
# List only energy datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Energy

# List only climate datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Climate

CLI Commands¶

`benchmark` — Run Benchmark¶

uv run python -m uncertainty_flow.cli benchmark --dataset <name> [OPTIONS]

Option	Default	Description
`--dataset`, `-d`	(required)	Dataset name or HuggingFace path
`--model`, `-m`	`all`	Models to run: `all`, `quantile-forest`, `conformal-regressor`, `conformal-forecaster`
`--n-samples`, `-n`	`1000`	Number of samples to use
`--horizon`, `-h`	`3`	Forecast horizon for time series models
`--n-estimators`, `-e`	`30`	Number of base estimators
`--target`, `-t`	`OT`	Target column name
`--auto-tune`	`true`	Enable/disable auto-tuning
`--target-coverage`, `-c`	`0.9`	Target coverage level for tuning
`--tune-samples`	`500`	Samples to use for tuning
`--output`, `-o`	`benchmark_results`	Output file prefix
`--json-only`	-	Output only JSON
`--csv-only`	-	Output only CSV

Examples¶

# Run all models with auto-tuning (default)
uv run python -m uncertainty_flow.cli benchmark --dataset weather

# Run specific models
uv run python -m uncertainty_flow.cli benchmark --dataset m4_daily \
    --model quantile-forest,conformal-regressor

# Run without auto-tuning (faster, uses default params)
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune

# Custom coverage target and sample size
uv run python -m uncertainty_flow.cli benchmark --dataset electricity \
    --target-coverage 0.8 --n-samples 2000

# Save results
uv run python -m uncertainty_flow.cli benchmark --dataset weather \
    --output my_results

`list-datasets` — List Available Datasets¶

# List all datasets
uv run python -m uncertainty_flow.cli list-datasets

# Filter by domain
uv run python -m uncertainty_flow.cli list-datasets --domain Energy

`download-dataset` — Download Dataset for Offline Use¶

# Download a single dataset
uv run python -m uncertainty_flow.cli download-dataset m4_daily

# Download to custom cache directory
uv run python -m uncertainty_flow.cli download-dataset weather --cache-dir /path/to/cache

Auto-Tuning¶

Auto-tuning is enabled by default and automatically finds optimal hyperparameters for each model to achieve the target coverage level.

How It Works¶

For each model, the tuner tests multiple parameter combinations
Parameters are scored on validation splits only (no fit/predict on the same rows)
The best parameters are used for the final benchmark on a separate untouched test holdout

Validation Strategy (Leakage-Safe)¶

Tabular tuning defaults to random holdout; for small datasets it uses CV.
Time-series tuning defaults to temporal holdout.
Optional hybrid validation uses outer holdout + inner out-of-sample CV on outer-train only.
The selector is deterministic and logs chosen strategy and rationale.

Example strategy logs:

validation_strategy strategy=temporal_holdout reason=time_series task defaults to temporal holdout ...
tuning_validation_plan model=conformal-forecaster strategy=temporal_holdout reason=time_series task defaults...

Search Space¶

Model	Parameters Tested
`quantile-forest`	`n_estimators`: [20, 30, 50], `min_samples_leaf`: [3, 5, 10]
`conformal-regressor`	supported base-estimator params such as `n_estimators`, plus `calibration_size`: [0.15, 0.20, 0.25, 0.30]
`conformal-forecaster`	supported base-estimator params such as `n_estimators`, plus `calibration_size`: [0.15, 0.20, 0.25, 0.30] and `lags`: [1, 2, 3]

Disabling Auto-Tuning¶

# Faster runs with default parameters
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune

Output Format¶

JSON Output¶

{
  "metadata": {
    "run_id": "3d115493",
    "timestamp": "2026-03-31T13:30:22Z",
    "dataset": "weather",
    "domain": "Climate",
    "n_samples": 1000,
    "horizon": 3,
    "auto_tune": true,
    "target_coverage": 0.9
  },
  "results": [
    {
      "model": "conformal-forecaster",
      "coverage_90": 0.9449,
      "coverage_80": 0.8788,
      "sharpness_90": 0.0223,
      "sharpness_80": 0.0148,
      "winkler_90": 0.0260,
      "winkler_80": 0.0197,
      "pinball_loss": 0.0027,
      "train_time_sec": 0.091,
      "n_samples": 997,
      "tuned_params": {"n_estimators": 50, "calibration_size": 0.25, "lags": 1},
      "was_tuned": true
    }
  ]
}

Metrics Explained¶

Metric	Description	Target
`coverage_90`	Fraction of true values within 90% prediction interval	~0.90
`coverage_80`	Fraction of true values within 80% prediction interval	~0.80
`sharpness_90`	Average width of 90% prediction intervals	Lower is better
`winkler_90`	Winkler score for 90% intervals	Lower is better
`pinball_loss`	Pinball loss at quantile 0.1	Lower is better
`train_time_sec`	Training time in seconds	-

Using the Library Programmatically¶

Python API¶

from uncertainty_flow.benchmarking import BenchmarkConfig, BenchmarkRunner

# Create config with auto-tuning enabled
config = BenchmarkConfig(
    dataset_name="weather",
    n_samples=1000,
    horizon=3,
    auto_tune=True,
    target_coverage=0.9,
)

# Run benchmark
runner = BenchmarkRunner(config)
runner.load_data()
result = runner.run_all()

# Access results
for model_result in result.models:
    print(f"{model_result.model_name}:")
    print(f"  Coverage @ 90%: {model_result.coverage_90}")
    print(f"  Sharpness @ 90%: {model_result.sharpness_90}")
    print(f"  Tuned params: {model_result.tuned_params}")

# Save results
runner.save_json("results.json")
runner.save_csv("results.csv")

Auto-Tuning Only¶

from uncertainty_flow.benchmarking.tuning import auto_tune_model, TuningConfig
from uncertainty_flow.benchmarking.datasets import load_dataset

# Load data
df, _ = load_dataset("weather", n_samples=500)
target = "OT"

# Tune a specific model
config = TuningConfig(target_coverage=0.9, n_samples=500)
result = auto_tune_model(
    model_name="conformal-forecaster",
    df=df,
    target=target,
    horizon=3,
    config=config,
)

print(f"Best params: {result.best_params}")
print(f"Coverage: {result.coverage_90}")

Best Practices¶

Use Auto-Tuning — It significantly improves coverage calibration with minimal performance overhead.
Choose Appropriate Sample Size — Use at least 500 samples for reliable tuning, 1000+ for final benchmarks.
Match Horizon to Dataset — Set --horizon based on your forecasting needs. Larger horizons require more data.
Compare Multiple Models — Different models excel on different datasets. Run all models to find the best fit.
Consider Coverage vs Sharpness Trade-off — A model with slightly lower coverage but much tighter intervals may be preferable for some applications.

Troubleshooting¶

"Dataset not found"¶

# Verify dataset name
uv run python -m uncertainty_flow.cli list-datasets | grep <name>

# Use full HuggingFace path if needed
uv run python -m uncertainty_flow.cli benchmark \
    --dataset autogluon/chronos_datasets/m4_daily

Poor Coverage Results¶

Enable auto-tuning to find better hyperparameters
Increase tune-samples for more reliable tuning
Try a different model — some models work better on certain data patterns

Slow Benchmark Runs¶

Reduce n-samples for faster iteration
Disable auto-tuning for quick experiments
Reduce model complexity (fewer estimators)