BENCHMARKING.md — Benchmarking Guide¶
This document explains how to benchmark uncertainty_flow models using the built-in benchmarking framework.
Quick Start¶
# Run the consolidated benchmark suite on all default datasets
uv run python benchmarks/run_benchmarks.py --all-datasets
# Run on a single dataset with custom iterations
uv run python benchmarks/run_benchmarks.py -d weather -n 500 --iterations 5
# Generate a report from saved results
uv run python benchmarks/generate_report.py --output results/report.md
Available Datasets¶
The library integrates with HuggingFace datasets and includes 108 datasets for benchmarking:
| Dataset | Domain | Description |
|---|---|---|
weather |
Climate | Weather time series (ts-arena) |
exchange_rate |
Finance | Daily exchange rates |
electricity |
Energy | Electricity demand time series |
m4_daily |
Mixed | M4 daily forecasting competition |
m4_hourly |
Mixed | M4 hourly forecasting competition |
m4_weekly |
Mixed | M4 weekly forecasting competition |
m4_monthly |
Mixed | M4 monthly forecasting competition |
m4_quarterly |
Mixed | M4 quarterly forecasting competition |
m4_yearly |
Mixed | M4 yearly forecasting competition |
weatherbench_daily |
Climate | WeatherBench daily weather |
weatherbench_hourly_temperature |
Climate | WeatherBench hourly temperature |
monash_electricity_hourly |
Energy | Australian electricity demand |
monash_london_smart_meters |
Energy | London smart meter data |
ercot |
Energy | Texas electricity demand |
monash_traffic |
Transportation | Traffic flow data |
monash_pedestrian_counts |
Transportation | Pedestrian counts |
taxi_1h |
Transportation | Taxi trip counts (1h) |
monash_hospital |
Healthcare | Hospital admissions |
monash_fred_md |
Finance | FRED macroeconomic indicators |
m5 |
Retail | Walmart sales data |
Filter by Domain¶
# List only energy datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Energy
# List only climate datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Climate
CLI Commands¶
Benchmark orchestration is implemented by the BenchmarkFlow module and exposed publicly through BenchmarkRunner and the CLI.
Flow lifecycle per run:
loaddatasetsplitinto tune/train/test (or rolling-origin splits)tune-per-run-context(optional)fit/predictevaluatesinkthroughResultSink
benchmark — Run Benchmark¶
uv run python -m uncertainty_flow.cli benchmark --dataset <name> [OPTIONS]
| Option | Default | Description |
|---|---|---|
--dataset, -d |
(required) | Dataset name or HuggingFace path |
--model, -m |
all |
Models to run: all, quantile-forest, conformal-regressor, conformal-forecaster |
--n-samples, -n |
1000 |
Number of samples to use |
--horizon, -h |
3 |
Forecast horizon for time series models |
--n-estimators, -e |
30 |
Number of base estimators |
--target, -t |
OT |
Target column name |
--auto-tune |
true |
Enable/disable auto-tuning |
--target-coverage, -c |
0.9 |
Target coverage level for tuning |
--tune-samples |
500 |
Samples to use for tuning |
--output, -o |
benchmark_results |
Output file prefix |
--json-only |
- | Output only JSON |
--csv-only |
- | Output only CSV |
Examples¶
# Run all models with auto-tuning (default)
uv run python -m uncertainty_flow.cli benchmark --dataset weather
# Run specific models
uv run python -m uncertainty_flow.cli benchmark --dataset m4_daily \
--model quantile-forest,conformal-regressor
# Run without auto-tuning (faster, uses default params)
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune
# Custom coverage target and sample size
uv run python -m uncertainty_flow.cli benchmark --dataset electricity \
--target-coverage 0.8 --n-samples 2000
# Save results
uv run python -m uncertainty_flow.cli benchmark --dataset weather \
--output my_results
list-datasets — List Available Datasets¶
# List all datasets
uv run python -m uncertainty_flow.cli list-datasets
# Filter by domain
uv run python -m uncertainty_flow.cli list-datasets --domain Energy
download-dataset — Download Dataset for Offline Use¶
# Download a single dataset
uv run python -m uncertainty_flow.cli download-dataset m4_daily
# Download to custom cache directory
uv run python -m uncertainty_flow.cli download-dataset weather --cache-dir /path/to/cache
Auto-Tuning¶
Auto-tuning is enabled by default and automatically finds optimal hyperparameters for each model to achieve the target coverage level.
How It Works¶
- For each model, the tuner tests multiple parameter combinations
- Parameters are scored on validation splits only (no fit/predict on the same rows)
- The best parameters are used for the final benchmark on a separate untouched test holdout
Validation Strategy (Leakage-Safe)¶
- Tabular tuning defaults to random holdout; for small datasets it uses CV.
- Time-series tuning defaults to temporal holdout.
- Optional hybrid validation uses outer holdout + inner out-of-sample CV on outer-train only.
- The selector is deterministic and logs chosen strategy and rationale.
Example strategy logs:
validation_strategy strategy=temporal_holdout reason=time_series task defaults to temporal holdout ...
tuning_validation_plan model=conformal-forecaster strategy=temporal_holdout reason=time_series task defaults...
Search Space¶
| Model | Parameters Tested |
|---|---|
quantile-forest |
n_estimators: [20, 30, 50], min_samples_leaf: [3, 5, 10] |
conformal-regressor |
supported base-estimator params such as n_estimators, plus calibration_size: [0.15, 0.20, 0.25, 0.30] |
conformal-forecaster |
supported base-estimator params such as n_estimators, plus calibration_size: [0.15, 0.20, 0.25, 0.30] and lags: [1, 2, 3] |
Disabling Auto-Tuning¶
# Faster runs with default parameters
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune
Output Format¶
JSON Output¶
{
"dataset": "weather",
"metadata": {
"run_id": "3d115493",
"timestamp": "2026-03-31T13:30:22Z",
"dataset": "weather",
"domain": "Climate",
"n_samples": 1000,
"horizon": 3,
"test_size": 0.2,
"auto_tune": true,
"target_coverage": 0.9
},
"errors": [],
"results": [
{
"model": "conformal-forecaster",
"coverage_90": 0.9449,
"coverage_80": 0.8788,
"sharpness_90": 0.0223,
"sharpness_80": 0.0148,
"winkler_90": 0.0260,
"winkler_80": 0.0197,
"pinball_loss": 0.0027,
"train_time_sec": 0.091,
"n_samples": 997,
"tuned_params": {"n_estimators": 50, "calibration_size": 0.25, "lags": 1},
"was_tuned": true,
"validation_coverage_90": 0.9123,
"validation_sharpness_90": 0.0311,
"validation_winkler_90": 0.0364,
"validation_split_type": "temporal_holdout",
"validation_strategy": "temporal_holdout",
"validation_n_splits": 1,
"test_split_type": "out_of_time"
}
]
}
Serialized benchmark output is owned by ResultSink (sinks.py) and uses:
- top-level:
dataset,metadata,errors,results - no top-level
modelsfield
Metrics Explained¶
| Metric | Description | Target |
|---|---|---|
coverage_90 |
Fraction of true values within 90% prediction interval | ~0.90 |
coverage_80 |
Fraction of true values within 80% prediction interval | ~0.80 |
sharpness_90 |
Average width of 90% prediction intervals | Lower is better |
winkler_90 |
Winkler score for 90% intervals | Lower is better |
pinball_loss |
Pinball loss at quantile 0.1 | Lower is better |
train_time_sec |
Training time in seconds | - |
Using the Library Programmatically¶
Python API¶
from uncertainty_flow.benchmarking import BenchmarkConfig, BenchmarkRunner
# Create config with auto-tuning enabled
config = BenchmarkConfig(
dataset_name="weather",
n_samples=1000,
horizon=3,
auto_tune=True,
target_coverage=0.9,
)
# Run benchmark
runner = BenchmarkRunner(config)
runner.load_data()
result = runner.run_all()
# Access results
for model_result in result.models:
print(f"{model_result.model_name}:")
print(f" Coverage @ 90%: {model_result.coverage_90}")
print(f" Sharpness @ 90%: {model_result.sharpness_90}")
print(f" Tuned params: {model_result.tuned_params}")
# Save results
runner.save_json("results.json")
runner.save_csv("results.csv")
Extending Benchmark Models¶
Use the provider seam for new benchmark model adapters.
- Stable built-in names remain:
quantile-forest,conformal-regressor,conformal-forecaster - Provider contract lives in
providers.py(BenchmarkModelProvider) - Legacy class registry path in
runner.pyremains for compatibility, but provider-based extension is the primary path - Prefer provider seam when adding new benchmark integrations or custom adapter logic
Maintainer Migration Note¶
Benchmark architecture moved from an all-in-runner.py mental model to split modules:
- orchestration:
flow.py - model seams:
providers.py - output seams:
sinks.py - configs/results contracts:
configs.py,results.py - public adapter:
runner.py
Auto-Tuning Only¶
from uncertainty_flow.benchmarking.tuning import auto_tune_model, TuningConfig
from uncertainty_flow.benchmarking.datasets import load_dataset
# Load data
df, _ = load_dataset("weather", n_samples=500)
target = "OT"
# Tune a specific model
config = TuningConfig(target_coverage=0.9, n_samples=500)
result = auto_tune_model(
model_name="conformal-forecaster",
df=df,
target=target,
horizon=3,
config=config,
)
print(f"Best params: {result.best_params}")
print(f"Coverage: {result.coverage_90}")
Best Practices¶
-
Use Auto-Tuning — It significantly improves coverage calibration with minimal performance overhead.
-
Choose Appropriate Sample Size — Use at least 500 samples for reliable tuning, 1000+ for final benchmarks.
-
Match Horizon to Dataset — Set
--horizonbased on your forecasting needs. Larger horizons require more data. -
Compare Multiple Models — Different models excel on different datasets. Run
allmodels to find the best fit. -
Consider Coverage vs Sharpness Trade-off — A model with slightly lower coverage but much tighter intervals may be preferable for some applications.
Troubleshooting¶
"Dataset not found"¶
# Verify dataset name
uv run python -m uncertainty_flow.cli list-datasets | grep <name>
# Use full HuggingFace path if needed
uv run python -m uncertainty_flow.cli benchmark \
--dataset autogluon/chronos_datasets/m4_daily
Poor Coverage Results¶
- Enable auto-tuning to find better hyperparameters
- Increase
tune-samplesfor more reliable tuning - Try a different model — some models work better on certain data patterns
Slow Benchmark Runs¶
- Reduce
n-samplesfor faster iteration - Disable auto-tuning for quick experiments
- Reduce model complexity (fewer estimators)