BENCHMARKING.md — Benchmarking Guide¶
This document explains how to benchmark uncertainty_flow models using the built-in benchmarking framework.
Quick Start¶
# Run the consolidated benchmark suite on all default datasets
uv run python benchmarks/run_benchmarks.py --all-datasets
# Run on a single dataset with custom iterations
uv run python benchmarks/run_benchmarks.py -d weather -n 500 --iterations 5
# Generate a report from saved results
uv run python benchmarks/generate_report.py --output results/report.md
---
## Available Datasets
The library integrates with [HuggingFace datasets](https://huggingface.co/datasets) and includes **108 datasets** for benchmarking:
| Dataset | Domain | Description |
|---------|--------|-------------|
| `weather` | Climate | Weather time series (ts-arena) |
| `exchange_rate` | Finance | Daily exchange rates |
| `electricity` | Energy | Electricity demand time series |
| `m4_daily` | Mixed | M4 daily forecasting competition |
| `m4_hourly` | Mixed | M4 hourly forecasting competition |
| `m4_weekly` | Mixed | M4 weekly forecasting competition |
| `m4_monthly` | Mixed | M4 monthly forecasting competition |
| `m4_quarterly` | Mixed | M4 quarterly forecasting competition |
| `m4_yearly` | Mixed | M4 yearly forecasting competition |
| `weatherbench_daily` | Climate | WeatherBench daily weather |
| `weatherbench_hourly_temperature` | Climate | WeatherBench hourly temperature |
| `monash_electricity_hourly` | Energy | Australian electricity demand |
| `monash_london_smart_meters` | Energy | London smart meter data |
| `ercot` | Energy | Texas electricity demand |
| `monash_traffic` | Transportation | Traffic flow data |
| `monash_pedestrian_counts` | Transportation | Pedestrian counts |
| `taxi_1h` | Transportation | Taxi trip counts (1h) |
| `monash_hospital` | Healthcare | Hospital admissions |
| `monash_fred_md` | Finance | FRED macroeconomic indicators |
| `m5` | Retail | Walmart sales data |
### Filter by Domain
```bash
# List only energy datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Energy
# List only climate datasets
uv run python -m uncertainty_flow.cli list-datasets --domain Climate
CLI Commands¶
benchmark — Run Benchmark¶
uv run python -m uncertainty_flow.cli benchmark --dataset <name> [OPTIONS]
| Option | Default | Description |
|---|---|---|
--dataset, -d |
(required) | Dataset name or HuggingFace path |
--model, -m |
all |
Models to run: all, quantile-forest, conformal-regressor, conformal-forecaster |
--n-samples, -n |
1000 |
Number of samples to use |
--horizon, -h |
3 |
Forecast horizon for time series models |
--n-estimators, -e |
30 |
Number of base estimators |
--target, -t |
OT |
Target column name |
--auto-tune |
true |
Enable/disable auto-tuning |
--target-coverage, -c |
0.9 |
Target coverage level for tuning |
--tune-samples |
500 |
Samples to use for tuning |
--output, -o |
benchmark_results |
Output file prefix |
--json-only |
- | Output only JSON |
--csv-only |
- | Output only CSV |
Examples¶
# Run all models with auto-tuning (default)
uv run python -m uncertainty_flow.cli benchmark --dataset weather
# Run specific models
uv run python -m uncertainty_flow.cli benchmark --dataset m4_daily \
--model quantile-forest,conformal-regressor
# Run without auto-tuning (faster, uses default params)
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune
# Custom coverage target and sample size
uv run python -m uncertainty_flow.cli benchmark --dataset electricity \
--target-coverage 0.8 --n-samples 2000
# Save results
uv run python -m uncertainty_flow.cli benchmark --dataset weather \
--output my_results
list-datasets — List Available Datasets¶
# List all datasets
uv run python -m uncertainty_flow.cli list-datasets
# Filter by domain
uv run python -m uncertainty_flow.cli list-datasets --domain Energy
download-dataset — Download Dataset for Offline Use¶
# Download a single dataset
uv run python -m uncertainty_flow.cli download-dataset m4_daily
# Download to custom cache directory
uv run python -m uncertainty_flow.cli download-dataset weather --cache-dir /path/to/cache
Auto-Tuning¶
Auto-tuning is enabled by default and automatically finds optimal hyperparameters for each model to achieve the target coverage level.
How It Works¶
- For each model, the tuner tests multiple parameter combinations
- Parameters are scored on validation splits only (no fit/predict on the same rows)
- The best parameters are used for the final benchmark on a separate untouched test holdout
Validation Strategy (Leakage-Safe)¶
- Tabular tuning defaults to random holdout; for small datasets it uses CV.
- Time-series tuning defaults to temporal holdout.
- Optional hybrid validation uses outer holdout + inner out-of-sample CV on outer-train only.
- The selector is deterministic and logs chosen strategy and rationale.
Example strategy logs:
validation_strategy strategy=temporal_holdout reason=time_series task defaults to temporal holdout ...
tuning_validation_plan model=conformal-forecaster strategy=temporal_holdout reason=time_series task defaults...
Search Space¶
| Model | Parameters Tested |
|---|---|
quantile-forest |
n_estimators: [20, 30, 50], min_samples_leaf: [3, 5, 10] |
conformal-regressor |
supported base-estimator params such as n_estimators, plus calibration_size: [0.15, 0.20, 0.25, 0.30] |
conformal-forecaster |
supported base-estimator params such as n_estimators, plus calibration_size: [0.15, 0.20, 0.25, 0.30] and lags: [1, 2, 3] |
Disabling Auto-Tuning¶
# Faster runs with default parameters
uv run python -m uncertainty_flow.cli benchmark --dataset weather --no-auto-tune
Output Format¶
JSON Output¶
{
"metadata": {
"run_id": "3d115493",
"timestamp": "2026-03-31T13:30:22Z",
"dataset": "weather",
"domain": "Climate",
"n_samples": 1000,
"horizon": 3,
"auto_tune": true,
"target_coverage": 0.9
},
"results": [
{
"model": "conformal-forecaster",
"coverage_90": 0.9449,
"coverage_80": 0.8788,
"sharpness_90": 0.0223,
"sharpness_80": 0.0148,
"winkler_90": 0.0260,
"winkler_80": 0.0197,
"pinball_loss": 0.0027,
"train_time_sec": 0.091,
"n_samples": 997,
"tuned_params": {"n_estimators": 50, "calibration_size": 0.25, "lags": 1},
"was_tuned": true
}
]
}
Metrics Explained¶
| Metric | Description | Target |
|---|---|---|
coverage_90 |
Fraction of true values within 90% prediction interval | ~0.90 |
coverage_80 |
Fraction of true values within 80% prediction interval | ~0.80 |
sharpness_90 |
Average width of 90% prediction intervals | Lower is better |
winkler_90 |
Winkler score for 90% intervals | Lower is better |
pinball_loss |
Pinball loss at quantile 0.1 | Lower is better |
train_time_sec |
Training time in seconds | - |
Using the Library Programmatically¶
Python API¶
from uncertainty_flow.benchmarking import BenchmarkConfig, BenchmarkRunner
# Create config with auto-tuning enabled
config = BenchmarkConfig(
dataset_name="weather",
n_samples=1000,
horizon=3,
auto_tune=True,
target_coverage=0.9,
)
# Run benchmark
runner = BenchmarkRunner(config)
runner.load_data()
result = runner.run_all()
# Access results
for model_result in result.models:
print(f"{model_result.model_name}:")
print(f" Coverage @ 90%: {model_result.coverage_90}")
print(f" Sharpness @ 90%: {model_result.sharpness_90}")
print(f" Tuned params: {model_result.tuned_params}")
# Save results
runner.save_json("results.json")
runner.save_csv("results.csv")
Auto-Tuning Only¶
from uncertainty_flow.benchmarking.tuning import auto_tune_model, TuningConfig
from uncertainty_flow.benchmarking.datasets import load_dataset
# Load data
df, _ = load_dataset("weather", n_samples=500)
target = "OT"
# Tune a specific model
config = TuningConfig(target_coverage=0.9, n_samples=500)
result = auto_tune_model(
model_name="conformal-forecaster",
df=df,
target=target,
horizon=3,
config=config,
)
print(f"Best params: {result.best_params}")
print(f"Coverage: {result.coverage_90}")
Best Practices¶
-
Use Auto-Tuning — It significantly improves coverage calibration with minimal performance overhead.
-
Choose Appropriate Sample Size — Use at least 500 samples for reliable tuning, 1000+ for final benchmarks.
-
Match Horizon to Dataset — Set
--horizonbased on your forecasting needs. Larger horizons require more data. -
Compare Multiple Models — Different models excel on different datasets. Run
allmodels to find the best fit. -
Consider Coverage vs Sharpness Trade-off — A model with slightly lower coverage but much tighter intervals may be preferable for some applications.
Troubleshooting¶
"Dataset not found"¶
# Verify dataset name
uv run python -m uncertainty_flow.cli list-datasets | grep <name>
# Use full HuggingFace path if needed
uv run python -m uncertainty_flow.cli benchmark \
--dataset autogluon/chronos_datasets/m4_daily
Poor Coverage Results¶
- Enable auto-tuning to find better hyperparameters
- Increase
tune-samplesfor more reliable tuning - Try a different model — some models work better on certain data patterns
Slow Benchmark Runs¶
- Reduce
n-samplesfor faster iteration - Disable auto-tuning for quick experiments
- Reduce model complexity (fewer estimators)