Metrics¶
uncertainty_flow.metrics
¶
Metrics for evaluating probabilistic predictions.
calibration_error(y_true, lower, upper, nominal_coverage=0.9)
¶
Absolute deviation of empirical coverage from nominal coverage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
lower
|
Series | ndarray
|
Lower bound of prediction interval |
required |
upper
|
Series | ndarray
|
Upper bound of prediction interval |
required |
nominal_coverage
|
float
|
Target coverage level (e.g. 0.9) |
0.9
|
Returns:
| Type | Description |
|---|---|
float
|
Absolute calibration error (float). Lower is better. 0 = perfectly calibrated. |
Source code in uncertainty_flow/metrics/calibration.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | |
diebold_mariano_test(errors_a, errors_b, one_sided=True)
¶
Diebold-Mariano test for equal predictive accuracy.
Tests the null hypothesis that two sets of forecast errors have equal expected loss.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
errors_a
|
ndarray
|
Per-sample errors from model A. |
required |
errors_b
|
ndarray
|
Per-sample errors from model B. |
required |
one_sided
|
bool
|
If True (default), tests A < B (A has lower loss). If False, two-sided test for equal loss. |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars DataFrame with columns |
DataFrame
|
|
Source code in uncertainty_flow/metrics/comparison.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
model_confidence_set(predictions, y_true, metric='crps', alpha=0.05)
¶
Hansen et al. (2011) Model Confidence Set.
Sequentially eliminates models that are significantly inferior to the best-performing model. Uses the Diebold-Mariano test for pairwise comparisons with a Bonferroni correction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
predictions
|
dict[str, DistributionPrediction]
|
Dict mapping model names to |
required |
y_true
|
ndarray
|
True values. |
required |
metric
|
str
|
Scoring metric ( |
'crps'
|
alpha
|
float
|
Significance level (default 0.05). |
0.05
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars DataFrame with columns |
DataFrame
|
(whether the model survives in the confidence set). |
Source code in uncertainty_flow/metrics/comparison.py
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 | |
skill_score(pred_a, pred_b, y_true, metric='crps')
¶
Relative skill score of model A vs model B.
The skill score (SS) is defined as:
SS = 1 - score(A) / score(B)
where SS > 0 means model A outperforms model B.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred_a
|
DistributionPrediction
|
Predictions from model A. |
required |
pred_b
|
DistributionPrediction
|
Predictions from model B (baseline). |
required |
y_true
|
ndarray
|
True values. |
required |
metric
|
str
|
Scoring metric ( |
'crps'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Polars DataFrame with columns |
DataFrame
|
|
Source code in uncertainty_flow/metrics/comparison.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | |
coverage_score(y_true, lower, upper)
¶
Fraction of true values that fall within the prediction interval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
lower
|
Series | ndarray
|
Lower bound of prediction interval |
required |
upper
|
Series | ndarray
|
Upper bound of prediction interval |
required |
Returns:
| Type | Description |
|---|---|
float
|
Fraction of values within interval (float in [0, 1]) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If bounds are invalid |
Examples:
>>> import polars as pl
>>> y_true = pl.Series([1, 2, 3, 4, 5])
>>> lower = pl.Series([0.5, 1.5, 2.5, 3.5, 4.5])
>>> upper = pl.Series([1.5, 2.5, 3.5, 4.5, 5.5])
>>> coverage_score(y_true, lower, upper)
0.6
Source code in uncertainty_flow/metrics/coverage.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | |
crps_quantile(y_true, quantile_matrix, quantile_levels)
¶
Exact CRPS from quantile predictions via the quantile-score decomposition.
Uses the closed-form decomposition:
CRPS = 2 * Σⱼ wⱼ * [𝟙(y < qⱼ) - τⱼ] * (qⱼ - y)
where wⱼ = (τⱼ₊₁ - τⱼ₋₁) / 2 are trapezoidal quadrature weights.
Reference: Laio & Tamea (2007); also used by properscoring and
scoringrules packages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
ndarray
|
(n,) array of true values. |
required |
quantile_matrix
|
ndarray
|
(n, k) array of predicted quantile values per sample. |
required |
quantile_levels
|
ndarray
|
(k,) array of quantile levels in (0, 1), strictly increasing. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Mean CRPS across all samples. Lower is better. |
Source code in uncertainty_flow/metrics/crps.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
crps_score(y_true, lower, upper, confidence=0.9)
¶
Approximate CRPS from a prediction interval (Gaussian assumption).
.. deprecated::
Use :func:crps_quantile or DistributionPrediction.crps(y_true)
instead. This function will be removed in v0.3.0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
lower
|
Series | ndarray
|
Lower bound of prediction interval |
required |
upper
|
Series | ndarray
|
Upper bound of prediction interval |
required |
confidence
|
float
|
Confidence level for the interval |
0.9
|
Returns:
| Type | Description |
|---|---|
float
|
Approximate CRPS score (float). Lower is better. |
Source code in uncertainty_flow/metrics/crps.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | |
log_score_kde(y_true, quantile_matrix, quantile_levels, n_draw=500, random_state=None)
¶
Non-parametric log-score via kernel density estimation.
Draws samples from the piecewise-linear CDF defined by the quantile knots, fits a Gaussian KDE, and evaluates the log-density.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
ndarray
|
(n,) array of true values. |
required |
quantile_matrix
|
ndarray
|
(n, k) array of predicted quantile values. |
required |
quantile_levels
|
ndarray
|
(k,) array of quantile levels. |
required |
n_draw
|
int
|
Samples per observation for KDE fitting. |
500
|
random_state
|
int | None
|
Random seed. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Mean log-score (float). |
Source code in uncertainty_flow/metrics/log_score.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
log_score_pooled(y_true, quantile_matrix, quantile_levels, family='auto')
¶
Mean log-score using one pooled distribution fitted from mean quantiles.
This helper keeps the previous behavior for backward comparisons.
Source code in uncertainty_flow/metrics/log_score.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
energy_score(pred, y_true, n_samples=1000, random_state=None)
¶
Energy score — a proper multivariate scoring rule.
.. math::
\text{ES} = \mathbb{E}[\|X - y\|] - \frac{1}{2} \mathbb{E}[\|X - X'\|]
where :math:X, X' are independent draws from the forecast distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
DistributionPrediction
|
|
required |
y_true
|
True values (DataFrame / array matching targets). |
required | |
n_samples
|
int
|
Monte Carlo samples per observation. |
1000
|
random_state
|
int | None
|
Random seed. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Mean energy score (float). Lower is better. |
Source code in uncertainty_flow/metrics/multivariate.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
variogram_score(pred, y_true, n_samples=1000, p=0.5, random_state=None)
¶
Variogram score — sensitive to correlation structure.
.. math::
\text{VS}_p = \sum_{i \neq j} w_{ij}
\left(|y_i - y_j|^p - \mathbb{E}[|X_i - X_j|^p]\right)^2
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
DistributionPrediction
|
|
required |
y_true
|
True values. |
required | |
n_samples
|
int
|
Monte Carlo samples per observation. |
1000
|
p
|
float
|
Power parameter (default 0.5). |
0.5
|
random_state
|
int | None
|
Random seed. |
None
|
Returns:
| Type | Description |
|---|---|
float
|
Mean variogram score (float). Lower is better. |
Source code in uncertainty_flow/metrics/multivariate.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | |
pinball_loss(y_true, y_pred, quantile)
¶
Quantile loss (pinball loss).
For quantile q, loss = max(q * (y_true - y_pred), (q - 1) * (y_true - y_pred)) Penalizes over-prediction and under-prediction asymmetrically.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
y_pred
|
Series | ndarray
|
Predicted values |
required |
quantile
|
float
|
Quantile level (e.g., 0.9 for 90th percentile) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Mean loss across all samples (float) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If quantile is not in (0, 1) |
Examples:
>>> import polars as pl
>>> y_true = pl.Series([1, 2, 3, 4, 5])
>>> y_pred = pl.Series([1.5, 2.5, 2.5, 4.5, 4.5])
>>> pinball_loss(y_true, y_pred, 0.5)
0.4
Source code in uncertainty_flow/metrics/pinball.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
mae_score(y_true, y_pred)
¶
Mean Absolute Error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
y_pred
|
Series | ndarray
|
Predicted values (point predictions, e.g. median) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Mean absolute error (float). Lower is better. |
Source code in uncertainty_flow/metrics/point.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | |
rmse_score(y_true, y_pred)
¶
Root Mean Squared Error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
y_pred
|
Series | ndarray
|
Predicted values (point predictions, e.g. median) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Root mean squared error (float). Lower is better. |
Source code in uncertainty_flow/metrics/point.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
winkler_score(y_true, lower, upper, confidence)
¶
Winkler score for prediction intervals.
Penalizes: - Interval width (wider intervals = higher penalty) - Misses (if y_true outside interval, penalty proportional to distance)
Lower is better.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Series | ndarray
|
True values |
required |
lower
|
Series | ndarray
|
Lower bound of prediction interval |
required |
upper
|
Series | ndarray
|
Upper bound of prediction interval |
required |
confidence
|
float
|
Confidence level (e.g., 0.9 for 90% interval) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Mean Winkler score across all samples (float) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If confidence is not in (0, 1) or if bounds are invalid |
Examples:
>>> import polars as pl
>>> y_true = pl.Series([1, 2, 3, 4, 5])
>>> lower = pl.Series([0.5, 1.5, 2.5, 3.5, 4.5])
>>> upper = pl.Series([1.5, 2.5, 3.5, 4.5, 5.5])
>>> winkler_score(y_true, lower, upper, 0.9)
1.0
Source code in uncertainty_flow/metrics/winkler.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
score(pred, y_true, metric, **kwargs)
¶
Unified metric entry point.
Dispatches to the correct metric function based on metric name,
extracting the right inputs from the DistributionPrediction object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pred
|
DistributionPrediction
|
A |
required |
y_true
|
True values (Polars Series/DataFrame or numpy array). |
required | |
metric
|
str | Callable
|
One of |
required |
**kwargs
|
Extra keyword arguments forwarded to the underlying metric. |
{}
|
Returns:
| Type | Description |
|---|---|
float | dict[str, float]
|
Scalar for univariate, or |
Source code in uncertainty_flow/metrics/__init__.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | |