Novelty Evaluation¶
novelentitymatcher.novelty.evaluation.evaluator
¶
Unified novelty detection evaluator.
Supports both benchmark and research evaluation modes with comprehensive metrics and reporting.
Classes¶
NoveltyEvaluator(mode='benchmark', metrics=None)
¶
Unified evaluator for novelty detection.
Supports two modes: - benchmark: Quick evaluation on OOD splits with core metrics - research: Comprehensive evaluation with confusion matrices and threshold sweeping
Metrics computed: - AUROC, AUPRC - Detection rates at 1%, 5%, 10% FPR - Precision, Recall, F1 at optimal threshold
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
Literal['benchmark', 'research']
|
Evaluation mode ('benchmark' or 'research') |
'benchmark'
|
metrics
|
list[str] | None
|
List of metrics to compute (None for default based on mode) |
None
|
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
Functions¶
evaluate(novelty_scores, is_novel_true, threshold=None)
¶
Evaluate novelty detection performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
threshold
|
float | None
|
Optional threshold for discrete predictions |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary of metric name -> value |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
create_report(novelty_scores, is_novel_true, threshold=None)
¶
Create a comprehensive evaluation report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
threshold
|
float | None
|
Optional threshold for discrete predictions |
None
|
Returns:
| Type | Description |
|---|---|
EvaluationReport
|
EvaluationReport with all metrics |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
sweep_thresholds(novelty_scores, is_novel_true, num_thresholds=100)
¶
Sweep across thresholds and compute metrics at each.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
num_thresholds
|
int
|
Number of thresholds to evaluate |
100
|
Returns:
| Type | Description |
|---|---|
dict[str, ndarray]
|
Dict with arrays for thresholds and metrics |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
compare_thresholds(novelty_scores, is_novel_true, thresholds)
¶
Compare metrics at specific thresholds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
thresholds
|
list[float]
|
List of thresholds to evaluate |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, float]]
|
List of dicts with metrics at each threshold |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
Functions¶
novelentitymatcher.novelty.evaluation.metrics
¶
Metric computations for novelty detection evaluation.
Provides functions for computing AUROC, AUPRC, detection rates, precision, recall, F1, and confusion matrices.
Functions¶
compute_auroc(scores, labels)
¶
Compute Area Under ROC Curve.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
Returns:
| Type | Description |
|---|---|
float
|
AUROC score (0-1, 0.5 = random) |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
compute_auprc(scores, labels)
¶
Compute Area Under Precision-Recall Curve.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
Returns:
| Type | Description |
|---|---|
float
|
AUPRC score (0-1) |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
compute_detection_rates(scores, labels, fpr_thresholds=(0.01, 0.05, 0.1))
¶
Compute detection rates at specific false positive rates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
fpr_thresholds
|
tuple[float, ...]
|
FPR values to compute detection rates for |
(0.01, 0.05, 0.1)
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict mapping fpr_percentage -> detection_rate |
dict[str, float]
|
(e.g., "detection_rate_1" -> 0.95 for 1% FPR) |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
compute_precision_recall_f1(scores, labels, threshold=None)
¶
Compute precision, recall, and F1 score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
threshold
|
float | None
|
Decision threshold (if None, finds optimal) |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dict with precision, recall, f1, and threshold |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
find_optimal_threshold(scores, labels)
¶
Find threshold that maximizes F1 score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Optimal threshold value |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
compute_confusion_matrix(scores, labels, threshold)
¶
Compute confusion matrix components.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
threshold
|
float
|
Decision threshold |
required |
Returns:
| Type | Description |
|---|---|
dict[str, int]
|
Dict with tp, tn, fp, fn counts |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
sweep_thresholds(scores, labels, thresholds=None)
¶
Sweep across thresholds and compute metrics at each.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
labels
|
ndarray
|
Ground truth labels (True = novel) |
required |
thresholds
|
ndarray | None
|
Array of thresholds to sweep (default: 0-100) |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, ndarray]
|
Dict with arrays for thresholds, precision, recall, f1, tp, fp, tn, fn |
Source code in src/novelentitymatcher/novelty/evaluation/metrics.py
novelentitymatcher.novelty.evaluation.splitters
¶
Data splitters for novelty detection evaluation.
Provides utilities for creating OOD (Out-of-Distribution) splits and gradual novelty scenarios for testing.
Classes¶
OODSplitter(known_ratio=0.8, random_state=42)
¶
Creates OOD (Out-of-Distribution) splits for novelty detection evaluation.
Splits data into known classes and unknown/novel classes to simulate the novelty detection scenario.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
known_ratio
|
float
|
Fraction of classes to keep as known (0-1) |
0.8
|
random_state
|
int
|
Random seed for reproducibility |
42
|
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
Functions¶
create_split(texts, labels)
¶
Create OOD train/test split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Tuple of (train_texts, train_labels, test_texts, test_is_novel) |
list[str]
|
|
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
create_split_with_indices(texts, labels)
¶
Create OOD split with additional metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with split data and metadata |
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
GradualNoveltySplitter(known_ratios=None, random_state=42)
¶
Creates multiple splits with gradually increasing novelty.
Useful for testing how novelty detection performance degrades as the number of novel classes increases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
known_ratios
|
list[float] | None
|
List of known ratios to create splits for |
None
|
random_state
|
int
|
Random seed for reproducibility |
42
|
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
Functions¶
create_splits(texts, labels)
¶
Create multiple splits with different novelty levels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of split dictionaries, one per known_ratio |
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
get_novelty_progression(texts, labels)
¶
Get summary of novelty progression across splits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
dict[str, list]
|
Dict with arrays for known_ratio, n_known, n_novel |