Skip to content

Novelty Strategies

novelentitymatcher.novelty.strategies.base

Base protocol for novelty detection strategies.

All strategies must implement this protocol to be compatible with the NoveltyDetector.

Classes

NoveltyStrategy

Bases: ABC

Base protocol for all novelty detection strategies.

Each strategy is responsible for: 1. Initializing with reference embeddings and labels 2. Detecting novel samples from a batch of inputs 3. Providing per-sample metrics for signal combination 4. Specifying its weight for signal fusion

Attributes
config_schema abstractmethod property

Return the config dataclass type for this strategy.

This is used for validation and defaults.

Functions
initialize(reference_embeddings, reference_labels, config) abstractmethod

Initialize strategy with reference data.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples

required
reference_labels list[str]

Class labels for known samples

required
config Any

Strategy-specific configuration object

required
Source code in src/novelentitymatcher/novelty/strategies/base.py
@abstractmethod
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: Any,
) -> None:
    """
    Initialize strategy with reference data.

    Args:
        reference_embeddings: Embeddings of known samples
        reference_labels: Class labels for known samples
        config: Strategy-specific configuration object
    """
detect(texts, embeddings, predicted_classes, confidences, **kwargs) abstractmethod

Detect novel samples.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
embeddings ndarray

Text embeddings

required
predicted_classes list[str]

Predicted class for each sample

required
confidences ndarray

Prediction confidence scores

required
**kwargs

Additional strategy-specific parameters

{}

Returns:

Type Description
set[int]

(flags, metrics) - flagged indices and per-sample metrics

dict[int, dict[str, Any]]
  • flags: Set of indices flagged as novel
tuple[set[int], dict[int, dict[str, Any]]]
  • metrics: Dict mapping index to metric dict
Source code in src/novelentitymatcher/novelty/strategies/base.py
@abstractmethod
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted class for each sample
        confidences: Prediction confidence scores
        **kwargs: Additional strategy-specific parameters

    Returns:
        (flags, metrics) - flagged indices and per-sample metrics
        - flags: Set of indices flagged as novel
        - metrics: Dict mapping index to metric dict
    """
get_weight() abstractmethod

Return weight for signal combination.

This weight determines how much this strategy contributes to the final novelty score.

Source code in src/novelentitymatcher/novelty/strategies/base.py
@abstractmethod
def get_weight(self) -> float:
    """
    Return weight for signal combination.

    This weight determines how much this strategy contributes
    to the final novelty score.
    """
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.knn_distance

kNN distance-based novelty detection strategy.

Flags samples based on their distance to k-nearest neighbors in the reference set.

Classes

KNNDistanceStrategy()

Bases: NoveltyStrategy

kNN distance strategy for novelty detection.

Flags samples as novel if their average distance to k-nearest neighbors in the reference set exceeds a threshold.

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
def __init__(self):
    self._config: KNNConfig = None
    self._ann_index: ANNIndex | None = None
Attributes
config_schema property

Return KNNConfig as the config schema.

Functions
initialize(reference_embeddings, reference_labels, config)

Initialize the kNN strategy with reference data.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples

required
reference_labels list[str]

Labels of known samples

required
config KNNConfig

KNNConfig with k, thresholds, and metric

required
Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: KNNConfig,
) -> None:
    """
    Initialize the kNN strategy with reference data.

    Args:
        reference_embeddings: Embeddings of known samples
        reference_labels: Labels of known samples
        config: KNNConfig with k, thresholds, and metric
    """
    self._config = config or KNNConfig()

    # Initialize ANN index
    self._ann_index = ANNIndex(
        dim=reference_embeddings.shape[1],
        max_elements=len(reference_labels),
    )
    self._ann_index.add_vectors(reference_embeddings, reference_labels)
detect(texts, embeddings, predicted_classes, confidences, **kwargs)

Detect novel samples using kNN distance.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
embeddings ndarray

Text embeddings

required
predicted_classes list[str]

Predicted classes

required
confidences ndarray

Prediction confidences

required
**kwargs

Additional parameters

{}

Returns:

Type Description
tuple[set[int], dict[int, dict[str, Any]]]

(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using kNN distance.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted classes
        confidences: Prediction confidences
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    k = min(self._config.k, self._ann_index.n_elements)

    # Query kNN
    similarities, neighbor_indices = self._ann_index.knn_query(embeddings, k=k)

    flags = set()
    metrics = {}

    for idx in range(len(embeddings)):
        metric = self._compute_knn_metrics(
            idx,
            similarities[idx],
            neighbor_indices[idx],
            predicted_classes[idx],
        )
        metrics[idx] = metric

        # Check if novelty score exceeds threshold
        if metric["knn_novelty_score"] >= self._config.distance_threshold:
            flags.add(idx)

    return flags, metrics
get_weight()

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
def get_weight(self) -> float:
    """Return weight for signal combination."""
    # kNN is a strong signal, give it high weight
    return 0.45
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.clustering

Clustering-based novelty detection strategy.

Flags samples that form small, isolated clusters or don't fit well into any existing cluster.

Classes

ClusteringStrategy()

Bases: NoveltyStrategy

Clustering-based strategy for novelty detection.

Uses HDBSCAN to cluster samples and identifies novel samples as those that are in small or low-cohesion clusters.

Source code in src/novelentitymatcher/novelty/strategies/clustering.py
def __init__(self):
    self._config: ClusteringConfig = None
    self._clusterer: ScalableClusterer = None
    self._validator: ClusterValidator = None
    self._reference_embeddings: np.ndarray = None
    self._reference_labels: list[str] = None
Attributes
config_schema property

Return ClusteringConfig as the config schema.

Functions
initialize(reference_embeddings, reference_labels, config)

Initialize the clustering strategy.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples

required
reference_labels list[str]

Labels of known samples

required
config ClusteringConfig

ClusteringConfig with thresholds

required
Source code in src/novelentitymatcher/novelty/strategies/clustering.py
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: ClusteringConfig,
) -> None:
    """
    Initialize the clustering strategy.

    Args:
        reference_embeddings: Embeddings of known samples
        reference_labels: Labels of known samples
        config: ClusteringConfig with thresholds
    """
    self._config = config or ClusteringConfig()
    self._reference_embeddings = reference_embeddings
    self._reference_labels = reference_labels

    # Initialize clusterer
    self._clusterer = ScalableClusterer(
        min_cluster_size=self._config.hdbscan_min_cluster_size,
        min_samples=self._config.hdbscan_min_samples,
        cluster_selection_epsilon=self._config.cluster_selection_epsilon,
    )

    # Initialize validator
    self._validator = ClusterValidator(
        min_cohesion_threshold=self._config.cohesion_threshold,
        min_persistence_threshold=self._config.persistence_threshold,
    )
detect(texts, embeddings, predicted_classes, confidences, **kwargs)

Detect novel samples using clustering.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
embeddings ndarray

Text embeddings

required
predicted_classes list[str]

Predicted classes

required
confidences ndarray

Prediction confidences

required
**kwargs

Additional parameters

{}

Returns:

Type Description
tuple[set[int], dict[int, dict[str, Any]]]

(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/clustering.py
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using clustering.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted classes
        confidences: Prediction confidences
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    # Combine reference and query embeddings for clustering
    all_embeddings = np.vstack([self._reference_embeddings, embeddings])

    # Fit clusterer on all embeddings
    self._clusterer.fit(all_embeddings)

    # Get cluster labels
    labels = self._clusterer.labels

    # Separate query labels (reference samples come first)
    query_labels = labels[len(self._reference_embeddings) :]

    flags = set()
    metrics = {}

    # Validate clusters and identify novel samples
    unique_labels = np.unique(query_labels)

    for label in unique_labels:
        if label == -1:  # Noise points
            # All noise points are novel
            mask = query_labels == label
            indices = np.where(mask)[0]
            for idx in indices:
                flags.add(idx)
                metrics[idx] = {
                    "cluster_label": -1,
                    "cluster_support_score": 0.0,
                    "cluster_is_novel": True,
                    "cluster_size": 1,
                }
        else:
            # Check if cluster is valid
            # Get all embeddings with this label (including reference)
            all_mask = labels == label
            _cluster_embeddings = all_embeddings[all_mask]

            is_valid = self._validator.is_valid_cluster(
                all_embeddings,
                labels,
                label,
                min_size=self._config.min_cluster_size,
            )

            # Compute support score (1 - cohesion)
            cohesion = self._validator.compute_cohesion(
                all_embeddings, labels, label
            )
            support_score = 1.0 - cohesion

            # Get query indices for this cluster
            query_mask = query_labels == label
            query_indices = np.where(query_mask)[0]

            for idx in query_indices:
                # Novel if cluster is invalid or support score is low
                is_novel = not is_valid or support_score < (
                    1.0 - self._config.cohesion_threshold
                )

                if is_novel:
                    flags.add(idx)

                metrics[idx] = {
                    "cluster_label": int(label),
                    "cluster_support_score": support_score,
                    "cluster_is_novel": is_novel,
                    "cluster_size": int(np.sum(all_mask)),
                    "cluster_cohesion": cohesion,
                }

    return flags, metrics
get_weight()

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/clustering.py
def get_weight(self) -> float:
    """Return weight for signal combination."""
    # Clustering provides complementary signal
    return 0.2
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.pattern

Pattern-based novelty detection strategy wrapper.

Wraps PatternScorer to implement NoveltyStrategy protocol.

Classes

novelentitymatcher.novelty.strategies.oneclass

One-Class SVM novelty detection strategy wrapper.

Wraps OneClassSVMDetector to implement NoveltyStrategy protocol.

Classes

novelentitymatcher.novelty.strategies.setfit

SetFit contrastive novelty detection strategy wrapper.

Wraps SetFitDetector to implement NoveltyStrategy protocol.

Classes

novelentitymatcher.novelty.strategies.setfit_centroid

SetFit centroid distance novelty detection strategy.

Computes minimum cosine distance from each query to known class centroids in the SetFit fine-tuned embedding space. Produces continuous novelty scores.

This is the recommended strategy when SetFit full training is used for Phase 1, as contrastive learning creates tight, well-separated class clusters.

Classes

SetFitCentroidStrategy()

Bases: NoveltyStrategy

Centroid distance strategy using SetFit fine-tuned embeddings.

For each known class, computes a centroid in the SetFit embedding space. Novelty score = minimum cosine distance from query to any centroid.

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py
def __init__(self) -> None:
    self._config: SetFitCentroidConfig | None = None
    self._centroids: np.ndarray | None = None
    self._class_labels: list[str] | None = None
    self._threshold: float | None = None
    self._setfit_model: Any | None = None
Functions
initialize(reference_embeddings, reference_labels, config)

Initialize centroids from reference embeddings.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples (already from SetFit model)

required
reference_labels list[str]

Class labels for known samples

required
config SetFitCentroidConfig

SetFitCentroidConfig with threshold

required
Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: SetFitCentroidConfig,
) -> None:
    """
    Initialize centroids from reference embeddings.

    Args:
        reference_embeddings: Embeddings of known samples (already from SetFit model)
        reference_labels: Class labels for known samples
        config: SetFitCentroidConfig with threshold
    """
    self._config = config or SetFitCentroidConfig()
    self._class_labels = list(set(reference_labels))

    # Compute per-class centroids
    centroids = {}
    for label in self._class_labels:
        mask = np.array(reference_labels) == label
        class_embeddings = reference_embeddings[mask]
        if len(class_embeddings) > 0:
            centroids[label] = np.mean(class_embeddings, axis=0)

    # Sort centroids by class label for consistent indexing
    self._centroids = np.array(
        [centroids[label] for label in sorted(centroids.keys())]
    )
    self._class_labels = sorted(centroids.keys())

    # Calibrate threshold from reference set if not explicitly set
    if self._config.threshold is None:
        self._threshold = self._calibrate_threshold(
            reference_embeddings, reference_labels
        )
    else:
        self._threshold = self._config.threshold
detect(texts, embeddings, predicted_classes, confidences, **kwargs)

Detect novel samples using centroid distance.

Parameters:

Name Type Description Default
texts list[str]

Input texts (unused, embeddings are pre-computed)

required
embeddings ndarray

Query embeddings

required
predicted_classes list[str]

Predicted class for each sample

required
confidences ndarray

Prediction confidence scores

required

Returns:

Type Description
tuple[set[int], dict[int, dict[str, Any]]]

(flags, metrics) - flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using centroid distance.

    Args:
        texts: Input texts (unused, embeddings are pre-computed)
        embeddings: Query embeddings
        predicted_classes: Predicted class for each sample
        confidences: Prediction confidence scores

    Returns:
        (flags, metrics) - flagged indices and per-sample metrics
    """
    if self._centroids is None or self._threshold is None:
        return set(), {}

    flags: set[int] = set()
    metrics: dict[int, dict[str, Any]] = {}

    # Normalize embeddings for cosine distance
    query_norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    query_norms = np.where(query_norms == 0, 1, query_norms)
    query_normalized = embeddings / query_norms

    centroid_norms = np.linalg.norm(self._centroids, axis=1, keepdims=True)
    centroid_norms = np.where(centroid_norms == 0, 1, centroid_norms)
    centroids_normalized = self._centroids / centroid_norms

    # Compute cosine similarity matrix (queries x centroids)
    similarity_matrix = query_normalized @ centroids_normalized.T

    # Convert to cosine distance
    distance_matrix = 1.0 - similarity_matrix

    for idx in range(len(embeddings)):
        distances = distance_matrix[idx]
        min_distance = float(np.min(distances))
        nearest_centroid_idx = int(np.argmin(distances))
        nearest_class = self._class_labels[nearest_centroid_idx]

        # Continuous novelty score (normalized to [0, 1] via sigmoid)
        novelty_score = self._distance_to_score(min_distance)

        is_novel = min_distance > self._threshold

        if is_novel:
            flags.add(idx)

        metrics[idx] = {
            "setfit_centroid_min_distance": min_distance,
            "setfit_centroid_nearest_class": nearest_class,
            "setfit_centroid_novelty_score": novelty_score,
            "setfit_centroid_is_novel": is_novel,
            "setfit_centroid_predicted_class": predicted_classes[idx],
            "setfit_centroid_confidence": float(confidences[idx]),
        }

    return flags, metrics
get_weight()

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py
def get_weight(self) -> float:
    """Return weight for signal combination."""
    return 0.45
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.prototypical

Prototypical network novelty detection strategy wrapper.

Wraps PrototypicalDetector to implement NoveltyStrategy protocol.

Classes

novelentitymatcher.novelty.strategies.mahalanobis

Mahalanobis distance-based novelty detection strategy.

Flags samples based on their Mahalanobis distance to the class-conditional distribution of their predicted class. Supports optional conformal calibration for statistically grounded p-value based novelty routing.

Classes

MahalanobisDistanceStrategy()

Bases: NoveltyStrategy

Mahalanobis distance strategy for novelty detection.

Computes the Mahalanobis distance from each sample to the class-conditional distribution (mean + shared covariance) of its predicted class. Samples whose distance exceeds a configurable threshold are flagged as novel.

When calibration_mode="conformal", raw distances are wrapped with conformal p-values for statistically grounded routing. This is backward- compatible: calibration_mode="none" produces identical results to the original threshold-only behavior.

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
def __init__(self):
    self._config: MahalanobisConfig = None
    self._class_means: dict[str, np.ndarray] = {}
    self._cov_inv: np.ndarray | None = None
    self._dim: int = 0
    self._calibrator: Any = None
Attributes
config_schema property

Return MahalanobisConfig as the config schema.

Functions
initialize(reference_embeddings, reference_labels, config)

Initialize the Mahalanobis strategy with reference data.

Computes per-class mean vectors and a shared (pooled) covariance matrix with regularization for numerical stability.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples (n_samples, dim)

required
reference_labels list[str]

Class labels for known samples

required
config MahalanobisConfig

MahalanobisConfig with threshold, regularization, etc.

required
Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: MahalanobisConfig,
) -> None:
    """
    Initialize the Mahalanobis strategy with reference data.

    Computes per-class mean vectors and a shared (pooled) covariance matrix
    with regularization for numerical stability.

    Args:
        reference_embeddings: Embeddings of known samples (n_samples, dim)
        reference_labels: Class labels for known samples
        config: MahalanobisConfig with threshold, regularization, etc.
    """
    self._config = config or MahalanobisConfig()
    self._dim = reference_embeddings.shape[1]
    self._class_means = {}
    self._cov_inv = None
    self._calibrator = None

    if self._config.calibration_mode == "conformal":
        self._initialize_with_calibration(reference_embeddings, reference_labels)
    else:
        self._initialize_core(reference_embeddings, reference_labels)
detect(texts, embeddings, predicted_classes, confidences, **kwargs)

Detect novel samples using Mahalanobis distance.

When calibration_mode="conformal", flagging uses p-values instead of raw distance thresholds. A sample is flagged if p_value < calibration_alpha.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
embeddings ndarray

Text embeddings

required
predicted_classes list[str]

Predicted classes

required
confidences ndarray

Prediction confidences

required
**kwargs

Additional parameters

{}

Returns:

Type Description
tuple[set[int], dict[int, dict[str, Any]]]

(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using Mahalanobis distance.

    When ``calibration_mode="conformal"``, flagging uses p-values
    instead of raw distance thresholds. A sample is flagged if
    ``p_value < calibration_alpha``.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted classes
        confidences: Prediction confidences
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    flags = set()
    metrics = {}

    if (
        self._config.calibration_mode == "conformal"
        and self._calibrator is not None
        and self._calibrator.is_calibrated
    ):
        raw_distances = self._compute_all_distances(embeddings, predicted_classes)
        if self._config.calibration_method == "mondrian":
            p_values = self._calibrator.predict_pvalues_for_class(
                raw_distances, predicted_classes
            )
        else:
            p_values = self._calibrator.predict_pvalues(raw_distances)

        for idx in range(len(embeddings)):
            metric = self._compute_mahalanobis_metrics(
                idx,
                embeddings[idx],
                predicted_classes[idx],
            )
            metric["p_value"] = float(p_values[idx])
            metric["calibration_mode"] = "conformal"
            metrics[idx] = metric

            if p_values[idx] < self._config.calibration_alpha:
                flags.add(idx)
    else:
        for idx in range(len(embeddings)):
            metric = self._compute_mahalanobis_metrics(
                idx,
                embeddings[idx],
                predicted_classes[idx],
            )
            metrics[idx] = metric

            if metric["mahalanobis_distance"] >= self._config.threshold:
                flags.add(idx)

    return flags, metrics
get_weight()

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
def get_weight(self) -> float:
    """Return weight for signal combination."""
    return 0.35
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

Functions

novelentitymatcher.novelty.strategies.uncertainty

Uncertainty-based novelty detection strategy.

Flags samples based on prediction uncertainty using margin and entropy.

Classes

UncertaintyStrategy()

Bases: NoveltyStrategy

Uncertainty-based strategy for novelty detection.

Flags samples as novel if their prediction uncertainty exceeds configured thresholds (margin or entropy).

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
def __init__(self):
    self._config: UncertaintyConfig = None
Attributes
config_schema property

Return UncertaintyConfig as the config schema.

Functions
initialize(reference_embeddings, reference_labels, config)

Initialize the uncertainty strategy.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples (not used)

required
reference_labels list[str]

Labels of known samples (not used)

required
config UncertaintyConfig

UncertaintyConfig with thresholds

required
Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: UncertaintyConfig,
) -> None:
    """
    Initialize the uncertainty strategy.

    Args:
        reference_embeddings: Embeddings of known samples (not used)
        reference_labels: Labels of known samples (not used)
        config: UncertaintyConfig with thresholds
    """
    self._config = config or UncertaintyConfig()
detect(texts, embeddings, predicted_classes, confidences, **kwargs)

Detect novel samples using uncertainty metrics.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
embeddings ndarray

Text embeddings (not used)

required
predicted_classes list[str]

Predicted classes (not used)

required
confidences ndarray

Prediction confidence scores

required
**kwargs

Additional parameters, may include 'all_probs' for full distribution

{}

Returns:

Type Description
tuple[set[int], dict[int, dict[str, Any]]]

(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using uncertainty metrics.

    Args:
        texts: Input texts
        embeddings: Text embeddings (not used)
        predicted_classes: Predicted classes (not used)
        confidences: Prediction confidence scores
        **kwargs: Additional parameters, may include 'all_probs' for full distribution

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    flags = set()
    metrics = {}

    # Check if we have full probability distributions
    all_probs = kwargs.get("all_probs", None)

    for idx, confidence in enumerate(confidences):
        metric = self._compute_uncertainty_metrics(
            idx,
            confidence,
            all_probs[idx] if all_probs is not None else None,
        )
        metrics[idx] = metric

        # Check if uncertainty exceeds thresholds
        is_novel = (
            metric["margin_score"] < self._config.margin_threshold
            or metric["entropy_score"] > self._config.entropy_threshold
        )

        if is_novel:
            flags.add(idx)

    return flags, metrics
get_weight()

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
def get_weight(self) -> float:
    """Return weight for signal combination."""
    # Uncertainty is a strong signal
    return 0.35
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.confidence

Confidence threshold-based novelty detection strategy.

Flags samples with prediction confidence below a threshold as novel.

Classes

ConfidenceStrategy()

Bases: NoveltyStrategy

Confidence threshold strategy for novelty detection.

Flags samples as novel if their prediction confidence falls below a configured threshold.

Source code in src/novelentitymatcher/novelty/strategies/confidence.py
def __init__(self):
    self._config: ConfidenceConfig = None
Attributes
config_schema property

Return ConfidenceConfig as the config schema.

Functions
initialize(reference_embeddings, reference_labels, config)

Initialize the confidence strategy.

Parameters:

Name Type Description Default
reference_embeddings ndarray

Embeddings of known samples (not used)

required
reference_labels list[str]

Labels of known samples (not used)

required
config ConfidenceConfig

ConfidenceConfig with threshold parameter

required
Source code in src/novelentitymatcher/novelty/strategies/confidence.py
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: ConfidenceConfig,
) -> None:
    """
    Initialize the confidence strategy.

    Args:
        reference_embeddings: Embeddings of known samples (not used)
        reference_labels: Labels of known samples (not used)
        config: ConfidenceConfig with threshold parameter
    """
    self._config = config or ConfidenceConfig()
detect(texts, embeddings, predicted_classes, confidences, **kwargs)

Detect novel samples using confidence threshold.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
embeddings ndarray

Text embeddings (not used)

required
predicted_classes list[str]

Predicted classes (not used)

required
confidences ndarray

Prediction confidence scores

required
**kwargs

Additional parameters

{}

Returns:

Type Description
tuple[set[int], dict[int, dict[str, Any]]]

(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/confidence.py
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using confidence threshold.

    Args:
        texts: Input texts
        embeddings: Text embeddings (not used)
        predicted_classes: Predicted classes (not used)
        confidences: Prediction confidence scores
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    flags = set()
    metrics = {}

    for idx, confidence in enumerate(confidences):
        is_novel = confidence < self._config.threshold

        if is_novel:
            flags.add(idx)

        metrics[idx] = {
            "confidence_score": float(confidence),
            "confidence_is_novel": is_novel,
        }

    return flags, metrics
get_weight()

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/confidence.py
def get_weight(self) -> float:
    """Return weight for signal combination."""
    # Confidence is a foundational signal, give it moderate weight
    return 0.35
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.self_knowledge

Self-knowledge detection strategy wrapper.

Wraps SelfKnowledgeDetector to implement NoveltyStrategy protocol.

Classes

SelfKnowledgeStrategy()

Bases: NoveltyStrategy

Self-knowledge strategy for novelty detection.

Uses a sparse autoencoder to learn representations of known samples and flags high reconstruction error as novel.

Source code in src/novelentitymatcher/novelty/strategies/self_knowledge.py
def __init__(self):
    self._config: SelfKnowledgeConfig = None
    self._detector: SelfKnowledgeDetector = None
Functions
get_config()

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py
def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

novelentitymatcher.novelty.strategies.conformal

Conformal prediction-based calibration for OOD detection strategies.

Wraps raw strategy scores with statistically grounded p-values, enabling rigorous routing of out-of-distribution inputs.

Classes

ConformalCalibrator(alpha=0.1, method='split')

Calibrate raw OOD scores into conformal p-values.

Supports two methods: - "split": Holds out a fraction of reference data for calibration. - "mondrian": Uses class-conditional (Mondrian) conformal calibration with per-class nonconformity distributions.

Usage::

cal = ConformalCalibrator(alpha=0.1, method="split")
cal.calibrate(raw_scores, labels)
pvals = cal.predict_pvalues(test_scores)
Source code in src/novelentitymatcher/novelty/strategies/conformal.py
def __init__(
    self,
    alpha: float = 0.1,
    method: Literal["mondrian", "split"] = "split",
):
    self.alpha = alpha
    self.method = method
    self._nonconformity_scores: np.ndarray | None = None
    self._class_scores: dict[str, np.ndarray] = {}
    self._n_calibration: int = 0
    self._is_calibrated: bool = False
Attributes
calibration_metadata property

Return calibration metadata for reproducibility.

Functions
calibrate(scores, labels)

Compute nonconformity scores from calibration data.

Parameters:

Name Type Description Default
scores ndarray

Raw OOD scores for calibration samples, shape (n_samples,). Higher scores indicate more anomalous / novel.

required
labels ndarray

Class labels for calibration samples, shape (n_samples,).

required

Returns:

Type Description
ConformalCalibrator

Self for fluent chaining.

Source code in src/novelentitymatcher/novelty/strategies/conformal.py
def calibrate(
    self,
    scores: np.ndarray,
    labels: np.ndarray,
) -> ConformalCalibrator:
    """Compute nonconformity scores from calibration data.

    Args:
        scores: Raw OOD scores for calibration samples, shape (n_samples,).
                Higher scores indicate more anomalous / novel.
        labels: Class labels for calibration samples, shape (n_samples,).

    Returns:
        Self for fluent chaining.
    """
    scores = np.asarray(scores, dtype=np.float64)
    labels = np.asarray(labels)

    if self.method == "mondrian":
        self._calibrate_mondrian(scores, labels)
    else:
        self._nonconformity_scores = np.sort(scores)
        self._n_calibration = len(scores)

    self._is_calibrated = True
    logger.info(
        "Conformal calibration complete: method=%s, n=%d, alpha=%.3f",
        self.method,
        self._n_calibration,
        self.alpha,
    )
    return self
predict_pvalues(scores)

Convert raw OOD scores to calibrated p-values.

Parameters:

Name Type Description Default
scores ndarray

Raw scores for test samples, shape (n_samples,).

required

Returns:

Type Description
ndarray

p-values, shape (n_samples,). Lower p-value = more likely OOD.

Source code in src/novelentitymatcher/novelty/strategies/conformal.py
def predict_pvalues(self, scores: np.ndarray) -> np.ndarray:
    """Convert raw OOD scores to calibrated p-values.

    Args:
        scores: Raw scores for test samples, shape (n_samples,).

    Returns:
        p-values, shape (n_samples,). Lower p-value = more likely OOD.
    """
    if not self._is_calibrated:
        raise RuntimeError(
            "Calibrator has not been calibrated. Call calibrate() first."
        )

    scores = np.asarray(scores, dtype=np.float64)

    if self.method == "mondrian":
        return self._predict_mondrian_pvalues(scores)

    return self._compute_pvalues(scores, self._nonconformity_scores)
predict_pvalues_for_class(scores, predicted_classes)

Compute class-conditional p-values when predicted classes are known.

Parameters:

Name Type Description Default
scores ndarray

Raw OOD scores for test samples.

required
predicted_classes list[str]

Predicted class for each sample.

required

Returns:

Type Description
ndarray

p-values, shape (n_samples,).

Source code in src/novelentitymatcher/novelty/strategies/conformal.py
def predict_pvalues_for_class(
    self,
    scores: np.ndarray,
    predicted_classes: list[str],
) -> np.ndarray:
    """Compute class-conditional p-values when predicted classes are known.

    Args:
        scores: Raw OOD scores for test samples.
        predicted_classes: Predicted class for each sample.

    Returns:
        p-values, shape (n_samples,).
    """
    if not self._is_calibrated:
        raise RuntimeError(
            "Calibrator has not been calibrated. Call calibrate() first."
        )

    scores = np.asarray(scores, dtype=np.float64)
    pvalues = np.empty(len(scores))

    all_cal = (
        np.sort(np.concatenate(list(self._class_scores.values())))
        if self._class_scores
        else self._nonconformity_scores
    )

    for i, (score, pred_class) in enumerate(
        zip(scores, predicted_classes, strict=False)
    ):
        class_cal = self._class_scores.get(str(pred_class))
        if class_cal is not None and len(class_cal) > 0:
            pvalues[i] = self._compute_pvalues(np.array([score]), class_cal)[0]
        elif all_cal is not None:
            pvalues[i] = self._compute_pvalues(np.array([score]), all_cal)[0]
        else:
            pvalues[i] = 1.0

    return pvalues

Functions