Novelty Strategies¶

`novelentitymatcher.novelty.strategies.base` ¶

Base protocol for novelty detection strategies.

All strategies must implement this protocol to be compatible with the NoveltyDetector.

Classes¶

`NoveltyStrategy` ¶

Bases: ABC

Base protocol for all novelty detection strategies.

Each strategy is responsible for: 1. Initializing with reference embeddings and labels 2. Detecting novel samples from a batch of inputs 3. Providing per-sample metrics for signal combination 4. Specifying its weight for signal fusion

Attributes¶

`config_schema` `abstractmethod` `property` ¶

Return the config dataclass type for this strategy.

This is used for validation and defaults.

Functions¶

`initialize(reference_embeddings, reference_labels, config)` `abstractmethod` ¶

Initialize strategy with reference data.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples	required
`reference_labels`	`list[str]`	Class labels for known samples	required
`config`	`Any`	Strategy-specific configuration object	required

Source code in src/novelentitymatcher/novelty/strategies/base.py

@abstractmethod
def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: Any,
) -> None:
    """
    Initialize strategy with reference data.

    Args:
        reference_embeddings: Embeddings of known samples
        reference_labels: Class labels for known samples
        config: Strategy-specific configuration object
    """

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` `abstractmethod` ¶

Detect novel samples.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`embeddings`	`ndarray`	Text embeddings	required
`predicted_classes`	`list[str]`	Predicted class for each sample	required
`confidences`	`ndarray`	Prediction confidence scores	required
`**kwargs`		Additional strategy-specific parameters	`{}`

Returns:

Type	Description
`set[int]`	(flags, metrics) - flagged indices and per-sample metrics
`dict[int, dict[str, Any]]`	flags: Set of indices flagged as novel
`tuple[set[int], dict[int, dict[str, Any]]]`	metrics: Dict mapping index to metric dict

Source code in src/novelentitymatcher/novelty/strategies/base.py

@abstractmethod
def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted class for each sample
        confidences: Prediction confidence scores
        **kwargs: Additional strategy-specific parameters

    Returns:
        (flags, metrics) - flagged indices and per-sample metrics
        - flags: Set of indices flagged as novel
        - metrics: Dict mapping index to metric dict
    """

`get_weight()` `abstractmethod` ¶

Return weight for signal combination.

This weight determines how much this strategy contributes to the final novelty score.

Source code in src/novelentitymatcher/novelty/strategies/base.py

@abstractmethod
def get_weight(self) -> float:
    """
    Return weight for signal combination.

    This weight determines how much this strategy contributes
    to the final novelty score.
    """

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.knn_distance` ¶

kNN distance-based novelty detection strategy.

Flags samples based on their distance to k-nearest neighbors in the reference set.

Classes¶

`KNNDistanceStrategy()` ¶

Bases: NoveltyStrategy

kNN distance strategy for novelty detection.

Flags samples as novel if their average distance to k-nearest neighbors in the reference set exceeds a threshold.

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py

def __init__(self):
    self._config: KNNConfig = None
    self._ann_index: ANNIndex | None = None

Attributes¶

`config_schema` `property` ¶

Return KNNConfig as the config schema.

Functions¶

`initialize(reference_embeddings, reference_labels, config)` ¶

Initialize the kNN strategy with reference data.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples	required
`reference_labels`	`list[str]`	Labels of known samples	required
`config`	`KNNConfig`	KNNConfig with k, thresholds, and metric	required

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py

def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: KNNConfig,
) -> None:
    """
    Initialize the kNN strategy with reference data.

    Args:
        reference_embeddings: Embeddings of known samples
        reference_labels: Labels of known samples
        config: KNNConfig with k, thresholds, and metric
    """
    self._config = config or KNNConfig()

    # Initialize ANN index
    self._ann_index = ANNIndex(
        dim=reference_embeddings.shape[1],
        max_elements=len(reference_labels),
    )
    self._ann_index.add_vectors(reference_embeddings, reference_labels)

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

Detect novel samples using kNN distance.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`embeddings`	`ndarray`	Text embeddings	required
`predicted_classes`	`list[str]`	Predicted classes	required
`confidences`	`ndarray`	Prediction confidences	required
`**kwargs`		Additional parameters	`{}`

Returns:

Type	Description
`tuple[set[int], dict[int, dict[str, Any]]]`	(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py

def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using kNN distance.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted classes
        confidences: Prediction confidences
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    k = min(self._config.k, self._ann_index.n_elements)

    # Query kNN
    similarities, neighbor_indices = self._ann_index.knn_query(embeddings, k=k)

    flags = set()
    metrics = {}

    for idx in range(len(embeddings)):
        metric = self._compute_knn_metrics(
            idx,
            similarities[idx],
            neighbor_indices[idx],
            predicted_classes[idx],
        )
        metrics[idx] = metric

        # Check if novelty score exceeds threshold
        if metric["knn_novelty_score"] >= self._config.distance_threshold:
            flags.add(idx)

    return flags, metrics

`get_weight()` ¶

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py

def get_weight(self) -> float:
    """Return weight for signal combination."""
    # kNN is a strong signal, give it high weight
    return 0.45

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.clustering` ¶

Clustering-based novelty detection strategy.

Flags samples that form small, isolated clusters or don't fit well into any existing cluster.

Classes¶

`ClusteringStrategy()` ¶

Bases: NoveltyStrategy

Clustering-based strategy for novelty detection.

Uses HDBSCAN to cluster samples and identifies novel samples as those that are in small or low-cohesion clusters.

Source code in src/novelentitymatcher/novelty/strategies/clustering.py

def __init__(self):
    self._config: ClusteringConfig = None
    self._clusterer: ScalableClusterer = None
    self._validator: ClusterValidator = None
    self._reference_embeddings: np.ndarray = None
    self._reference_labels: list[str] = None

Attributes¶

`config_schema` `property` ¶

Return ClusteringConfig as the config schema.

Functions¶

`initialize(reference_embeddings, reference_labels, config)` ¶

Initialize the clustering strategy.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples	required
`reference_labels`	`list[str]`	Labels of known samples	required
`config`	`ClusteringConfig`	ClusteringConfig with thresholds	required

Source code in src/novelentitymatcher/novelty/strategies/clustering.py

def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: ClusteringConfig,
) -> None:
    """
    Initialize the clustering strategy.

    Args:
        reference_embeddings: Embeddings of known samples
        reference_labels: Labels of known samples
        config: ClusteringConfig with thresholds
    """
    self._config = config or ClusteringConfig()
    self._reference_embeddings = reference_embeddings
    self._reference_labels = reference_labels

    # Initialize clusterer
    self._clusterer = ScalableClusterer(
        min_cluster_size=self._config.hdbscan_min_cluster_size,
        min_samples=self._config.hdbscan_min_samples,
        cluster_selection_epsilon=self._config.cluster_selection_epsilon,
    )

    # Initialize validator
    self._validator = ClusterValidator(
        min_cohesion_threshold=self._config.cohesion_threshold,
        min_persistence_threshold=self._config.persistence_threshold,
    )

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

Detect novel samples using clustering.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`embeddings`	`ndarray`	Text embeddings	required
`predicted_classes`	`list[str]`	Predicted classes	required
`confidences`	`ndarray`	Prediction confidences	required
`**kwargs`		Additional parameters	`{}`

Returns:

Type	Description
`tuple[set[int], dict[int, dict[str, Any]]]`	(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/clustering.py

def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using clustering.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted classes
        confidences: Prediction confidences
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    # Combine reference and query embeddings for clustering
    all_embeddings = np.vstack([self._reference_embeddings, embeddings])

    # Fit clusterer on all embeddings
    self._clusterer.fit(all_embeddings)

    # Get cluster labels
    labels = self._clusterer.labels

    # Separate query labels (reference samples come first)
    query_labels = labels[len(self._reference_embeddings) :]

    flags = set()
    metrics = {}

    # Validate clusters and identify novel samples
    unique_labels = np.unique(query_labels)

    for label in unique_labels:
        if label == -1:  # Noise points
            # All noise points are novel
            mask = query_labels == label
            indices = np.where(mask)[0]
            for idx in indices:
                flags.add(idx)
                metrics[idx] = {
                    "cluster_label": -1,
                    "cluster_support_score": 0.0,
                    "cluster_is_novel": True,
                    "cluster_size": 1,
                }
        else:
            # Check if cluster is valid
            # Get all embeddings with this label (including reference)
            all_mask = labels == label
            _cluster_embeddings = all_embeddings[all_mask]

            is_valid = self._validator.is_valid_cluster(
                all_embeddings,
                labels,
                label,
                min_size=self._config.min_cluster_size,
            )

            # Compute support score (1 - cohesion)
            cohesion = self._validator.compute_cohesion(
                all_embeddings, labels, label
            )
            support_score = 1.0 - cohesion

            # Get query indices for this cluster
            query_mask = query_labels == label
            query_indices = np.where(query_mask)[0]

            for idx in query_indices:
                # Novel if cluster is invalid or support score is low
                is_novel = not is_valid or support_score < (
                    1.0 - self._config.cohesion_threshold
                )

                if is_novel:
                    flags.add(idx)

                metrics[idx] = {
                    "cluster_label": int(label),
                    "cluster_support_score": support_score,
                    "cluster_is_novel": is_novel,
                    "cluster_size": int(np.sum(all_mask)),
                    "cluster_cohesion": cohesion,
                }

    return flags, metrics

`get_weight()` ¶

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/clustering.py

def get_weight(self) -> float:
    """Return weight for signal combination."""
    # Clustering provides complementary signal
    return 0.2

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.pattern` ¶

Pattern-based novelty detection strategy wrapper.

Wraps PatternScorer to implement NoveltyStrategy protocol.

Classes¶

`novelentitymatcher.novelty.strategies.oneclass` ¶

One-Class SVM novelty detection strategy wrapper.

Wraps OneClassSVMDetector to implement NoveltyStrategy protocol.

Classes¶

`novelentitymatcher.novelty.strategies.setfit` ¶

SetFit contrastive novelty detection strategy wrapper.

Wraps SetFitDetector to implement NoveltyStrategy protocol.

Classes¶

`novelentitymatcher.novelty.strategies.setfit_centroid` ¶

SetFit centroid distance novelty detection strategy.

Computes minimum cosine distance from each query to known class centroids in the SetFit fine-tuned embedding space. Produces continuous novelty scores.

This is the recommended strategy when SetFit full training is used for Phase 1, as contrastive learning creates tight, well-separated class clusters.

Classes¶

`SetFitCentroidStrategy()` ¶

Bases: NoveltyStrategy

Centroid distance strategy using SetFit fine-tuned embeddings.

For each known class, computes a centroid in the SetFit embedding space. Novelty score = minimum cosine distance from query to any centroid.

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py

def __init__(self) -> None:
    self._config: SetFitCentroidConfig | None = None
    self._centroids: np.ndarray | None = None
    self._class_labels: list[str] | None = None
    self._threshold: float | None = None
    self._setfit_model: Any | None = None

Functions¶

`initialize(reference_embeddings, reference_labels, config)` ¶

Initialize centroids from reference embeddings.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples (already from SetFit model)	required
`reference_labels`	`list[str]`	Class labels for known samples	required
`config`	`SetFitCentroidConfig`	SetFitCentroidConfig with threshold	required

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py

def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: SetFitCentroidConfig,
) -> None:
    """
    Initialize centroids from reference embeddings.

    Args:
        reference_embeddings: Embeddings of known samples (already from SetFit model)
        reference_labels: Class labels for known samples
        config: SetFitCentroidConfig with threshold
    """
    self._config = config or SetFitCentroidConfig()
    self._class_labels = list(set(reference_labels))

    # Compute per-class centroids
    centroids = {}
    for label in self._class_labels:
        mask = np.array(reference_labels) == label
        class_embeddings = reference_embeddings[mask]
        if len(class_embeddings) > 0:
            centroids[label] = np.mean(class_embeddings, axis=0)

    # Sort centroids by class label for consistent indexing
    self._centroids = np.array(
        [centroids[label] for label in sorted(centroids.keys())]
    )
    self._class_labels = sorted(centroids.keys())

    # Calibrate threshold from reference set if not explicitly set
    if self._config.threshold is None:
        self._threshold = self._calibrate_threshold(
            reference_embeddings, reference_labels
        )
    else:
        self._threshold = self._config.threshold

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

Detect novel samples using centroid distance.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts (unused, embeddings are pre-computed)	required
`embeddings`	`ndarray`	Query embeddings	required
`predicted_classes`	`list[str]`	Predicted class for each sample	required
`confidences`	`ndarray`	Prediction confidence scores	required

Returns:

Type	Description
`tuple[set[int], dict[int, dict[str, Any]]]`	(flags, metrics) - flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py

def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using centroid distance.

    Args:
        texts: Input texts (unused, embeddings are pre-computed)
        embeddings: Query embeddings
        predicted_classes: Predicted class for each sample
        confidences: Prediction confidence scores

    Returns:
        (flags, metrics) - flagged indices and per-sample metrics
    """
    if self._centroids is None or self._threshold is None:
        return set(), {}

    flags: set[int] = set()
    metrics: dict[int, dict[str, Any]] = {}

    # Normalize embeddings for cosine distance
    query_norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    query_norms = np.where(query_norms == 0, 1, query_norms)
    query_normalized = embeddings / query_norms

    centroid_norms = np.linalg.norm(self._centroids, axis=1, keepdims=True)
    centroid_norms = np.where(centroid_norms == 0, 1, centroid_norms)
    centroids_normalized = self._centroids / centroid_norms

    # Compute cosine similarity matrix (queries x centroids)
    similarity_matrix = query_normalized @ centroids_normalized.T

    # Convert to cosine distance
    distance_matrix = 1.0 - similarity_matrix

    for idx in range(len(embeddings)):
        distances = distance_matrix[idx]
        min_distance = float(np.min(distances))
        nearest_centroid_idx = int(np.argmin(distances))
        nearest_class = self._class_labels[nearest_centroid_idx]

        # Continuous novelty score (normalized to [0, 1] via sigmoid)
        novelty_score = self._distance_to_score(min_distance)

        is_novel = min_distance > self._threshold

        if is_novel:
            flags.add(idx)

        metrics[idx] = {
            "setfit_centroid_min_distance": min_distance,
            "setfit_centroid_nearest_class": nearest_class,
            "setfit_centroid_novelty_score": novelty_score,
            "setfit_centroid_is_novel": is_novel,
            "setfit_centroid_predicted_class": predicted_classes[idx],
            "setfit_centroid_confidence": float(confidences[idx]),
        }

    return flags, metrics

`get_weight()` ¶

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/setfit_centroid.py

def get_weight(self) -> float:
    """Return weight for signal combination."""
    return 0.45

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.prototypical` ¶

Prototypical network novelty detection strategy wrapper.

Wraps PrototypicalDetector to implement NoveltyStrategy protocol.

Classes¶

`novelentitymatcher.novelty.strategies.mahalanobis` ¶

Mahalanobis distance-based novelty detection strategy.

Flags samples based on their Mahalanobis distance to the class-conditional distribution of their predicted class. Supports optional conformal calibration for statistically grounded p-value based novelty routing.

Classes¶

`MahalanobisDistanceStrategy()` ¶

Bases: NoveltyStrategy

Mahalanobis distance strategy for novelty detection.

Computes the Mahalanobis distance from each sample to the class-conditional distribution (mean + shared covariance) of its predicted class. Samples whose distance exceeds a configurable threshold are flagged as novel.

When calibration_mode="conformal", raw distances are wrapped with conformal p-values for statistically grounded routing. This is backward- compatible: calibration_mode="none" produces identical results to the original threshold-only behavior.

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py

def __init__(self):
    self._config: MahalanobisConfig = None
    self._class_means: dict[str, np.ndarray] = {}
    self._cov_inv: np.ndarray | None = None
    self._dim: int = 0
    self._calibrator: Any = None

Attributes¶

`config_schema` `property` ¶

Return MahalanobisConfig as the config schema.

Functions¶

`initialize(reference_embeddings, reference_labels, config)` ¶

Initialize the Mahalanobis strategy with reference data.

Computes per-class mean vectors and a shared (pooled) covariance matrix with regularization for numerical stability.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples (n_samples, dim)	required
`reference_labels`	`list[str]`	Class labels for known samples	required
`config`	`MahalanobisConfig`	MahalanobisConfig with threshold, regularization, etc.	required

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py

def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: MahalanobisConfig,
) -> None:
    """
    Initialize the Mahalanobis strategy with reference data.

    Computes per-class mean vectors and a shared (pooled) covariance matrix
    with regularization for numerical stability.

    Args:
        reference_embeddings: Embeddings of known samples (n_samples, dim)
        reference_labels: Class labels for known samples
        config: MahalanobisConfig with threshold, regularization, etc.
    """
    self._config = config or MahalanobisConfig()
    self._dim = reference_embeddings.shape[1]
    self._class_means = {}
    self._cov_inv = None
    self._calibrator = None

    if self._config.calibration_mode == "conformal":
        self._initialize_with_calibration(reference_embeddings, reference_labels)
    else:
        self._initialize_core(reference_embeddings, reference_labels)

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

Detect novel samples using Mahalanobis distance.

When calibration_mode="conformal", flagging uses p-values instead of raw distance thresholds. A sample is flagged if p_value < calibration_alpha.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`embeddings`	`ndarray`	Text embeddings	required
`predicted_classes`	`list[str]`	Predicted classes	required
`confidences`	`ndarray`	Prediction confidences	required
`**kwargs`		Additional parameters	`{}`

Returns:

Type	Description
`tuple[set[int], dict[int, dict[str, Any]]]`	(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py

def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using Mahalanobis distance.

    When ``calibration_mode="conformal"``, flagging uses p-values
    instead of raw distance thresholds. A sample is flagged if
    ``p_value < calibration_alpha``.

    Args:
        texts: Input texts
        embeddings: Text embeddings
        predicted_classes: Predicted classes
        confidences: Prediction confidences
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    flags = set()
    metrics = {}

    if (
        self._config.calibration_mode == "conformal"
        and self._calibrator is not None
        and self._calibrator.is_calibrated
    ):
        raw_distances = self._compute_all_distances(embeddings, predicted_classes)
        if self._config.calibration_method == "mondrian":
            p_values = self._calibrator.predict_pvalues_for_class(
                raw_distances, predicted_classes
            )
        else:
            p_values = self._calibrator.predict_pvalues(raw_distances)

        for idx in range(len(embeddings)):
            metric = self._compute_mahalanobis_metrics(
                idx,
                embeddings[idx],
                predicted_classes[idx],
            )
            metric["p_value"] = float(p_values[idx])
            metric["calibration_mode"] = "conformal"
            metrics[idx] = metric

            if p_values[idx] < self._config.calibration_alpha:
                flags.add(idx)
    else:
        for idx in range(len(embeddings)):
            metric = self._compute_mahalanobis_metrics(
                idx,
                embeddings[idx],
                predicted_classes[idx],
            )
            metrics[idx] = metric

            if metric["mahalanobis_distance"] >= self._config.threshold:
                flags.add(idx)

    return flags, metrics

`get_weight()` ¶

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py

def get_weight(self) -> float:
    """Return weight for signal combination."""
    return 0.35

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

Functions¶

`novelentitymatcher.novelty.strategies.uncertainty` ¶

Uncertainty-based novelty detection strategy.

Flags samples based on prediction uncertainty using margin and entropy.

Classes¶

`UncertaintyStrategy()` ¶

Bases: NoveltyStrategy

Uncertainty-based strategy for novelty detection.

Flags samples as novel if their prediction uncertainty exceeds configured thresholds (margin or entropy).

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py

def __init__(self):
    self._config: UncertaintyConfig = None

Attributes¶

`config_schema` `property` ¶

Return UncertaintyConfig as the config schema.

Functions¶

`initialize(reference_embeddings, reference_labels, config)` ¶

Initialize the uncertainty strategy.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples (not used)	required
`reference_labels`	`list[str]`	Labels of known samples (not used)	required
`config`	`UncertaintyConfig`	UncertaintyConfig with thresholds	required

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py

def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: UncertaintyConfig,
) -> None:
    """
    Initialize the uncertainty strategy.

    Args:
        reference_embeddings: Embeddings of known samples (not used)
        reference_labels: Labels of known samples (not used)
        config: UncertaintyConfig with thresholds
    """
    self._config = config or UncertaintyConfig()

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

Detect novel samples using uncertainty metrics.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`embeddings`	`ndarray`	Text embeddings (not used)	required
`predicted_classes`	`list[str]`	Predicted classes (not used)	required
`confidences`	`ndarray`	Prediction confidence scores	required
`**kwargs`		Additional parameters, may include 'all_probs' for full distribution	`{}`

Returns:

Type	Description
`tuple[set[int], dict[int, dict[str, Any]]]`	(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py

def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using uncertainty metrics.

    Args:
        texts: Input texts
        embeddings: Text embeddings (not used)
        predicted_classes: Predicted classes (not used)
        confidences: Prediction confidence scores
        **kwargs: Additional parameters, may include 'all_probs' for full distribution

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    flags = set()
    metrics = {}

    # Check if we have full probability distributions
    all_probs = kwargs.get("all_probs", None)

    for idx, confidence in enumerate(confidences):
        metric = self._compute_uncertainty_metrics(
            idx,
            confidence,
            all_probs[idx] if all_probs is not None else None,
        )
        metrics[idx] = metric

        # Check if uncertainty exceeds thresholds
        is_novel = (
            metric["margin_score"] < self._config.margin_threshold
            or metric["entropy_score"] > self._config.entropy_threshold
        )

        if is_novel:
            flags.add(idx)

    return flags, metrics

`get_weight()` ¶

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py

def get_weight(self) -> float:
    """Return weight for signal combination."""
    # Uncertainty is a strong signal
    return 0.35

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.confidence` ¶

Confidence threshold-based novelty detection strategy.

Flags samples with prediction confidence below a threshold as novel.

Classes¶

`ConfidenceStrategy()` ¶

Bases: NoveltyStrategy

Confidence threshold strategy for novelty detection.

Flags samples as novel if their prediction confidence falls below a configured threshold.

Source code in src/novelentitymatcher/novelty/strategies/confidence.py

def __init__(self):
    self._config: ConfidenceConfig = None

Attributes¶

`config_schema` `property` ¶

Return ConfidenceConfig as the config schema.

Functions¶

`initialize(reference_embeddings, reference_labels, config)` ¶

Initialize the confidence strategy.

Parameters:

Name	Type	Description	Default
`reference_embeddings`	`ndarray`	Embeddings of known samples (not used)	required
`reference_labels`	`list[str]`	Labels of known samples (not used)	required
`config`	`ConfidenceConfig`	ConfidenceConfig with threshold parameter	required

Source code in src/novelentitymatcher/novelty/strategies/confidence.py

def initialize(
    self,
    reference_embeddings: np.ndarray,
    reference_labels: list[str],
    config: ConfidenceConfig,
) -> None:
    """
    Initialize the confidence strategy.

    Args:
        reference_embeddings: Embeddings of known samples (not used)
        reference_labels: Labels of known samples (not used)
        config: ConfidenceConfig with threshold parameter
    """
    self._config = config or ConfidenceConfig()

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

Detect novel samples using confidence threshold.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`embeddings`	`ndarray`	Text embeddings (not used)	required
`predicted_classes`	`list[str]`	Predicted classes (not used)	required
`confidences`	`ndarray`	Prediction confidence scores	required
`**kwargs`		Additional parameters	`{}`

Returns:

Type	Description
`tuple[set[int], dict[int, dict[str, Any]]]`	(flags, metrics) - Flagged indices and per-sample metrics

Source code in src/novelentitymatcher/novelty/strategies/confidence.py

def detect(
    self,
    texts: list[str],
    embeddings: np.ndarray,
    predicted_classes: list[str],
    confidences: np.ndarray,
    **kwargs,
) -> tuple[set[int], dict[int, dict[str, Any]]]:
    """
    Detect novel samples using confidence threshold.

    Args:
        texts: Input texts
        embeddings: Text embeddings (not used)
        predicted_classes: Predicted classes (not used)
        confidences: Prediction confidence scores
        **kwargs: Additional parameters

    Returns:
        (flags, metrics) - Flagged indices and per-sample metrics
    """
    flags = set()
    metrics = {}

    for idx, confidence in enumerate(confidences):
        is_novel = confidence < self._config.threshold

        if is_novel:
            flags.add(idx)

        metrics[idx] = {
            "confidence_score": float(confidence),
            "confidence_is_novel": is_novel,
        }

    return flags, metrics

`get_weight()` ¶

Return weight for signal combination.

Source code in src/novelentitymatcher/novelty/strategies/confidence.py

def get_weight(self) -> float:
    """Return weight for signal combination."""
    # Confidence is a foundational signal, give it moderate weight
    return 0.35

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.self_knowledge` ¶

Self-knowledge detection strategy wrapper.

Wraps SelfKnowledgeDetector to implement NoveltyStrategy protocol.

Classes¶

`SelfKnowledgeStrategy()` ¶

Bases: NoveltyStrategy

Self-knowledge strategy for novelty detection.

Uses a sparse autoencoder to learn representations of known samples and flags high reconstruction error as novel.

Source code in src/novelentitymatcher/novelty/strategies/self_knowledge.py

def __init__(self):
    self._config: SelfKnowledgeConfig = None
    self._detector: SelfKnowledgeDetector = None

Functions¶

`get_config()` ¶

Get the current configuration for this strategy.

Override this if your strategy stores its config differently.

Source code in src/novelentitymatcher/novelty/strategies/base.py

def get_config(self) -> Any:
    """
    Get the current configuration for this strategy.

    Override this if your strategy stores its config differently.
    """
    return getattr(self, "_config", None)

`novelentitymatcher.novelty.strategies.conformal` ¶

Conformal prediction-based calibration for OOD detection strategies.

Wraps raw strategy scores with statistically grounded p-values, enabling rigorous routing of out-of-distribution inputs.

Classes¶

`ConformalCalibrator(alpha=0.1, method='split')` ¶

Calibrate raw OOD scores into conformal p-values.

Supports two methods: - "split": Holds out a fraction of reference data for calibration. - "mondrian": Uses class-conditional (Mondrian) conformal calibration with per-class nonconformity distributions.

Usage::

cal = ConformalCalibrator(alpha=0.1, method="split")
cal.calibrate(raw_scores, labels)
pvals = cal.predict_pvalues(test_scores)

Source code in src/novelentitymatcher/novelty/strategies/conformal.py

def __init__(
    self,
    alpha: float = 0.1,
    method: Literal["mondrian", "split"] = "split",
):
    self.alpha = alpha
    self.method = method
    self._nonconformity_scores: np.ndarray | None = None
    self._class_scores: dict[str, np.ndarray] = {}
    self._n_calibration: int = 0
    self._is_calibrated: bool = False

Attributes¶

`calibration_metadata` `property` ¶

Return calibration metadata for reproducibility.

Functions¶

`calibrate(scores, labels)` ¶

Compute nonconformity scores from calibration data.

Parameters:

Name	Type	Description	Default
`scores`	`ndarray`	Raw OOD scores for calibration samples, shape (n_samples,). Higher scores indicate more anomalous / novel.	required
`labels`	`ndarray`	Class labels for calibration samples, shape (n_samples,).	required

Returns:

Type	Description
`ConformalCalibrator`	Self for fluent chaining.

Source code in src/novelentitymatcher/novelty/strategies/conformal.py

def calibrate(
    self,
    scores: np.ndarray,
    labels: np.ndarray,
) -> ConformalCalibrator:
    """Compute nonconformity scores from calibration data.

    Args:
        scores: Raw OOD scores for calibration samples, shape (n_samples,).
                Higher scores indicate more anomalous / novel.
        labels: Class labels for calibration samples, shape (n_samples,).

    Returns:
        Self for fluent chaining.
    """
    scores = np.asarray(scores, dtype=np.float64)
    labels = np.asarray(labels)

    if self.method == "mondrian":
        self._calibrate_mondrian(scores, labels)
    else:
        self._nonconformity_scores = np.sort(scores)
        self._n_calibration = len(scores)

    self._is_calibrated = True
    logger.info(
        "Conformal calibration complete: method=%s, n=%d, alpha=%.3f",
        self.method,
        self._n_calibration,
        self.alpha,
    )
    return self

`predict_pvalues(scores)` ¶

Convert raw OOD scores to calibrated p-values.

Parameters:

Name	Type	Description	Default
`scores`	`ndarray`	Raw scores for test samples, shape (n_samples,).	required

Returns:

Type	Description
`ndarray`	p-values, shape (n_samples,). Lower p-value = more likely OOD.

Source code in src/novelentitymatcher/novelty/strategies/conformal.py

def predict_pvalues(self, scores: np.ndarray) -> np.ndarray:
    """Convert raw OOD scores to calibrated p-values.

    Args:
        scores: Raw scores for test samples, shape (n_samples,).

    Returns:
        p-values, shape (n_samples,). Lower p-value = more likely OOD.
    """
    if not self._is_calibrated:
        raise RuntimeError(
            "Calibrator has not been calibrated. Call calibrate() first."
        )

    scores = np.asarray(scores, dtype=np.float64)

    if self.method == "mondrian":
        return self._predict_mondrian_pvalues(scores)

    return self._compute_pvalues(scores, self._nonconformity_scores)

`predict_pvalues_for_class(scores, predicted_classes)` ¶

Compute class-conditional p-values when predicted classes are known.

Parameters:

Name	Type	Description	Default
`scores`	`ndarray`	Raw OOD scores for test samples.	required
`predicted_classes`	`list[str]`	Predicted class for each sample.	required

Returns:

Type	Description
`ndarray`	p-values, shape (n_samples,).

Source code in src/novelentitymatcher/novelty/strategies/conformal.py

def predict_pvalues_for_class(
    self,
    scores: np.ndarray,
    predicted_classes: list[str],
) -> np.ndarray:
    """Compute class-conditional p-values when predicted classes are known.

    Args:
        scores: Raw OOD scores for test samples.
        predicted_classes: Predicted class for each sample.

    Returns:
        p-values, shape (n_samples,).
    """
    if not self._is_calibrated:
        raise RuntimeError(
            "Calibrator has not been calibrated. Call calibrate() first."
        )

    scores = np.asarray(scores, dtype=np.float64)
    pvalues = np.empty(len(scores))

    all_cal = (
        np.sort(np.concatenate(list(self._class_scores.values())))
        if self._class_scores
        else self._nonconformity_scores
    )

    for i, (score, pred_class) in enumerate(
        zip(scores, predicted_classes, strict=False)
    ):
        class_cal = self._class_scores.get(str(pred_class))
        if class_cal is not None and len(class_cal) > 0:
            pvalues[i] = self._compute_pvalues(np.array([score]), class_cal)[0]
        elif all_cal is not None:
            pvalues[i] = self._compute_pvalues(np.array([score]), all_cal)[0]
        else:
            pvalues[i] = 1.0

    return pvalues

Novelty Strategies¶

novelentitymatcher.novelty.strategies.base ¶

Classes¶

NoveltyStrategy ¶

Attributes¶

config_schema abstractmethod property ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) abstractmethod ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) abstractmethod ¶

get_weight() abstractmethod ¶

get_config() ¶

novelentitymatcher.novelty.strategies.knn_distance ¶

Classes¶

KNNDistanceStrategy() ¶

Attributes¶

config_schema property ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) ¶

get_weight() ¶

get_config() ¶

novelentitymatcher.novelty.strategies.clustering ¶

Classes¶

ClusteringStrategy() ¶

Attributes¶

config_schema property ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) ¶

get_weight() ¶

get_config() ¶

novelentitymatcher.novelty.strategies.pattern ¶

Classes¶

novelentitymatcher.novelty.strategies.oneclass ¶

Classes¶

novelentitymatcher.novelty.strategies.setfit ¶

Classes¶

novelentitymatcher.novelty.strategies.setfit_centroid ¶

Classes¶

SetFitCentroidStrategy() ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) ¶

get_weight() ¶

get_config() ¶

novelentitymatcher.novelty.strategies.prototypical ¶

Classes¶

novelentitymatcher.novelty.strategies.mahalanobis ¶

Classes¶

MahalanobisDistanceStrategy() ¶

Attributes¶

config_schema property ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) ¶

get_weight() ¶

get_config() ¶

Functions¶

novelentitymatcher.novelty.strategies.uncertainty ¶

Classes¶

UncertaintyStrategy() ¶

Attributes¶

config_schema property ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) ¶

get_weight() ¶

get_config() ¶

novelentitymatcher.novelty.strategies.confidence ¶

Classes¶

ConfidenceStrategy() ¶

Attributes¶

config_schema property ¶

Functions¶

initialize(reference_embeddings, reference_labels, config) ¶

detect(texts, embeddings, predicted_classes, confidences, **kwargs) ¶

get_weight() ¶

get_config() ¶

novelentitymatcher.novelty.strategies.self_knowledge ¶

Classes¶

`novelentitymatcher.novelty.strategies.base` ¶

`NoveltyStrategy` ¶

`config_schema` `abstractmethod` `property` ¶

`initialize(reference_embeddings, reference_labels, config)` `abstractmethod` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` `abstractmethod` ¶

`get_weight()` `abstractmethod` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.knn_distance` ¶

`KNNDistanceStrategy()` ¶

`config_schema` `property` ¶

`initialize(reference_embeddings, reference_labels, config)` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

`get_weight()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.clustering` ¶

`ClusteringStrategy()` ¶

`config_schema` `property` ¶

`initialize(reference_embeddings, reference_labels, config)` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

`get_weight()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.pattern` ¶

`novelentitymatcher.novelty.strategies.oneclass` ¶

`novelentitymatcher.novelty.strategies.setfit` ¶

`novelentitymatcher.novelty.strategies.setfit_centroid` ¶

`SetFitCentroidStrategy()` ¶

`initialize(reference_embeddings, reference_labels, config)` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

`get_weight()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.prototypical` ¶

`novelentitymatcher.novelty.strategies.mahalanobis` ¶

`MahalanobisDistanceStrategy()` ¶

`config_schema` `property` ¶

`initialize(reference_embeddings, reference_labels, config)` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

`get_weight()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.uncertainty` ¶

`UncertaintyStrategy()` ¶

`config_schema` `property` ¶

`initialize(reference_embeddings, reference_labels, config)` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

`get_weight()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.confidence` ¶

`ConfidenceStrategy()` ¶

`config_schema` `property` ¶

`initialize(reference_embeddings, reference_labels, config)` ¶

`detect(texts, embeddings, predicted_classes, confidences, **kwargs)` ¶

`get_weight()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.self_knowledge` ¶

`SelfKnowledgeStrategy()` ¶

`get_config()` ¶

`novelentitymatcher.novelty.strategies.conformal` ¶

`ConformalCalibrator(alpha=0.1, method='split')` ¶

`calibration_metadata` `property` ¶

`calibrate(scores, labels)` ¶

`predict_pvalues(scores)` ¶

`predict_pvalues_for_class(scores, predicted_classes)` ¶