Novelty Detection¶

`novelentitymatcher.novelty.entity_matcher` ¶

Primary orchestration API for classification plus novel-class detection.

This module promotes NovelEntityMatcher to the main public entry point for novelty-aware matching. It wraps a fitted Matcher together with the multi-signal NoveltyDetector and optional LLMClassProposer.

Classes¶

`NovelEntityMatchResult(id, score, is_match, is_novel, novel_score=None, match_method='accepted_known', alternatives=list(), signals=dict(), predicted_id=None, metadata=dict())` `dataclass` ¶

Operational result for a single novelty-aware match decision.

`NovelEntityMatcher(entities=None, *, matcher=None, model='potion-32m', mode='zero-shot', acceptance_threshold=None, detection_config=None, llm_provider=None, llm_model=None, llm_api_keys=None, output_dir='./proposals', auto_save=True, match_threshold=None, novelty_strategy='confidence', confidence_threshold=0.3, knn_k=5, knn_distance_threshold=0.6, min_cluster_size=5, use_novelty_detector=True, review_storage_path='./proposals/review_records.json')` ¶

Bases: DiscoveryBase

Primary public API for novelty-aware matching.

The class orchestrates three stages: 1. Retrieve rich matcher metadata with top-k candidates and embeddings. 2. Score novelty using ANN-backed multi-signal detection. 3. Optionally propose new class names for novel batches.

Source code in src/novelentitymatcher/novelty/entity_matcher.py

def __init__(
    self,
    entities: list[dict[str, Any]] | None = None,
    *,
    matcher: Matcher | None = None,
    model: str = "potion-32m",
    mode: str = "zero-shot",
    acceptance_threshold: float | None = None,
    detection_config: DetectionConfig | dict[str, Any] | None = None,
    llm_provider: str | None = None,
    llm_model: str | None = None,
    llm_api_keys: dict[str, str] | None = None,
    output_dir: str = "./proposals",
    auto_save: bool = True,
    match_threshold: float | None = None,
    novelty_strategy: str = "confidence",
    confidence_threshold: float = 0.3,
    knn_k: int = 5,
    knn_distance_threshold: float = 0.6,
    min_cluster_size: int = 5,
    use_novelty_detector: bool = True,
    review_storage_path: str = "./proposals/review_records.json",
):
    if matcher is None:
        if entities is None:
            raise ValueError("entities is required when matcher is not provided")
        threshold = (
            acceptance_threshold
            if acceptance_threshold is not None
            else (match_threshold if match_threshold is not None else 0.5)
        )
        matcher = Matcher(
            entities=entities,
            model=model,
            mode=mode,
            threshold=threshold,
        )

    self.matcher = matcher
    self.entities = (
        entities if entities is not None else list(getattr(matcher, "entities", []))
    )
    self.acceptance_threshold = (
        acceptance_threshold
        if acceptance_threshold is not None
        else (
            match_threshold
            if match_threshold is not None
            else getattr(self.matcher, "threshold", 0.5)
        )
    )
    self.output_dir = output_dir
    self.auto_save = auto_save
    self.use_novelty_detector = use_novelty_detector

    self.detection_config = self._coerce_detection_config(
        detection_config=detection_config,
        novelty_strategy=novelty_strategy,
        confidence_threshold=confidence_threshold,
        knn_k=knn_k,
        knn_distance_threshold=knn_distance_threshold,
        min_cluster_size=min_cluster_size,
    )
    self.detector = NoveltyDetector(config=self.detection_config)
    clustering_config = self.detection_config.clustering or ClusteringConfig(
        min_cluster_size=min_cluster_size
    )
    self.clusterer = ScalableClusterer(
        min_cluster_size=clustering_config.min_cluster_size
    )
    self.llm_proposer = LLMClassProposer(
        primary_model=llm_model,
        provider=llm_provider,
        api_keys=llm_api_keys,
    )
    self.review_manager = ProposalReviewManager(review_storage_path)

Functions¶

`novelentitymatcher.novelty.core.detector` ¶

Core novelty detector with strategy orchestration.

This is the main entry point for novelty detection, using a strategy pattern to support multiple detection algorithms.

Classes¶

`NoveltyDetector(config)` ¶

Simplified novelty detector using registered strategies.

This detector manages strategy initialization and orchestration, delegating signal combination and metadata building to specialized components.

Responsibilities: - Strategy initialization and lifecycle - Strategy orchestration - Delegates signal combining to SignalCombiner - Delegates metadata creation to MetadataBuilder

Parameters:

Name	Type	Description	Default
`config`	`DetectionConfig`	Detection configuration	required

Source code in src/novelentitymatcher/novelty/core/detector.py

def __init__(self, config: DetectionConfig):
    """
    Initialize the novelty detector.

    Args:
        config: Detection configuration
    """
    # Validate configuration
    config.validate_strategies()

    self.config = config
    self._strategies: dict[str, Any] = {}
    self._combiner = SignalCombiner(config)
    self._metadata_builder = MetadataBuilder()
    self._is_initialized = False
    self._reference_signature: str | None = None

Attributes¶

`is_initialized` `property` ¶

Check if detector has been initialized with reference data.

Functions¶

`detect_novel_samples(texts, confidences, embeddings, predicted_classes, reference_embeddings=None, reference_labels=None, **kwargs)` ¶

Detect novel samples using configured strategies.

Parameters:

Name	Type	Description	Default
`texts`	`list[str]`	Input texts	required
`confidences`	`ndarray`	Prediction confidence scores	required
`embeddings`	`ndarray`	Text embeddings	required
`predicted_classes`	`list[str]`	Predicted class for each sample	required
`reference_embeddings`	`ndarray \| None`	Embeddings of known samples	`None`
`reference_labels`	`list[str] \| None`	Class labels for known samples	`None`
`**kwargs`		Additional strategy-specific parameters	`{}`

Returns:

Type	Description
`NovelSampleReport`	NovelSampleReport with detection results

Source code in src/novelentitymatcher/novelty/core/detector.py

def detect_novel_samples(
    self,
    texts: list[str],
    confidences: np.ndarray,
    embeddings: np.ndarray,
    predicted_classes: list[str],
    reference_embeddings: np.ndarray | None = None,
    reference_labels: list[str] | None = None,
    **kwargs,
) -> NovelSampleReport:
    """
    Detect novel samples using configured strategies.

    Args:
        texts: Input texts
        confidences: Prediction confidence scores
        embeddings: Text embeddings
        predicted_classes: Predicted class for each sample
        reference_embeddings: Embeddings of known samples
        reference_labels: Class labels for known samples
        **kwargs: Additional strategy-specific parameters

    Returns:
        NovelSampleReport with detection results
    """
    if reference_embeddings is None or reference_labels is None:
        raise RuntimeError("reference embeddings and labels are required")

    if len(texts) == 0:
        return NovelSampleReport(
            novel_samples=[],
            detection_strategies=list(self.config.strategies),
            config=self.config.model_dump(),
            signal_counts=dict.fromkeys(self.config.strategies, 0),
        )

    reference_signature = self._compute_reference_signature(
        reference_embeddings,
        reference_labels,
    )

    # Initialize strategies if needed or if the reference corpus changed.
    if not self._is_initialized or self._reference_signature != reference_signature:
        self._initialize_strategies(reference_embeddings, reference_labels)

    # Run each strategy
    all_flags: set[int] = set()
    all_metrics: dict[int, dict[str, Any]] = {}
    strategy_outputs: dict[str, tuple[set[int], dict]] = {}

    for strategy_id, strategy in self._strategies.items():
        flags, metrics = strategy.detect(
            texts=texts,
            embeddings=embeddings,
            predicted_classes=predicted_classes,
            confidences=confidences,
            **kwargs,
        )
        strategy_outputs[strategy_id] = (flags, metrics)
        all_flags.update(flags)

        # Merge metrics
        for idx, metric_dict in metrics.items():
            if idx not in all_metrics:
                all_metrics[idx] = {}
            all_metrics[idx].update(metric_dict)

    # Combine signals
    novel_indices, novelty_scores = self._combiner.combine(
        strategy_outputs=strategy_outputs,
        all_metrics=all_metrics,
    )

    # Build report
    report = self._metadata_builder.build_report(
        texts=texts,
        confidences=confidences,
        predicted_classes=predicted_classes,
        novel_indices=novel_indices,
        novelty_scores=novelty_scores,
        all_metrics=all_metrics,
        strategy_outputs=strategy_outputs,
        config=self.config,
    )

    return report

`reset()` ¶

Reset the detector, clearing all initialized strategies.

This allows the detector to be re-used with different reference data.

Source code in src/novelentitymatcher/novelty/core/detector.py

def reset(self) -> None:
    """
    Reset the detector, clearing all initialized strategies.

    This allows the detector to be re-used with different reference data.
    """
    self._strategies.clear()
    self._is_initialized = False
    self._reference_signature = None

`get_strategy(strategy_id)` ¶

Get an initialized strategy by ID.

Parameters:

Name	Type	Description	Default
`strategy_id`	`str`	Strategy identifier	required

Returns:

Type	Description
`Any`	Strategy instance if initialized

Raises:

Type	Description
`ValueError`	If strategy not found or not initialized

Source code in src/novelentitymatcher/novelty/core/detector.py

def get_strategy(self, strategy_id: str) -> Any:
    """
    Get an initialized strategy by ID.

    Args:
        strategy_id: Strategy identifier

    Returns:
        Strategy instance if initialized

    Raises:
        ValueError: If strategy not found or not initialized
    """
    if strategy_id not in self._strategies:
        available = ", ".join(self._strategies.keys())
        raise ValueError(
            f"Strategy '{strategy_id}' not initialized. Available: {available}"
        )
    return self._strategies[strategy_id]

`list_initialized_strategies()` ¶

List all initialized strategies.

Returns:

Type	Description
`list[str]`	List of strategy IDs

Source code in src/novelentitymatcher/novelty/core/detector.py

def list_initialized_strategies(self) -> list[str]:
    """
    List all initialized strategies.

    Returns:
        List of strategy IDs
    """
    return list(self._strategies.keys())

Novelty Detection¶

novelentitymatcher.novelty.entity_matcher ¶

Classes¶

NovelEntityMatchResult(id, score, is_match, is_novel, novel_score=None, match_method='accepted_known', alternatives=list(), signals=dict(), predicted_id=None, metadata=dict()) dataclass ¶

Functions¶

novelentitymatcher.novelty.core.detector ¶

Classes¶

NoveltyDetector(config) ¶

Attributes¶

is_initialized property ¶

Functions¶

detect_novel_samples(texts, confidences, embeddings, predicted_classes, reference_embeddings=None, reference_labels=None, **kwargs) ¶

reset() ¶

get_strategy(strategy_id) ¶

list_initialized_strategies() ¶

`novelentitymatcher.novelty.entity_matcher` ¶

`NovelEntityMatchResult(id, score, is_match, is_novel, novel_score=None, match_method='accepted_known', alternatives=list(), signals=dict(), predicted_id=None, metadata=dict())` `dataclass` ¶

`novelentitymatcher.novelty.core.detector` ¶

`NoveltyDetector(config)` ¶

`is_initialized` `property` ¶

`detect_novel_samples(texts, confidences, embeddings, predicted_classes, reference_embeddings=None, reference_labels=None, **kwargs)` ¶

`reset()` ¶

`get_strategy(strategy_id)` ¶

`list_initialized_strategies()` ¶