Skip to content

Novelty Detection

novelentitymatcher.novelty.entity_matcher

Primary orchestration API for classification plus novel-class detection.

This module promotes NovelEntityMatcher to the main public entry point for novelty-aware matching. It wraps a fitted Matcher together with the multi-signal NoveltyDetector and optional LLMClassProposer.

Classes

NovelEntityMatchResult(id, score, is_match, is_novel, novel_score=None, match_method='accepted_known', alternatives=list(), signals=dict(), predicted_id=None, metadata=dict()) dataclass

Operational result for a single novelty-aware match decision.

NovelEntityMatcher(entities=None, *, matcher=None, model='potion-32m', mode='zero-shot', acceptance_threshold=None, detection_config=None, llm_provider=None, llm_model=None, llm_api_keys=None, output_dir='./proposals', auto_save=True, match_threshold=None, novelty_strategy='confidence', confidence_threshold=0.3, knn_k=5, knn_distance_threshold=0.6, min_cluster_size=5, use_novelty_detector=True, review_storage_path='./proposals/review_records.json')

Bases: DiscoveryBase

Primary public API for novelty-aware matching.

The class orchestrates three stages: 1. Retrieve rich matcher metadata with top-k candidates and embeddings. 2. Score novelty using ANN-backed multi-signal detection. 3. Optionally propose new class names for novel batches.

Source code in src/novelentitymatcher/novelty/entity_matcher.py
def __init__(
    self,
    entities: list[dict[str, Any]] | None = None,
    *,
    matcher: Matcher | None = None,
    model: str = "potion-32m",
    mode: str = "zero-shot",
    acceptance_threshold: float | None = None,
    detection_config: DetectionConfig | dict[str, Any] | None = None,
    llm_provider: str | None = None,
    llm_model: str | None = None,
    llm_api_keys: dict[str, str] | None = None,
    output_dir: str = "./proposals",
    auto_save: bool = True,
    match_threshold: float | None = None,
    novelty_strategy: str = "confidence",
    confidence_threshold: float = 0.3,
    knn_k: int = 5,
    knn_distance_threshold: float = 0.6,
    min_cluster_size: int = 5,
    use_novelty_detector: bool = True,
    review_storage_path: str = "./proposals/review_records.json",
):
    if matcher is None:
        if entities is None:
            raise ValueError("entities is required when matcher is not provided")
        threshold = (
            acceptance_threshold
            if acceptance_threshold is not None
            else (match_threshold if match_threshold is not None else 0.5)
        )
        matcher = Matcher(
            entities=entities,
            model=model,
            mode=mode,
            threshold=threshold,
        )

    self.matcher = matcher
    self.entities = (
        entities if entities is not None else list(getattr(matcher, "entities", []))
    )
    self.acceptance_threshold = (
        acceptance_threshold
        if acceptance_threshold is not None
        else (
            match_threshold
            if match_threshold is not None
            else getattr(self.matcher, "threshold", 0.5)
        )
    )
    self.output_dir = output_dir
    self.auto_save = auto_save
    self.use_novelty_detector = use_novelty_detector

    self.detection_config = self._coerce_detection_config(
        detection_config=detection_config,
        novelty_strategy=novelty_strategy,
        confidence_threshold=confidence_threshold,
        knn_k=knn_k,
        knn_distance_threshold=knn_distance_threshold,
        min_cluster_size=min_cluster_size,
    )
    self.detector = NoveltyDetector(config=self.detection_config)
    clustering_config = self.detection_config.clustering or ClusteringConfig(
        min_cluster_size=min_cluster_size
    )
    self.clusterer = ScalableClusterer(
        min_cluster_size=clustering_config.min_cluster_size
    )
    self.llm_proposer = LLMClassProposer(
        primary_model=llm_model,
        provider=llm_provider,
        api_keys=llm_api_keys,
    )
    self.review_manager = ProposalReviewManager(review_storage_path)

Functions

novelentitymatcher.novelty.core.detector

Core novelty detector with strategy orchestration.

This is the main entry point for novelty detection, using a strategy pattern to support multiple detection algorithms.

Classes

NoveltyDetector(config)

Simplified novelty detector using registered strategies.

This detector manages strategy initialization and orchestration, delegating signal combination and metadata building to specialized components.

Responsibilities: - Strategy initialization and lifecycle - Strategy orchestration - Delegates signal combining to SignalCombiner - Delegates metadata creation to MetadataBuilder

Parameters:

Name Type Description Default
config DetectionConfig

Detection configuration

required
Source code in src/novelentitymatcher/novelty/core/detector.py
def __init__(self, config: DetectionConfig):
    """
    Initialize the novelty detector.

    Args:
        config: Detection configuration
    """
    # Validate configuration
    config.validate_strategies()

    self.config = config
    self._strategies: dict[str, Any] = {}
    self._combiner = SignalCombiner(config)
    self._metadata_builder = MetadataBuilder()
    self._is_initialized = False
    self._reference_signature: str | None = None
Attributes
is_initialized property

Check if detector has been initialized with reference data.

Functions
detect_novel_samples(texts, confidences, embeddings, predicted_classes, reference_embeddings=None, reference_labels=None, **kwargs)

Detect novel samples using configured strategies.

Parameters:

Name Type Description Default
texts list[str]

Input texts

required
confidences ndarray

Prediction confidence scores

required
embeddings ndarray

Text embeddings

required
predicted_classes list[str]

Predicted class for each sample

required
reference_embeddings ndarray | None

Embeddings of known samples

None
reference_labels list[str] | None

Class labels for known samples

None
**kwargs

Additional strategy-specific parameters

{}

Returns:

Type Description
NovelSampleReport

NovelSampleReport with detection results

Source code in src/novelentitymatcher/novelty/core/detector.py
def detect_novel_samples(
    self,
    texts: list[str],
    confidences: np.ndarray,
    embeddings: np.ndarray,
    predicted_classes: list[str],
    reference_embeddings: np.ndarray | None = None,
    reference_labels: list[str] | None = None,
    **kwargs,
) -> NovelSampleReport:
    """
    Detect novel samples using configured strategies.

    Args:
        texts: Input texts
        confidences: Prediction confidence scores
        embeddings: Text embeddings
        predicted_classes: Predicted class for each sample
        reference_embeddings: Embeddings of known samples
        reference_labels: Class labels for known samples
        **kwargs: Additional strategy-specific parameters

    Returns:
        NovelSampleReport with detection results
    """
    if reference_embeddings is None or reference_labels is None:
        raise RuntimeError("reference embeddings and labels are required")

    if len(texts) == 0:
        return NovelSampleReport(
            novel_samples=[],
            detection_strategies=list(self.config.strategies),
            config=self.config.model_dump(),
            signal_counts=dict.fromkeys(self.config.strategies, 0),
        )

    reference_signature = self._compute_reference_signature(
        reference_embeddings,
        reference_labels,
    )

    # Initialize strategies if needed or if the reference corpus changed.
    if not self._is_initialized or self._reference_signature != reference_signature:
        self._initialize_strategies(reference_embeddings, reference_labels)

    # Run each strategy
    all_flags: set[int] = set()
    all_metrics: dict[int, dict[str, Any]] = {}
    strategy_outputs: dict[str, tuple[set[int], dict]] = {}

    for strategy_id, strategy in self._strategies.items():
        flags, metrics = strategy.detect(
            texts=texts,
            embeddings=embeddings,
            predicted_classes=predicted_classes,
            confidences=confidences,
            **kwargs,
        )
        strategy_outputs[strategy_id] = (flags, metrics)
        all_flags.update(flags)

        # Merge metrics
        for idx, metric_dict in metrics.items():
            if idx not in all_metrics:
                all_metrics[idx] = {}
            all_metrics[idx].update(metric_dict)

    # Combine signals
    novel_indices, novelty_scores = self._combiner.combine(
        strategy_outputs=strategy_outputs,
        all_metrics=all_metrics,
    )

    # Build report
    report = self._metadata_builder.build_report(
        texts=texts,
        confidences=confidences,
        predicted_classes=predicted_classes,
        novel_indices=novel_indices,
        novelty_scores=novelty_scores,
        all_metrics=all_metrics,
        strategy_outputs=strategy_outputs,
        config=self.config,
    )

    return report
reset()

Reset the detector, clearing all initialized strategies.

This allows the detector to be re-used with different reference data.

Source code in src/novelentitymatcher/novelty/core/detector.py
def reset(self) -> None:
    """
    Reset the detector, clearing all initialized strategies.

    This allows the detector to be re-used with different reference data.
    """
    self._strategies.clear()
    self._is_initialized = False
    self._reference_signature = None
get_strategy(strategy_id)

Get an initialized strategy by ID.

Parameters:

Name Type Description Default
strategy_id str

Strategy identifier

required

Returns:

Type Description
Any

Strategy instance if initialized

Raises:

Type Description
ValueError

If strategy not found or not initialized

Source code in src/novelentitymatcher/novelty/core/detector.py
def get_strategy(self, strategy_id: str) -> Any:
    """
    Get an initialized strategy by ID.

    Args:
        strategy_id: Strategy identifier

    Returns:
        Strategy instance if initialized

    Raises:
        ValueError: If strategy not found or not initialized
    """
    if strategy_id not in self._strategies:
        available = ", ".join(self._strategies.keys())
        raise ValueError(
            f"Strategy '{strategy_id}' not initialized. Available: {available}"
        )
    return self._strategies[strategy_id]
list_initialized_strategies()

List all initialized strategies.

Returns:

Type Description
list[str]

List of strategy IDs

Source code in src/novelentitymatcher/novelty/core/detector.py
def list_initialized_strategies(self) -> list[str]:
    """
    List all initialized strategies.

    Returns:
        List of strategy IDs
    """
    return list(self._strategies.keys())