Extraction¶

`novelentitymatcher.novelty.extraction.evidence` ¶

Cluster evidence extractor with configurable extraction methods.

Supports three evidence methods for benchmarking: - "tfidf": TF-IDF keyword extraction (baseline). - "centroid": Terms closest to cluster embedding centroid. - "combined": Union of tfidf + centroid with deduplication.

Classes¶

`ClusterEvidenceExtractor(method='tfidf', max_keywords=8, max_examples=4, token_budget=256)` ¶

Extract compact evidence from a cluster of text samples.

Parameters:

Name	Type	Description	Default
`method`	`str`	Extraction method - `"tfidf"`, `"centroid"`, or `"combined"`.	`'tfidf'`
`max_keywords`	`int`	Maximum keywords to return.	`8`
`max_examples`	`int`	Maximum representative examples.	`4`
`token_budget`	`int`	Soft token budget for representative examples.	`256`

Source code in src/novelentitymatcher/novelty/extraction/evidence.py

def __init__(
    self,
    method: str = "tfidf",
    max_keywords: int = 8,
    max_examples: int = 4,
    token_budget: int = 256,
):
    if method not in ("tfidf", "centroid", "combined"):
        raise ValueError(
            f"Invalid evidence method: '{method}'. "
            "Must be 'tfidf', 'centroid', or 'combined'."
        )
    self.method = method
    self.max_keywords = max_keywords
    self.max_examples = max_examples
    self.token_budget = token_budget

Functions¶

`extract(cluster_texts, cluster_embeddings=None, reference_embeddings=None)` ¶

Extract evidence from a single cluster.

Parameters:

Name	Type	Description	Default
`cluster_texts`	`list[str]`	Text samples in the cluster.	required
`cluster_embeddings`	`ndarray \| None`	Embeddings for the cluster samples.	`None`
`reference_embeddings`	`ndarray \| None`	Optional reference embeddings (not yet used).	`None`

Returns:

Type	Description
`ClusterEvidence`	ClusterEvidence with keywords, examples, and metadata.

Source code in src/novelentitymatcher/novelty/extraction/evidence.py

def extract(
    self,
    cluster_texts: list[str],
    cluster_embeddings: np.ndarray | None = None,
    reference_embeddings: np.ndarray | None = None,
) -> ClusterEvidence:
    """Extract evidence from a single cluster.

    Args:
        cluster_texts: Text samples in the cluster.
        cluster_embeddings: Embeddings for the cluster samples.
        reference_embeddings: Optional reference embeddings (not yet used).

    Returns:
        ClusterEvidence with keywords, examples, and metadata.
    """
    keywords = self._extract_keywords(cluster_texts, cluster_embeddings)
    representatives = self._select_representatives(
        cluster_texts, cluster_embeddings
    )

    return ClusterEvidence(
        keywords=keywords,
        representative_examples=representatives,
        sample_indices=list(range(len(cluster_texts))),
        metadata={
            "evidence_method": self.method,
            "sample_count": len(cluster_texts),
        },
        token_budget=self.token_budget,
    )

Extraction¶

novelentitymatcher.novelty.extraction.evidence ¶

Classes¶

ClusterEvidenceExtractor(method='tfidf', max_keywords=8, max_examples=4, token_budget=256) ¶

Functions¶

extract(cluster_texts, cluster_embeddings=None, reference_embeddings=None) ¶

Functions¶

`novelentitymatcher.novelty.extraction.evidence` ¶

`ClusterEvidenceExtractor(method='tfidf', max_keywords=8, max_examples=4, token_budget=256)` ¶

`extract(cluster_texts, cluster_embeddings=None, reference_embeddings=None)` ¶