Skip to content

Extraction

novelentitymatcher.novelty.extraction.evidence

Cluster evidence extractor with configurable extraction methods.

Supports three evidence methods for benchmarking: - "tfidf": TF-IDF keyword extraction (baseline). - "centroid": Terms closest to cluster embedding centroid. - "combined": Union of tfidf + centroid with deduplication.

Classes

ClusterEvidenceExtractor(method='tfidf', max_keywords=8, max_examples=4, token_budget=256)

Extract compact evidence from a cluster of text samples.

Parameters:

Name Type Description Default
method str

Extraction method - "tfidf", "centroid", or "combined".

'tfidf'
max_keywords int

Maximum keywords to return.

8
max_examples int

Maximum representative examples.

4
token_budget int

Soft token budget for representative examples.

256
Source code in src/novelentitymatcher/novelty/extraction/evidence.py
def __init__(
    self,
    method: str = "tfidf",
    max_keywords: int = 8,
    max_examples: int = 4,
    token_budget: int = 256,
):
    if method not in ("tfidf", "centroid", "combined"):
        raise ValueError(
            f"Invalid evidence method: '{method}'. "
            "Must be 'tfidf', 'centroid', or 'combined'."
        )
    self.method = method
    self.max_keywords = max_keywords
    self.max_examples = max_examples
    self.token_budget = token_budget
Functions
extract(cluster_texts, cluster_embeddings=None, reference_embeddings=None)

Extract evidence from a single cluster.

Parameters:

Name Type Description Default
cluster_texts list[str]

Text samples in the cluster.

required
cluster_embeddings ndarray | None

Embeddings for the cluster samples.

None
reference_embeddings ndarray | None

Optional reference embeddings (not yet used).

None

Returns:

Type Description
ClusterEvidence

ClusterEvidence with keywords, examples, and metadata.

Source code in src/novelentitymatcher/novelty/extraction/evidence.py
def extract(
    self,
    cluster_texts: list[str],
    cluster_embeddings: np.ndarray | None = None,
    reference_embeddings: np.ndarray | None = None,
) -> ClusterEvidence:
    """Extract evidence from a single cluster.

    Args:
        cluster_texts: Text samples in the cluster.
        cluster_embeddings: Embeddings for the cluster samples.
        reference_embeddings: Optional reference embeddings (not yet used).

    Returns:
        ClusterEvidence with keywords, examples, and metadata.
    """
    keywords = self._extract_keywords(cluster_texts, cluster_embeddings)
    representatives = self._select_representatives(
        cluster_texts, cluster_embeddings
    )

    return ClusterEvidence(
        keywords=keywords,
        representative_examples=representatives,
        sample_indices=list(range(len(cluster_texts))),
        metadata={
            "evidence_method": self.method,
            "sample_count": len(cluster_texts),
        },
        token_budget=self.token_budget,
    )

Functions