Extraction¶
novelentitymatcher.novelty.extraction.evidence
¶
Cluster evidence extractor with configurable extraction methods.
Supports three evidence methods for benchmarking:
- "tfidf": TF-IDF keyword extraction (baseline).
- "centroid": Terms closest to cluster embedding centroid.
- "combined": Union of tfidf + centroid with deduplication.
Classes¶
ClusterEvidenceExtractor(method='tfidf', max_keywords=8, max_examples=4, token_budget=256)
¶
Extract compact evidence from a cluster of text samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
str
|
Extraction method - |
'tfidf'
|
max_keywords
|
int
|
Maximum keywords to return. |
8
|
max_examples
|
int
|
Maximum representative examples. |
4
|
token_budget
|
int
|
Soft token budget for representative examples. |
256
|
Source code in src/novelentitymatcher/novelty/extraction/evidence.py
Functions¶
extract(cluster_texts, cluster_embeddings=None, reference_embeddings=None)
¶
Extract evidence from a single cluster.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cluster_texts
|
list[str]
|
Text samples in the cluster. |
required |
cluster_embeddings
|
ndarray | None
|
Embeddings for the cluster samples. |
None
|
reference_embeddings
|
ndarray | None
|
Optional reference embeddings (not yet used). |
None
|
Returns:
| Type | Description |
|---|---|
ClusterEvidence
|
ClusterEvidence with keywords, examples, and metadata. |