Exceptions & Config¶
novelentitymatcher.exceptions
¶
Custom exceptions for novel_entity_matcher with helpful context and suggestions.
Classes¶
SemanticMatcherError
¶
Bases: Exception
Base exception for all novel_entity_matcher errors.
ValidationError(message, *, entity=None, field=None, suggestion=None)
¶
Bases: ValueError, SemanticMatcherError
Raised when input validation fails with helpful context.
Attributes:
| Name | Type | Description |
|---|---|---|
entity |
The entity that failed validation (if applicable) |
|
field |
The specific field that failed validation |
|
suggestion |
Helpful suggestion for fixing the error |
Source code in src/novelentitymatcher/exceptions.py
TrainingError(message, *, training_mode=None, details=None)
¶
Bases: RuntimeError, SemanticMatcherError
Raised when training fails with diagnostic information.
Attributes:
| Name | Type | Description |
|---|---|---|
training_mode |
The mode that was being trained |
|
details |
Additional diagnostic information |
Source code in src/novelentitymatcher/exceptions.py
MatchingError
¶
Bases: RuntimeError, SemanticMatcherError
Raised when matching operations fail.
ModeError(message, *, invalid_mode=None, valid_modes=None)
¶
Bases: ValueError, SemanticMatcherError
Raised when matcher mode configuration is invalid.
Attributes:
| Name | Type | Description |
|---|---|---|
invalid_mode |
The mode that was provided |
|
valid_modes |
List of valid mode options |
Source code in src/novelentitymatcher/exceptions.py
LLMError(message, *, last_error=None, attempted_models=None)
¶
Bases: SemanticMatcherError
Raised when LLM operations fail after all retries.
Attributes:
| Name | Type | Description |
|---|---|---|
last_error |
The last exception that caused all models to fail |
|
attempted_models |
List of models that were attempted |
Source code in src/novelentitymatcher/exceptions.py
novelentitymatcher.config_registry
¶
novelentitymatcher.api
¶
Single import surface for the novel_entity_matcher public API.
Usage
from novelentitymatcher.api import *
or selective imports:¶
from novelentitymatcher.api import ( Matcher, NovelEntityMatcher, DiscoveryPipeline, PipelineConfig, DetectionConfig, NovelSampleMetadata, DiscoveryCluster, ClassProposal, )
Classes¶
BERTClassifier(labels, model_name='distilbert-base-uncased', num_epochs=3, batch_size=16, learning_rate=2e-05, max_length=128, use_fp16=True)
¶
BERT-based text classifier using transformers library.
This classifier provides a drop-in alternative to SetFitClassifier with identical interface. It uses fine-tuned BERT models for classification, offering superior accuracy for complex pattern-driven tasks.
Example
from novelentitymatcher.core.bert_classifier import BERTClassifier labels = ["DE", "FR", "US"] clf = BERTClassifier(labels=labels, model_name="distilbert-base-uncased") training_data = [ ... {"text": "Germany", "label": "DE"}, ... {"text": "France", "label": "FR"}, ... {"text": "USA", "label": "US"}, ... ] clf.train(training_data, num_epochs=3) prediction = clf.predict("Deutschland") # "DE" proba = clf.predict_proba("Deutschland") # [0.02, 0.01, 0.97]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
list[str]
|
List of class labels for classification. |
required |
model_name
|
str
|
HuggingFace model name or path. Default: "distilbert-base-uncased". |
'distilbert-base-uncased'
|
num_epochs
|
int
|
Number of training epochs. Default: 3. |
3
|
batch_size
|
int
|
Training batch size. Default: 16. |
16
|
learning_rate
|
float
|
Learning rate for training. Default: 2e-5. |
2e-05
|
max_length
|
int
|
Maximum sequence length for tokenization. Default: 128. |
128
|
use_fp16
|
bool
|
Whether to use mixed precision training (faster, less memory). Only works on GPU. Default: True. |
True
|
Source code in src/novelentitymatcher/core/bert_classifier.py
Functions¶
train(training_data, num_epochs=None, batch_size=None, show_progress=True)
¶
Train the BERT classifier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
training_data
|
list[dict]
|
List of training examples with 'text' and 'label' keys. |
required |
num_epochs
|
int | None
|
Number of training epochs (overrides default). |
None
|
batch_size
|
int | None
|
Batch size for training (overrides default). |
None
|
show_progress
|
bool
|
Whether to show progress bar during training. |
True
|
Raises:
| Type | Description |
|---|---|
TrainingError
|
If training fails or data is invalid. |
Source code in src/novelentitymatcher/core/bert_classifier.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | |
predict(texts)
¶
Predict labels for input text(s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
str | list[str]
|
Single text string or list of text strings. |
required |
Returns:
| Type | Description |
|---|---|
str | list[str]
|
Predicted label(s). If input is single string, returns single label. |
str | list[str]
|
If input is list, returns list of labels. |
Raises:
| Type | Description |
|---|---|
TrainingError
|
If model is not trained yet. |
Source code in src/novelentitymatcher/core/bert_classifier.py
predict_proba(text)
¶
Get prediction probabilities for all labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text string. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
NumPy array of probabilities for each label, in same order as self.labels. |
Raises:
| Type | Description |
|---|---|
TrainingError
|
If model is not trained yet. |
Source code in src/novelentitymatcher/core/bert_classifier.py
save(path)
¶
Save the trained model and tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Directory path to save the model. |
required |
Raises:
| Type | Description |
|---|---|
TrainingError
|
If model is not trained yet. |
Source code in src/novelentitymatcher/core/bert_classifier.py
load(path)
classmethod
¶
Load a trained BERTClassifier from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Directory path containing the saved model. |
required |
Returns:
| Type | Description |
|---|---|
BERTClassifier
|
Loaded BERTClassifier instance. |
Source code in src/novelentitymatcher/core/bert_classifier.py
EmbeddingMatcher(entities, model_name='sentence-transformers/paraphrase-mpnet-base-v2', threshold=0.7, normalize=True, embedding_dim=None, cache=None)
¶
Embedding-based similarity matching without training.
Source code in src/novelentitymatcher/core/embedding_matcher.py
HierarchicalScoring(hierarchy_index, alpha=0.7, beta=0.3)
¶
Calculate hierarchy-aware confidence scores.
Combines: - Semantic similarity (cosine similarity of embeddings) - Hierarchical proximity boost (based on relationship type) - Depth penalty (deeper relationships = lower scores)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hierarchy_index
|
HierarchyIndex
|
HierarchyIndex for graph operations |
required |
alpha
|
float
|
Weight for semantic similarity (0-1) |
0.7
|
beta
|
float
|
Weight for hierarchical boost (0-1) |
0.3
|
Source code in src/novelentitymatcher/core/hierarchy.py
Functions¶
compute_score(query_embedding, entity_embedding, entity_id, relationship_type='self', depth=0)
¶
Compute hierarchical score combining semantic and hierarchical features.
Formula
final_score = ( semantic_similarity * alpha + hierarchical_boost * beta ) * depth_penalty
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_embedding
|
ndarray
|
Query text embedding |
required |
entity_embedding
|
ndarray
|
Entity text embedding |
required |
entity_id
|
str
|
Entity identifier |
required |
relationship_type
|
str
|
"self", "parent", "child", "ancestor", "descendant" |
'self'
|
depth
|
int
|
Relationship depth (0=self, 1=direct, etc.) |
0
|
Returns:
| Type | Description |
|---|---|
float
|
Final hierarchical score (0-1) |
Source code in src/novelentitymatcher/core/hierarchy.py
HierarchyIndex(entities)
¶
Graph-based index for hierarchical entity relationships.
Supports: - Multi-parent hierarchies (DAG structure) - Weighted edges for relationship strength - Fast ancestor/descendant queries - Path finding and depth calculation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entities
|
list[dict[str, Any]]
|
List of entity dicts with optional 'hierarchy' key hierarchy format: { 'parents': ['parent_id1', 'parent_id2'], 'children': ['child_id1', 'child_id2'], 'level': int, 'weights': {'parent_id': float} } |
required |
Source code in src/novelentitymatcher/core/hierarchy.py
Functions¶
get_ancestors(entity_id, max_depth=None)
¶
Get all ancestor entities for a given entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Entity to find ancestors for |
required |
max_depth
|
int | None
|
Maximum depth to traverse (None = unlimited) |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of ancestor entity IDs |
Source code in src/novelentitymatcher/core/hierarchy.py
get_descendants(entity_id, max_depth=None)
¶
Get all descendant entities for a given entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Entity to find descendants for |
required |
max_depth
|
int | None
|
Maximum depth to traverse (None = unlimited) |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of descendant entity IDs |
Source code in src/novelentitymatcher/core/hierarchy.py
get_relationship_depth(entity_a, entity_b)
¶
Calculate the depth of relationship between two entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_a
|
str
|
First entity ID |
required |
entity_b
|
str
|
Second entity ID |
required |
Returns:
| Type | Description |
|---|---|
int
|
Depth (0 = same entity, 1 = direct parent/child, 2 = grandparent, etc.) |
int
|
Returns -1 if no relationship found |
Source code in src/novelentitymatcher/core/hierarchy.py
get_path(from_entity, to_entity)
¶
Get shortest path between two entities in the hierarchy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
from_entity
|
str
|
Starting entity ID |
required |
to_entity
|
str
|
Ending entity ID |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of entity IDs representing the path (inclusive) |
list[str]
|
Returns empty list if no path exists |
Source code in src/novelentitymatcher/core/hierarchy.py
is_ancestor(ancestor_id, descendant_id)
¶
Check if ancestor_id is an ancestor of descendant_id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ancestor_id
|
str
|
Potential ancestor |
required |
descendant_id
|
str
|
Potential descendant |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if ancestor_id is an ancestor of descendant_id |
Source code in src/novelentitymatcher/core/hierarchy.py
HDBSCANBackend(min_samples=5, cluster_selection_epsilon=0.0, metric='cosine', prediction_data=True)
¶
Bases: ClusteringBackend
HDBSCAN clustering backend.
Source code in src/novelentitymatcher/novelty/clustering/backends.py
SOPTICSBackend(min_samples=5, metric='cosine')
¶
UMAPHDBSCANBackend(min_samples=5, cluster_selection_epsilon=0.0, n_neighbors=15, umap_dim=10, umap_metric='cosine', prediction_data=True)
¶
Bases: ClusteringBackend
UMAP preprocessing followed by HDBSCAN clustering backend.
Source code in src/novelentitymatcher/novelty/clustering/backends.py
ScalableClusterer(backend='auto', min_cluster_size=5, min_samples=5, cluster_selection_epsilon=0.0, n_neighbors=15, umap_dim=10, umap_metric='cosine', prediction_data=True)
¶
Wrapper for scalable density-based clustering.
Supports: - HDBSCAN: Standard hierarchical DBSCAN (best for <100K points) - sOPTICS: LSH-accelerated OPTICS (for 100K-1M points) - UMAP+HDBSCAN: UMAP dimensionality reduction before HDBSCAN - Auto: Automatic backend selection based on dataset size
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
backend
|
str
|
Clustering backend ('hdbscan', 'soptics', 'umap_hdbscan', 'auto') |
'auto'
|
min_cluster_size
|
int
|
Minimum points to form a cluster. |
5
|
min_samples
|
int
|
Min samples for core distance (OPTICS). |
5
|
cluster_selection_epsilon
|
float
|
Distance threshold for cluster selection. |
0.0
|
n_neighbors
|
int
|
Neighbors for UMAP (if used). |
15
|
umap_dim
|
int
|
Target dimensionality for UMAP preprocessing. |
10
|
umap_metric
|
str
|
Metric for UMAP. |
'cosine'
|
prediction_data
|
bool
|
Whether to compute prediction_data for HDBSCAN. |
True
|
Source code in src/novelentitymatcher/novelty/clustering/scalable.py
Attributes¶
labels
property
¶
Get cluster labels.
probabilities
property
¶
Get cluster membership probabilities.
Functions¶
fit_predict(embeddings, metric='cosine')
¶
Fit clusterer and predict labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeddings
|
ndarray
|
Input embeddings (n_samples, dim) |
required |
metric
|
str
|
Distance metric ('cosine', 'euclidean', 'precomputed') |
'cosine'
|
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray, dict[str, Any]]
|
Tuple of (cluster_labels, probabilities, validation_info) |
Source code in src/novelentitymatcher/novelty/clustering/scalable.py
fit(embeddings, metric='cosine')
¶
Fit the clusterer (alias for compatibility).
get_cluster_members(cluster_id)
¶
Get indices of members in a specific cluster.
Source code in src/novelentitymatcher/novelty/clustering/scalable.py
get_noise_points()
¶
Get indices of noise points (label = -1).
ClusterValidator(min_cohesion_threshold=0.45, min_persistence_threshold=0.1)
¶
Validates clustering results for novelty detection.
Provides metrics and validation methods to assess cluster quality and determine if samples represent novel clusters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_cohesion_threshold
|
float
|
Minimum cohesion for valid clusters |
0.45
|
min_persistence_threshold
|
float
|
Minimum persistence for valid clusters |
0.1
|
Source code in src/novelentitymatcher/novelty/clustering/validation.py
Functions¶
compute_cohesion(embeddings, labels, cluster_id)
¶
Compute cluster cohesion (compactness).
Cohesion is the average pairwise similarity within a cluster.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeddings
|
ndarray
|
All embeddings |
required |
labels
|
ndarray
|
Cluster labels for each embedding |
required |
cluster_id
|
int
|
Cluster to compute cohesion for |
required |
Returns:
| Type | Description |
|---|---|
float
|
Cohesion score (0-1, higher = more compact) |
Source code in src/novelentitymatcher/novelty/clustering/validation.py
compute_separation(embeddings, labels, cluster_id)
¶
Compute cluster separation (distinctiveness from other clusters).
Separation is the minimum average distance to another cluster.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeddings
|
ndarray
|
All embeddings |
required |
labels
|
ndarray
|
Cluster labels for each embedding |
required |
cluster_id
|
int
|
Cluster to compute separation for |
required |
Returns:
| Type | Description |
|---|---|
float
|
Separation score (0-1, higher = more separated) |
Source code in src/novelentitymatcher/novelty/clustering/validation.py
is_valid_cluster(embeddings, labels, cluster_id, min_size=5)
¶
Determine if a cluster is valid (stable and meaningful).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeddings
|
ndarray
|
All embeddings |
required |
labels
|
ndarray
|
Cluster labels |
required |
cluster_id
|
int
|
Cluster to validate |
required |
min_size
|
int
|
Minimum number of samples for valid cluster |
5
|
Returns:
| Type | Description |
|---|---|
bool
|
True if cluster is valid |
Source code in src/novelentitymatcher/novelty/clustering/validation.py
get_cluster_statistics(embeddings, labels)
¶
Compute statistics for all clusters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
embeddings
|
ndarray
|
All embeddings |
required |
labels
|
ndarray
|
Cluster labels |
required |
Returns:
| Type | Description |
|---|---|
dict[int, dict[str, float]]
|
Dict mapping cluster_id to statistics dict |
Source code in src/novelentitymatcher/novelty/clustering/validation.py
DetectionConfig
¶
Bases: BaseModel
Main configuration for novelty detection.
This config specifies which strategies to use, their individual configurations, and how to combine their signals.
Attributes¶
strategies = Field(default_factory=(lambda: ['confidence', 'knn_distance', 'setfit_centroid']))
class-attribute
instance-attribute
¶
List of strategy IDs to use for novelty detection.
Available strategies: - confidence: Confidence threshold - knn_distance: kNN distance-based - uncertainty: Margin/entropy uncertainty - clustering: Clustering-based - self_knowledge: Sparse autoencoder - pattern: Pattern-based - oneclass: One-Class SVM - prototypical: Prototypical networks - setfit: SetFit contrastive
combine_method = Field(default='weighted')
class-attribute
instance-attribute
¶
Method for combining strategy signals.
Options: - weighted: Weighted fusion of scores - union: Flag if any strategy flags - intersection: Flag if all strategies flag - voting: Flag if majority of strategies flag - meta_learner: Logistic regression meta-learner (requires training)
confidence = None
class-attribute
instance-attribute
¶
Configuration for confidence strategy.
knn_distance = None
class-attribute
instance-attribute
¶
Configuration for kNN distance strategy.
uncertainty = None
class-attribute
instance-attribute
¶
Configuration for uncertainty strategy.
clustering = None
class-attribute
instance-attribute
¶
Configuration for clustering strategy.
self_knowledge = None
class-attribute
instance-attribute
¶
Configuration for self-knowledge strategy.
pattern = None
class-attribute
instance-attribute
¶
Configuration for pattern strategy.
oneclass = None
class-attribute
instance-attribute
¶
Configuration for One-Class SVM strategy.
prototypical = None
class-attribute
instance-attribute
¶
Configuration for prototypical strategy.
setfit = None
class-attribute
instance-attribute
¶
Configuration for SetFit strategy.
setfit_centroid = None
class-attribute
instance-attribute
¶
Configuration for SetFit centroid distance strategy.
mahalanobis = None
class-attribute
instance-attribute
¶
Configuration for Mahalanobis distance strategy.
lof = None
class-attribute
instance-attribute
¶
Configuration for Local Outlier Factor strategy.
weights = None
class-attribute
instance-attribute
¶
Weights for signal combination.
enable_lazy_initialization = Field(default=True)
class-attribute
instance-attribute
¶
Whether to lazily initialize strategies (only when first used).
debug_mode = Field(default=False)
class-attribute
instance-attribute
¶
Enable debug mode for verbose logging.
candidate_top_k = Field(default=5, ge=1)
class-attribute
instance-attribute
¶
How many matcher candidates to request when collecting metadata.
allowed_maturities = Field(default_factory=(lambda: ['production', 'experimental', 'internal']))
class-attribute
instance-attribute
¶
Allowed strategy maturity levels. Strategies outside these levels are rejected during validation.
Functions¶
get_strategy_config(strategy_id)
¶
Get configuration for a specific strategy.
Returns the strategy-specific config if it exists, otherwise returns a default config for that strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_id
|
str
|
The strategy identifier |
required |
Returns:
| Type | Description |
|---|---|
Any
|
Strategy-specific configuration object |
Source code in src/novelentitymatcher/novelty/config/base.py
get_weight_config()
¶
Get the weight configuration, with defaults if not set.
Returns:
| Type | Description |
|---|---|
WeightConfig
|
WeightConfig instance |
Source code in src/novelentitymatcher/novelty/config/base.py
validate_strategies()
¶
Validate that all configured strategies are available and allowed by maturity.
Strategies are registered at module load time via decorators. This method only validates — it does not trigger imports.
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unknown strategy is configured or maturity not allowed |
Source code in src/novelentitymatcher/novelty/config/base.py
ClusteringConfig
¶
Bases: BaseModel
Configuration for clustering-based strategy.
Attributes¶
min_cluster_size = Field(default=5, ge=1)
class-attribute
instance-attribute
¶
Minimum cluster size to be considered valid.
persistence_threshold = Field(default=0.1, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Persistence threshold for cluster stability.
cohesion_threshold = Field(default=0.45, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Cohesion threshold for cluster compactness.
hdbscan_min_cluster_size = Field(default=5, ge=1)
class-attribute
instance-attribute
¶
min_cluster_size parameter for HDBSCAN.
hdbscan_min_samples = Field(default=1, ge=1)
class-attribute
instance-attribute
¶
min_samples parameter for HDBSCAN.
cluster_selection_epsilon = Field(default=0.0, ge=0.0)
class-attribute
instance-attribute
¶
cluster_selection_epsilon for HDBSCAN.
ConfidenceConfig
¶
KNNConfig
¶
Bases: BaseModel
Configuration for kNN distance-based strategy.
Attributes¶
k = Field(default=20, ge=1, le=100)
class-attribute
instance-attribute
¶
Number of nearest neighbors to consider.
distance_threshold = Field(default=0.55, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Threshold for kNN distance score. Samples above this are flagged.
strong_threshold = Field(default=0.85, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Strong novelty threshold for high-confidence detection.
metric = Field(default='cosine')
class-attribute
instance-attribute
¶
Distance metric to use ('cosine', 'euclidean', etc.).
LOFConfig
¶
Bases: BaseModel
Configuration for Local Outlier Factor (LOF) strategy.
Attributes¶
n_neighbors = Field(default=20, ge=2)
class-attribute
instance-attribute
¶
Number of neighbors to use for LOF.
contamination = Field(default=0.1, gt=0.0, le=0.5)
class-attribute
instance-attribute
¶
Expected proportion of outliers in the reference set.
metric = Field(default='cosine')
class-attribute
instance-attribute
¶
Distance metric to use ('cosine', 'euclidean', 'manhattan', etc.).
score_threshold = Field(default=0.0)
class-attribute
instance-attribute
¶
LOF score threshold. Samples below this are flagged as novel.
MahalanobisConfig
¶
Bases: BaseModel
Configuration for Mahalanobis distance-based strategy.
Attributes¶
threshold = Field(default=3.0, gt=0.0)
class-attribute
instance-attribute
¶
Mahalanobis distance threshold. Samples above this are flagged as novel.
regularization = Field(default=0.0001, gt=0.0)
class-attribute
instance-attribute
¶
Covariance matrix regularization (ridge) for numerical stability.
use_class_conditional = Field(default=True)
class-attribute
instance-attribute
¶
Whether to use per-class distributions (True) or a single global distribution (False).
calibration_mode = Field(default='none')
class-attribute
instance-attribute
¶
Calibration mode: 'none' for raw threshold, 'conformal' for p-value calibration.
calibration_alpha = Field(default=0.1, gt=0.0, le=1.0)
class-attribute
instance-attribute
¶
Significance level for conformal prediction. Lower = stricter.
calibration_method = Field(default='split')
class-attribute
instance-attribute
¶
Conformal calibration method: 'split' or 'mondrian' (class-conditional).
calibration_set_fraction = Field(default=0.2, gt=0.0, le=0.5)
class-attribute
instance-attribute
¶
Fraction of reference data held out for conformal calibration.
OneClassConfig
¶
Bases: BaseModel
Configuration for One-Class SVM strategy.
Attributes¶
nu = Field(default=0.1, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Expected outlier fraction. Lower = stricter boundary.
kernel = Field(default='rbf')
class-attribute
instance-attribute
¶
SVM kernel type ('rbf', 'linear', 'poly', 'sigmoid').
gamma = Field(default='scale')
class-attribute
instance-attribute
¶
Kernel coefficient ('scale', 'auto', or float).
model_name = Field(default='sentence-transformers/all-MiniLM-L6-v2')
class-attribute
instance-attribute
¶
Sentence transformer model name for embeddings.
PatternConfig
¶
Bases: BaseModel
Configuration for pattern-based strategy.
Attributes¶
threshold = Field(default=0.5, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Novelty score threshold for pattern-based detection.
char_ngram_n = Field(default=3, ge=1, le=5)
class-attribute
instance-attribute
¶
Character n-gram size for pattern extraction.
char_4gram_n = Field(default=4, ge=1, le=5)
class-attribute
instance-attribute
¶
Character 4-gram size.
prefix_suffix_n = Field(default=3, ge=1, le=5)
class-attribute
instance-attribute
¶
Prefix/suffix length for distribution analysis.
PrototypicalConfig
¶
Bases: BaseModel
Configuration for prototypical networks strategy.
Attributes¶
distance_threshold = Field(default=0.5, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Distance threshold for novelty detection.
model_name = Field(default='sentence-transformers/all-MiniLM-L6-v2')
class-attribute
instance-attribute
¶
Sentence transformer model name for embeddings.
support_samples_per_class = Field(default=5, ge=1)
class-attribute
instance-attribute
¶
Number of support samples per class for prototype computation.
SelfKnowledgeConfig
¶
Bases: BaseModel
Configuration for sparse autoencoder strategy.
Attributes¶
hidden_dim = Field(default=128, ge=1)
class-attribute
instance-attribute
¶
Hidden dimension for the autoencoder.
threshold = Field(default=0.5, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Reconstruction error threshold for novelty detection.
epochs = Field(default=100, ge=1)
class-attribute
instance-attribute
¶
Number of training epochs.
batch_size = Field(default=32, ge=1)
class-attribute
instance-attribute
¶
Training batch size.
learning_rate = Field(default=0.001, gt=0.0)
class-attribute
instance-attribute
¶
Learning rate for training.
SetFitConfig
¶
Bases: BaseModel
Configuration for SetFit contrastive strategy.
Attributes¶
margin = Field(default=0.5, ge=0.0)
class-attribute
instance-attribute
¶
Contrastive loss margin.
model_name = Field(default='sentence-transformers/all-MiniLM-L6-v2')
class-attribute
instance-attribute
¶
Sentence transformer model name.
epochs = Field(default=10, ge=1)
class-attribute
instance-attribute
¶
Number of training epochs.
batch_size = Field(default=16, ge=1)
class-attribute
instance-attribute
¶
Training batch size.
learning_rate = Field(default=2e-05, gt=0.0)
class-attribute
instance-attribute
¶
Learning rate for fine-tuning.
threshold = Field(default=0.7, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Similarity threshold for novelty detection.
UncertaintyConfig
¶
Bases: BaseModel
Configuration for uncertainty-based strategy.
WeightConfig
¶
Bases: BaseModel
Weights for signal combination from different strategies.
Each strategy's contribution to the final novelty score is weighted. Weights should sum to approximately 1.0, but this is not enforced as normalization is applied during combination.
Attributes¶
confidence = Field(default=0.35, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for confidence threshold strategy.
uncertainty = Field(default=0.35, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for uncertainty-based strategy.
knn = Field(default=0.45, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for kNN distance-based strategy.
cluster = Field(default=0.2, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for clustering-based strategy.
self_knowledge = Field(default=0.08, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for sparse autoencoder strategy.
pattern = Field(default=0.2, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for pattern-based strategy.
oneclass = Field(default=0.1, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for One-Class SVM strategy.
prototypical = Field(default=0.02, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for prototypical networks strategy.
setfit = Field(default=0.02, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for SetFit contrastive strategy.
setfit_centroid = Field(default=0.45, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for SetFit centroid distance strategy (recommended, highest weight).
mahalanobis = Field(default=0.35, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for Mahalanobis distance strategy.
lof = Field(default=0.15, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Weight for Local Outlier Factor strategy.
adaptive = Field(default=False)
class-attribute
instance-attribute
¶
Enable adaptive weight computation based on dataset characteristics.
novelty_threshold = Field(default=0.6, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Final novelty score threshold for flagging samples.
knn_gate_threshold = Field(default=0.45, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
kNN gate threshold - samples above this are always considered novel.
strong_uncertainty_threshold = Field(default=0.85, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Strong uncertainty threshold - samples above this are always novel.
strong_knn_threshold = Field(default=0.85, ge=0.0, le=1.0)
class-attribute
instance-attribute
¶
Strong kNN threshold - samples above this are always novel.
Functions¶
normalize_weights()
¶
Normalize weights to sum to 1.0.
Returns:
| Type | Description |
|---|---|
WeightConfig
|
A new WeightConfig with normalized weights |
Source code in src/novelentitymatcher/novelty/config/weights.py
MetadataBuilder()
¶
Builds comprehensive reports for novelty detection results.
Aggregates information from all strategies and creates detailed reports with per-sample metrics and explanations.
Source code in src/novelentitymatcher/novelty/core/metadata.py
Functions¶
build_report(texts, confidences, predicted_classes, novel_indices, novelty_scores, all_metrics, strategy_outputs, config)
¶
Build a comprehensive novelty detection report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
confidences
|
ndarray
|
Prediction confidence scores |
required |
predicted_classes
|
list[str]
|
Predicted class for each sample |
required |
novel_indices
|
set[int]
|
Indices flagged as novel |
required |
novelty_scores
|
dict[int, float]
|
Final novelty scores |
required |
all_metrics
|
dict[int, dict[str, Any]]
|
All per-sample metrics |
required |
strategy_outputs
|
dict[str, tuple[set[int], dict]]
|
Per-strategy outputs |
required |
config
|
DetectionConfig
|
Detection configuration |
required |
Returns:
| Type | Description |
|---|---|
NovelSampleReport
|
NovelSampleReport with all detection results |
Source code in src/novelentitymatcher/novelty/core/metadata.py
build_summary(report)
¶
Build a summary of the detection report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report
|
NovelSampleReport
|
NovelSampleReport to summarize |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Summary dictionary with key statistics |
Source code in src/novelentitymatcher/novelty/core/metadata.py
SignalCombiner(config)
¶
Handles signal combination from multiple strategies.
Supports several combination methods: - weighted: Weighted fusion of strategy scores - union: Flag if any strategy flags - intersection: Flag if all strategies flag - voting: Flag if majority of strategies flag
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DetectionConfig
|
Detection configuration |
required |
Source code in src/novelentitymatcher/novelty/core/signal_combiner.py
Functions¶
combine(strategy_outputs, all_metrics)
¶
Combine strategy signals into final novelty decisions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_outputs
|
dict[str, tuple[set[int], dict]]
|
Dict mapping strategy_id to (flags, metrics) |
required |
all_metrics
|
dict[int, dict[str, Any]]
|
Dict mapping sample index to all metrics |
required |
Returns:
| Type | Description |
|---|---|
set[int]
|
(novel_indices, novelty_scores) |
dict[int, float]
|
|
tuple[set[int], dict[int, float]]
|
|
Source code in src/novelentitymatcher/novelty/core/signal_combiner.py
train_meta_learner(features, labels)
¶
Train the logistic regression meta-learner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
ndarray
|
(n_samples, n_features) matrix of strategy scores |
required |
labels
|
ndarray
|
(n_samples,) binary novelty labels (1=novel, 0=known) |
required |
Returns:
| Type | Description |
|---|---|
float
|
Training accuracy |
Source code in src/novelentitymatcher/novelty/core/signal_combiner.py
save_meta_learner(path)
¶
Persist the trained meta-learner to disk.
Source code in src/novelentitymatcher/novelty/core/signal_combiner.py
load_meta_learner(path)
¶
Load a trained meta-learner from disk.
Source code in src/novelentitymatcher/novelty/core/signal_combiner.py
StrategyRegistry
¶
Registry for novelty detection strategies.
Strategies are registered using the @StrategyRegistry.register decorator. Once registered, they can be instantiated by their strategy_id.
Functions¶
register(strategy_cls)
classmethod
¶
Register a strategy class.
Usage
@StrategyRegistry.register class MyStrategy(NoveltyStrategy): strategy_id = "my_strategy" ...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_cls
|
type[NoveltyStrategy]
|
Strategy class to register |
required |
Returns:
| Type | Description |
|---|---|
type[NoveltyStrategy]
|
The same strategy class (for decorator use) |
Source code in src/novelentitymatcher/novelty/core/strategies.py
get(strategy_id)
classmethod
¶
Get a strategy class by ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_id
|
str
|
Unique strategy identifier |
required |
Returns:
| Type | Description |
|---|---|
type[NoveltyStrategy]
|
Strategy class |
Raises:
| Type | Description |
|---|---|
ValueError
|
If strategy_id is not registered |
Source code in src/novelentitymatcher/novelty/core/strategies.py
create(strategy_id)
classmethod
¶
Create an instance of a strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_id
|
str
|
Unique strategy identifier |
required |
Returns:
| Type | Description |
|---|---|
NoveltyStrategy
|
Instantiated strategy object |
Source code in src/novelentitymatcher/novelty/core/strategies.py
list_strategies(maturity=None)
classmethod
¶
List all registered strategy IDs, optionally filtered by maturity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
maturity
|
str | None
|
Optional maturity filter ("production", "experimental", "internal"). |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of strategy IDs in registration order |
Source code in src/novelentitymatcher/novelty/core/strategies.py
is_registered(strategy_id)
classmethod
¶
Check if a strategy is registered.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_id
|
str
|
Strategy identifier to check |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if strategy is registered |
Source code in src/novelentitymatcher/novelty/core/strategies.py
unregister(strategy_id)
classmethod
¶
Unregister a strategy.
This is primarily useful for testing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy_id
|
str
|
Strategy identifier to unregister |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If strategy_id is not registered |
Source code in src/novelentitymatcher/novelty/core/strategies.py
clear()
classmethod
¶
Clear all registered strategies.
This is primarily useful for testing.
NovelEntityMatchResult(id, score, is_match, is_novel, novel_score=None, match_method='accepted_known', alternatives=list(), signals=dict(), predicted_id=None, metadata=dict())
dataclass
¶
Operational result for a single novelty-aware match decision.
NoveltyEvaluator(mode='benchmark', metrics=None)
¶
Unified evaluator for novelty detection.
Supports two modes: - benchmark: Quick evaluation on OOD splits with core metrics - research: Comprehensive evaluation with confusion matrices and threshold sweeping
Metrics computed: - AUROC, AUPRC - Detection rates at 1%, 5%, 10% FPR - Precision, Recall, F1 at optimal threshold
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
Literal['benchmark', 'research']
|
Evaluation mode ('benchmark' or 'research') |
'benchmark'
|
metrics
|
list[str] | None
|
List of metrics to compute (None for default based on mode) |
None
|
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
Functions¶
evaluate(novelty_scores, is_novel_true, threshold=None)
¶
Evaluate novelty detection performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
threshold
|
float | None
|
Optional threshold for discrete predictions |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Dictionary of metric name -> value |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
create_report(novelty_scores, is_novel_true, threshold=None)
¶
Create a comprehensive evaluation report.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
threshold
|
float | None
|
Optional threshold for discrete predictions |
None
|
Returns:
| Type | Description |
|---|---|
EvaluationReport
|
EvaluationReport with all metrics |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
sweep_thresholds(novelty_scores, is_novel_true, num_thresholds=100)
¶
Sweep across thresholds and compute metrics at each.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
num_thresholds
|
int
|
Number of thresholds to evaluate |
100
|
Returns:
| Type | Description |
|---|---|
dict[str, ndarray]
|
Dict with arrays for thresholds and metrics |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
compare_thresholds(novelty_scores, is_novel_true, thresholds)
¶
Compare metrics at specific thresholds.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novelty_scores
|
ndarray
|
Predicted novelty scores (higher = more novel) |
required |
is_novel_true
|
ndarray
|
Ground truth novelty labels (True = novel) |
required |
thresholds
|
list[float]
|
List of thresholds to evaluate |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, float]]
|
List of dicts with metrics at each threshold |
Source code in src/novelentitymatcher/novelty/evaluation/evaluator.py
GradualNoveltySplitter(known_ratios=None, random_state=42)
¶
Creates multiple splits with gradually increasing novelty.
Useful for testing how novelty detection performance degrades as the number of novel classes increases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
known_ratios
|
list[float] | None
|
List of known ratios to create splits for |
None
|
random_state
|
int
|
Random seed for reproducibility |
42
|
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
Functions¶
create_splits(texts, labels)
¶
Create multiple splits with different novelty levels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of split dictionaries, one per known_ratio |
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
get_novelty_progression(texts, labels)
¶
Get summary of novelty progression across splits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
dict[str, list]
|
Dict with arrays for known_ratio, n_known, n_novel |
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
OODSplitter(known_ratio=0.8, random_state=42)
¶
Creates OOD (Out-of-Distribution) splits for novelty detection evaluation.
Splits data into known classes and unknown/novel classes to simulate the novelty detection scenario.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
known_ratio
|
float
|
Fraction of classes to keep as known (0-1) |
0.8
|
random_state
|
int
|
Random seed for reproducibility |
42
|
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
Functions¶
create_split(texts, labels)
¶
Create OOD train/test split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Tuple of (train_texts, train_labels, test_texts, test_is_novel) |
list[str]
|
|
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
create_split_with_indices(texts, labels)
¶
Create OOD split with additional metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of input texts |
required |
labels
|
list[str]
|
List of corresponding labels |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with split data and metadata |
Source code in src/novelentitymatcher/novelty/evaluation/splitters.py
BGERetriever(model_name='BAAI/bge-m3', device=None, batch_size=32)
¶
BGE-M3 style dense retriever for examples.
Simple wrapper that uses sentence-transformers for dense retrieval of in-context examples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
Model name for sentence-transformers |
'BAAI/bge-m3'
|
device
|
str | None
|
Device to use ("cuda", "cpu", or None for auto) |
None
|
batch_size
|
int
|
Batch size for encoding |
32
|
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
Functions¶
encode(texts, batch_size=None)
¶
Encode texts to embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of texts to encode |
required |
batch_size
|
int | None
|
Override batch size |
None
|
Returns:
| Type | Description |
|---|---|
Any
|
numpy array of embeddings (n, dim) |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
similarity(query_embeddings, corpus_embeddings)
¶
Compute similarity between query and corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_embeddings
|
Any
|
Query embeddings (n, dim) |
required |
corpus_embeddings
|
Any
|
Corpus embeddings (m, dim) |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Similarity matrix (n, m) |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
RetrievalAugmentedProposer(retriever=None, llm_proposer=None, k_examples=5, k_novel_per_class=3, retrieval_metric='cosine', rerank=False)
¶
LLM class proposer enhanced with retrieval-based in-context examples.
Retrieves most relevant examples from a corpus to include in the LLM prompt, improving class naming quality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retriever
|
EmbeddingBackend | None
|
Embedding backend for retrieval (e.g., BGE-M3) |
None
|
llm_proposer
|
Any | None
|
Existing LLMClassProposer to enhance |
None
|
k_examples
|
int
|
Number of in-context examples to retrieve |
5
|
k_novel_per_class
|
int
|
Number of novel examples per proposed class |
3
|
retrieval_metric
|
str
|
Similarity metric for retrieval |
'cosine'
|
rerank
|
bool
|
Whether to use reranking for better examples |
False
|
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
Attributes¶
is_ready
property
¶
Check if proposer is ready for use.
Functions¶
index_examples(examples, embeddings=None)
¶
Index examples for retrieval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
examples
|
list[str]
|
List of example texts to index |
required |
embeddings
|
Any | None
|
Pre-computed embeddings (if None, will compute) |
None
|
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
retrieve(query, k=None)
¶
Retrieve k most relevant examples for a query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text |
required |
k
|
int | None
|
Number of examples to retrieve (default: k_examples) |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of dicts with 'text', 'score', 'index' |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
retrieve_by_class(class_name, novel_samples, existing_classes)
¶
Retrieve examples relevant to a proposed class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
class_name
|
str
|
Proposed class name |
required |
novel_samples
|
list[Any]
|
Novel samples to find examples for |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with retrieved examples and metadata |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
build_prompt(novel_samples, existing_classes, context=None, use_retrieval=True)
¶
Build prompt for LLM class proposal with retrieval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novel_samples
|
list[Any]
|
Novel samples to propose classes for |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
context
|
str | None
|
Optional domain context |
None
|
use_retrieval
|
bool
|
Whether to include retrieved examples |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Formatted prompt string |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | |
propose_classes(novel_samples, existing_classes, context=None)
¶
Propose new classes with retrieval-augmented prompting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novel_samples
|
list[Any]
|
Novel samples to propose classes for |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
context
|
str | None
|
Optional domain context |
None
|
Returns:
| Type | Description |
|---|---|
Any | None
|
NovelClassAnalysis from LLM or None if unavailable |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
ClassProposal
¶
Bases: BaseModel
A proposed class for a cluster of novel samples.
ClusterEvidence
¶
Bases: BaseModel
Compact statistical evidence extracted for a cluster.
DiscoveryCluster
¶
Bases: BaseModel
Community of likely novel samples discovered in a batch.
NovelClassAnalysis
¶
Bases: BaseModel
Class proposals generated from a novelty discovery run.
NovelClassDiscoveryReport
¶
Bases: BaseModel
End-to-end report for novelty detection and optional proposal generation.
NovelSampleMetadata
¶
Bases: BaseModel
Metadata for a single sample flagged as novel.
NovelSampleReport
¶
Bases: BaseModel
Novel samples found during a detection run.
ProposalReviewRecord
¶
Bases: BaseModel
Lifecycle-aware review record for a proposed class.
DetectionReport(novelty_report, strategies_used, runtime_seconds, timestamp, additional_info=dict())
dataclass
¶
Report from a complete detection run.
Contains the NovelSampleReport plus additional metadata about the detection run (timing, strategy performance, etc.).
Attributes¶
novelty_report
instance-attribute
¶
The core novelty detection report.
strategies_used
instance-attribute
¶
List of strategies that were used.
runtime_seconds
instance-attribute
¶
Time taken for detection in seconds.
timestamp
instance-attribute
¶
ISO timestamp of when detection was run.
additional_info = field(default_factory=dict)
class-attribute
instance-attribute
¶
Any additional information to include in the report.
EvaluationReport(auroc, auprc, detection_rate_at_1, detection_rate_at_5, detection_rate_at_10, precision, recall, f1, optimal_threshold, confusion_matrix=None, per_class_metrics=None, num_samples=0, num_novel=0, timestamp='')
dataclass
¶
Report from evaluating novelty detection.
Contains metrics from evaluating on a labeled dataset.
Attributes¶
auroc
instance-attribute
¶
Area under ROC curve.
auprc
instance-attribute
¶
Area under Precision-Recall curve.
detection_rate_at_1
instance-attribute
¶
Detection rate at 1% false positive rate.
detection_rate_at_5
instance-attribute
¶
Detection rate at 5% false positive rate.
detection_rate_at_10
instance-attribute
¶
Detection rate at 10% false positive rate.
precision
instance-attribute
¶
Precision at optimal threshold.
recall
instance-attribute
¶
Recall at optimal threshold.
f1
instance-attribute
¶
F1 score at optimal threshold.
optimal_threshold
instance-attribute
¶
Threshold that maximizes F1 score.
confusion_matrix = None
class-attribute
instance-attribute
¶
Confusion matrix at optimal threshold.
per_class_metrics = None
class-attribute
instance-attribute
¶
Per-class metrics if available.
num_samples = 0
class-attribute
instance-attribute
¶
Total number of samples evaluated.
num_novel = 0
class-attribute
instance-attribute
¶
Number of actually novel samples.
timestamp = ''
class-attribute
instance-attribute
¶
ISO timestamp of when evaluation was run.
SampleMetrics(index, text, predicted_class, confidence, is_novel, novelty_score, strategy_flags, raw_metrics)
dataclass
¶
Aggregated metrics for a single sample.
Contains metrics from all strategies for a specific sample.
Attributes¶
index
instance-attribute
¶
Sample index in the input batch.
text
instance-attribute
¶
The input text.
predicted_class
instance-attribute
¶
Predicted class for this sample.
confidence
instance-attribute
¶
Prediction confidence score.
is_novel
instance-attribute
¶
Whether this sample was flagged as novel.
novelty_score
instance-attribute
¶
Final combined novelty score.
strategy_flags
instance-attribute
¶
Which strategies flagged this sample.
raw_metrics
instance-attribute
¶
Raw metrics from each strategy.
StrategyMetrics(strategy_id, flags, metrics)
dataclass
¶
Metrics from a single strategy.
Contains the flags and per-sample metrics produced by a strategy.
ANNBackend
¶
Supported ANN backends.
ANNIndex(dim, backend=ANNBackend.HNSWLIB, max_elements=100000, ef_construction=200, M=16)
¶
Wrapper for Approximate Nearest Neighbor indexing.
Provides efficient O(log n) similarity search using HNSWlib or FAISS.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dim
|
int
|
Dimensionality of embeddings |
required |
backend
|
str
|
ANN backend to use ('hnswlib' or 'faiss') |
HNSWLIB
|
max_elements
|
int
|
Maximum number of elements to index |
100000
|
ef_construction
|
int
|
HNSW ef_construction parameter (higher = better quality) |
200
|
M
|
int
|
HNSW M parameter (higher = better quality, more memory) |
16
|
Source code in src/novelentitymatcher/novelty/storage/index.py
Attributes¶
n_elements
property
¶
Get number of elements in the index.
labels
property
¶
Return the labels stored alongside indexed vectors.
Functions¶
add_vectors(vectors, labels=None)
¶
Add vectors to the index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
vectors
|
ndarray
|
Array of shape (n_vectors, dim) |
required |
labels
|
list[str] | None
|
Optional labels for the vectors |
None
|
Source code in src/novelentitymatcher/novelty/storage/index.py
knn_query(query, k=5)
¶
Find k-nearest neighbors for query vector(s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
ndarray
|
Query vector or vectors of shape (n_queries, dim) |
required |
k
|
int
|
Number of neighbors to return |
5
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Tuple of (distances, indices) |
ndarray
|
|
tuple[ndarray, ndarray]
|
|
Source code in src/novelentitymatcher/novelty/storage/index.py
get_distance_matrix(queries, targets=None)
¶
Get distance matrix between queries and all indexed vectors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries
|
ndarray
|
Query vectors of shape (n_queries, dim) |
required |
targets
|
ndarray | None
|
Optional target vectors (if None, use all indexed vectors) |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
Distance matrix of shape (n_queries, n_targets) |
Source code in src/novelentitymatcher/novelty/storage/index.py
save(path)
¶
Save index to disk.
Source code in src/novelentitymatcher/novelty/storage/index.py
load(path)
¶
Load index from disk.
Source code in src/novelentitymatcher/novelty/storage/index.py
clear()
¶
Clear all elements from the index.
Source code in src/novelentitymatcher/novelty/storage/index.py
PromotionResult(review_record, entities_added=list(), index_updated=False, retrain_required=False)
dataclass
¶
ProposalReviewManager(storage_path='./proposals/review_records.json')
¶
Persist and update proposal review records for HITL workflows.
Source code in src/novelentitymatcher/novelty/storage/review.py
Functions¶
promote_with_index_update(review_id, matcher)
¶
Promote and automatically update the matcher's entity index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
review_id
|
str
|
The review record to promote. |
required |
matcher
|
Any
|
A NovelEntityMatcher or similar object with |
required |
Returns:
| Type | Description |
|---|---|
PromotionResult
|
PromotionResult with full details of the promotion. |
Source code in src/novelentitymatcher/novelty/storage/review.py
NoveltyStrategy
¶
Bases: ABC
Base protocol for all novelty detection strategies.
Each strategy is responsible for: 1. Initializing with reference embeddings and labels 2. Detecting novel samples from a batch of inputs 3. Providing per-sample metrics for signal combination 4. Specifying its weight for signal fusion
Attributes¶
config_schema
abstractmethod
property
¶
Return the config dataclass type for this strategy.
This is used for validation and defaults.
Functions¶
initialize(reference_embeddings, reference_labels, config)
abstractmethod
¶
Initialize strategy with reference data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples |
required |
reference_labels
|
list[str]
|
Class labels for known samples |
required |
config
|
Any
|
Strategy-specific configuration object |
required |
Source code in src/novelentitymatcher/novelty/strategies/base.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
abstractmethod
¶
Detect novel samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings |
required |
predicted_classes
|
list[str]
|
Predicted class for each sample |
required |
confidences
|
ndarray
|
Prediction confidence scores |
required |
**kwargs
|
Additional strategy-specific parameters |
{}
|
Returns:
| Type | Description |
|---|---|
set[int]
|
(flags, metrics) - flagged indices and per-sample metrics |
dict[int, dict[str, Any]]
|
|
tuple[set[int], dict[int, dict[str, Any]]]
|
|
Source code in src/novelentitymatcher/novelty/strategies/base.py
get_weight()
abstractmethod
¶
Return weight for signal combination.
This weight determines how much this strategy contributes to the final novelty score.
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
ClusteringStrategy()
¶
Bases: NoveltyStrategy
Clustering-based strategy for novelty detection.
Uses HDBSCAN to cluster samples and identifies novel samples as those that are in small or low-cohesion clusters.
Source code in src/novelentitymatcher/novelty/strategies/clustering.py
Attributes¶
config_schema
property
¶
Return ClusteringConfig as the config schema.
Functions¶
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
initialize(reference_embeddings, reference_labels, config)
¶
Initialize the clustering strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples |
required |
reference_labels
|
list[str]
|
Labels of known samples |
required |
config
|
ClusteringConfig
|
ClusteringConfig with thresholds |
required |
Source code in src/novelentitymatcher/novelty/strategies/clustering.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
¶
Detect novel samples using clustering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings |
required |
predicted_classes
|
list[str]
|
Predicted classes |
required |
confidences
|
ndarray
|
Prediction confidences |
required |
**kwargs
|
Additional parameters |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[set[int], dict[int, dict[str, Any]]]
|
(flags, metrics) - Flagged indices and per-sample metrics |
Source code in src/novelentitymatcher/novelty/strategies/clustering.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | |
ConfidenceStrategy()
¶
Bases: NoveltyStrategy
Confidence threshold strategy for novelty detection.
Flags samples as novel if their prediction confidence falls below a configured threshold.
Source code in src/novelentitymatcher/novelty/strategies/confidence.py
Attributes¶
config_schema
property
¶
Return ConfidenceConfig as the config schema.
Functions¶
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
initialize(reference_embeddings, reference_labels, config)
¶
Initialize the confidence strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples (not used) |
required |
reference_labels
|
list[str]
|
Labels of known samples (not used) |
required |
config
|
ConfidenceConfig
|
ConfidenceConfig with threshold parameter |
required |
Source code in src/novelentitymatcher/novelty/strategies/confidence.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
¶
Detect novel samples using confidence threshold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings (not used) |
required |
predicted_classes
|
list[str]
|
Predicted classes (not used) |
required |
confidences
|
ndarray
|
Prediction confidence scores |
required |
**kwargs
|
Additional parameters |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[set[int], dict[int, dict[str, Any]]]
|
(flags, metrics) - Flagged indices and per-sample metrics |
Source code in src/novelentitymatcher/novelty/strategies/confidence.py
KNNDistanceStrategy()
¶
Bases: NoveltyStrategy
kNN distance strategy for novelty detection.
Flags samples as novel if their average distance to k-nearest neighbors in the reference set exceeds a threshold.
Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
Attributes¶
config_schema
property
¶
Return KNNConfig as the config schema.
Functions¶
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
initialize(reference_embeddings, reference_labels, config)
¶
Initialize the kNN strategy with reference data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples |
required |
reference_labels
|
list[str]
|
Labels of known samples |
required |
config
|
KNNConfig
|
KNNConfig with k, thresholds, and metric |
required |
Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
¶
Detect novel samples using kNN distance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings |
required |
predicted_classes
|
list[str]
|
Predicted classes |
required |
confidences
|
ndarray
|
Prediction confidences |
required |
**kwargs
|
Additional parameters |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[set[int], dict[int, dict[str, Any]]]
|
(flags, metrics) - Flagged indices and per-sample metrics |
Source code in src/novelentitymatcher/novelty/strategies/knn_distance.py
LOFStrategy()
¶
Bases: NoveltyStrategy
LOF strategy for novelty detection.
Trains a Local Outlier Factor model on reference embeddings in novelty=True mode, then scores new samples. Samples with scores below the configurable threshold are flagged as novel.
Source code in src/novelentitymatcher/novelty/strategies/lof.py
Attributes¶
config_schema
property
¶
Return LOFConfig as the config schema.
Functions¶
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
initialize(reference_embeddings, reference_labels, config)
¶
Initialize LOF strategy by fitting on reference embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples |
required |
reference_labels
|
list[str]
|
Labels of known samples |
required |
config
|
LOFConfig
|
LOFConfig with n_neighbors, contamination, metric, threshold |
required |
Source code in src/novelentitymatcher/novelty/strategies/lof.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
¶
Detect novel samples using LOF anomaly scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings |
required |
predicted_classes
|
list[str]
|
Predicted classes |
required |
confidences
|
ndarray
|
Prediction confidences |
required |
**kwargs
|
Additional parameters |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[set[int], dict[int, dict[str, Any]]]
|
(flags, metrics) - Flagged indices and per-sample metrics |
Source code in src/novelentitymatcher/novelty/strategies/lof.py
MahalanobisDistanceStrategy()
¶
Bases: NoveltyStrategy
Mahalanobis distance strategy for novelty detection.
Computes the Mahalanobis distance from each sample to the class-conditional distribution (mean + shared covariance) of its predicted class. Samples whose distance exceeds a configurable threshold are flagged as novel.
When calibration_mode="conformal", raw distances are wrapped with
conformal p-values for statistically grounded routing. This is backward-
compatible: calibration_mode="none" produces identical results to the
original threshold-only behavior.
Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
Attributes¶
config_schema
property
¶
Return MahalanobisConfig as the config schema.
Functions¶
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
initialize(reference_embeddings, reference_labels, config)
¶
Initialize the Mahalanobis strategy with reference data.
Computes per-class mean vectors and a shared (pooled) covariance matrix with regularization for numerical stability.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples (n_samples, dim) |
required |
reference_labels
|
list[str]
|
Class labels for known samples |
required |
config
|
MahalanobisConfig
|
MahalanobisConfig with threshold, regularization, etc. |
required |
Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
¶
Detect novel samples using Mahalanobis distance.
When calibration_mode="conformal", flagging uses p-values
instead of raw distance thresholds. A sample is flagged if
p_value < calibration_alpha.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings |
required |
predicted_classes
|
list[str]
|
Predicted classes |
required |
confidences
|
ndarray
|
Prediction confidences |
required |
**kwargs
|
Additional parameters |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[set[int], dict[int, dict[str, Any]]]
|
(flags, metrics) - Flagged indices and per-sample metrics |
Source code in src/novelentitymatcher/novelty/strategies/mahalanobis.py
SelfKnowledgeStrategy()
¶
Bases: NoveltyStrategy
Self-knowledge strategy for novelty detection.
Uses a sparse autoencoder to learn representations of known samples and flags high reconstruction error as novel.
Source code in src/novelentitymatcher/novelty/strategies/self_knowledge.py
UncertaintyStrategy()
¶
Bases: NoveltyStrategy
Uncertainty-based strategy for novelty detection.
Flags samples as novel if their prediction uncertainty exceeds configured thresholds (margin or entropy).
Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
Attributes¶
config_schema
property
¶
Return UncertaintyConfig as the config schema.
Functions¶
get_config()
¶
Get the current configuration for this strategy.
Override this if your strategy stores its config differently.
initialize(reference_embeddings, reference_labels, config)
¶
Initialize the uncertainty strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_embeddings
|
ndarray
|
Embeddings of known samples (not used) |
required |
reference_labels
|
list[str]
|
Labels of known samples (not used) |
required |
config
|
UncertaintyConfig
|
UncertaintyConfig with thresholds |
required |
Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
detect(texts, embeddings, predicted_classes, confidences, **kwargs)
¶
Detect novel samples using uncertainty metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Input texts |
required |
embeddings
|
ndarray
|
Text embeddings (not used) |
required |
predicted_classes
|
list[str]
|
Predicted classes (not used) |
required |
confidences
|
ndarray
|
Prediction confidence scores |
required |
**kwargs
|
Additional parameters, may include 'all_probs' for full distribution |
{}
|
Returns:
| Type | Description |
|---|---|
tuple[set[int], dict[int, dict[str, Any]]]
|
(flags, metrics) - Flagged indices and per-sample metrics |
Source code in src/novelentitymatcher/novelty/strategies/uncertainty.py
MatchRecord(text, predicted_id, confidence, embedding, candidates=list(), raw_result=None, metadata=dict(), match_method=None, reference_embedding=None, distance=None)
dataclass
¶
Normalized per-query match metadata for downstream discovery stages.
MatchResultWithMetadata(predictions, confidences, embeddings, scores=None, metadata=None, candidate_results=list(), records=list())
dataclass
¶
Enhanced match result with stable downstream metadata.
The legacy attributes (predictions, confidences, embeddings, metadata)
remain available, while candidate_results and records provide a consistent
contract for novelty and pipeline stages.