Core¶
novelentitymatcher.core.matcher
¶
Classes¶
EmbeddingMatcher(entities, model_name='sentence-transformers/paraphrase-mpnet-base-v2', threshold=0.7, normalize=True, embedding_dim=None, cache=None)
¶
Embedding-based similarity matching without training.
Source code in src/novelentitymatcher/core/embedding_matcher.py
Matcher(entities, model='default', threshold=0.7, normalize=True, mode=None, blocking_strategy=None, reranker_model='default', verbose=False, metrics_callback=None)
¶
Unified entity matcher with smart auto-selection.
Automatically chooses the best matching strategy: - No training data -> zero-shot (embedding similarity) - < 3 examples/entity -> head-only training (~30s) - >= 3 examples/entity -> full training (~3min)
Source code in src/novelentitymatcher/core/matcher.py
Functions¶
novelentitymatcher.core.classifier
¶
Classes¶
SetFitClassifier(labels, model_name='sentence-transformers/paraphrase-mpnet-base-v2', num_epochs=4, batch_size=16, weight_decay=0.01, head_c=1.0, num_iterations=5, pca_dims=None, skip_body_training=False)
¶
Wrapper for SetFit training and prediction.
Source code in src/novelentitymatcher/core/classifier.py
Functions¶
train(training_data, num_epochs=None, batch_size=None, show_progress=True)
¶
Train the classifier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
training_data
|
list[dict]
|
List of training examples with 'text' and 'label' keys |
required |
num_epochs
|
int | None
|
Number of training epochs (overrides default) |
None
|
batch_size
|
int | None
|
Batch size for training (overrides default) |
None
|
show_progress
|
bool
|
Whether to show progress bar during training |
True
|
Source code in src/novelentitymatcher/core/classifier.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
Functions¶
novelentitymatcher.core.normalizer
¶
novelentitymatcher.core.reranker
¶
Cross-encoder reranking for semantic entity matching.
Classes¶
CrossEncoderReranker(model='bge-m3', backend=None, device=None, batch_size=32)
¶
User-facing API for cross-encoder reranking.
Provides precise reranking of candidate entities using cross-encoder models. Typically used after initial retrieval with bi-encoder models.
Example
from novelentitymatcher import EmbeddingMatcher, CrossEncoderReranker
Initial retrieval¶
retriever = EmbeddingMatcher(entities, model_name="bge-base") retriever.build_index() candidates = retriever.match(query, top_k=50)
Rerank top candidates¶
reranker = CrossEncoderReranker(model="bge-m3") final_results = reranker.rerank(query, candidates, top_k=5)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
Model alias or full model name |
'bge-m3'
|
backend
|
Custom backend implementation (defaults to STReranker) |
None
|
|
device
|
str | None
|
Device to run model on (None for auto-detection) |
None
|
batch_size
|
int
|
Batch size for inference |
32
|
Source code in src/novelentitymatcher/core/reranker.py
Functions¶
rerank(query, candidates, top_k=5, text_field='text')
¶
Rerank candidates using cross-encoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text |
required |
candidates
|
list[dict[str, Any]]
|
List of candidate dictionaries |
required |
top_k
|
int
|
Number of top results to return |
5
|
text_field
|
str
|
Field name containing text to score |
'text'
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
Reranked list of candidates (top_k only) with added 'cross_encoder_score' field |
Source code in src/novelentitymatcher/core/reranker.py
rerank_batch(queries, candidates_list, top_k=5, text_field='text')
¶
Batch reranking for multiple queries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries
|
list[str]
|
List of query texts |
required |
candidates_list
|
list[list[dict[str, Any]]]
|
List of candidate lists (one per query) |
required |
top_k
|
int
|
Number of top results to return per query |
5
|
text_field
|
str
|
Field name containing text to score |
'text'
|
Returns:
| Type | Description |
|---|---|
list[list[dict[str, Any]]]
|
List of reranked candidate lists |
Source code in src/novelentitymatcher/core/reranker.py
score(query, docs)
¶
Score query-document pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text |
required |
docs
|
list[str]
|
List of document texts |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
List of scores (one per document) |
Source code in src/novelentitymatcher/core/reranker.py
novelentitymatcher.core.hierarchy
¶
Hierarchical entity matching with multi-parent support.
This module provides: - HierarchyIndex: Graph-based hierarchy representation - HierarchicalScoring: Depth-aware confidence scoring - HierarchicalMatcher: User-facing API for hierarchical matching
Classes¶
HierarchyIndex(entities)
¶
Graph-based index for hierarchical entity relationships.
Supports: - Multi-parent hierarchies (DAG structure) - Weighted edges for relationship strength - Fast ancestor/descendant queries - Path finding and depth calculation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entities
|
list[dict[str, Any]]
|
List of entity dicts with optional 'hierarchy' key hierarchy format: { 'parents': ['parent_id1', 'parent_id2'], 'children': ['child_id1', 'child_id2'], 'level': int, 'weights': {'parent_id': float} } |
required |
Source code in src/novelentitymatcher/core/hierarchy.py
Functions¶
get_ancestors(entity_id, max_depth=None)
¶
Get all ancestor entities for a given entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Entity to find ancestors for |
required |
max_depth
|
int | None
|
Maximum depth to traverse (None = unlimited) |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of ancestor entity IDs |
Source code in src/novelentitymatcher/core/hierarchy.py
get_descendants(entity_id, max_depth=None)
¶
Get all descendant entities for a given entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Entity to find descendants for |
required |
max_depth
|
int | None
|
Maximum depth to traverse (None = unlimited) |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of descendant entity IDs |
Source code in src/novelentitymatcher/core/hierarchy.py
get_relationship_depth(entity_a, entity_b)
¶
Calculate the depth of relationship between two entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_a
|
str
|
First entity ID |
required |
entity_b
|
str
|
Second entity ID |
required |
Returns:
| Type | Description |
|---|---|
int
|
Depth (0 = same entity, 1 = direct parent/child, 2 = grandparent, etc.) |
int
|
Returns -1 if no relationship found |
Source code in src/novelentitymatcher/core/hierarchy.py
get_path(from_entity, to_entity)
¶
Get shortest path between two entities in the hierarchy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
from_entity
|
str
|
Starting entity ID |
required |
to_entity
|
str
|
Ending entity ID |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of entity IDs representing the path (inclusive) |
list[str]
|
Returns empty list if no path exists |
Source code in src/novelentitymatcher/core/hierarchy.py
is_ancestor(ancestor_id, descendant_id)
¶
Check if ancestor_id is an ancestor of descendant_id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ancestor_id
|
str
|
Potential ancestor |
required |
descendant_id
|
str
|
Potential descendant |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if ancestor_id is an ancestor of descendant_id |
Source code in src/novelentitymatcher/core/hierarchy.py
HierarchicalScoring(hierarchy_index, alpha=0.7, beta=0.3)
¶
Calculate hierarchy-aware confidence scores.
Combines: - Semantic similarity (cosine similarity of embeddings) - Hierarchical proximity boost (based on relationship type) - Depth penalty (deeper relationships = lower scores)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hierarchy_index
|
HierarchyIndex
|
HierarchyIndex for graph operations |
required |
alpha
|
float
|
Weight for semantic similarity (0-1) |
0.7
|
beta
|
float
|
Weight for hierarchical boost (0-1) |
0.3
|
Source code in src/novelentitymatcher/core/hierarchy.py
Functions¶
compute_score(query_embedding, entity_embedding, entity_id, relationship_type='self', depth=0)
¶
Compute hierarchical score combining semantic and hierarchical features.
Formula
final_score = ( semantic_similarity * alpha + hierarchical_boost * beta ) * depth_penalty
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_embedding
|
ndarray
|
Query text embedding |
required |
entity_embedding
|
ndarray
|
Entity text embedding |
required |
entity_id
|
str
|
Entity identifier |
required |
relationship_type
|
str
|
"self", "parent", "child", "ancestor", "descendant" |
'self'
|
depth
|
int
|
Relationship depth (0=self, 1=direct, etc.) |
0
|
Returns:
| Type | Description |
|---|---|
float
|
Final hierarchical score (0-1) |
Source code in src/novelentitymatcher/core/hierarchy.py
HierarchicalMatcher(entities, embedding_model='BAAI/bge-base-en-v1.5', alpha=0.7, beta=0.3, normalize=True)
¶
Hierarchical entity matching with multi-parent support.
Combines semantic similarity (via EmbeddingMatcher) with hierarchy-aware scoring to enable flexible granularity matching.
Features: - Match at any level in hierarchy (self, ancestors, descendants) - Multi-parent hierarchy support - Depth-aware confidence scores - Flexible granularity matching
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entities
|
list[dict[str, Any]]
|
List of entity dicts with optional 'hierarchy' key |
required |
embedding_model
|
str
|
Sentence transformer model name |
'BAAI/bge-base-en-v1.5'
|
alpha
|
float
|
Weight for semantic similarity (0-1) |
0.7
|
beta
|
float
|
Weight for hierarchical boost (0-1) |
0.3
|
normalize
|
bool
|
Whether to apply text normalization |
True
|
Source code in src/novelentitymatcher/core/hierarchy.py
Functions¶
build_index()
¶
Build embedding index for all entities.
Must be called before matching.
Source code in src/novelentitymatcher/core/hierarchy.py
match(query, top_k=5, match_level='all', max_depth=3)
¶
Match query considering hierarchical relationships.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text |
required |
top_k
|
int
|
Number of results to return |
5
|
match_level
|
str
|
"self", "ancestors", "descendants", "all" |
'all'
|
max_depth
|
int
|
Maximum depth to traverse for hierarchical matches |
3
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of matches with: |
list[dict[str, Any]]
|
|
list[dict[str, Any]]
|
|
list[dict[str, Any]]
|
|
list[dict[str, Any]]
|
|
list[dict[str, Any]]
|
|
list[dict[str, Any]]
|
|
Source code in src/novelentitymatcher/core/hierarchy.py
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 | |
get_ancestors(entity_id, max_depth=None)
¶
Get all ancestors of an entity with metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Entity to find ancestors for |
required |
max_depth
|
int | None
|
Maximum depth to traverse |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of ancestor entities with metadata |
Source code in src/novelentitymatcher/core/hierarchy.py
get_descendants(entity_id, max_depth=None)
¶
Get all descendants of an entity with metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Entity to find descendants for |
required |
max_depth
|
int | None
|
Maximum depth to traverse |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of descendant entities with metadata |
Source code in src/novelentitymatcher/core/hierarchy.py
get_hierarchy_path(entity_id, to_entity=None)
¶
Get path from entity_id to root or to_entity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity_id
|
str
|
Starting entity |
required |
to_entity
|
str | None
|
Ending entity (None = path to root) |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of entities representing the path |
Source code in src/novelentitymatcher/core/hierarchy.py
novelentitymatcher.core.blocking
¶
Blocking strategies for efficient candidate filtering.
Classes¶
BlockingStrategy
¶
Bases: ABC
Abstract base class for blocking strategies.
Functions¶
block(query, entities, top_k)
abstractmethod
¶
Return top_k candidate entities for the query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text |
required |
entities
|
list[dict[str, Any]]
|
List of all entities |
required |
top_k
|
int
|
Maximum number of candidates to return |
required |
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of candidate entities (top_k or fewer) |
Source code in src/novelentitymatcher/core/blocking.py
NoOpBlocking
¶
Bases: BlockingStrategy
Pass-through blocking for small datasets.
Returns all entities up to top_k without any filtering.
BM25Blocking(k1=1.5, b=0.75)
¶
Bases: BlockingStrategy
Fast lexical blocking using BM25.
Uses BM25 algorithm for efficient lexical matching. Good for keyword-heavy queries and proper nouns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k1
|
float
|
BM25 k1 parameter (term frequency saturation) |
1.5
|
b
|
float
|
BM25 b parameter (length normalization) |
0.75
|
Source code in src/novelentitymatcher/core/blocking.py
Functions¶
build_index(entities)
¶
Build BM25 index from entities.
Source code in src/novelentitymatcher/core/blocking.py
block(query, entities, top_k)
¶
Return top_k candidates using BM25 scores.
Source code in src/novelentitymatcher/core/blocking.py
TFIDFBlocking()
¶
Bases: BlockingStrategy
TF-IDF based blocking.
Uses TF-IDF vectorization for lexical matching. Good for document-level similarity.
Optimized with: - Vocabulary caching across rebuilds - Efficient content-based hashing (MD5) - Sparse matrix operations via sklearn
Source code in src/novelentitymatcher/core/blocking.py
Functions¶
build_index(entities)
¶
Build TF-IDF index from entities.
Source code in src/novelentitymatcher/core/blocking.py
block(query, entities, top_k)
¶
Return top_k candidates using TF-IDF scores.
Source code in src/novelentitymatcher/core/blocking.py
FuzzyBlocking(score_cutoff=70)
¶
Bases: BlockingStrategy
Fuzzy string matching blocking.
Uses RapidFuzz for approximate string matching. Good for catching typos and variations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
score_cutoff
|
int
|
Minimum similarity score (0-100) |
70
|
Source code in src/novelentitymatcher/core/blocking.py
Functions¶
block(query, entities, top_k)
¶
Return top_k candidates using fuzzy matching.
Source code in src/novelentitymatcher/core/blocking.py
novelentitymatcher.core.embedding_matcher
¶
Classes¶
EmbeddingMatcher(entities, model_name='sentence-transformers/paraphrase-mpnet-base-v2', threshold=0.7, normalize=True, embedding_dim=None, cache=None)
¶
Embedding-based similarity matching without training.
Source code in src/novelentitymatcher/core/embedding_matcher.py
Functions¶
novelentitymatcher.core.bert_classifier
¶
BERT-based classifier using transformers library.
This module provides BERTClassifier, a drop-in alternative to SetFitClassifier that uses fine-tuned BERT models for text classification. BERT classifiers provide superior accuracy for complex pattern-driven tasks but with higher computational cost.
Classes¶
BERTClassifier(labels, model_name='distilbert-base-uncased', num_epochs=3, batch_size=16, learning_rate=2e-05, max_length=128, use_fp16=True)
¶
BERT-based text classifier using transformers library.
This classifier provides a drop-in alternative to SetFitClassifier with identical interface. It uses fine-tuned BERT models for classification, offering superior accuracy for complex pattern-driven tasks.
Example
from novelentitymatcher.core.bert_classifier import BERTClassifier labels = ["DE", "FR", "US"] clf = BERTClassifier(labels=labels, model_name="distilbert-base-uncased") training_data = [ ... {"text": "Germany", "label": "DE"}, ... {"text": "France", "label": "FR"}, ... {"text": "USA", "label": "US"}, ... ] clf.train(training_data, num_epochs=3) prediction = clf.predict("Deutschland") # "DE" proba = clf.predict_proba("Deutschland") # [0.02, 0.01, 0.97]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
list[str]
|
List of class labels for classification. |
required |
model_name
|
str
|
HuggingFace model name or path. Default: "distilbert-base-uncased". |
'distilbert-base-uncased'
|
num_epochs
|
int
|
Number of training epochs. Default: 3. |
3
|
batch_size
|
int
|
Training batch size. Default: 16. |
16
|
learning_rate
|
float
|
Learning rate for training. Default: 2e-5. |
2e-05
|
max_length
|
int
|
Maximum sequence length for tokenization. Default: 128. |
128
|
use_fp16
|
bool
|
Whether to use mixed precision training (faster, less memory). Only works on GPU. Default: True. |
True
|
Source code in src/novelentitymatcher/core/bert_classifier.py
Functions¶
train(training_data, num_epochs=None, batch_size=None, show_progress=True)
¶
Train the BERT classifier.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
training_data
|
list[dict]
|
List of training examples with 'text' and 'label' keys. |
required |
num_epochs
|
int | None
|
Number of training epochs (overrides default). |
None
|
batch_size
|
int | None
|
Batch size for training (overrides default). |
None
|
show_progress
|
bool
|
Whether to show progress bar during training. |
True
|
Raises:
| Type | Description |
|---|---|
TrainingError
|
If training fails or data is invalid. |
Source code in src/novelentitymatcher/core/bert_classifier.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 | |
predict(texts)
¶
Predict labels for input text(s).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
str | list[str]
|
Single text string or list of text strings. |
required |
Returns:
| Type | Description |
|---|---|
str | list[str]
|
Predicted label(s). If input is single string, returns single label. |
str | list[str]
|
If input is list, returns list of labels. |
Raises:
| Type | Description |
|---|---|
TrainingError
|
If model is not trained yet. |
Source code in src/novelentitymatcher/core/bert_classifier.py
predict_proba(text)
¶
Get prediction probabilities for all labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text string. |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
NumPy array of probabilities for each label, in same order as self.labels. |
Raises:
| Type | Description |
|---|---|
TrainingError
|
If model is not trained yet. |
Source code in src/novelentitymatcher/core/bert_classifier.py
save(path)
¶
Save the trained model and tokenizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Directory path to save the model. |
required |
Raises:
| Type | Description |
|---|---|
TrainingError
|
If model is not trained yet. |
Source code in src/novelentitymatcher/core/bert_classifier.py
load(path)
classmethod
¶
Load a trained BERTClassifier from disk.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Directory path containing the saved model. |
required |
Returns:
| Type | Description |
|---|---|
BERTClassifier
|
Loaded BERTClassifier instance. |
Source code in src/novelentitymatcher/core/bert_classifier.py
Functions¶
novelentitymatcher.core.matching_strategy
¶
Matching strategy pattern for Matcher mode selection.
Classes¶
StrategyConfig(threshold, model_name, training_mode, normalize=True)
dataclass
¶
Configuration for matching strategies.
Encapsulates threshold, model settings, and training mode that were previously managed in _EntityMatcher.
MatchingStrategy(matcher)
¶
Bases: ABC
Abstract base class for matching strategies.
Source code in src/novelentitymatcher/core/matching_strategy.py
ZeroShotStrategy(matcher)
¶
Bases: MatchingStrategy
Strategy for zero-shot (embedding-only) matching.
Source code in src/novelentitymatcher/core/matching_strategy.py
HeadOnlyFullStrategy(matcher)
¶
Bases: MatchingStrategy
Strategy for head-only and full training modes.
Source code in src/novelentitymatcher/core/matching_strategy.py
BertStrategy(matcher)
¶
Bases: MatchingStrategy
Strategy for BERT-based matching.
Source code in src/novelentitymatcher/core/matching_strategy.py
HybridStrategy(matcher)
¶
Bases: MatchingStrategy
Strategy for hybrid blocking + retrieval matching.
Source code in src/novelentitymatcher/core/matching_strategy.py
MatcherFacade(embedding_matcher, entity_matcher, bert_matcher, hybrid_matcher, config)
¶
Facade providing access to all matcher components for strategies.
Source code in src/novelentitymatcher/core/matching_strategy.py
Functions¶
get_strategy(mode)
¶
Get strategy class for the given mode.
Source code in src/novelentitymatcher/core/matching_strategy.py
novelentitymatcher.core.hybrid
¶
Hybrid matching pipeline with blocking, retrieval, and reranking.
Classes¶
HybridMatcher(entities, blocking_strategy=None, retriever_model='BAAI/bge-base-en-v1.5', reranker_model='BAAI/bge-reranker-v2-m3', normalize=True)
¶
Three-stage waterfall pipeline for semantic entity matching.
Combines fast blocking, semantic retrieval, and precise reranking for accurate and efficient matching.
Pipeline Stages
- Blocking (BM25/TF-IDF/Fuzzy) - Fast lexical filtering
- Bi-Encoder Retrieval - Semantic similarity search
- Cross-Encoder Reranking - Precise cross-attention scoring
Example
from novelentitymatcher import HybridMatcher from novelentitymatcher.core.blocking import BM25Blocking
matcher = HybridMatcher( ... entities=products, ... blocking_strategy=BM25Blocking(), ... retriever_model="bge-base", ... reranker_model="bge-m3" ... )
results = matcher.match( ... "iPhone 15 case", ... blocking_top_k=1000, ... retrieval_top_k=50, ... final_top_k=5 ... )
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entities
|
list[dict[str, Any]]
|
List of entity dictionaries |
required |
blocking_strategy
|
BlockingStrategy | None
|
Blocking strategy (defaults to NoOpBlocking) |
None
|
retriever_model
|
str
|
Model name for bi-encoder retrieval |
'BAAI/bge-base-en-v1.5'
|
reranker_model
|
str
|
Model name for cross-encoder reranking |
'BAAI/bge-reranker-v2-m3'
|
normalize
|
bool
|
Whether to normalize text (lowercase, remove accents, etc.) |
True
|
Source code in src/novelentitymatcher/core/hybrid.py
Functions¶
match(query, blocking_top_k=1000, retrieval_top_k=50, final_top_k=5)
¶
Match query using three-stage waterfall pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Search query |
required |
blocking_top_k
|
int
|
Number of candidates after blocking stage |
1000
|
retrieval_top_k
|
int
|
Number of candidates after retrieval stage |
50
|
final_top_k
|
int
|
Number of final results after reranking |
5
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of matched entities with scores (bi-encoder and cross-encoder) |
Source code in src/novelentitymatcher/core/hybrid.py
match_bulk(queries, blocking_top_k=1000, retrieval_top_k=50, final_top_k=5, n_jobs=-1, chunk_size=None)
¶
Batch matching for multiple queries.
Batches bi-encoder encoding across all queries (single model.encode call instead of one per query), then computes per-query similarity against blocked candidates.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
queries
|
list[str]
|
List of search queries |
required |
blocking_top_k
|
int
|
Number of candidates after blocking stage |
1000
|
retrieval_top_k
|
int
|
Number of candidates after retrieval stage |
50
|
final_top_k
|
int
|
Number of final results after reranking |
5
|
n_jobs
|
int
|
Ignored (kept for backwards compatibility). |
-1
|
chunk_size
|
int | None
|
Ignored (kept for backwards compatibility). |
None
|
Returns:
| Type | Description |
|---|---|
list[list[dict[str, Any]]]
|
List of matched entity lists (one per query) |
Source code in src/novelentitymatcher/core/hybrid.py
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | |
novelentitymatcher.core.matcher_components
¶
novelentitymatcher.core.matcher_runtime
¶
novelentitymatcher.core.matcher_shared
¶
Classes¶
Functions¶
extract_top_prediction_metadata(match_results, single_input)
¶
Normalize matcher output into top-1 predictions and confidences.
Novel class detection only needs the best prediction per input. This keeps a
stable shape even when the underlying matcher returns dicts, lists, strings,
or None values.
Source code in src/novelentitymatcher/core/matcher_shared.py
novelentitymatcher.core.async_utils
¶
Classes¶
AsyncExecutor(max_workers=None)
¶
Manages async execution of sync operations.
Runs CPU-bound or blocking sync operations in a thread pool, allowing async code to proceed without blocking the event loop.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_workers
|
int | None
|
Maximum number of worker threads. Defaults to CPU_COUNT * 2, capped at 32 for I/O bound workloads. |
None
|
Source code in src/novelentitymatcher/core/async_utils.py
Functions¶
run_in_thread(func, *args, **kwargs)
async
¶
Run a sync function in a thread pool.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable
|
Synchronous function to execute |
required |
*args
|
Positional arguments to pass to func |
()
|
|
**kwargs
|
Keyword arguments to pass to func |
{}
|
Returns:
| Type | Description |
|---|---|
Any
|
The return value of func |
Source code in src/novelentitymatcher/core/async_utils.py
run_in_thread_batch(func, items, batch_size=32)
async
¶
Run sync function on batches concurrently.
Splits items into batches and runs func on each batch in parallel, then flattens the results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
func
|
Callable
|
Function that takes a list and returns a list |
required |
items
|
list[Any]
|
Items to process in batches |
required |
batch_size
|
int
|
Size of each batch |
32
|
Returns:
| Type | Description |
|---|---|
list[Any]
|
Flattened list of results from all batches |
Source code in src/novelentitymatcher/core/async_utils.py
shutdown()
¶
Clean up resources by shutting down the thread pool. Idempotent.