Matcher Modes¶
Related docs: index.md | quickstart.md | architecture.md
Overview¶
The unified Matcher class supports multiple matching strategies through modes. Modes automatically route to the optimal implementation (EmbeddingMatcher, EntityMatcher, or HybridMatcher).
Available Modes¶
| Mode | Description | Training Time | Use Case |
|---|---|---|---|
zero-shot |
Embedding similarity only | None | No training data available |
head-only |
Train classifier head only | ~30s | Minimal training data (1-2 examples/entity) |
full |
Full SetFit training | ~3min | Sufficient training data (3+ examples/entity) |
bert |
BERT-based classifier | ~5min | High accuracy needed (100+ examples/entity) |
hybrid |
Multi-stage pipeline | None | Large datasets (10k+ entities) |
auto |
Smart auto-detection | Variable | Let the library choose |
Mode Comparison¶
zero-shot¶
What: Pure embedding similarity using cosine similarity.
When to use: - No labeled training data available - Need immediate results - Prototyping or exploration
Pros: - No training required - Instant setup - Works out of the box
Cons: - Lower accuracy than trained modes - Can't learn from your data
Example:
from novelentitymatcher import Matcher
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit()
result = matcher.match("query")
Implementation: Routes to EmbeddingMatcher
head-only¶
What: Lightweight SetFit training that only trains the classification head.
When to use: - Limited training data (1-2 examples per entity) - Need fast training - Quick iteration on model
Pros: - Fast training (~30 seconds) - Better than zero-shot with minimal data - Good for quick experiments
Cons: - Lower accuracy than full training - May not capture complex patterns
Example:
matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data, num_epochs=1)
result = matcher.match("query")
Implementation: Routes to EntityMatcher with head-only training
full¶
What: Full SetFit training with contrastive learning.
When to use: - Sufficient training data (3+ examples per entity) - Need best accuracy - Production deployment
Pros: - Best accuracy - Learns from your data - Robust to variations
Cons: - Slower training (~3 minutes) - Requires more training data
Example:
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data, num_epochs=4)
result = matcher.match("query")
Implementation: Routes to EntityMatcher with full training
bert¶
What: BERT-based classifier using transformers library.
When to use: - High-stakes accuracy is critical (legal, medical, financial) - Complex pattern recognition needed (sarcasm, nuanced sentiment) - Data-rich scenarios (100+ examples per entity recommended) - GPU resources available - Inference speed is not critical
Pros: - Superior accuracy for complex tasks (often 3-5% better than SetFit) - Works well with smaller datasets (8-16 examples per class) - State-of-the-art transformer architecture
Cons: - Slower training (~5 minutes, GPU recommended) - Slower inference (full transformer pass required) - Higher computational cost - Larger model files on disk
Example:
matcher = Matcher(entities=entities, mode="bert", model="distilbert")
matcher.fit(training_data, num_epochs=3)
result = matcher.match("query")
Implementation: Routes to EntityMatcher with BERT classifier
BERT Models:
# Default: DistilBERT (recommended)
matcher = Matcher(entities=entities, mode="bert")
# For maximum accuracy
matcher = Matcher(entities=entities, mode="bert", model="deberta-v3")
# For resource-constrained environments
matcher = Matcher(entities=entities, mode="bert", model="tinybert")
# For multilingual text
matcher = Matcher(entities=entities, mode="bert", model="bert-multilingual")
See: bert-classifier.md for detailed BERT guide.
hybrid¶
What: Three-stage pipeline: blocking → retrieval → reranking.
Stages: 1. Blocking: Fast candidate filtering (BM25/TF-IDF) 2. Retrieval: Embedding similarity on filtered candidates 3. Reranking: Cross-encoder scoring for precision
When to use: - Large datasets (10k+ entities) - Need both speed and accuracy - Can tolerate some complexity
Pros: - Scales to very large datasets - High accuracy with reranking - Efficient candidate pruning
Cons: - More complex setup - Multiple models to load - Higher memory usage
Example:
Implementation: Routes to HybridMatcher
Pipeline Parameters:
result = matcher.match(
"query",
blocking_top_k=1000, # Candidates after blocking
retrieval_top_k=50, # Candidates after retrieval
final_top_k=5, # Final results after reranking
)
auto¶
What: Smart mode selection based on training data.
Decision Logic:
No training data → zero-shot
< 3 examples/entity → head-only
≥ 3 examples/entity, < 100 total → full
≥ 100 total, ≥ 8 examples/entity → bert
When to use: - Unsure which mode to pick - Want the library to choose optimally - Starting a new project
Example:
matcher = Matcher(entities=entities, mode="auto")
matcher.fit(training_data) # Auto-selects based on data
How it works: 1. Analyzes training data volume per entity 2. Selects appropriate mode automatically 3. Stores detected mode for transparency
Check detected mode:
info = matcher.get_training_info()
print(info["detected_mode"]) # "zero-shot", "head-only", "full", or "bert"
Mode Selection Decision Tree¶
Do you have training data?
│
├─ No → zero-shot
│
└─ Yes → How many examples per entity?
│
├─ < 3 → head-only (fast, ~30s)
│
├─ ≥ 3, < 100 total examples → full (accurate, ~3min)
│
└─ ≥ 100 total, ≥ 8 per entity → bert (very accurate, ~5min)
Special case: Large datasets (10k+ entities)
Explicit Mode Selection¶
Override auto-detection when you know what you want:
# Force zero-shot even with training data
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit(training_data) # Training data ignored
# Force full training even with minimal data
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data) # Will train but may overfit
Performance & Accuracy Tradeoffs¶
Speed Comparison¶
Training time (100 entities, 50 examples):
| Mode | Training Time | Query Speed |
|---|---|---|
| zero-shot | None | Fast (static: ~1ms, dynamic: ~10ms) |
| head-only | ~30s | Fast (~10ms) |
| full | ~3min | Fast (~10ms) |
| bert | ~5min | Medium (~50ms) |
| hybrid | None | Medium (~50-100ms with reranking) |
Accuracy Comparison¶
Accuracy on typical dataset (higher is better):
| Mode | Accuracy | Notes |
|---|---|---|
| zero-shot | 70-80% | Good baseline |
| head-only | 80-85% | Better with minimal data |
| full | 85-95% | Best with sufficient data |
| bert | 88-98% | Superior for complex patterns |
| hybrid | 90-95% | Best for large datasets |
Actual results vary by dataset quality and size.
Hybrid Mode Deep Dive¶
Pipeline Stages¶
# Stage 1: Blocking (fast candidate filtering)
# BM25, TF-IDF, or fuzzy matching
# Reduces 10k entities → 1000 candidates
# Stage 2: Retrieval (embedding similarity)
# Static or dynamic embeddings
# Reduces 1000 candidates → 50 candidates
# Stage 3: Reranking (cross-encoder scoring)
# Precise but slow
# Reduces 50 candidates → 5 final results
Blocking Strategies¶
from novelentitymatcher.core.blocking import BM25Blocking
matcher = Matcher(
entities=entities,
mode="hybrid",
blocking_strategy=BM25Blocking()
)
Available strategies:
- BM25Blocking - Keyword-based (default)
- TFIDFBlocking - Document similarity
- FuzzyBlocking - Typos and variations
- NoOpBlocking - No filtering (for small datasets)
Reranker Models¶
Available rerankers:
- bge-m3 - Multilingual, high quality (default)
- bge-large - Higher accuracy, slower
- ms-marco - Lightweight alternative
Candidate Filtering (Trained Modes)¶
When using head-only, full, or bert modes, restrict matching to known candidates:
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)
# Only match against specific candidates
candidates = [
{"id": "DE", "name": "Germany"},
{"id": "FR", "name": "France"},
]
result = matcher.match("query", candidates=candidates)
# Only returns DE or FR, not other entities
Use cases: - Geographic filtering (e.g., only European countries) - Category filtering (e.g., only technology companies) - User permissions (e.g., only entities user can access)
Mode-Specific Features¶
zero-shot Features¶
# Static embeddings (fastest)
matcher = Matcher(mode="zero-shot", model="potion-8m")
# Dynamic embeddings (better accuracy)
matcher = Matcher(mode="zero-shot", model="bge-base")
# Dimension reduction (MRL models)
matcher = Matcher(
mode="zero-shot",
model="mrl-en",
embedding_dim=256
)
head-only / full Features¶
# Training parameters
matcher.fit(
training_data,
num_epochs=4, # Training epochs
batch_size=16, # Batch size
show_progress=True # Show progress bar
)
# Candidate filtering
result = matcher.match("query", candidates=candidates)
hybrid Features¶
# Pipeline tuning
result = matcher.match(
"query",
blocking_top_k=1000,
retrieval_top_k=50,
final_top_k=5
)
# Batch processing
results = matcher.match(
["query1", "query2", ...],
n_jobs=-1, # Parallel processing
chunk_size=100 # Batch size
)
Migration from Deprecated Classes¶
Old Way¶
from novelentitymatcher import EmbeddingMatcher, EntityMatcher
# Zero-shot
matcher = EmbeddingMatcher(entities=entities)
matcher.build_index()
# Training
matcher = EntityMatcher(entities=entities)
matcher.train(training_data)
New Way¶
from novelentitymatcher import Matcher
# Zero-shot
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit()
# Training
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)
Choosing the Right Mode¶
Quick Decision Guide¶
I have no training data → zero-shot
I have some training data (1-2 examples/entity) → head-only
I have good training data (3+ examples/entity) → full
I have rich training data (100+ examples, 8+ per entity) → bert
I have 10k+ entities → hybrid
I'm not sure → auto (let the library choose)
Scenario Examples¶
API endpoint with no training data:
matcher = Matcher(entities=entities, mode="zero-shot", model="potion-8m")
# Fastest, no training needed
Internal tool with a few labeled examples:
matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data)
# Fast training, better than zero-shot
Production system with good training data:
High-stakes application with rich training data:
matcher = Matcher(entities=entities, mode="bert", model="distilbert")
matcher.fit(training_data)
# Superior accuracy for complex patterns
Enterprise directory (50k employees):
Diagnostic Tools¶
Check Current Mode¶
info = matcher.get_training_info()
print(f"Mode: {info['mode']}")
print(f"Detected: {info['detected_mode']}")
print(f"Active: {info['active_matcher']}")
Explain Match Results¶
explanation = matcher.explain_match("query", top_k=5)
print(explanation["matched"]) # True/False
print(explanation["best_match"]) # Top result
print(explanation["top_k"]) # All candidates
Debug Issues¶
diagnosis = matcher.diagnose("query")
print(diagnosis["issue"]) # What's wrong
print(diagnosis["suggestion"]) # How to fix it
Next Steps¶
- See
quickstart.mdfor basic usage - See
models.mdfor model selection - See
static-embeddings.mdfor static embeddings