Matcher Modes¶

Related docs: index.md | quickstart.md | architecture.md

Overview¶

The unified Matcher class supports multiple matching strategies through modes. Modes automatically route to the optimal implementation (EmbeddingMatcher, EntityMatcher, or HybridMatcher).

Available Modes¶

Mode	Description	Training Time	Use Case
`zero-shot`	Embedding similarity only	None	No training data available
`head-only`	Train classifier head only	~30s	Minimal training data (1-2 examples/entity)
`full`	Full SetFit training	~3min	Sufficient training data (3+ examples/entity)
`bert`	BERT-based classifier	~5min	High accuracy needed (100+ examples/entity)
`hybrid`	Multi-stage pipeline	None	Large datasets (10k+ entities)
`auto`	Smart auto-detection	Variable	Let the library choose

Mode Comparison¶

zero-shot¶

What: Pure embedding similarity using cosine similarity.

When to use: - No labeled training data available - Need immediate results - Prototyping or exploration

Pros: - No training required - Instant setup - Works out of the box

Cons: - Lower accuracy than trained modes - Can't learn from your data

Example:

from novelentitymatcher import Matcher

matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit()
result = matcher.match("query")

Implementation: Routes to EmbeddingMatcher

head-only¶

What: Lightweight SetFit training that only trains the classification head.

When to use: - Limited training data (1-2 examples per entity) - Need fast training - Quick iteration on model

Pros: - Fast training (~30 seconds) - Better than zero-shot with minimal data - Good for quick experiments

Cons: - Lower accuracy than full training - May not capture complex patterns

Example:

matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data, num_epochs=1)
result = matcher.match("query")

Implementation: Routes to EntityMatcher with head-only training

full¶

What: Full SetFit training with contrastive learning.

When to use: - Sufficient training data (3+ examples per entity) - Need best accuracy - Production deployment

Pros: - Best accuracy - Learns from your data - Robust to variations

Cons: - Slower training (~3 minutes) - Requires more training data

Example:

matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data, num_epochs=4)
result = matcher.match("query")

Implementation: Routes to EntityMatcher with full training

bert¶

What: BERT-based classifier using transformers library.

When to use: - High-stakes accuracy is critical (legal, medical, financial) - Complex pattern recognition needed (sarcasm, nuanced sentiment) - Data-rich scenarios (100+ examples per entity recommended) - GPU resources available - Inference speed is not critical

Pros: - Superior accuracy for complex tasks (often 3-5% better than SetFit) - Works well with smaller datasets (8-16 examples per class) - State-of-the-art transformer architecture

Cons: - Slower training (~5 minutes, GPU recommended) - Slower inference (full transformer pass required) - Higher computational cost - Larger model files on disk

Example:

matcher = Matcher(entities=entities, mode="bert", model="distilbert")
matcher.fit(training_data, num_epochs=3)
result = matcher.match("query")

Implementation: Routes to EntityMatcher with BERT classifier

BERT Models:

# Default: DistilBERT (recommended)
matcher = Matcher(entities=entities, mode="bert")

# For maximum accuracy
matcher = Matcher(entities=entities, mode="bert", model="deberta-v3")

# For resource-constrained environments
matcher = Matcher(entities=entities, mode="bert", model="tinybert")

# For multilingual text
matcher = Matcher(entities=entities, mode="bert", model="bert-multilingual")

See: bert-classifier.md for detailed BERT guide.

hybrid¶

What: Three-stage pipeline: blocking → retrieval → reranking.

Stages: 1. Blocking: Fast candidate filtering (BM25/TF-IDF) 2. Retrieval: Embedding similarity on filtered candidates 3. Reranking: Cross-encoder scoring for precision

When to use: - Large datasets (10k+ entities) - Need both speed and accuracy - Can tolerate some complexity

Pros: - Scales to very large datasets - High accuracy with reranking - Efficient candidate pruning

Cons: - More complex setup - Multiple models to load - Higher memory usage

Example:

matcher = Matcher(entities=entities, mode="hybrid")
matcher.fit()
result = matcher.match("query")

Implementation: Routes to HybridMatcher

Pipeline Parameters:

result = matcher.match(
    "query",
    blocking_top_k=1000,     # Candidates after blocking
    retrieval_top_k=50,      # Candidates after retrieval
    final_top_k=5,           # Final results after reranking
)

auto¶

What: Smart mode selection based on training data.

Decision Logic:

No training data → zero-shot
< 3 examples/entity → head-only
≥ 3 examples/entity, < 100 total → full
≥ 100 total, ≥ 8 examples/entity → bert

When to use: - Unsure which mode to pick - Want the library to choose optimally - Starting a new project

Example:

matcher = Matcher(entities=entities, mode="auto")
matcher.fit(training_data)  # Auto-selects based on data

How it works: 1. Analyzes training data volume per entity 2. Selects appropriate mode automatically 3. Stores detected mode for transparency

Check detected mode:

info = matcher.get_training_info()
print(info["detected_mode"])  # "zero-shot", "head-only", "full", or "bert"

Mode Selection Decision Tree¶

Do you have training data?
│
├─ No → zero-shot
│
└─ Yes → How many examples per entity?
          │
          ├─ < 3 → head-only (fast, ~30s)
          │
          ├─ ≥ 3, < 100 total examples → full (accurate, ~3min)
          │
          └─ ≥ 100 total, ≥ 8 per entity → bert (very accurate, ~5min)

Special case: Large datasets (10k+ entities)

Dataset size > 10k entities?
│
└─ Yes → Consider hybrid mode

Explicit Mode Selection¶

Override auto-detection when you know what you want:

# Force zero-shot even with training data
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit(training_data)  # Training data ignored

# Force full training even with minimal data
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)  # Will train but may overfit

Performance & Accuracy Tradeoffs¶

Speed Comparison¶

Training time (100 entities, 50 examples):

Mode	Training Time	Query Speed
zero-shot	None	Fast (static: ~1ms, dynamic: ~10ms)
head-only	~30s	Fast (~10ms)
full	~3min	Fast (~10ms)
bert	~5min	Medium (~50ms)
hybrid	None	Medium (~50-100ms with reranking)

Accuracy Comparison¶

Accuracy on typical dataset (higher is better):

Mode	Accuracy	Notes
zero-shot	70-80%	Good baseline
head-only	80-85%	Better with minimal data
full	85-95%	Best with sufficient data
bert	88-98%	Superior for complex patterns
hybrid	90-95%	Best for large datasets

Actual results vary by dataset quality and size.

Hybrid Mode Deep Dive¶

Pipeline Stages¶

# Stage 1: Blocking (fast candidate filtering)
# BM25, TF-IDF, or fuzzy matching
# Reduces 10k entities → 1000 candidates

# Stage 2: Retrieval (embedding similarity)
# Static or dynamic embeddings
# Reduces 1000 candidates → 50 candidates

# Stage 3: Reranking (cross-encoder scoring)
# Precise but slow
# Reduces 50 candidates → 5 final results

Blocking Strategies¶

from novelentitymatcher.core.blocking import BM25Blocking

matcher = Matcher(
    entities=entities,
    mode="hybrid",
    blocking_strategy=BM25Blocking()
)

Available strategies: - BM25Blocking - Keyword-based (default) - TFIDFBlocking - Document similarity - FuzzyBlocking - Typos and variations - NoOpBlocking - No filtering (for small datasets)

Reranker Models¶

matcher = Matcher(
    entities=entities,
    mode="hybrid",
    reranker_model="bge-m3"  # Default reranker
)

Available rerankers: - bge-m3 - Multilingual, high quality (default) - bge-large - Higher accuracy, slower - ms-marco - Lightweight alternative

Candidate Filtering (Trained Modes)¶

When using head-only, full, or bert modes, restrict matching to known candidates:

matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)

# Only match against specific candidates
candidates = [
    {"id": "DE", "name": "Germany"},
    {"id": "FR", "name": "France"},
]

result = matcher.match("query", candidates=candidates)
# Only returns DE or FR, not other entities

Use cases: - Geographic filtering (e.g., only European countries) - Category filtering (e.g., only technology companies) - User permissions (e.g., only entities user can access)

Mode-Specific Features¶

zero-shot Features¶

# Static embeddings (fastest)
matcher = Matcher(mode="zero-shot", model="potion-8m")

# Dynamic embeddings (better accuracy)
matcher = Matcher(mode="zero-shot", model="bge-base")

# Dimension reduction (MRL models)
matcher = Matcher(
    mode="zero-shot",
    model="mrl-en",
    embedding_dim=256
)

head-only / full Features¶

# Training parameters
matcher.fit(
    training_data,
    num_epochs=4,      # Training epochs
    batch_size=16,     # Batch size
    show_progress=True # Show progress bar
)

# Candidate filtering
result = matcher.match("query", candidates=candidates)

hybrid Features¶

# Pipeline tuning
result = matcher.match(
    "query",
    blocking_top_k=1000,
    retrieval_top_k=50,
    final_top_k=5
)

# Batch processing
results = matcher.match(
    ["query1", "query2", ...],
    n_jobs=-1,        # Parallel processing
    chunk_size=100    # Batch size
)

Migration from Deprecated Classes¶

Old Way¶

from novelentitymatcher import EmbeddingMatcher, EntityMatcher

# Zero-shot
matcher = EmbeddingMatcher(entities=entities)
matcher.build_index()

# Training
matcher = EntityMatcher(entities=entities)
matcher.train(training_data)

New Way¶

from novelentitymatcher import Matcher

# Zero-shot
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit()

# Training
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)

Choosing the Right Mode¶

Quick Decision Guide¶

I have no training data → zero-shot

I have some training data (1-2 examples/entity) → head-only

I have good training data (3+ examples/entity) → full

I have rich training data (100+ examples, 8+ per entity) → bert

I have 10k+ entities → hybrid

I'm not sure → auto (let the library choose)

Scenario Examples¶

API endpoint with no training data:

matcher = Matcher(entities=entities, mode="zero-shot", model="potion-8m")
# Fastest, no training needed

Internal tool with a few labeled examples:

matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data)
# Fast training, better than zero-shot

Production system with good training data:

matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)
# Best accuracy

High-stakes application with rich training data:

matcher = Matcher(entities=entities, mode="bert", model="distilbert")
matcher.fit(training_data)
# Superior accuracy for complex patterns

Enterprise directory (50k employees):

matcher = Matcher(entities=entities, mode="hybrid")
matcher.fit()
# Scales to large datasets

Diagnostic Tools¶

Check Current Mode¶

info = matcher.get_training_info()
print(f"Mode: {info['mode']}")
print(f"Detected: {info['detected_mode']}")
print(f"Active: {info['active_matcher']}")

Explain Match Results¶

explanation = matcher.explain_match("query", top_k=5)
print(explanation["matched"])      # True/False
print(explanation["best_match"])   # Top result
print(explanation["top_k"])        # All candidates

Debug Issues¶

diagnosis = matcher.diagnose("query")
print(diagnosis["issue"])       # What's wrong
print(diagnosis["suggestion"])  # How to fix it

Next Steps¶

See quickstart.md for basic usage
See models.md for model selection
See static-embeddings.md for static embeddings