Skip to content

Matcher Modes

Related docs: index.md | quickstart.md | architecture.md

Overview

The unified Matcher class supports multiple matching strategies through modes. Modes automatically route to the optimal implementation (EmbeddingMatcher, EntityMatcher, or HybridMatcher).

Available Modes

Mode Description Training Time Use Case
zero-shot Embedding similarity only None No training data available
head-only Train classifier head only ~30s Minimal training data (1-2 examples/entity)
full Full SetFit training ~3min Sufficient training data (3+ examples/entity)
bert BERT-based classifier ~5min High accuracy needed (100+ examples/entity)
hybrid Multi-stage pipeline None Large datasets (10k+ entities)
auto Smart auto-detection Variable Let the library choose

Mode Comparison

zero-shot

What: Pure embedding similarity using cosine similarity.

When to use: - No labeled training data available - Need immediate results - Prototyping or exploration

Pros: - No training required - Instant setup - Works out of the box

Cons: - Lower accuracy than trained modes - Can't learn from your data

Example:

from novelentitymatcher import Matcher

matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit()
result = matcher.match("query")

Implementation: Routes to EmbeddingMatcher


head-only

What: Lightweight SetFit training that only trains the classification head.

When to use: - Limited training data (1-2 examples per entity) - Need fast training - Quick iteration on model

Pros: - Fast training (~30 seconds) - Better than zero-shot with minimal data - Good for quick experiments

Cons: - Lower accuracy than full training - May not capture complex patterns

Example:

matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data, num_epochs=1)
result = matcher.match("query")

Implementation: Routes to EntityMatcher with head-only training


full

What: Full SetFit training with contrastive learning.

When to use: - Sufficient training data (3+ examples per entity) - Need best accuracy - Production deployment

Pros: - Best accuracy - Learns from your data - Robust to variations

Cons: - Slower training (~3 minutes) - Requires more training data

Example:

matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data, num_epochs=4)
result = matcher.match("query")

Implementation: Routes to EntityMatcher with full training


bert

What: BERT-based classifier using transformers library.

When to use: - High-stakes accuracy is critical (legal, medical, financial) - Complex pattern recognition needed (sarcasm, nuanced sentiment) - Data-rich scenarios (100+ examples per entity recommended) - GPU resources available - Inference speed is not critical

Pros: - Superior accuracy for complex tasks (often 3-5% better than SetFit) - Works well with smaller datasets (8-16 examples per class) - State-of-the-art transformer architecture

Cons: - Slower training (~5 minutes, GPU recommended) - Slower inference (full transformer pass required) - Higher computational cost - Larger model files on disk

Example:

matcher = Matcher(entities=entities, mode="bert", model="distilbert")
matcher.fit(training_data, num_epochs=3)
result = matcher.match("query")

Implementation: Routes to EntityMatcher with BERT classifier

BERT Models:

# Default: DistilBERT (recommended)
matcher = Matcher(entities=entities, mode="bert")

# For maximum accuracy
matcher = Matcher(entities=entities, mode="bert", model="deberta-v3")

# For resource-constrained environments
matcher = Matcher(entities=entities, mode="bert", model="tinybert")

# For multilingual text
matcher = Matcher(entities=entities, mode="bert", model="bert-multilingual")

See: bert-classifier.md for detailed BERT guide.


hybrid

What: Three-stage pipeline: blocking → retrieval → reranking.

Stages: 1. Blocking: Fast candidate filtering (BM25/TF-IDF) 2. Retrieval: Embedding similarity on filtered candidates 3. Reranking: Cross-encoder scoring for precision

When to use: - Large datasets (10k+ entities) - Need both speed and accuracy - Can tolerate some complexity

Pros: - Scales to very large datasets - High accuracy with reranking - Efficient candidate pruning

Cons: - More complex setup - Multiple models to load - Higher memory usage

Example:

matcher = Matcher(entities=entities, mode="hybrid")
matcher.fit()
result = matcher.match("query")

Implementation: Routes to HybridMatcher

Pipeline Parameters:

result = matcher.match(
    "query",
    blocking_top_k=1000,     # Candidates after blocking
    retrieval_top_k=50,      # Candidates after retrieval
    final_top_k=5,           # Final results after reranking
)


auto

What: Smart mode selection based on training data.

Decision Logic:

No training data → zero-shot
< 3 examples/entity → head-only
≥ 3 examples/entity, < 100 total → full
≥ 100 total, ≥ 8 examples/entity → bert

When to use: - Unsure which mode to pick - Want the library to choose optimally - Starting a new project

Example:

matcher = Matcher(entities=entities, mode="auto")
matcher.fit(training_data)  # Auto-selects based on data

How it works: 1. Analyzes training data volume per entity 2. Selects appropriate mode automatically 3. Stores detected mode for transparency

Check detected mode:

info = matcher.get_training_info()
print(info["detected_mode"])  # "zero-shot", "head-only", "full", or "bert"

Mode Selection Decision Tree

Do you have training data?
├─ No → zero-shot
└─ Yes → How many examples per entity?
          ├─ < 3 → head-only (fast, ~30s)
          ├─ ≥ 3, < 100 total examples → full (accurate, ~3min)
          └─ ≥ 100 total, ≥ 8 per entity → bert (very accurate, ~5min)

Special case: Large datasets (10k+ entities)

Dataset size > 10k entities?
└─ Yes → Consider hybrid mode

Explicit Mode Selection

Override auto-detection when you know what you want:

# Force zero-shot even with training data
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit(training_data)  # Training data ignored
# Force full training even with minimal data
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)  # Will train but may overfit

Performance & Accuracy Tradeoffs

Speed Comparison

Training time (100 entities, 50 examples):

Mode Training Time Query Speed
zero-shot None Fast (static: ~1ms, dynamic: ~10ms)
head-only ~30s Fast (~10ms)
full ~3min Fast (~10ms)
bert ~5min Medium (~50ms)
hybrid None Medium (~50-100ms with reranking)

Accuracy Comparison

Accuracy on typical dataset (higher is better):

Mode Accuracy Notes
zero-shot 70-80% Good baseline
head-only 80-85% Better with minimal data
full 85-95% Best with sufficient data
bert 88-98% Superior for complex patterns
hybrid 90-95% Best for large datasets

Actual results vary by dataset quality and size.

Hybrid Mode Deep Dive

Pipeline Stages

# Stage 1: Blocking (fast candidate filtering)
# BM25, TF-IDF, or fuzzy matching
# Reduces 10k entities → 1000 candidates

# Stage 2: Retrieval (embedding similarity)
# Static or dynamic embeddings
# Reduces 1000 candidates → 50 candidates

# Stage 3: Reranking (cross-encoder scoring)
# Precise but slow
# Reduces 50 candidates → 5 final results

Blocking Strategies

from novelentitymatcher.core.blocking import BM25Blocking

matcher = Matcher(
    entities=entities,
    mode="hybrid",
    blocking_strategy=BM25Blocking()
)

Available strategies: - BM25Blocking - Keyword-based (default) - TFIDFBlocking - Document similarity - FuzzyBlocking - Typos and variations - NoOpBlocking - No filtering (for small datasets)

Reranker Models

matcher = Matcher(
    entities=entities,
    mode="hybrid",
    reranker_model="bge-m3"  # Default reranker
)

Available rerankers: - bge-m3 - Multilingual, high quality (default) - bge-large - Higher accuracy, slower - ms-marco - Lightweight alternative

Candidate Filtering (Trained Modes)

When using head-only, full, or bert modes, restrict matching to known candidates:

matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)

# Only match against specific candidates
candidates = [
    {"id": "DE", "name": "Germany"},
    {"id": "FR", "name": "France"},
]

result = matcher.match("query", candidates=candidates)
# Only returns DE or FR, not other entities

Use cases: - Geographic filtering (e.g., only European countries) - Category filtering (e.g., only technology companies) - User permissions (e.g., only entities user can access)

Mode-Specific Features

zero-shot Features

# Static embeddings (fastest)
matcher = Matcher(mode="zero-shot", model="potion-8m")

# Dynamic embeddings (better accuracy)
matcher = Matcher(mode="zero-shot", model="bge-base")

# Dimension reduction (MRL models)
matcher = Matcher(
    mode="zero-shot",
    model="mrl-en",
    embedding_dim=256
)

head-only / full Features

# Training parameters
matcher.fit(
    training_data,
    num_epochs=4,      # Training epochs
    batch_size=16,     # Batch size
    show_progress=True # Show progress bar
)

# Candidate filtering
result = matcher.match("query", candidates=candidates)

hybrid Features

# Pipeline tuning
result = matcher.match(
    "query",
    blocking_top_k=1000,
    retrieval_top_k=50,
    final_top_k=5
)

# Batch processing
results = matcher.match(
    ["query1", "query2", ...],
    n_jobs=-1,        # Parallel processing
    chunk_size=100    # Batch size
)

Migration from Deprecated Classes

Old Way

from novelentitymatcher import EmbeddingMatcher, EntityMatcher

# Zero-shot
matcher = EmbeddingMatcher(entities=entities)
matcher.build_index()

# Training
matcher = EntityMatcher(entities=entities)
matcher.train(training_data)

New Way

from novelentitymatcher import Matcher

# Zero-shot
matcher = Matcher(entities=entities, mode="zero-shot")
matcher.fit()

# Training
matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)

Choosing the Right Mode

Quick Decision Guide

I have no training datazero-shot

I have some training data (1-2 examples/entity)head-only

I have good training data (3+ examples/entity)full

I have rich training data (100+ examples, 8+ per entity)bert

I have 10k+ entitieshybrid

I'm not sureauto (let the library choose)

Scenario Examples

API endpoint with no training data:

matcher = Matcher(entities=entities, mode="zero-shot", model="potion-8m")
# Fastest, no training needed

Internal tool with a few labeled examples:

matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data)
# Fast training, better than zero-shot

Production system with good training data:

matcher = Matcher(entities=entities, mode="full")
matcher.fit(training_data)
# Best accuracy

High-stakes application with rich training data:

matcher = Matcher(entities=entities, mode="bert", model="distilbert")
matcher.fit(training_data)
# Superior accuracy for complex patterns

Enterprise directory (50k employees):

matcher = Matcher(entities=entities, mode="hybrid")
matcher.fit()
# Scales to large datasets

Diagnostic Tools

Check Current Mode

info = matcher.get_training_info()
print(f"Mode: {info['mode']}")
print(f"Detected: {info['detected_mode']}")
print(f"Active: {info['active_matcher']}")

Explain Match Results

explanation = matcher.explain_match("query", top_k=5)
print(explanation["matched"])      # True/False
print(explanation["best_match"])   # Top result
print(explanation["top_k"])        # All candidates

Debug Issues

diagnosis = matcher.diagnose("query")
print(diagnosis["issue"])       # What's wrong
print(diagnosis["suggestion"])  # How to fix it

Next Steps