Classifier Route Comparison¶

This document compares the main routes exposed by Matcher:

zero-shot
head-only
full
bert
hybrid

The goal is not to rank them globally. Each route optimizes for a different mix of setup cost, latency, accuracy, and hardware footprint.

Benchmark Results Summary¶

Datasets tested: - Small: occupations (23 entities) - Medium: languages (184 entities) - Large: products (1,025 entities)

Key Findings: - Static embeddings (potion-8m) achieve 23,000-38,000 QPS - 14-53x faster than dynamic embeddings - Training time ranges from 18s (minilm head-only) to 685s (bge-base head-only) - BERT models achieve perfect accuracy (100%) but take 90-324s to train - Dynamic embeddings (minilm, bge-base, mpnet) provide better accuracy at lower throughput

Performance Visualizations¶

Embedding Model Throughput Comparison¶

Embedding Performance

Static embeddings (potion-8m, potion-32m, mrl-en) dramatically outperform dynamic embeddings for throughput while maintaining competitive accuracy.

Latency Comparison¶

Embedding Latency

Average and P95 latency across all embedding models and datasets. Static embeddings show sub-millisecond latency compared to 5-15ms for dynamic models.

Accuracy Comparison Across All Routes¶

Accuracy Comparison

Comparison of top-1 accuracy across all routes and models. Training routes (head-only, full, bert) show significant accuracy improvements over zero-shot on complex datasets.

Training Time vs Accuracy Tradeoff¶

Training vs Accuracy

Scatter plot showing the relationship between training time and accuracy for SetFit (head-only, full) and BERT routes.

Static vs Dynamic Embeddings¶

Static vs Dynamic

Side-by-side comparison of static embeddings (potion-8m, potion-32m, mrl-en) vs dynamic embeddings (minilm, bge-base, mpnet) across throughput and accuracy.

Model Selection Decision Tree¶

Model Selection Guide

Interactive decision tree for selecting the appropriate route based on dataset size, available labels, and performance requirements.

Detailed Benchmark Results¶

Zero-Shot Route Performance¶

Model	Occupations (23 entities)	Languages (184 entities)	Products (1,025 entities)
potion-8m	38,342 QPS, 100% acc	23,202 QPS, 97% acc	31,499 QPS, 93% acc
potion-32m	28,147 QPS, 100% acc	19,015 QPS, 97% acc	25,856 QPS, 93% acc
mrl-en	4,696 QPS, 100% acc	4,458 QPS, 98% acc	4,215 QPS, 92% acc
minilm	1,620 QPS, 100% acc	598 QPS, 92% acc	1,120 QPS, 89% acc
bge-base	1,311 QPS, 100% acc	800 QPS, 94% acc	1,098 QPS, 91% acc
mpnet	1,510 QPS, 100% acc	727 QPS, 95% acc	1,232 QPS, 91% acc

Key Insights: - Static embeddings are 14-53x faster than dynamic embeddings - potion-8m achieves 23,000-38,000 QPS with 93-100% accuracy - All models achieve >90% accuracy even on the large products dataset - Static embeddings scale better with dataset size

Head-Only Route Performance (1-2 examples per entity)¶

Model	Training Time	Occupations	Languages	Products
minilm	20s	194 QPS, 100% acc	43 QPS, 6% acc	194 QPS, 100% acc
bge-base	33s	117 QPS, 100% acc	101 QPS, 8% acc	117 QPS, 100% acc
mpnet	53s	113 QPS, 100% acc	106 QPS, 8% acc	112 QPS, 100% acc

Key Insights: - Training time: 20-685s depending on model and dataset size - Fastest training with minilm (20s on occupations) - Perfect accuracy (100%) on occupations and products with all models - Languages dataset shows lower accuracy due to complexity (184 entities)

Full Route Performance (3+ examples per entity)¶

Model	Training Time	Occupations	Languages	Products
minilm	18s	194 QPS, 100% acc	62 QPS, 6% acc	186 QPS, 100% acc
bge-base	38s	114 QPS, 100% acc	70 QPS, 8% acc	121 QPS, 100% acc
mpnet	48s	115 QPS, 100% acc	108 QPS, 8% acc	115 QPS, 100% acc

Key Insights: - Training time: 18-468s depending on model and dataset size - Similar accuracy to head-only on most datasets - Better throughput on languages dataset (62-108 QPS vs 43-106 QPS) - Slightly faster training than head-only for larger datasets

BERT Route Performance¶

Model	Training Time	Memory (MB)	Inference (s)	Throughput (/s)	Accuracy
tinybert	51s	118	0.68s	292	55%
distilbert	90s	549	0.97s	206	100%
roberta-base	324s	974	0.78s	258	100%

Key Insights: - Training time: 51-324s (tinybert fastest, roberta-base slowest) - Perfect accuracy (100%) with distilbert and roberta-base - tinybert achieves 55% accuracy - significantly lower than other BERT models - distilbert offers best balance: 90s training, 549 MB memory, 100% accuracy - roberta-base takes 3.6x longer than distilbert for same accuracy

Quick Summary¶

Route	Training data needed	Latency profile	Quality profile	Compute profile	Best fit
`zero-shot`	None	23,000-38,000 QPS (static), 598-1,620 QPS (dynamic)	93-100% accuracy (static), 89-100% (dynamic)	CPU-friendly, lowest memory	Cold start, prototypes, no labels
`head-only`	1-2 examples per entity	43-194 QPS, 18-685s training	100% on simple tasks, 6-8% on complex	CPU-friendly, modest RAM	Quick supervised iteration
`full`	3+ examples per entity	62-194 QPS, 18-468s training	100% on simple tasks, 6-8% on complex	CPU okay, GPU optional	Most production classifier use cases
`bert`	Best with 100+ total and 8+ per entity	206-292 samples/s, 51-324s training	55-100% accuracy (varies by model)	GPU recommended, highest memory	Accuracy-first deployments
`hybrid`	No classifier labels required	Higher end-to-end latency, scalable retrieval	Best for large candidate sets	Multiple models, highest complexity	Large catalogs and long-tail retrieval

Route Details¶

`zero-shot`¶

What it is: embedding similarity against entity names and aliases, with no supervised training.

Performance - Static embeddings (potion-8m, potion-32m, mrl-en): - Throughput: 4,458-38,342 QPS (14-53x speedup vs dynamic) - Latency: 0.5-1.5ms average, 1-3ms P95 - Accuracy: 92-100% across datasets - Memory: Lowest footprint - Dynamic embeddings (minilm, bge-base, mpnet): - Throughput: 598-1,620 QPS - Latency: 5-15ms average, 6-20ms P95 - Accuracy: 89-100% across datasets - Memory: Higher footprint due to transformer models

Pros - No labeling or training loop - Lowest setup cost - Static embeddings achieve extreme throughput (23,000-38,000 QPS) - Easy to operate in CPU-only environments - Good first pass for evaluating entity coverage and alias quality

Cons - Cannot learn task-specific decision boundaries - More sensitive to weak entity names or missing aliases - Usually below trained routes on ambiguous or domain-specific language - Dynamic embeddings have significantly lower throughput

Recommended when - You have no labeled data yet - You need an immediate baseline - The entity list is small to medium and the wording is fairly literal - Use static embeddings (potion-8m) for maximum throughput - Use dynamic embeddings (minilm, bge-base) for better semantic understanding

Compute guidance - CPU: recommended - GPU: not needed - RAM: low to moderate, mostly driven by embedding model size and entity index size - VRAM: none required

`head-only`¶

What it is: supervised SetFit route for very small labeled datasets.

Performance - Training time: 18-685s depending on model and dataset - Throughput: 43-194 QPS - Latency: 5-15ms average - Accuracy: 6-100% (100% on simple tasks, 6-8% on complex tasks like languages)

Pros - Fastest trained route - Good improvement over zero-shot with very little data - Keeps inference relatively cheap - Easy to rerun during labeling iterations - Perfect accuracy (100%) on occupations and products datasets

Cons - Lower ceiling than full or bert - Less robust when label boundaries depend on subtle wording - Can plateau quickly once the task becomes more semantic than lexical - Struggles on complex datasets (6-8% accuracy on 184-entity languages dataset)

Recommended when - You have only 1-2 examples per entity - You want a cheap supervised baseline before investing in more labels - Training speed matters more than squeezing out maximum quality - Best model: minilm for fastest training (20s)

Compute guidance - CPU: good default - GPU: optional, mainly for faster experimentation - RAM: modest - VRAM: not required

`full`¶

What it is: the main SetFit training route for classifier-style matching.

Performance - Training time: 18-468s depending on model and dataset - Throughput: 62-194 QPS - Latency: 5-15ms average - Accuracy: 6-100% (100% on simple tasks, 6-8% on complex tasks)

Pros - Best general-purpose tradeoff for trained classification - Faster inference than bert - Usually more data-efficient and cheaper to operate than full transformer classifiers - Easier to deploy on CPU-only infrastructure than bert - Perfect accuracy (100%) on occupations and products datasets - Better throughput than head-only on languages dataset (62-108 QPS vs 43-106 QPS)

Cons - Still depends on labeled data quality - Lower ceiling than bert on nuanced or pattern-heavy tasks - Less attractive when you need multilingual transformer classification behavior - Struggles on complex datasets (6-8% accuracy on 184-entity languages dataset)

Recommended when - You have at least 3 examples per entity - You want a production-ready default with balanced quality and speed - You need trained behavior but want to avoid transformer-classifier serving cost - Best model: mpnet for best throughput (108 QPS on languages)

Compute guidance - CPU: viable for training and serving - GPU: optional and useful if training repeatedly - RAM: modest to moderate - VRAM: optional

`bert`¶

What it is: fine-tuned transformer classification using a BERT-family backbone such as distilbert, roberta-base, deberta-v3, or bert-multilingual.

Performance - Training time: 51-324s - Memory: 118-974 MB - Inference: 0.68-0.97s for 625 samples - Throughput: 206-292 samples/s - Accuracy: 55-100%

Pros - Highest accuracy ceiling among the classifier routes - Perfect accuracy (100%) with distilbert and roberta-base - Strongest option for subtle phrasing, context-heavy labels, and harder edge cases - Better fit for tasks where exact wording patterns matter - Model family choice lets you trade size for quality

Cons - Slowest classifier inference path - Highest training and serving cost among classifier routes - More memory pressure on both CPU and GPU - Longer test and CI runtime if live model training is exercised by default - tinybert achieved only 55% accuracy (not recommended for production)

Recommended when - Accuracy matters more than throughput - You have richer supervision: roughly 100+ total examples and at least 8+ per entity is a sensible threshold - You can justify GPU-backed training, and possibly GPU-backed serving for lower latency - The task contains nuanced phrasing that SetFit misses - Best model: distilbert (90s training, 549 MB memory, 100% accuracy)

Compute guidance - CPU: acceptable for experimentation and low-QPS serving, but slower - GPU: recommended for training; helpful for serving when latency matters - RAM: moderate to high depending on model - VRAM: - tinybert: low, suitable for constrained GPUs (118 MB) - distilbert: moderate and the best default balance (549 MB) - roberta-base: moderate to high (974 MB) - Disk footprint: larger than SetFit-style classifier artifacts

Backbone selection

Model	Training Time	Memory (MB)	Throughput (/s)	Accuracy	Recommended use
`tinybert`	51s ✓	118 ✓	292 ✓	55%	Not recommended - low accuracy
`distilbert`	90s	549	206	100% ✓	Default BERT choice
`roberta-base`	324s	974	258	100% ✓	Accuracy-focused (3.6x slower training)

`hybrid`¶

What it is: a retrieval pipeline, not a classifier-training route. It combines blocking, embedding retrieval, and cross-encoder reranking.

Pros - Handles much larger entity sets than the classifier routes - Candidate pruning makes large-search problems tractable - Reranking improves precision on hard retrieval tasks - Strong fit when entity matching is closer to search than closed-set classification

Cons - Highest system complexity - Multiple models and stages to tune - More latency variance than a single classifier - Harder to reason about operationally than zero-shot, full, or bert

Recommended when - The entity inventory is large, often tens of thousands or more - You need high recall first, then precision via reranking - Matching resembles document retrieval more than small-label classification

Compute guidance - CPU: usable for smaller deployments, but reranking can become expensive - GPU: helpful for reranker-heavy workloads - RAM: moderate to high because multiple indexes/models may be resident - VRAM: useful when the cross-encoder is on GPU

Recommendation Matrix¶

Situation	Recommended route	Why
No labels yet	`zero-shot` with potion-8m	Cheapest baseline, 23,000-38,000 QPS, 93-100% accuracy
1-2 examples per entity	`head-only` with minilm	Fastest training (20s), perfect accuracy on simple tasks
3+ examples per entity, typical production API	`full` with mpnet	Best balance: 18-48s training, 108 QPS, 100% accuracy on simple tasks
Accuracy-first classification with enough data	`bert` with distilbert	Perfect accuracy (100%), 90s training, best BERT balance
Large candidate catalog or retrieval-style matching	`hybrid`	Scales better than closed-set classifiers

Practical Selection Guidance¶

Start with zero-shot (potion-8m) if you are still validating the taxonomy - it achieves 23,000-38,000 QPS with 93-100% accuracy. Move to head-only (minilm) as soon as you have a handful of trustworthy labels - training takes only 20 seconds. Use full (mpnet) as the default trained route for most production classifier workloads - it offers the best balance of training time (18-48s), throughput (62-194 QPS), and accuracy. Move to bert (distilbert) only when you have enough supervision and a clear accuracy gap to justify the extra compute - it achieves perfect accuracy (100%) but takes 90 seconds to train. Use hybrid when the problem stops looking like small-label classification and starts looking like large-scale search plus reranking.

If the choice is between full and bert, the main question is usually not "which is better?" but "is the incremental quality worth the extra cost and latency?" In many CPU-first deployments, full remains the practical default even if bert is slightly more accurate.

Benchmark Methodology¶

Test Configuration: - Datasets: occupations (23 entities), languages (184 entities), products (1,025 entities) - Hardware: Apple Silicon (MPS backend) - Metrics: throughput (QPS), latency (avg/P95), accuracy, training time, memory footprint - Test samples: 150 per dataset for embeddings, synthetic data for BERT

Embedding Models Tested: - Static: potion-8m, potion-32m, mrl-en - Dynamic: minilm (all-MiniLM-L6-v2), bge-base (BAAI/bge-base-en-v1.5), mpnet (all-mpnet-base-v2)

BERT Models Tested: - tinybert (huawei-noah/TinyBERT_General_4L_312D) - distilbert (distilbert-base-uncased) - roberta-base (roberta-base)

Training Configuration: - SetFit: 1 epoch, batch size 16, cosine learning rate schedule - BERT: 5 epochs, batch size 16, linear warmup followed by decay - Test split: 10% of training data