Skip to content

Benchmarking Guide

Related docs: ../index.md | ../models.md | ../static-embeddings.md | benchmark-results.md | speed-benchmark-results.md | novelty-detection-benchmark.md

Overview

Novel Entity Matcher includes a comprehensive benchmarking suite accessed via the novelentitymatcher-bench CLI. It covers accuracy, latency, throughput, and novelty detection across multiple strategies and datasets.

All benchmarks live in src/novelentitymatcher/benchmarks/ and are registered as the novelentitymatcher-bench entry point in pyproject.toml.

CLI Subcommands

Subcommand Purpose
run Run entity resolution, classification, and novelty benchmarks on HuggingFace datasets
bench-classifier Benchmark BERT vs SetFit classifiers head-to-head
bench-novelty Benchmark novelty detection strategies at quick/standard/full depth
bench-async Benchmark sync vs async matcher API throughput
render Render benchmark JSON as markdown tables
plot Generate charts from benchmark JSON
load Download/cache HuggingFace datasets
list List available datasets
clear Clear cached datasets
sweep Parameter sweep (threshold, k, distance)

Running Benchmarks

HuggingFace Benchmark Suite (run)

# Run all benchmarks (ER + classification + novelty)
uv run novelentitymatcher-bench run --task all --models potion-8m

# Entity resolution only
uv run novelentitymatcher-bench run --task er --models potion-8m --thresholds 0.5 0.7 0.9

# Classification with trained modes
uv run novelentitymatcher-bench run \
  --task classification \
  --models all-MiniLM-L6-v2 \
  --modes zero-shot head-only \
  --class-counts 4 10 28 \
  --max-train-samples 200

# Novelty detection
uv run novelentitymatcher-bench run --task novelty --ood-ratio 0.2

# Save results to JSON
uv run novelentitymatcher-bench run --task all --output data/hf_benchmarks/results.json

Classifier Comparison (bench-classifier)

# BERT vs SetFit head-to-head
uv run novelentitymatcher-bench bench-classifier --mode compare

# Multi-model BERT sweep
uv run novelentitymatcher-bench bench-classifier \
  --mode sweep-models \
  --models distilbert tinybert roberta-base \
  --num-entities 10 --num-samples 50

# Save results
uv run novelentitymatcher-bench bench-classifier --mode compare --output /tmp/clf_results.md

Novelty Strategy Benchmark (bench-novelty)

Three depth levels control strategy coverage:

Depth Strategies
quick KNN, Mahalanobis, LOF, OneClassSVM, IsolationForest
standard quick + Pattern, SetFit Centroid, ensembles
full standard + SignalCombiner, meta-learner
# Quick benchmark (fastest)
uv run novelentitymatcher-bench bench-novelty --depth quick

# Standard with specific datasets
uv run novelentitymatcher-bench bench-novelty \
  --depth standard \
  --datasets ag_news go_emotions \
  --max-train 200 --max-test 500

# Full depth
uv run novelentitymatcher-bench bench-novelty --depth full --output /tmp/novelty_results.csv

Async Speed Benchmark (bench-async)

uv run novelentitymatcher-bench bench-async \
  --section languages/languages \
  --model default \
  --modes zero-shot \
  --max-entities 50 \
  --max-queries 25 \
  --multiplier 20 \
  --concurrency 8 \
  --output artifacts/benchmarks/speed-routes.json

Rendering and Plotting

# Render benchmark JSON as markdown
uv run novelentitymatcher-bench render artifacts/benchmarks/results.json

# Generate charts from benchmark results
uv run novelentitymatcher-bench plot \
  --embedding-results results/embeddings.json \
  --training-results results/training.json \
  --bert-results results/bert.json \
  --output-dir docs/images/benchmarks

Dataset Management

# List available datasets
uv run novelentitymatcher-bench list

# Download/cache specific datasets
uv run novelentitymatcher-bench load --datasets ag_news go_emotions

# Force re-download
uv run novelentitymatcher-bench load --datasets ag_news --force

# Clear cache
uv run novelentitymatcher-bench clear --dataset ag_news
uv run novelentitymatcher-bench clear

Understanding the Output

Console Output

BENCHMARK RESULTS

[embedding]
<section: languages/languages>
           model backend  status  throughput_qps  accuracy_split  base_accuracy  val_accuracy  test_accuracy
potion-8m    static       ok        4032.12       val             0.9500         0.7250        1.0000
minilm       dynamic      ok         102.45       val             0.9500         0.7750        0.9474
bge-base     dynamic      ok          41.23       val             0.9600         0.7900        0.9500

Key metrics: - throughput_qps — Queries per second (higher is better) - accuracy — Top-1 accuracy on the preferred populated split - accuracy_split — Which split that top-line accuracy came from - Perturbation metrics such as typo_accuracy — robustness by transformation type - speedup_vs_minilm — Relative speed vs minilm baseline - status — "ok" or "skipped" (with skip_reason)

Novelty Benchmark Output

Strategy                   Val AUROC   Test AUROC   Test DR@1%
------------------------------------------------------------
knn_distance                   0.875        0.851        0.160
mahalanobis                    0.826        0.822        0.090
lof                            0.817        0.799        0.250
oneclass_svm                   0.830        0.825        0.290
isolation_forest               0.644        0.539        0.050

Benchmark Metrics Explained

Throughput (QPS)

Queries per second — how many matches the system can process.

  • Higher is better
  • potion-8m: ~4000 QPS (39x faster than minilm)
  • minilm: ~100 QPS (baseline)
  • bge-base: ~40 QPS (2.5x slower than minilm)

Accuracy

Top-1 match accuracy — percentage of queries that match the correct entity.

  • Typical range: 0.80–0.95 (80–95%)
  • Tradeoff with speed: static models trade slight accuracy for huge speed gains

Latency

Time per query — avg_latency, p95_latency, p99_latency.

AUROC (Novelty)

Area Under ROC Curve — overall discrimination ability for novelty detection. 1.0 = perfect, 0.5 = random.

DR@1% (Novelty)

Detection Rate at 1% False Positive — what fraction of novel samples are caught when only 1% of known samples are incorrectly flagged. Measures practical detection capability.

Benchmark Datasets

HuggingFace Datasets (run command)

Task Datasets
Entity Resolution walmart_amazon, amazon_google, fodors_zagats, beer, dblp_acm, dblp_googlescholar, itunes_amazon
Classification ag_news, yahoo_answers, goemotions
Novelty Detection ag_news, goemotions (with 20% OOD class split)

Datasets are cached as parquet at data/hf_benchmarks/.

Benchmark Download Security

Some entity-resolution benchmark sources currently resolve to legacy http:// URLs. The benchmark loader now emits a warning when insecure transport is used.

Migration guidance: - Prefer HTTPS mirrors for benchmark assets whenever available. - Update download_url values in dataset registry entries to trusted HTTPS sources. - Treat HTTP benchmark downloads as non-production and integrity-risky until migrated.

Processed Sections (bench-async)

Custom CSV sections in data/processed/*/*.csv:

data/processed/
├── languages/
│   └── languages.csv
├── universities/
│   └── universities.csv
└── currencies/
    └── currencies.csv

CSV columns: id, name, aliases (pipe-separated), type (optional).

Interpreting Results for Model Selection

Speed-Critical: potion-8m

  • 39x faster than minilm, minimal accuracy tradeoff (~92% vs 93%)
  • Use for high-traffic APIs (>1000 req/s), tight latency budgets (<10ms)

Accuracy-Critical: bge-base

  • Highest accuracy (~94–95%), better contextual understanding
  • Use when accuracy is paramount, lower traffic volumes

Balanced: minilm

  • Good accuracy (~93%), reasonable speed (~100 QPS)
  • Safe default for moderate traffic

Multilingual: mrl-multi or bge-m3

  • Static (fast) or dynamic (accurate) multilingual options

Programmatic Usage

from novelentitymatcher.benchmarks import BenchmarkRunner

runner = BenchmarkRunner()

# Load datasets
runner.load_all()

# Run specific benchmarks
er_results = runner.run_entity_resolution_benchmark(model="potion-8m")
clf_results = runner.run_classification(model="potion-8m", mode="zero-shot")
novelty_results = runner.run_novelty(model="potion-8m", ood_ratio=0.2)

# Run everything
all_results = runner.run_all()

Troubleshooting

"No benchmark sections found"

No processed data in data/processed/. Check with ls data/processed/*/*.csv or specify datasets explicitly.

Model loading errors

Test model loading: from novelentitymatcher import Matcher; m = Matcher(model="your-model", entities=[{"id":"1","name":"test"}]); m.fit()

Out of memory

Benchmark one model at a time or reduce data: --max-train-samples 100.

Next Steps