Skip to content

Benchmarks

novelentitymatcher.benchmarks.cli

CLI for HuggingFace benchmarks.

Classes

novelentitymatcher.benchmarks.loader

Async dataset loader for HuggingFace benchmarks with parquet caching.

Classes

DatasetLoader(cache_dir=None, cache_config=None)

Source code in src/novelentitymatcher/benchmarks/loader.py
def __init__(
    self,
    cache_dir: Path | None = None,
    cache_config: CacheConfig | None = None,
):
    self.cache_dir = cache_dir or DEFAULT_CACHE_DIR
    self.cache_config = cache_config or CacheConfig()
    self.cache_dir.mkdir(parents=True, exist_ok=True)

novelentitymatcher.benchmarks.runner

Benchmark runner orchestrator for HuggingFace benchmarks.

Classes

BenchmarkRunner(output_dir=None, cache_dir=None)

Source code in src/novelentitymatcher/benchmarks/runner.py
def __init__(
    self,
    output_dir: Path | None = None,
    cache_dir: Path | None = None,
):
    self.output_dir = output_dir or Path("data/hf_benchmarks")
    self.output_dir.mkdir(parents=True, exist_ok=True)
    self.loader = DatasetLoader(cache_dir=cache_dir)

    self.er_evaluator = EntityResolutionEvaluator()
    self.clf_evaluator = ClassificationEvaluator()
    self.novelty_evaluator = NoveltyEvaluator()

novelentitymatcher.benchmarks.shared

Shared utilities for benchmark scripts.

Consolidates duplicated code from: - benchmark_bert.py / benchmark_bert_models.py (generate_synthetic_data, benchmark_training, benchmark_inference) - benchmark_full_pipeline.py / benchmark_novelty_strategies.py / benchmark_novelty_full.py (compute_ood_metrics, SplitData, OOD splitting)

novelentitymatcher.benchmarks.registry

Dataset registry for HuggingFace benchmarks.

novelentitymatcher.benchmarks.base

Base evaluator abstract class for benchmarks.

Classes

BaseEvaluator(name)

Bases: ABC, Generic[T]

Source code in src/novelentitymatcher/benchmarks/base.py
def __init__(self, name: str):
    self.name = name
Functions
evaluate(data, **kwargs) abstractmethod

Evaluate on the given data.

Source code in src/novelentitymatcher/benchmarks/base.py
@abstractmethod
def evaluate(
    self,
    data: T,
    **kwargs,
) -> EvaluationResult:
    """Evaluate on the given data."""
    raise NotImplementedError
get_default_metrics() abstractmethod

Return list of default metric names this evaluator computes.

Source code in src/novelentitymatcher/benchmarks/base.py
@abstractmethod
def get_default_metrics(self) -> list[str]:
    """Return list of default metric names this evaluator computes."""
    raise NotImplementedError

novelentitymatcher.benchmarks.novelty_bench

Merged novelty detection benchmark with depth levels.

Consolidates: - benchmark_full_pipeline.py (Phase 2) - benchmark_novelty_strategies.py - benchmark_novelty_full.py

Depth levels: - quick: KNN, Mahalanobis, LOF, OneClassSVM, IsolationForest - standard: quick + Pattern, SetFit Centroid, weighted/voting/adaptive ensembles - full: standard + hyperparameter tuning + SignalCombiner + meta-learner + adaptive weights

novelentitymatcher.benchmarks.infra_bench

Infrastructure benchmarks: ANN backends and reranker models.

Benchmarks: - ANN backends (hnswlib vs faiss vs exact): build time, query latency, recall@k - Reranker models (bge-m3 vs bge-large vs ms-marco): accuracy, latency

Usage

novelentitymatcher-bench bench-ann --sizes 1000 10000 100000 novelentitymatcher-bench bench-reranker --queries 100

novelentitymatcher.benchmarks.classifier_bench

Merged classifier benchmark: BERT vs SetFit comparison and multi-model sweep.

Consolidates: - benchmark_bert.py (head-to-head BERT vs SetFit) - benchmark_bert_models.py (multi-model BERT sweep)

Modes: - compare: BERT vs SetFit head-to-head - sweep-models: benchmark multiple BERT-family classifiers

novelentitymatcher.benchmarks.async_bench

Async/sync performance benchmark for matcher APIs.

Consolidated from: benchmark_async.py

Benchmarks sync vs async matcher APIs across zero-shot, head-only, and full modes, measuring construct time, fit time, cold-query latency, steady-state match latency, QPS, and end-to-end wall time.

novelentitymatcher.benchmarks.weight_optimizer

Bayesian optimization of ensemble weights using Optuna.

Searches for optimal strategy weights and thresholds that maximize AUROC on validation data. Compares weighted/voting/meta_learner combination methods.

Usage

novelentitymatcher-bench bench-weights --trials 200 --dataset ag_news

novelentitymatcher.benchmarks.visualization

Visualization utilities for benchmark results.

Consolidates: - render_benchmark_report.py (JSON -> markdown tables) - visualize_benchmarks.py (JSON -> PNG charts)