Architecture¶
Related docs: index.md | quickstart.md
Overview¶
Novel Entity Matcher is a text-to-entity matching library built on SetFit few-shot learning with Sentence Transformers, with automatic novel entity detection and classification. Use this page for internals and module boundaries rather than first-run usage.
Module Structure¶
src/novelentitymatcher/
├── __init__.py # Public exports / lazy import surface
├── config.py # Config loading and defaults
├── core/ # Matching pipelines and domain logic
│ ├── matcher.py # EntityMatcher / EmbeddingMatcher / Matcher
│ ├── classifier.py # SetFitClassifier wrapper
│ ├── bert_classifier.py # BERTClassifier wrapper
│ ├── normalizer.py # Text normalization
│ ├── blocking.py # Candidate blocking strategies
│ ├── reranker.py # Cross-encoder reranking
│ ├── hybrid.py # Multi-stage matching pipeline
│ └── monitoring.py # Metrics/monitoring helpers
├── backends/ # Provider integrations (embeddings/reranking)
│ ├── base.py # Backend interfaces / shared abstractions
│ ├── static_embedding.py # Static embedding backend (model2vec, StaticEmbedding)
│ ├── sentencetransformer.py
│ ├── reranker_st.py
│ ├── litellm.py # Planned/in-progress cloud backend support
│ └── ...
├── pipeline/ # Internal stage contracts and orchestrator
│ ├── contracts.py # StageContext / StageResult / PipelineStage
│ ├── orchestrator.py # Ordered stage execution
│ └── adapters.py # Wrappers around matcher, OOD, and proposal flows
├── novelty/ # Novelty detection, discovery, and persistence
│ ├── entity_matcher.py # Public novelty-aware API
│ ├── match_result.py # Stable metadata/result contract for discovery work
│ ├── core/
│ ├── proposal/
│ ├── schemas/
│ └── storage/
├── ingestion/ # Dataset ingestion and normalization CLI/pipelines
│ ├── cli.py # `novelentitymatcher-ingest` entrypoint target
│ └── *.py # Source-specific ingestors (countries/products/etc.)
├── utils/ # Cross-cutting helpers (non-domain specific)
└── data/ # Packaged static data files / defaults
Package Boundaries¶
core/: orchestration and domain logic for matching, retrieval/reranking pipelines, and normalization.pipeline/: internal stage-oriented contracts used to compose discovery flows without exposing unstable stage APIs yet.novelty/: novelty detection, discovery reports, proposal generation, and persistence.backends/: provider-specific integrations for embeddings and rerankers (Hugging Face, LiteLLM, etc.).ingestion/: data acquisition/transformation utilities and the ingestion CLI.utils/: shared helpers used across modules that are not themselves product/domain entrypoints.data/: packaged JSON/static assets required at runtime.
Module Placement Rules¶
- Add a new matcher or pipeline stage to
core/unless it is provider-specific. - Add internal discovery-stage contracts or orchestrators to
pipeline/. - Add a new model/provider integration to
backends/. - Add novelty scoring, proposal generation, and review/persistence logic to
novelty/. - Add dataset import/transformation logic or CLI wiring to
ingestion/. - Put generic helpers in
utils/; avoid moving domain logic there just to “reuse” it. - Keep the public import surface curated through
src/novelentitymatcher/__init__.py(avoid exposing internal modules unintentionally).
Core Components¶
Matcher (Unified API)¶
The recommended Matcher class with smart auto-selection of the optimal matching strategy.
Modes:
- zero-shot: Embedding similarity without training
- head-only: Lightweight SetFit training (~30s)
- full: Full SetFit training (~3min)
- bert: BERT-based classifier (~5min, high accuracy)
- hybrid: Multi-stage pipeline (blocking → retrieval → reranking)
- auto: Auto-detects based on training data volume
Workflow:
matcher = Matcher(entities=[...], mode="auto")
matcher.fit(training_data=None) # Auto-selects mode
result = matcher.match("query") # Routes to appropriate strategy
Auto-selection Rules: - No training data → zero-shot mode - < 3 examples per entity → head-only mode - ≥ 3 examples per entity, < 100 total → full training mode - ≥ 100 total, ≥ 8 examples per entity → bert mode
EntityMatcher (Deprecated)¶
SetFit-based entity matching with optional text normalization.
Workflow: 1. Initialize with entities list (id, name, aliases) 2. Train with labeled examples 3. Predict entity ID for new inputs
matcher = EntityMatcher(entities=[
{"id": "DE", "name": "Germany", "aliases": ["Deutschland"]}
])
matcher.train(training_data)
result = matcher.predict("Deutschland") # → "DE"
EmbeddingMatcher¶
Similarity-based matching without training. Uses cosine similarity between embeddings.
Workflow: 1. Initialize with entities 2. Build index (encodes all names/aliases) 3. Match queries against index
matcher = EmbeddingMatcher(entities=[...])
matcher.build_index()
result = matcher.match("Deutschland") # → {"id": "DE", "score": 0.92}
SetFitClassifier¶
Low-level wrapper around SetFit for training and prediction.
BERTClassifier¶
Low-level wrapper around transformers library for BERT-based classification.
TextNormalizer¶
Text preprocessing with options for: - Lowercase conversion - Accent removal - Punctuation removal
Data Flow¶
Input Text
↓
TextNormalizer (optional)
↓
Matcher metadata collection
↓
Known-entity routing / top-k candidates
↓
OOD / novelty detection
↓
Optional class proposal generation
↓
Result (entity match, novelty report, and optional proposals)
Discovery Pipeline¶
DiscoveryPipeline is the pipeline-first public discovery entry point. It owns and wires together:
- a fitted
Matcherfor top-k match metadata and embeddings - a
NoveltyDetectorbuilt fromPipelineConfig - a
ScalableClustererconfigured from pipeline clustering settings - an optional
LLMClassProposerfor class proposal generation
The internal stage sequence is:
MatcherMetadataStageOODDetectionStageCommunityDetectionStageClusterEvidenceStageProposalStage
Key stage-level behaviors:
- OOD strategy and calibration settings are resolved before stage execution and applied through the owned
DetectionConfig - clustering metric / density parameters are applied both to the owned
ScalableClustererand the cluster stage runtime call - proposal generation can run in cluster mode or sample mode
- schema discovery augments proposals with discovered attributes and a normalized
attribute_schema
This keeps DiscoveryPipeline as the richest configuration surface for discovery workflows, while NovelEntityMatcher remains the simpler novelty-aware classification entry point.
Backends¶
Static Embeddings¶
Fast retrieval-oriented embeddings using pre-computed lookups.
Supports two approaches:
- model2vec (StaticModel): minishlab potion models (potion-8m, potion-32m)
- StaticEmbedding (sentence-transformers): RikkaBotan MRL models
Benefits: - 10-100x faster than dynamic embeddings - Lower memory usage - Sufficient accuracy for retrieval scenarios
Usage:
HuggingFace (SentenceTransformers)¶
HFEmbedding- Generate embeddingsHFReranker- Cross-encoder reranking
LiteLLM (future)¶
- Cloud LLM embedding support (planned/in progress; confirm implementation status before relying on it)
Ollama (future)¶
- Local LLM embeddings (planned; may not be fully wired in current release)
Model Registries¶
MODEL_SPECS¶
Central registry of model specifications with metadata: - Static models: potion-8m, potion-32m, mrl-en, mrl-multi - Dynamic models: bge-base, bge-m3, nomic, mpnet, minilm - Training support: Marks which models can be used for SetFit training
Resolution Logic¶
resolve_model_alias(): Maps short aliases to full model namesis_static_embedding_model(): Detects static embedding modelsresolve_training_model_alias(): Falls back to training-safe models
Default Models¶
- Retrieval default:
potion-8m(fast static embeddings) - Training default:
mpnet(SetFit-compatible)
Matcher Mode System¶
MATCHER_MODE_REGISTRY¶
Maps mode names to implementation classes:
- zero-shot → EmbeddingMatcher
- head-only → EntityMatcher (lightweight training)
- full → EntityMatcher (full training)
- hybrid → HybridMatcher (multi-stage pipeline)
- auto → SmartSelection (runtime detection)
Mode Selection Process¶
- User specifies mode (or uses
auto) - Matcher routes to appropriate implementation
- Training requests with static models auto-fallback to training-safe backbone
- Hybrid mode uses blocking → retrieval → reranking pipeline
Design Decisions¶
- Optional normalization - Users can disable if input is already clean
- Lazy model loading - SentenceTransformer loaded on first use
- Flexible input - Single string or list of strings for batch prediction
- Threshold-based matching - Configurable confidence threshold for EmbeddingMatcher