Novel Entity Matcher¶
Map messy text to canonical entities with automatic novel entity detection and classification.
What It Solves¶
- Normalize messy entity strings (typos, aliases, alternate names)
- Map text to canonical IDs (e.g. country code matching)
- Detect novel entities not present in your known classes
- Discover and propose new entity categories automatically
- Run locally with Sentence Transformers + SetFit — no cloud API required
Example: "Deutchland" → DE
Quick Start¶
import asyncio
from novelentitymatcher import Matcher
matcher = Matcher(["US", "CA", "DE", "FR", "JP"])
results = asyncio.run(matcher("Deutchland"))
# → MatchResult(id="DE", score=0.92)
Key Features¶
- Unified
Matcherclass — auto-selects between zero-shot, SetFit, BERT, and hybrid modes - Novelty Detection — identifies entities that don't match any known class using kNN, clustering, and statistical strategies
- Discovery Pipeline — staged processing with novel class proposal via LLM or heuristic methods
- Blocking & Reranking — BM25, TF-IDF, and fuzzy blocking with cross-encoder reranking for scalability
- Hierarchical Matching — tree-aware entity resolution with configurable depth and pruning
- Async API — high-throughput matching with
async/awaitfor batch workloads - Multiple Backends — Sentence Transformers, LiteLLM, and static embeddings (Model2Vec)
Where to Go Next¶
-
:rocket: Getting Started Quickstart guide — install, create a matcher, and run your first match
-
:books: API Reference Auto-generated docs — full API documentation from source docstrings
-
:bulb: Guides Async API · Configuration · Models · Matcher Modes
-
:test_tube: Experiments Benchmarking — reproduce results and run your own benchmarks
-
:gear: Architecture Internals — module layout, design decisions, and extension points
-
:map: Roadmap Technical Roadmap — active development plan and upcoming features
Installation¶
Optional extras for novelty detection, LLM features, and visualization: