Novel Class Detection¶
Related docs: quickstart.md | architecture.md | technical-roadmap.md
NovelEntityMatcher is the supported orchestration API for novelty-aware matching and class discovery. This page is the current source of truth for using the novelty stack that already exists in the repo today.
When to Use It¶
Use Matcher by itself when every query should map to an existing entity. Add NovelEntityMatcher when some queries may represent:
- a genuinely new class
- an out-of-distribution input that should be reviewed instead of force-matched
- a cluster of related unmatched samples that should be proposed as a new entity
Main Flow¶
from novelentitymatcher import Matcher, NovelEntityMatcher
from novelentitymatcher.novelty import DetectionConfig
from novelentitymatcher.novelty.config.strategies import ConfidenceConfig, KNNConfig
entities = [
{"id": "physics", "name": "Physics"},
{"id": "cs", "name": "Computer Science"},
]
matcher = Matcher(entities=entities, model="minilm", threshold=0.6)
matcher.fit(
texts=["quantum mechanics", "neural networks"],
labels=["physics", "cs"],
)
novel_matcher = NovelEntityMatcher(
matcher=matcher,
detection_config=DetectionConfig(
strategies=["confidence", "knn_distance"],
confidence=ConfidenceConfig(threshold=0.65),
knn_distance=KNNConfig(distance_threshold=0.45),
),
auto_save=False,
)
report = await novel_matcher.discover_novel_classes(
queries=["quantum biology", "new interdisciplinary topic"],
existing_classes=["physics", "cs"],
run_llm_proposal=False,
)
What Exists Today¶
The current novelty stack is already broader than a simple threshold check:
NovelEntityMatcherorchestrates match, detect, cluster, and optional proposal stepsNoveltyDetectorcombines multiple signals instead of relying on one threshold- distance-based novelty can use ANN-backed search structures
- proposal and review helpers can persist artifacts for later inspection
Detection Strategies¶
The exact strategy set may continue evolving, but the active system centers on these kinds of signals:
- confidence and uncertainty signals from the base matcher
- nearest-neighbor distance and support checks against known examples
- clustering-based grouping for batches of suspicious samples
- optional proposal generation for naming or summarizing candidate new classes
The practical recommendation is to start with conservative confidence and k-NN thresholds, then add clustering or proposal steps once you have a review workflow.
Lower-Level Components¶
novelentitymatcher.novelty.core.detector.NoveltyDetector: modular detector used byNovelEntityMatchernovelentitymatcher.novelty.proposal.llm.LLMClassProposer: LLM-backed naming and summarizationnovelentitymatcher.novelty.storage.ANNIndex: ANN search index used by distance-based strategiesnovelentitymatcher.novelty.storage.save_proposals/load_proposals: persistence helpers for discovery reports
Reports¶
Discovery returns NovelClassDiscoveryReport, which typically contains:
novel_sample_report.novel_samples: flagged samples with confidence, novelty score, signals, and per-sample metricsclass_proposals: optional generated class names and justificationsmetadata: counts and output paths for saved artifacts
Example Paths¶
Start with the maintained end-to-end example:
Other novelty-oriented examples in examples/ cover pattern-based, one-class, prototypical, and SetFit-backed workflows. Use them as implementation examples, not as separate sources of truth for the public docs.
Notes¶
- Use
technical-roadmap.mdfor future architectural direction. - Older research-heavy novelty proposals have been moved into the archive so this page can stay focused on the current workflow.