Novelty Proposals¶
novelentitymatcher.novelty.proposal.llm
¶
LLM-based class proposal system for novel class discovery.
Uses litellm with structured output to generate meaningful class names and descriptions for clusters of novel samples.
Classes¶
LLMProposalSchema
¶
LLMProposalWithSchemaSchema
¶
LLMClassProposer(primary_model=None, provider=None, fallback_models=None, api_keys=None, temperature=0.3, max_tokens=4096, max_clusters_per_summary=20)
¶
Propose new class names and descriptions using LLMs.
Uses litellm for multi-provider support with automatic fallback.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
primary_model
|
str | None
|
Primary model to use (e.g., 'openrouter/anthropic/claude-sonnet-4') |
None
|
provider
|
str | None
|
Preferred provider when auto-selecting a default model |
None
|
fallback_models
|
list[str] | None
|
Fallback models if primary fails |
None
|
api_keys
|
dict[str, str] | None
|
API keys for providers (e.g., {'openrouter': 'sk-...'}) |
None
|
temperature
|
float
|
Sampling temperature |
0.3
|
max_tokens
|
int
|
Maximum tokens in response |
4096
|
max_clusters_per_summary
|
int
|
Maximum clusters to include per LLM summary call (for hierarchical mode) |
20
|
Source code in src/novelentitymatcher/novelty/proposal/llm.py
Functions¶
propose_classes(novel_samples, existing_classes, context=None)
¶
Propose new classes based on novel samples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novel_samples
|
list[NovelSampleMetadata]
|
List of detected novel samples |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
context
|
str | None
|
Optional domain context |
None
|
Returns:
| Type | Description |
|---|---|
NovelClassAnalysis
|
NovelClassAnalysis with proposed classes |
Source code in src/novelentitymatcher/novelty/proposal/llm.py
propose_from_clusters(discovery_clusters, existing_classes, context=None, max_retries=2, hierarchical=True)
¶
Generate proposals from cluster-level evidence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
discovery_clusters
|
list[DiscoveryCluster]
|
List of discovery clusters. |
required |
existing_classes
|
list[str]
|
List of existing class names. |
required |
context
|
str | None
|
Optional domain context. |
None
|
max_retries
|
int
|
Maximum retry attempts. |
2
|
hierarchical
|
bool
|
If True, use hierarchical summarization for large cluster sets. |
True
|
Source code in src/novelentitymatcher/novelty/proposal/llm.py
propose_from_clusters_with_schema(discovery_clusters, existing_classes, context=None, max_retries=2, hierarchical=True, max_attributes=10)
¶
Generate proposals with attribute/field discovery from cluster evidence.
Like propose_from_clusters but the LLM prompt requests discovery of
common attributes and data structures for each proposed class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
discovery_clusters
|
list[DiscoveryCluster]
|
List of discovery clusters. |
required |
existing_classes
|
list[str]
|
List of existing class names. |
required |
context
|
str | None
|
Optional domain context. |
None
|
max_retries
|
int
|
Maximum retry attempts. |
2
|
hierarchical
|
bool
|
If True, use hierarchical summarization for large cluster sets. |
True
|
Source code in src/novelentitymatcher/novelty/proposal/llm.py
Functions¶
novelentitymatcher.novelty.proposal.retrieval
¶
Retrieval-Augmented LLM Class Proposer.
Enhances LLM-based class proposal with retrieval of in-context examples using dense embeddings (BGE-M3 style) for improved class naming.
Classes¶
RetrievalAugmentedProposer(retriever=None, llm_proposer=None, k_examples=5, k_novel_per_class=3, retrieval_metric='cosine', rerank=False)
¶
LLM class proposer enhanced with retrieval-based in-context examples.
Retrieves most relevant examples from a corpus to include in the LLM prompt, improving class naming quality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retriever
|
EmbeddingBackend | None
|
Embedding backend for retrieval (e.g., BGE-M3) |
None
|
llm_proposer
|
Any | None
|
Existing LLMClassProposer to enhance |
None
|
k_examples
|
int
|
Number of in-context examples to retrieve |
5
|
k_novel_per_class
|
int
|
Number of novel examples per proposed class |
3
|
retrieval_metric
|
str
|
Similarity metric for retrieval |
'cosine'
|
rerank
|
bool
|
Whether to use reranking for better examples |
False
|
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
Attributes¶
is_ready
property
¶
Check if proposer is ready for use.
Functions¶
index_examples(examples, embeddings=None)
¶
Index examples for retrieval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
examples
|
list[str]
|
List of example texts to index |
required |
embeddings
|
Any | None
|
Pre-computed embeddings (if None, will compute) |
None
|
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
retrieve(query, k=None)
¶
Retrieve k most relevant examples for a query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query text |
required |
k
|
int | None
|
Number of examples to retrieve (default: k_examples) |
None
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of dicts with 'text', 'score', 'index' |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
retrieve_by_class(class_name, novel_samples, existing_classes)
¶
Retrieve examples relevant to a proposed class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
class_name
|
str
|
Proposed class name |
required |
novel_samples
|
list[Any]
|
Novel samples to find examples for |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with retrieved examples and metadata |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
build_prompt(novel_samples, existing_classes, context=None, use_retrieval=True)
¶
Build prompt for LLM class proposal with retrieval.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novel_samples
|
list[Any]
|
Novel samples to propose classes for |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
context
|
str | None
|
Optional domain context |
None
|
use_retrieval
|
bool
|
Whether to include retrieved examples |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Formatted prompt string |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 | |
propose_classes(novel_samples, existing_classes, context=None)
¶
Propose new classes with retrieval-augmented prompting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
novel_samples
|
list[Any]
|
Novel samples to propose classes for |
required |
existing_classes
|
list[str]
|
List of existing class names |
required |
context
|
str | None
|
Optional domain context |
None
|
Returns:
| Type | Description |
|---|---|
Any | None
|
NovelClassAnalysis from LLM or None if unavailable |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
BGERetriever(model_name='BAAI/bge-m3', device=None, batch_size=32)
¶
BGE-M3 style dense retriever for examples.
Simple wrapper that uses sentence-transformers for dense retrieval of in-context examples.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
Model name for sentence-transformers |
'BAAI/bge-m3'
|
device
|
str | None
|
Device to use ("cuda", "cpu", or None for auto) |
None
|
batch_size
|
int
|
Batch size for encoding |
32
|
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
Functions¶
encode(texts, batch_size=None)
¶
Encode texts to embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
List of texts to encode |
required |
batch_size
|
int | None
|
Override batch size |
None
|
Returns:
| Type | Description |
|---|---|
Any
|
numpy array of embeddings (n, dim) |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
similarity(query_embeddings, corpus_embeddings)
¶
Compute similarity between query and corpus.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_embeddings
|
Any
|
Query embeddings (n, dim) |
required |
corpus_embeddings
|
Any
|
Corpus embeddings (m, dim) |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Similarity matrix (n, m) |
Source code in src/novelentitymatcher/novelty/proposal/retrieval.py
Functions¶
novelentitymatcher.novelty.proposal.config
¶
LLM API configuration with validation and environment variable support.
Provides Pydantic-based configuration for LLM timeouts, retries, and circuit breaker settings to ensure production-ready LLM integration.
Classes¶
LLMConfig
¶
Bases: BaseSettings
LLM API configuration with production-ready defaults.
Supports environment variable overrides via LLM_* prefix.
Environment Variables
LLM_TIMEOUT: Request timeout in seconds (default: 30) LLM_MAX_RETRIES: Maximum retry attempts (default: 5) LLM_CIRCUIT_FAIL_MAX: Consecutive failures before opening circuit (default: 3) LLM_CIRCUIT_RESET_SECONDS: Circuit open duration (default: 60)
Functions¶
novelentitymatcher.novelty.proposal.schema_enforcement
¶
Schema enforcement for LLM proposal outputs.
Provides retry-aware validation of LLM-generated proposals against Pydantic schemas, with structured error feedback for re-prompting.
Classes¶
ValidationResult(is_valid, parsed=None, errors=None)
¶
Result of validating raw LLM output against a schema.
Source code in src/novelentitymatcher/novelty/proposal/schema_enforcement.py
SchemaEnforcer(max_retries=2, schema_model=None)
¶
Validate and enforce Pydantic schemas on LLM outputs with retry logic.
Usage::
enforcer = SchemaEnforcer(max_retries=2, schema_model=LLMProposalSchema)
result = enforcer.enforce(raw_output, proposer_fn, context)
Source code in src/novelentitymatcher/novelty/proposal/schema_enforcement.py
Functions¶
validate(raw_output)
¶
Validate raw LLM output against the configured Pydantic schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_output
|
dict[str, Any]
|
Parsed JSON dict from LLM response. |
required |
Returns:
| Type | Description |
|---|---|
ValidationResult
|
ValidationResult with validity status and any errors. |
Source code in src/novelentitymatcher/novelty/proposal/schema_enforcement.py
enforce(raw_output, proposer_fn, context=None)
¶
Validate with retry loop. On failure, re-prompt with error feedback.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_output
|
dict[str, Any]
|
Initial parsed LLM output to validate. |
required |
proposer_fn
|
Callable[[str | None], dict[str, Any]]
|
Callable that takes an error feedback string and returns a new raw output dict from the LLM. |
required |
context
|
dict[str, Any] | None
|
Optional context for error messages. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Validated raw output dict (possibly from a retry). |