Troubleshooting¶
Related docs: quickstart.md | examples.md | experiments/index.md | index.md
This page covers common setup and first-run issues for the package plus exploratory scripts/notebooks.
Import Errors (setfit, datasets, torch, sentence-transformers)¶
Symptoms:
ImportError: setfit is required...ModuleNotFoundErrorfordatasets,torch, orsentence_transformers
What to check:
- Install the project dependencies in the active environment.
- Make sure your Jupyter kernel matches the environment where the package is installed.
- Re-run from the repo root if using
PYTHONPATH=.
Slow First Run / Model Downloads¶
First run often downloads model weights from Hugging Face. This can take time depending on model size and network speed.
What to expect:
- Small examples: may still pause while downloading
- Some advanced experiments/notebooks may require larger model downloads and slower startup
- Typical model sizes: 100-500MB
CPU vs GPU Expectations¶
- CPU works for basic testing and small examples.
- SetFit training and large embedding models can be significantly slower on CPU.
- GPU is optional but helpful for the country classifier experiments and larger embedding experiments.
EmbeddingMatcher Error: Index Not Built¶
Symptom:
RuntimeError("Index not built. Call build_index() first.")
Fix:
EntityMatcher Error: Model Not Trained¶
Symptom:
RuntimeError("Model not trained. Call train() first.")
Fix:
matcher = EntityMatcher(entities=entities)
matcher.train(training_data)
result = matcher.predict("query")
Low-Confidence Matches (Returns None)¶
Symptoms:
match()orpredict()returnsNonefor valid queries- Fewer matches than expected
Causes:
- Threshold too high: Confidence score below threshold
- Insufficient training data: Model hasn't learned variations
- Wrong matcher type: Using EmbeddingMatcher for complex variations
Solutions:
-
Lower the threshold:
-
Check confidence scores:
-
Use EntityMatcher instead:
Threshold Tuning Guidance¶
If you're getting too many matches (low precision):
- Raise threshold:
0.7→0.8or0.9 - Use validation data to find optimal threshold
If you're getting too few matches (low recall):
- Lower threshold:
0.7→0.6or0.5 - Check if queries have typos or extreme variations
- Consider adding more training examples (EntityMatcher)
Recommended thresholds by use case:
- High precision (0.8-0.9): Database lookups, exact matching
- Balanced (0.7, default): General purpose
- High recall (0.5-0.6): Fuzzy search, data cleaning
Use the maintained Matcher examples under examples/ when tuning thresholds or comparing modes.
Model Selection Issues¶
Problem: Model not working well for your language/domain
Solutions:
-
Try multilingual models:
-
Use domain-specific models:
Notebook Dependency Issues (jupyter, geograpy)¶
Jupyter¶
- Install Jupyter in the same environment as
novelentitymatcher - Launch from repo root to avoid path confusion
geograpy¶
- If you add a local
geograpynotebook experiment, expect extra installs and dependency troubleshooting beyond the core project
Path Migration Note (notebook/ -> experiments/)¶
The old experiment script path notebook/... was moved into experiments/....
Updated examples:
experiments/country_classifier/country_classifier.pyexperiments/country_classifier/country_classifier_quick.pyexperiments/country_classifier/country_classifier_advanced.py
Static Embedding Issues¶
model2vec Import Error¶
Symptom:
- ModuleNotFoundError: No module named 'model2vec'
Cause: - Trying to use potion models without model2vec installed
Fix:
# Install model2vec
uv pip install model2vec
# Or with extras
uv pip install novel-entity-matcher[static]
MRL Model Loading Error¶
Symptom:
- Failed to load static embedding model
- AttributeError: 'StaticEmbedding' module not found
Cause:
- RikkaBotan MRL models require trust_remote_code=True
Fix:
from novelentitymatcher.backends.static_embedding import StaticEmbeddingBackend
# Automatically handled by Matcher
from novelentitymatcher import Matcher
matcher = Matcher(model="mrl-en") # Works correctly
MPS Fallback Warning (Apple Silicon)¶
Symptom: - Warning about MPS fallback on Apple Silicon
Cause: - RikkaBotan MRL models use operations not supported by MPS
Fix:
- Already handled automatically - library sets PYTORCH_ENABLE_MPS_FALLBACK=1
- Warning is informational, not an error
Static Model Auto-Fallback¶
Symptom:
- Training with potion-8m uses mpnet instead
Cause: - Static models don't support SetFit training - Library auto-falls back to training-compatible model
Fix:
# Explicitly use training-compatible model
matcher = Matcher(model="mpnet") # Not potion-8m
matcher.fit(training_data, mode="full")
# Or accept the fallback
matcher = Matcher(model="potion-8m")
matcher.fit(training_data, mode="full") # Will use mpnet for training
See static-embeddings.md for more details.
Matcher Mode Issues¶
Auto-Detection Not Working as Expected¶
Symptom:
- Wrong mode selected by auto-detection
- Expected head-only but got full
Cause: - Auto-detection counts examples per entity - ≥ 3 examples per entity triggers full training
Fix:
# Check detected mode
matcher = Matcher(entities=entities, mode="auto")
matcher.fit(training_data)
print(matcher.get_training_info()["detected_mode"])
# Override mode explicitly
matcher = Matcher(entities=entities, mode="head-only")
matcher.fit(training_data)
Training Data Required Error¶
Symptom:
- ValueError: training_data is required for modes 'head-only' and 'full'
Cause: - Requested training mode without providing training data
Fix:
# Wrong
matcher = Matcher(mode="full")
matcher.fit() # Error!
# Right
matcher = Matcher(mode="full")
matcher.fit(training_data) # Provide training data
# Or use zero-shot for no training
matcher = Matcher(mode="zero-shot")
matcher.fit() # OK
Hybrid Mode Not Working¶
Symptom: - Hybrid mode returns no results - Very slow matching
Causes: 1. Blocking too aggressive - Filters out all candidates 2. Dataset too small - Hybrid overkill for <10k entities
Fix:
# 1. Try different blocking strategy
from novelentitymatcher.core.blocking import NoOpBlocking
matcher = Matcher(
entities=entities,
mode="hybrid",
blocking_strategy=NoOpBlocking() # No filtering
)
# 2. Increase blocking_top_k
result = matcher.match(
"query",
blocking_top_k=5000 # More candidates
)
# 3. Use simpler mode for small datasets
matcher = Matcher(entities=entities, mode="zero-shot")
Mode Not Supported Error¶
Symptom:
- ModeError: Invalid mode: 'invalid_mode'
Cause: - Typos or invalid mode names
Valid modes:
- zero-shot
- head-only
- full
- hybrid
- auto
Fix:
# Check mode spelling
matcher = Matcher(entities=entities, mode="zero-shot") # Correct
# Not "zeroshot" or "Zero-Shot"
See matcher-modes.md for complete mode guide.
Performance Issues¶
Problem: Matching is too slow
Solutions:
-
Use faster model:
-
Use batch processing:
-
Reduce embedding dimension (Matryoshka):
Common Errors by Matcher¶
EmbeddingMatcher¶
| Error | Cause | Fix |
|---|---|---|
Index not built |
Didn't call build_index() |
Call matcher.build_index() |
Returns None |
Threshold too high | Lower threshold |
| Slow matching | Large model or dataset | Use minilm model |
EntityMatcher¶
| Error | Cause | Fix |
|---|---|---|
Model not trained |
Didn't call train() |
Call matcher.train(data) |
Returns None |
Threshold too high or insufficient training | Lower threshold or add training examples |
| Training slow | Large model or many epochs | Use minilm and fewer epochs |
HybridMatcher¶
| Error | Cause | Fix |
|---|---|---|
| Empty results | Blocking stage filters everything | Try NoOpBlocking() or increase blocking_top_k |
| Slow | Small dataset | Use EmbeddingMatcher instead |
Getting More Help¶
If you're still stuck:
-
Check diagnostic tools:
-
Review documentation:
quickstart.md- Basic usageexamples.md- Example catalogmodels.md- Model selection guidematcher-modes.md- Mode systemstatic-embeddings.md- Static embedding details-
configuration.md- Configuration options -
Check diagnostics:
-
Search issues: GitHub Issues
-
Create an issue with:
- Code snippet
- Data sample (sanitized)
- Error message
diagnose()outputget_training_info()output