Signal 4: Semantic Router with MLX Embeddings

How para-files matches documents to categories using AI embeddings.

How It Works

Extract content from document (first 2000 characters)
Convert to embedding - 768-dimensional vector via MLX
Compare to utterances - Each route has utterance embeddings
Calculate similarity - Cosine similarity score (0.0-1.0)
Match if above threshold - Default: 0.75

Document: "electricity invoice from EDF"
    ↓ Embed (768 dims)
    ↓ Compare to utterances: ["electricity bill", "power usage", ...]
    ↓ Cosine similarity = 0.87
    ↓ Above threshold (0.75)? YES
    ↓ Match: factures-utilities (85% confidence)

The MLX Model

Model: nomic-embed-text-v1.5 (default)

Why this model?

Optimized for Apple Neural Engine (MLX)
Small (~100MB)
Fast (10-15ms)
Good quality (768 dimensions)
8192 token context

Download: Automatic on first use, cached in ~/.cache/huggingface/

Utterances Are Key

Routes have utterances (semantic keywords):

routes:
  - name: factures-utilities
    path: "4_Archives/factures/_Utilities"
    utterances:
      - "electricity bill"
      - "power usage"
      - "energy invoice"
      - "water consumption"

When you add utterances:

uv run para-files add-utterance factures-utilities "electrical consumption"

It learns to match documents about electricity better.

Similarity Scoring

Cosine similarity ranges from 0.0 to 1.0:

Score	Meaning
0.95+	Excellent match
0.85-0.95	Good match
0.75-0.85	Acceptable match
<0.75	Poor match (rejected)

Default threshold: 0.75

Configuring Semantic Matching

Adjust Sensitivity

# Lower = more matches, more false positives
export PARA_FILES_MLX_SCORE_THRESHOLD=0.70

# Higher = fewer matches, more to Inbox
export PARA_FILES_MLX_SCORE_THRESHOLD=0.85

Add Better Utterances

# Instead of lowering threshold, add utterances
uv run para-files add-utterance route "more specific phrase"

Performance

First classification: ~30 seconds (model downloads + loads)
Model embedding: ~10-15ms
Subsequent classifications: <1 second total

Model loads once and stays in memory.

Advantages

✓ Works without manual training ✓ Fast inference (ML on Apple Silicon) ✓ Deterministic (same input = same output) ✓ Works with any language ✓ Content-based (doesn’t need company name)

Limitations

✗ Needs good utterances ✗ Generic documents may not match ✗ Needs 2000+ char content to work well ✗ Doesn’t understand domain context

Signal 3: Domain KB - Known companies (90%)
Improve Matching - Add utterances
Configuration - Model settings