MLX Provider¶
The MLX provider enables on-device AI inference on Apple Silicon via the MLX framework.
Components¶
MLXEmbedder¶
Generates text embeddings via sentence-transformers for the Semantic Router (Signal 5).
- Default model:
nomic-ai/nomic-embed-text-v1.5 - Supports task-specific prefixes (
search_query:,search_document:) - Batch encoding for efficient multi-email processing
- Cosine similarity computation for category matching
MLXLLM¶
Text generation via mlx-lm for classification fallback (Signal 6).
- Default model:
mlx-community/gemma-4-e4b-it-OptiQ-4bit - Uses
apply_chat_templatewithenable_thinking=Falsefor Gemma 4 - Strips residual
<|channel>thought...<channel|>blocks as safety net - KV cache quantization (
kv_bits=8) with graceful fallback - Cached sampler and generate function for reduced per-call overhead
Lazy Loading¶
Both classes use lazy loading -- models are only downloaded and loaded on first use:
embedder = MLXEmbedder() # No model loaded yet
embedder.encode("text") # Model loaded here on first call
Configuration¶
Models are configured in config.toml:
[mlx]
enabled = true
embedding_model = "nomic-ai/nomic-embed-text-v1.5"
llm_model = "mlx-community/gemma-4-e4b-it-OptiQ-4bit"
llm_confidence = 0.85
llm_max_tokens = 128
llm_temperature = 0.2
Performance Optimizations¶
- Prompt prefix caching: Static category list (~600-900 tokens) is built once and reused
- Batch embeddings:
route_batch()encodes all pending emails in a single call - Cached sampler: Reused across calls when temperature matches
- KV cache quantization: Reduces memory usage (when supported by model architecture)