ADR-002: Switch MLX LLM from Mistral 7B to Gemma 4 E4B¶
Status¶
Accepted (implemented)
Date¶
2026-04-12
Context¶
The MLX LLM fallback (Signal 6) used mlx-community/Mistral-7B-Instruct-v0.3-4bit (7B params, 4-bit quantized). While functional, we evaluated whether newer models could improve:
- Structured JSON output reliability (
{category, confidence, reason}) - French language classification quality (prompts are in French)
- Inference speed on Apple Silicon
- Memory efficiency
Google released Gemma 4 (April 2026) with E4B, 26B MoE, and 31B dense variants, all Apache 2.0 licensed with 140+ language support.
Decision¶
Switch to mlx-community/gemma-4-e4b-it-OptiQ-4bit (Gemma 4 E4B, 4-bit quantized).
Why E4B over other Gemma 4 variants¶
| Variant | Active Params | RAM (4-bit) | Fit |
|---|---|---|---|
| E2B | ~2B | ~2 GB | Too small for classification quality |
| E4B | ~4B | ~4 GB | Best balance: similar RAM to Mistral 7B, native JSON |
| 26B MoE | ~3.8B active | ~18 GB | Overkill for email classification |
| 31B Dense | 31B | ~20 GB | Too large for this use case |
Consequences¶
Positive¶
- Native structured JSON + function-calling support (fewer parsing failures)
- Stronger multilingual/French performance (82% MMLU multilingual)
- Similar or lower RAM footprint (~4 GB vs ~4-5 GB for Mistral 7B)
- Faster inference (fewer effective parameters)
Negative¶
- Gemma 4 defaults to "thinking mode" (
<|channel>thought...), which consumed the entire token budget before producing JSON output
Thinking mode fix¶
Two-part solution based on official Gemma 4 documentation:
- Prevention: Pass
enable_thinking=Falsetotokenizer.apply_chat_template()— prevents the<|think|>system token from being included - Safety net: Strip residual
<|channel>thought...<channel|>blocks from responses before JSON parsing
Other changes¶
llm_max_tokensreduced from 256 → 128 (JSON classification is <80 tokens; halves generation time)- Config loader refactored:
_dataclass_from_dict()eliminates duplicated defaults betweenconfig.toml,MLXConfigdataclass, andMLXLLM.__init__
Implementation¶
config.toml [mlx]—llm_modelupdatedsrc/mailtag/config.py— dataclass default updated, loader refactoredsrc/mailtag/mlx_provider.py—enable_thinking=False, response stripping, updated defaulttests/test_mlx_provider.py— default model assertion updated