Architecture Overview
Understanding how para-files classifies documents.
The 4-Signal Pipeline (v2.0)
Para-files tries 4 classification signals in order. The first confident match wins.
flowchart TD
A[Input File] --> B{Signal 1: Rules Engine<br/>95% confidence}
B -->|Match| Z[Classification Result]
B -->|No Match| C{Signal 2: Book Detector<br/>92% confidence}
C -->|Match| Z
C -->|No Match| D{Signal 3: Taxonomy Classifier<br/>90% confidence}
D -->|Match| Z
D -->|No Match| E{Signal 4: MLX-LLM Fallback<br/>60% confidence}
E -->|Match| Z
E -->|No Match| G[Inbox Fallback<br/>0_Inbox/]
G --> Z
Signal Details (v2.0)
| Signal | Confidence | What It Does | Data Source |
|---|---|---|---|
| 1. Rules Engine | 95% | Matches extensions/patterns | personal_file_tree.yaml |
| 2. Book Detector | 92% | Detects books via ISBN + Thema | thema.json hierarchy |
| 3. Taxonomy Classifier | 90% | Matches keywords + issuers | documents.json |
| 4. MLX-LLM Fallback | 60% | Native MLX-LM inference | In-process (no Ollama) |
Removed in v2.0
| Signal | Replaced By |
|---|---|
| Validated DB | Taxonomy Classifier (issuers) |
| Domain KB | Taxonomy Classifier (issuers) |
| Semantic Router | Taxonomy Classifier (keywords) |
| LLM Fallback (Ollama) | MLX-LLM Fallback (native) |
How to Improve Matching
Choose based on your situation:
- Photo/video routing → Add extension patterns to
personal_file_tree.yaml - From known companies → Add issuer to
documents.json - Document type detection → Add keywords to
documents.json - Technical books → Automatic via ISBN lookup + Thema classification
Book Path Format (Thema Hybrid Naming)
Books use the THEMA v1.6 international classification with hybrid naming:
3_Resources/livres/{L1_Code}_{ShortName}/{L2_Code}_{ShortName}
Example: 3_Resources/livres/U_Informatique/UB_Programmation
| Raw Thema Description | Hybrid Folder Name |
|---|---|
Informatique et traitement de l'information | U_Informatique |
Informatique : logiciels et programmation | UB_Programmation |
Arts : généralités | AB_Generalites |
Rules applied:
- Max 2 hierarchy levels after
livres/ - Accents removed (é→e, ç→c)
- Colons: take part after (
:→ specific term) - Slashes: take first part (
/→ general term) - Invalid filesystem chars replaced with
_
Component Architecture
graph LR
A["CLI<br/>main.py"]
B["ClassificationPipeline<br/>Orchestrator"]
C["4 Signals<br/>Classifiers"]
D["Taxonomies<br/>JSON Loaders"]
E["Reference Tree<br/>YAML Config"]
A --> B
B --> C
B --> D
B --> E
C --> D
Data Flow
File Input
↓
Extract Metadata (filename, content, dates)
↓
Try Signal 1 → 2 → 3 → 4
↓
Return Result (category + confidence + source)
↓
Action (classify, move, etc.)
Key Technologies
- MLX-LM - Native LLM inference on Apple Silicon (replaces Ollama)
- JSON Taxonomies -
documents.json(issuers + keywords),thema.json(books) - YAML Reference Tree - Routing rules by extension/pattern
- Pydantic - Type-safe taxonomy models
Next Steps
Learn about each signal:
- Signal 1: Rules Engine - Extension/pattern matching
- Signal 2: Book Detector - ISBN + Thema classification
- Signal 3: Taxonomy Classifier - Keywords + issuers
- Signal 4: MLX-LLM Fallback - Native AI inference