Developer Guide
Complete onboarding guide for new contributors to para-files.
Prerequisites
- macOS with Apple Silicon (M1/M2/M3/M4) - Required
- Python 3.12+
- uv package manager
- Git
Environment Setup
Step 1: Clone and Install
# Clone the repository
git clone https://github.com/fjacquet/para-files.git
cd para-files
# Install all dependencies including dev tools
uv sync --all-extras
# Verify installation
uv run para-files --version
Step 2: Set Up Development Environment
# Create a test PARA folder structure
mkdir -p /tmp/test-para/{0_Inbox,1_Projects,2_Areas,3_Resources,4_Archives}
# Set environment variable for testing
export PARA_FILES_PARA_ROOT="/tmp/test-para"
# Install pre-commit hooks
pre-commit install
Step 3: Verify Everything Works
# Run the test suite
uv run pytest -v
# Run linter
uv run ruff check src/ tests/
# Run type checker
uv run mypy src/
# Test CLI
uv run para-files --help
Codebase Structure
para-files/
├── src/para_files/ # Main source code
│ ├── __init__.py
│ ├── main.py # CLI entry point (Typer app)
│ ├── config.py # Configuration with pydantic-settings
│ ├── pipeline.py # 6-signal classification orchestrator
│ ├── reference_tree.py # YAML reference tree loader
│ ├── types.py # Pydantic data models
│ ├── mover.py # File move/copy operations
│ ├── classifiers/ # Classification signal implementations
│ │ ├── validated_db.py # Signal 1: Manual mappings
│ │ ├── rules_engine.py # Signal 2: Glob patterns
│ │ ├── book_detector.py # Signal 2.5: Book detection
│ │ ├── domain_kb.py # Signal 3: Known issuers
│ │ ├── semantic_router.py # Signal 4: MLX embeddings
│ │ └── llm_fallback.py # Signal 5: LLM fallback
│ ├── encoders/
│ │ └── mlx_encoder.py # MLX embedding encoder
│ ├── taxonomies/
│ │ ├── models.py # Thema/document taxonomy models
│ │ └── loader.py # JSON taxonomy loaders
│ └── utils/
│ ├── filename_sanitizer.py # Centralized filename sanitization
│ ├── thema_lookup.py # Thema code lookup service
│ └── ... # Other utilities
├── tests/ # Test suite
├── config/
│ ├── personal_file_tree.yaml # PARA reference tree (routing rules)
│ ├── documents.json # Document taxonomy (issuers + keywords)
│ └── thema.json # Thema v1.6 book classification (9,187 codes)
├── docs/ # Documentation (Jekyll/GitHub Pages)
├── pyproject.toml # Project configuration
├── README.md # User documentation
├── CLAUDE.md # AI assistant instructions
└── CHANGELOG.md # Version history
Key Files to Understand First
1. src/para_files/types.py - Data Models
Contains all Pydantic models used throughout:
ClassificationResult- Result of classifying a fileConfidenceLevel- Confidence and source of classificationRoute- A destination route in the PARA structure
2. src/para_files/pipeline.py - Core Logic
The ClassificationPipeline class orchestrates the 6-signal classification:
# Simplified flow
class ClassificationPipeline:
def classify_file(self, path: Path) -> ClassificationResult:
# Try each signal in order (first match wins)
for signal in [validated_db, rules_engine, book_detector,
domain_kb, semantic_router, llm_fallback]:
result = signal.classify(path)
if result.matched:
return result
return ClassificationResult.unclassified()
3. src/para_files/main.py - CLI Commands
All CLI commands are defined here using Typer:
classify- Classify filesmove- Classify and move filesscan- Preview directory classificationsclean- Remove junk files
Development Workflow
Making Changes
-
Create a branch
git checkout -b feature/my-feature -
Make changes following code style
-
Run quality checks
uv run ruff check src/ tests/ uv run ruff format src/ tests/ uv run mypy src/ uv run pytest -v - Update documentation
- Add entry to CHANGELOG.md under
[Unreleased] - Update README.md if CLI changes
- Add docstrings to new public functions
- Add entry to CHANGELOG.md under
-
Commit and push
git add . git commit -m "feat: add my feature" git push -u origin feature/my-feature
Code Style Rules
| Rule | Enforcement |
|---|---|
| Line length: 100 chars | Ruff |
| Type hints required | mypy (strict) |
from __future__ import annotations | All modules |
| Docstrings for public functions | Convention |
No print() - use logging | Ruff (T201) |
Testing Guidelines
- Tests go in
tests/directory - Match test file names:
test_<module>.py - Use pytest fixtures for common setup
- Mock external services (MLX, filesystem)
- Aim for 80%+ coverage
# Example test
def test_classify_pdf(tmp_path: Path, mock_encoder: MockEncoder):
"""Test PDF classification returns correct route."""
pdf = tmp_path / "invoice.pdf"
pdf.write_bytes(b"%PDF-1.4...")
result = pipeline.classify_file(pdf)
assert result.category == "2_Areas/finances/factures"
assert result.confidence.value >= 0.75
Architecture Deep Dive
The 6-Signal Pipeline
flowchart TB
file["Input File"] --> extract["Extract Content/Metadata"]
extract --> s1{"1. Validated DB<br/>100% confidence"}
s1 -->|match| done["Classification Result"]
s1 -->|no match| s2
s2{"2. Rules Engine<br/>95% confidence"}
s2 -->|match| done
s2 -->|no match| s3
s3{"2.5 Book Detector<br/>92% confidence"}
s3 -->|match| done
s3 -->|no match| s4
s4{"3. Domain KB<br/>90% confidence"}
s4 -->|match| done
s4 -->|no match| s5
s5{"4. Semantic Router<br/>85% confidence"}
s5 -->|match| done
s5 -->|no match| s6
s6{"5. LLM Fallback<br/>Variable confidence"}
s6 -->|match| done
s6 -->|no match| unclassified["Unclassified"]
MLX Embeddings
The semantic router uses MLX for fast local embeddings:
from para_files.encoders import MLXEncoder
# Lazy loading - model downloads on first use
encoder = MLXEncoder(model_name="mlx-community/nomic-embed-text-v1.5")
# Encode text (10-15ms latency)
embeddings = encoder(["invoice from electric company"])
# Compare with cosine similarity
similarity = cosine_similarity(embeddings, route_embeddings)
Adding a New Classifier
When adding new classifiers:
- Create a new file in
src/para_files/classifiers/ - Inherit from
BaseClassifier - Implement
classify()method - Define
source,name, anddefault_confidenceproperties - Add to pipeline in
ClassificationPipeline.__init__ - Write comprehensive tests
Example:
from para_files.classifiers.base import BaseClassifier
from para_files.types import ClassificationResult, ConfidenceLevel
class MyClassifier(BaseClassifier):
@property
def source(self) -> str:
return "my_classifier"
@property
def name(self) -> str:
return "My Classifier"
@property
def default_confidence(self) -> float:
return 0.88
def classify(self, file_path: Path) -> ClassificationResult | None:
# Your classification logic here
pass
Debugging Tips
Enable Verbose Logging
uv run para-files classify document.pdf -v
Test Specific Routes
uv run para-files test-route factures-energie --file invoice.pdf -v
Inspect Reference Tree
uv run para-files tree --validate
uv run para-files routes --utterances
uv run para-files issuers
Check Configuration
uv run para-files config --show
Common Issues
Issue: MLX Import Error
ImportError: No module named 'mlx'
Solution: Ensure you’re on Apple Silicon Mac. MLX only works on M1/M2/M3/M4.
Issue: Tests Fail with Missing PARA_ROOT
Solution: Set environment variable:
export PARA_FILES_PARA_ROOT="/tmp/test-para"
Issue: Type Errors
Solution: Run mypy and fix annotations:
uv run mypy src/ --show-error-codes
Documentation Guidelines
When making changes, update documentation:
| Change Type | Documentation Update |
|---|---|
| New feature | CHANGELOG.md, README.md |
| Bug fix | CHANGELOG.md |
| CLI change | README.md, CHANGELOG.md |
| Architecture | docs/architecture.md, CHANGELOG.md |
| Config option | README.md, CHANGELOG.md |
Getting Help
- Check existing issues on GitHub
- Read the test files for usage examples
- Review CLAUDE.md for conventions
- Ask in discussions/issues