Developer Guide

Complete onboarding guide for new contributors to para-files.

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4) - Required
Python 3.12+
uv package manager
Git

Environment Setup

Step 1: Clone and Install

# Clone the repository
git clone https://github.com/fjacquet/para-files.git
cd para-files

# Install all dependencies including dev tools
uv sync --all-extras

# Verify installation
uv run para-files --version

Step 2: Set Up Development Environment

# Create a test PARA folder structure
mkdir -p /tmp/test-para/{0_Inbox,1_Projects,2_Areas,3_Resources,4_Archives}

# Set environment variable for testing
export PARA_FILES_PARA_ROOT="/tmp/test-para"

# Install pre-commit hooks
pre-commit install

Step 3: Verify Everything Works

# Run the test suite
uv run pytest -v

# Run linter
uv run ruff check src/ tests/

# Run type checker
uv run mypy src/

# Test CLI
uv run para-files --help

Codebase Structure

para-files/
├── src/para_files/           # Main source code
│   ├── __init__.py
│   ├── main.py               # CLI entry point (Typer app)
│   ├── config.py             # Configuration with pydantic-settings
│   ├── pipeline.py           # 6-signal classification orchestrator
│   ├── reference_tree.py     # YAML reference tree loader
│   ├── types.py              # Pydantic data models
│   ├── mover.py              # File move/copy operations
│   ├── classifiers/          # Classification signal implementations
│   │   ├── validated_db.py   # Signal 1: Manual mappings
│   │   ├── rules_engine.py   # Signal 2: Glob patterns
│   │   ├── book_detector.py  # Signal 2.5: Book detection
│   │   ├── domain_kb.py      # Signal 3: Known issuers
│   │   ├── semantic_router.py # Signal 4: MLX embeddings
│   │   └── llm_fallback.py   # Signal 5: LLM fallback
│   ├── encoders/
│   │   └── mlx_encoder.py    # MLX embedding encoder
│   ├── taxonomies/
│   │   ├── models.py         # Thema/document taxonomy models
│   │   └── loader.py         # JSON taxonomy loaders
│   └── utils/
│       ├── filename_sanitizer.py  # Centralized filename sanitization
│       ├── thema_lookup.py   # Thema code lookup service
│       └── ...               # Other utilities
├── tests/                    # Test suite
├── config/
│   ├── personal_file_tree.yaml  # PARA reference tree (routing rules)
│   ├── documents.json        # Document taxonomy (issuers + keywords)
│   └── thema.json            # Thema v1.6 book classification (9,187 codes)
├── docs/                     # Documentation (Jekyll/GitHub Pages)
├── pyproject.toml            # Project configuration
├── README.md                 # User documentation
├── CLAUDE.md                 # AI assistant instructions
└── CHANGELOG.md              # Version history

Key Files to Understand First

1. `src/para_files/types.py` - Data Models

Contains all Pydantic models used throughout:

ClassificationResult - Result of classifying a file
ConfidenceLevel - Confidence and source of classification
Route - A destination route in the PARA structure

2. `src/para_files/pipeline.py` - Core Logic

The ClassificationPipeline class orchestrates the 6-signal classification:

# Simplified flow
class ClassificationPipeline:
    def classify_file(self, path: Path) -> ClassificationResult:
        # Try each signal in order (first match wins)
        for signal in [validated_db, rules_engine, book_detector,
                       domain_kb, semantic_router, llm_fallback]:
            result = signal.classify(path)
            if result.matched:
                return result
        return ClassificationResult.unclassified()

3. `src/para_files/main.py` - CLI Commands

All CLI commands are defined here using Typer:

classify - Classify files
move - Classify and move files
scan - Preview directory classifications
clean - Remove junk files

Development Workflow

Making Changes

Create a branch
```
git checkout -b feature/my-feature
```
Make changes following code style

Run quality checks

uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run mypy src/
uv run pytest -v

Update documentation
- Add entry to CHANGELOG.md under [Unreleased]
- Update README.md if CLI changes
- Add docstrings to new public functions

Commit and push

git add .
git commit -m "feat: add my feature"
git push -u origin feature/my-feature

Code Style Rules

Rule	Enforcement
Line length: 100 chars	Ruff
Type hints required	mypy (strict)
`from __future__ import annotations`	All modules
Docstrings for public functions	Convention
No `print()` - use `logging`	Ruff (T201)

Testing Guidelines

Tests go in tests/ directory
Match test file names: test_<module>.py
Use pytest fixtures for common setup
Mock external services (MLX, filesystem)
Aim for 80%+ coverage

# Example test
def test_classify_pdf(tmp_path: Path, mock_encoder: MockEncoder):
    """Test PDF classification returns correct route."""
    pdf = tmp_path / "invoice.pdf"
    pdf.write_bytes(b"%PDF-1.4...")

    result = pipeline.classify_file(pdf)

    assert result.category == "2_Areas/finances/factures"
    assert result.confidence.value >= 0.75

Architecture Deep Dive

The 6-Signal Pipeline

flowchart TB
    file["Input File"] --> extract["Extract Content/Metadata"]

    extract --> s1{"1. Validated DB<br/>100% confidence"}
    s1 -->|match| done["Classification Result"]
    s1 -->|no match| s2

    s2{"2. Rules Engine<br/>95% confidence"}
    s2 -->|match| done
    s2 -->|no match| s3

    s3{"2.5 Book Detector<br/>92% confidence"}
    s3 -->|match| done
    s3 -->|no match| s4

    s4{"3. Domain KB<br/>90% confidence"}
    s4 -->|match| done
    s4 -->|no match| s5

    s5{"4. Semantic Router<br/>85% confidence"}
    s5 -->|match| done
    s5 -->|no match| s6

    s6{"5. LLM Fallback<br/>Variable confidence"}
    s6 -->|match| done
    s6 -->|no match| unclassified["Unclassified"]

MLX Embeddings

The semantic router uses MLX for fast local embeddings:

from para_files.encoders import MLXEncoder

# Lazy loading - model downloads on first use
encoder = MLXEncoder(model_name="mlx-community/nomic-embed-text-v1.5")

# Encode text (10-15ms latency)
embeddings = encoder(["invoice from electric company"])

# Compare with cosine similarity
similarity = cosine_similarity(embeddings, route_embeddings)

Adding a New Classifier

When adding new classifiers:

Create a new file in src/para_files/classifiers/
Inherit from BaseClassifier
Implement classify() method
Define source, name, and default_confidence properties
Add to pipeline in ClassificationPipeline.__init__
Write comprehensive tests

Example:

from para_files.classifiers.base import BaseClassifier
from para_files.types import ClassificationResult, ConfidenceLevel

class MyClassifier(BaseClassifier):
    @property
    def source(self) -> str:
        return "my_classifier"

    @property
    def name(self) -> str:
        return "My Classifier"

    @property
    def default_confidence(self) -> float:
        return 0.88

    def classify(self, file_path: Path) -> ClassificationResult | None:
        # Your classification logic here
        pass

Debugging Tips

Enable Verbose Logging

uv run para-files classify document.pdf -v

Test Specific Routes

uv run para-files test-route factures-energie --file invoice.pdf -v

Inspect Reference Tree

uv run para-files tree --validate
uv run para-files routes --utterances
uv run para-files issuers

Check Configuration

uv run para-files config --show

Common Issues

Issue: MLX Import Error

ImportError: No module named 'mlx'

Solution: Ensure you’re on Apple Silicon Mac. MLX only works on M1/M2/M3/M4.

Issue: Tests Fail with Missing PARA_ROOT

Solution: Set environment variable:

export PARA_FILES_PARA_ROOT="/tmp/test-para"

Issue: Type Errors

Solution: Run mypy and fix annotations:

uv run mypy src/ --show-error-codes

Documentation Guidelines

When making changes, update documentation:

Change Type	Documentation Update
New feature	CHANGELOG.md, README.md
Bug fix	CHANGELOG.md
CLI change	README.md, CHANGELOG.md
Architecture	docs/architecture.md, CHANGELOG.md
Config option	README.md, CHANGELOG.md

Getting Help

Check existing issues on GitHub
Read the test files for usage examples
Review CLAUDE.md for conventions
Ask in discussions/issues