ADR-004: DataFrame as Pipeline Data Format

Status: Accepted Date: 2026-02-18

Context

The pipeline processes VM data through ingestion, classification, and calculation stages. Two options: pass DataFrames or VMRecord dataclass instances.

Decision

Use pandas DataFrame as the primary data format throughout the pipeline.

Rationale

DataFrames are natural for batch operations (5000+ VMs)
Vectorized operations are faster than row-by-row dataclass conversion
pandas integrates well with openpyxl (read) and NiceGUI AG Grid (display)
Classification adds columns to DataFrame without schema changes

Consequences

VMRecord dataclass exists but is used for typed access where needed, not as pipeline currency
Type safety is weaker than dataclass (column names are strings)
Must maintain canonical column schema consistency across all parsers