Skip to content

ADR-004: DataFrame as Pipeline Data Format

Status: Accepted Date: 2026-02-18

Context

The pipeline processes VM data through ingestion, classification, and calculation stages. Two options: pass DataFrames or VMRecord dataclass instances.

Decision

Use pandas DataFrame as the primary data format throughout the pipeline.

Rationale

  • DataFrames are natural for batch operations (5000+ VMs)
  • Vectorized operations are faster than row-by-row dataclass conversion
  • pandas integrates well with openpyxl (read) and NiceGUI AG Grid (display)
  • Classification adds columns to DataFrame without schema changes

Consequences

  • VMRecord dataclass exists but is used for typed access where needed, not as pipeline currency
  • Type safety is weaker than dataclass (column names are strings)
  • Must maintain canonical column schema consistency across all parsers