Phase 2: File Ingestion Pipeline - Research

Researched: 2026-02-18 Domain: Excel/CSV parsing, format detection, data normalization, pandas DataFrame operations Confidence: HIGH

Summary

Phase 2 implements parsers for RVTools (.xlsx) and LiveOptics (.xlsx/.csv) files, auto-detects the format, normalizes data into a common DataFrame schema, and filters template VMs. The existing codebase already has the VMRecord dataclass and FileFormat enum from Phase 1, plus pandas and openpyxl in the dependency stack.

Key discovery from inspecting the actual sample files: RVTools has 70 columns in vInfo (we need 8), LiveOptics has 38 columns in VMs (we need ~8). RVTools labels columns as "MB" but values are actually MiB (base-2), matching LiveOptics which explicitly uses "MiB". This means NO unit conversion is needed between formats -- both are already in MiB despite different labeling. The Template column in RVTools is a boolean (True/False/NaN), not a string "True" as the requirements suggest.

The project already has two sample files (samples/rvtools.xlsx with 24 VMs, samples/live-optics.xlsx with 610 VMs) plus a second LiveOptics file in samples/xxx-IT_02_16_2026/. Both LiveOptics files have identical column schemas, confirming the format is stable.

Primary recommendation: Build three parser functions (one per format) that return pd.DataFrame with a canonical schema, plus a format detector that dispatches to the correct parser. Use simple alias dictionaries for column fuzzy matching (no external library needed). Return errors as custom exceptions with user-friendly messages.

Standard Stack

Core (already installed)

Library	Version	Purpose	Why Standard
pandas	>=2.2,<4.0	DataFrame creation, CSV reading	Already in project, handles all tabular ops
openpyxl	>=3.1.2	XLSX reading engine for pandas	Already in project, required by `pd.read_excel`

Supporting

Library	Version	Purpose	When to Use
pytest	>=8.0	Testing parsers	All parser tests

No New Dependencies Needed

Column "fuzzy matching" as described in FR-1.7 does not require fuzzywuzzy, rapidfuzz, or any NLP library. The actual need is alias mapping: known column name variations (e.g., "Provisioned MiB" vs "Provisioned MB") resolved via a dictionary. This is a 10-line solution, not a library problem.

Architecture Patterns

Recommended Module Structure

src/store_predict/
├── pipeline/
│   ├── __init__.py
│   ├── models.py           # Already exists (VMRecord, FileFormat)
│   ├── ingestion.py         # Format detection + orchestration
│   ├── parsers/
│   │   ├── __init__.py
│   │   ├── rvtools.py       # RVTools .xlsx parser
│   │   ├── liveoptics.py    # LiveOptics .xlsx and .csv parser
│   │   └── columns.py       # Column alias maps + resolver
│   └── errors.py            # Custom exceptions

Pattern 1: Parser Function Signature

What: Each parser returns a normalized DataFrame, never VMRecord objects directly. VMRecord conversion happens downstream if needed.

Why: DataFrames are the natural unit for batch operations (filtering, calculations). Converting to/from dataclasses row-by-row is wasteful for 5000+ VMs.

# Each parser module exposes one public function:
def parse_rvtools(path: Path) -> pd.DataFrame:
    """Parse RVTools xlsx vInfo tab into normalized DataFrame.

    Returns DataFrame with canonical columns:
        vm_name, os_name, provisioned_mib, in_use_mib,
        datacenter, cluster, is_template, is_powered_on, source_format

    Raises:
        IngestionError: If file cannot be parsed or required columns missing.
    """

def parse_liveoptics_xlsx(path: Path) -> pd.DataFrame:
    """Parse LiveOptics xlsx VMs tab into normalized DataFrame."""

def parse_liveoptics_csv(path: Path) -> pd.DataFrame:
    """Parse LiveOptics CSV into normalized DataFrame."""

Pattern 2: Format Detection

What: Detect format by inspecting file extension + internal structure (sheet names for xlsx, column headers for csv).

def detect_format(path: Path) -> FileFormat:
    """Auto-detect file format.

    Strategy:
    1. CSV extension -> check headers -> LiveOptics CSV or error
    2. XLSX extension -> check sheet names:
       - Has 'vInfo' sheet -> RVTools
       - Has 'VMs' sheet -> LiveOptics XLSX
       - Neither -> IngestionError
    """

Evidence from sample files:

RVTools sheets: ['vInfo', 'vCPU', 'vMemory', 'vDisk', ...] (25 sheets)
LiveOptics sheets: ['Details', 'ESX Hosts', ..., 'VMs', ...] (11 sheets)
Sheet name overlap: None between the two formats. Detection is unambiguous.

Pattern 3: Column Resolution with Aliases

What: Map known column name variations to canonical names using a dictionary.

# columns.py
RVTOOLS_ALIASES: dict[str, list[str]] = {
    "vm_name": ["VM", "VM Name"],
    "powerstate": ["Powerstate", "Power State"],
    "is_template": ["Template"],
    "os_name": ["OS according to the VMware Tools", "OS according to the configuration file"],
    "provisioned_mib": ["Provisioned MB", "Provisioned MiB"],
    "in_use_mib": ["In Use MB", "In Use MiB"],
    "datacenter": ["Datacenter"],
    "cluster": ["Cluster"],
}

LIVEOPTICS_ALIASES: dict[str, list[str]] = {
    "vm_name": ["VM Name"],
    "os_name": ["VM OS"],
    "provisioned_mib": ["Virtual Disk Size (MiB)"],
    "in_use_mib": ["Guest VM Disk Used (MiB)"],
    "guest_capacity_mib": ["Guest VM Disk Capacity (MiB)"],
    "is_template": ["Template"],
    "powerstate": ["Power State"],
    "datacenter": ["Datacenter"],
    "cluster": ["Cluster"],
}

def resolve_columns(df: pd.DataFrame, aliases: dict[str, list[str]]) -> dict[str, str]:
    """Map canonical names to actual column names found in the DataFrame.

    Returns: {canonical_name: actual_column_name}
    Raises IngestionError if required columns are not found.
    """

Pattern 4: Canonical DataFrame Schema

What: All parsers produce a DataFrame with identical columns regardless of source format.

CANONICAL_COLUMNS = [
    "vm_name",          # str
    "os_name",          # str
    "provisioned_mib",  # float64
    "in_use_mib",       # float64
    "datacenter",       # str (empty string if unavailable)
    "cluster",          # str (empty string if unavailable)
    "is_template",      # bool
    "is_powered_on",    # bool
    "source_format",    # str (FileFormat.value)
]

Pattern 5: Ingestion Orchestrator

What: Single entry point that combines detection + parsing + filtering.

def ingest_file(path: Path) -> pd.DataFrame:
    """Main entry point: detect format, parse, filter templates, return normalized DataFrame.

    Steps:
    1. Validate file exists and has valid extension
    2. Detect format
    3. Parse with appropriate parser
    4. Filter out template VMs (is_template == True)
    5. Return cleaned DataFrame
    """

Anti-Patterns to Avoid

Row-by-row processing: Never iterate rows to create VMRecord objects during ingestion. Use vectorized pandas operations. VMRecord is for downstream typed access if needed, not for parsing.
Reading entire workbook: Use pd.read_excel(path, sheet_name='vInfo') which reads only the specified sheet, not all 25 sheets.
String comparison for booleans: Template column in RVTools is already boolean dtype, not string. Don't do df['Template'] == 'True' -- use df['Template'] == True or df['Template'].fillna(False).astype(bool).
Hardcoded column indices: Always use column names, never positional indexing. Column order varies between RVTools versions.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
XLSX parsing	Custom XML reader	`pd.read_excel(engine='openpyxl')`	Handles encoding, data types, large files
CSV parsing	Custom delimiter logic	`pd.read_csv()`	Handles quoting, encoding, malformed rows
Column matching	Levenshtein distance	Alias dictionary lookup	Known set of variations, not arbitrary fuzzy match
File validation	Manual magic byte checking	Extension check + sheet name inspection	Sufficient for the known format set

Key insight: The "fuzzy matching" requirement (FR-1.7) sounds like it needs NLP but is really just an alias table. RVTools and LiveOptics have well-known column schemas with minor variations between versions. A dictionary of known aliases covers all real-world cases.

Common Pitfalls

Pitfall 1: RVTools Template Column Contains NaN

What goes wrong: Filtering df[df['Template'] == False] drops rows where Template is NaN (which are non-template VMs that just have missing data). Why it happens: RVTools exports sometimes have NaN for Template on certain VMs (seen in our sample: 4 out of 24 VMs have NaN). How to avoid: Use df['Template'].fillna(False) before filtering, or filter with df['Template'] != True. Warning signs: VM count after filtering is lower than expected.

Pitfall 2: RVTools "MB" Values Are Actually MiB

What goes wrong: Applying MB-to-MiB conversion factor (dividing by 1.048576) when not needed, understating capacity. Why it happens: Column headers say "MB" but VMware/RVTools uses base-2 (MiB) values despite the label. How to avoid: Pass through values as-is. Both RVTools "MB" and LiveOptics "MiB" are already in MiB. Evidence: RVTools documentation confirms values are MiB despite "MB" label. Also confirmed by VMware documentation.

Pitfall 3: LiveOptics Provisioned Capacity Ambiguity

What goes wrong: Using wrong column for "provisioned" capacity. Why it happens: LiveOptics has three capacity columns: Virtual Disk Size (MiB) (provisioned virtual disk), Guest VM Disk Capacity (MiB) (guest-visible capacity), Guest VM Disk Used (MiB) (actual used). How to avoid: Map Virtual Disk Size (MiB) to provisioned_mib and Guest VM Disk Used (MiB) to in_use_mib. The Guest VM Disk Capacity (MiB) is informational but not used in the canonical schema.

Pitfall 4: Encoding Issues in VM Names

What goes wrong: French characters (accents, cedilla) in VM names get corrupted. Why it happens: CSV files may use Windows-1252 encoding instead of UTF-8. How to avoid: Try UTF-8 first, fall back to latin-1/cp1252 for CSV parsing. XLSX files handle encoding internally via openpyxl.

Pitfall 5: Large File Performance

What goes wrong: Reading 5000-VM xlsx files takes too long or uses too much memory. Why it happens: Reading all columns when only 8 are needed; or reading all sheets. How to avoid: Use usecols parameter in pd.read_excel() to read only needed columns. Only read the target sheet. For very large files, openpyxl's read_only=True mode is automatically used by pandas.

Pitfall 6: Column Name Trailing Spaces

What goes wrong: Column matching fails silently. Why it happens: Some RVTools exports have trailing spaces in column headers (observed: 'Free % ' in vPartition). This could also affect vInfo columns. How to avoid: Strip whitespace from all column names: df.columns = df.columns.str.strip().

Code Examples

Reading RVTools vInfo Sheet

# Source: verified against actual samples/rvtools.xlsx
import pandas as pd
from pathlib import Path

def parse_rvtools(path: Path) -> pd.DataFrame:
    # Read only the vInfo sheet
    df = pd.read_excel(path, sheet_name="vInfo", engine="openpyxl")

    # Strip whitespace from column names
    df.columns = df.columns.str.strip()

    # Resolve columns via aliases
    col_map = resolve_columns(df, RVTOOLS_ALIASES)

    # Select and rename to canonical names
    result = df[[col_map[c] for c in REQUIRED_COLUMNS]].copy()
    result.columns = pd.Index(list(REQUIRED_COLUMNS))

    # Normalize types
    result["is_template"] = result["is_template"].fillna(False).astype(bool)
    result["is_powered_on"] = result["powerstate"].str.lower() == "poweredon"
    result["source_format"] = FileFormat.RVTOOLS.value

    # Fill missing string columns
    for col in ["datacenter", "cluster", "os_name"]:
        result[col] = result[col].fillna("")

    return result

Reading LiveOptics XLSX

# Source: verified against actual samples/live-optics.xlsx
def parse_liveoptics_xlsx(path: Path) -> pd.DataFrame:
    df = pd.read_excel(path, sheet_name="VMs", engine="openpyxl")
    df.columns = df.columns.str.strip()

    col_map = resolve_columns(df, LIVEOPTICS_ALIASES)

    result = pd.DataFrame({
        "vm_name": df[col_map["vm_name"]],
        "os_name": df[col_map["os_name"]].fillna(""),
        "provisioned_mib": pd.to_numeric(df[col_map["provisioned_mib"]], errors="coerce").fillna(0.0),
        "in_use_mib": pd.to_numeric(df[col_map["in_use_mib"]], errors="coerce").fillna(0.0),
        "datacenter": df[col_map["datacenter"]].fillna(""),
        "cluster": df[col_map["cluster"]].fillna(""),
        "is_template": df[col_map["is_template"]].fillna(False).astype(bool),
        "is_powered_on": df[col_map["powerstate"]].str.lower() == "poweredon",
        "source_format": FileFormat.LIVEOPTICS_XLSX.value,
    })
    return result

Reading LiveOptics CSV

def parse_liveoptics_csv(path: Path) -> pd.DataFrame:
    # Try UTF-8 first, fall back to latin-1
    for encoding in ("utf-8", "latin-1"):
        try:
            df = pd.read_csv(path, encoding=encoding)
            break
        except UnicodeDecodeError:
            continue
    else:
        raise IngestionError("Cannot decode CSV file. Unsupported encoding.")

    df.columns = df.columns.str.strip()
    # Same column resolution and normalization as XLSX
    # ...

Format Detection

def detect_format(path: Path) -> FileFormat:
    suffix = path.suffix.lower()

    if suffix == ".csv":
        # Validate it has LiveOptics columns
        df_head = pd.read_csv(path, nrows=0)
        df_head.columns = df_head.columns.str.strip()
        if "VM Name" in df_head.columns or "VM OS" in df_head.columns:
            return FileFormat.LIVEOPTICS_CSV
        raise IngestionError(
            "CSV file does not match LiveOptics format. "
            "Expected columns: VM Name, VM OS, etc."
        )

    if suffix == ".xlsx":
        import openpyxl
        wb = openpyxl.load_workbook(path, read_only=True)
        sheets = wb.sheetnames
        wb.close()

        if "vInfo" in sheets:
            return FileFormat.RVTOOLS
        if "VMs" in sheets:
            return FileFormat.LIVEOPTICS_XLSX
        raise IngestionError(
            f"XLSX file has unrecognized sheets: {sheets[:5]}. "
            "Expected 'vInfo' (RVTools) or 'VMs' (LiveOptics)."
        )

    raise IngestionError(
        f"Unsupported file type: {suffix}. "
        "Please upload .xlsx (RVTools or LiveOptics) or .csv (LiveOptics)."
    )

Custom Exception

class IngestionError(Exception):
    """Raised when file ingestion fails with a user-friendly message."""

    def __init__(self, message: str, *, details: str = "") -> None:
        self.message = message  # User-facing
        self.details = details  # Developer-facing (NOT logged to UI)
        super().__init__(message)

Actual Column Data from Sample Files

RVTools vInfo (samples/rvtools.xlsx)

Relevant columns from the 70 available:

Column Name	Dtype	Sample Value	Maps To
VM	str	RSL-DC-01	vm_name
Powerstate	str	poweredOn / poweredOff / NaN	is_powered_on
Template	bool	True / False / NaN	is_template
OS according to the VMware Tools	str	Microsoft Windows Server 2012 (64-bit)	os_name
Provisioned MB	int64	51227	provisioned_mib (no conversion)
In Use MB	int64	22497	in_use_mib (no conversion)
Datacenter	str	RSL Cluster	datacenter
Cluster	str	RSL	cluster

Total VMs in sample: 24
Templates: 2 (True), 4 NaN, 18 False
Powerstate: poweredOn, poweredOff, NaN

LiveOptics VMs (samples/live-optics.xlsx)

Relevant columns from the 38 available:

Column Name	Dtype	Sample Value	Maps To
VM Name	str	3000-GPO-TEST	vm_name
VM OS	str	Microsoft Windows 10 (64-bit)	os_name
Virtual Disk Size (MiB)	int64	61440	provisioned_mib
Guest VM Disk Used (MiB)	int64	32978	in_use_mib
Guest VM Disk Capacity (MiB)	int64	60811	(informational, not mapped)
Template	bool	True / False	is_template
Power State	str	poweredOn / poweredOff	is_powered_on
Datacenter	str	vxr-dc-win	datacenter
Cluster	str	vxr-clu-win	cluster

Total VMs in sample: 610
No NaN in key columns (VM Name, VM OS, capacity columns)

State of the Art

Old Approach	Current Approach	When Changed	Impact
`xlrd` for xlsx	`openpyxl` for xlsx	xlrd 2.0 (2020)	xlrd dropped xlsx support; openpyxl is required
pandas `read_excel` without engine	pandas with `engine='openpyxl'` explicit	pandas 1.2+	Best practice to be explicit about engine
Manual CSV encoding detection	`chardet` or try/except cascade	Ongoing	Simple try/except is sufficient for known formats

No deprecated approaches in current stack. pandas 2.2+ and openpyxl 3.1+ are current and stable.

Testing Strategy

Per project convention: NO unittest.mock. Use real sample files as fixtures.

Test Fixtures Needed

# conftest.py additions
@pytest.fixture
def rvtools_path() -> Path:
    return Path(__file__).parent.parent / "samples" / "rvtools.xlsx"

@pytest.fixture
def liveoptics_xlsx_path() -> Path:
    return Path(__file__).parent.parent / "samples" / "live-optics.xlsx"

# For CSV testing, export a CSV from the LiveOptics xlsx or create a minimal fixture

Key Test Cases

Format detection: RVTools xlsx detected correctly, LiveOptics xlsx detected, CSV detected, unknown rejected
RVTools parsing: Correct column extraction, template filtering, NaN Template handling, Powerstate mapping
LiveOptics parsing: Correct column extraction, template handling
Schema consistency: All parsers produce identical column sets
Edge cases: Empty file, wrong extension, missing required columns, file with extra columns
Column alias resolution: Known variations resolved, unknown columns raise clear error

CSV Test Data

No LiveOptics CSV sample file exists in the repo. Options:

Export from the xlsx sample during test setup (using pandas)
Create a small hand-written CSV fixture in tests/fixtures/

Recommendation: Create a small CSV fixture file (5-10 rows) rather than depending on runtime xlsx-to-csv conversion.

Open Questions

LiveOptics CSV column separator
What we know: LiveOptics xlsx has standard columns. CSV likely uses comma separator.
What's unclear: No sample CSV file exists to verify delimiter and encoding.
Recommendation: Default to comma, add encoding fallback, verify with first real CSV upload.
RVTools version variations
What we know: Current sample has exact column names documented above.
What's unclear: Older RVTools versions may use slightly different column names.
Recommendation: The alias dictionary handles this. Start with known names, add aliases as real-world files surface variations.
Whether to keep VMRecord or use DataFrame throughout
What we know: VMRecord dataclass exists from Phase 1. DataFrames are more efficient for batch operations.
What's unclear: Whether downstream phases (classification, UI) expect VMRecord objects or DataFrames.
Recommendation: Ingestion returns DataFrame. Provide a utility df_to_records() if downstream needs VMRecord. Don't convert eagerly.

Sources

Primary (HIGH confidence)

Actual sample files inspected with openpyxl and pandas:
samples/rvtools.xlsx - 24 VMs, 70 columns in vInfo, Template is boolean with NaN
samples/live-optics.xlsx - 610 VMs, 38 columns in VMs, clean data
samples/xxx-IT_02_16_2026/LiveOptics_3221684_VMWARE_02_12_2026.xlsx - identical schema to main sample
Existing codebase: pipeline/models.py (VMRecord, FileFormat), config.py (paths), conftest.py (fixtures)
pandas documentation: pd.read_excel(), pd.read_csv() API verified from training data (stable API)

Secondary (MEDIUM confidence)

RVTools sizing workshop docs - confirmed MB values are actually MiB
VMware/RVTools unit documentation - confirmed base-2 units despite "MB" labels

Tertiary (LOW confidence)

None. All findings verified against actual sample files.

Metadata

Confidence breakdown:

Standard stack: HIGH - already installed and verified in Phase 1
Architecture: HIGH - patterns derived from actual file inspection, not speculation
Pitfalls: HIGH - NaN Template and MB/MiB issues verified against real sample data
Column schemas: HIGH - extracted directly from sample files with code

Research date: 2026-02-18 Valid until: 2026-04-18 (stable domain, file formats don't change frequently)