Phase 7: integration tests against the cross-version baselines¶

Master plan: PLAN-measure.md · Previous phase: PLAN-measure-phase-06-baselines.md

Status: Not started¶

Mission¶

Wire tests/test_measure.py (8 smoke tests from phase 4, 13 -o tests from phase 5) up to the cross-version baselines that phase 6 committed in instar-testdata. After phase 7, every safe-tier image in the manifest plus every curated --size case is checked against its corresponding qemu-img measure baseline, and the vmdk / vpc / vhdx target formats (which qemu-img doesn't measure) get round-trip coverage that asserts convert output size lies in the [required, fully_allocated] range.

Phase 7 closes the loop: phase 4 says "the CLI works for the cases I tested"; phase 7 says "the CLI matches qemu-img on every case we record a baseline for, on every supported qemu version".

Why this is its own phase¶

Phase 6 stored bytes on disk. Phase 7 turns those bytes into assertions. They're separable because:

Phase 6 work is in the instar-testdata repo and the long-running tail is regenerating baselines whenever the matrix expands. Phase 7 work is in the instar repo and does not touch the testdata side.
Phase 6 needed scripting that runs once per matrix-refresh; phase 7 needs Python test infrastructure that runs on every CI invocation.
The fan-out is large (~200 test cases). Bundling phase 6's data generation with the test code would mean one giant commit that's hard to review.

Architecture¶

Baseline lookup¶

The existing tests/base.py already wraps the expected-outputs/<output-type>/profiles/<profile>/<image-id>.stdout.txt layout via get_output_profiles() and get_expected_output(). Phase 7 reuses both, with one small extension:

# tests/base.py
COMMAND_OUTPUT_DIRS = {
    'info': 'qemu-img',
    'check': 'check',
    'compare': 'compare',
    'measure': 'measure',        # NEW — phase 7
}

The measure-specific peculiarity is the _size/ pseudo-bucket in the raw layout. But for the profile layout (expected-outputs/measure-json/profiles/profile-NN/), files are named by image_id exactly like every other command — detect-profiles.py already produced this naming during phase 6. So get_expected_output('1G-qcow2-default', 'profile-10-0-0', output_type='json', command='measure') returns the right file with no additional helper needed.

For the source-image cases, the image_id in the profile bucket has the __<target> suffix already baked in (verified in phase 6): e.g. cirros-qcow2__qcow2.stdout.txt. Tests pass this composed id to get_expected_output().

Three test surfaces¶

Surface 1: `--size` mode baseline comparison¶

For each of the 21 SIZE_CASES (the same list phase 6's generator iterates), and each output type (human / json):

Translate the case's size_str / target_format / options_list into instar's CLI flags.
Run instar measure --size <SIZE> -O <TARGET> <flags> --output=<format>.
Look up the baseline via get_output_profiles('json', 'measure')['profiles'] to get the profile for the installed qemu version (or a representative one), then get_expected_output('1G-qcow2-cs-64k', profile, output_type='json', command='measure').
Assert instar_output == baseline exactly.

Skip cases where the baseline has non-zero exit (qemu-img rejected the option on that profile's representative version). Read the meta.json in the raw bucket (expected-outputs/measure-json/_size/<version>/<case-name>.meta.json) to determine exit code; skip if non-zero.

Since dedup put all 80 qemu versions into a single profile, in practice every test uses profile-6-0-0 (or whatever the single profile is named). The version-map.json is the source of truth — read it rather than hard-coding.

Surface 2: source-image baseline comparison¶

For each safe-tier image in tests/manifest.json, and each target ∈ {raw, qcow2}, and each output type ∈ {human, json}:

Compute the expected image_id as <manifest-id>__<target>.
Skip if no baseline exists (handles caution/malicious images that were filtered out of phase 6's generation).
Skip if the baseline's meta.json shows non-zero exit (e.g. the qcow2-overlay-chain case whose backing-file path is stale).
Run instar measure <image-path> -O <target> --output=<format>.
Compare to baseline byte-for-byte. Note: baselines were recorded with $TESTDATA_ROOT placeholder for portability; the existing substitute_testdata_root() helper takes care of resolving paths in the comparison.

Surface 3: vmdk / vpc / vhdx round-trip¶

These target formats can't be cross-validated against qemu-img (qemu-img errors with "does not support size measurement"). Instead:

Create a small empty raw tmpfile via qemu-img create -f raw <tmpfile> <SIZE> (sizes: 1 MiB, 16 MiB, 64 MiB).
Run instar measure --size <SIZE> -O <fmt> --output=json → parse required + fully_allocated.
Run instar convert -f raw <tmpfile> -O <fmt> <out_tmpfile>.
Assert required <= os.path.getsize(out_tmpfile) <= fully_allocated.

For non-empty sources (a half-allocated raw input):

Use one of the existing safe-tier qcow2 test images as source (e.g. cirros-qcow2).
Run instar measure <image> -O <fmt> --output=json.
Run instar convert <image> -O <fmt> <out_tmpfile>.
Same bound assertion.

Round-trip tests are slower (each one runs convert end-to-end), so cap at ~15 cases total (3 sizes × 3 vmdk-style targets × 1 source mode + ~6 source-image cases).

`SIZE_CASES` duplication question¶

The 21 SIZE_CASES list lives in instar-testdata/scripts/generate-baselines.py. Two options for phase 7's tests:

A. Mirror the list inline in tests/test_measure.py as a Python const. Pros: tests are self-contained, no cross-repo path resolution. Cons: drift risk if testdata adds a new case without updating instar's mirror. B. Walk the directory expected-outputs/measure-json/_size/<version>/ to discover cases at runtime, then derive args from filenames. Pros: never drifts. Cons: filename-to-args reverse-engineering is brittle (a typo in a case name silently desyncs from the expected args).

Recommendation: A (mirror). Drift is a controllable problem; brittleness is not. Add a one-line cross-check test that asserts every *.stdout.txt in the raw bucket has a mirroring SIZE_CASES entry, so adding a case to phase 6 without updating instar causes a clear test failure.

Test-class organisation¶

tests/test_measure.py
├── TestMeasureSmoke          (phase 4 — 8 tests, unchanged)
├── TestMeasureOptions        (phase 5 — 13 tests, unchanged)
├── TestMeasureBaselineSize   (phase 7 — ~42 tests)
├── TestMeasureBaselineSource (phase 7 — ~156 tests)
└── TestMeasureRoundTrip      (phase 7 — ~15 tests)

Total ≈ 230 tests. The two new baseline-comparison classes together fan out to ~200 of those; each test is short (< 1 s) so total runtime is dominated by binary launch (~0.5 s × 200 = ~100 s). Acceptable.

If the runtime balloons under stestr (forking + venv import overhead), enable parallel execution via stestr's --concurrency flag. The existing test suite already runs in parallel; phase 7 inherits that.

Round-trip math¶

For vmdk monolithicSparse / vhd dynamic / vhdx, sparse output skips holes. Empty source → near-minimum output (header + tables only). The phase 1 calculator computes required for that case. So:

Empty source: actual ≈ required. Tolerance: actual <= required + grain_size (one grain of margin for writer alignment artefacts).
Half-allocated source: required < actual < fully_allocated.
Fully-allocated source: actual ≈ fully_allocated.

The plan picks the simplest invariant that always holds:

required - cluster_size <= actual <= fully_allocated + cluster_size

Where cluster_size is a small alignment cushion (one output sector or one block, whichever is larger; reuse vhdx::MB_ALIGN = 1MB for vhdx). This lets the test pass even when convert pads to output-sector boundaries that measure didn't account for (the divergence noted in phase 1e for VHD's leading-footer alignment).

If a round-trip test fails the bound, that's a real bug — either measure is over/underestimating, or convert is producing a wrong-size file. Both are blocking.

Round-trip and `instar convert` semantics¶

instar's convert already produces vmdk / vpc / vhdx output (phase 3 of the convert plan). For phase 7's round-trip tests, use the existing CLI surface unchanged:

instar convert -f raw <input> -O <target> <output>

For target = vmdk, default is monolithicSparse. For vpc, default is dynamic. For vhdx, default is dynamic. These defaults match what measure computes when given no --subformat. If the test wants to exercise an alternative subformat, both convert and measure need the matching flag.

Open questions¶

The single-profile dedup means surfaces 1 and 2 don't exercise multiple qemu versions — every test compares against the same baseline regardless of installed qemu version. Recommendation: that's fine. The matrix exists so a future qemu-img-side change that splits the profile gets caught immediately; in the current state, the test coverage is "instar measure matches qemu-img 6.0.0 through 10.2.0" because they're all in one profile.
What about --output=human matching? qemu-img's human output is the same shape across versions. Should surfaces 1 and 2 run both human and json comparisons? Recommendation: yes — both are baselined in phase 6, both are matched. Doubles the test count but keeps coverage symmetric.
Round-trip tolerance: the phase plan picks a one-cluster-size cushion. Could tighten (zero tolerance, exact equality) if convert and measure agree exactly. Try exact first; widen if a test fails on an alignment off-by-one.
What if the installed qemu-img version isn't in any profile bucket? This shouldn't happen post-phase-6 (the matrix covers 6.0.0–10.2.0), but a developer might have qemu-img 5.x or something newer installed. Fallback: use the only profile that exists (since measure has only one profile, this is trivial). For future-proofing if the profile space ever grows, use version_to_profile.get(<v>, list(profiles)[0]) with a logged warning.
VMDK monolithicFlat source rejection: phase 4 rejected this with a clear error. The test surface should confirm the error is still raised — but phase 4 already covered that. Don't duplicate.
Convert's sector-size alignment can cause convert's actual output to exceed fully_allocated slightly (the leading-footer / sector-alignment gap from phase 1e). The cluster_size cushion in the round-trip math accounts for this, but the precise size depends on the output sector size (default 65536). Use that as the cushion directly for VHD specifically.
Should round-trip tests run by default in make test-integration? They each run convert end-to-end and the file I/O isn't free. Recommendation: yes — they're fast enough (~1 s each × 15 = ~15 s total) and they're the only way to catch vmdk/vpc/vhdx measure regressions. No opt-in flag.

Execution¶

Step	Effort	Model	Isolation	Brief for sub-agent
7a	medium	sonnet	none	Extend `tests/base.py`: add `'measure': 'measure'` to `COMMAND_OUTPUT_DIRS`. No new helper functions needed yet — `get_output_profiles()` and `get_expected_output()` already work for measure once the dict is updated; verify by writing a one-line smoke check (e.g. an `assertNotEqual(get_output_profiles('json', 'measure')['profiles'], {})` test inside an existing class). Touch only `tests/base.py` and `tests/test_measure.py` (the smoke check goes in the existing `TestMeasureSmoke` class). Run `make test-integration` to confirm the new smoke test passes.
7b	medium	sonnet	none	Add `TestMeasureBaselineSize(TestMeasureSmoke)` to `tests/test_measure.py`. Define a `MEASURE_SIZE_CASES = [...]` list at module scope mirroring the 21 entries from `instar-testdata/scripts/generate-baselines.py:SIZE_CASES`. Each entry is `(case_name, size_str, target, options_list)`. Implement `_args_for_case(case)` that translates an entry to a list of `instar measure` CLI args (size → `--size`, target → `-O`, options_list → `-o opt1,opt2`). Generate one test per case × output type (use a loop that calls `setattr(cls, f'test_{case_name}_{output_type}', ...)` to register the methods, or define a parametrised helper). Each test runs `instar measure`, fetches the matching baseline via `get_expected_output(case_name, profile, output_type='json'\|'human', command='measure')`, and asserts byte equality. Skip cases whose baseline meta.json (`expected-outputs/measure-<type>/_size/<version>/<case-name>.meta.json` for the profile's representative version) has non-zero return_code. Add a cross-check `test_size_cases_match_baselines()` that walks `expected-outputs/measure-json/_size/<version>/` and asserts every `*.stdout.txt` corresponds to a MEASURE_SIZE_CASES entry, catching drift. Run `make test-integration` and confirm ~42 new tests pass.
7c	high	sonnet	none	Add `TestMeasureBaselineSource(TestMeasureSmoke)` to `tests/test_measure.py`. Iterate `self.get_all_safe_images()` (or whatever the existing helper is — look in `tests/base.py` for the iteration pattern; if it's not exposed, walk `self._images` or load the manifest directly). For each image × target ∈ {raw, qcow2} × output_type ∈ {human, json}, generate a test. Each test computes `image_id = f'{image.id}__{target}'`, skips if no baseline exists or if baseline meta.json shows non-zero return_code, runs `instar measure <path> -O <target> --output=<format>`, fetches the baseline via `get_expected_output(image_id, profile, output_type, command='measure')`, and asserts byte equality (after `substitute_testdata_root` if the baseline uses the placeholder). Note that the baseline filenames include the `__<target>` suffix because of phase 6's naming scheme. Expect ~156 tests; many will skip if their meta.json shows the qcow2-overlay-chain stale-backing-file failure. Run `make test-integration` and report pass/skip/fail counts. High effort because: iterating the manifest cleanly, composing the right image_id, and handling skip cases all interact. The sub-agent must read the manifest-loading code in `tests/base.py` carefully to find the right iteration pattern.
7d	medium	sonnet	none	Add `TestMeasureRoundTrip(TestMeasureSmoke)` to `tests/test_measure.py` covering vmdk / vpc / vhdx target formats (which qemu-img can't measure). Two flavours: (a) `--size` mode — create an empty raw tmpfile via `qemu-img create -f raw <tmpfile> <SIZE>`, run `instar measure --size <SIZE> -O <fmt> --output=json`, run `instar convert -f raw <tmpfile> -O <fmt> <out>`, assert `required <= os.path.getsize(out) <= fully_allocated` with a one-output-sector cushion (65536 bytes); (b) source-image mode — use an existing safe-tier qcow2 (cirros-qcow2 is the standard pick), run measure + convert, same bound assertion. Cap at ~15 tests total (3 sizes × 3 targets for --size mode + 2 source images × 3 targets for source mode = 15). Use `tempfile.NamedTemporaryFile` for the input/output paths, clean up in `addCleanup()`. Run `make test-integration` and confirm all pass.
7e	low	sonnet	none	Update `ARCHITECTURE.md`: in the existing "operations/measure/" bullet (last touched in 5d), append a sentence about the test coverage — "Integration tests in `tests/test_measure.py` cross-validate `instar measure` against the `qemu-img measure` baselines in `instar-testdata/expected-outputs/measure-*` for every safe-tier image and every curated `--size` case, plus round-trip the vmdk / vpc / vhdx outputs through `instar convert` to verify the predicted size bounds." Add to `CHANGELOG.md` Unreleased / Added: "Comprehensive integration tests for `instar measure`: cross-version baseline comparison for raw and qcow2 targets across every safe-tier test image, plus round-trip size-bound checks for vmdk / vpc / vhdx targets where qemu-img cannot validate. (PLAN-measure-phase-07-integration-tests.md)". Run `pre-commit run --all-files`.

Total: 5 commits.

Out of scope for phase 7¶

Updating instar-testdata (phase 6 already covered that).
Caution-tier / malicious-tier image coverage (phase 6 scope decision; revisit as a follow-up).
LUKS-encrypted source baselines (master-plan future work).
Backing-chain composition tests (chain support isn't in measure yet).
Performance benchmarking (separate effort).
Coverage-guided fuzz updates (phase 8).
Differential fuzz updates (phase 9).
docs/measure.md user guide (phase 10).

Success criteria¶

tests/test_measure.py has ~230 total tests (8 smoke + 13 options + ~42 size baseline + ~156 source baseline + ~15 round-trip).
make test-integration runs them all; ~210 pass, the rest skip-with-message for known-non-zero baselines (the qcow2-overlay-chain stale-backing-file family).
make instar builds; make lint clean; pre-commit run --all-files clean.
One end-to-end byte-equality check confirms parity: instar measure --size 1M -O qcow2 --output=json matches the baseline in instar-testdata/expected-outputs/measure-json/profiles/profile-NN/1M-qcow2-default.stdout.txt.
ARCHITECTURE.md and CHANGELOG.md updated.

Risks and mitigations¶

SIZE_CASES list drift between repos. Mitigation: 7b's test_size_cases_match_baselines() cross-checker catches any case present on disk but not in the Python mirror, and any case in the mirror but not on disk.
Manifest entries without baselines (e.g. images added to the manifest after the phase 6 matrix was generated). Mitigation: skipTest with a clear message rather than fail. Surfaces the gap without blocking CI; user regenerates baselines when convenient.
Round-trip bound cushion too tight or too loose. Mitigation: phase 1e flagged the VHD sector-alignment divergence; start with a one-output-sector cushion (65 536 bytes), tighten or loosen based on observed failures during step 7d.
Parallel test runner conflicts (multiple tests writing to the same tmpfile). Mitigation: each round-trip test uses tempfile.NamedTemporaryFile so paths are unique; stestr's default forked execution handles isolation correctly.
get_expected_output() raises on missing files rather than returning None. Mitigation: wrap with a baseline-exists check that returns False if the path doesn't exist, then skipTest from the caller.

Back brief¶

Before executing any step, the executing agent should back-brief: which test class is being added (or extended), which baseline files it reads, and how it locates them (via get_output_profiles / get_expected_output, or by direct path construction). The reviewer should verify no step bleeds into phase 8 (fuzzing), phase 9 (differential fuzzing extension), or phase 10 (docs).

📝 Report an issue with this page