Phase 7: integration tests against the cross-version baselines¶
Master plan: PLAN-measure.md · Previous phase: PLAN-measure-phase-06-baselines.md
Status: Not started¶
Mission¶
Wire tests/test_measure.py (8 smoke tests from phase 4, 13
-o tests from phase 5) up to the cross-version baselines
that phase 6 committed in instar-testdata. After phase 7,
every safe-tier image in the manifest plus every curated
--size case is checked against its corresponding
qemu-img measure baseline, and the vmdk / vpc / vhdx target
formats (which qemu-img doesn't measure) get round-trip
coverage that asserts convert output size lies in the
[required, fully_allocated] range.
Phase 7 closes the loop: phase 4 says "the CLI works for the cases I tested"; phase 7 says "the CLI matches qemu-img on every case we record a baseline for, on every supported qemu version".
Why this is its own phase¶
Phase 6 stored bytes on disk. Phase 7 turns those bytes into assertions. They're separable because:
- Phase 6 work is in the
instar-testdatarepo and the long-running tail is regenerating baselines whenever the matrix expands. Phase 7 work is in theinstarrepo and does not touch the testdata side. - Phase 6 needed scripting that runs once per matrix-refresh; phase 7 needs Python test infrastructure that runs on every CI invocation.
- The fan-out is large (~200 test cases). Bundling phase 6's data generation with the test code would mean one giant commit that's hard to review.
Architecture¶
Baseline lookup¶
The existing tests/base.py already wraps the
expected-outputs/<output-type>/profiles/<profile>/<image-id>.stdout.txt
layout via get_output_profiles() and get_expected_output().
Phase 7 reuses both, with one small extension:
# tests/base.py
COMMAND_OUTPUT_DIRS = {
'info': 'qemu-img',
'check': 'check',
'compare': 'compare',
'measure': 'measure', # NEW — phase 7
}
The measure-specific peculiarity is the _size/ pseudo-bucket
in the raw layout. But for the profile layout
(expected-outputs/measure-json/profiles/profile-NN/), files
are named by image_id exactly like every other command —
detect-profiles.py already produced this naming during
phase 6. So get_expected_output('1G-qcow2-default',
'profile-10-0-0', output_type='json', command='measure')
returns the right file with no additional helper needed.
For the source-image cases, the image_id in the profile
bucket has the __<target> suffix already baked in
(verified in phase 6): e.g. cirros-qcow2__qcow2.stdout.txt.
Tests pass this composed id to get_expected_output().
Three test surfaces¶
Surface 1: --size mode baseline comparison¶
For each of the 21 SIZE_CASES (the same list phase 6's generator iterates), and each output type (human / json):
- Translate the case's
size_str/target_format/options_listinto instar's CLI flags. - Run
instar measure --size <SIZE> -O <TARGET> <flags> --output=<format>. - Look up the baseline via
get_output_profiles('json', 'measure')['profiles']to get the profile for the installed qemu version (or a representative one), thenget_expected_output('1G-qcow2-cs-64k', profile, output_type='json', command='measure'). - Assert
instar_output == baselineexactly.
Skip cases where the baseline has non-zero exit (qemu-img
rejected the option on that profile's representative
version). Read the meta.json in the raw bucket
(expected-outputs/measure-json/_size/<version>/<case-name>.meta.json)
to determine exit code; skip if non-zero.
Since dedup put all 80 qemu versions into a single profile,
in practice every test uses profile-6-0-0 (or whatever the
single profile is named). The version-map.json is the source
of truth — read it rather than hard-coding.
Surface 2: source-image baseline comparison¶
For each safe-tier image in tests/manifest.json, and each
target ∈ {raw, qcow2}, and each output type ∈ {human, json}:
- Compute the expected
image_idas<manifest-id>__<target>. - Skip if no baseline exists (handles caution/malicious images that were filtered out of phase 6's generation).
- Skip if the baseline's meta.json shows non-zero exit (e.g. the qcow2-overlay-chain case whose backing-file path is stale).
- Run
instar measure <image-path> -O <target> --output=<format>. - Compare to baseline byte-for-byte. Note: baselines were
recorded with
$TESTDATA_ROOTplaceholder for portability; the existingsubstitute_testdata_root()helper takes care of resolving paths in the comparison.
Surface 3: vmdk / vpc / vhdx round-trip¶
These target formats can't be cross-validated against qemu-img (qemu-img errors with "does not support size measurement"). Instead:
- Create a small empty raw tmpfile via
qemu-img create -f raw <tmpfile> <SIZE>(sizes: 1 MiB, 16 MiB, 64 MiB). - Run
instar measure --size <SIZE> -O <fmt> --output=json→ parserequired+fully_allocated. - Run
instar convert -f raw <tmpfile> -O <fmt> <out_tmpfile>. - Assert
required <= os.path.getsize(out_tmpfile) <= fully_allocated.
For non-empty sources (a half-allocated raw input):
- Use one of the existing safe-tier qcow2 test images as source (e.g. cirros-qcow2).
- Run
instar measure <image> -O <fmt> --output=json. - Run
instar convert <image> -O <fmt> <out_tmpfile>. - Same bound assertion.
Round-trip tests are slower (each one runs convert end-to-end), so cap at ~15 cases total (3 sizes × 3 vmdk-style targets × 1 source mode + ~6 source-image cases).
SIZE_CASES duplication question¶
The 21 SIZE_CASES list lives in
instar-testdata/scripts/generate-baselines.py. Two options
for phase 7's tests:
A. Mirror the list inline in tests/test_measure.py as a
Python const. Pros: tests are self-contained, no
cross-repo path resolution. Cons: drift risk if testdata
adds a new case without updating instar's mirror.
B. Walk the directory expected-outputs/measure-json/_size/<version>/
to discover cases at runtime, then derive args from
filenames. Pros: never drifts. Cons: filename-to-args
reverse-engineering is brittle (a typo in a case name
silently desyncs from the expected args).
Recommendation: A (mirror). Drift is a controllable
problem; brittleness is not. Add a one-line cross-check test
that asserts every *.stdout.txt in the raw bucket has a
mirroring SIZE_CASES entry, so adding a case to phase 6
without updating instar causes a clear test failure.
Test-class organisation¶
tests/test_measure.py
├── TestMeasureSmoke (phase 4 — 8 tests, unchanged)
├── TestMeasureOptions (phase 5 — 13 tests, unchanged)
├── TestMeasureBaselineSize (phase 7 — ~42 tests)
├── TestMeasureBaselineSource (phase 7 — ~156 tests)
└── TestMeasureRoundTrip (phase 7 — ~15 tests)
Total ≈ 230 tests. The two new baseline-comparison classes together fan out to ~200 of those; each test is short (< 1 s) so total runtime is dominated by binary launch (~0.5 s × 200 = ~100 s). Acceptable.
If the runtime balloons under stestr (forking + venv import
overhead), enable parallel execution via stestr's
--concurrency flag. The existing test suite already runs
in parallel; phase 7 inherits that.
Round-trip math¶
For vmdk monolithicSparse / vhd dynamic / vhdx, sparse
output skips holes. Empty source → near-minimum output
(header + tables only). The phase 1 calculator computes
required for that case. So:
- Empty source:
actual ≈ required. Tolerance:actual <= required + grain_size(one grain of margin for writer alignment artefacts). - Half-allocated source:
required < actual < fully_allocated. - Fully-allocated source:
actual ≈ fully_allocated.
The plan picks the simplest invariant that always holds:
Where cluster_size is a small alignment cushion (one
output sector or one block, whichever is larger; reuse
vhdx::MB_ALIGN = 1MB for vhdx). This lets the test pass
even when convert pads to output-sector boundaries that
measure didn't account for (the divergence noted in phase 1e
for VHD's leading-footer alignment).
If a round-trip test fails the bound, that's a real bug — either measure is over/underestimating, or convert is producing a wrong-size file. Both are blocking.
Round-trip and instar convert semantics¶
instar's convert already produces vmdk / vpc / vhdx output
(phase 3 of the convert plan). For phase 7's round-trip
tests, use the existing CLI surface unchanged:
For target = vmdk, default is monolithicSparse. For vpc,
default is dynamic. For vhdx, default is dynamic. These
defaults match what measure computes when given no
--subformat. If the test wants to exercise an alternative
subformat, both convert and measure need the matching flag.
Open questions¶
-
The single-profile dedup means surfaces 1 and 2 don't exercise multiple qemu versions — every test compares against the same baseline regardless of installed qemu version. Recommendation: that's fine. The matrix exists so a future qemu-img-side change that splits the profile gets caught immediately; in the current state, the test coverage is "instar measure matches qemu-img 6.0.0 through 10.2.0" because they're all in one profile.
-
What about
--output=humanmatching? qemu-img's human output is the same shape across versions. Should surfaces 1 and 2 run both human and json comparisons? Recommendation: yes — both are baselined in phase 6, both are matched. Doubles the test count but keeps coverage symmetric. -
Round-trip tolerance: the phase plan picks a one-cluster-size cushion. Could tighten (zero tolerance, exact equality) if convert and measure agree exactly. Try exact first; widen if a test fails on an alignment off-by-one.
-
What if the installed qemu-img version isn't in any profile bucket? This shouldn't happen post-phase-6 (the matrix covers 6.0.0–10.2.0), but a developer might have qemu-img 5.x or something newer installed. Fallback: use the only profile that exists (since measure has only one profile, this is trivial). For future-proofing if the profile space ever grows, use
version_to_profile.get(<v>, list(profiles)[0])with a logged warning. -
VMDK monolithicFlat source rejection: phase 4 rejected this with a clear error. The test surface should confirm the error is still raised — but phase 4 already covered that. Don't duplicate.
-
Convert's sector-size alignment can cause convert's actual output to exceed
fully_allocatedslightly (the leading-footer / sector-alignment gap from phase 1e). The cluster_size cushion in the round-trip math accounts for this, but the precise size depends on the output sector size (default 65536). Use that as the cushion directly for VHD specifically. -
Should round-trip tests run by default in
make test-integration? They each run convert end-to-end and the file I/O isn't free. Recommendation: yes — they're fast enough (~1 s each × 15 = ~15 s total) and they're the only way to catch vmdk/vpc/vhdx measure regressions. No opt-in flag.
Execution¶
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 7a | medium | sonnet | none | Extend tests/base.py: add 'measure': 'measure' to COMMAND_OUTPUT_DIRS. No new helper functions needed yet — get_output_profiles() and get_expected_output() already work for measure once the dict is updated; verify by writing a one-line smoke check (e.g. an assertNotEqual(get_output_profiles('json', 'measure')['profiles'], {}) test inside an existing class). Touch only tests/base.py and tests/test_measure.py (the smoke check goes in the existing TestMeasureSmoke class). Run make test-integration to confirm the new smoke test passes. |
| 7b | medium | sonnet | none | Add TestMeasureBaselineSize(TestMeasureSmoke) to tests/test_measure.py. Define a MEASURE_SIZE_CASES = [...] list at module scope mirroring the 21 entries from instar-testdata/scripts/generate-baselines.py:SIZE_CASES. Each entry is (case_name, size_str, target, options_list). Implement _args_for_case(case) that translates an entry to a list of instar measure CLI args (size → --size, target → -O, options_list → -o opt1,opt2). Generate one test per case × output type (use a loop that calls setattr(cls, f'test_{case_name}_{output_type}', ...) to register the methods, or define a parametrised helper). Each test runs instar measure, fetches the matching baseline via get_expected_output(case_name, profile, output_type='json'|'human', command='measure'), and asserts byte equality. Skip cases whose baseline meta.json (expected-outputs/measure-<type>/_size/<version>/<case-name>.meta.json for the profile's representative version) has non-zero return_code. Add a cross-check test_size_cases_match_baselines() that walks expected-outputs/measure-json/_size/<version>/ and asserts every *.stdout.txt corresponds to a MEASURE_SIZE_CASES entry, catching drift. Run make test-integration and confirm ~42 new tests pass. |
| 7c | high | sonnet | none | Add TestMeasureBaselineSource(TestMeasureSmoke) to tests/test_measure.py. Iterate self.get_all_safe_images() (or whatever the existing helper is — look in tests/base.py for the iteration pattern; if it's not exposed, walk self._images or load the manifest directly). For each image × target ∈ {raw, qcow2} × output_type ∈ {human, json}, generate a test. Each test computes image_id = f'{image.id}__{target}', skips if no baseline exists or if baseline meta.json shows non-zero return_code, runs instar measure <path> -O <target> --output=<format>, fetches the baseline via get_expected_output(image_id, profile, output_type, command='measure'), and asserts byte equality (after substitute_testdata_root if the baseline uses the placeholder). Note that the baseline filenames include the __<target> suffix because of phase 6's naming scheme. Expect ~156 tests; many will skip if their meta.json shows the qcow2-overlay-chain stale-backing-file failure. Run make test-integration and report pass/skip/fail counts. High effort because: iterating the manifest cleanly, composing the right image_id, and handling skip cases all interact. The sub-agent must read the manifest-loading code in tests/base.py carefully to find the right iteration pattern. |
| 7d | medium | sonnet | none | Add TestMeasureRoundTrip(TestMeasureSmoke) to tests/test_measure.py covering vmdk / vpc / vhdx target formats (which qemu-img can't measure). Two flavours: (a) --size mode — create an empty raw tmpfile via qemu-img create -f raw <tmpfile> <SIZE>, run instar measure --size <SIZE> -O <fmt> --output=json, run instar convert -f raw <tmpfile> -O <fmt> <out>, assert required <= os.path.getsize(out) <= fully_allocated with a one-output-sector cushion (65536 bytes); (b) source-image mode — use an existing safe-tier qcow2 (cirros-qcow2 is the standard pick), run measure + convert, same bound assertion. Cap at ~15 tests total (3 sizes × 3 targets for --size mode + 2 source images × 3 targets for source mode = 15). Use tempfile.NamedTemporaryFile for the input/output paths, clean up in addCleanup(). Run make test-integration and confirm all pass. |
| 7e | low | sonnet | none | Update ARCHITECTURE.md: in the existing "operations/measure/" bullet (last touched in 5d), append a sentence about the test coverage — "Integration tests in tests/test_measure.py cross-validate instar measure against the qemu-img measure baselines in instar-testdata/expected-outputs/measure-* for every safe-tier image and every curated --size case, plus round-trip the vmdk / vpc / vhdx outputs through instar convert to verify the predicted size bounds." Add to CHANGELOG.md Unreleased / Added: "Comprehensive integration tests for instar measure: cross-version baseline comparison for raw and qcow2 targets across every safe-tier test image, plus round-trip size-bound checks for vmdk / vpc / vhdx targets where qemu-img cannot validate. (PLAN-measure-phase-07-integration-tests.md)". Run pre-commit run --all-files. |
Total: 5 commits.
Out of scope for phase 7¶
- Updating
instar-testdata(phase 6 already covered that). - Caution-tier / malicious-tier image coverage (phase 6 scope decision; revisit as a follow-up).
- LUKS-encrypted source baselines (master-plan future work).
- Backing-chain composition tests (chain support isn't in measure yet).
- Performance benchmarking (separate effort).
- Coverage-guided fuzz updates (phase 8).
- Differential fuzz updates (phase 9).
docs/measure.mduser guide (phase 10).
Success criteria¶
tests/test_measure.pyhas ~230 total tests (8 smoke + 13 options + ~42 size baseline + ~156 source baseline + ~15 round-trip).make test-integrationruns them all; ~210 pass, the rest skip-with-message for known-non-zero baselines (the qcow2-overlay-chain stale-backing-file family).make instarbuilds;make lintclean;pre-commit run --all-filesclean.- One end-to-end byte-equality check confirms parity:
instar measure --size 1M -O qcow2 --output=jsonmatches the baseline ininstar-testdata/expected-outputs/measure-json/profiles/profile-NN/1M-qcow2-default.stdout.txt. - ARCHITECTURE.md and CHANGELOG.md updated.
Risks and mitigations¶
- SIZE_CASES list drift between repos. Mitigation: 7b's
test_size_cases_match_baselines()cross-checker catches any case present on disk but not in the Python mirror, and any case in the mirror but not on disk. - Manifest entries without baselines (e.g. images added to the manifest after the phase 6 matrix was generated). Mitigation: skipTest with a clear message rather than fail. Surfaces the gap without blocking CI; user regenerates baselines when convenient.
- Round-trip bound cushion too tight or too loose. Mitigation: phase 1e flagged the VHD sector-alignment divergence; start with a one-output-sector cushion (65 536 bytes), tighten or loosen based on observed failures during step 7d.
- Parallel test runner conflicts (multiple tests writing
to the same tmpfile). Mitigation: each round-trip test
uses
tempfile.NamedTemporaryFileso paths are unique; stestr's default forked execution handles isolation correctly. get_expected_output()raises on missing files rather than returning None. Mitigation: wrap with a baseline-exists check that returns False if the path doesn't exist, thenskipTestfrom the caller.
Back brief¶
Before executing any step, the executing agent should
back-brief: which test class is being added (or extended),
which baseline files it reads, and how it locates them
(via get_output_profiles / get_expected_output, or by
direct path construction). The reviewer should verify no
step bleeds into phase 8 (fuzzing), phase 9 (differential
fuzzing extension), or phase 10 (docs).