Phase 5: cross-version baselines¶
Master plan: PLAN-map.md · Previous phase: PLAN-map-phase-04-output-formatting.md
Status: Complete¶
instar-testdata commits 4e56008d8 (generator extension),
8e0498ca3 + 315859c3d (profile dedup), and 0f972d5b1
(raw baselines) produced map-human + map-json baselines
for all 80 qemu-img versions (6.0.0–10.2.x) across every
safe-tier source image — ~6,240 baseline cells total.
detect-profiles.py deduplicates into 1 map-human profile
(stable across the full range) and 3 map-json profiles
(transitions at 6.0.x→6.1.x — likely compressed field
addition — and 8.1.x→8.2.x).
Mission¶
Generate the map-human and map-json baseline matrix in
the sibling instar-testdata repository, covering the full
80-version qemu-img-binaries/x86_64/ set against every
safe-tier source image. The matrix is consumed by phase 6's
integration tests (tests/test_map.py), which check
instar map's output against the version-keyed expected
output for whichever qemu-img is installed on the host.
Phase 5's deliverable is bulk data, not code: ~80 versions
× ~44 safe-tier images × 2 output types ≈ 7,000 raw
baseline files, deduplicated by detect-profiles.py into a
handful of profile directories per (output_type,
src_format). The script changes are small extensions to the
existing baseline generator; the run itself takes ~30
minutes against a warm qemu-img-binaries/ tree.
Why this is its own phase¶
Phase 4 shipped byte-for-byte qemu-img-compatible output for the current dev qemu-img. Phase 5 confirms that output matches across the supported qemu-img version range and captures the version-keyed expected output that phase 6's integration tests will diff against. Without phase 5 the phase 4 renderer's "byte-for-byte parity" claim is a one-version snapshot.
Bundling phase 5 with phase 4 (renderer) or phase 6 (integration tests) would tangle two unrelated kinds of work: phase 4 is pure Rust polish; phase 6 is Python integration plumbing; phase 5 is a generator extension plus a bulk data run. The clean split is what the PLAN-measure / PLAN-create / PLAN-rebase / PLAN-commit predecessors all use.
Architecture¶
Cross-repository split¶
Phase 5 commits land across two repositories:
instar-testdata(sibling repo,~/src/shakenfist/instar-testdata/, GitLab remotegitlab.home.stillhq.com:private/instar-testdata.git):- Generator script extensions
- Generated baselines (raw + profiles)
-
testdata README update The repository is a private GitLab project; the convention from earlier phases is one commit for the script change and a second for the generated data (kept separate because the data commit is multi-thousand files and would mask the script changes in
git log). -
instar-wt-map(this worktree, GitHub remoteshakenfist/instar): - PLAN-map.md execution-table status update
- CHANGELOG.md entry
- Possibly
docs/quirks.mdadjustments if the generator surfaces version-keyed divergences not already documented in phase 4c.
The phase plan tracks both repos but phase 5's git commits are not atomic across them — instar-testdata lands first (because the baselines are the deliverable), instar documents that landing after the testdata commits are pushed.
generate-baselines.py extension¶
Add a new 'map' entry to the COMMANDS dict in
instar-testdata/scripts/generate-baselines.py:
'map': {
'output_types': {
'map-human': None, # default human-readable output
'map-json': 'json', # JSON output (--output=json)
},
# qemu-img map reads every format the parser supports.
# Same whitelist as info / check / measure.
'supported_formats': [
'raw', 'qcow2', 'vmdk', 'vmdk3', 'vhd', 'vhdx',
'qcow1', 'qed', 'vdi',
# vpc is qemu's internal name for vhd
'vpc',
],
# build_cmd: qemu-img map [--output=FMT] IMAGE
'build_cmd': lambda binary, image_path, output_format: (
[str(binary), 'map'] +
([f'--output={output_format}'] if output_format else []) +
[str(image_path)]
),
},
Add a generate_map_baseline helper modelled on
generate_measure_source_baseline (the simpler shape — no
target-format axis, no size-mode):
def generate_map_baseline(
binary: Path,
version: str,
image: dict,
images_root: Path,
output_dir: Path,
output_format: str = None,
timeout: int = 30,
) -> dict:
"""
Generate a map baseline for one source-image case.
Runs: qemu-img map [--output=FMT] <image>
Output filename stem is '<image-id>'.
Writes into output_dir which should be
<output_type>/<src_format>/<version>/
Returns dict with status and details.
"""
# … boilerplate identical to generate_measure_source_baseline
# minus the target_format axis.
Add a dispatch branch in the main loop alongside the existing measure / create / resize / rebase / commit branches:
elif command_name == 'map':
# -- map command: source-image mode only --
for output_type_name, output_format in output_types.items():
print(f' Output type: {output_type_name}')
for image in images:
image_format = image.get('format', '').lower()
src_dir = (
output_root / output_type_name / image_format / version
)
src_dir.mkdir(parents=True, exist_ok=True)
total += 1
result = generate_map_baseline(
binary, version, image, images_root,
src_dir, output_format,
)
label = f'{image["id"]}'
# ... same OK / WARN / TIMEOUT / ERROR dispatch as measure
Update the --command argparse choices to include map,
and add map to the docstring's command list.
detect-profiles.py extension¶
Add 'map-human' and 'map-json' to
instar-testdata/scripts/detect-profiles.py:
# New command-based naming
CHECK_OUTPUT_TYPES = ['check-human', 'check-json']
COMPARE_OUTPUT_TYPES = ['compare-human', 'compare-json']
MEASURE_OUTPUT_TYPES = ['measure-human', 'measure-json']
CREATE_OUTPUT_TYPES = ['create-info-json']
MAP_OUTPUT_TYPES = ['map-human', 'map-json']
OUTPUT_TYPES = (
INFO_OUTPUT_TYPES
+ CHECK_OUTPUT_TYPES
+ COMPARE_OUTPUT_TYPES
+ MEASURE_OUTPUT_TYPES
+ CREATE_OUTPUT_TYPES
+ MAP_OUTPUT_TYPES
)
# Map uses the per-bucket layout:
# <type>/<src_format>/<version>/<image-id>.stdout.txt
MULTI_BUCKET_TYPES = set(
MEASURE_OUTPUT_TYPES + CREATE_OUTPUT_TYPES + MAP_OUTPUT_TYPES
)
Map's baseline layout buckets by source format (like measure) rather than target format (like create), because map is read-only on the source — there's no target axis.
Baseline volume estimate¶
- 80 qemu-img versions × ~44 safe-tier images × 2 output types = ~7,040 raw baseline files.
- 3 files per baseline (
.stdout.txt,.stderr.txt,.meta.json) = ~21,000 small files. - JSON outputs for highly fragmented images may reach ~50 KiB each; average is closer to 5 KiB.
- After
detect-profiles.pydedup: expected 1-3 profiles per (output_type, src_format) bucket. qemu-img map's output format is stable across versions (thecompressedJSON field is the only known addition; need to verify the exact version range during 5b).
Disk space: ~150 MiB raw + ~5 MiB profiles ≈ 155 MiB total. Comparable to the existing measure baselines.
Runtime: ~30 minutes for the full sweep on a warm
qemu-img-binaries/ tree (per the measure precedent in
PLAN-measure phase 6).
--start-offset / --max-length window cases¶
The master plan called for a handful of window cases
analogous to measure's SIZE_CASES. Phase 5 ships only
the default-window baselines (no --start-offset, no
--max-length). Window-case behaviour is exercised in
phase 6's integration tests with bounded fixtures
constructed in tests/test_map.py itself — easier than
baseline-generating per-image window cases (which would
need per-image virtual-size knowledge to construct
mid-image / end-of-image / past-EOF cases sensibly).
Future work: if differential fuzzing (phase 8) surfaces
version-keyed window-handling drift, add a WINDOW_CASES
list to the generator that runs each window case under
_window bucket (analogous to measure's _size bucket).
Old qemu-img versions¶
The 80-version matrix runs from 6.0.0 to 10.2.0. Spot
checks during plan research suggest qemu-img map output is
stable across this range modulo one known addition: the
compressed JSON field was introduced at some point.
Phase 5 records whatever each version emits; the dedup
machinery handles version-keyed differences cleanly. If a
particular version segfaults or rejects an image, the
.meta.json records the non-zero exit and the integration
test in phase 6 skips that cell with a documented reason.
Cross-version edge cases to verify during the run¶
During phase 5b, eyeball the output of a handful of representative versions to catch surprises:
- qemu-img 6.0.0: oldest version in the matrix; verify
compressedfield presence / absence. - qemu-img 7.2.0: mid-range; sanity check.
- qemu-img 10.0.8 / 10.2.0: newest; matches the phase 4 dev target.
- One source per format: qcow2, raw, vmdk, vhd, vhdx
— confirm each format produces sensible output and that
no format hits a "block driver does not support" error
uniformly (those would imply we're listing the wrong
supported_formatsset). - Empty image: confirm the all-zero qcow2 produces
the expected one-extent
present: false, zero: trueline in JSON / header-only in human.
The expected results match the phase 4a fixtures I already verified against qemu-img 10.0.8.
Documentation outside the generator¶
instar-testdata/README.md(if it has a baseline- inventory section) gets a one-line entry for the new map baselines.instar-wt-map/CHANGELOG.mdUnreleased / Added gets a one-line entry citing the new baselines.instar-wt-map/docs/plans/PLAN-map.mdexecution table: phase 5 row flipped from "Not started" to "Complete".
Open questions¶
-
Window-case baselines: defer to phase 6 (as per-test fixtures) or include in phase 5 (as additional baseline buckets)? Recommendation: defer. The master plan was permissive; per-test fixtures are easier to maintain and don't need version-keyed dedup.
-
VMDK monolithicFlat sources in the matrix: instar refuses these host-side. qemu-img map handles them. The baseline-generator runs qemu-img, not instar, so the qemu-img-side baselines record valid output; phase 6's integration test will need a skip-list for monolithicFlat sources. Recommendation: include them in the baseline run (they're in the safe-tier manifest), and document the integration-test skip in phase 6.
-
Chain images: instar refuses sources with backing files. qemu-img map walks the chain. The baseline-generator records qemu-img's chain-walking output. Recommendation: same as #2 — include in the baseline run, integration test skips chain sources with a documented reason pointing at the chain follow-up.
-
Profile naming: existing profiles use sha256-prefix names like
profile-a3f4e2d8. Map follows the same convention; no special handling needed. -
Empty (zero-extent) JSON output: an all-zero qcow2 emits one extent
[{ ..., "present": false, "zero": true, ... }]. This is well-defined; just confirms during 5b that older qemu-img versions don't emit[]instead. -
Run-the-generator host: the script is heavy enough that it warrants a beefy host. Same as measure's precedent — run on the dev box, commit the result, push.
-
Re-generating after a format-detection change: if instar's format detection changes (e.g. better raw/vhd disambiguation), the baseline-generator's image-format assignments don't change (they come from the manifest, which is authored separately). No re-run needed unless the manifest changes.
Execution¶
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 5a (instar-testdata) | medium | sonnet | none | Extend instar-testdata/scripts/generate-baselines.py with the 'map' COMMAND entry per the schema in the Architecture section. Add generate_map_baseline(binary, version, image, images_root, output_dir, output_format, timeout=30) helper modelled on generate_measure_source_baseline (line ~946) but without the target_format axis. Add a elif command_name == 'map': dispatch branch in the main loop (around line 2334, next to the measure branch) that iterates over output_types × images, calling generate_map_baseline. Update the --command argparse choices and the docstring's command list to include map. Extend instar-testdata/scripts/detect-profiles.py with MAP_OUTPUT_TYPES = ['map-human', 'map-json'] added to OUTPUT_TYPES and MULTI_BUCKET_TYPES. Smoke-test with ./scripts/generate-baselines.py --command map --version 10.0.0 and confirm the directory structure (expected-outputs/map-{human,json}/<src_format>/10.0.0/<image>.{stdout,stderr,meta.json}) is created with sensible content. Commit to instar-testdata as one commit: scripts: add map command to baseline generator (PLAN-map phase 5a). |
| 5b (instar-testdata) | low | sonnet | none | Run the full sweep: ./scripts/generate-baselines.py --command map (no version filter — exercises all 80 binaries) followed by ./scripts/detect-profiles.py --output-type map-human and --output-type map-json. Expected runtime ~30 minutes. Spot-check the generated profiles against a representative set of versions (6.0.0, 7.2.0, 10.0.8, 10.2.0 — see "Cross-version edge cases" in the Architecture section). Commit to instar-testdata as one commit: expected-outputs: add map baselines for 80 qemu-img versions (PLAN-map phase 5b). Expected commit size: ~7,000 raw baseline files + a handful of profile directories. Disk usage: ~155 MiB. Low effort because: mechanical run-and-commit; the script has been smoke-tested in 5a. If a particular version's output is surprising (e.g. unexpected segfault on a specific image), capture in the commit message and proceed. |
| 5c (instar-wt-map) | low | sonnet | none | Update docs/plans/PLAN-map.md execution table to flip phase 5's status to Complete. Update CHANGELOG.md Unreleased / Added with one line citing the new map baselines in instar-testdata. If 5b surfaced any version-keyed divergences not already documented in phase 4c's quirks, add them to docs/quirks.md's map section. Run pre-commit run --all-files. Commit to instar (this worktree) as: map: close out phase 5 of PLAN-map (cross-version baselines). |
Total: 3 commits across two repositories.
Why no high-effort step¶
Phase 5 is entirely mechanical extension of an existing generator. The schema is well-understood from measure / create / resize / rebase / commit. The bulk-data commit needs visual sanity-checking but no judgement calls — if a particular cell errors out, the meta.json records it and phase 6 handles the skip.
Out of scope for phase 5¶
- Integration tests against the baselines (phase 6).
- Window-case (
--start-offset/--max-length) baselines (deferred to phase 6 per-test fixtures). - Coverage-guided fuzz harness updates (phase 7).
- Differential fuzz against qemu-img map (phase 8).
- New testdata fixtures specifically for map (the safe-tier manifest already covers the formats we need).
- Output-profile machinery additions in instar's VMM (phase 4 deferred this to phase 5; phase 5 confirms whether any is needed; based on dev-machine spot checks, none is expected).
Success criteria¶
instar-testdata/scripts/generate-baselines.py --command map --helplists the new map command.instar-testdata/scripts/detect-profiles.py --output-type map-humanruns cleanly against the generated raw data.instar-testdata/expected-outputs/map-{human,json}/directories exist with one bucket per source format and one version directory per qemu-img binary.- Profile directories exist under
expected-outputs/map-{human,json}/<src_format>/profiles/with a small (1-3) count of profile-hash directories per bucket. - Spot-check on a handful of cells matches qemu-img map's output for those versions (eyeball comparison during 5b).
- The instar-testdata commits push cleanly to GitLab.
instar-wt-map's PLAN-map.md table and CHANGELOG reflect phase 5 completion.
Risks and mitigations¶
-
Old qemu-img versions reject
--output=json: some qemu-img versions may not support the flag. Mitigation: the generator records the non-zero exit; phase 6 skips comparison for cells where the qemu-img-side baseline shows a non-zero exit. Same pattern as the existing measure baselines (seeKNOWN_MEASURE_VERSION_SKIPSintests/test_measure.py). -
qemu-img map segfaults / hangs on a specific image: the generator's per-baseline timeout (30s) caps hangs; segfaults appear as non-zero exits. Both are recorded verbatim. If a known qemu CVE affects a specific version-image combination, document in the commit message and let phase 6 skip.
-
Disk usage spike during the run: ~155 MiB peak is well within budget. Mitigation: run on the dev box; baseline storage doesn't compete with anything.
-
Version-keyed format drift: the
compressedfield was added at some unknown version. The dedup machinery surfaces this as 2+ profiles per (output_type, src_format) bucket; phase 6's integration test selects the right profile by qemu-img version. No special handling needed in the generator. -
GitLab push size: a ~7,000-file commit may be slow but is well-trodden ground (measure / create / resize / rebase / commit all pushed similar volumes).
-
Test-image manifest changes mid-run: the manifest pinning is at the testdata-repo level. If a new image is added between 5a and 5b, re-run 5b. No special handling needed.
Back brief¶
Before executing any step, the executing agent should
back-brief: which repository the step affects (instar-
testdata for 5a / 5b, instar for 5c), which existing
function is the closest template (generate_measure_-
source_baseline for 5a's helper, expected-outputs/
measure-*/ for 5b's directory layout), and the
runtime budget for 5b (~30 min). The reviewer should
verify that 5a's smoke-test on one version produced a
plausibly-shaped output before running the full sweep,
that 5b's commit includes both raw and profile
directories, and that 5c references the right
testdata-repo commit hash in the CHANGELOG line.