PLAN-resize followup 01: targeted refcount-block pre-pass¶
Prompt¶
Before responding to questions or discussion points in this
document, explore the instar codebase thoroughly. Read the
qcow2 resize planner (src/crates/resize/src/qcow2.rs —
especially plan_grow, plan_l1_grow,
plan_l1_and_refcount_grow, ensure_block_staged,
block_offset_in_file, stage_increment,
stage_decrement). Read the current guest pre-pass at
src/operations/resize/src/main.rs:382-420 and the
EXISTING_STATE_LIMIT carve-up immediately above.
Where a question touches on external concepts (qcow2 refcount- table layout, refcount-block coverage math), research as needed to give a confident answer. Flag any uncertainty explicitly rather than guessing.
This is a follow-up to PLAN-resize.md. It addresses a real-world limitation surfaced by the automated reviewer on PR #326 — the guest's "stage every refcount block" pre-pass imposes an image-size ceiling proportional to cluster size, hitting ~128 GiB at the common 64 KiB cluster size.
Mission¶
Replace the guest's stage-every-refcount-block pre-pass for
qcow2 grow with a targeted pre-pass that stages only the
specific refcount blocks the chosen grow flavour will modify.
The planner already has the right contract
(ensure_block_staged returns ScratchTooSmall if a needed
block isn't present in existing_refcount_block_indices), so
the entire change is on the guest side plus a small new helper
in the planner crate to expose the block-identification logic.
Post-fix, the qcow2 resize image-size ceiling at the default 64 KiB cluster lifts from ~128 GiB to multi-PB (bounded only by what the filesystem can store). Real-world cloud workloads (1 TiB+ qcow2 disks growing by some fixed delta) work end-to-end.
Out of scope: the qcow2 shrink path. Shrink stages L2 tables, not refcount blocks; its staging is already targeted (only the L2 tables covering the discard range) and bounded by a separate cap (256 L2 tables = ~512 GiB of discardable range per operation). Lifting that cap is a different problem with its own design space and is queued as a separate item under PLAN-resize.md Future work.
What the survey turned up¶
-
Current stage-everything pre-pass (
src/operations/resize/src/main.rs:382-420): walks the refcount table, collects every non-zero entry's block_idx into ablock_indicesarray (cap 1024), then reads each block atcluster_sizebytes intoEXISTING_STATE. The cumulative byte cost isnon_zero_block_count * cluster_size, capped atEXISTING_STATE_LIMIT = 4 MiB. -
Block-coverage math:
entries_per_refblock = cluster_size * 8 / refcount_bitsbytes_per_block = entries_per_refblock * cluster_size- At cluster=64 KiB, refcount_bits=16:
2^15 entries * 64 KiB = 2 GiB per block -
Image-size ceiling: `EXISTING_STATE_LIMIT / cluster_size
- bytes_per_block = 4 MiB / 64 KiB * 2 GiB = 128 GiB`
-
Qcow2ResizeOptscarriesexisting_refcount_block_bytes: &[u8](flat concatenation inexisting_refcount_block_indicesorder) plusexisting_refcount_block_indices: &[u64]. The planner'sensure_block_staged(opts, block_idx, ...)linear-scansexisting_refcount_block_indicesfor the requested block_idx and returnsScratchTooSmallif absent. -
What
plan_l1_growactually demands (src/crates/resize/src/qcow2.rs:202-292): registers an increment patch for every cluster in[new_l1_first_cluster, new_l1_last_cluster](the new L1 region, appended at EOF) and a decrement patch for every cluster in[old_l1_first_cluster, old_l1_last_cluster](the freed old L1 region). For each registered patch the planner callsensure_block_staged(block_idx = cluster / entries_per_refblock). Worst case is ~1-2 distinct blocks on each side; same block can appear on both sides (rare overlap). -
What
plan_l1_and_refcount_growdemands (similar reading atsrc/crates/resize/src/qcow2.rs:605onward; full enumeration needed during 12a): same as L1Grow plus blocks containing the new refcount-table region. The new refcount-block clusters themselves are written from scratch (no existing block to stage —synthetic_layoutbuilds them fresh). -
What
plan_header_onlydemands: nothing — header rewrite only, no refcount mutation. -
decide_action: takes scalar fields (cluster_size, refcount_bits, current/new virtual_size, current_l1_entries, current_refcount_table_clusters) and returns the grow flavour. No refcount-block data needed to make the decision. This is the structural pivot that makes targeted staging possible — the guest can calldecide_actionbefore reading any refcount blocks. -
ensure_block_stagedcontract (src/crates/resize/src/qcow2.rs:1452-1475): returnsErr(ScratchTooSmall)on a miss. The hint at line 1471-1473 ("Block not staged. The guest's pre-pass should have caught this; surface as ScratchTooSmall so the host can retry with a wider stage list") is forward-looking — the guest can retry, but with the targeted approach we expect this to never fire in practice. -
The shrink path stages L2 tables (not refcount blocks) into
EXISTING_STATEafter the refcount-block region. Capped at 256 L2 tables via a separate guard.
Algorithmic design¶
New planner helper: compute_grow_action_and_required_blocks¶
Public function on the qcow2 planner module that the guest calls before staging refcount blocks:
pub fn compute_grow_action_and_required_blocks(
opts: &Qcow2ResizeGrowQuery,
) -> Result<GrowPlan, ResizeError>;
Where Qcow2ResizeGrowQuery is a struct of scalar fields the
guest already has from parsed: QcowHeader:
pub struct Qcow2ResizeGrowQuery {
pub cluster_size: u32,
pub refcount_bits: u8,
pub extended_l2: bool,
pub current_virtual_size: u64,
pub new_virtual_size: u64,
pub current_file_size: u64,
pub current_l1_entries: u32,
pub current_l1_table_offset: u64,
pub current_refcount_table_clusters: u32,
pub current_incompatible_features: u64,
}
And GrowPlan is:
pub struct GrowPlan {
pub action: Qcow2GrowAction, // HeaderOnly / L1Grow / L1AndRefcountGrow
pub required_blocks: ArrayVec<u64, MAX_REQUIRED_BLOCKS>,
}
MAX_REQUIRED_BLOCKS is small (8 is generous — L1Grow needs ≤
4, L1AndRefcountGrow ≤ 6 even with adversarial layouts).
This function does no I/O, allocates nothing, and is pure- function over the scalar inputs. Same restrictions as the existing planner.
Per-flavour block-identification rules¶
| Action | Blocks needed |
|---|---|
| HeaderOnly | (none) |
| L1Grow | distinct block_idxs for clusters in [new_l1_first_cluster, new_l1_last_cluster] ∪ [old_l1_first_cluster, old_l1_last_cluster] |
| L1AndRefcountGrow | L1Grow's set ∪ distinct block_idxs for clusters in [new_refcount_table_first_cluster, new_refcount_table_last_cluster] |
The cluster ranges are computable from the scalar inputs only:
new_l1_size_bytes = new_l1_entries * 8
new_l1_clusters = ceil(new_l1_size_bytes / cluster_size).max(1)
new_l1_first_cluster = current_file_size / cluster_size
new_l1_last_cluster = new_l1_first_cluster + new_l1_clusters - 1
old_l1_first_cluster = current_l1_table_offset / cluster_size
old_l1_size_bytes = current_l1_entries * 8
old_l1_clusters = ceil(old_l1_size_bytes / cluster_size).max(1)
old_l1_last_cluster = old_l1_first_cluster + old_l1_clusters - 1
entries_per_refblock = cluster_size * 8 / refcount_bits
block_idx(c) = c / entries_per_refblock
For L1AndRefcountGrow, the new refcount-table region offset
follows the new L1 region (see synthetic_layout_after_*
helpers — phase 12a must read these and document the exact
formula in the helper). Same arithmetic shape.
The returned required_blocks is the distinct-block-idx
union of the per-cluster lookups. Dedupe is done in the
helper (small fixed-size array; linear scan is fine).
Updated guest pre-pass¶
// 1. Read L1 + refcount-table into EXISTING_STATE (small).
// 2. Compute grow plan via the new helper.
let grow_plan = qcow2::compute_grow_action_and_required_blocks(&query)?;
// 3. Stage exactly the blocks the planner will need.
let mut block_indices: [u64; MAX_REQUIRED_BLOCKS] = [0; MAX_REQUIRED_BLOCKS];
let count = grow_plan.required_blocks.len();
block_indices[..count].copy_from_slice(&grow_plan.required_blocks);
let blocks_off = rt_end;
let blocks_total = count * cluster_size;
// blocks_total bounded by MAX_REQUIRED_BLOCKS * cluster_size
// = 8 * 2 MiB max = 16 MiB worst case (cluster_size cap), but
// at cluster_size=64KiB only 512 KiB. Fits comfortably in
// EXISTING_STATE_LIMIT.
debug_assert!(blocks_off + blocks_total <= EXISTING_STATE_LIMIT);
for (slot, &block_idx) in block_indices[..count].iter().enumerate() {
let block_file_off = block_offset_in_table(rt_slice, block_idx);
read_byte_range(call_table, sector_size, block_file_off,
state_base.add(blocks_off + slot * cluster_size),
cluster_size)?;
}
// 4. Call plan_resize_qcow2 with the targeted stage list.
// 5. Apply patches as before.
The L2-staging logic for shrink stays unchanged.
EXISTING_STATE_LIMIT stays at 4 MiB¶
After this change, the actual peak usage at the default cluster size is: - L1 region: ≤ 64 KiB (1 cluster) for any virtual size up to ~512 GiB; ≤ 128 KiB up to 1 TiB; ≤ 256 KiB up to 2 TiB. - Refcount-table region: similar order. - ≤ 6 refcount blocks × 64 KiB = 384 KiB. - Total: well under 1 MiB at default cluster.
At pathological cluster sizes (2 MiB cluster), 6 blocks ×
2 MiB = 12 MiB — would exceed EXISTING_STATE_LIMIT. But
that combination is already blocked by other limits
(QCOW2_MAX_RESIZE_SCRATCH at 32 MiB), and the differential
fuzz picker filters cluster_size=2 MiB anyway. Document the
remaining cluster_size > 1 MiB + multi-TB grow edge case
as a known limit rather than working around it.
Naming / file layout¶
- New helper lives in
src/crates/resize/src/qcow2.rs, exported viasrc/crates/resize/src/lib.rs. Following the pattern ofPreallocation,Qcow2ResizeOpts, etc. - New types:
Qcow2ResizeGrowQuery,GrowPlan,Qcow2GrowAction(the latter already exists internally asdecide_action's return — promote topub). - The internal
decide_actionbecomes the implementation detail behindcompute_grow_action_and_required_blocks.
Test surface¶
Unit tests in src/crates/resize/src/qcow2.rs::tests:
- compute_grow_action_and_required_blocks for each grow
flavour at default cluster size — verify action +
expected block-idx set.
- Same at 4 KiB and 1 MiB cluster — confirm block-idx math
scales correctly.
- 1 TiB virtual size grow at 64 KiB cluster — confirms the
required-block set stays bounded (≤ 4 blocks).
- HeaderOnly: required_blocks is empty.
- Forward-compat assert: returned required_blocks.len() <=
MAX_REQUIRED_BLOCKS.
Integration tests in src/crates/resize/tests/:
- A new qcow2_grow_large.rs file with a fabricated 500 GiB
qcow2 fixture (header + L1 + refcount-table only; no data
region). Grow to 1 TiB. Re-parse header to verify
virtual_size = new. Demonstrates the limit is lifted.
- Edge case: virtual_size just past the old 128 GiB ceiling
(e.g. 130 GiB) to verify the regression boundary.
Python integration test in tests/test_resize.py:
- Add TestResizeLargeImages with 1-2 cases that create a
qcow2 of e.g. 200 GiB (small file, large virtual via
sparse), resize to 256 GiB, verify info JSON. Conditional
on filesystem support — skip when df reports < a few
MiB free at the test tmpdir.
Fuzz coverage: drop the 40-bit size clamp in
fuzz_resize_planners.rs for the qcow2 branch (or lift it
from 40 to 56 bits = ~64 PiB). The planner-side defensive-
input gap that motivated the clamp is still relevant for
the other formats, so the clamp stays for vhd/vhdx/vmdk.
Differential fuzz: remove the size ceiling implicit in
the _resize_option_picker for qcow2 (currently caps at
64 MiB virtual). Bump qcow2 picker to include 200 MiB or
1 GiB sizes so the differential surface exercises the new
code path against qemu.
Public API delta¶
- New
pub fn compute_grow_action_and_required_blocksincrates/resize. - New
pub struct Qcow2ResizeGrowQuery,pub struct GrowPlan,pub enum Qcow2GrowAction. - Internal
decide_actionand per-flavour helpers may need to becomepub(crate)if they aren't already — verify during implementation.
Qcow2ResizeOpts shape and plan_resize_qcow2 signature
are unchanged. The old planner contract still works for
backwards compat: a caller that wants to stage everything
(e.g. a test or a fallback path) can still do so and
ensure_block_staged will accept it.
Open questions¶
-
Should the helper return only block indices, or also what each block is wanted for (debug aid)? Returning a richer struct is more friendly for diagnostics but adds complexity. Recommendation: indices only; if a future incident needs more context, the planner can be instrumented with
debug!traces. -
Should
MAX_REQUIRED_BLOCKSbe tight (e.g. 8) or generous (e.g. 16)? L1Grow needs ≤ 4 in normal cases. L1AndRefcountGrow ≤ 6. Doubling is cheap (ArrayVecstorage is8 × MAX = 128bytes either way). Recommendation: 16 to leave headroom for the refcount-table grow case at non-default cluster sizes; document the calculation. -
Should
compute_grow_action_and_required_blockshandle the shrink case too? No — shrink needs L2-table staging, not refcount-block staging. The query function is grow-specific. Shrink will get its own targeted pre-pass in a separate follow-up if/when needed. -
Two-pass safety net? Should the guest, if
plan_resize_qcow2returnsScratchTooSmall, fall back to staging all blocks and retry? Recommendation: no. The targeted helper is the single source of truth. If it's wrong, the integration test should catch it; the fuzz harness too. A silent fallback masks bugs. -
Naming:
compute_grow_action_and_required_blocksis long.prepare_growis shorter but vague.qcow2_grow_querymatches the input type. Recommendation:compute_qcow2_grow_query(mirrors input name)? Or simplyqcow2_grow_query? Decide at implementation time. -
Should we add a
compute_*for vhd / vhdx / vmdk for symmetry? No — those formats stage scalar header fields only, not refcount blocks. Their pre-passes don't have this issue. -
Should the limit lift be gated behind a host CLI flag in case the new code regresses on some workload? No. The new pre-pass is strictly more conservative (stages fewer bytes, never more). If anything regresses, it'll be a
ScratchTooSmallfor a case the helper missed — surfaced loudly, easy to fix.
Execution¶
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 01a | medium | opus | none | Refactor decide_action and the internal grow-flavour helpers into a clean compute_qcow2_grow_query(query: &Qcow2ResizeGrowQuery) -> Result<GrowPlan, ResizeError> public function in src/crates/resize/src/qcow2.rs. Promote Qcow2GrowAction to pub. New pub struct Qcow2ResizeGrowQuery and pub struct GrowPlan per the design. The helper computes the action and the distinct required_blocks set per the per-flavour rules. Unit tests cover all three flavours at default + 4 KiB + 1 MiB clusters, plus the 1 TiB-virtual case. Re-export the new types from src/crates/resize/src/lib.rs. No host or guest changes yet; the existing stage-everything path keeps working. Wave-1 audit clean. |
| 01b | medium | sonnet | none | Update the guest pre-pass at src/operations/resize/src/main.rs:382-420: call compute_qcow2_grow_query first, then stage only grow_plan.required_blocks. Delete the old MAX_RB_INDICES = 1024 scan + the cap check at line 406 (replaced by the bounded required_blocks.len() <= MAX_REQUIRED_BLOCKS). The L2-staging logic for shrink stays unchanged. Build + test-rust + test-integration clean. Smoke a 200 GiB qcow2 grow manually (create sparse, resize, verify info). |
| 01c | small | sonnet | none | Add src/crates/resize/tests/qcow2_grow_large.rs exercising a 500 GiB → 1 TiB grow on a fabricated qcow2 fixture (header + L1 + refcount-table + a few refcount blocks; no data region). Verify the planner accepts the targeted block stage list and emits a correct plan. Add TestResizeLargeImages to tests/test_resize.py with a 200 GiB → 256 GiB end-to-end case (skipif filesystem can't accommodate the small physical file). |
| 01d | small | sonnet | none | Fuzz updates: relax the size clamp in fuzz_resize_planners.rs for the qcow2 branch (40 → 56 bits, or branch-specific); bump the qcow2 picker in differential-fuzz.py to include 200 MiB / 1 GiB sizes (still below qemu-img-resize runtime limits but exercising the new pre-pass with non-trivial L1 sizes). Run a 5-minute coverage smoke + 200-iteration differential smoke at fixed seed; both clean. |
| 01e | low | sonnet | none | Doc + master-plan housekeeping. Update docs/resize.md and docs/quirks.md: remove the "~128 GiB ceiling" note (now stale); add a one-line note that the qcow2 grow limit is now bounded only by the filesystem. Remove the matching item from docs/plans/PLAN-resize.md's ## Future work section (or move to a "Lifted" subsection). Add this followup plan's status to docs/plans/index.md. Commit. |
Out of scope for this followup¶
- Shrink-path L2 staging cap (separate problem, separate design space; queued in PLAN-resize.md Future work).
- Per-format data-region preallocation parity (separate Future-work item; orthogonal).
- Bumping
EXISTING_STATE_LIMIT(the targeted pre-pass removes the need for this). - Adding similar targeted pre-passes to vhd / vhdx / vmdk (those formats don't stage refcount-block-equivalents; their pre-passes are already small and scalar).
Success criteria¶
compute_qcow2_grow_queryexposed; unit tests pass for all three flavours at three cluster sizes (default, 4 KiB, 1 MiB).- Guest pre-pass uses the targeted stage list; the old
MAX_RB_INDICES = 1024scan is gone. make test-rust+make test-integrationclean.- New
qcow2_grow_large.rsintegration test passes (500 GiB → 1 TiB). - New
TestResizeLargeImagespasses when filesystem has room; skips cleanly otherwise. - Manual smoke:
instar create -f qcow2 t.qcow2 256G && instar resize -f qcow2 t.qcow2 512G && instar info t.qcow2succeeds end-to-end and reportsvirtual-size: 549755813888. - 5-minute fuzz smoke + 200-iter differential smoke clean.
- Docs updated: ceiling note removed; Future-work entry consolidated.
Sub-agent guidance¶
Read these files before any step:
src/crates/resize/src/qcow2.rslines 66-330 (the grow planner entry point +plan_l1_grow's block-staging pattern), and 600-920 (plan_l1_and_refcount_growfor the additional block-identification rules), and 1450-1490 (ensure_block_staged+block_offset_in_file).src/operations/resize/src/main.rslines 39-90 (memory layout constants) and 350-475 (the pre-pass to replace).src/crates/resize/src/lib.rs(theQcow2ResizeOptsshape that the helper's input mirrors).tests/test_resize.py:TestResizeBaselineMatrixandTestResizeConsistencyfor the test-class style to mirror forTestResizeLargeImages.- This plan's §"Algorithmic design" for the exact math the helper must implement.
The management session review checklist is the same as the
PLAN-resize phase plans: per-step git diff review, smoke
before commit, escalate any planner-test divergence to the
user before papering it over.