QCOW2 Snapshot System¶
QCOW2 supports internal snapshots that capture the state of the virtual disk at a point in time. Each snapshot maintains its own L1 table, enabling efficient storage through copy-on-write.
Snapshot Table Location¶
The snapshot table is stored at snapshots_offset (header bytes 64-71).
The number of snapshots is in nb_snapshots (header bytes 60-63).
Snapshot Table Format¶
The snapshot table is a contiguous area containing variable-length entries:
+------------------+
| Snapshot 1 | (variable size, 8-byte aligned)
+------------------+
| Padding | (to 8-byte boundary)
+------------------+
| Snapshot 2 |
+------------------+
| Padding |
+------------------+
| ... |
+------------------+
Snapshot Header Structure¶
typedef struct QCowSnapshotHeader {
uint64_t l1_table_offset; // Snapshot's L1 table (cluster-aligned)
uint32_t l1_size; // Number of L1 entries
uint16_t id_str_size; // Unique ID string length
uint16_t name_size; // Snapshot name length
uint32_t date_sec; // Creation time (seconds since epoch)
uint32_t date_nsec; // Nanoseconds component
uint64_t vm_clock_nsec; // VM clock at snapshot time
uint32_t vm_state_size; // VM state size (32-bit, legacy)
uint32_t extra_data_size; // Size of extra data following header
// Extra data follows (if extra_data_size > 0)
// ID string follows (id_str_size bytes)
// Name string follows (name_size bytes)
// Padding to 8-byte boundary
} qemu_PACKED QCowSnapshotHeader;
Size: 48 bytes (base) + extra_data + id_str + name + padding
Extended Snapshot Data¶
Version 3 images include extra data after the base header:
typedef struct QCowSnapshotExtraData {
uint64_t vm_state_size_large; // 64-bit VM state size
uint64_t disk_size; // Virtual disk size at snapshot
uint64_t icount; // Instruction count (record/replay)
} qemu_PACKED QCowSnapshotExtraData;
Size: 24 bytes minimum for version 3
Version 3 images must have extra_data_size >= 16 (at least bytes 0-15).
In-Memory Snapshot Representation¶
typedef struct QCowSnapshot {
uint64_t l1_table_offset;
uint32_t l1_size;
char *id_str; // Unique identifier
char *name; // Human-readable name
uint64_t disk_size;
uint64_t vm_state_size;
uint32_t date_sec;
uint32_t date_nsec;
uint64_t vm_clock_nsec;
uint64_t icount;
uint32_t extra_data_size;
void *unknown_extra_data; // For forward compatibility
} QCowSnapshot;
Parser surface¶
The qcow2 crate exposes two parser variants:
parse_snapshot_table— bounded. Returns aSnapshotTablewith up toMAX_SNAPSHOTS(16) entries on the caller's stack. Used byinfo(to set theFLAG_HAS_SNAPSHOTSbit) andconvert(whose--snapshotresolver,find_snapshot, runs qemu's two-full-pass ID-then-name match over the parsed table before switching to the found snapshot's L1 — see docs/quirks.md's convert section, including the bounded 16-entry lookup residual). Behaviour is byte-identical to the pre-streaming implementation; the bounded variant is now a thin wrapper overfor_each_snapshot_entrythat stops once the 16-entry array is full.for_each_snapshot_entry— streaming, no in-memory cap. Invokes aFnMut(&SnapshotEntry) -> boolcallback once per entry. Bounded only by the qcow2 header'snb_snapshotsfield (spec cap 65536). Aborts and returnsfalseon a read error or when the callback returnsfalse. Used by the snapshot subcommand to emit one wire record per entry without ballooning the guest's stack. (An unusedfind_snapshot_streamingconvenience wrapper with the old per-entry id-or-name semantics was removed in PLAN-snapshot phase 14.)snapshot_entry_to_record— planner converter from the internalSnapshotEntryto the wire-FFIshared::SnapshotEntryRecord. Splitsdate_secinto hi/lo halves, truncates id (32) and name (256) to the wire buffer sizes, and resolves the v2disk_sizefallback by taking the active image's virtual size as a parameter.
Extra-data fallback rules¶
qemu's block/qcow2-snapshot.c::qcow2_read_snapshots applies
progressive-reveal rules to the extra-data section: fields are
populated only when extra_data_size is large enough to carry
them, with sentinel values otherwise. The instar parser mirrors
this exactly.
extra_data_size |
vm_state_size_large |
disk_size |
icount |
|---|---|---|---|
< 8 |
vm_state_size as u64 |
0 (use header size) |
u64::MAX |
>= 8 |
from offset 0 | 0 (use header size) |
u64::MAX |
>= 16 |
from offset 0 | from offset 8 | u64::MAX |
>= 24 |
from offset 0 | from offset 8 | from offset 16 |
> 1024 |
(rejected — entry is skipped per QCOW_MAX_SNAPSHOT_EXTRA_DATA) |
disk_size == 0 is the v2 / short-v3 sentinel meaning "fall
back to the active header's virtual size"; the converter
snapshot_entry_to_record substitutes it at conversion time
because the parser is unaware of the active header.
icount == u64::MAX matches qemu's qcow2_snapshot.icount = -1
and the SnapshotEntryRecord::ICOUNT_ABSENT constant.
Mutator surface¶
The mutating snapshot operations (-c / -d / -a) compose
their per-mode patch lists from pure mutator primitives in the
src/crates/snapshot/ crate (parallel to commit and rebase;
landed in PLAN-snapshot phase 5). The crate is no_std, depends
only on qcow2 for type / constant access, and operates on
caller-staged byte slices without any I/O.
read_refcount_in_block/set_refcount_in_block— scalar accessors for every spec-permitted refcount width (1, 2, 4, 8, 16, 32, 64). The setter was lifted fromresize::qcow2::set_refcount; resize calls it through a thin wrapper.check_refcount_after_addend— overflow-safe arithmetic used by the two-pass refcount mutator's dry-run pass.alloc_cluster_in_refblocks— cursor-driven linear scan over staged refcount blocks. Claims the first zero entry and returns its host byte offset. v1 supports 16-bit refcounts only.rewrite_l1_entry_copied_flag/rewrite_l2_entry_copied_flag— set or clearOFLAG_COPIEDon one L1 / L2 entry. The L2 helper handles both standard (8-byte stride) and extended-L2 (16-byte stride; subcluster bitmap untouched) layouts.for_each_cluster_in_l1— visitor that walks the L1 -> L2 chain and yields oneL1ClusterRefper allocated cluster (Standard or Compressed); unallocated L1 and L2 entries are skipped.update_snapshot_refcount— two-pass composed mutator. Pass 1 walks the relevant L1(s) and runscheck_refcount_after_addendfor every cluster; on the first overflow it returnsRefcountOverflow { at_host_offset }before mutating any refblock. Pass 2 walks again and applies the new refcounts viaset_refcount_in_block. HandlesIncrementForCreate,DecrementForDelete, andSwapForApply { from_l1, to_l1 }. Both passes adjust every reachable data cluster and each L2 table cluster — once per non-zero L1 entry, after that entry's data clusters — mirroring qemu'sqcow2_update_snapshot_refcount(block/qcow2-refcount.c). The L2-table bump is mandatory for create: after a create the active L1 and the snapshot's L1 copy share the same physical L2 tables, so each L2 cluster's refcount must reach 2 for a later guest write to trigger the L2 copy-on-write instead of silently overwriting the snapshot's L2 in place. The function never touches the L1 table's own clusters — the caller owns those (create allocates the snapshot L1 copy at refcount 1; delete frees the snapshot L1 explicitly). (The L2-table coverage was added in PLAN-snapshot phase 6, closing a phase 5 correctness gap.)update_copied_flags_for_l1— walks the L1, rewriting theOFLAG_COPIEDflag on each L1 and L2 entry based on the cluster's current refcount (set when refcount==1, clear otherwise). Returns the number of entries rewritten. L2 entries that reference no cluster (UNALLOCATED, or ZERO_PLAIN on standard L2) are scrubbed, not skipped: qemu'sqcow2_update_snapshot_refcountstripsOFLAG_COPIEDbefore classifying and assignsrefcount = 0to those entry types, so a stale COPIED bit is actively cleared on every walk — the walker mirrors that (added in PLAN-snapshot phase 8, closing a phase 5–7 fidelity gap; the extended-L2 subcluster bitmap is untouched).
The phase 6 create planner adds these table-serialisation
helpers (src/crates/snapshot/src/table.rs):
alloc_contiguous_clusters_in_refblocks— first-fit scan forcountconsecutive zero-refcount clusters (allowed to span refblock boundaries), claiming each. The single-clusteralloc_cluster_in_refblocksis now acount = 1wrapper.NewSnapshotEntry+serialize_snapshot_entry— emit one new on-disk entry: 40-byte big-endian header,extra_data_size = 24, the 24-byte extra data (vm_state_size_large/disk_size/icount), then the id and name strings, with no trailing pad. Matchesqemu-img10.0.x byte-for-byte (icountwritten as0, not theu64::MAX"absent" sentinel the read side uses).snapshot_table_byte_len— walk the raw old table fornb_snapshotsentries (8-aligned starts) and return its exact unpadded byte length, so the guest can stage / copy / free it.build_snapshot_table— copy the old entries verbatim (preserving any unknown trailing extra data), zero-pad to the next 8-byte boundary, and append the serialised new entry.parse_decimal_id/format_decimal_u64— strtoul- /%lu-style ID arithmetic for themax(existing IDs) + 1assignment qemu'sfind_new_snapshot_idperforms.
The phase 7 delete planner adds:
snapshot_table_entry_bounds— the (start offset, unpadded length) of one raw table entry, walking entries exactly likesnapshot_table_byte_len. Delete's find-by-name walk uses it to compare the full on-disk name (independent of the bounded parser's 63-byte truncation) and to locate the removed entry.build_snapshot_table_without— the table compaction: every entry except the removed one copied verbatim to the next 8-aligned output offset (gaps zeroed, unpadded tail). Removing the sole remaining entry yields length 0; the caller then writes headernb_snapshots = 0, snapshots_offset = 0and allocates no table, matching qemu.precheck_snapshot_refcount(inqcow2.rs) — a public read-only wrapper overupdate_snapshot_refcount's dry-run pass (pass 1), so delete can validate the decrement against the staged refblocks before any disk write while deferring the paired apply until after the commit-point header write.
The phase 8 apply planner adds:
MatchMode/FoundSnapshot/find_snapshot_in_table— the raw-table snapshot finder with per-mode matching semantics.NameOnlyis delete's single name pass (the phase 7 inline find was refactored onto it);IdThenNameis apply's two-full-pass resolver (qemu'sfind_snapshot_by_id_or_name: a complete ID pass, then — only if no ID matched — a complete name pass, so a later ID match beats an earlier name match). Comparisons cover the full on-disk strings, independent of the bounded parser's 63-byte truncation.FoundSnapshotcarries the entry's index, L1 geometry, anddisk_size_or_zero(extra-data offset 8 whenextra_data_size >= 16, else a 0 "absent" sentinel that the caller treats as matching — mirroringqcow2_read_snapshots' default of the current virtual size).
The crate emits no patch lists: the guest binaries write each
staged region directly (commit-binary style), because the
writeback needs fsync barriers between write groups, which a
flat patch list cannot express. (A speculative SnapshotPatch /
SnapshotPlan patch-list API sat unused in the crate root
through phase 13 and was removed in PLAN-snapshot phase 14.)
Create write ordering (crash safety)¶
instar snapshot -c writes back in four fsync-separated groups,
mirroring qemu's qcow2_snapshot_create + qcow2_write_snapshots:
A: L1 copy (verbatim pre-flag-rewrite bytes), dirty L2 tables,
the rewritten active L1, and the dirty refcount blocks
(covering the data / L2 increments and the new allocations)
-> fsync
B: the new snapshot table (at a freshly allocated, contiguous
region)
-> fsync
C: the 12-byte header write at offset 60 — nb_snapshots (u32 BE)
followed by snapshots_offset (u64 BE). THIS IS THE COMMIT
POINT.
-> fsync
D: free the old snapshot table's clusters (decrement their
refcounts to 0 and write those refblocks back). Skipped when
there was no old table (nb_snapshots was 0).
-> fsync
The barrier ordering gives the same crash-safety contract as
qemu: a crash before group C leaves the old table authoritative
(the new clusters are orphaned garbage — qemu-img check reports
leaks, not corruption); a crash after group C leaves the new
table authoritative (the old table's clusters leak until group D
runs). Leaks are repairable with qemu-img check -r; dangling
references are not, and this ordering never produces them.
The snapshot's L1 copy is serialised from the active L1's bytes
captured before the COPIED-flag rewrite, so — exactly like
qemu — the stored copy keeps its (now stale) OFLAG_COPIED bits
even though the shared clusters are at refcount 2. qemu-img
check validates only the active L1/L2 flags, so this is
correct; the apply path refreshes the flags if the snapshot is
ever restored.
Delete write ordering (crash safety)¶
instar snapshot -d finds the target by name only, first
match in table order (qemu 10's bdrv_snapshot_find — see
docs/quirks.md), stages BOTH chains (the deleted snapshot's
L1 + L2 set for the decrement walk; the active L1 + L2 set for
the COPIED refresh), then writes back in three fsync-separated
groups, mirroring qcow2_snapshot_delete:
precheck: precheck_snapshot_refcount(DecrementForDelete) over the
snapshot's chain, plus refcount >= 1 checks on the snapshot's
L1 clusters and the old table's clusters. Read-only, BEFORE
any disk write: a corrupt image fails here with the file
untouched. (qemu has no such check; its equivalent failure
would surface after the commit point.)
A: the compacted snapshot table (built by
build_snapshot_table_without at a freshly allocated,
contiguous region) + all staged refblocks, which at this
moment carry ONLY the table-allocation bumps. Skipped
entirely when the remaining snapshot count is 0.
-> fsync
B: the 12-byte header write at offset 60 — nb_snapshots - 1
(u32 BE) followed by the new table offset, or 0 / 0 when the
table is now empty. THIS IS THE COMMIT POINT.
-> fsync
(in-memory, qemu's "we won't recover but just leak clusters"
zone: update_snapshot_refcount(DecrementForDelete) over the
snapshot's chain, then decrement the snapshot's L1 clusters,
then the old table's clusters — decrements, never set-to-0,
so an underflow surfaces a double-free bug. Then the COPIED
refresh over the ACTIVE chain against the post-decrement
refcounts — shared data clusters that dropped 2 -> 1 get
COPIED SET, the reverse direction from create — AND over the
deleted chain's staged L2 set, mirroring qemu's -1 walk,
which recomputes flags on every L2 entry it visits. The
deleted snapshot's L1 buffer is mutated in place but never
written — qemu's "update L1 only if addend >= 0" exemption,
and it is being freed anyway.)
C: all staged refblocks (now carrying the decrements) + the
active L1 + the active L2 set + the SURVIVING snap-set L2s
(those whose own cluster's post-decrement refcount is
non-zero, e.g. L2 tables shared with another snapshot, which
land on disk with refreshed COPIED flags). Freed L2s are
never written, matching qemu's cache discard.
-> fsync
A crash before group B leaves the old table authoritative and at worst an orphaned compacted table (a leak); a crash after group B but before group C completes leaves the snapshot gone with refcounts too high and/or stale COPIED flags — leaks and repairable flag warnings, never a dangling reference. Because delete writes no timestamps, the post-delete image is byte-identical to qemu's given byte-identical inputs (modulo freed-cluster contents and the file tail — docs/quirks.md). The surviving-L2 write-back was added post-phase-13: the differential fuzzer caught a deleted-snapshot L2 shared with a surviving snapshot landing with stale COPIED-clear entries (safe — a spurious COW at worst — but not byte-identical).
Apply write ordering (crash safety)¶
instar snapshot -a finds the target by ID first, then name —
two full passes (qemu's find_snapshot_by_id_or_name; see
docs/quirks.md for the -d / -a asymmetry), refuses geometry
mismatches (a stored disk_size differing from the current
virtual size, or a snapshot L1 larger than the active L1 — qemu
truncates / grows respectively; docs/quirks.md), stages BOTH
chains (the target snapshot's L1 + L2 set; the old active L1 +
L2 set), then writes back in three fsync-separated groups,
mirroring qcow2_snapshot_goto. Apply rewrites the active L1
in place and never touches the snapshot table or the header:
precheck: precheck_snapshot_refcount(SwapForApply { from: active
L1, to: snapshot L1 }) — both directions (decrement underflow
on the outgoing chain, increment overflow on the incoming
one) validated read-only, BEFORE any disk write.
(in-memory: update_snapshot_refcount increment walk over the
snapshot's chain — qemu's +1 walk)
A: all staged refblocks, carrying the increments only.
-> fsync
B: the snapshot's RAW L1 content, zero-padded to the active L1's
byte size (hdr.l1_size * 8), written at hdr.l1_table_offset —
stale COPIED flags intact, mirroring qemu's bdrv_pwrite_sync.
THIS IS THE COMMIT POINT: the active view is now the snapshot.
-> fsync
(in-memory: the -1 walk over the staged OLD active chain, then
ONE final-state COPIED refresh over the padded new-L1 copy +
the snapshot's L2 set, and over the staged old active chain —
qemu's -1 walk also refreshes the old chain's surviving L2s)
C: all staged refblocks (now carrying the decrements) + the
refreshed L1 written to BOTH locations — hdr.l1_table_offset
at the padded length AND sn.l1_table_offset at sn.l1_size * 8
(replicating the snapshot-stored-L1 flag write qemu's +1 walk
performs) — + the dirty snapshot-set L2s + the surviving
old-active L2s (final refcount > 0, e.g. shared with another
snapshot). Freed old-active L2s are NEVER written (qemu runs
the walks with cache_discards = true, so dirty cache entries
for freed clusters are dropped, not flushed).
-> fsync
Why one flag pass suffices (qemu performs three flag-bearing writes: the snapshot's stored L1 mid-state during the +1 walk, the raw padded copy, then the active L1 at final state in the addend-0 walk): after an apply, every cluster reachable from the new active chain has refcount >= 2 — the active L1 is a copy of the snapshot's L1, so everything the active view references is also referenced by the still-present snapshot. Every COPIED flag on the new chain therefore ends clear, and the flags qemu computes mid-state equal the flags at final state. instar computes flags once, at final state, and writes the same bytes.
A crash before group B leaves the image unchanged except over-referenced refcounts (repairable leaks); a crash between B and C leaves the active view switched with leaks and stale COPIED flags — repairable, never a dangling reference. One window differs cosmetically from qemu (qemu scrubs the snapshot's stored L1 before its active overwrite, instar after); both orders leave only repairable states and the final bytes are identical. Because apply writes no timestamps, no snapshot-table bytes and no header bytes, post-apply images are byte-identical to qemu's given byte-identical inputs across every scenario, including diverged applies (modulo freed-cluster contents and the file tail — docs/quirks.md).
Snapshot L1 Table¶
Each snapshot has its own L1 table, independent of the "active" L1 table. This is the key to copy-on-write:
Active State:
L1 (active) --> L2 tables --> Data clusters
After Snapshot:
L1 (active) --> L2 tables --> Data clusters
^ ^
L1 (snapshot) ------+ |
|
(Both L1s may point to same L2/data until modified)
When data is written after a snapshot: 1. Check if cluster is shared (refcount > 1) 2. If shared, allocate new cluster (COW) 3. Copy data, write new data 4. Update active L1/L2 to point to new cluster 5. Decrement refcount on old cluster
Snapshot Operations¶
Creating a Snapshot¶
1. Flush all pending writes
2. Allocate space for snapshot L1 table
3. Copy current L1 table to snapshot location
4. Increment refcounts for all referenced clusters
5. Clear COPIED flags on shared clusters
6. Allocate new snapshot table entry
7. Write snapshot header with metadata
8. Update header (nb_snapshots, snapshots_offset)
9. Free old snapshot table if reallocated
Restoring a Snapshot (goto)¶
1. Validate snapshot L1 table offset/size
2. Grow current L1 table if needed
3. Increment refcounts for snapshot's clusters
4. Decrement refcounts for current L1's clusters
5. Copy snapshot L1 to current L1
6. Update COPIED flags based on new refcounts
7. Clear DIRTY flag if set
Deleting a Snapshot¶
1. Load snapshot's L1 table
2. Decrement refcounts for all referenced clusters
3. Free clusters with refcount reaching 0
4. Set COPIED flags where refcount becomes 1
5. Free snapshot's L1 table
6. Remove entry from snapshot table
7. Write updated snapshot table
8. Update header (nb_snapshots)
Listing Snapshots¶
For each snapshot:
- ID: unique identifier
- Name: human-readable name
- Date: creation timestamp
- VM state size: saved RAM size
- Disk size: virtual disk size at snapshot time
VM State Storage¶
Snapshots can include VM state (memory, device state) for hibernation:
- VM state is stored as regular data in the image
- Located via L1 table entries beyond virtual disk size
vm_state_size/vm_state_size_largeindicates size- Address:
l1_vm_state_index << (cluster_bits + l2_bits)
The active image's VM state area is typically discarded after snapshot creation to avoid unnecessary copy-on-write.
Snapshot Table Limits¶
#define QCOW_MAX_SNAPSHOTS 65536
#define QCOW_MAX_SNAPSHOTS_SIZE (1024 * QCOW_MAX_SNAPSHOTS) // 64 MB
#define QCOW_MAX_SNAPSHOT_EXTRA_DATA 1024
Refcount Updates for Snapshots¶
The function qcow2_update_snapshot_refcount() handles bulk updates:
int qcow2_update_snapshot_refcount(
BlockDriverState *bs,
int64_t l1_table_offset, // Snapshot's L1 table
int l1_size, // Number of L1 entries
int addend // +1 for create, -1 for delete
) {
// For each L1 entry:
// Load L2 table
// For each L2 entry:
// If compressed: update refcount for compressed extent
// If normal: update refcount for cluster
// Update COPIED flag if needed
}
Snapshot Consistency¶
Atomic updates are critical for snapshot table modifications:
- Allocate new snapshot table at new location
- Write all snapshot entries
- Update header (nb_snapshots, snapshots_offset) atomically
- Only after header update succeeds, free old table
This ensures crash recovery always finds a valid snapshot table.
Forward Compatibility¶
The extra_data_size field enables future extensions:
- Unknown extra data is preserved on read
- Written back unchanged on update
- Allows older qemu to handle newer snapshot formats
Common Issues¶
- Snapshot chain depth: Too many snapshots degrades read performance
- Space consumption: Deleted snapshots may not free space until refcounts drop
- VM state size: Large VM state can significantly increase snapshot size
- Consistency: Snapshots taken during writes may have inconsistent state