Post-write verification for output integrity¶
Prompt¶
Before responding to questions or discussion points in this document, explore the occystrap codebase thoroughly. Read relevant source files, understand existing patterns (pipeline architecture, input/filter/output interfaces, URI parsing, CLI commands, registry authentication, error handling), and ground your answers in what the code actually does today. Do not speculate about the codebase when you could read it instead. Where a question touches on external concepts (Docker Registry V2, OCI specs, container image formats, compression), research as needed to give a confident answer. Flag any uncertainty explicitly rather than guessing.
Consult ARCHITECTURE.md for the pipeline pattern, element types,
input/filter/output interfaces, and cross-cutting concerns (layer
caching, parallel downloads, compression). Consult CLAUDE.md for
build commands and project conventions.
When we get to detailed planning, I prefer a separate plan file
per detailed phase. These separate files should be named for the
master plan, in the same directory as the master plan, and simply
have -phase-NN-descriptive appended before the .md file
extension. Tracking of these sub-phases should be done via a table
like this in this master plan under the Execution section:
| Phase | Plan | Status |
|-------|------|--------|
| 1. Registry listing API | PLAN-thing-phase-01-listing.md | Not started |
| 2. Label filtering | PLAN-thing-phase-02-labels.md | Not started |
| ... | ... | ... |
I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit. Each commit should be self-contained: it should build, pass tests, and have a clear commit message explaining what changed and why.
Situation¶
Occystrap recently gained significant parallelism (httpx with HTTP/2, concurrent multi-image processing, parallel layer downloads and uploads). This makes bulk operations much faster but also increases the chance that transient errors (network glitches, rate-limiting, disk I/O issues) could silently produce incomplete output.
The existing check command validates images from an input
source — it reads a manifest and config from a registry, then
optionally downloads and verifies all layers. But there is no
equivalent verification for output — after process writes
an image, nothing confirms that the written output is complete
and correct.
Users running bulk mirrors (e.g., quay:// → dir:// with
hundreds of images) need confidence that every image landed
correctly, especially when they saw transient errors scroll past
during processing.
What exists today¶
check.py module:
- CheckResults class: accumulates errors/warnings/info with
error(), warning(), info() methods and has_errors
property.
- check_metadata(manifest, config, results): fast-mode
validation of manifest structure, schema version, layer count
consistency, compression compatibility, media types.
- check_layers(input_source, manifest, config, results):
full validation that downloads layers and verifies diff_ids,
tar format, whiteout correctness.
check CLI command:
- Takes a source URI and runs check_metadata (always) plus
check_layers (unless --fast).
- Reports results in text or JSON format.
- Only works against input sources (registries, tarballs,
Docker daemon).
Output writers (finalize() state):
- DirWriter: writes layers as files in subdirectories, writes
manifest-{name}-{tag}.json and updates catalog.json.
- TarWriter: writes layers and manifest into a tarball,
closes the tarball.
- RegistryWriter: pushes blobs and manifest to a registry,
reports upload stats.
- DockerWriter: builds a tarball and POSTs to Docker API.
- OCIBundleWriter / MountWriter: extend DirWriter with
OCI bundle / overlay extraction.
Base output tracking (ImageOutput):
- _track_element(type, size): counts layers and bytes.
- _total_bytes, _layer_count: available after processing.
None of the output writers verify their own output after writing. The pipeline trusts that if no exception was raised, the output is correct.
Mission and problem statement¶
Add a --verify flag to the process command that runs
post-write verification after each image completes. The
verification should confirm that the output is complete and
correct by reading back what was written and checking it
against what should have been written.
For bulk operations, provide an aggregate summary ("47/47 images verified OK" or "45/47 verified, 2 FAILED") so users can trust the result without scrolling through logs.
The verification should be:
- On by default — users shouldn't have to opt in to
correctness. A --no-verify flag disables it for speed.
- Output-type-specific — each output format has different
things to verify.
- Non-destructive — verification reads but never modifies
the output.
- Efficient — avoid re-downloading or re-reading more data
than necessary. For directory output, stat files and check
hashes. For registry output, HEAD requests for blob
existence.
Open questions¶
- Should
--verifybe on by default?
Recommendation: Yes, on by default with --no-verify to
disable. The performance cost is small relative to the
transfer, and the confidence benefit is high. Users who
want maximum speed can opt out.
- Should verification re-read and hash every layer, or just check file existence and size?
Recommendation: Two levels. The default --verify
checks existence and size (fast). A --verify=full mode
also re-reads and hashes layers (thorough but slower).
This mirrors the check command's --fast vs full mode.
- How should verification interact with filters?
When filters modify layer content (e.g., exclude,
normalize-timestamps), the output layers have different
hashes than the input layers. Verification needs to check
against what the output should contain, not what the
input had.
Recommendation: The output writer knows what it wrote. Have each writer record what it expects (file paths, sizes, digests) during processing, then verify against those expectations. This avoids any filter confusion.
- Should verification failures cause a non-zero exit code?
Recommendation: Yes. Exit code 0 = all images processed and verified. Exit code 1 = processing or verification failure.
Execution¶
Phase 1: Verification framework and DirWriter verifier¶
Add the --verify / --no-verify flags to the process
command. Define an abstract verify() method on ImageOutput
that subclasses implement. Implement verification for
DirWriter (the most common output for bulk operations):
- Check manifest file exists and is valid JSON.
- Check config file exists and matches expected size.
- Check each layer directory and
layer.tarfile exists. - Check each layer file size matches what was written.
- Optionally (full mode): re-read and hash each layer.
Also implement for OCIBundleWriter and MountWriter which
extend DirWriter.
Phase 2: TarWriter and DockerWriter verifiers¶
Implement verification for TarWriter:
- Re-open the tarball and list its entries.
- Check manifest.json, config, and all layer tarballs present.
- Check sizes match.
- Optionally (full mode): re-read and hash layers within the tarball.
Implement verification for DockerWriter:
- Query Docker API to confirm the image was loaded.
- Check image ID matches expected config digest.
Phase 3: RegistryWriter verifier¶
Implement verification for RegistryWriter:
- HEAD each layer blob to confirm it exists in the registry.
- HEAD the config blob.
- GET the manifest and verify it matches what was pushed.
- Optionally (full mode): GET and hash each blob.
Phase 4: Bulk verification summary and documentation¶
Add aggregate reporting to _process_multi():
- Track verification results per image.
- Print summary line: "47/47 images verified OK" or "45/47 verified, 2 FAILED: [list]".
- Update README, ARCHITECTURE.md, docs/command-reference.md.
- Add functional tests for the verification flow.
| Phase | Plan | Status |
|---|---|---|
| 1. Verification framework and DirWriter | PLAN-post-write-verification-phase-01-framework.md | Not started |
| 2. TarWriter and DockerWriter verifiers | PLAN-post-write-verification-phase-02-tar-docker.md | Not started |
| 3. RegistryWriter verifier | PLAN-post-write-verification-phase-03-registry.md | Not started |
| 4. Bulk summary and documentation | PLAN-post-write-verification-phase-04-summary.md | Not started |
Administration and logistics¶
Success criteria¶
We will know when this plan has been successfully implemented because the following statements will be true:
- The code passes
flake8 --max-line-length=120andpre-commit run --all-files. - New code follows the existing pipeline pattern (input/filter/ output interfaces) where applicable.
- There are unit tests for core logic and integration tests for new CLI commands.
- Lines are wrapped at 120 characters, single quotes for strings, double quotes for docstrings.
- Documentation in
docs/has been updated to describe any new commands or features. ARCHITECTURE.md,README.md, andAGENTS.mdhave been updated if the change adds or modifies modules or CLI commands.process --verifyis on by default and exits non-zero on verification failure.- Bulk operations print an aggregate verification summary.
- Each output writer has a type-specific verify() implementation.
Future work¶
- Integrate with the existing
checkcommand so thatcheck dir:///path/to/outputworks (currentlycheckonly supports input URIs). - Add a
--verify-onlymode that re-verifies a previously written output without reprocessing. - Verification for the
proxycommand's downstream writes. - Checksums file (e.g.,
SHA256SUMS) written alongside directory output for external verification tools.
Bugs fixed during this work¶
(None yet.)
Back brief¶
Before executing any step of this plan, please back brief the operator as to your understanding of the plan and how the work you intend to do aligns with that plan.