Phase 5: Benchmarking and tuning¶
Prompt¶
Before responding to questions or discussion points in this document, explore the occystrap codebase thoroughly. Read relevant source files, understand existing patterns (pipeline architecture, input/filter/output interfaces, URI parsing, CLI commands, registry authentication, error handling), and ground your answers in what the code actually does today. Do not speculate about the codebase when you could read it instead. Where a question touches on external concepts (Docker Registry V2, OCI specs, container image formats, compression), research as needed to give a confident answer. Flag any uncertainty explicitly rather than guessing.
Consult ARCHITECTURE.md for the pipeline pattern, element types,
input/filter/output interfaces, and cross-cutting concerns (layer
caching, parallel downloads, compression). Consult CLAUDE.md for
build commands and project conventions.
I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit. Each commit should be self-contained: it should build, pass tests, and have a clear commit message explaining what changed and why.
Goal¶
Measure the performance improvements from Phases 1-4, tune
default thread pool sizes, and document performance
characteristics so users know how to get the best results
from -j and -J.
Current state¶
Phases 1-4 introduced:
- httpx with HTTP/2 — connection pooling, multiplexed streams, reduced TLS handshake overhead
- Parallel Quay API resolution — tag existence checks run concurrently via ThreadPoolExecutor
- Concurrent multi-image processing —
-Jflag for processing multiple images simultaneously - DirWriter os.rename — zero-copy layer placement for same-filesystem temp dirs
There is no existing benchmarking infrastructure. The CI
functional tests verify correctness but don't measure
performance. We need a repeatable way to measure wall-clock
time across the key workflows and different -j/-J values.
Design¶
Benchmark script: tools/benchmark.sh¶
A shell script (per project convention — large scripts go in
tools/) that runs representative workflows and reports
timing. It requires a local Docker registry at localhost:5000
populated with test images (same setup as CI).
Workflows to benchmark¶
| ID | Command | Tests |
|---|---|---|
single-pull |
registry://localhost:5000/library/ubuntu:latest → dir:// |
Single image, network+disk |
single-tar |
registry://localhost:5000/library/ubuntu:latest → tar:// |
Single image, sequential output |
single-push |
registry://localhost:5000/library/ubuntu:latest → registry://localhost:5000/bench/ubuntu:latest |
Mirror/push with compression |
bulk-quay-info |
quay://projectquay/*:latest info |
API resolution speed |
multi-dir |
Multiple images → dir://?unique_names=true |
Multi-image concurrency |
Each workflow runs with several -j/-J combinations to
find optimal defaults:
-j 1,-j 4(current default),-j 8,-j 16-J 1(current default equivalent),-J 3,-J 6
Output format¶
The script outputs a simple TSV table:
This is easy to paste into a spreadsheet or process with
awk/column -t. A --json flag outputs JSON for
programmatic consumption.
Timing¶
Each test is timed with time (wall-clock). The output
directory is cleaned between runs to ensure a cold cache.
Layer cache is disabled (no --layer-cache) to measure
raw transfer performance.
Performance tuning documentation: docs/performance.md¶
A new doc page covering:
- What
-jcontrols — per-image layer download parallelism. Higher values help with high-latency registries. Diminishing returns above 8 for most registries. - What
-Jcontrols — multi-image concurrency. Higher values help when processing many small images (e.g.,quay://bulk). Limited by registry rate limits. - What
--rate-limitcontrols — requests per second cap. Required for aggressive-j/-Jto avoid 429s. - What
--retriescontrols — persistence with exponential backoff on transient failures. - Recommended settings — a table of scenarios with
suggested
-j/-J/--rate-limitvalues. - Connection efficiency — httpx with HTTP/2 means fewer TLS handshakes, multiplexed requests. No user action needed, just context for why things are faster.
Default tuning¶
Based on benchmark results, we may adjust the default values
for -j and -J in main.py. Current defaults:
-j 4— likely fine for most use cases-J 3— may need adjustment based on measurements
Any default changes will be a separate commit with justification in the commit message.
Open questions¶
- Should the benchmark script be runnable in CI?
Recommendation: Not in this phase. CI environments have variable performance (shared runners, network variability). The script is designed for local use on a developer machine with a local registry. We can add a CI benchmark job later if we want regression detection, but that requires a dedicated runner with consistent performance.
- Should we add a
--benchmarkflag to the CLI itself?
Recommendation: No. A separate script is more flexible and doesn't pollute the CLI with testing concerns. The script can be updated independently of the release cycle.
Implementation steps¶
Step 1: Create benchmark script¶
Write tools/benchmark.sh that:
- Checks for a local registry at localhost:5000
- Populates it with test images if needed
- Runs each workflow with each -j/-J combination
- Reports timing in TSV (default) or JSON (--json)
- Cleans output directories between runs
Step 2: Create performance documentation¶
Write docs/performance.md covering the tuning knobs,
recommended settings, and how HTTP/2 connection efficiency
works under the hood.
Step 3: Update docs index and README¶
Add a link to docs/performance.md from docs/index.md
and mention performance tuning in the README. Update
ARCHITECTURE.md if needed.
Step 4: Run benchmarks and tune defaults¶
Run the benchmark script, analyze results, and adjust defaults if warranted. Document the benchmark results and rationale for any changes.
Commit plan¶
-
Add benchmark script and performance documentation. Create
tools/benchmark.shanddocs/performance.md. Updatedocs/index.md, README, and ARCHITECTURE.md. -
Tune defaults based on benchmark results. Adjust
-j/-Jdefaults if measurements justify it. Include benchmark results summary in commit message.
Risk assessment¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Benchmark results vary across machines | High | Low | Document that results are relative, not absolute |
| Rate limiting during bulk benchmarks | Medium | Low | Use local registry for most tests |
| Default changes regress some workflows | Low | Medium | Only change defaults with clear evidence |
Administration and logistics¶
Success criteria¶
We will know when this plan has been successfully implemented because the following statements will be true:
- The code passes
flake8 --max-line-length=120andpre-commit run --all-files. - New code follows the existing pipeline pattern (input/filter/ output interfaces) where applicable.
- There are unit tests for core logic and integration tests for new CLI commands.
- Lines are wrapped at 120 characters, single quotes for strings, double quotes for docstrings.
- Documentation in
docs/has been updated to describe any new commands or features. ARCHITECTURE.md,README.md, andAGENTS.mdhave been updated if the change adds or modifies modules or CLI commands.
Future work¶
- CI-based benchmark regression detection with a dedicated stable runner.
- Flame graph / cProfile integration for identifying CPU-bound hotspots.
- Memory profiling for large bulk operations to ensure concurrent image processing doesn't exhaust RAM.
Bugs fixed during this work¶
(None yet.)
Back brief¶
Before executing any step of this plan, please back brief the operator as to your understanding of the plan and how the work you intend to do aligns with that plan.