Pipeline Architecture¶
Occy Strap processes container images using a flexible pipeline pattern. This document explains how the pipeline works and how its components interact.
Overview¶
The pipeline follows a simple flow:
- Input Source reads image elements (config and layers) from a source
- Filters transform or inspect elements as they pass through
- Output Writer writes the processed elements to their destination
Image Elements¶
Container images consist of two types of elements:
| Element Type | Description |
|---|---|
CONFIG_FILE |
JSON file containing image metadata and configuration |
IMAGE_LAYER |
Tarball containing a filesystem layer |
Each element flows through the pipeline independently, allowing streaming processing without loading entire images into memory.
Input Sources¶
Input sources implement the ImageInput interface and provide image elements
from various sources.
Registry Input¶
Fetches images from Docker/OCI registries using the HTTP API.
Capabilities: - Token-based and basic authentication - Multi-architecture image selection - Manifest parsing (v1, v2, OCI formats) - Individual layer blob fetching
Docker Daemon Input¶
Fetches images from local Docker or Podman daemons.
Uses the Docker Engine API over Unix socket. The entire image is streamed
(equivalent to docker save), then parsed on the fly.
Note: The Docker Engine API only provides complete image export - there's no way to fetch individual layers separately. This is a limitation of the API design.
Tarball Input¶
Reads images from existing docker-save format tarballs.
Parses manifest.json to locate config files and layers within the tarball.
Filters¶
Filters implement the decorator pattern, wrapping outputs (or other filters)
to transform or inspect elements. They inherit from ImageFilter.
How Filters Work¶
# Conceptual filter structure
class MyFilter(ImageFilter):
def __init__(self, wrapped_output):
self.wrapped = wrapped_output
def process_image_element(self, element_type, name, data):
# Transform the element
modified_data = transform(data)
modified_name = new_name_if_changed
# Pass to wrapped output
self.wrapped.process_image_element(element_type, modified_name,
modified_data)
Filter Capabilities¶
Filters can:
- Transform data - Modify element content (e.g., normalize timestamps)
- Transform names - Rename elements (e.g., after hash changes)
- Inspect elements - Read without modification (e.g., search)
- Skip elements - Exclude elements from output
- Accumulate state - Track information across elements
Available Filters¶
normalize-timestamps: Rewrites layer tarballs to set all file modification times to a consistent value. Since this changes content, SHA256 hashes are recalculated.
search: Searches layer contents for files matching patterns. Can operate as search-only (prints results) or passthrough (searches AND forwards elements).
exclude: Removes files matching glob patterns from layers, recalculating hashes afterward.
Chaining Filters¶
Multiple filters are chained together:
occystrap process registry://... tar://output.tar \
-f normalize-timestamps \
-f "search:pattern=*.conf" \
-f "exclude:pattern=**/.git/**"
The pipeline becomes:
Each filter wraps the next, forming a chain that processes elements in order.
Output Writers¶
Output writers implement the ImageOutput interface and handle the final
destination of processed elements.
Tarball Output¶
Creates docker-loadable tarballs in v1.2 format.
The tarball contains:
- manifest.json - Image manifest
- <hash>.json - Config file
- <hash>/layer.tar - Layer tarballs
Can be loaded with docker load -i output.tar.
Directory Output¶
Extracts images to directories.
Options:
- unique_names=true - Enable layer deduplication by prefixing filenames
- expand=true - Extract layer tarballs to filesystem
With unique_names, a catalog.json tracks which layers belong to which
images, allowing multiple images to share storage.
OCI Bundle Output¶
Creates OCI runtime bundles for runc.
Produces:
- config.json - OCI runtime configuration
- rootfs/ - Merged filesystem from all layers
Registry Output¶
Pushes images to Docker/OCI registries.
Uploads layers as blobs and creates the manifest.
Docker Daemon Output¶
Loads images into local Docker or Podman.
Uses the Docker Engine API to load the image.
Data Flow Example¶
Consider this command:
occystrap process registry://docker.io/library/busybox:latest \
tar://busybox.tar -f normalize-timestamps
The data flow is:
1. Registry Input fetches manifest from docker.io
2. Registry Input yields CONFIG_FILE element
--> TimestampNormalizer passes through unchanged
--> TarWriter writes to tarball
3. For each layer:
a. Registry Input fetches layer blob
b. Registry Input yields IMAGE_LAYER element
c. TimestampNormalizer rewrites tarball with epoch timestamps
d. TimestampNormalizer recalculates SHA256
e. TimestampNormalizer yields modified element with new name
f. TarWriter writes modified layer to tarball
4. TarWriter.finalize() writes manifest.json
Key Concepts¶
Whiteout Files¶
OCI layers use special files to mark deletions:
.wh.<filename>- Marks a specific file as deleted.wh..wh..opq- Marks entire directory as opaque (replaced)
These are processed when extracting layers with expand=true.
Layer Deduplication¶
With unique_names=true, layers are stored with content-addressed names.
When downloading multiple images:
- First image stores layers normally
- Subsequent images check if layers already exist
- Shared layers are referenced, not duplicated
catalog.jsonmaps images to their layers
Hash Recalculation¶
When filters modify layer content (timestamps, file exclusion), the SHA256 hash changes. Filters that modify content:
- Process the layer tarball
- Calculate the new SHA256 hash
- Update the layer name to use the new hash
- Update the manifest to reference the new hash