Blob Storage Roadmap¶

This document describes the planned evolution of Shaken Fist's blob storage system, from the current etcd-based hash tracking to a fully content-addressable, deduplicated storage layer with support for content-defined chunking.

Vision¶

The end state is a storage system where:

Content-addressable: Blobs can be looked up by hash in O(1) time
Deduplicated: Identical content is stored once, regardless of how many blobs reference it
Chunked: Large blobs are broken into content-defined chunks that can be shared across blobs
Efficient: Hash lookups, duplicate detection, and verification are all database-indexed operations

End state architecture:

Instance → Blob A (composite) → [Chunk1, Chunk2, Chunk3]
Artifact → Blob B (composite) → [Chunk1, Chunk4, Chunk5]
                                    ↑
                            Shared chunk (stored once)

The state before any changes¶

Before this work, blob storage had several limitations:

Hash storage in etcd: Hashes stored at attribute/blob/{uuid}/checksums with O(n) lookup
No deduplication: Two users uploading identical content get two copies on disk
Separate tracking systems: Hashes in etcd, locations in object_references table
Brute-force hash lookup: Finding a blob by hash iterates through ALL blobs

Key code locations: - Hash verification: shakenfist/blob.py:915-977 - Hash lookup API: shakenfist/external_api/blob.py:345-358 - Scheduled verification: shakenfist/daemons/cluster/scheduled_tasks.py:95-111

Roadmap¶

Phase 1: Hash Tracking in MariaDB¶

Goal: Move hash storage from etcd to MariaDB with proper indexes, enabling O(1) hash lookups.

Status: COMPLETE

Prerequisites: PLAN-reference-counts.md (COMPLETE)

What it enables: - O(1) "find blob by hash" queries via idx_hash_lookup index - Proper separation of hash values from verification metadata - Foundation for duplicate detection in Phase 2 - Support for additional hash algorithms (BLAKE3, etc.)

Schema¶

CREATE TABLE blob_hashes (
    blob_uuid UUID NOT NULL,
    node VARCHAR(255) NOT NULL,
    algorithm VARCHAR(32) NOT NULL,
    hash_value VARCHAR(256) NOT NULL,
    file_size BIGINT NOT NULL,
    computed_at TIMESTAMP NOT NULL,
    last_verified_at TIMESTAMP NOT NULL,
    verification_status VARCHAR(16) NOT NULL DEFAULT 'pending',
    error_message TEXT,
    PRIMARY KEY (blob_uuid, node, algorithm),
    INDEX idx_blob_status (blob_uuid, verification_status),
    INDEX idx_node (node),
    INDEX idx_last_verified (last_verified_at),
    INDEX idx_status (verification_status),
    INDEX idx_hash_lookup (algorithm, hash_value)
);

Migration Strategy¶

Hard cutover using sf-client migrate-checksums-to-mariadb:

Run migration command to copy existing checksums from etcd to MariaDB
Deploy updated code that reads/writes only to MariaDB
Delete old checksums attributes from etcd

The migration command is idempotent and can be run multiple times safely.

Implementation Steps¶

Create Pydantic schema for BlobHash
Add MariaDB table and access functions (following object_references patterns)
Add gRPC handlers in database daemon
Create migration command
Update verify_checksum() to write to MariaDB
Update checksums property to read from MariaDB
Remove etcd checksum code

Phase 2: Composite Blobs and Lazy Deduplication¶

Goal: Enable storage deduplication by converting duplicate blobs to composites that reference shared content.

Status: Future

Prerequisites: Phase 1 (blob_hashes table with idx_hash_lookup)

What it enables: - Storage savings when identical content is uploaded multiple times - Incremental deduplication (happens lazily, no big migration) - Foundation for content-defined chunking in Phase 3

Key Concept: Composite Blobs¶

Add a type field to blobs: content (default) or composite. A composite blob's content is defined by references to other blobs (chunks).

Before deduplication:
  Blob A (type=content, 10GB file)  ←── Instance disk
  Blob B (type=content, 10GB file)  ←── Artifact index
  # Identical content, stored twice

After lazy dedup detects A.hash == B.hash:
  Blob A (type=composite) → [C]     ←── Instance disk (resolves to C)
  Blob B (type=composite) → [C]     ←── Artifact index (resolves to C)
  Blob C (type=content, 10GB file)  ←── System-owned content blob
  # Only one copy on disk

Ownership Model¶

Content blobs (type=content) are always system-owned. User-facing blobs are composites that reference shared content:

When User A uploads content, system creates content blob C (system-owned) and composite blob A (user-owned) pointing to C
When User B uploads identical content, system reuses C and creates composite blob B (user-owned) pointing to C
When User A deletes blob A, composite A is deleted, C's ref_count decreases
Content blob C only deleted when ref_count hits 0 (no composites reference it)

This cleanly separates user identity/permissions (on composite blobs) from content storage (on system-owned content blobs).

How Deduplication Works¶

Lazy hashing computes hashes for blobs (existing behavior)
idx_hash_lookup finds blobs with matching hashes
Create system-owned content blob (or reuse existing)
Convert user blobs to composites pointing to content blob
Delete duplicate files, keep only one copy

Chunk Storage¶

Chunk references stored in object_references table:

-- RelationshipType.CHUNK added to enum
source_type=BLOB, source_uuid=composite_blob,
relationship=CHUNK, relationship_value="0",
target_type=BLOB, target_uuid=content_blob

This provides: - "What chunks make up blob X?" → get_references_from(BLOB, X, relationship=CHUNK) - "What blobs use chunk C?" → get_references_to(BLOB, C, relationship=CHUNK) - Ref counting works naturally - chunk can't be deleted while referenced

Phase 3: Content-Defined Chunking¶

Goal: Break large blobs into variable-sized chunks based on content boundaries, enabling cross-blob deduplication.

Status: Future (prototype in src/private/flywheel)

Prerequisites: Phase 2 (composite blobs)

What it enables: - Deduplication at sub-blob granularity - Efficient storage of similar but not identical images - Reduced transfer sizes for incremental updates

Key Concept¶

Instead of storing a 10GB VM image as one blob, break it into ~1000 chunks of ~10MB each based on content boundaries (using algorithms like FastCDC with BLAKE3 hashing).

Blob X (composite) → [C1, C2, C3, C4, ...]  ←── Ubuntu 22.04 image
Blob Y (composite) → [C1, C2, C5, C6, ...]  ←── Ubuntu 22.04 with customizations
                       ↑   ↑
                 Shared chunks (base OS)

Implementation Notes¶

Uses same composite blob infrastructure from Phase 2
Chunks are just content blobs referenced via CHUNK relationships
BLAKE3 hashing for efficient chunk identification (schema already supports it)
Content-defined chunking only for new uploads; existing blobs can be lazily re-chunked

Design Decisions¶

Why Composite Blobs Instead of a Separate ContentStore?¶

We considered several approaches to deduplication:

Approach	Description	Tradeoffs
A: Blob → ContentStore	Two-layer: Blob (UUID) references ContentStore (hash)	Clean separation, but requires new object type and migration
B: Hash as Identity	Blobs identified by hash, not UUID	Matches Docker, but breaking API change
C: Defer	Keep current model, revisit later	Safe, but delays benefits
D: Composite Blobs	Single type with `content` or `composite` flag	Incremental, API-compatible, unifies dedup and chunking

Decision: Option D (Composite Blobs)

Key insight: Option D achieves Option A's architecture using a single object type. Chunks are blobs - they just play a different role (system-owned, only referenced via CHUNK relationships, hold actual bytes).

Option A: Blob (public) → ContentStore (internal, new type)
Option D: Blob (composite, user-owned) → Blob (content, system-owned)

This allows incremental migration: - Day 1: All blobs are type=content (current state) - Day N: Lazy dedup converts some to type=composite - Day M: Content-defined chunking creates composites from the start - End: Most user-facing blobs are composites, most storage is in shared chunks

Why Store Chunks in object_references?¶

We considered three options for chunk list storage:

Option	Storage	Pros	Cons
File content	Composite blob file = JSON manifest	Consistent	Two reads minimum
Blob attribute	etcd/MariaDB metadata	Single read	Large for many chunks; etcd scaling
object_references	CHUNK relationship type	Queryable; ref counting works	Another relationship type

Decision: Use object_references table

This avoids creating another "two storage locations" problem and leverages infrastructure already built for PLAN-reference-counts.md.

Why VARCHAR Instead of ENUM for Status?¶

The verification_status column uses VARCHAR(16) instead of ENUM('valid', 'invalid', 'pending') because ENUMs are painful to modify in MySQL. Using VARCHAR allows adding states like 'quarantined' or 'verifying' without schema migration.

Open Questions¶

REST API Hash References¶

Should the REST API accept hash-based references? e.g., GET /blob/sha512:abc123...

This would enable Docker-style content-addressable access but requires: - Streaming hash computation on upload - Decision on what to return if multiple composites share the same content

Chunk Size Strategy¶

For content-defined chunking, what's the optimal target chunk size? - Smaller chunks = more deduplication, more metadata overhead - Larger chunks = less deduplication, simpler management - Need benchmarking with real workloads

📝 Report an issue with this page