Blob Storage Roadmap¶
This document describes the planned evolution of Shaken Fist's blob storage system, from the current etcd-based hash tracking to a fully content-addressable, deduplicated storage layer with support for content-defined chunking.
Vision¶
The end state is a storage system where:
- Content-addressable: Blobs can be looked up by hash in O(1) time
- Deduplicated: Identical content is stored once, regardless of how many blobs reference it
- Chunked: Large blobs are broken into content-defined chunks that can be shared across blobs
- Efficient: Hash lookups, duplicate detection, and verification are all database-indexed operations
End state architecture:
Instance → Blob A (composite) → [Chunk1, Chunk2, Chunk3]
Artifact → Blob B (composite) → [Chunk1, Chunk4, Chunk5]
↑
Shared chunk (stored once)
The state before any changes¶
Before this work, blob storage had several limitations:
- Hash storage in etcd: Hashes stored at
attribute/blob/{uuid}/checksumswith O(n) lookup - No deduplication: Two users uploading identical content get two copies on disk
- Separate tracking systems: Hashes in etcd, locations in
object_referencestable - Brute-force hash lookup: Finding a blob by hash iterates through ALL blobs
Key code locations:
- Hash verification: shakenfist/blob.py:915-977
- Hash lookup API: shakenfist/external_api/blob.py:345-358
- Scheduled verification: shakenfist/daemons/cluster/scheduled_tasks.py:95-111
Roadmap¶
Phase 1: Hash Tracking in MariaDB¶
Goal: Move hash storage from etcd to MariaDB with proper indexes, enabling O(1) hash lookups.
Status: COMPLETE
Prerequisites: PLAN-reference-counts.md (COMPLETE)
What it enables:
- O(1) "find blob by hash" queries via idx_hash_lookup index
- Proper separation of hash values from verification metadata
- Foundation for duplicate detection in Phase 2
- Support for additional hash algorithms (BLAKE3, etc.)
Schema¶
CREATE TABLE blob_hashes (
blob_uuid UUID NOT NULL,
node VARCHAR(255) NOT NULL,
algorithm VARCHAR(32) NOT NULL,
hash_value VARCHAR(256) NOT NULL,
file_size BIGINT NOT NULL,
computed_at TIMESTAMP NOT NULL,
last_verified_at TIMESTAMP NOT NULL,
verification_status VARCHAR(16) NOT NULL DEFAULT 'pending',
error_message TEXT,
PRIMARY KEY (blob_uuid, node, algorithm),
INDEX idx_blob_status (blob_uuid, verification_status),
INDEX idx_node (node),
INDEX idx_last_verified (last_verified_at),
INDEX idx_status (verification_status),
INDEX idx_hash_lookup (algorithm, hash_value)
);
Migration Strategy¶
Hard cutover using sf-client migrate-checksums-to-mariadb:
- Run migration command to copy existing checksums from etcd to MariaDB
- Deploy updated code that reads/writes only to MariaDB
- Delete old
checksumsattributes from etcd
The migration command is idempotent and can be run multiple times safely.
Implementation Steps¶
- Create Pydantic schema for BlobHash
- Add MariaDB table and access functions (following
object_referencespatterns) - Add gRPC handlers in database daemon
- Create migration command
- Update
verify_checksum()to write to MariaDB - Update
checksumsproperty to read from MariaDB - Remove etcd checksum code
Phase 2: Composite Blobs and Lazy Deduplication¶
Goal: Enable storage deduplication by converting duplicate blobs to composites that reference shared content.
Status: Future
Prerequisites: Phase 1 (blob_hashes table with idx_hash_lookup)
What it enables: - Storage savings when identical content is uploaded multiple times - Incremental deduplication (happens lazily, no big migration) - Foundation for content-defined chunking in Phase 3
Key Concept: Composite Blobs¶
Add a type field to blobs: content (default) or composite. A composite blob's content is defined by references to other blobs (chunks).
Before deduplication:
Blob A (type=content, 10GB file) ←── Instance disk
Blob B (type=content, 10GB file) ←── Artifact index
# Identical content, stored twice
After lazy dedup detects A.hash == B.hash:
Blob A (type=composite) → [C] ←── Instance disk (resolves to C)
Blob B (type=composite) → [C] ←── Artifact index (resolves to C)
Blob C (type=content, 10GB file) ←── System-owned content blob
# Only one copy on disk
Ownership Model¶
Content blobs (type=content) are always system-owned. User-facing blobs are composites that reference shared content:
- When User A uploads content, system creates content blob C (system-owned) and composite blob A (user-owned) pointing to C
- When User B uploads identical content, system reuses C and creates composite blob B (user-owned) pointing to C
- When User A deletes blob A, composite A is deleted, C's ref_count decreases
- Content blob C only deleted when ref_count hits 0 (no composites reference it)
This cleanly separates user identity/permissions (on composite blobs) from content storage (on system-owned content blobs).
How Deduplication Works¶
- Lazy hashing computes hashes for blobs (existing behavior)
idx_hash_lookupfinds blobs with matching hashes- Create system-owned content blob (or reuse existing)
- Convert user blobs to composites pointing to content blob
- Delete duplicate files, keep only one copy
Chunk Storage¶
Chunk references stored in object_references table:
-- RelationshipType.CHUNK added to enum
source_type=BLOB, source_uuid=composite_blob,
relationship=CHUNK, relationship_value="0",
target_type=BLOB, target_uuid=content_blob
This provides:
- "What chunks make up blob X?" → get_references_from(BLOB, X, relationship=CHUNK)
- "What blobs use chunk C?" → get_references_to(BLOB, C, relationship=CHUNK)
- Ref counting works naturally - chunk can't be deleted while referenced
Phase 3: Content-Defined Chunking¶
Goal: Break large blobs into variable-sized chunks based on content boundaries, enabling cross-blob deduplication.
Status: Future (prototype in src/private/flywheel)
Prerequisites: Phase 2 (composite blobs)
What it enables: - Deduplication at sub-blob granularity - Efficient storage of similar but not identical images - Reduced transfer sizes for incremental updates
Key Concept¶
Instead of storing a 10GB VM image as one blob, break it into ~1000 chunks of ~10MB each based on content boundaries (using algorithms like FastCDC with BLAKE3 hashing).
Blob X (composite) → [C1, C2, C3, C4, ...] ←── Ubuntu 22.04 image
Blob Y (composite) → [C1, C2, C5, C6, ...] ←── Ubuntu 22.04 with customizations
↑ ↑
Shared chunks (base OS)
Implementation Notes¶
- Uses same composite blob infrastructure from Phase 2
- Chunks are just content blobs referenced via CHUNK relationships
- BLAKE3 hashing for efficient chunk identification (schema already supports it)
- Content-defined chunking only for new uploads; existing blobs can be lazily re-chunked
Design Decisions¶
Why Composite Blobs Instead of a Separate ContentStore?¶
We considered several approaches to deduplication:
| Approach | Description | Tradeoffs |
|---|---|---|
| A: Blob → ContentStore | Two-layer: Blob (UUID) references ContentStore (hash) | Clean separation, but requires new object type and migration |
| B: Hash as Identity | Blobs identified by hash, not UUID | Matches Docker, but breaking API change |
| C: Defer | Keep current model, revisit later | Safe, but delays benefits |
| D: Composite Blobs | Single type with content or composite flag |
Incremental, API-compatible, unifies dedup and chunking |
Decision: Option D (Composite Blobs)
Key insight: Option D achieves Option A's architecture using a single object type. Chunks are blobs - they just play a different role (system-owned, only referenced via CHUNK relationships, hold actual bytes).
Option A: Blob (public) → ContentStore (internal, new type)
Option D: Blob (composite, user-owned) → Blob (content, system-owned)
This allows incremental migration:
- Day 1: All blobs are type=content (current state)
- Day N: Lazy dedup converts some to type=composite
- Day M: Content-defined chunking creates composites from the start
- End: Most user-facing blobs are composites, most storage is in shared chunks
Why Store Chunks in object_references?¶
We considered three options for chunk list storage:
| Option | Storage | Pros | Cons |
|---|---|---|---|
| File content | Composite blob file = JSON manifest | Consistent | Two reads minimum |
| Blob attribute | etcd/MariaDB metadata | Single read | Large for many chunks; etcd scaling |
| object_references | CHUNK relationship type | Queryable; ref counting works | Another relationship type |
Decision: Use object_references table
This avoids creating another "two storage locations" problem and leverages infrastructure already built for PLAN-reference-counts.md.
Why VARCHAR Instead of ENUM for Status?¶
The verification_status column uses VARCHAR(16) instead of ENUM('valid', 'invalid', 'pending') because ENUMs are painful to modify in MySQL. Using VARCHAR allows adding states like 'quarantined' or 'verifying' without schema migration.
Open Questions¶
REST API Hash References¶
Should the REST API accept hash-based references? e.g., GET /blob/sha512:abc123...
This would enable Docker-style content-addressable access but requires: - Streaming hash computation on upload - Decision on what to return if multiple composites share the same content
Chunk Size Strategy¶
For content-defined chunking, what's the optimal target chunk size? - Smaller chunks = more deduplication, more metadata overhead - Larger chunks = less deduplication, simpler management - Need benchmarking with real workloads