Development Plans¶

This section contains forward-looking roadmaps for Shaken Fist development. These documents describe planned features and architectural directions.

Forward-Looking Statements

Plans describe intended future work and may change based on implementation experience, community feedback, or shifting priorities. Check the status table below to see what has been implemented.

Plan sequencing¶

The set of incomplete plans has grown to the point where the order they land in matters. The intended sequencing is:

Network operations facade — complete. Landed via the network-facade branch.
Retire etcd — (absorbed into BYO MariaDB and sf-database tier phase 0; the standalone PLAN-remove-etcd.md has been removed.) Inspection confirmed the etcd data drain is complete (the DATA_MIGRATIONS dict is empty), but the supporting machinery — shakenfist/etcd.py, etcd3gw, the etcd proto stubs, the drain test files, is_etcd_master, the migration-era sf-ctl aliases — is still in tree. Phase 0 of the BYO-MariaDB plan deletes it as a single sweep. The etcd_master ansible group name rename stays with PLAN-remove-primary phase 7 as a deploy-scope concern.
Health checks, readiness, and graceful drain — complete. A precondition for the BYO load-balancer story in remove-primary being operationally honest. Delivered sf-api /livez//readyz//healthz with SIGTERM drain, dependency-aware grpc.health.v1 on sf-database, systemd WATCHDOG liveness on the worker/elected daemons (which also closes the cluster-lock proof-of-life gap), and operator LB/upgrade docs. Landed on the health-checks branch. Node resource health is a sibling (complete) on a different axis — it drives node.state from the health of the storage/resource dependencies a node's hosted object types declare, so a dead disk or hung NFS mount takes a node out of scheduling. It was not sequenced against the BYO thread; it grew out of the sf-6 blob-NVMe incident and landed independently on the node-resource-health branch.
Remove the primary node — the BYO-infrastructure scope reduction. Phase 7 finishes the deployer-level etcd_master → database_node rename, by which point the drain code itself is long gone. Naturally followed by a wipe-and-redeploy of Mikal's production cluster against the new shape.
BYO MariaDB and sf-database as a tier — lifted out of remove-primary because it grew into its own master plan. Removes MariaDB-server install from the deployer entirely, reshapes sf-database into a deployer-chosen tier of equal stateless instances reached via client-side gRPC load balancing (not leader election), and carves schema/migration execution out of daemon startup into an operator-run sf-ctl ensure-mariadb-schema command. Can land in parallel with remove-primary's remaining phases; its phase 1 also performs the scope-shift edit to remove-primary itself.

Remove syslog forwarding (ship logs to Loki) delivers the "Loki-shipper story" that remove-primary phase 1 is explicitly gated on — it adds structured-JSON logging and an in-process, on-disk-spooled Loki push (modelled on the eventlog spool/drainer) before deleting the rsyslog wiring. It can land in parallel with the other BYO work and is sequenced ahead of remove-primary phase 1, which it realises.

The remaining incomplete plans — Embrace TLS, Sticky blob transfers, Replace exec'd network commands with netlink, Atomic scheduling via reservations, the connected Generic allocator / Network service ports / Network carrier model triple, and the not-yet-drafted OpenTelemetry instrumentation thread — are intentionally not ordered relative to each other here. They each have specific dependencies on either remove-primary having established the BYO shape (the operator-provides-PKI surface for TLS, the streaming-proxy baseline for sticky transfers, the sf-database election pattern for the others) or network-facade having landed (the netlink plan, whose privilege-separation phases need network-facade's single-mutator property), but among themselves the order is a triage decision best made when remove-primary is close to landing rather than now. The scheduler-reservations plan is independent of the BYO shape but benefits from the OpenTelemetry thread landing first so that phase 0's design choices can be informed by real load and contention numbers.

Database load reduction sits outside that triage: it addresses a measured production problem (the sf-database tier serving ~527 ops/second at idle, 57% of it one polling loop) and its first phase is deliberately shippable immediately. Its phase 4 (caller attribution via gRPC metadata and per-caller counter labels), preceded by a phase 3 that consolidates the three sf-database client stacks into one so attribution has a single interceptor seam, is also the first concrete slice of the OpenTelemetry instrumentation thread — the caller-identity plumbing is what a later span-propagation phase would reuse, and it is designed to compose with the mTLS peer-identity model from Embrace TLS rather than duplicate it.

The generic-allocator / network-service-ports / network-carrier-model triple is internally ordered. Generic-allocator is the foundational refactor (replaces five ad-hoc allocators with one primitive and is independently shippable). Network-service-ports builds on the allocator to expose per-network DNAT'd ports for managed services (web consoles, transfer agents, managed VPN endpoints). Network-carrier-model layers a smeared lease-based per-network carrier role with VIP advertisement on top, removing the network-node singleton; it depends on both prior plans and is the largest of the three. The triple supersedes the "network node failover" thread that was previously a not-yet-drafted line item.

The blob-storage and SQL-pushdown roadmaps and the network-facade plan run on their own cadence and are not part of this sequencing.

Plan Status¶

Plan	Phase	Status	Description
Blob Storage Roadmap	Phase 1: Hash Tracking	Complete	Move hash storage to MariaDB
Blob Storage Roadmap	Phase 2: Lazy Dedup	Future	Composite blobs and deduplication
Blob Storage Roadmap	Phase 3: Chunking	Future	Content-defined chunking
API Query Batching	Phase 1: Batch Infrastructure	Planning	Add batch query functions
API Query Batching	Phase 2: Prefetch Pattern	Future	Modify API to prefetch related data
API Query Batching	Phase 3: Generic Framework	Future	Declarative prefetch requirements
SQL-pushdown Filtering	Phase 1: Query Infrastructure	Complete	Typed criteria + generic `find_objects` primitive
SQL-pushdown Filtering	Phase 2: Artifact Pushdown	Complete	Push state/namespace/name for Artifact lookups to SQL
SQL-pushdown Filtering	Phase 3: Instance and Network Pushdown	Complete	Mirror Artifact pushdown for Instance and Network
SQL-pushdown Filtering	Phase 4: Iterator Rework	Complete	Port iterators to single pushed-down query
SQL-pushdown Filtering	Phase 5: Ad-hoc Bulk Scan Cleanup	Complete	Eliminate remaining full-table scans on filter paths
SQL-pushdown Filtering	Phase 6: Tests and Documentation	Complete	Coverage and docs updates
SQL-pushdown Filtering	Phase 7: Denormalised Child-UUID List Removal	Complete	Replace cached UUID lists on attributes tables with SQL queries
Replace last_cluster_operation	Phase 1: `has_pending_cluster_operation` query	Complete	New query API and tests
Replace last_cluster_operation	Phase 2: Switch gating callers	Complete	Move `is_okay()` and siblings off the single-pointer read
Replace last_cluster_operation	Phase 3: Auto-target tracking	Complete	`*_create_and_enqueue` writes target rows automatically
Replace last_cluster_operation	Phase 4: Remove explicit setters	Complete	Drop redundant `set_last_cluster_operation` callers
Replace last_cluster_operation	Phase 5: Documentation and final audit	Complete	Update docs, verify CI
Fix cluster_operation_targets UNIQUE constraint	Schema fix	Complete	Replace column-level `UNIQUE(operation_uuid)` with composite `UNIQUE(operation_uuid, target_object_type, target_uuid)` so multi-target ops record all their target rows
Network operations facade	Master plan	Complete	Split `Network` into a queue-enqueuing facade and a single-mutator worker so local daemons can no longer bypass `net-worker`'s serialisation
Queue performance and coalescing	Steps 1-6	In Progress	Unified batched dequeue, coalescible-task metadata, worker- and enqueue-side dedup of redundant cluster operations. Step 7 (measure with CI data, decide on fairness) outstanding.
Recurring cluster operations	Master plan	Stub	Cron-like framework for recurring cluster operations; absorbs the cluster and cleaner `scheduled_tasks.py` and `daemons/network/maintain.py`; adds user-facing recurring tasks (e.g. snapshot every 24 hours)
Health checks	Phase 0: Research and decisions	Complete	Routing principle (LB probes only sf-api) collapsed OQ1/2/4/9; decided readiness cache, drain-grace/timeout reconciliation, WATCHDOG liveness + lock proof-of-life, auth, daemon classification
Health checks	Phase 1: sf-api endpoints and drain	Complete	`/livez`, `/readyz`, `/healthz` on sf-api; per-worker readiness checker; SIGTERM-driven drain (`API_DRAIN_GRACE`, reconciled timeouts)
Health checks	Phase 2: gRPC health on sf-database	Complete	`grpc.health.v1.Health` `Check` now tracks live MariaDB reachability via the daemon's ~10s loop; schema currency stays a startup refuse-to-start precondition; no Watch, no client `healthCheckConfig`
Health checks	Phase 3: WATCHDOG liveness wiring	Complete	Wired systemd `WATCHDOG` into the eight non-trivial daemons (`WatchdogSec=60s`, pet in `idle()` + cluster/cleaner heavy iterators); closes lock proof-of-life via watchdog-kill → lease-expiry failover
Health checks	Phase 4: Operator documentation	Complete	`load_balancing.md` HAProxy/nginx-FOSS/ALB probe configs; rolling-upgrade-with-drain in `upgrades.md`; live `ci_drain_check.sh` (in the actions repo) wired into functional-tests
Node resource health	Phases 1-4: Checks, evaluator, cascade, docs	Complete	Timeout-guarded path check; sf-resources marks a dead-storage node `error` (stops scheduling, discounts blob replicas) and exposes a `node_resource_health` gauge; cluster-daemon cascade errors instances + re-replicates blobs; `sf-ctl clear-node-error` recovery and operator docs. Decisions ratified inline (no separate phase-0 doc).
Remove the primary node	Phase 1: Remove monitoring	Complete	Realised by Remove syslog forwarding (ship logs to Loki) phase 5 (Grafana and the primary-node Prometheus server already removed as warmup)
Remove the primary node	Phase 2: ~~Bootstrap CLI~~	Dissolved	Reassessed to nothing — config stays in the idempotent role, `AUTH_SECRET_SEED` caller-supplied, `system` namespace via the existing `bootstrap-system-key`; no `bootstrap_operations` table or `bootstrap-cluster` command. The role-config remainder folds into phase 6
Remove the primary node	Phase 3: Remove LB	Complete	Realised by Remove the Apache load balancer
Remove the primary node	Phases 4-5: (moved to BYO MariaDB)	(moved)	MariaDB BYO and `sf-database`-as-tier lifted into PLAN-byo-mariadb.md
Remove the primary node	Phase 6: Galaxy collection	Complete	Deployer repackaged as the `shakenfist.shakenfist` collection (parameterised `node` role + component roles, native ansible modules under `plugins/modules/`, `release.yml` galaxy publish jobs); the reusable `smoke-cluster.yml` + `tools/` on `shakenfist/actions` drive all CI topologies; the getsf/topology installer chain and legacy roles are deleted
Remove the primary node	Phase 7: Rename and cleanup	Complete	`etcd_master` → `database_node` rename landed with one-release fallbacks (legacy inventory group union + deprecation warning, vestigial `is_etcd_master`/`is_eventlog_node` flags pinned False); the deferred-removal PR is scheduled for the release after v0.8.0 ships
Remove the primary node	Phase 8: Roll out shared smoke-cluster CI	Complete	`build-smoke-cluster` composite action extracted; `client-python` adopted the reusable workflow (replacing its broken getsf-era CI); kerbside-mode recipe documented in the actions README; dead-topology deletion deferred while `v0.6`/`v0.7-releases` CI still consumes them
Remove the Apache load balancer	Phase 1: Document operator-provided LB	Complete	Example apache2 + nginx configs and the `localhost:13000` single-node escape hatch (realises remove-primary phase 3)
Remove the Apache load balancer	Phase 2: Remove Apache from the deployer	Complete	Delete `apache2.yml` + `apache-site-primary.conf`; repoint single-node `api_url` to `:13000`
Remove syslog forwarding (ship logs to Loki)	Phase 0: Decisions and design	Complete	Config surface, Loki label/field-name contract, spool factoring, auth, CI Loki topology — answers recorded in the master plan's Decisions section
Remove syslog forwarding (ship logs to Loki)	Phase 1: Default structured JSON logging	Complete	`shakenfist-utilities==0.8.5` (JSON-only daemon logging, clean record, field contract, tests, draft superseded) released to PyPI and pinned in shakenfist
Remove syslog forwarding (ship logs to Loki)	Phase 2: Loki shipper in shakenfist	Complete	`logship_spool.py` / `logship_drainer.py` + a `logging.Handler` modelled on the eventlog spool/drainer; HTTP push, `LOKI_*` config, lifecycle wiring, metrics, `LOG_EVENTS_TO_LOKI` echo guard, mypy-covered, tests
Remove syslog forwarding (ship logs to Loki)	Phase 3: CI Loki	Complete	`tools/ci-install-loki.sh` stands up Loki on the primary, `LOKI_BASE_URL` plumbed through getsf/deploy into `/etc/sf/config`, and `test_loki.py` asserts SF logs reach Loki end-to-end
Remove syslog forwarding (ship logs to Loki)	Phase 4: Rework CI tooling	Complete	New versioned `shakenfist/actions` log-checks querying Loki (structured `\\| json` LogQL) + clingwrap bundle dumping the logship spool, per-node journald, and Loki (so a shipper failure stays debuggable); system-origin checks deferred to a phase-5 per-node check
Remove syslog forwarding (ship logs to Loki)	Phase 5: Remove rsyslog forwarding	Complete	Atomic cut-over: delete rsyslog configs/`syslog_target`/install, route gunicorn logs + drop `--log-syslog`, add tenant/auth plumbing, switch both CI workflows to the phase-4 Loki tooling (realises remove-primary phase 1)
Remove syslog forwarding (ship logs to Loki)	Phase 6: Documentation	Complete	`docs/operator_guide/logging.md` (BYO-Loki contract), events-vs-logs convention in AGENTS, ARCHITECTURE/README/installation, nav
BYO MariaDB and sf-database tier	Phase 0: Retire etcd machinery	Complete	Supersedes `PLAN-remove-etcd`; deleted `etcd.py`, `etcd3gw`, etcd protos, `DATA_MIGRATIONS` framework, drain tests, dead `sf-ctl` helpers and `show/set-etcd-config` aliases, stale `migrate-*` comments, `.claude/skills/migrate-etcd-to-mariadb.md` (`is_etcd_master` deferred to `PLAN-remove-primary` phase 7)
BYO MariaDB and sf-database tier	Phase 1: Statelessness and scope shift	Complete	Schema-versions lock; stop sf-database calling `ensure_schema()` at startup; daemon startup-version check; MariaDB compat check (version/engine/charset) on `ensure-mariadb-schema` and sf-database; lift scope out of remove-primary
BYO MariaDB and sf-database tier	Phase 2: Config untangle	Complete	Untangle `MARIADB_HOST` from "I am the database node"; rename `DATABASE_NODE_IP` → `MARIADB_GATEWAY_HOSTS` (plural list); bind sf-database on `NODE_MESH_IP` instead of `DATABASE_NODE_IP`
BYO MariaDB and sf-database tier	Phase 3: gRPC tier	Complete	Multi-endpoint client-side gRPC LB; minimal `grpc.health.v1.Health` on sf-database; channel factory at `shakenfist/util/grpc_channel.py`
BYO MariaDB and sf-database tier	Phase 4: Deploy BYO	Complete	`getsf` prompts for connection details; `roles/mariadb/` deleted; `tools/bootstrap-mariadb.sql` ships; `deploy.py` stops generating a password; tuning `.cnf` moves to `examples/`; `SHAKENFIST_MARIADB_HOST=localhost` escape hatch dissolved
BYO MariaDB and sf-database tier	Phase 5: CI workflow step	Complete	New tools/ci-install-mariadb.sh helper + five workflow sites in functional-tests.yml/scheduled-tests.yml install MariaDB and pass GETSF_MARIADB_* env vars to getsf-wrapper
BYO MariaDB and sf-database tier	Phase 6: CI tier coverage	Complete	Multi-instance sf-database startup; `MARIADB_GATEWAY_HOSTS` rendered as a list; bind-all drop-in; `ci-topology-slim-tier` in `shakenfist/actions`; merge-queue matrix entry; functional LB-fanout test asserting each etcd_master saw at least 5% of traffic
BYO MariaDB and sf-database tier	Phase 7: Documentation	Complete	`docs/operator_guide/database.md` restructured to lead with BYO and stripped of historical etcd content; tier-model note added to `ARCHITECTURE.md`; deleted-skill bullet dropped from `README.md`; "Bring your own MariaDB" section appended to `docs/release_notes/v07-v08.md`
Embrace TLS	Phase 0: Research and decisions	Not started	Resolve open TLS questions into a decisions document
Embrace TLS	Phase 1: Cert reload	Not started	Graceful TLS material reload across daemons
Embrace TLS	Phase 2: sf-database mTLS	Not started	Canary mTLS path for the highest-traffic gRPC channel
Embrace TLS	Phase 3: Other gRPC mTLS	Not started	Extend mTLS to the remaining inter-daemon channels
Embrace TLS	Phase 4: MariaDB TLS	Not started	TLS on the SF-to-MariaDB connection
Embrace TLS	Phase 5: sf-api TLS	Not started	Optional native TLS on sf-api; document operator-LB story
Embrace TLS	Phase 6: Expiry monitoring	Not started	Cert expiry warnings as events + prometheus metrics
Embrace TLS	Phase 7: Dev CA	Not started	Repurpose `pki_internal_ca` as dev/test convenience only
Sticky blob transfers	Phase 0: Research and decisions	Not started	Resolve cookie format, LB coverage, and placement-interaction questions
Sticky blob transfers	Phase 1: Server-side cookies	Not started	sf-api emits and honours server-set sticky cookies
Sticky blob transfers	Phase 2: LB documentation	Not started	Document HAProxy / Envoy / cloud-LB / nginx configurations
Sticky blob transfers	Phase 3: Client verification	Not started	Verify SF Python client cookie handling end-to-end
Sticky blob transfers	Phase 4: Failover behaviour	Not started	Define recovery path when the sticky backend dies mid-session
Replace exec'd network commands with netlink	Phase 0: Research and decisions	Not started	Pick `pyroute2.nftables` vs `python-nftables`, handle sysctl / arping corners, scope the privexec split, pick auth model
Replace exec'd network commands with netlink	Phase 1: rtnetlink for link / addr / route / neigh	Not started	Port `ip` exec sites to `pyroute2.IPRoute`
Replace exec'd network commands with netlink	Phase 2: Bridge attributes via `IFLA_BR_*`	Not started	Replace `brctl` with rtnetlink bridge link attributes
Replace exec'd network commands with netlink	Phase 3: nftables rules via netlink	Not started	Port iptables rules to nftables in atomic transactions
Replace exec'd network commands with netlink	Phase 4: Stand up `sf-net-privexec`	Not started	New typed-API daemon holding `CAP_NET_ADMIN`, with `net-worker` as its only client
Replace exec'd network commands with netlink	Phase 5: Shrink `sf-privexec`	Not started	Drop `CAP_NET_ADMIN` and network RPCs from the existing privexec daemon
Replace exec'd network commands with netlink	Phase 6: Cleanup	Not started	Close out remaining `sf-net` direct-exec sites and the sysctl / arping corners
Atomic scheduling via reservations	Phase 00a: Load-aware ordering and system reservations	Implemented (awaiting sfcbr soak)	Static quick wins for sfcbr: load-per-thread ordering, core-denominated OS / infra-role reservations (CPU and RAM), headroom-weighted selection, CPU topology tracking, measured overcommit default
Atomic scheduling via reservations	Phase 0: Research and decisions	Not started	Resolve conditional-INSERT vs SELECT-FOR-UPDATE, reservation row schema, lifecycle states, affinity model rework, batch-create semantics, generic-vs-specific
Atomic scheduling via reservations	Phase 1: `node_reservations` schema	Not started	Schema and migration for the reservation table
Atomic scheduling via reservations	Phase 2: Conditional-INSERT primitive	Not started	The scheduling primitive that filters and claims atomically
Atomic scheduling via reservations	Phase 3: Reservation lifecycle	Not started	Consume on `building`, explicit release on failure, leased TTL reaper
Atomic scheduling via reservations	Phase 4: Migrate callers	Not started	Port the three in-process `Scheduler()` call sites
Atomic scheduling via reservations	Phase 5: Batch-create API	Not started	All-or-nothing multi-instance create primitive
Atomic scheduling via reservations	Phase 6: Affinity model rework	Not started	Implement the affinity decision from phase 0
Atomic scheduling via reservations	Phase 7: Diagnostic-mode rejection logging	Not started	Restore per-rejection detail on failed schedules without paying the cost on every success
Atomic scheduling via reservations	Phase 8: Documentation	Not started	Operator guide for the new model and migration notes
Remove the eventlog service	Phase 1: Schema, accessors, RPC, row-count gauge	Complete	`events` / `event_objects` tables, `RecordEventBatch` on sf-database, `database_events_rows` gauge
Remove the eventlog service	Phase 2: Write cut-over and metrics	Complete	Swap drainer's gRPC target to sf-database; promote `event_uuid` / `request_id`; wire spool-depth, drop, insert metrics
Remove the eventlog service	Phase 3: Prune in cluster daemon	Complete	Move per-event-type prune sweep into the cluster maintainer with multi-object semantics
Remove the eventlog service	Phase 4: REST API direct-read	Complete	Event-list endpoints call `GetObjectEvents` on sf-database; no sqlite locality
Remove the eventlog service	Phase 5: Delete the daemon	Complete	Remove `sf-eventlog`, gRPC protos, systemd unit, config, `event_dlq`, and on-disk sqlite chunks
Remove the eventlog service	Phase 6: Documentation	Complete	Operator guide for new eventlog, history-loss called out in release notes, ARCHITECTURE/README/AGENTS
Generic allocator	Phase 0: Research and decisions	Not started	Pick allocation strategy, per-pool policy shape, leased-vs-permanent semantics, migration plan
Generic allocator	Phase 1: Schema and primitive	Not started	`resource_pool_allocations` table and conditional-INSERT allocator
Generic allocator	Phase 2: Port VXLAN allocator	Not started	First migration; sets the template
Generic allocator	Phase 3: Port console / VDI ports	Not started	Drop the local `socket.bind()` race-check
Generic allocator	Phase 4: Port vsock CID allocator	Not started	Drop the global cluster lock
Generic allocator	Phase 5: Port MAC allocator	Not started	Fix today's probabilistic-only correctness
Generic allocator	Phase 6: Documentation	Not started	Audit-log surface and developer docs
Network service ports	Phase 0: Research and decisions	Not started	Pick port range, TLS-on-shared-IP, token model, two-stage cleanup ordering, carrier-coupling contract
Network service ports	Phase 1: Pool registration	Not started	Register service-port pool with the generic allocator
Network service ports	Phase 2: API and token issuance	Not started	`allocate_service_port` / `release_service_port`
Network service ports	Phase 3: Carrier-side DNAT	Not started	Network daemon programs DNAT rules per allocation
Network service ports	Phase 4: Reaper and reconciler	Not started	Drift detection and repair across DB and iptables state
Network service ports	Phase 5: Validation surface	Not started	Smoke-test caller or first real caller
Network service ports	Phase 6: Documentation	Not started	Operator and developer docs including threat-surface change
Network carrier model	Phase 0: Research and decisions	Not started	Resolve pool config, lease TTL, advertisement modes, DHCP state, VLAN-trunk forward compatibility
Network carrier model	Phase 1: Carrier pool config	Not started	Eligible-carrier set and node-capability declaration
Network carrier model	Phase 2: Per-network carrier lease	Not started	Lease primitive backed by `cluster_locks` and the generic allocator
Network carrier model	Phase 3: Renderer process	Not started	Carrier-side process that materialises leased state to kernel
Network carrier model	Phase 4: SNAT and floating IP	Not started	Render SNAT and egress IP via the carrier
Network carrier model	Phase 5: Service ports via renderer	Not started	Carrier-side hookup for `PLAN-network-service-ports`
Network carrier model	Phase 6: DHCP persistence	Not started	DHCP leases survive carrier change
Network carrier model	Phase 7: DNS via renderer	Not started	DNS records rendered from data
Network carrier model	Phase 8: BGP advertisement	Not started	Embedded or operator-external BGP speaker
Network carrier model	Phase 9: L2 / GARP advertisement	Not started	Alternative for non-routed deployments
Network carrier model	Phase 10: Migration	Not started	Cutover from singleton network node to smeared carriers
Network carrier model	Phase 11: Operator documentation	Not started	VIP failover modes and pool sizing
OIDC authentication	Master plan	Stub	OIDC as an authentication option for human users; existing namespace keys re-framed as service-account tokens for automation
Auth federation	Phase 1: Terminology and glossary	Complete	Pin authentication vocabulary and other overloaded codebase terms in a docs/ glossary
Auth federation	Phase 2: Namespace keys as first-class objects	Complete	Keys become DBOs with events, soft delete, expiry, scopes, and provenance; cleaner reaps expired keys
Auth federation	Phase 3: Federated exchange and scope enforcement	Not started	Exchange an external identity token (GitHub Actions first, issuer-generic) for a scoped expiring namespace key
Auth federation	Phase 4: Authentication documentation	Not started	Update the three authentication guides for keys, scopes, and federation without depending on private CI internals
Auth federation	Phase 5: OIDC plan refresh	Not started	Rewrite the human-login OIDC stub plan against the as-built federation infrastructure
Auth federation	Phase 6: Secrets that cannot be logged by accident	Not started	Move key material to pydantic SecretStr, so stringifying a secret into an event renders asterisks instead
Auth federation	Phase 7: Recognisable secrets and leak detection	Not started	GitHub-style prefix and checksum on cluster-minted secrets, with gitleaks and Loki queries to catch leaks
Artifact UX rework	Master plan	Stub	Rework the upload/blob/artifact/label/snapshot surface to remove usability sharp edges (ambiguous name resolution, blob-UUID juggling, underpowered labels, instance-costumed snapshots); adopts #3271, #1634, #1167, #592, #877, #1386, #833, #422
Kerbside VDI console tokens	Master plan	In Progress	Cross-repo (shakenfist / client-python / kerbside / ryll): per-instance console authorisation via short-lived Ed25519-signed tokens minted by a new vdiconsoleproxy endpoint and validated offline by Kerbside; cluster signing key in `cluster_config`, pubkey published API-side, kerbside exchange endpoint mirrors the Nova token flow; pip-installable ryll so `sf-client instance vdiconsole` lands the user in a session seamlessly
Cluster op visibility	Phase 1: CI await helper deflake	Complete	Rewrite CI await helpers onto the history-aware by-target ops endpoint
Cluster op visibility	Phase 2: Observational flag schema	Not started	`cluster_operation_targets` v3: mark measurement ops so they stop permuting `last_cluster_operation`
Cluster op visibility	Phase 3: Mark observational sites	Not started	Audit and classify every `create_and_enqueue()` caller
Cluster op visibility	Phase 4: API surface	Not started	`has_pending_operations` in `external_view()`; truthful `/operations` endpoints; client support
Cluster op visibility	Phase 5: Coverage and docs	Not started	Functional coverage, helper simplification, operator/developer docs
Database load reduction	Phase 1: Stop the idle-loop polls	Complete	Remove the 5 Hz `get_node`/`get_node_daemon_state` polling from `Daemon.idle()` (~57% of measured sf-database load)
Database load reduction	Phase 2: Static object value caching	Complete	Narrow client-side cache for immutable static object values; restores the objects-cacheable/attributes-not principle
Database load reduction	Phase 3: Consolidate the gRPC client stacks	Complete	Remove the orphaned `database.py` client (dead since commit e48d3257f) so the live `mariadb.py` stack is the sole interceptor seam for attribution
Database load reduction	Phase 4: Caller attribution on counters	Complete	gRPC metadata caller identity + an additive `database_requests_total{operation, caller_daemon}` counter; mTLS-compatible, first slice of the OpenTelemetry thread
Database load reduction	Phase 5: Next-tier reductions	Not started	Data-driven reduction of `dequeue`/`get_ipam`/locks/blob-transfer polling, planned from phase 4 attribution numbers
Per-host resource reservations	Phase 1: Config keys + CPU/RAM math	Complete	New `NODE_*` reservation keys; flat per-node CPU (threads) / RAM math, dropping the infra-role branch
Per-host resource reservations	Phase 2: Disk reservation model	Complete	Per-node disk floor at the instances/blobs allocation points; publish `disk_reservation_gb` and convert `MINIMUM_FREE_DISK` consumers
Per-host resource reservations	Phase 3: Ansible templating	Complete	Per-host defaults templated into `/etc/sf/config`; stop `set-config`'ing reservations; inventory override
Per-host resource reservations	Phase 4: Docs and cleanup	Complete	Operator docs, plan index, `sf-ctl unset-config` + inert-row retirement

Status Definitions¶

Stub: Framing recorded for future detailed planning; not yet ready to execute
Not started: Plan exists, work not yet begun
Planning: Design complete, implementation not yet started
In Progress: Currently being implemented
Complete: Implemented and released
Future: Planned but not yet designed in detail
Blocked on preconditions: Plan exists but explicitly waits on another plan or external event before work can begin

📝 Report an issue with this page