Phase 0: Retire etcd machinery and migration-era scaffolding¶
Parent plan: PLAN-byo-mariadb.md.
Supersedes the formerly-separate PLAN-remove-etcd.md
(absorbed by this plan; that file was deleted in commit
76fa8ee8b).
Prompt¶
Before responding, read these source files end-to-end — they are the substrate this phase deletes:
shakenfist/etcd.py(473 lines; the entire etcd shim layer used to be the dependency surface for every drain function).shakenfist/mariadb.pylines 2364-2553 (theDATA_MIGRATIONSframework comment block, declaration, andensure_data_migrations()function) and lines 2571-~4920 (the ~31_migrate_etcd_*/_cleanup_etcd_*functions, the_migrate_node_state _keyhelper, and theDATA_MIGRATIONS.update()block at ~4923).shakenfist/protos/etcd_pb2.py,etcd_pb2_grpc.pyand their.pyisiblings (generated stubs for the etcd v3 gRPC API).protos/etcd.proto(the proto source the stubs come from).shakenfist/tests/test_cluster_config_drain.py(117 lines) andshakenfist/tests/test_etcd_ops_queues_ drain.py(218 lines) — unit tests that exercise the drain path; both die when the code they cover dies.shakenfist/client/ctl.pylines 183-196 (the hiddenshow-etcd-config/set-etcd-configaliases preserved from the etcd era).shakenfist/config.pyline 488 area (theETCD_HOSTconfig field, comment already says "remove in next release").pyproject.toml(etcd3gw==2.6.0dependency).CLAUDE.mdlines mentioning etcd (58, 246, 369 by the latest count — find by string, not line).docs/developer_guide/authentication.mdline 56 (ETCDCTL_API=3example).
This phase deletes Python-side etcd machinery. It does not touch:
- The
etcd_masteransible group name indeploy.py,deploy.yml, or any ansible role. That rename belongs toPLAN-remove-primaryphase 7. - The
is_etcd_masterPython attribute onNode, the matching column on thenodestable, the gRPC protobuf field, or any other "is this the database node?" boolean threaded through the codebase. Per the master plan, that attribute is functionally identical to "is database node" today and will be renamed inPLAN-remove-primaryphase 7 alongside the ansible group rename. Phase 0 leaves the attribute alone to avoid cross-scope churn.
One commit per step at minimum. Each commit must build,
pass pre-commit run --all-files, and have a clear
message.
Context¶
The originally-separate PLAN-remove-etcd.md was
absorbed into this plan because the master-plan
exploration found that the etcd machinery was still in
tree even though the data drain had finished. The
master plan's decision 11 records this as the rationale.
That same exploration claimed DATA_MIGRATIONS was an
empty dict. The post-phase-1 inventory reveals this was
wrong: DATA_MIGRATIONS is declared empty at
mariadb.py:2396 but populated via
DATA_MIGRATIONS.update({...}) at line 4923 with 31
table-keyed migration functions that drain etcd into
MariaDB. Those functions never run on a cluster without
ETCD_HOST set (ensure_data_migrations() short-
circuits at lines 2440-2444 when ETCD_HOST is
empty), but they are still in tree as ~2400 lines of
dead code.
Phase 1 of this plan removed the only daemon-side
caller of ensure_data_migrations() (commit
6a4ac7b69). The function is now orphaned: no code
path invokes it. The orphan is harmless but conspicuous
— phase 0 deletes it along with the rest of the
machinery so the tree reflects reality.
After this phase lands, the following statements are true:
shakenfist/etcd.pyno longer exists. No SF code imports etcd v3 gRPC stubs oretcd3gw. Theetcd3gwPyPI dependency is gone.ensure_data_migrations(), theDATA_MIGRATIONSdict, every_migrate_etcd_*/_cleanup_etcd_*function, and the framework comment block are removed frommariadb.py.ensure_schema()is the only entry point for schema management.- The
ETCD_HOSTconfig field is gone fromshakenfist/config.py.sf-ctl show-etcd-config/set-etcd-configaliases are gone fromshakenfist/client/ctl.py. - Source comments referencing retired
sf-ctl migrate-*commands (inblob.py,upload.py,artifact.py,node.py,namespace.py,network/network.py,constants.py) are removed or reworded. - The two etcd-drain test files
(
test_cluster_config_drain.py,test_etcd_ops_queues_drain.py) are deleted. CLAUDE.mdetcd notes (the "etcd.py module is retained only to service DATA_MIGRATIONS" paragraph and theetcd3gwline in the Dependencies list) are removed.docs/developer_guide/authentication.mdno longer referencesETCDCTL_API=3.- The stale comment at
mariadb.py:16904-16906that defers version-bumping toensure_data_migrationsis reworded (the v1→v2 ALTER TABLE forinstance_attributes.vsock_cidsis now standalone; the schema migration block immediately below bumps the version itself). - The stale
_construct_keycross-reference atmariadb.py:1361-1362is reworded — the comment explains lock-key naming and pointed at the defunctetcd._construct_keyfunction for the historical justification.
Decisions (phase-local)¶
-
is_etcd_masterPython attribute is left alone. The attribute is read and written acrossnode.py,mariadb.py(thenodestable column at line ~11200),daemons/database/main.py(gRPC proto round-trips), anddaemons/resources/main.py. It is dead in the sense that no code path reacts to its value, but it is alive in the sense that removing it requires a coordinated schema migration, protobuf change, gRPC client recompile, and Node- accessor rewrite. The master plan already scopes the rename (is_etcd_master→is_database_node) toPLAN-remove-primaryphase 7. Phase 0 keeps to that boundary. -
No staged removal across releases. Per master plan decision 7 (greenfields only), there is no compatibility shim, no deprecation warning, no
--legacyflag preserving the old behaviour. The single operating SF cluster will be redeployed against the post-phase-0 tree. -
Schema-version comment update is in-scope. The comment at
mariadb.py:16904-16906says "We do NOT bump the version here. The version is bumped by the data migration inensure_data_migrations()." Onceensure_data_migrationsis deleted, that comment is misleading. The actual code immediately below (lines 16913+) handles version bumping in a schema-only manner. Step 0a rewords the comment to reflect that the v1→v2 ALTER TABLE is now a pure schema migration with the version bump landing in the per-version block that follows. -
The
_construct_keyhistorical-reference comment is reworded, not deleted. Atmariadb.py:1361- 1362a comment explains the lock_key column naming convention by referring to the defunctetcd._construct_key(prefix='sflocks')function. The convention is real and worth documenting; the etcd reference is stale. Step 0a rewords the comment to describe the convention directly without referring to the removed function. -
tox -e genprotosis not re-run as part of this phase. Phase 0 deletes the etcd.protosource and its generated stubs by hand. Re-runninggenprotoswould re-generate stubs for whatever.protofiles remain, which is exactly what we want — but doing it after-the-fact in this phase risks regenerating other stubs as a side effect (whitespace, version markers, etc.). The phase leaves the rest ofshakenfist/protos/untouched.
Steps¶
Four steps. They are strictly sequential — step 0a
removes the etcd module's only Python importer,
which is the prerequisite for step 0b's actual
deletion of etcd.py and its protos. Steps 0c and 0d
follow after the imports are gone but are independent
of each other.
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 0a | medium | opus | worktree | Purge dead etcd-migration code from shakenfist/mariadb.py. Delete: (i) the from shakenfist import etcd import at the top of the file (find the exact line); (ii) the DATA_MIGRATIONS framework block at ~lines 2364-2553 — the multi-line docstring-comment intro, the DATA_MIGRATIONS: dict[...] = {} declaration, and the ensure_data_migrations() function in its entirety; (iii) every _migrate_etcd_* and _cleanup_etcd_* function at lines ~2571-~4920 (31 functions in total — confirm count with grep -c '^def _migrate_etcd_\\|^def _cleanup_etcd_' shakenfist/mariadb.py after the delete to verify zero remain); (iv) the _migrate_node_state_key helper at ~line 3325 (only called by other migration functions in the same block); (v) the DATA_MIGRATIONS.update({...}) block at ~line 4923-~4960 that registers the functions. Then update two stale comments: (vi) at lines ~16904-16906, replace the "We do NOT bump the version here. The version is bumped by the data migration in ensure_data_migrations()" comment with one that describes the v1→v2 ALTER TABLE as a standalone schema migration (the per-version bump immediately below already handles the version increment); (vii) at lines ~1361-1362, replace the "Mirrors etcd._construct_key(prefix='sflocks')" comment with a direct explanation of the /sflocks/{type}/{subtype}/{name} convention without referencing the removed function. Verify with grep that no etcd. references remain in mariadb.py. Run pre-commit run --all-files; the existing test suite should pass (no test should rely on the removed migration functions — they are dead code; the dedicated drain tests are deleted in step 0b). Worktree isolation: this is the highest-volume single-step deletion in the phase (~2400 lines), and it is worth being able to discard if the boundary detection went wrong. One commit. |
| 0b | low | sonnet | none | Delete the etcd module, its protos, the proto source, and the drain tests. Files: shakenfist/etcd.py, shakenfist/protos/etcd_pb2.py, shakenfist/protos/etcd_pb2.pyi, shakenfist/protos/etcd_pb2_grpc.py, shakenfist/protos/etcd_pb2_grpc.pyi, protos/etcd.proto, shakenfist/tests/test_cluster_config_drain.py, shakenfist/tests/test_etcd_ops_queues_drain.py. Use git rm for each. After deletion, run grep -rn 'shakenfist\\.etcd\\|shakenfist/etcd\\|protos\\.etcd_pb2\\|etcd_pb2_grpc' --include='*.py' shakenfist/ and confirm no hits — if anything remains, stop and report. Run pre-commit run --all-files to confirm the tree is consistent (no orphaned references). Note: etcd3gw is still listed in pyproject.toml after this step; step 0c removes it. The dependency staying in place across this step keeps the tree installable from pyproject.toml even though no code imports etcd3gw any more — sequencing the dep removal into a separate commit makes the bisect history clearer. One commit. |
| 0c | low | sonnet | none | Remove etcd-era surface area from config and the CLI. Three changes: (i) drop the etcd3gw==2.6.0 entry from pyproject.toml's dependencies list (find the exact line; preserve apparent formatting); (ii) remove the ETCD_HOST config field and its accompanying "etcd (retained only for DATA_MIGRATIONS drain — remove in next release)" comment from shakenfist/config.py (around line 488); (iii) remove the hidden show-etcd-config and set-etcd-config @click.command(name=..., hidden=True) blocks from shakenfist/client/ctl.py (lines ~183-196), AND remove the matching cli.add_command(show_etcd_config) / cli.add_command(set_etcd_config) entries further down in the file (around line ~350). Run grep -rn 'ETCD_HOST\\|show-etcd-config\\|set-etcd-config\\|etcd3gw' --include='*.py' --include='*.toml' shakenfist/ pyproject.toml and confirm only doc/comment references remain (those are step 0d's job). Run pre-commit run --all-files. One commit. |
| 0d | low | sonnet | none | Clean stale source comments and developer docs that reference the retired sf-ctl migrate-* commands and etcd. Source comments to remove or rephrase: shakenfist/blob.py:137,143,149 (three "State migration to MariaDB is now handled by sf-ctl migrate-*" comments); shakenfist/upload.py:54,60; shakenfist/artifact.py:121; shakenfist/node.py:138; shakenfist/namespace.py:101; shakenfist/network/network.py:106; shakenfist/constants.py:88 (the "Use 'sf-ctl migrate-floating-network-uuid' to migrate" line — the function it talks about no longer exists). For each: read the surrounding code to decide whether the comment can be deleted entirely (preferred when the comment was only there to document the migration path) or whether a brief rewrite preserves useful context. Docs: in CLAUDE.md, remove the "Note: the etcd.py module is retained only to service DATA_MIGRATIONS entries which drain leftover etcd keys from older clusters. The module will be removed in the next minor version." paragraph (currently around line 246) and remove the etcd3gw - etcd client (retained for DATA_MIGRATIONS drain only) bullet from the Dependencies list (around line 369). In docs/developer_guide/authentication.md, remove the export ETCDCTL_API=3 line (around line 56) — read context to decide whether the surrounding paragraph also needs adjustment. Final verification: grep -rn 'etcd' --include='*.py' --include='*.md' --include='*.toml' shakenfist/ docs/ CLAUDE.md README.md pyproject.toml 2>/dev/null | grep -v 'is_etcd_master\\|etcd_master' | head -30 should show only references in docs/plans/ (the plan documents themselves), nothing in source code or operator docs. One commit. |
Validation¶
pre-commit run --all-filespasses after each step.- Final state grep:
grep -rn 'from shakenfist import etcd\\|from shakenfist.etcd\\|import etcd3gw\\|from etcd3gw\\|DATA_MIGRATIONS\\|ensure_data_migrations\\|ETCD_HOST\\|show-etcd-config\\|set-etcd-config\\|_migrate_etcd_\\|_cleanup_etcd_\\|ETCDCTL_API' --include='*.py' --include='*.toml' --include='*.md' shakenfist/ docs/ CLAUDE.md pyproject.toml 2>/dev/nullshould show only references indocs/plans/(this plan describing the deletion) anddocs/release_notes/(historical record). python -c "import shakenfist.mariadb; import shakenfist.config; import shakenfist.client.ctl"succeeds. (Doesn't run the daemon, but proves the imports unwind cleanly.)- The CI deploy path (cluster_ci) still stands up a cluster end-to-end. This is the integration check that nothing operator-visible regressed; CI is what catches the case where a deleted comment was actually load-bearing (e.g. someone parsed a docstring at runtime, which I don't believe happens but is worth verifying).
- The set of
is_etcd_masterreferences is unchanged before vs. after the phase. Verifies that scope was respected.
Risks¶
- Boundary-detection error in step 0a. Deleting
~2400 lines from a 17,000-line file by line range
risks slicing wrong: catching one line of a non-
migration helper that happens to be in the middle
of the migration block, or missing the trailing
}ofDATA_MIGRATIONS.update({...}). The brief for step 0a calls out the grep-after-delete check (grep -c '^def _migrate_etcd_\|^def _cleanup_etcd_'must return0) and the standalone-import check (grep -n 'etcd\.' shakenfist/mariadb.pymust return nothing) as concrete verification steps. Worktree isolation lets the management session discard the sub-agent's output and re-spin with a better brief if the cut went wrong. - A migration helper turns out to be reachable
from non-migration code. Specifically:
_migrate_node_state_keyat line 3325 has the shape of a generic helper. The grep does say it is only called from within_migrate_etcd_nodes, but the brief tells the sub-agent to verify with a grep before deleting. If a non-migration caller exists, the sub-agent reports rather than deleting. - Dropping
ETCD_HOSTreveals other readers. The master-plan-time grep showedETCD_HOSTconsumed only bymariadb.py:2442(the short-circuit inensure_data_migrations). Step 0a deletes that consumer first; step 0c then drops the config field. If another reader exists somewhere not yet identified (e.g. an ansible template), the pre-commit run on step 0c will not catch it (ansible templates aren't linted for Pydantic config references), but the cluster_ci validation will. If pre-commit fails on step 0c with a config-related error, that's the signal to find the unexpected reader. - The
# sf-ctl migrate-*comments in step 0d are scattered. It is easy to miss one. The final grep in the step 0d brief (grep -rn 'sf-ctl migrate' --include='*.py' shakenfist/) covers this.
Out of scope¶
For clarity, none of these are touched by phase 0:
is_etcd_masterPython attribute onNode, the matching column onnodes, the gRPC field, any caller. →PLAN-remove-primaryphase 7.etcd_masteransible-group name indeploy.py,deploy.yml, ansible roles, inventory templates,installation.mdexamples. →PLAN-remove-primaryphase 7.- The release-notes files under
docs/release_notes/(e.g.v07-v08.md) that mention retiredsf-ctl migrate-*commands. These are historical records; leaving them as-is is correct. - The
.claude/skills/migrate-etcd-to-mariadb.mdskill — already deleted in master-plan commit76fa8ee8b. PLAN-remove-etcd.md— already deleted in master- plan commit76fa8ee8b.
Back brief¶
Before executing this phase, please back brief the operator on:
- The four steps in order, with the file or line- range each touches.
- The deliberate decision to leave
is_etcd_masteralone (left forPLAN-remove-primaryphase 7), so the management session and the sub-agents do not accidentally widen scope. - The corrected understanding that
DATA_MIGRATIONSwas not empty as the master plan claimed — step 0a is removing 31 active migration-function registrations along with the function bodies. The functions never run on a cluster withoutETCD_HOST, but they are not zero-byte stubs. - The validation step that confirms
is_etcd_masterreferences are unchanged before and after the phase, as the cross-check that scope was respected.