Phase 4: Deploy-side BYO — getsf prompts, role deletion, SQL snippet¶
Parent plan: PLAN-byo-mariadb.md.
Prompt¶
Before responding, read these files so you understand the current deployer shape and the cuts this phase makes:
shakenfist/deploy/getsf— 1080-line bash deployer. Read lines 1-60 (helpers likequestion_start,record_answer), the topology-prompt block around lines 270-450, and the tail end where it generatestopology.jsonor hands off todeploy.py.shakenfist/deploy/ansible/deploy.pylines 150-200 (theupdate_if_specifiedcalls that translateGETSF_*env vars to ansible variables, including themariadb_passwordgeneration at line 167).shakenfist/deploy/ansible/deploy.ymllines 180-220 (themariadbrole invocation that this phase deletes, and thebaserole'sconfigaction that ships/etc/sf/configto every node).shakenfist/deploy/ansible/roles/mariadb/— five files total (tasks/bootstrap.yml,tasks/main.yml,handlers/main.yml,meta/main.yml,files/90-shakenfist-tuning.cnf). The whole directory goes away in this phase.shakenfist/deploy/ansible/roles/base/templates/config— the/etc/sf/configtemplate. Lines 34-46 are the MariaDB env-var block; phase 4 rewrites it.shakenfist/deploy/ansible/roles/primary/tasks/cluster_config.yml— nineSHAKENFIST_MARIADB_HOST=localhost \lines precede everysf-ctlinvocation. Phase 4 dissolves the escape hatch by relying on/etc/sf/configto setSHAKENFIST_MARIADB_HOSTto the operator's host.shakenfist/deploy/ansible/roles/base/tasks/register.ymllines 10-20 (the comment paragraph aboutetcd_master+ the inlineSHAKENFIST_MARIADB_HOST=localhostfor theensure-mariadb-schemainvocation).
This phase deliberately leaves CI red until phase 5
lands. The master plan calls this out: deleting
roles/mariadb/ means CI's deploy step no longer
installs a MariaDB, so functional tests can't reach
one. Phase 5 adds a workflow step that installs
MariaDB outside getsf and applies the SQL snippet
this phase ships. The two phases should land in
short succession.
One commit per step. Each commit must pass
pre-commit run --all-files; functional CI passing
is not a per-step requirement, because phase 4
intentionally breaks the install path that CI
exercises today.
Context¶
After phases 0-3, the SF code tolerates BYO
MariaDB cleanly: the config layer expresses the
two-config orthogonal model, the gRPC tier
construction is correct, the schema and migrations
moved out of daemon startup, and the compatibility
gate refuses misconfigured servers. The remaining
gap is operator-facing: the bundled getsf /
deploy.py / roles/mariadb/ pipeline still
installs and tunes a MariaDB server on
etcd_master[0]. That is the bundled-install path
this phase deletes.
What getsf does today:
- Prompts the operator for topology (nodes, hostnames, NICs, IPs, SSH credentials, floating-IP block, DNS server, etc.). 30+ prompts.
- Does not prompt for any MariaDB credentials.
- Hands the topology to
deploy.pyviaGETSF_*env vars. deploy.pygenerates a random 24-charactermariadb_password(line 167) and threads it throughdeploy.ymlto the bundledmariadbrole and to thebaserole's/etc/sf/configtemplate.
What getsf does after this phase:
- Same topology prompts as today.
- Plus five new prompts for MariaDB connection details: host, port, user, password, database name. Defaults match the SQL snippet so a single-box convenience deploy needs only the host and password.
deploy.pyaccepts the credentials fromGETSF_MARIADB_*env vars; refuses to proceed if host or password is empty (no default fallback).- The bundled
mariadbrole is gone;deploy.ymlno longer references it. - The
90-shakenfist-tuning.cnffile moves toexamples/mariadb-tuning.cnfwith a short comment explaining how to install it. - The
SHAKENFIST_MARIADB_HOST=localhostExecStartPre escape hatch inroles/primary/tasks/cluster_config.ymlandroles/base/tasks/register.ymlis dissolved. The shell tasks inheritSHAKENFIST_MARIADB_HOSTfrom/etc/sf/configwhere it now contains the operator's host. - A new
tools/bootstrap-mariadb.sqlsnippet is shipped, idempotent, ready for operators to apply against their MariaDB instance with their chosen password. roles/base/templates/configrewrites: the MariaDB block now uses the operator-provided variables. TheMARIADB_HOSTblock is wrapped in{% if inventory_hostname in groups['etcd_master'] %}so only database-tier nodes get direct-access credentials.
The principle: SF is a component slotted into an operator's infrastructure. Operators bring their MariaDB; SF prompts them for its address and uses it.
Decisions (phase-local)¶
-
tools/bootstrap-mariadb.sqlships with a__REPLACE_ME__placeholder for the password. Operators replace it before applying. The alternative — accepting--password=viamysqlCLI — would require operators to script the password into a shell command, which leaks it to process listings. Sed-replace into a temporary file (or pipe through sed) keeps the password out ofps. -
The snippet uses the SF defaults: database
shakenfist, usershakenfist, grantsALL ON shakenfist.*. Operators who want different names can edit the snippet AND set the correspondingGETSF_MARIADB_*answers. This phase does not add documentation for non-default names — that's an unusual case not worth surfacing. -
deploy.pyrefuses to proceed ifmariadb_hostormariadb_passwordis empty. No default fallback. Operators who forgot to provide credentials get a clear error pointing at thetools/bootstrap-mariadb.sqlinstructions. No silent fallback tolocalhostor to a generated password. -
The bundled
90-shakenfist-tuning.cnfbecomes a documented example atexamples/mariadb-tuning.cnf. Operators who want the tuning copy it into/etc/mysql/mariadb.conf.d/themselves. A comment block at the top of the file explains the install path and notes that the tunings are reasonable starting values, not prescriptions. -
SHAKENFIST_MARIADB_HOST=localhostescape hatch is dissolved inroles/primary/tasks/cluster_config.ymlandroles/base/tasks/register.yml. The shell commands inheritSHAKENFIST_MARIADB_HOSTfrom/etc/sf/config(which now contains the operator's host). The comments above each shell task are reworded to reflect the change. -
/etc/sf/configonly gets the direct-MariaDB block on database-tier nodes. The template conditional is{% if inventory_hostname in groups['etcd_master'] %}. Non-database nodes get only the gateway-host block. Theetcd_masteransible group name is still the one in use; PLAN-remove-primary phase 7 renames it. Phase 4 does not. -
CI breakage between phase 4 and phase 5 landing is accepted. Per the master plan's sequencing note. Phase 4 commits do not need to leave functional CI green; the per-commit bar is
pre-commit run --all-filesonly. Phase 5 restores CI by installing MariaDB in a workflow step. -
topology.jsoncompatibility is dropped. Existing operators withtopology.jsonfiles that lack the MariaDB block fail atdeploy.pyvalidation. Per master plan decision 7 (greenfields only), no shim is provided. The cluster is rebuilt against the new shape. -
The
examples/directory already exists. The tuning.cnfmoves there rather than totools/, becausetools/is for things operators run (the SQL snippet) whileexamples/is for things operators adapt (the tuning file).
Steps¶
Five sequential steps. Each step must pass
pre-commit run --all-files; functional CI may be
red between phase 4 and phase 5 landing.
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 1 | low | sonnet | none | Ship tools/bootstrap-mariadb.sql and move the tuning .cnf. Create tools/bootstrap-mariadb.sql with idempotent CREATE DATABASE IF NOT EXISTS shakenfist CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;, CREATE USER IF NOT EXISTS 'shakenfist'@'%' IDENTIFIED BY '__REPLACE_ME__';, GRANT ALL ON shakenfist.* TO 'shakenfist'@'%';, FLUSH PRIVILEGES;. Add a 10-line SQL comment header explaining: the snippet creates the SF database and user, operators replace __REPLACE_ME__ with their chosen password before applying (e.g. sed 's/__REPLACE_ME__/mypw/' tools/bootstrap-mariadb.sql | mysql -u root), the snippet is idempotent and safe to re-run, and that the database/user names match SF's defaults. Move shakenfist/deploy/ansible/roles/mariadb/files/90-shakenfist-tuning.cnf to examples/mariadb-tuning.cnf (use git mv); rewrite its leading comment block to explain it's a recommended-but-optional drop-in for /etc/mysql/mariadb.conf.d/, operators copy it themselves, and the tunings are starting values not prescriptions. Verify tools/ directory exists at repo root (ls tools/); if not, create it. pre-commit run --all-files. One commit. |
| 2 | high | opus | worktree | Operator-input plumbing in getsf and deploy.py. (a) In shakenfist/deploy/getsf: add five new prompt blocks after the existing topology prompts and before the topology JSON generation. Match the existing question_start / read / record_answer / question_end pattern exactly. The five prompts are: GETSF_MARIADB_HOST (required, no default — the operator's MariaDB host or IP), GETSF_MARIADB_PORT (default 3306), GETSF_MARIADB_USER (default shakenfist), GETSF_MARIADB_PASSWORD (required, no default), GETSF_MARIADB_DATABASE (default shakenfist). For the two required-no-default prompts, do not accept an empty answer; loop the prompt until a value is supplied (or echo a clear error message and exit 1 — match whichever pattern getsf uses elsewhere). Each prompt should include a sentence pointing operators at tools/bootstrap-mariadb.sql and the docs/operator_guide/database.md BYO section. (b) In shakenfist/deploy/ansible/deploy.py: replace the random-password generation at line 167 with explicit reads of GETSF_MARIADB_HOST, GETSF_MARIADB_PORT, GETSF_MARIADB_USER, GETSF_MARIADB_PASSWORD, GETSF_MARIADB_DATABASE. Use update_if_specified('mariadb_port', '3306'), update_if_specified('mariadb_user', 'shakenfist'), update_if_specified('mariadb_database', 'shakenfist') for the three optional fields. For mariadb_host and mariadb_password: read with no default, raise SystemExit('mariadb_host is required; see tools/bootstrap-mariadb.sql and docs/operator_guide/database.md') if empty. Use the same update_if_specified family so the env-var-to-variable plumbing matches the existing pattern. Worktree isolation: this changes the operator-facing prompt UI and the variable-translation contract; getting either wrong silently breaks every deploy. One commit. |
| 3 | high | opus | worktree | Delete the bundled mariadb role and rewrite the config template. (a) git rm -r shakenfist/deploy/ansible/roles/mariadb/ (all five files: tasks/bootstrap.yml, tasks/main.yml, handlers/main.yml, meta/main.yml, and the now-moved 90-shakenfist-tuning.cnf — the file move happened in step 1, but the role-directory removal needs the now-empty files/ subdir cleaned up too). (b) In shakenfist/deploy/ansible/deploy.yml: remove the ### MariaDB section's - hosts: etcd_master block that invokes the role (around lines 188-200). Remove the mariadb_password: "{{ mariadb_password }}" var on the subsequent base role's config action (line ~214) — wait, keep that one. base/templates/config still needs the password threaded through. Remove only the role invocation. (c) Rewrite shakenfist/deploy/ansible/roles/base/templates/config lines 34-46 (the MariaDB env-var block). The gateway-host block stays on every node (already correct). Wrap the direct-host block (SHAKENFIST_MARIADB_HOST, SHAKENFIST_MARIADB_PORT, SHAKENFIST_MARIADB_USER, SHAKENFIST_MARIADB_PASSWORD, SHAKENFIST_MARIADB_DATABASE) in {% if inventory_hostname in groups['etcd_master'] %} and {% endif %}. Change the values from hard-coded references to operator-provided variables: SHAKENFIST_MARIADB_HOST="{{ mariadb_host }}", SHAKENFIST_MARIADB_PORT={{ mariadb_port }}, SHAKENFIST_MARIADB_USER="{{ mariadb_user }}", SHAKENFIST_MARIADB_PASSWORD="{{ mariadb_password }}", SHAKENFIST_MARIADB_DATABASE="{{ mariadb_database }}". Worktree isolation: deletes a role, rewrites a template; one typo breaks every node's /etc/sf/config on the next deploy. One commit. |
| 4 | medium | sonnet | none | Dissolve the localhost escape hatch. (a) In shakenfist/deploy/ansible/roles/primary/tasks/cluster_config.yml: every shell task (around lines 13-90) has an inline SHAKENFIST_MARIADB_HOST=localhost \ line as the first line of the shell command. Remove every such line. The shell tasks now inherit SHAKENFIST_MARIADB_HOST from /etc/sf/config via the systemd environment that the shell module respects. Update the leading comment paragraph (around lines 1-11) to remove the localhost-escape-hatch explanation; the new comment should explain that these tasks run sf-ctl commands that need direct MariaDB access (which they get from the operator-provided host in /etc/sf/config). (b) In shakenfist/deploy/ansible/roles/base/tasks/register.yml: remove the inline SHAKENFIST_MARIADB_HOST=localhost \ if present (it was around line 17 — verify by grep), and update the comment paragraph above the task (lines 8-17 area) to reflect the new model. Specifically: replace "All sf-ctl commands run on etcd_master[0] with MARIADB_HOST set so they can ..." with prose explaining that ensure-mariadb-schema runs on a database-tier node which has direct MariaDB access via /etc/sf/config. (c) pre-commit run --all-files. One commit. |
| 5 | medium | sonnet | none | Documentation. (a) docs/operator_guide/database.md BYO section: rewrite to be the canonical operator workflow. Cover: prerequisites (a MariaDB 10.6+ server reachable from every SF node, that meets the compat requirements section already in this file from phase 1), the SQL snippet at tools/bootstrap-mariadb.sql (sed-replace the password, apply it), the optional tuning at examples/mariadb-tuning.cnf (operator-installable drop-in), and the getsf prompts that ask for connection details. Show a complete single-box example: apt install mariadb-server → apply the snippet → optionally drop in the tuning → run getsf → answer the new prompts. (b) Update docs/operator_guide/upgrades.md to reflect that operators run sf-ctl ensure-mariadb-schema against their existing BYO MariaDB after an SF upgrade with schema changes (this content is mostly already there from phase 1; check for any remaining references to bundled MariaDB install). (c) CLAUDE.md: update the "Storage: MariaDB and the Database Service" section's tuning-related guidance if any (likely none, since the tuning was previously in the ansible role and not surfaced to developers). (d) ARCHITECTURE.md, README.md, AGENTS.md: search each for roles/mariadb, "bundled MariaDB", or similar phrases that imply SF installs MariaDB; update each to reflect the BYO model. The README in particular may have a "deployment" section that promises a turnkey install — update to mention the BYO prerequisite. (e) pre-commit run --all-files. One commit. |
Validation¶
pre-commit run --all-filespasses after each step.- After step 1:
tools/bootstrap-mariadb.sqlexists,examples/mariadb-tuning.cnfexists, the pre-existingroles/mariadb/files/90-shakenfist- tuning.cnfno longer exists. A manualcat tools/bootstrap-mariadb.sql | mysqlagainst a local MariaDB instance succeeds (it's idempotent; safe to re-run). - After step 2:
getsf(run in a test environment) prompts for the five MariaDB fields, refuses empty answers for host and password, records them to.getsfrc.deploy.pyreads them correctly. An attempt to rundeploy.pywith noGETSF_MARIADB_HOSTfails fast with a clear error message. - After step 3:
shakenfist/deploy/ansible/roles/no longer contains amariadb/directory.deploy.ymlno longer references the role. A grepgrep -rn 'role: mariadb\|roles/mariadb' shakenfist/deploy/returns zero hits. The base config template's MariaDB block has theinventory_hostname in groups['etcd_master']conditional. - After step 4:
grep -rn 'SHAKENFIST_MARIADB_HOST=localhost' shakenfist/deploy/returns zero hits. The cluster_config.yml shell tasks no longer have inline env overrides. - After step 5: documentation is consistent. A new
operator reading just
docs/operator_guide/ database.mdknows what they need to bring and whatgetsfwill ask them for. - CI remains red between phase 4 landing and phase 5 landing; this is expected and called out.
Risks¶
getsf's required-no-default prompts. If the loop pattern is mis-implemented, an operator hits enter and gets stuck in an infinite loop. The brief for step 2 tells the sub-agent to match whichever pattern getsf uses elsewhere for required answers (there may be precedent — see thereadpattern in the FLOATING_BLOCK or topology-node-list prompts). If no precedent exists, the sub-agent should pick "echo a clear error and exit 1" rather than loop, to fail fast and let the operator re-run with the right env vars set.- The
inventory_hostname in groups['etcd_master']conditional in the config template. If the Jinja syntax is wrong, every node's/etc/sf/configis malformed and every daemon fails to start on the next deploy. The brief for step 3 tells the sub-agent to verify withansible-playbook --syntax-check deploy.ymlor by running the Jinja2 template through Python's template engine with a fake inventory before declaring done. - The localhost escape hatch removal interaction
with
sf-ctl ensure-mariadb-schema.ensure-mariadb-schemarequiresMARIADB_HOSTto be set (verified by phase 1's brief). After step 4, the shell tasks rely on the env inheritance from/etc/sf/config. The brief for step 4 calls out that the cluster_config.yml shell tasks must run on a node where/etc/sf/confighas the direct-host block — which is now onlyetcd_masternodes per step 3's conditional. The cluster_config.yml tasks today delegate toetcd_master[0], so the ordering works out: tasks delegate toetcd_master[0], which has the direct-host block, which hasMARIADB_HOSTset to the operator's host. If the delegation is missing on some task, step 4's sub-agent will see the error at the next deploy attempt. - CI breakage window. Between phase 4 landing and phase 5 landing, the cluster_ci pipeline's install step has no MariaDB to talk to. If the window is more than a day or two, the project visibly red-bars. Mitigation: land phase 5 immediately after phase 4. The master plan calls this out explicitly.
Out of scope¶
- CI workflow MariaDB install — phase 5.
- N=2 sf-database functional CI shape — phase 6.
- ARCHITECTURE/README/AGENTS sweep for the broader BYO direction — partial in step 5; the more extensive phase-7 docs sweep handles the full pass.
etcd_master→database_nodeansible group rename — PLAN-remove-primary phase 7.topology.jsonmigration shim for operators with existing files — out of scope (greenfields only).- The bundled Apache reverse proxy and rsyslog install removal — PLAN-remove-primary phases 1 and 3.
Back brief¶
Before executing this phase, please back brief the operator on:
- The five steps in order with the file boundaries for each.
- The deliberate CI-breakage window between phase 4 and phase 5 landing. Confirm phase 5 is ready to spawn immediately after phase 4 commits land.
- The decision to keep the
etcd_masteransible group name in the new template conditional (phase 7's rename territory). - The decision to refuse empty
mariadb_host/mariadb_passwordindeploy.pyrather than fall back to defaults. Operators who forget to set them get a clear error. - The decision to ship the SQL snippet with a
__REPLACE_ME__placeholder and ask operators to sed-replace rather than scripting the password into amysqlcommand line. - The plan to land the broader doc sweep (ARCHITECTURE/README/AGENTS) in step 5 as a starter and finish it in phase 7.