Phase 6: N>1 sf-database functional CI shape¶
Parent plan: PLAN-byo-mariadb.md.
Prompt¶
Before responding, read these files so you understand the current deploy shape and what phases 1-5 have landed:
shakenfist/deploy/ansible/deploy.yml— the playbook. The[primary_node]and[etcd_master]hosts groups, the cluster-config delegation toetcd_master[0], and the threading ofmariadb_password/mariadb_gateway_hosts.shakenfist/deploy/ansible/roles/base/tasks/register.yml— sf-database startup is currentlydelegate_to: etcd_master[0]withrun_once: true, which collapses an N-element group to a single instance. Phase 6 unwinds that.shakenfist/deploy/ansible/roles/base/templates/config—SHAKENFIST_MARIADB_GATEWAY_HOSTSis rendered as a single host (etcd_master[0]'s mesh IP). Phase 6 renders it as a comma-separated list of everyetcd_mastermember's mesh IP. TheSHAKENFIST_MARIADB_HOSTblock already only renders onetcd_mastergroup members; no change needed there.shakenfist/deploy/ansible/roles/base/tasks/config.yml— thesf-database.servicetemplate is dropped withwhen: inventory_hostname in groups.get('etcd_master', []). Multi-instance startup is supported by the existing template; the only thing standing in the way isregister.yml's delegation.shakenfist/daemons/database/main.py— Prometheus counters are nameddatabase_{op}_total(e.g.database_get_state_total,database_set_state_total) and live on each instance'sMARIADB_GATEWAY_METRICS_PORT(default 13006). Per-instance counters are what phase 6's functional test asserts non-zero against, to prove the LB fanned requests out.shakenfist/util/grpc_channel.py— themake_database_channel()factory shipsloadBalancingConfig: round_robinandhealthCheckConfig: serviceName: ""ingrpc.service_config. Round-robin picks a subchannel per RPC, so two sf-database instances fronted by the same channel will see roughly 50/50 request distribution.tools/ci-install-mariadb.shandtools/bootstrap-mariadb.sql— phase 5's operator-facing helpers. The SQL snippet already grants'shakenfist'@'%'so any client host can connect. The CI install script does not override Debian'sbind-address = 127.0.0.1default; phase 6 needs MariaDB to listen for remote etcd_master nodes too, so step 2 extends the script with a bind-all drop-in for multi-node deployments..github/workflows/functional-tests.yml— phase 5 left four "Install BYO MariaDB on primary" / "Run getsf installer on primary" pair sites for the three existing matrix entries (Smoke localhost, threeslim-primarymerge entries) and one ansible-modules site. Phase 6 adds one new matrix entry under the merge-queue job that consumes a newslim-tiertopology with at least twoetcd_mastermembers..github/workflows/scheduled-tests.yml— the scheduled localhost / slim-primary entries. Phase 6 does not add a scheduledslim-tierentry; one entry on the merge-queue side is enough to keep regression coverage on every PR that reaches merge, and the scheduled run already has long enough runtime budgets.docs/operator_guide/database.md— the operator- facing description of the BYO MariaDB and tier model. Phase 6 adds a single short paragraph noting that N>1 sf-database is exercised by CI, plus an exampleMARIADB_GATEWAY_HOSTSvalue with a comma-separated list. The bulk of the doc rewrite is phase 7's scope.
The point of this phase is to prove the tier works under load and to lay down regression coverage so a future "only one sf-database actually serves requests" bug cannot reach a release silently. The tier already works in code (phases 1-3 made it work); phase 6 makes CI watch it work.
One commit per step. Each commit must pass
pre-commit run --all-files (which includes
actionlint on workflow YAML). Step 4 depends on a
landed change in the external shakenfist/actions
repository (a new ci-topology-slim-tier.yml
playbook); that coordination is called out in the
step's brief and below in Risks.
Context¶
After phase 5, CI exercises the BYO contract end to
end against a single sf-database instance.
SHAKENFIST_MARIADB_GATEWAY_HOSTS is a one-element
list, the round_robin LB policy has nothing to
round-robin between, and register.yml's
delegate_to: etcd_master[0] + run_once: true makes
"start sf-database on every etcd_master node" act
the same as "start one sf-database on the primary".
That is correct for single-tier deployments and
invisible for any other shape, because no shape
runs more than one sf-database today.
Two specific properties of the tier model are currently untested:
- A request from a daemon's gRPC channel actually
reaches more than one sf-database instance. The
round_robinpolicy's correctness depends on the resolver returning multiple addresses and the subchannels being healthy. If the resolver fails silently, or the health check fails on N-1 subchannels, the channel falls back to "always pick the one healthy subchannel" and the LB becomes a no-op. Nothing about that would surface in a single-endpoint test. - A second
sf-databasestartup against the same MariaDB completes cleanly. Phase 1 fixed the schema-versions race defensively and moved migrations out of daemon startup, but no CI run has ever brought two instances up against the same schema concurrently. A latent bug — a strayTable(...)registration outsideTABLE_CREATION_LOCK, an SQLAlchemy state-cache race on shared metadata, arecord_startclash on identical hostnames — would not surface in single-instance CI.
Phase 6 closes both gaps by:
- Removing the single-instance pinning in
register.ymlso multi-etcd_master topologies genuinely run sf-database on every group member. - Rendering
MARIADB_GATEWAY_HOSTSas the fulletcd_mastergroup list so daemons construct multi-endpoint LB channels in line with the actual topology. - Teaching
ci-install-mariadb.shto listen on all interfaces so non-primary etcd_master nodes can reach MariaDB on the primary. - Adding a
slim-tierCI topology (in the externalshakenfist/actionsrepo) with three nodes andGETSF_NODE_ETCD_MASTERset to two of them, plus a workflow matrix entry that consumes it. - Shipping a functional test that drives sf-api
for a couple of seconds, scrapes
database_*_totalcounters from each sf-database instance's metrics endpoint, and asserts every instance saw a non-trivial share of the traffic. - A one-paragraph docs update to
docs/operator_guide/database.mdcalling out that N>1 sf-database is the tested production shape and showing the comma-separatedMARIADB_GATEWAY_HOSTSform.
After this phase lands:
- Every PR's merge-queue run exercises a 2-instance sf-database tier against a shared MariaDB.
- The functional test fails if either instance does
not serve a share of the traffic, which catches:
(a) the gRPC LB silently falling back to a single
subchannel, (b) one
sf-databaseinstance crashing on startup, (c) one instance going silently unhealthy mid-run. - A future regression to single-instance behaviour
(someone re-adds
delegate_to: etcd_master[0]on the assumption that sf-database is "still a singleton", or a config-template edit drops the list) fails on the next PR rather than at operator deploy time.
Decisions (phase-local)¶
-
Multi-instance startup is a
when:gate, not a delegate-and-run-once collapse. The "Start the database daemon on etcd_master nodes" task inregister.ymlcurrently usesdelegate_to: etcd_master[0]withrun_once: true, which makes it a single action regardless of group size. Phase 6 replaces both withwhen: inventory_hostname in groups.get('etcd_master', [])so every member of the group starts its own sf-database. Theregister-daemon databasetask immediately above already uses this gate pattern and works correctly under N>1; the start task just needs to follow the same shape. -
MARIADB_GATEWAY_HOSTStemplate renders the full list. Today the template line isSHAKENFIST_MARIADB_GATEWAY_HOSTS="{{ hostvars[groups['etcd_master'][0]]['node_mesh_ip'] }}". Phase 6 makes itSHAKENFIST_MARIADB_GATEWAY_HOSTS="{{ groups['etcd_master'] | map('extract', hostvars, 'node_mesh_ip') | join(',') }}". Single- etcd_master deployments degenerate to a one- element list with no trailing comma, which is what_parse_comma_separated_hostsinconfig.pyalready handles. -
CI install script grows a bind-all drop-in.
ci-install-mariadb.shwrites a small/etc/mysql/mariadb.conf.d/99-bind-all.cnfcontainingbind-address = 0.0.0.0before restarting MariaDB. This is correct for CI (ephemeral VMs reachable only inside the test namespace), correct for single-box dev installs (loopback connects to 0.0.0.0 fine), and only problematic for an operator who copies the script verbatim onto a production host — which the script's leading comment already warns against. The drop-in lands as99-*so it overrides any Debian-shipped50-server.cnfbind-address default. -
New topology lands in the external
shakenfist/actionsrepo. Phase 5 deliberately left the actions repo untouched because phase 5's goal was the narrowest possible "restore green" fix. Phase 6's goal — prove the tier works — requires a new topology the actions repo doesn't ship today. The new playbookansible/ci-topology-slim-tier.ymlprovisions three VMs (10.0.0.20-22 is fine, reusing the existing slim-primary IP block) and the matchinggetsf-wrappersetsGETSF_NODE_ETCD_MASTER="<vm-1-name> <vm-2-name>". Two etcd_master nodes is enough to exercise the LB path; three would not meaningfully strengthen the test and would widen the CI VM footprint. The third VM is a hypervisor. -
Matrix entry only on the merge-queue side. The new entry goes in the merge-queue job's
matrix.mergelist, not in the PR-localhostmatrix.prlist and not in the scheduled workflow. Rationale: every PR already runs the PR-localhost smoke; adding aslim-tierPR entry would double infrastructure spend per PR for marginal coverage. The merge-queue gate is the right place to insist a PR has been exercised against the multi-instance tier before it ships, and the merge-queue runs less often than per-PR CI so the cost is bounded. -
Functional test asserts each instance served at least 5% of total traffic. A daemon's gRPC channel hits the round-robin policy at per-call granularity, so 100 calls across two instances should be roughly 50/50 ± normal jitter. The 5% floor is loose enough to survive transient unhealthiness or one subchannel briefly going into reconnect, and strict enough to fail loudly when the LB silently degenerates to a single subchannel (which manifests as 100% / 0%). The test uses the
database_*_totalfamily, scraping each instance's Prometheus endpoint athttp://<mesh_ip>:13006/metrics. -
The functional test lives in
shakenfist/deploy/shakenfist_ci/cluster_ci_tests/. That directory's tests run against multi-node deployments (slim-primary today, slim-tier after phase 6). It can also live under a new subdirectory if the test fits poorly with the existing tests, but the simpler choice is the existing dir with atest_database_tier.pyfile. Smoke and guest CI don't see the test (they're single-node) which is fine — the test only matters in the tier shape. -
No bind-address change to the operator- facing tuning .cnf.
examples/mariadb- tuning.cnfis operator infrastructure; operators choose their own bind-address based on their network topology. The CI script's drop-in is the right home for the bind-all override because the script is already CI/dev-flavoured and warns it is not for production.
Steps¶
Six steps. Step 3 (the external repo work) is
coordinated outside this repo; steps 4 and 5
must not merge until step 3 has landed and is
discoverable from the CI runner, or the new
matrix entry's Build infrastructure step will
fail looking for a topology playbook that
doesn't exist.
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 1 | medium | opus | worktree | Multi-instance sf-database startup. Edit shakenfist/deploy/ansible/roles/base/tasks/register.yml. Find the task Start the database daemon on etcd_master nodes (currently around line 50, with delegate_to: "{{ groups['etcd_master'][0] }}" and run_once: true). Replace delegate_to and run_once with when: inventory_hostname in groups.get('etcd_master', []), matching the gate pattern the immediately-preceding Register the database daemon task already uses. Verify by reading the task block: after the edit, both the register and start tasks should look structurally identical except for the action (one calls sf-ctl register-daemon database --node-name, the other restarts the systemd unit). Update the leading comment block in register.yml (lines 1-15) — the comment currently says "Step 3: Start the database daemon on etcd_master nodes" which is correct, but it also says "All sf-ctl commands are delegated to etcd_master[0] (the database-tier node)" in the singular — change "node" to "nodes" and "the database-tier node" to "the first database-tier node". One commit. pre-commit run --all-files. |
| 2 | low | sonnet | none | MARIADB_GATEWAY_HOSTS list + bind-all drop-in. Two file edits in one commit. (a) shakenfist/deploy/ansible/roles/base/templates/config line 34 — change the single-host extraction SHAKENFIST_MARIADB_GATEWAY_HOSTS="{{hostvars[groups['etcd_master'][0]]['node_mesh_ip']}}" to a list join: SHAKENFIST_MARIADB_GATEWAY_HOSTS="{{ groups['etcd_master'] \| map('extract', hostvars, 'node_mesh_ip') \| join(',') }}". Verify the single-element case still renders without a trailing comma (jinja's join does the right thing). (b) tools/ci-install-mariadb.sh — after the Applying SF bootstrap snippet block and before the tuning install, add a new section that writes /etc/mysql/mariadb.conf.d/99-bind-all.cnf containing [mysqld]\nbind-address = 0.0.0.0\n. Make it unconditional (no flag); the script's leading comment already warns it is for CI / dev, not production. The existing systemctl restart mariadb block at the tuning install can be reused — if the tuning install is skipped (empty TUNING_PATH), the script must still restart MariaDB after writing the bind-all drop-in. Simplest shape: write the bind-all drop-in BEFORE the tuning install, hoist the restart-and-wait block out so it runs once after both edits. pre-commit run --all-files. bash -n tools/ci-install-mariadb.sh. One commit. |
| 3 | — | — | — | External coordination, not implemented by a sub-agent in this repo. Create a PR against shakenfist/actions adding ansible/ci-topology-slim-tier.yml. The playbook should provision three VMs (reuse the slim-primary IP block 10.0.0.20-22 if practical) and a matching getsf-wrapper that sets GETSF_NODE_PRIMARY to vm-1, GETSF_NODE_ETCD_MASTER to "vm-1 vm-2" (a two-element space-separated list, matching getsf's read GETSF_NODE_ETCD_MASTER shape), GETSF_NODE_NETWORK to vm-1, GETSF_NODE_HYPERVISOR to "vm-1 vm-2 vm-3" (so the third VM is a hypervisor), and GETSF_NODE_STORAGE empty. Pattern-match on the existing ci-topology-slim-primary.yml for the rest of the wrapper structure. This step is not a sub-agent task; the management session coordinates the actions-repo PR with the operator. Step 4 must not merge until this PR has landed on main and the setup-test-environment action's pinned ref (@main per the workflow YAML) picks it up. Steps 1, 2, 5 and 6 are independent of this and can land in any order before step 4. |
| 4 | medium | opus | worktree | Add the slim-tier matrix entry. Edit .github/workflows/functional-tests.yml. In the merge-queue job's matrix.merge array (around line 803) add a new entry after the last existing slim-primary entry: { name: 'Debian 12 tier', job_name: 'debian-12-slim-tier', base_image: 'sf://label/ci-images/debian-12', base_image_user: 'debian', topology: 'slim-tier', concurrency: 5, stestr_config: 'cluster-ci.conf' }. The existing steps under the merge-queue job already consume matrix.merge.topology to pick the topology playbook, so no other workflow edit is needed. Important: the existing Install BYO MariaDB on primary and Run getsf installer on primary step pair (around line 1445/1463 of the slim-primary merge sites) ssh-invokes the install on ${primary} only — but ${primary} is the primary VM, which is also where MariaDB lives, so this is still correct for slim-tier (the additional sf-database instance on vm-2 connects to MariaDB on vm-1 over the mesh). However the GETSF_MARIADB_HOST=127.0.0.1 literal in the installer step is wrong for slim-tier because the non-primary etcd_master node cannot use 127.0.0.1 to reach MariaDB. Change the value at the slim-tier site only to ${primary_mesh_ip} — or simpler, leave 127.0.0.1 at the slim-primary sites and inject a separate if [ matrix.merge.topology == 'slim-tier' ] shell branch in the new entry's getsf invocation that uses the primary's mesh IP. The cleanest approach is a per-site SSH command rather than a conditional, so duplicate the entire Install BYO MariaDB on primary + Run getsf installer on primary pair for slim-tier with the corrected GETSF_MARIADB_HOST value. Yes that's another five-or-so lines of YAML duplication — the phase-5 decision (workflow YAML duplication accepted) applies the same way here. actionlint via pre-commit catches yaml-syntax mistakes. One commit. The commit message must call out the dependency on shakenfist/actions step 3 having landed first. |
| 5 | high | opus | none | Functional test for the LB path. Create shakenfist/deploy/shakenfist_ci/cluster_ci_tests/test_database_tier.py with one test class TestDatabaseTier(base.BaseTestCase) and one test method test_grpc_lb_fans_out_across_sf_database_instances. The test does the following: (a) Discover sf-database instances by reading the cluster's node list (self.system_client.get_nodes()) and filtering to nodes whose is_etcd_master attribute is True — those are the etcd_master members and therefore the sf-database hosts. Note: phase 6 of PLAN-byo-mariadb does not rename is_etcd_master; that rename lives with PLAN-remove-primary phase 7. The attribute is still spelled is_etcd_master in the API response at the time this test ships. (b) Skip the test if fewer than 2 sf-database instances are present, with a clear self.skipTest('test_grpc_lb_fans_out_across_sf_database_instances requires N>=2 sf-database instances; saw N=...') message — that keeps the test harmless on the slim-primary / localhost shapes which only have one etcd_master. (c) Scrape http://<node_mesh_ip>:13006/metrics for each sf-database node before driving any traffic, parse out each database_*_total counter into a dict keyed by counter name. Use the existing requests package (already a test dep) and a small parser; do not pull in a prometheus client library for this. (d) Drive ~100 cheap sf-api calls — self.system_client.get_namespaces() is fine — each of which triggers at least one sf-database RPC on the api server's side. Do it serially; concurrency adds noise without strengthening the assertion. (e) Re-scrape /metrics on each sf-database node, compute deltas, sum across counter names per node. (f) Assert that each node saw at least 0.05 * total_delta_summed_across_nodes requests. The 5% floor accommodates jitter while failing loudly on a 100/0 split. Include self.addDetail calls dumping the before / after counter dicts so a failure produces a readable forensic trail. Keep the test focused on this one assertion — do not also assert per-RPC-name counters, that's brittle to RPC-call-count changes in unrelated phases. The test ships as part of cluster-ci.conf automatically (everything under cluster_ci_tests/ is discovered by stestr). pre-commit run --all-files. One commit. |
| 6 | low | sonnet | none | One-paragraph docs update. Edit docs/operator_guide/database.md. Find the existing section that describes MARIADB_GATEWAY_HOSTS (added during phase 2). Add a new sub-paragraph immediately after it titled "Multi-instance deployments" or similar, two-to-three sentences: (a) "More than one sf-database instance can run against the same MariaDB. List every instance's mesh IP in MARIADB_GATEWAY_HOSTS, comma-separated." (b) Show an example: MARIADB_GATEWAY_HOSTS="10.0.0.20,10.0.0.21". (c) "Every sf-database instance must be able to reach the MariaDB server; in BYO deployments this typically means the operator's MariaDB is bound to a routable interface, not localhost." Single commit. pre-commit run --all-files. The bigger docs sweep (lead-with-BYO restructure, README, ARCHITECTURE.md) is phase 7's scope, not this step. |
Validation¶
pre-commit run --all-filespasses after each step (flake8 / mypy / stestr / actionlint).- After step 1 + step 2: a manual ansible-lint dry
run (
ansible-lint shakenfist/deploy/ansible/) shows no new warnings. - After step 4 (with step 3 landed externally): the
next merge-queue run shows the new
debian-12-slim-tierjob appearing in the matrix and reaching green. The build- infrastructure step provisions three VMs, the BYO MariaDB install step succeeds on vm-1, theRun getsf installer on primarystep completes, andtools/run_remote vm-2 systemctl is-active sf-databasereturnsactive. - After step 5: the
test_grpc_lb_fans_outtest passes on a merge-queue run with thedebian-12-slim-tiermatrix entry, and is reported asskippedon every other matrix entry. The skipped reason in the stestr output is readable and mentions the N>=2 requirement. - After step 6:
mkdocs buildsucceeds and the new paragraph appears in the rendered operator-guide page.
Risks¶
- External repo coordination (step 3 landing
late). If the workflow change (step 4) merges
before the actions-repo PR lands, every PR's
merge-queue run fails at
Build infrastructurelooking foransible/ci-topology-slim-tier.yml. The brief explicitly tells the management session to land step 3 first and confirm thesetup-test-environmentaction picks up the new playbook. The reverse — landing step 3 before step 4 — is harmless; the topology just sits unused until step 4 merges. - Round-robin jitter under low call volume. 100 calls split across 2 instances is mean-50, variance-25. The probability of any single instance getting fewer than 5 calls (i.e. failing the 5% floor) is vanishingly small for a healthy round-robin policy, but a long GC pause or a transient unhealthy subchannel could push it. If the test flakes, the right fix is increasing the call count (to 200 or 500) rather than loosening the floor — the floor exists to catch silent LB regression.
- Non-primary etcd_master cannot reach
MariaDB on the primary. Step 2 fixes the
bind-address default. If the workflow still
fails on vm-2's sf-database startup after step
4 lands, the most likely cause is either: (a)
the bind-all drop-in wasn't applied (check
/etc/mysql/mariadb.conf.d/99-bind-all.cnfon the primary), or (b) iptables / nftables / the CI VM's network is blocking 3306 between VMs. Mitigation: a one-line systemd unit check in the workflow's logs section will catch (a); for (b), the CI VM image is controlled by the actions repo and is the same image used by slim-primary, which is already known to permit intra-cluster traffic, so this is unlikely. is_etcd_masterattribute name churn. PLAN-remove-primary phase 7 will renameis_etcd_master(Python attribute, ansible group, API field). If phase 7 lands between this plan's step 1 and step 5, step 5's test needs the new attribute name. Mitigation: the test's discovery logic is a single line (if node['is_etcd_master']); a future rename catches it via the test failing, which is fine. Phase 7's PR will be where the rename is threaded through.- The phase-1 schema-version verification
fires on the second sf-database start.
Phase 1 made sf-database verify the schema
version on startup and refuse to start on
mismatch. Two sf-database instances bringing
themselves up against the same
schema_versionstable in close succession is the first time this code path runs concurrently in CI. The read is a simpleSELECT version FROM schema_versions; there is no race here. If one instance fails to start because the table read raced with the other instance's read, that is a phase-1 bug to fix, not a phase-6 redesign. Mitigation: the workflow's logs already capture sf-database systemd output, so a failed-to-start on vm-2 surfaces directly. - gRPC subchannel health-check timing. The
round-robin policy excludes subchannels with
failing health checks. If vm-2's
sf-database is slow to come up, the channel
on the client side may briefly see only
vm-1 as healthy and route 100% of early
traffic there. The test scrapes after the
call loop completes, so a transient
unhealthy state during the loop's first few
calls is fine; only a persistent skew to
one instance fails the test. If the test
flakes here, the fix is a short
time.sleep(2)before the call loop to let the channel settle, or hitting the health endpoint of each sf-database directly before the loop to confirm both are SERVING.
Out of scope¶
- The
is_etcd_masterrename — PLAN-remove- primary phase 7. - A bigger docs rewrite (lead-with-BYO,
ARCHITECTURE.md,README.md, release notes) — phase 7 of this plan. - A scheduled-workflow
slim-tierentry — every merge gate already covers it; the scheduled workflow stays focused on long-running shapes. - Cross-instance sf-database state sharing
testing (caches, hostname collisions in
record_start) — these don't exist by design (sf-database is stateless), but if a future bug surfaces one, the regression test for it is its own commit, not part of this phase. - A
slim-tierPR-localhost matrix entry — doubles per-PR cost for marginal benefit. - A LB regression test against
localhosttopology — the test self-skips when N<2, which is the intended shape. - L4 LB / VIP regression coverage — there is no L4 LB in the architecture (decision 6 of the master plan).
- Per-caller authz testing — gated on the
mTLS work in
PLAN-embrace-tls.md.
Back brief¶
Before executing this phase, please back brief the operator on:
- The six steps and the file boundaries for each.
- Step 3 is not a sub-agent task in this
repo; it is a coordinated PR against the
external
shakenfist/actionsrepo. Step 4 cannot land until step 3 has merged and is picked up by thesetup-test-environmentaction's@mainref. Confirm the management session will own that coordination rather than spawn an agent to do it. - The decision to add the matrix entry on the merge-queue side only (not PR-localhost, not scheduled). Confirm.
- The decision to keep
examples/mariadb- tuning.cnfoperator-pure and put the bind-all drop-in inci-install-mariadb.shinstead. Confirm. - The functional test's 5% per-instance floor and the rationale (catches silent LB degeneracy without flaking on healthy jitter). Confirm.
- The decision to skip the test on N<2 topologies rather than fail it. Confirm.
- The dependency on phase 1's schema-version verify behaving cleanly under concurrent startup — phase 6 stresses it for the first time, and any latent bug there is a phase-1 fix, not a phase-6 redesign.