Remove the Apache load balancer from the deployer¶
Prompt¶
Before responding to questions or discussion points in this
document, explore the shakenfist codebase thoroughly. Read
relevant source files, understand existing patterns (the
ansible deployer layout, getsf topology generation, the
primary role, how api_url flows from topology JSON into
sfrc / shakenfist.json, and how the cluster_ci rig talks
to the API), and ground your answers in what the code
actually does today. Do not speculate about the codebase when
you could read it instead. Where a question touches on
external concepts (Apache mod_proxy_balancer, nginx
upstream / proxy_pass, HTTP reverse proxying, gunicorn),
research as needed to give a confident answer. Flag any
uncertainty explicitly rather than guessing.
All planning documents should go into docs/plans/.
Consult ARCHITECTURE.md for the system architecture
overview and daemon structure, CLAUDE.md for build commands
and project conventions (note the "REST API URL Structure"
section, which documents the gunicorn-on-:13000 vs
Apache-adds-/api distinction this plan acts on), and the
parent plan PLAN-remove-primary.md,
of which this work is the realisation of phase 3 ("Remove
Apache reverse proxy from deployer"), broken out into its own
focused master plan. Key references inside the repo include
shakenfist/deploy/ansible/roles/primary/tasks/apache2.yml
(the role action to delete),
shakenfist/deploy/ansible/files/apache-site-primary.conf
(the vhost template to delete and preserve as documentation),
shakenfist/deploy/ansible/deploy.yml (the primary_node
play that invokes the apache2 action),
shakenfist/deploy/getsf (where the single-node api_url
default is set), and docs/operator_guide/installation.md
(the operator-facing documentation to rewrite).
When we get to detailed planning, I prefer a separate plan
file per detailed phase. These separate files should be named
for the master plan, in the same directory as the master
plan, and simply have -phase-NN-descriptive appended before
the .md file extension. Tracking of these sub-phases should
be done via the table in the Execution section below.
I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit. Each commit should be self-contained: it should build, pass tests, and have a clear commit message explaining what changed and why.
Situation¶
The Shaken Fist deployer assumes the primary node doubles as the cluster's API load balancer by installing and configuring Apache. The mechanism is three pieces:
roles/primary/tasks/apache2.ymlinstallsapache2, enablesproxy proxy_http lbmethod_byrequests, writes a site config, and restarts the service.files/apache-site-primary.confis the jinja-templated vhost. It listens on port 80 and proxies/api(plus the OpenAPI paths/apidocs,/flasgger_static,/apispec_1.json) to abalancer://sfapipool whoseBalancerMembers are every hypervisor'snode_mesh_ip:13000— i.e. it round-robins across the gunicorn API workers.deploy.yml'sprimary_nodeplay (around line 255) invokesrole: primarywithrole_action: "apache2"before runningpostinstall.
This is the source of the /api URL prefix documented in
CLAUDE.md: gunicorn serves the API at / on port 13000,
and Apache is what maps the external /api/... path onto it.
The single-node api_url default that getsf emits
(http://127.0.0.1/api, set in two places around
getsf:1050 and getsf:1059) only resolves because Apache
is listening on port 80 and proxying /api.
Two facts make this assumption removable with low risk:
- The only known production deployment does not use it. Mikal's production cluster fronts the API with nginx, not the deployer-installed Apache. The Apache install is dead weight there.
- CI does not depend on it either. The cluster_ci rig
deploys a
localhost(single-node) topology, but the functional tests exportSHAKENFIST_API_URL=http://localhost:13000(functional-tests.yml:416) and talk straight to gunicorn. Apache is installed in CI but never exercised by a test, so removing it does not change what CI validates.
The Apache balancer config is also strictly less capable than what a real operator wants: it has no TLS termination (it listens on port 80 only), no health checking of backends, and hard-codes the OpenAPI passthrough paths. Every operator who takes SF to production replaces it. It belongs in documentation as a worked example, not in the deployer as an installed default.
The broader direction — SF stops being a platform deployer
and becomes an opinionated application that runs against
operator-provided infrastructure (DB, metrics, logs, load
balancer) — is set out in
PLAN-remove-primary.md. This plan
delivers the load-balancer slice of that vision.
Mission and problem statement¶
Remove every trace of the Apache load balancer from the deployer, and replace it with documentation that shows operators how to put their own reverse proxy / load balancer in front of the cluster. Concretely:
- The deployer no longer installs or configures Apache. The
apache2.ymlrole action and theapache-site-primary.conftemplate are deleted, and theprimary_nodeplay no longer invokes the apache2 action. (Theprimaryrole itself stays — it still owns bootstrap, cluster_config, rsyslog, the ad-hoc inventory, and postinstall. Removing the primary node wholesale is the parent plan's job, not this one.) - The single-node convenience deployment keeps working with
no operator-provided proxy at all, by pointing
api_urlat gunicorn directly (http://127.0.0.1:13000, no/apiprefix). This is the documented "localhost:13000 escape hatch." - Production operators are given working example apache2
and nginx configurations in the operator guide, derived
from (and replacing) the deleted vhost. The examples keep
the
/apiexternal-path convention so existingapi_url: https://lb.example.com/apivalues continue to work, and additionally show TLS termination as the thing an operator actually needs to add. - The installation docs stop claiming "the primary node runs an apache load balancer" and instead state that load balancing is operator-provided, linking to the examples.
The principle, inherited from the parent plan: SF deploys
sf-* daemons on the hosts you tell it about, against
infrastructure (including the load balancer) whose addresses
you tell it. The load balancer is now firmly on the
operator's side of that line.
Alternatives considered¶
Keep Apache but make it optional (a topology flag)¶
We could gate the apache2 action behind a topology flag so
operators who want the bundled balancer keep it. We reject
this: it preserves the maintenance burden (the vhost template,
the BalancerMember loop, the unused-in-CI install) to serve
a configuration nobody is known to use, and it muddies the
"infrastructure is operator-provided" line the parent plan is
drawing. A documented example config an operator copies once
is strictly less code to own than a conditional role action
plus its template.
Replace Apache with a bundled nginx¶
Since the one real deployment uses nginx, we could swap the deployer to install nginx instead of Apache. We reject this for the same reason: it keeps the deployer in the load-balancer business, just with a different daemon. The operator's nginx is already configured to their needs (TLS, WAF, cert rotation, their own logging); a second deployer-managed nginx would fight it. Ship the nginx config as an example, not as an install.
Drop the /api prefix convention entirely¶
We could take this opportunity to serve the API at / behind
the operator's LB and retire the /api prefix. We reject
this for this plan: the /api prefix is baked into existing
operators' api_url values, the OpenAPI doc URLs, and
client expectations. Changing it is a compatibility break
orthogonal to removing Apache, and would be its own plan. The
example LB configs therefore preserve /api. Single-node is
the one place there is no /api prefix, because there is no
proxy — and that is already the reality, just made explicit.
Open questions¶
- Where do the example LB configs live? Resolved: a
dedicated
docs/operator_guide/load_balancing.mdpage, linked frominstallation.md. - Should the example configs ship as files too?
Resolved: yes — ship
examples/apache-loadbalancer.confandexamples/nginx-loadbalancer.conf(mirroring the parent plan'sexamples/grafana-dashboard.jsonprecedent), and snippet them into the doc page so the rendered docs and the copyable files stay in sync. - Does the multi-node example topology in
installation.mdneed itsapi_urlreconsidered? The sample atinstallation.md:79useshttps://...your...install...here.com/api. With Apache gone, that URL now points at the operator's own LB rather than the deployer's Apache, but the value itself is still correct if the operator follows the example configs. Phase 1 should make that dependency explicit in the prose, not leave it implied. - Is the
api_urltopology field still mandatory for the primary node? Todaydeploy.yml:42-46recordsapi_urlfrom theprimary_nodeentry, andgetsfonly emits the stanza for primary nodes. After this change, single-node gets:13000and multi-node operators supply their LB URL. Confirm in phase 2 that a missing/!primaryapi_urlstill degrades gracefully (it currently defaults to empty) and document the expectation.
Execution¶
The work is small and splits cleanly into a docs-first phase (which captures the soon-to-be-deleted Apache config as a worked example before anything is removed, so no knowledge is lost in between) and a deletion phase (which removes the deployer code and repoints the single-node default). Both phases leave CI green.
| Phase | Plan | Status |
|---|---|---|
| 1. Document operator-provided load balancing (example apache2 + nginx configs, single-node escape hatch) | PLAN-remove-apache-lb-phase-01-docs.md | Complete |
2. Remove Apache from the deployer; repoint single-node api_url to :13000 |
PLAN-remove-apache-lb-phase-02-remove.md | Complete (pending CI confirmation) |
Phase notes:
- Phase 1 (medium effort, docs-first). Add a load
balancing page to
docs/operator_guide/(open question 1) with two worked examples: an Apachemod_proxy_balancerconfig derived from the existingapache-site-primary.conf, and an equivalent nginxupstream+proxy_passconfig that matches the real production deployment's shape. Both preserve the/api,/apidocs,/flasgger_static,/apispec_1.jsonpassthroughs and both show where TLS termination plugs in (the deployer's Apache had none). Document the single-nodehttp://localhost:13000escape hatch — when there is no proxy, pointapi_urlstraight at gunicorn. Rewrite the "primary node runs an apache load balancer" paragraph ininstallation.md:51to say load balancing is operator-provided and link the new page; clarify theapi_urlexpectation in the multi-node example (open questions 3, 4). Optionally ship the configs underexamples/(open question 2). This phase adds no code changes and cannot break CI; it deliberately lands the documentation before the source vhost is deleted so the example is a faithful copy. - Phase 2 (medium effort, deletion). Delete
roles/primary/tasks/apache2.ymlandfiles/apache-site-primary.conf. Remove therole: primary/role_action: "apache2"invocation from theprimary_nodeplay indeploy.yml(keep thepostinstallinvocation that follows it). Change the single-node / primaryapi_urldefault ingetsf(both thelocalhostbranch near line 1050 and theGETSF_NODE_PRIMARYbranch near line 1059) fromhttp://127.0.0.1/apitohttp://127.0.0.1:13000. Verify the cluster_ci deploy still goes green — it should, because the functional tests already exportSHAKENFIST_API_URL=http://localhost:13000and never used the Apache/apipath. Grep the tree for any remainingapache//apiassumptions in the deployer (e.g.shakenfist.json,sfrc,getsfcomments) and theinstallation.mdapache mention to make sure nothing still installs or references the bundled balancer. (Theapache2install insidecluster_ci_tests/test_floating_ips.pyis not in scope — that installs Apache inside a test VM as a web server for floating-IP testing, unrelated to the deployer's load balancer.)
Agent guidance¶
Execution model¶
All implementation work is done by sub-agents, never in the management session. The management session (this conversation) is reserved for planning, review, and decision-making. This keeps the management context lean and avoids drowning it in implementation diffs.
The workflow is:
- Plan at high effort in the management session.
- Spawn a sub-agent for each implementation step with the brief from the plan, at the recommended effort level and model.
- Review the sub-agent's output in the management session. Check the actual files — the sub-agent's summary describes what it intended, not necessarily what it did.
- Fix or retry if the output is wrong. Diagnose whether the brief was insufficient (improve it) or the model was too light (upgrade it), then re-run.
- Commit once the management session is satisfied with the result.
This applies to all steps. If a sub-agent can't succeed even with a detailed brief and the right model, that's a signal the brief needs improving, not that the management session should do the implementation itself.
Both phases of this plan are low-risk deletions / docs and
can work directly in the main tree. Reach for
isolation: "worktree" only if a phase-2 sub-agent's CI
verification turns out to require experimental iteration.
Planning effort¶
The master plan itself was created at high effort. Both
phase plans can be planned at medium effort — phase 1 is
documentation authoring with the source config in hand, and
phase 2 is a bounded deletion plus a one-line default change
whose blast radius (the api_url flow and the CI rig) is
already mapped in this master plan.
Step-level guidance¶
Each phase plan should include a table like this:
| Step | Effort | Model | Isolation | Brief for sub-agent |
|------|--------|-------|-----------|---------------------|
| 1a | medium | sonnet | none | One-sentence summary of what to do and which files to touch |
| 1b | high | opus | worktree | Why this needs high effort: requires understanding X to do Y |
Effort levels:
- high — Requires reading multiple files, making judgment calls, understanding non-obvious invariants, or researching external references.
- medium — The plan provides enough context that the sub-agent can follow a clear brief.
- low — Purely mechanical changes (delete a file, remove a play stanza, change a string default).
Model choice:
- opus — Deep reasoning, cross-daemon architectural understanding, subtle correctness judgment.
- sonnet — Good default for well-briefed implementation work. Both phases here suit sonnet given the briefs in this plan.
- haiku — Purely mechanical tasks: file deletion, search-and-replace.
When in doubt, skew to the more capable model. Saving money only matters if the outcome is still acceptable.
Brief for sub-agent: Write it as if briefing a colleague
who has never seen the codebase. Include what to change,
which files to touch, what patterns to follow, and any
non-obvious constraints. In particular, a phase-2 brief must
spell out that the /api prefix is an Apache artifact, that
single-node must therefore move to :13000, and that CI
already hits :13000 directly so a green functional-tests
run is the acceptance signal.
Management session review checklist¶
After a sub-agent completes, the management session should verify:
- The files that were supposed to change actually changed (read them, don't trust the summary).
- No unrelated files were modified (in particular, the
test_floating_ips.pyin-VM apache install is untouched). - The code passes
pre-commit run --all-files(flake8, stestr unit tests, mypy). - The cluster_ci deploy still succeeds — this plan is operator-facing, and an internally-clean change that breaks the CI deploy is a regression.
- The example LB configs in docs actually proxy the same
paths the deleted vhost did (
/api,/apidocs,/flasgger_static,/apispec_1.json). - Commit message follows project conventions (including the Co-Authored-By line with model, context window, effort level, and other settings).
Administration and logistics¶
Success criteria¶
We will know when this plan has been successfully implemented because the following statements will be true:
- The code passes
pre-commit run --all-files(flake8, stestr unit tests, and mypy type checking). roles/primary/tasks/apache2.ymlandfiles/apache-site-primary.confno longer exist, and no part of the deployer installs or configures Apache.- The
primary_nodeplay indeploy.ymlno longer invokes the apache2 role action;postinstallstill runs. getsfemitshttp://127.0.0.1:13000(nothttp://127.0.0.1/api) as the single-node / primaryapi_urldefault, and a single-nodegetsfdeploy yields a workingsf-clientwith no proxy installed.- The cluster_ci functional tests still pass (they already
target
http://localhost:13000directly). docs/operator_guide/contains a load-balancing page with working example apache2 and nginx configurations that preserve the/api,/apidocs,/flasgger_static, and/apispec_1.jsonpassthroughs and show where TLS termination plugs in.installation.mdno longer claims the primary node runs an Apache load balancer; it describes load balancing as operator-provided and links the new page, and the single-node:13000escape hatch is documented.ARCHITECTURE.md,README.md, andAGENTS.mdare updated if they reference the primary-node Apache load balancer.
Future work¶
- Drop the
/apiprefix convention. Serving the API at/behind the operator's LB and retiring the/apiprefix would simplify the URL story, but it is a compatibility break orthogonal to removing Apache and wants its own plan. - Health-check-aware load balancing. The example LB
configs can only become genuinely production-honest once SF
exposes
/healthz/ readiness endpoints. Tracked inPLAN-health-checks.md; once it lands, revisit the example configs to add active backend health checks. - TLS / mTLS between components. The example configs
terminate operator TLS at the LB; in-cluster TLS (gunicorn,
gRPC) is tracked in
PLAN-embrace-tls.md. - Update the parent plan. Mark phase 3 of
PLAN-remove-primary.md("Remove Apache reverse proxy from deployer") as realised by this plan, so the two do not describe overlapping work as both outstanding.
Bugs fixed during this work¶
This section should list any bugs we encounter during development that we fixed.
Documentation index maintenance¶
This plan has been registered in docs/plans/:
index.md— a row added to the Plan Status table linking this plan, its phases, status, and one-line description.order.yml— an entry added next toPLAN-remove-primary.mdso it appears in the documentation navigation. Phase files are not added toorder.yml; they are linked from this plan's Execution table and fromindex.mdonly.
The site navigation in mkdocs.yml is produced from
mkdocs.yml.tmpl by the docs-sync workflow, which consumes
order.yml; it does not need hand-editing.
When all phases are complete, update the status column in
docs/plans/index.md.
Back brief¶
Before executing any step of this plan, please back brief the operator as to your understanding of the plan and how the work you intend to do aligns with that plan.