Lease-based per-network carrier model with VIP advertisement¶
Prompt¶
Before responding to questions or discussion points in this
document, explore the shakenfist codebase thoroughly. Read
the network daemon (shakenfist/daemons/network/), the
existing "network node" model (where the singleton role is
configured and which daemons are gated on it), the floating
IP allocation and SNAT programming, the DHCP/DNS state
managed via dnsmasq, the cluster-lock leasing pattern in
shakenfist/locks.py, and the leader-election pattern that
PLAN-remove-primary phase 5 introduces for sf-database.
Ground your answers in what the code actually does today.
Where a question touches on external concepts (BGP for VIP advertisement, gratuitous ARP / VRRP for L2 failover, the operational differences between routed and bridged cloud deployments, embedded BGP speakers such as GoBGP, BIRD, FRR), research as needed to give a confident answer. Flag any uncertainty explicitly.
All planning documents go into docs/plans/.
Consult ARCHITECTURE.md for the system architecture
overview, especially the network subsystem and the existing
network node role. Consult CLAUDE.md for build commands,
project conventions, the cluster-lock leasing pattern, the
single-mutator direction in PLAN-network-facade, and the
exec-to-netlink direction in PLAN-replace-exec-with-netlink.
This plan is a placeholder. It captures intent and the known open questions and is intentionally light on detail. Phase 0 will resolve the open questions into a decisions section and the phase table below will be re-cut accordingly.
When we get to detailed planning, I prefer a separate plan
file per detailed phase, named for the master plan with
-phase-NN-descriptive appended before the .md extension.
I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit.
Situation¶
Shaken Fist's network node is the last named singleton role
in the architecture. After PLAN-remove-primary phase 5
makes sf-database electable, every other daemon will
either run on every node or elect a leader from a pool. The
network node is the holdout: it is configured (not elected),
hosts every virtual network's egress floating IP, programs
every network's SNAT / DHCP / DNS, and is a single failure
domain for external connectivity across the entire cluster.
The original PLAN-remove-primary "future work" stub for
this was "network node failover, with operator-provided
keepalived / corosync / BGP." That framing is honest about
the L2-vs-L3 mechanism question but understates the
architectural opportunity. A richer answer:
Smear the carrier role across a configurable pool of nodes, per network. Each network independently leases its egress floating IP and gateway role to one carrier from an eligible pool. Different networks can have different carriers. Total egress bandwidth scales with the pool instead of bottlenecking at one node. Blast radius shrinks to "the networks currently leased to the failed carrier," not "every network in the cluster." Locality becomes possible — a network whose VMs mostly live on node X can have its carrier on node X, avoiding the mesh hop for the common-case traffic.
The architectural framing that ties this together is
"network-node state is data, the carrier is a renderer."
Every piece of state that today lives as configured kernel
state on the singleton network node — SNAT rules, DHCP
leases, DNS records, the DNAT'd service ports from
PLAN-network-service-ports, the floating-IP itself —
becomes a row (or rows) in MariaDB. The carrier is a
process that, on lease acquire, reads its leased networks'
state and materialises it into kernel state. On lease loss,
it tears that state down. On failover, the new carrier
reads the same data and reprograms. Survivability of
in-progress state across failover is a property of the
data, not of any kernel-state-sync handshake.
Two advertisement mechanisms cover the operational landscape:
- BGP advertisement — each carrier advertises the floating IPs it currently holds via BGP, either through an embedded speaker (GoBGP / BIRD / FRR managed by SF) or by configuring an operator-external speaker. Failover converges in seconds. Required for L3-routed deployments (cloud, modern on-prem); requires operator network gear to peer.
- L2 advertisement (gratuitous ARP, optionally VRRP- style) — the new carrier fires GARP for the floating IP; L2 switches converge. No BGP required. Works only when all candidate carriers share an L2 segment with the operator's upstream router; slower failover; doesn't cross L3 boundaries.
Phase 0 decides which modes ship in v1 and how the operator selects between them.
A future direction worth naming explicitly so the eligibility model is shaped to handle it: mapping external VLANs into specific virtual networks. That makes carrier eligibility per-network and operator- provisioned ("this carrier has VLAN 42 on its trunk; only networks bound to VLAN 42 can be carried here"). This plan does not implement VLAN mapping — that's its own future plan — but the eligibility-filter design must not paint itself into a corner that VLAN trunking would force a rewrite of.
Mission and problem statement¶
Shaken Fist's network role is no longer a configured singleton. Each network leases its gateway role to one carrier from an operator-configured pool. Carriers materialise network state from MariaDB into kernel state on lease acquire; tear it down on lease loss. Floating IPs are advertised via BGP or L2 depending on operator choice. Locality-aware placement prefers carriers where the network's instances already run. The architecture has no named-singleton roles left.
Concretely, after this plan lands:
- Operators designate a carrier pool — the set of nodes eligible to carry virtual networks. Pool membership is config-driven; defaults to all nodes for small deployments, narrows to a privileged subset for larger ones where only some nodes have the network configuration to be carriers.
- Per-network carrier leases, backed by the
cluster_locksleasing pattern (refresh-while-alive, steal-on-expiry,lost_eventsignal on confirmed loss). - A carrier process on each pool member that watches for leased networks, materialises their state on acquire, and tears it down on loss. The process is a renderer: state flows from MariaDB to kernel, not the other way.
- All network-node-resident state lives in MariaDB:
- The egress floating IP itself (as an allocation,
via
PLAN-generic-allocator). - DNAT'd service ports (via
PLAN-network-service-ports). - DHCP leases (so a carrier change does not lose the lease database).
- DNS records.
- SNAT-policy state (derived from network config plus the egress IP allocation, not separately persisted).
- Locality-aware placement for carrier leases. Filter
pool members by eligibility; score by locality (instances
on this network running on this carrier) and load (number
of networks already carried); tie-break by round-robin.
Same conditional-INSERT pattern as
PLAN-scheduler-reservations. - Pluggable advertisement:
- BGP via embedded speaker, OR
- BGP via operator-external speaker (SF programs routes via the speaker's API), OR
- L2 GARP / VRRP-style, OR
- Manual (operator handles it; SF emits events but doesn't program advertisement)
- Cluster-wide network work queue uses assignment-at-
enqueue routing. The enqueuer looks up the network's
current carrier and stamps
assigned_node_uuidon the queue row at insert time. Carriers pullWHERE assigned_node_uuid = me AND status = pending— a single equality predicate, cost independent of how many networks a carrier ends up holding. Carrier change triggers a re-route step on lease acquire: the new carrier issuesUPDATE ... WHERE network_uuid = ? AND status = pending SET assigned_node_uuid = me, bounded by in-flight queue depth. Chosen over filter-on-read (WHERE network_uuid IN (...)) — which works but has a cost-grows-with-fan-out shape — and over hash-partitioned queues, which carry rebalance overhead this design doesn't need. - The legacy "network node" config is removed (or kept as the degenerate-single-carrier case for small deployments).
The principle is: every persistent network role is a leased rendering of data, not a configured kernel state.
Open questions¶
This plan is the largest of the three in this thread and has the most open questions. Phase 0 will resolve at least:
- Carrier pool configuration shape. Ansible group
(parallels how
etcd_masteris configured today)? Per-node config flag (CARRIER_ELIGIBLE=true)? A node attribute in MariaDB? Phase 0 picks one, with the criterion that operator-driven changes to pool membership should be picked up without restarts. - Lease TTL, refresh cadence, and
lost_eventsemantics. Mirrorcluster_locks(60s expiry, 20s refresh) or pick different numbers based on the failover-time requirements? The advertisement mode matters: BGP convergence is seconds and tolerates short leases; L2 GARP is faster but flap-prone if the lease is too short. - Advertisement mode selection. Per-cluster default? Per-network override (some networks BGP-advertised, others L2)? Phase 0 confirms whether mixing modes in one cluster is a feature or a footgun.
- Embedded BGP speaker choice. GoBGP (mature, library- friendly, easy embedding), BIRD (standard but process-oriented), FRR (very fully featured, heaviest). If embedded BGP ships in v1, phase 0 picks one and documents the reasoning. If embedded BGP is v2 and v1 ships operator-external BGP only, phase 0 says so.
- L2 advertisement specifics. GARP only (simple, no election, just announce on acquire) or VRRP-style (peers elect the master, advertise via VRRP). GARP-only is cheaper but requires SF's lease to be the source of truth (since there's no L2 election to break ties). VRRP adds a second consensus mechanism that has to agree with SF's lease, which is risky. Phase 0 picks.
- DHCP state across carrier change. Today dnsmasq's lease file is a local file on the network node. If the carrier changes, naive failover loses the lease database — clients get new IPs on renewal. Options: (a) persist leases to MariaDB and have the new carrier rebuild dnsmasq's lease file at acquire; (b) accept transient DHCP problems on failover and document; (c) use a dnsmasq config that survives reload more gracefully. Phase 0 picks; (a) is the right answer for "carrier failure is invisible" but is the most work.
- DNS state across carrier change. Same shape as DHCP. Records are derived from network config plus instance state, so probably rebuildable from data without persistence. Confirm.
- In-flight TCP across carrier change. A long-lived connection through DNAT survives carrier failover only if the new carrier installs identical conntrack state. Almost certainly not in scope for v1; document the operator-visible behaviour clearly ("active TCP sessions through DNAT'd service ports may reset on carrier failover; SSH or browser-console sessions typically reconnect transparently"). Phase 0 confirms.
- VLAN-trunking forward compatibility. Eligibility filtering today is "is this node in the carrier pool?" Future VLAN-trunking adds "...and does this node have the right VLAN on its trunk?" The eligibility model must not bake in "all carriers are equivalent." Phase 0 designs the eligibility-filter query to accept future per-network constraints without rewrite. Out of scope to implement VLAN trunking here; in scope to not block it.
- Migration from singleton network node. Rolling cutover: the existing network node is initially the only pool member. Operators expand the pool, networks rebalance via lease churn or operator-triggered rebalance. Phase 0 designs the migration story and confirms it does not require simultaneous downtime across all networks.
- Single-node-cluster behaviour. Carrier pool of one degenerates to "current behaviour, but lease-managed." Confirm this case is not pessimised (no unnecessary lease churn, no false-failover detection in a one-member pool).
- Interaction with
PLAN-network-facadeandPLAN-replace-exec-with-netlink. The carrier is exactly the single-mutator thatPLAN-network-facadeis shapingnet-workerto be — but per-carrier-node, not cluster-wide singleton. Phase 0 confirms whethernet-workerremains a cluster-wide singleton with the carrier role layered on top, or whethernet-workeritself shards per carrier. Thesf-net-privexecdaemon fromPLAN-replace-exec-with-netlinkruns on every pool member so this composes naturally; confirm. - Floating-IP-as-allocation. The egress floating IP
is currently a per-network attribute. Once it's an
allocation in
PLAN-generic-allocator, the per- network attribute can be derived from the allocation or removed in favour of querying the allocation directly. Phase 0 picks. - Locality scoring inputs. "Instances on this network running on this carrier" is the obvious locality signal. Other plausible inputs: current number of networks the carrier leases (balance); current egress bandwidth on the carrier (load); operator-declared affinity hints. Phase 0 picks the initial scoring function with room to grow.
- Network-node config removal. Once carriers ship,
the legacy
network_nodeansible group / config flag is at best dead code, at worst an attractive nuisance for misconfiguration. Phase 0 decides whether v1 removes it or whether removal is a separate cleanup plan.
Execution¶
Provisional, to be re-cut after phase 0.
| Phase | Plan | Status |
|---|---|---|
| 0. Research and decisions document | PLAN-network-carrier-model-phase-00-decisions.md | Not started |
| 1. Carrier pool configuration and node-capability declaration | PLAN-network-carrier-model-phase-01-pool.md | Not started |
| 2. Per-network carrier lease primitive | PLAN-network-carrier-model-phase-02-lease.md | Not started |
| 3. Carrier renderer process | PLAN-network-carrier-model-phase-03-renderer.md | Not started |
| 4. SNAT and floating-IP programming via the renderer | PLAN-network-carrier-model-phase-04-snat.md | Not started |
| 5. DNAT'd service ports via the renderer (carrier-side hookup) | PLAN-network-carrier-model-phase-05-service-ports.md | Not started |
| 6. DHCP state persisted and rebuilt on carrier change | PLAN-network-carrier-model-phase-06-dhcp.md | Not started |
| 7. DNS state via the renderer | PLAN-network-carrier-model-phase-07-dns.md | Not started |
| 8. BGP advertisement mode | PLAN-network-carrier-model-phase-08-bgp.md | Not started |
| 9. L2 / GARP advertisement mode | PLAN-network-carrier-model-phase-09-l2.md | Not started |
| 10. Migration from singleton network node | PLAN-network-carrier-model-phase-10-migration.md | Not started |
| 11. Operator documentation for VIP failover and pool sizing | PLAN-network-carrier-model-phase-11-docs.md | Not started |
Dependencies on other plans¶
- Hard dependency on
PLAN-generic-allocator. The carrier lease, the floating-IP allocation, the DNAT'd service ports — all use the generic allocator primitive. - Hard dependency on
PLAN-network-service-ports. The carrier reprograms DNAT'd service ports on lease acquire; this plan does not invent its own DNAT mechanism. - Tight coherence with
PLAN-network-facadeandPLAN-replace-exec-with-netlink. The carrier is the natural home of single-mutator network operations and of netlink-driven privileged programming. Either both land before this plan does, or this plan accepts that the carrier renderer is the first non-trivial consumer of those plans' interfaces and reviews the interface fit during its own phase 0. - Parallel-track to
PLAN-remove-primary. Both reduce named-singleton roles, but they target different singletons (database vs network) with different mechanisms (election vs lease-and-render). Either order is fine; neither blocks the other. - Compatible with
PLAN-embrace-tls. Carrier-to- carrier and carrier-to-sf-databasetraffic flows over the same mTLS channels mTLS lands across; this plan does not interact with TLS material.
Agent guidance¶
Execution model¶
All implementation work is done by sub-agents, never in the
management session. The workflow mirrors
PLAN-remove-primary.md and the other placeholder plans.
This is a deeply architectural plan touching production network state, failover semantics, and live client traffic. Every implementation phase should default to opus at high effort. The advertisement-mode phases (8 and 9) and the DHCP-persistence phase (6) are the most subtle and deserve the most careful review.
Planning effort¶
The master plan itself is high effort despite being a placeholder, because the open-questions list is the load-bearing part of the plan and getting any of them wrong shapes a lot of subsequent work. Phase 0 (research and decisions) is the highest-effort decisions document in this thread.
Step-level guidance¶
Each phase plan should include a step table in the same
format as PLAN-remove-primary.md, with effort, model,
isolation, and brief columns.
Management session review checklist¶
Standard checklist from PLAN-remove-primary.md, plus:
- Carrier failover is exercised by an end-to-end test that kills the current carrier and confirms the new carrier reprograms all of the network's state from data. Not stubbed.
- DHCP lease survival across carrier change is exercised by a test that has a live client lease before the failover and confirms the lease survives it.
- At least one advertisement mode (BGP or L2) is
exercised in
deploy/cluster_ciend-to-end, with a real upstream / switch in the path. - Smearing is exercised: a multi-network setup confirms different networks end up on different carriers and that the locality-scoring heuristic behaves as designed.
- The eligibility-filter design is reviewed against the future VLAN-trunking constraint to confirm no rewrite is forced by it.
- Single-node-cluster behaviour is exercised to confirm no unnecessary lease churn or false failover.
- Object cleanup (
hard_delete()) on a network releases all its carrier-side state cleanly. - mypy coverage for the new carrier renderer is good from day one; this is new code, no excuse for thinly-typed surfaces.
Administration and logistics¶
Success criteria¶
We will know when this plan has been successfully implemented because the following statements will be true:
- No virtual network is bound to a specific configured "network node." Every network leases its carrier role from an eligible pool.
- Carrier failure causes the carrier's leased networks to fail over to other pool members within the documented failover-time bound, without operator intervention.
- Total egress bandwidth scales with the carrier pool, not with one node.
- All network-node-resident state — egress IP, SNAT, DNAT'd service ports, DHCP leases, DNS records — lives in MariaDB and is rendered into kernel state by the carrier.
- At least one advertisement mode (BGP or L2) ships in v1
and is verified end-to-end in
deploy/cluster_ci. - Locality-aware placement demonstrably prefers carriers where the network's instances run, with documented scoring inputs.
- The eligibility-filter design accepts future per-network hard constraints (e.g. VLAN-trunking) without requiring a rewrite.
- Migration from the singleton network-node configuration to a smeared carrier pool is exercised and documented.
pre-commit run --all-filespasses.
Future work¶
- VLAN trunking from external networks into virtual
networks. The motivating future feature that drives
the eligibility-filter design here. Wants its own plan
once the carrier model is in place — at minimum it adds
a
node_network_capabilitiestable or equivalent, a carrier-side configuration for the trunk, and the per-network eligibility constraint in the placement query. - Carrier rebalance. Once smearing is in place, operator-triggered rebalance ("re-pick carriers for all networks using current locality / load data") is a natural follow-on. Out of scope here.
- Per-carrier metrics. Bandwidth, lease count, lease churn rate, advertisement health. Probably falls out of the OpenTelemetry thread.
- Cross-carrier conntrack sync for stateful failover. Out of scope here, would land in a separate plan with a hard "do we actually want this" decision up front (conntrack sync is operationally heavy).
- BGP route filters and AS-path policies. If embedded BGP ships, operator-controllable route filtering is a natural follow-on. Out of scope here.
Bugs fixed during this work¶
This section should list any bugs we encounter during development that we fixed.
Documentation index maintenance¶
When creating a new master plan from this template, update
the following files in docs/plans/:
index.md— add a row to the Plan Status table.order.yml— add an entry for the new master plan.
Back brief¶
Before executing any step of this plan, please back brief the operator as to your understanding of the plan and how the work you intend to do aligns with that plan.