Network operations facade and queue-only mutation¶
Prompt¶
Before responding to questions or discussion points in this document, explore the shakenfist codebase thoroughly. Read relevant source files, understand existing patterns (object lifecycle, state machines, MariaDB storage via the three-layer direct/gRPC/public pattern, Pydantic schemas, daemon architecture, operation queue system, event logging), and ground your answers in what the code actually does today. Do not speculate about the codebase when you could read it instead. Where a question touches on external concepts (KVM/libvirt, VXLAN networking, MariaDB/Galera, gRPC/protobuf), research as needed to give a confident answer. Flag any uncertainty explicitly rather than guessing.
All planning documents should go into docs/plans/.
Consult ARCHITECTURE.md for the system architecture
overview, object types, and daemon structure. Consult
CLAUDE.md for build commands, project conventions, and
database access patterns. Consult GOALS.md for current
development priorities. Key references inside the repo
include shakenfist/network/network.py (the Network class
under discussion), shakenfist/operations/net_op.py (the
worker dispatch), shakenfist/daemons/network/workitem.py
(the single-threaded net-worker),
shakenfist/daemons/network/maintain.py (the parallel
maintain reconciliation thread),
shakenfist/daemons/privexec/main.py (the
privileged-execution daemon that actually mutates kernel
network state), shakenfist/mariadb.py (the existing
three-layer database access pattern, which is the architectural
precedent for the proposed change), and
shakenfist/external_api/base.py (the
redirect_to_network_node decorator that this plan removes).
When we get to detailed planning, I prefer a separate plan
file per detailed phase. These separate files should be named
for the master plan, in the same directory as the master
plan, and simply have -phase-NN-descriptive appended before
the .md file extension. Tracking of these sub-phases should
be done via the table in the Execution section below.
I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit. Each commit should be self-contained: it should build, pass tests, and have a clear commit message explaining what changed and why.
Situation¶
The Network class in shakenfist/network/network.py carries
two distinct responsibilities in one type:
-
The intent API that callers across the cluster reach for when they want a network to change state — every daemon and REST handler calls
n.create_on_hypervisor(),n.ensure_mesh(),n.add_floating_ip(...), etc. -
The worker API that actually mutates host network state on a given node — the same methods invoke
util_concurrency.create_vxlan_interface,util_concurrency.ensure_vxlan_mesh,util_concurrency.add_floating_ip, etc., which talk to the localsf-privexecdaemon and runip/bridge/iptablescommands.
Because both responsibilities live on the same class, any
node-local caller can invoke the worker API directly without
going through the queue. The sf-net daemon's net-worker
job is single-threaded by design and processes one
net_op work item at a time
(daemons/network/workitem.py:31 carries a comment from
mikal to that effect), so two queued net-ops on one node
are correctly serialised. But the same node typically runs
sf-net's separate maintain reconciliation thread plus
sf-queues, sf-api, and the instance lifecycle paths in
instance.py, and each of those can call n.ensure_mesh()
(and friends) directly without coordinating with the
net-worker. Today there are five call sites for
ensure_mesh across instance.py,
operations/net_op.py, operations/node_inst_netdesc_op.py,
daemons/queues/startup_tasks.py and
daemons/network/maintain.py, plus similar fan-out for
floating-IP and route operations.
The same bypass shape exists at the API layer. Endpoints
that need to run on the network node today wear the
redirect_to_network_node decorator
(external_api/base.py:336), which makes a synchronous HTTP
request from the receiving API server to the network node's
gunicorn on port 13000. This is used in
external_api/network.py (three sites) and
external_api/interface.py (one site). It is the same class
of bypass: instead of routing the intent through the
operation queue that we already have for cluster work, the
API server pretends to be on the network node by proxying.
This produced a concrete CI failure on PR #3182's merge-queue
run (GitHub Actions run 25899623744, "Guests" job). Two
threads inside sf-privexec ran _ensure_mesh for the same
VXLAN interface within 2 ms of each other, both observed the
same stale FDB entry from their initial bridge fdb show,
and both issued bridge fdb del for it. The kernel served
one and rejected the other with RTNETLINK answers: No such
file or directory; the losing privexec request returned
FAILURE, network.ensure_mesh raised EnsureMeshFailed, and
the resulting ERROR sf-net/ERROR sf-privexec/Traceback
lines tripped the post-test stable-log gate. The functional
tests themselves all passed — only the log scrape caught the
race.
A short-term fix is being landed on the stability branch:
wrap the six unlocked host-mutating methods on Network
(ensure_mesh, add_floating_ip, remove_floating_ip,
route_address, unroute_address, remove_nat) with the
existing per-network NodeLock (the same primitive used by
the already-locked methods). That fix is necessary but
addresses the symptom, not the structural problem.
The architectural precedent for the fix exists in this same
codebase. shakenfist/mariadb.py (and the sf-database
daemon at daemons/database/main.py) enforces that every
daemon, including local ones, must reach MariaDB through the
gRPC database service. Only sf-database has MARIADB_HOST
set; all other daemons route through gRPC. This eliminates
direct-vs-indirect access skew, gives the database service a
single point to enforce metrics, throttling, and connection
pooling, and lets the daemon evolve internally without
touching every caller. The proposal in this plan is to apply
the same shape to network operations: one daemon owns the
host-state mutations, and every other caller addresses it
through a typed API that enqueues intents.
Mission and problem statement¶
Restructure Network so that all host-state mutations are
performed by a single owner (the sf-net net-worker on each
node), and every other caller — including local daemons on
the same node and the API server — interacts via the public
Network class whose host-mutating methods enqueue intents
through the operation queue. Sequenced callers declare their
ordering via the existing ClusterOperation.depends_on
machinery; callers that genuinely need to block until a chain
completes use a small generic poll helper rather than each
method having its own wait shim. After the change:
Networkremains the class every caller already holds. Its host-mutating methods (ensure_mesh,add_floating_ip,create_on_hypervisor,update_dnsmasq, etc.) keep their existing names but their bodies change from "do the work inline" to "build a NetOp task for this intent, enqueue it on the appropriate queue, return the op handle". They do not wait. Their return type changes fromNone(with raised exceptions on failure) to the enqueued op (with errors surfaced through the op's terminal state).- A new
BridgedVXLanNetwork(network)class, instantiated only inside thenet-workerworkitem dispatcher, carries the actual_apply_*host-state mutation methods. All calls toutil_concurrency.create_vxlan_interface,util_concurrency.ensure_vxlan_mesh,util_concurrency.add_floating_ip, and friends move onto this class. - Re-entrancy is structurally impossible. The only path
that bypasses the queue is constructing a
BridgedVXLanNetworkinstance, and that constructor is only called inside the workitem dispatcher. A queued op that needs to chain host changes (e.g._network_deploydoingcreate_on_network_nodethenensure_mesh) performs those chained changes on itsBridgedVXLanNetworkinstance, so it never re-enqueues and never deadlocks the single-threaded net-worker. Explicitly not used: thread-local "am I inside the queue?" context flags. Detection by inheritance: external callers hold aNetwork, in-worker callers hold aBridgedVXLanNetwork. - There is no path by which
sf-queues,sf-api,sf-cleaner, the instance lifecycle code, or themaintainreconciliation thread can directly callutil_concurrency.ensure_vxlan_mesh,util_concurrency.add_floating_ip, or any other host-mutating privexec helper for a network. - The single-threaded
net-workeris the only mutator and therefore naturally serialises all activity for a network on a node —NodeLockbecomes redundant for these methods and the locks added by the stability-branch fix can be removed. - The
maintainreconciliation thread insf-netno longer calls the same methods as the net-worker; instead it enqueuesnet_ops and lets the net-worker do the work. - The
redirect_to_network_nodedecorator and all four of its applications are removed. API handlers no longer need to be on the network node, because the operation they enqueue runs wherever the net-worker that owns the queue lives. The affected endpoints change contract from "block until the host change is done" to "return 202 + terminal op uuid and the client polls"; the client (shakenfist/client-python) is updated to match. - Two new REST endpoints expose ClusterOperation chains:
GET
/clusteroperations/<uuid>/chainreturns the transitivedepends_onclosure; GET/clusteroperations?target_object_type=&target_uuid=lists ops targeting an object. Both are namespace-scoped at the SQL layer using the existingcluster_operation_targetstable. - The "queue-jumping" fairness concern (a node that's also
the network node, or just a node running both
sf-netandsf-queues, can bypass the work queue) disappears because no caller has a bypass to take. - Errors crossing the queue boundary are serialised as a
structured
ErrorReport(code, message, details,origin_class, traceback) rather than as rehydrated Python exception classes. In-worker exception handling is unchanged; the boundary translation happens in the worker dispatcher's outerexcept. REST endpoints renderErrorReportto HTTP via a single mapping. This matches the pattern every mature RPC framework (gRPC, JSON-RPC, Erlang OTP) has converged on, and explicitly avoids theoslo.messagingexception- rehydration model that produced years of version-skew pain in OpenStack. See open question 3.
Scope boundaries:
- In scope: every
Networkmethod that currently invokesutil_concurrency.*host-mutating helpers, plusupdate_dnsmasq,remove_dnsmasq,remove_dhcp_lease,update_dns_entry,remove_dns_entry(which mutate the dnsmasq process state for the network). - In scope: the
maintainthread insf-net— its direct calls ton.ensure_mesh(),n.create_on_hypervisor(),n.add_floating_ip(...),n.route_address(...)are precisely the bypasses we are closing. - In scope: a per-node queue lane that every node's local
sf-netservices for its own host. The existingnetworknode-*queues are network-node-only, so we cannot routecreate_on_hypervisororensure_mesh(which are per-hypervisor mutations) through them. See open question 9. - In scope: removal of the
redirect_to_network_nodedecorator and its four call sites inexternal_api/network.pyandexternal_api/interface.py. After the facade lands, none of those handlers need to be on the network node. - In scope: preserving event log fidelity through the
refactor. The existing
add_eventcalls inside host-mutating methods are part of the audit trail we want to keep, even when the body of the method moves toBridgedVXLanNetwork. Either the enqueueingNetworkshim or the executingBridgedVXLanNetworkmethod must emit equivalent events; double-emission must be avoided. - Out of scope: the
NetworkInterfaceandIPAMclasses. Those have their own concerns (IP reservation, interface attach/detach) that overlap with networks but are not the same problem. They may benefit from the same pattern later, but each has its own audit work. - Out of scope: the existing
nodelock-based fix on thestabilitybranch. That fix stays in place until the facade refactor lands and is proven; it can be removed in the final phase as cleanup. - Out of scope: the other API redirect decorators
(
redirect_instance_request,redirect_to_eventlog_node,redirect_upload_request). They are structurally the same problem asredirect_to_network_nodeand the queue-based pattern established here is the future direction for them too, but each has its own caller fan-out and is its own audit. See future work. - Out of scope: generalised event-based completion notification (subscribing to terminal-state events for a cluster operation, webhook notifications on object events, etc.). Polling is sufficient for this plan. See future work.
- Out of scope: capability advertisement / client-driven preference for the new 202+poll contract. Since we own both server and client and have no other consumers, we flip the contract directly. See future work if external API consumers ever appear.
- Out of scope: lowering the global
BaseClusterOperation.defer(delay=15)default. The refactor passes the tight defer delay through the new chain-builder helper instead, leaving unrelated existingdefer()callers undisturbed. See open question 11. - Out of scope: saga / compensator semantics for
chains. This plan keeps the existing convergence model:
when a chain step errors, the dispatcher already aborts
the remaining dependents (
workitem.py:84-101), partial state is left as-is, andmaintainplus the cleaner are responsible for detecting and correcting drift on subsequent passes. We are not introducing paired forward/undo operations or running chains backwards on failure. Reasons: compensators carry their own state (they need to know what the forward step observed, not just what it wrote), compensators can themselves fail (requiring dead-letter / halt machinery), idempotency requirements stack on both forward and reverse paths, and compensation is externally observable in confusing ways. The investment is better spent on Phase 6 making convergence robust. See future work for the lighter alternative. - Out of scope: changing how
net_ops are queued in terms of priority semantics, or how the cluster decides which node owns a network. Queue prioritisation and network-node election are unchanged; only the set of available queues grows.
Open questions¶
- How sequenced callers express ordering. Today
n.create_on_hypervisor()returns when the host change is complete; many callers depend on that (e.g.node_inst_netdesc_op.py:243callsn.create_on_hypervisor()thenn.ensure_mesh()thenn.update_dnsmasq()in sequence, relying on each having finished before the next runs). Earlier framings of this question asked "what wait primitive do we provide" and considered polling, MariaDB named locks, in-process events, and gRPC streaming. That framing was wrong: the existingClusterOperationmachinery already has adepends_onfield whose semantics are exactly what we need (daemons/network/workitem.py:71-109). The right answer is to use it.
Model. Network.ensure_mesh() becomes "build a
NetOp task for ensure_mesh, enqueue it on the
appropriate queue, return the op handle". It does not
wait. A sequenced caller that today does
create_on_hypervisor → ensure_mesh → update_dnsmasq
becomes a caller that enqueues three NetOps, with each
declaring depends_on=[previous], then returns the
handle of the terminal op. Callers that genuinely need a
synchronous wait (CLI tools, some test code, a stubborn
REST endpoint) call a small poll_until_terminal(op)
helper that polls mariadb.get_cluster_operation on a
short interval bounded by ASYNC_OP_TIMEOUT. This is
the same shape as the existing
_await_instance_operations_complete in
deploy/shakenfist_ci/base.py:395, just generalised.
Why this is better than wait-helper-as-default.
1. No new wait primitive — we use the dependency
machinery the dispatcher already implements.
2. REST handlers can return early with the terminal op's
uuid and the client polls — no gunicorn worker held
for the duration of the host change.
3. The re-entrancy framing simplifies. Network methods
build and enqueue; BridgedVXLanNetwork methods do
the host work. There is no "inside the worker we go
synchronous, outside the worker we go async" split —
everything outside the dispatcher is async by
default.
4. Errors surface through the existing op terminal-state
machinery; the dispatcher already aborts dependents
whose deps end in ERROR/DELETED/ABORT
(workitem.py:85-101). No new error-propagation
plumbing.
Implementation note on errors. In-worker exception
handling is unchanged: BridgedVXLanNetwork._apply_*
still raises the typed exceptions, and the workitem
dispatcher's existing except EnsureMeshFailed: and
sibling blocks in net_op.py:84-94 still catch them
in the same process. The boundary translation happens
in the dispatcher's outer except: anything that
escapes is converted via ErrorReport.from_exception
and persisted on the op record. External callers
using poll_until_terminal read op.error_report
and either pattern-match on report.code or call
op.raise_for_error() to surface the failure as a
single generic wrapper. See open question 3 for the
full ErrorReport model.
Implementation note on idempotency. Chains can be
re-attempted (e.g. a transient defer that exceeds the
timeout, an operator-triggered retry). All _apply_*
methods on BridgedVXLanNetwork must remain
re-entrancy-safe at the host level. ensure_mesh is
already idempotent by design (it computes desired state
and converges). add_floating_ip and friends must be
audited as part of their migration phase.
Implementation note on the defer delay. See open question 11.
-
Naming and class shape. Resolved.
Networkremains the public class with the same call surface every caller already uses (no churn at call sites). A newBridgedVXLanNetwork(network)class — instantiated only inside the workitem dispatcher — holds the_apply_*worker methods. The name forecloses nothing about future alternate implementations (e.g. native L2 underlay, OVS, etc.) and names what the current implementation actually is. There is no separateNetworkFacadetype; the facade isNetwork. -
Error propagation at the queue boundary. What we are building is an in-cluster RPC mechanism, and every mature RPC framework — gRPC, JSON-RPC, Twirp/Connect, Erlang's
{'EXIT', Pid, Reason}— has converged on the same pattern for errors: errors are data, never rehydrated exceptions, at the boundary. OpenStack'soslo.messagingis the cautionary tale for the alternative; it tried to rehydrate exception classes from a registry and produced years of version-skew bugs, serialisation fragility, and security concerns about reconstituting arbitrary classes from strings. We will not repeat that experiment.
Resolved approach: ErrorReport. A small typed
record carried on the op record describing a failure:
class ErrorReport:
code: str # e.g. 'network.ensure_mesh.failed'
message: str # human-readable
details: dict # structured context
origin_class: str # e.g. 'shakenfist.exceptions.EnsureMeshFailed'
traceback: str # stored for operator debugging
ErrorReport lives in
shakenfist/operations/error_report.py (not in
shakenfist/network/) because it is fundamentally a
cluster-operation concept and other subsystems will
adopt it later. It has at minimum:
ErrorReport.from_exception(exc) -> ErrorReport— called by the worker dispatcher when an_apply_*method raises. A small registry maps known exception classes to stable codes; unknown exceptions becomecode='internal.unknown'withorigin_classset so operators still see what happened.ErrorReport.to_http() -> (status_code, body)— used by REST endpoints to render a failed op into a clean HTTP response without leaking tracebacks. The mapping from code to HTTP status lives in one place.- On the op object,
op.error_reportreturns the report (orNoneif not failed). op.raise_for_error()raises a single genericNetworkOperationFailedcarrying the report, for the rare caller that wants exception-flow control rather than pattern-matching on the code.
Crucial cleavage point. In-process exception
handling is unchanged. _apply_ensure_mesh()
still raises EnsureMeshFailed; the worker dispatcher
still catches it in the same process; the existing
except EnsureMeshFailed: and except DeadNetwork:
blocks in net_op.py:84-94 and elsewhere keep
working unchanged because they are in-worker and
never crossed the queue boundary. The only change is
that when the dispatcher's outer except catches
anything that did escape, it converts to ErrorReport
before persisting, rather than storing freeform text.
Initial code namespace (registered in
from_exception for v1):
network.ensure_mesh.failed←EnsureMeshFailednetwork.dead←DeadNetworknetwork.create_vxlan.failed←CreateVXLANInterfaceFailednetwork.floating.assign_failed←CannotAssignFloatingGatewayinternal.unknown← anything else (withorigin_classpreserving the actual exception class)
New codes are added per-subsystem as needed. The
namespace is hierarchical so operators can grep / log-
scrape (e.g. code starts with 'network.').
What we gain:
* No registry of importable classes on the consumer
side. Wire format is strings + dict; any consumer
in any language can pattern-match.
* Forward compatibility: new codes do not break old
consumers.
* REST clients see structured JSON, not stack traces.
* No oslo.messaging-style version-skew exception
class problem.
* Composes with future cancellation work: a
cancelled op gets code='cancelled' and downstream
handling already knows what to do with codes it
recognises.
What we give up. External callers that today
could except EnsureMeshFailed: to handle one
specific failure mode become callers that either (a)
pattern-match on report.code after polling, or (b)
call op.raise_for_error() and catch the single
generic wrapper. Today's set of such callers is
small; the trade-off is sound and matches what every
other RPC framework arrived at independently.
-
What happens to the
Networkclass itself. Resolved together with open question 2.Networkkeeps its data carrier role (object_states, attributes, queries,subst_dict(),is_okay(),is_dead()) and its existing public method names. The bodies of host-mutating methods change from "do the work inline" to "build anet_optask, enqueue it on the appropriate queue, poll for terminal state, re-raise on error". ABridgedVXLanNetwork(network)wrapper class carries the_apply_*methods that actually run on the host; it is instantiated only by the net-worker dispatcher. -
Migration order. Resolved: per-method migration, one method (or a small tightly-coupled group) per phase. Phases 2–5 migrated each method cleanly in-place without leaving temporary feature flags behind. The per-method migration flag approach discussed during planning was not needed in practice — each phase replaced the method body directly, leaving the public
Networksurface unchanged while the implementation moved toBridgedVXLanNetwork. The instance-start path was migrated last (Phase 5) as planned.
Safety: per-phase PR runs cluster_ci. The "leaves
the cluster in a runnable state, passes CI" rule for
each phase means each phase's PR must pass not just
pre-commit run --all-files and stestr but also the
functional cluster_ci smoke suite. Migration phases
that change observable behaviour (e.g. enqueue+poll
semantics replacing direct calls) are precisely the
places where unit tests will pass but functional tests
may catch a regression. Phase plans treated cluster_ci
as a required gate, not a courtesy run.
- Behaviour of
maintain.py. The reconciliation thread today walks all networks every interval and re-applies host state where it has drifted. Under the facade it instead enqueues reconciliation ops. The risks are (a)maintainenqueues a fresh reconciliation each pass even when the previous one is still in flight, piling up duplicates; (b) a permanently-broken network ("operation of death") loops forever, consuming queue cycles; (c) maintain's observation can be stale by the time it goes to enqueue.
Resolved approach: per-network gating at enqueue time, plus three safety guards.
The reconciliation thread still runs on a fixed interval (current default 60 s; configurable). Each pass:
- Queue-depth safety guard. Before doing any
work, the pass queries
get_work_queue_length()on the per-node and network-node queue families. If queue depth exceeds a configurable threshold, the pass is skipped and an event is emitted. This handles the genuine pathology of "queue is way backed up; do not pile on". - Discovery. Walk all networks observed on the
host (and the network node's set on the elected
network node), compare observed vs desired state.
Same logic as today's
is_okay/"Recreating not okay network on hypervisor" code paths — we are not weakening the drift detection. - Per-network gating. For each network with
detected drift, call
has_pending_cluster_operation( target=network, op_type='net_op')(the history-aware "any in-flight op?" query backed bycluster_operation_targets). If yes, skip enqueueing for this network this pass. - Cooldown. If no in-flight op exists, look up
the most recent terminal reconciliation op for
this network. If it ended in ERROR within the last
MAINTAIN_RECONCILE_COOLDOWN_SECONDS(default 60 s), skip enqueueing — let the previous failure breathe before retrying. - Circuit breaker. If the last K terminal
reconciliations on this network all ended in
ERROR (default
MAINTAIN_RECONCILE_CIRCUIT_K = 5), skip enqueueing and emit a prominent event ("network has failed reconciliation K times in a row; quiesced pending operator attention"). The next maintain pass naturally re-checks; once the operator does something that lets a fresh reconciliation succeed, the circuit closes. - Enqueue. Otherwise build the reconciliation
chain (using the same
BridgedVXLanNetwork-driven logic as user-facing reconciliation, just withbackgroundpriority) and enqueue.
This is one cluster_operation_targets query and one
get_work_queue_length query per pass, both indexed.
No new schema.
Why maintain is not itself a ClusterOperation in
this plan. A natural-looking alternative is to make
maintain a CO that enqueues its successor with
runs_after=[spawned_reconciliations] and
delay=60s. We considered this and rejected it for
two concrete reasons:
* depends_on aborts the dependent if a dep ends in
ERROR (workitem.py:84-101), which would
permanently kill the maintain recurrence on the
first failed reconciliation. runs_after doesn't
abort, but it also has no max-wait semantics — a
stuck dep defers the dependent forever.
* The CO model assumes terminal states; maintain is
an indefinite recurring scan. Making it a CO means
re-enqueue-self-on-completion, which is
"reinventing cron with extra steps".
The proper home for "maintain is a recurring CO" is
the future PLAN-recurring-operations.md, which
introduces a typed RecurringOperation object,
addresses the dispatcher gaps named above
(max-wait runs_after, continue-on-failure
recurrence), and absorbs scheduled_tasks.py and
maintain.py as initial consumers. When that lands,
daemons/network/maintain.py disappears and the
gating logic above moves into a
network_maintain_pass op triggered by an internal
RecurringOperation. Until then, the thread-plus-
gating approach above is sufficient and avoids
pre-engineering the dispatcher changes that the
recurrence framework will need to do properly.
- Performance and latency. The instance-start path
currently makes ~3-5 host changes back-to-back in the
POST
/instanceshandler. Under the dependency-chain model (open question 1), the handler enqueues a chain of ~3-5 NetOps and returns the terminal op's uuid; the chain executes asynchronously and the client polls. The absolute work is unchanged; the question is whether the chain-defer latency (a dep finishes, the dependent has to be re-dequeued before it observes that) creates a visible floor.
The dispatcher's current 15s defer delay
(baseoperation.py:212-240) is too coarse for chained
ops. See open question 11 for tuning.
The existing redirect_to_network_node proxy is itself
a blocking synchronous HTTP round-trip that already
occupies a gunicorn worker for the full host change
(just less obviously), so the move to 202+poll for the
four affected endpoints is a net reduction in worker
occupancy.
Current leaning: measure end-to-end before adding
any composite-op batching. The obvious lever if it
matters is a composite instance_attach_network task
that performs the whole sequence under one queue item;
trade-off is observability of intermediate steps. That
goes on future work for now.
-
What about
enable_nat? Resolved.enable_natis the one host-mutating method already protected today (it's only called from insidecreate_on_network_node's lock). Under the facade it collapses intoBridgedVXLanNetwork._apply_enable_nat— an internal worker helper invoked by_apply_create_on_network_node, not part of theNetworkpublic surface. Callers who today callenable_nat()directly become callers of a higher-level intent (typicallyn.create_on_network_node()which now enqueues the composite chain). -
Per-node queue lane for sf-net. The current
networknode-*queues (get_all_network_queues()inoperations/baseoperation.py:87) are global queues that only the elected network node services. They are the right home forcreate_on_network_node,add_floating_ip,route_address,enable_nat, etc. Butcreate_on_hypervisorandensure_meshare per-hypervisor mutations — they need to run on each node that has interfaces on the network. A second queue family, scoped to each node, was added in Phase 1 for the local sf-net's net-worker.
The new family mirrors the existing five-priority
taxonomy exactly (the priority enum is
schema/operations/baseclusteroperation.py:13-17:
user_waiting=10, user_facing=20,
user_facing_high_io=25, background=30,
background_high_io=40):
<node_uuid>-network-user_waiting<node_uuid>-network-user_facing<node_uuid>-network-user_facing_high_io<node_uuid>-network-background<node_uuid>-network-background_high_io
The existing dispatcher drains queues in priority
order by iterating the list returned from the
queue-list helper (workitem.py:36-39); the per-node
family is added at the front of that iteration before
the network-node-only family, so a local user-facing
op outranks a network-node background op. Within a
family, the established priority semantics apply
unchanged.
Priority assignment:
- REST handlers that have returned 202 to a waiting
client enqueue at
user_facing(oruser_facing_high_iofor steps that touch the host disk path, e.g. dnsmasq config rewrites). - Operations triggered by an interactive CLI / API call
where a human is blocked on the response use
user_waiting— matching the existing semantics of that lane. maintain-thread reconciliation ops enqueue atbackground(orbackground_high_iofor FDB scans that may sweep many networks at once).-
_apply_*chains spawned internally by a worker op inherit the priority of their parent op rather than defaulting to a fixed value. -
Scope of API redirect removal. The
redirect_to_network_nodedecorator and three of its four applications were removed in Phase 7; after the facade landed those handlers no longer needed to be on the network node. The one remaining application —NetworkPingEndpoint.get— was explicitly retained because migrating the synchronous ping to a queue-based op requires op-output infrastructure not yet built; it remains deferred future work (see Future work section).redirect_instance_request(~10 sites in instance.py),redirect_to_eventlog_node(several sites), andredirect_upload_request(two sites) are the same class of bypass and the queue-based pattern established here is the future direction for them, but each has its own caller fan-out and its own correctness concerns. These remain out of scope for this plan and are recorded as explicit future work. -
Defer delay for chained ops. The existing dispatcher (
baseoperation.py:212-240) defers a not- yet-ready op by re-enqueueing it with a 15-second delay. That value works well when the deferral reason is "a long-running parent op is still going"; it is far too coarse when the reason is "we are chasing the tail of an immediate predecessor that's about to finish". A three-op chain where each step takes 50 ms of host work could spend 30 seconds in defer-induced sleep — utterly unacceptable.Resolved approach: exponential back-off, tracked in-memory on the net-worker. The dispatcher maintains a
dict[op_uuid -> next_delay]on the worker. First defer for an op uses 100 ms; each subsequent defer doubles the entry's value, capped at 15 s. On terminal state (success, error, abort, detected cancellation) the entry is dropped. The dispatcher does not re-read op state from MariaDB on every defer — it just consults the map and re-enqueues with the recorded delay — so the cost of a tight back-off is paid in memory, not query load.Refinements: * Reset on observed dep-completion. When the dispatcher actually advances op B to executing (because A finally finished), B's entry is dropped and recreated at 100 ms if B itself later defers waiting on C. We do not carry over the back-off depth across distinct deps. * Soft map cap. A safety bound (e.g. 1000 entries) with oldest-first eviction. If we ever reach the cap something else is wrong, but we don't want a memory leak in the failure mode. * Cancellation check on dequeue (Phase 1 sub-step). Before deciding whether to defer or execute, the dispatcher inspects the op's current state. If it is
STATE_ABORT,STATE_DELETED, or otherwise terminal, the dispatcher drops the in-memory entry, resolves the work item, and proceeds to the next job. This also fixes a latent bug: todayexecute()callsself.state = STATE_EXECUTING, andstate_targets[STATE_ABORT] = (STATE_DELETED,)only — so executing a pre-aborted op raisesInvalidStateExceptionand lands an untrapped traceback in the logs. See open question 13 for the broader cancellation work that this plan does not take on.Critical safety property. The in-memory map is correct only because each queue this plan touches is serviced by exactly one worker: *
<node_uuid>-network-*queues — only that node'ssf-netnet-worker dequeues from them. *networknode-*queues — only the elected network node'ssf-netnet-worker dequeues from them.If we ever generalise the queue infrastructure such that more than one worker can dequeue from the same queue (e.g. a worker pool inside one process, or multiple nodes voting on the same queue), two workers can each defer the same op independently and double-enqueue it, breaking the delay schedule. The implementation must carry a big, scary comment at the map's declaration explaining this assumption and warning that any move toward multi-worker dequeue requires either a shared map (in-process pool, lock-protected) or a return to DB-backed state (cross-node). Do not bury this in a paragraph — it has to be the first thing a future maintainer sees.
Existing
defer()callers outside this plan are not affected: they continue to pass an explicitdelay=...(or use the unchanged 15 s default). The exponential schedule applies only when callers opt in via the new chain helper (or, equivalently, when the dispatcher itself defers without an explicit delay inside the per-node network queues — to be made precise in the Phase 1 plan).Option (C) from the earlier framing — dispatcher re-dequeues dependents on terminal-state transition — remains the right long-term answer and is recorded in the Future work section. It would remove the need for back-off entirely because dependents wake on signal rather than retry.
-
API endpoints for chain discoverability. A REST client that gets a 202 + terminal-op uuid needs a way to discover the chain so it can find which step failed if the chain aborts. The
depends_onfield is already on cluster operations and points the right way (dependent → dependency); from the terminal op the client can walk backward to find a failed predecessor. But there's no "give me the whole chain" endpoint today.Both endpoints were added in Phase 7: * GET
/clusteroperations/<uuid>/chain— returns the transitive closure ofdepends_onstarting from<uuid>, scoped to the caller's namespace (admin sees everything; non-admin sees only ops targeting objects in their namespaces). Useful for the "something in this chain failed; where" lookup. * GET/clusteroperations?target_object_type=... &target_uuid=...— lists ops targeting a given object, scoped by namespace. Useful for "what's currently happening on this network/instance". The backing query is the same onehas_pending_cluster_operationalready uses on thecluster_operation_targetstable.Authorisation follows the existing namespace-ownership pattern (admin sees all; users see ops on objects they own). The
cluster_operation_targetstable records enough to scope the query at the DB layer; no Python-side filtering of full-table scans. -
Broader cluster-operation cancellation. Out of scope for this plan, but flagged here so the in-scope cancellation-check sub-step in open question 11 lands in a coherent direction. Today there is ad-hoc cancellation processing scattered across the codebase (e.g. the dispatcher's "abort if a dep ended in ERROR/DELETED/ABORT" rule at
workitem.py:84-101); there is no formal model for "operations that should be cancelled when their target object is deleted" versus "operations the user explicitly requested whose result is independently valuable (e.g. snapshot)".A plausible shape — to be designed properly in its own master plan — is: * Each
ClusterOperationsubclass declares whether it is cancellable. * Each subclass declares whether it blocks state transitions of its target object (e.g. a snapshot blocks deletion of its instance; an ensure-mesh op does not). *hard_delete()of a parent object sweeps pending ops by target: cancellable ones are aborted, ops that block transitions force the parent's deletion to defer (or fail) until they complete. * A "cancel" verb on the cluster-operation REST API so users can explicitly cancel cancellable ops.In scope for this plan: only the dispatcher's cancellation-check sub-step in open question 11 (drop aborted/deleted ops on dequeue, fix the latent
InvalidStateException). Nothing else.Out of scope and explicit future work: everything above is left for
PLAN-cluster-op-cancellation.md(not yet written). This plan should not pre-engineer the per-class declarations or the sweep-on-delete behaviour.
Notes:
* REST contract change. Since we own both server and
client (shakenfist/client-python), the contract for
the affected delete endpoints changed from "block until
the host change is done" to "return 202 + terminal op
uuid and the client polls". A capability-negotiation
scheme on the server (so old clients still get the
blocking behaviour) was considered but explicitly
declined: we are the only consumer of this API and
would rather not carry the dual code path. It can be
revisited if external consumers ever appear.
Execution¶
| Phase | Plan | Status |
|---|---|---|
| 0. Stability-branch lock fix (separate, lands now) | (not a sub-plan — see commit on stability) |
Complete |
1. Per-node sf-net queue family (five priority lanes mirroring the existing taxonomy) + dispatcher changes: exponential back-off map (100 ms → 15 s, ×2, single-worker safety comment) and cancellation-check on dequeue (drop aborted/deleted ops, fix the latent InvalidStateException from executing a pre-aborted op) |
PLAN-network-facade-phase-01-per-node-queues.md | Complete |
2. BridgedVXLanNetwork scaffold, ErrorReport infrastructure (class, from_exception registry, to_http mapping, op.error_report / op.raise_for_error API), and ensure_mesh migration |
PLAN-network-facade-phase-02-ensure-mesh.md | Complete |
| 3. Floating-IP and route migration | PLAN-network-facade-phase-03-floating-ip.md | Complete |
| 4. dnsmasq operation migration | PLAN-network-facade-phase-04-dnsmasq.md | Complete |
5. create_on_* and delete_on_* migration |
PLAN-network-facade-phase-05-lifecycle.md | Complete |
6. maintain.py rewrite as discovery-only |
PLAN-network-facade-phase-06-maintain.md | Complete |
7. REST contract: remove redirect_to_network_node from three of its four sites, flip the two delete endpoints to 202+poll, add /clusteroperations/<uuid>/chain and /clusteroperations?target_*= endpoints, update client-python |
PLAN-network-facade-phase-07-rest-contract.md | Complete |
8. Remove the temporary NodeLocks from the stability fix (no per-method migration flags existed to remove — Phases 2–5 migrated each method cleanly in-place) |
PLAN-network-facade-phase-08-cleanup.md | Complete |
| 9. Documentation and tests | PLAN-network-facade-phase-09-docs.md | Complete |
Phase numbering reflects dependency ordering. Phase 1 is
foundational (no per-method migration can happen without per-node
queues); the small, isolated ensure_mesh method comes next; the
broad-fan-out lifecycle methods are later; the redirect-decorator
removal sits at phase 7 because all of its callers need the facade
to be in place. Each phase is expected to compile, pass CI, and
leave the cluster in a runnable state; intermediate phases will
have Network carrying both the old direct-call API (for
unmigrated methods) and the new facade-routed API (for migrated
methods) in parallel.
Agent guidance¶
Execution model¶
All implementation work is done by sub-agents, never in the management session. The management session (this conversation) is reserved for planning, review, and decision-making. This keeps the management context lean and avoids drowning it in implementation diffs.
The workflow is:
- Plan at high effort in the management session.
- Spawn a sub-agent for each implementation step with the brief from the plan, at the recommended effort level and model.
- Review the sub-agent's output in the management session. Check the actual files — the sub-agent's summary describes what it intended, not necessarily what it did.
- Fix or retry if the output is wrong. Diagnose whether the brief was insufficient (improve it) or the model was too light (upgrade it), then re-run.
- Commit once the management session is satisfied with the result.
This applies to all steps, including high-effort ones. If a sub-agent can't succeed even with a detailed brief and the right model, that's a signal the brief needs improving, not that the management session should do the implementation itself.
Use isolation: "worktree" for sub-agents when the change is
risky or experimental. The worktree is discarded if the
output is unsatisfactory. For safe, well-understood changes,
sub-agents can work directly in the main tree.
Planning effort¶
The master plan itself was created at high effort — it requires broad codebase understanding, cross-referencing multiple source files, and making judgment calls about scope and sequencing.
Per-phase guidance:
- Phase 1 (per-node queues) — high effort. Touches queue
infrastructure that every daemon depends on; getting
routing wrong has cluster-wide consequences. Requires
understanding the existing
get_all_node_queues/get_all_network_queuestaxonomy and thedequeue_work_itemclaim semantics. - Phase 2 (
ensure_meshmigration) — high effort. First use of theBridgedVXLanNetworkpattern; establishes the enqueue-and-wait shim, the exception whitelist serialisation, and the event-emission split. All subsequent per-method phases inherit from these decisions. - Phases 3–5 (per-method migrations) — medium effort each. The pattern from phase 2 is in place; each phase applies it to a different method group, with phase-specific judgment about idempotency and the affected_objects parameter for floating-IP methods.
- Phase 6 (
maintainrewrite) — high effort. Subtle reconciliation logic; risk of generating no-op churn on every interval if the "is it actually wrong" check is not carefully ported. - Phase 7 (REST contract) — high effort. Larger than
pure mechanical: removes
redirect_to_network_node, flips four endpoints to 202+poll, adds two new cluster-operation discovery endpoints with namespace- scoped SQL authz, and updatesshakenfist/client-pythonto match. The four sites need careful audit to confirm nothing else relies on running on the network node beyond what the facade has already moved. - Phase 8 (cleanup) — low effort. Reverts the stability fix locks and removes the migration flags. Each removal is a small targeted diff.
- Phase 9 (docs) — medium effort. Updates
ARCHITECTURE.md,AGENTS.md, the API reference docs, and the master plan itself. README, the developer guide state-machine docs, and the operator-guide network docs required no changes (confirmed during Phase 9 planning).
Step-level guidance¶
Each phase plan should include a table like this:
| Step | Effort | Model | Isolation | Brief for sub-agent |
|------|--------|-------|-----------|---------------------|
| 1a | medium | sonnet | none | One-sentence summary of what to do and which files to touch |
| 1b | high | opus | worktree | Why this needs high effort: requires understanding X to do Y |
Effort levels: - high — Requires reading multiple files, making judgment calls, understanding non-obvious invariants, or researching external references. The sub-agent needs to think carefully about edge cases. - medium — The plan provides enough context that the sub-agent can follow a clear brief. May need to read a few files but the approach is well-defined. - low — Purely mechanical changes (rename, reformat, add a log line, regenerate proto stubs). The brief is a complete instruction.
Model choice: - opus — Best for steps that require deep reasoning, cross-daemon architectural understanding, subtle correctness judgment (locking, state machines, migration), or complex protocol research. - sonnet — Good default for well-briefed implementation work. Faster and cheaper than opus. - haiku — Suitable for purely mechanical tasks. The brief must be a near-complete instruction.
When in doubt, skew to the more capable model. Saving money only matters if the outcome is still acceptable.
Brief for sub-agent: Write it as if briefing a colleague who has never seen the codebase. Include: what to change, which files to touch, what patterns to follow, and any non-obvious constraints. The better the brief, the lower the effort level needed and the lighter the model that can succeed.
Management session review checklist¶
After a sub-agent completes, the management session should verify:
- The files that were supposed to change actually changed (read them, don't trust the summary).
- No unrelated files were modified.
- The code passes
pre-commit run --all-files(flake8, stestr unit tests, mypy). - If proto files changed, stubs were regenerated with
tox -e genprotosand committed. - The changes match the intent of the brief — not just syntactically correct but semantically right.
- Commit message follows project conventions (including the Co-Authored-By line with model, context window, effort level, and other settings).
- For network-facade phases specifically: no new direct
call to
util_concurrency.*host-mutating helpers appears outsideBridgedVXLanNetwork; no new caller bypasses the queue.
Administration and logistics¶
Success criteria¶
We will know when this plan has been successfully implemented because the following statements will be true:
- The code passes
pre-commit run --all-files(flake8, stestr unit tests, and mypy type checking). grepforutil_concurrency.create_vxlan_interface,util_concurrency.ensure_vxlan_mesh,util_concurrency.add_floating_ip,util_concurrency.remove_floating_ipoutsideshakenfist/network/and theBridgedVXLanNetworkimplementation returns no hits. The only callers are the worker class.grepforredirect_to_network_nodereturns no hits (decorator definition and all four applications removed).- The four affected REST endpoints return 202 with a
terminal-op uuid in the response body, not 200 with the
fully-applied result. The
shakenfist/client-pythonclient transparently polls these handles for callers that want synchronous semantics. GET /clusteroperations/<uuid>/chainreturns the transitivedepends_onclosure scoped to the caller's namespace.GET /clusteroperations?target_object_type= &target_uuid=lists ops targeting a given object, also scoped. Filtering is done at the SQL layer usingcluster_operation_targets— no Python-side filtering of full-table scans.- Chained NetOps built by the new facade helper use the in-memory exponential back-off schedule (100 ms first defer, ×2 per subsequent defer, 15 s cap) rather than the global 15 s default. A three-op chain of short- running steps completes without accumulating noticeable extra latency from defer back-off.
- The dispatcher contains a prominent comment at the declaration of the in-memory back-off map warning that its correctness depends on each queue being serviced by exactly one worker, and that any change toward multi-worker dequeue (in-process pool or cross-node voting) requires either a shared / locked map or a return to DB-backed state.
- The dispatcher recognises aborted / deleted ops on
dequeue: it drops the in-memory back-off entry,
resolves the work item, and proceeds without raising
InvalidStateException. Unit-test coverage for this path exists. - An
ErrorReportclass exists atshakenfist/operations/error_report.pycarryingcode,message,details,origin_class, andtraceback.ErrorReport.from_exception(exc)maps the four registered network exceptions to stable codes (network.ensure_mesh.failed,network.dead,network.create_vxlan.failed,network.floating.assign_failed) and falls back tointernal.unknownwithorigin_classpreserved for anything else. - The worker dispatcher's outer
exceptconverts escaping exceptions toErrorReportand persists them on the op record. REST endpoints rendering failed ops callErrorReport.to_http()rather than catching individual exception classes; the traceback is not surfaced to clients by default.op.error_reportreturns the report on the op object;op.raise_for_error()raises a single genericNetworkOperationFailedcarrying the report for callers that want exception-flow control. - In-worker
except EnsureMeshFailed:/except DeadNetwork:blocks innet_op.pyand siblings continue to work unchanged — the boundary translation only applies to exceptions that escape past those in-worker handlers. - The per-network
NodeLockcalls added by the stability-branch fix are removed; thewith self.get_lock(...)blocks inside the migrated_apply_*methods onBridgedVXLanNetworkare removed too (the single-threaded net-worker is the serialisation mechanism now). maintain.pyno longer callsn.ensure_mesh(),n.create_on_hypervisor(),n.add_floating_ip(...), orn.route_address(...). It only enqueuesnet_ops.maintain.pyrespects the four guards specified in open question 6: queue-depth safety, per-network gating, cooldown after a failed reconciliation, consecutive-failure circuit breaker. A deliberately- broken reconciliation (test scaffolding) produces no more thanMAINTAIN_RECONCILE_CIRCUIT_Kreconciliation ops over an extended period; an operator-visible event surfaces the circuit-breaker state.- The PR #3182 reproduction scenario (two concurrent
_ensure_meshcalls for the same VXLAN) cannot occur: there is no second caller on the same node, because the net-worker is single-threaded. - The per-node
<node_uuid>-network-*queue family exposes all five priority lanes (user_waiting,user_facing,user_facing_high_io,background,background_high_io). REST handlers enqueue atuser_facing;maintainenqueues atbackground. Spot-checking under load shows that user-facing chains complete ahead of concurrent background reconciliation. - New code follows existing patterns: object lifecycle in
baseobject.py, MariaDB access via the three-layer pattern (direct/gRPC/public), Pydantic schemas inshakenfist/schema/. - There are unit tests for the enqueue-and-wait helper and
for the exception whitelist serialisation, and functional
test coverage in
shakenfist/deploy/cluster_ciexercising the migrated paths (instance create, floating IP attach, network maintenance reconciliation). - Lines are wrapped at 120 characters, single quotes for strings, double quotes for docstrings.
ARCHITECTURE.md,README.md, andAGENTS.mddescribe theNetwork/BridgedVXLanNetworksplit and the per-node<node_uuid>-network-*queue family. The developer-guide state-machine docs and the operator-guide network docs are updated.
Future work¶
- General subscription mechanism for terminal-state events. Polling is sufficient for v1 but is not the endgame. A clean shape is publishing terminal-state transitions through the EventLog so that REST API callers (and potentially registered webhooks) can subscribe rather than poll. This would replace the polling helper transparently and is useful well beyond networks. Out of scope here.
- Dispatcher re-dequeue on dep completion. Option (C) from open question 11: when a cluster operation reaches a terminal state, the dispatcher immediately requeues any ops that were waiting on it, instead of relying on those dependents to wake up via the defer back-off. This is the proper long-term replacement for the defer-and-retry pattern and removes both the tight-defer-delay knob and the in-memory back-off map entirely.
- Recurring cluster operations framework
(
PLAN-recurring-operations.md, stub). A typedRecurringOperationobject with a cron-like schedule, plus the dispatcher changes needed to support it (max-waitruns_after, continue-on-failure recurrence semantics). When this framework lands,daemons/network/maintain.pyis absorbed into it: the per-network gating + cooldown + circuit-breaker behaviour specified in this plan's open question 6 moves into anetwork_maintain_passop triggered by an internalRecurringOperation, and the maintain thread disappears. Other internal consumers (daemons/cluster/scheduled_tasks.py) migrate at the same time. User-facing recurring tasks like "snapshot this instance every 24 hours" land in the same framework. - Broader cluster-operation cancellation
(
PLAN-cluster-op-cancellation.md, not yet written). Formalise the ad-hoc cancellation handling scattered across the codebase into a coherent model: eachClusterOperationsubclass declares whether it is cancellable and whether it blocks state transitions of its target object (e.g. a snapshot blocks deletion of its instance, an ensure-mesh op does not);hard_delete()of a parent sweeps pending ops by target and either aborts them or defers the parent deletion until they complete; acancelverb is exposed on the cluster-operation REST API. This plan takes only the minimal dispatcher-side cancellation- check needed to make exponential back-off correct; it deliberately does not touch the per-class declarations or the sweep-on-delete behaviour. - Capability negotiation on the REST API. If external API consumers ever appear, the server may want to advertise the 202+poll contract as a capability and let clients signal whether they prefer it (with the blocking contract as fallback). Out of scope while we are the sole consumer.
- Extend
ErrorReportto other subsystems.ErrorReportis introduced here for the network operations queue boundary, but it is a general cluster-operation concept and lives atshakenfist/operations/error_report.pyrather than insideshakenfist/network/. Other subsystems (artifact, blob, instance lifecycle) should adopt the same pattern as their operations migrate to the same shape: register their exceptions inErrorReport.from_exceptionwith stable codes, and useop.error_report/op.raise_for_erroron the poll side. Each subsystem can adopt independently; the registry is open for extension. - Generic background-lane starvation fix. If
background/background_high_ioreconciliation ops are observed to starve behind a steady stream of user-facing work, the fix is a generic work-queue fairness change (e.g. weighted draining across priority lanes, occasional forced dequeue from the next-lower lane) applied to every queue family, not a special case for the network family. Out of scope here; record the observation if it surfaces during phase rollout. - Explicit cleanup-on-error chains. If a concrete case
arises where the convergence model is genuinely
insufficient — i.e. partial state left by a failed chain
cannot be reliably reconciled by
maintainor the cleaner, and the caller has a precise idea of what to undo — we could extend the dependency vocabulary with aruns_on_error_ofvariant so that a caller can enqueue a cleanup op that fires only when a specific chain step ends inERROR. This is much lighter than full saga support: cleanup is just another op the caller chooses to enqueue, paired with a specific failure trigger, rather than a per-op compensator that every operation must implement. Deferred until a real need surfaces; we should not pre-engineer this. - Composite per-instance network operations. The
instance-start path performs ~3-5 host changes
back-to-back; if queue round-trip latency dominates,
collapsing the sequence into a single
instance_attach_networkcomposite op amortises the cost. Worth doing only with measured evidence. - Apply the facade pattern to
NetworkInterfaceandIPAM. Each has its own audit work but the same bypass problem exists in principle. - Queue-based ping endpoint migration. Phase 7
retained the
@redirect_to_network_nodedecorator onNetworkPingEndpoint.getbecause migrating a synchronous ping to the queue-based op model requires op-output infrastructure (a way to surface the ping result back through the op terminal state) that is not yet built. This remains deferred beyond this master plan. - Remove the remaining API redirect decorators.
redirect_instance_request,redirect_to_eventlog_node, andredirect_upload_requestare structurally the same problem asredirect_to_network_node. Each needs its own audit because the operations they proxy do not all have queue-based equivalents today. - Move the REST API layer to its own tier of machines. If the API tier ever becomes the bottleneck, separating gunicorn workers from the compute/network nodes lets them scale independently. Not required for v1.
- Promote the per-node
<node_uuid>-network-*queue family to a general "per-node daemon-owned" pattern. If other daemons end up needing the same "this work must run on this specific node, serialised through a single worker" shape, the queue infrastructure introduced here is a starting point.
Bugs fixed during this work¶
- Latent
InvalidStateExceptionon pre-aborted ops. (Fixed in Phase 1.) Before the cancellation-check on dequeue was added, the dispatcher could pick up an op already inSTATE_ABORT, callself.state = STATE_EXECUTING, and hitInvalidStateExceptionbecausestate_targets[STATE_ABORT] = (STATE_DELETED,)only. The Phase 1 cancellation-check drops such ops cleanly before attempting to execute them.
Documentation index maintenance¶
When this master plan is updated:
docs/plans/index.md— the row for this plan should track the current overall status (Planning → In Progress → Complete). Phase rows are not added toindex.md; phases are tracked in the Execution table above.docs/plans/order.yml— this master plan is already listed; phase files are not added toorder.yml.
When all phases of this plan are complete, update the status
column in docs/plans/index.md.
Back brief¶
Before executing any step of this plan, the implementing sub-agent must back brief the operator as to its understanding of the phase plan and how the work it intends to do aligns with that plan.