PLAN: Queue performance and coalescing¶
Status¶
Steps 1-6 implemented on the network-facade branch. Step 7
(measure, decide on fairness) is left until the per-op
'started executing' event from step 1 produces real CI data.
Problem¶
Functional CI on the network-facade branch surfaced a
cluster-wide latency tail: cluster operations -- especially
network_apply_update_dnsmasq on the elected network node --
spent >60 s queued before a worker picked them up. Six instance
starts on one network each enqueued one update_dnsmasq op; the
single-threaded sf-net worker serviced them strictly serially,
and each one paid the full state-machine round-trip
(STATE_EXECUTING write -> work -> STATE_COMPLETE write -> 100 ms
poll lag on the waiter side) even though the actual dnsmasq
restart is sub-second.
The pre-existing topology was already serialised; what changed is that the network-facade refactor moved work that used to run inline on the network node into the queue. Every change now pays the queue+state-machine overhead, and the work backed up.
Approach¶
Six discrete changes plus one measurement step:
-
Visibility: emit a
'started executing'event at the dispatcher-pickup boundary carryingwait_seconds,defer_countandqueue_name. This is the only place in the pipeline that observes bothop.created_at(insert time) andstart_time(when the worker is about to callop.execute()), so the per-op queue-wait latency lands directly in eventlog. -
Unified batched dequeue: replace
dequeue_work_item(qn)and its direct/gRPC pair withdequeue_work_items(queue_names, limit), served by a single MariaDB SELECT usingORDER BY FIELD(queue_name, ...), scheduled_at. Bothsf-netandsf-queuesuse the new API; the singular method is removed (one way of doing the thing). The previous 10 sequentialDequeuegRPCs per idle poll become one. -
Coalescible-task metadata: declare which (op_type, task) combinations are safe to fold. Subclasses set
coalescible_tasks(frozenset) andcoalescible_target_columnonBaseClusterOperation; the schema module declares the same set underCOALESCIBLE_TASKS. Metadata-only commit -- no behaviour changes. -
Worker-side dedup: inside
BaseClusterOperation.execute, two passes. (a) Within-job: drop duplicate coalescible tasks fromself.tasks. (b) Cross-op: ask MariaDB (one transactional SQL statement) to fold every other pending op on the same target whose entire task list is one of our coalescible tasks -- their state transitions tocomplete, and when the dispatcher eventually surfaces theirwork_queuerow the terminal-state branch drops it cleanly. -
Enqueue-side dedup: at the top of
net_op.create_and_enqueue, look up an existing pending coalescible op on the same target. If found, return that op's uuid instead of inserting a duplicate row. Dedup is skipped when the new enqueue carriesdepends_onorruns_after(those encode an ordering constraint reusing a sibling would erase). The lookup race is bounded -- two concurrent callers that both miss the lookup produce at most one duplicate row, which the worker-side fold (step 4) catches on dispatch. -
Caller-site audit: sweep for fan-out patterns we can collapse before they hit
create_and_enqueue. See the findings section below. -
Re-measure: once steps 1-6 are deployed in CI, the
'started executing'event distribution tells us whether the wait tail is gone or whether explicit fairness (bounded staleness, reserved-slot lottery) is still needed for lower-priority queues.
Audit findings (step 6)¶
| Site | Pattern | Resolution |
|---|---|---|
node_inst_netdesc_op._instance_start |
Loop over net_desc; per-interface call to n.ensure_mesh() and n.update_dnsmasq() |
Track reconciled_network_uuids in a set; first interface on a network triggers reconciliation, subsequent interfaces on the same network skip. Per-interface work (state flip, floating-IP fan-out) stays inside the loop. Fixed in this commit. |
node_inst_op._instance_delete |
Loop over instance_networks; one n.update_dnsmasq() per network |
Already deduplicates network_uuid before entering the loop. No change. |
External API hot-plug (external_api/instance.py) |
Single multi-task enqueue [create_network_node, ensure_mesh] |
Not a fan-out -- one op per hot-plug call. No change. |
daemons/network/maintain.py |
Per-network reconciliation enqueues during a 30 s pass | Bounded by per-network in-flight gate (has_pending_cluster_operation_target). After step 5, parallel maintainer passes on the same node coalesce; cross-node dedup deliberately disabled because mesh is per-hypervisor. No change. |
daemons/queues/startup_tasks.py |
Per-network sequential n.create_on_hypervisor() + n.ensure_mesh() waits during node startup |
create_on_hypervisor enqueues node_net_op.network_apply_create_hypervisor, which is not currently coalescible. Multiple instances on the same network on the same node fan out at startup. See the follow-up below. |
Follow-ups not landed here¶
-
NodeNetOp.network_apply_create_hypervisorcoalescing. This task is idempotent (util_concurrency.create_vxlan_interfaceis "create if missing") and is enqueued per-instance during startup -- so a node restoring N instances on the same network enqueues N node_net_ops where one would do. Marking it coalescible needs the find/claim primitives to filter on bothnode_uuidandnetwork_uuid(same network on the same node), which the currenttarget_columnparameter does not support. Generalising to a list of(column, value)pairs is straightforward; deferred so step 5's CI run gives us measurable baseline numbers first. -
Explicit fairness for low-priority queues. The dequeue query honours strict priority order via
FIELD(); lower priorities only spill in when higher ones yield fewer rows thanlimit. Sustained heavy load onuser_facingcould in principle starvebackground. The CI signal will tell us whether to add bounded-staleness ordering (ORDER BY CASE WHEN NOW() - created_at > N THEN top ELSE priority END, ...) or a reserved-slot mechanism.