PLAN: Queue performance and coalescing¶

Status¶

Steps 1-6 implemented on the network-facade branch. Step 7 (measure, decide on fairness) is left until the per-op 'started executing' event from step 1 produces real CI data.

Problem¶

Functional CI on the network-facade branch surfaced a cluster-wide latency tail: cluster operations -- especially network_apply_update_dnsmasq on the elected network node -- spent >60 s queued before a worker picked them up. Six instance starts on one network each enqueued one update_dnsmasq op; the single-threaded sf-net worker serviced them strictly serially, and each one paid the full state-machine round-trip (STATE_EXECUTING write -> work -> STATE_COMPLETE write -> 100 ms poll lag on the waiter side) even though the actual dnsmasq restart is sub-second.

The pre-existing topology was already serialised; what changed is that the network-facade refactor moved work that used to run inline on the network node into the queue. Every change now pays the queue+state-machine overhead, and the work backed up.

Approach¶

Six discrete changes plus one measurement step:

Visibility: emit a 'started executing' event at the dispatcher-pickup boundary carrying wait_seconds, defer_count and queue_name. This is the only place in the pipeline that observes both op.created_at (insert time) and start_time (when the worker is about to call op.execute()), so the per-op queue-wait latency lands directly in eventlog.
Unified batched dequeue: replace dequeue_work_item(qn) and its direct/gRPC pair with dequeue_work_items(queue_names, limit), served by a single MariaDB SELECT using ORDER BY FIELD(queue_name, ...), scheduled_at. Both sf-net and sf-queues use the new API; the singular method is removed (one way of doing the thing). The previous 10 sequential Dequeue gRPCs per idle poll become one.
Coalescible-task metadata: declare which (op_type, task) combinations are safe to fold. Subclasses set coalescible_tasks (frozenset) and coalescible_target_column on BaseClusterOperation; the schema module declares the same set under COALESCIBLE_TASKS. Metadata-only commit -- no behaviour changes.
Worker-side dedup: inside BaseClusterOperation.execute, two passes. (a) Within-job: drop duplicate coalescible tasks from self.tasks. (b) Cross-op: ask MariaDB (one transactional SQL statement) to fold every other pending op on the same target whose entire task list is one of our coalescible tasks -- their state transitions to complete, and when the dispatcher eventually surfaces their work_queue row the terminal-state branch drops it cleanly.
Enqueue-side dedup: at the top of net_op.create_and_enqueue, look up an existing pending coalescible op on the same target. If found, return that op's uuid instead of inserting a duplicate row. Dedup is skipped when the new enqueue carries depends_on or runs_after (those encode an ordering constraint reusing a sibling would erase). The lookup race is bounded -- two concurrent callers that both miss the lookup produce at most one duplicate row, which the worker-side fold (step 4) catches on dispatch.
Caller-site audit: sweep for fan-out patterns we can collapse before they hit create_and_enqueue. See the findings section below.
Re-measure: once steps 1-6 are deployed in CI, the 'started executing' event distribution tells us whether the wait tail is gone or whether explicit fairness (bounded staleness, reserved-slot lottery) is still needed for lower-priority queues.

Audit findings (step 6)¶

Site	Pattern	Resolution
`node_inst_netdesc_op._instance_start`	Loop over `net_desc`; per-interface call to `n.ensure_mesh()` and `n.update_dnsmasq()`	Track `reconciled_network_uuids` in a set; first interface on a network triggers reconciliation, subsequent interfaces on the same network skip. Per-interface work (state flip, floating-IP fan-out) stays inside the loop. Fixed in this commit.
`node_inst_op._instance_delete`	Loop over `instance_networks`; one `n.update_dnsmasq()` per network	Already deduplicates network_uuid before entering the loop. No change.
External API hot-plug (`external_api/instance.py`)	Single multi-task enqueue `[create_network_node, ensure_mesh]`	Not a fan-out -- one op per hot-plug call. No change.
`daemons/network/maintain.py`	Per-network reconciliation enqueues during a 30 s pass	Bounded by per-network in-flight gate (`has_pending_cluster_operation_target`). After step 5, parallel maintainer passes on the same node coalesce; cross-node dedup deliberately disabled because mesh is per-hypervisor. No change.
`daemons/queues/startup_tasks.py`	Per-network sequential `n.create_on_hypervisor()` + `n.ensure_mesh()` waits during node startup	`create_on_hypervisor` enqueues `node_net_op.network_apply_create_hypervisor`, which is not currently coalescible. Multiple instances on the same network on the same node fan out at startup. See the follow-up below.

Follow-ups not landed here¶

NodeNetOp.network_apply_create_hypervisor coalescing. This task is idempotent (util_concurrency.create_vxlan_interface is "create if missing") and is enqueued per-instance during startup -- so a node restoring N instances on the same network enqueues N node_net_ops where one would do. Marking it coalescible needs the find/claim primitives to filter on both node_uuid and network_uuid (same network on the same node), which the current target_column parameter does not support. Generalising to a list of (column, value) pairs is straightforward; deferred so step 5's CI run gives us measurable baseline numbers first.
Explicit fairness for low-priority queues. The dequeue query honours strict priority order via FIELD(); lower priorities only spill in when higher ones yield fewer rows than limit. Sustained heavy load on user_facing could in principle starve background. The CI signal will tell us whether to add bounded-staleness ordering (ORDER BY CASE WHEN NOW() - created_at > N THEN top ELSE priority END, ...) or a reserved-slot mechanism.

📝 Report an issue with this page