PLAN: Fix cluster_operation_targets UNIQUE constraint¶
Status¶
Proposed. Root cause confirmed from CI artifact bundle for run
25727839217 (Guests job, 2026-05-12), with the diagnostic
instrumentation added in commit 5d0bf73c. See "Evidence" below.
Problem¶
The forbidden-string scanner in CI has been intermittently failing on
Recreating not okay network on hypervisor for months. The post-
86aacb19 change to Network.is_okay() was meant to close the
last_cluster_operation race; commit 5d0bf73c added instrumentation to
distinguish "bridge was torn down unexpectedly" from "bridge create
failed silently". The 2026-05-12 bundle shows neither -- the bridge was
never created on the affected hypervisor in the first place, and
the maintainer was the one creating it for the first time. The race is
between sf-net's maintainer and the node_inst_net_iface_op
(hot-plug interface) cluster operation.
Root cause¶
The cluster_operation_targets table is declared with
unique=True on operation_uuid
(shakenfist/mariadb.py:639-640):
This makes it physically impossible to record more than one target row
per operation. But the design (commit message of 86aacb19 and the
target_fields: ClassVar[dict[str, ObjectType]] declarations on
operation schemas) explicitly requires multiple target rows per op
for multi-target operations.
_direct_create_cluster_operation_target
(shakenfist/mariadb.py:5327-5361) catches IntegrityError and
returns True silently:
except IntegrityError:
# Duplicate operation_uuid — already recorded, which is fine.
# The UNIQUE constraint on operation_uuid ensures idempotency.
return True
So for any op with N>1 targets, only the first target row is
persisted; the remaining N-1 are silently dropped. The comment claims
idempotency; in reality it drops semantically-different rows because
the column the design wants to be unique-per-target-row is
(operation_uuid, target_object_type, target_uuid), not just
operation_uuid.
Affected operation schemas¶
grep -A6 'target_fields: ClassVar' shakenfist/schema/operations/
shows four multi-target ops. For each, only the first declared
target is recorded; the rest are dropped:
| Schema | Targets (dict order) | Recorded | Dropped |
|---|---|---|---|
node_inst_net_iface_op |
instance, network, interface | instance | network, interface |
artifact_fetch_op |
artifact, instance | artifact | instance |
net_iface_ip_op |
network, interface | network | interface |
net_iface_op |
network, interface | network | interface |
The hot-plug op (node_inst_net_iface_op) is the one tripping the CI
guard, because the network target row is the one
Network.is_okay() reads to defer the maintainer. With that row
missing, has_pending_cluster_operation(network) returns False, the
maintainer falls through to is_created(), the bridge is genuinely
absent on the target hypervisor (the hot-plug op hasn't reached
create_on_hypervisor yet), and the forbidden-string event fires.
The bridge is then created idempotently -- first by the maintainer at
poll time, then later (no-op) by the hot-plug op's
create_on_hypervisor once its depends_on net_op completes. There
is no functional breakage of the hot-plug -- only the cosmetic
forbidden-string trip. But the same schema bug would cause real bugs
for is_okay() defers on the other three ops as well.
Evidence (CI run 25727839217)¶
Timeline reconstructed from
bundle-functional-cluster-guests/bundle/t-SqLSw-{3,5}/var/log/syslog:
| Time | Node | What |
|---|---|---|
| 10:46:58.886 | sf-3 | API receives POST /instances/ef251675.../interfaces (hot-plug) |
| 10:46:59.478 | sf-3 | NetworkInterface d17c9638 row written, state=initial |
| 10:47:00.276 | sf-3 | net_op c50f0867 (network_update_dnsmasq) enqueued |
| 10:47:00.986 | sf-3 | node_inst_net_iface_op 7295cbee enqueued, targets={instance,network,interface} |
| 10:47:01.074 | sf-5 | Op 7295cbee execution deferred 15s, waiting_on=[net_op c50f0867] |
| 10:47:13.014 | sf-5 | sf-net maintainer: ip link show br-vxlan-35072f → exit 1 "does not exist" |
| 10:47:13.111 | sf-5 | sf-net emits network not ok, is not created (proves has_pending_cluster_operation returned False) |
| 10:47:13.174 | sf-5 | sf-net emits recreating not okay network on hypervisor -- forbidden string |
| 10:47:13.239 | sf-5 | sf-net runs create_on_hypervisor, bridge created cleanly |
| 10:47:17.002 | sf-5 | Op 7295cbee state→executing |
| 10:47:17.063 | sf-5 | Hot-plug op runs create_on_hypervisor (idempotent, bridge already up) |
| 10:47:17.763 | sf-5 | Op 7295cbee state→complete |
The op was in queued state from 10:47:00.986 through 10:47:17.002 --
a 16-second window in which has_pending_cluster_operation(network)
should have returned True and made Network.is_okay() short-circuit
to True. It returned False. The only consistent explanation is that
the cluster_operation_targets row for target_object_type=network,
target_uuid=88eb4715... was never persisted.
The schema bug (unique=True on operation_uuid) is the proximate
cause: enqueue_cluster_operation iterates target_fields in
declaration order {instance, network, interface}, calls
mariadb.create_cluster_operation_target three times for the same
operation_uuid=7295cbee..., the second and third inserts raise
IntegrityError, and the catch-and-return-True swallows them.
Test gap¶
shakenfist/tests/test_cluster_operation_targets.py:574
(test_hot_plug_enqueue_writes_three_target_rows) passes today
because it mocks mariadb.create_cluster_operation_target directly --
it asserts the function was called three times with the right args
but never exercises the actual DB constraint. The constraint check
must be moved to an integration test against a real engine (or at
minimum an in-memory SQLAlchemy engine with the same table
definition).
Fix¶
Schema change¶
Replace the column-level unique=True on operation_uuid with a
composite uniqueness over the three columns that together identify a
unique target row:
sa.Table(
'cluster_operation_targets', metadata,
sa.Column('sequence_number', sa.BigInteger(),
primary_key=True, autoincrement=True),
sa.Column('operation_uuid', sa.String(36), nullable=False),
sa.Column('operation_type', sa.String(64), nullable=False),
sa.Column('target_object_type', sa.Enum(ObjectType),
nullable=False),
sa.Column('target_uuid', sa.String(36), nullable=False),
sa.Column('created_at', sa.Double(), nullable=False),
sa.UniqueConstraint(
'operation_uuid', 'target_object_type', 'target_uuid',
name='uq_cot_op_target'),
sa.Index('idx_cot_target', 'target_object_type', 'target_uuid'),
sa.Index('idx_cot_operation', 'operation_uuid'),
sa.Index('idx_cot_created', 'created_at'),
)
The new idx_cot_operation preserves the fast lookup that the old
unique-on-operation_uuid constraint was providing as a side effect
(comment at shakenfist/mariadb.py:633).
Migration¶
Bump CLUSTER_OPERATION_TARGETS_VERSION and add a migration step
that:
1. Drops the existing UNIQUE constraint on operation_uuid.
2. Adds the composite UNIQUE on (operation_uuid, target_object_type,
target_uuid).
3. Adds the new non-unique index on operation_uuid.
The existing table will have at most one row per operation_uuid, so
no data fix-up is required. The compound UNIQUE will accept any new
rows after migration.
Insert path¶
Keep the IntegrityError → True early-return for true duplicates (the
composite UNIQUE only fires on actual duplicate (op, target_type,
target_uuid) triples, which is idempotency). Update the comment to
reflect the new constraint.
Backfill for in-flight operations¶
Operations currently in flight at upgrade time will have only one target row each, so multi-target ops in flight at the moment of upgrade will continue to leak through the gate one more time. Two options:
- Do nothing: the next maintainer poll for any affected network will see the bridge missing and recreate it -- exactly the current bug, but only during the upgrade window.
- Replay: on startup of the database daemon after migration, scan
in-flight ops and re-call the
enqueue_cluster_operationtarget- write loop for them. Adds complexity for a one-time window.
Recommend (1) for the upgrade. Document that the forbidden-string trip is expected during the upgrade window of this fix.
Test fix¶
Add an integration-style test in test_cluster_operation_targets.py
that uses a real SQLAlchemy engine (in-memory SQLite is fine for the
shape, MariaDB in CI for the actual constraint) and asserts:
- Enqueuing
node_inst_net_iface_opresults in three rows incluster_operation_targets, one per declared target. has_pending_cluster_operation_target(NETWORK, network_uuid)returns True while the op is queued/preflight/executing.- Replays of the same op (idempotency check) do not produce duplicate rows -- the composite UNIQUE handles this.
The existing mock-based test is fine to keep as a unit test of the
schema's target_fields declaration and the call-site fan-out; mark
it as such in the docstring.
Why this didn't show up earlier¶
- Single-target ops (the vast majority) are unaffected. The four multi-target ops are all relatively recent additions in the cluster_operation_targets era.
- For
node_inst_net_iface_opspecifically, the instance target is recorded -- so any gate that reads the instance's pending-op list sees the op correctly. The bug is invisible from the instance side and only manifests from the network/interface side. - The unit test asserts call counts, which pass regardless of what the DB does with the writes.
Risk¶
Low for the fix itself: removing an overly-strict constraint with a correct replacement, idempotency preserved, no callers depended on the old constraint outside the DB write helper.
The only callers of cluster_operation_targets that read rows are:
get_latest_cluster_operation_target-- unaffected, picks one row.has_pending_cluster_operation_target-- unaffected, EXISTS query.get_cluster_operation_targets_for_object-- already designed to return multiple rows (per-object history).delete_cluster_operation_targets_for_object-- unaffected.delete_stale_cluster_operation_targets-- unaffected.
None assume one-row-per-operation_uuid; they all key on target_uuid
or operation_uuid and tolerate multiple rows.
Out of scope¶
- The 5d0bf73c instrumentation can stay for now; once this fix is in, the next failure (if any) will be a genuinely different race and the debug events still help.
- The TOCTOU race in
_instance_delete(host_networks snapshot vsdelete_on_hypervisor) that I initially suspected is not the cause here, but it remains theoretically possible. Leave that for a separate plan if it ever fires under the new constraint.