Skip to content

Phase 3 — prune sweep in the cluster daemon

Parent plan: PLAN-eventlog-direct-mariadb.md. Predecessors: Phase 1, Phase 2.

Scope

Phase 3 moves the per-event-type prune sweep from sf-eventlog into the cluster daemon's existing scheduled- task loop, running daily. After this phase ships, the cluster maintainer (one node at a time, behind the cluster ClusterLock) ages rows out of the new events / event_objects tables according to the existing eight MAX_{TYPE}_EVENT_AGE configs. sf-eventlog's own per- sqlite prune loop still runs against the (now stale) sqlite chunks and is harmless; it's deleted in phase 5.

The phase covers:

  • Three new _direct_prune_* functions on sf-database for the per-event-type prune, the api-request object- type override, and the orphan-events sweep.
  • Two new Prometheus counters: database_events_pruned_total{event_type=...} for the per-type and api-request sweeps, and database_orphan_events_pruned_total for the post- sweep orphan cleanup. Both at module scope in mariadb.py, picked up automatically on sf-database's existing metrics endpoint via the shared default registry.
  • One new sf-database RPC: PruneEvents. Single RPC rather than three because the orchestration (seven event_types + api-request + orphan sweep, each internally batched) is simpler inside the direct function than it is over the wire, and the prune is a once-a-day call where "long RPC" is acceptable.
  • A new scheduled task scheduled_tasks.prune_events running schedule.every(1).days.do(...) from the cluster daemon's main loop, gated behind the cluster election lock that the maintainer already holds.
  • Multi-object retention semantics preserved exactly per the master plan's decision 4: an events row is deleted only once its last event_objects row has been pruned. An event touching N objects survives until all N objects' retention windows have dropped it.

Out of scope (deferred):

  • The read RPC and REST cut-over (phase 4).
  • Deleting sf-eventlog and the DLQ (phase 5).
  • Docs (phase 6).
  • sf-eventlog's own per-sqlite prune loop. It keeps running against the now-stale sqlite chunks until phase 5 deletes it; harmless.

Prune semantics

Three logical stages per daily sweep, ordered as below because stage A and B both target event_objects rows (order doesn't matter between them) but stage C runs after both to clean up events that lost their last reference.

Stage A — per event_type prune. For each of the seven event_type configs:

MAX_AUDIT_EVENT_AGE      90 d
MAX_MUTATE_EVENT_AGE     90 d
MAX_STATUS_EVENT_AGE      7 d
MAX_USAGE_EVENT_AGE      30 d
MAX_RESOURCES_EVENT_AGE   7 d
MAX_PRUNE_EVENT_AGE      30 d
MAX_HISTORIC_EVENT_AGE   90 d

Delete every event_objects row whose event has event_type = X and timestamp < now - MAX_X_EVENT_AGE. A MAX_X_EVENT_AGE = -1 config disables that type's prune, mirroring the existing eventlog daemon's behaviour.

Stage B — api-request object-type override. The existing eventlog daemon special-cases events tied to api-request objects: regardless of event_type, they age out at MAX_API_REQUEST_EVENT_AGE (1 day). In the two-table model this becomes a prune of event_objects rows where object_type = 'api-request' and the referenced event is older than MAX_API_REQUEST_EVENT_AGE. An event tied to both an api-request and an instance loses its api-request reference after 1 day but the event row stays alive (and visible from the instance's stream) until the instance reference is dropped by stage A.

Stage C — orphan events sweep. After stages A and B, delete events rows whose event_uuid is no longer referenced by any event_objects row. This is the rule that gives the "delete event only once its last object reference is gone" semantics.

SQL design

All three stages use a paged LIMIT-and-loop pattern to avoid long lock holds on what is potentially the largest table on the database node. Batch size 10000 matches the existing eventlog daemon's per-sweep cap and is small enough to commit in a fraction of a second on a healthy InnoDB even when the rows being deleted are spread across many pages.

Stage A

DELETE eo
FROM event_objects eo
JOIN events e ON eo.event_uuid = e.event_uuid
WHERE e.event_type = %s
  AND e.timestamp < %s
LIMIT 10000;

Drives off the (event_type, timestamp) index on events for the cutoff scan, joins to event_objects via the secondary (event_uuid) index, and deletes the join row. Loop until rowcount < 10000.

Stage B

DELETE eo
FROM event_objects eo
JOIN events e ON eo.event_uuid = e.event_uuid
WHERE eo.object_type = 'api-request'
  AND e.timestamp < %s
LIMIT 10000;

Drives off the event_objects PK prefix (object_type, ...) for the api-request filter and joins to events for the cutoff. Loop until rowcount < 10000.

Stage C

DELETE e
FROM events e
LEFT JOIN event_objects eo ON e.event_uuid = eo.event_uuid
WHERE eo.event_uuid IS NULL
LIMIT 10000;

Anti-join via the events PK and the event_objects secondary (event_uuid) index. Loop until rowcount < 10000.

Why no FK CASCADE

The phase 1 schema deliberately omitted the foreign key between event_objects.event_uuid and events.event_uuid, matching the project convention on object_states / object_metadata / cluster_operation_targets. With no FK there is no cascade and the two-stage delete is explicit. The benefit: stages A and B don't pay any per-row FK check cost, and stage C can be a clean anti-join without the referential-integrity overhead.

Counter design

Two counters at module scope in mariadb.py, mirroring the phase 2d EVENTS_INSERTED pattern:

EVENTS_PRUNED = Counter(
    'database_events_pruned_total',
    'Event-object rows pruned, by event type (and the '
    'synthetic "api-request" type for the object-type '
    'override sweep).',
    ['event_type']
)

ORPHAN_EVENTS_PRUNED = Counter(
    'database_orphan_events_pruned_total',
    'Events rows pruned because no event_objects row '
    'referenced them.'
)

Incremented inside the respective _direct_prune_* functions by the rowcount of each successful batch. The api-request override increments EVENTS_PRUNED.labels(event_type='api-request'), which is a synthetic label value (not a real event_type) chosen so operators can distinguish object-type-override prunes from regular per-type prunes in the same metric.

Counters are registered on every daemon that imports mariadb, but only move on sf-database (the only daemon that runs the direct prune). Other daemons report them at zero — harmless.

RPC and accessor design

One new RPC on protos/database.proto:

message PruneEventsRequest {
  // Empty -- the cluster daemon decides cadence;
  // sf-database decides per-type retention from config.
}

message PruneEventsReply {
  bool success = 1;
  string error = 2;
  int64 rows_pruned = 3;
}

rpc PruneEvents (PruneEventsRequest) returns (PruneEventsReply) {}

The single RPC orchestrates all three stages inside the direct function. The cluster maintainer doesn't see the internal batching — it gets back a total rows-pruned count for the daily summary log line. RPC timeout on the client side is generous (5 min) because the daily prune is allowed to be slow.

Three-layer accessor stack in mariadb.py:

  • _direct_prune_events_by_type(event_type, max_age) -> int runs stage A's loop for one event_type, increments EVENTS_PRUNED.labels(event_type=event_type) by the per-batch rowcount, returns total deleted.
  • _direct_prune_api_request_events(max_age) -> int runs stage B's loop, increments EVENTS_PRUNED.labels(event_type='api-request'), returns total deleted.
  • _direct_prune_orphan_events() -> int runs stage C's loop, increments ORPHAN_EVENTS_PRUNED, returns total deleted.
  • _direct_prune_events() -> int is the orchestrator: iterates the seven event_types from config, calls each stage, sums the rows, returns the total.
  • _grpc_prune_events() -> int marshals to PruneEventsRequest, calls stub.PruneEvents(request, timeout=300.0), returns reply.rows_pruned.
  • Public prune_events() -> int routes via _use_database_service().

Per the phase 1 / phase 2 pattern: the proto RPC's abstract method on the daemon servicer must land in the same commit as the proto change, or pre-commit's mypy hook fails.

Cluster daemon wiring

A new function in shakenfist/daemons/cluster/ scheduled_tasks.py:

from shakenfist import mariadb

def prune_events() -> None:
    """Daily prune sweep of the events / event_objects tables.

    Runs on the elected cluster maintainer. Drives the
    three-stage prune described in
    docs/plans/PLAN-eventlog-direct-mariadb-phase-03-prune.md.
    """
    try:
        rows = mariadb.prune_events()
        LOG.info(f'Events prune sweep removed {rows} rows.')
    except Exception as e:
        LOG.warning(f'Events prune sweep failed: {e}')

Wired in shakenfist/daemons/cluster/main.py alongside the existing schedule.every(...).do(...) calls (currently around lines 427-436):

schedule.every(1).days.do(scheduled_tasks.prune_events)

The cluster maintainer election lock guarantees only one node runs this per day. If the maintainer loses the election mid-prune, the in-flight RPC completes against sf-database (the work is done) and the cluster daemon's next loop iteration enters _await_election cleanly. Worst case under a maintainer flap: a backup takes over and re-runs the prune later the same day — the per-batch DELETEs are idempotent so the second run is just a fast no-op.

Step plan

Step Effort Model Isolation Brief for sub-agent
3a high opus none Add the three _direct_prune_* functions plus the orchestrator _direct_prune_events in shakenfist/mariadb.py, mirroring the placement and pattern of _direct_delete_stale_cluster_operation_targets (currently around lines 6004-6050). Per-type and api-request stages use the paged LIMIT 10000 loop with the SQL from the "SQL design" section of the phase 3 plan. Orphan sweep uses the LEFT JOIN anti-join. All three increment their respective Counter (EVENTS_PRUNED for per-type and api-request, ORPHAN_EVENTS_PRUNED for orphan) at module scope; define those Counters near the existing EVENTS_INSERTED Counter (added in phase 2d). The orchestrator iterates the seven event_types from config (audit, mutate, status, usage, resources, prune, historic), skipping any with MAX_X_EVENT_AGE = -1, plus the api-request stage and the orphan sweep. Returns the total row count. Wrap each _direct_prune_* with try/except OperationalError per the existing prune patterns; a DB error returns the partial count rather than raising. Commit message subject: "mariadb: per-type, api-request, and orphan prune helpers."
3b high opus none Add the PruneEvents RPC to protos/database.proto, add the PruneEventsRequest / PruneEventsReply messages following the placement of the existing event-related messages from phase 1 step 1b. Then run tox -e genprotos. In the same commit (per the phase 1/2 lesson — abstract methods on the proto require their handlers to land together for mypy), add the PruneEvents handler in shakenfist/daemons/database/main.py mirroring RecordEventBatch (currently around lines 4237-4280). The handler increments self.monitor.counters['prune_events'], calls mariadb._direct_prune_events(), returns PruneEventsReply(success=True, error='', rows_pruned=rows). Register 'prune_events' in the Monitor operations list. Also add _grpc_prune_events and the public prune_events in shakenfist/mariadb.py mirroring _grpc_record_event_batch / record_event_batch. The gRPC call uses timeout=300.0. Run pre-commit run --all-files — must be fully green. Commit message subject: "database: PruneEvents RPC, handler, and dispatcher."
3c medium opus none Cluster daemon wiring in shakenfist/daemons/cluster/scheduled_tasks.py and shakenfist/daemons/cluster/main.py. Add prune_events() to scheduled_tasks.py as described in the "Cluster daemon wiring" section of the phase 3 plan. Add schedule.every(1).days.do(scheduled_tasks.prune_events) to main.py's schedule registration block (currently around lines 427-436). Read both files first to confirm the exact placement and import style. The function logs the row count at info; an exception is caught and logged at warning so a transient DB failure doesn't kill the maintainer. Commit message subject: "cluster: schedule daily events prune."
3d medium sonnet none Tests for phase 3. Add to shakenfist/tests/test_events_storage.py: (i) _direct_prune_events_by_type deletes only matching event_type rows older than cutoff, leaves newer ones, returns the correct count, increments the labeled counter; (ii) multi-object semantics — an event with three object refs, of which one is aged out by stage A and two by separate stages, ultimately gets the event row deleted in stage C; (iii) _direct_prune_api_request_events deletes only the api-request object_type rows; (iv) _direct_prune_orphan_events deletes events with no remaining event_objects rows; (v) orchestrator skips event_types whose MAX_*_EVENT_AGE is -1; (vi) Counter deltas (use the read-before / compare-after pattern from the phase 2d EVENTS_INSERTED tests). Use the same mock-engine/connection pattern from existing tests where possible; for the multi-stage semantic test you may need a deeper mock that lets the rowcount-driven loop terminate. Also add an integration-ish test for scheduled_tasks.prune_events calling mariadb.prune_events and logging the result. Run tox -e py3 and pre-commit run --all-files. Commit message subject: "tests: phase 3 prune coverage."

Ordering:

  • 3a is independent; ships alone.
  • 3b depends on 3a (its handler calls _direct_prune_events). Self-contained commit because proto+handler must land together.
  • 3c depends on 3b (its scheduled task calls mariadb.prune_events).
  • 3d depends on all of the above.

Per the phase 1/2 lesson: management session runs pre-commit run --all-files between every commit so cross-file mypy issues surface immediately. If any single step trips mypy on the abstract-method front, fold it into the predecessor commit rather than shipping a broken intermediate.

Risks and mitigations

  • Risk: A single batched DELETE locks event_objects long enough to delay incoming RecordEventBatch writes from the drainer. Mitigation: 10000-row batches commit in well under a second on InnoDB at the cluster sizes Shaken Fist targets. The per-row write rate from the drainer (50-100 events per batch, hundreds of events per second peak across the cluster) is much smaller than the per-batch delete rate. If contention surfaces in production, the batch size is a single-line tuning knob.

  • Risk: The api-request override drops the wrong rows because of subtle differences from the old per- sqlite-chunk semantics. Mitigation: Phase 3 step 3d test (iii) exercises the override explicitly. The new semantics ("object_type='api-request' row drops at MAX_API_REQUEST_EVENT_AGE regardless of event_type") is a one-line specification — easier to reason about than the old per-event branching.

  • Risk: Stage C anti-join is slow on a large events table. Mitigation: The LEFT JOIN uses the event_objects.event_uuid secondary index and the events.event_uuid primary key — both covering. The daily cadence means stage C runs against at most ~yesterday's worth of new orphans (the prior day's prune already cleared everything older).

  • Risk: The cluster maintainer loses the lock mid-prune. The RPC continues on sf-database; the cluster daemon re-enters election and a backup picks up. The backup re-runs the prune later in the day. Mitigation: Per-batch DELETEs are idempotent; re-running is a fast no-op against rows already gone. Counter values double-count the re-run's no-op batches, which an operator sees as a slightly noisy rate but no incorrect data. The next normal sweep cycle resumes the expected cadence.

  • Risk: Phase 3 ships but sf-eventlog's own prune loop is still running against the (now stale) sqlite chunks. Two prune loops, but they target different storage. Mitigation: This is intentional. The sf-eventlog loop is harmless work against data nothing reads. Phase 5 deletes the loop along with the daemon.

Definition of done

  • database_events_pruned_total{event_type=...} and database_orphan_events_pruned_total are visible on sf-database's metrics endpoint.
  • A manual integration test (insert N events across multiple event_types with a small MAX_*_EVENT_AGE, wait a day or fast-forward the time, observe sweep deletes the expected rows) passes.
  • The multi-object semantic test (3d ii) passes against a real MariaDB in CI.
  • pre-commit run --all-files is clean.
  • Each commit is self-contained; commit messages follow project conventions including the Prompt paragraph and Co-Authored-By line with model and effort.

Back brief

Before executing any step of this phase, the implementing sub-agent should back-brief the management session on its understanding of the brief and the surrounding context.

📝 Report an issue with this page