Events¶
Shaken Fist records audit, status, mutation, and usage events for
every object in the cluster. Events are written from every daemon
via a local disk-backed spool, drained in batches to sf-database
via gRPC, and stored in two MariaDB tables (events and
event_objects). REST reads come from sf-database via the
GetObjectEvents RPC and can be served by any sf-api node.
The daily prune runs from the elected cluster maintainer.
Write path¶
Each Shaken Fist daemon process holds its own sqlite spool file at:
When code calls eventlog.add_event() or eventlog.add_event_multi()
the event is written synchronously to the local spool (a single
cheap sqlite insert, sub-millisecond) and the call returns. A
background drainer thread (shakenfist.eventlog_drainer) picks
batches of up to 100 rows off the spool every 100 ms and writes
them to MariaDB via mariadb.record_event_batch.
On the database daemon itself record_event_batch writes directly
to MariaDB. On every other daemon it routes through the database
gRPC interface to sf-database.
Drainer cadence and backoff¶
The drainer runs at DRAIN_POLL_INTERVAL = 0.1 s (100 ms). When
the spool is empty it sleeps between polls. When the spool has
rows the drainer sends back-to-back batches of
DRAIN_BATCH_SIZE = 100 rows until the spool drains, so a burst
of events is on the wire within a second or two under normal
conditions.
If sf-database refuses or times out, the batch is left in the
spool and retried on the next drain tick. The backoff schedule
(initial 0.5 s, doubling up to 30 s, resetting on any success)
prevents a downed database service from hammering the network.
High-water mark¶
The spool is bounded by SPOOL_HIGH_WATER_MARK = 100 000 rows
(roughly 50 MiB on disk). When the cap is reached, incoming
enqueue() calls drop the event silently rather than blocking
the caller. Each dropped event advances the
eventlog_spool_dropped_total counter. The
eventlog_spool_depth gauge shows how many rows are currently
pending.
Monitor eventlog_spool_depth proactively. A depth that climbs
toward 100 000 means sf-database has been unreachable for an
extended period or the cluster is producing events faster than
the drainer can flush.
Orphan spool recovery¶
On startup the spool module scans for leftover spool files from previously-dead PIDs. Their rows are migrated into the fresh-pid spool so the drainer picks them up automatically. No operator action is needed.
Read path¶
Five REST endpoints return events for each object type:
| Endpoint | Object type |
|---|---|
GET /instances/<ref>/events |
Instance |
GET /artifacts/<ref>/events |
Artifact |
GET /networks/<ref>/events |
Network |
GET /nodes/<node>/events |
Node |
GET /blobs/<uuid>/events |
Blob |
All endpoints accept an optional limit query parameter
(default 100). The server enforces a hard cap of 1000 rows
per request; a limit of 0 or any negative value is treated
as the default 100.
Events are returned newest-first. Each event in the response includes at minimum:
| Field | Description |
|---|---|
event_uuid |
Unique identifier for the event row |
request_id |
HTTP request ID of the API call that caused the event, if available; null otherwise |
timestamp |
Unix timestamp (float) when the event was recorded |
event_type |
String label for the event category |
message |
Human-readable description |
extra |
Dict of additional structured fields |
node |
Name of the node that emitted the event |
Note: the event_uuid field was named correlation_id in
releases prior to phase 4 of the events-on-MariaDB migration.
Clients that introspect either key need updating; clients that
pass through the response dict opaquely need no change.
Retention¶
Event rows are pruned daily by scheduled_tasks.prune_events,
running under ClusterLock election on the cluster maintainer.
The prune runs in three stages:
- Per-event-type sweep. Removes
event_objectsrows older than the configured age for each event type. Theeventsrow is not yet deleted here. - api-request object-type override sweep. The
api-requestobject type accumulates very verbose events. A second sweep removesevent_objectsrows associated withapi-requestobjects that are older thanMAX_API_REQUEST_EVENT_AGE, regardless of the event's own type. This truncates the API request history much more aggressively than the per-type caps. - Orphan events sweep. Removes any
eventsrows that are no longer referenced by anyevent_objectsrow. This is the step that actually deletes the event content.
Retention configuration¶
Retention ages are in seconds. The defaults below are the
cluster-wide defaults; override any of them in /etc/sf/config
or as environment variables.
| Config key | Default | Description |
|---|---|---|
MAX_AUDIT_EVENT_AGE |
7 776 000 (90 days) | Audit events |
MAX_MUTATE_EVENT_AGE |
7 776 000 (90 days) | Mutation events |
MAX_STATUS_EVENT_AGE |
604 800 (7 days) | Status events |
MAX_USAGE_EVENT_AGE |
2 592 000 (30 days) | Usage events |
MAX_RESOURCES_EVENT_AGE |
604 800 (7 days) | Resource events |
MAX_PRUNE_EVENT_AGE |
2 592 000 (30 days) | Prune events |
MAX_HISTORIC_EVENT_AGE |
7 776 000 (90 days) | Historic events |
MAX_API_REQUEST_EVENT_AGE |
86 400 (1 day) | api-request object override |
Multi-object retention semantics¶
A single events row can be referenced by event_objects rows
for multiple objects (for example, a network-interface event is
associated with both the interface and its parent network). The
events row survives until its last referencing
event_objects row is pruned. This means an event that carries
a long-retention type tied to a short-retention object remains
visible from the long-retention object's event stream until that
object's retention window closes.
Object hard-delete cleanup¶
When a DatabaseBackedObject is hard-deleted, hard_delete()
calls mariadb.delete_object_events, which issues the
DeleteObjectEvents gRPC RPC. That RPC removes every
event_objects row for the deleted object's UUID. The events
row itself is not deleted immediately: if another object still
references it (because it was a multi-object event) it remains
visible from that other object's event stream. If no other object
references the events row it becomes an orphan, and the next
daily orphan sweep removes it.
Metrics reference¶
The events subsystem exposes the following Prometheus metrics.
All sf-database metrics are scraped from DATABASE_METRICS_PORT
(default 13006). Per-daemon metrics are scraped from each
daemon's own metrics port.
| Metric | Type | Source daemon | Description |
|---|---|---|---|
database_events_rows |
Gauge | sf-database | Current row count in the events table, sampled roughly every 60 seconds. Use this to watch for unchecked growth if pruning stops. |
database_events_inserted_total |
Counter | sf-database | Events inserted, labelled by event_type. Watch the per-label rate for unexpected spikes. |
database_events_pruned_total |
Counter | sf-database | event_objects rows pruned per prune run, labelled by event_type. The synthetic label event_type='api-request' covers the object-type-override sweep. |
database_orphan_events_pruned_total |
Counter | sf-database | events rows removed by the orphan sweep. A non-zero rate here is normal after each daily prune. |
eventlog_spool_depth |
Gauge | every daemon | Rows currently pending in the local spool. Should be close to zero under normal conditions. |
eventlog_spool_dropped_total |
Counter | every daemon | Events dropped at the spool high-water mark. Any non-zero rate indicates that sf-database has been unreachable for an extended period. |
The database_*_total{operation} per-RPC counters on sf-database
also pick up the RecordEventBatch, GetObjectEvents, and
DeleteObjectEvents RPCs automatically via the shared counter
registration.
Operator cleanup after upgrade¶
The events-on-MariaDB migration (Shaken Fist v0.8) retired the
sf-eventlog daemon and its on-disk sqlite event chunks. After
upgrading all nodes the directory:
on the former eventlog node holds pre-cut-over sqlite chunks that are no longer read by any daemon. It is safe to remove them once the new code is running:
The new spool files live under /srv/shakenfist/spool/eventlog/
and are managed automatically by the spool module.
See Database Architecture for the broader MariaDB schema, and ARCHITECTURE.md for the full daemon picture.