Recurring cluster operations framework¶
Prompt¶
Before responding to questions or discussion points in this document, explore the shakenfist codebase thoroughly. Read relevant source files, understand existing patterns (object lifecycle, state machines, MariaDB storage via the three-layer direct/gRPC/public pattern, Pydantic schemas, daemon architecture, operation queue system, event logging), and ground your answers in what the code actually does today. Do not speculate about the codebase when you could read it instead. Where a question touches on external concepts (cron expression evaluation, scheduling semantics under clustered failover, KVM/libvirt, VXLAN networking, MariaDB/Galera, gRPC/protobuf), research as needed to give a confident answer. Flag any uncertainty explicitly rather than guessing.
All planning documents should go into docs/plans/.
Consult ARCHITECTURE.md for the system architecture
overview, object types, and daemon structure. Consult
CLAUDE.md for build commands, project conventions, and
database access patterns. Consult GOALS.md for current
development priorities. Key references inside the repo
include shakenfist/operations/baseoperation.py (the
BaseClusterOperation framework and its dispatcher
semantics), shakenfist/daemons/cluster/scheduled_tasks.py
(the existing ad-hoc scheduled-tasks code that this
framework would absorb), shakenfist/daemons/network/maintain.py
(the network-maintenance loop that will become a consumer
once the framework lands), and shakenfist/mariadb.py (the
three-layer database access pattern).
When we get to detailed planning, the convention is a
separate plan file per detailed phase, named
PLAN-recurring-operations-phase-NN-descriptive.md in the
same directory.
I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit. Each commit should be self-contained: it should build, pass tests, and have a clear commit message explaining what changed and why.
Situation¶
Several distinct callers in the codebase have a "do X on a recurring schedule" requirement, each currently solved ad-hoc:
scheduled_tasks.pyatdaemons/cluster/scheduled_tasks.pyis the closest thing we have to a recurring-operations framework today. It is internal-only, tied to the cluster daemon's loop, and the schedule is hard-coded.maintain.pyatdaemons/network/maintain.pyis the network reconciliation loop. It runs on a thread insidesf-net, ticking every interval and walking all networks. The network-facade plan (PLAN-network-facade.md) deliberately keeps it as a thread for now and gates its enqueues per-network, noting that the proper landing place for "maintain is a recurring CO" is here.- User-driven recurrence is currently impossible. A user who wants "snapshot this instance every 24 hours" has to run external cron + REST client. There is no way to express "this operation should recur" through the API.
The operation queue framework (BaseClusterOperation,
priorities, cluster_operation_targets, the dispatcher's
depends_on / runs_after semantics) gives us most of the
machinery we'd need for a unified recurrence framework. We
have a typed operation model, persistence, queue
infrastructure, priority lanes, namespace authz on the
target object, and history via cluster_operation_targets.
What we lack:
- A schedule type (cron expression and/or "every N seconds")
- A persisted
RecurringOperationobject that lives alongside other DBOs - A tick mechanism that fires the recurring op on schedule
- Gating semantics ("don't tick if the previous tick is still in flight")
- User-facing REST API
- A migration path for the existing internal consumers
(
scheduled_tasks.py,maintain.py) - Two specific dispatcher gaps the network-facade plan
flagged: no max-wait semantics for
runs_after(a stuck dep defers the dependent indefinitely); anddepends_onaborts the dependent on dep failure, which is wrong for recurring tasks (a single failed reconcile should not stop the recurrence).
Mission and problem statement¶
Introduce a RecurringOperation object type and the
supporting framework so that:
- Any internal subsystem that today runs a recurring loop
can express it as a
RecurringOperationinstead. Initial consumers:scheduled_tasks.pyanddaemons/network/maintain.py. - Users can create
RecurringOperationobjects via the REST API for explicit recurring tasks. Initial supported template types include at leastsnapshot_op(matches the user-driven motivating use case) andagent_opfor recurring agent commands. - The framework respects the existing priority lanes, namespace authz, target tracking, and event logging — a recurring op is not a second-class citizen.
- Operational concerns are addressed up front:
per-recurrence "don't double-fire" gating, a maximum-wait
semantics for
runs_afterso a stuck dep doesn't break recurrence permanently, and explicit handling for failed ticks (do not break the recurrence on a single failure). - When the framework lands,
scheduled_tasks.pyanddaemons/network/maintain.pymigrate to it in subsequent phases. The network-facade plan's Q6 design (per-network gating + cooldown + circuit breaker) is preserved through the migration — it just lives inside a maintenance-pass recurring op rather than inside a free-standing thread.
Scope boundaries (preliminary — to be refined when this plan moves out of stub status):
- In scope: the
RecurringOperationobject, its REST API, its tick mechanism, the dispatcher changes needed to support max-waitruns_afterand continue-on-failure recurrence semantics, and migration ofscheduled_tasks.pyandmaintain.pyas initial consumers. - Out of scope: general workflow engines (we are not building Airflow). The recurrence vocabulary is deliberately small.
- Out of scope: changing the operation queue's priority taxonomy or the way the cluster elects an owner for cluster-wide operations.
Open questions¶
These are preliminary sketches. Each will be tightened significantly when this plan moves out of stub status.
-
Schedule format. Cron expressions are powerful but verbose and have many edge cases (timezones, DST, "every Wednesday in the third week"); simple "every N seconds" is sufficient for internal consumers (
maintain.pyis every 60 s) but inadequate for user-facing recurrences like "snapshot every 24 hours at 3 am UTC". Possible resolution: support both, with a strict subset of cron expressions to avoid the most painful edge cases (no timezone-aware scheduling for v1, evaluate all expressions in UTC). -
Tick owner and failover. Some recurrence ticks are cluster-wide (run maintain on the elected network node); others are per-node (each node's local
sf-netruns its local maintenance). The framework needs to express this. The existing cluster-wide vs per-node queue taxonomy is the right primitive to lean on — the recurring op simply enqueues at the appropriate queue. -
Gating semantics. "Don't tick if the previous tick is still in flight" is the common case but not the only one. Some recurrences may want to overlap (e.g. metrics collection that's idempotent and cheap). Possible shape: a per-recurrence
overlap_policy: 'skip' | 'queue' | 'replace'. -
Failure semantics. A single failed tick must not stop the recurrence. But repeated failures should surface to operators. The network-facade Q6 design (cooldown + circuit breaker via the
cluster_operation_targetshistory) generalises:RecurringOperationtracks recent tick history and pauses the recurrence after K consecutive failures with an operator-visible event. Manual operator action clears the pause. -
Dispatcher max-wait for
runs_after. Todayruns_afterdefers the dependent indefinitely if the dep is non-terminal. For recurring tasks, a stuck dep should eventually time out and let the dependent proceed. Need a per-dep "deadline" or per-op "max wait on deps" setting. Affects the dispatcher (which today has no such notion). -
Continue-on-failure deps for recurrences. Today
depends_onaborts the dependent on dep failure. The "next tick of a recurrence" wants the opposite: "wait until the previous tick reaches any terminal state, then run regardless". The natural primitive isruns_after(which already has those semantics), but when combined with question 5 the deadline interaction needs to be clear. -
REST API shape. New endpoints:
POST /recurring_operations— createGET /recurring_operations— list (namespace-scoped)GET /recurring_operations/<uuid>— readDELETE /recurring_operations/<uuid>— delete-
POST /recurring_operations/<uuid>/pause,.../resume,.../trigger— operator and user verbs The template-op vocabulary is constrained to a small set initially:snapshot_op,agent_op, possiblynetwork_maintain_pass_oponce that exists as a discrete op type. -
Persistence. A
recurring_operationstable with the schedule, the template (op type + args as JSON), the gating policy, and the most-recent-tick history. Mutable attributes for paused / resumed state. Follow the existing pattern for DBO persistence in MariaDB (namespaces/artifacts/ etc.). -
Migration sequencing. Build the framework first, then migrate
scheduled_tasks.py(simpler, no user-facing surface), then migratemaintain.py(requires the network-maintain-pass op type to exist as a discrete CO, which is itself non-trivial). The user-facing REST surface and the snapshot/agent template support can land in parallel with or after the internal migrations — they don't block each other.
Execution¶
(Detailed phase plans will be drafted when this plan moves out of stub status. Phases are tentatively expected to look like:)
| Phase | Plan | Status |
|---|---|---|
1. RecurringOperation object and persistence |
TBD | Not started |
2. Dispatcher max-wait for runs_after + continue-on-failure recurrence semantics |
TBD | Not started |
| 3. Tick mechanism and gating policies | TBD | Not started |
4. Migrate scheduled_tasks.py as first internal consumer |
TBD | Not started |
5. Network-maintain-pass op + migrate maintain.py |
TBD | Not started |
| 6. REST API and user-facing template vocabulary | TBD | Not started |
| 7. Documentation and tests | TBD | Not started |
This plan is currently in stub form. It exists primarily
to anchor a future-work reference in
PLAN-network-facade.md and to capture the framing for
when work begins.
Agent guidance¶
(To be filled in when this plan moves out of stub status.
The structure will mirror PLAN-network-facade.md's Agent
guidance section: execution model, planning effort,
step-level guidance table, management session review
checklist.)
Administration and logistics¶
Success criteria¶
When this plan is successfully implemented:
- A
RecurringOperationobject type exists, follows the existing DBO patterns, persists in MariaDB, and is documented inARCHITECTURE.md. daemons/cluster/scheduled_tasks.pyis gone (or is a thin wrapper) — its contents have moved to internalRecurringOperationinstances.daemons/network/maintain.pyis gone — its contents have moved to anetwork_maintain_passCO triggered by an internalRecurringOperation. The per-network gating + cooldown + circuit breaker behaviour fromPLAN-network-facade.mdQ6 is preserved.- A user can
POST /recurring_operationsto create e.g. "snapshot instance X every 24 hours at 3 am UTC" and the snapshot fires on schedule. - The dispatcher supports a max-wait deadline on
runs_afterso a stuck dep cannot permanently break recurrence. - A
RecurringOperationwhose ticks repeatedly fail pauses itself after K failures with an operator-visible event, mirroring the network-facade circuit breaker. - The code passes
pre-commit run --all-files. - Functional test coverage in
shakenfist/deploy/cluster_ciexercises both internal consumers and at least one user-driven template.
Future work¶
- Time-aware scheduling. Cron expressions evaluated in the cluster's configured timezone (not just UTC), with proper DST handling. Initial implementation is UTC-only.
- Cross-recurrence dependencies. A
RecurringOperationwhose tick depends on another recurrence's most-recent successful tick. Possibly useful for "do nightly snapshots only if backup completed". Speculative. - Catch-up policy. What to do when the system was offline through a scheduled tick: skip, fire once, fire for every missed slot. Initial implementation skips.
Bugs fixed during this work¶
(none yet)
Documentation index maintenance¶
When this plan is updated:
docs/plans/index.md— the row for this plan should track its overall status. Phase rows are not added.docs/plans/order.yml— this master plan is registered; phase files are not.
Back brief¶
Before executing any step of this plan, the implementing sub-agent must back brief the operator as to its understanding of the phase plan and how the work it intends to do aligns with that plan.