Skip to content

Recurring cluster operations framework

Prompt

Before responding to questions or discussion points in this document, explore the shakenfist codebase thoroughly. Read relevant source files, understand existing patterns (object lifecycle, state machines, MariaDB storage via the three-layer direct/gRPC/public pattern, Pydantic schemas, daemon architecture, operation queue system, event logging), and ground your answers in what the code actually does today. Do not speculate about the codebase when you could read it instead. Where a question touches on external concepts (cron expression evaluation, scheduling semantics under clustered failover, KVM/libvirt, VXLAN networking, MariaDB/Galera, gRPC/protobuf), research as needed to give a confident answer. Flag any uncertainty explicitly rather than guessing.

All planning documents should go into docs/plans/.

Consult ARCHITECTURE.md for the system architecture overview, object types, and daemon structure. Consult CLAUDE.md for build commands, project conventions, and database access patterns. Consult GOALS.md for current development priorities. Key references inside the repo include shakenfist/operations/baseoperation.py (the BaseClusterOperation framework and its dispatcher semantics), shakenfist/daemons/cluster/scheduled_tasks.py (the existing ad-hoc scheduled-tasks code that this framework would absorb), shakenfist/daemons/network/maintain.py (the network-maintenance loop that will become a consumer once the framework lands), and shakenfist/mariadb.py (the three-layer database access pattern).

When we get to detailed planning, the convention is a separate plan file per detailed phase, named PLAN-recurring-operations-phase-NN-descriptive.md in the same directory.

I prefer one commit per logical change, and at minimum one commit per phase. Do not batch unrelated changes into a single commit. Each commit should be self-contained: it should build, pass tests, and have a clear commit message explaining what changed and why.

Situation

Several distinct callers in the codebase have a "do X on a recurring schedule" requirement, each currently solved ad-hoc:

  1. scheduled_tasks.py at daemons/cluster/scheduled_tasks.py is the closest thing we have to a recurring-operations framework today. It is internal-only, tied to the cluster daemon's loop, and the schedule is hard-coded.
  2. maintain.py at daemons/network/maintain.py is the network reconciliation loop. It runs on a thread inside sf-net, ticking every interval and walking all networks. The network-facade plan (PLAN-network-facade.md) deliberately keeps it as a thread for now and gates its enqueues per-network, noting that the proper landing place for "maintain is a recurring CO" is here.
  3. User-driven recurrence is currently impossible. A user who wants "snapshot this instance every 24 hours" has to run external cron + REST client. There is no way to express "this operation should recur" through the API.

The operation queue framework (BaseClusterOperation, priorities, cluster_operation_targets, the dispatcher's depends_on / runs_after semantics) gives us most of the machinery we'd need for a unified recurrence framework. We have a typed operation model, persistence, queue infrastructure, priority lanes, namespace authz on the target object, and history via cluster_operation_targets. What we lack:

  • A schedule type (cron expression and/or "every N seconds")
  • A persisted RecurringOperation object that lives alongside other DBOs
  • A tick mechanism that fires the recurring op on schedule
  • Gating semantics ("don't tick if the previous tick is still in flight")
  • User-facing REST API
  • A migration path for the existing internal consumers (scheduled_tasks.py, maintain.py)
  • Two specific dispatcher gaps the network-facade plan flagged: no max-wait semantics for runs_after (a stuck dep defers the dependent indefinitely); and depends_on aborts the dependent on dep failure, which is wrong for recurring tasks (a single failed reconcile should not stop the recurrence).

Mission and problem statement

Introduce a RecurringOperation object type and the supporting framework so that:

  • Any internal subsystem that today runs a recurring loop can express it as a RecurringOperation instead. Initial consumers: scheduled_tasks.py and daemons/network/maintain.py.
  • Users can create RecurringOperation objects via the REST API for explicit recurring tasks. Initial supported template types include at least snapshot_op (matches the user-driven motivating use case) and agent_op for recurring agent commands.
  • The framework respects the existing priority lanes, namespace authz, target tracking, and event logging — a recurring op is not a second-class citizen.
  • Operational concerns are addressed up front: per-recurrence "don't double-fire" gating, a maximum-wait semantics for runs_after so a stuck dep doesn't break recurrence permanently, and explicit handling for failed ticks (do not break the recurrence on a single failure).
  • When the framework lands, scheduled_tasks.py and daemons/network/maintain.py migrate to it in subsequent phases. The network-facade plan's Q6 design (per-network gating + cooldown + circuit breaker) is preserved through the migration — it just lives inside a maintenance-pass recurring op rather than inside a free-standing thread.

Scope boundaries (preliminary — to be refined when this plan moves out of stub status):

  • In scope: the RecurringOperation object, its REST API, its tick mechanism, the dispatcher changes needed to support max-wait runs_after and continue-on-failure recurrence semantics, and migration of scheduled_tasks.py and maintain.py as initial consumers.
  • Out of scope: general workflow engines (we are not building Airflow). The recurrence vocabulary is deliberately small.
  • Out of scope: changing the operation queue's priority taxonomy or the way the cluster elects an owner for cluster-wide operations.

Open questions

These are preliminary sketches. Each will be tightened significantly when this plan moves out of stub status.

  1. Schedule format. Cron expressions are powerful but verbose and have many edge cases (timezones, DST, "every Wednesday in the third week"); simple "every N seconds" is sufficient for internal consumers (maintain.py is every 60 s) but inadequate for user-facing recurrences like "snapshot every 24 hours at 3 am UTC". Possible resolution: support both, with a strict subset of cron expressions to avoid the most painful edge cases (no timezone-aware scheduling for v1, evaluate all expressions in UTC).

  2. Tick owner and failover. Some recurrence ticks are cluster-wide (run maintain on the elected network node); others are per-node (each node's local sf-net runs its local maintenance). The framework needs to express this. The existing cluster-wide vs per-node queue taxonomy is the right primitive to lean on — the recurring op simply enqueues at the appropriate queue.

  3. Gating semantics. "Don't tick if the previous tick is still in flight" is the common case but not the only one. Some recurrences may want to overlap (e.g. metrics collection that's idempotent and cheap). Possible shape: a per-recurrence overlap_policy: 'skip' | 'queue' | 'replace'.

  4. Failure semantics. A single failed tick must not stop the recurrence. But repeated failures should surface to operators. The network-facade Q6 design (cooldown + circuit breaker via the cluster_operation_targets history) generalises: RecurringOperation tracks recent tick history and pauses the recurrence after K consecutive failures with an operator-visible event. Manual operator action clears the pause.

  5. Dispatcher max-wait for runs_after. Today runs_after defers the dependent indefinitely if the dep is non-terminal. For recurring tasks, a stuck dep should eventually time out and let the dependent proceed. Need a per-dep "deadline" or per-op "max wait on deps" setting. Affects the dispatcher (which today has no such notion).

  6. Continue-on-failure deps for recurrences. Today depends_on aborts the dependent on dep failure. The "next tick of a recurrence" wants the opposite: "wait until the previous tick reaches any terminal state, then run regardless". The natural primitive is runs_after (which already has those semantics), but when combined with question 5 the deadline interaction needs to be clear.

  7. REST API shape. New endpoints:

  8. POST /recurring_operations — create
  9. GET /recurring_operations — list (namespace-scoped)
  10. GET /recurring_operations/<uuid> — read
  11. DELETE /recurring_operations/<uuid> — delete
  12. POST /recurring_operations/<uuid>/pause, .../resume, .../trigger — operator and user verbs The template-op vocabulary is constrained to a small set initially: snapshot_op, agent_op, possibly network_maintain_pass_op once that exists as a discrete op type.

  13. Persistence. A recurring_operations table with the schedule, the template (op type + args as JSON), the gating policy, and the most-recent-tick history. Mutable attributes for paused / resumed state. Follow the existing pattern for DBO persistence in MariaDB (namespaces / artifacts / etc.).

  14. Migration sequencing. Build the framework first, then migrate scheduled_tasks.py (simpler, no user-facing surface), then migrate maintain.py (requires the network-maintain-pass op type to exist as a discrete CO, which is itself non-trivial). The user-facing REST surface and the snapshot/agent template support can land in parallel with or after the internal migrations — they don't block each other.

Execution

(Detailed phase plans will be drafted when this plan moves out of stub status. Phases are tentatively expected to look like:)

Phase Plan Status
1. RecurringOperation object and persistence TBD Not started
2. Dispatcher max-wait for runs_after + continue-on-failure recurrence semantics TBD Not started
3. Tick mechanism and gating policies TBD Not started
4. Migrate scheduled_tasks.py as first internal consumer TBD Not started
5. Network-maintain-pass op + migrate maintain.py TBD Not started
6. REST API and user-facing template vocabulary TBD Not started
7. Documentation and tests TBD Not started

This plan is currently in stub form. It exists primarily to anchor a future-work reference in PLAN-network-facade.md and to capture the framing for when work begins.

Agent guidance

(To be filled in when this plan moves out of stub status. The structure will mirror PLAN-network-facade.md's Agent guidance section: execution model, planning effort, step-level guidance table, management session review checklist.)

Administration and logistics

Success criteria

When this plan is successfully implemented:

  • A RecurringOperation object type exists, follows the existing DBO patterns, persists in MariaDB, and is documented in ARCHITECTURE.md.
  • daemons/cluster/scheduled_tasks.py is gone (or is a thin wrapper) — its contents have moved to internal RecurringOperation instances.
  • daemons/network/maintain.py is gone — its contents have moved to a network_maintain_pass CO triggered by an internal RecurringOperation. The per-network gating + cooldown + circuit breaker behaviour from PLAN-network-facade.md Q6 is preserved.
  • A user can POST /recurring_operations to create e.g. "snapshot instance X every 24 hours at 3 am UTC" and the snapshot fires on schedule.
  • The dispatcher supports a max-wait deadline on runs_after so a stuck dep cannot permanently break recurrence.
  • A RecurringOperation whose ticks repeatedly fail pauses itself after K failures with an operator-visible event, mirroring the network-facade circuit breaker.
  • The code passes pre-commit run --all-files.
  • Functional test coverage in shakenfist/deploy/cluster_ci exercises both internal consumers and at least one user-driven template.

Future work

  • Time-aware scheduling. Cron expressions evaluated in the cluster's configured timezone (not just UTC), with proper DST handling. Initial implementation is UTC-only.
  • Cross-recurrence dependencies. A RecurringOperation whose tick depends on another recurrence's most-recent successful tick. Possibly useful for "do nightly snapshots only if backup completed". Speculative.
  • Catch-up policy. What to do when the system was offline through a scheduled tick: skip, fire once, fire for every missed slot. Initial implementation skips.

Bugs fixed during this work

(none yet)

Documentation index maintenance

When this plan is updated:

  • docs/plans/index.md — the row for this plan should track its overall status. Phase rows are not added.
  • docs/plans/order.yml — this master plan is registered; phase files are not.

Back brief

Before executing any step of this plan, the implementing sub-agent must back brief the operator as to its understanding of the phase plan and how the work it intends to do aligns with that plan.

📝 Report an issue with this page