v0.7 to v0.8 release notes¶

Major changes¶

You can now hot plug (add) a network interface to a running instance. This is exposed by a POST to the /instance/...instanceref.../interfaces endpoint, passing a network specification like you would at boot time. More details are available in the
The Shaken Fist client now has a plugin system, which allows additional commands that may not be of interest to all users to be added without cluttering the main codebase. Orchestration of k3s clusters via the shakenfist-client-k3s plugin is the first example of one of these plugins.
You can now optionally have DNS entries for the instances in a virtual network provided to the instances on that network via the provide_dns argument on network creation. This included a re-write of DHCP server options in the configuration file.
We now use threads for most concurrency requirements. See the operator page on threading for further details.
Disk utilization is now a factor in scheduling decisions. This might produce unexpected results if you have many very active filesystems mounted on a single hypervisor.

REST API¶

There is now an API call (GET /admin/resources) which exposes the resource utilization of the cluster to admin users.
You can now POST to /instance/...instanceref.../interfaces as described above to hot plug new network interfaces into an existing instance.
There is now an API call (GET /instances/...instanceref.../vdiconsoleproxy) which mints a short lived, Ed25519-signed token and returns a Kerbside proxy URL for an instance's SPICE console, so users can reach a graphical console without direct network access to the hypervisor. The public verification keys are published to admins at GET /admin/vditokenpubkey. This integration is disabled until the operator sets KERBSIDE_URL; see the VDI console tokens operator guide.

Supported distributions¶

Debian 12 is now supported as a host OS. Support for Debian 11 as a host OS has been dropped.
Ubuntu 22.04 is now supported as a host OS.
Fedora 34, 38, and 29, as well as Ubuntu 22.04 now have canned guest images.
Rocky 8 and 9 now have canned guest images.

Logging¶

REST API request traces are now logged via the event logging mechanism with the object type "api-requests" and the UUID being the request id.
Effort is now taken to report an event against all objects it affects, not just an arbitrary single object. This should ease debugging.

Containers and Kubernetes¶

The Shaken Fist client can now orchestrate k3s Kubernetes clusters for you. The lifecycle support is relatively simple at the moment, with cluster creation and deletion supported, as well as fetching the kubectl configuration from the cluster. This will be expanded over time. This support is implemented entirely in the Shaken Fist python client, and heavily utilizes the in guest agent added in v0.7. The client side nature of the orchestration makes it easy for you to customize the orchestration if desired without having to alter the main server code.

Networking¶

IP address management has moved to a new baseobject called IPAM. Events are therefore recorded for address management as you would expect.
Addresses released on any network (including the floating network) are now quarantined for IP_DELETION_HALO_DURATION seconds after deletion before they can be reused. The only exception to this is if a network is heavily congested and an allocation attempt will fail. In that case the halo is temporarily reduced to 30 seconds and a warning log message is emitted.
You can now list the addresses in use for a given network with the sf-client network addresses ...uuid... command.
In order to support the K3S Kubernetes orchestration, the concept of routed IPs was introduced. A routed IP is an address from the floating address pool which uses routing to deliver traffic to the relevant virtual network. An interface on the virtual network must then have been configured by the user to answer ARP requests for that address. This works well with metallb, which our K3S orchestration uses to expose services.
Network orchestration now waits for iptables locks, instead of failing commands in high load situations.

Instances¶

Shaken Fist can now capture screenshots of instance consoles.
Pause and unpause are now retried several times on failure, as sometimes libvirt does not respond correctly.
Specifying an incorrect disk bus now returns a more helpful error.
Power on now implies creation of a config drive is one is specified for the instance. That is, you can force re-creation of the config drive by powering an instance off and then on again.
Deletion of an instance will now cause most outstanding cluster operations for that instance to be cancelled. This is so these operations do not block the delete operation. The one exception at this time is snapshots, which will complete before the deletion occurs.

Artifacts¶

When you refer to an artifact by name, and there is more than one match then the match in your local namespace (if any) is now preferred instead of returning an error.

Deployment¶

The getsf installer and the legacy Ansible deployer have been removed. Shaken Fist is now deployed with the shakenfist.shakenfist Ansible collection: you write an inventory, set your variables, and run a playbook (example inventories and playbooks ship in examples/). See the rewritten installation guide for the new workflow.
We no longer reset the authentication secret used to generate authentication tokens on upgrade. This means tokens from before an upgrade will continue to work for their normal lifetime.
We now lock versions of upstream Ansible Galaxy dependencies.
We now lock versions of all of our indirect python dependencies.
Nodes can now transition directly from the missing state to the stopping state.
The Ansible modules have been re-written to skip resources that are already in existence and as described by your request. The Ansible modules are also now documented in the user guide.
The sf-client ansible subcommand and the module shims it installed have been removed from the python client. The sf_* modules are now native implementations shipped in the shakenfist.shakenfist collection; update playbooks to reference them by their fully qualified names (for example shakenfist.shakenfist.sf_instance) or via a collections: entry.
The deployer no longer configures rsyslog forwarding to a central node (there is no primary node to forward to). Structured logs are shipped to the Loki endpoint you configure instead -- see the logging guide.
The deployer no longer installs an Apache reverse proxy in front of the REST API. You bring your own load balancer for the API tier -- see the load balancing guide.
TLS certificates on hypervisors which are within 30 days of expiry are now automatically replaced.

Database¶

Object state (created, deleted, error, etc.) is now stored in MariaDB instead of etcd. This improves query performance for state-based lookups and reduces etcd load. MariaDB is now required for all deployments.
A new database microservice runs on database-tier nodes and provides gRPC access to MariaDB for all other nodes. This centralizes database access and means only database-tier nodes require direct MariaDB credentials.
MariaDB schema versioning is now tracked in a schema_versions table. This enables automatic schema migrations when upgrading Shaken Fist. The sf-ctl ensure-mariadb-schema command can be used to manually verify or apply schema migrations.
Existing deployments upgrading to v0.8 must run sf-ctl migrate-state-to-mariadb after stopping services and before starting the new version. Use --dry-run to preview what will be migrated.
Cluster operation headers and the per-node work queues have been moved from etcd to the new cluster_operations and work_queue MariaDB tables. Create and enqueue is now a single MariaDB transaction (header + state + queue row), and dequeue uses SELECT ... FOR UPDATE SKIP LOCKED for race-safe claims. Residual /sf/queue/, /sf/processing/ and /sf/{op_type}/ etcd keys left behind by older clusters are drained automatically by a one-shot data migration on database daemon startup.
A new stuck-job reaper in the cluster daemon reclaims work queue rows whose claim has exceeded CLUSTER_OP_STUCK_THRESHOLD seconds (default 1800) and rejects jobs that have been claimed more than CLUSTER_OP_MAX_ATTEMPTS times (default 5). Reaper activity is exposed via cluster_op_reaper_requeued_total and cluster_op_reaper_rejected_total on CLUSTER_METRICS_PORT (default 13007).
The floating network now uses a well-known UUID (f10a7f10-a7f1-4a7f-a10a-7f10a7f10a7f) instead of the invalid string "floating". Existing deployments must run sf-ctl migrate-floating-network-uuid after stopping services to migrate. This change enables proper UUID4 type validation in the IPAM schema.

Bring your own MariaDB¶

Earlier in the v0.8 cycle the Shaken Fist deployer installed and configured MariaDB for you. As of this batch of changes that stopped: operators provide their own MariaDB and the deployer slots into existing infrastructure rather than prescribing it. This is a breaking change — existing deployments must be rebuilt against the new shape.

Deployer no longer installs MariaDB. The roles/mariadb/ Ansible role has been deleted. Operators are responsible for running a MariaDB 10.6.0+ server before deploying Shaken Fist.
Bootstrap and tuning helpers ship in the source tree. tools/bootstrap-mariadb.sql creates the shakenfist database, the shakenfist user, and the required grants. Apply it once against your MariaDB before the first deploy; the script is idempotent and safe to re-run. examples/mariadb-tuning.cnf is an optional drop-in with starting-point InnoDB and connection-pool settings; copy it to /etc/mysql/mariadb.conf.d/ if you want SF's recommended baseline.
The deployer takes connection details as variables. Set mariadb_host, mariadb_port (default 3306), mariadb_user (default shakenfist), mariadb_password, and mariadb_database in your deployment's group_vars/all.yml. The deployer no longer generates a password — operators choose it when they apply tools/bootstrap-mariadb.sql.
sf-ctl ensure-mariadb-schema is the only path for schema work. sf-database no longer calls ensure_schema() or runs data migrations at startup. Instead it reads the recorded schema version, compares it against its own expectations, and refuses to start on mismatch. Run sf-ctl ensure-mariadb-schema to create or migrate the schema; this command is the single authoritative path for all schema changes.
MariaDB compatibility check at startup. Both sf-ctl ensure-mariadb-schema and sf-database at startup verify that the server is MariaDB (not MySQL), version 10.6.0 or later, with the default storage engine set to InnoDB, the connection charset utf8mb4, and a utf8mb4_* collation. An incompatible server surfaces as a clear refusal-to-start with a multi-line error rather than a runtime failure on the first JSON-column write.
MARIADB_GATEWAY_HOSTS replaces DATABASE_NODE_IP. The new config key is a comma-separated list of sf-database gRPC endpoints; single-instance deployments use a one-element list. The companion keys MARIADB_GATEWAY_PORT (default 13005) and MARIADB_GATEWAY_METRICS_PORT (default 13006) set the gRPC and Prometheus ports that each sf-database instance binds on.
MARIADB_HOST scope is now narrow. Set MARIADB_HOST only on nodes that run sf-database and on any node where an operator manually runs sf-ctl ensure-mariadb-schema. The previous MARIADB_HOST=localhost direct-access hack used at config-bootstrap time is gone.
sf-database is a tier of N >= 1 instances. All instances connect to the same MariaDB. None is elected leader; all serve any inbound gRPC request. Every other SF daemon reaches the tier through a client-side load-balanced gRPC channel that round-robins requests across the MARIADB_GATEWAY_HOSTS list, skipping dead endpoints via subchannel connectivity state and client keepalives. No external L4 load balancer is required. The grpc.health.v1.Health protocol is published for external monitoring via unary Check calls.
CI exercises N > 1. The slim-tier CI topology runs two sf-database instances on every merge-queue run. The multi-instance shape is a supported production configuration.
Several sf-ctl commands have been deleted. migrate-state-to-mariadb, migrate-floating-network-uuid, and all other sf-ctl migrate-* commands tied to the etcd era are gone. Operators on the new shape do not run any migration command. The migrate-etcd-to-mariadb Claude Code skill in .claude/skills/ has also been removed.
Greenfields only. PLAN-byo-mariadb does not preserve compatibility with deployments that took the earlier-in-the-v0.8-cycle shape (deployer-installed MariaDB, singular DATABASE_NODE_IP, etc.). Operators rebuild against the new shape.

See docs/plans/PLAN-byo-mariadb.md for the multi-phase rollout details.

Performance¶

Events are written directly to MariaDB via a per-daemon local spool rather than through etcd or a dedicated eventlog node. This eliminates all event-related etcd traffic and the single-node bottleneck that the old sf-eventlog daemon represented. (Earlier in the v0.8 cycle events were sent via gRPC to the eventlog node; that stage has also been superseded.)
We now use gRPC calls to compact etcd, instead of relying on a python client wrapper. This means we can now update our gRPC and protobuf dependencies to much more recent versions.
etcd traffic levels are now monitored in CI and we attempt to hold fewer cluster level locks for local operations.

Minor changes¶

Attribute updates on instances, networks and artifacts now write only the columns named in the update, instead of rewriting the whole attribute row. This fixes a lost-update race where two daemons concurrently updating different attributes of the same object could silently clobber each other's writes (most visibly, an agent operation enqueue lost to a concurrent power-state update).
Unhandled exceptions are now recorded to /srv/shakenfist/exceptions/ for later analysis. See the operator guide for details.
CI has been moved from relatively unreliable scraping of the instance serial console over telnet to using the Shaken Fist in-guest agent to inspect the state of instances for correctness.
The slow lock warning threshold is no longer configurable (SLOW_LOCK_THRESHOLD). Instead, a warning is emitted if a lot takes more than half of the specified timeout period to be acquired. This change was made because in some places we expect to wait a long time for a lock -- for example serialized fetches of a single resource from outside the cluster, but we also wanted to enforce locks didn't take a long time to acquire in CI.
Shaken Fist now uses Renovate to keep the dependencies of the develop branch up to date. This means that locking requirements at release time is no longer required, and is therefore more reliable.
The qemu commands generated now vary based on the version of qemu installed on the machine. This was required to support the newer qemu version in Ubuntu 22.04.
The ansible modules have been rewritten to be more reliable.
The ShakenFist client now uses HTTP sessions to reduce latency for requests.
etcd has been removed. All object storage, cluster locks, work queues, cluster configuration, and the event dead letter queue now live in MariaDB. The etcd.py module is retained as a minimal shim only so that DATA_MIGRATIONS entries can drain residual etcd keys on first startup of an upgraded cluster; it will be removed entirely in the next minor release. Operators no longer need to deploy or maintain etcd. The SHAKENFIST_ETCD_HOST, SHAKENFIST_DATABASE_USE_DIRECT_ETCD, SHAKENFIST_LOG_ETCD_CONNECTIONS, and SHAKENFIST_NODE_IS_ETCD_MASTER configuration settings have been removed. The database tier is identified by the database_node Ansible inventory group; the legacy etcd_master name for that group is still accepted, with a deprecation warning, and is removed in the next release. Nodes now report an is_database_node role flag via the REST API; the vestigial is_etcd_master and is_eventlog_node fields are still emitted (always false) and will be removed in the next release.
sf-backup and sf-restore have been retired. These commands previously backed up and restored etcd state. With etcd removed, they are no longer functional and will print a message directing you to use MariaDB tooling instead. To back up your Shaken Fist cluster, use mariadb-dump (or mysqldump) to export the Shaken Fist database. To restore, use mariadb (or mysql) to import the dump. Event history is now part of the MariaDB database and no separate backup step is needed for it.

Event logging migrated to MariaDB¶

Events previously lived in per-object SQLite files managed by a dedicated sf-eventlog daemon on a designated eventlog node. They now live in two MariaDB tables (events and event_objects) and are written from every daemon via a local spool that is drained in batches by a background thread directly into the database service. REST reads are served from MariaDB by any sf-api node; there is no longer a single eventlog node or a separate sf-eventlog daemon. See the operator guide on events for the full write-path, read-path, retention, and metrics reference.

History loss on upgrade¶

Events written before the upgrade are not reachable through the REST API after cut-over. The on-disk SQLite chunks under /srv/shakenfist/events/ on the former eventlog node remain present until you remove them — no daemon writes there any more. This behaviour was deliberate; preserving pre-upgrade history would have required a separate migration tool with no operator-visible operational benefit.

Once the new code is running on every node, rm -rf /srv/shakenfist/events/ is safe to run on the former eventlog node.

Ansible inventory change¶

The eventlog_node Ansible host group is gone. The daemon registration, service-start, and service-stop blocks that referenced it have been removed from the playbooks.

Remove the eventlog_node group from your inventory before deploying the new version.

Configuration key removals¶

The following configuration keys have been deleted and are no longer read by any daemon. Leaving them in your environment files is harmless but you are encouraged to remove them to avoid confusion:

EVENTLOG_NODE_IP
EVENTLOG_API_PORT
EVENTLOG_METRICS_PORT
EVENTLOG_SUPPRESS_GRPC
NODE_IS_EVENTLOG_NODE

REST `/events` response shape change¶

The response objects returned by the /{instance,artifact,network,node, blob}/<uuid>/events endpoints have been updated:

correlation_id is renamed to event_uuid. Clients that read this key by name must be updated; clients that pass the response dict through opaquely (such as the shakenfist-client Python library) require no change.
type is renamed to event_type to match the new schema column name. Clients that read response["type"] must be updated to read response["event_type"] instead.
request_id is now a first-class top-level field rather than being nested inside extra. Clients that read extra["request_id"] must be updated to read the top-level request_id field instead.
The limit query parameter is now capped at 1000 server-side. Any request with a limit greater than 1000 returns at most 1000 rows.
A limit of 0 or any negative value now defaults to 100 rather than returning all rows. Callers that previously passed a negative limit to fetch every event must switch to paginating with a positive limit.

New Prometheus metrics¶

The following metrics are new in this release. The sf-database daemon hosts the storage-side counters; the spool metrics are emitted by every daemon process:

database_events_rows — gauge, sf-database: current row count in the events table.
database_events_inserted_total{event_type} — counter, sf-database: events written per event type.
database_events_pruned_total{event_type} — counter, sf-database: events pruned per daily retention run, broken down by event type.
database_orphan_events_pruned_total — counter, sf-database: events removed by the orphan sweep (no remaining event_objects row).
eventlog_spool_depth — gauge, every daemon: current number of events queued in the local spool awaiting drain.
eventlog_spool_dropped_total — counter, every daemon: events silently dropped because the spool exceeded the high-water mark (SPOOL_HIGH_WATER_MARK, default 100 000).

See the operator guide on events for alert recommendations and the full metrics reference.

Strict namespace scoping on `*_ref` lookups¶

The instance, network, and artifact lookup decorators (arg_is_instance_ref, arg_is_network_ref, arg_is_artifact_ref) now honour the request body's namespace field strictly. Two behaviours changed:

When a system caller passes namespace=<x> (for example client.get_instance(name, namespace='ovirt-homelab')), the lookup is now scoped strictly to namespace <x>. Previously the namespace body field was ignored for resolution and a same-named object in a different namespace could be returned. A system caller who does not pass namespace retains the historical cross-namespace "search everywhere" behaviour.
When a tenant caller passes a namespace=<other> that is not their own namespace, the request now returns 404 instead of silently resolving against the caller's own namespace. Tenants who relied on the old shape were already going to get a 404 from the subsequent ownership check; the request is now rejected sooner.

There is no migration step. Callers that pass namespace matching the object's actual namespace are unaffected.

One specific consequence is worth calling out for operator tooling: the floating network has namespace=NULL in the database and therefore no longer matches any non-empty namespace=<x> filter. Admin scripts that habitually pass namespace='system' on every network call will now receive a 404 when querying the floating network and must omit namespace to reach it.

Load-aware scheduling and per-node resource reservations¶

The scheduler now understands that machines come in different sizes and that some machines have other jobs to do. Three related changes landed together:

Per-node resource reservations. Every node reserves RAM, CPU and disk for the operating system and host-level services, through three config keys: NODE_RAM_RESERVATION_GB (default 2.0 GB), NODE_CPU_RESERVATION_THREADS (default 2 hardware threads -- reservation is denominated in threads, not physical cores) and NODE_DISK_RESERVATION_GB (default 20.0 GB, kept free on the instances and blobs filesystems). These are per-node values, set through each node's /etc/sf/config (which the Ansible deploy templates per host), and are never set cluster-wide with sf-ctl set-config. There is no separate addition for nodes carrying a cluster-wide role (network node, database node) in the server; instead, the deploy computes a per-host default that already folds in that bump -- 10% of the host's RAM floored at 2 GB plus 4 GB on network/database nodes, four threads instead of two on network/database nodes, and a flat 20 GB of disk -- and an operator can override any of the three per host from inventory. The resources daemon reads its own node's values, computes the resulting schedulable capacity, and publishes it in node metrics (cpu_schedulable, memory_reserved_mb, disk_reservation_gb), along with CPU topology (physical cores, threads, and performance / efficiency core counts on hybrid parts). NODE_DISK_RESERVATION_GB replaces the old cluster-wide MINIMUM_FREE_DISK.
Placement is load-aware and size-aware. Candidate nodes are ranked by load per schedulable thread rather than raw load average, and selection within a band of similar nodes is weighted by each node's headroom toward a target load (SCHEDULER_TARGET_LOAD, default 0.75 per schedulable thread). Larger or idler machines now draw a proportionally larger share of a burst of instance creates; previously a small busy machine and a large idle one could be treated as equals.

`CPU_OVERCOMMIT_RATIO` semantic change¶

CPU_OVERCOMMIT_RATIO now multiplies the schedulable thread count (threads minus NODE_CPU_RESERVATION_THREADS) rather than all threads, and its default has dropped from 16 to 3.0. The old default dated back to assumptions about many mostly-idle instances and in practice never rejected a node; the new default was measured on a CI-dominated cluster where busy hypervisors sustain 2.3-3.0 allocated vCPUs per thread with RAM as the binding constraint.

If your cluster is already packed beyond the new cap, existing instances are untouched but new schedules to full nodes will be refused until they drain. Operators who prefer the historic behaviour can set CPU_OVERCOMMIT_RATIO=16 and NODE_CPU_RESERVATION_THREADS / NODE_RAM_RESERVATION_GB to zero per node in inventory.

📝 Report an issue with this page