v0.7 to v0.8 release notes¶
Major changes¶
- You can now hot plug (add) a network interface to a running instance. This is
exposed by a POST to the
/instance/...instanceref.../interfacesendpoint, passing a network specification like you would at boot time. More details are available in the - The Shaken Fist client now has a plugin system, which allows additional commands
that may not be of interest to all users to be added without cluttering the
main codebase. Orchestration of k3s clusters via the
shakenfist-client-k3splugin is the first example of one of these plugins. - You can now optionally have DNS entries for the instances in a virtual network
provided to the instances on that network via the
provide_dnsargument on network creation. This included a re-write of DHCP server options in the configuration file. - We now use threads for most concurrency requirements. See the operator page on threading for further details.
- Disk utilization is now a factor in scheduling decisions. This might produce unexpected results if you have many very active filesystems mounted on a single hypervisor.
REST API¶
- There is now an API call (
GET /admin/resources) which exposes the resource utilization of the cluster to admin users. - You can now POST to
/instance/...instanceref.../interfacesas described above to hot plug new network interfaces into an existing instance.
Supported distributions¶
- Debian 12 is now supported as a host OS. Support for Debian 11 as a host OS has been dropped.
- Ubuntu 22.04 is now supported as a host OS.
- Fedora 34, 38, and 29, as well as Ubuntu 22.04 now have canned guest images.
- Rocky 8 and 9 now have canned guest images.
Logging¶
- REST API request traces are now logged via the event logging mechanism with the object type "api-requests" and the UUID being the request id.
- Effort is now taken to report an event against all objects it affects, not just an arbitrary single object. This should ease debugging.
Containers and Kubernetes¶
- The Shaken Fist client can now orchestrate k3s Kubernetes clusters for you. The lifecycle support is relatively simple at the moment, with cluster creation and deletion supported, as well as fetching the kubectl configuration from the cluster. This will be expanded over time. This support is implemented entirely in the Shaken Fist python client, and heavily utilizes the in guest agent added in v0.7. The client side nature of the orchestration makes it easy for you to customize the orchestration if desired without having to alter the main server code.
Networking¶
- IP address management has moved to a new baseobject called IPAM. Events are therefore recorded for address management as you would expect.
- Addresses released on any network (including the floating network) are now
quarantined for
IP_DELETION_HALO_DURATIONseconds after deletion before they can be reused. The only exception to this is if a network is heavily congested and an allocation attempt will fail. In that case the halo is temporarily reduced to 30 seconds and a warning log message is emitted. - You can now list the addresses in use for a given network with the
sf-client network addresses ...uuid...command. - In order to support the K3S Kubernetes orchestration, the concept of routed IPs was introduced. A routed IP is an address from the floating address pool which uses routing to deliver traffic to the relevant virtual network. An interface on the virtual network must then have been configured by the user to answer ARP requests for that address. This works well with metallb, which our K3S orchestration uses to expose services.
- Network orchestration now waits for
iptableslocks, instead of failing commands in high load situations.
Instances¶
- Shaken Fist can now capture screenshots of instance consoles.
- Pause and unpause are now retried several times on failure, as sometimes libvirt does not respond correctly.
- Specifying an incorrect disk bus now returns a more helpful error.
- Power on now implies creation of a config drive is one is specified for the instance. That is, you can force re-creation of the config drive by powering an instance off and then on again.
- Deletion of an instance will now cause most outstanding cluster operations for that instance to be cancelled. This is so these operations do not block the delete operation. The one exception at this time is snapshots, which will complete before the deletion occurs.
Artifacts¶
- When you refer to an artifact by name, and there is more than one match then the match in your local namespace (if any) is now preferred instead of returning an error.
Deployment¶
- We no longer reset the authentication secret used to generate authentication tokens on upgrade. This means tokens from before an upgrade will continue to work for their normal lifetime.
- We now lock versions of upstream Ansible Galaxy dependencies.
- We now lock versions of all of our indirect python dependencies.
- Nodes can now transition directly from the missing state to the stopping state.
- The Ansible modules have been re-written to skip resources that are already in existence and ask described by your request. The Ansible module is also now documented at the user guide.
- TLS certificates on hypervisors which are within 30 days of expiry are now automatically replaced.
Database¶
- Object state (created, deleted, error, etc.) is now stored in MariaDB instead of etcd. This improves query performance for state-based lookups and reduces etcd load. MariaDB is now required for all deployments.
- A new database microservice runs on etcd_master nodes and provides gRPC access to MariaDB for all other nodes. This centralizes database access and means only etcd_master nodes require direct MariaDB credentials.
- MariaDB schema versioning is now tracked in a
schema_versionstable. This enables automatic schema migrations when upgrading Shaken Fist. Thesf-ctl ensure-mariadb-schemacommand can be used to manually verify or apply schema migrations. - Existing deployments upgrading to v0.8 must run
sf-ctl migrate-state-to-mariadbafter stopping services and before starting the new version. Use--dry-runto preview what will be migrated. - Cluster operation headers and the per-node work queues have been moved from
etcd to the new
cluster_operationsandwork_queueMariaDB tables. Create and enqueue is now a single MariaDB transaction (header + state + queue row), and dequeue usesSELECT ... FOR UPDATE SKIP LOCKEDfor race-safe claims. Residual/sf/queue/,/sf/processing/and/sf/{op_type}/etcd keys left behind by older clusters are drained automatically by a one-shot data migration on database daemon startup. - A new stuck-job reaper in the cluster daemon reclaims work queue rows
whose claim has exceeded
CLUSTER_OP_STUCK_THRESHOLDseconds (default 1800) and rejects jobs that have been claimed more thanCLUSTER_OP_MAX_ATTEMPTStimes (default 5). Reaper activity is exposed viacluster_op_reaper_requeued_totalandcluster_op_reaper_rejected_totalonCLUSTER_METRICS_PORT(default 13007). - The floating network now uses a well-known UUID (
f10a7f10-a7f1-4a7f-a10a-7f10a7f10a7f) instead of the invalid string "floating". Existing deployments must runsf-ctl migrate-floating-network-uuidafter stopping services to migrate. This change enables proper UUID4 type validation in the IPAM schema.
Bring your own MariaDB¶
Earlier in the v0.8 cycle the Shaken Fist deployer installed and configured MariaDB for you. As of this batch of changes that stopped: operators provide their own MariaDB and the deployer slots into existing infrastructure rather than prescribing it. This is a breaking change — existing deployments must be rebuilt against the new shape.
-
Deployer no longer installs MariaDB. The
roles/mariadb/Ansible role has been deleted. Operators are responsible for running a MariaDB 10.6.0+ server before deploying Shaken Fist. -
Bootstrap and tuning helpers ship in the source tree.
tools/bootstrap-mariadb.sqlcreates theshakenfistdatabase, theshakenfistuser, and the required grants. Apply it once against your MariaDB before the first deploy; the script is idempotent and safe to re-run.examples/mariadb-tuning.cnfis an optional drop-in with starting-point InnoDB and connection-pool settings; copy it to/etc/mysql/mariadb.conf.d/if you want SF's recommended baseline. -
getsfprompts for connection details. The installer now asks forGETSF_MARIADB_HOST,GETSF_MARIADB_PORT(default3306),GETSF_MARIADB_USER(defaultshakenfist),GETSF_MARIADB_PASSWORD, andGETSF_MARIADB_DATABASE(defaultshakenfist). All five can be passed as environment variables for non-interactive deploys. The deployer no longer generates a password — operators choose it when they applytools/bootstrap-mariadb.sql. -
sf-ctl ensure-mariadb-schemais the only path for schema work.sf-databaseno longer callsensure_schema()or runs data migrations at startup. Instead it reads the recorded schema version, compares it against its own expectations, and refuses to start on mismatch. Runsf-ctl ensure-mariadb-schemato create or migrate the schema; this command is the single authoritative path for all schema changes. -
MariaDB compatibility check at startup. Both
sf-ctl ensure-mariadb-schemaandsf-databaseat startup verify that the server is MariaDB (not MySQL), version 10.6.0 or later, with the default storage engine set to InnoDB, the connection charsetutf8mb4, and autf8mb4_*collation. An incompatible server surfaces as a clear refusal-to-start with a multi-line error rather than a runtime failure on the first JSON-column write. -
MARIADB_GATEWAY_HOSTSreplacesDATABASE_NODE_IP. The new config key is a comma-separated list ofsf-databasegRPC endpoints; single-instance deployments use a one-element list. The companion keysMARIADB_GATEWAY_PORT(default13005) andMARIADB_GATEWAY_METRICS_PORT(default13006) set the gRPC and Prometheus ports that eachsf-databaseinstance binds on. -
MARIADB_HOSTscope is now narrow. SetMARIADB_HOSTonly on nodes that runsf-databaseand on any node where an operator manually runssf-ctl ensure-mariadb-schema. The previousMARIADB_HOST=localhostdirect-access hack used at config-bootstrap time is gone. -
sf-databaseis a tier of N >= 1 instances. All instances connect to the same MariaDB. None is elected leader; all serve any inbound gRPC request. Every other SF daemon reaches the tier through a client-side load-balanced gRPC channel that round-robins requests across theMARIADB_GATEWAY_HOSTSlist, skipping dead endpoints via subchannel connectivity state and client keepalives. No external L4 load balancer is required. Thegrpc.health.v1.Healthprotocol is published for external monitoring via unaryCheckcalls. -
CI exercises N > 1. The
slim-tierCI topology runs twosf-databaseinstances on every merge-queue run. The multi-instance shape is a supported production configuration. -
Several
sf-ctlcommands have been deleted.migrate-state-to-mariadb,migrate-floating-network-uuid, and all othersf-ctl migrate-*commands tied to the etcd era are gone. Operators on the new shape do not run any migration command. Themigrate-etcd-to-mariadbClaude Code skill in.claude/skills/has also been removed. -
Greenfields only. PLAN-byo-mariadb does not preserve compatibility with deployments that took the earlier-in-the-v0.8-cycle shape (deployer-installed MariaDB, singular
DATABASE_NODE_IP, etc.). Operators rebuild against the new shape.
See docs/plans/PLAN-byo-mariadb.md for the multi-phase rollout
details.
Performance¶
- Events are written directly to MariaDB via a per-daemon local spool rather than through etcd or a dedicated eventlog node. This eliminates all event-related etcd traffic and the single-node bottleneck that the old sf-eventlog daemon represented. (Earlier in the v0.8 cycle events were sent via gRPC to the eventlog node; that stage has also been superseded.)
- We now use gRPC calls to compact etcd, instead of relying on a python client wrapper. This means we can now update our gRPC and protobuf dependencies to much more recent versions.
- etcd traffic levels are now monitored in CI and we attempt to hold fewer cluster level locks for local operations.
Minor changes¶
- Unhandled exceptions are now recorded to
/srv/shakenfist/exceptions/for later analysis. See the operator guide for details. - CI has been moved from relatively unreliable scraping of the instance serial console over telnet to using the Shaken Fist in-guest agent to inspect the state of instances for correctness.
- The slow lock warning threshold is no longer configurable (SLOW_LOCK_THRESHOLD). Instead, a warning is emitted if a lot takes more than half of the specified timeout period to be acquired. This change was made because in some places we expect to wait a long time for a lock -- for example serialized fetches of a single resource from outside the cluster, but we also wanted to enforce locks didn't take a long time to acquire in CI.
- Shaken Fist now uses
Renovateto keep the dependencies of thedevelopbranch up to date. This means that locking requirements at release time is no longer required, and is therefore more reliable. - The
qemucommands generated now vary based on the version ofqemuinstalled on the machine. This was required to support the newerqemuversion in Ubuntu 22.04. - The ansible modules have been rewritten to be more reliable.
- The ShakenFist client now uses HTTP sessions to reduce latency for requests.
- etcd has been removed. All object storage, cluster locks, work queues,
cluster configuration, and the event dead letter queue now live in MariaDB.
The
etcd.pymodule is retained as a minimal shim only so thatDATA_MIGRATIONSentries can drain residual etcd keys on first startup of an upgraded cluster; it will be removed entirely in the next minor release. Operators no longer need to deploy or maintain etcd. TheSHAKENFIST_ETCD_HOST,SHAKENFIST_DATABASE_USE_DIRECT_ETCD,SHAKENFIST_LOG_ETCD_CONNECTIONS, andSHAKENFIST_NODE_IS_ETCD_MASTERconfiguration settings have been removed. Theetcd_masterAnsible host group now identifies the database node and will be renamed in a future release. sf-backupandsf-restorehave been retired. These commands previously backed up and restored etcd state. With etcd removed, they are no longer functional and will print a message directing you to use MariaDB tooling instead. To back up your Shaken Fist cluster, usemariadb-dump(ormysqldump) to export the Shaken Fist database. To restore, usemariadb(ormysql) to import the dump. Event history is now part of the MariaDB database and no separate backup step is needed for it.
Event logging migrated to MariaDB¶
Events previously lived in per-object SQLite files managed by a dedicated
sf-eventlog daemon on a designated eventlog node. They now live in two
MariaDB tables (events and event_objects) and are written from every
daemon via a local spool that is drained in batches by a background thread
directly into the database service. REST reads are served from MariaDB by
any sf-api node; there is no longer a single eventlog node or a separate
sf-eventlog daemon. See
the operator guide on events for the full
write-path, read-path, retention, and metrics reference.
History loss on upgrade¶
Events written before the upgrade are not reachable through the REST API
after cut-over. The on-disk SQLite chunks under /srv/shakenfist/events/
on the former eventlog node remain present until you remove them — no
daemon writes there any more. This behaviour was deliberate; preserving
pre-upgrade history would have required a separate migration tool with no
operator-visible operational benefit.
- Once the new code is running on every node,
rm -rf /srv/shakenfist/events/is safe to run on the former eventlog node.
Ansible inventory change¶
The eventlog_node Ansible host group is gone. The daemon registration,
service-start, and service-stop blocks that referenced it have been
removed from the playbooks.
- Remove the
eventlog_nodegroup from your inventory before deploying the new version.
Configuration key removals¶
The following configuration keys have been deleted and are no longer read by any daemon. Leaving them in your environment files is harmless but you are encouraged to remove them to avoid confusion:
EVENTLOG_NODE_IPEVENTLOG_API_PORTEVENTLOG_METRICS_PORTEVENTLOG_SUPPRESS_GRPCNODE_IS_EVENTLOG_NODE
REST /events response shape change¶
The response objects returned by the /{instance,artifact,network,node,
blob}/<uuid>/events endpoints have been updated:
correlation_idis renamed toevent_uuid. Clients that read this key by name must be updated; clients that pass the response dict through opaquely (such as theshakenfist-clientPython library) require no change.typeis renamed toevent_typeto match the new schema column name. Clients that readresponse["type"]must be updated to readresponse["event_type"]instead.request_idis now a first-class top-level field rather than being nested insideextra. Clients that readextra["request_id"]must be updated to read the top-levelrequest_idfield instead.- The
limitquery parameter is now capped at 1000 server-side. Any request with alimitgreater than 1000 returns at most 1000 rows. - A
limitof 0 or any negative value now defaults to 100 rather than returning all rows. Callers that previously passed a negative limit to fetch every event must switch to paginating with a positive limit.
New Prometheus metrics¶
The following metrics are new in this release. The sf-database daemon hosts the storage-side counters; the spool metrics are emitted by every daemon process:
database_events_rows— gauge, sf-database: current row count in theeventstable.database_events_inserted_total{event_type}— counter, sf-database: events written per event type.database_events_pruned_total{event_type}— counter, sf-database: events pruned per daily retention run, broken down by event type.database_orphan_events_pruned_total— counter, sf-database: events removed by the orphan sweep (no remainingevent_objectsrow).eventlog_spool_depth— gauge, every daemon: current number of events queued in the local spool awaiting drain.eventlog_spool_dropped_total— counter, every daemon: events silently dropped because the spool exceeded the high-water mark (SPOOL_HIGH_WATER_MARK, default 100 000).
See the operator guide on events for alert recommendations and the full metrics reference.
Strict namespace scoping on *_ref lookups¶
The instance, network, and artifact lookup decorators
(arg_is_instance_ref, arg_is_network_ref, arg_is_artifact_ref) now
honour the request body's namespace field strictly. Two behaviours
changed:
- When a system caller passes
namespace=<x>(for exampleclient.get_instance(name, namespace='ovirt-homelab')), the lookup is now scoped strictly to namespace<x>. Previously the namespace body field was ignored for resolution and a same-named object in a different namespace could be returned. A system caller who does not passnamespaceretains the historical cross-namespace "search everywhere" behaviour. - When a tenant caller passes a
namespace=<other>that is not their own namespace, the request now returns 404 instead of silently resolving against the caller's own namespace. Tenants who relied on the old shape were already going to get a 404 from the subsequent ownership check; the request is now rejected sooner.
There is no migration step. Callers that pass namespace matching
the object's actual namespace are unaffected.
One specific consequence is worth calling out for operator tooling:
the floating network has namespace=NULL in the database and
therefore no longer matches any non-empty namespace=<x> filter.
Admin scripts that habitually pass namespace='system' on every
network call will now receive a 404 when querying the floating
network and must omit namespace to reach it.