CI platform-matrix expansion¶
Prompt¶
Expand ryll's CI to catch platform-specific runtime bugs that
the current matrix lets through. Today's CI builds and runs
unit tests on Linux, macOS, and Windows, but the runtime smoke
tests (tools/web-smoke.sh, --web TLS) and lint run on Linux
only. Bugs that need real platform execution to surface — like
the rustls CryptoProvider panic that broke macOS builds at
TLS connect time but never showed on Linux (commit a9aff050,
2026-05-08) — slip through and only surface during dogfooding.
This master plan was spun out of bench observations during the session-001 feedback work. It is independent of the session-001 phases and the macOS-runtime-metrics plan, and can land in any order.
When working through phases, follow the project's plan
conventions (per-phase plan files named
PLAN-ci-platform-matrix-phase-NN-*.md, one logical change per
commit, master-plan table updated as work lands).
Situation¶
What CI does today¶
.github/workflows/ci.yml defines:
- Lint (
cargo fmt --check+cargo clippy): Linux only. - Build matrix across Linux / macOS / Windows. Each matrix cell:
- Builds
cargo build --release -p ryll. - Runs
cargo test --workspace. - Linux-only:
tools/web-smoke.sh(HTTP) andtools/web-smoke.sh --tls(HTTPS) against the just-built binary in--webmode. - Produces a platform-shaped artifact (.deb, .rpm, macOS tarball, Windows zip).
tools/web-smoke.shis a Bash script — it cannot run on the Windows runner without WSL or a PowerShell rewrite, and it is gatedif: runner.os == 'Linux'for that reason on the macOS runner too even though Bash exists there.- QEMU integration tests (
make test-qemu*) run nowhere in CI today — they need KVM and a libvirt-style stack.
Bug classes the current matrix catches¶
- Compile errors anywhere in the workspace, on every platform.
- Unit-test regressions, on every platform.
- Lint violations and formatting drift (Linux only — but lint is platform-independent in practice).
--web-mode regressions, including TLS-cert loading and HTTP routing — Linux only.- Build artifact packaging (.deb / .rpm / .tar.gz / .zip).
Bug classes the current matrix misses¶
The session-001 work has surfaced two specific gaps:
-
Runtime startup paths only exercised inside
--websmoke, on Linux. The rustlsinstall_default()panic atclient.rs:157lay dormant because no test ever constructed a SPICE TLS connector outside--webmode.--webhappened to install aCryptoProviderearly; the GUI / headless path did not. Tests passed. Local macOS dogfooding fell over the moment a TLS SPICE target was contacted. -
Platform-specific runtime APIs (Phase 02 of session-001-feedback adds
NSProcessInfo.beginActivityWithOptionsfor macOS App Nap opt-out; G1's macOS-runtime-metrics plan adds Machtask_info/thread_infocalls). Both are cfg-gated totarget_os = "macos". Our current macOS matrix cell would compile them but cannot exercise them meaningfully — acargo testrunning on a fresh macOS runner is foreground-active, so App Nap conditions never trigger; an FFI call returning the right struct in unit isolation says nothing about a real session's behaviour. This is harder to address than (1), and may require a "manual-but-checklisted" QA pass rather than full automation. See Phase 4. -
Bash-only smoke tests force runtime-test parity gaps: anywhere a test is written in Bash, Windows is excluded by construction.
Why "everything on every platform" is the wrong answer¶
GitHub Actions runners price per minute (current public pricing, 2026):
| Runner | Linux | Windows | macOS |
|---|---|---|---|
| Cost multiplier vs Linux | 1× | 2× | 10× |
| Cold-cache test latency (this repo) | ~6 min | ~12 min | ~10 min |
| Flake rate (observed, this repo) | low | low | medium |
Naïvely promoting every Linux job to all three platforms roughly triples wall-clock CI latency on every PR (each matrix cell ~10 min, three concurrent cells gated by the slowest) and quadruples the per-PR cost (because macOS dominates the bill). That tradeoff is fine for a release build; it is wasteful for a typo-fix PR.
The right answer is graduated coverage: every platform runs the parts that catch platform-specific bugs, and only Linux runs the parts that don't (lint, formatting, deep integration). Phase plans below pick one expansion at a time and judge it on bug-class coverage per minute spent.
Mission and problem statement¶
Make ryll's CI catch the bug classes that currently only surface during macOS / Windows dogfooding. Specifically:
- The TLS startup path on every platform, on every PR. (Catches the rustls panic class.)
- Runtime smoke tests that work on Windows, not just Linux. (Catches the bash-script-portability class.)
- A clear, documented manual-QA checklist for the bug classes automation cannot reach (App Nap, code-signing, Gatekeeper interactions). (Acknowledges the limit, doesn't pretend automation handles it.)
Out of scope: Linux-runner-equivalent integration testing on
macOS / Windows (no KVM, no QEMU SPICE stack); browser-based
automation of --web mode (separate plan); cross-compilation
matrices (we already build natively).
Approach¶
The plan breaks into four phases. Phases 1 and 2 are no-regret expansions that fit comfortably in current CI budgets. Phase 3 is a portability cleanup that unblocks Phases 1 and 2 on Windows. Phase 4 documents a manual-QA boundary; the work itself is a doc, not automation.
Phase 01 — Cross-platform GUI/headless TLS smoke test¶
Add a smoke test that:
- Spins up a minimal TLS-capable echo server in-process (rustls server config, self-signed cert, pinned CA passed to ryll via the same path the SPICE client uses).
- Invokes ryll's TLS client setup path (constructs a
SpiceClientwith a TLS-portConnectionConfig, attempts the handshake). - Asserts the handshake reaches "connected" (the in-process server logs the client hello) before tearing down.
- Does not require a real SPICE server — the server hello is enough to confirm rustls didn't panic and the client reached the network layer.
Where it lives: a new integration test in
shakenfist-spice-protocol/tests/tls_handshake.rs, picked up
automatically by the existing cargo test --workspace in the
build matrix. No new CI step required.
This catches:
- The rustls CryptoProvider install regression.
- Any future TLS feature-unification surprise (cert-loader
pulled in via a transitive crate, etc.).
- Hostname-verifier behaviour on every platform (the
SpiceCaVerifier in client.rs has subtle platform-
dependent behaviour on root-store loading).
Cost: ~5–10 s extra test runtime per matrix cell. Fits inside the existing test step.
Bug-class coverage per minute: very high. The single test would have caught today's bug at PR time.
Phase 02 — web-smoke parity on macOS and Windows¶
Two tasks:
(a) Drop the if: runner.os == 'Linux' gate on
tools/web-smoke.sh for the macOS matrix cell. Bash exists on
the macOS runner, the script is plain Bash (no Linux-isms
beyond the SPICE target it speaks to, which is just
ryll --web itself). Verify the gate is the only blocker.
(b) Either port tools/web-smoke.sh to PowerShell, or
rewrite both as a small Rust integration binary that the
build matrix invokes after cargo build --release. The Rust
rewrite is more work but eliminates a class of "Bash on
Windows is a quagmire" headaches forever — and lets the
smoke test reuse types from the workspace.
Recommendation: do (a) first as a one-line CI change, then
plan (b) as its own follow-up phase if Windows coverage of
--web becomes important enough to justify the rewrite.
This catches:
- --web-mode regressions on macOS specifically (rustls,
TLS cert loading, axum behaviour).
- --web-mode regressions on Windows (after (b)).
Cost: extra ~30–60 s on the macOS cell (web-smoke includes a
brief ryll --web startup + handshake exchange). On Windows
(after rewrite) similar.
Bug-class coverage per minute: medium-high — duplicates some
Phase 01 coverage but exercises the actual --web HTTP
endpoint path that Phase 01 doesn't.
Phase 03 — Smoke-test portability cleanup¶
Address the structural issue that Phases 01 and 02 hint at:
runtime tests written in Bash exclude Windows. Audit
tools/*.sh and identify which are CI-relevant. For each,
choose:
- Keep as Bash (Linux-only operational scripts —
propose-release.sh,address-comments-with-claude.sh). These don't run in CI per se; no action needed. - Port to a small Rust tool in a
tools/workspace member — for tests that need to run on every platform. - Rewrite as a
cargotest under the relevant crate — for tests that exercise crate behaviour and can use the test harness.
The goal is that every CI-relevant smoke test runs on every matrix cell. Linux-only operational scripts are fine to leave as Bash — they're not blocking the matrix.
This catches: - Future smoke-test additions don't accidentally exclude a platform. - Shell-portability bugs in test infrastructure don't masquerade as product bugs.
Cost: one-time refactor effort. No ongoing CI minute cost delta.
Bug-class coverage per minute: zero direct (it's plumbing). Indirect: enables higher-coverage phases.
Phase 04 — Manual-QA checklist for un-automatable platform behaviour¶
Acknowledge the bug classes automation cannot reach and write
them down so a human releaser knows what to spot-check. Adds
a docs/release-qa.md file with a checklist organised by
platform:
- macOS: open binary in Finder (Gatekeeper UX), idle a SPICE session for >30 minutes with ryll backgrounded (App Nap behaviour, Phase 02 of session-001-feedback once it lands), confirm clipboard sync survives.
- Windows: clipboard, USB redirection on 32-bit USB drivers, multi-monitor under DPI scaling.
- Linux: Wayland vs. Xorg, libvirt-managed vs. raw QEMU, AppImage / Flatpak packaging if those land.
The checklist is run before tagging a release, by a human, on each platform. Output is a checked-off form attached to the release PR.
This catches: - The bug classes that need eyes on a real device, full stop. - Regressions in UX behaviour that pass automation but surprise users.
Cost: not a CI minute cost — a release-time human cost.
Bug-class coverage per minute: not applicable. The point is to be honest that some bugs require this and to make the boundary explicit.
Phase order¶
| Phase | Plan | Status |
|---|---|---|
| 1. Cross-platform TLS handshake smoke | PLAN-ci-platform-matrix-phase-01-tls-smoke.md | Not started |
2. web-smoke on macOS (and Windows after Phase 03) |
PLAN-ci-platform-matrix-phase-02-web-smoke-parity.md | Not started |
| 3. Smoke-test portability audit | PLAN-ci-platform-matrix-phase-03-smoke-portability.md | Not started |
| 4. Release QA checklist doc | PLAN-ci-platform-matrix-phase-04-release-qa.md | Not started |
Hard dependencies: Phase 02b (Windows web-smoke) is gated on Phase 03 if we choose the Rust-rewrite route. Phase 02a (macOS web-smoke) has no dependencies.
Open questions¶
-
CI cost ceiling. What is the project's monthly minutes budget on GitHub Actions? Phases 01 and 02a together probably add <60 s per matrix cell — irrelevant. If coverage grew larger, this would matter. Document the current usage in Phase 01's plan.
-
Should clippy run on macOS / Windows too? Clippy is platform-agnostic in 99% of cases; a macOS run of clippy would catch the 1% of cfg-gated lints. Cost: ~2 minutes on the macOS cell. Probably yes, but not in the first pass — roll into Phase 03 if it falls out cleanly.
-
Self-hosted runners as an escape valve. The repo already uses self-hosted Linux runners (
runs-on: [self-hosted, static]) for the Claude bot workflows. A self-hosted Mac mini or Windows VM would let us run heavier integration tests without GitHub minute pricing. Out of scope here; raise as a separate infrastructure plan if cost ever becomes the constraint. -
Coverage reporting. Phase 01 adds tests; do we want
cargo-llvm-covreporting per-platform coverage too? The answer is "probably yes eventually" but it has its own build complications (llvm-tools-preview availability) and should be a separate item, not bundled here. -
Renovate / supply-chain bot interactions. Renovate PRs currently hit the same matrix; will the expanded matrix slow them down enough to matter? Empirically Renovate PRs block on no review path, so latency isn't the bottleneck. No action.
Out of scope¶
- Adding QEMU-based integration tests on macOS / Windows — no
KVM equivalent makes the existing
test-qemu*recipes unportable. If a small synthetic SPICE server emerges (a Rust crate that speaks server-side SPICE just enough to unit-test client behaviour), revisit. - Browser-driven automation of
--webmode (Playwright / Selenium against the in-process axum server). Useful but its own master plan; would dominate the cost of the rest of this plan combined. - Cross-compilation matrices (e.g. building macOS binaries on Linux). The current native-build matrix is the source of truth and matches user expectations; cross-compilation introduces a class of "works on Linux-built macOS binary, not on Mac-built one" bugs that would themselves be a testing problem.
- Code signing / notarisation automation on macOS / Windows. Important for releases but tangential to bug-coverage CI; belongs with the packaging work.
- Performance regression CI (benchmarks, frame-rate regression). Different motivation, different infrastructure, different plan.