CI platform-matrix expansion¶

Prompt¶

Expand ryll's CI to catch platform-specific runtime bugs that the current matrix lets through. Today's CI builds and runs unit tests on Linux, macOS, and Windows, but the runtime smoke tests (tools/web-smoke.sh, --web TLS) and lint run on Linux only. Bugs that need real platform execution to surface — like the rustls CryptoProvider panic that broke macOS builds at TLS connect time but never showed on Linux (commit a9aff050, 2026-05-08) — slip through and only surface during dogfooding.

This master plan was spun out of bench observations during the session-001 feedback work. It is independent of the session-001 phases and the macOS-runtime-metrics plan, and can land in any order.

When working through phases, follow the project's plan conventions (per-phase plan files named PLAN-ci-platform-matrix-phase-NN-*.md, one logical change per commit, master-plan table updated as work lands).

Situation¶

What CI does today¶

.github/workflows/ci.yml defines:

Lint (cargo fmt --check + cargo clippy): Linux only.
Build matrix across Linux / macOS / Windows. Each matrix cell:
Builds cargo build --release -p ryll.
Runs cargo test --workspace.
Linux-only: tools/web-smoke.sh (HTTP) and tools/web-smoke.sh --tls (HTTPS) against the just-built binary in --web mode.
Produces a platform-shaped artifact (.deb, .rpm, macOS tarball, Windows zip).
tools/web-smoke.sh is a Bash script — it cannot run on the Windows runner without WSL or a PowerShell rewrite, and it is gated if: runner.os == 'Linux' for that reason on the macOS runner too even though Bash exists there.
QEMU integration tests (make test-qemu*) run nowhere in CI today — they need KVM and a libvirt-style stack.

Bug classes the current matrix catches¶

Compile errors anywhere in the workspace, on every platform.
Unit-test regressions, on every platform.
Lint violations and formatting drift (Linux only — but lint is platform-independent in practice).
--web-mode regressions, including TLS-cert loading and HTTP routing — Linux only.
Build artifact packaging (.deb / .rpm / .tar.gz / .zip).

Bug classes the current matrix misses¶

The session-001 work has surfaced two specific gaps:

Runtime startup paths only exercised inside --web smoke, on Linux. The rustls install_default() panic at client.rs:157 lay dormant because no test ever constructed a SPICE TLS connector outside --web mode. --web happened to install a CryptoProvider early; the GUI / headless path did not. Tests passed. Local macOS dogfooding fell over the moment a TLS SPICE target was contacted.
Platform-specific runtime APIs (Phase 02 of session-001-feedback adds NSProcessInfo.beginActivityWithOptions for macOS App Nap opt-out; G1's macOS-runtime-metrics plan adds Mach task_info / thread_info calls). Both are cfg-gated to target_os = "macos". Our current macOS matrix cell would compile them but cannot exercise them meaningfully — a cargo test running on a fresh macOS runner is foreground-active, so App Nap conditions never trigger; an FFI call returning the right struct in unit isolation says nothing about a real session's behaviour. This is harder to address than (1), and may require a "manual-but-checklisted" QA pass rather than full automation. See Phase 4.
Bash-only smoke tests force runtime-test parity gaps: anywhere a test is written in Bash, Windows is excluded by construction.

Why "everything on every platform" is the wrong answer¶

GitHub Actions runners price per minute (current public pricing, 2026):

Runner	Linux	Windows	macOS
Cost multiplier vs Linux	1×	2×	10×
Cold-cache test latency (this repo)	~6 min	~12 min	~10 min
Flake rate (observed, this repo)	low	low	medium

Naïvely promoting every Linux job to all three platforms roughly triples wall-clock CI latency on every PR (each matrix cell ~10 min, three concurrent cells gated by the slowest) and quadruples the per-PR cost (because macOS dominates the bill). That tradeoff is fine for a release build; it is wasteful for a typo-fix PR.

The right answer is graduated coverage: every platform runs the parts that catch platform-specific bugs, and only Linux runs the parts that don't (lint, formatting, deep integration). Phase plans below pick one expansion at a time and judge it on bug-class coverage per minute spent.

Mission and problem statement¶

Make ryll's CI catch the bug classes that currently only surface during macOS / Windows dogfooding. Specifically:

The TLS startup path on every platform, on every PR. (Catches the rustls panic class.)
Runtime smoke tests that work on Windows, not just Linux. (Catches the bash-script-portability class.)
A clear, documented manual-QA checklist for the bug classes automation cannot reach (App Nap, code-signing, Gatekeeper interactions). (Acknowledges the limit, doesn't pretend automation handles it.)

Out of scope: Linux-runner-equivalent integration testing on macOS / Windows (no KVM, no QEMU SPICE stack); browser-based automation of --web mode (separate plan); cross-compilation matrices (we already build natively).

Approach¶

The plan breaks into four phases. Phases 1 and 2 are no-regret expansions that fit comfortably in current CI budgets. Phase 3 is a portability cleanup that unblocks Phases 1 and 2 on Windows. Phase 4 documents a manual-QA boundary; the work itself is a doc, not automation.

Phase 01 — Cross-platform GUI/headless TLS smoke test¶

Add a smoke test that:

Spins up a minimal TLS-capable echo server in-process (rustls server config, self-signed cert, pinned CA passed to ryll via the same path the SPICE client uses).
Invokes ryll's TLS client setup path (constructs a SpiceClient with a TLS-port ConnectionConfig, attempts the handshake).
Asserts the handshake reaches "connected" (the in-process server logs the client hello) before tearing down.
Does not require a real SPICE server — the server hello is enough to confirm rustls didn't panic and the client reached the network layer.

Where it lives: a new integration test in shakenfist-spice-protocol/tests/tls_handshake.rs, picked up automatically by the existing cargo test --workspace in the build matrix. No new CI step required.

This catches: - The rustls CryptoProvider install regression. - Any future TLS feature-unification surprise (cert-loader pulled in via a transitive crate, etc.). - Hostname-verifier behaviour on every platform (the SpiceCaVerifier in client.rs has subtle platform- dependent behaviour on root-store loading).

Cost: ~5–10 s extra test runtime per matrix cell. Fits inside the existing test step.

Bug-class coverage per minute: very high. The single test would have caught today's bug at PR time.

Phase 02 — `web-smoke` parity on macOS and Windows¶

Two tasks:

(a) Drop the if: runner.os == 'Linux' gate on tools/web-smoke.sh for the macOS matrix cell. Bash exists on the macOS runner, the script is plain Bash (no Linux-isms beyond the SPICE target it speaks to, which is just ryll --web itself). Verify the gate is the only blocker.

(b) Either port tools/web-smoke.sh to PowerShell, or rewrite both as a small Rust integration binary that the build matrix invokes after cargo build --release. The Rust rewrite is more work but eliminates a class of "Bash on Windows is a quagmire" headaches forever — and lets the smoke test reuse types from the workspace.

Recommendation: do (a) first as a one-line CI change, then plan (b) as its own follow-up phase if Windows coverage of --web becomes important enough to justify the rewrite.

This catches: - --web-mode regressions on macOS specifically (rustls, TLS cert loading, axum behaviour). - --web-mode regressions on Windows (after (b)).

Cost: extra ~30–60 s on the macOS cell (web-smoke includes a brief ryll --web startup + handshake exchange). On Windows (after rewrite) similar.

Bug-class coverage per minute: medium-high — duplicates some Phase 01 coverage but exercises the actual --web HTTP endpoint path that Phase 01 doesn't.

Phase 03 — Smoke-test portability cleanup¶

Address the structural issue that Phases 01 and 02 hint at: runtime tests written in Bash exclude Windows. Audit tools/*.sh and identify which are CI-relevant. For each, choose:

Keep as Bash (Linux-only operational scripts — propose-release.sh, address-comments-with-claude.sh). These don't run in CI per se; no action needed.
Port to a small Rust tool in a tools/ workspace member — for tests that need to run on every platform.
Rewrite as a cargo test under the relevant crate — for tests that exercise crate behaviour and can use the test harness.

The goal is that every CI-relevant smoke test runs on every matrix cell. Linux-only operational scripts are fine to leave as Bash — they're not blocking the matrix.

This catches: - Future smoke-test additions don't accidentally exclude a platform. - Shell-portability bugs in test infrastructure don't masquerade as product bugs.

Cost: one-time refactor effort. No ongoing CI minute cost delta.

Bug-class coverage per minute: zero direct (it's plumbing). Indirect: enables higher-coverage phases.

Phase 04 — Manual-QA checklist for un-automatable platform behaviour¶

Acknowledge the bug classes automation cannot reach and write them down so a human releaser knows what to spot-check. Adds a docs/release-qa.md file with a checklist organised by platform:

macOS: open binary in Finder (Gatekeeper UX), idle a SPICE session for >30 minutes with ryll backgrounded (App Nap behaviour, Phase 02 of session-001-feedback once it lands), confirm clipboard sync survives.
Windows: clipboard, USB redirection on 32-bit USB drivers, multi-monitor under DPI scaling.
Linux: Wayland vs. Xorg, libvirt-managed vs. raw QEMU, AppImage / Flatpak packaging if those land.

The checklist is run before tagging a release, by a human, on each platform. Output is a checked-off form attached to the release PR.

This catches: - The bug classes that need eyes on a real device, full stop. - Regressions in UX behaviour that pass automation but surprise users.

Cost: not a CI minute cost — a release-time human cost.

Bug-class coverage per minute: not applicable. The point is to be honest that some bugs require this and to make the boundary explicit.

Phase order¶

Phase	Plan	Status
1. Cross-platform TLS handshake smoke	PLAN-ci-platform-matrix-phase-01-tls-smoke.md	Not started
2. `web-smoke` on macOS (and Windows after Phase 03)	PLAN-ci-platform-matrix-phase-02-web-smoke-parity.md	Not started
3. Smoke-test portability audit	PLAN-ci-platform-matrix-phase-03-smoke-portability.md	Not started
4. Release QA checklist doc	PLAN-ci-platform-matrix-phase-04-release-qa.md	Not started

Hard dependencies: Phase 02b (Windows web-smoke) is gated on Phase 03 if we choose the Rust-rewrite route. Phase 02a (macOS web-smoke) has no dependencies.

Open questions¶

CI cost ceiling. What is the project's monthly minutes budget on GitHub Actions? Phases 01 and 02a together probably add <60 s per matrix cell — irrelevant. If coverage grew larger, this would matter. Document the current usage in Phase 01's plan.
Should clippy run on macOS / Windows too? Clippy is platform-agnostic in 99% of cases; a macOS run of clippy would catch the 1% of cfg-gated lints. Cost: ~2 minutes on the macOS cell. Probably yes, but not in the first pass — roll into Phase 03 if it falls out cleanly.
Self-hosted runners as an escape valve. The repo already uses self-hosted Linux runners (runs-on: [self-hosted, static]) for the Claude bot workflows. A self-hosted Mac mini or Windows VM would let us run heavier integration tests without GitHub minute pricing. Out of scope here; raise as a separate infrastructure plan if cost ever becomes the constraint.
Coverage reporting. Phase 01 adds tests; do we want cargo-llvm-cov reporting per-platform coverage too? The answer is "probably yes eventually" but it has its own build complications (llvm-tools-preview availability) and should be a separate item, not bundled here.
Renovate / supply-chain bot interactions. Renovate PRs currently hit the same matrix; will the expanded matrix slow them down enough to matter? Empirically Renovate PRs block on no review path, so latency isn't the bottleneck. No action.

Out of scope¶

Adding QEMU-based integration tests on macOS / Windows — no KVM equivalent makes the existing test-qemu* recipes unportable. If a small synthetic SPICE server emerges (a Rust crate that speaks server-side SPICE just enough to unit-test client behaviour), revisit.
Browser-driven automation of --web mode (Playwright / Selenium against the in-process axum server). Useful but its own master plan; would dominate the cost of the rest of this plan combined.
Cross-compilation matrices (e.g. building macOS binaries on Linux). The current native-build matrix is the source of truth and matches user expectations; cross-compilation introduces a class of "works on Linux-built macOS binary, not on Mac-built one" bugs that would themselves be a testing problem.
Code signing / notarisation automation on macOS / Windows. Important for releases but tangential to bug-coverage CI; belongs with the packaging work.
Performance regression CI (benchmarks, frame-rate regression). Different motivation, different infrastructure, different plan.

📝 Report an issue with this page