Phase 17 — Patched libspice-server for hypothesis validation¶
Phase 17 of PLAN-stream-caps-and-flap.md.
Goal¶
Build a Debian package of libspice-server1 with
NUM_TRACE_ITEMS bumped from 8 to 128, install it on ONE
hypervisor, re-run the 006-style workload, and measure
whether stream re-engagement actually improves. The
package is a validation tool first and an operational
artefact second — if it works, the question of cluster-wide
rollout is a separate scope decision.
Why this exists¶
Phase 13A identified NUM_TRACE_ITEMS = 8
(server/display-channel-private.h:23-25) as the binding
constraint on stream re-engagement under OOM pressure;
phase 13C identified the workload-driven OOM rate as the
trigger. A one-line bump on the spice-server side is the
cheapest server-side mitigation imaginable — IF the
hypothesis is right. The only way to know for sure is to
build the patched package and measure.
Three things this is NOT:
- A commitment to ship the patched package cluster-wide. That's a separate decision after 17B has data.
- A replacement for upstreaming. The upstream issue should be filed in parallel; the local package is for our test timeline, not as a permanent fork.
- A replacement for phase 16 (guest-driver alternatives). If 17B is positive, phase 16 becomes optional; if 17B is negative, phase 16 escalates.
Scope and sequencing¶
Gated on session 006 results. Do not start 17A until 006
has been analysed and the trace-ring-contention model is
confirmed. The signature is 006d (fullscreen at 64 MiB)
showing OOM counts substantially lower than 006c (windowed
at 256 MiB) — that's the workload-shape-beats-VRAM
prediction, which only matters if the trace ring is the
floor. If 006 instead shows command-ring depth dominates
(006d roughly equal to 006c), then patching NUM_TRACE_ITEMS
alone won't help and this phase should pivot or be cancelled.
In scope:
- A standalone build recipe (shell script in
ryll-test-sessions/bin/or similar) that pulls Debian'sspice-serversource matching the deployed version, applies the one-line patch, and produces a.debfor the matching architecture. - A documented install + revert procedure for one
hypervisor (
sf-2is the suggested test host based on 005's per-node activity). - A re-run of the 006a (windowed, 64 MiB VRAM) workload against the patched hypervisor, with bundle capture, so the result is comparable to 006a's baseline by direct qemu-log diff.
- The drafted upstream issue body (was outlined in
PLAN-stream-caps-and-flap-phase-13-streaming-intermittency.md's 13A "Implications for 16 / upstream" section). Filing upstream is operator action; the draft text lives here.
Out of scope:
- Cluster-wide rollout. That's phase 18 if we decide it's worth doing.
- Long-term package-maintenance automation (renovate-bot sync, etc.). Same — phase 18 territory.
- Patching qemu's
QXL_COMMAND_RING_SIZE. Out of scope for the validation; the spice-server patch should be sufficient if 13A's model is right. If it isn't, file a separate investigation. - Building the patched package from the ryll repo. The
build recipe lives in
ryll-test-sessions(the test-harness repo) because it's a hypervisor-side test artefact, not a ryll-client artefact.
Patch value rationale¶
RED_RELEASE_BUNCH_SIZE = 64
(server/image-encoders.h:221). A single
display_channel_free_some cycle can evict up to 64
drawables, each potentially writing into the trace ring.
For the ring to survive one OOM cycle without being fully
overwritten by unrelated draws we need NUM_TRACE_ITEMS >
RED_RELEASE_BUNCH_SIZE, so 128 is the natural next power
of two (the existing code uses ITEMS_TRACE_MASK =
NUM_TRACE_ITEMS - 1, so the value must remain a power of
two). Going to 256 buys a small additional margin at
roughly 8 extra KB per display channel — defensible if
something downstream wants headroom, but 128 is the
principled choice for "just survive one OOM cycle".
The patch is therefore:
--- a/server/display-channel-private.h
+++ b/server/display-channel-private.h
-#define NUM_TRACE_ITEMS (1 << 3)
+#define NUM_TRACE_ITEMS (1 << 7)
(Verify the exact form in the Debian source before
relying on this — (1 << 3) may be 8 literally, or
wrapped in an enum, depending on the version.)
Step table¶
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 17A | medium | sonnet | none | Build the patched .deb. In ryll-test-sessions, add a bin/build-patched-libspice.sh. Steps the script should automate (or document if non-automatable): (1) Determine the spice-server version installed on the target hypervisor — ssh sf-2 dpkg -l libspice-server1. (2) On a build host (the Kasm machine is fine, or a fresh Debian container — pick whichever the operator prefers; docker run --rm -it -v "$PWD:/work" debian:12 bash keeps the host clean), apt-get source spice-server=<version> to get the source matching the running version exactly. (3) Apply a one-file patch bumping NUM_TRACE_ITEMS from (1 << 3) to (1 << 7) — store the patch as patches/0001-bump-num-trace-items.patch adjacent to the build script. (4) dpkg-buildpackage -us -uc -b to build the binary .deb files (no source .deb, no signing). (5) Drop the resulting libspice-server1_*.deb (and any companion -dev / -dbg packages we need) into a built/ directory in ryll-test-sessions/, with a README explaining what's in it and how to install. The script should be idempotent — re-running it should produce a fresh build without manual cleanup. Verify the script runs end-to-end by producing a .deb for at least one Debian version. Do NOT push the resulting .deb to git (binary artefact); add built/ to .gitignore if not already there. |
| 17B | — | — | — | Operator validation run. This is manual operator work, NOT a sub-agent task. The operator: (a) installs the patched libspice-server1 on sf-2 (sudo dpkg -i libspice-server1_*.deb), (b) restarts libvirtd (sudo systemctl restart libvirtd) so qemu re-links the library on the next VM spawn, (c) destroys any existing test VMs on sf-2 and creates a fresh one (./bin/makeinstance.sh 64) — important because already-running VMs are linked against the old library, (d) runs the 006a-equivalent workload on this fresh VM with auto-snapshot enabled, tag 007a-patched, (e) captures the bundle, (f) reverts by sudo apt-get install --reinstall libspice-server1 to pull the distro package back. Detailed instructions go into a 007.md in the test-sessions repo when 17A's build script is ready — write it as part of 17A's commit. The comparison the bundle should reveal: 007a-patched's display_channel_create_stream count vs. 006a's, plus 007a-patched's display_channel_debug_oom count vs. 006a's (OOM count should be roughly unchanged — the patch doesn't reduce OOM frequency, it reduces the consequence of each OOM). The expected result if the trace-ring-contention model is right: stream re-engagements should be substantially higher in 007a-patched than in 006a despite similar OOM counts. |
| 17C | low | haiku | none | File the upstream issue + write follow-up plan. Take the writeup paragraph already drafted in PLAN-stream-caps-and-flap-phase-13-streaming-intermittency.md (the 13A "Implications for 16 / upstream" section) and turn it into a properly-formatted upstream bug-report at https://gitlab.freedesktop.org/spice/spice/-/issues — operator does the actual filing; this step produces the issue body as a .md file in docs/upstream-issues/spice-trace-ring.md. If 17B was negative (the patch didn't help), update the issue draft to reflect what we learned instead — possibly mentioning the command-ring as a secondary hypothesis. If 17B was positive, write PLAN-stream-caps-and-flap-phase-18-libspice-rollout.md as a stub for the cluster-wide rollout question (proposed-concept status — actual planning happens later, only if the operator decides rollout is worth the maintenance tail). |
Success criteria¶
- A reproducible
bin/build-patched-libspice.shthat produces alibspice-server1*.debmatching the deployed Debian version on sf-*. Verified by at least one successful run. - A 007.md instruction file in ryll-test-sessions documenting the install / run / revert procedure.
- Either: bundle
test-session-007a-patched.tar.gzshowing measurably better stream re-engagement than 006a's baseline (success — record numbers in this phase plan file), OR: a written falsification with the comparison data and a pivot decision. - Upstream issue body drafted as
docs/upstream-issues/spice-trace-ring.md.
Open questions¶
- Q1: build host choice. Kasm machine vs. Debian
container? The container keeps the Kasm host clean and
matches the deployed Debian version cleanly; the host
is faster and avoids docker-in-docker concerns. Decide
in 17A; default is
debian:12container if the sf-* hosts are running bookworm (likely). - Q2: dev/dbg companions. Do we need
libspice-server-devorlibspice-server1-dbgsym? Almost certainly not for production install, but worth grabbing once for symbolisation when reading the patched binary's behaviour. Decide in 17A; default is "build onlylibspice-server1, skip companions". - Q3: pinning. Should the operator pin the patched
package via
apt-mark hold libspice-server1after install to preventapt upgradeclobbering it? Yes for the validation host during the test window. Note in 007.md.
Cross-references¶
PLAN-stream-caps-and-flap-phase-13-streaming-intermittency.md— the 13A and 13C findings that motivate this phase, and the drafted upstream-issue paragraph that 17C will formalise.PLAN-stream-caps-and-flap-phase-16-qxl-viability.md— the alternative path if 17B falsifies the trace-ring model. 16's relative priority depends on 17B's outcome.- ryll-test-sessions
006.md— the workload shape that 17B's run replicates against the patched library. - Future
PLAN-stream-caps-and-flap-phase-18-libspice-rollout.md— written only if 17B is positive AND the operator decides cluster-wide rollout is worth the maintenance tail.