Diagnosing "video stream not keeping up" reports¶
Prompt¶
Build the instrumentation and (where warranted) the threading changes needed so that future "video stream isn't keeping up" bug reports are actionable — i.e. a developer reading the report can tell whether the server stopped sending frames, the network was slow, decode was slow, render was slow, or our own bug-report I/O (pcap, screenshot encoding) was applying backpressure to the SPICE socket.
This master plan was spun out of the session-001 dogfooding
triage (PLAN-session-001-feedback.md), which logged a
U1: video stream appears to not be keeping up report
(ryll-bugreport-2026-05-05T10-39-47Z.zip) without enough data
to act on. Rather than guess at a fix, we're closing the
observability gap first.
When working through phases, follow the project's plan
conventions (per-phase plan files named
PLAN-video-keeping-up-phase-NN-*.md, one logical change per
commit, master-plan table updated as work lands).
Situation¶
Findings from a code walk of the display pipeline as of this plan's creation:
-
Display channel pipeline is single-threaded — socket read, message parse, image decode, and
ImageReadyemission all run on one async task (shakenfist-spice-renderer/src/channels/display.rs:575). Backlogs can accumulate at the socket buffer, the decode step (especially GLZ), or implicitly downstream ofImageReady. -
Bug-report / capture I/O is inline on hot paths:
PcapChannelWriter(ryll/src/capture.rs:28) isMutex-guarded and writes to an unbufferedFile. It is invoked on the display read task atdisplay.rs:631(every received chunk) and on the send path atdisplay.rs:2170. Slow disk on the read side back-pressures the SPICE socket directly.-
VideoWriter(H.264 + MP4) atryll/src/capture.rs:230encodes synchronously, but the call site isCaptureSession::frame()atcapture.rs:622, invoked from the egui event handler atapp.rs:1615in response toChannelEvent::ImageReady— not on the display read task. So video encoding cannot directly stall the SPICE socket; it can, however, block the GUI event loop and starve presentation. Phase 3 addresses this UI-side backpressure rather than socket-side backpressure. -
Existing metrics partially diagnose the problem:
last_latency_ms: PING-to-PING interval on the main channel (main_channel.rs:502) — captures network + server-send delay, not client processing time.bandwidth_current/bandwidth_history: socket arrival bytes/sec sampled once per second.fps: derived fromDisplayMarktimestamps — measures presentation rate.frames_received: cumulativeImageReadycount.-
DisplaySnapshot.recent_decodes: VecDeque (capMAX_RECENT_DECODES = 20) with per-decode success flag, image type, dimensions, and session-relative timestamp — no wall-clock duration. -
Gaps that block diagnosis today:
- No decode wall-time per image; can't tell GLZ is choking.
- No socket-buffer high-water mark; can't tell the read loop is falling behind.
- No ACK-window-exhaustion signal; can't tell we're applying backpressure at the SPICE level.
- No render-side arrival-to-display latency; can't tell the renderer is the bottleneck once decode finishes.
Mission and problem statement¶
Make a "video not keeping up" bug report self-diagnosing: the report alone, without re-running the session, should tell a maintainer which of {server, network, decode, render, our own I/O backpressure} is the bottleneck. Where measurement reveals ryll itself is the bottleneck, fix it.
Approach¶
Instrumentation before threading. The cheap signals (phase 1) will tell us whether the threading work in phases 2–3 is even worth doing. If phase 1 reveals the bottleneck is consistently decode CPU or server-side, we may stop after phase 1 + a fix in the right place rather than refactoring I/O paths that aren't actually hot.
Resolved decisions¶
-
Drop-on-overflow for the future pcap and screenshot writer tasks (phases 2 and 3). Dropping preserves the socket read rate; the cost is gaps in pcaps when disk falls behind. Each dropped item increments a counter exposed in the snapshot (
writer_dropped_countper channel) so a bug report makes the drop visible. Blocking would reintroduce the backpressure the threading split is meant to remove. -
Last-N plus min/max/mean for decode duration, sharing the existing
recent_decodesring (capMAX_RECENT_DECODES = 20indisplay.rs). Cheap, matches the surrounding code, and the aggregate stats give a long-run summary without the state cost of a full histogram. Computed at snapshot-emit time over the ring contents. -
Render-side latency = mpsc-queue lag between event emission in the display channel and event drain in the egui frame loop. Captured as
produced_at_secs: f64onChannelEvent::ImageReady*andChannelEvent::DisplayMarkat emit; consumed inprocess_eventsto computeconsumed_at - produced_atand feed two bounded rings surfaced as min/max/mean aggregates onAppSnapshot. Finer-grained "inside the egui paint" timing was rejected as disproportionate; end-to-end "pixels visible to user" is unmeasurable without external instrumentation. SeePLAN-video-keeping-up-phase-04-render-latency.md.
Acceptance criteria¶
A "video stream not keeping up" bug report is self-diagnosing
when its channel-state.json for the display channel lets a
maintainer answer all of the following without re-running the
session:
- Decode load. What was the per-decode wall time (min / max / mean over the recent window)? Were failures or cache misses spiking? (Phase 1.)
- Socket read pressure. Was the read loop consistently filling its 256 KB chunks (a signal that the OS recv buffer had bytes waiting when we read)? (Phase 1.)
- SPICE-level backpressure. How often did we fill the ACK window before sending an ACK, and how long were the gaps between ACKs? (Phase 1.)
- Inline writer cost. When pcap capture is enabled, are writer drops happening? (Phase 2.)
- GUI loop encode cost. When MP4 capture is enabled, is encoding stalling presentation? (Phase 3.)
- Render path. Optional: once the prior signals show decode + I/O are healthy, can we attribute remaining latency to the renderer? (Phase 4.)
Execution¶
| Phase | Plan | Status |
|---|---|---|
| 1. Decode duration + socket fill + ACK-window signals | PLAN-video-keeping-up-phase-01-instrumentation.md | Done |
| 2. Move pcap writes to a dedicated writer task | PLAN-video-keeping-up-phase-02-pcap-thread.md | Done |
| 3. Move MP4 video encoding off the GUI event loop | PLAN-video-keeping-up-phase-03-video-encode-thread.md | Done |
| 4. Render-side arrival-to-display latency | PLAN-video-keeping-up-phase-04-render-latency.md | Done |
Phase 1 is the gate. It produces the data needed to triage
U1 and tells us whether phases 2–4 are warranted. Done when the
display channel-state.json in a bug report includes per-decode
wall time, socket-read fill stats, and ACK-send stats, and a
maintainer can read those fields without consulting code.
Phase 2 moves PcapChannelWriter writes off the channel
read tasks onto a dedicated writer task with a bounded queue.
Done when packet_received and packet_sent are non-blocking
enqueues, and dropped items are counted in the snapshot.
Phase 3 moves VideoWriter::write_frame() off the egui
event handler onto a dedicated encoder task. Done when
CaptureSession::frame() is a non-blocking enqueue and the egui
event loop is not stalled by H.264 encoding.
Phase 4 is optional and only justified if phases 1–3 leave a residual gap where the renderer is suspected. Adds arrival-to-display latency at the renderer.
Phases 2 and 3 can run in either order once phase 1 data confirms inline I/O is actually a meaningful contributor.
Out of scope: changes to the SPICE protocol layer, decode algorithm changes (GLZ, Lz4), or renderer architecture changes.