K1: tokio waker silently lost on main channel after T+465 s¶
Status: RESOLVED (2026-05-11)¶
Root cause was not a tokio waker bug. It was an
abandoned-receiver deadlock in our own session orchestrator
(shakenfist-spice-renderer/src/session.rs): an intermediate
mpsc::channel(64) was drained only until SessionInitialized
and ChannelsAvailable had arrived, then dropped on the floor
while MainChannel kept producing ChannelEvent::Latency on
every server PING. After ~65 pings (~T+466 s, the exact
fingerprint described below) the buffer filled and main blocked
forever inside event_tx.send().await. The "wakers never fire"
appearance was real but downstream: main's whole task was
suspended on the mpsc backpressure, so heartbeat/keepalive arms
of its tokio::select! couldn't be polled, and the server's
30 s rcc timer eventually tore the channel down.
Fix (commit 370d8ce5): replaced the temp mpsc with two
oneshot channels for the session id and channels list, so
main now sends every event directly into the real caller-owned
event_tx (capacity 1024, actually drained by the renderer).
All event_tx.send() calls in main_channel.rs are now
wrapped in a 5 s timeout helper as defense-in-depth.
Regression test (commit cf3d31f5): make test-k1-idle
(driver in tools/test-k1-idle.sh) connects ryll headless,
idles 540 s, and asserts no wedge / no timeout / sufficient
pong count. Verified passing against both spi2eth (TLS) and
the local QEMU test-qemu target.
The chronology below is preserved as a record of how the diagnosis went, including everything we ruled out before we caught the real culprit. The original "TL;DR" framing below described the symptom, not the cause.
Original investigation (preserved for history)¶
TL;DR (as written during the investigation)¶
After ~7 minutes 45 seconds of an idle SPICE session, ryll's main-channel tokio task suspends with two valid Wakers registered (one with the timer wheel, one with the IO driver for the TLS socket). Neither Waker is ever called again. Other tokio tasks on the same runtime continue to function normally; the rest of the session disconnects ~45 seconds later when the SPICE server's rcc 30 s connectivity check fires and tears every channel down. ryll's main task remains hung indefinitely (does not exit even on SIGINT).
This document is a checkpoint of the K1 investigation (PLAN-session-001-feedback Phase 02) so a future session can pick up where we left off.
Reproducer¶
- A built ryll connecting to any SPICE server (we have used Kerbside-fronted oVirt VMs and a direct libvirt-managed GNOME desktop). Server choice does not appear to matter.
- Connect, leave the session backgrounded / unfocused / idle for >7m45s.
- Disconnect modal pops at exactly session_uptime ≈ 510 s in every reproduction (range 510–540 s, dominated by 510 s).
- main's tokio task does not exit; the run-loop wrapper (added
in commit
c956ed65) never logsexited cleanlyorexited with errorfor main even though it does for every other channel.
This reproduces against the existing sf-4 test target from
the dev box at 192.168.1.7 and from a macOS laptop. It
reproduces against a GNOME guest (which also fires K2 large-
frame warnings every 60 s) — we have not yet tested against
uefi-latency-guest to see whether the trigger is GNOME-
display-traffic-shaped.
What we have eliminated¶
The bug survives all of the following changes:
| Tested change | Outcome |
|---|---|
| macOS only? Linux reproduces with identical signature | ❌ not platform-specific |
Multi-threaded scheduler? --debug-single-thread-runtime reproduces too |
❌ not scheduler |
| macOS App Nap / pasteboard? Linux has neither, reproduces | ❌ not macOS-specific |
Clipboard polling? RYLL_DISABLE_CLIPBOARD_POLL=1 reproduces with last_arm=keepalive_idle_done |
❌ not the clipboard arm |
| My Phase 02 keepalive code? K1 was already in session-001b before any keepalive/heartbeat additions; the new code just made it diagnosable | ❌ not Phase 02 additions |
| Server-side stop? Wire-level tcpdump confirms: server keeps a healthy connection until T+510 (ack exchanges, then FIN-ACK), and FIN delivery is intact at the kernel level | ❌ not server-side after T+465 |
Diagnostic infrastructure left in tree¶
The following diagnostic instrumentation is live on the
session-001-feedback branch and should stay in until K1 is
closed. Each is gated either by an env var or a Cargo feature
so production builds aren't affected.
| What | Where | Activated by |
|---|---|---|
Per-channel run_loop exit logs |
shakenfist-spice-renderer/src/channels/*.rs (commit c956ed65) |
always |
Heartbeat with iter_count and last_arm |
main_channel.rs run_loop |
always (info!) |
In-process gdb watchdog (5 s silence threshold, dumps thread apply all bt to /tmp/ryll-watchdog-bt-<pid>-<ts>.txt) |
main_channel.rs run_loop |
env: RYLL_WATCHDOG_GDB=1 |
--debug-single-thread-runtime flag |
ryll/src/{config,app,main}.rs |
CLI flag |
RYLL_DISABLE_CLIPBOARD_POLL env var |
main_channel.rs run_loop |
env: RYLL_DISABLE_CLIPBOARD_POLL=1 |
tokio-console cargo feature with console-subscriber 0.5+ |
ryll/Cargo.toml, ryll/src/main.rs |
env: RYLL_TOKIO_CONSOLE=1 (build with make build-tokio-console) |
make macos-build / make build-tokio-console Makefile targets |
Makefile |
manual |
| Per-channel pcap capture (post-decryption SPICE) | existing --capture <DIR> |
CLI flag |
The RYLL_GIT_SHA build-time embed (commit e3179ca5) means
every test binary identifies itself in info!("ryll v… (sha)")
at startup and in the bug-report metadata.json.
Test session evidence directory¶
~/ryll-test-sessions/test-session-001*/ holds raw artefacts
from every reproduction. Notable ones:
001b/— three early macOS captures, no Phase 02 code.001c/— first macOS capture with Phase 02 keepalive plumbing.001d/— first run with theclient_keepalive_send_countfield; established the ~31-keepalive deterministic pattern.001f/— confirmed main never logs an exit.001g/,001h/— Linux reproductions; established the bug is not macOS-specific.001i/— first Linux disconnect zip.001j/— Linux with watchdog + run-loop wrapper.001k/—--debug-single-thread-runtime(still hangs).001m/—RYLL_DISABLE_CLIPBOARD_POLL=1(still hangs).001n/,001o/— heartbeat with iter counter showing the loop iterated ~3/sec right up to T+465 then stopped.001p/— per-channel pcap capture confirmed which arms fired and when each channel went silent.001q/— wire-level tcpdump (wire.pcap) confirmed the server sent FIN at T+510; ryll's main never reacted.001s/— tokio-console capture (this document's smoking gun).
The smoking-gun snapshot (test-session-001s, 2026-05-10)¶
Test #10: ryll built with make build-tokio-console, run with
RYLL_WATCHDOG_GDB=1 RYLL_TOKIO_CONSOLE=1. ryll started at
19:29:12 local. K1 fired at T+465.246. tokio-console connected.
Drilling into main's task (id=23, spawned at
shakenfist-spice-renderer/src/session.rs:118:23) yielded:
ID: 23 ⏸
Target: tokio::task
Location: shakenfist-spice-renderer/src/session.rs:118:23
Total Time: 13m00s
Busy: 1.17s (0.15%)
Waker:
Current wakers: 2
Clones: 12595, Drops: 12593
Woken: 2179 times
Last woken: 5m14s ago
Warning: This task occupies a large amount of stack space (1432 bytes)
Decoded:
- Last woken 5m14s ago = session_uptime 7m46s = T+466: matches every prior reproduction's freeze time exactly.
- Current wakers: 2 = one clone for the timer wheel
(covers
heartbeat,keepalive_idle,keepalive_timeout), one clone for the IO driver (TLS socket via mio). - Both wakers stay registered for 5+ minutes without firing — even though heartbeat's deadline (T+467) and the FIN at T+510 should both have called them.
- Other tokio tasks on the same runtime are unaffected. Their wakers fire correctly throughout. Only main's task is affected.
- The 1432-byte stack-space warning is a perf hint, not a trigger; tokio's auto-box threshold is 2 KiB, we're under it. Probably unrelated.
Conclusion¶
Two valid Waker subscriptions are registered for main's task; both are silently dropped/disconnected by the runtime after T+465–466 specifically for this task. This is either a bug in tokio's runtime bookkeeping or an extremely subtle interaction in our code that triggers exactly that condition.
Tests against current dep versions¶
| Test | Tokio | Rustls | Tokio-rustls | Mio | Result |
|---|---|---|---|---|---|
| Original (session-001b–001s) | 1.51.1 | 0.23.37 | 0.26.4 | 1.x | K1 reproduces |
| Test #11 (001t) | 1.52.3 | 0.23.37 | 0.26.4 | 1.x | K1 reproduces |
| Test #12 (001u) | 1.52.3 | 0.23.40 | 0.26.4 | 1.2.0 | K1 reproduces |
All semver-compatible bumps to the relevant async-IO crates
(via cargo update) leave the K1 signature identical:
pong_send_count: 66, client_keepalive_send_count: 31,
last_send_ts: 465.x, main hangs.
Searched upstream tokio issues and discussions for matching symptoms (waker registered but never fires, task suspended indefinitely with select!, similar). Closest hits — #4730 (one task halts executor — ruled out, our other tasks run),
7632 (wake_by_ref weak-memory bug — ruled out, x86 Linux¶
reproduces and is sequentially consistent), #2565 (interval inside select — ruled out, user error in that case) — none match our specific signature. No open issue currently documents "select! arm with sleep_until + interval + tokio-rustls TcpStream loses both registered wakers after specific server-driven event".
Next steps (in order of cheapness)¶
- Bisect ryll's history via
make buildcheckout-by- checkout to find when K1 was first introduced. The pattern (always at T+510 disconnect) suggests it's been latent for a while and only became visible once the user's dogfooding hit the trigger window. - Try replacing
tokio::select!withtokio_util::time::FuturesUnorderedor hand-rolled future polling in main_channel — different waker subscription shape, might dodge the trigger even if it doesn't fix the underlying tokio bug. - Test against
uefi-latency-guestviamake test-qemuto rule in or out GNOME-display-traffic shape. - Try major bumps: tokio-rustls 0.26 → 0.27 (may need rustls 0.24), eframe 0.29 → 0.34 (significant API churn). Bigger code change, but exhaustively rules out dep-version dependence.
- Construct a minimal reproducer suitable for filing a
tokio issue. The shape is: a single channel reading a
tokio-rustls TcpStream inside a
tokio::select!with severalsleep_untilandinterval.tick()arms; under sustained server PING/PONG load the read arm has been firing reliably; after some specific server-driven event the task suspends with non-zero registered Wakers that never fire. - Pivot to Phase 02 Block A (auto-reconnect with backoff) so the user-visible failure mode improves regardless of K1's root cause. K1 becomes a survivable "session blip" rather than a hard hang. ~several hours of solid implementation work; valuable on its own merits.
Open questions¶
- Does the bug trigger only on main-channel-shaped tasks? Main has 7 select! arms (read + monitors_config_rx + debounce_sleep + clipboard_interval + keepalive_timeout + keepalive_idle + heartbeat) where most other channels have 3–4. Could the count of arms or the size of the state machine matter?
- What's special about the 12th server PING? Server
consistently sends 12 PINGs at 15 s intervals (T+300 to
T+465) then stops. This is server-side behaviour we have
not characterised — but it does NOT cause the freeze on its
own (server keeps the TCP connection alive cleanly through
T+510 per
wire.pcap). - Why does the freeze fire after PING #66 specifically?
The disconnect-cause snapshots in every reproduction show
pong_send_count: 65at the last visible heartbeat andpong_send_count: 66after the freeze. The hang is triggered after exactly one more PING-PONG cycle past whatever server-side state change happens at T+465.
What to do if the session restarts¶
- The branch is
session-001-feedback. Latest commits should be onorigin/session-001-feedback. - All of the diagnostic instrumentation listed above is in the latest commits; nothing needs to be re-added.
- The
~/ryll-test-sessions/test-session-001*/directories are NOT under git — they're scratch evidence. Don't delete. - The
--cfg tokio_unstablebuild is needed for tokio-console but the regularmake builddoes NOT need it. Don't accidentally applyRUSTFLAGS=--cfg tokio_unstableto a release build. - The watchdog environment variable is
RYLL_WATCHDOG_GDB=1, notRYLL_GDB_WATCHDOG=1.
Companion plan¶
This investigation is part of Phase 02 of
docs/plans/PLAN-session-001-feedback-phase-02-reconnect.md.
The K1 fix (whatever it ends up being) should be folded into
Phase 02's "Block C" once we have it. The auto-reconnect UX
work in Block A still applies regardless of K1's outcome —
even if main is fixed, transient network errors will still
benefit from automatic reconnect.