Phase 5 — Auto-snapshot bug-report mode¶
Phase 5 of PLAN-stream-caps-and-flap.md.
Prompt¶
Before responding to questions or discussion points in this document, explore the ryll codebase thoroughly. The bug-report plumbing already exists end-to-end — your job is to wire a new automated trigger to the existing assembly + write path, not to redesign either. Specifically read:
ryll/src/bugreport.rs—BugReport::new(assembly) andBugReport::write_zip(file emission).ryll/src/app.rs:2186— the call site that builds aBugReportfor a manual F8 trigger;manual_bug_report_dirfor the existing dir-resolution logic.ryll/src/config.rs— how CLI args are declared and threaded through.ryll/src/main.rs:228and:772— howbug_report_diris resolved and handed to the renderer.
Flag uncertainty explicitly rather than guessing.
Goal¶
Add a flight-data-recorder mode that fires a complete bug report on a fixed cadence into a rolling subdirectory. Operator sets it once at session start; whatever happens during the run is captured by construction, regardless of whether the operator notices a symptom in real time.
This directly addresses the session-002f problem: audio worked for the entire run, so no bug-report was filed — but if audio had gone silent for 30 seconds mid-session the operator might not have caught it in time to trigger a report manually. With auto-snapshot at a 30-second cadence, the silent window would have been captured in two reports either side and we could correlate playback counters across the boundary.
Scope¶
In scope:
- New CLI flag
--auto-snapshot-interval SECONDS(0 or unset = disabled, anything ≥ 1 enables periodic capture). - New CLI flag
--auto-snapshot-cap N(default 20) — the maximum number of auto-snapshot zips kept on disk; oldest pruned when capacity is exceeded. - A tokio interval task spawned at session start when the
interval is set; ticks every N seconds; assembles a
BugReportvia the existingBugReport::newpath and writes it viawrite_zipinto a dedicatedauto-snapshots/subdirectory of the operator's bug-report dir. - Auto-generated description matching the manual-report
convention:
"auto-snapshot T+47.3s"(session uptime embedded so an operator can correlate across snapshots). - The auto-snapshot's
BugReportTypeisConnection(already exists; lightweight; carries the new playback diagnostics from phase 4 because the channel-state dispatch checks the report type to pick which channel's snapshot to embed). Decide in 5A: do we want Connection (lightweight, just main) or extend to a newBugReportType::AutoSnapshotthat includes playback/usbredir/webdav too? Working answer: extend. - Per operator direction, pcap stays in auto-snapshot zips. The disk cost (~700 KiB pcap + ~150 KiB JSON + screenshot if Display ≈ ~1 MiB per zip, ~20 MiB at the default cap of 20) is acceptable for the diagnostic value.
- Filename scheme:
auto-snapshots/ryll-auto-snapshot-<utc_iso>-T+<uptime_secs>.zip— UTC ISO timestamp for sorting, uptime appended so the filename alone tells you when in the run it fired. - Operator awareness — startup notification. When the
auto-snapshot interval task spawns at session start, push
one
NotifySeverity::Infonotification viapush_notificationwithNotificationSource::Internal:"Auto-snapshot mode enabled — every {N}s, max {cap} snapshots, saving to {path}". One-shot, never repeated, no cool-down (this is the operator's confirmation that the flag took effect). Without it, the operator has no in-app signal that the mode is active until the first zip lands on disk. - Operator awareness — live counter in the stats panel.
Surface
auto_snapshots_saved: u64andauto_snapshots_pruned: u64on a snapshot the stats panel reads (AppSnapshotis the natural home; it already carries other session-wide counters). Render a single line in the existing stats panel as"Auto-snapshot: {saved}/{cap}"when the mode is enabled (hide the line entirely when disabled so it doesn't add visual noise to non-auto sessions). Operator can glance at it any time without scrolling notifications. Updated by the interval task after each successful write_zip (saved += 1) and after each prune (pruned += deleted).
Out of scope:
- A hamburger-menu live toggle. CLI-only for this phase. Adding a UI toggle later is straightforward but doesn't add diagnostic value over the CLI flag (operator decides before the run whether they're hunting for an intermittent issue).
- Per-snapshot description customisation. The auto string is deterministic; an operator who wants a specific description should use the manual F8 path.
- Compression of the pcap before zip. The existing zip flow already compresses; extra work for marginal saving.
- Screenshot in auto-snapshots. Decision: include the
screenshot if it can be captured cheaply from the
existing trigger-snapshot ring (
AppSnapshot'ssurfacesalready provide the latest frame). If reaching that data from a non-GUI thread is intrusive, drop the screenshot for auto reports — pcap + channel-state is the high-value payload. Document the decision in code comments. - A notification per snapshot. At 30s × 10min that's
20 notifications, which would bury everything else in
the panel. The startup notification + live stats-panel
counter cover awareness without the noise. If a future
operator wants per-snapshot logging, a separate
--auto-snapshot-verboseflag with a milestone-cadence notification ("10 saved, 1 pruned"every Nth tick) is the cheap-to-add follow-up; deferred deliberately. - Disconnect-triggered auto-snapshot. The existing auto-disconnect path already fires a bug report on disconnect; auto-snapshot mode is independent of that.
Open questions¶
- Q1 (decide in 5A): how does the auto-snapshot task
reach the data it needs?
BugReport::newtakes references totraffic,channel_snapshots,app_snapshot,notifications, plus the target host/ port and optional surface pixels. The GUI thread holds these on theRyllAppstruct. Working proposal: - Introduce an
AutoSnapshotStatestruct holdingArc<TrafficCapture>,Arc<ChannelSnapshots>,Arc<Mutex<AppSnapshot>>,Arc<Mutex<NotificationStore>>, plus the resolved target host/port, output dir, cap, and interval. Construct once on session bring-up afterRyllApp::new. - The interval task owns the
AutoSnapshotStateand calls a newBugReport::new_auto(...)helper that wrapsBugReport::newwith the auto description and aNonescreenshot path. -
All the existing fields on
RyllAppare alreadyArc-backed (verify by readingapp.rs:2186and nearby) — if not, this step bumps them toArc. -
Q2 (decide in 5A): how do we prune to the cap? Working proposal: after each successful
write_zip, scanauto-snapshots/forryll-auto-snapshot-*.zip, sort by filename (which is timestamp-ordered), keep the newest N, delete the rest. Single-pass, no need to track state in memory across ticks. If the operator deletes some zips between ticks, our prune still works correctly (it operates on whatever's on disk). -
Q3 (decide in 5A): what happens on write_zip failure? The existing manual path surfaces failures via the
NotificationStore. The auto-snapshot task should NOT spam the UI on every failed tick (e.g. if disk is full, every 30 seconds would be too noisy). Working proposal: log atwarnon the first failure, emit a singleNotifySeverity::Warnnotification with a 5-minute cool-down, log atdebugthereafter, but never block the interval task — keep ticking in case the underlying problem clears. -
Q4 (open): does auto-snapshot mode work without
--bug-report-dir? Working answer: the existingmanual_bug_report_dir()fallback chain (--bug-report-dir → --capture/bug-reports/ → cwd) applies unchanged. Auto-snapshots go into<that_dir>/auto-snapshots/. If neither flag is set, zips land in./auto-snapshots/in the current working directory — operator can run--auto-snapshot-interval 30from any directory and find their data afterwards.
Design notes¶
Where it slots in¶
[ryll startup]
│
├─ parse args (--auto-snapshot-interval N, --auto-snapshot-cap M)
│
└─ RyllApp::new(...) returns the GUI app
│
└─ on session start (post-handshake):
│
└─ if auto_snapshot_interval > 0:
│
├─ resolve dir = <bug_report_dir>/auto-snapshots/
├─ build AutoSnapshotState (Arc'd handles)
└─ tokio::spawn(auto_snapshot_loop(state))
│
└─ interval.tick() every N seconds:
├─ BugReport::new_auto(...)
├─ report.write_zip(dir)
├─ prune dir to cap
└─ continue loop
Filename scheme¶
Each zip:
ISO-style UTC timestamp + session uptime. The directory listing sorts chronologically; uptime tells the operator how far into the session each snapshot landed without opening the metadata.json.
Disk pressure¶
Per snapshot (rough order of magnitude): - channel-state.json: 5–15 KiB - metadata.json, session.json, notifications.json, runtime-metrics.json: ~5 KiB combined - pcap: ~700 KiB per 20-second window (varies with bandwidth) - screenshot.png: ~700 KiB for a 1920×1472 surface - screenshot-region.png: not applicable for auto-snapshots (no region selected)
Per zip after compression: 700 KiB – 1.5 MiB. At the
default cap of 20, total disk is ~30 MiB. Acceptable;
adjustable via --auto-snapshot-cap.
Interaction with --capture¶
--capture <dir> already creates <dir>/bug-reports/
for manual reports. Auto-snapshots go into
<dir>/bug-reports/auto-snapshots/ when --capture is set
and no explicit --bug-report-dir is provided. Documented
in 5B.
Interaction with BugReport::new's 2-second metric¶
sample¶
BugReport::new blocks for 2 seconds to sample runtime
metrics. The interval task runs on its own tokio task; the
blocking sample happens on a spawn_blocking thread so the
tokio runtime stays responsive. If interval N < 3 seconds,
samples overlap (mostly harmless — each is independent and
the data is per-sample). Document a minimum recommended
interval of 10 s in the CLI help text.
Execution step table¶
| Step | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|
| 5A | medium | sonnet | none | Core implementation. Add auto_snapshot_interval: Option<u64> and auto_snapshot_cap: Option<usize> to ryll/src/config.rs (clap derive). Thread through main.rs to RyllApp::new. Add AutoSnapshotState struct in a new ryll/src/auto_snapshot.rs (or bugreport.rs if cleaner) holding Arc'd handles to traffic, channel_snapshots, app_snapshot, notifications, plus resolved target_host/port, output_dir, cap, interval. Per Q1 working proposal: if any of these fields on RyllApp aren't already Arc-backed, bump them. Add BugReport::new_auto(...) helper that wraps BugReport::new with auto-generated description (format!("auto-snapshot T+{:.1}s", session_uptime_secs)) and BugReportType::AutoSnapshot (new variant — defaults its channel name dispatch to include playback + main + display + cursor + inputs + usbredir + webdav so a single zip carries everything). Spawn the interval task from RyllApp::on_session_ready (or wherever the session-bring-up code lives) when auto_snapshot_interval is set. At spawn time push one NotifySeverity::Info notification (NotificationSource::Internal): "Auto-snapshot mode enabled — every {N}s, max {cap} snapshots, saving to {path}" — one-shot, no cool-down. Add auto_snapshots_saved: u64 and auto_snapshots_pruned: u64 to AppSnapshot (or the snapshot the stats panel reads); bump after each successful write_zip / prune respectively. Render in the existing stats panel as "Auto-snapshot: {saved}/{cap}" only when the mode is enabled (hide the line entirely when disabled). Implement the prune-to-cap step per Q2: glob auto-snapshots/ryll-auto-snapshot-*.zip, sort by filename, delete oldest beyond cap. Per Q3: handle write_zip errors with a notification cool-down (5 min, single warn log on first failure). Per Q4: subdirectory is <bug_report_dir>/auto-snapshots/ using manual_bug_report_dir's fallback chain. Verify make build && make test && make lint && pre-commit run --all-files. |
| 5B | low | haiku | none | Docs touch-up. Update README.md (if it covers CLI flags) and docs/configuration.md to document --auto-snapshot-interval and --auto-snapshot-cap. Add a short paragraph to docs/troubleshooting.md under or near the playback-observability section explaining when to enable auto-snapshot mode (intermittent issues, flight-data-recorder use case). Cross-link from docs/libvirt-spice-recommendations.md's "Side-by-side testing recipe" section as an alternative to manual periodic reports. Run pre-commit run --all-files. |
| 5C | — | — | — | Operator smoke test. Run a ryll session against sf-4 with --auto-snapshot-interval 30 --auto-snapshot-cap 20. Let it run for ≥ 3 minutes while doing typical workload. Confirm: (a) at session start the notification panel shows one Info notification confirming auto-snapshot mode is enabled with the interval, cap, and target path; (b) the stats panel shows "Auto-snapshot: N/20" and N increments by 1 every ~30 s; (c) <bug-report-dir>/auto-snapshots/ contains 6+ zips spaced ~30 s apart; (d) each zip's channel-state.json shows playback fields populating differently across snapshots (proves the snapshot is being re-taken not just the same data re-zipped); (e) after the cap is exceeded, oldest zips are pruned and the stats counter still reads N/20 (no overflow); (f) pcap is present in each zip and contains the ~20 seconds of traffic preceding the snapshot; (g) the notification panel does NOT show one notification per snapshot (proves we didn't accidentally make it noisy). This is operator verification, not a code change. |
Commits: one per step (5A, 5B). 5C is operator verification.
Test plan¶
Automated (5A):
- Unit test for the filename generator: given (utc_now, uptime_secs), produces the expected pattern.
- Unit test for the prune helper: given a list of N+5 fake filenames, deletes the 5 oldest (by lexical sort, which is timestamp sort by construction).
- Integration test (if feasible): construct a small
AutoSnapshotState, run the interval loop for 2 ticks (interval = 1 s), assert two zips appear in the target dir and both deserialise without error.
Manual (5C):
- Operator-driven; the smoke test in 5C is the contract.
Documentation impact¶
README.md/docs/configuration.md: new CLI flag docs (5B).docs/troubleshooting.md: paragraph on auto-snapshot mode and when to use it (5B).docs/libvirt-spice-recommendations.md: brief reference from the side-by-side testing recipe (5B).- Phase 10 (documentation phase) will further consolidate.
Success criteria¶
--auto-snapshot-interval 30produces a new zip every ~30 seconds in<bug-report-dir>/auto-snapshots/.- Each zip is a full bug-report artefact equivalent to a manual F8 trigger (channel-state.json with all phase-4 diagnostics, pcap, metadata, runtime-metrics).
- Rolling cap enforced; oldest zips pruned when cap is exceeded.
- Disk usage stays bounded at roughly
cap × ~1 MiB per zip. - The auto-snapshot task does not interfere with the GUI thread, the audio thread, or the manual F8 report path.
- Operator awareness: one
Infonotification at session start confirms the mode is enabled (with interval, cap, and target path). A live counter in the stats panel shows"Auto-snapshot: {saved}/{cap}"and increments on each successful write; hidden when the mode is disabled. No per-snapshot notifications. make build && make test && make lint && pre-commit run --all-filesclean.
Back brief¶
Before executing 5A, the implementing sub-agent should back-brief the operator with:
- Which fields on
RyllAppneed to becomeArc-backed (if any). - Whether they're introducing a new
BugReportType::AutoSnapshotvariant or piggybacking onConnection. - The filename scheme (confirm UTC ISO + uptime).
- How they're handling the cpal/audio-thread thread-safety
concern (should be a non-issue if
channel_snapshotsis alreadyArc<ChannelSnapshots>— verify).