Phase 5: Capture runtime metrics in bug reports¶

Parent plan: PLAN-idle-cpu-and-latency.md

Motivation¶

When the user originally reported "ryll burns 6 cores at idle", we had no choice but to spawn a profiling sub-agent, launch ryll under top -H and /proc/<pid>/task/*/stat, and reverse-engineer where the CPU was going. All of that information is trivially available inside ryll's own process — we just don't capture it.

If the bug-report ZIP had included per-thread CPU and thread names, the user's first bug report would have shown llvmpipe-{0..15} each at ~36% of one core, and phase 1 of this plan would not have needed to exist. This phase makes future "ryll is slow / hot / leaking" reports self-debugging.

Goal¶

Capture process and per-thread runtime metrics into every bug report. Metrics include:

Process-level: total CPU%, RSS, VmSize, uptime.
Per-thread: thread name (from /proc/self/task/*/comm), CPU% over the sample window, total CPU time.

The numbers must reflect current rate (e.g. last 2 seconds), not lifetime averages, because lifetime numbers get diluted by startup costs and hide steady-state behaviour.

Linux-first. macOS and Windows can log "metrics unavailable on this platform" and omit the section gracefully — better than no bug reports at all on non-Linux.

Background¶

Bug reports are assembled in ryll/src/bugreport.rs. The existing flow:

BugReport::new(...) collects metadata, channel state, optional pcap traffic, and an optional screenshot (bugreport.rs:608+).
metadata.json already contains ryll version, platform, and target host (see chrono_now() and friends at bugreport.rs:540-579).
The ZIP layout is documented in README.md (capture mode section).

The natural extension: add a runtime_metrics field to the report struct, populated at report-creation time, and either embed it in metadata.json or write it as a separate runtime-metrics.json file in the ZIP.

Approach¶

Sampling¶

CPU% requires two reads of /proc/self/stat (and per-task equivalents) separated by a sample window, with the delta divided by elapsed wall time and sysconf(_SC_CLK_TCK). A 2-second sample is a good default — long enough to be meaningful, short enough that the user doesn't notice the button took longer than usual.

Pseudocode:

fn sample_metrics(window: Duration) -> RuntimeMetrics {
    let snapshot_a = read_proc_self_and_tasks();
    sleep(window);
    let snapshot_b = read_proc_self_and_tasks();
    compute_deltas(snapshot_a, snapshot_b, window)
}

The 2-second sleep blocks the report-creation path. Two options:

a) Block on the report-creation thread (simplest, user sees a brief "collecting metrics..." message).

b) Sample continuously in the background (a small task that polls every N seconds, ring-buffers the result, and reports the most recent sample).

Recommendation: option (a) for v1. Bug reports are already a deliberate, non-interactive operation — F12 opens a dialog, the user types a description, then clicks Save. An extra 2 seconds is invisible. Continuous sampling adds a thread that runs forever and contradicts the spirit of this plan's CPU-reduction goal.

Per-thread data¶

/proc/self/task/<tid>/comm gives the thread name (set by pthread_setname_np or prctl(PR_SET_NAME) — egui, tokio, cpal, and Mesa all set sensible names). Tokio worker threads will be named like tokio-runtime-worker; the egui main thread inherits the binary name; cpal sets its own; Mesa's llvmpipe threads are llvmpipe-N.

/proc/self/task/<tid>/stat gives utime + stime in clock ticks, same format as /proc/<pid>/stat.

Cross-platform¶

For v1, gate the implementation behind #[cfg(target_os = "linux")]. Non-Linux platforms emit a RuntimeMetrics::Unavailable { reason: "..." } variant. macOS and Windows support is reasonable to add later via mach_task_info and GetProcessTimes respectively, but the dominant ryll user base today is Linux.

Output format¶

A new file in the ZIP: runtime-metrics.json. Schema:

{
  "sample_window_ms": 2000,
  "process": {
    "cpu_percent": 624.3,
    "rss_kb": 184320,
    "vm_size_kb": 1572864,
    "uptime_secs": 47.2
  },
  "threads": [
    { "tid": 12345, "name": "ryll", "cpu_percent": 43.1 },
    { "tid": 12346, "name": "tokio-runtime-worker", "cpu_percent": 0.4 },
    { "tid": 12347, "name": "llvmpipe-0", "cpu_percent": 36.2 },
    ...
  ],
  "platform": "linux"
}

Or for non-Linux:

{
  "platform": "macos",
  "available": false,
  "reason": "per-thread metrics not implemented on macOS"
}

Constraints and edge cases¶

Sample window blocks the GUI. Acceptable for bug reports, which already block on the file dialog.
Thread count can be high (16 llvmpipe + 16 tokio workers + a handful of others = ~35 on the user's machine). Keep the JSON compact; include all threads rather than truncating.
Counter wraparound for utime/stime is u64 ticks → not a real concern for any realistic ryll session.
Permissions: /proc/self/* is always readable by the same UID, no privilege issues.
Test environment: tests should not actually sample for 2 seconds. Either expose the window as a parameter for testing, or skip the sampling test entirely and only test the parsing.

Steps¶

Step	Effort	Model	Isolation	Brief for sub-agent
5a	medium	sonnet	none	Add a new module `ryll/src/metrics.rs` with a `RuntimeMetrics` struct, a `sample(window: Duration) -> RuntimeMetrics` function (Linux only — gated by `#[cfg(target_os = "linux")]`), and a non-Linux fallback returning `RuntimeMetrics::unavailable(reason)`. Read `/proc/self/stat`, `/proc/self/status`, `/proc/self/task/<tid>/stat`, `/proc/self/task/<tid>/comm`. Compute CPU% as `(delta_utime + delta_stime) / sysconf(_SC_CLK_TCK) / window.as_secs_f64() * 100.0`. Use `serde::Serialize` so the struct serialises directly to JSON. Add unit tests for the parsing helpers (parse a sample `/proc/self/stat` string, verify field extraction) — do NOT add a test that actually sleeps for 2 seconds.
5b	medium	sonnet	none	Wire `metrics::sample()` into `BugReport::new()` in ryll/src/bugreport.rs. Sample at the start of report assembly with a 2-second window. Add a `runtime_metrics: RuntimeMetrics` field to the `BugReport` struct. In the `write_zip()` method, write a new `runtime-metrics.json` file alongside the existing files. Add a unit test that constructs a `BugReport` with a stub metrics value and verifies the ZIP contains `runtime-metrics.json` with the expected JSON shape.
5c	low	sonnet	none	Update README.md bug-report bullet (around line 24) to mention runtime metrics. Update docs/plans/PLAN-idle-cpu-and-latency.md phase 5 status to Complete.

Success criteria for this phase¶

F12 → Save → unzip the result → runtime-metrics.json is present and contains process + per-thread CPU% from the last 2 seconds.
On Linux, llvmpipe threads (or whatever's hot) are visible in the per-thread list.
On non-Linux, the file exists with a clear "platform unsupported" payload.
pre-commit run --all-files and make test pass.
The 2-second sample window is visible to the user but unobtrusive — the bug report dialog can show "Collecting metrics..." or just block the Save button.
README.md mentions runtime metrics in the bug-report feature bullet.

Open question¶

Should runtime-metrics.json also include version info about the GPU stack (Mesa version, renderer string from glGetString(GL_RENDERER))? That would have made the "llvmpipe is the bottleneck" diagnosis even more immediate. Recommendation: yes, if it's free — wgpu already queries adapter info during init; if we can capture the adapter name and backend at startup and embed it here, that's a one-liner. If it requires a fresh wgpu context, defer to follow-up.

📝 Report an issue with this page