Phase 1: Display correctness¶
Part of PLAN-deferred-debt.md.
Scope¶
Fix the bugs that produce visually wrong output on screen, harden JPEG parsing against malformed input, and add graceful handling for unsupported image types.
Four sub-tasks, each a separate commit:
| Step | Description | Source |
|---|---|---|
| 1a | GLZ cross-frame image corruption | PLAN-remaining-issues.md #1 |
| 1b | GLZ cross-image retry loop optimisation | PLAN-display-iteration-followups.md |
| 1c | Missing image type stubs | PLAN-display-iteration-followups.md |
| 1d | Bounds check in inject_dht | PLAN-pr20-followup.md #3 |
1a. GLZ cross-frame image corruption¶
Current state¶
The GLZ decompressor lives in
shakenfist-spice-compression/src/glz.rs (307 lines).
It decompresses SPICE GLZ images which can reference
pixels from previously decompressed images via a shared
dictionary of type Arc<Mutex<HashMap<u64, Vec<u8>>>>.
The symptom is random areas of incorrect pixels after incremental display updates, specifically in regions updated by ZLIB_GLZ_RGB or GLZ_RGB draw_copy messages.
Root cause investigation¶
Comparing the Rust implementation against the Python
reference in kerbside/kerbside/utilities/glz.py (132
lines), the decompression algorithm itself appears
faithful. The control byte parsing, pixel_flag/image_flag
decoding, and pixel copy loops all match. However, the
investigation has revealed three dictionary management
bugs that are the likely cause of corruption:
Bug 1: No dictionary eviction. The GLZ header
contains a win_head_dist field (parsed at glz.rs:61)
that specifies the sliding window size for the
dictionary. Images older than image_id - win_head_dist
should be evicted. Currently win_head_dist is parsed
but discarded -- it is not in the DecompressedImage
struct and no eviction code exists anywhere. The
dictionary grows without bound until a RESET message
clears it entirely (display.rs:525).
Without eviction, the dictionary may hold stale images that the server has already evicted from its own dictionary. If a new image is compressed against a server-side dictionary that does NOT contain an old image, but our client-side dictionary still has it, the decompressor could pick up stale pixel data. Conversely, if the dictionary grows too large, memory pressure could cause issues.
The Python reference in
kerbside/testclient/ryll/display_types.py has
eviction code that is commented out -- it was a
fixed-size (100 image) LRU that was disabled. The proxy
copy (kerbside/kerbside/utilities/glz.py) has no
eviction at all. So neither Python implementation
evicts either, suggesting this may not be the primary
corruption cause but is still incorrect behaviour.
Bug 2: IMAGE_FLAGS_CACHE_ME is never checked. The
SPICE protocol's ImageDescriptor has a flags field;
bit 0 (IMAGE_FLAGS_CACHE_ME) tells the client whether
to cache the decoded image. Currently display.rs caches
every decoded GLZ image unconditionally (lines 1117-1129),
regardless of this flag. The constant is defined in the
protocol crate (constants.rs:312) and the flag value
is logged (display.rs:842) but never tested.
If the server sends images without CACHE_ME set, we
should not insert them into the GLZ dictionary. Having
extra images in the dictionary should not corrupt
references to images that ARE in the dictionary, so this
is more of a memory waste / correctness hygiene issue.
Bug 3: INVALIDATE_LIST does not clear GLZ dictionary.
Display channel message handlers for INVALIDATE_LIST
(display.rs:480-512) and INVAL_ALL_PIXMAPS
(display.rs:514-519) only clear self.image_cache, not
self.glz_dictionary. Server-initiated cache
invalidation therefore does not affect GLZ-cached images.
If the server sends an invalidation and then reuses an
image ID with different content, the client will render
stale pixels.
Plan¶
-
Add
win_head_disttoDecompressedImage. Add apub win_head_dist: u32field to theDecompressedImagestruct inshakenfist-spice-compression/src/lib.rs. Populate it from the parsed header value inglz.rs. UpdateDecompressedImage::new()and all construction sites. -
Implement dictionary eviction in display.rs. After inserting a GLZ image into the dictionary (display.rs:1122-1126), evict entries with
image_id < current_id - win_head_dist. This matches the SPICE server's sliding window semantics. -
Check IMAGE_FLAGS_CACHE_ME before caching. At display.rs:1117, check
img_desc.flagsagainstIMAGE_FLAGS_CACHE_MEbefore inserting into either cache. Non-GLZ images already go throughimage_cachewhich is subject to invalidation, so the main impact is on GLZ images.
Note: The GLZ dictionary is special -- GLZ
cross-frame references require the referenced image
to be present. The server should only generate
cross-frame references to images it told us to cache.
If we start skipping un-flagged images and then get a
cross-frame reference to one, we'll hit the retry/
miss path. This needs careful testing. If it causes
regressions, gate it behind a log warning instead
and keep caching everything. However, for GLZ images
specifically, the server manages the dictionary
window via win_head_dist and the protocol
guarantees that cross-frame references only target
images within the window. So we should always cache
GLZ images (they're part of the dictionary by
definition) but can skip caching for non-GLZ types
when the flag is not set.
-
Extend INVALIDATE_LIST to cover GLZ dictionary. In the
INVALIDATE_LISThandler (display.rs:480-512), also remove matching IDs fromself.glz_dictionary. Similarly, inINVAL_ALL_PIXMAPS(display.rs:514-519), clear both caches. This ensures server-initiated invalidation works for all image types. -
Add debug logging for cross-frame references. In the cross-frame reference path (glz.rs:209-248), add
debug!logging that shows: - The source image ID being looked up
- Whether it was found
- The pixel_offset and length being copied
- The dictionary size at lookup time
This will help diagnose any remaining corruption after the dictionary fixes land.
Files to modify¶
shakenfist-spice-compression/src/lib.rs-- addwin_head_disttoDecompressedImageshakenfist-spice-compression/src/glz.rs-- populatewin_head_dist, add debug loggingryll/src/channels/display.rs-- eviction, CACHE_ME check, INVALIDATE_LIST fix
Testing¶
cargo test --workspacemust pass.- Manual test with a real VM session: connect, move windows, scroll content, and verify no pixel corruption in GLZ-decoded regions.
- Monitor logs for cross-frame reference warnings (which would indicate dictionary misses).
Risk¶
Medium. The dictionary management changes are
functionally new code in a correctness-critical path.
The eviction and invalidation changes should make
behaviour MORE correct (matching what the server
expects), but if the eviction window is too aggressive,
we could evict images that are still referenced. The
win_head_dist value from the header should be the
authoritative window size, matching the server's own
eviction policy.
1b. GLZ cross-image retry loop optimisation¶
Current state¶
When a GLZ image references a previous image that is not yet in the shared dictionary (because another display channel task hasn't finished decompressing it), the decompressor retries up to 20 times with 5ms sleeps (glz.rs:214-216). This adds up to 100ms latency per miss.
The retry loop also holds the dictionary mutex lock for the entire pixel copy operation (glz.rs:219-234), blocking other channels from inserting into the dictionary during that time.
Plan¶
-
Replace polling with
When an image is inserted into the dictionary (display.rs:1122-1126), calltokio::sync::Notify. Add aNotifyalongside the dictionary:notify.notify_waiters(). In the cross-frame reference path, instead of sleeping 5ms, awaitnotify.notified()with a timeout. This wakes the decompressor immediately when the referenced image becomes available. -
Minimise lock duration. Clone the referenced image's pixel data while holding the lock, then drop the lock before copying pixels into the output buffer. This reduces contention with other channels. The clone is a memory cost but GLZ images are typically small tiles, not full-frame.
-
Keep a timeout. Use
tokio::time::timeout(Duration::from_millis(100), ...)around the notify wait loop as a safety net. If the image never arrives (e.g. because the server sent a bad reference), fail after 100ms as before rather than blocking forever. -
Update the shared dictionary type. The type alias
SharedGlzDictionaryin display.rs:139 changes fromArc<Mutex<HashMap<u64, Vec<u8>>>>toArc<GlzDictionary>. Update all usage sites.
Files to modify¶
shakenfist-spice-compression/src/glz.rs-- new dictionary type, notify-based waitshakenfist-spice-compression/src/lib.rs-- exportGlzDictionaryryll/src/channels/display.rs-- update type alias, notify on insert, update clear/evict calls
Dependency¶
This step changes the dictionary type, so it must be coordinated with step 1a (which adds eviction logic). Both steps touch the same code paths. Order: 1a first (fix correctness), 1b second (improve performance). It is acceptable to combine 1a and 1b into a single commit if the diff is cleaner that way.
Testing¶
cargo test --workspacemust pass.- Manual test: cross-frame references should resolve without visible delay.
- Check logs for timeout warnings (which would indicate the 100ms fallback being hit).
1c. Missing image type stubs¶
Current state¶
The ImageType enum defines 12 types. The display
channel handles 8 of them. The remaining 4 fall through
to a warn! at display.rs:1089-1095:
LzPalette(100) -- paletted LZ imagesSurface(104) -- surface-to-surface copyFromCacheLossless(106) -- lossless cache variantJpegAlpha(108) -- JPEG with separate alpha plane
Plan¶
Add explicit match arms for each type that log a
descriptive warning and return None. The current
catch-all already does this, but explicit arms:
- Document that we know about these types.
- Allow type-specific log messages (e.g. "LzPalette images require palette data which is not yet implemented").
- Make the
_ =>arm truly unreachable for known protocol values, catching any future additions.
This is a small, self-contained change.
Files to modify¶
ryll/src/channels/display.rs-- add 4 match arms in the image type dispatch (~line 1089)
Testing¶
cargo test --workspacemust pass.- No functional change unless one of these types is actually encountered in a session.
1d. Bounds check in inject_dht¶
Current state¶
inject_dht() (display.rs:58-78) injects cached DHT
(Huffman table) segments into JPEG streams that lack
them. The function walks past APP0/APP1 markers by
reading segment lengths, but does not validate that the
computed segment length stays within bounds:
while i + 3 < jpeg.len()
&& jpeg[i] == 0xFF
&& (jpeg[i + 1] == 0xE0 || jpeg[i + 1] == 0xE1)
{
let seg_len = u16::from_be_bytes(
[jpeg[i + 2], jpeg[i + 3]]
) as usize + 2;
if i + seg_len > jpeg.len() { // this check exists!
break;
}
out.extend_from_slice(&jpeg[i..i + seg_len]);
i += seg_len;
}
Re-reading the code, the bounds check i + seg_len >
jpeg.len() IS present (line 69-71). The original
auto-reviewer flagged this because extract_dht_segments
has the check and the reviewer expected inject_dht
might not -- but it does. The check prevents the panic.
However, there is still a minor issue: seg_len is
computed as u16 + 2, which can be at most 65537. If
the length field contains 0xFFFF, seg_len = 65537
and i + seg_len could overflow on 32-bit platforms
(though ryll targets 64-bit only). This is not a
practical concern but can be tightened with a
saturating_add.
Plan¶
-
Add
saturating_addforseg_lencomputation. Replaceas usize + 2with.saturating_add(2). This is a one-line defensive change. -
Add a brief comment noting that this matches the bounds checking in
extract_dht_segments.
This is minimal -- the code is already correct for all practical inputs.
Files to modify¶
ryll/src/channels/display.rs-- line 68
Testing¶
cargo test --workspacemust pass.- Phase 6 of the master plan will add dedicated JPEG parsing tests; this phase does not add tests.
Administration¶
Commit sequence¶
Four commits, one per sub-task:
1a: GLZ dictionary management fixes (eviction, CACHE_ME, invalidation)1b: GLZ retry loop → notify-based wake1c: Explicit match arms for unsupported image types1d: saturating_add in inject_dht
Steps 1c and 1d are independent and can be done in any order. Steps 1a and 1b are ordered (1a first).
Back brief¶
Before executing any step of this plan, please back brief the operator as to your understanding of the plan and how the work you intend to do aligns with that plan.