Phase 8: OpenStack CI lane disposition + oVirt provisioning flake¶
Part of PLAN-test-harness.md. The disposition
decision and documentation land in kerbside; the root-cause fix lands
in shakenfist/actions. Per the master plan's single-home rule, the
plan file lives here in docs/plans/.
Goal¶
The master plan framed phase 8 as deciding the fate of the heavyweight "Test cloud compatibility" CI lane (oVirt + kolla-OpenStack) now that the direct-qemu lane covers the spine. That decision is now made (see Decisions): keep the lane per-PR, both legs blocking, for now.
With the disposition settled as status quo, the substance of phase 8 becomes making that per-PR gate trustworthy, because a gate the team merges past is worse than no gate. Two concrete problems block that:
-
The oVirt leg has a real, intermittent provisioning flake. On
pull_requestruns (the actual merge gate) the oVirt leg failed genuinely on 2026-06-11 withssh: connect to host 10.0.2.2 port 22: Connection refusedimmediately after the provisioning playbook's SSH wait had already passed. Root cause confirmed against the code (see Situation): the wait gate checks only that port 22 is listening and presents an OpenSSH banner — it does not wait for cloud-init to finish, so sshd's host-key-regen restart (or a not-yet-ready guest on a busy hypervisor) drops the next real SSH. This is the bug that already trained us to merge past red once (the phase 6→7 smoke-client version-pin failure merged on a red lane). -
The lane shows spurious oVirt red on
workflow_dispatchruns. Manual/develop-branch dispatches defaulttargetto kolla-only, and the oVirt job's "Filter workflow_dispatch runs" step callscore.setFailed('Target skipped')— marking the deliberately-skipped target as a failure rather than a clean skip. Itsalways()artifact step then times out SSHing a guest that was never provisioned. This is the cosmetic red that made the lane look broken on develop and obscured the real PR-gate health.
This phase is scope-bounded to:
- A root-cause fix to the oVirt (and kolla — same playbook) provisioning
readiness gate in
shakenfist/actions. - A GitHub issue in
shakenfist/actionstracking the flake and its diagnosis (none exists today; the plan template now asks us to file one). - A workflow fix in kerbside so unselected
workflow_dispatchtargets skip cleanly instead of reporting red. - A documented disposition record (keep per-PR; the criteria under which we would later demote to a schedule) and the master-plan status bump.
Out of scope for phase 8:
- Demoting the lane to a schedule. Explicitly decided against for now (see Decisions). The revisit criteria are recorded so a future phase can pick it up.
- Retiring the oVirt or kolla leg. Both stay per-PR and blocking.
- Any change to the direct-qemu lane. It is green and untouched.
- A second runner shape or a CI matrix expansion.
- Broader
shakenfist/actionsprovisioning rework. The fix is surgical: the readiness gate, nothing else in the provisioning flow. - Deduplicating the oVirt vs kolla provisioning paths. The kolla
leg provisions via
setup-kerbside-environmentand the oVirt leg via a directkerbside-single-node.ymlinvocation; both share the same wait gate, so the fix covers both, but unifying the two entry points is separate work.
Decisions baked into this plan¶
These are judgment calls made while drafting, surfaced explicitly so they can be challenged before code lands.
- Disposition: keep the cloud-compat lane per-PR, both legs blocking, for now. The operator's call (2026-06-20): Kerbside PRs are frequent and getting Nova reliable with Kerbside is the primary focus, so the per-PR Nova coverage is worth its cost. A future demotion to a schedule is foreseeable but not yet. Phase 8 therefore makes no cadence change; it records the decision and the revisit criteria.
- Revisit-to-schedule criteria (for a future phase, not now): demote
the heavy legs to nightly +
workflow_dispatchwhen any of — the Nova+Kerbside integration has been stable for a sustained period and the per-PR signal stops catching regressions; PR volume drops enough that per-PR cost outweighs value; or runner capacity becomes a binding constraint. Recorded in the master plan's Future work. - Root-cause the flake rather than route around it. The operator chose this over making oVirt non-blocking or accepting the red. The diagnosis is confirmed and the fix is small, so this is the right trade: fix the actual bug, keep the gate honest.
- The fix is
wait_for_connection+cloud-init status --wait, not a biggerwait_fortimeout. Bumping the existingwait_fortimeout would not help: the wait already passes (the banner is present) and the failure happens after it, when sshd bounces during cloud-init. The correct gate waits for cloud-init to actually finish. See the fix sketch in the Execution table. - Keep the cheap banner
wait_foras an early gate; add the real readiness gate after it. The existingwait_for port:22 search_regex:OpenSSHis harmless and gives a fast early failure if the guest never boots at all. Rather than replace it, add a follow-on readiness play targeting the provisioned hosts. Minimal blast radius. - The cosmetic target-skip becomes a job-level
if:, not asetFailedstep. Replacingcore.setFailed('Target skipped')with a job-level condition (github.event_name != 'workflow_dispatch' || contains(inputs.target, matrix.test.name)) makes an unselected target skip (neutral grey) instead of fail (red). Thealways()artifact step then never runs against a non-existent guest. - Verification of the provisioning fix is CI-based and may need
iteration. The SF fabric provisioning cannot be exercised locally,
and the flake is intermittent (~10% on the pre-fix gate), so a single
green run does not prove the fix. Confidence comes from the mechanism
(
cloud-init status --waitcloses the exact window that was failing) plus several consecutive green oVirt runs. Like phase 5, CI iteration is part of finishing the step, not a new phase. - Fable is not used in this phase. Phase 7 was the deliberate Fable experiment. Phase 8's steps are a well-understood ansible fix, a small workflow edit, and documentation; opus and sonnet cover them. Noted so the model choice is a decision, not an omission.
Situation¶
What the cloud-compat lane is, today¶
.github/workflows/functional-tests.yml ("Test cloud compatibility")
runs three jobs, triggered on pull_request to develop and on
workflow_dispatch:
sanity_checks— lint, unit tests, coverage. Fast, reliable.ovirt_matrix("oVirt 4.5 on Rocky 8") — provisions a Rocky 8 guest on the SF fabric, installs ovirt-engine, creates a SPICE test target, and checks console connectivity.openstack_matrix("OpenStack via Kolla-Ansible master on Debian 12") — provisions a guest, builds kolla images, deploys all-in-one OpenStack, and runs the kerbside tempest plugin.
The kolla leg was red until shakenfist/kerbside-patches PR #1306
merged (2026-06-12); since then it has been green on every pull_request
run. The oVirt leg is the remaining trouble.
Measured lane health (as of 2026-06-20)¶
Distinguishing trigger type matters and was initially missed:
- On
pull_requestruns (the real merge gate), oVirt was green on roughly 12 of the last ~14 runs, with two genuine failures on 2026-06-11 and none since 2026-06-13. - On
workflow_dispatchruns (manual, develop branch), the oVirt job is reported as failed on every run — but cosmetically: those dispatches defaulttargetto kolla-only, and the oVirt job's filter step callscore.setFailed('Target skipped'). This is not the infra flake; it is the workflow marking a deliberately-skipped target as a failure.
So the genuine flake is real but low-rate (~10% on the gate, none in the last week), and the dramatic "always red on develop" appearance was the cosmetic target-skip.
Root cause of the genuine oVirt flake (confirmed)¶
The provisioning playbook shakenfist/actions
ansible/kerbside-single-node.yml ends with:
- name: Wait for all instances to present an "OpenSSH" prompt
wait_for:
port: 22
host: "{{hostvars[item]['ansible_ssh_host']}}"
search_regex: OpenSSH
delay: 60
timeout: 300
with_items: "{{ groups['allsf'] }}"
This waits only until port 22 is open and emits an OpenSSH banner. It
does not authenticate and does not wait for cloud-init to
finish. The instance is created (in kerbside-create-instance.yml) with
await: true (SF agent up), added to the allsf group with
ansible_ssh_user: cloud-user, and the runner then SSHes to it directly
from the kerbside workflow ("Prepare /srv on target" and later steps).
Primary evidence, oVirt job on run 27335927162 (2026-06-11):
10:10:08 TASK [Wait for all instances to present an "OpenSSH" prompt]- the wait passed (the playbook step "Build infrastructure" did not fail; only later steps did, and the timing is far too early for a 300 s wait timeout)
10:11:56 ssh: connect to host 10.0.2.2 port 22: Connection refusedat the next workflow step ("Prepare /srv on target")
"Connection refused" (not "timed out") means the host was routable
but nothing was listening on port 22 at that instant — sshd had bounced.
The signature is cloud-init: sshd starts and presents a banner early
(the wait_for passes), then cloud-init regenerates host keys and
restarts sshd, and the runner's next real SSH lands in that window. On a
busier hypervisor the gap between "banner present" and "cloud-init done"
widens, which is why the failure rate tracks fabric load.
The cosmetic target-skip¶
Both ovirt_matrix and openstack_matrix begin with:
- name: Filter workflow_dispatch runs
if: github.event_name == 'workflow_dispatch' && ! contains(inputs.target, matrix.test.name)
uses: actions/github-script@v7
with:
script: |
core.setFailed('Target skipped')
On a workflow_dispatch run with target set to one matrix entry, the
other job's filter step fails the job. The downstream Gather
artifacts / Collect logs steps run if: always() and then time out
SSHing a guest that was never created. The whole job shows red for a
target that was simply not selected.
Bug tracker scan (per template guidance)¶
shakenfist/kerbsideopen issues: none relate to the cloud-compat lane, oVirt, or CI flakiness (the open set is workflow-standards #59, security-settings #58, ryll-decoupling #15, install-guide #3).shakenfist/actionsopen issues: none match ssh / cloud-init / wait / flake / provisioning.
So no existing issue tracks this; phase 8 files one (step 8a).
Mission and problem statement¶
After phase 8:
shakenfist/actionsansible/kerbside-single-node.ymlgates instance readiness on cloud-init completion, not just an open SSH port. The oVirt (and kolla) leg no longer fails with post-wait "connection refused". Confirmed by several consecutive green oVirtpull_requestruns.- A
shakenfist/actionsissue records the diagnosis and links the fix. kerbside.github/workflows/functional-tests.ymlskips unselectedworkflow_dispatchtargets cleanly (neutral), so manual/develop runs no longer show spurious oVirt red.- The master plan carries the disposition decision (keep per-PR; revisit criteria) and the phase 8 row reads "Implementation complete; PR pending operator".
pre-commit run --all-filesandactionlintclean on the kerbside commits;ansible-lint/yamllintclean (as the actions repo requires) on the provisioning commit.
Open questions¶
These do not block writing this plan but must be resolved during implementation:
- Is
cloud-initpresent and iscloud-init status --waitavailable tocloud-useron both base images (Rocky 8 for oVirt, Debian 12 for kolla)? Expected yes on standard cloud images; verify in step 8b, and if--waitneeds root, run it withbecome: true. - Does
wait_for_connectionwork against theallsfhosts as added to the inventory (it uses theansible_ssh_*facts set inkerbside-create-instance.yml)? Expected yes; confirm the new readiness play targetshosts: allsfand inherits those connection vars. - Should the kolla leg's
setup-kerbside-environmentpath also be audited? The kolla leg provisions through that composite action, not a barekerbside-single-node.ymlcall. Confirm whether it routes through the same playbook (and therefore the same wait gate) or has its own wait; if separate, the fix may need mirroring. Resolve early in step 8b. - How many consecutive green oVirt runs constitute "fixed" given the
~10% pre-fix rate? Provisional bar: the mechanism plus ≥3 consecutive
green oVirt
pull_requestruns. The operator may want more.
Execution¶
Each step is one logical change. Kerbside commits land on the branch
test-harness-phase-8 (this plan file travels with them). The
shakenfist/actions fix lands as its own commit in that repo; per the
master plan, the two repos do not share git operations. The operator
opens all PRs.
| Step | Repo | Effort | Model | Isolation | Brief for sub-agent |
|---|---|---|---|---|---|
| 8a. File the tracking issue | shakenfist/actions | low | (management session) | none | Not a sub-agent coding task — the management session files a GitHub issue on shakenfist/actions titled for the oVirt provisioning readiness flake. Body: the confirmed diagnosis (the wait_for port:22 search_regex:OpenSSH gate in ansible/kerbside-single-node.yml returns as soon as sshd presents a banner, before cloud-init finishes; sshd's host-key-regen restart then drops the next real SSH with "connection refused"; rate tracks hypervisor load). Cite the 2026-06-11 evidence (run 27335927162: wait passed at 10:10:08, "connection refused" at 10:11:56 on the next step). State the intended fix (cloud-init readiness gate) so the issue and the fix PR cross-link. Mark it found during kerbside test-harness phase 8. |
| 8b. Fix the provisioning readiness gate | shakenfist/actions | medium | opus | worktree | Requires a shakenfist/actions checkout (the operator must confirm one is available; clone if not). In ansible/kerbside-single-node.yml, after the existing Wait for all instances to present an "OpenSSH" prompt task, add a readiness gate that waits for cloud-init to finish, not just for the SSH port. Recommended shape — a follow-on play (or delegated tasks) targeting hosts: allsf so it uses the per-instance ansible_ssh_host / ansible_ssh_user: cloud-user / key facts set in kerbside-create-instance.yml: wait_for_connection: {delay: 0, timeout: 300} to establish a real authenticated SSH connection (retries through an sshd bounce), then command: cloud-init status --wait (changed_when: false, become: true if --wait needs root on the image) to block until cloud-init is done. Keep the existing wait_for banner check as the cheap early gate. First, verify the open questions: that the kolla leg (setup-kerbside-environment) routes through the same playbook/gate (mirror the fix if not), that cloud-init is present on both Rocky 8 and Debian 12 base images, and that cloud-init status --wait is available to the SSH user. Do NOT broaden scope to other provisioning tasks. Verify locally with ansible-lint and yamllint (or whatever the actions repo's pre-commit runs) and a --syntax-check; note in the commit body that full verification is CI-only (the SF fabric cannot be exercised locally) and that the real proof is consecutive green oVirt runs. One commit. The operator opens the actions PR; expect to watch several oVirt runs before declaring it fixed. |
| 8c. Fix the cosmetic target-skip false-red | kerbside | low | sonnet | none | On test-harness-phase-8. In .github/workflows/functional-tests.yml, for BOTH the ovirt_matrix and openstack_matrix jobs: remove the "Filter workflow_dispatch runs" step (the core.setFailed('Target skipped') github-script step) and instead add a job-level condition so unselected workflow_dispatch targets skip cleanly: if: github.event_name != 'workflow_dispatch' || contains(inputs.target, matrix.test.name). This makes a non-selected target neutral-skip (grey) rather than red, and stops the always() artifact/log steps from running against a guest that was never provisioned. Confirm the jobs still run on every pull_request (the condition is true when the event is not workflow_dispatch) and that a workflow_dispatch with a given target runs only the matching job. Verify with actionlint and pre-commit run --all-files. One commit. |
| 8d. Disposition record + docs + status | kerbside | low | sonnet | none | On test-harness-phase-8. Update docs/plans/PLAN-test-harness.md: set the phase 8 row status to "Implementation complete; PR pending operator"; in Future work, record the disposition decision (keep the cloud-compat lane per-PR, both legs blocking, for now) and the revisit-to-schedule criteria (Nova+Kerbside integration stable and per-PR signal no longer catching regressions; PR volume drops; or runner capacity becomes constraining); add the two bugs to the master plan's "Bugs fixed during this work" (the oVirt readiness-gate flake fixed in shakenfist/actions, and the cosmetic target-skip false-red). If README.md / AGENTS.md / ARCHITECTURE.md describe the cloud-compat lane's reliability or trigger behaviour, add a one-line note that unselected workflow_dispatch targets skip cleanly and that instance readiness now gates on cloud-init. pre-commit run --all-files clean. One commit. |
Sequencing notes¶
- 8a (file the issue) first, so 8b's fix PR can reference it.
- 8b is the long pole: the real fix, cross-repo, CI-verified, likely several oVirt runs before it is trusted. It is independent of 8c/8d and can proceed in parallel with them.
- 8c and 8d are small kerbside-side changes; they can land before 8b is confirmed green, since they do not depend on the provisioning fix.
- The operator opens the
shakenfist/actionsPR (8b) and the kerbside PR (8c+8d). The phase is "done" once the actions fix is merged and the oVirt leg has been green across several consecutivepull_requestruns.
Branch and PR shape¶
- kerbside: new branch
test-harness-phase-8fromdevelop. Steps 8c and 8d land here, plus this plan file. One PR. - shakenfist/actions: a single commit on its own branch for step 8b. Separate PR. The plan file stays in kerbside per the single-home rule.
Agent guidance¶
This phase plan follows the conventions in PLAN-TEMPLATE.md at the
kerbside repo root. The execution model, effort levels, brief-writing
standards, and management-session review checklist apply unchanged.
Notes specific to phase 8:
- The fix target is
shakenfist/actions, a third repo. As with the ryll phases, the management session must confirm the operator has that repo checked out and brief the sub-agent that the plan lives inshakenfist/kerbside/docs/plans/even though the commit lands elsewhere. Use a worktree for 8b so the experimental ansible change is isolated. - Do not "fix" the flake by enlarging the existing
wait_fortimeout. The wait already passes; the failure is after it. Anyone reaching for a bigger timeout has misread the root cause — point them at the Situation section. - The provisioning fix cannot be proven locally. The SF fabric is
not reproducible on the dev host. Verification is
ansible-lint/ syntax-check plus CI observation across several runs. Resist any sub-agent claiming the fix is "verified" off a single green run, given the ~10% pre-fix rate. - Keep the cosmetic and the real fix in separate commits and repos. The target-skip edit (kerbside) and the readiness gate (actions) are unrelated causes that happened to both show as "oVirt red"; do not conflate them.
- Do not change the lane's cadence. The disposition is keep-per-PR.
No
schedule:trigger, no demotion. If a sub-agent proposes one, that is out of scope — it belongs to a future phase guided by the recorded revisit criteria.
Back brief¶
Before executing any step of this plan, please back brief the operator as to your understanding of the step and how the work you intend to do aligns with that step's brief.
Administration and logistics¶
Success criteria¶
Phase 8 is done when:
shakenfist/actionsansible/kerbside-single-node.ymlgates on cloud-init completion (wait_for_connection+cloud-init status --wait, or equivalent) after the existing banner wait, and the change isansible-lint/yamllintclean.- A
shakenfist/actionsissue records the diagnosis and is linked from the fix PR. - The oVirt
pull_requestleg is green across several consecutive runs after the fix merges (provisional bar: ≥3). kerbside.github/workflows/functional-tests.ymlskips unselectedworkflow_dispatchtargets cleanly;actionlintandpre-commit run --all-filesare clean.- The master plan records the disposition decision and revisit criteria, lists both fixed bugs, and marks the phase 8 row "Implementation complete; PR pending operator".
Future work¶
Items deliberately deferred from phase 8:
- Demote the heavy legs to a schedule. Recorded in the master plan's Future work with its revisit criteria. Not now.
- Unify the oVirt and kolla provisioning entry points. They share
the wait gate but enter through different paths
(
kerbside-single-node.ymldirectly vssetup-kerbside-environment); consolidating is separate work. - Retry wrappers on the kerbside workflow's direct SSH steps. A
belt-and-braces complement to the readiness gate. If the gate fix
proves insufficient, add
ssh ... || retrybackoff to the "Prepare /srv on target" and sibling steps. Not needed if 8b closes the window. - Audit other
shakenfist/actionsplaybooks for the same port-only wait pattern. The multi-node playbooks likely share it.
Bugs fixed during this work¶
- oVirt provisioning readiness gate waited for an open
SSH port, not cloud-init completion (
shakenfist/actionskerbside-single-node.yml). Thewait_for port:22 search_regex:OpenSSHtask returned on the sshd banner; cloud-init then restarted sshd (host-key regen) and the runner's next real SSH was refused. Fixed by adding await_for_connection+cloud-init status --waitreadiness gate (withfailed_when: falseso it waits without asserting cloud-init's exit code). Confirmed by inspection that the kolla all-in-one leg routes through the same playbook, so one fix covers both legs; the multi-node playbooks carry the same pattern and are logged as future work. Tracked asshakenfist/actionsissue #2. - Cloud-compat lane reported unselected
workflow_dispatchtargets as red (kerbsidefunctional-tests.yml). Thecore.setFailed('Target skipped')filter step marked a deliberately-unselected matrix target as failed, and thealways()artifact steps then timed out against a never-provisioned guest. Fixed by replacing the step with a job-levelif:so unselected targets skip cleanly (the target name is a literal because the matrix context is unavailable in a job-levelif:).