Skip to content

flake: CI required checks aggregator fails when gen job is cancelled #1081

@ibetitsmike

Description

@ibetitsmike

CI Run Link: https://github.com/coder/coder/actions/runs/18689201542

Branch: main
Commit: 86f0f39863a27040acd17dd6bc354cc6a430df7c (Steven Masley) — coder/coder@86f0f39

Summary:

  • The "required" aggregator job failed because it detected a cancelled required check: gen.
  • There were no failing tests; most test jobs succeeded. The gen job shows an explicit cancellation.

Evidence:

  • required job log shows cancelled check and exits 1:
Checking required checks
- changes: success
- fmt: success
- lint: success
- gen: cancelled
- test-go-pg: success
- test-go-pg-17: success
- test-go-race-pg: success
- test-js: success
- test-e2e: success
- offlinedocs: success
- check-build: skipped

One of the required checks has failed or has been cancelled
##[error]Process completed with exit code 1.
  • gen job log shows cancellation (no test failure):
##[error]The operation was canceled.
Post job cleanup.
Terminate orphan process: pid (...) (make)

Classification: Infrastructure/CI pipeline

  • Root cause: The required checks aggregator treats cancelled jobs as failures. The gen job was cancelled mid-run, which triggered the aggregator to fail and sent a Slack alert, despite all test suites passing.
  • Not a test flake, not a data race, and no panic/OOM found.

Timing verification:

  • required job failed at 2025-10-21T15:45:36Z, matching the Slack notification window for this run.

Duplicates search (last 30 days & historical keywords):

  • Queries used in coder/internal:
    • "One of the required checks has failed or has been cancelled"
    • gen cancelled required checks
    • ci required checks cancelled
    • required checks aggregator
  • Closest prior issue: flake: CI failure in main - gen job (pnpm setup) and build job (Java setup) #929 (different root cause: network timeouts in setup steps). No active duplicate found for cancelled gen causing required to fail.

Assignment analysis (component ownership):

  • This is owned by CI workflow maintainers. Recent ownership signals for .github/workflows/ci.yaml:
    • ci: make changes required (#20131) — Ethan
    • ci: make test-go-pg-17 a required check (#19722) — Ethan
    • ci: ping slack on ci failures / prompt changes (#19835, #19435) — Ethan
    • Recent edits also by Michael for Slack agent wiring (#20379)
  • Given recent substantive changes adding and tuning the required checks mechanism were by Ethan, assigning to @ethanndickson for triage.

Recommendations:

  • Consider allowing specific jobs (e.g., gen) to be treated as neutral on cancellation, or gate the aggregator until all required jobs are completed/succeeded/explicitly skipped.
  • Investigate why gen was cancelled mid-run within the same workflow (concurrency/cancel-in-progress or runner preemption). If expected, update the aggregator’s logic; if not, address job stability.

Repro/Next steps:

  • Review .github/workflows/ci.yaml: required job logic for handling cancelled jobs.
  • Check workflow concurrency settings and any conditions that could cancel gen.
  • If cancellations are expected in some paths, adjust required to ignore those paths or re-run gen automatically.

Quality Checklist:

  • Identified actual failing job and validated timing
  • Downloaded and reviewed failing job logs (required) and cancelled job logs (gen)
  • Searched coder/internal for duplicates with multiple query patterns
  • Classified as Infrastructure, not a test/data race/process crash
  • Assignment based on component ownership history, not PR/commit author of the failing run

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions