openova/clusters
e3mrah 3a728eb36c
fix(bootstrap-kit): bp-catalyst-platform dependsOn + remediation hardening (chart-roll-rca PR-1+3) (#1302)
Closes the 90-min chart-roll wedge observed on omantel.biz provision #6
(2026-05-10) where bp-catalyst-platform 1.4.128 sat in `pending-upgrade`
for ~90 min until manual reconcile + Bucket-A kubectl-apply unblocked it.

Root cause (chart-roll-rca-iter15.md): bp-catalyst-platform's dependsOn
omits bp-crossplane-claims, the chart that owns the
access.openova.io/v1alpha1 XRD. Slot 13 races slot 14, qa-fixtures
UserAccess CRs hit admission before the XRD is registered, Helm rejects
the manifests with `no matches for kind "UserAccess" in version
"access.openova.io/v1alpha1"`, the release Secret enters
pending-upgrade, and the install/upgrade blocks' 25m timeout x 3
remediation.retries ceiling allows the wedge to compound for ~75 min
worst case before any operator-visible failure.

PR-1 (CRITICAL) - bp-catalyst-platform dependsOn += bp-crossplane-claims:
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
- Adds the missing edge so Flux blocks the umbrella install until the
  XRD is live. Eliminates the race entirely on a fresh roll.

PR-3 - install/upgrade remediation hardening:
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
- Adds `cleanupOnFail: true` to both install and upgrade blocks (purges
  partial release artifacts on retry).
- Adds `remediation.strategy: rollback` and `remediateLastFailure: true`
  (rollback to last good release before retrying, instead of pinning
  the release Secret at `pending-upgrade` for the full timeout).
- Reduces `timeout: 25m` -> `15m` (with PR-1 fixed, 25m is overkill;
  faster failure = faster automatic recovery).
- Net: failed-then-recoverable upgrade collapses from ~75 min worst case
  to ~15 min worst case.

PR-2 (defense-in-depth) - APIVersions.Has gate on UserAccess templates:
- products/catalyst/chart/templates/qa-fixtures/useraccess-qa-user1.yaml
- products/catalyst/chart/templates/qa-fixtures/useraccess-qa-test-tiers.yaml
- Wraps the gating `if` with `(.Capabilities.APIVersions.Has
  "access.openova.io/v1alpha1/UserAccess")`. If the XRD is not yet
  registered the manifest evaluates to empty bytes, eliminating the
  admission-rejection class of chart-roll wedges even if dependsOn
  ordering breaks again.

Acceptance test (next fresh provision, e.g., provision #7):
- `kubectl get hr -n flux-system bp-catalyst-platform` reaches
  Ready=True on the FIRST install action (no `pending-upgrade`).
- Chart roll completes in <15 min, zero-touch (no manual
  `flux reconcile`, no Bucket-A kubectl-apply).
- `kubectl get useraccess -A` shows qa-user1 + 5 qa-test-{tier} CRs
  without operator intervention.

Refs: chart-roll-rca-iter15.md (PR-1, PR-2, PR-3 sections).

Co-authored-by: e3mrah <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:20:29 +04:00
..
_template fix(bootstrap-kit): bp-catalyst-platform dependsOn + remediation hardening (chart-roll-rca PR-1+3) (#1302) 2026-05-10 20:20:29 +04:00
contabo-mkt/tenants provision: deploy tenant e2e-wp-test (plan: m, apps: 1) 2026-05-06 02:23:14 +04:00
omantel.omani.works fix(bp-cilium): disable kubeProxyReplacement (DNS pathology unblock) (#1283) 2026-05-10 15:10:58 +04:00
otech.omani.works fix(bp-trivy): node-collector tolerates control-plane taint (closes #769) (#772) 2026-05-04 17:38:29 +02:00