fix(bp-keycloak): bump keycloak-config-cli hook timeouts (#129)

Fresh-Sovereign provision #15 (otech 0ad3687ddd72deb7) wedged at
phase1-watching for 30+ min: bp-keycloak HelmRelease failed with
`post-upgrade hooks failed: timed out waiting for the condition` →
bp-gitea (dependsOn keycloak OIDC) blocked → bp-self-sovereign-cutover
never converged.

Root cause
──────────
The bitnami keycloak subchart's `keycloak-config-cli-job.yaml` is
rendered as a Helm post-install/post-upgrade/post-rollback hook
(default annotations on the Job, weight 5). On a fresh k3s the
realm-import Job fires before Postgres+Liquibase finish bootstrapping
Keycloak (legitimately 3-10 min), and the bitnami subchart defaults
are too tight to absorb that race:

  - keycloakConfigCli.availabilityCheck.timeout="" → keycloak-config-cli
    falls back to its internal ~120s wait for Keycloak's /admin endpoint
  - keycloakConfigCli.backoffLimit: 1 → only 2 Pod attempts total
    before the Job is marked Failed

Both attempts hit the 120s window, Job goes Failed, Helm reports the
post-upgrade hook timed out, HR install/upgrade retries (×3) all hit
the same race, HR remains Failed → downstream blueprints never install.

Fix
───
Tune the hook's internal timing to fit comfortably inside the parent
HR's 15m install/upgrade timeout while leaving headroom for cold image
pull + Pod scheduling:

  keycloak.keycloakConfigCli.availabilityCheck.timeout: "600s"   (was "")
  keycloak.keycloakConfigCli.backoffLimit:               5        (was 1)

Both knobs remain operator-overridable via per-Sovereign
`valuesFrom` (Inviolable Principle #4: no hardcoding). Per
Inviolable Principle #3 (no workarounds), this does NOT disable the
hook semantics — disabling the hook would break the documented
contract that the realm exists before the HR reaches Ready
(downstream bp-gitea + catalyst-api consume the realm).

Files
─────
  platform/keycloak/chart/values.yaml           (+59  inline rationale)
  platform/keycloak/chart/Chart.yaml            (1.4.2 → 1.4.3 + changelog)
  clusters/_template/bootstrap-kit/09-keycloak.yaml (HR pin → 1.4.3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-11 00:50:31 +02:00
parent 58f518ff3d
commit 9a5cbcd178
3 changed files with 73 additions and 5 deletions

View File

@ -45,7 +45,12 @@ spec:
# parameter so each Sovereign owns its KC realm named after the tenant
# short-name (omantel chroot → "omantel"). Default `sovereign` is kept
# in the chart for backward compat with overlays not yet migrated.
version: 1.4.2
# 1.4.3 (issue #129): bumps keycloakConfigCli.availabilityCheck.timeout
# 120s → 600s + backoffLimit 1 → 5. Fixes "post-upgrade hooks failed:
# timed out waiting for the condition" wedge on fresh provisions where
# Postgres+Liquibase bootstrap exceeds the bitnami subchart's 120s
# default Keycloak-availability window for the realm-import Job.
version: 1.4.3
sourceRef:
kind: HelmRepository
name: bp-keycloak

View File

@ -1,6 +1,6 @@
apiVersion: v2
name: bp-keycloak
version: 1.4.2
version: 1.4.3
description: |
Catalyst-curated Blueprint umbrella chart for Keycloak. Depends on the
upstream `keycloak` chart (bitnami) as a Helm subchart so
@ -36,6 +36,19 @@ description: |
credentials Secret `realm` key flow from the same value, keeping
Keycloak realm and catalyst-api CATALYST_KC_REALM env in sync. Closes
TC-124, TC-125, TC-159, TC-160, TC-161, TC-176, TC-190, TC-285.
1.4.3 (issue #129, post-upgrade hook timeout wedge): bump bitnami
keycloakConfigCli.availabilityCheck.timeout 120s → 600s and
backoffLimit 1 → 5. Fresh-Sovereign provision #15 (otech
0ad3687ddd72deb7) wedged at phase1-watching because the
keycloak-config-cli Job (rendered as a Helm post-install/post-upgrade
hook by the bitnami subchart's default annotations) timed out twice
inside its 120s availability window before Postgres+Liquibase finished
bootstrapping Keycloak. Helm reported "post-upgrade hooks failed:
timed out waiting for the condition" → bp-keycloak HR never reached
Ready → bp-gitea (dependsOn keycloak OIDC) blocked → bp-self-sovereign-
cutover never converged. New defaults sit comfortably inside the parent
HR's 15m install/upgrade timeout. Both knobs remain per-Sovereign
overridable via valuesFrom (Inviolable Principle #4: no hardcoding).
type: application
keywords: [catalyst, blueprint, keycloak]
maintainers:

View File

@ -276,13 +276,63 @@ keycloak:
# k3s --oidc-groups-prefix flag)
keycloakConfigCli:
enabled: true
# Run the realm-import Job as a Helm post-install hook so the realm
# is provisioned exactly once per fresh release. Re-run on upgrade
# is idempotent (config-cli reconciles the realm to spec).
# Run the realm-import Job as a Helm post-install/post-upgrade/post-rollback
# hook (bitnami subchart default — annotations defined on the Job template
# via .Values.keycloakConfigCli.annotations: helm.sh/hook = post-install,
# post-upgrade,post-rollback with weight 5). The realm is provisioned
# exactly once per fresh release; re-run on upgrade is idempotent
# (config-cli reconciles the realm to spec).
image:
registry: docker.io
repository: bitnamilegacy/keycloak-config-cli
tag: 6.4.0-debian-12-r11
# ─── Hook timing — issue #129 (post-upgrade hook timeout wedge) ─────────
#
# Root cause history (prov #15, otech 0ad3687ddd72deb7, 2026-05-10):
# Flux's first install of bp-keycloak races against PostgreSQL readiness;
# when keycloak-config-cli's first run-attempt fires before Keycloak's
# admin endpoint is reachable, it counts toward Job retries. Bitnami
# subchart defaults are too tight for a fresh Sovereign:
# - availabilityCheck.timeout="" → keycloak-config-cli falls back to
# its own internal default (~120s) waiting for Keycloak
# - backoffLimit: 1 → only 2 Pod attempts total before Job marked Failed
# On a brand-new k3s where Postgres+Liquibase migrations + Keycloak
# bootstrap legitimately take 3-10+ minutes, both attempts time out
# inside the 120s window → Job Failed → Helm reports
# "post-upgrade hooks failed: timed out waiting for the condition"
# → bp-keycloak HelmRelease never reaches Ready=True → bp-gitea
# (dependsOn keycloak OIDC) never installs → bp-self-sovereign-cutover
# never converges → fresh provision wedges at phase1-watching.
#
# Fix per Inviolable Principle #4 (no hardcoding) — both knobs are
# operator-overridable via per-Sovereign overlay valuesFrom path:
# keycloak.keycloakConfigCli.availabilityCheck.timeout
# keycloak.keycloakConfigCli.backoffLimit
# The chart defaults below are tuned for the documented worst case
# (slow Postgres + cold image-pull); operators with faster substrates
# may shorten them.
#
# Numbers picked to fit inside the parent HR's 15m install/upgrade
# timeout while leaving headroom for Job pod scheduling + image pull:
# - availabilityCheck.timeout 600s (10 min) — keycloak-config-cli
# polls Keycloak's /admin REST endpoint; this caps the single-Pod
# wait. Covers Postgres bootstrap + Liquibase + Keycloak internal
# startup with margin.
# - backoffLimit 5 — Job retries with exponential backoff (cap 6m by
# default). Combined with a 10m availability poll the realistic
# worst case is one Pod hitting the timeout, then a successful
# retry once Keycloak Ready.
#
# NOT a workaround per Inviolable Principle #3. The workaround would
# be disabling the hook semantics entirely (set annotations: {}); that
# breaks the documented contract that the realm is imported before
# the Helm release reaches Ready (downstream bp-gitea + catalyst-api
# depend on the realm existing). Tuning the hook's internal timing
# respects the contract.
availabilityCheck:
enabled: true
timeout: "600s"
backoffLimit: 5
# ─── Phase-8b realm ConfigMap (issue #604, #899) ─────────────────────────
# The realm import JSON is now owned by the parent bp-keycloak chart's
# templates/configmap-sovereign-realm.yaml template. That template has