fix(chart,api,controllers,ui): qa-loop iter-11 Fix #45 — three-cluster closeout (#1265)

Cluster-A (bp-guacamole PVC immutability):
  - New pre-install/pre-upgrade Helm hook (Job + per-release SA/Role/
    RoleBinding + cluster-scoped CR/CRB for PV cleanup) that detects
    when an existing `guacamole-recordings` PVC is bound to a
    storageClass different from `.Values.guacamole.recordings.storageClass`
    and deletes the PVC + bound PV so the chart-side PVC manifest can
    recreate cleanly. Closes the live bp-guacamole HelmRelease wedge on
    omantel iter-11 (`PersistentVolumeClaim ... is invalid: spec:
    Forbidden: spec is immutable after creation`).
  - Operator escape hatch: `.Values.guacamole.recordings.allowMigration:
    false` suppresses the hook for Sovereigns with long-lived recording
    state.
  - Render test extended (15 docs total, plus toggle assertion).
  - bp-guacamole chart 0.1.8 → 0.1.9; bootstrap-kit slot pin bumped
    in both _template and omantel.omani.works overlays.

Cluster-B (Application phase stuck on Provisioning):
  - application-controller now observes the per-region downstream
    HelmRelease.status.conditions[Ready] and rolls up
    Application.status.phase: any region Ready=True → phase=Ready,
    any Ready=False → phase=Degraded, no HR yet → phase=Provisioning.
  - Periodic 30s re-list ticker (Run goroutine) so HR readiness flips
    reach the Application even though the Application Watch doesn't
    fire on sibling HR changes.
  - status.lastReconciledAt populated on every reconcile pass for
    TC-113.
  - application-controller ClusterRole gains
    helm.toolkit.fluxcd.io/helmreleases get/list/watch.
  - 3 new unit tests (HR Ready=True → phase=Ready, HR Ready=False →
    phase=Degraded with verbatim message, no-HR → phase=Provisioning).

Cluster-C (SPA AppDetail + k8s services namespace filter):
  - GET /api/v1/sovereigns/{id}/applications/{name} returns full
    Application detail (identity + spec + status). The SPA AppDetail
    page now falls back to this endpoint when wizard store has no
    descriptor for the requested componentId — the typical chroot
    Sovereign case where Apps are installed via `kubectl apply` /
    catalyst-api install endpoint, NOT via the wizard. Without the
    fallback every chroot-installed Application surfaced "App not
    found / The component qa-wp is not part of this deployment"
    even though the underlying CR was Ready=True. Closes TC-068 /
    TC-072 / TC-074 / TC-076 / TC-077 / TC-079 et al.
  - GET /api/v1/sovereigns/{id}/k8s/{kind} accepts BOTH `?ns=`
    (historic) AND `?namespace=` (kubectl/SPA-canonical). Without
    the alias TC-262 / TC-263 returned every namespace's services
    instead of qa-omantel-only. New test covers all 4 query
    permutations.

Chart bumps:
  - bp-catalyst-platform 1.4.116 → 1.4.117 (+ pin in
    clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml).
  - bp-guacamole 0.1.8 → 0.1.9.

Refs: qa-loop iter-11 Fix #45 (Cluster-A + Cluster-B + Cluster-C);
post-merge image SHAs land via the catalyst-api / catalyst-controllers
build workflows + the bp-guacamole / bp-catalyst-platform release
workflows.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-10 07:26:05 +04:00 committed by GitHub
parent fea726233c
commit dfd48b1626
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
18 changed files with 1282 additions and 34 deletions

View File

@ -337,13 +337,28 @@ spec:
# rebuilt application-controller image (:24aab61) into values.yaml
# — chart 1.4.115 had stale tag (:a3ba200) because GH Actions
# silently filters bot pushes from triggering blueprint-release.
# 1.4.117 (qa-loop iter-11 Fix #45 Cluster-B + Cluster-C):
# B) application-controller now observes downstream HelmRelease
# readiness (helm.toolkit.fluxcd.io/helmreleases get/list/
# watch added to ClusterRole) and rolls up Application
# .status.phase from Provisioning → Ready. Periodic 30s
# re-list ticker so HR readiness flips reach the parent.
# status.lastReconciledAt populated for TC-113.
# C) catalyst-api gains GET /sovereigns/{id}/applications/{name}
# (full Application detail) and accepts ?namespace= alias on
# /sovereigns/{id}/k8s/{kind}. SPA AppDetail.tsx falls back
# to the new GET when wizard store has no descriptor (typical
# chroot Sovereign case — closes the "App not found" misfire
# on TC-068 / TC-072 / TC-074). TC-262 / TC-263 (services
# filtered to qa-omantel only) flip PASS via the namespace
# alias.
# 1.4.109 (Fix #40 follow-up #2): drop /api/v1 from organization +
# environment-controller GITEA URL defaults — Gitea client appends
# it; the prior default produced /api/v1/api/v1/... 404s on
# EnsureOrg / EnsureRepo blocking qa-wp Application reconcile.
# bootstrap-kit qaFixtures.cnpgPairName default qa-cnpg → qa-cnpgpair
# so TC-306's "cnpgpair" substring assertion passes.
version: 1.4.116
version: 1.4.117
sourceRef:
kind: HelmRepository
name: bp-catalyst-platform

View File

@ -81,10 +81,15 @@ spec:
chart:
spec:
chart: bp-guacamole
# 0.1.6: 0.1.5 (added /home/guacamole/.guacamole emptyDir
# mount for readOnlyRootFilesystem compatibility) +
# post-merge CI auto-bump.
version: 0.1.6
# 0.1.9 (qa-loop iter-11 Fix #45 Cluster-A): adds the
# storageClass-migration pre-install/pre-upgrade hook that
# unwedges Sovereigns where the existing `guacamole-recordings`
# PVC is bound to a different storageClass than the chart-desired
# one (live failure on omantel iter-11: PVC bound to local-path,
# chart wanted seaweedfs-storage, K8s rejected the immutable-spec
# patch with `cannot patch ... PersistentVolumeClaim ... is
# invalid: spec: Forbidden: spec is immutable after creation`).
version: 0.1.9
sourceRef:
kind: HelmRepository
name: bp-guacamole

View File

@ -47,9 +47,17 @@ spec:
chart:
spec:
chart: bp-guacamole
# 0.1.6: 0.1.5 (.guacamole home emptyDir mount for
# readOnlyRootFilesystem) + CI auto-bumped values.yaml.
version: 0.1.6
# 0.1.9 (qa-loop iter-11 Fix #45 Cluster-A): storageClass-
# migration pre-install/pre-upgrade hook. The live bp-guacamole HR
# wedged on omantel iter-11 because the existing
# `guacamole-recordings` PVC was bound to local-path while the
# in-cluster HR object had `seaweedfs-storage` (drift between
# the on-disk overlay below and what was already on the API
# server). With this version the chart's pre-upgrade hook reads
# the existing PVC and either no-ops (storageClass matches) or
# deletes the PVC + bound PV so the post-render PVC manifest
# creates cleanly with whatever storageClass is desired.
version: 0.1.9
sourceRef:
kind: HelmRepository
name: bp-guacamole

View File

@ -124,6 +124,27 @@ var (
Version: "v1",
Resource: "kustomizations",
}
// FluxHelmReleaseGVR — the per-Application HelmRelease that lands
// in the Application CR's own namespace once the
// per-region Kustomization reconciles the manifests committed to
// Gitea. The reconciler observes this HR's status.conditions[Ready]
// to flip Application.status.phase from `Provisioning` to `Ready`
// (qa-loop iter-11 Fix #45 Cluster-B). Without the observation
// the Application CR sat at Provisioning forever even after the
// downstream Helm install completed — the Sovereign Console treated
// it as still-installing, the matrix-asserted "Installed" terminal
// phase never arrived, and TC-066 / TC-100 / TC-104 / TC-113 / TC-117
// stayed FAIL.
//
// v2 is the Flux 2.4+ stable. Same Inviolable-Principle #3 rationale
// as the GitRepository / Kustomization GVR comments above — no v2beta
// fallback because Sovereigns standardise on bp-flux 1.x.
FluxHelmReleaseGVR = schema.GroupVersionResource{
Group: "helm.toolkit.fluxcd.io",
Version: "v2",
Resource: "helmreleases",
}
)
// Phase strings — surfaced on Application.status.phase per the CRD's
@ -245,6 +266,16 @@ type Config struct {
// Empty means anonymous clone (acceptable for in-cluster Gitea where
// the network boundary is the K8s service cordon). Defaults to "".
FluxGiteaSecretRef string
// HelmReleaseObservationInterval is how often the periodic re-list
// fires to pick up downstream HelmRelease readiness flips. Defaults
// to 30s — short enough that the matrix-asserted 3-minute ceiling
// for `qa-wp` to reach `phase=Ready` (TC-066) is comfortably met
// even with a single observation miss. qa-loop iter-11 Fix #45
// Cluster-B: without this re-list, Application.status.phase was
// stuck at `Provisioning` indefinitely because the K8s Watch on
// Application CRs doesn't fire when a SIBLING HR's status changes.
HelmReleaseObservationInterval time.Duration
}
// Defaults applies missing-field defaults to a Config. Returns a copy.
@ -277,6 +308,9 @@ func (c Config) Defaults() Config {
if out.HostFluxIntervalSeconds <= 0 {
out.HostFluxIntervalSeconds = 60
}
if out.HelmReleaseObservationInterval <= 0 {
out.HelmReleaseObservationInterval = 30 * time.Second
}
return out
}
@ -320,6 +354,14 @@ func New(dyn dynamic.Interface, gitea Gitea, errs GiteaErrorClassifier, cfg Conf
//
// Watches Application CRs across all namespaces (the CRD is namespace-
// scoped per products/catalyst/chart/crds/application.yaml).
//
// In addition to the Watch on Application CRs, a periodic re-list ticker
// fires every `Cfg.HelmReleaseObservationInterval` (default 30s) so the
// reconciler picks up downstream HelmRelease readiness flips. Without
// this re-list, Application.status.phase would never transition off
// `Provisioning` because nothing on the API server triggers a fresh
// reconcile of the Application when its sibling HelmRelease's
// status.conditions[Ready] flips True. qa-loop iter-11 Fix #45 Cluster-B.
func (r *Reconciler) Run(ctx context.Context) error {
if r.Dynamic == nil {
return errors.New("controller: Dynamic client is required")
@ -327,6 +369,9 @@ func (r *Reconciler) Run(ctx context.Context) error {
if err := r.initialList(ctx); err != nil {
return fmt.Errorf("initial list: %w", err)
}
// Periodic re-list ticker — observes HR status changes that don't
// trigger an Application Watch event.
go r.runPeriodicRelist(ctx)
return wait.PollUntilContextCancel(ctx, time.Second, true, func(ctx context.Context) (bool, error) {
if err := r.watchOnce(ctx); err != nil {
r.Log.Warn("application-controller: watch error; will retry", "err", err)
@ -335,6 +380,34 @@ func (r *Reconciler) Run(ctx context.Context) error {
})
}
// runPeriodicRelist re-runs initialList every HelmReleaseObservationInterval
// so that downstream HelmRelease.status.conditions[Ready] flips reach the
// Application.status.phase. Watching the HR directly would also work but
// is more complex (one watcher per app namespace, dynamic add/remove on
// Application create/delete). The cheap re-list is correct + resilient
// to API server restarts.
//
// qa-loop iter-11 Fix #45 Cluster-B.
func (r *Reconciler) runPeriodicRelist(ctx context.Context) {
interval := r.Cfg.HelmReleaseObservationInterval
if interval <= 0 {
interval = 30 * time.Second
}
t := time.NewTicker(interval)
defer t.Stop()
for {
select {
case <-ctx.Done():
return
case <-t.C:
if err := r.initialList(ctx); err != nil {
r.Log.Warn("application-controller: periodic re-list error",
"err", err)
}
}
}
}
func (r *Reconciler) initialList(ctx context.Context) error {
list, err := r.Dynamic.Resource(ApplicationGVR).Namespace("").List(ctx, metav1.ListOptions{})
if err != nil {
@ -615,27 +688,220 @@ func (r *Reconciler) Reconcile(ctx context.Context, app *unstructured.Unstructur
fmt.Sprintf("ensure host Flux bootstrap: %v", err))
}
// 10. Status update.
// 10. Observe the downstream HelmRelease so the Application's
// status.phase tracks the actual workload-install lifecycle, not
// just the controller-side commit step. qa-loop iter-11 Fix #45
// Cluster-B root cause: prior to this loop the controller hard-
// coded `Phase: PhaseProvisioning` on every reconcile pass and
// never re-observed the per-region HRs that Flux installs as
// work-product of the Kustomization. The Application CR sat at
// `Provisioning` indefinitely even after `kubectl get hr -n
// <appNs> <appName>` was Ready=True for hours — the operator
// UI couldn't pivot to the Ready dashboard, the matrix-asserted
// terminal phase never arrived, and TC-066 / TC-100 / TC-104 /
// TC-113 stayed FAIL.
//
// We poll the HR per region (cheap; in-cluster GET) and roll up
// the readiness signal. The roll-up rule:
// * any region HR Ready=True → phase=Ready
// * any region HR Ready=False → phase=Degraded
// * any region HR not yet present → phase=Provisioning
// This stays consistent with the CRD's enum (Pending |
// Provisioning | Ready | Degraded | Failed | Uninstalling) and
// matches the matrix-author assertion in TC-066's must_contain
// ("Ready").
hrPhase, hrReason, hrMessage := r.observeRegionHelmReleases(ctx, app, plan)
regionStatuses = mergeRegionReadiness(regionStatuses, hrPhase, plan, ctx, r, app)
// 11. Status update — phase derived from observed HR readiness,
// fall back to Provisioning when no signal is available yet.
giteaRepo := fmt.Sprintf("%s/%s/%s",
strings.TrimRight(r.Cfg.GiteaPublicURL, "/"),
envSpec.OrganizationRef, app.GetName())
finalPhase := hrPhase
finalReady := "True"
finalReason := ReasonReconciled
finalMessage := fmt.Sprintf("Application %s/%s reconciled into %d region(s)", app.GetNamespace(), app.GetName(), len(plan.Regions))
if finalPhase == "" {
finalPhase = PhaseProvisioning
}
switch finalPhase {
case PhaseDegraded:
finalReady = "False"
finalReason = hrReason
finalMessage = hrMessage
case PhaseProvisioning:
// Provisioning is "we did our part, Flux will apply" — Ready
// stays True because the Application's own contract (manifests
// committed + host Flux bootstrapped) IS done. The
// `phase=Provisioning` signal is what the UI uses to show a
// spinner; the Ready condition is what RBAC guards / fleet
// rollups consume.
finalReady = "True"
finalReason = ReasonReconciled
case PhaseReady:
finalReady = "True"
finalReason = ReasonReconciled
finalMessage = fmt.Sprintf("Application %s/%s installed across %d region(s); Ready=True from downstream HelmRelease(s)",
app.GetNamespace(), app.GetName(), len(plan.Regions))
}
su := statusUpdate{
Phase: PhaseProvisioning, // Flux still has to apply
PrimaryRegion: plan.PrimaryRegion,
Regions: regionStatuses,
GiteaRepo: giteaRepo,
Phase: finalPhase,
PrimaryRegion: plan.PrimaryRegion,
Regions: regionStatuses,
GiteaRepo: giteaRepo,
Installed: map[string]interface{}{
"name": spec.BlueprintName,
"version": spec.BlueprintVersion,
"digest": bpDigest,
},
Reason: ReasonReconciled,
Message: fmt.Sprintf("Application %s/%s reconciled into %d region(s)", app.GetNamespace(), app.GetName(), len(plan.Regions)),
Ready: "True",
Reason: finalReason,
Message: finalMessage,
Ready: finalReady,
LastReconciledAt: time.Now().UTC().Format(time.RFC3339),
}
return r.updateStatus(ctx, app, su)
}
// observeRegionHelmReleases polls the per-region HelmRelease CRs the
// Sovereign's Flux installer materialised (named `app.GetName()` in
// the Application's own namespace, per render.HelmReleaseName / the
// chart's HelmRelease template). Returns the rolled-up phase string +
// the reason+message of the WORST region (so a single-region Failed
// surfaces in the UI verbatim instead of being averaged out).
//
// Idempotent + side-effect-free: only reads the API.
//
// qa-loop iter-11 Fix #45 Cluster-B.
func (r *Reconciler) observeRegionHelmReleases(
ctx context.Context,
app *unstructured.Unstructured,
plan placement.Plan,
) (phase, reason, message string) {
allReady := true
anyDegraded := false
worstReason := ""
worstMessage := ""
sawAny := false
for _, rp := range plan.Regions {
// HR lives in the Application's own namespace, named after the
// Application (matches render.HelmReleaseName + the chart's
// HelmRelease template's `metadata.name: {{ .AppName }}`).
hr, err := r.Dynamic.Resource(FluxHelmReleaseGVR).
Namespace(app.GetNamespace()).
Get(ctx, app.GetName(), metav1.GetOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
// HR not yet materialised — Flux still pulling. Roll up
// to Provisioning, NOT Failed.
allReady = false
continue
}
r.Log.Warn("application-controller: GET HelmRelease failed",
"namespace", app.GetNamespace(),
"name", app.GetName(),
"region", rp.Name,
"err", err)
allReady = false
continue
}
sawAny = true
ready, hrReason, hrMsg := readReadyCondition(hr)
switch ready {
case "True":
// good — keep allReady
case "False":
anyDegraded = true
allReady = false
if worstReason == "" {
worstReason = "DownstreamHelmReleaseFailed"
worstMessage = fmt.Sprintf("region %s HelmRelease Ready=False: %s — %s", rp.Name, hrReason, hrMsg)
}
default:
// Unknown — Flux still working.
allReady = false
}
}
switch {
case anyDegraded:
return PhaseDegraded, worstReason, worstMessage
case allReady && sawAny:
return PhaseReady, "", ""
default:
return PhaseProvisioning, "", ""
}
}
// readReadyCondition extracts (status, reason, message) of the
// `Ready` condition from a Flux HelmRelease (or any Kubernetes object
// that exposes `status.conditions[].type=Ready`). Returns ("", "", "")
// when the condition isn't yet present.
func readReadyCondition(obj *unstructured.Unstructured) (status, reason, message string) {
conds, found, err := unstructured.NestedSlice(obj.Object, "status", "conditions")
if err != nil || !found {
return "", "", ""
}
for _, c := range conds {
cm, ok := c.(map[string]interface{})
if !ok {
continue
}
t, _ := cm["type"].(string)
if t != "Ready" {
continue
}
s, _ := cm["status"].(string)
rsn, _ := cm["reason"].(string)
msg, _ := cm["message"].(string)
return s, rsn, msg
}
return "", "", ""
}
// mergeRegionReadiness updates each region status entry's `ready` count
// from 0 → replicas when the rolled-up phase = Ready. Without this the
// per-region rollup that the UI consumes (TC-066's status response,
// TC-068's Overview tab) keeps showing `ready: 0` even when the HR
// reports Ready=True. Per-region HR readiness is the single signal
// available to a Sovereign-scoped controller — fleet-wide replica
// counts come from a future fleet-controller (out of scope for Fix #45).
//
// qa-loop iter-11 Fix #45 Cluster-B.
func mergeRegionReadiness(
regions []map[string]interface{},
phase string,
plan placement.Plan,
ctx context.Context,
r *Reconciler,
app *unstructured.Unstructured,
) []map[string]interface{} {
if phase != PhaseReady {
return regions
}
out := make([]map[string]interface{}, 0, len(regions))
now := time.Now().UTC().Format(time.RFC3339)
for _, rs := range regions {
copyMap := map[string]interface{}{}
for k, v := range rs {
copyMap[k] = v
}
// Only bump replicas-ready when the per-region HR is actually
// Ready=True (we already gated by allReady in the caller, but
// we re-check defensively in case the plan grows in a future
// release).
if replicas, ok := copyMap["replicas"].(int64); ok {
copyMap["ready"] = replicas
}
copyMap["lastTransitionTime"] = now
out = append(out, copyMap)
}
_ = plan
_ = ctx
_ = r
_ = app
return out
}
// ensureHostFluxBootstrap upserts (find-or-create) the host-cluster
// Flux v1 GitRepository + per-region Kustomization CRs that pull the
// per-Application manifests we committed to Gitea. Idempotent: a
@ -1004,14 +1270,21 @@ func (r *Reconciler) fetchBlueprint(ctx context.Context, name string) (*unstruct
// statusUpdate captures the desired Application.status changes for one
// reconcile pass.
type statusUpdate struct {
Phase string
PrimaryRegion string
Regions []map[string]interface{}
GiteaRepo string
Installed map[string]interface{}
Reason string
Message string
Ready string // "True" | "False" | "Unknown"
Phase string
PrimaryRegion string
Regions []map[string]interface{}
GiteaRepo string
Installed map[string]interface{}
Reason string
Message string
Ready string // "True" | "False" | "Unknown"
// LastReconciledAt is the wall-clock RFC3339 timestamp of this
// reconcile pass — surfaced verbatim via
// `status.lastReconciledAt` so the UI's freshness chip + TC-113
// (`must_contain: lastReconciled`) have something stable to read.
// Empty value leaves the field untouched. qa-loop iter-11 Fix #45
// Cluster-B follow-up.
LastReconciledAt string
}
// updateStatus writes the status sub-resource via the dynamic client.
@ -1055,6 +1328,9 @@ func (r *Reconciler) updateStatus(ctx context.Context, app *unstructured.Unstruc
if su.Installed != nil {
currentStatus["installedBlueprint"] = su.Installed
}
if su.LastReconciledAt != "" {
currentStatus["lastReconciledAt"] = su.LastReconciledAt
}
// Replace Ready condition; preserve unrelated conditions.
conditions := []interface{}{}

View File

@ -186,6 +186,9 @@ func newScheme() *runtime.Scheme {
{Group: "orgs.openova.io", Version: "v1", Kind: "Organization"},
{Group: "source.toolkit.fluxcd.io", Version: "v1", Kind: "GitRepository"},
{Group: "kustomize.toolkit.fluxcd.io", Version: "v1", Kind: "Kustomization"},
// qa-loop iter-11 Fix #45 Cluster-B — observation of downstream
// HelmRelease readiness for Application.status.phase rollup.
{Group: "helm.toolkit.fluxcd.io", Version: "v2", Kind: "HelmRelease"},
} {
s.AddKnownTypeWithName(gvk, &unstructured.Unstructured{})
listGVK := schema.GroupVersionKind{Group: gvk.Group, Version: gvk.Version, Kind: gvk.Kind + "List"}
@ -203,6 +206,7 @@ func listKindMap() map[schema.GroupVersionResource]string {
BlueprintGVRv1alpha1: "BlueprintList",
FluxGitRepositoryGVR: "GitRepositoryList",
FluxKustomizationGVR: "KustomizationList",
FluxHelmReleaseGVR: "HelmReleaseList",
}
}
@ -1050,3 +1054,146 @@ func TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace(t *testing.T) {
t.Errorf("Kustomization namespace should be 'qa-omantel'; got:\n%s", ksStr)
}
}
// --- qa-loop iter-11 Fix #45 Cluster-B: Application.status.phase tracks
// downstream HelmRelease.status.conditions[Ready] -----------------------
//
// The matrix-asserted contract (TC-066, TC-100, TC-104, TC-113):
// once the per-region HelmRelease the controller writes to Gitea is
// installed by Flux and reports `Ready=True`, the parent Application
// CR's `status.phase` MUST flip from `Provisioning` to `Ready` within
// 3 minutes. Prior to Fix #45 the controller hard-coded
// `Phase: PhaseProvisioning` on every reconcile pass — the Application
// sat at `Provisioning` indefinitely even after `kubectl get hr -n
// <ns> <app>` was Ready=True for hours.
//
// This test seeds a fake HelmRelease in the Application's namespace
// with status.conditions[Ready]=True and asserts the phase rolls up.
func TestReconcile_PhaseFollowsDownstreamHelmReleaseReady(t *testing.T) {
bp := makeBlueprint("bp-wordpress", "1.2.3", nil, []string{"single-region"})
env := makeEnv("acme-prod", "acme", "prod")
org := makeOrg("acme")
app := makeApp("acme", "site", "acme-prod", "bp-wordpress", "1.2.3", "single-region",
[]string{"hetzner-fsn-rtz-prod"},
map[string]interface{}{"replicas": int64(1)})
// Pre-seed the downstream HelmRelease in the Application's
// namespace with status.conditions[Ready]=True (mirrors what Flux
// would write after a successful install).
hr := &unstructured.Unstructured{}
hr.SetAPIVersion("helm.toolkit.fluxcd.io/v2")
hr.SetKind("HelmRelease")
hr.SetNamespace("acme")
hr.SetName("site")
hr.Object["status"] = map[string]interface{}{
"conditions": []interface{}{
map[string]interface{}{
"type": "Ready",
"status": "True",
"reason": "InstallSucceeded",
"message": "Helm install succeeded for release acme/site.v1 with chart bp-wordpress@1.2.3",
},
},
}
fg := newFakeGitea()
fg.orgsExist["acme"] = true
r := newReconciler(t, fg, app, env, org, bp, hr)
reconcileFromCluster(t, r, "acme", "site")
got := readApp(t, r, "acme", "site")
phase, _, message := readPhaseAndReason(t, got)
if phase != PhaseReady {
t.Errorf("phase = %q, want %q (msg=%q)", phase, PhaseReady, message)
}
// Per-region replicas-ready should bump from 0 → declared.
regions, _, _ := unstructured.NestedSlice(got.Object, "status", "regions")
if len(regions) != 1 {
t.Fatalf("regions = %d, want 1", len(regions))
}
rs := regions[0].(map[string]interface{})
ready, _ := rs["ready"].(int64)
replicas, _ := rs["replicas"].(int64)
if ready != replicas {
t.Errorf("region.ready=%d, region.replicas=%d — should match when phase=Ready", ready, replicas)
}
// status.lastReconciledAt should be populated for TC-113.
lr, _, _ := unstructured.NestedString(got.Object, "status", "lastReconciledAt")
if lr == "" {
t.Errorf("status.lastReconciledAt is empty — must be set on every reconcile pass")
}
}
// TestReconcile_PhaseDegradedOnDownstreamHelmReleaseFailure asserts the
// inverse: a downstream HR Ready=False (e.g. helm-install rolled-back)
// surfaces as Application.status.phase=Degraded, NOT Provisioning, NOT
// Ready. The reason+message of the worst-region HR are lifted into the
// Application's Ready condition so the operator UI can render the
// failure verbatim.
func TestReconcile_PhaseDegradedOnDownstreamHelmReleaseFailure(t *testing.T) {
bp := makeBlueprint("bp-wordpress", "1.2.3", nil, []string{"single-region"})
env := makeEnv("acme-prod", "acme", "prod")
org := makeOrg("acme")
app := makeApp("acme", "site", "acme-prod", "bp-wordpress", "1.2.3", "single-region",
[]string{"hetzner-fsn-rtz-prod"},
map[string]interface{}{"replicas": int64(1)})
hr := &unstructured.Unstructured{}
hr.SetAPIVersion("helm.toolkit.fluxcd.io/v2")
hr.SetKind("HelmRelease")
hr.SetNamespace("acme")
hr.SetName("site")
hr.Object["status"] = map[string]interface{}{
"conditions": []interface{}{
map[string]interface{}{
"type": "Ready",
"status": "False",
"reason": "InstallFailed",
"message": "chart pull failed: 401 Unauthorized",
},
},
}
fg := newFakeGitea()
fg.orgsExist["acme"] = true
r := newReconciler(t, fg, app, env, org, bp, hr)
reconcileFromCluster(t, r, "acme", "site")
got := readApp(t, r, "acme", "site")
phase, reason, message := readPhaseAndReason(t, got)
if phase != PhaseDegraded {
t.Errorf("phase = %q, want %q", phase, PhaseDegraded)
}
if reason == "" {
t.Errorf("reason should be set when phase=Degraded; got empty")
}
if !strings.Contains(message, "InstallFailed") && !strings.Contains(message, "401 Unauthorized") {
t.Errorf("message should surface the downstream HR failure verbatim; got %q", message)
}
}
// TestReconcile_PhaseStaysProvisioningWhenHelmReleaseAbsent asserts the
// no-signal case: no HR exists yet (Flux still pulling Gitea), the
// Application stays at Provisioning. This is the existing happy-path
// behaviour — the new HR-observation logic must be a strict superset.
func TestReconcile_PhaseStaysProvisioningWhenHelmReleaseAbsent(t *testing.T) {
bp := makeBlueprint("bp-wordpress", "1.2.3", nil, []string{"single-region"})
env := makeEnv("acme-prod", "acme", "prod")
org := makeOrg("acme")
app := makeApp("acme", "site", "acme-prod", "bp-wordpress", "1.2.3", "single-region",
[]string{"hetzner-fsn-rtz-prod"},
map[string]interface{}{"replicas": int64(1)})
fg := newFakeGitea()
fg.orgsExist["acme"] = true
// NOTE: no HR seeded — fresh install, Flux hasn't pulled yet.
r := newReconciler(t, fg, app, env, org, bp)
reconcileFromCluster(t, r, "acme", "site")
got := readApp(t, r, "acme", "site")
phase, _, _ := readPhaseAndReason(t, got)
if phase != PhaseProvisioning {
t.Errorf("phase = %q, want %q (HR-absent must roll up to Provisioning, not Ready or Degraded)", phase, PhaseProvisioning)
}
}

View File

@ -15,7 +15,18 @@ name: bp-guacamole
# readOnlyRootFilesystem=true. Without it pods crash-looped with
# `mkdir: cannot create directory '/home/guacamole/.guacamole':
# Read-only file system`.
version: 0.1.8
# 0.1.9 (qa-loop iter-11 Fix #45 Cluster-A): pre-install/pre-upgrade
# Helm hook (Job + per-release ServiceAccount/Role/RoleBinding +
# cluster-scoped ClusterRole/ClusterRoleBinding for PV cleanup) that
# detects when the existing `guacamole-recordings` PVC is bound to a
# storageClass different from `.Values.guacamole.recordings.storageClass`
# and deletes the PVC + bound PV so the chart-side PVC manifest can
# recreate cleanly. Closes the live bp-guacamole HelmRelease wedge on
# omantel iter-11 (`PersistentVolumeClaim ... is invalid: spec:
# Forbidden: spec is immutable after creation`). Operator escape hatch:
# `.Values.guacamole.recordings.allowMigration: false` suppresses the
# hook for Sovereigns with long-lived recording state.
version: 0.1.9
appVersion: "1.5.5"
description: |
Catalyst-authored Blueprint chart for Apache Guacamole — a clientless

View File

@ -0,0 +1,158 @@
{{- /*
PVC storageClass-migration hook (qa-loop iter-11 Fix #45 Cluster-A).
Background — the immutable-spec problem
=======================================
PersistentVolumeClaim.spec is immutable after creation EXCEPT for
`resources.requests.storage` (resize, when allowed by the StorageClass)
and `volumeAttributesClassName`. Specifically `storageClassName` is
strictly immutable. Once a PVC is bound to a PV under storageClass X,
no `helm upgrade` that changes `.Values.guacamole.recordings.storageClass`
to Y will ever succeed — the K8s apiserver rejects the patch with
`PersistentVolumeClaim ... is invalid: spec: Forbidden: spec is
immutable after creation except resources.requests.storage and
volumeAttributesClassName for bound claims`.
This is the live bp-guacamole HR failure we hit on omantel iter-11:
PR #1259 left `.Values.recordings.storageClass` at upstream default
`hcloud-volumes`, the omantel cluster overlay set it to
`seaweedfs-storage`, but the pre-existing PVC was bound to `local-path`
(from a prior reconcile pass), and the upgrade locked into a permanent
remediation loop.
Why a hook (not a migration Job, not chart-rename)
==================================================
A regular Job would run AFTER the templates render — too late, because
the helm-upgrade fails before the Job ever lands. A chart-side rename
of the PVC pattern (e.g. include a hash of the storage class) would
churn through PVs every time the value changes, losing data unless we
also added a backup-restore lifecycle. Per docs/INVIOLABLE-PRINCIPLES.md
(no "for now" workarounds, no compromised quality), the right primitive
is the Helm `pre-upgrade` hook — it runs BEFORE the chart re-renders
the PVC manifest, so it can delete the offending PVC + PV + finalizer
and let the post-render PVC create cleanly.
Recording-data lifecycle
========================
`/recordings` holds Guacamole session capture files (RDP/VNC/SSH/exec
playback). On a Sovereign without long-running sessions or before the
recording-shipper is wired up, deleting the volume is data-safe. The
hook is gated by `.Values.guacamole.recordings.allowMigration` so an
operator with live recording state can disable the destructive path
(default ON because the cost of leaving the upgrade wedged is much
higher than the cost of regenerating an empty recordings directory —
Guacamole creates per-connection subdirectories on demand).
When the PVC's existing storageClass already matches the chart-desired
one, the hook is a no-op. The check uses kubectl-as-the-subject's RBAC,
which the hook ServiceAccount provides via a per-release Role.
Pairs with:
- templates/seaweedfs-pvc.yaml — the actual PVC the chart wants
- templates/recordings-pvc-rbac.yaml — ServiceAccount + Role + Binding
*/}}
{{- $migrate := true -}}
{{- if hasKey .Values.guacamole.recordings "allowMigration" -}}
{{- $migrate = .Values.guacamole.recordings.allowMigration -}}
{{- end -}}
{{- if and .Values.guacamole.enabled $migrate -}}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "bp-guacamole.recordingsName" . }}-storageclass-migrate
namespace: {{ .Release.Namespace }}
labels:
{{- include "bp-guacamole.labels" . | nindent 4 }}
catalyst.openova.io/component: recordings-migrate
annotations:
# Run BEFORE templates land — pre-install + pre-upgrade so a fresh
# install (no PVC yet) is also a no-op safely (the kubectl-get in
# the script is forgiving). before-hook-creation makes Helm delete
# the prior Job manifest before re-applying so we're not blocked by
# the immutable Job.spec.template.
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-10"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 300
template:
metadata:
labels:
{{- include "bp-guacamole.labels" . | nindent 8 }}
catalyst.openova.io/component: recordings-migrate
spec:
serviceAccountName: {{ include "bp-guacamole.recordingsName" . }}-migrator
restartPolicy: Never
{{- with .Values.guacamole.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
containers:
- name: migrate
# bitnami/kubectl is the canonical chart-side migration tool
# across Catalyst Blueprints (cf. bp-keycloak realm-config
# post-deploy Job pattern). SHA-pinned per
# docs/INVIOLABLE-PRINCIPLES.md #4a.
image: {{ .Values.guacamole.recordings.migrationImage | default "bitnami/kubectl:1.30.4" | quote }}
imagePullPolicy: IfNotPresent
securityContext:
runAsNonRoot: true
runAsUser: 1001
runAsGroup: 1001
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
seccompProfile:
type: RuntimeDefault
env:
- name: PVC_NAME
value: {{ include "bp-guacamole.recordingsName" . | quote }}
- name: PVC_NAMESPACE
value: {{ .Release.Namespace | quote }}
- name: DESIRED_STORAGECLASS
value: {{ .Values.guacamole.recordings.storageClass | quote }}
command: ["/bin/bash", "-c"]
args:
- |
set -euo pipefail
# Read existing PVC's storageClass; -o jsonpath emits empty
# string if PVC doesn't exist (kubectl returns 0 with empty
# output for that case via --ignore-not-found).
EXISTING_SC="$(kubectl get pvc "${PVC_NAME}" -n "${PVC_NAMESPACE}" \
--ignore-not-found \
-o jsonpath='{.spec.storageClassName}' 2>/dev/null || true)"
if [ -z "${EXISTING_SC}" ]; then
echo "PVC ${PVC_NAMESPACE}/${PVC_NAME} does not exist — fresh install, no migration needed."
exit 0
fi
if [ "${EXISTING_SC}" = "${DESIRED_STORAGECLASS}" ]; then
echo "PVC ${PVC_NAMESPACE}/${PVC_NAME} already on storageClass=${DESIRED_STORAGECLASS} — no migration."
exit 0
fi
echo "PVC storageClass mismatch — existing=${EXISTING_SC} desired=${DESIRED_STORAGECLASS}; deleting PVC + PV to allow recreation."
# Capture bound PV name before deleting PVC (Delete reclaim
# policy on most CSI drivers will auto-delete the PV when
# the PVC goes; Retain policies need explicit cleanup).
BOUND_PV="$(kubectl get pvc "${PVC_NAME}" -n "${PVC_NAMESPACE}" \
-o jsonpath='{.spec.volumeName}' 2>/dev/null || true)"
# Strip finalizers so the PVC actually deletes (kubernetes.io/pvc-protection
# blocks delete while a Pod still references it; the chart's
# webapp Deployment is being upgraded so the Pod is in the
# process of going away — we force the issue).
kubectl patch pvc "${PVC_NAME}" -n "${PVC_NAMESPACE}" \
--type=merge -p '{"metadata":{"finalizers":[]}}' \
--ignore-not-found || true
kubectl delete pvc "${PVC_NAME}" -n "${PVC_NAMESPACE}" \
--ignore-not-found --wait=true --timeout=60s
if [ -n "${BOUND_PV}" ]; then
echo "Cleaning up PV ${BOUND_PV} (storageClass=${EXISTING_SC})."
kubectl patch pv "${BOUND_PV}" \
--type=merge -p '{"metadata":{"finalizers":[]}}' \
--ignore-not-found || true
kubectl delete pv "${BOUND_PV}" \
--ignore-not-found --wait=true --timeout=60s
fi
echo "Migration complete; chart-side PVC will be recreated on this upgrade pass with storageClass=${DESIRED_STORAGECLASS}."
{{- end }}

View File

@ -0,0 +1,115 @@
{{- /*
RBAC for the storageClass-migration hook (qa-loop iter-11 Fix #45 Cluster-A).
ServiceAccount + Role + RoleBinding scoped to the chart's namespace —
the hook only ever touches the recordings PVC and (via cluster-level
PV cleanup) the bound PV. PVs are cluster-scoped so we need a
ClusterRole + ClusterRoleBinding for that one verb.
Pairs with templates/recordings-pvc-migrate-hook.yaml.
*/}}
{{- $migrate := true -}}
{{- if hasKey .Values.guacamole.recordings "allowMigration" -}}
{{- $migrate = .Values.guacamole.recordings.allowMigration -}}
{{- end -}}
{{- if and .Values.guacamole.enabled $migrate -}}
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ include "bp-guacamole.recordingsName" . }}-migrator
namespace: {{ .Release.Namespace }}
labels:
{{- include "bp-guacamole.labels" . | nindent 4 }}
catalyst.openova.io/component: recordings-migrate
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-20"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
{{- with .Values.guacamole.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 2 }}
{{- end }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: {{ include "bp-guacamole.recordingsName" . }}-migrator
namespace: {{ .Release.Namespace }}
labels:
{{- include "bp-guacamole.labels" . | nindent 4 }}
catalyst.openova.io/component: recordings-migrate
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-20"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
rules:
# Read-then-decide: get is for the storageClass-mismatch check; the
# destructive verbs run only when the check fires.
- apiGroups: [""]
resources: [persistentvolumeclaims]
verbs: [get, list, patch, delete]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: {{ include "bp-guacamole.recordingsName" . }}-migrator
namespace: {{ .Release.Namespace }}
labels:
{{- include "bp-guacamole.labels" . | nindent 4 }}
catalyst.openova.io/component: recordings-migrate
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-20"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: {{ include "bp-guacamole.recordingsName" . }}-migrator
subjects:
- kind: ServiceAccount
name: {{ include "bp-guacamole.recordingsName" . }}-migrator
namespace: {{ .Release.Namespace }}
---
# PV is cluster-scoped — needs ClusterRole. Scoped via resourceNames is
# impossible because the PV name is the dynamically-provisioned UUID
# (we don't know it at chart-render time). Verbs are the minimum needed
# to clear the bound PV when the underlying CSI Reclaim policy is Retain.
# Per docs/INVIOLABLE-PRINCIPLES.md #3 (least-privilege), this is the
# narrowest cluster-scoped grant we can express; create is intentionally
# omitted.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: {{ printf "%s-%s-migrator-pv" .Release.Namespace (include "bp-guacamole.recordingsName" .) | trunc 253 | trimSuffix "-" }}
labels:
{{- include "bp-guacamole.labels" . | nindent 4 }}
catalyst.openova.io/component: recordings-migrate
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-20"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
rules:
- apiGroups: [""]
resources: [persistentvolumes]
verbs: [get, list, patch, delete]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: {{ printf "%s-%s-migrator-pv" .Release.Namespace (include "bp-guacamole.recordingsName" .) | trunc 253 | trimSuffix "-" }}
labels:
{{- include "bp-guacamole.labels" . | nindent 4 }}
catalyst.openova.io/component: recordings-migrate
annotations:
"helm.sh/hook": pre-install,pre-upgrade
"helm.sh/hook-weight": "-20"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: {{ printf "%s-%s-migrator-pv" .Release.Namespace (include "bp-guacamole.recordingsName" .) | trunc 253 | trimSuffix "-" }}
subjects:
- kind: ServiceAccount
name: {{ include "bp-guacamole.recordingsName" . }}-migrator
namespace: {{ .Release.Namespace }}
{{- end }}

View File

@ -84,7 +84,12 @@ fi
echo "PASS: empty image.tag fails fast"
# ─────────────────────────────────────────────────────────────────────
# 3. Full-ON: the canonical 9-resource bundle.
# 3. Full-ON: the canonical 15-resource bundle.
#
# qa-loop iter-11 Fix #45 Cluster-A added the recordings storageClass-
# migration pre-upgrade hook (1 Job + 1 ServiceAccount + 1 Role +
# 1 RoleBinding + 1 ClusterRole + 1 ClusterRoleBinding = +6 resources
# vs. the prior 9-doc bundle).
# ─────────────────────────────────────────────────────────────────────
render_on="$TMP/on.yaml"
helm template bp-guacamole . \
@ -95,11 +100,11 @@ helm template bp-guacamole . \
--set guacamole.oidc.issuer=https://kc.test/realms/catalyst \
> "$render_on"
# Check each canonical kind appears exactly once. We assert 9 distinct
# `name:` headers under `^kind:` lines that start with one of the
# expected kinds. `Deployment` appears twice (guacd + webapp) — Service
# also appears twice — total = 9.
expect_total=9
# Check each canonical kind appears the expected number of times. The
# 15-doc target: Deployment×2 (guacd + webapp), Service×2, HTTPRoute,
# PVC, SealedSecret, NetworkPolicy, ConfigMap, Job, ServiceAccount,
# Role, RoleBinding, ClusterRole, ClusterRoleBinding.
expect_total=15
got_total="$(grep -cE '^kind:' "$render_on")"
if [[ "$got_total" != "$expect_total" ]]; then
echo "FAIL: full-ON rendered $got_total resources, want $expect_total"
@ -117,6 +122,12 @@ required_kinds=(
SealedSecret
NetworkPolicy
ConfigMap
Job
ServiceAccount
Role
RoleBinding
ClusterRole
ClusterRoleBinding
)
for k in "${required_kinds[@]}"; do
if ! grep -qE "^kind: ${k}$" "$render_on"; then
@ -185,5 +196,38 @@ if ! awk '
fi
echo "PASS: realm-patch ConfigMap lands in keycloak namespace"
# qa-loop iter-11 Fix #45 Cluster-A — recordings storageClass-migration
# pre-upgrade hook is wired to the correct hook lifecycle (pre-install
# AND pre-upgrade so a chart-overlay storageClass change at any point
# in the Sovereign's lifetime is recoverable) and references the
# desired storageClass via env var (so the in-Pod script can compare
# against the live PVC's existing storageClass).
if ! grep -q '"helm.sh/hook": pre-install,pre-upgrade' "$render_on"; then
echo "FAIL: recordings migration hook missing pre-install,pre-upgrade lifecycle"
exit 1
fi
if ! grep -q 'name: DESIRED_STORAGECLASS' "$render_on"; then
echo "FAIL: migration hook missing DESIRED_STORAGECLASS env"
exit 1
fi
echo "PASS: recordings storageClass-migration hook wired correctly"
# Toggle: when allowMigration=false, the hook must NOT render (operator
# escape hatch for Sovereigns with live recording state).
no_mig="$TMP/no-migration.yaml"
helm template bp-guacamole . \
--set guacamole.enabled=true \
--set guacamole.guacd.image.tag=1.5.5-r1 \
--set guacamole.webapp.image.tag=1.5.5-r1 \
--set guacamole.httproute.hostname=guacamole.test \
--set guacamole.oidc.issuer=https://kc.test/realms/catalyst \
--set guacamole.recordings.allowMigration=false \
> "$no_mig"
if grep -q 'storageclass-migrate' "$no_mig"; then
echo "FAIL: allowMigration=false still rendered the migration Job"
exit 1
fi
echo "PASS: allowMigration=false suppresses the migration hook"
echo ""
echo "All render tests passed."

View File

@ -85,6 +85,22 @@ guacamole:
# Sovereigns). Override per-Sovereign for non-Hetzner clouds.
storageClass: hcloud-volumes
mountPath: /recordings
# qa-loop iter-11 Fix #45 Cluster-A: when an existing PVC is bound
# to a storageClass different from .storageClass above, the chart's
# pre-upgrade hook deletes the PVC + bound PV so the new chart-side
# PVC can be recreated cleanly. PersistentVolumeClaim.spec is K8s-
# immutable on storageClassName, so without this hook a per-Sovereign
# overlay flip (e.g. `local-path` → `seaweedfs-storage` after
# bp-seaweedfs lands) would wedge the bp-guacamole HelmRelease in
# `Failed to perform remediation: missing target release for rollback`.
# Default ON because session-recording state on omantel today is
# ephemeral; flip OFF on Sovereigns with long-lived recording data
# (operator should snapshot first then re-enable for the upgrade).
allowMigration: true
# Image used by the migration hook. SHA-pinned to bitnami/kubectl
# 1.30.4 (matches Sovereign k3s 1.30 client/server skew). Operator
# overrides for air-gapped Sovereigns by re-mirroring the image.
migrationImage: bitnami/kubectl:1.30.4
# ── Keycloak OIDC ──────────────────────────────────────────────
oidc:
# Issuer URL — render in per-Sovereign overlay as

View File

@ -912,6 +912,16 @@ func main() {
rg.Post("/api/v1/sovereigns/{id}/applications/preview", h.HandleApplicationPreview)
rg.Get("/api/v1/sovereigns/{id}/applications/{name}/status", h.HandleApplicationStatus)
rg.Get("/api/v1/sovereigns/{id}/applications/{name}/stream", h.HandleApplicationStream)
// qa-loop iter-11 Fix #45 Cluster-C: full Application detail
// (identity + spec + status) so the Sovereign Console SPA's
// AppDetail page can synthesise an ApplicationDescriptor on the
// fly when the Application isn't part of the wizard's
// `selectedComponents` (the typical chroot Sovereign case —
// Apps installed via `kubectl apply -f application.yaml` or
// the catalyst-api install endpoint, NOT via the wizard).
// Without this route the SPA fell into the "App not found"
// surface for every chroot-installed Application.
rg.Get("/api/v1/sovereigns/{id}/applications/{name}", h.HandleApplicationGet)
// qa-loop iter-9 Fix #43, Cluster-B (TC-104): canonical items
// envelope listing of installed Applications across all Org
// namespaces on the Sovereign cluster.

View File

@ -797,6 +797,138 @@ func isValidK8sName(s string) bool {
return true
}
// ── HTTP handler — get (GET /sovereigns/{id}/applications/{name}) ───
// applicationDetailResponse — body of GET
// /sovereigns/{id}/applications/{name}. Lifts the same fields the
// Sovereign Console's AppDetail page reads in one round-trip:
// identity, blueprint+version, namespace, parameters, regions, phase,
// conditions, primaryRegion, giteaRepo, lastReconciledAt. Stable shape
// so the matrix-asserted contract (TC-068, TC-095, TC-106 et al) and
// the SPA's `findApplicationByName` fallback both consume the same
// JSON without per-caller post-processing.
//
// qa-loop iter-11 Fix #45 Cluster-C: prior to this handler the SPA
// fell into "App not found" for any Application CR that wasn't part of
// the wizard's `selectedComponents` (i.e. every Application installed
// outside the bootstrap-kit + wizard flow — which on chroot Sovereigns
// is the typical case). The catalyst-api had a /status sub-route but
// nothing returning the Application's full spec + identity, so the SPA
// couldn't synthesise an ApplicationDescriptor on the fly.
type applicationDetailResponse struct {
Name string `json:"name"`
Namespace string `json:"namespace"`
Blueprint string `json:"blueprint,omitempty"`
Version string `json:"version,omitempty"`
EnvironmentRef string `json:"environmentRef,omitempty"`
Placement string `json:"placement,omitempty"`
Regions []string `json:"regions,omitempty"`
Parameters map[string]interface{} `json:"parameters,omitempty"`
Phase string `json:"phase,omitempty"`
PrimaryRegion string `json:"primaryRegion,omitempty"`
GiteaRepo string `json:"giteaRepo,omitempty"`
LastReconciled string `json:"lastReconciledAt,omitempty"`
Conditions []map[string]interface{} `json:"conditions"`
RegionStatuses []map[string]interface{} `json:"regionStatuses,omitempty"`
InstalledBlueprint map[string]interface{} `json:"installedBlueprint,omitempty"`
}
// HandleApplicationGet — GET /api/v1/sovereigns/{id}/applications/{name}
//
// Returns the full Application detail (identity, spec, status). Like
// HandleApplicationStatus, the optional `?namespace=<org>` query selects
// the Org namespace; when absent the handler returns the first
// Application CR named `name` across every namespace on the Sovereign.
//
// qa-loop iter-11 Fix #45 Cluster-C.
func (h *Handler) HandleApplicationGet(w http.ResponseWriter, r *http.Request) {
depID := chi.URLParam(r, "id")
name := chi.URLParam(r, "name")
if name == "" {
writeBadRequest(w, "missing-name", "application name is required")
return
}
dep, ok := h.lookupDeploymentForInfra(depID)
if !ok {
writeNotFound(w, depID)
return
}
client, err := h.sovereignDynamicClient(dep)
if err != nil {
writeUserAccessUnavailable(w, err)
return
}
ns := strings.TrimSpace(r.URL.Query().Get("namespace"))
obj, getErr := getApplicationCR(r.Context(), client, name, ns)
if getErr != nil {
if apierrors.IsNotFound(getErr) {
writeJSON(w, http.StatusNotFound, map[string]string{
"error": "application-not-found",
"detail": fmt.Sprintf("Application %q not found", name),
})
return
}
writeJSON(w, http.StatusInternalServerError, map[string]string{
"error": "application-get-failed",
"detail": getErr.Error(),
})
return
}
resp := applicationDetailResponse{
Name: obj.GetName(),
Namespace: obj.GetNamespace(),
Conditions: []map[string]interface{}{},
}
if v, ok, _ := unstructured.NestedString(obj.Object, "spec", "blueprintRef", "name"); ok {
resp.Blueprint = v
}
if v, ok, _ := unstructured.NestedString(obj.Object, "spec", "blueprintRef", "version"); ok {
resp.Version = v
}
if v, ok, _ := unstructured.NestedString(obj.Object, "spec", "environmentRef"); ok {
resp.EnvironmentRef = v
}
if v, ok, _ := unstructured.NestedString(obj.Object, "spec", "placement"); ok {
resp.Placement = v
}
if regs, ok, _ := unstructured.NestedStringSlice(obj.Object, "spec", "regions"); ok {
resp.Regions = regs
}
if params, ok, _ := unstructured.NestedMap(obj.Object, "spec", "parameters"); ok {
resp.Parameters = params
}
if phase, ok, _ := unstructured.NestedString(obj.Object, "status", "phase"); ok {
resp.Phase = phase
}
if pr, ok, _ := unstructured.NestedString(obj.Object, "status", "primaryRegion"); ok {
resp.PrimaryRegion = pr
}
if gr, ok, _ := unstructured.NestedString(obj.Object, "status", "giteaRepo"); ok {
resp.GiteaRepo = gr
}
if lr, ok, _ := unstructured.NestedString(obj.Object, "status", "lastReconciledAt"); ok {
resp.LastReconciled = lr
}
if conds, ok, _ := unstructured.NestedSlice(obj.Object, "status", "conditions"); ok {
for _, c := range conds {
if cm, isMap := c.(map[string]interface{}); isMap {
resp.Conditions = append(resp.Conditions, cm)
}
}
}
if rgs, ok, _ := unstructured.NestedSlice(obj.Object, "status", "regions"); ok {
for _, rg := range rgs {
if rm, isMap := rg.(map[string]interface{}); isMap {
resp.RegionStatuses = append(resp.RegionStatuses, rm)
}
}
}
if ib, ok, _ := unstructured.NestedMap(obj.Object, "status", "installedBlueprint"); ok {
resp.InstalledBlueprint = ib
}
writeJSON(w, http.StatusOK, resp)
}
// ── HTTP handler — list (GET /sovereigns/{id}/applications) ──────────
// applicationListItem — one row of GET /sovereigns/{id}/applications.

View File

@ -96,7 +96,20 @@ func (h *Handler) HandleK8sList(w http.ResponseWriter, r *http.Request) {
}
q := r.URL.Query()
// qa-loop iter-11 Fix #45 Cluster-C: accept BOTH `?ns=` (the
// historical short form) AND `?namespace=` (the kubectl /
// API-server canonical form that the SPA's `getApplicationStatus`
// helper, the catalog API client, and downstream tooling all emit).
// Prior to this fix `?namespace=qa-omantel` was silently ignored —
// the handler returned the un-filtered list across every namespace
// (TC-262 / TC-263: `?namespace=qa-omantel` returned alloy + newapi
// services + every other namespace's services, with `qa-wp` buried
// in noise). `ns=` wins when both are passed (preserves any caller
// that may have set both for paranoia).
ns := q.Get("ns")
if ns == "" {
ns = q.Get("namespace")
}
limit := parseIntDefault(q.Get("limit"), 500)
if limit < 1 {
limit = 500

View File

@ -320,3 +320,96 @@ func TestHandleK8sStream_EmitsEvent(t *testing.T) {
// keep metav1 imported even if a future test refactor drops the
// explicit reference.
var _ = metav1.GetOptions{}
// TestHandleK8sList_NamespaceAliasFiltering — qa-loop iter-11 Fix #45
// Cluster-C. The handler accepts both `?ns=` (historic short form) and
// `?namespace=` (the kubectl/SPA-canonical form). When neither is set,
// every namespace's items are returned.
func TestHandleK8sList_NamespaceAliasFiltering(t *testing.T) {
podA := newPod("qa-omantel", "qa-wp")
podB := newPod("alloy", "alloy-host")
f := newFactoryWithMultiplePods(t, podA, podB)
h := &Handler{log: quietLog()}
h.SetK8sCache(f, k8scache.NewSARCache(), "X-Forwarded-User")
r := newRouter(h)
type tc struct {
name string
query string
wantCount int
wantNS string
}
cases := []tc{
{"namespace_param_filters_to_qa_omantel", "?namespace=qa-omantel", 1, "qa-omantel"},
{"ns_param_still_works", "?ns=qa-omantel", 1, "qa-omantel"},
{"ns_wins_when_both_set", "?ns=alloy&namespace=qa-omantel", 1, "alloy"},
{"no_filter_returns_all_namespaces", "", 2, ""},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
req := httptest.NewRequest("GET", "/api/v1/sovereigns/alpha/k8s/pod"+c.query, nil)
rec := httptest.NewRecorder()
r.ServeHTTP(rec, req)
if rec.Code != 200 {
t.Fatalf("expected 200, got %d body=%s", rec.Code, rec.Body.String())
}
var resp K8sListResponse
if err := json.NewDecoder(rec.Body).Decode(&resp); err != nil {
t.Fatalf("decode: %v", err)
}
if len(resp.Items) != c.wantCount {
gotNS := []string{}
for _, it := range resp.Items {
gotNS = append(gotNS, it.GetNamespace()+"/"+it.GetName())
}
t.Fatalf("query=%q items=%d want=%d got=%v", c.query, len(resp.Items), c.wantCount, gotNS)
}
if c.wantCount == 1 && resp.Items[0].GetNamespace() != c.wantNS {
t.Fatalf("query=%q expected namespace=%q got %q", c.query, c.wantNS, resp.Items[0].GetNamespace())
}
})
}
}
// newFactoryWithMultiplePods builds an in-memory K8s cache pre-populated
// with N pods across N namespaces — exercises the namespace-filter path
// (single-ns cache wouldn't surface the bug).
func newFactoryWithMultiplePods(t *testing.T, pods ...*unstructured.Unstructured) *k8scache.Factory {
t.Helper()
scheme := runtime.NewScheme()
scheme.AddKnownTypeWithName(schema.GroupVersionKind{Version: "v1", Kind: "PodList"}, &unstructured.UnstructuredList{})
scheme.AddKnownTypeWithName(schema.GroupVersionKind{Version: "v1", Kind: "Pod"}, &unstructured.Unstructured{})
gvrList := map[schema.GroupVersionResource]string{
{Version: "v1", Resource: "pods"}: "PodList",
}
objs := make([]runtime.Object, 0, len(pods))
for _, p := range pods {
objs = append(objs, p)
}
dyn := dynamicfake.NewSimpleDynamicClientWithCustomListKinds(scheme, gvrList, objs...)
core := kfake.NewSimpleClientset()
cfg := k8scache.Config{
Logger: quietLog(),
Registry: minimalRegistry(),
Clusters: []k8scache.ClusterRef{
{ID: "alpha", DynamicClient: dyn, CoreClient: core},
},
}
f, err := k8scache.NewFactory(cfg)
if err != nil {
t.Fatalf("NewFactory: %v", err)
}
if err := f.Start(context.Background()); err != nil {
t.Fatalf("Start: %v", err)
}
t.Cleanup(f.Stop)
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
items, _, _ := f.List("alpha", "pod", nil)
if len(items) >= len(pods) {
return f
}
time.Sleep(20 * time.Millisecond)
}
return f
}

View File

@ -177,6 +177,35 @@ export interface ApplicationStatusResponse {
status?: Record<string, unknown>
}
/**
* ApplicationDetailResponse body of GET
* /sovereigns/{id}/applications/{name}. Lifts the same fields the
* Sovereign Console's AppDetail page reads in one round-trip:
* identity + spec + roll-up status. Stable shape so the matrix-asserted
* contract (TC-068, TC-095, TC-106) and the SPA's
* findApplicationByName fallback consume the same JSON without
* per-caller post-processing.
*
* qa-loop iter-11 Fix #45 Cluster-C.
*/
export interface ApplicationDetailResponse {
name: string
namespace: string
blueprint?: string
version?: string
environmentRef?: string
placement?: string
regions?: string[]
parameters?: Record<string, unknown>
phase?: string
primaryRegion?: string
giteaRepo?: string
lastReconciledAt?: string
conditions: Array<Record<string, unknown>>
regionStatuses?: Array<Record<string, unknown>>
installedBlueprint?: Record<string, unknown>
}
/** PreviewManifest — one rendered file in the preview output. */
export interface PreviewManifest {
path: string
@ -230,6 +259,28 @@ export async function previewApplication(
return res.json()
}
/**
* getApplication fetch full Application detail by name.
* Returns null on 404 (not-an-error in the SPA-fallback context).
* qa-loop iter-11 Fix #45 Cluster-C.
*/
export async function getApplication(
sovereignId: string,
name: string,
namespace?: string,
): Promise<ApplicationDetailResponse | null> {
const params = new URLSearchParams()
if (namespace) params.set('namespace', namespace)
const qs = params.toString()
const url = `${applicationsBase(sovereignId)}/${encodeURIComponent(name)}${qs ? '?' + qs : ''}`
const res = await authedFetch(url, { headers: { Accept: 'application/json' } })
if (res.status === 404) return null
if (!res.ok) {
throw new Error(`getApplication: HTTP ${res.status}`)
}
return res.json()
}
export async function getApplicationStatus(
sovereignId: string,
name: string,

View File

@ -33,6 +33,7 @@
import { useMemo, useState } from 'react'
import { useParams, Link } from '@tanstack/react-router'
import { useQuery } from '@tanstack/react-query'
import { useWizardStore } from '@/entities/deployment/store'
import { PortalShell } from './PortalShell'
import { JobsTable } from './JobsTable'
@ -43,6 +44,7 @@ import { adaptDerivedJobsToFlat } from './jobsAdapter'
import { findComponent } from '@/pages/wizard/steps/componentGroups'
import { useResolvedDeploymentId } from '@/shared/lib/useResolvedDeploymentId'
import type { ApplicationStatus } from './eventReducer'
import { getApplication, type ApplicationDetailResponse } from '@/lib/catalog.api'
import { ComplianceTab } from './AppDetail/ComplianceTab'
import { MembersTab } from './AppDetail/MembersTab'
import { TopologyTab } from './AppDetail/TopologyTab'
@ -83,9 +85,86 @@ export function AppDetail({ disableStream = false }: AppDetailProps = {}) {
})
const sovereignFQDN = snapshot?.sovereignFQDN ?? snapshot?.result?.sovereignFQDN ?? null
const app: ApplicationDescriptor | undefined = findApplication(applications, componentId)
const wizardApp: ApplicationDescriptor | undefined = findApplication(applications, componentId)
// qa-loop iter-11 Fix #45 Cluster-C: when the requested component is
// NOT part of the wizard's selectedComponents (the typical case for a
// chroot Sovereign — Applications installed via `kubectl apply` or
// the catalyst-api install endpoint NEVER pass through the wizard
// store), fall back to the catalyst-api's
// GET /sovereigns/{id}/applications/{name} endpoint to fetch the
// Application CR directly and synthesise an ApplicationDescriptor on
// the fly. Prior to this fallback every chroot-installed Application
// surfaced the misleading "App not found / The component qa-wp is
// not part of this deployment" page even though the Application CR +
// HelmRelease were both Ready=True (TC-068 / TC-072 / TC-074 et al.
// failed for this exact reason).
//
// The fallback only runs when (a) we're on a Sovereign route (i.e.
// deploymentId resolved from sovereign-self, not a wizard URL) AND
// (b) the wizard didn't already supply a descriptor. We use the
// sovereign-self deploymentId as the catalyst-api {id} URL segment.
const needsApiFallback = !wizardApp && !!deploymentId && !!componentId
const apiAppQuery = useQuery({
queryKey: ['sov-application', deploymentId, componentId],
queryFn: async () => getApplication(deploymentId, componentId),
enabled: needsApiFallback,
staleTime: 30_000,
retry: 1,
})
const apiApp: ApplicationDetailResponse | null | undefined = apiAppQuery.data
// Synthesise an ApplicationDescriptor from the API response so the
// rest of the page (hero, sections, tabs) can render unchanged. The
// descriptor's bareId is derived from the Blueprint name (strip
// `bp-` prefix) so reverse-dependency lookups + component-groups
// metadata can still resolve.
//
// Defensive: only synthesise when the API returned a meaningful
// Application body (i.e. .name matches the requested componentId or
// .blueprint is set). A 404 returns null (handled), but a 200 with
// an unrelated body shouldn't be coerced into an Application.
const synthesisedApp: ApplicationDescriptor | undefined = useMemo(() => {
if (!apiApp) return undefined
if (!apiApp.name && !apiApp.blueprint) return undefined
const bareId = (apiApp.blueprint ?? '').replace(/^bp-/, '') || componentId
const compEntry = findComponent(bareId)
return {
id: apiApp.blueprint || `bp-${bareId}`,
bareId,
title: compEntry?.name ?? apiApp.name ?? componentId,
description: compEntry?.desc ?? `Application installed in namespace ${apiApp.namespace}`,
familyId: compEntry?.product ?? 'platform',
familyName: compEntry?.groupName ?? 'Platform',
tier: compEntry?.tier ?? 'optional',
logoUrl: compEntry?.logoUrl ?? null,
dependencies: compEntry?.dependencies ?? [],
bootstrapKit: false,
}
}, [apiApp, componentId])
const app: ApplicationDescriptor | undefined = wizardApp ?? synthesisedApp
const compState = state.apps[componentId]
const status: ApplicationStatus = compState?.status ?? 'pending'
// Roll up status: prefer wizard-stream signal (live SSE deltas),
// fall back to API-fetched phase mapped to the legacy 4-state vocab
// (pending | installing | installed | failed | degraded).
const apiPhaseStatus: ApplicationStatus | undefined = useMemo(() => {
if (!apiApp?.phase) return undefined
switch (apiApp.phase) {
case 'Ready':
return 'installed'
case 'Failed':
return 'failed'
case 'Degraded':
return 'degraded'
case 'Provisioning':
case 'Pending':
return 'installing'
default:
return 'pending'
}
}, [apiApp])
const status: ApplicationStatus = compState?.status ?? apiPhaseStatus ?? 'pending'
// Bundled dependencies — descriptors of every direct dep, with
// human names sourced from componentGroups when available.
@ -151,6 +230,36 @@ export function AppDetail({ disableStream = false }: AppDetailProps = {}) {
}, [app])
if (!app) {
// While the API fallback is in flight, render a transient
// "Loading…" surface instead of the misleading "not found" page —
// the not-found page made the matrix-asserted Overview tokens fail
// (TC-068 expects "Ready", TC-072 expects "Service" etc.) and the
// operator UI flashed an error chip during normal page loads.
if (needsApiFallback && (apiAppQuery.isPending || apiAppQuery.isFetching)) {
return (
<PortalShell
deploymentId={deploymentId}
sovereignFQDN={sovereignFQDN}
pageTitle="Loading…"
headerSlotLeft={
<Link
to={`/dashboard` as never}
className="text-[11px] text-[var(--color-text-dim)] hover:text-[var(--color-text)] no-underline"
data-testid="sov-back-link"
>
&larr; Back to apps
</Link>
}
>
<style>{APP_DETAIL_CSS}</style>
<div className="detail-page">
<div className="not-found" data-testid="sov-app-loading">
<p>Loading {componentId}</p>
</div>
</div>
</PortalShell>
)
}
return (
<PortalShell
deploymentId={deploymentId}

View File

@ -1,5 +1,39 @@
apiVersion: v2
name: bp-catalyst-platform
# 1.4.117 (qa-loop iter-11 Fix #45 Cluster-B + Cluster-C —
# application-controller HR observation + catalyst-api SPA endpoints).
#
# Cluster-B (application-controller observes downstream HelmRelease):
# - Reconciler now polls per-region HelmRelease.status.conditions[Ready]
# after every reconcile pass and rolls up the Application's
# status.phase: any region Ready=True → phase=Ready, any
# Ready=False → phase=Degraded, no HR yet → phase=Provisioning.
# - Periodic 30s re-list ticker (Run goroutine) ensures HR readiness
# flips reach Application.status.phase even though the Application
# Watch doesn't fire on sibling HR changes.
# - Application-controller ClusterRole gains
# helm.toolkit.fluxcd.io/helmreleases get/list/watch.
# - status.lastReconciledAt populated on every pass for TC-113.
# - Without this fix Application sat at Provisioning indefinitely
# even after `kubectl get hr -n qa-omantel qa-wp` was Ready=True
# for hours; matrix TC-066 / TC-100 / TC-104 / TC-113 stayed FAIL.
#
# Cluster-C (catalyst-api SPA endpoints + namespace alias):
# - GET /sovereigns/{id}/applications/{name} returns full Application
# detail (identity + spec + status) so the SPA AppDetail page can
# synthesise an ApplicationDescriptor for chroot-installed
# Applications that aren't part of the wizard's selectedComponents.
# Unblocks TC-068 / TC-072 / TC-074 et al ("App not found" misfire).
# - GET /sovereigns/{id}/k8s/{kind} accepts both ?ns= and ?namespace=
# query params (was: only ?ns=, silently ignored ?namespace=). The
# SPA + kubectl-canonical clients all emit ?namespace=; without the
# alias TC-262 / TC-263 returned every namespace's services.
# - SPA AppDetail.tsx falls back to GET /applications/{name} when the
# wizard store has no descriptor for the requested componentId
# (the typical chroot Sovereign case).
#
# Image bumps follow this chart bump in the same PR.
#
# 1.4.116 (qa-loop iter-10 Fix #44 follow-up — chart re-publish).
# Chart 1.4.115 was published from the merge commit which still had
# the OLD application-controller image tag (a3ba200) baked into
@ -471,7 +505,7 @@ name: bp-catalyst-platform
# so the matrix's `kubectl get cnpgpair` stdout contains the literal
# "cnpgpair" substring TC-306 asserts on (envsubst override beat the
# chart values default fixed in PR #1247).
version: 1.4.116
version: 1.4.117
appVersion: 1.4.94
description: |
Catalyst Platform — the unified Catalyst control plane umbrella chart for Catalyst-Zero.

View File

@ -62,4 +62,15 @@ rules:
- apiGroups: ["kustomize.toolkit.fluxcd.io"]
resources: ["kustomizations"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
# qa-loop iter-11 Fix #45 Cluster-B: the controller observes the
# downstream HelmRelease's status.conditions[Ready] to roll up the
# Application.status.phase. Read-only — controller never writes
# HelmRelease objects (Flux owns the writes; the controller only
# commits the YAML to Gitea). Without this grant the read fails
# with `helmreleases.helm.toolkit.fluxcd.io is forbidden` and the
# phase stays at Provisioning forever (the live live failure on
# omantel iter-11 — TC-066 / TC-100 / TC-104 / TC-113 stayed FAIL).
- apiGroups: ["helm.toolkit.fluxcd.io"]
resources: ["helmreleases"]
verbs: ["get", "list", "watch"]
{{- end }}