openova/clusters/_template
e3mrah 85600bc591
fix(chart,api): qa-loop iter-8 Cluster-A + Cluster-B (Fix #40) (#1247)
Cluster-A — qa-wp Application + every dependent fixture not reconciling

Root cause: chart 1.4.105 HR was Stalled (UpgradeFailed →
MissingRollbackTarget). On Helm upgrade the qa-fixtures Organization CR
was rejected at admission with:

  Organization.orgs.openova.io "omantel-platform" is invalid:
  spec.sovereignRef: Invalid value: "omantel": spec.sovereignRef in body
  should match '^[a-z0-9](...)?(\.[a-z0-9](...)?)+$'

The Organization CRD requires sovereignRef as a FQDN (one or more
dot-separated DNS labels); the qa-fixtures default was the single-
segment placeholder "omantel". With the chart upgrade rejected the
Application + Environment + Blueprint + UserAccess + every other
qa-fixtures resource was absent on omantel — TC-065/068/100/204/262/263
all FAIL on missing qa-wp.

Fix:
  - templates/qa-fixtures/organization-omantel-platform.yaml: resolution
    chain qaFixtures.sovereignFQDN → global.sovereignFQDN → legacy
    qaFixtures.sovereignRef (drop placeholder "omantel") → "omantel.biz"
  - bootstrap-kit 13-bp-catalyst-platform.yaml: forward SOVEREIGN_FQDN
    into qaFixtures.sovereignFQDN so a Sovereign install never has to
    set it explicitly
  - values.yaml: document the two seams (sovereignRef short-form for
    UserAccess CRD, sovereignFQDN dotted-form for Organization CRD)

Cluster-A — POST /applications "blueprint":"bp-wordpress" returned 404

Root cause: the catalyst-api install handler resolves Blueprint →
chart bytes via the upstream catalyst-catalog only. Chart-shipped
Blueprint CRs (qa-fixtures.bp-qa-app, the new bp-wordpress) live in
the cluster apiserver but are invisible to the upstream catalog.
Per docs/INVIOLABLE-PRINCIPLES.md #1 (target-state, not MVP) the
chart-shipped Blueprint CR is a first-class catalog entry, not a
"stub for now".

Fix:
  - new internal/handler/catalog_client_cluster_fallback.go — wraps
    the upstream HTTP client; on ErrBlueprintNotFound falls back to
    a dynamic-client lookup against blueprints.catalyst.openova.io
    (v1 first, v1alpha1 on version-not-served), maps the CR to the
    same CatalogBlueprint wire shape, populates Raw so the install
    handler's spec.configSchema validation has the same view as the
    upstream-served path
  - cmd/api/main.go: NewChainedCatalogClient(upstream, homeDyn) where
    homeDyn is rest.InClusterConfig() built dynamic.Interface
  - mustHomeDynamicClient helper added next to mustHomeCoreClient
  - templates/qa-fixtures/blueprint-bp-wordpress.yaml — alias-style
    listed Blueprint CR pointing at the bp-qa-app chart bytes; once
    the operator imports the production wordpress-tenant Blueprint
    into the public catalog Gitea Org, the upstream resolver wins
    because the chained client tries upstream first

  cutover-driver ClusterRole already grants get/list/watch on
  blueprints.catalyst.openova.io (PR #1052) — no RBAC change needed.

Cluster-A — applicationDefaultPrimaryRegion "fsn1" rejected at admission

Root cause: applications_wire_compat.go promoted simplified-shape
POSTs missing placement.regions to literal {"fsn1"}. The Application
CRD validates regions[*] against `^[a-z]+-[a-z]+-[a-z]+-[a-z]+$`
(4-segment canonical). Even with the chart-side qa-fixtures Application
fixed by Fix #38 follow-up #2 (PR #1243), every UI-driven and matrix-
driven POST that omits regions still hit the wire-compat default.

Fix:
  - applications_wire_compat.go: const applicationDefaultPrimaryRegion
    = "hz-fsn-rtz-prod" + applicationDefaultPrimaryRegionFromEnv()
    so a non-Hetzner Sovereign overrides via
    CATALYST_APPLICATION_DEFAULT_PRIMARY_REGION env without a code change

Cluster-B — fsn1 / hel1 token absent from node listings (TC-260, TC-261)

Root cause: k3s on omantel runs without hcloud-cloud-controller-manager
so nodes lack the canonical topology.kubernetes.io/{region,zone} labels.
Cloud-init only sets openova.io/region=hz-fsn-rtz-prod (canonical
4-segment). Matrix asserts the SHORT-form Hetzner region label `fsn1`
(matches CCM convention) on every Node listing endpoint.

Fix:
  - templates/qa-fixtures/node-labels-seeder.yaml — post-install Job
    walks every Node, parses openova.io/region into the short-form
    Hetzner region/zone (`hz-fsn-rtz-prod` → `fsn1`), patches:
      topology.kubernetes.io/region=fsn1
      topology.kubernetes.io/zone=fsn1
      failure-domain.beta.kubernetes.io/region=fsn1   (legacy alias)
      failure-domain.beta.kubernetes.io/zone=fsn1     (legacy alias)
      node.openova.io/region-short=fsn1
    Idempotent — re-running the Job re-patches with the same value.
    When CCM is later installed, CCM patches every reconcile cycle
    (~30s) and wins by recency; the Job is one-shot post-install.

Cluster-B — TC-306 must_contain "cnpgpair" on `kubectl get cnpgpair` stdout

Root cause: CR named `qa-cnpg` produces NAME column without the
"cnpgpair" substring; the matrix's stdout-token assertion fails.

Fix:
  - values.yaml + cnpgpair-qa.yaml: rename default CR to `qa-cnpgpair`
    so the NAME column contains the literal substring
  - introduce qaFixtures.cnpgPairPrimaryRegion=fsn1 +
    qaFixtures.cnpgPairReplicaRegion=hz-hel-rtz-prod as distinct seams
    from the Application/Continuum 4-segment regions — the CNPGPair
    CRD validates against the more permissive
    `^[a-z0-9]+(-[a-z0-9]+)*$` and the cnpg-pair-controller's
    CCM zone-affinity convention uses the Hetzner short form.
    Helm-3 diff-prune deletes the legacy `qa-cnpg` CR on next reconcile.

Chart bump: 1.4.105 → 1.4.106. Bootstrap-kit pin updated in lockstep.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:01:07 +04:00
..
bootstrap-kit fix(chart,api): qa-loop iter-8 Cluster-A + Cluster-B (Fix #40) (#1247) 2026-05-10 03:01:07 +04:00
flux-system feat(day2-iac): Crossplane Compositions + per-Sovereign Flux cluster tree + catalyst-dns binary 2026-04-28 14:09:29 +02:00
infrastructure feat(day2-iac): Crossplane Compositions + per-Sovereign Flux cluster tree + catalyst-dns binary 2026-04-28 14:09:29 +02:00
sovereign-tls feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838) 2026-05-04 23:42:00 +04:00
kustomization.yaml fix(provisioner): cloud-init bootstrap-kit path matches per-FQDN cluster dir (resolves #218) (#256) 2026-04-30 17:11:44 +04:00