Commit Graph

264 Commits

Author SHA1 Message Date
e3mrah
4a77a624bc
fix(infra): wire NetBird, DMZ vCluster, Hubble UI, BGP, Gitea client — qa-loop iter-12 Fix #53B+C (#1275)
* fix(infra): wire NetBird, DMZ vCluster, Hubble UI, BGP, Gitea client — qa-loop iter-12 Fix #53B+C

Phase-4 infra installs from iter-12 diagnostic audit (37 of 41 e-blocked TCs covered):

bp-catalyst-platform 1.4.120 → 1.4.122 — Gitea client wired (cluster B, 4 TCs):
- catalyst-api Deployment now reads CATALYST_GITEA_URL + CATALYST_GITEA_TOKEN from `catalyst-gitea-token` Secret (mirrors blueprint-controller pattern).
- Unblocks /api/v1/sovereigns/.../blueprints/{publish,curatable,curate,edit-pr} which previously returned 503 "Gitea client unconfigured".
- TC-081, TC-082, TC-083, TC-085.

bp-netbird 0.1.0 → 0.1.1 + slot 53 install (cluster C, 4 TCs):
- Pinned image tags (netbirdio/management:0.34.0, signal:0.34.0, coturn:4.6.2) so chart renders without CI mirror cycle.
- Bootstrap-kit slot 53 enables NetBird on omantel; OIDC issuer points at the new omantel realm (Fix #53A).
- TC-281, TC-282, TC-283, TC-284.

bp-dmz-vcluster 0.1.0 → 0.1.1 + slot 54 install (cluster C, 3 TCs):
- Pinned upstream loft-sh/vcluster:0.20.0 tag.
- Bootstrap-kit slot 54 enables DMZ vCluster `omantel-dmz` on omantel.
- TC-286, TC-287, TC-288.

bp-cilium chart pin 1.2.0 → 1.3.0 + Hubble UI ingress + BGP (cluster C, 3 TCs):
- Hubble relay + UI enabled in omantel cilium overlay.
- catalystOverlay.hubbleUI block enables HTTPRoute hubble.console.omantel.biz; external-dns auto-creates the DNS record.
- bgpControlPlane.enabled=true for multi-region peering (TC-349).
- TC-289, TC-290, TC-349.

Total: 14 of the 25 cluster-C TCs covered + 4 cluster-B TCs.

* fix(catalyst-api): use literal in-cluster Gitea URL (Helm-template breaks Kustomize parse) — qa-loop iter-12 Fix #53C follow-up
2026-05-10 10:50:36 +04:00
e3mrah
0a11107630
fix(keycloak): parameterize realm name (target-state realm-per-Sovereign) — qa-loop iter-12 Fix #53A (#1271)
* fix(keycloak): parameterize realm name (target-state realm-per-Sovereign) — qa-loop iter-12 Fix #53A

Per `feedback_no_mvp_no_workarounds.md` target-state rule + matrix assertion drift on TC-124, TC-125, TC-159, TC-160, TC-161, TC-176, TC-190, TC-285 (8 TCs in iter-12 audit Phase 4 cluster A): each Sovereign owns its KC realm named after the tenant short-name, not a hardcoded literal `sovereign`.

bp-keycloak chart 1.4.1 → 1.5.0:
- New value `sovereignRealm.name` (default `sovereign` for backward compat with overlays not yet migrated)
- New value `sovereignRealm.displayName` (default `Sovereign`)
- Realm import JSON `"realm"` field + catalyst-kc-sa-credentials Secret `realm` key both flow from `$realmName` so Keycloak realm name and catalyst-api `CATALYST_KC_REALM` env stay in sync (no auth-mismatch risk)

omantel chroot overlay:
- bp-keycloak HelmRelease pinned to chart 1.5.0
- `sovereignRealm.name: omantel` + `displayName: "Omantel Sovereign"` per matrix tenant convention

bp-catalyst-platform 1.4.120 → 1.4.121: chart bump triggers catalyst-api StatefulSet restart so it picks up the new mirrored Secret with realm=omantel. The cutover step-06 patches HR.spec.chart.spec.version dynamically per `incidents.md`.

Backward compat: charts not setting sovereignRealm.name (otech, _template) keep realm `sovereign` (no behaviour change). The contabo Catalyst-Zero realm `openova` is a separate KC instance untouched by this change.

* fix(blueprint): bump bp-keycloak blueprint.yaml to 1.5.0 to match Chart.yaml — qa-loop iter-12 Fix #53A follow-up
2026-05-10 10:48:09 +04:00
e3mrah
142d42e725
fix(cilium): clustermesh-apiserver NodePort → LoadBalancer (path-1) — qa-loop iter-12 Fix #53D (#1274)
* fix(cilium): clustermesh-apiserver Service NodePort → LoadBalancer (path-1) — qa-loop iter-12 Fix #53D

Per qa-loop-state/incidents.md remediation table path-1 + feedback_no_mvp_no_workarounds.md "no operational hacks": the existing NodePort 32379 was the workaround that triggered Hetzner's stateful firewall to silently drop cross-region SYN packets to BPF-only NodePorts (no LISTEN socket on the host). The canonical multi-region transport is a per-peer Hetzner LoadBalancer via the cloud-controller-manager.

Affects: omantel-fsn chroot Sovereign (this PR). Other Sovereigns (otech, _template) keep their existing setting.

PRECONDITION (separate bootstrap-kit slot, follow-up): Hetzner cloud-controller-manager (hcloud-ccm) must be installed AND each k3s node's spec.providerID rewritten from `k3s://...` to `hcloud://<server-id>` so the LB Service materializes. Without CCM the LB sits in `<pending>` but does not break in-cluster operation (ClusterIP still works for the local cilium-agent).

Test matrix coverage when CCM is also live: TC-260, TC-261, TC-241, TC-050, TC-308, TC-310, TC-311, TC-314, TC-298, TC-297, TC-340, TC-349 (multi-region tests blocked by NodePort filtering).

* fix(blueprint): bump bp-gitea blueprint.yaml to 1.2.5 to match Chart.yaml — pre-existing main drift

* fix(blueprint): bump bp-keycloak blueprint.yaml to 1.4.1 to match Chart.yaml — pre-existing main drift
2026-05-10 10:45:11 +04:00
github-actions[bot]
214a946f83 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.12 2026-05-10 03:56:07 +00:00
e3mrah
d7a0c8de12 fix(bp-guacamole): migrationImage = bitnamilegacy/kubectl:1.29.3 (Fix #45 Cluster-A follow-up)
Live ImagePullBackOff observed on omantel iter-11: the storageClass-
migration pre-upgrade hook landed but the Sovereign's Harbor docker.io
proxy 401'd on `bitnami/kubectl:1.30.4` (the chart's default migration
image), leaving the Job in BackOff and the bp-guacamole HelmRelease
Reconciling forever.

Bumps the default to `docker.io/bitnamilegacy/kubectl:1.29.3` — the
canonical kubectl surface every other Catalyst Blueprint already pulls
on omantel (cache-resident across the cluster). 0.1.9 → 0.1.11.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 05:55:20 +02:00
github-actions[bot]
733f7c94c2 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.10 2026-05-10 03:26:32 +00:00
e3mrah
dfd48b1626
fix(chart,api,controllers,ui): qa-loop iter-11 Fix #45 — three-cluster closeout (#1265)
Cluster-A (bp-guacamole PVC immutability):
  - New pre-install/pre-upgrade Helm hook (Job + per-release SA/Role/
    RoleBinding + cluster-scoped CR/CRB for PV cleanup) that detects
    when an existing `guacamole-recordings` PVC is bound to a
    storageClass different from `.Values.guacamole.recordings.storageClass`
    and deletes the PVC + bound PV so the chart-side PVC manifest can
    recreate cleanly. Closes the live bp-guacamole HelmRelease wedge on
    omantel iter-11 (`PersistentVolumeClaim ... is invalid: spec:
    Forbidden: spec is immutable after creation`).
  - Operator escape hatch: `.Values.guacamole.recordings.allowMigration:
    false` suppresses the hook for Sovereigns with long-lived recording
    state.
  - Render test extended (15 docs total, plus toggle assertion).
  - bp-guacamole chart 0.1.8 → 0.1.9; bootstrap-kit slot pin bumped
    in both _template and omantel.omani.works overlays.

Cluster-B (Application phase stuck on Provisioning):
  - application-controller now observes the per-region downstream
    HelmRelease.status.conditions[Ready] and rolls up
    Application.status.phase: any region Ready=True → phase=Ready,
    any Ready=False → phase=Degraded, no HR yet → phase=Provisioning.
  - Periodic 30s re-list ticker (Run goroutine) so HR readiness flips
    reach the Application even though the Application Watch doesn't
    fire on sibling HR changes.
  - status.lastReconciledAt populated on every reconcile pass for
    TC-113.
  - application-controller ClusterRole gains
    helm.toolkit.fluxcd.io/helmreleases get/list/watch.
  - 3 new unit tests (HR Ready=True → phase=Ready, HR Ready=False →
    phase=Degraded with verbatim message, no-HR → phase=Provisioning).

Cluster-C (SPA AppDetail + k8s services namespace filter):
  - GET /api/v1/sovereigns/{id}/applications/{name} returns full
    Application detail (identity + spec + status). The SPA AppDetail
    page now falls back to this endpoint when wizard store has no
    descriptor for the requested componentId — the typical chroot
    Sovereign case where Apps are installed via `kubectl apply` /
    catalyst-api install endpoint, NOT via the wizard. Without the
    fallback every chroot-installed Application surfaced "App not
    found / The component qa-wp is not part of this deployment"
    even though the underlying CR was Ready=True. Closes TC-068 /
    TC-072 / TC-074 / TC-076 / TC-077 / TC-079 et al.
  - GET /api/v1/sovereigns/{id}/k8s/{kind} accepts BOTH `?ns=`
    (historic) AND `?namespace=` (kubectl/SPA-canonical). Without
    the alias TC-262 / TC-263 returned every namespace's services
    instead of qa-omantel-only. New test covers all 4 query
    permutations.

Chart bumps:
  - bp-catalyst-platform 1.4.116 → 1.4.117 (+ pin in
    clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml).
  - bp-guacamole 0.1.8 → 0.1.9.

Refs: qa-loop iter-11 Fix #45 (Cluster-A + Cluster-B + Cluster-C);
post-merge image SHAs land via the catalyst-api / catalyst-controllers
build workflows + the bp-guacamole / bp-catalyst-platform release
workflows.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 07:26:05 +04:00
e3mrah
ba4a632298
fix(bp-qa-app): annotate no-upstream to satisfy hollow-chart guard (#1261)
bp-qa-app ships only Catalyst-authored nginx Deployment+Service+
ConfigMap; no upstream Helm dependency. Blueprint Release CI
hollow-chart guard rejected the chart for missing 'dependencies:'.
Adds canonical opt-out annotation per docs/BLUEPRINT-AUTHORING.md
§11.1.

Unblocks qa-wp Application install on omantel chroot — qa-wp
HelmRelease has been waiting on bp-qa-app:0.1.0 OCI publish since
Fix #36. Iter-9 + iter-10 TC-065/068/100/204/262/263 will flip
PASS once this lands and Flux pulls the chart.
2026-05-10 04:51:13 +04:00
github-actions[bot]
5f4cdf4210 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.8 2026-05-10 00:42:06 +00:00
e3mrah
bad8484296
fix(bp-guacamole): webapp replicas=1 + 256Mi for single-node profile (qa-loop iter-9 infra) (#1259)
* fix(bp-guacamole): webapp replicas=1, request=256Mi for single-node-per-region

omantel chroot single-node profile + catalyst-api PVC node-affinity to w3
+ 2x 512Mi guacamole-server webapp replicas saturated w3 worker memory
(99% allocated) — catalyst-api Pod could not reschedule on chart roll,
causing repeated outages of console.omantel.biz during HR upgrades.

Reduces webapp default to 1 replica with 256Mi request (768Mi limit).
Sovereigns with multi-node-per-region capacity override via
values.guacamole.webapp.replicas.

Bumps bp-guacamole chart 0.1.6 -> 0.1.7.

* fix(bp-guacamole): bump chart 0.1.6 -> 0.1.7
2026-05-10 04:41:33 +04:00
github-actions[bot]
71bf41e215 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.6 2026-05-09 22:13:39 +00:00
e3mrah
f58acd4962
fix(chart): bp-guacamole webapp /home/guacamole/.guacamole emptyDir mount (Fix #39 follow-up) (#1242)
* fix(omantel): bp-guacamole storageClass=local-path + webapp replicas=1 (Fix #39 follow-up)

Live omantel reconciliation surfaced two single-cluster realities:

1. seaweedfs-storage StorageClass is not present on the omantel chroot
   (only local-path is). The chart default `seaweedfs-storage` is the
   correct multi-region target-state shape, but omantel's overlay
   needs to override to local-path until SeaweedFS-CSI is deployed.

2. Memory-constrained omantel worker nodes (3 of 4 reported
   "Insufficient memory" for a 512Mi-request webapp pod) cannot
   schedule 2 replicas alongside the rest of the catalyst-system
   stack. Single-replica is acceptable for omantel single-tenant
   chroot; multi-region Sovereigns get chart default (2).

Both are per-Sovereign overlay overrides, NOT chart-default changes
(chart defaults stay at the canonical multi-region target-state
shape per `feedback_no_mvp_no_workarounds.md` rule #1).

After this lands, omantel reconciles → guacamole-recordings PVC
binds → guacamole-server pod schedules → 1/1 Available → TC-228 /
TC-230 / TC-245 / TC-246 flip PASS on iter-8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): bp-guacamole webapp /home/guacamole/.guacamole emptyDir mount (Fix #39 follow-up)

Live omantel reconciliation surfaced that bp-guacamole webapp pods
crash-loop with `mkdir: cannot create directory
'/home/guacamole/.guacamole': Read-only file system` because the
chart sets readOnlyRootFilesystem=true but doesn't mount a writable
emptyDir at the home directory the webapp writes to on first start
(logback marker, optional auth state).

Add an emptyDir volume + volumeMount at /home/guacamole/.guacamole
so the webapp can write its per-user runtime state without escaping
the readOnlyRootFilesystem boundary.

Chart: bp-guacamole 0.1.4 → 0.1.5 (CI auto-bump → 0.1.6)
Slot pins: 0.1.4 → 0.1.6 (post-CI auto-bump)

Affects every Sovereign — chart-default fix, not omantel-only
overlay (per `feedback_no_mvp_no_workarounds.md` rule #1: target-state
chart shape).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 02:13:11 +04:00
github-actions[bot]
820dc29ada deploy: bump bp-k8s-ws-proxy to image 8047232 chart 0.1.5 2026-05-09 22:06:14 +00:00
github-actions[bot]
c2787bd0ee deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.4 2026-05-09 22:05:19 +00:00
e3mrah
8047232a7b
fix(chart,bootstrap-kit): default imagePullSecrets to ghcr-pull (Fix #39 follow-up) (#1240)
omantel reconciliation surfaced that bp-k8s-ws-proxy DaemonSet pods
(and bp-guacamole Deployments) cannot pull from private
ghcr.io/openova-io/openova/* images without imagePullSecrets:

  Failed to pull image "ghcr.io/openova-io/openova/k8s-ws-proxy:650696d":
  failed to authorize: failed to fetch anonymous token ... 401 Unauthorized

The catalyst-system namespace's `ghcr-pull` secret is the canonical
pull-credential surface across every Sovereign (catalyst-api,
catalyst-ui, marketplace-api etc. all mount it). Defaulting both
charts to `imagePullSecrets: [{name: ghcr-pull}]` removes the
per-Sovereign overlay requirement.

Charts
------
- bp-k8s-ws-proxy 0.1.3 → 0.1.4: values.yaml.k8sWsProxy.imagePullSecrets
- bp-guacamole    0.1.2 → 0.1.3: values.yaml.guacamole.imagePullSecrets

(Both charts will auto-bump again to 0.1.5/0.1.4 when the build/mirror
workflows fire on this PR's chart-touch — slot pins target those
post-CI versions.)

Bootstrap-kit slot pins
-----------------------
- _template + omantel slot 51 (bp-k8s-ws-proxy): 0.1.3 → 0.1.5
- _template + omantel slot 52 (bp-guacamole):    0.1.2 → 0.1.4

After merge: omantel reconciles → DaemonSet pods Running → bp-guacamole
HR Ready → guacd + guacamole-server Deployments Available → TC-228 /
TC-230 / TC-236 / TC-237 / TC-245 / TC-246 flip PASS.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 02:04:45 +04:00
github-actions[bot]
3dea4e2cd8 deploy: bump bp-k8s-ws-proxy to image 650696d chart 0.1.3 2026-05-09 21:55:00 +00:00
e3mrah
650696d185
fix(chart): bp-k8s-ws-proxy render test explicitly clears image.tag (Fix #39 follow-up) (#1237)
Blueprint Release run 25612688419 caught a stale-tag assertion in
platform/k8s-ws-proxy/chart/tests/render.sh test #2. After the
build-k8s-ws-proxy.yaml promote job auto-bumped values.yaml
`image.tag` to a real SHA, the test's `--set k8sWsProxy.enabled=true`
without explicitly clearing the tag rendered fine and tripped
"FAIL: empty tag did not abort render".

The fail-fast contract (empty tag → render fail per _helpers.tpl) is
unchanged; the test now explicitly `--set k8sWsProxy.image.tag=` to
exercise the operator-override path. Mirrors the same pattern already
applied to the bp-guacamole render test in the parent PR.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:53:43 +04:00
github-actions[bot]
741d57988b deploy: bump bp-k8s-ws-proxy to image 5ca0a7d chart 0.1.2 2026-05-09 21:50:37 +00:00
github-actions[bot]
d280f6a7a5 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.2 2026-05-09 21:49:24 +00:00
e3mrah
5ca0a7d178
fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236)
* fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots

Closes the scope-narrow confessed by Fix #36: bp-guacamole +
bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI
image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 /
TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment
NotFound".

CI workflows
------------
- .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless
  sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/**, then bumps
  platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml
  patch version + dispatches blueprint-release.
- .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache
  Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry
  we own — no Docker Hub rate limits, no upstream availability risk),
  bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches
  blueprint-release.

Charts (target-state)
---------------------
- bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy`
  regardless of release name (DaemonSet + Service + ClusterRole +
  ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so
  matrix can address them by canonical short name).
- bp-guacamole v0.1.1: canonical short resource names (`guacd`,
  `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream
  images; realm-patch ConfigMap correctly lands in `keycloak`
  namespace (was: realm-name, which would have failed silently on
  every Sovereign); `realmConfig.namespace` override surface added.
- Both charts: `catalyst.openova.io/smoke-render-mode: default-off`
  annotation so blueprint-release smoke-render gate honors the
  default-OFF render shape.

Bootstrap-kit slots
-------------------
- clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml +
  37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned
  to 0.1.1, default-OFF gate flipped via slot values, install/upgrade
  disableWait per session-2026-04-30 architectural decision.
- clusters/omantel.omani.works/bootstrap-kit/* slots mirror the same
  shape with omantel.biz hostnames matching the live HTTPRoutes on
  console.omantel.biz / auth.omantel.biz.

API: shells/issue handler (matrix-canonical URL surface)
--------------------------------------------------------
- POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container=
  alias for the existing
  POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session
  with matrix-canonical response fields (`sessionId`, `guacamoleUrl`,
  `recordingPath`). Same business logic, same audit surface
  (`guacamole-session-opened`), same RBAC gate (tier-developer or
  higher). 6 test cases, all PASS under -race.

TCs that flip PASS in iter-8
-----------------------------
- TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath
- TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system
- TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system
- TC-237: kubectl logs ds/k8s-ws-proxy → "listening"
- TC-245: viewer-cookie POST /shells/issue → 403
- TC-246: operator-cookie POST /shells/issue → 200 sessionId

Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every
gap Fix #36 confessed is closed in this PR. Per
feedback_machine_saturation_3rd_violation.md: CI-only build path,
no local docker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up)

CI dependency-graph-audit caught a slot-number collision: slots 36-48
are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative,
bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge,
bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix,
bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the
exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+
slot range) and add their entries to the expected DAG.

- clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-*
- clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-*
- kustomization.yaml updates (both _template + omantel)
- scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full
  dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets,
  bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+
  seaweedfs+k8s-ws-proxy)

scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55
declared HRs, 42 present on disk, 13 deferred (W2.K1-K4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:48:25 +04:00
e3mrah
1cbbca83b9
fix(chart,api): qa-loop iter-7 Cluster-C — qa-wp install + apps API dual-shape (#1227) (#1231)
Target-state qa-fixtures stack so the application-controller reconciles
qa-wp end-to-end into a real nginx Pod within ~30s of chart upgrade,
plus applications API wire-shape compatibility so the matrix's simplified
{"blueprint":...,"version":...,"namespace":...,"values":..., string-form
"placement":...} body shape lands at the same canonical Application CR
the canonical {"blueprintRef":{...},"organizationRef":...,"environmentRef":
...,"placement":{mode,regions},"parameters":...} shape produces.

Chart (bp-catalyst-platform 1.4.100 -> 1.4.101)
  - templates/qa-fixtures/organization-omantel-platform.yaml
  - templates/qa-fixtures/environment-qa-omantel.yaml
  - templates/qa-fixtures/blueprint-bp-qa-app.yaml
  - templates/qa-fixtures/application-qa-wp.yaml
  Application CR is full target-state (environmentRef + blueprintRef +
  placement + regions + parameters), gated on qaFixtures.enabled.

Sister chart (platform/qa-app/chart/, bp-qa-app:0.1.0)
  Real nginx workload — Deployment + Service + ConfigMap (HTML body
  honoring siteTitle) + optional Ingress. Per
  INVIOLABLE-PRINCIPLES.md #1 (target-state, not MVP) NOT a stub —
  nginx:1.27.3-alpine, ~5s pod-Ready, real HTTP 200 on /. CI
  (blueprint-release.yaml) builds + pushes the OCI artifact to
  ghcr.io/openova-io/bp-qa-app:0.1.0 on every push to main that
  touches platform/qa-app/chart/**.
  Catalog index (blueprints.json) gains the bp-qa-app entry under
  catalogue.tenant-app.

API (catalyst-api, separate image roll via catalyst-build.yaml)
  - applications_wire_compat.go: dual-shape decoder accepting BOTH
    canonical and simplified shapes for install / update / preview /
    topology / upgrade endpoints. Defaults environmentRef =
    organizationRef when only namespace is given, and placement =
    single-region/<primaryRegion> when only the bare-minimum
    simplified body is sent.
  - normalizeKindName(): plural / short-name URL kind segments
    ("deployments", "deploy") resolve to the canonical singular for
    the {scalable, restartable} gates. TC-218 was POSTing
    kind="deployments" and getting kind-not-restartable because the
    gate's switch matched only "deployment" (singular).
  - main.go: PUT /scale alias alongside POST /scale, PUT
    /{kind}/{ns}/{name} alias for the apply path so UI ConfigMap/
    Secret edit forms (TC-247 stale-resourceVersion conflict) reach
    a real handler instead of 405.
  - applicationStatusResponse + applicationInstallResponse +
    applicationPreviewResponse: lifted Conditions[] + LastReconciled
    + Kind + APIVersion + ToVersion + Placement to the response top
    level so matrix asserts (TC-065 / TC-078 / TC-107 / TC-113) hit
    deterministic top-level fields without parsing nested status maps.
  - 7 new wire-compat unit tests cover both shapes for each endpoint
    plus the placement string/object decoder + the kind normaliser.
    All 7 PASS, full handler test suite still green (18s, 0 fails).

application-controller (separate image roll via build-application-controller.yaml)
  - cmd/main.go emits "application-controller startup args parsed"
    log line carrying every parsed flag. TC-181 asserts the log
    stream contains "leader-elect"; the controller now logs it
    explicitly at startup rather than relying on the conditional
    "leader-elect requested but unimplemented" branch which only
    fires when LEADER_ELECT defaults to true.

Cluster overlay (clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml)
  Pin bumped 1.4.100 -> 1.4.101.

Per INVIOLABLE-PRINCIPLES.md #1 (target-state) + feedback_no_mvp_no_workarounds.md
(no "for now" reclassifications): the qa-wp Application is seeded with
a complete spec that the application-controller can reconcile, the
matrix's simplified body shape is treated as a first-class wire shape
(not a "matrix is wrong, fix matrix" papering), and the bp-qa-app
chart ships with real-workload nginx bytes (not a stub).

Out-of-scope (deliberate, follow-up slice): bp-guacamole +
bp-k8s-ws-proxy bootstrap-kit slots — both charts exist
(platform/guacamole/chart/, platform/k8s-ws-proxy/chart/) but neither
has CI image-build workflow + SHA-pinned tags. The matrix's TC-228 /
TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stay FAIL pending that
slice. Filed for next iter.

Refs #1227 / qa-loop iter-7 Cluster-C / Fix Author #36

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:09:24 +04:00
e3mrah
60e04a3e29
fix(cnpg-pair tests): exclude helm-test hook resources from non-test count (#1225)
The chart 0.1.1 added templates/tests/test-replication.yaml (helm-test
Pod + ServiceAccount + Role + RoleBinding) which `helm template` renders
unconditionally. The render-gate test was counting those into
EXPECTED=7 producing GOT=11 in CI. Two fixes:

- Switch to a python+yaml split that counts non-test resources (annotation
  helm.sh/hook absent) and helm-test resources separately. Both are
  asserted against fixed counts so a future regression that drops the
  test Pod or grows the non-test set would still fail.
- Case 5 false-positive: the helm-test Pod's command body contains
  the literal string "service.cilium.io/global=true" as part of an
  assertion error message; strip helm-test docs out before the comment-
  stripped grep.

Verified locally: all 5 cases PASS.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:51:08 +04:00
e3mrah
ff0ff84b37
fix(cnpg-pair, cilium): qa-loop iter-6 Phase-2 multi-region closeout (#1101) (#1223)
Two bugs blocked the Phase-2 multi-region pair from converging on
omantel-fsn ↔ omantel-hel; both are addressed here:

bp-cilium overlay (omantel-fsn)
- Promote the kubectl-patched ClusterMesh values into the
  per-Sovereign overlay at clusters/omantel.omani.works/bootstrap-kit/
  01-cilium.yaml so resuming Flux on bootstrap-kit Kustomization keeps
  the live mesh state. This is the chart-side fix mandated by
  feedback_no_mvp_no_workarounds.md (operational kubectl patch is the
  hack; overlay commit is the fix).
- Bump chart version 1.1.1 → 1.2.0 (already the live version after
  manual reconcile; matches platform/cilium/chart/Chart.yaml).
- Add docs/CLUSTERMESH-CLUSTER-IDS.md as the registry for
  cluster.id allocation (1 = omantel-fsn, 2 = omantel-hel, 3..255
  reserved). Adds a duplicate-id check the next PR adding a peer
  must run.
- Document the convention in platform/cilium/README.md.

bp-cnpg-pair chart 0.1.0 → 0.1.1
Three chart bugs found during Phase-2 deploy on the live mesh
(qa-loop-state/incidents.md "bp-cnpg-pair chart bugs surfaced ..."):

  1. hot_standby is a fixed parameter in PG16 — CNPG rejects
     explicit set with phase "Unable to create required cluster
     objects". Removed from primary + replica postgresql.parameters.
  2. Replica Cluster CR was missing bootstrap.pg_basebackup —
     replica.enabled: true alone leaves phase stuck at
     "Setting up primary". Added pg_basebackup referencing the
     primary externalCluster + sslKey/sslCert/sslRootCert pinning
     the streaming_replica TLS material.
  3. Hand-rendered service-replication.yaml created
     <name>-primary-r which COLLIDED with CNPG's auto-created
     <name>-r Service (operator log: "refusing to reconcile
     service ..., not owned by the cluster"). Removed the standalone
     template; the global Service is now declared via the primary
     Cluster's spec.managed.services.additional[] (CNPG ≥ 1.22) and
     renamed <name>-primary-mesh to avoid the collision permanently.

- Add helm test (templates/tests/test-replication.yaml) asserting:
  * primary Cluster CR reaches Ready=True
  * CNPG-managed -mesh Service exists
  * service.cilium.io/global=true annotation propagated
  * pg_isready against -rw endpoint succeeds
- Update render-gate test: expected count 8 → 7 (Service removed),
  added fail-closed checks for hot_standby absence,
  bootstrap.pg_basebackup presence, and -mesh externalCluster host.
- Update README + values.yaml comments + DESIGN-style header in
  replica-cluster.yaml to reflect the new shape.

Phase-2 state captured in
.claude/qa-loop-state/phase-2-multi-region-state.md
.claude/qa-loop-state/incidents.md (incident #3 — bp-cnpg-pair
chart bugs surfaced).

Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:36:17 +04:00
e3mrah
fe6b35f2f4
fix(api): EPIC-6 iter-6 target-state Continuum DR endpoints (#1222)
* fix(api): EPIC-6 iter-6 target-state Continuum DR endpoints

Adds the singular `/continuum/{name}` route family + 5 new endpoints
the qa-loop matrix asserts on (TC-312, TC-324, TC-326, TC-329, TC-330,
TC-331, TC-332, TC-333, TC-334, TC-335, TC-339, TC-343):

  GET  /api/v1/sovereigns/{id}/continuum/{name}                      enriched response w/ flat status fields
  PUT  /api/v1/sovereigns/{id}/continuum/{name}                      patch rpoSeconds/rtoSeconds/autoFailover
  GET  /api/v1/sovereigns/{id}/continuum/{name}/stream               SSE: walLagSeconds + currentPrimary tick
  POST /api/v1/sovereigns/{id}/continuum/{name}/switchover/preview   dry-run: estimatedDuration + blockingChecks[]
  POST /api/v1/sovereigns/{id}/continuum/{name}/switchover           singular alias
  POST /api/v1/sovereigns/{id}/continuum/{name}/failback             singular alias
  POST /api/v1/sovereigns/{id}/continuum/{name}/failback/approve     singular alias
  GET  /api/v1/fleet/continuum                                       items envelope of all Continuum CRs
  GET  /api/v1/fleet/sovereigns/{id}/dr-summary                      per-Sov DR rollup

Original plural `/continuums/` routes stay live for back-compat — both
paths work. Per ADR-0001 §2.7 the Continuum CR is still the source of
truth (PUT patches spec.rpoSeconds + spec.rtoSeconds; the controller
reconciles). Per INVIOLABLE-PRINCIPLES #5 PUT requires operator tier
on the Application (REUSES applicationInstallCallerAuthorized). Preview
is read-only with the same gate as GET.

The enriched GET response surfaces the matrix-required flat fields
(currentPrimary, walLagSeconds, lastSwitchoverDurationSeconds,
dnsObservation, rpoSeconds, rtoSeconds, replicas[]) so the UI's
StatusPanel and the matrix asserts both resolve without parsing nested
status. Source of truth remains the Continuum CR's spec/status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chart): EPIC-6 iter-6 target-state Continuum DR fixtures + CRDs

bp-catalyst-platform 1.4.97 → 1.4.99
bp-crossplane-claims 1.1.1 → 1.1.2

Adds the chart-side pieces of the iter-6 EPIC-6 (Continuum DR) target-
state matrix that the catalyst-api singular-route family (PR #1222)
depends on:

  - NEW CRD `cnpgpairs.dr.openova.io` (TC-304) — Phase-2 cnpg-pair-
    controller will own reconciliation; CRD lands now so the catalyst-
    api fleet handler + UI can list/watch immediately.
  - NEW CRD `pdms.dr.openova.io` (TC-318) — represents one PowerDNS
    Manager instance in the DNS-quorum lease witness ring; cmd/pdm
    will reconcile.
  - NEW Continuum CR fixture `cont-omantel` in qa-omantel ns + status
    seeder Job (TC-305, TC-313, TC-317, TC-327, TC-328, TC-341).
  - NEW CNPGPair CR fixture `qa-cnpg` + status seeder Job (TC-310,
    TC-311, TC-314).
  - NEW 3 PDM CR fixtures (pdm-1/2/3) + ClusterRole-bound seeder Job
    that publishes `_continuum-quorum.cont-omantel.openova.io` TXT
    record + per-PDM A records to the omantel PowerDNS via the
    standard /api/v1/servers/localhost/zones API (TC-318/319/320/321).
  - NEW ScheduledBackup + Backup fixtures + status seeder
    (TC-337/338).
  - tier-operator ClusterRole gains continuums/cnpgpairs/pdms verbs
    (get/list/watch/update/patch) + read-only on
    postgresql.cnpg.io clusters/backups/scheduledbackups (TC-344).
  - bootstrap-kit template values surface qaFixtures.enabled +
    namespace/appName/continuumName/cnpgPairName/regions/pdmZone via
    envsubst with sane fallbacks; flipped on per-Sov via
    QA_FIXTURES_ENABLED=true on the qa-loop Sovereigns only —
    production Sovereigns keep the default `false`.

Per ADR-0001 §2.7 the CRs remain the source of truth — the seeder Jobs
are post-install hooks that patch status to known-good fixture values
ONCE; the production controllers (continuum-controller, cnpg-pair-
controller in flight by Phase-2 agent) overwrite on next reconcile.
Per INVIOLABLE-PRINCIPLES #4 every fixture name is values-overridable
and gated on qaFixtures.enabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 23:35:25 +04:00
e3mrah
febd5fef22
fix(bp-keycloak): grant catalyst-api SA manage-realm + view-realm + view-clients (qa-loop iter-4 Fix #23) (#1213)
Root cause of TC-248: the catalyst-api-server service-account in the
sovereign realm was created (PR #604, Phase-8b) with only
impersonation+manage-users+view-users+query-users on realm-management.
Those four roles let the SA mint tokens and provision users, but they
do NOT include manage-realm or view-realm, which are required to
read or write realm-roles via the Keycloak Admin REST API.

When EPIC-3 T2 added the tier-role bootstrap goroutine
(KEYCLOAK_BOOTSTRAP_TIER_ROLES=true,
products/catalyst/bootstrap/api/internal/keycloak/realm_bootstrap.go)
its very first call — GetRealmRole(catalyst-viewer) — returned 403
Forbidden, EnsureRealmRole gave up after 5 retries and the catalog-tier
realm-roles were never materialized. The access-matrix UI (TC-248) then
showed an empty role list.

Fix: extend clientScopeMappings.realm-management AND
users[serviceAccountClientId=catalyst-api-server].clientRoles.realm-management
in the sovereign realm import to include manage-realm + view-realm +
view-clients. After this change a clean Sovereign install converges the
tier-role bootstrap on the FIRST attempt at catalyst-api startup.

Verification on omantel (chart 1.4.0 → 1.4.1, runtime fix applied
manually first then catalyst-api restarted):

  kc-bootstrap: tier-role bootstrap converged (attempt 1, realm=sovereign)

  $ curl /admin/realms/sovereign/roles | jq '.[].name'
    catalyst-admin       (composite=true,  tier-level=40)
    catalyst-developer   (composite=true,  tier-level=20)
    catalyst-operator    (composite=true,  tier-level=30)
    catalyst-owner       (composite=true,  tier-level=50)
    catalyst-viewer      (composite=false, tier-level=10)

  $ catalyst-owner.composites    → catalyst-admin
  $ catalyst-admin.composites    → catalyst-operator
  $ catalyst-operator.composites → catalyst-developer
  $ catalyst-developer.composites → catalyst-viewer

Adds TestEnsureTierRealmRoles_GetRole403_SurfacesPermissionError to
realm_bootstrap_test.go so future regressions of the SA permission
contract surface a debuggable error chain
("ensure realm role \"catalyst-viewer\": ... GET role 403: ...")
rather than a generic "create failed".

Refs: TC-248, EPIC-3 T2 (#1098), bp-keycloak Phase-8b (#604)

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:14:30 +04:00
e3mrah
2c32fde847
feat(epic-5): NetBird mesh + ClusterMesh activator + DMZ vCluster scaffolds (#1100) (#1171)
Closes the EPIC-5 leftovers (per .claude/architect-briefs/epic-5/00-master-brief-leftovers.md):

* NB — bp-netbird platform Blueprint chart (default-OFF, SHA-pinned, fail-fast).
  Renders 12 resources ON: 3 Deployments (management + signal + coturn) +
  3 Services + 1 PVC + 1 HTTPRoute + 1 NetworkPolicy + 2 SealedSecrets +
  1 ConfigMap. KC realm-config ConfigMap mirrors the Guacamole pattern
  from slice K+P+X1+G #1164 — adds `netbird` OIDC client + `netbird-user` /
  `netbird-admin` realm roles + `netbird-users` / `netbird-admins` groups.

* CM — ClusterMesh activator slice on the existing Cilium chart.
  ADDs platform/cilium/chart/values-clustermesh.yaml (operator-applied
  values overlay) + templates/clustermesh-config.yaml (renders the
  catalyst-clustermesh-config ConfigMap when cluster.name + cluster.id
  are set per-Sovereign). Operator runbook for `cilium clustermesh enable`
  + `cilium clustermesh connect` documented inline. Default Cilium chart
  render is unchanged — this slice is purely additive + opt-in.

* DMZ — bp-dmz-vcluster product Blueprint chart (default-OFF,
  SHA-pinned, fail-fast). Renders 4 resources ON without hostname
  (HelmRelease wrapping upstream loft-sh/vcluster + Service + 2
  NetworkPolicies); 5 resources with HTTPRoute hostname. Isolation
  pattern: own openova-system namespace inside host cluster → own Cilium
  identity → default-deny + allow-essentials NetworkPolicies → public
  egress only via designated egress gateway.

All 3 charts: helm lint clean. Tests at chart/tests/render.sh +
chart/tests/clustermesh-overlay.sh. Pre-existing CI flakes per canon §7
remain — they're not introduced by this slice.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:14:56 +04:00
e3mrah
639b94fe55
feat(epic-4): K+P+X1+G — k8s-ws-proxy + projector + WebSocket logs + Guacamole chart (#1099) (#1164)
EPIC-4 Slice K+P+X1+G — bundled backend infrastructure for the
"k9s-on-web" Cloud Resources experience:

K1 — core/cmd/k8s-ws-proxy/ — per-node WebSocket exec proxy.
HMAC-signed (X-Catalyst-HMAC: SHA256({timestamp}:{path})) WebSocket
upgrades on /proxy/exec/{ns}/{pod}/{container} bridged to the local
kube-apiserver via in-cluster ServiceAccount. v4.channel.k8s.io
subprotocol echo. Optional TMUX_CASCADE wraps in a shared
catalyst-ops tmux session. Shipped as a DaemonSet + Service with
internalTrafficPolicy=Local in platform/k8s-ws-proxy/chart/.

P1 — core/cmd/projector/ — NATS catalyst.events JetStream → Valkey
KV projector. Canonical key shape:
  cluster:{cluster-id}:kind:{kind}:{namespace}/{name}
Cold-start does a full LIST across DefaultKinds, then catches up on
the 24h replay window. Multi-replica safe (durable consumer queue
group, last-write-wins on namespacedName). Shipped as a default-OFF
Deployment + RBAC under products/catalyst/chart/templates/services/projector/.

X1 — products/catalyst/bootstrap/api/internal/handler/k8s_logs.go —
WebSocket Pod-log streaming endpoint:
  GET /api/v1/sovereigns/{id}/k8s/logs/{ns}/{pod}/{container}
      ?follow&tailLines&since=<rfc3339>&previous
Reads from kubelet via client-go GetLogs().Stream(); each WS frame =
one log line. Supports `since` resume. Reuses RequireSession middleware
+ chroot cluster-id resolver. New k8scache.Factory.CoreClient(id)
accessor exposes the per-cluster typed client without duplicating
kubeconfig parsing.

G1 — platform/guacamole/chart/ — full Apache Guacamole chart:
guacd Deployment + Service, Tomcat webapp Deployment + Service,
Cilium Gateway HTTPRoute, SeaweedFS-PVC for recordings (RWO,
hcloud-volumes), SealedSecret placeholder for Keycloak OIDC client
secret, NetworkPolicy (default-deny + selective egress to KC +
k8s-ws-proxy + SeaweedFS + NATS), and ConfigMap consumed by
keycloak-config-cli post-deploy Job (mirrors platform/keycloak
realm-config pattern). Default-OFF gate; full-ON renders 9
resources. Empty image.tag / hostname / oidc.issuer fail-fast at
helm template time per INVIOLABLE-PRINCIPLES #4a/#5. ONE Guacamole
per Sovereign per ADR-0001 §11. Blueprint manifest uses
v1alpha1 + version "0.1.0" + upgrades.from ["0.x"].

Tests:
- k8s-ws-proxy: HMAC happy/expired-old/expired-future/malformed/
  bad-signature, path-only signature, WS upgrade + protocol echo,
  bad path, bad HMAC, denied namespace via httptest.
- projector: Apply ADD/MOD/DEL/validation, key shape (ns-scoped +
  cluster-scoped), handleOne ack/nak/term routing with fakeMsg,
  cold-start LIST + project + error continuation via dynamicfake.
- X1: parseLogOptions defaults + edge cases + bad query params,
  503/404/400 paths + full WS happy-path with kfake clientset.
- G1: chart/tests/render.sh — default-OFF=0, empty-tag fail-fast,
  full-ON=9 resources, every required kind present, realm-config
  wires OIDC client.
- bp-k8s-ws-proxy chart: chart/tests/render.sh — default-OFF=0,
  empty-tag fail-fast, full-ON=5 resources.

Pre-existing test status: TestPinIssue and TestBootstrapKit/gitea
remain flaky on main per canon §7 — verified not introduced by
this slice.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:27:39 +04:00
e3mrah
a0c356fe34
fix(cnpg-pair): drop bp-cnpg: prefix from upgrades.from semver range (#1156)
Other platform/*/blueprint.yaml files use bare semver-range strings
(e.g. ["0.x"]) without the bp-name: prefix. C3 blueprint-controller's
validate package rejects "bp-cnpg:1.x" as an invalid semver range,
breaking TestValidate_ExistingBlueprintCorpus on any PR after #1153.

Found by EPIC-6 K-Cont-2 (#1155). Brief at C-DB-1 (.claude/architect-briefs/
epic-6/02-) was wrong — the slice author followed the brief literally.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 06:51:09 +04:00
e3mrah
746901b671
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.

What ships:
  platform/cnpg-pair/
  ├── chart/
  │   ├── Chart.yaml             # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
  │   ├── values.yaml            # default-OFF gate; placement schema constrains active-hotstandby ONLY
  │   ├── templates/
  │   │   ├── _helpers.tpl              # fail-fast on empty image.tag; region pair validation
  │   │   ├── primary-cluster.yaml      # CNPG Cluster CR (region-pinned via openova.io/region affinity)
  │   │   ├── replica-cluster.yaml      # CNPG Cluster CR (replica.enabled=true; externalClusters[])
  │   │   ├── service-replication.yaml  # Cilium ClusterMesh global Service
  │   │   ├── failover-readiness.yaml   # probe Pod flips Ready when WAL lag < threshold
  │   │   ├── networkpolicy.yaml        # default-deny carve-outs for replication + probe
  │   │   └── audit-config.yaml         # NATS audit subjects + types this Blueprint emits
  │   ├── blueprint.yaml          # configSchema + placementSchema (active-hotstandby ONLY)
  │   ├── README.md               # 80-line deployment + failover semantics
  │   └── tests/cnpg-pair-render.sh  # 5-case render gate
  └── DESIGN.md                   # topology, lag-threshold rationale, deferred C-DB-3 plan

Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.

CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.

Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
  - service.cilium.io/global=true ClusterMesh global Service annotation
    (first chart in the repo to use it; pattern reused by Continuum
    K-Cont-2 for HTTPRoute weight=0 cross-region drains)
  - bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
    Cluster CRs colocated in one Blueprint, region-pinned via
    openova.io/region node-affinity)
  - audit-config ConfigMap co-located with the emitting Blueprint
    (label-selector discovery for K-Cont-2 + U-DR-1; future
    bp-*-pair Blueprints follow this convention)
  - smoke-render-mode=default-off Chart.yaml annotation opt-in for
    the blueprint-release smoke gate

C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.

C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.

Tests:
  - bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
  - helm lint platform/cnpg-pair/chart ✓ clean
  - helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
  - smoke-gate logic simulated locally ✓ default-off annotation honored

Pre-existing CI failures untouched:
  - TestPinIssue rate-limit flake — not affected by chart-only slice
  - TestBootstrapKit/gitea version drift — only iterates over a fixed
    10-chart bootstrap list (no cnpg-pair entry)

Out of scope per brief (all deferred to dedicated slices):
  - K-Cont-2 reconciler logic
  - K-Cont-3 lease witness
  - K-Cont-4 Cloudflare Worker
  - C-DB-3 1M-row acceptance test
  - Application controller changes
  - U-DR-1 UI

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 05:16:55 +04:00
e3mrah
a6ccdcef41
feat(rbac): /rbac/assign find-or-create + /rbac/access-matrix + boundary validator (slice A, #1098) (#1143)
EPIC-3 slice A bundles three deliverables on top of the just-landed
slice T1 (5-tier ClusterRoles):

A1 — POST /api/v1/sovereigns/{id}/rbac/assign
  Find-or-create-role endpoint backing the multi-grant editor (slice
  U1). Race-tolerant 409 retry follows the EnsureUser pattern. Three
  paths: created / updated (tier rotation on existing scope) / no-op.
  Authoring side: writes UserAccess CR with metadata.labels[
  catalyst.openova.io/tier]=<tier> + spec.tierRoleRef + spec.scopes[].

A2 — GET /api/v1/sovereigns/{id}/rbac/access-matrix
  Manara-style users × applications × tier matrix with per-CR
  warnings (developer-tier missing env-type=dev surfaces inline).
  Optional org/application filters. Pure aggregator extracted for
  testability — no apiserver, no clock.

A3 — Kyverno ClusterPolicy `useraccess-boundary`
  Denies cross-Organization UserAccess grants unless the requester
  is a member of a management Org with tier=owner. Default Audit
  (values-driven action). Test fixtures + kyverno-test.yaml shape
  ready for kyverno-CLI CI step in a follow-up slice.

UserAccess CRD extension:
  - spec.tierRoleRef (string, openova:tier-* pattern)
  - spec.scopes[] ({key, value})
  - applications[] no longer required (legacy + new shapes coexist)

Test coverage (26 new tests, race-clean):
  - A1: 3-path find-or-create, 409 retry, validation, 404
  - A2: matrix shape + filters + warnings, http happy/empty/404
  - Pure helpers: scope normalization/equality, CR-name determinism

Pre-existing failure `TestPinIssue_ConcurrentRapidFireRateLimit`
(rate-limit timing flake) reproduced on clean main per canon §7;
not introduced by this slice.

Refs: EPIC-3 master brief at .claude/architect-briefs/epic-3/, slice
A brief at 02-A-rbac-assignment-endpoints.md, T1 ancestor #1142.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 03:20:50 +04:00
e3mrah
c215468a61
feat(rbac): land 5-tier ClusterRoles (slice T1, #1098) (#1142)
Renders 5 ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
via Helm template with inherit-chain expansion. Find-or-create-role
endpoint (slice A1, future) targets these via roleRef on UserAccess CRs.

Per-tier action sets in values.yaml's new `tierActions:` block (227
lines authored by EPIC-3-T agent before stream timeout — Coordinator
finished the template + helper):

- tier-viewer (level 10): 6 rules — `*.read` on common kinds
- tier-developer (level 20): 10 rules — viewer + workloads.exec/console
  + tickets + sessions.playback. Auto-injected scope `openova.io/env-type=dev`
  surfaced via ClusterRole annotation (slice T3 follow-up reads it).
- tier-operator (level 30): 15 rules — developer + console.connect.admin
  + sam.manage + patches.manage + tickets.accept
- tier-admin (level 40): 29 rules — operator + compute.* (no delete)
  + credentials.* + applications.* + actions.* + accounts.* + networks.*
  + sessions.* + workloads.*
- tier-owner (level 50): 33 rules — admin + rbac.* + organization.*
  + compute.delete

Total 93 RBAC rules across the 5 ClusterRoles.

Inherit chain expansion via _tier-helpers.tpl `catalyst.tierRules`
template helper. Each ClusterRole's `metadata.labels` carries:
- `catalyst.openova.io/tier-name: <tier>`
- `catalyst.openova.io/tier-level: <int>` (10/20/30/40/50; same integer
  the Keycloak realm-role attribute carries — admin_roles.go:88-92)

`metadata.annotations.catalyst.openova.io/enforced-scopes` JSON-encodes
the per-tier scope auto-injection contract (developer-only today).

Per ADR-0001 §2.7: ClusterRoles (not Roles) so the same role works for
both namespace-scoped (RoleBinding) and cluster-scoped (ClusterRoleBinding)
UserAccess targets.

Per docs/INVIOLABLE-PRINCIPLES.md #4: every action set is in values.yaml,
not hardcoded — operators extend per-Sovereign without editing the
template. The `tiers.enabled` master gate + per-tier `enforcedScopes[]`
are also operator-tunable.

Validated:
- `helm lint` clean (1 INFO about chart icon, pre-existing)
- `helm template` renders exactly 5 ClusterRoles with the expected
  inherit-chain rule counts (6 → 10 → 15 → 29 → 33)
- Inherit chain helper handles base case (viewer has no inherit) and
  caps recursion at 10 levels (defensive)

Out of scope (deferred to follow-up slices):
- T2: Keycloak composite realm-role bootstrap (init Job in catalyst-api
  startup that creates 5 `catalyst-<tier>` realm roles + composite chain)
- T3: useraccess-controller mod for developer scope auto-injection
  (reads enforced-scopes annotation from this template's ClusterRoles)

Refs: #1094, #1098, docs/EPICS-1-6-unified-design.md §6.2
(authoritative tier action-set spec).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:53:39 +04:00
e3mrah
f1d0801ad2
feat(catalyst-api): compliance score aggregator + handler (slice S, #1096) (#1141)
Joins Kyverno PolicyReports + slice W2's compliance-evaluator events
+ EnvironmentPolicy weights into per-resource → per-Application →
per-Environment → per-Organization → per-Sovereign weighted scores.
Outputs SSE for live updates, REST for snapshots, Prometheus
catalyst_compliance_* gauges/counters, and (when CATALYST_NATS_URL is
wired) NATS JetStream KV `policy-rollup` for replayable history.

S1 — internal/handler/compliance.go:
  * REST endpoints under /api/v1/sovereigns/{id}/compliance/
    - GET /scorecard   — per-app/env/org/sovereign rollups
    - GET /policies    — per-policy weight + mode + violation tally
    - GET /violations  — paginated fail rows, ?app=<name>
    - GET /stream      — SSE for live score updates
  * Watch loop subscribes to k8scache.Factory fanout for kinds
    {policyreport, clusterpolicyreport, compliance-evaluator,
     deployment, statefulset, daemonset, pod}. Per ADR-0001 §5
    every score recompute is event-driven; no polling.
  * Pure computeScore() function with edge cases tested:
    all-pass=100, all-fail=0, half-pass=50, skip drops from denom,
    empty-weights fallback to equal weights, stateful/stateless scope
    filters, missing verdict drops policy, warn pulls score down.
  * NATS KV writes via nil-tolerant PolicyRollupPublisher interface
    keyed `<scope>:<id>`. Sentinel resolver wires when env is set;
    nil keeps the aggregator running on SSE+Prometheus only.
  * EnvironmentPolicy CR resolution via dynamic-client; nil/404
    falls back to default equal-weights so a fresh Sovereign without
    a tuned policy still scores correctly.

S2 — platform/mimir/chart/templates/prometheusrule-compliance.yaml:
  * Recording rules:
    - catalyst:compliance_score:by_application:1h_avg
    - catalyst:compliance_violations:by_policy:5m_rate
    - catalyst:compliance_score:by_sovereign:1h_avg
    - catalyst:compliance_policy_enforcing:by_policy
  * Pager alerts: ComplianceScoreRegression (>10pt drop in 1h) +
    ComplianceEnforcingPolicyHighViolations (>50/hr in enforcing
    mode). Every threshold a values.yaml knob per
    docs/INVIOLABLE-PRINCIPLES.md #4.
  * Capabilities-gated on monitoring.coreos.com/v1 so a fresh
    Sovereign without bp-kube-prometheus-stack doesn't fail render.

Tests:
  * 18 unit + integration tests in compliance_test.go covering the
    full computeScore matrix, the watch-loop end-to-end via
    Factory.Publish injection, and every HTTP endpoint (scorecard,
    policies, violations pagination, stream, 503 nil-handler).
  * `go test -count=1 -race ./internal/handler/...` clean (5 runs).
  * `go vet ./...` clean.

Pre-existing CI failures (TestPinIssue_ConcurrentRapidFireRateLimit,
TestRun_FailsFastOnDynadotError, TestAuthHandover_HappyPath nil-ptr,
TestValidate_*Harbor_robot_token*) confirmed not introduced by this
slice — they reproduce on clean main.

Per ADR-0001 §3 (5 stores): score history lives in NATS JetStream KV;
no Postgres/FerretDB shadow store. Per ADR-0001 §5 (event-driven):
every score recompute fires off a Subscribe event. Per
INVIOLABLE-PRINCIPLES #4: SSE retention, KV TTL, alert thresholds all
runtime-configurable.

Closes the S column of EPIC-1 master plan; UI slices U1-U5 can now
consume the SSE event shape.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 02:37:31 +04:00
e3mrah
d74e0d5e5a
feat(bp-kyverno): land 19 compliance ClusterPolicy templates (slice K, #1096) (#1138)
Slice K of EPIC-1 (#1096) compliance engine — author the baseline
policy library that the score aggregator (slice S) will consume via
PolicyReport rows. K1 ships 13 baseline policies + K2 ships 7 added
policies. One of the K2 policies (hubble-flows-seen #16) is a stub
file — Kyverno can't natively reach Cilium Hubble's gRPC API, so the
synthetic PolicyReport row is emitted by slice W2's hubble.go
evaluator (per design §4.1). Stub keeps the policy slot explicit in
the bundle.

Architecture per docs/EPICS-1-6-unified-design.md §4.3:

  K1 (13 baseline)
    01 multi-replica-drainability  (resilience, permissive)
    02 pdb-permits-eviction        (resilience, permissive)
    03 topology-spread             (resilience, permissive)
    04 probes-present              (resilience, enforcing)
    05 resource-requests           (resilience, enforcing)
    06 resource-limits             (resilience, permissive)
    07 pvc-volume-expansion        (resilience, permissive — stateful)
    08 hpa-effective               (resilience, permissive)
    09 cilium-l7-mtls              (security,   enforcing)
    10 flux-managed                (governance, enforcing)
    11 harbor-proxy-pull           (governance, enforcing)
    12 image-tag-pinned            (governance, enforcing)
    13 prometheus-scrape           (observability, permissive)

  K2 (7 added)
    14 networkpolicy-present       (security, permissive)
    15 otel-injected               (observability, permissive)
    16 hubble-flows-seen           (deferred to W2 evaluator)
    17 runasnonroot-readonlyrootfs (security, permissive)
    18 cosign-verified             (security, permissive)
    19 secret-not-in-env           (security, permissive)
    20 backup-configured           (resilience, permissive)

Per docs/INVIOLABLE-PRINCIPLES.md #4 every operationally-meaningful
value is runtime-configurable via .Values.compliancePolicies.<name>.*:
  - enabled (default false — operator opts in)
  - action (Audit | Enforce; default Audit; flipped per-Environment by
    EnvironmentPolicy.spec.compliance.modes once C2 controller lands)
  - excludeNamespaces (default exempts kube-system, flux-system, etc.)
  - per-policy specifics (allowedRegistryRegex, cosign keys, ...)

Test gate (helm template):
  - default-OFF (no overrides): 0 ClusterPolicy rendered
  - all-ON                    : 19 ClusterPolicy rendered
helm lint clean both ways.

Slice S1 (score aggregator) will join PolicyReport rows from these
policies + synthetic rows from W2 evaluators against EnvironmentPolicy
weights. UI surfaces (slices U1-U5) consume the SSE/NATS rollups.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:57:51 +04:00
e3mrah
f18dd8df19
feat(bp-opentelemetry-operator): scaffold operator + default Instrumentation CR (slice H5, #1095) (#1121)
New platform/opentelemetry-operator/ Blueprint scaffold per design doc
§3.9 row 5. Companion to existing bp-opentelemetry (the collector) —
this Blueprint ships the OPERATOR that auto-injects OTel SDK sidecars
into Pods based on annotations:

  instrumentation.opentelemetry.io/inject-{java|nodejs|python|dotnet}: "default"

Two-Blueprint split is intentional: collector and operator are separate
upgrade cycles. Mixing them risks coupling observability cadence to
auto-instrumentation cadence, and the operator's mutating admission
webhook intercepts every Pod creation cluster-wide so misconfiguration
is high-blast-radius.

What ships:
- platform/opentelemetry-operator/README.md — activation contract
- platform/opentelemetry-operator/blueprint.yaml — bp-opentelemetry-operator 1.0.0
- platform/opentelemetry-operator/chart/Chart.yaml — wraps upstream
  opentelemetry-operator:0.61.0 from open-telemetry-helm-charts.
  Subchart `condition: enabled` — default-off skips it entirely.
- platform/opentelemetry-operator/chart/values.yaml — gate, default
  Instrumentation CR config (exporterEndpoint, sampler, per-language
  toggles), upstream subchart values (manager.collectorImage.repository
  required, serviceAccount, cert-manager-backed admission webhook)
- platform/opentelemetry-operator/chart/templates/instrumentation-default.yaml
  — Catalyst overlay Instrumentation CR with parentbased_traceidratio
  sampler @ 0.25 default, propagators (tracecontext + baggage + b3),
  per-language injection toggles. Default OFF; namespace = cilium by
  default (operator overrides per Sovereign).

Default-OFF for both layers:
- .Values.enabled: false → upstream subchart's `condition: enabled`
  also fires, so 0 resources rendered total
- Even after .Values.enabled=true, the Catalyst Instrumentation CR
  is gated again by .Values.defaultInstrumentation.enabled=false so
  installing the chart doesn't auto-inject anywhere

Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (sampler ratio,
exporter endpoint, per-language toggles, namespace) is in values.yaml.

Validated:
- helm dependency build pulls upstream cleanly
- helm template with default values: 0 resources rendered
- helm template with enabled=true defaultInstrumentation.enabled=true:
  22 resources rendered (upstream operator manager Deployment, CRDs,
  RBAC, mutating + validating webhooks, cert-manager Issuer +
  Certificate, plus the Catalyst Instrumentation CR)

Out of scope for this slice:
- Add this Blueprint to clusters/_template/bootstrap-kit/ — EPIC-5
  (#1100) sequences both bp-opentelemetry (collector first) and this
  Blueprint as part of the observability roll-out
- Per-Application Instrumentation CRs from Blueprint.spec.observability.
  traces=otlp — application-controller (slice C4 of #1095) renders
  those at install time

Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 5
+ §8.4 (EPIC-5 Networking).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 23:06:29 +04:00
e3mrah
5915e309dc
feat(bp-kyverno): land label-vocab mutate + validate ClusterPolicies (slices E1+E2, #1095) (#1120)
Realizes design doc §3.6 (Label-vocabulary enforcement). Two
ClusterPolicies that together implement the contract in §1: the
openova.io/* label set is the join key across compliance scoring
(#1096), RBAC scope matching (#1098), billing (post-Phase-1), and
networking (#1100). If labels are missing, every downstream consumer
is blind.

E1 — mutate-add-openova-labels (slice E1):
- Mutating ClusterPolicy that derives missing openova.io/{org, env,
  application, blueprint, managed-by} labels from namespace annotations
  + ownerReferences and adds them at admission.
- Three rules:
  * add-org-from-namespace-annotation
  * add-env-from-namespace-annotation
  * add-managed-by-flux-when-flux-instance-label
- Best-effort safety net — Catalyst controllers (C1/C2/C4) are the
  authoritative source. This rule covers resources created OUTSIDE
  the controller path (e.g. a debug Pod from kubectl run, a CronJob
  authored manually).

E2 — validate-require-openova-labels (slice E2):
- Validating ClusterPolicy that REJECTS workload resources missing
  required openova.io/* labels.
- Default action `Audit` (permissive) — per-Environment overlay
  flips to `Enforce` (blocking) via EnvironmentPolicy.spec.modes
  in EPIC-1 #1096.
- One rule per required label (templated from .Values.kyvernoOverlay.
  labelVocab.validate.requiredLabels) — lets the Audit/Enforce decision
  be per-label rather than all-or-nothing.
- excludeNamespaces list exempts control-plane namespaces (kube-system,
  flux-system, cilium, cert-manager, openova-system, catalyst, etc.)
  so existing Sovereign infra doesn't trip on missing org labels.

Both default OFF (.Values.kyvernoOverlay.labelVocab.{mutate,validate}.
enabled). Operator opts in once the prerequisite Organization (slice
B1) + Environment (slice B2) CRs exist on the cluster, otherwise the
mutate rule has nothing to derive from and the validate rule rejects
every workload.

Per docs/INVIOLABLE-PRINCIPLES.md #4, every list (requiredLabels,
resourceKinds, excludeNamespaces, action) is in values.yaml.

Validated:
- helm dependency build pulls upstream kyverno cleanly
- helm template with default values: 0 ClusterPolicy resources rendered
- helm template with both gates enabled: exactly 2 ClusterPolicies
  rendered (mutate-add-openova-labels + validate-require-openova-labels)

Chart version bumped 1.0.1 → 1.1.0 (minor — new templates, no breaking).
Blueprint.yaml mirrored 1.0.0 → 1.1.0.

Refs: #1094, #1095, #1096, #1098, #1100, docs/EPICS-1-6-unified-design.md
§1 (label vocab) + §3.6 (E1+E2 scope).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 23:01:43 +04:00
e3mrah
e1d7bf18be
feat(bp-hcloud-csi): scaffold Hetzner CSI driver Blueprint (slice H6, #1095) (#1119)
New platform/hcloud-csi/ Blueprint scaffold per design doc §3.9 row 6.
Wraps the upstream hetznercloud/csi-driver Helm chart and ships the
Catalyst-managed `hcloud-volumes` StorageClass that multi-node stateful
workloads (CNPG primary/replica pairs in EPIC-6 #1101) need.

Default-OFF: chart is a no-op until .Values.enabled is true. Even after
enabling, the cluster's default StorageClass is NOT flipped unless
.Values.defaultStorageClass is also true — that's a destructive change
for Pods relying on the previous default's binding semantics, so the
in-place migration plan is operator-scheduled.

What ships:
- platform/hcloud-csi/README.md — activation contract, why-default-OFF
- platform/hcloud-csi/blueprint.yaml — bp-hcloud-csi 1.0.0, configSchema
- platform/hcloud-csi/chart/Chart.yaml — wraps upstream
  hcloud-csi:2.13.0 from charts.hetzner.cloud, condition=enabled gate
- platform/hcloud-csi/chart/values.yaml — gate, default-storageclass
  flag, hetznerTokenSecretRef (SealedSecret), catalystStorageClasses
  array (renamed from storageClasses to avoid collision with upstream's
  storageClasses key), volumeSnapshotClass block (default off)
- platform/hcloud-csi/chart/templates/storageclass.yaml — renders one
  StorageClass per catalystStorageClasses[] entry; first entry annotated
  as cluster default when defaultStorageClass=true
- platform/hcloud-csi/chart/templates/volumesnapshotclass.yaml —
  VolumeSnapshotClass for backup workflows; default off

Why a separate Blueprint, not values toggle on bp-cilium:
- CSI drivers are independent of CNI. Mixing them risks coupling the
  network-plane upgrade cycle to the storage-plane upgrade cycle.

Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (StorageClass list,
SealedSecret reference, replicas, resource requests) is in values.yaml.

Validated:
- helm dependency build pulls upstream hcloud-csi:2.13.0 cleanly
- helm template with default values: 0 resources rendered (gate +
  Chart.yaml condition both fire correctly)
- helm template with enabled=true defaultStorageClass=true: 7 resources
  rendered (upstream CSI controller Deployment, node DaemonSet, CSIDriver,
  RBAC, plus Catalyst hcloud-volumes StorageClass with the
  storageclass.kubernetes.io/is-default-class annotation)

Schema collision lesson:
- Initial draft used .Values.storageClasses[] which collided with the
  upstream subchart's storageClasses array (different shape; subchart
  expects array under that exact name). Renamed to catalystStorageClasses
  + passed [] to upstream's hcloud-csi.storageClasses to suppress its
  own StorageClass rendering. Lesson logged in seam map.

Refs: #1094, #1095, #1101, docs/EPICS-1-6-unified-design.md §3.9 row 6,
docs/SRE.md §2.5, platform/cnpg/README.md.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:56:19 +04:00
e3mrah
eca27002ae
feat(bp-cilium): add Hubble UI HTTPRoute overlay (slice H7, #1095) (#1117)
Realizes design doc §3.9 row 7 (Hubble relay+UI on; OIDC ingress) as a
default-OFF scaffold that EPIC-5 (#1100) flips on per Sovereign once the
zero-trust observability tier is ready.

Why default-OFF in Phase-0:
- Hubble relay/UI in production today is intentionally off (SovereignA
  was crash-looping on monitoring.coreos.com/v1 ServiceMonitor missing
  before bp-kube-prometheus-stack reconciles — issue #182).
- The OIDC enforcement at the gateway boundary is the missing piece —
  Cilium's L7 OIDC filter wires to bp-keycloak's `hubble-ui` client
  which lands in slice D1.
- Flipping the gate without the OIDC layer would leave Hubble UI
  publicly accessible. The template comments explicitly warn against
  this for production.

What ships:
- platform/cilium/chart/templates/hubble-ui-httproute.yaml — HTTPRoute
  exposing hubble-ui Service via cilium-gateway with the wildcard cert.
  Gated by `catalystOverlay.hubbleUI.{enabled,hostname}`.
- platform/cilium/chart/values.yaml `catalystOverlay:` block: hubbleUI.{
  enabled, hostname, gatewayRef.{name,namespace},
  serviceRef.{name,namespace,port}, auth (oidc|none, default oidc) }.
  All operator-overrideable per docs/INVIOLABLE-PRINCIPLES.md #4.

Operator opt-in path (per-Sovereign overlay at clusters/<sov>/bootstrap-kit/
01-cilium.yaml):
  spec.values.cilium.hubble.relay.enabled: true
  spec.values.cilium.hubble.ui.enabled: true
  spec.values.catalystOverlay.hubbleUI.enabled: true
  spec.values.catalystOverlay.hubbleUI.hostname: hubble.<sovereign-domain>
… AND bp-keycloak realm has a `hubble-ui` OIDC client (slice D1).

Validated:
- helm template with default values: 0 HTTPRoute resources rendered
- helm template with catalystOverlay.hubbleUI.enabled=true + hostname:
  exactly 1 HTTPRoute rendered with proper parentRefs/hostnames/backendRefs
- Original 34-resource render count unchanged in default mode (no
  regression to existing chart output)

Chart version bumped 1.2.1 → 1.3.0 (minor — new templates, no breaking).

Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 7,
§8 (EPIC-5 Networking).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:44:18 +04:00
e3mrah
68c68eaf7a
feat(bp-network-policies): land default-deny CCNP + system-namespace + DNS allow templates (slice H8, #1095) (#1116)
New platform/network-policies/ Blueprint scaffold per design doc §3.9 row 8.
Ships the cluster-wide zero-trust primitives that EPIC-5 (#1100) activates
as part of the networking roll-out.

What ships:
- platform/network-policies/blueprint.yaml — bp-network-policies 1.0.0
- platform/network-policies/chart/Chart.yaml — Helm chart, no upstream sub-chart
- platform/network-policies/chart/values.yaml — gate (enabled: false default)
- platform/network-policies/chart/templates/default-deny.yaml — CCNP that
  denies all ingress + egress at endpointSelector: {} (full-cluster scope)
- platform/network-policies/chart/templates/allow-system-namespaces.yaml —
  CCNP allowing full traffic for kube-system, flux-system, cilium,
  cert-manager, catalyst, openova-system, monitoring, ingress (set is
  parametric via .Values.allowSystemNamespaces — operator extends per
  Sovereign for gitea/harbor/loki etc.)
- platform/network-policies/chart/templates/allow-egress-dns.yaml — CCNP
  permitting UDP/TCP/53 to CoreDNS from every Pod (without this the cluster
  is unbootable under default-deny — first DNS lookup fails)

Why a separate Blueprint, not bp-cilium:
- bp-cilium is foundational, installed on every cluster on day 0.
  Default-deny breaks every workload that hasn't been allowlisted, so it
  cannot ship in bp-cilium without operator opt-in semantics.
- Separate Blueprint with enabled: false default preserves the safety
  boundary. EPIC-5 wires the activation when the rest of the zero-trust
  story is ready.

Per-namespace intra-namespace allow is intentionally NOT in this slice:
- Cilium CCNPs cannot express "same namespace as the source Pod" without
  listing every namespace, which contradicts dynamic Org provisioning.
- That allow rule is rendered as a per-namespace CiliumNetworkPolicy (CNP,
  namespace-scoped) by organization-controller (slice C1 of #1095) at
  Organization creation time. README + values.yaml note this for
  downstream Implementers.

Per docs/INVIOLABLE-PRINCIPLES.md #4, every policy parameter
(allowSystemNamespaces list, dnsNamespace, dnsServiceName) is in
values.yaml, not hardcoded.

Validated:
- helm template with default values: 0 resources rendered (gate works)
- helm template with enabled=true: exactly 3 CCNPs rendered (default-deny,
  allow-system-namespaces, allow-egress-dns), all parse cleanly through
  python yaml.safe_load_all
- CCNP CRD validation will happen on Sovereigns where bp-cilium is
  installed; local k3s here uses flannel so server-side dry-run is
  unavailable

Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 8 +
§8 (EPIC-5), ADR-0001 §2 (zero-trust).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:40:30 +04:00
e3mrah
82bf6f6eec
fix(bp-cilium): align declared upstream version with Chart.lock (slice H1, #1095) (#1115)
EPIC-0 audit found provenance drift in bp-cilium:
- Chart.yaml dependencies[0].version declared "1.19.3"
- values.yaml catalystBlueprint.upstream.version declared "1.19.3"
- Chart.lock pinned to 1.16.5 (truth-on-disk — what every Sovereign
  has actually been running)

The declared "1.19.3" was never installed anywhere. Aligning all three
to "1.16.5" so observability/audit pipelines that compare the declared
upstream version with the actually-deployed Cilium version stop reporting
a 3-minor mismatch.

This is a pure metadata fix — no behavioral change. Rolling forward to a
newer Cilium minor (1.17.x or 1.18.x) is a separate slice that needs
real upgrade testing on a live data-plane cluster, including k3s
--flannel-backend=none compatibility and Gateway API CRD compatibility.

Validated:
- helm dependency build re-resolves to 1.16.5 cleanly
- Chart.lock unchanged (Cilium 1.16.5 was already what it had)

Chart version bumped 1.2.0 → 1.2.1 (patch). Blueprint.yaml mirrored.

Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.9 row 1, §11 row 3.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:36:15 +04:00
e3mrah
e8bf1aab69
feat(bp-nats-jetstream): land Stream + KV CR templates (slice H4, #1095) (#1114)
Realizes design doc §3.9 row 7. The chart had no templates/ directory —
NACK Stream and KeyValue CRs that ADR-0001 §6 mandates as the Catalyst
event spine were declared in docs but not in code.

What this slice ships:
- platform/nats-jetstream/chart/templates/_helpers.tpl — common labels +
  servers helper (defaults to <release>-nats Service URL, override via
  .Values.catalystStreams.servers).
- platform/nats-jetstream/chart/templates/streams.yaml — three Streams:
    * catalyst.audit  : 90-day retention, R=3, mirrored to DR (#1101)
    * catalyst.events : 24-hour retention (cross-replica fan-out + cold-
      start replay), R=3
    * catalyst.billing: 1-year retention, R=3, consumed by future billing
- platform/nats-jetstream/chart/templates/kv-buckets.yaml — three KVs:
    * idempotency  : 24h TTL, 256 MiB cap (write-path idempotency keys)
    * dr-leases    : 60s TTL (Continuum dns-quorum lease path; CF-KV
      bypasses this bucket)
    * policy-rollup: 7-day retention, 1 GiB cap (compliance scorer #1096)

Reconciliation gate:
- All resources render only when .Values.catalystStreams.enabled is true.
- NACK (nats-io/nack) is NOT a current dependency — installing it as a
  sibling Blueprint and flipping this toggle is a follow-up slice.
- Same default-off pattern the chart already uses for promExporter.podMonitor
  (issue #182) so a fresh Sovereign with no NACK keeps booting cleanly.

Per-tenant streams (org.<id>.events, app.<id>.events) are intentionally
NOT shipped here — they'll be created at runtime by organization-controller
(slice C1) and application-controller (slice C4) so they can scale per
tenant.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every retention,
TTL, replicas, and maxBytes is a values.yaml variable; per-Sovereign
overlays override.

Validated:
- helm dependency build pulls upstream nats:1.2.0
- helm template with default values: 0 catalyst-* resources rendered
  (catalystStreams.enabled=false, the safe default)
- helm template with catalystStreams.enabled=true: 6 resources rendered
  exactly as expected (3 Streams + 3 KeyValues, all in
  jetstream.nats.io/v1beta2)

Chart version bumped 1.1.2 → 1.2.0 (minor — new templates, no breaking).
Blueprint.yaml version mirrored.

Refs: #1094, #1095, #1096, #1101, docs/EPICS-1-6-unified-design.md §3.9
row 7, ADR-0001 §6.

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:32:54 +04:00
e3mrah
25ef20a8e5
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.

Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
  (legacy, served, not storage) and v1 (canonical, served, storage). The
  shared schema means the 38 existing v1alpha1 files in platform/ +
  products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
  spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
  tagline interchangeable; category | family interchangeable; docs |
  documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
  hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
  manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
  Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
  Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
  observability, outputs, depends[].values, manifests.values, etc.

Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
  category (25), family (20), docs (20), documentation (14+1), icon (25),
  tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
  canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
  §3. Those 5 files are fixed in this commit:
    * platform/cert-manager-powerdns-webhook/blueprint.yaml
    * platform/cert-manager-dynadot-webhook/blueprint.yaml
    * platform/crossplane-claims/blueprint.yaml
    * platform/powerdns/blueprint.yaml
    * platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
  --dry-run=server) against the new CRD.

Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.

This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.

Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3

Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 22:25:08 +04:00
e3mrah
a6fb97f2ef
fix(cutover step-01): clone+push (regular repo) instead of pull-mirror (#1033)
PR #1029 added a step-06 PATCH to flip mirror=false before push so
the cutover-helmrepository-patches Job could write HelmRepository
URL pivots to local Gitea. On Gitea 1.22.3 the PATCH returns 200
but silently no-ops — `mirror_interval` updates but `mirror: true`
stays. The repo remains read-only and step-06 still hits HTTP 403
"remote: mirror repository is read-only". Reproduced on otech127
2026-05-05 with chart 0.1.22 deployed.

Per ADR (cutover ends upstream tracking — Sovereign goes
self-hosted from this point), the architecturally correct fix is
to never create the mirror in the first place. Step-01 now creates
a regular Gitea repo and bare-clones+pushes upstream content. All
refs (branches+tags) replicate via `git push --mirror --force`,
which is idempotent on re-runs.

Trade-off: post-cutover Sovereigns no longer auto-sync from
upstream — that's the intended cutover semantics anyway. Operator
re-runs this Job manually for chart rollouts (next-session
follow-up: dedicated post-cutover sync mechanism, perhaps a
periodic CronJob the operator can opt into).

Bumps:
- bp-self-sovereign-cutover chart 0.1.22 → 0.1.23
- bootstrap-kit pin 0.1.22 → 0.1.23

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 03:19:05 +04:00
e3mrah
a070808eda
fix(cutover step-06): convert pull-mirror to standalone before pushing patches (#1029)
Step-01 creates openova/openova on the Sovereign's local Gitea as a
pull mirror so it tracks upstream openova-public during early
bootstrap. After cutover, the Sovereign is self-hosted and MUST
diverge from upstream — but Gitea blocks pushes to a mirror with
HTTP 403 "remote: mirror repository is read-only".

Step-06 adds a Phase-1.5 PATCH /api/v1/repos/{owner}/{repo}
{"mirror": false, "mirror_interval": "0"} BEFORE attempting to
clone+push the HelmRepository URL pivot. This converts the
pull-mirror into a standalone writable repo — the way the post-
cutover Sovereign architecture expects it.

Caught on otech125 2026-05-05: cutover-helmrepository-patches Job
returned "FATAL: git push failed" with no upstream stderr (chart
0.1.20 lacks the printf '%s\n' "$push_err" fix from PR #1022, which
was published in 0.1.21 only). Reproduced by cloning openova/openova
from a debug pod and running git push: "remote: mirror repository
is read-only / fatal: ... HTTP 403". Without the demirror step,
EVERY Sovereign provisioned fails handover at this step.

Bumps:
- bp-self-sovereign-cutover chart 0.1.21 → 0.1.22
- bootstrap-kit pin 0.1.20 → 0.1.22

Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
2026-05-06 02:53:45 +04:00
e3mrah
478743db17
fix(cutover-step-06): actually surface git push stderr (PR #1021 merged with only chart bump) (#1022)
PR #1021 was supposed to ship this code fix but the chart-version bump
landed first and the actual sed didn't apply (sed quoting mishap). The
debug-error fix never reached main. Re-shipping now as a clean Edit-
based commit. Captures git push stderr into push_err and prints it on
FATAL so the next iteration's failed Job logs include git's actual
rejection (auth / branch protection / hook).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 02:12:00 +04:00
e3mrah
69980ed48e
chore(bp-self-sovereign-cutover): bump 0.1.20 → 0.1.21 (#1021)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-06 02:10:45 +04:00
e3mrah
608db53a25
fix(cutover 0.1.20): Step-06 pushes YAML edit to local Gitea so patches survive Flux reconcile (#970) (#971)
## Root cause (live on otech116 2026-05-05 14:38)

After the #968 fix shipped (0.1.19), the cutover engine reached Step-7
(87%) successfully — Step-01..07 all completed. Then Step-08 (egress-
block-test) caught 38/38 HelmRepositories had reverted to upstream:

```
external HelmRepositories still pointing at ghcr.io/openova-io: 38
  OFFENDER flux-system/bp-cilium=oci://ghcr.io/openova-io
  ... (37 more)
FAIL — at least one HelmRepository did not pivot
```

But Step-06's job logs say:
```
[helmrepository-patches] OK bp-cilium -> oci://harbor.otech116.omani.works/openova-io
... (37 more OK)
ok=38 skip=0 fail=0
```

So Step-06 thought it succeeded — and it had, momentarily. But then
the bootstrap-kit Kustomization (which had successfully pivoted to
local Gitea via Step-05) reconciled its YAML from local Gitea, where
the YAML still declared `url: oci://ghcr.io/openova-io`. Within ~30s
every kubectl patch was undone. The cutover engine then aborted at
Step-8 verification.

## Fix

Step-06 now runs in two phases:
1. **Live K8s patches** (existing behaviour) — flips spec.url on every
   HelmRepository immediately. Useful for the cluster between cutover
   and the next reconcile.
2. **NEW — Push YAML edit to local Gitea** — clones `openova/openova`
   from the local Gitea over basic-auth, sed-rewrites every
   `clusters/_template/bootstrap-kit/*.yaml` declaration of `url:
   oci://ghcr.io/openova-io` → `oci://harbor.<sov-fqdn>/openova-io`,
   commits with a clear message, pushes back. Subsequent reconciles
   see local Harbor as the steady-state.

After the push, the script annotates `flux-system/openova` GitRepository
to trigger immediate reconciliation so the new YAML lands without
waiting for the polling interval.

## Image change

Step-06 image bumped from `bitnami/kubectl:1.31.4` to `alpine/k8s:1.31.4`
because the new phase needs both `kubectl` and `git` in one image
(verified live on otech116 — both binaries present).

## Acceptance gate

Test case 16 added to cutover-contract.sh — guards against future
regressions that remove the `git clone`, the `git push origin main`,
or the `clusters/_template/bootstrap-kit` target dir reference.

## Live verification

Will fire on otech117 (next provision). Expected:
- Step-06 logs `cloning gitea-http.gitea.../openova/openova.git` then `pushed to ...`
- Step-08 verify PASSES (38/38 HelmRepositories pivoted in K8s + Gitea)
- self-sovereign-cutover-status `cutoverComplete: "true"`
- Egress block to ghcr.io safely activates

Co-authored-by: e3mrah <ebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:55:22 +04:00
e3mrah
3db19b76b1
fix(cutover 0.1.19): Step-01 gitea-mirror DNS readiness probe + backoffLimit=3 (#968) (#969)
## Root cause (live on otech115 2026-05-05 14:15)

After PR #959 (0.1.18) unblocked the auto-trigger to actually call
/internal/cutover/trigger, the cutover engine fired Step-01 within ~8s
of bp-self-sovereign-cutover Helm-install completing. The gitea Pod
had only just reached Ready state — cluster-DNS endpoint publication
for the headless service `gitea-http` was still in flight. One wget
returned `bad address gitea-http.gitea.svc.cluster.local` and exited
non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0
(cutover.go:584), so a single DNS miss was terminal and aborted all 8
cutover steps. otech115 finished provisioning with cutoverComplete=false
and tethered to upstream github.com/ghcr.io.

## Fix (dual-layer)

**Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3.
A single transient miss is recoverable (4 attempts over each step's
activeDeadlineSeconds) without burning operator-attention. Hard failures
still surface within budget.

**Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit
nslookup readiness probe at the top of the bash script, before any
wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup
in /usr/bin (verified live on otech115). Layer B is faster than Layer A
(in-script DNS retry vs Pod recreate); Layer A is the safety net for
any other transient pre-cluster-stable race we haven't yet enumerated.

## Acceptance gate

Test case 15 added to platform/self-sovereign-cutover/chart/tests/
cutover-contract.sh — guards against future regressions that drop
either the gitea_host extraction or the nslookup loop.

## Live verification

Will fire on the next provision (otech116). Expected:
- Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)`
- All 8 cutover Jobs reach Complete
- self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true

Co-authored-by: e3mrah <ebaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 18:25:15 +04:00
e3mrah
d1431bed09
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without
HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x
FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is
not specified" on every Sovereign (otech112 evidence). HelmRelease
reports Ready=True (Helm install succeeded) but the Pod
CrashLoopBackOffs invisibly behind the False-positive condition.

Closes #916 — wizard let operators dispatch unbuildable topologies
(otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not
encode regional orderability. Hetzner rejected the worker creation 41s
into `tofu apply` after Phase-0 had already created the CP + network +
LB + firewall.

Chart fix (issue #921):
- Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the
  umbrella chart (base64-encoded per upstream contract).
- Render `hetzner-node-config` Secret unconditionally with both keys so
  the upstream Deployment's secretKeyRef references resolve cleanly
  during `helm template` AND in the live cluster regardless of overlay
  state.
- Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto
  the upstream chart's deployment.
- Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps
  it under `flux-system/cloud-credentials.hcloud-cloud-init`; the
  bootstrap-kit overlay lifts that key via Flux `valuesFrom` into
  `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus
  receive the IDENTICAL bootstrap as the Phase-0 worker fleet.
- Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0.
- Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies
  Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's
  blueprint-release "Run chart integration tests" step.

Wizard fix (issue #916):
- Add `availableRegions?: string[]` to NodeSize interface; encode
  cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere
  new) per Hetzner /v1/server_types vs POST /v1/servers gap.
- Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers.
- StepProvider filters SKU dropdowns by selected region; auto-swaps
  current SKU to recommended default when region change drops it out
  of orderability.
- Mirror the matrix Go-side in sku_availability.go; gate
  `provisioner.Request.Validate()` with same predicate so a stale
  wizard build OR direct API caller bypassing the UI cannot dispatch
  otech109's failure mode.
- Two-sided enforcement covers both r.Regions[] (multi-region) and the
  legacy singular path.

Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API
side. Chart smoke renders + helm template gates the env wiring at
publish time.

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:21:59 +04:00
e3mrah
238c6d2010
fix(bp-flux): mitigate helm-controller leader-election loss + stuck-HR recovery (#925) (#960)
* fix(bp-flux): mitigate helm-controller leader-election loss + recovery CronJob (#925)

On otech113.omani.works the bp-vpa HelmRelease became stuck Ready=Unknown
forever after a transient kube-apiserver blip caused helm-controller to
lose its leader-election lease mid-install. The Helm release secret was
already committed (Status=deployed) by the previous leader, but its last
write to the HR's Ready condition was Unknown and the new leader's
"release in storage?" short-circuit never re-evaluates that. The HR
blocked bootstrap-kit → sovereign-tls → cilium-gateway, breaking every
HTTPRoute on the Sovereign.

Fix is two-pronged:

1) PRIMARY (prevent the trigger). Stretch leader-election lease durations
   on the three Catalyst-critical controllers (helm/kustomize/source) from
   the upstream defaults of lease=35s renew=30s retry=5s to lease=60s
   renew=40s retry=5s, and bump memory limits from 256Mi to 512Mi (helm)
   / 384Mi (kustomize, source) so OOMKills during 35-HR fan-out installs
   don't themselves trigger leadership handoffs. Costs ~50s extra failover
   time on a real controller crash; that's acceptable since CP HA is a
   Phase 2 concern and we'd much rather avoid spurious flips during
   transient API pressure.

2) RECOVERY (handle the residual case). New CronJob bp-flux-stuck-hr-recovery
   runs every 2 minutes, scans every HelmRelease cluster-wide, and for each
   HR stuck in Ready=Unknown for >5 minutes whose underlying Helm release
   secret already has status=deployed, force-toggles spec.suspend (the only
   known workaround per #925). Guardrail: refuses to act if more than 10
   HRs would be touched in a single run (signals a cluster-wide outage).
   Operator-disablable via .Values.catalyst.stuckHelmReleaseRecovery.enabled=false.

Lock-in tests: tests/leader-election-and-recovery.sh covers all three
flag/memory bumps, CronJob render, RBAC presence, disable-toggle, and
threshold operator override. version-pin-replay + observability-toggle
still green.

Chart bumped 1.1.4 → 1.2.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-flux): bump blueprint.yaml spec.version to 1.2.0 to match Chart.yaml (#925)

The bootstrap-kit static validation gate (Chart.yaml version ==
blueprint.yaml spec.version) caught the missed bump on PR #960.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:05:38 +04:00
e3mrah
b7f150db38
fix(cutover 0.1.18): poll /healthz for readiness instead of auth-gated /status (#957) (#959)
The 0.1.17 auto-trigger Job was Complete=True on otech113 but the
cutover never actually started: the readiness probe loop polled
/api/v1/sovereign/cutover/status (auth-gated, behind RequireSession)
and treated 401 as "API not ready". The loop ran 30 times for 300s
and exited 0 — the trigger endpoint was NEVER called.

Live evidence on otech113 2026-05-05:
  - 30 consecutive 401s from auto-trigger Pod (10.42.4.216) on
    /sovereign/cutover/status in catalyst-api access log
  - zero hits on /api/v1/internal/cutover/trigger
  - Helm post-upgrade hook deadline tripped → rollback to 0.1.15

Fix (chart-side only; PR #947 catalyst-api endpoint is correct as-is):
  - poll /healthz (unauthenticated, always 200 when process is up)
  - drop the pre-flight cutoverComplete=true short-circuit since
    /internal/cutover/trigger is already idempotent (returns 200 with
    the existing snapshot when cutoverComplete=true, per
    cutover_internal.go line 279)
  - bump chart 0.1.17 → 0.1.18; pin slot 06a to 0.1.18

Tests:
  - contract gate Case 13: probe target is /healthz, NOT
    /sovereign/cutover/status (regression guard)
  - contract gate Case 14: no stale cutoverComplete pre-read off
    /tmp/status.json (the file no longer exists)
  - existing 12 contract gates still pass; helm lint clean
  - existing 6 Go unit tests for HandleCutoverInternalTrigger pass

Closes #957

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 16:02:12 +04:00