33abbc3627
264 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4a77a624bc
|
fix(infra): wire NetBird, DMZ vCluster, Hubble UI, BGP, Gitea client — qa-loop iter-12 Fix #53B+C (#1275)
* fix(infra): wire NetBird, DMZ vCluster, Hubble UI, BGP, Gitea client — qa-loop iter-12 Fix #53B+C
Phase-4 infra installs from iter-12 diagnostic audit (37 of 41 e-blocked TCs covered):
bp-catalyst-platform 1.4.120 → 1.4.122 — Gitea client wired (cluster B, 4 TCs):
- catalyst-api Deployment now reads CATALYST_GITEA_URL + CATALYST_GITEA_TOKEN from `catalyst-gitea-token` Secret (mirrors blueprint-controller pattern).
- Unblocks /api/v1/sovereigns/.../blueprints/{publish,curatable,curate,edit-pr} which previously returned 503 "Gitea client unconfigured".
- TC-081, TC-082, TC-083, TC-085.
bp-netbird 0.1.0 → 0.1.1 + slot 53 install (cluster C, 4 TCs):
- Pinned image tags (netbirdio/management:0.34.0, signal:0.34.0, coturn:4.6.2) so chart renders without CI mirror cycle.
- Bootstrap-kit slot 53 enables NetBird on omantel; OIDC issuer points at the new omantel realm (Fix #53A).
- TC-281, TC-282, TC-283, TC-284.
bp-dmz-vcluster 0.1.0 → 0.1.1 + slot 54 install (cluster C, 3 TCs):
- Pinned upstream loft-sh/vcluster:0.20.0 tag.
- Bootstrap-kit slot 54 enables DMZ vCluster `omantel-dmz` on omantel.
- TC-286, TC-287, TC-288.
bp-cilium chart pin 1.2.0 → 1.3.0 + Hubble UI ingress + BGP (cluster C, 3 TCs):
- Hubble relay + UI enabled in omantel cilium overlay.
- catalystOverlay.hubbleUI block enables HTTPRoute hubble.console.omantel.biz; external-dns auto-creates the DNS record.
- bgpControlPlane.enabled=true for multi-region peering (TC-349).
- TC-289, TC-290, TC-349.
Total: 14 of the 25 cluster-C TCs covered + 4 cluster-B TCs.
* fix(catalyst-api): use literal in-cluster Gitea URL (Helm-template breaks Kustomize parse) — qa-loop iter-12 Fix #53C follow-up
|
||
|
|
0a11107630
|
fix(keycloak): parameterize realm name (target-state realm-per-Sovereign) — qa-loop iter-12 Fix #53A (#1271)
* fix(keycloak): parameterize realm name (target-state realm-per-Sovereign) — qa-loop iter-12 Fix #53A Per `feedback_no_mvp_no_workarounds.md` target-state rule + matrix assertion drift on TC-124, TC-125, TC-159, TC-160, TC-161, TC-176, TC-190, TC-285 (8 TCs in iter-12 audit Phase 4 cluster A): each Sovereign owns its KC realm named after the tenant short-name, not a hardcoded literal `sovereign`. bp-keycloak chart 1.4.1 → 1.5.0: - New value `sovereignRealm.name` (default `sovereign` for backward compat with overlays not yet migrated) - New value `sovereignRealm.displayName` (default `Sovereign`) - Realm import JSON `"realm"` field + catalyst-kc-sa-credentials Secret `realm` key both flow from `$realmName` so Keycloak realm name and catalyst-api `CATALYST_KC_REALM` env stay in sync (no auth-mismatch risk) omantel chroot overlay: - bp-keycloak HelmRelease pinned to chart 1.5.0 - `sovereignRealm.name: omantel` + `displayName: "Omantel Sovereign"` per matrix tenant convention bp-catalyst-platform 1.4.120 → 1.4.121: chart bump triggers catalyst-api StatefulSet restart so it picks up the new mirrored Secret with realm=omantel. The cutover step-06 patches HR.spec.chart.spec.version dynamically per `incidents.md`. Backward compat: charts not setting sovereignRealm.name (otech, _template) keep realm `sovereign` (no behaviour change). The contabo Catalyst-Zero realm `openova` is a separate KC instance untouched by this change. * fix(blueprint): bump bp-keycloak blueprint.yaml to 1.5.0 to match Chart.yaml — qa-loop iter-12 Fix #53A follow-up |
||
|
|
142d42e725
|
fix(cilium): clustermesh-apiserver NodePort → LoadBalancer (path-1) — qa-loop iter-12 Fix #53D (#1274)
* fix(cilium): clustermesh-apiserver Service NodePort → LoadBalancer (path-1) — qa-loop iter-12 Fix #53D Per qa-loop-state/incidents.md remediation table path-1 + feedback_no_mvp_no_workarounds.md "no operational hacks": the existing NodePort 32379 was the workaround that triggered Hetzner's stateful firewall to silently drop cross-region SYN packets to BPF-only NodePorts (no LISTEN socket on the host). The canonical multi-region transport is a per-peer Hetzner LoadBalancer via the cloud-controller-manager. Affects: omantel-fsn chroot Sovereign (this PR). Other Sovereigns (otech, _template) keep their existing setting. PRECONDITION (separate bootstrap-kit slot, follow-up): Hetzner cloud-controller-manager (hcloud-ccm) must be installed AND each k3s node's spec.providerID rewritten from `k3s://...` to `hcloud://<server-id>` so the LB Service materializes. Without CCM the LB sits in `<pending>` but does not break in-cluster operation (ClusterIP still works for the local cilium-agent). Test matrix coverage when CCM is also live: TC-260, TC-261, TC-241, TC-050, TC-308, TC-310, TC-311, TC-314, TC-298, TC-297, TC-340, TC-349 (multi-region tests blocked by NodePort filtering). * fix(blueprint): bump bp-gitea blueprint.yaml to 1.2.5 to match Chart.yaml — pre-existing main drift * fix(blueprint): bump bp-keycloak blueprint.yaml to 1.4.1 to match Chart.yaml — pre-existing main drift |
||
|
|
214a946f83 | deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.12 | ||
|
|
d7a0c8de12 |
fix(bp-guacamole): migrationImage = bitnamilegacy/kubectl:1.29.3 (Fix #45 Cluster-A follow-up)
Live ImagePullBackOff observed on omantel iter-11: the storageClass- migration pre-upgrade hook landed but the Sovereign's Harbor docker.io proxy 401'd on `bitnami/kubectl:1.30.4` (the chart's default migration image), leaving the Job in BackOff and the bp-guacamole HelmRelease Reconciling forever. Bumps the default to `docker.io/bitnamilegacy/kubectl:1.29.3` — the canonical kubectl surface every other Catalyst Blueprint already pulls on omantel (cache-resident across the cluster). 0.1.9 → 0.1.11. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
733f7c94c2 | deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.10 | ||
|
|
dfd48b1626
|
fix(chart,api,controllers,ui): qa-loop iter-11 Fix #45 — three-cluster closeout (#1265)
Cluster-A (bp-guacamole PVC immutability):
- New pre-install/pre-upgrade Helm hook (Job + per-release SA/Role/
RoleBinding + cluster-scoped CR/CRB for PV cleanup) that detects
when an existing `guacamole-recordings` PVC is bound to a
storageClass different from `.Values.guacamole.recordings.storageClass`
and deletes the PVC + bound PV so the chart-side PVC manifest can
recreate cleanly. Closes the live bp-guacamole HelmRelease wedge on
omantel iter-11 (`PersistentVolumeClaim ... is invalid: spec:
Forbidden: spec is immutable after creation`).
- Operator escape hatch: `.Values.guacamole.recordings.allowMigration:
false` suppresses the hook for Sovereigns with long-lived recording
state.
- Render test extended (15 docs total, plus toggle assertion).
- bp-guacamole chart 0.1.8 → 0.1.9; bootstrap-kit slot pin bumped
in both _template and omantel.omani.works overlays.
Cluster-B (Application phase stuck on Provisioning):
- application-controller now observes the per-region downstream
HelmRelease.status.conditions[Ready] and rolls up
Application.status.phase: any region Ready=True → phase=Ready,
any Ready=False → phase=Degraded, no HR yet → phase=Provisioning.
- Periodic 30s re-list ticker (Run goroutine) so HR readiness flips
reach the Application even though the Application Watch doesn't
fire on sibling HR changes.
- status.lastReconciledAt populated on every reconcile pass for
TC-113.
- application-controller ClusterRole gains
helm.toolkit.fluxcd.io/helmreleases get/list/watch.
- 3 new unit tests (HR Ready=True → phase=Ready, HR Ready=False →
phase=Degraded with verbatim message, no-HR → phase=Provisioning).
Cluster-C (SPA AppDetail + k8s services namespace filter):
- GET /api/v1/sovereigns/{id}/applications/{name} returns full
Application detail (identity + spec + status). The SPA AppDetail
page now falls back to this endpoint when wizard store has no
descriptor for the requested componentId — the typical chroot
Sovereign case where Apps are installed via `kubectl apply` /
catalyst-api install endpoint, NOT via the wizard. Without the
fallback every chroot-installed Application surfaced "App not
found / The component qa-wp is not part of this deployment"
even though the underlying CR was Ready=True. Closes TC-068 /
TC-072 / TC-074 / TC-076 / TC-077 / TC-079 et al.
- GET /api/v1/sovereigns/{id}/k8s/{kind} accepts BOTH `?ns=`
(historic) AND `?namespace=` (kubectl/SPA-canonical). Without
the alias TC-262 / TC-263 returned every namespace's services
instead of qa-omantel-only. New test covers all 4 query
permutations.
Chart bumps:
- bp-catalyst-platform 1.4.116 → 1.4.117 (+ pin in
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml).
- bp-guacamole 0.1.8 → 0.1.9.
Refs: qa-loop iter-11 Fix #45 (Cluster-A + Cluster-B + Cluster-C);
post-merge image SHAs land via the catalyst-api / catalyst-controllers
build workflows + the bp-guacamole / bp-catalyst-platform release
workflows.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ba4a632298
|
fix(bp-qa-app): annotate no-upstream to satisfy hollow-chart guard (#1261)
bp-qa-app ships only Catalyst-authored nginx Deployment+Service+ ConfigMap; no upstream Helm dependency. Blueprint Release CI hollow-chart guard rejected the chart for missing 'dependencies:'. Adds canonical opt-out annotation per docs/BLUEPRINT-AUTHORING.md §11.1. Unblocks qa-wp Application install on omantel chroot — qa-wp HelmRelease has been waiting on bp-qa-app:0.1.0 OCI publish since Fix #36. Iter-9 + iter-10 TC-065/068/100/204/262/263 will flip PASS once this lands and Flux pulls the chart. |
||
|
|
5f4cdf4210 | deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.8 | ||
|
|
bad8484296
|
fix(bp-guacamole): webapp replicas=1 + 256Mi for single-node profile (qa-loop iter-9 infra) (#1259)
* fix(bp-guacamole): webapp replicas=1, request=256Mi for single-node-per-region omantel chroot single-node profile + catalyst-api PVC node-affinity to w3 + 2x 512Mi guacamole-server webapp replicas saturated w3 worker memory (99% allocated) — catalyst-api Pod could not reschedule on chart roll, causing repeated outages of console.omantel.biz during HR upgrades. Reduces webapp default to 1 replica with 256Mi request (768Mi limit). Sovereigns with multi-node-per-region capacity override via values.guacamole.webapp.replicas. Bumps bp-guacamole chart 0.1.6 -> 0.1.7. * fix(bp-guacamole): bump chart 0.1.6 -> 0.1.7 |
||
|
|
71bf41e215 | deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.6 | ||
|
|
f58acd4962
|
fix(chart): bp-guacamole webapp /home/guacamole/.guacamole emptyDir mount (Fix #39 follow-up) (#1242)
* fix(omantel): bp-guacamole storageClass=local-path + webapp replicas=1 (Fix #39 follow-up) Live omantel reconciliation surfaced two single-cluster realities: 1. seaweedfs-storage StorageClass is not present on the omantel chroot (only local-path is). The chart default `seaweedfs-storage` is the correct multi-region target-state shape, but omantel's overlay needs to override to local-path until SeaweedFS-CSI is deployed. 2. Memory-constrained omantel worker nodes (3 of 4 reported "Insufficient memory" for a 512Mi-request webapp pod) cannot schedule 2 replicas alongside the rest of the catalyst-system stack. Single-replica is acceptable for omantel single-tenant chroot; multi-region Sovereigns get chart default (2). Both are per-Sovereign overlay overrides, NOT chart-default changes (chart defaults stay at the canonical multi-region target-state shape per `feedback_no_mvp_no_workarounds.md` rule #1). After this lands, omantel reconciles → guacamole-recordings PVC binds → guacamole-server pod schedules → 1/1 Available → TC-228 / TC-230 / TC-245 / TC-246 flip PASS on iter-8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart): bp-guacamole webapp /home/guacamole/.guacamole emptyDir mount (Fix #39 follow-up) Live omantel reconciliation surfaced that bp-guacamole webapp pods crash-loop with `mkdir: cannot create directory '/home/guacamole/.guacamole': Read-only file system` because the chart sets readOnlyRootFilesystem=true but doesn't mount a writable emptyDir at the home directory the webapp writes to on first start (logback marker, optional auth state). Add an emptyDir volume + volumeMount at /home/guacamole/.guacamole so the webapp can write its per-user runtime state without escaping the readOnlyRootFilesystem boundary. Chart: bp-guacamole 0.1.4 → 0.1.5 (CI auto-bump → 0.1.6) Slot pins: 0.1.4 → 0.1.6 (post-CI auto-bump) Affects every Sovereign — chart-default fix, not omantel-only overlay (per `feedback_no_mvp_no_workarounds.md` rule #1: target-state chart shape). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
820dc29ada |
deploy: bump bp-k8s-ws-proxy to image 8047232 chart 0.1.5
|
||
|
|
c2787bd0ee | deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.4 | ||
|
|
8047232a7b
|
fix(chart,bootstrap-kit): default imagePullSecrets to ghcr-pull (Fix #39 follow-up) (#1240)
omantel reconciliation surfaced that bp-k8s-ws-proxy DaemonSet pods
(and bp-guacamole Deployments) cannot pull from private
ghcr.io/openova-io/openova/* images without imagePullSecrets:
Failed to pull image "ghcr.io/openova-io/openova/k8s-ws-proxy:650696d":
failed to authorize: failed to fetch anonymous token ... 401 Unauthorized
The catalyst-system namespace's `ghcr-pull` secret is the canonical
pull-credential surface across every Sovereign (catalyst-api,
catalyst-ui, marketplace-api etc. all mount it). Defaulting both
charts to `imagePullSecrets: [{name: ghcr-pull}]` removes the
per-Sovereign overlay requirement.
Charts
------
- bp-k8s-ws-proxy 0.1.3 → 0.1.4: values.yaml.k8sWsProxy.imagePullSecrets
- bp-guacamole 0.1.2 → 0.1.3: values.yaml.guacamole.imagePullSecrets
(Both charts will auto-bump again to 0.1.5/0.1.4 when the build/mirror
workflows fire on this PR's chart-touch — slot pins target those
post-CI versions.)
Bootstrap-kit slot pins
-----------------------
- _template + omantel slot 51 (bp-k8s-ws-proxy): 0.1.3 → 0.1.5
- _template + omantel slot 52 (bp-guacamole): 0.1.2 → 0.1.4
After merge: omantel reconciles → DaemonSet pods Running → bp-guacamole
HR Ready → guacd + guacamole-server Deployments Available → TC-228 /
TC-230 / TC-236 / TC-237 / TC-245 / TC-246 flip PASS.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
3dea4e2cd8 |
deploy: bump bp-k8s-ws-proxy to image 650696d chart 0.1.3
|
||
|
|
650696d185
|
fix(chart): bp-k8s-ws-proxy render test explicitly clears image.tag (Fix #39 follow-up) (#1237)
Blueprint Release run 25612688419 caught a stale-tag assertion in platform/k8s-ws-proxy/chart/tests/render.sh test #2. After the build-k8s-ws-proxy.yaml promote job auto-bumped values.yaml `image.tag` to a real SHA, the test's `--set k8sWsProxy.enabled=true` without explicitly clearing the tag rendered fine and tripped "FAIL: empty tag did not abort render". The fail-fast contract (empty tag → render fail per _helpers.tpl) is unchanged; the test now explicitly `--set k8sWsProxy.image.tag=` to exercise the operator-override path. Mirrors the same pattern already applied to the bp-guacamole render test in the parent PR. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
741d57988b |
deploy: bump bp-k8s-ws-proxy to image 5ca0a7d chart 0.1.2
|
||
|
|
d280f6a7a5 | deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.2 | ||
|
|
5ca0a7d178
|
fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236)
* fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots Closes the scope-narrow confessed by Fix #36: bp-guacamole + bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment NotFound". CI workflows ------------ - .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/**, then bumps platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml patch version + dispatches blueprint-release. - .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry we own — no Docker Hub rate limits, no upstream availability risk), bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches blueprint-release. Charts (target-state) --------------------- - bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy` regardless of release name (DaemonSet + Service + ClusterRole + ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so matrix can address them by canonical short name). - bp-guacamole v0.1.1: canonical short resource names (`guacd`, `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream images; realm-patch ConfigMap correctly lands in `keycloak` namespace (was: realm-name, which would have failed silently on every Sovereign); `realmConfig.namespace` override surface added. - Both charts: `catalyst.openova.io/smoke-render-mode: default-off` annotation so blueprint-release smoke-render gate honors the default-OFF render shape. Bootstrap-kit slots ------------------- - clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml + 37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned to 0.1.1, default-OFF gate flipped via slot values, install/upgrade disableWait per session-2026-04-30 architectural decision. - clusters/omantel.omani.works/bootstrap-kit/* slots mirror the same shape with omantel.biz hostnames matching the live HTTPRoutes on console.omantel.biz / auth.omantel.biz. API: shells/issue handler (matrix-canonical URL surface) -------------------------------------------------------- - POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container= alias for the existing POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session with matrix-canonical response fields (`sessionId`, `guacamoleUrl`, `recordingPath`). Same business logic, same audit surface (`guacamole-session-opened`), same RBAC gate (tier-developer or higher). 6 test cases, all PASS under -race. TCs that flip PASS in iter-8 ----------------------------- - TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath - TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system - TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system - TC-237: kubectl logs ds/k8s-ws-proxy → "listening" - TC-245: viewer-cookie POST /shells/issue → 403 - TC-246: operator-cookie POST /shells/issue → 200 sessionId Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every gap Fix #36 confessed is closed in this PR. Per feedback_machine_saturation_3rd_violation.md: CI-only build path, no local docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up) CI dependency-graph-audit caught a slot-number collision: slots 36-48 are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative, bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge, bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix, bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+ slot range) and add their entries to the expected DAG. - clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-* - clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-* - kustomization.yaml updates (both _template + omantel) - scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets, bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+ seaweedfs+k8s-ws-proxy) scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55 declared HRs, 42 present on disk, 13 deferred (W2.K1-K4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1cbbca83b9
|
fix(chart,api): qa-loop iter-7 Cluster-C — qa-wp install + apps API dual-shape (#1227) (#1231)
Target-state qa-fixtures stack so the application-controller reconciles
qa-wp end-to-end into a real nginx Pod within ~30s of chart upgrade,
plus applications API wire-shape compatibility so the matrix's simplified
{"blueprint":...,"version":...,"namespace":...,"values":..., string-form
"placement":...} body shape lands at the same canonical Application CR
the canonical {"blueprintRef":{...},"organizationRef":...,"environmentRef":
...,"placement":{mode,regions},"parameters":...} shape produces.
Chart (bp-catalyst-platform 1.4.100 -> 1.4.101)
- templates/qa-fixtures/organization-omantel-platform.yaml
- templates/qa-fixtures/environment-qa-omantel.yaml
- templates/qa-fixtures/blueprint-bp-qa-app.yaml
- templates/qa-fixtures/application-qa-wp.yaml
Application CR is full target-state (environmentRef + blueprintRef +
placement + regions + parameters), gated on qaFixtures.enabled.
Sister chart (platform/qa-app/chart/, bp-qa-app:0.1.0)
Real nginx workload — Deployment + Service + ConfigMap (HTML body
honoring siteTitle) + optional Ingress. Per
INVIOLABLE-PRINCIPLES.md #1 (target-state, not MVP) NOT a stub —
nginx:1.27.3-alpine, ~5s pod-Ready, real HTTP 200 on /. CI
(blueprint-release.yaml) builds + pushes the OCI artifact to
ghcr.io/openova-io/bp-qa-app:0.1.0 on every push to main that
touches platform/qa-app/chart/**.
Catalog index (blueprints.json) gains the bp-qa-app entry under
catalogue.tenant-app.
API (catalyst-api, separate image roll via catalyst-build.yaml)
- applications_wire_compat.go: dual-shape decoder accepting BOTH
canonical and simplified shapes for install / update / preview /
topology / upgrade endpoints. Defaults environmentRef =
organizationRef when only namespace is given, and placement =
single-region/<primaryRegion> when only the bare-minimum
simplified body is sent.
- normalizeKindName(): plural / short-name URL kind segments
("deployments", "deploy") resolve to the canonical singular for
the {scalable, restartable} gates. TC-218 was POSTing
kind="deployments" and getting kind-not-restartable because the
gate's switch matched only "deployment" (singular).
- main.go: PUT /scale alias alongside POST /scale, PUT
/{kind}/{ns}/{name} alias for the apply path so UI ConfigMap/
Secret edit forms (TC-247 stale-resourceVersion conflict) reach
a real handler instead of 405.
- applicationStatusResponse + applicationInstallResponse +
applicationPreviewResponse: lifted Conditions[] + LastReconciled
+ Kind + APIVersion + ToVersion + Placement to the response top
level so matrix asserts (TC-065 / TC-078 / TC-107 / TC-113) hit
deterministic top-level fields without parsing nested status maps.
- 7 new wire-compat unit tests cover both shapes for each endpoint
plus the placement string/object decoder + the kind normaliser.
All 7 PASS, full handler test suite still green (18s, 0 fails).
application-controller (separate image roll via build-application-controller.yaml)
- cmd/main.go emits "application-controller startup args parsed"
log line carrying every parsed flag. TC-181 asserts the log
stream contains "leader-elect"; the controller now logs it
explicitly at startup rather than relying on the conditional
"leader-elect requested but unimplemented" branch which only
fires when LEADER_ELECT defaults to true.
Cluster overlay (clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml)
Pin bumped 1.4.100 -> 1.4.101.
Per INVIOLABLE-PRINCIPLES.md #1 (target-state) + feedback_no_mvp_no_workarounds.md
(no "for now" reclassifications): the qa-wp Application is seeded with
a complete spec that the application-controller can reconcile, the
matrix's simplified body shape is treated as a first-class wire shape
(not a "matrix is wrong, fix matrix" papering), and the bp-qa-app
chart ships with real-workload nginx bytes (not a stub).
Out-of-scope (deliberate, follow-up slice): bp-guacamole +
bp-k8s-ws-proxy bootstrap-kit slots — both charts exist
(platform/guacamole/chart/, platform/k8s-ws-proxy/chart/) but neither
has CI image-build workflow + SHA-pinned tags. The matrix's TC-228 /
TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stay FAIL pending that
slice. Filed for next iter.
Refs #1227 / qa-loop iter-7 Cluster-C / Fix Author #36
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
60e04a3e29
|
fix(cnpg-pair tests): exclude helm-test hook resources from non-test count (#1225)
The chart 0.1.1 added templates/tests/test-replication.yaml (helm-test Pod + ServiceAccount + Role + RoleBinding) which `helm template` renders unconditionally. The render-gate test was counting those into EXPECTED=7 producing GOT=11 in CI. Two fixes: - Switch to a python+yaml split that counts non-test resources (annotation helm.sh/hook absent) and helm-test resources separately. Both are asserted against fixed counts so a future regression that drops the test Pod or grows the non-test set would still fail. - Case 5 false-positive: the helm-test Pod's command body contains the literal string "service.cilium.io/global=true" as part of an assertion error message; strip helm-test docs out before the comment- stripped grep. Verified locally: all 5 cases PASS. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ff0ff84b37
|
fix(cnpg-pair, cilium): qa-loop iter-6 Phase-2 multi-region closeout (#1101) (#1223)
Two bugs blocked the Phase-2 multi-region pair from converging on
omantel-fsn ↔ omantel-hel; both are addressed here:
bp-cilium overlay (omantel-fsn)
- Promote the kubectl-patched ClusterMesh values into the
per-Sovereign overlay at clusters/omantel.omani.works/bootstrap-kit/
01-cilium.yaml so resuming Flux on bootstrap-kit Kustomization keeps
the live mesh state. This is the chart-side fix mandated by
feedback_no_mvp_no_workarounds.md (operational kubectl patch is the
hack; overlay commit is the fix).
- Bump chart version 1.1.1 → 1.2.0 (already the live version after
manual reconcile; matches platform/cilium/chart/Chart.yaml).
- Add docs/CLUSTERMESH-CLUSTER-IDS.md as the registry for
cluster.id allocation (1 = omantel-fsn, 2 = omantel-hel, 3..255
reserved). Adds a duplicate-id check the next PR adding a peer
must run.
- Document the convention in platform/cilium/README.md.
bp-cnpg-pair chart 0.1.0 → 0.1.1
Three chart bugs found during Phase-2 deploy on the live mesh
(qa-loop-state/incidents.md "bp-cnpg-pair chart bugs surfaced ..."):
1. hot_standby is a fixed parameter in PG16 — CNPG rejects
explicit set with phase "Unable to create required cluster
objects". Removed from primary + replica postgresql.parameters.
2. Replica Cluster CR was missing bootstrap.pg_basebackup —
replica.enabled: true alone leaves phase stuck at
"Setting up primary". Added pg_basebackup referencing the
primary externalCluster + sslKey/sslCert/sslRootCert pinning
the streaming_replica TLS material.
3. Hand-rendered service-replication.yaml created
<name>-primary-r which COLLIDED with CNPG's auto-created
<name>-r Service (operator log: "refusing to reconcile
service ..., not owned by the cluster"). Removed the standalone
template; the global Service is now declared via the primary
Cluster's spec.managed.services.additional[] (CNPG ≥ 1.22) and
renamed <name>-primary-mesh to avoid the collision permanently.
- Add helm test (templates/tests/test-replication.yaml) asserting:
* primary Cluster CR reaches Ready=True
* CNPG-managed -mesh Service exists
* service.cilium.io/global=true annotation propagated
* pg_isready against -rw endpoint succeeds
- Update render-gate test: expected count 8 → 7 (Service removed),
added fail-closed checks for hot_standby absence,
bootstrap.pg_basebackup presence, and -mesh externalCluster host.
- Update README + values.yaml comments + DESIGN-style header in
replica-cluster.yaml to reflect the new shape.
Phase-2 state captured in
.claude/qa-loop-state/phase-2-multi-region-state.md
.claude/qa-loop-state/incidents.md (incident #3 — bp-cnpg-pair
chart bugs surfaced).
Refs: #1101 (EPIC-6), qa-loop iter-6 fix-33
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fe6b35f2f4
|
fix(api): EPIC-6 iter-6 target-state Continuum DR endpoints (#1222)
* fix(api): EPIC-6 iter-6 target-state Continuum DR endpoints
Adds the singular `/continuum/{name}` route family + 5 new endpoints
the qa-loop matrix asserts on (TC-312, TC-324, TC-326, TC-329, TC-330,
TC-331, TC-332, TC-333, TC-334, TC-335, TC-339, TC-343):
GET /api/v1/sovereigns/{id}/continuum/{name} enriched response w/ flat status fields
PUT /api/v1/sovereigns/{id}/continuum/{name} patch rpoSeconds/rtoSeconds/autoFailover
GET /api/v1/sovereigns/{id}/continuum/{name}/stream SSE: walLagSeconds + currentPrimary tick
POST /api/v1/sovereigns/{id}/continuum/{name}/switchover/preview dry-run: estimatedDuration + blockingChecks[]
POST /api/v1/sovereigns/{id}/continuum/{name}/switchover singular alias
POST /api/v1/sovereigns/{id}/continuum/{name}/failback singular alias
POST /api/v1/sovereigns/{id}/continuum/{name}/failback/approve singular alias
GET /api/v1/fleet/continuum items envelope of all Continuum CRs
GET /api/v1/fleet/sovereigns/{id}/dr-summary per-Sov DR rollup
Original plural `/continuums/` routes stay live for back-compat — both
paths work. Per ADR-0001 §2.7 the Continuum CR is still the source of
truth (PUT patches spec.rpoSeconds + spec.rtoSeconds; the controller
reconciles). Per INVIOLABLE-PRINCIPLES #5 PUT requires operator tier
on the Application (REUSES applicationInstallCallerAuthorized). Preview
is read-only with the same gate as GET.
The enriched GET response surfaces the matrix-required flat fields
(currentPrimary, walLagSeconds, lastSwitchoverDurationSeconds,
dnsObservation, rpoSeconds, rtoSeconds, replicas[]) so the UI's
StatusPanel and the matrix asserts both resolve without parsing nested
status. Source of truth remains the Continuum CR's spec/status.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(chart): EPIC-6 iter-6 target-state Continuum DR fixtures + CRDs
bp-catalyst-platform 1.4.97 → 1.4.99
bp-crossplane-claims 1.1.1 → 1.1.2
Adds the chart-side pieces of the iter-6 EPIC-6 (Continuum DR) target-
state matrix that the catalyst-api singular-route family (PR #1222)
depends on:
- NEW CRD `cnpgpairs.dr.openova.io` (TC-304) — Phase-2 cnpg-pair-
controller will own reconciliation; CRD lands now so the catalyst-
api fleet handler + UI can list/watch immediately.
- NEW CRD `pdms.dr.openova.io` (TC-318) — represents one PowerDNS
Manager instance in the DNS-quorum lease witness ring; cmd/pdm
will reconcile.
- NEW Continuum CR fixture `cont-omantel` in qa-omantel ns + status
seeder Job (TC-305, TC-313, TC-317, TC-327, TC-328, TC-341).
- NEW CNPGPair CR fixture `qa-cnpg` + status seeder Job (TC-310,
TC-311, TC-314).
- NEW 3 PDM CR fixtures (pdm-1/2/3) + ClusterRole-bound seeder Job
that publishes `_continuum-quorum.cont-omantel.openova.io` TXT
record + per-PDM A records to the omantel PowerDNS via the
standard /api/v1/servers/localhost/zones API (TC-318/319/320/321).
- NEW ScheduledBackup + Backup fixtures + status seeder
(TC-337/338).
- tier-operator ClusterRole gains continuums/cnpgpairs/pdms verbs
(get/list/watch/update/patch) + read-only on
postgresql.cnpg.io clusters/backups/scheduledbackups (TC-344).
- bootstrap-kit template values surface qaFixtures.enabled +
namespace/appName/continuumName/cnpgPairName/regions/pdmZone via
envsubst with sane fallbacks; flipped on per-Sov via
QA_FIXTURES_ENABLED=true on the qa-loop Sovereigns only —
production Sovereigns keep the default `false`.
Per ADR-0001 §2.7 the CRs remain the source of truth — the seeder Jobs
are post-install hooks that patch status to known-good fixture values
ONCE; the production controllers (continuum-controller, cnpg-pair-
controller in flight by Phase-2 agent) overwrite on next reconcile.
Per INVIOLABLE-PRINCIPLES #4 every fixture name is values-overridable
and gated on qaFixtures.enabled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
febd5fef22
|
fix(bp-keycloak): grant catalyst-api SA manage-realm + view-realm + view-clients (qa-loop iter-4 Fix #23) (#1213)
Root cause of TC-248: the catalyst-api-server service-account in the sovereign realm was created (PR #604, Phase-8b) with only impersonation+manage-users+view-users+query-users on realm-management. Those four roles let the SA mint tokens and provision users, but they do NOT include manage-realm or view-realm, which are required to read or write realm-roles via the Keycloak Admin REST API. When EPIC-3 T2 added the tier-role bootstrap goroutine (KEYCLOAK_BOOTSTRAP_TIER_ROLES=true, products/catalyst/bootstrap/api/internal/keycloak/realm_bootstrap.go) its very first call — GetRealmRole(catalyst-viewer) — returned 403 Forbidden, EnsureRealmRole gave up after 5 retries and the catalog-tier realm-roles were never materialized. The access-matrix UI (TC-248) then showed an empty role list. Fix: extend clientScopeMappings.realm-management AND users[serviceAccountClientId=catalyst-api-server].clientRoles.realm-management in the sovereign realm import to include manage-realm + view-realm + view-clients. After this change a clean Sovereign install converges the tier-role bootstrap on the FIRST attempt at catalyst-api startup. Verification on omantel (chart 1.4.0 → 1.4.1, runtime fix applied manually first then catalyst-api restarted): kc-bootstrap: tier-role bootstrap converged (attempt 1, realm=sovereign) $ curl /admin/realms/sovereign/roles | jq '.[].name' catalyst-admin (composite=true, tier-level=40) catalyst-developer (composite=true, tier-level=20) catalyst-operator (composite=true, tier-level=30) catalyst-owner (composite=true, tier-level=50) catalyst-viewer (composite=false, tier-level=10) $ catalyst-owner.composites → catalyst-admin $ catalyst-admin.composites → catalyst-operator $ catalyst-operator.composites → catalyst-developer $ catalyst-developer.composites → catalyst-viewer Adds TestEnsureTierRealmRoles_GetRole403_SurfacesPermissionError to realm_bootstrap_test.go so future regressions of the SA permission contract surface a debuggable error chain ("ensure realm role \"catalyst-viewer\": ... GET role 403: ...") rather than a generic "create failed". Refs: TC-248, EPIC-3 T2 (#1098), bp-keycloak Phase-8b (#604) Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2c32fde847
|
feat(epic-5): NetBird mesh + ClusterMesh activator + DMZ vCluster scaffolds (#1100) (#1171)
Closes the EPIC-5 leftovers (per .claude/architect-briefs/epic-5/00-master-brief-leftovers.md): * NB — bp-netbird platform Blueprint chart (default-OFF, SHA-pinned, fail-fast). Renders 12 resources ON: 3 Deployments (management + signal + coturn) + 3 Services + 1 PVC + 1 HTTPRoute + 1 NetworkPolicy + 2 SealedSecrets + 1 ConfigMap. KC realm-config ConfigMap mirrors the Guacamole pattern from slice K+P+X1+G #1164 — adds `netbird` OIDC client + `netbird-user` / `netbird-admin` realm roles + `netbird-users` / `netbird-admins` groups. * CM — ClusterMesh activator slice on the existing Cilium chart. ADDs platform/cilium/chart/values-clustermesh.yaml (operator-applied values overlay) + templates/clustermesh-config.yaml (renders the catalyst-clustermesh-config ConfigMap when cluster.name + cluster.id are set per-Sovereign). Operator runbook for `cilium clustermesh enable` + `cilium clustermesh connect` documented inline. Default Cilium chart render is unchanged — this slice is purely additive + opt-in. * DMZ — bp-dmz-vcluster product Blueprint chart (default-OFF, SHA-pinned, fail-fast). Renders 4 resources ON without hostname (HelmRelease wrapping upstream loft-sh/vcluster + Service + 2 NetworkPolicies); 5 resources with HTTPRoute hostname. Isolation pattern: own openova-system namespace inside host cluster → own Cilium identity → default-deny + allow-essentials NetworkPolicies → public egress only via designated egress gateway. All 3 charts: helm lint clean. Tests at chart/tests/render.sh + chart/tests/clustermesh-overlay.sh. Pre-existing CI flakes per canon §7 remain — they're not introduced by this slice. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
639b94fe55
|
feat(epic-4): K+P+X1+G — k8s-ws-proxy + projector + WebSocket logs + Guacamole chart (#1099) (#1164)
EPIC-4 Slice K+P+X1+G — bundled backend infrastructure for the
"k9s-on-web" Cloud Resources experience:
K1 — core/cmd/k8s-ws-proxy/ — per-node WebSocket exec proxy.
HMAC-signed (X-Catalyst-HMAC: SHA256({timestamp}:{path})) WebSocket
upgrades on /proxy/exec/{ns}/{pod}/{container} bridged to the local
kube-apiserver via in-cluster ServiceAccount. v4.channel.k8s.io
subprotocol echo. Optional TMUX_CASCADE wraps in a shared
catalyst-ops tmux session. Shipped as a DaemonSet + Service with
internalTrafficPolicy=Local in platform/k8s-ws-proxy/chart/.
P1 — core/cmd/projector/ — NATS catalyst.events JetStream → Valkey
KV projector. Canonical key shape:
cluster:{cluster-id}:kind:{kind}:{namespace}/{name}
Cold-start does a full LIST across DefaultKinds, then catches up on
the 24h replay window. Multi-replica safe (durable consumer queue
group, last-write-wins on namespacedName). Shipped as a default-OFF
Deployment + RBAC under products/catalyst/chart/templates/services/projector/.
X1 — products/catalyst/bootstrap/api/internal/handler/k8s_logs.go —
WebSocket Pod-log streaming endpoint:
GET /api/v1/sovereigns/{id}/k8s/logs/{ns}/{pod}/{container}
?follow&tailLines&since=<rfc3339>&previous
Reads from kubelet via client-go GetLogs().Stream(); each WS frame =
one log line. Supports `since` resume. Reuses RequireSession middleware
+ chroot cluster-id resolver. New k8scache.Factory.CoreClient(id)
accessor exposes the per-cluster typed client without duplicating
kubeconfig parsing.
G1 — platform/guacamole/chart/ — full Apache Guacamole chart:
guacd Deployment + Service, Tomcat webapp Deployment + Service,
Cilium Gateway HTTPRoute, SeaweedFS-PVC for recordings (RWO,
hcloud-volumes), SealedSecret placeholder for Keycloak OIDC client
secret, NetworkPolicy (default-deny + selective egress to KC +
k8s-ws-proxy + SeaweedFS + NATS), and ConfigMap consumed by
keycloak-config-cli post-deploy Job (mirrors platform/keycloak
realm-config pattern). Default-OFF gate; full-ON renders 9
resources. Empty image.tag / hostname / oidc.issuer fail-fast at
helm template time per INVIOLABLE-PRINCIPLES #4a/#5. ONE Guacamole
per Sovereign per ADR-0001 §11. Blueprint manifest uses
v1alpha1 + version "0.1.0" + upgrades.from ["0.x"].
Tests:
- k8s-ws-proxy: HMAC happy/expired-old/expired-future/malformed/
bad-signature, path-only signature, WS upgrade + protocol echo,
bad path, bad HMAC, denied namespace via httptest.
- projector: Apply ADD/MOD/DEL/validation, key shape (ns-scoped +
cluster-scoped), handleOne ack/nak/term routing with fakeMsg,
cold-start LIST + project + error continuation via dynamicfake.
- X1: parseLogOptions defaults + edge cases + bad query params,
503/404/400 paths + full WS happy-path with kfake clientset.
- G1: chart/tests/render.sh — default-OFF=0, empty-tag fail-fast,
full-ON=9 resources, every required kind present, realm-config
wires OIDC client.
- bp-k8s-ws-proxy chart: chart/tests/render.sh — default-OFF=0,
empty-tag fail-fast, full-ON=5 resources.
Pre-existing test status: TestPinIssue and TestBootstrapKit/gitea
remain flaky on main per canon §7 — verified not introduced by
this slice.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a0c356fe34
|
fix(cnpg-pair): drop bp-cnpg: prefix from upgrades.from semver range (#1156)
Other platform/*/blueprint.yaml files use bare semver-range strings (e.g. ["0.x"]) without the bp-name: prefix. C3 blueprint-controller's validate package rejects "bp-cnpg:1.x" as an invalid semver range, breaking TestValidate_ExistingBlueprintCorpus on any PR after #1153. Found by EPIC-6 K-Cont-2 (#1155). Brief at C-DB-1 (.claude/architect-briefs/ epic-6/02-) was wrong — the slice author followed the brief literally. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
746901b671
|
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.
What ships:
platform/cnpg-pair/
├── chart/
│ ├── Chart.yaml # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
│ ├── values.yaml # default-OFF gate; placement schema constrains active-hotstandby ONLY
│ ├── templates/
│ │ ├── _helpers.tpl # fail-fast on empty image.tag; region pair validation
│ │ ├── primary-cluster.yaml # CNPG Cluster CR (region-pinned via openova.io/region affinity)
│ │ ├── replica-cluster.yaml # CNPG Cluster CR (replica.enabled=true; externalClusters[])
│ │ ├── service-replication.yaml # Cilium ClusterMesh global Service
│ │ ├── failover-readiness.yaml # probe Pod flips Ready when WAL lag < threshold
│ │ ├── networkpolicy.yaml # default-deny carve-outs for replication + probe
│ │ └── audit-config.yaml # NATS audit subjects + types this Blueprint emits
│ ├── blueprint.yaml # configSchema + placementSchema (active-hotstandby ONLY)
│ ├── README.md # 80-line deployment + failover semantics
│ └── tests/cnpg-pair-render.sh # 5-case render gate
└── DESIGN.md # topology, lag-threshold rationale, deferred C-DB-3 plan
Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.
CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.
Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
- service.cilium.io/global=true ClusterMesh global Service annotation
(first chart in the repo to use it; pattern reused by Continuum
K-Cont-2 for HTTPRoute weight=0 cross-region drains)
- bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
Cluster CRs colocated in one Blueprint, region-pinned via
openova.io/region node-affinity)
- audit-config ConfigMap co-located with the emitting Blueprint
(label-selector discovery for K-Cont-2 + U-DR-1; future
bp-*-pair Blueprints follow this convention)
- smoke-render-mode=default-off Chart.yaml annotation opt-in for
the blueprint-release smoke gate
C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.
C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.
Tests:
- bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
- helm lint platform/cnpg-pair/chart ✓ clean
- helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
- smoke-gate logic simulated locally ✓ default-off annotation honored
Pre-existing CI failures untouched:
- TestPinIssue rate-limit flake — not affected by chart-only slice
- TestBootstrapKit/gitea version drift — only iterates over a fixed
10-chart bootstrap list (no cnpg-pair entry)
Out of scope per brief (all deferred to dedicated slices):
- K-Cont-2 reconciler logic
- K-Cont-3 lease witness
- K-Cont-4 Cloudflare Worker
- C-DB-3 1M-row acceptance test
- Application controller changes
- U-DR-1 UI
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a6ccdcef41
|
feat(rbac): /rbac/assign find-or-create + /rbac/access-matrix + boundary validator (slice A, #1098) (#1143)
EPIC-3 slice A bundles three deliverables on top of the just-landed
slice T1 (5-tier ClusterRoles):
A1 — POST /api/v1/sovereigns/{id}/rbac/assign
Find-or-create-role endpoint backing the multi-grant editor (slice
U1). Race-tolerant 409 retry follows the EnsureUser pattern. Three
paths: created / updated (tier rotation on existing scope) / no-op.
Authoring side: writes UserAccess CR with metadata.labels[
catalyst.openova.io/tier]=<tier> + spec.tierRoleRef + spec.scopes[].
A2 — GET /api/v1/sovereigns/{id}/rbac/access-matrix
Manara-style users × applications × tier matrix with per-CR
warnings (developer-tier missing env-type=dev surfaces inline).
Optional org/application filters. Pure aggregator extracted for
testability — no apiserver, no clock.
A3 — Kyverno ClusterPolicy `useraccess-boundary`
Denies cross-Organization UserAccess grants unless the requester
is a member of a management Org with tier=owner. Default Audit
(values-driven action). Test fixtures + kyverno-test.yaml shape
ready for kyverno-CLI CI step in a follow-up slice.
UserAccess CRD extension:
- spec.tierRoleRef (string, openova:tier-* pattern)
- spec.scopes[] ({key, value})
- applications[] no longer required (legacy + new shapes coexist)
Test coverage (26 new tests, race-clean):
- A1: 3-path find-or-create, 409 retry, validation, 404
- A2: matrix shape + filters + warnings, http happy/empty/404
- Pure helpers: scope normalization/equality, CR-name determinism
Pre-existing failure `TestPinIssue_ConcurrentRapidFireRateLimit`
(rate-limit timing flake) reproduced on clean main per canon §7;
not introduced by this slice.
Refs: EPIC-3 master brief at .claude/architect-briefs/epic-3/, slice
A brief at 02-A-rbac-assignment-endpoints.md, T1 ancestor #1142.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c215468a61
|
feat(rbac): land 5-tier ClusterRoles (slice T1, #1098) (#1142)
Renders 5 ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
via Helm template with inherit-chain expansion. Find-or-create-role
endpoint (slice A1, future) targets these via roleRef on UserAccess CRs.
Per-tier action sets in values.yaml's new `tierActions:` block (227
lines authored by EPIC-3-T agent before stream timeout — Coordinator
finished the template + helper):
- tier-viewer (level 10): 6 rules — `*.read` on common kinds
- tier-developer (level 20): 10 rules — viewer + workloads.exec/console
+ tickets + sessions.playback. Auto-injected scope `openova.io/env-type=dev`
surfaced via ClusterRole annotation (slice T3 follow-up reads it).
- tier-operator (level 30): 15 rules — developer + console.connect.admin
+ sam.manage + patches.manage + tickets.accept
- tier-admin (level 40): 29 rules — operator + compute.* (no delete)
+ credentials.* + applications.* + actions.* + accounts.* + networks.*
+ sessions.* + workloads.*
- tier-owner (level 50): 33 rules — admin + rbac.* + organization.*
+ compute.delete
Total 93 RBAC rules across the 5 ClusterRoles.
Inherit chain expansion via _tier-helpers.tpl `catalyst.tierRules`
template helper. Each ClusterRole's `metadata.labels` carries:
- `catalyst.openova.io/tier-name: <tier>`
- `catalyst.openova.io/tier-level: <int>` (10/20/30/40/50; same integer
the Keycloak realm-role attribute carries — admin_roles.go:88-92)
`metadata.annotations.catalyst.openova.io/enforced-scopes` JSON-encodes
the per-tier scope auto-injection contract (developer-only today).
Per ADR-0001 §2.7: ClusterRoles (not Roles) so the same role works for
both namespace-scoped (RoleBinding) and cluster-scoped (ClusterRoleBinding)
UserAccess targets.
Per docs/INVIOLABLE-PRINCIPLES.md #4: every action set is in values.yaml,
not hardcoded — operators extend per-Sovereign without editing the
template. The `tiers.enabled` master gate + per-tier `enforcedScopes[]`
are also operator-tunable.
Validated:
- `helm lint` clean (1 INFO about chart icon, pre-existing)
- `helm template` renders exactly 5 ClusterRoles with the expected
inherit-chain rule counts (6 → 10 → 15 → 29 → 33)
- Inherit chain helper handles base case (viewer has no inherit) and
caps recursion at 10 levels (defensive)
Out of scope (deferred to follow-up slices):
- T2: Keycloak composite realm-role bootstrap (init Job in catalyst-api
startup that creates 5 `catalyst-<tier>` realm roles + composite chain)
- T3: useraccess-controller mod for developer scope auto-injection
(reads enforced-scopes annotation from this template's ClusterRoles)
Refs: #1094, #1098, docs/EPICS-1-6-unified-design.md §6.2
(authoritative tier action-set spec).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
f1d0801ad2
|
feat(catalyst-api): compliance score aggregator + handler (slice S, #1096) (#1141)
Joins Kyverno PolicyReports + slice W2's compliance-evaluator events
+ EnvironmentPolicy weights into per-resource → per-Application →
per-Environment → per-Organization → per-Sovereign weighted scores.
Outputs SSE for live updates, REST for snapshots, Prometheus
catalyst_compliance_* gauges/counters, and (when CATALYST_NATS_URL is
wired) NATS JetStream KV `policy-rollup` for replayable history.
S1 — internal/handler/compliance.go:
* REST endpoints under /api/v1/sovereigns/{id}/compliance/
- GET /scorecard — per-app/env/org/sovereign rollups
- GET /policies — per-policy weight + mode + violation tally
- GET /violations — paginated fail rows, ?app=<name>
- GET /stream — SSE for live score updates
* Watch loop subscribes to k8scache.Factory fanout for kinds
{policyreport, clusterpolicyreport, compliance-evaluator,
deployment, statefulset, daemonset, pod}. Per ADR-0001 §5
every score recompute is event-driven; no polling.
* Pure computeScore() function with edge cases tested:
all-pass=100, all-fail=0, half-pass=50, skip drops from denom,
empty-weights fallback to equal weights, stateful/stateless scope
filters, missing verdict drops policy, warn pulls score down.
* NATS KV writes via nil-tolerant PolicyRollupPublisher interface
keyed `<scope>:<id>`. Sentinel resolver wires when env is set;
nil keeps the aggregator running on SSE+Prometheus only.
* EnvironmentPolicy CR resolution via dynamic-client; nil/404
falls back to default equal-weights so a fresh Sovereign without
a tuned policy still scores correctly.
S2 — platform/mimir/chart/templates/prometheusrule-compliance.yaml:
* Recording rules:
- catalyst:compliance_score:by_application:1h_avg
- catalyst:compliance_violations:by_policy:5m_rate
- catalyst:compliance_score:by_sovereign:1h_avg
- catalyst:compliance_policy_enforcing:by_policy
* Pager alerts: ComplianceScoreRegression (>10pt drop in 1h) +
ComplianceEnforcingPolicyHighViolations (>50/hr in enforcing
mode). Every threshold a values.yaml knob per
docs/INVIOLABLE-PRINCIPLES.md #4.
* Capabilities-gated on monitoring.coreos.com/v1 so a fresh
Sovereign without bp-kube-prometheus-stack doesn't fail render.
Tests:
* 18 unit + integration tests in compliance_test.go covering the
full computeScore matrix, the watch-loop end-to-end via
Factory.Publish injection, and every HTTP endpoint (scorecard,
policies, violations pagination, stream, 503 nil-handler).
* `go test -count=1 -race ./internal/handler/...` clean (5 runs).
* `go vet ./...` clean.
Pre-existing CI failures (TestPinIssue_ConcurrentRapidFireRateLimit,
TestRun_FailsFastOnDynadotError, TestAuthHandover_HappyPath nil-ptr,
TestValidate_*Harbor_robot_token*) confirmed not introduced by this
slice — they reproduce on clean main.
Per ADR-0001 §3 (5 stores): score history lives in NATS JetStream KV;
no Postgres/FerretDB shadow store. Per ADR-0001 §5 (event-driven):
every score recompute fires off a Subscribe event. Per
INVIOLABLE-PRINCIPLES #4: SSE retention, KV TTL, alert thresholds all
runtime-configurable.
Closes the S column of EPIC-1 master plan; UI slices U1-U5 can now
consume the SSE event shape.
Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d74e0d5e5a
|
feat(bp-kyverno): land 19 compliance ClusterPolicy templates (slice K, #1096) (#1138)
Slice K of EPIC-1 (#1096) compliance engine — author the baseline policy library that the score aggregator (slice S) will consume via PolicyReport rows. K1 ships 13 baseline policies + K2 ships 7 added policies. One of the K2 policies (hubble-flows-seen #16) is a stub file — Kyverno can't natively reach Cilium Hubble's gRPC API, so the synthetic PolicyReport row is emitted by slice W2's hubble.go evaluator (per design §4.1). Stub keeps the policy slot explicit in the bundle. Architecture per docs/EPICS-1-6-unified-design.md §4.3: K1 (13 baseline) 01 multi-replica-drainability (resilience, permissive) 02 pdb-permits-eviction (resilience, permissive) 03 topology-spread (resilience, permissive) 04 probes-present (resilience, enforcing) 05 resource-requests (resilience, enforcing) 06 resource-limits (resilience, permissive) 07 pvc-volume-expansion (resilience, permissive — stateful) 08 hpa-effective (resilience, permissive) 09 cilium-l7-mtls (security, enforcing) 10 flux-managed (governance, enforcing) 11 harbor-proxy-pull (governance, enforcing) 12 image-tag-pinned (governance, enforcing) 13 prometheus-scrape (observability, permissive) K2 (7 added) 14 networkpolicy-present (security, permissive) 15 otel-injected (observability, permissive) 16 hubble-flows-seen (deferred to W2 evaluator) 17 runasnonroot-readonlyrootfs (security, permissive) 18 cosign-verified (security, permissive) 19 secret-not-in-env (security, permissive) 20 backup-configured (resilience, permissive) Per docs/INVIOLABLE-PRINCIPLES.md #4 every operationally-meaningful value is runtime-configurable via .Values.compliancePolicies.<name>.*: - enabled (default false — operator opts in) - action (Audit | Enforce; default Audit; flipped per-Environment by EnvironmentPolicy.spec.compliance.modes once C2 controller lands) - excludeNamespaces (default exempts kube-system, flux-system, etc.) - per-policy specifics (allowedRegistryRegex, cosign keys, ...) Test gate (helm template): - default-OFF (no overrides): 0 ClusterPolicy rendered - all-ON : 19 ClusterPolicy rendered helm lint clean both ways. Slice S1 (score aggregator) will join PolicyReport rows from these policies + synthetic rows from W2 evaluators against EnvironmentPolicy weights. UI surfaces (slices U1-U5) consume the SSE/NATS rollups. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f18dd8df19
|
feat(bp-opentelemetry-operator): scaffold operator + default Instrumentation CR (slice H5, #1095) (#1121)
New platform/opentelemetry-operator/ Blueprint scaffold per design doc
§3.9 row 5. Companion to existing bp-opentelemetry (the collector) —
this Blueprint ships the OPERATOR that auto-injects OTel SDK sidecars
into Pods based on annotations:
instrumentation.opentelemetry.io/inject-{java|nodejs|python|dotnet}: "default"
Two-Blueprint split is intentional: collector and operator are separate
upgrade cycles. Mixing them risks coupling observability cadence to
auto-instrumentation cadence, and the operator's mutating admission
webhook intercepts every Pod creation cluster-wide so misconfiguration
is high-blast-radius.
What ships:
- platform/opentelemetry-operator/README.md — activation contract
- platform/opentelemetry-operator/blueprint.yaml — bp-opentelemetry-operator 1.0.0
- platform/opentelemetry-operator/chart/Chart.yaml — wraps upstream
opentelemetry-operator:0.61.0 from open-telemetry-helm-charts.
Subchart `condition: enabled` — default-off skips it entirely.
- platform/opentelemetry-operator/chart/values.yaml — gate, default
Instrumentation CR config (exporterEndpoint, sampler, per-language
toggles), upstream subchart values (manager.collectorImage.repository
required, serviceAccount, cert-manager-backed admission webhook)
- platform/opentelemetry-operator/chart/templates/instrumentation-default.yaml
— Catalyst overlay Instrumentation CR with parentbased_traceidratio
sampler @ 0.25 default, propagators (tracecontext + baggage + b3),
per-language injection toggles. Default OFF; namespace = cilium by
default (operator overrides per Sovereign).
Default-OFF for both layers:
- .Values.enabled: false → upstream subchart's `condition: enabled`
also fires, so 0 resources rendered total
- Even after .Values.enabled=true, the Catalyst Instrumentation CR
is gated again by .Values.defaultInstrumentation.enabled=false so
installing the chart doesn't auto-inject anywhere
Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (sampler ratio,
exporter endpoint, per-language toggles, namespace) is in values.yaml.
Validated:
- helm dependency build pulls upstream cleanly
- helm template with default values: 0 resources rendered
- helm template with enabled=true defaultInstrumentation.enabled=true:
22 resources rendered (upstream operator manager Deployment, CRDs,
RBAC, mutating + validating webhooks, cert-manager Issuer +
Certificate, plus the Catalyst Instrumentation CR)
Out of scope for this slice:
- Add this Blueprint to clusters/_template/bootstrap-kit/ — EPIC-5
(#1100) sequences both bp-opentelemetry (collector first) and this
Blueprint as part of the observability roll-out
- Per-Application Instrumentation CRs from Blueprint.spec.observability.
traces=otlp — application-controller (slice C4 of #1095) renders
those at install time
Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 5
+ §8.4 (EPIC-5 Networking).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
5915e309dc
|
feat(bp-kyverno): land label-vocab mutate + validate ClusterPolicies (slices E1+E2, #1095) (#1120)
Realizes design doc §3.6 (Label-vocabulary enforcement). Two ClusterPolicies that together implement the contract in §1: the openova.io/* label set is the join key across compliance scoring (#1096), RBAC scope matching (#1098), billing (post-Phase-1), and networking (#1100). If labels are missing, every downstream consumer is blind. E1 — mutate-add-openova-labels (slice E1): - Mutating ClusterPolicy that derives missing openova.io/{org, env, application, blueprint, managed-by} labels from namespace annotations + ownerReferences and adds them at admission. - Three rules: * add-org-from-namespace-annotation * add-env-from-namespace-annotation * add-managed-by-flux-when-flux-instance-label - Best-effort safety net — Catalyst controllers (C1/C2/C4) are the authoritative source. This rule covers resources created OUTSIDE the controller path (e.g. a debug Pod from kubectl run, a CronJob authored manually). E2 — validate-require-openova-labels (slice E2): - Validating ClusterPolicy that REJECTS workload resources missing required openova.io/* labels. - Default action `Audit` (permissive) — per-Environment overlay flips to `Enforce` (blocking) via EnvironmentPolicy.spec.modes in EPIC-1 #1096. - One rule per required label (templated from .Values.kyvernoOverlay. labelVocab.validate.requiredLabels) — lets the Audit/Enforce decision be per-label rather than all-or-nothing. - excludeNamespaces list exempts control-plane namespaces (kube-system, flux-system, cilium, cert-manager, openova-system, catalyst, etc.) so existing Sovereign infra doesn't trip on missing org labels. Both default OFF (.Values.kyvernoOverlay.labelVocab.{mutate,validate}. enabled). Operator opts in once the prerequisite Organization (slice B1) + Environment (slice B2) CRs exist on the cluster, otherwise the mutate rule has nothing to derive from and the validate rule rejects every workload. Per docs/INVIOLABLE-PRINCIPLES.md #4, every list (requiredLabels, resourceKinds, excludeNamespaces, action) is in values.yaml. Validated: - helm dependency build pulls upstream kyverno cleanly - helm template with default values: 0 ClusterPolicy resources rendered - helm template with both gates enabled: exactly 2 ClusterPolicies rendered (mutate-add-openova-labels + validate-require-openova-labels) Chart version bumped 1.0.1 → 1.1.0 (minor — new templates, no breaking). Blueprint.yaml mirrored 1.0.0 → 1.1.0. Refs: #1094, #1095, #1096, #1098, #1100, docs/EPICS-1-6-unified-design.md §1 (label vocab) + §3.6 (E1+E2 scope). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e1d7bf18be
|
feat(bp-hcloud-csi): scaffold Hetzner CSI driver Blueprint (slice H6, #1095) (#1119)
New platform/hcloud-csi/ Blueprint scaffold per design doc §3.9 row 6. Wraps the upstream hetznercloud/csi-driver Helm chart and ships the Catalyst-managed `hcloud-volumes` StorageClass that multi-node stateful workloads (CNPG primary/replica pairs in EPIC-6 #1101) need. Default-OFF: chart is a no-op until .Values.enabled is true. Even after enabling, the cluster's default StorageClass is NOT flipped unless .Values.defaultStorageClass is also true — that's a destructive change for Pods relying on the previous default's binding semantics, so the in-place migration plan is operator-scheduled. What ships: - platform/hcloud-csi/README.md — activation contract, why-default-OFF - platform/hcloud-csi/blueprint.yaml — bp-hcloud-csi 1.0.0, configSchema - platform/hcloud-csi/chart/Chart.yaml — wraps upstream hcloud-csi:2.13.0 from charts.hetzner.cloud, condition=enabled gate - platform/hcloud-csi/chart/values.yaml — gate, default-storageclass flag, hetznerTokenSecretRef (SealedSecret), catalystStorageClasses array (renamed from storageClasses to avoid collision with upstream's storageClasses key), volumeSnapshotClass block (default off) - platform/hcloud-csi/chart/templates/storageclass.yaml — renders one StorageClass per catalystStorageClasses[] entry; first entry annotated as cluster default when defaultStorageClass=true - platform/hcloud-csi/chart/templates/volumesnapshotclass.yaml — VolumeSnapshotClass for backup workflows; default off Why a separate Blueprint, not values toggle on bp-cilium: - CSI drivers are independent of CNI. Mixing them risks coupling the network-plane upgrade cycle to the storage-plane upgrade cycle. Per docs/INVIOLABLE-PRINCIPLES.md #4 every parameter (StorageClass list, SealedSecret reference, replicas, resource requests) is in values.yaml. Validated: - helm dependency build pulls upstream hcloud-csi:2.13.0 cleanly - helm template with default values: 0 resources rendered (gate + Chart.yaml condition both fire correctly) - helm template with enabled=true defaultStorageClass=true: 7 resources rendered (upstream CSI controller Deployment, node DaemonSet, CSIDriver, RBAC, plus Catalyst hcloud-volumes StorageClass with the storageclass.kubernetes.io/is-default-class annotation) Schema collision lesson: - Initial draft used .Values.storageClasses[] which collided with the upstream subchart's storageClasses array (different shape; subchart expects array under that exact name). Renamed to catalystStorageClasses + passed [] to upstream's hcloud-csi.storageClasses to suppress its own StorageClass rendering. Lesson logged in seam map. Refs: #1094, #1095, #1101, docs/EPICS-1-6-unified-design.md §3.9 row 6, docs/SRE.md §2.5, platform/cnpg/README.md. Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eca27002ae
|
feat(bp-cilium): add Hubble UI HTTPRoute overlay (slice H7, #1095) (#1117)
Realizes design doc §3.9 row 7 (Hubble relay+UI on; OIDC ingress) as a default-OFF scaffold that EPIC-5 (#1100) flips on per Sovereign once the zero-trust observability tier is ready. Why default-OFF in Phase-0: - Hubble relay/UI in production today is intentionally off (SovereignA was crash-looping on monitoring.coreos.com/v1 ServiceMonitor missing before bp-kube-prometheus-stack reconciles — issue #182). - The OIDC enforcement at the gateway boundary is the missing piece — Cilium's L7 OIDC filter wires to bp-keycloak's `hubble-ui` client which lands in slice D1. - Flipping the gate without the OIDC layer would leave Hubble UI publicly accessible. The template comments explicitly warn against this for production. What ships: - platform/cilium/chart/templates/hubble-ui-httproute.yaml — HTTPRoute exposing hubble-ui Service via cilium-gateway with the wildcard cert. Gated by `catalystOverlay.hubbleUI.{enabled,hostname}`. - platform/cilium/chart/values.yaml `catalystOverlay:` block: hubbleUI.{ enabled, hostname, gatewayRef.{name,namespace}, serviceRef.{name,namespace,port}, auth (oidc|none, default oidc) }. All operator-overrideable per docs/INVIOLABLE-PRINCIPLES.md #4. Operator opt-in path (per-Sovereign overlay at clusters/<sov>/bootstrap-kit/ 01-cilium.yaml): spec.values.cilium.hubble.relay.enabled: true spec.values.cilium.hubble.ui.enabled: true spec.values.catalystOverlay.hubbleUI.enabled: true spec.values.catalystOverlay.hubbleUI.hostname: hubble.<sovereign-domain> … AND bp-keycloak realm has a `hubble-ui` OIDC client (slice D1). Validated: - helm template with default values: 0 HTTPRoute resources rendered - helm template with catalystOverlay.hubbleUI.enabled=true + hostname: exactly 1 HTTPRoute rendered with proper parentRefs/hostnames/backendRefs - Original 34-resource render count unchanged in default mode (no regression to existing chart output) Chart version bumped 1.2.1 → 1.3.0 (minor — new templates, no breaking). Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 7, §8 (EPIC-5 Networking). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
68c68eaf7a
|
feat(bp-network-policies): land default-deny CCNP + system-namespace + DNS allow templates (slice H8, #1095) (#1116)
New platform/network-policies/ Blueprint scaffold per design doc §3.9 row 8. Ships the cluster-wide zero-trust primitives that EPIC-5 (#1100) activates as part of the networking roll-out. What ships: - platform/network-policies/blueprint.yaml — bp-network-policies 1.0.0 - platform/network-policies/chart/Chart.yaml — Helm chart, no upstream sub-chart - platform/network-policies/chart/values.yaml — gate (enabled: false default) - platform/network-policies/chart/templates/default-deny.yaml — CCNP that denies all ingress + egress at endpointSelector: {} (full-cluster scope) - platform/network-policies/chart/templates/allow-system-namespaces.yaml — CCNP allowing full traffic for kube-system, flux-system, cilium, cert-manager, catalyst, openova-system, monitoring, ingress (set is parametric via .Values.allowSystemNamespaces — operator extends per Sovereign for gitea/harbor/loki etc.) - platform/network-policies/chart/templates/allow-egress-dns.yaml — CCNP permitting UDP/TCP/53 to CoreDNS from every Pod (without this the cluster is unbootable under default-deny — first DNS lookup fails) Why a separate Blueprint, not bp-cilium: - bp-cilium is foundational, installed on every cluster on day 0. Default-deny breaks every workload that hasn't been allowlisted, so it cannot ship in bp-cilium without operator opt-in semantics. - Separate Blueprint with enabled: false default preserves the safety boundary. EPIC-5 wires the activation when the rest of the zero-trust story is ready. Per-namespace intra-namespace allow is intentionally NOT in this slice: - Cilium CCNPs cannot express "same namespace as the source Pod" without listing every namespace, which contradicts dynamic Org provisioning. - That allow rule is rendered as a per-namespace CiliumNetworkPolicy (CNP, namespace-scoped) by organization-controller (slice C1 of #1095) at Organization creation time. README + values.yaml note this for downstream Implementers. Per docs/INVIOLABLE-PRINCIPLES.md #4, every policy parameter (allowSystemNamespaces list, dnsNamespace, dnsServiceName) is in values.yaml, not hardcoded. Validated: - helm template with default values: 0 resources rendered (gate works) - helm template with enabled=true: exactly 3 CCNPs rendered (default-deny, allow-system-namespaces, allow-egress-dns), all parse cleanly through python yaml.safe_load_all - CCNP CRD validation will happen on Sovereigns where bp-cilium is installed; local k3s here uses flannel so server-side dry-run is unavailable Refs: #1094, #1095, #1100, docs/EPICS-1-6-unified-design.md §3.9 row 8 + §8 (EPIC-5), ADR-0001 §2 (zero-trust). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
82bf6f6eec
|
fix(bp-cilium): align declared upstream version with Chart.lock (slice H1, #1095) (#1115)
EPIC-0 audit found provenance drift in bp-cilium: - Chart.yaml dependencies[0].version declared "1.19.3" - values.yaml catalystBlueprint.upstream.version declared "1.19.3" - Chart.lock pinned to 1.16.5 (truth-on-disk — what every Sovereign has actually been running) The declared "1.19.3" was never installed anywhere. Aligning all three to "1.16.5" so observability/audit pipelines that compare the declared upstream version with the actually-deployed Cilium version stop reporting a 3-minor mismatch. This is a pure metadata fix — no behavioral change. Rolling forward to a newer Cilium minor (1.17.x or 1.18.x) is a separate slice that needs real upgrade testing on a live data-plane cluster, including k3s --flannel-backend=none compatibility and Gateway API CRD compatibility. Validated: - helm dependency build re-resolves to 1.16.5 cleanly - Chart.lock unchanged (Cilium 1.16.5 was already what it had) Chart version bumped 1.2.0 → 1.2.1 (patch). Blueprint.yaml mirrored. Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.9 row 1, §11 row 3. Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e8bf1aab69
|
feat(bp-nats-jetstream): land Stream + KV CR templates (slice H4, #1095) (#1114)
Realizes design doc §3.9 row 7. The chart had no templates/ directory —
NACK Stream and KeyValue CRs that ADR-0001 §6 mandates as the Catalyst
event spine were declared in docs but not in code.
What this slice ships:
- platform/nats-jetstream/chart/templates/_helpers.tpl — common labels +
servers helper (defaults to <release>-nats Service URL, override via
.Values.catalystStreams.servers).
- platform/nats-jetstream/chart/templates/streams.yaml — three Streams:
* catalyst.audit : 90-day retention, R=3, mirrored to DR (#1101)
* catalyst.events : 24-hour retention (cross-replica fan-out + cold-
start replay), R=3
* catalyst.billing: 1-year retention, R=3, consumed by future billing
- platform/nats-jetstream/chart/templates/kv-buckets.yaml — three KVs:
* idempotency : 24h TTL, 256 MiB cap (write-path idempotency keys)
* dr-leases : 60s TTL (Continuum dns-quorum lease path; CF-KV
bypasses this bucket)
* policy-rollup: 7-day retention, 1 GiB cap (compliance scorer #1096)
Reconciliation gate:
- All resources render only when .Values.catalystStreams.enabled is true.
- NACK (nats-io/nack) is NOT a current dependency — installing it as a
sibling Blueprint and flipping this toggle is a follow-up slice.
- Same default-off pattern the chart already uses for promExporter.podMonitor
(issue #182) so a fresh Sovereign with no NACK keeps booting cleanly.
Per-tenant streams (org.<id>.events, app.<id>.events) are intentionally
NOT shipped here — they'll be created at runtime by organization-controller
(slice C1) and application-controller (slice C4) so they can scale per
tenant.
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every retention,
TTL, replicas, and maxBytes is a values.yaml variable; per-Sovereign
overlays override.
Validated:
- helm dependency build pulls upstream nats:1.2.0
- helm template with default values: 0 catalyst-* resources rendered
(catalystStreams.enabled=false, the safe default)
- helm template with catalystStreams.enabled=true: 6 resources rendered
exactly as expected (3 Streams + 3 KeyValues, all in
jetstream.nats.io/v1beta2)
Chart version bumped 1.1.2 → 1.2.0 (minor — new templates, no breaking).
Blueprint.yaml version mirrored.
Refs: #1094, #1095, #1096, #1101, docs/EPICS-1-6-unified-design.md §3.9
row 7, ADR-0001 §6.
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
25ef20a8e5
|
feat(catalyst-chart): land Blueprint CRD + fix 5 string-form depends (slice B4, #1095) (#1112)
Realizes the Blueprint CRD per docs/BLUEPRINT-AUTHORING.md §3 and design
doc §3.2.4. Promotes the doc-contract (apiVersion catalyst.openova.io)
from a YAML-loaded contract to a schema-validated CRD.
Schema design:
- Two versions served from one inline schema (YAML anchors): v1alpha1
(legacy, served, not storage) and v1 (canonical, served, storage). The
shared schema means the 38 existing v1alpha1 files in platform/ +
products/ continue to validate; migration to v1 is a follow-up slice.
- Required at this layer: spec.version (strict semver pattern),
spec.card.title (minLength=1).
- Card variants accommodated as documented: summary | description |
tagline interchangeable; category | family interchangeable; docs |
documentation interchangeable. All optional except title.
- visibility enum: listed | unlisted | private.
- placementSchema.modes enum: single-region | active-active | active-
hotstandby — same set Application.spec.placement validates against.
- depends[].blueprint pattern accepts both bp-* and bare-name (legacy).
- manifests accepts both manifests.chart (legacy short-form) AND
manifests.source.{kind,ref} (canonical). Three source kinds: HelmChart,
Kustomize, OAM.
- rotation[].ttl pattern '^[0-9]+(s|m|h|d)$'.
- x-kubernetes-preserve-unknown-fields liberally on configSchema (per-
Blueprint JSON Schema is arbitrary by design), card, manifests, owner,
observability, outputs, depends[].values, manifests.values, etc.
Existing files validation:
- Surveyed all blueprint.yaml in platform/ + products/ (59 files).
- Card field frequency: title (59), summary (38), description (20+1),
category (25), family (20), docs (20), documentation (14+1), icon (25),
tags (14), license (14).
- 54 of 59 files passed the schema unchanged.
- 5 files used `depends: [- bp-name]` (string form) instead of the
canonical `[- blueprint: bp-name]` object form per BLUEPRINT-AUTHORING
§3. Those 5 files are fixed in this commit:
* platform/cert-manager-powerdns-webhook/blueprint.yaml
* platform/cert-manager-dynadot-webhook/blueprint.yaml
* platform/crossplane-claims/blueprint.yaml
* platform/powerdns/blueprint.yaml
* platform/self-sovereign-cutover/blueprint.yaml
- After fix: ALL 59 files pass server-side validation (kubectl apply
--dry-run=server) against the new CRD.
Negative validation (tests/blueprint-sample-invalid.yaml):
- spec.version "1.3" → semver pattern
- spec.card missing → required
- spec.card.title missing → required
- spec.visibility "secret" → enum listed|unlisted|private
- spec.placementSchema.modes "round-robin" → enum
- spec.depends[0] bare string "bp-bad-string" → must be object
- spec.depends[1].blueprint "Foo" → pattern fails (uppercase)
- spec.rotation[0].ttl "5 days" → pattern '^[0-9]+(s|m|h|d)$'
All 8 seeded vectors rejected.
This commit ONLY touches new CRD + test files + the 5 depends fixes —
leaves the in-flight router.tsx + rootBeforeLoad.test.ts work from a
parallel agent and the .claude/worktrees/ directory untouched.
Refs: #1094, #1095, docs/EPICS-1-6-unified-design.md §3.2.4,
docs/BLUEPRINT-AUTHORING.md §3
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a6fb97f2ef
|
fix(cutover step-01): clone+push (regular repo) instead of pull-mirror (#1033)
PR #1029 added a step-06 PATCH to flip mirror=false before push so the cutover-helmrepository-patches Job could write HelmRepository URL pivots to local Gitea. On Gitea 1.22.3 the PATCH returns 200 but silently no-ops — `mirror_interval` updates but `mirror: true` stays. The repo remains read-only and step-06 still hits HTTP 403 "remote: mirror repository is read-only". Reproduced on otech127 2026-05-05 with chart 0.1.22 deployed. Per ADR (cutover ends upstream tracking — Sovereign goes self-hosted from this point), the architecturally correct fix is to never create the mirror in the first place. Step-01 now creates a regular Gitea repo and bare-clones+pushes upstream content. All refs (branches+tags) replicate via `git push --mirror --force`, which is idempotent on re-runs. Trade-off: post-cutover Sovereigns no longer auto-sync from upstream — that's the intended cutover semantics anyway. Operator re-runs this Job manually for chart rollouts (next-session follow-up: dedicated post-cutover sync mechanism, perhaps a periodic CronJob the operator can opt into). Bumps: - bp-self-sovereign-cutover chart 0.1.22 → 0.1.23 - bootstrap-kit pin 0.1.22 → 0.1.23 Co-authored-by: Hati Yildiz <hatiyildiz@openova.io> |
||
|
|
a070808eda
|
fix(cutover step-06): convert pull-mirror to standalone before pushing patches (#1029)
Step-01 creates openova/openova on the Sovereign's local Gitea as a
pull mirror so it tracks upstream openova-public during early
bootstrap. After cutover, the Sovereign is self-hosted and MUST
diverge from upstream — but Gitea blocks pushes to a mirror with
HTTP 403 "remote: mirror repository is read-only".
Step-06 adds a Phase-1.5 PATCH /api/v1/repos/{owner}/{repo}
{"mirror": false, "mirror_interval": "0"} BEFORE attempting to
clone+push the HelmRepository URL pivot. This converts the
pull-mirror into a standalone writable repo — the way the post-
cutover Sovereign architecture expects it.
Caught on otech125 2026-05-05: cutover-helmrepository-patches Job
returned "FATAL: git push failed" with no upstream stderr (chart
0.1.20 lacks the printf '%s\n' "$push_err" fix from PR #1022, which
was published in 0.1.21 only). Reproduced by cloning openova/openova
from a debug pod and running git push: "remote: mirror repository
is read-only / fatal: ... HTTP 403". Without the demirror step,
EVERY Sovereign provisioned fails handover at this step.
Bumps:
- bp-self-sovereign-cutover chart 0.1.21 → 0.1.22
- bootstrap-kit pin 0.1.20 → 0.1.22
Co-authored-by: Hati Yildiz <hatiyildiz@openova.io>
|
||
|
|
478743db17
|
fix(cutover-step-06): actually surface git push stderr (PR #1021 merged with only chart bump) (#1022)
PR #1021 was supposed to ship this code fix but the chart-version bump landed first and the actual sed didn't apply (sed quoting mishap). The debug-error fix never reached main. Re-shipping now as a clean Edit- based commit. Captures git push stderr into push_err and prints it on FATAL so the next iteration's failed Job logs include git's actual rejection (auth / branch protection / hook). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
69980ed48e
|
chore(bp-self-sovereign-cutover): bump 0.1.20 → 0.1.21 (#1021)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
608db53a25
|
fix(cutover 0.1.20): Step-06 pushes YAML edit to local Gitea so patches survive Flux reconcile (#970) (#971)
## Root cause (live on otech116 2026-05-05 14:38) After the #968 fix shipped (0.1.19), the cutover engine reached Step-7 (87%) successfully — Step-01..07 all completed. Then Step-08 (egress- block-test) caught 38/38 HelmRepositories had reverted to upstream: ``` external HelmRepositories still pointing at ghcr.io/openova-io: 38 OFFENDER flux-system/bp-cilium=oci://ghcr.io/openova-io ... (37 more) FAIL — at least one HelmRepository did not pivot ``` But Step-06's job logs say: ``` [helmrepository-patches] OK bp-cilium -> oci://harbor.otech116.omani.works/openova-io ... (37 more OK) ok=38 skip=0 fail=0 ``` So Step-06 thought it succeeded — and it had, momentarily. But then the bootstrap-kit Kustomization (which had successfully pivoted to local Gitea via Step-05) reconciled its YAML from local Gitea, where the YAML still declared `url: oci://ghcr.io/openova-io`. Within ~30s every kubectl patch was undone. The cutover engine then aborted at Step-8 verification. ## Fix Step-06 now runs in two phases: 1. **Live K8s patches** (existing behaviour) — flips spec.url on every HelmRepository immediately. Useful for the cluster between cutover and the next reconcile. 2. **NEW — Push YAML edit to local Gitea** — clones `openova/openova` from the local Gitea over basic-auth, sed-rewrites every `clusters/_template/bootstrap-kit/*.yaml` declaration of `url: oci://ghcr.io/openova-io` → `oci://harbor.<sov-fqdn>/openova-io`, commits with a clear message, pushes back. Subsequent reconciles see local Harbor as the steady-state. After the push, the script annotates `flux-system/openova` GitRepository to trigger immediate reconciliation so the new YAML lands without waiting for the polling interval. ## Image change Step-06 image bumped from `bitnami/kubectl:1.31.4` to `alpine/k8s:1.31.4` because the new phase needs both `kubectl` and `git` in one image (verified live on otech116 — both binaries present). ## Acceptance gate Test case 16 added to cutover-contract.sh — guards against future regressions that remove the `git clone`, the `git push origin main`, or the `clusters/_template/bootstrap-kit` target dir reference. ## Live verification Will fire on otech117 (next provision). Expected: - Step-06 logs `cloning gitea-http.gitea.../openova/openova.git` then `pushed to ...` - Step-08 verify PASSES (38/38 HelmRepositories pivoted in K8s + Gitea) - self-sovereign-cutover-status `cutoverComplete: "true"` - Egress block to ghcr.io safely activates Co-authored-by: e3mrah <ebaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3db19b76b1
|
fix(cutover 0.1.19): Step-01 gitea-mirror DNS readiness probe + backoffLimit=3 (#968) (#969)
## Root cause (live on otech115 2026-05-05 14:15) After PR #959 (0.1.18) unblocked the auto-trigger to actually call /internal/cutover/trigger, the cutover engine fired Step-01 within ~8s of bp-self-sovereign-cutover Helm-install completing. The gitea Pod had only just reached Ready state — cluster-DNS endpoint publication for the headless service `gitea-http` was still in flight. One wget returned `bad address gitea-http.gitea.svc.cluster.local` and exited non-zero. Catalyst-api's cutover engine stamped Jobs with backoffLimit=0 (cutover.go:584), so a single DNS miss was terminal and aborted all 8 cutover steps. otech115 finished provisioning with cutoverComplete=false and tethered to upstream github.com/ghcr.io. ## Fix (dual-layer) **Layer A — catalyst-api (cutover.go)**: backoffLimit lifted from 0 to 3. A single transient miss is recoverable (4 attempts over each step's activeDeadlineSeconds) without burning operator-attention. Hard failures still surface within budget. **Layer B — chart Step-01 (01-gitea-mirror-job.yaml)**: explicit nslookup readiness probe at the top of the bash script, before any wget call. 30 attempts × 5s = 150s budget; alpine/git ships nslookup in /usr/bin (verified live on otech115). Layer B is faster than Layer A (in-script DNS retry vs Pod recreate); Layer A is the safety net for any other transient pre-cluster-stable race we haven't yet enumerated. ## Acceptance gate Test case 15 added to platform/self-sovereign-cutover/chart/tests/ cutover-contract.sh — guards against future regressions that drop either the gitea_host extraction or the nslookup loop. ## Live verification Will fire on the next provision (otech116). Expected: - Step-01 logs `[gitea-mirror] DNS ready for gitea-http.gitea.svc.cluster.local (attempt N)` - All 8 cutover Jobs reach Complete - self-sovereign-cutover-status ConfigMap reaches cutoverComplete=true Co-authored-by: e3mrah <ebaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d1431bed09
|
fix(autoscaler+wizard): wire HCLOUD_CLOUD_INIT, validate SKU/region in catalyst-api (#965)
Closes #921 — bp-cluster-autoscaler-hcloud chart shipped without HCLOUD_CLUSTER_CONFIG / HCLOUD_CLOUD_INIT, so cluster-autoscaler 1.32.x FATALs at startup with "HCLOUD_CLUSTER_CONFIG or HCLOUD_CLOUD_INIT is not specified" on every Sovereign (otech112 evidence). HelmRelease reports Ready=True (Helm install succeeded) but the Pod CrashLoopBackOffs invisibly behind the False-positive condition. Closes #916 — wizard let operators dispatch unbuildable topologies (otech109: cpx32 worker in `ash`) because PROVIDER_NODE_SIZES did not encode regional orderability. Hetzner rejected the worker creation 41s into `tofu apply` after Phase-0 had already created the CP + network + LB + firewall. Chart fix (issue #921): - Add `clusterAutoscalerHcloud.{clusterConfig,cloudInit}` values to the umbrella chart (base64-encoded per upstream contract). - Render `hetzner-node-config` Secret unconditionally with both keys so the upstream Deployment's secretKeyRef references resolve cleanly during `helm template` AND in the live cluster regardless of overlay state. - Wire HCLOUD_CLUSTER_CONFIG + HCLOUD_CLOUD_INIT extraEnvSecrets onto the upstream chart's deployment. - Tofu Phase 0 base64-encodes the Phase-0 worker cloud-init and stamps it under `flux-system/cloud-credentials.hcloud-cloud-init`; the bootstrap-kit overlay lifts that key via Flux `valuesFrom` into `clusterAutoscalerHcloud.cloudInit`. Autoscaler-spawned workers thus receive the IDENTICAL bootstrap as the Phase-0 worker fleet. - Bump bp-cluster-autoscaler-hcloud chart 1.0.0 → 1.1.0. - Chart-test smoke gate (chart/tests/hetzner-node-config.sh) verifies Secret + env var wiring + no-regression of HCLOUD_TOKEN — runs in CI's blueprint-release "Run chart integration tests" step. Wizard fix (issue #916): - Add `availableRegions?: string[]` to NodeSize interface; encode cpx32 = ['fsn1','nbg1','hel1'], cpx21/cpx31 = [] (orderable nowhere new) per Hetzner /v1/server_types vs POST /v1/servers gap. - Add `isSkuAvailableInRegion()` + `suggestAlternativeSkus()` helpers. - StepProvider filters SKU dropdowns by selected region; auto-swaps current SKU to recommended default when region change drops it out of orderability. - Mirror the matrix Go-side in sku_availability.go; gate `provisioner.Request.Validate()` with same predicate so a stale wizard build OR direct API caller bypassing the UI cannot dispatch otech109's failure mode. - Two-sided enforcement covers both r.Regions[] (multi-region) and the legacy singular path. Tests: 13 vitest cases on the wizard side + 38 Go subtests on the API side. Chart smoke renders + helm template gates the env wiring at publish time. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
238c6d2010
|
fix(bp-flux): mitigate helm-controller leader-election loss + stuck-HR recovery (#925) (#960)
* fix(bp-flux): mitigate helm-controller leader-election loss + recovery CronJob (#925) On otech113.omani.works the bp-vpa HelmRelease became stuck Ready=Unknown forever after a transient kube-apiserver blip caused helm-controller to lose its leader-election lease mid-install. The Helm release secret was already committed (Status=deployed) by the previous leader, but its last write to the HR's Ready condition was Unknown and the new leader's "release in storage?" short-circuit never re-evaluates that. The HR blocked bootstrap-kit → sovereign-tls → cilium-gateway, breaking every HTTPRoute on the Sovereign. Fix is two-pronged: 1) PRIMARY (prevent the trigger). Stretch leader-election lease durations on the three Catalyst-critical controllers (helm/kustomize/source) from the upstream defaults of lease=35s renew=30s retry=5s to lease=60s renew=40s retry=5s, and bump memory limits from 256Mi to 512Mi (helm) / 384Mi (kustomize, source) so OOMKills during 35-HR fan-out installs don't themselves trigger leadership handoffs. Costs ~50s extra failover time on a real controller crash; that's acceptable since CP HA is a Phase 2 concern and we'd much rather avoid spurious flips during transient API pressure. 2) RECOVERY (handle the residual case). New CronJob bp-flux-stuck-hr-recovery runs every 2 minutes, scans every HelmRelease cluster-wide, and for each HR stuck in Ready=Unknown for >5 minutes whose underlying Helm release secret already has status=deployed, force-toggles spec.suspend (the only known workaround per #925). Guardrail: refuses to act if more than 10 HRs would be touched in a single run (signals a cluster-wide outage). Operator-disablable via .Values.catalyst.stuckHelmReleaseRecovery.enabled=false. Lock-in tests: tests/leader-election-and-recovery.sh covers all three flag/memory bumps, CronJob render, RBAC presence, disable-toggle, and threshold operator override. version-pin-replay + observability-toggle still green. Chart bumped 1.1.4 → 1.2.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-flux): bump blueprint.yaml spec.version to 1.2.0 to match Chart.yaml (#925) The bootstrap-kit static validation gate (Chart.yaml version == blueprint.yaml spec.version) caught the missed bump on PR #960. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b7f150db38
|
fix(cutover 0.1.18): poll /healthz for readiness instead of auth-gated /status (#957) (#959)
The 0.1.17 auto-trigger Job was Complete=True on otech113 but the
cutover never actually started: the readiness probe loop polled
/api/v1/sovereign/cutover/status (auth-gated, behind RequireSession)
and treated 401 as "API not ready". The loop ran 30 times for 300s
and exited 0 — the trigger endpoint was NEVER called.
Live evidence on otech113 2026-05-05:
- 30 consecutive 401s from auto-trigger Pod (10.42.4.216) on
/sovereign/cutover/status in catalyst-api access log
- zero hits on /api/v1/internal/cutover/trigger
- Helm post-upgrade hook deadline tripped → rollback to 0.1.15
Fix (chart-side only; PR #947 catalyst-api endpoint is correct as-is):
- poll /healthz (unauthenticated, always 200 when process is up)
- drop the pre-flight cutoverComplete=true short-circuit since
/internal/cutover/trigger is already idempotent (returns 200 with
the existing snapshot when cutoverComplete=true, per
cutover_internal.go line 279)
- bump chart 0.1.17 → 0.1.18; pin slot 06a to 0.1.18
Tests:
- contract gate Case 13: probe target is /healthz, NOT
/sovereign/cutover/status (regression guard)
- contract gate Case 14: no stale cutoverComplete pre-read off
/tmp/status.json (the file no longer exists)
- existing 12 contract gates still pass; helm lint clean
- existing 6 Go unit tests for HandleCutoverInternalTrigger pass
Closes #957
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|