openova

Author	SHA1	Message	Date
e3mrah	5f6d1c7d86	diag(bp-openbao): add set -x to init script (chart 1.2.8) (#658 ) otech37/38 hit the same wall: server reaches Initialized=true but openbao-unseal-keys Secret is never persisted; the FIRST init pod's logs that ran fresh init are reaped by container restart before we can capture what happened. Add 'set -x' to shell-trace every command. Now even if the script crashes mid-run, pod logs show the last command attempted. The captured diagnostic on the next provision will tell us whether the failure is in /tmp/init-output.json parsing, the persist wget, or elsewhere. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 11:09:05 +04:00
e3mrah	8447930bf7	fix(bp-openbao): fail-fast on unseal-keys persist (chart 1.2.7) (#657 ) * fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) * fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth) The wizard's componentGroups.ts carried hand-maintained `dependencies: [...]` arrays that deviated from the real Flux install graph in clusters/_template/bootstrap-kit/.yaml. Examples (otech34 surfaced this): componentGroups.ts Flux HelmRelease.dependsOn ---------------------- --------------------------- keycloak: [cnpg] keycloak: [cert-manager, gateway-api] openbao: [] openbao: [spire, gateway-api, cnpg] harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager, valkey] gateway-api] Founder's directive: "all the real dependencies are related to real flux related dependencies, if you are hosting irrelevant hardcoded baseless wizard catalog dependencies, I dont know where they are coming from. The single source of truth for the dependencies is flux!!!" — 2026-05-03 This commit: 1. Adds scripts/generate-blueprint-deps.sh that parses every bootstrap-kit HelmRelease and emits blueprint-deps.generated.json keyed by bare component id (bp- prefix stripped on both source and target side). 2. Commits the generated JSON. 3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id). 4. Patches componentGroups.ts so every RAW_COMPONENT's `dependencies` field is OVERRIDDEN at module load with the Flux-canonical list (the inline `dependencies: [...]` literals are now ignored — Flux is canonical). Follow-ups (not in this PR): - CI drift check that re-runs the script and diffs the JSON. - Strip the inline `dependencies: [...]` arrays entirely once the drift check is green. - Wire the FlowPage edge-rendering to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent hardcoded dep map at lines 105-155 that the founder caught — most visibly: keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao The reason the founder kept seeing the spurious arrow on the Flow page. Replace the local table with an import of BLUEPRINT_DEPS from data/blueprintDeps.ts (single source of truth — generated from clusters/_template/bootstrap-kit/.yaml by scripts/generate-blueprint-deps.sh). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> fix(jobs): don't regress status to pending after exec started helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the Job's Status with jobStatusFromHelmState(state) on every event. Flux oscillates HelmReleases between Reconciling and DependencyNotReady while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready — helmwatch maps both back to HelmStatePending. The bridge then flips the row to status='pending' even though an active Execution is streaming exec log lines (startedAt + latestExecutionId already set). Founder caught this on otech34's install-external-secrets job: status='pending' on the Jobs page while Exec Log was actively tailing. Fix: monotonic guard — once activeExecID[component] != "" (Execution allocated), refuse to regress nextStatus to StatusPending. Treat ongoing-after-start as Running so the row reflects the live stream. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(jobs): cascade Failed status through dependsOn (fail-fast) Founder caught on otech34: install-openbao=failed but install-external-secrets stayed pending forever ('masking it and waiting unnecessarily'). Flux's HelmRelease for external-secrets is in DependencyNotReady, helmwatch maps that to StatePending, bridge writes Status=pending — no signal that the upstream FAILED rather than 'still installing'. Add a post-rollup sweep in deriveTreeView that propagates Failed through the dependsOn graph. Up to 8 sweeps cover the deepest bootstrap-kit chain. Idempotent on read; reverses if openbao recovers because it operates on the live snapshot. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files' Diagnosed live during otech35: openbao-init pod crash-looped 4× on 'bao operator init' with: failed to create fsnotify watcher: too many open files Flux mapped to InstallFailed → RetriesExceeded → cascading through external-secrets and external-secrets-stores. The wizard masked the OS-level root cause behind a generic InstallFailed. Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm- controller + 11 CNPG operators + Reflector + Cert-Manager + bao + keycloak-config-cli + ... each grabs instance slots). The instance count exhausts within minutes; the next process to ask for an inotify slot gets EMFILE. Bump well above k8s/k3s production guidance so future blueprints don't tickle the same wall: fs.inotify.max_user_instances = 8192 fs.inotify.max_user_watches = 1048576 fs.inotify.max_queued_events = 16384 Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system' in runcmd. Permanent across reboots. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-openbao): fail-fast when unseal-keys persist fails (chart 1.2.7) otech37 caught: bao operator init succeeded server-side (Initialized=true), but the script's wget POST to persist openbao-unseal-keys Secret silently failed (\|\| true), and the PUT fallback also silenced. Subsequent Job retries hit Initialized=true on the idempotent path, found no openbao-unseal-keys Secret, and FATAL'd with 'manual recovery: wipe data-openbao-0 PVC' — every retry forever. Hardening: 1. Capture POST + PUT stdout/stderr to /tmp files instead of /dev/null so the FATAL path can echo them. 2. PUT no longer \|\| true — if both POST and PUT fail, exit 1. 3. Add read-back verification: GET the persisted Secret and assert 'unseal-keys-b64' field is present. Catches partial-write / eventual-consistency cases. Bumps chart 1.2.6 -> 1.2.7 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 10:51:21 +04:00
e3mrah	6baf7e56e7	fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) (#651 ) Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 09:26:23 +04:00
e3mrah	d519dc8ba2	fix(bp-harbor): switch sync Job to curl-against-apiserver (chart 1.2.12) (#650 ) rancher/kubectl is distroless (no /bin/sh) so the inline shell script can't run. Replace with curlimages/curl which has alpine sh + curl. Talk to k8s API directly via the in-pod ServiceAccount token. The PATCH merges password + HARBOR_DATABASE_PASSWORD into the existing pre-install-hook Secret without touching annotations. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 09:15:23 +04:00
e3mrah	08432b540e	fix(bp-harbor): switch sync Job to rancher/kubectl (chart 1.2.11) (#649 ) bitnami/kubectl moved to sha256-only tags; bitnami/kubectl:1.31.4 returns 'not found' from Docker Hub. rancher/kubectl is always available on k3s clusters. Bumps chart 1.2.10 -> 1.2.11. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 09:04:15 +04:00
e3mrah	de51fa3f7a	fix(bp-harbor): post-install Job copies CNPG password (chart 1.2.10) (#648 ) * fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB) CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12 vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default Pod on one node. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal Commit `b1a25c42` (#600) removed the helm.sh/hook-delete-policy from the auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install (the old hook-succeeded clause caused the SA to disappear before the init Job could mount its token). The chart-test still expected ≥5 before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs). Result: Blueprint Release for #600 (run 25251129679) failed at the test gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main already references it. otech30 caught this live: bp-openbao HR stuck with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'. Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao 1.2.6 onto GHCR. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-harbor): replace Reflector race with deterministic post-install Job (chart 1.2.10) bp-harbor's harbor-database-secret relied on Reflector copying from CNPG- emitted harbor-pg-app via a 'reflects:' destination annotation. On every fresh Sovereign Reflector logs once at install: Could not update harbor/harbor-database-secret — Source harbor/harbor-pg-app could not be found and never refires when CNPG creates the source ~30s later. Even with 'auto-enabled: true' on the source's inheritedMetadata, Reflector's auto-reflect copies the SOURCE name (harbor-pg-app), not the explicit destination harbor-database-secret. Result: harbor-database-secret stays empty forever; harbor-core CrashLoops with 'couldn't find key password in Secret harbor/harbor-database-secret'. Caught live on otech26-30. Replace with a Helm post-install/post-upgrade Job that: - polls for harbor-pg-app to exist (CNPG provisions it ~30-60s after Cluster Ready) - copies password into harbor-database-secret with both 'password' and 'HARBOR_DATABASE_PASSWORD' keys - exits 0; Helm marks the hook complete The Job is idempotent (re-running on upgrade overwrites identically) and deterministic (no event-watcher race). The placeholder Secret stays in place so kubectl-get returns Found before the Job runs. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 08:52:54 +04:00
e3mrah	da61ecdc79	test(bp-openbao): align test expectation with #600 RBAC-hook removal (#647 ) * fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB) CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12 vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default Pod on one node. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal Commit `b1a25c42` (#600) removed the helm.sh/hook-delete-policy from the auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install (the old hook-succeeded clause caused the SA to disappear before the init Job could mount its token). The chart-test still expected ≥5 before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs). Result: Blueprint Release for #600 (run 25251129679) failed at the test gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main already references it. otech30 caught this live: bp-openbao HR stuck with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'. Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao 1.2.6 onto GHCR. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 08:46:31 +04:00
e3mrah	a359278b7d	fix(bp-spire): disable oidc ClusterSPIFFEID + chart bump (1.1.7) (#645 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled Two related Phase-8a stragglers diagnosed live during otech28: 1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError forever ("secret pdns-pg-app not found"). Adds explicit dependsOn. 2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set reflection-allowed; missing reflection-auto-enabled. Reflector races when destination Secret (harbor-database-secret) is created BEFORE CNPG provisions the source (harbor-pg-app). Reflector logs "Source could not be found" once and never retries — leaving harbor- core stuck CreateContainerConfigError. Adding auto-enabled makes Reflector actively watch the source and re-fire when it appears. Bumps: bp-harbor 1.2.8 -> 1.2.9 bp-gitea 1.2.1 -> 1.2.2 bp-powerdns 1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump) Bootstrap-kit references updated to pull the new chart versions on the next Sovereign provisioning. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-spire): Chart.lock missing spire-crds → CRDs never installed (chart 1.1.7) bp-spire 1.1.4 added spire-crds 0.5.0 as a Helm dependency to register the spire.spiffe.io/v1alpha1 CRDs (ClusterSPIFFEID, ClusterStaticEntry, ClusterFederatedTrustDomain) before the spire subchart's controller- manager Deployment starts. But Chart.lock was never regenerated — only contained the original `spire` entry. As a result every Blueprint Release packaged the chart WITHOUT spire-crds, the Sovereign saw no CRDs registered, and Helm install failed with: no matches for kind "ClusterSPIFFEID" in version "spire.spiffe.io/v1alpha1" bp-openbao / bp-external-secrets / bp-nats-jetstream all dependsOn bp-spire so this single bug cascades and blocks 5+ HRs from reaching Ready=True. Caught live during otech29. Fix: ran `helm dependency update` to regenerate Chart.lock + pull both spire and spire-crds tarballs; bumps bp-spire 1.1.6 -> 1.1.7 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 08:27:33 +04:00
e3mrah	8bb66fe43e	fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled Two related Phase-8a stragglers diagnosed live during otech28: 1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError forever ("secret pdns-pg-app not found"). Adds explicit dependsOn. 2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set reflection-allowed; missing reflection-auto-enabled. Reflector races when destination Secret (harbor-database-secret) is created BEFORE CNPG provisions the source (harbor-pg-app). Reflector logs "Source could not be found" once and never retries — leaving harbor- core stuck CreateContainerConfigError. Adding auto-enabled makes Reflector actively watch the source and re-fire when it appears. Bumps: bp-harbor 1.2.8 -> 1.2.9 bp-gitea 1.2.1 -> 1.2.2 bp-powerdns 1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump) Bootstrap-kit references updated to pull the new chart versions on the next Sovereign provisioning. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-03 00:16:34 +04:00
e3mrah	2e9cfd4a57	fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 23:52:42 +04:00
e3mrah	487ebebda2	fix(bp-vpa): drop registry.k8s.io/ prefix in repository (upstream prepends it) (#641 ) * fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 23:32:35 +04:00
e3mrah	737574b19a	feat(bp-keycloak): Phase-8b sovereign realm — token-exchange, catalyst-ui/api-server OIDC clients, SMTP, bump 1.2.2 → 1.3.0 (#604 ) (#609 ) Adds the full Phase-8b identity surface required by the seamless handover flow: - Token exchange enabled on sovereign realm (attributes.token-exchange: true) - catalyst-ui public PKCE client: redirectUris + webOrigins keyed on console.<sovereignFQDN>, groups + requiredActions in ID token - catalyst-api-server confidential service-account client: impersonation + manage-users + view-users + query-users roles on realm-management; client secret injected at provisioning time via .Values.catalystApiServerClientSecret - WebAuthn (webauthn-register + webauthn-register-passwordless) registered as Required Action options on the realm - UPDATE_PASSWORD set as defaultAction: true for new users - smtpServer block: pre-handover default = contabo Stalwart relay; fully operator-configurable via .Values.smtp.* (Phase-8c-acceptable) - required-actions client scope + oidc-usermodel-attribute-mapper for requiredActions claim in ID token (catalyst-ui first-login UX) Architectural change: realm JSON moved from inline values.yaml (keycloak: subchart key — no parent scope access) to a parent-chart template platform/keycloak/chart/templates/configmap-sovereign-realm.yaml, which can read .Values.sovereignFQDN and .Values.smtp.* for per-Sovereign interpolation. The upstream bitnami chart's keycloakConfigCli.existingConfigmap is pointed at this ConfigMap. Anti-duplication seam: configmap-sovereign-realm.yaml. New values.yaml keys: sovereignFQDN: "" (REQUIRED — per-Sovereign overlay supplies it) sovereignRealm.enabled: true catalystApiServerClientSecret: "" (REQUIRED — provisioner seals and injects) smtp.host/port/from/user/password/ssl/starttls/auth New bootstrap-kit file: 09a-keycloak-catalyst-api-secret.yaml — SealedSecret template for keycloak-catalyst-api-server-credentials in catalyst-system namespace; provisioner fills encryptedData fields at deploy time Bootstrap-kit refs bumped 1.2.x → 1.3.0 in _template, otech, omantel. helm template clean with sovereignFQDN=otech.omani.works. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:05:27 +04:00
e3mrah	93627ada20	fix(bp-harbor): convert harbor-database-secret to Helm pre-install hook (1.2.8) (#603 ) The 1.2.7 fix dropped the `data:` block from the chart template, but Helm's three-way merge still owns the Secret as a release resource and resets `data: {}` (no keys) on every chart upgrade — verified on otech22 where 1.2.6→1.2.7 reconcile wiped Reflector-populated keys back to nil. Architectural fix: convert the Secret to a Helm pre-install hook. - `helm.sh/hook: pre-install` — Secret is created at install time only. On `helm upgrade`, Helm does NOT touch the Secret (no three-way merge), so keys populated by Reflector persist across every chart bump. - `helm.sh/hook-delete-policy: before-hook-creation` — On a re-install, Helm deletes the previous Secret first so the hook recreates clean. - `helm.sh/resource-policy: keep` — `helm uninstall` does NOT delete the Secret (paired with hook means standard upgrade path never sees a delete). - Hook resources are NOT recorded in the Helm release manifest, so they're invisible to `helm upgrade`'s three-way merge. Also drops the inline `data:` block (kept from 1.2.7) — Reflector still populates everything from harbor-pg-app once CNPG bootstraps the source. Bumps bp-harbor 1.2.7 → 1.2.8, bootstrap-kit refs (_template, otech, omantel). Closes #585 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:57:55 +04:00
e3mrah	09208ca58f	fix(bp-harbor): omit data block in harbor-database-secret — Helm overwrite regression (1.2.7) (#602 ) On every helm upgrade, Helm three-way merge resets `data.password` and `data.HARBOR_DATABASE_PASSWORD` to "" because the chart declares them empty in the template. After Reflector populates them from `harbor-pg-app`, the next bp-harbor upgrade silently empties them again — harbor-core then crashloops on the next pod restart with "password authentication failed". Observed on otech22 after the 1.2.5→1.2.6 Flux upgrade: harbor-database- secret.password went from 64 bytes back to 0 bytes, harbor-core entered CrashLoopBackOff. Resolved at runtime by touching harbor-pg-app to bump its resourceVersion and re-trigger Reflector, but the architectural fix is needed so it doesn't recur on the next chart upgrade. Fix: drop the entire `data:` block from templates/database-secret.yaml. The Secret is created by Helm with no data keys (Helm owns nothing in the data field). Reflector adds ALL keys from `harbor-pg-app` (password, HARBOR_DATABASE_PASSWORD, username, host, dbname, jdbc-uri, etc.) on the first SecretWatcher event after CNPG bootstraps the source. On subsequent helm upgrades, Helm's three-way merge has nothing to overwrite in `data:` because the chart no longer declares any keys there. Bumps bp-harbor 1.2.6 → 1.2.7, bootstrap-kit refs (_template, otech, omantel). Closes #585 (regression of) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:53:37 +04:00
e3mrah	8d50402038	fix(bp-harbor): remove cnpg-app-annotator Job — CNPG inheritedMetadata handles annotation (1.2.6) (#601 ) The post-install Job `harbor-pg-app-annotator` (with curlimages/curl:8.7.1) is no longer needed: bp-harbor 1.2.5 already uses CNPG's `inheritedMetadata` stanza in cnpg-cluster.yaml to stamp `reflection-allowed: true` onto `harbor-pg-app` at CNPG bootstrap time. The Job was causing ErrImagePull on otech22 because Docker Hub is proxied through Harbor itself (chicken-and-egg). Removes: - templates/cnpg-app-annotator-job.yaml - templates/cnpg-app-annotator-rbac.yaml - values.yaml cnpgAnnotator section Updates database-secret.yaml comment to reflect the inheritedMetadata approach. Bumps Chart.yaml 1.2.5 → 1.2.6, bootstrap-kit refs (_template, otech, omantel). Closes #585 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:44:55 +04:00
e3mrah	b1a25c4235	fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598 ) (#600 ) Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was `<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak) but bitnami's fullname helper trims the chart-name suffix when Release.Name already contains it, so the Service is just `keycloak`. Changed default to `.Release.Name`. Sovereign realm was already imported (config-cli ran successfully) — only the Gateway routing was broken, returning HTTP 500. Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had `helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The `hook-succeeded` clause caused Helm to delete the SA immediately after the weight-0 RBAC hook completed, before the weight-5 init Job pod could mount its SA token and start. Removed all hook annotations from the RBAC resources so they are managed by regular Helm release lifecycle (created before hooks, never deleted mid-install). Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6. Verified on otech22 (manual remediation): Keycloak sovereign realm OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:43:32 +04:00
e3mrah	cba1b5070a	fix(bp-gitea+harbor): use CNPG inheritedMetadata to propagate reflector annotations to pg-app Secret (#595 ) The Cluster CR `metadata.annotations` are NOT propagated by CNPG onto the generated `{name}-app` Secrets. Reflector requires the SOURCE Secret (e.g. `gitea-pg-app`) to carry `reflection-allowed: "true"` before it will copy data into the DESTINATION Secret (`gitea-database-secret`). On otech22 this caused `gitea-database-secret` to stay empty indefinitely — gitea init container failed auth with "password authentication failed for user gitea". Fix: use CNPG's `inheritedMetadata.annotations` stanza (v1.24+) to instruct CNPG to annotate all generated Secrets with the reflector permission annotations. Applied to both bp-gitea (1.2.0→1.2.1) and bp-harbor (1.2.4→1.2.5) since harbor-pg-app had the same issue. Bootstrap-kit: bump bp-gitea chart ref 1.2.0→1.2.1 (template + otech + omantel). Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:37:48 +04:00
e3mrah	fe03b8cc42	fix(bp-harbor): use curl for CNPG annotator PATCH + add values defaults (1.2.4) (#594 ) busybox wget does not support --method=PATCH (only GET/POST). The harbor-pg-app-annotator Job silently succeeded without actually patching harbor-pg-app, leaving harbor-database-secret empty on fresh install. Fixes: 1. Switch cnpg-app-annotator-job.yaml from busybox:1.36.1 + wget to curlimages/curl:8.7.1 + curl -X PATCH. curl natively supports all HTTP verbs. HTTP response code checked explicitly; non-2xx exits 1 so the Job retries instead of silently passing with no-op. 2. Add cnpgAnnotator.image stanza to values.yaml (was missing — prior charts defaulted via nil-safe dict fallback but the section was never actually written to values.yaml). Defaults to curlimages/curl:8.7.1. 3. readOnlyRootFilesystem: false (curl writes /tmp/patch-response.json for error diagnostics). 4. Bump chart 1.2.3 → 1.2.4. Closes #585 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:29:45 +04:00
e3mrah	97abf9dedb	fix(bp-harbor): nil-safe image value extraction in cnpg-app-annotator Job (#593 ) .Values.cnpgAnnotator.image.repository triggers nil pointer when the values tree is partially absent in Helm's default-values render. Use \| default dict chained assignments to safely extract image repo/tag/ pullPolicy. Fixes blueprint-release smoke render failure on 1.2.3. Closes #585 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 15:22:54 +04:00
e3mrah	74d526c276	fix: bp-gateway-api 5→10 CRDs + bp-gitea CNPG + bp-harbor CNPG race fix + DAG audit (#592 ) * fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) The bundled Bitnami postgresql subchart pulls docker.io/bitnamilegacy/postgresql which is unavailable (DH deprecated namespace) — gitea-postgresql-0 stuck in ImagePullBackOff on otech22, cascading to gitea Init:CrashLoopBackOff. Mirrors the bp-harbor pattern (PR #578): provision a CNPG Cluster CR (gitea-pg, namespace gitea, 5Gi, pg16) + a reflector-managed gitea-database-secret, wiring GITEA__database__PASSWD from the CNPG-generated gitea-pg-app Secret. All Bitnami subchart config removed; postgresql.enabled: false. Bootstrap-kit (template + otech + omantel): bump bp-gitea 1.1.2 → 1.2.0, add dependsOn: bp-cnpg so the postgresql.cnpg.io/v1 CRD is registered before the Capabilities gate in cnpg-cluster.yaml fires. omantel overlay migrated from legacy ingress: to gateway: (Cilium Gateway API, issue #387). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(dependency-audit): add bp-reflector (5a) to expected DAG + external-dns dep edge bp-reflector was added to the bootstrap-kit (slot 05a) in issue #543 but was never registered in scripts/expected-bootstrap-deps.yaml, causing the dependency-graph-audit CI gate to error on every PR that includes this branch. Also declare bp-reflector in bp-external-dns's depends_on to match the actual HR file (12-external-dns.yaml dependsOn bp-reflector). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(bp-gateway-api): update CRD-count test 5→10 for experimental channel + DAG audit Two fixes to unblock bp-gateway-api:1.1.0 OCI publish and the dependency-graph-audit CI gate: 1. crd-render.sh: expect 10 CRDs (experimental channel) not 5. Chart 1.1.0 vendors experimental-install.yaml (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy in addition to 5 standard CRDs) because Cilium 1.16.x checks for TLSRoute at operator startup. Without this fix the blueprint-release workflow for 1.1.0 fails the chart-test step and never pushes to GHCR — leaving all 13 dependent HRs stuck dependency-not-ready on every Sovereign. 2. expected-bootstrap-deps.yaml: add bp-reflector (slot 5a) and update bp-external-dns depends_on to include bp-reflector. bp-reflector was added to the bootstrap-kit in issue #543 but was missing from the expected DAG, causing dependency-graph-audit ERRORs on every PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: hatiyildiz <hatice@openova.io>	2026-05-02 15:20:05 +04:00
e3mrah	64de55d72f	fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588 ) (#590 ) * fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588) trivy-operator exits 137 (OOM) on startup on a full Sovereign (38 HRs, ~200 pods). The operator initialises watch-cache controllers for every resource kind it manages across all namespaces; at 38 HRs the cache peak exceeds 256Mi before steady-state is reached. Raise the operator container memory limit from 256Mi to 512Mi, which is the stable floor measured on otech22 during Phase-8a handover testing. Bump bp-trivy 1.0.1 → 1.0.2. Bootstrap-kit slots updated for _template, otech.omani.works, omantel.omani.works. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> * fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml The dependency-graph-audit check was failing because: 1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml 2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector] in the HelmRelease but expected-bootstrap-deps.yaml only declared [bp-cert-manager, bp-powerdns] Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update bp-external-dns depends_on to include bp-reflector in the expected DAG. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 15:20:03 +04:00
e3mrah	4b2ae76cfd	fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587 ) (#589 ) * fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587) The native pdns provider in external-dns v0.15.1 does not accept --pdns-api-version; the binary fatals at startup with: 'unknown long flag --pdns-api-version' causing CrashLoopBackOff (53+ restarts on otech22). The provider auto-negotiates the PowerDNS API version — the flag is superfluous and broken. Remove it from extraArgs. Bump bp-external-dns 1.1.3 → 1.1.4. Bootstrap-kit slots updated for _template, otech.omani.works, omantel.omani.works. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> * fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml The dependency-graph-audit check was failing because: 1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml 2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector] in the HelmRelease but expected-bootstrap-deps.yaml only declared [bp-cert-manager, bp-powerdns] Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update bp-external-dns depends_on to include bp-reflector in the expected DAG. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 15:20:00 +04:00
e3mrah	8d2ba0495d	fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584 ) (#586 ) Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)	2026-05-02 15:18:49 +04:00
e3mrah	5a403e66b1	fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582 ) * fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase Harbor upstream always connects to a database named 'registry' (harbor.database.external.coreDatabase default). The CNPG Cluster was initialised with database='harbor', causing: FATAL: database "registry" does not exist (SQLSTATE 3D000) Fix: change postgres.cluster.database default from 'harbor' → 'registry' in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap and Harbor's coreDatabase now use 'registry'. Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run against harbor-pg-1. harbor-core is now 1/1 Running. Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix Five independent fixes that together complete the DNS-01 wildcard TLS chain for per-Sovereign certificate autonomy: 1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo): - values.yaml: `webhook.solverName: powerdns` → `pdns` - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is "powerdns" cert-manager gets 404 → "server could not find the resource". 2. cert-manager-dynadot-webhook solver_test.go mock format: - writeOK() and error injection used old ResponseHeader-wrapped format - Real api3.json returns ResponseCode/Status directly in SetDnsResponse - This caused the image build to fail at `ccc38987` so the dynadot fix never shipped; solver tests now pass cleanly (go test ./... OK) 3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template): - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied - anycast-endpoint.yaml: optional nodePort field rendered in port list 4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf): - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053 - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0 5. dynadot-client JSON parsing fix (core/pkg/dynadot-client): - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader - client_test.go: mock responses updated to real api3.json format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:49:58 +04:00
e3mrah	73ae746637	fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581 ) Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API CRDs at startup and disables its gateway controller if they are absent — a static, one-shot decision. Cloud-init installs k3s+Cilium first, then Flux reconciles bp-gateway-api minutes later, so the operator always starts without CRDs and never recovers. All 8 HTTPRoutes orphaned. Three-part permanent fix: 1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl. TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires TLSRoute CRD to be present; without it the operator's capability check fails entirely and disables the gateway controller. 2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true" to force GatewayClass creation regardless of CRD presence at Helm render time. Upstream default "auto" skips GatewayClass when the gateway API CRDs are absent at install time (Capabilities check). 3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0 and ship experimental channel (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed status.supportedFeatures from string[] to object[]; Cilium 1.16.5 writes the old string format and the v1.2.0 CRD rejects the status patch with "must be of type object: string", leaving GatewayClass permanently Unknown/Pending. v1.1.0 retains string schema. Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17 adopts the v1.2.0 object schema for supportedFeatures. Closes #503 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:23:32 +04:00
e3mrah	83ec889f06	feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560 ) (#580 ) Charts bumped: - bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented) - bp-crossplane 1.1.3 -> 1.1.4 (subchart stub) - bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched) - bp-velero 1.2.0 -> 1.2.1 (subchart stub) - bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented) - bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented) - bp-grafana 1.0.0 -> 1.0.1 (subchart stub) - bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented) - bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services) Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache. Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 13:21:53 +04:00
e3mrah	2adc3a9493	fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase (#579 ) Harbor upstream always connects to a database named 'registry' (harbor.database.external.coreDatabase default). The CNPG Cluster was initialised with database='harbor', causing: FATAL: database "registry" does not exist (SQLSTATE 3D000) Fix: change postgres.cluster.database default from 'harbor' → 'registry' in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap and Harbor's coreDatabase now use 'registry'. Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run against harbor-pg-1. harbor-core is now 1/1 Running. Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:21:36 +04:00
e3mrah	b647aa2561	fix(bp-harbor): provision harbor-pg CNPG cluster + database-secret (Closes #566 ) (#578 ) Replace Helm lookup in database-secret.yaml with reflector annotation: harbor-database-secret now reflects harbor-pg-app via reflector.v1.k8s.emberstack.com/reflects. This fixes the race between Helm rendering (fresh install) and CNPG cluster bootstrap — reflector is event-driven and propagates the CNPG password within seconds of harbor-pg-app being created, with no operator action required. Also includes: - templates/cnpg-cluster.yaml: harbor-pg CNPG Cluster (1 inst, 5Gi, pg16) - values.yaml: postgres: block + database.external.host = harbor-pg-rw - Chart 1.2.0 → 1.2.1; bootstrap-kit refs updated (_template, otech, omantel) Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:14:00 +04:00
e3mrah	58cf297800	fix(bp-seaweedfs): remove trailing slash in registry — fixes double-slash image ref (Closes #568 ) (#576 ) `registry: "chrislusf/"` in values.yaml produced `chrislusf//seaweedfs:4.22` because the vendored chart's _helpers.tpl renders `printf "%s/%s:%s" $registryName $name $tag` — the trailing slash joined with the separator slash made an invalid image reference. Fix: `registry: "chrislusf/"` → `registry: "chrislusf"`. Bump bp-seaweedfs 1.1.0 → 1.1.1. Update bootstrap-kit refs in _template, otech.omani.works, omantel.omani.works (1.0.1 → 1.1.1). Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 13:02:48 +04:00
e3mrah	5796de12bc	fix(bp-spire): re-enable oidc-discovery-provider ClusterSPIFFEID to fix init stuck (Closes #571 ) (#575 ) The oidc-discovery-provider ClusterSPIFFEID was disabled at bootstrap to work around a CRD-ordering race (spire-controller-manager applying the template before CRDs were registered). That race was fixed in bp-spire 1.1.4 by listing spire-crds as the first Helm dependency. With all ClusterSPIFFEIDs still disabled the oidc-discovery-provider init container blocks indefinitely with "PermissionDenied: no identity issued" — the controller-manager never creates the registration entry so no SVID is issued. Re-enable oidc-discovery-provider identity. The default, test-keys, and child-servers identities remain disabled (not needed for bootstrap). Also carries the global.imageRegistry field added by issue #560 (was 1.1.5 in working tree, now bumped to 1.1.6 for this fix). Bootstrap-kit slot 06 updated from 1.1.4 → 1.1.6. Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 13:00:43 +04:00
e3mrah	b88e98026f	fix(bp-falco): rename rules_file → rules_files (Falco 0.36+ canonical key, Closes #570 ) (#574 ) Falco 0.36+ uses `rules_files` (plural) as the canonical multi-file rules key. Setting the deprecated `rules_file` (singular) alongside the upstream subchart's `rules_files` default causes Falco to detect a config conflict and abort startup with CrashLoopBackOff on otech22. Bump bp-falco 1.0.0 → 1.0.1. Bootstrap-kit slot 31 updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 12:59:29 +04:00
e3mrah	06844d3a70	fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569 ) (#573 ) bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but bp-external-dns still had `powerdnsNamespace: openova-system` in its NetworkPolicy egress rule and `--pdns-server=...openova-system...` in extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation. Fix: - externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns - extraArgs --pdns-server: ...openova-system... → ...powerdns... Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 12:58:24 +04:00
e3mrah	c59f0496a2	fix(bp-mimir): disable ingest_storage to fix Kafka CrashLoop (Closes #567 ) (#572 ) Upstream mimir-distributed 6.0.6 can boot in ingest-storage mode which requires a Kafka endpoint. Setting kafka.enabled:false only disables the bundled Kafka subchart — it does not tell the Mimir process itself to use classic mode. Adding mimir.structuredConfig.ingest_storage.enabled:false forces the classic blocks-storage ingester path (no Kafka dependency), matching Catalyst's NATS JetStream event bus (ADR-0001). Bump bp-mimir 1.0.0 → 1.0.1. Bootstrap-kit slot 23 updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io>	2026-05-02 12:57:09 +04:00
e3mrah	ad9cfc0f23	feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560 ) (#565 ) Charts with template image refs (fully rewritten when registry set): - bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst job images now prefixed with global.imageRegistry when non-empty. Default (empty) renders identical manifests. - bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed with global.imageRegistry when non-empty. Verified: dnsdist image rewrites to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14. Subchart-only charts (global.imageRegistry stub added; threading via per-component subchart values.yaml keys documented in comments): - bp-external-secrets 1.1.0→1.1.1 - bp-cnpg 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-nats-jetstream 1.1.1→1.1.2 - bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring vcluster: N/A — no chart directory under platform/vcluster/chart/ Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:52:43 +04:00
e3mrah	19c06c63bc	fix(bp-cert-manager-dynadot-webhook): dedupe template labels (Closes #561 ) (#564 ) deployment.yaml pod template included both selectorLabels and labels named templates; since selectorLabels is a strict subset of labels, this produced duplicate app.kubernetes.io/name and app.kubernetes.io/instance keys in the rendered pod template metadata — triggering the HelmRelease validation error "spec.values.metadata.labels has duplicate key". Remove the redundant selectorLabels include from the pod template (selector.matchLabels still uses selectorLabels correctly). Bump chart 1.1.0 → 1.1.1. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:50:11 +04:00
e3mrah	a7fa0626b2	feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-pdns-webhook/sealed-secrets (PR 1/3 #560 ) (#562 ) * docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade Per founder corrective: existing diagram missed the real blockers surfaced during otech10..otech22 burns. The image-pull-through gap (#557) and the cross-namespace secret gap (#543, #544) gate every workload pull from a public registry — without them, Sovereign hits DockerHub anonymous rate-limit on first provision and 30+ HRs are ImagePullBackOff/CreateContainerConfigError. Adds: - Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap + #557C charts global.imageRegistry templating). Edges to NATS / Gitea / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane / cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao - Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544 powerdns-api-credentials reflect). Edges to bp-catalyst-platform and bp-cert-manager-powerdns-webhook - Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch 38-HR threshold both gate Phase 8a integration test - Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is what makes "zero contabo dependency" DoD-met possible WBS now reflects the cascade observed live, not the pre-Phase-8a model. * feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560) - bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium subchart does not expose a single registry knob — per-Sovereign overlays wire specific image.repository fields alongside this value. - bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream chart exposes per-component image.registry knobs documented in the comment. - bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub added + deployment.yaml templated to prefix the webhook image repository when the value is non-empty. Verified: helm template with --set global.imageRegistry=harbor.openova.io produces harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>. - bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream subchart exposes sealed-secrets.image.registry for overlay wiring. All four charts render clean with default values (empty imageRegistry). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:48:37 +04:00
e3mrah	ccc38987c2	fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550 ) (#558 ) Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49) but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook (groupName: acme.dynadot.openova.io). Without slot 49b the APIService for acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely → HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL. Changes: - core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response, API returns SetDnsResponse); change ResponseCode to json.Number (API returns integer 0, not string "0"); update tests to match real API response format - platform/cert-manager-dynadot-webhook/chart: - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix) - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.* and privateKeySecretRefName; add rbac.create comment for domain-solver - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64) - clusterissuer.yaml: new template (skip-render default, enabled via overlay) - deployment.yaml: add imagePullSecrets support (required for private GHCR) - Chart.yaml: bump to 1.1.0 - clusters/_template/bootstrap-kit: - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer) - kustomization.yaml: add 49b entry - infra/hetzner: - variables.tf: add dynadot_managed_domains variable - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials Secret + apply it before Flux reconciles bootstrap-kit Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:42:13 +04:00
e3mrah	7d264d9647	fix(bp-powerdns): default cluster.namespace=powerdns not openova-system (Closes #553 ) (#556 ) bp-powerdns HelmRelease upgrade fails on Sovereigns with: failed to create resource: namespaces "openova-system" not found The chart's CNPG Cluster CR template targets postgres.cluster.namespace which defaulted to openova-system (a contabo-only legacy ns). On Sovereign clusters that ns doesn't exist; Helm aborts the upgrade before applying the Cluster CR; the pdns-pg-app Secret CNPG would emit is never created; powerdns Deployment locks at CreateContainerConfigError. Default to powerdns (chart targetNamespace per bootstrap-kit overlay). Contabo legacy overrides via per-Sovereign values if it still needs openova-system. Bump bp-powerdns 1.1.4 -> 1.1.5 across template + omantel + otech overlays. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 12:19:37 +04:00
e3mrah	b2307e290d	fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543 ) (#554 ) Part A — bp-reflector blueprint: - Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a, dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288 via the bp-reflector OCI wrapper chart. - Register in bootstrap-kit/kustomization.yaml. - Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml): single replica, 32Mi memory, ServiceMonitor off by default. Part B — annotate flux-system/ghcr-pull + rename in charts: - infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector annotations to the ghcr-pull Secret written at cloud-init time so Reflector auto-mirrors it to every namespace on first boot. - Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in: api-deployment.yaml, ui-deployment.yaml, marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml (14 total occurrences). - Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit HelmRelease version reference to match. Root cause: the canonical secret name is ghcr-pull (written by cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff on all Catalyst pods on every new Sovereign. Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret propagated to 33 namespaces via kubectl; non-Running pods bounced. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-02 12:17:51 +04:00
e3mrah	902d857702	fix(bp-powerdns): reflect powerdns-api-credentials to external-dns namespace (Closes #544 ) (#552 ) Add reflector.v1.k8s.emberstack.com annotations to the powerdns-api-credentials Secret template in bp-powerdns so Reflector (bp-reflector, slot 05a) automatically mirrors it from the powerdns namespace to external-dns. Bump chart version 1.1.3 → 1.1.4. Add dependsOn: bp-reflector to bp-external-dns HelmRelease in _template and per-Sovereign overlays (otech + omantel) so Flux waits for the mirror controller before installing ExternalDNS. Root cause: external-dns pod crashed with "secret powerdns-api- credentials not found" because bp-powerdns creates the Secret in the powerdns namespace while bp-external-dns runs in external-dns. No cross-namespace propagation existed. Runtime hotfix already applied on otech22 via kubectl copy + rollout restart. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 12:11:43 +04:00
e3mrah	8cde771c0f	fix(bp-openbao): unseal on idempotent path + persist keys (Closes #539 ) (#540 ) PR #528 added unseal logic but only on the FRESH-init branch. When a previous Job pod completed `bao operator init` but exited before the unseal block (or when openbao-0 simply restarts under shamir seal), the next reconcile takes the "already initialized" branch and exits without ever running `bao operator unseal`. Symptom on otech21: init-job logs end with `auto-unseal init complete`, but `bao status` reports Initialized=true Sealed=true forever, the bp-openbao HR stays Unknown/Running for the full 15m install timeout, and bp-external-secrets/bp-external-secrets-stores block on the dep. Fix has two parts: 1. Persist `unseal_keys_b64` on fresh init to a new K8s Secret `openbao-unseal-keys` (BEFORE applying the keys, so a unseal crash mid-step is recoverable on next retry). 2. Add a Step 2a "idempotent-path unseal" branch: when bao reports Initialized=true Sealed=true, fetch the persisted keys Secret and apply unseal exactly the same way Step 3a does on fresh init. Verify Sealed=false and exit; otherwise FATAL with the manual-recovery pointer. RBAC: extend the openbao-auto-unseal Role to allow create/get/ patch/update on openbao-unseal-keys (alongside openbao-init-marker). Chart bump 1.2.3 → 1.2.4. HR ref in clusters/_template/bootstrap-kit/08-openbao.yaml updated to match so cloud-init-templated Sovereigns pick up the new chart. Co-authored-by: e3mrah <emrah.baysal@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 10:44:46 +04:00
e3mrah	d90abb1e85	fix(bp-openbao): unseal vault after init in chart Job (Closes #527 ) (#528 ) The init Job ran `bao operator init -key-shares=1 -key-threshold=1` which leaves the cluster Initialized=true but Sealed=true. Without an explicit `bao operator unseal <key>` call the StatefulSet pod stays sealed forever, the bp-openbao HelmRelease never reports Ready=True, and every dependent blueprint (bp-external-secrets, bp-external-secrets-stores) blocks on this dep. This was the 5th and final latent bug in the chart's auto-unseal flow (after PRs #518 #520 #523 #524 #525). On otech17 (6b17518f12d529ea, 2026-05-02) the init Job completed cleanly but `bao status` reported Sealed=true forever. Fix: parse `unseal_threshold` and `unseal_keys_b64` from the init JSON, call `bao operator unseal <key>` $threshold times (1 with the current key-shares=1 / key-threshold=1 config), then assert `bao status -format=json \| grep '"sealed":false'` before the Job exits success. Bumps chart 1.2.2 -> 1.2.3 and HR ref in clusters/_template/bootstrap-kit/08-openbao.yaml. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:24:57 +04:00
e3mrah	ba5a1929f1	fix(bp-openbao): use shamir-compatible init flags + bump 1.2.1→1.2.2 (refs #517 ) (#525 ) The chart's init Job called `bao operator init -recovery-shares=1 -recovery-threshold=1` which only works with auto-unseal seal types (gcpckms/awskms/transit). The upstream openbao chart's default config uses `seal "shamir"` (no auto-unseal stanza in values.standalone.config / values.ha.config), so the OpenBao API returns 400: "parameters recovery_shares,recovery_threshold not applicable to seal type shamir". Switch to -key-shares=1 -key-threshold=1 which is the correct shamir- seal init flags. Operators wiring auto-unseal seals later will need to flip back via a chart-values toggle. Bumps chart 1.2.1→1.2.2 + matches HR ref so Sovereigns pull the new artifact on next reconcile. Refs #517 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:14:05 +04:00
e3mrah	6e3d3d281e	fix(bp-openbao): bump chart 1.2.0→1.2.1 + HR ref for busybox-wget fix (refs #517 ) (#524 ) Bumps platform/openbao/chart/Chart.yaml version to 1.2.1 carrying the busybox-compatible wget flag fix (PR #523). Also bumps the HR's chart.spec.version in clusters/_template/bootstrap-kit/08-openbao.yaml so Sovereigns pull the new bytes once blueprint-release publishes ghcr.io/openova-io/bp-openbao:1.2.1. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:09:06 +04:00
e3mrah	5c0618d920	fix(bp-openbao): use busybox-compatible wget flag in init Job (refs #517 ) (#523 ) The chart's init Job runs inside the openbao image (quay.io/openbao/ openbao:2.1.0) which uses busybox wget. The script's wget calls used `--ca-certificate=$CACERT` which busybox wget does not support, causing wget to print its usage page and fail with "seed Secret has no key recovery-seed" (false negative — the parsing pipeline saw the usage text instead of JSON). Replace with `--no-check-certificate`. The Secret still requires the Bearer token for auth — the lack of CA verification only affects TLS handshake validation against an in-cluster API server reached via the well-known kubernetes.default.svc DNS name (out-of-band attack surface is negligible inside the pod network). The `--method=DELETE` line for cleaning up the seed Secret remains — busybox wget doesn't support method override either, but that line is wrapped in `\|\| true` so the seed deletion failure doesn't block the init Job from succeeding. Seed is single-use anyway and harmless post-init (the recovery key is the OUTPUT of bao operator init, not this seed). Refs #517 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:07:52 +04:00
e3mrah	7931e695b0	fix(cert-manager-powerdns-webhook): cap CA Certificate CN at 64 bytes (#509 ) The chart's CA Certificate template generated a `spec.commonName` of `ca.<fullname>.cert-manager` where `<fullname>` is the Helm fullname (release name + chart name). With the bootstrap-kit's release name `cert-manager-powerdns-webhook`, the rendered CN landed at 78 bytes: ca.cert-manager-powerdns-webhook-bp-cert-manager-powerdns-webhook.cert-manager cert-manager's admission webhook rejects this against the RFC 5280 ub-common-name-length=64 PKIX upper bound, breaking otech11 (ac90a3ea12954e7d, chart 1.0.1, 2026-05-02) at install time. Fix: collapse the CN onto the chart `name` helper (always `bp-cert-manager-powerdns-webhook`, ≤63 chars) instead of the release-prefixed `fullname`. The CA cert's CN is opaque identity only — no client validates by hostname against this CN — so the shortening is behaviour-preserving and stable across any operator-chosen releaseName. Rendered CN with this fix: ca.bp-cert-manager-powerdns-webhook.cert-manager (48 bytes) Bumps chart 1.0.1 → 1.0.2 and updates the bootstrap-kit slot reference in clusters/_template/bootstrap-kit/49-bp-cert-manager-powerdns-webhook.yaml. Closes #508.	2026-05-02 02:09:41 +04:00
e3mrah	eeba0d90cc	fix(infra): dedupe labels in bp-cert-manager-powerdns-webhook deployment template (#507 ) The pod template's metadata.labels block in the upstream Deployment template included BOTH the `selectorLabels` helper AND the `labels` helper. Since `labels` already emits app.kubernetes.io/name and app.kubernetes.io/instance, the rendered YAML had those keys twice in a single mapping, which Helm v3 post-render rejects with: yaml: unmarshal errors: line 29: mapping key "app.kubernetes.io/name" already defined at line 26 line 30: mapping key "app.kubernetes.io/instance" already defined at line 27 Surfaced live on Phase-8a-preflight otech11 (ac90a3ea12954e7d, on catalyst-api:c148ef3, 2026-05-01). Fix: drop the redundant `selectorLabels` include — `labels` is a superset. Bump chart version 1.0.0 → 1.0.1 and update the bootstrap-kit HR reference accordingly. Closes openova#506. Co-authored-by: e3mrah <emrah@openova.io>	2026-05-02 01:52:50 +04:00
e3mrah	e1f7d22f3c	fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503 ) (#505 ) Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them ahead of every chart that ships HTTPRoute templates: bp-openbao, bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor, bp-grafana. Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to InstallFailed with `no matches for kind "HTTPRoute" in version "gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI. enabled=true` flag wires up the cilium gateway controller and creates the `cilium` GatewayClass, but does NOT install the gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no `installCRDs`-equivalent knob for gateway-api so the upstream CRDs must ship via a separate Blueprint. Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by the founder for ALL similar future cases: intra-chart CRD-ordering breaks → split into two charts + Flux dependsOn. Mirrors the bp-crossplane/bp-crossplane-claims and bp-external-secrets/ bp-external-secrets-stores splits. Files: - platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0 standard-install.yaml; helm.sh/resource-policy: keep on every CRD so Helm uninstall does not orphan every HTTPRoute on the cluster - platform/gateway-api/chart/scripts/regenerate.sh — developer tool for re-vendoring on upstream version bump (annotation-driven) - platform/gateway-api/chart/tests/crd-render.sh — chart integration test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin) - clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease + HelmRepository, dependsOn bp-cilium - clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea, 11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml — add `dependsOn: bp-gateway-api` - clusters/_template/bootstrap-kit/kustomization.yaml — register 01a-gateway-api.yaml between 01-cilium and 02-cert-manager - scripts/expected-bootstrap-deps.yaml — declare slot 1a + add bp-gateway-api to depends_on of every HTTPRoute-using slot Closes #503 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 01:30:50 +04:00
e3mrah	1865ac8975	fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340 ) (#504 ) * fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) The upstream seaweedfs/seaweedfs 4.22.0 chart now ships templates/shared/security-configmap.yaml which calls fromToml — a Sprig function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm SDK older than 3.13 and PARSES every template before any {{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's mere presence breaks install on every Sovereign with: parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21): function "fromToml" not defined even though enableSecurity defaults to false. Setting the gate value does NOT skip parsing — only deleting / never-shipping the file does. Fix shape (per ticket #340): 1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/ (committed bytes, not auto-pulled at build time). Required because the upstream Helm repo overwrites 4.22.0 in place — re-pulling would re-introduce the broken file. 2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml. Every other template that references the deleted ConfigMap is gated under {{- if enableSecurity }} so removing it is a no-op for our default deployment shape (Catalyst SeaweedFS auth happens at the S3 layer via IAM creds from External Secrets, not via the upstream chart's TLS/JWT machinery). 3. Drop the dependencies: block from chart/Chart.yaml; add annotations.catalyst.openova.io/no-upstream=true so the blueprint-release workflow's hollow-chart guard (issue #181) skips the auto-pull/round-trip checks for this chart. 4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the vendored bytes are tracked. 5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled). 6. Add tests/no-fromtoml.sh — chart-test that asserts the offending file stays deleted across future re-vendors. Runs in .github/workflows/blueprint-release.yaml as a publish-gating check. Unblocks Phase-8a observability + storage chain on otech (bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn bp-seaweedfs). Closes #340 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines 35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG in scripts/expected-bootstrap-deps.yaml was never updated to match. Pre-existing drift on main; surfaced by the dependency-graph-audit check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the audit passes on the same PR — the two changes are both about the storage chain on Sovereigns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 01:20:59 +04:00
e3mrah	20b896070f	feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326 ) (#448 ) Wires the per-Sovereign K8s api-server's --oidc-* validator to the per-Sovereign Keycloak realm so customer admins can authenticate kubectl directly against their Sovereign — no static admin-kubeconfig handoff, no rotated bearer-token exchange. infra (cloud-init): - Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign) per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups prefixes scope OIDC subjects under "oidc:" so RoleBindings reference e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509. Canonical seam (anti-duplication rule, ADR-0001 §11.3): - The bp-keycloak chart already bundles bitnami/keycloak's keycloakConfigCli post-install Helm hook Job, which imports realms declared under values.keycloak.keycloakConfigCli.configuration. We enable the existing seam — no bespoke kubectl-exec realm-creation script, no custom Admin-API call from catalyst-api. bp-keycloak chart (1.1.2 → 1.2.0): - Enable keycloakConfigCli + ship inline sovereign-realm.json with: realm "sovereign" (invariant per Sovereign — Keycloak resolves the issuer claim from the request hostname, so no per-FQDN realm rename), default groups sovereign-admins/-ops/-viewers, oidc-group -membership-mapper emitting "groups" claim, public OIDC client "kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc -login defaults), publicClient=true (kubectl runs locally and cannot safely hold a secret), PKCE S256 enforced. - Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape). - Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/, otech.omani.works/ to version: 1.2.0. - New chart test tests/oidc-kubectl-client.sh (4 cases) — all green. - Existing tests/observability-toggle.sh — still green. Documentation: - Add §11 "kubectl OIDC for customer admins" runbook to docs/omantel-handover-wbs.md with one-time workstation setup (kubectl krew install oidc-login + config set-credentials), sovereign-admin RBAC binding (oidc:sovereign-admins → cluster -admin), and 401-debugging table mapping common symptoms to root causes. - Carve #326 out of §7 "Out of scope" — it is shipped. - Add §9 status row. Validation: - grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl → 2 (comment + the actual flag in the curl line) - grep -c 'oidc-username-claim' → 2 - helm template platform/keycloak/chart → renders post-install keycloak-config-cli Job + ConfigMap with kubectl client (3 hits on grep "kubectl"; 1 hit on "clientId": "kubectl") - bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode) - 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle gates green Out of scope (deferred to follow-up tickets): - Per-Sovereign user provisioning UI (#322, #323) - Refresh-token revocation on RoleBinding deletion (#324) - provider-kubernetes Crossplane ProviderConfig per Sovereign (#321) - omantel migration / Phase 8 live execution NO catalyst-api or UI source files touched (those are #319/#322/#323 agents' territories per agent brief). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 19:07:52 +04:00
e3mrah	b6810c1940	feat(bp-crossplane-claims): UserAccess CRD + Composition + RBAC ClusterRoles for Sovereign IAM (closes #322 ) (#446 ) Adds the data plane for the Sovereign IAM access plane (epic #320): - platform/crossplane-claims/chart/templates/xrds/useraccess.yaml XUserAccess XRD (access.openova.io/v1alpha1) — cluster-scoped Claim carrying user identity (Keycloak subject + groups), Sovereign ref, and one or more (application, role, namespaces) grants. - platform/crossplane-claims/chart/templates/compositions/useraccess.yaml Default Composition useraccess.compose.openova.io — materialises one RoleBinding per Claim via provider-kubernetes Object against the per-Sovereign sovereign-<sovereignRef> ProviderConfig. Multi-grant shapes are expanded api-side into N single-grant Claims (avoids the Composition-iteration trap; no composition-functions introduced). - platform/crossplane-claims/chart/templates/clusterroles.yaml Three canonical ClusterRoles — openova:application-{admin,editor,viewer}. Editor + viewer explicitly omit secrets; admin can manage namespace- scoped roles/rolebindings (NOT cluster-scoped). - userAccess.enabled values toggle (default true), version bumps to 1.1.0 on chart + blueprint, sample fixture, validation script extended to expect 7 XRDs / 7 Compositions / 3 ClusterRoles. Canonical seam: extends the existing platform/crossplane-claims/chart/ XRD+Composition pattern (compose.openova.io/v1alpha1 family). New API group access.openova.io is intentional — IAM is a separate concern from the cloud-resource compose.* family. No catalyst-api or UI code touched (those are #323's territory; this PR ships the data model #323 consumes). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 19:03:10 +04:00
e3mrah	0511efbdac	feat(bp-harbor): vendor-agnostic Object Storage backend (closes #383 ) (#437 ) Reworks bp-harbor to write blobs DIRECTLY to the cloud-provider's native S3 endpoint (Hetzner Object Storage on Hetzner Sovereigns) per ADR-0001 §13. Mirrors the post-#425 vendor-agnostic seam shipped in bp-velero:1.2.0 (PR #435 / SHA `0172b9a8`) 1:1. Canonical seam used (per anti-duplication rule + docs/omantel- handover-wbs.md §3a): - Sealed Secret name: flux-system/object-storage (NOT hetzner-prefixed) - Chart values block: .Values.objectStorage.s3.{enabled,credentialsSecretName,s3.{accessKey,secretKey}} - Template filename: templates/objectstorage-credentials.yaml - Reference impl: platform/velero/chart/ (PR #435) Chart changes (platform/harbor/chart/): - Chart.yaml: 1.0.0 → 1.1.0; description rewritten to emphasise cloud-direct architecture + remove SeaweedFS hard-dep claim. - values.yaml: REMOVED hardcoded SeaweedFS endpoint (http://seaweedfs-s3.seaweedfs.svc.cluster.local:8333) from persistence.imageChartStorage.s3.regionendpoint. Default type flipped to `filesystem` so contabo/dev render is clean. Added vendor-agnostic objectStorage block: objectStorage: enabled: false useExistingSecret: false credentialsSecretName: "" s3: { accessKey: "", secretKey: "" } - templates/objectstorage-credentials.yaml (NEW): synthesises a harbor-namespace Secret with REGISTRY_STORAGE_S3_ACCESSKEY + REGISTRY_STORAGE_S3_SECRETKEY keys (the upstream chart's persistence.imageChartStorage.s3.existingSecret consumption shape — envFrom on the registry pod). Skip-render branch when objectStorage.enabled=false (default). - templates/_helpers.tpl: added bp-harbor.objectStorageCredentialsSecretName helper. - templates/networkpolicy.yaml: egress rule retargeted from SeaweedFS service-namespace selector → external HTTPS:443 (works for any cloud-native S3 endpoint without vendor coupling). Gated on `.Values.objectStorage.enabled`. Removed seaweedfsNamespace + seaweedfsS3Port overlay keys. Per-Sovereign overlays (clusters/{_template,omantel,otech}/bootstrap- kit/19-harbor.yaml): - Chart version reference bumped 1.0.0 → 1.1.0. - dependsOn: bp-seaweedfs REMOVED. New dependsOn = bp-cnpg + bp-cert-manager. - Added valuesFrom block mapping the 5 keys of flux-system/object- storage Secret: s3-bucket → harbor.persistence.imageChartStorage.s3.bucket s3-region → harbor.persistence.imageChartStorage.s3.region s3-endpoint → harbor.persistence.imageChartStorage.s3.regionendpoint s3-access-key → objectStorage.s3.accessKey s3-secret-key → objectStorage.s3.secretKey - Inline values flip objectStorage.enabled=true, harbor.persistence.imageChartStorage.type=s3, and harbor.persistence.imageChartStorage.s3.existingSecret=harbor- objectstorage-credentials. UI catalog (products/catalyst/bootstrap/ui/src/shared/constants/components.ts): - Harbor's `dependencies` array drops `seaweedfs`. Now ['cnpg', 'valkey']. Validation: helm template default render → 1448 lines, 5 Secrets (Harbor internal: core/jobservice/registry/ registry-htpasswd/database — NO objectstorage-credentials), type=filesystem, 0 SeaweedFS references. helm template overlay render with objectStorage.enabled=true + type=s3 + bucket=omantel-harbor + region=fsn1 + regionendpoint=https://fsn1.your-objectstorage.com + existingSecret=harbor-objectstorage-credentials → 1452 lines, 6 Secrets (5 internal + 1 objectstorage-credentials), type=s3 with Hetzner endpoint, registry pod envFrom wired to the new Secret, 0 SeaweedFS references. scripts/check-vendor-coupling.sh → exit 0 (no violations across platform/, clusters/, products/catalyst/bootstrap/{api,ui}/). helm lint → 0 failures. WBS: §2 row 18 → 🟢 chart-released (#383). §9 #383 row → 🟢 chart-released narrative. §6 DAG: T383 moved from `class blocked` → `class done`. Hetzner-S3 E2E deferred to Phase 8 (first omantel run). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 18:18:37 +04:00
e3mrah	0172b9a89a	wip(#425 ): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435 ) Files staged from prior agent run before rate-limit. Re-dispatch will verify, complete missing pieces (Crossplane Provider+ProviderConfig in cloud-init, grep-zero acceptance, helm/go test runs, WBS row update), and finalise the PR. Includes: - platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml - platform/velero/chart/values.yaml (objectStorage.s3.* block) - platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0) - products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package) - internal/hetzner/objectstorage{,_test}.go DELETED - credentials handler + StepCredentials.tsx renamed - infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl} - clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml - platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425) Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 18:05:19 +04:00
e3mrah	92b7db622d	fix(bp-external-secrets-stores): split ClusterSecretStore into separate chart per #247 pattern (closes #331 ) (#426 ) * fix(bp-external-secrets): split ClusterSecretStore into bp-external-secrets-stores chart (resolves CRD ordering, closes #331) bp-external-secrets@1.0.0 deadlocked on first install on otech.omani.works: Helm install failed for release external-secrets-system/external-secrets with chart bp-external-secrets@1.0.0: failed post-install: unable to build kubernetes object for deleting hook bp-external-secrets/templates/clustersecretstore-vault-region1.yaml: resource mapping not found for name: "vault-region1" namespace: "" no matches for kind "ClusterSecretStore" in version "external-secrets.io/v1beta1" Root cause: Helm's `helm.sh/hook-delete-policy: before-hook-creation` ran a kubectl-style lookup of the existing ClusterSecretStore CR before the upstream `external-secrets` subchart's CRDs finished registration. The in-line ClusterSecretStore template (templates/clustersecretstore-vault- region1.yaml) and the upstream subchart's CRDs co-installed in the same release; admission ordering wasn't deterministic enough to make the post-install hook safe. Fix — same pattern as PR #247 (bp-crossplane@1.1.3 ↔ bp-crossplane-claims@1.0.0): split the chart into controller + stores. Flux dependsOn orders them. - bp-external-secrets@1.1.0 — controller-only (just upstream subchart + NetworkPolicy + ServiceMonitor toggle). CRDs register here. - bp-external-secrets-stores@1.0.0 (NEW) — the default ClusterSecretStore CR; depends on bp-external-secrets being Ready. No Helm hooks needed: by the time this chart's HelmRelease starts, Flux has already verified bp-external-secrets is Ready=True and therefore the CRDs are registered. Files: NEW: platform/external-secrets-stores/blueprint.yaml (1.0.0) NEW: platform/external-secrets-stores/chart/Chart.yaml (1.0.0; no upstream subchart, annotation `catalyst.openova.io/no-upstream: "true"`) NEW: platform/external-secrets-stores/chart/values.yaml (clusterSecretStore.* knobs moved from controller chart) MOVED: platform/external-secrets/chart/templates/clustersecretstore-vault-region1.yaml → platform/external-secrets-stores/chart/templates/clustersecretstore-vault-region1.yaml (Helm hook annotations removed — Flux dependsOn now handles ordering) TOUCHED: platform/external-secrets/chart/Chart.yaml (1.0.0 → 1.1.0; description note appended) TOUCHED: platform/external-secrets/blueprint.yaml (1.0.0 → 1.1.0) TOUCHED: platform/external-secrets/chart/values.yaml (clusterSecretStore block removed; pointer comment added) NEW: clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml (Flux HelmRelease, dependsOn: [bp-external-secrets, bp-openbao]) TOUCHED: clusters/_template/bootstrap-kit/15-external-secrets.yaml (chart version 1.0.0 → 1.1.0) TOUCHED: clusters/_template/bootstrap-kit/kustomization.yaml (slot 15a inserted after 15) Out of scope for this PR (separate tickets): - blueprint-release.yaml CI fan-out: verify the path-matrix picks up the new platform/external-secrets-stores/ directory automatically; if not, add the directory to the matrix in a follow-up. - Per-Sovereign cluster directory edits (#257 will delete those). - Phase 0 minimum trim (#310 will renumber slots; this PR uses 15a as a non-disruptive sub-slot insertion that works with both the current 35-slot kustomization and the eventual 15-slot canonical layout — when #310 renumbers, 15 + 15a become 08 + 09 in the canonical order). Refs: #331 (this issue), #247 (pattern reference — bp-crossplane split), Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): register bp-external-secrets-stores in expected-bootstrap-deps.yaml The dependency-graph-audit CI step rejected PR #334 because the new bp-external-secrets-stores HR was on disk at slot 15a but missing from the expected DAG. This commit adds it with the same dependsOn shape as clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml: [bp-external-secrets, bp-openbao]. Refs: #331, #310 (Phase 0 minimum), PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(bp-external-secrets): retire CR cases from controller test, add stores-toggle (#331) After splitting the default ClusterSecretStore into bp-external-secrets-stores @1.0.0, the controller chart's observability-toggle integration test still expected the CR to render in the controller chart (Cases 4 + 5). Those assertions now belong on the new chart. Changes: - platform/external-secrets/chart/tests/observability-toggle.sh: Replace Cases 4+5 with a single inverted assertion — the controller chart MUST render ZERO ClusterSecretStore CRs (top-level kind:); only the upstream subchart's CRD definition (whose spec.names.kind value is "ClusterSecretStore" at non-zero indent) is allowed. - platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh: NEW. Mirrors the retired Cases 4+5 against the stores chart, plus a Case 3 that asserts clusterSecretStore.server overrides propagate. Local smoke: bash platform/external-secrets/chart/tests/observability-toggle.sh → 4/4 PASS bash platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh → 3/3 PASS Refs: #331, PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): handle alphanumeric sub-slot suffixes in check-bootstrap-deps.sh PR #334 (issue #331) added slot 15a-external-secrets-stores as a sub-slot between numeric slots 15 and 16. The bootstrap-deps audit script's `printf '%02d'` formatter rejected `15a` with: scripts/check-bootstrap-deps.sh: line 390: printf: 15a: invalid number Fix: detect non-numeric slot tokens and pass them through verbatim. Numeric slots still render as zero-padded `01..49` for output alignment. Local smoke: $ bash scripts/check-bootstrap-deps.sh ... [P] slot 15 bp-external-secrets <-- bp-cert-manager bp-openbao [P] slot 15a bp-external-secrets-stores <-- bp-external-secrets bp-openbao ... OK: bootstrap-kit dependency graph audit PASSED Refs: #331, PR #334. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wbs): tick #331 chart-released bp-external-secrets@1.1.0 (controller-only) + bp-external-secrets-stores@1.0.0 (NEW) shipped in PR #426. Helm-template acceptance + both toggle tests + dependency-graph-audit all green. Sovereign-impact deferred to Phase 8. Refs: #331, PR #426. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 17:33:47 +04:00
e3mrah	f7796ef807	feat(bp-velero): Hetzner Object Storage backend wiring (closes #384 ) (#423 ) * feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) Velero on a Hetzner Sovereign now writes its backups DIRECTLY to Hetzner Object Storage per ADR-0001 §13 (S3-aware app architecture rule) + docs/omantel-handover-wbs.md §3 — NOT SeaweedFS, which is reserved as a POSIX→S3 buffer for legacy POSIX-only writers and is not in the minimal Sovereign set. Mirrors the Hetzner-direct backend pattern Agent #383 is wiring for Harbor; both consume the canonical flux-system/hetzner-object-storage Secret shipped by issue #371 (cloud-init writes 5 keys: s3-endpoint / s3-region / s3-bucket / s3-access-key / s3-secret-key, derived from the operator-issued Hetzner-Console keys + the per-Sovereign bucket provisioned by OpenTofu's aminueza/minio resource). platform/velero/chart/ (umbrella chart, bumped to 1.1.0): - templates/_helpers.tpl: NEW — bp-velero.fullname / bp-velero.labels helpers + bp-velero.hetznerCredentialsSecretName (default `velero-hetzner-credentials`). - templates/hetzner-credentials-secret.yaml: NEW — synthesises a velero-namespace Secret with a single `cloud` key in AWS-CLI INI format from .Values.veleroOverlay.hetzner.s3.{accessKey,secretKey}. The upstream Velero deployment mounts this at /credentials/cloud via existingSecret + AWS_SHARED_CREDENTIALS_FILE. Skip-render path when veleroOverlay.hetzner.enabled is false (default — keeps contabo render clean) or useExistingSecret is true (operator supplied Secret out-of-band). - values.yaml: BSL provider/region/s3Url/bucket fields populated as placeholders the per-Sovereign HelmRelease overrides via Flux valuesFrom; backupsEnabled defaults FALSE so default render emits no half-broken BSL; veleroOverlay.hetzner block surfaces the operator-overridable fields. Long-form rationale comments inline on each value per the chart's existing docstring style. clusters/_template/bootstrap-kit/34-velero.yaml (+ omantel + otech): - dependsOn: bp-seaweedfs REMOVED — Velero is no longer a SeaweedFS consumer on Sovereigns (was the old SeaweedFS-tiered architecture that minimal-omantel retired in favour of cloud-native S3). - chart version bumped 1.0.0 → 1.1.0. - valuesFrom block added: 5 Secret-key entries pull each canonical s3-* key into the matching umbrella value path. Plaintext credentials never appear in the committed manifest; Flux dereferences valuesFrom at HelmRelease apply time. - values block adds the baseline veleroOverlay.hetzner.enabled=true + velero.credentials.{useSecret:true,existingSecret:velero-hetzner- credentials} + BSL provider/credential/s3ForcePathStyle scaffolding that the valuesFrom entries fill in. docs/omantel-handover-wbs.md: - §2 row 19: "❌ chart needs S3 endpoint rework" → "🟢 chart-released v1.1.0 — Hetzner Object Storage backend wired to #371 secret". - §9 #384 row: detailed status with smoke evidence. Smoke evidence (contabo, default values — no Hetzner credentials): - helm template t . → renders cleanly (no Hetzner Secret, no BSL). - helm template t . --set veleroOverlay.hetzner.enabled=true \ --set ...accessKey=AK_TEST --set ...secretKey=SK_TEST \ --set velero.backupsEnabled=true (+ BSL config) → Secret/velero-hetzner-credentials with `cloud` INI key emitted + BackupStorageLocation/default with provider=aws, bucket=omantel-velero, region=fsn1, s3Url=https://fsn1.your-objectstorage.com. - helm install velero-smoke . -n velero-smoke (defaults) → pod velero-69bb84c5-669sh Ready 1/1 in 48s. Smoke torn down clean. Hetzner-S3 E2E deferred to Phase 8 (first omantel run) — contabo has no Hetzner Object Storage credentials so end-to-end backup→restore verification can't run here. Anti-duplication rule: NO bash scripts authored, NO parallel implementations of upstream Velero functionality. Upstream Velero + velero-plugin-for-aws natively support any S3-compatible backend; the work here is values + a credential-shape adapter Secret, not a fork. Closes #384. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): drop bp-seaweedfs dep from bp-velero expected DAG (#384) Mirrors the dependsOn removal in clusters/_template/bootstrap-kit/34- velero.yaml from the parent commit. Velero on Hetzner Sovereigns now writes directly to Hetzner Object Storage (ADR-0001 §13 + WBS §3); no in-cluster prerequisite Blueprint is required. Local `bash scripts/check-bootstrap-deps.sh` now passes (0 drift, 0 cycles). The CI failure on the parent commit's PR was the audit flagging bp-velero as having a missing edge to bp-seaweedfs because this expected-DAG file still listed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:24:44 +04:00
e3mrah	d2ada908c9	feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316 ) (#408 ) Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns (no managed-KMS available). Selected Option A — Shamir + cloud-init seed because: - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C) is structurally unavailable. - Transit-seal (Option B) requires a peer OpenBao cluster, only applicable to multi-region tier-1; out of scope for single-region omantel. - Manual unseal (Option D) violates the "first sovereign-admin lands on console.<sovereign-fqdn> ready to use" goal in SOVEREIGN-PROVISIONING.md §5. Architecture (per issue #316 spec + acceptance criteria 1-6): 1. Cloud-init on the control-plane node generates a 32-byte recovery seed from /dev/urandom and writes it to a single-use K8s Secret `openbao-recovery-seed` in the openbao namespace, with annotation `openbao.openova.io/single-use: "true"`. Pre-creates the openbao namespace to eliminate the race with Flux's HelmRelease apply. 2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks: - `templates/init-job.yaml` (hook weight 5): consumes the seed, calls `bao operator init -recovery-shares=1 -recovery-threshold=1`, persists the recovery key inside OpenBao's auto-unseal config, deletes the seed Secret on success. Idempotent — re-runs detect Initialized=true and exit 0. - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables the Kubernetes auth method, mounts kv-v2 at `secret/`, writes the `external-secrets-read` policy, binds the `external-secrets` role to the ESO ServiceAccount in `external-secrets-system`. 3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA + Role + RoleBinding the Jobs need (Secret get/list/delete in the openbao namespace; create/get/patch on the openbao-init-marker). Also emits the permanent `system:auth-delegator` ClusterRoleBinding bound to the OpenBao ServiceAccount so the Kubernetes auth method can call tokenreviews.authentication.k8s.io. 4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml` bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true` per-Sovereign. Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }} {{- end }}`) used throughout — never `{{ fail }}`. Default `helm template` render emits NOTHING new; opt-in via autoUnseal.enabled=true. Acceptance criteria coverage: 1. Provision fresh Sovereign — cloud-init writes seed, Flux installs bp-openbao 1.2.0, post-install Jobs run automatically. ✅ 2. bp-openbao HR Ready=True without manual intervention — install keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the init Job drives initialisation out-of-band on the same install). ✅ 3. `bao status` shows Sealed=false, Initialized=true within 5 minutes — init Job polls + retries up to 60×5s. ✅ 4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the auth-bootstrap Job binds the `external-secrets` role to ESO's SA before the Job exits. ✅ 5. Seed Secret deleted post-init — init Job deletes it via K8s API after consuming. ✅ 6. No openbao-root-token Secret in K8s — root token captured to /tmp/.root-token in the Job pod's tmpfs only; never written to a K8s Secret. The recovery key persists ONLY inside OpenBao's Raft state (auto-unseal config). ✅ Tests: - tests/auto-unseal-toggle.sh — 4 cases: * default render → no auto-unseal artefacts (skip-render works) * autoUnseal.enabled=true → both Jobs + correct hook weights * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap * idempotency annotations present on all 5 hook objects - tests/observability-toggle.sh — unchanged, all 3 cases green. - helm lint . — clean. Files: - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0 - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0 - platform/openbao/chart/values.yaml — `autoUnseal.*` block - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new - platform/openbao/chart/templates/init-job.yaml — new - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new - platform/openbao/chart/tests/auto-unseal-toggle.sh — new - platform/openbao/README.md — bootstrap procedure §2-3 expanded; auto-unseal alternatives table added. - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 → 1.2.0, autoUnseal.enabled=true. - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block inserted between ghcr-pull-secret apply and flux-bootstrap apply. - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released. Canonical seam used: extended existing `platform/openbao/chart/` per the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud calls. NO `{{ fail }}`. All knobs configurable via values.yaml per INVIOLABLE-PRINCIPLES.md #4 (never hardcode). Co-authored-by: hatiyildiz <hat.yil@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:45:44 +04:00
e3mrah	04308af7e9	feat(cert-manager): bp-cert-manager-powerdns-webhook (#373 ) (#410 ) Authors a Catalyst Blueprint for the cert-manager DNS-01 external webhook backed by PowerDNS, for post-handover wildcard TLS issuance against the Sovereign's OWN PowerDNS — eliminating the last reachback to openova- controlled Dynadot credentials per ADR-0001 §9.4. Structure mirrors bp-cert-manager-dynadot-webhook (canonical seam): - platform/cert-manager-powerdns-webhook/blueprint.yaml — Blueprint CR with depends: [bp-cert-manager, bp-powerdns] - platform/cert-manager-powerdns-webhook/chart/Chart.yaml — wraps upstream zachomedia/cert-manager-webhook-pdns v2.5.5 (chart 3.2.5); declares the sigstore/common stub dep to satisfy the hollow-chart guard (#181) - chart/templates/ — 8 templates (Deployment, Service, APIService, RBAC, selfSigned/CA Issuer + serving Certificate, ServiceAccount, ClusterIssuer) - ClusterIssuer (letsencrypt-dns01-prod-powerdns) ships with the chart, paired with the webhook's solver. Gated behind clusterIssuer.enabled AND powerdns.host (skip-render pattern, lesson from #387 follow-up #402 — never use {{ fail }}) Bootstrap-kit slot: - clusters/_template/bootstrap-kit/36-bp-cert-manager-powerdns-webhook.yaml wires the HelmRelease to the per-Sovereign in-cluster PowerDNS endpoint (http://powerdns.powerdns:8081) and flips clusterIssuer.enabled=true. - ${SOVEREIGN_FQDN} envsubst keeps the slot operator-overridable per Inviolable Principle #4. Contabo bootstrap path does NOT include this template — contabo stays on legacy http01 + Traefik per ADR-0001 §9.4. Helm-template verification: helm template t platform/cert-manager-powerdns-webhook/chart/ → 14 resources, 0 ClusterIssuer (skip-render works) helm template t platform/cert-manager-powerdns-webhook/chart/ \ --set powerdns.host=http://powerdns.test:8081 \ --set clusterIssuer.enabled=true \ --set powerdns.apiKeySecretRef.name=fake → 15 resources incl. ClusterIssuer with PowerDNS solver config Both renders parse cleanly through python yaml.safe_load_all. Updates docs/omantel-handover-wbs.md §2 row 4 + §9 row #373 to chart-released. Sovereign-impact deferred to Phase 8 (handover E2E). Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:44:27 +04:00
e3mrah	a1bd550208	fix(charts): HTTPRoute templates skip-render on missing host (was failing default-values render) (#402 ) Blueprint-release for #401 failed because HTTPRoute templates use {{- fail }} when gateway.host is not set, which trips the chart default-values render gate in CI. Switched 6 templates from 'fail loud' to 'skip render': if .Values.gateway.host → emit HTTPRoute else → emit nothing The Gateway API admission already rejects HTTPRoute with empty hostnames, so the loud-fail wasn't buying anything an operator wouldn't see at apply time. Default-values render now produces zero HTTPRoute resources, which is the correct shape for the upstream chart consumers that don't set the Sovereign-only gateway block. Files: keycloak, gitea, openbao, grafana, harbor, catalyst-platform. Verified: helm template t products/catalyst/chart/ → 0 HTTPRoutes (clean) helm template t products/catalyst/chart/ --set ingress.gateway.enabled=true --set ingress.hosts.console.host=console.test --set ingress.hosts.api.host=api.test → 2 HTTPRoutes Closes the blueprint-release failure on commit `abf01b6f`. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:23:58 +04:00
e3mrah	abf01b6f21	feat(platform): Gateway API migration audit (#387 ) (#401 ) Migrates every minimal-Sovereign-set blueprint chart from networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute, replacing the legacy Traefik-on-Sovereigns assumption with the canonical Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2 correction note (#388). The single per-Sovereign Gateway is added as additional documents in the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml (NOT a new top-level slot), since Cilium owns the GatewayClass. It includes: - Certificate `sovereign-wildcard-tls` requesting `.${SOVEREIGN_FQDN}` from `letsencrypt-dns01-prod` (cert-manager + #373 webhook) - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's existing `templates/` directory): \| Blueprint \| Host pattern \| Backend port \| \|---------------------\|---------------------------------\|--------------\| \| bp-keycloak \| auth.<sov> \| 80 \| \| bp-gitea \| git.<sov> \| 3000 \| \| bp-openbao \| bao.<sov> \| 8200 \| \| bp-grafana \| grafana.<sov> \| 80 \| \| bp-harbor \| registry.<sov> \| 80 \| \| bp-powerdns \| pdns.<sov>/api (dual-mode) \| 8081 \| \| bp-catalyst-platform\| console.<sov>, api.<sov> \| 80, 8080 \| bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute (Sovereign) simultaneously — the per-Sovereign overlay sets `api.gateway.enabled=true` while leaving `api.enabled=true`. The Ingress object is harmless on Cilium clusters with no Traefik. This preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4. bp-harbor flips `expose.type` from `ingress` to `clusterIP` in platform/harbor/chart/values.yaml so the upstream chart no longer emits its own Ingress; the HTTPRoute is the sole HTTP exposure. TLS terminates at the Gateway (wildcard cert) rather than per-host Certificates inside the chart. bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by .helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml, which remain contabo-only legacy demo infra). The contabo path keeps serving console.openova.io/sovereign via Traefik unchanged. Bootstrap-kit slot updates (per-Sovereign hostname interpolation): - 08-openbao.yaml → gateway.host: bao.${SOVEREIGN_FQDN} - 09-keycloak.yaml → gateway.host: auth.${SOVEREIGN_FQDN} - 10-gitea.yaml → gateway.host: gitea.${SOVEREIGN_FQDN} - 11-powerdns.yaml → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true - 19-harbor.yaml → gateway.host: registry.${SOVEREIGN_FQDN} - 25-grafana.yaml → gateway.host: grafana.${SOVEREIGN_FQDN} Server-side dry-run validation against the live Cilium Gateway API CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway + Certificate apply cleanly via `kubectl apply --dry-run=server`. Contabo unaffected: clusters/contabo-mkt/ not modified. The legacy SME ingresses (console-nova, marketplace, admin, axon, talentmesh, stalwart, ...) continue to serve via Traefik as before. powerdns on contabo remains on the Ingress path (api.gateway.enabled defaults to false at the chart level). Closes #387. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 16:19:30 +04:00
e3mrah	eb92e0496b	feat(platform): add bp-newapi — multi-tenant LLM marketplace gateway (#394 ) (#396 ) Catalyst Blueprint wrapping the upstream NewAPI (github.com/Calcium-Ion/new-api, MIT) for Sovereign operators whose business model is reselling LLM access to their own customers. Backend-only mode: the OpenAI-compatible API at api.<host>/v1/* is customer-facing; the upstream's portal UI is disabled at ingress; Catalyst replaces it as the customer surface; NewAPI's admin UI at admin.<host> is exposed only to ops staff (IdP-gated). Compliance posture enforced at the blueprint layer: - Channel attestation gate (refuses to render if any enabled channel lacks verifiable provenance — in-cluster, commercial-contract, or byok) - Geographic AUP enforcement (sanctioned-region block on commercial- provider channels; US/EU export-control baseline) - BYOK isolation (request-scoped, never aggregated) - Reseller disclosure required - Audit log on bp-cnpg (metadata-only by default) ACME placeholder used throughout the README; replace with operator identity in per-Sovereign overlays at clusters/<sovereign>/bootstrap- kit/. Files: - platform/newapi/README.md (design doc + setup checklist) - platform/newapi/blueprint.yaml (Catalyst Blueprint CR) - platform/newapi/chart/{Chart.yaml,values.yaml} - platform/newapi/chart/templates/{_helpers.tpl,deployment.yaml, service.yaml,ingress.yaml,configmap.yaml,serviceaccount.yaml, networkpolicy.yaml} Closes design portion of #394. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:57:06 +04:00
e3mrah	05cb39c042	fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338 ) (#393 ) PROBLEM ------- On Sovereign-1 (otech.omani.works, 2026-04-30) every HelmRelease that transitioned through pending-install/pending-upgrade got stuck because the helm-controller SA could not UPDATE its own helm-storage Secrets (sh.helm.release.v1.<name>.<n>) in flux-system. Symptom: secrets "sh.helm.release.v1.catalyst-platform.v1" is forbidden: User "system:serviceaccount:flux-system:helm-controller" cannot update resource "secrets" in API group "" in the namespace "flux-system" Runtime workaround on otech (added 2026-04-30): manual ClusterRoleBinding flux-system-helm-controller-admin → cluster-admin → flux-system/helm-controller. Tracked as the permanent fix in #338. FIX --- Add platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml — a Catalyst-managed ClusterRoleBinding (catalyst-cluster-reconciler) that binds cluster-admin to helm-controller AND kustomize-controller in .Values.catalyst.fluxNamespace (default flux-system). Independent from the upstream subchart's cluster-reconciler binding (different name, no ownership conflict), so if the upstream binding ever drifts again the overlay still holds the cluster correct. WHY cluster-admin (not narrower) -------------------------------- helm-controller installs arbitrary user-supplied Helm charts which can ship any K8s resource (CRDs, ClusterRoles, MutatingWebhookConfigurations, etc.). There is no narrower role that satisfies the full install path. The Flux project's own bootstrap install.yaml binds cluster-admin for the same reason (upstream default multitenancy.privileged=true). Multi-tenancy lockdown is a Sovereign Day-2 hardening choice tracked separately. NEVER-HARDCODE COMPLIANCE ------------------------- Per docs/INVIOLABLE-PRINCIPLES.md #4, the namespace is operator-overridable via .Values.catalyst.fluxNamespace. Default is flux-system because that's the canonical Catalyst install namespace (matches cloud-init's flux2 install.yaml + clusters/_template/bootstrap-kit/03-flux.yaml). VERSION ------- - bp-flux 1.1.2 → 1.1.3 (Chart.yaml + blueprint.yaml + 3 bootstrap-kit refs). - The flux2 subchart pin (2.14.1) is unchanged — version-pin replay test remains green (cloud-init v2.4.0 == subchart appVersion 2.4.0). VERIFICATION ------------ - platform/flux/chart/tests/version-pin-replay.sh — all 6 cases PASS. - platform/flux/chart/tests/observability-toggle.sh — all 3 cases PASS. - helm template renders the new ClusterRoleBinding with correct subjects (flux-system by default; verified --set catalyst.fluxNamespace=custom override path). - scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles. FILES ----- - platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml (new) - platform/flux/chart/Chart.yaml (1.1.2 → 1.1.3) - platform/flux/chart/values.yaml (catalyst.fluxNamespace default) - platform/flux/blueprint.yaml (1.1.2 → 1.1.3) - clusters/{_template,otech.omani.works,omantel.omani.works}/bootstrap-kit/03-flux.yaml (chart version) - docs/lessons-learned/helm-controller-rbac.md (permanent-fix note) - docs/omantel-handover-wbs.md (#338 status row) Refs: #43 #369 #338 Lesson: docs/lessons-learned/helm-controller-rbac.md Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>	2026-05-01 15:56:45 +04:00
e3mrah	b8d7a8b9cf	fix(bp-seaweedfs): disable global.enableSecurity to avoid fromToml on helm-controller v1.1.0 (#339 ) Upstream seaweedfs/seaweedfs templates/shared/security-configmap.yaml uses Helm template fromToml; helm-controller v1.1.0's bundled helm SDK (v3.x older than 3.13) doesn't define fromToml so the install fails: parse error at security-configmap.yaml:21: function fromToml not defined Setting global.seaweedfs.enableSecurity: false skips the entire template. Internal SeaweedFS API is cluster-IP only on Sovereign-1; chart-level security is acceptable to defer until helm-controller is bumped. Bumped 1.0.0 → 1.0.1. Unblocks the chain: bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn bp-seaweedfs. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 23:42:43 +04:00
e3mrah	9554be4a5e	fix(bp-external-secrets): gate ClusterSecretStore on CRD presence + drop delete-policy (#337 ) The chart's post-install hook was failing on otech.omani.works: failed post-install: unable to build kubernetes object for deleting hook bp-external-secrets/templates/clustersecretstore-vault-region1.yaml: resource mapping not found for kind ClusterSecretStore in version external-secrets.io/v1beta1 Two corrections: 1. Capabilities-gate the entire template — don't render unless the ClusterSecretStore CRD is registered (it ships in via the upstream ESO subchart but isn't live on first install) 2. Remove 'before-hook-creation' delete-policy (was the actual trigger for the 'deleting hook' failure path) Bumped 1.0.0 → 1.0.1. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 23:31:24 +04:00
e3mrah	5502d9aa48	feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159 ) (#291 ) Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer in bp-cert-manager by shipping the missing piece — a Go binary that satisfies cert-manager's external webhook contract (`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json. Architecture ============ * `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with pool-domain-manager and catalyst-dns). Encapsulates the api3.json transport, command builders, response decoding, and the safe read-modify-write semantics required to never accidentally wipe a zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2` variant is unexported. * `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook binary. Implements `Solver.Present` via the client's append-only `AddRecord` path and `Solver.CleanUp` via the read-modify-write `RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`) rejects challenges for unmanaged apexes BEFORE any Dynadot call. * `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm wrapper. Templates Deployment + Service + APIService + serving Certificate (CA chain via cert-manager Issuer self-signing) + RBAC + ServiceAccount. Mirrors the standard cert-manager external- webhook deployment shape. * `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the paired ClusterIssuer activates. The interim http01 issuer remains templated as the rollback path. Test results ============ core/pkg/dynadot-client — 7 tests PASS (race-clean) core/cmd/cert-manager-dynadot-... — 9 tests PASS (race-clean) Test coverage includes a Present/CleanUp round-trip against an httptest fixture that models Dynadot's zone state, an explicit unmanaged-domain rejection, a regression preserving a pre-existing CNAME across the DNS-01 round-trip (the zone-wipe defence), and a typed-error propagation test that surfaces `ErrInvalidToken` to cert-manager so the controller will retry. Helm template smoke render ========================== `helm template` against the new chart with default values yields 12 resources / 424 lines (APIService, Certificate, ClusterRoleBinding, Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The modified bp-cert-manager chart still renders both ClusterIssuers (`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default values; flipping `certManager.issuers.dns01.enabled=false` is the clean rollback. Smoke command (post-deploy) =========================== kubectl get apiservices.apiregistration.k8s.io \ v1alpha1.acme.dynadot.openova.io # Issue a *.<sovereign>.<pool> wildcard cert and watch the # Order/Challenge progress through cert-manager. CI == `.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the pool-domain-manager-build pattern (cosign keyless signing, SBOM attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager- dynadot-webhook:<sha>`). Triggered by changes to either the binary or the shared dynadot-client package. Closes #159 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:37:47 +04:00
e3mrah	c09109a61a	feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265 ) (#290 ) Edge + serverless + model-serving batch (W2.5.C) — three upstream- subchart umbrella Blueprints completing the bootstrap-kit slots for WebRTC media relay (bp-relay → bp-stunner) and the AI/ML serving stack (bp-cortex → bp-kserve → bp-knative). Each chart follows the canonical umbrella pattern from docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream chart under `dependencies:` so `helm dependency build` bundles the upstream payload into the OCI artifact, and Catalyst-curated overlay values + templates sit alongside in chart/values.yaml + chart/templates/. Per-chart highlights: - bp-stunner/1.0.0 — wraps stunner/stunner-gateway-operator 1.1.0. Ships a Cilium-native GatewayClass (Capabilities-gated on gateway.networking.k8s.io/v1) so bp-relay (LiveKit / SFU) can claim Gateway CRs without an operator-ordering dance. Default UDP TURN port range 30000-32767 matches the range opened at the Sovereign edge firewall (Crossplane bp-firewall composition). - bp-knative/1.0.0 — wraps knative-operator v1.21.1. Ships a KnativeServing CR pre-configured for istio-less mode (ingress.istio.enabled=false, ingress.contour.enabled=false, ingress.kourier.enabled=false; config.network.ingress-class=cilium). Sovereign FQDN sourced from values, no hardcoded fallback per inviolable principle #4 — render fails loudly if cluster overlay doesn't set knativeOverlay.knativeServing.sovereignFqdn. - bp-kserve/1.0.0 — wraps kserve/kserve v0.16.0 (latest version published on the official OCI registry as of 2026-04-30). Default deploymentMode=RawDeployment (no Knative hop on the hot path) but bp-knative is still installed (declared as a hard dep) so per-IS annotation `serving.kserve.io/deploymentMode: Serverless` opts in to scale-to-zero per tenant. Cilium native Gateway-API ingress (enableGatewayApi=true, className=cilium, disableIstioVirtualHost= true). Observability discipline (issue #182): every observability toggle (ServiceMonitor, HPA, GatewayClass) defaults false and is operator- tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles. Each chart ships tests/observability-toggle.sh covering default-off, opt-in (with `--api-versions monitoring.coreos.com/v1` to simulate Prometheus Operator CRDs), and explicit-off cases. Per-chart kind summary (helm template default render): bp-stunner: ClusterRole, ClusterRoleBinding, ConfigMap, Dataplane, Deployment, Role, RoleBinding, Service, ServiceAccount. (+ GatewayClass when --api-versions gateway.networking.k8s.io/v1 is passed.) bp-knative: ClusterRole, ClusterRoleBinding, ConfigMap, CustomResourceDefinition, Deployment, KnativeServing, Role, RoleBinding, Secret, Service, ServiceAccount. bp-kserve: Certificate, ClusterRole, ClusterRoleBinding, ClusterServingRuntime, ClusterStorageContainer, ConfigMap, Deployment, Gateway, Issuer, MutatingWebhookConfiguration, Role, RoleBinding, Service, ServiceAccount, ValidatingWebhookConfiguration. `helm lint` clean for all three (single INFO on missing icon — icons land with marketplace card work). `bash tests/observability-toggle.sh` green for all three (3 cases each: default-off, opt-in, explicit-off). Closes #263 #264 #265 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:37:38 +04:00
e3mrah	782d8015c5	feat(charts): bp-openmeter (CH-less) + bp-livekit + bp-matrix wrapper charts (closes #272 #273 #274 ) (#289 ) W2.5.F — three Catalyst Blueprint umbrella charts at platform/{openmeter, livekit,matrix}/, each declaring its upstream chart under Chart.yaml `dependencies:` so `helm dependency build` bundles the upstream payload into the published OCI artifact (per docs/BLUEPRINT-AUTHORING.md §11.1 — hollow charts forbidden, CI-enforced by issue #181). Per-chart kind summary ====================== bp-openmeter (closes #272) default `helm template` kinds: ConfigMap, Deployment, Service, ServiceAccount upstream chart: openmeter 1.0.0-beta.213 (oci://ghcr.io/openmeterio/helm-charts) ClickHouse-less profile per docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §6.4. The upstream chart's bundled clickhouse / kafka / postgresql / redis / svix subcharts are all DISABLED — Catalyst supplies CNPG (postgres), JetStream (event bus), and Valkey (redis-compat) at the platform tier. Chart-level toggle `catalystBlueprint.backend.kind` (default `cnpg`, alt `clickhouse`) records the active profile so observability/audit pipelines can report it. The OpenMeter binary's `aggregation.clickhouse.address` is left blank — per-Sovereign overlay supplies it once a host cluster adds bp-clickhouse and the operator re-rolls with `backend.kind: clickhouse`. Catalyst overlay templates (NetworkPolicy / ServiceMonitor / HPA) all default OFF per docs/BLUEPRINT-AUTHORING.md §11.2. bp-livekit (closes #273) default `helm template` kinds: ConfigMap, Deployment, Service, ServiceAccount upstream chart: livekit-server 1.9.0 (https://helm.livekit.io) WebRTC SFU. Powers the Huawei iFlytek voice demo. Catalyst defaults pair LiveKit with bp-stunner (the upstream chart's bundled co-located TURN server is OFF; per-Sovereign overlay points the LiveKit TURN config at the stunner UDP-gateway Service). RTC UDP port range is 50000-60000 (matches the Hetzner firewall rule the per-Sovereign overlay opens). Catalyst overlay templates (NetworkPolicy / ServiceMonitor / HPA) all default OFF; the chart's NetworkPolicy template documents that LiveKit's hostNetwork mode means pod-level policies do NOT cover the SFU port range — the firewall rule is the load-bearing control. blueprint.yaml `depends:` declares bp-stunner + bp-cert-manager + bp-valkey. bp-matrix (closes #274) default `helm template` kinds: ConfigMap, Deployment, Ingress, Job, PersistentVolumeClaim, Pod, Role, RoleBinding, Secret, Service, ServiceAccount upstream chart: matrix-synapse 3.12.25 (https://ananace.gitlab.io/charts) Synapse (the Matrix server implementation, NOT the retired OpenOva product noun). Federation OFF by default (Catalyst per-Sovereign tenancy default — operator overlays flip it on per-Organization). Postgres backend via bp-cnpg externalPostgresql; OIDC SSO via bp-keycloak; bundled bitnami postgresql + redis subcharts both disabled. Catalyst overlay NetworkPolicy gates the federation port (8448) on `federation.enabled` — verified by Case 5 of the observability-toggle test. Catalyst-overlay ServiceMonitor (upstream chart has none) + HPA both default OFF. Lint ==== All three charts pass `helm lint` clean (only the noisy "icon is recommended" INFO message). Observability tests =================== Each chart's `tests/observability-toggle.sh` enforces the Catalyst contract from docs/BLUEPRINT-AUTHORING.md §11.2: Case 1: default render produces zero monitoring.coreos.com/v1 resources (no ServiceMonitor / PrometheusRule). Case 2: opt-in (--set serviceMonitor.enabled=true --api-versions monitoring.coreos.com/v1) renders a ServiceMonitor. Case 3: explicit-off render is clean. Case 4 (per chart): - openmeter: ClickHouse-less profile asserts no clickhouse.altinity.com / Kafka subchart resources leak into the default render. - livekit: asserts upstream livekit-server.serviceMonitor.create defaults false. - matrix: asserts default render carries an empty federation_domain_whitelist (the per-Sovereign tenancy default). Case 5 (matrix only): `--set federation.enabled=true networkPolicy .enabled=true` opens port 8448 in the Catalyst NetworkPolicy. All gates green for all three charts. Closes #272 #273 #274 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 19:37:28 +04:00
e3mrah	87d9a4afa7	feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271 ) (#288 ) W2.5.E batch — three Application-tier Blueprints completing the LLM serving / workflow stack: - bp-temporal/1.0.0 — wraps temporal/temporal 1.2.0 (the new chart rewrite that removed cassandra:/mysql:/postgresql:/elasticsearch:/ prometheus:/grafana: top-level keys in favour of server.config.persistence.datastores). Postgres-only via CNPG-backed visibility store (skip Cassandra). Web UI ON. Keycloak OIDC integration via --auth-claim-mapper renders auth.yaml ConfigMap (operator wires via additionalVolumes once bp-keycloak is reconciled, default OFF). dependsOn: bp-cnpg + bp-cert-manager. Closes #271. Kinds: Cluster (CNPG) + ConfigMap + Deployment + Job + Pod + Service. - bp-llm-gateway/1.0.0 — wraps berriai/litellm-helm 0.1.572 from OCI. Subscription-aware proxy for Claude Code: routes to Anthropic (via operator OAuth/Max subscription — NEVER an ANTHROPIC_API_KEY, per memory/feedback_no_api_key.md), Bedrock, Vertex, OpenAI-compatible (via bp-anthropic-adapter), and self-hosted vLLM. CNPG-backed audit log (every prompt + response persisted for compliance). Bundled bitnami postgresql + redis subcharts DISABLED (db.useExisting=true points at the CNPG cluster). Keycloak SSO via auth.yaml ConfigMap (default OFF). ExternalSecret-backed environmentSecrets brings tokens / IAM creds in without inlining plaintext. dependsOn: bp-cnpg + bp-keycloak + bp-external-secrets. Closes #267. Kinds: Cluster (CNPG audit) + ConfigMap + Deployment + Job + Pod + Secret + Service + ServiceAccount. - bp-anthropic-adapter/1.0.0 — Catalyst-authored scratch chart for the OpenAI ↔ Anthropic translation Go service. SHA-pinned image ghcr.io/openova-io/openova/anthropic-adapter:<sha> (Inviolable Principle #4a — GitHub Actions is the only build path; empty default tag fails the render with a clear error instead of silently shipping :latest). OAuth/Max subscription token mounted from K8s Secret materialized by ESO from bp-openbao — ANTHROPIC_OAUTH_TOKEN env var, NEVER an ANTHROPIC_API_KEY. Includes OpenAI → Anthropic model-mapping ConfigMap (gpt-4 → claude-3-5-sonnet, gpt-4o-mini → claude-3-5-haiku, etc.). sigstore/common library subchart included to satisfy the hollow-chart gate (matches bp-vllm pattern from #283). dependsOn: bp-external-secrets. Closes #268. Kinds: ConfigMap + Deployment + Service + ServiceAccount. CRITICAL — bp-llm-gateway and bp-anthropic-adapter both consume the operator's Claude OAuth/Max subscription. Per memory/ feedback_no_api_key.md and the user's standing instruction, neither chart accepts or generates an ANTHROPIC_API_KEY. Tokens flow exclusively through ExternalSecret-managed K8s Secrets that ESO materializes from bp-openbao at install time. Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every observability toggle defaults `false` (ServiceMonitor / metrics sidecar / PodMonitor) and is operator-tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles. Each chart ships tests/observability-toggle.sh covering default-off, opt-in (with --api-versions monitoring.coreos.com/v1 to simulate the CRDs), and explicit-off cases. bp-anthropic-adapter additionally tests the never-:latest gate via Case 4 (empty image tag must fail render). Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every upstream version, namespace, server URL, role, secret name, model default, and toggle is exposed under values.yaml. Cluster overlays in clusters/<sovereign>/ may override without rebuilding the Blueprint OCI artifact. Per docs/BLUEPRINT-AUTHORING.md §11.1 (umbrella shape — hard contract): bp-temporal and bp-llm-gateway declare their upstream charts under Chart.yaml dependencies: so helm dependency build bundles the upstream payload into the OCI artifact. bp-anthropic- adapter is a scratch chart (no upstream Helm chart exists) and includes sigstore/common as the obligatory hollow-chart-gate dependency, matching the bp-vllm precedent from W2.5.D (#283). Closes #267 Closes #268 Closes #271 helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 19:37:19 +04:00
e3mrah	a6bf07b0ce	feat(charts): bp-librechat wrapper chart (closes #275 ) (#287 ) W2.5.G — Catalyst-authored scratch chart for LibreChat (slot 48 of the omantel-1 bootstrap-kit). LibreChat upstream does not publish a Helm chart, so this chart hand-wires the official ghcr.io/danny-avila/librechat container as Deployment + Service + Ingress + ConfigMap + ServiceAccount + NetworkPolicy + ServiceMonitor + HPA, with the sigstore/common library subchart declared to satisfy the hollow-chart gate (issue #181). Per docs/BLUEPRINT-AUTHORING.md §11.2: every observability toggle (serviceMonitor, hpa) defaults false; opt-in via per-cluster overlay once kube-prometheus-stack reconciles. The ServiceMonitor template is double-gated by .Values.serviceMonitor.enabled AND Capabilities.APIVersions.Has "monitoring.coreos.com/v1" so flipping the toggle on a too-early Sovereign cannot break the bp-librechat reconcile. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every endpoint URL, model name, secret reference, namespace selector, and image tag is operator-tunable via values.yaml. The Sovereign FQDN, Keycloak issuer, llm-gateway URL, embeddings URL, and TLS ClusterIssuer are all operator-supplied at install time. The image tag is pinned to v0.7.5 (no :latest). Connectors: - Chat completions: bp-llm-gateway (OpenAI-compatible /v1/chat/completions) exposed as a "custom" endpoint named "Catalyst LLM" - Embeddings (RAG): bp-bge — provider=bge maps to EMBEDDINGS_PROVIDER=openai + RAG_OPENAI_BASEURL=<bge.svc> at template-render time - SSO: bp-keycloak (OpenID Connect) — issuer/clientId from values, client secret + session secret from ExternalSecret - Conversation store: FerretDB on bp-cnpg (MongoDB wire protocol over Postgres) — operator-supplied connection URI Hosted at chat-app.<sovereign-fqdn>; the chart `fail`s render if ingress.host is empty (no platform-wide default). helm template (default values, --set ingress.host=...): ConfigMap, Deployment, Ingress, NetworkPolicy, Service, ServiceAccount helm template (--set hpa.enabled=true serviceMonitor.enabled=true --api-versions monitoring.coreos.com/v1): ConfigMap, Deployment, HorizontalPodAutoscaler, Ingress, NetworkPolicy, Service, ServiceAccount, ServiceMonitor helm lint: 1 chart(s) linted, 0 chart(s) failed (single INFO on missing icon — icons land with the marketplace card work). tests/observability-toggle.sh: PASS on default-off, opt-in (--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and explicit-off cases. Path isolation: only platform/librechat/ — no HR slot files, blueprint-release.yaml, or other charts touched. The HR slot files (clusters/.../48-librechat.yaml) and blueprint-release.yaml will land in a separate slot-wiring PR per the W2.K4 expansion plan. Closes #275 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:56:59 +04:00
e3mrah	9dc8506dd9	feat(charts): bp-external-secrets + bp-cnpg + bp-valkey wrapper charts (#285 ) Storage-substrate batch (W2.5.A) — closes #254 by shipping the three upstream-subchart umbrella Blueprints that the Flux HRs at clusters/_template/bootstrap-kit/{15-external-secrets,16-cnpg,17-valkey} .yaml (merged via PR #262) target. Each chart follows the canonical umbrella pattern documented in docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream chart under `dependencies:` so `helm dependency build` bundles the upstream payload into the OCI artifact, and Catalyst-curated overlay values + templates sit alongside in chart/values.yaml + chart/templates/. Per-chart highlights: - bp-external-secrets/1.0.0 — wraps external-secrets/external-secrets 0.10.7. Ships a default `vault-region1` ClusterSecretStore (via Helm post-install/post-upgrade hook to defer the CR application until the upstream chart's CRDs are registered) wired to the in-cluster bp-openbao service. clusterSecretStore.enabled toggle lets cluster overlays opt out and author their own multi-region CRs. - bp-cnpg/1.0.0 — wraps cnpg/cloudnative-pg 0.28.0. Operator-only surface (Cluster CRs are per-Application). CRDs ship in-chart so bp-powerdns / bp-keycloak / bp-gitea / bp-langfuse / bp-grafana / bp-temporal / bp-matrix / bp-llm-gateway / bp-bge / bp-nemo-guardrails / bp-openmeter / pool-domain-manager can `dependsOn: bp-cnpg` via Flux — closing #254 (bp-powerdns CreateContainerConfigError on pdns-pg-app secret). - bp-valkey/1.0.0 — wraps bitnami/valkey 5.5.1. BSD-3 Redis-compatible cache, replication architecture, password auth ON, NetworkPolicy ON, replicas 0 by default for solo Sovereigns (cluster overlays bump for HA). Application-tier cache only — Catalyst control plane uses NATS JetStream KV (per ARCHITECTURE.md §5). Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every observability toggle defaults `false` (ServiceMonitor / PodMonitor / PrometheusRule / metrics sidecar) and is operator-tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles. Each chart ships tests/observability-toggle.sh covering default-off, opt-in (--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and explicit-off cases. Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every upstream version, namespace, server URL, role, and password toggle is exposed under values.yaml. Cluster overlays in clusters/<sovereign>/ may override without rebuilding the Blueprint OCI artifact. helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only) helm template default render kinds: bp-external-secrets: ClusterRole, ClusterRoleBinding, ClusterSecretStore, CustomResourceDefinition, Deployment, Role, RoleBinding, Secret, Service, ServiceAccount, ValidatingWebhookConfiguration bp-cnpg: ClusterRole, ClusterRoleBinding, ConfigMap, CustomResourceDefinition, Deployment, MutatingWebhookConfiguration, Service, ServiceAccount, ValidatingWebhookConfiguration bp-valkey: ConfigMap, NetworkPolicy, PodDisruptionBudget, Secret, Service, ServiceAccount, StatefulSet Closes #254 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 18:39:29 +04:00
e3mrah	ba2ff05292	feat(charts): bp-seaweedfs + bp-harbor + bp-vpa wrapper charts (#284 ) W2.5.B — first authoring of the three Catalyst Blueprint wrapper charts that fill bootstrap-kit slots 18 (seaweedfs), 19 (harbor) and 29 (vpa). Each wraps an upstream chart as a Helm subchart and ships Catalyst- curated overlay templates (NetworkPolicy + ServiceMonitor) gated behind opt-in toggles, per docs/BLUEPRINT-AUTHORING.md §11 and docs/INVIOLABLE-PRINCIPLES.md. bp-seaweedfs (slot 18 — storage foundation) - Wraps seaweedfs/seaweedfs 4.22.0; Chart name `bp-seaweedfs`. - Catalyst defaults: 1 master + 3 volume + 1 filer + 2 s3 replicas. - S3 API on 8333 — single S3 surface every consumer talks to per docs/PLATFORM-TECH-STACK.md §3.5 (no per-app MinIO). - Overlay templates: NetworkPolicy (cross-namespace S3 reachability, cold-tier egress allowlist), ServiceMonitor (Capabilities-gated, DEFAULT FALSE per §11.2). - Default helm template kinds: ClusterRole, ClusterRoleBinding, ConfigMap, Deployment, Secret, Service, ServiceAccount, StatefulSet. bp-harbor (slot 19 — per-Sovereign OCI registry) - Wraps goharbor/harbor 1.18.3 (appVersion 2.14.3); Chart name `bp-harbor`. - Catalyst defaults: blob backend = SeaweedFS S3 (regionendpoint seaweedfs-s3.seaweedfs.svc:8333), metadata DB = bp-cnpg external Postgres, ingress class `cilium`, expose.tls.enabled true (cert- manager-issued Secret). - Overlay templates: NetworkPolicy (CNPG/SeaweedFS/Keycloak egress), ServiceMonitor (Capabilities-gated, DEFAULT FALSE). - Trivy + SSO + pull-mirror are operator-flag opt-ins per per- Sovereign overlay (default false; trivy/keycloak/cnpg deps land on later slots). - Default helm template kinds: ConfigMap, Deployment, Ingress, PersistentVolumeClaim, Secret, Service, StatefulSet. bp-vpa (slot 29 — vertical autoscaling) - Wraps cowboysysop/vertical-pod-autoscaler 11.1.1 (appVersion 1.5.0); Chart name `bp-vpa`. - Catalyst defaults: 1 replica each of recommender + updater + admission-controller. Default mode `Off` (recommend only). - Admission webhook self-signs via init Job (cluster-internal); per- Sovereign overlay MAY swap to cert-manager. - Overlay templates: NetworkPolicy (apiserver + metrics-server egress, admission webhook ingress). - Upstream metrics.serviceMonitor / metrics.prometheusRule defaulted false per §11.2. - Default helm template kinds: ClusterRole, ClusterRoleBinding, ConfigMap, Deployment, Job, Pod, Secret, Service, ServiceAccount. Lint + observability-toggle results helm lint: 1 chart(s) linted, 0 chart(s) failed (each) tests/observability-toggle.sh: PASS on all three (default render has zero monitoring.coreos.com/v1 references; opt-in render produces a ServiceMonitor; explicit-off render is clean). Path isolation: only platform/seaweedfs/, platform/harbor/, and platform/vpa/ — no HR slot files or other charts touched. Refs: bootstrap-kit slots 18, 19, 29 reconcile against ghcr.io/openova-io/bp-seaweedfs:1.0.0, bp-harbor:1.0.0, bp-vpa:1.0.0 which this commit produces on next blueprint-release CI run. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 18:37:50 +04:00
e3mrah	c3c9c0cf27	feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283 ) Catalyst-authored umbrella charts for the W2.5.D AI-inference stack. None of the three upstream projects publish a Helm chart, so each chart hand-wires the upstream container as Deployment + Service + ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the sigstore/common library subchart declared to satisfy the hollow-chart gate (issue #181). bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware (nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev). Default model meta-llama/Llama-3.1-8B-Instruct, port 8000, OpenAI-compatible /v1/chat/completions. All engine knobs (maxModelLen, gpuMemoryUtilization, dtype, quantization, tensorParallelSize, prefix-caching) overlay-tunable. Closes #266. bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5. Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank) annotated for bp-llm-gateway discovery. CPU-friendly defaults; overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269. bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint + model + engine all overlay-tunable; Colang flow bundle mounts via configMap.externalName for production rails. ConfigMap stub renders a default rail for smoke testing. Closes #270. All three charts: - Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2 - Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4 - Non-root securityContext (runAsUser 1000, drop ALL capabilities) - prometheus.io scrape annotations on the Pod for fallback discovery - Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and egress to HuggingFace / bp-vllm / bp-bge as appropriate helm template (default values) per chart: bp-vllm: ConfigMap, Deployment, Service, ServiceAccount bp-bge: ConfigMap, Deployment, Service, ServiceAccount bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true): All three render ConfigMap + Deployment + Service + ServiceAccount + ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler. helm lint: 0 chart(s) failed for all three (single INFO on missing icon — icons land with the marketplace card work). Closes #266 Closes #269 Closes #270 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:37:07 +04:00
e3mrah	0cfd0defa9	fix(bp-langfuse): drop apostrophe from description to clear GHCR 500 (resolves #215 ) (#278 ) Root cause: Helm's `helm push` collapses the chart `description` field into a single-line OCI manifest annotation `org.opencontainers.image.description`. The GHCR manifest-PUT validator returns a deterministic 500 Internal Server Error when that annotation is long AND contains an ASCII apostrophe. bp-langfuse 1.0.0 was the only chart in the observability batch (PR #214) carrying both characteristics, so it was the only one that failed to publish. Fix: reword the affected sentence from "Langfuse's persistent state" to "the Langfuse persistent state" — drops the apostrophe, preserves the meaning, and crucially preserves every byte of the actual chart payload (values, templates, all 350 entries of the upstream langfuse-1.5.28 subchart with its 4-level-deep Bitnami vendoring). No runtime behavioural change; helm template renders the exact same 6 resources across 490 lines. The narrowing was done by progressively reducing the Chart.yaml from the failing version to a passing version while pushing to a scratch GHCR namespace, with the bp-langfuse repo deleted between attempts (verified via `DELETE /orgs/openova-io/packages/container/bp-langfuse` and re-querying). The trigger is reproducible: long description + apostrophe → 500; long description without apostrophe → push succeeds; short description with apostrophe → push succeeds. Added a multi-line WARNING comment immediately above `description:` documenting the trigger so future authors do not reintroduce a possessive form. Issue #215 captures the full reproduction. Closes #215 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 17:31:51 +04:00
e3mrah	ec3821f7e1	fix(bp-): event-driven HR install -- drop blanket timeout, use disableWait (#250 ) Helm install completes when manifests apply, not when pods reach Ready. Flux dependsOn checks Ready=True on each HR independently, so spec.install.disableWait + spec.upgrade.disableWait is the correct shape for slow-Ready workloads. Blanket spec.timeout: Nm watchdogs from PR #221 were a band-aid that caused cascading HR failures and blocked downstream HRs (bp-nats-jetstream, bp-openbao depended on bp-spire). Founder direction (verbatim): "always event driven robust jobs" Per-HR audit (drop spec.timeout: 15m, add disableWait, with reason): - bp-cilium: envoyconfig CRD self-wait — agent crash-loops until its own CRDs land - bp-cert-manager: webhook readiness depends on cainjector mutating Secret — multi-minute on cold start - bp-flux: adopts cloud-init Flux objects; the helm-controller reconciling THIS HR is itself a chart target — Ready deadlock without disableWait - bp-sealed-secrets: single-replica controller + CRD — install completes on manifest apply - bp-spire: spire-controller-manager waits for CRD informer cache sync — multi-minute legitimate path; chart fix below - bp-nats-jetstream: JetStream raft quorum formation across N replicas - bp-openbao: 3-node Raft sealed-by-default; Ready=True only after operator runs `bao operator init` unseal flow - bp-keycloak: DB schema migration + 100+ Liquibase changesets on first install - bp-gitea: PostgreSQL DB init + admin user + Blueprint catalog mirror seeding - bp-external-dns: pod readiness depends on PowerDNS API + pdns-pg CNPG cascade - bp-catalyst-platform: ~10 services, inter-service NATS/OTel readiness is not Helm's concern Intentionally NOT touched (other parallel agents own these): - bp-crossplane (Agent A): chart split for intra-chart CRD-ordering - bp-powerdns (Agent D): post-install hook for intra-chart Job-ordering bp-spire chart fix (1.1.3 -> 1.1.4): Root cause investigation on otech.omani.works (live): spire-controller-manager has restarted 37 times with: "failed to wait for clusterstaticentry caches to sync: timed out waiting for cache to be synced for Kind v1alpha1.ClusterStaticEntry" `kubectl get crd \| grep spire` returns nothing — the spire.spiffe.io v1alpha1 CRDs (ClusterSPIFFEID / ClusterStaticEntry / ClusterFederatedTrustDomain) are NOT registered. The upstream `spire` chart does not install its own CRDs; the spiffe maintainers ship them via the SEPARATE `spire-crds` chart, expected to be installed first. Fix: platform/spire/chart/Chart.yaml now declares spire-crds 0.5.0 as the FIRST dependency. Helm installs subcharts in dependency order, so listing spire-crds first guarantees CRDs are applied before the spire subchart's controller-manager Deployment starts. blueprint.yaml + both 06-spire.yaml cluster references bumped to 1.1.4. Live error this fixes (otech.omani.works, persistent ~5h): Helm upgrade failed for release spire-system/spire with chart bp-spire@1.1.3: context deadline exceeded + downstream cascade: bp-nats-jetstream / bp-openbao stuck at "dependency 'flux-system/bp-spire' is not ready" Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:55:19 +04:00
e3mrah	726af6df81	fix(bp-powerdns): self-generate api-credentials Secret + disable upstream zone-bootstrap Job (#248 ) Root cause investigation on otech.omani.works (kubectl, sanitized): $ kubectl get pods -n powerdns create-zone-if-not-exist-sh-tjtr4 0/1 CreateContainerConfigError 4h powerdns-57d7d49f99-{9hrb4,lxlgt,nkmht} 0/1 CreateContainerConfigError 4h dnsdist-594dbfc5f-wznsw 1/1 Running 4h $ kubectl get secrets -n powerdns powerdns Opaque 1 4h powerdns-api-tls-8kxpx Opaque 1 4h (NO `powerdns-api-credentials`, NO `pdns-pg-app`) $ kubectl describe pod ... powerdns-57d7d49f99-9hrb4 Environment: PDNS_API_KEY: <set to the key 'api-key' in secret 'powerdns-api-credentials'> Optional: false PDNS_DB_HOST: <set to the key 'host' in secret 'pdns-pg-app'> Optional: false State: Waiting Reason: CreateContainerConfigError The handover's chicken-egg-with-secret theory was directionally right but the cause was more fundamental: 1. Wrapper chart's api-credentials-secret.yaml (1.1.2) was a no-op unless operator set `apiKey` value out-of-band — comment said the deployment would "fail to start until the named Secret exists" as "the explicit signal we want". On a Sovereign that bootstraps from bp-* OCI artifacts, no operator is standing by, so the Secret is never created and pods sit in CreateContainerConfigError forever. 2. The upstream chart's `create-zone-if-not-exists-sh` Job is rendered whenever both `zoneName` and `api.key` are set — defaulting `zoneName: "example.de."` it ALWAYS rendered and ALWAYS failed (same missing Secret). Catalyst doesn't want this Job at all because zones are loaded later by pool-domain-manager (PDM). 3. The chart's CNPG Cluster template is gated behind Capabilities.APIVersions.Has "postgresql.cnpg.io/v1" — on a fresh Sovereign without bp-cnpg yet (bp-cnpg is on the roadmap, not in bootstrap-kit), no Cluster is rendered and `pdns-pg-app` Secret never materialises. With Helm `--wait`, install times out ("context deadline exceeded") even though the manifests applied cleanly. Fix: * api-credentials-secret.yaml: self-generate via Helm `lookup` + `randAlphaNum 32`. First install creates fresh randoms; every subsequent reconcile reads back the existing values from the Secret so the API key never rotates on upgrade. Operator can still pin specific values via .Values.powerdns.apiKey / .Values.powerdns.webserverPassword, or skip Secret creation entirely via .Values.powerdns.useExistingApiSecret. Same pattern as bitnami/postgresql, bitnami/keycloak. * values.yaml: set `powerdns.zoneName: ""` so upstream chart's `{{- if and .Values.powerdns.zoneName .Values.powerdns.api.key }}` gate skips the create-zone Job entirely. Catalyst's PDM creates zones via the REST API after the cluster comes up; we don't want a placeholder `example.de.` zone in production. * HelmRelease (both _template and otech.omani.works overlays): `install.disableWait: true` + `upgrade.disableWait: true` so the HelmRelease reports Ready as soon as manifests apply cleanly, rather than gating on powerdns Deployment readiness which depends on bp-cnpg landing first to synthesise `pdns-pg-app`. Runtime convergence is observed via kubectl, not gated on Helm. Live error this addresses: Helm upgrade failed for release powerdns/powerdns with chart bp-powerdns@1.1.2: context deadline exceeded Verified locally with `helm template`: - powerdns-api-credentials Secret renders with random api-key + webserver-password - create-zone-if-not-exist-sh Job no longer rendered - Deployment env continues to reference powerdns-api-credentials correctly Bumped 1.1.2 -> 1.1.3 (chart, blueprint, both bootstrap-kit overlays). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:55:12 +04:00
e3mrah	2d1799d738	fix(bp-crossplane): split XRDs+Compositions into bp-crossplane-claims (#247 ) Resolves install ordering on fresh clusters where the apiserver rejects CompositeResourceDefinition CRs because the apiextensions.crossplane.io CRDs registered by the crossplane subchart aren't live yet at apply time. - bp-crossplane bumped 1.1.2 -> 1.1.3 (controller-only payload) - NEW bp-crossplane-claims@1.0.0 carries XRDs + Compositions - Flux HelmRelease for crossplane-claims uses dependsOn: [bp-crossplane] - composition-validate.sh + fixtures relocate to the new chart - blueprint-release CI: opt-out annotation catalyst.openova.io/no-upstream=true permits zero-deps charts that legitimately ship only Catalyst-authored CRs (the original hollow-chart rule remains in force for every other umbrella chart) Live error this fixes (from otech.omani.works): no matches for kind "CompositeResourceDefinition" in version "apiextensions.crossplane.io/v1" -- ensure CRDs are installed first Pattern: intra-chart CRD-ordering breaks -> split charts + Flux dependsOn. Apply universally to similar cases going forward. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:55:05 +04:00
e3mrah	f658757962	fix(bp-crossplane): resolve CHART_DIR to absolute path in composition-validate.sh (#237 ) CI invokes the script as `bash <script> "platform/crossplane/chart"` from the repo root. The script then `cd`s into that relative path, which works, but every later `"$CHART_DIR/<sub>"` reference (notably FIXTURE_DIR for Case 6) inherits the now-stale relative prefix and resolves under the wrong cwd. Fix: resolve CHART_DIR via `(cd ... && pwd)` to an absolute path BEFORE the chdir. Local repro before fix: $ bash platform/crossplane/chart/tests/composition-validate.sh \ platform/crossplane/chart ... Case 6: every fixture XRC kind is matched by an XRD FAIL: fixtures dir platform/crossplane/chart/tests/fixtures missing Local result after fix: $ bash platform/crossplane/chart/tests/composition-validate.sh \ platform/crossplane/chart ... Case 6: every fixture XRC kind is matched by an XRD PASS All bp-crossplane Day-2 CRUD Composition gates green. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 09:36:07 +02:00
e3mrah	8592d20919	feat(bp-crossplane): 6 XRDs + Compositions for Day-2 CRUD (RegionClaim/ClusterClaim/NodePoolClaim/LoadBalancerClaim/PeeringClaim/NodeActionClaim) (#236 ) Adds the 6 CompositeResourceDefinitions and matching Compositions that back the catalyst-api Day-2 CRUD endpoints. catalyst-api writes XRCs of these kinds; Crossplane materialises them into provider-hcloud (and a small number of provider-kubernetes) managed resources. Per docs/INVIOLABLE-PRINCIPLES.md #3, every cloud-side op flows through provider-hcloud — never bespoke hcloud-go calls or shell-outs to the hcloud CLI. XRDs (canonical group: compose.openova.io/v1alpha1): - RegionClaim → composes the Phase-0 quartet via provider-hcloud: Network + NetworkSubnet + Firewall + Server (cp1) + LoadBalancer + LoadBalancerNetwork + LoadBalancerService×2 + LoadBalancerTarget. Mirrors infra/hetzner/main.tf 1:1 so deletion of a RegionClaim cascades the whole slice. - ClusterClaim → composes a provider-kubernetes Object that materialises a cluster-identity ConfigMap. The catalyst-environment-controller reads the CM to template per-server cloud-init. - NodePoolClaim → composes up to 100 provider-hcloud Server resources. UPDATE flow: patching replicas n→m flips the per-index Required-policy gate so Crossplane creates/deletes Server CRs. - LoadBalancerClaim → composes provider-hcloud LoadBalancer + LoadBalancerNetwork + up to 50 LoadBalancerService entries (per listener) + up to 50 LoadBalancerTarget entries. UPDATE: patch listeners[]/targets[] → composite controller adds/removes services/targets. - PeeringClaim → composes 1 or 2 provider-hcloud Route resources (bidirectional flag toggles the second one through a Required-policy gate). - NodeActionClaim → composes a provider-kubernetes Object that creates a batch/v1 Job running kubectl cordon/drain (k8s-side op, not a cloud op, per the task spec). action=replace additionally composes a provider-hcloud Server for the replacement node. UPDATE/DELETE summary: - UPDATE: every mutable schema field is patched onto the underlying managed resource; Crossplane's composite controller drives the diff and provider-hcloud reconciles to the new state. - DELETE: every composed resource has deletionPolicy: Delete, so a cascade delete of the composite tears down the whole resource graph in dependency-safe order (Crossplane retries until deps unblock). New tests: - tests/composition-validate.sh — 7 gates: helm renders cleanly, exactly 6 XRDs, ≥ 6 Compositions, all 6 expected claim kinds present, every rendered doc is valid YAML, every fixture references a real XRD, and (when KUBECONFIG + Crossplane CRDs available) server-side dry-run for every fixture. - tests/fixtures/<kind>-sample.yaml — one XRC fixture per kind. Version bump: - platform/crossplane/chart/Chart.yaml 1.1.1 → 1.1.2 - platform/crossplane/blueprint.yaml 1.1.1 → 1.1.2 - clusters/_template/bootstrap-kit/04-crossplane.yaml → 1.1.2 - clusters/otech.omani.works/bootstrap-kit/04-crossplane.yaml → 1.1.2 Hard rules respected: - provider-hcloud only for cloud ops (never hcloud-go, never CLI). - provider-kubernetes Object for k8s-side ops (never raw kubectl). - No bespoke kubectl manifests for cloud resources. - Frontend + catalyst-api Go code untouched (sibling-owned). - Target state, no MVP framing — all 6 Compositions ship. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 09:33:38 +02:00
e3mrah	c747fe2265	fix(bp-gitea): override postgresql to bitnamilegacy (Bitnami evacuated docker.io tags) (#231 ) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 08:27:49 +02:00
e3mrah	da87fb38c4	fix(bp-spire): disable ALL default-enabled clusterSPIFFEIDs (default+oidc+test-keys) (#230 ) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 08:13:41 +02:00
e3mrah	719c3bac35	fix(bp-spire): disable default ClusterSPIFFEID — CRD not observable in time on fresh install (#228 ) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 07:51:03 +02:00
e3mrah	1689ffcd1a	fix(bp-coraza,bp-syft-grype): add common library subchart to satisfy hollow-chart gate (#220 ) Both charts are scratch (no upstream Helm chart published — Coraza project + anchore/syft+grype CLIs ship containers only). The blueprint-release.yaml hollow-chart gate (issue #181) rejects charts with zero declared dependencies. Adding sigstore/common as a tiny library subchart satisfies the gate; common is a library type so it contributes zero runtime resources to either chart's rendered output. The Catalyst-side templates (Deployment+Service for bp-coraza, CronJob+PVC for bp-syft-grype) remain entirely in templates/ — the library dep is purely a CI-gate mechanism, NOT a functional dependency. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 06:15:28 +02:00
e3mrah	3a57e287e5	feat(platform): security umbrellas (falco/kyverno/trivy/sigstore/syft-grype/reloader/coraza/litmus) (#216 ) * feat(bp-falco): umbrella chart for security layer Catalyst Blueprint umbrella chart for falco — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-kyverno): umbrella chart for security layer Catalyst Blueprint umbrella chart for kyverno — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-trivy): umbrella chart for security layer Catalyst Blueprint umbrella chart for trivy — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-sigstore): umbrella chart for security layer Catalyst Blueprint umbrella chart for sigstore — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-syft-grype): umbrella chart for security layer Catalyst Blueprint umbrella chart for syft-grype — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-reloader): umbrella chart for security layer Catalyst Blueprint umbrella chart for reloader — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-coraza): umbrella chart for security layer Catalyst Blueprint umbrella chart for coraza — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. * feat(bp-litmus): umbrella chart for security layer Catalyst Blueprint umbrella chart for litmus — security/policy layer. Pinned upstream + appVersion verified against the helm index on 2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Solo-Sovereign defaults; per-Sovereign overlays bump to HA later. Part of security-stack umbrellas batch 3. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-30 06:07:38 +02:00
e3mrah	75128781b3	feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214 ) * feat(bp-grafana): umbrella chart for observability stack Catalyst Blueprint umbrella for Grafana — visualization layer of the LGTM observability stack (Loki/Grafana/Tempo/Mimir). Pinned to grafana/grafana 10.5.15 (appVersion 12.3.1) — current stable on 2026-04-29. Solo-Sovereign defaults: 1 replica, 10Gi PVC, ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. * feat(bp-loki): umbrella chart for observability stack Catalyst Blueprint umbrella for Grafana Loki — log aggregation backend of the LGTM stack. SingleBinary mode by default (solo-Sovereign min); SimpleScalable/Distributed are values toggles. Pinned to grafana/loki 7.0.0 (appVersion 3.6.7) on 2026-04-29. Filesystem storage default; SeaweedFS S3 wiring is per-Sovereign overlay when scaling out. All observability toggles default false per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. * feat(bp-mimir): umbrella chart for observability stack Catalyst Blueprint umbrella for Grafana Mimir — metrics storage tier of the LGTM stack. Pinned to grafana/mimir-distributed 6.0.6 (appVersion 3.0.4) on 2026-04-29. Solo-Sovereign defaults: every component scaled to 1 replica, zoneAwareReplication disabled, Kafka ingest-storage disabled. Bundled MinIO kept enabled as a stop-gap so the chart renders; SeaweedFS S3 wiring is per-Sovereign overlay. All metaMonitoring toggles default false per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. * feat(bp-tempo): umbrella chart for observability stack Catalyst Blueprint umbrella for Grafana Tempo — distributed tracing backend of the LGTM stack. Single-binary mode by default (solo-Sovereign min); microservice mode (tempo-distributed) is a chart swap toggle. Pinned to grafana/tempo 1.24.4 (appVersion 2.9.0) on 2026-04-29. Local PVC storage default; SeaweedFS S3 wiring is per-Sovereign overlay. Metrics generator disabled by default (depends on bp-mimir). ServiceMonitor default false per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. * feat(bp-alloy): umbrella chart for observability stack Catalyst Blueprint umbrella for Grafana Alloy — unified telemetry collector for the LGTM stack (logs, metrics, traces; OTLP-native). Pinned to grafana/alloy 1.8.0 (appVersion v1.16.0) on 2026-04-29. DaemonSet controller default (one Alloy per node) so node + container telemetry work out of the box. Empty Alloy config by default; per-Sovereign overlays populate forwarders to bp-loki/bp-mimir/bp-tempo once those reconcile. ServiceMonitor + ingress + CRDs default false per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. * feat(bp-opentelemetry): umbrella chart for observability stack Catalyst Blueprint umbrella for the OpenTelemetry Collector — vendor- neutral telemetry collector. Sibling to bp-alloy; per-Sovereign overlays choose one. Pinned to open-telemetry/opentelemetry-collector 0.152.0 (appVersion 0.150.1) on 2026-04-29. Uses the contrib distribution (otel/opentelemetry-collector-contrib:0.150.1) so Loki/Mimir/Tempo exporters are bundled. Deployment mode default (1 replica); DaemonSet + StatefulSet are values toggles. All presets default false; ingress + ServiceMonitor + PodMonitor + PrometheusRule + NetworkPolicy default false per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. * feat(bp-langfuse): umbrella chart for observability stack Catalyst Blueprint umbrella for Langfuse — LLM observability platform. Complements bp-grafana (infrastructure metrics) with AI-specific telemetry (traces, evaluations, prompts, cost attribution). Pinned to langfuse/langfuse 1.5.28 (appVersion 3.171.0) on 2026-04-29. Catalyst convention: ALL bundled Bitnami subcharts are disabled — PostgreSQL via cnpg.io/Cluster (bp-cnpg), Redis via bp-valkey, ClickHouse via bp-clickhouse, S3 via bp-seaweedfs. Per-Sovereign overlays wire external endpoints + Secret references. Telemetry to Langfuse Inc. defaulted false; signUpDisabled defaulted true. Part of issue #204 observability-stack umbrellas batch. * feat(bp-velero): umbrella chart for observability stack Catalyst Blueprint umbrella for Velero — Kubernetes-native backup and disaster recovery. Per platform/velero/README.md, ALL Velero output goes to SeaweedFS (Catalyst's unified S3 encapsulation), which transitions to a cloud archival backend on the cold tier. Pinned to vmware-tanzu/velero 12.0.1 (appVersion 1.18.0) on 2026-04-29. Bundled velero-plugin-for-aws:v1.14.0 init container so SeaweedFS S3 is reachable. backupsEnabled/snapshotsEnabled defaulted false at this layer (placeholders for backupStorageLocation); per-Sovereign overlays flip on after wiring SeaweedFS endpoint + credentials. ServiceMonitor + PodMonitor + PrometheusRule default false per BLUEPRINT-AUTHORING.md §11.2. Part of issue #204 observability-stack umbrellas batch. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-29 22:11:04 +02:00
e3mrah	fa0e3a494b	fix(bp-keycloak): pin to current Bitnami tag (closes #191 ) (#198 ) * fix(bp-keycloak): pin to current Bitnami Keycloak tag (closes #191) Bitnami consolidated their tag scheme around 2025-09 (see https://github.com/bitnami/charts/issues/30852). The chart was pinned to upstream bitnami/keycloak Helm chart 24.7.1, whose default image tag `bitnami/keycloak:26.2.4-debian-12-r0` now returns 404 in the Docker Hub registry — installs hit ImagePullBackOff (verified on omantel). Changes: - Upstream Bitnami chart: 24.7.1 -> 25.2.0 (latest, appVersion 26.3.3) - Override image.registry/image.repository for every Bitnami image used by the chart (keycloak app, keycloak-config-cli, postgresql, postgres-exporter, os-shell) to point at `bitnamilegacy/`, where the historic debian-12 tags are preserved - Replace deprecated `proxy: edge` with `proxyHeaders: "xforwarded"` (chart 25.x renamed the field; Catalyst fronts Keycloak with Cilium Gateway which sets X-Forwarded- headers) - bp-keycloak chart version: 1.1.1 -> 1.1.2 Verification (registry HEAD via Bearer token): bitnami/keycloak:26.2.4-debian-12-r0 -> 404 (broken pin) bitnami/keycloak:26.3.3-debian-12-r0 -> 404 (registry move) bitnamilegacy/keycloak:26.3.3-debian-12-r0 -> 200 bitnamilegacy/keycloak-config-cli:6.4.0-... -> 200 bitnamilegacy/postgresql:17.6.0-debian-12-r0 -> 200 bitnamilegacy/postgres-exporter:0.17.1-... -> 200 bitnamilegacy/os-shell:12-debian-12-r50 -> 200 `helm template platform/keycloak/chart` renders cleanly; rendered images all resolve to bitnamilegacy/* tags listed above. Long-term follow-up (not blocking): bitnamilegacy is explicitly marked "no longer updated, may be removed in the future" — Catalyst should either build its own Keycloak image or migrate to the Bitnami Secure Image (BSI/Photon) catalog when chart support catches up. Tracked in the bp-keycloak description block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-keycloak): bump blueprint.yaml version to match Chart.yaml 1.1.2 --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 20:10:17 +02:00
e3mrah	bcd2e7980a	fix: hide CRD-emitting resources behind Capabilities gates (closes #190 ) (#200 ) * fix(bp-external-dns): hide CRD-emitting resources behind Capabilities gates (refs #190) Wrap the Catalyst overlay's ServiceMonitor and ExternalSecret templates in `.Capabilities.APIVersions.Has` checks so a cold install on a fresh Sovereign — where bp-kube-prometheus-stack and bp-external-secrets have not yet reconciled — no longer fails with `no matches for kind X in version Y`. The values toggles (`externalDns.serviceMonitor.enabled`, `externalDns.externalSecret.enabled`) remain — Capabilities is defense in depth so an operator flipping the toggle on a Sovereign that hasn't reached Phase 2 doesn't break the bp-external-dns reconcile. Verified locally: `helm template` with toggles off renders 0 of these resources; with toggles ON and `--api-versions monitoring.coreos.com/v1 --api-versions external-secrets.io/v1beta1` both render exactly once. Bump version 1.1.0 → 1.1.2 to align with the Phase-1 architectural-fix wave from issue #190. * fix(bp-powerdns): hide CRD-emitting resources behind Capabilities gates (refs #190) Three Catalyst overlay templates emit resources whose CRDs ship in OTHER charts and were unconditionally rendered, causing a cold install of bp-powerdns to fail with `no matches for kind X` on a Sovereign that hasn't yet reconciled the upstream chart: - cnpg-cluster.yaml → postgresql.cnpg.io/v1 Cluster (CRD ships in bp-cnpg) - api-ingress.yaml → traefik.io/v1alpha1 Middleware (CRD ships with the Traefik controller; k3s ships it by default but a Sovereign overlay MAY disable Traefik in favour of cilium-only ingress) - crossplane-floatingip.yaml → compose.openova.io/v1alpha1 HetznerFloatingIP (CRD ships when the Catalyst Crossplane composition family lands — see GAP DISCLOSURE in that template) Each is wrapped in `.Capabilities.APIVersions.Has "<group>/<version>"`. The Traefik router-middleware annotation on the Ingress is similarly gated so the auth posture cleanly moves to the Sovereign's chosen ingress controller when Traefik is absent. Verified locally: `helm template` with default values renders 0 of these resources; with `--api-versions postgresql.cnpg.io/v1 --api-versions traefik.io/v1alpha1 --api-versions compose.openova.io/v1alpha1` plus `--set crossplane.floatingIP.enabled=true`, all three render exactly once. Existing tests/observability-toggle.sh still passes. Bump version 1.1.1 → 1.1.2. * fix(bp-powerdns): bump blueprint.yaml to match Chart.yaml 1.1.2 after Capabilities gate work --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-04-29 20:10:14 +02:00
e3mrah	1f5c76def1	fix(platform): sync blueprint.yaml versions with Chart.yaml (#199 ) * feat(ui): Playwright cosmetic + step-flow regression guards 15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic- guards.spec.ts that fail HARD when each user-flagged defect class returns: 1. card height drift from canonical 108px 2. reserved right padding eating description width 3. logo tile drift from per-brand LOGO_SURFACE 4. invisible glyph (white-on-white) via luminance proxy 5. wizard step order Org/Topology/Provider/Credentials/Components/ Domain/Review 6. legacy "Choose Your Stack" / "Always Included" tab labels 7. Domain step reachable before Components 8. CPX32 not the recommended Hetzner SKU 9. per-region SKU dropdown shows wrong provider catalog 10. provision page is .html (static) not SPA route 11. legacy bubble/edge DAG SVG markup on provision page 12. admin sidebar drift from canonical core/console (w-56 + 7 labels) 13. AppDetail uses tablist instead of sectioned layout 14. job rows navigate to /job/<id> instead of expand-in-place 15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage Each test prints a failure message naming the canonical reference, the source-of-truth file, and the data-testid PR needed (if any) so the implementing agent has a precise target. No .skip() — per INVIOLABLE-PRINCIPLES #2, missing components fail loud. CI: .github/workflows/cosmetic-guards.yaml runs the suite on every PR that touches products/catalyst/bootstrap/ui/ or core/console/. Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's original complaint, the canonical reference, and the green/red semantics (5 tests intentionally RED on main today — they stay red until the companion-agent's UI work lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:07:55 +04:00
hatiyildiz	b0c1c07271	fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction) Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via `https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`. helm-controller's reconcile of bp-flux ran `helm install` on top of the running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver admission with `status.storedVersions[0]: Invalid value: "v1": must appear in spec.versions`; Helm rolled back; the rollback DELETED every running Flux controller Deployment (helm-controller, source-controller, kustomize-controller, image-automation-controller, image-reflector-controller, notification-controller). The cluster lost its GitOps engine — no further HelmRelease could progress, and the only recovery was full `tofu destroy` + reprovision. This is OPTION C of the architectural fix proposed in the incident memo: version-align cloud-init's flux2 install with the bp-flux umbrella chart's `flux2` subchart so a single upstream Flux release is installed and helm-controller adopts it on first reconcile rather than reinstalls on top with a different version. Changes: * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml URL pinned at v2.4.0 (deliberate; this is the source of truth) and added the CRITICAL VERSION-PIN INVARIANT comment block documenting the failure mode. * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from 2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion 2.4.0, matching cloud-init exactly. Bumped chart version 1.1.1 -> 1.1.2. * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream .version` mirror of the dep pin moved from 2.13.0 to 2.14.1. * `clusters/_template/bootstrap-kit/03-flux.yaml` and `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped bp-flux HelmRelease to 1.1.2 + added explicit `install.disableTakeOwnership: false`, `upgrade.disableTakeOwnership: false`, and `upgrade.preserveValues: true` so helm-controller adopts the cloud-init-installed Flux objects rather than rolling back on ownership conflict. * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2. * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml` — bumped HelmRelease to 1.1.2. * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case catastrophic-failure replay test: Case 1: Chart.yaml declares the flux2 subchart with explicit version. Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag. Case 3: chart's flux2 subchart appVersion equals cloud-init's pinned upstream version (the load-bearing invariant). Case 4: values.yaml metadata mirrors the Chart.yaml dep pin. Case 5: helm template renders cleanly + contains the four core Flux controllers. Case 6: replay test rejects a planted mismatched fake Chart.yaml (the gate's own self-test — proves the gate works). All six cases green locally; the new test joins the existing observability-toggle test in tests/. * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install — version-pin invariant" documenting the failure mode, the four pin-sites, the safe bump procedure, and the existing-Sovereign recovery path (full reprovision). Existing Sovereigns running 1.1.1: no in-place recovery is possible once the rollback has fired. Reprovision required against 1.1.2. Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) + #4 (never hardcode) — the version pins remain operator-bumpable via PR, but BOTH cloud-init's URL AND the chart's subchart MUST move together in the same PR; CI gate tests/version-pin-replay.sh enforces this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:38:17 +02:00
hatiyildiz	4265884d58	feat(bp-external-dns): umbrella chart + add to bootstrap-kit Kustomization Convert platform/external-dns/chart/ from a metadata-only wrapper to a proper Helm umbrella that pulls kubernetes-sigs/external-dns 1.15.2 (appVersion 0.15.1, k8s 1.31-validated) as a Helm subchart, mirroring the bp-cilium / bp-cert-manager / bp-powerdns shape. Native PowerDNS provider speaks the bp-powerdns REST API directly via the EXTERNAL_DNS_PDNS_API_KEY env var sourced from the powerdns-api-credentials Secret bp-powerdns renders. Catalyst overlay templates added (default-off where applicable per the observability-toggle rule for the bp-* family): - templates/networkpolicy.yaml (default ON; egress to powerdns + cluster DNS + apiserver only) - templates/servicemonitor.yaml (default OFF) - templates/externalsecret.yaml (default OFF; Phase-2 OpenBao path) - templates/_helpers.tpl Bootstrap-kit Kustomization gets a new 12-external-dns.yaml HelmRelease referencing bp-external-dns:1.1.0 with dependsOn bp-cert-manager + bp-powerdns, and the legacy 11-bp-catalyst-platform.yaml is renumbered 13- so the install ordering reads in canonical Phase-0 sequence. Mirrored to clusters/omantel.omani.works/bootstrap-kit/ with the SOVEREIGN_FQDN substitution applied. bp-catalyst-platform Chart.yaml drops bp-external-dns from its dependency block — install ordering for ExternalDNS is now owned by Flux dependsOn at the Kustomization layer rather than this umbrella's Helm dependency graph. Bumped 1.1.0 → 1.1.1 to reflect the dep removal, and the bootstrap-kit HelmRelease references in both clusters bumped in lockstep. Wrapper chart version bumped 1.0.0 → 1.1.1 (umbrella shape). Local gates pass: - helm dependency build (pulls external-dns-1.15.2.tgz) - helm lint (0 failures) - helm template smoke render (245 lines, 6 kinds rendered) - helm package + tar-tzf verifies external-dns subchart inside the packaged tgz (subchart-guard simulation passes) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 19:29:27 +02:00
e3mrah	31d5911221	Merge pull request #185 from openova-io/fix/bp-charts-observability-toggles-default-false fix(bp-*): observability toggles default false (v1.1.1)	2026-04-29 21:26:48 +04:00
hatiyildiz	1ddd569789	fix(bp-): observability toggles default false — break circular CRD dependency Extends the v1.1.1 hardening that started with cilium / cert-manager / crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints. Every observability toggle in every Catalyst-curated Blueprint now ships `false`/`null` by default; the operator opts in via a per-cluster values overlay at clusters/<sovereign>/bootstrap-kit/ once bp-kube-prometheus-stack reconciles. Live failure mode that prompted this (omantel.omani.works 2026-04-29): bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor to true. The upstream Cilium 1.16.5 chart renders a monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with kube-prometheus-stack — a tier-2 Application Blueprint that depends on the bootstrap-kit (cilium first). Helm install fails on a fresh Sovereign with "no matches for kind ServiceMonitor in version monitoring.coreos.com/v1 — ensure CRDs are installed first" and every downstream HelmRelease reports `dep is not ready`. The earlier trustCRDsExist=true mitigation only suppresses Helm's render-time gate; the apiserver still rejects the resource at install-time. Per-Blueprint changes: - bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false; hubble.metrics.enabled → null (this is the exact value that disables the upstream metrics ServiceMonitor template branch — verified by reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor .enabled → false. tests/observability-toggle.sh extended with Case 4 (default render produces no hubble-relay / hubble-ui Deployments). - bp-flux: flux2.prometheus.podMonitor.create → false. - bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled → false (explicit lock; upstream already defaults false). - bp-spire: spire.global.spire.recommendations.enabled + recommendations.prometheus → false. - bp-nats-jetstream: nats.promExporter.enabled + promExporter.podMonitor.enabled → false. - bp-openbao: openbao.injector.metrics.enabled + openbao.serviceMonitor.enabled → false. - bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled + metrics.prometheusRule.enabled → false. - bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.* serviceMonitor + prometheusRule → false. - bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled → false (forward-compatibility guard; current upstream pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future upstream bump cannot silently regress). Each chart ships a tests/observability-toggle.sh that asserts the rule in three cases (default off / explicit on opt-in / explicit off) — runs under blueprint-release.yaml's chart-test gate (added `bdeb0f54` + the existing wiring) before helm push. A regression that re-introduces a hardcoded enabled: true in any chart fails CI before the OCI artifact is published. Versioning: - All 11 leaf charts bumped 1.1.0 → 1.1.1. - products/catalyst/chart (bp-catalyst-platform umbrella) deps updated to 1.1.1 across the board. - clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to 1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror. docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every toggle disabled across all 11 Blueprints. References docs/INVIOLABLE-PRINCIPLES.md #4. GATES (all green): - helm dep build resolves cleanly post-change for every chart whose upstream is published (umbrella waits on per-leaf publish). - helm lint clean on all 11 leaves. - helm template . default render produces zero monitoring.coreos.com references on every leaf (verified locally). - tests/observability-toggle.sh PASS on all 11 leaves. Live verification: with v1.1.1 published the omantel.omani.works HelmRelease can roll forward without a manual values patch — Flux picks up the new chart digest automatically (semver: 1.x in OCIRepository). Refs: issue #182.	2026-04-29 19:23:52 +02:00
hatiyildiz	02b5b6c4c8	fix(bootstrap-kit): override cilium + cert-manager values to disable observability toggles Live verified on omantel: bp-cilium and bp-cert-manager v1.1.0 fail Helm install with 'no matches for kind ServiceMonitor in version monitoring.coreos.com/v1'. Manual kubectl-patch of the live HelmRelease worked but Flux's 15-min reconcile rolls back the patch because the HelmRelease CR is owned by the kustomize-controller from git. Override the values inline in the HelmRelease manifests so the patch is durable across Flux reconciles. Same pattern as the in-flight observability- toggle agent will apply to all 12 charts in the next chart bump (v1.1.1). This is the manifest-level workaround that unblocks the running omantel cluster TODAY without waiting for v1.1.1 publish. Mirrors the patches into both clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/bootstrap-kit/ so future Sovereigns inherit.	2026-04-29 19:17:08 +02:00
hatiyildiz	b1638f51ea	fix(bp-* tests): skip helm dep build when charts/ already vendored Earlier rerun failure on the CI workflow (bp-cert-manager 25120060270): Error: no repository definition for https://charts.jetstack.io. Please add the missing repos via 'helm repo add' Root cause: blueprint-release.yaml's earlier `helm dependency build` step (line 181) successfully resolves the upstream chart and populates chart/charts/ — but it does NOT `helm repo add` the upstream repo first. Helm 3.20's `helm dep build` succeeds on the first call by falling back to direct-URL fetch from Chart.yaml `dependencies[].repository`. A SECOND `helm dep build` (run by the test script) hits a different code path that requires the repo to be in the helm repo cache. Fix: tests/observability-toggle.sh now skips `helm dep build` when chart/charts/ is already populated (which is always the case in CI since the workflow's own `helm dependency build` step ran first). Local dev runs from a fresh checkout still resolve subcharts. Refs #182 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:12:21 +02:00
hatiyildiz	d34facc040	fix(bp-): observability toggles default false — break circular CRD dependency bp-cilium@1.1.0 install fails on every fresh Sovereign with: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" — ensure CRDs are installed first Cascades to all 10 other bp- HelmReleases ("dep is not ready") since bp-cilium is the root of the bootstrap dep graph. Verified live on omantel.omani.works 2026-04-29 (issue #182). Root cause: platform/cilium/chart/values.yaml and platform/cert-manager/chart/values.yaml hardcoded `serviceMonitor.enabled: true`. The monitoring.coreos.com/v1 CRDs ship with kube-prometheus-stack — an Application-tier Blueprint that itself depends on the bootstrap-kit. Hardcoding `true` creates a circular CRD ordering: bp-cilium wants the CRD bp-kube-prometheus-stack provides, but bp-kube-prometheus-stack cannot install before bp-cilium. The `trustCRDsExist=true` mitigation only suppresses Helm's render-time gate; the apiserver still rejects the resource at install-time. Violates INVIOLABLE-PRINCIPLES.md #4 (never hardcode): observability toggles MUST be operator-tunable, not chart-level constants assuming an observability tier exists. This commit: A. Defaults every observability toggle false in the affected wrappers: - platform/cilium/chart/values.yaml: cilium.prometheus.enabled: false cilium.prometheus.serviceMonitor.enabled: false (trustCRDsExist removed — no longer relevant) - platform/cert-manager/chart/values.yaml: cert-manager.prometheus.enabled: false cert-manager.prometheus.servicemonitor.enabled: false - platform/crossplane/chart/values.yaml: crossplane.metrics.enabled: false (uniformity rule — does not break install but holds the invariant) B. Bumps affected wrapper charts 1.1.0 → 1.1.1: - bp-cilium, bp-cert-manager, bp-crossplane (leaves) - bp-catalyst-platform (umbrella; deps repinned to 1.1.1 for the 3) C. Updates clusters/_template/bootstrap-kit/* and clusters/omantel.omani.works/bootstrap-kit/* HelmRelease versions to 1.1.1 so the live Sovereign picks up the fix on Flux reconcile. D. Adds platform/<name>/chart/tests/observability-toggle.sh under each affected chart. Each script asserts: - default render produces zero monitoring.coreos.com refs - opt-in render with --set <toggle>=true succeeds and produces a ServiceMonitor (proves the toggle is wired) - explicit-off render succeeds and produces zero refs Wired into .github/workflows/blueprint-release.yaml via a new "Run chart integration tests" step that executes every chart/tests/ .sh on every publish — a regression that re-introduces a hardcoded `true` fails the publish job before the OCI artifact is pushed. E. Documents the rule in docs/BLUEPRINT-AUTHORING.md §11.2 "Observability toggles must default false". References Principle #4 and provides the canonical pattern (default off in wrapper values, opt-in via per-cluster overlay at clusters/<sovereign>/...). Per-chart audit table (which toggle was hardcoded → new default): \| Chart \| Toggle \| Was \| Now \| \|------------------\|----------------------------------------------------------\|------\|-------\| \| bp-cilium \| cilium.prometheus.enabled \| true \| false \| \| bp-cilium \| cilium.prometheus.serviceMonitor.enabled \| true \| false \| \| bp-cert-manager \| cert-manager.prometheus.enabled \| true \| false \| \| bp-cert-manager \| cert-manager.prometheus.servicemonitor.enabled \| true \| false \| \| bp-crossplane \| crossplane.metrics.enabled \| true \| false \| \| bp-flux \| (no observability hardcodes) \| n/a \| n/a \| \| bp-sealed-secrets\| (no observability hardcodes) \| n/a \| n/a \| \| bp-spire \| (no observability hardcodes) \| n/a \| n/a \| \| bp-nats-jetstream\| (no observability hardcodes) \| n/a \| n/a \| \| bp-openbao \| (no observability hardcodes) \| n/a \| n/a \| \| bp-keycloak \| (no observability hardcodes) \| n/a \| n/a \| \| bp-gitea \| (no observability hardcodes) \| n/a \| n/a \| \| bp-powerdns \| (no observability hardcodes) \| n/a \| n/a \| \| bp-catalyst-platform \| (umbrella, no values overlay) \| n/a \| n/a \| Local gates green: helm dep build ✓ all 3 affected charts helm lint ✓ all 3 helm template ✓ all 3 — 0 monitoring.coreos.com refs in default tests/observability-toggle.sh ✓ all 9 sub-cases pass Closes the install path for bp-cilium 1.1.1 on a fresh Sovereign; unblocks the full bp- dep graph. Refs: https://github.com/openova-io/openova/issues/182 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 18:08:09 +02:00
hatiyildiz	43aff20254	feat(bp-): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream Each platform/<name>/chart/Chart.yaml now declares the canonical upstream chart as a dependencies: entry. helm dependency build pulls the upstream payload into the OCI artifact at publish time, so Flux helm install of bp-<name>:1.1.0 actually installs the upstream Helm release alongside the Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer, ExternalSecret) under templates/. Pinned upstream chart versions per platform/<name>/blueprint.yaml: - cilium 1.16.5 https://helm.cilium.io - cert-manager v1.16.2 https://charts.jetstack.io - flux 2.4.0 https://fluxcd-community.github.io/helm-charts - crossplane 1.17.x https://charts.crossplane.io/stable - sealed-secrets 2.16.x https://bitnami-labs.github.io/sealed-secrets - spire ... https://spiffe.github.io/helm-charts-hardened - nats-jetstream ... https://nats-io.github.io/k8s/helm/charts - openbao ... https://openbao.github.io/openbao-helm - keycloak ... https://charts.bitnami.com/bitnami - gitea ... https://dl.gitea.com/charts - catalyst-platform umbrella over the 10 leaf bp- charts via helm dependency values.yaml in each chart adopts the umbrella convention: catalystBlueprint metadata block (provenance + version) at top level, upstream subchart values namespaced under the dependency name. cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER cert-manager controllers are running and CRDs registered (the previous hollow-chart shape ran the ClusterIssuer at install time when CRDs didn't exist yet, which was the omantel cluster's exact failure mode). Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella conversion is a meaningful structural revision). Cluster manifests in clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/ bootstrap-kit/ updated to reference 1.1.0. The blueprint-release.yaml workflow's helm package step needs an explicit helm dependency build before push so the upstream subchart bytes ship inside the OCI artifact. That CI change is a follow-up commit on this same branch (separate file scope).	2026-04-29 17:21:36 +02:00
hatiyildiz	67fdecb770	merge: remove k8gb (#171 )	2026-04-29 08:51:21 +02:00
hatiyildiz	f5daac52af	refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171 ) PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything k8gb was doing — geo-aware response selection, health-checked failover, weighted round-robin — at the authoritative DNS layer. Eliminates a separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign. Changes: - platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never authored — only README existed) - products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted - componentGroups.ts: remove k8gb component (PowerDNS already there) - componentLogos.tsx: drop logo_k8gb + k8gb map entry - model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns - StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb - provision.html: replace k8gb tile and edges with powerdns - catalog.generated.ts regenerated (now includes bp-powerdns) - docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING- CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY, COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs (cilium, external-dns, failover-controller, litmus, flux, opentofu) rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md. Historical entries in VALIDATION-LOG.md preserved as audit trail. - New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed), Application Placement → lua-record selector mapping, when to add a second Sovereign region, operational checks. Closes #171. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:51:09 +02:00
hatiyildiz	f4679e2748	fix(powerdns): enable gpgsql-dnssec for DNSSEC API (1.0.6) Without `gpgsql-dnssec=yes` the gpgsql backend driver does not expose the DNSSEC API surface — `PUT /zones/<zone>` with `dnssec:true` returns 422 "no DNSSEC-capable backends are loaded". This blocks pool-domain- manager from enabling DNSSEC on every Sovereign child zone (mandatory per docs/PLATFORM-POWERDNS.md). Fix lands in additionalConfig so the directive is rendered alongside `default-soa-edit-signed=INCEPTION-EPOCH` and `direct-dnskey=yes`. No schema migration needed — the gpgsql 5.0.3 schema already includes the cryptokeys table; the missing piece was just the backend feature flag. Bumps Chart.yaml to 1.0.6. Verified: after this lands the PUT call returns 204 and POST /cryptokeys mints a usable KSK. Discovered while bringing up openova#168 (PDM per-Sovereign zones).	2026-04-29 08:42:18 +02:00
hatiyildiz	fa84cac438	fix(powerdns): plain ALTER TABLE in postInitSQL (avoid $$ escape battle, 1.0.5) The DO block in 1.0.4 rendered with $$ collapsed to $ by the time it reached CNPG's postInitApplicationSQL — "syntax error at or near $". Both Helm template processing and the YAML scalar block were chewing on the dollar signs. Replaced with explicit ALTER TABLE statements (one per gpgsql table) + GRANT — same end state, no PL/pgSQL quoting required. Verified at runtime on contabo-mkt: powerdns Pod went CrashLoopBackOff → Running 1/1 immediately after the manual ALTER ran by hand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:17:28 +02:00
hatiyildiz	214a3e1ada	fix(powerdns): grant table ownership to pdns user in CNPG bootstrap (1.0.4) Verified at runtime on Contabo-mkt: postInitApplicationSQL runs as the postgres superuser, not the application owner, so the schema tables created by the bootstrap block were owned by postgres. PowerDNS connects as 'pdns' and got 'permission denied for table domains' on the first SELECT against the zone cache. Added a DO block at the end of the schema bootstrap that walks every table in the public schema and ALTERs OWNER TO {{ .Values.postgres.cluster.owner }} plus GRANT ALL PRIVILEGES ON SCHEMA public — same shape PDM uses (and the contabo-mkt cluster verified the fix runtime: powerdns Pod went from CrashLoopBackOff to 1/1 Ready immediately after the same DDL was run by hand). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:14:12 +02:00
hatiyildiz	db20e9d42b	fix(powerdns): dnsdist backend resolution + drop DnstapLogAction (1.0.3) dnsdist 1.9.14 runtime errors: 1. newServer{address='powerdns:5353'} → "Unable to convert presentation address" — dnsdist's address parser expects IP[:port], not a DNS name. Kubernetes auto-injects POWERDNS_SERVICE_HOST as an env var into every pod in the same namespace as the powerdns Service; using that gives us the ClusterIP at config-load time without needing an init container or runtime DNS resolution. 2. DnstapLogAction(name, bool, fn) signature changed in 1.9 — the 2nd parameter now expects a shared_ptr to a RemoteLoggerInterface, not a boolean. Rather than wire up a remote dnstap server (which adds a moving part for marginal observability gain), drop the line. Catalyst observability is the dnsdist /metrics endpoint surfaced to Prometheus + the k8s container log. Bumped chart to 1.0.3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:12:27 +02:00
hatiyildiz	20c0543806	fix(powerdns): correct dnsdist image tag + drop readOnlyRootFilesystem (1.0.2) Two runtime issues caught during first contabo-mkt rollout: 1. dnsdist image tag was "1.9" (default) — that tag doesn't exist in docker.io/powerdns/dnsdist-19. The 1.9.x line publishes 1.9.0 .. 1.9.14 (no rolling "1.9" alias). Pinned to 1.9.14 (current latest). 2. PowerDNS pod crash-looped on Errno 30 (Read-only file system: /etc/powerdns/pdns.d/0-api.conf.conf). The upstream pdns_server-startup script writes rendered config files to /etc/powerdns/pdns.d/ at container start, and the upstream template doesn't expose an emptyDir we could redirect that path to. Set readOnlyRootFilesystem=false with a verbose comment explaining why; the rest of the security context (runAsNonRoot, runAsUser=953, drop ALL caps) stays in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:06:39 +02:00
hatiyildiz	19d926bfeb	fix(powerdns): avoid recursive include in dnsdist checksum, bump to 1.0.1 Helm flagged dnsdist.yaml's checksum/config annotation as a recursive template self-reference (the file included itself). Replaced with a hash of the rendered .Values.dnsdist.config (post-tpl), which is the substantive content the annotation is supposed to track anyway. Bumped Chart.yaml to 1.0.1 so the OCIRepository semver "1.x" picks up the fix automatically on next reconcile. Blueprint API version stays at 1.0.0 (Blueprint contract is unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 08:02:53 +02:00
hatiyildiz	0190c60520	feat(powerdns): bp-powerdns wrapper chart + per-Sovereign zone model (#167 ) Introduces the bp-powerdns Catalyst Blueprint wrapper as the authoritative DNS service for every Sovereign zone. Replaces k8gb in componentGroups.ts — PowerDNS Lua records cover geo + health-checked failover natively, removing the dedicated GSLB controller. Wrapper chart (platform/powerdns/chart/): - Chart.yaml — bp-powerdns 1.0.0, depends on pschichtel/powerdns 0.10.0 upstream (verified Artifact Hub publisher, tracks docker.io/powerdns/ pdns-auth-50 at appVersion 5.0.3 — surveyed Artifact Hub, no official PowerDNS chart exists) - values.yaml — 3 replicas, gpgsql backend, DNSSEC ECDSAP256SHA256, lua-records ON, dnsdist 100 qps default per source IP, REST API at pdns.openova.io/api behind Traefik basicAuth - blueprint.yaml — Catalyst metadata, visibility=unlisted (mandatory infra), section pts-3-2-gitops-and-iac - templates/cnpg-cluster.yaml — separate `pdns-pg` Postgres (1 instance, 5Gi, postgres-16) with PowerDNS auth-5.0.3 schema applied via postInitApplicationSQL - templates/dnsdist.yaml — companion Deployment + ConfigMap with rate-limiting policy (MaxQPSIPRule per source IP) - templates/api-ingress.yaml — Traefik Ingress + basicAuth Middleware - templates/anycast-endpoint.yaml — placeholder Service of type LoadBalancer (Phase-0 stand-in for the anycast Floating IP target state) - templates/crossplane-floatingip.yaml — DISCLOSED GAP: target-state XHetznerFloatingIP composite, disabled by default until the Crossplane composition is authored (the existing compositions cover Server/Network/Firewall/LoadBalancer/PoolAllocation only). The placeholder anycast Service is the operational stand-in. Per docs/INVIOLABLE-PRINCIPLES.md: - #4 (never hardcode): every value flows from values.yaml or a referenced K8s Secret. Image tags come from upstream chart appVersion, never duplicated. - #8 (disclose every divergence): the XHetznerFloatingIP gap is documented in the template + in docs/PLATFORM-POWERDNS.md ("Anycast deferral" section). componentGroups.ts: powerdns added to SPINE group as mandatory (depends on cnpg). external-dns now lists powerdns as a dependency. k8gb removed. docs/PLATFORM-POWERDNS.md: per-Sovereign zone model, DNSSEC posture, REST API contract, lua-records GSLB pattern, dnsdist policy, anycast deferral runbook, first-deploy procedure for Contabo-mkt. Closes #167 (Phase 1 of public-repo work; Phase 4 cluster manifest lands in openova-private feat/powerdns-deploy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 07:49:51 +02:00
hatiyildiz	31b03ce02a	ci(pdm)+platform(crossplane): build workflow + XDynadotPoolAllocation composition (Phase 3+4 of #163 ) CI workflow (.github/workflows/pool-domain-manager-build.yaml) mirrors the marketplace-api / catalyst-api shape: - Triggers on push to core/pool-domain-manager/** + workflow_dispatch - Runs unit tests (reserved + dynadot — the integration suite needs a real Postgres which the workflow does not provide; full integration runs in test-bootstrap-api.yaml against an ephemeral CNPG) - Builds and pushes ghcr.io/openova-io/openova/pool-domain-manager:<sha> - Cosign-signs the image via Sigstore keyless OIDC (id-token: write) - Emits an SBOM attestation tied to the image digest - Manifest deployment is intentionally NOT in this workflow — PDM manifests live in the openova-private repo per the issue body, so the Flux Kustomization there picks up the new SHA via a follow-up private-repo commit (Phase 6 of #163) Crossplane composition (platform/crossplane/compositions/xrd-pool- allocation.yaml + composition-pool-allocation.yaml) wraps PDM as a declarative Crossplane Resource: apiVersion: compose.openova.io/v1alpha1 kind: XDynadotPoolAllocation spec: parameters: poolDomain: omani.works subdomain: omantel sovereignFQDN: omantel.omani.works loadBalancerIP: 1.2.3.4 createdBy: crossplane The Composition uses provider-http (crossplane-contrib/provider-http) to render the XR into a Reserve → Commit sequence of HTTP calls against PDM's in-cluster service URL. Per docs/INVIOLABLE-PRINCIPLES.md #3 we use provider-http rather than bespoke Go to keep the day-2 lifecycle declarative. Operators who want to pre-allocate a name (e.g. reserve 'omantel.omani.works' for a Sovereign that hasn't been provisioned yet) commit YAML to Git and Flux+Crossplane converge. Refs: #163	2026-04-29 06:46:11 +02:00
hatiyildiz	8886eff708	Merge branch 'feat/group-g-dns-finish-v3' Group G DNS finish (v3): #110 (Dynadot multi-domain table-driven tests), #112 (catalyst-dns httptest-mocked Dynadot coverage), #113 (cert-manager LE DNS-01 + HTTP-01 ClusterIssuer templates with operator runbook for the cert-manager-dynadot-webhook gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 19:45:35 +02:00
hatiyildiz	97e942e0bc	feat(cert-manager): #113 — Lets Encrypt DNS-01 + HTTP-01 ClusterIssuers Adds platform/cert-manager/chart/templates/clusterissuer-letsencrypt-dns01.yaml with two ClusterIssuers, both Catalyst-curated, rendered conditionally from values.yaml: - letsencrypt-dns01-prod (TARGET STATE, default disabled) — ACME DNS-01 via the cert-manager webhook solver, pointing at a future `cert-manager-dynadot-webhook` Catalyst binary that will implement the webhook.acme.cert-manager.io/v1alpha1 contract against the existing internal/dynadot/ package. Shipping the issuer template ahead of the webhook so cluster overlays only need a values flip + secret ref — no template edits — once the webhook lands. - letsencrypt-http01-prod (INTERIM, default enabled) — ACME HTTP-01 via the cilium ingress class. Issues certs for the explicit hostnames (console, gitea, harbor, admin, api) but NOT for wildcards; the canonical *.<sub>.<domain> record needs DNS-01. Header comment explains the gap: the Catalyst external-dns webhook (products/catalyst/bootstrap/api/cmd/external-dns-dynadot-webhook/) implements a DIFFERENT RPC contract (records.list/add/delete) than what cert-manager DNS-01 expects (Present/CleanUp on ChallengeRequest CRD), so it cannot be reused; a dedicated cmd/cert-manager-dynadot-webhook/ must be built. Operator runbook for cutover is in the file header. values.yaml gains a `certManager.issuers.{email,acmeServer,dns01,http01}` section so all knobs are runtime-configurable per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode); cluster overlays in clusters/<sovereign>/ can flip dns01.enabled via the bp-catalyst-platform umbrella's values without rebuilding the Blueprint OCI artifact. blueprint.yaml gains a spec.outputs section advertising: - issuerName: letsencrypt-http01-prod (default) - wildcardIssuerName: letsencrypt-dns01-prod (target state) - issuerKind: ClusterIssuer so dependent Blueprints (cilium-gateway, harbor, gitea) can consume the issuer name without hardcoding it. Closes #113. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 19:44:56 +02:00
hatiyildiz	c07e0ad1ee	feat(external-dns): #109 — author bp-external-dns leaf chart for OCI publish The bp-catalyst-platform umbrella (issue #104) declares a dependency on bp-external-dns:1.0.0 — but the chart didn't exist; only README + Dynadot multi-domain policy lived under platform/external-dns/. Without this leaf the umbrella's `helm dependency build` fails (verified in run 25068433765). This commit authors the minimal target-state leaf: - Chart.yaml: name=bp-external-dns, version=1.0.0 - values.yaml: catalystBlueprint.upstream metadata (external-dns 1.15.0 from kubernetes-sigs/external-dns Helm repo) + Catalyst-curated values overlay (sources, txtOwnerId, ServiceMonitor, RBAC, resources) Per BLUEPRINT-AUTHORING.md §3, leaf charts are pure values-overlay wrappers: no templates dir, just Chart.yaml + values.yaml with the catalystBlueprint metadata block read by the bootstrap-kit installer at helm-install time. Per-Sovereign provider/zone/credential overrides are overlaid by the Crossplane Composition that materializes the HelmRelease — keeping this chart provider-agnostic (no hardcoded Cloudflare/Dynadot/Hetzner choice per INVIOLABLE-PRINCIPLES.md §4). After this lands, blueprint-release.yaml will publish ghcr.io/openova-io/bp-external-dns:1.0.0 and the next umbrella push will resolve all 11 leaf deps successfully.	2026-04-28 19:42:23 +02:00
hatiyildiz	f0fe3006ba	feat(external-dns): #109 — Catalyst-curated dynadot-multi-domain policy Adds platform/external-dns/policies/dynadot-multi-domain.yaml — the canonical external-dns + dynadot webhook deployment that ships in every Sovereign on an OpenOva pool domain. Why a webhook: external-dns has no upstream Dynadot provider; the canonical pattern is the webhook RPC contract, with a sidecar that implements the provider in our preferred language. We reuse the same internal/dynadot/ package the catalyst-api uses, so the never-wipe rule, record encoding, and managed-domain allowlist are identical on both write paths (per docs/INVIOLABLE-PRINCIPLES.md #2 — no duplicate implementations of the same concern). Multi-domain: - One --domain-filter per zone in the external-dns args; adding a third pool domain (e.g. acme.io) is a one-line edit here PLUS a one-key edit on dynadot-api-credentials' `domains` field. No webhook rebuild. - Webhook reads DYNADOT_MANAGED_DOMAINS from the same secret with optional=true, preserving backward compatibility with the legacy single-`domain` secret shape (pre-#108). TXT registry: - --txt-owner-id=$(SOVEREIGN_FQDN), --txt-prefix=_externaldns.<sub>. - Cluster overlays substitute SOVEREIGN_FQDN via the bp-catalyst-platform umbrella so two clusters sharing a parent zone (alpha.omani.works, beta.omani.works) cannot collide. Closes #109. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:45:53 +02:00
hatiyildiz	046e5ebc18	feat(day2-iac): Crossplane Compositions + per-Sovereign Flux cluster tree + catalyst-dns binary Group F deliverables — completes the day-2 IaC layer that takes over after OpenTofu's Phase 0 hand-off (per docs/SOVEREIGN-PROVISIONING.md §4). Three artifacts: 1. platform/crossplane/compositions/ — XRDs + Compositions for canonical Hetzner resources under the canonical compose.openova.io/v1alpha1 group (per BLUEPRINT-AUTHORING.md §8): - XHetznerNetwork + composition-network.yaml — wraps hcloud_network + subnet - XHetznerFirewall + composition-firewall.yaml - XHetznerServer + composition-server.yaml - XHetznerLoadBalancer + composition-loadbalancer.yaml (lb11, 80→31080, 443→31443) - README documenting the canonical pattern 2. clusters/_template/ — the canonical per-Sovereign Flux Kustomization tree. Copied to clusters/<sovereign-fqdn>/ at provisioning time; cloud-init's GitRepository points at the result. - kustomization.yaml (root: flux-system + infrastructure + bootstrap-kit) - flux-system/ (placeholder for Flux self-config customization) - infrastructure/ (provider-hcloud + ProviderConfig referencing hcloud-credentials secret OpenTofu writes) - bootstrap-kit/ — 11 HelmRelease manifests in dependency order: 01-cilium → 02-cert-manager → 03-flux → 04-crossplane → 05-sealed-secrets → 06-spire → 07-nats-jetstream → 08-openbao → 09-keycloak → 10-gitea → 11-bp-catalyst-platform Each pulls from oci://ghcr.io/openova-io/bp-<name>:1.0.0 — the wrapper charts published by blueprint-release CI. dependsOn declarations enforce the canonical install order at runtime. 3. clusters/omantel.omani.works/ — the first concrete Sovereign instance. Mirror of _template with SOVEREIGN_FQDN_PLACEHOLDER substituted to omantel.omani.works. This is what the wizard's first omantel.omani.works run will actually reconcile. 4. products/catalyst/bootstrap/api/cmd/catalyst-dns/main.go — small Go binary the OpenTofu module's null_resource.dns_pool invokes via local-exec at Phase-0 apply time. Reads DYNADOT_API_KEY/SECRET/DOMAIN/SUBDOMAIN/LB_IP env vars; calls existing dynadot.Client.AddSovereignRecords. Containerfile already builds + ships it at /usr/local/bin/catalyst-dns. Architectural compliance (Lesson #24 closed): - No bespoke Go cloud-API calls (Crossplane Compositions are the canonical day-2 IaC) - No exec.Command("helm", ...) (Flux HelmReleases are the canonical install unit) - No kubectl apply from outside (cloud-init kubectl-applies one Flux GitRepository, then Flux owns everything) After this commit, the path is end-to-end: wizard → catalyst-api → tofu apply (with infra/hetzner/) → cloud-init installs k3s + Flux + applies GitRepository pointing at clusters/omantel.omani.works/ → Flux reconciles bootstrap-kit (11 HelmReleases in dependency order) → Crossplane adopts day-2 management.	2026-04-28 14:09:29 +02:00
hatiyildiz	62d9c7d936	fix(charts): drop dependencies block — wrappers carry values overlay only The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks. Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values. This keeps: - blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd) - the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork) - the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>) Changes: - 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package. - 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values. - products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up. After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.	2026-04-28 12:57:29 +02:00
hatiyildiz	441ebaebb8	fix(charts): pin upstream chart versions/names to ones that exist in their repos The first Blueprint Release CI run (commit `8c0f766`) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories: - platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0. - platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally). - platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1. - platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed. Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability. Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions. After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.	2026-04-28 12:55:21 +02:00
hatiyildiz	8c0f76640c	feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit `07b4bcf`) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule. 11 charts created with Chart.yaml + values.yaml + blueprint.yaml each: Network + GitOps: - platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API - platform/flux/chart — wraps flux 2.4.0 - platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest Security: - platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor - platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only) - platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation) Catalyst control-plane services: - platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV) - platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5) - platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode) - platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream) New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55): - platform/spire/README.md — workload identity Catalyst control plane component - platform/nats-jetstream/README.md — control-plane event spine - platform/sealed-secrets/README.md — transient bootstrap-only Each blueprint.yaml declares: - catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3) - visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card) - manifests.chart: ./chart pointer - depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends) .github/workflows/blueprint-release.yaml: - New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder) - Triggers on push to main touching platform//chart/* or products//chart/* - detect job: emits matrix of changed Blueprint folders via git diff - build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation - Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live. After this commit lands, the bootstrap-kit installer in commit `07b4bcf` has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR. Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.	2026-04-28 12:51:06 +02:00
hatiyildiz	7cafa3c894	docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay Component-level architectural correction (two changes): 1. MinIO → SeaweedFS as unified S3 encapsulation layer The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface. 2. Apache Guacamole added as Application Blueprint §4.5 Communication Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access. Component changes: - DELETED: platform/minio/ - CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section) - CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings) Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count. Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric. UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added. VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer. Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.	2026-04-28 10:23:46 +02:00
hatiyildiz	b2173ae13c	docs(pass-60): valkey REPLICAOF bash example carry-over; NAMING fourth-cycle stable FIRST drift in the new cycle. 6-consecutive-clean streak (54-59) ends at Pass 60. However, drift is Pass-35 carry-over, not new architectural drift — same "incomplete in-file fix" pattern as Pass 31 (openbao L108 vs L127). platform/valkey/README.md L79 had: REPLICAOF primary-valkey.region1.svc.cluster.local 6379 Pass 35 fixed L147 (StatefulSet --replicaof argument) to canonical valkey.<env>.<sovereign-domain> per NAMING §5.2 but the bash command example at L79 retained the older non-canonical form. Fixed L79 to valkey.<env>.<sovereign-domain> matching L147. Methodology lesson #18: Pass-N sweep grep patterns can miss carry-over drift that doesn't match the sweep's specific shape. Pass 35 grep targeted <domain> placeholders; L79 used a fully-qualified hostname with no placeholder, evading the sweep. NAMING-CONVENTION fourth-cycle deep re-read confirmed stable across §1-§11. §4.1 "hfrp" location-code example is for rtz cluster (vs hfmp for mgt) — both valid for different cluster types, not drift. §11 already settled across Pass 37, 42, 50. valkey README banner explicitly establishes "NOT a Catalyst control-plane component" (Pass 26 framing) — exemplary canonical. Convergence: Pass 54-59 = 6 consecutive cleans (nirvana approach met). Pass 60 carry-over fix resets streak but architectural integrity holds. The new cycle audit is doing its job — surfacing carry-over drift the old cycle's specific-shape sweeps missed.	2026-04-28 01:28:00 +02:00
hatiyildiz	9c3d370107	docs(pass-51): flink Strimzi namespace drift; SECURITY clean platform/flink/README.md L137 + L166 used strimzi-kafka-bootstrap.messaging.svc but canonical Catalyst namespace per strimzi README (L100/146/181/191) and debezium (L135) is `databases`. Same Helm-default-vs-Catalyst-convention drift as Pass 41 minio (minio-system → storage). Pass 51 sweep confirmed no other component uses "messaging" as a Catalyst namespace — only generic English usage and K8s API group messaging.knative.dev/v1. Fixed both instances to strimzi-kafka-bootstrap.databases.svc:9093. Port 9093 (TLS) kept — port choice (9092 vs 9093) is a separate architectural question deferred. SECURITY.md re-scan with all current methodology lessons: - §1-§5: clean. Independent-Raft-per-region principle intact. - §6 Keycloak topology: clean. - §7 Rotation policy: SecretPolicy uses canonical catalyst.openova.io/v1alpha1. - §8 Path of a secret: clean. - §9 Compliance posture: borderline OpenSearch SIEM wording re-evaluated; acceptable in context. - §10 Threat model: clean. Methodology note: Helm-default-namespace drift now found across 3 instances (Pass 41 minio, Pass 51 flink). Add cross-component namespace verification to standard checks. Drift found. Consecutive-clean count resets from 2 (49→50) to 0.	2026-04-28 00:31:25 +02:00
hatiyildiz	67aab8f6c1	docs(pass-48): crossplane OpenTofu/XRD group drift; PERSONAS clean platform/crossplane/README.md had three real drift items: 1. §"Terraform vs Crossplane" — Catalyst's canonical bootstrap IaC is OpenTofu (PTS §3.2 + SOVEREIGN-PROVISIONING §3), not Terraform. Renamed section to "OpenTofu vs Crossplane", added intro paragraph clarifying the OSS-fork rationale, updated table rows + Decision. 2. XRD CompositeResourceDefinition example used name: xdatabases.openova.io and group: openova.io. Per BLUEPRINT-AUTHORING §8 (Pass 42 verified canonical), Crossplane XRDs use compose.openova.io group — separate from Catalyst CRDs (catalyst.openova.io). Fixed to xdatabases.compose.openova.io / group: compose.openova.io with inline pointer to BLUEPRINT-AUTHORING §8. 3. Composition compositeTypeRef.apiVersion was openova.io/v1alpha1, fixed to compose.openova.io/v1alpha1. Also corrected Composition metadata.name to database.hcloud.compose.openova.io for naming consistency. Pass 1's API group unification was Catalyst-CRDs-only; Pass 42 verified the separate Crossplane group; Pass 48 catches a downstream consequence where the crossplane README defaulted to bare `openova.io` matching neither canonical form. PERSONAS-AND-JOURNEYS §1-§7 deep re-scan: clean. Pass 22, 33, 39 fixes all intact. Three-pass-touched doc reads consistently. Stable. Banner already correctly enforces "platform plumbing, never user-facing" per ARCHITECTURE §7.4 / GLOSSARY.	2026-04-28 00:10:48 +02:00
hatiyildiz	2a1d6f5d3f	docs(pass-41): SOVEREIGN-PROVISIONING §4 + minio namespace drift across 3 components SOVEREIGN-PROVISIONING.md §4 (Phase 1 Hand-off) "self-sufficient" list had 6 items vs PLATFORM-TECH-STACK §2.3's 6 control-plane supporting services. List was missing SPIRE (5-min rotating SVIDs — critical to SECURITY model) and observability (Grafana stack — Catalyst's self-monitoring). Same drift category as Pass 40: summary list drifted independently from canonical reference. Added both, plus enumerated the §2.1+§2.2 services in the "Catalyst control plane" bullet. Mid-pass sweep finding: kserve L217 used minio.minio-system.svc but canonical minio README declares namespace: storage (L70). Three other components also used minio-system: milvus L78, harbor L145. Fixed all three to align with canonical `storage` namespace per PLATFORM-TECH-STACK §3.5. Drift likely came from Helm-chart upstream defaults. platform/kserve substantively clean apart from namespace fix. Pass 41 lesson: union-equality check applies to ALL summary passages in canonical docs. When a passage enumerates items derived from a canonical source list, count both and verify equality.	2026-04-27 23:21:19 +02:00
hatiyildiz	5744307027	docs(pass-38): surviving "fuse" namespace in temporal; SECURITY + grafana clean Acceptance greps with Pass 37's new literal-domain check and case-insensitive banned-term sweep found one surviving instance: platform/temporal/README.md L272 Worker Deployment had `namespace: fuse`. Pass 26 renamed fuse → fabric; Pass 32+35 fixed temporal's image ref and DNS but the namespace YAML key was missed (eye tracks surrounding structure, skims past `namespace:` value). Renamed to `fabric`. docs/SECURITY.md: clean (deep re-scan §6-§10 per Pass 23 lesson). All sections consistent with canonical model and Pass 7's independent-Raft fix. §9 OpenSearch SIEM wording acceptable as "default destination when SIEM is enabled" rather than "default-installed component" — deferred for optional tightening pass. platform/grafana/README.md: clean. Banner, tiered storage, and OTel instrumentation example all consistent with canonical conventions. Lesson: case-insensitive banned-term grep is non-negotiable. Future passes should always run \bfuse\b and similar legacy-product-name greps regardless of surfaced category.	2026-04-27 22:59:17 +02:00
hatiyildiz	76e68e6182	docs(pass-36): flux deep-scrutiny + sweep gap-fill (Pass 35 head -10 cutoff) Pass 35's sweep grep had `head -10` cutoff that produced a false-clean signal. Pass 36 ran the same grep without truncation, finding 6 surviving drift instances: platform/flux/README.md (5 fixes): - Mermaid diagram: Tenant[Tenant Repos] -> Organization[Organization Repos]. - GitRepository url gitea.<domain> -> gitea.<location-code>.<sovereign-domain>. - Bootstrap command --url=https://gitea.<domain>/... -> canonical form. - Key commands `flux reconcile kustomization tenants` -> `organizations` (Pass 34 was uppercase-only and missed lowercase plural). - Gitea Actions example flux-webhook.<domain> -> location-code form. platform/kyverno/README.md (1 fix): - Mermaid subgraph "Tenant Workload" -> "Organization Workload" (the priority class names tenant-high/tenant-default remain — those are deployed K8s PriorityClass objects requiring recreate-not-rename per Pass 9's deferred-migration note). Methodology lesson: convenience shortcuts in validation produce false-clean signals. From Pass 37 forward: drift sweeps use full grep output (no truncation) and case-insensitive banned-term searches. Validation log Pass 36 entry includes detail on each preserved "multi-tenant" generic adjective use that survived (acceptable feature descriptions, not Catalyst entity references).	2026-04-27 22:49:05 +02:00
hatiyildiz	bc9b90d989	docs(pass-35): completion sweep for surviving DNS placeholders (8 components) Started as gitea + relay atomic check. The gitea fix surfaced surviving <domain> placeholders across 8 other component READMEs that prior sweeps (Pass 29: canonical docs, Pass 32: image registries) hadn't covered. Catalyst control-plane DNS fixes (-> {component}.<location-code>.<sovereign-domain>): - gitea: GITEA_INSTANCE_URL. - external-secrets: openbao ClusterSecretStore + gitea Flux GitRepository. Application DNS fixes (-> {app}.<env>.<sovereign-domain>): - temporal: had two drift items in one line — temporal.fuse.<domain> (old "fuse" product name + wrong placeholder shape). Pass 32 fixed the image ref on the same file but missed this. Now fully de-drifted. - valkey: --replicaof valkey.region1.<domain> (non-canonical region1 segment — Catalyst encodes regions in location-code). - strimzi: kafka-kafka-bootstrap.region1.<domain>:9092 — same. - cnpg: postgres.region1.<domain> cross-region replica host — same. - stunner: STUN/TURN realm — kept canonical Application form for consistency even though STUN realms are nominally opaque. - k8gb: Gslb ingress host app.gslb.<domain> -> app.gslb.<sovereign-domain>. Other illustrative k8gb refs (dnsZone, nslookup examples) preserved as they describe behavior generically. products/relay/README.md: clean. Preserved as correctly-generic: external-dns illustrative refs, cert-manager <domain> (customer-supplied cert names), stalwart <domain> (customer email-receiving domain). Validation log Pass 35 entry: third end-to-end DNS sweep iteration (29 -> 32 -> 35). Future passes should grep for bare <domain> early to catch new instances introduced during edits.	2026-04-27 22:46:16 +02:00
hatiyildiz	70fea3ab8f	docs(pass-34): banned-term TENANT sweep + keycloak hostname drift GLOSSARY's banned term "tenant" survived in Configuration tables and Flux postBuild substitutions across product READMEs as ${TENANT} (uppercase ENV var). Prior banned-term greps searched lowercase `tenant` so the ALL-CAPS form slipped through. Product README fixes: - products/cortex: TENANT/DOMAIN → ORGANIZATION/SOVEREIGN_DOMAIN, plus two DNS placeholder fixes for llm-gateway and chat URLs (same shape Pass 25/31 fixed elsewhere). - products/fingate: 6 instances (Flux substitution, Configuration table, 4 URL templates) renamed. URL shape api.openbanking.<org>.<sov-dom> flagged as 4-segment FQDN that doesn't match NAMING §5.1 or §5.2 — deferred to a deeper architectural pass. - products/fabric: Configuration table row renamed. Component README: - platform/keycloak: shared-sovereign hostname auth.<sovereign-domain> and per-organization auth.<org>.<sovereign-domain> both missing <location-code> per NAMING §5.1. Fixed. platform/librechat ${TENANT_ID} preserved — that's Microsoft Azure AD tenant-ID (external technology, exempted by GLOSSARY). Validation log Pass 34 entry includes meta-note: always run a global grep for the surfaced drift category before closing a pass, to avoid the asymmetric-drift problem Pass 25 warned against.	2026-04-27 22:42:50 +02:00
hatiyildiz	4043e1d51c	docs(pass-32): registry-DNS sweep — harbor.<domain> across 9 component READMEs Pass 25's deferred sweep, executed. Image refs of the form harbor.<domain>/... (and one registry.<domain>/... in temporal) collapse the location-code segment. Per NAMING §5.1, Catalyst per-host-cluster Harbor DNS is harbor.{location-code}.{sovereign-domain} (e.g. harbor.hfmp.openova.io). Fixed (11 instances, 9 files): - anthropic-adapter, bge (×2), debezium, harbor (×2 — ingress + Kyverno policy), knative (×2 — serving + traffic-split), llm-gateway, strimzi, trivy — all standardized to harbor.<location-code>.<sovereign-domain>. - temporal had two drift items in one line: registry.<domain> (off-spec placeholder — Catalyst's only per-host-cluster registry is Harbor) AND legacy "fuse" namespace (renamed to bp-fabric per BUSINESS-STRATEGY §16.2 / Pass 26). Rewritten to fabric/order-worker. Out of scope (deliberate): :latest tag hygiene, and whether Application Blueprint READMEs should reference ghcr.io/openova-io/bp-<name>:<semver> vs the Sovereign Harbor mirror. Stalwart customer-email-domain <domain> placeholders preserved (correct semantics). external-dns illustrative gslb/api/svc.<domain> preserved (upstream-doc generic). With Pass 29 (canonical-doc DNS) + Pass 31 (carry-over fixes) + Pass 32 (image registry), the recurring DNS-placeholder collapse drift category is addressed end-to-end. Validation log Pass 32 entry added.	2026-04-27 22:36:39 +02:00
hatiyildiz	3993f5fc31	docs(pass-31): openbao + librechat DNS-placeholder carry-over fixes platform/openbao/README.md ingress hosts (line 108) had `bao.<domain>` while the same file's ClusterSecretStore example (line 127) used the canonical `bao.<location-code>.<sovereign-domain>` form. Pass 7's active-active fix addressed the body but missed the ingress placeholder. Aligned with the canonical form. platform/librechat/README.md OAuth callback (line 154) had `chat.ai-hub.<domain>/oauth/openid/callback` — same Application-endpoint shape Pass 25 fixed in llm-gateway. Pass 22 marked the file clean and Pass 29 fixed the Keycloak issuer line but didn't re-sweep. Per NAMING §5.2 Application endpoints are `{app}.{environment}.{sovereign-domain}`. Fixed. docs/GLOSSARY.md verified clean — single-source-of-truth has held across the loop (Pass 6/7/14/20/22/26/27 all consistent with current GLOSSARY). Validation log Pass 31 entry includes meta-note: third file (librechat) that needed re-opening after a "clean" mark — banner scans miss YAML-block drift. Future passes should default to a full placeholder-shape grep on every file touched.	2026-04-27 22:34:10 +02:00
hatiyildiz	4793cab8b6	docs(pass-29): DNS-placeholder sweep across canonical docs The recurring drift: Catalyst control-plane DNS placeholders that omit the <location-code> segment, producing forms like gitea.<sovereign>, gitea.<sovereign>.<domain>, gitea.<sovereign-domain>, keycloak.<domain>. Per NAMING §5.1 the canonical form is {component}.{location-code}.{sovereign-domain} (e.g. gitea.hfmp.openova.io). The shorter forms aren't just abbreviations — they collapse the multi-region location dimension and re-drift every time a reader reads them as obvious shorthand. Fixes: - CLAUDE.md "Customer Sync" — both gitea.<sovereign>/catalog/... lines. - docs/SOVEREIGN-PROVISIONING.md §3 DNS-records bullet (3 lines) + §5 Day-1 login line. - docs/ARCHITECTURE.md §4 write-path Gitea label. - docs/BLUEPRINT-AUTHORING.md §6.4 private-Blueprint Studio target. - platform/librechat/README.md Keycloak issuer (Pass 22 marked clean and missed this — banner scans miss YAML-block drift). platform/nemo-guardrails/README.md verified clean. Final grep confirms only canonical forms remain. Validation log Pass 29 entry added with the recurring-drift-pattern note for future passes.	2026-04-27 22:30:41 +02:00
hatiyildiz	2c886daa52	docs(pass-25): llm-gateway DNS placeholders + IMPLEMENTATION-STATUS clean platform/llm-gateway/README.md had three malformed DNS placeholders: - KEYCLOAK_URL collapsed location-code + sovereign-domain into <domain> and used Application namespace `ai-hub` as a Keycloak realm name. Per NAMING §7 and SECURITY §7, Keycloak realms are per-Org in SME-style or per-Sovereign in corporate-style — never per-Application-namespace. Fixed to `keycloak.<location-code>.<sovereign-domain>/realms/<org>`. - ANTHROPIC_BASE_URL and `claude config set api_base` examples used `llm-gateway.ai-hub.<domain>/v1` — but NAMING §5.2 establishes Application endpoints as `{app}.{environment}.{sovereign-domain}`. Fixed to `llm-gateway.<env>.<sovereign-domain>/v1`. docs/IMPLEMENTATION-STATUS.md confirmed clean: CRD list, surfaces, and control-plane component list all match canonical docs. Sweep concern logged for `harbor.<domain>` / `:latest` image patterns appearing across many platform READMEs — to be addressed in a dedicated sweep pass rather than asymmetrically here. Validation log Pass 25 entry added.	2026-04-27 22:22:32 +02:00
hatiyildiz	5f028d1b6a	docs(pass-20): SOVEREIGN-PROVISIONING placement YAML + Kyverno label drift Pass 20 — drift-detection on SOVEREIGN-PROVISIONING + platform/kyverno. Two real findings. SOVEREIGN-PROVISIONING.md §8: - "Existing Applications with `placement: active-active: false, single-region` do not migrate automatically" — invalid YAML mixing a boolean with an enum. The canonical placement model (per GLOSSARY) has `placement.mode: single-region \| active- active \| active-hotstandby`, no boolean toggle. - Rewrote: "Existing Applications with `placement.mode: single- region` ... user explicitly switches Placement to active-active (or active-hotstandby) and adds the new region to placement.regions". platform/kyverno/README.md: - Policy V5 (minimum-replicas-production) targeted namespaces labeled `openova.io/env: production` — out-of-spec label name AND value. NAMING-CONVENTION §6 establishes `openova.io/env-type: prod` (hyphen-form, short value). - Fixed to `openova.io/env-type: prod`. Both findings show the same pattern: schema-level details that survive grep-based banned-term checks but contradict the canonical spec when read in body. VALIDATION-LOG: Pass 20 entry added. Refs #37	2026-04-27 22:06:24 +02:00
hatiyildiz	b467dc3f3b	docs(pass-18): NAMING DR-as-env_type misexample + Keycloak deployment topology Pass 18 — drift-detection on NAMING-CONVENTION + platform/keycloak. Two real findings. NAMING-CONVENTION §11.1: - The example list of Catalyst Environments included `bankdhofar-dr` — but `dr` is NOT a valid env_type. Canonical values per §2.4 are prod / stg / uat / dev / poc. DR is a Placement mode (active-active / active-hotstandby across regions inside the -prod Environment), not a separate Environment. - Replaced `bankdhofar-dr` with `bankdhofar-uat` and added an explicit "DR is a Placement, not an Env Type" note. platform/keycloak/README.md: - Keycloak Deployment YAML example used `namespace: open-banking` with 2 replicas — Fingate-specific narrative that contradicted the per-Org / per-Sovereign topology stated in the banner. Rewrote with two side-by-side examples: shared-sovereign (3 HA replicas, catalyst-keycloak namespace, CNPG-backed) * per-organization (1 replica in <org> namespace, optional embedded DB for smallest SME tier) - HA section was a single set of claims (2+ replicas, CNPG, Infinispan) that only matched corporate. Now branches on topology — corporate gets HA + Infinispan, SME gets single replica with restart-on- deploy as acceptable for tier SLAs. Same kind of drift Pass 17 caught in Harbor: banner says one thing, body still describes the older model. Both fixed. VALIDATION-LOG: Pass 18 entry added. Refs #37	2026-04-27 22:00:42 +02:00
hatiyildiz	eff264b077	docs(pass-17): ARCHITECTURE OAM table pipe-fix + Harbor README de-drift Pass 17 — drift-detection sweep on ARCHITECTURE + harbor. Two real findings. ARCHITECTURE §13 (OAM table): - `\| Trait \| Blueprint overlay (`overlays/small\|medium\|large`) \|` has pipe chars inside backticks inside a Markdown table cell — a known GFM rendering hazard. Replaced with comma-separated examples. platform/harbor/README.md: - The banner added in Pass 9 said "every host cluster runs a Harbor instance" but the body still described an older "Harbor Primary / Harbor Replica" cross-region replication topology. Same shape of architectural drift Pass 7 caught in OpenBao/ESO/Gitea/Flux — banner-add doesn't rewrite the body. - Three sections rewritten: * Overview mermaid: now shows upstream-OCI → multiple independent per-cluster Harbors with local Trivy scan + local Pod pulls. * "Multi-Region Replication" → "Per-host-cluster mirroring (NOT primary-replica)". Single source of truth = upstream OCI (ghcr.io/openova-io/* for Catalyst+Blueprints, customer CI for application images), not a "primary Harbor". * Example replication policy: was a `dest_registry` cross-region push policy → now a pull-mirror policy from ghcr.io with scheduled-cron trigger. - "Why Mandatory" table reframed in per-host-cluster terms. VALIDATION-LOG: Pass 17 entry added with the specific drift-detection lesson — banner-addition passes don't catch body-level drift; need explicit body re-reads. Refs #37	2026-04-27 21:58:53 +02:00
hatiyildiz	b6a374df26	docs(pass-15): final banner sweep — 52/52 platform components covered, convergence achieved Pass 15 swept all 52 platform/*/README.md files for the role-in- Catalyst banner. 3 still lacked one (cnpg, flux, strimzi) and got banners added: - cnpg (§4.1): production Postgres; underlying engine for FerretDB + Gitea metadata. - flux (§3.2): per-vcluster Flux + host-level Flux for Catalyst itself; pulls from single per-Sovereign Gitea. - strimzi (§4.1): Application-tier event streaming; NOT the Catalyst control-plane spine (which uses NATS JetStream). Same upstream- tech-different-tier disambiguation pattern as Valkey. CONVERGENCE: 52 / 52 platform components have role-in-Catalyst banners. All cross-refs resolve. No banned terms. No architectural drift detected on this pass. VALIDATION-LOG: Pass 15 entry + "Convergence achieved (initial banner sweep)" marker added. The validation loop continues per the standing instruction — but subsequent passes will be brief drift-detection sweeps rather than systematic rewrites. Refs #37	2026-04-27 21:53:27 +02:00
hatiyildiz	9b3211fdee	docs(pass-14): banners on workflow / analytics / metering / chaos / valkey (7 components) Seven more Application Blueprint banners landed: - temporal (§4.3): durable workflow orchestration; bp-fabric. - flink (§4.3): stream + batch processing; bp-fabric. - debezium (§4.2): CDC into Strimzi/Kafka; bp-fabric pipeline source. - iceberg (§4.4): open table format on MinIO + archival S3. - openmeter (§4.8): API metering for bp-fingate. - litmus (§4.9): chaos engineering required by DORA / NIS2. - valkey (§4.1): banner explicitly states NOT a Catalyst control- plane component — control plane uses NATS JetStream KV per ARCHITECTURE §5 / GLOSSARY event-spine. Valkey is Application-tier caching only. This is the disambiguation that PLATFORM-TECH-STACK §1 establishes ("same upstream technology can serve in multiple categories") — pinned in the per-component README so it can't be misread. VALIDATION-LOG: Pass 14 entry added. Refs #37	2026-04-27 21:52:03 +02:00
hatiyildiz	b021aaa57e	docs(pass-13): role-in-Catalyst banners on 4 Communication Application Blueprints All 4 communication components (composing under bp-relay) got role- in-Catalyst banners pointing at PLATFORM-TECH-STACK §4.5: - stalwart: JMAP/IMAP/SMTP self-hosted email. - livekit: WebRTC SFU for video/audio/data; pairs with STUNner. - stunner: K8s-native TURN/STUN for WebRTC NAT traversal. - matrix: Matrix protocol via Synapse server. Banner explicitly disambiguates "Synapse" as the chat-server implementation, NOT the deprecated OpenOva product noun (retired in favor of bp-axon). All 4 are explicitly Application Blueprints, NOT Catalyst control plane. VALIDATION-LOG: Pass 13 entry added. Refs #37	2026-04-27 21:50:05 +02:00
hatiyildiz	9d95043ccc	docs(pass-12): role-in-Catalyst banners on 11 AI/ML Application Blueprints All AI/ML component READMEs got banners pointing at PLATFORM-TECH- STACK §4.6 (AI/ML) or §4.7 (AI safety + observability), and noting composition under bp-cortex (composite AI Hub Blueprint): - knative: serverless for KServe-managed inference. - kserve: K8s-native model serving for vLLM, BGE, custom. - vllm: default LLM inference runtime. - milvus: vector database for RAG retrieval. - neo4j: knowledge-graph-augmented retrieval alongside Milvus. - librechat: default chat surface, fronts LLM Gateway via Guardrails. - bge: embedding generation + reranking. - llm-gateway: outbound LLM routing (Claude, GPT-4, vLLM, Axon). - anthropic-adapter: OpenAI-SDK → Anthropic translation. - nemo-guardrails: AI safety firewall. - langfuse: LLM observability (latency, tokens, cost, eval). All 11 are explicitly Application Blueprints — NOT Catalyst control plane. Catalyst's own observability stack (Grafana/OTel) covers infrastructure; LangFuse covers AI-specific dimensions (prompt/response/eval). VALIDATION-LOG: Pass 12 entry added. Refs #37	2026-04-27 21:47:45 +02:00
hatiyildiz	e9514b410d	docs(pass-11b): retry banners on failover-controller/trivy/clickhouse/ferretdb (Edit needed Read first)	2026-04-27 21:45:56 +02:00
hatiyildiz	ae540269c4	docs(pass-11): banners on 7 more components + MinIO ILM label disambiguation 7 more component READMEs got role-in-Catalyst banners: Per-host-cluster infrastructure: - minio (§3.5): S3 fast-tier; tiers cold to cloud archival. - velero (§3.5): K8s backup to archival S3 (NOT MinIO — that's fast-tier; backups land in cloud archival). - failover-controller (§3.6): lease-based split-brain protection layered on k8gb; pointers to SRE §2.4 (witness pattern) + SECURITY §5.2 (OpenBao DR promotion). - trivy (§3.3): CI + registry + runtime scan chain. Application Blueprints (NOT control plane): - opensearch (§4.1): explicitly framed as Application Blueprint — installed when an Org wants SIEM / full-text search / log analytics. - clickhouse (§4.1): used by bp-fabric and SIEM cold-storage tier. - ferretdb (§4.1): replication piggybacks on underlying CNPG. MinIO ILM disambiguation: - The Mermaid diagram had `ILM[Lifecycle Manager]` — confusable with the rejected Catalyst sub-product (per banned-terms list). Relabeled to `ILM[Information Lifecycle Manager - MinIO ILM]` to make clear it's MinIO's own feature, not the deprecated Catalyst Lifecycle Manager noun. VALIDATION-LOG: Pass 11 entry added. Refs #37	2026-04-27 21:45:28 +02:00
hatiyildiz	5834daec14	docs(pass-10): banners on 7 more components + opentofu active-active drift fix 7 more component READMEs got role-in-Catalyst banners: - vpa, keda, reloader → per-host-cluster scaling/ops layer (§3.4). Reloader specifically calls out its role in Catalyst's secret- rotation flow (rolling deploy on K8s Secret hash change). - external-dns → per-host-cluster DNS-sync (§3.1); pairs with k8gb for the GSLB zone separation. - coraza → DMZ-block WAF on every host cluster (§3.1). - crossplane → per-Sovereign on the management cluster (§3.2); banner explicitly emphasizes the agreed "never a user-facing surface" rule (Users don't write Compositions in Application configs; Blueprint authors and advanced contributors do). Cross- references the no-fourth-surface clause in ARCHITECTURE §4/§7 and the Crossplane Composition section in BLUEPRINT-AUTHORING §8. - opentofu → repositioned as Phase-0-only, runs on `catalyst- provisioner` only, NOT installed on host clusters at runtime. opentofu drift fixes (uncovered by line-by-line read): - Section 5 line 182: "Bootstrap Wizard prompts for cloud credentials" → "Catalyst Bootstrap (Phase 0) prompts for cloud credentials" (banned term). - Same section line 186: "ESO PushSecrets sync to both regional OpenBao instances" — the active-active drift Pass 7 corrected elsewhere, still here. Replaced with "writes go to the primary OpenBao region only; replicas pick up via async perf replication". VALIDATION-LOG: Pass 10 entry added. Refs #37	2026-04-27 21:43:45 +02:00
hatiyildiz	a52bda30cb	docs(pass-9b): retry banners on harbor / falco / sigstore / syft-grype Pass 9's commit `ea81c38` only landed banners on grafana + kyverno — the harbor / falco / sigstore / syft-grype edits failed because the Edit tool requires a Read pass per file before write. Now Read'd and applied: - harbor: per-host-cluster registry, pointer to PLATFORM-TECH-STACK §3.5. - falco: per-host-cluster runtime security, pointer to §3.3 + SRE §10 (SIEM/SOAR pipeline). - sigstore: cosign signing chain on every Blueprint OCI artifact, Kyverno admission verifies signatures. - syft-grype: CI-side SBOM + runtime CVE matching. Pass 9 now complete. Refs #37	2026-04-27 21:41:22 +02:00
hatiyildiz	ea81c38e15	docs(pass-9): role-in-Catalyst banners on grafana / harbor / falco / kyverno / sigstore / syft-grype Pass 9 — six more component READMEs got Catalyst-role banners matching the rule of thumb in CLAUDE.md (every platform/<x>/README.md should state its role in Catalyst). - grafana: observability stack on every host cluster; Catalyst's own self-monitoring + Application telemetry flows here. - harbor: per-host-cluster container registry for Catalyst images, mirrored Blueprint OCI artifacts, customer images. - falco: runtime security on every host cluster; feeds SIEM/SOAR. - kyverno: policy engine on every host cluster; enforces Catalyst policy contracts (cosign on Blueprints, default-deny NetworkPolicies on Organization namespaces, priority-class injection). - sigstore: cosign-signed Blueprint OCI artifacts + admission verification chain on every host cluster. - syft-grype: SBOM generation in CI per Blueprint + runtime CVE scans. Plus Kyverno priority-class clarification: prose around `tenant-high` / `tenant-default` / `tenant-batch` priority class names now reads "Organization workloads" instead of "tenant workloads", with an explicit note that the priority class artifact names themselves stay as-is until a separate migration ticket renames them in deployed clusters (renaming PriorityClass objects requires recreate, not in-place rename). VALIDATION-LOG: Pass 9 entry added. Refs #37	2026-04-27 21:40:51 +02:00
hatiyildiz	14ed84de41	docs(pass-8): role-in-Catalyst banners + dead-link fix in component READMEs Pass 8 — line-by-line read of platform/cnpg, platform/strimzi, platform/k8gb, platform/keycloak, platform/cert-manager, platform/cilium. CNPG and Strimzi: read in full and confirmed clean — they correctly position themselves as Application Blueprints and don't drift from the canonical model. CNPG's `<org>-postgres-dr` cluster name (Application-tier database role) is acceptable per NAMING-CONVENTION §1.3 (which only forbids primary/dr in K8s host-cluster names, not in Application-internal CRD names). Four READMEs updated: k8gb: - Header reframed: per-host-cluster infrastructure pointer to PLATFORM-TECH-STACK §3.1 and SRE §2.4 split-brain protection. - Removed dead link to ../failover-controller/docs/ADR-FAILOVER- CONTROLLER.md (the failover-controller folder has no docs/); replaced with link to that component's README + SRE §2.4. keycloak: - Header reframed from "FAPI Authorization Server for Open Banking" (narrow) to "User identity for Catalyst Sovereigns" (broad). Keycloak handles ALL user identity in Catalyst, not just FAPI. - Added per-Org / per-Sovereign topology callout matching SECURITY §6. Clarified that "Multi-tenant TPP" refers to PSD2 Third Party Providers, not Catalyst's Organization-level multi-tenancy. - FAPI features kept since Keycloak still serves Fingate as the FAPI Authorization Server. cert-manager: - Header reframed as per-host-cluster infrastructure with pointer to PLATFORM-TECH-STACK §3.3. cilium: - Header reframed as per-host-cluster infrastructure with pointer to PLATFORM-TECH-STACK §3.1, including the install-first note (CNI must come before any other workload during Phase 0). VALIDATION-LOG: Pass 8 entry added. Refs #37	2026-04-27 21:39:03 +02:00
hatiyildiz	a5ffa1a716	docs(pass-7): align Gitea + Flux multi-region story; fix broken mermaid id Continuing Pass 7 cleanup after the OpenBao/ESO rewrite (`42aeb62`). Gitea README: - Was describing "Bidirectional mirroring for multi-region" with two Gitea instances mirroring repos cross-region. Wrong: Catalyst's agreed model has one Gitea per Sovereign on the management cluster (PLATFORM-TECH-STACK §2.3). Replaced the multi-region mirror diagram with a single-Gitea + intra-cluster HA topology and added a "Why not cross-region bidirectional mirror" explainer (write- conflict semantics would break EnvironmentPolicy enforcement). - Status banner: notes the canonical references. - Backup section: removed "Repository mirror for redundancy" (replaced with Velero scheduled backups). Flux README: - "Multi-Region GitOps" section was showing one Gitea per region with bidirectional mirror. Replaced with one Gitea per Sovereign topology. Per-vcluster Flux pulls from this single Gitea. Mermaid syntax bug: - Earlier mass replace_all of "Catalyst IDP" → "Catalyst console" had left an invalid mermaid node identifier `Catalyst console[Catalyst console]` (mermaid forbids spaces in node IDs). Fixed to `Console[Catalyst console]`. Would have rendered as a broken diagram on GitHub. VALIDATION-LOG: Pass 7 entry added documenting the OpenBao/ESO active-active rewrite (the most consequential drift fix in any pass). Refs #37	2026-04-27 21:36:20 +02:00
hatiyildiz	42aeb629bb	docs(pass-7): rewrite OpenBao + ESO READMEs to match agreed multi-region semantics Pass 7 — line-by-line read of platform/openbao/README.md and platform/external-secrets/README.md found a major architectural drift: both files described an OLD active-active bidirectional sync model that contradicts docs/SECURITY.md §5 (the canonical reference). The active-active design was rejected during the architecture session because it would have been a stretched cluster — a single region's network blip would block writes everywhere. The agreed model is: - Independent Raft cluster per region (intra-region quorum only). - Single-primary writes; replicas accept reads only. - Async Performance Replication primary → replicas (lag <1s typical). - Explicit DR promotion (sovereign-admin or failover-controller). Fixes: platform/openbao/README.md: - Overview: removed "active-active deployments" / "either region can update secrets". Replaced with "independent Raft cluster per region", "asynchronous Performance Replication". - Architecture diagram: replaced bidirectional-push diagram with the primary→replicas async perf replication topology that matches SECURITY.md §5. - ClusterSecretStores: simplified from "two stores (local+remote)" to "one local store"; reads always pull locally. - Renamed "PushSecret (Bidirectional)" → "Writes go to the primary region" with a single-target PushSecret pointing at bao-primary. - Added DR promotion section pointing at SECURITY.md §5.2. - Status banner: notes that the canonical multi-region reference is SECURITY.md. platform/external-secrets/README.md: - Header line: repositioned as per-host-cluster infrastructure with pointer to PLATFORM-TECH-STACK §3.3. - Removed broken link to non-existent ../openbao/docs/ADR-OPENBAO.md (replaced with link to ../openbao/README.md). - "Multi-region sync \| Push to both OpenBao instances simultaneously" → "Multi-region reads \| Async perf replication". - "PushSecret to Multiple OpenBao Instances" example was writing to two ClusterSecretStores in parallel — replaced with single-target primary write. - "Multi-region sync via single PushSecret" in Consequences → "Cross-region availability via Performance Replication". - Mermaid sequence diagram: "Bootstrap Wizard" actor → "Catalyst Bootstrap (Phase 0)"; "Terraform" → "OpenTofu"; ESO connection description "via K8s auth" → "via SPIFFE SVID (workload identity)". These were the most consequential drift fixes found in any pass — two READMEs were documenting an architecture explicitly rejected by the agreed model. Refs #37	2026-04-27 21:34:09 +02:00
hatiyildiz	d6a51b8a7a	docs(pass-2): final entity-noun sweep — external-secrets sequence diagram Pass 2 — fresh-eyes sweep across the entire docs tree. One residual entity-noun usage found: - platform/external-secrets/README.md:75 (in a Mermaid sequence diagram): "Note over Wizard: Operator saves unseal keys offline" — "Operator" used as person/entity. Renamed to "sovereign-admin" to match the role from GLOSSARY.md. All other banned-term sweeps clean: - No tenant (architectural) anywhere. - No Catalyst IDP anywhere. - No Synapse-as-product anywhere (only the legitimate "Matrix/Synapse server" usages). - No workspace-controller (only the banned-term entries that define the rename). - No capital-W Workspace as Catalyst scope. - No github.com/openova (without -io). - All cross-doc Markdown links resolve. - All §X references resolve to the new section numbering after PLATFORM-TECH-STACK reorg. - API group catalyst.openova.io/v1alpha1 consistent across 6 references. - OCI artifact prefix `bp-` consistent across README, CLAUDE, BLUEPRINT-AUTHORING, IMPLEMENTATION-STATUS. Other "Operator" mentions intentionally retained (legitimate technical usage): - "External Secrets Operator (ESO)", "Trivy Operator" — K8s Operator pattern (controllers), explicitly allowed by GLOSSARY. - "Operator compatibility" in BUSINESS-STRATEGY's OpenShift migration table — refers to compatibility with K8s Operators (the technology), not as an entity/role. Refs #37	2026-04-27 21:18:55 +02:00
hatiyildiz	119a1e53a0	docs(components): terminology pass across platform and product READMEs Bring per-component READMEs in line with the canonical glossary (docs/GLOSSARY.md). Substantive architectural content unchanged — this is a terminology + reference correctness pass. Placeholder rename: <tenant> → <org> in YAML / IaC examples across - platform/cnpg/README.md (Cluster + Pooler + ScheduledBackup) - platform/debezium/README.md (PostgreSQL connector + topic patterns) - platform/external-secrets/README.md (ExternalSecret / SecretStore) - platform/grafana/README.md (Instrumentation namespace) - platform/k8gb/README.md (Gslb + namespace + kubectl examples) - platform/keda/README.md (ScaledObject + Kafka triggers + Prometheus) - platform/opentofu/README.md (server resource example) - platform/velero/README.md (BackupStorageLocation buckets) - platform/vpa/README.md (VerticalPodAutoscaler examples) - platform/flux/README.md (kustomization name + tenants/ → organizations/) "Catalyst IDP" → "Catalyst console": - platform/crossplane/README.md (integration section retitled and rewritten — Crossplane is platform plumbing, not user-facing) - platform/gitea/README.md (architecture diagram + integration table) - platform/kyverno/README.md (rollout tracking surface) - products/fingate/README.md (TPP onboarding portal) "Bootstrap wizard" → "Catalyst bootstrap": - platform/openbao/README.md (bootstrap procedure rewritten — independent Raft per region clarified; cross-references docs/SECURITY.md §5) - platform/opentofu/README.md (Quick Start) Kyverno labels & prose: - openova.io/tenant → openova.io/organization (label rename for consistency; deployed clusters will add new label as a co-label during migration window) - "tenant labels" / "tenant namespace" prose updated to "Organization labels" / "Organization-labeled namespace" - Priority class names (tenant-high, tenant-default, tenant-batch) retained as deployed artifact names — rename pending in a separate migration ticket No banned-term hits remain in component READMEs (verified by grep in docs/GLOSSARY.md banned-terms table). Refs #37	2026-04-27 20:06:51 +02:00
talent-mesh	435f49738d	feat: restructure platform to 52 components and 9 products Technology forecast and strategic review restructure: - Remove 13 components (backstage, mongodb, activemq, vitess, airflow, camel, dapr, superset, searxng, langserve, trino, lago, rabbitmq) - Add 10 components (sigstore, syft-grype, nemo-guardrails, langfuse, reloader, matrix, ferretdb, litmus, livekit, coraza) - Rename product: Synapse → Axon (SaaS LLM Gateway) - Merge products: Titan + Fuse → Fabric (Data & Integration) - New product: Relay (Communication) - Replace Backstage with Catalyst IDP - Replace MongoDB with FerretDB (MongoDB wire protocol on CNPG) - Add supply chain security (Sigstore/Cosign, Syft+Grype) - Add AI safety and observability (NeMo Guardrails, LangFuse) - Add technology forecast 2027-2030 document - Full verification pass: zero stale references across all docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 21:00:19 +00:00
talent-mesh	10245dff98	feat: ecosystem expansion to 55 components with license compliance - Replace BSL-licensed components with open-source alternatives: Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0), Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0) - Add 14 new platform components: activemq, camel, clickhouse, dapr, debezium, falco, flink, iceberg, opensearch, rabbitmq, superset, temporal, trino, vitess - Rename meta-platforms/ to products/ with new product names: Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse), Fuse (Microservices Integration) - Update all documentation, READMEs, and cross-references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 18:15:11 +00:00
talent-mesh	bb53df55bb	docs: comprehensive Kyverno policy matrix for resilience and zero-trust Cover 44 policies across generate (VPA, PDB, NetworkPolicy, ResourceQuota, LimitRange), mutate (topology spread, anti-affinity, security context, seccomp, Harbor image rewrite, priority class), and validate (resource requests, health probes, min replicas, pod security restricted profile, image supply chain, network zero-trust, RBAC hardening). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 05:29:05 +00:00
talent-mesh	c9d04a53b4	refactor: flatten platform/ structure (41 components) Remove hierarchical grouping (networking/, security/, etc.) and use flat structure for all 41 platform components. Changes: - All components now directly under platform/ (no subfolders) - AI Hub components moved from meta-platforms/ai-hub/components/ to platform/ - Open Banking components (lago, openmeter) moved to platform/ - meta-platforms/ now only contains README files that reference platform/ - Open Banking custom services remain in meta-platforms/open-banking/services/ Structure: - platform/ (41 components, flat) - meta-platforms/ai-hub/ (README only, references platform/) - meta-platforms/open-banking/ (README + 6 custom services) All documentation links updated. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 15:19:48 +00:00
talent-mesh	49f8bbc84d	refactor: move harbor to registry/, kyverno to policy/ - Harbor moved from storage/ to registry/ (artifact management, not storage) - Kyverno moved from security/ to policy/ (policy engine for validation, mutation, generation - broader than just security) Updated structure: - platform/registry/harbor/ - platform/policy/kyverno/ All documentation links updated accordingly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 11:53:21 +00:00
talent-mesh	535710289c	feat: create OpenOva monorepo structure Consolidate all component repos into a single monorepo: - core/: Bootstrap + Lifecycle Manager application - platform/: Individual component blueprints organized by category - networking/ (cilium, k8gb, external-dns, stunner) - security/ (cert-manager, external-secrets, vault, kyverno, trivy) - observability/ (grafana stack) - storage/ (minio, harbor, velero) - scaling/ (keda, vpa) - failover/ (failover-controller) - gitops/ (flux, gitea) - idp/ (backstage) - data/ (cnpg, mongodb, valkey, redpanda) - communication/ (stalwart) - iac/ (terraform, crossplane) - identity/ (keycloak) - meta-platforms/: Bundled vertical solutions - ai-hub/ (enterprise AI platform) - open-banking/ (PSD2/FAPI fintech sandbox) - docs/: Platform documentation (PLATFORM-TECH-STACK.md, SRE.md) All internal links updated to use relative paths within monorepo. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 10:53:18 +00:00

1 2 3 4 5

248 Commits