openova

Author	SHA1	Message	Date
e3mrah	06bea550ff	feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin (#1874 ) The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml, bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish commits land on main, but it cannot detect the "chart bumped but never published" failure mode: the bootstrap-kit pin points at a chart version that GHCR never received because blueprint-release.yaml failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep, runner cancellation, transient GHCR push 5xx). Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180 and 1.4.181 were "lost" during the TBD-A20 scanner break window (21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist until A58 manually re-fired the workflow via dispatch. Fresh Sovereigns silently fell back to the last working tag. What this adds - scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and optional `--ghcr-org <org>`). For every chart pinned in the kit, it lists ghcr.io/<org>/<chart> tags via `gh api /orgs/<org>/packages/container/<chart>/versions --paginate`, then asserts the pinned version appears. Exits 1 on any missing tag. - A per-chart tag cache avoids redundant paginations. - .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now passes `--check-ghcr` on `push` to main + `workflow_dispatch` (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish to GHCR anyway). The job stays `continue-on-error: true` under the same observational umbrella as the existing post-merge full sweep so a transient API blip cannot red-flag every chart bump; the missing-tag list still surfaces on the run summary for operator attention. - Job grants `packages: read` so the workflow GITHUB_TOKEN can list private package versions. Verification (origin/main snapshot, 2026-05-19) - Full sweep default: 50/50 chart→pin pairs OK, no GHCR check. - Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags present — PASS exit 0. - Negative test: with products/catalyst/chart/Chart.yaml + slot 13 both set to a non-existent 99.99.99, the script exits 1 with `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the remediation hint pointing at `gh workflow run blueprint-release.yaml`. - `--changed-only --base origin/main` against a no-change tree: clean exit 0 with the existing "nothing to check" message. Refs #1872, #1864, #1856. Closes #1872 Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:12:13 +04:00
hatiyildiz	69f2d7d91a	fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864 ) Root cause of the auto-bump-pin miss flagged in #1864. The Blueprint Release workflow has been in `startup_failure` since PR #1858 (commit `cf35b4a`) merged at 21:04:22Z. The lockstep step's multi-line shell heredoc inside a `run: \|` block-scalar: if [ ... ]; then msg="deploy(...) (auto, Refs TBD-A6) <-- literal blank line Also locksteps platform blueprint.yaml ..." <-- column 1, no indent is interpreted by the YAML scanner as the END of the block-scalar at the blank line, and the next column-1 line is then parsed as a new top-level mapping key — which fails because the previous mapping isn't terminated. The whole workflow file is rejected at workflow- startup time. Verified with `python3 -c yaml.safe_load(...)` (raises `ScannerError: could not find expected ':' line 815`) and by `gh api .../actions/runs/26060392136` returning `conclusion=failure, status=completed, jobs: []` for every push since `cf35b4a`. Consequence: no chart bump since `cf35b4a` has triggered the TBD-A6 auto-bump-pin or the TBD-A20 blueprint.yaml lockstep. PR #1865 was the manual catch-up for bp-newapi (1.4.20 -> 1.4.21); without this fix every future chart publish will drift the same way. Fix: build the multi-line commit message with `printf '%s\n\n%s'` so the string source stays on physically-indented lines that the YAML block-scalar accepts. Behaviour is identical — same commit subject, same blank line, same body — only the construction shape changes. Added a 9-line comment naming the seam so future authors don't reintroduce the same trap. Verified locally: * `python3 -c yaml.safe_load(open(...))` succeeds, parses 24 build-job steps. * `CHART_NAME=bp-newapi PREV_VERSION=1.4.20 CHART_VERSION=1.4.21 BP_PREV_VERSION=1.4.20 bash -c "$(printf ...)"` emits the canonical "deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (auto, Refs TBD-A6)\n\nAlso locksteps platform ..." body. Refs #1864. Refs PR #1858 (TBD-A20 lockstep that introduced the YAML defect).	2026-05-19 00:07:07 +02:00
e3mrah	cf35b4a9b6	fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856 ) (#1858 ) A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux, openbao, keycloak, gitea) where blueprint.yaml spec.version had silently fallen behind chart/Chart.yaml version, breaking TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart publish — never the upstream platform/<bp>/blueprint.yaml. This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml spec.version whenever Chart.yaml version bumps. Both file edits land in the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint lockstep). Idempotent reset-and-rewrite retry preserved for the existing parallel-matrix race case. Workflow changes (.github/workflows/blueprint-release.yaml): * New step `bump_blueprint` after `bump_pin` — locates ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml (handles both platform-leaf and products-umbrella conventions), filters to kind:Blueprint (defensive against CRD yaml at the products/catalyst/chart/crds path), reads current spec.version at 2-space indent, sed-rewrites to CHART_VERSION, verifies post-write. * Commit step renamed to "Commit + push bootstrap-kit pin bump + blueprint.yaml lockstep"; stages both files, single commit, with convergent retry on conflict. * Summary block surfaces both bumps separately. Regression test (tests/e2e/bootstrap-kit/main_test.go): * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks platform/* and products/, discovers every Blueprint manifest with a sibling Chart.yaml, asserts spec.version == Chart.yaml version. Covers ALL ~70 blueprints, not just the canonical 10 kit ones the existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates. Failure messages name the file, drift direction, and the exact sed command to fix — drift remediation is mechanical. Drift cleanup (mandatory companion, same shape as A17/#1855): 26 Application-Blueprint blueprints whose spec.version had been left at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to Chart.yaml as authoritative. All currently surface in the new sweep test; without the cleanup the test would block this PR (and every subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook, cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores], falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir, netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover, trivy, valkey, velero, vpa, products/dmz-vcluster. After this lands, the next chart-version bump in any platform/<bp>/ folder auto-converges all three artifacts (Chart.yaml, blueprint.yaml, bootstrap-kit pin) in a single bot commit. No more manual collector PRs; no more silent drift between chart and Blueprint manifest. Closes #1856. Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:04:22 +04:00
e3mrah	a8931db541	fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849 ) (#1855 ) Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main: 1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml), gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin -> Sovereign install); blueprint.yaml gets resynced down/up to match. 2. pin-sync-audit on push — full-sweep audit races the blueprint-release auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift until the auto-bump bot commits the pin update ~60s later; the bot push (GITHUB_TOKEN convention) does not retrigger this workflow, so the failure remains in run history. Fix: set continue-on-error: true on push/workflow_dispatch events (PR remains blocking via --changed-only). The full-sweep output still surfaces drift on the run summary; it just doesn't fail the overall run while the heal-in- ~60s window is open. Documented inline in the job header. Net effect: every push to main re-runs cleanly green. The 13 pre-existing drifts called out in the existing job comment will continue to heal as each lagging chart gets its next bump (auto-bump hook + this PR's manifest-validation alignment). Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence to blueprint.yaml versions which the test asserts but the auto-bump hook does not yet update). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>	2026-05-19 00:34:48 +04:00
e3mrah	f061088ad3	fix(ci): harden TBD-A6 against bootstrap-kit slot indent drift (#1751 ) The TBD-A6 auto-bump hook in blueprint-release.yaml AND the check-bootstrap-kit-pin-sync.sh audit both key on the regex `^ chart: <name>$` (exactly 6 leading spaces) to find the slot pinning a given chart. Every existing slot is at that indent today (audited 2026-05-18 across all 49 bp-* slots + the one sandbox slot). But if a future slot author writes the `chart:` / `version:` lines at a DIFFERENT indent (e.g. 4 or 8 spaces, copy-pasting from a tutorial that uses different YAML style), BOTH systems silently skip that slot: - The audit reports the chart as "not in the bootstrap kit" (skipped count, no drift entry). - The auto-bump hook logs "graceful no-op" and the chart-pin pair drifts forever undetected. This is exactly the silent-drift failure mode TBD-A6 was created to prevent. Two-layer defence: 1. scripts/check-bootstrap-kit-pin-sync.sh: scan every slot file for `chart:` and `version:` lines at any non-6-space indent ≤14 spaces (deeper than that = legit nested values field). Fail loudly with an actionable error that points at both the audit script and the auto-bump regex that need lockstep updating if the schema ever legitimately changes. 2. .github/workflows/blueprint-release.yaml auto-bump step: before declaring "no slot pins this chart, graceful no-op", do a second grep at any indent. If THAT finds something, fail with the same actionable error — prevents publishing a chart whose pin will silently lag. Verified: the indent-sanity scan passes on origin/main today (no false positives across 51 slot files). Synthetic 4-space and 8-space test slots both trigger the new error correctly. Edge cases also re-verified live on run 26041355274 (`0812d890`, first production auto-bump): - Chart-not-in-kit: bp-wordpress-tenant correctly logs "graceful no-op (chart is an opt-in Application Blueprint)". - Already-in-sync: bp-catalyst-platform on a second push logs "already pins ... — no-op" without spam. - Race against concurrent bump: the existing fetch+reset+ re-sed retry loop (lines 661-680) handles cross-chart matrix races; same-chart races converge to the same final pin via rewrite_pin's idempotent reset path. No `git pull --rebase` needed before commit. Refs TBD-A6 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 19:42:12 +04:00
e3mrah	5f85c731c1	fix(ci): deploy-bot auto-bumps bootstrap-kit pin when chart version bumps (Refs TBD-A6 meta-fix) (#1713 ) TBD-A6: every chart-publishing wave in the 2026-05-17/18 session required a SEPARATE manual collector PR to bump the bootstrap-kit pin so Sovereigns would actually install the freshly published OCI artifact. Without the pin bump, the chart at e.g. bp-catalyst-platform:1.4.166 gets published to GHCR but clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml still pins `version: 1.4.165` and fresh Sovereigns silently install the OLD artifact. Manual collector PRs eliminated by this hook (sample from one session): #1676 chart 1.4.162 -> 1.4.163 (Wave 16 collector) #1687 chart 1.4.163 -> 1.4.164 (Wave 17 collector) #1694 bp-guacamole 0.1.21 -> 0.1.22 (TBD-G6) #1695 chart 1.4.164 -> 1.4.165 (Wave 18 collector) #1698 chart 1.4.165 -> 1.4.166 (TBD-E8) #1700 bp-guacamole 0.1.22 -> 0.1.23 (TBD-G4 phase 2) #1706 bp-self-sovereign-cutover 0.1.29 -> 0.1.30 (TBD-C18) #1707 chart 1.4.166 -> 1.4.167 (Wave 24 collector) The fix lives in .github/workflows/blueprint-release.yaml — the single workflow that publishes every chart's OCI artifact. After a successful push + cosign sign + SBOM attest, a new "Auto-bump bootstrap-kit pin" step: 1. Reads ${{ steps.chart.outputs.name }} (e.g. `bp-newapi`). 2. Greps clusters/_template/bootstrap-kit/.yaml for the canonical ` chart: <name>` line (6-space indent matches every existing slot's HelmRelease.spec.chart.spec.chart shape). 3. If a matching slot file is found, sed-replaces the slot's ` version: <old>` with `version: <new>` and commits + pushes back to main as hatiyildiz <noreply>. 4. If no slot file matches, the chart is an opt-in Application Blueprint (e.g. bp-vllm, bp-temporal) and the step gracefully no-ops. 5. Conflict-tolerant retry up to 3 times with idempotent reset-and-rewrite for the parallel matrix case (multiple charts bumped in the same push). The bot-author commit does NOT re-trigger workflows (GitHub Actions GITHUB_TOKEN convention), so the chain converges in ONE pass: chart bump -> blueprint-release -> publish artifact -> bump pin. No loop. A regression test (scripts/check-bootstrap-kit-pin-sync.sh) asserts the convergence contract: every Chart.yaml in platform/ or products/* whose chart name is pinned in clusters/_template/bootstrap-kit/ MUST have the same version on both sides. The .github/workflows/test- bootstrap-kit.yaml workflow now runs this audit: - On `pull_request`: `--changed-only --base <pr-base>` so a PR is only blocked on chart->pin pairs IT modified. This avoids forcing pre-existing drifts (13 charts as of 2026-05-18, validated via a full sweep against origin/main) to be fixed before any unrelated PR can land. The auto-bump hook will heal those drifts on the next bump of each lagging chart. - On `push` and `workflow_dispatch`: full sweep so post-merge drift is observable on the run summary. Why blueprint-release.yaml is the right insertion point (not each build-bp-<name>.yaml or services-build.yaml or catalyst-build.yaml): - It runs after EVERY chart publish, regardless of upstream trigger. - It already has the canonical chart name + version in ${{ steps.chart.outputs.name }} + ${{ steps.chart.outputs.version }}. - One file changed, one hook covers all 51 bootstrap-kit slots plus future additions. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 18:36:55 +04:00
e3mrah	96674b71c9	fix(ci+catalyst-api): hold deploy-bot bumps when any prov is in-flight (was rolling catalyst-api Pod mid-tofu-apply, abandoning provs) (#1688 ) Context — t13/t17/t21 incident, 2026-05-17. catalyst-api is single-replica with strategy: Recreate; the OpenTofu workdir lives on a /tmp emptyDir that dies with the Pod. When this workflow bumped the image SHA mid-prov, Flux rolled the Pod and killed `tofu apply` mid-resource. The on-disk record was rewritten to status=failed by restoreFromStore on the new Pod, but Hetzner resources tagged with the abandoned deployment-id stayed orphaned and required manual `hcloud` cleanup. Three consecutive provs died this way in one afternoon. Option C (smallest blast radius): gate the deploy-bot at the workflow level. 1. New public endpoint GET /api/v1/deployments/in-flight-count on catalyst-api. Returns {count, ids} of deployments in Phase-0 in-flight status (pending / provisioning / tofu-applying / flux-bootstrapping). Phase-1 (phase1-watching) is observational and resumes across Pod restarts via resumePhase1Watch, so it does NOT block. Adopted deployments are excluded. No FQDNs / owner emails in the response — same information-disclosure posture as /api/v1/subdomains/check. Unauthenticated; the deploy-bot has no session cookie. 2. .github/workflows/catalyst-build.yaml `deploy` job polls this endpoint before bumping values.yaml. count==0 → green light. count>0 → sleep 20s and retry. Hard cap 30 min (a stuck prov must not block all future deploys — that would be the worst possible failure mode for a CI gate). Fail-open on any non-200 / network error so the gate cannot itself become an outage. Notes: - Mothership URL configurable via vars.CATALYST_API_URL (defaults to https://console.openova.io). Sovereign chroot self-deploys can point to their local catalyst-api. - First-rollout safe: the endpoint does not exist on the LIVE mothership until THIS PR's image lands, so the first run after merge falls through the 404 branch and proceeds. Subsequent runs benefit from the gate. - NOT a Chart.yaml bump. The deploy-bot itself bumps the literal image refs in chart templates (existing behaviour), so the new endpoint reaches Sovereigns through the normal chart-rebake path. Tests: handler/deployments_in_flight_count_test.go covers Phase-0 vs Phase-1 vs terminal vs adopted classification + empty-store green light. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 15:54:54 +04:00
e3mrah	cadc7b5cea	fix(sandbox-ci): mcp-server Dockerfile repo-root context + pty/mcp auto-bump wiring (chart was half-deployable) (#1667 ) Sandbox chart was un-deployable end-to-end because three CI-side gaps compounded after PR #1658 wired the mcp-server module to depend on core/controllers + core/services/shared via `replace` directives: 1. mcp-server Dockerfile built against a too-narrow context. The workflow passed `context: products/sandbox/mcp-server` and the Dockerfile assumed `COPY . .` could see everything it needed, but the `replace ../../../core/controllers` line in the module's go.mod only resolves when the build can actually reach those paths. Result: every push after #1658 failed at `go build` with `module not found`. Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout): COPY the replace targets' module roots + sources, then build with WORKDIR set to the dependent module. Static binary still produced into a distroless/static-debian12:nonroot final stage. 2. mcp-server workflow had no chart auto-bump step. Even after a green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml stayed empty so the chart's `required` guard (deployment.yaml line 72) refused to render. Added the same yq-bump + bot-commit pattern build-sandbox-controller.yaml already uses, targeting `.runtime.mcpImage` and writing a fully-qualified `<repo>:<sha>` string (consumer reads it as one image reference, not a {repository,tag} pair). Also widened paths-filter to include core/controllers/ + core/services/shared/ so changes to the replace targets re-trigger the build. 3. pty-server workflow had no auto-bump either. Same surgery: yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays narrow (pty-server has no cross-tree `replace` directives). 4. Stop-gap pin values for runtime.{ptyServerImage,mcpImage} so the next chart roll out doesn't fail-fast before the rebuilt workflows land their first bumps: - ptyServerImage → `ad5163e6` (current latest pty-server) - mcpImage → `1b0e86c` (last pre-#1658 green build; the rebuilt workflow will land the next real SHA on the next push to main). Verified locally: - `go build ./products/sandbox/mcp-server/...` clean (43.8 MB static binary at /tmp/openova-sandbox-mcp; `file` confirms statically linked ELF). - `helm template test platform/sandbox/chart --set enabled=true …` renders cleanly; both env vars carry the SHA-pinned image refs. No Chart.yaml bump. Read-only clusters. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 13:22:17 +04:00
e3mrah	1b0e86cb1a	ci(sandbox): build workflows for controller + pty-server + mcp-server (so chart can actually deploy) (#1632 ) PR #1622 shipped the sandbox-controller binary + chart, and PR #1618 shipped pty-server + mcp-server scaffolds, but neither came with CI build workflows — meaning the chart's image.repository points at a GHCR package that no workflow ever publishes (ImagePullBackOff on every install). Per docs/INVIOLABLE-PRINCIPLES.md #4a every runtime image MUST be produced by a GitHub Actions workflow from a committed git SHA; this PR closes that gap. Three new workflows, all event-driven (push paths-filter + PR + workflow_dispatch, no cron): - build-sandbox-controller.yaml — mirrors build-application-controller (shared core/controllers go.mod, go vet + race tests, Buildx push, cosign keyless sign, SBOM attest, auto-bump platform/sandbox/chart/ values.yaml image.tag back to main so the next install picks up the SHA-pinned image without operator action). - build-sandbox-pty-server.yaml — separate go module under products/sandbox/pty-server (own go.mod/go.sum), Dockerfile uses COPY . . so build context is the server directory. Same Buildx + cosign + SBOM flow as the controller. No values.yaml bump yet: Wave-2 wiring of the StatefulSet template will land in a follow-up. - build-sandbox-mcp-server.yaml — stdlib-only stdio MCP sidecar (no go.sum yet), same shape as pty-server. Per `feedback_no_mvp_no_workarounds.md` rule 1 (target-state, never "manual follow-up bump") the controller workflow auto-bumps the chart values.yaml so a Sovereign overlay flipping `enabled: true` Just Works. Per the user's hard rule for this PR, no Chart.yaml bump and no blueprint-release dispatch — the Sandbox chart's publication cadence is gated by Wave-2 readiness, not per-image builds. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:11:28 +04:00
e3mrah	0ac12970d8	ci(openova-flow): build openova-flow-server + adapter-flux images + sed chart tags (#1398 ) Add the two missing GitHub Actions build pipelines for the OpenovaFlow Go binaries so prov #34 has real images to install. Both auto-bump their chart's values.yaml `image.tag` on every main-branch push and dispatch blueprint-release for chart re-publish. Workflows shipped: - .github/workflows/build-openova-flow-server.yaml · Triggers on push to products/openova-flow/server/** or the chart · `go vet` + `go test -race` + Buildx push to ghcr.io/openova-io/openova/openova-flow-server:<sha> + :latest · cosign keyless sign + SBOM attest · awk-bumps platform/openova-flow-server/chart/values.yaml flowServer.image.tag, commits to main with [skip ci] · Dispatches blueprint-release.yaml for chart re-publish - .github/workflows/build-openova-flow-adapter-flux.yaml · Same shape; bumps platform/openova-flow-emitter/chart/values.yaml flowEmitter.image.tag Chart defaults (`tag: "latest"`) already shipped in PR #1397 — no values.yaml changes needed in this PR. Canonical patterns cited (ARCHITECT-FIRST): - Build shape mirrors .github/workflows/build-application-controller.yaml (Go vet + test + Buildx + cosign + SBOM + values.yaml awk-bump + blueprint-release dispatch). - awk-sed bump pattern mirrors catalystApi/catalystUi tag-bump in .github/workflows/catalyst-build.yaml `deploy` job (with the `[skip ci]` + explicit blueprint-release dispatch fix from #712). Per docs/INVIOLABLE-PRINCIPLES.md: - #4a (GitHub Actions is the only build path) - event-driven (no cron triggers, only push/PR/workflow_dispatch) MIRROR-EVERYTHING: image refs in chart values point at harbor.openova.io/proxy-ghcr/...; CI pushes to ghcr.io directly and Harbor proxy-pulls. No direct push to harbor. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 16:03:31 +04:00
e3mrah	3a5d9fc102	fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111 ) (#1331 ) Two infrastructure-hardening fixes that together eliminate ~30 min of provision-cycle waste per regression event documented in Fix #101. ## Fix A — CI guard against unescaped tftpl shell expansion Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default} inside YAML comment lines. Uses PCRE negative-lookbehind so correctly escaped \$\${VAR:-default} (templatefile() literal-dollar) does not trip the guard. Background: PR #1311 (Fix #73) added a YAML comment with bare \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context; the colon in the interpolation hits HCL's reserved conditional grammar and crashes 'tofu plan' with "Template interpolation doesn't expect a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted ~30 min before PR #1328 fixed the one offender. Without the guard, the next operator who adds a similar comment repeats the incident. Documented in infra/hetzner/README.md so editors learn the \$\$ escape pattern before they trip the CI gate. ## Fix B — bucket-name suffix to escape global Hetzner namespace Hetzner Object Storage bucket names share a GLOBAL namespace across every tenant. The previous BucketNameForSovereign(fqdn) derivation 'catalyst-<fqdn-with-dashes>' would collide on the second CreateDeployment for the same FQDN (re-provision after wipe, two operators on adjacent pools, race conditions) and the second 'tofu apply' would fail with BucketAlreadyExists. Change BucketNameForSovereign signature to (fqdn, deploymentID) and append the first 8 chars of the deployment-id as a suffix: catalyst-omantel-omani-works-b3b837a2 newID() already returns 16-hex random — the leading 8 chars are 32 bits of fresh entropy, enough to make collisions cryptographically negligible. Backward-compat: empty deploymentID (legacy on-disk records) falls back to first-8-hex of sha256(fqdn) so wipes of pre-Fix-111 Sovereigns remain deterministic. Call-sites updated: - handler/deployments.go: id := newID() moved before bucket-name derivation; uses hetzner.BucketNameForSovereign - handler/wipe.go: passes dep.ID to PurgeBuckets and to BucketNameForSovereign in the report - hetzner/buckets.go: PurgeBuckets signature now takes deploymentID; bucketSuffix() handles the fallback Tests: - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign table covers canonical newID() shape, collision avoidance, uppercase normalisation, empty + non-hex fallback paths. New TestBucketNameForSovereign_CollisionAvoidance asserts the Fix #111 invariant directly. - handler/deployments_test.go: TestCreateDeployment_DerivesObjectStorageBucketFromFQDN now asserts the suffixed shape against the actual dep.ID. - All produced names re-validated against the S3 bucket-naming RFC (mirrored regex from provisioner.s3BucketNamePattern). ## Claimed TCs _None directly — infrastructure hardening; eliminates 30+ min wasted per cycle from regressions like PR #1311 + bucket-collision_ ## Verification - go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS - go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS - go vet ./... → clean - go build ./... → clean - yaml.safe_load on workflow → clean - pre-existing handler-package fails (whoami, continuum-switchover) are unrelated and present on origin/main Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 23:31:56 +04:00
e3mrah	f668d791ab	fix(bp-newapi): publish newapi-mirror image + repoint chart to existing tag (qa-loop bounded-cycle audit prov #7 Gap F) (#1315 ) Root cause from live diagnosis (omantel.biz prov #7, kubectl --context=omantel): The bp-newapi chart at platform/newapi/chart/values.yaml referenced `ghcr.io/openova-io/openova/newapi-mirror:v0.4.5` since its first commit (44d0200a, 2026-05-01). However: 1. NO CI workflow ever built that image. There is no `build-bp-newapi.yaml` (or similar) under .github/workflows/. The GHCR package `ghcr.io/openova-io/openova/newapi-mirror` does not exist (404 from /orgs/openova-io/packages/container/...). 2. The tag `v0.4.5` is fictitious — neither upstream Calcium-Ion/new-api (`docker.io/calciumion/new-api`) nor the alternate ancestor (`justsong/one-api`) ever published a `v0.4.5`. The lowest stable Calcium-Ion tag is `v0.6.0.9`; the highest stable v0.x is `v0.13.2` (upstream publish 2026-04-27). Result: every fresh Sovereign's NewAPI Pod ImagePullBackOff'd 403 Forbidden on the never-existed image, blocking alice signup gate 5 (LLM) and surfacing in the bounded-cycle audit as Gap F. Fix (mirrors bp-guacamole CI pattern in .github/workflows/build-bp-guacamole.yaml): - NEW .github/workflows/build-bp-newapi.yaml — push to platform/newapi/chart/* triggers a Job that pulls `docker.io/calciumion/new-api:<UPSTREAM_VER>`, captures the upstream repo digest, re-tags as `ghcr.io/openova-io/openova/newapi-mirror: <UPSTREAM_VER>` + `:latest`, pushes both, then bumps values.yaml + Chart.yaml + dispatches blueprint-release. - platform/newapi/chart/values.yaml — newapi.image.tag bumped from `v0.4.5` (fictitious) to `v0.13.2` (latest stable Calcium-Ion/new-api on Docker Hub). Comment block expanded with full rationale + link to the new build workflow + bump-in-lockstep instructions. - platform/newapi/chart/Chart.yaml — version 1.4.1 → 1.4.2, appVersion `0.4.5` → `0.13.2` (Helm convention: appVersion = upstream version without the `v` prefix). Inline changelog records the audit-prov-7 Gap F lineage. - clusters/_template/bootstrap-kit/80-newapi.yaml — pinned chart version 1.4.1 → 1.4.2 with the same changelog inline. Verified locally: - `helm template smoke platform/newapi/chart --set database.existingSecret=fake --set credentials.existingSecret=fake --set auth.adminUI.mode=masterKey` renders `image: "ghcr.io/openova-io/openova/newapi-mirror:v0.13.2"` and `app.kubernetes.io/version: "0.13.2"`. The v1.0.0-rc.x upstream line is gated on schema migration stabilisation; the channel-seed Job uses the legacy admin-API request shape, so do NOT auto-roll past v0.13.x without re-running the channel-seed integration smoke against NewAPI's `/api/channel/`. Pairs with the Gap C re-investigation memo (no chart fix needed; PR #1309 only gated `defaultCompositionRef`, not the XRD itself; the useraccesses.access.openova.io CRD is present on omantel prov #7). DO NOT MERGE — this PR is for qa-loop bounded-cycle Wave 5 Fix #80 (Gap F) review. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 21:20:49 +04:00
e3mrah	9780e8d72d	fix(chart): bp-catalyst-platform 1.4.116 — chart re-publish + dispatch (qa-loop iter-10 Fix #44 follow-up) (#1264 ) Chart 1.4.115 was published from the merge commit which still had the OLD application-controller image tag (`a3ba200`) in values.yaml — the auto-bump commit landed seconds later but GitHub Actions does NOT trigger workflows from bot pushes by default (anti-recursion safeguard), so blueprint-release was never re-run and the published chart shipped with the wrong image. Sovereigns installing chart 1.4.115 still ran the buggy application-controller without the targetNamespace fix. Fix: - Bump bp-catalyst-platform 1.4.115 → 1.4.116 (this commit is human- authored so blueprint-release fires via the path filter). - Bump clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pin to 1.4.116. - Extend build-application-controller.yaml to dispatch blueprint-release.yaml after the bot bumps values.yaml, so the same race never blocks any future controller image roll-out. Per docs/INVIOLABLE-PRINCIPLES.md #1 (target-state) — operator must never have to manually re-trigger a chart publish after a controller image rebuild. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 06:17:13 +04:00
e3mrah	24aab61207	fix(application-controller): HelmRelease targetNamespace = App's namespace, not Org slug (qa-loop iter-10 Fix #44 ) (#1262 ) Root cause: the application-controller rendered the per-Application HelmRelease with `metadata.namespace = Org` and `spec.targetNamespace = Org` where Org is the parent Organization slug. On omantel the Application(qa-wp) lives in ns `qa-omantel` while the Org is named `omantel-platform` — so the workload Pod landed in the wrong namespace, breaking matrix rows TC-068 / TC-100 / TC-204 / TC-262 / TC-263 (all asserting Pod in qa-omantel). Symmetric Kustomization wrapper had the same bug. Existing render unit test only covered the org==namespace case (`acme/acme`) which masked the bug. Fix: - render.Inputs gains AppNamespace field. helmRelease + kustomization templates resolve `metadata.namespace` and `spec.targetNamespace` to AppNamespace (back-compat default = Org). - application_controller.go passes app.GetNamespace() as AppNamespace on every render.Render call. - HelmRelease spec.install.createNamespace = true so a missing workload namespace is provisioned by helm-controller (per docs/INVIOLABLE-PRINCIPLES.md #1 target-state — controller must work without an operator pre-creating the namespace). - Org slug is still stamped on the catalyst.openova.io/organization label for traceability. - 3 new Go tests: TestRender_NamespaceIsAppNamespace (omantel scenario via render pkg) TestRender_CreateNamespaceTrue TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace (drives the omantel scenario end-to-end through the controller fake) - build-application-controller.yaml extended with auto-bump of controllers.application.image.tag in values.yaml on push-to-main, so the chart picks up the rebuilt image without a manual operator edit (per feedback_no_mvp_no_workarounds.md rule 1). - bp-catalyst-platform chart 1.4.114 → 1.4.115. Verification (post-roll on omantel): - delete omantel-platform/qa-wp Pod - annotate qa-omantel/qa-wp HR for reconcile - expect: Pod in qa-omantel ns + HR.spec.targetNamespace == qa-omantel Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 05:17:48 +04:00
e3mrah	5ca0a7d178	fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236 ) * fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots Closes the scope-narrow confessed by Fix #36: bp-guacamole + bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment NotFound". CI workflows ------------ - .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/*, then bumps platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml patch version + dispatches blueprint-release. - .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry we own — no Docker Hub rate limits, no upstream availability risk), bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches blueprint-release. Charts (target-state) --------------------- - bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy` regardless of release name (DaemonSet + Service + ClusterRole + ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so matrix can address them by canonical short name). - bp-guacamole v0.1.1: canonical short resource names (`guacd`, `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream images; realm-patch ConfigMap correctly lands in `keycloak` namespace (was: realm-name, which would have failed silently on every Sovereign); `realmConfig.namespace` override surface added. - Both charts: `catalyst.openova.io/smoke-render-mode: default-off` annotation so blueprint-release smoke-render gate honors the default-OFF render shape. Bootstrap-kit slots ------------------- - clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml + 37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned to 0.1.1, default-OFF gate flipped via slot values, install/upgrade disableWait per session-2026-04-30 architectural decision. - clusters/omantel.omani.works/bootstrap-kit/ slots mirror the same shape with omantel.biz hostnames matching the live HTTPRoutes on console.omantel.biz / auth.omantel.biz. API: shells/issue handler (matrix-canonical URL surface) -------------------------------------------------------- - POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container= alias for the existing POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session with matrix-canonical response fields (`sessionId`, `guacamoleUrl`, `recordingPath`). Same business logic, same audit surface (`guacamole-session-opened`), same RBAC gate (tier-developer or higher). 6 test cases, all PASS under -race. TCs that flip PASS in iter-8 ----------------------------- - TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath - TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system - TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system - TC-237: kubectl logs ds/k8s-ws-proxy → "listening" - TC-245: viewer-cookie POST /shells/issue → 403 - TC-246: operator-cookie POST /shells/issue → 200 sessionId Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every gap Fix #36 confessed is closed in this PR. Per feedback_machine_saturation_3rd_violation.md: CI-only build path, no local docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up) CI dependency-graph-audit caught a slot-number collision: slots 36-48 are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative, bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge, bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix, bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+ slot range) and add their entries to the expected DAG. - clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-* - clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-* - kustomization.yaml updates (both _template + omantel) - scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets, bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+ seaweedfs+k8s-ws-proxy) scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55 declared HRs, 42 present on disk, 13 deferred (W2.K1-K4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-10 01:48:25 +04:00
e3mrah	b24475e2c2	fix(api+chart): clusterroles GVR + CATALYST_BUILD_SHA env injection (qa-loop iter-3) (#1206 ) Two coupled fixes for QA-loop iter-3 cluster `clusterroles-gvr-and-sha-injection`: Sub-A — clusterroles GVR (TC-122/196/199/248): - Add rbac.authorization.k8s.io/v1 ClusterRole + ClusterRoleBinding to k8scache.DefaultKinds. Both cluster-scoped. - Add matching get/list/watch verbs on catalyst-api-cutover-driver ClusterRole. Per feedback_chroot_in_cluster_fallback.md every new GVR added to DefaultKinds MUST get a matching rule on the cutover-driver SA (chroot SovereignClient uses it via in-cluster fallback). - Pin both kinds in TestDefaultKinds_GraphAndDashboardSurface so a regression that drops them from the registry fails the unit test. Sub-B — CATALYST_BUILD_SHA env injection (TC-261): - api-deployment.yaml: inject CATALYST_BUILD_SHA + CATALYST_CHART_VERSION env vars with LITERAL values (not Helm directives) per the dual-mode contract — Kustomize on contabo can't render `{{ .Values... }}` in `value:` fields. - .github/workflows/catalyst-build.yaml: extend the "bump literal image refs" sed pass to also bump the CATALYST_BUILD_SHA env literal so /api/v1/version returns the SHA the Pod is actually running (no drift between image tag and reported SHA). - The handler (version.go) already reads CATALYST_BUILD_SHA via envOrTrim with `dev`/`0.0.0` ldflag fallbacks — no Go change needed; the version_test.go env-override test already covers it. Chart bumped 1.4.94 -> 1.4.95. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 17:56:21 +04:00
e3mrah	c1b92404ee	fix(chart): enable 5 Group C controllers + KC realm-role bootstrap (qa-loop iter-1) (#1194 ) EPIC-3 RBAC reconciliation loop was dormant on every Sovereign because the 5 Group C controllers (organization, environment, blueprint, application, useraccess) shipped with `enabled: false` and the KEYCLOAK_BOOTSTRAP_TIER_ROLES env var was hardcoded to "false". Result: UserAccess CRs created by /api/v1/sovereigns/{id}/rbac/assign never materialised into RoleBindings + composite realm-roles. Cluster: controllers-and-kc-bootstrap-gates (qa-loop iter-1). Changes: - values.yaml: organization/environment/application/useraccess controllers flipped to `enabled: true` and `image.tag` SHA-pinned to the latest GHCR-published push-on-main builds (organization/environment/application :1b29c71, useraccess :ff2172f) per Inviolable Principle #4a. - values.yaml: blueprint stays `enabled: false` until first push-on-main build of build-blueprint-controller.yaml lands an image in GHCR (never reference an image not built by CI). - values.yaml: new top-level `keycloak.bootstrap.ensureTierRoles: true`. - api-deployment.yaml: KEYCLOAK_BOOTSTRAP_TIER_ROLES now sources its default from `.Values.keycloak.bootstrap.ensureTierRoles` (per slice T2 brief #1098/#1146) instead of hardcoded "false". - .github/workflows/build-blueprint-controller.yaml: new workflow scaffolded (mirror of build-application-controller shape) so the first commit touching core/controllers/blueprint/** ships a CI-built, SHA-pinned, cosign-signed image to GHCR. - Chart.yaml: bumped 1.4.89 → 1.4.90. Verified via `helm template`: - 4 controller Deployments + 4 controller ClusterRoles render (blueprint pending image build). - KEYCLOAK_BOOTSTRAP_TIER_ROLES renders as "true" by default. - 5 tier ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}` render from platform/crossplane-claims/chart/. Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 15:41:58 +04:00
e3mrah	7ca4abddd2	feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101 ) (#1159 ) * feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState \| 404 \| 401 PUT /lease/<slot> → 200 + LeaseState \| 412 + state \| 401 DELETE /lease/<slot> → 204 \| 412 \| 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 08:01:44 +04:00
e3mrah	746901b671	feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101 ) (#1153 ) EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a companion to bp-cnpg: primary CNPG Cluster CR in region A, replica Cluster CR in region B configured as a CNPG replica cluster (replica.enabled=true + externalCluster), WAL streaming over a Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the only canonical inter-region transport — never public TLS. What ships: platform/cnpg-pair/ ├── chart/ │ ├── Chart.yaml # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off │ ├── values.yaml # default-OFF gate; placement schema constrains active-hotstandby ONLY │ ├── templates/ │ │ ├── _helpers.tpl # fail-fast on empty image.tag; region pair validation │ │ ├── primary-cluster.yaml # CNPG Cluster CR (region-pinned via openova.io/region affinity) │ │ ├── replica-cluster.yaml # CNPG Cluster CR (replica.enabled=true; externalClusters[]) │ │ ├── service-replication.yaml # Cilium ClusterMesh global Service │ │ ├── failover-readiness.yaml # probe Pod flips Ready when WAL lag < threshold │ │ ├── networkpolicy.yaml # default-deny carve-outs for replication + probe │ │ └── audit-config.yaml # NATS audit subjects + types this Blueprint emits │ ├── blueprint.yaml # configSchema + placementSchema (active-hotstandby ONLY) │ ├── README.md # 80-line deployment + failover semantics │ └── tests/cnpg-pair-render.sh # 5-case render gate └── DESIGN.md # topology, lag-threshold rationale, deferred C-DB-3 plan Default-OFF gate per the brief: helm template with default values renders ZERO resources; helm template with cnpgPair.enabled=true + both regions + image.tag renders 8 resources (2 Cluster CRs, 1 Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap). Empty image.tag fails fast at template-render per Inviolable Principle #4a; same primary/replica region fails fast (degenerate pair). All 5 render gates pass locally; helm lint + YAML parse clean. CI smoke-render gate fix (single-line behavior change in blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render- mode: default-off` annotation opt-in so charts that legitimately render zero at default values (this chart + future bp--pair Blueprints) skip the `<5 lines` empty-render check. The chart's own tests/cnpg-pair-render.sh covers the enabled-render path; without the annotation the empty-render check still fires unchanged. Seam-map additions (return diff for 01-canonical-seams.md Platform table): - service.cilium.io/global=true ClusterMesh global Service annotation (first chart in the repo to use it; pattern reused by Continuum K-Cont-2 for HTTPRoute weight=0 cross-region drains) - bp--pair active-hotstandby cluster-pair pattern (primary+replica Cluster CRs colocated in one Blueprint, region-pinned via openova.io/region node-affinity) - audit-config ConfigMap co-located with the emitting Blueprint (label-selector discovery for K-Cont-2 + U-DR-1; future bp--pair Blueprints follow this convention) - smoke-render-mode=default-off Chart.yaml annotation opt-in for the blueprint-release smoke gate C-DB-2 (publish): existing blueprint-release.yaml workflow auto- detects `platform//chart/**` paths — no allowlist edit required. First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build. C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the future implementer's brief is self-contained. Tests: - bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS - helm lint platform/cnpg-pair/chart ✓ clean - helm template ... \| python3 yaml.safe_load_all ✓ 8 docs parse clean - smoke-gate logic simulated locally ✓ default-off annotation honored Pre-existing CI failures untouched: - TestPinIssue rate-limit flake — not affected by chart-only slice - TestBootstrapKit/gitea version drift — only iterates over a fixed 10-chart bootstrap list (no cnpg-pair entry) Out of scope per brief (all deferred to dedicated slices): - K-Cont-2 reconciler logic - K-Cont-3 lease witness - K-Cont-4 Cloudflare Worker - C-DB-3 1M-row acceptance test - Application controller changes - U-DR-1 UI Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 05:16:55 +04:00
e3mrah	ddbe44918f	feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101 ) (#1151 ) Slice K-Cont-1 of EPIC-6 (#1101) ships the Continuum product skeleton: - core/controllers/continuum/{cmd,internal/{controller,events}} - cmd/main.go — controller-runtime Manager bootstrap; leader election; /healthz, /readyz, /metrics endpoints; env-only config per INVIOLABLE-PRINCIPLES #4 - internal/controller — ContinuumReconciler with no-op Reconcile() (K-Cont-2 fills the body); SetupWithManager() watches Continuum CRs via unstructured.Unstructured per ADR-0001 §2.7 (no controller-gen) - internal/events — placeholder package documenting K-Cont-2's NATS audit-event-type list - Containerfile — multi-stage Go build → alpine:3.20 runtime, UID 65534 - products/continuum/chart/ — full Helm chart shape (default-OFF): - Chart.yaml + values.yaml (continuum.enabled: false; image.tag empty; fail-fast on empty tag at render time) - templates/{_helpers.tpl, deployment, service, serviceaccount, rbac, networkpolicy}.yaml - blueprint.yaml — OpenOva Blueprint manifest with configSchema + placementSchema (single-region: management cluster) + depends: bp-cnpg-pair + bp-powerdns - crds/README.md — pointer to the canonical Continuum CRD shipped in products/catalyst/chart/crds/continuum.yaml (B8 #1110); not duplicated - products/continuum/DESIGN.md — chart-vs-binary split decision (Option A: binary in shared core/controllers/ module per CC1 #1135), K-Cont-2 fill list, K-Cont-3 lease witness API contract sketch - .github/workflows/build-continuum-controller.yaml — event-driven CI (NO cron) with go vet + go test -race + helm template ON/OFF resource count gates + fail-fast verification + GHCR build & push (cosign keyless signed) + repository_dispatch for chart-bump fan-out helm template verification: - continuum.enabled=false → 0 resources (default OFF) - continuum.enabled=true + image.tag=ci-test → 6 resources (ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service, NetworkPolicy) - continuum.enabled=true + empty image.tag → render fails per #4a go vet ./continuum/... → clean. go test -count=1 -race → all green. Out of scope (per the K-Cont-1 brief): - Reconcile body — K-Cont-2 - Lease witness implementations — K-Cont-3 - Cloudflare Worker source — K-Cont-4 - bp-cnpg-pair Blueprint — C-DB-1 Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 04:45:00 +04:00
e3mrah	b0ed216e81	feat(catalog): catalog-svc HTTP REST service + chart wiring (slice L1+L2, #1097 ) (#1148 ) EPIC-2 Slice L of #1097. Multi-source Blueprint catalog HTTP REST service backed by Gitea (3 sources: public mirror, sovereign-curated, per-Org private). Replaces the per-Org SME catalog per ADR-0001 §4.3 (different scope: SME's was Org-bound; catalyst-catalog is Sovereign- wide multi-source). L1 — core/services/catalyst-catalog/ Go service: - Separate go.mod (services group is for HTTP services, controllers group is for CRD reconcilers — documented in DESIGN.md). - Imports the unified Gitea client via Go module replace directive. - Promoted core/controllers/internal/gitea → pkg/gitea so the catalog (a sibling Go module) can import it (Go internal/ rule). 5 Group C controllers updated atomically. - HTTP REST endpoints: /api/v1/catalog{,/{name},/{name}/versions, /{name}/versions/{version}} + /healthz. - Source resolution priority on collision: private > sovereign > public. - Per-Org access filter: caller's Claims.Groups[] determines visible private blueprints; Org A user does NOT see Org B's private set. - 30s TTL LRU cache on blueprint.yaml reads (capacity 1024 default). - Session-cookie / Bearer / ?access_token= claim extraction matching catalyst-api's seam; expired-token rejection in-process. - Containerfile: distroless-static, non-root UID 65532. L2 — products/catalyst/chart/templates/services/catalog/ wiring: - 5 templates (deployment, service, serviceaccount, rbac, httproute) + _helpers.tpl. Default-OFF gate via .Values.services.catalog.enabled. - helm template: 0 catalog resources when OFF, 6 when ON. - Empty image.tag fail-fasts at render per Inviolable Principle #4a. - HTTPRoute exposes /api/v1/catalog on api.<sovereign> hostname. - Chart bumped 1.4.85 → 1.4.86. Gitea client extension (canonical seam, NOT per-service variant): - +ListOrgRepos(ctx, org) []Repo — paginated repo listing. - +ListContents(ctx, org, repo, branch, path) []ContentEntry — directory listing for per-Org shared-blueprints fan-out. GitHub Actions workflow: - .github/workflows/catalyst-catalog-build.yaml — push-on-paths + pull_request + workflow_dispatch (NO cron). go vet + go test (race + count=1) + image build → GHCR :<sha>. repository_dispatch fan-out to chart-bump matches the Group C controllers' pattern. Tests (3-tier gate): unit (config, cache, auth, source, handler) + integration (httptest-backed Gitea fixtures across all 3 sources + priority + per-Org access). All green; race detector on. L3 (SME catalog retirement) is deferred per the EPIC-2 master brief. GraphQL deferred (REST first; gqlgen would pull ~80MB of indirect deps for a feature no UI consumer has asked for yet). Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 04:04:52 +04:00
e3mrah	66fd0bbae3	refactor(controllers): promote duplicated internal/ packages to shared core/controllers/internal/ (CC1, #1095 ) (#1135 ) Slice CC1 of EPIC-0 (#1095) — Coordinator-led consolidation. The 5 Group C controllers (slices C1-C5: organization, environment, blueprint, application, useraccess) all merged with their own per-controller go.mod + per-controller internal/ tree. This PR canonicalizes the shared layout per `02-implementer-canon.md` §1+§2: * One go.mod at core/controllers/go.mod (Path A — single shared module) * Shared helpers under core/controllers/internal/: - semver/ (was: blueprint/internal/semver + application/internal/semver, now exposes blueprint's IsValidRange + app's IsExact, with the union of both test corpora) - placement/ (was: application/internal/placement; promoted per seam map) - render/ (was: application/internal/render; promoted per seam map) - labels/ (was: useraccess/internal/labels; promoted per seam map — Manara-style scope matcher, owner-of-record C5) Module-discipline decision (Path A vs Path B): Path A. The 5 controllers' go.mod files use the same controller-runtime v0.19.0, k8s.io/* @ 0.31.x, sigs.k8s.io/yaml v1.4.0, etc. The only drift was organization-controller on k8s.io/api 0.31.0 vs the others on 0.31.1 — a trivial bump. Independent dep-version pinning would only be valuable if a controller needed a hostile dep the others shouldn't pull; nothing in the current tree is hostile. Containerfiles + workflows updated: * 5 Containerfiles now COPY core/controllers/{go.mod,go.sum,internal/} plus the per-controller tree from a repo-root build context. * 4 per-controller workflows (application/environment/organization/ useraccess; blueprint-controller has no dedicated workflow yet) now trigger on core/controllers/{<name>/, internal/, go.mod, go.sum} and run go vet + go test scoped to their own tree + shared internal. * useraccess workflow context flipped from core/controllers/useraccess to . (repo root) so the Containerfile can reach the shared go.mod. Subpackages NOT promoted in this PR (compromise — flagged for follow-up): * gitea/ — 4 of 5 controllers each ship a Gitea HTTP client. The APIs DIVERGE (organization has Org+Repo CRUD with Repo struct return values; application/blueprint/environment have File CRUD with Org-not-found sentinel). A SUPERSET package would require renaming methods (e.g. EnsureRepo collides on signature) which crosses the brief's "no API redesign" line. CC2 follow-up slice should design the unified surface before promoting. * validate/ — application's package validates Application.spec.parameters against a JSON Schema (santhosh-tekuri lib); blueprint's validates Blueprint CR business rules (semver-backed). Same dir name, completely different functions — not actually duplicates. * gitops/ — environment's renders Flux GitRepository for an Environment; organization's renders HelmRelease+Namespace for an Org. Same dir name, different inputs and outputs. Test-coverage delta: pre-consolidation 134 root-level tests (sum across 5 modules); post-consolidation 133 tests. Net delta -1: blueprint and application each had their own TestIsValidRange in their semver pkg; the shared semver pkg's TestIsValidRange now exercises the union of both controllers' valid+invalid input corpora — coverage strictly improved even though one redundant test name disappeared. Verified locally: go build + go vet + `go test -count=1 -race ./...` all clean; all 5 controller binaries (cmd/) link successfully. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:54:42 +04:00
e3mrah	dbf585744c	feat(controllers): land application-controller (slice C4, #1095 ) (#1133 ) Watches Application.apps.openova.io/v1 CRs and reconciles each Application to per-region kustomization + helmrelease manifests in the per-Org Gitea repo (gitea.<location-code>.<sovereign-domain>/<org>/<app>). Reconcile flow per slice C4 brief: 1. Resolve parents: spec.environmentRef → Environment CR, then Environment.spec.organizationRef → Organization CR. Pending-on-miss. 2. Fetch Blueprint at spec.blueprintRef.{name,version} (v1 with v1alpha1 fallback). Pending-on-miss. 3. Validate spec.parameters against Blueprint.spec.configSchema via github.com/santhosh-tekuri/jsonschema/v5. On invalid → status.phase= Failed + Condition reason=Invalid listing every failing JSON pointer. 4. Validate placement against Blueprint.spec.placementSchema.modes. 5. Resolve placement → per-region work plan: - single-region: regions[0] only, role=primary - active-active: every region rendered identically (sorted for byte-stability), role=active, no primaryRegion - active-hotstandby: regions[0] primary, regions[1..] standby (replicas: 0 + _openova_standby: true overlay; Continuum #1101 flips on switchover) 6. Render kustomization.yaml + helmrelease.yaml per region under clusters/<region>/applications/<app>/{...}.yaml on the env-type- mapped branch (develop\|staging\|main per NAMING §11.2). 7. Idempotent commit via gitea.PutFile's byte-equality short-circuit — re-reconcile on steady state = 0 Gitea writes (slice C4 brief test #7). 8. Status update: phase / primaryRegion / regions[] / giteaRepo / installedBlueprint{name,version,digest} / conditions[]. 9. Finalizer + cascade delete: on metadata.deletionTimestamp, removes every manifest the controller wrote and releases the finalizer. Architecture compliance per docs/INVIOLABLE-PRINCIPLES.md: - Flux is the only reconciler. Controller writes to Gitea; Flux applies. NO direct K8s create of HelmRelease/Kustomization/Service. - Dynamic client + unstructured.Unstructured (no controller-gen, no zz_generated_deepcopy.go). - Every value is environment-configurable (GITEA_API_URL, GITEA_TOKEN, GITEA_PUBLIC_URL, SOURCE_NAMESPACE, HELMRELEASE_INTERVAL, CATALOG_SOURCE_REF, REQUEUE_AFTER_SECONDS, METRICS_ADDR, HEALTH_ADDR, LEADER_ELECT, LEADER_ELECT_NS, LOG_LEVEL). - SHA-pinned images via the focused build-application-controller.yaml workflow (push-on-paths + PR + workflow_dispatch — no cron). Tests cover the full 9-test matrix from the brief plus 3 bonus paths: T1 Pending on missing Environment (no Gitea writes). T2 Pending on missing Blueprint (no Gitea writes). T3 Invalid on parameters schema mismatch — Condition message names the failing path 'replicas'; no Gitea writes. T4 single-region happy path → expected manifests written under clusters/<region>/applications/<app>/ on branch=main, finalizer added, status.phase=Provisioning, status.primaryRegion populated, status.giteaRepo populated. T5 active-active fan-out → 2 regions, 2 manifest sets byte-equal after region-name canonicalisation. status.primaryRegion empty. T6 active-hotstandby → primary renders replicas:3 (user param); standby renders replicas:0 + _openova_standby:true marker. T7 Idempotency → re-reconcile after success = 0 Gitea writes (PutFile byte-equality short-circuit). T8 Deletion cascade → manifests removed from Gitea, finalizer released after delete pass. T9 Drift detection → Gitea-side manifest hand-edited; controller restores byte-identical original on next pass. + Pending on Gitea Org missing (org doesn't exist in Gitea even though Organization CR exists — slice C1 hasn't run yet). + Invalid placement-vs-blueprint-allowed-modes (placement-active-active rejected on a Blueprint declaring only single-region). Module path: github.com/openova-io/openova/core/controllers/application (per-controller go.mod, matching siblings C1/C2/C3/C5; CC1 promotes shared internals to core/controllers/internal/ in a follow-up slice). `go vet ./...` clean. `go test -count=1 -race ./...` all green. Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:34:22 +04:00
e3mrah	8988cd9e4f	feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095 ) (#1131 ) Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/*. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:29:44 +04:00
e3mrah	2ab442544e	feat(controllers): land environment-controller (slice C2, #1095 ) (#1127 ) Implements slice C2 of EPIC-0 #1095 — the environment-controller Go binary. Watches Environment.catalyst.openova.io/v1 CRs (cluster-scoped) and reconciles each Environment to: 1. Verify the per-Org Gitea Org exists (parent Organization gate). Missing org surfaces GiteaOrgReady=False + Pending phase, never panics or crashloops. 2. Track the canonical branch name for this Environment in status.giteaRepoRef.{org,branch} per NAMING-CONVENTION.md §11.2 item 1 (develop/staging/main ↔ dev/stg/prod; uat/poc map to their own branch name). 3. Idempotently write per-vCluster Flux GitRepository manifests into the Org's Gitea repo at the canonical path `clusters/<host-cluster>/environments/<env-name>/gitrepository.yaml` per NAMING §11.2 item 3. Multi-region Environments fan out one commit per spec.regions[]. Identical bytes short-circuit (zero spurious commits in repo history); drift triggers an overwrite with the existing blob SHA. 4. Surface the canonical JetStream subject prefix `ws.{organizationRef}-{envType}.>` on status.jetstreamSubjectPrefix per NAMING §11.2 item 4 + ARCHITECTURE.md §5. Per-Environment NATS Stream CR creation is OUT OF SCOPE here — NACK isn't installed yet (future slice). 5. Set status.phase, status.regionCount (printer column), status.vclusters[], status.observedGeneration, and the Ready/GiteaOrgReady/GitRepositoryWritten conditions. Architecture rules honored (per docs/INVIOLABLE-PRINCIPLES.md + docs/adr/0001-catalyst-control-plane-architecture.md): - Flux is the only reconciler in production. The controller writes manifests to Gitea; Flux applies them. NO kubectl apply, NO helm install, NO exec.Command in the codebase. - Crossplane is cloud-only. This controller is K8s-to-K8s native via controller-runtime + client-go. - DR is a Placement, not an Env Type. The controller treats spec.envType as the schema-validated enum {prod\|stg\|uat\|dev\|poc} with no special-case for DR (per NAMING §11.1). - Sovereign-independent. The Gitea base URL, secret ref, branch suffix, commit author, and Flux interval are ALL runtime config (per Inviolable Principle #4 — never hardcode). Files: - core/controllers/environment/api/v1/types.go — Environment Go types matching the CRD; hand-written DeepCopy to avoid build-time codegen tool dependency. - core/controllers/environment/internal/gitea/client.go — minimal GitHub-compatible REST client targeting Gitea's /api/v1 (GET /orgs/{org}, GET/POST/PUT /repos/{org}/{repo}/contents/{path}). Idempotent UpsertFile with byte-equality short-circuit + blob-SHA conflict refusal. - core/controllers/environment/internal/gitops/render.go — pure template rendering of the Flux GitRepository CR. Deterministic field ordering for byte-equality idempotency. - core/controllers/environment/internal/controller/environment_controller.go — reconciler: validate spec, gate on Gitea Org, fan out per-region manifest writes, set status + conditions. - core/controllers/environment/cmd/main.go — controller-runtime manager entry point with leader election. - core/controllers/environment/Containerfile — two-stage build, alpine:3.20 runtime, non-root UID 65534, ENTRYPOINT. - core/controllers/environment/deploy/rbac.yaml — ClusterRole watching Environments + status subresource + leader election lease. - .github/workflows/build-environment-controller.yaml — CI mirrors build-cert-manager-dynadot-webhook.yaml: vet + race tests, docker buildx + cosign keyless sign + SBOM attest, push to ghcr.io/openova-io/openova/environment-controller. Tests (35 total, all GREEN, race-detector enabled): - internal/controller (T1–T11): T1 happy-path single-region reconcile T2 idempotent re-reconcile (zero spurious commits) T3 parent Org missing → Pending + GiteaOrgReady=False (no panic) T4 multi-region fan-out (3 commits, 3 regions) T5 drift detection — operator hand-edit gets overwritten T6 placement-vs-regions cardinality violations → Failed T7 env_type→branch mapping table T8 Gitea repo missing → Pending + GiteaRepoMissing reason T9 partial-failure one region → Degraded with that region Failed T10 Config.Defaults applies the documented defaults T11 NotFound between dequeue and Get is benign - internal/gitea: GET /orgs OK + 404 + 500; UpsertFile create / idempotent / update with SHA / repo-not-found; pathEscape preserves slashes; arg-validation. - internal/gitops: BranchForEnvType / JetStreamSubjectPrefix / HostClusterName (with override) / GitRepositoryPath / RenderGitRepository (deterministic + complete + anonymous + default interval + required-field validation) / EnvironmentName. go vet ./... clean. go test -count=1 -race ./... GREEN. Out of scope per slice brief: organization-controller (C1), blueprint-controller (C3), application-controller (C4), useraccess-controller (C5), catalyst-api codebase changes, NACK install, per-Environment NATS Stream CRs. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:05:53 +04:00
e3mrah	84167a768e	feat(controllers): land organization-controller (slice C1, #1095 ) (#1129 ) A thin in-cluster Go controller that watches Organization CRs (orgs.openova.io/v1) and reconciles four downstream artifacts per the EPICS-1-6 unified design §3.3 + §3.7 and ADR-0001 §2.7: 1. vCluster HelmRelease — written into the per-Org Gitea repo (NOT direct apply; Flux reconciles per ADR-0001 §2.1). 2. Keycloak group — at path /<slug> with attributes {org=[<slug>], tier=[<sme\|corporate>]}. 3. Gitea Org — auto-created if absent; one repo per Org seeds the vCluster + tenant manifests. 4. UserAccess CR — one per spec.owners[] entry; slice C5's useraccess-controller materializes the RoleBindings. Per ADR-0001 §2.2 (Crossplane is cloud-only) this is K8s-to-K8s reconciliation NOT a Crossplane Composition. Per §2.1 the controller writes manifests via the Gitea HTTP contents API — never kubectl apply, never helm install, never exec.Command("helm", ...). Idempotent: re-running on a steady-state CR is a no-op (every "ensure" is find-or-create with byte-equal short-circuit on PutFile). What ships: - core/controllers/organization/cmd/main.go — entry point with envconfig, leader election, signal handling - core/controllers/organization/internal/controller/ — reconciler + KeycloakClient interface + LiveKeycloak impl - core/controllers/organization/internal/gitea/ — minimal Gitea Admin REST client (Org/Repo + contents-API). Self-contained — extractable to core/pkg/gitea-client/ when slice C2 needs it. - core/controllers/organization/internal/gitops/ — manifest renderer (namespace + vcluster HelmRelease + kustomization) - core/controllers/organization/internal/orgapi/ — Organization Go types mirroring the CRD schema (no deepcopy-gen — inlined) - core/controllers/organization/Containerfile — multi-stage build (alpine-based, runs as UID 65534) - core/controllers/organization/config/{rbac,manager}/ — ClusterRole + Deployment scaffolding for chart consumption (slice F1) - .github/workflows/build-organization-controller.yaml — push/PR/ manual triggers, no cron Tests: 9 unit tests across 3 packages cover happy-path reconcile, idempotency (zero net writes on second reconcile), Keycloak group already exists, Gitea Org already exists, slug/metadata drift, missing CR no-op, byte-equal PutFile no-op, 422-race re-find, template structural-YAML validity, and label-vocabulary compliance. go test -count=1 -race ./... and go vet ./... both clean. Out of scope: environment-controller (C2), application-controller (C4), useraccess-controller (C5 — this controller only WRITES UserAccess CRs). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:04:29 +04:00
e3mrah	dd1699afe3	feat(controllers): land useraccess-controller — fix silently broken Crossplane path (slice C5, #1095 , P0) (#1128 ) Per docs/EPICS-1-6-unified-design.md §3.5 and ADR-0001 §2.3 amendment, K8s-to-K8s reconciliation belongs to thin in-cluster controllers, not Crossplane Compositions. The existing useraccess.compose.openova.io Composition writes RoleBindings via provider-kubernetes — but provider-kubernetes is NOT installed on any production Sovereign (caught in the EPIC-0 audit). Every UserAccess CR has been silently no-op'd. This controller fixes that. What lands: - core/controllers/useraccess/cmd/main.go — controller-runtime Manager with leader election + signal handling, environment-only config - internal/controller/{reconciler,desired,spec,status,types}.go — the reconciler. Watches UserAccess.access.openova.io/v1alpha1 (cluster- scoped, unstructured client) and owns RoleBinding + ClusterRoleBinding via Owns() so drift triggers reconcile via ownerRef indexing - internal/labels/scope.go — Manara DNA scope matcher: AND-within / OR-across, wildcard scopes, EnforcedScopes() per catalog tier (the developer auto-injection of openova.io/env-type=dev) - internal/controller/_test.go + internal/labels/scope_test.go — 26 unit tests with the controller-runtime fake client. Covers happy-path, multi-app/multi-ns fan-out, namespaces:[""]→CRB, group subjects, drift detection+restore, orphan deletion on spec shrink, idempotency, invalid spec, ownerRef shape, NotFound no-op, and the 5-catalog-tier matrix - deploy/{rbac,deployment}.yaml — ClusterRole/SA/Deployment with non-root, read-only-rootfs, drop-ALL caps, leader-election Role - Containerfile — Alpine 3.20 final stage, CGO_ENABLED=0, UID 65534 - .github/workflows/useraccess-controller-build.yaml — event-driven build (push-on-main + PR test job), SHA-pinned image tags Behaviour: - Per UserAccess CR, materialises RoleBindings (per namespace) or ClusterRoleBindings (when namespaces:["*"]) referencing the canonical openova:application-{admin,editor,viewer} ClusterRoles - ownerRef back to the UserAccess CR with controller=true + blockOwnerDeletion=true so K8s GC cascades deletes - Drift detection: hand-mutated bindings are restored on next pass + Condition Drift=True surfaced for the UI - Idempotent: steady-state reconcile = 0 K8s writes - Status: phase (Pending\|Active\|Failed), rolebindingsCreated, observedGeneration, conditions[] Out of scope per the brief: - Crossplane Composition deletion (operator retires post-verify) - 5-catalog-tier role inheritance (lands with EPIC-3 #1098) - Keycloak realm-role sync (slice D1b, this controller is consumer) Tests: go vet ./... # clean go test -count=1 -race ./... # 26/26 pass go test ./internal/labels/... -run TestScope # full 5-tier matrix Co-authored-by: Hatice Yildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 00:04:07 +04:00
e3mrah	358c32c032	ci: add cluster bootstrap-kit drift guardrail (slice H2 scope-reduced, #1095 ) (#1122 ) Adds .github/workflows/cluster-template-drift.yaml — a warn-only workflow that reports drift between each clusters/<sovereign>/bootstrap-kit/ tree and the canonical clusters/_template/bootstrap-kit/. Why warn-only, not enforce: - Every existing Sovereign carries some legitimate drift (per-Sovereign image SHAs, region-specific values overlay) — blocking PRs on diff count would prevent ALL cluster work. - The right place to enforce the boundary is Catalyst's organization- controller (slice C1 of #1095), not CI. Once C1 ships, every new Sovereign bootstrap-kit is generated from _template and the attestation lives at apply-time, not at CI-time. - Retroactively reconciling the existing omantel.omani.works/ and otech.omani.works/ trees (which have 20+ differing files plus structural changes — extra files on each side) is a high-blast-radius maintenance-window operation, NOT a CI scoped slice. What this workflow does: - Triggers on push to main + PR + workflow_dispatch when clusters/** changes. - For each clusters/<sovereign>/ directory, runs `diff -rq` against clusters/_template/bootstrap-kit/ and writes a Markdown report to the run summary AND a sticky PR comment. - Counts differing files + only-in-template + only-in-Sovereign per Sovereign so reviewers can quickly see whether new drift was introduced. Per docs/EPICS-1-6-unified-design.md §3.9 row 2 + §11 row 6 (decision amended from "reconcile + CI gate" to "warn-only CI gate"; structural reconcile deferred to slice C1 organization-controller). Per docs/INVIOLABLE-PRINCIPLES.md #4a — workflow only inspects YAML; no images built, no cloud calls. Refs: #1094, #1095, slice C1 (organization-controller). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 23:09:50 +04:00
e3mrah	eb6a3c1812	fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060 ) * fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. Two stacked headers + sidebar inside sidebar ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. "✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps. This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 21:10:31 +04:00
e3mrah	953ef8290f	fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs (#980 ) * fix(catalyst-ui): drop stale params={{ deploymentId }} from clean-root Links (#975) #976 collapsed `to="/provision/$deploymentId/<page>"` to clean root paths (`to="/<page>"`) but left the `params={{ deploymentId }}` prop on every callsite, breaking the Vite tsc build with TS2353. Fixes: - Drop `params={{ deploymentId }}` from Links whose target is now a parameterless clean root path (StatusStrip, AppDetail, AppsPage, DecommissionPage, FlowPage, JobDetail, JobsPage, JobsTimeline, SettingsPage, DeploymentsList). - For Links whose `to` still uses `$componentId`/`$jobId`, cast `params` with `as never` to match the existing pattern in cloud-compute/cloud-network/cloud-storage/Sidebar/UserAccess (the dual-mount under provisionRoute + consoleLayoutRoute defeats TS's strict params inference; the runtime path is correct). - Drop `deploymentId` prop + interface field from JobCard / JobRow / JobsTable / AppCard now that the Links don't need it; update test fixtures + the JobsTable row-link assertion to match the new clean `/jobs/$jobId` href. - Drop the unused ArchEdgeType import in k8sAdapter (TS6196). - Dashboard navigateToApp uses `as never` casts to align with the same pattern. * fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs Two paths consume the catalyst-api / catalyst-ui images: 1. bp-catalyst-platform OCI chart (Sovereigns) — values.yaml driven, tag in values.yaml is rendered at helm install time by Sovereign Flux. 2. contabo Kustomize-path — literal image refs in templates/api-deployment.yaml and templates/ui-deployment.yaml. Flux kustomize-controller on contabo reconciles those files directly. The CI deploy step was bumping BOTH on every PR, which auto-rolled contabo every time anyone merged a catalyst-api code change. On 2026-05-05 PR #975's k8scache feature broke contabo startup on the auto-roll because contabo has 27 dead-Sovereign kubeconfigs that the new code iterates synchronously at startup, blocking readiness. Fix: keep the values.yaml bump (Sovereigns auto-pick-up via OCI chart which is the right behaviour for fresh provisions). Drop the templates/*-deployment.yaml bump so contabo only rolls when an operator manually commits a validated SHA into those files. Closes the auto-deploy-to-contabo blast radius on every PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 21:24:57 +04:00
e3mrah	2ff50f0591	fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955 ) Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:47:37 +04:00
e3mrah	db332f6767	fix(ci): services-build auto-bumps chart patch + dispatches blueprint-release (#874 ) * fix(bp-catalyst-platform): bump 1.4.8 -> 1.4.9 to republish with current services-auth image (#871) Chart 1.4.8 was published from commit `95a06f56` BEFORE the deploy-bot updated templates/sme-services/auth.yaml's image pin from services-auth:fa4395f -> services-auth:95a06f5 (which has the /auth/send-pin alias from PR #869). The blueprint-release workflow fired on `95a06f56` only, so the OCI artifact for 1.4.8 was published with the OLD image SHA in chart bytes. otech103 reconciled 1.4.8 and rendered the auth Deployment with the OLD image -> /auth/send-pin returns 404 -> SME marketplace signup blocked. Same deploy-step race documented in feedback_idempotent_iac_purge.md and the overnight DoD bookmark. Long-term fix is a double-bump sequencing PR (file separately); short-term fix is bumping the chart version so blueprint-release republishes the artifact with the current image pin. No template change. Lockstep slot 13 pin in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml bumps from 1.4.8 -> 1.4.9. Closes #871 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(ci): services-build deploy auto-bumps chart patch + dispatches blueprint-release (#872) Eliminate the recurring race between services-build's deploy commit and blueprint-release's path-trigger on chart-version-bumping PRs. Before: a PR bumping `products/catalyst/chart/Chart.yaml` AND touching `core/services/*` triggered both workflows on the same merge SHA in parallel. blueprint-release packaged the chart at the merge commit (which still held the OLD image SHAs) and published the bumped chart version with stale image refs. services-build's deploy commit landed AFTER, but per GitHub Actions design GITHUB_TOKEN-authored pushes do NOT re-trigger workflows, so blueprint-release never fired again on the corrected chart. A manual no-op chart bump PR was the only way to republish (PR #865 chasing PR #864 was the live incident). After: services-build's deploy step 1. sed-rewrites image: lines under products/catalyst/chart/templates/sme-services/.yaml (unchanged) 2. Pure-bash semver patch-bumps Chart.yaml `version:` and `appVersion:` atomically 3. Single commit captures both rewrites 4. Explicit `gh workflow run blueprint-release.yaml -f blueprint=catalyst -f tree=products` dispatches the chart publish (matches catalyst-build's PR #720 pattern) 5. Idempotent push retry re-reads origin/main and bumps from THAT version on conflict, so concurrent CI runs produce strictly increasing patch versions instead of clobbering each other Adds `actions: write` to the deploy job permissions so the gh workflow run dispatch doesn't return HTTP 403. The manual chart-version field in author PRs becomes a floor; CI auto-bumps from there. PR authors should NOT bump the patch themselves any more — the deploy step does it. Major/minor bumps remain the author's call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 08:32:34 +04:00
e3mrah	1d93b6c5af	feat(e2e): SME demo Playwright spec — full 6-step happy path (#805 ) (#823 ) Authors the load-bearing investor-demo proof artefact for the SME-tenant turnkey experience epic (#795). The spec walks the FULL happy path against the catalyst-ui SPA and emits 1440×900 screenshots at every assertion so the DoD checklist is satisfied with visual evidence rather than narrative. What landed: - products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear spec covering Step 1 (marketplace signup) → Step 2 (provisioning) → Step 3 (SME admin first login + dashboard) → Step 4 (create alice via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to unblocking issues. - products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry of every URL, hostname, fixture user, and UUID the spec uses. Per feedback_never_hardcode_urls.md, no test inlines a hostname; every asserted host derives from OTECH_FQDN + SME_SLUG. - products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape- faithful page.route mocks for tenant discovery, /api/v1/whoami, /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment endpoints, app placeholders for WordPress/OpenClaw/webmail, and the /api/v1/sme/billing/ledger surface. Each helper is the seam between mock-mode (today) and live-mode (post-#804) so the spec opts out of any single mock by simply not calling that helper. - .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger that runs the spec against a freshly-installed dev tree with VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the SovereignConsoleLayout's auth gate has a non-null sovereignFQDN. Uploads the 805-* screenshot evidence as a 30-day artefact. Run today on a fresh checkout: cd products/catalyst/bootstrap/ui VITE_CATALYST_MODE=sovereign \ VITE_SOVEREIGN_FQDN=acme.otech.example \ npm run dev & PLAYWRIGHT_HOST=http://localhost:5173 \ npx playwright test e2e/sme-demo.spec.ts Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 / #798 / #802-followup). Live-mode follow-up (after #804 lands a fresh otech with the SME tenant pipeline wired): drop the mock installers from beforeEach and flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper calls change. Per docs/INVIOLABLE-PRINCIPLES.md: #1 (waterfall): the canonical 6-step contract from #805 is asserted in this first cut, not staged across cycles. #2 (never compromise): every step that's deferred is fixme'd with a blocker link, never silently skipped. #4 (never hardcode): every URL routes through e2e/lib/config.ts. Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-04 22:52:07 +04:00
e3mrah	9645a9044a	feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798 ) (#818 ) * feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add the SME-2 metering integration end-to-end. NewAPI is consumed as the upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned mirror, not a fork) — the metering envelope is produced by a Go sidecar that observes the OpenAI-style `usage.total_tokens` field on every 2xx /v1/* response. This avoids forking the upstream binary while still producing the canonical envelope shape on `catalyst.usage.recorded`. A) NewAPI metering sidecar — core/services/metering-sidecar/ - Transparent reverse proxy in front of NewAPI on its own port; the bp-newapi Service routes the cluster-fronting port to the sidecar, which forwards to NewAPI on the pod's loopback. - Observes successful /v1/* JSON responses, parses `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes one envelope on `catalyst.usage.recorded` per completed request. - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed. - Customer-facing latency is NEVER blocked on metering: the response body is restored before publish; on NATS unreachable the envelope is persisted to disk and retried by a background drain loop. - 14 unit tests (proxy + publisher + safeFilename guards). B) sme-billing NATS subscriber — core/services/billing/handlers/ metering_consumer.go - JetStream durable consumer `sme-billing-metering` on stream `CATALYST_USAGE` (provisioned by sme-billing on startup). - Idempotent on metadata.request_id via a UNIQUE partial index on credit_ledger.external_ref; redelivery from the broker collapses to a single ledger row. - Customer auto-create on cold start (the rbac sme.user.created envelope may land AFTER the first metered request; we don't strand usage waiting for it). - 11 unit tests covering happy-path, idempotency, malformed-payload poison-pill, missing-request-id, non-negative amount guard, resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak. C) HTTP handler POST /billing/metering/record — handlers/metering.go - Synchronous validate → INSERT credit_ledger → return {ledger_entry_id, balance_after_omr, balance_after_micro_omr, duplicate}. Same payload + idempotency guard as the NATS path. - Auth: superadmin OR sovereign-admin (operator-admin model; end-user LLM traffic flows through the sidecar, never this URL). - 8 unit tests covering happy-path, idempotency, role gating, malformed-JSON, positive-amount rejection, customer-not-found. D) Schema — core/services/billing/store/store.go - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR exact integer — preserves precision at metering rates). - ADD COLUMN external_ref TEXT + UNIQUE partial index for idempotency dedup. - ADD COLUMN metadata JSONB for the raw envelope. - GetCreditBalance projects both amount_omr (legacy) and amount_micro_omr (new) into the integer-OMR view. - GetCreditBalanceMicroOMR returns canonical precision. - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0) distinguishes fresh insert from duplicate without a follow-up SELECT. E) Wiring - core/services/shared/events/nats.go — minimal NATS JetStream publisher + subscriber surface; legacy RedPanda producer/consumer in events.go untouched per [Q-mine-3]. - core/services/billing/main.go — NATS_URL env; subscriber wired in parallel with the existing RedPanda tenant-events consumer. - middleware/jwt.go — exported test helper WithClaims so handler tests can construct an authenticated context without minting a real signed token. - .github/workflows/services-build.yaml — metering-sidecar added to the build matrix; deploy job skips it (image consumed by the bp-newapi chart, not products/catalyst sme-services). F) bp-newapi chart (1.0.0 → 1.1.0) - meteringSidecar block in values.yaml: image, port, NATS URL, priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool dir, header names, resources, securityContext (read-only-rootfs). - deployment.yaml renders the sidecar container + emptyDir spool volume when meteringSidecar.enabled (default true). - service.yaml routes the cluster-fronting :3000 to the sidecar when enabled, exposes a separate :3001 → NewAPI direct port for bp-catalyst-platform admin-API traffic (ADR-0003 §3.2). - networkpolicy.yaml allows the sidecar's port + nats-system egress for JetStream publish. Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green. Helm template renders cleanly with sidecar enabled and disabled. Closes #798 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798) Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000") that does NOT scan directly into Go int64 — the integration test TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on the post-redeem balance read. Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is unambiguously bigint and Scan target stays uniform across pre-#798 rows (amount_omr only) and post-#798 rows (amount_micro_omr present). Affects: - GetCreditBalance - GetCreditBalanceMicroOMR - RecordUsage's running-balance read Test mocks updated to match the new SQL prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:32:42 +04:00
e3mrah	93bd3ace5b	feat(bp-openclaw): workspace controller + per-user pod chart (#803 ) (#810 ) Implements locked decision [A] of epic #795: per-SME-tenant workspace controller deployment + per-user runtime pod, identity-blind by construction. Consumes the per-user newapi-key-{uuid} Secrets rendered by the unified-rbac user-create hook (ADR-0003 §3.3). What this delivers: - platform/openclaw/chart/ bp-openclaw v0.1.0 (no-upstream) - platform/openclaw/runtime/ Go reference runtime (NEWAPI_BASE_URL + NEWAPI_KEY env contract only) - .github/workflows/openclaw-runtime.yaml Event-driven build for the runtime image (paths-on-push + manual rerun; NO schedule:cron per CLAUDE.md). - platform/openclaw/blueprint.yaml Catalyst registration + configSchema. Chart highlights: - Required values guarded by _helpers.tpl :: assertRequired so missing realmURL/clientSecretName/tenant.namespace/baseURL/host fail render with helpful messages. - RBAC: namespaced Role in tenant ns; create verbs split into separate rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md. Label-based ownership (catalyst.openova.io/openclaw-user) enforced at the controller, not in RBAC. - ingress: cert-manager.io/cluster-issuer annotation triggers ACME auto-issuance for openclaw.<sme-domain>. - per-user pod template ConfigMap holds the pod-spec the controller renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders filled at session-start. - networkPolicy covers controller pod only; per-user pod NetworkPolicy is rendered by the controller at session-start (target hostname is read from the per-user Secret which doesn't exist at chart-render time — documented in README.md). Tests: chart/tests/render-toggles.sh (7 cases) covers required-value enforcement, RBAC create+resourceNames violation guard, ServiceMonitor default-off, networkPolicy toggle, pod-template placeholder presence, cert-manager annotation. All seven gates pass locally. Closes part of #795 (epic still open). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:10:24 +04:00
e3mrah	9adca8442a	fix: ci actions:write + auth-layout overflow scroll (#712 followup, #721 followup) (#728 ) Two unrelated production-bug fixes squashed because they came out of the same live verification pass on console.openova.io 2026-05-04. 1. catalyst-build.yaml deploy job permissions PR #720 added a `gh workflow run blueprint-release.yaml` dispatch step at the end of the deploy job to close the bot-deploy-doesn't- trigger-workflows gap from #712. Step has been failing on every run since with HTTP 403 "Resource not accessible by integration" because GITHUB_TOKEN lacks `actions: write` by default. Result: blueprint-release was never dispatched after PR #722–727 merged; the bp-catalyst-platform OCI artifact stayed on the pre-fix chart and any Sovereign provisioned afterwards picked up the buggy chart. Add the missing permission so dispatch succeeds. 2. AuthLayout.tsx vertical centering at small viewport heights The sign-in / verify cards were mathematically centered at 1440×900 (Δ=0.008px verified via getBoundingClientRect in Playwright) but founder reports the card sitting at the top of the screen on real-world viewports. Root cause: the right panel had `flex flex-1 items-center justify-center` which centers ONLY if the inner content fits within the viewport — at smaller heights the form's natural content flow pushed the card off-screen with no scroll fallback. Fix: add `items-stretch` to the outer flex (so the right panel fills full viewport height), `overflow-y-auto` on the right column (so the card can scroll inside its column when too tall), and `py-8` padding on the card wrapper (breathing room when scrolling kicks in). Result: card is vertically centered when content fits, and stays visible (column-scrollable) when it doesn't, on every viewport height from 1024×600 up. Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-04 12:44:44 +04:00
e3mrah	35183af5be	fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712 ) (#720 ) * feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS operator with a single overlay toggle. Changes ======= products/catalyst/chart: - Chart.yaml 1.2.7 → 1.3.0 - values.yaml: ingress.marketplace.enabled toggle (default false) + marketplace.{brand,currency,paymentProvider,signupPolicy} surface - templates/sme-services/marketplace-routes.yaml: HTTPRoute marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin, / → marketplace; HTTPRoute .<sov> → console (per-tenant wildcard) - templates/sme-services/marketplace-reference-grant.yaml: cross- namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services - .helmignore: stop excluding sme-services/ and marketplace-api/* (only .kustomization.yaml + .ingress.yaml remain Kustomize-only) - All sme-services/* + marketplace-api/* manifests wrapped with {{ if .Values.ingress.marketplace.enabled }} so non-marketplace Sovereigns render the chart unchanged clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: - chart version 1.2.7 → 1.3.0 - ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN} - ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false} infra/hetzner: - variables.tf: marketplace_enabled var (string "true"/"false", default "false") - main.tf: thread var into cloudinit-control-plane.tftpl - cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations products/catalyst/bootstrap/api/internal/provisioner/provisioner.go: - Request.MarketplaceEnabled bool (json:"marketplaceEnabled") - writeTfvars: marketplace_enabled = "true"\|"false" core/pool-domain-manager/internal/allocator/allocator.go: - canonicalRecordSet adds "marketplace" prefix → marketplace.<sov> resolves via PDM at zone-commit time (PR #710 explicit record so caches don't depend on the .<sov> wildcard alone) DoD ready ========= - helm template with ingress.marketplace.enabled=false → identical manifest set to 1.2.7 (verified locally) - helm template with ingress.marketplace.enabled=true → emits 17 extra resources: 13 sme-services workloads + 2 marketplace-api + 1 HTTPRoute pair + 1 ReferenceGrant - pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green - catalyst-api builds, provisioner cloudinit_path_test green fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub Actions design, commits authored by GITHUB_TOKEN don't re-trigger workflows. blueprint-release.yaml's `on.push.paths: products//chart/*` filter matches the deploy commit's diff (chart/values.yaml + chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire, but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck on whatever catalyst-api SHA was current at the last manual chart- touching PR. Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb six PRs after the SHA was superseded — every fresh Sovereign installed the buggy pre-#701 image and rejected handover with 401 unauthenticated. Fix: after `git push` succeeds in the deploy job, dispatch blueprint-release explicitly via `gh workflow run`. The dispatched run re-renders + re-publishes the chart with the just-pushed values.yaml. Closes #712. --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io>	2026-05-04 07:49:03 +04:00
e3mrah	b5c9839da7	feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611 ) Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/ → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io>	2026-05-02 19:17:56 +04:00
e3mrah	10c8e997c4	fix(catalyst): restore literal image refs in Kustomize-path deployment YAMLs (#614 ) The feat/global-imageRegistry (#580) PR converted the literal image refs in api-deployment.yaml and ui-deployment.yaml to Helm template expressions ({{ .Values.global.imageRegistry }}...) without updating the CI deploy step to also patch those files. Since the catalyst-platform Flux Kustomization reads these files as raw manifests (not via helm-controller), the Helm template syntax was never rendered, leaving a literal '{{ if ... }}' string as the image reference → InvalidImageName on every Pod start. Root cause: two consumers of the same file — Helm chart path (Sovereign clusters) and Kustomize path (contabo-mkt) — but only the Helm path was handled by the deploy job. Fix: - Restore literal `ghcr.io/openova-io/openova/catalyst-{api,ui}:b50a600` image refs in the Kustomize-path deployment YAMLs (immediate unblock). - Update CI deploy step to sed-patch those literal refs on every deploy commit so future image rolls keep both paths in sync (durable fix). Closes: the InvalidImageName regression introduced in #580. Unblocks: issue #608 (Phase-8b Agent A magic-link auth) — catalyst-api was stuck at InvalidImageName since commit `83ec889f`, preventing the CATALYST_KC_ADDR / session-cookie auth gate from loading. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 18:29:09 +04:00
hatiyildiz	59fb2b742c	fix(ci): use awk instead of python heredoc in deploy — fixes YAML parse error	2026-05-02 13:48:17 +02:00
hatiyildiz	885e032dc5	fix(ci): deploy job updates values.yaml SHA tags, not Helm template files The previous sed targeted ui-deployment.yaml + api-deployment.yaml for `image: ghcr.io/.../catalyst-ui:.*` but those files use Helm template expressions (`{{ .Values.images.catalystUi.tag }}`), so sed silently no-ops. Result: every catalyst build committed "No changes" and the deployed image was never updated. Fix: switch deploy job to update images.catalystUi.tag and images.catalystApi.tag in products/catalyst/chart/values.yaml via python3 regex (handles multiline YAML reliably). Also bump catalystUi + catalystApi tags to `32c5e43` (the build from #596 / PR #599 — Vite base: '/' fix). Fixes #596 deploy path.	2026-05-02 13:46:03 +02:00
e3mrah	942be6f58d	fix(ci): disable buildx provenance+sbom attestation in dynadot-webhook build (#583 ) containerd 1.7.x on k3s cannot pull multi-arch images whose OCI index includes an attestation manifest (the unknown/unknown platform entry added by docker/build-push-action when provenance=true). Containerd resolves the manifest index, encounters the attestation entry, fetches its descriptor from GHCR which returns an HTML 404 page, and then caches that HTML page as a blob SHA — every subsequent pull of ANY tag for that image returns the same HTML SHA instead of the real layer. Fix: set provenance=false + sbom=false on the build-push-action step. SBOM attestation is handled separately by cosign attest, which does not embed its manifest into the OCI index. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-02 14:29:58 +04:00
e3mrah	52c6938e02	ci(catalyst-build): watch infra/hetzner/ so cloudinit changes rebuild catalyst-api (#472 ) Phase-8a-preflight bug #2 (after #471's tftpl escape fix): catalyst-api Docker image bakes /infra/hetzner/cloudinit-control-plane.tftpl. Without this path in the build trigger, fixes to that file do NOT rebuild the image — the running pod keeps using the stale tftpl and provisioning keeps failing with the same Tofu error. Per CLAUDE.md Rule 4a (GitHub Actions is the only build path), the path filter MUST cover every directory the image depends on. Missing infra/hetzner/ was a long-standing latent CI bug — surfaced by Phase-8a #454 first live provision attempt. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:34:13 +04:00
e3mrah	1628a1b3aa	ci(preflight): GHCR auth for A+E + WBS tick — all 4 preflights done (#470 ) First runs of preflight A (bootstrap-kit) and E (Keycloak) failed with the same error: helm OCI pull from ghcr.io/openova-io/bp-* returning 401 'unauthorized: authentication required'. bp-* are PRIVATE GHCR packages. #460's agent fixed it for B in c26fbcaf. #461's already had GHCR login. This commit applies the same helm-registry-login pattern to A and E. WBS state on main after this commit: - done (35): all chart-level + #317 + #319 + #453 + 4 preflights - wip (0) - blocked (3): 454, 455, 456 (Phase-8 live runs, operator-driven) The preflights' first runs ALREADY surfaced a real CI bug pattern that would have hit Phase 8a — exactly what they're for. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:06:36 +04:00
e3mrah	4a7eb42d26	feat(ci): Phase-8a preflight E — Keycloak realm-import + kubectl OIDC client (closes #462 ) (#468 ) Surfaces Risk R6 (docs/omantel-handover-wbs.md §9a — Keycloak realm-import config-CLI bootstrap timing untested). bp-keycloak 1.2.0 ships a sovereign realm + a public kubectl OIDC client via the upstream bitnami/keycloak chart's keycloakConfigCli post-install Helm hook (issue #326); this workflow proves it actually wires up on a clean cluster before we run it on a real Sovereign. Workflow installs bp-keycloak 1.2.0 on a kind cluster (helm/kind-action v1, kindest/node:v1.30.6 — same versions as test-bootstrap-kit), waits for the keycloak StatefulSet to roll out, polls for the keycloakConfigCli post-install Job by label (app.kubernetes.io/component=keycloak-config-cli), waits for it to Complete, port-forwards svc/keycloak and asserts: 1. /realms/sovereign returns 200 (realm exists in Keycloak's DB). 2. The kubectl OIDC client is provisioned with publicClient=true, redirectUris contains http://localhost:8000 (kubectl-oidc-login default), and the groups client scope is wired with the oidc-group-membership-mapper (the per-Sovereign k3s api-server's --oidc-groups-claim flag depends on this). Acceptance per ticket: if the post-install Job fails, the workflow summary captures Job logs + StatefulSet logs + cluster state via GITHUB_STEP_SUMMARY so a failed run is debuggable without re-running. Triggers are event-driven only per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled" rule — push on the workflow file itself plus workflow_dispatch for ad-hoc re-runs. Closes #462. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:30 +04:00
e3mrah	abac00d8b3	feat(ci): Phase-8a preflight A — bootstrap-kit reconcile dry-run on kind (closes #459 ) (#467 ) Surfaces Risk-register R4 (docs/omantel-handover-wbs.md §9a — bootstrap-kit reconcile-chain order untested under load) before Phase 8a (#454) burns Hetzner credit on test.omani.works. New workflow .github/workflows/preflight-bootstrap-kit.yaml: - kind v0.25.0 + kindest/node:v1.30.6 - Gateway API CRDs v1.2.0 standard channel - Full Flux controller set (fluxcd/flux2/action@main + flux install) - Mock Secrets: flux-system/object-storage, flux-system/cloud-credentials, flux-system/ghcr-pull - Renders clusters/_template/bootstrap-kit/ with SOVEREIGN_FQDN_PLACEHOLDER + ${SOVEREIGN_FQDN} -> test-sov.example.com (matches test harness pattern in tests/e2e/bootstrap-kit/main_test.go:247) - 30 x 30s HR poll loop, never-fail-fast (goal: surface ALL bugs, not stop at first) - $GITHUB_STEP_SUMMARY emits Markdown table of every HR's terminal Ready condition + per-HR describe blocks for non-Ready + recent flux-system events + raw hrs.json artefact (14d retention) - Event-driven only: push on self-edit + workflow_dispatch; no schedule: cron (per CLAUDE.md "every workflow MUST be event-driven") Canonical seam reused (no duplication): - kind setup + flux install pattern from .github/workflows/test-bootstrap-kit.yaml - bootstrap-kit kustomization at clusters/_template/bootstrap-kit/ (the same overlay production Sovereigns consume; substitution shape mirrors tests/e2e/bootstrap-kit/main_test.go:247) - event-driven shape per .github/workflows/check-vendor-coupling.yaml (#428) Out of scope (sibling preflights): - #460 Crossplane provider-hcloud Healthy probe - #461 Cilium Gateway HTTPRoute admission - #462 Keycloak realm-import Validated: actionlint clean, YAML parses cleanly. WBS row #459 in §9 updated: 🟡 in flight -> 🟢 done (workflow shipped). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:26 +04:00
e3mrah	6f9ee43a9d	fix(ci): GHCR auth for bp-crossplane OCI pull in preflight (#460 ) (#466 ) Run 25221515110 surfaced the exact blocking error the workflow was designed to surface — but for the install step, not the Healthy probe: Error: INSTALLATION FAILED: failed to perform "FetchReference" on source: GET "https://ghcr.io/v2/openova-io/bp-crossplane/manifests/1.1.3": ... 401: unauthorized: authentication required bp-crossplane is a PRIVATE GHCR package (verified via `gh api /orgs/openova-io/packages/container/bp-crossplane`). The fix mirrors the canonical seam in .github/workflows/blueprint-release.yaml: add `packages: read` to the job permissions and run `helm registry login ghcr.io` against GITHUB_TOKEN before the `helm install oci://...` step. No new pattern; just reuse. This unblocks the actual goal of #460 — observing provider-hcloud Healthy=True (or surfacing whatever blocks it) on a kind cluster. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:15 +04:00
e3mrah	48b73af6ae	feat(ci): Phase-8a preflight C — Cilium Gateway HTTPRoute admission on kind (closes #461 ) (#465 ) Surfaces Risk-register R3 (docs/omantel-handover-wbs.md §9a) — Cilium Gateway HTTPRoute admission was untested on contabo because contabo runs Traefik (no `cilium-gateway` Gateway present per ADR-0001 §9.4). This workflow boots a kind cluster, installs upstream Cilium 1.16.5 with `gatewayAPI.enabled=true`, applies the per-Sovereign Gateway shape from `clusters/_template/bootstrap-kit/01-cilium.yaml` (HTTP listener only — TLS is Phase 8a), pulls bp-catalyst-platform:1.1.8 from GHCR, renders its httproute.yaml template with sovereign overlay values, and asserts that `catalyst-ui` and `catalyst-api` HTTPRoutes both reach Accepted=True against the Cilium Gateway. Anti-duplication: GHCR helm-registry-login mirrors blueprint-release .yaml (lines 173-177); kind+Cilium pattern matches playwright-smoke shape; per-Sovereign Gateway is a 1:1 mirror of the canonical bootstrap-kit slot 01 (HTTP listener), no new shape invented. Trigger pattern is event-driven per CLAUDE.md: push on this file or the chart templates it validates, plus workflow_dispatch for re-runs. No cron. Out of scope (Phase 8a/8b): TLS termination, real DNS resolution, backend Deployment health, the 10 leaf bp-* dependencies (which have their own chart-verify smoke runs). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 20:01:01 +04:00
e3mrah	48a1623b28	feat(ci): Phase-8a preflight B — Crossplane provider-hcloud Healthy on kind (closes #460 ) (#463 ) Surfaces Risk-register R2 (docs/omantel-handover-wbs.md §9a — provider-hcloud Healthy=True never observed). New workflow spins up kind, installs bp-crossplane 1.1.3 from GHCR, applies the EXACT Provider + ProviderConfig shape from infra/hetzner/cloudinit-control-plane.tftpl (#425), waits up to 5 min for Healthy=True, plants a fake hcloud-token Secret in flux-system to match the canonical secretRef, and asserts the ProviderConfig is accepted by the API. Reuses existing seams: - helm/kind-action@v1 pattern from .github/workflows/test-bootstrap-kit.yaml - event-driven trigger shape from .github/workflows/check-vendor-coupling.yaml - canonical Provider/ProviderConfig YAML from infra/hetzner/cloudinit-control-plane.tftpl No schedule: cron (per CLAUDE.md "every workflow MUST be event-driven"). No live Hetzner calls — fake-readonly-token only; real-credential validation is Phase 8a, not this preflight. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 19:58:32 +04:00
e3mrah	1e7d1e67c9	test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429 ) (#432 ) Phase 8 of the omantel handover (#369) needs an automated E2E that proves DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with zero contabo dependency post-handover. Today this is a SCAFFOLD — when Phase 4/6/7 land, dispatching the new workflow against a live omantel is the entire Phase 8. Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md): - tests/e2e/playwright/tests/ ← mirror of sovereign-wizard.spec.ts shape (NOT specs/ as the issue body said — actual repo path is tests/) - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries, workers=1, reporter=list) — reused as-is - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the pre-flight skip-when-unreachable pattern - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4, setup-node v4, npm install, playwright install --with-deps chromium, upload-artifact on failure) — mirrored, NOT duplicated What ships: - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests): 1. sovereign Ready + 23/23 blueprints 2. all bp-* HelmReleases Ready=True 3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready") 4. vendor-agnostic Object Storage (post-#425 canonical secret name flux-system/object-storage — NOT hetzner-object-storage) 5. dig +trace omantel.omani.works ends at omantel NS, not contabo 6. zero contabo dependency (omantel /api/healthz keeps returning 200) Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset. - .github/workflows/omantel-e2e-handover.yaml (NEW): workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled"). Inputs let the operator override base URLs at dispatch time. - docs/omantel-handover-wbs.md: new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1 with the spec test() blocks; §9 status row added for #429 (🟢 scaffold-shipped). Local verification: cd tests/e2e/playwright && npm install && \ npx playwright test --list tests/omantel-handover.spec.ts → 6 tests listed cleanly npx playwright test tests/omantel-handover.spec.ts → 6 skipped (env vars unset, expected) Out of scope (per #425 / #428 territory split): - internal/hetzner/, infra/hetzner/, platform/velero/chart/, clusters/.../34-velero.yaml — #425's vendor-agnostic sweep - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>	2026-05-01 17:52:18 +04:00

1 2

93 Commits