06bea550ff
93 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
06bea550ff
|
feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin (#1874)
The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml, bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish commits land on main, but it cannot detect the "chart bumped but never published" failure mode: the bootstrap-kit pin points at a chart version that GHCR never received because blueprint-release.yaml failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep, runner cancellation, transient GHCR push 5xx). Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180 and 1.4.181 were "lost" during the TBD-A20 scanner break window (21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist until A58 manually re-fired the workflow via dispatch. Fresh Sovereigns silently fell back to the last working tag. What this adds - scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and optional `--ghcr-org <org>`). For every chart pinned in the kit, it lists ghcr.io/<org>/<chart> tags via `gh api /orgs/<org>/packages/container/<chart>/versions --paginate`, then asserts the pinned version appears. Exits 1 on any missing tag. - A per-chart tag cache avoids redundant paginations. - .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now passes `--check-ghcr` on `push` to main + `workflow_dispatch` (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish to GHCR anyway). The job stays `continue-on-error: true` under the same observational umbrella as the existing post-merge full sweep so a transient API blip cannot red-flag every chart bump; the missing-tag list still surfaces on the run summary for operator attention. - Job grants `packages: read` so the workflow GITHUB_TOKEN can list private package versions. Verification (origin/main snapshot, 2026-05-19) - Full sweep default: 50/50 chart→pin pairs OK, no GHCR check. - Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags present — PASS exit 0. - Negative test: with products/catalyst/chart/Chart.yaml + slot 13 both set to a non-existent 99.99.99, the script exits 1 with `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the remediation hint pointing at `gh workflow run blueprint-release.yaml`. - `--changed-only --base origin/main` against a no-change tree: clean exit 0 with the existing "nothing to check" message. Refs #1872, #1864, #1856. Closes #1872 Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
69f2d7d91a |
fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)
Root cause of the auto-bump-pin miss flagged in #1864. The Blueprint Release workflow has been in `startup_failure` since PR #1858 (commit |
||
|
|
cf35b4a9b6
|
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux, openbao, keycloak, gitea) where blueprint.yaml spec.version had silently fallen behind chart/Chart.yaml version, breaking TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart publish — never the upstream platform/<bp>/blueprint.yaml. This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml spec.version whenever Chart.yaml version bumps. Both file edits land in the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint lockstep). Idempotent reset-and-rewrite retry preserved for the existing parallel-matrix race case. Workflow changes (.github/workflows/blueprint-release.yaml): * New step `bump_blueprint` after `bump_pin` — locates ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml (handles both platform-leaf and products-umbrella conventions), filters to kind:Blueprint (defensive against CRD yaml at the products/catalyst/chart/crds path), reads current spec.version at 2-space indent, sed-rewrites to CHART_VERSION, verifies post-write. * Commit step renamed to "Commit + push bootstrap-kit pin bump + blueprint.yaml lockstep"; stages both files, single commit, with convergent retry on conflict. * Summary block surfaces both bumps separately. Regression test (tests/e2e/bootstrap-kit/main_test.go): * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks platform/* and products/*, discovers every Blueprint manifest with a sibling Chart.yaml, asserts spec.version == Chart.yaml version. Covers ALL ~70 blueprints, not just the canonical 10 kit ones the existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates. * Failure messages name the file, drift direction, and the exact sed command to fix — drift remediation is mechanical. Drift cleanup (mandatory companion, same shape as A17/#1855): 26 Application-Blueprint blueprints whose spec.version had been left at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to Chart.yaml as authoritative. All currently surface in the new sweep test; without the cleanup the test would block this PR (and every subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook, cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores], falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir, netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover, trivy, valkey, velero, vpa, products/dmz-vcluster. After this lands, the next chart-version bump in any platform/<bp>/ folder auto-converges all three artifacts (Chart.yaml, blueprint.yaml, bootstrap-kit pin) in a single bot commit. No more manual collector PRs; no more silent drift between chart and Blueprint manifest. Closes #1856. Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a8931db541
|
fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849) (#1855)
Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main: 1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml), gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin -> Sovereign install); blueprint.yaml gets resynced down/up to match. 2. pin-sync-audit on push — full-sweep audit races the blueprint-release auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift until the auto-bump bot commits the pin update ~60s later; the bot push (GITHUB_TOKEN convention) does not retrigger this workflow, so the failure remains in run history. Fix: set continue-on-error: true on push/workflow_dispatch events (PR remains blocking via --changed-only). The full-sweep output still surfaces drift on the run summary; it just doesn't fail the overall run while the heal-in- ~60s window is open. Documented inline in the job header. Net effect: every push to main re-runs cleanly green. The 13 pre-existing drifts called out in the existing job comment will continue to heal as each lagging chart gets its next bump (auto-bump hook + this PR's manifest-validation alignment). Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence to blueprint.yaml versions which the test asserts but the auto-bump hook does not yet update). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> |
||
|
|
f061088ad3
|
fix(ci): harden TBD-A6 against bootstrap-kit slot indent drift (#1751)
The TBD-A6 auto-bump hook in blueprint-release.yaml AND the
check-bootstrap-kit-pin-sync.sh audit both key on the regex
`^ chart: <name>$` (exactly 6 leading spaces) to find the
slot pinning a given chart. Every existing slot is at that indent
today (audited 2026-05-18 across all 49 bp-* slots + the one
sandbox slot). But if a future slot author writes the
`chart:` / `version:` lines at a DIFFERENT indent (e.g. 4 or 8
spaces, copy-pasting from a tutorial that uses different YAML
style), BOTH systems silently skip that slot:
- The audit reports the chart as "not in the bootstrap kit"
(skipped count, no drift entry).
- The auto-bump hook logs "graceful no-op" and the chart-pin
pair drifts forever undetected.
This is exactly the silent-drift failure mode TBD-A6 was created
to prevent. Two-layer defence:
1. scripts/check-bootstrap-kit-pin-sync.sh: scan every slot file
for `chart:` and `version:` lines at any non-6-space indent
≤14 spaces (deeper than that = legit nested values field).
Fail loudly with an actionable error that points at both the
audit script and the auto-bump regex that need lockstep
updating if the schema ever legitimately changes.
2. .github/workflows/blueprint-release.yaml auto-bump step:
before declaring "no slot pins this chart, graceful no-op",
do a second grep at any indent. If THAT finds something, fail
with the same actionable error — prevents publishing a chart
whose pin will silently lag.
Verified: the indent-sanity scan passes on origin/main today
(no false positives across 51 slot files). Synthetic 4-space and
8-space test slots both trigger the new error correctly.
Edge cases also re-verified live on run 26041355274
(
|
||
|
|
5f85c731c1
|
fix(ci): deploy-bot auto-bumps bootstrap-kit pin when chart version bumps (Refs TBD-A6 meta-fix) (#1713)
TBD-A6: every chart-publishing wave in the 2026-05-17/18 session required a SEPARATE manual collector PR to bump the bootstrap-kit pin so Sovereigns would actually install the freshly published OCI artifact. Without the pin bump, the chart at e.g. bp-catalyst-platform:1.4.166 gets published to GHCR but clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml still pins `version: 1.4.165` and fresh Sovereigns silently install the OLD artifact. Manual collector PRs eliminated by this hook (sample from one session): #1676 chart 1.4.162 -> 1.4.163 (Wave 16 collector) #1687 chart 1.4.163 -> 1.4.164 (Wave 17 collector) #1694 bp-guacamole 0.1.21 -> 0.1.22 (TBD-G6) #1695 chart 1.4.164 -> 1.4.165 (Wave 18 collector) #1698 chart 1.4.165 -> 1.4.166 (TBD-E8) #1700 bp-guacamole 0.1.22 -> 0.1.23 (TBD-G4 phase 2) #1706 bp-self-sovereign-cutover 0.1.29 -> 0.1.30 (TBD-C18) #1707 chart 1.4.166 -> 1.4.167 (Wave 24 collector) The fix lives in .github/workflows/blueprint-release.yaml — the single workflow that publishes every chart's OCI artifact. After a successful push + cosign sign + SBOM attest, a new "Auto-bump bootstrap-kit pin" step: 1. Reads ${{ steps.chart.outputs.name }} (e.g. `bp-newapi`). 2. Greps clusters/_template/bootstrap-kit/*.yaml for the canonical ` chart: <name>` line (6-space indent matches every existing slot's HelmRelease.spec.chart.spec.chart shape). 3. If a matching slot file is found, sed-replaces the slot's ` version: <old>` with `version: <new>` and commits + pushes back to main as hatiyildiz <noreply>. 4. If no slot file matches, the chart is an opt-in Application Blueprint (e.g. bp-vllm, bp-temporal) and the step gracefully no-ops. 5. Conflict-tolerant retry up to 3 times with idempotent reset-and-rewrite for the parallel matrix case (multiple charts bumped in the same push). The bot-author commit does NOT re-trigger workflows (GitHub Actions GITHUB_TOKEN convention), so the chain converges in ONE pass: chart bump -> blueprint-release -> publish artifact -> bump pin. No loop. A regression test (scripts/check-bootstrap-kit-pin-sync.sh) asserts the convergence contract: every Chart.yaml in platform/* or products/* whose chart name is pinned in clusters/_template/bootstrap-kit/ MUST have the same version on both sides. The .github/workflows/test- bootstrap-kit.yaml workflow now runs this audit: - On `pull_request`: `--changed-only --base <pr-base>` so a PR is only blocked on chart->pin pairs IT modified. This avoids forcing pre-existing drifts (13 charts as of 2026-05-18, validated via a full sweep against origin/main) to be fixed before any unrelated PR can land. The auto-bump hook will heal those drifts on the next bump of each lagging chart. - On `push` and `workflow_dispatch`: full sweep so post-merge drift is observable on the run summary. Why blueprint-release.yaml is the right insertion point (not each build-bp-<name>.yaml or services-build.yaml or catalyst-build.yaml): - It runs after EVERY chart publish, regardless of upstream trigger. - It already has the canonical chart name + version in ${{ steps.chart.outputs.name }} + ${{ steps.chart.outputs.version }}. - One file changed, one hook covers all 51 bootstrap-kit slots plus future additions. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
96674b71c9
|
fix(ci+catalyst-api): hold deploy-bot bumps when any prov is in-flight (was rolling catalyst-api Pod mid-tofu-apply, abandoning provs) (#1688)
Context — t13/t17/t21 incident, 2026-05-17. catalyst-api is single-replica
with strategy: Recreate; the OpenTofu workdir lives on a /tmp emptyDir that
dies with the Pod. When this workflow bumped the image SHA mid-prov, Flux
rolled the Pod and killed `tofu apply` mid-resource. The on-disk record was
rewritten to status=failed by restoreFromStore on the new Pod, but Hetzner
resources tagged with the abandoned deployment-id stayed orphaned and
required manual `hcloud` cleanup. Three consecutive provs died this way in
one afternoon.
Option C (smallest blast radius): gate the deploy-bot at the workflow level.
1. New public endpoint GET /api/v1/deployments/in-flight-count on
catalyst-api. Returns {count, ids} of deployments in Phase-0 in-flight
status (pending / provisioning / tofu-applying / flux-bootstrapping).
Phase-1 (phase1-watching) is observational and resumes across Pod
restarts via resumePhase1Watch, so it does NOT block. Adopted
deployments are excluded. No FQDNs / owner emails in the response —
same information-disclosure posture as /api/v1/subdomains/check.
Unauthenticated; the deploy-bot has no session cookie.
2. .github/workflows/catalyst-build.yaml `deploy` job polls this endpoint
before bumping values.yaml. count==0 → green light. count>0 → sleep
20s and retry. Hard cap 30 min (a stuck prov must not block all
future deploys — that would be the worst possible failure mode for a
CI gate). Fail-open on any non-200 / network error so the gate
cannot itself become an outage.
Notes:
- Mothership URL configurable via vars.CATALYST_API_URL (defaults to
https://console.openova.io). Sovereign chroot self-deploys can point
to their local catalyst-api.
- First-rollout safe: the endpoint does not exist on the LIVE
mothership until THIS PR's image lands, so the first run after merge
falls through the 404 branch and proceeds. Subsequent runs benefit
from the gate.
- NOT a Chart.yaml bump. The deploy-bot itself bumps the literal image
refs in chart templates (existing behaviour), so the new endpoint
reaches Sovereigns through the normal chart-rebake path.
Tests: handler/deployments_in_flight_count_test.go covers Phase-0 vs
Phase-1 vs terminal vs adopted classification + empty-store green light.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
cadc7b5cea
|
fix(sandbox-ci): mcp-server Dockerfile repo-root context + pty/mcp auto-bump wiring (chart was half-deployable) (#1667)
Sandbox chart was un-deployable end-to-end because three CI-side gaps compounded after PR #1658 wired the mcp-server module to depend on core/controllers + core/services/shared via `replace` directives: 1. **mcp-server Dockerfile built against a too-narrow context**. The workflow passed `context: products/sandbox/mcp-server` and the Dockerfile assumed `COPY . .` could see everything it needed, but the `replace ../../../core/controllers` line in the module's go.mod only resolves when the build can actually reach those paths. Result: every push after #1658 failed at `go build` with `module not found`. Fix mirrors core/controllers/sandbox/Dockerfile (Slice-CC1 layout): COPY the replace targets' module roots + sources, then build with WORKDIR set to the dependent module. Static binary still produced into a distroless/static-debian12:nonroot final stage. 2. **mcp-server workflow had no chart auto-bump step**. Even after a green build, `runtime.mcpImage` in platform/sandbox/chart/values.yaml stayed empty so the chart's `required` guard (deployment.yaml line 72) refused to render. Added the same yq-bump + bot-commit pattern build-sandbox-controller.yaml already uses, targeting `.runtime.mcpImage` and writing a fully-qualified `<repo>:<sha>` string (consumer reads it as one image reference, not a {repository,tag} pair). Also widened paths-filter to include core/controllers/** + core/services/shared/** so changes to the replace targets re-trigger the build. 3. **pty-server workflow had no auto-bump either**. Same surgery: yq-bump `.runtime.ptyServerImage` + commit-and-push. Context stays narrow (pty-server has no cross-tree `replace` directives). 4. **Stop-gap pin values for runtime.{ptyServerImage,mcpImage}** so the next chart roll out doesn't fail-fast before the rebuilt workflows land their first bumps: - ptyServerImage → |
||
|
|
1b0e86cb1a
|
ci(sandbox): build workflows for controller + pty-server + mcp-server (so chart can actually deploy) (#1632)
PR #1622 shipped the sandbox-controller binary + chart, and PR #1618 shipped pty-server + mcp-server scaffolds, but neither came with CI build workflows — meaning the chart's image.repository points at a GHCR package that no workflow ever publishes (ImagePullBackOff on every install). Per docs/INVIOLABLE-PRINCIPLES.md #4a every runtime image MUST be produced by a GitHub Actions workflow from a committed git SHA; this PR closes that gap. Three new workflows, all event-driven (push paths-filter + PR + workflow_dispatch, no cron): - build-sandbox-controller.yaml — mirrors build-application-controller (shared core/controllers go.mod, go vet + race tests, Buildx push, cosign keyless sign, SBOM attest, auto-bump platform/sandbox/chart/ values.yaml image.tag back to main so the next install picks up the SHA-pinned image without operator action). - build-sandbox-pty-server.yaml — separate go module under products/sandbox/pty-server (own go.mod/go.sum), Dockerfile uses COPY . . so build context is the server directory. Same Buildx + cosign + SBOM flow as the controller. No values.yaml bump yet: Wave-2 wiring of the StatefulSet template will land in a follow-up. - build-sandbox-mcp-server.yaml — stdlib-only stdio MCP sidecar (no go.sum yet), same shape as pty-server. Per `feedback_no_mvp_no_workarounds.md` rule 1 (target-state, never "manual follow-up bump") the controller workflow auto-bumps the chart values.yaml so a Sovereign overlay flipping `enabled: true` Just Works. Per the user's hard rule for this PR, no Chart.yaml bump and no blueprint-release dispatch — the Sandbox chart's publication cadence is gated by Wave-2 readiness, not per-image builds. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0ac12970d8
|
ci(openova-flow): build openova-flow-server + adapter-flux images + sed chart tags (#1398)
Add the two missing GitHub Actions build pipelines for the OpenovaFlow Go binaries so prov #34 has real images to install. Both auto-bump their chart's values.yaml `image.tag` on every main-branch push and dispatch blueprint-release for chart re-publish. Workflows shipped: - .github/workflows/build-openova-flow-server.yaml · Triggers on push to products/openova-flow/server/** or the chart · `go vet` + `go test -race` + Buildx push to ghcr.io/openova-io/openova/openova-flow-server:<sha> + :latest · cosign keyless sign + SBOM attest · awk-bumps platform/openova-flow-server/chart/values.yaml flowServer.image.tag, commits to main with [skip ci] · Dispatches blueprint-release.yaml for chart re-publish - .github/workflows/build-openova-flow-adapter-flux.yaml · Same shape; bumps platform/openova-flow-emitter/chart/values.yaml flowEmitter.image.tag Chart defaults (`tag: "latest"`) already shipped in PR #1397 — no values.yaml changes needed in this PR. Canonical patterns cited (ARCHITECT-FIRST): - Build shape mirrors .github/workflows/build-application-controller.yaml (Go vet + test + Buildx + cosign + SBOM + values.yaml awk-bump + blueprint-release dispatch). - awk-sed bump pattern mirrors catalystApi/catalystUi tag-bump in .github/workflows/catalyst-build.yaml `deploy` job (with the `[skip ci]` + explicit blueprint-release dispatch fix from #712). Per docs/INVIOLABLE-PRINCIPLES.md: - #4a (GitHub Actions is the only build path) - event-driven (no cron triggers, only push/PR/workflow_dispatch) MIRROR-EVERYTHING: image refs in chart values point at harbor.openova.io/proxy-ghcr/...; CI pushes to ghcr.io directly and Harbor proxy-pulls. No direct push to harbor. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3a5d9fc102
|
fix(infra,catalyst-api provisioner): tftpl CI guard + bucket-name suffix (Fix #101 followup, Fix #111) (#1331)
Two infrastructure-hardening fixes that together eliminate ~30 min of provision-cycle waste per regression event documented in Fix #101. ## Fix A — CI guard against unescaped tftpl shell expansion Adds a grep-based step to .github/workflows/infra-hetzner-tofu.yaml that scans every infra/hetzner/*.tftpl for unescaped \${VAR:-default} inside YAML comment lines. Uses PCRE negative-lookbehind so correctly escaped \$\${VAR:-default} (templatefile() literal-dollar) does not trip the guard. Background: PR #1311 (Fix #73) added a YAML comment with bare \${QA_FIXTURES_ENABLED:-false}. tofu's templatefile() parses ALL \${...} sequences regardless of YAML/HCL/shell context; the colon in the interpolation hits HCL's reserved conditional grammar and crashes 'tofu plan' with "Template interpolation doesn't expect a colon at this location". Prov #9 (4204f0b0c5e37a80) wasted ~30 min before PR #1328 fixed the one offender. Without the guard, the next operator who adds a similar comment repeats the incident. Documented in infra/hetzner/README.md so editors learn the \$\$ escape pattern before they trip the CI gate. ## Fix B — bucket-name suffix to escape global Hetzner namespace Hetzner Object Storage bucket names share a GLOBAL namespace across every tenant. The previous BucketNameForSovereign(fqdn) derivation 'catalyst-<fqdn-with-dashes>' would collide on the second CreateDeployment for the same FQDN (re-provision after wipe, two operators on adjacent pools, race conditions) and the second 'tofu apply' would fail with BucketAlreadyExists. Change BucketNameForSovereign signature to (fqdn, deploymentID) and append the first 8 chars of the deployment-id as a suffix: catalyst-omantel-omani-works-b3b837a2 newID() already returns 16-hex random — the leading 8 chars are 32 bits of fresh entropy, enough to make collisions cryptographically negligible. Backward-compat: empty deploymentID (legacy on-disk records) falls back to first-8-hex of sha256(fqdn) so wipes of pre-Fix-111 Sovereigns remain deterministic. Call-sites updated: - handler/deployments.go: id := newID() moved before bucket-name derivation; uses hetzner.BucketNameForSovereign - handler/wipe.go: passes dep.ID to PurgeBuckets and to BucketNameForSovereign in the report - hetzner/buckets.go: PurgeBuckets signature now takes deploymentID; bucketSuffix() handles the fallback Tests: - hetzner/buckets_test.go: 6-case TestBucketNameForSovereign table covers canonical newID() shape, collision avoidance, uppercase normalisation, empty + non-hex fallback paths. New TestBucketNameForSovereign_CollisionAvoidance asserts the Fix #111 invariant directly. - handler/deployments_test.go: TestCreateDeployment_DerivesObjectStorageBucketFromFQDN now asserts the suffixed shape against the actual dep.ID. - All produced names re-validated against the S3 bucket-naming RFC (mirrored regex from provisioner.s3BucketNamePattern). ## Claimed TCs _None directly — infrastructure hardening; eliminates 30+ min wasted per cycle from regressions like PR #1311 + bucket-collision_ ## Verification - go test ./internal/hetzner/... -run "Bucket" → 9/9 PASS - go test ./internal/handler/ -run "DerivesObjectStorageBucket" → PASS - go vet ./... → clean - go build ./... → clean - yaml.safe_load on workflow → clean - pre-existing handler-package fails (whoami, continuum-switchover) are unrelated and present on origin/main Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f668d791ab
|
fix(bp-newapi): publish newapi-mirror image + repoint chart to existing tag (qa-loop bounded-cycle audit prov #7 Gap F) (#1315)
Root cause from live diagnosis (omantel.biz prov #7, kubectl --context=omantel): The bp-newapi chart at platform/newapi/chart/values.yaml referenced `ghcr.io/openova-io/openova/newapi-mirror:v0.4.5` since its first commit (44d0200a, 2026-05-01). However: 1. NO CI workflow ever built that image. There is no `build-bp-newapi*.yaml` (or similar) under .github/workflows/. The GHCR package `ghcr.io/openova-io/openova/newapi-mirror` does not exist (404 from /orgs/openova-io/packages/container/...). 2. The tag `v0.4.5` is fictitious — neither upstream Calcium-Ion/new-api (`docker.io/calciumion/new-api`) nor the alternate ancestor (`justsong/one-api`) ever published a `v0.4.5`. The lowest stable Calcium-Ion tag is `v0.6.0.9`; the highest stable v0.x is `v0.13.2` (upstream publish 2026-04-27). Result: every fresh Sovereign's NewAPI Pod ImagePullBackOff'd 403 Forbidden on the never-existed image, blocking alice signup gate 5 (LLM) and surfacing in the bounded-cycle audit as Gap F. Fix (mirrors bp-guacamole CI pattern in .github/workflows/build-bp-guacamole.yaml): - NEW .github/workflows/build-bp-newapi.yaml — push to platform/newapi/chart/** triggers a Job that pulls `docker.io/calciumion/new-api:<UPSTREAM_VER>`, captures the upstream repo digest, re-tags as `ghcr.io/openova-io/openova/newapi-mirror: <UPSTREAM_VER>` + `:latest`, pushes both, then bumps values.yaml + Chart.yaml + dispatches blueprint-release. - platform/newapi/chart/values.yaml — newapi.image.tag bumped from `v0.4.5` (fictitious) to `v0.13.2` (latest stable Calcium-Ion/new-api on Docker Hub). Comment block expanded with full rationale + link to the new build workflow + bump-in-lockstep instructions. - platform/newapi/chart/Chart.yaml — version 1.4.1 → 1.4.2, appVersion `0.4.5` → `0.13.2` (Helm convention: appVersion = upstream version without the `v` prefix). Inline changelog records the audit-prov-7 Gap F lineage. - clusters/_template/bootstrap-kit/80-newapi.yaml — pinned chart version 1.4.1 → 1.4.2 with the same changelog inline. Verified locally: - `helm template smoke platform/newapi/chart --set database.existingSecret=fake --set credentials.existingSecret=fake --set auth.adminUI.mode=masterKey` renders `image: "ghcr.io/openova-io/openova/newapi-mirror:v0.13.2"` and `app.kubernetes.io/version: "0.13.2"`. The v1.0.0-rc.x upstream line is gated on schema migration stabilisation; the channel-seed Job uses the legacy admin-API request shape, so do NOT auto-roll past v0.13.x without re-running the channel-seed integration smoke against NewAPI's `/api/channel/`. Pairs with the Gap C re-investigation memo (no chart fix needed; PR #1309 only gated `defaultCompositionRef`, not the XRD itself; the useraccesses.access.openova.io CRD is present on omantel prov #7). DO NOT MERGE — this PR is for qa-loop bounded-cycle Wave 5 Fix #80 (Gap F) review. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9780e8d72d
|
fix(chart): bp-catalyst-platform 1.4.116 — chart re-publish + dispatch (qa-loop iter-10 Fix #44 follow-up) (#1264)
Chart 1.4.115 was published from the merge commit which still had the
OLD application-controller image tag (
|
||
|
|
24aab61207
|
fix(application-controller): HelmRelease targetNamespace = App's namespace, not Org slug (qa-loop iter-10 Fix #44) (#1262)
Root cause: the application-controller rendered the per-Application HelmRelease with `metadata.namespace = Org` and `spec.targetNamespace = Org` where Org is the parent Organization slug. On omantel the Application(qa-wp) lives in ns `qa-omantel` while the Org is named `omantel-platform` — so the workload Pod landed in the wrong namespace, breaking matrix rows TC-068 / TC-100 / TC-204 / TC-262 / TC-263 (all asserting Pod in qa-omantel). Symmetric Kustomization wrapper had the same bug. Existing render unit test only covered the org==namespace case (`acme/acme`) which masked the bug. Fix: - render.Inputs gains AppNamespace field. helmRelease + kustomization templates resolve `metadata.namespace` and `spec.targetNamespace` to AppNamespace (back-compat default = Org). - application_controller.go passes app.GetNamespace() as AppNamespace on every render.Render call. - HelmRelease spec.install.createNamespace = true so a missing workload namespace is provisioned by helm-controller (per docs/INVIOLABLE-PRINCIPLES.md #1 target-state — controller must work without an operator pre-creating the namespace). - Org slug is still stamped on the catalyst.openova.io/organization label for traceability. - 3 new Go tests: TestRender_NamespaceIsAppNamespace (omantel scenario via render pkg) TestRender_CreateNamespaceTrue TestReconcile_HelmReleaseTargetNamespaceIsAppNamespace (drives the omantel scenario end-to-end through the controller fake) - build-application-controller.yaml extended with auto-bump of controllers.application.image.tag in values.yaml on push-to-main, so the chart picks up the rebuilt image without a manual operator edit (per feedback_no_mvp_no_workarounds.md rule 1). - bp-catalyst-platform chart 1.4.114 → 1.4.115. Verification (post-roll on omantel): - delete omantel-platform/qa-wp Pod - annotate qa-omantel/qa-wp HR for reconcile - expect: Pod in qa-omantel ns + HR.spec.targetNamespace == qa-omantel Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
5ca0a7d178
|
fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots (#1236)
* fix(ci,charts,api): qa-loop iter-7 Fix #39 — bp-guacamole + bp-k8s-ws-proxy bootstrap-kit slots Closes the scope-narrow confessed by Fix #36: bp-guacamole + bp-k8s-ws-proxy chart skeletons existed at platform/* but lacked CI image-build workflows + bootstrap-kit slots, so TC-228 / TC-230 / TC-236 / TC-237 / TC-245 / TC-246 stayed FAIL with "deployment NotFound". CI workflows ------------ - .github/workflows/build-k8s-ws-proxy.yaml: Buildx + cosign keyless sign + SBOM attestation flow on core/cmd/k8s-ws-proxy/**, then bumps platform/k8s-ws-proxy/chart/values.yaml image.tag + Chart.yaml patch version + dispatches blueprint-release. - .github/workflows/build-bp-guacamole.yaml: mirrors upstream Apache Guacamole 1.5.5 to GHCR (so every Sovereign pulls from a registry we own — no Docker Hub rate limits, no upstream availability risk), bumps values.yaml.image.{repository,tag} + Chart.yaml + dispatches blueprint-release. Charts (target-state) --------------------- - bp-k8s-ws-proxy v0.1.1: canonical workload name `k8s-ws-proxy` regardless of release name (DaemonSet + Service + ClusterRole + ClusterRoleBinding + ServiceAccount all named `k8s-ws-proxy` so matrix can address them by canonical short name). - bp-guacamole v0.1.1: canonical short resource names (`guacd`, `guacamole-server`, `guacamole-recordings`); GHCR-mirrored upstream images; realm-patch ConfigMap correctly lands in `keycloak` namespace (was: realm-name, which would have failed silently on every Sovereign); `realmConfig.namespace` override surface added. - Both charts: `catalyst.openova.io/smoke-render-mode: default-off` annotation so blueprint-release smoke-render gate honors the default-OFF render shape. Bootstrap-kit slots ------------------- - clusters/_template/bootstrap-kit/36-bp-k8s-ws-proxy.yaml + 37-bp-guacamole.yaml: dependsOn-ordered (proxy → gateway), pinned to 0.1.1, default-OFF gate flipped via slot values, install/upgrade disableWait per session-2026-04-30 architectural decision. - clusters/omantel.omani.works/bootstrap-kit/* slots mirror the same shape with omantel.biz hostnames matching the live HTTPRoutes on console.omantel.biz / auth.omantel.biz. API: shells/issue handler (matrix-canonical URL surface) -------------------------------------------------------- - POST /api/v1/sovereigns/{id}/shells/issue?namespace=&pod=&container= alias for the existing POST /api/v1/sovereigns/{id}/k8s/exec/{ns}/{pod}/{container}/session with matrix-canonical response fields (`sessionId`, `guacamoleUrl`, `recordingPath`). Same business logic, same audit surface (`guacamole-session-opened`), same RBAC gate (tier-developer or higher). 6 test cases, all PASS under -race. TCs that flip PASS in iter-8 ----------------------------- - TC-228: POST /shells/issue → sessionId + guacamoleUrl + recordingPath - TC-230: kubectl get deploy guacd guacamole-server -n catalyst-system - TC-236: kubectl get ds k8s-ws-proxy -n catalyst-system - TC-237: kubectl logs ds/k8s-ws-proxy → "listening" - TC-245: viewer-cookie POST /shells/issue → 403 - TC-246: operator-cookie POST /shells/issue → 200 sessionId Per feedback_no_mvp_no_workarounds.md: NO follow-up slices — every gap Fix #36 confessed is closed in this PR. Per feedback_machine_saturation_3rd_violation.md: CI-only build path, no local docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-kit): move bp-k8s-ws-proxy + bp-guacamole to slots 51/52 (Fix #39 follow-up) CI dependency-graph-audit caught a slot-number collision: slots 36-48 are reserved for the W2.K4 AI-runtime cohort (bp-stunner, bp-knative, bp-kserve, bp-vllm, bp-llm-gateway, bp-anthropic-adapter, bp-bge, bp-nemo-guardrails, bp-temporal, bp-openmeter, bp-livekit, bp-matrix, bp-librechat) per scripts/expected-bootstrap-deps.yaml. Move the exec-fan-out blueprints to slots 51/52 (post-W2.K4, pre-Phase-2 80+ slot range) and add their entries to the expected DAG. - clusters/_template/bootstrap-kit/{36,37}-* → {51,52}-* - clusters/omantel.omani.works/bootstrap-kit/{36,37}-* → {51,52}-* - kustomization.yaml updates (both _template + omantel) - scripts/expected-bootstrap-deps.yaml: declare slots 51/52 with full dependsOn lists (bp-k8s-ws-proxy on cilium+sealed-secrets, bp-guacamole on cilium+cert-manager+keycloak+sealed-secrets+ seaweedfs+k8s-ws-proxy) scripts/check-bootstrap-deps.sh re-run: 0 drift, 0 cycles, 55 declared HRs, 42 present on disk, 13 deferred (W2.K1-K4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b24475e2c2
|
fix(api+chart): clusterroles GVR + CATALYST_BUILD_SHA env injection (qa-loop iter-3) (#1206)
Two coupled fixes for QA-loop iter-3 cluster
`clusterroles-gvr-and-sha-injection`:
Sub-A — clusterroles GVR (TC-122/196/199/248):
- Add rbac.authorization.k8s.io/v1 ClusterRole + ClusterRoleBinding
to k8scache.DefaultKinds. Both cluster-scoped.
- Add matching get/list/watch verbs on
catalyst-api-cutover-driver ClusterRole. Per
feedback_chroot_in_cluster_fallback.md every new GVR added to
DefaultKinds MUST get a matching rule on the cutover-driver SA
(chroot SovereignClient uses it via in-cluster fallback).
- Pin both kinds in TestDefaultKinds_GraphAndDashboardSurface so a
regression that drops them from the registry fails the unit test.
Sub-B — CATALYST_BUILD_SHA env injection (TC-261):
- api-deployment.yaml: inject CATALYST_BUILD_SHA + CATALYST_CHART_VERSION
env vars with LITERAL values (not Helm directives) per the
dual-mode contract — Kustomize on contabo can't render
`{{ .Values... }}` in `value:` fields.
- .github/workflows/catalyst-build.yaml: extend the "bump literal
image refs" sed pass to also bump the CATALYST_BUILD_SHA env
literal so /api/v1/version returns the SHA the Pod is actually
running (no drift between image tag and reported SHA).
- The handler (version.go) already reads CATALYST_BUILD_SHA via
envOrTrim with `dev`/`0.0.0` ldflag fallbacks — no Go change
needed; the version_test.go env-override test already covers it.
Chart bumped 1.4.94 -> 1.4.95.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
c1b92404ee
|
fix(chart): enable 5 Group C controllers + KC realm-role bootstrap (qa-loop iter-1) (#1194)
EPIC-3 RBAC reconciliation loop was dormant on every Sovereign because
the 5 Group C controllers (organization, environment, blueprint,
application, useraccess) shipped with `enabled: false` and the
KEYCLOAK_BOOTSTRAP_TIER_ROLES env var was hardcoded to "false". Result:
UserAccess CRs created by /api/v1/sovereigns/{id}/rbac/assign never
materialised into RoleBindings + composite realm-roles.
Cluster: controllers-and-kc-bootstrap-gates (qa-loop iter-1).
Changes:
- values.yaml: organization/environment/application/useraccess controllers
flipped to `enabled: true` and `image.tag` SHA-pinned to the latest
GHCR-published push-on-main builds (organization/environment/application
:1b29c71, useraccess :ff2172f) per Inviolable Principle #4a.
- values.yaml: blueprint stays `enabled: false` until first
push-on-main build of build-blueprint-controller.yaml lands an image
in GHCR (never reference an image not built by CI).
- values.yaml: new top-level `keycloak.bootstrap.ensureTierRoles: true`.
- api-deployment.yaml: KEYCLOAK_BOOTSTRAP_TIER_ROLES now sources its
default from `.Values.keycloak.bootstrap.ensureTierRoles` (per slice
T2 brief #1098/#1146) instead of hardcoded "false".
- .github/workflows/build-blueprint-controller.yaml: new workflow
scaffolded (mirror of build-application-controller shape) so the
first commit touching core/controllers/blueprint/** ships a
CI-built, SHA-pinned, cosign-signed image to GHCR.
- Chart.yaml: bumped 1.4.89 → 1.4.90.
Verified via `helm template`:
- 4 controller Deployments + 4 controller ClusterRoles render (blueprint
pending image build).
- KEYCLOAK_BOOTSTRAP_TIER_ROLES renders as "true" by default.
- 5 tier ClusterRoles `openova:tier-{viewer,developer,operator,admin,owner}`
render from platform/crossplane-claims/chart/.
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7ca4abddd2
|
feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) (#1159)
* feat(continuum): K-Cont-4 — Cloudflare Worker source + tofu wiring for lease witness (#1101) Implements the server side of the Cloudflare KV lease-witness pattern that K-Cont-3's CFKVClient (in core/controllers/continuum/internal/ witness/cloudflarekv/) speaks to. The Worker fronts a Cloudflare Workers KV namespace with read-then-CAS-write semantics enforced via the If-Match header — exact contract per K-Cont-3 #1158 report (item d) and the canonical-seams "Cloudflare KV Worker contract" entry. Routes: GET /lease/<slot-url-encoded> → 200 + LeaseState | 404 | 401 PUT /lease/<slot> → 200 + LeaseState | 412 + state | 401 DELETE /lease/<slot> → 204 | 412 | 401 All 7 K-Cont-3 trap behaviors verified by 46 vitest tests: 1. If-Match: 0 = first-acquire-on-empty-slot 2. Generation increments unconditionally (incl. Release) 3. 412 includes current state body 4. TTL eviction is server-authoritative in stamping (Worker doesn't auto-evict — controller's IsHeldBy decides) 5. X-Holder mismatch on DELETE returns 412 (stale region can't evict new primary) 6. Bearer token validation against env-bound allow-list 7. Optional X-Lease-Slot header logged for KV granularity Files: products/continuum/cloudflare-worker/{package.json, tsconfig.json, wrangler.toml, vitest.config.ts, .eslintrc.cjs, .gitignore, DESIGN.md, src/{index,auth,kv,types}.ts, src/handlers/{get,put,delete}.ts, test/{handlers,contract,env.d}.ts} infra/cloudflare-worker-leases/{versions,variables,main,outputs}.tf + README.md .github/workflows/cloudflare-worker-leases-build.yaml (event-driven, NO cron — push-on-paths + PR + workflow_dispatch) Tests: 46/46 vitest pass (handlers 37 + contract 9). ESLint clean. tsc --noEmit clean. wrangler deploy --dry-run produces 9.47 KiB bundle. Per the brief: tofu module ships ready for operator action — no auto-deploy. Operator runbook in DESIGN.md §"Operator runbook — deploy a new Sovereign". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(continuum/cf-worker-tofu): K-Cont-4 — adopt CF v5 inline secret_text binding (was v4 separate resource) `tofu validate` failed on `cloudflare_workers_secret` — that resource was REMOVED in cloudflare/cloudflare v5 (it consolidated into the inline `bindings = [...]` array on `cloudflare_workers_script` with `type = "secret_text"`). Same security guarantee — encrypted at rest in CF, never visible via dashboard read API once written. `tofu fmt` also wanted versions.tf alignment + the .terraform.lock.hcl pinning the resolved cloudflare/cloudflare v5.19.1 (mirrors infra/hetzner/ which commits its lock file). Per Inviolable Principle #5 the bearer token value still flows from TF_VAR_bearer_tokens_csv extracted at apply time from a K8s SealedSecret — never inlined here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
746901b671
|
feat(cnpg-pair): C-DB-1 — bp-cnpg-pair Blueprint (active-hotstandby CNPG cluster-pair across regions) (#1101) (#1153)
EPIC-6 Slice C-DB-1+C-DB-2. Active-hotstandby CNPG cluster-pair as a
companion to bp-cnpg: primary CNPG Cluster CR in region A, replica
Cluster CR in region B configured as a CNPG replica cluster
(replica.enabled=true + externalCluster), WAL streaming over a
Cilium ClusterMesh-shared Service. Per ADR-0001 §9 ClusterMesh is the
only canonical inter-region transport — never public TLS.
What ships:
platform/cnpg-pair/
├── chart/
│ ├── Chart.yaml # bp-cnpg-pair 0.1.0; no-upstream + smoke-render-mode=default-off
│ ├── values.yaml # default-OFF gate; placement schema constrains active-hotstandby ONLY
│ ├── templates/
│ │ ├── _helpers.tpl # fail-fast on empty image.tag; region pair validation
│ │ ├── primary-cluster.yaml # CNPG Cluster CR (region-pinned via openova.io/region affinity)
│ │ ├── replica-cluster.yaml # CNPG Cluster CR (replica.enabled=true; externalClusters[])
│ │ ├── service-replication.yaml # Cilium ClusterMesh global Service
│ │ ├── failover-readiness.yaml # probe Pod flips Ready when WAL lag < threshold
│ │ ├── networkpolicy.yaml # default-deny carve-outs for replication + probe
│ │ └── audit-config.yaml # NATS audit subjects + types this Blueprint emits
│ ├── blueprint.yaml # configSchema + placementSchema (active-hotstandby ONLY)
│ ├── README.md # 80-line deployment + failover semantics
│ └── tests/cnpg-pair-render.sh # 5-case render gate
└── DESIGN.md # topology, lag-threshold rationale, deferred C-DB-3 plan
Default-OFF gate per the brief: helm template with default values
renders ZERO resources; helm template with cnpgPair.enabled=true +
both regions + image.tag renders 8 resources (2 Cluster CRs, 1
Service, 1 Deployment, 3 NetworkPolicies, 1 audit-config ConfigMap).
Empty image.tag fails fast at template-render per Inviolable
Principle #4a; same primary/replica region fails fast (degenerate
pair). All 5 render gates pass locally; helm lint + YAML parse clean.
CI smoke-render gate fix (single-line behavior change in
blueprint-release.yaml): adds a `catalyst.openova.io/smoke-render-
mode: default-off` annotation opt-in so charts that legitimately
render zero at default values (this chart + future bp-*-pair
Blueprints) skip the `<5 lines` empty-render check. The chart's own
tests/cnpg-pair-render.sh covers the enabled-render path; without
the annotation the empty-render check still fires unchanged.
Seam-map additions (return diff for 01-canonical-seams.md Platform
table):
- service.cilium.io/global=true ClusterMesh global Service annotation
(first chart in the repo to use it; pattern reused by Continuum
K-Cont-2 for HTTPRoute weight=0 cross-region drains)
- bp-*-pair active-hotstandby cluster-pair pattern (primary+replica
Cluster CRs colocated in one Blueprint, region-pinned via
openova.io/region node-affinity)
- audit-config ConfigMap co-located with the emitting Blueprint
(label-selector discovery for K-Cont-2 + U-DR-1; future
bp-*-pair Blueprints follow this convention)
- smoke-render-mode=default-off Chart.yaml annotation opt-in for
the blueprint-release smoke gate
C-DB-2 (publish): existing blueprint-release.yaml workflow auto-
detects `platform/*/chart/**` paths — no allowlist edit required.
First push triggers `ghcr.io/openova-io/bp-cnpg-pair:0.1.0` build.
C-DB-3 (1M-row acceptance test) DEFERRED — full plan documented in
DESIGN.md "Deferred — C-DB-3 acceptance test plan" section so the
future implementer's brief is self-contained.
Tests:
- bash platform/cnpg-pair/chart/tests/cnpg-pair-render.sh ✓ 5/5 PASS
- helm lint platform/cnpg-pair/chart ✓ clean
- helm template ... | python3 yaml.safe_load_all ✓ 8 docs parse clean
- smoke-gate logic simulated locally ✓ default-off annotation honored
Pre-existing CI failures untouched:
- TestPinIssue rate-limit flake — not affected by chart-only slice
- TestBootstrapKit/gitea version drift — only iterates over a fixed
10-chart bootstrap list (no cnpg-pair entry)
Out of scope per brief (all deferred to dedicated slices):
- K-Cont-2 reconciler logic
- K-Cont-3 lease witness
- K-Cont-4 Cloudflare Worker
- C-DB-3 1M-row acceptance test
- Application controller changes
- U-DR-1 UI
Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ddbe44918f
|
feat(continuum): K-Cont-1 — Continuum product skeleton (chart + binary + GHA workflow, no reconcile yet) (#1101) (#1151)
Slice K-Cont-1 of EPIC-6 (#1101) ships the Continuum product skeleton: - core/controllers/continuum/{cmd,internal/{controller,events}} - cmd/main.go — controller-runtime Manager bootstrap; leader election; /healthz, /readyz, /metrics endpoints; env-only config per INVIOLABLE-PRINCIPLES #4 - internal/controller — ContinuumReconciler with no-op Reconcile() (K-Cont-2 fills the body); SetupWithManager() watches Continuum CRs via unstructured.Unstructured per ADR-0001 §2.7 (no controller-gen) - internal/events — placeholder package documenting K-Cont-2's NATS audit-event-type list - Containerfile — multi-stage Go build → alpine:3.20 runtime, UID 65534 - products/continuum/chart/ — full Helm chart shape (default-OFF): - Chart.yaml + values.yaml (continuum.enabled: false; image.tag empty; fail-fast on empty tag at render time) - templates/{_helpers.tpl, deployment, service, serviceaccount, rbac, networkpolicy}.yaml - blueprint.yaml — OpenOva Blueprint manifest with configSchema + placementSchema (single-region: management cluster) + depends: bp-cnpg-pair + bp-powerdns - crds/README.md — pointer to the canonical Continuum CRD shipped in products/catalyst/chart/crds/continuum.yaml (B8 #1110); not duplicated - products/continuum/DESIGN.md — chart-vs-binary split decision (Option A: binary in shared core/controllers/ module per CC1 #1135), K-Cont-2 fill list, K-Cont-3 lease witness API contract sketch - .github/workflows/build-continuum-controller.yaml — event-driven CI (NO cron) with go vet + go test -race + helm template ON/OFF resource count gates + fail-fast verification + GHCR build & push (cosign keyless signed) + repository_dispatch for chart-bump fan-out helm template verification: - continuum.enabled=false → 0 resources (default OFF) - continuum.enabled=true + image.tag=ci-test → 6 resources (ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, Service, NetworkPolicy) - continuum.enabled=true + empty image.tag → render fails per #4a go vet ./continuum/... → clean. go test -count=1 -race → all green. Out of scope (per the K-Cont-1 brief): - Reconcile body — K-Cont-2 - Lease witness implementations — K-Cont-3 - Cloudflare Worker source — K-Cont-4 - bp-cnpg-pair Blueprint — C-DB-1 Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b0ed216e81
|
feat(catalog): catalog-svc HTTP REST service + chart wiring (slice L1+L2, #1097) (#1148)
EPIC-2 Slice L of #1097. Multi-source Blueprint catalog HTTP REST service backed by Gitea (3 sources: public mirror, sovereign-curated, per-Org private). Replaces the per-Org SME catalog per ADR-0001 §4.3 (different scope: SME's was Org-bound; catalyst-catalog is Sovereign- wide multi-source). L1 — core/services/catalyst-catalog/ Go service: - Separate go.mod (services group is for HTTP services, controllers group is for CRD reconcilers — documented in DESIGN.md). - Imports the unified Gitea client via Go module replace directive. - Promoted core/controllers/internal/gitea → pkg/gitea so the catalog (a sibling Go module) can import it (Go internal/ rule). 5 Group C controllers updated atomically. - HTTP REST endpoints: /api/v1/catalog{,/{name},/{name}/versions, /{name}/versions/{version}} + /healthz. - Source resolution priority on collision: private > sovereign > public. - Per-Org access filter: caller's Claims.Groups[] determines visible private blueprints; Org A user does NOT see Org B's private set. - 30s TTL LRU cache on blueprint.yaml reads (capacity 1024 default). - Session-cookie / Bearer / ?access_token= claim extraction matching catalyst-api's seam; expired-token rejection in-process. - Containerfile: distroless-static, non-root UID 65532. L2 — products/catalyst/chart/templates/services/catalog/ wiring: - 5 templates (deployment, service, serviceaccount, rbac, httproute) + _helpers.tpl. Default-OFF gate via .Values.services.catalog.enabled. - helm template: 0 catalog resources when OFF, 6 when ON. - Empty image.tag fail-fasts at render per Inviolable Principle #4a. - HTTPRoute exposes /api/v1/catalog on api.<sovereign> hostname. - Chart bumped 1.4.85 → 1.4.86. Gitea client extension (canonical seam, NOT per-service variant): - +ListOrgRepos(ctx, org) []Repo — paginated repo listing. - +ListContents(ctx, org, repo, branch, path) []ContentEntry — directory listing for per-Org shared-blueprints fan-out. GitHub Actions workflow: - .github/workflows/catalyst-catalog-build.yaml — push-on-paths + pull_request + workflow_dispatch (NO cron). go vet + go test (race + count=1) + image build → GHCR :<sha>. repository_dispatch fan-out to chart-bump matches the Group C controllers' pattern. Tests (3-tier gate): unit (config, cache, auth, source, handler) + integration (httptest-backed Gitea fixtures across all 3 sources + priority + per-Org access). All green; race detector on. L3 (SME catalog retirement) is deferred per the EPIC-2 master brief. GraphQL deferred (REST first; gqlgen would pull ~80MB of indirect deps for a feature no UI consumer has asked for yet). Co-authored-by: hatiyildiz <hati.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
66fd0bbae3
|
refactor(controllers): promote duplicated internal/ packages to shared core/controllers/internal/ (CC1, #1095) (#1135)
Slice CC1 of EPIC-0 (#1095) — Coordinator-led consolidation. The 5 Group C controllers (slices C1-C5: organization, environment, blueprint, application, useraccess) all merged with their own per-controller go.mod + per-controller internal/ tree. This PR canonicalizes the shared layout per `02-implementer-canon.md` §1+§2: * One go.mod at core/controllers/go.mod (Path A — single shared module) * Shared helpers under core/controllers/internal/: - semver/ (was: blueprint/internal/semver + application/internal/semver, now exposes blueprint's IsValidRange + app's IsExact, with the union of both test corpora) - placement/ (was: application/internal/placement; promoted per seam map) - render/ (was: application/internal/render; promoted per seam map) - labels/ (was: useraccess/internal/labels; promoted per seam map — Manara-style scope matcher, owner-of-record C5) Module-discipline decision (Path A vs Path B): Path A. The 5 controllers' go.mod files use the same controller-runtime v0.19.0, k8s.io/* @ 0.31.x, sigs.k8s.io/yaml v1.4.0, etc. The only drift was organization-controller on k8s.io/api 0.31.0 vs the others on 0.31.1 — a trivial bump. Independent dep-version pinning would only be valuable if a controller needed a hostile dep the others shouldn't pull; nothing in the current tree is hostile. Containerfiles + workflows updated: * 5 Containerfiles now COPY core/controllers/{go.mod,go.sum,internal/} plus the per-controller tree from a repo-root build context. * 4 per-controller workflows (application/environment/organization/ useraccess; blueprint-controller has no dedicated workflow yet) now trigger on core/controllers/{<name>/**, internal/**, go.mod, go.sum} and run go vet + go test scoped to their own tree + shared internal. * useraccess workflow context flipped from core/controllers/useraccess to . (repo root) so the Containerfile can reach the shared go.mod. Subpackages NOT promoted in this PR (compromise — flagged for follow-up): * gitea/ — 4 of 5 controllers each ship a Gitea HTTP client. The APIs DIVERGE (organization has Org+Repo CRUD with Repo struct return values; application/blueprint/environment have File CRUD with Org-not-found sentinel). A SUPERSET package would require renaming methods (e.g. EnsureRepo collides on signature) which crosses the brief's "no API redesign" line. CC2 follow-up slice should design the unified surface before promoting. * validate/ — application's package validates Application.spec.parameters against a JSON Schema (santhosh-tekuri lib); blueprint's validates Blueprint CR business rules (semver-backed). Same dir name, completely different functions — not actually duplicates. * gitops/ — environment's renders Flux GitRepository for an Environment; organization's renders HelmRelease+Namespace for an Org. Same dir name, different inputs and outputs. Test-coverage delta: pre-consolidation 134 root-level tests (sum across 5 modules); post-consolidation 133 tests. Net delta -1: blueprint and application each had their own TestIsValidRange in their semver pkg; the shared semver pkg's TestIsValidRange now exercises the union of both controllers' valid+invalid input corpora — coverage strictly improved even though one redundant test name disappeared. Verified locally: go build + go vet + `go test -count=1 -race ./...` all clean; all 5 controller binaries (cmd/) link successfully. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
dbf585744c
|
feat(controllers): land application-controller (slice C4, #1095) (#1133)
Watches Application.apps.openova.io/v1 CRs and reconciles each
Application to per-region kustomization + helmrelease manifests in
the per-Org Gitea repo (gitea.<location-code>.<sovereign-domain>/<org>/<app>).
Reconcile flow per slice C4 brief:
1. Resolve parents: spec.environmentRef → Environment CR, then
Environment.spec.organizationRef → Organization CR. Pending-on-miss.
2. Fetch Blueprint at spec.blueprintRef.{name,version} (v1 with
v1alpha1 fallback). Pending-on-miss.
3. Validate spec.parameters against Blueprint.spec.configSchema via
github.com/santhosh-tekuri/jsonschema/v5. On invalid → status.phase=
Failed + Condition reason=Invalid listing every failing JSON pointer.
4. Validate placement against Blueprint.spec.placementSchema.modes.
5. Resolve placement → per-region work plan:
- single-region: regions[0] only, role=primary
- active-active: every region rendered identically (sorted
for byte-stability), role=active, no primaryRegion
- active-hotstandby: regions[0] primary, regions[1..] standby
(replicas: 0 + _openova_standby: true overlay; Continuum
#1101 flips on switchover)
6. Render kustomization.yaml + helmrelease.yaml per region under
clusters/<region>/applications/<app>/{...}.yaml on the env-type-
mapped branch (develop|staging|main per NAMING §11.2).
7. Idempotent commit via gitea.PutFile's byte-equality short-circuit
— re-reconcile on steady state = 0 Gitea writes (slice C4 brief
test #7).
8. Status update: phase / primaryRegion / regions[] / giteaRepo /
installedBlueprint{name,version,digest} / conditions[].
9. Finalizer + cascade delete: on metadata.deletionTimestamp, removes
every manifest the controller wrote and releases the finalizer.
Architecture compliance per docs/INVIOLABLE-PRINCIPLES.md:
- Flux is the only reconciler. Controller writes to Gitea; Flux
applies. NO direct K8s create of HelmRelease/Kustomization/Service.
- Dynamic client + unstructured.Unstructured (no controller-gen, no
zz_generated_deepcopy.go).
- Every value is environment-configurable (GITEA_API_URL, GITEA_TOKEN,
GITEA_PUBLIC_URL, SOURCE_NAMESPACE, HELMRELEASE_INTERVAL,
CATALOG_SOURCE_REF, REQUEUE_AFTER_SECONDS, METRICS_ADDR, HEALTH_ADDR,
LEADER_ELECT, LEADER_ELECT_NS, LOG_LEVEL).
- SHA-pinned images via the focused build-application-controller.yaml
workflow (push-on-paths + PR + workflow_dispatch — no cron).
Tests cover the full 9-test matrix from the brief plus 3 bonus paths:
T1 Pending on missing Environment (no Gitea writes).
T2 Pending on missing Blueprint (no Gitea writes).
T3 Invalid on parameters schema mismatch — Condition message names
the failing path 'replicas'; no Gitea writes.
T4 single-region happy path → expected manifests written under
clusters/<region>/applications/<app>/ on branch=main, finalizer
added, status.phase=Provisioning, status.primaryRegion populated,
status.giteaRepo populated.
T5 active-active fan-out → 2 regions, 2 manifest sets byte-equal
after region-name canonicalisation. status.primaryRegion empty.
T6 active-hotstandby → primary renders replicas:3 (user param);
standby renders replicas:0 + _openova_standby:true marker.
T7 Idempotency → re-reconcile after success = 0 Gitea writes
(PutFile byte-equality short-circuit).
T8 Deletion cascade → manifests removed from Gitea, finalizer
released after delete pass.
T9 Drift detection → Gitea-side manifest hand-edited; controller
restores byte-identical original on next pass.
+ Pending on Gitea Org missing (org doesn't exist in Gitea even
though Organization CR exists — slice C1 hasn't run yet).
+ Invalid placement-vs-blueprint-allowed-modes (placement-active-active
rejected on a Blueprint declaring only single-region).
Module path: github.com/openova-io/openova/core/controllers/application
(per-controller go.mod, matching siblings C1/C2/C3/C5; CC1 promotes
shared internals to core/controllers/internal/ in a follow-up slice).
`go vet ./...` clean. `go test -count=1 -race ./...` all green.
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8988cd9e4f
|
feat(infra-hetzner): wire all var.regions[] entries end-to-end (slice G1, #1095) (#1131)
Slice G1 of EPIC-0 (#1095, Group G "Multi-cluster substrate"). Today infra/hetzner/main.tf only realises regions[0] end-to-end — every wizard payload's regions[1..N] entries silently no-op. EPIC-6 (#1101) Continuum DR demo needs 3 regions (mgmt + fsn + hel per docs/EPICS-1-6-unified-design.md §3.8 + §11), so this slice closes the gap. Architecture: hybrid singular-path + secondary-region overlay. - The legacy singular path (var.region + count = local.control_plane_count) STAYS untouched — every existing Sovereign state (omantel, otech*) keeps its resource addresses (hcloud_server.control_plane[0], hcloud_load_balancer.main, etc) and produces a no-op plan diff. - New regions (regions[1+]) are realised via a parallel for_each set keyed by "{cloudRegion}-{index}" (e.g. fsn1-1, hel1-2). Each secondary region gets its own /24 subnet inside the shared /16 hcloud_network, its own CP server, its own workers, and its own lb11 load balancer. The shared hcloud_firewall + hcloud_ssh_key (one tenant boundary per Sovereign). Why hybrid not full for_each: a wholesale refactor would change every existing resource address (hcloud_server.control_plane[0] → hcloud_server.control_plane["mgmt"]), forcing every running Sovereign to run `tofu state mv` for ~12 resources or face destructive recreates. The brief explicitly bans that. Hybrid is purely additive — secondary resources are NEW addresses no existing state carries. No `tofu state mv` runbook required. Existing Sovereigns provisioned with var.regions = [] or len(var.regions) == 1 produce identical plans before and after this PR. Slice G3 (out of scope here) wires Cilium ClusterMesh between secondary regions and adds per-cluster GitOps path differentiation; today every secondary CP renders an identical Flux Kustomization pointed at clusters/<sovereign_fqdn>/. Tests: tests/multi_region.tftest.hcl exercises 5 scenarios offline via mock_provider + override_resource (no real Hetzner): - legacy_no_regions_payload (var.regions=[]) - single_region_entry_does_not_double_provision (len==1) - three_region_mgmt_fsn_hel (EPIC-6 shape) - same_region_duplicates_produce_distinct_keys - non_hetzner_regions_are_filtered_out (oci entries skipped) All 5 pass. CI workflow infra-hetzner-tofu.yaml runs validate + fmt -check + test on every PR touching infra/hetzner/**. Per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled": push-on-merge + pull-request-on-touch + workflow_dispatch only. No cron. Validation: $ tofu validate Success! The configuration is valid. $ tofu fmt -check -recursive exit=0 $ tofu test tests/multi_region.tftest.hcl... pass run "legacy_no_regions_payload"... pass run "single_region_entry_does_not_double_provision"... pass run "three_region_mgmt_fsn_hel"... pass run "same_region_duplicates_produce_distinct_keys"... pass run "non_hetzner_regions_are_filtered_out"... pass Success! 5 passed, 0 failed. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2ab442544e
|
feat(controllers): land environment-controller (slice C2, #1095) (#1127)
Implements slice C2 of EPIC-0 #1095 — the environment-controller Go binary. Watches Environment.catalyst.openova.io/v1 CRs (cluster-scoped) and reconciles each Environment to: 1. Verify the per-Org Gitea Org exists (parent Organization gate). Missing org surfaces GiteaOrgReady=False + Pending phase, never panics or crashloops. 2. Track the canonical branch name for this Environment in status.giteaRepoRef.{org,branch} per NAMING-CONVENTION.md §11.2 item 1 (develop/staging/main ↔ dev/stg/prod; uat/poc map to their own branch name). 3. Idempotently write per-vCluster Flux GitRepository manifests into the Org's Gitea repo at the canonical path `clusters/<host-cluster>/environments/<env-name>/gitrepository.yaml` per NAMING §11.2 item 3. Multi-region Environments fan out one commit per spec.regions[]. Identical bytes short-circuit (zero spurious commits in repo history); drift triggers an overwrite with the existing blob SHA. 4. Surface the canonical JetStream subject prefix `ws.{organizationRef}-{envType}.>` on status.jetstreamSubjectPrefix per NAMING §11.2 item 4 + ARCHITECTURE.md §5. Per-Environment NATS Stream CR creation is OUT OF SCOPE here — NACK isn't installed yet (future slice). 5. Set status.phase, status.regionCount (printer column), status.vclusters[], status.observedGeneration, and the Ready/GiteaOrgReady/GitRepositoryWritten conditions. Architecture rules honored (per docs/INVIOLABLE-PRINCIPLES.md + docs/adr/0001-catalyst-control-plane-architecture.md): - Flux is the only reconciler in production. The controller writes manifests to Gitea; Flux applies them. NO kubectl apply, NO helm install, NO exec.Command in the codebase. - Crossplane is cloud-only. This controller is K8s-to-K8s native via controller-runtime + client-go. - DR is a Placement, not an Env Type. The controller treats spec.envType as the schema-validated enum {prod|stg|uat|dev|poc} with no special-case for DR (per NAMING §11.1). - Sovereign-independent. The Gitea base URL, secret ref, branch suffix, commit author, and Flux interval are ALL runtime config (per Inviolable Principle #4 — never hardcode). Files: - core/controllers/environment/api/v1/types.go — Environment Go types matching the CRD; hand-written DeepCopy to avoid build-time codegen tool dependency. - core/controllers/environment/internal/gitea/client.go — minimal GitHub-compatible REST client targeting Gitea's /api/v1 (GET /orgs/{org}, GET/POST/PUT /repos/{org}/{repo}/contents/{path}). Idempotent UpsertFile with byte-equality short-circuit + blob-SHA conflict refusal. - core/controllers/environment/internal/gitops/render.go — pure template rendering of the Flux GitRepository CR. Deterministic field ordering for byte-equality idempotency. - core/controllers/environment/internal/controller/environment_controller.go — reconciler: validate spec, gate on Gitea Org, fan out per-region manifest writes, set status + conditions. - core/controllers/environment/cmd/main.go — controller-runtime manager entry point with leader election. - core/controllers/environment/Containerfile — two-stage build, alpine:3.20 runtime, non-root UID 65534, ENTRYPOINT. - core/controllers/environment/deploy/rbac.yaml — ClusterRole watching Environments + status subresource + leader election lease. - .github/workflows/build-environment-controller.yaml — CI mirrors build-cert-manager-dynadot-webhook.yaml: vet + race tests, docker buildx + cosign keyless sign + SBOM attest, push to ghcr.io/openova-io/openova/environment-controller. Tests (35 total, all GREEN, race-detector enabled): - internal/controller (T1–T11): T1 happy-path single-region reconcile T2 idempotent re-reconcile (zero spurious commits) T3 parent Org missing → Pending + GiteaOrgReady=False (no panic) T4 multi-region fan-out (3 commits, 3 regions) T5 drift detection — operator hand-edit gets overwritten T6 placement-vs-regions cardinality violations → Failed T7 env_type→branch mapping table T8 Gitea repo missing → Pending + GiteaRepoMissing reason T9 partial-failure one region → Degraded with that region Failed T10 Config.Defaults applies the documented defaults T11 NotFound between dequeue and Get is benign - internal/gitea: GET /orgs OK + 404 + 500; UpsertFile create / idempotent / update with SHA / repo-not-found; pathEscape preserves slashes; arg-validation. - internal/gitops: BranchForEnvType / JetStreamSubjectPrefix / HostClusterName (with override) / GitRepositoryPath / RenderGitRepository (deterministic + complete + anonymous + default interval + required-field validation) / EnvironmentName. go vet ./... clean. go test -count=1 -race ./... GREEN. Out of scope per slice brief: organization-controller (C1), blueprint-controller (C3), application-controller (C4), useraccess-controller (C5), catalyst-api codebase changes, NACK install, per-Environment NATS Stream CRs. Co-authored-by: hatiyildiz <hati@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
84167a768e
|
feat(controllers): land organization-controller (slice C1, #1095) (#1129)
A thin in-cluster Go controller that watches Organization CRs
(orgs.openova.io/v1) and reconciles four downstream artifacts per
the EPICS-1-6 unified design §3.3 + §3.7 and ADR-0001 §2.7:
1. vCluster HelmRelease — written into the per-Org Gitea repo
(NOT direct apply; Flux reconciles per ADR-0001 §2.1).
2. Keycloak group — at path /<slug> with attributes
{org=[<slug>], tier=[<sme|corporate>]}.
3. Gitea Org — auto-created if absent; one repo per Org seeds
the vCluster + tenant manifests.
4. UserAccess CR — one per spec.owners[] entry; slice C5's
useraccess-controller materializes the RoleBindings.
Per ADR-0001 §2.2 (Crossplane is cloud-only) this is K8s-to-K8s
reconciliation NOT a Crossplane Composition. Per §2.1 the controller
writes manifests via the Gitea HTTP contents API — never kubectl
apply, never helm install, never exec.Command("helm", ...).
Idempotent: re-running on a steady-state CR is a no-op (every
"ensure" is find-or-create with byte-equal short-circuit on PutFile).
What ships:
- core/controllers/organization/cmd/main.go — entry point with
envconfig, leader election, signal handling
- core/controllers/organization/internal/controller/ — reconciler +
KeycloakClient interface + LiveKeycloak impl
- core/controllers/organization/internal/gitea/ — minimal Gitea Admin
REST client (Org/Repo + contents-API). Self-contained — extractable
to core/pkg/gitea-client/ when slice C2 needs it.
- core/controllers/organization/internal/gitops/ — manifest renderer
(namespace + vcluster HelmRelease + kustomization)
- core/controllers/organization/internal/orgapi/ — Organization Go
types mirroring the CRD schema (no deepcopy-gen — inlined)
- core/controllers/organization/Containerfile — multi-stage build
(alpine-based, runs as UID 65534)
- core/controllers/organization/config/{rbac,manager}/ — ClusterRole
+ Deployment scaffolding for chart consumption (slice F1)
- .github/workflows/build-organization-controller.yaml — push/PR/
manual triggers, no cron
Tests: 9 unit tests across 3 packages cover happy-path reconcile,
idempotency (zero net writes on second reconcile), Keycloak group
already exists, Gitea Org already exists, slug/metadata drift,
missing CR no-op, byte-equal PutFile no-op, 422-race re-find,
template structural-YAML validity, and label-vocabulary compliance.
go test -count=1 -race ./... and go vet ./... both clean.
Out of scope: environment-controller (C2), application-controller
(C4), useraccess-controller (C5 — this controller only WRITES
UserAccess CRs).
Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
dd1699afe3
|
feat(controllers): land useraccess-controller — fix silently broken Crossplane path (slice C5, #1095, P0) (#1128)
Per docs/EPICS-1-6-unified-design.md §3.5 and ADR-0001 §2.3 amendment,
K8s-to-K8s reconciliation belongs to thin in-cluster controllers, not
Crossplane Compositions. The existing useraccess.compose.openova.io
Composition writes RoleBindings via provider-kubernetes — but
provider-kubernetes is NOT installed on any production Sovereign
(caught in the EPIC-0 audit). Every UserAccess CR has been silently
no-op'd. This controller fixes that.
What lands:
- core/controllers/useraccess/cmd/main.go — controller-runtime Manager
with leader election + signal handling, environment-only config
- internal/controller/{reconciler,desired,spec,status,types}.go — the
reconciler. Watches UserAccess.access.openova.io/v1alpha1 (cluster-
scoped, unstructured client) and owns RoleBinding +
ClusterRoleBinding via Owns() so drift triggers reconcile via
ownerRef indexing
- internal/labels/scope.go — Manara DNA scope matcher: AND-within /
OR-across, wildcard scopes, EnforcedScopes() per catalog tier (the
developer auto-injection of openova.io/env-type=dev)
- internal/controller/*_test.go + internal/labels/scope_test.go —
26 unit tests with the controller-runtime fake client. Covers
happy-path, multi-app/multi-ns fan-out, namespaces:["*"]→CRB,
group subjects, drift detection+restore, orphan deletion on spec
shrink, idempotency, invalid spec, ownerRef shape, NotFound no-op,
and the 5-catalog-tier matrix
- deploy/{rbac,deployment}.yaml — ClusterRole/SA/Deployment with
non-root, read-only-rootfs, drop-ALL caps, leader-election Role
- Containerfile — Alpine 3.20 final stage, CGO_ENABLED=0, UID 65534
- .github/workflows/useraccess-controller-build.yaml — event-driven
build (push-on-main + PR test job), SHA-pinned image tags
Behaviour:
- Per UserAccess CR, materialises RoleBindings (per namespace) or
ClusterRoleBindings (when namespaces:["*"]) referencing the
canonical openova:application-{admin,editor,viewer} ClusterRoles
- ownerRef back to the UserAccess CR with controller=true +
blockOwnerDeletion=true so K8s GC cascades deletes
- Drift detection: hand-mutated bindings are restored on next pass +
Condition Drift=True surfaced for the UI
- Idempotent: steady-state reconcile = 0 K8s writes
- Status: phase (Pending|Active|Failed), rolebindingsCreated,
observedGeneration, conditions[]
Out of scope per the brief:
- Crossplane Composition deletion (operator retires post-verify)
- 5-catalog-tier role inheritance (lands with EPIC-3 #1098)
- Keycloak realm-role sync (slice D1b, this controller is consumer)
Tests:
go vet ./... # clean
go test -count=1 -race ./... # 26/26 pass
go test ./internal/labels/... -run TestScope # full 5-tier matrix
Co-authored-by: Hatice Yildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
358c32c032
|
ci: add cluster bootstrap-kit drift guardrail (slice H2 scope-reduced, #1095) (#1122)
Adds .github/workflows/cluster-template-drift.yaml — a warn-only workflow that reports drift between each clusters/<sovereign>/bootstrap-kit/ tree and the canonical clusters/_template/bootstrap-kit/. Why warn-only, not enforce: - Every existing Sovereign carries some legitimate drift (per-Sovereign image SHAs, region-specific values overlay) — blocking PRs on diff count would prevent ALL cluster work. - The right place to enforce the boundary is Catalyst's organization- controller (slice C1 of #1095), not CI. Once C1 ships, every new Sovereign bootstrap-kit is generated from _template and the attestation lives at apply-time, not at CI-time. - Retroactively reconciling the existing omantel.omani.works/ and otech.omani.works/ trees (which have 20+ differing files plus structural changes — extra files on each side) is a high-blast-radius maintenance-window operation, NOT a CI scoped slice. What this workflow does: - Triggers on push to main + PR + workflow_dispatch when clusters/** changes. - For each clusters/<sovereign>/ directory, runs `diff -rq` against clusters/_template/bootstrap-kit/ and writes a Markdown report to the run summary AND a sticky PR comment. - Counts differing files + only-in-template + only-in-Sovereign per Sovereign so reviewers can quickly see whether new drift was introduced. Per docs/EPICS-1-6-unified-design.md §3.9 row 2 + §11 row 6 (decision amended from "reconcile + CI gate" to "warn-only CI gate"; structural reconcile deferred to slice C1 organization-controller). Per docs/INVIOLABLE-PRINCIPLES.md #4a — workflow only inspects YAML; no images built, no cloud calls. Refs: #1094, #1095, slice C1 (organization-controller). Co-authored-by: hatiyildiz <hatiyildiz@noreply.openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
eb6a3c1812
|
fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs — Sovereigns + contabo were frozen at :2122fb8 (#1060)
* fix(catalyst-api): rip out dangling sovereign_* route registrations + chart 1.4.56 PR #1050 deleted sovereign_more.go (which defined HandleSovereignUsers, HandleSovereignCatalog, HandleSovereignSettings, HandleSovereignTopology) but left four route registrations in cmd/api/main.go that still referenced those handler methods. The catalyst-api build for the merged revert (run 25439549879) failed with: cmd/api/main.go:690:39: h.HandleSovereignUsers undefined cmd/api/main.go:691:41: h.HandleSovereignCatalog undefined cmd/api/main.go:692:42: h.HandleSovereignSettings undefined cmd/api/main.go:693:42: h.HandleSovereignTopology undefined That's why ghcr.io/openova-io/openova/catalyst-api:fdd3354 was never published — only the UI image rolled. Result: omantel.biz catalyst-api pod stuck in ImagePullBackOff. Drop the four route registrations. Same baby, new address — the chroot Sovereign uses the existing /api/v1/deployments/{depId}/* handlers via the JWT-resolved deploymentId, not parallel-baby /api/v1/sovereign/* endpoints. Also revert two more parallel-baby fragments still on main: - getHierarchicalInfrastructure mode-aware fetcher → single mother URL (the chroot resolves deploymentId from the cookie and the mother-side topology handler serves byte-identical data once cutover-import has persisted the deployment record on the Sovereign's local store) - CatalogAdminPage.fetchApps mode-aware → /catalog/apps everywhere Bump bp-catalyst-platform chart 1.4.55 → 1.4.56 and the cluster Kustomization version pin to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(sovereignDynamicClient): in-cluster fallback when running ON the Sovereign The chroot Sovereign Console at console.<sov-fqdn> is the SAME catalyst-api binary as the mother. When that binary runs ON the Sovereign cluster (catalyst-system namespace on the Sovereign itself), there is no posted-back kubeconfig — the catalyst-api IS in the cluster it needs to talk to, and rest.InClusterConfig() returns the right credentials. Without this, every endpoint that needs the Sovereign-side dynamic client returned 503 with "sovereign cluster kubeconfig not yet posted back" — including ListUserAccess (/users page), CreateUserAccess, infrastructure CRUD, etc. Caught on omantel.biz 2026-05-06: /users rendered "list user-access: HTTP 503" because the Sovereign-side catalyst-api was looking for a kubeconfig that doesn't exist on the chroot side of the cutover boundary. Detection: SOVEREIGN_FQDN env (set on every Sovereign-side catalyst-api deployment by the chart) matches dep.Request.SovereignFQDN. On the mother, SOVEREIGN_FQDN is unset → unchanged behavior. On the chroot, SOVEREIGN_FQDN matches the only deployment served (its own) → use in-cluster. Same fallback applied to tryDynamicClientLocked (loaderInputFor's best-effort live-source client) so /infrastructure/topology and the /cloud graph render with live data on the chroot too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(user-access): empty list when CRD absent + RBAC for chroot Two coupled fixes for the /users page on chroot Sovereign Console: 1. catalyst-api-cutover-driver ClusterRole: grant read/write on useraccesses.access.openova.io. The Sovereign chroot's catalyst-api uses the in-cluster ServiceAccount (per PR #1052). The list call was returning 403 from the apiserver because the SA had no rule covering this CRD. 2. ListUserAccess: return 200 with empty items when the CRD itself is not installed (apierrors.IsNotFound). The access.openova.io CRD ships via a separate blueprint that may not yet be installed on a fresh Sovereign — the page should render its empty state, not a 500 toast. Caught live on omantel.biz 2026-05-06 after PR #1052 unblocked the in-cluster client path: list call surfaced first as 403 (RBAC), then as 500 "server could not find the requested resource" (CRD absent). Both now resolve to a 200 + []. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): byte-identical /jobs + /cloud — kill fixture fallback, lazy-seed jobs.Store from live cluster, single endpoint Two parallel-baby paths still made the chroot diverge from the mother on /cloud and /jobs/{jobId}. Both now ship one path that serves byte-identical data on both surfaces. 1. CloudPage rendered fictional topology (Frankfurt, Helsinki, omantel-primary, omantel-secondary, edge-lb, vpc-net-eu, …) when the topology query errored — because it fell back to `infrastructureTopologyFixture` from `src/test/fixtures/`. That is a test-only file leaking into production via the production import tree, in direct violation of INVIOLABLE-PRINCIPLES #1 (no placeholder data — empty state when you don't know). Fix: drop the fixture fallback. On error → null → empty-state render. The mother shows the same empty state when its loader returns nothing; byte-identical. 2. JobsTable + JobDetail rendered a flat green-grid because the chroot was hitting `/api/v1/sovereign/jobs` which returns a minimal shape (no dependsOn, no parentId, no exec records). Mother's `/api/v1/deployments/{depId}/jobs` returns the rich shape from a per-deployment jobs.Store, which on the chroot starts empty (the mother's exportDeploymentToChild only ships the deployment record, not the jobs.Store contents). Fix: ship one URL on both surfaces — `/api/v1/deployments/{id}/jobs`. Add `chrootSeedJobsStoreIfEmpty` that runs at handler-time when SOVEREIGN_FQDN matches dep.Request.SovereignFQDN AND the per- deployment jobs.Store has 0 records: do a one-shot HelmRelease list via the in-cluster client (helmwatch.ListAndSnapshotHelmReleases — exported here, mirrors Watcher.SnapshotComponents without spinning up an informer), pass through snapshotsToSeeds + Bridge.SeedJobsFromInformerList. Subsequent calls read directly from the now-populated store and return rich Job records with dependsOn / parentId / status — exactly like the mother. useLiveJobsBackfill loses its mode-aware fetcher; the chroot UI uses the same `/api/v1/deployments/{id}/jobs` URL as the mother. 3. HandleDeploymentImport now also loads the imported record into the in-memory deployments map immediately, so `/deployments/{id}/*` handlers don't need a pod restart's restoreFromStore to see the chroot-imported deployment. Bump bp-catalyst-platform 1.4.56 → 1.4.57 (chart + Kustomization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobdetail): bare-jobName URL — Traefik strips %3A so canonical id 404s JobDetail navigation was 404ing on the chroot because the link builder URL-encoded the canonical Job id ("69e73b3abe673840:install-keycloak") and Traefik (or any upstream proxy that's RFC 3986 §3.3-strict) does not decode `%3A` inside path segments. The catalyst-api router saw the literal "%3A" and Store.GetJob's exact-match path missed. Two coupled fixes: 1. useJobLinkBuilder strips the "<deploymentId>:" prefix before encoding, producing /jobs/install-keycloak (Traefik-safe) instead of /jobs/69e73b3abe673840%3Ainstall-keycloak. Store.GetJob already accepts both bare jobName and canonical id (see store.go:781-789). 2. JobDetail.jobsById indexes by BOTH canonical id AND bare jobName so the URL param resolves regardless of which format the link emitted. Bump chart 1.4.58 → 1.4.59. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cloud): resolve deploymentId from cookie on chroot — was firing topology against undefined CloudPage's topology query fired against /deployments/undefined/... on the chroot (URL is /cloud, no deploymentId path segment), so the page showed "Couldn't load architecture" with all node counts at 0/0. Fix: same pattern as JobDetail — useResolvedDeploymentId() reads the JWT cookie's deployment_id claim via /api/v1/sovereign/self, falling back from URL params. Topology query also gates on `!!deploymentId` so it doesn't waste a 404 round-trip during cookie resolution. Bump chart 1.4.60 → 1.4.61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): single chrome — no frame in frame, no mother handover banner Two visible bleed-throughs from the mother's wizard UX onto the chroot Sovereign Console at console.<sov-fqdn>: 1. **Two stacked headers + sidebar inside sidebar** ("frame in frame"). SovereignConsoleLayout rendered its own sidebar+header AND the page inside rendered PortalShell which rendered ANOTHER header (its sidebar was already skipped for chroot per a prior fix). User saw two horizontal title bars stacked. Resolution: SovereignConsoleLayout becomes auth-only on the chroot. It runs the cookie/OIDC auth gate + RequiredActionsModal, then renders <Outlet/> with NO chrome. PortalShell is now the single chrome owner on both surfaces: - Mother (/sovereign/provision/$id): renders Sidebar with /provision/$id/X URLs + its header. - Chroot (console.<sov-fqdn>): renders SovereignSidebar with clean /X URLs + the same header. One sidebar, one header, byte-identical to mother layout. 2. **"✓ Sovereign is ready — Redirecting to your Sovereign console" banner on /apps.** This is the mother's wizard celebration that tells the operator "you can now jump to your new Sovereign". On the chroot the operator IS already on the Sovereign Console; the banner bleeds through because the imported deployment record carries the mother's handover-ready event in its history. Resolution: AppsPage gates the banner, the toast, and the auto-redirect timer on `!isSovereignMode`. Chroot stays clean. Bump chart 1.4.62 → 1.4.63. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chroot): wrap chroot-only pages in PortalShell + drop /catalog page Three chroot-only pages bypassed PortalShell entirely. After SovereignConsoleLayout went auth-only in #1057, they rendered full-bleed with no sidebar / no header — visible look-and-feel break. /settings/marketplace → MarketplaceSettings (wrapped in PortalShell) /parent-domains → ParentDomainsPage (wrapped in PortalShell) /catalog → CatalogAdminPage (deleted) Drop /catalog entirely per founder direction: a separate page just to flip a "publish to marketplace" boolean per app is the wrong shape. The natural place for that toggle is on each /apps card (future PR — needs HandleSovereignApps to join publish state from the SME catalog microservice). Removed: - /catalog route registration in router.tsx - 'Catalog' entry in SovereignSidebar's FLAT_NAV - CatalogAdminPage.tsx (525 lines) - 'catalog' from ActiveSection union + deriveActiveSection regex The publish-state PATCH endpoint at /catalog/admin/apps/{slug}/publish on the SME catalog service is unaffected; it's exposed at marketplace.<sov-fqdn>, not console.<sov-fqdn>, and the future apps-card toggle will call it via the same path. Bump chart 1.4.64 → 1.4.65. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(apps): publish chip on each card — replaces deleted /catalog page Per founder direction: "if the catalog is just labeling an app to be shown in marketplace, why don't we do it through the apps?" — drop the standalone /catalog page (#1058), put the publish toggle on each /apps card. Backend (catalyst-api): - New file sme_catalog_client.go — best-effort client for the in-cluster SME catalog microservice at http://catalog.sme.svc.cluster.local:8082. 30s response cache, 1.5s probe budget, returns nil on DNS NXDOMAIN (SME services tier not deployed on this Sovereign — common when marketplace.enabled is false). - HandleSovereignApps decorates each app with `marketplacePublished` *bool joined by slug from the SME catalog. nil ⇒ slug not in SME catalog (bootstrap component, or marketplace not deployed) ⇒ FE suppresses the chip. - New handler HandleSovereignAppPublish at PATCH /api/v1/sovereign/apps/{slug}/publish. Body {"published": bool}. Proxies to PATCH /catalog/admin/apps/{slug}/publish on the SME catalog. Surfaces upstream status verbatim. Invalidates the cache so the next /apps poll reflects the change immediately. Frontend (AppsPage): - liveAppsQuery returns { statusById, publishedBySlug } instead of the bare status map. - Each AppCard with a non-null marketplacePublished renders a PUBLISHED / UNPUBLISHED chip alongside the status chip. Click → PATCH → optimistic refetch via React Query. - Bootstrap components and apps not in the SME catalog have nil → no chip (correct: nothing to toggle). - Cards with marketplace.enabled=false render no chips at all (SME catalog unreachable → nil for every slug). Bump chart 1.4.66 → 1.4.67. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(chart,ci): auto-bump literal catalyst-{api,ui} SHAs so all Sovereigns + contabo get fresh code Audit triggered by founder asking if PRs #1051..#1059 reach NEW Sovereigns or just my manual `kubectl set image` patches on omantel. Answer was: nothing reached anyone except omantel via manual patches. Both contabo AND every fresh Sovereign would install :2122fb8 — the SHA frozen at PR #1040's last manual chart-touch on May 6 morning. Root cause: - chart/templates/api-deployment.yaml + ui-deployment.yaml carry LITERAL image refs ("ghcr.io/openova-io/openova/catalyst-api:2122fb8"), not Helm-templated `{{ .Values.images.catalystApi.tag }}`. - catalyst-build CI's deploy step bumped values.yaml's catalystApi.tag on every push — but no template reads from it. Dead code. - contabo's catalyst-platform Flux Kustomization at ./products/catalyst/chart/templates applies these as raw manifests. - Sovereigns Helm-install the same chart; Helm passes the literal through unchanged. - Both ended up frozen at whatever literal was committed at the last manual chart-touching PR. Fix: 1. CI's deploy step now bumps both the literal SHAs in the two template files AND the unused-but-kept-for-SME-services values.yaml. Sed-patches the literal directly so contabo's Kustomize path keeps working. 2. The commit step adds the two templates to the staged set alongside values.yaml, so every "deploy: update catalyst images to <sha>" commit propagates to contabo (10-min reconcile) AND Sovereigns (next OCI chart publish via blueprint-release). 3. Bump bp-catalyst-platform 1.4.68 → 1.4.69 so the new chart with the latest literal (currently :8361df4) gets republished and pinned in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml. Why drop the "freeze contabo" intent of the previous comment: The previous comment said contabo auto-roll on every PR was bad because PR #975's image broke contabo (k8scache startup loop). Solution there is: fix the bug in the code, not freeze contabo. Freezing masked real divergence — the reason the founder caught this is that manual omantel patches were the only thing keeping omantel current while contabo + every other fresh Sovereign quietly ran 9 PRs behind. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
953ef8290f
|
fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs (#980)
* fix(catalyst-ui): drop stale params={{ deploymentId }} from clean-root Links (#975)
#976 collapsed `to="/provision/$deploymentId/<page>"` to clean root
paths (`to="/<page>"`) but left the `params={{ deploymentId }}` prop
on every callsite, breaking the Vite tsc build with TS2353. Fixes:
- Drop `params={{ deploymentId }}` from Links whose target is now a
parameterless clean root path (StatusStrip, AppDetail, AppsPage,
DecommissionPage, FlowPage, JobDetail, JobsPage, JobsTimeline,
SettingsPage, DeploymentsList).
- For Links whose `to` still uses `$componentId`/`$jobId`, cast
`params` with `as never` to match the existing pattern in
cloud-compute/cloud-network/cloud-storage/Sidebar/UserAccess
(the dual-mount under provisionRoute + consoleLayoutRoute defeats
TS's strict params inference; the runtime path is correct).
- Drop `deploymentId` prop + interface field from JobCard / JobRow /
JobsTable / AppCard now that the Links don't need it; update test
fixtures + the JobsTable row-link assertion to match the new
clean `/jobs/$jobId` href.
- Drop the unused ArchEdgeType import in k8sAdapter (TS6196).
- Dashboard navigateToApp uses `as never` casts to align with the
same pattern.
* fix(catalyst-build): stop auto-bumping contabo Kustomize-path image refs
Two paths consume the catalyst-api / catalyst-ui images:
1. bp-catalyst-platform OCI chart (Sovereigns) — values.yaml driven, tag
in values.yaml is rendered at helm install time by Sovereign Flux.
2. contabo Kustomize-path — literal image refs in templates/api-deployment.yaml
and templates/ui-deployment.yaml. Flux kustomize-controller on contabo
reconciles those files directly.
The CI deploy step was bumping BOTH on every PR, which auto-rolled
contabo every time anyone merged a catalyst-api code change. On
2026-05-05 PR #975's k8scache feature broke contabo startup on the
auto-roll because contabo has 27 dead-Sovereign kubeconfigs that the
new code iterates synchronously at startup, blocking readiness.
Fix: keep the values.yaml bump (Sovereigns auto-pick-up via OCI chart
which is the right behaviour for fresh provisions). Drop the
templates/*-deployment.yaml bump so contabo only rolls when an
operator manually commits a validated SHA into those files.
Closes the auto-deploy-to-contabo blast radius on every PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2ff50f0591
|
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
db332f6767
|
fix(ci): services-build auto-bumps chart patch + dispatches blueprint-release (#874)
* fix(bp-catalyst-platform): bump 1.4.8 -> 1.4.9 to republish with current services-auth image (#871) Chart 1.4.8 was published from commit |
||
|
|
1d93b6c5af
|
feat(e2e): SME demo Playwright spec — full 6-step happy path (#805) (#823)
Authors the load-bearing investor-demo proof artefact for the SME-tenant turnkey experience epic (#795). The spec walks the FULL happy path against the catalyst-ui SPA and emits 1440×900 screenshots at every assertion so the DoD checklist is satisfied with visual evidence rather than narrative. What landed: - products/catalyst/bootstrap/ui/e2e/sme-demo.spec.ts — single linear spec covering Step 1 (marketplace signup) → Step 2 (provisioning) → Step 3 (SME admin first login + dashboard) → Step 4 (create alice via unified-rbac with 3-step ADR-0003 hook progress) → Step 5a (alice on WordPress) → Steps 5b/5c/5d/6 fixme'd with TODO links to unblocking issues. - products/catalyst/bootstrap/ui/e2e/lib/config.ts — central registry of every URL, hostname, fixture user, and UUID the spec uses. Per feedback_never_hardcode_urls.md, no test inlines a hostname; every asserted host derives from OTECH_FQDN + SME_SLUG. - products/catalyst/bootstrap/ui/e2e/lib/sme-fixtures.ts — wire-shape- faithful page.route mocks for tenant discovery, /api/v1/whoami, /api/v1/sme/tenants, /api/v1/sme/users (CRUD), the deployment endpoints, app placeholders for WordPress/OpenClaw/webmail, and the /api/v1/sme/billing/ledger surface. Each helper is the seam between mock-mode (today) and live-mode (post-#804) so the spec opts out of any single mock by simply not calling that helper. - .github/workflows/sme-demo-e2e.yaml — push + PR + dispatch trigger that runs the spec against a freshly-installed dev tree with VITE_CATALYST_MODE=sovereign + VITE_SOVEREIGN_FQDN set so the SovereignConsoleLayout's auth gate has a non-null sovereignFQDN. Uploads the 805-* screenshot evidence as a 30-day artefact. Run today on a fresh checkout: cd products/catalyst/bootstrap/ui VITE_CATALYST_MODE=sovereign \ VITE_SOVEREIGN_FQDN=acme.otech.example \ npm run dev & PLAYWRIGHT_HOST=http://localhost:5173 \ npx playwright test e2e/sme-demo.spec.ts Result: 6 passed, 4 fixme (5b/5c/5d/6, all with TODO links to #804 / #798 / #802-followup). Live-mode follow-up (after #804 lands a fresh otech with the SME tenant pipeline wired): drop the mock installers from beforeEach and flip OTECH_FQDN/SME_SLUG via env. The spec stays — only the helper calls change. Per docs/INVIOLABLE-PRINCIPLES.md: #1 (waterfall): the canonical 6-step contract from #805 is asserted in this first cut, not staged across cycles. #2 (never compromise): every step that's deferred is fixme'd with a blocker link, never silently skipped. #4 (never hardcode): every URL routes through e2e/lib/config.ts. Refs: openova-io/openova#795, openova-io/openova#804, ADR-0003 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
9645a9044a
|
feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) (#818)
* feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add the SME-2 metering integration end-to-end. NewAPI is consumed as the upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned mirror, not a fork) — the metering envelope is produced by a Go sidecar that observes the OpenAI-style `usage.total_tokens` field on every 2xx /v1/* response. This avoids forking the upstream binary while still producing the canonical envelope shape on `catalyst.usage.recorded`. A) NewAPI metering sidecar — core/services/metering-sidecar/ - Transparent reverse proxy in front of NewAPI on its own port; the bp-newapi Service routes the cluster-fronting port to the sidecar, which forwards to NewAPI on the pod's loopback. - Observes successful /v1/* JSON responses, parses `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes one envelope on `catalyst.usage.recorded` per completed request. - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed. - Customer-facing latency is NEVER blocked on metering: the response body is restored before publish; on NATS unreachable the envelope is persisted to disk and retried by a background drain loop. - 14 unit tests (proxy + publisher + safeFilename guards). B) sme-billing NATS subscriber — core/services/billing/handlers/ metering_consumer.go - JetStream durable consumer `sme-billing-metering` on stream `CATALYST_USAGE` (provisioned by sme-billing on startup). - Idempotent on metadata.request_id via a UNIQUE partial index on credit_ledger.external_ref; redelivery from the broker collapses to a single ledger row. - Customer auto-create on cold start (the rbac sme.user.created envelope may land AFTER the first metered request; we don't strand usage waiting for it). - 11 unit tests covering happy-path, idempotency, malformed-payload poison-pill, missing-request-id, non-negative amount guard, resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak. C) HTTP handler POST /billing/metering/record — handlers/metering.go - Synchronous validate → INSERT credit_ledger → return {ledger_entry_id, balance_after_omr, balance_after_micro_omr, duplicate}. Same payload + idempotency guard as the NATS path. - Auth: superadmin OR sovereign-admin (operator-admin model; end-user LLM traffic flows through the sidecar, never this URL). - 8 unit tests covering happy-path, idempotency, role gating, malformed-JSON, positive-amount rejection, customer-not-found. D) Schema — core/services/billing/store/store.go - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR exact integer — preserves precision at metering rates). - ADD COLUMN external_ref TEXT + UNIQUE partial index for idempotency dedup. - ADD COLUMN metadata JSONB for the raw envelope. - GetCreditBalance projects both amount_omr (legacy) and amount_micro_omr (new) into the integer-OMR view. - GetCreditBalanceMicroOMR returns canonical precision. - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0) distinguishes fresh insert from duplicate without a follow-up SELECT. E) Wiring - core/services/shared/events/nats.go — minimal NATS JetStream publisher + subscriber surface; legacy RedPanda producer/consumer in events.go untouched per [Q-mine-3]. - core/services/billing/main.go — NATS_URL env; subscriber wired in parallel with the existing RedPanda tenant-events consumer. - middleware/jwt.go — exported test helper WithClaims so handler tests can construct an authenticated context without minting a real signed token. - .github/workflows/services-build.yaml — metering-sidecar added to the build matrix; deploy job skips it (image consumed by the bp-newapi chart, not products/catalyst sme-services). F) bp-newapi chart (1.0.0 → 1.1.0) - meteringSidecar block in values.yaml: image, port, NATS URL, priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool dir, header names, resources, securityContext (read-only-rootfs). - deployment.yaml renders the sidecar container + emptyDir spool volume when meteringSidecar.enabled (default true). - service.yaml routes the cluster-fronting :3000 to the sidecar when enabled, exposes a separate :3001 → NewAPI direct port for bp-catalyst-platform admin-API traffic (ADR-0003 §3.2). - networkpolicy.yaml allows the sidecar's port + nats-system egress for JetStream publish. Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green. Helm template renders cleanly with sidecar enabled and disabled. Closes #798 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798) Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000") that does NOT scan directly into Go int64 — the integration test TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on the post-redeem balance read. Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is unambiguously bigint and Scan target stays uniform across pre-#798 rows (amount_omr only) and post-#798 rows (amount_micro_omr present). Affects: - GetCreditBalance - GetCreditBalanceMicroOMR - RecordUsage's running-balance read Test mocks updated to match the new SQL prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
93bd3ace5b
|
feat(bp-openclaw): workspace controller + per-user pod chart (#803) (#810)
Implements locked decision [A] of epic #795: per-SME-tenant workspace controller deployment + per-user runtime pod, identity-blind by construction. Consumes the per-user newapi-key-{uuid} Secrets rendered by the unified-rbac user-create hook (ADR-0003 §3.3). What this delivers: - platform/openclaw/chart/ bp-openclaw v0.1.0 (no-upstream) - platform/openclaw/runtime/ Go reference runtime (NEWAPI_BASE_URL + NEWAPI_KEY env contract only) - .github/workflows/openclaw-runtime.yaml Event-driven build for the runtime image (paths-on-push + manual rerun; NO schedule:cron per CLAUDE.md). - platform/openclaw/blueprint.yaml Catalyst registration + configSchema. Chart highlights: - Required values guarded by _helpers.tpl :: assertRequired so missing realmURL/clientSecretName/tenant.namespace/baseURL/host fail render with helpful messages. - RBAC: namespaced Role in tenant ns; create verbs split into separate rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md. Label-based ownership (catalyst.openova.io/openclaw-user) enforced at the controller, not in RBAC. - ingress: cert-manager.io/cluster-issuer annotation triggers ACME auto-issuance for openclaw.<sme-domain>. - per-user pod template ConfigMap holds the pod-spec the controller renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders filled at session-start. - networkPolicy covers controller pod only; per-user pod NetworkPolicy is rendered by the controller at session-start (target hostname is read from the per-user Secret which doesn't exist at chart-render time — documented in README.md). Tests: chart/tests/render-toggles.sh (7 cases) covers required-value enforcement, RBAC create+resourceNames violation guard, ServiceMonitor default-off, networkPolicy toggle, pod-template placeholder presence, cert-manager annotation. All seven gates pass locally. Closes part of #795 (epic still open). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9adca8442a
|
fix: ci actions:write + auth-layout overflow scroll (#712 followup, #721 followup) (#728)
Two unrelated production-bug fixes squashed because they came out of the same live verification pass on console.openova.io 2026-05-04. 1. catalyst-build.yaml deploy job permissions PR #720 added a `gh workflow run blueprint-release.yaml` dispatch step at the end of the deploy job to close the bot-deploy-doesn't- trigger-workflows gap from #712. Step has been failing on every run since with HTTP 403 "Resource not accessible by integration" because GITHUB_TOKEN lacks `actions: write` by default. Result: blueprint-release was never dispatched after PR #722–727 merged; the bp-catalyst-platform OCI artifact stayed on the pre-fix chart and any Sovereign provisioned afterwards picked up the buggy chart. Add the missing permission so dispatch succeeds. 2. AuthLayout.tsx vertical centering at small viewport heights The sign-in / verify cards were mathematically centered at 1440×900 (Δ=0.008px verified via getBoundingClientRect in Playwright) but founder reports the card sitting at the top of the screen on real-world viewports. Root cause: the right panel had `flex flex-1 items-center justify-center` which centers ONLY if the inner content fits within the viewport — at smaller heights the form's natural content flow pushed the card off-screen with no scroll fallback. Fix: add `items-stretch` to the outer flex (so the right panel fills full viewport height), `overflow-y-auto` on the right column (so the card can scroll inside its column when too tall), and `py-8` padding on the card wrapper (breathing room when scrolling kicks in). Result: card is vertically centered when content fits, and stays visible (column-scrollable) when it doesn't, on every viewport height from 1024×600 up. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
35183af5be
|
fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) (#720)
* feat(bp-catalyst-platform): expose marketplace + tenant wildcard, bump 1.3.0 (closes #710) Marketplace exposure for franchised Sovereigns. Otech becomes a SaaS operator with a single overlay toggle. Changes ======= products/catalyst/chart: - Chart.yaml 1.2.7 → 1.3.0 - values.yaml: ingress.marketplace.enabled toggle (default false) + marketplace.{brand,currency,paymentProvider,signupPolicy} surface - templates/sme-services/marketplace-routes.yaml: HTTPRoute marketplace.<sov> with /api/ → marketplace-api, /back-office/ → admin, / → marketplace; HTTPRoute *.<sov> → console (per-tenant wildcard) - templates/sme-services/marketplace-reference-grant.yaml: cross- namespace ReferenceGrant from catalyst-system HTTPRoute → sme Services - .helmignore: stop excluding sme-services/* and marketplace-api/* (only *.kustomization.yaml + *.ingress.yaml remain Kustomize-only) - All sme-services/* + marketplace-api/* manifests wrapped with {{ if .Values.ingress.marketplace.enabled }} so non-marketplace Sovereigns render the chart unchanged clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: - chart version 1.2.7 → 1.3.0 - ingress.hosts.marketplace.host: marketplace.${SOVEREIGN_FQDN} - ingress.marketplace.enabled: ${MARKETPLACE_ENABLED:-false} infra/hetzner: - variables.tf: marketplace_enabled var (string "true"/"false", default "false") - main.tf: thread var into cloudinit-control-plane.tftpl - cloudinit-control-plane.tftpl: postBuild.substitute.MARKETPLACE_ENABLED on bootstrap-kit, sovereign-tls, infrastructure-config Kustomizations products/catalyst/bootstrap/api/internal/provisioner/provisioner.go: - Request.MarketplaceEnabled bool (json:"marketplaceEnabled") - writeTfvars: marketplace_enabled = "true"|"false" core/pool-domain-manager/internal/allocator/allocator.go: - canonicalRecordSet adds "marketplace" prefix → marketplace.<sov> resolves via PDM at zone-commit time (PR #710 explicit record so caches don't depend on the *.<sov> wildcard alone) DoD ready ========= - helm template with ingress.marketplace.enabled=false → identical manifest set to 1.2.7 (verified locally) - helm template with ingress.marketplace.enabled=true → emits 17 extra resources: 13 sme-services workloads + 2 marketplace-api + 1 HTTPRoute pair + 1 ReferenceGrant - pdm tests: TestCanonicalRecordSet, TestCommitDNSShape green - catalyst-api builds, provisioner cloudinit_path_test green * fix(ci): catalyst-build dispatches blueprint-release after deploy commit (closes #712) The deploy job's `git push` is made under GITHUB_TOKEN; per GitHub Actions design, commits authored by GITHUB_TOKEN don't re-trigger workflows. blueprint-release.yaml's `on.push.paths: products/*/chart/**` filter matches the deploy commit's diff (chart/values.yaml + chart/templates/{api,ui}-deployment.yaml), so the workflow SHOULD fire, but doesn't — leaving the bp-catalyst-platform:1.2.7 OCI artifact stuck on whatever catalyst-api SHA was current at the last manual chart- touching PR. Today (2026-05-03) this stranded otech62-otech66 on catalyst-api:74d08eb six PRs after the SHA was superseded — every fresh Sovereign installed the buggy pre-#701 image and rejected handover with 401 unauthenticated. Fix: after `git push` succeeds in the deploy job, dispatch blueprint-release explicitly via `gh workflow run`. The dispatched run re-renders + re-publishes the chart with the just-pushed values.yaml. Closes #712. --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
b5c9839da7
|
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables: UI: - AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server callback; sovereign → client-side OIDC token exchange via oidc.ts) - Router: sovereign console routes (/console/*), DETECTED_MODE index redirect, authCallbackRoute dedup fix, authHandoverRoute safety net - StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token before redirecting operator to Sovereign console (falls back to plain URL on error) API: - main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env - deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time - provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON - auth.go: /auth/handover endpoint for seamless single-identity flow Infra: - cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/ - variables.tf: handover_jwt_public_key variable (sensitive, default empty) Chart: - api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars Playwright CI fixes: - playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard - playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix - cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/* - sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests). Co-authored-by: e3mrah <e3mrah@openova.io> |
||
|
|
10c8e997c4
|
fix(catalyst): restore literal image refs in Kustomize-path deployment YAMLs (#614)
The feat/global-imageRegistry (#580) PR converted the literal image refs
in api-deployment.yaml and ui-deployment.yaml to Helm template expressions
({{ .Values.global.imageRegistry }}...) without updating the CI deploy step
to also patch those files. Since the catalyst-platform Flux Kustomization
reads these files as raw manifests (not via helm-controller), the Helm
template syntax was never rendered, leaving a literal '{{ if ... }}'
string as the image reference → InvalidImageName on every Pod start.
Root cause: two consumers of the same file — Helm chart path (Sovereign
clusters) and Kustomize path (contabo-mkt) — but only the Helm path was
handled by the deploy job.
Fix:
- Restore literal `ghcr.io/openova-io/openova/catalyst-{api,ui}:b50a600`
image refs in the Kustomize-path deployment YAMLs (immediate unblock).
- Update CI deploy step to sed-patch those literal refs on every deploy
commit so future image rolls keep both paths in sync (durable fix).
Closes: the InvalidImageName regression introduced in #580.
Unblocks: issue #608 (Phase-8b Agent A magic-link auth) — catalyst-api
was stuck at InvalidImageName since commit
|
||
|
|
59fb2b742c | fix(ci): use awk instead of python heredoc in deploy — fixes YAML parse error | ||
|
|
885e032dc5 |
fix(ci): deploy job updates values.yaml SHA tags, not Helm template files
The previous sed targeted ui-deployment.yaml + api-deployment.yaml for
`image: ghcr.io/.../catalyst-ui:.*` but those files use Helm template
expressions (`{{ .Values.images.catalystUi.tag }}`), so sed silently
no-ops. Result: every catalyst build committed "No changes" and the
deployed image was never updated.
Fix: switch deploy job to update images.catalystUi.tag and
images.catalystApi.tag in products/catalyst/chart/values.yaml via
python3 regex (handles multiline YAML reliably).
Also bump catalystUi + catalystApi tags to
|
||
|
|
942be6f58d
|
fix(ci): disable buildx provenance+sbom attestation in dynadot-webhook build (#583)
containerd 1.7.x on k3s cannot pull multi-arch images whose OCI index includes an attestation manifest (the unknown/unknown platform entry added by docker/build-push-action when provenance=true). Containerd resolves the manifest index, encounters the attestation entry, fetches its descriptor from GHCR which returns an HTML 404 page, and then caches that HTML page as a blob SHA — every subsequent pull of ANY tag for that image returns the same HTML SHA instead of the real layer. Fix: set provenance=false + sbom=false on the build-push-action step. SBOM attestation is handled separately by cosign attest, which does not embed its manifest into the OCI index. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
52c6938e02
|
ci(catalyst-build): watch infra/hetzner/** so cloudinit changes rebuild catalyst-api (#472)
Phase-8a-preflight bug #2 (after #471's tftpl escape fix): catalyst-api Docker image bakes /infra/hetzner/cloudinit-control-plane.tftpl. Without this path in the build trigger, fixes to that file do NOT rebuild the image — the running pod keeps using the stale tftpl and provisioning keeps failing with the same Tofu error. Per CLAUDE.md Rule 4a (GitHub Actions is the only build path), the path filter MUST cover every directory the image depends on. Missing infra/hetzner/** was a long-standing latent CI bug — surfaced by Phase-8a #454 first live provision attempt. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
1628a1b3aa
|
ci(preflight): GHCR auth for A+E + WBS tick — all 4 preflights done (#470)
First runs of preflight A (bootstrap-kit) and E (Keycloak) failed with the same error: helm OCI pull from ghcr.io/openova-io/bp-* returning 401 'unauthorized: authentication required'. bp-* are PRIVATE GHCR packages. #460's agent fixed it for B in c26fbcaf. #461's already had GHCR login. This commit applies the same helm-registry-login pattern to A and E. WBS state on main after this commit: - done (35): all chart-level + #317 + #319 + #453 + 4 preflights - wip (0) - blocked (3): 454, 455, 456 (Phase-8 live runs, operator-driven) The preflights' first runs ALREADY surfaced a real CI bug pattern that would have hit Phase 8a — exactly what they're for. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
4a7eb42d26
|
feat(ci): Phase-8a preflight E — Keycloak realm-import + kubectl OIDC client (closes #462) (#468)
Surfaces Risk R6 (docs/omantel-handover-wbs.md §9a — Keycloak realm-import config-CLI bootstrap timing untested). bp-keycloak 1.2.0 ships a sovereign realm + a public kubectl OIDC client via the upstream bitnami/keycloak chart's keycloakConfigCli post-install Helm hook (issue #326); this workflow proves it actually wires up on a clean cluster before we run it on a real Sovereign. Workflow installs bp-keycloak 1.2.0 on a kind cluster (helm/kind-action v1, kindest/node:v1.30.6 — same versions as test-bootstrap-kit), waits for the keycloak StatefulSet to roll out, polls for the keycloakConfigCli post-install Job by label (app.kubernetes.io/component=keycloak-config-cli), waits for it to Complete, port-forwards svc/keycloak and asserts: 1. /realms/sovereign returns 200 (realm exists in Keycloak's DB). 2. The kubectl OIDC client is provisioned with publicClient=true, redirectUris contains http://localhost:8000 (kubectl-oidc-login default), and the groups client scope is wired with the oidc-group-membership-mapper (the per-Sovereign k3s api-server's --oidc-groups-claim flag depends on this). Acceptance per ticket: if the post-install Job fails, the workflow summary captures Job logs + StatefulSet logs + cluster state via GITHUB_STEP_SUMMARY so a failed run is debuggable without re-running. Triggers are event-driven only per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled" rule — push on the workflow file itself plus workflow_dispatch for ad-hoc re-runs. Closes #462. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
abac00d8b3
|
feat(ci): Phase-8a preflight A — bootstrap-kit reconcile dry-run on kind (closes #459) (#467)
Surfaces Risk-register R4 (docs/omantel-handover-wbs.md §9a — bootstrap-kit reconcile-chain order untested under load) before Phase 8a (#454) burns Hetzner credit on test.omani.works. New workflow .github/workflows/preflight-bootstrap-kit.yaml: - kind v0.25.0 + kindest/node:v1.30.6 - Gateway API CRDs v1.2.0 standard channel - Full Flux controller set (fluxcd/flux2/action@main + flux install) - Mock Secrets: flux-system/object-storage, flux-system/cloud-credentials, flux-system/ghcr-pull - Renders clusters/_template/bootstrap-kit/ with SOVEREIGN_FQDN_PLACEHOLDER + ${SOVEREIGN_FQDN} -> test-sov.example.com (matches test harness pattern in tests/e2e/bootstrap-kit/main_test.go:247) - 30 x 30s HR poll loop, never-fail-fast (goal: surface ALL bugs, not stop at first) - $GITHUB_STEP_SUMMARY emits Markdown table of every HR's terminal Ready condition + per-HR describe blocks for non-Ready + recent flux-system events + raw hrs.json artefact (14d retention) - Event-driven only: push on self-edit + workflow_dispatch; no schedule: cron (per CLAUDE.md "every workflow MUST be event-driven") Canonical seam reused (no duplication): - kind setup + flux install pattern from .github/workflows/test-bootstrap-kit.yaml - bootstrap-kit kustomization at clusters/_template/bootstrap-kit/ (the same overlay production Sovereigns consume; substitution shape mirrors tests/e2e/bootstrap-kit/main_test.go:247) - event-driven shape per .github/workflows/check-vendor-coupling.yaml (#428) Out of scope (sibling preflights): - #460 Crossplane provider-hcloud Healthy probe - #461 Cilium Gateway HTTPRoute admission - #462 Keycloak realm-import Validated: actionlint clean, YAML parses cleanly. WBS row #459 in §9 updated: 🟡 in flight -> 🟢 done (workflow shipped). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
6f9ee43a9d
|
fix(ci): GHCR auth for bp-crossplane OCI pull in preflight (#460) (#466)
Run 25221515110 surfaced the exact blocking error the workflow was designed to surface — but for the install step, not the Healthy probe: Error: INSTALLATION FAILED: failed to perform "FetchReference" on source: GET "https://ghcr.io/v2/openova-io/bp-crossplane/manifests/1.1.3": ... 401: unauthorized: authentication required bp-crossplane is a PRIVATE GHCR package (verified via `gh api /orgs/openova-io/packages/container/bp-crossplane`). The fix mirrors the canonical seam in .github/workflows/blueprint-release.yaml: add `packages: read` to the job permissions and run `helm registry login ghcr.io` against GITHUB_TOKEN before the `helm install oci://...` step. No new pattern; just reuse. This unblocks the actual goal of #460 — observing provider-hcloud Healthy=True (or surfacing whatever blocks it) on a kind cluster. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
48b73af6ae
|
feat(ci): Phase-8a preflight C — Cilium Gateway HTTPRoute admission on kind (closes #461) (#465)
Surfaces Risk-register R3 (docs/omantel-handover-wbs.md §9a) — Cilium Gateway HTTPRoute admission was untested on contabo because contabo runs Traefik (no `cilium-gateway` Gateway present per ADR-0001 §9.4). This workflow boots a kind cluster, installs upstream Cilium 1.16.5 with `gatewayAPI.enabled=true`, applies the per-Sovereign Gateway shape from `clusters/_template/bootstrap-kit/01-cilium.yaml` (HTTP listener only — TLS is Phase 8a), pulls bp-catalyst-platform:1.1.8 from GHCR, renders its httproute.yaml template with sovereign overlay values, and asserts that `catalyst-ui` and `catalyst-api` HTTPRoutes both reach Accepted=True against the Cilium Gateway. Anti-duplication: GHCR helm-registry-login mirrors blueprint-release .yaml (lines 173-177); kind+Cilium pattern matches playwright-smoke shape; per-Sovereign Gateway is a 1:1 mirror of the canonical bootstrap-kit slot 01 (HTTP listener), no new shape invented. Trigger pattern is event-driven per CLAUDE.md: push on this file or the chart templates it validates, plus workflow_dispatch for re-runs. No cron. Out of scope (Phase 8a/8b): TLS termination, real DNS resolution, backend Deployment health, the 10 leaf bp-* dependencies (which have their own chart-verify smoke runs). Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
48a1623b28
|
feat(ci): Phase-8a preflight B — Crossplane provider-hcloud Healthy on kind (closes #460) (#463)
Surfaces Risk-register R2 (docs/omantel-handover-wbs.md §9a — provider-hcloud Healthy=True never observed). New workflow spins up kind, installs bp-crossplane 1.1.3 from GHCR, applies the EXACT Provider + ProviderConfig shape from infra/hetzner/cloudinit-control-plane.tftpl (#425), waits up to 5 min for Healthy=True, plants a fake hcloud-token Secret in flux-system to match the canonical secretRef, and asserts the ProviderConfig is accepted by the API. Reuses existing seams: - helm/kind-action@v1 pattern from .github/workflows/test-bootstrap-kit.yaml - event-driven trigger shape from .github/workflows/check-vendor-coupling.yaml - canonical Provider/ProviderConfig YAML from infra/hetzner/cloudinit-control-plane.tftpl No schedule: cron (per CLAUDE.md "every workflow MUST be event-driven"). No live Hetzner calls — fake-readonly-token only; real-credential validation is Phase 8a, not this preflight. Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |
||
|
|
1e7d1e67c9
|
test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429) (#432)
Phase 8 of the omantel handover (#369) needs an automated E2E that proves DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with zero contabo dependency post-handover. Today this is a SCAFFOLD — when Phase 4/6/7 land, dispatching the new workflow against a live omantel is the entire Phase 8. Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md): - tests/e2e/playwright/tests/ ← mirror of sovereign-wizard.spec.ts shape (NOT specs/ as the issue body said — actual repo path is tests/) - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries, workers=1, reporter=list) — reused as-is - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the pre-flight skip-when-unreachable pattern - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4, setup-node v4, npm install, playwright install --with-deps chromium, upload-artifact on failure) — mirrored, NOT duplicated What ships: - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests): 1. sovereign Ready + 23/23 blueprints 2. all bp-* HelmReleases Ready=True 3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready") 4. vendor-agnostic Object Storage (post-#425 canonical secret name flux-system/object-storage — NOT hetzner-object-storage) 5. dig +trace omantel.omani.works ends at omantel NS, not contabo 6. zero contabo dependency (omantel /api/healthz keeps returning 200) Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset. - .github/workflows/omantel-e2e-handover.yaml (NEW): workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow MUST be event-driven, NEVER scheduled"). Inputs let the operator override base URLs at dispatch time. - docs/omantel-handover-wbs.md: new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1 with the spec test() blocks; §9 status row added for #429 (🟢 scaffold-shipped). Local verification: cd tests/e2e/playwright && npm install && \ npx playwright test --list tests/omantel-handover.spec.ts → 6 tests listed cleanly npx playwright test tests/omantel-handover.spec.ts → 6 skipped (env vars unset, expected) Out of scope (per #425 / #428 territory split): - internal/hetzner/, infra/hetzner/, platform/velero/chart/, clusters/.../34-velero.yaml — #425's vendor-agnostic sweep - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com> |