openova

Author	SHA1	Message	Date
e3mrah	139a620ea7	fix(sovereign-tls): cilium-gateway propagates Hetzner LB annotations via spec.infrastructure (#1889 ) Closes #1885 (TBD-A31). Problem (t28 evidence — A98 + A107 reports, 2026-05-19 00:30Z): `console.t28.omani.works:443` accepts TCP but TLS resets. Inspection: `kubectl get svc -n kube-system cilium-gateway-cilium-gateway` shows type=ClusterIP with no Hetzner LB. Even with the tofu-provisioned `hcloud_load_balancer.main` (infra/hetzner/main.tf:955) carrying 443→30443 service-port at the infra layer, the cluster-side hcloud-CCM has no signal to materialise a parallel Service-level LB for the auto-generated gateway Service — so operators inspecting kubectl see a non-LoadBalancer Service and conclude the LB chain is broken. Fix: Add `spec.infrastructure.annotations` to the Gateway resource. The Gateway-API spec mandates that controllers propagate these annotations to any infrastructure resources they create — in Cilium 1.16+ this means the auto-generated `cilium-gateway-cilium-gateway` Service in kube-system. hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) then picks the annotations up at Service reconcile time and provisions a Hetzner LB. Annotations (mirrors clustermesh-apiserver block in 01-cilium.yaml): - load-balancer.hetzner.cloud/name = <slug>-<region>-gateway - load-balancer.hetzner.cloud/location = <Hetzner DC> - load-balancer.hetzner.cloud/type = lb11 - load-balancer.hetzner.cloud/use-private-ip = "false" (DoD A2 — public IPs always) - load-balancer.hetzner.cloud/disable-private-ingress = "true" - load-balancer.hetzner.cloud/health-check-protocol = tcp - load-balancer.hetzner.cloud/health-check-port = "30443" - load-balancer.hetzner.cloud/health-check-interval = 15s - load-balancer.hetzner.cloud/health-check-timeout = 10s - load-balancer.hetzner.cloud/health-check-retries = "3" Per-region segmentation: SOVEREIGN_FQDN_SLUG + SOVEREIGN_REGION_KEY in the LB name so each multi-region peer's cilium-gateway gets its own public LB (Hetzner LBs are unique-by-name; duplicate-name allocations collapse to the first-created instance, hiding the LB for every subsequent region). Wiring: 3 substitute vars (SOVEREIGN_FQDN_SLUG, SOVEREIGN_REGION_KEY, HCLOUD_LB_LOCATION) threaded into the sovereign-tls Kustomization's postBuild.substitute block. These mirror the same vars already passed to bootstrap-kit's Kustomization for the clustermesh-apiserver LB block in 01-cilium.yaml apiserver.service.annotations, so the configuration boundary is symmetric across the gateway LB and the clustermesh LB. Memory rules respected: - A2 (PUBLIC IPs for inter-region) — use-private-ip=false - feedback_overlap_provs_dont_serialize_wait (no provisioning gate) - feedback_subagents_inherit_design_system (no new architectural seam, reuses existing Gateway-API + hcloud-CCM contracts) Validation: $ kubectl kustomize clusters/_template/sovereign-tls/ \| grep -A 30 'kind: Gateway' → renders all 10 Hetzner LB annotations under spec.infrastructure → ${SOVEREIGN_FQDN_SLUG}/${SOVEREIGN_REGION_KEY}/${HCLOUD_LB_LOCATION} substituted at Flux apply time Acceptance criteria (per issue): - kubectl get svc -n kube-system cilium-gateway-cilium-gateway shows type=LoadBalancer with external IP (after fresh prov + handover) - curl -skI https://console.<fqdn>/ returns HTTP 200 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:50:35 +04:00
hatiyildiz	cd45d074af	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.182 -> 1.4.183 (auto, Refs TBD-A6)	2026-05-19 00:47:50 +00:00
e3mrah	90e30e084c	fix(httproute): omit default sectionName so multi-zone Sovereigns attach via Cilium Gateway hostname matcher (Closes #1884 , TBD-A30) (#1888 ) Pre-1.4.183 the chart pinned every catalyst-system HTTPRoute to `sectionName: https` (via values.yaml default), but the Cilium Gateway template (clusters/_template/sovereign-tls/cilium-gateway.yaml + infra/hetzner/main.tf locals.parent_domains_listeners_yaml) names HTTPS listeners: - SINGLE parent zone → bare `https` / `http` - MULTIPLE parent zones → unique `https-<sanitised-zone>` / `http-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`) On t28 (omani.works primary + omani.homes SME pool, A107 D29 walk 2026-05-19) every public HTTPRoute reported `Accepted=False NoMatchingListener` and console.<sov> / api.<sov> / marketplace.<sov> / .<sov> returned 404 / connection-refused. Single-zone Sovereigns were unaffected because Gateway used bare `https`. Fix (Option C - omit sectionName): default `ingress.gateway.parentRef. sectionName=""` in values.yaml. The existing `{{- with .Values.ingress. gateway.parentRef.sectionName }}` guards in templates/httproute.yaml, templates/services/catalog/httproute.yaml, and templates/sme-services/ marketplace-routes.yaml skip the field entirely when empty. Cilium Gateway then matches each route to listeners by hostname filter - every listener has `hostname: .<zone>`, so `console.<sov-fqdn>` auto-attaches to the listener whose hostname matches (which is precisely the listener whose certificateRef terminates the right wildcard cert). This is the canonical pattern already in use elsewhere in the codebase: - core/controllers/sandbox/internal/gitops/manifests.go (sandbox) - core/controllers/organization/internal/controller/tenant_route.go (per-Org tenant routes) - products/catalyst/chart/templates/sme-services/tenant-public-routes.yaml Preflight CI (.github/workflows/preflight-cilium-httproute.yaml) explicitly overrides `--set ingress.gateway.parentRef.sectionName=http` because it ships a Gateway with an HTTP-only listener named `http`; that override path is preserved unchanged. helm template render verifies all 5 affected HTTPRoutes (catalyst-ui, catalyst-api, catalyst-catalog, marketplace, tenant-wildcard) now emit a `parentRefs` block with name+namespace only, no `sectionName`. helm lint clean. Chart bumped 1.4.182 -> 1.4.183. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:47:14 +04:00
e3mrah	ab6f3e6510	fix(scripts): scrub stale sovereign-tls from expected-bootstrap-deps.yaml (post #1879 cleanup, fixes dep-graph-audit) (Refs #1871 ) (#1881 ) PR #1875 added `sovereign-tls` to the bp-self-sovereign-cutover dependsOn in both the chart AND scripts/expected-bootstrap-deps.yaml. PR #1879 reverted the chart half (because HelmRelease.dependsOn cannot reference a Flux Kustomization — helm-controller logs "not found", chart parks Stalled, handover never fires). The scripts/expected-bootstrap-deps.yaml half was left behind, so the dep-graph-audit job now fails on origin/main with drift between the declared expectation (`bp-gitea bp-harbor sovereign-tls`) and the chart on disk (`bp-gitea bp-harbor`). Scrub: - Remove `sovereign-tls` from the cutover's depends_on list. - Remove the stale `sovereign-tls` placeholder slot 0t entry (no HR file exists for it — it is a Flux Kustomization). - Replace the obsolete comment block with a short note explaining the PR #1875 / #1879 history so the next reader doesn't re-add it. Verified: `bash scripts/check-bootstrap-deps.sh` -> "OK: bootstrap-kit dependency graph audit PASSED" with Drift: 0, Cycles: 0. Verified: `helm template platform/self-sovereign-cutover/chart` -> exit 0. Refs #1871 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:14:43 +04:00
hatiyildiz	d6e9379b20	deploy(bp-self-sovereign-cutover): lockstep blueprint.yaml spec.version 0.1.31 -> 0.1.32 (auto, Refs TBD-A20, #1856 )	2026-05-19 00:04:44 +00:00
e3mrah	ee4dfedef8	fix(cutover): Step-06 Job waits for Cilium Gateway Programmed=True before HelmRepository URL rewrite (Closes #1871 , supersedes #1875 ) (#1879 ) PR #1875 added `- name: sovereign-tls` to bp-self-sovereign-cutover.dependsOn to gate the URL rewrite behind Gateway TLS readiness. That fix was unresolvable: Flux HelmRelease.dependsOn can ONLY reference other HelmReleases, but sovereign-tls is a Flux Kustomization. helm-controller verbatim on t27 fresh-prov (A84 empirical test, 2026-05-18): helmreleases.helm.toolkit.fluxcd.io "sovereign-tls" not found bp-self-sovereign-cutover sat forever in dependency-wait, cutover never fired, handover never fired. This commit moves the readiness check INTO the chart: chart 0.1.32 adds a Phase -1 (gateway-wait) at the top of the Step-06 helmrepository- patches Job. The Job polls `gateway.networking.k8s.io/v1.Gateway cilium-gateway` in `kube-system` until status.conditions[Programmed]= True, with a 30 min default deadline. If the Gateway never programs, the Job exits 1 (surfacing the block to the operator) rather than rewriting URLs into a Gateway that won't answer TLS. RBAC: ClusterRole gains gateway.networking.k8s.io/gateways {get,list,watch}. Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml`: - reverts the bad PR #1875 `- name: sovereign-tls` dependsOn entry - bumps chart pin 0.1.31 -> 0.1.32 Tests: cutover-contract Case 20 guards the Phase -1 block + RBAC. helm-template confirms the Phase -1 wait + env (GATEWAY_NAMESPACE= kube-system, GATEWAY_NAME=cilium-gateway, GATEWAY_WAIT_TIMEOUT_ SECONDS=1800) renders into the cutover-step-06-helmrepository-patches ConfigMap.podSpec. Closes #1871 Refs #1875 (supersedes) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:04:12 +04:00
e3mrah	366d5d2b33	docs(principles): clarify #14 — HelmRelease.dependsOn cannot reference Kustomizations (empirical t27 finding) (#1878 ) A84 empirical finding (t27 / PR #1875): HelmRelease.spec.dependsOn strictly references OTHER HelmReleases — it cannot reference Flux Kustomizations or other resource kinds. PR #1875 added the `sovereign-tls` Kustomization to a HelmRelease's dependsOn; helm-controller logged `helmreleases "sovereign-tls" not found` and retried every 30s forever. Adds a critical sub-rule to principle #14 documenting the cross-kind limitation, the recommended workaround (wait-HelmRelease shim or move the gated workload into a Kustomization), and the verbatim helm-controller error message so the next regression is greppable. Doc-only. Co-authored-by: hatiyildiz <claude@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 04:00:25 +04:00
hatiyildiz	2e1826abb4	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.181 -> 1.4.182 (auto, Refs TBD-A6)	2026-05-18 23:49:51 +00:00
github-actions[bot]	5a25c254a1	deploy: update sme service images to `5d5c557` + bump chart to 1.4.182	2026-05-18 23:49:14 +00:00
e3mrah	5d5c55739e	fix(notification): retry-backoff on Stalwart 503 5.5.1 rate-limit (#1876 ) When Stalwart trips its rate-limit and returns "503 5.5.1", the notification service previously surfaced the error immediately to the events consumer, which kept hammering on the next event and prolonged the rate-limit window. Now Mailer.Send detects 503 5.5.1 specifically (via textproto.Error unwrap + canonical-code substring fallback) and retries up to 3 times with a 60s backoff between attempts. The backoff is configurable via SMTP_RETRY_BACKOFF env var (Go duration string OR bare integer seconds; 30s floor to keep the rate-limiter happy). Non-rate-limit errors (auth failure, transient I/O, etc.) bubble up unchanged so the consumer can NACK / dead-letter as appropriate. Adds smtp_test.go covering: - single rate-limit -> retry -> success - exhausted retries -> wrapped error preserving textproto.Error - non-rate-limit error -> immediate pass-through, no backoff - isRateLimit detection (textproto, multiline 503-5.5.1, negative cases) - parseRetryBackoff env-var forms + 30s floor + zero/garbage fallbacks No credential touches: this is a retry-hardening fix only; the chart-side SMTP creds path is already GREEN (see #1793 A80 diagnosis). Refs #1793 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:47:58 +04:00
e3mrah	3b4c130129	fix(bootstrap-kit): cutover dependsOn sovereign-tls — wait for Gateway TLS before HelmRepository URL rewrite (Closes #1871 ) (#1875 ) TBD-A24 cutover↔gateway circular deadlock — discovered on t26 zero-touch prov 2026-05-18 (99bb823cb0513f4b): 1. bp-catalyst-platform HR installs at v1.4.179 (Ready=True) 2. bp-self-sovereign-cutover HR Ready=True (deps gitea+harbor only) 3. Step-06 rewrites all 50 HelmRepository URLs ghcr.io → registry.<fqdn> 4. bp-catalyst-platform flips Ready=False (TLS handshake EOF — no Gateway) 5. sovereign-tls Kustomization blocked on bootstrap-kit Ready=True 6. bootstrap-kit blocked on bp-catalyst-platform Ready=True 7. Full deadlock — no Gateway, no handover, every UI route 404 Fix: add `sovereign-tls` as a third dependsOn entry on the cutover HR so Flux waits for the Cilium Gateway to be serving TLS before the URL rewrite fires. Same architectural shape as Wave 7 bp-hcloud-csi removal (#1610) — chicken-and-egg between bootstrap-kit and sovereign-tls broken by ordering the dangerous-side-effect chart AFTER the Gateway is ready. Also updates scripts/expected-bootstrap-deps.yaml so the dep-graph audit (check-bootstrap-deps.sh) recognises the new edge: slot 6a gets the extra `sovereign-tls` entry, plus a new "slot 0t" entry declaring sovereign-tls as a known node (no HR file on disk → audit reports it as `deferred`, info not error; Phase 4 cycle detection accepts it as a zero-in-degree root). Verified locally: - yq parses spec.dependsOn → 3 entries (bp-gitea, bp-harbor, sovereign-tls) - scripts/check-bootstrap-deps.sh: 50 present, 65 declared, 0 drift, 0 cycles - helm template platform/self-sovereign-cutover/chart: exit 0 (smoke OK) Refs: t26 ID 99bb823cb0513f4b, A55 diagnostic, A67 diagnosis, slot 17a comment in clusters/_template/bootstrap-kit/kustomization.yaml documenting the same chicken-and-egg shape. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:19:55 +04:00
e3mrah	06bea550ff	feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin (#1874 ) The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml, bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish commits land on main, but it cannot detect the "chart bumped but never published" failure mode: the bootstrap-kit pin points at a chart version that GHCR never received because blueprint-release.yaml failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep, runner cancellation, transient GHCR push 5xx). Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180 and 1.4.181 were "lost" during the TBD-A20 scanner break window (21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist until A58 manually re-fired the workflow via dispatch. Fresh Sovereigns silently fell back to the last working tag. What this adds - scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and optional `--ghcr-org <org>`). For every chart pinned in the kit, it lists ghcr.io/<org>/<chart> tags via `gh api /orgs/<org>/packages/container/<chart>/versions --paginate`, then asserts the pinned version appears. Exits 1 on any missing tag. - A per-chart tag cache avoids redundant paginations. - .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now passes `--check-ghcr` on `push` to main + `workflow_dispatch` (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish to GHCR anyway). The job stays `continue-on-error: true` under the same observational umbrella as the existing post-merge full sweep so a transient API blip cannot red-flag every chart bump; the missing-tag list still surfaces on the run summary for operator attention. - Job grants `packages: read` so the workflow GITHUB_TOKEN can list private package versions. Verification (origin/main snapshot, 2026-05-19) - Full sweep default: 50/50 chart→pin pairs OK, no GHCR check. - Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags present — PASS exit 0. - Negative test: with products/catalyst/chart/Chart.yaml + slot 13 both set to a non-existent 99.99.99, the script exits 1 with `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the remediation hint pointing at `gh workflow run blueprint-release.yaml`. - `--changed-only --base origin/main` against a no-change tree: clean exit 0 with the existing "nothing to check" message. Refs #1872, #1864, #1856. Closes #1872 Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:12:13 +04:00
e3mrah	a7cd2fc21f	docs(principles): add 3 session-2026-05-18 principles (validate-vs-origin / GHCR-tag-check / cutover-dependsOn-Gateway) (#1873 ) Adds three new inviolable principles surfaced by 2026-05-18 incidents: - #12 Never validate against the local working tree — A19 false-positive (verifier grepped a feature-branch working copy with unstaged edits, reported "already on main" when it was not). - #13 Chart-pin bumps must match a GHCR tag that exists — TBD-A48 / PR #1869 drift: pin to bp-self-sovereign-cutover:0.1.4 landed on main while the chart artifact had not been published, causing hours of ImagePullBackOff. - #14 Cutover-style HRs that rewrite HelmRepository URLs must dependsOn Gateway readiness — TBD-A24 / PR #1871: bp-self-sovereign-cutover flipped URLs to local registry before Cilium Gateway was serving TLS, deadlocking the cluster. Doc-only change; bumps the front-matter Updated date to 2026-05-18. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:09:26 +04:00
hatiyildiz	26e4c8e30e	deploy(bp-guacamole): bump bootstrap-kit pin 0.1.25 -> 0.1.26 (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version 0.1.25 -> 0.1.26 (Refs TBD-A20, #1856).	2026-05-18 22:20:35 +00:00
github-actions[bot]	8ce7c02aa9	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.26	2026-05-18 22:19:59 +00:00
e3mrah	1b87d38e94	deploy: catch-up pins for bp-catalyst-platform 1.4.181 + bp-guacamole 0.1.25 (post #1866 fix) (#1869 ) Catch-up for drift introduced during the Blueprint Release workflow outage 21:04:22Z (PR #1858 merge with YAML scanner break) → 22:07:49Z (PR #1866 fix). Charts published in that window: - bp-catalyst-platform 1.4.180 → 1.4.181 (umbrella) - bp-guacamole 0.1.24 → 0.1.25 Auto-bump-pin step didn't fire during the outage. A39 already caught up bp-newapi (PR #1865). This PR catches up the remaining 2. Refs #1864, PR #1866 (workflow fix), PR #1858 (root cause). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 02:19:34 +04:00
hatiyildiz	66fa508b74	deploy(bp-newapi): bump bootstrap-kit pin 1.4.21 -> 1.4.22 (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version 1.4.21 -> 1.4.22 (Refs TBD-A20, #1856).	2026-05-18 22:11:05 +00:00
e3mrah	22e046b554	Merge pull request #1866 from openova-io/fix/1864-workflow-yaml-startup-failure fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)	2026-05-19 02:07:48 +04:00
hatiyildiz	69f2d7d91a	fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864 ) Root cause of the auto-bump-pin miss flagged in #1864. The Blueprint Release workflow has been in `startup_failure` since PR #1858 (commit `cf35b4a`) merged at 21:04:22Z. The lockstep step's multi-line shell heredoc inside a `run: \|` block-scalar: if [ ... ]; then msg="deploy(...) (auto, Refs TBD-A6) <-- literal blank line Also locksteps platform blueprint.yaml ..." <-- column 1, no indent is interpreted by the YAML scanner as the END of the block-scalar at the blank line, and the next column-1 line is then parsed as a new top-level mapping key — which fails because the previous mapping isn't terminated. The whole workflow file is rejected at workflow- startup time. Verified with `python3 -c yaml.safe_load(...)` (raises `ScannerError: could not find expected ':' line 815`) and by `gh api .../actions/runs/26060392136` returning `conclusion=failure, status=completed, jobs: []` for every push since `cf35b4a`. Consequence: no chart bump since `cf35b4a` has triggered the TBD-A6 auto-bump-pin or the TBD-A20 blueprint.yaml lockstep. PR #1865 was the manual catch-up for bp-newapi (1.4.20 -> 1.4.21); without this fix every future chart publish will drift the same way. Fix: build the multi-line commit message with `printf '%s\n\n%s'` so the string source stays on physically-indented lines that the YAML block-scalar accepts. Behaviour is identical — same commit subject, same blank line, same body — only the construction shape changes. Added a 9-line comment naming the seam so future authors don't reintroduce the same trap. Verified locally: * `python3 -c yaml.safe_load(open(...))` succeeds, parses 24 build-job steps. * `CHART_NAME=bp-newapi PREV_VERSION=1.4.20 CHART_VERSION=1.4.21 BP_PREV_VERSION=1.4.20 bash -c "$(printf ...)"` emits the canonical "deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (auto, Refs TBD-A6)\n\nAlso locksteps platform ..." body. Refs #1864. Refs PR #1858 (TBD-A20 lockstep that introduced the YAML defect).	2026-05-19 00:07:07 +02:00
github-actions[bot]	c64220f8cc	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.22	2026-05-18 22:05:58 +00:00
e3mrah	1e1fe26e02	Merge pull request #1865 from openova-io/fix/1864-bp-newapi-pin-catchup deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race)	2026-05-19 02:05:33 +04:00
hatiyildiz	f57f62764b	deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race) Closes #1864 Manual catch-up. The auto-bump-pin step (TBD-A6) did NOT run for the 1.4.20 -> 1.4.21 chart bump at commit `8b33188` because the Blueprint Release workflow has been stuck in startup_failure since PR #1858 (commit `cf35b4a`) merged at 21:04:22Z. The workflow YAML at .github/workflows/blueprint-release.yaml lines 812-814 has a multi-line heredoc string inside a `run: \|` block-scalar whose continuation lines are unindented: msg="deploy(${CHART_NAME}): bump bootstrap-kit pin ${PREV_VERSION} -> ... (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version ${BP_PREV_VERSION} ..." YAML treats the unindented line as the end of the block-scalar and the next line as a new mapping key (which it isn't), so the entire workflow file fails the GitHub Actions YAML validator at workflow-start time. Every push since `cf35b4a` has produced a run with `conclusion=failure, status=completed, jobs=[]` (zero jobs spun up). Evidence: * gh api repos/openova-io/openova/actions/runs/26060392136 -> 'This run likely failed because of a workflow file issue.' * Same for every subsequent run including the chart 1.4.21 publish (no run was even created for `8b33188` because the workflow file couldn't parse). * `python3 -c 'yaml.safe_load(open(...))'` raises `ScannerError ... could not find expected ':' line 815`. This PR is the ONE-LINE catch-up so the pin drift is closed. A companion PR fixes the workflow YAML so future chart bumps auto-bump the pin again.	2026-05-19 00:04:40 +02:00
github-actions[bot]	6b11734a81	deploy: update sme service images to `4a61543` + bump chart to 1.4.181	2026-05-18 21:48:56 +00:00
e3mrah	4a61543957	test(tenant): wire round-trip for tenant.created owner_email contract (#1863 ) Verifies the publisher-side wrapper struct in CreateOrg (handlers.go:248-252) marshals to bytes the provisioning consumer in organization_create.go can decode flat with owner_email as a sibling field. Pairs with TestHandleTenantCreated_FullTenantStructDecode on the consumer side — together they pin BOTH ends of the contract so a refactor that nests under "tenant" or renames the tag fails in CI rather than at staging. Refs #1829 (D29). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 01:47:38 +04:00
github-actions[bot]	de86df1126	deploy: bump sandbox-controller image to `a405572`	2026-05-18 21:46:29 +00:00
github-actions[bot]	a09445482b	deploy: bump sandbox-mcp-server image to `a405572`	2026-05-18 21:44:53 +00:00
github-actions[bot]	4fd3aae99b	deploy: bump application-controller image to `a405572`	2026-05-18 21:44:52 +00:00
e3mrah	a40557227e	fix(controllers): NATS consume-leg for D35 (organization + sandbox) (#1862 ) PR #1626 wired the publish-leg (tenant + billing → NATS JetStream catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow even though the publish leg shipped. This PR adds: - core/controllers/pkg/natsbus: minimal JetStream subscriber shared by Group-C controllers. Self-contained (no dep on core/services/shared which pulls in franz-go/Kafka the controllers never touch). - core/controllers/organization/internal/controller/nats_bridge.go: subscribes to catalyst.tenant.created + catalyst.billing.order.placed, patches openova.io/last-event-observed-at + ...-subject annotations on the matching Organization CR. The annotation patch triggers an informer event → controller-runtime enqueues Reconcile within ~50ms instead of waiting for the 30s requeue fallback. - core/controllers/sandbox/internal/controller/nats_bridge.go: same pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR using the same `sandbox-<sanitised-email>` naming convention tenant-service's SandboxOrchestrator (PR #1633) writes under. - main.go wiring in both controllers reads NATS_URL from env. Unset = log "consume-leg disabled" + continue (informer requeue fallback intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS is an accelerator, not the only path. Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate JetStream delivery produces a byte-stable annotation patch and controller-runtime does NOT enqueue a redundant Reconcile. Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the happy path, the no-matching-CR soft miss, duplicate-envelope no-churn, malformed JSON poison-pill, and the publish-side ↔ consume-side name derivation lockstep for Sandbox CRs. HARD CONSTRAINT respected: no credential mutations — bridges read only the envelope + the target CR, never Secrets or Keycloak SA creds. Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:43:08 +04:00
github-actions[bot]	376cd7d14c	deploy: update catalyst images to `c86fb3d`	2026-05-18 21:36:21 +00:00
e3mrah	c86fb3d1dc	fix(catalyst-api): seed full 4-entry .omani.X sme-pool (D30 / #1830 ) (#1861 ) LoadSMETenantParentDomainsFromEnv's hardcoded-fallback only seeded 2 entries (omani.works + omani.trade), but the marketplace UI (core/marketplace/src/components/AddonsStep.svelte) lists 4 (omani.homes + omani.rest + omani.trade + omani.works) and core/services/domain/store.AllowedTLDs has the same canonical 4. Result: a customer picking .omani.homes or .omani.rest in /addons sailed through the picker but got 422 invalid-parent-domain at catalyst-api signup because FindParentDomain didn't recognise the TLD. This widens the seed to all 4 canonical .omani.X entries so the backend pool, the marketplace picker, and AllowedTLDs all agree. NSFlipReady=true on every entry (the zones are already delegated to the Sovereign's PowerDNS at gTLD level — pdmFlipNS short-circuits via nsAlreadyMatches for Day-2 re-adds). Updated TestLoadSMETenantParentDomainsFromEnv_StubFallback (`pool != 4`) and added 3 fresh tests in sovereign_parent_domains_test.go covering: canonical 4-entry seed, OTECH primary + 4 sme-pool composition, env-override path without fallback leakage. Closes #1830 (Part 1 — Day-1 pool seed). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 01:34:24 +04:00
github-actions[bot]	a6b5752391	deploy: update sme service images to `b214566` + bump chart to 1.4.180	2026-05-18 21:28:12 +00:00
e3mrah	b214566c1a	fix(provisioning): create Organization CR on tenant.created (C16 root cause) (#1860 ) Closes the voucher-checkout → Organization-CR loop that was missing from the convergence chain. Before this PR the flow stalled at: voucher accept → tenant-service CreateOrg → writes Tenant row, publishes tenant.created ↓ (DROP — no consumer) provisioning consumer switch (case "tenant.created" missing — A26 verifier pinpointed this) ↓ organization-controller has nothing to reconcile ↓ no vCluster / Keycloak group / Gitea org / per-tenant HTTPRoute A26 verifier on t22: zero Organization CRs after 168min despite the tenant row existing. Closes #1722. Unblocks D29 zero-touch tenant provisioning (Refs #1829). Changes: - core/services/tenant/handlers/handlers.go Enrich tenant.created payload with owner_email from JWT claims so the provisioning consumer can mint the Organization owner roster without a second store round-trip. Wrapper struct embeds *Tenant so existing decoders are wire-compatible. - core/services/provisioning/handlers/consumer.go Add case "tenant.created" to the dispatch switch. - core/services/provisioning/handlers/organization_create.go New handler. Validates slug + owner_email, builds cluster-scoped Organization CR (apiVersion orgs.openova.io/v1), POSTs via k8sRequest. Idempotent on 409 AlreadyExists (NATS redelivery safe). 404 → operator-misconfiguration error event. 5xx → return err so broker redelivers. Inviolable Principle #4: parent domain flows env → Handler.TenantParentDomain → CR (with per-tenant parent_domain payload override for multi-pool Sovereigns). - core/services/provisioning/handlers/organization_create_test.go Unit tests: malformed payload, invalid slug (incl. path-traversal), missing owner_email, full Tenant decode, default-fill paths, empty parent domain mints anyway, payload-shape pinning. All exercised with KUBERNETES_SERVICE_HOST scrubbed so no real apiserver dial. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:26:57 +04:00
github-actions[bot]	2457926e16	deploy: update catalyst images to `7ec73f9`	2026-05-18 21:16:51 +00:00
e3mrah	7ec73f9e2b	fix(catalyst-api): handler test baseline GREEN — 6 failing tests fixed (Closes #1853 ) (#1859 ) Per-test root cause + fix: 1. TestPinIssue_ConcurrentRapidFireRateLimit (TOCTOU race) — pinStore.canIssue and put() ran under separate mutex acquisitions; three concurrent /pin/issue goroutines all observed "no entry", passed canIssue, then raced EnsureUser against Keycloak. Replaced with atomic tryReserve() that check-and-stamps under a single lock; HandlePinIssue calls store.drop(email) on EnsureUser/generatePin/no-KC failure to roll back the reservation so the 60s cooldown doesn't punish operator retries. 2. TestFinaliseHandover_FullFlow — test fixture drift after PR #1487 keyed the tofu workdir by DeploymentID (provisioner.workdirKey). Test still wrote workdir at filepath.Join(tmp, "tenant-y-omani-works") (the legacy sovereign-name slug); FinaliseHandover handler uses `id`. Updated test to write workdir at filepath.Join(tmp, "dep-full") so it matches the actual prod lookup path. Same fix for the receiver-failure sibling test. 3. TestEnsureOwnerUserAccess_CreatesCanonicalCR — drifted twice: (a) test queried Namespace("") but the t134 D21 fix moved the CR to userAccessOwnerNamespace ("catalyst-system") because useraccesses is namespaced per the XRD claimNames block; (b) test asserted spec.applications = [{app:"", role:"admin"}] but the t135 D21 fix switched to spec.tierRoleRef = "openova:tier-owner" (XRD pattern rejects `app: ""`). Updated test to query catalyst-system namespace and assert tierRoleRef + applications-must-be-absent. 4. TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty — production unstructuredToUserAccess left Spec.Applications=nil when the CR has no spec.applications, which json-marshals to `null` and crashes the React UI's items.map() (qa-loop iter-4 users-page-null-map regression). Initialize Spec.Applications = []userAccessAppGrantBody{} in the struct literal so the empty-slice contract is preserved. 5. TestHandleWhoami_PinSessionRBACClaims — whoamiInjectTierRoles unconditionally appended every inherited tier role even when the upstream JWT already shaped the role list authoritatively. A PIN-minted session carrying tier=owner + realm_access=[catalyst-owner] was getting fanned out to all 5 inheritance entries, which the route-guard couldn't reconcile. Now: if the operator's own catalyst-<tier> role is already present, the projection returns early and preserves the upstream list. TestHandleWhoami_ProjectsTierToRealmRoles still passes (empty input → still injects inheritance) and TestWhoamiInjectTierRoles_PreservesExistingRoles still passes (idempotent — same input out). 6. TestHandleWhoami_NoRBACOmitsFields — whoamiResponse.RealmAccess was a struct value with `omitempty`, which encoding/json does NOT honour for structs (only pointers/slices/maps until Go 1.24's `omitzero`). A pre-RBAC session always serialized realm_access:{} on the wire, breaking the legacy {email,sub,verified} contract. Changed to whoamiRealmAccess so omitempty actually drops the field; HandleWhoami only allocates the pointer when claims carry roles, and drops it back to nil if the projection ended up empty. Test status after fix (worktree off origin/main): - All 6 target tests PASS - Full TestPin, TestHandleWhoami, TestWhoamiInjectTierRoles, TestEnsureOwnerUserAccess, TestOwnerUserAccessName, TestListUserAccess, TestFinaliseHandover, TestUnstructuredToUserAccess* PASS (57 tests) - go test ./... -p 1 across the entire catalyst-api module PASS Pre-existing parallelism flakes (TestGetKubeconfig_ReadsFromPathPointer / TestPhase1Started_GuardPreventsDoubleWatch / TestPodRestart_*) exist on baseline too — write to /var/lib/catalyst/ from a goroutine that outlives test scope. Out of scope for this PR; tracked separately. Closes #1853 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:14:38 +04:00
github-actions[bot]	de53b39d13	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.25	2026-05-18 21:05:55 +00:00
github-actions[bot]	8b33188019	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.21	2026-05-18 21:04:48 +00:00
e3mrah	cf35b4a9b6	fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856 ) (#1858 ) A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux, openbao, keycloak, gitea) where blueprint.yaml spec.version had silently fallen behind chart/Chart.yaml version, breaking TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart publish — never the upstream platform/<bp>/blueprint.yaml. This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml spec.version whenever Chart.yaml version bumps. Both file edits land in the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint lockstep). Idempotent reset-and-rewrite retry preserved for the existing parallel-matrix race case. Workflow changes (.github/workflows/blueprint-release.yaml): * New step `bump_blueprint` after `bump_pin` — locates ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml (handles both platform-leaf and products-umbrella conventions), filters to kind:Blueprint (defensive against CRD yaml at the products/catalyst/chart/crds path), reads current spec.version at 2-space indent, sed-rewrites to CHART_VERSION, verifies post-write. * Commit step renamed to "Commit + push bootstrap-kit pin bump + blueprint.yaml lockstep"; stages both files, single commit, with convergent retry on conflict. * Summary block surfaces both bumps separately. Regression test (tests/e2e/bootstrap-kit/main_test.go): * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks platform/* and products/, discovers every Blueprint manifest with a sibling Chart.yaml, asserts spec.version == Chart.yaml version. Covers ALL ~70 blueprints, not just the canonical 10 kit ones the existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates. Failure messages name the file, drift direction, and the exact sed command to fix — drift remediation is mechanical. Drift cleanup (mandatory companion, same shape as A17/#1855): 26 Application-Blueprint blueprints whose spec.version had been left at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to Chart.yaml as authoritative. All currently surface in the new sweep test; without the cleanup the test would block this PR (and every subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook, cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores], falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir, netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover, trivy, valkey, velero, vpa, products/dmz-vcluster. After this lands, the next chart-version bump in any platform/<bp>/ folder auto-converges all three artifacts (Chart.yaml, blueprint.yaml, bootstrap-kit pin) in a single bot commit. No more manual collector PRs; no more silent drift between chart and Blueprint manifest. Closes #1856. Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:04:22 +04:00
e3mrah	2484c8a3de	fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799 ) (#1846 ) TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image to cadc7b5") which never triggered the blueprint-release workflow. As a result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck InProgress and the bootstrap-kit kustomization fails its health check. GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via `/orgs/openova-io/packages/container/bp-velero/versions`. Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate stays GREEN) so blueprint-release.yaml fires on this push, publishes `ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1 subchart, same templates, same values. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:43:13 +04:00
hatiyildiz	9975e057da	deploy(bp-newapi): bump bootstrap-kit pin 1.4.19 -> 1.4.20 (auto, Refs TBD-A6)	2026-05-18 20:38:15 +00:00
github-actions[bot]	9982dcafa8	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.20	2026-05-18 20:37:26 +00:00
e3mrah	3d0c96a237	fix(bp-newapi): single-pod DB migration via startupProbe (Closes #1798 ) (#1857 ) newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay 30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the binary mid-migration on every restart. The 28-CREATE-TABLE + 2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with sslmode=require — well over the kill window. On t22 chart 1.4.18 the `newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff restarts because every kill happened before the GORM connection pool's first wire write completed (pg_stat_activity on the CNPG primary showed no newapi-user connections). Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2): [SYS] ... database migration started ← last log line exitCode=2 finishedAt-startedAt = 50s exactly Readiness probe: connect: connection refused 10.42.0.185:3000 DB: psql \\dt → "Did not find any relations" CNPG: pg_stat_activity → no `newapi` user connections Fix (canonical k8s pattern, Inviolable Principle #16 — own the seam): add a startupProbe that gates BOTH liveness and readiness until the binary opens :3000/api/status. Budget 30 × 10s = 5 min, comfortably above the observed 60-120s ceiling and below operator- impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is unchanged but only activates after startupProbe success per kubelet semantics. The probe block is operator-tunable via `.Values.newapi.probes.startup.*`; setting it to `null` skip-renders the block so overlays against a pre-seeded DB can opt out (Inviolable Principle #4). Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so freshly franchised Sovereigns pull the new chart on next prov. Render tested (smoke + override): startupProbe present with failureThreshold=30 in defaults; suppressed when startup: null. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:37:00 +04:00
e3mrah	a8931db541	fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849 ) (#1855 ) Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main: 1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml), gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin -> Sovereign install); blueprint.yaml gets resynced down/up to match. 2. pin-sync-audit on push — full-sweep audit races the blueprint-release auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift until the auto-bump bot commits the pin update ~60s later; the bot push (GITHUB_TOKEN convention) does not retrigger this workflow, so the failure remains in run history. Fix: set continue-on-error: true on push/workflow_dispatch events (PR remains blocking via --changed-only). The full-sweep output still surfaces drift on the run summary; it just doesn't fail the overall run while the heal-in- ~60s window is open. Documented inline in the job header. Net effect: every push to main re-runs cleanly green. The 13 pre-existing drifts called out in the existing job comment will continue to heal as each lagging chart gets its next bump (auto-bump hook + this PR's manifest-validation alignment). Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence to blueprint.yaml versions which the test asserts but the auto-bump hook does not yet update). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>	2026-05-19 00:34:48 +04:00
e3mrah	d36e54df74	test(chart): baseline CNP allow-list contract gate — guards #1785→#1803→#1847 cascade (Closes #1850 ) (#1854 ) The May 2026 baseline-CNP cascade shipped three production bugs in two days because nothing in CI rendered the chart and asserted on the rendered CiliumNetworkPolicy shape: - #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system with WORLD egress restricted to TCP/443 only AND no ingress allow for the `catalyst` namespace. - #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after /api/v1/auth/pin-request 502'd on every fresh onboarding. - #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24 fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s. This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh — a pure helm-template + grep/awk contract gate matching the existing platform/self-sovereign-cutover/chart/tests/cutover-contract.sh pattern. The Blueprint Release workflow already runs every *.sh under chart/tests/ as a publish gate (see blueprint-release.yaml line 384), so the gate is wired automatically and fails publish BEFORE the OCI artifact reaches a Sovereign. 13 cases asserted: 1. baseline-default-deny CNP renders + is namespaced to catalyst-system 2. egress allows SMTP submission 587/TCP (#1803 regression guard) 3. egress allows SMTPS 465/TCP (#1803 regression guard) 4. egress allows legacy SMTP 25/TCP (#1803 regression guard) 5. egress allows HTTPS 443/TCP to world 6. egress allows kube-dns 53/UDP + 53/TCP 7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847) 8. ingress allows `flux-system` (HelmRelease readiness probes) 9. ingress allows `kube-system` (operator + ccm + CoreDNS) 10. ingress is namespace-scoped — no fromEntities:{cluster\|world\|all} wildcard 11. catalyst-api Service exposes port 8080 (auto-trigger contract) 12. CNP toggles off cleanly with security.baselineCnp.enabled=false 13. allowedIngressNamespaces propagates via --set (operator-tunable) Negative-test confirmation (executed locally before commit): - Remove SMTP 587 from template → Case 2 FAILS, exit 1 - Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1 - Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1 - Restore originals → all 13 cases PASS, exit 0 Refs: TBD-A18, PRs #1785 #1803 #1847, audit /tmp/audit-recent-prs-quality-report.json Closes #1850 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 00:32:28 +04:00
github-actions[bot]	82e972fb77	deploy: update catalyst images to `75cb059`	2026-05-18 20:26:21 +00:00
e3mrah	75cb059fc0	Merge pull request #1851 from openova-io/fix/a16-hetzner-ssh-key-sweep fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)	2026-05-19 00:24:19 +04:00
github-actions[bot]	e78faa986c	deploy: update catalyst images to `f07312c`	2026-05-18 20:23:49 +00:00
e3mrah	f07312c5ae	fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852 ) Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179 + bootstrap-kit pin bump + cloud-init substitute extension, because each fix is small and they share the same fresh-prov verification cycle. TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the chroot in-cluster fallback's k8sCache.Factory reflector emitted continuous `networkpolicies is forbidden` errors at the cluster scope because only update/patch/delete were granted (existing mutation block) — the read path was never wired. Mirrors the existing cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7). TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields configuredRegions / controlPlaneIP / primaryRegion / replicaRegion / selfDeploymentId / enableHotStandby / qaApplications empty on every fresh prov. Pre-fix the envsubst placeholders resolved to empty because nothing wrote them into the bootstrap-kit Kustomization postBuild substitute map → the chart rendered empty strings → Dashboard SovereignCard configured-regions chips, Settings page operator-identity, /api/v1/sovereign/self, and the D31 active-hot-standby gating ALL silently fell through to default behaviour. Wired via three coordinated changes: - Chart values.yaml gains global.sovereignSelfDeploymentId default - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId, sovereign.configuredRegions, sovereign.qaApplications mappings (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`) - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION / SOVEREIGN_REPLICA_REGION (canonical 4-segment labels), SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty), SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list), QA_APPLICATIONS_YAML (reserved, default `[]`) - main.tf: new template inputs sovereign_configured_regions_yaml + replica_region_canonical_label (derived from local.secondary_regions), threaded into both primary CP and per-secondary-region cloud-init templatefile calls TBD-A10b (issue #1845) — GET /api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409 kubeconfig-file-missing on fresh prov for every region. Pre-fix the handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly call with the bare `cloudRegion` (`?region=hel1`) because that's the matrix-doc-friendly form. Fall-back resolution order added to GetKubeconfig: exact-name first (legacy + manual operator PUT), then `<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test covers all three paths: exact match, slot-suffix glob, unknown-region still 409. Closes the regression introduced when PR #1763 (mothership→chroot kubeconfig handover hook) started using the cloud-init naming convention for fan-out exports. Closes #1843, Closes #1844, Closes #1845 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:21:38 +04:00
hatiyildiz	6e883c1f8b	fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16) Third match pass for SSH keys whose name AND label both drifted from the Tofu canonical emission. The OpenSSH public_key comment is the one piece of metadata that survives Console-rename, partial tofu apply, and out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical prefix into it at generation. Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh t25 provs because previous wipe cycles left it as an orphan. Label-pass + name-prefix-pass had no signal once the name/label drifted. Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping t2.omantel.biz cannot delete t20.omantel.biz's SSH key. Tests: - PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match) - PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin) - PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes) - PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched) - PublicKeyComment_ParsesFormats (OpenSSH parser unit pins) - CommentMatchesPrefix_BoundaryRules (separator rune table) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:15:51 +02:00
hatiyildiz	7a2cad9a47	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.177 -> 1.4.178 (auto, Refs TBD-A6)	2026-05-18 19:46:12 +00:00
e3mrah	31b7dc5859	fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847 ) PR #1785 (chart 1.4.171) shipped a baseline default-deny CiliumNetworkPolicy in catalyst-system whose ingress allowlist was limited to: - reserved.ingress: "" (cilium-gateway endpoint) - same-namespace catalyst-system Pods - host / remote-node / kube-apiserver entities The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst` namespace, including the 10-auto-trigger Job whose Pod curls catalyst-api.catalyst-system.svc.cluster.local:8080 to fire /api/v1/internal/cutover/trigger. With #1785 in effect on a FRESH prov, every auto-trigger Pod times out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and the D0 auto-redirect to the Sovereign Console never happens — the operator is stuck on mothership /jobs forever. Caught by t24 zero-touch verification (2026-05-18): handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst' ns cannot reach catalyst-api in 'catalyst-system' ns because baseline-default-deny CNP allows ingress only from {reserved.ingress, catalyst-system ns, host entities}" The companion symptom on t22 was masked because t22's cutover Job had already completed before the CNP rolled out — the CNP did not gate ingress there. Fix ───────────────────────────────────────────────────────────────── Add a fourth ingress rule to baseline-default-deny allowing fromEndpoints in the operator-tunable list .Values.security.baselineCnp.allowedIngressNamespaces. Defaults: - catalyst — cutover Pods (the load-bearing fix) - flux-system — Helm/Kustomize/Source controllers probing Service readiness for HelmRelease health rollups (worked pre-#1785 via no-CNP default) - kube-system — Cilium operator + hcloud-ccm + CoreDNS that do cluster introspection calls (the reserved.ingress gateway endpoint here is still matched by rule 1's reserved.ingress: "" selector — this rule covers non-gateway Pods) The list mirrors the existing allowedPlatformNamespaces pattern on the egress side. No other rule semantics change. Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177 (PR #1803, SMTP egress) — both are sub-regressions from the same #1785 baseline-CNP ship. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:45:28 +04:00

... 4 5 6 7 8 ...

2747 Commits