Commit Graph

2747 Commits

Author SHA1 Message Date
e3mrah
139a620ea7
fix(sovereign-tls): cilium-gateway propagates Hetzner LB annotations via spec.infrastructure (#1889)
Closes #1885 (TBD-A31).

Problem (t28 evidence — A98 + A107 reports, 2026-05-19 00:30Z):
`console.t28.omani.works:443` accepts TCP but TLS resets. Inspection:
`kubectl get svc -n kube-system cilium-gateway-cilium-gateway` shows
type=ClusterIP with no Hetzner LB. Even with the tofu-provisioned
`hcloud_load_balancer.main` (infra/hetzner/main.tf:955) carrying
443→30443 service-port at the infra layer, the cluster-side hcloud-CCM
has no signal to materialise a parallel Service-level LB for the
auto-generated gateway Service — so operators inspecting kubectl see
a non-LoadBalancer Service and conclude the LB chain is broken.

Fix:
Add `spec.infrastructure.annotations` to the Gateway resource. The
Gateway-API spec mandates that controllers propagate these annotations
to any infrastructure resources they create — in Cilium 1.16+ this means
the auto-generated `cilium-gateway-cilium-gateway` Service in kube-system.
hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) then picks the
annotations up at Service reconcile time and provisions a Hetzner LB.

Annotations (mirrors clustermesh-apiserver block in 01-cilium.yaml):
  - load-balancer.hetzner.cloud/name = <slug>-<region>-gateway
  - load-balancer.hetzner.cloud/location = <Hetzner DC>
  - load-balancer.hetzner.cloud/type = lb11
  - load-balancer.hetzner.cloud/use-private-ip = "false"  (DoD A2 — public IPs always)
  - load-balancer.hetzner.cloud/disable-private-ingress = "true"
  - load-balancer.hetzner.cloud/health-check-protocol = tcp
  - load-balancer.hetzner.cloud/health-check-port = "30443"
  - load-balancer.hetzner.cloud/health-check-interval = 15s
  - load-balancer.hetzner.cloud/health-check-timeout = 10s
  - load-balancer.hetzner.cloud/health-check-retries = "3"

Per-region segmentation: SOVEREIGN_FQDN_SLUG + SOVEREIGN_REGION_KEY in
the LB name so each multi-region peer's cilium-gateway gets its own
public LB (Hetzner LBs are unique-by-name; duplicate-name allocations
collapse to the first-created instance, hiding the LB for every
subsequent region).

Wiring: 3 substitute vars (SOVEREIGN_FQDN_SLUG, SOVEREIGN_REGION_KEY,
HCLOUD_LB_LOCATION) threaded into the sovereign-tls Kustomization's
postBuild.substitute block. These mirror the same vars already passed
to bootstrap-kit's Kustomization for the clustermesh-apiserver LB block
in 01-cilium.yaml apiserver.service.annotations, so the configuration
boundary is symmetric across the gateway LB and the clustermesh LB.

Memory rules respected:
  - A2 (PUBLIC IPs for inter-region) — use-private-ip=false
  - feedback_overlap_provs_dont_serialize_wait (no provisioning gate)
  - feedback_subagents_inherit_design_system (no new architectural seam,
    reuses existing Gateway-API + hcloud-CCM contracts)

Validation:
  $ kubectl kustomize clusters/_template/sovereign-tls/ | grep -A 30 'kind: Gateway'
  → renders all 10 Hetzner LB annotations under spec.infrastructure
  → ${SOVEREIGN_FQDN_SLUG}/${SOVEREIGN_REGION_KEY}/${HCLOUD_LB_LOCATION}
    substituted at Flux apply time

Acceptance criteria (per issue):
  - kubectl get svc -n kube-system cilium-gateway-cilium-gateway shows
    type=LoadBalancer with external IP (after fresh prov + handover)
  - curl -skI https://console.<fqdn>/ returns HTTP 200

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:50:35 +04:00
hatiyildiz
cd45d074af deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.182 -> 1.4.183 (auto, Refs TBD-A6) 2026-05-19 00:47:50 +00:00
e3mrah
90e30e084c
fix(httproute): omit default sectionName so multi-zone Sovereigns attach via Cilium Gateway hostname matcher (Closes #1884, TBD-A30) (#1888)
Pre-1.4.183 the chart pinned every catalyst-system HTTPRoute to
`sectionName: https` (via values.yaml default), but the Cilium Gateway
template (clusters/_template/sovereign-tls/cilium-gateway.yaml +
infra/hetzner/main.tf locals.parent_domains_listeners_yaml) names HTTPS
listeners:

  - SINGLE parent zone → bare `https` / `http`
  - MULTIPLE parent zones → unique `https-<sanitised-zone>` /
    `http-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`)

On t28 (omani.works primary + omani.homes SME pool, A107 D29 walk
2026-05-19) every public HTTPRoute reported `Accepted=False
NoMatchingListener` and console.<sov> / api.<sov> / marketplace.<sov> /
*.<sov> returned 404 / connection-refused. Single-zone Sovereigns were
unaffected because Gateway used bare `https`.

Fix (Option C - omit sectionName): default `ingress.gateway.parentRef.
sectionName=""` in values.yaml. The existing `{{- with .Values.ingress.
gateway.parentRef.sectionName }}` guards in templates/httproute.yaml,
templates/services/catalog/httproute.yaml, and templates/sme-services/
marketplace-routes.yaml skip the field entirely when empty. Cilium
Gateway then matches each route to listeners by hostname filter - every
listener has `hostname: *.<zone>`, so `console.<sov-fqdn>` auto-attaches
to the listener whose hostname matches (which is precisely the listener
whose certificateRef terminates the right wildcard cert).

This is the canonical pattern already in use elsewhere in the codebase:
  - core/controllers/sandbox/internal/gitops/manifests.go (sandbox)
  - core/controllers/organization/internal/controller/tenant_route.go
    (per-Org tenant routes)
  - products/catalyst/chart/templates/sme-services/tenant-public-routes.yaml

Preflight CI (.github/workflows/preflight-cilium-httproute.yaml) explicitly
overrides `--set ingress.gateway.parentRef.sectionName=http` because it
ships a Gateway with an HTTP-only listener named `http`; that override
path is preserved unchanged.

helm template render verifies all 5 affected HTTPRoutes
(catalyst-ui, catalyst-api, catalyst-catalog, marketplace,
tenant-wildcard) now emit a `parentRefs` block with name+namespace only,
no `sectionName`. helm lint clean.

Chart bumped 1.4.182 -> 1.4.183.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:47:14 +04:00
e3mrah
ab6f3e6510
fix(scripts): scrub stale sovereign-tls from expected-bootstrap-deps.yaml (post #1879 cleanup, fixes dep-graph-audit) (Refs #1871) (#1881)
PR #1875 added `sovereign-tls` to the bp-self-sovereign-cutover dependsOn
in both the chart AND scripts/expected-bootstrap-deps.yaml. PR #1879
reverted the chart half (because HelmRelease.dependsOn cannot reference a
Flux Kustomization — helm-controller logs "not found", chart parks
Stalled, handover never fires).

The scripts/expected-bootstrap-deps.yaml half was left behind, so the
dep-graph-audit job now fails on origin/main with drift between the
declared expectation (`bp-gitea bp-harbor sovereign-tls`) and the chart
on disk (`bp-gitea bp-harbor`).

Scrub:
- Remove `sovereign-tls` from the cutover's depends_on list.
- Remove the stale `sovereign-tls` placeholder slot 0t entry (no HR
  file exists for it — it is a Flux Kustomization).
- Replace the obsolete comment block with a short note explaining the
  PR #1875 / #1879 history so the next reader doesn't re-add it.

Verified: `bash scripts/check-bootstrap-deps.sh` -> "OK: bootstrap-kit
dependency graph audit PASSED" with Drift: 0, Cycles: 0.
Verified: `helm template platform/self-sovereign-cutover/chart` -> exit 0.

Refs #1871

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:14:43 +04:00
hatiyildiz
d6e9379b20 deploy(bp-self-sovereign-cutover): lockstep blueprint.yaml spec.version 0.1.31 -> 0.1.32 (auto, Refs TBD-A20, #1856) 2026-05-19 00:04:44 +00:00
e3mrah
ee4dfedef8
fix(cutover): Step-06 Job waits for Cilium Gateway Programmed=True before HelmRepository URL rewrite (Closes #1871, supersedes #1875) (#1879)
PR #1875 added `- name: sovereign-tls` to bp-self-sovereign-cutover.dependsOn
to gate the URL rewrite behind Gateway TLS readiness. That fix was
unresolvable: Flux HelmRelease.dependsOn can ONLY reference other
HelmReleases, but sovereign-tls is a Flux Kustomization. helm-controller
verbatim on t27 fresh-prov (A84 empirical test, 2026-05-18):

  helmreleases.helm.toolkit.fluxcd.io "sovereign-tls" not found

bp-self-sovereign-cutover sat forever in dependency-wait, cutover never
fired, handover never fired.

This commit moves the readiness check INTO the chart: chart 0.1.32 adds
a Phase -1 (gateway-wait) at the top of the Step-06 helmrepository-
patches Job. The Job polls `gateway.networking.k8s.io/v1.Gateway
cilium-gateway` in `kube-system` until status.conditions[Programmed]=
True, with a 30 min default deadline. If the Gateway never programs,
the Job exits 1 (surfacing the block to the operator) rather than
rewriting URLs into a Gateway that won't answer TLS.

RBAC: ClusterRole gains gateway.networking.k8s.io/gateways
{get,list,watch}.

Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml`:
  - reverts the bad PR #1875 `- name: sovereign-tls` dependsOn entry
  - bumps chart pin 0.1.31 -> 0.1.32

Tests: cutover-contract Case 20 guards the Phase -1 block + RBAC.
helm-template confirms the Phase -1 wait + env (GATEWAY_NAMESPACE=
kube-system, GATEWAY_NAME=cilium-gateway, GATEWAY_WAIT_TIMEOUT_
SECONDS=1800) renders into the cutover-step-06-helmrepository-patches
ConfigMap.podSpec.

Closes #1871
Refs #1875 (supersedes)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:04:12 +04:00
e3mrah
366d5d2b33
docs(principles): clarify #14 — HelmRelease.dependsOn cannot reference Kustomizations (empirical t27 finding) (#1878)
A84 empirical finding (t27 / PR #1875): HelmRelease.spec.dependsOn
strictly references OTHER HelmReleases — it cannot reference Flux
Kustomizations or other resource kinds. PR #1875 added the `sovereign-tls`
Kustomization to a HelmRelease's dependsOn; helm-controller logged
`helmreleases "sovereign-tls" not found` and retried every 30s forever.

Adds a critical sub-rule to principle #14 documenting the cross-kind
limitation, the recommended workaround (wait-HelmRelease shim or move the
gated workload into a Kustomization), and the verbatim helm-controller
error message so the next regression is greppable.

Doc-only.

Co-authored-by: hatiyildiz <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 04:00:25 +04:00
hatiyildiz
2e1826abb4 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.181 -> 1.4.182 (auto, Refs TBD-A6) 2026-05-18 23:49:51 +00:00
github-actions[bot]
5a25c254a1 deploy: update sme service images to 5d5c557 + bump chart to 1.4.182 2026-05-18 23:49:14 +00:00
e3mrah
5d5c55739e
fix(notification): retry-backoff on Stalwart 503 5.5.1 rate-limit (#1876)
When Stalwart trips its rate-limit and returns "503 5.5.1", the
notification service previously surfaced the error immediately to the
events consumer, which kept hammering on the next event and prolonged
the rate-limit window.

Now Mailer.Send detects 503 5.5.1 specifically (via *textproto.Error
unwrap + canonical-code substring fallback) and retries up to 3 times
with a 60s backoff between attempts. The backoff is configurable via
SMTP_RETRY_BACKOFF env var (Go duration string OR bare integer seconds;
30s floor to keep the rate-limiter happy). Non-rate-limit errors
(auth failure, transient I/O, etc.) bubble up unchanged so the
consumer can NACK / dead-letter as appropriate.

Adds smtp_test.go covering:
- single rate-limit -> retry -> success
- exhausted retries -> wrapped error preserving *textproto.Error
- non-rate-limit error -> immediate pass-through, no backoff
- isRateLimit detection (textproto, multiline 503-5.5.1, negative cases)
- parseRetryBackoff env-var forms + 30s floor + zero/garbage fallbacks

No credential touches: this is a retry-hardening fix only; the
chart-side SMTP creds path is already GREEN (see #1793 A80 diagnosis).

Refs #1793

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:47:58 +04:00
e3mrah
3b4c130129
fix(bootstrap-kit): cutover dependsOn sovereign-tls — wait for Gateway TLS before HelmRepository URL rewrite (Closes #1871) (#1875)
TBD-A24 cutover↔gateway circular deadlock — discovered on t26 zero-touch
prov 2026-05-18 (99bb823cb0513f4b):

  1. bp-catalyst-platform HR installs at v1.4.179 (Ready=True)
  2. bp-self-sovereign-cutover HR Ready=True (deps gitea+harbor only)
  3. Step-06 rewrites all 50 HelmRepository URLs ghcr.io → registry.<fqdn>
  4. bp-catalyst-platform flips Ready=False (TLS handshake EOF — no Gateway)
  5. sovereign-tls Kustomization blocked on bootstrap-kit Ready=True
  6. bootstrap-kit blocked on bp-catalyst-platform Ready=True
  7. Full deadlock — no Gateway, no handover, every UI route 404

Fix: add `sovereign-tls` as a third dependsOn entry on the cutover HR so
Flux waits for the Cilium Gateway to be serving TLS before the URL
rewrite fires. Same architectural shape as Wave 7 bp-hcloud-csi removal
(#1610) — chicken-and-egg between bootstrap-kit and sovereign-tls broken
by ordering the dangerous-side-effect chart AFTER the Gateway is ready.

Also updates scripts/expected-bootstrap-deps.yaml so the dep-graph audit
(check-bootstrap-deps.sh) recognises the new edge: slot 6a gets the
extra `sovereign-tls` entry, plus a new "slot 0t" entry declaring
sovereign-tls as a known node (no HR file on disk → audit reports it as
`deferred`, info not error; Phase 4 cycle detection accepts it as a
zero-in-degree root).

Verified locally:
  - yq parses spec.dependsOn → 3 entries (bp-gitea, bp-harbor, sovereign-tls)
  - scripts/check-bootstrap-deps.sh: 50 present, 65 declared, 0 drift, 0 cycles
  - helm template platform/self-sovereign-cutover/chart: exit 0 (smoke OK)

Refs: t26 ID 99bb823cb0513f4b, A55 diagnostic, A67 diagnosis, slot 17a
comment in clusters/_template/bootstrap-kit/kustomization.yaml documenting
the same chicken-and-egg shape.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:19:55 +04:00
e3mrah
06bea550ff
feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin (#1874)
The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml,
bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish
commits land on main, but it cannot detect the "chart bumped but never
published" failure mode: the bootstrap-kit pin points at a chart
version that GHCR never received because blueprint-release.yaml
failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep,
runner cancellation, transient GHCR push 5xx).

Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180
and 1.4.181 were "lost" during the TBD-A20 scanner break window
(21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS
while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist
until A58 manually re-fired the workflow via dispatch. Fresh
Sovereigns silently fell back to the last working tag.

What this adds
- scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and
  optional `--ghcr-org <org>`). For every chart pinned in the kit, it
  lists ghcr.io/<org>/<chart> tags via `gh api
  /orgs/<org>/packages/container/<chart>/versions --paginate`, then
  asserts the pinned version appears. Exits 1 on any missing tag.
- A per-chart tag cache avoids redundant paginations.
- .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now
  passes `--check-ghcr` on `push` to main + `workflow_dispatch`
  (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish
  to GHCR anyway). The job stays `continue-on-error: true` under the
  same observational umbrella as the existing post-merge full sweep
  so a transient API blip cannot red-flag every chart bump; the
  missing-tag list still surfaces on the run summary for operator
  attention.
- Job grants `packages: read` so the workflow GITHUB_TOKEN can list
  private package versions.

Verification (origin/main snapshot, 2026-05-19)
- Full sweep default: 50/50 chart→pin pairs OK, no GHCR check.
- Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags
  present — PASS exit 0.
- Negative test: with products/catalyst/chart/Chart.yaml + slot 13
  both set to a non-existent 99.99.99, the script exits 1 with
  `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the
  remediation hint pointing at `gh workflow run
  blueprint-release.yaml`.
- `--changed-only --base origin/main` against a no-change tree: clean
  exit 0 with the existing "nothing to check" message.

Refs #1872, #1864, #1856.

Closes #1872

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:12:13 +04:00
e3mrah
a7cd2fc21f
docs(principles): add 3 session-2026-05-18 principles (validate-vs-origin / GHCR-tag-check / cutover-dependsOn-Gateway) (#1873)
Adds three new inviolable principles surfaced by 2026-05-18 incidents:

- #12 Never validate against the local working tree — A19 false-positive
  (verifier grepped a feature-branch working copy with unstaged edits,
  reported "already on main" when it was not).
- #13 Chart-pin bumps must match a GHCR tag that exists — TBD-A48 / PR #1869
  drift: pin to bp-self-sovereign-cutover:0.1.4 landed on main while the
  chart artifact had not been published, causing hours of ImagePullBackOff.
- #14 Cutover-style HRs that rewrite HelmRepository URLs must dependsOn
  Gateway readiness — TBD-A24 / PR #1871: bp-self-sovereign-cutover flipped
  URLs to local registry before Cilium Gateway was serving TLS, deadlocking
  the cluster.

Doc-only change; bumps the front-matter Updated date to 2026-05-18.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:09:26 +04:00
hatiyildiz
26e4c8e30e deploy(bp-guacamole): bump bootstrap-kit pin 0.1.25 -> 0.1.26 (auto, Refs TBD-A6)
Also locksteps platform blueprint.yaml spec.version 0.1.25 -> 0.1.26 (Refs TBD-A20, #1856).
2026-05-18 22:20:35 +00:00
github-actions[bot]
8ce7c02aa9 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.26 2026-05-18 22:19:59 +00:00
e3mrah
1b87d38e94
deploy: catch-up pins for bp-catalyst-platform 1.4.181 + bp-guacamole 0.1.25 (post #1866 fix) (#1869)
Catch-up for drift introduced during the Blueprint Release workflow outage
21:04:22Z (PR #1858 merge with YAML scanner break) → 22:07:49Z (PR #1866 fix).

Charts published in that window:
- bp-catalyst-platform 1.4.180 → 1.4.181 (umbrella)
- bp-guacamole 0.1.24 → 0.1.25

Auto-bump-pin step didn't fire during the outage. A39 already caught up bp-newapi
(PR #1865). This PR catches up the remaining 2.

Refs #1864, PR #1866 (workflow fix), PR #1858 (root cause).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 02:19:34 +04:00
hatiyildiz
66fa508b74 deploy(bp-newapi): bump bootstrap-kit pin 1.4.21 -> 1.4.22 (auto, Refs TBD-A6)
Also locksteps platform blueprint.yaml spec.version 1.4.21 -> 1.4.22 (Refs TBD-A20, #1856).
2026-05-18 22:11:05 +00:00
e3mrah
22e046b554
Merge pull request #1866 from openova-io/fix/1864-workflow-yaml-startup-failure
fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)
2026-05-19 02:07:48 +04:00
hatiyildiz
69f2d7d91a fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)
Root cause of the auto-bump-pin miss flagged in #1864.

The Blueprint Release workflow has been in `startup_failure` since
PR #1858 (commit cf35b4a) merged at 21:04:22Z. The lockstep step's
multi-line shell heredoc inside a `run: |` block-scalar:

    if [ ... ]; then
      msg="deploy(...) (auto, Refs TBD-A6)
                                                        <-- literal blank line
    Also locksteps platform blueprint.yaml ..."          <-- column 1, no indent

is interpreted by the YAML scanner as the END of the block-scalar
at the blank line, and the next column-1 line is then parsed as a
new top-level mapping key — which fails because the previous mapping
isn't terminated. The whole workflow file is rejected at workflow-
startup time. Verified with `python3 -c yaml.safe_load(...)` (raises
`ScannerError: could not find expected ':' line 815`) and by `gh api
.../actions/runs/26060392136` returning `conclusion=failure,
status=completed, jobs: []` for every push since cf35b4a.

Consequence: no chart bump since cf35b4a has triggered the TBD-A6
auto-bump-pin or the TBD-A20 blueprint.yaml lockstep. PR #1865 was
the manual catch-up for bp-newapi (1.4.20 -> 1.4.21); without this
fix every future chart publish will drift the same way.

Fix: build the multi-line commit message with `printf '%s\n\n%s'`
so the string source stays on physically-indented lines that the
YAML block-scalar accepts. Behaviour is identical — same commit
subject, same blank line, same body — only the construction shape
changes. Added a 9-line comment naming the seam so future authors
don't reintroduce the same trap.

Verified locally:
  * `python3 -c yaml.safe_load(open(...))` succeeds, parses 24
    build-job steps.
  * `CHART_NAME=bp-newapi PREV_VERSION=1.4.20 CHART_VERSION=1.4.21
    BP_PREV_VERSION=1.4.20 bash -c "$(printf ...)"` emits the
    canonical "deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 ->
    1.4.21 (auto, Refs TBD-A6)\n\nAlso locksteps platform ..." body.

Refs #1864.
Refs PR #1858 (TBD-A20 lockstep that introduced the YAML defect).
2026-05-19 00:07:07 +02:00
github-actions[bot]
c64220f8cc deploy: bump bp-newapi upstream v0.13.2 chart 1.4.22 2026-05-18 22:05:58 +00:00
e3mrah
1e1fe26e02
Merge pull request #1865 from openova-io/fix/1864-bp-newapi-pin-catchup
deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race)
2026-05-19 02:05:33 +04:00
hatiyildiz
f57f62764b deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race)
Closes #1864

Manual catch-up. The auto-bump-pin step (TBD-A6) did NOT run for the
1.4.20 -> 1.4.21 chart bump at commit 8b33188 because the Blueprint
Release workflow has been stuck in **startup_failure** since PR #1858
(commit cf35b4a) merged at 21:04:22Z. The workflow YAML at
.github/workflows/blueprint-release.yaml lines 812-814 has a multi-line
heredoc string inside a `run: |` block-scalar whose continuation lines
are unindented:

  msg="deploy(${CHART_NAME}): bump bootstrap-kit pin ${PREV_VERSION} -> ...
                                                              (auto, Refs TBD-A6)

  Also locksteps platform blueprint.yaml spec.version ${BP_PREV_VERSION} ..."

YAML treats the unindented line as the end of the block-scalar and the
next line as a new mapping key (which it isn't), so the entire workflow
file fails the GitHub Actions YAML validator at workflow-start time.
Every push since cf35b4a has produced a run with `conclusion=failure,
status=completed, jobs=[]` (zero jobs spun up).

Evidence:
  * gh api repos/openova-io/openova/actions/runs/26060392136 ->
    'This run likely failed because of a workflow file issue.'
  * Same for every subsequent run including the chart 1.4.21 publish
    (no run was even created for 8b33188 because the workflow file
    couldn't parse).
  * `python3 -c 'yaml.safe_load(open(...))'` raises
    `ScannerError ... could not find expected ':' line 815`.

This PR is the ONE-LINE catch-up so the pin drift is closed. A
companion PR fixes the workflow YAML so future chart bumps auto-bump
the pin again.
2026-05-19 00:04:40 +02:00
github-actions[bot]
6b11734a81 deploy: update sme service images to 4a61543 + bump chart to 1.4.181 2026-05-18 21:48:56 +00:00
e3mrah
4a61543957
test(tenant): wire round-trip for tenant.created owner_email contract (#1863)
Verifies the publisher-side wrapper struct in CreateOrg
(handlers.go:248-252) marshals to bytes the provisioning consumer
in organization_create.go can decode flat with owner_email as a
sibling field. Pairs with TestHandleTenantCreated_FullTenantStructDecode
on the consumer side — together they pin BOTH ends of the contract
so a refactor that nests under "tenant" or renames the tag fails
in CI rather than at staging.

Refs #1829 (D29).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 01:47:38 +04:00
github-actions[bot]
de86df1126 deploy: bump sandbox-controller image to a405572 2026-05-18 21:46:29 +00:00
github-actions[bot]
a09445482b deploy: bump sandbox-mcp-server image to a405572 2026-05-18 21:44:53 +00:00
github-actions[bot]
4fd3aae99b deploy: bump application-controller image to a405572 2026-05-18 21:44:52 +00:00
e3mrah
a40557227e
fix(controllers): NATS consume-leg for D35 (organization + sandbox) (#1862)
PR #1626 wired the publish-leg (tenant + billing → NATS JetStream
catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster
controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow
even though the publish leg shipped.

This PR adds:

- core/controllers/pkg/natsbus: minimal JetStream subscriber shared by
  Group-C controllers. Self-contained (no dep on core/services/shared
  which pulls in franz-go/Kafka the controllers never touch).
- core/controllers/organization/internal/controller/nats_bridge.go:
  subscribes to catalyst.tenant.created + catalyst.billing.order.placed,
  patches openova.io/last-event-observed-at + ...-subject annotations on
  the matching Organization CR. The annotation patch triggers an
  informer event → controller-runtime enqueues Reconcile within ~50ms
  instead of waiting for the 30s requeue fallback.
- core/controllers/sandbox/internal/controller/nats_bridge.go: same
  pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR
  using the same `sandbox-<sanitised-email>` naming convention
  tenant-service's SandboxOrchestrator (PR #1633) writes under.
- main.go wiring in both controllers reads NATS_URL from env. Unset =
  log "consume-leg disabled" + continue (informer requeue fallback
  intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS
  is an accelerator, not the only path.

Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate
JetStream delivery produces a byte-stable annotation patch and
controller-runtime does NOT enqueue a redundant Reconcile.

Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the
happy path, the no-matching-CR soft miss, duplicate-envelope no-churn,
malformed JSON poison-pill, and the publish-side ↔ consume-side name
derivation lockstep for Sandbox CRs.

HARD CONSTRAINT respected: no credential mutations — bridges read only
the envelope + the target CR, never Secrets or Keycloak SA creds.

Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:43:08 +04:00
github-actions[bot]
376cd7d14c deploy: update catalyst images to c86fb3d 2026-05-18 21:36:21 +00:00
e3mrah
c86fb3d1dc
fix(catalyst-api): seed full 4-entry .omani.X sme-pool (D30 / #1830) (#1861)
LoadSMETenantParentDomainsFromEnv's hardcoded-fallback only seeded
2 entries (omani.works + omani.trade), but the marketplace UI
(core/marketplace/src/components/AddonsStep.svelte) lists 4
(omani.homes + omani.rest + omani.trade + omani.works) and
core/services/domain/store.AllowedTLDs has the same canonical 4.
Result: a customer picking .omani.homes or .omani.rest in /addons
sailed through the picker but got 422 invalid-parent-domain at
catalyst-api signup because FindParentDomain didn't recognise the
TLD.

This widens the seed to all 4 canonical .omani.X entries so the
backend pool, the marketplace picker, and AllowedTLDs all agree.
NSFlipReady=true on every entry (the zones are already delegated
to the Sovereign's PowerDNS at gTLD level — pdmFlipNS
short-circuits via nsAlreadyMatches for Day-2 re-adds).

Updated TestLoadSMETenantParentDomainsFromEnv_StubFallback
(`pool != 4`) and added 3 fresh tests in
sovereign_parent_domains_test.go covering: canonical 4-entry seed,
OTECH primary + 4 sme-pool composition, env-override path without
fallback leakage.

Closes #1830 (Part 1 — Day-1 pool seed).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 01:34:24 +04:00
github-actions[bot]
a6b5752391 deploy: update sme service images to b214566 + bump chart to 1.4.180 2026-05-18 21:28:12 +00:00
e3mrah
b214566c1a
fix(provisioning): create Organization CR on tenant.created (C16 root cause) (#1860)
Closes the voucher-checkout → Organization-CR loop that was missing
from the convergence chain. Before this PR the flow stalled at:

  voucher accept → tenant-service CreateOrg
     → writes Tenant row, publishes tenant.created
              ↓                                  (DROP — no consumer)
  provisioning consumer switch
     (case "tenant.created" missing — A26 verifier pinpointed this)
              ↓
  organization-controller has nothing to reconcile
              ↓
  no vCluster / Keycloak group / Gitea org / per-tenant HTTPRoute

A26 verifier on t22: zero Organization CRs after 168min despite the
tenant row existing. Closes #1722. Unblocks D29 zero-touch tenant
provisioning (Refs #1829).

Changes:

- core/services/tenant/handlers/handlers.go
  Enrich tenant.created payload with owner_email from JWT claims so
  the provisioning consumer can mint the Organization owner roster
  without a second store round-trip. Wrapper struct embeds *Tenant
  so existing decoders are wire-compatible.

- core/services/provisioning/handlers/consumer.go
  Add case "tenant.created" to the dispatch switch.

- core/services/provisioning/handlers/organization_create.go
  New handler. Validates slug + owner_email, builds cluster-scoped
  Organization CR (apiVersion orgs.openova.io/v1), POSTs via
  k8sRequest. Idempotent on 409 AlreadyExists (NATS redelivery
  safe). 404 → operator-misconfiguration error event. 5xx → return
  err so broker redelivers. Inviolable Principle #4: parent domain
  flows env → Handler.TenantParentDomain → CR (with per-tenant
  parent_domain payload override for multi-pool Sovereigns).

- core/services/provisioning/handlers/organization_create_test.go
  Unit tests: malformed payload, invalid slug (incl. path-traversal),
  missing owner_email, full Tenant decode, default-fill paths, empty
  parent domain mints anyway, payload-shape pinning. All exercised
  with KUBERNETES_SERVICE_HOST scrubbed so no real apiserver dial.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:26:57 +04:00
github-actions[bot]
2457926e16 deploy: update catalyst images to 7ec73f9 2026-05-18 21:16:51 +00:00
e3mrah
7ec73f9e2b
fix(catalyst-api): handler test baseline GREEN — 6 failing tests fixed (Closes #1853) (#1859)
Per-test root cause + fix:

1. TestPinIssue_ConcurrentRapidFireRateLimit (TOCTOU race) — pinStore.canIssue
   and put() ran under separate mutex acquisitions; three concurrent
   /pin/issue goroutines all observed "no entry", passed canIssue, then
   raced EnsureUser against Keycloak. Replaced with atomic tryReserve()
   that check-and-stamps under a single lock; HandlePinIssue calls
   store.drop(email) on EnsureUser/generatePin/no-KC failure to roll back
   the reservation so the 60s cooldown doesn't punish operator retries.

2. TestFinaliseHandover_FullFlow — test fixture drift after PR #1487 keyed
   the tofu workdir by DeploymentID (provisioner.workdirKey). Test still
   wrote workdir at filepath.Join(tmp, "tenant-y-omani-works") (the legacy
   sovereign-name slug); FinaliseHandover handler uses `id`. Updated test
   to write workdir at filepath.Join(tmp, "dep-full") so it matches the
   actual prod lookup path. Same fix for the receiver-failure sibling test.

3. TestEnsureOwnerUserAccess_CreatesCanonicalCR — drifted twice: (a) test
   queried Namespace("") but the t134 D21 fix moved the CR to
   userAccessOwnerNamespace ("catalyst-system") because useraccesses is
   namespaced per the XRD claimNames block; (b) test asserted
   spec.applications = [{app:"*", role:"admin"}] but the t135 D21 fix
   switched to spec.tierRoleRef = "openova:tier-owner" (XRD pattern
   rejects `app: "*"`). Updated test to query catalyst-system namespace
   and assert tierRoleRef + applications-must-be-absent.

4. TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty — production
   unstructuredToUserAccess left Spec.Applications=nil when the CR has no
   spec.applications, which json-marshals to `null` and crashes the React
   UI's items.map() (qa-loop iter-4 users-page-null-map regression).
   Initialize Spec.Applications = []userAccessAppGrantBody{} in the
   struct literal so the empty-slice contract is preserved.

5. TestHandleWhoami_PinSessionRBACClaims — whoamiInjectTierRoles
   unconditionally appended every inherited tier role even when the
   upstream JWT already shaped the role list authoritatively. A
   PIN-minted session carrying tier=owner + realm_access=[catalyst-owner]
   was getting fanned out to all 5 inheritance entries, which the
   route-guard couldn't reconcile. Now: if the operator's own
   catalyst-<tier> role is already present, the projection returns early
   and preserves the upstream list. TestHandleWhoami_ProjectsTierToRealmRoles
   still passes (empty input → still injects inheritance) and
   TestWhoamiInjectTierRoles_PreservesExistingRoles still passes
   (idempotent — same input out).

6. TestHandleWhoami_NoRBACOmitsFields — whoamiResponse.RealmAccess was a
   struct value with `omitempty`, which encoding/json does NOT honour for
   structs (only pointers/slices/maps until Go 1.24's `omitzero`). A
   pre-RBAC session always serialized realm_access:{} on the wire,
   breaking the legacy {email,sub,verified} contract. Changed to
   *whoamiRealmAccess so omitempty actually drops the field; HandleWhoami
   only allocates the pointer when claims carry roles, and drops it back
   to nil if the projection ended up empty.

Test status after fix (worktree off origin/main):
- All 6 target tests PASS
- Full TestPin*, TestHandleWhoami*, TestWhoamiInjectTierRoles*,
  TestEnsureOwnerUserAccess*, TestOwnerUserAccessName*, TestListUserAccess*,
  TestFinaliseHandover*, TestUnstructuredToUserAccess* PASS (57 tests)
- go test ./... -p 1 across the entire catalyst-api module PASS

Pre-existing parallelism flakes (TestGetKubeconfig_ReadsFromPathPointer /
TestPhase1Started_GuardPreventsDoubleWatch / TestPodRestart_*) exist on
baseline too — write to /var/lib/catalyst/ from a goroutine that outlives
test scope. Out of scope for this PR; tracked separately.

Closes #1853

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:14:38 +04:00
github-actions[bot]
de53b39d13 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.25 2026-05-18 21:05:55 +00:00
github-actions[bot]
8b33188019 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.21 2026-05-18 21:04:48 +00:00
e3mrah
cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
e3mrah
2484c8a3de
fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799) (#1846)
TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because
the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the
initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image
to cadc7b5") which never triggered the blueprint-release workflow. As a
result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck
InProgress and the bootstrap-kit kustomization fails its health check.

GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via
`/orgs/openova-io/packages/container/bp-velero/versions`.

Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate
stays GREEN) so blueprint-release.yaml fires on this push, publishes
`ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a
no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1
subchart, same templates, same values.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:43:13 +04:00
hatiyildiz
9975e057da deploy(bp-newapi): bump bootstrap-kit pin 1.4.19 -> 1.4.20 (auto, Refs TBD-A6) 2026-05-18 20:38:15 +00:00
github-actions[bot]
9982dcafa8 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.20 2026-05-18 20:37:26 +00:00
e3mrah
3d0c96a237
fix(bp-newapi): single-pod DB migration via startupProbe (Closes #1798) (#1857)
newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an
empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay
30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the
binary mid-migration on every restart. The 28-CREATE-TABLE +
2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with
sslmode=require — well over the kill window. On t22 chart 1.4.18 the
`newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff
restarts because every kill happened before the GORM connection
pool's first wire write completed (pg_stat_activity on the CNPG
primary showed no newapi-user connections).

Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2):
  [SYS] ... database migration started   ← last log line
  exitCode=2 finishedAt-startedAt = 50s exactly
  Readiness probe: connect: connection refused 10.42.0.185:3000
  DB: psql \\dt → "Did not find any relations"
  CNPG: pg_stat_activity → no `newapi` user connections

Fix (canonical k8s pattern, Inviolable Principle #16 — own the
seam): add a startupProbe that gates BOTH liveness and readiness
until the binary opens :3000/api/status. Budget 30 × 10s = 5 min,
comfortably above the observed 60-120s ceiling and below operator-
impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is
unchanged but only activates after startupProbe success per kubelet
semantics. The probe block is operator-tunable via
`.Values.newapi.probes.startup.*`; setting it to `null` skip-renders
the block so overlays against a pre-seeded DB can opt out
(Inviolable Principle #4).

Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so
freshly franchised Sovereigns pull the new chart on next prov.

Render tested (smoke + override): startupProbe present with
failureThreshold=30 in defaults; suppressed when startup: null.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:37:00 +04:00
e3mrah
a8931db541
fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849) (#1855)
Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main:

1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields
   asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml
   version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager
   (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak
   (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml),
   gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin
   -> Sovereign install); blueprint.yaml gets resynced down/up to match.

2. pin-sync-audit on push — full-sweep audit races the blueprint-release
   auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift
   until the auto-bump bot commits the pin update ~60s later; the bot
   push (GITHUB_TOKEN convention) does not retrigger this workflow, so
   the failure remains in run history. Fix: set continue-on-error: true
   on push/workflow_dispatch events (PR remains blocking via
   --changed-only). The full-sweep output still surfaces drift on the
   run summary; it just doesn't fail the overall run while the heal-in-
   ~60s window is open. Documented inline in the job header.

Net effect: every push to main re-runs cleanly green. The 13 pre-existing
drifts called out in the existing job comment will continue to heal as
each lagging chart gets its next bump (auto-bump hook + this PR's
manifest-validation alignment).

Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs
TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence
to blueprint.yaml versions which the test asserts but the auto-bump hook
does not yet update).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
2026-05-19 00:34:48 +04:00
e3mrah
d36e54df74
test(chart): baseline CNP allow-list contract gate — guards #1785→#1803→#1847 cascade (Closes #1850) (#1854)
The May 2026 baseline-CNP cascade shipped three production bugs in
two days because nothing in CI rendered the chart and asserted on the
rendered CiliumNetworkPolicy shape:

  - #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system
    with WORLD egress restricted to TCP/443 only AND no ingress allow
    for the `catalyst` namespace.
  - #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after
    /api/v1/auth/pin-request 502'd on every fresh onboarding.
  - #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24
    fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s.

This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh —
a pure helm-template + grep/awk contract gate matching the existing
platform/self-sovereign-cutover/chart/tests/cutover-contract.sh
pattern. The Blueprint Release workflow already runs every *.sh under
chart/tests/ as a publish gate (see blueprint-release.yaml line 384),
so the gate is wired automatically and fails publish BEFORE the OCI
artifact reaches a Sovereign.

13 cases asserted:
  1. baseline-default-deny CNP renders + is namespaced to catalyst-system
  2. egress allows SMTP submission 587/TCP (#1803 regression guard)
  3. egress allows SMTPS 465/TCP (#1803 regression guard)
  4. egress allows legacy SMTP 25/TCP (#1803 regression guard)
  5. egress allows HTTPS 443/TCP to world
  6. egress allows kube-dns 53/UDP + 53/TCP
  7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847)
  8. ingress allows `flux-system` (HelmRelease readiness probes)
  9. ingress allows `kube-system` (operator + ccm + CoreDNS)
 10. ingress is namespace-scoped — no fromEntities:{cluster|world|all} wildcard
 11. catalyst-api Service exposes port 8080 (auto-trigger contract)
 12. CNP toggles off cleanly with security.baselineCnp.enabled=false
 13. allowedIngressNamespaces propagates via --set (operator-tunable)

Negative-test confirmation (executed locally before commit):
  - Remove SMTP 587 from template → Case 2 FAILS, exit 1
  - Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1
  - Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1
  - Restore originals → all 13 cases PASS, exit 0

Refs: TBD-A18, PRs #1785 #1803 #1847, audit /tmp/audit-recent-prs-quality-report.json
Closes #1850

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 00:32:28 +04:00
github-actions[bot]
82e972fb77 deploy: update catalyst images to 75cb059 2026-05-18 20:26:21 +00:00
e3mrah
75cb059fc0
Merge pull request #1851 from openova-io/fix/a16-hetzner-ssh-key-sweep
fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)
2026-05-19 00:24:19 +04:00
github-actions[bot]
e78faa986c deploy: update catalyst images to f07312c 2026-05-18 20:23:49 +00:00
e3mrah
f07312c5ae
fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852)
Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179
+ bootstrap-kit pin bump + cloud-init substitute extension, because each
fix is small and they share the same fresh-prov verification cycle.

TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list
networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies
get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the
chroot in-cluster fallback's k8sCache.Factory reflector emitted
continuous `networkpolicies is forbidden` errors at the cluster scope
because only update/patch/delete were granted (existing mutation block)
— the read path was never wired. Mirrors the existing
cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s
NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7).

TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields
configuredRegions / controlPlaneIP / primaryRegion / replicaRegion /
selfDeploymentId / enableHotStandby / qaApplications empty on every
fresh prov. Pre-fix the envsubst placeholders resolved to empty because
nothing wrote them into the bootstrap-kit Kustomization postBuild
substitute map → the chart rendered empty strings → Dashboard
SovereignCard configured-regions chips, Settings page operator-identity,
/api/v1/sovereign/self, and the D31 active-hot-standby gating ALL
silently fell through to default behaviour. Wired via three coordinated
changes:
  - Chart values.yaml gains global.sovereignSelfDeploymentId default
  - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId,
    sovereign.configuredRegions, sovereign.qaApplications mappings
    (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`)
  - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP
    (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION /
    SOVEREIGN_REPLICA_REGION (canonical 4-segment labels),
    SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty),
    SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list),
    QA_APPLICATIONS_YAML (reserved, default `[]`)
  - main.tf: new template inputs sovereign_configured_regions_yaml +
    replica_region_canonical_label (derived from local.secondary_regions),
    threaded into both primary CP and per-secondary-region cloud-init
    templatefile calls

TBD-A10b (issue #1845) — GET
/api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409
kubeconfig-file-missing on fresh prov for every region. Pre-fix the
handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init
PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region
key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk
filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly
call with the bare `cloudRegion` (`?region=hel1`) because that's the
matrix-doc-friendly form. Fall-back resolution order added to
GetKubeconfig: exact-name first (legacy + manual operator PUT), then
`<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test
covers all three paths: exact match, slot-suffix glob, unknown-region
still 409. Closes the regression introduced when PR #1763
(mothership→chroot kubeconfig handover hook) started using the
cloud-init naming convention for fan-out exports.

Closes #1843, Closes #1844, Closes #1845

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:21:38 +04:00
hatiyildiz
6e883c1f8b fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)
Third match pass for SSH keys whose name AND label both drifted from the
Tofu canonical emission. The OpenSSH public_key comment is the one piece
of metadata that survives Console-rename, partial tofu apply, and
out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical
prefix into it at generation.

Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh
t25 provs because previous wipe cycles left it as an orphan. Label-pass
+ name-prefix-pass had no signal once the name/label drifted.

Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by
TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping
t2.omantel.biz cannot delete t20.omantel.biz's SSH key.

Tests:
  - PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match)
  - PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin)
  - PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes)
  - PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched)
  - PublicKeyComment_ParsesFormats (OpenSSH parser unit pins)
  - CommentMatchesPrefix_BoundaryRules (separator rune table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:15:51 +02:00
hatiyildiz
7a2cad9a47 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.177 -> 1.4.178 (auto, Refs TBD-A6) 2026-05-18 19:46:12 +00:00
e3mrah
31b7dc5859
fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847)
PR #1785 (chart 1.4.171) shipped a baseline default-deny
CiliumNetworkPolicy in catalyst-system whose ingress allowlist was
limited to:

  - reserved.ingress: "" (cilium-gateway endpoint)
  - same-namespace catalyst-system Pods
  - host / remote-node / kube-apiserver entities

The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst`
namespace, including the 10-auto-trigger Job whose Pod curls
catalyst-api.catalyst-system.svc.cluster.local:8080 to fire
/api/v1/internal/cutover/trigger.

With #1785 in effect on a FRESH prov, every auto-trigger Pod times
out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and
the D0 auto-redirect to the Sovereign Console never happens — the
operator is stuck on mothership /jobs forever.

Caught by t24 zero-touch verification (2026-05-18):

  handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst'
  ns cannot reach catalyst-api in 'catalyst-system' ns because
  baseline-default-deny CNP allows ingress only from {reserved.ingress,
  catalyst-system ns, host entities}"

The companion symptom on t22 was masked because t22's cutover Job
had already completed before the CNP rolled out — the CNP did not
gate ingress there.

Fix
─────────────────────────────────────────────────────────────────
Add a fourth ingress rule to baseline-default-deny allowing
fromEndpoints in the operator-tunable list
.Values.security.baselineCnp.allowedIngressNamespaces. Defaults:

  - catalyst       — cutover Pods (the load-bearing fix)
  - flux-system    — Helm/Kustomize/Source controllers probing
                     Service readiness for HelmRelease health
                     rollups (worked pre-#1785 via no-CNP default)
  - kube-system    — Cilium operator + hcloud-ccm + CoreDNS that
                     do cluster introspection calls (the
                     reserved.ingress gateway endpoint here is
                     still matched by rule 1's reserved.ingress: ""
                     selector — this rule covers non-gateway Pods)

The list mirrors the existing allowedPlatformNamespaces pattern on
the egress side. No other rule semantics change.

Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177
(PR #1803, SMTP egress) — both are sub-regressions from the same
#1785 baseline-CNP ship.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:45:28 +04:00