Closes#1885 (TBD-A31).
Problem (t28 evidence — A98 + A107 reports, 2026-05-19 00:30Z):
`console.t28.omani.works:443` accepts TCP but TLS resets. Inspection:
`kubectl get svc -n kube-system cilium-gateway-cilium-gateway` shows
type=ClusterIP with no Hetzner LB. Even with the tofu-provisioned
`hcloud_load_balancer.main` (infra/hetzner/main.tf:955) carrying
443→30443 service-port at the infra layer, the cluster-side hcloud-CCM
has no signal to materialise a parallel Service-level LB for the
auto-generated gateway Service — so operators inspecting kubectl see
a non-LoadBalancer Service and conclude the LB chain is broken.
Fix:
Add `spec.infrastructure.annotations` to the Gateway resource. The
Gateway-API spec mandates that controllers propagate these annotations
to any infrastructure resources they create — in Cilium 1.16+ this means
the auto-generated `cilium-gateway-cilium-gateway` Service in kube-system.
hcloud-cloud-controller-manager (bp-hcloud-ccm slot 55) then picks the
annotations up at Service reconcile time and provisions a Hetzner LB.
Annotations (mirrors clustermesh-apiserver block in 01-cilium.yaml):
- load-balancer.hetzner.cloud/name = <slug>-<region>-gateway
- load-balancer.hetzner.cloud/location = <Hetzner DC>
- load-balancer.hetzner.cloud/type = lb11
- load-balancer.hetzner.cloud/use-private-ip = "false" (DoD A2 — public IPs always)
- load-balancer.hetzner.cloud/disable-private-ingress = "true"
- load-balancer.hetzner.cloud/health-check-protocol = tcp
- load-balancer.hetzner.cloud/health-check-port = "30443"
- load-balancer.hetzner.cloud/health-check-interval = 15s
- load-balancer.hetzner.cloud/health-check-timeout = 10s
- load-balancer.hetzner.cloud/health-check-retries = "3"
Per-region segmentation: SOVEREIGN_FQDN_SLUG + SOVEREIGN_REGION_KEY in
the LB name so each multi-region peer's cilium-gateway gets its own
public LB (Hetzner LBs are unique-by-name; duplicate-name allocations
collapse to the first-created instance, hiding the LB for every
subsequent region).
Wiring: 3 substitute vars (SOVEREIGN_FQDN_SLUG, SOVEREIGN_REGION_KEY,
HCLOUD_LB_LOCATION) threaded into the sovereign-tls Kustomization's
postBuild.substitute block. These mirror the same vars already passed
to bootstrap-kit's Kustomization for the clustermesh-apiserver LB block
in 01-cilium.yaml apiserver.service.annotations, so the configuration
boundary is symmetric across the gateway LB and the clustermesh LB.
Memory rules respected:
- A2 (PUBLIC IPs for inter-region) — use-private-ip=false
- feedback_overlap_provs_dont_serialize_wait (no provisioning gate)
- feedback_subagents_inherit_design_system (no new architectural seam,
reuses existing Gateway-API + hcloud-CCM contracts)
Validation:
$ kubectl kustomize clusters/_template/sovereign-tls/ | grep -A 30 'kind: Gateway'
→ renders all 10 Hetzner LB annotations under spec.infrastructure
→ ${SOVEREIGN_FQDN_SLUG}/${SOVEREIGN_REGION_KEY}/${HCLOUD_LB_LOCATION}
substituted at Flux apply time
Acceptance criteria (per issue):
- kubectl get svc -n kube-system cilium-gateway-cilium-gateway shows
type=LoadBalancer with external IP (after fresh prov + handover)
- curl -skI https://console.<fqdn>/ returns HTTP 200
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-1.4.183 the chart pinned every catalyst-system HTTPRoute to
`sectionName: https` (via values.yaml default), but the Cilium Gateway
template (clusters/_template/sovereign-tls/cilium-gateway.yaml +
infra/hetzner/main.tf locals.parent_domains_listeners_yaml) names HTTPS
listeners:
- SINGLE parent zone → bare `https` / `http`
- MULTIPLE parent zones → unique `https-<sanitised-zone>` /
`http-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`)
On t28 (omani.works primary + omani.homes SME pool, A107 D29 walk
2026-05-19) every public HTTPRoute reported `Accepted=False
NoMatchingListener` and console.<sov> / api.<sov> / marketplace.<sov> /
*.<sov> returned 404 / connection-refused. Single-zone Sovereigns were
unaffected because Gateway used bare `https`.
Fix (Option C - omit sectionName): default `ingress.gateway.parentRef.
sectionName=""` in values.yaml. The existing `{{- with .Values.ingress.
gateway.parentRef.sectionName }}` guards in templates/httproute.yaml,
templates/services/catalog/httproute.yaml, and templates/sme-services/
marketplace-routes.yaml skip the field entirely when empty. Cilium
Gateway then matches each route to listeners by hostname filter - every
listener has `hostname: *.<zone>`, so `console.<sov-fqdn>` auto-attaches
to the listener whose hostname matches (which is precisely the listener
whose certificateRef terminates the right wildcard cert).
This is the canonical pattern already in use elsewhere in the codebase:
- core/controllers/sandbox/internal/gitops/manifests.go (sandbox)
- core/controllers/organization/internal/controller/tenant_route.go
(per-Org tenant routes)
- products/catalyst/chart/templates/sme-services/tenant-public-routes.yaml
Preflight CI (.github/workflows/preflight-cilium-httproute.yaml) explicitly
overrides `--set ingress.gateway.parentRef.sectionName=http` because it
ships a Gateway with an HTTP-only listener named `http`; that override
path is preserved unchanged.
helm template render verifies all 5 affected HTTPRoutes
(catalyst-ui, catalyst-api, catalyst-catalog, marketplace,
tenant-wildcard) now emit a `parentRefs` block with name+namespace only,
no `sectionName`. helm lint clean.
Chart bumped 1.4.182 -> 1.4.183.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1875 added `sovereign-tls` to the bp-self-sovereign-cutover dependsOn
in both the chart AND scripts/expected-bootstrap-deps.yaml. PR #1879
reverted the chart half (because HelmRelease.dependsOn cannot reference a
Flux Kustomization — helm-controller logs "not found", chart parks
Stalled, handover never fires).
The scripts/expected-bootstrap-deps.yaml half was left behind, so the
dep-graph-audit job now fails on origin/main with drift between the
declared expectation (`bp-gitea bp-harbor sovereign-tls`) and the chart
on disk (`bp-gitea bp-harbor`).
Scrub:
- Remove `sovereign-tls` from the cutover's depends_on list.
- Remove the stale `sovereign-tls` placeholder slot 0t entry (no HR
file exists for it — it is a Flux Kustomization).
- Replace the obsolete comment block with a short note explaining the
PR #1875 / #1879 history so the next reader doesn't re-add it.
Verified: `bash scripts/check-bootstrap-deps.sh` -> "OK: bootstrap-kit
dependency graph audit PASSED" with Drift: 0, Cycles: 0.
Verified: `helm template platform/self-sovereign-cutover/chart` -> exit 0.
Refs #1871
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1875 added `- name: sovereign-tls` to bp-self-sovereign-cutover.dependsOn
to gate the URL rewrite behind Gateway TLS readiness. That fix was
unresolvable: Flux HelmRelease.dependsOn can ONLY reference other
HelmReleases, but sovereign-tls is a Flux Kustomization. helm-controller
verbatim on t27 fresh-prov (A84 empirical test, 2026-05-18):
helmreleases.helm.toolkit.fluxcd.io "sovereign-tls" not found
bp-self-sovereign-cutover sat forever in dependency-wait, cutover never
fired, handover never fired.
This commit moves the readiness check INTO the chart: chart 0.1.32 adds
a Phase -1 (gateway-wait) at the top of the Step-06 helmrepository-
patches Job. The Job polls `gateway.networking.k8s.io/v1.Gateway
cilium-gateway` in `kube-system` until status.conditions[Programmed]=
True, with a 30 min default deadline. If the Gateway never programs,
the Job exits 1 (surfacing the block to the operator) rather than
rewriting URLs into a Gateway that won't answer TLS.
RBAC: ClusterRole gains gateway.networking.k8s.io/gateways
{get,list,watch}.
Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml`:
- reverts the bad PR #1875 `- name: sovereign-tls` dependsOn entry
- bumps chart pin 0.1.31 -> 0.1.32
Tests: cutover-contract Case 20 guards the Phase -1 block + RBAC.
helm-template confirms the Phase -1 wait + env (GATEWAY_NAMESPACE=
kube-system, GATEWAY_NAME=cilium-gateway, GATEWAY_WAIT_TIMEOUT_
SECONDS=1800) renders into the cutover-step-06-helmrepository-patches
ConfigMap.podSpec.
Closes#1871
Refs #1875 (supersedes)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A84 empirical finding (t27 / PR #1875): HelmRelease.spec.dependsOn
strictly references OTHER HelmReleases — it cannot reference Flux
Kustomizations or other resource kinds. PR #1875 added the `sovereign-tls`
Kustomization to a HelmRelease's dependsOn; helm-controller logged
`helmreleases "sovereign-tls" not found` and retried every 30s forever.
Adds a critical sub-rule to principle #14 documenting the cross-kind
limitation, the recommended workaround (wait-HelmRelease shim or move the
gated workload into a Kustomization), and the verbatim helm-controller
error message so the next regression is greppable.
Doc-only.
Co-authored-by: hatiyildiz <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Stalwart trips its rate-limit and returns "503 5.5.1", the
notification service previously surfaced the error immediately to the
events consumer, which kept hammering on the next event and prolonged
the rate-limit window.
Now Mailer.Send detects 503 5.5.1 specifically (via *textproto.Error
unwrap + canonical-code substring fallback) and retries up to 3 times
with a 60s backoff between attempts. The backoff is configurable via
SMTP_RETRY_BACKOFF env var (Go duration string OR bare integer seconds;
30s floor to keep the rate-limiter happy). Non-rate-limit errors
(auth failure, transient I/O, etc.) bubble up unchanged so the
consumer can NACK / dead-letter as appropriate.
Adds smtp_test.go covering:
- single rate-limit -> retry -> success
- exhausted retries -> wrapped error preserving *textproto.Error
- non-rate-limit error -> immediate pass-through, no backoff
- isRateLimit detection (textproto, multiline 503-5.5.1, negative cases)
- parseRetryBackoff env-var forms + 30s floor + zero/garbage fallbacks
No credential touches: this is a retry-hardening fix only; the
chart-side SMTP creds path is already GREEN (see #1793 A80 diagnosis).
Refs #1793
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-A24 cutover↔gateway circular deadlock — discovered on t26 zero-touch
prov 2026-05-18 (99bb823cb0513f4b):
1. bp-catalyst-platform HR installs at v1.4.179 (Ready=True)
2. bp-self-sovereign-cutover HR Ready=True (deps gitea+harbor only)
3. Step-06 rewrites all 50 HelmRepository URLs ghcr.io → registry.<fqdn>
4. bp-catalyst-platform flips Ready=False (TLS handshake EOF — no Gateway)
5. sovereign-tls Kustomization blocked on bootstrap-kit Ready=True
6. bootstrap-kit blocked on bp-catalyst-platform Ready=True
7. Full deadlock — no Gateway, no handover, every UI route 404
Fix: add `sovereign-tls` as a third dependsOn entry on the cutover HR so
Flux waits for the Cilium Gateway to be serving TLS before the URL
rewrite fires. Same architectural shape as Wave 7 bp-hcloud-csi removal
(#1610) — chicken-and-egg between bootstrap-kit and sovereign-tls broken
by ordering the dangerous-side-effect chart AFTER the Gateway is ready.
Also updates scripts/expected-bootstrap-deps.yaml so the dep-graph audit
(check-bootstrap-deps.sh) recognises the new edge: slot 6a gets the
extra `sovereign-tls` entry, plus a new "slot 0t" entry declaring
sovereign-tls as a known node (no HR file on disk → audit reports it as
`deferred`, info not error; Phase 4 cycle detection accepts it as a
zero-in-degree root).
Verified locally:
- yq parses spec.dependsOn → 3 entries (bp-gitea, bp-harbor, sovereign-tls)
- scripts/check-bootstrap-deps.sh: 50 present, 65 declared, 0 drift, 0 cycles
- helm template platform/self-sovereign-cutover/chart: exit 0 (smoke OK)
Refs: t26 ID 99bb823cb0513f4b, A55 diagnostic, A67 diagnosis, slot 17a
comment in clusters/_template/bootstrap-kit/kustomization.yaml documenting
the same chicken-and-egg shape.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml,
bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish
commits land on main, but it cannot detect the "chart bumped but never
published" failure mode: the bootstrap-kit pin points at a chart
version that GHCR never received because blueprint-release.yaml
failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep,
runner cancellation, transient GHCR push 5xx).
Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180
and 1.4.181 were "lost" during the TBD-A20 scanner break window
(21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS
while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist
until A58 manually re-fired the workflow via dispatch. Fresh
Sovereigns silently fell back to the last working tag.
What this adds
- scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and
optional `--ghcr-org <org>`). For every chart pinned in the kit, it
lists ghcr.io/<org>/<chart> tags via `gh api
/orgs/<org>/packages/container/<chart>/versions --paginate`, then
asserts the pinned version appears. Exits 1 on any missing tag.
- A per-chart tag cache avoids redundant paginations.
- .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now
passes `--check-ghcr` on `push` to main + `workflow_dispatch`
(PR mode stays `--changed-only` and skips GHCR — PRs cannot publish
to GHCR anyway). The job stays `continue-on-error: true` under the
same observational umbrella as the existing post-merge full sweep
so a transient API blip cannot red-flag every chart bump; the
missing-tag list still surfaces on the run summary for operator
attention.
- Job grants `packages: read` so the workflow GITHUB_TOKEN can list
private package versions.
Verification (origin/main snapshot, 2026-05-19)
- Full sweep default: 50/50 chart→pin pairs OK, no GHCR check.
- Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags
present — PASS exit 0.
- Negative test: with products/catalyst/chart/Chart.yaml + slot 13
both set to a non-existent 99.99.99, the script exits 1 with
`GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the
remediation hint pointing at `gh workflow run
blueprint-release.yaml`.
- `--changed-only --base origin/main` against a no-change tree: clean
exit 0 with the existing "nothing to check" message.
Refs #1872, #1864, #1856.
Closes#1872
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three new inviolable principles surfaced by 2026-05-18 incidents:
- #12 Never validate against the local working tree — A19 false-positive
(verifier grepped a feature-branch working copy with unstaged edits,
reported "already on main" when it was not).
- #13 Chart-pin bumps must match a GHCR tag that exists — TBD-A48 / PR #1869
drift: pin to bp-self-sovereign-cutover:0.1.4 landed on main while the
chart artifact had not been published, causing hours of ImagePullBackOff.
- #14 Cutover-style HRs that rewrite HelmRepository URLs must dependsOn
Gateway readiness — TBD-A24 / PR #1871: bp-self-sovereign-cutover flipped
URLs to local registry before Cilium Gateway was serving TLS, deadlocking
the cluster.
Doc-only change; bumps the front-matter Updated date to 2026-05-18.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the auto-bump-pin miss flagged in #1864.
The Blueprint Release workflow has been in `startup_failure` since
PR #1858 (commit cf35b4a) merged at 21:04:22Z. The lockstep step's
multi-line shell heredoc inside a `run: |` block-scalar:
if [ ... ]; then
msg="deploy(...) (auto, Refs TBD-A6)
<-- literal blank line
Also locksteps platform blueprint.yaml ..." <-- column 1, no indent
is interpreted by the YAML scanner as the END of the block-scalar
at the blank line, and the next column-1 line is then parsed as a
new top-level mapping key — which fails because the previous mapping
isn't terminated. The whole workflow file is rejected at workflow-
startup time. Verified with `python3 -c yaml.safe_load(...)` (raises
`ScannerError: could not find expected ':' line 815`) and by `gh api
.../actions/runs/26060392136` returning `conclusion=failure,
status=completed, jobs: []` for every push since cf35b4a.
Consequence: no chart bump since cf35b4a has triggered the TBD-A6
auto-bump-pin or the TBD-A20 blueprint.yaml lockstep. PR #1865 was
the manual catch-up for bp-newapi (1.4.20 -> 1.4.21); without this
fix every future chart publish will drift the same way.
Fix: build the multi-line commit message with `printf '%s\n\n%s'`
so the string source stays on physically-indented lines that the
YAML block-scalar accepts. Behaviour is identical — same commit
subject, same blank line, same body — only the construction shape
changes. Added a 9-line comment naming the seam so future authors
don't reintroduce the same trap.
Verified locally:
* `python3 -c yaml.safe_load(open(...))` succeeds, parses 24
build-job steps.
* `CHART_NAME=bp-newapi PREV_VERSION=1.4.20 CHART_VERSION=1.4.21
BP_PREV_VERSION=1.4.20 bash -c "$(printf ...)"` emits the
canonical "deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 ->
1.4.21 (auto, Refs TBD-A6)\n\nAlso locksteps platform ..." body.
Refs #1864.
Refs PR #1858 (TBD-A20 lockstep that introduced the YAML defect).
Closes#1864
Manual catch-up. The auto-bump-pin step (TBD-A6) did NOT run for the
1.4.20 -> 1.4.21 chart bump at commit 8b33188 because the Blueprint
Release workflow has been stuck in **startup_failure** since PR #1858
(commit cf35b4a) merged at 21:04:22Z. The workflow YAML at
.github/workflows/blueprint-release.yaml lines 812-814 has a multi-line
heredoc string inside a `run: |` block-scalar whose continuation lines
are unindented:
msg="deploy(${CHART_NAME}): bump bootstrap-kit pin ${PREV_VERSION} -> ...
(auto, Refs TBD-A6)
Also locksteps platform blueprint.yaml spec.version ${BP_PREV_VERSION} ..."
YAML treats the unindented line as the end of the block-scalar and the
next line as a new mapping key (which it isn't), so the entire workflow
file fails the GitHub Actions YAML validator at workflow-start time.
Every push since cf35b4a has produced a run with `conclusion=failure,
status=completed, jobs=[]` (zero jobs spun up).
Evidence:
* gh api repos/openova-io/openova/actions/runs/26060392136 ->
'This run likely failed because of a workflow file issue.'
* Same for every subsequent run including the chart 1.4.21 publish
(no run was even created for 8b33188 because the workflow file
couldn't parse).
* `python3 -c 'yaml.safe_load(open(...))'` raises
`ScannerError ... could not find expected ':' line 815`.
This PR is the ONE-LINE catch-up so the pin drift is closed. A
companion PR fixes the workflow YAML so future chart bumps auto-bump
the pin again.
Verifies the publisher-side wrapper struct in CreateOrg
(handlers.go:248-252) marshals to bytes the provisioning consumer
in organization_create.go can decode flat with owner_email as a
sibling field. Pairs with TestHandleTenantCreated_FullTenantStructDecode
on the consumer side — together they pin BOTH ends of the contract
so a refactor that nests under "tenant" or renames the tag fails
in CI rather than at staging.
Refs #1829 (D29).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
PR #1626 wired the publish-leg (tenant + billing → NATS JetStream
catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster
controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow
even though the publish leg shipped.
This PR adds:
- core/controllers/pkg/natsbus: minimal JetStream subscriber shared by
Group-C controllers. Self-contained (no dep on core/services/shared
which pulls in franz-go/Kafka the controllers never touch).
- core/controllers/organization/internal/controller/nats_bridge.go:
subscribes to catalyst.tenant.created + catalyst.billing.order.placed,
patches openova.io/last-event-observed-at + ...-subject annotations on
the matching Organization CR. The annotation patch triggers an
informer event → controller-runtime enqueues Reconcile within ~50ms
instead of waiting for the 30s requeue fallback.
- core/controllers/sandbox/internal/controller/nats_bridge.go: same
pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR
using the same `sandbox-<sanitised-email>` naming convention
tenant-service's SandboxOrchestrator (PR #1633) writes under.
- main.go wiring in both controllers reads NATS_URL from env. Unset =
log "consume-leg disabled" + continue (informer requeue fallback
intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS
is an accelerator, not the only path.
Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate
JetStream delivery produces a byte-stable annotation patch and
controller-runtime does NOT enqueue a redundant Reconcile.
Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the
happy path, the no-matching-CR soft miss, duplicate-envelope no-churn,
malformed JSON poison-pill, and the publish-side ↔ consume-side name
derivation lockstep for Sandbox CRs.
HARD CONSTRAINT respected: no credential mutations — bridges read only
the envelope + the target CR, never Secrets or Keycloak SA creds.
Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LoadSMETenantParentDomainsFromEnv's hardcoded-fallback only seeded
2 entries (omani.works + omani.trade), but the marketplace UI
(core/marketplace/src/components/AddonsStep.svelte) lists 4
(omani.homes + omani.rest + omani.trade + omani.works) and
core/services/domain/store.AllowedTLDs has the same canonical 4.
Result: a customer picking .omani.homes or .omani.rest in /addons
sailed through the picker but got 422 invalid-parent-domain at
catalyst-api signup because FindParentDomain didn't recognise the
TLD.
This widens the seed to all 4 canonical .omani.X entries so the
backend pool, the marketplace picker, and AllowedTLDs all agree.
NSFlipReady=true on every entry (the zones are already delegated
to the Sovereign's PowerDNS at gTLD level — pdmFlipNS
short-circuits via nsAlreadyMatches for Day-2 re-adds).
Updated TestLoadSMETenantParentDomainsFromEnv_StubFallback
(`pool != 4`) and added 3 fresh tests in
sovereign_parent_domains_test.go covering: canonical 4-entry seed,
OTECH primary + 4 sme-pool composition, env-override path without
fallback leakage.
Closes#1830 (Part 1 — Day-1 pool seed).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Closes the voucher-checkout → Organization-CR loop that was missing
from the convergence chain. Before this PR the flow stalled at:
voucher accept → tenant-service CreateOrg
→ writes Tenant row, publishes tenant.created
↓ (DROP — no consumer)
provisioning consumer switch
(case "tenant.created" missing — A26 verifier pinpointed this)
↓
organization-controller has nothing to reconcile
↓
no vCluster / Keycloak group / Gitea org / per-tenant HTTPRoute
A26 verifier on t22: zero Organization CRs after 168min despite the
tenant row existing. Closes#1722. Unblocks D29 zero-touch tenant
provisioning (Refs #1829).
Changes:
- core/services/tenant/handlers/handlers.go
Enrich tenant.created payload with owner_email from JWT claims so
the provisioning consumer can mint the Organization owner roster
without a second store round-trip. Wrapper struct embeds *Tenant
so existing decoders are wire-compatible.
- core/services/provisioning/handlers/consumer.go
Add case "tenant.created" to the dispatch switch.
- core/services/provisioning/handlers/organization_create.go
New handler. Validates slug + owner_email, builds cluster-scoped
Organization CR (apiVersion orgs.openova.io/v1), POSTs via
k8sRequest. Idempotent on 409 AlreadyExists (NATS redelivery
safe). 404 → operator-misconfiguration error event. 5xx → return
err so broker redelivers. Inviolable Principle #4: parent domain
flows env → Handler.TenantParentDomain → CR (with per-tenant
parent_domain payload override for multi-pool Sovereigns).
- core/services/provisioning/handlers/organization_create_test.go
Unit tests: malformed payload, invalid slug (incl. path-traversal),
missing owner_email, full Tenant decode, default-fill paths, empty
parent domain mints anyway, payload-shape pinning. All exercised
with KUBERNETES_SERVICE_HOST scrubbed so no real apiserver dial.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-test root cause + fix:
1. TestPinIssue_ConcurrentRapidFireRateLimit (TOCTOU race) — pinStore.canIssue
and put() ran under separate mutex acquisitions; three concurrent
/pin/issue goroutines all observed "no entry", passed canIssue, then
raced EnsureUser against Keycloak. Replaced with atomic tryReserve()
that check-and-stamps under a single lock; HandlePinIssue calls
store.drop(email) on EnsureUser/generatePin/no-KC failure to roll back
the reservation so the 60s cooldown doesn't punish operator retries.
2. TestFinaliseHandover_FullFlow — test fixture drift after PR #1487 keyed
the tofu workdir by DeploymentID (provisioner.workdirKey). Test still
wrote workdir at filepath.Join(tmp, "tenant-y-omani-works") (the legacy
sovereign-name slug); FinaliseHandover handler uses `id`. Updated test
to write workdir at filepath.Join(tmp, "dep-full") so it matches the
actual prod lookup path. Same fix for the receiver-failure sibling test.
3. TestEnsureOwnerUserAccess_CreatesCanonicalCR — drifted twice: (a) test
queried Namespace("") but the t134 D21 fix moved the CR to
userAccessOwnerNamespace ("catalyst-system") because useraccesses is
namespaced per the XRD claimNames block; (b) test asserted
spec.applications = [{app:"*", role:"admin"}] but the t135 D21 fix
switched to spec.tierRoleRef = "openova:tier-owner" (XRD pattern
rejects `app: "*"`). Updated test to query catalyst-system namespace
and assert tierRoleRef + applications-must-be-absent.
4. TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty — production
unstructuredToUserAccess left Spec.Applications=nil when the CR has no
spec.applications, which json-marshals to `null` and crashes the React
UI's items.map() (qa-loop iter-4 users-page-null-map regression).
Initialize Spec.Applications = []userAccessAppGrantBody{} in the
struct literal so the empty-slice contract is preserved.
5. TestHandleWhoami_PinSessionRBACClaims — whoamiInjectTierRoles
unconditionally appended every inherited tier role even when the
upstream JWT already shaped the role list authoritatively. A
PIN-minted session carrying tier=owner + realm_access=[catalyst-owner]
was getting fanned out to all 5 inheritance entries, which the
route-guard couldn't reconcile. Now: if the operator's own
catalyst-<tier> role is already present, the projection returns early
and preserves the upstream list. TestHandleWhoami_ProjectsTierToRealmRoles
still passes (empty input → still injects inheritance) and
TestWhoamiInjectTierRoles_PreservesExistingRoles still passes
(idempotent — same input out).
6. TestHandleWhoami_NoRBACOmitsFields — whoamiResponse.RealmAccess was a
struct value with `omitempty`, which encoding/json does NOT honour for
structs (only pointers/slices/maps until Go 1.24's `omitzero`). A
pre-RBAC session always serialized realm_access:{} on the wire,
breaking the legacy {email,sub,verified} contract. Changed to
*whoamiRealmAccess so omitempty actually drops the field; HandleWhoami
only allocates the pointer when claims carry roles, and drops it back
to nil if the projection ended up empty.
Test status after fix (worktree off origin/main):
- All 6 target tests PASS
- Full TestPin*, TestHandleWhoami*, TestWhoamiInjectTierRoles*,
TestEnsureOwnerUserAccess*, TestOwnerUserAccessName*, TestListUserAccess*,
TestFinaliseHandover*, TestUnstructuredToUserAccess* PASS (57 tests)
- go test ./... -p 1 across the entire catalyst-api module PASS
Pre-existing parallelism flakes (TestGetKubeconfig_ReadsFromPathPointer /
TestPhase1Started_GuardPreventsDoubleWatch / TestPodRestart_*) exist on
baseline too — write to /var/lib/catalyst/ from a goroutine that outlives
test scope. Out of scope for this PR; tracked separately.
Closes#1853
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.
This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.
Workflow changes (.github/workflows/blueprint-release.yaml):
* New step `bump_blueprint` after `bump_pin` — locates
${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
(handles both platform-leaf and products-umbrella conventions),
filters to kind:Blueprint (defensive against CRD yaml at the
products/catalyst/chart/crds path), reads current spec.version at
2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
* Commit step renamed to "Commit + push bootstrap-kit pin bump +
blueprint.yaml lockstep"; stages both files, single commit, with
convergent retry on conflict.
* Summary block surfaces both bumps separately.
Regression test (tests/e2e/bootstrap-kit/main_test.go):
* New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
platform/* and products/*, discovers every Blueprint manifest with
a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
* Failure messages name the file, drift direction, and the exact sed
command to fix — drift remediation is mechanical.
Drift cleanup (mandatory companion, same shape as A17/#1855):
26 Application-Blueprint blueprints whose spec.version had been left
at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
Chart.yaml as authoritative. All currently surface in the new sweep
test; without the cleanup the test would block this PR (and every
subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
trivy, valkey, velero, vpa, products/dmz-vcluster.
After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.
Closes#1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because
the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the
initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image
to cadc7b5") which never triggered the blueprint-release workflow. As a
result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck
InProgress and the bootstrap-kit kustomization fails its health check.
GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via
`/orgs/openova-io/packages/container/bp-velero/versions`.
Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate
stays GREEN) so blueprint-release.yaml fires on this push, publishes
`ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a
no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1
subchart, same templates, same values.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an
empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay
30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the
binary mid-migration on every restart. The 28-CREATE-TABLE +
2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with
sslmode=require — well over the kill window. On t22 chart 1.4.18 the
`newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff
restarts because every kill happened before the GORM connection
pool's first wire write completed (pg_stat_activity on the CNPG
primary showed no newapi-user connections).
Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2):
[SYS] ... database migration started ← last log line
exitCode=2 finishedAt-startedAt = 50s exactly
Readiness probe: connect: connection refused 10.42.0.185:3000
DB: psql \\dt → "Did not find any relations"
CNPG: pg_stat_activity → no `newapi` user connections
Fix (canonical k8s pattern, Inviolable Principle #16 — own the
seam): add a startupProbe that gates BOTH liveness and readiness
until the binary opens :3000/api/status. Budget 30 × 10s = 5 min,
comfortably above the observed 60-120s ceiling and below operator-
impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is
unchanged but only activates after startupProbe success per kubelet
semantics. The probe block is operator-tunable via
`.Values.newapi.probes.startup.*`; setting it to `null` skip-renders
the block so overlays against a pre-seeded DB can opt out
(Inviolable Principle #4).
Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so
freshly franchised Sovereigns pull the new chart on next prov.
Render tested (smoke + override): startupProbe present with
failureThreshold=30 in defaults; suppressed when startup: null.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main:
1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields
asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml
version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager
(1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak
(1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml),
gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin
-> Sovereign install); blueprint.yaml gets resynced down/up to match.
2. pin-sync-audit on push — full-sweep audit races the blueprint-release
auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift
until the auto-bump bot commits the pin update ~60s later; the bot
push (GITHUB_TOKEN convention) does not retrigger this workflow, so
the failure remains in run history. Fix: set continue-on-error: true
on push/workflow_dispatch events (PR remains blocking via
--changed-only). The full-sweep output still surfaces drift on the
run summary; it just doesn't fail the overall run while the heal-in-
~60s window is open. Documented inline in the job header.
Net effect: every push to main re-runs cleanly green. The 13 pre-existing
drifts called out in the existing job comment will continue to heal as
each lagging chart gets its next bump (auto-bump hook + this PR's
manifest-validation alignment).
Refs PRs #1666#1687#1695#1698#1706#1707 (the manual collector PRs
TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence
to blueprint.yaml versions which the test asserts but the auto-bump hook
does not yet update).
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
The May 2026 baseline-CNP cascade shipped three production bugs in
two days because nothing in CI rendered the chart and asserted on the
rendered CiliumNetworkPolicy shape:
- #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system
with WORLD egress restricted to TCP/443 only AND no ingress allow
for the `catalyst` namespace.
- #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after
/api/v1/auth/pin-request 502'd on every fresh onboarding.
- #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24
fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s.
This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh —
a pure helm-template + grep/awk contract gate matching the existing
platform/self-sovereign-cutover/chart/tests/cutover-contract.sh
pattern. The Blueprint Release workflow already runs every *.sh under
chart/tests/ as a publish gate (see blueprint-release.yaml line 384),
so the gate is wired automatically and fails publish BEFORE the OCI
artifact reaches a Sovereign.
13 cases asserted:
1. baseline-default-deny CNP renders + is namespaced to catalyst-system
2. egress allows SMTP submission 587/TCP (#1803 regression guard)
3. egress allows SMTPS 465/TCP (#1803 regression guard)
4. egress allows legacy SMTP 25/TCP (#1803 regression guard)
5. egress allows HTTPS 443/TCP to world
6. egress allows kube-dns 53/UDP + 53/TCP
7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847)
8. ingress allows `flux-system` (HelmRelease readiness probes)
9. ingress allows `kube-system` (operator + ccm + CoreDNS)
10. ingress is namespace-scoped — no fromEntities:{cluster|world|all} wildcard
11. catalyst-api Service exposes port 8080 (auto-trigger contract)
12. CNP toggles off cleanly with security.baselineCnp.enabled=false
13. allowedIngressNamespaces propagates via --set (operator-tunable)
Negative-test confirmation (executed locally before commit):
- Remove SMTP 587 from template → Case 2 FAILS, exit 1
- Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1
- Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1
- Restore originals → all 13 cases PASS, exit 0
Refs: TBD-A18, PRs #1785#1803#1847, audit /tmp/audit-recent-prs-quality-report.json
Closes#1850
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179
+ bootstrap-kit pin bump + cloud-init substitute extension, because each
fix is small and they share the same fresh-prov verification cycle.
TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list
networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies
get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the
chroot in-cluster fallback's k8sCache.Factory reflector emitted
continuous `networkpolicies is forbidden` errors at the cluster scope
because only update/patch/delete were granted (existing mutation block)
— the read path was never wired. Mirrors the existing
cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s
NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7).
TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields
configuredRegions / controlPlaneIP / primaryRegion / replicaRegion /
selfDeploymentId / enableHotStandby / qaApplications empty on every
fresh prov. Pre-fix the envsubst placeholders resolved to empty because
nothing wrote them into the bootstrap-kit Kustomization postBuild
substitute map → the chart rendered empty strings → Dashboard
SovereignCard configured-regions chips, Settings page operator-identity,
/api/v1/sovereign/self, and the D31 active-hot-standby gating ALL
silently fell through to default behaviour. Wired via three coordinated
changes:
- Chart values.yaml gains global.sovereignSelfDeploymentId default
- bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId,
sovereign.configuredRegions, sovereign.qaApplications mappings
(YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`)
- cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP
(= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION /
SOVEREIGN_REPLICA_REGION (canonical 4-segment labels),
SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty),
SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list),
QA_APPLICATIONS_YAML (reserved, default `[]`)
- main.tf: new template inputs sovereign_configured_regions_yaml +
replica_region_canonical_label (derived from local.secondary_regions),
threaded into both primary CP and per-secondary-region cloud-init
templatefile calls
TBD-A10b (issue #1845) — GET
/api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409
kubeconfig-file-missing on fresh prov for every region. Pre-fix the
handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init
PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region
key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk
filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly
call with the bare `cloudRegion` (`?region=hel1`) because that's the
matrix-doc-friendly form. Fall-back resolution order added to
GetKubeconfig: exact-name first (legacy + manual operator PUT), then
`<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test
covers all three paths: exact match, slot-suffix glob, unknown-region
still 409. Closes the regression introduced when PR #1763
(mothership→chroot kubeconfig handover hook) started using the
cloud-init naming convention for fan-out exports.
Closes#1843, Closes#1844, Closes#1845
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third match pass for SSH keys whose name AND label both drifted from the
Tofu canonical emission. The OpenSSH public_key comment is the one piece
of metadata that survives Console-rename, partial tofu apply, and
out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical
prefix into it at generation.
Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh
t25 provs because previous wipe cycles left it as an orphan. Label-pass
+ name-prefix-pass had no signal once the name/label drifted.
Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by
TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping
t2.omantel.biz cannot delete t20.omantel.biz's SSH key.
Tests:
- PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match)
- PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin)
- PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes)
- PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched)
- PublicKeyComment_ParsesFormats (OpenSSH parser unit pins)
- CommentMatchesPrefix_BoundaryRules (separator rune table)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1785 (chart 1.4.171) shipped a baseline default-deny
CiliumNetworkPolicy in catalyst-system whose ingress allowlist was
limited to:
- reserved.ingress: "" (cilium-gateway endpoint)
- same-namespace catalyst-system Pods
- host / remote-node / kube-apiserver entities
The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst`
namespace, including the 10-auto-trigger Job whose Pod curls
catalyst-api.catalyst-system.svc.cluster.local:8080 to fire
/api/v1/internal/cutover/trigger.
With #1785 in effect on a FRESH prov, every auto-trigger Pod times
out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and
the D0 auto-redirect to the Sovereign Console never happens — the
operator is stuck on mothership /jobs forever.
Caught by t24 zero-touch verification (2026-05-18):
handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst'
ns cannot reach catalyst-api in 'catalyst-system' ns because
baseline-default-deny CNP allows ingress only from {reserved.ingress,
catalyst-system ns, host entities}"
The companion symptom on t22 was masked because t22's cutover Job
had already completed before the CNP rolled out — the CNP did not
gate ingress there.
Fix
─────────────────────────────────────────────────────────────────
Add a fourth ingress rule to baseline-default-deny allowing
fromEndpoints in the operator-tunable list
.Values.security.baselineCnp.allowedIngressNamespaces. Defaults:
- catalyst — cutover Pods (the load-bearing fix)
- flux-system — Helm/Kustomize/Source controllers probing
Service readiness for HelmRelease health
rollups (worked pre-#1785 via no-CNP default)
- kube-system — Cilium operator + hcloud-ccm + CoreDNS that
do cluster introspection calls (the
reserved.ingress gateway endpoint here is
still matched by rule 1's reserved.ingress: ""
selector — this rule covers non-gateway Pods)
The list mirrors the existing allowedPlatformNamespaces pattern on
the egress side. No other rule semantics change.
Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177
(PR #1803, SMTP egress) — both are sub-regressions from the same
#1785 baseline-CNP ship.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>