Commit Graph

2490 Commits

Author SHA1 Message Date
hatiyildiz
2e1826abb4 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.181 -> 1.4.182 (auto, Refs TBD-A6) 2026-05-18 23:49:51 +00:00
github-actions[bot]
5a25c254a1 deploy: update sme service images to 5d5c557 + bump chart to 1.4.182 2026-05-18 23:49:14 +00:00
e3mrah
5d5c55739e
fix(notification): retry-backoff on Stalwart 503 5.5.1 rate-limit (#1876)
When Stalwart trips its rate-limit and returns "503 5.5.1", the
notification service previously surfaced the error immediately to the
events consumer, which kept hammering on the next event and prolonged
the rate-limit window.

Now Mailer.Send detects 503 5.5.1 specifically (via *textproto.Error
unwrap + canonical-code substring fallback) and retries up to 3 times
with a 60s backoff between attempts. The backoff is configurable via
SMTP_RETRY_BACKOFF env var (Go duration string OR bare integer seconds;
30s floor to keep the rate-limiter happy). Non-rate-limit errors
(auth failure, transient I/O, etc.) bubble up unchanged so the
consumer can NACK / dead-letter as appropriate.

Adds smtp_test.go covering:
- single rate-limit -> retry -> success
- exhausted retries -> wrapped error preserving *textproto.Error
- non-rate-limit error -> immediate pass-through, no backoff
- isRateLimit detection (textproto, multiline 503-5.5.1, negative cases)
- parseRetryBackoff env-var forms + 30s floor + zero/garbage fallbacks

No credential touches: this is a retry-hardening fix only; the
chart-side SMTP creds path is already GREEN (see #1793 A80 diagnosis).

Refs #1793

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:47:58 +04:00
e3mrah
3b4c130129
fix(bootstrap-kit): cutover dependsOn sovereign-tls — wait for Gateway TLS before HelmRepository URL rewrite (Closes #1871) (#1875)
TBD-A24 cutover↔gateway circular deadlock — discovered on t26 zero-touch
prov 2026-05-18 (99bb823cb0513f4b):

  1. bp-catalyst-platform HR installs at v1.4.179 (Ready=True)
  2. bp-self-sovereign-cutover HR Ready=True (deps gitea+harbor only)
  3. Step-06 rewrites all 50 HelmRepository URLs ghcr.io → registry.<fqdn>
  4. bp-catalyst-platform flips Ready=False (TLS handshake EOF — no Gateway)
  5. sovereign-tls Kustomization blocked on bootstrap-kit Ready=True
  6. bootstrap-kit blocked on bp-catalyst-platform Ready=True
  7. Full deadlock — no Gateway, no handover, every UI route 404

Fix: add `sovereign-tls` as a third dependsOn entry on the cutover HR so
Flux waits for the Cilium Gateway to be serving TLS before the URL
rewrite fires. Same architectural shape as Wave 7 bp-hcloud-csi removal
(#1610) — chicken-and-egg between bootstrap-kit and sovereign-tls broken
by ordering the dangerous-side-effect chart AFTER the Gateway is ready.

Also updates scripts/expected-bootstrap-deps.yaml so the dep-graph audit
(check-bootstrap-deps.sh) recognises the new edge: slot 6a gets the
extra `sovereign-tls` entry, plus a new "slot 0t" entry declaring
sovereign-tls as a known node (no HR file on disk → audit reports it as
`deferred`, info not error; Phase 4 cycle detection accepts it as a
zero-in-degree root).

Verified locally:
  - yq parses spec.dependsOn → 3 entries (bp-gitea, bp-harbor, sovereign-tls)
  - scripts/check-bootstrap-deps.sh: 50 present, 65 declared, 0 drift, 0 cycles
  - helm template platform/self-sovereign-cutover/chart: exit 0 (smoke OK)

Refs: t26 ID 99bb823cb0513f4b, A55 diagnostic, A67 diagnosis, slot 17a
comment in clusters/_template/bootstrap-kit/kustomization.yaml documenting
the same chicken-and-egg shape.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:19:55 +04:00
e3mrah
06bea550ff
feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin (#1874)
The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml,
bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish
commits land on main, but it cannot detect the "chart bumped but never
published" failure mode: the bootstrap-kit pin points at a chart
version that GHCR never received because blueprint-release.yaml
failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep,
runner cancellation, transient GHCR push 5xx).

Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180
and 1.4.181 were "lost" during the TBD-A20 scanner break window
(21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS
while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist
until A58 manually re-fired the workflow via dispatch. Fresh
Sovereigns silently fell back to the last working tag.

What this adds
- scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and
  optional `--ghcr-org <org>`). For every chart pinned in the kit, it
  lists ghcr.io/<org>/<chart> tags via `gh api
  /orgs/<org>/packages/container/<chart>/versions --paginate`, then
  asserts the pinned version appears. Exits 1 on any missing tag.
- A per-chart tag cache avoids redundant paginations.
- .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now
  passes `--check-ghcr` on `push` to main + `workflow_dispatch`
  (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish
  to GHCR anyway). The job stays `continue-on-error: true` under the
  same observational umbrella as the existing post-merge full sweep
  so a transient API blip cannot red-flag every chart bump; the
  missing-tag list still surfaces on the run summary for operator
  attention.
- Job grants `packages: read` so the workflow GITHUB_TOKEN can list
  private package versions.

Verification (origin/main snapshot, 2026-05-19)
- Full sweep default: 50/50 chart→pin pairs OK, no GHCR check.
- Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags
  present — PASS exit 0.
- Negative test: with products/catalyst/chart/Chart.yaml + slot 13
  both set to a non-existent 99.99.99, the script exits 1 with
  `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the
  remediation hint pointing at `gh workflow run
  blueprint-release.yaml`.
- `--changed-only --base origin/main` against a no-change tree: clean
  exit 0 with the existing "nothing to check" message.

Refs #1872, #1864, #1856.

Closes #1872

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:12:13 +04:00
e3mrah
a7cd2fc21f
docs(principles): add 3 session-2026-05-18 principles (validate-vs-origin / GHCR-tag-check / cutover-dependsOn-Gateway) (#1873)
Adds three new inviolable principles surfaced by 2026-05-18 incidents:

- #12 Never validate against the local working tree — A19 false-positive
  (verifier grepped a feature-branch working copy with unstaged edits,
  reported "already on main" when it was not).
- #13 Chart-pin bumps must match a GHCR tag that exists — TBD-A48 / PR #1869
  drift: pin to bp-self-sovereign-cutover:0.1.4 landed on main while the
  chart artifact had not been published, causing hours of ImagePullBackOff.
- #14 Cutover-style HRs that rewrite HelmRepository URLs must dependsOn
  Gateway readiness — TBD-A24 / PR #1871: bp-self-sovereign-cutover flipped
  URLs to local registry before Cilium Gateway was serving TLS, deadlocking
  the cluster.

Doc-only change; bumps the front-matter Updated date to 2026-05-18.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 03:09:26 +04:00
hatiyildiz
26e4c8e30e deploy(bp-guacamole): bump bootstrap-kit pin 0.1.25 -> 0.1.26 (auto, Refs TBD-A6)
Also locksteps platform blueprint.yaml spec.version 0.1.25 -> 0.1.26 (Refs TBD-A20, #1856).
2026-05-18 22:20:35 +00:00
github-actions[bot]
8ce7c02aa9 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.26 2026-05-18 22:19:59 +00:00
e3mrah
1b87d38e94
deploy: catch-up pins for bp-catalyst-platform 1.4.181 + bp-guacamole 0.1.25 (post #1866 fix) (#1869)
Catch-up for drift introduced during the Blueprint Release workflow outage
21:04:22Z (PR #1858 merge with YAML scanner break) → 22:07:49Z (PR #1866 fix).

Charts published in that window:
- bp-catalyst-platform 1.4.180 → 1.4.181 (umbrella)
- bp-guacamole 0.1.24 → 0.1.25

Auto-bump-pin step didn't fire during the outage. A39 already caught up bp-newapi
(PR #1865). This PR catches up the remaining 2.

Refs #1864, PR #1866 (workflow fix), PR #1858 (root cause).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 02:19:34 +04:00
hatiyildiz
66fa508b74 deploy(bp-newapi): bump bootstrap-kit pin 1.4.21 -> 1.4.22 (auto, Refs TBD-A6)
Also locksteps platform blueprint.yaml spec.version 1.4.21 -> 1.4.22 (Refs TBD-A20, #1856).
2026-05-18 22:11:05 +00:00
e3mrah
22e046b554
Merge pull request #1866 from openova-io/fix/1864-workflow-yaml-startup-failure
fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)
2026-05-19 02:07:48 +04:00
hatiyildiz
69f2d7d91a fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)
Root cause of the auto-bump-pin miss flagged in #1864.

The Blueprint Release workflow has been in `startup_failure` since
PR #1858 (commit cf35b4a) merged at 21:04:22Z. The lockstep step's
multi-line shell heredoc inside a `run: |` block-scalar:

    if [ ... ]; then
      msg="deploy(...) (auto, Refs TBD-A6)
                                                        <-- literal blank line
    Also locksteps platform blueprint.yaml ..."          <-- column 1, no indent

is interpreted by the YAML scanner as the END of the block-scalar
at the blank line, and the next column-1 line is then parsed as a
new top-level mapping key — which fails because the previous mapping
isn't terminated. The whole workflow file is rejected at workflow-
startup time. Verified with `python3 -c yaml.safe_load(...)` (raises
`ScannerError: could not find expected ':' line 815`) and by `gh api
.../actions/runs/26060392136` returning `conclusion=failure,
status=completed, jobs: []` for every push since cf35b4a.

Consequence: no chart bump since cf35b4a has triggered the TBD-A6
auto-bump-pin or the TBD-A20 blueprint.yaml lockstep. PR #1865 was
the manual catch-up for bp-newapi (1.4.20 -> 1.4.21); without this
fix every future chart publish will drift the same way.

Fix: build the multi-line commit message with `printf '%s\n\n%s'`
so the string source stays on physically-indented lines that the
YAML block-scalar accepts. Behaviour is identical — same commit
subject, same blank line, same body — only the construction shape
changes. Added a 9-line comment naming the seam so future authors
don't reintroduce the same trap.

Verified locally:
  * `python3 -c yaml.safe_load(open(...))` succeeds, parses 24
    build-job steps.
  * `CHART_NAME=bp-newapi PREV_VERSION=1.4.20 CHART_VERSION=1.4.21
    BP_PREV_VERSION=1.4.20 bash -c "$(printf ...)"` emits the
    canonical "deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 ->
    1.4.21 (auto, Refs TBD-A6)\n\nAlso locksteps platform ..." body.

Refs #1864.
Refs PR #1858 (TBD-A20 lockstep that introduced the YAML defect).
2026-05-19 00:07:07 +02:00
github-actions[bot]
c64220f8cc deploy: bump bp-newapi upstream v0.13.2 chart 1.4.22 2026-05-18 22:05:58 +00:00
e3mrah
1e1fe26e02
Merge pull request #1865 from openova-io/fix/1864-bp-newapi-pin-catchup
deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race)
2026-05-19 02:05:33 +04:00
hatiyildiz
f57f62764b deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race)
Closes #1864

Manual catch-up. The auto-bump-pin step (TBD-A6) did NOT run for the
1.4.20 -> 1.4.21 chart bump at commit 8b33188 because the Blueprint
Release workflow has been stuck in **startup_failure** since PR #1858
(commit cf35b4a) merged at 21:04:22Z. The workflow YAML at
.github/workflows/blueprint-release.yaml lines 812-814 has a multi-line
heredoc string inside a `run: |` block-scalar whose continuation lines
are unindented:

  msg="deploy(${CHART_NAME}): bump bootstrap-kit pin ${PREV_VERSION} -> ...
                                                              (auto, Refs TBD-A6)

  Also locksteps platform blueprint.yaml spec.version ${BP_PREV_VERSION} ..."

YAML treats the unindented line as the end of the block-scalar and the
next line as a new mapping key (which it isn't), so the entire workflow
file fails the GitHub Actions YAML validator at workflow-start time.
Every push since cf35b4a has produced a run with `conclusion=failure,
status=completed, jobs=[]` (zero jobs spun up).

Evidence:
  * gh api repos/openova-io/openova/actions/runs/26060392136 ->
    'This run likely failed because of a workflow file issue.'
  * Same for every subsequent run including the chart 1.4.21 publish
    (no run was even created for 8b33188 because the workflow file
    couldn't parse).
  * `python3 -c 'yaml.safe_load(open(...))'` raises
    `ScannerError ... could not find expected ':' line 815`.

This PR is the ONE-LINE catch-up so the pin drift is closed. A
companion PR fixes the workflow YAML so future chart bumps auto-bump
the pin again.
2026-05-19 00:04:40 +02:00
github-actions[bot]
6b11734a81 deploy: update sme service images to 4a61543 + bump chart to 1.4.181 2026-05-18 21:48:56 +00:00
e3mrah
4a61543957
test(tenant): wire round-trip for tenant.created owner_email contract (#1863)
Verifies the publisher-side wrapper struct in CreateOrg
(handlers.go:248-252) marshals to bytes the provisioning consumer
in organization_create.go can decode flat with owner_email as a
sibling field. Pairs with TestHandleTenantCreated_FullTenantStructDecode
on the consumer side — together they pin BOTH ends of the contract
so a refactor that nests under "tenant" or renames the tag fails
in CI rather than at staging.

Refs #1829 (D29).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 01:47:38 +04:00
github-actions[bot]
de86df1126 deploy: bump sandbox-controller image to a405572 2026-05-18 21:46:29 +00:00
github-actions[bot]
a09445482b deploy: bump sandbox-mcp-server image to a405572 2026-05-18 21:44:53 +00:00
github-actions[bot]
4fd3aae99b deploy: bump application-controller image to a405572 2026-05-18 21:44:52 +00:00
e3mrah
a40557227e
fix(controllers): NATS consume-leg for D35 (organization + sandbox) (#1862)
PR #1626 wired the publish-leg (tenant + billing → NATS JetStream
catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster
controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow
even though the publish leg shipped.

This PR adds:

- core/controllers/pkg/natsbus: minimal JetStream subscriber shared by
  Group-C controllers. Self-contained (no dep on core/services/shared
  which pulls in franz-go/Kafka the controllers never touch).
- core/controllers/organization/internal/controller/nats_bridge.go:
  subscribes to catalyst.tenant.created + catalyst.billing.order.placed,
  patches openova.io/last-event-observed-at + ...-subject annotations on
  the matching Organization CR. The annotation patch triggers an
  informer event → controller-runtime enqueues Reconcile within ~50ms
  instead of waiting for the 30s requeue fallback.
- core/controllers/sandbox/internal/controller/nats_bridge.go: same
  pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR
  using the same `sandbox-<sanitised-email>` naming convention
  tenant-service's SandboxOrchestrator (PR #1633) writes under.
- main.go wiring in both controllers reads NATS_URL from env. Unset =
  log "consume-leg disabled" + continue (informer requeue fallback
  intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS
  is an accelerator, not the only path.

Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate
JetStream delivery produces a byte-stable annotation patch and
controller-runtime does NOT enqueue a redundant Reconcile.

Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the
happy path, the no-matching-CR soft miss, duplicate-envelope no-churn,
malformed JSON poison-pill, and the publish-side ↔ consume-side name
derivation lockstep for Sandbox CRs.

HARD CONSTRAINT respected: no credential mutations — bridges read only
the envelope + the target CR, never Secrets or Keycloak SA creds.

Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:43:08 +04:00
github-actions[bot]
376cd7d14c deploy: update catalyst images to c86fb3d 2026-05-18 21:36:21 +00:00
e3mrah
c86fb3d1dc
fix(catalyst-api): seed full 4-entry .omani.X sme-pool (D30 / #1830) (#1861)
LoadSMETenantParentDomainsFromEnv's hardcoded-fallback only seeded
2 entries (omani.works + omani.trade), but the marketplace UI
(core/marketplace/src/components/AddonsStep.svelte) lists 4
(omani.homes + omani.rest + omani.trade + omani.works) and
core/services/domain/store.AllowedTLDs has the same canonical 4.
Result: a customer picking .omani.homes or .omani.rest in /addons
sailed through the picker but got 422 invalid-parent-domain at
catalyst-api signup because FindParentDomain didn't recognise the
TLD.

This widens the seed to all 4 canonical .omani.X entries so the
backend pool, the marketplace picker, and AllowedTLDs all agree.
NSFlipReady=true on every entry (the zones are already delegated
to the Sovereign's PowerDNS at gTLD level — pdmFlipNS
short-circuits via nsAlreadyMatches for Day-2 re-adds).

Updated TestLoadSMETenantParentDomainsFromEnv_StubFallback
(`pool != 4`) and added 3 fresh tests in
sovereign_parent_domains_test.go covering: canonical 4-entry seed,
OTECH primary + 4 sme-pool composition, env-override path without
fallback leakage.

Closes #1830 (Part 1 — Day-1 pool seed).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 01:34:24 +04:00
github-actions[bot]
a6b5752391 deploy: update sme service images to b214566 + bump chart to 1.4.180 2026-05-18 21:28:12 +00:00
e3mrah
b214566c1a
fix(provisioning): create Organization CR on tenant.created (C16 root cause) (#1860)
Closes the voucher-checkout → Organization-CR loop that was missing
from the convergence chain. Before this PR the flow stalled at:

  voucher accept → tenant-service CreateOrg
     → writes Tenant row, publishes tenant.created
              ↓                                  (DROP — no consumer)
  provisioning consumer switch
     (case "tenant.created" missing — A26 verifier pinpointed this)
              ↓
  organization-controller has nothing to reconcile
              ↓
  no vCluster / Keycloak group / Gitea org / per-tenant HTTPRoute

A26 verifier on t22: zero Organization CRs after 168min despite the
tenant row existing. Closes #1722. Unblocks D29 zero-touch tenant
provisioning (Refs #1829).

Changes:

- core/services/tenant/handlers/handlers.go
  Enrich tenant.created payload with owner_email from JWT claims so
  the provisioning consumer can mint the Organization owner roster
  without a second store round-trip. Wrapper struct embeds *Tenant
  so existing decoders are wire-compatible.

- core/services/provisioning/handlers/consumer.go
  Add case "tenant.created" to the dispatch switch.

- core/services/provisioning/handlers/organization_create.go
  New handler. Validates slug + owner_email, builds cluster-scoped
  Organization CR (apiVersion orgs.openova.io/v1), POSTs via
  k8sRequest. Idempotent on 409 AlreadyExists (NATS redelivery
  safe). 404 → operator-misconfiguration error event. 5xx → return
  err so broker redelivers. Inviolable Principle #4: parent domain
  flows env → Handler.TenantParentDomain → CR (with per-tenant
  parent_domain payload override for multi-pool Sovereigns).

- core/services/provisioning/handlers/organization_create_test.go
  Unit tests: malformed payload, invalid slug (incl. path-traversal),
  missing owner_email, full Tenant decode, default-fill paths, empty
  parent domain mints anyway, payload-shape pinning. All exercised
  with KUBERNETES_SERVICE_HOST scrubbed so no real apiserver dial.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:26:57 +04:00
github-actions[bot]
2457926e16 deploy: update catalyst images to 7ec73f9 2026-05-18 21:16:51 +00:00
e3mrah
7ec73f9e2b
fix(catalyst-api): handler test baseline GREEN — 6 failing tests fixed (Closes #1853) (#1859)
Per-test root cause + fix:

1. TestPinIssue_ConcurrentRapidFireRateLimit (TOCTOU race) — pinStore.canIssue
   and put() ran under separate mutex acquisitions; three concurrent
   /pin/issue goroutines all observed "no entry", passed canIssue, then
   raced EnsureUser against Keycloak. Replaced with atomic tryReserve()
   that check-and-stamps under a single lock; HandlePinIssue calls
   store.drop(email) on EnsureUser/generatePin/no-KC failure to roll back
   the reservation so the 60s cooldown doesn't punish operator retries.

2. TestFinaliseHandover_FullFlow — test fixture drift after PR #1487 keyed
   the tofu workdir by DeploymentID (provisioner.workdirKey). Test still
   wrote workdir at filepath.Join(tmp, "tenant-y-omani-works") (the legacy
   sovereign-name slug); FinaliseHandover handler uses `id`. Updated test
   to write workdir at filepath.Join(tmp, "dep-full") so it matches the
   actual prod lookup path. Same fix for the receiver-failure sibling test.

3. TestEnsureOwnerUserAccess_CreatesCanonicalCR — drifted twice: (a) test
   queried Namespace("") but the t134 D21 fix moved the CR to
   userAccessOwnerNamespace ("catalyst-system") because useraccesses is
   namespaced per the XRD claimNames block; (b) test asserted
   spec.applications = [{app:"*", role:"admin"}] but the t135 D21 fix
   switched to spec.tierRoleRef = "openova:tier-owner" (XRD pattern
   rejects `app: "*"`). Updated test to query catalyst-system namespace
   and assert tierRoleRef + applications-must-be-absent.

4. TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty — production
   unstructuredToUserAccess left Spec.Applications=nil when the CR has no
   spec.applications, which json-marshals to `null` and crashes the React
   UI's items.map() (qa-loop iter-4 users-page-null-map regression).
   Initialize Spec.Applications = []userAccessAppGrantBody{} in the
   struct literal so the empty-slice contract is preserved.

5. TestHandleWhoami_PinSessionRBACClaims — whoamiInjectTierRoles
   unconditionally appended every inherited tier role even when the
   upstream JWT already shaped the role list authoritatively. A
   PIN-minted session carrying tier=owner + realm_access=[catalyst-owner]
   was getting fanned out to all 5 inheritance entries, which the
   route-guard couldn't reconcile. Now: if the operator's own
   catalyst-<tier> role is already present, the projection returns early
   and preserves the upstream list. TestHandleWhoami_ProjectsTierToRealmRoles
   still passes (empty input → still injects inheritance) and
   TestWhoamiInjectTierRoles_PreservesExistingRoles still passes
   (idempotent — same input out).

6. TestHandleWhoami_NoRBACOmitsFields — whoamiResponse.RealmAccess was a
   struct value with `omitempty`, which encoding/json does NOT honour for
   structs (only pointers/slices/maps until Go 1.24's `omitzero`). A
   pre-RBAC session always serialized realm_access:{} on the wire,
   breaking the legacy {email,sub,verified} contract. Changed to
   *whoamiRealmAccess so omitempty actually drops the field; HandleWhoami
   only allocates the pointer when claims carry roles, and drops it back
   to nil if the projection ended up empty.

Test status after fix (worktree off origin/main):
- All 6 target tests PASS
- Full TestPin*, TestHandleWhoami*, TestWhoamiInjectTierRoles*,
  TestEnsureOwnerUserAccess*, TestOwnerUserAccessName*, TestListUserAccess*,
  TestFinaliseHandover*, TestUnstructuredToUserAccess* PASS (57 tests)
- go test ./... -p 1 across the entire catalyst-api module PASS

Pre-existing parallelism flakes (TestGetKubeconfig_ReadsFromPathPointer /
TestPhase1Started_GuardPreventsDoubleWatch / TestPodRestart_*) exist on
baseline too — write to /var/lib/catalyst/ from a goroutine that outlives
test scope. Out of scope for this PR; tracked separately.

Closes #1853

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:14:38 +04:00
github-actions[bot]
de53b39d13 deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.25 2026-05-18 21:05:55 +00:00
github-actions[bot]
8b33188019 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.21 2026-05-18 21:04:48 +00:00
e3mrah
cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
e3mrah
2484c8a3de
fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799) (#1846)
TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because
the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the
initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image
to cadc7b5") which never triggered the blueprint-release workflow. As a
result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck
InProgress and the bootstrap-kit kustomization fails its health check.

GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via
`/orgs/openova-io/packages/container/bp-velero/versions`.

Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate
stays GREEN) so blueprint-release.yaml fires on this push, publishes
`ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a
no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1
subchart, same templates, same values.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:43:13 +04:00
hatiyildiz
9975e057da deploy(bp-newapi): bump bootstrap-kit pin 1.4.19 -> 1.4.20 (auto, Refs TBD-A6) 2026-05-18 20:38:15 +00:00
github-actions[bot]
9982dcafa8 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.20 2026-05-18 20:37:26 +00:00
e3mrah
3d0c96a237
fix(bp-newapi): single-pod DB migration via startupProbe (Closes #1798) (#1857)
newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an
empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay
30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the
binary mid-migration on every restart. The 28-CREATE-TABLE +
2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with
sslmode=require — well over the kill window. On t22 chart 1.4.18 the
`newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff
restarts because every kill happened before the GORM connection
pool's first wire write completed (pg_stat_activity on the CNPG
primary showed no newapi-user connections).

Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2):
  [SYS] ... database migration started   ← last log line
  exitCode=2 finishedAt-startedAt = 50s exactly
  Readiness probe: connect: connection refused 10.42.0.185:3000
  DB: psql \\dt → "Did not find any relations"
  CNPG: pg_stat_activity → no `newapi` user connections

Fix (canonical k8s pattern, Inviolable Principle #16 — own the
seam): add a startupProbe that gates BOTH liveness and readiness
until the binary opens :3000/api/status. Budget 30 × 10s = 5 min,
comfortably above the observed 60-120s ceiling and below operator-
impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is
unchanged but only activates after startupProbe success per kubelet
semantics. The probe block is operator-tunable via
`.Values.newapi.probes.startup.*`; setting it to `null` skip-renders
the block so overlays against a pre-seeded DB can opt out
(Inviolable Principle #4).

Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so
freshly franchised Sovereigns pull the new chart on next prov.

Render tested (smoke + override): startupProbe present with
failureThreshold=30 in defaults; suppressed when startup: null.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:37:00 +04:00
e3mrah
a8931db541
fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849) (#1855)
Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main:

1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields
   asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml
   version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager
   (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak
   (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml),
   gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin
   -> Sovereign install); blueprint.yaml gets resynced down/up to match.

2. pin-sync-audit on push — full-sweep audit races the blueprint-release
   auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift
   until the auto-bump bot commits the pin update ~60s later; the bot
   push (GITHUB_TOKEN convention) does not retrigger this workflow, so
   the failure remains in run history. Fix: set continue-on-error: true
   on push/workflow_dispatch events (PR remains blocking via
   --changed-only). The full-sweep output still surfaces drift on the
   run summary; it just doesn't fail the overall run while the heal-in-
   ~60s window is open. Documented inline in the job header.

Net effect: every push to main re-runs cleanly green. The 13 pre-existing
drifts called out in the existing job comment will continue to heal as
each lagging chart gets its next bump (auto-bump hook + this PR's
manifest-validation alignment).

Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs
TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence
to blueprint.yaml versions which the test asserts but the auto-bump hook
does not yet update).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
2026-05-19 00:34:48 +04:00
e3mrah
d36e54df74
test(chart): baseline CNP allow-list contract gate — guards #1785→#1803→#1847 cascade (Closes #1850) (#1854)
The May 2026 baseline-CNP cascade shipped three production bugs in
two days because nothing in CI rendered the chart and asserted on the
rendered CiliumNetworkPolicy shape:

  - #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system
    with WORLD egress restricted to TCP/443 only AND no ingress allow
    for the `catalyst` namespace.
  - #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after
    /api/v1/auth/pin-request 502'd on every fresh onboarding.
  - #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24
    fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s.

This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh —
a pure helm-template + grep/awk contract gate matching the existing
platform/self-sovereign-cutover/chart/tests/cutover-contract.sh
pattern. The Blueprint Release workflow already runs every *.sh under
chart/tests/ as a publish gate (see blueprint-release.yaml line 384),
so the gate is wired automatically and fails publish BEFORE the OCI
artifact reaches a Sovereign.

13 cases asserted:
  1. baseline-default-deny CNP renders + is namespaced to catalyst-system
  2. egress allows SMTP submission 587/TCP (#1803 regression guard)
  3. egress allows SMTPS 465/TCP (#1803 regression guard)
  4. egress allows legacy SMTP 25/TCP (#1803 regression guard)
  5. egress allows HTTPS 443/TCP to world
  6. egress allows kube-dns 53/UDP + 53/TCP
  7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847)
  8. ingress allows `flux-system` (HelmRelease readiness probes)
  9. ingress allows `kube-system` (operator + ccm + CoreDNS)
 10. ingress is namespace-scoped — no fromEntities:{cluster|world|all} wildcard
 11. catalyst-api Service exposes port 8080 (auto-trigger contract)
 12. CNP toggles off cleanly with security.baselineCnp.enabled=false
 13. allowedIngressNamespaces propagates via --set (operator-tunable)

Negative-test confirmation (executed locally before commit):
  - Remove SMTP 587 from template → Case 2 FAILS, exit 1
  - Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1
  - Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1
  - Restore originals → all 13 cases PASS, exit 0

Refs: TBD-A18, PRs #1785 #1803 #1847, audit /tmp/audit-recent-prs-quality-report.json
Closes #1850

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 00:32:28 +04:00
github-actions[bot]
82e972fb77 deploy: update catalyst images to 75cb059 2026-05-18 20:26:21 +00:00
e3mrah
75cb059fc0
Merge pull request #1851 from openova-io/fix/a16-hetzner-ssh-key-sweep
fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)
2026-05-19 00:24:19 +04:00
github-actions[bot]
e78faa986c deploy: update catalyst images to f07312c 2026-05-18 20:23:49 +00:00
e3mrah
f07312c5ae
fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852)
Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179
+ bootstrap-kit pin bump + cloud-init substitute extension, because each
fix is small and they share the same fresh-prov verification cycle.

TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list
networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies
get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the
chroot in-cluster fallback's k8sCache.Factory reflector emitted
continuous `networkpolicies is forbidden` errors at the cluster scope
because only update/patch/delete were granted (existing mutation block)
— the read path was never wired. Mirrors the existing
cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s
NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7).

TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields
configuredRegions / controlPlaneIP / primaryRegion / replicaRegion /
selfDeploymentId / enableHotStandby / qaApplications empty on every
fresh prov. Pre-fix the envsubst placeholders resolved to empty because
nothing wrote them into the bootstrap-kit Kustomization postBuild
substitute map → the chart rendered empty strings → Dashboard
SovereignCard configured-regions chips, Settings page operator-identity,
/api/v1/sovereign/self, and the D31 active-hot-standby gating ALL
silently fell through to default behaviour. Wired via three coordinated
changes:
  - Chart values.yaml gains global.sovereignSelfDeploymentId default
  - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId,
    sovereign.configuredRegions, sovereign.qaApplications mappings
    (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`)
  - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP
    (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION /
    SOVEREIGN_REPLICA_REGION (canonical 4-segment labels),
    SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty),
    SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list),
    QA_APPLICATIONS_YAML (reserved, default `[]`)
  - main.tf: new template inputs sovereign_configured_regions_yaml +
    replica_region_canonical_label (derived from local.secondary_regions),
    threaded into both primary CP and per-secondary-region cloud-init
    templatefile calls

TBD-A10b (issue #1845) — GET
/api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409
kubeconfig-file-missing on fresh prov for every region. Pre-fix the
handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init
PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region
key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk
filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly
call with the bare `cloudRegion` (`?region=hel1`) because that's the
matrix-doc-friendly form. Fall-back resolution order added to
GetKubeconfig: exact-name first (legacy + manual operator PUT), then
`<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test
covers all three paths: exact match, slot-suffix glob, unknown-region
still 409. Closes the regression introduced when PR #1763
(mothership→chroot kubeconfig handover hook) started using the
cloud-init naming convention for fan-out exports.

Closes #1843, Closes #1844, Closes #1845

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 00:21:38 +04:00
hatiyildiz
6e883c1f8b fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)
Third match pass for SSH keys whose name AND label both drifted from the
Tofu canonical emission. The OpenSSH public_key comment is the one piece
of metadata that survives Console-rename, partial tofu apply, and
out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical
prefix into it at generation.

Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh
t25 provs because previous wipe cycles left it as an orphan. Label-pass
+ name-prefix-pass had no signal once the name/label drifted.

Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by
TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping
t2.omantel.biz cannot delete t20.omantel.biz's SSH key.

Tests:
  - PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match)
  - PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin)
  - PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes)
  - PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched)
  - PublicKeyComment_ParsesFormats (OpenSSH parser unit pins)
  - CommentMatchesPrefix_BoundaryRules (separator rune table)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:15:51 +02:00
hatiyildiz
7a2cad9a47 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.177 -> 1.4.178 (auto, Refs TBD-A6) 2026-05-18 19:46:12 +00:00
e3mrah
31b7dc5859
fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847)
PR #1785 (chart 1.4.171) shipped a baseline default-deny
CiliumNetworkPolicy in catalyst-system whose ingress allowlist was
limited to:

  - reserved.ingress: "" (cilium-gateway endpoint)
  - same-namespace catalyst-system Pods
  - host / remote-node / kube-apiserver entities

The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst`
namespace, including the 10-auto-trigger Job whose Pod curls
catalyst-api.catalyst-system.svc.cluster.local:8080 to fire
/api/v1/internal/cutover/trigger.

With #1785 in effect on a FRESH prov, every auto-trigger Pod times
out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and
the D0 auto-redirect to the Sovereign Console never happens — the
operator is stuck on mothership /jobs forever.

Caught by t24 zero-touch verification (2026-05-18):

  handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst'
  ns cannot reach catalyst-api in 'catalyst-system' ns because
  baseline-default-deny CNP allows ingress only from {reserved.ingress,
  catalyst-system ns, host entities}"

The companion symptom on t22 was masked because t22's cutover Job
had already completed before the CNP rolled out — the CNP did not
gate ingress there.

Fix
─────────────────────────────────────────────────────────────────
Add a fourth ingress rule to baseline-default-deny allowing
fromEndpoints in the operator-tunable list
.Values.security.baselineCnp.allowedIngressNamespaces. Defaults:

  - catalyst       — cutover Pods (the load-bearing fix)
  - flux-system    — Helm/Kustomize/Source controllers probing
                     Service readiness for HelmRelease health
                     rollups (worked pre-#1785 via no-CNP default)
  - kube-system    — Cilium operator + hcloud-ccm + CoreDNS that
                     do cluster introspection calls (the
                     reserved.ingress gateway endpoint here is
                     still matched by rule 1's reserved.ingress: ""
                     selector — this rule covers non-gateway Pods)

The list mirrors the existing allowedPlatformNamespaces pattern on
the egress side. No other rule semantics change.

Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177
(PR #1803, SMTP egress) — both are sub-regressions from the same
#1785 baseline-CNP ship.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:45:28 +04:00
hatiyildiz
61948474b5 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.176 -> 1.4.177 (auto, Refs TBD-A6) 2026-05-18 19:28:52 +00:00
e3mrah
153fcf9419
fix(cnp): allow SMTP egress (587/465/25) from catalyst-system — fixes PIN-issue 502 regression from #1785 (#1803)
PR #1785 (chart 1.4.171) shipped a baseline-default-deny CiliumNetworkPolicy
in catalyst-system whose world-egress block was restricted to TCP/443 only.
That silently broke SMTP submission from catalyst-api to the operator
Stalwart relay (mail.openova.io), surfacing as 502s at
/api/v1/auth/pin-request — customer journey step 11/12 (PIN-issue email
delivery) is now blocked on every fresh Sovereign onboarding flow.

DIAGNOSTIC EVIDENCE
-------------------
- CNP `baseline-default-deny` in catalyst-system was created at
  2026-05-18 18:13:09Z (the moment chart 1.4.171 rolled out).
- Egress rule:
    toEntities: [world]
    toPorts:    [443/TCP]
  i.e. only HTTPS world egress permitted.
- A Pod in catalyst-system cannot `nc 45.151.123.50 587` (timeout).
- A Pod in the default namespace on the SAME node connects fine
  and receives the `220 Stalwart ESMTP` banner — confirming the
  block is policy-driven, not network/host-firewall driven.

FIX
---
Extend the world-egress block in
products/catalyst/chart/templates/network-policies/baseline-catalyst-system.yaml
to permit, in addition to the existing 443/TCP:

  - 587/TCP — SMTP submission (the production path to mail.openova.io)
  - 465/TCP — SMTPS (fallback)
  - 25/TCP  — legacy SMTP (fallback)

All four ports are scoped to `toEntities: [world]`, matching the
existing 443 allow. No other rule semantics change — same-namespace,
cluster-DNS, kube-apiserver, and platform-namespace allows are
untouched. The 25/TCP allow is included only as a legacy fallback;
production traffic is on 587.

A "Regression context — DO NOT NARROW THIS BLOCK WITHOUT REVIEW"
comment is added inline so the next reviewer who tightens the block
sees the failure mode that drove the widening.

CHART
-----
1.4.176 → 1.4.177. Changelog entry added under the 1.4.176 block,
above the version line, describing the regression + fix.

VERIFICATION
------------
`helm template products/catalyst/chart` renders the updated CNP with
four ports (443/587/465/25) under the world egress block; all other
rules byte-identical to 1.4.176.

Refs PR #1785 (the regression source), Issue #1746 (the original
baseline-CNP work).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-18 23:28:19 +04:00
github-actions[bot]
732f2363b9 deploy: update catalyst images to c422c97 2026-05-18 19:16:52 +00:00
e3mrah
c422c97b97
fix(catalyst-api): publish body→query translation + rbac/assign CRD-NotFound detection (Refs TBD-C4-fup, TBD-C6-006-followup) (#1802)
TBD-C4-fup — publish body→query translation regression guard:
- Adds sme_catalog_client_test.go pinning the wire shape on
  smeCatalogClient.SetPublished. The C4-012 / #1735 fix (PR #1789)
  translates the chroot's {"published":true} JSON body into the
  upstream catalog's ?value=true|false query param shape that
  services-catalog SetAppPublished (handlers.go:303-313) requires.
  Wave 35 cov-bench v3 surfaced 400 here because the deploy bot
  hadn't bumped catalyst-api past e2c56c3 (PR #1787) when the
  bench ran — PR #1789's translation was already in the merged
  code but not in the live image. The test pins URL +
  ?value=<bool> + empty body so any future revert fires.

TBD-C6-006-followup — RBAC assign 500 → 503:
- Root cause: UserAccess is a NAMESPACED Crossplane Claim per the
  XRD's claimNames block (platform/crossplane-claims/chart/
  templates/xrds/useraccess.yaml). rbacAssignNamespace = "" routed
  the dynamic Create to the apiserver's cluster-scoped REST path
  /apis/access.openova.io/v1alpha1/useraccesses, which the
  apiserver doesn't serve for a namespaced CRD — returns 404 with
  "the server could not find the requested resource". PR #1789's
  apierrors.IsNotFound→503 wrapper never fired because the 404 was
  for the route, not the resource.
- Fix: pin rbacAssignNamespace = "catalyst-system" and stamp it on
  every Create. Mirrors user_access_owner_seed.go's t134 D21 fix
  (userAccessOwnerNamespace = "catalyst-system"). Lists keep
  Namespace("") for cross-namespace listing (valid against a
  namespaced CRD — apiserver returns the union).
- Defense in depth: isCRDNotInstalledErr() string-fallback for
  "the server could not find the requested resource" / "no matches
  for kind" — apierrors.IsNotFound can lose StatusReasonNotFound
  through error-chain wrapping. Mirrors
  catalog_client_cluster_fallback.isVersionNotServed.
- user_access.go: same defect class — CreateUserAccess /
  UpdateUserAccess / tryDeleteUserAccess all called .Namespace("")
  on a namespaced CRD. CreateUserAccess now stamps
  rbacAssignNamespace; Update + Delete walk the all-namespaces
  list via findUserAccessByName() to discover the canonical ns
  before issuing the mutation against that exact REST path.

Tests:
- TestSetPublished_SendsQueryParamNotBody (regression guard for
  TBD-C4-fup)
- TestHandleRBACAssign_CreateStampsNamespace (regression guard for
  TBD-C6-006-followup namespace fix)
- TestIsCRDNotInstalledErr_StringFallback (regression guard for
  defense-in-depth detection)
- Existing test reads updated to use rbacAssignNamespace instead
  of Namespace("") (no behavioural change — the fake dynamic
  client routes accurately now)

Refs TBD-C4-fup
Refs TBD-C6-006-followup

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:14:40 +04:00
hatiyildiz
0293318a3a deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.175 -> 1.4.176 (auto, Refs TBD-A6) 2026-05-18 19:14:22 +00:00
github-actions[bot]
fbbf1b395f deploy: update sme service images to 989328d + bump chart to 1.4.176 2026-05-18 19:13:00 +00:00
hatiyildiz
da28ae6936 deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.174 -> 1.4.175 (auto, Refs TBD-A6) 2026-05-18 19:12:31 +00:00