openova

Author	SHA1	Message	Date
hatiyildiz	2e1826abb4	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.181 -> 1.4.182 (auto, Refs TBD-A6)	2026-05-18 23:49:51 +00:00
github-actions[bot]	5a25c254a1	deploy: update sme service images to `5d5c557` + bump chart to 1.4.182	2026-05-18 23:49:14 +00:00
e3mrah	5d5c55739e	fix(notification): retry-backoff on Stalwart 503 5.5.1 rate-limit (#1876 ) When Stalwart trips its rate-limit and returns "503 5.5.1", the notification service previously surfaced the error immediately to the events consumer, which kept hammering on the next event and prolonged the rate-limit window. Now Mailer.Send detects 503 5.5.1 specifically (via textproto.Error unwrap + canonical-code substring fallback) and retries up to 3 times with a 60s backoff between attempts. The backoff is configurable via SMTP_RETRY_BACKOFF env var (Go duration string OR bare integer seconds; 30s floor to keep the rate-limiter happy). Non-rate-limit errors (auth failure, transient I/O, etc.) bubble up unchanged so the consumer can NACK / dead-letter as appropriate. Adds smtp_test.go covering: - single rate-limit -> retry -> success - exhausted retries -> wrapped error preserving textproto.Error - non-rate-limit error -> immediate pass-through, no backoff - isRateLimit detection (textproto, multiline 503-5.5.1, negative cases) - parseRetryBackoff env-var forms + 30s floor + zero/garbage fallbacks No credential touches: this is a retry-hardening fix only; the chart-side SMTP creds path is already GREEN (see #1793 A80 diagnosis). Refs #1793 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:47:58 +04:00
e3mrah	3b4c130129	fix(bootstrap-kit): cutover dependsOn sovereign-tls — wait for Gateway TLS before HelmRepository URL rewrite (Closes #1871 ) (#1875 ) TBD-A24 cutover↔gateway circular deadlock — discovered on t26 zero-touch prov 2026-05-18 (99bb823cb0513f4b): 1. bp-catalyst-platform HR installs at v1.4.179 (Ready=True) 2. bp-self-sovereign-cutover HR Ready=True (deps gitea+harbor only) 3. Step-06 rewrites all 50 HelmRepository URLs ghcr.io → registry.<fqdn> 4. bp-catalyst-platform flips Ready=False (TLS handshake EOF — no Gateway) 5. sovereign-tls Kustomization blocked on bootstrap-kit Ready=True 6. bootstrap-kit blocked on bp-catalyst-platform Ready=True 7. Full deadlock — no Gateway, no handover, every UI route 404 Fix: add `sovereign-tls` as a third dependsOn entry on the cutover HR so Flux waits for the Cilium Gateway to be serving TLS before the URL rewrite fires. Same architectural shape as Wave 7 bp-hcloud-csi removal (#1610) — chicken-and-egg between bootstrap-kit and sovereign-tls broken by ordering the dangerous-side-effect chart AFTER the Gateway is ready. Also updates scripts/expected-bootstrap-deps.yaml so the dep-graph audit (check-bootstrap-deps.sh) recognises the new edge: slot 6a gets the extra `sovereign-tls` entry, plus a new "slot 0t" entry declaring sovereign-tls as a known node (no HR file on disk → audit reports it as `deferred`, info not error; Phase 4 cycle detection accepts it as a zero-in-degree root). Verified locally: - yq parses spec.dependsOn → 3 entries (bp-gitea, bp-harbor, sovereign-tls) - scripts/check-bootstrap-deps.sh: 50 present, 65 declared, 0 drift, 0 cycles - helm template platform/self-sovereign-cutover/chart: exit 0 (smoke OK) Refs: t26 ID 99bb823cb0513f4b, A55 diagnostic, A67 diagnosis, slot 17a comment in clusters/_template/bootstrap-kit/kustomization.yaml documenting the same chicken-and-egg shape. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:19:55 +04:00
e3mrah	06bea550ff	feat(ci): TBD-A26 pin-sync audit verifies GHCR artifact exists for each bootstrap-kit pin (#1874 ) The existing TBD-A6 + TBD-A20 system catches drift between Chart.yaml, bootstrap-kit pin, and blueprint.yaml spec.version AFTER chart-publish commits land on main, but it cannot detect the "chart bumped but never published" failure mode: the bootstrap-kit pin points at a chart version that GHCR never received because blueprint-release.yaml failed (e.g. TBD-A20 YAML scanner break, race with TBD-A20 lockstep, runner cancellation, transient GHCR push 5xx). Concrete observed failure (2026-05-18/19): bp-catalyst-platform 1.4.180 and 1.4.181 were "lost" during the TBD-A20 scanner break window (21:04Z → 22:07Z). The pin sync audit reported chart=pin=1.4.181 PASS while ghcr.io/openova-io/bp-catalyst-platform:1.4.181 did NOT exist until A58 manually re-fired the workflow via dispatch. Fresh Sovereigns silently fell back to the last working tag. What this adds - scripts/check-bootstrap-kit-pin-sync.sh gains `--check-ghcr` (and optional `--ghcr-org <org>`). For every chart pinned in the kit, it lists ghcr.io/<org>/<chart> tags via `gh api /orgs/<org>/packages/container/<chart>/versions --paginate`, then asserts the pinned version appears. Exits 1 on any missing tag. - A per-chart tag cache avoids redundant paginations. - .github/workflows/test-bootstrap-kit.yaml `pin-sync-audit` job now passes `--check-ghcr` on `push` to main + `workflow_dispatch` (PR mode stays `--changed-only` and skips GHCR — PRs cannot publish to GHCR anyway). The job stays `continue-on-error: true` under the same observational umbrella as the existing post-merge full sweep so a transient API blip cannot red-flag every chart bump; the missing-tag list still surfaces on the run summary for operator attention. - Job grants `packages: read` so the workflow GITHUB_TOKEN can list private package versions. Verification (origin/main snapshot, 2026-05-19) - Full sweep default: 50/50 chart→pin pairs OK, no GHCR check. - Full sweep `--check-ghcr`: 50/50 pairs OK AND 50/50 GHCR tags present — PASS exit 0. - Negative test: with products/catalyst/chart/Chart.yaml + slot 13 both set to a non-existent 99.99.99, the script exits 1 with `GHCR MISS bp-catalyst-platform:99.99.99 — tag NOT FOUND` and the remediation hint pointing at `gh workflow run blueprint-release.yaml`. - `--changed-only --base origin/main` against a no-change tree: clean exit 0 with the existing "nothing to check" message. Refs #1872, #1864, #1856. Closes #1872 Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:12:13 +04:00
e3mrah	a7cd2fc21f	docs(principles): add 3 session-2026-05-18 principles (validate-vs-origin / GHCR-tag-check / cutover-dependsOn-Gateway) (#1873 ) Adds three new inviolable principles surfaced by 2026-05-18 incidents: - #12 Never validate against the local working tree — A19 false-positive (verifier grepped a feature-branch working copy with unstaged edits, reported "already on main" when it was not). - #13 Chart-pin bumps must match a GHCR tag that exists — TBD-A48 / PR #1869 drift: pin to bp-self-sovereign-cutover:0.1.4 landed on main while the chart artifact had not been published, causing hours of ImagePullBackOff. - #14 Cutover-style HRs that rewrite HelmRepository URLs must dependsOn Gateway readiness — TBD-A24 / PR #1871: bp-self-sovereign-cutover flipped URLs to local registry before Cilium Gateway was serving TLS, deadlocking the cluster. Doc-only change; bumps the front-matter Updated date to 2026-05-18. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 03:09:26 +04:00
hatiyildiz	26e4c8e30e	deploy(bp-guacamole): bump bootstrap-kit pin 0.1.25 -> 0.1.26 (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version 0.1.25 -> 0.1.26 (Refs TBD-A20, #1856).	2026-05-18 22:20:35 +00:00
github-actions[bot]	8ce7c02aa9	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.26	2026-05-18 22:19:59 +00:00
e3mrah	1b87d38e94	deploy: catch-up pins for bp-catalyst-platform 1.4.181 + bp-guacamole 0.1.25 (post #1866 fix) (#1869 ) Catch-up for drift introduced during the Blueprint Release workflow outage 21:04:22Z (PR #1858 merge with YAML scanner break) → 22:07:49Z (PR #1866 fix). Charts published in that window: - bp-catalyst-platform 1.4.180 → 1.4.181 (umbrella) - bp-guacamole 0.1.24 → 0.1.25 Auto-bump-pin step didn't fire during the outage. A39 already caught up bp-newapi (PR #1865). This PR catches up the remaining 2. Refs #1864, PR #1866 (workflow fix), PR #1858 (root cause). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 02:19:34 +04:00
hatiyildiz	66fa508b74	deploy(bp-newapi): bump bootstrap-kit pin 1.4.21 -> 1.4.22 (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version 1.4.21 -> 1.4.22 (Refs TBD-A20, #1856).	2026-05-18 22:11:05 +00:00
e3mrah	22e046b554	Merge pull request #1866 from openova-io/fix/1864-workflow-yaml-startup-failure fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864)	2026-05-19 02:07:48 +04:00
hatiyildiz	69f2d7d91a	fix(ci): TBD-A6 auto-bump-pin must trigger after chart-publish commits even when TBD-A20 lockstep ran (Refs #1864 ) Root cause of the auto-bump-pin miss flagged in #1864. The Blueprint Release workflow has been in `startup_failure` since PR #1858 (commit `cf35b4a`) merged at 21:04:22Z. The lockstep step's multi-line shell heredoc inside a `run: \|` block-scalar: if [ ... ]; then msg="deploy(...) (auto, Refs TBD-A6) <-- literal blank line Also locksteps platform blueprint.yaml ..." <-- column 1, no indent is interpreted by the YAML scanner as the END of the block-scalar at the blank line, and the next column-1 line is then parsed as a new top-level mapping key — which fails because the previous mapping isn't terminated. The whole workflow file is rejected at workflow- startup time. Verified with `python3 -c yaml.safe_load(...)` (raises `ScannerError: could not find expected ':' line 815`) and by `gh api .../actions/runs/26060392136` returning `conclusion=failure, status=completed, jobs: []` for every push since `cf35b4a`. Consequence: no chart bump since `cf35b4a` has triggered the TBD-A6 auto-bump-pin or the TBD-A20 blueprint.yaml lockstep. PR #1865 was the manual catch-up for bp-newapi (1.4.20 -> 1.4.21); without this fix every future chart publish will drift the same way. Fix: build the multi-line commit message with `printf '%s\n\n%s'` so the string source stays on physically-indented lines that the YAML block-scalar accepts. Behaviour is identical — same commit subject, same blank line, same body — only the construction shape changes. Added a 9-line comment naming the seam so future authors don't reintroduce the same trap. Verified locally: * `python3 -c yaml.safe_load(open(...))` succeeds, parses 24 build-job steps. * `CHART_NAME=bp-newapi PREV_VERSION=1.4.20 CHART_VERSION=1.4.21 BP_PREV_VERSION=1.4.20 bash -c "$(printf ...)"` emits the canonical "deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (auto, Refs TBD-A6)\n\nAlso locksteps platform ..." body. Refs #1864. Refs PR #1858 (TBD-A20 lockstep that introduced the YAML defect).	2026-05-19 00:07:07 +02:00
github-actions[bot]	c64220f8cc	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.22	2026-05-18 22:05:58 +00:00
e3mrah	1e1fe26e02	Merge pull request #1865 from openova-io/fix/1864-bp-newapi-pin-catchup deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race)	2026-05-19 02:05:33 +04:00
hatiyildiz	f57f62764b	deploy(bp-newapi): bump bootstrap-kit pin 1.4.20 -> 1.4.21 (catch-up after TBD-A23 / TBD-A20 race) Closes #1864 Manual catch-up. The auto-bump-pin step (TBD-A6) did NOT run for the 1.4.20 -> 1.4.21 chart bump at commit `8b33188` because the Blueprint Release workflow has been stuck in startup_failure since PR #1858 (commit `cf35b4a`) merged at 21:04:22Z. The workflow YAML at .github/workflows/blueprint-release.yaml lines 812-814 has a multi-line heredoc string inside a `run: \|` block-scalar whose continuation lines are unindented: msg="deploy(${CHART_NAME}): bump bootstrap-kit pin ${PREV_VERSION} -> ... (auto, Refs TBD-A6) Also locksteps platform blueprint.yaml spec.version ${BP_PREV_VERSION} ..." YAML treats the unindented line as the end of the block-scalar and the next line as a new mapping key (which it isn't), so the entire workflow file fails the GitHub Actions YAML validator at workflow-start time. Every push since `cf35b4a` has produced a run with `conclusion=failure, status=completed, jobs=[]` (zero jobs spun up). Evidence: * gh api repos/openova-io/openova/actions/runs/26060392136 -> 'This run likely failed because of a workflow file issue.' * Same for every subsequent run including the chart 1.4.21 publish (no run was even created for `8b33188` because the workflow file couldn't parse). * `python3 -c 'yaml.safe_load(open(...))'` raises `ScannerError ... could not find expected ':' line 815`. This PR is the ONE-LINE catch-up so the pin drift is closed. A companion PR fixes the workflow YAML so future chart bumps auto-bump the pin again.	2026-05-19 00:04:40 +02:00
github-actions[bot]	6b11734a81	deploy: update sme service images to `4a61543` + bump chart to 1.4.181	2026-05-18 21:48:56 +00:00
e3mrah	4a61543957	test(tenant): wire round-trip for tenant.created owner_email contract (#1863 ) Verifies the publisher-side wrapper struct in CreateOrg (handlers.go:248-252) marshals to bytes the provisioning consumer in organization_create.go can decode flat with owner_email as a sibling field. Pairs with TestHandleTenantCreated_FullTenantStructDecode on the consumer side — together they pin BOTH ends of the contract so a refactor that nests under "tenant" or renames the tag fails in CI rather than at staging. Refs #1829 (D29). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 01:47:38 +04:00
github-actions[bot]	de86df1126	deploy: bump sandbox-controller image to `a405572`	2026-05-18 21:46:29 +00:00
github-actions[bot]	a09445482b	deploy: bump sandbox-mcp-server image to `a405572`	2026-05-18 21:44:53 +00:00
github-actions[bot]	4fd3aae99b	deploy: bump application-controller image to `a405572`	2026-05-18 21:44:52 +00:00
e3mrah	a40557227e	fix(controllers): NATS consume-leg for D35 (organization + sandbox) (#1862 ) PR #1626 wired the publish-leg (tenant + billing → NATS JetStream catalyst.<domain>.<event>). The consume-leg was missing: no in-cluster controller subscribed, so D35 (NATS round-trip end-to-end) stayed yellow even though the publish leg shipped. This PR adds: - core/controllers/pkg/natsbus: minimal JetStream subscriber shared by Group-C controllers. Self-contained (no dep on core/services/shared which pulls in franz-go/Kafka the controllers never touch). - core/controllers/organization/internal/controller/nats_bridge.go: subscribes to catalyst.tenant.created + catalyst.billing.order.placed, patches openova.io/last-event-observed-at + ...-subject annotations on the matching Organization CR. The annotation patch triggers an informer event → controller-runtime enqueues Reconcile within ~50ms instead of waiting for the 30s requeue fallback. - core/controllers/sandbox/internal/controller/nats_bridge.go: same pattern for catalyst.tenant.sandbox_requested. Looks up Sandbox CR using the same `sandbox-<sanitised-email>` naming convention tenant-service's SandboxOrchestrator (PR #1633) writes under. - main.go wiring in both controllers reads NATS_URL from env. Unset = log "consume-leg disabled" + continue (informer requeue fallback intact). The 30s RequeueAfter inside r.Reconcile is unchanged — NATS is an accelerator, not the only path. Idempotency: ev.Timestamp is the broker-side time stamp, so duplicate JetStream delivery produces a byte-stable annotation patch and controller-runtime does NOT enqueue a redundant Reconcile. Tests cover Ack/Nak/Ack-to-skip dispatch (subscriber_test.go), the happy path, the no-matching-CR soft miss, duplicate-envelope no-churn, malformed JSON poison-pill, and the publish-side ↔ consume-side name derivation lockstep for Sandbox CRs. HARD CONSTRAINT respected: no credential mutations — bridges read only the envelope + the target CR, never Secrets or Keycloak SA creds. Refs #1835 (D35 round-trip end-to-end), Refs #1776 (D35b sandbox). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:43:08 +04:00
github-actions[bot]	376cd7d14c	deploy: update catalyst images to `c86fb3d`	2026-05-18 21:36:21 +00:00
e3mrah	c86fb3d1dc	fix(catalyst-api): seed full 4-entry .omani.X sme-pool (D30 / #1830 ) (#1861 ) LoadSMETenantParentDomainsFromEnv's hardcoded-fallback only seeded 2 entries (omani.works + omani.trade), but the marketplace UI (core/marketplace/src/components/AddonsStep.svelte) lists 4 (omani.homes + omani.rest + omani.trade + omani.works) and core/services/domain/store.AllowedTLDs has the same canonical 4. Result: a customer picking .omani.homes or .omani.rest in /addons sailed through the picker but got 422 invalid-parent-domain at catalyst-api signup because FindParentDomain didn't recognise the TLD. This widens the seed to all 4 canonical .omani.X entries so the backend pool, the marketplace picker, and AllowedTLDs all agree. NSFlipReady=true on every entry (the zones are already delegated to the Sovereign's PowerDNS at gTLD level — pdmFlipNS short-circuits via nsAlreadyMatches for Day-2 re-adds). Updated TestLoadSMETenantParentDomainsFromEnv_StubFallback (`pool != 4`) and added 3 fresh tests in sovereign_parent_domains_test.go covering: canonical 4-entry seed, OTECH primary + 4 sme-pool composition, env-override path without fallback leakage. Closes #1830 (Part 1 — Day-1 pool seed). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 01:34:24 +04:00
github-actions[bot]	a6b5752391	deploy: update sme service images to `b214566` + bump chart to 1.4.180	2026-05-18 21:28:12 +00:00
e3mrah	b214566c1a	fix(provisioning): create Organization CR on tenant.created (C16 root cause) (#1860 ) Closes the voucher-checkout → Organization-CR loop that was missing from the convergence chain. Before this PR the flow stalled at: voucher accept → tenant-service CreateOrg → writes Tenant row, publishes tenant.created ↓ (DROP — no consumer) provisioning consumer switch (case "tenant.created" missing — A26 verifier pinpointed this) ↓ organization-controller has nothing to reconcile ↓ no vCluster / Keycloak group / Gitea org / per-tenant HTTPRoute A26 verifier on t22: zero Organization CRs after 168min despite the tenant row existing. Closes #1722. Unblocks D29 zero-touch tenant provisioning (Refs #1829). Changes: - core/services/tenant/handlers/handlers.go Enrich tenant.created payload with owner_email from JWT claims so the provisioning consumer can mint the Organization owner roster without a second store round-trip. Wrapper struct embeds *Tenant so existing decoders are wire-compatible. - core/services/provisioning/handlers/consumer.go Add case "tenant.created" to the dispatch switch. - core/services/provisioning/handlers/organization_create.go New handler. Validates slug + owner_email, builds cluster-scoped Organization CR (apiVersion orgs.openova.io/v1), POSTs via k8sRequest. Idempotent on 409 AlreadyExists (NATS redelivery safe). 404 → operator-misconfiguration error event. 5xx → return err so broker redelivers. Inviolable Principle #4: parent domain flows env → Handler.TenantParentDomain → CR (with per-tenant parent_domain payload override for multi-pool Sovereigns). - core/services/provisioning/handlers/organization_create_test.go Unit tests: malformed payload, invalid slug (incl. path-traversal), missing owner_email, full Tenant decode, default-fill paths, empty parent domain mints anyway, payload-shape pinning. All exercised with KUBERNETES_SERVICE_HOST scrubbed so no real apiserver dial. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:26:57 +04:00
github-actions[bot]	2457926e16	deploy: update catalyst images to `7ec73f9`	2026-05-18 21:16:51 +00:00
e3mrah	7ec73f9e2b	fix(catalyst-api): handler test baseline GREEN — 6 failing tests fixed (Closes #1853 ) (#1859 ) Per-test root cause + fix: 1. TestPinIssue_ConcurrentRapidFireRateLimit (TOCTOU race) — pinStore.canIssue and put() ran under separate mutex acquisitions; three concurrent /pin/issue goroutines all observed "no entry", passed canIssue, then raced EnsureUser against Keycloak. Replaced with atomic tryReserve() that check-and-stamps under a single lock; HandlePinIssue calls store.drop(email) on EnsureUser/generatePin/no-KC failure to roll back the reservation so the 60s cooldown doesn't punish operator retries. 2. TestFinaliseHandover_FullFlow — test fixture drift after PR #1487 keyed the tofu workdir by DeploymentID (provisioner.workdirKey). Test still wrote workdir at filepath.Join(tmp, "tenant-y-omani-works") (the legacy sovereign-name slug); FinaliseHandover handler uses `id`. Updated test to write workdir at filepath.Join(tmp, "dep-full") so it matches the actual prod lookup path. Same fix for the receiver-failure sibling test. 3. TestEnsureOwnerUserAccess_CreatesCanonicalCR — drifted twice: (a) test queried Namespace("") but the t134 D21 fix moved the CR to userAccessOwnerNamespace ("catalyst-system") because useraccesses is namespaced per the XRD claimNames block; (b) test asserted spec.applications = [{app:"", role:"admin"}] but the t135 D21 fix switched to spec.tierRoleRef = "openova:tier-owner" (XRD pattern rejects `app: ""`). Updated test to query catalyst-system namespace and assert tierRoleRef + applications-must-be-absent. 4. TestUnstructuredToUserAccess_NilApplicationsBecomesEmpty — production unstructuredToUserAccess left Spec.Applications=nil when the CR has no spec.applications, which json-marshals to `null` and crashes the React UI's items.map() (qa-loop iter-4 users-page-null-map regression). Initialize Spec.Applications = []userAccessAppGrantBody{} in the struct literal so the empty-slice contract is preserved. 5. TestHandleWhoami_PinSessionRBACClaims — whoamiInjectTierRoles unconditionally appended every inherited tier role even when the upstream JWT already shaped the role list authoritatively. A PIN-minted session carrying tier=owner + realm_access=[catalyst-owner] was getting fanned out to all 5 inheritance entries, which the route-guard couldn't reconcile. Now: if the operator's own catalyst-<tier> role is already present, the projection returns early and preserves the upstream list. TestHandleWhoami_ProjectsTierToRealmRoles still passes (empty input → still injects inheritance) and TestWhoamiInjectTierRoles_PreservesExistingRoles still passes (idempotent — same input out). 6. TestHandleWhoami_NoRBACOmitsFields — whoamiResponse.RealmAccess was a struct value with `omitempty`, which encoding/json does NOT honour for structs (only pointers/slices/maps until Go 1.24's `omitzero`). A pre-RBAC session always serialized realm_access:{} on the wire, breaking the legacy {email,sub,verified} contract. Changed to whoamiRealmAccess so omitempty actually drops the field; HandleWhoami only allocates the pointer when claims carry roles, and drops it back to nil if the projection ended up empty. Test status after fix (worktree off origin/main): - All 6 target tests PASS - Full TestPin, TestHandleWhoami, TestWhoamiInjectTierRoles, TestEnsureOwnerUserAccess, TestOwnerUserAccessName, TestListUserAccess, TestFinaliseHandover, TestUnstructuredToUserAccess* PASS (57 tests) - go test ./... -p 1 across the entire catalyst-api module PASS Pre-existing parallelism flakes (TestGetKubeconfig_ReadsFromPathPointer / TestPhase1Started_GuardPreventsDoubleWatch / TestPodRestart_*) exist on baseline too — write to /var/lib/catalyst/ from a goroutine that outlives test scope. Out of scope for this PR; tracked separately. Closes #1853 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:14:38 +04:00
github-actions[bot]	de53b39d13	deploy: bump bp-guacamole upstream 1.5.5 chart 0.1.25	2026-05-18 21:05:55 +00:00
github-actions[bot]	8b33188019	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.21	2026-05-18 21:04:48 +00:00
e3mrah	cf35b4a9b6	fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856 ) (#1858 ) A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux, openbao, keycloak, gitea) where blueprint.yaml spec.version had silently fallen behind chart/Chart.yaml version, breaking TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart publish — never the upstream platform/<bp>/blueprint.yaml. This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml spec.version whenever Chart.yaml version bumps. Both file edits land in the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint lockstep). Idempotent reset-and-rewrite retry preserved for the existing parallel-matrix race case. Workflow changes (.github/workflows/blueprint-release.yaml): * New step `bump_blueprint` after `bump_pin` — locates ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml (handles both platform-leaf and products-umbrella conventions), filters to kind:Blueprint (defensive against CRD yaml at the products/catalyst/chart/crds path), reads current spec.version at 2-space indent, sed-rewrites to CHART_VERSION, verifies post-write. * Commit step renamed to "Commit + push bootstrap-kit pin bump + blueprint.yaml lockstep"; stages both files, single commit, with convergent retry on conflict. * Summary block surfaces both bumps separately. Regression test (tests/e2e/bootstrap-kit/main_test.go): * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks platform/* and products/, discovers every Blueprint manifest with a sibling Chart.yaml, asserts spec.version == Chart.yaml version. Covers ALL ~70 blueprints, not just the canonical 10 kit ones the existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates. Failure messages name the file, drift direction, and the exact sed command to fix — drift remediation is mechanical. Drift cleanup (mandatory companion, same shape as A17/#1855): 26 Application-Blueprint blueprints whose spec.version had been left at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to Chart.yaml as authoritative. All currently surface in the new sweep test; without the cleanup the test would block this PR (and every subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook, cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores], falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir, netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover, trivy, valkey, velero, vpa, products/dmz-vcluster. After this lands, the next chart-version bump in any platform/<bp>/ folder auto-converges all three artifacts (Chart.yaml, blueprint.yaml, bootstrap-kit pin) in a single bot commit. No more manual collector PRs; no more silent drift between chart and Blueprint manifest. Closes #1856. Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 01:04:22 +04:00
e3mrah	2484c8a3de	fix(bp-velero): bump 1.2.1 -> 1.2.2 to force a publish (Closes #1799 ) (#1846 ) TBD-A13: `ghcr.io/openova-io/bp-velero:1.2.1` returns not-found because the 1.2.1 bump in platform/velero/chart/Chart.yaml shipped only in the initial-fill commit (`e5c2797c` "deploy: bump sandbox-mcp-server image to cadc7b5") which never triggered the blueprint-release workflow. As a result every fresh Sovereign's bp-velero HelmRelease (slot 34) is stuck InProgress and the bootstrap-kit kustomization fails its health check. GHCR currently has 1.0.0, 1.1.0, 1.2.0 — confirmed via `/orgs/openova-io/packages/container/bp-velero/versions`. Bump to 1.2.2 (chart + bootstrap-kit pin in lockstep so the A6 sync gate stays GREEN) so blueprint-release.yaml fires on this push, publishes `ghcr.io/openova-io/bp-velero:1.2.2`, and the auto-bump-pin step is a no-op. No payload changes — same upstream vmware-tanzu/velero 12.0.1 subchart, same templates, same values. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:43:13 +04:00
hatiyildiz	9975e057da	deploy(bp-newapi): bump bootstrap-kit pin 1.4.19 -> 1.4.20 (auto, Refs TBD-A6)	2026-05-18 20:38:15 +00:00
github-actions[bot]	9982dcafa8	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.20	2026-05-18 20:37:26 +00:00
e3mrah	3d0c96a237	fix(bp-newapi): single-pod DB migration via startupProbe (Closes #1798 ) (#1857 ) newapi-mirror:v0.13.2 hangs on first-boot GORM AutoMigrate against an empty CNPG database: kubelet's pre-A12 liveness probe (initialDelay 30s + period 10s + failureThreshold 3 = ~50s ceiling) SIGKILLs the binary mid-migration on every restart. The 28-CREATE-TABLE + 2-column-type AutoMigrate takes 60-120s on cpx21/cpx31 nodes with sslmode=require — well over the kill window. On t22 chart 1.4.18 the `newapi` DB had ZERO public-schema tables after 29 CrashLoopBackOff restarts because every kill happened before the GORM connection pool's first wire write completed (pg_stat_activity on the CNPG primary showed no newapi-user connections). Symptom (t22 verify, pod newapi-bp-newapi-6fd8799b6-lpsd2): [SYS] ... database migration started ← last log line exitCode=2 finishedAt-startedAt = 50s exactly Readiness probe: connect: connection refused 10.42.0.185:3000 DB: psql \\dt → "Did not find any relations" CNPG: pg_stat_activity → no `newapi` user connections Fix (canonical k8s pattern, Inviolable Principle #16 — own the seam): add a startupProbe that gates BOTH liveness and readiness until the binary opens :3000/api/status. Budget 30 × 10s = 5 min, comfortably above the observed 60-120s ceiling and below operator- impatience limits. Liveness's pre-A12 cadence (30s/10s/3) is unchanged but only activates after startupProbe success per kubelet semantics. The probe block is operator-tunable via `.Values.newapi.probes.startup.*`; setting it to `null` skip-renders the block so overlays against a pre-seeded DB can opt out (Inviolable Principle #4). Also bumps the bootstrap-kit pin 1.4.18 → 1.4.19 in slot 80 so freshly franchised Sovereigns pull the new chart on next prov. Render tested (smoke + override): startupProbe present with failureThreshold=30 in defaults; suppressed when startup: null. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:37:00 +04:00
e3mrah	a8931db541	fix(ci): sync stale blueprint.yaml versions + soften push-mode pin-sync race (Closes #1849 ) (#1855 ) Two disjoint regressions stack-failed test-bootstrap-kit.yaml on every push to main: 1. manifest-validation — TestBootstrapKit_BlueprintCardsHaveRequiredFields asserts platform/<bp>/blueprint.yaml spec.version == chart/Chart.yaml version. Six blueprints had drifted: cilium (1.3.0->1.3.5), cert-manager (1.2.0->1.2.2), flux (1.2.0->1.2.2), openbao (1.2.14->1.2.16), keycloak (1.5.0->1.4.5 — blueprint led chart, sync to authoritative Chart.yaml), gitea (1.2.5->1.2.7). Chart.yaml is canonical (drives bootstrap-kit pin -> Sovereign install); blueprint.yaml gets resynced down/up to match. 2. pin-sync-audit on push — full-sweep audit races the blueprint-release auto-bump hook. Chart-bump merge commit has chart=N pin=N-1 drift until the auto-bump bot commits the pin update ~60s later; the bot push (GITHUB_TOKEN convention) does not retrigger this workflow, so the failure remains in run history. Fix: set continue-on-error: true on push/workflow_dispatch events (PR remains blocking via --changed-only). The full-sweep output still surfaces drift on the run summary; it just doesn't fail the overall run while the heal-in- ~60s window is open. Documented inline in the job header. Net effect: every push to main re-runs cleanly green. The 13 pre-existing drifts called out in the existing job comment will continue to heal as each lagging chart gets its next bump (auto-bump hook + this PR's manifest-validation alignment). Refs PRs #1666 #1687 #1695 #1698 #1706 #1707 (the manual collector PRs TBD-A6 eliminated for bootstrap-kit pins; this PR extends the convergence to blueprint.yaml versions which the test asserts but the auto-bump hook does not yet update). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>	2026-05-19 00:34:48 +04:00
e3mrah	d36e54df74	test(chart): baseline CNP allow-list contract gate — guards #1785→#1803→#1847 cascade (Closes #1850 ) (#1854 ) The May 2026 baseline-CNP cascade shipped three production bugs in two days because nothing in CI rendered the chart and asserted on the rendered CiliumNetworkPolicy shape: - #1785 (chart 1.4.171) — added the baseline CNP for catalyst-system with WORLD egress restricted to TCP/443 only AND no ingress allow for the `catalyst` namespace. - #1803 (chart 1.4.177) — re-added SMTP egress (587/465/25 TCP) after /api/v1/auth/pin-request 502'd on every fresh onboarding. - #1847 (chart 1.4.178) — re-added ingress from `catalyst` after t24 fresh-prov handover hung at WAIT_TIMEOUT_SECONDS=1500s. This adds products/catalyst/chart/tests/baseline-cnp-allowlist.sh — a pure helm-template + grep/awk contract gate matching the existing platform/self-sovereign-cutover/chart/tests/cutover-contract.sh pattern. The Blueprint Release workflow already runs every *.sh under chart/tests/ as a publish gate (see blueprint-release.yaml line 384), so the gate is wired automatically and fails publish BEFORE the OCI artifact reaches a Sovereign. 13 cases asserted: 1. baseline-default-deny CNP renders + is namespaced to catalyst-system 2. egress allows SMTP submission 587/TCP (#1803 regression guard) 3. egress allows SMTPS 465/TCP (#1803 regression guard) 4. egress allows legacy SMTP 25/TCP (#1803 regression guard) 5. egress allows HTTPS 443/TCP to world 6. egress allows kube-dns 53/UDP + 53/TCP 7. ingress allows `catalyst` ns — cutover Pods → catalyst-api:8080 (#1847) 8. ingress allows `flux-system` (HelmRelease readiness probes) 9. ingress allows `kube-system` (operator + ccm + CoreDNS) 10. ingress is namespace-scoped — no fromEntities:{cluster\|world\|all} wildcard 11. catalyst-api Service exposes port 8080 (auto-trigger contract) 12. CNP toggles off cleanly with security.baselineCnp.enabled=false 13. allowedIngressNamespaces propagates via --set (operator-tunable) Negative-test confirmation (executed locally before commit): - Remove SMTP 587 from template → Case 2 FAILS, exit 1 - Remove `catalyst` from values.yaml default → Case 7 FAILS, exit 1 - Add `fromEntities: [cluster]` wildcard → Case 10 FAILS, exit 1 - Restore originals → all 13 cases PASS, exit 0 Refs: TBD-A18, PRs #1785 #1803 #1847, audit /tmp/audit-recent-prs-quality-report.json Closes #1850 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-19 00:32:28 +04:00
github-actions[bot]	82e972fb77	deploy: update catalyst images to `75cb059`	2026-05-18 20:26:21 +00:00
e3mrah	75cb059fc0	Merge pull request #1851 from openova-io/fix/a16-hetzner-ssh-key-sweep fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16)	2026-05-19 00:24:19 +04:00
github-actions[bot]	e78faa986c	deploy: update catalyst images to `f07312c`	2026-05-18 20:23:49 +00:00
e3mrah	f07312c5ae	fix(cutover): RBAC + sovereign-fqdn ConfigMap + kubeconfig?region path — 3 t24 zero-touch P1 blockers (#1852 ) Three Wave 36 P1 fresh-prov blockers ship together as one chart 1.4.179 + bootstrap-kit pin bump + cloud-init substitute extension, because each fix is small and they share the same fresh-prov verification cycle. TBD-A14 (issue #1843) — catalyst-api-cutover-driver SA cannot list networkpolicies cluster-scope. Add networking.k8s.io/networkpolicies get/list/watch verbs to clusterrole-cutover-driver.yaml. Pre-fix the chroot in-cluster fallback's k8sCache.Factory reflector emitted continuous `networkpolicies is forbidden` errors at the cluster scope because only update/patch/delete were granted (existing mutation block) — the read path was never wired. Mirrors the existing cilium.io/ciliumnetworkpolicies block; the two CRDs co-exist (k8s NetworkPolicy = baseline L3/L4, CiliumNetworkPolicy = tier-3 L7). TBD-A15 (issue #1844) — sovereign-fqdn ConfigMap fields configuredRegions / controlPlaneIP / primaryRegion / replicaRegion / selfDeploymentId / enableHotStandby / qaApplications empty on every fresh prov. Pre-fix the envsubst placeholders resolved to empty because nothing wrote them into the bootstrap-kit Kustomization postBuild substitute map → the chart rendered empty strings → Dashboard SovereignCard configured-regions chips, Settings page operator-identity, /api/v1/sovereign/self, and the D31 active-hot-standby gating ALL silently fell through to default behaviour. Wired via three coordinated changes: - Chart values.yaml gains global.sovereignSelfDeploymentId default - bootstrap-kit slot 13 gains global.sovereignSelfDeploymentId, sovereign.configuredRegions, sovereign.qaApplications mappings (YAML inline-list shape `${SOVEREIGN_CONFIGURED_REGIONS_YAML:-[]}`) - cloud-init Kustomization substitute map gains SOVEREIGN_CONTROL_PLANE_IP (= load_balancer_ipv4), SOVEREIGN_PRIMARY_REGION / SOVEREIGN_REPLICA_REGION (canonical 4-segment labels), SOVEREIGN_ENABLE_HOT_STANDBY (reserved, default empty), SOVEREIGN_CONFIGURED_REGIONS_YAML (JSON-encoded cloudRegion list), QA_APPLICATIONS_YAML (reserved, default `[]`) - main.tf: new template inputs sovereign_configured_regions_yaml + replica_region_canonical_label (derived from local.secondary_regions), threaded into both primary CP and per-secondary-region cloud-init templatefile calls TBD-A10b (issue #1845) — GET /api/v1/deployments/{id}/kubeconfig?region=<cloudRegion> returns 409 kubeconfig-file-missing on fresh prov for every region. Pre-fix the handler only resolved `<id>-<region>.yaml` exactly, but the cloud-init PUT-back + mothership→chroot D16 fan-out use the tofu secondary-region key shape `<cloudRegion>-<i>` (e.g. `hel1-1`, `nbg1-2`) — so on-disk filenames look like `<id>-hel1-1.yaml`. Verifiers + operators commonly call with the bare `cloudRegion` (`?region=hel1`) because that's the matrix-doc-friendly form. Fall-back resolution order added to GetKubeconfig: exact-name first (legacy + manual operator PUT), then `<id>-<region>-*.yaml` glob (sort.Strings deterministic). Unit test covers all three paths: exact match, slot-suffix glob, unknown-region still 409. Closes the regression introduced when PR #1763 (mothership→chroot kubeconfig handover hook) started using the cloud-init naming convention for fan-out exports. Closes #1843, Closes #1844, Closes #1845 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 00:21:38 +04:00
hatiyildiz	6e883c1f8b	fix(hetzner): sweep orphan SSH keys by public_key comment (TBD-A16) Third match pass for SSH keys whose name AND label both drifted from the Tofu canonical emission. The OpenSSH public_key comment is the one piece of metadata that survives Console-rename, partial tofu apply, and out-of-band hcloud-cli edits — bootstrap-cli stamps the canonical prefix into it at generation. Caught in production 2026-05-18: catalyst-t24-omantel-biz blocked fresh t25 provs because previous wipe cycles left it as an orphan. Label-pass + name-prefix-pass had no signal once the name/label drifted. Adds boundary-aware HasPrefix check (the same P0 safety guard pinned by TestPurge_NamePrefixFallback_DoesNotTouchOtherCustomers) so wiping t2.omantel.biz cannot delete t20.omantel.biz's SSH key. Tests: - PublicKeyCommentFallback_DeletesUnlabeled (the third-pass match) - PublicKeyCommentFallback_BoundarySafety (P0 t2 vs t20 safety pin) - PublicKeyCommentFallback_NoDoubleCount (idempotent against earlier passes) - PublicKeyCommentFallback_LeavesOtherKeys (other tenants untouched) - PublicKeyComment_ParsesFormats (OpenSSH parser unit pins) - CommentMatchesPrefix_BoundaryRules (separator rune table) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 22:15:51 +02:00
hatiyildiz	7a2cad9a47	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.177 -> 1.4.178 (auto, Refs TBD-A6)	2026-05-18 19:46:12 +00:00
e3mrah	31b7dc5859	fix(cnp): allow ingress from catalyst ns (cutover Pods) — fresh-prov handover blocker (Refs PR #1785 regression, t24 zero-touch finding) (#1847 ) PR #1785 (chart 1.4.171) shipped a baseline default-deny CiliumNetworkPolicy in catalyst-system whose ingress allowlist was limited to: - reserved.ingress: "" (cilium-gateway endpoint) - same-namespace catalyst-system Pods - host / remote-node / kube-apiserver entities The bp-self-sovereign-cutover chart stamps Jobs into the `catalyst` namespace, including the 10-auto-trigger Job whose Pod curls catalyst-api.catalyst-system.svc.cluster.local:8080 to fire /api/v1/internal/cutover/trigger. With #1785 in effect on a FRESH prov, every auto-trigger Pod times out at WAIT_TIMEOUT_SECONDS=1500s, handoverFiredAt stays null, and the D0 auto-redirect to the Sovereign Console never happens — the operator is stuck on mothership /jobs forever. Caught by t24 zero-touch verification (2026-05-18): handover_status: "BLOCKED — cutover auto-trigger Pod in 'catalyst' ns cannot reach catalyst-api in 'catalyst-system' ns because baseline-default-deny CNP allows ingress only from {reserved.ingress, catalyst-system ns, host entities}" The companion symptom on t22 was masked because t22's cutover Job had already completed before the CNP rolled out — the CNP did not gate ingress there. Fix ───────────────────────────────────────────────────────────────── Add a fourth ingress rule to baseline-default-deny allowing fromEndpoints in the operator-tunable list .Values.security.baselineCnp.allowedIngressNamespaces. Defaults: - catalyst — cutover Pods (the load-bearing fix) - flux-system — Helm/Kustomize/Source controllers probing Service readiness for HelmRelease health rollups (worked pre-#1785 via no-CNP default) - kube-system — Cilium operator + hcloud-ccm + CoreDNS that do cluster introspection calls (the reserved.ingress gateway endpoint here is still matched by rule 1's reserved.ingress: "" selector — this rule covers non-gateway Pods) The list mirrors the existing allowedPlatformNamespaces pattern on the egress side. No other rule semantics change. Chart bump 1.4.177 → 1.4.178. Companion regression to chart 1.4.177 (PR #1803, SMTP egress) — both are sub-regressions from the same #1785 baseline-CNP ship. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:45:28 +04:00
hatiyildiz	61948474b5	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.176 -> 1.4.177 (auto, Refs TBD-A6)	2026-05-18 19:28:52 +00:00
e3mrah	153fcf9419	fix(cnp): allow SMTP egress (587/465/25) from catalyst-system — fixes PIN-issue 502 regression from #1785 (#1803 ) PR #1785 (chart 1.4.171) shipped a baseline-default-deny CiliumNetworkPolicy in catalyst-system whose world-egress block was restricted to TCP/443 only. That silently broke SMTP submission from catalyst-api to the operator Stalwart relay (mail.openova.io), surfacing as 502s at /api/v1/auth/pin-request — customer journey step 11/12 (PIN-issue email delivery) is now blocked on every fresh Sovereign onboarding flow. DIAGNOSTIC EVIDENCE ------------------- - CNP `baseline-default-deny` in catalyst-system was created at 2026-05-18 18:13:09Z (the moment chart 1.4.171 rolled out). - Egress rule: toEntities: [world] toPorts: [443/TCP] i.e. only HTTPS world egress permitted. - A Pod in catalyst-system cannot `nc 45.151.123.50 587` (timeout). - A Pod in the default namespace on the SAME node connects fine and receives the `220 Stalwart ESMTP` banner — confirming the block is policy-driven, not network/host-firewall driven. FIX --- Extend the world-egress block in products/catalyst/chart/templates/network-policies/baseline-catalyst-system.yaml to permit, in addition to the existing 443/TCP: - 587/TCP — SMTP submission (the production path to mail.openova.io) - 465/TCP — SMTPS (fallback) - 25/TCP — legacy SMTP (fallback) All four ports are scoped to `toEntities: [world]`, matching the existing 443 allow. No other rule semantics change — same-namespace, cluster-DNS, kube-apiserver, and platform-namespace allows are untouched. The 25/TCP allow is included only as a legacy fallback; production traffic is on 587. A "Regression context — DO NOT NARROW THIS BLOCK WITHOUT REVIEW" comment is added inline so the next reviewer who tightens the block sees the failure mode that drove the widening. CHART ----- 1.4.176 → 1.4.177. Changelog entry added under the 1.4.176 block, above the version line, describing the regression + fix. VERIFICATION ------------ `helm template products/catalyst/chart` renders the updated CNP with four ports (443/587/465/25) under the world egress block; all other rules byte-identical to 1.4.176. Refs PR #1785 (the regression source), Issue #1746 (the original baseline-CNP work). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>	2026-05-18 23:28:19 +04:00
github-actions[bot]	732f2363b9	deploy: update catalyst images to `c422c97`	2026-05-18 19:16:52 +00:00
e3mrah	c422c97b97	fix(catalyst-api): publish body→query translation + rbac/assign CRD-NotFound detection (Refs TBD-C4-fup, TBD-C6-006-followup) (#1802 ) TBD-C4-fup — publish body→query translation regression guard: - Adds sme_catalog_client_test.go pinning the wire shape on smeCatalogClient.SetPublished. The C4-012 / #1735 fix (PR #1789) translates the chroot's {"published":true} JSON body into the upstream catalog's ?value=true\|false query param shape that services-catalog SetAppPublished (handlers.go:303-313) requires. Wave 35 cov-bench v3 surfaced 400 here because the deploy bot hadn't bumped catalyst-api past `e2c56c3` (PR #1787) when the bench ran — PR #1789's translation was already in the merged code but not in the live image. The test pins URL + ?value=<bool> + empty body so any future revert fires. TBD-C6-006-followup — RBAC assign 500 → 503: - Root cause: UserAccess is a NAMESPACED Crossplane Claim per the XRD's claimNames block (platform/crossplane-claims/chart/ templates/xrds/useraccess.yaml). rbacAssignNamespace = "" routed the dynamic Create to the apiserver's cluster-scoped REST path /apis/access.openova.io/v1alpha1/useraccesses, which the apiserver doesn't serve for a namespaced CRD — returns 404 with "the server could not find the requested resource". PR #1789's apierrors.IsNotFound→503 wrapper never fired because the 404 was for the route, not the resource. - Fix: pin rbacAssignNamespace = "catalyst-system" and stamp it on every Create. Mirrors user_access_owner_seed.go's t134 D21 fix (userAccessOwnerNamespace = "catalyst-system"). Lists keep Namespace("") for cross-namespace listing (valid against a namespaced CRD — apiserver returns the union). - Defense in depth: isCRDNotInstalledErr() string-fallback for "the server could not find the requested resource" / "no matches for kind" — apierrors.IsNotFound can lose StatusReasonNotFound through error-chain wrapping. Mirrors catalog_client_cluster_fallback.isVersionNotServed. - user_access.go: same defect class — CreateUserAccess / UpdateUserAccess / tryDeleteUserAccess all called .Namespace("") on a namespaced CRD. CreateUserAccess now stamps rbacAssignNamespace; Update + Delete walk the all-namespaces list via findUserAccessByName() to discover the canonical ns before issuing the mutation against that exact REST path. Tests: - TestSetPublished_SendsQueryParamNotBody (regression guard for TBD-C4-fup) - TestHandleRBACAssign_CreateStampsNamespace (regression guard for TBD-C6-006-followup namespace fix) - TestIsCRDNotInstalledErr_StringFallback (regression guard for defense-in-depth detection) - Existing test reads updated to use rbacAssignNamespace instead of Namespace("") (no behavioural change — the fake dynamic client routes accurately now) Refs TBD-C4-fup Refs TBD-C6-006-followup Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 23:14:40 +04:00
hatiyildiz	0293318a3a	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.175 -> 1.4.176 (auto, Refs TBD-A6)	2026-05-18 19:14:22 +00:00
github-actions[bot]	fbbf1b395f	deploy: update sme service images to `989328d` + bump chart to 1.4.176	2026-05-18 19:13:00 +00:00
hatiyildiz	da28ae6936	deploy(bp-catalyst-platform): bump bootstrap-kit pin 1.4.174 -> 1.4.175 (auto, Refs TBD-A6)	2026-05-18 19:12:31 +00:00

1 2 3 4 5 ...

2490 Commits