Commit Graph

2490 Commits

Author SHA1 Message Date
e3mrah
a2cbe3baa0
feat(sandbox-mcp): sandbox.auth.* + sandbox.secrets.* real impls (#1658)
Wave 11 follow-up to PR #1653 (sandbox.db.*). Replaces the stubbed
sandbox.auth.* and sandbox.secrets.* tool handlers with real
implementations so agents can manage per-Sandbox Keycloak realms /
OIDC clients and a per-Sandbox Secret store.

sandbox.auth.* (Keycloak Admin REST via the sandbox-controller-
injected admin bearer):

  - sandbox.auth.provisionRealm {realm_name, display_name?}
      POST /admin/realms — idempotent on 409 Conflict.
  - sandbox.auth.listClients
      GET /admin/realms/<sandbox-realm>/clients — friendly empty
      list on 404 (realm not yet provisioned).
  - sandbox.auth.registerClient {client_id, redirect_uris,
                                 public_client?, name?}
      POST /admin/realms/<sandbox-realm>/clients — idempotent on
      409 Conflict, typed error on 404 (realm missing).

  The Sandbox's "own" realm name is deterministic (`sandbox-<org>-
  <id>`); the agent CANNOT pass a `realm` argument to list /
  register, only provisionRealm accepts a free-form name.

sandbox.secrets.* (per-Sandbox K8s Secret store, base64-encoded
data, encrypted at rest by kube-apiserver encryption-provider):

  - sandbox.secrets.read  {key}        — returns Found / KeyNotFound
                                          / NotFound (Secret missing)
  - sandbox.secrets.write {key, value} — auto-creates the Secret on
                                          first write (Added /
                                          Updated / Created)

  The Secret is named `sandbox-<owner-uid>-secrets` in env.Sandbox-
  Namespace and gated by openova.io/managed-by=openova-sandbox-mcp
  so sandbox.secrets.write CANNOT mutate the controller-injected
  `sandbox-tokens` Secret or any other unmanaged Secret in the ns.

Auth: claims.OrgID == env.OrgID required (same as sandbox.db.*),
RequiredCapability = "sandbox.auth" / "sandbox.secrets".

New env vars (sandbox-controller injects on MCP Deployment):

  - SANDBOX_OWNER_UID      — `sandbox-<owner-uid>-secrets` suffix
  - KEYCLOAK_ADMIN_URL     — root of the Keycloak Admin REST API
  - KEYCLOAK_ADMIN_TOKEN   — pre-minted admin bearer
  - KEYCLOAK_PARENT_REALM  — default "master"

No chart bump; mcp-server-only change. go build + go test clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:19:46 +04:00
e3mrah
d5ea7d9de6
feat(sandbox): sandbox.<sov-fqdn> public URL — DNS + cert SAN + correct parentRefs (#1657)
The Sandbox public-URL flow (sandbox.<sov-fqdn>/sessions/<owner-uid>/*) had
three independent gaps that prevented PR #1641's HTTPRoute from resolving
end-to-end:

1. HTTPRoute parentRefs pointed at "catalyst-public/catalyst-system/https",
   a Gateway that does not exist on a Sovereign. The canonical public
   Gateway is "cilium-gateway/kube-system" (clusters/_template/
   sovereign-tls/cilium-gateway.yaml), the same parent that organization-
   controller's tenant_route.go and the chart's httproute.yaml attach to.
   sectionName is omitted so the HTTPRoute auto-attaches to every listener
   whose hostname matches sandbox.<sov-fqdn> — the wildcard
   *.${SOVEREIGN_FQDN} HTTPS listener already in place per infra/hetzner/
   main.tf locals.parent_domains_listeners_yaml fallback path.

2. The per-name Cilium Gateway cert (clusters/_template/sovereign-tls/
   cilium-gateway-cert.yaml) is a SAN list, not a wildcard. Without
   "sandbox.<sov-fqdn>" in its dnsNames cilium-envoy serves the default
   fallback cert and browsers see NET::ERR_CERT_COMMON_NAME_INVALID.
   This file is the source of the per-zone Secret
   sovereign-wildcard-tls-<sov-fqdn-dashed> the Gateway listener
   references — adding the SAN is the only TLS-side change needed; the
   Gateway listener wildcard is already a hostname match.

3. The parent zone's A-record set is built from CanonicalSovereignSubdomains
   in products/catalyst/bootstrap/api/internal/handler/
   sovereign_dns_records.go. Without "sandbox" the PowerDNS PATCH never
   writes sandbox.<sov-fqdn> A-record → primary LB IP, and the URL
   resolves NXDOMAIN even when the listener + cert are healthy.

End-to-end resolution chain after this PR:

  Browser → sandbox.<sov-fqdn>/sessions/<owner-uid>/  (PowerDNS A record
    points at primary LB IPv4)
  → Hetzner LB :443 → cp-node :30443 (cilium-envoy)
  → Gateway listener https-<sov-fqdn-dashed> on *.<sov-fqdn> matches
    hostname; cert SAN includes sandbox.<sov-fqdn> so TLS terminates
  → HTTPRoute pty-server in sandbox-<owner-uid> namespace matches
    hostname + /sessions/<owner-uid>/ path prefix; URLRewrite strips
    /sessions/<owner-uid>/ → /sessions/
  → backendRef pty-server:7681 in sandbox-<owner-uid> namespace
  → pty-server StatefulSet (PR #1641) serves the session

Hard rules respected: READ-ONLY clusters, no Chart.yaml bump (only
template content + Go renderer + Go handler list), helm template +
kubectl kustomize clean (verified locally), tests updated to assert the
new parentRefs shape and pass under go 1.23.
2026-05-18 12:15:59 +04:00
github-actions[bot]
5309bb8c39 deploy: bump sandbox-controller image to 63255bf 2026-05-18 08:15:56 +00:00
e3mrah
63255bf172
feat(sandbox-mcp): gitea.pr.create/merge + issue.* + k8s.read.logs (was stubs) (#1656)
Wave 11 promotes the remaining write-surface tools from #1645's stubs
to real handlers, so an agent inside a Sandbox can end-to-end open PRs,
file issues, comment, merge, and pull container logs without leaving the
MCP transport:

  - pkg/gitea: +MergePullRequest, +Issue + IssueComment types, +List/Get/
    Create/CommentOnIssue methods (new issues.go; pulls.go grows the
    merge helper). Same client envelope, same ErrRepoNotFound mapping.
  - mcp-server gitea.go: gitea.pr.create / gitea.pr.merge /
    gitea.issue.list / get / create / comment handlers + JSON Schemas.
    Same HS256 bearer + claims.OrgID match as #1645.
  - mcp-server k8s_read.go: k8s.read.logs via client-go's typed
    kubernetes.Interface (dynamic client doesn't expose Pods/log).
    Bounded fetch — follow=false, tail_lines default 200 capped at 5000,
    1 MiB byte cap, 30s deadline. Long-lived streams stay on the
    catalyst-api WebSocket surface.
  - tests: +merge_issues_test.go (pkg/gitea, 11 cases) + gitea_wave11_test.go
    (mcp-server, 14 cases) covering happy paths, missing-arg validation,
    explicit merge styles, list-after-create idempotency, and the two
    pre-cluster guard rails on k8s.read.logs.

Hard rules honoured: READ-ONLY clusters (k8s.write.* still stubbed),
no chart bump, go build + go test clean. Kept stubbed: sandbox.db.*,
sandbox.auth.*, gitea.release.list (Wave 12+).

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:12:41 +04:00
github-actions[bot]
e2e8132b00 deploy: update Catalyst marketplace image to f3915c0 2026-05-18 08:03:45 +00:00
e3mrah
f3915c01fa
test(marketplace): codified customer-journey regression (17 steps) (#1655)
Codifies the 17-step marketplace customer journey (storefront → catalog →
product detail → voucher → signup → subdomain pick → PIN → checkout →
provisioning chain → console redirect) as a hermetic Playwright suite.

Previously the journey was only walked manually by ad-hoc fix-author
agents (see PR #1635 / docs/SESSION-2026-05-17-CONVERGENCE.md). This adds
a regression gate so future PRs catch breakage in any of the 14 spec
tests (17 step labels grouped into 14 Playwright tests — steps 12-15 are
asserted as one API-chain contract since CheckoutStep redirects to
console before the panel-poll UI would render).

Highlights
----------
- core/marketplace/playwright.config.ts — testDir=./playwright,
  workers=1, baseURL from MARKETPLACE_BASE_URL (default
  http://localhost:4321), same posture as
  tests/e2e/playwright/playwright.config.ts.
- core/marketplace/playwright/customer-journey.spec.ts — every backend
  call (/api/catalog/*, /api/auth/*, /api/tenant/*, /api/billing/*,
  /api/provisioning/*) intercepted via page.route() so the run is
  hermetic (npm run build && npm run preview is enough — no real
  catalyst-api / billing / provisioning service required).
- Asserts the PR #1627 fix (deriveConsoleURL host-driven) — Sovereign
  hosts redirect to console.<sov-fqdn> (no /nova), mothership stays on
  console.openova.io/nova.

Verification
------------
npx playwright test customer-journey → 14 passed (2.5m).
2026-05-18 12:02:39 +04:00
github-actions[bot]
18df061895 deploy: bump bp-newapi upstream v0.13.2 chart 1.4.11 2026-05-18 08:00:46 +00:00
e3mrah
0604c5e057
fix(newapi): gate channel render on attestation present (was blocking install when accountId env empty) (#1654)
Convergence wave 11 blocker on t16: bp-newapi HR install fails with

  Error: template: bp-newapi/templates/configmap.yaml:1:4: executing
  "bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>:
  channel[0] (qwen3.6-bankdhofar): commercial-contract attestation
  requires accountId

PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can
opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping
`defaultChannels.qwenBankDhofar.enabled` to true with envsubst
placeholders for the attestation:

  attestation:
    kind: commercial-contract
    accountId:   ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
    contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}

On a Sovereign that has not yet signed the commercial contract those
variables expand to empty strings, and the chart's
`assertChannelAttestation` helper hard-fails the helm template before
any manifest is rendered — newapi install crashes at slot 80 and the
whole bootstrap-kit reconciliation stalls.

Fix (Option A — smallest change, makes the chart actually install):
SKIP composing the qwenBankDhofar channel when
attestation.kind=commercial-contract AND either accountId or contractRef
is empty. NewAPI installs with zero default channels (operator-supplied
`.Values.channels` still compose). Once the operator overlay supplies
the attestation values the channel composes on the next reconcile.

Touches two templates that gate on the same effective channel list:

  - templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a
    pre-check ($qbdAttReady) that short-circuits the channel composition
    block when attestation is incomplete. The downstream
    `assertChannelAttestation` helper then sees an empty channel list
    for the qwenBankDhofar slot and emits no error.
  - templates/channel-seed-job.yaml — mirrors the same gate so the
    post-install Helm hook Job + RBAC + audit ConfigMap also skip when
    the channel itself was skipped (otherwise the Job would POST a row
    whose ConfigMap entry was omitted from /etc/newapi/channels.yaml).

`helm template platform/newapi/chart` renders cleanly in all three
states:
  - default (qbd.enabled=false) → no channel, no seed Job
  - qbd.enabled=true + empty accountId/contractRef → no channel, no
    seed Job (NEW: pre-1.4.10 this hard-failed)
  - qbd.enabled=true + accountId + contractRef present → channel
    composed normally, seed Job emitted

Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped
1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix.

READ-ONLY clusters preserved. NO Chart.yaml bump on
bp-catalyst-platform.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:00:06 +04:00
e3mrah
d080207c32
feat(sandbox-mcp): sandbox.db.* real impl (CNPG provision/list/get/drop/dump) (#1653)
PR #1645 (Wave 8) wired gitea.* + k8s.read.* + session.* in the MCP
server but left sandbox.db.* as not_implemented stubs. This commit
ships the real handlers using the same dynamic-client pattern.

Tools shipped (all gated on `RequiredCapability=sandbox.db` + claim
OrgID==env.OrgID, all scoped to env.SandboxNamespace):

  - sandbox.db.provision {name, plan?} — POSTs a CNPG Cluster CR
    (default plan: 1 instance, 5Gi PVC, postgres 16, db=app). Returns
    {host:<name>-rw.<ns>.svc.cluster.local, port:5432, dbname, user,
    secretName:<name>-app, secretKey:password}.
  - sandbox.db.list — labels-filtered LIST scoped to the Sandbox ns,
    returns the same connection envelope per item plus a distilled
    status summary (phase, readyInstances, Ready condition).
  - sandbox.db.get {name} — GET one Cluster; refuses to surface a
    Cluster lacking openova.io/managed-by=openova-sandbox-mcp
    (defence-in-depth against an agent fishing for per-Org pair DBs).
  - sandbox.db.drop {name} — DELETE with foreground propagation so the
    operator cascades PVC/Service/Secret cleanup before returning.
    Same managed-by guard as get.
  - sandbox.db.dump {name} — POSTs a one-shot Backup CR
    (`<cluster>-dump-<UTC>`). Returns the Backup name + the Cluster's
    configured barmanObjectStore.destinationPath so the agent can find
    the resulting S3 prefix without polling Backup.status.

Why CNPG Cluster CRs (not a per-Sandbox shared DB): per app DB keeps
tenancy / backup / restart blast-radius per-app, matches architecture
§3 + §7. Cluster CRs live in the Sandbox's OWN namespace
(sandbox-<owner-uid>); the agent cannot pass `namespace` — it's read
from env. The MCP server never mutates the resulting Pods/PVCs/
Services — the upstream CNPG operator (bp-cnpg) owns those.

Tests (sandbox_db_test.go, 9 cases incl. 5 capability-gate sub-tests):
  - validation (name regex, missing name, unknown plan)
  - default-plan CR shape (apiVersion, kind, labels, spec.instances,
    storage.size, bootstrap.initdb.database, enableSuperuserAccess)
  - connectionFor envelope matches CNPG service-name defaults
  - on-demand Backup CR shape + managed-by label
  - requireSandboxNS guard rails (no env / empty ns / populated)
  - capability gate rejects bearers w/o sandbox.db
  - status summary surfaces phase + Ready condition only

Hard rules respected: NO chart bump, no host-cluster touch — every
mutation lands inside the Sandbox's own namespace via the SA the
sandbox-controller already gives the MCP pod. go build + go vet +
go test clean. Catalogue test updated for new `sandbox.db.get`.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:59:56 +04:00
e3mrah
7b77ebe99c
fix(bootstrap-kit): bp-sandbox slot move 61 → 19a to break harbor chicken-and-egg (#1652)
Caught live on t16.omantel.biz convergence: bp-sandbox HR stuck
Reconciling because its chart pull goes through harbor.<sov-fqdn>
(post-handover cutover slot 06a Step-06 phase-1 rewrites every
HelmRepository URL `oci://ghcr.io/openova-io` →
`oci://harbor.<sov-fqdn>/openova-io`), but harbor.<sov-fqdn> is not
reachable yet because bp-harbor itself has not reached Ready —
chicken-and-egg.

Same failure shape as Wave 7 #1610 with bp-hcloud-csi (REMOVED). This
PR takes the cleaner long-term cousin path: rather than remove the
slot, sequence it AFTER bp-harbor (slot 19) by renumbering to 19a
+ adding `bp-harbor` to the HR's dependsOn graph. The Sandbox MVP
Wave 11 slot stays available with no manual Day-2 add-app
re-introduction needed.

bp-harbor itself does not hit the cycle because its chart pull goes
through harbor.openova.io (the mothership-warmed proxy-cache wired
into k3s registries.yaml at cloud-init time) — NOT through
harbor.<sov-fqdn>.

Diff:
- clusters/_template/bootstrap-kit/61-bp-sandbox.yaml renamed →
  19a-bp-sandbox.yaml; slot label "61" → "19a"; dependsOn adds
  bp-harbor; header documents the move + chicken-and-egg context.
- clusters/_template/bootstrap-kit/kustomization.yaml: 19a slot
  inserted right after 19-harbor.yaml with the post-cutover URL
  rewrite rationale inline; old slot-61 entry replaced with a
  back-pointer comment.

Verified `kubectl kustomize clusters/_template/bootstrap-kit/`
renders clean: bp-sandbox HR keeps slot label, gains
- name: bp-harbor in dependsOn, all other fields unchanged.

No Chart.yaml bump (this is a bootstrap-kit Kustomization-only fix,
not a chart change). READ-ONLY clusters.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:56:52 +04:00
github-actions[bot]
51913fe380 deploy: bump sandbox-controller image to ad5163e 2026-05-18 07:54:45 +00:00
e3mrah
ad5163e69a
feat(sandbox-controller): IdleScaler scales pty-server replicas to 0 after configured idle window (#1651)
PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on
every pty-server StatefulSet but no controller was reading it. This closes
the loop:

pty-server (products/sandbox/pty-server/):
  - session.Manager tracks lastActivity; Touch() called on session
    create/stop, WS attach/detach, every WS message in/out, resize/signal.
  - New GET /idle endpoint returns {lastActivityAt, activeSessions}.
  - Unit tests cover the endpoint shape + Touch() bump.

sandbox-controller (core/controllers/sandbox/internal/idlescaler/):
  - New IdleScaler runnable, registered with mgr.Add() in main.go.
  - NeedLeaderElection=true (singleton across HA replicas).
  - Every 60s lists pty-server StatefulSets by label selector
    (app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst),
    constrained to `sandbox-*` namespaces in code for defence-in-depth.
  - For each: probes the in-cluster Service /idle endpoint, stamps the
    `openova.io/sandbox-last-activity-at` annotation, and patches
    spec.replicas=0 once now-lastActivity exceeds the per-SS
    `openova.io/sandbox-idle-timeout-minutes` annotation (falling back to
    SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30).
  - Probe failure with no prior annotation → skip (next tick); probe
    failure WITH prior annotation → still decide on stale data so a
    degraded probe path doesn't keep a forgotten Pod alive forever.
  - activeSessions > 0 keeps the Pod alive regardless of idle window.
  - Already-zero replicas → idempotent no-op.

Chart RBAC:
  - ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY
    cluster-wide write on a non-CR resource, scoped to the controller's
    own managed StatefulSets via the label selector + namespace prefix.

Tests: 9 unit tests covering active-not-idle, idle-scales-zero,
active-sessions-never-scales, probe-fail-no-annotation-skips,
per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op,
default-URL-builder, leader-election-singleton.

Approach: controller polls pty-server's /idle endpoint via cluster-DNS
(smaller diff than embedding a k8s client in pty-server — pty-server
keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:51:36 +04:00
github-actions[bot]
c4fa06a9f4 deploy: bump sandbox-controller image to 3a3ee74 2026-05-18 07:46:53 +00:00
github-actions[bot]
c9fe39a20f deploy: bump bp-newapi upstream v0.13.2 chart 1.4.9 2026-05-18 07:44:23 +00:00
e3mrah
96d2d9bce7
fix(provisioning): set Organization.spec.tenantPublic on product-install (was empty; HTTPRoute reconciler had nothing to render) (#1650)
PR #1644 added Organization.spec.tenantPublic + per-tenant HTTPRoute
reconciler, but nothing set the field — every Org CR's TenantPublic
stayed zero-value, the reconciler short-circuited at the empty
ParentDomain guard, and `<slug>.omani.homes` 404'd at the Cilium
Gateway.

Wire the patch at the only point that knows a tenant's product is
actually Ready: the provisioning service. Both the initial workflow
(`provision.completed`) and the day-2 install path
(`provision.app_ready`) now patch the Organization CR's
spec.tenantPublic with parentDomain (from TENANT_PARENT_DOMAIN env),
subdomain (= slug), backendService (canonical vcluster-synced name),
port 80, and the picked product slug. Last-write-wins on subsequent
installs.

Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent zone flows through
env, never hardcoded — every Sovereign picks its own pool zone.
Empty env disables the patch entirely (legacy tenants keep working
through the Sovereign-wide tenant-wildcard route). Best-effort:
failures don't fail the provision. 404 on the CR is benign (legacy
tenant without an Organization counterpart).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:44:00 +04:00
e3mrah
3a3ee742ec
feat(sandbox-controller): call newapi /admin/tokens/sandbox + write Secret + rotation (was placeholder) (#1643)
Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox
LLM-gateway tokens via the catalyst-api bridge handler shipped in
PR #1638, replacing the Wave 1 placeholder Secret with a real
LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo.

Changes:

  - New newapi.Client (core/controllers/sandbox/internal/newapi/) —
    thin HTTP client for POST /admin/tokens/sandbox with the bridge's
    {org_id, user_id, sandbox_id, allowed_channels} body + Bearer
    ADMIN_SECRET auth. Interface so tests can stub.

  - Reconciler extended:
      * NewAPIClient + DefaultChannels + TokenRotationLeadTime fields
      * On every reconcile: decide mint-or-skip from annotation
        openova.io/sandbox-token-expires-at vs. now + lead-time
      * On mint: POST to bridge, stamp expires-at + rotated-at
        annotations on the CR, render token bytes into a new
        gitops manifest secret-newapi-token.yaml committed to the
        per-Org catalyst-tenant repo at sandbox/<owner-uid>/
      * Bridge failure → Failed/TokenMintFailed condition + 30s
        requeue + no gitops writes (fail-loud)
      * Empty DefaultChannels → NoAllowedChannels condition (fail
        earlier than the bridge's 400)

  - gitops.Render:
      * New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt
        /NewAPITokenRotatedAt fields
      * New secret-newapi-token.yaml template — Secret with
        stringData.LLM_GATEWAY_TOKEN + expires-at annotation +
        optional kubectl.kubernetes.io/restartedAt rotation marker
        so Wave 2's pty-server StatefulSet picks up rolling
        restarts on token rotation
      * kustomization.yaml appends the new manifest when token
        present

  - Chart wiring (platform/sandbox/chart):
      * Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET
        (secretKeyRef from newapi-bp-newapi-token-signing-key,
         optional: true), NEWAPI_DEFAULT_CHANNELS
      * ClusterRole bumped to allow update/patch on the
        sandboxes/ resource (the controller now stamps annotations
        on the CR)

  - platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml:
      * Added emberstack/reflector annotations so the chart-emitted
        Secret (newapi namespace) mirrors into the sandbox-controller
        namespace by default; reflectorNamespaces is overrideable.

Tests:

  - newapi client: happy-path round-trip, 401 surfaces, input
    validation, request validation. 4 cases.
  - sandbox-controller: existing Wave 1 cases (happy/idempotent/
    drift/missing) still pass; 5 new cases for the token path:
    fresh mint + Secret render, rotation on near-expiry, steady-
    state no-mint, bridge failure surfaces condition, no-channels
    misconfig fails early. 9 cases total, all green.

Hard rules honored:
  - No Chart.yaml bump (chart pinning is a release-driver concern)
  - go build + go test ./core/controllers/sandbox/... clean

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:43:50 +04:00
e3mrah
8f4b34edd3
test(sandbox-ui): Playwright e2e for landing + settings + session nav (#1649)
Wave 9 regression gate for the Sandbox UI scaffold shipped in PR #1621.
Covers four happy-path surfaces:

- Sidebar Sandbox entry exists + accent-active class on /sandbox
- Landing renders 6 agent cards (aider / claude-code / cursor-agent /
  little-coder / opencode / qwen-code) with Connect Claude Max CTA
- /sandbox/settings BYOS Connect button when disconnected
- /sandbox/$id route resolves + create POST sends agent=aider

Auth gate, deployment self-discovery, SSE events, and sandbox API are
all mocked via page.route so the spec runs against `npm run dev` (Vite
on :5173) with no catalyst-api required. Per-test timeout bumped to 90s
to absorb Vite's cold-cache xterm/tanstack-router module load.

Sovereign-mode env vars required for SovereignSidebar to render:
  VITE_CATALYST_MODE=sovereign \\
  VITE_SOVEREIGN_FQDN=sandbox.example.test \\
  npm run dev

Local result: 4/4 passed in 2.1m (warm cache).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:41:25 +04:00
github-actions[bot]
2fee03f7d2 deploy: bump sandbox-controller image to c0020d9 2026-05-18 07:40:02 +00:00
e3mrah
c0020d9c33
feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs) (#1645)
* feat(pkg/gitea): add ListPullRequests + GetPullRequest read API

Wave 8 prerequisite for openova-sandbox-mcp's gitea.pr.list +
gitea.pr.get tools. Mirrors the existing client surface
(CreatePullRequest, ListOrgRepos) with state-filtered pagination and
a get-by-number fetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs)

Wave 8 swaps the openova-sandbox-mcp Wave-2 not_implemented stubs for
production-ready handlers on:

- gitea.repo.list / gitea.repo.get (delegates to core/controllers/pkg/gitea)
- gitea.pr.list / gitea.pr.get     (delegates to new ListPullRequests +
  GetPullRequest helpers in pkg/gitea; org-scope check rejects cross-tenant
  owner overrides at tool dispatch time)
- k8s.read.get / k8s.read.list / k8s.read.watch (dynamic.Interface against
  the Sandbox pod's in-cluster SA or SANDBOX_KUBECONFIG; watch is a
  bounded short-watch — long-lived subs land Wave 9 via MCP
  resources/subscribe)
- sandbox.session.whoami / sandbox.session.info (echo per-call Claims +
  Sandbox metadata so the agent can self-discover its scope)

Auth: every tools/call carries a bearer (via _auth.token arg OR
SANDBOX_TOKEN env). main.go validates HS256 against SANDBOX_JWT_SECRET
using the canonical core/services/shared/auth.Claims shape (PR #1619),
strips _auth from the args, installs Claims on ctx, then Registry.Call
gates on capability + org_id-match before reaching the handler.
sandbox.session.* skips the org-scope check (the operator's session
is the operator's regardless of which Org slug their claim carries).

Stubs retained (Wave 8+):
- sandbox.db.*   (CNPG Cluster CR provisioning)
- sandbox.auth.* (Keycloak realm/client management)
- gitea.pr.create / gitea.pr.merge / gitea.issue.* / gitea.release.*
- k8s.read.logs

Hard rule preserved: k8s.write.* never lands in the MCP surface.

24 new tests (registry catalogue completeness, auth gate, gitea via
httptest stub, JWT round-trip, env-var parsing).

Builds clean against go 1.23 + k8s.io/client-go v0.31.1; module wires
core/controllers + core/services/shared via the same replace pattern
catalyst-bootstrap and every sme-service already use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:36:53 +04:00
github-actions[bot]
c6820e3d4a deploy: bump sandbox-controller image to 9f6354f 2026-05-18 07:33:12 +00:00
e3mrah
a8a56a25f6
fix(org-controller): render per-tenant HTTPRoute so <slug>.omani.homes serves traffic (#1644)
PowerDNS now resolves <slug>.<parentDomain> for every Org mapped onto a
Sovereign's role=sme-pool parent domain (PR #1629), but no HTTPRoute was
attaching that hostname to the tenant's installed product Service. The
Cilium Gateway terminated TLS on the wildcard cert and fell through to
the marketplace tenant-wildcard route — serving the storefront landing
page instead of the tenant's WordPress / Nextcloud / GitLab install.

Fix:

1. Extend Organization CRD with optional spec.tenantPublic
   (parentDomain, subdomain, backendService, backendPort, product).
2. organization-controller renders a Gateway-API HTTPRoute in the Org
   namespace (= slug) attached to cilium-gateway/kube-system when
   parentDomain is set. Skipped silently when unset so existing Orgs
   keep working.
3. Chart-side templates/sme-services/tenant-public-routes.yaml renders
   the same HTTPRoute shape from .Values.tenantRoutes[] for operators
   that prefer static fixtures over the controller's reconcile loop.
4. Tests: TestReconcile_TenantPublic_RendersHTTPRoute and
   TestReconcile_TenantPublic_DisabledByDefault cover both paths.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:32:54 +04:00
e3mrah
8888d9edd1
feat(catalog+billing): Sandbox Free/Pro/Ent plans + quota wire (was no plans = broken checkout) (#1642)
PR #1633 added the Sandbox app to seedApps but never wired the matching plan
rows. The marketplace checkout hit "plan_id not found" the moment a customer
picked Sandbox, and PR #1639's sandbox-orchestrator could only mint CRs with
the Wave 1 baseline quota regardless of the picked tier.

This PR closes both gaps in lockstep:

Catalog:
- Plan struct gets ProductSlug + IncludedQuotas fields (back-compat:
  omitempty BSON tags so legacy rows decode fine).
- expectedSandboxPlans() helper canonical-defines the three tiers:
    sandbox-free  0 OMR  1 session, 1 agent,    5 GB, BYOS
    sandbox-pro   9 OMR  3 sessions, 6 agents, 50 GB, BYOS (Popular)
    sandbox-ent  49 OMR  unlimited,  6 agents, 500 GB, BYOS
- seedAllData appends them on fresh seed; seedMissingSandboxPlans
  backfills them on already-populated Sovereigns (idempotent GET-then-
  create, patches missing ProductSlug/IncludedQuotas on legacy rows).
- UpdatePlan persists the two new fields.

Sandbox orchestrator wiring:
- SandboxRequestedPayload.PlanID added; CreateOrg forwards body.PlanID.
- buildSandbox stamps openova.io/plan-id annotation + spec.planId when
  PlanID is non-empty.
- quotaForPlan() maps sandbox-{free,pro,ent} → SandboxQuota; empty or
  unknown plan_id falls through to DefaultQuota (Wave 1 baseline =
  Sandbox Free shape). Hard-coded map mirrors catalog IncludedQuotas so
  tenant-service avoids a compile-time dep on the catalog mongo stack.

Tests:
- TestExpectedSandboxPlans_Shape locks slugs, prices, quota keys, the
  Popular flag (sandbox-pro), and the quota ladder.
- TestSandboxHandle_PlanIDStampsAnnotationAndQuota table-test exercises
  all three tiers end-to-end (annotation + spec.planId + spec.quota).
- TestSandboxHandle_PlanIDEmptyKeepsDefaultQuota guards back-compat
  with pre-PR publishers.
- TestSandboxHandle_PlanIDUnknownFallsBackToDefault guards typo'd /
  retired plan IDs.

go build + go test clean for catalog, tenant, billing, provisioning,
shared, marketplace-api.

No Chart.yaml bump, no cluster touch.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:31:25 +04:00
e3mrah
9f6354f1e1
feat(sandbox): controller spawns pty-server + MCP Pods (was just namespace+RBAC+PVCs) (#1641)
Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous
slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC +
PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly-
created Sandbox sat there with empty plumbing and no way for the user
to actually run a coding session.

This PR completes the per-Sandbox runtime by extending
core/controllers/sandbox/internal/gitops/manifests.go to render the
four manifests architecture.md §7 enumerates:

- StatefulSet pty-server (replicas = spec.quota.concurrentSessions,
  one Pod per in-flight session per architecture.md §1/§2). Env wired
  per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID,
  SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL,
  LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret
  (key llm-gateway-token, optional). When claude-code is in
  spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the
  per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key
  access_token, optional) per claude-code-byos.md §3. Repo PVCs mount
  at /workspace/<repo-slug>.
- Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP
  server, talks to pty-server via the in-namespace ClusterIP Service.
- Service pty-server (ClusterIP :7681) — backend for both the MCP
  Deployment and the HTTPRoute.
- HTTPRoute pty-server — publishes
  sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via
  the existing catalyst-public Cilium Gateway in catalyst-system.
  PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees
  its own /sessions/<id> surface.

Knobs are env-plumbed from the chart per Inviolable Principle #4:
- SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image
  refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm
  render fast on empty, no silent :latest).
- SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit
  overlay derives it from ${SOVEREIGN_FQDN}).
- SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX /
  SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc
  defaults.

Idle timeout (architecture.md §7) lands as a StatefulSet annotation
openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually
scales the StatefulSet down on idle ships in a sibling PR (out of
scope for "spawn the Pods"; this PR makes the Pods exist).

Tests cover the full Wave-8 manifest shape: replicas count, identity
env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname
binding, kustomization stitching, idempotency. go test
./core/controllers/sandbox/... green; helm template renders cleanly +
required guard fires on missing runtime values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:30:00 +04:00
e3mrah
422da46360
fix(sovereign-tls): cilium-gateway listeners per parentZone (#1640)
Issue #831 follow-on to #827. Previously the Cilium Gateway declared a
single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under
non-primary parent zones (e.g. wp-foo.omani.homes when the operator
brings omani.homes as the SME pool) hit cilium-envoy's default fallback
cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered
by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR
\#827) existed but had no Gateway listener claiming its hostname.

Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent
zone. Materialised at Terraform plan time as a JSON-flow array
(infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode
of the listener objects iterating decoded parent_domains_yaml), threaded
through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and
consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}`
in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone
Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay
in lockstep.

Scalar placeholder (not multi-line block) because kustomize-build parses
the YAML before Flux runs envsubst — a placeholder on its own line at
column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then
swaps it for the JSON-flow array string, which the apiserver parses as
the real listener list.

Single-zone fallback preserved (var.parent_domains_yaml empty →
[{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone
provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions
(e.g. primary omani.works + sme-pool omani.homes) render 4 listeners.

Verification:
  - kubectl kustomize clusters/_template/sovereign-tls/ → clean
  - End-to-end simulation (single-zone, two-zone) renders correct
    listener counts (2 / 4) with correct certificateRefs per zone.
  - Listener naming `https-<sanitised>` / `http-<sanitised>` is unique
    per listener so Gateway controller programs them all (duplicate
    names produce Conflicting status condition).

Files:
  - clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar
    listeners placeholder + comment block explaining the why)
  - infra/hetzner/main.tf (locals.parent_domains_decoded +
    locals.parent_domains_listeners_yaml; threaded into primary CP and
    secondary regions' templatefile() calls)
  - infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML
    substitute var in sovereign-tls Kustomization block)

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:09:26 +04:00
e3mrah
4c83d98765
feat(sandbox): orchestrator listens tenant.sandbox_requested → Sandbox CR materialisation (#1639)
PR #1633 wired CreateOrg to publish `tenant.sandbox_requested` when the
marketplace cart includes the sandbox product. Nobody was subscribing —
the event landed in NATS `catalyst.tenant.sandbox_requested` and aged
out unread, so no Sandbox CR (PR #1622) was ever minted and the
customer sat on a "Provisioning…" spinner forever.

This slice closes the loop. A new SandboxOrchestrator in tenant-service:

- Subscribes via events.MultiSubscriber (PR #1636) to the canonical
  NATS subject + legacy Kafka topic.
- Parses {tenant_id, org_slug, owner_id, owner_email, agents,
  sovereign, requested_at} and resolves the owner email (event field
  → store.GetMemberEmail → owner_id fallback).
- Materialises a Sandbox CR in catalyst-system (SANDBOX_NAMESPACE
  override) via a dynamic client, with spec per architecture §7:
  owner.email + owner.orgRef.slug, default quota (4 CPU / 8 Gi /
  50 Gi / 3 sessions), spec.agentCatalogue from the cart.
- Idempotent: Get-then-Create with AlreadyExists swallowed so NATS
  redeliveries + duplicate marketplace submits stay no-ops; the
  sandbox-controller remains SoR for spec mutations.

Wiring in main.go is best-effort — when no in-cluster config nor
KUBECONFIG is available (CI / dev loops) the orchestrator is skipped
with a Warn; the rest of the tenant service still boots.

Hard rules: no chart bump, no cluster writes outside of the Sandbox
Create call (sandbox-controller reconciles the rest), `go build ./...`
clean, `go test ./...` clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:09:22 +04:00
github-actions[bot]
22851d980d deploy: bump bp-newapi upstream v0.13.2 chart 1.4.8 2026-05-18 07:03:09 +00:00
e3mrah
4abd156fee
feat(newapi): real /admin/tokens/sandbox mint impl (was stub from #1619) (#1638)
Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a
real HS256 mint flow the sandbox-controller can call when it rolls out
a fresh Sandbox Pod.

Handler (platform/newapi/internal/handler/sandbox_token.go):
  - Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET),
    constant-time compared. 401 on mismatch / missing bearer.
  - Request body: {org_id, user_id, sandbox_id, allowed_channels[]}.
    De-duplicates + scrubs empty channel names so a controller bug
    sending [""] can't mint a token that NewAPI silently treats as
    "no restriction".
  - Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape:
    {sub: sandbox_id, org: org_id, user: user_id, channels: [...],
     iat, exp: iat+7d, typ: "sandbox"}.
  - Returns {token, expires_at}.
  - Refuses with 503 when SigningKey or AdminSecret is unset
    (visible chart-wiring gap, not a forgeable-token leak).
  - Removes the previous Claims/jwt.Parse PAT-validation path that
    came with the stub — caller is the controller, not an operator.
  - NewHandlerFromEnv() factory loads + validates env at process
    start so catalyst-api can fail loudly instead of shipping the
    endpoint silently.

Unit tests (sandbox_token_test.go) — 11 cases:
  - happy path (mint + claim shape + signature round-trip)
  - de-dup + empty-channel scrub
  - admin-secret mismatch / missing bearer → 401
  - missing org_id / user_id / sandbox_id / empty channels → 400
  - non-POST → 405
  - unset env → 503
  - mintSandboxToken empty-secret guard + round-trip
  - response does not echo admin secret or signing key

Chart wiring (platform/newapi/chart):
  - New Secret template sandbox-token-signing-key-secret.yaml
    auto-renders with Helm `lookup` + helm.sh/resource-policy: keep
    (same load-bearing pattern as credentials-secret.yaml #943 and
    gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values
    for both SIGNING_KEY and ADMIN_SECRET; persistence across
    reconciles is required because a reconcile-time rotation would
    silently invalidate every per-Sandbox token across the Sovereign
    AND break the sandbox-controller's auth path until its Pod
    restarts.
  - values.yaml block sandboxTokenSigningKey.{existingSecret,
    autoProvision, autoSecretName} matching the `credentials`
    convention (operator override > auto-provision > skip-render).
  - No Chart.yaml bump — chart value addition only.

Verification:
  - go build ./platform/newapi/internal/handler/... — clean
  - go test ./platform/newapi/internal/handler/... — 11/11 PASS
  - helm template platform/newapi/chart — Secret renders

How sandbox-controller will use it:
  1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key.
  2. POST /admin/tokens/sandbox with bearer + body
     {org_id: <Sandbox.spec.owner.orgRef.slug>,
      user_id: <Sandbox.spec.owner.email>,
      sandbox_id: <Sandbox.metadata.uid>,
      allowed_channels: ["qwen3.6-bankdhofar"]}.
  3. Write returned token into Secret/sandbox-<uid>-newapi-token.
  4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 11:02:40 +04:00
e3mrah
401ab6713a
feat(catalyst-api): /api/v1/sandbox/sessions CRUD + sandboxes GVR in k8sCache + cutover-driver RBAC (#1637)
Wires the catalyst-api backend the Sandbox FE (PR #1621 — getSandboxes /
createSandbox / getByosStatus in sandbox.api.ts) has been calling into.
Without this handler the /sandbox surface on the Sovereign Console rendered
its empty state forever — every getSandboxes() 404'd at the catalyst-api
ingress and every "Start a session" click hit the same wall.

Handler — products/catalyst/bootstrap/api/internal/handler/sandbox_sessions.go
- GET    /api/v1/sandbox/sessions          — list Sandbox CRs in the
                                              operator's Org namespace
- POST   /api/v1/sandbox/sessions          — create Sandbox CR with agent
                                              validated against the 6-agent
                                              catalogue (aider / claude-code /
                                              cursor-agent / little-coder /
                                              opencode / qwen-code)
- GET    /api/v1/sandbox/sessions/{id}     — fetch single Sandbox detail
- DELETE /api/v1/sandbox/sessions/{id}     — graceful delete (the controller
                                              fires finalizers + cleans up
                                              the per-Sandbox vcluster
                                              namespace + PVCs + RBAC)

Client resolution mirrors the Family E compliance + k8s_resource_actions.go
seam: k8sCache.Factory.DynamicClientFor(resolveChrootClusterID("")) is the
primary path; sovereignDepsFor() — rest.InClusterConfig() — is the chroot
in-cluster fallback per feedback_chroot_in_cluster_fallback.md. Both 503
when unavailable so the FE renders its "API pending" pill rather than a
spinner.

Org-scoping uses claims.Org (the org_id Keycloak claim PR #1619 lit up)
for the CR namespace + spec.owner.orgRef.slug. Single-tenant chroots
without an org_id fall back through CATALYST_SANDBOX_DEFAULT_NAMESPACE
to a sensible default per docs/INVIOLABLE-PRINCIPLES.md #4. Wave-1 quota
defaults (4 CPU / 8Gi memory / 50Gi storage / 3 concurrent sessions)
mirror products/sandbox/docs/architecture.md §7 — the FE doesn't yet
expose a quota picker.

Status projection: CRD vocabulary (Pending|Provisioning|Ready|Failed)
maps to FE vocabulary (pending|running|stopped|failed|unknown) in
mapSandboxStatus so a fresh Sandbox shows the spinner rather than
"unknown" until the controller catches up.

k8sCache.DefaultKinds — products/catalyst/bootstrap/api/internal/k8scache/kinds.go
- Adds sandbox.openova.io/v1 Sandbox so the generic /k8s/{kind} surface
  enumerates Sandboxes the same way it does Applications + UserAccess.
  Per feedback_chroot_in_cluster_fallback.md every new GVR here needs a
  matching rule on the cutover-driver SA.

Cutover-driver RBAC — products/catalyst/chart/templates/clusterrole-cutover-driver.yaml
- Adds sandboxes.sandbox.openova.io with verbs split per
  feedback_rbac_create_no_resourcenames.md:
    rule 1: ["create"]
    rule 2: ["get","list","watch","delete"]
- Read-only on status (the controller owns status); write is spec-only
  on POST + the apiserver delete on DELETE.

Routes — products/catalyst/bootstrap/api/cmd/api/main.go
- Registered inside the RequireSession group alongside the existing
  /api/v1/sandbox/byos/claude-code/* surface; same auth gate, same
  patternless leading "/api/v1/sandbox/...".

Verified: go build clean, go vet clean, k8scache test suite green
(2.7s), helm template renders the new RBAC block.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:45:05 +04:00
github-actions[bot]
0ad7879013 deploy: update sme service images to 72f82ea + bump chart to 1.4.162 2026-05-18 06:33:51 +00:00
e3mrah
72f82ea7f2
fix(sme): wire provisioning/notification/domain consumers to NATS (was Kafka-only, was silent-dropping every tenant.created event) (#1636)
PR #1626 wired the PUBLISH leg of tenant + billing to NATS via
events.MultiPublisher (canonical subject `catalyst.<event.Type>` per
ADR-0001 §6). The CONSUME leg stayed Kafka-only — provisioning,
notification, domain, billing's tenant-events cascade, AND tenant's own
provision-events + members-cleanup consumers all called
events.NewConsumer(redpandaBrokers, …). On Sovereigns REDPANDA_BROKERS
is empty by design (no Redpanda exists; NATS is the canonical bus per
the convergence-fix block in configmap.yaml) so those consumers either
never started OR dialed `localhost:9092` in a hot crash loop.

Net effect on every Sovereign install pre-this-PR:
  1. alice POSTs /sme/tenants → tenant publishes catalyst.tenant.created
     to NATS (PR #1626).
  2. provisioning's only subscriber was Kafka-only → silent drop.
  3. No Organization CR ever spawned → no vCluster → CONVERGENCE BROKEN.

This change introduces a symmetric subscribe-side abstraction mirroring
bridge.go's MultiPublisher:

  - events.BrokerSubscriber: unified Subscribe(ctx, handler) interface,
    satisfied by *Consumer, *DLQSubscriber, *MultiSubscriber.
  - events.MultiSubscriber: fans in from NATS JetStream durable
    consumers (one per canonical subject) + an optional legacy Kafka
    Consumer. NewMultiSubscriber refuses to construct with both legs
    nil (the silent-no-op pattern this PR exists to prevent).
  - events.NATSConn.ensureSMEStream: idempotently creates the
    CATALYST_SME Stream filtering `catalyst.>` so the first consumer
    on a fresh Sovereign bootstraps lifecycle.

Each service's main.go now constructs a MultiSubscriber and passes it
to the consumer dispatch loop. Consumer signatures take
events.BrokerSubscriber instead of *events.Consumer (interface upcast,
so *events.Consumer call sites keep working on Catalyst-Zero):

  - provisioning: tenant.created / tenant.deleted /
    tenant.app_install_requested / tenant.app_uninstall_requested /
    order.placed (the 5 subjects PR #1626 publishes to NATS).
    Also wires MultiPublisher so provision.* publishes hit NATS too —
    downstream tenant + notification consumers need them.
  - notification: full fan-in (user.login, order.placed,
    payment.received, provision.*, domain.*, member.invited).
  - domain: tenant.deleted (subdomain + BYOD reclamation cascade).
  - billing: tenant.deleted (Stripe sub-cancel + invoice void + ledger
    marker cascade). Existing metering NATS subscriber unaffected.
  - tenant: provision.* + tenant.deleted (members cleanup).
    Now reachable on Sovereigns; pre-this-PR they were inside the
    `if redpandaBrokersRaw != ""` block.

Chart wiring: NATS_URL env added to provisioning, notification, and
domain Deployments (tenant + billing already wired via PR #1626).
notification.yaml also flips its hardcoded REDPANDA_BROKERS literal to
the shared ConfigMap key so the per-topology default (empty on
Sovereigns, talentmesh redpanda on Catalyst-Zero) applies.

Verification:
  - go build ./core/services/{shared,tenant,billing,provisioning,
    notification,domain}/... clean.
  - go test ./... clean across all 6 modules.
  - helm template with global.sovereignFQDN=test.example.com renders
    NATS_URL="nats://nats-jetstream.nats-system.svc.cluster.local:4222"
    into all 5 Deployments + ConfigMap.
  - helm template without sovereignFQDN renders NATS_URL="" and
    REDPANDA_BROKERS=talentmesh redpanda, matching Catalyst-Zero.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:32:49 +04:00
e3mrah
62c5620741
docs: session 2026-05-17/18 convergence report + DoD D32-D35 + Sandbox status update (#1635)
- New docs/SESSION-2026-05-17-CONVERGENCE.md narrative session report covering
  the 22 user-facing PRs (#1597-#1632) across 9 waves: founder bug families,
  BSS iframe-seam removal, bp-hcloud-csi removal, CloudPage TS hotfix,
  Sandbox W1-W5 scaffold, and 9 convergence-cleanup fixes.
- SOVEREIGN-MULTI-REGION-DOD.md extended D31 -> D35: Sandbox CRD installable
  (D32), Sandbox agent catalogue picker (D33), newapi Sovereign-side LLM
  gateway (D34), NATS broker round-trip publish+consume (D35).
- products/sandbox/README.md flips Status from "Design. Not yet implemented."
  to "Wave 1-5 implementation in flight (PRs #1615/#1618/#1619/#1621/#1622/#1632
  merged; runtime smoke pending fresh prov)". Adds founder TODO to register
  Anthropic OAuth client_id per claude-code-byos.md.

No code, chart, or test changes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:28:11 +04:00
e3mrah
9690ff8351
feat(sandbox+bootstrap-kit): slot 61 bp-sandbox HR (deploys sandbox-controller on Sovereigns, gated SANDBOX_ENABLED) (#1634)
Wires PR #1622's platform/sandbox/chart/ into bootstrap-kit so that
sandbox-controller actually deploys on Sovereigns. Without this slot,
the chart ships but no HelmRelease installs it — Sandbox CRs sit
unhandled.

- NEW clusters/_template/bootstrap-kit/61-bp-sandbox.yaml — HelmRepository
  + HelmRelease for the `sandbox` chart (name comes from
  platform/sandbox/chart/Chart.yaml `name: sandbox`).
  - dependsOn: bp-vcluster-helmrepo (slot 60, Wave 2 per-Sandbox vCluster
    source), bp-catalyst-platform (slot 13, catalyst-system Namespace +
    catalyst-gitea-token Secret).
  - targetNamespace: catalyst-system (where the controller lives).
  - values.enabled gated default-OFF via ${SANDBOX_ENABLED:-false}
    (matches platform/sandbox/chart/values.yaml `enabled: false`).
  - env.hostCluster + env.sovereignFQDN fed from canonical
    SOVEREIGN_REGION_CANONICAL_LABEL + SOVEREIGN_FQDN substitutes.
- MODIFY kustomization.yaml — register 61-bp-sandbox.yaml after slot 60.
- MODIFY scripts/expected-bootstrap-deps.yaml — declare slot 61 with
  depends_on=[bp-vcluster-helmrepo, bp-catalyst-platform]; validator
  reports drift=0/cycles=0.

NO chart Chart.yaml bump (Wave 1 chart stays at 0.1.0).
`helm template` + `kubectl kustomize` render clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:26:18 +04:00
github-actions[bot]
ced295726a deploy: update sme service images to 4bad2a3 + bump chart to 1.4.161 2026-05-18 06:24:41 +00:00
github-actions[bot]
77083dbcd5 deploy: update Catalyst marketplace image to 4bad2a3 2026-05-18 06:24:18 +00:00
e3mrah
4bad2a3cea
Merge pull request #1633 from openova-io/sandbox-wave4-marketplace-catalog-entry
feat(sandbox): Wave 4 — marketplace catalog entry (customer can pick Sandbox alongside WordPress)
2026-05-18 10:23:36 +04:00
Emrah Baysal
b8b80973de feat(sandbox): Wave 4 — marketplace catalog entry (customer can pick Sandbox alongside WordPress)
Adds the Sandbox product to the marketplace storefront so a customer
picks it off marketplace.<sov>/apps the same way they pick WordPress /
Nextcloud. Card chrome is the existing .app-card shape verbatim — no
new components per the design-system inheritance rule. The detail page
gains a 6-agent picker (aider, claude-code, cursor-agent, little-coder,
opencode, qwen-code) using the existing .related-card chrome with a
picked state mirroring .app-card.in-cart. Picks land on cart.agents
and travel through checkout into the tenant create-org payload.

Tenant-service emits a sibling `tenant.sandbox_requested` event on
sme.tenant.events when the cart contains the sandbox product. The
event carries org slug + owner + agents list, sufficient for the
sandbox-controller (or its upstream orchestrator) to mint a Sandbox
CR with matching spec.agentCatalogue. The Organization CR creation
path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 08:22:37 +02:00
github-actions[bot]
41eba2d436 deploy: bump sandbox-controller image to 1b0e86c 2026-05-18 06:14:36 +00:00
github-actions[bot]
2e57d76ce1 deploy: update sme service images to d681f64 + bump chart to 1.4.160 2026-05-18 06:12:09 +00:00
e3mrah
1b0e86cb1a
ci(sandbox): build workflows for controller + pty-server + mcp-server (so chart can actually deploy) (#1632)
PR #1622 shipped the sandbox-controller binary + chart, and PR #1618
shipped pty-server + mcp-server scaffolds, but neither came with CI
build workflows — meaning the chart's image.repository points at a
GHCR package that no workflow ever publishes (ImagePullBackOff on
every install). Per docs/INVIOLABLE-PRINCIPLES.md #4a every runtime
image MUST be produced by a GitHub Actions workflow from a committed
git SHA; this PR closes that gap.

Three new workflows, all event-driven (push paths-filter + PR +
workflow_dispatch, no cron):

- build-sandbox-controller.yaml — mirrors build-application-controller
  (shared core/controllers go.mod, go vet + race tests, Buildx push,
  cosign keyless sign, SBOM attest, auto-bump platform/sandbox/chart/
  values.yaml image.tag back to main so the next install picks up the
  SHA-pinned image without operator action).

- build-sandbox-pty-server.yaml — separate go module under
  products/sandbox/pty-server (own go.mod/go.sum), Dockerfile uses
  COPY . . so build context is the server directory. Same Buildx +
  cosign + SBOM flow as the controller. No values.yaml bump yet:
  Wave-2 wiring of the StatefulSet template will land in a follow-up.

- build-sandbox-mcp-server.yaml — stdlib-only stdio MCP sidecar
  (no go.sum yet), same shape as pty-server.

Per `feedback_no_mvp_no_workarounds.md` rule 1 (target-state, never
"manual follow-up bump") the controller workflow auto-bumps the chart
values.yaml so a Sovereign overlay flipping `enabled: true` Just Works.
Per the user's hard rule for this PR, no Chart.yaml bump and no
blueprint-release dispatch — the Sandbox chart's publication cadence
is gated by Wave-2 readiness, not per-image builds.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:11:28 +04:00
e3mrah
d681f64505
fix(catalyst-api): mint HS256 token on SME proxy calls (was forwarding incompatible RS256) (#1630)
PR #1625 shipped the /api/v1/sme/billing/vouchers/* proxies but the
SME gateway (core/services/gateway/proxy.go) rejects RS256 outright
— it only accepts HS256 signed with sme-secrets/JWT_SECRET. Result
on every fresh Sovereign: operator clicks on /bss/vouchers returned
silent 401 with no upstream audit trail.

This commit ships the bridge:

- core/services/shared/auth/mint_sme.go (new)
  - MintSMEAccessToken(secret, sub, email, role) → 5-min HS256 JWT
    in the wire shape billing's requireVoucherIssuer expects.
  - SMERoleFor(realmRoles, tier) → maps Keycloak roles + tier claim
    onto SME vocab (superadmin | sovereign-admin | member).
  - Pure, no IO, fully unit-tested (mint_sme_test.go).

- products/catalyst/bootstrap/api/internal/handler/sme_billing_vouchers.go
  - proxySMEVoucher now mints a fresh HS256 token per upstream hop
    from the operator's already-validated RS256 session claims and
    forwards that as Bearer to the SME gateway. RS256 header is no
    longer leaked upstream.
  - Unwired bridge (CATALYST_SME_JWT_SECRET empty) surfaces 503
    `sme-jwt-bridge-unwired` instead of the silent 401.

- products/catalyst/bootstrap/api/internal/handler/handler.go
  - h.smeJWTSecret field + SetSMEJWTSecret(secret) setter.

- products/catalyst/bootstrap/api/cmd/api/main.go
  - Reads CATALYST_SME_JWT_SECRET on startup and wires it.
  - Log line includes byte count only (never the secret value, per
    INVIOLABLE-PRINCIPLES.md #10).

- products/catalyst/chart/templates/api-deployment.yaml
  - New env CATALYST_SME_JWT_SECRET sourced from sme-secrets/JWT_SECRET
    in the same namespace (catalyst-system). optional: true so
    Sovereigns without marketplace surface a 503 rather than
    CreateContainerConfigError.

- products/catalyst/chart/templates/sme-services/sme-secrets.yaml
  - emberstack/reflector annotation block mirroring sme-secrets
    from `sme` ns into `catalyst-system` (Kubernetes secretKeyRef
    is same-namespace-only). Same pattern as cnpg-cluster.yaml
    and provisioning-github-token.yaml.

Operator-visible behaviour: the bridge is transparent on the happy
path (operator with sovereign-admin tier on a Sovereign with
marketplace enabled clicks /bss/vouchers → list returns). On the
unhappy paths the operator now sees a real status code:
  - 503 sme-jwt-bridge-unwired (chart wire missing) — actionable
  - 503 sme-gateway-unreachable (DNS NXDOMAIN) — pre-existing
  - 403 from billing's requireVoucherIssuer (role insufficient)
    — was silent 401 before, now propagates the real authz result.

Tests: core/services/shared/auth `go test ./...` PASS. catalyst-api
`go build ./...` PASS.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:11:04 +04:00
e3mrah
51b6188eb1
feat(sandbox+bootstrap-kit): newapi Sovereign install (Bank Dhofar Qwen wired for Sandbox) (#1631)
Sandbox Wave 4 retry. Slot 80 (bp-newapi) already exists in the
_template bootstrap-kit but ships the qwenBankDhofar channel
hard-coded to `enabled: false` with empty endpoint — so every
franchised Sovereign came up without an LLM channel and sandbox
agents fell back to mothership newapi, defeating per-Sovereign
sandboxing.

Wire the qwenBankDhofar channel to the same envsubst flag the
Catalyst control plane uses (`${MARKETPLACE_ENABLED:-false}`)
and default the endpoint to the canonical first-otech relay
(`https://llm-api.omtd.bankdhofar.com`) with override via
`${LLM_BANK_DHOFAR_BASE_URL}`. API key is still pulled from the
`newapi-channel-qwen-bankdhofar` Secret (cloud-init or
ExternalSecret per existing chart contract).

No chart bump — chart 1.4.6 (slot 80) already supports gating
qwenBankDhofar via .Values.defaultChannels.qwenBankDhofar.enabled
and reading endpoint/secret from those values. Only the
bootstrap-kit overlay was wired with the wrong defaults.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 10:08:43 +04:00
github-actions[bot]
48a4a86548 deploy: update catalyst images to e7b2062 2026-05-18 05:58:02 +00:00
e3mrah
e7b20620aa
fix(domain): per-tenant DNS reconciler — <slug>.<pool-domain> resolves to Sovereign LB (was mothership) (#1629)
Wire CATALYST_OTECH_INGRESS_IPV4 from the sovereign-fqdn ConfigMap key
`lbIP` so DefaultSMETenantDNSProvisioner.ProvisionFreeSubdomain (already
implemented in sme_tenant_dns.go) actually receives the Sovereign's LB
IP at run time. Without this env, ProvisionFreeSubdomain has been
returning `errors.New("otech ingress IPv4 unconfigured")` — silently —
on every Sovereign tenant signup, so the per-tenant A records for
`console|wordpress|openclaw|mail|keycloak.<slug>.<pool>` were never
PATCHed into PowerDNS, leaving the pool zone's apex/wildcard delegation
to point new tenants at the mothership IP (49.12.16.160) instead of the
correct Sovereign LB.

Same plumbing pattern as SOVEREIGN_LB_IP a few lines above (sourced from
the same ConfigMap, same key). Per-tenant approach (not a single
`*.<pool>` wildcard) is required because: (a) each tenant gets five
distinct host records, (b) the pool zone hosts records for the Sovereign
itself, so a blanket wildcard would shadow legitimate Sovereign-owned
subdomains, and (c) the reconciler is already there — only the env wire
is missing.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:56:02 +04:00
github-actions[bot]
ca880b0e3f deploy: update sme service images to 50a45a9 + bump chart to 1.4.159 2026-05-18 05:45:23 +00:00
e3mrah
50a45a9783
fix(billing): skip Stripe when voucher covers 100% of total (unblocks fully-paid voucher checkout) (#1628)
POST /billing/checkout was 503'ing with "payment processor is not
configured" on Sovereigns that have not pasted Stripe keys yet — even
when the customer's credit balance (from a fresh voucher redemption
in the same request, or a prior balance) fully covered the order
total. Make the credit-only short-circuit explicit: compute
`remainingOMR := totalOMR - creditBalance` and settle via
CreditOnlyCheckout when `<= 0`, BEFORE any Stripe settings probe.
This is the path that has to keep working during the voucher-only
weeks of a new Sovereign.

Adds checkout_test.go covering two regression paths:

  - fresh-voucher path: customer with 0 credit redeems WELCOME50
    against a 50-OMR plan → 200 + paid_by_credit:true, settings table
    never probed (sqlmock asserts no unexpected queries).
  - pre-existing-credit path: customer with 200-OMR standing balance
    buys a 100-OMR plan, no promo_code in request → 200 +
    paid_by_credit:true + 100-OMR leftover credit.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:44:22 +04:00
github-actions[bot]
467964d898 deploy: update Catalyst marketplace image to 556813d 2026-05-18 05:43:19 +00:00
e3mrah
556813d0ac
fix(marketplace): post-purchase redirect to Sovereign-local console (was hardcoded to mothership) (#1627)
Previously, after a successful checkout on a Sovereign marketplace
(e.g. marketplace.t15.omantel.biz), the browser was redirected to
https://console.openova.io/nova which is the MOTHERSHIP console — so
the user was bounced off their own Sovereign and re-prompted to sign in
against the wrong identity provider. Same bug fired the "returning user"
auto-redirect in Layout.astro.

Root cause: CONSOLE_URL in core/marketplace/src/lib/config.ts and the
inline returning-user redirect in core/marketplace/src/layouts/
Layout.astro both hardcoded "https://console.openova.io/nova". The
marketplace pod is shared across mothership + every Sovereign (one
deployment, multiple ingress hostnames — see marketplace-routes.yaml
which fronts marketplace.<sov-fqdn> on Sovereigns), so a build-time
constant could never name the right console.

Fix: derive the console URL from window.location.hostname at runtime.

  - marketplace.openova.io      -> https://console.openova.io/nova   (mothership, /nova prefix preserved)
  - marketplace.<sov-fqdn>      -> https://console.<sov-fqdn>        (Sovereign — Cilium Gateway *.<sov-fqdn> wildcard route, NO /nova)
  - partner hosts + dev         -> mothership fallback                (skipConsoleRedirect tenants don't reach this path anyway)

Implemented twice in lockstep — once in src/lib/config.ts for the
Svelte components that use consoleHref(), once inline in
src/layouts/Layout.astro because the returning-user redirect must fire
before the Svelte bundle loads.

Test: npm run build in core/marketplace clean (9 pages, 0 warnings).
Inline detector verified present in dist/checkout/index.html +
dist/index.html.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:42:42 +04:00
github-actions[bot]
1063916b73 deploy: update sme service images to 048cb2c + bump chart to 1.4.158 2026-05-18 05:34:41 +00:00
e3mrah
048cb2c3de
fix(sme): wire tenant + billing event dispatchers to NATS (was Redpanda-only, blocking convergence) (#1626)
The tenant + billing services hardcoded a franz-go Kafka publisher
pointing at REDPANDA_BROKERS. On Sovereigns there is NO Redpanda in
cluster — only NATS JetStream at
nats-jetstream.nats-system.svc.cluster.local:4222 — so every
tenant.created / tenant.deleted / order.placed event was silently
dropped, blocking provisioning + downstream consumers and stalling
the convergence chain end to end.

Per ADR-0001 §6 the canonical event bus is NATS JetStream with
subject convention `catalyst.<domain>.<event>`. This change:

  - Adds events.BrokerPublisher + events.MultiPublisher that fan out
    to NATS (`catalyst.<event.Type>` derived from Event.Type) and the
    legacy Redpanda topic in one call. Either transport may be nil;
    the constructor refuses to build a no-op publisher (the exact
    silent-failure mode we just hit).
  - Adds NATSConn.PublishEvent so the generic Event envelope can flow
    over the same JetStream connection used for the metering
    subscriber (#798), with Event.ID as the JetStream Msg-Id for
    broker-side de-dup.
  - Updates tenant + billing main.go to read NATS_URL +
    REDPANDA_BROKERS independently, construct the appropriate
    transports, and wire MultiPublisher into the Handler. Legacy
    Kafka consumers only start when REDPANDA_BROKERS is non-empty
    so the pods no longer crashloop dialling localhost:9092 on
    Sovereigns.
  - Updates chart templates to inject NATS_URL into both tenant and
    billing Deployments. ConfigMap default for NATS_URL on Sovereigns
    is nats://nats-jetstream.nats-system.svc.cluster.local:4222
    (fixes the existing bug where defaults pointed at the wrong
    namespace `nats-jetstream` — NATS actually lives in `nats-system`
    per clusters/_template/bootstrap-kit/07-nats-jetstream.yaml).
  - Sovereign default of REDPANDA_BROKERS is now empty (was the wrong
    NATS URL stuffed into a Kafka env, which made franz-go fail every
    dial).

Subject mapping per CanonicalSubject:
  tenant.created               → catalyst.tenant.created
  tenant.deleted               → catalyst.tenant.deleted
  tenant.app_install_requested → catalyst.tenant.app_install_requested
  order.placed                 → catalyst.billing.order.placed

Test:
  go build ./... in shared/, tenant/, billing/ (clean)
  go test ./events/... ./handlers/... in all three (existing + new
    bridge_test.go pass)
  helm template with global.sovereignFQDN set renders NATS_URL in
    both Deployments + REDPANDA_BROKERS="" in ConfigMap
  helm template without global.sovereignFQDN renders the legacy
    Redpanda broker (Catalyst-Zero contabo path remains intact)

NATS-side consumers for sme.tenant.events / sme.provision.events ship
in a follow-up PR per the ADR-0001 §6 migration plan; this PR only
unblocks the publish leg which is the immediate convergence blocker.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:33:36 +04:00
e3mrah
586bf7fd2d
fix(catalyst-api): wire /api/v1/sme/billing/vouchers/{list,issue,revoke} proxy (#1625)
Wave 6 PR #1609 shipped the BSS Vouchers FE
(products/catalyst/bootstrap/ui/src/lib/bss.api.ts —
listVouchers/issueVoucher/revokeVoucher) but never added the matching
catalyst-api proxy handlers. The /bss/vouchers page could render its
target-state chrome but every voucher action 404'd at the catalyst-api
ingress because no `/api/v1/sme/billing/vouchers/*` route existed.

This PR adds the catalyst-api → SME gateway proxy in
sme_billing_vouchers.go, mirroring the sme_billing_revenue.go /
sme_catalog_client.go patterns:

- GET    /api/v1/sme/billing/vouchers/list
- POST   /api/v1/sme/billing/vouchers/issue
- POST   /api/v1/sme/billing/vouchers/revoke/{code}  (task spec)
- DELETE /api/v1/sme/billing/vouchers/revoke/{code}  (FE wire)

All four registered inside the existing RequireSession group in
cmd/api/main.go alongside the other /api/v1/sme/* routes. Upstream is
the SME gateway at http://gateway.sme.svc.cluster.local:8080 (override
via CATALYST_SME_GATEWAY_URL per docs/INVIOLABLE-PRINCIPLES.md #4),
which strips `/api` and forwards to core/services/billing/handlers/
vouchers.go (gated by requireVoucherIssuer — superadmin OR
sovereign-admin per docs/FRANCHISE-MODEL.md §3). The handler always
forwards revoke as DELETE so the billing service's
`DELETE /billing/vouchers/revoke/{code}` route matches.

The Authorization header is forwarded verbatim; status + body stream
through unchanged so the FE's listVouchers (which throws on non-2xx)
sees the upstream's real status. 503 + sme-gateway-unreachable on DNS
NXDOMAIN so a Sovereign with marketplace.enabled=false degrades
gracefully rather than 5xx-ing.

No chart bump. Build clean; only pre-existing whoami/user_access
test failures remain (unrelated to this surface — confirmed by
running the same tests on origin/main without this change).

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 09:33:01 +04:00