openova

Author	SHA1	Message	Date
e3mrah	a2cbe3baa0	feat(sandbox-mcp): sandbox.auth.* + sandbox.secrets.* real impls (#1658 ) Wave 11 follow-up to PR #1653 (sandbox.db.). Replaces the stubbed sandbox.auth. and sandbox.secrets.* tool handlers with real implementations so agents can manage per-Sandbox Keycloak realms / OIDC clients and a per-Sandbox Secret store. sandbox.auth.* (Keycloak Admin REST via the sandbox-controller- injected admin bearer): - sandbox.auth.provisionRealm {realm_name, display_name?} POST /admin/realms — idempotent on 409 Conflict. - sandbox.auth.listClients GET /admin/realms/<sandbox-realm>/clients — friendly empty list on 404 (realm not yet provisioned). - sandbox.auth.registerClient {client_id, redirect_uris, public_client?, name?} POST /admin/realms/<sandbox-realm>/clients — idempotent on 409 Conflict, typed error on 404 (realm missing). The Sandbox's "own" realm name is deterministic (`sandbox-<org>- <id>`); the agent CANNOT pass a `realm` argument to list / register, only provisionRealm accepts a free-form name. sandbox.secrets.* (per-Sandbox K8s Secret store, base64-encoded data, encrypted at rest by kube-apiserver encryption-provider): - sandbox.secrets.read {key} — returns Found / KeyNotFound / NotFound (Secret missing) - sandbox.secrets.write {key, value} — auto-creates the Secret on first write (Added / Updated / Created) The Secret is named `sandbox-<owner-uid>-secrets` in env.Sandbox- Namespace and gated by openova.io/managed-by=openova-sandbox-mcp so sandbox.secrets.write CANNOT mutate the controller-injected `sandbox-tokens` Secret or any other unmanaged Secret in the ns. Auth: claims.OrgID == env.OrgID required (same as sandbox.db.*), RequiredCapability = "sandbox.auth" / "sandbox.secrets". New env vars (sandbox-controller injects on MCP Deployment): - SANDBOX_OWNER_UID — `sandbox-<owner-uid>-secrets` suffix - KEYCLOAK_ADMIN_URL — root of the Keycloak Admin REST API - KEYCLOAK_ADMIN_TOKEN — pre-minted admin bearer - KEYCLOAK_PARENT_REALM — default "master" No chart bump; mcp-server-only change. go build + go test clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:19:46 +04:00
e3mrah	d5ea7d9de6	feat(sandbox): sandbox.<sov-fqdn> public URL — DNS + cert SAN + correct parentRefs (#1657 ) The Sandbox public-URL flow (sandbox.<sov-fqdn>/sessions/<owner-uid>/) had three independent gaps that prevented PR #1641's HTTPRoute from resolving end-to-end: 1. HTTPRoute parentRefs pointed at "catalyst-public/catalyst-system/https", a Gateway that does not exist on a Sovereign. The canonical public Gateway is "cilium-gateway/kube-system" (clusters/_template/ sovereign-tls/cilium-gateway.yaml), the same parent that organization- controller's tenant_route.go and the chart's httproute.yaml attach to. sectionName is omitted so the HTTPRoute auto-attaches to every listener whose hostname matches sandbox.<sov-fqdn> — the wildcard .${SOVEREIGN_FQDN} HTTPS listener already in place per infra/hetzner/ main.tf locals.parent_domains_listeners_yaml fallback path. 2. The per-name Cilium Gateway cert (clusters/_template/sovereign-tls/ cilium-gateway-cert.yaml) is a SAN list, not a wildcard. Without "sandbox.<sov-fqdn>" in its dnsNames cilium-envoy serves the default fallback cert and browsers see NET::ERR_CERT_COMMON_NAME_INVALID. This file is the source of the per-zone Secret sovereign-wildcard-tls-<sov-fqdn-dashed> the Gateway listener references — adding the SAN is the only TLS-side change needed; the Gateway listener wildcard is already a hostname match. 3. The parent zone's A-record set is built from CanonicalSovereignSubdomains in products/catalyst/bootstrap/api/internal/handler/ sovereign_dns_records.go. Without "sandbox" the PowerDNS PATCH never writes sandbox.<sov-fqdn> A-record → primary LB IP, and the URL resolves NXDOMAIN even when the listener + cert are healthy. End-to-end resolution chain after this PR: Browser → sandbox.<sov-fqdn>/sessions/<owner-uid>/ (PowerDNS A record points at primary LB IPv4) → Hetzner LB :443 → cp-node :30443 (cilium-envoy) → Gateway listener https-<sov-fqdn-dashed> on *.<sov-fqdn> matches hostname; cert SAN includes sandbox.<sov-fqdn> so TLS terminates → HTTPRoute pty-server in sandbox-<owner-uid> namespace matches hostname + /sessions/<owner-uid>/ path prefix; URLRewrite strips /sessions/<owner-uid>/ → /sessions/ → backendRef pty-server:7681 in sandbox-<owner-uid> namespace → pty-server StatefulSet (PR #1641) serves the session Hard rules respected: READ-ONLY clusters, no Chart.yaml bump (only template content + Go renderer + Go handler list), helm template + kubectl kustomize clean (verified locally), tests updated to assert the new parentRefs shape and pass under go 1.23.	2026-05-18 12:15:59 +04:00
github-actions[bot]	5309bb8c39	deploy: bump sandbox-controller image to `63255bf`	2026-05-18 08:15:56 +00:00
e3mrah	63255bf172	feat(sandbox-mcp): gitea.pr.create/merge + issue.* + k8s.read.logs (was stubs) (#1656 ) Wave 11 promotes the remaining write-surface tools from #1645's stubs to real handlers, so an agent inside a Sandbox can end-to-end open PRs, file issues, comment, merge, and pull container logs without leaving the MCP transport: - pkg/gitea: +MergePullRequest, +Issue + IssueComment types, +List/Get/ Create/CommentOnIssue methods (new issues.go; pulls.go grows the merge helper). Same client envelope, same ErrRepoNotFound mapping. - mcp-server gitea.go: gitea.pr.create / gitea.pr.merge / gitea.issue.list / get / create / comment handlers + JSON Schemas. Same HS256 bearer + claims.OrgID match as #1645. - mcp-server k8s_read.go: k8s.read.logs via client-go's typed kubernetes.Interface (dynamic client doesn't expose Pods/log). Bounded fetch — follow=false, tail_lines default 200 capped at 5000, 1 MiB byte cap, 30s deadline. Long-lived streams stay on the catalyst-api WebSocket surface. - tests: +merge_issues_test.go (pkg/gitea, 11 cases) + gitea_wave11_test.go (mcp-server, 14 cases) covering happy paths, missing-arg validation, explicit merge styles, list-after-create idempotency, and the two pre-cluster guard rails on k8s.read.logs. Hard rules honoured: READ-ONLY clusters (k8s.write.* still stubbed), no chart bump, go build + go test clean. Kept stubbed: sandbox.db., sandbox.auth., gitea.release.list (Wave 12+). Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:12:41 +04:00
github-actions[bot]	e2e8132b00	deploy: update Catalyst marketplace image to `f3915c0`	2026-05-18 08:03:45 +00:00
e3mrah	f3915c01fa	test(marketplace): codified customer-journey regression (17 steps) (#1655 ) Codifies the 17-step marketplace customer journey (storefront → catalog → product detail → voucher → signup → subdomain pick → PIN → checkout → provisioning chain → console redirect) as a hermetic Playwright suite. Previously the journey was only walked manually by ad-hoc fix-author agents (see PR #1635 / docs/SESSION-2026-05-17-CONVERGENCE.md). This adds a regression gate so future PRs catch breakage in any of the 14 spec tests (17 step labels grouped into 14 Playwright tests — steps 12-15 are asserted as one API-chain contract since CheckoutStep redirects to console before the panel-poll UI would render). Highlights ---------- - core/marketplace/playwright.config.ts — testDir=./playwright, workers=1, baseURL from MARKETPLACE_BASE_URL (default http://localhost:4321), same posture as tests/e2e/playwright/playwright.config.ts. - core/marketplace/playwright/customer-journey.spec.ts — every backend call (/api/catalog/, /api/auth/, /api/tenant/, /api/billing/, /api/provisioning/*) intercepted via page.route() so the run is hermetic (npm run build && npm run preview is enough — no real catalyst-api / billing / provisioning service required). - Asserts the PR #1627 fix (deriveConsoleURL host-driven) — Sovereign hosts redirect to console.<sov-fqdn> (no /nova), mothership stays on console.openova.io/nova. Verification ------------ npx playwright test customer-journey → 14 passed (2.5m).	2026-05-18 12:02:39 +04:00
github-actions[bot]	18df061895	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.11	2026-05-18 08:00:46 +00:00
e3mrah	0604c5e057	fix(newapi): gate channel render on attestation present (was blocking install when accountId env empty) (#1654 ) Convergence wave 11 blocker on t16: bp-newapi HR install fails with Error: template: bp-newapi/templates/configmap.yaml:1:4: executing "bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>: channel[0] (qwen3.6-bankdhofar): commercial-contract attestation requires accountId PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping `defaultChannels.qwenBankDhofar.enabled` to true with envsubst placeholders for the attestation: attestation: kind: commercial-contract accountId: ${LLM_BANK_DHOFAR_ACCOUNT_ID:-} contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-} On a Sovereign that has not yet signed the commercial contract those variables expand to empty strings, and the chart's `assertChannelAttestation` helper hard-fails the helm template before any manifest is rendered — newapi install crashes at slot 80 and the whole bootstrap-kit reconciliation stalls. Fix (Option A — smallest change, makes the chart actually install): SKIP composing the qwenBankDhofar channel when attestation.kind=commercial-contract AND either accountId or contractRef is empty. NewAPI installs with zero default channels (operator-supplied `.Values.channels` still compose). Once the operator overlay supplies the attestation values the channel composes on the next reconcile. Touches two templates that gate on the same effective channel list: - templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a pre-check ($qbdAttReady) that short-circuits the channel composition block when attestation is incomplete. The downstream `assertChannelAttestation` helper then sees an empty channel list for the qwenBankDhofar slot and emits no error. - templates/channel-seed-job.yaml — mirrors the same gate so the post-install Helm hook Job + RBAC + audit ConfigMap also skip when the channel itself was skipped (otherwise the Job would POST a row whose ConfigMap entry was omitted from /etc/newapi/channels.yaml). `helm template platform/newapi/chart` renders cleanly in all three states: - default (qbd.enabled=false) → no channel, no seed Job - qbd.enabled=true + empty accountId/contractRef → no channel, no seed Job (NEW: pre-1.4.10 this hard-failed) - qbd.enabled=true + accountId + contractRef present → channel composed normally, seed Job emitted Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped 1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix. READ-ONLY clusters preserved. NO Chart.yaml bump on bp-catalyst-platform. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 12:00:06 +04:00
e3mrah	d080207c32	feat(sandbox-mcp): sandbox.db.* real impl (CNPG provision/list/get/drop/dump) (#1653 ) PR #1645 (Wave 8) wired gitea.* + k8s.read.* + session.* in the MCP server but left sandbox.db.* as not_implemented stubs. This commit ships the real handlers using the same dynamic-client pattern. Tools shipped (all gated on `RequiredCapability=sandbox.db` + claim OrgID==env.OrgID, all scoped to env.SandboxNamespace): - sandbox.db.provision {name, plan?} — POSTs a CNPG Cluster CR (default plan: 1 instance, 5Gi PVC, postgres 16, db=app). Returns {host:<name>-rw.<ns>.svc.cluster.local, port:5432, dbname, user, secretName:<name>-app, secretKey:password}. - sandbox.db.list — labels-filtered LIST scoped to the Sandbox ns, returns the same connection envelope per item plus a distilled status summary (phase, readyInstances, Ready condition). - sandbox.db.get {name} — GET one Cluster; refuses to surface a Cluster lacking openova.io/managed-by=openova-sandbox-mcp (defence-in-depth against an agent fishing for per-Org pair DBs). - sandbox.db.drop {name} — DELETE with foreground propagation so the operator cascades PVC/Service/Secret cleanup before returning. Same managed-by guard as get. - sandbox.db.dump {name} — POSTs a one-shot Backup CR (`<cluster>-dump-<UTC>`). Returns the Backup name + the Cluster's configured barmanObjectStore.destinationPath so the agent can find the resulting S3 prefix without polling Backup.status. Why CNPG Cluster CRs (not a per-Sandbox shared DB): per app DB keeps tenancy / backup / restart blast-radius per-app, matches architecture §3 + §7. Cluster CRs live in the Sandbox's OWN namespace (sandbox-<owner-uid>); the agent cannot pass `namespace` — it's read from env. The MCP server never mutates the resulting Pods/PVCs/ Services — the upstream CNPG operator (bp-cnpg) owns those. Tests (sandbox_db_test.go, 9 cases incl. 5 capability-gate sub-tests): - validation (name regex, missing name, unknown plan) - default-plan CR shape (apiVersion, kind, labels, spec.instances, storage.size, bootstrap.initdb.database, enableSuperuserAccess) - connectionFor envelope matches CNPG service-name defaults - on-demand Backup CR shape + managed-by label - requireSandboxNS guard rails (no env / empty ns / populated) - capability gate rejects bearers w/o sandbox.db - status summary surfaces phase + Ready condition only Hard rules respected: NO chart bump, no host-cluster touch — every mutation lands inside the Sandbox's own namespace via the SA the sandbox-controller already gives the MCP pod. go build + go vet + go test clean. Catalogue test updated for new `sandbox.db.get`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:59:56 +04:00
e3mrah	7b77ebe99c	fix(bootstrap-kit): bp-sandbox slot move 61 → 19a to break harbor chicken-and-egg (#1652 ) Caught live on t16.omantel.biz convergence: bp-sandbox HR stuck Reconciling because its chart pull goes through harbor.<sov-fqdn> (post-handover cutover slot 06a Step-06 phase-1 rewrites every HelmRepository URL `oci://ghcr.io/openova-io` → `oci://harbor.<sov-fqdn>/openova-io`), but harbor.<sov-fqdn> is not reachable yet because bp-harbor itself has not reached Ready — chicken-and-egg. Same failure shape as Wave 7 #1610 with bp-hcloud-csi (REMOVED). This PR takes the cleaner long-term cousin path: rather than remove the slot, sequence it AFTER bp-harbor (slot 19) by renumbering to 19a + adding `bp-harbor` to the HR's dependsOn graph. The Sandbox MVP Wave 11 slot stays available with no manual Day-2 add-app re-introduction needed. bp-harbor itself does not hit the cycle because its chart pull goes through harbor.openova.io (the mothership-warmed proxy-cache wired into k3s registries.yaml at cloud-init time) — NOT through harbor.<sov-fqdn>. Diff: - clusters/_template/bootstrap-kit/61-bp-sandbox.yaml renamed → 19a-bp-sandbox.yaml; slot label "61" → "19a"; dependsOn adds bp-harbor; header documents the move + chicken-and-egg context. - clusters/_template/bootstrap-kit/kustomization.yaml: 19a slot inserted right after 19-harbor.yaml with the post-cutover URL rewrite rationale inline; old slot-61 entry replaced with a back-pointer comment. Verified `kubectl kustomize clusters/_template/bootstrap-kit/` renders clean: bp-sandbox HR keeps slot label, gains - name: bp-harbor in dependsOn, all other fields unchanged. No Chart.yaml bump (this is a bootstrap-kit Kustomization-only fix, not a chart change). READ-ONLY clusters. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:56:52 +04:00
github-actions[bot]	51913fe380	deploy: bump sandbox-controller image to `ad5163e`	2026-05-18 07:54:45 +00:00
e3mrah	ad5163e69a	feat(sandbox-controller): IdleScaler scales pty-server replicas to 0 after configured idle window (#1651 ) PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on every pty-server StatefulSet but no controller was reading it. This closes the loop: pty-server (products/sandbox/pty-server/): - session.Manager tracks lastActivity; Touch() called on session create/stop, WS attach/detach, every WS message in/out, resize/signal. - New GET /idle endpoint returns {lastActivityAt, activeSessions}. - Unit tests cover the endpoint shape + Touch() bump. sandbox-controller (core/controllers/sandbox/internal/idlescaler/): - New IdleScaler runnable, registered with mgr.Add() in main.go. - NeedLeaderElection=true (singleton across HA replicas). - Every 60s lists pty-server StatefulSets by label selector (app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst), constrained to `sandbox-*` namespaces in code for defence-in-depth. - For each: probes the in-cluster Service /idle endpoint, stamps the `openova.io/sandbox-last-activity-at` annotation, and patches spec.replicas=0 once now-lastActivity exceeds the per-SS `openova.io/sandbox-idle-timeout-minutes` annotation (falling back to SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30). - Probe failure with no prior annotation → skip (next tick); probe failure WITH prior annotation → still decide on stale data so a degraded probe path doesn't keep a forgotten Pod alive forever. - activeSessions > 0 keeps the Pod alive regardless of idle window. - Already-zero replicas → idempotent no-op. Chart RBAC: - ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY cluster-wide write on a non-CR resource, scoped to the controller's own managed StatefulSets via the label selector + namespace prefix. Tests: 9 unit tests covering active-not-idle, idle-scales-zero, active-sessions-never-scales, probe-fail-no-annotation-skips, per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op, default-URL-builder, leader-election-singleton. Approach: controller polls pty-server's /idle endpoint via cluster-DNS (smaller diff than embedding a k8s client in pty-server — pty-server keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:51:36 +04:00
github-actions[bot]	c4fa06a9f4	deploy: bump sandbox-controller image to `3a3ee74`	2026-05-18 07:46:53 +00:00
github-actions[bot]	c9fe39a20f	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.9	2026-05-18 07:44:23 +00:00
e3mrah	96d2d9bce7	fix(provisioning): set Organization.spec.tenantPublic on product-install (was empty; HTTPRoute reconciler had nothing to render) (#1650 ) PR #1644 added Organization.spec.tenantPublic + per-tenant HTTPRoute reconciler, but nothing set the field — every Org CR's TenantPublic stayed zero-value, the reconciler short-circuited at the empty ParentDomain guard, and `<slug>.omani.homes` 404'd at the Cilium Gateway. Wire the patch at the only point that knows a tenant's product is actually Ready: the provisioning service. Both the initial workflow (`provision.completed`) and the day-2 install path (`provision.app_ready`) now patch the Organization CR's spec.tenantPublic with parentDomain (from TENANT_PARENT_DOMAIN env), subdomain (= slug), backendService (canonical vcluster-synced name), port 80, and the picked product slug. Last-write-wins on subsequent installs. Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent zone flows through env, never hardcoded — every Sovereign picks its own pool zone. Empty env disables the patch entirely (legacy tenants keep working through the Sovereign-wide tenant-wildcard route). Best-effort: failures don't fail the provision. 404 on the CR is benign (legacy tenant without an Organization counterpart). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:44:00 +04:00
e3mrah	3a3ee742ec	feat(sandbox-controller): call newapi /admin/tokens/sandbox + write Secret + rotation (was placeholder) (#1643 ) Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox LLM-gateway tokens via the catalyst-api bridge handler shipped in PR #1638, replacing the Wave 1 placeholder Secret with a real LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo. Changes: - New newapi.Client (core/controllers/sandbox/internal/newapi/) — thin HTTP client for POST /admin/tokens/sandbox with the bridge's {org_id, user_id, sandbox_id, allowed_channels} body + Bearer ADMIN_SECRET auth. Interface so tests can stub. - Reconciler extended: * NewAPIClient + DefaultChannels + TokenRotationLeadTime fields * On every reconcile: decide mint-or-skip from annotation openova.io/sandbox-token-expires-at vs. now + lead-time * On mint: POST to bridge, stamp expires-at + rotated-at annotations on the CR, render token bytes into a new gitops manifest secret-newapi-token.yaml committed to the per-Org catalyst-tenant repo at sandbox/<owner-uid>/ * Bridge failure → Failed/TokenMintFailed condition + 30s requeue + no gitops writes (fail-loud) * Empty DefaultChannels → NoAllowedChannels condition (fail earlier than the bridge's 400) - gitops.Render: * New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt /NewAPITokenRotatedAt fields * New secret-newapi-token.yaml template — Secret with stringData.LLM_GATEWAY_TOKEN + expires-at annotation + optional kubectl.kubernetes.io/restartedAt rotation marker so Wave 2's pty-server StatefulSet picks up rolling restarts on token rotation * kustomization.yaml appends the new manifest when token present - Chart wiring (platform/sandbox/chart): * Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET (secretKeyRef from newapi-bp-newapi-token-signing-key, optional: true), NEWAPI_DEFAULT_CHANNELS * ClusterRole bumped to allow update/patch on the sandboxes/ resource (the controller now stamps annotations on the CR) - platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml: * Added emberstack/reflector annotations so the chart-emitted Secret (newapi namespace) mirrors into the sandbox-controller namespace by default; reflectorNamespaces is overrideable. Tests: - newapi client: happy-path round-trip, 401 surfaces, input validation, request validation. 4 cases. - sandbox-controller: existing Wave 1 cases (happy/idempotent/ drift/missing) still pass; 5 new cases for the token path: fresh mint + Secret render, rotation on near-expiry, steady- state no-mint, bridge failure surfaces condition, no-channels misconfig fails early. 9 cases total, all green. Hard rules honored: - No Chart.yaml bump (chart pinning is a release-driver concern) - go build + go test ./core/controllers/sandbox/... clean Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:43:50 +04:00
e3mrah	8f4b34edd3	test(sandbox-ui): Playwright e2e for landing + settings + session nav (#1649 ) Wave 9 regression gate for the Sandbox UI scaffold shipped in PR #1621. Covers four happy-path surfaces: - Sidebar Sandbox entry exists + accent-active class on /sandbox - Landing renders 6 agent cards (aider / claude-code / cursor-agent / little-coder / opencode / qwen-code) with Connect Claude Max CTA - /sandbox/settings BYOS Connect button when disconnected - /sandbox/$id route resolves + create POST sends agent=aider Auth gate, deployment self-discovery, SSE events, and sandbox API are all mocked via page.route so the spec runs against `npm run dev` (Vite on :5173) with no catalyst-api required. Per-test timeout bumped to 90s to absorb Vite's cold-cache xterm/tanstack-router module load. Sovereign-mode env vars required for SovereignSidebar to render: VITE_CATALYST_MODE=sovereign \\ VITE_SOVEREIGN_FQDN=sandbox.example.test \\ npm run dev Local result: 4/4 passed in 2.1m (warm cache). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:41:25 +04:00
github-actions[bot]	2fee03f7d2	deploy: bump sandbox-controller image to `c0020d9`	2026-05-18 07:40:02 +00:00
e3mrah	c0020d9c33	feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs) (#1645 ) * feat(pkg/gitea): add ListPullRequests + GetPullRequest read API Wave 8 prerequisite for openova-sandbox-mcp's gitea.pr.list + gitea.pr.get tools. Mirrors the existing client surface (CreatePullRequest, ListOrgRepos) with state-filtered pagination and a get-by-number fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs) Wave 8 swaps the openova-sandbox-mcp Wave-2 not_implemented stubs for production-ready handlers on: - gitea.repo.list / gitea.repo.get (delegates to core/controllers/pkg/gitea) - gitea.pr.list / gitea.pr.get (delegates to new ListPullRequests + GetPullRequest helpers in pkg/gitea; org-scope check rejects cross-tenant owner overrides at tool dispatch time) - k8s.read.get / k8s.read.list / k8s.read.watch (dynamic.Interface against the Sandbox pod's in-cluster SA or SANDBOX_KUBECONFIG; watch is a bounded short-watch — long-lived subs land Wave 9 via MCP resources/subscribe) - sandbox.session.whoami / sandbox.session.info (echo per-call Claims + Sandbox metadata so the agent can self-discover its scope) Auth: every tools/call carries a bearer (via _auth.token arg OR SANDBOX_TOKEN env). main.go validates HS256 against SANDBOX_JWT_SECRET using the canonical core/services/shared/auth.Claims shape (PR #1619), strips _auth from the args, installs Claims on ctx, then Registry.Call gates on capability + org_id-match before reaching the handler. sandbox.session.* skips the org-scope check (the operator's session is the operator's regardless of which Org slug their claim carries). Stubs retained (Wave 8+): - sandbox.db.* (CNPG Cluster CR provisioning) - sandbox.auth.* (Keycloak realm/client management) - gitea.pr.create / gitea.pr.merge / gitea.issue.* / gitea.release.* - k8s.read.logs Hard rule preserved: k8s.write.* never lands in the MCP surface. 24 new tests (registry catalogue completeness, auth gate, gitea via httptest stub, JWT round-trip, env-var parsing). Builds clean against go 1.23 + k8s.io/client-go v0.31.1; module wires core/controllers + core/services/shared via the same replace pattern catalyst-bootstrap and every sme-service already use. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:36:53 +04:00
github-actions[bot]	c6820e3d4a	deploy: bump sandbox-controller image to `9f6354f`	2026-05-18 07:33:12 +00:00
e3mrah	a8a56a25f6	fix(org-controller): render per-tenant HTTPRoute so <slug>.omani.homes serves traffic (#1644 ) PowerDNS now resolves <slug>.<parentDomain> for every Org mapped onto a Sovereign's role=sme-pool parent domain (PR #1629), but no HTTPRoute was attaching that hostname to the tenant's installed product Service. The Cilium Gateway terminated TLS on the wildcard cert and fell through to the marketplace tenant-wildcard route — serving the storefront landing page instead of the tenant's WordPress / Nextcloud / GitLab install. Fix: 1. Extend Organization CRD with optional spec.tenantPublic (parentDomain, subdomain, backendService, backendPort, product). 2. organization-controller renders a Gateway-API HTTPRoute in the Org namespace (= slug) attached to cilium-gateway/kube-system when parentDomain is set. Skipped silently when unset so existing Orgs keep working. 3. Chart-side templates/sme-services/tenant-public-routes.yaml renders the same HTTPRoute shape from .Values.tenantRoutes[] for operators that prefer static fixtures over the controller's reconcile loop. 4. Tests: TestReconcile_TenantPublic_RendersHTTPRoute and TestReconcile_TenantPublic_DisabledByDefault cover both paths. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:32:54 +04:00
e3mrah	8888d9edd1	feat(catalog+billing): Sandbox Free/Pro/Ent plans + quota wire (was no plans = broken checkout) (#1642 ) PR #1633 added the Sandbox app to seedApps but never wired the matching plan rows. The marketplace checkout hit "plan_id not found" the moment a customer picked Sandbox, and PR #1639's sandbox-orchestrator could only mint CRs with the Wave 1 baseline quota regardless of the picked tier. This PR closes both gaps in lockstep: Catalog: - Plan struct gets ProductSlug + IncludedQuotas fields (back-compat: omitempty BSON tags so legacy rows decode fine). - expectedSandboxPlans() helper canonical-defines the three tiers: sandbox-free 0 OMR 1 session, 1 agent, 5 GB, BYOS sandbox-pro 9 OMR 3 sessions, 6 agents, 50 GB, BYOS (Popular) sandbox-ent 49 OMR unlimited, 6 agents, 500 GB, BYOS - seedAllData appends them on fresh seed; seedMissingSandboxPlans backfills them on already-populated Sovereigns (idempotent GET-then- create, patches missing ProductSlug/IncludedQuotas on legacy rows). - UpdatePlan persists the two new fields. Sandbox orchestrator wiring: - SandboxRequestedPayload.PlanID added; CreateOrg forwards body.PlanID. - buildSandbox stamps openova.io/plan-id annotation + spec.planId when PlanID is non-empty. - quotaForPlan() maps sandbox-{free,pro,ent} → SandboxQuota; empty or unknown plan_id falls through to DefaultQuota (Wave 1 baseline = Sandbox Free shape). Hard-coded map mirrors catalog IncludedQuotas so tenant-service avoids a compile-time dep on the catalog mongo stack. Tests: - TestExpectedSandboxPlans_Shape locks slugs, prices, quota keys, the Popular flag (sandbox-pro), and the quota ladder. - TestSandboxHandle_PlanIDStampsAnnotationAndQuota table-test exercises all three tiers end-to-end (annotation + spec.planId + spec.quota). - TestSandboxHandle_PlanIDEmptyKeepsDefaultQuota guards back-compat with pre-PR publishers. - TestSandboxHandle_PlanIDUnknownFallsBackToDefault guards typo'd / retired plan IDs. go build + go test clean for catalog, tenant, billing, provisioning, shared, marketplace-api. No Chart.yaml bump, no cluster touch. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:31:25 +04:00
e3mrah	9f6354f1e1	feat(sandbox): controller spawns pty-server + MCP Pods (was just namespace+RBAC+PVCs) (#1641 ) Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC + PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly- created Sandbox sat there with empty plumbing and no way for the user to actually run a coding session. This PR completes the per-Sandbox runtime by extending core/controllers/sandbox/internal/gitops/manifests.go to render the four manifests architecture.md §7 enumerates: - StatefulSet pty-server (replicas = spec.quota.concurrentSessions, one Pod per in-flight session per architecture.md §1/§2). Env wired per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID, SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL, LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret (key llm-gateway-token, optional). When claude-code is in spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key access_token, optional) per claude-code-byos.md §3. Repo PVCs mount at /workspace/<repo-slug>. - Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP server, talks to pty-server via the in-namespace ClusterIP Service. - Service pty-server (ClusterIP :7681) — backend for both the MCP Deployment and the HTTPRoute. - HTTPRoute pty-server — publishes sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via the existing catalyst-public Cilium Gateway in catalyst-system. PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees its own /sessions/<id> surface. Knobs are env-plumbed from the chart per Inviolable Principle #4: - SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm render fast on empty, no silent :latest). - SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit overlay derives it from ${SOVEREIGN_FQDN}). - SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX / SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc defaults. Idle timeout (architecture.md §7) lands as a StatefulSet annotation openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually scales the StatefulSet down on idle ships in a sibling PR (out of scope for "spawn the Pods"; this PR makes the Pods exist). Tests cover the full Wave-8 manifest shape: replicas count, identity env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname binding, kustomization stitching, idempotency. go test ./core/controllers/sandbox/... green; helm template renders cleanly + required guard fires on missing runtime values. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:30:00 +04:00
e3mrah	422da46360	fix(sovereign-tls): cilium-gateway listeners per parentZone (#1640 ) Issue #831 follow-on to #827. Previously the Cilium Gateway declared a single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under non-primary parent zones (e.g. wp-foo.omani.homes when the operator brings omani.homes as the SME pool) hit cilium-envoy's default fallback cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR \#827) existed but had no Gateway listener claiming its hostname. Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent zone. Materialised at Terraform plan time as a JSON-flow array (infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode of the listener objects iterating decoded parent_domains_yaml), threaded through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}` in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay in lockstep. Scalar placeholder (not multi-line block) because kustomize-build parses the YAML before Flux runs envsubst — a placeholder on its own line at column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then swaps it for the JSON-flow array string, which the apiserver parses as the real listener list. Single-zone fallback preserved (var.parent_domains_yaml empty → [{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions (e.g. primary omani.works + sme-pool omani.homes) render 4 listeners. Verification: - kubectl kustomize clusters/_template/sovereign-tls/ → clean - End-to-end simulation (single-zone, two-zone) renders correct listener counts (2 / 4) with correct certificateRefs per zone. - Listener naming `https-<sanitised>` / `http-<sanitised>` is unique per listener so Gateway controller programs them all (duplicate names produce Conflicting status condition). Files: - clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar listeners placeholder + comment block explaining the why) - infra/hetzner/main.tf (locals.parent_domains_decoded + locals.parent_domains_listeners_yaml; threaded into primary CP and secondary regions' templatefile() calls) - infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML substitute var in sovereign-tls Kustomization block) Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:09:26 +04:00
e3mrah	4c83d98765	feat(sandbox): orchestrator listens tenant.sandbox_requested → Sandbox CR materialisation (#1639 ) PR #1633 wired CreateOrg to publish `tenant.sandbox_requested` when the marketplace cart includes the sandbox product. Nobody was subscribing — the event landed in NATS `catalyst.tenant.sandbox_requested` and aged out unread, so no Sandbox CR (PR #1622) was ever minted and the customer sat on a "Provisioning…" spinner forever. This slice closes the loop. A new SandboxOrchestrator in tenant-service: - Subscribes via events.MultiSubscriber (PR #1636) to the canonical NATS subject + legacy Kafka topic. - Parses {tenant_id, org_slug, owner_id, owner_email, agents, sovereign, requested_at} and resolves the owner email (event field → store.GetMemberEmail → owner_id fallback). - Materialises a Sandbox CR in catalyst-system (SANDBOX_NAMESPACE override) via a dynamic client, with spec per architecture §7: owner.email + owner.orgRef.slug, default quota (4 CPU / 8 Gi / 50 Gi / 3 sessions), spec.agentCatalogue from the cart. - Idempotent: Get-then-Create with AlreadyExists swallowed so NATS redeliveries + duplicate marketplace submits stay no-ops; the sandbox-controller remains SoR for spec mutations. Wiring in main.go is best-effort — when no in-cluster config nor KUBECONFIG is available (CI / dev loops) the orchestrator is skipped with a Warn; the rest of the tenant service still boots. Hard rules: no chart bump, no cluster writes outside of the Sandbox Create call (sandbox-controller reconciles the rest), `go build ./...` clean, `go test ./...` clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:09:22 +04:00
github-actions[bot]	22851d980d	deploy: bump bp-newapi upstream v0.13.2 chart 1.4.8	2026-05-18 07:03:09 +00:00
e3mrah	4abd156fee	feat(newapi): real /admin/tokens/sandbox mint impl (was stub from #1619 ) (#1638 ) Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a real HS256 mint flow the sandbox-controller can call when it rolls out a fresh Sandbox Pod. Handler (platform/newapi/internal/handler/sandbox_token.go): - Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET), constant-time compared. 401 on mismatch / missing bearer. - Request body: {org_id, user_id, sandbox_id, allowed_channels[]}. De-duplicates + scrubs empty channel names so a controller bug sending [""] can't mint a token that NewAPI silently treats as "no restriction". - Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape: {sub: sandbox_id, org: org_id, user: user_id, channels: [...], iat, exp: iat+7d, typ: "sandbox"}. - Returns {token, expires_at}. - Refuses with 503 when SigningKey or AdminSecret is unset (visible chart-wiring gap, not a forgeable-token leak). - Removes the previous Claims/jwt.Parse PAT-validation path that came with the stub — caller is the controller, not an operator. - NewHandlerFromEnv() factory loads + validates env at process start so catalyst-api can fail loudly instead of shipping the endpoint silently. Unit tests (sandbox_token_test.go) — 11 cases: - happy path (mint + claim shape + signature round-trip) - de-dup + empty-channel scrub - admin-secret mismatch / missing bearer → 401 - missing org_id / user_id / sandbox_id / empty channels → 400 - non-POST → 405 - unset env → 503 - mintSandboxToken empty-secret guard + round-trip - response does not echo admin secret or signing key Chart wiring (platform/newapi/chart): - New Secret template sandbox-token-signing-key-secret.yaml auto-renders with Helm `lookup` + helm.sh/resource-policy: keep (same load-bearing pattern as credentials-secret.yaml #943 and gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values for both SIGNING_KEY and ADMIN_SECRET; persistence across reconciles is required because a reconcile-time rotation would silently invalidate every per-Sandbox token across the Sovereign AND break the sandbox-controller's auth path until its Pod restarts. - values.yaml block sandboxTokenSigningKey.{existingSecret, autoProvision, autoSecretName} matching the `credentials` convention (operator override > auto-provision > skip-render). - No Chart.yaml bump — chart value addition only. Verification: - go build ./platform/newapi/internal/handler/... — clean - go test ./platform/newapi/internal/handler/... — 11/11 PASS - helm template platform/newapi/chart — Secret renders How sandbox-controller will use it: 1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key. 2. POST /admin/tokens/sandbox with bearer + body {org_id: <Sandbox.spec.owner.orgRef.slug>, user_id: <Sandbox.spec.owner.email>, sandbox_id: <Sandbox.metadata.uid>, allowed_channels: ["qwen3.6-bankdhofar"]}. 3. Write returned token into Secret/sandbox-<uid>-newapi-token. 4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 11:02:40 +04:00
e3mrah	401ab6713a	feat(catalyst-api): /api/v1/sandbox/sessions CRUD + sandboxes GVR in k8sCache + cutover-driver RBAC (#1637 ) Wires the catalyst-api backend the Sandbox FE (PR #1621 — getSandboxes / createSandbox / getByosStatus in sandbox.api.ts) has been calling into. Without this handler the /sandbox surface on the Sovereign Console rendered its empty state forever — every getSandboxes() 404'd at the catalyst-api ingress and every "Start a session" click hit the same wall. Handler — products/catalyst/bootstrap/api/internal/handler/sandbox_sessions.go - GET /api/v1/sandbox/sessions — list Sandbox CRs in the operator's Org namespace - POST /api/v1/sandbox/sessions — create Sandbox CR with agent validated against the 6-agent catalogue (aider / claude-code / cursor-agent / little-coder / opencode / qwen-code) - GET /api/v1/sandbox/sessions/{id} — fetch single Sandbox detail - DELETE /api/v1/sandbox/sessions/{id} — graceful delete (the controller fires finalizers + cleans up the per-Sandbox vcluster namespace + PVCs + RBAC) Client resolution mirrors the Family E compliance + k8s_resource_actions.go seam: k8sCache.Factory.DynamicClientFor(resolveChrootClusterID("")) is the primary path; sovereignDepsFor() — rest.InClusterConfig() — is the chroot in-cluster fallback per feedback_chroot_in_cluster_fallback.md. Both 503 when unavailable so the FE renders its "API pending" pill rather than a spinner. Org-scoping uses claims.Org (the org_id Keycloak claim PR #1619 lit up) for the CR namespace + spec.owner.orgRef.slug. Single-tenant chroots without an org_id fall back through CATALYST_SANDBOX_DEFAULT_NAMESPACE to a sensible default per docs/INVIOLABLE-PRINCIPLES.md #4. Wave-1 quota defaults (4 CPU / 8Gi memory / 50Gi storage / 3 concurrent sessions) mirror products/sandbox/docs/architecture.md §7 — the FE doesn't yet expose a quota picker. Status projection: CRD vocabulary (Pending\|Provisioning\|Ready\|Failed) maps to FE vocabulary (pending\|running\|stopped\|failed\|unknown) in mapSandboxStatus so a fresh Sandbox shows the spinner rather than "unknown" until the controller catches up. k8sCache.DefaultKinds — products/catalyst/bootstrap/api/internal/k8scache/kinds.go - Adds sandbox.openova.io/v1 Sandbox so the generic /k8s/{kind} surface enumerates Sandboxes the same way it does Applications + UserAccess. Per feedback_chroot_in_cluster_fallback.md every new GVR here needs a matching rule on the cutover-driver SA. Cutover-driver RBAC — products/catalyst/chart/templates/clusterrole-cutover-driver.yaml - Adds sandboxes.sandbox.openova.io with verbs split per feedback_rbac_create_no_resourcenames.md: rule 1: ["create"] rule 2: ["get","list","watch","delete"] - Read-only on status (the controller owns status); write is spec-only on POST + the apiserver delete on DELETE. Routes — products/catalyst/bootstrap/api/cmd/api/main.go - Registered inside the RequireSession group alongside the existing /api/v1/sandbox/byos/claude-code/* surface; same auth gate, same patternless leading "/api/v1/sandbox/...". Verified: go build clean, go vet clean, k8scache test suite green (2.7s), helm template renders the new RBAC block. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:45:05 +04:00
github-actions[bot]	0ad7879013	deploy: update sme service images to `72f82ea` + bump chart to 1.4.162	2026-05-18 06:33:51 +00:00
e3mrah	72f82ea7f2	fix(sme): wire provisioning/notification/domain consumers to NATS (was Kafka-only, was silent-dropping every tenant.created event) (#1636 ) PR #1626 wired the PUBLISH leg of tenant + billing to NATS via events.MultiPublisher (canonical subject `catalyst.<event.Type>` per ADR-0001 §6). The CONSUME leg stayed Kafka-only — provisioning, notification, domain, billing's tenant-events cascade, AND tenant's own provision-events + members-cleanup consumers all called events.NewConsumer(redpandaBrokers, …). On Sovereigns REDPANDA_BROKERS is empty by design (no Redpanda exists; NATS is the canonical bus per the convergence-fix block in configmap.yaml) so those consumers either never started OR dialed `localhost:9092` in a hot crash loop. Net effect on every Sovereign install pre-this-PR: 1. alice POSTs /sme/tenants → tenant publishes catalyst.tenant.created to NATS (PR #1626). 2. provisioning's only subscriber was Kafka-only → silent drop. 3. No Organization CR ever spawned → no vCluster → CONVERGENCE BROKEN. This change introduces a symmetric subscribe-side abstraction mirroring bridge.go's MultiPublisher: - events.BrokerSubscriber: unified Subscribe(ctx, handler) interface, satisfied by Consumer, DLQSubscriber, MultiSubscriber. - events.MultiSubscriber: fans in from NATS JetStream durable consumers (one per canonical subject) + an optional legacy Kafka Consumer. NewMultiSubscriber refuses to construct with both legs nil (the silent-no-op pattern this PR exists to prevent). - events.NATSConn.ensureSMEStream: idempotently creates the CATALYST_SME Stream filtering `catalyst.>` so the first consumer on a fresh Sovereign bootstraps lifecycle. Each service's main.go now constructs a MultiSubscriber and passes it to the consumer dispatch loop. Consumer signatures take events.BrokerSubscriber instead of events.Consumer (interface upcast, so events.Consumer call sites keep working on Catalyst-Zero): - provisioning: tenant.created / tenant.deleted / tenant.app_install_requested / tenant.app_uninstall_requested / order.placed (the 5 subjects PR #1626 publishes to NATS). Also wires MultiPublisher so provision. publishes hit NATS too — downstream tenant + notification consumers need them. - notification: full fan-in (user.login, order.placed, payment.received, provision., domain., member.invited). - domain: tenant.deleted (subdomain + BYOD reclamation cascade). - billing: tenant.deleted (Stripe sub-cancel + invoice void + ledger marker cascade). Existing metering NATS subscriber unaffected. - tenant: provision.* + tenant.deleted (members cleanup). Now reachable on Sovereigns; pre-this-PR they were inside the `if redpandaBrokersRaw != ""` block. Chart wiring: NATS_URL env added to provisioning, notification, and domain Deployments (tenant + billing already wired via PR #1626). notification.yaml also flips its hardcoded REDPANDA_BROKERS literal to the shared ConfigMap key so the per-topology default (empty on Sovereigns, talentmesh redpanda on Catalyst-Zero) applies. Verification: - go build ./core/services/{shared,tenant,billing,provisioning, notification,domain}/... clean. - go test ./... clean across all 6 modules. - helm template with global.sovereignFQDN=test.example.com renders NATS_URL="nats://nats-jetstream.nats-system.svc.cluster.local:4222" into all 5 Deployments + ConfigMap. - helm template without sovereignFQDN renders NATS_URL="" and REDPANDA_BROKERS=talentmesh redpanda, matching Catalyst-Zero. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:32:49 +04:00
e3mrah	62c5620741	docs: session 2026-05-17/18 convergence report + DoD D32-D35 + Sandbox status update (#1635 ) - New docs/SESSION-2026-05-17-CONVERGENCE.md narrative session report covering the 22 user-facing PRs (#1597-#1632) across 9 waves: founder bug families, BSS iframe-seam removal, bp-hcloud-csi removal, CloudPage TS hotfix, Sandbox W1-W5 scaffold, and 9 convergence-cleanup fixes. - SOVEREIGN-MULTI-REGION-DOD.md extended D31 -> D35: Sandbox CRD installable (D32), Sandbox agent catalogue picker (D33), newapi Sovereign-side LLM gateway (D34), NATS broker round-trip publish+consume (D35). - products/sandbox/README.md flips Status from "Design. Not yet implemented." to "Wave 1-5 implementation in flight (PRs #1615/#1618/#1619/#1621/#1622/#1632 merged; runtime smoke pending fresh prov)". Adds founder TODO to register Anthropic OAuth client_id per claude-code-byos.md. No code, chart, or test changes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:28:11 +04:00
e3mrah	9690ff8351	feat(sandbox+bootstrap-kit): slot 61 bp-sandbox HR (deploys sandbox-controller on Sovereigns, gated SANDBOX_ENABLED) (#1634 ) Wires PR #1622's platform/sandbox/chart/ into bootstrap-kit so that sandbox-controller actually deploys on Sovereigns. Without this slot, the chart ships but no HelmRelease installs it — Sandbox CRs sit unhandled. - NEW clusters/_template/bootstrap-kit/61-bp-sandbox.yaml — HelmRepository + HelmRelease for the `sandbox` chart (name comes from platform/sandbox/chart/Chart.yaml `name: sandbox`). - dependsOn: bp-vcluster-helmrepo (slot 60, Wave 2 per-Sandbox vCluster source), bp-catalyst-platform (slot 13, catalyst-system Namespace + catalyst-gitea-token Secret). - targetNamespace: catalyst-system (where the controller lives). - values.enabled gated default-OFF via ${SANDBOX_ENABLED:-false} (matches platform/sandbox/chart/values.yaml `enabled: false`). - env.hostCluster + env.sovereignFQDN fed from canonical SOVEREIGN_REGION_CANONICAL_LABEL + SOVEREIGN_FQDN substitutes. - MODIFY kustomization.yaml — register 61-bp-sandbox.yaml after slot 60. - MODIFY scripts/expected-bootstrap-deps.yaml — declare slot 61 with depends_on=[bp-vcluster-helmrepo, bp-catalyst-platform]; validator reports drift=0/cycles=0. NO chart Chart.yaml bump (Wave 1 chart stays at 0.1.0). `helm template` + `kubectl kustomize` render clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:26:18 +04:00
github-actions[bot]	ced295726a	deploy: update sme service images to `4bad2a3` + bump chart to 1.4.161	2026-05-18 06:24:41 +00:00
github-actions[bot]	77083dbcd5	deploy: update Catalyst marketplace image to `4bad2a3`	2026-05-18 06:24:18 +00:00
e3mrah	4bad2a3cea	Merge pull request #1633 from openova-io/sandbox-wave4-marketplace-catalog-entry feat(sandbox): Wave 4 — marketplace catalog entry (customer can pick Sandbox alongside WordPress)	2026-05-18 10:23:36 +04:00
Emrah Baysal	b8b80973de	feat(sandbox): Wave 4 — marketplace catalog entry (customer can pick Sandbox alongside WordPress) Adds the Sandbox product to the marketplace storefront so a customer picks it off marketplace.<sov>/apps the same way they pick WordPress / Nextcloud. Card chrome is the existing .app-card shape verbatim — no new components per the design-system inheritance rule. The detail page gains a 6-agent picker (aider, claude-code, cursor-agent, little-coder, opencode, qwen-code) using the existing .related-card chrome with a picked state mirroring .app-card.in-cart. Picks land on cart.agents and travel through checkout into the tenant create-org payload. Tenant-service emits a sibling `tenant.sandbox_requested` event on sme.tenant.events when the cart contains the sandbox product. The event carries org slug + owner + agents list, sufficient for the sandbox-controller (or its upstream orchestrator) to mint a Sandbox CR with matching spec.agentCatalogue. The Organization CR creation path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 08:22:37 +02:00
github-actions[bot]	41eba2d436	deploy: bump sandbox-controller image to `1b0e86c`	2026-05-18 06:14:36 +00:00
github-actions[bot]	2e57d76ce1	deploy: update sme service images to `d681f64` + bump chart to 1.4.160	2026-05-18 06:12:09 +00:00
e3mrah	1b0e86cb1a	ci(sandbox): build workflows for controller + pty-server + mcp-server (so chart can actually deploy) (#1632 ) PR #1622 shipped the sandbox-controller binary + chart, and PR #1618 shipped pty-server + mcp-server scaffolds, but neither came with CI build workflows — meaning the chart's image.repository points at a GHCR package that no workflow ever publishes (ImagePullBackOff on every install). Per docs/INVIOLABLE-PRINCIPLES.md #4a every runtime image MUST be produced by a GitHub Actions workflow from a committed git SHA; this PR closes that gap. Three new workflows, all event-driven (push paths-filter + PR + workflow_dispatch, no cron): - build-sandbox-controller.yaml — mirrors build-application-controller (shared core/controllers go.mod, go vet + race tests, Buildx push, cosign keyless sign, SBOM attest, auto-bump platform/sandbox/chart/ values.yaml image.tag back to main so the next install picks up the SHA-pinned image without operator action). - build-sandbox-pty-server.yaml — separate go module under products/sandbox/pty-server (own go.mod/go.sum), Dockerfile uses COPY . . so build context is the server directory. Same Buildx + cosign + SBOM flow as the controller. No values.yaml bump yet: Wave-2 wiring of the StatefulSet template will land in a follow-up. - build-sandbox-mcp-server.yaml — stdlib-only stdio MCP sidecar (no go.sum yet), same shape as pty-server. Per `feedback_no_mvp_no_workarounds.md` rule 1 (target-state, never "manual follow-up bump") the controller workflow auto-bumps the chart values.yaml so a Sovereign overlay flipping `enabled: true` Just Works. Per the user's hard rule for this PR, no Chart.yaml bump and no blueprint-release dispatch — the Sandbox chart's publication cadence is gated by Wave-2 readiness, not per-image builds. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:11:28 +04:00
e3mrah	d681f64505	fix(catalyst-api): mint HS256 token on SME proxy calls (was forwarding incompatible RS256) (#1630 ) PR #1625 shipped the /api/v1/sme/billing/vouchers/* proxies but the SME gateway (core/services/gateway/proxy.go) rejects RS256 outright — it only accepts HS256 signed with sme-secrets/JWT_SECRET. Result on every fresh Sovereign: operator clicks on /bss/vouchers returned silent 401 with no upstream audit trail. This commit ships the bridge: - core/services/shared/auth/mint_sme.go (new) - MintSMEAccessToken(secret, sub, email, role) → 5-min HS256 JWT in the wire shape billing's requireVoucherIssuer expects. - SMERoleFor(realmRoles, tier) → maps Keycloak roles + tier claim onto SME vocab (superadmin \| sovereign-admin \| member). - Pure, no IO, fully unit-tested (mint_sme_test.go). - products/catalyst/bootstrap/api/internal/handler/sme_billing_vouchers.go - proxySMEVoucher now mints a fresh HS256 token per upstream hop from the operator's already-validated RS256 session claims and forwards that as Bearer to the SME gateway. RS256 header is no longer leaked upstream. - Unwired bridge (CATALYST_SME_JWT_SECRET empty) surfaces 503 `sme-jwt-bridge-unwired` instead of the silent 401. - products/catalyst/bootstrap/api/internal/handler/handler.go - h.smeJWTSecret field + SetSMEJWTSecret(secret) setter. - products/catalyst/bootstrap/api/cmd/api/main.go - Reads CATALYST_SME_JWT_SECRET on startup and wires it. - Log line includes byte count only (never the secret value, per INVIOLABLE-PRINCIPLES.md #10). - products/catalyst/chart/templates/api-deployment.yaml - New env CATALYST_SME_JWT_SECRET sourced from sme-secrets/JWT_SECRET in the same namespace (catalyst-system). optional: true so Sovereigns without marketplace surface a 503 rather than CreateContainerConfigError. - products/catalyst/chart/templates/sme-services/sme-secrets.yaml - emberstack/reflector annotation block mirroring sme-secrets from `sme` ns into `catalyst-system` (Kubernetes secretKeyRef is same-namespace-only). Same pattern as cnpg-cluster.yaml and provisioning-github-token.yaml. Operator-visible behaviour: the bridge is transparent on the happy path (operator with sovereign-admin tier on a Sovereign with marketplace enabled clicks /bss/vouchers → list returns). On the unhappy paths the operator now sees a real status code: - 503 sme-jwt-bridge-unwired (chart wire missing) — actionable - 503 sme-gateway-unreachable (DNS NXDOMAIN) — pre-existing - 403 from billing's requireVoucherIssuer (role insufficient) — was silent 401 before, now propagates the real authz result. Tests: core/services/shared/auth `go test ./...` PASS. catalyst-api `go build ./...` PASS. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:11:04 +04:00
e3mrah	51b6188eb1	feat(sandbox+bootstrap-kit): newapi Sovereign install (Bank Dhofar Qwen wired for Sandbox) (#1631 ) Sandbox Wave 4 retry. Slot 80 (bp-newapi) already exists in the _template bootstrap-kit but ships the qwenBankDhofar channel hard-coded to `enabled: false` with empty endpoint — so every franchised Sovereign came up without an LLM channel and sandbox agents fell back to mothership newapi, defeating per-Sovereign sandboxing. Wire the qwenBankDhofar channel to the same envsubst flag the Catalyst control plane uses (`${MARKETPLACE_ENABLED:-false}`) and default the endpoint to the canonical first-otech relay (`https://llm-api.omtd.bankdhofar.com`) with override via `${LLM_BANK_DHOFAR_BASE_URL}`. API key is still pulled from the `newapi-channel-qwen-bankdhofar` Secret (cloud-init or ExternalSecret per existing chart contract). No chart bump — chart 1.4.6 (slot 80) already supports gating qwenBankDhofar via .Values.defaultChannels.qwenBankDhofar.enabled and reading endpoint/secret from those values. Only the bootstrap-kit overlay was wired with the wrong defaults. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 10:08:43 +04:00
github-actions[bot]	48a4a86548	deploy: update catalyst images to `e7b2062`	2026-05-18 05:58:02 +00:00
e3mrah	e7b20620aa	fix(domain): per-tenant DNS reconciler — <slug>.<pool-domain> resolves to Sovereign LB (was mothership) (#1629 ) Wire CATALYST_OTECH_INGRESS_IPV4 from the sovereign-fqdn ConfigMap key `lbIP` so DefaultSMETenantDNSProvisioner.ProvisionFreeSubdomain (already implemented in sme_tenant_dns.go) actually receives the Sovereign's LB IP at run time. Without this env, ProvisionFreeSubdomain has been returning `errors.New("otech ingress IPv4 unconfigured")` — silently — on every Sovereign tenant signup, so the per-tenant A records for `console\|wordpress\|openclaw\|mail\|keycloak.<slug>.<pool>` were never PATCHed into PowerDNS, leaving the pool zone's apex/wildcard delegation to point new tenants at the mothership IP (49.12.16.160) instead of the correct Sovereign LB. Same plumbing pattern as SOVEREIGN_LB_IP a few lines above (sourced from the same ConfigMap, same key). Per-tenant approach (not a single `*.<pool>` wildcard) is required because: (a) each tenant gets five distinct host records, (b) the pool zone hosts records for the Sovereign itself, so a blanket wildcard would shadow legitimate Sovereign-owned subdomains, and (c) the reconciler is already there — only the env wire is missing. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 09:56:02 +04:00
github-actions[bot]	ca880b0e3f	deploy: update sme service images to `50a45a9` + bump chart to 1.4.159	2026-05-18 05:45:23 +00:00
e3mrah	50a45a9783	fix(billing): skip Stripe when voucher covers 100% of total (unblocks fully-paid voucher checkout) (#1628 ) POST /billing/checkout was 503'ing with "payment processor is not configured" on Sovereigns that have not pasted Stripe keys yet — even when the customer's credit balance (from a fresh voucher redemption in the same request, or a prior balance) fully covered the order total. Make the credit-only short-circuit explicit: compute `remainingOMR := totalOMR - creditBalance` and settle via CreditOnlyCheckout when `<= 0`, BEFORE any Stripe settings probe. This is the path that has to keep working during the voucher-only weeks of a new Sovereign. Adds checkout_test.go covering two regression paths: - fresh-voucher path: customer with 0 credit redeems WELCOME50 against a 50-OMR plan → 200 + paid_by_credit:true, settings table never probed (sqlmock asserts no unexpected queries). - pre-existing-credit path: customer with 200-OMR standing balance buys a 100-OMR plan, no promo_code in request → 200 + paid_by_credit:true + 100-OMR leftover credit. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 09:44:22 +04:00
github-actions[bot]	467964d898	deploy: update Catalyst marketplace image to `556813d`	2026-05-18 05:43:19 +00:00
e3mrah	556813d0ac	fix(marketplace): post-purchase redirect to Sovereign-local console (was hardcoded to mothership) (#1627 ) Previously, after a successful checkout on a Sovereign marketplace (e.g. marketplace.t15.omantel.biz), the browser was redirected to https://console.openova.io/nova which is the MOTHERSHIP console — so the user was bounced off their own Sovereign and re-prompted to sign in against the wrong identity provider. Same bug fired the "returning user" auto-redirect in Layout.astro. Root cause: CONSOLE_URL in core/marketplace/src/lib/config.ts and the inline returning-user redirect in core/marketplace/src/layouts/ Layout.astro both hardcoded "https://console.openova.io/nova". The marketplace pod is shared across mothership + every Sovereign (one deployment, multiple ingress hostnames — see marketplace-routes.yaml which fronts marketplace.<sov-fqdn> on Sovereigns), so a build-time constant could never name the right console. Fix: derive the console URL from window.location.hostname at runtime. - marketplace.openova.io -> https://console.openova.io/nova (mothership, /nova prefix preserved) - marketplace.<sov-fqdn> -> https://console.<sov-fqdn> (Sovereign — Cilium Gateway *.<sov-fqdn> wildcard route, NO /nova) - partner hosts + dev -> mothership fallback (skipConsoleRedirect tenants don't reach this path anyway) Implemented twice in lockstep — once in src/lib/config.ts for the Svelte components that use consoleHref(), once inline in src/layouts/Layout.astro because the returning-user redirect must fire before the Svelte bundle loads. Test: npm run build in core/marketplace clean (9 pages, 0 warnings). Inline detector verified present in dist/checkout/index.html + dist/index.html. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 09:42:42 +04:00
github-actions[bot]	1063916b73	deploy: update sme service images to `048cb2c` + bump chart to 1.4.158	2026-05-18 05:34:41 +00:00
e3mrah	048cb2c3de	fix(sme): wire tenant + billing event dispatchers to NATS (was Redpanda-only, blocking convergence) (#1626 ) The tenant + billing services hardcoded a franz-go Kafka publisher pointing at REDPANDA_BROKERS. On Sovereigns there is NO Redpanda in cluster — only NATS JetStream at nats-jetstream.nats-system.svc.cluster.local:4222 — so every tenant.created / tenant.deleted / order.placed event was silently dropped, blocking provisioning + downstream consumers and stalling the convergence chain end to end. Per ADR-0001 §6 the canonical event bus is NATS JetStream with subject convention `catalyst.<domain>.<event>`. This change: - Adds events.BrokerPublisher + events.MultiPublisher that fan out to NATS (`catalyst.<event.Type>` derived from Event.Type) and the legacy Redpanda topic in one call. Either transport may be nil; the constructor refuses to build a no-op publisher (the exact silent-failure mode we just hit). - Adds NATSConn.PublishEvent so the generic Event envelope can flow over the same JetStream connection used for the metering subscriber (#798), with Event.ID as the JetStream Msg-Id for broker-side de-dup. - Updates tenant + billing main.go to read NATS_URL + REDPANDA_BROKERS independently, construct the appropriate transports, and wire MultiPublisher into the Handler. Legacy Kafka consumers only start when REDPANDA_BROKERS is non-empty so the pods no longer crashloop dialling localhost:9092 on Sovereigns. - Updates chart templates to inject NATS_URL into both tenant and billing Deployments. ConfigMap default for NATS_URL on Sovereigns is nats://nats-jetstream.nats-system.svc.cluster.local:4222 (fixes the existing bug where defaults pointed at the wrong namespace `nats-jetstream` — NATS actually lives in `nats-system` per clusters/_template/bootstrap-kit/07-nats-jetstream.yaml). - Sovereign default of REDPANDA_BROKERS is now empty (was the wrong NATS URL stuffed into a Kafka env, which made franz-go fail every dial). Subject mapping per CanonicalSubject: tenant.created → catalyst.tenant.created tenant.deleted → catalyst.tenant.deleted tenant.app_install_requested → catalyst.tenant.app_install_requested order.placed → catalyst.billing.order.placed Test: go build ./... in shared/, tenant/, billing/ (clean) go test ./events/... ./handlers/... in all three (existing + new bridge_test.go pass) helm template with global.sovereignFQDN set renders NATS_URL in both Deployments + REDPANDA_BROKERS="" in ConfigMap helm template without global.sovereignFQDN renders the legacy Redpanda broker (Catalyst-Zero contabo path remains intact) NATS-side consumers for sme.tenant.events / sme.provision.events ship in a follow-up PR per the ADR-0001 §6 migration plan; this PR only unblocks the publish leg which is the immediate convergence blocker. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 09:33:36 +04:00
e3mrah	586bf7fd2d	fix(catalyst-api): wire /api/v1/sme/billing/vouchers/{list,issue,revoke} proxy (#1625 ) Wave 6 PR #1609 shipped the BSS Vouchers FE (products/catalyst/bootstrap/ui/src/lib/bss.api.ts — listVouchers/issueVoucher/revokeVoucher) but never added the matching catalyst-api proxy handlers. The /bss/vouchers page could render its target-state chrome but every voucher action 404'd at the catalyst-api ingress because no `/api/v1/sme/billing/vouchers/` route existed. This PR adds the catalyst-api → SME gateway proxy in sme_billing_vouchers.go, mirroring the sme_billing_revenue.go / sme_catalog_client.go patterns: - GET /api/v1/sme/billing/vouchers/list - POST /api/v1/sme/billing/vouchers/issue - POST /api/v1/sme/billing/vouchers/revoke/{code} (task spec) - DELETE /api/v1/sme/billing/vouchers/revoke/{code} (FE wire) All four registered inside the existing RequireSession group in cmd/api/main.go alongside the other /api/v1/sme/ routes. Upstream is the SME gateway at http://gateway.sme.svc.cluster.local:8080 (override via CATALYST_SME_GATEWAY_URL per docs/INVIOLABLE-PRINCIPLES.md #4), which strips `/api` and forwards to core/services/billing/handlers/ vouchers.go (gated by requireVoucherIssuer — superadmin OR sovereign-admin per docs/FRANCHISE-MODEL.md §3). The handler always forwards revoke as DELETE so the billing service's `DELETE /billing/vouchers/revoke/{code}` route matches. The Authorization header is forwarded verbatim; status + body stream through unchanged so the FE's listVouchers (which throws on non-2xx) sees the upstream's real status. 503 + sme-gateway-unreachable on DNS NXDOMAIN so a Sovereign with marketplace.enabled=false degrades gracefully rather than 5xx-ing. No chart bump. Build clean; only pre-existing whoami/user_access test failures remain (unrelated to this surface — confirmed by running the same tests on origin/main without this change). Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-18 09:33:01 +04:00

... 3 4 5 6 7 ...

2490 Commits