Wave 11 follow-up to PR #1653 (sandbox.db.*). Replaces the stubbed
sandbox.auth.* and sandbox.secrets.* tool handlers with real
implementations so agents can manage per-Sandbox Keycloak realms /
OIDC clients and a per-Sandbox Secret store.
sandbox.auth.* (Keycloak Admin REST via the sandbox-controller-
injected admin bearer):
- sandbox.auth.provisionRealm {realm_name, display_name?}
POST /admin/realms — idempotent on 409 Conflict.
- sandbox.auth.listClients
GET /admin/realms/<sandbox-realm>/clients — friendly empty
list on 404 (realm not yet provisioned).
- sandbox.auth.registerClient {client_id, redirect_uris,
public_client?, name?}
POST /admin/realms/<sandbox-realm>/clients — idempotent on
409 Conflict, typed error on 404 (realm missing).
The Sandbox's "own" realm name is deterministic (`sandbox-<org>-
<id>`); the agent CANNOT pass a `realm` argument to list /
register, only provisionRealm accepts a free-form name.
sandbox.secrets.* (per-Sandbox K8s Secret store, base64-encoded
data, encrypted at rest by kube-apiserver encryption-provider):
- sandbox.secrets.read {key} — returns Found / KeyNotFound
/ NotFound (Secret missing)
- sandbox.secrets.write {key, value} — auto-creates the Secret on
first write (Added /
Updated / Created)
The Secret is named `sandbox-<owner-uid>-secrets` in env.Sandbox-
Namespace and gated by openova.io/managed-by=openova-sandbox-mcp
so sandbox.secrets.write CANNOT mutate the controller-injected
`sandbox-tokens` Secret or any other unmanaged Secret in the ns.
Auth: claims.OrgID == env.OrgID required (same as sandbox.db.*),
RequiredCapability = "sandbox.auth" / "sandbox.secrets".
New env vars (sandbox-controller injects on MCP Deployment):
- SANDBOX_OWNER_UID — `sandbox-<owner-uid>-secrets` suffix
- KEYCLOAK_ADMIN_URL — root of the Keycloak Admin REST API
- KEYCLOAK_ADMIN_TOKEN — pre-minted admin bearer
- KEYCLOAK_PARENT_REALM — default "master"
No chart bump; mcp-server-only change. go build + go test clean.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Sandbox public-URL flow (sandbox.<sov-fqdn>/sessions/<owner-uid>/*) had
three independent gaps that prevented PR #1641's HTTPRoute from resolving
end-to-end:
1. HTTPRoute parentRefs pointed at "catalyst-public/catalyst-system/https",
a Gateway that does not exist on a Sovereign. The canonical public
Gateway is "cilium-gateway/kube-system" (clusters/_template/
sovereign-tls/cilium-gateway.yaml), the same parent that organization-
controller's tenant_route.go and the chart's httproute.yaml attach to.
sectionName is omitted so the HTTPRoute auto-attaches to every listener
whose hostname matches sandbox.<sov-fqdn> — the wildcard
*.${SOVEREIGN_FQDN} HTTPS listener already in place per infra/hetzner/
main.tf locals.parent_domains_listeners_yaml fallback path.
2. The per-name Cilium Gateway cert (clusters/_template/sovereign-tls/
cilium-gateway-cert.yaml) is a SAN list, not a wildcard. Without
"sandbox.<sov-fqdn>" in its dnsNames cilium-envoy serves the default
fallback cert and browsers see NET::ERR_CERT_COMMON_NAME_INVALID.
This file is the source of the per-zone Secret
sovereign-wildcard-tls-<sov-fqdn-dashed> the Gateway listener
references — adding the SAN is the only TLS-side change needed; the
Gateway listener wildcard is already a hostname match.
3. The parent zone's A-record set is built from CanonicalSovereignSubdomains
in products/catalyst/bootstrap/api/internal/handler/
sovereign_dns_records.go. Without "sandbox" the PowerDNS PATCH never
writes sandbox.<sov-fqdn> A-record → primary LB IP, and the URL
resolves NXDOMAIN even when the listener + cert are healthy.
End-to-end resolution chain after this PR:
Browser → sandbox.<sov-fqdn>/sessions/<owner-uid>/ (PowerDNS A record
points at primary LB IPv4)
→ Hetzner LB :443 → cp-node :30443 (cilium-envoy)
→ Gateway listener https-<sov-fqdn-dashed> on *.<sov-fqdn> matches
hostname; cert SAN includes sandbox.<sov-fqdn> so TLS terminates
→ HTTPRoute pty-server in sandbox-<owner-uid> namespace matches
hostname + /sessions/<owner-uid>/ path prefix; URLRewrite strips
/sessions/<owner-uid>/ → /sessions/
→ backendRef pty-server:7681 in sandbox-<owner-uid> namespace
→ pty-server StatefulSet (PR #1641) serves the session
Hard rules respected: READ-ONLY clusters, no Chart.yaml bump (only
template content + Go renderer + Go handler list), helm template +
kubectl kustomize clean (verified locally), tests updated to assert the
new parentRefs shape and pass under go 1.23.
Codifies the 17-step marketplace customer journey (storefront → catalog →
product detail → voucher → signup → subdomain pick → PIN → checkout →
provisioning chain → console redirect) as a hermetic Playwright suite.
Previously the journey was only walked manually by ad-hoc fix-author
agents (see PR #1635 / docs/SESSION-2026-05-17-CONVERGENCE.md). This adds
a regression gate so future PRs catch breakage in any of the 14 spec
tests (17 step labels grouped into 14 Playwright tests — steps 12-15 are
asserted as one API-chain contract since CheckoutStep redirects to
console before the panel-poll UI would render).
Highlights
----------
- core/marketplace/playwright.config.ts — testDir=./playwright,
workers=1, baseURL from MARKETPLACE_BASE_URL (default
http://localhost:4321), same posture as
tests/e2e/playwright/playwright.config.ts.
- core/marketplace/playwright/customer-journey.spec.ts — every backend
call (/api/catalog/*, /api/auth/*, /api/tenant/*, /api/billing/*,
/api/provisioning/*) intercepted via page.route() so the run is
hermetic (npm run build && npm run preview is enough — no real
catalyst-api / billing / provisioning service required).
- Asserts the PR #1627 fix (deriveConsoleURL host-driven) — Sovereign
hosts redirect to console.<sov-fqdn> (no /nova), mothership stays on
console.openova.io/nova.
Verification
------------
npx playwright test customer-journey → 14 passed (2.5m).
Convergence wave 11 blocker on t16: bp-newapi HR install fails with
Error: template: bp-newapi/templates/configmap.yaml:1:4: executing
"bp-newapi/templates/configmap.yaml" at <include "bp-newapi.assertChannelAttestation" .>:
channel[0] (qwen3.6-bankdhofar): commercial-contract attestation
requires accountId
PR #1631 wired the bootstrap-kit overlay so franchised Sovereigns can
opt in to marketplace via `MARKETPLACE_ENABLED=true` — flipping
`defaultChannels.qwenBankDhofar.enabled` to true with envsubst
placeholders for the attestation:
attestation:
kind: commercial-contract
accountId: ${LLM_BANK_DHOFAR_ACCOUNT_ID:-}
contractRef: ${LLM_BANK_DHOFAR_CONTRACT_REF:-}
On a Sovereign that has not yet signed the commercial contract those
variables expand to empty strings, and the chart's
`assertChannelAttestation` helper hard-fails the helm template before
any manifest is rendered — newapi install crashes at slot 80 and the
whole bootstrap-kit reconciliation stalls.
Fix (Option A — smallest change, makes the chart actually install):
SKIP composing the qwenBankDhofar channel when
attestation.kind=commercial-contract AND either accountId or contractRef
is empty. NewAPI installs with zero default channels (operator-supplied
`.Values.channels` still compose). Once the operator overlay supplies
the attestation values the channel composes on the next reconcile.
Touches two templates that gate on the same effective channel list:
- templates/_helpers.tpl `bp-newapi.effectiveChannels` — adds a
pre-check ($qbdAttReady) that short-circuits the channel composition
block when attestation is incomplete. The downstream
`assertChannelAttestation` helper then sees an empty channel list
for the qwenBankDhofar slot and emits no error.
- templates/channel-seed-job.yaml — mirrors the same gate so the
post-install Helm hook Job + RBAC + audit ConfigMap also skip when
the channel itself was skipped (otherwise the Job would POST a row
whose ConfigMap entry was omitted from /etc/newapi/channels.yaml).
`helm template platform/newapi/chart` renders cleanly in all three
states:
- default (qbd.enabled=false) → no channel, no seed Job
- qbd.enabled=true + empty accountId/contractRef → no channel, no
seed Job (NEW: pre-1.4.10 this hard-failed)
- qbd.enabled=true + accountId + contractRef present → channel
composed normally, seed Job emitted
Chart bumped 1.4.9 → 1.4.10; bootstrap-kit overlay pin bumped
1.4.6 → 1.4.10 so franchised Sovereigns immediately pick up the fix.
READ-ONLY clusters preserved. NO Chart.yaml bump on
bp-catalyst-platform.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1645 (Wave 8) wired gitea.* + k8s.read.* + session.* in the MCP
server but left sandbox.db.* as not_implemented stubs. This commit
ships the real handlers using the same dynamic-client pattern.
Tools shipped (all gated on `RequiredCapability=sandbox.db` + claim
OrgID==env.OrgID, all scoped to env.SandboxNamespace):
- sandbox.db.provision {name, plan?} — POSTs a CNPG Cluster CR
(default plan: 1 instance, 5Gi PVC, postgres 16, db=app). Returns
{host:<name>-rw.<ns>.svc.cluster.local, port:5432, dbname, user,
secretName:<name>-app, secretKey:password}.
- sandbox.db.list — labels-filtered LIST scoped to the Sandbox ns,
returns the same connection envelope per item plus a distilled
status summary (phase, readyInstances, Ready condition).
- sandbox.db.get {name} — GET one Cluster; refuses to surface a
Cluster lacking openova.io/managed-by=openova-sandbox-mcp
(defence-in-depth against an agent fishing for per-Org pair DBs).
- sandbox.db.drop {name} — DELETE with foreground propagation so the
operator cascades PVC/Service/Secret cleanup before returning.
Same managed-by guard as get.
- sandbox.db.dump {name} — POSTs a one-shot Backup CR
(`<cluster>-dump-<UTC>`). Returns the Backup name + the Cluster's
configured barmanObjectStore.destinationPath so the agent can find
the resulting S3 prefix without polling Backup.status.
Why CNPG Cluster CRs (not a per-Sandbox shared DB): per app DB keeps
tenancy / backup / restart blast-radius per-app, matches architecture
§3 + §7. Cluster CRs live in the Sandbox's OWN namespace
(sandbox-<owner-uid>); the agent cannot pass `namespace` — it's read
from env. The MCP server never mutates the resulting Pods/PVCs/
Services — the upstream CNPG operator (bp-cnpg) owns those.
Tests (sandbox_db_test.go, 9 cases incl. 5 capability-gate sub-tests):
- validation (name regex, missing name, unknown plan)
- default-plan CR shape (apiVersion, kind, labels, spec.instances,
storage.size, bootstrap.initdb.database, enableSuperuserAccess)
- connectionFor envelope matches CNPG service-name defaults
- on-demand Backup CR shape + managed-by label
- requireSandboxNS guard rails (no env / empty ns / populated)
- capability gate rejects bearers w/o sandbox.db
- status summary surfaces phase + Ready condition only
Hard rules respected: NO chart bump, no host-cluster touch — every
mutation lands inside the Sandbox's own namespace via the SA the
sandbox-controller already gives the MCP pod. go build + go vet +
go test clean. Catalogue test updated for new `sandbox.db.get`.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught live on t16.omantel.biz convergence: bp-sandbox HR stuck
Reconciling because its chart pull goes through harbor.<sov-fqdn>
(post-handover cutover slot 06a Step-06 phase-1 rewrites every
HelmRepository URL `oci://ghcr.io/openova-io` →
`oci://harbor.<sov-fqdn>/openova-io`), but harbor.<sov-fqdn> is not
reachable yet because bp-harbor itself has not reached Ready —
chicken-and-egg.
Same failure shape as Wave 7 #1610 with bp-hcloud-csi (REMOVED). This
PR takes the cleaner long-term cousin path: rather than remove the
slot, sequence it AFTER bp-harbor (slot 19) by renumbering to 19a
+ adding `bp-harbor` to the HR's dependsOn graph. The Sandbox MVP
Wave 11 slot stays available with no manual Day-2 add-app
re-introduction needed.
bp-harbor itself does not hit the cycle because its chart pull goes
through harbor.openova.io (the mothership-warmed proxy-cache wired
into k3s registries.yaml at cloud-init time) — NOT through
harbor.<sov-fqdn>.
Diff:
- clusters/_template/bootstrap-kit/61-bp-sandbox.yaml renamed →
19a-bp-sandbox.yaml; slot label "61" → "19a"; dependsOn adds
bp-harbor; header documents the move + chicken-and-egg context.
- clusters/_template/bootstrap-kit/kustomization.yaml: 19a slot
inserted right after 19-harbor.yaml with the post-cutover URL
rewrite rationale inline; old slot-61 entry replaced with a
back-pointer comment.
Verified `kubectl kustomize clusters/_template/bootstrap-kit/`
renders clean: bp-sandbox HR keeps slot label, gains
- name: bp-harbor in dependsOn, all other fields unchanged.
No Chart.yaml bump (this is a bootstrap-kit Kustomization-only fix,
not a chart change). READ-ONLY clusters.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1641 shipped the `openova.io/sandbox-idle-timeout-minutes` annotation on
every pty-server StatefulSet but no controller was reading it. This closes
the loop:
pty-server (products/sandbox/pty-server/):
- session.Manager tracks lastActivity; Touch() called on session
create/stop, WS attach/detach, every WS message in/out, resize/signal.
- New GET /idle endpoint returns {lastActivityAt, activeSessions}.
- Unit tests cover the endpoint shape + Touch() bump.
sandbox-controller (core/controllers/sandbox/internal/idlescaler/):
- New IdleScaler runnable, registered with mgr.Add() in main.go.
- NeedLeaderElection=true (singleton across HA replicas).
- Every 60s lists pty-server StatefulSets by label selector
(app.kubernetes.io/component=pty-server + openova.io/managed-by=catalyst),
constrained to `sandbox-*` namespaces in code for defence-in-depth.
- For each: probes the in-cluster Service /idle endpoint, stamps the
`openova.io/sandbox-last-activity-at` annotation, and patches
spec.replicas=0 once now-lastActivity exceeds the per-SS
`openova.io/sandbox-idle-timeout-minutes` annotation (falling back to
SANDBOX_IDLE_TIMEOUT_MINUTES env, default 30).
- Probe failure with no prior annotation → skip (next tick); probe
failure WITH prior annotation → still decide on stale data so a
degraded probe path doesn't keep a forgotten Pod alive forever.
- activeSessions > 0 keeps the Pod alive regardless of idle window.
- Already-zero replicas → idempotent no-op.
Chart RBAC:
- ClusterRole gains apps/statefulsets get/list/watch/patch — the ONLY
cluster-wide write on a non-CR resource, scoped to the controller's
own managed StatefulSets via the label selector + namespace prefix.
Tests: 9 unit tests covering active-not-idle, idle-scales-zero,
active-sessions-never-scales, probe-fail-no-annotation-skips,
per-SS-annotation-override, namespace-prefix-defence, already-zero-no-op,
default-URL-builder, leader-election-singleton.
Approach: controller polls pty-server's /idle endpoint via cluster-DNS
(smaller diff than embedding a k8s client in pty-server — pty-server
keeps its ~80-line go.mod, no new RBAC inside the per-Sandbox namespace).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1644 added Organization.spec.tenantPublic + per-tenant HTTPRoute
reconciler, but nothing set the field — every Org CR's TenantPublic
stayed zero-value, the reconciler short-circuited at the empty
ParentDomain guard, and `<slug>.omani.homes` 404'd at the Cilium
Gateway.
Wire the patch at the only point that knows a tenant's product is
actually Ready: the provisioning service. Both the initial workflow
(`provision.completed`) and the day-2 install path
(`provision.app_ready`) now patch the Organization CR's
spec.tenantPublic with parentDomain (from TENANT_PARENT_DOMAIN env),
subdomain (= slug), backendService (canonical vcluster-synced name),
port 80, and the picked product slug. Last-write-wins on subsequent
installs.
Per docs/INVIOLABLE-PRINCIPLES.md #4 the parent zone flows through
env, never hardcoded — every Sovereign picks its own pool zone.
Empty env disables the patch entirely (legacy tenants keep working
through the Sovereign-wide tenant-wildcard route). Best-effort:
failures don't fail the provision. 404 on the CR is benign (legacy
tenant without an Organization counterpart).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the sandbox-controller (PR #1622) to actually mint per-Sandbox
LLM-gateway tokens via the catalyst-api bridge handler shipped in
PR #1638, replacing the Wave 1 placeholder Secret with a real
LLM_GATEWAY_TOKEN-bearing manifest pushed to the per-Org Gitea repo.
Changes:
- New newapi.Client (core/controllers/sandbox/internal/newapi/) —
thin HTTP client for POST /admin/tokens/sandbox with the bridge's
{org_id, user_id, sandbox_id, allowed_channels} body + Bearer
ADMIN_SECRET auth. Interface so tests can stub.
- Reconciler extended:
* NewAPIClient + DefaultChannels + TokenRotationLeadTime fields
* On every reconcile: decide mint-or-skip from annotation
openova.io/sandbox-token-expires-at vs. now + lead-time
* On mint: POST to bridge, stamp expires-at + rotated-at
annotations on the CR, render token bytes into a new
gitops manifest secret-newapi-token.yaml committed to the
per-Org catalyst-tenant repo at sandbox/<owner-uid>/
* Bridge failure → Failed/TokenMintFailed condition + 30s
requeue + no gitops writes (fail-loud)
* Empty DefaultChannels → NoAllowedChannels condition (fail
earlier than the bridge's 400)
- gitops.Render:
* New Inputs.NewAPIToken/NewAPITokenSecretName/NewAPITokenExpiresAt
/NewAPITokenRotatedAt fields
* New secret-newapi-token.yaml template — Secret with
stringData.LLM_GATEWAY_TOKEN + expires-at annotation +
optional kubectl.kubernetes.io/restartedAt rotation marker
so Wave 2's pty-server StatefulSet picks up rolling
restarts on token rotation
* kustomization.yaml appends the new manifest when token
present
- Chart wiring (platform/sandbox/chart):
* Deployment env: NEWAPI_BASE_URL, NEWAPI_ADMIN_SECRET
(secretKeyRef from newapi-bp-newapi-token-signing-key,
optional: true), NEWAPI_DEFAULT_CHANNELS
* ClusterRole bumped to allow update/patch on the
sandboxes/ resource (the controller now stamps annotations
on the CR)
- platform/newapi/chart/templates/sandbox-token-signing-key-secret.yaml:
* Added emberstack/reflector annotations so the chart-emitted
Secret (newapi namespace) mirrors into the sandbox-controller
namespace by default; reflectorNamespaces is overrideable.
Tests:
- newapi client: happy-path round-trip, 401 surfaces, input
validation, request validation. 4 cases.
- sandbox-controller: existing Wave 1 cases (happy/idempotent/
drift/missing) still pass; 5 new cases for the token path:
fresh mint + Secret render, rotation on near-expiry, steady-
state no-mint, bridge failure surfaces condition, no-channels
misconfig fails early. 9 cases total, all green.
Hard rules honored:
- No Chart.yaml bump (chart pinning is a release-driver concern)
- go build + go test ./core/controllers/sandbox/... clean
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 9 regression gate for the Sandbox UI scaffold shipped in PR #1621.
Covers four happy-path surfaces:
- Sidebar Sandbox entry exists + accent-active class on /sandbox
- Landing renders 6 agent cards (aider / claude-code / cursor-agent /
little-coder / opencode / qwen-code) with Connect Claude Max CTA
- /sandbox/settings BYOS Connect button when disconnected
- /sandbox/$id route resolves + create POST sends agent=aider
Auth gate, deployment self-discovery, SSE events, and sandbox API are
all mocked via page.route so the spec runs against `npm run dev` (Vite
on :5173) with no catalyst-api required. Per-test timeout bumped to 90s
to absorb Vite's cold-cache xterm/tanstack-router module load.
Sovereign-mode env vars required for SovereignSidebar to render:
VITE_CATALYST_MODE=sovereign \\
VITE_SOVEREIGN_FQDN=sandbox.example.test \\
npm run dev
Local result: 4/4 passed in 2.1m (warm cache).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(pkg/gitea): add ListPullRequests + GetPullRequest read API
Wave 8 prerequisite for openova-sandbox-mcp's gitea.pr.list +
gitea.pr.get tools. Mirrors the existing client surface
(CreatePullRequest, ListOrgRepos) with state-filtered pagination and
a get-by-number fetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(sandbox): real impls for gitea.* + k8s.read.* MCP tools (was not_implemented stubs)
Wave 8 swaps the openova-sandbox-mcp Wave-2 not_implemented stubs for
production-ready handlers on:
- gitea.repo.list / gitea.repo.get (delegates to core/controllers/pkg/gitea)
- gitea.pr.list / gitea.pr.get (delegates to new ListPullRequests +
GetPullRequest helpers in pkg/gitea; org-scope check rejects cross-tenant
owner overrides at tool dispatch time)
- k8s.read.get / k8s.read.list / k8s.read.watch (dynamic.Interface against
the Sandbox pod's in-cluster SA or SANDBOX_KUBECONFIG; watch is a
bounded short-watch — long-lived subs land Wave 9 via MCP
resources/subscribe)
- sandbox.session.whoami / sandbox.session.info (echo per-call Claims +
Sandbox metadata so the agent can self-discover its scope)
Auth: every tools/call carries a bearer (via _auth.token arg OR
SANDBOX_TOKEN env). main.go validates HS256 against SANDBOX_JWT_SECRET
using the canonical core/services/shared/auth.Claims shape (PR #1619),
strips _auth from the args, installs Claims on ctx, then Registry.Call
gates on capability + org_id-match before reaching the handler.
sandbox.session.* skips the org-scope check (the operator's session
is the operator's regardless of which Org slug their claim carries).
Stubs retained (Wave 8+):
- sandbox.db.* (CNPG Cluster CR provisioning)
- sandbox.auth.* (Keycloak realm/client management)
- gitea.pr.create / gitea.pr.merge / gitea.issue.* / gitea.release.*
- k8s.read.logs
Hard rule preserved: k8s.write.* never lands in the MCP surface.
24 new tests (registry catalogue completeness, auth gate, gitea via
httptest stub, JWT round-trip, env-var parsing).
Builds clean against go 1.23 + k8s.io/client-go v0.31.1; module wires
core/controllers + core/services/shared via the same replace pattern
catalyst-bootstrap and every sme-service already use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PowerDNS now resolves <slug>.<parentDomain> for every Org mapped onto a
Sovereign's role=sme-pool parent domain (PR #1629), but no HTTPRoute was
attaching that hostname to the tenant's installed product Service. The
Cilium Gateway terminated TLS on the wildcard cert and fell through to
the marketplace tenant-wildcard route — serving the storefront landing
page instead of the tenant's WordPress / Nextcloud / GitLab install.
Fix:
1. Extend Organization CRD with optional spec.tenantPublic
(parentDomain, subdomain, backendService, backendPort, product).
2. organization-controller renders a Gateway-API HTTPRoute in the Org
namespace (= slug) attached to cilium-gateway/kube-system when
parentDomain is set. Skipped silently when unset so existing Orgs
keep working.
3. Chart-side templates/sme-services/tenant-public-routes.yaml renders
the same HTTPRoute shape from .Values.tenantRoutes[] for operators
that prefer static fixtures over the controller's reconcile loop.
4. Tests: TestReconcile_TenantPublic_RendersHTTPRoute and
TestReconcile_TenantPublic_DisabledByDefault cover both paths.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1633 added the Sandbox app to seedApps but never wired the matching plan
rows. The marketplace checkout hit "plan_id not found" the moment a customer
picked Sandbox, and PR #1639's sandbox-orchestrator could only mint CRs with
the Wave 1 baseline quota regardless of the picked tier.
This PR closes both gaps in lockstep:
Catalog:
- Plan struct gets ProductSlug + IncludedQuotas fields (back-compat:
omitempty BSON tags so legacy rows decode fine).
- expectedSandboxPlans() helper canonical-defines the three tiers:
sandbox-free 0 OMR 1 session, 1 agent, 5 GB, BYOS
sandbox-pro 9 OMR 3 sessions, 6 agents, 50 GB, BYOS (Popular)
sandbox-ent 49 OMR unlimited, 6 agents, 500 GB, BYOS
- seedAllData appends them on fresh seed; seedMissingSandboxPlans
backfills them on already-populated Sovereigns (idempotent GET-then-
create, patches missing ProductSlug/IncludedQuotas on legacy rows).
- UpdatePlan persists the two new fields.
Sandbox orchestrator wiring:
- SandboxRequestedPayload.PlanID added; CreateOrg forwards body.PlanID.
- buildSandbox stamps openova.io/plan-id annotation + spec.planId when
PlanID is non-empty.
- quotaForPlan() maps sandbox-{free,pro,ent} → SandboxQuota; empty or
unknown plan_id falls through to DefaultQuota (Wave 1 baseline =
Sandbox Free shape). Hard-coded map mirrors catalog IncludedQuotas so
tenant-service avoids a compile-time dep on the catalog mongo stack.
Tests:
- TestExpectedSandboxPlans_Shape locks slugs, prices, quota keys, the
Popular flag (sandbox-pro), and the quota ladder.
- TestSandboxHandle_PlanIDStampsAnnotationAndQuota table-test exercises
all three tiers end-to-end (annotation + spec.planId + spec.quota).
- TestSandboxHandle_PlanIDEmptyKeepsDefaultQuota guards back-compat
with pre-PR publishers.
- TestSandboxHandle_PlanIDUnknownFallsBackToDefault guards typo'd /
retired plan IDs.
go build + go test clean for catalog, tenant, billing, provisioning,
shared, marketplace-api.
No Chart.yaml bump, no cluster touch.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 8 extension to PR #1622 (Wave-1 sandbox-controller). The previous
slice reconciled a Sandbox CR into namespace + ResourceQuota + RBAC +
PVCs + placeholder Secret — but NO pty-server, NO MCP server. A freshly-
created Sandbox sat there with empty plumbing and no way for the user
to actually run a coding session.
This PR completes the per-Sandbox runtime by extending
core/controllers/sandbox/internal/gitops/manifests.go to render the
four manifests architecture.md §7 enumerates:
- StatefulSet pty-server (replicas = spec.quota.concurrentSessions,
one Pod per in-flight session per architecture.md §1/§2). Env wired
per newapi-proxy-contract.md §1: SANDBOX_OWNER_UID, ORG_ID,
SOVEREIGN_FQDN, NEWAPI_URL, LLM_GATEWAY_URL / OPENAI_BASE_URL,
LLM_GATEWAY_TOKEN / OPENAI_API_KEY from per-sandbox Secret
(key llm-gateway-token, optional). When claude-code is in
spec.agentCatalogue, ANTHROPIC_API_KEY is ALSO wired from the
per-user BYOS Secret `sandbox-byos-claude-code-<owner-uid>` (key
access_token, optional) per claude-code-byos.md §3. Repo PVCs mount
at /workspace/<repo-slug>.
- Deployment openova-sandbox-mcp (architecture.md §3). Companion MCP
server, talks to pty-server via the in-namespace ClusterIP Service.
- Service pty-server (ClusterIP :7681) — backend for both the MCP
Deployment and the HTTPRoute.
- HTTPRoute pty-server — publishes
sandbox.<sov-fqdn>/sessions/<owner-uid>/* → pty-server :7681 via
the existing catalyst-public Cilium Gateway in catalyst-system.
PathPrefix rewrite strips /sessions/<owner-uid> so pty-server sees
its own /sessions/<id> surface.
Knobs are env-plumbed from the chart per Inviolable Principle #4:
- SANDBOX_PTY_SERVER_IMAGE / SANDBOX_MCP_IMAGE — SHA-pinned image
refs from values.runtime.{ptyServerImage,mcpImage} (fails Helm
render fast on empty, no silent :latest).
- SANDBOX_NEWAPI_URL — from values.runtime.newapiURL (bootstrap-kit
overlay derives it from ${SOVEREIGN_FQDN}).
- SANDBOX_LLM_GATEWAY_TOKEN_SECRET / SANDBOX_BYOS_SECRET_PREFIX /
SANDBOX_IDLE_TIMEOUT_MINUTES — optional with architecture-doc
defaults.
Idle timeout (architecture.md §7) lands as a StatefulSet annotation
openova.io/sandbox-idle-timeout-minutes — the poll-loop that actually
scales the StatefulSet down on idle ships in a sibling PR (out of
scope for "spawn the Pods"; this PR makes the Pods exist).
Tests cover the full Wave-8 manifest shape: replicas count, identity
env keys, BYOS gating on spec.agentCatalogue, HTTPRoute hostname
binding, kustomization stitching, idempotency. go test
./core/controllers/sandbox/... green; helm template renders cleanly +
required guard fires on missing runtime values.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #831 follow-on to #827. Previously the Cilium Gateway declared a
single listener pair on `*.${SOVEREIGN_FQDN}` only — tenant URLs under
non-primary parent zones (e.g. wp-foo.omani.homes when the operator
brings omani.homes as the SME pool) hit cilium-envoy's default fallback
cert and TLS-handshake-mismatched. The per-zone wildcard Secret rendered
by products/catalyst/chart/templates/sovereign-wildcard-certs.yaml (PR
\#827) existed but had no Gateway listener claiming its hostname.
Fix: render one listener pair (HTTPS:30443 + HTTP:30080) per parent
zone. Materialised at Terraform plan time as a JSON-flow array
(infra/hetzner/main.tf locals.parent_domains_listeners_yaml — jsonencode
of the listener objects iterating decoded parent_domains_yaml), threaded
through Flux postBuild.substitute as PARENT_DOMAINS_LISTENERS_YAML, and
consumed as a scalar value at `listeners: \${PARENT_DOMAINS_LISTENERS_YAML}`
in cilium-gateway.yaml. Each pair's certificateRefs target the per-zone
Secret `sovereign-wildcard-tls-<sanitised-zone>` so listener + cert stay
in lockstep.
Scalar placeholder (not multi-line block) because kustomize-build parses
the YAML before Flux runs envsubst — a placeholder on its own line at
column 0 fails YAML parse. Scalar `${VAR}` parses cleanly; envsubst then
swaps it for the JSON-flow array string, which the apiserver parses as
the real listener list.
Single-zone fallback preserved (var.parent_domains_yaml empty →
[{name: <sovereign_fqdn>, role: primary}]) so legacy single-zone
provisions render 2 listeners (1 HTTPS + 1 HTTP). Multi-zone provisions
(e.g. primary omani.works + sme-pool omani.homes) render 4 listeners.
Verification:
- kubectl kustomize clusters/_template/sovereign-tls/ → clean
- End-to-end simulation (single-zone, two-zone) renders correct
listener counts (2 / 4) with correct certificateRefs per zone.
- Listener naming `https-<sanitised>` / `http-<sanitised>` is unique
per listener so Gateway controller programs them all (duplicate
names produce Conflicting status condition).
Files:
- clusters/_template/sovereign-tls/cilium-gateway.yaml (scalar
listeners placeholder + comment block explaining the why)
- infra/hetzner/main.tf (locals.parent_domains_decoded +
locals.parent_domains_listeners_yaml; threaded into primary CP and
secondary regions' templatefile() calls)
- infra/hetzner/cloudinit-control-plane.tftpl (PARENT_DOMAINS_LISTENERS_YAML
substitute var in sovereign-tls Kustomization block)
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1633 wired CreateOrg to publish `tenant.sandbox_requested` when the
marketplace cart includes the sandbox product. Nobody was subscribing —
the event landed in NATS `catalyst.tenant.sandbox_requested` and aged
out unread, so no Sandbox CR (PR #1622) was ever minted and the
customer sat on a "Provisioning…" spinner forever.
This slice closes the loop. A new SandboxOrchestrator in tenant-service:
- Subscribes via events.MultiSubscriber (PR #1636) to the canonical
NATS subject + legacy Kafka topic.
- Parses {tenant_id, org_slug, owner_id, owner_email, agents,
sovereign, requested_at} and resolves the owner email (event field
→ store.GetMemberEmail → owner_id fallback).
- Materialises a Sandbox CR in catalyst-system (SANDBOX_NAMESPACE
override) via a dynamic client, with spec per architecture §7:
owner.email + owner.orgRef.slug, default quota (4 CPU / 8 Gi /
50 Gi / 3 sessions), spec.agentCatalogue from the cart.
- Idempotent: Get-then-Create with AlreadyExists swallowed so NATS
redeliveries + duplicate marketplace submits stay no-ops; the
sandbox-controller remains SoR for spec mutations.
Wiring in main.go is best-effort — when no in-cluster config nor
KUBECONFIG is available (CI / dev loops) the orchestrator is skipped
with a Warn; the rest of the tenant service still boots.
Hard rules: no chart bump, no cluster writes outside of the Sandbox
Create call (sandbox-controller reconciles the rest), `go build ./...`
clean, `go test ./...` clean.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the Wave 1b stub that echoed the inbound PAT verbatim with a
real HS256 mint flow the sandbox-controller can call when it rolls out
a fresh Sandbox Pod.
Handler (platform/newapi/internal/handler/sandbox_token.go):
- Caller auth: shared admin-secret bearer (env NEWAPI_ADMIN_SECRET),
constant-time compared. 401 on mismatch / missing bearer.
- Request body: {org_id, user_id, sandbox_id, allowed_channels[]}.
De-duplicates + scrubs empty channel names so a controller bug
sending [""] can't mint a token that NewAPI silently treats as
"no restriction".
- Mints HS256 JWT signed with NEWAPI_TOKEN_SIGNING_KEY. Claim shape:
{sub: sandbox_id, org: org_id, user: user_id, channels: [...],
iat, exp: iat+7d, typ: "sandbox"}.
- Returns {token, expires_at}.
- Refuses with 503 when SigningKey or AdminSecret is unset
(visible chart-wiring gap, not a forgeable-token leak).
- Removes the previous Claims/jwt.Parse PAT-validation path that
came with the stub — caller is the controller, not an operator.
- NewHandlerFromEnv() factory loads + validates env at process
start so catalyst-api can fail loudly instead of shipping the
endpoint silently.
Unit tests (sandbox_token_test.go) — 11 cases:
- happy path (mint + claim shape + signature round-trip)
- de-dup + empty-channel scrub
- admin-secret mismatch / missing bearer → 401
- missing org_id / user_id / sandbox_id / empty channels → 400
- non-POST → 405
- unset env → 503
- mintSandboxToken empty-secret guard + round-trip
- response does not echo admin secret or signing key
Chart wiring (platform/newapi/chart):
- New Secret template sandbox-token-signing-key-secret.yaml
auto-renders with Helm `lookup` + helm.sh/resource-policy: keep
(same load-bearing pattern as credentials-secret.yaml #943 and
gitea admin-secret.yaml #830 Bug 2). 64-char alphanumeric values
for both SIGNING_KEY and ADMIN_SECRET; persistence across
reconciles is required because a reconcile-time rotation would
silently invalidate every per-Sandbox token across the Sovereign
AND break the sandbox-controller's auth path until its Pod
restarts.
- values.yaml block sandboxTokenSigningKey.{existingSecret,
autoProvision, autoSecretName} matching the `credentials`
convention (operator override > auto-provision > skip-render).
- No Chart.yaml bump — chart value addition only.
Verification:
- go build ./platform/newapi/internal/handler/... — clean
- go test ./platform/newapi/internal/handler/... — 11/11 PASS
- helm template platform/newapi/chart — Secret renders
How sandbox-controller will use it:
1. Read NEWAPI_ADMIN_SECRET from mounted Secret newapi-token-signing-key.
2. POST /admin/tokens/sandbox with bearer + body
{org_id: <Sandbox.spec.owner.orgRef.slug>,
user_id: <Sandbox.spec.owner.email>,
sandbox_id: <Sandbox.metadata.uid>,
allowed_channels: ["qwen3.6-bankdhofar"]}.
3. Write returned token into Secret/sandbox-<uid>-newapi-token.
4. Mount that Secret into the Sandbox Pod as LLM_GATEWAY_TOKEN.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the catalyst-api backend the Sandbox FE (PR #1621 — getSandboxes /
createSandbox / getByosStatus in sandbox.api.ts) has been calling into.
Without this handler the /sandbox surface on the Sovereign Console rendered
its empty state forever — every getSandboxes() 404'd at the catalyst-api
ingress and every "Start a session" click hit the same wall.
Handler — products/catalyst/bootstrap/api/internal/handler/sandbox_sessions.go
- GET /api/v1/sandbox/sessions — list Sandbox CRs in the
operator's Org namespace
- POST /api/v1/sandbox/sessions — create Sandbox CR with agent
validated against the 6-agent
catalogue (aider / claude-code /
cursor-agent / little-coder /
opencode / qwen-code)
- GET /api/v1/sandbox/sessions/{id} — fetch single Sandbox detail
- DELETE /api/v1/sandbox/sessions/{id} — graceful delete (the controller
fires finalizers + cleans up
the per-Sandbox vcluster
namespace + PVCs + RBAC)
Client resolution mirrors the Family E compliance + k8s_resource_actions.go
seam: k8sCache.Factory.DynamicClientFor(resolveChrootClusterID("")) is the
primary path; sovereignDepsFor() — rest.InClusterConfig() — is the chroot
in-cluster fallback per feedback_chroot_in_cluster_fallback.md. Both 503
when unavailable so the FE renders its "API pending" pill rather than a
spinner.
Org-scoping uses claims.Org (the org_id Keycloak claim PR #1619 lit up)
for the CR namespace + spec.owner.orgRef.slug. Single-tenant chroots
without an org_id fall back through CATALYST_SANDBOX_DEFAULT_NAMESPACE
to a sensible default per docs/INVIOLABLE-PRINCIPLES.md #4. Wave-1 quota
defaults (4 CPU / 8Gi memory / 50Gi storage / 3 concurrent sessions)
mirror products/sandbox/docs/architecture.md §7 — the FE doesn't yet
expose a quota picker.
Status projection: CRD vocabulary (Pending|Provisioning|Ready|Failed)
maps to FE vocabulary (pending|running|stopped|failed|unknown) in
mapSandboxStatus so a fresh Sandbox shows the spinner rather than
"unknown" until the controller catches up.
k8sCache.DefaultKinds — products/catalyst/bootstrap/api/internal/k8scache/kinds.go
- Adds sandbox.openova.io/v1 Sandbox so the generic /k8s/{kind} surface
enumerates Sandboxes the same way it does Applications + UserAccess.
Per feedback_chroot_in_cluster_fallback.md every new GVR here needs a
matching rule on the cutover-driver SA.
Cutover-driver RBAC — products/catalyst/chart/templates/clusterrole-cutover-driver.yaml
- Adds sandboxes.sandbox.openova.io with verbs split per
feedback_rbac_create_no_resourcenames.md:
rule 1: ["create"]
rule 2: ["get","list","watch","delete"]
- Read-only on status (the controller owns status); write is spec-only
on POST + the apiserver delete on DELETE.
Routes — products/catalyst/bootstrap/api/cmd/api/main.go
- Registered inside the RequireSession group alongside the existing
/api/v1/sandbox/byos/claude-code/* surface; same auth gate, same
patternless leading "/api/v1/sandbox/...".
Verified: go build clean, go vet clean, k8scache test suite green
(2.7s), helm template renders the new RBAC block.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1626 wired the PUBLISH leg of tenant + billing to NATS via
events.MultiPublisher (canonical subject `catalyst.<event.Type>` per
ADR-0001 §6). The CONSUME leg stayed Kafka-only — provisioning,
notification, domain, billing's tenant-events cascade, AND tenant's own
provision-events + members-cleanup consumers all called
events.NewConsumer(redpandaBrokers, …). On Sovereigns REDPANDA_BROKERS
is empty by design (no Redpanda exists; NATS is the canonical bus per
the convergence-fix block in configmap.yaml) so those consumers either
never started OR dialed `localhost:9092` in a hot crash loop.
Net effect on every Sovereign install pre-this-PR:
1. alice POSTs /sme/tenants → tenant publishes catalyst.tenant.created
to NATS (PR #1626).
2. provisioning's only subscriber was Kafka-only → silent drop.
3. No Organization CR ever spawned → no vCluster → CONVERGENCE BROKEN.
This change introduces a symmetric subscribe-side abstraction mirroring
bridge.go's MultiPublisher:
- events.BrokerSubscriber: unified Subscribe(ctx, handler) interface,
satisfied by *Consumer, *DLQSubscriber, *MultiSubscriber.
- events.MultiSubscriber: fans in from NATS JetStream durable
consumers (one per canonical subject) + an optional legacy Kafka
Consumer. NewMultiSubscriber refuses to construct with both legs
nil (the silent-no-op pattern this PR exists to prevent).
- events.NATSConn.ensureSMEStream: idempotently creates the
CATALYST_SME Stream filtering `catalyst.>` so the first consumer
on a fresh Sovereign bootstraps lifecycle.
Each service's main.go now constructs a MultiSubscriber and passes it
to the consumer dispatch loop. Consumer signatures take
events.BrokerSubscriber instead of *events.Consumer (interface upcast,
so *events.Consumer call sites keep working on Catalyst-Zero):
- provisioning: tenant.created / tenant.deleted /
tenant.app_install_requested / tenant.app_uninstall_requested /
order.placed (the 5 subjects PR #1626 publishes to NATS).
Also wires MultiPublisher so provision.* publishes hit NATS too —
downstream tenant + notification consumers need them.
- notification: full fan-in (user.login, order.placed,
payment.received, provision.*, domain.*, member.invited).
- domain: tenant.deleted (subdomain + BYOD reclamation cascade).
- billing: tenant.deleted (Stripe sub-cancel + invoice void + ledger
marker cascade). Existing metering NATS subscriber unaffected.
- tenant: provision.* + tenant.deleted (members cleanup).
Now reachable on Sovereigns; pre-this-PR they were inside the
`if redpandaBrokersRaw != ""` block.
Chart wiring: NATS_URL env added to provisioning, notification, and
domain Deployments (tenant + billing already wired via PR #1626).
notification.yaml also flips its hardcoded REDPANDA_BROKERS literal to
the shared ConfigMap key so the per-topology default (empty on
Sovereigns, talentmesh redpanda on Catalyst-Zero) applies.
Verification:
- go build ./core/services/{shared,tenant,billing,provisioning,
notification,domain}/... clean.
- go test ./... clean across all 6 modules.
- helm template with global.sovereignFQDN=test.example.com renders
NATS_URL="nats://nats-jetstream.nats-system.svc.cluster.local:4222"
into all 5 Deployments + ConfigMap.
- helm template without sovereignFQDN renders NATS_URL="" and
REDPANDA_BROKERS=talentmesh redpanda, matching Catalyst-Zero.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the Sandbox product to the marketplace storefront so a customer
picks it off marketplace.<sov>/apps the same way they pick WordPress /
Nextcloud. Card chrome is the existing .app-card shape verbatim — no
new components per the design-system inheritance rule. The detail page
gains a 6-agent picker (aider, claude-code, cursor-agent, little-coder,
opencode, qwen-code) using the existing .related-card chrome with a
picked state mirroring .app-card.in-cart. Picks land on cart.agents
and travel through checkout into the tenant create-org payload.
Tenant-service emits a sibling `tenant.sandbox_requested` event on
sme.tenant.events when the cart contains the sandbox product. The
event carries org slug + owner + agents list, sufficient for the
sandbox-controller (or its upstream orchestrator) to mint a Sandbox
CR with matching spec.agentCatalogue. The Organization CR creation
path is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1622 shipped the sandbox-controller binary + chart, and PR #1618
shipped pty-server + mcp-server scaffolds, but neither came with CI
build workflows — meaning the chart's image.repository points at a
GHCR package that no workflow ever publishes (ImagePullBackOff on
every install). Per docs/INVIOLABLE-PRINCIPLES.md #4a every runtime
image MUST be produced by a GitHub Actions workflow from a committed
git SHA; this PR closes that gap.
Three new workflows, all event-driven (push paths-filter + PR +
workflow_dispatch, no cron):
- build-sandbox-controller.yaml — mirrors build-application-controller
(shared core/controllers go.mod, go vet + race tests, Buildx push,
cosign keyless sign, SBOM attest, auto-bump platform/sandbox/chart/
values.yaml image.tag back to main so the next install picks up the
SHA-pinned image without operator action).
- build-sandbox-pty-server.yaml — separate go module under
products/sandbox/pty-server (own go.mod/go.sum), Dockerfile uses
COPY . . so build context is the server directory. Same Buildx +
cosign + SBOM flow as the controller. No values.yaml bump yet:
Wave-2 wiring of the StatefulSet template will land in a follow-up.
- build-sandbox-mcp-server.yaml — stdlib-only stdio MCP sidecar
(no go.sum yet), same shape as pty-server.
Per `feedback_no_mvp_no_workarounds.md` rule 1 (target-state, never
"manual follow-up bump") the controller workflow auto-bumps the chart
values.yaml so a Sovereign overlay flipping `enabled: true` Just Works.
Per the user's hard rule for this PR, no Chart.yaml bump and no
blueprint-release dispatch — the Sandbox chart's publication cadence
is gated by Wave-2 readiness, not per-image builds.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1625 shipped the /api/v1/sme/billing/vouchers/* proxies but the
SME gateway (core/services/gateway/proxy.go) rejects RS256 outright
— it only accepts HS256 signed with sme-secrets/JWT_SECRET. Result
on every fresh Sovereign: operator clicks on /bss/vouchers returned
silent 401 with no upstream audit trail.
This commit ships the bridge:
- core/services/shared/auth/mint_sme.go (new)
- MintSMEAccessToken(secret, sub, email, role) → 5-min HS256 JWT
in the wire shape billing's requireVoucherIssuer expects.
- SMERoleFor(realmRoles, tier) → maps Keycloak roles + tier claim
onto SME vocab (superadmin | sovereign-admin | member).
- Pure, no IO, fully unit-tested (mint_sme_test.go).
- products/catalyst/bootstrap/api/internal/handler/sme_billing_vouchers.go
- proxySMEVoucher now mints a fresh HS256 token per upstream hop
from the operator's already-validated RS256 session claims and
forwards that as Bearer to the SME gateway. RS256 header is no
longer leaked upstream.
- Unwired bridge (CATALYST_SME_JWT_SECRET empty) surfaces 503
`sme-jwt-bridge-unwired` instead of the silent 401.
- products/catalyst/bootstrap/api/internal/handler/handler.go
- h.smeJWTSecret field + SetSMEJWTSecret(secret) setter.
- products/catalyst/bootstrap/api/cmd/api/main.go
- Reads CATALYST_SME_JWT_SECRET on startup and wires it.
- Log line includes byte count only (never the secret value, per
INVIOLABLE-PRINCIPLES.md #10).
- products/catalyst/chart/templates/api-deployment.yaml
- New env CATALYST_SME_JWT_SECRET sourced from sme-secrets/JWT_SECRET
in the same namespace (catalyst-system). optional: true so
Sovereigns without marketplace surface a 503 rather than
CreateContainerConfigError.
- products/catalyst/chart/templates/sme-services/sme-secrets.yaml
- emberstack/reflector annotation block mirroring sme-secrets
from `sme` ns into `catalyst-system` (Kubernetes secretKeyRef
is same-namespace-only). Same pattern as cnpg-cluster.yaml
and provisioning-github-token.yaml.
Operator-visible behaviour: the bridge is transparent on the happy
path (operator with sovereign-admin tier on a Sovereign with
marketplace enabled clicks /bss/vouchers → list returns). On the
unhappy paths the operator now sees a real status code:
- 503 sme-jwt-bridge-unwired (chart wire missing) — actionable
- 503 sme-gateway-unreachable (DNS NXDOMAIN) — pre-existing
- 403 from billing's requireVoucherIssuer (role insufficient)
— was silent 401 before, now propagates the real authz result.
Tests: core/services/shared/auth `go test ./...` PASS. catalyst-api
`go build ./...` PASS.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sandbox Wave 4 retry. Slot 80 (bp-newapi) already exists in the
_template bootstrap-kit but ships the qwenBankDhofar channel
hard-coded to `enabled: false` with empty endpoint — so every
franchised Sovereign came up without an LLM channel and sandbox
agents fell back to mothership newapi, defeating per-Sovereign
sandboxing.
Wire the qwenBankDhofar channel to the same envsubst flag the
Catalyst control plane uses (`${MARKETPLACE_ENABLED:-false}`)
and default the endpoint to the canonical first-otech relay
(`https://llm-api.omtd.bankdhofar.com`) with override via
`${LLM_BANK_DHOFAR_BASE_URL}`. API key is still pulled from the
`newapi-channel-qwen-bankdhofar` Secret (cloud-init or
ExternalSecret per existing chart contract).
No chart bump — chart 1.4.6 (slot 80) already supports gating
qwenBankDhofar via .Values.defaultChannels.qwenBankDhofar.enabled
and reading endpoint/secret from those values. Only the
bootstrap-kit overlay was wired with the wrong defaults.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire CATALYST_OTECH_INGRESS_IPV4 from the sovereign-fqdn ConfigMap key
`lbIP` so DefaultSMETenantDNSProvisioner.ProvisionFreeSubdomain (already
implemented in sme_tenant_dns.go) actually receives the Sovereign's LB
IP at run time. Without this env, ProvisionFreeSubdomain has been
returning `errors.New("otech ingress IPv4 unconfigured")` — silently —
on every Sovereign tenant signup, so the per-tenant A records for
`console|wordpress|openclaw|mail|keycloak.<slug>.<pool>` were never
PATCHed into PowerDNS, leaving the pool zone's apex/wildcard delegation
to point new tenants at the mothership IP (49.12.16.160) instead of the
correct Sovereign LB.
Same plumbing pattern as SOVEREIGN_LB_IP a few lines above (sourced from
the same ConfigMap, same key). Per-tenant approach (not a single
`*.<pool>` wildcard) is required because: (a) each tenant gets five
distinct host records, (b) the pool zone hosts records for the Sovereign
itself, so a blanket wildcard would shadow legitimate Sovereign-owned
subdomains, and (c) the reconciler is already there — only the env wire
is missing.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST /billing/checkout was 503'ing with "payment processor is not
configured" on Sovereigns that have not pasted Stripe keys yet — even
when the customer's credit balance (from a fresh voucher redemption
in the same request, or a prior balance) fully covered the order
total. Make the credit-only short-circuit explicit: compute
`remainingOMR := totalOMR - creditBalance` and settle via
CreditOnlyCheckout when `<= 0`, BEFORE any Stripe settings probe.
This is the path that has to keep working during the voucher-only
weeks of a new Sovereign.
Adds checkout_test.go covering two regression paths:
- fresh-voucher path: customer with 0 credit redeems WELCOME50
against a 50-OMR plan → 200 + paid_by_credit:true, settings table
never probed (sqlmock asserts no unexpected queries).
- pre-existing-credit path: customer with 200-OMR standing balance
buys a 100-OMR plan, no promo_code in request → 200 +
paid_by_credit:true + 100-OMR leftover credit.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously, after a successful checkout on a Sovereign marketplace
(e.g. marketplace.t15.omantel.biz), the browser was redirected to
https://console.openova.io/nova which is the MOTHERSHIP console — so
the user was bounced off their own Sovereign and re-prompted to sign in
against the wrong identity provider. Same bug fired the "returning user"
auto-redirect in Layout.astro.
Root cause: CONSOLE_URL in core/marketplace/src/lib/config.ts and the
inline returning-user redirect in core/marketplace/src/layouts/
Layout.astro both hardcoded "https://console.openova.io/nova". The
marketplace pod is shared across mothership + every Sovereign (one
deployment, multiple ingress hostnames — see marketplace-routes.yaml
which fronts marketplace.<sov-fqdn> on Sovereigns), so a build-time
constant could never name the right console.
Fix: derive the console URL from window.location.hostname at runtime.
- marketplace.openova.io -> https://console.openova.io/nova (mothership, /nova prefix preserved)
- marketplace.<sov-fqdn> -> https://console.<sov-fqdn> (Sovereign — Cilium Gateway *.<sov-fqdn> wildcard route, NO /nova)
- partner hosts + dev -> mothership fallback (skipConsoleRedirect tenants don't reach this path anyway)
Implemented twice in lockstep — once in src/lib/config.ts for the
Svelte components that use consoleHref(), once inline in
src/layouts/Layout.astro because the returning-user redirect must fire
before the Svelte bundle loads.
Test: npm run build in core/marketplace clean (9 pages, 0 warnings).
Inline detector verified present in dist/checkout/index.html +
dist/index.html.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tenant + billing services hardcoded a franz-go Kafka publisher
pointing at REDPANDA_BROKERS. On Sovereigns there is NO Redpanda in
cluster — only NATS JetStream at
nats-jetstream.nats-system.svc.cluster.local:4222 — so every
tenant.created / tenant.deleted / order.placed event was silently
dropped, blocking provisioning + downstream consumers and stalling
the convergence chain end to end.
Per ADR-0001 §6 the canonical event bus is NATS JetStream with
subject convention `catalyst.<domain>.<event>`. This change:
- Adds events.BrokerPublisher + events.MultiPublisher that fan out
to NATS (`catalyst.<event.Type>` derived from Event.Type) and the
legacy Redpanda topic in one call. Either transport may be nil;
the constructor refuses to build a no-op publisher (the exact
silent-failure mode we just hit).
- Adds NATSConn.PublishEvent so the generic Event envelope can flow
over the same JetStream connection used for the metering
subscriber (#798), with Event.ID as the JetStream Msg-Id for
broker-side de-dup.
- Updates tenant + billing main.go to read NATS_URL +
REDPANDA_BROKERS independently, construct the appropriate
transports, and wire MultiPublisher into the Handler. Legacy
Kafka consumers only start when REDPANDA_BROKERS is non-empty
so the pods no longer crashloop dialling localhost:9092 on
Sovereigns.
- Updates chart templates to inject NATS_URL into both tenant and
billing Deployments. ConfigMap default for NATS_URL on Sovereigns
is nats://nats-jetstream.nats-system.svc.cluster.local:4222
(fixes the existing bug where defaults pointed at the wrong
namespace `nats-jetstream` — NATS actually lives in `nats-system`
per clusters/_template/bootstrap-kit/07-nats-jetstream.yaml).
- Sovereign default of REDPANDA_BROKERS is now empty (was the wrong
NATS URL stuffed into a Kafka env, which made franz-go fail every
dial).
Subject mapping per CanonicalSubject:
tenant.created → catalyst.tenant.created
tenant.deleted → catalyst.tenant.deleted
tenant.app_install_requested → catalyst.tenant.app_install_requested
order.placed → catalyst.billing.order.placed
Test:
go build ./... in shared/, tenant/, billing/ (clean)
go test ./events/... ./handlers/... in all three (existing + new
bridge_test.go pass)
helm template with global.sovereignFQDN set renders NATS_URL in
both Deployments + REDPANDA_BROKERS="" in ConfigMap
helm template without global.sovereignFQDN renders the legacy
Redpanda broker (Catalyst-Zero contabo path remains intact)
NATS-side consumers for sme.tenant.events / sme.provision.events ship
in a follow-up PR per the ADR-0001 §6 migration plan; this PR only
unblocks the publish leg which is the immediate convergence blocker.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave 6 PR #1609 shipped the BSS Vouchers FE
(products/catalyst/bootstrap/ui/src/lib/bss.api.ts —
listVouchers/issueVoucher/revokeVoucher) but never added the matching
catalyst-api proxy handlers. The /bss/vouchers page could render its
target-state chrome but every voucher action 404'd at the catalyst-api
ingress because no `/api/v1/sme/billing/vouchers/*` route existed.
This PR adds the catalyst-api → SME gateway proxy in
sme_billing_vouchers.go, mirroring the sme_billing_revenue.go /
sme_catalog_client.go patterns:
- GET /api/v1/sme/billing/vouchers/list
- POST /api/v1/sme/billing/vouchers/issue
- POST /api/v1/sme/billing/vouchers/revoke/{code} (task spec)
- DELETE /api/v1/sme/billing/vouchers/revoke/{code} (FE wire)
All four registered inside the existing RequireSession group in
cmd/api/main.go alongside the other /api/v1/sme/* routes. Upstream is
the SME gateway at http://gateway.sme.svc.cluster.local:8080 (override
via CATALYST_SME_GATEWAY_URL per docs/INVIOLABLE-PRINCIPLES.md #4),
which strips `/api` and forwards to core/services/billing/handlers/
vouchers.go (gated by requireVoucherIssuer — superadmin OR
sovereign-admin per docs/FRANCHISE-MODEL.md §3). The handler always
forwards revoke as DELETE so the billing service's
`DELETE /billing/vouchers/revoke/{code}` route matches.
The Authorization header is forwarded verbatim; status + body stream
through unchanged so the FE's listVouchers (which throws on non-2xx)
sees the upstream's real status. 503 + sme-gateway-unreachable on DNS
NXDOMAIN so a Sovereign with marketplace.enabled=false degrades
gracefully rather than 5xx-ing.
No chart bump. Build clean; only pre-existing whoami/user_access
test failures remain (unrelated to this surface — confirmed by
running the same tests on origin/main without this change).
Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>