Splits the 20 EPIC-1 (#1096) compliance ClusterPolicy templates out of
bp-kyverno (engine umbrella chart) into a dedicated Blueprint
bp-kyverno-policies@1.0.0 with its own HelmRelease, ordered via HR-to-HR
dependsOn on bp-kyverno in the bootstrap-kit Kustomization.
WHY (the bug we're killing):
PR #1138 (2026-05-08) shipped 20 ClusterPolicy templates with
`enabled: false` defaults → dead-on-arrival for 11 days. PR #1933
(2026-05-19) flipped 18 defaults to `enabled: true` + bumped chart
1.1.0 → 1.2.0 + bumped the bootstrap-kit pin — but hit a CRD install-
ordering race on fresh prov t33: ClusterPolicy CRs (in
templates/policies/baseline/*.yaml) and Kyverno CRDs (in upstream
charts/crds/templates/) render in the SAME Helm pass, and the
apiserver's RESTMapper has not yet learned kyverno.io/v1.ClusterPolicy
when Helm applies the ClusterPolicy CRs. PR #1935 reverted ONLY the
bootstrap-kit pin (1.2.0 → 1.1.0) — chart source kept claiming policies
were on by default while the deployed pin pulled an engine-only artifact
with zero policies. "Theater on theater" — founder walk on t34 confirmed
GET /api/v1/sovereigns/<id>/compliance/policies returns `policyCount=0`,
only `useraccess-boundary` (from bp-crossplane-claims) was installed.
The structural fix is splitting the chart so the engine + CRDs reconcile
+ register first, THEN the policy chart applies its CRs cleanly. Audit
mode default = non-blocking (admission still passes, PolicyReport rows
populate). Operators flip individual policies to Enforce per-Sovereign
overlay or via EnvironmentPolicy.spec.compliance.modes (slice C2
controller path — separate work item).
CHANGES:
1. NEW chart `platform/kyverno-policies/chart/`:
- Chart.yaml: name=bp-kyverno-policies, version=1.0.0, no subchart deps
- values.yaml: `compliancePolicies:` block moved verbatim from bp-kyverno
(defaults: 18 enabled+Audit, 2 intentionally OFF — `hubbleFlowsSeen`
stub for W2 evaluator, `cosignVerified` until operator supplies PEM)
- templates/baseline/01-..20-*.yaml: 20 ClusterPolicy templates moved
via `git mv` (preserves blame; preserves PR #1933's 3 operator fixes
— regex_match JMESPath + operator: Equals for 11/12/19)
- tests/fixtures/: moved with the policies (fixtures reference policy
output, not engine output)
2. ENGINE chart `platform/kyverno/chart/`:
- Chart.yaml: 1.2.0 → 1.2.1 (policies removed, source no longer
drift-claims compliance content)
- values.yaml: `compliancePolicies:` block deleted (now lives in
bp-kyverno-policies)
- templates/clusterpolicy-mutate-add-openova-labels.yaml + ...require-
openova-labels.yaml KEPT (engine-coupled mutating policies, EPIC-0
label-vocab E1/E2, defaults OFF — separate concern from EPIC-1
compliance library)
- Empty `templates/policies/` directory removed
3. NEW bootstrap-kit slot `clusters/_template/bootstrap-kit/27a-kyverno-
policies.yaml`:
- HelmRelease bp-kyverno-policies pinned at chart `1.0.0`
- HR-level `dependsOn: [bp-kyverno]` — same-kind, honored by Flux
(per docs/INVIOLABLE-PRINCIPLES.md #14 cross-kind HR→Kustomization
dependsOn is silently ignored, so we keep ordering at HR→HR within
the single bootstrap-kit Kustomization)
- targetNamespace: kyverno (same as engine — ClusterPolicy is cluster-
scoped but the umbrella overlay namespacing matches the engine)
- disableWait: true — Kyverno reports ClusterPolicy Ready asynchronously
so we don't want downstream HRs stalling on policy-level health
4. UPDATED `clusters/_template/bootstrap-kit/kustomization.yaml`:
- Added `27a-kyverno-policies.yaml` immediately after `27-kyverno.yaml`
5. BUMPED `clusters/_template/bootstrap-kit/27-kyverno.yaml`:
- Engine pin 1.1.0 → 1.2.1 (engine-only; install behavior identical
to 1.1.0 since policies + their values are no longer in this chart)
VALIDATION (Principle #15 — validate against fresh state, not stable state):
$ helm template bp-kyverno-policies platform/kyverno-policies/chart \
| grep -c '^kind: ClusterPolicy'
18
$ helm lint platform/kyverno-policies/chart && helm lint platform/kyverno/chart
==> 1 chart(s) linted, 0 chart(s) failed (both)
$ helm template bp-kyverno platform/kyverno/chart \
| grep -c '^kind: ClusterPolicy'
0 # engine no longer renders any ClusterPolicy CRs
$ helm package platform/kyverno-policies/chart
Successfully packaged → bp-kyverno-policies-1.0.0.tgz (20 templates)
CRD-race REPRODUCED locally without container runtime: applying the
rendered policy YAML to a cluster WITHOUT Kyverno CRDs returns
"no matches for kind \"ClusterPolicy\" in version \"kyverno.io/v1\"
ensure CRDs are installed first"
for every policy — proving the install-order fix is necessary.
Full `helm install` from-scratch on Kind blocked locally (no container
runtime on bastion); the Blueprint-Release CI workflow runs the full
`helm dependency build` + package + GHCR push pipeline AND a
`helm template` smoke render at publish time — that is the fresh-state
Helm install gate before any pin lands.
CI / GHCR (Principle #13):
Blueprint-Release workflow auto-detects `platform/kyverno-policies/chart/**`
and publishes `oci://ghcr.io/openova-io/bp-kyverno-policies:1.0.0`
on push to main. The slot pin in 27a-kyverno-policies.yaml is set to
`1.0.0` to match (auto-bump-pin step is a no-op when source version
already matches the slot pin).
DELIBERATELY OUT OF SCOPE:
- W2 Go evaluator for `hubble-flows-seen` (stub stays a no-op)
- Cosign publicKey supply path for `cosign-verified`
- Per-Environment EnvironmentPolicy.spec.compliance.modes enforcement
flip controller
- Score-aggregator weight defaults configuration UI
- `useraccess-boundary` (lives in bp-crossplane-claims, unchanged)
This does NOT close#1096. The EPIC remains open until a fresh-prov walk
shows `kubectl get clusterpolicies -A` returning the 18 baseline policies
+ useraccess-boundary, plus the AppDetail Compliance tab rendering non-
zero policyCount. Founder closes#1096 after that walk.
Refs #1096, Refs #2019, Refs #1929, Refs #1936
TBD-A69. PR #2005 fixed build-organization-controller.yaml only. The
other six controller workflows (application, blueprint, continuum,
environment, sandbox, useraccess) had the same gaps that caused the
#1997 18h deploy gap:
- application-controller: missing pkg/** in path filter (auto-bump
already present from earlier work).
- blueprint, continuum, environment, useraccess: missing BOTH pkg/**
path filter AND auto-bump pipeline (permissions promotion +
values.yaml bump + commit/push + blueprint-release dispatch).
- sandbox: already complete (pkg/** + auto-bump to platform/sandbox
chart) — left untouched.
Each updated workflow inherits the canonical shape from
build-organization-controller.yaml (PR #2005):
1. `core/controllers/pkg/**` added to BOTH push.paths and
pull_request.paths. Without this, a fix that only touches the
shared HTTP-client tree (gitea/keycloak/kc-mappers) silently
fails to rebuild the controller image.
2. `permissions.contents: write` + `actions: write` so the build
job can push the values.yaml bump and dispatch the downstream
chart re-publish.
3. An awk-scoped `Bump controllers.<who>.image.tag in values.yaml`
step that updates ONLY the targeted controller's tag (verified
locally — sibling tags remain untouched).
4. A commit/push step that bumps
products/catalyst/chart/values.yaml (or
products/continuum/chart/values.yaml for continuum, which has
its own chart).
5. A `gh workflow run blueprint-release.yaml` dispatch so the
bot-pushed commit fires the downstream chart re-publish
(GitHub Actions silently filters bot pushes from path-trigger
workflows otherwise).
Adds two new files to lock the shape in:
- `scripts/check-controller-workflow-uniformity.sh` — a CI
regression test that grep-asserts every controller workflow has
the canonical pkg/** filter + auto-bump pipeline. Fails loudly
if any new controller workflow ships without the canonical shape,
or if an existing one regresses.
- `.github/workflows/check-controller-workflow-uniformity.yaml` —
push-on-touch + pull_request-on-touch event-driven wrapper that
runs the script. Mirrors the shape of check-vendor-coupling.yaml.
Verified locally:
- YAML syntax valid for all 7 controller workflows + the new check
workflow.
- Regression script passes on all 7 controller workflows.
- Simulated awk bumps against products/catalyst/chart/values.yaml
and products/continuum/chart/values.yaml — each script bumps
ONLY the targeted controller's tag, sibling tags untouched.
No chart bumps. No Go/chart changes. CI-workflow-only.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bp-self-sovereign-cutover orchestrator stuck at step 5/9 on t38
2026-05-19 when catalyst-api restarted mid-cutover. The in-process
runCutover goroutine died; the durable status ConfigMap captured the
in-flight state but NOTHING auto-fired the engine on the fresh Pod.
The chart's auto-trigger Helm Job only runs on post-install /
post-upgrade hooks; a catalyst-api Pod restart AFTER the chart is
already installed leaves the cutover stranded. Step 09 (gitea-token-mint)
was never created → PR #2008's provisioning init-container blocked
forever waiting for the cutover-step-09 token annotation → tenant
onboarding flow stuck (Pillar 1 + 4 + 5 blocked).
Root cause (cutover.go, lines 770-790): the engine reads `priorStatus`
on a fresh /start call and skips steps where result==success, but only
HandleCutoverStart / HandleCutoverInternalTrigger can trigger that
code path. No startup hook → no auto-resume. Additionally, in-flight
step rows whose result==running stay "running" forever in the durable
record.
Fix (single PR, no chart changes — purely catalyst-api Go code):
1. Handler.ResumeInterruptedCutover(ctx) — new exported method that
reads the cutover status ConfigMap, detects in-flight cutovers
(cutoverComplete=false AND cutoverStartedAt!=""), resets every
step row whose .result=="running" back to "" (so the engine
treats it as not-yet-attempted), and spawns runCutover with a
background context.
2. cmd/api/main.go — call h.ResumeInterruptedCutover(ctx) just before
ListenAndServe so a startup-resume race against a stale auto-
trigger Job retry is serialised through the in-process running
flag (tryStartRun).
3. createCutoverJob — Create-or-Get on AlreadyExists (concurrent
trigger fires from operator CTA + auto-trigger Job hitting
catalyst-api simultaneously is now benign).
Tests (cutover_test.go):
- TestResumeInterruptedCutover_ResumesAndCompletes — seeds 3-step
status with step-1 success, step-2 running, step-3 untouched.
Asserts after resume: step-1 NOT re-run, step-2 re-run, step-3
run, cutoverComplete=true.
- TestResumeInterruptedCutover_NoOpWhenComplete — already-done
status produces zero Job creates.
- TestResumeInterruptedCutover_NoOpWhenNeverStarted — empty
cutoverStartedAt MUST not pre-empt the chart's auto-trigger Job.
Chart bump: bp-catalyst-platform 1.4.219 → 1.4.220 + bootstrap-kit
pin in lockstep (clusters/_template/bootstrap-kit/
13-bp-catalyst-platform.yaml). No bp-self-sovereign-cutover chart
changes — every step PodSpec is already idempotent by design.
Refs #2016
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Root cause: CFKVClient.Renew compared the server-stamped ExpiresAt
against the client's wall-clock (time.Now()). The Cloudflare Worker
is the timestamping authority — ExpiresAt is in the Worker's clock
frame. Whenever the Worker's clock and the client's wall-clock
diverged (NTP skew, fake-clock tests, or simply the test fixture
clock pinned to 2026-05-09 while CI runs on a later date), the
client's check declared the lease expired and Renew returned
ErrLeaseLost — even though the Worker still considered the lease
healthy.
This caused the Build continuum-controller workflow to red on every
push since 2026-05-09 with:
--- FAIL: TestCFKV_ContractSuite/RenewExtendsTTLAndBumpsGeneration
contract.go:214: Renew: witness: lease lost
--- FAIL: TestCFKV_ContractSuite/GenerationMonotonicityAcrossOps
contract.go:298: Renew: witness: lease lost
Fix: remove the client-side wall-clock expiry check. Expiry is
enforced server-side — an expired renew returns 412, which write()
already maps to ErrLeaseHeldByAnother, which the Renew wrapper then
re-maps to ErrLeaseLost. This keeps a single source of truth for
"is the lease alive" (the Worker), avoiding the dual-clock
disagreement. The non-holder early return (cur.Holder != holder ->
ErrLeaseLost) is preserved because it never depended on time.
Validation:
- TestCFKV_ContractSuite/RenewExtendsTTLAndBumpsGeneration GREEN
- All 14 contract suite sub-tests GREEN
- ./continuum/internal/witness/cloudflarekv/... -count=10 GREEN
- All ./continuum/... packages GREEN
Refs #2012
Co-authored-by: Emrah Baysal <emrah.baysal@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bp-sandbox chart defaulted `env.newapiBaseURL` to
`http://newapi.newapi.svc.cluster.local:3000`. That assumes the bp-newapi
ClusterIP Service is named bare `newapi`. In practice the canonical
service name rendered by `helm template newapi platform/newapi/chart/
-s templates/service.yaml` is `newapi-bp-newapi`, because
`bp-newapi.fullname` in `platform/newapi/chart/templates/_helpers.tpl`
emits `{Release.Name}-{Chart.Name}` and `clusters/_template/bootstrap-kit/
80-newapi.yaml` sets `releaseName: newapi` against chart `bp-newapi`.
The bootstrap-kit overlay at
`clusters/_template/bootstrap-kit/19a-bp-sandbox.yaml` does NOT override
`env.newapiBaseURL`, so every Sovereign's sandbox-controller resolved a
DNS name no Service ever publishes:
POST /admin/tokens/sandbox → lookup newapi.newapi.svc.cluster.local
on 10.43.0.10:53: no such host
Walker on t38 (chart 1.4.216, substrate be4f78bc872e2c56, 2026-05-19)
caught the live regression. Every qwen-code Sandbox session failed at
TokenMint, blocking the canonical Pillar-4 customer journey
(console.<orgslug>.omani.homes → Sandbox → qwen-code provisions
additional app).
Fix scope:
- platform/sandbox/chart/values.yaml: default flipped to
`http://newapi-bp-newapi.newapi.svc.cluster.local:3000`.
- platform/sandbox/chart/templates/deployment.yaml: inline `default` in
the env block matched.
- platform/sandbox/chart/Chart.yaml: bp-sandbox 0.3.0 -> 0.3.1.
- clusters/_template/bootstrap-kit/19a-bp-sandbox.yaml: pin 0.3.0 ->
0.3.1 in lockstep (Inviolable Principle #14).
Verification:
- `helm template bp-sandbox platform/sandbox/chart/ -s
templates/deployment.yaml` with required values set renders the env
literal `value: "http://newapi-bp-newapi.newapi.svc.cluster.local:3000"`.
- `helm template newapi platform/newapi/chart/ -s templates/service.yaml`
renders `metadata.name: newapi-bp-newapi`.
DoD per anti-theater discipline (CLAUDE.md §0): issue stays open until a
fresh-prov Sandbox session successfully mints a NewAPI token and reaches
qwen-code. This PR ships the source-of-truth env-var fix only; it does
NOT defensively retry alternate names in the dial path.
Refs #2015
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Two tiers of placement modes coexist in the Blueprint corpus but only
one was registered in the validator + CRD enum, causing
TestValidate_ExistingBlueprintCorpus to fail on the 4 bp-*-vcluster
blueprints since 2026-05-09:
- Application-tier (marketplace 99%): single-region / active-active /
active-hotstandby
- Bootstrap-topology tier (docs/SOVEREIGN-MULTI-REGION-DOD.md A4):
primary-only / secondary-only / every-region
The 4 affected blueprints (bp-mgmt-vcluster / bp-dmz-vcluster /
bp-rtz-vcluster / bp-vcluster-helmrepo) correctly use the bootstrap-
topology tier — these are NOT operator-selectable; they document
which regions the bootstrap layer auto-installs the chart into.
Extends:
- validate.go canonicalPlacementModes with the three bootstrap modes
+ inline documentation of the two-tier taxonomy
- blueprint.yaml CRD enum (placementSchema.modes.items + .default)
kept in sync per the validator's "must mirror" comment
- 4 new unit-test cases for the bootstrap-topology modes
Result: TestValidate_ExistingBlueprintCorpus 71/71 GREEN
(previously 67/71, 4 FAIL).
Unblocks #2012 and every other PR touching blueprint-controller.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
t38 walk caught the canonical TBD-V9 bug: customer redeems voucher
WALK-T38-2138 on a 50 OMR order, voucher credit is only 10 OMR, Stripe
is unconfigured in the Sovereign, Checkout returns 503 "payment processor
not configured" — but promo_codes.times_redeemed had already advanced
0→1, promo_redemptions row was inserted, and a credit_ledger grant was
written. Voucher shows "Exhausted 1/1" with no order to show for it; the
customer's one-per-customer promo is silently burned.
Root cause: store.RedeemPromoCode runs its own transaction (necessary
for the FOR UPDATE concurrency cap) and commits the three side effects
up front. The rest of the Checkout pipeline (GetCreditBalance, GetSettings,
CreditOnlyCheckout, Stripe customer + session creation) can fail without
undoing the redemption.
Fix (saga / compensating action):
- store.RollbackPromoCodeRedemption(customerID, code) — single tx that
DELETEs promo_redemptions, decrements times_redeemed (GREATEST(..,0)
underflow guarded), and DELETEs the credit_ledger redeem grant (filter
reason='promo:<code>' AND order_id IS NULL so order spend ledger rows
are not touched). Idempotent: 0-row DELETE on promo_redemptions
short-circuits the rest, so re-running a failed checkout never
double-decrements.
- handlers.Checkout tracks voucherRedeemed and calls
RollbackPromoCodeRedemption on every downstream failure: settings load,
Stripe-unconfigured 503 (the t38 walk path), CreateOrder failure,
Stripe customer rejection, Stripe session rejection, plan-price
unresolvable.
- Voucher only stays committed once (a) CreditOnlyCheckout commits the
order+spend+sub transactionally and order.placed fires, or (b) the
Stripe Checkout Session URL is handed back to the customer (canonical
abandoned-cart: credit persists on ledger for the next attempt).
Tests:
- store_test.go: three new tests cover the rollback contract — happy
path (all three side effects undone in one tx), idempotent no-op
when no redemption row exists, empty-args no-op (no DB hit).
- checkout_test.go: TestCheckout_VoucherPartialCover_StripeUnconfigured_RollsBackRedemption
is the t38 regression — full sqlmock walk asserting the rollback tx
fires before the 503 response.
bp-catalyst-platform Chart.yaml + bootstrap-kit pin bumped 1.4.214 → 1.4.215.
Co-authored-by: Claude Code (hatiyildiz) <claude@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-V10 — t38 walk: after successful /redeem + /checkout the customer
was redirected to the operator console URL (`console.<sov-fqdn>`)
instead of the per-tenant console (`console.<slug>.<sov-fqdn>`).
Root cause: `core/marketplace/src/lib/config.ts::deriveConsoleURL`
mapped `marketplace.<sov-fqdn> → console.<sov-fqdn>`, never prepending
the tenant slug. PR #1993 (TBD-A67) restored the `console.` prefix in
the chart-side HTTPRoute (tenant-public-routes.yaml) AND the runtime
organization-controller's tenant_route.go (both emit
`console.<slug>.<parentDomain>` byte-identically), but the marketplace
JS that does the post-checkout redirect never picked up the slug-
prefixed shape.
Fix
---
- `src/lib/config.ts`: `deriveConsoleURL(slug?)` now splices the slug
as the left-most label when the marketplace host is
`marketplace.<sov-fqdn>`. Slug source: explicit arg → localStorage
(`sme-active-org-slug`) → fallback to slug-less operator host.
Exported pure helper `composeTenantConsoleURL(host, slug)` for
testability. Mothership (`marketplace.openova.io`) and partner
vanity hosts unchanged.
- `src/lib/api.ts`: new `setActiveOrgSlug()`. `logout()` clears both
`sme-active-org-slug` and `sme-checkout-tenant-slug`.
- `src/components/CheckoutStep.svelte`: persist `tenant.slug` to
`sme-checkout-tenant-slug` BEFORE the Stripe hop so the cross-
origin return can re-stamp it; call `setActiveOrgSlug(tenant.slug)`
on credit-covered path; pass the slug through `consoleHref(...,
{ slug })` for the redirect navigation.
- `src/layouts/Layout.astro`: inline returning-user redirect now
pulls the slug from the live-orgs response (preferring the org
matching `sme-active-org`) and stamps `sme-active-org-slug` before
redirecting to `console.<slug>.<sov-fqdn>`.
Validation
----------
- `playwright/customer-journey.spec.ts` step 16 extended with the
brief's exact assertion: `marketplace.omani.homes` + slug `demo`
→ `https://console.demo.omani.homes`. Plus regression guards for
multi-label sov-fqdn (`marketplace.t38.omani.works` + `acme` →
`console.acme.t38.omani.works`), mixed-case slug lowercasing, empty/
null slug falling back to operator host, and mothership ignoring
the slug.
- `git grep '\.openova\.io"' core/marketplace/src/` returns ZERO new
hits introduced by this PR (existing references are the tenant
table for `omantel.openova.io` and the canonical mothership host
guard — both intentional).
- `npm run build` clean on the affected files (Astro static export
including CheckoutStep.svelte rebuild).
Chart bump
----------
- products/catalyst/chart/Chart.yaml: 1.4.213 → 1.4.214
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pin:
1.4.213 → 1.4.214
Refs: PR #1993 (TBD-A67 console-prefix chart fix), #1949 (/redeem)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-V8: voucher email never delivered. On t38 canonical walk (agent
a550281a, 2026-05-19 21:37:33Z) operator issued voucher, row persisted,
HTTP 200 returned, but recipient IMAP stayed empty. catalyst-api logs
showed sme/notification returning 401 to the downstream dispatch.
Trace (end-to-end, per docs/INVIOLABLE-PRINCIPLES.md #1):
FE → catalyst-api → SME gateway → billing → notification
catalyst-api → gateway → billing wire is correct: catalyst-api mints an
HS256 bridge token from the operator's RS256 Keycloak session via
sharedauth.MintSMEAccessToken (signed with the reflector mirror of
sme-secrets/JWT_SECRET into catalyst-system), gateway and billing both
verify HS256 with the same bytes.
billing → notification wire was broken: billing's sendVoucherIssuedEmail
(core/services/billing/handlers/vouchers.go) POSTed with only
Content-Type — NO Authorization header. notification's HTTP surface is
gated by the shared HS256 JWTAuth middleware
(core/services/shared/middleware/jwt.go); a missing header returns 401
silently. The voucher upsert already persisted so the operator saw 200,
but no email ever landed.
TBD's hypothesis ("JWT signing-secret mismatch between catalyst-api and
sme/notification") was incorrect — both Pods already read from the SAME
sme-secrets/JWT_SECRET (chart templates/sme-services/billing.yaml line
67-71 and notification.yaml line 47-51, both pointing at the same
secretKeyRef). The real gap was that billing never USED those bytes to
mint an outbound service token.
Fix (Go-side only, no chart-template change):
1. Add JWTSecret []byte to billing's Handler struct
(core/services/billing/handlers/handlers.go).
2. Wire it in core/services/billing/main.go from the same JWT_SECRET
env the inbound JWTAuth middleware already consumes.
3. In sendVoucherIssuedEmail, mint a 5-minute HS256 service token
via sharedauth.MintSMEAccessToken (the SAME helper catalyst-api's
RS256→HS256 bridge uses, so the wire contract is symmetric) and
forward it as Authorization: Bearer <token>.
Claims: sub="sme-billing", role="superadmin", typ="session".
4. Empty JWTSecret falls back to the legacy no-header path so a stale
chart that doesn't wire JWT_SECRET into billing doesn't crash the
voucher upsert (mirrors optional:true on catalyst-api's
CATALYST_SME_JWT_SECRET secretKeyRef).
Tests:
- TestIssueVoucher_SendsAuthorizationHeader: exercises the full round-
trip. Billing mints with the test bytes; we re-parse the captured
token with the SAME bytes (the exact path notification's JWTAuth
middleware takes on receive) and assert claim shape — sub, role,
typ, exp. Pre-fix the captured request had no Authorization header
so this would have failed at the first check.
- TestIssueVoucher_NoAuthHeader_WhenJWTSecretUnset: back-compat guard
for the legacy no-secret path.
- All pre-existing TestIssueVoucher_* tests still pass.
Chart bumped 1.4.213 → 1.4.214 and bootstrap-kit pin in
clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml updated
to match.
Validation:
- go test ./core/services/billing/... → PASS (3 packages)
- helm template products/catalyst/chart --set
ingress.marketplace.enabled=true → both sme/billing and
sme/notification Deployments read JWT_SECRET from
secretKeyRef.name=sme-secrets, key=JWT_SECRET.
Refs #1842 (D28 voucher email arrival), #1829 (D29 customer journey).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-V11 / Issue #2002. On t38 fresh prov, sme/provisioning Pod logged
`HTTP 401 user does not exist [uid: 0, name: ""]` on the first tenant
Org CR creation. Root cause: provisioning Pod started with the chart's
first-install placeholder GITHUB_TOKEN (the Gitea admin password mirrored
verbatim by provisioning-github-token.yaml — enough to clear Container-
ConfigError but NOT a valid Gitea API token). Step 09 of bp-self-
sovereign-cutover later mints a real API token + patches the Secret
+ rollout-restarts the Pod, but the FIRST tenant journey always 401'd
because the Pod was already serving with the bad placeholder.
Approach (B): add an init container `wait-for-cutover-token` to the
SME provisioning Deployment that polls the Secret for the cutover
annotation `catalyst.openova.io/token-source: self-sovereign-cutover-
step-09` (stamped by Step 09 alongside the minted token bytes). The
Pod stays in Init:0/1 until Step 09 has actually completed, then the
main container starts with a guaranteed-valid token. Default poll
budget = 10s × 180 = 1800s (covers Hetzner cold-start ~18m + slack).
Why NOT HelmRelease.dependsOn:
- Per Principle #14, HR.dependsOn → Kustomization is silently ignored.
- bp-self-sovereign-cutover HR is dormant + disableWait:true: it goes
Ready=True at install BEFORE Step 09's Job actually runs. Adding it
to bp-catalyst-platform.dependsOn would buy nothing.
- Pod-level init gating waits on the actual condition (Secret
annotation set by Step 09), not on a proxy.
Why NOT change bp-self-sovereign-cutover trigger order:
- Step 09 must run AFTER bp-catalyst-platform creates the Secret
(otherwise the patch has no target). Reordering would break the
inverse dependency.
Why NOT a Job that bootstraps the user upfront:
- Step 09 already mints the token; we don't need a second bootstrap.
- The bug is timing, not absence of bootstrap.
Files changed:
- products/catalyst/chart/templates/sme-services/provisioning.yaml:
add initContainers block gated on
smeServices.provisioning.waitForCutoverToken.enabled (default true).
Re-uses existing `provisioning` SA (already has secrets get/list/watch
in `sme` ns via sme-provisioning ClusterRole — no new RBAC).
- products/catalyst/chart/values.yaml: add
smeServices.provisioning.waitForCutoverToken.{enabled,image,
intervalSeconds,timeoutSeconds} block.
- products/catalyst/chart/Chart.yaml: bump 1.4.213 → 1.4.214 with
full TBD-V11 changelog entry.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bump
HelmRelease pin 1.4.213 → 1.4.214 (chart bump only delivers the fix
when the pin moves — TBD-A68 / 1.4.213 precedent).
Validation:
- `helm template` Sovereign-mode render shows the init container in
the provisioning Deployment with kubectl-poll loop.
- Default-values smoke render unaffected (gate is
ingress.marketplace.enabled=true; smoke uses defaults where false).
- `helm lint products/catalyst/chart/` passes.
- Contabo-Zero render path safe by construction (chart only renders
the Deployment when ingress.marketplace.enabled=true; contabo
doesn't enable marketplace via this chart).
Closes#2002. Refs #1829 (D29 tenant materialisation gate).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Followup hardening for #1997 (PR #2004 catch-up bumped the
organization-controller chart pin to c9b58ea). PR #2004 unblocks t38
right now, but the underlying cause — `build-organization-controller.yaml`
has no auto-bump step and its path filter misses `core/controllers/pkg/**`
— is still live and will re-strand the next gitea-client fix the
moment it lands. This PR closes both gaps so the bug cannot recur.
Two surgical additions:
1. `.github/workflows/build-organization-controller.yaml`
a. Promote `permissions.contents: read` → `write` (+ `actions:
write`), mirroring `build-application-controller.yaml`.
b. Add `Bump controllers.organization.image.tag in values.yaml`
step (awk-scoped to the `organization:` block only — cannot
accidentally bump a sibling controller's tag).
c. Add `Commit and push values.yaml bump` step (rebase-safe,
skip-if-no-change).
d. Add `Dispatch blueprint-release for chart re-publish` step
— anti-recursion bypass for the GH-Actions rule that bot
pushes don't fire downstream workflows. Without this the
rebuilt image is NEVER baked into a new chart version.
e. Add `core/controllers/pkg/**` to push + pull_request path
filters. The shared HTTP-client tree (gitea, keycloak,
kc-mappers, …) is COPYed into every Group C controller's
image via the Containerfile, so a change to it MUST rebuild.
PR #1910 only triggered a rebuild because it happened to
also touch `organization_controller_test.go`; a pure pkg/
fix would silently skip the workflow.
2. `core/controllers/pkg/gitea/client_test.go`
New `TestCreateOrg_HitsOrgsEndpointWithAuth` — wire-level
regression guard that:
- Fails hard if the client EVER hits `/api/v1/admin/orgs` (would
catch a refactor accident that re-introduces the Gitea 1.22+
405 bug regardless of which chart pin is deployed).
- Asserts the request is `POST /api/v1/orgs` exactly once.
- Asserts the request carries `Authorization: token <hex>` with
the exact expected value (defense-in-depth: even if the URL
is right, Gitea 1.22+ still returns 405 without the token).
Sibling controllers (environment, blueprint, useraccess, …) likely
have the same missing-auto-bump + missing-pkg/** path filter. NOT
fixing them in this PR — blast-radius discipline. Follow-up
recommended: audit every `build-*-controller.yaml` for both gaps.
Validation:
• go vet ./pkg/gitea/... — clean
• go test -race ./pkg/gitea/... — ok, all pre-existing + new tests pass
• go test -run TestCreateOrg_HitsOrgsEndpointWithAuth -v — PASS
Refs #1997 (PR #2004 closed the immediate symptom; this PR closes
the deploy gap so #1997 cannot recur)
Refs #1910 (the original /admin/orgs → /orgs code fix)
Refs #1829 (D29 customer journey hardening)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-1.0.2 bp-valkey shipped `valkey.auth.enabled: true` (bitnami default)
while bp-newapi's REDIS_CONN_STRING default was the passwordless URL
`redis://valkey-primary.valkey.svc.cluster.local:6379`. On every
freshly-franchised Sovereign the newapi Pod CrashLoopBackOff'd 45x on
the Redis ping probe with `NOAUTH Authentication required` — caught
on t38 sandbox walk 2026-05-20. This is the Pillar-4 verifier-killing
bug for the Sandbox + qwen-code + MCP end-user DoD (#1986).
Approach A (simpler, this PR): flip bp-valkey's default to
`auth.enabled: false` so the upstream bitnami chart exports
`ALLOW_EMPTY_PASSWORD=yes` to the Valkey container. Verified via
`helm template` — the render now contains:
- name: ALLOW_EMPTY_PASSWORD
value: "yes"
Other in-cluster consumers tolerate the change:
- products/catalyst sme-services (auth.yaml + gateway.yaml) read
VALKEY_PASSWORD via `secretKeyRef ... optional: true` and fall
back to the no-auth connect path in
core/services/shared/db/valkey.go when the value is empty.
- products/catalyst projector wraps the password Secret mount in
`{{- with .Values.services.projector.valkey.passwordSecret }}`
so an absent Secret simply skips the password env var.
Approach B (deferred): make bp-newapi mirror the bp-valkey
auto-generated password Secret into the newapi namespace and template
it into REDIS_CONN_STRING. Larger scope, tracked under #2003 follow-up.
Changes:
- platform/valkey/chart/values.yaml — auth.enabled: true → false
- platform/valkey/chart/Chart.yaml — version 1.0.1 → 1.0.2
- platform/valkey/blueprint.yaml — spec.version + configSchema default
- clusters/_template/bootstrap-kit/17-valkey.yaml — chart pin 1.0.1 → 1.0.2
Verified:
- `helm dependency build` succeeds (bitnami/valkey 5.5.1 unchanged)
- `helm template` renders `ALLOW_EMPTY_PASSWORD=yes` on the Pod
- tests/observability-toggle.sh — all 4 cases PASS
Closes#2003
Refs #1986
Co-authored-by: hatiyildiz <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-A68: t38 walkthrough on 2026-05-19 21:41Z (chart 1.4.211) put two
tenant Organization CRs (walkdemo38, walk-t38-2138) into
Ready=False/GiteaOrgFailed with `POST .../api/v1/admin/orgs HTTP 405`.
Investigation showed the code fix already landed on main as PR #1910
(merged 2026-05-19 03:59Z, commit f442c28): `gitea.EnsureOrg` now hits
`POST /api/v1/orgs` (the user-token endpoint) instead of the admin-only
`/api/v1/admin/orgs` that returns 405 to the in-cluster service-account
token. The build-organization-controller workflow successfully produced
fresh images at f442c28 and then again at c9b58ea (most recent main-
HEAD push touching the controller, 2026-05-19 20:58Z).
The bug on t38 was deployment-time: the chart's image pin at
products/catalyst/chart/values.yaml:369 still pointed at `72e3f08`
from 2026-05-10 across three subsequent chart bumps (1.4.210 / 1.4.211
/ 1.4.212). The CI auto-bump-images job covers SME images only, not
controller images, so this class of stale pin slips through. Filing
TBD-A69 separately to close that CI gap.
Files (pure deployment-pin update, no code change):
- products/catalyst/chart/values.yaml:369
tag: "72e3f08" -> tag: "c9b58ea"
- products/catalyst/chart/Chart.yaml
version + appVersion 1.4.212 -> 1.4.213, changelog entry added.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
version: 1.4.212 -> 1.4.213, changelog entry added.
Validation:
- `helm template products/catalyst/chart | grep organization-controller`
-> `image: "ghcr.io/openova-io/openova/organization-controller:c9b58ea"`
- `grep -c "72e3f08" <helm template output>` -> 0
- GHCR manifest probe for c9b58ea returns HTTP 200 with
application/vnd.docker.distribution.manifest.v2+json (image exists
and is pullable by the in-cluster ghcr-pull secret).
Post-deploy expectation:
- organization-controller Pod rolls to c9b58ea on `helm upgrade`.
- Controller logs flip from `POST /api/v1/admin/orgs HTTP 405` (every 30s)
to `POST /api/v1/orgs 201` on the existing stuck Organization CRs.
- walkdemo38 + walk-t38-2138 auto-recover to Ready=True without operator
intervention (gitea EnsureOrg is idempotent; the reconcile loop will
re-fire and succeed).
- Unblocks D29 tenant-org provisioning chain (Keycloak group +
vCluster + tenant URL HTTPRoute + WordPress install all gate on the
Organization CR being Ready).
Closes#1997
Refs #1829 (D29 tenant onboarding), #1842, #1945, #1910 (the upstream
code fix this chart bump finally ships).
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-flux-stuck-hr-recovery): grant helmreleases/status patch RBAC + log stderr (Closes#1995)
Agent ae9d7638 verifying PR #1991 on t38 (2026-05-19 21:18Z) found
the bp-flux-stuck-hr-recovery CronJob correctly detected bp-alloy in
`Ready=Unknown for 427s, history[0].status=deployed` state, entered
the TBD-A66 branch B, and attempted the patch — but the in-Pod
`kubectl patch hr --subresource=status` silently failed because its
stderr was swallowed by `2>&1` into the same /dev/null pipe as
stdout. A manual identical patch from bastion succeeded immediately,
so RBAC was not the blocker.
Investigation: the 1.2.3 ClusterRole already grants `helmreleases`
+ `helmreleases/status` patch+update verbs (it was added in PR #1991
to enable the new branch in the first place). The actual root cause
of the silent failure was diagnostic-blind: the script could not
distinguish a successful patch from a failing one, so the
human-readable `RECOVER ... — patching` log line emitted in both
cases.
Fix (1.2.4):
- Capture `kubectl patch --subresource=status` stderr to a tempfile
under /tmp (the writable emptyDir mount) so multi-line apiserver
errors survive intact.
- Emit three structured `[A66]` log lines that operators / agents
can grep:
detection: `[A66] HR <ns>/<name> Ready=Unknown for <age>s,
history[0]=deployed → attempting patch`
success: `[A66] HR <ns>/<name> patched to Ready=True`
failure: `[A66] HR <ns>/<name> patch FAILED: <stderr>`
- Same treatment for the annotation-rollback path so a stuck
idempotency annotation can also be diagnosed.
- Add Case 8 to leader-election-and-recovery.sh asserting:
* detection / success / failure log lines render in the script
* the `>/dev/null 2>&1` pattern is no longer on the critical
`kubectl patch --subresource=status` line
* stderr is captured via `mktemp /tmp/a66-patch-err.XXXXXX`
Chart 1.2.3 -> 1.2.4; bootstrap-kit pin 03-flux.yaml bumped in
lockstep (bootstrap-kit pin-sync check passes for bp-flux).
Refs #1989 (TBD-A66). Closes#1995.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-flux): bump blueprint.yaml spec.version 1.2.3 → 1.2.4 in lockstep with Chart.yaml
manifest-validation's TestBootstrapKit_BlueprintCardsHaveRequiredFields + TestBootstrapKit_BlueprintVersionLockstepSweep require blueprint.yaml spec.version to track Chart.yaml version exactly (TBD-A20 / #1856). Forgotten in the previous commit.
Refs #1995.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five surgical fixes for TBD-A68 (#1994) — every tenant-facing URL the
catalyst-api / SPA / chart could emit now follows the Sovereign FQDN
the deployment is bound to, instead of hardcoding the mothership host.
1. products/catalyst/bootstrap/api/internal/handler/auth.go
PIN email plaintext + HTML bodies now read SOVEREIGN_FQDN env via a
new pinEmailLoginURL() helper. Chroot mode (SOVEREIGN_FQDN set)
emits `https://console.<fqdn>/login`; mothership mode keeps the
historical `https://console.openova.io/sovereign/login`. The HTML
visible-link text is also derived from the resolved host.
2. core/console/src/lib/config.ts
MARKETPLACE_URL / CHECKOUT_URL / MARKETPLACE_HOME_URL now lazy-
resolve via resolveMarketplaceOrigin() — Astro public env
`PUBLIC_MARKETPLACE_ORIGIN` first, runtime `window.location.host`
second (strip `console.<slug>?` + prepend `marketplace.`), legacy
`https://marketplace.openova.io` fallback for SSR snapshots.
3. products/catalyst/chart/templates/sme-services/configmap.yaml
CORS_ORIGIN_PUBLIC / CORS_ORIGIN_ADMIN / CORS_ORIGIN_GATEWAY /
PUBLIC_BASE_URL / PUBLIC_API_BASE_URL / CNAME_TARGET /
CHECKOUT_SUCCESS_URL / CHECKOUT_CANCEL_URL now templated against
`marketplace.<global.sovereignFQDN>` + sibling platform zone.
Catalyst-Zero render (no sovereignFQDN, no host override) keeps
the legacy `sme.openova.io` byte-identical so contabo's existing
CORS / public URLs don't drift.
4. products/catalyst/chart/templates/sme-services/notification.yaml
Notification Deployment's CORS_ORIGIN env now sources from the
shared `sme-services-config.CORS_ORIGIN_PUBLIC` key instead of
hardcoding `https://sme.openova.io`. Per-Sovereign FQDN
substitution flows through automatically.
5. Regression test:
TestPinEmail_SovereignFQDNRoutesLoginURL in auth_pin_test.go covers
both modes (chroot routes to sovereign console; mothership keeps
openova.io target) and asserts the HTML body never routes tenant
traffic through openova.io when SOVEREIGN_FQDN is set.
Validation:
- `helm template products/catalyst/chart --set global.sovereignFQDN=t38.omani.works`
renders ZERO openova.io strings in CORS / PUBLIC_BASE_URL / CHECKOUT
keys. Catalyst-Zero render preserves the legacy sme.openova.io paths.
- `go test ./internal/handler/` passes 101.4s (full suite + new
TestPinEmail regression test).
Chart bump: bp-catalyst-platform 1.4.211 -> 1.4.212 + bootstrap-kit
pin in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml.
Closes#1994
Co-authored-by: hatiyildiz <claude@anthropic.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-A67: three surgical fixes for the tenant org URL drift between
the founder's spec (`console.<slug>.<parent>` per CLAUDE.md §0) and
the runtime emit. Pre-fix the controller emitted `<slug>.<parent>`
while the chart-side overlay AND sme_tenant_gitops.go:536 emitted
`console.<slug>.<parent>`; tenant onboarding emails on every non-
openova.io Sovereign leaked the platform marketing host into the
WorkspaceURL.
Files (three production paths + symmetric tests):
- core/controllers/organization/internal/controller/tenant_route.go:113
-> emits `console.<subdomain>.<parentDomain>` so the runtime
reconciler and the chart-side overlay produce byte-identical
HTTPRoute shapes.
- products/catalyst/chart/templates/sme-services/tenant-public-routes.yaml:82
-> chart-side analogue mirrors the new console-prefixed shape.
- core/services/notification/handlers/enrich.go
-> WorkspaceURL now `https://console.<sub>.<parentZone>` where
parentZone comes from a new TENANT_PARENT_DOMAIN env (same name
the provisioning service uses for Handler.TenantParentDomain).
Empty parent zone yields empty URL — NEVER falls back to
`.openova.io`, restoring compliance with the "never touch
openova.io" rule on per-Sovereign deployments.
Tests:
- new enrich_test.go: 5 truth-table cases on the pure workspaceURL
helper + 2 end-to-end Lookup cases. Hard regression guard that
the rendered URL contains neither a missing `console.` prefix nor
a leaked `openova.io` substring.
- organization_controller_test.go: TenantPublic_RendersHTTPRoute
assertion bumped from `acme.omani.homes` to `console.acme.omani.homes`
+ HasPrefix("console.") regression guard.
Chart bump: bp-catalyst-platform 1.4.209 -> 1.4.210; bootstrap-kit
pin in clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
follows.
Refs #1990 TBD-A67.
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-flux-stuck-hr-recovery): detect+correct deployed-but-unknown-Ready HRs (Refs #1989)
t37 canonical walk on nbg1-2 / hel1-1 secondary CPs surfaced a second
stuck-HR failure mode: helm-controller completes the install — the HR's
own `.status.history[0].status` flips to "deployed" — but apiserver
flap on the slow secondary CP loses the write that flips
`.status.conditions[type=Ready]` from Unknown to True. The existing
suspend-toggle recovery (issue #925) does NOT fix this because helm-
controller's "release in storage" short-circuit returns yes on every
subsequent reconcile, so it never re-evaluates Ready.
This PR extends the stuckHelmReleaseRecovery CronJob with a second
detection branch:
for hr where
.status.conditions[type=Ready].status == "Unknown"
AND age(Unknown) > stuckThreshold (default 5m)
AND .status.history[0].status == "deployed"
AND metadata.annotations["stuck-hr-recovery.openova.io/auto-corrected-at"] == ""
→ kubectl annotate hr stuck-hr-recovery.openova.io/auto-corrected-at=<RFC3339>
→ kubectl patch hr --subresource=status --type=merge
status.conditions=[{type:Ready, status:True,
reason:ReconciliationSucceeded,
message:"auto-corrected from deployed-but-
unknown-Ready by stuck-hr-recovery
(TBD-A66)",
lastTransitionTime:<RFC3339>}]
Safety / idempotency:
- Annotation acts as both audit trail AND idempotency guard. Re-runs
on an already-corrected HR skip immediately.
- If the status patch fails, the annotation is rolled back so the
next CronJob run re-attempts.
- Guardrail unchanged: >10 acted-on HRs in a single run → exit 1 +
operator alert.
- The 10-HR guardrail spans BOTH branches combined.
RBAC additions:
- helmreleases/status with verbs [patch, update] — status subresource
is a separate RBAC target in Kubernetes. Without this rule
`kubectl patch --subresource=status` returns 403.
Validation:
- tests/leader-election-and-recovery.sh: 6 → 7 cases (existing 6
issue #925 cases still PASS; new Case 7 covers TBD-A66 — script
contains history[0].status check, status-subresource patch verb,
audit annotation key, helmreleases/status ClusterRole verb, and
operator-greppable "auto-corrected from deployed-but-unknown-Ready"
audit string).
- Mock JSONPath replay against 4 synthetic HRs: branch B routes
deployed-but-unknown to status patch, branch A still handles
pending-install via the secret check, idempotency annotation
correctly skips re-run, healthy Ready=True HR is no-op.
Chart bump:
- platform/flux/chart/Chart.yaml: 1.2.2 → 1.2.3
- clusters/_template/bootstrap-kit/03-flux.yaml: bp-flux HR pin
1.2.2 → 1.2.3 (the existing pin for omantel/otech live clusters
sits at 1.1.3 — unchanged, those clusters are pre-#925 baseline).
Closure note:
- Refs #1989 (not Closes — closure happens when the t37 canonical
walk reaches handover successfully on a fresh prov).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(bp-flux): bump blueprint.yaml spec.version 1.2.2 → 1.2.3 (lockstep with Chart.yaml)
Companion to TBD-A66 / #1989 bump. CI gate
`TestBootstrapKit_BlueprintVersionLockstepSweep` (TBD-A20, #1856)
asserts blueprint.yaml spec.version == chart/Chart.yaml version per
platform/*. Missed this in the parent commit because the older bp-flux
bumps (1.2.1 → 1.2.2 etc.) did not require this companion bump back
when the lockstep gate didn't exist.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: claude-bot <claude-bot@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the agent-slug -> binary mapping inside pty-server, closing the
B3 wiring hole identified in TBD-P4 #1986.
Design source: /tmp/p4-b3-design-spec.md (agent abfeafd7, 2026-05-19).
Files touched:
- products/sandbox/pty-server/internal/agentcatalog/agentcatalog.go (NEW)
Hardcoded 7-row table: 6 real-agent slugs in lock-step with the
FE / catalyst-api / chart-CRD enum, plus sovereign-shell as a
rescue row that's always present (black-screen prevention).
Lookup / AllSlugs / Resolve API + optional JSON override at
/etc/openova/sandbox-agents.json (path overridable via
OPENOVA_SANDBOX_AGENTS_PATH).
- products/sandbox/pty-server/internal/agentcatalog/agentcatalog_test.go (NEW)
7 unit tests: known slugs / unknown slug / override file /
override-supersedes-builtin / argv shape / env-merge precedence /
AllSlugs sorted+exhaustive + upstream-catalogue drift guard.
- products/sandbox/pty-server/internal/agentcatalog/export_test_helpers.go (NEW)
ResetCache helper for sibling-package tests.
- products/sandbox/pty-server/internal/server/routes.go
createRequest gains Agent + ExtraArgs + EnvMap fields. Exactly
one of {agent, command} required; unknown slug -> 400 with the
canonical list (NOT bash fallback); RequiredEnv presence check
surfaces missing wiring at create time. New lazySpawn helper
wires WS /sessions/{id}/attach to either ?agent= query or
SANDBOX_DEFAULT_AGENT env so the FE stays zero-touch when the
controller renders that env from spec.agentCatalogue[0].
- products/sandbox/pty-server/internal/server/routes_test.go
9 HTTP-level tests covering happy path / unknown slug 400 / both
set / neither set / missing required env / backward-compat
command path + 4 lazy-spawn scenarios (env-set, query-overrides,
neither -> 404, unknown slug surfaces invalid-agent).
- products/sandbox/pty-server/internal/session/manager.go
+CreateWithID for the lazy-spawn path, where the session id is
the Sandbox CRD name (carried in the WS URL) rather than a
pty-server-minted hex string.
Design notes preserved:
- No new MCP env-injection code. The controller already renders
every relevant env var (NEWAPI_URL, OPENAI_*, LLM_GATEWAY_*,
ANTHROPIC_*, SANDBOX_*) on the pty-server StatefulSet at
gitops/manifests.go:321-359; session.New passes os.Environ()
through to exec.Cmd.Env at session.go:89.
- No chart bump. SANDBOX_DEFAULT_AGENT is consumed only if
rendered; pty-server falls back to the historic 404 behaviour
when the env is empty (forward-compat with current chart).
- B3-followup (SANDBOX_* rename on the pty-server StatefulSet to
match #1987's MCP Deployment) is deferred per the design spec.
Refs #1986
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pty-server image used `distroless/static-debian12:nonroot` which
shipped only the Go pty-server binary. `exec.Command("qwen-code")`
returned ENOENT — Pillar 4 of the end-user DoD (customer picks
`qwen-code` in Sandbox → agent launches with MCP) could not work on
any prov regardless of the controller/MCP wiring.
Swap the final stage to `node:22-bookworm-slim` and install the four
publicly fetchable agents architecture.md §1+§7 promises:
qwen-code npm @qwen-code/qwen-code (Node)
claude-code npm @anthropic-ai/claude-code (Node)
opencode npm opencode-ai (Node)
aider pip aider-chat (Python venv)
Symlink the slug form (`qwen-code`, `claude-code`) over the short
binary names the npm packages expose (`qwen`, `claude`) so the
existing `exec.Command(<slug>)` shape lights up without waiting on B3
(the slug→binary registry).
`cursor-agent` is intentionally not bundled — Cursor's product shape
is a cloud-hosted IDE companion, not a self-hosted CLI; the
analogous bring-your-own bridge for hosted vendors lives in
`claude-code-byos.md`.
Non-root posture preserved (runs as `node` uid 1000). `tini` added
for clean PID-1 signal propagation on session DELETE. Image grows
~580 MiB (distroless 14 MiB → ~600 MiB) — worth it: the four agents
are the Sandbox surface, and Pillar 4 cannot be GREEN without them
on PATH.
Chart bump: bp-sandbox 0.2.0 → 0.3.0 in both `platform/sandbox/chart/
Chart.yaml` and `clusters/_template/bootstrap-kit/19a-bp-sandbox.yaml`
so the next bootstrap-kit reconcile picks up the runtime image bump
the build-sandbox-pty-server workflow will commit on push.
Refs #1986 (TBD-P4 umbrella — B2 newapi default, B3 slug registry,
B4 MCP env-var drift remain).
Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TBD-P4 B4 — env-var name drift between the sandbox-controller and the
MCP plugin silently degraded every MCP tool family to "not configured"
at runtime. The controller emitted bare `ORG_ID` and `SOVEREIGN_FQDN`
on every rendered MCP Deployment while the MCP binary
(products/sandbox/mcp-server/internal/tools/env.go) reads the
namespaced canonical `SANDBOX_ORG_ID` / `SANDBOX_SOVEREIGN_FQDN`. Per
agent a99ea3aa's investigation, six additional env-var families the
MCP requires were never wired at all.
Surgical alignment across renderer + chart + controller wiring:
1. core/controllers/sandbox/internal/gitops/manifests.go — MCP
Deployment template renamed the bare names AND grew env entries
for the canonical set the MCP plugin reads:
Rename (MCP Deployment only; pty-server StatefulSet keeps the bare
names since they are inherited into the user's agent shell — that
is a distinct contract):
ORG_ID -> SANDBOX_ORG_ID (tool family: all)
SOVEREIGN_FQDN -> SANDBOX_SOVEREIGN_FQDN (tool family: all)
Added (the MCP plugin was reading them; controller wasn't emitting):
SANDBOX_ID -> identifies the Sandbox CR
SANDBOX_NAMESPACE -> rendered ns sandbox-<owner-uid>
SANDBOX_TENANT_ID -> scopes marketplace/byod handler
SANDBOX_GITEA_BASE_URL -> sandbox.deploy / gitea tool family
SANDBOX_GITEA_TOKEN (secret) -> ditto, via secretKeyRef optional
SANDBOX_DOMAIN_API_URL -> marketplace tool family
SANDBOX_MARKETPLACE_API_URL -> marketplace tool family
SANDBOX_STORAGE_S3_ENDPOINT -> sandbox.storage tool family
SANDBOX_STORAGE_S3_REGION -> ditto
SANDBOX_STORAGE_S3_USE_TLS -> ditto
SANDBOX_STORAGE_S3_ACCESS_KEY -> ditto, via secretKeyRef optional
SANDBOX_STORAGE_S3_SECRET_KEY -> ditto, via secretKeyRef optional
KEYCLOAK_ADMIN_URL -> sandbox.auth tool family
KEYCLOAK_PARENT_REALM -> ditto
KEYCLOAK_ADMIN_TOKEN (secret) -> ditto, via secretKeyRef optional
2. platform/sandbox/chart — bp-sandbox HR surfaces the new wiring as
chart-level values (mcp.giteaBaseURL, mcp.domainAPIURL,
mcp.storage.*, mcp.keycloak.*) defaulting to the in-cluster Service
DNS of a stock Sovereign install. Per-Sovereign overlays may
override any value. Secrets are NEVER written from this chart —
name+key references only with `optional: true` so a fresh-prov
Sovereign with a credential source in flight does NOT crash the
per-Sandbox MCP Pod; the affected tool family surfaces a clean
"not configured" error at call time (matches the MCP plugin's
existing per-tool guard pattern).
3. Chart.yaml + bootstrap-kit pin (19a-bp-sandbox.yaml) bumped to
0.2.0 so the per-Sovereign overlay picks up the new env surface
on the next reconcile.
4. sandbox_controller_test.go — extended deployment-mcp.yaml assertion
block to assert the canonical SANDBOX_* env-var set + value
plumbing AND added a negative assertion that the bare `ORG_ID` /
`SOVEREIGN_FQDN` names MUST NOT appear on the MCP Deployment
(they remain on the pty-server StatefulSet, distinct contract).
Regression test against future re-introduction of the drift.
Validation:
- go test ./sandbox/... — all green (controller / gitops / idlescaler
/ newapi / sandboxapi).
- helm template platform/sandbox/chart --set enabled=true ... — clean
render, 16 SANDBOX_MCP_* env vars emitted on the controller
Deployment.
Hard rules honoured:
- READ-ONLY against existing cluster (no kubectl writes).
- No Secret writes — name+key references only, all `optional: true`.
- emrah.baysal mailbox + Stalwart admin untouched.
- Principle #12 fresh clone validation.
Refs #1986
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #1979 (TBD-A50 layer 3, merged 18:00Z 2026-05-19) added the
idempotent ExternalIP reconciler as inline runcmd heredocs and bumped
the rendered cloud-init guardrail from 30720 to 31744. The ~3 KiB of
inline bash + systemd unit heredocs overshot the new headroom: t36
fresh-prov tofu plan FAILED with rendered control-plane cloud-init
at ~32498 B vs the 31744 B guardrail (754 B over). Issue #1981.
This PR repackages PR #1979 using the PR #1978 pattern that fixed the
analogous #1977 / TBD-A52 incident:
- Adds an `l3` subcommand to /usr/local/bin/openova-externalip-bootstrap.sh
(the same write_files script that hosts `l1` + `l2`). Same reconciler
logic — read /etc/openova/cp-public-ipv4, compare to Node ExternalIP,
restart k3s on mismatch, log to /var/log/openova-externalip.log.
- Adds two new write_files entries for the systemd .service + .timer
unit files (replaces the 3× cat-heredoc runcmd block).
- The runcmd L3 step collapses from 77 lines of inline heredocs to
a single token: `systemctl daemon-reload && systemctl enable --now
openova-extip-reconcile.timer`.
- Bumps the CP cloud-init guardrail from 31744 to 32256 (Hetzner hard
cap 32768 minus 512 B safety buffer), applied to both primary +
secondary CP preconditions in main.tf. The +512 B headroom buys
room for the next legitimate addition without re-tripping the gate.
## Behavior
Behavior identical to PR #1979 — same reconciler script, same exit
codes (0=ok, 2=no-file, 3=apiserver-unreachable, 4=unrecovered), same
systemd .service `SuccessExitStatus=0 2 3 4`, same .timer `OnBootSec=2min
/ OnUnitActiveSec=5min`. Diagnostic strings trimmed (~150 B saved) but
key tokens preserved (`OK`, `MISMATCH`, `RECOVERED`, `FATAL nofile`,
`FATAL apiserver`, `FATAL unrec`, `#1941` reference).
## Validation (Principle #15)
- `tofu validate infra/hetzner/` → Success
- Templatefile() measurement harness (`/tmp/measure-cloudinit/`,
same fixture PR #1978 used):
- pre-fix rendered: 31865 B (over fixture 30720 by 1145 B)
- post-fix rendered: 31130 B (under new 32256 guardrail with
1126 B headroom)
- savings: ~735 B vs PR #1979 baseline
- Production headroom (after +633 B fixture↔prod variance offset):
estimated 31763 B in prod, 493 B headroom under new 32256 guardrail.
- `shellcheck` on rendered bootstrap script: clean (only one pre-
existing SC2034 for loop counter `i`, present before this PR).
- Mock test 3-case battery (matching/missing-file/mismatch-recovers):
rc=0/2/0 with expected log tokens.
## Hard rules
- `Closes #1981` because acceptance is code-level (size proof + tofu
validate). The functional Refs #1941 closure still depends on fresh-
prov walk demonstrating timer fires + log accumulates.
- READ-ONLY on cluster. No Secrets touched. No emrah.baysal email
/ Stalwart admin API touched.
Refs #1941, #1979, #1978, #1977, #1958, #966.
Co-authored-by: hatiyildiz <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three surgical fixes for the 11 cosmetic-guard regressions caught on
CI run 26112245005 (issue #1976 / TBD-A64). 8 of 11 deferred — see
TBD-A65..A71 for the architectural follow-up tickets.
1. wizard/steps/logoTone.ts:126
`alloy` tile background `#FFFFFF` → `#FD6F00` (canonical Grafana
Alloy swirl colour per grafana.com/oss/alloy hero). The vendored
Badge already paints a white glyph; on a white tile the mark was
invisible. Cosmetic-guards `logo tiles use canonical brand surface`
test now matches LOGO_SURFACE_CANON[alloy] = '#FD6F00'.
2. wizard/steps/stepComponentsCopy.ts:33-34 + StepComponents.tsx:920-941
Retired the legacy "Choose Your Stack" / "Always Included" labels
(renamed to "Components" / "Foundation") and dropped `role="tablist"`
+ `role="tab"` on the section toggle. Matches the canonical SME
marketplace single-grid pattern in
core/marketplace/src/components/AppsStep.svelte. The
`tab === 'choose' | 'always'` state machine stays — only the
operator-visible strings + ARIA semantics changed.
`stepDescription` rephrased to drop both legacy phrases.
StepComponents.test.tsx updated for the new labels + `aria-pressed`.
3. sovereign/AppDetail.tsx:806-859
`data-testid="sov-app-tab-${id}"` alias exposed on every TabButton
via an absolutely-positioned aria-hidden span overlay (a single DOM
node can't carry two `data-testid` values, the primary
`app-tab-${id}` stays on the <button> for back-compat with the
AppDetail.test.tsx matrix). Unblocks the 22+ existing
`sov-app-tab-*` Playwright selectors in application-pages-t-o-p,
continuum-dr-section, compliance-dashboards, and rbac-membership
that have been broken since the rename.
Chart bump: bp-catalyst-platform 1.4.208 → 1.4.209.
Bootstrap-kit pin: 13-bp-catalyst-platform.yaml 1.4.208 → 1.4.209.
Refs #1976 TBD-A64.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>