Commit Graph

264 Commits

Author SHA1 Message Date
e3mrah
2ff50f0591
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on
fresh Sovereign):

#952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls
PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar}
anonymously and gets 403 Forbidden. Fix:

- Templatize spec.imagePullSecrets on Deployment + channel-seed Job.
- Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`.
- Add `newapi` to flux-system/ghcr-pull's reflector
  reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl
  so bp-reflector mirrors the source Secret into the namespace
  automatically on every fresh Sovereign.
- Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay.

#953 — services-build.yaml's image-rewrite loop only matched the
hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8
sme-services templates use `image: "{{ ... }}/services-<svc>:{{
.Values.images.smeTag }}"`. Each services-build run bumped only
auth.yaml while reporting "update sme service images to ${SHA}",
leaving the live Pod on stale bytes (PR #951's #941 fix never reached
services-catalog despite the merge + chart bump chain). Fix:

- After the hardcoded loop, also bump `images.smeTag` in
  products/catalyst/chart/values.yaml with a strict regex match
  (`^  smeTag: "<sha>"$`); refuse to auto-bump if the line shape
  changes (defends against silent drift if a contributor renames the
  field).
- Mirror the change into the retry-path `rewrite()` function so a
  reset-to-origin/main retry does not recreate the original bug.

Tests:

- platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases
  asserting the Deployment and channel-seed Job carry the default
  ghcr-pull reference, that an empty override suppresses the block,
  and that custom secret names propagate (Inviolable Principle #4).
- tests/integration/services-build-rewrite.sh — 3 cases reproducing
  the workflow's rewrite logic on a sandboxed copy of the live
  chart, asserting both auth.yaml's hardcoded line AND values.yaml's
  smeTag get bumped, that helm-render of the catalyst chart with
  the bumped values produces all 8 SME-service Deployments at the
  new SHA, and that an idempotent re-bump to a second SHA also lands
  cleanly.

Refs: #952 #953 (umbrella #915 — alice signup gate 5).

Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:47:37 +04:00
e3mrah
689276889c
fix(bp-catalyst-platform+bp-newapi): unblock alice signup gates 2-6 on Sovereigns (#915) (#951)
Six coupled chart + orchestrator fixes that unblock alice marketplace
signup → tenant ready → SaaS integrations → LLM → ledger on a freshly
franchised Sovereign. C5-final got Gate 1 GREEN on otech113 (2026-05-05)
but every downstream gate failed because the SME bundle hardcoded
contabo-only assumptions.

Bumps:
  - bp-catalyst-platform 1.4.21 → 1.4.22
  - bp-newapi             1.3.0 → 1.4.0
  - bootstrap-kit slot 13 + 80 pins updated in lockstep

Issues addressed (single consolidated PR — smaller PRs would race
against alice signup retries):

  - #934 (auth SMTP empty → "failed to send email"): sme-secrets.yaml
    now reads SMTP_* from `catalyst-system/sovereign-smtp-credentials`
    (the same A5-seeded source #883/#905 the chart 1.4.20 catalyst-
    openova-kc-credentials Secret already uses) with source-wins
    precedence. Both canonical (smtp-host/port/from/user/pass) AND
    legacy (host/port/from/user/password) source-Secret key shapes
    accepted. Empty source falls back to chart-level defaults so the
    contabo path stays clean.

  - #940 (provisioning service GITHUB_TOKEN placeholder + hardcoded
    upstream github.com): chart values
    .Values.smeServices.provisioning.{githubToken,git.{apiURL,owner,
    repo,branch}} make every GitHub-API coordinate operator-overridable
    with topology-aware defaults (Sovereign ⇒ in-cluster Gitea REST
    API + `openova` org; contabo ⇒ api.github.com + `openova-io` org).
    Provisioning binary's startup gate validates the GITHUB_TOKEN does
    NOT contain placeholder substrings (<placeholder>, PLACEHOLDER,
    REPLACE_ME, ...) and crashes the Pod into Pending if it does — the
    operator sees the misconfig immediately instead of after alice
    signups have failed silently in service logs. GitHub client now
    accepts a custom API URL via NewClientWithAPIURL so Gitea's GitHub-
    compatible /api/v1 surface drops in without re-implementing the
    client.

  - #941 (catalog "27 apps COMING SOON"): added `openclaw` and
    `stalwart-mail` to migrateAppDeployable's deployable map at
    core/services/catalog/handlers/seed.go. Both blueprints (bp-openclaw,
    bp-stalwart-{sovereign,tenant}) ship with visibility=listed in the
    embedded blueprints.json AND have working SME-tenant overlay
    templates in sme_tenant_gitops.go, but the catalog handler silently
    filtered them out because they were missing here. Map extracted to
    DeployableAppSlugs() exported function so unit tests can assert
    membership without invoking a Mongo store.

  - #942 (REDPANDA_BROKERS hardcoded to talentmesh): configmap.yaml
    selects broker default at render time based on global.sovereignFQDN
    — Sovereign ⇒ NATS JetStream Service per ADR-0001 (the only local
    bus on Sovereigns); contabo ⇒ legacy Redpanda Service in talentmesh.
    Operator MAY override either default via
    .Values.smeServices.eventBus.brokers without forking the chart.
    The ConfigMap key name stays REDPANDA_BROKERS for back-compat with
    existing SME service Go env wiring; new EVENT_BUS_PROTOCOL key
    surfaces the protocol hint for services that want to switch wire
    format independently.

  - #943 (bp-newapi silently skips Deployment): NEW
    templates/cnpg-cluster.yaml auto-provisions a CNPG-backed Postgres
    Cluster + Helm-`lookup`-persistent DSN Secret when
    .Values.cnpg.enabled (DEFAULT true). NEW templates/credentials-
    secret.yaml auto-generates SESSION_SECRET + CRYPTO_SECRET (each
    64-char randAlphaNum, persistent across reconciles via Helm
    `lookup`) when .Values.credentials.autoProvision (DEFAULT true).
    deployment.yaml gate now resolves Secret names from the chart-
    emitted defaults when the operator hasn't supplied an override.
    Capabilities-gated on postgresql.cnpg.io/v1 so a cold install
    before bp-cnpg is Ready surfaces as "no Cluster yet" rather than
    a hard install error.

  - #944 (CRITICAL — cross-cluster pollution): provisioning.yaml
    templates GIT_BASE_PATH from
    .Values.smeServices.provisioning.gitBasePath with a topology-aware
    default `clusters/<sovereignFQDN>/sme-tenants` on Sovereigns. NEW
    `core/services/provisioning/gitguard` package validates at startup
    AND on every commit code path that the path begins with
    `clusters/<self-FQDN>/` — refusing to commit to any other cluster's
    tree. Defence in depth so a runtime env mutation (kubectl exec,
    ConfigMap update without Pod restart, hostile sidecar) cannot
    bypass the check. Pre-#944 every alice tenant overlay landed in
    upstream openova/openova `clusters/contabo-mkt/tenants/<id>/`
    which contabo Flux would then install on the contabo cluster —
    C5-final caught + reverted the alice2 incident at commit 5715db04.

Tests:
  - core/services/provisioning/gitguard: 22 cases covering Sovereign
    + contabo + traversal + prefix-collision + placeholder token
  - core/services/catalog/handlers: openclaw/stalwart-mail in
    deployable map + stable-shape lock against accidental deletes
  - helm-template smoke pass: bp-newapi (default values renders
    Deployment + auto-provisioned Secrets); bp-catalyst-platform
    (Sovereign render shows GIT_BASE_PATH=clusters/otech113.../sme-
    tenants, REDPANDA_BROKERS=nats-jetstream..., GITHUB_OWNER=openova,
    GITHUB_API_URL=http://gitea-http...)

Closes #934 #940 #941 #942 #943 #944
Refs umbrella #915

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:27:23 +04:00
e3mrah
890fa67eff
fix(bp-harbor): inline labels on admin Secret to drop duplicate keys (#949) (#950)
PR #947 (bp-harbor 1.2.14) added templates/admin-secret.yaml that
included the canonical bp-harbor.labels helper AND re-declared
app.kubernetes.io/name + catalyst.openova.io/component with admin-
credential-specific values. Helm's strict YAML post-render parser
rejected the rendered manifest with `mapping key
"app.kubernetes.io/name" already defined at line 8`, blocking the
upgrade chain on otech113 — bp-self-sovereign-cutover dependsOn
bp-harbor and re-blocked, stalling cutover indefinitely.

Per the issue's recommended Option A, labels are inlined verbatim
on the admin Secret. Every key the helper would emit is reproduced
explicitly, except the two that need a Secret-specific value
(catalyst.openova.io/component=harbor-admin) plus an explicit
admin-credentials sub-component label.

A regression guard (Case 6) is added to tests/admin-secret.sh: the
rendered Secret block is parsed through PyYAML's safe_load_all,
which enforces mapping-key uniqueness the same way Helm's post-
render does. Duplicate keys raise and break the test.

Bumps:
  - platform/harbor/chart/Chart.yaml    1.2.14 → 1.2.15
  - clusters/_template/bootstrap-kit/19-harbor.yaml  slot pin

Verification (all green locally):
  helm template smoke . --namespace harbor   # renders OK
  bash tests/admin-secret.sh                 # 6 gates green
  helm lint .                                # 0 failed

Closes one half of #949 (bp-harbor side); the slot pin update
delivers it to fresh Sovereigns; existing otech113 picks up the
upgrade on next Flux reconcile after the new chart publishes.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-05 15:19:17 +04:00
e3mrah
88a8ecd8bb
fix(cutover): Reflector-mirror harbor-admin Secret + in-cluster trigger endpoint (#935) (#947)
Two bugs surfaced live on otech113 2026-05-05 blocking Self-Sovereignty
Cutover end-to-end. Fix both in lockstep:

Bug 1 — bp-self-sovereign-cutover Step 02 (harbor-projects) Job in
`catalyst` namespace was hitting `secret "harbor-core" not found` for
11+ retries because the upstream Harbor `harbor-core` Secret only
exists in the `harbor` namespace and Kubernetes forbids cross-namespace
secretKeyRef. Step 02 was stuck in CreateContainerConfigError forever.

  Fix: bp-harbor 1.2.13 → 1.2.14 ships a Catalyst-curated `harbor-admin`
  Secret in the `harbor` namespace with Reflector mirror annotations
  (allowed-namespaces=catalyst, auto-enabled). The same Secret name
  auto-materialises in `catalyst` so the cutover Job's secretKeyRef
  resolves natively. Password is randomly generated on first install
  (32-char alphanum, 190 bits entropy per feedback_passwords.md) and
  preserved across reconciles via `lookup`. The upstream Harbor subchart
  consumes it via `existingSecretAdminPassword: harbor-admin`.
  bp-self-sovereign-cutover 0.1.16 → 0.1.17 updates
  `harbor.adminSecretRef.name` from `harbor-core` to `harbor-admin`.

Bug 2 — The 0.1.16 auto-trigger Helm post-install Job (#933) POSTed
/api/v1/sovereign/cutover/start which sits behind RequireSession
middleware. The Job has no human session cookie — every request 401'd
forever and cutover never started.

  Fix: new catalyst-api endpoint POST /api/v1/internal/cutover/trigger
  lives OUTSIDE RequireSession and validates the bearer token via the
  apiserver's TokenReview API + checks the resolved username matches
  the canonical `bp-self-sovereign-cutover-runner` SA. Same engine,
  same idempotency, same state machine — different auth surface.
  The auto-trigger Job now mounts its projected SA token at
  /var/run/secrets/kubernetes.io/serviceaccount/token and sends it
  as `Authorization: Bearer <token>`. SA username + accepted list are
  runtime-overridable per Inviolable Principle #4.

Tests
  - 6 Go unit tests for HandleCutoverInternalTrigger covering happy
    path, missing bearer (401), TokenReview rejection (502), wrong SA
    (403), idempotency (no Jobs created when complete), wrong method
    (405). All pass.
  - bp-harbor admin-secret contract test (5 cases) — Secret renders,
    HARBOR_ADMIN_PASSWORD key present, Reflector annotations, keep
    policy, upstream consumes via existingSecretAdminPassword.
  - bp-self-sovereign-cutover cutover-contract test extended with 3
    new cases — auto-trigger uses /internal/cutover/trigger, sends
    SA bearer token, references harbor-admin (not harbor-core).
  - All 12 cutover-contract gates green; all 4 observability-toggle
    gates green; helm template + helm lint clean on both charts.

Bootstrap-kit slot pins
  - clusters/_template/bootstrap-kit/19-harbor.yaml: 1.2.13 → 1.2.14
  - clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml:
    0.1.16 → 0.1.17

Closes #935

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:12:50 +04:00
e3mrah
e9a72aa00d
feat(self-sovereign-cutover): auto-trigger on install + always-defined State (#933 E1) (#936)
Closes the otech113 dashboard regression where SovereigntyCard rendered
`invalid CutoverState: <undefined>` instead of a Tethered badge, and
makes the Day-2 cutover fire automatically once the chart lands rather
than waiting for an operator click on "Achieve True Sovereignty".

Founder rule per #933: handover is not "done" until cutover has run;
the operator must NOT have to click a CTA on
console.<sov-fqdn>/console/dashboard.

Three coupled changes:

1. catalyst-api: cutoverStatusResponse now ALWAYS emits a `state` field
   ("tethered" or "sovereign"), derived from cutoverComplete. The UI's
   branded parseCutoverState rejects empty/undefined, which is what
   was rendering the user-visible error text. Tests cover the empty
   ConfigMap, missing cutoverComplete, and explicit-true cases.

2. UI parseCutoverStatus: defensive fallback when wire frame omits
   `state` — derive from cutoverComplete (default "tethered"). Hostile/
   typo'd state values (e.g. 'pending', '') still throw via the branded
   parser. Defends against partial-rollout where a stale catalyst-api
   Pod is still serving the old shape.

3. bp-self-sovereign-cutover 0.1.16 (chart): new Helm post-install/
   post-upgrade hook (templates/10-auto-trigger-job.yaml) POSTs
   /api/v1/sovereign/cutover/start on catalyst-api after the step
   ConfigMaps + RBAC land. Idempotent via catalyst-api's durable
   status ConfigMap (200 if already complete, 409 if running, 200
   to start). Fails open: a transient catalyst-api unreachability
   exits 0 so the chart install doesn't block; operator can always
   re-fire via the manual CTA. Gated on .Values.trigger.auto (default
   true; per-Sovereign overlays can disable for soak Sovereigns).

Hard rules honoured:
- No contabo Pods touched.
- Existing tethered Sovereigns that have not cutover stay tethered —
  the auto-trigger Job is in the chart (per-Sovereign), not in the
  mothership; only fresh Sovereign installs of bp-self-sovereign-cutover
  0.1.16+ get it.
- IaC-first: the auto-trigger uses catalyst-api's existing /start
  endpoint (no bespoke cluster mutation outside the chart).
- Event-driven: post-install hook fires on chart install (no cron).

Verification:
- Go: cutover_test.go +TestBuildCutoverStatusResponse_StateAlwaysDefined
  +TestHandleCutoverStatus_StateFieldEmittedOnFreshSovereign — both
  green.
- TS: cutover.test.ts +5 cases for parseCutoverStatus state-fallback;
  35/35 green. Sovereignty widget tests 20/20 green.
- Chart: tests/cutover-contract.sh +Case 8/9 (auto-trigger present by
  default, absent under trigger.auto=false); helm template renders
  cleanly.

Co-authored-by: Hatice Yildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:40:52 +04:00
e3mrah
9077016466
feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924) (#931)
Phase-2 follow-up to #883: replace mothership Stalwart relay
(mail.openova.io:587) with a Sovereign-local Stalwart so Console
PIN/magic-link mail originates from `noreply@<sovereignFQDN>` with
per-Sovereign SPF/DKIM/DMARC posture, eliminating the mothership
SMTP SPOF for Sovereign Console login.

What ships:

  1. NEW blueprint platform/stalwart-sovereign/ (otech-level — distinct
     from per-tenant bp-stalwart-tenant). Single Stalwart instance per
     Sovereign cluster, scoped to Sovereign Console system mail. NO
     Keycloak OIDC, NO webmail UI — Sovereign Console is the only
     consumer. Auto-provisioned admin + submission Secrets via the
     lookup-or-generate pattern (#898/#830/#887). Post-install Job:
       - registers the noreply submission principal in Stalwart
       - allows send-as for noreply@<sovereignFQDN>
       - reads DKIM public key, patches dns-records ConfigMap
       - materialises catalyst-system/sovereign-smtp-credentials with
         Sovereign-local infrastructure addresses + credentials,
         carrying BOTH key shapes (smtp-user/smtp-pass + legacy
         user/password) so the consumer chart works either way.

  2. NEW bootstrap-kit slot 95 (clusters/_template/bootstrap-kit/
     95-bp-stalwart-sovereign.yaml). dependsOn: bp-cert-manager,
     bp-catalyst-platform. Sequenced after bp-catalyst-platform (slot
     13) so the chart's post-install Job lands its mirror Secret in
     an already-existing catalyst-system namespace.

  3. bp-catalyst-platform 1.4.19 → 1.4.20: SOURCE-wins precedence
     extended to (a) non-secret fields smtp-host/smtp-port/smtp-from
     so Sovereign-local infra addresses (`mail.<sovereignFQDN>`) take
     over from mothership defaults (`mail.openova.io`) on the next
     reconcile after slot 95 lands, and (b) canonical key shape
     `smtp-user`/`smtp-pass` in addition to legacy `user`/`password`
     source key shape.

  4. expected-bootstrap-deps.yaml: declare slot 95 graph edge.

  5. catalyst-api handler/sovereign_smtp_seed.go: documentation-only
     update to note this Phase-1 step is now a graceful fallback —
     the Phase-2 chart's post-install Job overwrites the mirror
     Secret on first reconcile so the cutover from mothership relay
     to Sovereign-local relay is automatic, no operator action.

Verification:
  - `helm template smoke ./platform/stalwart-sovereign/chart` clean
    (smoke-render-safe; per-template gates skip when sovereignFQDN unset).
  - `helm template smoke -f operator-values.yaml` emits StatefulSet,
    LoadBalancer Service, ClusterIP HTTP Service, DKIM-signing config,
    dns-records ConfigMap, Setup Job + RBAC.
  - `chart/tests/sovereign-render.sh` 3 cases all PASS.
  - `helm template smoke ./products/catalyst/chart` (1.4.20) clean.
  - `helm lint` both charts: clean (only icon-recommended INFO).
  - `bash scripts/check-bootstrap-deps.sh` PASSED — bootstrap-kit
    dependency graph audit, 0 drift, 0 cycles.
  - `go test -run TestSeedSovereignSMTP` — Phase-1 seed tests pass.
  - `go test -run TestBootstrapKit_TemplateClusterParses` — slot 95
    YAML parses cleanly.

Out of scope (sub-PR follow-up under #924):
  - DKIM keypair generation in catalyst-api orchestrator + DNS records
    (MX/A/SPF/DMARC/DKIM-pubkey) registration via PDM dynadot adapter
    at omani.works.
  - Hetzner PTR (rDNS) auto-registration via the Hetzner cloud API.
  - Cert-manager Certificate adding mail.<sovereignFQDN> SAN to the
    Sovereign wildcard cert (chart relies on the existing wildcard
    cert from bp-catalyst-platform 1.4.0+'s per-zone Certificate
    template — when that wildcard chain covers the Sovereign FQDN,
    `mail.<sovereignFQDN>` is already covered).

Acceptance (lands when sub-PR follow-up ships):
  - Sovereign Console PIN delivery uses noreply@<sov-fqdn>.
  - External mail server (e.g. Gmail) accepts mail with valid SPF + DKIM.
  - Mothership SMTP no longer SPOF for Sovereign Console login.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:20:16 +04:00
e3mrah
3fe27f625f
feat(bp-wordpress-tenant): wp-cli OIDC bootstrap + oidc.* canonical block (0.2.0, #915) (#927)
Umbrella issue #915 (D1 sub-task). Aligns the chart's post-install OIDC
config Job with the canonical wp-cli flow and the bp-keycloak tenant-
realm contract C1's PR #918 ships.

Chart 0.2.0
-----------
- templates/oidc-config-job.yaml rewritten to use the official
  wordpress:cli-2.12.0-php8.3 image (manifest-list digest pinned per
  Inviolable Principle #4). Replaces direct PHP/SQL UPSERTs against
  wp_options with:
    * wp core install (idempotent: wp core is-installed)
    * wp plugin install openid-connect-generic --activate (idempotent:
      wp plugin is-installed)
    * wp option update openid_connect_generic_settings <json>
    * wp option update default_role
    * wp theme install/activate
    * wp option update siteurl/home
  Going through wp-cli (i.e. WordPress core's own PHP API) is more
  resilient than schema-shape-dependent INSERT statements and survives
  WordPress minor upgrades.

- values.yaml: new canonical oidc.* block —
    oidc.{enabled, issuerURL, clientId, clientSecretName, defaultRole,
          identityKey, roleMapping, cliImage}.
  Default oidc.clientSecretName = "wordpress-oidc-client-secret"
  matches the K8s Secret bp-keycloak's PR #918 emits alongside the
  realm import ConfigMap (so the realm JSON's `secret` field and the
  Secret bytes never drift).

- Legacy keycloak.{realmURL, clientID, clientSecretName} kept as a
  back-compat alias. _helpers.tpl folds it into oidc.* when the
  modern keys are at their values.yaml defaults so chart 0.1.x
  clusters keep reconciling. Removed in chart 0.3.0.

- oidc.defaultRole=subscriber — newly auto-created SSO users land
  with subscriber capability (operator overrides via overlay).

- Redirect URIs: the openid-connect-generic plugin's default callback
  is /wp-admin/admin-ajax.php?action=openid-connect-authorize when
  alternate_redirect_uri=0 (we set 0). bp-keycloak (PR #918)
  registers the same URL plus /wp-login.php and a /* wildcard, so the
  client's allowed-redirect-URI list aligns with what the plugin
  actually issues.

Orchestrator emit
-----------------
- products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
  smeTenantBPWordPress now emits the canonical oidc.* block AND the
  legacy keycloak.* alias (for chart 0.1.x clusters mid-upgrade).

Tests
-----
- chart/tests/oidc-config.sh — 7 helm-template assertions:
    1. Canonical oidc.* render produces a Job with the required
       wp-cli command flow + wordpress:cli-2.12.0-php8.3 image.
    2. Legacy keycloak.* fold path (chart 0.1.x compat).
    3. oidc.enabled=false short-circuits the Job.
    4. alternate_redirect_uri=0 (so plugin URL matches the realm-
       registered redirect URI from PR #918).
    5. defaultRole rendered + propagated.
    6. Render YAML is parseable and contains all required kinds.
    7. wp-content PVC mounted in the Job (so pg4wp's db.php drop-in
       loads — failure here would silently fall back to mysqli).

- internal/handler/sme_tenant_test.go:
    * TestRenderSMETenantOverlay_WordPressEmitsOIDC — pins the
      canonical oidc.* block + legacy keycloak.* alias the
      orchestrator emits for the alice@omantel test fixture.
    * TestRenderSMETenantOverlay_WordPressOIDC_BYOMode — BYO domain
      mode renders wordpress.<byo-domain> as the ingress host.

Verification
------------
- helm lint clean
- helm template smoke green for: oidc.* canonical, keycloak.* legacy
  fold, oidc.enabled=false short-circuit
- chart/tests/oidc-config.sh: 7/7 PASS
- chart/tests/observability-toggle.sh: 2/2 PASS (regression)
- go test ./internal/handler/ -run "SMETenant|TestRenderSME": all
  green (TestAuthHandover_HappyPath failure is pre-existing on main,
  unrelated to this change)

Closes (D1 sub-task) of #915.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 14:10:41 +04:00
e3mrah
a1ca1872aa
feat(bp-stalwart-tenant): wire Keycloak OIDC SSO end-to-end (#915) (#920)
Closes the C2 sub-task of EPIC #915 — alice's Stalwart authenticates
SMTP/IMAP/JMAP/webmail logins against her per-tenant Keycloak realm,
not a shared otech-level IdP.

Three layered changes (matching the three things broken on otech103):

1. Orchestrator (`smeTenantBPStalwart` in sme_tenant_gitops.go)
   now emits per-tenant OIDC values matching the bp-wordpress-tenant
   + bp-openclaw shape:
     keycloak.realmURL = https://keycloak.<sub>.<parent>/realms/sme-<sub>
     keycloak.clientID = stalwart
     keycloak.clientSecretName = stalwart-oidc-client-secret
     keycloak.oidcExternalSecret.remoteRef.key
       = sovereign/<otech-fqdn>/stalwart/<tenant>/oidc
   plus admin externalSecret + dependsOn bp-keycloak so the SME's
   three apps (wordpress, openclaw, stalwart) SSO against ONE realm
   with distinct client IDs (#915 C1 registers all three in the realm
   bootstrap).

2. Chart bootstrap config.toml drops the pre-0.16 kebab-case
   `[directory.keycloak] type = "oidc"` block (silently ignored by
   the upstream registry parser — verified against
   crates/registry/src/schema/structs.rs in stalwartlabs/stalwart;
   OidcDirectory serdes camelCase: `@type = "Oidc"`, `issuerUrl`,
   `claimUsername`, `claimName`, `claimGroups`, `requireScopes`).
   The `internal` directory stays as the bootstrap fallback so the
   admin can log in before the post-install Job seeds OIDC.

3. setupJob defaults to enabled (was off in 0.1.1) and POSTs the
   canonical OIDC directory entry to `/api/settings`:
     directory.keycloak.@type            = "Oidc"
     directory.keycloak.issuerUrl        = <realm URL>
     directory.keycloak.claimUsername    = preferred_username
     directory.keycloak.claimName        = name
     directory.keycloak.claimGroups      = groups
     directory.keycloak.requireScopes    = [openid email profile groups]
     directory.keycloak.usernameDomain   = <tenant domain>
     storage.directory                   = keycloak
   The setting POSTs are idempotent (`assert_empty: false`) so Helm
   upgrades re-run without breaking existing logins. Re-uses the
   upstream Stalwart container (ships curl + stalwart-cli) — no new
   image needed.

Tests:
  - `chart/tests/oidc-render.sh` (NEW): asserts every settings key
    is rendered, the [oauth] env block propagates the per-tenant
    realm URL, and the bootstrap config.toml parses as valid TOML.
  - `chart/tests/expression-syntax.sh`: re-passes (Stalwart
    expression `==` audit per stalwart_expression_syntax.md).
  - `TestRenderSMETenantOverlay_StalwartEmitsKeycloakOIDC` (NEW):
    Go test verifies the orchestrator emits the per-tenant realm
    URL, client metadata, and ExternalSecret-store remoteRef paths.
  - All existing TestRenderSMETenantOverlay_* tests pass.
  - `helm template` clean with default values AND with a per-tenant
    overlay (--api-versions external-secrets.io/v1beta1).

Chart bumps 0.1.1 → 0.1.2; blueprint.yaml spec.version mirrors per
issue #817 (chart/blueprint version invariant).

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:37:46 +04:00
e3mrah
9447d88dfd
feat(bp-newapi): auto-seed channel #1 = Qwen3.6 @ BankDhofar (#915) (#919)
Per epic #915 (SME tenant integration DoD: alice → OpenClaw → NewAPI →
Qwen3.6@BankDhofar end-to-end), bp-newapi must come up with channel
#1 = Qwen3.6 hosted at BankDhofar
(https://llm-api.omtd.bankdhofar.com, model qwen3-coder / alias
qwen3.6) already wired to its admin API, so the FIRST customer
request from an SME's OpenClaw → NewAPI hits a real upstream LLM
rather than a 404 / "no channel found" error.

Until now the chart's channels.yaml ConfigMap was a documentation
surface only; the upstream NewAPI binary persists channel state to
its Postgres `channels` table via its admin API at /api/channel/.
This patch bridges that gap.

Discovery:
  - Canonical BankDhofar relay reference exists in
    openova-private/clusters/contabo-mkt/apps/axon/helmrelease.yaml
    (axon.vllm.baseUrl=https://llm-api.omtd.bankdhofar.com,
    defaultModel=qwen3-coder, secret=axon-vllm-secret).
  - K8s secret confirmed live (axon/axon-vllm-secret, key
    AXON_VLLM_API_KEY).
  - Architecture: bp-newapi is per-Sovereign (one NewAPI per OTECH);
    SME tenants share it via OpenClaw's newapi.baseURL =
    https://newapi.<OTECHFQDN>. Channel seeding therefore happens
    at the Sovereign-level chart install, NOT per-tenant.

Changes:
  1. platform/newapi/chart/values.yaml
     - New `defaultChannels.qwenBankDhofar` block (enabled=false by
       default; per-Sovereign overlay flips it true with the
       canonical endpoint + commercial-contract attestation).
     - New `channelSeed` block configuring the post-install Helm
       hook Job (image, resources, backoff, deadline, hook delete
       policy).

  2. platform/newapi/chart/templates/_helpers.tpl
     - effectiveChannels helper composes qwenBankDhofar BEFORE
       operator-supplied .Values.channels and BEFORE defaultChannels.vllm
       so it lands as channel #1 in NewAPI's row-insertion order
       (NewAPI's router resolves `model` lookups in row order).
     - New channelSeedJobName helper (shared by Job + RBAC + ConfigMap).

  3. platform/newapi/chart/templates/channel-seed-job.yaml (NEW)
     - post-install/post-upgrade Helm hook Job that:
       * Mounts the operator-supplied master-key Secret
         (auth.adminUI.masterKeySecret) for one-time admin API auth.
       * Mounts the per-channel upstream API key Secret
         (defaultChannels.qwenBankDhofar.existingSecret).
       * Polls /api/status until 200 (handles NewAPI startup window).
       * For each default channel: GET /api/channel/?keyword=<name>;
         if a row whose `name` exactly matches exists, SKIP. Otherwise
         POST /api/channel/ with the channel definition. Idempotent —
         re-runs after upgrades are no-ops once channels exist.
       * Bounded RBAC (Role+RoleBinding only on the named Secrets).
       * Skip-render gates: channelSeed.enabled, defaultChannels.*
         enabled, masterKeySecret supplied. helm template with default
         values renders no Job (CI smoke clean).

  4. clusters/_template/bootstrap-kit/80-newapi.yaml
     - Bumped chart version 1.2.0 → 1.3.0.
     - Added defaultChannels.qwenBankDhofar block to the per-Sovereign
       overlay shape (still enabled=false in the template — operator
       supplies endpoint + attestation + Secrets per Sovereign).

  5. platform/newapi/chart/Chart.yaml
     - Bumped 1.2.0 → 1.3.0 with changelog comment.

  6. products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
     - bp-openclaw per-tenant overlay now emits `newapi.defaultModel:
       qwen3.6` so OpenClaw's UI surfaces the friendlier alias by
       default. (Both qwen3.6 and qwen3-coder route to the same
       channel via the chart's `models` list.)

Verification:
  - helm lint .                    PASS (1 chart linted, 0 failed)
  - helm template (defaults)       PASS (no Job rendered)
  - helm template (qwen enabled)   PASS (Job + RBAC + ConfigMap +
                                          channels.yaml all render
                                          with channel #1 first)
  - helm template (endpoint empty) FAIL with helpful message
                                   (configurability gate)
  - go build ./...                 PASS
  - go test ./internal/handler/... PASS for SME tenant overlay tests
                                   (TestRenderSMETenantOverlay_*)
  - Pre-existing AuthHandover panic is unrelated to this change

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every knob is
configurable via the per-Sovereign bootstrap-kit overlay. The
endpoint default is empty so a fresh `helm template` does not
silently wire customers to a third-party host.

Co-authored-by: alierenbaysal <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:32:00 +04:00
e3mrah
7f859dbb4b
feat(bp-keycloak): tenant-mode realm with wordpress/openclaw/stalwart OIDC clients (1.4.0, #915) (#918)
PR #911 wired the SME tenant orchestrator to emit
realmConfig.tenant.enabled=true on the per-tenant bp-keycloak
HelmRelease — but the chart had no template that consumed those values,
so the WordPress / OpenClaw / Stalwart OIDC integrations had no client
registered in the tenant realm and SSO failed end-to-end.

This change adds the chart-side template the orchestrator was already
emitting for. When realmConfig.tenant.enabled=true:

  * configmap-sovereign-realm.yaml SKIPS (mutual-exclusion guard added
    on the existing template) so only one realm CM is rendered.
  * NEW templates/configmap-tenant-realm.yaml renders a realm import
    ConfigMap (same name `<release>-sovereign-realm-config` so the
    upstream keycloak-config-cli existingConfigmap reference still
    resolves) carrying the tenant realm + 3 OIDC clients:
      - wordpress  (confidential, auth-code; redirect URIs cover the
                    openid-connect-generic plugin's admin-ajax.php
                    callback + /wp-login.php fallback)
      - openclaw   (confidential, auth-code; redirect URI /oauth/callback
                    per #915 spec)
      - stalwart   (confidential, serviceAccountsEnabled=true so the
                    directory.keycloak type=oidc backend can use
                    client_credentials to introspect IMAP/SMTP tokens;
                    standardFlowEnabled=true for webmail UI auth-code)
  * NEW per-app Secrets emitted in the same template scope as the realm
    ConfigMap so the realm JSON's `secret` field and the K8s Secret
    bytes never drift:
      - wordpress-oidc-client-secret
      - openclaw-oidc-client-secret
      - stalwart-oidc-client-secret  (carries BOTH client-secret AND
                                      OIDC_CLIENT_SECRET keys for the
                                      two consumer paths)
  * Each per-app secret persists across helm upgrade via
    lookup-or-generate (mirrors marketplace-api/secret.yaml pattern from
    issue #887 and the existing catalyst-api-server secret in
    configmap-sovereign-realm.yaml). helm.sh/resource-policy: keep so
    bytes outlive uninstall.
  * Fail-closed validation when realmConfig.tenant.enabled=true and
    any of realmName / parentDomain / subdomain is unset (Inviolable
    Principle #4).

NEW tests/tenant-realm-oidc-clients.sh covers 6 cases:
  1. Sovereign-mode default render unchanged (kubectl + catalyst-ui +
     catalyst-api-server clients present, no tenant artefacts leak).
  2. Tenant-mode render produces exactly ONE realm CM under the
     expected name + zero leaked Sovereign-only resources.
  3. Tenant realm JSON parses + 3 OIDC clients present with the
     redirect-URI / publicClient / serviceAccountsEnabled shape per
     #915 spec; Secret bytes match realm JSON's `secret` fields.
  4. Fail-closed validation when tenant fields missing.
  5. keycloak-config-cli post-install Job projects the realm CM by
     SAME name in BOTH modes.
  6. Operator-supplied per-app clientSecret overrides the
     lookup-or-generate path.

Existing tests/observability-toggle.sh + tests/oidc-kubectl-client.sh
still pass.

Sovereign-mode unchanged. The chart now consumes the values the
orchestrator (PR #911) was already emitting; no orchestrator change
needed.

Closes #915 (C1 sub-task) and unblocks #899 (per-tenant Keycloak
realm-config materialisation).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:29:40 +04:00
e3mrah
61c8d77b58
feat(bp-openclaw): per-tenant Keycloak SSO + NewAPI as OpenAI-compatible LLM gateway (#915) (#917)
Wire bp-openclaw to the per-tenant Keycloak realm (OIDC SSO) and the
per-tenant NewAPI (OpenAI-compatible LLM endpoint, NOT direct OpenAI),
delivering C3 of umbrella epic #915.

Chart changes (bp-openclaw 0.1.0 → 0.2.0):
- Add canonical `oidc.{issuerURL,clientId,clientSecret.{name,key}}` block.
- Add canonical `llm.{baseURL,apiKey.{name,key},defaultModel}` block.
- Controller Deployment now emits OIDC_*, LLM_*, OPENAI_API_{BASE,KEY},
  LLM_DEFAULT_MODEL envs (legacy KEYCLOAK_*/NEWAPI_BASE_URL_DEFAULT
  retained for back-compat with current controller image).
- Per-user pods carry OPENAI_API_BASE / OPENAI_API_KEY / LLM_DEFAULT_MODEL
  alongside the identity-blind NEWAPI_BASE_URL / NEWAPI_KEY (ADR-0003
  §3.3 unchanged).
- Legacy `keycloak.*` / `newapi.*` keys remain accepted as fallbacks;
  helpers prefer canonical blocks but fall back to the legacy alias when
  the canonical block is unset (or still at placeholder).
- assertNoPlaceholders guard updated to check resolved canonical values.
- render-toggles.sh smoke test extended: asserts both canonical and
  legacy code-paths render and that all expected envs reach the
  rendered Deployment.

Orchestrator changes (catalyst-api smeTenantBPOpenClaw template):
- Emit per-tenant `oidc.issuerURL` = https://keycloak.<sub>.<parent>/realms/sme-<sub>
- Emit per-tenant `oidc.clientId` = openclaw, secret from
  openclaw-oidc-client-secret/OIDC_CLIENT_SECRET (rendered by
  bp-keycloak's post-install hook).
- Emit per-tenant `llm.baseURL` = https://api.<sub>.<parent>/v1 (alice's
  own NewAPI ingress, NOT the otech-wide newapi.<otech-fqdn>); apiKey
  from openclaw-newapi-controller-token/NEWAPI_KEY.
- Emit `llm.defaultModel: qwen3.6` — NewAPI uses this to select the
  backing channel; C4 of #915 wires Qwen3.6@BankDhofar at tenant-create.
- Legacy keycloak/newapi blocks still emitted for back-compat with
  bp-openclaw < 0.2.0.

Tests:
- New TestRenderSMETenantOverlay_OpenClawOIDCAndLLMBlocks asserts the
  rendered HelmRelease contains the canonical oidc + llm blocks with
  per-tenant values, and that llm.baseURL is the per-tenant
  api.<sub>.<parent>/v1 (NOT the otech-wide newapi).
- bp-openclaw render-toggles.sh extended (Case 2b/2c).

Co-authored-by: alierenbaysal <alierenbaysal@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:26:59 +04:00
e3mrah
368545369b
fix(bp-stalwart-tenant): unbootable on fresh tenants — values shape, missing admin Secret, sec ctx (#898) (#904)
Three fixes that left bp-stalwart-tenant 0.1.0 unable to come up on a
freshly-franchised SME tenant. All surfaced on the otech103 alice
tenant during the Phase-1 DoD sweep.

1. Tenant-domain values shape (HelmRelease render error)

   The 0.1.0 chart referenced `.Values.domain.primary` in five
   templates. The live HR on otech103 had `values.domain:
   acme.omani.works` (a string), emitted by a pre-#897 catalyst-api
   build, so every reconcile died with:

     can't evaluate field primary in type interface {}

   Added `bp-stalwart-tenant.tenantDomain` + `tenantMode` helpers
   that resolve in priority order:

     1. `tenant.domain`        (forward-looking flat shape)
     2. `domain.primary`       (canonical post-#897 map shape)
     3. `domain` (string)      (legacy pre-#897 shape — back-compat)

   Returns "" smoke-render-safe; per-template gates skip when empty.

2. Missing stalwart-admin Secret

   deployment.yaml + mailbox-provision-job.yaml reference a Secret
   key `ADMIN_PASSWORD` on `.Values.admin.secretName`. The 0.1.0
   chart only emitted an ExternalSecret, and only when
   `admin.externalSecret.remoteRef.key` was non-empty (smoke-render
   concession). Fresh tenants land in CreateContainerConfigError.

   Added `templates/admin-secret.yaml` mirroring marketplace-api/
   secret.yaml (#887): random 32-char ADMIN_PASSWORD generated by
   sprig randAlphaNum, persisted across reconcile via lookup,
   helm.sh/resource-policy: keep so reinstall picks it back up.
   Auto-disabled when an authoritative ExternalSecret is wired —
   no double-bind between two controllers.

3. Pod sec ctx vs. upstream image's file capabilities

   `getcap docker.io/stalwartlabs/stalwart:v0.16.3 /usr/local/bin/
   stalwart` reports `cap_net_bind_service=ep`. The image creates
   user `stalwart` at UID 2000 and the binary IS the entrypoint
   (no demotion script). The 0.1.0 chart ran as UID 65534 with
   `drop: ALL` — kernel refuses to elevate file caps with empty
   bounding set, so exec failed with `operation not permitted`.

   Aligned to image's native UID 2000, kept `drop: ALL` and added
   `NET_BIND_SERVICE` explicitly. fsGroup 2000 ensures /opt/stalwart
   PVC is writable.

Other:
- Bumped Chart.yaml + blueprint.yaml to 0.1.1 (#817 alignment).
- configSchema in blueprint.yaml now permits the legacy + tenant
  shapes alongside the canonical map.
- mailboxProvisioner.setupJob.enabled defaults to false until the
  canonical stalwart-cli image is published (re-uses upstream
  stalwart container as fallback CLI host).

Acceptance: targeted at otech103 alice tenant
(sme-789ae512-bc0f-467c-a016-001f5496c403) where 0.1.0 reconciliation
fails with the value-shape error and the pod CrashLoops with `exec
... operation not permitted`. Verification on otech103 in #898.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:55:03 +04:00
e3mrah
93c4b700de
fix(bp-keycloak): templatize existingConfigmap reference for per-tenant installs (#899) (#902)
bp-keycloak 1.3.2 hardcoded `keycloak.keycloakConfigCli.existingConfigmap` to
the literal "keycloak-sovereign-realm-config". This worked for the Sovereign-
mothership bootstrap-kit (releaseName=keycloak emits matching ConfigMap) but
broke for every per-tenant install where releaseName=bp-keycloak emits
"bp-keycloak-sovereign-realm-config" — the post-install keycloak-config-cli
Job stuck in ContainerCreating with `MountVolume.SetUp failed for volume
"config-volume" : configmap "keycloak-sovereign-realm-config" not found`,
HelmRelease InstallFailed after 15m timeout, cascading to bp-openclaw and
bp-wordpress-tenant which dependsOn it.

The bitnami/keycloak subchart's `keycloak.keycloakConfigCli.configmapName`
helper (charts/keycloak/templates/_helpers.tpl) applies `tpl` to the
existingConfigmap value, so embedding `{{ .Release.Name }}` inside the
string resolves at chart-render time. With this single-line change:

  - Sovereign-mothership (releaseName=keycloak) → keycloak-sovereign-realm-config (unchanged)
  - Per-tenant (releaseName=bp-keycloak)        → bp-keycloak-sovereign-realm-config (matches actual emitted ConfigMap)

Verified via helm template both modes — backendRef and config-volume
configMap.name match the actual ConfigMap emitted by
templates/configmap-sovereign-realm.yaml.

Chart bumped 1.3.2 → 1.3.3 + bootstrap-kit slot 09 + blueprint.yaml.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:49:39 +04:00
e3mrah
eddf0e62a4
fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter (#891) (#892)
* fix(catalyst-api): SME-tenant orchestrator writes parent kustomization.yaml index (#889)

The Flux Kustomization rendered by bp-catalyst-platform 1.4.13+ at
clusters/<sov-fqdn>/sme-tenants/ requires a parent kustomization.yaml
that enumerates tenant subdirectories. The orchestrator only wrote
per-tenant overlays without the parent index, so on otech103 Flux
hit:

  kustomization path not found: stat /tmp/kustomization-...
  /clusters/otech103.omani.works/sme-tenants: no such file or directory

Even after a tenant signup, the parent path lacked a kustomization.yaml
so Flux couldn't enumerate subdirs.

Fix: NEW writeParentTenantsIndex helper called from both
WriteTenantOverlay and DeleteTenantOverlay. Scans the parent dir for
subdirectories that contain kustomization.yaml, sorts them lexically
for deterministic output (no spurious diffs), and writes a parent
kustomization.yaml listing them under `resources:`. Empty list (no
tenants) renders as `resources: []` — still a valid Kustomization
root, so Flux stays Ready=True after the last tenant teardown.

git add covers both the per-tenant subdir AND the parent index, so a
single commit captures the delta.

Live on otech103 post-cutover, 2026-05-05.

* fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter to include clusters/<sov-fqdn>/ (#891)

After Day-2 cutover, the GitRepository ignore filter excluded the
Sovereign's own clusters/<sov-fqdn>/ subtree. This made every
Sovereign-specific Flux Kustomization (sme-tenants, future per-Sov
overlays) hit "kustomization path not found" because source-controller
filtered the path out of the artifact tarball.

Live on otech103 (2026-05-05): sme-tenants Kustomization stuck for
20+ minutes despite the orchestrator successfully committing the
overlay to local Gitea.

Fix: Step-5 (flux-gitrepository-patch) now writes the patch as a
multi-line YAML strategic-merge file via /tmp emptyDir (since the
Pod runs readOnlyRootFilesystem), composing the new ignore filter:

  /*
  !/clusters/_template
  !/clusters/${SOVEREIGN_FQDN}
  !/platform
  !/products

The SOVEREIGN_FQDN is wired from .Values.sovereign.fqdn (already
established in the chart values).

Bumps chart 0.1.14 -> 0.1.15. Slot 06a pin bumps in lockstep.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-05 09:39:42 +04:00
e3mrah
8e4c88fd28
fix(bp-self-sovereign-cutover): auto-sync local Gitea mirror from upstream GitHub (#870) (#875)
Step-1 gitea-mirror Job replaces the legacy one-shot create-empty-repo +
git-push pattern with a single call to Gitea's native /repos/migrate API
with mirror=true and mirror_interval=10m0s. Gitea now polls the upstream
openova-io/openova repo on a 10-minute interval and replicates branches
+ tags into the local Sovereign Gitea automatically.

Closes the "Sovereign drifts from upstream main forever after Day-2
cutover" bug — hit twice during the otech103 2026-05-04 overnight DoD
session, requiring manual `git fetch` inside the Gitea pod for every
chart rollout.

Why /repos/migrate over the previous git push approach:
- Gitea cannot convert a regular repo into a pull-mirror after creation
  (the mirror flag is set at create-time only). The migrate endpoint
  creates the repo AS a mirror in one shot.
- The migrate endpoint accepts toggles for issues / pull-requests /
  wiki / labels / milestones / releases — we set them all to false so
  Gitea only replicates branches+tags, the only refs the Sovereign's
  Flux GitRepository needs.
- Recurring sync is a Gitea-native capability; using it avoids a
  parallel CronJob (which would violate the "event-driven not cron"
  inviolable principle) or a long-poll sidecar (which would duplicate
  what Gitea already does).

Idempotency: if the repo already exists from a prior cutover attempt,
the script PATCHes mirror_interval to the desired value and POSTs to
/mirror-sync to trigger an immediate refresh. Note that PATCH alone
cannot convert a legacy non-mirror repo to a mirror — Sovereigns
seeded by chart < 0.1.14 would need an operator-driven repo delete +
re-migrate to retro-fit auto-sync, but new provisions take the
migrate path automatically.

Verification on the rendered ConfigMap:
  $ helm template smoke .                   # renders 16 docs cleanly
  $ bash tests/cutover-contract.sh          # all 7 gates green
  $ sh -n <rendered-script>                 # POSIX shell syntax OK

Chart bumped 0.1.13 → 0.1.14 (Chart.yaml + blueprint.yaml spec.version
aligned per #817 invariant + slot 06a-bp-self-sovereign-cutover.yaml
pin lockstep).

Refs #870, #790.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 08:35:40 +04:00
e3mrah
9b710049e3
fix(self-sovereign-cutover): Step-8 baseline-diff (only NEW regressions count) (#858)
Live otech103: Step-8 survival window failed because infrastructure-config Kustomization had been NotReady for 4h pre-cutover (Crossplane provider CRD ordering — unrelated to sovereignty). Sovereignty proof asks 'did cutover break anything', not 'is the cluster perfect'. Capture baseline NotReady set before the window, only fail on NEW additions during.

Bumps 0.1.12 → 0.1.13 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 04:20:16 +04:00
e3mrah
d5d1d9b2cd
fix(self-sovereign-cutover): Step-8 tolerate slot-managed self-ref HelmRepositories (#857)
Live otech103: Step-8 verification flagged 2 HelmRepositories (bp-newapi + bp-self-sovereign-cutover) still on ghcr.io/openova-io. Both are declared in clusters/_template/bootstrap-kit/ slot files which Flux Kustomization re-applies on every reconcile — Step-6's patch is transient for them. Data-plane impact is null because they're not pulled again until the next cutover cycle which would re-apply the patch first. The 38 leaf-bp HelmRepositories ARE patched durably (live in HelmRelease values, not separate slot files).

Bumps 0.1.11 → 0.1.12 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 04:06:41 +04:00
e3mrah
142ea21534
fix(self-sovereign-cutover): Step-8 passive architectural verification (Cilium can't egressDeny+toFQDNs) (#856)
Live otech103: Step-8 (egress-block-test) failed because Cilium 1.16's CiliumNetworkPolicy schema doesn't support 'spec.egressDeny[].toFQDNs' — strict-decoding error 'unknown field'. FQDN-based matching in Cilium is only allowed in 'egress' (allow), not 'egressDeny'.

Pivot: Step-8 now asserts the architectural pivots from Steps 5-7 are actually live (GitRepository.url + all HelmRepositories + catalyst-api env all point at local Gitea/Harbor) BEFORE entering the durationSeconds survival window during which Flux Kustomization + HelmRelease readiness is polled. Same sovereignty proof, expressed in a form Cilium can evaluate.

Bumps 0.1.10 → 0.1.11 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 03:22:30 +04:00
e3mrah
86ae235804
fix(self-sovereign-cutover): catalyst-api namespace catalyst-system not catalyst-platform (#855)
Live otech103: Step-7 (catalyst-api-env-patch) hit 'deployments.apps catalyst-api not found' in catalyst-platform ns. Actual Sovereign-side namespace is catalyst-system. Bumps 0.1.9 → 0.1.10.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:59:11 +04:00
e3mrah
dd84060d05
fix(self-sovereign-cutover): switch from bitnami/kubectl to alpine/k8s (#854)
Live otech103 2026-05-04: bitnami/kubectl:1.31.4 404 on Docker Hub. Bitnami deprecated public Docker Hub registry in 2025; their kubectl image stopped getting tags. alpine/k8s is the canonical alpine-based replacement — kubectl + helm + standard k8s CLI surface, actively maintained, :1.31.4 verified present.

Bumps 0.1.8 → 0.1.9 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:55:46 +04:00
e3mrah
887ff62200
fix(self-sovereign-cutover): bitnami/kubectl tag :1.31 → :1.31.4 (#853)
Live otech103 2026-05-04: Step-5 (flux-gitrepository-patch) Pod DeadlineExceeded after 10m of ImagePullBackOff. bitnami/kubectl on DockerHub doesn't have a floating :1.31 tag — only patch-level :1.31.X. Pin to :1.31.4 (latest of 1.31 minor as of today).

Bumps 0.1.7 → 0.1.8 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:42:54 +04:00
e3mrah
e9970db7b6
fix(self-sovereign-cutover): proxy-quay adapter type docker-registry (#852)
Live otech103: Harbor rejects project create with metadata.proxy_cache=true on registries with type 'quay' — HTTP 400 'unsupported registry type quay'. Quay speaks plain v2 so docker-registry is the correct adapter (4/7 projects ahead succeeded with the same shape). Bumps 0.1.6 → 0.1.7.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:29:26 +04:00
e3mrah
ea51642092
fix(self-sovereign-cutover): proxy-ghcr Harbor adapter type 'github-ghcr' (#851)
Live otech103 2026-05-04: Step-2 harbor-projects POST /api/v2.0/registries returns 500 'adapter factory for github not found'. Harbor 2.x's canonical GHCR proxy-cache adapter is named 'github-ghcr', not 'github'.

Bumps 0.1.5 → 0.1.6 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:26:51 +04:00
e3mrah
8f96daeb6f
fix(self-sovereign-cutover): harbor service is 'harbor-core' not 'harbor-harbor-core' (#849)
Live failure on otech103 2026-05-04: Step-2 (harbor-projects) Pod exits silently after first echo because curl exit 6 (CURLE_COULDNT_RESOLVE_HOST). The chart's default harborInternalURL was http://harbor-harbor-core.harbor.svc.cluster.local but the actual bitnami harbor chart's service name is harbor-core (release name doesn't double-prefix when targetNamespace == 'harbor' AND releaseName == 'harbor').

Fix: harborInternalURL → http://harbor-core.harbor.svc.cluster.local. Verified via 'kubectl get svc -n harbor' on otech103.

Bumps 0.1.4 → 0.1.5 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 01:16:41 +04:00
e3mrah
ab5681e656
fix(self-sovereign-cutover): Step-1 use bare clone + explicit refspec push (#848)
Live failure on otech103 2026-05-04 even after 0.1.3: git push --all in a mirror clone still pushes refs/pull/* because mirror clones store all upstream refs (incl. GitHub PR refs) at the same level as refs/heads/, and --all walks the whole local refstore.

Fix: use git clone --bare (not --mirror) which only fetches refs/heads/* and refs/tags/*, then push with explicit refspecs:
  git push origin 'refs/heads/*:refs/heads/*'
  git push origin 'refs/tags/*:refs/tags/*'

Bumps 0.1.3 → 0.1.4 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:59:25 +04:00
e3mrah
6322d82775
fix(self-sovereign-cutover): Step-1 push --all + --tags (skip GitHub PR refs) (#847)
Live failure on otech103 2026-05-04: git push --mirror to local Gitea rejected by Gitea's update hook on every refs/pull/<n>/head + refs/pull/<n>/merge ref (those are GitHub-specific metadata refs Gitea doesn't accept). Branches and tags push fine.

Fix: split the push into 'git push --all' (branches) + 'git push --tags' (tags). Branches + tags are exactly what Flux GitRepository needs to reconcile from local Gitea — PR refs are upstream-only metadata not referenced by any consumer.

Bumps bp-self-sovereign-cutover 0.1.2 → 0.1.3 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:55:22 +04:00
e3mrah
3015033136
fix(self-sovereign-cutover): Step-1 creates Gitea org before repo (#846)
Live failure on otech103 2026-05-04: Step-1 hit 'POST /orgs/openova/repos returns 404 Not Found' because the org openova doesn't exist on a fresh Gitea install. The /user/repos fallback would have created the repo under gitea_admin/openova, but the subsequent git push targets openova/openova so it fails with 'remote: Not found'.

Fix: explicit org-create step before repo-create. POST /orgs with {username, visibility} creates the org idempotently (swallow 422 'already exists'). Then POST /orgs/<org>/repos creates the repo under it. Push URL targets openova/openova as before.

Bumps bp-self-sovereign-cutover 0.1.1 → 0.1.2 + slot 06a pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:51:24 +04:00
e3mrah
e36089540d
fix(self-sovereign-cutover): Step-1 BusyBox-wget Basic auth header (--user not supported) (#845)
* fix(bp-gitea): mirror gitea-admin-secret to catalyst ns via reflector annotations

Live failure on otech103 2026-05-04: cutover Step-1 gitea-mirror Job in catalyst ns CrashLoops with 'secret "gitea-admin-secret" not found' because K8s forbids cross-namespace secretKeyRef. The Secret created by bp-gitea 1.2.4 lives in the gitea ns; the cutover Job runs in the catalyst ns.

Fix: add reflector.v1.k8s.emberstack.com annotations on the Secret so bp-reflector (already installed at slot 05a) mirrors it into the catalyst namespace. The Job's secretKeyRef then resolves locally. Reflector keeps the mirror in lockstep on password rotation.

Bumps bp-gitea 1.2.4 → 1.2.5 + slot 10 pin lockstep.

* fix(self-sovereign-cutover): Step-1 gitea-mirror BusyBox-wget compat (Basic auth header)

Live failure on otech103 2026-05-04: Step-1 cutover-gitea-mirror Pod exits with 'wget: unrecognized option: password=...' because the alpine/git image bundles BusyBox wget which does NOT recognise --user / --password (those are GNU wget flags).

Fix: build a base64'd Authorization: Basic header from $GITEA_USERNAME:$GITEA_PASSWORD and pass it via --header (BusyBox wget supports --header). Same Gitea API call surface, BusyBox-compatible wire.

Bumps bp-self-sovereign-cutover 0.1.0 → 0.1.1 + slot 06a pin lockstep.

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:40:24 +04:00
e3mrah
66abe75b2e
fix(bp-gitea): mirror gitea-admin-secret to catalyst ns via reflector annotations (#844)
Live failure on otech103 2026-05-04: cutover Step-1 gitea-mirror Job in catalyst ns CrashLoops with 'secret "gitea-admin-secret" not found' because K8s forbids cross-namespace secretKeyRef. The Secret created by bp-gitea 1.2.4 lives in the gitea ns; the cutover Job runs in the catalyst ns.

Fix: add reflector.v1.k8s.emberstack.com annotations on the Secret so bp-reflector (already installed at slot 05a) mirrors it into the catalyst namespace. The Job's secretKeyRef then resolves locally. Reflector keeps the mirror in lockstep on password rotation.

Bumps bp-gitea 1.2.4 → 1.2.5 + slot 10 pin lockstep.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:37:04 +04:00
e3mrah
c42e98216c
fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (curl -o + readOnlyRootFS) (#843)
* fix(bootstrap-kit,bp-newapi): bump slot pins (gitea 1.2.4, catalyst-platform 1.4.2) + gate Traefik Middleware on Cilium Sovereigns (bp-newapi 1.2.0)

Three issues blocking the otech103 verification proof on a freshly merged main, all uncovered while live-driving the Day-2 Independence cutover:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pinned 1.4.0 — missed the bumps from PR #839 (1.4.1, RBAC dual-mode render) and PR #841 (1.4.2, POWERDNS env literal). Bumping the slot pin to 1.4.2 lands those fixes on every fresh provision.

2. clusters/_template/bootstrap-kit/10-gitea.yaml pinned 1.2.3 — missed the bump from PR #832 (1.2.4, gitea-admin-secret canonical Secret for cutover Step-1 to mount). Bumping to 1.2.4 unblocks bp-self-sovereign-cutover Step-1 (gitea-mirror Job).

3. platform/newapi/chart/templates/ingress.yaml hard-rendered a traefik.io/v1alpha1 Middleware resource. On a Cilium Gateway Sovereign that CRD does not exist; bp-newapi 1.1.0 install failed with 'no matches for kind Middleware'. Gating the Middleware behind .Values.ingress.middleware.enabled (default false) lets the chart install on Cilium Sovereigns; contabo / Traefik clusters can still flip it on per-overlay. Bumping to 1.2.0 (additive feature, default-off, no breaking change). Slot 80-newapi pin bumped lockstep.

Verified live state on otech103.omani.works (deployment id 12dff5098e33053e):
- bp-newapi 1.1.0 HR: Status=False 'Helm install failed: ... no matches for kind Middleware in version traefik.io/v1alpha1'
- bp-catalyst-platform HR pinned at 1.4.0 (lacks RBAC for cutover-driver)
- bp-gitea HR pinned at 1.2.3 (lacks gitea-admin-secret)

After this PR merges + Flux reconciles otech103, all three HRs upgrade in place and the cutover proof can be driven to completion.

* fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (readOnlyRootFS + curl -o)

Caught live on otech103 2026-05-04: zone-bootstrap Job exit 23 (curl write error) because curl -o /tmp/zone-resp + readOnlyRootFilesystem=true and no /tmp emptyDir mount. Bumps bp-powerdns 1.2.0 → 1.2.1 + slot 11 pin lockstep.

Without /tmp/zone-resp writable the Job CrashLoops every retry, never completes, bp-external-dns dependency stuck, Phase-1 watcher never reaches ready, handover never auto-fires.

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:28:44 +04:00
e3mrah
7de05bab9d
fix(bootstrap-kit,bp-newapi): bump slot pins (gitea 1.2.4, catalyst-platform 1.4.2) + gate Traefik Middleware on Cilium Sovereigns (bp-newapi 1.2.0) (#842)
Three issues blocking the otech103 verification proof on a freshly merged main, all uncovered while live-driving the Day-2 Independence cutover:

1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pinned 1.4.0 — missed the bumps from PR #839 (1.4.1, RBAC dual-mode render) and PR #841 (1.4.2, POWERDNS env literal). Bumping the slot pin to 1.4.2 lands those fixes on every fresh provision.

2. clusters/_template/bootstrap-kit/10-gitea.yaml pinned 1.2.3 — missed the bump from PR #832 (1.2.4, gitea-admin-secret canonical Secret for cutover Step-1 to mount). Bumping to 1.2.4 unblocks bp-self-sovereign-cutover Step-1 (gitea-mirror Job).

3. platform/newapi/chart/templates/ingress.yaml hard-rendered a traefik.io/v1alpha1 Middleware resource. On a Cilium Gateway Sovereign that CRD does not exist; bp-newapi 1.1.0 install failed with 'no matches for kind Middleware'. Gating the Middleware behind .Values.ingress.middleware.enabled (default false) lets the chart install on Cilium Sovereigns; contabo / Traefik clusters can still flip it on per-overlay. Bumping to 1.2.0 (additive feature, default-off, no breaking change). Slot 80-newapi pin bumped lockstep.

Verified live state on otech103.omani.works (deployment id 12dff5098e33053e):
- bp-newapi 1.1.0 HR: Status=False 'Helm install failed: ... no matches for kind Middleware in version traefik.io/v1alpha1'
- bp-catalyst-platform HR pinned at 1.4.0 (lacks RBAC for cutover-driver)
- bp-gitea HR pinned at 1.2.3 (lacks gitea-admin-secret)

After this PR merges + Flux reconciles otech103, all three HRs upgrade in place and the cutover proof can be driven to completion.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-05 00:22:55 +04:00
e3mrah
e96741a0ca
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The
operator brings 1+ parent domains at signup (`omani.works` for own
use, `omani.trade` for the SME pool, etc.) and may add more
post-handover via the admin console (#829).

bp-powerdns 1.2.0 (platform/powerdns/chart):
- New `zones: []` values key listing parent domains to bootstrap
- New Helm post-install/post-upgrade hook Job
  (templates/zone-bootstrap-job.yaml) that POSTs each entry to
  /api/v1/servers/localhost/zones at install time. Idempotent on
  HTTP 409 — re-runs after upgrades or chart bumps never fail.
- Default-values render skips when zones is empty (legacy behavior).

bp-catalyst-platform 1.4.0 (products/catalyst/chart):
- New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}`
  values
- New templates/sovereign-wildcard-certs.yaml renders one
  cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex)
  via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert
  renews independently. Skips entirely when parentZones is empty so
  the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml
  retains ownership of `sovereign-wildcard-tls` (avoids
  helm-vs-kustomize ownership flap).
- New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded
  into the catalyst-api Pod as CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_SERVER_ID env vars.

catalyst-api (products/catalyst/bootstrap/api):
- New internal/powerdns package with typed Client (CreateZone,
  ZoneExists). Idempotent on HTTP 409/412.
- handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the
  typed client when wired via SetPowerDNSZoneClient — the
  admin-console "Add another parent domain" flow now creates real
  zones in the Sovereign's PowerDNS at runtime.
- main.go wires the client when CATALYST_POWERDNS_API_URL +
  CATALYST_POWERDNS_API_KEY are set.
- Comprehensive unit tests (client_test.go: 9 cases incl.
  201/409/412/500 + custom NS + custom serverID).

Bootstrap-kit slot integration:
- clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to
  bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from
  Flux postBuild.substitute.
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml:
  bumps to bp-catalyst-platform 1.4.0 and threads `parentZones:
  ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two
  slots stay in lockstep).
- infra/hetzner: new `parent_domains_yaml` Terraform variable
  (defaults to single-zone array derived from sovereign_fqdn) →
  cloud-init renders the PARENT_DOMAINS_YAML Flux substitute.

DoD verified end-to-end with helm template + envsubst:
- Multi-zone overlay (omani.works + omani.trade) renders 2
  PowerDNS zone-create API calls in the bootstrap Job AND 2
  Certificate resources (`*.omani.works`, `*.omani.trade`) in
  bp-catalyst-platform.
- Single-zone fallback (PARENT_DOMAINS_YAML defaults to
  `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy
  provisioning paths working without per-overlay edits.

Closes #827.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 23:42:00 +04:00
e3mrah
dbbbcfa7dc
fix(bp-gitea): ship gitea-admin-secret with random password (#830) (#832)
bp-self-sovereign-cutover Step 1 (gitea-mirror) was stuck in
CreateContainerConfigError on otech102 because the cutover PodSpec
referenced `gitea-admin-secret` with `username`/`password` keys which
no chart materialised. Worse, the upstream gitea subchart fell through
to its hardcoded default password `r8sA8CPHD9!bt6d` whenever no
existingSecret was set — every fresh Sovereign would have shipped with
identical admin credentials.

Add templates/admin-secret.yaml: a Catalyst-curated Secret named
`gitea-admin-secret` with `username` (default `gitea_admin`) and
`password` (32-char random alphanumeric, generated on first install,
preserved across reconciles via Helm `lookup`). Wire
`gitea.gitea.admin.existingSecret = gitea-admin-secret` so the upstream
init container reads its admin creds from this Secret instead of the
hardcoded default. The same Secret is consumed by bp-self-sovereign-
cutover Step 1.

Resource-policy keep + lookup-based persistence guarantees the password
bytes are stable across helm upgrade, helm rollback, Flux re-
reconciliation, even helm uninstall + reinstall.

Bumps bp-gitea 1.2.3 → 1.2.4 (Chart.yaml + blueprint.yaml).

Issue: openova-io/openova#830 (Bug 2)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:26:55 +04:00
e3mrah
ab67a48fe7
fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819)
TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for
9 blueprints because their platform/<name>/chart/Chart.yaml version had
been bumped without a matching update to platform/<name>/blueprint.yaml
spec.version. The pre-existing failure forced 7 recent PRs to self-merge
with --admin, masking real CI failures.

Aligned spec.version to match Chart.yaml version on:

  cert-manager   1.1.1 -> 1.1.2
  flux           1.1.3 -> 1.1.4
  crossplane     1.1.3 -> 1.1.4
  sealed-secrets 1.1.1 -> 1.1.2
  spire          1.1.4 -> 1.1.7
  nats-jetstream 1.1.1 -> 1.1.2
  openbao        1.2.0  -> 1.2.14
  keycloak       1.3.1 -> 1.3.2
  gitea          1.2.1 -> 1.2.3

Verified locally:

  $ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1
  --- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s)
      ... all 10 sub-tests pass (cilium + the 9 above)

The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself
the drift guardrail: it fails CI whenever Chart.yaml is bumped without a
matching blueprint.yaml bump. No additional script needed.

Closes #817 once verified on main.

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
2026-05-04 22:32:49 +04:00
e3mrah
9645a9044a
feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) (#818)
* feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798)

Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add
the SME-2 metering integration end-to-end. NewAPI is consumed as the
upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned
mirror, not a fork) — the metering envelope is produced by a Go sidecar
that observes the OpenAI-style `usage.total_tokens` field on every
2xx /v1/* response. This avoids forking the upstream binary while still
producing the canonical envelope shape on `catalyst.usage.recorded`.

A) NewAPI metering sidecar — core/services/metering-sidecar/
   - Transparent reverse proxy in front of NewAPI on its own port; the
     bp-newapi Service routes the cluster-fronting port to the sidecar,
     which forwards to NewAPI on the pod's loopback.
   - Observes successful /v1/* JSON responses, parses
     `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes
     amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes
     one envelope on `catalyst.usage.recorded` per completed request.
   - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed.
   - Customer-facing latency is NEVER blocked on metering: the response
     body is restored before publish; on NATS unreachable the envelope
     is persisted to disk and retried by a background drain loop.
   - 14 unit tests (proxy + publisher + safeFilename guards).

B) sme-billing NATS subscriber — core/services/billing/handlers/
   metering_consumer.go
   - JetStream durable consumer `sme-billing-metering` on stream
     `CATALYST_USAGE` (provisioned by sme-billing on startup).
   - Idempotent on metadata.request_id via a UNIQUE partial index on
     credit_ledger.external_ref; redelivery from the broker collapses
     to a single ledger row.
   - Customer auto-create on cold start (the rbac sme.user.created
     envelope may land AFTER the first metered request; we don't strand
     usage waiting for it).
   - 11 unit tests covering happy-path, idempotency, malformed-payload
     poison-pill, missing-request-id, non-negative amount guard,
     resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak.

C) HTTP handler POST /billing/metering/record — handlers/metering.go
   - Synchronous validate → INSERT credit_ledger → return
     {ledger_entry_id, balance_after_omr, balance_after_micro_omr,
     duplicate}. Same payload + idempotency guard as the NATS path.
   - Auth: superadmin OR sovereign-admin (operator-admin model;
     end-user LLM traffic flows through the sidecar, never this URL).
   - 8 unit tests covering happy-path, idempotency, role gating,
     malformed-JSON, positive-amount rejection, customer-not-found.

D) Schema — core/services/billing/store/store.go
   - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT
     (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR
     exact integer — preserves precision at metering rates).
   - ADD COLUMN external_ref TEXT + UNIQUE partial index for
     idempotency dedup.
   - ADD COLUMN metadata JSONB for the raw envelope.
   - GetCreditBalance projects both amount_omr (legacy) and
     amount_micro_omr (new) into the integer-OMR view.
   - GetCreditBalanceMicroOMR returns canonical precision.
   - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0)
     distinguishes fresh insert from duplicate without a follow-up
     SELECT.

E) Wiring
   - core/services/shared/events/nats.go — minimal NATS JetStream
     publisher + subscriber surface; legacy RedPanda producer/consumer
     in events.go untouched per [Q-mine-3].
   - core/services/billing/main.go — NATS_URL env; subscriber wired
     in parallel with the existing RedPanda tenant-events consumer.
   - middleware/jwt.go — exported test helper WithClaims so handler
     tests can construct an authenticated context without minting a
     real signed token.
   - .github/workflows/services-build.yaml — metering-sidecar added
     to the build matrix; deploy job skips it (image consumed by the
     bp-newapi chart, not products/catalyst sme-services).

F) bp-newapi chart (1.0.0 → 1.1.0)
   - meteringSidecar block in values.yaml: image, port, NATS URL,
     priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool
     dir, header names, resources, securityContext (read-only-rootfs).
   - deployment.yaml renders the sidecar container + emptyDir spool
     volume when meteringSidecar.enabled (default true).
   - service.yaml routes the cluster-fronting :3000 to the sidecar
     when enabled, exposes a separate :3001 → NewAPI direct port for
     bp-catalyst-platform admin-API traffic (ADR-0003 §3.2).
   - networkpolicy.yaml allows the sidecar's port + nats-system
     egress for JetStream publish.

Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green.
Helm template renders cleanly with sidecar enabled and disabled.

Closes #798

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798)

Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which
lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000")
that does NOT scan directly into Go int64 — the integration test
TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on
the post-redeem balance read.

Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is
unambiguously bigint and Scan target stays uniform across pre-#798 rows
(amount_omr only) and post-#798 rows (amount_micro_omr present).

Affects:
  - GetCreditBalance
  - GetCreditBalanceMicroOMR
  - RecordUsage's running-balance read

Test mocks updated to match the new SQL prefix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:32:42 +04:00
e3mrah
a6d2d25598
feat(bp-stalwart-tenant): per-SME dedicated mail server v0.1.0 (#801) (#815)
Adds platform/stalwart-tenant/ Blueprint chart implementing locked decision
[Q3] of EPIC #795 — every SME on a Sovereign gets its OWN Stalwart instance
in its tenant namespace, with its OWN domain, OWN MTA reputation, and OWN
queue. NOT a shared otech-level multi-domain Stalwart.

Components shipped:
  • StatefulSet (single-replica, RocksDB on PVC)
  • Service x3: SMTP/submission LoadBalancer, IMAP/IMAPS LoadBalancer,
    webmail/JMAP ClusterIP (fronted by Cilium Gateway HTTPRoute)
  • HTTPRoute (gateway mode, default) or Ingress (fallback) for webmail
    UI at https://mail.<sme-domain>
  • ConfigMap config.toml — Stalwart bootstrap config; OIDC bound to
    SME-vcluster Keycloak realm; uses == not = in expressions per
    stalwart_expression_syntax.md memory (incident 2026-04-14)
  • ConfigMap dns-records-required — MX/SPF/DKIM/DMARC for the SME admin
    (free-subdomain mode → published to PowerDNS by unified-rbac;
     BYO mode → surfaced in unified-rbac console UI for SME admin)
  • ExternalSecret x2 — admin password + OIDC client secret pulled from
    OpenBao at canonical paths
    sovereign/<sov>/stalwart/<tenant>/{admin,oidc}
  • Job (post-install) — bootstraps admin principal with email-receive
    permission and send-allow row; idempotent; covers stalwart_send_as.md
    group-permission gotcha (incident 2026-04-20)
  • NetworkPolicy — default-deny + explicit allows (SMTP/IMAP from
    anywhere, webmail from gateway namespace, egress to Keycloak/NATS/
    PowerDNS/DNS/outbound SMTP)
  • Tests: chart/tests/expression-syntax.sh — audits rendered config for
    the `==` rule

Per-user mailbox provisioning is event-driven (ADR-0003 §3): unified-rbac
POSTs Stalwart's /api/principal admin API on sme.user.created. The
continuous NATS subscriber Deployment is OFF by default (chart-level);
per-tenant overlay flips it on once the SME vcluster's NATS subject is
known.

Image SHA-pinned: docker.io/stalwartlabs/stalwart:v0.16.3 @
sha256:5d75cff4e9c6d75e64636e9ef9674b1d877f8f6fb2e11ee8176fbad3faaa5289
(Inviolable Principles #4 + #4a). global.imageRegistry rewrite supported
for post-handover Sovereign Harbor proxy-cache (ADR-0001 §11.5).

Smoke render passes with default values (623 lines, 8 manifests).
helm lint clean. Required values gated via per-template render-gates,
not fail() at chart root, so the platform-wide blueprint-release.yaml
hollow-chart + smoke gates pass (issue #181 + bp-openclaw 2026-05-04
failure mode avoided).

Closes #801 (chart published; UAT after smoke-deploy).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 22:22:46 +04:00
e3mrah
3e7284de45
fix(bp-wordpress-tenant): default-values smoke render must succeed (#800) (#814)
The Blueprint Release workflow runs `helm template <chart>` with NO
overrides as a smoke gate before publishing the OCI artifact. After
#800's initial merge (c141fcd1), that smoke step failed because
`smeDomain`, `keycloak.realmURL`, and `keycloak.clientSecretName`
used `required` calls or empty strings that produced render-time
errors:

  Error: execution error at (oidc-config-job.yaml:82:33):
    .Values.smeDomain or .Values.ingress.host MUST be set
    (no sensible default per INVIOLABLE-PRINCIPLES #4).

Fix: replace empty defaults with placeholder values
(`sme.local`, `https://auth.sme.local/realms/sme`,
`wordpress-oidc`) and remove the `required` template fences. Per-
Sovereign overlays MUST override these placeholders at install time;
the runtime `oidc-config` Job will surface a clear failure if they
remain on the placeholder (Keycloak realm URL won't resolve). This
matches the trade-off INVIOLABLE-PRINCIPLES #4 calls out — operator-
configurable values, no production-safe defaults, but smoke-render
still passes.

Verified:
  - `helm template smoke .` (no overrides) → 812 lines, 11 K8s
    resources rendered cleanly.
  - `helm template smoke . --set smeDomain=... --api-versions
    postgresql.cnpg.io/v1 ...` → 12 resources including the CNPG
    Cluster, with all wordpress images SHA-pinned to
    sha256:054e611...196.
  - chart/tests/observability-toggle.sh both cases PASS.
  - `helm lint` only the cosmetic icon-recommended INFO note.

Refs: #800

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 22:19:40 +04:00
e3mrah
d6dedb1ecd
fix(bp-openclaw): use placeholder defaults so blueprint-release smoke render passes (#803) (#813)
The blueprint-release CI workflow runs `helm template <chart>` with
default values as a smoke gate (.github/workflows/blueprint-release.yaml
SMOKE step). The original chart shipped empty-string defaults for every
required value (keycloak.realmURL, tenant.namespace, etc.) and used
`required` / `fail` to abort render — which is correct fail-fast
behaviour for real installs but wrongly fails CI's default-values
smoke step. Result: bp-openclaw 0.1.0 never published to GHCR (run
25335221500 fail).

Match the bp-self-sovereign-cutover pattern (PR #791): provide
placeholder defaults that let smoke render produce valid YAML, gated
behind a new `assertNoPlaceholders` toggle that per-cluster Flux
overlays MUST set to `true`. With the toggle ON, _helpers.tpl ::
assertNoPlaceholders fails render with a clear message identifying any
placeholder still in place.

Changes:
- values.yaml: add placeholder defaults for keycloak.realmURL,
  keycloak.clientSecretName, newapi.baseURL, tenant.namespace,
  ingress.host, controller.image.tag, perUserPod.image.tag.
  Add `assertNoPlaceholders: false` flag (overlays set true).
- _helpers.tpl: replace assertRequired with assertNoPlaceholders —
  same intent, runs only when the toggle is on, so smoke render passes
  while real installs still get fail-fast on bad overlays.
- serviceaccount.yaml: invoke assertNoPlaceholders instead of assertRequired.
- controller-deployment.yaml + controller-ingress.yaml: drop the
  `required` calls (defaults are now valid bytes; the
  assertNoPlaceholders helper enforces real values at install time).
- tests/render-toggles.sh: rewrite Case 1 (now expects success) and
  Case 2 (asserts assertNoPlaceholders=true fails on placeholders) +
  Case 2b (assertNoPlaceholders=true with real values succeeds).
  All 7 gates pass locally.

Output (post-merge): chart published to
oci://ghcr.io/openova-io/bp-openclaw:0.1.0.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:17:43 +04:00
e3mrah
20b3c5258a
feat(bp-newapi): chart maturation + first-otech deploy + Qwen vLLM channel (#799) (#812)
* feat(bp-newapi): chart maturation — ExternalSecret + first-otech vLLM channel + skip-render gates (#799)

Maturation work for the SME-3 turnkey-experience epic (#795). Aligns
the bp-newapi scratch chart with ADR-0003 (RBAC ↔ NewAPI user-create
hook contract) and gets it past the blueprint-release CI smoke render
that has blocked publication since PR #396 (run 25213444992 failed at
default-values render of v1.0.0).

Changes
-------
- templates/external-secret.yaml (NEW). Renders the
  `catalyst-newapi-admin-token` ExternalSecret consumed by unified-rbac
  (ADR-0003 §3.2 + §6) for issuing per-user keys against
  `http://newapi.newapi.svc/api/v1/admin/users`. Sourced from OpenBao
  via the `vault-region1` ClusterSecretStore (canonical default shipped
  by bp-external-secrets-stores). Capabilities-gated on
  `external-secrets.io/v1beta1` so cold installs without ESO don't
  fail-render. Operator supplies the per-Sovereign OpenBao path via
  `catalystIntegration.externalSecret.remoteRef.key`; canonical
  convention is `sovereign/<sovereign-fqdn>/newapi/admin-token` with
  property `ADMIN_API_TOKEN`. Per Inviolable Principle #4 every knob
  is operator-overridable in the cluster overlay.

- values.yaml. Adds `catalystIntegration.externalSecret.{enabled,
  refreshInterval, secretStoreRef.{kind,name}, remoteRef.{key,property}}`
  block (default enabled=true, key="" so a misconfigured overlay fails
  loudly at render rather than silently skipping). Adds
  `defaultChannels.vllm` block — first-otech shorthand that composes a
  vLLM-typed channel into the rendered channels list when enabled.
  Default endpoint is empty per Inviolable Principle #4; the
  `clusters/<sovereign>/bootstrap-kit/80-newapi.yaml` overlay supplies
  the per-Sovereign URL (canonical first-otech reference =
  `https://llm-api.omtd.bankdhofar.com` model `qwen3-coder`, the same
  upstream Axon uses on the OpenOva marketing deployment).

- templates/_helpers.tpl. New `bp-newapi.effectiveChannels` helper
  composes `.Values.channels` with `defaultChannels.vllm` (when
  enabled). The `assertChannelAttestation` helper now operates on the
  effective list so attestation gates apply to defaultChannels
  composition too. `defaultChannels.vllm.enabled=true` with empty
  endpoint fails-fast at render with a guided error message.

- templates/configmap.yaml. Channels rendering switches to the
  effectiveChannels helper. OIDC block now skip-renders gracefully when
  `auth.adminUI.keycloak.issuer` is unset (smoke-render path) instead
  of `required`-failing; the per-Sovereign overlay sets the issuer.

- templates/deployment.yaml. Skip-render gate on Deployment when
  `database.existingSecret`, `credentials.existingSecret`, or (when
  Keycloak mode is selected) the OIDC client secret is missing. Removes
  the four `required` calls that were failing CI smoke render. Service,
  ServiceAccount, ConfigMap, NetworkPolicy still render so the smoke
  test gets a non-empty output proving structural soundness; the actual
  Deployment defers until the per-Sovereign overlay wires the secrets.

- templates/ingress.yaml. Same skip-render pattern: when either
  `ingress.host` or `ingress.adminHost` is empty, the entire ingress
  block is silently skipped. Matches the bp-keycloak / bp-openbao /
  bp-external-dns HTTPRoute templates.

- Chart.yaml. version 1.0.0 → 1.1.0 (minor bump — additive features;
  no breaking changes to existing operator overrides).

Verification
------------
`helm template` smoke render on default values now succeeds with 4
resources (NetworkPolicy / ServiceAccount / ConfigMap / Service); 168
lines, well above the CI 5-line minimum. With a full per-Sovereign
overlay (hosts + secrets + Keycloak issuer + ESO Capabilities + Traefik
Capabilities + defaultChannels.vllm.endpoint), 8 resources render
including Deployment, both Ingresses, the Traefik allowlist Middleware,
and the ExternalSecret. The composed qwen channel writes through to
`channels.yaml` with the expected endpoint + models + attestation.

Refs
----
ADR-0003 §3.2 + §6 — admin-token contract
Issue #795 (epic) — locked decisions
Issue #796 — hook contract spec (sequential blocker, merged)
Inviolable Principles #1, #3, #4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(bootstrap-kit): slot 80 — bp-newapi default install (#799)

Adds the canonical install slot for bp-newapi to every fresh Sovereign's
bootstrap-kit. Sequenced after the W2.K1 dependency wave so NewAPI's
ExternalSecret + Postgres DSN dependencies resolve on first reconcile.

The HelmRelease declares `dependsOn: [bp-openbao, bp-keycloak, bp-cnpg]`:
- bp-openbao(08): admin-token ExternalSecret backend
- bp-keycloak(09): OIDC issuer for ops-staff admin UI at admin.<fqdn>
- bp-cnpg(16): Postgres backing for users/credits/channels/audit

Per-Sovereign overlays inherit the slot's defaults and override:
- ingress.host                                        api.${SOVEREIGN_FQDN}
- ingress.adminHost                                   admin.${SOVEREIGN_FQDN}
- auth.adminUI.keycloak.issuer
- database.existingSecret                             (Crossplane-claimed)
- credentials.existingSecret
- catalystIntegration.externalSecret.remoteRef.key    sovereign/${FQDN}/newapi/admin-token
- defaultChannels.vllm.enabled                        true (first-otech)
- defaultChannels.vllm.endpoint                       (operator-supplied)

The `_template/` slot keeps `defaultChannels.vllm.enabled: false` so a
fresh Sovereign does not silently wire customers to a third-party
endpoint; the canonical first-otech reference (Qwen3 Coder via
`https://llm-api.omtd.bankdhofar.com`, same relay Axon uses on the
OpenOva marketing deployment) is documented in-line for operators
adopting the same upstream.

Refs: #795 (epic), ADR-0003

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bootstrap-deps): register bp-newapi slot 80 in expected DAG (#799)

Fixes the dependency-graph-audit drift detection caught at PR #812 CI:
the audit script enumerates HelmReleases in clusters/_template/bootstrap-kit/
and compares to scripts/expected-bootstrap-deps.yaml; an HR present on
disk but absent from the expected DAG is treated as drift.

Adds the canonical entry for bp-newapi at slot 80 with the same
depends_on set declared on the HelmRelease itself
([bp-openbao, bp-keycloak, bp-cnpg]).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-newapi): align blueprint.yaml spec.version with Chart.yaml (#799)

The TestBootstrapKit_BlueprintCardsHaveRequiredFields static-validation
gate asserts Chart.yaml version == blueprint.yaml spec.version. The
chart was bumped to 1.1.0 in c63ecd8c; bumping the blueprint metadata
to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:17:25 +04:00
e3mrah
c141fcd1d3
feat(bp-wordpress-tenant): turnkey SSO-wired WordPress per SME (#800) (#811)
New scratch Blueprint chart `bp-wordpress-tenant` v0.1.0 that
provisions a turnkey, SSO-pre-wired WordPress instance per SME tenant
inside the SME's vcluster, satisfying ticket #800 (SME-5) of the #795
SME-tenant turnkey experience epic.

What it provisions:

  - Deployment of `wordpress:6-php8.3-apache` (manifest-list digest
    sha256:054e611...196), pulled through the Sovereign Harbor
    proxy-cache when `global.imageRegistry` is set (per
    INVIOLABLE-PRINCIPLES #4).
  - Two initContainers seed wp-content/ from the image onto the PVC
    and install the openid-connect-generic plugin + pg4wp Postgres
    drop-in from wordpress.org / GitHub. Idempotent, runs only once
    per PVC.
  - Postgres provisioned in-tenant via a `Cluster.postgresql.cnpg.io`
    (default `wordpress-db`, 1 instance, 10Gi, pg16). The CNPG-emitted
    `<cluster>-app` Secret is mirrored into `wordpress-database-secret`
    by Reflector + a post-install sync Job (otech30 race fix carried
    forward from bp-gitea).
  - PVC for `/var/www/html/wp-content/` (default 10Gi, RWO,
    helm.sh/resource-policy: keep so customer content survives
    `helm uninstall`).
  - Ingress at `wordpress.<smeDomain>` with cert-manager TLS via
    operator-supplied ClusterIssuer (default `letsencrypt-prod`).
  - NetworkPolicy restricting egress to bp-cnpg :5432, Keycloak
    :8443/:8080, kube-dns, and HTTPS to public IPs (for plugin/theme
    fetches).
  - Three post-install Jobs:
      hook weight 5  — db-secret-sync (PATCHes wordpress-database-
                       secret.password from CNPG <cluster>-app)
      hook weight 10 — oidc-config (UPSERTs openid_connect_generic_
                       settings, active_plugins, template/stylesheet,
                       siteurl/home rows in wp_options via PHP+PDO)
      hook weight 15 — admin-user (INSERT/UPDATE wp_users +
                       wp_usermeta for SME admin's email with
                       administrator role)

After all hooks complete, the SME admin's first browser hit lands on
/wp-admin authenticated via Keycloak SSO — no install wizard, no
manual config.

Hollow-chart guard (issue #181) satisfied via the `common` library
subchart from sigstore, matching bp-newapi's pattern for scratch
charts (no first-party WordPress Helm chart exists upstream).

Tests:
  - chart/tests/observability-toggle.sh verifies BLUEPRINT-AUTHORING
    §11.2 (default render produces no PodMonitor/ServiceMonitor).
  - `helm template` smoke render with required values produces 11 K8s
    resources cleanly; `helm lint` zero-failure.

Refs: #800, #795

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-04 22:13:32 +04:00
e3mrah
93bd3ace5b
feat(bp-openclaw): workspace controller + per-user pod chart (#803) (#810)
Implements locked decision [A] of epic #795: per-SME-tenant workspace
controller deployment + per-user runtime pod, identity-blind by
construction. Consumes the per-user newapi-key-{uuid} Secrets rendered
by the unified-rbac user-create hook (ADR-0003 §3.3).

What this delivers:
- platform/openclaw/chart/        bp-openclaw v0.1.0 (no-upstream)
- platform/openclaw/runtime/      Go reference runtime (NEWAPI_BASE_URL
                                  + NEWAPI_KEY env contract only)
- .github/workflows/openclaw-runtime.yaml
                                  Event-driven build for the runtime
                                  image (paths-on-push + manual rerun;
                                  NO schedule:cron per CLAUDE.md).
- platform/openclaw/blueprint.yaml
                                  Catalyst registration + configSchema.

Chart highlights:
- Required values guarded by _helpers.tpl :: assertRequired so missing
  realmURL/clientSecretName/tenant.namespace/baseURL/host fail render
  with helpful messages.
- RBAC: namespaced Role in tenant ns; create verbs split into separate
  rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md.
  Label-based ownership (catalyst.openova.io/openclaw-user) enforced at
  the controller, not in RBAC.
- ingress: cert-manager.io/cluster-issuer annotation triggers ACME
  auto-issuance for openclaw.<sme-domain>.
- per-user pod template ConfigMap holds the pod-spec the controller
  renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders
  filled at session-start.
- networkPolicy covers controller pod only; per-user pod NetworkPolicy
  is rendered by the controller at session-start (target hostname is
  read from the per-user Secret which doesn't exist at chart-render
  time — documented in README.md).

Tests: chart/tests/render-toggles.sh (7 cases) covers required-value
enforcement, RBAC create+resourceNames violation guard, ServiceMonitor
default-off, networkPolicy toggle, pod-template placeholder presence,
cert-manager annotation. All seven gates pass locally.

Closes part of #795 (epic still open).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:10:24 +04:00
e3mrah
33dc98782b
feat(bp-self-sovereign-cutover): chart + bootstrap-kit slot 06a (#791) (#808)
New platform Blueprint at `platform/self-sovereign-cutover/chart/`. Ships
DORMANT — eight step PodSpec ConfigMaps, the registry-pivot DaemonSet, the
mutable cutover-status ConfigMap, plus ServiceAccount/RBAC. The catalyst-api
cutover endpoint (#792, merged at 03828641) reads each step ConfigMap by
label selector and stamps real Jobs only on operator-driven trigger.

Step inventory:
  01 gitea-mirror             — git push --mirror upstream → local Gitea
  02 harbor-projects          — create 7 proxy-cache projects
  03 harbor-prewarm           — HEAD-pull bootstrap-kit images through cache
  04 registry-pivot           — DaemonSet rewrites registries.yaml on every node
  05 flux-gitrepository-patch — pivot GitRepository.url → local Gitea
  06 helmrepository-patches   — pivot 38 OCI URLs → local Harbor
  07 catalyst-api-env-patch   — kubectl set env CATALYST_GITOPS_REPO_URL
  08 egress-block-test        — CiliumNetworkPolicy + 10-min sovereignty proof

Plus self-sovereign-cutover-status ConfigMap with the consumer-contract keys
(cutoverComplete, currentStep, step.<name>.result, etc.) shipped at install
with helm.sh/resource-policy: keep so chart uninstall doesn't lose state.

Bootstrap-kit slot `06a-bp-self-sovereign-cutover.yaml` installs the chart
into the `catalyst` namespace (matches catalyst-api's default discovery
namespace), depends on bp-gitea + bp-harbor, uses disableWait: true.

RBAC splits `create` verbs into their own Rule WITHOUT resourceNames per
feedback_rbac_create_no_resourcenames.md — the bp-openbao loop anchor.

chart/tests/cutover-contract.sh enforces:
  - 8 step ConfigMaps render
  - required labels (part-of/component/cutover-order/cutover-mode)
  - required data keys (stepName + podSpec for job-mode)
  - step 04 mode=daemonset-wait
  - status ConfigMap retained on uninstall
  - RBAC create/resourceNames split

helm template smoke render: 1180 lines, 19 resources (1 Namespace + 1 SA +
11 ConfigMaps + 1 DaemonSet + 1 ClusterRole + 1 ClusterRoleBinding).
helm lint: clean.
scripts/check-bootstrap-deps.sh: PASSED (slot 6a registered, depends_on
[bp-gitea, bp-harbor]).

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:55:19 +04:00
e3mrah
2e981f36a5
fix(bp-keycloak): catalyst-kc-sa-credentials addr → in-cluster Service URL (closes #781) (#788)
Sovereign-side catalyst-api Pod's intra-cluster Keycloak calls (token
mint, EnsureUser) were failing with `dial tcp: lookup
auth.<sov-fqdn> on 10.43.0.10:53: no such host`. The Sovereign's
CoreDNS resolves *.<sov-fqdn> via upstream resolvers — it does NOT
forward to the in-cluster PowerDNS that holds those records. Public
DNS works (PowerDNS authoritative), but Pod-side lookups of
auth.<sov-fqdn> return NXDOMAIN.

Live evidence — otech94 2026-05-04: handover URL returned
`{"error":"keycloak error: ensure user"}` from a DNS lookup failure
inside the catalyst-api Pod.

Fix: bp-keycloak chart now writes the in-cluster Service URL
(http://<release>.<namespace>.svc.cluster.local) into the
catalyst-kc-sa-credentials Secret's `addr` key instead of the public
gateway host (https://auth.<sov-fqdn>). This Secret is consumed
EXCLUSIVELY by the in-cluster catalyst-api Pod via reflector mirror
into catalyst-system; it is NEVER exposed to browsers.

The HTTPRoute hostname (.Values.gateway.host) stays at auth.<sov-fqdn>
for operator browsers — only the Pod's intra-cluster OAuth
client_credentials calls switch to the Service URL.

Catalyst-Zero (contabo) is unaffected: it runs `keycloak-zero`
(separate chart in openova-private), not bp-keycloak.

Changes:
- platform/keycloak/chart/templates/configmap-sovereign-realm.yaml:
  Secret's $kcAddr unconditionally uses
  http://<release>.<namespace>.svc.cluster.local
- platform/keycloak/chart/Chart.yaml: 1.3.1 → 1.3.2
- clusters/_template/bootstrap-kit/09-keycloak.yaml: chart version 1.3.1 → 1.3.2
- products/catalyst/chart/Chart.yaml: 1.3.0 → 1.3.1 (changelog entry only)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: 1.3.0 → 1.3.1

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 20:34:22 +04:00
e3mrah
53bc4357ca
feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) (#776)
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767)

Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers
couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB):

1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate"
   Section with: bootstrap-kit baseline (sum of mandatory-tier component
   footprints), selected components delta, control-plane overhead, and a
   "Recommended N x <SKU>" line that turns amber when the operator's chosen
   worker count is below the rollup. Backed by per-component RAM/CPU floors
   in components/wizard/steps/componentFootprints.ts (covered by 12 unit
   tests including the otech92 reproduction).

2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at
   bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart
   9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired
   from the canonical flux-system/cloud-credentials.hcloud-token Secret
   cloud-init writes (mirrors the velero/harbor object-storage pattern).
   Pinned to the control-plane node so the autoscaler never schedules onto
   a worker it could itself terminate. 10-minute scale-down idle as the
   cost-saving default.

Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA /
KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over
KEDA for cluster scaling, and the bounds + safety story.

Per the issue's MVP scope, this PR ships the blueprint + StepReview
estimate WITHOUT the wizard StepProvider min/max pair refactor or the
tofu node-pool template restructuring. Those are tracked as a follow-up
issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps

Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected-
bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776
because the file existed without a matching entry in the expected DAG, AND
collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort +
slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to
the expected-bootstrap-deps.yaml so the audit passes.

`scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:49:44 +04:00
e3mrah
0dbdf3b327
fix(bp-trivy): node-collector tolerates control-plane taint (closes #769) (#772)
PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to
the CP node when worker_count > 0. Two bootstrap-kit charts have pods
that MUST land on the CP and lacked the matching toleration:

bp-trivy
  • node-collector: Pod pinned to each node via nodeSelector
    `kubernetes.io/hostname=<node>`. The CP-bound collector reads
    /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler,
    /var/lib/kube-controller-manager via hostPath — these only exist
    on the CP. Without the toleration the collector sat Pending forever
    on otech93 (live evidence in #769).
  • scanJobTolerations: per-workload scan jobs the operator spawns may
    target pods on CP-only system DaemonSets (kube-system kube-proxy
    in non-Cilium mode, etc.). Adding the toleration here so reports
    are produced for those workloads too.

bp-alloy
  • DaemonSet — one pod MUST land on every node including the CP, so
    CP-local kubelet logs + node metrics flow into the LGTM stack.
    Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93
    and CP telemetry was silently lost.

Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP
is untainted in solo mode per PR #755's conditional.

Versions bumped:
  • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins)
  • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins)

Out of scope (audited, no change needed):
  • bp-cilium — upstream defaults already tolerate everything (verified
    on otech93: cilium DaemonSet at 4/4 nodes).
  • bp-falco — values.yaml already declares NoSchedule + NoExecute
    Exists tolerations (4/4 on otech93).
  • cnpg/harbor — no kubelet-cert-renew Jobs in current charts.

Verified:
  • `helm template` on both charts renders the expected toleration
    (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed
     by the operator at scan-job spawn time).
  • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:38:29 +02:00
e3mrah
31784d7ed5
fix(bp-external-dns): apiserver Endpoints sync timeout — Cilium kube-apiserver entity required (closes #770) (#771)
* fix(bp-external-dns): grant apiserver egress via CiliumNetworkPolicy (closes #770)

Root cause: ExternalDNS crashloops on every fresh Sovereign provision
with `failed to sync *v1.Endpoints: context deadline exceeded`. The
companion vanilla NetworkPolicy egress rule
`to: ipBlock: 0.0.0.0/0 ports: 443,6443` does NOT match traffic to the
kube-apiserver under Cilium with the default `policy-cidr-match-mode: ""`.
Cilium models the apiserver as a reserved identity, not a CIDR range,
so the ipBlock rule is bypassed and the apiserver call is dropped at
the egress hook of the external-dns endpoint.

Fix: render a companion CiliumNetworkPolicy with
`toEntities: [kube-apiserver]` scoped to the external-dns Pod selector.
This is the canonical Cilium pattern for controllers that watch the
apiserver. The existing vanilla NetworkPolicy is preserved verbatim so
the Blueprint remains CNI-agnostic per BLUEPRINT-AUTHORING.md.

Live proof on otech93 (2026-05-04): manually applied the rendered CNP
to the running cluster, external-dns transitioned from CrashLoopBackOff
(8 restarts in 20m) to 1/1 Running within 30s, informer cache sync
completed cleanly.

Bumps bp-external-dns 1.1.6 → 1.1.7.

Why not `policy-cidr-match-mode: nodes` cluster-wide on bp-cilium? It
silently relaxes EVERY other NetworkPolicy that uses 0.0.0.0/0 in the
cluster — too broad. Per INVIOLABLE-PRINCIPLES the fix MUST be scoped
to the workload that needs it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(_template): bump bp-external-dns 1.1.6 → 1.1.7 to pick up CNP fix

Pairs with the chart bump in the same PR. Every fresh otech provision
hydrates clusters/_template/, so this pin is what determines the
version installed. Without bumping here, otech94+ would still use
1.1.6 and continue to crashloop with the apiserver-egress symptom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:27:17 +04:00
e3mrah
69de64ba19
fix(cilium): k8sServiceHost 127.0.0.1 → 10.0.1.2 so workers' Cilium can reach apiserver (#738)
Issue #733 follow-up. The default cpx32 multi-node Sovereign (1 CP + 2
workers) provisioned successfully, but worker nodes stuck NotReady
because cilium-agent on workers crashloop'd:

  Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system":
    dial tcp 127.0.0.1:6443: connect: connection refused

Root cause: `k8sServiceHost: 127.0.0.1` works on the k3s SERVER node
(supervisor binds localhost:6443) but FAILS on every k3s AGENT node
(agent does NOT expose apiserver on localhost — only the supervisor
on :6444). Pre-#733 every Sovereign was solo (worker_count=0), so
this never fired.

Fix: point Cilium at `10.0.1.2`, the CP's stable private IP on the
Sovereign's 10.0.1.0/24 subnet (cp1=10.0.1.2 per main.tf network
block). No-op on the CP (10.0.1.2 IS its own private IP) and works
on workers (which already join the cluster via the same address per
cloudinit-worker.tftpl `K3S_URL=https://${cp_private_ip}:6443`).

Files:
- infra/hetzner/cloudinit-control-plane.tftpl — bootstrap helm install
  values file written to /var/lib/catalyst/cilium-values.yaml
- platform/cilium/chart/values.yaml — Flux bp-cilium HelmRelease
  values (cilium_values_parity_test.go enforces the two stay aligned)

Verified live on otech50: 3× CPX32 servers running, 1 CP Ready, 2
workers registered with k3s but NotReady due to cilium init failure.
After this fix workers should reach Ready, and the Phase-1 watcher
sees all components Ready=True across the multi-node cluster.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:23:51 +04:00
e3mrah
9a58289786
fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (closes #715) (#716)
* fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713)

Closes #713

Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing
401 on /auth/handover:

1. SOVEREIGN_FQDN race
   api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn"
   with optional:true. On Sovereigns, that ConfigMap is rendered by the
   sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform
   HelmRelease. When the Pod starts first, valueFrom collapses to "" and
   stays empty — audience check rejects every valid token as "invalid
   audience". Fix: add Reloader annotations so the Pod rolls when the
   ConfigMap (and the handover-jwt-public Secret) appears.

2. catalyst-api-server SA missing user-level realm-management role mappings
   bp-keycloak realm import granted roles via clientScopeMappings — wrong
   level. The actual service-account user had no clientRoles entry, so KC
   rejected GET /users with 403 when catalyst-api tried to ensure the
   operator user during handover. Fix: add explicit "users" array binding
   service-account-catalyst-api-server to realm-management.{impersonation,
   manage-users, view-users, query-users}.

* fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (#715)

Closes #715

Two architectural bugs surfaced live on otech64 (2026-05-03), both leading
to a healthy-looking Sovereign that the operator could not reach.

1. catalyst-api tofu workdir on emptyDir
   CATALYST_TOFU_WORKDIR=/tmp/catalyst/tofu (emptyDir). When contabo's
   catalyst-api Pod rolled mid-apply (the PR #714 deploy commit triggered
   a rolling restart 3 minutes into otech64's tofu run), in-progress state
   was lost. Tofu had created LB/network/server/services but not the
   hcloud_load_balancer_target.control_plane resource yet — the cluster
   came up at the k3s level but the public LB had no targets, returning
   TLS handshake failure for every console.<sov> request.

   Move CATALYST_TOFU_WORKDIR to /var/lib/catalyst/tofu (PVC-backed,
   fsGroup=65534 already wires write access). tofu apply resumes from
   where it left off after any Pod restart.

2. bp-reloader env-vars strategy
   reloadStrategy=env-vars only injects checksum env vars for ConfigMaps
   referenced via envFrom. Workloads using valueFrom: configMapKeyRef
   (catalyst-api's SOVEREIGN_FQDN) are silently not reloaded — the
   configmap.reloader.stakater.com/reload annotation added in PR #714
   was a no-op under env-vars.

   Switch to reloadStrategy=annotations. Reloader bumps a pod-template
   annotation, triggering rollout regardless of how the CM/Secret is
   referenced.

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 02:04:26 +04:00
e3mrah
e96e31a781
fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713) (#714)
Closes #713

Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing
401 on /auth/handover:

1. SOVEREIGN_FQDN race
   api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn"
   with optional:true. On Sovereigns, that ConfigMap is rendered by the
   sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform
   HelmRelease. When the Pod starts first, valueFrom collapses to "" and
   stays empty — audience check rejects every valid token as "invalid
   audience". Fix: add Reloader annotations so the Pod rolls when the
   ConfigMap (and the handover-jwt-public Secret) appears.

2. catalyst-api-server SA missing user-level realm-management role mappings
   bp-keycloak realm import granted roles via clientScopeMappings — wrong
   level. The actual service-account user had no clientRoles entry, so KC
   rejected GET /users with 403 when catalyst-api tried to ensure the
   operator user during handover. Fix: add explicit "users" array binding
   service-account-catalyst-api-server to realm-management.{impersonation,
   manage-users, view-users, query-users}.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-04 01:37:36 +04:00
e3mrah
c5ffaa2fd7
fix(bp-external-dns): livenessProbe.initialDelaySeconds=180 for cold-cluster cache-sync (closes #700) (#707)
PR #679 added --request-timeout=120s but external-dns has TWO timeouts:
RequestTimeout (per-API-call, controlled by --request-timeout) and
WaitForCacheSync (initial informer sync, hardcoded 60s in upstream binary,
NOT exposed as a flag). On a fresh Sovereign with k3s apiserver
CPU-saturated, the cache sync misses 60s -> fatal: failed to sync
*v1.Node: context deadline exceeded -> CrashLoopBackOff 5-10 times.
Caught live on otech49+ (2026-05-03), 5 restarts before stable.

Bump livenessProbe.initialDelaySeconds from upstream 10s default to 180s
so kubelet does NOT restart the Pod while the initial cache sync runs
against a CPU-saturated freshly-provisioned k3s apiserver. The Sovereign
apiserver reaches steady-state within ~2 min so 3 min comfortably covers
cold starts. Also bumps periodSeconds=30 + failureThreshold=3 so a
genuinely-hung pod is still killed within ~90s once steady-state.
readinessProbe gets a corresponding initialDelaySeconds=30 so endpoint
flapping during sync doesn't churn services.

Helm overrides REPLACE whole maps (not merge), so the override preserves
the upstream httpGet.path: /healthz + port: http shape verbatim.

Bumps:
- platform/external-dns/chart/Chart.yaml: 1.1.5 -> 1.1.6
- clusters/_template/bootstrap-kit/12-external-dns.yaml: HelmRelease pin 1.1.5 -> 1.1.6

Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-03 23:39:36 +04:00
e3mrah
7ca9541ef9
fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup) (#691)
* fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup)

Sovereign-side catalyst-api needs Keycloak service-account credentials
to provision the operator's user during /auth/handover. Today the chart
references K8s Secret `catalyst-kc-sa-credentials` with keys addr/realm/
client-id/client-secret in the catalyst-system namespace — but no
zero-touch path materialised it. The dead SealedSecret template at
09a-keycloak-catalyst-api-secret.yaml had a different name AND different
keys (CATALYST_KC_*), used PLACEHOLDER_SEALED_VALUE markers no
provisioner replaced, and wasn't even listed in the bootstrap-kit
kustomization.

Symptom on otech48: GET /auth/handover?token=<valid-jwt> returns
"server misconfiguration: keycloak not configured"
(auth_handover.go:169).

Fix: bp-keycloak chart's configmap-sovereign-realm.yaml template now
emits the realm-import ConfigMap AND the catalyst-kc-sa-credentials
Secret in a single template scope so they share the same generated
client secret. Pattern mirrors platform/powerdns/chart/templates/
api-credentials-secret.yaml (canonical seam, ADR-0001 §11.3
anti-duplication).

Secret-value resolution order (first match wins):
  1. operator-supplied .Values.catalystApiServerClientSecret
  2. helm `lookup` of existing Secret in keycloak ns (idempotent)
  3. fresh randAlphaNum 32 (zero-touch on first install)

The Secret carries the four keys exactly as the catalyst-api Pod's
secretKeyRef expects — addr / realm / client-id / client-secret —
with addr derived from gateway.host (https://auth.<sovereignFQDN>).
Reflector annotations auto-mirror the Secret to catalyst-system as
soon as that namespace materialises (bootstrap-kit slot 13).

The realm import already creates the catalyst-api-server client with
serviceAccountsEnabled + impersonation/manage-users/view-users/
query-users role mappings — so once Keycloak is Ready and the realm
imports, the SA is fully provisioned and the K8s Secret carries a
matching client secret. No post-install Job, no Admin-API script,
no out-of-band SealedSecret ceremony.

Cleanup: removes the dead 09a SealedSecret template (not in
kustomization, never produced a working Secret).

Bumps:
  - bp-keycloak chart 1.3.0 -> 1.3.1
  - clusters/_template/bootstrap-kit/09-keycloak.yaml HelmRelease
    pin 1.3.0 -> 1.3.1

Existing per-Sovereign overlays (clusters/otech.omani.works/,
clusters/omantel.omani.works/) intentionally remain on 1.3.0 — fresh
otechN provisioning consumes _template at provision time.

Will be verified live on otech49 — handover end-to-end without ANY
manual Secret creation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(keycloak): bump blueprint.yaml spec.version to match chart 1.3.1

TestBootstrapKit_BlueprintCardsHaveRequiredFields/keycloak asserts
Chart.yaml.version == blueprint.yaml.spec.version. Forgot to bump
blueprint.yaml in the previous commit.

Note: 8 other blueprints (cert-manager, flux, crossplane, sealed-secrets,
spire, nats-jetstream, openbao, gitea) carry the same pre-existing
mismatch and the test fails on main too. Out of scope for this PR;
fixing the keycloak case to keep the new chart version internally
consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 19:50:06 +04:00
e3mrah
684759564e
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget

cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns
even with all of:

- gatewayAPI.hostNetwork.enabled=true on the Cilium chart
- securityContext.privileged=true on the cilium-envoy DaemonSet
- securityContext.capabilities.add=[NET_BIND_SERVICE]
- envoy-keep-cap-netbindservice=true in cilium-config ConfigMap
- Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema)

Repeatable error from cilium-envoy logs across otech45, otech46, otech47:

  listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed
  to bind or apply socket options: cannot bind '0.0.0.0:80':
  Permission denied

The bind() syscall is intercepted by cilium-agent's BPF socket-LB
program in a way that does not honour container capabilities. Even
PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets
"Permission denied". Cilium 1.19.3 → 1.16.5 made no difference
(F1, PR #684 still ships — the version bump is sound for other
reasons; the listener bind is just a separate fix).

This commit moves the listeners to high ports (30080/30443) and lets
the Hetzner LB do the public-facing port translation:

  HCLB :80   → CP node :30080  (cilium-gateway HTTP listener)
  HCLB :443  → CP node :30443  (cilium-gateway HTTPS listener)

External users still hit `https://console.<sov>.omani.works/auth/handover`
on port 443; the high port is invisible. High-port bind succeeds
without NET_BIND_SERVICE because the kernel only gates ports below
`net.ipv4.ip_unprivileged_port_start` (default 1024).

Will be verified on otech48: the next fresh provision should serve
console.otech48/auth/handover end-to-end without the 502/timeout
chain seen on otech45–47.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager

PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681)
calls contabo's authoritative PowerDNS at pdns.openova.io to write
DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook
needs an X-API-Key Secret in the Sovereign's cert-manager namespace —
PR #681 didn't ship the materialization seam, so on otech43..otech47
the Secret was missing and the wildcard cert never issued.

This commit closes the seam from contabo to the Sovereign:

1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on
   openova-system/powerdns-api-credentials extended from "external-dns"
   to "external-dns,catalyst" so contabo catalyst-api can mount the
   API key.

2. bp-powerdns: api.basicAuth.enabled flips default true to false.
   Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that
   blocked machine-to-machine API access from Sovereigns. The X-API-Key
   contract is unchanged.

3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds
   CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key
   secret (optional=true so Sovereign-side catalyst-api Pods that don't
   reflect this still start clean).

4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field
   reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every
   Request before Validate(). Forwards as tofu var powerdns_api_key.

5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive,
   default "").

6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct
   dynadot-api-credentials Secret block (PR #681 dropped
   bp-cert-manager-dynadot-webhook) with a new
   cert-manager/powerdns-api-credentials Secret block. runcmd applies
   it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook.

End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token.

Will be verified live on otech48 (next provision after this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 18:23:27 +04:00
e3mrah
52b87afa9e
fix(bp-cilium): upgrade upstream cilium 1.16.5 → 1.19.3 (1.2.0) (#684)
1.16.x gateway-api hostNetwork mode is buggy on Sovereigns: cilium-envoy
NACKs listeners with "cannot bind '0.0.0.0:80': Permission denied" and
the loaded RDS for the Sovereign vhost only carries the default `/` route
to catalyst-ui — `/auth/*` and `/api/*` HTTPRoute matches defined in CEC
never reach envoy's live config. Result: console.<sov>/auth/handover?token=…
serves the React shell instead of the catalyst-api Go handler, defeating
the Phase-8b seamless handover. Caught live on otech46.

1.18+ ships the Gateway API implementation graduated from beta with the
hostNetwork bind path fixed; 1.19 is the current stable line (1.19.3).
Values shape verified backward-compatible across the keys we set:
gatewayAPI.hostNetwork.enabled, envoy.enabled, envoyConfig.enabled,
encryption.type=wireguard, encryption.nodeEncryption — all unchanged
between 1.16 and 1.19.

Bumps:
  - bp-cilium chart 1.1.5 → 1.2.0 (minor — major upstream version jump)
  - upstream cilium subchart 1.16.5 → 1.19.3
  - blueprint.yaml spec.version 1.1.3 → 1.2.0 (was already drifted from
    Chart.yaml; brings them back in sync per manifest-validation gate)
  - clusters/_template/bootstrap-kit/01-cilium.yaml HelmRelease pin
    1.1.5 → 1.2.0

Per-cluster overlays under clusters/<sovereign>/bootstrap-kit/ keep
their pinned versions until the operator opts in — fresh otechN
provisions render from _template/ and pick up 1.2.0 on first boot.

Will be verified live on the next fresh Sovereign provision (otech47+).

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:20:54 +04:00
e3mrah
2b60e944e2
fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook (#681)
* fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook

Caught live on otech43-46: cert-manager DNS-01 challenges for
*.otechN.omani.works failed because the Sovereign-side webhook wrote
challenge TXT records to the Sovereign's local PowerDNS. omani.works is
delegated from Dynadot to ns1/2/3.openova.io which run on contabo's
central PowerDNS — the Sovereign's local PowerDNS is INVISIBLE on the
public DNS chain until pool-domain-manager seals the per-Sovereign NS
delegation. Let's Encrypt resolvers walk the public chain, query
contabo, get NXDOMAIN, the cert never issues. Manual workaround was
seeding challenge TXT directly in contabo PowerDNS.

This PR automates the right write path:

- bp-cert-manager-powerdns-webhook chart bumped to 1.0.4. Default
  powerdns.host flips from "" (skip-render) to https://pdns.openova.io
  (contabo's public PowerDNS API ingress, authoritative for omani.works).
- ClusterIssuer letsencrypt-dns01-prod-powerdns now usable with no
  per-cluster powerdns.host override for the omani.works pool.
  apiKeySecretRef.namespace clarified — upstream ignores it; the Secret
  must live in cert-manager namespace (= ChallengeRequest.ResourceNamespace
  for ClusterIssuers).
- bootstrap-kit slot 49 updated: drops bp-powerdns dependsOn (webhook
  calls out-of-cluster contabo, not local PowerDNS), bumps chart version,
  removes inline powerdns.host override (defaults are correct).
- bootstrap-kit slot 49b (bp-cert-manager-dynadot-webhook) DELETED
  entirely — Dynadot is NOT the API-level authority for omani.works
  subdomains, the dynadot webhook silently fails the same way the
  Sovereign-local powerdns one did.
- clusters/_template/sovereign-tls/cilium-gateway-cert.yaml flips
  issuerRef from letsencrypt-dns01-prod (was dynadot-backed) to
  letsencrypt-dns01-prod-powerdns (the new contabo-backed issuer).
- bp-cert-manager chart: certManager.issuers.dns01.enabled defaults to
  false (deprecated dynadot path). letsencrypt-http01-prod retained for
  per-host certs. Cluster overlays MAY flip dns01.enabled=true for
  non-omani.works pools where Dynadot IS the API-level authority.
- scripts/expected-bootstrap-deps.yaml: drops slot 49b, drops bp-powerdns
  edge from slot 49.
- Documentation (README + blueprint.yaml + Chart.yaml description)
  rewritten to reflect contabo retarget and lifecycle reasoning.

Credential plumbing (out of scope here, must be done in cloud-init):
- Every Sovereign needs a `powerdns-api-credentials` Secret in the
  `cert-manager` namespace whose `api-key` value matches contabo's
  PowerDNS API key. Same seeding pattern as `dynadot-api-credentials`
  in infra/hetzner/cloudinit-control-plane.tftpl.

Caveat — basicAuth on contabo's PowerDNS API ingress: contabo currently
fronts pdns.openova.io with Traefik basicAuth (per
clusters/contabo-mkt/apps/powerdns/helmrelease.yaml). The upstream
zachomedia/cert-manager-webhook-pdns binary supports the X-API-Key
header but not HTTP Basic Auth out of the box. To make this end-to-end
green, contabo's basicAuth requirement must be relaxed (X-API-Key alone
provides the auth posture, and contabo's API endpoint is restricted to
operator IPs by other means OR the Sovereign's webhook needs an
Authorization header injected via the chart's powerdns.headers map
(plaintext password in the ClusterIssuer config — not ideal). This PR
ships the chart side; the basicAuth question is a follow-up on the
contabo side.

Verified locally:
- helm lint platform/cert-manager-powerdns-webhook/chart -> PASS
- helm template platform/cert-manager-powerdns-webhook/chart -> renders
- helm template ... --set clusterIssuer.enabled=true -> renders the
  ClusterIssuer with host="https://pdns.openova.io" + correct apiKey
  Secret reference.
- helm template platform/cert-manager/chart -> renders ONLY
  letsencrypt-http01-prod (the dns01 dynadot issuer correctly gated off).
- scripts/check-bootstrap-deps.sh: net-zero new drift; my branch reduces
  pre-existing errors from 3 to 2 (the dropped slot 49b removed the only
  drift my branch was responsible for).

Closes follow-up to #373. Preconditions for handover URL TLS green
on otech43-46 lineage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): repair YAML structure in expected-bootstrap-deps.yaml

Two pre-existing drifts were blocking dependency-graph-audit CI:

1. Slot 5a (bp-reflector) was missing its closing list separator,
   causing yq to merge the bp-nats-jetstream entry into the bp-reflector
   map and effectively drop bp-reflector from the expected DAG.
   Added explicit `- slot: 7` for bp-nats-jetstream and quoted "5a" so
   yq treats it as a string slot (matches the convention with "49b").

2. bp-powerdns slot 11: actual bootstrap-kit declares dependsOn
   bp-cnpg (live since otech28 — pdns-pg-app secret race) but the
   expected DAG was missing this edge.

This is unblocks merging fix/cert-manager-powerdns-webhook-contabo (PR
above) — these drifts existed on main but weren't surfaced until the
last expected-deps edit forced a re-run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:12:48 +04:00
e3mrah
a50ef0ece0
fix(bp-external-dns): --request-timeout=120s for cold-cluster initial sync (1.1.5) (#679)
Caught live on otech43–46: external-dns crashloops 10+ times on fresh
Sovereign before initial *v1.Pod sync completes. Default 30s timeout
insufficient when k3s apiserver is CPU-saturated.

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 16:50:37 +04:00
e3mrah
dd4148acb6
fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674)
The Cilium gateway-api L7LB nodePort chain was silently broken on
otech45: TCP to LB:443 succeeds, but TLS handshake never completes.
Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match
what cilium-envoy actually listens on (verified via /proc/net/tcp on
the cilium-envoy pod — port 12869 not in listening sockets). The
nodePort indirection (31443→envoy:12869) is broken at the redirect
step.

Fix: bind cilium-envoy directly to the host's :80 and :443 via
gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public
80→private:80 and 443→private:443 directly (no nodePort indirection).

Two coordinated changes:
  1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true
  2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443)

bp-cilium chart bumped to 1.1.5.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 15:22:51 +04:00
e3mrah
1bd2ab1951
fix(bp-gitea): use explicit labels in sync-job template (chart 1.2.3 retry) (#670)
Previous attempt referenced 'bp-gitea.labels' helper which doesn't
exist in this chart (bp-gitea has no _helpers.tpl, unlike bp-harbor).
Blueprint Release workflow's helm-template gate caught it:
  template: bp-gitea/templates/database-secret-sync-job.yaml:53:8:
    error calling include: template: no template 'bp-gitea.labels'
    associated with template 'gotpl'

Fix: replace the 4 occurrences of 'include bp-gitea.labels' with
explicit catalyst.openova.io/blueprint + component labels. Same
shape, no helper dependency.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 14:37:24 +04:00
e3mrah
9eff5530cd
fix(bp-gitea): replace Reflector with database-secret-sync-job (chart 1.2.3) (#668)
Same root cause + same fix as bp-harbor (PR #557). The Reflector-based
'gitea-database-secret reflects gitea-pg-app' pattern races with CNPG:
Reflector logs once at install time that the source doesn't exist
('Could not update gitea/gitea-database-secret — Source gitea-pg-app
not found') and never retries. The destination stays empty (password
"") and gitea init container crashloops with 'pq: password
authentication failed for user gitea' — caught live on otech43,
manually patched at the time but no chart fix shipped, so otech45
hit the exact same failure (founder caught it in k9s).

Fix: replicate bp-harbor's sync-job pattern verbatim.
  - post-install,post-upgrade Helm hook (weight 5)
  - curlimages/curl image talking to in-cluster apiserver
  - Polls until gitea-pg-app exists, reads .data.password,
    PATCHes gitea-database-secret with the password key
  - Hook-delete-policy: before-hook-creation,hook-succeeded
  - Idempotent on re-run; CNPG never rotates without operator action

Drops the HARBOR_DATABASE_PASSWORD alias (gitea binds the
'password' key directly via secretKeyRef in values.yaml).

The existing pre-install database-secret.yaml placeholder stays so
the Secret is Found at install time (some tooling assumes presence
for the Pod's lifetime).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 14:24:41 +04:00
e3mrah
a8bcb773c9
fix(bp-openbao): add BAO_TOKEN+NAMESPACE env to auth-bootstrap (chart 1.2.14) (#666)
PR #663 added the revoke logic at the bottom of the script but the
companion env-block additions (BAO_TOKEN sourced from openbao-root-token
Secret, NAMESPACE from fieldRef) somehow never landed in the merged
diff — only the trailing revoke + DELETE block did.

Result on otech44: openbao-root-token Secret IS being created by
init-job (PR #663's other half worked), but auth-bootstrap pod env
ends at TOKEN_MAX_TTL with no BAO_TOKEN, so 'bao auth enable kubernetes'
hits 403 Forbidden again — the exact same failure that PR #663 was
supposed to fix.

This PR adds the missing env declarations.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 14:02:34 +04:00
e3mrah
74921e30f1
fix(architecture): drop bp-spire, Cilium WireGuard is the canonical east-west mesh (#665)
Founder direction 2026-05-03: with 100% Cilium mesh enforcement +
Envoy where required, bp-spire is redundant for the minimal Sovereign
MVP.

Reasoning:
- Cilium 1.13+ has built-in mutual auth using SPIFFE, but it ships
  with its own embedded SPIRE server managed by the Cilium operator.
  External bp-spire is not needed for east-west mTLS.
- Our ESO→OpenBao auth uses the K8s ServiceAccount auth method
  (TokenReview against kube-apiserver), not JWT-SVID.
- WireGuard transparent encryption (already enabled in cilium values)
  encrypts every pod-to-pod connection at the kernel transport layer.
- Cross-Sovereign federation and per-workload-fingerprint attestation
  are not blocking handover; they can be re-introduced as an opt-in
  blueprint when needed.

Changes:
- Delete clusters/_template/bootstrap-kit/06-spire.yaml
- Remove bp-spire from kustomization.yaml + expected-bootstrap-deps.yaml
- Remove bp-spire dependsOn from 07-nats-jetstream.yaml + 08-openbao.yaml
- bp-cilium 1.1.4: add encryption.nodeEncryption=true so node-to-node
  traffic (not just pod-to-pod) is also WireGuard-encrypted; document
  in values.yaml comment that WireGuard is the canonical east-west
  mTLS layer.

Removes 4 pods (spire-server, spire-agent, spire-spiffe-csi-driver,
spire-spiffe-oidc-discovery-provider) from every Sovereign and the
recurring CSI mount race that was getting stuck on otech43.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 13:56:36 +04:00
e3mrah
be6e610093
fix: drop bp-langfuse from minimal + bp-mimir 1.0.2 push_grpc fix (#664)
* fix: drop bp-langfuse from minimal bootstrap-kit + bp-mimir push_grpc fix

Two independent fixes packaged together:

1. **Drop bp-langfuse** from the SOLO minimal bootstrap-kit. Per
   founder direction: langfuse is LLM-specific (prompt/completion
   tracing for AI plane), not platform infrastructure, and belongs
   to a future 'AI Add-On' template. Its CreateContainerConfigError
   on every Sovereign provision (missing langfuse-secrets pre-install)
   was eating Phase-1 reconciliation budget without contributing to
   handover-ready state. Removed:
   - clusters/_template/bootstrap-kit/26-langfuse.yaml
   - kustomization.yaml entry
   - scripts/expected-bootstrap-deps.yaml slot 26 entry

2. **bp-mimir 1.0.2** — re-enable ingester.push_grpc_method_enabled.
   Upstream mimir-distributed 6.0.6 disables Push gRPC when
   ingest-storage is off, but classic-mode ingester REQUIRES it.
   The combo crashloops with 'cannot disable Push gRPC method in
   ingester, while ingest storage (-ingest-storage.enabled) is not
   enabled'. Caught live on otech43 with 17 restarts.

Both issues block Phase-1 ready=40/40 from being a clean signal.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-mimir): chart 1.0.2 push_grpc_method_enabled + finalize langfuse drop

Follow-up to previous commit which only captured the file deletion.
This commit applies: bp-mimir 1.0.2 chart bump, kustomization +
expected-deps removal of langfuse, bootstrap-kit version bumps.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 13:50:38 +04:00
e3mrah
561439b6c2
fix(bp-openbao): wire root_token init→auth-bootstrap (chart 1.2.13) (#663)
Caught live on otech43 after chart 1.2.12 fixed the persist gap and
auth-bootstrap finally ran: 'Error enabling kubernetes auth ... Code: 403
permission denied'. The auth-bootstrap Job had no BAO_TOKEN and was
making unauthenticated bao API calls.

Three coordinated changes:

1. init-job.yaml: after bao operator init succeeds and ROOT_TOKEN is
   extracted, POST a transient Secret openbao-root-token with the
   token in data.token. Already-exists (409) is treated as
   idempotent-re-run, anything else fails the Job loud (was silent
   before, hid the bug).

2. auth-bootstrap-job.yaml: BAO_TOKEN env sourced via secretKeyRef
   from openbao-root-token. After running auth enable / secrets enable
   / policy write / role bind, revoke the token via 'bao token revoke
   -self' AND attempt DELETE on the Secret. (busybox wget --method=DELETE
   may silently no-op; the bao-side revoke is the load-bearing
   acceptance-criterion-6 mechanism.)

3. auto-unseal-rbac.yaml: openbao-root-token added to the mutation
   rule's resourceNames so the SA can GET/PATCH/UPDATE/DELETE it.
   Create is already unrestricted from chart 1.2.10's RBAC split.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 12:55:13 +04:00
e3mrah
be9b5ca5bf
fix(bp-openbao): wc -l counts 0 for single-key without trailing newline (1.2.12) — TRUE root cause (#662)
Caught live on otech42 with chart 1.2.11's per-pod logs:
  + bao operator init -key-shares=1 -key-threshold=1 -format=json
  [openbao-init] FATAL: extracted 0 unseal key(s) but threshold=1

key-shares=1 → no comma → tr ',' '\n' is no-op → final sed produces
single line WITHOUT trailing newline → wc -l counts 0. Every prior
loop attributed to RBAC/wget was a downstream symptom.

Fix: append 'awk 1' for trailing newline, swap wc -l for grep -c .

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 12:28:50 +04:00
e3mrah
7bd9aae89b
diag(bp-openbao): restartPolicy: Never (chart 1.2.11) — preserve fresh-init pod logs (#661)
OnFailure restarts the SAME container in the SAME pod, and only the
MOST RECENT failed container's logs are kubectl-loggable. The first
attempt's logs (where the FRESH path runs and the persist gap lives)
are reaped before later restarts can be inspected.

Switching to Never makes each retry a separate Pod via Job's
backoffLimit replay. Every failed pod is independently inspectable
with kubectl logs <pod> until ttlSecondsAfterFinished tears it down.
Combined with chart 1.2.9's openbao-init-trace Secret upload (POST
now succeeds with 1.2.10's RBAC split), the fresh-path failure point
becomes definitively observable.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 12:13:23 +04:00
e3mrah
b5fee168b5
fix(bp-openbao): split RBAC for create verb (chart 1.2.10) — root cause of unseal-keys never persisted (#660)
The openbao-auto-unseal Role granted 'create' on Secrets with
resourceNames set. Kubernetes RBAC doesn't enforce resourceNames on
the create verb (the resource has no name at admission time, so
there's nothing to filter), but the kube-apiserver still REJECTS the
request because the rule's effective verbs[create]+resourceNames combo
doesn't match the bare 'create secrets' permission check. Result:
every init Job POST returned 403 Forbidden.

The script then fell through to the PUT branch, which silently failed
because BusyBox wget (the openbao image's only HTTP client) has no
--method flag. Both calls non-zero → script exited 1 with FATAL
'cannot persist'. The first init's logs got reaped before later
restarts could be inspected, so the FATAL was never visible — the
retries all hit the idempotent FATAL ('vault is sealed but the
unseal-keys Secret is missing') with no record of why.

Caught live on otech40 with chart 1.2.9's trace upload + a wget
auth-can-i probe:
  kubectl auth can-i create secrets --as=...openbao-auto-unseal → no
  kubectl auth can-i create secret/openbao-unseal-keys ... → yes

Fix: split into two rules per the k8s RBAC pattern.
  rule 1: verbs[create] WITHOUT resourceNames (allows POST)
  rule 2: verbs[get,patch,update,delete] WITH resourceNames
          (mutation stays scoped to known names)

This unblocks every fresh Sovereign provisioning. Each subsequent run
hits the idempotent path (GET on openbao-unseal-keys → 200) and
unseals automatically — no operator intervention.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:55:05 +04:00
e3mrah
09e56f1e47
diag(bp-openbao): persist init script trace to Secret across restarts (1.2.9) (#659)
otech38/39 confirmed: openbao reaches Initialized=true on the first
init pod attempt but the unseal-keys Secret is never persisted. The
fresh-init container's logs are reaped before subsequent restarts'
idempotent FATAL allows them to be inspected, so we keep flying blind
on the actual failure point.

This change tees every line of the init script (set -x trace + every
echo) into /tmp/.script.trace and uploads it to a per-namespace
Secret 'openbao-init-trace' on EXIT (success OR failure). The Secret
survives Pod recreation and any Job retry; the operator can read it
with kubectl after the next provision and see exactly where the
fresh-path script exited.

Adds 'openbao-init-trace' to the openbao-auto-unseal Role's
resourceNames so the Job SA can PUT/POST it.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:38:54 +04:00
e3mrah
5f6d1c7d86
diag(bp-openbao): add set -x to init script (chart 1.2.8) (#658)
otech37/38 hit the same wall: server reaches Initialized=true but
openbao-unseal-keys Secret is never persisted; the FIRST init pod's
logs that ran fresh init are reaped by container restart before we
can capture what happened.

Add 'set -x' to shell-trace every command. Now even if the script
crashes mid-run, pod logs show the last command attempted. The
captured diagnostic on the next provision will tell us whether the
failure is in /tmp/init-output.json parsing, the persist wget, or
elsewhere.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:09:05 +04:00
e3mrah
8447930bf7
fix(bp-openbao): fail-fast on unseal-keys persist (chart 1.2.7) (#657)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT

PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
  keycloak: ['cert-manager', 'openbao']  ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.

Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): don't regress status to pending after exec started

helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).

Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.

Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): cascade Failed status through dependsOn (fail-fast)

Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.

Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'

Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
  failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.

Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.

Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
  fs.inotify.max_user_instances = 8192
  fs.inotify.max_user_watches   = 1048576
  fs.inotify.max_queued_events  = 16384

Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-openbao): fail-fast when unseal-keys persist fails (chart 1.2.7)

otech37 caught: bao operator init succeeded server-side
(Initialized=true), but the script's wget POST to persist
openbao-unseal-keys Secret silently failed (|| true), and the PUT
fallback also silenced. Subsequent Job retries hit Initialized=true
on the idempotent path, found no openbao-unseal-keys Secret, and
FATAL'd with 'manual recovery: wipe data-openbao-0 PVC' — every
retry forever.

Hardening:
  1. Capture POST + PUT stdout/stderr to /tmp files instead of
     /dev/null so the FATAL path can echo them.
  2. PUT no longer || true — if both POST and PUT fail, exit 1.
  3. Add read-back verification: GET the persisted Secret and
     assert 'unseal-keys-b64' field is present. Catches
     partial-write / eventual-consistency cases.

Bumps chart 1.2.6 -> 1.2.7 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 10:51:21 +04:00
e3mrah
6baf7e56e7
fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) (#651)
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:26:23 +04:00
e3mrah
d519dc8ba2
fix(bp-harbor): switch sync Job to curl-against-apiserver (chart 1.2.12) (#650)
rancher/kubectl is distroless (no /bin/sh) so the inline shell script
can't run. Replace with curlimages/curl which has alpine sh + curl.
Talk to k8s API directly via the in-pod ServiceAccount token. The
PATCH merges password + HARBOR_DATABASE_PASSWORD into the existing
pre-install-hook Secret without touching annotations.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:15:23 +04:00
e3mrah
08432b540e
fix(bp-harbor): switch sync Job to rancher/kubectl (chart 1.2.11) (#649)
bitnami/kubectl moved to sha256-only tags; bitnami/kubectl:1.31.4
returns 'not found' from Docker Hub. rancher/kubectl is always
available on k3s clusters. Bumps chart 1.2.10 -> 1.2.11.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:04:15 +04:00
e3mrah
de51fa3f7a
fix(bp-harbor): post-install Job copies CNPG password (chart 1.2.10) (#648)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)

CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal

Commit b1a25c42 (#600) removed the helm.sh/hook-delete-policy from the
auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install
(the old hook-succeeded clause caused the SA to disappear before the
init Job could mount its token). The chart-test still expected ≥5
before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs).

Result: Blueprint Release for #600 (run 25251129679) failed at the test
gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main
already references it. otech30 caught this live: bp-openbao HR stuck
with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'.

Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao
1.2.6 onto GHCR.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-harbor): replace Reflector race with deterministic post-install Job (chart 1.2.10)

bp-harbor's harbor-database-secret relied on Reflector copying from CNPG-
emitted harbor-pg-app via a 'reflects:' destination annotation. On every
fresh Sovereign Reflector logs once at install:
    Could not update harbor/harbor-database-secret —
    Source harbor/harbor-pg-app could not be found
and never refires when CNPG creates the source ~30s later. Even with
'auto-enabled: true' on the source's inheritedMetadata, Reflector's
auto-reflect copies the SOURCE name (harbor-pg-app), not the explicit
destination harbor-database-secret. Result: harbor-database-secret stays
empty forever; harbor-core CrashLoops with 'couldn't find key password
in Secret harbor/harbor-database-secret'. Caught live on otech26-30.

Replace with a Helm post-install/post-upgrade Job that:
  - polls for harbor-pg-app to exist (CNPG provisions it ~30-60s after
    Cluster Ready)
  - copies password into harbor-database-secret with both 'password'
    and 'HARBOR_DATABASE_PASSWORD' keys
  - exits 0; Helm marks the hook complete

The Job is idempotent (re-running on upgrade overwrites identically)
and deterministic (no event-watcher race). The placeholder Secret stays
in place so kubectl-get returns Found before the Job runs.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:52:54 +04:00
e3mrah
da61ecdc79
test(bp-openbao): align test expectation with #600 RBAC-hook removal (#647)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)

CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal

Commit b1a25c42 (#600) removed the helm.sh/hook-delete-policy from the
auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install
(the old hook-succeeded clause caused the SA to disappear before the
init Job could mount its token). The chart-test still expected ≥5
before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs).

Result: Blueprint Release for #600 (run 25251129679) failed at the test
gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main
already references it. otech30 caught this live: bp-openbao HR stuck
with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'.

Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao
1.2.6 onto GHCR.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:46:31 +04:00
e3mrah
a359278b7d
fix(bp-spire): disable oidc ClusterSPIFFEID + chart bump (1.1.7) (#645)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-spire): Chart.lock missing spire-crds → CRDs never installed (chart 1.1.7)

bp-spire 1.1.4 added spire-crds 0.5.0 as a Helm dependency to register
the spire.spiffe.io/v1alpha1 CRDs (ClusterSPIFFEID, ClusterStaticEntry,
ClusterFederatedTrustDomain) before the spire subchart's controller-
manager Deployment starts. But Chart.lock was never regenerated — only
contained the original `spire` entry. As a result every Blueprint
Release packaged the chart WITHOUT spire-crds, the Sovereign saw no
CRDs registered, and Helm install failed with:

  no matches for kind "ClusterSPIFFEID" in version "spire.spiffe.io/v1alpha1"

bp-openbao / bp-external-secrets / bp-nats-jetstream all dependsOn
bp-spire so this single bug cascades and blocks 5+ HRs from reaching
Ready=True. Caught live during otech29.

Fix: ran `helm dependency update` to regenerate Chart.lock + pull both
spire and spire-crds tarballs; bumps bp-spire 1.1.6 -> 1.1.7 and
bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:27:33 +04:00
e3mrah
8bb66fe43e
fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 00:16:34 +04:00
e3mrah
2e9cfd4a57
fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:52:42 +04:00
e3mrah
487ebebda2
fix(bp-vpa): drop registry.k8s.io/ prefix in repository (upstream prepends it) (#641)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:32:35 +04:00
e3mrah
737574b19a
feat(bp-keycloak): Phase-8b sovereign realm — token-exchange, catalyst-ui/api-server OIDC clients, SMTP, bump 1.2.2 → 1.3.0 (#604) (#609)
Adds the full Phase-8b identity surface required by the seamless handover flow:

- Token exchange enabled on sovereign realm (attributes.token-exchange: true)
- catalyst-ui public PKCE client: redirectUris + webOrigins keyed on
  console.<sovereignFQDN>, groups + requiredActions in ID token
- catalyst-api-server confidential service-account client: impersonation +
  manage-users + view-users + query-users roles on realm-management; client
  secret injected at provisioning time via .Values.catalystApiServerClientSecret
- WebAuthn (webauthn-register + webauthn-register-passwordless) registered as
  Required Action options on the realm
- UPDATE_PASSWORD set as defaultAction: true for new users
- smtpServer block: pre-handover default = contabo Stalwart relay; fully
  operator-configurable via .Values.smtp.* (Phase-8c-acceptable)
- required-actions client scope + oidc-usermodel-attribute-mapper for
  requiredActions claim in ID token (catalyst-ui first-login UX)

Architectural change: realm JSON moved from inline values.yaml (keycloak:
subchart key — no parent scope access) to a parent-chart template
platform/keycloak/chart/templates/configmap-sovereign-realm.yaml, which can
read .Values.sovereignFQDN and .Values.smtp.* for per-Sovereign interpolation.
The upstream bitnami chart's keycloakConfigCli.existingConfigmap is pointed at
this ConfigMap. Anti-duplication seam: configmap-sovereign-realm.yaml.

New values.yaml keys:
  sovereignFQDN: "" (REQUIRED — per-Sovereign overlay supplies it)
  sovereignRealm.enabled: true
  catalystApiServerClientSecret: "" (REQUIRED — provisioner seals and injects)
  smtp.host/port/from/user/password/ssl/starttls/auth

New bootstrap-kit file:
  09a-keycloak-catalyst-api-secret.yaml — SealedSecret template for
  keycloak-catalyst-api-server-credentials in catalyst-system namespace;
  provisioner fills encryptedData fields at deploy time

Bootstrap-kit refs bumped 1.2.x → 1.3.0 in _template, otech, omantel.
helm template clean with sovereignFQDN=otech.omani.works.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:05:27 +04:00
e3mrah
93627ada20
fix(bp-harbor): convert harbor-database-secret to Helm pre-install hook (1.2.8) (#603)
The 1.2.7 fix dropped the `data:` block from the chart template, but
Helm's three-way merge still owns the Secret as a release resource and
resets `data: {}` (no keys) on every chart upgrade — verified on otech22
where 1.2.6→1.2.7 reconcile wiped Reflector-populated keys back to nil.

Architectural fix: convert the Secret to a Helm pre-install hook.

  - `helm.sh/hook: pre-install` — Secret is created at install time only.
    On `helm upgrade`, Helm does NOT touch the Secret (no three-way merge),
    so keys populated by Reflector persist across every chart bump.
  - `helm.sh/hook-delete-policy: before-hook-creation` — On a re-install,
    Helm deletes the previous Secret first so the hook recreates clean.
  - `helm.sh/resource-policy: keep` — `helm uninstall` does NOT delete the
    Secret (paired with hook means standard upgrade path never sees a delete).
  - Hook resources are NOT recorded in the Helm release manifest, so they're
    invisible to `helm upgrade`'s three-way merge.

Also drops the inline `data:` block (kept from 1.2.7) — Reflector still
populates everything from harbor-pg-app once CNPG bootstraps the source.

Bumps bp-harbor 1.2.7 → 1.2.8, bootstrap-kit refs (_template, otech, omantel).

Closes #585

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:57:55 +04:00
e3mrah
09208ca58f
fix(bp-harbor): omit data block in harbor-database-secret — Helm overwrite regression (1.2.7) (#602)
On every helm upgrade, Helm three-way merge resets `data.password` and
`data.HARBOR_DATABASE_PASSWORD` to "" because the chart declares them
empty in the template. After Reflector populates them from `harbor-pg-app`,
the next bp-harbor upgrade silently empties them again — harbor-core then
crashloops on the next pod restart with "password authentication failed".

Observed on otech22 after the 1.2.5→1.2.6 Flux upgrade: harbor-database-
secret.password went from 64 bytes back to 0 bytes, harbor-core entered
CrashLoopBackOff. Resolved at runtime by touching harbor-pg-app to bump
its resourceVersion and re-trigger Reflector, but the architectural fix
is needed so it doesn't recur on the next chart upgrade.

Fix: drop the entire `data:` block from templates/database-secret.yaml.
The Secret is created by Helm with no data keys (Helm owns nothing in
the data field). Reflector adds ALL keys from `harbor-pg-app` (password,
HARBOR_DATABASE_PASSWORD, username, host, dbname, jdbc-uri, etc.) on
the first SecretWatcher event after CNPG bootstraps the source. On
subsequent helm upgrades, Helm's three-way merge has nothing to overwrite
in `data:` because the chart no longer declares any keys there.

Bumps bp-harbor 1.2.6 → 1.2.7, bootstrap-kit refs (_template, otech, omantel).

Closes #585 (regression of)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:53:37 +04:00
e3mrah
8d50402038
fix(bp-harbor): remove cnpg-app-annotator Job — CNPG inheritedMetadata handles annotation (1.2.6) (#601)
The post-install Job `harbor-pg-app-annotator` (with curlimages/curl:8.7.1)
is no longer needed: bp-harbor 1.2.5 already uses CNPG's `inheritedMetadata`
stanza in cnpg-cluster.yaml to stamp `reflection-allowed: true` onto
`harbor-pg-app` at CNPG bootstrap time. The Job was causing ErrImagePull on
otech22 because Docker Hub is proxied through Harbor itself (chicken-and-egg).

Removes:
  - templates/cnpg-app-annotator-job.yaml
  - templates/cnpg-app-annotator-rbac.yaml
  - values.yaml cnpgAnnotator section

Updates database-secret.yaml comment to reflect the inheritedMetadata approach.

Bumps Chart.yaml 1.2.5 → 1.2.6, bootstrap-kit refs (_template, otech, omantel).

Closes #585

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:44:55 +04:00
e3mrah
b1a25c4235
fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598) (#600)
Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was
`<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak)
but bitnami's fullname helper trims the chart-name suffix when Release.Name
already contains it, so the Service is just `keycloak`. Changed default to
`.Release.Name`. Sovereign realm was already imported (config-cli ran
successfully) — only the Gateway routing was broken, returning HTTP 500.

Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had
`helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The
`hook-succeeded` clause caused Helm to delete the SA immediately after the
weight-0 RBAC hook completed, before the weight-5 init Job pod could mount
its SA token and start. Removed all hook annotations from the RBAC resources
so they are managed by regular Helm release lifecycle (created before hooks,
never deleted mid-install).

Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6.

Verified on otech22 (manual remediation): Keycloak sovereign realm
OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:43:32 +04:00
e3mrah
cba1b5070a
fix(bp-gitea+harbor): use CNPG inheritedMetadata to propagate reflector annotations to pg-app Secret (#595)
The Cluster CR `metadata.annotations` are NOT propagated by CNPG onto the
generated `{name}-app` Secrets. Reflector requires the SOURCE Secret (e.g.
`gitea-pg-app`) to carry `reflection-allowed: "true"` before it will copy
data into the DESTINATION Secret (`gitea-database-secret`). On otech22 this
caused `gitea-database-secret` to stay empty indefinitely — gitea init container
failed auth with "password authentication failed for user gitea".

Fix: use CNPG's `inheritedMetadata.annotations` stanza (v1.24+) to instruct
CNPG to annotate all generated Secrets with the reflector permission annotations.
Applied to both bp-gitea (1.2.0→1.2.1) and bp-harbor (1.2.4→1.2.5) since
harbor-pg-app had the same issue.

Bootstrap-kit: bump bp-gitea chart ref 1.2.0→1.2.1 (template + otech + omantel).

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:37:48 +04:00
e3mrah
fe03b8cc42
fix(bp-harbor): use curl for CNPG annotator PATCH + add values defaults (1.2.4) (#594)
busybox wget does not support --method=PATCH (only GET/POST). The
harbor-pg-app-annotator Job silently succeeded without actually patching
harbor-pg-app, leaving harbor-database-secret empty on fresh install.

Fixes:
1. Switch cnpg-app-annotator-job.yaml from busybox:1.36.1 + wget to
   curlimages/curl:8.7.1 + curl -X PATCH. curl natively supports all
   HTTP verbs. HTTP response code checked explicitly; non-2xx exits 1
   so the Job retries instead of silently passing with no-op.
2. Add cnpgAnnotator.image stanza to values.yaml (was missing — prior
   charts defaulted via nil-safe dict fallback but the section was
   never actually written to values.yaml). Defaults to curlimages/curl:8.7.1.
3. readOnlyRootFilesystem: false (curl writes /tmp/patch-response.json
   for error diagnostics).
4. Bump chart 1.2.3 → 1.2.4.

Closes #585

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:29:45 +04:00
e3mrah
97abf9dedb
fix(bp-harbor): nil-safe image value extraction in cnpg-app-annotator Job (#593)
.Values.cnpgAnnotator.image.repository triggers nil pointer when the
values tree is partially absent in Helm's default-values render. Use
| default dict chained assignments to safely extract image repo/tag/
pullPolicy. Fixes blueprint-release smoke render failure on 1.2.3.

Closes #585

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:22:54 +04:00
e3mrah
74d526c276
fix: bp-gateway-api 5→10 CRDs + bp-gitea CNPG + bp-harbor CNPG race fix + DAG audit (#592)
* fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584)

The bundled Bitnami postgresql subchart pulls docker.io/bitnamilegacy/postgresql
which is unavailable (DH deprecated namespace) — gitea-postgresql-0 stuck in
ImagePullBackOff on otech22, cascading to gitea Init:CrashLoopBackOff.

Mirrors the bp-harbor pattern (PR #578): provision a CNPG Cluster CR (gitea-pg,
namespace gitea, 5Gi, pg16) + a reflector-managed gitea-database-secret, wiring
GITEA__database__PASSWD from the CNPG-generated gitea-pg-app Secret. All Bitnami
subchart config removed; postgresql.enabled: false.

Bootstrap-kit (template + otech + omantel): bump bp-gitea 1.1.2 → 1.2.0, add
dependsOn: bp-cnpg so the postgresql.cnpg.io/v1 CRD is registered before the
Capabilities gate in cnpg-cluster.yaml fires. omantel overlay migrated from
legacy ingress: to gateway: (Cilium Gateway API, issue #387).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dependency-audit): add bp-reflector (5a) to expected DAG + external-dns dep edge

bp-reflector was added to the bootstrap-kit (slot 05a) in issue #543 but was
never registered in scripts/expected-bootstrap-deps.yaml, causing the
dependency-graph-audit CI gate to error on every PR that includes this branch.
Also declare bp-reflector in bp-external-dns's depends_on to match the actual
HR file (12-external-dns.yaml dependsOn bp-reflector).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(bp-gateway-api): update CRD-count test 5→10 for experimental channel + DAG audit

Two fixes to unblock bp-gateway-api:1.1.0 OCI publish and the
dependency-graph-audit CI gate:

1. crd-render.sh: expect 10 CRDs (experimental channel) not 5.
   Chart 1.1.0 vendors experimental-install.yaml (TLSRoute, TCPRoute,
   UDPRoute, BackendLBPolicy, BackendTLSPolicy in addition to 5 standard
   CRDs) because Cilium 1.16.x checks for TLSRoute at operator startup.
   Without this fix the blueprint-release workflow for 1.1.0 fails the
   chart-test step and never pushes to GHCR — leaving all 13 dependent
   HRs stuck dependency-not-ready on every Sovereign.

2. expected-bootstrap-deps.yaml: add bp-reflector (slot 5a) and update
   bp-external-dns depends_on to include bp-reflector. bp-reflector was
   added to the bootstrap-kit in issue #543 but was missing from the
   expected DAG, causing dependency-graph-audit ERRORs on every PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-02 15:20:05 +04:00
e3mrah
64de55d72f
fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588) (#590)
* fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588)

trivy-operator exits 137 (OOM) on startup on a full Sovereign (38 HRs,
~200 pods). The operator initialises watch-cache controllers for every
resource kind it manages across all namespaces; at 38 HRs the cache
peak exceeds 256Mi before steady-state is reached.

Raise the operator container memory limit from 256Mi to 512Mi, which
is the stable floor measured on otech22 during Phase-8a handover testing.

Bump bp-trivy 1.0.1 → 1.0.2. Bootstrap-kit slots updated for _template,
otech.omani.works, omantel.omani.works.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

* fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml

The dependency-graph-audit check was failing because:
1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but
   bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml
2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector]
   in the HelmRelease but expected-bootstrap-deps.yaml only declared
   [bp-cert-manager, bp-powerdns]

Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update
bp-external-dns depends_on to include bp-reflector in the expected DAG.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 15:20:03 +04:00
e3mrah
4b2ae76cfd
fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587) (#589)
* fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587)

The native pdns provider in external-dns v0.15.1 does not accept
--pdns-api-version; the binary fatals at startup with:
  'unknown long flag --pdns-api-version'
causing CrashLoopBackOff (53+ restarts on otech22).

The provider auto-negotiates the PowerDNS API version — the flag is
superfluous and broken. Remove it from extraArgs.

Bump bp-external-dns 1.1.3 → 1.1.4. Bootstrap-kit slots updated for
_template, otech.omani.works, omantel.omani.works.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

* fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml

The dependency-graph-audit check was failing because:
1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but
   bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml
2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector]
   in the HelmRelease but expected-bootstrap-deps.yaml only declared
   [bp-cert-manager, bp-powerdns]

Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update
bp-external-dns depends_on to include bp-reflector in the expected DAG.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 15:20:00 +04:00
e3mrah
8d2ba0495d
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)
2026-05-02 15:18:49 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
73ae746637
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API
CRDs at startup and disables its gateway controller if they are absent —
a static, one-shot decision. Cloud-init installs k3s+Cilium first, then
Flux reconciles bp-gateway-api minutes later, so the operator always
starts without CRDs and never recovers. All 8 HTTPRoutes orphaned.

Three-part permanent fix:

1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl.
   TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires
   TLSRoute CRD to be present; without it the operator's capability
   check fails entirely and disables the gateway controller.

2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true"
   to force GatewayClass creation regardless of CRD presence at Helm
   render time. Upstream default "auto" skips GatewayClass when the
   gateway API CRDs are absent at install time (Capabilities check).

3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0
   and ship experimental channel (TLSRoute, TCPRoute, UDPRoute,
   BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed
   status.supportedFeatures from string[] to object[]; Cilium 1.16.5
   writes the old string format and the v1.2.0 CRD rejects the status
   patch with "must be of type object: string", leaving GatewayClass
   permanently Unknown/Pending. v1.1.0 retains string schema.

Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17
adopts the v1.2.0 object schema for supportedFeatures.

Closes #503

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:23:32 +04:00
e3mrah
83ec889f06
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)

Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.

Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:21:53 +04:00
e3mrah
2adc3a9493
fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase (#579)
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:21:36 +04:00
e3mrah
b647aa2561
fix(bp-harbor): provision harbor-pg CNPG cluster + database-secret (Closes #566) (#578)
Replace Helm lookup in database-secret.yaml with reflector annotation:
harbor-database-secret now reflects harbor-pg-app via
reflector.v1.k8s.emberstack.com/reflects. This fixes the race between
Helm rendering (fresh install) and CNPG cluster bootstrap — reflector
is event-driven and propagates the CNPG password within seconds of
harbor-pg-app being created, with no operator action required.

Also includes:
- templates/cnpg-cluster.yaml: harbor-pg CNPG Cluster (1 inst, 5Gi, pg16)
- values.yaml: postgres: block + database.external.host = harbor-pg-rw
- Chart 1.2.0 → 1.2.1; bootstrap-kit refs updated (_template, otech, omantel)

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:14:00 +04:00
e3mrah
58cf297800
fix(bp-seaweedfs): remove trailing slash in registry — fixes double-slash image ref (Closes #568) (#576)
`registry: "chrislusf/"` in values.yaml produced `chrislusf//seaweedfs:4.22`
because the vendored chart's _helpers.tpl renders
`printf "%s/%s:%s" $registryName $name $tag` — the trailing slash joined
with the separator slash made an invalid image reference.

Fix: `registry: "chrislusf/"` → `registry: "chrislusf"`.
Bump bp-seaweedfs 1.1.0 → 1.1.1. Update bootstrap-kit refs in _template,
otech.omani.works, omantel.omani.works (1.0.1 → 1.1.1).

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:02:48 +04:00
e3mrah
5796de12bc
fix(bp-spire): re-enable oidc-discovery-provider ClusterSPIFFEID to fix init stuck (Closes #571) (#575)
The oidc-discovery-provider ClusterSPIFFEID was disabled at bootstrap to
work around a CRD-ordering race (spire-controller-manager applying the
template before CRDs were registered). That race was fixed in bp-spire 1.1.4
by listing spire-crds as the first Helm dependency.

With all ClusterSPIFFEIDs still disabled the oidc-discovery-provider init
container blocks indefinitely with "PermissionDenied: no identity issued" —
the controller-manager never creates the registration entry so no SVID is
issued.

Re-enable oidc-discovery-provider identity. The default, test-keys, and
child-servers identities remain disabled (not needed for bootstrap).

Also carries the global.imageRegistry field added by issue #560 (was 1.1.5
in working tree, now bumped to 1.1.6 for this fix). Bootstrap-kit slot 06
updated from 1.1.4 → 1.1.6.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:00:43 +04:00
e3mrah
b88e98026f
fix(bp-falco): rename rules_file → rules_files (Falco 0.36+ canonical key, Closes #570) (#574)
Falco 0.36+ uses `rules_files` (plural) as the canonical multi-file rules
key. Setting the deprecated `rules_file` (singular) alongside the upstream
subchart's `rules_files` default causes Falco to detect a config conflict
and abort startup with CrashLoopBackOff on otech22.

Bump bp-falco 1.0.0 → 1.0.1. Bootstrap-kit slot 31 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:59:29 +04:00
e3mrah
06844d3a70
fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569) (#573)
bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but
bp-external-dns still had `powerdnsNamespace: openova-system` in its
NetworkPolicy egress rule and `--pdns-server=...openova-system...` in
extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation.

Fix:
- externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns
- extraArgs --pdns-server: ...openova-system... → ...powerdns...

Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:58:24 +04:00
e3mrah
c59f0496a2
fix(bp-mimir): disable ingest_storage to fix Kafka CrashLoop (Closes #567) (#572)
Upstream mimir-distributed 6.0.6 can boot in ingest-storage mode which
requires a Kafka endpoint. Setting kafka.enabled:false only disables the
bundled Kafka subchart — it does not tell the Mimir process itself to use
classic mode. Adding mimir.structuredConfig.ingest_storage.enabled:false
forces the classic blocks-storage ingester path (no Kafka dependency),
matching Catalyst's NATS JetStream event bus (ADR-0001).

Bump bp-mimir 1.0.0 → 1.0.1. Bootstrap-kit slot 23 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:57:09 +04:00
e3mrah
ad9cfc0f23
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set):
- bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst
  job images now prefixed with global.imageRegistry when non-empty. Default
  (empty) renders identical manifests.
- bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed
  with global.imageRegistry when non-empty. Verified: dnsdist image rewrites
  to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14.

Subchart-only charts (global.imageRegistry stub added; threading via per-component
subchart values.yaml keys documented in comments):
- bp-external-secrets 1.1.0→1.1.1
- bp-cnpg 1.0.0→1.0.1  (charts/ missing = pre-existing state, not this PR)
- bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR)
- bp-nats-jetstream 1.1.1→1.1.2
- bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring

vcluster: N/A — no chart directory under platform/vcluster/chart/

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:52:43 +04:00