33abbc3627
264 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2ff50f0591
|
fix(bp-newapi+services-build): imagePullSecrets on Pod, sed bumps values.yaml smeTag (#955)
Two SME-blocker bugs caught live on otech113 (alice signup gate 5 fails on fresh Sovereign): #952 — bp-newapi 1.4.0 Pod has no imagePullSecrets, so kubelet pulls PRIVATE ghcr.io/openova-io/openova/{newapi-mirror,services-metering-sidecar} anonymously and gets 403 Forbidden. Fix: - Templatize spec.imagePullSecrets on Deployment + channel-seed Job. - Default values.yaml `imagePullSecrets: [{name: ghcr-pull}]`. - Add `newapi` to flux-system/ghcr-pull's reflector reflection-{allowed,auto}-namespaces in cloudinit-control-plane.tftpl so bp-reflector mirrors the source Secret into the namespace automatically on every fresh Sovereign. - Bump bp-newapi 1.4.0 -> 1.4.1, update _template overlay. #953 — services-build.yaml's image-rewrite loop only matched the hardcoded `image: ghcr.io/.../services-<svc>:<sha>` form. 7 of 8 sme-services templates use `image: "{{ ... }}/services-<svc>:{{ .Values.images.smeTag }}"`. Each services-build run bumped only auth.yaml while reporting "update sme service images to ${SHA}", leaving the live Pod on stale bytes (PR #951's #941 fix never reached services-catalog despite the merge + chart bump chain). Fix: - After the hardcoded loop, also bump `images.smeTag` in products/catalyst/chart/values.yaml with a strict regex match (`^ smeTag: "<sha>"$`); refuse to auto-bump if the line shape changes (defends against silent drift if a contributor renames the field). - Mirror the change into the retry-path `rewrite()` function so a reset-to-origin/main retry does not recreate the original bug. Tests: - platform/newapi/chart/tests/imagepullsecrets-render.sh — 4 cases asserting the Deployment and channel-seed Job carry the default ghcr-pull reference, that an empty override suppresses the block, and that custom secret names propagate (Inviolable Principle #4). - tests/integration/services-build-rewrite.sh — 3 cases reproducing the workflow's rewrite logic on a sandboxed copy of the live chart, asserting both auth.yaml's hardcoded line AND values.yaml's smeTag get bumped, that helm-render of the catalyst chart with the bumped values produces all 8 SME-service Deployments at the new SHA, and that an idempotent re-bump to a second SHA also lands cleanly. Refs: #952 #953 (umbrella #915 — alice signup gate 5). Co-authored-by: hatiyildiz <143030955+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
689276889c
|
fix(bp-catalyst-platform+bp-newapi): unblock alice signup gates 2-6 on Sovereigns (#915) (#951)
Six coupled chart + orchestrator fixes that unblock alice marketplace
signup → tenant ready → SaaS integrations → LLM → ledger on a freshly
franchised Sovereign. C5-final got Gate 1 GREEN on otech113 (2026-05-05)
but every downstream gate failed because the SME bundle hardcoded
contabo-only assumptions.
Bumps:
- bp-catalyst-platform 1.4.21 → 1.4.22
- bp-newapi 1.3.0 → 1.4.0
- bootstrap-kit slot 13 + 80 pins updated in lockstep
Issues addressed (single consolidated PR — smaller PRs would race
against alice signup retries):
- #934 (auth SMTP empty → "failed to send email"): sme-secrets.yaml
now reads SMTP_* from `catalyst-system/sovereign-smtp-credentials`
(the same A5-seeded source #883/#905 the chart 1.4.20 catalyst-
openova-kc-credentials Secret already uses) with source-wins
precedence. Both canonical (smtp-host/port/from/user/pass) AND
legacy (host/port/from/user/password) source-Secret key shapes
accepted. Empty source falls back to chart-level defaults so the
contabo path stays clean.
- #940 (provisioning service GITHUB_TOKEN placeholder + hardcoded
upstream github.com): chart values
.Values.smeServices.provisioning.{githubToken,git.{apiURL,owner,
repo,branch}} make every GitHub-API coordinate operator-overridable
with topology-aware defaults (Sovereign ⇒ in-cluster Gitea REST
API + `openova` org; contabo ⇒ api.github.com + `openova-io` org).
Provisioning binary's startup gate validates the GITHUB_TOKEN does
NOT contain placeholder substrings (<placeholder>, PLACEHOLDER,
REPLACE_ME, ...) and crashes the Pod into Pending if it does — the
operator sees the misconfig immediately instead of after alice
signups have failed silently in service logs. GitHub client now
accepts a custom API URL via NewClientWithAPIURL so Gitea's GitHub-
compatible /api/v1 surface drops in without re-implementing the
client.
- #941 (catalog "27 apps COMING SOON"): added `openclaw` and
`stalwart-mail` to migrateAppDeployable's deployable map at
core/services/catalog/handlers/seed.go. Both blueprints (bp-openclaw,
bp-stalwart-{sovereign,tenant}) ship with visibility=listed in the
embedded blueprints.json AND have working SME-tenant overlay
templates in sme_tenant_gitops.go, but the catalog handler silently
filtered them out because they were missing here. Map extracted to
DeployableAppSlugs() exported function so unit tests can assert
membership without invoking a Mongo store.
- #942 (REDPANDA_BROKERS hardcoded to talentmesh): configmap.yaml
selects broker default at render time based on global.sovereignFQDN
— Sovereign ⇒ NATS JetStream Service per ADR-0001 (the only local
bus on Sovereigns); contabo ⇒ legacy Redpanda Service in talentmesh.
Operator MAY override either default via
.Values.smeServices.eventBus.brokers without forking the chart.
The ConfigMap key name stays REDPANDA_BROKERS for back-compat with
existing SME service Go env wiring; new EVENT_BUS_PROTOCOL key
surfaces the protocol hint for services that want to switch wire
format independently.
- #943 (bp-newapi silently skips Deployment): NEW
templates/cnpg-cluster.yaml auto-provisions a CNPG-backed Postgres
Cluster + Helm-`lookup`-persistent DSN Secret when
.Values.cnpg.enabled (DEFAULT true). NEW templates/credentials-
secret.yaml auto-generates SESSION_SECRET + CRYPTO_SECRET (each
64-char randAlphaNum, persistent across reconciles via Helm
`lookup`) when .Values.credentials.autoProvision (DEFAULT true).
deployment.yaml gate now resolves Secret names from the chart-
emitted defaults when the operator hasn't supplied an override.
Capabilities-gated on postgresql.cnpg.io/v1 so a cold install
before bp-cnpg is Ready surfaces as "no Cluster yet" rather than
a hard install error.
- #944 (CRITICAL — cross-cluster pollution): provisioning.yaml
templates GIT_BASE_PATH from
.Values.smeServices.provisioning.gitBasePath with a topology-aware
default `clusters/<sovereignFQDN>/sme-tenants` on Sovereigns. NEW
`core/services/provisioning/gitguard` package validates at startup
AND on every commit code path that the path begins with
`clusters/<self-FQDN>/` — refusing to commit to any other cluster's
tree. Defence in depth so a runtime env mutation (kubectl exec,
ConfigMap update without Pod restart, hostile sidecar) cannot
bypass the check. Pre-#944 every alice tenant overlay landed in
upstream openova/openova `clusters/contabo-mkt/tenants/<id>/`
which contabo Flux would then install on the contabo cluster —
C5-final caught + reverted the alice2 incident at commit
|
||
|
|
890fa67eff
|
fix(bp-harbor): inline labels on admin Secret to drop duplicate keys (#949) (#950)
PR #947 (bp-harbor 1.2.14) added templates/admin-secret.yaml that included the canonical bp-harbor.labels helper AND re-declared app.kubernetes.io/name + catalyst.openova.io/component with admin- credential-specific values. Helm's strict YAML post-render parser rejected the rendered manifest with `mapping key "app.kubernetes.io/name" already defined at line 8`, blocking the upgrade chain on otech113 — bp-self-sovereign-cutover dependsOn bp-harbor and re-blocked, stalling cutover indefinitely. Per the issue's recommended Option A, labels are inlined verbatim on the admin Secret. Every key the helper would emit is reproduced explicitly, except the two that need a Secret-specific value (catalyst.openova.io/component=harbor-admin) plus an explicit admin-credentials sub-component label. A regression guard (Case 6) is added to tests/admin-secret.sh: the rendered Secret block is parsed through PyYAML's safe_load_all, which enforces mapping-key uniqueness the same way Helm's post- render does. Duplicate keys raise and break the test. Bumps: - platform/harbor/chart/Chart.yaml 1.2.14 → 1.2.15 - clusters/_template/bootstrap-kit/19-harbor.yaml slot pin Verification (all green locally): helm template smoke . --namespace harbor # renders OK bash tests/admin-secret.sh # 6 gates green helm lint . # 0 failed Closes one half of #949 (bp-harbor side); the slot pin update delivers it to fresh Sovereigns; existing otech113 picks up the upgrade on next Flux reconcile after the new chart publishes. Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> |
||
|
|
88a8ecd8bb
|
fix(cutover): Reflector-mirror harbor-admin Secret + in-cluster trigger endpoint (#935) (#947)
Two bugs surfaced live on otech113 2026-05-05 blocking Self-Sovereignty Cutover end-to-end. Fix both in lockstep: Bug 1 — bp-self-sovereign-cutover Step 02 (harbor-projects) Job in `catalyst` namespace was hitting `secret "harbor-core" not found` for 11+ retries because the upstream Harbor `harbor-core` Secret only exists in the `harbor` namespace and Kubernetes forbids cross-namespace secretKeyRef. Step 02 was stuck in CreateContainerConfigError forever. Fix: bp-harbor 1.2.13 → 1.2.14 ships a Catalyst-curated `harbor-admin` Secret in the `harbor` namespace with Reflector mirror annotations (allowed-namespaces=catalyst, auto-enabled). The same Secret name auto-materialises in `catalyst` so the cutover Job's secretKeyRef resolves natively. Password is randomly generated on first install (32-char alphanum, 190 bits entropy per feedback_passwords.md) and preserved across reconciles via `lookup`. The upstream Harbor subchart consumes it via `existingSecretAdminPassword: harbor-admin`. bp-self-sovereign-cutover 0.1.16 → 0.1.17 updates `harbor.adminSecretRef.name` from `harbor-core` to `harbor-admin`. Bug 2 — The 0.1.16 auto-trigger Helm post-install Job (#933) POSTed /api/v1/sovereign/cutover/start which sits behind RequireSession middleware. The Job has no human session cookie — every request 401'd forever and cutover never started. Fix: new catalyst-api endpoint POST /api/v1/internal/cutover/trigger lives OUTSIDE RequireSession and validates the bearer token via the apiserver's TokenReview API + checks the resolved username matches the canonical `bp-self-sovereign-cutover-runner` SA. Same engine, same idempotency, same state machine — different auth surface. The auto-trigger Job now mounts its projected SA token at /var/run/secrets/kubernetes.io/serviceaccount/token and sends it as `Authorization: Bearer <token>`. SA username + accepted list are runtime-overridable per Inviolable Principle #4. Tests - 6 Go unit tests for HandleCutoverInternalTrigger covering happy path, missing bearer (401), TokenReview rejection (502), wrong SA (403), idempotency (no Jobs created when complete), wrong method (405). All pass. - bp-harbor admin-secret contract test (5 cases) — Secret renders, HARBOR_ADMIN_PASSWORD key present, Reflector annotations, keep policy, upstream consumes via existingSecretAdminPassword. - bp-self-sovereign-cutover cutover-contract test extended with 3 new cases — auto-trigger uses /internal/cutover/trigger, sends SA bearer token, references harbor-admin (not harbor-core). - All 12 cutover-contract gates green; all 4 observability-toggle gates green; helm template + helm lint clean on both charts. Bootstrap-kit slot pins - clusters/_template/bootstrap-kit/19-harbor.yaml: 1.2.13 → 1.2.14 - clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml: 0.1.16 → 0.1.17 Closes #935 Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
e9a72aa00d
|
feat(self-sovereign-cutover): auto-trigger on install + always-defined State (#933 E1) (#936)
Closes the otech113 dashboard regression where SovereigntyCard rendered `invalid CutoverState: <undefined>` instead of a Tethered badge, and makes the Day-2 cutover fire automatically once the chart lands rather than waiting for an operator click on "Achieve True Sovereignty". Founder rule per #933: handover is not "done" until cutover has run; the operator must NOT have to click a CTA on console.<sov-fqdn>/console/dashboard. Three coupled changes: 1. catalyst-api: cutoverStatusResponse now ALWAYS emits a `state` field ("tethered" or "sovereign"), derived from cutoverComplete. The UI's branded parseCutoverState rejects empty/undefined, which is what was rendering the user-visible error text. Tests cover the empty ConfigMap, missing cutoverComplete, and explicit-true cases. 2. UI parseCutoverStatus: defensive fallback when wire frame omits `state` — derive from cutoverComplete (default "tethered"). Hostile/ typo'd state values (e.g. 'pending', '') still throw via the branded parser. Defends against partial-rollout where a stale catalyst-api Pod is still serving the old shape. 3. bp-self-sovereign-cutover 0.1.16 (chart): new Helm post-install/ post-upgrade hook (templates/10-auto-trigger-job.yaml) POSTs /api/v1/sovereign/cutover/start on catalyst-api after the step ConfigMaps + RBAC land. Idempotent via catalyst-api's durable status ConfigMap (200 if already complete, 409 if running, 200 to start). Fails open: a transient catalyst-api unreachability exits 0 so the chart install doesn't block; operator can always re-fire via the manual CTA. Gated on .Values.trigger.auto (default true; per-Sovereign overlays can disable for soak Sovereigns). Hard rules honoured: - No contabo Pods touched. - Existing tethered Sovereigns that have not cutover stay tethered — the auto-trigger Job is in the chart (per-Sovereign), not in the mothership; only fresh Sovereign installs of bp-self-sovereign-cutover 0.1.16+ get it. - IaC-first: the auto-trigger uses catalyst-api's existing /start endpoint (no bespoke cluster mutation outside the chart). - Event-driven: post-install hook fires on chart install (no cron). Verification: - Go: cutover_test.go +TestBuildCutoverStatusResponse_StateAlwaysDefined +TestHandleCutoverStatus_StateFieldEmittedOnFreshSovereign — both green. - TS: cutover.test.ts +5 cases for parseCutoverStatus state-fallback; 35/35 green. Sovereignty widget tests 20/20 green. - Chart: tests/cutover-contract.sh +Case 8/9 (auto-trigger present by default, absent under trigger.auto=false); helm template renders cleanly. Co-authored-by: Hatice Yildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9077016466
|
feat(bp-stalwart-sovereign): per-Sovereign Stalwart for Console mail (#924) (#931)
Phase-2 follow-up to #883: replace mothership Stalwart relay (mail.openova.io:587) with a Sovereign-local Stalwart so Console PIN/magic-link mail originates from `noreply@<sovereignFQDN>` with per-Sovereign SPF/DKIM/DMARC posture, eliminating the mothership SMTP SPOF for Sovereign Console login. What ships: 1. NEW blueprint platform/stalwart-sovereign/ (otech-level — distinct from per-tenant bp-stalwart-tenant). Single Stalwart instance per Sovereign cluster, scoped to Sovereign Console system mail. NO Keycloak OIDC, NO webmail UI — Sovereign Console is the only consumer. Auto-provisioned admin + submission Secrets via the lookup-or-generate pattern (#898/#830/#887). Post-install Job: - registers the noreply submission principal in Stalwart - allows send-as for noreply@<sovereignFQDN> - reads DKIM public key, patches dns-records ConfigMap - materialises catalyst-system/sovereign-smtp-credentials with Sovereign-local infrastructure addresses + credentials, carrying BOTH key shapes (smtp-user/smtp-pass + legacy user/password) so the consumer chart works either way. 2. NEW bootstrap-kit slot 95 (clusters/_template/bootstrap-kit/ 95-bp-stalwart-sovereign.yaml). dependsOn: bp-cert-manager, bp-catalyst-platform. Sequenced after bp-catalyst-platform (slot 13) so the chart's post-install Job lands its mirror Secret in an already-existing catalyst-system namespace. 3. bp-catalyst-platform 1.4.19 → 1.4.20: SOURCE-wins precedence extended to (a) non-secret fields smtp-host/smtp-port/smtp-from so Sovereign-local infra addresses (`mail.<sovereignFQDN>`) take over from mothership defaults (`mail.openova.io`) on the next reconcile after slot 95 lands, and (b) canonical key shape `smtp-user`/`smtp-pass` in addition to legacy `user`/`password` source key shape. 4. expected-bootstrap-deps.yaml: declare slot 95 graph edge. 5. catalyst-api handler/sovereign_smtp_seed.go: documentation-only update to note this Phase-1 step is now a graceful fallback — the Phase-2 chart's post-install Job overwrites the mirror Secret on first reconcile so the cutover from mothership relay to Sovereign-local relay is automatic, no operator action. Verification: - `helm template smoke ./platform/stalwart-sovereign/chart` clean (smoke-render-safe; per-template gates skip when sovereignFQDN unset). - `helm template smoke -f operator-values.yaml` emits StatefulSet, LoadBalancer Service, ClusterIP HTTP Service, DKIM-signing config, dns-records ConfigMap, Setup Job + RBAC. - `chart/tests/sovereign-render.sh` 3 cases all PASS. - `helm template smoke ./products/catalyst/chart` (1.4.20) clean. - `helm lint` both charts: clean (only icon-recommended INFO). - `bash scripts/check-bootstrap-deps.sh` PASSED — bootstrap-kit dependency graph audit, 0 drift, 0 cycles. - `go test -run TestSeedSovereignSMTP` — Phase-1 seed tests pass. - `go test -run TestBootstrapKit_TemplateClusterParses` — slot 95 YAML parses cleanly. Out of scope (sub-PR follow-up under #924): - DKIM keypair generation in catalyst-api orchestrator + DNS records (MX/A/SPF/DMARC/DKIM-pubkey) registration via PDM dynadot adapter at omani.works. - Hetzner PTR (rDNS) auto-registration via the Hetzner cloud API. - Cert-manager Certificate adding mail.<sovereignFQDN> SAN to the Sovereign wildcard cert (chart relies on the existing wildcard cert from bp-catalyst-platform 1.4.0+'s per-zone Certificate template — when that wildcard chain covers the Sovereign FQDN, `mail.<sovereignFQDN>` is already covered). Acceptance (lands when sub-PR follow-up ships): - Sovereign Console PIN delivery uses noreply@<sov-fqdn>. - External mail server (e.g. Gmail) accepts mail with valid SPF + DKIM. - Mothership SMTP no longer SPOF for Sovereign Console login. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3fe27f625f
|
feat(bp-wordpress-tenant): wp-cli OIDC bootstrap + oidc.* canonical block (0.2.0, #915) (#927)
Umbrella issue #915 (D1 sub-task). Aligns the chart's post-install OIDC config Job with the canonical wp-cli flow and the bp-keycloak tenant- realm contract C1's PR #918 ships. Chart 0.2.0 ----------- - templates/oidc-config-job.yaml rewritten to use the official wordpress:cli-2.12.0-php8.3 image (manifest-list digest pinned per Inviolable Principle #4). Replaces direct PHP/SQL UPSERTs against wp_options with: * wp core install (idempotent: wp core is-installed) * wp plugin install openid-connect-generic --activate (idempotent: wp plugin is-installed) * wp option update openid_connect_generic_settings <json> * wp option update default_role * wp theme install/activate * wp option update siteurl/home Going through wp-cli (i.e. WordPress core's own PHP API) is more resilient than schema-shape-dependent INSERT statements and survives WordPress minor upgrades. - values.yaml: new canonical oidc.* block — oidc.{enabled, issuerURL, clientId, clientSecretName, defaultRole, identityKey, roleMapping, cliImage}. Default oidc.clientSecretName = "wordpress-oidc-client-secret" matches the K8s Secret bp-keycloak's PR #918 emits alongside the realm import ConfigMap (so the realm JSON's `secret` field and the Secret bytes never drift). - Legacy keycloak.{realmURL, clientID, clientSecretName} kept as a back-compat alias. _helpers.tpl folds it into oidc.* when the modern keys are at their values.yaml defaults so chart 0.1.x clusters keep reconciling. Removed in chart 0.3.0. - oidc.defaultRole=subscriber — newly auto-created SSO users land with subscriber capability (operator overrides via overlay). - Redirect URIs: the openid-connect-generic plugin's default callback is /wp-admin/admin-ajax.php?action=openid-connect-authorize when alternate_redirect_uri=0 (we set 0). bp-keycloak (PR #918) registers the same URL plus /wp-login.php and a /* wildcard, so the client's allowed-redirect-URI list aligns with what the plugin actually issues. Orchestrator emit ----------------- - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go smeTenantBPWordPress now emits the canonical oidc.* block AND the legacy keycloak.* alias (for chart 0.1.x clusters mid-upgrade). Tests ----- - chart/tests/oidc-config.sh — 7 helm-template assertions: 1. Canonical oidc.* render produces a Job with the required wp-cli command flow + wordpress:cli-2.12.0-php8.3 image. 2. Legacy keycloak.* fold path (chart 0.1.x compat). 3. oidc.enabled=false short-circuits the Job. 4. alternate_redirect_uri=0 (so plugin URL matches the realm- registered redirect URI from PR #918). 5. defaultRole rendered + propagated. 6. Render YAML is parseable and contains all required kinds. 7. wp-content PVC mounted in the Job (so pg4wp's db.php drop-in loads — failure here would silently fall back to mysqli). - internal/handler/sme_tenant_test.go: * TestRenderSMETenantOverlay_WordPressEmitsOIDC — pins the canonical oidc.* block + legacy keycloak.* alias the orchestrator emits for the alice@omantel test fixture. * TestRenderSMETenantOverlay_WordPressOIDC_BYOMode — BYO domain mode renders wordpress.<byo-domain> as the ingress host. Verification ------------ - helm lint clean - helm template smoke green for: oidc.* canonical, keycloak.* legacy fold, oidc.enabled=false short-circuit - chart/tests/oidc-config.sh: 7/7 PASS - chart/tests/observability-toggle.sh: 2/2 PASS (regression) - go test ./internal/handler/ -run "SMETenant|TestRenderSME": all green (TestAuthHandover_HappyPath failure is pre-existing on main, unrelated to this change) Closes (D1 sub-task) of #915. Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a1ca1872aa
|
feat(bp-stalwart-tenant): wire Keycloak OIDC SSO end-to-end (#915) (#920)
Closes the C2 sub-task of EPIC #915 — alice's Stalwart authenticates SMTP/IMAP/JMAP/webmail logins against her per-tenant Keycloak realm, not a shared otech-level IdP. Three layered changes (matching the three things broken on otech103): 1. Orchestrator (`smeTenantBPStalwart` in sme_tenant_gitops.go) now emits per-tenant OIDC values matching the bp-wordpress-tenant + bp-openclaw shape: keycloak.realmURL = https://keycloak.<sub>.<parent>/realms/sme-<sub> keycloak.clientID = stalwart keycloak.clientSecretName = stalwart-oidc-client-secret keycloak.oidcExternalSecret.remoteRef.key = sovereign/<otech-fqdn>/stalwart/<tenant>/oidc plus admin externalSecret + dependsOn bp-keycloak so the SME's three apps (wordpress, openclaw, stalwart) SSO against ONE realm with distinct client IDs (#915 C1 registers all three in the realm bootstrap). 2. Chart bootstrap config.toml drops the pre-0.16 kebab-case `[directory.keycloak] type = "oidc"` block (silently ignored by the upstream registry parser — verified against crates/registry/src/schema/structs.rs in stalwartlabs/stalwart; OidcDirectory serdes camelCase: `@type = "Oidc"`, `issuerUrl`, `claimUsername`, `claimName`, `claimGroups`, `requireScopes`). The `internal` directory stays as the bootstrap fallback so the admin can log in before the post-install Job seeds OIDC. 3. setupJob defaults to enabled (was off in 0.1.1) and POSTs the canonical OIDC directory entry to `/api/settings`: directory.keycloak.@type = "Oidc" directory.keycloak.issuerUrl = <realm URL> directory.keycloak.claimUsername = preferred_username directory.keycloak.claimName = name directory.keycloak.claimGroups = groups directory.keycloak.requireScopes = [openid email profile groups] directory.keycloak.usernameDomain = <tenant domain> storage.directory = keycloak The setting POSTs are idempotent (`assert_empty: false`) so Helm upgrades re-run without breaking existing logins. Re-uses the upstream Stalwart container (ships curl + stalwart-cli) — no new image needed. Tests: - `chart/tests/oidc-render.sh` (NEW): asserts every settings key is rendered, the [oauth] env block propagates the per-tenant realm URL, and the bootstrap config.toml parses as valid TOML. - `chart/tests/expression-syntax.sh`: re-passes (Stalwart expression `==` audit per stalwart_expression_syntax.md). - `TestRenderSMETenantOverlay_StalwartEmitsKeycloakOIDC` (NEW): Go test verifies the orchestrator emits the per-tenant realm URL, client metadata, and ExternalSecret-store remoteRef paths. - All existing TestRenderSMETenantOverlay_* tests pass. - `helm template` clean with default values AND with a per-tenant overlay (--api-versions external-secrets.io/v1beta1). Chart bumps 0.1.1 → 0.1.2; blueprint.yaml spec.version mirrors per issue #817 (chart/blueprint version invariant). Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9447d88dfd
|
feat(bp-newapi): auto-seed channel #1 = Qwen3.6 @ BankDhofar (#915) (#919)
Per epic #915 (SME tenant integration DoD: alice → OpenClaw → NewAPI → Qwen3.6@BankDhofar end-to-end), bp-newapi must come up with channel #1 = Qwen3.6 hosted at BankDhofar (https://llm-api.omtd.bankdhofar.com, model qwen3-coder / alias qwen3.6) already wired to its admin API, so the FIRST customer request from an SME's OpenClaw → NewAPI hits a real upstream LLM rather than a 404 / "no channel found" error. Until now the chart's channels.yaml ConfigMap was a documentation surface only; the upstream NewAPI binary persists channel state to its Postgres `channels` table via its admin API at /api/channel/. This patch bridges that gap. Discovery: - Canonical BankDhofar relay reference exists in openova-private/clusters/contabo-mkt/apps/axon/helmrelease.yaml (axon.vllm.baseUrl=https://llm-api.omtd.bankdhofar.com, defaultModel=qwen3-coder, secret=axon-vllm-secret). - K8s secret confirmed live (axon/axon-vllm-secret, key AXON_VLLM_API_KEY). - Architecture: bp-newapi is per-Sovereign (one NewAPI per OTECH); SME tenants share it via OpenClaw's newapi.baseURL = https://newapi.<OTECHFQDN>. Channel seeding therefore happens at the Sovereign-level chart install, NOT per-tenant. Changes: 1. platform/newapi/chart/values.yaml - New `defaultChannels.qwenBankDhofar` block (enabled=false by default; per-Sovereign overlay flips it true with the canonical endpoint + commercial-contract attestation). - New `channelSeed` block configuring the post-install Helm hook Job (image, resources, backoff, deadline, hook delete policy). 2. platform/newapi/chart/templates/_helpers.tpl - effectiveChannels helper composes qwenBankDhofar BEFORE operator-supplied .Values.channels and BEFORE defaultChannels.vllm so it lands as channel #1 in NewAPI's row-insertion order (NewAPI's router resolves `model` lookups in row order). - New channelSeedJobName helper (shared by Job + RBAC + ConfigMap). 3. platform/newapi/chart/templates/channel-seed-job.yaml (NEW) - post-install/post-upgrade Helm hook Job that: * Mounts the operator-supplied master-key Secret (auth.adminUI.masterKeySecret) for one-time admin API auth. * Mounts the per-channel upstream API key Secret (defaultChannels.qwenBankDhofar.existingSecret). * Polls /api/status until 200 (handles NewAPI startup window). * For each default channel: GET /api/channel/?keyword=<name>; if a row whose `name` exactly matches exists, SKIP. Otherwise POST /api/channel/ with the channel definition. Idempotent — re-runs after upgrades are no-ops once channels exist. * Bounded RBAC (Role+RoleBinding only on the named Secrets). * Skip-render gates: channelSeed.enabled, defaultChannels.* enabled, masterKeySecret supplied. helm template with default values renders no Job (CI smoke clean). 4. clusters/_template/bootstrap-kit/80-newapi.yaml - Bumped chart version 1.2.0 → 1.3.0. - Added defaultChannels.qwenBankDhofar block to the per-Sovereign overlay shape (still enabled=false in the template — operator supplies endpoint + attestation + Secrets per Sovereign). 5. platform/newapi/chart/Chart.yaml - Bumped 1.2.0 → 1.3.0 with changelog comment. 6. products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go - bp-openclaw per-tenant overlay now emits `newapi.defaultModel: qwen3.6` so OpenClaw's UI surfaces the friendlier alias by default. (Both qwen3.6 and qwen3-coder route to the same channel via the chart's `models` list.) Verification: - helm lint . PASS (1 chart linted, 0 failed) - helm template (defaults) PASS (no Job rendered) - helm template (qwen enabled) PASS (Job + RBAC + ConfigMap + channels.yaml all render with channel #1 first) - helm template (endpoint empty) FAIL with helpful message (configurability gate) - go build ./... PASS - go test ./internal/handler/... PASS for SME tenant overlay tests (TestRenderSMETenantOverlay_*) - Pre-existing AuthHandover panic is unrelated to this change Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), every knob is configurable via the per-Sovereign bootstrap-kit overlay. The endpoint default is empty so a fresh `helm template` does not silently wire customers to a third-party host. Co-authored-by: alierenbaysal <alierenbaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7f859dbb4b
|
feat(bp-keycloak): tenant-mode realm with wordpress/openclaw/stalwart OIDC clients (1.4.0, #915) (#918)
PR #911 wired the SME tenant orchestrator to emit realmConfig.tenant.enabled=true on the per-tenant bp-keycloak HelmRelease — but the chart had no template that consumed those values, so the WordPress / OpenClaw / Stalwart OIDC integrations had no client registered in the tenant realm and SSO failed end-to-end. This change adds the chart-side template the orchestrator was already emitting for. When realmConfig.tenant.enabled=true: * configmap-sovereign-realm.yaml SKIPS (mutual-exclusion guard added on the existing template) so only one realm CM is rendered. * NEW templates/configmap-tenant-realm.yaml renders a realm import ConfigMap (same name `<release>-sovereign-realm-config` so the upstream keycloak-config-cli existingConfigmap reference still resolves) carrying the tenant realm + 3 OIDC clients: - wordpress (confidential, auth-code; redirect URIs cover the openid-connect-generic plugin's admin-ajax.php callback + /wp-login.php fallback) - openclaw (confidential, auth-code; redirect URI /oauth/callback per #915 spec) - stalwart (confidential, serviceAccountsEnabled=true so the directory.keycloak type=oidc backend can use client_credentials to introspect IMAP/SMTP tokens; standardFlowEnabled=true for webmail UI auth-code) * NEW per-app Secrets emitted in the same template scope as the realm ConfigMap so the realm JSON's `secret` field and the K8s Secret bytes never drift: - wordpress-oidc-client-secret - openclaw-oidc-client-secret - stalwart-oidc-client-secret (carries BOTH client-secret AND OIDC_CLIENT_SECRET keys for the two consumer paths) * Each per-app secret persists across helm upgrade via lookup-or-generate (mirrors marketplace-api/secret.yaml pattern from issue #887 and the existing catalyst-api-server secret in configmap-sovereign-realm.yaml). helm.sh/resource-policy: keep so bytes outlive uninstall. * Fail-closed validation when realmConfig.tenant.enabled=true and any of realmName / parentDomain / subdomain is unset (Inviolable Principle #4). NEW tests/tenant-realm-oidc-clients.sh covers 6 cases: 1. Sovereign-mode default render unchanged (kubectl + catalyst-ui + catalyst-api-server clients present, no tenant artefacts leak). 2. Tenant-mode render produces exactly ONE realm CM under the expected name + zero leaked Sovereign-only resources. 3. Tenant realm JSON parses + 3 OIDC clients present with the redirect-URI / publicClient / serviceAccountsEnabled shape per #915 spec; Secret bytes match realm JSON's `secret` fields. 4. Fail-closed validation when tenant fields missing. 5. keycloak-config-cli post-install Job projects the realm CM by SAME name in BOTH modes. 6. Operator-supplied per-app clientSecret overrides the lookup-or-generate path. Existing tests/observability-toggle.sh + tests/oidc-kubectl-client.sh still pass. Sovereign-mode unchanged. The chart now consumes the values the orchestrator (PR #911) was already emitting; no orchestrator change needed. Closes #915 (C1 sub-task) and unblocks #899 (per-tenant Keycloak realm-config materialisation). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
61c8d77b58
|
feat(bp-openclaw): per-tenant Keycloak SSO + NewAPI as OpenAI-compatible LLM gateway (#915) (#917)
Wire bp-openclaw to the per-tenant Keycloak realm (OIDC SSO) and the per-tenant NewAPI (OpenAI-compatible LLM endpoint, NOT direct OpenAI), delivering C3 of umbrella epic #915. Chart changes (bp-openclaw 0.1.0 → 0.2.0): - Add canonical `oidc.{issuerURL,clientId,clientSecret.{name,key}}` block. - Add canonical `llm.{baseURL,apiKey.{name,key},defaultModel}` block. - Controller Deployment now emits OIDC_*, LLM_*, OPENAI_API_{BASE,KEY}, LLM_DEFAULT_MODEL envs (legacy KEYCLOAK_*/NEWAPI_BASE_URL_DEFAULT retained for back-compat with current controller image). - Per-user pods carry OPENAI_API_BASE / OPENAI_API_KEY / LLM_DEFAULT_MODEL alongside the identity-blind NEWAPI_BASE_URL / NEWAPI_KEY (ADR-0003 §3.3 unchanged). - Legacy `keycloak.*` / `newapi.*` keys remain accepted as fallbacks; helpers prefer canonical blocks but fall back to the legacy alias when the canonical block is unset (or still at placeholder). - assertNoPlaceholders guard updated to check resolved canonical values. - render-toggles.sh smoke test extended: asserts both canonical and legacy code-paths render and that all expected envs reach the rendered Deployment. Orchestrator changes (catalyst-api smeTenantBPOpenClaw template): - Emit per-tenant `oidc.issuerURL` = https://keycloak.<sub>.<parent>/realms/sme-<sub> - Emit per-tenant `oidc.clientId` = openclaw, secret from openclaw-oidc-client-secret/OIDC_CLIENT_SECRET (rendered by bp-keycloak's post-install hook). - Emit per-tenant `llm.baseURL` = https://api.<sub>.<parent>/v1 (alice's own NewAPI ingress, NOT the otech-wide newapi.<otech-fqdn>); apiKey from openclaw-newapi-controller-token/NEWAPI_KEY. - Emit `llm.defaultModel: qwen3.6` — NewAPI uses this to select the backing channel; C4 of #915 wires Qwen3.6@BankDhofar at tenant-create. - Legacy keycloak/newapi blocks still emitted for back-compat with bp-openclaw < 0.2.0. Tests: - New TestRenderSMETenantOverlay_OpenClawOIDCAndLLMBlocks asserts the rendered HelmRelease contains the canonical oidc + llm blocks with per-tenant values, and that llm.baseURL is the per-tenant api.<sub>.<parent>/v1 (NOT the otech-wide newapi). - bp-openclaw render-toggles.sh extended (Case 2b/2c). Co-authored-by: alierenbaysal <alierenbaysal@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
368545369b
|
fix(bp-stalwart-tenant): unbootable on fresh tenants — values shape, missing admin Secret, sec ctx (#898) (#904)
Three fixes that left bp-stalwart-tenant 0.1.0 unable to come up on a
freshly-franchised SME tenant. All surfaced on the otech103 alice
tenant during the Phase-1 DoD sweep.
1. Tenant-domain values shape (HelmRelease render error)
The 0.1.0 chart referenced `.Values.domain.primary` in five
templates. The live HR on otech103 had `values.domain:
acme.omani.works` (a string), emitted by a pre-#897 catalyst-api
build, so every reconcile died with:
can't evaluate field primary in type interface {}
Added `bp-stalwart-tenant.tenantDomain` + `tenantMode` helpers
that resolve in priority order:
1. `tenant.domain` (forward-looking flat shape)
2. `domain.primary` (canonical post-#897 map shape)
3. `domain` (string) (legacy pre-#897 shape — back-compat)
Returns "" smoke-render-safe; per-template gates skip when empty.
2. Missing stalwart-admin Secret
deployment.yaml + mailbox-provision-job.yaml reference a Secret
key `ADMIN_PASSWORD` on `.Values.admin.secretName`. The 0.1.0
chart only emitted an ExternalSecret, and only when
`admin.externalSecret.remoteRef.key` was non-empty (smoke-render
concession). Fresh tenants land in CreateContainerConfigError.
Added `templates/admin-secret.yaml` mirroring marketplace-api/
secret.yaml (#887): random 32-char ADMIN_PASSWORD generated by
sprig randAlphaNum, persisted across reconcile via lookup,
helm.sh/resource-policy: keep so reinstall picks it back up.
Auto-disabled when an authoritative ExternalSecret is wired —
no double-bind between two controllers.
3. Pod sec ctx vs. upstream image's file capabilities
`getcap docker.io/stalwartlabs/stalwart:v0.16.3 /usr/local/bin/
stalwart` reports `cap_net_bind_service=ep`. The image creates
user `stalwart` at UID 2000 and the binary IS the entrypoint
(no demotion script). The 0.1.0 chart ran as UID 65534 with
`drop: ALL` — kernel refuses to elevate file caps with empty
bounding set, so exec failed with `operation not permitted`.
Aligned to image's native UID 2000, kept `drop: ALL` and added
`NET_BIND_SERVICE` explicitly. fsGroup 2000 ensures /opt/stalwart
PVC is writable.
Other:
- Bumped Chart.yaml + blueprint.yaml to 0.1.1 (#817 alignment).
- configSchema in blueprint.yaml now permits the legacy + tenant
shapes alongside the canonical map.
- mailboxProvisioner.setupJob.enabled defaults to false until the
canonical stalwart-cli image is published (re-uses upstream
stalwart container as fallback CLI host).
Acceptance: targeted at otech103 alice tenant
(sme-789ae512-bc0f-467c-a016-001f5496c403) where 0.1.0 reconciliation
fails with the value-shape error and the pod CrashLoops with `exec
... operation not permitted`. Verification on otech103 in #898.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
93c4b700de
|
fix(bp-keycloak): templatize existingConfigmap reference for per-tenant installs (#899) (#902)
bp-keycloak 1.3.2 hardcoded `keycloak.keycloakConfigCli.existingConfigmap` to
the literal "keycloak-sovereign-realm-config". This worked for the Sovereign-
mothership bootstrap-kit (releaseName=keycloak emits matching ConfigMap) but
broke for every per-tenant install where releaseName=bp-keycloak emits
"bp-keycloak-sovereign-realm-config" — the post-install keycloak-config-cli
Job stuck in ContainerCreating with `MountVolume.SetUp failed for volume
"config-volume" : configmap "keycloak-sovereign-realm-config" not found`,
HelmRelease InstallFailed after 15m timeout, cascading to bp-openclaw and
bp-wordpress-tenant which dependsOn it.
The bitnami/keycloak subchart's `keycloak.keycloakConfigCli.configmapName`
helper (charts/keycloak/templates/_helpers.tpl) applies `tpl` to the
existingConfigmap value, so embedding `{{ .Release.Name }}` inside the
string resolves at chart-render time. With this single-line change:
- Sovereign-mothership (releaseName=keycloak) → keycloak-sovereign-realm-config (unchanged)
- Per-tenant (releaseName=bp-keycloak) → bp-keycloak-sovereign-realm-config (matches actual emitted ConfigMap)
Verified via helm template both modes — backendRef and config-volume
configMap.name match the actual ConfigMap emitted by
templates/configmap-sovereign-realm.yaml.
Chart bumped 1.3.2 → 1.3.3 + bootstrap-kit slot 09 + blueprint.yaml.
Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
eddf0e62a4
|
fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter (#891) (#892)
* fix(catalyst-api): SME-tenant orchestrator writes parent kustomization.yaml index (#889) The Flux Kustomization rendered by bp-catalyst-platform 1.4.13+ at clusters/<sov-fqdn>/sme-tenants/ requires a parent kustomization.yaml that enumerates tenant subdirectories. The orchestrator only wrote per-tenant overlays without the parent index, so on otech103 Flux hit: kustomization path not found: stat /tmp/kustomization-... /clusters/otech103.omani.works/sme-tenants: no such file or directory Even after a tenant signup, the parent path lacked a kustomization.yaml so Flux couldn't enumerate subdirs. Fix: NEW writeParentTenantsIndex helper called from both WriteTenantOverlay and DeleteTenantOverlay. Scans the parent dir for subdirectories that contain kustomization.yaml, sorts them lexically for deterministic output (no spurious diffs), and writes a parent kustomization.yaml listing them under `resources:`. Empty list (no tenants) renders as `resources: []` — still a valid Kustomization root, so Flux stays Ready=True after the last tenant teardown. git add covers both the per-tenant subdir AND the parent index, so a single commit captures the delta. Live on otech103 post-cutover, 2026-05-05. * fix(self-sovereign-cutover): Step-5 widens GitRepository ignore filter to include clusters/<sov-fqdn>/ (#891) After Day-2 cutover, the GitRepository ignore filter excluded the Sovereign's own clusters/<sov-fqdn>/ subtree. This made every Sovereign-specific Flux Kustomization (sme-tenants, future per-Sov overlays) hit "kustomization path not found" because source-controller filtered the path out of the artifact tarball. Live on otech103 (2026-05-05): sme-tenants Kustomization stuck for 20+ minutes despite the orchestrator successfully committing the overlay to local Gitea. Fix: Step-5 (flux-gitrepository-patch) now writes the patch as a multi-line YAML strategic-merge file via /tmp emptyDir (since the Pod runs readOnlyRootFilesystem), composing the new ignore filter: /* !/clusters/_template !/clusters/${SOVEREIGN_FQDN} !/platform !/products The SOVEREIGN_FQDN is wired from .Values.sovereign.fqdn (already established in the chart values). Bumps chart 0.1.14 -> 0.1.15. Slot 06a pin bumps in lockstep. --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
8e4c88fd28
|
fix(bp-self-sovereign-cutover): auto-sync local Gitea mirror from upstream GitHub (#870) (#875)
Step-1 gitea-mirror Job replaces the legacy one-shot create-empty-repo + git-push pattern with a single call to Gitea's native /repos/migrate API with mirror=true and mirror_interval=10m0s. Gitea now polls the upstream openova-io/openova repo on a 10-minute interval and replicates branches + tags into the local Sovereign Gitea automatically. Closes the "Sovereign drifts from upstream main forever after Day-2 cutover" bug — hit twice during the otech103 2026-05-04 overnight DoD session, requiring manual `git fetch` inside the Gitea pod for every chart rollout. Why /repos/migrate over the previous git push approach: - Gitea cannot convert a regular repo into a pull-mirror after creation (the mirror flag is set at create-time only). The migrate endpoint creates the repo AS a mirror in one shot. - The migrate endpoint accepts toggles for issues / pull-requests / wiki / labels / milestones / releases — we set them all to false so Gitea only replicates branches+tags, the only refs the Sovereign's Flux GitRepository needs. - Recurring sync is a Gitea-native capability; using it avoids a parallel CronJob (which would violate the "event-driven not cron" inviolable principle) or a long-poll sidecar (which would duplicate what Gitea already does). Idempotency: if the repo already exists from a prior cutover attempt, the script PATCHes mirror_interval to the desired value and POSTs to /mirror-sync to trigger an immediate refresh. Note that PATCH alone cannot convert a legacy non-mirror repo to a mirror — Sovereigns seeded by chart < 0.1.14 would need an operator-driven repo delete + re-migrate to retro-fit auto-sync, but new provisions take the migrate path automatically. Verification on the rendered ConfigMap: $ helm template smoke . # renders 16 docs cleanly $ bash tests/cutover-contract.sh # all 7 gates green $ sh -n <rendered-script> # POSIX shell syntax OK Chart bumped 0.1.13 → 0.1.14 (Chart.yaml + blueprint.yaml spec.version aligned per #817 invariant + slot 06a-bp-self-sovereign-cutover.yaml pin lockstep). Refs #870, #790. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9b710049e3
|
fix(self-sovereign-cutover): Step-8 baseline-diff (only NEW regressions count) (#858)
Live otech103: Step-8 survival window failed because infrastructure-config Kustomization had been NotReady for 4h pre-cutover (Crossplane provider CRD ordering — unrelated to sovereignty). Sovereignty proof asks 'did cutover break anything', not 'is the cluster perfect'. Capture baseline NotReady set before the window, only fail on NEW additions during. Bumps 0.1.12 → 0.1.13 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
d5d1d9b2cd
|
fix(self-sovereign-cutover): Step-8 tolerate slot-managed self-ref HelmRepositories (#857)
Live otech103: Step-8 verification flagged 2 HelmRepositories (bp-newapi + bp-self-sovereign-cutover) still on ghcr.io/openova-io. Both are declared in clusters/_template/bootstrap-kit/ slot files which Flux Kustomization re-applies on every reconcile — Step-6's patch is transient for them. Data-plane impact is null because they're not pulled again until the next cutover cycle which would re-apply the patch first. The 38 leaf-bp HelmRepositories ARE patched durably (live in HelmRelease values, not separate slot files). Bumps 0.1.11 → 0.1.12 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
142ea21534
|
fix(self-sovereign-cutover): Step-8 passive architectural verification (Cilium can't egressDeny+toFQDNs) (#856)
Live otech103: Step-8 (egress-block-test) failed because Cilium 1.16's CiliumNetworkPolicy schema doesn't support 'spec.egressDeny[].toFQDNs' — strict-decoding error 'unknown field'. FQDN-based matching in Cilium is only allowed in 'egress' (allow), not 'egressDeny'. Pivot: Step-8 now asserts the architectural pivots from Steps 5-7 are actually live (GitRepository.url + all HelmRepositories + catalyst-api env all point at local Gitea/Harbor) BEFORE entering the durationSeconds survival window during which Flux Kustomization + HelmRelease readiness is polled. Same sovereignty proof, expressed in a form Cilium can evaluate. Bumps 0.1.10 → 0.1.11 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
86ae235804
|
fix(self-sovereign-cutover): catalyst-api namespace catalyst-system not catalyst-platform (#855)
Live otech103: Step-7 (catalyst-api-env-patch) hit 'deployments.apps catalyst-api not found' in catalyst-platform ns. Actual Sovereign-side namespace is catalyst-system. Bumps 0.1.9 → 0.1.10. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
dd84060d05
|
fix(self-sovereign-cutover): switch from bitnami/kubectl to alpine/k8s (#854)
Live otech103 2026-05-04: bitnami/kubectl:1.31.4 404 on Docker Hub. Bitnami deprecated public Docker Hub registry in 2025; their kubectl image stopped getting tags. alpine/k8s is the canonical alpine-based replacement — kubectl + helm + standard k8s CLI surface, actively maintained, :1.31.4 verified present. Bumps 0.1.8 → 0.1.9 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
887ff62200
|
fix(self-sovereign-cutover): bitnami/kubectl tag :1.31 → :1.31.4 (#853)
Live otech103 2026-05-04: Step-5 (flux-gitrepository-patch) Pod DeadlineExceeded after 10m of ImagePullBackOff. bitnami/kubectl on DockerHub doesn't have a floating :1.31 tag — only patch-level :1.31.X. Pin to :1.31.4 (latest of 1.31 minor as of today). Bumps 0.1.7 → 0.1.8 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
e9970db7b6
|
fix(self-sovereign-cutover): proxy-quay adapter type docker-registry (#852)
Live otech103: Harbor rejects project create with metadata.proxy_cache=true on registries with type 'quay' — HTTP 400 'unsupported registry type quay'. Quay speaks plain v2 so docker-registry is the correct adapter (4/7 projects ahead succeeded with the same shape). Bumps 0.1.6 → 0.1.7. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
ea51642092
|
fix(self-sovereign-cutover): proxy-ghcr Harbor adapter type 'github-ghcr' (#851)
Live otech103 2026-05-04: Step-2 harbor-projects POST /api/v2.0/registries returns 500 'adapter factory for github not found'. Harbor 2.x's canonical GHCR proxy-cache adapter is named 'github-ghcr', not 'github'. Bumps 0.1.5 → 0.1.6 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
8f96daeb6f
|
fix(self-sovereign-cutover): harbor service is 'harbor-core' not 'harbor-harbor-core' (#849)
Live failure on otech103 2026-05-04: Step-2 (harbor-projects) Pod exits silently after first echo because curl exit 6 (CURLE_COULDNT_RESOLVE_HOST). The chart's default harborInternalURL was http://harbor-harbor-core.harbor.svc.cluster.local but the actual bitnami harbor chart's service name is harbor-core (release name doesn't double-prefix when targetNamespace == 'harbor' AND releaseName == 'harbor'). Fix: harborInternalURL → http://harbor-core.harbor.svc.cluster.local. Verified via 'kubectl get svc -n harbor' on otech103. Bumps 0.1.4 → 0.1.5 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
ab5681e656
|
fix(self-sovereign-cutover): Step-1 use bare clone + explicit refspec push (#848)
Live failure on otech103 2026-05-04 even after 0.1.3: git push --all in a mirror clone still pushes refs/pull/* because mirror clones store all upstream refs (incl. GitHub PR refs) at the same level as refs/heads/, and --all walks the whole local refstore. Fix: use git clone --bare (not --mirror) which only fetches refs/heads/* and refs/tags/*, then push with explicit refspecs: git push origin 'refs/heads/*:refs/heads/*' git push origin 'refs/tags/*:refs/tags/*' Bumps 0.1.3 → 0.1.4 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
6322d82775
|
fix(self-sovereign-cutover): Step-1 push --all + --tags (skip GitHub PR refs) (#847)
Live failure on otech103 2026-05-04: git push --mirror to local Gitea rejected by Gitea's update hook on every refs/pull/<n>/head + refs/pull/<n>/merge ref (those are GitHub-specific metadata refs Gitea doesn't accept). Branches and tags push fine. Fix: split the push into 'git push --all' (branches) + 'git push --tags' (tags). Branches + tags are exactly what Flux GitRepository needs to reconcile from local Gitea — PR refs are upstream-only metadata not referenced by any consumer. Bumps bp-self-sovereign-cutover 0.1.2 → 0.1.3 + slot 06a pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
3015033136
|
fix(self-sovereign-cutover): Step-1 creates Gitea org before repo (#846)
Live failure on otech103 2026-05-04: Step-1 hit 'POST /orgs/openova/repos returns 404 Not Found' because the org openova doesn't exist on a fresh Gitea install. The /user/repos fallback would have created the repo under gitea_admin/openova, but the subsequent git push targets openova/openova so it fails with 'remote: Not found'.
Fix: explicit org-create step before repo-create. POST /orgs with {username, visibility} creates the org idempotently (swallow 422 'already exists'). Then POST /orgs/<org>/repos creates the repo under it. Push URL targets openova/openova as before.
Bumps bp-self-sovereign-cutover 0.1.1 → 0.1.2 + slot 06a pin lockstep.
Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
|
||
|
|
e36089540d
|
fix(self-sovereign-cutover): Step-1 BusyBox-wget Basic auth header (--user not supported) (#845)
* fix(bp-gitea): mirror gitea-admin-secret to catalyst ns via reflector annotations Live failure on otech103 2026-05-04: cutover Step-1 gitea-mirror Job in catalyst ns CrashLoops with 'secret "gitea-admin-secret" not found' because K8s forbids cross-namespace secretKeyRef. The Secret created by bp-gitea 1.2.4 lives in the gitea ns; the cutover Job runs in the catalyst ns. Fix: add reflector.v1.k8s.emberstack.com annotations on the Secret so bp-reflector (already installed at slot 05a) mirrors it into the catalyst namespace. The Job's secretKeyRef then resolves locally. Reflector keeps the mirror in lockstep on password rotation. Bumps bp-gitea 1.2.4 → 1.2.5 + slot 10 pin lockstep. * fix(self-sovereign-cutover): Step-1 gitea-mirror BusyBox-wget compat (Basic auth header) Live failure on otech103 2026-05-04: Step-1 cutover-gitea-mirror Pod exits with 'wget: unrecognized option: password=...' because the alpine/git image bundles BusyBox wget which does NOT recognise --user / --password (those are GNU wget flags). Fix: build a base64'd Authorization: Basic header from $GITEA_USERNAME:$GITEA_PASSWORD and pass it via --header (BusyBox wget supports --header). Same Gitea API call surface, BusyBox-compatible wire. Bumps bp-self-sovereign-cutover 0.1.0 → 0.1.1 + slot 06a pin lockstep. --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
66abe75b2e
|
fix(bp-gitea): mirror gitea-admin-secret to catalyst ns via reflector annotations (#844)
Live failure on otech103 2026-05-04: cutover Step-1 gitea-mirror Job in catalyst ns CrashLoops with 'secret "gitea-admin-secret" not found' because K8s forbids cross-namespace secretKeyRef. The Secret created by bp-gitea 1.2.4 lives in the gitea ns; the cutover Job runs in the catalyst ns. Fix: add reflector.v1.k8s.emberstack.com annotations on the Secret so bp-reflector (already installed at slot 05a) mirrors it into the catalyst namespace. The Job's secretKeyRef then resolves locally. Reflector keeps the mirror in lockstep on password rotation. Bumps bp-gitea 1.2.4 → 1.2.5 + slot 10 pin lockstep. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
c42e98216c
|
fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (curl -o + readOnlyRootFS) (#843)
* fix(bootstrap-kit,bp-newapi): bump slot pins (gitea 1.2.4, catalyst-platform 1.4.2) + gate Traefik Middleware on Cilium Sovereigns (bp-newapi 1.2.0) Three issues blocking the otech103 verification proof on a freshly merged main, all uncovered while live-driving the Day-2 Independence cutover: 1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pinned 1.4.0 — missed the bumps from PR #839 (1.4.1, RBAC dual-mode render) and PR #841 (1.4.2, POWERDNS env literal). Bumping the slot pin to 1.4.2 lands those fixes on every fresh provision. 2. clusters/_template/bootstrap-kit/10-gitea.yaml pinned 1.2.3 — missed the bump from PR #832 (1.2.4, gitea-admin-secret canonical Secret for cutover Step-1 to mount). Bumping to 1.2.4 unblocks bp-self-sovereign-cutover Step-1 (gitea-mirror Job). 3. platform/newapi/chart/templates/ingress.yaml hard-rendered a traefik.io/v1alpha1 Middleware resource. On a Cilium Gateway Sovereign that CRD does not exist; bp-newapi 1.1.0 install failed with 'no matches for kind Middleware'. Gating the Middleware behind .Values.ingress.middleware.enabled (default false) lets the chart install on Cilium Sovereigns; contabo / Traefik clusters can still flip it on per-overlay. Bumping to 1.2.0 (additive feature, default-off, no breaking change). Slot 80-newapi pin bumped lockstep. Verified live state on otech103.omani.works (deployment id 12dff5098e33053e): - bp-newapi 1.1.0 HR: Status=False 'Helm install failed: ... no matches for kind Middleware in version traefik.io/v1alpha1' - bp-catalyst-platform HR pinned at 1.4.0 (lacks RBAC for cutover-driver) - bp-gitea HR pinned at 1.2.3 (lacks gitea-admin-secret) After this PR merges + Flux reconciles otech103, all three HRs upgrade in place and the cutover proof can be driven to completion. * fix(bp-powerdns): zone-bootstrap Job needs /tmp emptyDir (readOnlyRootFS + curl -o) Caught live on otech103 2026-05-04: zone-bootstrap Job exit 23 (curl write error) because curl -o /tmp/zone-resp + readOnlyRootFilesystem=true and no /tmp emptyDir mount. Bumps bp-powerdns 1.2.0 → 1.2.1 + slot 11 pin lockstep. Without /tmp/zone-resp writable the Job CrashLoops every retry, never completes, bp-external-dns dependency stuck, Phase-1 watcher never reaches ready, handover never auto-fires. --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
7de05bab9d
|
fix(bootstrap-kit,bp-newapi): bump slot pins (gitea 1.2.4, catalyst-platform 1.4.2) + gate Traefik Middleware on Cilium Sovereigns (bp-newapi 1.2.0) (#842)
Three issues blocking the otech103 verification proof on a freshly merged main, all uncovered while live-driving the Day-2 Independence cutover: 1. clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml pinned 1.4.0 — missed the bumps from PR #839 (1.4.1, RBAC dual-mode render) and PR #841 (1.4.2, POWERDNS env literal). Bumping the slot pin to 1.4.2 lands those fixes on every fresh provision. 2. clusters/_template/bootstrap-kit/10-gitea.yaml pinned 1.2.3 — missed the bump from PR #832 (1.2.4, gitea-admin-secret canonical Secret for cutover Step-1 to mount). Bumping to 1.2.4 unblocks bp-self-sovereign-cutover Step-1 (gitea-mirror Job). 3. platform/newapi/chart/templates/ingress.yaml hard-rendered a traefik.io/v1alpha1 Middleware resource. On a Cilium Gateway Sovereign that CRD does not exist; bp-newapi 1.1.0 install failed with 'no matches for kind Middleware'. Gating the Middleware behind .Values.ingress.middleware.enabled (default false) lets the chart install on Cilium Sovereigns; contabo / Traefik clusters can still flip it on per-overlay. Bumping to 1.2.0 (additive feature, default-off, no breaking change). Slot 80-newapi pin bumped lockstep. Verified live state on otech103.omani.works (deployment id 12dff5098e33053e): - bp-newapi 1.1.0 HR: Status=False 'Helm install failed: ... no matches for kind Middleware in version traefik.io/v1alpha1' - bp-catalyst-platform HR pinned at 1.4.0 (lacks RBAC for cutover-driver) - bp-gitea HR pinned at 1.2.3 (lacks gitea-admin-secret) After this PR merges + Flux reconciles otech103, all three HRs upgrade in place and the cutover proof can be driven to completion. Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> |
||
|
|
e96741a0ca
|
feat(powerdns,cert-manager): multi-zone bootstrap + per-zone wildcard cert (#827) (#838)
A franchised Sovereign now supports N parent zones, NOT one. The operator brings 1+ parent domains at signup (`omani.works` for own use, `omani.trade` for the SME pool, etc.) and may add more post-handover via the admin console (#829). bp-powerdns 1.2.0 (platform/powerdns/chart): - New `zones: []` values key listing parent domains to bootstrap - New Helm post-install/post-upgrade hook Job (templates/zone-bootstrap-job.yaml) that POSTs each entry to /api/v1/servers/localhost/zones at install time. Idempotent on HTTP 409 — re-runs after upgrades or chart bumps never fail. - Default-values render skips when zones is empty (legacy behavior). bp-catalyst-platform 1.4.0 (products/catalyst/chart): - New `parentZones: []` + `wildcardCert.{enabled,namespace,issuerName}` values - New templates/sovereign-wildcard-certs.yaml renders one cert-manager.io/v1.Certificate per zone (each `*.<zone>` + apex) via the letsencrypt-dns01-prod-powerdns ClusterIssuer. Each cert renews independently. Skips entirely when parentZones is empty so the legacy clusters/_template/sovereign-tls/cilium-gateway-cert.yaml retains ownership of `sovereign-wildcard-tls` (avoids helm-vs-kustomize ownership flap). - New `catalystApi.{powerdnsURL,powerdnsServerID}` values threaded into the catalyst-api Pod as CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_SERVER_ID env vars. catalyst-api (products/catalyst/bootstrap/api): - New internal/powerdns package with typed Client (CreateZone, ZoneExists). Idempotent on HTTP 409/412. - handler.pdmCreatePowerDNSZone (issue #829's stub) now uses the typed client when wired via SetPowerDNSZoneClient — the admin-console "Add another parent domain" flow now creates real zones in the Sovereign's PowerDNS at runtime. - main.go wires the client when CATALYST_POWERDNS_API_URL + CATALYST_POWERDNS_API_KEY are set. - Comprehensive unit tests (client_test.go: 9 cases incl. 201/409/412/500 + custom NS + custom serverID). Bootstrap-kit slot integration: - clusters/_template/bootstrap-kit/11-powerdns.yaml: bumps to bp-powerdns 1.2.0 and threads `zones: ${PARENT_DOMAINS_YAML}` from Flux postBuild.substitute. - clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: bumps to bp-catalyst-platform 1.4.0 and threads `parentZones: ${PARENT_DOMAINS_YAML}` (same source-of-truth string so the two slots stay in lockstep). - infra/hetzner: new `parent_domains_yaml` Terraform variable (defaults to single-zone array derived from sovereign_fqdn) → cloud-init renders the PARENT_DOMAINS_YAML Flux substitute. DoD verified end-to-end with helm template + envsubst: - Multi-zone overlay (omani.works + omani.trade) renders 2 PowerDNS zone-create API calls in the bootstrap Job AND 2 Certificate resources (`*.omani.works`, `*.omani.trade`) in bp-catalyst-platform. - Single-zone fallback (PARENT_DOMAINS_YAML defaults to `[{name: "<sov_fqdn>", role: "primary"}]`) keeps legacy provisioning paths working without per-overlay edits. Closes #827. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
dbbbcfa7dc
|
fix(bp-gitea): ship gitea-admin-secret with random password (#830) (#832)
bp-self-sovereign-cutover Step 1 (gitea-mirror) was stuck in CreateContainerConfigError on otech102 because the cutover PodSpec referenced `gitea-admin-secret` with `username`/`password` keys which no chart materialised. Worse, the upstream gitea subchart fell through to its hardcoded default password `r8sA8CPHD9!bt6d` whenever no existingSecret was set — every fresh Sovereign would have shipped with identical admin credentials. Add templates/admin-secret.yaml: a Catalyst-curated Secret named `gitea-admin-secret` with `username` (default `gitea_admin`) and `password` (32-char random alphanumeric, generated on first install, preserved across reconciles via Helm `lookup`). Wire `gitea.gitea.admin.existingSecret = gitea-admin-secret` so the upstream init container reads its admin creds from this Secret instead of the hardcoded default. The same Secret is consumed by bp-self-sovereign- cutover Step 1. Resource-policy keep + lookup-based persistence guarantees the password bytes are stable across helm upgrade, helm rollback, Flux re- reconciliation, even helm uninstall + reinstall. Bumps bp-gitea 1.2.3 → 1.2.4 (Chart.yaml + blueprint.yaml). Issue: openova-io/openova#830 (Bug 2) Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ab67a48fe7
|
fix(blueprints): align blueprint.yaml spec.version with Chart.yaml version (#817) (#819)
TestBootstrapKit_BlueprintCardsHaveRequiredFields was failing on main for
9 blueprints because their platform/<name>/chart/Chart.yaml version had
been bumped without a matching update to platform/<name>/blueprint.yaml
spec.version. The pre-existing failure forced 7 recent PRs to self-merge
with --admin, masking real CI failures.
Aligned spec.version to match Chart.yaml version on:
cert-manager 1.1.1 -> 1.1.2
flux 1.1.3 -> 1.1.4
crossplane 1.1.3 -> 1.1.4
sealed-secrets 1.1.1 -> 1.1.2
spire 1.1.4 -> 1.1.7
nats-jetstream 1.1.1 -> 1.1.2
openbao 1.2.0 -> 1.2.14
keycloak 1.3.1 -> 1.3.2
gitea 1.2.1 -> 1.2.3
Verified locally:
$ go test ./... -run TestBootstrapKit_BlueprintCardsHaveRequiredFields -count=1
--- PASS: TestBootstrapKit_BlueprintCardsHaveRequiredFields (0.01s)
... all 10 sub-tests pass (cilium + the 9 above)
The existing test (tests/e2e/bootstrap-kit/main_test.go:145) is itself
the drift guardrail: it fails CI whenever Chart.yaml is bumped without a
matching blueprint.yaml bump. No additional script needed.
Closes #817 once verified on main.
Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
|
||
|
|
9645a9044a
|
feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) (#818)
* feat(metering): NewAPI NATS publisher + sme-billing subscriber + POST /metering/record (#798) Per #795 [Q-mine-3] (NATS not RedPanda) + [Q-mine-4] (one ledger), add the SME-2 metering integration end-to-end. NewAPI is consumed as the upstream image `ghcr.io/openova-io/openova/newapi-mirror` (a pinned mirror, not a fork) — the metering envelope is produced by a Go sidecar that observes the OpenAI-style `usage.total_tokens` field on every 2xx /v1/* response. This avoids forking the upstream binary while still producing the canonical envelope shape on `catalyst.usage.recorded`. A) NewAPI metering sidecar — core/services/metering-sidecar/ - Transparent reverse proxy in front of NewAPI on its own port; the bp-newapi Service routes the cluster-fronting port to the sidecar, which forwards to NewAPI on the pod's loopback. - Observes successful /v1/* JSON responses, parses `usage.{prompt_tokens,completion_tokens,total_tokens}`, computes amount_micro_omr = -tokens * priceMicroOMRPerToken, and publishes one envelope on `catalyst.usage.recorded` per completed request. - Failed (non-2xx), non-JSON, and admin-path requests are NOT billed. - Customer-facing latency is NEVER blocked on metering: the response body is restored before publish; on NATS unreachable the envelope is persisted to disk and retried by a background drain loop. - 14 unit tests (proxy + publisher + safeFilename guards). B) sme-billing NATS subscriber — core/services/billing/handlers/ metering_consumer.go - JetStream durable consumer `sme-billing-metering` on stream `CATALYST_USAGE` (provisioned by sme-billing on startup). - Idempotent on metadata.request_id via a UNIQUE partial index on credit_ledger.external_ref; redelivery from the broker collapses to a single ledger row. - Customer auto-create on cold start (the rbac sme.user.created envelope may land AFTER the first metered request; we don't strand usage waiting for it). - 11 unit tests covering happy-path, idempotency, malformed-payload poison-pill, missing-request-id, non-negative amount guard, resolver error → Nak, derive-micro-OMR-from-OMR, DB-error → Nak. C) HTTP handler POST /billing/metering/record — handlers/metering.go - Synchronous validate → INSERT credit_ledger → return {ledger_entry_id, balance_after_omr, balance_after_micro_omr, duplicate}. Same payload + idempotency guard as the NATS path. - Auth: superadmin OR sovereign-admin (operator-admin model; end-user LLM traffic flows through the sidecar, never this URL). - 8 unit tests covering happy-path, idempotency, role gating, malformed-JSON, positive-amount rejection, customer-not-found. D) Schema — core/services/billing/store/store.go - ALTER TABLE credit_ledger ADD COLUMN amount_micro_omr BIGINT (1 OMR = 1,000,000 micro-OMR; -0.000234 OMR = -234 micro-OMR exact integer — preserves precision at metering rates). - ADD COLUMN external_ref TEXT + UNIQUE partial index for idempotency dedup. - ADD COLUMN metadata JSONB for the raw envelope. - GetCreditBalance projects both amount_omr (legacy) and amount_micro_omr (new) into the integer-OMR view. - GetCreditBalanceMicroOMR returns canonical precision. - RecordUsage method: ON CONFLICT DO UPDATE … RETURNING (xmax<>0) distinguishes fresh insert from duplicate without a follow-up SELECT. E) Wiring - core/services/shared/events/nats.go — minimal NATS JetStream publisher + subscriber surface; legacy RedPanda producer/consumer in events.go untouched per [Q-mine-3]. - core/services/billing/main.go — NATS_URL env; subscriber wired in parallel with the existing RedPanda tenant-events consumer. - middleware/jwt.go — exported test helper WithClaims so handler tests can construct an authenticated context without minting a real signed token. - .github/workflows/services-build.yaml — metering-sidecar added to the build matrix; deploy job skips it (image consumed by the bp-newapi chart, not products/catalyst sme-services). F) bp-newapi chart (1.0.0 → 1.1.0) - meteringSidecar block in values.yaml: image, port, NATS URL, priceMicroOMRPerToken (default 156 = 0.000156 OMR/token), spool dir, header names, resources, securityContext (read-only-rootfs). - deployment.yaml renders the sidecar container + emptyDir spool volume when meteringSidecar.enabled (default true). - service.yaml routes the cluster-fronting :3000 to the sidecar when enabled, exposes a separate :3001 → NewAPI direct port for bp-catalyst-platform admin-API traffic (ADR-0003 §3.2). - networkpolicy.yaml allows the sidecar's port + nats-system egress for JetStream publish. Tests: 33 new (14 sidecar + 11 subscriber + 8 HTTP handler), all green. Helm template renders cleanly with sidecar enabled and disabled. Closes #798 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(billing/store): cast SUM to BIGINT so lib/pq scans into int64 (#798) Postgres returns `SUM(int) + SUM(bigint)/integer` as `numeric`, which lib/pq presents as a `[]uint8` decimal string ("50.000000000000000000000000") that does NOT scan directly into Go int64 — the integration test TestVoucherLifecycle_IssueRedeemAndCreditApplied caught this in CI on the post-redeem balance read. Wrap the SUM expressions in CAST(... AS BIGINT) so the column type is unambiguously bigint and Scan target stays uniform across pre-#798 rows (amount_omr only) and post-#798 rows (amount_micro_omr present). Affects: - GetCreditBalance - GetCreditBalanceMicroOMR - RecordUsage's running-balance read Test mocks updated to match the new SQL prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a6d2d25598
|
feat(bp-stalwart-tenant): per-SME dedicated mail server v0.1.0 (#801) (#815)
Adds platform/stalwart-tenant/ Blueprint chart implementing locked decision [Q3] of EPIC #795 — every SME on a Sovereign gets its OWN Stalwart instance in its tenant namespace, with its OWN domain, OWN MTA reputation, and OWN queue. NOT a shared otech-level multi-domain Stalwart. Components shipped: • StatefulSet (single-replica, RocksDB on PVC) • Service x3: SMTP/submission LoadBalancer, IMAP/IMAPS LoadBalancer, webmail/JMAP ClusterIP (fronted by Cilium Gateway HTTPRoute) • HTTPRoute (gateway mode, default) or Ingress (fallback) for webmail UI at https://mail.<sme-domain> • ConfigMap config.toml — Stalwart bootstrap config; OIDC bound to SME-vcluster Keycloak realm; uses == not = in expressions per stalwart_expression_syntax.md memory (incident 2026-04-14) • ConfigMap dns-records-required — MX/SPF/DKIM/DMARC for the SME admin (free-subdomain mode → published to PowerDNS by unified-rbac; BYO mode → surfaced in unified-rbac console UI for SME admin) • ExternalSecret x2 — admin password + OIDC client secret pulled from OpenBao at canonical paths sovereign/<sov>/stalwart/<tenant>/{admin,oidc} • Job (post-install) — bootstraps admin principal with email-receive permission and send-allow row; idempotent; covers stalwart_send_as.md group-permission gotcha (incident 2026-04-20) • NetworkPolicy — default-deny + explicit allows (SMTP/IMAP from anywhere, webmail from gateway namespace, egress to Keycloak/NATS/ PowerDNS/DNS/outbound SMTP) • Tests: chart/tests/expression-syntax.sh — audits rendered config for the `==` rule Per-user mailbox provisioning is event-driven (ADR-0003 §3): unified-rbac POSTs Stalwart's /api/principal admin API on sme.user.created. The continuous NATS subscriber Deployment is OFF by default (chart-level); per-tenant overlay flips it on once the SME vcluster's NATS subject is known. Image SHA-pinned: docker.io/stalwartlabs/stalwart:v0.16.3 @ sha256:5d75cff4e9c6d75e64636e9ef9674b1d877f8f6fb2e11ee8176fbad3faaa5289 (Inviolable Principles #4 + #4a). global.imageRegistry rewrite supported for post-handover Sovereign Harbor proxy-cache (ADR-0001 §11.5). Smoke render passes with default values (623 lines, 8 manifests). helm lint clean. Required values gated via per-template render-gates, not fail() at chart root, so the platform-wide blueprint-release.yaml hollow-chart + smoke gates pass (issue #181 + bp-openclaw 2026-05-04 failure mode avoided). Closes #801 (chart published; UAT after smoke-deploy). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
3e7284de45
|
fix(bp-wordpress-tenant): default-values smoke render must succeed (#800) (#814)
The Blueprint Release workflow runs `helm template <chart>` with NO
overrides as a smoke gate before publishing the OCI artifact. After
#800's initial merge (
|
||
|
|
d6dedb1ecd
|
fix(bp-openclaw): use placeholder defaults so blueprint-release smoke render passes (#803) (#813)
The blueprint-release CI workflow runs `helm template <chart>` with default values as a smoke gate (.github/workflows/blueprint-release.yaml SMOKE step). The original chart shipped empty-string defaults for every required value (keycloak.realmURL, tenant.namespace, etc.) and used `required` / `fail` to abort render — which is correct fail-fast behaviour for real installs but wrongly fails CI's default-values smoke step. Result: bp-openclaw 0.1.0 never published to GHCR (run 25335221500 fail). Match the bp-self-sovereign-cutover pattern (PR #791): provide placeholder defaults that let smoke render produce valid YAML, gated behind a new `assertNoPlaceholders` toggle that per-cluster Flux overlays MUST set to `true`. With the toggle ON, _helpers.tpl :: assertNoPlaceholders fails render with a clear message identifying any placeholder still in place. Changes: - values.yaml: add placeholder defaults for keycloak.realmURL, keycloak.clientSecretName, newapi.baseURL, tenant.namespace, ingress.host, controller.image.tag, perUserPod.image.tag. Add `assertNoPlaceholders: false` flag (overlays set true). - _helpers.tpl: replace assertRequired with assertNoPlaceholders — same intent, runs only when the toggle is on, so smoke render passes while real installs still get fail-fast on bad overlays. - serviceaccount.yaml: invoke assertNoPlaceholders instead of assertRequired. - controller-deployment.yaml + controller-ingress.yaml: drop the `required` calls (defaults are now valid bytes; the assertNoPlaceholders helper enforces real values at install time). - tests/render-toggles.sh: rewrite Case 1 (now expects success) and Case 2 (asserts assertNoPlaceholders=true fails on placeholders) + Case 2b (assertNoPlaceholders=true with real values succeeds). All 7 gates pass locally. Output (post-merge): chart published to oci://ghcr.io/openova-io/bp-openclaw:0.1.0. Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
20b3c5258a
|
feat(bp-newapi): chart maturation + first-otech deploy + Qwen vLLM channel (#799) (#812)
* feat(bp-newapi): chart maturation — ExternalSecret + first-otech vLLM channel + skip-render gates (#799) Maturation work for the SME-3 turnkey-experience epic (#795). Aligns the bp-newapi scratch chart with ADR-0003 (RBAC ↔ NewAPI user-create hook contract) and gets it past the blueprint-release CI smoke render that has blocked publication since PR #396 (run 25213444992 failed at default-values render of v1.0.0). Changes ------- - templates/external-secret.yaml (NEW). Renders the `catalyst-newapi-admin-token` ExternalSecret consumed by unified-rbac (ADR-0003 §3.2 + §6) for issuing per-user keys against `http://newapi.newapi.svc/api/v1/admin/users`. Sourced from OpenBao via the `vault-region1` ClusterSecretStore (canonical default shipped by bp-external-secrets-stores). Capabilities-gated on `external-secrets.io/v1beta1` so cold installs without ESO don't fail-render. Operator supplies the per-Sovereign OpenBao path via `catalystIntegration.externalSecret.remoteRef.key`; canonical convention is `sovereign/<sovereign-fqdn>/newapi/admin-token` with property `ADMIN_API_TOKEN`. Per Inviolable Principle #4 every knob is operator-overridable in the cluster overlay. - values.yaml. Adds `catalystIntegration.externalSecret.{enabled, refreshInterval, secretStoreRef.{kind,name}, remoteRef.{key,property}}` block (default enabled=true, key="" so a misconfigured overlay fails loudly at render rather than silently skipping). Adds `defaultChannels.vllm` block — first-otech shorthand that composes a vLLM-typed channel into the rendered channels list when enabled. Default endpoint is empty per Inviolable Principle #4; the `clusters/<sovereign>/bootstrap-kit/80-newapi.yaml` overlay supplies the per-Sovereign URL (canonical first-otech reference = `https://llm-api.omtd.bankdhofar.com` model `qwen3-coder`, the same upstream Axon uses on the OpenOva marketing deployment). - templates/_helpers.tpl. New `bp-newapi.effectiveChannels` helper composes `.Values.channels` with `defaultChannels.vllm` (when enabled). The `assertChannelAttestation` helper now operates on the effective list so attestation gates apply to defaultChannels composition too. `defaultChannels.vllm.enabled=true` with empty endpoint fails-fast at render with a guided error message. - templates/configmap.yaml. Channels rendering switches to the effectiveChannels helper. OIDC block now skip-renders gracefully when `auth.adminUI.keycloak.issuer` is unset (smoke-render path) instead of `required`-failing; the per-Sovereign overlay sets the issuer. - templates/deployment.yaml. Skip-render gate on Deployment when `database.existingSecret`, `credentials.existingSecret`, or (when Keycloak mode is selected) the OIDC client secret is missing. Removes the four `required` calls that were failing CI smoke render. Service, ServiceAccount, ConfigMap, NetworkPolicy still render so the smoke test gets a non-empty output proving structural soundness; the actual Deployment defers until the per-Sovereign overlay wires the secrets. - templates/ingress.yaml. Same skip-render pattern: when either `ingress.host` or `ingress.adminHost` is empty, the entire ingress block is silently skipped. Matches the bp-keycloak / bp-openbao / bp-external-dns HTTPRoute templates. - Chart.yaml. version 1.0.0 → 1.1.0 (minor bump — additive features; no breaking changes to existing operator overrides). Verification ------------ `helm template` smoke render on default values now succeeds with 4 resources (NetworkPolicy / ServiceAccount / ConfigMap / Service); 168 lines, well above the CI 5-line minimum. With a full per-Sovereign overlay (hosts + secrets + Keycloak issuer + ESO Capabilities + Traefik Capabilities + defaultChannels.vllm.endpoint), 8 resources render including Deployment, both Ingresses, the Traefik allowlist Middleware, and the ExternalSecret. The composed qwen channel writes through to `channels.yaml` with the expected endpoint + models + attestation. Refs ---- ADR-0003 §3.2 + §6 — admin-token contract Issue #795 (epic) — locked decisions Issue #796 — hook contract spec (sequential blocker, merged) Inviolable Principles #1, #3, #4 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(bootstrap-kit): slot 80 — bp-newapi default install (#799) Adds the canonical install slot for bp-newapi to every fresh Sovereign's bootstrap-kit. Sequenced after the W2.K1 dependency wave so NewAPI's ExternalSecret + Postgres DSN dependencies resolve on first reconcile. The HelmRelease declares `dependsOn: [bp-openbao, bp-keycloak, bp-cnpg]`: - bp-openbao(08): admin-token ExternalSecret backend - bp-keycloak(09): OIDC issuer for ops-staff admin UI at admin.<fqdn> - bp-cnpg(16): Postgres backing for users/credits/channels/audit Per-Sovereign overlays inherit the slot's defaults and override: - ingress.host api.${SOVEREIGN_FQDN} - ingress.adminHost admin.${SOVEREIGN_FQDN} - auth.adminUI.keycloak.issuer - database.existingSecret (Crossplane-claimed) - credentials.existingSecret - catalystIntegration.externalSecret.remoteRef.key sovereign/${FQDN}/newapi/admin-token - defaultChannels.vllm.enabled true (first-otech) - defaultChannels.vllm.endpoint (operator-supplied) The `_template/` slot keeps `defaultChannels.vllm.enabled: false` so a fresh Sovereign does not silently wire customers to a third-party endpoint; the canonical first-otech reference (Qwen3 Coder via `https://llm-api.omtd.bankdhofar.com`, same relay Axon uses on the OpenOva marketing deployment) is documented in-line for operators adopting the same upstream. Refs: #795 (epic), ADR-0003 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bootstrap-deps): register bp-newapi slot 80 in expected DAG (#799) Fixes the dependency-graph-audit drift detection caught at PR #812 CI: the audit script enumerates HelmReleases in clusters/_template/bootstrap-kit/ and compares to scripts/expected-bootstrap-deps.yaml; an HR present on disk but absent from the expected DAG is treated as drift. Adds the canonical entry for bp-newapi at slot 80 with the same depends_on set declared on the HelmRelease itself ([bp-openbao, bp-keycloak, bp-cnpg]). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(bp-newapi): align blueprint.yaml spec.version with Chart.yaml (#799) The TestBootstrapKit_BlueprintCardsHaveRequiredFields static-validation gate asserts Chart.yaml version == blueprint.yaml spec.version. The chart was bumped to 1.1.0 in c63ecd8c; bumping the blueprint metadata to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
c141fcd1d3
|
feat(bp-wordpress-tenant): turnkey SSO-wired WordPress per SME (#800) (#811)
New scratch Blueprint chart `bp-wordpress-tenant` v0.1.0 that provisions a turnkey, SSO-pre-wired WordPress instance per SME tenant inside the SME's vcluster, satisfying ticket #800 (SME-5) of the #795 SME-tenant turnkey experience epic. What it provisions: - Deployment of `wordpress:6-php8.3-apache` (manifest-list digest sha256:054e611...196), pulled through the Sovereign Harbor proxy-cache when `global.imageRegistry` is set (per INVIOLABLE-PRINCIPLES #4). - Two initContainers seed wp-content/ from the image onto the PVC and install the openid-connect-generic plugin + pg4wp Postgres drop-in from wordpress.org / GitHub. Idempotent, runs only once per PVC. - Postgres provisioned in-tenant via a `Cluster.postgresql.cnpg.io` (default `wordpress-db`, 1 instance, 10Gi, pg16). The CNPG-emitted `<cluster>-app` Secret is mirrored into `wordpress-database-secret` by Reflector + a post-install sync Job (otech30 race fix carried forward from bp-gitea). - PVC for `/var/www/html/wp-content/` (default 10Gi, RWO, helm.sh/resource-policy: keep so customer content survives `helm uninstall`). - Ingress at `wordpress.<smeDomain>` with cert-manager TLS via operator-supplied ClusterIssuer (default `letsencrypt-prod`). - NetworkPolicy restricting egress to bp-cnpg :5432, Keycloak :8443/:8080, kube-dns, and HTTPS to public IPs (for plugin/theme fetches). - Three post-install Jobs: hook weight 5 — db-secret-sync (PATCHes wordpress-database- secret.password from CNPG <cluster>-app) hook weight 10 — oidc-config (UPSERTs openid_connect_generic_ settings, active_plugins, template/stylesheet, siteurl/home rows in wp_options via PHP+PDO) hook weight 15 — admin-user (INSERT/UPDATE wp_users + wp_usermeta for SME admin's email with administrator role) After all hooks complete, the SME admin's first browser hit lands on /wp-admin authenticated via Keycloak SSO — no install wizard, no manual config. Hollow-chart guard (issue #181) satisfied via the `common` library subchart from sigstore, matching bp-newapi's pattern for scratch charts (no first-party WordPress Helm chart exists upstream). Tests: - chart/tests/observability-toggle.sh verifies BLUEPRINT-AUTHORING §11.2 (default render produces no PodMonitor/ServiceMonitor). - `helm template` smoke render with required values produces 11 K8s resources cleanly; `helm lint` zero-failure. Refs: #800, #795 Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> |
||
|
|
93bd3ace5b
|
feat(bp-openclaw): workspace controller + per-user pod chart (#803) (#810)
Implements locked decision [A] of epic #795: per-SME-tenant workspace controller deployment + per-user runtime pod, identity-blind by construction. Consumes the per-user newapi-key-{uuid} Secrets rendered by the unified-rbac user-create hook (ADR-0003 §3.3). What this delivers: - platform/openclaw/chart/ bp-openclaw v0.1.0 (no-upstream) - platform/openclaw/runtime/ Go reference runtime (NEWAPI_BASE_URL + NEWAPI_KEY env contract only) - .github/workflows/openclaw-runtime.yaml Event-driven build for the runtime image (paths-on-push + manual rerun; NO schedule:cron per CLAUDE.md). - platform/openclaw/blueprint.yaml Catalyst registration + configSchema. Chart highlights: - Required values guarded by _helpers.tpl :: assertRequired so missing realmURL/clientSecretName/tenant.namespace/baseURL/host fail render with helpful messages. - RBAC: namespaced Role in tenant ns; create verbs split into separate rules WITHOUT resourceNames per feedback_rbac_create_no_resourcenames.md. Label-based ownership (catalyst.openova.io/openclaw-user) enforced at the controller, not in RBAC. - ingress: cert-manager.io/cluster-issuer annotation triggers ACME auto-issuance for openclaw.<sme-domain>. - per-user pod template ConfigMap holds the pod-spec the controller renders per session, with ${USER_UUID}/${SECRET_NAME} placeholders filled at session-start. - networkPolicy covers controller pod only; per-user pod NetworkPolicy is rendered by the controller at session-start (target hostname is read from the per-user Secret which doesn't exist at chart-render time — documented in README.md). Tests: chart/tests/render-toggles.sh (7 cases) covers required-value enforcement, RBAC create+resourceNames violation guard, ServiceMonitor default-off, networkPolicy toggle, pod-template placeholder presence, cert-manager annotation. All seven gates pass locally. Closes part of #795 (epic still open). Co-authored-by: hatiyildiz <hatice.yildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
33dc98782b
|
feat(bp-self-sovereign-cutover): chart + bootstrap-kit slot 06a (#791) (#808)
New platform Blueprint at `platform/self-sovereign-cutover/chart/`. Ships
DORMANT — eight step PodSpec ConfigMaps, the registry-pivot DaemonSet, the
mutable cutover-status ConfigMap, plus ServiceAccount/RBAC. The catalyst-api
cutover endpoint (#792, merged at
|
||
|
|
2e981f36a5
|
fix(bp-keycloak): catalyst-kc-sa-credentials addr → in-cluster Service URL (closes #781) (#788)
Sovereign-side catalyst-api Pod's intra-cluster Keycloak calls (token
mint, EnsureUser) were failing with `dial tcp: lookup
auth.<sov-fqdn> on 10.43.0.10:53: no such host`. The Sovereign's
CoreDNS resolves *.<sov-fqdn> via upstream resolvers — it does NOT
forward to the in-cluster PowerDNS that holds those records. Public
DNS works (PowerDNS authoritative), but Pod-side lookups of
auth.<sov-fqdn> return NXDOMAIN.
Live evidence — otech94 2026-05-04: handover URL returned
`{"error":"keycloak error: ensure user"}` from a DNS lookup failure
inside the catalyst-api Pod.
Fix: bp-keycloak chart now writes the in-cluster Service URL
(http://<release>.<namespace>.svc.cluster.local) into the
catalyst-kc-sa-credentials Secret's `addr` key instead of the public
gateway host (https://auth.<sov-fqdn>). This Secret is consumed
EXCLUSIVELY by the in-cluster catalyst-api Pod via reflector mirror
into catalyst-system; it is NEVER exposed to browsers.
The HTTPRoute hostname (.Values.gateway.host) stays at auth.<sov-fqdn>
for operator browsers — only the Pod's intra-cluster OAuth
client_credentials calls switch to the Service URL.
Catalyst-Zero (contabo) is unaffected: it runs `keycloak-zero`
(separate chart in openova-private), not bp-keycloak.
Changes:
- platform/keycloak/chart/templates/configmap-sovereign-realm.yaml:
Secret's $kcAddr unconditionally uses
http://<release>.<namespace>.svc.cluster.local
- platform/keycloak/chart/Chart.yaml: 1.3.1 → 1.3.2
- clusters/_template/bootstrap-kit/09-keycloak.yaml: chart version 1.3.1 → 1.3.2
- products/catalyst/chart/Chart.yaml: 1.3.0 → 1.3.1 (changelog entry only)
- clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml: 1.3.0 → 1.3.1
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
53bc4357ca
|
feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) (#776)
* feat(provisioner): cluster-autoscaler-hcloud + wizard footprint estimate (closes #767) Two-pronged fix for the FailedScheduling pattern that hit otech92 (2x cpx32 workers couldn't fit external-secrets-webhook because the bootstrap-kit ate the full 16 GB): 1. PRE-LAUNCH ESTIMATE — wizard StepReview now surfaces a "Footprint estimate" Section with: bootstrap-kit baseline (sum of mandatory-tier component footprints), selected components delta, control-plane overhead, and a "Recommended N x <SKU>" line that turns amber when the operator's chosen worker count is below the rollup. Backed by per-component RAM/CPU floors in components/wizard/steps/componentFootprints.ts (covered by 12 unit tests including the otech92 reproduction). 2. RUNTIME AUTOSCALING — new bp-cluster-autoscaler-hcloud Blueprint added at bootstrap-kit slot 40. Wraps the upstream kubernetes/autoscaler chart 9.46.6 (appVersion 1.32.0) with the Hetzner cloud-provider. Token wired from the canonical flux-system/cloud-credentials.hcloud-token Secret cloud-init writes (mirrors the velero/harbor object-storage pattern). Pinned to the control-plane node so the autoscaler never schedules onto a worker it could itself terminate. 10-minute scale-down idle as the cost-saving default. Documented in docs/ARCHITECTURE.md sec.14 (Autoscaling) — explains how VPA / HPA / KEDA / cluster-autoscaler compose, why we picked cluster-autoscaler over KEDA for cluster scaling, and the bounds + safety story. Per the issue's MVP scope, this PR ships the blueprint + StepReview estimate WITHOUT the wizard StepProvider min/max pair refactor or the tofu node-pool template restructuring. Those are tracked as a follow-up issue (scope-control rule per docs/INVIOLABLE-PRINCIPLES.md #1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(provisioner): move cluster-autoscaler to slot 50 + register in expected-bootstrap-deps Slot 40 was already forward-declared for bp-llm-gateway in scripts/expected- bootstrap-deps.yaml — the dependency-graph-audit CI check fired on PR #776 because the file existed without a matching entry in the expected DAG, AND collided with a reserved slot. Move to slot 50 (after the W2.K4 cohort + slot 49 bp-cert-manager-powerdns-webhook) and add the matching entry to the expected-bootstrap-deps.yaml so the audit passes. `scripts/check-bootstrap-deps.sh` runs clean locally now (drift=0, cycles=0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0dbdf3b327
|
fix(bp-trivy): node-collector tolerates control-plane taint (closes #769) (#772)
PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to the CP node when worker_count > 0. Two bootstrap-kit charts have pods that MUST land on the CP and lacked the matching toleration: bp-trivy • node-collector: Pod pinned to each node via nodeSelector `kubernetes.io/hostname=<node>`. The CP-bound collector reads /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler, /var/lib/kube-controller-manager via hostPath — these only exist on the CP. Without the toleration the collector sat Pending forever on otech93 (live evidence in #769). • scanJobTolerations: per-workload scan jobs the operator spawns may target pods on CP-only system DaemonSets (kube-system kube-proxy in non-Cilium mode, etc.). Adding the toleration here so reports are produced for those workloads too. bp-alloy • DaemonSet — one pod MUST land on every node including the CP, so CP-local kubelet logs + node metrics flow into the LGTM stack. Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93 and CP telemetry was silently lost. Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP is untainted in solo mode per PR #755's conditional. Versions bumped: • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins) • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins) Out of scope (audited, no change needed): • bp-cilium — upstream defaults already tolerate everything (verified on otech93: cilium DaemonSet at 4/4 nodes). • bp-falco — values.yaml already declares NoSchedule + NoExecute Exists tolerations (4/4 on otech93). • cnpg/harbor — no kubelet-cert-renew Jobs in current charts. Verified: • `helm template` on both charts renders the expected toleration (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed by the operator at scan-job spawn time). • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
31784d7ed5
|
fix(bp-external-dns): apiserver Endpoints sync timeout — Cilium kube-apiserver entity required (closes #770) (#771)
* fix(bp-external-dns): grant apiserver egress via CiliumNetworkPolicy (closes #770) Root cause: ExternalDNS crashloops on every fresh Sovereign provision with `failed to sync *v1.Endpoints: context deadline exceeded`. The companion vanilla NetworkPolicy egress rule `to: ipBlock: 0.0.0.0/0 ports: 443,6443` does NOT match traffic to the kube-apiserver under Cilium with the default `policy-cidr-match-mode: ""`. Cilium models the apiserver as a reserved identity, not a CIDR range, so the ipBlock rule is bypassed and the apiserver call is dropped at the egress hook of the external-dns endpoint. Fix: render a companion CiliumNetworkPolicy with `toEntities: [kube-apiserver]` scoped to the external-dns Pod selector. This is the canonical Cilium pattern for controllers that watch the apiserver. The existing vanilla NetworkPolicy is preserved verbatim so the Blueprint remains CNI-agnostic per BLUEPRINT-AUTHORING.md. Live proof on otech93 (2026-05-04): manually applied the rendered CNP to the running cluster, external-dns transitioned from CrashLoopBackOff (8 restarts in 20m) to 1/1 Running within 30s, informer cache sync completed cleanly. Bumps bp-external-dns 1.1.6 → 1.1.7. Why not `policy-cidr-match-mode: nodes` cluster-wide on bp-cilium? It silently relaxes EVERY other NetworkPolicy that uses 0.0.0.0/0 in the cluster — too broad. Per INVIOLABLE-PRINCIPLES the fix MUST be scoped to the workload that needs it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(_template): bump bp-external-dns 1.1.6 → 1.1.7 to pick up CNP fix Pairs with the chart bump in the same PR. Every fresh otech provision hydrates clusters/_template/, so this pin is what determines the version installed. Without bumping here, otech94+ would still use 1.1.6 and continue to crashloop with the apiserver-egress symptom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
69de64ba19
|
fix(cilium): k8sServiceHost 127.0.0.1 → 10.0.1.2 so workers' Cilium can reach apiserver (#738)
Issue #733 follow-up. The default cpx32 multi-node Sovereign (1 CP + 2 workers) provisioned successfully, but worker nodes stuck NotReady because cilium-agent on workers crashloop'd: Get "https://127.0.0.1:6443/api/v1/namespaces/kube-system": dial tcp 127.0.0.1:6443: connect: connection refused Root cause: `k8sServiceHost: 127.0.0.1` works on the k3s SERVER node (supervisor binds localhost:6443) but FAILS on every k3s AGENT node (agent does NOT expose apiserver on localhost — only the supervisor on :6444). Pre-#733 every Sovereign was solo (worker_count=0), so this never fired. Fix: point Cilium at `10.0.1.2`, the CP's stable private IP on the Sovereign's 10.0.1.0/24 subnet (cp1=10.0.1.2 per main.tf network block). No-op on the CP (10.0.1.2 IS its own private IP) and works on workers (which already join the cluster via the same address per cloudinit-worker.tftpl `K3S_URL=https://${cp_private_ip}:6443`). Files: - infra/hetzner/cloudinit-control-plane.tftpl — bootstrap helm install values file written to /var/lib/catalyst/cilium-values.yaml - platform/cilium/chart/values.yaml — Flux bp-cilium HelmRelease values (cilium_values_parity_test.go enforces the two stay aligned) Verified live on otech50: 3× CPX32 servers running, 1 CP Ready, 2 workers registered with k3s but NotReady due to cilium init failure. After this fix workers should reach Ready, and the Phase-1 watcher sees all components Ready=True across the multi-node cluster. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9a58289786
|
fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (closes #715) (#716)
* fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713) Closes #713 Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing 401 on /auth/handover: 1. SOVEREIGN_FQDN race api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn" with optional:true. On Sovereigns, that ConfigMap is rendered by the sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform HelmRelease. When the Pod starts first, valueFrom collapses to "" and stays empty — audience check rejects every valid token as "invalid audience". Fix: add Reloader annotations so the Pod rolls when the ConfigMap (and the handover-jwt-public Secret) appears. 2. catalyst-api-server SA missing user-level realm-management role mappings bp-keycloak realm import granted roles via clientScopeMappings — wrong level. The actual service-account user had no clientRoles entry, so KC rejected GET /users with 403 when catalyst-api tried to ensure the operator user during handover. Fix: add explicit "users" array binding service-account-catalyst-api-server to realm-management.{impersonation, manage-users, view-users, query-users}. * fix(catalyst-api,bp-reloader): tofu state on PVC + Reloader annotations strategy (#715) Closes #715 Two architectural bugs surfaced live on otech64 (2026-05-03), both leading to a healthy-looking Sovereign that the operator could not reach. 1. catalyst-api tofu workdir on emptyDir CATALYST_TOFU_WORKDIR=/tmp/catalyst/tofu (emptyDir). When contabo's catalyst-api Pod rolled mid-apply (the PR #714 deploy commit triggered a rolling restart 3 minutes into otech64's tofu run), in-progress state was lost. Tofu had created LB/network/server/services but not the hcloud_load_balancer_target.control_plane resource yet — the cluster came up at the k3s level but the public LB had no targets, returning TLS handshake failure for every console.<sov> request. Move CATALYST_TOFU_WORKDIR to /var/lib/catalyst/tofu (PVC-backed, fsGroup=65534 already wires write access). tofu apply resumes from where it left off after any Pod restart. 2. bp-reloader env-vars strategy reloadStrategy=env-vars only injects checksum env vars for ConfigMaps referenced via envFrom. Workloads using valueFrom: configMapKeyRef (catalyst-api's SOVEREIGN_FQDN) are silently not reloaded — the configmap.reloader.stakater.com/reload annotation added in PR #714 was a no-op under env-vars. Switch to reloadStrategy=annotations. Reloader bumps a pod-template annotation, triggering rollout regardless of how the CM/Secret is referenced. --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
e96e31a781
|
fix(catalyst-api,bp-keycloak): handover 401 root-causes — Reloader annot + realm SA users array (#713) (#714)
Closes #713 Two distinct chart bugs surfaced live on otech62 (2026-05-03), both producing 401 on /auth/handover: 1. SOVEREIGN_FQDN race api-deployment.yaml reads SOVEREIGN_FQDN from ConfigMap "sovereign-fqdn" with optional:true. On Sovereigns, that ConfigMap is rendered by the sovereign-tls Flux Kustomization concurrently with bp-catalyst-platform HelmRelease. When the Pod starts first, valueFrom collapses to "" and stays empty — audience check rejects every valid token as "invalid audience". Fix: add Reloader annotations so the Pod rolls when the ConfigMap (and the handover-jwt-public Secret) appears. 2. catalyst-api-server SA missing user-level realm-management role mappings bp-keycloak realm import granted roles via clientScopeMappings — wrong level. The actual service-account user had no clientRoles entry, so KC rejected GET /users with 403 when catalyst-api tried to ensure the operator user during handover. Fix: add explicit "users" array binding service-account-catalyst-api-server to realm-management.{impersonation, manage-users, view-users, query-users}. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
c5ffaa2fd7
|
fix(bp-external-dns): livenessProbe.initialDelaySeconds=180 for cold-cluster cache-sync (closes #700) (#707)
PR #679 added --request-timeout=120s but external-dns has TWO timeouts: RequestTimeout (per-API-call, controlled by --request-timeout) and WaitForCacheSync (initial informer sync, hardcoded 60s in upstream binary, NOT exposed as a flag). On a fresh Sovereign with k3s apiserver CPU-saturated, the cache sync misses 60s -> fatal: failed to sync *v1.Node: context deadline exceeded -> CrashLoopBackOff 5-10 times. Caught live on otech49+ (2026-05-03), 5 restarts before stable. Bump livenessProbe.initialDelaySeconds from upstream 10s default to 180s so kubelet does NOT restart the Pod while the initial cache sync runs against a CPU-saturated freshly-provisioned k3s apiserver. The Sovereign apiserver reaches steady-state within ~2 min so 3 min comfortably covers cold starts. Also bumps periodSeconds=30 + failureThreshold=3 so a genuinely-hung pod is still killed within ~90s once steady-state. readinessProbe gets a corresponding initialDelaySeconds=30 so endpoint flapping during sync doesn't churn services. Helm overrides REPLACE whole maps (not merge), so the override preserves the upstream httpGet.path: /healthz + port: http shape verbatim. Bumps: - platform/external-dns/chart/Chart.yaml: 1.1.5 -> 1.1.6 - clusters/_template/bootstrap-kit/12-external-dns.yaml: HelmRelease pin 1.1.5 -> 1.1.6 Co-authored-by: hatiyildiz <hatice@openova.io> |
||
|
|
7ca9541ef9
|
fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup) (#691)
* fix(handover): provision Keycloak service-account credentials zero-touch (Phase-8b followup) Sovereign-side catalyst-api needs Keycloak service-account credentials to provision the operator's user during /auth/handover. Today the chart references K8s Secret `catalyst-kc-sa-credentials` with keys addr/realm/ client-id/client-secret in the catalyst-system namespace — but no zero-touch path materialised it. The dead SealedSecret template at 09a-keycloak-catalyst-api-secret.yaml had a different name AND different keys (CATALYST_KC_*), used PLACEHOLDER_SEALED_VALUE markers no provisioner replaced, and wasn't even listed in the bootstrap-kit kustomization. Symptom on otech48: GET /auth/handover?token=<valid-jwt> returns "server misconfiguration: keycloak not configured" (auth_handover.go:169). Fix: bp-keycloak chart's configmap-sovereign-realm.yaml template now emits the realm-import ConfigMap AND the catalyst-kc-sa-credentials Secret in a single template scope so they share the same generated client secret. Pattern mirrors platform/powerdns/chart/templates/ api-credentials-secret.yaml (canonical seam, ADR-0001 §11.3 anti-duplication). Secret-value resolution order (first match wins): 1. operator-supplied .Values.catalystApiServerClientSecret 2. helm `lookup` of existing Secret in keycloak ns (idempotent) 3. fresh randAlphaNum 32 (zero-touch on first install) The Secret carries the four keys exactly as the catalyst-api Pod's secretKeyRef expects — addr / realm / client-id / client-secret — with addr derived from gateway.host (https://auth.<sovereignFQDN>). Reflector annotations auto-mirror the Secret to catalyst-system as soon as that namespace materialises (bootstrap-kit slot 13). The realm import already creates the catalyst-api-server client with serviceAccountsEnabled + impersonation/manage-users/view-users/ query-users role mappings — so once Keycloak is Ready and the realm imports, the SA is fully provisioned and the K8s Secret carries a matching client secret. No post-install Job, no Admin-API script, no out-of-band SealedSecret ceremony. Cleanup: removes the dead 09a SealedSecret template (not in kustomization, never produced a working Secret). Bumps: - bp-keycloak chart 1.3.0 -> 1.3.1 - clusters/_template/bootstrap-kit/09-keycloak.yaml HelmRelease pin 1.3.0 -> 1.3.1 Existing per-Sovereign overlays (clusters/otech.omani.works/, clusters/omantel.omani.works/) intentionally remain on 1.3.0 — fresh otechN provisioning consumes _template at provision time. Will be verified live on otech49 — handover end-to-end without ANY manual Secret creation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(keycloak): bump blueprint.yaml spec.version to match chart 1.3.1 TestBootstrapKit_BlueprintCardsHaveRequiredFields/keycloak asserts Chart.yaml.version == blueprint.yaml.spec.version. Forgot to bump blueprint.yaml in the previous commit. Note: 8 other blueprints (cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, gitea) carry the same pre-existing mismatch and the test fails on main too. Out of scope for this PR; fixing the keycloak case to keep the new chart version internally consistent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
684759564e
|
fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager (PR #681 followup) (#686)
* fix(cilium-gateway): listener ports 80/443 → 30080/30443 + LB retarget cilium-envoy refuses to bind privileged ports (80/443) on Sovereigns even with all of: - gatewayAPI.hostNetwork.enabled=true on the Cilium chart - securityContext.privileged=true on the cilium-envoy DaemonSet - securityContext.capabilities.add=[NET_BIND_SERVICE] - envoy-keep-cap-netbindservice=true in cilium-config ConfigMap - Gateway API CRDs at v1.3.0 (matching cilium 1.19.3 schema) Repeatable error from cilium-envoy logs across otech45, otech46, otech47: listener 'kube-system/cilium-gateway-cilium-gateway/listener' failed to bind or apply socket options: cannot bind '0.0.0.0:80': Permission denied The bind() syscall is intercepted by cilium-agent's BPF socket-LB program in a way that does not honour container capabilities. Even PID 1 with CapEff=0x000001ffffffffff (all caps) and uid=0 gets "Permission denied". Cilium 1.19.3 → 1.16.5 made no difference (F1, PR #684 still ships — the version bump is sound for other reasons; the listener bind is just a separate fix). This commit moves the listeners to high ports (30080/30443) and lets the Hetzner LB do the public-facing port translation: HCLB :80 → CP node :30080 (cilium-gateway HTTP listener) HCLB :443 → CP node :30443 (cilium-gateway HTTPS listener) External users still hit `https://console.<sov>.omani.works/auth/handover` on port 443; the high port is invisible. High-port bind succeeds without NET_BIND_SERVICE because the kernel only gates ports below `net.ipv4.ip_unprivileged_port_start` (default 1024). Will be verified on otech48: the next fresh provision should serve console.otech48/auth/handover end-to-end without the 502/timeout chain seen on otech45–47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(powerdns+catalyst-api): zero-touch contabo PowerDNS API key for Sovereign cert-manager PR #681 followup. The new bp-cert-manager-powerdns-webhook (PR #681) calls contabo's authoritative PowerDNS at pdns.openova.io to write DNS-01 challenge TXT records for *.otech<N>.omani.works. That webhook needs an X-API-Key Secret in the Sovereign's cert-manager namespace — PR #681 didn't ship the materialization seam, so on otech43..otech47 the Secret was missing and the wildcard cert never issued. This commit closes the seam from contabo to the Sovereign: 1. bp-powerdns chart 1.1.7 to 1.1.8: Reflector annotations on openova-system/powerdns-api-credentials extended from "external-dns" to "external-dns,catalyst" so contabo catalyst-api can mount the API key. 2. bp-powerdns: api.basicAuth.enabled flips default true to false. Layered Traefik basicAuth + PowerDNS X-API-Key was double auth that blocked machine-to-machine API access from Sovereigns. The X-API-Key contract is unchanged. 3. bp-catalyst-platform 1.2.3 to 1.2.4: api-deployment.yaml adds CATALYST_POWERDNS_API_KEY env from powerdns-api-credentials/api-key secret (optional=true so Sovereign-side catalyst-api Pods that don't reflect this still start clean). 4. catalyst-api provisioner.go: new Provisioner.PowerDNSAPIKey field reads from CATALYST_POWERDNS_API_KEY env at New(). Stamps onto every Request before Validate(). Forwards as tofu var powerdns_api_key. 5. infra/hetzner/variables.tf: new var.powerdns_api_key (sensitive, default ""). 6. infra/hetzner/cloudinit-control-plane.tftpl: replaces the defunct dynadot-api-credentials Secret block (PR #681 dropped bp-cert-manager-dynadot-webhook) with a new cert-manager/powerdns-api-credentials Secret block. runcmd applies it BEFORE Flux reconciles bp-cert-manager-powerdns-webhook. End-to-end seam mirrors PR #543 ghcr-pull and PR #680 harbor-robot-token. Will be verified live on otech48 (next provision after this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatice@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
52b87afa9e
|
fix(bp-cilium): upgrade upstream cilium 1.16.5 → 1.19.3 (1.2.0) (#684)
1.16.x gateway-api hostNetwork mode is buggy on Sovereigns: cilium-envoy
NACKs listeners with "cannot bind '0.0.0.0:80': Permission denied" and
the loaded RDS for the Sovereign vhost only carries the default `/` route
to catalyst-ui — `/auth/*` and `/api/*` HTTPRoute matches defined in CEC
never reach envoy's live config. Result: console.<sov>/auth/handover?token=…
serves the React shell instead of the catalyst-api Go handler, defeating
the Phase-8b seamless handover. Caught live on otech46.
1.18+ ships the Gateway API implementation graduated from beta with the
hostNetwork bind path fixed; 1.19 is the current stable line (1.19.3).
Values shape verified backward-compatible across the keys we set:
gatewayAPI.hostNetwork.enabled, envoy.enabled, envoyConfig.enabled,
encryption.type=wireguard, encryption.nodeEncryption — all unchanged
between 1.16 and 1.19.
Bumps:
- bp-cilium chart 1.1.5 → 1.2.0 (minor — major upstream version jump)
- upstream cilium subchart 1.16.5 → 1.19.3
- blueprint.yaml spec.version 1.1.3 → 1.2.0 (was already drifted from
Chart.yaml; brings them back in sync per manifest-validation gate)
- clusters/_template/bootstrap-kit/01-cilium.yaml HelmRelease pin
1.1.5 → 1.2.0
Per-cluster overlays under clusters/<sovereign>/bootstrap-kit/ keep
their pinned versions until the operator opts in — fresh otechN
provisions render from _template/ and pick up 1.2.0 on first boot.
Will be verified live on the next fresh Sovereign provision (otech47+).
Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
2b60e944e2
|
fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook (#681)
* fix(bp-cert-manager-powerdns-webhook): re-target to contabo PowerDNS, drop dynadot-webhook Caught live on otech43-46: cert-manager DNS-01 challenges for *.otechN.omani.works failed because the Sovereign-side webhook wrote challenge TXT records to the Sovereign's local PowerDNS. omani.works is delegated from Dynadot to ns1/2/3.openova.io which run on contabo's central PowerDNS — the Sovereign's local PowerDNS is INVISIBLE on the public DNS chain until pool-domain-manager seals the per-Sovereign NS delegation. Let's Encrypt resolvers walk the public chain, query contabo, get NXDOMAIN, the cert never issues. Manual workaround was seeding challenge TXT directly in contabo PowerDNS. This PR automates the right write path: - bp-cert-manager-powerdns-webhook chart bumped to 1.0.4. Default powerdns.host flips from "" (skip-render) to https://pdns.openova.io (contabo's public PowerDNS API ingress, authoritative for omani.works). - ClusterIssuer letsencrypt-dns01-prod-powerdns now usable with no per-cluster powerdns.host override for the omani.works pool. apiKeySecretRef.namespace clarified — upstream ignores it; the Secret must live in cert-manager namespace (= ChallengeRequest.ResourceNamespace for ClusterIssuers). - bootstrap-kit slot 49 updated: drops bp-powerdns dependsOn (webhook calls out-of-cluster contabo, not local PowerDNS), bumps chart version, removes inline powerdns.host override (defaults are correct). - bootstrap-kit slot 49b (bp-cert-manager-dynadot-webhook) DELETED entirely — Dynadot is NOT the API-level authority for omani.works subdomains, the dynadot webhook silently fails the same way the Sovereign-local powerdns one did. - clusters/_template/sovereign-tls/cilium-gateway-cert.yaml flips issuerRef from letsencrypt-dns01-prod (was dynadot-backed) to letsencrypt-dns01-prod-powerdns (the new contabo-backed issuer). - bp-cert-manager chart: certManager.issuers.dns01.enabled defaults to false (deprecated dynadot path). letsencrypt-http01-prod retained for per-host certs. Cluster overlays MAY flip dns01.enabled=true for non-omani.works pools where Dynadot IS the API-level authority. - scripts/expected-bootstrap-deps.yaml: drops slot 49b, drops bp-powerdns edge from slot 49. - Documentation (README + blueprint.yaml + Chart.yaml description) rewritten to reflect contabo retarget and lifecycle reasoning. Credential plumbing (out of scope here, must be done in cloud-init): - Every Sovereign needs a `powerdns-api-credentials` Secret in the `cert-manager` namespace whose `api-key` value matches contabo's PowerDNS API key. Same seeding pattern as `dynadot-api-credentials` in infra/hetzner/cloudinit-control-plane.tftpl. Caveat — basicAuth on contabo's PowerDNS API ingress: contabo currently fronts pdns.openova.io with Traefik basicAuth (per clusters/contabo-mkt/apps/powerdns/helmrelease.yaml). The upstream zachomedia/cert-manager-webhook-pdns binary supports the X-API-Key header but not HTTP Basic Auth out of the box. To make this end-to-end green, contabo's basicAuth requirement must be relaxed (X-API-Key alone provides the auth posture, and contabo's API endpoint is restricted to operator IPs by other means OR the Sovereign's webhook needs an Authorization header injected via the chart's powerdns.headers map (plaintext password in the ClusterIssuer config — not ideal). This PR ships the chart side; the basicAuth question is a follow-up on the contabo side. Verified locally: - helm lint platform/cert-manager-powerdns-webhook/chart -> PASS - helm template platform/cert-manager-powerdns-webhook/chart -> renders - helm template ... --set clusterIssuer.enabled=true -> renders the ClusterIssuer with host="https://pdns.openova.io" + correct apiKey Secret reference. - helm template platform/cert-manager/chart -> renders ONLY letsencrypt-http01-prod (the dns01 dynadot issuer correctly gated off). - scripts/check-bootstrap-deps.sh: net-zero new drift; my branch reduces pre-existing errors from 3 to 2 (the dropped slot 49b removed the only drift my branch was responsible for). Closes follow-up to #373. Preconditions for handover URL TLS green on otech43-46 lineage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(scripts): repair YAML structure in expected-bootstrap-deps.yaml Two pre-existing drifts were blocking dependency-graph-audit CI: 1. Slot 5a (bp-reflector) was missing its closing list separator, causing yq to merge the bp-nats-jetstream entry into the bp-reflector map and effectively drop bp-reflector from the expected DAG. Added explicit `- slot: 7` for bp-nats-jetstream and quoted "5a" so yq treats it as a string slot (matches the convention with "49b"). 2. bp-powerdns slot 11: actual bootstrap-kit declares dependsOn bp-cnpg (live since otech28 — pdns-pg-app secret race) but the expected DAG was missing this edge. This is unblocks merging fix/cert-manager-powerdns-webhook-contabo (PR above) — these drifts existed on main but weren't surfaced until the last expected-deps edit forced a re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a50ef0ece0
|
fix(bp-external-dns): --request-timeout=120s for cold-cluster initial sync (1.1.5) (#679)
Caught live on otech43–46: external-dns crashloops 10+ times on fresh Sovereign before initial *v1.Pod sync completes. Default 30s timeout insufficient when k3s apiserver is CPU-saturated. Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
dd4148acb6
|
fix(cilium-gateway): hostNetwork mode + Hetzner LB→80/443 (chart 1.1.5) (#674)
The Cilium gateway-api L7LB nodePort chain was silently broken on otech45: TCP to LB:443 succeeds, but TLS handshake never completes. Root cause: Cilium 1.16.5's BPF L7LB Proxy Port (12869) doesn't match what cilium-envoy actually listens on (verified via /proc/net/tcp on the cilium-envoy pod — port 12869 not in listening sockets). The nodePort indirection (31443→envoy:12869) is broken at the redirect step. Fix: bind cilium-envoy directly to the host's :80 and :443 via gatewayAPI.hostNetwork.enabled=true. Hetzner LB forwards public 80→private:80 and 443→private:443 directly (no nodePort indirection). Two coordinated changes: 1. platform/cilium/chart/values.yaml: gatewayAPI.hostNetwork.enabled=true 2. infra/hetzner/main.tf: LB destination_port = 80/443 (was 31080/31443) bp-cilium chart bumped to 1.1.5. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
1bd2ab1951
|
fix(bp-gitea): use explicit labels in sync-job template (chart 1.2.3 retry) (#670)
Previous attempt referenced 'bp-gitea.labels' helper which doesn't
exist in this chart (bp-gitea has no _helpers.tpl, unlike bp-harbor).
Blueprint Release workflow's helm-template gate caught it:
template: bp-gitea/templates/database-secret-sync-job.yaml:53:8:
error calling include: template: no template 'bp-gitea.labels'
associated with template 'gotpl'
Fix: replace the 4 occurrences of 'include bp-gitea.labels' with
explicit catalyst.openova.io/blueprint + component labels. Same
shape, no helper dependency.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
9eff5530cd
|
fix(bp-gitea): replace Reflector with database-secret-sync-job (chart 1.2.3) (#668)
Same root cause + same fix as bp-harbor (PR #557). The Reflector-based 'gitea-database-secret reflects gitea-pg-app' pattern races with CNPG: Reflector logs once at install time that the source doesn't exist ('Could not update gitea/gitea-database-secret — Source gitea-pg-app not found') and never retries. The destination stays empty (password "") and gitea init container crashloops with 'pq: password authentication failed for user gitea' — caught live on otech43, manually patched at the time but no chart fix shipped, so otech45 hit the exact same failure (founder caught it in k9s). Fix: replicate bp-harbor's sync-job pattern verbatim. - post-install,post-upgrade Helm hook (weight 5) - curlimages/curl image talking to in-cluster apiserver - Polls until gitea-pg-app exists, reads .data.password, PATCHes gitea-database-secret with the password key - Hook-delete-policy: before-hook-creation,hook-succeeded - Idempotent on re-run; CNPG never rotates without operator action Drops the HARBOR_DATABASE_PASSWORD alias (gitea binds the 'password' key directly via secretKeyRef in values.yaml). The existing pre-install database-secret.yaml placeholder stays so the Secret is Found at install time (some tooling assumes presence for the Pod's lifetime). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
a8bcb773c9
|
fix(bp-openbao): add BAO_TOKEN+NAMESPACE env to auth-bootstrap (chart 1.2.14) (#666)
PR #663 added the revoke logic at the bottom of the script but the companion env-block additions (BAO_TOKEN sourced from openbao-root-token Secret, NAMESPACE from fieldRef) somehow never landed in the merged diff — only the trailing revoke + DELETE block did. Result on otech44: openbao-root-token Secret IS being created by init-job (PR #663's other half worked), but auth-bootstrap pod env ends at TOKEN_MAX_TTL with no BAO_TOKEN, so 'bao auth enable kubernetes' hits 403 Forbidden again — the exact same failure that PR #663 was supposed to fix. This PR adds the missing env declarations. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
74921e30f1
|
fix(architecture): drop bp-spire, Cilium WireGuard is the canonical east-west mesh (#665)
Founder direction 2026-05-03: with 100% Cilium mesh enforcement + Envoy where required, bp-spire is redundant for the minimal Sovereign MVP. Reasoning: - Cilium 1.13+ has built-in mutual auth using SPIFFE, but it ships with its own embedded SPIRE server managed by the Cilium operator. External bp-spire is not needed for east-west mTLS. - Our ESO→OpenBao auth uses the K8s ServiceAccount auth method (TokenReview against kube-apiserver), not JWT-SVID. - WireGuard transparent encryption (already enabled in cilium values) encrypts every pod-to-pod connection at the kernel transport layer. - Cross-Sovereign federation and per-workload-fingerprint attestation are not blocking handover; they can be re-introduced as an opt-in blueprint when needed. Changes: - Delete clusters/_template/bootstrap-kit/06-spire.yaml - Remove bp-spire from kustomization.yaml + expected-bootstrap-deps.yaml - Remove bp-spire dependsOn from 07-nats-jetstream.yaml + 08-openbao.yaml - bp-cilium 1.1.4: add encryption.nodeEncryption=true so node-to-node traffic (not just pod-to-pod) is also WireGuard-encrypted; document in values.yaml comment that WireGuard is the canonical east-west mTLS layer. Removes 4 pods (spire-server, spire-agent, spire-spiffe-csi-driver, spire-spiffe-oidc-discovery-provider) from every Sovereign and the recurring CSI mount race that was getting stuck on otech43. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
be6e610093
|
fix: drop bp-langfuse from minimal + bp-mimir 1.0.2 push_grpc fix (#664)
* fix: drop bp-langfuse from minimal bootstrap-kit + bp-mimir push_grpc fix Two independent fixes packaged together: 1. **Drop bp-langfuse** from the SOLO minimal bootstrap-kit. Per founder direction: langfuse is LLM-specific (prompt/completion tracing for AI plane), not platform infrastructure, and belongs to a future 'AI Add-On' template. Its CreateContainerConfigError on every Sovereign provision (missing langfuse-secrets pre-install) was eating Phase-1 reconciliation budget without contributing to handover-ready state. Removed: - clusters/_template/bootstrap-kit/26-langfuse.yaml - kustomization.yaml entry - scripts/expected-bootstrap-deps.yaml slot 26 entry 2. **bp-mimir 1.0.2** — re-enable ingester.push_grpc_method_enabled. Upstream mimir-distributed 6.0.6 disables Push gRPC when ingest-storage is off, but classic-mode ingester REQUIRES it. The combo crashloops with 'cannot disable Push gRPC method in ingester, while ingest storage (-ingest-storage.enabled) is not enabled'. Caught live on otech43 with 17 restarts. Both issues block Phase-1 ready=40/40 from being a clean signal. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-mimir): chart 1.0.2 push_grpc_method_enabled + finalize langfuse drop Follow-up to previous commit which only captured the file deletion. This commit applies: bp-mimir 1.0.2 chart bump, kustomization + expected-deps removal of langfuse, bootstrap-kit version bumps. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
561439b6c2
|
fix(bp-openbao): wire root_token init→auth-bootstrap (chart 1.2.13) (#663)
Caught live on otech43 after chart 1.2.12 fixed the persist gap and auth-bootstrap finally ran: 'Error enabling kubernetes auth ... Code: 403 permission denied'. The auth-bootstrap Job had no BAO_TOKEN and was making unauthenticated bao API calls. Three coordinated changes: 1. init-job.yaml: after bao operator init succeeds and ROOT_TOKEN is extracted, POST a transient Secret openbao-root-token with the token in data.token. Already-exists (409) is treated as idempotent-re-run, anything else fails the Job loud (was silent before, hid the bug). 2. auth-bootstrap-job.yaml: BAO_TOKEN env sourced via secretKeyRef from openbao-root-token. After running auth enable / secrets enable / policy write / role bind, revoke the token via 'bao token revoke -self' AND attempt DELETE on the Secret. (busybox wget --method=DELETE may silently no-op; the bao-side revoke is the load-bearing acceptance-criterion-6 mechanism.) 3. auto-unseal-rbac.yaml: openbao-root-token added to the mutation rule's resourceNames so the SA can GET/PATCH/UPDATE/DELETE it. Create is already unrestricted from chart 1.2.10's RBAC split. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
be9b5ca5bf
|
fix(bp-openbao): wc -l counts 0 for single-key without trailing newline (1.2.12) — TRUE root cause (#662)
Caught live on otech42 with chart 1.2.11's per-pod logs: + bao operator init -key-shares=1 -key-threshold=1 -format=json [openbao-init] FATAL: extracted 0 unseal key(s) but threshold=1 key-shares=1 → no comma → tr ',' '\n' is no-op → final sed produces single line WITHOUT trailing newline → wc -l counts 0. Every prior loop attributed to RBAC/wget was a downstream symptom. Fix: append 'awk 1' for trailing newline, swap wc -l for grep -c . Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
7bd9aae89b
|
diag(bp-openbao): restartPolicy: Never (chart 1.2.11) — preserve fresh-init pod logs (#661)
OnFailure restarts the SAME container in the SAME pod, and only the MOST RECENT failed container's logs are kubectl-loggable. The first attempt's logs (where the FRESH path runs and the persist gap lives) are reaped before later restarts can be inspected. Switching to Never makes each retry a separate Pod via Job's backoffLimit replay. Every failed pod is independently inspectable with kubectl logs <pod> until ttlSecondsAfterFinished tears it down. Combined with chart 1.2.9's openbao-init-trace Secret upload (POST now succeeds with 1.2.10's RBAC split), the fresh-path failure point becomes definitively observable. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
b5fee168b5
|
fix(bp-openbao): split RBAC for create verb (chart 1.2.10) — root cause of unseal-keys never persisted (#660)
The openbao-auto-unseal Role granted 'create' on Secrets with
resourceNames set. Kubernetes RBAC doesn't enforce resourceNames on
the create verb (the resource has no name at admission time, so
there's nothing to filter), but the kube-apiserver still REJECTS the
request because the rule's effective verbs[create]+resourceNames combo
doesn't match the bare 'create secrets' permission check. Result:
every init Job POST returned 403 Forbidden.
The script then fell through to the PUT branch, which silently failed
because BusyBox wget (the openbao image's only HTTP client) has no
--method flag. Both calls non-zero → script exited 1 with FATAL
'cannot persist'. The first init's logs got reaped before later
restarts could be inspected, so the FATAL was never visible — the
retries all hit the idempotent FATAL ('vault is sealed but the
unseal-keys Secret is missing') with no record of why.
Caught live on otech40 with chart 1.2.9's trace upload + a wget
auth-can-i probe:
kubectl auth can-i create secrets --as=...openbao-auto-unseal → no
kubectl auth can-i create secret/openbao-unseal-keys ... → yes
Fix: split into two rules per the k8s RBAC pattern.
rule 1: verbs[create] WITHOUT resourceNames (allows POST)
rule 2: verbs[get,patch,update,delete] WITH resourceNames
(mutation stays scoped to known names)
This unblocks every fresh Sovereign provisioning. Each subsequent run
hits the idempotent path (GET on openbao-unseal-keys → 200) and
unseals automatically — no operator intervention.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
09e56f1e47
|
diag(bp-openbao): persist init script trace to Secret across restarts (1.2.9) (#659)
otech38/39 confirmed: openbao reaches Initialized=true on the first init pod attempt but the unseal-keys Secret is never persisted. The fresh-init container's logs are reaped before subsequent restarts' idempotent FATAL allows them to be inspected, so we keep flying blind on the actual failure point. This change tees every line of the init script (set -x trace + every echo) into /tmp/.script.trace and uploads it to a per-namespace Secret 'openbao-init-trace' on EXIT (success OR failure). The Secret survives Pod recreation and any Job retry; the operator can read it with kubectl after the next provision and see exactly where the fresh-path script exited. Adds 'openbao-init-trace' to the openbao-auto-unseal Role's resourceNames so the Job SA can PUT/POST it. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
5f6d1c7d86
|
diag(bp-openbao): add set -x to init script (chart 1.2.8) (#658)
otech37/38 hit the same wall: server reaches Initialized=true but openbao-unseal-keys Secret is never persisted; the FIRST init pod's logs that ran fresh init are reaped by container restart before we can capture what happened. Add 'set -x' to shell-trace every command. Now even if the script crashes mid-run, pod logs show the last command attempted. The captured diagnostic on the next provision will tell us whether the failure is in /tmp/init-output.json parsing, the persist wget, or elsewhere. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
8447930bf7
|
fix(bp-openbao): fail-fast on unseal-keys persist (chart 1.2.7) (#657)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)
* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)
The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):
componentGroups.ts Flux HelmRelease.dependsOn
---------------------- ---------------------------
keycloak: [cnpg] keycloak: [cert-manager, gateway-api]
openbao: [] openbao: [spire, gateway-api, cnpg]
harbor: [cnpg, seaweedfs, harbor: [cnpg, cert-manager,
valkey] gateway-api]
Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03
This commit:
1. Adds scripts/generate-blueprint-deps.sh that parses every
bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
keyed by bare component id (bp- prefix stripped on both source
and target side).
2. Commits the generated JSON.
3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
4. Patches componentGroups.ts so every RAW_COMPONENT's
`dependencies` field is OVERRIDDEN at module load with the
Flux-canonical list (the inline `dependencies: [...]` literals
are now ignored — Flux is canonical).
Follow-ups (not in this PR):
- CI drift check that re-runs the script and diffs the JSON.
- Strip the inline `dependencies: [...]` arrays entirely once the
drift check is green.
- Wire the FlowPage edge-rendering to match.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT
PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
keycloak: ['cert-manager', 'openbao'] ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.
Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): don't regress status to pending after exec started
helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).
Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.
Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(jobs): cascade Failed status through dependsOn (fail-fast)
Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.
Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'
Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.
Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.
Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.inotify.max_queued_events = 16384
Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* fix(bp-openbao): fail-fast when unseal-keys persist fails (chart 1.2.7)
otech37 caught: bao operator init succeeded server-side
(Initialized=true), but the script's wget POST to persist
openbao-unseal-keys Secret silently failed (|| true), and the PUT
fallback also silenced. Subsequent Job retries hit Initialized=true
on the idempotent path, found no openbao-unseal-keys Secret, and
FATAL'd with 'manual recovery: wipe data-openbao-0 PVC' — every
retry forever.
Hardening:
1. Capture POST + PUT stdout/stderr to /tmp files instead of
/dev/null so the FATAL path can echo them.
2. PUT no longer || true — if both POST and PUT fail, exit 1.
3. Add read-back verification: GET the persisted Secret and
assert 'unseal-keys-b64' field is present. Catches
partial-write / eventual-consistency cases.
Bumps chart 1.2.6 -> 1.2.7 and bootstrap-kit reference.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
---------
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
|
||
|
|
6baf7e56e7
|
fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) (#651)
Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
d519dc8ba2
|
fix(bp-harbor): switch sync Job to curl-against-apiserver (chart 1.2.12) (#650)
rancher/kubectl is distroless (no /bin/sh) so the inline shell script can't run. Replace with curlimages/curl which has alpine sh + curl. Talk to k8s API directly via the in-pod ServiceAccount token. The PATCH merges password + HARBOR_DATABASE_PASSWORD into the existing pre-install-hook Secret without touching annotations. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
08432b540e
|
fix(bp-harbor): switch sync Job to rancher/kubectl (chart 1.2.11) (#649)
bitnami/kubectl moved to sha256-only tags; bitnami/kubectl:1.31.4 returns 'not found' from Docker Hub. rancher/kubectl is always available on k3s clusters. Bumps chart 1.2.10 -> 1.2.11. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
de51fa3f7a
|
fix(bp-harbor): post-install Job copies CNPG password (chart 1.2.10) (#648)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)
CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal
Commit
|
||
|
|
da61ecdc79
|
test(bp-openbao): align test expectation with #600 RBAC-hook removal (#647)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)
CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal
Commit
|
||
|
|
a359278b7d
|
fix(bp-spire): disable oidc ClusterSPIFFEID + chart bump (1.1.7) (#645)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled Two related Phase-8a stragglers diagnosed live during otech28: 1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError forever ("secret pdns-pg-app not found"). Adds explicit dependsOn. 2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set reflection-allowed; missing reflection-auto-enabled. Reflector races when destination Secret (harbor-database-secret) is created BEFORE CNPG provisions the source (harbor-pg-app). Reflector logs "Source could not be found" once and never retries — leaving harbor- core stuck CreateContainerConfigError. Adding auto-enabled makes Reflector actively watch the source and re-fire when it appears. Bumps: bp-harbor 1.2.8 -> 1.2.9 bp-gitea 1.2.1 -> 1.2.2 bp-powerdns 1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump) Bootstrap-kit references updated to pull the new chart versions on the next Sovereign provisioning. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-spire): Chart.lock missing spire-crds → CRDs never installed (chart 1.1.7) bp-spire 1.1.4 added spire-crds 0.5.0 as a Helm dependency to register the spire.spiffe.io/v1alpha1 CRDs (ClusterSPIFFEID, ClusterStaticEntry, ClusterFederatedTrustDomain) before the spire subchart's controller- manager Deployment starts. But Chart.lock was never regenerated — only contained the original `spire` entry. As a result every Blueprint Release packaged the chart WITHOUT spire-crds, the Sovereign saw no CRDs registered, and Helm install failed with: no matches for kind "ClusterSPIFFEID" in version "spire.spiffe.io/v1alpha1" bp-openbao / bp-external-secrets / bp-nats-jetstream all dependsOn bp-spire so this single bug cascades and blocks 5+ HRs from reaching Ready=True. Caught live during otech29. Fix: ran `helm dependency update` to regenerate Chart.lock + pull both spire and spire-crds tarballs; bumps bp-spire 1.1.6 -> 1.1.7 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
8bb66fe43e
|
fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled Two related Phase-8a stragglers diagnosed live during otech28: 1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError forever ("secret pdns-pg-app not found"). Adds explicit dependsOn. 2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set reflection-allowed; missing reflection-auto-enabled. Reflector races when destination Secret (harbor-database-secret) is created BEFORE CNPG provisions the source (harbor-pg-app). Reflector logs "Source could not be found" once and never retries — leaving harbor- core stuck CreateContainerConfigError. Adding auto-enabled makes Reflector actively watch the source and re-fire when it appears. Bumps: bp-harbor 1.2.8 -> 1.2.9 bp-gitea 1.2.1 -> 1.2.2 bp-powerdns 1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump) Bootstrap-kit references updated to pull the new chart versions on the next Sovereign provisioning. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
2e9cfd4a57
|
fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux + cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir + Loki + Tempo + … each request 50-500m vCPU and the node hits 100% allocatable before half the workloads schedule. CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size that fits the bootstrap-kit with VPA-recommendation headroom. Operators can still pick CPX32 explicitly if they trim the component set on StepComponents — but the default SOLO path now provisions a node that actually boots into a steady state. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2) - Replace forbidden `:latest` tag with current short-SHA `942be6f` per docs/INVIOLABLE-PRINCIPLES.md #4. - Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet authenticates against private ghcr.io/openova-io/openova/* via the Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace. Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff on every Sovereign — caught live during otech27. - Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
487ebebda2
|
fix(bp-vpa): drop registry.k8s.io/ prefix in repository (upstream prepends it) (#641)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service PR #546 (Closes #542) introduced a dependency cycle: hcloud_server.control_plane.user_data → local.control_plane_cloud_init local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address `tofu plan` failed with: Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane Caught live during otech23 first-end-to-end provisioning attempt. Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON the CP node, so it resolves its own public IPv4 at boot via Hetzner's metadata service: curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4 Same observable behavior as #546 (kubeconfig server: rewritten to CP public IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with no graph cycle. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra+api): wire handover_jwt_public_key end-to-end The OpenTofu cloud-init template references ${handover_jwt_public_key} (infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares the variable, but neither side wires it: - main.tf templatefile() call did not pass the key → "vars map does not contain key handover_jwt_public_key" on tofu plan - provisioner.writeTfvars never set the var → empty even when wired Caught live during otech23 provisioning, immediately after the tofu-cycle fix landed. tofu plan failed with: Error: Invalid function argument on main.tf line 170, in locals: 170: control_plane_cloud_init = replace(templatefile(... Invalid value for "vars" parameter: vars map does not contain key "handover_jwt_public_key", referenced at ./cloudinit-control-plane.tftpl:371,9-32. Fix: - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key - provisioner.Request gains a HandoverJWTPublicKey field (json:"-", server-stamped, never accepted from client JSON) - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK() when the signer is configured (CATALYST_HANDOVER_KEY_PATH set) - writeTfvars emits the value into tofu.auto.tfvars.json variables.tf default "" preserves the no-signer path: cloud-init writes an empty handover-jwt-public.jwk and the new Sovereign is provisioned without the handover-validation surface (handover flow simply not wired on that Sovereign — degraded gracefully, not a hard failure). Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(api): cloud-init kubeconfig postback must live outside RequireSession The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the RequireSession-gated chi.Group, so every cloud-init postback was rejected with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run. Cloud-init has no browser session cookie — it authenticates with the SHA-256-hashed bearer token PutKubeconfig already verifies internally. Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth. catalyst-api never received the kubeconfig, Phase 1 helmwatch never started, the wizard's Jobs page stayed in PENDING forever. Fix: register the PUT outside the auth group so cloud-init's bearer-hash auth path is the only gate. The matching GET stays inside session auth — the operator's "Download kubeconfig" button needs the session cookie. Caught live during otech23 first end-to-end provisioning. Per the new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM + PowerDNS + on-disk state) and the next provision will use otech24. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl and declared var.harbor_robot_token in infra/hetzner/variables.tf with a default of "". The catalyst-api side never set it, so every Sovereign so far provisioned with an empty token in registries.yaml — containerd's auth to harbor.openova.io's proxy projects failed silently and pulls fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns rate-limit HTML and: Failed to pull image "rancher/mirrored-pause:3.6": unexpected media type text/html for sha256:... cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux pods stay Pending; no HelmReleases ever land; the wizard's job stream shows everything PENDING because there's nothing to watch. Caught live during otech24. Wiring (mirrors the GHCRPullToken pattern): 1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN env at New(). 2. Stamped onto every Request in Provision() and Destroy() before writeTfvars. 3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted from the wizard payload. 4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json. 5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret (mirrored from openova-harbor — Reflector-managed on Sovereign clusters; copied per-namespace on Catalyst-Zero contabo) as CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths still come up. variables.tf default "" preserves graceful fall-through if the operator hasn't issued a robot token yet, and the architecture rule is now enforced end-to-end: every image on every Sovereign goes through harbor.openova.io. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up) PR #638 added Validate() rejection for missing harbor_robot_token, but the handler only stamped req.HarborRobotToken from p.HarborRobotToken inside Provision() — Validate() runs in the handler BEFORE Provision() gets the chance to stamp. Result: every wizard launch returned Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing) even though the env var is set on the Pod. Caught immediately on the otech25 launch attempt. Fix: same env-stamp pattern as GHCRPullToken at the top of the CreateDeployment handler. Provisioner-level stamp in Provision() stays as defense-in-depth. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo> PR #557 wrote registries.yaml with mirror endpoints like https://harbor.openova.io/proxy-dockerhub hoping containerd would build URLs like https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6 But Harbor proxy-cache projects expose their API at https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 (project name lives BEFORE the image-path /v2/, not as a path prefix). Harbor returns its SPA UI HTML (status 200, content-type text/html) for the wrong shape; containerd then errors with: "unexpected media type text/html for sha256:... not found" and pause-image / cilium / coredns pulls fail forever — caught live during otech24 and otech25. Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare Harbor host; per-mirror rewrite re-maps the image path so containerd's final URL is correctly project-prefixed. Verified manually: curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6 -> 200 application/vnd.docker.distribution.manifest.list.v2+json This unblocks every Sovereign image pull through the central Harbor. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> * fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry` (default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` — doubled prefix, image-not-found, ImagePullBackOff on every fresh Sovereign. Caught live during otech26. Fix: drop the redundant prefix. Subchart's default `.image.registry` keeps it pointing at registry.k8s.io which the new Sovereign's containerd routes through harbor.openova.io/v2/proxy-k8s/... via registries.yaml rewrite (#640). Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> --------- Co-authored-by: hatiyildiz <hatiyildiz@openova.io> |
||
|
|
737574b19a
|
feat(bp-keycloak): Phase-8b sovereign realm — token-exchange, catalyst-ui/api-server OIDC clients, SMTP, bump 1.2.2 → 1.3.0 (#604) (#609)
Adds the full Phase-8b identity surface required by the seamless handover flow: - Token exchange enabled on sovereign realm (attributes.token-exchange: true) - catalyst-ui public PKCE client: redirectUris + webOrigins keyed on console.<sovereignFQDN>, groups + requiredActions in ID token - catalyst-api-server confidential service-account client: impersonation + manage-users + view-users + query-users roles on realm-management; client secret injected at provisioning time via .Values.catalystApiServerClientSecret - WebAuthn (webauthn-register + webauthn-register-passwordless) registered as Required Action options on the realm - UPDATE_PASSWORD set as defaultAction: true for new users - smtpServer block: pre-handover default = contabo Stalwart relay; fully operator-configurable via .Values.smtp.* (Phase-8c-acceptable) - required-actions client scope + oidc-usermodel-attribute-mapper for requiredActions claim in ID token (catalyst-ui first-login UX) Architectural change: realm JSON moved from inline values.yaml (keycloak: subchart key — no parent scope access) to a parent-chart template platform/keycloak/chart/templates/configmap-sovereign-realm.yaml, which can read .Values.sovereignFQDN and .Values.smtp.* for per-Sovereign interpolation. The upstream bitnami chart's keycloakConfigCli.existingConfigmap is pointed at this ConfigMap. Anti-duplication seam: configmap-sovereign-realm.yaml. New values.yaml keys: sovereignFQDN: "" (REQUIRED — per-Sovereign overlay supplies it) sovereignRealm.enabled: true catalystApiServerClientSecret: "" (REQUIRED — provisioner seals and injects) smtp.host/port/from/user/password/ssl/starttls/auth New bootstrap-kit file: 09a-keycloak-catalyst-api-secret.yaml — SealedSecret template for keycloak-catalyst-api-server-credentials in catalyst-system namespace; provisioner fills encryptedData fields at deploy time Bootstrap-kit refs bumped 1.2.x → 1.3.0 in _template, otech, omantel. helm template clean with sovereignFQDN=otech.omani.works. Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
93627ada20
|
fix(bp-harbor): convert harbor-database-secret to Helm pre-install hook (1.2.8) (#603)
The 1.2.7 fix dropped the `data:` block from the chart template, but
Helm's three-way merge still owns the Secret as a release resource and
resets `data: {}` (no keys) on every chart upgrade — verified on otech22
where 1.2.6→1.2.7 reconcile wiped Reflector-populated keys back to nil.
Architectural fix: convert the Secret to a Helm pre-install hook.
- `helm.sh/hook: pre-install` — Secret is created at install time only.
On `helm upgrade`, Helm does NOT touch the Secret (no three-way merge),
so keys populated by Reflector persist across every chart bump.
- `helm.sh/hook-delete-policy: before-hook-creation` — On a re-install,
Helm deletes the previous Secret first so the hook recreates clean.
- `helm.sh/resource-policy: keep` — `helm uninstall` does NOT delete the
Secret (paired with hook means standard upgrade path never sees a delete).
- Hook resources are NOT recorded in the Helm release manifest, so they're
invisible to `helm upgrade`'s three-way merge.
Also drops the inline `data:` block (kept from 1.2.7) — Reflector still
populates everything from harbor-pg-app once CNPG bootstraps the source.
Bumps bp-harbor 1.2.7 → 1.2.8, bootstrap-kit refs (_template, otech, omantel).
Closes #585
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
09208ca58f
|
fix(bp-harbor): omit data block in harbor-database-secret — Helm overwrite regression (1.2.7) (#602)
On every helm upgrade, Helm three-way merge resets `data.password` and `data.HARBOR_DATABASE_PASSWORD` to "" because the chart declares them empty in the template. After Reflector populates them from `harbor-pg-app`, the next bp-harbor upgrade silently empties them again — harbor-core then crashloops on the next pod restart with "password authentication failed". Observed on otech22 after the 1.2.5→1.2.6 Flux upgrade: harbor-database- secret.password went from 64 bytes back to 0 bytes, harbor-core entered CrashLoopBackOff. Resolved at runtime by touching harbor-pg-app to bump its resourceVersion and re-trigger Reflector, but the architectural fix is needed so it doesn't recur on the next chart upgrade. Fix: drop the entire `data:` block from templates/database-secret.yaml. The Secret is created by Helm with no data keys (Helm owns nothing in the data field). Reflector adds ALL keys from `harbor-pg-app` (password, HARBOR_DATABASE_PASSWORD, username, host, dbname, jdbc-uri, etc.) on the first SecretWatcher event after CNPG bootstraps the source. On subsequent helm upgrades, Helm's three-way merge has nothing to overwrite in `data:` because the chart no longer declares any keys there. Bumps bp-harbor 1.2.6 → 1.2.7, bootstrap-kit refs (_template, otech, omantel). Closes #585 (regression of) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
8d50402038
|
fix(bp-harbor): remove cnpg-app-annotator Job — CNPG inheritedMetadata handles annotation (1.2.6) (#601)
The post-install Job `harbor-pg-app-annotator` (with curlimages/curl:8.7.1) is no longer needed: bp-harbor 1.2.5 already uses CNPG's `inheritedMetadata` stanza in cnpg-cluster.yaml to stamp `reflection-allowed: true` onto `harbor-pg-app` at CNPG bootstrap time. The Job was causing ErrImagePull on otech22 because Docker Hub is proxied through Harbor itself (chicken-and-egg). Removes: - templates/cnpg-app-annotator-job.yaml - templates/cnpg-app-annotator-rbac.yaml - values.yaml cnpgAnnotator section Updates database-secret.yaml comment to reflect the inheritedMetadata approach. Bumps Chart.yaml 1.2.5 → 1.2.6, bootstrap-kit refs (_template, otech, omantel). Closes #585 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
b1a25c4235
|
fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598) (#600)
Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was `<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak) but bitnami's fullname helper trims the chart-name suffix when Release.Name already contains it, so the Service is just `keycloak`. Changed default to `.Release.Name`. Sovereign realm was already imported (config-cli ran successfully) — only the Gateway routing was broken, returning HTTP 500. Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had `helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The `hook-succeeded` clause caused Helm to delete the SA immediately after the weight-0 RBAC hook completed, before the weight-5 init Job pod could mount its SA token and start. Removed all hook annotations from the RBAC resources so they are managed by regular Helm release lifecycle (created before hooks, never deleted mid-install). Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6. Verified on otech22 (manual remediation): Keycloak sovereign realm OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
cba1b5070a
|
fix(bp-gitea+harbor): use CNPG inheritedMetadata to propagate reflector annotations to pg-app Secret (#595)
The Cluster CR `metadata.annotations` are NOT propagated by CNPG onto the
generated `{name}-app` Secrets. Reflector requires the SOURCE Secret (e.g.
`gitea-pg-app`) to carry `reflection-allowed: "true"` before it will copy
data into the DESTINATION Secret (`gitea-database-secret`). On otech22 this
caused `gitea-database-secret` to stay empty indefinitely — gitea init container
failed auth with "password authentication failed for user gitea".
Fix: use CNPG's `inheritedMetadata.annotations` stanza (v1.24+) to instruct
CNPG to annotate all generated Secrets with the reflector permission annotations.
Applied to both bp-gitea (1.2.0→1.2.1) and bp-harbor (1.2.4→1.2.5) since
harbor-pg-app had the same issue.
Bootstrap-kit: bump bp-gitea chart ref 1.2.0→1.2.1 (template + otech + omantel).
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|
|
fe03b8cc42
|
fix(bp-harbor): use curl for CNPG annotator PATCH + add values defaults (1.2.4) (#594)
busybox wget does not support --method=PATCH (only GET/POST). The harbor-pg-app-annotator Job silently succeeded without actually patching harbor-pg-app, leaving harbor-database-secret empty on fresh install. Fixes: 1. Switch cnpg-app-annotator-job.yaml from busybox:1.36.1 + wget to curlimages/curl:8.7.1 + curl -X PATCH. curl natively supports all HTTP verbs. HTTP response code checked explicitly; non-2xx exits 1 so the Job retries instead of silently passing with no-op. 2. Add cnpgAnnotator.image stanza to values.yaml (was missing — prior charts defaulted via nil-safe dict fallback but the section was never actually written to values.yaml). Defaults to curlimages/curl:8.7.1. 3. readOnlyRootFilesystem: false (curl writes /tmp/patch-response.json for error diagnostics). 4. Bump chart 1.2.3 → 1.2.4. Closes #585 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
97abf9dedb
|
fix(bp-harbor): nil-safe image value extraction in cnpg-app-annotator Job (#593)
.Values.cnpgAnnotator.image.repository triggers nil pointer when the values tree is partially absent in Helm's default-values render. Use | default dict chained assignments to safely extract image repo/tag/ pullPolicy. Fixes blueprint-release smoke render failure on 1.2.3. Closes #585 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
74d526c276
|
fix: bp-gateway-api 5→10 CRDs + bp-gitea CNPG + bp-harbor CNPG race fix + DAG audit (#592)
* fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) The bundled Bitnami postgresql subchart pulls docker.io/bitnamilegacy/postgresql which is unavailable (DH deprecated namespace) — gitea-postgresql-0 stuck in ImagePullBackOff on otech22, cascading to gitea Init:CrashLoopBackOff. Mirrors the bp-harbor pattern (PR #578): provision a CNPG Cluster CR (gitea-pg, namespace gitea, 5Gi, pg16) + a reflector-managed gitea-database-secret, wiring GITEA__database__PASSWD from the CNPG-generated gitea-pg-app Secret. All Bitnami subchart config removed; postgresql.enabled: false. Bootstrap-kit (template + otech + omantel): bump bp-gitea 1.1.2 → 1.2.0, add dependsOn: bp-cnpg so the postgresql.cnpg.io/v1 CRD is registered before the Capabilities gate in cnpg-cluster.yaml fires. omantel overlay migrated from legacy ingress: to gateway: (Cilium Gateway API, issue #387). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(dependency-audit): add bp-reflector (5a) to expected DAG + external-dns dep edge bp-reflector was added to the bootstrap-kit (slot 05a) in issue #543 but was never registered in scripts/expected-bootstrap-deps.yaml, causing the dependency-graph-audit CI gate to error on every PR that includes this branch. Also declare bp-reflector in bp-external-dns's depends_on to match the actual HR file (12-external-dns.yaml dependsOn bp-reflector). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(bp-gateway-api): update CRD-count test 5→10 for experimental channel + DAG audit Two fixes to unblock bp-gateway-api:1.1.0 OCI publish and the dependency-graph-audit CI gate: 1. crd-render.sh: expect 10 CRDs (experimental channel) not 5. Chart 1.1.0 vendors experimental-install.yaml (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy in addition to 5 standard CRDs) because Cilium 1.16.x checks for TLSRoute at operator startup. Without this fix the blueprint-release workflow for 1.1.0 fails the chart-test step and never pushes to GHCR — leaving all 13 dependent HRs stuck dependency-not-ready on every Sovereign. 2. expected-bootstrap-deps.yaml: add bp-reflector (slot 5a) and update bp-external-dns depends_on to include bp-reflector. bp-reflector was added to the bootstrap-kit in issue #543 but was missing from the expected DAG, causing dependency-graph-audit ERRORs on every PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: hatiyildiz <hatice@openova.io> |
||
|
|
64de55d72f
|
fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588) (#590)
* fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588) trivy-operator exits 137 (OOM) on startup on a full Sovereign (38 HRs, ~200 pods). The operator initialises watch-cache controllers for every resource kind it manages across all namespaces; at 38 HRs the cache peak exceeds 256Mi before steady-state is reached. Raise the operator container memory limit from 256Mi to 512Mi, which is the stable floor measured on otech22 during Phase-8a handover testing. Bump bp-trivy 1.0.1 → 1.0.2. Bootstrap-kit slots updated for _template, otech.omani.works, omantel.omani.works. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> * fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml The dependency-graph-audit check was failing because: 1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml 2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector] in the HelmRelease but expected-bootstrap-deps.yaml only declared [bp-cert-manager, bp-powerdns] Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update bp-external-dns depends_on to include bp-reflector in the expected DAG. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> |
||
|
|
4b2ae76cfd
|
fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587) (#589)
* fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587) The native pdns provider in external-dns v0.15.1 does not accept --pdns-api-version; the binary fatals at startup with: 'unknown long flag --pdns-api-version' causing CrashLoopBackOff (53+ restarts on otech22). The provider auto-negotiates the PowerDNS API version — the flag is superfluous and broken. Remove it from extraArgs. Bump bp-external-dns 1.1.3 → 1.1.4. Bootstrap-kit slots updated for _template, otech.omani.works, omantel.omani.works. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> * fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml The dependency-graph-audit check was failing because: 1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml 2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector] in the HelmRelease but expected-bootstrap-deps.yaml only declared [bp-cert-manager, bp-powerdns] Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update bp-external-dns depends_on to include bp-reflector in the expected DAG. Co-Authored-By: alierenbaysal <alierenbaysal@openova.io> --------- Co-authored-by: alierenbaysal <alierenbaysal@openova.io> |
||
|
|
8d2ba0495d
|
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584) |
||
|
|
5a403e66b1
|
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:
FATAL: database "registry" does not exist (SQLSTATE 3D000)
Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.
Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.
Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix
Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:
1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
- values.yaml: `webhook.solverName: powerdns` → `pdns`
- The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
"powerdns" cert-manager gets 404 → "server could not find the resource".
2. cert-manager-dynadot-webhook solver_test.go mock format:
- writeOK() and error injection used old ResponseHeader-wrapped format
- Real api3.json returns ResponseCode/Status directly in SetDnsResponse
- This caused the image build to fail at
|
||
|
|
73ae746637
|
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API CRDs at startup and disables its gateway controller if they are absent — a static, one-shot decision. Cloud-init installs k3s+Cilium first, then Flux reconciles bp-gateway-api minutes later, so the operator always starts without CRDs and never recovers. All 8 HTTPRoutes orphaned. Three-part permanent fix: 1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl. TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires TLSRoute CRD to be present; without it the operator's capability check fails entirely and disables the gateway controller. 2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true" to force GatewayClass creation regardless of CRD presence at Helm render time. Upstream default "auto" skips GatewayClass when the gateway API CRDs are absent at install time (Capabilities check). 3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0 and ship experimental channel (TLSRoute, TCPRoute, UDPRoute, BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed status.supportedFeatures from string[] to object[]; Cilium 1.16.5 writes the old string format and the v1.2.0 CRD rejects the status patch with "must be of type object: string", leaving GatewayClass permanently Unknown/Pending. v1.1.0 retains string schema. Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17 adopts the v1.2.0 object schema for supportedFeatures. Closes #503 Co-authored-by: hatiyildiz <hatiyildiz@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
83ec889f06
|
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)
Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.
Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
|
||
|
|
2adc3a9493
|
fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase (#579)
Harbor upstream always connects to a database named 'registry' (harbor.database.external.coreDatabase default). The CNPG Cluster was initialised with database='harbor', causing: FATAL: database "registry" does not exist (SQLSTATE 3D000) Fix: change postgres.cluster.database default from 'harbor' → 'registry' in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap and Harbor's coreDatabase now use 'registry'. Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run against harbor-pg-1. harbor-core is now 1/1 Running. Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
b647aa2561
|
fix(bp-harbor): provision harbor-pg CNPG cluster + database-secret (Closes #566) (#578)
Replace Helm lookup in database-secret.yaml with reflector annotation: harbor-database-secret now reflects harbor-pg-app via reflector.v1.k8s.emberstack.com/reflects. This fixes the race between Helm rendering (fresh install) and CNPG cluster bootstrap — reflector is event-driven and propagates the CNPG password within seconds of harbor-pg-app being created, with no operator action required. Also includes: - templates/cnpg-cluster.yaml: harbor-pg CNPG Cluster (1 inst, 5Gi, pg16) - values.yaml: postgres: block + database.external.host = harbor-pg-rw - Chart 1.2.0 → 1.2.1; bootstrap-kit refs updated (_template, otech, omantel) Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
58cf297800
|
fix(bp-seaweedfs): remove trailing slash in registry — fixes double-slash image ref (Closes #568) (#576)
`registry: "chrislusf/"` in values.yaml produced `chrislusf//seaweedfs:4.22` because the vendored chart's _helpers.tpl renders `printf "%s/%s:%s" $registryName $name $tag` — the trailing slash joined with the separator slash made an invalid image reference. Fix: `registry: "chrislusf/"` → `registry: "chrislusf"`. Bump bp-seaweedfs 1.1.0 → 1.1.1. Update bootstrap-kit refs in _template, otech.omani.works, omantel.omani.works (1.0.1 → 1.1.1). Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|
|
5796de12bc
|
fix(bp-spire): re-enable oidc-discovery-provider ClusterSPIFFEID to fix init stuck (Closes #571) (#575)
The oidc-discovery-provider ClusterSPIFFEID was disabled at bootstrap to work around a CRD-ordering race (spire-controller-manager applying the template before CRDs were registered). That race was fixed in bp-spire 1.1.4 by listing spire-crds as the first Helm dependency. With all ClusterSPIFFEIDs still disabled the oidc-discovery-provider init container blocks indefinitely with "PermissionDenied: no identity issued" — the controller-manager never creates the registration entry so no SVID is issued. Re-enable oidc-discovery-provider identity. The default, test-keys, and child-servers identities remain disabled (not needed for bootstrap). Also carries the global.imageRegistry field added by issue #560 (was 1.1.5 in working tree, now bumped to 1.1.6 for this fix). Bootstrap-kit slot 06 updated from 1.1.4 → 1.1.6. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> |
||
|
|
b88e98026f
|
fix(bp-falco): rename rules_file → rules_files (Falco 0.36+ canonical key, Closes #570) (#574)
Falco 0.36+ uses `rules_files` (plural) as the canonical multi-file rules key. Setting the deprecated `rules_file` (singular) alongside the upstream subchart's `rules_files` default causes Falco to detect a config conflict and abort startup with CrashLoopBackOff on otech22. Bump bp-falco 1.0.0 → 1.0.1. Bootstrap-kit slot 31 updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> |
||
|
|
06844d3a70
|
fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569) (#573)
bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but bp-external-dns still had `powerdnsNamespace: openova-system` in its NetworkPolicy egress rule and `--pdns-server=...openova-system...` in extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation. Fix: - externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns - extraArgs --pdns-server: ...openova-system... → ...powerdns... Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> |
||
|
|
c59f0496a2
|
fix(bp-mimir): disable ingest_storage to fix Kafka CrashLoop (Closes #567) (#572)
Upstream mimir-distributed 6.0.6 can boot in ingest-storage mode which requires a Kafka endpoint. Setting kafka.enabled:false only disables the bundled Kafka subchart — it does not tell the Mimir process itself to use classic mode. Adding mimir.structuredConfig.ingest_storage.enabled:false forces the classic blocks-storage ingester path (no Kafka dependency), matching Catalyst's NATS JetStream event bus (ADR-0001). Bump bp-mimir 1.0.0 → 1.0.1. Bootstrap-kit slot 23 updated. Co-authored-by: alierenbaysal <alierenbaysal@openova.io> |
||
|
|
ad9cfc0f23
|
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set): - bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst job images now prefixed with global.imageRegistry when non-empty. Default (empty) renders identical manifests. - bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed with global.imageRegistry when non-empty. Verified: dnsdist image rewrites to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14. Subchart-only charts (global.imageRegistry stub added; threading via per-component subchart values.yaml keys documented in comments): - bp-external-secrets 1.1.0→1.1.1 - bp-cnpg 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR) - bp-nats-jetstream 1.1.1→1.1.2 - bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring vcluster: N/A — no chart directory under platform/vcluster/chart/ Co-authored-by: alierenbaysal <alierenbaysal@openova.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> |