Commit Graph

248 Commits

Author SHA1 Message Date
e3mrah
5f6d1c7d86
diag(bp-openbao): add set -x to init script (chart 1.2.8) (#658)
otech37/38 hit the same wall: server reaches Initialized=true but
openbao-unseal-keys Secret is never persisted; the FIRST init pod's
logs that ran fresh init are reaped by container restart before we
can capture what happened.

Add 'set -x' to shell-trace every command. Now even if the script
crashes mid-run, pod logs show the last command attempted. The
captured diagnostic on the next provision will tell us whether the
failure is in /tmp/init-output.json parsing, the persist wget, or
elsewhere.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 11:09:05 +04:00
e3mrah
8447930bf7
fix(bp-openbao): fail-fast on unseal-keys persist (chart 1.2.7) (#657)
* fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13)

* fix(wizard): blueprint deps from Flux HelmRelease.dependsOn (single source of truth)

The wizard's componentGroups.ts carried hand-maintained `dependencies:
[...]` arrays that deviated from the real Flux install graph in
clusters/_template/bootstrap-kit/*.yaml. Examples (otech34 surfaced
this):

  componentGroups.ts          Flux HelmRelease.dependsOn
  ----------------------      ---------------------------
  keycloak: [cnpg]            keycloak: [cert-manager, gateway-api]
  openbao:  []                openbao:  [spire, gateway-api, cnpg]
  harbor:   [cnpg, seaweedfs, harbor:   [cnpg, cert-manager,
              valkey]                    gateway-api]

Founder's directive: "all the real dependencies are related to real
flux related dependencies, if you are hosting irrelevant hardcoded
baseless wizard catalog dependencies, I dont know where they are
coming from. The single source of truth for the dependencies is
flux!!!" — 2026-05-03

This commit:
  1. Adds scripts/generate-blueprint-deps.sh that parses every
     bootstrap-kit HelmRelease and emits blueprint-deps.generated.json
     keyed by bare component id (bp- prefix stripped on both source
     and target side).
  2. Commits the generated JSON.
  3. Adds products/catalyst/bootstrap/ui/src/data/blueprintDeps.ts
     thin TS wrapper exporting BLUEPRINT_DEPS + depsFor(id).
  4. Patches componentGroups.ts so every RAW_COMPONENT's
     `dependencies` field is OVERRIDDEN at module load with the
     Flux-canonical list (the inline `dependencies: [...]` literals
     are now ignored — Flux is canonical).

Follow-ups (not in this PR):
  - CI drift check that re-runs the script and diffs the JSON.
  - Strip the inline `dependencies: [...]` arrays entirely once the
    drift check is green.
  - Wire the FlowPage edge-rendering to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(flowpage): replace second hardcoded BOOTSTRAP_KIT_DEPS table with Flux SoT

PR #652 fixed the wizard catalog. FlowPage.tsx had a SECOND independent
hardcoded dep map at lines 105-155 that the founder caught — most
visibly:
  keycloak: ['cert-manager', 'openbao']  ← FALSE; Flux says no openbao
The reason the founder kept seeing the spurious arrow on the Flow page.

Replace the local table with an import of BLUEPRINT_DEPS from
data/blueprintDeps.ts (single source of truth — generated from
clusters/_template/bootstrap-kit/*.yaml by
scripts/generate-blueprint-deps.sh).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): don't regress status to pending after exec started

helmwatch_bridge.go's OnHelmReleaseEvent unconditionally overwrote the
Job's Status with jobStatusFromHelmState(state) on every event. Flux
oscillates HelmReleases between Reconciling and DependencyNotReady
while a dependency (e.g. bp-openbao waiting on bp-spire) isn't Ready
— helmwatch maps both back to HelmStatePending. The bridge then flips
the row to status='pending' even though an active Execution is
streaming exec log lines (startedAt + latestExecutionId already set).

Founder caught this on otech34's install-external-secrets job:
status='pending' on the Jobs page while Exec Log was actively
tailing.

Fix: monotonic guard — once activeExecID[component] != "" (Execution
allocated), refuse to regress nextStatus to StatusPending. Treat
ongoing-after-start as Running so the row reflects the live stream.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(jobs): cascade Failed status through dependsOn (fail-fast)

Founder caught on otech34: install-openbao=failed but
install-external-secrets stayed pending forever ('masking it and
waiting unnecessarily'). Flux's HelmRelease for external-secrets is
in DependencyNotReady, helmwatch maps that to StatePending,
bridge writes Status=pending — no signal that the upstream FAILED
rather than 'still installing'.

Add a post-rollup sweep in deriveTreeView that propagates Failed
through the dependsOn graph. Up to 8 sweeps cover the deepest
bootstrap-kit chain. Idempotent on read; reverses if openbao recovers
because it operates on the live snapshot.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): bump kernel inotify limits — bp-openbao init was crashing 'too many open files'

Diagnosed live during otech35: openbao-init pod crash-looped 4×
on 'bao operator init' with:
  failed to create fsnotify watcher: too many open files
Flux mapped to InstallFailed → RetriesExceeded → cascading through
external-secrets and external-secrets-stores. The wizard masked the
OS-level root cause behind a generic InstallFailed.

Hetzner Ubuntu 24.04 ships fs.inotify.max_user_instances=128 — far
too low for a 35-component bootstrap-kit (k3s kubelet + Flux helm-
controller + 11 CNPG operators + Reflector + Cert-Manager + bao +
keycloak-config-cli + ... each grabs instance slots). The instance
count exhausts within minutes; the next process to ask for an
inotify slot gets EMFILE.

Bump well above k8s/k3s production guidance so future blueprints
don't tickle the same wall:
  fs.inotify.max_user_instances = 8192
  fs.inotify.max_user_watches   = 1048576
  fs.inotify.max_queued_events  = 16384

Applied via /etc/sysctl.d/99-catalyst-inotify.conf + 'sysctl --system'
in runcmd. Permanent across reboots.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-openbao): fail-fast when unseal-keys persist fails (chart 1.2.7)

otech37 caught: bao operator init succeeded server-side
(Initialized=true), but the script's wget POST to persist
openbao-unseal-keys Secret silently failed (|| true), and the PUT
fallback also silenced. Subsequent Job retries hit Initialized=true
on the idempotent path, found no openbao-unseal-keys Secret, and
FATAL'd with 'manual recovery: wipe data-openbao-0 PVC' — every
retry forever.

Hardening:
  1. Capture POST + PUT stdout/stderr to /tmp files instead of
     /dev/null so the FATAL path can echo them.
  2. PUT no longer || true — if both POST and PUT fail, exit 1.
  3. Add read-back verification: GET the persisted Secret and
     assert 'unseal-keys-b64' field is present. Catches
     partial-write / eventual-consistency cases.

Bumps chart 1.2.6 -> 1.2.7 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 10:51:21 +04:00
e3mrah
6baf7e56e7
fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) (#651)
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:26:23 +04:00
e3mrah
d519dc8ba2
fix(bp-harbor): switch sync Job to curl-against-apiserver (chart 1.2.12) (#650)
rancher/kubectl is distroless (no /bin/sh) so the inline shell script
can't run. Replace with curlimages/curl which has alpine sh + curl.
Talk to k8s API directly via the in-pod ServiceAccount token. The
PATCH merges password + HARBOR_DATABASE_PASSWORD into the existing
pre-install-hook Secret without touching annotations.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:15:23 +04:00
e3mrah
08432b540e
fix(bp-harbor): switch sync Job to rancher/kubectl (chart 1.2.11) (#649)
bitnami/kubectl moved to sha256-only tags; bitnami/kubectl:1.31.4
returns 'not found' from Docker Hub. rancher/kubectl is always
available on k3s clusters. Bumps chart 1.2.10 -> 1.2.11.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:04:15 +04:00
e3mrah
de51fa3f7a
fix(bp-harbor): post-install Job copies CNPG password (chart 1.2.10) (#648)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)

CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal

Commit b1a25c42 (#600) removed the helm.sh/hook-delete-policy from the
auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install
(the old hook-succeeded clause caused the SA to disappear before the
init Job could mount its token). The chart-test still expected ≥5
before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs).

Result: Blueprint Release for #600 (run 25251129679) failed at the test
gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main
already references it. otech30 caught this live: bp-openbao HR stuck
with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'.

Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao
1.2.6 onto GHCR.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-harbor): replace Reflector race with deterministic post-install Job (chart 1.2.10)

bp-harbor's harbor-database-secret relied on Reflector copying from CNPG-
emitted harbor-pg-app via a 'reflects:' destination annotation. On every
fresh Sovereign Reflector logs once at install:
    Could not update harbor/harbor-database-secret —
    Source harbor/harbor-pg-app could not be found
and never refires when CNPG creates the source ~30s later. Even with
'auto-enabled: true' on the source's inheritedMetadata, Reflector's
auto-reflect copies the SOURCE name (harbor-pg-app), not the explicit
destination harbor-database-secret. Result: harbor-database-secret stays
empty forever; harbor-core CrashLoops with 'couldn't find key password
in Secret harbor/harbor-database-secret'. Caught live on otech26-30.

Replace with a Helm post-install/post-upgrade Job that:
  - polls for harbor-pg-app to exist (CNPG provisions it ~30-60s after
    Cluster Ready)
  - copies password into harbor-database-secret with both 'password'
    and 'HARBOR_DATABASE_PASSWORD' keys
  - exits 0; Helm marks the hook complete

The Job is idempotent (re-running on upgrade overwrites identically)
and deterministic (no event-watcher race). The placeholder Secret stays
in place so kubectl-get returns Found before the Job runs.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:52:54 +04:00
e3mrah
da61ecdc79
test(bp-openbao): align test expectation with #600 RBAC-hook removal (#647)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)

CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal

Commit b1a25c42 (#600) removed the helm.sh/hook-delete-policy from the
auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install
(the old hook-succeeded clause caused the SA to disappear before the
init Job could mount its token). The chart-test still expected ≥5
before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs).

Result: Blueprint Release for #600 (run 25251129679) failed at the test
gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main
already references it. otech30 caught this live: bp-openbao HR stuck
with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'.

Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao
1.2.6 onto GHCR.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:46:31 +04:00
e3mrah
a359278b7d
fix(bp-spire): disable oidc ClusterSPIFFEID + chart bump (1.1.7) (#645)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-spire): Chart.lock missing spire-crds → CRDs never installed (chart 1.1.7)

bp-spire 1.1.4 added spire-crds 0.5.0 as a Helm dependency to register
the spire.spiffe.io/v1alpha1 CRDs (ClusterSPIFFEID, ClusterStaticEntry,
ClusterFederatedTrustDomain) before the spire subchart's controller-
manager Deployment starts. But Chart.lock was never regenerated — only
contained the original `spire` entry. As a result every Blueprint
Release packaged the chart WITHOUT spire-crds, the Sovereign saw no
CRDs registered, and Helm install failed with:

  no matches for kind "ClusterSPIFFEID" in version "spire.spiffe.io/v1alpha1"

bp-openbao / bp-external-secrets / bp-nats-jetstream all dependsOn
bp-spire so this single bug cascades and blocks 5+ HRs from reaching
Ready=True. Caught live during otech29.

Fix: ran `helm dependency update` to regenerate Chart.lock + pull both
spire and spire-crds tarballs; bumps bp-spire 1.1.6 -> 1.1.7 and
bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:27:33 +04:00
e3mrah
8bb66fe43e
fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 00:16:34 +04:00
e3mrah
2e9cfd4a57
fix(bp-cert-manager-dynadot-webhook): pin SHA + add ghcr-pull imagePullSecret (#643)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:52:42 +04:00
e3mrah
487ebebda2
fix(bp-vpa): drop registry.k8s.io/ prefix in repository (upstream prepends it) (#641)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 23:32:35 +04:00
e3mrah
737574b19a
feat(bp-keycloak): Phase-8b sovereign realm — token-exchange, catalyst-ui/api-server OIDC clients, SMTP, bump 1.2.2 → 1.3.0 (#604) (#609)
Adds the full Phase-8b identity surface required by the seamless handover flow:

- Token exchange enabled on sovereign realm (attributes.token-exchange: true)
- catalyst-ui public PKCE client: redirectUris + webOrigins keyed on
  console.<sovereignFQDN>, groups + requiredActions in ID token
- catalyst-api-server confidential service-account client: impersonation +
  manage-users + view-users + query-users roles on realm-management; client
  secret injected at provisioning time via .Values.catalystApiServerClientSecret
- WebAuthn (webauthn-register + webauthn-register-passwordless) registered as
  Required Action options on the realm
- UPDATE_PASSWORD set as defaultAction: true for new users
- smtpServer block: pre-handover default = contabo Stalwart relay; fully
  operator-configurable via .Values.smtp.* (Phase-8c-acceptable)
- required-actions client scope + oidc-usermodel-attribute-mapper for
  requiredActions claim in ID token (catalyst-ui first-login UX)

Architectural change: realm JSON moved from inline values.yaml (keycloak:
subchart key — no parent scope access) to a parent-chart template
platform/keycloak/chart/templates/configmap-sovereign-realm.yaml, which can
read .Values.sovereignFQDN and .Values.smtp.* for per-Sovereign interpolation.
The upstream bitnami chart's keycloakConfigCli.existingConfigmap is pointed at
this ConfigMap. Anti-duplication seam: configmap-sovereign-realm.yaml.

New values.yaml keys:
  sovereignFQDN: "" (REQUIRED — per-Sovereign overlay supplies it)
  sovereignRealm.enabled: true
  catalystApiServerClientSecret: "" (REQUIRED — provisioner seals and injects)
  smtp.host/port/from/user/password/ssl/starttls/auth

New bootstrap-kit file:
  09a-keycloak-catalyst-api-secret.yaml — SealedSecret template for
  keycloak-catalyst-api-server-credentials in catalyst-system namespace;
  provisioner fills encryptedData fields at deploy time

Bootstrap-kit refs bumped 1.2.x → 1.3.0 in _template, otech, omantel.
helm template clean with sovereignFQDN=otech.omani.works.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:05:27 +04:00
e3mrah
93627ada20
fix(bp-harbor): convert harbor-database-secret to Helm pre-install hook (1.2.8) (#603)
The 1.2.7 fix dropped the `data:` block from the chart template, but
Helm's three-way merge still owns the Secret as a release resource and
resets `data: {}` (no keys) on every chart upgrade — verified on otech22
where 1.2.6→1.2.7 reconcile wiped Reflector-populated keys back to nil.

Architectural fix: convert the Secret to a Helm pre-install hook.

  - `helm.sh/hook: pre-install` — Secret is created at install time only.
    On `helm upgrade`, Helm does NOT touch the Secret (no three-way merge),
    so keys populated by Reflector persist across every chart bump.
  - `helm.sh/hook-delete-policy: before-hook-creation` — On a re-install,
    Helm deletes the previous Secret first so the hook recreates clean.
  - `helm.sh/resource-policy: keep` — `helm uninstall` does NOT delete the
    Secret (paired with hook means standard upgrade path never sees a delete).
  - Hook resources are NOT recorded in the Helm release manifest, so they're
    invisible to `helm upgrade`'s three-way merge.

Also drops the inline `data:` block (kept from 1.2.7) — Reflector still
populates everything from harbor-pg-app once CNPG bootstraps the source.

Bumps bp-harbor 1.2.7 → 1.2.8, bootstrap-kit refs (_template, otech, omantel).

Closes #585

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:57:55 +04:00
e3mrah
09208ca58f
fix(bp-harbor): omit data block in harbor-database-secret — Helm overwrite regression (1.2.7) (#602)
On every helm upgrade, Helm three-way merge resets `data.password` and
`data.HARBOR_DATABASE_PASSWORD` to "" because the chart declares them
empty in the template. After Reflector populates them from `harbor-pg-app`,
the next bp-harbor upgrade silently empties them again — harbor-core then
crashloops on the next pod restart with "password authentication failed".

Observed on otech22 after the 1.2.5→1.2.6 Flux upgrade: harbor-database-
secret.password went from 64 bytes back to 0 bytes, harbor-core entered
CrashLoopBackOff. Resolved at runtime by touching harbor-pg-app to bump
its resourceVersion and re-trigger Reflector, but the architectural fix
is needed so it doesn't recur on the next chart upgrade.

Fix: drop the entire `data:` block from templates/database-secret.yaml.
The Secret is created by Helm with no data keys (Helm owns nothing in
the data field). Reflector adds ALL keys from `harbor-pg-app` (password,
HARBOR_DATABASE_PASSWORD, username, host, dbname, jdbc-uri, etc.) on
the first SecretWatcher event after CNPG bootstraps the source. On
subsequent helm upgrades, Helm's three-way merge has nothing to overwrite
in `data:` because the chart no longer declares any keys there.

Bumps bp-harbor 1.2.6 → 1.2.7, bootstrap-kit refs (_template, otech, omantel).

Closes #585 (regression of)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:53:37 +04:00
e3mrah
8d50402038
fix(bp-harbor): remove cnpg-app-annotator Job — CNPG inheritedMetadata handles annotation (1.2.6) (#601)
The post-install Job `harbor-pg-app-annotator` (with curlimages/curl:8.7.1)
is no longer needed: bp-harbor 1.2.5 already uses CNPG's `inheritedMetadata`
stanza in cnpg-cluster.yaml to stamp `reflection-allowed: true` onto
`harbor-pg-app` at CNPG bootstrap time. The Job was causing ErrImagePull on
otech22 because Docker Hub is proxied through Harbor itself (chicken-and-egg).

Removes:
  - templates/cnpg-app-annotator-job.yaml
  - templates/cnpg-app-annotator-rbac.yaml
  - values.yaml cnpgAnnotator section

Updates database-secret.yaml comment to reflect the inheritedMetadata approach.

Bumps Chart.yaml 1.2.5 → 1.2.6, bootstrap-kit refs (_template, otech, omantel).

Closes #585

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:44:55 +04:00
e3mrah
b1a25c4235
fix(bp-keycloak,bp-openbao): HTTPRoute backend wrong name + RBAC hook lifecycle bug (#598) (#600)
Bug A — bp-keycloak@1.2.2: HTTPRoute backendService default was
`<release>-keycloak` (gave `keycloak-keycloak` with releaseName=keycloak)
but bitnami's fullname helper trims the chart-name suffix when Release.Name
already contains it, so the Service is just `keycloak`. Changed default to
`.Release.Name`. Sovereign realm was already imported (config-cli ran
successfully) — only the Gateway routing was broken, returning HTTP 500.

Bug B — bp-openbao@1.2.6: auto-unseal-rbac SA/Role/RoleBinding had
`helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded`. The
`hook-succeeded` clause caused Helm to delete the SA immediately after the
weight-0 RBAC hook completed, before the weight-5 init Job pod could mount
its SA token and start. Removed all hook annotations from the RBAC resources
so they are managed by regular Helm release lifecycle (created before hooks,
never deleted mid-install).

Bootstrap-kit refs bumped: bp-keycloak 1.2.0→1.2.2, bp-openbao 1.2.4→1.2.6.

Verified on otech22 (manual remediation): Keycloak sovereign realm
OIDC endpoint returns valid JSON, openbao-0 Initialized=true Sealed=false.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:43:32 +04:00
e3mrah
cba1b5070a
fix(bp-gitea+harbor): use CNPG inheritedMetadata to propagate reflector annotations to pg-app Secret (#595)
The Cluster CR `metadata.annotations` are NOT propagated by CNPG onto the
generated `{name}-app` Secrets. Reflector requires the SOURCE Secret (e.g.
`gitea-pg-app`) to carry `reflection-allowed: "true"` before it will copy
data into the DESTINATION Secret (`gitea-database-secret`). On otech22 this
caused `gitea-database-secret` to stay empty indefinitely — gitea init container
failed auth with "password authentication failed for user gitea".

Fix: use CNPG's `inheritedMetadata.annotations` stanza (v1.24+) to instruct
CNPG to annotate all generated Secrets with the reflector permission annotations.
Applied to both bp-gitea (1.2.0→1.2.1) and bp-harbor (1.2.4→1.2.5) since
harbor-pg-app had the same issue.

Bootstrap-kit: bump bp-gitea chart ref 1.2.0→1.2.1 (template + otech + omantel).

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:37:48 +04:00
e3mrah
fe03b8cc42
fix(bp-harbor): use curl for CNPG annotator PATCH + add values defaults (1.2.4) (#594)
busybox wget does not support --method=PATCH (only GET/POST). The
harbor-pg-app-annotator Job silently succeeded without actually patching
harbor-pg-app, leaving harbor-database-secret empty on fresh install.

Fixes:
1. Switch cnpg-app-annotator-job.yaml from busybox:1.36.1 + wget to
   curlimages/curl:8.7.1 + curl -X PATCH. curl natively supports all
   HTTP verbs. HTTP response code checked explicitly; non-2xx exits 1
   so the Job retries instead of silently passing with no-op.
2. Add cnpgAnnotator.image stanza to values.yaml (was missing — prior
   charts defaulted via nil-safe dict fallback but the section was
   never actually written to values.yaml). Defaults to curlimages/curl:8.7.1.
3. readOnlyRootFilesystem: false (curl writes /tmp/patch-response.json
   for error diagnostics).
4. Bump chart 1.2.3 → 1.2.4.

Closes #585

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:29:45 +04:00
e3mrah
97abf9dedb
fix(bp-harbor): nil-safe image value extraction in cnpg-app-annotator Job (#593)
.Values.cnpgAnnotator.image.repository triggers nil pointer when the
values tree is partially absent in Helm's default-values render. Use
| default dict chained assignments to safely extract image repo/tag/
pullPolicy. Fixes blueprint-release smoke render failure on 1.2.3.

Closes #585

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:22:54 +04:00
e3mrah
74d526c276
fix: bp-gateway-api 5→10 CRDs + bp-gitea CNPG + bp-harbor CNPG race fix + DAG audit (#592)
* fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584)

The bundled Bitnami postgresql subchart pulls docker.io/bitnamilegacy/postgresql
which is unavailable (DH deprecated namespace) — gitea-postgresql-0 stuck in
ImagePullBackOff on otech22, cascading to gitea Init:CrashLoopBackOff.

Mirrors the bp-harbor pattern (PR #578): provision a CNPG Cluster CR (gitea-pg,
namespace gitea, 5Gi, pg16) + a reflector-managed gitea-database-secret, wiring
GITEA__database__PASSWD from the CNPG-generated gitea-pg-app Secret. All Bitnami
subchart config removed; postgresql.enabled: false.

Bootstrap-kit (template + otech + omantel): bump bp-gitea 1.1.2 → 1.2.0, add
dependsOn: bp-cnpg so the postgresql.cnpg.io/v1 CRD is registered before the
Capabilities gate in cnpg-cluster.yaml fires. omantel overlay migrated from
legacy ingress: to gateway: (Cilium Gateway API, issue #387).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dependency-audit): add bp-reflector (5a) to expected DAG + external-dns dep edge

bp-reflector was added to the bootstrap-kit (slot 05a) in issue #543 but was
never registered in scripts/expected-bootstrap-deps.yaml, causing the
dependency-graph-audit CI gate to error on every PR that includes this branch.
Also declare bp-reflector in bp-external-dns's depends_on to match the actual
HR file (12-external-dns.yaml dependsOn bp-reflector).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(bp-gateway-api): update CRD-count test 5→10 for experimental channel + DAG audit

Two fixes to unblock bp-gateway-api:1.1.0 OCI publish and the
dependency-graph-audit CI gate:

1. crd-render.sh: expect 10 CRDs (experimental channel) not 5.
   Chart 1.1.0 vendors experimental-install.yaml (TLSRoute, TCPRoute,
   UDPRoute, BackendLBPolicy, BackendTLSPolicy in addition to 5 standard
   CRDs) because Cilium 1.16.x checks for TLSRoute at operator startup.
   Without this fix the blueprint-release workflow for 1.1.0 fails the
   chart-test step and never pushes to GHCR — leaving all 13 dependent
   HRs stuck dependency-not-ready on every Sovereign.

2. expected-bootstrap-deps.yaml: add bp-reflector (slot 5a) and update
   bp-external-dns depends_on to include bp-reflector. bp-reflector was
   added to the bootstrap-kit in issue #543 but was missing from the
   expected DAG, causing dependency-graph-audit ERRORs on every PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-02 15:20:05 +04:00
e3mrah
64de55d72f
fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588) (#590)
* fix(bp-trivy): raise operator memory limit 256Mi→512Mi — OOMKilled on 38-HR Sovereign (Closes #588)

trivy-operator exits 137 (OOM) on startup on a full Sovereign (38 HRs,
~200 pods). The operator initialises watch-cache controllers for every
resource kind it manages across all namespaces; at 38 HRs the cache
peak exceeds 256Mi before steady-state is reached.

Raise the operator container memory limit from 256Mi to 512Mi, which
is the stable floor measured on otech22 during Phase-8a handover testing.

Bump bp-trivy 1.0.1 → 1.0.2. Bootstrap-kit slots updated for _template,
otech.omani.works, omantel.omani.works.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

* fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml

The dependency-graph-audit check was failing because:
1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but
   bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml
2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector]
   in the HelmRelease but expected-bootstrap-deps.yaml only declared
   [bp-cert-manager, bp-powerdns]

Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update
bp-external-dns depends_on to include bp-reflector in the expected DAG.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 15:20:03 +04:00
e3mrah
4b2ae76cfd
fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587) (#589)
* fix(bp-external-dns): remove --pdns-api-version flag — unknown in v0.15.1 (Closes #587)

The native pdns provider in external-dns v0.15.1 does not accept
--pdns-api-version; the binary fatals at startup with:
  'unknown long flag --pdns-api-version'
causing CrashLoopBackOff (53+ restarts on otech22).

The provider auto-negotiates the PowerDNS API version — the flag is
superfluous and broken. Remove it from extraArgs.

Bump bp-external-dns 1.1.3 → 1.1.4. Bootstrap-kit slots updated for
_template, otech.omani.works, omantel.omani.works.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

* fix(ci): add bp-reflector slot 5a + bp-external-dns dep to expected-bootstrap-deps.yaml

The dependency-graph-audit check was failing because:
1. 05a-reflector.yaml exists in clusters/_template/bootstrap-kit/ but
   bp-reflector was not declared in scripts/expected-bootstrap-deps.yaml
2. bp-external-dns had dependsOn=[bp-cert-manager, bp-powerdns, bp-reflector]
   in the HelmRelease but expected-bootstrap-deps.yaml only declared
   [bp-cert-manager, bp-powerdns]

Add bp-reflector (slot 5a, depends_on: [bp-cert-manager]) and update
bp-external-dns depends_on to include bp-reflector in the expected DAG.

Co-Authored-By: alierenbaysal <alierenbaysal@openova.io>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 15:20:00 +04:00
e3mrah
8d2ba0495d
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)
2026-05-02 15:18:49 +04:00
e3mrah
5a403e66b1
fix(tls): DNS-01 wildcard TLS chain — solverName pdns, NodePort 30053, dynadot test fix (#582)
* fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase

Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(tls): DNS-01 wildcard TLS chain — solverName, NodePort 30053, dynadot test fix

Five independent fixes that together complete the DNS-01 wildcard TLS chain
for per-Sovereign certificate autonomy:

1. cert-manager-powerdns-webhook solverName mismatch (root cause of #550 echo):
   - values.yaml: `webhook.solverName: powerdns` → `pdns`
   - The zachomedia binary's Name() returns "pdns" (hardcoded). cert-manager
     calls POST /apis/<groupName>/v1alpha1/<solverName>; when solverName is
     "powerdns" cert-manager gets 404 → "server could not find the resource".

2. cert-manager-dynadot-webhook solver_test.go mock format:
   - writeOK() and error injection used old ResponseHeader-wrapped format
   - Real api3.json returns ResponseCode/Status directly in SetDnsResponse
   - This caused the image build to fail at ccc38987 so the dynadot fix
     never shipped; solver tests now pass cleanly (go test ./... OK)

3. PowerDNS NodePort 30053 anycast overlay (bootstrap-kit and template):
   - _template/bootstrap-kit/11-powerdns.yaml: adds anycast NodePort values
   - omantel + otech bootstrap-kit: same NodePort 30053 overlay applied
   - anycast-endpoint.yaml: optional nodePort field rendered in port list

4. Hetzner LB + firewall for DNS port 53 (infra/hetzner/main.tf):
   - hcloud_load_balancer_service.dns: TCP:53 → NodePort 30053
   - Firewall: TCP+UDP :53 from 0.0.0.0/0,::/0

5. dynadot-client JSON parsing fix (core/pkg/dynadot-client):
   - AddRecord + SetFullDNS: struct no longer wraps respHeader in ResponseHeader
   - client_test.go: mock responses updated to real api3.json format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:49:58 +04:00
e3mrah
73ae746637
fix(cloud-init): install Gateway API v1.1.0 CRDs before cilium so operator registers gateway controller (#581)
Root cause (otech22 2026-05-02): Cilium operator checks for Gateway API
CRDs at startup and disables its gateway controller if they are absent —
a static, one-shot decision. Cloud-init installs k3s+Cilium first, then
Flux reconciles bp-gateway-api minutes later, so the operator always
starts without CRDs and never recovers. All 8 HTTPRoutes orphaned.

Three-part permanent fix:

1. cloud-init: apply Gateway API v1.1.0 experimental CRDs (incl.
   TLSRoute) BEFORE the Cilium helm install. Cilium 1.16.x requires
   TLSRoute CRD to be present; without it the operator's capability
   check fails entirely and disables the gateway controller.

2. bp-cilium (1.1.2 → 1.1.3): add gatewayAPI.gatewayClass.create: "true"
   to force GatewayClass creation regardless of CRD presence at Helm
   render time. Upstream default "auto" skips GatewayClass when the
   gateway API CRDs are absent at install time (Capabilities check).

3. bp-gateway-api (1.0.0 → 1.1.0): downgrade CRDs from v1.2.0 to v1.1.0
   and ship experimental channel (TLSRoute, TCPRoute, UDPRoute,
   BackendLBPolicy, BackendTLSPolicy). Gateway API v1.2.0 changed
   status.supportedFeatures from string[] to object[]; Cilium 1.16.5
   writes the old string format and the v1.2.0 CRD rejects the status
   patch with "must be of type object: string", leaving GatewayClass
   permanently Unknown/Pending. v1.1.0 retains string schema.

Upgrade path: bump bp-gateway-api + bp-cilium together when Cilium ≥ 1.17
adopts the v1.2.0 object schema for supportedFeatures.

Closes #503

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:23:32 +04:00
e3mrah
83ec889f06
feat(platform): add global.imageRegistry to remaining bp-* charts + bp-catalyst-platform (PR 3/3, #560) (#580)
Charts bumped:
- bp-keycloak 1.2.0 -> 1.2.1 (subchart stub; per-component image.registry knobs documented)
- bp-crossplane 1.1.3 -> 1.1.4 (subchart stub)
- bp-crossplane-claims 1.1.0 -> 1.1.1 (global.kubectlImage added; kubectl Job image templated; Hetzner ubuntu-24.04 server images intentionally untouched)
- bp-velero 1.2.0 -> 1.2.1 (subchart stub)
- bp-kyverno 1.0.0 -> 1.0.1 (subchart stub; per-controller image.registry knobs documented)
- bp-trivy 1.0.0 -> 1.0.1 (subchart stub; both operator + scanner image.registry knobs documented)
- bp-grafana 1.0.0 -> 1.0.1 (subchart stub)
- bp-flux 1.1.3 -> 1.1.4 (subchart stub; per-controller image.repository knobs documented)
- bp-catalyst-platform 1.1.13 -> 1.1.14 (global.imageRegistry + images.{catalystApi,catalystUi,marketplaceApi,console,smeTag} added; all 14 Catalyst-authored image refs templated: catalyst-api, catalyst-ui, marketplace-api, console + 10 SME services)

Post-handover per-Sovereign overlays set global.imageRegistry to harbor.<sovereign-fqdn> so every container image pull routes through the Sovereign's own Harbor proxy_cache.

Closes (partial): issue #560 — all 23 bp-* charts now carry global.imageRegistry

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:21:53 +04:00
e3mrah
2adc3a9493
fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase (#579)
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:21:36 +04:00
e3mrah
b647aa2561
fix(bp-harbor): provision harbor-pg CNPG cluster + database-secret (Closes #566) (#578)
Replace Helm lookup in database-secret.yaml with reflector annotation:
harbor-database-secret now reflects harbor-pg-app via
reflector.v1.k8s.emberstack.com/reflects. This fixes the race between
Helm rendering (fresh install) and CNPG cluster bootstrap — reflector
is event-driven and propagates the CNPG password within seconds of
harbor-pg-app being created, with no operator action required.

Also includes:
- templates/cnpg-cluster.yaml: harbor-pg CNPG Cluster (1 inst, 5Gi, pg16)
- values.yaml: postgres: block + database.external.host = harbor-pg-rw
- Chart 1.2.0 → 1.2.1; bootstrap-kit refs updated (_template, otech, omantel)

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:14:00 +04:00
e3mrah
58cf297800
fix(bp-seaweedfs): remove trailing slash in registry — fixes double-slash image ref (Closes #568) (#576)
`registry: "chrislusf/"` in values.yaml produced `chrislusf//seaweedfs:4.22`
because the vendored chart's _helpers.tpl renders
`printf "%s/%s:%s" $registryName $name $tag` — the trailing slash joined
with the separator slash made an invalid image reference.

Fix: `registry: "chrislusf/"` → `registry: "chrislusf"`.
Bump bp-seaweedfs 1.1.0 → 1.1.1. Update bootstrap-kit refs in _template,
otech.omani.works, omantel.omani.works (1.0.1 → 1.1.1).

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:02:48 +04:00
e3mrah
5796de12bc
fix(bp-spire): re-enable oidc-discovery-provider ClusterSPIFFEID to fix init stuck (Closes #571) (#575)
The oidc-discovery-provider ClusterSPIFFEID was disabled at bootstrap to
work around a CRD-ordering race (spire-controller-manager applying the
template before CRDs were registered). That race was fixed in bp-spire 1.1.4
by listing spire-crds as the first Helm dependency.

With all ClusterSPIFFEIDs still disabled the oidc-discovery-provider init
container blocks indefinitely with "PermissionDenied: no identity issued" —
the controller-manager never creates the registration entry so no SVID is
issued.

Re-enable oidc-discovery-provider identity. The default, test-keys, and
child-servers identities remain disabled (not needed for bootstrap).

Also carries the global.imageRegistry field added by issue #560 (was 1.1.5
in working tree, now bumped to 1.1.6 for this fix). Bootstrap-kit slot 06
updated from 1.1.4 → 1.1.6.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 13:00:43 +04:00
e3mrah
b88e98026f
fix(bp-falco): rename rules_file → rules_files (Falco 0.36+ canonical key, Closes #570) (#574)
Falco 0.36+ uses `rules_files` (plural) as the canonical multi-file rules
key. Setting the deprecated `rules_file` (singular) alongside the upstream
subchart's `rules_files` default causes Falco to detect a config conflict
and abort startup with CrashLoopBackOff on otech22.

Bump bp-falco 1.0.0 → 1.0.1. Bootstrap-kit slot 31 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:59:29 +04:00
e3mrah
06844d3a70
fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569) (#573)
bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but
bp-external-dns still had `powerdnsNamespace: openova-system` in its
NetworkPolicy egress rule and `--pdns-server=...openova-system...` in
extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation.

Fix:
- externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns
- extraArgs --pdns-server: ...openova-system... → ...powerdns...

Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:58:24 +04:00
e3mrah
c59f0496a2
fix(bp-mimir): disable ingest_storage to fix Kafka CrashLoop (Closes #567) (#572)
Upstream mimir-distributed 6.0.6 can boot in ingest-storage mode which
requires a Kafka endpoint. Setting kafka.enabled:false only disables the
bundled Kafka subchart — it does not tell the Mimir process itself to use
classic mode. Adding mimir.structuredConfig.ingest_storage.enabled:false
forces the classic blocks-storage ingester path (no Kafka dependency),
matching Catalyst's NATS JetStream event bus (ADR-0001).

Bump bp-mimir 1.0.0 → 1.0.1. Bootstrap-kit slot 23 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:57:09 +04:00
e3mrah
ad9cfc0f23
feat(platform): add global.imageRegistry to bp-openbao/external-secrets/cnpg/valkey/nats-jetstream/powerdns/gitea (PR 2/3, #560) (#565)
Charts with template image refs (fully rewritten when registry set):
- bp-openbao 1.2.4→1.2.5: init-job.yaml + auth-bootstrap-job.yaml — Catalyst
  job images now prefixed with global.imageRegistry when non-empty. Default
  (empty) renders identical manifests.
- bp-powerdns 1.1.5→1.1.6: dnsdist.yaml Catalyst companion image prefixed
  with global.imageRegistry when non-empty. Verified: dnsdist image rewrites
  to harbor.openova.io/docker.io/powerdns/dnsdist-19:1.9.14.

Subchart-only charts (global.imageRegistry stub added; threading via per-component
subchart values.yaml keys documented in comments):
- bp-external-secrets 1.1.0→1.1.1
- bp-cnpg 1.0.0→1.0.1  (charts/ missing = pre-existing state, not this PR)
- bp-valkey 1.0.0→1.0.1 (charts/ missing = pre-existing state, not this PR)
- bp-nats-jetstream 1.1.1→1.1.2
- bp-gitea 1.1.2→1.1.3: upstream chart exposes gitea.image.registry for wiring

vcluster: N/A — no chart directory under platform/vcluster/chart/

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:52:43 +04:00
e3mrah
19c06c63bc
fix(bp-cert-manager-dynadot-webhook): dedupe template labels (Closes #561) (#564)
deployment.yaml pod template included both selectorLabels and labels named
templates; since selectorLabels is a strict subset of labels, this produced
duplicate app.kubernetes.io/name and app.kubernetes.io/instance keys in the
rendered pod template metadata — triggering the HelmRelease validation error
"spec.values.metadata.labels has duplicate key". Remove the redundant
selectorLabels include from the pod template (selector.matchLabels still uses
selectorLabels correctly). Bump chart 1.1.0 → 1.1.1.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:50:11 +04:00
e3mrah
a7fa0626b2
feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-pdns-webhook/sealed-secrets (PR 1/3 #560) (#562)
* docs(wbs): Mermaid DAG shows actual Phase-8a dependency cascade

Per founder corrective: existing diagram missed the real blockers
surfaced during otech10..otech22 burns. The image-pull-through gap
(#557) and the cross-namespace secret gap (#543, #544) gate every
workload pull from a public registry — without them, Sovereign hits
DockerHub anonymous rate-limit on first provision and 30+ HRs are
ImagePullBackOff/CreateContainerConfigError.

Adds:
- Phase 0b · Image pull-through (#557 + #557B Sovereign-Harbor swap +
  #557C charts global.imageRegistry templating). Edges to NATS / Gitea
  / Harbor / Grafana / Loki / Mimir / PowerDNS / Crossplane /
  cert-manager-powerdns-webhook / Trivy / Kyverno / SPIRE / OpenBao
- Phase 0c · Cross-namespace secrets (#543 ghcr-pull Reflector + #544
  powerdns-api-credentials reflect). Edges to bp-catalyst-platform and
  bp-cert-manager-powerdns-webhook
- Phase 1 additions: #542 kubeconfig CP-IP fix and #547 helmwatch
  38-HR threshold both gate Phase 8a integration test
- Phase 0b → Phase 8b edge: post-handover Sovereign-Harbor swap is
  what makes "zero contabo dependency" DoD-met possible

WBS now reflects the cascade observed live, not the pre-Phase-8a model.

* feat(platform): add global.imageRegistry to bp-cilium/cert-manager/cert-manager-powerdns-webhook/sealed-secrets (PR 1/3, #560)

- bp-cilium 1.1.1→1.1.2: global.imageRegistry stub added; upstream cilium
  subchart does not expose a single registry knob — per-Sovereign overlays
  wire specific image.repository fields alongside this value.
- bp-cert-manager 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  chart exposes per-component image.registry knobs documented in the comment.
- bp-cert-manager-powerdns-webhook 1.0.2→1.0.3: global.imageRegistry stub
  added + deployment.yaml templated to prefix the webhook image repository
  when the value is non-empty. Verified: helm template with
  --set global.imageRegistry=harbor.openova.io produces
  harbor.openova.io/zachomedia/cert-manager-webhook-pdns:<appVersion>.
- bp-sealed-secrets 1.1.1→1.1.2: global.imageRegistry stub added; upstream
  subchart exposes sealed-secrets.image.registry for overlay wiring.

All four charts render clean with default values (empty imageRegistry).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:48:37 +04:00
e3mrah
ccc38987c2
fix(tls): bp-cert-manager-dynadot-webhook slot 49b + DNS-01 JSON bug (Closes #550) (#558)
Root cause: bootstrap-kit installs bp-cert-manager-powerdns-webhook (slot 49)
but the letsencrypt-dns01-prod ClusterIssuer wires to the dynadot webhook
(groupName: acme.dynadot.openova.io). Without slot 49b the APIService for
acme.dynadot.openova.io does not exist → cert-manager gets "forbidden" on
every ChallengeRequest → sovereign-wildcard-tls stays in Issuing indefinitely
→ HTTPS gateway has no cert → SSL_ERROR_SYSCALL on the handover URL.

Changes:
- core/pkg/dynadot-client: fix SetDnsResponse JSON key (was SetDns2Response,
  API returns SetDnsResponse); change ResponseCode to json.Number (API returns
  integer 0, not string "0"); update tests to match real API response format
- platform/cert-manager-dynadot-webhook/chart:
  - rbac.yaml: add domain-solver ClusterRole + ClusterRoleBinding so
    cert-manager SA can CREATE on acme.dynadot.openova.io (the "forbidden" fix)
  - values.yaml: add certManager.{namespace,serviceAccountName}, clusterIssuer.*
    and privateKeySecretRefName; add rbac.create comment for domain-solver
  - certificate.yaml: trunc 64 on commonName (was 76 bytes, cert-manager rejects >64)
  - clusterissuer.yaml: new template (skip-render default, enabled via overlay)
  - deployment.yaml: add imagePullSecrets support (required for private GHCR)
  - Chart.yaml: bump to 1.1.0
- clusters/_template/bootstrap-kit:
  - 49b-bp-cert-manager-dynadot-webhook.yaml: new slot (PRE-handover issuer)
  - kustomization.yaml: add 49b entry
- infra/hetzner:
  - variables.tf: add dynadot_managed_domains variable
  - main.tf: pass dynadot_{key,secret,managed_domains} to cloud-init template
  - cloudinit-control-plane.tftpl: write cert-manager/dynadot-api-credentials
    Secret + apply it before Flux reconciles bootstrap-kit

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:42:13 +04:00
e3mrah
7d264d9647
fix(bp-powerdns): default cluster.namespace=powerdns not openova-system (Closes #553) (#556)
bp-powerdns HelmRelease upgrade fails on Sovereigns with:
  failed to create resource: namespaces "openova-system" not found

The chart's CNPG Cluster CR template targets postgres.cluster.namespace
which defaulted to openova-system (a contabo-only legacy ns). On
Sovereign clusters that ns doesn't exist; Helm aborts the upgrade
before applying the Cluster CR; the pdns-pg-app Secret CNPG would emit
is never created; powerdns Deployment locks at CreateContainerConfigError.

Default to powerdns (chart targetNamespace per bootstrap-kit overlay).
Contabo legacy overrides via per-Sovereign values if it still needs
openova-system.

Bump bp-powerdns 1.1.4 -> 1.1.5 across template + omantel + otech overlays.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:19:37 +04:00
e3mrah
b2307e290d
fix: bp-reflector + rename ghcr-pull-secret->ghcr-pull (Closes #543) (#554)
Part A — bp-reflector blueprint:
- Add clusters/_template/bootstrap-kit/05a-reflector.yaml (slot 05a,
  dependsOn bp-cert-manager) — installs emberstack/reflector v7.1.288
  via the bp-reflector OCI wrapper chart.
- Register in bootstrap-kit/kustomization.yaml.
- Add platform/reflector/chart/ wrapper (Chart.yaml + values.yaml):
  single replica, 32Mi memory, ServiceMonitor off by default.

Part B — annotate flux-system/ghcr-pull + rename in charts:
- infra/hetzner/cloudinit-control-plane.tftpl: add four Reflector
  annotations to the ghcr-pull Secret written at cloud-init time so
  Reflector auto-mirrors it to every namespace on first boot.
- Rename imagePullSecrets from ghcr-pull-secret to ghcr-pull in:
  api-deployment.yaml, ui-deployment.yaml,
  marketplace-api/deployment.yaml, and all 11 sme-services/*.yaml
  (14 total occurrences).
- Bump bp-catalyst-platform chart 1.1.12->1.1.13; update bootstrap-kit
  HelmRelease version reference to match.

Root cause: the canonical secret name is ghcr-pull (written by
cloud-init as /var/lib/catalyst/ghcr-pull-secret.yaml). Charts were
referencing ghcr-pull-secret (wrong name), causing ImagePullBackOff
on all Catalyst pods on every new Sovereign.

Runtime hotfix applied to otech22: both ghcr-pull and ghcr-pull-secret
propagated to 33 namespaces via kubectl; non-Running pods bounced.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-02 12:17:51 +04:00
e3mrah
902d857702
fix(bp-powerdns): reflect powerdns-api-credentials to external-dns namespace (Closes #544) (#552)
Add reflector.v1.k8s.emberstack.com annotations to the
powerdns-api-credentials Secret template in bp-powerdns so Reflector
(bp-reflector, slot 05a) automatically mirrors it from the powerdns
namespace to external-dns. Bump chart version 1.1.3 → 1.1.4.

Add dependsOn: bp-reflector to bp-external-dns HelmRelease in
_template and per-Sovereign overlays (otech + omantel) so Flux waits
for the mirror controller before installing ExternalDNS.

Root cause: external-dns pod crashed with "secret powerdns-api-
credentials not found" because bp-powerdns creates the Secret in the
powerdns namespace while bp-external-dns runs in external-dns. No
cross-namespace propagation existed. Runtime hotfix already applied on
otech22 via kubectl copy + rollout restart.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 12:11:43 +04:00
e3mrah
8cde771c0f
fix(bp-openbao): unseal on idempotent path + persist keys (Closes #539) (#540)
PR #528 added unseal logic but only on the FRESH-init branch. When a
previous Job pod completed `bao operator init` but exited before the
unseal block (or when openbao-0 simply restarts under shamir seal),
the next reconcile takes the "already initialized" branch and exits
without ever running `bao operator unseal`. Symptom on otech21:
init-job logs end with `auto-unseal init complete`, but
`bao status` reports Initialized=true Sealed=true forever, the
bp-openbao HR stays Unknown/Running for the full 15m install
timeout, and bp-external-secrets/bp-external-secrets-stores block
on the dep.

Fix has two parts:

1. Persist `unseal_keys_b64` on fresh init to a new K8s Secret
   `openbao-unseal-keys` (BEFORE applying the keys, so a unseal
   crash mid-step is recoverable on next retry).
2. Add a Step 2a "idempotent-path unseal" branch: when bao reports
   Initialized=true Sealed=true, fetch the persisted keys Secret
   and apply unseal exactly the same way Step 3a does on fresh
   init. Verify Sealed=false and exit; otherwise FATAL with the
   manual-recovery pointer.

RBAC: extend the openbao-auto-unseal Role to allow create/get/
patch/update on openbao-unseal-keys (alongside openbao-init-marker).

Chart bump 1.2.3 → 1.2.4. HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml updated to match
so cloud-init-templated Sovereigns pick up the new chart.

Co-authored-by: e3mrah <emrah.baysal@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 10:44:46 +04:00
e3mrah
d90abb1e85
fix(bp-openbao): unseal vault after init in chart Job (Closes #527) (#528)
The init Job ran `bao operator init -key-shares=1 -key-threshold=1`
which leaves the cluster Initialized=true but Sealed=true. Without
an explicit `bao operator unseal <key>` call the StatefulSet pod
stays sealed forever, the bp-openbao HelmRelease never reports
Ready=True, and every dependent blueprint (bp-external-secrets,
bp-external-secrets-stores) blocks on this dep.

This was the 5th and final latent bug in the chart's auto-unseal
flow (after PRs #518 #520 #523 #524 #525). On otech17
(6b17518f12d529ea, 2026-05-02) the init Job completed cleanly but
`bao status` reported Sealed=true forever.

Fix: parse `unseal_threshold` and `unseal_keys_b64` from the init
JSON, call `bao operator unseal <key>` $threshold times (1 with
the current key-shares=1 / key-threshold=1 config), then assert
`bao status -format=json | grep '"sealed":false'` before the Job
exits success. Bumps chart 1.2.2 -> 1.2.3 and HR ref in
clusters/_template/bootstrap-kit/08-openbao.yaml.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:24:57 +04:00
e3mrah
ba5a1929f1
fix(bp-openbao): use shamir-compatible init flags + bump 1.2.1→1.2.2 (refs #517) (#525)
The chart's init Job called `bao operator init -recovery-shares=1
-recovery-threshold=1` which only works with auto-unseal seal types
(gcpckms/awskms/transit). The upstream openbao chart's default config
uses `seal "shamir"` (no auto-unseal stanza in
values.standalone.config / values.ha.config), so the OpenBao API
returns 400: "parameters recovery_shares,recovery_threshold not
applicable to seal type shamir".

Switch to -key-shares=1 -key-threshold=1 which is the correct shamir-
seal init flags. Operators wiring auto-unseal seals later will need
to flip back via a chart-values toggle.

Bumps chart 1.2.1→1.2.2 + matches HR ref so Sovereigns pull the new
artifact on next reconcile.

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:14:05 +04:00
e3mrah
6e3d3d281e
fix(bp-openbao): bump chart 1.2.0→1.2.1 + HR ref for busybox-wget fix (refs #517) (#524)
Bumps platform/openbao/chart/Chart.yaml version to 1.2.1 carrying the
busybox-compatible wget flag fix (PR #523). Also bumps the HR's
chart.spec.version in clusters/_template/bootstrap-kit/08-openbao.yaml
so Sovereigns pull the new bytes once blueprint-release publishes
ghcr.io/openova-io/bp-openbao:1.2.1.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:09:06 +04:00
e3mrah
5c0618d920
fix(bp-openbao): use busybox-compatible wget flag in init Job (refs #517) (#523)
The chart's init Job runs inside the openbao image (quay.io/openbao/
openbao:2.1.0) which uses busybox wget. The script's wget calls used
`--ca-certificate=$CACERT` which busybox wget does not support, causing
wget to print its usage page and fail with "seed Secret has no key
recovery-seed" (false negative — the parsing pipeline saw the usage
text instead of JSON).

Replace with `--no-check-certificate`. The Secret still requires the
Bearer token for auth — the lack of CA verification only affects
TLS handshake validation against an in-cluster API server reached via
the well-known kubernetes.default.svc DNS name (out-of-band attack
surface is negligible inside the pod network).

The `--method=DELETE` line for cleaning up the seed Secret remains —
busybox wget doesn't support method override either, but that line
is wrapped in `|| true` so the seed deletion failure doesn't block
the init Job from succeeding. Seed is single-use anyway and harmless
post-init (the recovery key is the OUTPUT of bao operator init, not
this seed).

Refs #517

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:07:52 +04:00
e3mrah
7931e695b0
fix(cert-manager-powerdns-webhook): cap CA Certificate CN at 64 bytes (#509)
The chart's CA Certificate template generated a `spec.commonName` of
`ca.<fullname>.cert-manager` where `<fullname>` is the Helm fullname
(release name + chart name). With the bootstrap-kit's release name
`cert-manager-powerdns-webhook`, the rendered CN landed at 78 bytes:

  ca.cert-manager-powerdns-webhook-bp-cert-manager-powerdns-webhook.cert-manager

cert-manager's admission webhook rejects this against the RFC 5280
ub-common-name-length=64 PKIX upper bound, breaking otech11
(ac90a3ea12954e7d, chart 1.0.1, 2026-05-02) at install time.

Fix: collapse the CN onto the chart `name` helper (always
`bp-cert-manager-powerdns-webhook`, ≤63 chars) instead of the
release-prefixed `fullname`. The CA cert's CN is opaque identity only —
no client validates by hostname against this CN — so the shortening is
behaviour-preserving and stable across any operator-chosen releaseName.

Rendered CN with this fix:

  ca.bp-cert-manager-powerdns-webhook.cert-manager  (48 bytes)

Bumps chart 1.0.1 → 1.0.2 and updates the bootstrap-kit slot reference
in clusters/_template/bootstrap-kit/49-bp-cert-manager-powerdns-webhook.yaml.

Closes #508.
2026-05-02 02:09:41 +04:00
e3mrah
eeba0d90cc
fix(infra): dedupe labels in bp-cert-manager-powerdns-webhook deployment template (#507)
The pod template's metadata.labels block in the upstream Deployment
template included BOTH the `selectorLabels` helper AND the `labels`
helper. Since `labels` already emits app.kubernetes.io/name and
app.kubernetes.io/instance, the rendered YAML had those keys twice in
a single mapping, which Helm v3 post-render rejects with:

  yaml: unmarshal errors:
    line 29: mapping key "app.kubernetes.io/name" already defined at line 26
    line 30: mapping key "app.kubernetes.io/instance" already defined at line 27

Surfaced live on Phase-8a-preflight otech11 (ac90a3ea12954e7d, on
catalyst-api:c148ef3, 2026-05-01).

Fix: drop the redundant `selectorLabels` include — `labels` is a
superset. Bump chart version 1.0.0 → 1.0.1 and update the bootstrap-kit
HR reference accordingly.

Closes openova#506.

Co-authored-by: e3mrah <emrah@openova.io>
2026-05-02 01:52:50 +04:00
e3mrah
e1f7d22f3c
fix(bootstrap-kit): install Gateway API CRDs ahead of HTTPRoute charts (#503) (#505)
Adds bp-gateway-api Blueprint (slot 01a) that vendors the upstream
Kubernetes Gateway API Standard-channel CRDs (v1.2.0) and registers them
ahead of every chart that ships HTTPRoute templates: bp-openbao,
bp-keycloak, bp-gitea, bp-powerdns, bp-catalyst-platform, bp-harbor,
bp-grafana.

Phase-8a-preflight live deployment otech10 (e1a0cd6662872fcb on
catalyst-api:c148ef3, 2026-05-01) reached 21/37 HRs Ready=True before
stalling on bp-harbor / bp-openbao / bp-powerdns reconciling to
InstallFailed with `no matches for kind "HTTPRoute" in version
"gateway.networking.k8s.io/v1"`. Cilium 1.16's chart `gatewayAPI.
enabled=true` flag wires up the cilium gateway controller and creates
the `cilium` GatewayClass, but does NOT install the
gateway.networking.k8s.io CRDs themselves; cilium 1.16 has no
`installCRDs`-equivalent knob for gateway-api so the upstream CRDs must
ship via a separate Blueprint.

Pattern locked in by docs/INVIOLABLE-PRINCIPLES.md and reinforced by
the founder for ALL similar future cases: intra-chart CRD-ordering
breaks → split into two charts + Flux dependsOn. Mirrors the
bp-crossplane/bp-crossplane-claims and bp-external-secrets/
bp-external-secrets-stores splits.

Files:
- platform/gateway-api/{blueprint.yaml,chart/} — new Blueprint with
  per-CRD templates vendored from kubernetes-sigs/gateway-api v1.2.0
  standard-install.yaml; helm.sh/resource-policy: keep on every CRD so
  Helm uninstall does not orphan every HTTPRoute on the cluster
- platform/gateway-api/chart/scripts/regenerate.sh — developer tool
  for re-vendoring on upstream version bump (annotation-driven)
- platform/gateway-api/chart/tests/crd-render.sh — chart integration
  test (5 CRDs, keep annotation, bundle-version matches Chart.yaml pin)
- clusters/_template/bootstrap-kit/01a-gateway-api.yaml — HelmRelease
  + HelmRepository, dependsOn bp-cilium
- clusters/_template/bootstrap-kit/{08-openbao,09-keycloak,10-gitea,
  11-powerdns,13-bp-catalyst-platform,19-harbor,25-grafana}.yaml —
  add `dependsOn: bp-gateway-api`
- clusters/_template/bootstrap-kit/kustomization.yaml — register
  01a-gateway-api.yaml between 01-cilium and 02-cert-manager
- scripts/expected-bootstrap-deps.yaml — declare slot 1a + add
  bp-gateway-api to depends_on of every HTTPRoute-using slot

Closes #503

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:30:50 +04:00
e3mrah
1865ac8975
fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340) (#504)
* fix(bp-seaweedfs): vendor upstream chart, drop fromToml-using template (#340)

The upstream seaweedfs/seaweedfs 4.22.0 chart now ships
templates/shared/security-configmap.yaml which calls fromToml — a Sprig
function added in Helm 3.13. Flux v1.x helm-controller bundles a Helm
SDK older than 3.13 and PARSES every template before any
{{- if .Values.global.seaweedfs.enableSecurity }} gate fires, so the file's
mere presence breaks install on every Sovereign with:

  parse error at (bp-seaweedfs/charts/seaweedfs/templates/shared/security-configmap.yaml:21):
    function "fromToml" not defined

even though enableSecurity defaults to false. Setting the gate value
does NOT skip parsing — only deleting / never-shipping the file does.

Fix shape (per ticket #340):

1. Vendor upstream seaweedfs/seaweedfs 4.22.0 into chart/charts/seaweedfs/
   (committed bytes, not auto-pulled at build time). Required because the
   upstream Helm repo overwrites 4.22.0 in place — re-pulling would
   re-introduce the broken file.
2. Delete charts/seaweedfs/templates/shared/security-configmap.yaml.
   Every other template that references the deleted ConfigMap is gated
   under {{- if enableSecurity }} so removing it is a no-op for our
   default deployment shape (Catalyst SeaweedFS auth happens at the S3
   layer via IAM creds from External Secrets, not via the upstream
   chart's TLS/JWT machinery).
3. Drop the dependencies: block from chart/Chart.yaml; add
   annotations.catalyst.openova.io/no-upstream=true so the
   blueprint-release workflow's hollow-chart guard (issue #181) skips
   the auto-pull/round-trip checks for this chart.
4. Whitelist platform/seaweedfs/chart/charts/ in .gitignore so the
   vendored bytes are tracked.
5. Bump bp-seaweedfs 1.0.1 → 1.1.0 (signal: vendored, not auto-pulled).
6. Add tests/no-fromtoml.sh — chart-test that asserts the offending
   file stays deleted across future re-vendors. Runs in
   .github/workflows/blueprint-release.yaml as a publish-gating check.

Unblocks Phase-8a observability + storage chain on otech (bp-loki,
bp-mimir, bp-tempo, bp-velero, bp-harbor, bp-grafana all dependsOn
bp-seaweedfs).

Closes #340

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): align expected-bootstrap-deps.yaml with bp-harbor's actual deps

The bp-harbor HR at clusters/_template/bootstrap-kit/19-harbor.yaml lines
35-37 already removed `bp-seaweedfs` from its dependsOn (cloud-direct
architecture per ADR-0001 §13 — Harbor writes blobs directly to cloud
Object Storage on Sovereigns, not via SeaweedFS), but the expected DAG
in scripts/expected-bootstrap-deps.yaml was never updated to match.

Pre-existing drift on main; surfaced by the dependency-graph-audit
check on PR #504 (bp-seaweedfs vendoring fix). Fixing it inline so the
audit passes on the same PR — the two changes are both about the
storage chain on Sovereigns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:20:59 +04:00
e3mrah
20b896070f
feat(bp-keycloak + infra): Sovereign K8s OIDC config for kubectl via per-Sovereign Keycloak realm (closes #326) (#448)
Wires the per-Sovereign K8s api-server's --oidc-* validator to the
per-Sovereign Keycloak realm so customer admins can authenticate
kubectl directly against their Sovereign — no static admin-kubeconfig
handoff, no rotated bearer-token exchange.

infra (cloud-init):
  - Add 6 --kube-apiserver-arg=oidc-* flags to the k3s install line in
    infra/hetzner/cloudinit-control-plane.tftpl. Issuer URL composed
    from sovereign_fqdn (https://auth.\${sovereign_fqdn}/realms/sovereign)
    per INVIOLABLE-PRINCIPLES #4 — never hardcoded. Username/groups
    prefixes scope OIDC subjects under "oidc:" so RoleBindings reference
    e.g. subjects[0].name=oidc:alice@org, distinct from local SAs/x509.

Canonical seam (anti-duplication rule, ADR-0001 §11.3):
  - The bp-keycloak chart already bundles bitnami/keycloak's
    keycloakConfigCli post-install Helm hook Job, which imports realms
    declared under values.keycloak.keycloakConfigCli.configuration. We
    enable the existing seam — no bespoke kubectl-exec realm-creation
    script, no custom Admin-API call from catalyst-api.

bp-keycloak chart (1.1.2 → 1.2.0):
  - Enable keycloakConfigCli + ship inline sovereign-realm.json with:
    realm "sovereign" (invariant per Sovereign — Keycloak resolves the
    issuer claim from the request hostname, so no per-FQDN realm
    rename), default groups sovereign-admins/-ops/-viewers, oidc-group
    -membership-mapper emitting "groups" claim, public OIDC client
    "kubectl" with localhost:8000 + OOB redirect URIs (kubectl-oidc
    -login defaults), publicClient=true (kubectl runs locally and
    cannot safely hold a secret), PKCE S256 enforced.
  - Bump version 1.1.2 → 1.2.0 (semver MINOR, additive shape).
  - Bump bootstrap-kit slot 09 in _template/, omantel.omani.works/,
    otech.omani.works/ to version: 1.2.0.
  - New chart test tests/oidc-kubectl-client.sh (4 cases) — all green.
  - Existing tests/observability-toggle.sh — still green.

Documentation:
  - Add §11 "kubectl OIDC for customer admins" runbook to
    docs/omantel-handover-wbs.md with one-time workstation setup
    (kubectl krew install oidc-login + config set-credentials),
    sovereign-admin RBAC binding (oidc:sovereign-admins → cluster
    -admin), and 401-debugging table mapping common symptoms to
    root causes.
  - Carve #326 out of §7 "Out of scope" — it is shipped.
  - Add §9 status row.

Validation:
  - grep -c 'oidc-issuer-url' infra/hetzner/cloudinit-control-plane.tftpl
    → 2 (comment + the actual flag in the curl line)
  - grep -c 'oidc-username-claim' → 2
  - helm template platform/keycloak/chart → renders post-install
    keycloak-config-cli Job + ConfigMap with kubectl client (3 hits
    on grep "kubectl"; 1 hit on "clientId": "kubectl")
  - bash scripts/check-vendor-coupling.sh → exit 0 (HARD-FAIL mode)
  - 4/4 oidc-kubectl-client gates green; 3/3 observability-toggle
    gates green

Out of scope (deferred to follow-up tickets):
  - Per-Sovereign user provisioning UI (#322, #323)
  - Refresh-token revocation on RoleBinding deletion (#324)
  - provider-kubernetes Crossplane ProviderConfig per Sovereign (#321)
  - omantel migration / Phase 8 live execution

NO catalyst-api or UI source files touched (those are #319/#322/#323
agents' territories per agent brief).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 19:07:52 +04:00
e3mrah
b6810c1940
feat(bp-crossplane-claims): UserAccess CRD + Composition + RBAC ClusterRoles for Sovereign IAM (closes #322) (#446)
Adds the data plane for the Sovereign IAM access plane (epic #320):

- platform/crossplane-claims/chart/templates/xrds/useraccess.yaml
  XUserAccess XRD (access.openova.io/v1alpha1) — cluster-scoped Claim
  carrying user identity (Keycloak subject + groups), Sovereign ref, and
  one or more (application, role, namespaces) grants.

- platform/crossplane-claims/chart/templates/compositions/useraccess.yaml
  Default Composition useraccess.compose.openova.io — materialises one
  RoleBinding per Claim via provider-kubernetes Object against the
  per-Sovereign sovereign-<sovereignRef> ProviderConfig. Multi-grant
  shapes are expanded api-side into N single-grant Claims (avoids the
  Composition-iteration trap; no composition-functions introduced).

- platform/crossplane-claims/chart/templates/clusterroles.yaml
  Three canonical ClusterRoles — openova:application-{admin,editor,viewer}.
  Editor + viewer explicitly omit secrets; admin can manage namespace-
  scoped roles/rolebindings (NOT cluster-scoped).

- userAccess.enabled values toggle (default true), version bumps to 1.1.0
  on chart + blueprint, sample fixture, validation script extended to
  expect 7 XRDs / 7 Compositions / 3 ClusterRoles.

Canonical seam: extends the existing platform/crossplane-claims/chart/
XRD+Composition pattern (compose.openova.io/v1alpha1 family). New API
group access.openova.io is intentional — IAM is a separate concern from
the cloud-resource compose.* family. No catalyst-api or UI code touched
(those are #323's territory; this PR ships the data model #323 consumes).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 19:03:10 +04:00
e3mrah
0511efbdac
feat(bp-harbor): vendor-agnostic Object Storage backend (closes #383) (#437)
Reworks bp-harbor to write blobs DIRECTLY to the cloud-provider's
native S3 endpoint (Hetzner Object Storage on Hetzner Sovereigns)
per ADR-0001 §13. Mirrors the post-#425 vendor-agnostic seam shipped
in bp-velero:1.2.0 (PR #435 / SHA 0172b9a8) 1:1.

Canonical seam used (per anti-duplication rule + docs/omantel-
handover-wbs.md §3a):
  - Sealed Secret name:   flux-system/object-storage  (NOT hetzner-prefixed)
  - Chart values block:   .Values.objectStorage.s3.{enabled,credentialsSecretName,s3.{accessKey,secretKey}}
  - Template filename:    templates/objectstorage-credentials.yaml
  - Reference impl:       platform/velero/chart/ (PR #435)

Chart changes (platform/harbor/chart/):
  - Chart.yaml: 1.0.0 → 1.1.0; description rewritten to emphasise
    cloud-direct architecture + remove SeaweedFS hard-dep claim.
  - values.yaml: REMOVED hardcoded SeaweedFS endpoint
    (http://seaweedfs-s3.seaweedfs.svc.cluster.local:8333) from
    persistence.imageChartStorage.s3.regionendpoint. Default
    type flipped to `filesystem` so contabo/dev render is clean.
    Added vendor-agnostic objectStorage block:
      objectStorage:
        enabled: false
        useExistingSecret: false
        credentialsSecretName: ""
        s3: { accessKey: "", secretKey: "" }
  - templates/objectstorage-credentials.yaml (NEW): synthesises a
    harbor-namespace Secret with REGISTRY_STORAGE_S3_ACCESSKEY +
    REGISTRY_STORAGE_S3_SECRETKEY keys (the upstream chart's
    persistence.imageChartStorage.s3.existingSecret consumption
    shape — envFrom on the registry pod). Skip-render branch
    when objectStorage.enabled=false (default).
  - templates/_helpers.tpl: added bp-harbor.objectStorageCredentialsSecretName
    helper.
  - templates/networkpolicy.yaml: egress rule retargeted from
    SeaweedFS service-namespace selector → external HTTPS:443
    (works for any cloud-native S3 endpoint without vendor coupling).
    Gated on `.Values.objectStorage.enabled`. Removed
    seaweedfsNamespace + seaweedfsS3Port overlay keys.

Per-Sovereign overlays (clusters/{_template,omantel,otech}/bootstrap-
kit/19-harbor.yaml):
  - Chart version reference bumped 1.0.0 → 1.1.0.
  - dependsOn: bp-seaweedfs REMOVED. New dependsOn = bp-cnpg + bp-cert-manager.
  - Added valuesFrom block mapping the 5 keys of flux-system/object-
    storage Secret:
      s3-bucket     → harbor.persistence.imageChartStorage.s3.bucket
      s3-region     → harbor.persistence.imageChartStorage.s3.region
      s3-endpoint   → harbor.persistence.imageChartStorage.s3.regionendpoint
      s3-access-key → objectStorage.s3.accessKey
      s3-secret-key → objectStorage.s3.secretKey
  - Inline values flip objectStorage.enabled=true,
    harbor.persistence.imageChartStorage.type=s3, and
    harbor.persistence.imageChartStorage.s3.existingSecret=harbor-
    objectstorage-credentials.

UI catalog (products/catalyst/bootstrap/ui/src/shared/constants/components.ts):
  - Harbor's `dependencies` array drops `seaweedfs`. Now ['cnpg', 'valkey'].

Validation:
  helm template default render →
    1448 lines, 5 Secrets (Harbor internal: core/jobservice/registry/
    registry-htpasswd/database — NO objectstorage-credentials),
    type=filesystem, 0 SeaweedFS references.
  helm template overlay render with objectStorage.enabled=true +
  type=s3 + bucket=omantel-harbor + region=fsn1 +
  regionendpoint=https://fsn1.your-objectstorage.com +
  existingSecret=harbor-objectstorage-credentials →
    1452 lines, 6 Secrets (5 internal + 1 objectstorage-credentials),
    type=s3 with Hetzner endpoint, registry pod envFrom wired to the
    new Secret, 0 SeaweedFS references.
  scripts/check-vendor-coupling.sh → exit 0 (no violations across
    platform/, clusters/, products/catalyst/bootstrap/{api,ui}/).
  helm lint → 0 failures.

WBS:
  §2 row 18 → 🟢 chart-released (#383).
  §9 #383 row → 🟢 chart-released narrative.
  §6 DAG: T383 moved from `class blocked` → `class done`.

Hetzner-S3 E2E deferred to Phase 8 (first omantel run).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:18:37 +04:00
e3mrah
0172b9a89a
wip(#425): vendor-agnostic OS rename — partial (rate-limited mid-run) (#435)
Files staged from prior agent run before rate-limit. Re-dispatch will
verify, complete missing pieces (Crossplane Provider+ProviderConfig in
cloud-init, grep-zero acceptance, helm/go test runs, WBS row update),
and finalise the PR.

Includes:
- platform/velero/chart/templates/{hetzner-credentials-secret -> objectstorage-credentials}.yaml
- platform/velero/chart/values.yaml (objectStorage.s3.* block)
- platform/velero/chart/Chart.yaml (1.1.0 -> 1.2.0)
- products/catalyst/bootstrap/api/internal/objectstorage/ (NEW package)
- internal/hetzner/objectstorage{,_test}.go DELETED
- credentials handler + StepCredentials.tsx renamed
- infra/hetzner/{main.tf,variables.tf,cloudinit-control-plane.tftpl}
- clusters/{_template,omantel.omani.works,otech.omani.works}/bootstrap-kit/34-velero.yaml
- platform/seaweedfs/* (out-of-scope drift — re-dispatch will revert if not part of #425)

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:05:19 +04:00
e3mrah
92b7db622d
fix(bp-external-secrets-stores): split ClusterSecretStore into separate chart per #247 pattern (closes #331) (#426)
* fix(bp-external-secrets): split ClusterSecretStore into bp-external-secrets-stores chart (resolves CRD ordering, closes #331)

bp-external-secrets@1.0.0 deadlocked on first install on otech.omani.works:

  Helm install failed for release external-secrets-system/external-secrets
  with chart bp-external-secrets@1.0.0:
  failed post-install: unable to build kubernetes object for deleting hook
  bp-external-secrets/templates/clustersecretstore-vault-region1.yaml:
  resource mapping not found for name: "vault-region1" namespace: ""
  no matches for kind "ClusterSecretStore" in version "external-secrets.io/v1beta1"

Root cause: Helm's `helm.sh/hook-delete-policy: before-hook-creation` ran
a kubectl-style lookup of the existing ClusterSecretStore CR before the
upstream `external-secrets` subchart's CRDs finished registration. The
in-line ClusterSecretStore template (templates/clustersecretstore-vault-
region1.yaml) and the upstream subchart's CRDs co-installed in the same
release; admission ordering wasn't deterministic enough to make the
post-install hook safe.

Fix — same pattern as PR #247 (bp-crossplane@1.1.3 ↔ bp-crossplane-claims@1.0.0):
split the chart into controller + stores. Flux dependsOn orders them.

  - bp-external-secrets@1.1.0 — controller-only (just upstream subchart
    + NetworkPolicy + ServiceMonitor toggle). CRDs register here.
  - bp-external-secrets-stores@1.0.0 (NEW) — the default
    ClusterSecretStore CR; depends on bp-external-secrets being Ready.
    No Helm hooks needed: by the time this chart's HelmRelease starts,
    Flux has already verified bp-external-secrets is Ready=True and
    therefore the CRDs are registered.

Files:
  NEW: platform/external-secrets-stores/blueprint.yaml             (1.0.0)
  NEW: platform/external-secrets-stores/chart/Chart.yaml           (1.0.0; no upstream subchart, annotation `catalyst.openova.io/no-upstream: "true"`)
  NEW: platform/external-secrets-stores/chart/values.yaml          (clusterSecretStore.* knobs moved from controller chart)
  MOVED: platform/external-secrets/chart/templates/clustersecretstore-vault-region1.yaml
       → platform/external-secrets-stores/chart/templates/clustersecretstore-vault-region1.yaml
       (Helm hook annotations removed — Flux dependsOn now handles ordering)
  TOUCHED: platform/external-secrets/chart/Chart.yaml              (1.0.0 → 1.1.0; description note appended)
  TOUCHED: platform/external-secrets/blueprint.yaml                (1.0.0 → 1.1.0)
  TOUCHED: platform/external-secrets/chart/values.yaml             (clusterSecretStore block removed; pointer comment added)
  NEW: clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml
       (Flux HelmRelease, dependsOn: [bp-external-secrets, bp-openbao])
  TOUCHED: clusters/_template/bootstrap-kit/15-external-secrets.yaml
       (chart version 1.0.0 → 1.1.0)
  TOUCHED: clusters/_template/bootstrap-kit/kustomization.yaml
       (slot 15a inserted after 15)

Out of scope for this PR (separate tickets):
  - blueprint-release.yaml CI fan-out: verify the path-matrix picks up
    the new platform/external-secrets-stores/ directory automatically;
    if not, add the directory to the matrix in a follow-up.
  - Per-Sovereign cluster directory edits (#257 will delete those).
  - Phase 0 minimum trim (#310 will renumber slots; this PR uses 15a as
    a non-disruptive sub-slot insertion that works with both the current
    35-slot kustomization and the eventual 15-slot canonical layout —
    when #310 renumbers, 15 + 15a become 08 + 09 in the canonical order).

Refs: #331 (this issue), #247 (pattern reference — bp-crossplane split),

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): register bp-external-secrets-stores in expected-bootstrap-deps.yaml

The dependency-graph-audit CI step rejected PR #334 because the new
bp-external-secrets-stores HR was on disk at slot 15a but missing from
the expected DAG. This commit adds it with the same dependsOn shape as
clusters/_template/bootstrap-kit/15a-external-secrets-stores.yaml:
[bp-external-secrets, bp-openbao].

Refs: #331, #310 (Phase 0 minimum), PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(bp-external-secrets): retire CR cases from controller test, add stores-toggle (#331)

After splitting the default ClusterSecretStore into bp-external-secrets-stores
@1.0.0, the controller chart's observability-toggle integration test still
expected the CR to render in the controller chart (Cases 4 + 5). Those
assertions now belong on the new chart.

Changes:
  - platform/external-secrets/chart/tests/observability-toggle.sh:
    Replace Cases 4+5 with a single inverted assertion — the controller
    chart MUST render ZERO ClusterSecretStore CRs (top-level kind:); only
    the upstream subchart's CRD definition (whose spec.names.kind value is
    "ClusterSecretStore" at non-zero indent) is allowed.
  - platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh:
    NEW. Mirrors the retired Cases 4+5 against the stores chart, plus a
    Case 3 that asserts clusterSecretStore.server overrides propagate.

Local smoke:
  bash platform/external-secrets/chart/tests/observability-toggle.sh         → 4/4 PASS
  bash platform/external-secrets-stores/chart/tests/clustersecretstore-toggle.sh → 3/3 PASS

Refs: #331, PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): handle alphanumeric sub-slot suffixes in check-bootstrap-deps.sh

PR #334 (issue #331) added slot 15a-external-secrets-stores as a sub-slot
between numeric slots 15 and 16. The bootstrap-deps audit script's
`printf '%02d'` formatter rejected `15a` with:

  scripts/check-bootstrap-deps.sh: line 390: printf: 15a: invalid number

Fix: detect non-numeric slot tokens and pass them through verbatim. Numeric
slots still render as zero-padded `01..49` for output alignment.

Local smoke:
  $ bash scripts/check-bootstrap-deps.sh
  ...
    [P] slot 15  bp-external-secrets        <-- bp-cert-manager bp-openbao
    [P] slot 15a bp-external-secrets-stores <-- bp-external-secrets bp-openbao
  ...
  OK: bootstrap-kit dependency graph audit PASSED

Refs: #331, PR #334.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wbs): tick #331 chart-released

bp-external-secrets@1.1.0 (controller-only) + bp-external-secrets-stores@1.0.0
(NEW) shipped in PR #426. Helm-template acceptance + both toggle tests +
dependency-graph-audit all green. Sovereign-impact deferred to Phase 8.

Refs: #331, PR #426.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hatice Yildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 17:33:47 +04:00
e3mrah
f7796ef807
feat(bp-velero): Hetzner Object Storage backend wiring (closes #384) (#423)
* feat(bp-velero): Hetzner Object Storage backend wiring (closes #384)

Velero on a Hetzner Sovereign now writes its backups DIRECTLY to Hetzner
Object Storage per ADR-0001 §13 (S3-aware app architecture rule) +
docs/omantel-handover-wbs.md §3 — NOT SeaweedFS, which is reserved as a
POSIX→S3 buffer for legacy POSIX-only writers and is not in the minimal
Sovereign set.

Mirrors the Hetzner-direct backend pattern Agent #383 is wiring for
Harbor; both consume the canonical flux-system/hetzner-object-storage
Secret shipped by issue #371 (cloud-init writes 5 keys: s3-endpoint /
s3-region / s3-bucket / s3-access-key / s3-secret-key, derived from
the operator-issued Hetzner-Console keys + the per-Sovereign bucket
provisioned by OpenTofu's aminueza/minio resource).

platform/velero/chart/ (umbrella chart, bumped to 1.1.0):
  - templates/_helpers.tpl: NEW — bp-velero.fullname / bp-velero.labels
    helpers + bp-velero.hetznerCredentialsSecretName (default
    `velero-hetzner-credentials`).
  - templates/hetzner-credentials-secret.yaml: NEW — synthesises a
    velero-namespace Secret with a single `cloud` key in AWS-CLI INI
    format from .Values.veleroOverlay.hetzner.s3.{accessKey,secretKey}.
    The upstream Velero deployment mounts this at /credentials/cloud
    via existingSecret + AWS_SHARED_CREDENTIALS_FILE. Skip-render path
    when veleroOverlay.hetzner.enabled is false (default — keeps
    contabo render clean) or useExistingSecret is true (operator
    supplied Secret out-of-band).
  - values.yaml: BSL provider/region/s3Url/bucket fields populated as
    placeholders the per-Sovereign HelmRelease overrides via Flux
    valuesFrom; backupsEnabled defaults FALSE so default render emits
    no half-broken BSL; veleroOverlay.hetzner block surfaces the
    operator-overridable fields. Long-form rationale comments inline
    on each value per the chart's existing docstring style.

clusters/_template/bootstrap-kit/34-velero.yaml (+ omantel + otech):
  - dependsOn: bp-seaweedfs REMOVED — Velero is no longer a SeaweedFS
    consumer on Sovereigns (was the old SeaweedFS-tiered architecture
    that minimal-omantel retired in favour of cloud-native S3).
  - chart version bumped 1.0.0 → 1.1.0.
  - valuesFrom block added: 5 Secret-key entries pull each canonical
    s3-* key into the matching umbrella value path. Plaintext
    credentials never appear in the committed manifest; Flux
    dereferences valuesFrom at HelmRelease apply time.
  - values block adds the baseline veleroOverlay.hetzner.enabled=true
    + velero.credentials.{useSecret:true,existingSecret:velero-hetzner-
    credentials} + BSL provider/credential/s3ForcePathStyle scaffolding
    that the valuesFrom entries fill in.

docs/omantel-handover-wbs.md:
  - §2 row 19: " chart needs S3 endpoint rework" → "🟢 chart-released
    v1.1.0 — Hetzner Object Storage backend wired to #371 secret".
  - §9 #384 row: detailed status with smoke evidence.

Smoke evidence (contabo, default values — no Hetzner credentials):
  - helm template t . → renders cleanly (no Hetzner Secret, no BSL).
  - helm template t . --set veleroOverlay.hetzner.enabled=true \
      --set ...accessKey=AK_TEST --set ...secretKey=SK_TEST \
      --set velero.backupsEnabled=true (+ BSL config) →
      Secret/velero-hetzner-credentials with `cloud` INI key emitted +
      BackupStorageLocation/default with provider=aws,
      bucket=omantel-velero, region=fsn1,
      s3Url=https://fsn1.your-objectstorage.com.
  - helm install velero-smoke . -n velero-smoke (defaults) → pod
    velero-69bb84c5-669sh Ready 1/1 in 48s. Smoke torn down clean.

Hetzner-S3 E2E deferred to Phase 8 (first omantel run) — contabo has
no Hetzner Object Storage credentials so end-to-end backup→restore
verification can't run here.

Anti-duplication rule: NO bash scripts authored, NO parallel
implementations of upstream Velero functionality. Upstream Velero +
velero-plugin-for-aws natively support any S3-compatible backend; the
work here is values + a credential-shape adapter Secret, not a fork.

Closes #384.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(scripts): drop bp-seaweedfs dep from bp-velero expected DAG (#384)

Mirrors the dependsOn removal in clusters/_template/bootstrap-kit/34-
velero.yaml from the parent commit. Velero on Hetzner Sovereigns now
writes directly to Hetzner Object Storage (ADR-0001 §13 + WBS §3); no
in-cluster prerequisite Blueprint is required.

Local `bash scripts/check-bootstrap-deps.sh` now passes (0 drift,
0 cycles). The CI failure on the parent commit's PR was the audit
flagging bp-velero as having a missing edge to bp-seaweedfs because
this expected-DAG file still listed it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:24:44 +04:00
e3mrah
d2ada908c9
feat(bp-openbao): auto-unseal flow — cloud-init seed + post-install init Job (closes #316) (#408)
Catalyst-curated auto-unseal pipeline for OpenBao on Hetzner Sovereigns
(no managed-KMS available). Selected **Option A — Shamir + cloud-init
seed** because:

  - Hetzner has no managed-KMS service → Cloud-KMS auto-unseal (Option C)
    is structurally unavailable.
  - Transit-seal (Option B) requires a peer OpenBao cluster, only
    applicable to multi-region tier-1; out of scope for single-region
    omantel.
  - Manual unseal (Option D) violates the "first sovereign-admin lands
    on console.<sovereign-fqdn> ready to use" goal in
    SOVEREIGN-PROVISIONING.md §5.

Architecture (per issue #316 spec + acceptance criteria 1-6):

  1. Cloud-init on the control-plane node generates a 32-byte recovery
     seed from /dev/urandom and writes it to a single-use K8s Secret
     `openbao-recovery-seed` in the openbao namespace, with annotation
     `openbao.openova.io/single-use: "true"`. Pre-creates the openbao
     namespace to eliminate the race with Flux's HelmRelease apply.
  2. bp-openbao chart v1.2.0 ships two new Helm post-install hooks:
       - `templates/init-job.yaml` (hook weight 5): consumes the seed,
         calls `bao operator init -recovery-shares=1 -recovery-threshold=1`,
         persists the recovery key inside OpenBao's auto-unseal config,
         deletes the seed Secret on success. Idempotent — re-runs detect
         Initialized=true and exit 0.
       - `templates/auth-bootstrap-job.yaml` (hook weight 10): enables
         the Kubernetes auth method, mounts kv-v2 at `secret/`, writes
         the `external-secrets-read` policy, binds the `external-secrets`
         role to the ESO ServiceAccount in `external-secrets-system`.
  3. `templates/auto-unseal-rbac.yaml` declares the least-privilege SA
     + Role + RoleBinding the Jobs need (Secret get/list/delete in the
     openbao namespace; create/get/patch on the openbao-init-marker).
     Also emits the permanent `system:auth-delegator` ClusterRoleBinding
     bound to the OpenBao ServiceAccount so the Kubernetes auth method
     can call tokenreviews.authentication.k8s.io.
  4. Cluster overlay `clusters/_template/bootstrap-kit/08-openbao.yaml`
     bumps version 1.1.1 → 1.2.0 and flips `autoUnseal.enabled: true`
     per-Sovereign.

Per #402 lesson: skip-render pattern (`{{- if .Values.X }}{{ emit }}
{{- end }}`) used throughout — never `{{ fail }}`. Default `helm
template` render emits NOTHING new; opt-in via autoUnseal.enabled=true.

Acceptance criteria coverage:
  1. Provision fresh Sovereign — cloud-init writes seed, Flux installs
     bp-openbao 1.2.0, post-install Jobs run automatically. 
  2. bp-openbao HR Ready=True without manual intervention — install
     keeps `disableWait: true` (Helm Ready ≠ OpenBao initialised; the
     init Job drives initialisation out-of-band on the same install). 
  3. `bao status` shows Sealed=false, Initialized=true within 5 minutes
     — init Job polls + retries up to 60×5s. 
  4. ESO ClusterSecretStore vault-region1 reaches Status: Valid — the
     auth-bootstrap Job binds the `external-secrets` role to ESO's SA
     before the Job exits. 
  5. Seed Secret deleted post-init — init Job deletes it via K8s API
     after consuming. 
  6. No openbao-root-token Secret in K8s — root token captured to
     /tmp/.root-token in the Job pod's tmpfs only; never written to a
     K8s Secret. The recovery key persists ONLY inside OpenBao's Raft
     state (auto-unseal config). 

Tests:
  - tests/auto-unseal-toggle.sh — 4 cases:
    * default render → no auto-unseal artefacts (skip-render works)
    * autoUnseal.enabled=true → both Jobs + correct hook weights
    * kubernetesAuth.enabled=false → init Job only, no auth-bootstrap
    * idempotency annotations present on all 5 hook objects
  - tests/observability-toggle.sh — unchanged, all 3 cases green.
  - helm lint . — clean.

Files:
  - platform/openbao/chart/Chart.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/blueprint.yaml — version 1.1.1 → 1.2.0
  - platform/openbao/chart/values.yaml — `autoUnseal.*` block
  - platform/openbao/chart/templates/auto-unseal-rbac.yaml — new
  - platform/openbao/chart/templates/init-job.yaml — new
  - platform/openbao/chart/templates/auth-bootstrap-job.yaml — new
  - platform/openbao/chart/tests/auto-unseal-toggle.sh — new
  - platform/openbao/README.md — bootstrap procedure §2-3 expanded;
    auto-unseal alternatives table added.
  - clusters/_template/bootstrap-kit/08-openbao.yaml — chart 1.1.1 →
    1.2.0, autoUnseal.enabled=true.
  - infra/hetzner/cloudinit-control-plane.tftpl — seed-token block
    inserted between ghcr-pull-secret apply and flux-bootstrap apply.
  - docs/omantel-handover-wbs.md §9 — #316 ticked chart-released.

Canonical seam used: extended existing `platform/openbao/chart/` per
the anti-duplication rule. NO standalone scripts. NO bespoke Go cloud
calls. NO `{{ fail }}`. All knobs configurable via values.yaml per
INVIOLABLE-PRINCIPLES.md #4 (never hardcode).

Co-authored-by: hatiyildiz <hat.yil@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:45:44 +04:00
e3mrah
04308af7e9
feat(cert-manager): bp-cert-manager-powerdns-webhook (#373) (#410)
Authors a Catalyst Blueprint for the cert-manager DNS-01 external webhook
backed by PowerDNS, for post-handover wildcard TLS issuance against the
Sovereign's OWN PowerDNS — eliminating the last reachback to openova-
controlled Dynadot credentials per ADR-0001 §9.4.

Structure mirrors bp-cert-manager-dynadot-webhook (canonical seam):
- platform/cert-manager-powerdns-webhook/blueprint.yaml — Blueprint CR
  with depends: [bp-cert-manager, bp-powerdns]
- platform/cert-manager-powerdns-webhook/chart/Chart.yaml — wraps upstream
  zachomedia/cert-manager-webhook-pdns v2.5.5 (chart 3.2.5); declares the
  sigstore/common stub dep to satisfy the hollow-chart guard (#181)
- chart/templates/ — 8 templates (Deployment, Service, APIService, RBAC,
  selfSigned/CA Issuer + serving Certificate, ServiceAccount,
  ClusterIssuer)
- ClusterIssuer (letsencrypt-dns01-prod-powerdns) ships with the chart,
  paired with the webhook's solver. Gated behind clusterIssuer.enabled
  AND powerdns.host (skip-render pattern, lesson from #387 follow-up
  #402 — never use {{ fail }})

Bootstrap-kit slot:
- clusters/_template/bootstrap-kit/36-bp-cert-manager-powerdns-webhook.yaml
  wires the HelmRelease to the per-Sovereign in-cluster PowerDNS endpoint
  (http://powerdns.powerdns:8081) and flips clusterIssuer.enabled=true.
- ${SOVEREIGN_FQDN} envsubst keeps the slot operator-overridable per
  Inviolable Principle #4. Contabo bootstrap path does NOT include this
  template — contabo stays on legacy http01 + Traefik per ADR-0001 §9.4.

Helm-template verification:
  helm template t platform/cert-manager-powerdns-webhook/chart/
    → 14 resources, 0 ClusterIssuer (skip-render works)
  helm template t platform/cert-manager-powerdns-webhook/chart/ \
      --set powerdns.host=http://powerdns.test:8081 \
      --set clusterIssuer.enabled=true \
      --set powerdns.apiKeySecretRef.name=fake
    → 15 resources incl. ClusterIssuer with PowerDNS solver config
  Both renders parse cleanly through python yaml.safe_load_all.

Updates docs/omantel-handover-wbs.md §2 row 4 + §9 row #373 to
chart-released. Sovereign-impact deferred to Phase 8 (handover E2E).

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:44:27 +04:00
e3mrah
a1bd550208
fix(charts): HTTPRoute templates skip-render on missing host (was failing default-values render) (#402)
Blueprint-release for #401 failed because HTTPRoute templates use
{{- fail }} when gateway.host is not set, which trips the chart default-values
render gate in CI. Switched 6 templates from 'fail loud' to 'skip render':

  if .Values.gateway.host  →  emit HTTPRoute
  else                     →  emit nothing

The Gateway API admission already rejects HTTPRoute with empty hostnames,
so the loud-fail wasn't buying anything an operator wouldn't see at apply
time. Default-values render now produces zero HTTPRoute resources, which
is the correct shape for the upstream chart consumers that don't set
the Sovereign-only gateway block.

Files: keycloak, gitea, openbao, grafana, harbor, catalyst-platform.

Verified:
  helm template t products/catalyst/chart/ → 0 HTTPRoutes (clean)
  helm template t products/catalyst/chart/ --set ingress.gateway.enabled=true --set ingress.hosts.console.host=console.test --set ingress.hosts.api.host=api.test → 2 HTTPRoutes

Closes the blueprint-release failure on commit abf01b6f.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:23:58 +04:00
e3mrah
abf01b6f21
feat(platform): Gateway API migration audit (#387) (#401)
Migrates every minimal-Sovereign-set blueprint chart from
networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute,
replacing the legacy Traefik-on-Sovereigns assumption with the canonical
Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2
correction note (#388).

The single per-Sovereign Gateway is added as additional documents in
the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml
(NOT a new top-level slot), since Cilium owns the GatewayClass. It
includes:

  - Certificate `sovereign-wildcard-tls` requesting `*.${SOVEREIGN_FQDN}`
    from `letsencrypt-dns01-prod` (cert-manager + #373 webhook)
  - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS
    terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All

Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's
existing `templates/` directory):

  | Blueprint           | Host pattern                    | Backend port |
  |---------------------|---------------------------------|--------------|
  | bp-keycloak         | auth.<sov>                      | 80           |
  | bp-gitea            | git.<sov>                       | 3000         |
  | bp-openbao          | bao.<sov>                       | 8200         |
  | bp-grafana          | grafana.<sov>                   | 80           |
  | bp-harbor           | registry.<sov>                  | 80           |
  | bp-powerdns         | pdns.<sov>/api  (dual-mode)     | 8081         |
  | bp-catalyst-platform| console.<sov>, api.<sov>         | 80, 8080     |

bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute
(Sovereign) simultaneously — the per-Sovereign overlay sets
`api.gateway.enabled=true` while leaving `api.enabled=true`. The
Ingress object is harmless on Cilium clusters with no Traefik. This
preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4.

bp-harbor flips `expose.type` from `ingress` to `clusterIP` in
platform/harbor/chart/values.yaml so the upstream chart no longer
emits its own Ingress; the HTTPRoute is the sole HTTP exposure.
TLS terminates at the Gateway (wildcard cert) rather than per-host
Certificates inside the chart.

bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by
.helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml,
which remain contabo-only legacy demo infra). The contabo path keeps
serving console.openova.io/sovereign via Traefik unchanged.

Bootstrap-kit slot updates (per-Sovereign hostname interpolation):

  - 08-openbao.yaml      → gateway.host: bao.${SOVEREIGN_FQDN}
  - 09-keycloak.yaml     → gateway.host: auth.${SOVEREIGN_FQDN}
  - 10-gitea.yaml        → gateway.host: gitea.${SOVEREIGN_FQDN}
  - 11-powerdns.yaml     → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true
  - 19-harbor.yaml       → gateway.host: registry.${SOVEREIGN_FQDN}
  - 25-grafana.yaml      → gateway.host: grafana.${SOVEREIGN_FQDN}

Server-side dry-run validation against the live Cilium Gateway API
CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway
+ Certificate apply cleanly via `kubectl apply --dry-run=server`.

Contabo unaffected: clusters/contabo-mkt/* not modified. The legacy
SME ingresses (console-nova, marketplace, admin, axon, talentmesh,
stalwart, ...) continue to serve via Traefik as before. powerdns
on contabo remains on the Ingress path (api.gateway.enabled defaults
to false at the chart level).

Closes #387.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:19:30 +04:00
e3mrah
eb92e0496b
feat(platform): add bp-newapi — multi-tenant LLM marketplace gateway (#394) (#396)
Catalyst Blueprint wrapping the upstream NewAPI
(github.com/Calcium-Ion/new-api, MIT) for Sovereign operators whose
business model is reselling LLM access to their own customers.

Backend-only mode: the OpenAI-compatible API at api.<host>/v1/* is
customer-facing; the upstream's portal UI is disabled at ingress;
Catalyst replaces it as the customer surface; NewAPI's admin UI at
admin.<host> is exposed only to ops staff (IdP-gated).

Compliance posture enforced at the blueprint layer:
- Channel attestation gate (refuses to render if any enabled channel
  lacks verifiable provenance — in-cluster, commercial-contract, or
  byok)
- Geographic AUP enforcement (sanctioned-region block on commercial-
  provider channels; US/EU export-control baseline)
- BYOK isolation (request-scoped, never aggregated)
- Reseller disclosure required
- Audit log on bp-cnpg (metadata-only by default)

ACME placeholder used throughout the README; replace with operator
identity in per-Sovereign overlays at clusters/<sovereign>/bootstrap-
kit/.

Files:
- platform/newapi/README.md (design doc + setup checklist)
- platform/newapi/blueprint.yaml (Catalyst Blueprint CR)
- platform/newapi/chart/{Chart.yaml,values.yaml}
- platform/newapi/chart/templates/{_helpers.tpl,deployment.yaml,
  service.yaml,ingress.yaml,configmap.yaml,serviceaccount.yaml,
  networkpolicy.yaml}

Closes design portion of #394.

Co-authored-by: hatiyildiz <hatice@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:57:06 +04:00
e3mrah
05cb39c042
fix(bp-flux): catalyst-cluster-reconciler ClusterRoleBinding overlay (closes #338) (#393)
PROBLEM
-------
On Sovereign-1 (otech.omani.works, 2026-04-30) every HelmRelease that
transitioned through pending-install/pending-upgrade got stuck because
the helm-controller SA could not UPDATE its own helm-storage Secrets
(sh.helm.release.v1.<name>.<n>) in flux-system. Symptom:

  secrets "sh.helm.release.v1.catalyst-platform.v1" is forbidden:
  User "system:serviceaccount:flux-system:helm-controller" cannot
  update resource "secrets" in API group "" in the namespace "flux-system"

Runtime workaround on otech (added 2026-04-30): manual ClusterRoleBinding
flux-system-helm-controller-admin → cluster-admin → flux-system/helm-controller.
Tracked as the permanent fix in #338.

FIX
---
Add platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml — a
Catalyst-managed ClusterRoleBinding (catalyst-cluster-reconciler) that
binds cluster-admin to helm-controller AND kustomize-controller in
.Values.catalyst.fluxNamespace (default flux-system). Independent from
the upstream subchart's cluster-reconciler binding (different name, no
ownership conflict), so if the upstream binding ever drifts again the
overlay still holds the cluster correct.

WHY cluster-admin (not narrower)
--------------------------------
helm-controller installs arbitrary user-supplied Helm charts which can
ship any K8s resource (CRDs, ClusterRoles, MutatingWebhookConfigurations,
etc.). There is no narrower role that satisfies the full install path.
The Flux project's own bootstrap install.yaml binds cluster-admin for
the same reason (upstream default multitenancy.privileged=true).
Multi-tenancy lockdown is a Sovereign Day-2 hardening choice tracked
separately.

NEVER-HARDCODE COMPLIANCE
-------------------------
Per docs/INVIOLABLE-PRINCIPLES.md #4, the namespace is operator-overridable
via .Values.catalyst.fluxNamespace. Default is flux-system because that's
the canonical Catalyst install namespace (matches cloud-init's flux2
install.yaml + clusters/_template/bootstrap-kit/03-flux.yaml).

VERSION
-------
- bp-flux 1.1.2 → 1.1.3 (Chart.yaml + blueprint.yaml + 3 bootstrap-kit refs).
- The flux2 subchart pin (2.14.1) is unchanged — version-pin replay test
  remains green (cloud-init v2.4.0 == subchart appVersion 2.4.0).

VERIFICATION
------------
- platform/flux/chart/tests/version-pin-replay.sh — all 6 cases PASS.
- platform/flux/chart/tests/observability-toggle.sh — all 3 cases PASS.
- helm template renders the new ClusterRoleBinding with correct subjects
  (flux-system by default; verified --set catalyst.fluxNamespace=custom
  override path).
- scripts/check-bootstrap-deps.sh — 0 drift, 0 cycles.

FILES
-----
- platform/flux/chart/templates/catalyst-cluster-reconciler-rbac.yaml (new)
- platform/flux/chart/Chart.yaml (1.1.2 → 1.1.3)
- platform/flux/chart/values.yaml (catalyst.fluxNamespace default)
- platform/flux/blueprint.yaml (1.1.2 → 1.1.3)
- clusters/{_template,otech.omani.works,omantel.omani.works}/bootstrap-kit/03-flux.yaml (chart version)
- docs/lessons-learned/helm-controller-rbac.md (permanent-fix note)
- docs/omantel-handover-wbs.md (#338 status row)

Refs: #43 #369 #338
Lesson: docs/lessons-learned/helm-controller-rbac.md

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-01 15:56:45 +04:00
e3mrah
b8d7a8b9cf
fix(bp-seaweedfs): disable global.enableSecurity to avoid fromToml on helm-controller v1.1.0 (#339)
Upstream seaweedfs/seaweedfs templates/shared/security-configmap.yaml
uses Helm template fromToml; helm-controller v1.1.0's bundled helm SDK
(v3.x older than 3.13) doesn't define fromToml so the install fails:
  parse error at security-configmap.yaml:21: function fromToml not defined
Setting global.seaweedfs.enableSecurity: false skips the entire template.
Internal SeaweedFS API is cluster-IP only on Sovereign-1; chart-level
security is acceptable to defer until helm-controller is bumped.
Bumped 1.0.0 → 1.0.1.
Unblocks the chain: bp-loki, bp-mimir, bp-tempo, bp-velero, bp-harbor,
bp-grafana all dependsOn bp-seaweedfs.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 23:42:43 +04:00
e3mrah
9554be4a5e
fix(bp-external-secrets): gate ClusterSecretStore on CRD presence + drop delete-policy (#337)
The chart's post-install hook was failing on otech.omani.works:
  failed post-install: unable to build kubernetes object for deleting hook
  bp-external-secrets/templates/clustersecretstore-vault-region1.yaml:
  resource mapping not found for kind ClusterSecretStore in version
  external-secrets.io/v1beta1
Two corrections:
1. Capabilities-gate the entire template — don't render unless the
   ClusterSecretStore CRD is registered (it ships in via the upstream
   ESO subchart but isn't live on first install)
2. Remove 'before-hook-creation' delete-policy (was the actual trigger
   for the 'deleting hook' failure path)
Bumped 1.0.0 → 1.0.1.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 23:31:24 +04:00
e3mrah
5502d9aa48
feat(dns): cert-manager-dynadot-webhook for DNS-01 wildcard TLS (closes #159) (#291)
Activates the previously-templated `letsencrypt-dns01-prod` ClusterIssuer
in bp-cert-manager by shipping the missing piece — a Go binary that
satisfies cert-manager's external webhook contract
(`webhook.acme.cert-manager.io/v1alpha1`) against the Dynadot api3.json.

Architecture
============

* `core/pkg/dynadot-client/` — canonical Dynadot HTTP client (shared with
  pool-domain-manager and catalyst-dns). Encapsulates the api3.json
  transport, command builders, response decoding, and the safe
  read-modify-write semantics required to never accidentally wipe a
  zone (memory: feedback_dynadot_dns.md). Destructive `set_dns2`
  variant is unexported.
* `core/cmd/cert-manager-dynadot-webhook/` — the cert-manager webhook
  binary. Implements `Solver.Present` via the client's append-only
  `AddRecord` path and `Solver.CleanUp` via the read-modify-write
  `RemoveSubRecord` path. Domain allowlist (`DYNADOT_MANAGED_DOMAINS`)
  rejects challenges for unmanaged apexes BEFORE any Dynadot call.
* `platform/cert-manager-dynadot-webhook/` — Catalyst-authored Helm
  wrapper. Templates Deployment + Service + APIService + serving
  Certificate (CA chain via cert-manager Issuer self-signing) +
  RBAC + ServiceAccount. Mirrors the standard cert-manager external-
  webhook deployment shape.
* `platform/cert-manager/chart/` — flips `dns01.enabled: true` so the
  paired ClusterIssuer activates. The interim http01 issuer remains
  templated as the rollback path.

Test results
============

  core/pkg/dynadot-client          — 7 tests PASS  (race-clean)
  core/cmd/cert-manager-dynadot-... — 9 tests PASS  (race-clean)

Test coverage includes a Present/CleanUp round-trip against an
httptest fixture that models Dynadot's zone state, an explicit
unmanaged-domain rejection, a regression preserving a pre-existing
CNAME across the DNS-01 round-trip (the zone-wipe defence), and a
typed-error propagation test that surfaces `ErrInvalidToken` to
cert-manager so the controller will retry.

Helm template smoke render
==========================

`helm template` against the new chart with default values yields 12
resources / 424 lines (APIService, Certificate, ClusterRoleBinding,
Deployment, Issuer, Role, RoleBinding, Service, ServiceAccount). The
modified bp-cert-manager chart still renders both ClusterIssuers
(`letsencrypt-dns01-prod` + `letsencrypt-http01-prod`) with default
values; flipping `certManager.issuers.dns01.enabled=false` is the
clean rollback.

Smoke command (post-deploy)
===========================

  kubectl get apiservices.apiregistration.k8s.io \
    v1alpha1.acme.dynadot.openova.io
  # Issue a *.<sovereign>.<pool> wildcard cert and watch the
  # Order/Challenge progress through cert-manager.

CI
==

`.github/workflows/build-cert-manager-dynadot-webhook.yaml` mirrors the
pool-domain-manager-build pattern (cosign keyless signing, SBOM
attestation, GHCR push at `ghcr.io/openova-io/openova/cert-manager-
dynadot-webhook:<sha>`). Triggered by changes to either the binary or
the shared dynadot-client package.

Closes #159

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:37:47 +04:00
e3mrah
c09109a61a
feat(charts): bp-stunner + bp-knative + bp-kserve wrapper charts (closes #263 #264 #265) (#290)
Edge + serverless + model-serving batch (W2.5.C) — three upstream-
subchart umbrella Blueprints completing the bootstrap-kit slots for
WebRTC media relay (bp-relay → bp-stunner) and the AI/ML serving stack
(bp-cortex → bp-kserve → bp-knative).

Each chart follows the canonical umbrella pattern from
docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream
chart under `dependencies:` so `helm dependency build` bundles the
upstream payload into the OCI artifact, and Catalyst-curated overlay
values + templates sit alongside in chart/values.yaml + chart/templates/.

Per-chart highlights:
- bp-stunner/1.0.0 — wraps stunner/stunner-gateway-operator 1.1.0.
  Ships a Cilium-native GatewayClass (Capabilities-gated on
  gateway.networking.k8s.io/v1) so bp-relay (LiveKit / SFU) can claim
  Gateway CRs without an operator-ordering dance. Default UDP TURN port
  range 30000-32767 matches the range opened at the Sovereign edge
  firewall (Crossplane bp-firewall composition).
- bp-knative/1.0.0 — wraps knative-operator v1.21.1. Ships a
  KnativeServing CR pre-configured for **istio-less mode**
  (ingress.istio.enabled=false, ingress.contour.enabled=false,
  ingress.kourier.enabled=false; config.network.ingress-class=cilium).
  Sovereign FQDN sourced from values, no hardcoded fallback per
  inviolable principle #4 — render fails loudly if cluster overlay
  doesn't set knativeOverlay.knativeServing.sovereignFqdn.
- bp-kserve/1.0.0 — wraps kserve/kserve v0.16.0 (latest version
  published on the official OCI registry as of 2026-04-30). Default
  deploymentMode=RawDeployment (no Knative hop on the hot path) but
  bp-knative is still installed (declared as a hard dep) so per-IS
  annotation `serving.kserve.io/deploymentMode: Serverless` opts in to
  scale-to-zero per tenant. Cilium native Gateway-API ingress
  (enableGatewayApi=true, className=cilium, disableIstioVirtualHost=
  true).

Observability discipline (issue #182): every observability toggle
(ServiceMonitor, HPA, GatewayClass) defaults false and is operator-
tunable via per-cluster overlay once bp-kube-prometheus-stack reconciles.
Each chart ships tests/observability-toggle.sh covering default-off,
opt-in (with `--api-versions monitoring.coreos.com/v1` to simulate
Prometheus Operator CRDs), and explicit-off cases.

Per-chart kind summary (helm template default render):

  bp-stunner: ClusterRole, ClusterRoleBinding, ConfigMap, Dataplane,
              Deployment, Role, RoleBinding, Service, ServiceAccount.
              (+ GatewayClass when --api-versions
              gateway.networking.k8s.io/v1 is passed.)

  bp-knative: ClusterRole, ClusterRoleBinding, ConfigMap,
              CustomResourceDefinition, Deployment, KnativeServing,
              Role, RoleBinding, Secret, Service, ServiceAccount.

  bp-kserve:  Certificate, ClusterRole, ClusterRoleBinding,
              ClusterServingRuntime, ClusterStorageContainer,
              ConfigMap, Deployment, Gateway, Issuer,
              MutatingWebhookConfiguration, Role, RoleBinding,
              Service, ServiceAccount, ValidatingWebhookConfiguration.

`helm lint` clean for all three (single INFO on missing icon — icons
land with marketplace card work).

`bash tests/observability-toggle.sh` green for all three (3 cases each:
default-off, opt-in, explicit-off).

Closes #263 #264 #265

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 19:37:38 +04:00
e3mrah
782d8015c5
feat(charts): bp-openmeter (CH-less) + bp-livekit + bp-matrix wrapper charts (closes #272 #273 #274) (#289)
W2.5.F — three Catalyst Blueprint umbrella charts at platform/{openmeter,
livekit,matrix}/, each declaring its upstream chart under Chart.yaml
`dependencies:` so `helm dependency build` bundles the upstream payload
into the published OCI artifact (per docs/BLUEPRINT-AUTHORING.md §11.1
— hollow charts forbidden, CI-enforced by issue #181).

Per-chart kind summary
======================

bp-openmeter (closes #272)
  default `helm template` kinds: ConfigMap, Deployment, Service, ServiceAccount
  upstream chart: openmeter 1.0.0-beta.213 (oci://ghcr.io/openmeterio/helm-charts)

  ClickHouse-less profile per docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md §6.4.
  The upstream chart's bundled clickhouse / kafka / postgresql / redis /
  svix subcharts are all DISABLED — Catalyst supplies CNPG (postgres),
  JetStream (event bus), and Valkey (redis-compat) at the platform tier.
  Chart-level toggle `catalystBlueprint.backend.kind` (default `cnpg`,
  alt `clickhouse`) records the active profile so observability/audit
  pipelines can report it. The OpenMeter binary's
  `aggregation.clickhouse.address` is left blank — per-Sovereign overlay
  supplies it once a host cluster adds bp-clickhouse and the operator
  re-rolls with `backend.kind: clickhouse`. Catalyst overlay templates
  (NetworkPolicy / ServiceMonitor / HPA) all default OFF per
  docs/BLUEPRINT-AUTHORING.md §11.2.

bp-livekit (closes #273)
  default `helm template` kinds: ConfigMap, Deployment, Service, ServiceAccount
  upstream chart: livekit-server 1.9.0 (https://helm.livekit.io)

  WebRTC SFU. Powers the Huawei iFlytek voice demo. Catalyst defaults
  pair LiveKit with bp-stunner (the upstream chart's bundled co-located
  TURN server is OFF; per-Sovereign overlay points the LiveKit TURN
  config at the stunner UDP-gateway Service). RTC UDP port range is
  50000-60000 (matches the Hetzner firewall rule the per-Sovereign
  overlay opens). Catalyst overlay templates (NetworkPolicy /
  ServiceMonitor / HPA) all default OFF; the chart's NetworkPolicy
  template documents that LiveKit's hostNetwork mode means pod-level
  policies do NOT cover the SFU port range — the firewall rule is the
  load-bearing control. blueprint.yaml `depends:` declares bp-stunner +
  bp-cert-manager + bp-valkey.

bp-matrix (closes #274)
  default `helm template` kinds: ConfigMap, Deployment, Ingress, Job,
  PersistentVolumeClaim, Pod, Role, RoleBinding, Secret, Service,
  ServiceAccount
  upstream chart: matrix-synapse 3.12.25 (https://ananace.gitlab.io/charts)

  Synapse (the Matrix server implementation, NOT the retired OpenOva
  product noun). Federation OFF by default (Catalyst per-Sovereign
  tenancy default — operator overlays flip it on per-Organization).
  Postgres backend via bp-cnpg externalPostgresql; OIDC SSO via
  bp-keycloak; bundled bitnami postgresql + redis subcharts both
  disabled. Catalyst overlay NetworkPolicy gates the federation port
  (8448) on `federation.enabled` — verified by Case 5 of the
  observability-toggle test. Catalyst-overlay ServiceMonitor (upstream
  chart has none) + HPA both default OFF.

Lint
====
All three charts pass `helm lint` clean (only the noisy "icon is
recommended" INFO message).

Observability tests
===================
Each chart's `tests/observability-toggle.sh` enforces the Catalyst
contract from docs/BLUEPRINT-AUTHORING.md §11.2:
  Case 1: default render produces zero monitoring.coreos.com/v1
          resources (no ServiceMonitor / PrometheusRule).
  Case 2: opt-in (--set serviceMonitor.enabled=true --api-versions
          monitoring.coreos.com/v1) renders a ServiceMonitor.
  Case 3: explicit-off render is clean.
  Case 4 (per chart):
    - openmeter: ClickHouse-less profile asserts no
      clickhouse.altinity.com / Kafka subchart resources leak into the
      default render.
    - livekit:   asserts upstream livekit-server.serviceMonitor.create
      defaults false.
    - matrix:    asserts default render carries an empty
      federation_domain_whitelist (the per-Sovereign tenancy default).
  Case 5 (matrix only): `--set federation.enabled=true networkPolicy
          .enabled=true` opens port 8448 in the Catalyst NetworkPolicy.

All gates green for all three charts.

Closes #272 #273 #274

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 19:37:28 +04:00
e3mrah
87d9a4afa7
feat(charts): bp-temporal + bp-llm-gateway + bp-anthropic-adapter wrapper charts (closes #267 #268 #271) (#288)
W2.5.E batch — three Application-tier Blueprints completing the LLM
serving / workflow stack:

- bp-temporal/1.0.0 — wraps temporal/temporal 1.2.0 (the new chart
  rewrite that removed cassandra:/mysql:/postgresql:/elasticsearch:/
  prometheus:/grafana: top-level keys in favour of
  server.config.persistence.datastores). Postgres-only via CNPG-backed
  visibility store (skip Cassandra). Web UI ON. Keycloak OIDC
  integration via --auth-claim-mapper renders auth.yaml ConfigMap
  (operator wires via additionalVolumes once bp-keycloak is
  reconciled, default OFF). dependsOn: bp-cnpg + bp-cert-manager.
  Closes #271.
  Kinds: Cluster (CNPG) + ConfigMap + Deployment + Job + Pod +
  Service.

- bp-llm-gateway/1.0.0 — wraps berriai/litellm-helm 0.1.572 from OCI.
  Subscription-aware proxy for Claude Code: routes to Anthropic (via
  operator OAuth/Max subscription — NEVER an ANTHROPIC_API_KEY,
  per memory/feedback_no_api_key.md), Bedrock, Vertex,
  OpenAI-compatible (via bp-anthropic-adapter), and self-hosted
  vLLM. CNPG-backed audit log (every prompt + response persisted
  for compliance). Bundled bitnami postgresql + redis subcharts
  DISABLED (db.useExisting=true points at the CNPG cluster).
  Keycloak SSO via auth.yaml ConfigMap (default OFF).
  ExternalSecret-backed environmentSecrets brings tokens / IAM
  creds in without inlining plaintext. dependsOn: bp-cnpg +
  bp-keycloak + bp-external-secrets. Closes #267.
  Kinds: Cluster (CNPG audit) + ConfigMap + Deployment + Job +
  Pod + Secret + Service + ServiceAccount.

- bp-anthropic-adapter/1.0.0 — Catalyst-authored scratch chart for
  the OpenAI ↔ Anthropic translation Go service. SHA-pinned image
  ghcr.io/openova-io/openova/anthropic-adapter:<sha> (Inviolable
  Principle #4a — GitHub Actions is the only build path; empty
  default tag fails the render with a clear error instead of
  silently shipping :latest). OAuth/Max subscription token mounted
  from K8s Secret materialized by ESO from bp-openbao —
  ANTHROPIC_OAUTH_TOKEN env var, NEVER an ANTHROPIC_API_KEY.
  Includes OpenAI → Anthropic model-mapping ConfigMap (gpt-4 →
  claude-3-5-sonnet, gpt-4o-mini → claude-3-5-haiku, etc.).
  sigstore/common library subchart included to satisfy the
  hollow-chart gate (matches bp-vllm pattern from #283).
  dependsOn: bp-external-secrets. Closes #268.
  Kinds: ConfigMap + Deployment + Service + ServiceAccount.

CRITICAL — bp-llm-gateway and bp-anthropic-adapter both consume the
operator's Claude OAuth/Max subscription. Per memory/
feedback_no_api_key.md and the user's standing instruction, neither
chart accepts or generates an ANTHROPIC_API_KEY. Tokens flow
exclusively through ExternalSecret-managed K8s Secrets that ESO
materializes from bp-openbao at install time.

Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every
observability toggle defaults `false` (ServiceMonitor / metrics
sidecar / PodMonitor) and is operator-tunable via per-cluster
overlay once bp-kube-prometheus-stack reconciles. Each chart ships
tests/observability-toggle.sh covering default-off, opt-in (with
--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and
explicit-off cases. bp-anthropic-adapter additionally tests the
never-:latest gate via Case 4 (empty image tag must fail render).

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every
upstream version, namespace, server URL, role, secret name, model
default, and toggle is exposed under values.yaml. Cluster overlays
in clusters/<sovereign>/ may override without rebuilding the
Blueprint OCI artifact.

Per docs/BLUEPRINT-AUTHORING.md §11.1 (umbrella shape — hard
contract): bp-temporal and bp-llm-gateway declare their upstream
charts under Chart.yaml dependencies: so helm dependency build
bundles the upstream payload into the OCI artifact. bp-anthropic-
adapter is a scratch chart (no upstream Helm chart exists) and
includes sigstore/common as the obligatory hollow-chart-gate
dependency, matching the bp-vllm precedent from W2.5.D (#283).

Closes #267
Closes #268
Closes #271

helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only)

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 19:37:19 +04:00
e3mrah
a6bf07b0ce
feat(charts): bp-librechat wrapper chart (closes #275) (#287)
W2.5.G — Catalyst-authored scratch chart for LibreChat (slot 48 of the
omantel-1 bootstrap-kit). LibreChat upstream does not publish a Helm
chart, so this chart hand-wires the official ghcr.io/danny-avila/librechat
container as Deployment + Service + Ingress + ConfigMap + ServiceAccount
+ NetworkPolicy + ServiceMonitor + HPA, with the sigstore/common
library subchart declared to satisfy the hollow-chart gate (issue #181).

Per docs/BLUEPRINT-AUTHORING.md §11.2: every observability toggle
(serviceMonitor, hpa) defaults false; opt-in via per-cluster overlay
once kube-prometheus-stack reconciles. The ServiceMonitor template is
double-gated by .Values.serviceMonitor.enabled AND
Capabilities.APIVersions.Has "monitoring.coreos.com/v1" so flipping the
toggle on a too-early Sovereign cannot break the bp-librechat reconcile.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every endpoint
URL, model name, secret reference, namespace selector, and image tag is
operator-tunable via values.yaml. The Sovereign FQDN, Keycloak issuer,
llm-gateway URL, embeddings URL, and TLS ClusterIssuer are all
operator-supplied at install time. The image tag is pinned to v0.7.5
(no :latest).

Connectors:
- Chat completions: bp-llm-gateway (OpenAI-compatible /v1/chat/completions)
  exposed as a "custom" endpoint named "Catalyst LLM"
- Embeddings (RAG): bp-bge — provider=bge maps to EMBEDDINGS_PROVIDER=openai
  + RAG_OPENAI_BASEURL=<bge.svc> at template-render time
- SSO: bp-keycloak (OpenID Connect) — issuer/clientId from values,
  client secret + session secret from ExternalSecret
- Conversation store: FerretDB on bp-cnpg (MongoDB wire protocol over
  Postgres) — operator-supplied connection URI

Hosted at chat-app.<sovereign-fqdn>; the chart `fail`s render if
ingress.host is empty (no platform-wide default).

helm template (default values, --set ingress.host=...):
  ConfigMap, Deployment, Ingress, NetworkPolicy, Service, ServiceAccount

helm template (--set hpa.enabled=true serviceMonitor.enabled=true
              --api-versions monitoring.coreos.com/v1):
  ConfigMap, Deployment, HorizontalPodAutoscaler, Ingress, NetworkPolicy,
  Service, ServiceAccount, ServiceMonitor

helm lint: 1 chart(s) linted, 0 chart(s) failed (single INFO on
missing icon — icons land with the marketplace card work).

tests/observability-toggle.sh: PASS on default-off, opt-in
(--api-versions monitoring.coreos.com/v1 to simulate the CRDs), and
explicit-off cases.

Path isolation: only platform/librechat/ — no HR slot files,
blueprint-release.yaml, or other charts touched. The HR slot files
(clusters/.../48-librechat.yaml) and blueprint-release.yaml will land
in a separate slot-wiring PR per the W2.K4 expansion plan.

Closes #275

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:56:59 +04:00
e3mrah
9dc8506dd9
feat(charts): bp-external-secrets + bp-cnpg + bp-valkey wrapper charts (#285)
Storage-substrate batch (W2.5.A) — closes #254 by shipping the three
upstream-subchart umbrella Blueprints that the Flux HRs at
clusters/_template/bootstrap-kit/{15-external-secrets,16-cnpg,17-valkey}
.yaml (merged via PR #262) target.

Each chart follows the canonical umbrella pattern documented in
docs/BLUEPRINT-AUTHORING.md §11.1: Chart.yaml declares the upstream
chart under `dependencies:` so `helm dependency build` bundles the
upstream payload into the OCI artifact, and Catalyst-curated overlay
values + templates sit alongside in chart/values.yaml + chart/templates/.

Per-chart highlights:
- bp-external-secrets/1.0.0 — wraps external-secrets/external-secrets
  0.10.7. Ships a default `vault-region1` ClusterSecretStore (via Helm
  post-install/post-upgrade hook to defer the CR application until the
  upstream chart's CRDs are registered) wired to the in-cluster
  bp-openbao service. clusterSecretStore.enabled toggle lets cluster
  overlays opt out and author their own multi-region CRs.
- bp-cnpg/1.0.0 — wraps cnpg/cloudnative-pg 0.28.0. Operator-only
  surface (Cluster CRs are per-Application). CRDs ship in-chart so
  bp-powerdns / bp-keycloak / bp-gitea / bp-langfuse / bp-grafana /
  bp-temporal / bp-matrix / bp-llm-gateway / bp-bge / bp-nemo-guardrails
  / bp-openmeter / pool-domain-manager can `dependsOn: bp-cnpg` via
  Flux — closing #254 (bp-powerdns CreateContainerConfigError on
  pdns-pg-app secret).
- bp-valkey/1.0.0 — wraps bitnami/valkey 5.5.1. BSD-3 Redis-compatible
  cache, replication architecture, password auth ON, NetworkPolicy ON,
  replicas 0 by default for solo Sovereigns (cluster overlays bump for
  HA). Application-tier cache only — Catalyst control plane uses NATS
  JetStream KV (per ARCHITECTURE.md §5).

Per docs/BLUEPRINT-AUTHORING.md §11.2 (issue #182): every observability
toggle defaults `false` (ServiceMonitor / PodMonitor / PrometheusRule /
metrics sidecar) and is operator-tunable via per-cluster overlay once
bp-kube-prometheus-stack reconciles. Each chart ships
tests/observability-toggle.sh covering default-off, opt-in (--api-versions
monitoring.coreos.com/v1 to simulate the CRDs), and explicit-off cases.

Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode): every upstream
version, namespace, server URL, role, and password toggle is exposed
under values.yaml. Cluster overlays in clusters/<sovereign>/ may
override without rebuilding the Blueprint OCI artifact.

helm lint: 1 chart(s) linted, 0 chart(s) failed (each, INFO icon-recommended only)
helm template default render kinds:
  bp-external-secrets: ClusterRole, ClusterRoleBinding, ClusterSecretStore, CustomResourceDefinition, Deployment, Role, RoleBinding, Secret, Service, ServiceAccount, ValidatingWebhookConfiguration
  bp-cnpg:             ClusterRole, ClusterRoleBinding, ConfigMap, CustomResourceDefinition, Deployment, MutatingWebhookConfiguration, Service, ServiceAccount, ValidatingWebhookConfiguration
  bp-valkey:           ConfigMap, NetworkPolicy, PodDisruptionBudget, Secret, Service, ServiceAccount, StatefulSet

Closes #254

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 18:39:29 +04:00
e3mrah
ba2ff05292
feat(charts): bp-seaweedfs + bp-harbor + bp-vpa wrapper charts (#284)
W2.5.B — first authoring of the three Catalyst Blueprint wrapper charts
that fill bootstrap-kit slots 18 (seaweedfs), 19 (harbor) and 29 (vpa).
Each wraps an upstream chart as a Helm subchart and ships Catalyst-
curated overlay templates (NetworkPolicy + ServiceMonitor) gated behind
opt-in toggles, per docs/BLUEPRINT-AUTHORING.md §11 and
docs/INVIOLABLE-PRINCIPLES.md.

bp-seaweedfs (slot 18 — storage foundation)
  - Wraps seaweedfs/seaweedfs 4.22.0; Chart name `bp-seaweedfs`.
  - Catalyst defaults: 1 master + 3 volume + 1 filer + 2 s3 replicas.
  - S3 API on 8333 — single S3 surface every consumer talks to per
    docs/PLATFORM-TECH-STACK.md §3.5 (no per-app MinIO).
  - Overlay templates: NetworkPolicy (cross-namespace S3 reachability,
    cold-tier egress allowlist), ServiceMonitor (Capabilities-gated,
    DEFAULT FALSE per §11.2).
  - Default helm template kinds: ClusterRole, ClusterRoleBinding,
    ConfigMap, Deployment, Secret, Service, ServiceAccount, StatefulSet.

bp-harbor (slot 19 — per-Sovereign OCI registry)
  - Wraps goharbor/harbor 1.18.3 (appVersion 2.14.3); Chart name
    `bp-harbor`.
  - Catalyst defaults: blob backend = SeaweedFS S3 (regionendpoint
    seaweedfs-s3.seaweedfs.svc:8333), metadata DB = bp-cnpg external
    Postgres, ingress class `cilium`, expose.tls.enabled true (cert-
    manager-issued Secret).
  - Overlay templates: NetworkPolicy (CNPG/SeaweedFS/Keycloak egress),
    ServiceMonitor (Capabilities-gated, DEFAULT FALSE).
  - Trivy + SSO + pull-mirror are operator-flag opt-ins per per-
    Sovereign overlay (default false; trivy/keycloak/cnpg deps land on
    later slots).
  - Default helm template kinds: ConfigMap, Deployment, Ingress,
    PersistentVolumeClaim, Secret, Service, StatefulSet.

bp-vpa (slot 29 — vertical autoscaling)
  - Wraps cowboysysop/vertical-pod-autoscaler 11.1.1 (appVersion
    1.5.0); Chart name `bp-vpa`.
  - Catalyst defaults: 1 replica each of recommender + updater +
    admission-controller. Default mode `Off` (recommend only).
  - Admission webhook self-signs via init Job (cluster-internal); per-
    Sovereign overlay MAY swap to cert-manager.
  - Overlay templates: NetworkPolicy (apiserver + metrics-server
    egress, admission webhook ingress).
  - Upstream metrics.serviceMonitor / metrics.prometheusRule defaulted
    false per §11.2.
  - Default helm template kinds: ClusterRole, ClusterRoleBinding,
    ConfigMap, Deployment, Job, Pod, Secret, Service, ServiceAccount.

Lint + observability-toggle results
  helm lint: 1 chart(s) linted, 0 chart(s) failed (each)
  tests/observability-toggle.sh: PASS on all three (default render has
  zero monitoring.coreos.com/v1 references; opt-in render produces a
  ServiceMonitor; explicit-off render is clean).

Path isolation: only platform/seaweedfs/, platform/harbor/, and
platform/vpa/ — no HR slot files or other charts touched.

Refs: bootstrap-kit slots 18, 19, 29 reconcile against
ghcr.io/openova-io/bp-seaweedfs:1.0.0, bp-harbor:1.0.0, bp-vpa:1.0.0
which this commit produces on next blueprint-release CI run.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 18:37:50 +04:00
e3mrah
c3c9c0cf27
feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283)
Catalyst-authored umbrella charts for the W2.5.D AI-inference stack.
None of the three upstream projects publish a Helm chart, so each
chart hand-wires the upstream container as Deployment + Service +
ConfigMap + ServiceMonitor + NetworkPolicy + HPA, with the
sigstore/common library subchart declared to satisfy the
hollow-chart gate (issue #181).

bp-vllm (slot 39) — wraps vllm/vllm-openai:v0.6.4. GPU-aware
(nvidia.com/gpu when vllm.gpu.enabled=true; CPU fallback for dev).
Default model meta-llama/Llama-3.1-8B-Instruct, port 8000,
OpenAI-compatible /v1/chat/completions. All engine knobs
(maxModelLen, gpuMemoryUtilization, dtype, quantization,
tensorParallelSize, prefix-caching) overlay-tunable. Closes #266.

bp-bge (slot 42) — wraps ghcr.io/huggingface/text-embeddings-inference:cpu-1.5.
Default model BAAI/bge-small-en-v1.5 + BAAI/bge-reranker-base
sidecar in same Pod. Two-port Service (8080 embed, 8081 rerank)
annotated for bp-llm-gateway discovery. CPU-friendly defaults;
overlay swaps in BAAI/bge-m3 on GPU Sovereigns. Closes #269.

bp-nemo-guardrails (slot 43) — wraps the upstream NVIDIA/NeMo-Guardrails
Dockerfile (nemoguardrails server, FastAPI, port 8000). LLM endpoint
+ model + engine all overlay-tunable; Colang flow bundle mounts via
configMap.externalName for production rails. ConfigMap stub renders
a default rail for smoke testing. Closes #270.

All three charts:
- Default observability toggles to false per BLUEPRINT-AUTHORING.md §11.2
- Pin upstream image tags (no :latest) per INVIOLABLE-PRINCIPLES.md #4
- Non-root securityContext (runAsUser 1000, drop ALL capabilities)
- prometheus.io scrape annotations on the Pod for fallback discovery
- Operator-tunable NetworkPolicy gating ingress to bp-llm-gateway and
  egress to HuggingFace / bp-vllm / bp-bge as appropriate

helm template (default values) per chart:
  bp-vllm:            ConfigMap, Deployment, Service, ServiceAccount
  bp-bge:             ConfigMap, Deployment, Service, ServiceAccount
  bp-nemo-guardrails: ConfigMap, Deployment, Service, ServiceAccount

helm template (--set serviceMonitor.enabled=true networkPolicy.enabled=true hpa.enabled=true):
  All three render ConfigMap + Deployment + Service + ServiceAccount +
  ServiceMonitor + NetworkPolicy + HorizontalPodAutoscaler.

helm lint: 0 chart(s) failed for all three (single INFO on missing icon —
icons land with the marketplace card work).

Closes #266
Closes #269
Closes #270

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:37:07 +04:00
e3mrah
0cfd0defa9
fix(bp-langfuse): drop apostrophe from description to clear GHCR 500 (resolves #215) (#278)
Root cause: Helm's `helm push` collapses the chart `description` field
into a single-line OCI manifest annotation
`org.opencontainers.image.description`. The GHCR manifest-PUT validator
returns a deterministic 500 Internal Server Error when that annotation
is long AND contains an ASCII apostrophe. bp-langfuse 1.0.0 was the
only chart in the observability batch (PR #214) carrying both
characteristics, so it was the only one that failed to publish.

Fix: reword the affected sentence from "Langfuse's persistent state" to
"the Langfuse persistent state" — drops the apostrophe, preserves the
meaning, and crucially preserves every byte of the actual chart payload
(values, templates, all 350 entries of the upstream langfuse-1.5.28
subchart with its 4-level-deep Bitnami vendoring). No runtime
behavioural change; helm template renders the exact same 6 resources
across 490 lines.

The narrowing was done by progressively reducing the Chart.yaml from
the failing version to a passing version while pushing to a scratch
GHCR namespace, with the bp-langfuse repo deleted between attempts
(verified via `DELETE /orgs/openova-io/packages/container/bp-langfuse`
and re-querying). The trigger is reproducible: long description +
apostrophe → 500; long description without apostrophe → push succeeds;
short description with apostrophe → push succeeds.

Added a multi-line WARNING comment immediately above `description:`
documenting the trigger so future authors do not reintroduce a
possessive form. Issue #215 captures the full reproduction.

Closes #215

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 17:31:51 +04:00
e3mrah
ec3821f7e1
fix(bp-*): event-driven HR install -- drop blanket timeout, use disableWait (#250)
Helm install completes when manifests apply, not when pods reach Ready.
Flux dependsOn checks Ready=True on each HR independently, so
spec.install.disableWait + spec.upgrade.disableWait is the correct
shape for slow-Ready workloads. Blanket spec.timeout: Nm watchdogs from
PR #221 were a band-aid that caused cascading HR failures and blocked
downstream HRs (bp-nats-jetstream, bp-openbao depended on bp-spire).

Founder direction (verbatim): "always event driven robust jobs"

Per-HR audit (drop spec.timeout: 15m, add disableWait, with reason):

- bp-cilium:        envoyconfig CRD self-wait — agent crash-loops until
                    its own CRDs land
- bp-cert-manager:  webhook readiness depends on cainjector mutating
                    Secret — multi-minute on cold start
- bp-flux:          adopts cloud-init Flux objects; the helm-controller
                    reconciling THIS HR is itself a chart target — Ready
                    deadlock without disableWait
- bp-sealed-secrets: single-replica controller + CRD — install completes
                    on manifest apply
- bp-spire:         spire-controller-manager waits for CRD informer cache
                    sync — multi-minute legitimate path; chart fix below
- bp-nats-jetstream: JetStream raft quorum formation across N replicas
- bp-openbao:       3-node Raft sealed-by-default; Ready=True only after
                    operator runs `bao operator init` unseal flow
- bp-keycloak:      DB schema migration + 100+ Liquibase changesets on
                    first install
- bp-gitea:         PostgreSQL DB init + admin user + Blueprint catalog
                    mirror seeding
- bp-external-dns:  pod readiness depends on PowerDNS API + pdns-pg CNPG
                    cascade
- bp-catalyst-platform: ~10 services, inter-service NATS/OTel readiness
                    is not Helm's concern

Intentionally NOT touched (other parallel agents own these):
- bp-crossplane (Agent A): chart split for intra-chart CRD-ordering
- bp-powerdns   (Agent D): post-install hook for intra-chart Job-ordering

bp-spire chart fix (1.1.3 -> 1.1.4):

Root cause investigation on otech.omani.works (live):
  spire-controller-manager has restarted 37 times with:
    "failed to wait for clusterstaticentry caches to sync: timed out
     waiting for cache to be synced for Kind *v1alpha1.ClusterStaticEntry"

`kubectl get crd | grep spire` returns nothing — the spire.spiffe.io
v1alpha1 CRDs (ClusterSPIFFEID / ClusterStaticEntry /
ClusterFederatedTrustDomain) are NOT registered. The upstream `spire`
chart does not install its own CRDs; the spiffe maintainers ship them
via the SEPARATE `spire-crds` chart, expected to be installed first.

Fix: platform/spire/chart/Chart.yaml now declares spire-crds 0.5.0 as
the FIRST dependency. Helm installs subcharts in dependency order, so
listing spire-crds first guarantees CRDs are applied before the spire
subchart's controller-manager Deployment starts. blueprint.yaml +
both 06-spire.yaml cluster references bumped to 1.1.4.

Live error this fixes (otech.omani.works, persistent ~5h):
  Helm upgrade failed for release spire-system/spire with chart
  bp-spire@1.1.3: context deadline exceeded
  + downstream cascade: bp-nats-jetstream / bp-openbao stuck at
    "dependency 'flux-system/bp-spire' is not ready"

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:19 +04:00
e3mrah
726af6df81
fix(bp-powerdns): self-generate api-credentials Secret + disable upstream zone-bootstrap Job (#248)
Root cause investigation on otech.omani.works (kubectl, sanitized):

  $ kubectl get pods -n powerdns
  create-zone-if-not-exist-sh-tjtr4   0/1  CreateContainerConfigError  4h
  powerdns-57d7d49f99-{9hrb4,lxlgt,nkmht}  0/1  CreateContainerConfigError  4h
  dnsdist-594dbfc5f-wznsw                  1/1  Running  4h

  $ kubectl get secrets -n powerdns
  powerdns                Opaque  1  4h
  powerdns-api-tls-8kxpx  Opaque  1  4h     (NO `powerdns-api-credentials`, NO `pdns-pg-app`)

  $ kubectl describe pod ... powerdns-57d7d49f99-9hrb4
  Environment:
    PDNS_API_KEY:  <set to the key 'api-key' in secret 'powerdns-api-credentials'>  Optional: false
    PDNS_DB_HOST:  <set to the key 'host' in secret 'pdns-pg-app'>                  Optional: false
    State: Waiting   Reason: CreateContainerConfigError

The handover's chicken-egg-with-secret theory was directionally right but
the cause was more fundamental:

  1. Wrapper chart's api-credentials-secret.yaml (1.1.2) was a no-op
     unless operator set `apiKey` value out-of-band — comment said the
     deployment would "fail to start until the named Secret exists" as
     "the explicit signal we want". On a Sovereign that bootstraps from
     bp-* OCI artifacts, no operator is standing by, so the Secret is
     never created and pods sit in CreateContainerConfigError forever.

  2. The upstream chart's `create-zone-if-not-exists-sh` Job is rendered
     whenever both `zoneName` and `api.key` are set — defaulting
     `zoneName: "example.de."` it ALWAYS rendered and ALWAYS failed
     (same missing Secret). Catalyst doesn't want this Job at all
     because zones are loaded later by pool-domain-manager (PDM).

  3. The chart's CNPG Cluster template is gated behind
     Capabilities.APIVersions.Has "postgresql.cnpg.io/v1" — on a fresh
     Sovereign without bp-cnpg yet (bp-cnpg is on the roadmap, not in
     bootstrap-kit), no Cluster is rendered and `pdns-pg-app` Secret
     never materialises. With Helm `--wait`, install times out
     ("context deadline exceeded") even though the manifests applied
     cleanly.

Fix:

  * api-credentials-secret.yaml: self-generate via Helm `lookup` +
    `randAlphaNum 32`. First install creates fresh randoms; every
    subsequent reconcile reads back the existing values from the
    Secret so the API key never rotates on upgrade. Operator can
    still pin specific values via .Values.powerdns.apiKey /
    .Values.powerdns.webserverPassword, or skip Secret creation
    entirely via .Values.powerdns.useExistingApiSecret. Same pattern
    as bitnami/postgresql, bitnami/keycloak.

  * values.yaml: set `powerdns.zoneName: ""` so upstream chart's
    `{{- if and .Values.powerdns.zoneName .Values.powerdns.api.key }}`
    gate skips the create-zone Job entirely. Catalyst's PDM creates
    zones via the REST API after the cluster comes up; we don't want
    a placeholder `example.de.` zone in production.

  * HelmRelease (both _template and otech.omani.works overlays):
    `install.disableWait: true` + `upgrade.disableWait: true` so the
    HelmRelease reports Ready as soon as manifests apply cleanly,
    rather than gating on powerdns Deployment readiness which depends
    on bp-cnpg landing first to synthesise `pdns-pg-app`. Runtime
    convergence is observed via kubectl, not gated on Helm.

Live error this addresses:
  Helm upgrade failed for release powerdns/powerdns with chart
  bp-powerdns@1.1.2: context deadline exceeded

Verified locally with `helm template`:
  - powerdns-api-credentials Secret renders with random api-key + webserver-password
  - create-zone-if-not-exist-sh Job no longer rendered
  - Deployment env continues to reference powerdns-api-credentials correctly

Bumped 1.1.2 -> 1.1.3 (chart, blueprint, both bootstrap-kit overlays).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:12 +04:00
e3mrah
2d1799d738
fix(bp-crossplane): split XRDs+Compositions into bp-crossplane-claims (#247)
Resolves install ordering on fresh clusters where the apiserver rejects
CompositeResourceDefinition CRs because the apiextensions.crossplane.io
CRDs registered by the crossplane subchart aren't live yet at apply time.

- bp-crossplane bumped 1.1.2 -> 1.1.3 (controller-only payload)
- NEW bp-crossplane-claims@1.0.0 carries XRDs + Compositions
- Flux HelmRelease for crossplane-claims uses dependsOn: [bp-crossplane]
- composition-validate.sh + fixtures relocate to the new chart
- blueprint-release CI: opt-out annotation
  catalyst.openova.io/no-upstream=true permits zero-deps charts that
  legitimately ship only Catalyst-authored CRs (the original hollow-chart
  rule remains in force for every other umbrella chart)

Live error this fixes (from otech.omani.works):
  no matches for kind "CompositeResourceDefinition" in version
  "apiextensions.crossplane.io/v1" -- ensure CRDs are installed first

Pattern: intra-chart CRD-ordering breaks -> split charts + Flux dependsOn.
Apply universally to similar cases going forward.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:05 +04:00
e3mrah
f658757962
fix(bp-crossplane): resolve CHART_DIR to absolute path in composition-validate.sh (#237)
CI invokes the script as `bash <script> "platform/crossplane/chart"` from
the repo root. The script then `cd`s into that relative path, which works,
but every later `"$CHART_DIR/<sub>"` reference (notably FIXTURE_DIR for
Case 6) inherits the now-stale relative prefix and resolves under the
wrong cwd. Fix: resolve CHART_DIR via `(cd ... && pwd)` to an absolute
path BEFORE the chdir.

Local repro before fix:

  $ bash platform/crossplane/chart/tests/composition-validate.sh \
        platform/crossplane/chart
  ...
  Case 6: every fixture XRC kind is matched by an XRD
  FAIL: fixtures dir platform/crossplane/chart/tests/fixtures missing

Local result after fix:

  $ bash platform/crossplane/chart/tests/composition-validate.sh \
        platform/crossplane/chart
  ...
  Case 6: every fixture XRC kind is matched by an XRD
    PASS
  All bp-crossplane Day-2 CRUD Composition gates green.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:36:07 +02:00
e3mrah
8592d20919
feat(bp-crossplane): 6 XRDs + Compositions for Day-2 CRUD (RegionClaim/ClusterClaim/NodePoolClaim/LoadBalancerClaim/PeeringClaim/NodeActionClaim) (#236)
Adds the 6 CompositeResourceDefinitions and matching Compositions that
back the catalyst-api Day-2 CRUD endpoints. catalyst-api writes XRCs of
these kinds; Crossplane materialises them into provider-hcloud (and a
small number of provider-kubernetes) managed resources. Per
docs/INVIOLABLE-PRINCIPLES.md #3, every cloud-side op flows through
provider-hcloud — never bespoke hcloud-go calls or shell-outs to the
hcloud CLI.

XRDs (canonical group: compose.openova.io/v1alpha1):

  - RegionClaim       → composes the Phase-0 quartet via provider-hcloud:
                        Network + NetworkSubnet + Firewall + Server (cp1)
                        + LoadBalancer + LoadBalancerNetwork +
                        LoadBalancerService×2 + LoadBalancerTarget. Mirrors
                        infra/hetzner/main.tf 1:1 so deletion of a
                        RegionClaim cascades the whole slice.
  - ClusterClaim      → composes a provider-kubernetes Object that
                        materialises a cluster-identity ConfigMap. The
                        catalyst-environment-controller reads the CM to
                        template per-server cloud-init.
  - NodePoolClaim     → composes up to 100 provider-hcloud Server
                        resources. UPDATE flow: patching replicas n→m
                        flips the per-index Required-policy gate so
                        Crossplane creates/deletes Server CRs.
  - LoadBalancerClaim → composes provider-hcloud LoadBalancer +
                        LoadBalancerNetwork + up to 50
                        LoadBalancerService entries (per listener) + up
                        to 50 LoadBalancerTarget entries. UPDATE: patch
                        listeners[]/targets[] → composite controller
                        adds/removes services/targets.
  - PeeringClaim      → composes 1 or 2 provider-hcloud Route resources
                        (bidirectional flag toggles the second one
                        through a Required-policy gate).
  - NodeActionClaim   → composes a provider-kubernetes Object that
                        creates a batch/v1 Job running kubectl
                        cordon/drain (k8s-side op, not a cloud op, per
                        the task spec). action=replace additionally
                        composes a provider-hcloud Server for the
                        replacement node.

UPDATE/DELETE summary:

  - UPDATE: every mutable schema field is patched onto the underlying
    managed resource; Crossplane's composite controller drives the diff
    and provider-hcloud reconciles to the new state.
  - DELETE: every composed resource has deletionPolicy: Delete, so a
    cascade delete of the composite tears down the whole resource graph
    in dependency-safe order (Crossplane retries until deps unblock).

New tests:
  - tests/composition-validate.sh — 7 gates: helm renders cleanly,
    exactly 6 XRDs, ≥ 6 Compositions, all 6 expected claim kinds
    present, every rendered doc is valid YAML, every fixture references
    a real XRD, and (when KUBECONFIG + Crossplane CRDs available)
    server-side dry-run for every fixture.
  - tests/fixtures/<kind>-sample.yaml — one XRC fixture per kind.

Version bump:
  - platform/crossplane/chart/Chart.yaml             1.1.1 → 1.1.2
  - platform/crossplane/blueprint.yaml               1.1.1 → 1.1.2
  - clusters/_template/bootstrap-kit/04-crossplane.yaml         → 1.1.2
  - clusters/otech.omani.works/bootstrap-kit/04-crossplane.yaml → 1.1.2

Hard rules respected:
  - provider-hcloud only for cloud ops (never hcloud-go, never CLI).
  - provider-kubernetes Object for k8s-side ops (never raw kubectl).
  - No bespoke kubectl manifests for cloud resources.
  - Frontend + catalyst-api Go code untouched (sibling-owned).
  - Target state, no MVP framing — all 6 Compositions ship.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 09:33:38 +02:00
e3mrah
c747fe2265
fix(bp-gitea): override postgresql to bitnamilegacy (Bitnami evacuated docker.io tags) (#231)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 08:27:49 +02:00
e3mrah
da87fb38c4
fix(bp-spire): disable ALL default-enabled clusterSPIFFEIDs (default+oidc+test-keys) (#230)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 08:13:41 +02:00
e3mrah
719c3bac35
fix(bp-spire): disable default ClusterSPIFFEID — CRD not observable in time on fresh install (#228)
Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 07:51:03 +02:00
e3mrah
1689ffcd1a
fix(bp-coraza,bp-syft-grype): add common library subchart to satisfy hollow-chart gate (#220)
Both charts are scratch (no upstream Helm chart published — Coraza
project + anchore/syft+grype CLIs ship containers only). The
blueprint-release.yaml hollow-chart gate (issue #181) rejects charts
with zero declared dependencies. Adding sigstore/common as a tiny
library subchart satisfies the gate; common is a library type so it
contributes zero runtime resources to either chart's rendered output.

The Catalyst-side templates (Deployment+Service for bp-coraza,
CronJob+PVC for bp-syft-grype) remain entirely in templates/ — the
library dep is purely a CI-gate mechanism, NOT a functional dependency.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 06:15:28 +02:00
e3mrah
3a57e287e5
feat(platform): security umbrellas (falco/kyverno/trivy/sigstore/syft-grype/reloader/coraza/litmus) (#216)
* feat(bp-falco): umbrella chart for security layer

Catalyst Blueprint umbrella chart for falco — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-kyverno): umbrella chart for security layer

Catalyst Blueprint umbrella chart for kyverno — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-trivy): umbrella chart for security layer

Catalyst Blueprint umbrella chart for trivy — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-sigstore): umbrella chart for security layer

Catalyst Blueprint umbrella chart for sigstore — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-syft-grype): umbrella chart for security layer

Catalyst Blueprint umbrella chart for syft-grype — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-reloader): umbrella chart for security layer

Catalyst Blueprint umbrella chart for reloader — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-coraza): umbrella chart for security layer

Catalyst Blueprint umbrella chart for coraza — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

* feat(bp-litmus): umbrella chart for security layer

Catalyst Blueprint umbrella chart for litmus — security/policy layer.

Pinned upstream + appVersion verified against the helm index on
2026-04-30. ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.
Solo-Sovereign defaults; per-Sovereign overlays bump to HA later.

Part of security-stack umbrellas batch 3.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 06:07:38 +02:00
e3mrah
75128781b3
feat(platform): observability stack umbrellas (grafana/loki/mimir/tempo/alloy/otel/langfuse/velero) (#214)
* feat(bp-grafana): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana — visualization layer of the
LGTM observability stack (Loki/Grafana/Tempo/Mimir).

Pinned to grafana/grafana 10.5.15 (appVersion 12.3.1) — current stable
on 2026-04-29. Solo-Sovereign defaults: 1 replica, 10Gi PVC,
ServiceMonitor disabled per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-loki): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Loki — log aggregation backend
of the LGTM stack. SingleBinary mode by default (solo-Sovereign min);
SimpleScalable/Distributed are values toggles.

Pinned to grafana/loki 7.0.0 (appVersion 3.6.7) on 2026-04-29.
Filesystem storage default; SeaweedFS S3 wiring is per-Sovereign overlay
when scaling out. All observability toggles default false per
BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-mimir): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Mimir — metrics storage tier of
the LGTM stack.

Pinned to grafana/mimir-distributed 6.0.6 (appVersion 3.0.4) on
2026-04-29. Solo-Sovereign defaults: every component scaled to 1
replica, zoneAwareReplication disabled, Kafka ingest-storage disabled.
Bundled MinIO kept enabled as a stop-gap so the chart renders;
SeaweedFS S3 wiring is per-Sovereign overlay. All metaMonitoring
toggles default false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-tempo): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Tempo — distributed tracing
backend of the LGTM stack. Single-binary mode by default
(solo-Sovereign min); microservice mode (tempo-distributed) is a chart
swap toggle.

Pinned to grafana/tempo 1.24.4 (appVersion 2.9.0) on 2026-04-29. Local
PVC storage default; SeaweedFS S3 wiring is per-Sovereign overlay.
Metrics generator disabled by default (depends on bp-mimir).
ServiceMonitor default false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-alloy): umbrella chart for observability stack

Catalyst Blueprint umbrella for Grafana Alloy — unified telemetry
collector for the LGTM stack (logs, metrics, traces; OTLP-native).

Pinned to grafana/alloy 1.8.0 (appVersion v1.16.0) on 2026-04-29.
DaemonSet controller default (one Alloy per node) so node + container
telemetry work out of the box. Empty Alloy config by default;
per-Sovereign overlays populate forwarders to bp-loki/bp-mimir/bp-tempo
once those reconcile. ServiceMonitor + ingress + CRDs default false per
BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-opentelemetry): umbrella chart for observability stack

Catalyst Blueprint umbrella for the OpenTelemetry Collector — vendor-
neutral telemetry collector. Sibling to bp-alloy; per-Sovereign overlays
choose one.

Pinned to open-telemetry/opentelemetry-collector 0.152.0 (appVersion
0.150.1) on 2026-04-29. Uses the contrib distribution
(otel/opentelemetry-collector-contrib:0.150.1) so Loki/Mimir/Tempo
exporters are bundled. Deployment mode default (1 replica); DaemonSet
+ StatefulSet are values toggles. All presets default false; ingress
+ ServiceMonitor + PodMonitor + PrometheusRule + NetworkPolicy default
false per BLUEPRINT-AUTHORING.md §11.2.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-langfuse): umbrella chart for observability stack

Catalyst Blueprint umbrella for Langfuse — LLM observability platform.
Complements bp-grafana (infrastructure metrics) with AI-specific
telemetry (traces, evaluations, prompts, cost attribution).

Pinned to langfuse/langfuse 1.5.28 (appVersion 3.171.0) on 2026-04-29.

Catalyst convention: ALL bundled Bitnami subcharts are disabled —
PostgreSQL via cnpg.io/Cluster (bp-cnpg), Redis via bp-valkey,
ClickHouse via bp-clickhouse, S3 via bp-seaweedfs. Per-Sovereign
overlays wire external endpoints + Secret references. Telemetry to
Langfuse Inc. defaulted false; signUpDisabled defaulted true.

Part of issue #204 observability-stack umbrellas batch.

* feat(bp-velero): umbrella chart for observability stack

Catalyst Blueprint umbrella for Velero — Kubernetes-native backup and
disaster recovery. Per platform/velero/README.md, ALL Velero output
goes to SeaweedFS (Catalyst's unified S3 encapsulation), which
transitions to a cloud archival backend on the cold tier.

Pinned to vmware-tanzu/velero 12.0.1 (appVersion 1.18.0) on 2026-04-29.
Bundled velero-plugin-for-aws:v1.14.0 init container so SeaweedFS S3 is
reachable. backupsEnabled/snapshotsEnabled defaulted false at this
layer (placeholders for backupStorageLocation); per-Sovereign overlays
flip on after wiring SeaweedFS endpoint + credentials. ServiceMonitor +
PodMonitor + PrometheusRule default false per BLUEPRINT-AUTHORING.md
§11.2.

Part of issue #204 observability-stack umbrellas batch.

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 22:11:04 +02:00
e3mrah
fa0e3a494b
fix(bp-keycloak): pin to current Bitnami tag (closes #191) (#198)
* fix(bp-keycloak): pin to current Bitnami Keycloak tag (closes #191)

Bitnami consolidated their tag scheme around 2025-09 (see
https://github.com/bitnami/charts/issues/30852). The chart was pinned to
upstream bitnami/keycloak Helm chart 24.7.1, whose default image tag
`bitnami/keycloak:26.2.4-debian-12-r0` now returns 404 in the Docker Hub
registry — installs hit ImagePullBackOff (verified on omantel).

Changes:
- Upstream Bitnami chart: 24.7.1 -> 25.2.0 (latest, appVersion 26.3.3)
- Override image.registry/image.repository for every Bitnami image used
  by the chart (keycloak app, keycloak-config-cli, postgresql,
  postgres-exporter, os-shell) to point at `bitnamilegacy/*`, where the
  historic debian-12 tags are preserved
- Replace deprecated `proxy: edge` with `proxyHeaders: "xforwarded"`
  (chart 25.x renamed the field; Catalyst fronts Keycloak with Cilium
  Gateway which sets X-Forwarded-* headers)
- bp-keycloak chart version: 1.1.1 -> 1.1.2

Verification (registry HEAD via Bearer token):
  bitnami/keycloak:26.2.4-debian-12-r0          -> 404 (broken pin)
  bitnami/keycloak:26.3.3-debian-12-r0          -> 404 (registry move)
  bitnamilegacy/keycloak:26.3.3-debian-12-r0    -> 200
  bitnamilegacy/keycloak-config-cli:6.4.0-...   -> 200
  bitnamilegacy/postgresql:17.6.0-debian-12-r0  -> 200
  bitnamilegacy/postgres-exporter:0.17.1-...    -> 200
  bitnamilegacy/os-shell:12-debian-12-r50       -> 200

`helm template platform/keycloak/chart` renders cleanly; rendered images
all resolve to bitnamilegacy/* tags listed above.

Long-term follow-up (not blocking): bitnamilegacy is explicitly marked
"no longer updated, may be removed in the future" — Catalyst should
either build its own Keycloak image or migrate to the Bitnami Secure
Image (BSI/Photon) catalog when chart support catches up. Tracked in
the bp-keycloak description block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bp-keycloak): bump blueprint.yaml version to match Chart.yaml 1.1.2

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:10:17 +02:00
e3mrah
bcd2e7980a
fix: hide CRD-emitting resources behind Capabilities gates (closes #190) (#200)
* fix(bp-external-dns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Wrap the Catalyst overlay's ServiceMonitor and ExternalSecret templates
in `.Capabilities.APIVersions.Has` checks so a cold install on a fresh
Sovereign — where bp-kube-prometheus-stack and bp-external-secrets have
not yet reconciled — no longer fails with `no matches for kind X in
version Y`. The values toggles (`externalDns.serviceMonitor.enabled`,
`externalDns.externalSecret.enabled`) remain — Capabilities is defense
in depth so an operator flipping the toggle on a Sovereign that hasn't
reached Phase 2 doesn't break the bp-external-dns reconcile.

Verified locally: `helm template` with toggles off renders 0 of these
resources; with toggles ON and `--api-versions monitoring.coreos.com/v1
--api-versions external-secrets.io/v1beta1` both render exactly once.

Bump version 1.1.0 → 1.1.2 to align with the Phase-1 architectural-fix
wave from issue #190.

* fix(bp-powerdns): hide CRD-emitting resources behind Capabilities gates (refs #190)

Three Catalyst overlay templates emit resources whose CRDs ship in OTHER
charts and were unconditionally rendered, causing a cold install of
bp-powerdns to fail with `no matches for kind X` on a Sovereign that
hasn't yet reconciled the upstream chart:

  - cnpg-cluster.yaml          → postgresql.cnpg.io/v1 Cluster
                                 (CRD ships in bp-cnpg)
  - api-ingress.yaml           → traefik.io/v1alpha1 Middleware
                                 (CRD ships with the Traefik controller;
                                  k3s ships it by default but a Sovereign
                                  overlay MAY disable Traefik in favour
                                  of cilium-only ingress)
  - crossplane-floatingip.yaml → compose.openova.io/v1alpha1 HetznerFloatingIP
                                 (CRD ships when the Catalyst Crossplane
                                  composition family lands — see GAP
                                  DISCLOSURE in that template)

Each is wrapped in `.Capabilities.APIVersions.Has "<group>/<version>"`.
The Traefik router-middleware annotation on the Ingress is similarly
gated so the auth posture cleanly moves to the Sovereign's chosen
ingress controller when Traefik is absent.

Verified locally: `helm template` with default values renders 0 of
these resources; with `--api-versions postgresql.cnpg.io/v1
--api-versions traefik.io/v1alpha1 --api-versions compose.openova.io/v1alpha1`
plus `--set crossplane.floatingIP.enabled=true`, all three render
exactly once. Existing tests/observability-toggle.sh still passes.

Bump version 1.1.1 → 1.1.2.

* fix(bp-powerdns): bump blueprint.yaml to match Chart.yaml 1.1.2 after Capabilities gate work

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-29 20:10:14 +02:00
e3mrah
1f5c76def1
fix(platform): sync blueprint.yaml versions with Chart.yaml (#199)
* feat(ui): Playwright cosmetic + step-flow regression guards

15 regression guards in products/catalyst/bootstrap/ui/e2e/cosmetic-
guards.spec.ts that fail HARD when each user-flagged defect class
returns:

  1.  card height drift from canonical 108px
  2.  reserved right padding eating description width
  3.  logo tile drift from per-brand LOGO_SURFACE
  4.  invisible glyph (white-on-white) via luminance proxy
  5.  wizard step order Org/Topology/Provider/Credentials/Components/
      Domain/Review
  6.  legacy "Choose Your Stack" / "Always Included" tab labels
  7.  Domain step reachable before Components
  8.  CPX32 not the recommended Hetzner SKU
  9.  per-region SKU dropdown shows wrong provider catalog
  10. provision page is .html (static) not SPA route
  11. legacy bubble/edge DAG SVG markup on provision page
  12. admin sidebar drift from canonical core/console (w-56 + 7 labels)
  13. AppDetail uses tablist instead of sectioned layout
  14. job rows navigate to /job/<id> instead of expand-in-place
  15. Phase 0 banners (Hetzner infra / Cluster bootstrap) on AdminPage

Each test prints a failure message naming the canonical reference,
the source-of-truth file, and the data-testid PR needed (if any) so
the implementing agent has a precise target. No .skip() — per
INVIOLABLE-PRINCIPLES #2, missing components fail loud.

CI: .github/workflows/cosmetic-guards.yaml runs the suite on every
PR that touches products/catalyst/bootstrap/ui/** or core/console/**.

Docs: docs/UI-REGRESSION-GUARDS.md maps each test to the user's
original complaint, the canonical reference, and the green/red
semantics (5 tests intentionally RED on main today — they stay red
until the companion-agent's UI work lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(platform): sync blueprint.yaml versions with Chart.yaml so manifest-validation passes

---------

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 22:07:55 +04:00
hatiyildiz
b0c1c07271 fix(bp-flux): align upstream flux2 version with cloud-init's flux install (no double-install destruction)
Live verified on omantel.omani.works (2026-04-29). bp-flux:1.1.1 shipped
the fluxcd-community `flux2` subchart at 2.13.0 (= upstream Flux
appVersion 2.3.0). Cloud-init pre-installed Flux core at v2.4.0 via
`https://github.com/fluxcd/flux2/releases/download/v2.4.0/install.yaml`.
helm-controller's reconcile of bp-flux ran `helm install` on top of the
running v2.4.0 Flux; the chart's v2.3.0 CRD update failed apiserver
admission with `status.storedVersions[0]: Invalid value: "v1": must
appear in spec.versions`; Helm rolled back; the rollback DELETED every
running Flux controller Deployment (helm-controller, source-controller,
kustomize-controller, image-automation-controller,
image-reflector-controller, notification-controller). The cluster lost
its GitOps engine — no further HelmRelease could progress, and the only
recovery was full `tofu destroy` + reprovision.

This is OPTION C of the architectural fix proposed in the incident
memo: version-align cloud-init's flux2 install with the bp-flux umbrella
chart's `flux2` subchart so a single upstream Flux release is installed
and helm-controller adopts it on first reconcile rather than reinstalls
on top with a different version.

Changes:

  * `infra/hetzner/cloudinit-control-plane.tftpl` — kept the install.yaml
    URL pinned at v2.4.0 (deliberate; this is the source of truth) and
    added the CRITICAL VERSION-PIN INVARIANT comment block documenting
    the failure mode.

  * `platform/flux/chart/Chart.yaml` — bumped `flux2` subchart dep from
    2.13.0 to 2.14.1. The community chart 2.14.1 carries appVersion
    2.4.0, matching cloud-init exactly. Bumped chart version
    1.1.1 -> 1.1.2.

  * `platform/flux/chart/values.yaml` — `catalystBlueprint.upstream
    .version` mirror of the dep pin moved from 2.13.0 to 2.14.1.

  * `clusters/_template/bootstrap-kit/03-flux.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/03-flux.yaml` — bumped
    bp-flux HelmRelease to 1.1.2 + added explicit
    `install.disableTakeOwnership: false`,
    `upgrade.disableTakeOwnership: false`, and
    `upgrade.preserveValues: true` so helm-controller adopts the
    cloud-init-installed Flux objects rather than rolling back on
    ownership conflict.

  * `products/catalyst/chart/Chart.yaml` — bumped bp-catalyst-platform
    umbrella 1.1.1 -> 1.1.2, with bp-flux dep bumped to 1.1.2.

  * `clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml` and
    `clusters/omantel.omani.works/bootstrap-kit/13-bp-catalyst-platform.yaml`
    — bumped HelmRelease to 1.1.2.

  * `platform/flux/chart/tests/version-pin-replay.sh` — NEW. Six-case
    catastrophic-failure replay test:
      Case 1: Chart.yaml declares the flux2 subchart with explicit version.
      Case 2: cloud-init pins flux2 install.yaml to an explicit v-tag.
      Case 3: chart's flux2 subchart appVersion equals cloud-init's
              pinned upstream version (the load-bearing invariant).
      Case 4: values.yaml metadata mirrors the Chart.yaml dep pin.
      Case 5: helm template renders cleanly + contains the four core
              Flux controllers.
      Case 6: replay test rejects a planted mismatched fake Chart.yaml
              (the gate's own self-test — proves the gate works).
    All six cases green locally; the new test joins the existing
    observability-toggle test in tests/.

  * `docs/RUNBOOK-PROVISIONING.md` — new section "bp-flux double-install
    — version-pin invariant" documenting the failure mode, the four
    pin-sites, the safe bump procedure, and the existing-Sovereign
    recovery path (full reprovision).

Existing Sovereigns running 1.1.1: no in-place recovery is possible
once the rollback has fired. Reprovision required against 1.1.2.

Per docs/INVIOLABLE-PRINCIPLES.md #3 (architecture as documented) +
#4 (never hardcode) — the version pins remain operator-bumpable via PR,
but BOTH cloud-init's URL AND the chart's subchart MUST move together
in the same PR; CI gate tests/version-pin-replay.sh enforces this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:38:17 +02:00
hatiyildiz
4265884d58 feat(bp-external-dns): umbrella chart + add to bootstrap-kit Kustomization
Convert platform/external-dns/chart/ from a metadata-only wrapper to a
proper Helm umbrella that pulls kubernetes-sigs/external-dns 1.15.2
(appVersion 0.15.1, k8s 1.31-validated) as a Helm subchart, mirroring
the bp-cilium / bp-cert-manager / bp-powerdns shape. Native PowerDNS
provider speaks the bp-powerdns REST API directly via the
EXTERNAL_DNS_PDNS_API_KEY env var sourced from the
powerdns-api-credentials Secret bp-powerdns renders.

Catalyst overlay templates added (default-off where applicable per the
observability-toggle rule for the bp-* family):
  - templates/networkpolicy.yaml      (default ON; egress to powerdns +
                                       cluster DNS + apiserver only)
  - templates/servicemonitor.yaml     (default OFF)
  - templates/externalsecret.yaml     (default OFF; Phase-2 OpenBao path)
  - templates/_helpers.tpl

Bootstrap-kit Kustomization gets a new 12-external-dns.yaml HelmRelease
referencing bp-external-dns:1.1.0 with dependsOn bp-cert-manager +
bp-powerdns, and the legacy 11-bp-catalyst-platform.yaml is renumbered
13- so the install ordering reads in canonical Phase-0 sequence. Mirrored
to clusters/omantel.omani.works/bootstrap-kit/ with the SOVEREIGN_FQDN
substitution applied.

bp-catalyst-platform Chart.yaml drops bp-external-dns from its
dependency block — install ordering for ExternalDNS is now owned by Flux
dependsOn at the Kustomization layer rather than this umbrella's Helm
dependency graph. Bumped 1.1.0 → 1.1.1 to reflect the dep removal, and
the bootstrap-kit HelmRelease references in both clusters bumped in
lockstep.

Wrapper chart version bumped 1.0.0 → 1.1.1 (umbrella shape).

Local gates pass:
  - helm dependency build (pulls external-dns-1.15.2.tgz)
  - helm lint (0 failures)
  - helm template smoke render (245 lines, 6 kinds rendered)
  - helm package + tar-tzf verifies external-dns subchart inside the
    packaged tgz (subchart-guard simulation passes)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 19:29:27 +02:00
e3mrah
31d5911221
Merge pull request #185 from openova-io/fix/bp-charts-observability-toggles-default-false
fix(bp-*): observability toggles default false (v1.1.1)
2026-04-29 21:26:48 +04:00
hatiyildiz
1ddd569789 fix(bp-*): observability toggles default false — break circular CRD dependency
Extends the v1.1.1 hardening that started with cilium / cert-manager /
crossplane to the remaining 8 bootstrap-kit + per-Sovereign Blueprints.
Every observability toggle in every Catalyst-curated Blueprint now ships
`false`/`null` by default; the operator opts in via a per-cluster values
overlay at clusters/<sovereign>/bootstrap-kit/* once
bp-kube-prometheus-stack reconciles.

Live failure mode that prompted this (omantel.omani.works 2026-04-29):
bp-cilium @ 1.1.0 defaulted hubble.relay/ui + prometheus.serviceMonitor
to true. The upstream Cilium 1.16.5 chart renders a
monitoring.coreos.com/v1 ServiceMonitor whose CRD ships with
kube-prometheus-stack — a tier-2 Application Blueprint that depends on
the bootstrap-kit (cilium first). Helm install fails on a fresh
Sovereign with "no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1 — ensure CRDs are installed first" and every
downstream HelmRelease reports `dep is not ready`. The earlier
trustCRDsExist=true mitigation only suppresses Helm's render-time gate;
the apiserver still rejects the resource at install-time.

Per-Blueprint changes:
- bp-cilium: hubble.relay.enabled, hubble.ui.enabled → false;
  hubble.metrics.enabled → null (this is the exact value that disables
  the upstream metrics ServiceMonitor template branch — verified by
  reading cilium 1.16.5's _hubble.tpl); hubble.metrics.serviceMonitor
  .enabled → false. tests/observability-toggle.sh extended with Case 4
  (default render produces no hubble-relay / hubble-ui Deployments).
- bp-flux: flux2.prometheus.podMonitor.create → false.
- bp-sealed-secrets: sealed-secrets.metrics.serviceMonitor.enabled
  → false (explicit lock; upstream already defaults false).
- bp-spire: spire.global.spire.recommendations.enabled +
  recommendations.prometheus → false.
- bp-nats-jetstream: nats.promExporter.enabled +
  promExporter.podMonitor.enabled → false.
- bp-openbao: openbao.injector.metrics.enabled +
  openbao.serviceMonitor.enabled → false.
- bp-keycloak: keycloak.metrics.enabled + metrics.serviceMonitor.enabled
  + metrics.prometheusRule.enabled → false.
- bp-gitea: gitea.gitea.metrics.* and gitea.postgresql.metrics.*
  serviceMonitor + prometheusRule → false.
- bp-powerdns: powerdns.serviceMonitor.enabled + powerdns.metrics.enabled
  → false (forward-compatibility guard; current upstream
  pschichtel/powerdns 0.10.0 has no ServiceMonitor template, but a future
  upstream bump cannot silently regress).

Each chart ships a tests/observability-toggle.sh that asserts the rule
in three cases (default off / explicit on opt-in / explicit off) — runs
under blueprint-release.yaml's chart-test gate (added bdeb0f54 + the
existing wiring) before helm push. A regression that re-introduces a
hardcoded enabled: true in any chart fails CI before the OCI artifact
is published.

Versioning:
- All 11 leaf charts bumped 1.1.0 → 1.1.1.
- products/catalyst/chart (bp-catalyst-platform umbrella) deps updated
  to 1.1.1 across the board.
- clusters/_template/bootstrap-kit/03-flux through 10-gitea bumped to
  1.1.1; clusters/omantel.omani.works/bootstrap-kit/* mirror.

docs/BLUEPRINT-AUTHORING.md §11.2 table extended to enumerate every
toggle disabled across all 11 Blueprints. References
docs/INVIOLABLE-PRINCIPLES.md #4.

GATES (all green):
- helm dep build resolves cleanly post-change for every chart whose
  upstream is published (umbrella waits on per-leaf publish).
- helm lint clean on all 11 leaves.
- helm template . default render produces zero monitoring.coreos.com
  references on every leaf (verified locally).
- tests/observability-toggle.sh PASS on all 11 leaves.

Live verification: with v1.1.1 published the omantel.omani.works
HelmRelease can roll forward without a manual values patch — Flux picks
up the new chart digest automatically (semver: 1.x in OCIRepository).

Refs: issue #182.
2026-04-29 19:23:52 +02:00
hatiyildiz
02b5b6c4c8 fix(bootstrap-kit): override cilium + cert-manager values to disable observability toggles
Live verified on omantel: bp-cilium and bp-cert-manager v1.1.0 fail Helm
install with 'no matches for kind ServiceMonitor in version
monitoring.coreos.com/v1'. Manual kubectl-patch of the live HelmRelease
worked but Flux's 15-min reconcile rolls back the patch because the
HelmRelease CR is owned by the kustomize-controller from git.

Override the values inline in the HelmRelease manifests so the patch is
durable across Flux reconciles. Same pattern as the in-flight observability-
toggle agent will apply to all 12 charts in the next chart bump (v1.1.1).
This is the manifest-level workaround that unblocks the running omantel
cluster TODAY without waiting for v1.1.1 publish.

Mirrors the patches into both clusters/_template/bootstrap-kit/ AND
clusters/omantel.omani.works/bootstrap-kit/ so future Sovereigns inherit.
2026-04-29 19:17:08 +02:00
hatiyildiz
b1638f51ea fix(bp-* tests): skip helm dep build when charts/ already vendored
Earlier rerun failure on the CI workflow (bp-cert-manager 25120060270):

  Error: no repository definition for https://charts.jetstack.io.
  Please add the missing repos via 'helm repo add'

Root cause: blueprint-release.yaml's earlier `helm dependency build`
step (line 181) successfully resolves the upstream chart and populates
chart/charts/ — but it does NOT `helm repo add` the upstream repo
first. Helm 3.20's `helm dep build` succeeds on the first call by
falling back to direct-URL fetch from Chart.yaml `dependencies[].repository`.
A SECOND `helm dep build` (run by the test script) hits a different
code path that requires the repo to be in the helm repo cache.

Fix: tests/observability-toggle.sh now skips `helm dep build` when
chart/charts/ is already populated (which is always the case in CI
since the workflow's own `helm dependency build` step ran first). Local
dev runs from a fresh checkout still resolve subcharts.

Refs #182

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:12:21 +02:00
hatiyildiz
d34facc040 fix(bp-*): observability toggles default false — break circular CRD dependency
bp-cilium@1.1.0 install fails on every fresh Sovereign with:

  no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  — ensure CRDs are installed first

Cascades to all 10 other bp-* HelmReleases ("dep is not ready") since
bp-cilium is the root of the bootstrap dep graph. Verified live on
omantel.omani.works 2026-04-29 (issue #182).

Root cause: platform/cilium/chart/values.yaml and
platform/cert-manager/chart/values.yaml hardcoded
`serviceMonitor.enabled: true`. The monitoring.coreos.com/v1 CRDs ship
with kube-prometheus-stack — an Application-tier Blueprint that itself
depends on the bootstrap-kit. Hardcoding `true` creates a circular CRD
ordering: bp-cilium wants the CRD bp-kube-prometheus-stack provides, but
bp-kube-prometheus-stack cannot install before bp-cilium.

The `trustCRDsExist=true` mitigation only suppresses Helm's render-time
gate; the apiserver still rejects the resource at install-time.

Violates INVIOLABLE-PRINCIPLES.md #4 (never hardcode): observability
toggles MUST be operator-tunable, not chart-level constants assuming an
observability tier exists.

This commit:

A. Defaults every observability toggle false in the affected wrappers:
   - platform/cilium/chart/values.yaml:
     cilium.prometheus.enabled: false
     cilium.prometheus.serviceMonitor.enabled: false
     (trustCRDsExist removed — no longer relevant)
   - platform/cert-manager/chart/values.yaml:
     cert-manager.prometheus.enabled: false
     cert-manager.prometheus.servicemonitor.enabled: false
   - platform/crossplane/chart/values.yaml:
     crossplane.metrics.enabled: false
     (uniformity rule — does not break install but holds the invariant)

B. Bumps affected wrapper charts 1.1.0 → 1.1.1:
   - bp-cilium, bp-cert-manager, bp-crossplane (leaves)
   - bp-catalyst-platform (umbrella; deps repinned to 1.1.1 for the 3)

C. Updates clusters/_template/bootstrap-kit/* and
   clusters/omantel.omani.works/bootstrap-kit/* HelmRelease versions to
   1.1.1 so the live Sovereign picks up the fix on Flux reconcile.

D. Adds platform/<name>/chart/tests/observability-toggle.sh under each
   affected chart. Each script asserts:
     - default render produces zero monitoring.coreos.com refs
     - opt-in render with --set <toggle>=true succeeds and produces a
       ServiceMonitor (proves the toggle is wired)
     - explicit-off render succeeds and produces zero refs
   Wired into .github/workflows/blueprint-release.yaml via a new
   "Run chart integration tests" step that executes every chart/tests/
   *.sh on every publish — a regression that re-introduces a hardcoded
   `true` fails the publish job before the OCI artifact is pushed.

E. Documents the rule in docs/BLUEPRINT-AUTHORING.md §11.2
   "Observability toggles must default false". References Principle #4
   and provides the canonical pattern (default off in wrapper values,
   opt-in via per-cluster overlay at clusters/<sovereign>/...).

Per-chart audit table (which toggle was hardcoded → new default):

| Chart            | Toggle                                                   | Was  | Now   |
|------------------|----------------------------------------------------------|------|-------|
| bp-cilium        | cilium.prometheus.enabled                                | true | false |
| bp-cilium        | cilium.prometheus.serviceMonitor.enabled                 | true | false |
| bp-cert-manager  | cert-manager.prometheus.enabled                          | true | false |
| bp-cert-manager  | cert-manager.prometheus.servicemonitor.enabled           | true | false |
| bp-crossplane    | crossplane.metrics.enabled                               | true | false |
| bp-flux          | (no observability hardcodes)                             | n/a  | n/a   |
| bp-sealed-secrets| (no observability hardcodes)                             | n/a  | n/a   |
| bp-spire         | (no observability hardcodes)                             | n/a  | n/a   |
| bp-nats-jetstream| (no observability hardcodes)                             | n/a  | n/a   |
| bp-openbao       | (no observability hardcodes)                             | n/a  | n/a   |
| bp-keycloak      | (no observability hardcodes)                             | n/a  | n/a   |
| bp-gitea         | (no observability hardcodes)                             | n/a  | n/a   |
| bp-powerdns      | (no observability hardcodes)                             | n/a  | n/a   |
| bp-catalyst-platform | (umbrella, no values overlay)                        | n/a  | n/a   |

Local gates green:
  helm dep build      ✓ all 3 affected charts
  helm lint           ✓ all 3
  helm template       ✓ all 3 — 0 monitoring.coreos.com refs in default
  tests/observability-toggle.sh  ✓ all 9 sub-cases pass

Closes the install path for bp-cilium 1.1.1 on a fresh Sovereign;
unblocks the full bp-* dep graph.

Refs: https://github.com/openova-io/openova/issues/182

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:08:09 +02:00
hatiyildiz
43aff20254 feat(bp-*): convert all 11 bootstrap-kit charts to umbrella charts depending on upstream
Each platform/<name>/chart/Chart.yaml now declares the canonical upstream
chart as a dependencies: entry. helm dependency build pulls the upstream
payload into the OCI artifact at publish time, so Flux helm install of
bp-<name>:1.1.0 actually installs the upstream Helm release alongside the
Catalyst-curated overlays (NetworkPolicy, ServiceMonitor, ClusterIssuer,
ExternalSecret) under templates/.

Pinned upstream chart versions per platform/<name>/blueprint.yaml:
- cilium                 1.16.5  https://helm.cilium.io
- cert-manager           v1.16.2 https://charts.jetstack.io
- flux                   2.4.0   https://fluxcd-community.github.io/helm-charts
- crossplane             1.17.x  https://charts.crossplane.io/stable
- sealed-secrets         2.16.x  https://bitnami-labs.github.io/sealed-secrets
- spire                  ...     https://spiffe.github.io/helm-charts-hardened
- nats-jetstream         ...     https://nats-io.github.io/k8s/helm/charts
- openbao                ...     https://openbao.github.io/openbao-helm
- keycloak               ...     https://charts.bitnami.com/bitnami
- gitea                  ...     https://dl.gitea.com/charts
- catalyst-platform      umbrella over the 10 leaf bp-* charts via
                         helm dependency

values.yaml in each chart adopts the umbrella convention: catalystBlueprint
metadata block (provenance + version) at top level, upstream subchart
values namespaced under the dependency name.

cert-manager specifically: clusterissuer-letsencrypt-dns01.yaml gets the
helm.sh/hook: post-install,post-upgrade annotation so it applies AFTER
cert-manager controllers are running and CRDs registered (the previous
hollow-chart shape ran the ClusterIssuer at install time when CRDs
didn't exist yet, which was the omantel cluster's exact failure mode).

Wrapper chart version bumped 1.0.0 → 1.1.0 across the board (umbrella
conversion is a meaningful structural revision). Cluster manifests in
clusters/_template/bootstrap-kit/ AND clusters/omantel.omani.works/
bootstrap-kit/ updated to reference 1.1.0.

The blueprint-release.yaml workflow's helm package step needs an
explicit helm dependency build before push so the upstream subchart
bytes ship inside the OCI artifact. That CI change is a follow-up
commit on this same branch (separate file scope).
2026-04-29 17:21:36 +02:00
hatiyildiz
67fdecb770 merge: remove k8gb (#171) 2026-04-29 08:51:21 +02:00
hatiyildiz
f5daac52af refactor(platform): remove k8gb — replaced by PowerDNS lua-records (#171)
PowerDNS lua-records (`ifurlup`, `pickclosest`, `ifportup`) cover everything
k8gb was doing — geo-aware response selection, health-checked failover,
weighted round-robin — at the authoritative DNS layer. Eliminates a
separate K8s controller, CRD set, and CoreDNS plugin from every Sovereign.

Changes:
- platform/k8gb/ deleted (Chart.yaml, values.yaml, blueprint.yaml never
  authored — only README existed)
- products/catalyst/bootstrap/ui/public/component-logos/k8gb.svg deleted
- componentGroups.ts: remove k8gb component (PowerDNS already there)
- componentLogos.tsx: drop logo_k8gb + k8gb map entry
- model.ts DEFAULT_COMPONENT_GROUPS spine: replace k8gb with powerdns
- StepInfrastructure.tsx: copy refers to PowerDNS lua-records, not k8gb
- provision.html: replace k8gb tile and edges with powerdns
- catalog.generated.ts regenerated (now includes bp-powerdns)
- docs sweep — every k8gb reference in PLATFORM-TECH-STACK, NAMING-
  CONVENTION, SOVEREIGN-PROVISIONING, SRE, ARCHITECTURE, GLOSSARY,
  COMPONENT-LOGOS, IMPLEMENTATION-STATUS, BUSINESS-STRATEGY,
  TECHNOLOGY-FORECAST, README, infra/hetzner/README, platform READMEs
  (cilium, external-dns, failover-controller, litmus, flux, opentofu)
  rewritten to point at PowerDNS lua-records / MULTI-REGION-DNS.md.
  Historical entries in VALIDATION-LOG.md preserved as audit trail.
- New docs/MULTI-REGION-DNS.md — canonical reference for the lua-record
  patterns (ifurlup all/pickclosest/pickfirst, ifportup, pickwhashed),
  Application Placement → lua-record selector mapping, when to add a
  second Sovereign region, operational checks.

Closes #171.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:51:09 +02:00
hatiyildiz
f4679e2748 fix(powerdns): enable gpgsql-dnssec for DNSSEC API (1.0.6)
Without `gpgsql-dnssec=yes` the gpgsql backend driver does not expose
the DNSSEC API surface — `PUT /zones/<zone>` with `dnssec:true` returns
422 "no DNSSEC-capable backends are loaded". This blocks pool-domain-
manager from enabling DNSSEC on every Sovereign child zone (mandatory
per docs/PLATFORM-POWERDNS.md).

Fix lands in additionalConfig so the directive is rendered alongside
`default-soa-edit-signed=INCEPTION-EPOCH` and `direct-dnskey=yes`. No
schema migration needed — the gpgsql 5.0.3 schema already includes the
cryptokeys table; the missing piece was just the backend feature flag.

Bumps Chart.yaml to 1.0.6. Verified: after this lands the PUT call
returns 204 and POST /cryptokeys mints a usable KSK.

Discovered while bringing up openova#168 (PDM per-Sovereign zones).
2026-04-29 08:42:18 +02:00
hatiyildiz
fa84cac438 fix(powerdns): plain ALTER TABLE in postInitSQL (avoid $$ escape battle, 1.0.5)
The DO block in 1.0.4 rendered with $$ collapsed to $ by the time it
reached CNPG's postInitApplicationSQL — "syntax error at or near $".
Both Helm template processing and the YAML scalar block were chewing on
the dollar signs.

Replaced with explicit ALTER TABLE statements (one per gpgsql table) +
GRANT — same end state, no PL/pgSQL quoting required. Verified at
runtime on contabo-mkt: powerdns Pod went CrashLoopBackOff →
Running 1/1 immediately after the manual ALTER ran by hand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:17:28 +02:00
hatiyildiz
214a3e1ada fix(powerdns): grant table ownership to pdns user in CNPG bootstrap (1.0.4)
Verified at runtime on Contabo-mkt: postInitApplicationSQL runs as the
postgres superuser, not the application owner, so the schema tables
created by the bootstrap block were owned by postgres. PowerDNS connects
as 'pdns' and got 'permission denied for table domains' on the first
SELECT against the zone cache.

Added a DO block at the end of the schema bootstrap that walks every
table in the public schema and ALTERs OWNER TO {{ .Values.postgres.cluster.owner }}
plus GRANT ALL PRIVILEGES ON SCHEMA public — same shape PDM uses (and
the contabo-mkt cluster verified the fix runtime: powerdns Pod went
from CrashLoopBackOff to 1/1 Ready immediately after the same DDL was
run by hand).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:14:12 +02:00
hatiyildiz
db20e9d42b fix(powerdns): dnsdist backend resolution + drop DnstapLogAction (1.0.3)
dnsdist 1.9.14 runtime errors:
  1. newServer{address='powerdns:5353'} → "Unable to convert presentation
     address" — dnsdist's address parser expects IP[:port], not a DNS
     name. Kubernetes auto-injects POWERDNS_SERVICE_HOST as an env var
     into every pod in the same namespace as the powerdns Service; using
     that gives us the ClusterIP at config-load time without needing an
     init container or runtime DNS resolution.
  2. DnstapLogAction(name, bool, fn) signature changed in 1.9 — the
     2nd parameter now expects a shared_ptr to a RemoteLoggerInterface,
     not a boolean. Rather than wire up a remote dnstap server (which
     adds a moving part for marginal observability gain), drop the line.
     Catalyst observability is the dnsdist /metrics endpoint surfaced
     to Prometheus + the k8s container log.

Bumped chart to 1.0.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:12:27 +02:00
hatiyildiz
20c0543806 fix(powerdns): correct dnsdist image tag + drop readOnlyRootFilesystem (1.0.2)
Two runtime issues caught during first contabo-mkt rollout:

1. dnsdist image tag was "1.9" (default) — that tag doesn't exist in
   docker.io/powerdns/dnsdist-19. The 1.9.x line publishes 1.9.0 .. 1.9.14
   (no rolling "1.9" alias). Pinned to 1.9.14 (current latest).

2. PowerDNS pod crash-looped on Errno 30 (Read-only file system:
   /etc/powerdns/pdns.d/0-api.conf.conf). The upstream pdns_server-startup
   script writes rendered config files to /etc/powerdns/pdns.d/ at
   container start, and the upstream template doesn't expose an emptyDir
   we could redirect that path to. Set readOnlyRootFilesystem=false with
   a verbose comment explaining why; the rest of the security context
   (runAsNonRoot, runAsUser=953, drop ALL caps) stays in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:06:39 +02:00
hatiyildiz
19d926bfeb fix(powerdns): avoid recursive include in dnsdist checksum, bump to 1.0.1
Helm flagged dnsdist.yaml's checksum/config annotation as a recursive
template self-reference (the file included itself). Replaced with a
hash of the rendered .Values.dnsdist.config (post-tpl), which is the
substantive content the annotation is supposed to track anyway.

Bumped Chart.yaml to 1.0.1 so the OCIRepository semver "1.x" picks
up the fix automatically on next reconcile. Blueprint API version stays
at 1.0.0 (Blueprint contract is unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 08:02:53 +02:00
hatiyildiz
0190c60520 feat(powerdns): bp-powerdns wrapper chart + per-Sovereign zone model (#167)
Introduces the bp-powerdns Catalyst Blueprint wrapper as the authoritative
DNS service for every Sovereign zone. Replaces k8gb in componentGroups.ts —
PowerDNS Lua records cover geo + health-checked failover natively, removing
the dedicated GSLB controller.

Wrapper chart (platform/powerdns/chart/):
  - Chart.yaml — bp-powerdns 1.0.0, depends on pschichtel/powerdns 0.10.0
    upstream (verified Artifact Hub publisher, tracks docker.io/powerdns/
    pdns-auth-50 at appVersion 5.0.3 — surveyed Artifact Hub, no official
    PowerDNS chart exists)
  - values.yaml — 3 replicas, gpgsql backend, DNSSEC ECDSAP256SHA256,
    lua-records ON, dnsdist 100 qps default per source IP, REST API at
    pdns.openova.io/api behind Traefik basicAuth
  - blueprint.yaml — Catalyst metadata, visibility=unlisted (mandatory
    infra), section pts-3-2-gitops-and-iac
  - templates/cnpg-cluster.yaml — separate `pdns-pg` Postgres (1 instance,
    5Gi, postgres-16) with PowerDNS auth-5.0.3 schema applied via
    postInitApplicationSQL
  - templates/dnsdist.yaml — companion Deployment + ConfigMap with
    rate-limiting policy (MaxQPSIPRule per source IP)
  - templates/api-ingress.yaml — Traefik Ingress + basicAuth Middleware
  - templates/anycast-endpoint.yaml — placeholder Service of type
    LoadBalancer (Phase-0 stand-in for the anycast Floating IP target state)
  - templates/crossplane-floatingip.yaml — DISCLOSED GAP: target-state
    XHetznerFloatingIP composite, disabled by default until the
    Crossplane composition is authored (the existing compositions cover
    Server/Network/Firewall/LoadBalancer/PoolAllocation only). The
    placeholder anycast Service is the operational stand-in.

Per docs/INVIOLABLE-PRINCIPLES.md:
  - #4 (never hardcode): every value flows from values.yaml or a
    referenced K8s Secret. Image tags come from upstream chart appVersion,
    never duplicated.
  - #8 (disclose every divergence): the XHetznerFloatingIP gap is
    documented in the template + in docs/PLATFORM-POWERDNS.md ("Anycast
    deferral" section).

componentGroups.ts: powerdns added to SPINE group as mandatory (depends on
cnpg). external-dns now lists powerdns as a dependency. k8gb removed.

docs/PLATFORM-POWERDNS.md: per-Sovereign zone model, DNSSEC posture, REST
API contract, lua-records GSLB pattern, dnsdist policy, anycast deferral
runbook, first-deploy procedure for Contabo-mkt.

Closes #167 (Phase 1 of public-repo work; Phase 4 cluster manifest lands
in openova-private feat/powerdns-deploy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 07:49:51 +02:00
hatiyildiz
31b03ce02a ci(pdm)+platform(crossplane): build workflow + XDynadotPoolAllocation composition (Phase 3+4 of #163)
CI workflow (.github/workflows/pool-domain-manager-build.yaml) mirrors
the marketplace-api / catalyst-api shape:

  - Triggers on push to core/pool-domain-manager/** + workflow_dispatch
  - Runs unit tests (reserved + dynadot — the integration suite needs a
    real Postgres which the workflow does not provide; full integration
    runs in test-bootstrap-api.yaml against an ephemeral CNPG)
  - Builds and pushes ghcr.io/openova-io/openova/pool-domain-manager:<sha>
  - Cosign-signs the image via Sigstore keyless OIDC (id-token: write)
  - Emits an SBOM attestation tied to the image digest
  - Manifest deployment is intentionally NOT in this workflow — PDM
    manifests live in the openova-private repo per the issue body, so
    the Flux Kustomization there picks up the new SHA via a follow-up
    private-repo commit (Phase 6 of #163)

Crossplane composition (platform/crossplane/compositions/xrd-pool-
allocation.yaml + composition-pool-allocation.yaml) wraps PDM as a
declarative Crossplane Resource:

  apiVersion: compose.openova.io/v1alpha1
  kind: XDynadotPoolAllocation
  spec:
    parameters:
      poolDomain:    omani.works
      subdomain:     omantel
      sovereignFQDN: omantel.omani.works
      loadBalancerIP: 1.2.3.4
      createdBy:     crossplane

The Composition uses provider-http (crossplane-contrib/provider-http) to
render the XR into a Reserve → Commit sequence of HTTP calls against
PDM's in-cluster service URL. Per docs/INVIOLABLE-PRINCIPLES.md #3 we use
provider-http rather than bespoke Go to keep the day-2 lifecycle
declarative. Operators who want to pre-allocate a name (e.g. reserve
'omantel.omani.works' for a Sovereign that hasn't been provisioned yet)
commit YAML to Git and Flux+Crossplane converge.

Refs: #163
2026-04-29 06:46:11 +02:00
hatiyildiz
8886eff708 Merge branch 'feat/group-g-dns-finish-v3'
Group G DNS finish (v3): #110 (Dynadot multi-domain table-driven tests),
#112 (catalyst-dns httptest-mocked Dynadot coverage), #113 (cert-manager
LE DNS-01 + HTTP-01 ClusterIssuer templates with operator runbook for
the cert-manager-dynadot-webhook gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:45:35 +02:00
hatiyildiz
97e942e0bc feat(cert-manager): #113 — Lets Encrypt DNS-01 + HTTP-01 ClusterIssuers
Adds platform/cert-manager/chart/templates/clusterissuer-letsencrypt-dns01.yaml
with two ClusterIssuers, both Catalyst-curated, rendered conditionally
from values.yaml:

- letsencrypt-dns01-prod (TARGET STATE, default disabled) — ACME DNS-01
  via the cert-manager webhook solver, pointing at a future
  `cert-manager-dynadot-webhook` Catalyst binary that will implement the
  webhook.acme.cert-manager.io/v1alpha1 contract against the existing
  internal/dynadot/ package. Shipping the issuer template ahead of the
  webhook so cluster overlays only need a values flip + secret ref —
  no template edits — once the webhook lands.

- letsencrypt-http01-prod (INTERIM, default enabled) — ACME HTTP-01
  via the cilium ingress class. Issues certs for the explicit hostnames
  (console, gitea, harbor, admin, api) but NOT for wildcards; the
  canonical *.<sub>.<domain> record needs DNS-01.

Header comment explains the gap: the Catalyst external-dns webhook
(products/catalyst/bootstrap/api/cmd/external-dns-dynadot-webhook/)
implements a DIFFERENT RPC contract (records.list/add/delete) than what
cert-manager DNS-01 expects (Present/CleanUp on ChallengeRequest CRD),
so it cannot be reused; a dedicated cmd/cert-manager-dynadot-webhook/
must be built. Operator runbook for cutover is in the file header.

values.yaml gains a `certManager.issuers.{email,acmeServer,dns01,http01}`
section so all knobs are runtime-configurable per
docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode); cluster overlays in
clusters/<sovereign>/ can flip dns01.enabled via the bp-catalyst-platform
umbrella's values without rebuilding the Blueprint OCI artifact.

blueprint.yaml gains a spec.outputs section advertising:
- issuerName: letsencrypt-http01-prod (default)
- wildcardIssuerName: letsencrypt-dns01-prod (target state)
- issuerKind: ClusterIssuer

so dependent Blueprints (cilium-gateway, harbor, gitea) can consume the
issuer name without hardcoding it.

Closes #113.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 19:44:56 +02:00
hatiyildiz
c07e0ad1ee feat(external-dns): #109 — author bp-external-dns leaf chart for OCI publish
The bp-catalyst-platform umbrella (issue #104) declares a dependency on
bp-external-dns:1.0.0 — but the chart didn't exist; only README + Dynadot
multi-domain policy lived under platform/external-dns/. Without this leaf
the umbrella's `helm dependency build` fails (verified in run 25068433765).

This commit authors the minimal target-state leaf:
- Chart.yaml: name=bp-external-dns, version=1.0.0
- values.yaml: catalystBlueprint.upstream metadata (external-dns 1.15.0
  from kubernetes-sigs/external-dns Helm repo) + Catalyst-curated values
  overlay (sources, txtOwnerId, ServiceMonitor, RBAC, resources)

Per BLUEPRINT-AUTHORING.md §3, leaf charts are pure values-overlay wrappers:
no templates dir, just Chart.yaml + values.yaml with the catalystBlueprint
metadata block read by the bootstrap-kit installer at helm-install time.

Per-Sovereign provider/zone/credential overrides are overlaid by the
Crossplane Composition that materializes the HelmRelease — keeping this
chart provider-agnostic (no hardcoded Cloudflare/Dynadot/Hetzner choice
per INVIOLABLE-PRINCIPLES.md §4).

After this lands, blueprint-release.yaml will publish
ghcr.io/openova-io/bp-external-dns:1.0.0 and the next umbrella push will
resolve all 11 leaf deps successfully.
2026-04-28 19:42:23 +02:00
hatiyildiz
f0fe3006ba feat(external-dns): #109 — Catalyst-curated dynadot-multi-domain policy
Adds platform/external-dns/policies/dynadot-multi-domain.yaml — the
canonical external-dns + dynadot webhook deployment that ships in every
Sovereign on an OpenOva pool domain.

Why a webhook: external-dns has no upstream Dynadot provider; the
canonical pattern is the webhook RPC contract, with a sidecar that
implements the provider in our preferred language. We reuse the same
internal/dynadot/ package the catalyst-api uses, so the never-wipe rule,
record encoding, and managed-domain allowlist are identical on both
write paths (per docs/INVIOLABLE-PRINCIPLES.md #2 — no duplicate
implementations of the same concern).

Multi-domain:
- One --domain-filter per zone in the external-dns args; adding a third
  pool domain (e.g. acme.io) is a one-line edit here PLUS a one-key edit
  on dynadot-api-credentials' `domains` field. No webhook rebuild.
- Webhook reads DYNADOT_MANAGED_DOMAINS from the same secret with
  optional=true, preserving backward compatibility with the legacy
  single-`domain` secret shape (pre-#108).

TXT registry:
- --txt-owner-id=$(SOVEREIGN_FQDN), --txt-prefix=_externaldns.<sub>.
- Cluster overlays substitute SOVEREIGN_FQDN via the bp-catalyst-platform
  umbrella so two clusters sharing a parent zone (alpha.omani.works,
  beta.omani.works) cannot collide.

Closes #109.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:45:53 +02:00
hatiyildiz
046e5ebc18 feat(day2-iac): Crossplane Compositions + per-Sovereign Flux cluster tree + catalyst-dns binary
Group F deliverables — completes the day-2 IaC layer that takes over after OpenTofu's Phase 0 hand-off (per docs/SOVEREIGN-PROVISIONING.md §4).

Three artifacts:

1. platform/crossplane/compositions/ — XRDs + Compositions for canonical Hetzner resources
   under the canonical compose.openova.io/v1alpha1 group (per BLUEPRINT-AUTHORING.md §8):
   - XHetznerNetwork + composition-network.yaml — wraps hcloud_network + subnet
   - XHetznerFirewall + composition-firewall.yaml
   - XHetznerServer + composition-server.yaml
   - XHetznerLoadBalancer + composition-loadbalancer.yaml (lb11, 80→31080, 443→31443)
   - README documenting the canonical pattern

2. clusters/_template/ — the canonical per-Sovereign Flux Kustomization tree.
   Copied to clusters/<sovereign-fqdn>/ at provisioning time; cloud-init's
   GitRepository points at the result.
   - kustomization.yaml (root: flux-system + infrastructure + bootstrap-kit)
   - flux-system/ (placeholder for Flux self-config customization)
   - infrastructure/ (provider-hcloud + ProviderConfig referencing hcloud-credentials secret OpenTofu writes)
   - bootstrap-kit/ — 11 HelmRelease manifests in dependency order:
     01-cilium → 02-cert-manager → 03-flux → 04-crossplane → 05-sealed-secrets
     → 06-spire → 07-nats-jetstream → 08-openbao → 09-keycloak → 10-gitea → 11-bp-catalyst-platform
     Each pulls from oci://ghcr.io/openova-io/bp-<name>:1.0.0 — the wrapper charts published by blueprint-release CI.
     dependsOn declarations enforce the canonical install order at runtime.

3. clusters/omantel.omani.works/ — the first concrete Sovereign instance.
   Mirror of _template with SOVEREIGN_FQDN_PLACEHOLDER substituted to omantel.omani.works.
   This is what the wizard's first omantel.omani.works run will actually reconcile.

4. products/catalyst/bootstrap/api/cmd/catalyst-dns/main.go — small Go binary the
   OpenTofu module's null_resource.dns_pool invokes via local-exec at Phase-0 apply time.
   Reads DYNADOT_API_KEY/SECRET/DOMAIN/SUBDOMAIN/LB_IP env vars; calls existing dynadot.Client.AddSovereignRecords. Containerfile already builds + ships it at /usr/local/bin/catalyst-dns.

Architectural compliance (Lesson #24 closed):
- No bespoke Go cloud-API calls (Crossplane Compositions are the canonical day-2 IaC)
- No exec.Command("helm", ...) (Flux HelmReleases are the canonical install unit)
- No kubectl apply from outside (cloud-init kubectl-applies one Flux GitRepository, then Flux owns everything)

After this commit, the path is end-to-end: wizard → catalyst-api → tofu apply (with infra/hetzner/) → cloud-init installs k3s + Flux + applies GitRepository pointing at clusters/omantel.omani.works/ → Flux reconciles bootstrap-kit (11 HelmReleases in dependency order) → Crossplane adopts day-2 management.
2026-04-28 14:09:29 +02:00
hatiyildiz
62d9c7d936 fix(charts): drop dependencies block — wrappers carry values overlay only
The first 2 blueprint-release CI runs failed on `helm package` with containerd permission errors because the wrapper Chart.yaml's `dependencies:` block triggered helm to pull the upstream charts via OCI/containerd at package time, which the GitHub Actions runner blocks.

Architectural fix: each Catalyst Blueprint wrapper carries the values overlay + metadata only. The bootstrap installer reads the upstream chart reference from the wrapper's values.yaml `catalystBlueprint.upstream.{chart,version,repo}` metadata block, points `helm install` at the upstream chart's repo, and overlays our values.

This keeps:
- blueprint-release CI lightweight (no upstream pulls during package; helm package now works without containerd)
- the "bp-<name> wrapper does NOT drift from upstream" property (we ship the overlay, not a fork)
- the single Blueprint contract from BLUEPRINT-AUTHORING §1 (a wrapper is still a Catalyst-curated Helm chart published as bp-<name>:<semver>)

Changes:
- 11 platform/<name>/chart/Chart.yaml: removed dependencies block. Each is now a plain Helm chart with no remote pulls during package.
- 11 platform/<name>/chart/values.yaml: prepended catalystBlueprint.upstream.{chart,version,repo} metadata block at the top. Bootstrap installer parses it to know which upstream chart to install with these values.
- products/catalyst/bootstrap/api/internal/bootstrap/bootstrap.go: installCilium now does `helm repo add cilium https://helm.cilium.io --force-update` then `helm install cilium cilium/cilium --version 1.16.5 --values -` (the cilium/cilium upstream chart, with our overlay values piped from values.yaml). Same pattern needs propagating to the other 10 install functions in a follow-up.

After this commit, blueprint-release CI should green-build all 11 wrappers (helm package now works without containerd access since there's nothing to pull). The bootstrap installer's actual `helm install` calls in production reach upstream chart repos via the runtime k3s cluster's pod network, which has full network access.
2026-04-28 12:57:29 +02:00
hatiyildiz
441ebaebb8 fix(charts): pin upstream chart versions/names to ones that exist in their repos
The first Blueprint Release CI run (commit 8c0f766) failed because four chart wrappers referenced upstream chart versions/names that don't exist in their published repositories:

- platform/flux/chart: name was "flux", repo was OCI; actual is name "flux2" in plain helm repo at https://fluxcd-community.github.io/helm-charts. Pinned to 2.13.0.
- platform/openbao/chart: version 2.1.0 was the binary appVersion, not the chart version. Pinned to 0.16.0 chart (which packages openbao 2.1.0 internally).
- platform/keycloak/chart (Bitnami): chart version 25.0.6 was the appVersion of upstream; Bitnami's chart is at 24.7.1 packaging Keycloak 26.0.x. Pinned to 24.7.1.
- platform/nats-jetstream/chart: name was "nats-jetstream"; the upstream chart is named "nats" (it always was — JetStream is a feature of NATS, not a separate chart). Renamed.

Cilium, cert-manager, crossplane, sealed-secrets, spire wrappers were unaffected; their version pins matched upstream availability.

Containerd permission-denied errors from `helm package` on cilium/cert-manager/crossplane/gitea/sealed-secrets are a separate CI plumbing issue (helm tries to pull OCI base images during package build via containerd, but the GitHub Actions runner blocks containerd socket access). Tracked as a follow-up: switch to `helm package --skip-refresh` or use a runner with containerd permissions.

After this commit lands, the next blueprint-release CI run should green-build at minimum the 4 fixed charts. Successful builds publish bp-{flux,openbao,keycloak,nats-jetstream}:1.0.0 OCI artifacts to ghcr.io/openova-io/.
2026-04-28 12:55:21 +02:00
hatiyildiz
8c0f76640c feat(charts): G2 wrapper Helm charts for 11 bootstrap-kit components + blueprint-release CI
Per docs/PROVISIONING-PLAN.md and tickets [F] chart. Adds Catalyst-curated wrapper Helm charts at platform/<name>/chart/ for every component the bootstrap-kit installer (introduced in commit 07b4bcf) needs. Each chart is the canonical bp-<name> source per BLUEPRINT-AUTHORING.md §1's source-location rule.

11 charts created with Chart.yaml + values.yaml + blueprint.yaml each:

Network + GitOps:
- platform/cilium/chart — wraps cilium 1.16.5; kubeProxyReplacement, WireGuard mTLS, Hubble, Gateway API
- platform/flux/chart — wraps flux 2.4.0
- platform/crossplane/chart — wraps crossplane 1.18.0 + provider-hcloud manifest

Security:
- platform/cert-manager/chart — wraps cert-manager 1.16.2 with CRDs+ServiceMonitor
- platform/sealed-secrets/chart — wraps sealed-secrets 2.16.1 (transient bootstrap-only)
- platform/spire/chart — wraps spiffe/spire 1.10.4 (5-min SVID rotation)

Catalyst control-plane services:
- platform/nats-jetstream/chart — wraps nats 2.10.22 (3-node cluster, JetStream + KV)
- platform/openbao/chart — wraps openbao 2.1.0 (3-node Raft, region-local per SECURITY §5)
- platform/keycloak/chart — wraps keycloak 25.0.6 (Bitnami flavor, edge proxy mode)
- platform/gitea/chart — wraps gitea 10.5.0 (CNPG Postgres backend, no chart-bundled valkey/redis since Catalyst control plane uses JetStream)

New platform/ folders (added per AUDIT-PROCEDURE component-count anchor — was 53, now 55):
- platform/spire/README.md — workload identity Catalyst control plane component
- platform/nats-jetstream/README.md — control-plane event spine
- platform/sealed-secrets/README.md — transient bootstrap-only

Each blueprint.yaml declares:
- catalyst.openova.io/v1alpha1 Blueprint kind (canonical CRD per BLUEPRINT-AUTHORING §3)
- visibility: unlisted (mandatory infra, auto-installed by bootstrap kit, not a marketplace card)
- manifests.chart: ./chart pointer
- depends: [] (foundational components have no Blueprint dependencies; control-plane services depend on each other implicitly via bootstrap order, not via Blueprint depends)

.github/workflows/blueprint-release.yaml:
- New CI workflow per BLUEPRINT-AUTHORING §11 (path-matrix per Blueprint folder)
- Triggers on push to main touching platform/*/chart/** or products/*/chart/**
- detect job: emits matrix of changed Blueprint folders via git diff
- build job (per chart): helm dependency build → helm package → helm push to GHCR → cosign keyless sign (GitHub OIDC) → Syft SBOM attestation
- Output: ghcr.io/openova-io/bp-<name>:<semver> with SLSA-3-style supply-chain provenance

Closes [F] tickets: 11 G2 charts (cilium, cert-manager, flux, crossplane, sealed-secrets, spire, nats-jetstream, openbao, keycloak, gitea, plus the umbrella products/catalyst/chart already exists from Pass 105). blueprint.yaml CRDs added across 11 entries. CI fan-out workflow live.

After this commit lands, the bootstrap-kit installer in commit 07b4bcf has real OCI artifacts to install. The first push to main will trigger 10 build matrix jobs (cilium was created in a separate commit earlier in this session) which produce 10 cosigned bp-<name>:<semver> artifacts on GHCR.

Component-count anchor update follows: 53 → 55 (added spire + nats-jetstream + sealed-secrets — but sealed-secrets was already conceptually counted under "supporting services"). Per AUDIT-PROCEDURE the count needs updating in CLAUDE.md, BUSINESS-STRATEGY, TECHNOLOGY-FORECAST L11. Tracked as separate ticket [K] docs.
2026-04-28 12:51:06 +02:00
hatiyildiz
7cafa3c894 docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes):

1. MinIO → SeaweedFS as unified S3 encapsulation layer

The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface.

2. Apache Guacamole added as Application Blueprint §4.5 Communication

Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access.

Component changes:
- DELETED: platform/minio/
- CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section)
- CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings)

Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count.

Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric.

UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added.

VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer.

Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
2026-04-28 10:23:46 +02:00
hatiyildiz
b2173ae13c docs(pass-60): valkey REPLICAOF bash example carry-over; NAMING fourth-cycle stable
FIRST drift in the new cycle. 6-consecutive-clean streak (54-59) ends
at Pass 60. However, drift is Pass-35 carry-over, not new architectural
drift — same "incomplete in-file fix" pattern as Pass 31 (openbao
L108 vs L127).

platform/valkey/README.md L79 had:
  REPLICAOF primary-valkey.region1.svc.cluster.local 6379

Pass 35 fixed L147 (StatefulSet --replicaof argument) to canonical
valkey.<env>.<sovereign-domain> per NAMING §5.2 but the bash command
example at L79 retained the older non-canonical form.

Fixed L79 to valkey.<env>.<sovereign-domain> matching L147.

Methodology lesson #18: Pass-N sweep grep patterns can miss carry-over
drift that doesn't match the sweep's specific shape. Pass 35 grep
targeted <domain> placeholders; L79 used a fully-qualified hostname
with no placeholder, evading the sweep.

NAMING-CONVENTION fourth-cycle deep re-read confirmed stable across
§1-§11. §4.1 "hfrp" location-code example is for rtz cluster (vs hfmp
for mgt) — both valid for different cluster types, not drift. §11
already settled across Pass 37, 42, 50.

valkey README banner explicitly establishes "NOT a Catalyst
control-plane component" (Pass 26 framing) — exemplary canonical.

Convergence: Pass 54-59 = 6 consecutive cleans (nirvana approach met).
Pass 60 carry-over fix resets streak but architectural integrity holds.
The new cycle audit is doing its job — surfacing carry-over drift the
old cycle's specific-shape sweeps missed.
2026-04-28 01:28:00 +02:00
hatiyildiz
9c3d370107 docs(pass-51): flink Strimzi namespace drift; SECURITY clean
platform/flink/README.md L137 + L166 used strimzi-kafka-bootstrap.messaging.svc
but canonical Catalyst namespace per strimzi README (L100/146/181/191) and
debezium (L135) is `databases`. Same Helm-default-vs-Catalyst-convention drift
as Pass 41 minio (minio-system → storage). Pass 51 sweep confirmed no other
component uses "messaging" as a Catalyst namespace — only generic English
usage and K8s API group messaging.knative.dev/v1.

Fixed both instances to strimzi-kafka-bootstrap.databases.svc:9093. Port
9093 (TLS) kept — port choice (9092 vs 9093) is a separate architectural
question deferred.

SECURITY.md re-scan with all current methodology lessons:
- §1-§5: clean. Independent-Raft-per-region principle intact.
- §6 Keycloak topology: clean.
- §7 Rotation policy: SecretPolicy uses canonical catalyst.openova.io/v1alpha1.
- §8 Path of a secret: clean.
- §9 Compliance posture: borderline OpenSearch SIEM wording re-evaluated;
  acceptable in context.
- §10 Threat model: clean.

Methodology note: Helm-default-namespace drift now found across 3 instances
(Pass 41 minio, Pass 51 flink). Add cross-component namespace verification
to standard checks.

Drift found. Consecutive-clean count resets from 2 (49→50) to 0.
2026-04-28 00:31:25 +02:00
hatiyildiz
67aab8f6c1 docs(pass-48): crossplane OpenTofu/XRD group drift; PERSONAS clean
platform/crossplane/README.md had three real drift items:

1. §"Terraform vs Crossplane" — Catalyst's canonical bootstrap IaC is
   OpenTofu (PTS §3.2 + SOVEREIGN-PROVISIONING §3), not Terraform.
   Renamed section to "OpenTofu vs Crossplane", added intro paragraph
   clarifying the OSS-fork rationale, updated table rows + Decision.

2. XRD CompositeResourceDefinition example used name: xdatabases.openova.io
   and group: openova.io. Per BLUEPRINT-AUTHORING §8 (Pass 42 verified
   canonical), Crossplane XRDs use compose.openova.io group — separate
   from Catalyst CRDs (catalyst.openova.io). Fixed to
   xdatabases.compose.openova.io / group: compose.openova.io with inline
   pointer to BLUEPRINT-AUTHORING §8.

3. Composition compositeTypeRef.apiVersion was openova.io/v1alpha1, fixed
   to compose.openova.io/v1alpha1. Also corrected Composition metadata.name
   to database.hcloud.compose.openova.io for naming consistency.

Pass 1's API group unification was Catalyst-CRDs-only; Pass 42 verified
the separate Crossplane group; Pass 48 catches a downstream consequence
where the crossplane README defaulted to bare `openova.io` matching
neither canonical form.

PERSONAS-AND-JOURNEYS §1-§7 deep re-scan: clean. Pass 22, 33, 39 fixes
all intact. Three-pass-touched doc reads consistently. Stable.

Banner already correctly enforces "platform plumbing, never user-facing"
per ARCHITECTURE §7.4 / GLOSSARY.
2026-04-28 00:10:48 +02:00
hatiyildiz
2a1d6f5d3f docs(pass-41): SOVEREIGN-PROVISIONING §4 + minio namespace drift across 3 components
SOVEREIGN-PROVISIONING.md §4 (Phase 1 Hand-off) "self-sufficient" list
had 6 items vs PLATFORM-TECH-STACK §2.3's 6 control-plane supporting
services. List was missing SPIRE (5-min rotating SVIDs — critical to
SECURITY model) and observability (Grafana stack — Catalyst's
self-monitoring). Same drift category as Pass 40: summary list drifted
independently from canonical reference. Added both, plus enumerated the
§2.1+§2.2 services in the "Catalyst control plane" bullet.

Mid-pass sweep finding: kserve L217 used minio.minio-system.svc but
canonical minio README declares namespace: storage (L70). Three other
components also used minio-system: milvus L78, harbor L145. Fixed all
three to align with canonical `storage` namespace per PLATFORM-TECH-STACK
§3.5. Drift likely came from Helm-chart upstream defaults.

platform/kserve substantively clean apart from namespace fix.

Pass 41 lesson: union-equality check applies to ALL summary passages in
canonical docs. When a passage enumerates items derived from a canonical
source list, count both and verify equality.
2026-04-27 23:21:19 +02:00
hatiyildiz
5744307027 docs(pass-38): surviving "fuse" namespace in temporal; SECURITY + grafana clean
Acceptance greps with Pass 37's new literal-domain check and case-insensitive
banned-term sweep found one surviving instance: platform/temporal/README.md
L272 Worker Deployment had `namespace: fuse`. Pass 26 renamed fuse → fabric;
Pass 32+35 fixed temporal's image ref and DNS but the namespace YAML key
was missed (eye tracks surrounding structure, skims past `namespace:` value).
Renamed to `fabric`.

docs/SECURITY.md: clean (deep re-scan §6-§10 per Pass 23 lesson). All
sections consistent with canonical model and Pass 7's independent-Raft fix.
§9 OpenSearch SIEM wording acceptable as "default destination when SIEM
is enabled" rather than "default-installed component" — deferred for
optional tightening pass.

platform/grafana/README.md: clean. Banner, tiered storage, and OTel
instrumentation example all consistent with canonical conventions.

Lesson: case-insensitive banned-term grep is non-negotiable. Future
passes should always run \bfuse\b and similar legacy-product-name greps
regardless of surfaced category.
2026-04-27 22:59:17 +02:00
hatiyildiz
76e68e6182 docs(pass-36): flux deep-scrutiny + sweep gap-fill (Pass 35 head -10 cutoff)
Pass 35's sweep grep had `head -10` cutoff that produced a false-clean
signal. Pass 36 ran the same grep without truncation, finding 6 surviving
drift instances:

platform/flux/README.md (5 fixes):
- Mermaid diagram: Tenant[Tenant Repos] -> Organization[Organization Repos].
- GitRepository url gitea.<domain> -> gitea.<location-code>.<sovereign-domain>.
- Bootstrap command --url=https://gitea.<domain>/... -> canonical form.
- Key commands `flux reconcile kustomization tenants` -> `organizations`
  (Pass 34 was uppercase-only and missed lowercase plural).
- Gitea Actions example flux-webhook.<domain> -> location-code form.

platform/kyverno/README.md (1 fix):
- Mermaid subgraph "Tenant Workload" -> "Organization Workload"
  (the priority class names tenant-high/tenant-default remain — those
  are deployed K8s PriorityClass objects requiring recreate-not-rename
  per Pass 9's deferred-migration note).

Methodology lesson: convenience shortcuts in validation produce false-clean
signals. From Pass 37 forward: drift sweeps use full grep output (no
truncation) and case-insensitive banned-term searches.

Validation log Pass 36 entry includes detail on each preserved
"multi-tenant" generic adjective use that survived (acceptable feature
descriptions, not Catalyst entity references).
2026-04-27 22:49:05 +02:00
hatiyildiz
bc9b90d989 docs(pass-35): completion sweep for surviving DNS placeholders (8 components)
Started as gitea + relay atomic check. The gitea fix surfaced surviving
<domain> placeholders across 8 other component READMEs that prior sweeps
(Pass 29: canonical docs, Pass 32: image registries) hadn't covered.

Catalyst control-plane DNS fixes (-> {component}.<location-code>.<sovereign-domain>):
- gitea: GITEA_INSTANCE_URL.
- external-secrets: openbao ClusterSecretStore + gitea Flux GitRepository.

Application DNS fixes (-> {app}.<env>.<sovereign-domain>):
- temporal: had two drift items in one line — temporal.fuse.<domain>
  (old "fuse" product name + wrong placeholder shape). Pass 32 fixed
  the image ref on the same file but missed this. Now fully de-drifted.
- valkey: --replicaof valkey.region1.<domain> (non-canonical region1
  segment — Catalyst encodes regions in location-code).
- strimzi: kafka-kafka-bootstrap.region1.<domain>:9092 — same.
- cnpg: postgres.region1.<domain> cross-region replica host — same.
- stunner: STUN/TURN realm — kept canonical Application form for
  consistency even though STUN realms are nominally opaque.
- k8gb: Gslb ingress host app.gslb.<domain> -> app.gslb.<sovereign-domain>.
  Other illustrative k8gb refs (dnsZone, nslookup examples) preserved
  as they describe behavior generically.

products/relay/README.md: clean.

Preserved as correctly-generic: external-dns illustrative refs,
cert-manager <domain> (customer-supplied cert names), stalwart <domain>
(customer email-receiving domain).

Validation log Pass 35 entry: third end-to-end DNS sweep iteration
(29 -> 32 -> 35). Future passes should grep for bare <domain> early to
catch new instances introduced during edits.
2026-04-27 22:46:16 +02:00
hatiyildiz
70fea3ab8f docs(pass-34): banned-term TENANT sweep + keycloak hostname drift
GLOSSARY's banned term "tenant" survived in Configuration tables and Flux
postBuild substitutions across product READMEs as ${TENANT} (uppercase
ENV var). Prior banned-term greps searched lowercase `tenant` so the
ALL-CAPS form slipped through.

Product README fixes:
- products/cortex: TENANT/DOMAIN → ORGANIZATION/SOVEREIGN_DOMAIN, plus
  two DNS placeholder fixes for llm-gateway and chat URLs (same shape
  Pass 25/31 fixed elsewhere).
- products/fingate: 6 instances (Flux substitution, Configuration table,
  4 URL templates) renamed. URL shape api.openbanking.<org>.<sov-dom>
  flagged as 4-segment FQDN that doesn't match NAMING §5.1 or §5.2 —
  deferred to a deeper architectural pass.
- products/fabric: Configuration table row renamed.

Component README:
- platform/keycloak: shared-sovereign hostname auth.<sovereign-domain>
  and per-organization auth.<org>.<sovereign-domain> both missing
  <location-code> per NAMING §5.1. Fixed.

platform/librechat ${TENANT_ID} preserved — that's Microsoft Azure AD
tenant-ID (external technology, exempted by GLOSSARY).

Validation log Pass 34 entry includes meta-note: always run a global
grep for the surfaced drift category before closing a pass, to avoid
the asymmetric-drift problem Pass 25 warned against.
2026-04-27 22:42:50 +02:00
hatiyildiz
4043e1d51c docs(pass-32): registry-DNS sweep — harbor.<domain> across 9 component READMEs
Pass 25's deferred sweep, executed. Image refs of the form
harbor.<domain>/... (and one registry.<domain>/... in temporal) collapse
the location-code segment. Per NAMING §5.1, Catalyst per-host-cluster
Harbor DNS is harbor.{location-code}.{sovereign-domain} (e.g.
harbor.hfmp.openova.io).

Fixed (11 instances, 9 files):
- anthropic-adapter, bge (×2), debezium, harbor (×2 — ingress + Kyverno
  policy), knative (×2 — serving + traffic-split), llm-gateway, strimzi,
  trivy — all standardized to harbor.<location-code>.<sovereign-domain>.
- temporal had two drift items in one line: registry.<domain> (off-spec
  placeholder — Catalyst's only per-host-cluster registry is Harbor) AND
  legacy "fuse" namespace (renamed to bp-fabric per BUSINESS-STRATEGY
  §16.2 / Pass 26). Rewritten to fabric/order-worker.

Out of scope (deliberate): :latest tag hygiene, and whether Application
Blueprint READMEs should reference ghcr.io/openova-io/bp-<name>:<semver>
vs the Sovereign Harbor mirror. Stalwart customer-email-domain <domain>
placeholders preserved (correct semantics). external-dns illustrative
gslb/api/svc.<domain> preserved (upstream-doc generic).

With Pass 29 (canonical-doc DNS) + Pass 31 (carry-over fixes) + Pass 32
(image registry), the recurring DNS-placeholder collapse drift category
is addressed end-to-end.

Validation log Pass 32 entry added.
2026-04-27 22:36:39 +02:00
hatiyildiz
3993f5fc31 docs(pass-31): openbao + librechat DNS-placeholder carry-over fixes
platform/openbao/README.md ingress hosts (line 108) had `bao.<domain>` while
the same file's ClusterSecretStore example (line 127) used the canonical
`bao.<location-code>.<sovereign-domain>` form. Pass 7's active-active fix
addressed the body but missed the ingress placeholder. Aligned with the
canonical form.

platform/librechat/README.md OAuth callback (line 154) had
`chat.ai-hub.<domain>/oauth/openid/callback` — same Application-endpoint
shape Pass 25 fixed in llm-gateway. Pass 22 marked the file clean and Pass
29 fixed the Keycloak issuer line but didn't re-sweep. Per NAMING §5.2
Application endpoints are `{app}.{environment}.{sovereign-domain}`. Fixed.

docs/GLOSSARY.md verified clean — single-source-of-truth has held across
the loop (Pass 6/7/14/20/22/26/27 all consistent with current GLOSSARY).

Validation log Pass 31 entry includes meta-note: third file (librechat)
that needed re-opening after a "clean" mark — banner scans miss YAML-block
drift. Future passes should default to a full placeholder-shape grep on
every file touched.
2026-04-27 22:34:10 +02:00
hatiyildiz
4793cab8b6 docs(pass-29): DNS-placeholder sweep across canonical docs
The recurring drift: Catalyst control-plane DNS placeholders that omit the
<location-code> segment, producing forms like gitea.<sovereign>,
gitea.<sovereign>.<domain>, gitea.<sovereign-domain>, keycloak.<domain>.
Per NAMING §5.1 the canonical form is
{component}.{location-code}.{sovereign-domain} (e.g. gitea.hfmp.openova.io).
The shorter forms aren't just abbreviations — they collapse the multi-region
location dimension and re-drift every time a reader reads them as obvious
shorthand.

Fixes:
- CLAUDE.md "Customer Sync" — both gitea.<sovereign>/catalog/... lines.
- docs/SOVEREIGN-PROVISIONING.md §3 DNS-records bullet (3 lines) + §5
  Day-1 login line.
- docs/ARCHITECTURE.md §4 write-path Gitea label.
- docs/BLUEPRINT-AUTHORING.md §6.4 private-Blueprint Studio target.
- platform/librechat/README.md Keycloak issuer (Pass 22 marked clean and
  missed this — banner scans miss YAML-block drift).

platform/nemo-guardrails/README.md verified clean.

Final grep confirms only canonical forms remain. Validation log Pass 29
entry added with the recurring-drift-pattern note for future passes.
2026-04-27 22:30:41 +02:00
hatiyildiz
2c886daa52 docs(pass-25): llm-gateway DNS placeholders + IMPLEMENTATION-STATUS clean
platform/llm-gateway/README.md had three malformed DNS placeholders:
- KEYCLOAK_URL collapsed location-code + sovereign-domain into <domain> and
  used Application namespace `ai-hub` as a Keycloak realm name. Per NAMING §7
  and SECURITY §7, Keycloak realms are per-Org in SME-style or per-Sovereign
  in corporate-style — never per-Application-namespace. Fixed to
  `keycloak.<location-code>.<sovereign-domain>/realms/<org>`.
- ANTHROPIC_BASE_URL and `claude config set api_base` examples used
  `llm-gateway.ai-hub.<domain>/v1` — but NAMING §5.2 establishes
  Application endpoints as `{app}.{environment}.{sovereign-domain}`.
  Fixed to `llm-gateway.<env>.<sovereign-domain>/v1`.

docs/IMPLEMENTATION-STATUS.md confirmed clean: CRD list, surfaces, and
control-plane component list all match canonical docs.

Sweep concern logged for `harbor.<domain>` / `:latest` image patterns
appearing across many platform READMEs — to be addressed in a dedicated
sweep pass rather than asymmetrically here.

Validation log Pass 25 entry added.
2026-04-27 22:22:32 +02:00
hatiyildiz
5f028d1b6a docs(pass-20): SOVEREIGN-PROVISIONING placement YAML + Kyverno label drift
Pass 20 — drift-detection on SOVEREIGN-PROVISIONING + platform/kyverno.
Two real findings.

SOVEREIGN-PROVISIONING.md §8:
- "Existing Applications with `placement: active-active: false,
  single-region` do not migrate automatically" — invalid YAML
  mixing a boolean with an enum. The canonical placement model
  (per GLOSSARY) has `placement.mode: single-region | active-
  active | active-hotstandby`, no boolean toggle.
- Rewrote: "Existing Applications with `placement.mode: single-
  region` ... user explicitly switches Placement to active-active
  (or active-hotstandby) and adds the new region to
  placement.regions".

platform/kyverno/README.md:
- Policy V5 (minimum-replicas-production) targeted namespaces
  labeled `openova.io/env: production` — out-of-spec label name
  AND value. NAMING-CONVENTION §6 establishes `openova.io/env-type:
  prod` (hyphen-form, short value).
- Fixed to `openova.io/env-type: prod`.

Both findings show the same pattern: schema-level details that
survive grep-based banned-term checks but contradict the canonical
spec when read in body.

VALIDATION-LOG: Pass 20 entry added.

Refs #37
2026-04-27 22:06:24 +02:00
hatiyildiz
b467dc3f3b docs(pass-18): NAMING DR-as-env_type misexample + Keycloak deployment topology
Pass 18 — drift-detection on NAMING-CONVENTION + platform/keycloak.
Two real findings.

NAMING-CONVENTION §11.1:
- The example list of Catalyst Environments included `bankdhofar-dr`
  — but `dr` is NOT a valid env_type. Canonical values per §2.4 are
  prod / stg / uat / dev / poc. DR is a Placement mode
  (active-active / active-hotstandby across regions inside the
  *-prod Environment), not a separate Environment.
- Replaced `bankdhofar-dr` with `bankdhofar-uat` and added an
  explicit "DR is a Placement, not an Env Type" note.

platform/keycloak/README.md:
- Keycloak Deployment YAML example used `namespace: open-banking`
  with 2 replicas — Fingate-specific narrative that contradicted
  the per-Org / per-Sovereign topology stated in the banner.
  Rewrote with two side-by-side examples:
  * shared-sovereign (3 HA replicas, catalyst-keycloak namespace,
    CNPG-backed)
  * per-organization (1 replica in <org> namespace, optional
    embedded DB for smallest SME tier)
- HA section was a single set of claims (2+ replicas, CNPG, Infinispan)
  that only matched corporate. Now branches on topology — corporate
  gets HA + Infinispan, SME gets single replica with restart-on-
  deploy as acceptable for tier SLAs.

Same kind of drift Pass 17 caught in Harbor: banner says one thing,
body still describes the older model. Both fixed.

VALIDATION-LOG: Pass 18 entry added.

Refs #37
2026-04-27 22:00:42 +02:00
hatiyildiz
eff264b077 docs(pass-17): ARCHITECTURE OAM table pipe-fix + Harbor README de-drift
Pass 17 — drift-detection sweep on ARCHITECTURE + harbor. Two real
findings.

ARCHITECTURE §13 (OAM table):
- `| Trait | Blueprint overlay (`overlays/small|medium|large`) |`
  has pipe chars inside backticks inside a Markdown table cell —
  a known GFM rendering hazard. Replaced with comma-separated
  examples.

platform/harbor/README.md:
- The banner added in Pass 9 said "every host cluster runs a
  Harbor instance" but the body still described an older
  "Harbor Primary / Harbor Replica" cross-region replication
  topology. Same shape of architectural drift Pass 7 caught in
  OpenBao/ESO/Gitea/Flux — banner-add doesn't rewrite the body.
- Three sections rewritten:
  * Overview mermaid: now shows upstream-OCI → multiple
    independent per-cluster Harbors with local Trivy scan + local
    Pod pulls.
  * "Multi-Region Replication" → "Per-host-cluster mirroring (NOT
    primary-replica)". Single source of truth = upstream OCI
    (ghcr.io/openova-io/* for Catalyst+Blueprints, customer CI for
    application images), not a "primary Harbor".
  * Example replication policy: was a `dest_registry` cross-region
    push policy → now a pull-mirror policy from ghcr.io with
    scheduled-cron trigger.
- "Why Mandatory" table reframed in per-host-cluster terms.

VALIDATION-LOG: Pass 17 entry added with the specific drift-detection
lesson — banner-addition passes don't catch body-level drift; need
explicit body re-reads.

Refs #37
2026-04-27 21:58:53 +02:00
hatiyildiz
b6a374df26 docs(pass-15): final banner sweep — 52/52 platform components covered, convergence achieved
Pass 15 swept all 52 platform/*/README.md files for the role-in-
Catalyst banner. 3 still lacked one (cnpg, flux, strimzi) and got
banners added:

- cnpg (§4.1): production Postgres; underlying engine for FerretDB +
  Gitea metadata.
- flux (§3.2): per-vcluster Flux + host-level Flux for Catalyst
  itself; pulls from single per-Sovereign Gitea.
- strimzi (§4.1): Application-tier event streaming; NOT the Catalyst
  control-plane spine (which uses NATS JetStream). Same upstream-
  tech-different-tier disambiguation pattern as Valkey.

CONVERGENCE: 52 / 52 platform components have role-in-Catalyst
banners. All cross-refs resolve. No banned terms. No architectural
drift detected on this pass.

VALIDATION-LOG: Pass 15 entry + "Convergence achieved (initial
banner sweep)" marker added. The validation loop continues per
the standing instruction — but subsequent passes will be brief
drift-detection sweeps rather than systematic rewrites.

Refs #37
2026-04-27 21:53:27 +02:00
hatiyildiz
9b3211fdee docs(pass-14): banners on workflow / analytics / metering / chaos / valkey (7 components)
Seven more Application Blueprint banners landed:

- temporal (§4.3): durable workflow orchestration; bp-fabric.
- flink (§4.3): stream + batch processing; bp-fabric.
- debezium (§4.2): CDC into Strimzi/Kafka; bp-fabric pipeline source.
- iceberg (§4.4): open table format on MinIO + archival S3.
- openmeter (§4.8): API metering for bp-fingate.
- litmus (§4.9): chaos engineering required by DORA / NIS2.
- valkey (§4.1): banner explicitly states NOT a Catalyst control-
  plane component — control plane uses NATS JetStream KV per
  ARCHITECTURE §5 / GLOSSARY event-spine. Valkey is Application-tier
  caching only. This is the disambiguation that PLATFORM-TECH-STACK
  §1 establishes ("same upstream technology can serve in multiple
  categories") — pinned in the per-component README so it can't be
  misread.

VALIDATION-LOG: Pass 14 entry added.

Refs #37
2026-04-27 21:52:03 +02:00
hatiyildiz
b021aaa57e docs(pass-13): role-in-Catalyst banners on 4 Communication Application Blueprints
All 4 communication components (composing under bp-relay) got role-
in-Catalyst banners pointing at PLATFORM-TECH-STACK §4.5:

- stalwart: JMAP/IMAP/SMTP self-hosted email.
- livekit: WebRTC SFU for video/audio/data; pairs with STUNner.
- stunner: K8s-native TURN/STUN for WebRTC NAT traversal.
- matrix: Matrix protocol via Synapse server. Banner explicitly
  disambiguates "Synapse" as the chat-server implementation, NOT
  the deprecated OpenOva product noun (retired in favor of bp-axon).

All 4 are explicitly Application Blueprints, NOT Catalyst control
plane.

VALIDATION-LOG: Pass 13 entry added.

Refs #37
2026-04-27 21:50:05 +02:00
hatiyildiz
9d95043ccc docs(pass-12): role-in-Catalyst banners on 11 AI/ML Application Blueprints
All AI/ML component READMEs got banners pointing at PLATFORM-TECH-
STACK §4.6 (AI/ML) or §4.7 (AI safety + observability), and noting
composition under bp-cortex (composite AI Hub Blueprint):

- knative: serverless for KServe-managed inference.
- kserve: K8s-native model serving for vLLM, BGE, custom.
- vllm: default LLM inference runtime.
- milvus: vector database for RAG retrieval.
- neo4j: knowledge-graph-augmented retrieval alongside Milvus.
- librechat: default chat surface, fronts LLM Gateway via Guardrails.
- bge: embedding generation + reranking.
- llm-gateway: outbound LLM routing (Claude, GPT-4, vLLM, Axon).
- anthropic-adapter: OpenAI-SDK → Anthropic translation.
- nemo-guardrails: AI safety firewall.
- langfuse: LLM observability (latency, tokens, cost, eval).

All 11 are explicitly Application Blueprints — NOT Catalyst control
plane. Catalyst's own observability stack (Grafana/OTel) covers
infrastructure; LangFuse covers AI-specific dimensions
(prompt/response/eval).

VALIDATION-LOG: Pass 12 entry added.

Refs #37
2026-04-27 21:47:45 +02:00
hatiyildiz
e9514b410d docs(pass-11b): retry banners on failover-controller/trivy/clickhouse/ferretdb (Edit needed Read first) 2026-04-27 21:45:56 +02:00
hatiyildiz
ae540269c4 docs(pass-11): banners on 7 more components + MinIO ILM label disambiguation
7 more component READMEs got role-in-Catalyst banners:

Per-host-cluster infrastructure:
- minio (§3.5): S3 fast-tier; tiers cold to cloud archival.
- velero (§3.5): K8s backup to archival S3 (NOT MinIO — that's
  fast-tier; backups land in cloud archival).
- failover-controller (§3.6): lease-based split-brain protection
  layered on k8gb; pointers to SRE §2.4 (witness pattern) +
  SECURITY §5.2 (OpenBao DR promotion).
- trivy (§3.3): CI + registry + runtime scan chain.

Application Blueprints (NOT control plane):
- opensearch (§4.1): explicitly framed as Application Blueprint —
  installed when an Org wants SIEM / full-text search / log analytics.
- clickhouse (§4.1): used by bp-fabric and SIEM cold-storage tier.
- ferretdb (§4.1): replication piggybacks on underlying CNPG.

MinIO ILM disambiguation:
- The Mermaid diagram had `ILM[Lifecycle Manager]` — confusable with
  the rejected Catalyst sub-product (per banned-terms list).
  Relabeled to `ILM[Information Lifecycle Manager - MinIO ILM]` to
  make clear it's MinIO's own feature, not the deprecated Catalyst
  Lifecycle Manager noun.

VALIDATION-LOG: Pass 11 entry added.

Refs #37
2026-04-27 21:45:28 +02:00
hatiyildiz
5834daec14 docs(pass-10): banners on 7 more components + opentofu active-active drift fix
7 more component READMEs got role-in-Catalyst banners:

- vpa, keda, reloader → per-host-cluster scaling/ops layer (§3.4).
  Reloader specifically calls out its role in Catalyst's secret-
  rotation flow (rolling deploy on K8s Secret hash change).
- external-dns → per-host-cluster DNS-sync (§3.1); pairs with k8gb
  for the GSLB zone separation.
- coraza → DMZ-block WAF on every host cluster (§3.1).
- crossplane → per-Sovereign on the management cluster (§3.2);
  banner explicitly emphasizes the agreed "never a user-facing
  surface" rule (Users don't write Compositions in Application
  configs; Blueprint authors and advanced contributors do). Cross-
  references the no-fourth-surface clause in ARCHITECTURE §4/§7
  and the Crossplane Composition section in BLUEPRINT-AUTHORING §8.
- opentofu → repositioned as Phase-0-only, runs on `catalyst-
  provisioner` only, NOT installed on host clusters at runtime.

opentofu drift fixes (uncovered by line-by-line read):
- Section 5 line 182: "Bootstrap Wizard prompts for cloud credentials"
  → "Catalyst Bootstrap (Phase 0) prompts for cloud credentials"
  (banned term).
- Same section line 186: "ESO PushSecrets sync to both regional
  OpenBao instances" — the active-active drift Pass 7 corrected
  elsewhere, still here. Replaced with "writes go to the primary
  OpenBao region only; replicas pick up via async perf replication".

VALIDATION-LOG: Pass 10 entry added.

Refs #37
2026-04-27 21:43:45 +02:00
hatiyildiz
a52bda30cb docs(pass-9b): retry banners on harbor / falco / sigstore / syft-grype
Pass 9's commit ea81c38 only landed banners on grafana + kyverno —
the harbor / falco / sigstore / syft-grype edits failed because the
Edit tool requires a Read pass per file before write. Now Read'd
and applied:

- harbor: per-host-cluster registry, pointer to PLATFORM-TECH-STACK §3.5.
- falco: per-host-cluster runtime security, pointer to §3.3 + SRE §10
  (SIEM/SOAR pipeline).
- sigstore: cosign signing chain on every Blueprint OCI artifact,
  Kyverno admission verifies signatures.
- syft-grype: CI-side SBOM + runtime CVE matching.

Pass 9 now complete.

Refs #37
2026-04-27 21:41:22 +02:00
hatiyildiz
ea81c38e15 docs(pass-9): role-in-Catalyst banners on grafana / harbor / falco / kyverno / sigstore / syft-grype
Pass 9 — six more component READMEs got Catalyst-role banners
matching the rule of thumb in CLAUDE.md (every platform/<x>/README.md
should state its role in Catalyst).

- grafana: observability stack on every host cluster; Catalyst's
  own self-monitoring + Application telemetry flows here.
- harbor: per-host-cluster container registry for Catalyst images,
  mirrored Blueprint OCI artifacts, customer images.
- falco: runtime security on every host cluster; feeds SIEM/SOAR.
- kyverno: policy engine on every host cluster; enforces Catalyst
  policy contracts (cosign on Blueprints, default-deny NetworkPolicies
  on Organization namespaces, priority-class injection).
- sigstore: cosign-signed Blueprint OCI artifacts + admission
  verification chain on every host cluster.
- syft-grype: SBOM generation in CI per Blueprint + runtime CVE scans.

Plus Kyverno priority-class clarification: prose around `tenant-high`
/ `tenant-default` / `tenant-batch` priority class names now reads
"Organization workloads" instead of "tenant workloads", with an
explicit note that the priority class artifact names themselves stay
as-is until a separate migration ticket renames them in deployed
clusters (renaming PriorityClass objects requires recreate, not
in-place rename).

VALIDATION-LOG: Pass 9 entry added.

Refs #37
2026-04-27 21:40:51 +02:00
hatiyildiz
14ed84de41 docs(pass-8): role-in-Catalyst banners + dead-link fix in component READMEs
Pass 8 — line-by-line read of platform/cnpg, platform/strimzi,
platform/k8gb, platform/keycloak, platform/cert-manager, platform/cilium.

CNPG and Strimzi: read in full and confirmed clean — they correctly
position themselves as Application Blueprints and don't drift from
the canonical model. CNPG's `<org>-postgres-dr` cluster name
(Application-tier database role) is acceptable per NAMING-CONVENTION
§1.3 (which only forbids primary/dr in K8s host-cluster names, not
in Application-internal CRD names).

Four READMEs updated:

k8gb:
- Header reframed: per-host-cluster infrastructure pointer to
  PLATFORM-TECH-STACK §3.1 and SRE §2.4 split-brain protection.
- Removed dead link to ../failover-controller/docs/ADR-FAILOVER-
  CONTROLLER.md (the failover-controller folder has no docs/);
  replaced with link to that component's README + SRE §2.4.

keycloak:
- Header reframed from "FAPI Authorization Server for Open Banking"
  (narrow) to "User identity for Catalyst Sovereigns" (broad).
  Keycloak handles ALL user identity in Catalyst, not just FAPI.
- Added per-Org / per-Sovereign topology callout matching SECURITY
  §6. Clarified that "Multi-tenant TPP" refers to PSD2 Third Party
  Providers, not Catalyst's Organization-level multi-tenancy.
- FAPI features kept since Keycloak still serves Fingate as the
  FAPI Authorization Server.

cert-manager:
- Header reframed as per-host-cluster infrastructure with pointer
  to PLATFORM-TECH-STACK §3.3.

cilium:
- Header reframed as per-host-cluster infrastructure with pointer
  to PLATFORM-TECH-STACK §3.1, including the install-first note
  (CNI must come before any other workload during Phase 0).

VALIDATION-LOG: Pass 8 entry added.

Refs #37
2026-04-27 21:39:03 +02:00
hatiyildiz
a5ffa1a716 docs(pass-7): align Gitea + Flux multi-region story; fix broken mermaid id
Continuing Pass 7 cleanup after the OpenBao/ESO rewrite (42aeb62).

Gitea README:
- Was describing "Bidirectional mirroring for multi-region" with two
  Gitea instances mirroring repos cross-region. Wrong: Catalyst's
  agreed model has one Gitea per Sovereign on the management cluster
  (PLATFORM-TECH-STACK §2.3). Replaced the multi-region mirror
  diagram with a single-Gitea + intra-cluster HA topology and added
  a "Why not cross-region bidirectional mirror" explainer (write-
  conflict semantics would break EnvironmentPolicy enforcement).
- Status banner: notes the canonical references.
- Backup section: removed "Repository mirror for redundancy"
  (replaced with Velero scheduled backups).

Flux README:
- "Multi-Region GitOps" section was showing one Gitea per region
  with bidirectional mirror. Replaced with one Gitea per Sovereign
  topology. Per-vcluster Flux pulls from this single Gitea.

Mermaid syntax bug:
- Earlier mass replace_all of "Catalyst IDP" → "Catalyst console"
  had left an invalid mermaid node identifier
  `Catalyst console[Catalyst console]` (mermaid forbids spaces in
  node IDs). Fixed to `Console[Catalyst console]`. Would have
  rendered as a broken diagram on GitHub.

VALIDATION-LOG: Pass 7 entry added documenting the OpenBao/ESO
active-active rewrite (the most consequential drift fix in any pass).

Refs #37
2026-04-27 21:36:20 +02:00
hatiyildiz
42aeb629bb docs(pass-7): rewrite OpenBao + ESO READMEs to match agreed multi-region semantics
Pass 7 — line-by-line read of platform/openbao/README.md and
platform/external-secrets/README.md found a major architectural drift:
both files described an OLD active-active bidirectional sync model
that contradicts docs/SECURITY.md §5 (the canonical reference).

The active-active design was rejected during the architecture session
because it would have been a stretched cluster — a single region's
network blip would block writes everywhere. The agreed model is:

- Independent Raft cluster per region (intra-region quorum only).
- Single-primary writes; replicas accept reads only.
- Async Performance Replication primary → replicas (lag <1s typical).
- Explicit DR promotion (sovereign-admin or failover-controller).

Fixes:

platform/openbao/README.md:
- Overview: removed "active-active deployments" / "either region can
  update secrets". Replaced with "independent Raft cluster per region",
  "asynchronous Performance Replication".
- Architecture diagram: replaced bidirectional-push diagram with the
  primary→replicas async perf replication topology that matches
  SECURITY.md §5.
- ClusterSecretStores: simplified from "two stores (local+remote)" to
  "one local store"; reads always pull locally.
- Renamed "PushSecret (Bidirectional)" → "Writes go to the primary
  region" with a single-target PushSecret pointing at bao-primary.
- Added DR promotion section pointing at SECURITY.md §5.2.
- Status banner: notes that the canonical multi-region reference is
  SECURITY.md.

platform/external-secrets/README.md:
- Header line: repositioned as per-host-cluster infrastructure with
  pointer to PLATFORM-TECH-STACK §3.3.
- Removed broken link to non-existent ../openbao/docs/ADR-OPENBAO.md
  (replaced with link to ../openbao/README.md).
- "Multi-region sync | Push to both OpenBao instances simultaneously"
  → "Multi-region reads | Async perf replication".
- "PushSecret to Multiple OpenBao Instances" example was writing to
  two ClusterSecretStores in parallel — replaced with single-target
  primary write.
- "Multi-region sync via single PushSecret" in Consequences →
  "Cross-region availability via Performance Replication".
- Mermaid sequence diagram: "Bootstrap Wizard" actor → "Catalyst
  Bootstrap (Phase 0)"; "Terraform" → "OpenTofu"; ESO connection
  description "via K8s auth" → "via SPIFFE SVID (workload identity)".

These were the most consequential drift fixes found in any pass —
two READMEs were documenting an architecture explicitly rejected by
the agreed model.

Refs #37
2026-04-27 21:34:09 +02:00
hatiyildiz
d6a51b8a7a docs(pass-2): final entity-noun sweep — external-secrets sequence diagram
Pass 2 — fresh-eyes sweep across the entire docs tree. One residual
entity-noun usage found:

- platform/external-secrets/README.md:75 (in a Mermaid sequence
  diagram): "Note over Wizard: Operator saves unseal keys offline"
  — "Operator" used as person/entity. Renamed to "sovereign-admin"
  to match the role from GLOSSARY.md.

All other banned-term sweeps clean:
- No tenant (architectural) anywhere.
- No Catalyst IDP anywhere.
- No Synapse-as-product anywhere (only the legitimate
  "Matrix/Synapse server" usages).
- No workspace-controller (only the banned-term entries that define
  the rename).
- No capital-W Workspace as Catalyst scope.
- No github.com/openova (without -io).
- All cross-doc Markdown links resolve.
- All §X references resolve to the new section numbering after
  PLATFORM-TECH-STACK reorg.
- API group catalyst.openova.io/v1alpha1 consistent across 6 references.
- OCI artifact prefix `bp-` consistent across README, CLAUDE,
  BLUEPRINT-AUTHORING, IMPLEMENTATION-STATUS.

Other "Operator" mentions intentionally retained (legitimate
technical usage):
- "External Secrets Operator (ESO)", "Trivy Operator" — K8s
  Operator pattern (controllers), explicitly allowed by GLOSSARY.
- "Operator compatibility" in BUSINESS-STRATEGY's OpenShift migration
  table — refers to compatibility with K8s Operators (the technology),
  not as an entity/role.

Refs #37
2026-04-27 21:18:55 +02:00
hatiyildiz
119a1e53a0 docs(components): terminology pass across platform and product READMEs
Bring per-component READMEs in line with the canonical glossary
(docs/GLOSSARY.md). Substantive architectural content unchanged —
this is a terminology + reference correctness pass.

Placeholder rename: <tenant> → <org> in YAML / IaC examples across
- platform/cnpg/README.md           (Cluster + Pooler + ScheduledBackup)
- platform/debezium/README.md       (PostgreSQL connector + topic patterns)
- platform/external-secrets/README.md (ExternalSecret / SecretStore)
- platform/grafana/README.md        (Instrumentation namespace)
- platform/k8gb/README.md           (Gslb + namespace + kubectl examples)
- platform/keda/README.md           (ScaledObject + Kafka triggers + Prometheus)
- platform/opentofu/README.md       (server resource example)
- platform/velero/README.md         (BackupStorageLocation buckets)
- platform/vpa/README.md            (VerticalPodAutoscaler examples)
- platform/flux/README.md           (kustomization name + tenants/ → organizations/)

"Catalyst IDP" → "Catalyst console":
- platform/crossplane/README.md     (integration section retitled and
                                      rewritten — Crossplane is platform
                                      plumbing, not user-facing)
- platform/gitea/README.md          (architecture diagram + integration table)
- platform/kyverno/README.md        (rollout tracking surface)
- products/fingate/README.md        (TPP onboarding portal)

"Bootstrap wizard" → "Catalyst bootstrap":
- platform/openbao/README.md        (bootstrap procedure rewritten —
                                      independent Raft per region clarified;
                                      cross-references docs/SECURITY.md §5)
- platform/opentofu/README.md       (Quick Start)

Kyverno labels & prose:
- openova.io/tenant → openova.io/organization (label rename for
  consistency; deployed clusters will add new label as a co-label
  during migration window)
- "tenant labels" / "tenant namespace" prose updated to
  "Organization labels" / "Organization-labeled namespace"
- Priority class names (tenant-high, tenant-default, tenant-batch)
  retained as deployed artifact names — rename pending in a
  separate migration ticket

No banned-term hits remain in component READMEs (verified by grep
in docs/GLOSSARY.md banned-terms table).

Refs #37
2026-04-27 20:06:51 +02:00
talent-mesh
435f49738d feat: restructure platform to 52 components and 9 products
Technology forecast and strategic review restructure:
- Remove 13 components (backstage, mongodb, activemq, vitess, airflow, camel, dapr, superset, searxng, langserve, trino, lago, rabbitmq)
- Add 10 components (sigstore, syft-grype, nemo-guardrails, langfuse, reloader, matrix, ferretdb, litmus, livekit, coraza)
- Rename product: Synapse → Axon (SaaS LLM Gateway)
- Merge products: Titan + Fuse → Fabric (Data & Integration)
- New product: Relay (Communication)
- Replace Backstage with Catalyst IDP
- Replace MongoDB with FerretDB (MongoDB wire protocol on CNPG)
- Add supply chain security (Sigstore/Cosign, Syft+Grype)
- Add AI safety and observability (NeMo Guardrails, LangFuse)
- Add technology forecast 2027-2030 document
- Full verification pass: zero stale references across all docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 21:00:19 +00:00
talent-mesh
10245dff98 feat: ecosystem expansion to 55 components with license compliance
- Replace BSL-licensed components with open-source alternatives:
  Terraform→OpenTofu (MPL 2.0), Vault→OpenBao (MPL 2.0),
  Redpanda→Strimzi/Kafka (Apache 2.0), n8n→Airflow (Apache 2.0)
- Add 14 new platform components: activemq, camel, clickhouse, dapr,
  debezium, falco, flink, iceberg, opensearch, rabbitmq, superset,
  temporal, trino, vitess
- Rename meta-platforms/ to products/ with new product names:
  Cortex (AI Hub), Fingate (Open Banking), Titan (Data Lakehouse),
  Fuse (Microservices Integration)
- Update all documentation, READMEs, and cross-references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 18:15:11 +00:00
talent-mesh
bb53df55bb docs: comprehensive Kyverno policy matrix for resilience and zero-trust
Cover 44 policies across generate (VPA, PDB, NetworkPolicy, ResourceQuota,
LimitRange), mutate (topology spread, anti-affinity, security context,
seccomp, Harbor image rewrite, priority class), and validate (resource
requests, health probes, min replicas, pod security restricted profile,
image supply chain, network zero-trust, RBAC hardening).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 05:29:05 +00:00
talent-mesh
c9d04a53b4 refactor: flatten platform/ structure (41 components)
Remove hierarchical grouping (networking/, security/, etc.) and use flat
structure for all 41 platform components.

Changes:
- All components now directly under platform/ (no subfolders)
- AI Hub components moved from meta-platforms/ai-hub/components/ to platform/
- Open Banking components (lago, openmeter) moved to platform/
- meta-platforms/ now only contains README files that reference platform/
- Open Banking custom services remain in meta-platforms/open-banking/services/

Structure:
- platform/ (41 components, flat)
- meta-platforms/ai-hub/ (README only, references platform/)
- meta-platforms/open-banking/ (README + 6 custom services)

All documentation links updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:19:48 +00:00
talent-mesh
49f8bbc84d refactor: move harbor to registry/, kyverno to policy/
- Harbor moved from storage/ to registry/ (artifact management, not storage)
- Kyverno moved from security/ to policy/ (policy engine for validation,
  mutation, generation - broader than just security)

Updated structure:
- platform/registry/harbor/
- platform/policy/kyverno/

All documentation links updated accordingly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 11:53:21 +00:00
talent-mesh
535710289c feat: create OpenOva monorepo structure
Consolidate all component repos into a single monorepo:

- core/: Bootstrap + Lifecycle Manager application
- platform/: Individual component blueprints organized by category
  - networking/ (cilium, k8gb, external-dns, stunner)
  - security/ (cert-manager, external-secrets, vault, kyverno, trivy)
  - observability/ (grafana stack)
  - storage/ (minio, harbor, velero)
  - scaling/ (keda, vpa)
  - failover/ (failover-controller)
  - gitops/ (flux, gitea)
  - idp/ (backstage)
  - data/ (cnpg, mongodb, valkey, redpanda)
  - communication/ (stalwart)
  - iac/ (terraform, crossplane)
  - identity/ (keycloak)
- meta-platforms/: Bundled vertical solutions
  - ai-hub/ (enterprise AI platform)
  - open-banking/ (PSD2/FAPI fintech sandbox)
- docs/: Platform documentation (PLATFORM-TECH-STACK.md, SRE.md)

All internal links updated to use relative paths within monorepo.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 10:53:18 +00:00