Per lean-doc strategy (user-global CLAUDE.md §11 + repo CLAUDE.md), 3 orphan docs at top of docs/ are consolidated into the canonical set. Per-orphan fold table: | Orphan | Action | |---------------------------------|-----------------------------------------------------------------------------------------------| | docs/FRANCHISE-MODEL.md | Folded into BUSINESS-STRATEGY.md §10.8 "Franchise Model — End-to-End Mechanics" (full content)| | docs/PRODUCT-FAMILIES.md | Folded into BUSINESS-STRATEGY.md §5.5 "Product Families Map (Wizard Groups & Dependency Model)" | | docs/TECHNOLOGY-FORECAST-2027-2030.md | Renamed to docs/ROADMAP.md (git mv preserves history) | Rationale for the rename (not a fold): the 2027–2030 forecast is structurally a roadmap (forward-looking component trajectory) — not a subsection of business strategy. Promoting it to ROADMAP.md keeps it as a peer canonical doc and avoids burying it inside BUSINESS-STRATEGY. Attribution: both folded sections carry a "> **Source:** previously docs/<file>.md (folded here 2026-05-20)" note preserving provenance. Cross-ref updates: - README.md docs/ index — TECH-FORECAST row renamed to ROADMAP; description updated. FRANCHISE-MODEL + PRODUCT-FAMILIES were never on README; the in-text reference to the forecast was updated. - docs/GLOSSARY.md — Voucher + Franchisee entries now link BUSINESS-STRATEGY.md §10.8 anchor instead of FRANCHISE-MODEL.md. - docs/RUNBOOKS.md — "See also" entry retargeted to §10.8 anchor. - docs/PROVISIONING-PLAN.md — H-row + Phase 7 outputs retargeted to §10.8. - docs/SRE.md — Flagger note links ROADMAP.md. - docs/AUDIT-PROCEDURE.md — Anchor #4 (component-count) + grep alias updated to the new ROADMAP filename. - docs/BUSINESS-STRATEGY.md — old §10.7 "See FRANCHISE-MODEL.md" line now points at §10.8 below. Validation (per the docs-only PR pattern): - find docs -maxdepth 1 -name '*.md' matching the 3 old names → 0 - attribution lines in BUSINESS-STRATEGY.md → 2 (FRANCHISE + PRODUCT-FAMILIES) - docs/ROADMAP.md exists - No broken intra-doc references to the 3 old filenames. Refs #2100 Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
112 KiB
Runbooks
What this is: operator how-tos for OpenOva. Provisioning, chart bumps, Blueprint authoring, failover recovery, troubleshooting. Authority: PERMANENT canon. Reviewed PRs only. Updated: 2026-05-20. Pointers: see
DOD.mdfor fresh-prov verification,ARCHITECTURE.mdfor system shape,PRINCIPLES.mdfor what NOT to do.
This file consolidates five prior runbook documents (BLUEPRINT-AUTHORING.md, CHART-AUTHORING.md, DEMO-RUNBOOK.md, RUNBOOK-OPERATIONS.md, RUNBOOK-PROVISIONING.md) per the lean-doc strategy. Section anchors are stable; older docs are deleted by the orchestrator after this lands.
Table of contents
- §1 — Fresh provisioning
- §2 — Day-2 operations
- §3 — Blueprint authoring
- §4 — Chart-level conventions
- §5 — Demo / operator walks
- §6 — Failover recovery
- §7 — Troubleshooting matrix
- §8 — Doc-integrity audit cadence
- §9 — Bring up a Sovereign (canonical phase walkthrough)
- §10 — UI regression test catalog
- §11 — Phase-by-phase provisioning plan (Catalyst-Zero waterfall)
§1 — Fresh provisioning
Operator-level procedure for provisioning a new Sovereign end-to-end via the wizard at console.<sovereign-fqdn>/sovereign. Read with ARCHITECTURE.md (the architectural contract).
1.1 What you get
A new Sovereign — a self-sufficient deployed Catalyst — provisioned on Hetzner from Catalyst-Zero. At the end:
- k3s cluster on Hetzner Cloud servers in your chosen region
- Cilium CNI + Gateway API ingress, Flux GitOps reconciler, Crossplane day-2 IaC
- 11-component bootstrap kit reconciling cleanly: cilium → cert-manager → flux → crossplane → sealed-secrets → nats-jetstream → openbao → keycloak → gitea → powerdns → bp-catalyst-platform
- Reachable URLs:
console.<sovereign-fqdn>,gitea.<sovereign-fqdn>,harbor.<sovereign-fqdn>(TLS via cert-manager + Let's Encrypt) - Initial sovereign-admin in Keycloak's
catalyst-adminrealm - catalyst-provisioner has zero ongoing connection to the new Sovereign — Phase 1 hand-off complete
1.2 Pre-flight checklist
Walk these top to bottom. The wizard fails fast on missing prerequisites, but most are not visible to the wizard.
A. Hetzner Cloud project + API token
| Item | Required | Where |
|---|---|---|
| Hetzner Cloud project | Yes — separate project per Sovereign | https://console.hetzner.cloud → Projects |
| API token | Read and Write | Project → Security → API Tokens → New Token |
| Token storage | 1Password vault OpenOva — Production, item Catalyst — Hetzner Cloud token (<sovereign-fqdn>) |
Tag rotation:per-sovereign |
| Rotation policy | Rotate on leak, on decommission, or every 12 months | See SECURITY.md §11 |
The token is sent once through the wizard, used by catalyst-api for the OpenTofu run, then redacted from the persisted deployment record. It is not copied to the Sovereign cluster.
B. SSH public key
Generate fresh if you don't already have a sovereign-admin keypair:
ssh-keygen -t ed25519 -C "sovereign-admin@<your-org>" -f ~/.ssh/sovereign_admin -N ""
Paste the PUBLIC half (*.pub) — a single unbroken line starting ssh-ed25519 AAAA....
C. Pool subdomain reserved
The OpenOva pool zones are omani.works, omani.homes, omani.rest, omani.trade, omantel.biz. Pick one and pick a subdomain (e.g. t42). PDM /v1/reserve checks availability; on commit it (a) creates the per-Sovereign PowerDNS zone, (b) writes the canonical 6-record set, (c) updates the parent-zone NS delegation via the Dynadot registrar adapter.
Forbidden test domains (per DOD.md): openova.io, omantel.openova.io, Nova Cloud, eventforge.io.
D. DNS pool registered + Dynadot credentials
| Item | Required | Where |
|---|---|---|
K8s Secret dynadot-api-credentials |
Namespace openova-system, keys api-key, api-secret, domain |
kubectl -n openova-system get secret dynadot-api-credentials |
| PDM running | kubectl -n openova-system get deploy pool-domain-manager shows 1/1 READY |
— |
| PDM healthy | kubectl -n openova-system exec deploy/pool-domain-manager -- wget -q -O - http://localhost:8080/healthz returns {"status":"ok"} |
— |
E. GHCR pull token
Cloud-init creates flux-system/ghcr-pull Secret on the Sovereign cluster from the catalyst-api Pod's CATALYST_GHCR_PULL_TOKEN env var (sourced from K8s Secret catalyst-ghcr-pull-token).
| Item | Required | Where |
|---|---|---|
| Token type | Fine-grained personal access token, scope packages:read on org openova-io |
https://github.com/settings/tokens?type=beta |
| K8s Secret | catalyst/catalyst-ghcr-pull-token, key token |
kubectl -n catalyst get secret catalyst-ghcr-pull-token |
| Rotation policy | Yearly | See SECURITY.md §11.1 |
F. PowerDNS pool zones bootstrapped
kubectl -n openova-system exec deploy/powerdns -- \
pdnsutil list-all-zones 2>/dev/null | grep -E '^(omani\.(works|homes|rest|trade)|omantel\.biz)$'
If any line is missing, see PLATFORM-POWERDNS.md §"Pool zone bootstrap".
G. bp- charts published at target version*
Confirm the bootstrap-kit OCI artifacts exist before provisioning (target version is published in clusters/_template/bootstrap-kit/*.yaml).
H. subchart-guard CI green
gh run list --workflow=blueprint-release.yaml --limit 5 \
--json conclusion,headBranch,event --repo openova-io/openova
Every recent run on main must show "conclusion": "success". If any fails, do not provision; fix CI first.
1.3 The 7-step wizard
The wizard's canonical step order (from STEPS in products/catalyst/bootstrap/ui/src/pages/wizard/WizardPage.tsx): Org → Topology → Provider → Credentials → Components → Domain → Review.
| Step | What it captures | Notes |
|---|---|---|
| 1. Organisation | Org profile: name, industry, size, HQ, compliance frame | No email or domain here — captured at Step 6 |
| 2. Topology | Regions, building blocks, HA toggle, CP + worker SKU, worker count | Per #176 SKU pickers driven by PROVIDER_NODE_SIZES[provider] |
| 3. Provider | Hetzner (today); AWS / GCP / Azure / OCI / Huawei design-only | |
| 4. Credentials | Provider API token + project ID, SSH public key | Validated read-only via POST /api/v1/credentials/validate; token redacted from SSE stream |
| 5. Components | Single flat marketplace card grid (#162) with family chips + search + product-family chip filter | Per #175 dependency-aware cascades pull transitive deps automatically (Specter → BGE/Milvus/LangFuse/vLLM/KServe; Harbor → cnpg/seaweedfs/valkey) |
| 6. Domain | Pool subdomain OR BYO (manual NS / registrar API) + sovereign-admin email | Pool = PDM /v1/reserve. BYO byo-api = registrar token (Cloudflare/Namecheap/GoDaddy/OVH/Dynadot, #170) |
| 7. Review | Show every captured value, Provision button | Click → catalyst-api accepts the request and starts streaming |
Multi-region topology: canonical = N regions × 1 cpx52 per region, each node = CP AND worker (untainted), workerCount=0 in body. 3 regions = 3 servers, NOT 9.
1.4 Phase timeline
flowchart LR
subgraph PROV["catalyst-provisioner (mothership)"]
W["Wizard / SSE\nUI captures input"] --> A["catalyst-api\n/v1/deployments"]
A --> P0["Phase 0 — OpenTofu\nnetwork+firewall+ssh-key\n+server+LB\n(30–60s plan, 60–120s apply)"]
P0 --> PDM["PDM /v1/commit\nwrites Sovereign DNS\n(~5s)"]
end
subgraph CI["Cloud-init on control-plane (3–5min)"]
CI1["k3s install\n+ Cilium helm install\n(CNI bootstrap)"] --> CI2["Flux v2.4.0 install"]
CI2 --> CI3["create flux-system/ghcr-pull\nfrom CATALYST_GHCR_PULL_TOKEN"]
CI3 --> CI4["apply GitRepository\n+ 2 Kustomizations\n(bootstrap-kit + infra-config)"]
end
subgraph SOV["Sovereign cluster (Flux-owned)"]
F1["bootstrap-kit Kustomization\ninstalls 10 bp-* in dep order\n(10–15min)"] --> F2["bp-catalyst-platform\numbrella reconciles\n(~2min)"]
F2 --> F3["cert-manager issues wildcard\n+ Cilium Gateway online\n+ console URL responds\n(1–2min)"]
end
PDM --> CI1
CI4 --> F1
Total wall-clock: 15–25 minutes for a solo Sovereign (1 cpx52, 0 workers); 25–45 minutes with HA.
Ownership boundaries are load-bearing:
- catalyst-provisioner runs in the
catalystnamespace on Catalyst-Zero (the mothership). It does the OpenTofu run, hands the cloud-init template to the new server, calls PDM, then disconnects. - Cloud-init on the new control-plane is the only one-shot bridge. Installs k3s, Cilium, Flux, GHCR pull secret, then commits the cluster to GitOps mode.
- Sovereign cluster owns its outcome from then on. Flux pulls bp-* charts from the public OpenOva monorepo and reconciles steady-state. The provisioner has no privileged access after hand-off.
1.5 Phase-by-phase walkthrough
Phase 0 — OpenTofu (30–60s plan, 60–120s apply)
What gets created in Hetzner Cloud:
| Resource | Hetzner kind | Name pattern |
|---|---|---|
| Network | hcloud_network |
catalyst-${slug}-network |
| Firewall | hcloud_firewall |
catalyst-${slug}-fw |
| SSH key | hcloud_ssh_key |
catalyst-${slug}-ssh |
| Control-plane | hcloud_server |
catalyst-${slug}-cp-1 |
Workers (worker_count) |
hcloud_server |
catalyst-${slug}-worker-N |
| Load balancer | hcloud_load_balancer |
catalyst-${slug}-lb |
Where ${slug} = replace(sovereign_fqdn, ".", "-"). Names are deterministic — that is the basis for idempotent re-runs.
PDM /commit writes Sovereign DNS (~5s)
PDM (#163, #167, #168, #170):
- Creates the per-Sovereign authoritative zone
<sovereign-fqdn>.on bp-powerdns (CNPG-backedpdns-pg, DNSSEC-signed ECDSAP256SHA256, lua-records enabled) - Writes the canonical 6-record set:
@,*,console,api,gitea,harbor— all A records pointing at the LB IP - For pool Sovereigns: writes parent-zone NS delegation into Dynadot via the registrar adapter
- For
byo-api: flips NS at the customer's registrar - For
byo-manual: emits OpenOva NS list in the wizard
Cloud-init (3–5 min) — strict order:
apt-get update+ install curl ca-certificatescurl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.31.4+k3s1 sh -s - server --flannel-backend=none --disable-network-policy --disable=traefik --disable=servicelb --disable=local-storage --tls-san=<sovereign-fqdn>helm install cilium ... --set k8sServiceHost=127.0.0.1 ...— Cilium before Flux to break the CNI bootstrap deadlockflux install— Flux v2.4.0 corekubectl create secret generic ghcr-pull -n flux-system --from-literal=token="$CATALYST_GHCR_PULL_TOKEN"— durable so private bp-* charts pull cleanly- Apply the GitRepository pointing at
clusters/<sovereign-fqdn>/in the public OpenOva monorepo - Apply two Kustomizations split for CRD ordering:
bootstrap-kit— installs the 10 platform chartsinfrastructure-config— applies Crossplane Compositions + ProviderConfigs (depends-on bootstrap-kit)
Phase 1 — bootstrap-kit (10–15 min)
Flux pulls 10 bp-* HelmReleases in dependency order:
cilium → cert-manager → flux → crossplane → sealed-secrets
↓
nats-jetstream → openbao → keycloak → gitea → powerdns
Then bp-catalyst-platform (umbrella) reconciles.
cert-manager + Cilium Gateway + console URL (1–2 min)
Once bp-cert-manager is Ready=True and the wildcard *.<sovereign-fqdn> DNS has propagated, cert-manager issues a wildcard cert via DNS-01 (against PowerDNS). The Cilium Gateway picks it up; https://console.<sovereign-fqdn> returns 200.
1.6 Re-runs and idempotency
tofu apply on an existing state is idempotent: rerunning the wizard with the same Sovereign FQDN updates only what changed. To re-run cloud-init on the control-plane (rare), the cleanest path is via Crossplane Compositions in clusters/<sovereign-fqdn>/, NOT direct re-run. Cloud-init runs once per server lifetime by default.
For partial-state recovery, see §2.2 and the operator-recover-sovereign.sh script.
1.7 Canonical wipe endpoint
Burned once on t124 (2026-05-16): DELETE /api/v1/deployments/{id} is record-only — it does NOT destroy Hetzner resources. Use POST /api/v1/deployments/{id}/wipe with hcloud + S3 creds in the body — this is the canonical destructive operation (tofu destroy + hetzner.Purge + S3 delete).
§2 — Day-2 operations
2.1 Decommissioning
DEPLOYMENT_ID=<the deployment ID from Phase 0>
curl -s -X POST "https://console.<mothership-fqdn>/api/v1/deployments/${DEPLOYMENT_ID}/wipe" \
-H "Content-Type: application/json" \
-d '{"hcloud_token":"<token>","s3_credentials":{...}}'
After destroy, verify:
# Hetzner Cloud Console → Servers → empty for the project
# Hetzner Cloud Console → Load balancers → empty for the project
dig +short console.<sovereign-fqdn>
# May resolve until parent-zone NS-delegation TTL expires (~15 min)
2.2 Recovery script — scripts/operator-recover-sovereign.sh
Single-shot return to clean slate. Idempotent.
# Dry-run (default) — prints what WOULD be done, deletes nothing
./scripts/operator-recover-sovereign.sh <sovereign-fqdn>
# Apply — actually purges Hetzner, releases PDM, cancels deployment record
HETZNER_API_TOKEN=<from-1Password> \
./scripts/operator-recover-sovereign.sh <sovereign-fqdn> --apply
What it does, in order:
- Hetzner Cloud purge. Lists every resource carrying label
catalyst.openova.io/sovereign=<fqdn>(servers, LBs, networks, firewalls, volumes, primary IPs, floating IPs) and deletes via Hetzner API. SSH keys are matched by deterministic name slug. After delete, a verification sweep re-queries each resource type and re-deletes any that lingered. - PDM allocation release. Calls
DELETE http://pool-domain-manager.openova-system.svc.cluster.local:8080/api/v1/pool/<pool-zone>/release?sub=<sub>. - catalyst-api deployment record cancel. Rewrites
statustocancelledwith a recovery event.
Why safe to re-run: every Hetzner resource is named catalyst-${slug}-{role}. Re-running with the same FQDN recreates exactly the same names → no uniqueness_error.
Hetzner DELETE-but-resource-persists workaround: the verification sweep at end of Step 1 catches the well-known quirk where DELETE /v1/<kind>/<id> returns 204 No Content but the resource is still present 5–30s later (firewalls right after a server delete are the worst offender). Skipping the sweep caused exactly the uniqueness_error this script is meant to prevent.
2.3 Hetzner orphan-cleanup discipline
After wipe, enumerate EVERY Hetzner endpoint with full listing, never substring-filter. CCM auto-scaler workers + primary_ip-<digits> lack FQDN → name filters miss them. Canonical hetzner.Purge also misses them. Always do a full-enumeration verification sweep.
2.4 Chart-version collision (parallel fix-authors)
When parallel fix-authors bump the same chart, version collisions are inevitable:
- Check the latest chart version on
origin/mainBEFORE bumping (don't trust the version cited in the dispatch prompt — it may be stale). - On
git pushrejection: rebase + bump to the next free version + force-push-with-lease. - Lockstep bump in the same commit: chart
Chart.yamlversion+blueprint.yamlspec.version+ bootstrap-kit / reconciler pin file. Lockstep CI catches drift.
2.5 cert-manager + Let's Encrypt rate limit
If the operator re-provisioned the same FQDN >5 times in 7 days (LE "Duplicate Certificate" limit, 5/week):
- Switch ClusterIssuer to
letsencrypt-staging(untrusted cert, works without rate limit).kubectl edit clusterissuer wildcard-issuerand changeacme.server. - Browser will warn; acceptable for in-window operator testing.
- When the limit expires, switch back to
letsencrypt-prod; Certificate renews automatically.
2.6 StorageClass missing (legacy)
Symptom: fresh Sovereign reaches flux-bootstrap, bootstrap-kit Kustomization stuck Ready=False 10+ min, every PVC Pending with no persistent volumes available for this claim and no storage class is set.
Root cause: pre-2026-04-29 cloud-init passed --disable=local-storage to the k3s installer.
Resolution (current code): cloud-init keeps k3s' built-in local-path-provisioner and marks local-path as the default StorageClass BEFORE applying the Flux bootstrap manifest.
Recovery for pre-fix Sovereigns:
KUBECONFIG=/path/to/sovereign-kubeconfig
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.30/deploy/local-path-storage.yaml
kubectl -n local-path-storage wait --for=condition=Ready pod -l app=local-path-provisioner --timeout=60s
kubectl patch storageclass local-path -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Local-path is correct for solo-Sovereign target. Multi-node migration to hcloud-csi is a separate, deliberate operation.
2.7 bp-flux double-install — version-pin invariant
Live incident: omantel.omani.works, 2026-04-29. Flux controllers deleted by the FIRST reconcile of bp-flux. Cluster lost its GitOps engine in-place; only recovery is full reprovision.
Root cause: cloud-init's flux2 v<X.Y.Z>/install.yaml URL pin and the bp-flux umbrella's flux2 subchart appVersion drifted. Helm tried to update the existing Flux CRDs to a new schema, the apiserver rejected (storedVersions[0]: Invalid value: "v1"), Helm rolled back, the rollback deleted the existing Flux controller Deployments.
The invariant: cloud-init's install.yaml URL version and the bp-flux umbrella flux2 subchart appVersion MUST be the same upstream Flux release. Enforced at:
infra/hetzner/cloudinit-control-plane.tftpl— install.yaml URL pinplatform/flux/chart/Chart.yaml—flux2subchart depplatform/flux/chart/values.yaml—catalystBlueprint.upstream.versionplatform/flux/chart/tests/version-pin-replay.sh— CI gate; replays the catastrophic precondition
To bump Flux safely: pick the target upstream version, find the matching community chart from https://fluxcd-community.github.io/helm-charts/index.yaml, update all four pin sites in one PR, bump Chart.yaml version, update every clusters/<sovereign-fqdn>/bootstrap-kit/03-flux.yaml, run the replay test locally, push.
2.8 Phase 1 watch shows 0 HelmReleases
Symptom: wizard reaches flux-bootstrap cleanly, then admin banner warns Phase 1 watch saw 0 HelmReleases in 15m0s.
What it means: Phase 0 succeeded (cluster up, Flux installed). Phase-1 watcher never saw a bp-* HelmRelease appear within the first-seen window (CATALYST_PHASE1_FIRST_SEEN_TIMEOUT, default 15 min). Means Flux on the new Sovereign isn't materialising the bootstrap-kit Kustomization.
Operator playbook:
- Confirm catalyst-api Pod env vars are sane (
CATALYST_PHASE1_*). - On the new Sovereign:
kubectl get gitrepository -n flux-system -o wide+describe gitrepository openova-public. Look forConditions[type=Ready].status=True+ recentlastAppliedRevision. Common failures: 401/403 (deploy-key missing/wrong scope), 404 (branch/path mismatch), connection refused (DNS/firewall egress). kubectl get kustomization -n flux-system+describe kustomization -n flux-system <sovereign-fqdn>-bootstrap-kit. TheMessagefield names the cause: missing CRD,dependsOnunresolved, etc.- Inspect source-controller and kustomize-controller logs (
kubectl -n flux-system logs deploy/source-controller --tail=200). - Re-run reconciliation manually:
flux reconcile source git openova-public -n flux-system+flux reconcile kustomization <sovereign-fqdn>-bootstrap-kit -n flux-system.
If overall CATALYST_PHASE1_WATCH_TIMEOUT of 60m elapsed, start a fresh wizard run (Hetzner side is idempotent).
2.9 Cilium Gateway hostNetwork — world-ingress policy
Cilium's reserved:ingress endpoint is not covered by default-deny NotIn-namespace selector → 403 envoy on all public Sovereign hosts.
Fix: CCNP scoped to reserved.ingress allowing world / cluster / host / remote-node. PR #1482.
2.10 ClusterMesh regionKeyFromSpec off-by-one
regionKeyFromSpec idx+1 mismatched tofu secondary_regions index → empty kc → silent zero peers → fullyMeshed=0 with NO warn logs.
Fix: added "zero peer entries" Warn for future regressions (PR #1525).
2.11 Per-instance verification ledger
Every Sovereign instance carries a docs/ledger/TRUST.md ledger of claimed-done items in 4 states:
- UNVERIFIED (default)
- VERIFIED-PASS (screenshot evidence)
- VERIFIED-FAIL
- VERIFIED-PARTIAL
Every new PR against a surface flips it back to UNVERIFIED. Cron-refreshed alongside docs/ledger/TRACKER.md.
§3 — Blueprint authoring
How to author a Blueprint for Catalyst — the unified unit of installable software (replaces what was previously called "module" + "template"). Defer to GLOSSARY.md for terminology and ARCHITECTURE.md for the broader model.
3.1 What a Blueprint is
A Blueprint is:
- A source location (one of three Gitea-Org-scoped places, all using identical Blueprint shape):
- Public Blueprints:
platform/<name>/orproducts/<name>/ingithub.com/openova-io/openova(this repository). Per-Blueprint isolation is provided by CI fan-out — each folder publishes its own signed OCI artifact. Visible to every Sovereign via thecatalogGitea Org mirror. - Sovereign-curated private Blueprints: a Gitea Repo under the
catalog-sovereignGitea Org on a Sovereign. Authored by the Sovereign owner, visible to every Catalyst Organization on that Sovereign without being public upstream. - Org-private Blueprints: a directory inside
gitea.<location-code>.<sovereign-domain>/<org>/shared-blueprints/bp-<name>/. Visible only within that Org.
- Public Blueprints:
- A CRD manifest (
blueprint.yaml) declaring its identity, configSchema, placementSchema, dependencies, manifest pointers - A set of manifests (Helm chart, Kustomize base + overlays, or raw YAML) applied when the Blueprint is installed as an Application
- A set of Crossplane Compositions (optional) for any non-Kubernetes resources
- A CI pipeline that signs the artifact (cosign), generates SBOM (Syft), publishes to
ghcr.io/openova-io/bp-<name>:<semver>
One Blueprint = one card in the marketplace (when visibility: listed).
3.2 Folder layout
platform/<name>/ ← OR products/<name>/ for composite Blueprints
├── blueprint.yaml ← the Blueprint CRD manifest
├── README.md
├── chart/ ← Helm chart (preferred)
│ ├── Chart.yaml
│ ├── values.yaml
│ └── templates/
│ OR
├── manifests/ ← Kustomize base + overlays
├── compositions/ ← (optional) Crossplane Compositions
├── card/ ← marketplace presentation
└── tests/ ← acceptance tests
CI workflow lives once at the monorepo root (.github/workflows/blueprint-release.yaml) with path-based matrix builds.
3.3 Blueprint CRD
Annotated example for bp-wordpress:
apiVersion: catalyst.openova.io/v1alpha1
kind: Blueprint
metadata:
name: bp-wordpress
version: 1.3.0
spec:
card:
title: WordPress
tagline: Self-hosted CMS
category: cms
icon: ./card/icon.svg
visibility: listed # listed | unlisted | private
owner:
team: apps
contact: apps@openova.io
configSchema: # JSON Schema; drives console form
type: object
required: [domain, adminEmail]
properties:
domain: { type: string, format: hostname }
adminEmail: { type: string, format: email }
replicas: { type: integer, default: 2, minimum: 1, maximum: 20 }
placementSchema:
modes: [single-region, active-active, active-hotstandby]
minRegions: 1
maxRegions: 5
depends:
- blueprint: bp-postgres
version: ^1.4
alias: db
when: "{{ .config.postgres.mode == 'embedded' }}"
manifests:
source:
kind: HelmChart
ref: oci://ghcr.io/openova-io/bp-wordpress:1.3.0
upgrades:
from: [ 1.2.x, 1.1.x ]
blocks: [ 1.0.x ]
rotation:
- kind: oauth-client-secret
name: wp-keycloak-client
ttl: 90d
observability:
metrics: prometheus
logs: stdout
traces: otlp
3.4 configSchema design
The console form is generated from configSchema — never hand-written. JSON Schema features supported: type, format, default, enum, minimum, maximum, oneOf/anyOf, dependencies, and x-catalyst-ui-hint for non-trivial widgets (password, domain-picker, application-ref).
3.5 Dependencies
Hard, conditional, and reference dependencies all supported. Catalyst installs hard deps automatically; conditional deps are skipped if the predicate is false; reference deps resolve to a sibling Application in the same Environment.
3.6 Placement and multi-region
placementSchema.modes: single-region (trivial), active-active (stateless trivial, stateful declares replication strategy), active-hotstandby (CNPG WAL streaming, SeaweedFS bucket replication, Valkey REPLICAOF).
3.7 Manifests source types
manifests.source.kind |
When to use |
|---|---|
HelmChart |
Most third-party apps with existing Helm charts |
Kustomize |
Small custom apps; full patch control |
OAM |
(Future, not yet supported) |
3.8 Umbrella shape (HARD contract — CI-enforced)
Every Blueprint chart at platform/<name>/chart/ (and products/<name>/chart/) MUST be an umbrella chart: it MUST declare its upstream chart(s) under dependencies: in Chart.yaml so helm dependency build pulls the upstream payload into the published OCI artifact.
Hollow charts — wrappers that carry only Catalyst overlay templates without an upstream subchart dependency — are forbidden. CI rejects them.
Why this rule exists: earlier this cycle, bp-cert-manager:1.0.0 shipped as a hollow chart — only a ClusterIssuer template, no upstream cert-manager subchart bytes. Flux installed it on every Sovereign. Phase 1 broke on every Sovereign because cert-manager itself was never deployed. The artifact looked legitimate (right name, right version, signed, SBOM-attested) but the upstream payload was simply not there.
Dual-annotation requirement (PR #2087 + #2093)
Two pre-merge guards run on every chart change. BOTH are mandatory.
| Guard | Workflow | Rule | Why |
|---|---|---|---|
| GUARD 1 — no-upstream (pre-merge, PR #2087) | .github/workflows/check-chart-annotations.yaml → scripts/check-chart-annotations.sh |
Every changed chart/Chart.yaml MUST EITHER declare a non-empty dependencies: block OR carry annotation catalyst.openova.io/no-upstream: "true" |
Catches hollow shape before the chart version is dead-reserved by a failed publish. Pre-2026-05-20 each recurrence needed a follow-up version-bump PR. |
| GUARD 2 — smoke-render (pre-merge, PR #2093) | Same workflow | helm template with default values must produce ≥5 lines OR chart must carry catalyst.openova.io/smoke-render-mode: "default-off" |
Catches charts that render empty at defaults (enabled.default: false master gate) without opt-out annotation. |
Charts with enabled.default: false MUST carry BOTH annotations.
Real incident — bp-network-policies:1.0.1 (2026-05-20): chart had
no-upstream: true(GUARD 1 satisfied) but was MISSINGsmoke-render-mode: default-off. Smoke-render check at publish time tripped and dead-reserved version 1.0.1 — a follow-up PR was needed to bump to 1.0.2 with the second annotation. PR #2093 elevated the smoke-render check to pre-merge so this can never recur silently. PRs #2090 + #2091 added the dual annotations.
The four post-merge guards remain as belt-and-braces structural verification at publish time:
| When | Guard | Failure mode caught |
|---|---|---|
After helm dependency build |
Working-tree chart/charts/<dep>-<ver>.tgz exists for every dependencies: entry |
Missing/wrong repo URL, silently-skipped dep |
After helm package |
tar -tzf listing contains <chart_name>/charts/<dep>-<ver>.tgz |
.helmignore mishap, packaging-time stripping |
After helm push |
helm pull round-trips the artifact; pulled .tgz listing again contains every declared subchart |
Registry-side path mangling, OCI manifest rewriting |
| Always | helm template smoke render produces non-trivial output OR smoke-render-mode: default-off; rendered manifests uploaded as workflow artifact |
Render-broken templates, schema violations |
Any single guard failing fails the whole publish job. A hollow Blueprint can never reach a Sovereign through the sanctioned CI path.
Authoring rule
Every umbrella Chart.yaml declares the upstream chart(s) it wraps:
# platform/cilium/chart/Chart.yaml
apiVersion: v2
name: bp-cilium
version: 1.1.0
type: application
dependencies:
- name: cilium
version: "1.16.5"
repository: "https://helm.cilium.io"
The version pinned in dependencies: MUST match the version recorded in platform/<name>/blueprint.yaml and the catalystBlueprint.upstream.version field in values.yaml — all three together via PR + Blueprint release.
Verifying an existing artifact
helm pull oci://ghcr.io/openova-io/bp-cilium --version 1.1.0
tar -tzf bp-cilium-1.1.0.tgz | grep '^bp-cilium/charts/cilium/' | head
A non-empty result proves the upstream subchart is inside the OCI artifact.
3.9 Observability toggles must default false (HARD contract — CI-enforced)
Every observability toggle in a Blueprint's chart/values.yaml — serviceMonitor.enabled, metrics.enabled, prometheusRule.enabled, monitoring.enabled, tracing.enabled, prometheus.enabled and analogues — MUST default to false.
The CRDs that back ServiceMonitor / PrometheusRule (monitoring.coreos.com/v1) ship with kube-prometheus-stack. If bp-cilium defaults cilium.prometheus.serviceMonitor.enabled: true, Helm renders a ServiceMonitor the apiserver immediately rejects:
no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
— ensure CRDs are installed first
Result: bp-cilium's HelmRelease enters InstallFailed, every downstream bp-* HelmRelease (dependsOn: bp-cilium) reports dep is not ready, the whole Sovereign bootstrap stalls. Verified failure on omantel.omani.works 2026-04-29 (issue #182).
Canonical pattern:
# platform/cilium/chart/values.yaml — DEFAULT OFF
cilium:
prometheus:
enabled: false
serviceMonitor:
enabled: false
# clusters/<sovereign>/bootstrap-kit/01-cilium.yaml — OPERATOR OPT-IN
spec:
values:
cilium:
prometheus:
enabled: true
serviceMonitor:
enabled: true
CI runs tests/observability-toggle.sh (when present under platform/<name>/chart/tests/) on every publish. The script asserts default-render produces zero monitoring.coreos.com/v1 references, opt-in render succeeds AND produces a ServiceMonitor, explicit-off render succeeds AND produces zero references.
3.10 Visibility
| Value | Where it appears | Who can install it |
|---|---|---|
listed |
Public marketplace card grid | Everyone in the Sovereign |
unlisted |
Not on cards; reachable by direct URL or search | Anyone who knows the name |
private |
Visible only within the Org that owns the Blueprint repo | Only that Org's users |
3.11 Versioning
- Semver (
MAJOR.MINOR.PATCH). - Each release publishes a signed OCI artifact at
ghcr.io/openova-io/bp-<name>:<version>(bp-prefix added to make it self-identifying as a Catalyst Blueprint). - The Blueprint declares which prior versions are upgrade-compatible (
upgrades.from). - Customers pin to a version in their Application's
kustomization.yaml. Upgrades are explicit (one-click console, orgit pushediting the version pin).
3.12 Hard rules for Blueprint authors
| Rule | Why |
|---|---|
| All container images cosigned | Supply-chain security; Kyverno admission policy denies unsigned. |
| All artifacts SBOMed | Compliance (EU CRA, NIS2). |
| No plaintext secrets; use ExternalSecret references | See SECURITY.md. |
| Workload identity via K8s SA TokenReview + Cilium WireGuard | SPIFFE/SPIRE dropped from bootstrap-kit by PR #665; opt-in for cross-Sovereign federation. See SECURITY.md §2. |
Health endpoints standardized: /healthz (liveness) + /readyz (readiness) |
Catalyst observability assumes them. |
Metrics on /metrics (Prometheus exposition) |
Catalyst Grafana stack scrapes them. |
| Logs to stdout, structured JSON | Loki ingests them. |
| Traces via OTel | Tempo ingests them. |
app.kubernetes.io/* labels set on every resource |
Required for Catalyst projector to track. |
Acceptance tests in tests/ |
CI runs them on every PR. |
| Upgrade tests against previous version | Required to declare upgrade compatibility. |
§4 — Chart-level conventions
Sharp edges in the chart-authoring workflow that have already cost real outages. Read it before declaring "done" on any chart that mutates a long-lived resource.
4.1 Strategy flips on existing Deployments
What goes wrong: chart declares Deployment.spec.strategy.type: Recreate. The cluster already runs a Deployment of the same name created earlier with default RollingUpdate (so spec.strategy.rollingUpdate.maxSurge=25% and maxUnavailable=25% exist on the live object). Flux SSA submits the new manifest with the kustomize-controller field manager. The API server merges, then validates. Validation rejects:
Deployment.apps "<name>" is invalid:
spec.strategy.rollingUpdate: Forbidden:
may not be specified when strategy `type` is 'Recreate'
The Flux Kustomization parks at Ready=False on every reconcile until operator intervention.
Why SSA does this: SSA's contract is "set the fields you declare." It does NOT remove fields owned by other field managers. The pre-existing Deployment was created via kubectl apply (CSA), so kubectl-client-side-apply owns .spec.strategy.rollingUpdate.*. When kustomize-controller flips .spec.strategy.type to Recreate, those rolling-update fields stay on the object.
Why $patch: replace is NOT the answer:
- API strict-decoding rejects it on CREATE:
strict decoding error: unknown field "spec.strategy.$patch"— breaks fresh installs. - Flux SSA rejects it:
field not declared in schema. - It is a runtime directive, not a chart field.
The canonical fix — annotate the Deployment with the Flux force annotation:
apiVersion: apps/v1
kind: Deployment
metadata:
name: catalyst-api
annotations:
kustomize.toolkit.fluxcd.io/force: enabled
spec:
strategy:
type: Recreate
When kustomize-controller's SSA dry-run fails with Invalid, the controller falls back to delete-and-recreate the SINGLE annotated resource. The recreated Deployment has no residual rollingUpdate.* fields.
When you may use this annotation: only on resources that (a) already declare strategy.type: Recreate, OR (b) carry no client traffic, OR (c) are explicitly designed to lose in-process state on every roll. NEVER add to a RollingUpdate resource serving live traffic.
Reference incident: 2026-04-29 — contabo-mkt cluster — catalyst/catalyst-api. Kustomization stuck Ready=False for hours. Fix: kustomize.toolkit.fluxcd.io/force: enabled on products/catalyst/chart/templates/api-deployment.yaml.
4.2 Other chart fields that collide on apply
Same fix applies to each — annotate with kustomize.toolkit.fluxcd.io/force: enabled, let Flux recover via delete-and-recreate when SSA dry-run fails.
| Resource kind | Field that triggers an Invalid merge | Notes |
|---|---|---|
Deployment |
spec.strategy.type Recreate ↔ RollingUpdate |
§4.1 |
Deployment |
spec.selector.matchLabels change |
Selector is immutable post-create. Must recreate. |
Service |
spec.clusterIP (None ↔ value) |
Immutable. Must recreate. |
Service |
spec.type ClusterIP ↔ NodePort ↔ LoadBalancer |
Some transitions invalid. |
PersistentVolumeClaim |
spec.accessModes change after binding |
Immutable post-bind. Recreate would lose data — DO NOT add force annotation; provision a new PVC under a new name and migrate. |
StatefulSet |
spec.serviceName, spec.selector |
Immutable. Must recreate (loses pod identity). Plan migrations carefully. |
Job |
spec.template.* after create |
Immutable. Recreation is the only path. |
For PVCs and StatefulSets: NEVER add the Flux force annotation as a default. Data loss is the failure mode.
4.3 Authoring discipline checklist
Before declaring "done" on any chart that touches a long-lived resource:
- Run the chart's manifest through
kubectl apply --dry-run=serveragainst an EMPTY namespace. Must succeed (no$patch:in spec). - If the resource type appears in §4.2, ALSO run against a namespace where a PRIOR shape exists. Must succeed; if it fails, add the Flux force annotation AND the integration test.
- Verify
kustomization.yamlreferences all template files. - If the resource carries client traffic, document the recreate blast radius in the chart's leading comment.
4.4 Service-name-mismatch in env-var defaults
When default URL is http://svc.ns.svc... but the real Service is svc-bp-svc.ns.svc...:
Fix: helm template and grep the real Service name. Wire env-var default off the rendered output, not the assumed shape.
§5 — Demo / operator walks
The canonical deterministic 2-phase walk operator follows. Driven by DOD.md. The operator-facing companion to tests/dod/dod_test.go (the Go test that drives the same flow non-interactively when HETZNER_TEST_TOKEN is populated).
5.1 Pre-flight
| Item | Notes |
|---|---|
| Hetzner Cloud project + API token (Read+Write) + project ID | ~€31/mo at hourly billing, ~€0.05/h while up |
| SSH public key | Generate fresh if needed: ssh-keygen -t ed25519 -C "sovereign-admin@<sov>" -f ~/.ssh/<sov>_sovereign_admin |
| Pool subdomain reserved | Pick t<NN> under omani.works (or omantel.biz if LE-rate-limited). PDM checks availability, on commit creates per-Sovereign zone + parent-zone NS delegation |
| Catalyst-Zero (mothership) login | Confirm before running. Mothership is the OpenOva-run Catalyst-Zero |
| kubectl context to mothership | For pre-flight verification only |
5.2 The walk — Phase 0 + Phase 1 deterministic test
Per DOD.md, every walk must move at least one of the 5 inseparable pillars:
- Marketplace + voucher onboarding (Phase 0 + Phase 1 a–c)
- Multi-region BCP topology choice at signup (Phase 1 b)
- Two independent CNPG clusters + region-kill failover (Phase 1 b + orthogonal D31)
- Sandbox + auto-mounted
openova-sandbox-mcpwith full org knowledge (Phase 2 a–e) - Sovereign independence post-
bp-self-sovereign-cutover(Principle #11 + ADR-0002)
Phase 0 — voucher issuance + redeem preview (mothership BSS):
-
Sovereign-admin issues voucher — navigate to
https://console.<sovereign-fqdn>/bss(the BSS menu lives inside the operator console — NOT the legacyadmin.<sovereign-fqdn>URL which has been dead since the BSS migration). Sign in with sovereign-admin credentials. Billing → Vouchers → New Voucher:Field Value Code e.g. T42-DEMO-100Credit (OMR) 100Description DoD demo voucherActive trueMax redemptions 1Click Save. The UI POSTs to
POST /billing/vouchers/issue. -
Redeem preview — open
https://marketplace.t<NN>.omani.works/redeem/?code=T42-DEMO-100in a fresh browser session. The unauthenticated page POSTs to/api/billing/vouchers/redeem-previewand renders the credit metadata. Sign up to redeem routes to/planswith the code stashed in localStorage.
Phase 1 — tenant signup + Org creation + first App (tenant-facing):
- Tenant signs up via email/magic-link or Google OAuth
- Catalyst auto-creates an Organization (default slug
<orgslug>.omani.homesperDOD.md) - Voucher applied at first checkout via
POST /billing/checkoutwithpromo_code— atomic insert intopromo_redemptions, increment oftimes_redeemed, positive entry incredit_ledger - Tenant lands in marketplace — credit balance shown in top-right wallet
- Tenant creates an Environment (e.g.
production) - Tenant installs first Application (e.g.
bp-wordpress). The App install consumes from credit_ledger; remaining balance shown - Tenant reaches the App URL (e.g.
https://<orgslug>-production-wordpress.omani.homes)
Phase 2 — Sandbox + MCP (Pillar 4):
openova-sandbox-mcp auto-mounts. Agent is claude-code with full Org knowledge. Operator verifies via XHR + screenshot.
5.3 Verification
Verify voucher consumption:
TOKEN=<sovereign-admin JWT>
curl -s -H "Authorization: Bearer $TOKEN" \
"https://api.t<NN>.omani.works/billing/vouchers/list" \
| jq '.[] | select(.code=="T42-DEMO-100")'
# Expected: { "times_redeemed":1, "max_redemptions":1, ... }
Verify App reachable:
curl -sI "https://<orgslug>-production-wordpress.omani.homes"
# Expected: HTTP/2 200 (or 302 to login), Let's Encrypt subject CN matching FQDN
5.4 Final step — append VALIDATION-LOG entry
cd /home/openova/repos/openova
git checkout main && git pull origin main
cat >> docs/archive/validation-log.md <<'EOF'
## Pass NNN (YYYY-MM-DD) — DoD MET — t<NN>.omani.works
**Operator:** <name>
**Sovereign FQDN:** t<NN>.omani.works
**Hetzner region:** fsn1
**Total wall-clock:** ~MM minutes
**Voucher exercised:** T<NN>-DEMO-100 (100 OMR, 1/1 redeemed)
**App installed:** bp-wordpress at <orgslug>-production-wordpress.omani.homes
DoD Met:
- [x] Wizard provisioned t<NN>.omani.works in ~12 min
- [x] DNS authoritative on per-Sovereign PowerDNS zone
- [x] TLS auto-issued via cert-manager + Let's Encrypt
- [x] sovereign-admin logged into console.t<NN>.omani.works
- [x] Voucher issued via /bss
- [x] Tenant redeemed at marketplace.t<NN>.omani.works/redeem/?code=...
- [x] Tenant created Org + Env, installed first App, App URL reached HTTP/2 200
EOF
git add docs/archive/validation-log.md
git -c user.name="hatiyildiz" -c user.email="269457768+hatiyildiz@users.noreply.github.com" \
commit -m "docs(validation-log): DoD MET — t<NN>.omani.works"
git push origin main
(Per ~/.claude/CLAUDE.md: NEVER close issues — only the user closes after verification. Use Refs #N in PR bodies, not Closes #N, except for pure CI-gate / docs-only PRs.)
§6 — Failover recovery
For multi-region active-hotstandby Sovereigns and Applications (Pillar 3).
6.1 Region-kill canonical test (Pillar 3)
The deterministic failover test for two independent CNPG clusters:
- Place a write into the primary CNPG cluster (synchronous replication,
remote_apply, PR #2071) - Kill the primary region (Hetzner API: detach LB, drop firewall, terminate CP node)
- Promote the replica via
ContinuumCR (PR #2072, #2074) - Verify the write made it across — zero-tx-loss
- Reverse: promote original primary back when region recovers
Test harness lives at the D31 acceptance test (PR #2075).
6.2 Continuum CR + lease witness
Continuum (group dr.openova.io/v1) orchestrates switchover with a Cloudflare-KV or DNS-quorum lease witness (anti-split-brain). Schema in products/catalyst/chart/crds/continuum.yaml. Controller lives in EPIC-6 (#1101).
Required pattern: lease-based failover with cloud-witness. DMZ data plane over public IPs with WireGuard encryption (never RFC1918 tunnels depending on cloud-provider VPC peering).
6.3 cnpg-pair Blueprint (PR #2071)
bp-cnpg-pair ships two independent CNPG clusters across two regions over Cilium ClusterMesh, with synchronous replication (remote_apply). Cross-region pairing via ReplicaCluster over ClusterMesh. CRD: cnpgpair.dr.openova.io/v1 in products/catalyst/chart/crds/cnpgpair.yaml.
Provisioning generalised beyond WP-only by PR #2073 (feat(provisioning): generalize bp-cnpg-pair install path).
6.4 Inter-region transport
Inter-region = DMZ WireGuard over PUBLIC IPs ALWAYS. Cilium ClusterMesh apiserver via LoadBalancer (NEVER NodePort). Provider-mix canonical (different regions can be different providers).
6.5 Existing-Sovereign migration
There is no in-place recovery for a cluster whose Flux controllers have been deleted (see §2.7). For zero-tx-loss claims to hold, validate on the topology you claim: never report multi-region pass against a single-region prov.
§7 — Troubleshooting matrix
Common failure modes + first-look diagnostics, condensed from 18 documented incidents. Decision-tree shape: walk top-to-bottom, the first match wins.
7.1 Provisioning failures (Phase 0)
| Symptom | Most likely cause | Recovery |
|---|---|---|
tofu plan fails with `Invalid value for variable |
The given value "cpx32" is not valid for variable "control_plane_size"` | catalyst-api image predates fix c6cbfe68 |
tofu apply fails with hcloud_ssh_key: public_key field is invalid |
Malformed ed25519 key pasted into wizard | Re-generate (ssh-keygen -t ed25519 ...), copy single line verbatim, re-run wizard |
tofu apply fails with name is already used (uniqueness_error) |
Prior tofu apply partial, state file lost on Pod restart — orphan Hetzner resources |
Run scripts/operator-recover-sovereign.sh <fqdn> --apply (see §2.2), then re-run wizard with same FQDN |
tofu apply fails with dynadot API returned ... from a null_resource.dns_pool |
Old catalyst-api build with stale null_resource | Deploy newer catalyst-api image at or after 330211d2 |
tofu plan 403 Forbidden from hcloud |
Token has Read scope only, or expired | Generate Read+Write token; re-run wizard |
tofu plan quota exceeded |
Hetzner project default limits (typically 10 servers, 1 LB) | Open Hetzner support ticket; re-run when granted |
tofu apply hangs at Still creating... >10 min |
Hetzner regional capacity transient | Wait 15 min total; if stuck, cancel + re-run in a different region |
| PDM 409 conflict on subdomain check | Another Sovereign holds that subdomain in PDM | Pick a different name OR run §2.2 if leftover from failed run, then re-run with same name |
7.2 Cloud-init failures (Phase 0 → Phase 1 bridge)
| Symptom | Most likely cause | Recovery |
|---|---|---|
Node up but every pod Pending with 0/1 nodes are available: 1 node(s) had untolerated taint — Flux Kustomizations never go Ready |
CNI bootstrap deadlock: cloud-init installed Flux BEFORE Cilium (pre-fix e571ec7a) |
Deploy newer catalyst-api at or after 54872009; run §2.2 + re-provision |
cilium-operator Pending or crashlooping with failed to dial kube-apiserver |
k8sServiceHost=<sovereign-fqdn> cannot yet resolve at install time (pre-fix 54872009) |
Same — image must be at or after 54872009 |
7.3 Phase 1 failures (Flux + bp-* HelmReleases)
| Symptom | Most likely cause | Recovery |
|---|---|---|
Flux event: existing namespace "kube-system" is conflicting with another resource that has the same name |
Bootstrap-kit kustomize merge had kube-system Namespace declared twice (pre-fix 2022e1af) |
Fix is in main; Flux picks up on next reconcile interval. If pinned to old SHA, edit GitRepository spec.ref.branch |
Flux event: no matches for kind "ProviderConfig" in version "hetzner.crossplane.io/v1beta1" |
Single Kustomization tried to apply both Crossplane (CRDs) AND Hetzner Compositions (CRs). Fix 34c8de84 split into two Kustomizations |
Confirm cloud-init template post-34c8de84; re-provision |
HelmRelease: failed to get authentication secret 'flux-system/ghcr-pull': secrets "ghcr-pull" not found |
Pre-fix dddbab4b cloud-init didn't create durably |
Re-provision against current main. On a still-up Sovereign: kubectl -n flux-system create secret generic ghcr-pull --from-literal=token=... |
| HelmRelease: `failed to authorize: 401 Unauthorized | ghcr.io/openova-io/bp-cilium` | GHCR token expired or wrong scope |
HelmRelease: error validating ... no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" |
bp-* chart ships ServiceMonitor ON by default; CRD not yet registered. See §3.9 |
Edit bp-* HelmRelease values: observability.enabled: false; flux reconcile helmrelease. Long-term: bp-* chart bumps with default-off (already shipped in current bp-*:1.1.1+) |
HelmRelease Ready=True but no upstream pods — namespace empty except Helm release secret |
Hollow umbrella chart — dependencies declared but upstream subchart not packaged into charts/. Pre-fix 43aff202 |
Re-run blueprint-release workflow on the chart's tag — 4 guards (build/package/push/pull) will fail loudly. Fix the upstream pin + re-tag. See §3.8 |
| Wizard goes blank or "Deployment not found" after catalyst-api Pod restart | Pre-fix 418cead0 catalyst-api wrote deployments to emptyDir — Pod restart wiped them |
Confirm catalyst-api image at or after 418cead0; PVC mount in HelmRelease values. Orphans may exist — purge per §2.2 |
| SSE stream closes within seconds — admin UI shows zero components | catalyst-api helmwatch loop terminated at 0 HelmReleases (first-seen-gate bug) | Refresh page after Phase 1 30+s in; wizard falls back to REST poll. Long-term: deploy catalyst-api with the gate fix |
Wizard SSE shows flux-bootstrap complete but per-component grid stays empty; catalyst-api logs failed to load Sovereign kubeconfig: connection refused |
Cloud-init POST-back kubeconfig not implemented (issue #183) | Interim: SSH to CP, replace 127.0.0.1 with LB IP in /etc/rancher/k3s/k3s.yaml, save as sovereign-<fqdn>-kubeconfig Secret in catalyst ns |
| Admin UI shows every app card as "INSTALLED" even when underlying HelmReleases reconciling | Admin UI read deployment.status instead of live helmwatch SSE — fix 64d7de97 |
Confirm catalyst-ui image at or after 64d7de97 |
Certificate/wildcard reports too many certificates already issued for "<sovereign-fqdn>" |
Let's Encrypt rate limit: 5 per registered domain per week | Switch ClusterIssuer to letsencrypt-staging; wait for rate-limit expiry; switch back |
Phase 1 watch banner: 0 HelmReleases in 15m0s |
Flux on new Sovereign isn't materialising bootstrap-kit | Walk §2.8 playbook (GitRepository, Kustomization, controller logs, manual reconcile) |
7.4 Failure decision tree
flowchart TD
Start[Provisioning failed] --> Q1{Did wizard reach<br/>tofu-plan?}
Q1 -- No --> Q2{Step 6 Domain<br/>failed?}
Q1 -- Yes --> Q3{tofu-apply succeed?}
Q2 -- PDM 409 --> C18[7.1 — PDM subdomain conflict]
Q2 -- Other --> Healthcheck[Re-check pre-flight D PDM]
Q3 -- "Yes" --> Q4{cloud-init<br/>finish 5min?}
Q3 -- "Validation" --> Q5{What?}
Q3 -- "Runtime" --> Q6{What?}
Q5 -- "cpx*" --> C1[7.1 — catalyst-api stale image]
Q5 -- "ssh key" --> C2[7.1 — invalid public key]
Q6 -- "uniqueness" --> C3[7.1 — orphans, run §2.2]
Q6 -- "Dynadot" --> C4[7.1 — null_resource stale image]
Q4 -- "Flux Pending forever" --> C5[7.2 — CNI bootstrap deadlock]
Q4 -- "cilium-operator Pending" --> C6[7.2 — k8sServiceHost wrong]
Q4 -- "Yes" --> Q7{bootstrap-kit<br/>Ready?}
Q7 -- "kube-system conflict" --> C7[7.3 — kustomize merge]
Q7 -- "ProviderConfig CRD missing" --> C8[7.3 — Crossplane CRD ordering]
Q7 -- "Yes" --> Q8{bp-* HelmReleases<br/>Ready?}
Q8 -- "ghcr-pull missing" --> C9[7.3 — cloud-init missed Secret]
Q8 -- "401 from GHCR" --> C10[7.3 — token expired]
Q8 -- "ServiceMonitor kind" --> C11[7.3 — observability toggle]
Q8 -- "Hollow chart" --> C12[7.3 — umbrella conversion]
Q8 -- "Yes" --> Q9{Admin UI renders?}
Q9 -- "Deployment not found" --> C13[7.3 — PVC missing pre-418cead0]
Q9 -- "SSE terminates 0 comp" --> C14[7.3 — helmwatch gate]
Q9 -- "kubeconfig refused" --> C15[7.3 — cloud-init POST-back]
Q9 -- "All INSTALLED falsely" --> C16[7.3 — admin UI fiction]
Q9 -- "ACME rate limit" --> C17[7.3 — LE 5/week]
Q9 -- "Yes" --> Done([Sovereign live — Day-1])
§8 — Doc-integrity audit cadence
Source: previously
docs/AUDIT-PROCEDURE.md(merged here on 2026-05-20).
This section is the procedure for performing a documentation-integrity validation pass on the canonical Catalyst docs and component READMEs. It is on-demand only — there is no scheduled audit loop.
For invocation via Claude Code, see the audit-catalyst-docs skill.
8.1 When to run
- After any architectural change that touches multiple docs (component additions/removals, terminology shifts, structural model changes).
- Before tagging a public release of the canonical docs.
- Before adding a new Sovereign-curated catalog (
catalog-sovereignGitea Org) — to confirm the upstream canon is consistent. - On request, ad-hoc, when a contributor questions whether a doc claim is current.
Never run as a scheduled background loop. Past loops over-anchored on incorrect models (see docs/archive/validation-log.md Pass 103); text-shape consistency is not the same as architectural soundness.
8.2 What the audit verifies
The audit cross-checks the canonical docs and component READMEs against five categories of anchors:
- Banned-term hygiene — banned terms in
GLOSSARY.md§"Banned terms" must not appear (in non-exempt contexts) anywhere in the canon. - Naming canonicality —
env_type3-char form, DNS pattern split (control-plane vs Application), API group split (catalyst.openova.iovscompose.openova.io), JetStream subject prefix. - Structural invariants —
App = Gitea Repo(the unified rule from Pass 103), branchesdevelop/staging/mainmap to envs, 5 Gitea Orgs convention (catalog,catalog-sovereign, per-Catalyst-Organization,system). - Component-count consistency — number of
platform/<x>/folders matches the count anchored acrossCLAUDE.md, the technology forecast / roadmap,BUSINESS-STRATEGY.md, and the implicit table sums. - Defense-in-depth architectural anchors — load-bearing decisions (OpenBao independent-Raft per region, SeaweedFS as unified S3 encapsulation, Catalyst-as-platform / OpenOva-as-company, Valkey-NOT-control-plane, no-bidirectional-Gitea-mirror) must each appear consistently across at least 4 representational levels.
8.3 The 13 acceptance greps
Run from the repo root (/home/openova/repos/openova). All should produce zero output unless an exemption explanation is included.
# 1. Banned terms (excluding contextual exemptions noted in GLOSSARY)
for term in 'tenant' 'Workspace' 'Lifecycle Manager' 'bootstrap wizard' 'Backstage' \
'Synapse' 'Fuse' 'Module' 'Template' 'Operator' 'Client' 'Instance'; do
grep -rni "\\b$term\\b" docs/ platform/*/README.md products/*/README.md core/README.md README.md CLAUDE.md \
| grep -v 'GLOSSARY.md' | grep -v 'validation-log.md'
done
# 2. env_type long-form (must be 0)
grep -rnE 'acme-staging|acme-production|acme-development' docs/ platform/*/README.md products/*/README.md README.md CLAUDE.md \
| grep -v validation-log
# 3. JetStream subject prefix (must show only NAMING §11.2 occurrence)
grep -rnE 'ws\.\{?(env|org)' docs/ARCHITECTURE.md docs/GLOSSARY.md docs/SECURITY.md
# 4. API group split (count must be >=7 across Catalyst CRDs + Crossplane XRDs)
grep -rnE 'compose\.openova\.io/v1alpha1|catalyst\.openova\.io/v1alpha1' \
docs/ARCHITECTURE.md docs/SECURITY.md docs/RUNBOOKS.md \
core/README.md platform/crossplane/README.md | wc -l
# 5. Subsection ordering monotonicity
grep -nE '^### 7\.[0-9]' docs/ARCHITECTURE.md
grep -nE '^### 2\.[0-9]|^### 11\.[0-9]' docs/ARCHITECTURE.md
grep -nE '^### 5\.[0-9]' docs/SECURITY.md
# Manual check: numbers must be strictly increasing.
# 6. Old App-as-folder model (must be 0 outside validation-log)
grep -rnE 'Environment Gitea repo|/{org}/{org}-{env_type}|<org>/<org>-<env_type|per-Environment Gitea repos' \
docs/*.md README.md CLAUDE.md | grep -v validation-log
# 7. Branches-map-to-envs anchor present in 4+ docs
grep -lE 'develop`/`staging`/`main|develop/staging/main|branches.*map.*env' \
docs/GLOSSARY.md docs/ARCHITECTURE.md docs/DOD.md
# 8. 5 Gitea Orgs convention (must be in GLOSSARY + ARCHITECTURE + RUNBOOKS)
grep -lE 'catalog-sovereign|`system` Gitea Org|five conventional Gitea Orgs|5 conventional Gitea Orgs' \
docs/GLOSSARY.md docs/ARCHITECTURE.md docs/RUNBOOKS.md
# 9. Component count consistency across all anchors (no stale "53 components" except validation-log)
grep -rnE '\b53 components\b|\b53 curated\b|\b53-component\b|\ball 53\b|\b53 platform\b|\b53 folders\b' \
docs/*.md README.md CLAUDE.md | grep -v validation-log
ls -d platform/*/ | wc -l # must match the anchor
# 10. SeaweedFS encapsulation (no MinIO except intentional explanation in roadmap/forecast doc)
grep -rinE '\bminio\b' docs/*.md README.md CLAUDE.md core/README.md products/*/README.md platform/*/README.md \
| grep -v validation-log | grep -v 'platform/seaweedfs/'
# 11. OpenBao independent-Raft (must appear in 5+ representational levels)
grep -lE 'INDEPENDENT, NOT STRETCHED|independent Raft cluster|no stretched cluster|Independent OpenBao Raft' \
docs/SECURITY.md docs/ARCHITECTURE.md docs/GLOSSARY.md docs/BUSINESS-STRATEGY.md
# 12. Catalyst-as-platform anchor (must appear in GLOSSARY + README + BUSINESS-STRATEGY)
grep -lE 'Company vs.*Platform|Catalyst is the open|OpenOva.*the company|Catalyst.*the platform itself' \
docs/GLOSSARY.md README.md docs/BUSINESS-STRATEGY.md
# 13. DNS pattern split (NAMING + multiple consumers)
grep -nE '\{component\}\.\{location-code\}\.\{sovereign-domain\}|\{app\}\.\{environment\}\.\{sovereign-domain\}' \
docs/ARCHITECTURE.md
grep -lE '<location-code>\.<sovereign-domain>|<env>\.<sovereign-domain>' \
docs/RUNBOOKS.md platform/llm-gateway/README.md platform/valkey/README.md
8.4 Deep-read rotation
After greps, deep-read one canonical doc + one component README per pass. Rotate through the canon and the 56 platform components + 7 products (catalyst, cortex, axon, fingate, fabric, relay, specter) over time. The next-most-stale entry should be the target.
The deep-read confirms the doc's known anchors are present and consistent with the rest of the canon. For each:
- Read the doc end-to-end.
- Check known fix-trajectory anchors (see
docs/archive/validation-log.mdfor what was previously fixed in that file). - Cross-check at least 2 other docs the deep-read target references, looking for bidirectional consistency.
- Verify the 5 invariants (§8.2) hold.
8.5 Output
Append a numbered Pass entry to docs/archive/validation-log.md describing:
- Date, pass number, target doc + target component
- Acceptance grep results (clean / drift)
- Deep-read findings
- Any architectural anchors verified or flagged
- If drift: what was fixed and the new anchor
If clean: short entry confirming clean. If drift: longer entry documenting the fix and a Lesson if the drift represents a recurring pattern.
Commit message format: docs(pass-N): <target-doc> <ordinal>-cycle + <component> <ordinal>-cycle <clean|fixed>. Commit as hatiyildiz per the repo's git-identity convention.
8.6 What this audit does NOT do
- Architectural review. Text-shape consistency does not validate that the architecture is right. Architectural review is a separate, complementary discipline. See Pass 103 and Lesson #21.
- Code review. Most code is design-stage per
STATUS.md. Code review is a separate concern. - Compliance review. Mappings to PSD2/DORA/NIS2/SOX live in
bp-specter's Compliance Agent's runtime evaluation, not in doc audit. - Security review. Security review is
/security-reviewskill's domain.
§9 — Bring up a Sovereign (canonical phase walkthrough)
Source: previously
docs/SOVEREIGN-PROVISIONING.md(merged here on 2026-05-20).
How to provision a new Sovereign — a self-sufficient deployed instance of Catalyst — from inputs to Day-2 steady state. Defer to GLOSSARY.md for terminology and ARCHITECTURE.md for the model. The operator wizard procedure for the most-tested (Hetzner) path is in §1 above; this section is the complete provider-agnostic phase narrative with multi-region, air-gap, migration, and decommission.
The implementation reflects the deployed shape — the Go provisioner, OpenTofu module, 12 G2 wrapper Helm charts (the original 11 plus bp-powerdns at #167), the per-Sovereign PowerDNS zone model (#167/#168), and the pool-domain-manager (PDM) with registrar adapters (#163/#170) all exist in this monorepo today (per STATUS.md §7). End-to-end DoD against a real Hetzner project tracks Group M of §11 below. Catalyst-Zero (Contabo k3s, namespace catalyst) is the running catalyst-provisioner today.
9.1 Inputs
| Input | Required | Notes |
|---|---|---|
| Cloud provider | Hetzner / AWS / GCP / Azure / OCI / Huawei | Hetzner is the most-tested path. |
| Cloud credentials | Provider API token | Used by OpenTofu (one-shot bootstrap) and Crossplane (ongoing). |
| Sovereign name | e.g. omantel, bankdhofar |
Slug, lowercase, 3–32 chars. |
| Sovereign domain | e.g. omantel.omani.works, acme.bank.com |
Three modes (#169): pool (subdomain under omani.works / openova.io, allocated by pool-domain-manager); byo-manual (customer pastes OpenOva NS records into their own registrar UI); byo-api (customer pastes a registrar API token, OpenOva flips NS via the registrar adapter). Supported registrars for byo-api: Cloudflare, Namecheap, GoDaddy, OVH, Dynadot (#170). |
| Region(s) | 1+ | Single-region simplest for SME; 2+ for regulated/HA. |
| Building blocks per region | typically mgt + rtz (+ dmz) |
At minimum mgt + rtz. |
| Keycloak topology | per-organization (SME) / shared-sovereign (corporate) |
Determines Keycloak deployment shape. |
| Federation IdP (optional) | Azure AD / Okta / Google / etc. | For corporate; SME tier defers to per-Org Org-IdP federation. |
| TLS strategy | Let's Encrypt / cert-manager / corporate CA | cert-manager-managed, Let's Encrypt by default. |
| Object storage | Cloud-provider native | Used as the cold-tier backend behind SeaweedFS (which is the in-cluster S3 encapsulation layer that all consumers — Velero, Harbor, CNPG WAL, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg — talk to). |
9.2 Provisioning runs from catalyst-provisioner
The bootstrap is performed by catalyst-provisioner.openova.io, an always-on provisioning service operated by OpenOva. It is not part of any Sovereign at runtime — once a Sovereign is up, it is fully self-sufficient.
Why a permanent provisioner instead of "boot from your laptop":
- OpenTofu state must be durably stored — keeping it on a single person's laptop is fragile and a security risk.
- Provider credentials are scoped, stored in OpenBao on the provisioner, and never leave it.
- New Sovereigns can be created without a manual installer dance — the same machinery serves the next Sovereign provisioning request, regardless of who initiates it.
A self-host route exists for organizations that want zero OpenOva involvement: catalyst-provisioner is itself a Blueprint (bp-catalyst-provisioner) and can be deployed in a customer's own infrastructure. From there it bootstraps further Sovereigns. This is the air-gap path.
9.3 Phase 0 — Bootstrap
The implementation maps cleanly onto two artifacts in this monorepo:
| Step | Lives in | What runs |
|---|---|---|
| 1. Wizard input → tofu vars | products/catalyst/bootstrap/api/internal/provisioner/ |
Go service writes tofu.auto.tfvars.json from validated wizard input, runs tofu init && tofu plan && tofu apply -auto-approve against the canonical OpenTofu module, streams stdout/stderr lines to the wizard via SSE. No cloud APIs called from Go (per PRINCIPLES.md #3). |
| 2. Cloud resources | infra/hetzner/main.tf |
OpenTofu provisions: hcloud_network (10.0.0.0/16) + subnet (10.0.1.0/24), hcloud_firewall (80/443/6443/ICMP open; 22 closed by default — sovereign-admin adds source-CIDR rule via Crossplane post-bootstrap), hcloud_ssh_key from wizard input, 1 control-plane server (or 3 if ha_enabled) on Ubuntu 24.04 with cloud-init, worker_count worker servers, hcloud_load_balancer (lb11) targeting NodePorts 31080/31443. DNS is authoritative on PowerDNS (#167/#168) — the per-Sovereign PowerDNS zone is created by pool-domain-manager (PDM) /v1/commit once the LB IP is known; for pool sovereigns PDM also writes the parent-zone delegation, and for byo-api Sovereigns the matching registrar adapter (Cloudflare / Namecheap / GoDaddy / OVH / Dynadot, #170) flips the NS records at the customer's registrar. byo-manual Sovereigns instead show the OpenOva NS list in the wizard and poll until the customer's own registrar propagates the delegation. |
| 3. k3s + Flux bootstrap | infra/hetzner/cloudinit-control-plane.tftpl |
cloud-init on the control-plane node installs k3s v1.31.4+k3s1 with --flannel-backend=none --disable-network-policy --disable=traefik --disable=servicelb --disable=local-storage --tls-san=<sovereign-fqdn>, then installs Flux v2.4.0 core, then applies the Flux GitRepository + Kustomization pointing at clusters/<sovereign-fqdn>/ in the public OpenOva monorepo. From this point Flux owns the cluster. Workers join via cloudinit-worker.tftpl using the project-derived k3s_token. |
| 4. Bootstrap-kit install | clusters/<sovereign-fqdn>/ (Flux-reconciled) |
Flux installs the 12 G2 wrapper Helm charts (each a bp-<name>:<semver> OCI artifact published by .github/workflows/blueprint-release.yaml) in dependency order: cilium → cert-manager → flux (host-level reconciler for the cluster's own Kustomizations) → crossplane → sealed-secrets (transient) → spire (server + agent; opt-in post PR #665) → nats-jetstream → openbao (3-node Raft) → keycloak (per topology choice) → gitea (with public Blueprint mirror) → bp-powerdns (per-Sovereign authoritative zone, #167) → bp-catalyst-platform (umbrella). |
| 5. Crossplane adoption | Crossplane Compositions in clusters/<sovereign-fqdn>/ |
Crossplane adopts management of all infrastructure created by OpenTofu in step 2; sealed-secrets is decommissioned in favour of ESO + OpenBao for day-2 secret distribution; further DNS records (gitea/admin/api/harbor) are written by external-dns against the per-Sovereign PowerDNS zone via the PowerDNS REST API (NOT against the registrar). Phase 1 begins (see §9.4). |
The wizard's progress page polls Flux Kustomizations on the new cluster and renders steady-state to the user when every Kustomization is Ready=True.
DNS records written in Phase 0 — into the per-Sovereign PowerDNS zone (<sovereign-fqdn>.), see PLATFORM-POWERDNS.md §"Per-Sovereign zone model":
@ A → load balancer IP
* A → load balancer IP
console A → load balancer IP
api A → load balancer IP
gitea A → load balancer IP
harbor A → load balancer IP
The PDM /v1/commit endpoint writes the canonical 6-record set into the freshly-created Sovereign zone via the PowerDNS REST API. The wildcard A record covers every additional subdomain a Sovereign might add at runtime (axon, umami, langfuse, etc.) without re-issuing certificates. Per NAMING §5.1 the canonical control-plane DNS pattern is {component}.{location-code}.{sovereign-domain} — the wildcard handles per-Application records under per-Environment subdomains.
OpenTofu state: kept in the catalyst-api Pod under /tmp/catalyst/tofu/<sovereign-fqdn>/ — pinned via the CATALYST_TOFU_WORKDIR env var on the catalyst-api Deployment (commit 27527e4c) and backed by the Pod's writable /tmp emptyDir (2 Gi sizeLimit; the in-code default /var/lib/catalyst/... is unwritable for UID 65534, hence the override). Re-running with the same FQDN is idempotent (tofu apply on existing state). For air-gap installs the sovereign-admin MUST configure a remote backend with encryption-at-rest so the Hetzner token isn't carried only on Pod ephemeral storage.
Implementation status: the Go wrapper, OpenTofu module, and 12 G2 wrapper charts (the original 11 + bp-powerdns added at #167) all exist today (verified at STATUS.md §7). The pool-domain-manager (core/pool-domain-manager/) and its 5 registrar adapters are deployed and running in openova-system. End-to-end DoD against a real Hetzner project is pending Group M of §11.
Total Phase 0 time: 30–60 minutes for a single-region Hetzner Sovereign once DoD lands.
9.4 Phase 1 — Hand-off
After Phase 0 completes:
- Crossplane in the new Sovereign adopts management of all infrastructure created by OpenTofu. From this point forward, all infrastructure changes go through Crossplane.
- The bootstrap k3s nodes are not "thrown away" — they are claimed by Crossplane via the cloud provider's adoption mechanism.
- OpenTofu state is archived and read-only. It is never touched again.
catalyst-provisionerno longer has any active connection to the new Sovereign.
The Sovereign is now self-sufficient. It has the full Catalyst control-plane set per ARCHITECTURE.md §2.3:
- Its own Crossplane managing further infrastructure.
- Its own OpenBao for secrets.
- Its own JetStream as event spine.
- Its own Keycloak for users.
- Its own workload identity (Cilium WireGuard + K8s SA TokenReview; SPIFFE/SPIRE opt-in per
SECURITY.md§2). - Its own Gitea (with mirror of the public Blueprint catalog).
- Its own observability stack (Grafana + Alloy + Loki + Mimir + Tempo) for self-monitoring.
- Its own Catalyst control plane (console, marketplace, admin, projector, catalog-svc, provisioning, environment-controller, blueprint-controller, billing).
9.5 Phase 2 — Day-1 setup
The first sovereign-admin logs into console.<location-code>.<sovereign-domain>:
Day-1 actions
──────────────────────────────────────────────────────────────────
1. Configure cert-manager issuers (Let's Encrypt / corporate CA).
2. Configure backup destination (cloud object storage for Velero).
3. Configure Harbor with image-scanning policies.
4. (Optional) Federate Keycloak's catalyst-admin realm to corporate IdP.
5. (Optional) Configure observability exports (SIEM, datadog, etc.).
6. Onboard the first Organization:
Catalyst console → Admin → Organizations → New
Provide: name, contact, plan.
Environment-controller does NOT create vclusters yet.
They are created when the first Environment is provisioned.
7. Create the first Environment in that Organization:
Console → switch to Org context → Environments → New
Environment-controller spins up a vcluster on the chosen host cluster
and bootstraps Flux inside (watching the env-appropriate branch on
every Application repo within this Org's Gitea Org). Apps not yet
installed have no repos yet; repos are created on demand by the
provisioning-service when each App is installed.
Ready in ~60 seconds.
9.6 Phase 3 — Steady-state operation
From here on, the Sovereign runs autonomously. Sovereign-admins use the Catalyst admin UI for:
- Onboarding more Organizations
- Adding host clusters in new regions (Crossplane provisions them, environment-controller adopts them)
- Updating Catalyst itself (umbrella Blueprint version bumps, applied via Flux PR)
- Configuring SecretPolicies and EnvironmentPolicies
- Monitoring the Sovereign's own observability stack
- Reviewing audit logs
Everyday Application installs and configurations are done by org-admins and org-developers within their Organizations — see DOD.md.
9.7 Multi-region topology
9.7.1 Single-region (SME default)
Region A
└── Host cluster: hz-fsn-mgt-prod ← Catalyst control plane + per-Org vclusters
└── all building blocks collapse onto one cluster (mgt + rtz + dmz workloads
in separate namespaces, with Cilium NetworkPolicies enforcing isolation)
Cheapest topology. Single-region failure = Sovereign down. Acceptable for SME tier where customers also accept SME-tier SLAs.
9.7.2 Multi-region (corporate default)
Region A (primary mgt) Region B Region C (DR)
───────────────── ───────────── ─────────────
hz-nbg-mgt-prod hz-fsn-rtz-prod hz-hel-rtz-prod
Catalyst control plane per-Org vclusters per-Org vclusters
Gitea, JetStream, OpenBao, (sibling realizations (sibling realizations
Keycloak, projector, of each Org's Environment) of each Org's Environment)
catalog-svc, marketplace,
console, admin, billing
hz-nbg-dmz-prod hz-fsn-dmz-prod hz-hel-dmz-prod
ingress, WAF, PowerDNS ingress, WAF, PowerDNS ingress, WAF, PowerDNS
The mgt building block is typically NOT replicated (one Catalyst control plane per Sovereign). The rtz and dmz blocks ARE replicated for workload HA.
OpenBao runs in BOTH the mgt cluster (primary) and each rtz region (replica) — see SECURITY.md §5 for replication semantics.
9.8 Adding a region post-provisioning
sovereign-admin in Catalyst admin UI:
Admin → Infrastructure → Add Region
Provider: Hetzner
Region: hel
Building blocks: rtz, dmz
Apply
Catalyst:
- Crossplane provisions the new VPC, hosts, k3s cluster, etc.
- Cluster registered in Catalyst's cluster registry.
- cert-manager + Cilium + Flux + Crossplane + ESO + OpenBao replica deployed via the cluster's Flux Kustomization (SPIRE opt-in only).
- New region available as a Placement target for new and existing Environments.
Existing Applications with placement.mode: single-region do not migrate automatically. To extend an existing Application to the new region, the user explicitly switches Placement to active-active (or active-hotstandby) and adds the new region to placement.regions — that's a one-line edit in the Application's Gitea repo on the appropriate branch (or a click in the Topology tab).
9.9 Air-gap deployment
Connected zone (one-time) Air-gapped Sovereign
────────────────────────── ───────────────────────────────
1. Mirror public Blueprint OCI Harbor receives blobs via physical
artifacts to portable media. transfer / data diode.
2. Mirror Catalyst control-plane Sovereign's Gitea adopts blobs as
container images. OCI manifests in local registry.
3. Mirror cert-manager root + cert-manager configured with
organization CA bundle. internal CA only.
4. Configure Keycloak to local LDAP Keycloak federates to internal AD/LDAP.
(no external IdPs).
Catalyst is air-gap-ready by construction: every artifact (Blueprints, Catalyst code, base images) is OCI-signed. Mirror once, run forever.
9.10 Migration and decommission
9.10.1 Migrating an Organization between Sovereigns
Rare but supported. Example: a Bank Dhofar Organization started life on the openova Sovereign (paid SaaS), now wants to move to its own bankdhofar Sovereign (self-host).
1. Provision bankdhofar Sovereign (Phases 0–2).
2. On openova Sovereign: Admin → Organization → Export
Catalyst produces an export bundle:
- Org metadata
- All Application Gitea repos under this Org (cloned + bundled, including all branches)
- The Org's `shared-blueprints` repo
- Keycloak realm export (users, federated identities)
- OpenBao export (sealed secrets only)
3. On bankdhofar Sovereign: Admin → Organization → Import
Environment-controller recreates Environments → vclusters.
Flux pulls manifests, reconciles.
Apps come up.
4. Final cutover: DNS swap.
5. Verify, then decommission on openova side.
Time depends on data volume; typically minutes to hours per Org.
9.10.2 Decommissioning a Sovereign
Reverse of provisioning:
1. Migrate all Organizations off (§9.10.1).
2. Catalyst admin → Sovereign → Decommission
3. Crossplane begins teardown of host clusters.
4. OpenBao final state exported and stored encrypted.
5. DNS records removed.
6. Cloud resources reclaimed.
The customer keeps the OpenBao export and Gitea bundles for whatever retention period their compliance demands.
For Hetzner-specific decommissioning (POST /wipe endpoint and orphan-cleanup discipline), see §1.7 + §2.2 + §2.3.
§10 — UI regression test catalog
Source: previously
docs/UI-REGRESSION-GUARDS.md(merged here on 2026-05-20).
Mapping each Playwright cosmetic + step-flow regression guard to the user's original complaint and the source-of-truth file the guard protects.
- Test file:
products/catalyst/bootstrap/ui/e2e/cosmetic-guards.spec.ts - Playwright config:
products/catalyst/bootstrap/ui/playwright.config.ts - CI workflow:
.github/workflows/cosmetic-guards.yaml - Annotation: every test is tagged
@cosmetic-guardso the CI step can filter via--grep "@cosmetic-guard". - Companion suite:
tests/e2e/playwright/(issues #142/#143/#144 and the broader E2E agent #184). The cosmetic-guards suite is intentionally narrower — only the regressions the user has called out repeatedly.
10.1 Running locally
cd products/catalyst/bootstrap/ui
npm install # installs @playwright/test
npx playwright install # one-time browser download
npm run dev # starts vite on http://localhost:5173/sovereign/
# (in a second terminal)
npx playwright test e2e/cosmetic-guards.spec.ts
If something else has already claimed port 5173 (e.g. another vite instance), Vite will auto-bump to 5174/5175/etc. Override the test host accordingly:
PLAYWRIGHT_HOST=http://localhost:5174 npx playwright test e2e/cosmetic-guards.spec.ts
The config reads PLAYWRIGHT_HOST (default http://localhost:5173) and PLAYWRIGHT_BASEPATH (default /sovereign) from the environment, per PRINCIPLES.md #4 (never hardcode).
10.2 Pass / fail semantics — what "green" means
Regression guards are by design RED while the regression they describe is in the codebase. A test in this suite turns green only when the canonical shape it asserts is the actual shape rendered by the wizard or admin page.
- Tests 1, 2, 4 (StepComponents card geometry / luminance): green on main today — the canonical 108px height + per-brand logoTone + visible-glyph contract is currently honoured. Any future regression in these flips them red.
- Tests 3, 5, 7, 8, 9 (logo brand surfaces, step order, step gating, recommended SKU, per-provider catalog): green on main today.
- Tests 10, 11 (provision SPA route, no DAG): green on main today.
- Test 6 (no "Choose Your Stack" / "Always Included" tab labels): RED on main today and intentionally so — the legacy tab strip is still in
StepComponents.tsx. Flips green when stepComponentsCopy.ts dropstabChooseLabel/tabAlwaysLabeland StepComponents.tsx drops the top-levelrole="tablist"div. - Tests 12, 13, 14 (sidebar / AppDetail / JobsPage): RED on main today — the canonical Sovereign-side
Sidebar.tsx/AppDetail.tsx/JobsPage.tsxare in flight on a separate branch (companion agent scope). Flip green when those files land + the data-testids in the table below are present. - Test 15 (no Phase 0 banners): RED on main today —
PhaseBanners.tsxis still imported byAdminPage.tsx. Flips green when the import + file are removed and per-job cards take over.
A passing local run with all 15 green means every regression class the user has shouted about is currently absent. A failing test names the exact source-of-truth file the implementing agent needs to edit.
10.3 The 15 guards
Every row names: the user's complaint (paraphrased), the canonical reference, and the file that must NOT regress.
| # | User complaint | Canonical reference | Source-of-truth file | Restored by commit |
|---|---|---|---|---|
| 1 | "Card height grew again — should be 108, not 130" | SME marketplace .app-card height |
src/pages/wizard/steps/StepComponents.tsx .corp-comp-card { height: 108px } |
691467b4 |
| 2 | "Description text is squished — there's a 70px column wasted on the right" | SME contract minus the .app-body { padding-right: 72px } waste |
src/pages/wizard/steps/StepComponents.tsx .corp-comp-body |
(cosmetic refactor #175) |
| 3 | "Logo tiles are all white — Temporal/FerretDB/Alloy disappeared" | Each project's homepage / press kit surface | src/pages/wizard/steps/logoTone.ts LOGO_SURFACE |
(logoTone introduction) |
| 4 | "Temporal logo isn't visible — looks like a blank blue square" | LOGO_SURFACE brand surface MUST contrast against the glyph |
src/pages/wizard/steps/StepComponents.tsx <ComponentLogo> |
(logoTone introduction) |
| 5 | "Wizard steps were in the wrong order somehow" | WIZARD_STEPS array |
src/app/layouts/WizardLayout.tsx |
(wizard step refactor #174) |
| 6 | "Don't show the old Choose-Your-Stack / Always-Included tab labels" | SME marketplace single-grid layout | src/pages/wizard/steps/stepComponentsCopy.ts (tabChooseLabel / tabAlwaysLabel retire) + StepComponents.tsx top-level role="tablist" retire |
(in flight — companion agent) |
| 7 | "Domain step came before Components — that's backwards" | Step order: Components precedes Domain | src/app/layouts/WizardLayout.tsx (WIZARD_STEPS, clickable = done) |
(#174) |
| 8 | "Hetzner CPX32 is what we sell — make it the recommended SKU" | PROVIDER_NODE_SIZES.hetzner recommended:true exactly on cpx32 |
src/shared/constants/providerSizes.ts |
(provider catalog refactor) |
| 9 | "Huawei SKUs leaked into the Hetzner dropdown" | Per-provider SKU vocabularies are disjoint | src/pages/wizard/steps/StepProvider.tsx skuOptions(provider) reads PROVIDER_NODE_SIZES[provider] only |
(provider refactor) |
| 10 | "Provision page has .html in the URL — looks like a static page" |
tanstack-router SPA route /provision/$deploymentId |
src/app/router.tsx provisionRoute + vite.config.ts base: '/sovereign/' |
(DAG retirement) |
| 11 | "The bubble/edge graph is back — get rid of it" | AdminPage card grid replaces the legacy DAG | src/pages/provision/ProvisionPage.tsx re-exports AdminPage |
(DAG retirement) |
| 12 | "Admin sidebar should look exactly like core/console" | core/console/src/components/Sidebar.svelte (<aside class="...w-56..."> + 7-item nav) |
src/pages/sovereign/Sidebar.tsx |
(in flight — companion agent) |
| 13 | "Per-app page should be sectioned, not tabbed" | core/console/src/components/AppDetail.svelte sections (hero / About / Connection / Bundled / Tenant / Configuration / Jobs) |
src/pages/sovereign/AppDetail.tsx |
(in flight — companion agent) |
| 14 | "Jobs are expand-in-place cards, not a separate route" | core/console/src/components/JobsPage.svelte (button rows + inline expansion) |
src/pages/sovereign/JobsPage.tsx + JobCard.tsx |
(in flight — companion agent) |
| 15 | "Get rid of the Hetzner infra + Cluster bootstrap banners" | Per-job cards on AdminPage replace the Phase 0 banners | src/pages/sovereign/AdminPage.tsx (drop <PhaseBanners> import + delete PhaseBanners.tsx) |
(in flight — companion agent) |
10.4 Tests that need a data-testid PR first
Per PRINCIPLES.md #2 (never compromise quality), no test is tagged .skip() even when its target component is mid-refactor. Each test fails LOUD with an explicit error message naming the missing data-testid so the implementing agent has a precise target.
The list below is the authoritative set of data-testid attributes the companion-agent's UI work MUST add for the guards to flip green:
data-testid |
Goes on | Required by test |
|---|---|---|
admin-sidebar |
<aside> root of src/pages/sovereign/Sidebar.tsx |
#12 |
job-row-<id> |
The <button> row in src/pages/sovereign/JobsPage.tsx |
#14 |
job-expansion-<id> |
The inline expansion node sibling to job-row-<id> |
#14 |
The data-testid="component-card-<id>" and data-testid="logo-<id>" attributes used by tests #1–#4 already exist in the current StepComponents.tsx.
10.5 Why this lives in products/catalyst/bootstrap/ui/e2e/, not tests/e2e/playwright/
The repo-level tests/e2e/playwright/ is owned by the broader E2E suite (issues #142/#143/#144 + #184) and pulls together the wizard, admin voucher UI, and unified Blueprint card grid. Co-locating the narrower cosmetic guards next to the UI source they protect:
- keeps the import path to canonical references (e.g.
LOGO_SURFACE) trivially short, - lets a UI engineer run the guards via
npm run dev+npx playwright testfrom a single working directory, - and makes the GitHub Actions path filter (
products/catalyst/bootstrap/ui/**) trigger the exact suite that reasons about that tree.
The companion E2E suite agent (#184) and this suite share the /sovereign basepath contract; nothing in either file depends on the other.
§11 — Phase-by-phase provisioning plan (Catalyst-Zero waterfall)
Source: previously
docs/PROVISIONING-PLAN.md(merged here on 2026-05-20).
The agreed plan for consolidating the existing nova/console/admin/marketplace code into the public OpenOva Catalyst monorepo, deploying it as Catalyst-Zero (the first Catalyst Sovereign — running on Contabo, the chicken in the chicken-and-egg problem), and then provisioning the first franchised Sovereign on Hetzner via the wizard at console.openova.io/sovereign.
Parent issue: #43. Sub-tickets: A–M groups, #45–#175. Post-Group-M continuation tickets (#161, #162, #163, #167, #168, #169, #170, #171, #173, #174, #175) extend the plan with the per-Sovereign PowerDNS zone model, pool-domain-manager + registrar adapters, three-mode StepDomain (pool/byo-manual/byo-api), the wizard StepComponents redesign, and k8gb retirement.
11.1 Execution status (live)
| Group | Tickets | Status | Commits |
|---|---|---|---|
| A — Code consolidation | 9 | Done | 3c2f7e4 |
| B — SME backend services | 10 | Source migrated; CI workflow live | 7646840 |
| C — Cutover Catalyst-Zero | 8 | Flux is now reconciling Catalyst-Zero from github.com/openova-io/openova (public repo) — confirmed via kubectl get gitrepository -A returning openova-public source serving the catalyst-platform Kustomization |
9d93912, dc56854, bd967a7, 61de3da, 9fdfe07, 8c40984 (Group C cutover merge) |
| D — Wizard | 10 | In progress — Domain capture + Hetzner project ID added; AppsStep replacement pending | 854a063 |
| E — Provisioner backend | 13 | In progress — Real Hetzner client + bootstrap installer + Dynadot DNS landed; SSH kubeconfig fetch is stub | 915c467, db4f21a, 07b4bcf |
| F — Bootstrap-kit Helm charts | 14 | Done — All 12 G2 wrapper charts (original 11 + bp-powerdns #167) + blueprint-release CI live | 8c0f766, 0190c605 |
| G — DNS multi-domain | 6 | Superseded by PowerDNS authoritative (#167) + pool-domain-manager (#163) + registrar adapters (#170) — Dynadot is now one of five registrar adapters inside PDM, not the authoritative DNS surface | db4f21a, 0190c605 (#167), 2854d652 (#163), 567d7e1f (#170) |
| H — Franchise model | 7 | In progress — docs/FRANCHISE-MODEL.md authored from existing admin impl; cross-Sovereign voucher deferred |
this commit |
| I — Wizard UX | 6 | Design — SSE event log pane + step indicator pending | |
| J — Hetzner infra | 6 | In progress — cloud-init in repo; firewall + k3s flags wired into provisioner | 07b4bcf |
| K — Documentation | 8 | In progress — STATUS.md + core/README + products/catalyst/README updated; component-count anchor refreshed 53 → 56 (spire + nats-jetstream + sealed-secrets factored in); reconcile-pass-1 (2026-04-29) refreshed canonical docs against PowerDNS/PDM/registrar-adapter ground truth | 3c2f7e4, 8c0f766, group-k-docs, reconcile-pass-1 |
| L — Testing | 8 | Design — Playwright + integration tests pending | |
| M — End-to-end DoD | 9 | Design — Awaiting Hetzner credentials from sovereign-admin + first OCI-artifact CI runs to complete |
11.2 The chicken-and-egg problem and its resolution
Catalyst is a Kubernetes-native control plane that provisions other Sovereigns. Provisioning a Sovereign requires a provisioner service (catalyst-provisioner.openova.io per §9.2). That provisioner has to run somewhere. It cannot run inside the Sovereign it is provisioning (chicken-and-egg).
Resolution: the legacy nova/console/admin/marketplace stack currently running on Contabo k3s (in namespaces catalyst, sme, marketplace, website) is Catalyst-Zero — the first Sovereign. It exists today, has running pods today, and is the chicken from which the egg (the first Hetzner-hosted franchised Sovereign) gets provisioned.
The work in this plan consolidates that existing code into the public repo, redeploys it as a public-repo build (CI from github.com/openova-io/openova), and then uses it to provision the first franchised Sovereign. There is no greenfield "build Catalyst from scratch" — the Sovereign already exists; we are aligning it to the canonical Catalyst contract.
11.3 Current state inventory (verified against live cluster + repos, 2026-04-28)
11.3.1 Code locations (today)
| What | Where today | Where it must end up |
|---|---|---|
| Catalyst console UI (Astro+Svelte) | openova-private/apps/console/ |
openova/core/console/ |
| Catalyst admin UI (Astro+Svelte) | openova-private/apps/admin/ |
openova/core/admin/ |
| Catalyst marketplace UI (Astro+Svelte) | openova-private/apps/marketplace/ |
openova/core/marketplace/ |
| marketplace-api (Go backend) | openova-private/website/marketplace-api/ |
openova/core/marketplace-api/ |
| Catalyst-zero deployment chart | openova-private/clusters/contabo-mkt/apps/catalyst/ |
openova/products/catalyst/chart/templates/ |
| Vite scaffold for sovereign-wizard | openova/products/catalyst/bootstrap/ui/ |
merges into openova/core/console/src/pages/sovereign/ |
| CI workflows (6 of them) | openova-private/.github/workflows/{catalyst-build,marketplace-api-build,sme-{admin,console,marketplace,services}-build}.yaml |
openova/.github/workflows/ |
| Voucher / billing / tenants admin surface | openova-private/apps/admin/src/{components/BillingPage.svelte, lib/api.ts, pages/{billing,catalog,orders,tenants}.astro} |
openova/core/admin/... (carry forward unchanged) |
11.3.2 Live deployment on Contabo (verified via kubectl get all -A)
| Namespace | Pods running | Notes |
|---|---|---|
catalyst |
catalyst-api + catalyst-ui | 39 days uptime |
sme |
console + admin + marketplace | 5–6 days uptime |
marketplace |
marketplace-api | 13 days uptime |
website |
openova-website | live |
These pods are Catalyst-Zero. They stay running through Phases 1–2; Phase 2 is a rolling-update cutover to public-repo image builds.
11.3.3 Existing 5-step wizard (the "Components (5)" page reference)
The "Components (5)" the user referenced is the 5-step marketplace flow at openova-private/apps/marketplace/src/components/:
PlanStep → AppsStep → AddonsStep → CheckoutStep → ReviewStep
AppsStep is what gets replaced with the unified marketplace card grid (driven by the same bp-<x> Blueprint surface every Catalyst Sovereign uses).
11.3.4 Voucher mechanism (already implemented)
Lives in openova-private/apps/admin/:
src/components/BillingPage.svelte— voucher / billing UIsrc/lib/api.ts— voucher API clientsrc/pages/{billing,catalog,orders,tenants}.astro— admin pages
This is the canonical voucher implementation. Do not redesign. Read what's there, propagate to franchised Sovereigns, document in FRANCHISE-MODEL.md.
11.4 Architectural agreements (from the design conversation, durable)
These agreements survive any context compaction and apply to every phase of the work below.
- Catalyst-Zero is the existing Contabo deployment. Not greenfield. The work is consolidate + cutover + extend, not rebuild.
- omani.works is the first Sovereign-provided subdomain pool (registered to the OpenOva Dynadot account). User dynamically picks
omantel.omani.worksduring provisioning. The wizard offers BYO domain (customer's own) or a Sovereign-pool subdomain (default). Multi-region setups are out of scope for the first run. - Existing admin voucher implementation is the source of truth. Do not propose new CRDs. Read the existing implementation, propagate it to franchised Sovereigns, document it.
- G2 quality only. Catalyst-curated wrapper Helm charts at
platform/<x>/chart/for every component in the bootstrap kit. No upstream-as-is shortcuts. No corner-cutting. The unified Blueprint contract from §3 is the standard. - No mocks. No iterations. No partial deliveries. Waterfall — every phase produces real, deployed, working artifacts.
- All product code is public. Per the build-minutes constraint, code moves to
openova/(the public monorepo) before any further development. CI runs in the public repo from this point onward. - The Vite scaffold at
products/catalyst/bootstrap/ui/merges intocore/console/src/pages/sovereign/. It does not become its own deployable. - Sovereign-provisioning wizard target URL:
console.openova.io/sovereign. Captured fields include domain (BYO or pool), Hetzner Cloud API token, Hetzner project ID, Hetzner region (runtime parameter, never hardcoded), plus the marketplace-style App selection. - The Hetzner region is a runtime parameter chosen by the wizard user. Never hardcoded anywhere in code.
- Dynadot is OpenOva's registrar of record for the pool domains. The
dynadot-api-credentialsK8s secret inopenova-systemis account-scoped and coversopenova.ioplusomani.works(and any other domain in the same Dynadot account). Post-#167/#170 Dynadot is not authoritative DNS for any Sovereign zone — bp-powerdns is. Dynadot is one of five registrar adapters PDM uses to (a) keep the OpenOva pool domains' parent-zone NS records pointing at OpenOva PowerDNS and (b) honourbyo-apiSovereigns whose customer happens to use Dynadot.
11.5 The 8-phase waterfall
Each phase produces one or more commits to openova/. Each commit is real working code, not scaffold. No phase is skipped, abbreviated, or deferred.
11.5.1 Phase 1 — Code consolidation (openova-private → openova)
What: git mv the 4 apps (console, admin, marketplace, marketplace-api) from openova-private to openova/core/. Move 6 CI workflows to openova/.github/workflows/. Move Catalyst-Zero deployment manifests from openova-private/clusters/contabo-mkt/apps/catalyst/ to openova/products/catalyst/chart/templates/.
Outputs:
openova/core/{console,admin,marketplace,marketplace-api}/populatedopenova/.github/workflows/{catalyst-build,marketplace-api-build,sme-*-build}.yamlopenova/products/catalyst/chart/templates/{api-deployment,api-service,ui-deployment,ui-service,ingress}.yamlopenova/products/catalyst/chart/Chart.yaml(new)- All import paths, image refs (
ghcr.io/openova-io/openova/{console,admin,marketplace,marketplace-api,catalyst-api,catalyst-ui}:<sha>) updated - validation-log entry: Pass 105
Commit message: feat(consolidation): move Catalyst-Zero apps + CI from openova-private to public monorepo
11.5.2 Phase 2 — Cutover Catalyst-Zero to public-repo build
What: Trigger first public-repo CI run, get :<sha> images into GHCR, roll the existing Contabo deployment to the new images. Catalyst-Zero is now built from the public repo. Delete legacy paths from openova-private (preserved in git history).
Outputs:
- GHCR images at
ghcr.io/openova-io/openova/{console,admin,marketplace,marketplace-api,catalyst-api,catalyst-ui}:<sha> - Contabo k3s pods rolled to new image SHAs
openova-privatecleaned of legacy paths- validation-log entry: Pass 106
Commit message: infra(cutover): Catalyst-Zero now built from public repo
Acceptance: kubectl describe pod on each rolled pod shows image: ghcr.io/openova-io/openova/.... Console at console.openova.io still loads. Brief rolling-update window (<60s).
11.5.3 Phase 3 — Sovereign-provisioning wizard
What: Build the wizard at core/console/src/pages/sovereign/ using the Vite scaffold. Replace the legacy 5-step marketplace flow's AppsStep with a unified marketplace card grid (driven by bp-<x> Blueprint surface). Add Sovereign-provisioning-specific fields:
- Domain: BYO (customer's own domain) or pool selection (default
omani.works→ user picks subdomain likeomantel,acme-bank, etc.) - Hetzner Cloud API token (capture, store via ESO into OpenBao, never log)
- Hetzner project ID
- Hetzner region (dropdown of valid Hetzner regions; runtime parameter)
- Sovereign owner email (becomes initial sovereign-admin)
- Initial App selection (the unified marketplace grid)
Outputs:
openova/core/console/src/pages/sovereign/index.astro+ sub-pages for each wizard stepopenova/core/console/src/components/sovereign/{DomainStep,HetznerStep,AppsStep-unified,ReviewStep}.svelte- The legacy bootstrap Vite scaffold at
openova/products/catalyst/bootstrap/ui/is merged in and the directory deleted (its content is now part ofcore/console/) - validation-log entry: Pass 107
Commit message: feat(console): sovereign-provisioning wizard at /sovereign with domain + Hetzner inputs + unified marketplace App selection
11.5.4 Phase 4 — Provisioner backend
What: Build the wizard's backend at products/catalyst/bootstrap/api/ (the Go service deployed as catalyst-api in the catalyst namespace on Catalyst-Zero). Real backend that takes wizard input → calls OpenTofu → returns Sovereign provisioning state via SSE. Per PRINCIPLES.md #3, no cloud APIs are called from Go directly — OpenTofu owns Phase 0, Crossplane owns day-2, and Hetzner client code is reserved for read-only credential validation.
Outputs:
products/catalyst/bootstrap/api/internal/provisioner/— thin wrapper aroundtofuthat writestofu.auto.tfvars.jsonfrom validated wizard input, runstofu init && tofu plan && tofu apply -auto-approve, streams stdout/stderr lines to the wizard via SSEproducts/catalyst/bootstrap/api/internal/hetzner/— read-only Hetzner client for credential validation (POST /api/v1/credentials/validate); never used to mutate cloud stateproducts/catalyst/bootstrap/api/internal/pdm/— PDM client (/v1/reserve,/v1/commit,/v1/validate) for pool-subdomain allocation and registrar-token validationproducts/catalyst/bootstrap/api/internal/dynadot/— Dynadot client (used as one registrar adapter inside PDM's adapter set, not for direct DNS writes from this service)products/catalyst/bootstrap/api/internal/handler/— REST handlers includingPOST /api/v1/deployments,GET /api/v1/deployments/{id}/logs(SSE),POST /api/v1/deployments/{id}/phases/{phase}/retry,POST /api/v1/credentials/validate,POST /api/v1/subdomains/check,GET /api/v1/registrarsinfra/hetzner/main.tf— OpenTofu module (network, firewall, ssh-key, control-plane + worker servers, load balancer)- validation-log entry: Pass 108
Commit message: feat(provisioner): real Hetzner Sovereign provisioning end-to-end
11.5.5 Phase 5 — Bootstrap kit Helm charts (G2 quality)
What: Real Catalyst-curated wrapper Helm charts at platform/<x>/chart/ for every bootstrap-kit component. Each chart wraps upstream OSS with Catalyst-specific values, includes a blueprint.yaml per the unified Blueprint contract from §3, publishes a bp-<name>:<semver> OCI artifact via CI fan-out.
Components (in dependency order):
platform/cilium/chart/(CNI must come first)platform/cert-manager/chart/platform/flux/chart/(host-level)platform/crossplane/chart/platform/sealed-secrets/chart/(transient bootstrap-only)platform/spire/chart/(opt-in — SPIRE deferred from bootstrap-kit by PR #665; theplatform/spire/folder is retained)platform/nats-jetstream/chart/platform/openbao/chart/platform/keycloak/chart/platform/gitea/chart/products/catalyst/chart/— the umbrellabp-catalyst-platform
Outputs:
- 11 directories with
Chart.yaml,values.yaml,templates/,blueprint.yaml, optionalcompositions/,policies/,overlays/ - 11 entries in
openova/.github/workflows/blueprint-release.yaml(path-matrix CI fan-out) - 11 OCI artifacts published at
ghcr.io/openova-io/bp-<name>:<semver>after first CI run - One commit per chart (11 commits) — incremental review possible
- validation-log entries: Pass 109 through Pass 119
Commit messages: feat(bp-<name>): G2 Catalyst-curated chart for <name> per BLUEPRINT-AUTHORING contract
11.5.6 Phase 6 — DNS architecture: PowerDNS authoritative + PDM + registrar adapters
What: The DNS architecture has two layers. Authoritative DNS lives on bp-powerdns (#167) — every Sovereign zone (pool: omantel.omani.works, BYO: acme.bank.com) gets its own PowerDNS zone with DNSSEC + lua-records. Allocation + registrar control lives on the pool-domain-manager service (#163), which exposes registrar adapters (#170) for byo-api flow:
- Pool subdomains (e.g.
<sub>.omani.works,<sub>.openova.io): PDM/v1/reservechecks availability,/v1/commitcreates the per-Sovereign PowerDNS zone, writes the canonical 6-record set, and updates the parent zone's NS delegation via the OpenOva Dynadot registrar adapter. - BYO with manual NS-flip (
byo-manual): wizard surfaces the OpenOva NS list; customer pastes them into their own registrar UI; catalyst-api polls until propagation; PDM/v1/committhen writes the canonical record set into the new PowerDNS zone (no parent-zone change from OpenOva). - BYO with API NS-flip (
byo-api): customer picks their registrar from the supported list (Cloudflare, Namecheap, GoDaddy, OVH, Dynadot — #170), pastes a token; PDM/v1/validateconfirms scope read-only; on commit, the matching registrar adapter flips the NS records to OpenOva's NS set.
Outputs:
core/pool-domain-manager/— Go service deployed atpool-domain-managerinopenova-system, CNPG-backedpdm-pg. Modules:internal/allocator,internal/pdns,internal/registrar,internal/dynadot,internal/reserved,internal/store. CI:.github/workflows/pool-domain-manager-build.yaml.platform/crossplane/compositions/composition-pool-allocation.yaml+ matching XRD — declarative Crossplane wrapper around PDM/v1/reserveso Sovereign provisioning runs through the canonical IaC path.platform/powerdns/— bp-powerdns wrapper chart (Chart.yaml, values.yaml, blueprint.yaml, templates) with DNSSEC + lua-records on by default, dnsdist companion for rate-limiting.- validation-log entry: Pass 120 (component-count refresh + PDM landing).
Commit message: feat(dns): bp-powerdns + pool-domain-manager + registrar adapters for pool/byo flows
11.5.7 Phase 7 — Franchise model docs + voucher propagation
What: Read existing voucher implementation in admin app. Write FRANCHISE-MODEL.md documenting it as canonical. Ensure the new Sovereign at omantel.omani.works has its own admin surface (the same admin app, deployed inside the Sovereign) where omantel-admin can issue vouchers to omantel's tenants. Update GLOSSARY.md with Voucher and Franchisee definitions if not already present.
Outputs:
openova/docs/FRANCHISE-MODEL.md— canonical doc- Updates to
GLOSSARY.mdif needed - Updates to
BUSINESS-STRATEGY.mdrevenue model if needed - validation-log entry: Pass 121
Commit message: docs(franchise): canonical franchise model + voucher propagation, sourced from existing admin impl
11.5.8 Phase 8 — End-to-end provisioning (live demo / DoD)
What: From browser at console.openova.io/sovereign:
- User logs in (Keycloak SSO)
- Picks "New Sovereign"
- Pastes Hetzner Cloud API token + project ID, picks region (any — runtime parameter)
- Picks domain: pool →
omani.works→ user typesomantel(createsomantel.omani.works) - Picks initial Apps (unified marketplace selection)
- Click Provision
- Watches SSE-driven progress for ~10 minutes
- Provisioning completes; new Sovereign at
omantel.omani.worksis reachable - omantel-admin (initial sovereign-admin) logs into
console.omantel.omani.works - omantel-admin issues 1 voucher
- A fictional customer redeems the voucher at
omantel.omani.works/redeem?code=... - Customer's Organization + Environment + first App is created on omantel.omani.works
- Customer reaches their App's URL
Acceptance: every step above works without intervention. No mocks, no manual steps beyond the browser clicks.
Outputs:
- validation-log entry: Pass 122 — DoD documented with screenshots / kubectl evidence
- Optional: this section in
RUNBOOKS.mdfor repeatability
11.6 What this plan does NOT change
- The unified Application = Gitea Repo model (Pass 103) is preserved everywhere. The franchised Sovereign at omantel.omani.works will use the same model — one Gitea Org per Catalyst Organization, one Gitea Repo per Application.
- The 5 conventional Gitea Orgs convention (
catalog,catalog-sovereign,<org>per Catalyst Organization,system) applies to the new Sovereign exactly as it does to Catalyst-Zero. - The component-count anchor (Pass 104 set 53; Pass 105 raised it to 56 with spire + nats-jetstream + sealed-secrets) holds. SeaweedFS unified S3 encapsulation, Guacamole in bp-relay, OpenBao independent-Raft per region — all preserved.
- The audit procedure stays on-demand (no scheduled loops). The
audit-catalyst-docsskill is the only validation entry point.
11.7 References
ARCHITECTURE.md— target architecture (the design Catalyst-Zero is being aligned to)- §9.3 above — bootstrap kit dependency order (canonical reference for Phase 5 of this plan)
- §3 above — unified Blueprint shape (the contract Phase 5 charts must satisfy)
STATUS.md— gets updated incrementally as each phase lands- §8 above — how to validate after each phase
docs/archive/validation-log.mdPass 1–104 — historical record; Pass 105+ tracks this plan's execution
See also
DOD.md— end-user Definition of Done (5 pillars + Phase 0/1/2 deterministic test)ARCHITECTURE.md— Catalyst target architectureDOD.md— Sovereign / tenant-Org FQDN patterns + forbidden test stringsGLOSSARY.md— terminology source of truth (incl. banned terms)STATUS.md— what's built today vs designPRINCIPLES.md— the 15 inviolable engineering principlesPRINCIPLES.md— theater receipts to watch for in PR reviewSECURITY.md— identity, secrets, rotationPLATFORM-POWERDNS.md— per-Sovereign authoritative zone modelSECURITY.md§11 — GHCR pull token, Dynadot credentials, Hetzner tokens (rotation runbook merged from formerSECRET-ROTATION.mdon 2026-05-20)ARCHITECTURE.md§8.8 — PowerDNS lua-records for GSLB (folded from formerMULTI-REGION-DNS.mdon 2026-05-20)BUSINESS-STRATEGY.md§10.8 — franchise model + voucher mechanism (folded fromFRANCHISE-MODEL.md2026-05-20)TRUST.md— verification ledgertests/dod/dod_test.go— Go test that drives the §5 walk non-interactivelyscripts/operator-recover-sovereign.sh— §2.2 idempotent recovery