Commit Graph

35 Commits

Author SHA1 Message Date
e3mrah
f6757c7c93
feat(docs): lean documentation strategy — consolidate 16 docs into 7 canonical + 3 subdirs (#2094)
* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy)

Single canonical "how OpenOva works" doc per founder's lean-doc strategy.
2926 source lines → 1110 consolidated lines, no semantic loss.

Sections:
 §1  High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint)
 §2  Repo layout
 §3  Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...)
 §4  Naming conventions (dimensions, patterns, labels, DOMAINS-CANON)
 §5  Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces)
 §6  Per-host-cluster infrastructure
 §7  Application Blueprints
 §8  Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh)
 §9  Bootstrap-kit slot ordering (full 48-slot canonical list)
 §10 EPIC-level design overview (EPIC-0 through EPIC-6)
 §11 Per-chart DESIGN.md inventory
 §12 OAM influence
 §13 Read further

Stale literal fixes:
 - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances)
 - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055)
 - failover-controller marked REPLACED by bp-continuum

New PR refs wired into §3:
 - PR #665   SPIRE deferral
 - PR #2071  bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region)
 - PR #2087  bp-cnpg-pair pre-merge guard
 - PR #2093  bp-cnpg-pair pre-merge guard

New stack components added to §3:
 - bp-cnpg-pair  (synchronous remote_apply ReplicaCluster across ClusterMesh)
 - bp-continuum  (lease-based failover orchestrator)
 - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11)

Source docs (to be deleted by orchestrator in final PR):
 - docs/PLATFORM-TECH-STACK.md
 - docs/NAMING-CONVENTION.md
 - docs/EPICS-1-6-unified-design.md
 - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md

* docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy)

* docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy)

* docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy)

Part 1 — Runbook consolidation:
- NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops,
  Blueprint authoring, chart conventions, demo walk, failover, troubleshooting)
- Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK /
  RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface
- Documents dual-annotation requirement for charts with enabled.default: false
  (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1
  dead-reserve incident as the live evidence
- All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console)
- All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works
- Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093

Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md):
- Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit)
- Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed,
  awaiting fresh-prov walk" (per 5-pillar DoD)
- Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053)
- Adds 3 new CRDs verified in products/catalyst/chart/crds/:
  CNPGPair, PDM, Sandbox
- Sandbox controller chain CODE-COMPLETE
  (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632)
- SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061)
- New §6 CI / supply-chain guards table: hollow-chart (#2087),
  smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle,
  subchart 4-step, Flux version-pin replay
- New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧
- Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20),
  Pillar 3 (per above), Pillar 4 (Sandbox chain)

Part 3 — GLOSSARY.md folded as single source of truth for banned terms:
- Header dated 2026-05-20, notes "single source of truth for banned terms"
  and "no separate BANNED-TERMS.md"
- Existing 11 banned-terms rows rewritten with italicized qualifiers
- NEW Forbidden test domains subsection:
  openova.io (mothership-only), omantel.openova.io (hallucinated),
  Nova Cloud (predecessor brand), eventforge.io (hallucinated),
  admin.<fqdn> (dead BSS URL)
- SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665
  with TBD-V29 (#2055) re-introduction roadmap
- Cross-links updated: IMPLEMENTATION-STATUS → STATUS,
  SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md

CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion).
No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs

Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11.

This is the orchestrator commit on top of the four cherry-picked consolidation
commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It:

1. Deletes 15 legacy source docs (now folded into the 7 canonical):
   PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design,
   BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG,
   5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD,
   PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING,
   DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING.

2. Moves transient + historical docs into proper subdirs:
   - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state)
   - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery,
     2026-05-20-trust-audit,2026-05-20-walk-runbook}.md
   - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md

3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision)
   + docs/adr/README.md index.

4. Updates CLAUDE.md reading-order + repo-structure block to match the
   lean strategy and current core/ tree (controllers/, marketplace/, etc.).

5. Sweeps all .md files + .github/workflows + scripts to repoint old doc
   paths to the new canonical homes. ADR cross-references kept intact
   (ADRs are immutable historical artifacts).

Operator-side cron scripts that still write to the old paths
(/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and
openova-private/bin/trust-audit.sh) need a one-line path update —
flagged in the PR body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md

The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its
repo-root sentinel; the file no longer exists after the lean-doc
consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to
match the new canonical filename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 14:40:01 +04:00
hatiyildiz
6d38089895 deploy(bp-harbor): bump bootstrap-kit pin -> 1.2.19 + blueprint.yaml lockstep (auto, Refs TBD-A6 + TBD-A20, retry 2) 2026-05-19 04:03:38 +00:00
e3mrah
59980125ed
fix(networkpolicy): egress to CNPG data-plane Pods, not cnpg-system operator NS (TBD-A39, Closes #1901) (#1911)
The CNPG operator runs in the `cnpg-system` namespace, but the actual
Postgres workload Pods reconcile into the same namespace as the CNPG
`Cluster` CR — for the auto-provisioned-DB blueprints that's
`.Release.Namespace` (e.g. `newapi`, `harbor`). A NetworkPolicy egress
rule that namespace-selects on `cnpg-system` reaches the operator pods
only, NOT the Postgres workloads — every 5432 connection times out.

Verified live on t31: `newapi-bp-newapi-newapi-pg-1` runs in `newapi`
ns with label `cnpg.io/cluster=newapi-bp-newapi-newapi-pg`, while
`newapi-bp-newapi-…` is stuck 1/2 Ready with 20 restarts because its
egress NP allows 5432 only to `cnpg-system`.

Fix: every affected NP now selects the Postgres workload Pods by the
operator-emitted `cnpg.io/cluster=<clusterName>` Pod label — namespace-
agnostic, survives the operator namespace being different from the
data-plane namespace.

Charts fixed (4):

  - bp-newapi (1.4.22 → 1.4.23) — auto-provisions CNPG Cluster in
    `.Release.Namespace`. Removed the bogus `namespaceLabel: cnpg-system`
    egress entry from values.yaml; added a podSelector-based rule
    (cnpg.io/cluster=<release>-bp-newapi-newapi-pg) directly in the
    template, gated by `.Values.cnpg.enabled`.

  - bp-harbor (1.2.17 → 1.2.18) — Cluster CR in
    `postgres.cluster.namespace | default .Release.Namespace` (default
    `harbor`). Changed egress from namespaceSelector=cnpg to
    podSelector cnpg.io/cluster=<postgres.cluster.name|default harbor-pg>.

  - bp-matrix (1.0.0 → 1.0.1) — chart points at
    matrix-postgres-rw.matrix.svc.cluster.local (Cluster CR in
    `.Release.Namespace`). Replaced `cnpgNamespace` value with
    `cnpgClusterName` (default `matrix-postgres`) and switched egress
    rule to podSelector.

  - bp-openmeter (1.0.0 → 1.0.1) — operator-supplied CNPG endpoint
    pattern. Replaced `cnpgNamespace` with `cnpgClusterName` (default
    `openmeter-pg`) and switched egress rule to podSelector. Same
    pattern as matrix.

Audited and clean:

  - bp-cnpg-pair: already uses podSelectors throughout.
  - bp-wordpress-tenant: cnpgNamespaceLabel="" path resolves to
    `.Release.Namespace` via the `cnpgNamespace` helper.
  - bp-llm-gateway: already pod-selects on
    `cnpg.io/cluster=bp-llm-gateway-audit`.
  - bp-keycloak / bp-gitea / bp-grafana / bp-mimir: no own
    networkpolicy.yaml template (grafana/mimir pass enabled=false
    to upstream subcharts).

Validation:

  - helm template render clean for all 4 charts.
  - `kubectl apply --dry-run=server` on t31 — all 4 NetworkPolicies
    accepted by the API server.
  - Verbatim render confirms the auto-emitted cluster name matches the
    label on the existing CNPG Pod (newapi-bp-newapi-newapi-pg).

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-05-19 08:02:59 +04:00
e3mrah
0a45a790e7
fix: omit HTTPRoute sectionName across blueprint charts — match PR #1888 pattern (Closes #1902) (#1909)
PR #1888 (TBD-A30) fixed catalyst-system HTTPRoutes for multi-zone
Sovereigns whose Cilium Gateway renames HTTPS listeners from `https` to
`https-<sanitised-zone>` (e.g. `https-omani-works`, `https-omani-homes`)
when more than one parent zone is enabled. Every public HTTPRoute pinned
to `sectionName: https` got `Accepted=False NoMatchingListener` and the
hosted service 404'd / connection-refused.

That fix only touched products/catalyst/chart. Per-blueprint HTTPRoutes
shipped the same `sectionName: https` default in values.yaml, so on a
multi-zone Sovereign every blueprint route — gitea, grafana, harbor,
keycloak, newapi, openbao, powerdns, stalwart-tenant — silently failed
to attach. TBD-A40 / issue #1902.

Sweep verbatim:

  $ git grep -nE 'sectionName:[[:space:]]+(https|"https")[[:space:]]*$' \
      platform/*/chart/ products/ clusters/ core/ 2>/dev/null \
      | grep -v 'platform/gateway-api/chart/templates'
  platform/gitea/chart/values.yaml:168:    sectionName: https
  platform/grafana/chart/values.yaml:124:    sectionName: https
  platform/harbor/chart/values.yaml:437:    sectionName: https
  platform/keycloak/chart/values.yaml:482:    sectionName: https
  platform/newapi/chart/values.yaml:721:      sectionName: https
  platform/openbao/chart/values.yaml:72:    sectionName: https
  platform/powerdns/chart/values.yaml:407:      sectionName: https
  platform/stalwart-tenant/chart/values.yaml:297:      sectionName: https
  products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go:802:        sectionName: https

Fix (Option C — omit sectionName, same as PR #1888):

  - 8 blueprint values.yaml defaults flipped from `sectionName: https` to
    `sectionName: ""`. The chart templates already guard with `{{- with
    .Values.gateway.parentRef.sectionName }}`, so a blank value drops the
    field entirely and Cilium Gateway matches by hostname filter.

  - platform/newapi/chart/templates/httproute.yaml was the outlier: it
    used `default "https" $parent.sectionName` which fell back to `https`
    even when values.yaml said empty. Rewritten to `{{- with
    $parent.sectionName }}` so empty drops the field — same pattern as
    the other 7 blueprints.

  - products/catalyst/bootstrap/api/internal/handler/sme_tenant_gitops.go
    renders a per-tenant bp-keycloak HelmRelease and injected
    `sectionName: https` into spec.values. Flipped to `sectionName: ""`
    so the bp-keycloak chart's `{{- with }}` guard drops the field.

Validation (real `helm template`, default values, gateway enabled, no
sectionName override) — Principle #15:

  gitea            : sectionName lines in rendered output = 0
  grafana          : sectionName lines in rendered output = 0
  harbor           : sectionName lines in rendered output = 0
  keycloak         : sectionName lines in rendered output = 0
  openbao          : sectionName lines in rendered output = 0
  powerdns         : sectionName lines in rendered output = 0
  newapi           : sectionName lines in rendered output = 0
  stalwart-tenant  : sectionName lines in rendered output = 0

Override path preserved — `--set ...parentRef.sectionName=https-omani-works`
on each chart renders `sectionName: "https-omani-works"` correctly,
so operators on single-zone clusters or non-Cilium gateways can still
pin explicitly via bootstrap-kit overlay.

helm lint clean on all 8 blueprint charts (newapi cnpg-cluster.yaml lint
error is pre-existing on origin/main, unrelated to this fix).

Chart bumps (each blueprint also bumps blueprint.yaml spec.version per
#817 lockstep):
  bp-gitea            1.2.7  -> 1.2.8
  bp-grafana          1.0.1  -> 1.0.2
  bp-harbor           1.2.17 -> 1.2.18
  bp-keycloak         1.4.5  -> 1.4.6
  bp-newapi           1.4.22 -> 1.4.23
  bp-openbao          1.2.16 -> 1.2.17
  bp-powerdns         1.2.3  -> 1.2.4
  bp-stalwart-tenant  0.1.2  -> 0.1.3

Refs TBD-A40.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 07:57:12 +04:00
e3mrah
cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
e3mrah
3d929e69d7
fix(httproute): collapse double-prefix when releaseName contains chart name (gitea/harbor/openbao 500/404) (#1483)
* fix(tls): cilium-gateway-cert STAGING/PROD issuer selectable via tofu

clusters/_template/sovereign-tls/cilium-gateway-cert.yaml hardcoded
letsencrypt-dns01-prod-powerdns regardless of qa_test_session_enabled.
On high-cadence QA reprov cycles this hits the LE PROD 5/168h rate
limit (caught on prov #76 at 13:45 UTC, retry-after 16:49 UTC) and
the wildcard Certificate sticks Ready=False — Cilium Gateway has no
valid TLS secret → envoy listener never binds → public TLS handshake
to console.<fqdn> dies with SSL_ERROR_SYSCALL.

Add tofu local.wildcard_cert_issuer = qa_test_session_enabled ?
staging : prod. Thread WILDCARD_CERT_ISSUER through the sovereign-
tls Kustomization postBuild.substitute. cilium-gateway-cert.yaml
references it as ${WILDCARD_CERT_ISSUER}.

Default behaviour unchanged for non-QA (production) Sovereigns —
they still resolve to letsencrypt-dns01-prod-powerdns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cilium-gateway): allow world ingress to Cilium Gateway reserved:ingress endpoint

When Cilium Gateway API runs with gatewayAPI.hostNetwork.enabled=true and
a default-deny CCNP is present, every public request to a Sovereign host
(console, auth, gitea, registry, api, ...) hits the gateway listener and
gets DENIED at envoy's cilium.l7policy filter with:

    cilium.l7policy: Ingress from 1 policy lookup for endpoint X for port 30443: DENY

Public response: HTTP/1.1 403 Forbidden, body "Access denied", server: envoy.

Root cause: Cilium creates a special endpoint with identity reserved:ingress (8)
representing the gateway listener. By default this endpoint has
policy-enabled=both with allowed-ingress-identities=[1 (host)] and empty
L4 rules — so no port is permitted. The default-deny CCNP's NotIn-namespace
endpointSelector does NOT cover this endpoint (it has no
io.kubernetes.pod.namespace label), and our qa-fixtures didn't ship a
matching allow-template for it. Net effect: TLS handshake succeeds, HTTPRoutes
are Programmed, backends are healthy in-cluster, but every request 403s.

Caught live on prov #80 (omantel.biz, 2026-05-14) after the Gateway hostNetwork
fix (#1480) finally activated host-bind on :30443. Verified by:
- envoy debug log: cilium.l7policy DENY for endpoint 10.42.0.201 port 30443
- cilium-dbg endpoint get 3282 -o json: l4.ingress: [] and allowed-ingress-identities: [1]
- transiently applying the same CCNP via kubectl: console.omantel.biz → 200

Fix: ship a CCNP scoped to reserved:ingress that allows ingress from world,
cluster, host, remote-node (multi-region CP-to-CP), and kube-apiserver,
plus egress to all so envoy can forward to any backend service. This is
the canonical Cilium hostNetwork Gateway-API zero-trust pattern.

Chart bump: catalyst 1.4.142 → 1.4.143.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(httproute): match upstream chart fullname-collapse when releaseName contains chart name

Three Sovereign-facing HTTPRoute templates (gitea, harbor, openbao) had
backend defaults hardcoded as `<release>-<chart>-<resource>` (e.g.
`gitea-gitea-http`, `harbor-harbor-core`, `openbao-openbao`). The
upstream subcharts use a `<chart>.fullname` helper that COLLAPSES the
prefix when `.Release.Name` already contains the chart name — i.e. when
the bootstrap-kit releaseName is the chart name (the convention), the
live Service is `<release>-<resource>` (or just `<release>` for openbao),
not `<release>-<chart>-<resource>`.

Effect on prov #80 (omantel.biz):
- gitea/gitea HTTPRoute → backendRef `gitea-gitea-http` (does not exist; live is `gitea-http`) → BackendNotFound → gitea.omantel.biz returns HTTP 500
- harbor/harbor HTTPRoute → `harbor-harbor-core` (live is `harbor-core`) → registry.omantel.biz returns HTTP 500
- openbao/openbao HTTPRoute → `openbao-openbao` (live is `openbao`) → bao.omantel.biz dead

Fix: replicate the upstream chart's `.fullname` collapse logic via
`(ternary .Release.Name (printf "%s-<chart>" .Release.Name) (contains "<chart>" .Release.Name))` so the default backend always matches
the live Service name regardless of releaseName choice. Operators retain
the `gateway.backendService` override for non-standard release names.

Chart bumps: bp-gitea 1.2.6 → 1.2.7, bp-harbor 1.2.16 → 1.2.17, bp-openbao 1.2.14 → 1.2.15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: e3mrah <catalyst@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
2026-05-14 19:00:07 +04:00
e3mrah
74d23ab3dc
fix(charts): explicit harbor.openova.io/proxy-dockerhub prefix on all chart-hook images (#163) (#1367)
Per CLAUDE.md MIRROR-EVERYTHING inviolable rule: every chart-hook
image reference (pre/post-install Jobs, helper Pods) must use the
explicit Harbor proxy-cache form. Fix #158's bitnami → bitnamilegacy
swap was a band-aid; the architecturally correct fix is to defeat
upstream-deletion blast radius entirely by routing through Harbor.

The node-level containerd mirror in infra/hetzner/cloudinit-control-
plane.tftpl (line 706) already redirects docker.io/* →
harbor.openova.io/proxy-dockerhub/* implicitly, but implicit routing:
  - Hides the routing from SBOM scans
  - Bypasses the Kyverno harbor-proxy-pull ClusterPolicy
  - Means a chart audit (`grep docker.io`) misses a real dependency
  - Was the proximate cause of prov #27 wedging when Bitnami deleted
    docker.io/bitnami/kubectl:1.30.4 (Fix #158 had to chase the
    deletion mid-flight instead of being insulated by Harbor cache)

19 chart-hook image: refs + 5 chart values.yaml repository: defaults
now carry the explicit harbor.openova.io/proxy-dockerhub prefix.
Application/subchart images (keycloak, postgresql, mongodb in
keycloak+litmus subcharts) are intentionally out of scope for this
PR — those go through the node-level containerd mirror still.

Affected blueprints + chart version bumps:
  bp-cert-manager            1.2.1  -> 1.2.2
  bp-external-secrets-stores 1.0.4  -> 1.0.5
  bp-crossplane-claims       1.1.4  -> 1.1.5
  bp-flux                    1.2.1  -> 1.2.2
  bp-guacamole               0.1.16 -> 0.1.17
  bp-self-sovereign-cutover  0.1.28 -> 0.1.29
  bp-k8s-ws-proxy            0.1.9  -> 0.1.10
  bp-harbor                  1.2.15 -> 1.2.16
  bp-gitea                   1.2.5  -> 1.2.6
  bp-newapi                  1.4.5  -> 1.4.6
  bp-wordpress-tenant        0.2.0  -> 0.2.1
  catalyst-platform          1.4.138 -> 1.4.139

Co-authored-by: e3mrah <1234567+e3mrah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 11:32:21 +04:00
e3mrah
890fa67eff
fix(bp-harbor): inline labels on admin Secret to drop duplicate keys (#949) (#950)
PR #947 (bp-harbor 1.2.14) added templates/admin-secret.yaml that
included the canonical bp-harbor.labels helper AND re-declared
app.kubernetes.io/name + catalyst.openova.io/component with admin-
credential-specific values. Helm's strict YAML post-render parser
rejected the rendered manifest with `mapping key
"app.kubernetes.io/name" already defined at line 8`, blocking the
upgrade chain on otech113 — bp-self-sovereign-cutover dependsOn
bp-harbor and re-blocked, stalling cutover indefinitely.

Per the issue's recommended Option A, labels are inlined verbatim
on the admin Secret. Every key the helper would emit is reproduced
explicitly, except the two that need a Secret-specific value
(catalyst.openova.io/component=harbor-admin) plus an explicit
admin-credentials sub-component label.

A regression guard (Case 6) is added to tests/admin-secret.sh: the
rendered Secret block is parsed through PyYAML's safe_load_all,
which enforces mapping-key uniqueness the same way Helm's post-
render does. Duplicate keys raise and break the test.

Bumps:
  - platform/harbor/chart/Chart.yaml    1.2.14 → 1.2.15
  - clusters/_template/bootstrap-kit/19-harbor.yaml  slot pin

Verification (all green locally):
  helm template smoke . --namespace harbor   # renders OK
  bash tests/admin-secret.sh                 # 6 gates green
  helm lint .                                # 0 failed

Closes one half of #949 (bp-harbor side); the slot pin update
delivers it to fresh Sovereigns; existing otech113 picks up the
upgrade on next Flux reconcile after the new chart publishes.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-05 15:19:17 +04:00
e3mrah
88a8ecd8bb
fix(cutover): Reflector-mirror harbor-admin Secret + in-cluster trigger endpoint (#935) (#947)
Two bugs surfaced live on otech113 2026-05-05 blocking Self-Sovereignty
Cutover end-to-end. Fix both in lockstep:

Bug 1 — bp-self-sovereign-cutover Step 02 (harbor-projects) Job in
`catalyst` namespace was hitting `secret "harbor-core" not found` for
11+ retries because the upstream Harbor `harbor-core` Secret only
exists in the `harbor` namespace and Kubernetes forbids cross-namespace
secretKeyRef. Step 02 was stuck in CreateContainerConfigError forever.

  Fix: bp-harbor 1.2.13 → 1.2.14 ships a Catalyst-curated `harbor-admin`
  Secret in the `harbor` namespace with Reflector mirror annotations
  (allowed-namespaces=catalyst, auto-enabled). The same Secret name
  auto-materialises in `catalyst` so the cutover Job's secretKeyRef
  resolves natively. Password is randomly generated on first install
  (32-char alphanum, 190 bits entropy per feedback_passwords.md) and
  preserved across reconciles via `lookup`. The upstream Harbor subchart
  consumes it via `existingSecretAdminPassword: harbor-admin`.
  bp-self-sovereign-cutover 0.1.16 → 0.1.17 updates
  `harbor.adminSecretRef.name` from `harbor-core` to `harbor-admin`.

Bug 2 — The 0.1.16 auto-trigger Helm post-install Job (#933) POSTed
/api/v1/sovereign/cutover/start which sits behind RequireSession
middleware. The Job has no human session cookie — every request 401'd
forever and cutover never started.

  Fix: new catalyst-api endpoint POST /api/v1/internal/cutover/trigger
  lives OUTSIDE RequireSession and validates the bearer token via the
  apiserver's TokenReview API + checks the resolved username matches
  the canonical `bp-self-sovereign-cutover-runner` SA. Same engine,
  same idempotency, same state machine — different auth surface.
  The auto-trigger Job now mounts its projected SA token at
  /var/run/secrets/kubernetes.io/serviceaccount/token and sends it
  as `Authorization: Bearer <token>`. SA username + accepted list are
  runtime-overridable per Inviolable Principle #4.

Tests
  - 6 Go unit tests for HandleCutoverInternalTrigger covering happy
    path, missing bearer (401), TokenReview rejection (502), wrong SA
    (403), idempotency (no Jobs created when complete), wrong method
    (405). All pass.
  - bp-harbor admin-secret contract test (5 cases) — Secret renders,
    HARBOR_ADMIN_PASSWORD key present, Reflector annotations, keep
    policy, upstream consumes via existingSecretAdminPassword.
  - bp-self-sovereign-cutover cutover-contract test extended with 3
    new cases — auto-trigger uses /internal/cutover/trigger, sends
    SA bearer token, references harbor-admin (not harbor-core).
  - All 12 cutover-contract gates green; all 4 observability-toggle
    gates green; helm template + helm lint clean on both charts.

Bootstrap-kit slot pins
  - clusters/_template/bootstrap-kit/19-harbor.yaml: 1.2.13 → 1.2.14
  - clusters/_template/bootstrap-kit/06a-bp-self-sovereign-cutover.yaml:
    0.1.16 → 0.1.17

Closes #935

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 15:12:50 +04:00
e3mrah
6baf7e56e7
fix(bp-harbor): grep-oE for password (multi-line tolerant) (chart 1.2.13) (#651)
Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:26:23 +04:00
e3mrah
d519dc8ba2
fix(bp-harbor): switch sync Job to curl-against-apiserver (chart 1.2.12) (#650)
rancher/kubectl is distroless (no /bin/sh) so the inline shell script
can't run. Replace with curlimages/curl which has alpine sh + curl.
Talk to k8s API directly via the in-pod ServiceAccount token. The
PATCH merges password + HARBOR_DATABASE_PASSWORD into the existing
pre-install-hook Secret without touching annotations.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:15:23 +04:00
e3mrah
08432b540e
fix(bp-harbor): switch sync Job to rancher/kubectl (chart 1.2.11) (#649)
bitnami/kubectl moved to sha256-only tags; bitnami/kubectl:1.31.4
returns 'not found' from Docker Hub. rancher/kubectl is always
available on k3s clusters. Bumps chart 1.2.10 -> 1.2.11.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 09:04:15 +04:00
e3mrah
de51fa3f7a
fix(bp-harbor): post-install Job copies CNPG password (chart 1.2.10) (#648)
* fix(wizard): SOLO default CPX42 → CPX52 (8→12 vCPU / 16→24 GB)

CPX42 fit 30/40 HRs on otech29 but keycloak-keycloak-config-cli
post-upgrade Job sat Pending 8h with 'Insufficient cpu' — 35-component
bootstrap-kit + post-install hooks at peak exceed 8 vCPU. CPX52 (12
vCPU / 24 GB / €36/mo) is the smallest SKU that schedules every default
Pod on one node.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* test(bp-openbao): align Case-4 expectation with #600 RBAC-hook removal

Commit b1a25c42 (#600) removed the helm.sh/hook-delete-policy from the
auto-unseal SA/Role/RoleBinding so Helm does NOT reap them mid-install
(the old hook-succeeded clause caused the SA to disappear before the
init Job could mount its token). The chart-test still expected ≥5
before-hook-creation,hook-succeeded annotations (3 RBAC + 2 Jobs).

Result: Blueprint Release for #600 (run 25251129679) failed at the test
gate — bp-openbao 1.2.6 was NEVER published to GHCR, even though main
already references it. otech30 caught this live: bp-openbao HR stuck
with 'oci://ghcr.io/openova-io/bp-openbao:1.2.6: not found'.

Update the test to expect ≥2 (Jobs only). Re-publish gets bp-openbao
1.2.6 onto GHCR.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-harbor): replace Reflector race with deterministic post-install Job (chart 1.2.10)

bp-harbor's harbor-database-secret relied on Reflector copying from CNPG-
emitted harbor-pg-app via a 'reflects:' destination annotation. On every
fresh Sovereign Reflector logs once at install:
    Could not update harbor/harbor-database-secret —
    Source harbor/harbor-pg-app could not be found
and never refires when CNPG creates the source ~30s later. Even with
'auto-enabled: true' on the source's inheritedMetadata, Reflector's
auto-reflect copies the SOURCE name (harbor-pg-app), not the explicit
destination harbor-database-secret. Result: harbor-database-secret stays
empty forever; harbor-core CrashLoops with 'couldn't find key password
in Secret harbor/harbor-database-secret'. Caught live on otech26-30.

Replace with a Helm post-install/post-upgrade Job that:
  - polls for harbor-pg-app to exist (CNPG provisions it ~30-60s after
    Cluster Ready)
  - copies password into harbor-database-secret with both 'password'
    and 'HARBOR_DATABASE_PASSWORD' keys
  - exits 0; Helm marks the hook complete

The Job is idempotent (re-running on upgrade overwrites identically)
and deterministic (no event-watcher race). The placeholder Secret stays
in place so kubectl-get returns Found before the Job runs.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 08:52:54 +04:00
e3mrah
8bb66fe43e
fix(bp-{harbor,gitea,powerdns}): bp-cnpg dependsOn + Reflector auto-enabled (#644)
* fix(infra): break tofu cycle — resolve CP public IP at boot via metadata service

PR #546 (Closes #542) introduced a dependency cycle:
  hcloud_server.control_plane.user_data → local.control_plane_cloud_init
  local.control_plane_cloud_init → hcloud_server.control_plane[0].ipv4_address

`tofu plan` failed with:
  Error: Cycle: local.control_plane_cloud_init (expand), hcloud_server.control_plane

Caught live during otech23 first-end-to-end provisioning attempt.

Fix: stop templating `control_plane_ipv4` at plan time. cloud-init runs ON
the CP node, so it resolves its own public IPv4 at boot via Hetzner's
metadata service:
  curl http://169.254.169.254/hetzner/v1/metadata/public-ipv4

Same observable behavior as #546 (kubeconfig server: rewritten to CP public
IP, not LB IP — preserves the wizard-jobs-page-not-stuck-PENDING fix), with
no graph cycle.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra+api): wire handover_jwt_public_key end-to-end

The OpenTofu cloud-init template references ${handover_jwt_public_key}
(infra/hetzner/cloudinit-control-plane.tftpl:371) and variables.tf declares
the variable, but neither side wires it:
  - main.tf templatefile() call did not pass the key → "vars map does not
    contain key handover_jwt_public_key" on tofu plan
  - provisioner.writeTfvars never set the var → empty even when wired

Caught live during otech23 provisioning, immediately after the tofu-cycle
fix landed. tofu plan failed with:

  Error: Invalid function argument
    on main.tf line 170, in locals:
      170:   control_plane_cloud_init = replace(templatefile(...
    Invalid value for "vars" parameter: vars map does not contain key
    "handover_jwt_public_key", referenced at
    ./cloudinit-control-plane.tftpl:371,9-32.

Fix:
  - main.tf templatefile() now passes handover_jwt_public_key = var.handover_jwt_public_key
  - provisioner.Request gains a HandoverJWTPublicKey field (json:"-",
    server-stamped, never accepted from client JSON)
  - handler.CreateDeployment stamps it from h.handoverSigner.PublicJWK()
    when the signer is configured (CATALYST_HANDOVER_KEY_PATH set)
  - writeTfvars emits the value into tofu.auto.tfvars.json

variables.tf default "" preserves the no-signer path: cloud-init writes
an empty handover-jwt-public.jwk and the new Sovereign is provisioned
without the handover-validation surface (handover flow simply not wired
on that Sovereign — degraded gracefully, not a hard failure).

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(api): cloud-init kubeconfig postback must live outside RequireSession

The PUT /api/v1/deployments/{id}/kubeconfig route was registered inside the
RequireSession-gated chi.Group, so every cloud-init postback was rejected
with HTTP 401 {"error":"unauthenticated"} before PutKubeconfig could run.
Cloud-init has no browser session cookie — it authenticates with the
SHA-256-hashed bearer token PutKubeconfig already verifies internally.

Result on otech23: Phase 0 finished (Hetzner CP + LB up), but every
cloud-init `curl --retry 60 -X PUT ... /kubeconfig` returned 401 unauth.
catalyst-api never received the kubeconfig, Phase 1 helmwatch never
started, the wizard's Jobs page stayed in PENDING forever.

Fix: register the PUT outside the auth group so cloud-init's
bearer-hash auth path is the only gate. The matching GET stays inside
session auth — the operator's "Download kubeconfig" button needs the
session cookie.

Caught live during otech23 first end-to-end provisioning. Per the
new "punish-back-to-zero" rule, otech23 was wiped (Hetzner + PDM +
PowerDNS + on-disk state) and the next provision will use otech24.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(catalyst-api): wire harbor_robot_token through to tofu — never pull from docker.io

PR #557 added the registries.yaml mirror in cloudinit-control-plane.tftpl
and declared var.harbor_robot_token in infra/hetzner/variables.tf with a
default of "". The catalyst-api side never set it, so every Sovereign so
far provisioned with an empty token in registries.yaml — containerd's
auth to harbor.openova.io's proxy projects failed silently and pulls
fell through to docker.io. On a fresh Hetzner IP, Docker Hub returns
rate-limit HTML and:

  Failed to pull image "rancher/mirrored-pause:3.6":
    unexpected media type text/html for sha256:...

cilium / coredns / local-path-provisioner sit at Init:0/6 forever; Flux
pods stay Pending; no HelmReleases ever land; the wizard's job stream
shows everything PENDING because there's nothing to watch. Caught live
during otech24.

Wiring (mirrors the GHCRPullToken pattern):
  1. Provisioner.HarborRobotToken — read from CATALYST_HARBOR_ROBOT_TOKEN
     env at New().
  2. Stamped onto every Request in Provision() and Destroy() before
     writeTfvars.
  3. Request.HarborRobotToken — server-stamped (json:"-"); never accepted
     from the wizard payload.
  4. writeTfvars emits "harbor_robot_token" into tofu.auto.tfvars.json.
  5. api-deployment.yaml mounts the catalyst/harbor-robot-token Secret
     (mirrored from openova-harbor — Reflector-managed on Sovereign
     clusters; copied per-namespace on Catalyst-Zero contabo) as
     CATALYST_HARBOR_ROBOT_TOKEN, optional=true so degraded paths
     still come up.

variables.tf default "" preserves graceful fall-through if the operator
hasn't issued a robot token yet, and the architecture rule is now
enforced end-to-end: every image on every Sovereign goes through
harbor.openova.io.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(handler): stamp CATALYST_HARBOR_ROBOT_TOKEN before Validate() (#638 follow-up)

PR #638 added Validate() rejection for missing harbor_robot_token, but
the handler only stamped req.HarborRobotToken from p.HarborRobotToken
inside Provision() — Validate() runs in the handler BEFORE Provision()
gets the chance to stamp. Result: every wizard launch returned

  Provisioning rejected: Harbor robot token is required (CATALYST_HARBOR_ROBOT_TOKEN missing)

even though the env var is set on the Pod. Caught immediately on the
otech25 launch attempt.

Fix: same env-stamp pattern as GHCRPullToken at the top of the
CreateDeployment handler. Provisioner-level stamp in Provision() stays
as defense-in-depth.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(infra): registries.yaml needs rewrite — Harbor proxy URL is /v2/<proj>/<repo>, not /<proj>/v2/<repo>

PR #557 wrote registries.yaml with mirror endpoints like
  https://harbor.openova.io/proxy-dockerhub
hoping containerd would build URLs like
  https://harbor.openova.io/proxy-dockerhub/v2/rancher/mirrored-pause/manifests/3.6

But Harbor proxy-cache projects expose their API at
  https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
(project name lives BEFORE the image-path /v2/, not as a path prefix).
Harbor returns its SPA UI HTML (status 200, content-type text/html) for the
wrong shape; containerd then errors with:
  "unexpected media type text/html for sha256:... not found"
and pause-image / cilium / coredns pulls fail forever — caught live during
otech24 and otech25.

Fix: switch to k3s registries.yaml `rewrite` syntax. Endpoint is the bare
Harbor host; per-mirror rewrite re-maps the image path so containerd's
final URL is correctly project-prefixed. Verified manually:

  curl https://harbor.openova.io/v2/proxy-dockerhub/rancher/mirrored-pause/manifests/3.6
  -> 200 application/vnd.docker.distribution.manifest.list.v2+json

This unblocks every Sovereign image pull through the central Harbor.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-vpa): drop registry.k8s.io/ prefix from repository — upstream chart prepends it

cowboysysop/vertical-pod-autoscaler subchart prepends `.image.registry`
(default registry.k8s.io) to `.image.repository`. Catalyst's bp-vpa
overrode `repository: registry.k8s.io/autoscaling/vpa-...` so the rendered
image was `registry.k8s.io/registry.k8s.io/autoscaling/vpa-...:1.5.0` —
doubled prefix, image-not-found, ImagePullBackOff on every fresh
Sovereign. Caught live during otech26.

Fix: drop the redundant prefix. Subchart's default `.image.registry`
keeps it pointing at registry.k8s.io which the new Sovereign's
containerd routes through harbor.openova.io/v2/proxy-k8s/... via
registries.yaml rewrite (#640).

Bumps bp-vpa chart version to 1.0.1 and bootstrap-kit reference to match.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(wizard): SOLO default SKU CPX32 → CPX42 — 35-component bootstrap-kit needs 8 vCPU / 16 GB

CPX32 (4 vCPU / 8 GB) cannot fit the full SOLO bootstrap-kit on a single
node. Caught live during otech26: 38 pods Running, 34 pods stuck Pending
indefinitely with "Insufficient cpu" — Cilium + Crossplane + Flux +
cert-manager + CNPG + Keycloak + OpenBao + Harbor + Gitea + Mimir +
Loki + Tempo + … each request 50-500m vCPU and the node hits 100%
allocatable before half the workloads schedule.

CPX42 (8 vCPU / 16 GB / 320 GB SSD) at €25.49/mo is the smallest size
that fits the bootstrap-kit with VPA-recommendation headroom. Operators
can still pick CPX32 explicitly if they trim the component set on
StepComponents — but the default SOLO path now provisions a node
that actually boots into a steady state.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-cert-manager-dynadot-webhook): pin SHA tag + add ghcr-pull imagePullSecret (chart 1.1.2)

- Replace forbidden `:latest` tag with current short-SHA `942be6f` per
  docs/INVIOLABLE-PRINCIPLES.md #4.
- Add default `webhook.imagePullSecrets: [{name: ghcr-pull}]` so kubelet
  authenticates against private ghcr.io/openova-io/openova/* via the
  Reflector-mirrored `ghcr-pull` Secret in cert-manager namespace.
  Without this, the webhook Pod was stuck ErrImagePull/ImagePullBackOff
  on every Sovereign — caught live during otech27.
- Bumps chart version 1.1.1 -> 1.1.2 and bootstrap-kit reference.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

* fix(bp-{harbor,gitea,powerdns}): add bp-cnpg dependency + Reflector auto-enabled

Two related Phase-8a stragglers diagnosed live during otech28:

1. bp-powerdns missed bp-cnpg in dependsOn. Helm renders BEFORE
   postgresql.cnpg.io/v1 CRD is registered → templates/cnpg-cluster.yaml
   `Capabilities.APIVersions.Has` gate evaluates false → no Cluster CR
   → no pdns-pg-app Secret → powerdns Pods stuck CreateContainerConfigError
   forever ("secret pdns-pg-app not found"). Adds explicit dependsOn.

2. bp-harbor/gitea/powerdns CNPG inheritedMetadata only set
   reflection-allowed; missing reflection-auto-enabled. Reflector races
   when destination Secret (harbor-database-secret) is created BEFORE
   CNPG provisions the source (harbor-pg-app). Reflector logs
   "Source could not be found" once and never retries — leaving harbor-
   core stuck CreateContainerConfigError. Adding auto-enabled makes
   Reflector actively watch the source and re-fire when it appears.

Bumps:
  bp-harbor    1.2.8 -> 1.2.9
  bp-gitea     1.2.1 -> 1.2.2
  bp-powerdns  1.1.5 -> 1.1.7 (skips 1.1.6 which was a non-released bump)

Bootstrap-kit references updated to pull the new chart versions on
the next Sovereign provisioning.

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>

---------

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
2026-05-03 00:16:34 +04:00
e3mrah
93627ada20
fix(bp-harbor): convert harbor-database-secret to Helm pre-install hook (1.2.8) (#603)
The 1.2.7 fix dropped the `data:` block from the chart template, but
Helm's three-way merge still owns the Secret as a release resource and
resets `data: {}` (no keys) on every chart upgrade — verified on otech22
where 1.2.6→1.2.7 reconcile wiped Reflector-populated keys back to nil.

Architectural fix: convert the Secret to a Helm pre-install hook.

  - `helm.sh/hook: pre-install` — Secret is created at install time only.
    On `helm upgrade`, Helm does NOT touch the Secret (no three-way merge),
    so keys populated by Reflector persist across every chart bump.
  - `helm.sh/hook-delete-policy: before-hook-creation` — On a re-install,
    Helm deletes the previous Secret first so the hook recreates clean.
  - `helm.sh/resource-policy: keep` — `helm uninstall` does NOT delete the
    Secret (paired with hook means standard upgrade path never sees a delete).
  - Hook resources are NOT recorded in the Helm release manifest, so they're
    invisible to `helm upgrade`'s three-way merge.

Also drops the inline `data:` block (kept from 1.2.7) — Reflector still
populates everything from harbor-pg-app once CNPG bootstraps the source.

Bumps bp-harbor 1.2.7 → 1.2.8, bootstrap-kit refs (_template, otech, omantel).

Closes #585

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:57:55 +04:00
e3mrah
09208ca58f
fix(bp-harbor): omit data block in harbor-database-secret — Helm overwrite regression (1.2.7) (#602)
On every helm upgrade, Helm three-way merge resets `data.password` and
`data.HARBOR_DATABASE_PASSWORD` to "" because the chart declares them
empty in the template. After Reflector populates them from `harbor-pg-app`,
the next bp-harbor upgrade silently empties them again — harbor-core then
crashloops on the next pod restart with "password authentication failed".

Observed on otech22 after the 1.2.5→1.2.6 Flux upgrade: harbor-database-
secret.password went from 64 bytes back to 0 bytes, harbor-core entered
CrashLoopBackOff. Resolved at runtime by touching harbor-pg-app to bump
its resourceVersion and re-trigger Reflector, but the architectural fix
is needed so it doesn't recur on the next chart upgrade.

Fix: drop the entire `data:` block from templates/database-secret.yaml.
The Secret is created by Helm with no data keys (Helm owns nothing in
the data field). Reflector adds ALL keys from `harbor-pg-app` (password,
HARBOR_DATABASE_PASSWORD, username, host, dbname, jdbc-uri, etc.) on
the first SecretWatcher event after CNPG bootstraps the source. On
subsequent helm upgrades, Helm's three-way merge has nothing to overwrite
in `data:` because the chart no longer declares any keys there.

Bumps bp-harbor 1.2.6 → 1.2.7, bootstrap-kit refs (_template, otech, omantel).

Closes #585 (regression of)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:53:37 +04:00
e3mrah
8d50402038
fix(bp-harbor): remove cnpg-app-annotator Job — CNPG inheritedMetadata handles annotation (1.2.6) (#601)
The post-install Job `harbor-pg-app-annotator` (with curlimages/curl:8.7.1)
is no longer needed: bp-harbor 1.2.5 already uses CNPG's `inheritedMetadata`
stanza in cnpg-cluster.yaml to stamp `reflection-allowed: true` onto
`harbor-pg-app` at CNPG bootstrap time. The Job was causing ErrImagePull on
otech22 because Docker Hub is proxied through Harbor itself (chicken-and-egg).

Removes:
  - templates/cnpg-app-annotator-job.yaml
  - templates/cnpg-app-annotator-rbac.yaml
  - values.yaml cnpgAnnotator section

Updates database-secret.yaml comment to reflect the inheritedMetadata approach.

Bumps Chart.yaml 1.2.5 → 1.2.6, bootstrap-kit refs (_template, otech, omantel).

Closes #585

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:44:55 +04:00
e3mrah
cba1b5070a
fix(bp-gitea+harbor): use CNPG inheritedMetadata to propagate reflector annotations to pg-app Secret (#595)
The Cluster CR `metadata.annotations` are NOT propagated by CNPG onto the
generated `{name}-app` Secrets. Reflector requires the SOURCE Secret (e.g.
`gitea-pg-app`) to carry `reflection-allowed: "true"` before it will copy
data into the DESTINATION Secret (`gitea-database-secret`). On otech22 this
caused `gitea-database-secret` to stay empty indefinitely — gitea init container
failed auth with "password authentication failed for user gitea".

Fix: use CNPG's `inheritedMetadata.annotations` stanza (v1.24+) to instruct
CNPG to annotate all generated Secrets with the reflector permission annotations.
Applied to both bp-gitea (1.2.0→1.2.1) and bp-harbor (1.2.4→1.2.5) since
harbor-pg-app had the same issue.

Bootstrap-kit: bump bp-gitea chart ref 1.2.0→1.2.1 (template + otech + omantel).

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:37:48 +04:00
e3mrah
fe03b8cc42
fix(bp-harbor): use curl for CNPG annotator PATCH + add values defaults (1.2.4) (#594)
busybox wget does not support --method=PATCH (only GET/POST). The
harbor-pg-app-annotator Job silently succeeded without actually patching
harbor-pg-app, leaving harbor-database-secret empty on fresh install.

Fixes:
1. Switch cnpg-app-annotator-job.yaml from busybox:1.36.1 + wget to
   curlimages/curl:8.7.1 + curl -X PATCH. curl natively supports all
   HTTP verbs. HTTP response code checked explicitly; non-2xx exits 1
   so the Job retries instead of silently passing with no-op.
2. Add cnpgAnnotator.image stanza to values.yaml (was missing — prior
   charts defaulted via nil-safe dict fallback but the section was
   never actually written to values.yaml). Defaults to curlimages/curl:8.7.1.
3. readOnlyRootFilesystem: false (curl writes /tmp/patch-response.json
   for error diagnostics).
4. Bump chart 1.2.3 → 1.2.4.

Closes #585

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:29:45 +04:00
e3mrah
97abf9dedb
fix(bp-harbor): nil-safe image value extraction in cnpg-app-annotator Job (#593)
.Values.cnpgAnnotator.image.repository triggers nil pointer when the
values tree is partially absent in Helm's default-values render. Use
| default dict chained assignments to safely extract image repo/tag/
pullPolicy. Fixes blueprint-release smoke render failure on 1.2.3.

Closes #585

Co-authored-by: hatiyildiz <hatiyildiz@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 15:22:54 +04:00
e3mrah
74d526c276
fix: bp-gateway-api 5→10 CRDs + bp-gitea CNPG + bp-harbor CNPG race fix + DAG audit (#592)
* fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584)

The bundled Bitnami postgresql subchart pulls docker.io/bitnamilegacy/postgresql
which is unavailable (DH deprecated namespace) — gitea-postgresql-0 stuck in
ImagePullBackOff on otech22, cascading to gitea Init:CrashLoopBackOff.

Mirrors the bp-harbor pattern (PR #578): provision a CNPG Cluster CR (gitea-pg,
namespace gitea, 5Gi, pg16) + a reflector-managed gitea-database-secret, wiring
GITEA__database__PASSWD from the CNPG-generated gitea-pg-app Secret. All Bitnami
subchart config removed; postgresql.enabled: false.

Bootstrap-kit (template + otech + omantel): bump bp-gitea 1.1.2 → 1.2.0, add
dependsOn: bp-cnpg so the postgresql.cnpg.io/v1 CRD is registered before the
Capabilities gate in cnpg-cluster.yaml fires. omantel overlay migrated from
legacy ingress: to gateway: (Cilium Gateway API, issue #387).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dependency-audit): add bp-reflector (5a) to expected DAG + external-dns dep edge

bp-reflector was added to the bootstrap-kit (slot 05a) in issue #543 but was
never registered in scripts/expected-bootstrap-deps.yaml, causing the
dependency-graph-audit CI gate to error on every PR that includes this branch.
Also declare bp-reflector in bp-external-dns's depends_on to match the actual
HR file (12-external-dns.yaml dependsOn bp-reflector).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(bp-gateway-api): update CRD-count test 5→10 for experimental channel + DAG audit

Two fixes to unblock bp-gateway-api:1.1.0 OCI publish and the
dependency-graph-audit CI gate:

1. crd-render.sh: expect 10 CRDs (experimental channel) not 5.
   Chart 1.1.0 vendors experimental-install.yaml (TLSRoute, TCPRoute,
   UDPRoute, BackendLBPolicy, BackendTLSPolicy in addition to 5 standard
   CRDs) because Cilium 1.16.x checks for TLSRoute at operator startup.
   Without this fix the blueprint-release workflow for 1.1.0 fails the
   chart-test step and never pushes to GHCR — leaving all 13 dependent
   HRs stuck dependency-not-ready on every Sovereign.

2. expected-bootstrap-deps.yaml: add bp-reflector (slot 5a) and update
   bp-external-dns depends_on to include bp-reflector. bp-reflector was
   added to the bootstrap-kit in issue #543 but was missing from the
   expected DAG, causing dependency-graph-audit ERRORs on every PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: hatiyildiz <hatice@openova.io>
2026-05-02 15:20:05 +04:00
e3mrah
8d2ba0495d
fix(bp-gitea): switch to CNPG-managed postgres, drop bitnamilegacy subchart (Closes #584) (#586)
Squash merge: fix(bp-gitea) switch to CNPG-managed postgres (Closes #584)
2026-05-02 15:18:49 +04:00
e3mrah
2adc3a9493
fix(bp-harbor): CNPG database must be 'registry' not 'harbor' — matches coreDatabase (#579)
Harbor upstream always connects to a database named 'registry'
(harbor.database.external.coreDatabase default). The CNPG Cluster was
initialised with database='harbor', causing:

  FATAL: database "registry" does not exist (SQLSTATE 3D000)

Fix: change postgres.cluster.database default from 'harbor' → 'registry'
in values.yaml and cnpg-cluster.yaml template. Both the CNPG bootstrap
and Harbor's coreDatabase now use 'registry'.

Runtime fix on otech22: CREATE DATABASE registry OWNER harbor was run
against harbor-pg-1. harbor-core is now 1/1 Running.

Bump bp-harbor 1.2.1 → 1.2.2. Bootstrap-kit refs updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:21:36 +04:00
e3mrah
b647aa2561
fix(bp-harbor): provision harbor-pg CNPG cluster + database-secret (Closes #566) (#578)
Replace Helm lookup in database-secret.yaml with reflector annotation:
harbor-database-secret now reflects harbor-pg-app via
reflector.v1.k8s.emberstack.com/reflects. This fixes the race between
Helm rendering (fresh install) and CNPG cluster bootstrap — reflector
is event-driven and propagates the CNPG password within seconds of
harbor-pg-app being created, with no operator action required.

Also includes:
- templates/cnpg-cluster.yaml: harbor-pg CNPG Cluster (1 inst, 5Gi, pg16)
- values.yaml: postgres: block + database.external.host = harbor-pg-rw
- Chart 1.2.0 → 1.2.1; bootstrap-kit refs updated (_template, otech, omantel)

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-02 13:14:00 +04:00
e3mrah
06844d3a70
fix(bp-external-dns): point NetworkPolicy egress + pdns-server at powerdns ns (Closes #569) (#573)
bp-powerdns was moved to the `powerdns` namespace in PR #556/#553, but
bp-external-dns still had `powerdnsNamespace: openova-system` in its
NetworkPolicy egress rule and `--pdns-server=...openova-system...` in
extraArgs. Both pointed at the wrong namespace, blocking DNS reconciliation.

Fix:
- externalDns.networkPolicy.powerdnsNamespace: openova-system → powerdns
- extraArgs --pdns-server: ...openova-system... → ...powerdns...

Bump bp-external-dns 1.1.2 → 1.1.3. Bootstrap-kit slot 12 updated.

Co-authored-by: alierenbaysal <alierenbaysal@openova.io>
2026-05-02 12:58:24 +04:00
e3mrah
0511efbdac
feat(bp-harbor): vendor-agnostic Object Storage backend (closes #383) (#437)
Reworks bp-harbor to write blobs DIRECTLY to the cloud-provider's
native S3 endpoint (Hetzner Object Storage on Hetzner Sovereigns)
per ADR-0001 §13. Mirrors the post-#425 vendor-agnostic seam shipped
in bp-velero:1.2.0 (PR #435 / SHA 0172b9a8) 1:1.

Canonical seam used (per anti-duplication rule + docs/omantel-
handover-wbs.md §3a):
  - Sealed Secret name:   flux-system/object-storage  (NOT hetzner-prefixed)
  - Chart values block:   .Values.objectStorage.s3.{enabled,credentialsSecretName,s3.{accessKey,secretKey}}
  - Template filename:    templates/objectstorage-credentials.yaml
  - Reference impl:       platform/velero/chart/ (PR #435)

Chart changes (platform/harbor/chart/):
  - Chart.yaml: 1.0.0 → 1.1.0; description rewritten to emphasise
    cloud-direct architecture + remove SeaweedFS hard-dep claim.
  - values.yaml: REMOVED hardcoded SeaweedFS endpoint
    (http://seaweedfs-s3.seaweedfs.svc.cluster.local:8333) from
    persistence.imageChartStorage.s3.regionendpoint. Default
    type flipped to `filesystem` so contabo/dev render is clean.
    Added vendor-agnostic objectStorage block:
      objectStorage:
        enabled: false
        useExistingSecret: false
        credentialsSecretName: ""
        s3: { accessKey: "", secretKey: "" }
  - templates/objectstorage-credentials.yaml (NEW): synthesises a
    harbor-namespace Secret with REGISTRY_STORAGE_S3_ACCESSKEY +
    REGISTRY_STORAGE_S3_SECRETKEY keys (the upstream chart's
    persistence.imageChartStorage.s3.existingSecret consumption
    shape — envFrom on the registry pod). Skip-render branch
    when objectStorage.enabled=false (default).
  - templates/_helpers.tpl: added bp-harbor.objectStorageCredentialsSecretName
    helper.
  - templates/networkpolicy.yaml: egress rule retargeted from
    SeaweedFS service-namespace selector → external HTTPS:443
    (works for any cloud-native S3 endpoint without vendor coupling).
    Gated on `.Values.objectStorage.enabled`. Removed
    seaweedfsNamespace + seaweedfsS3Port overlay keys.

Per-Sovereign overlays (clusters/{_template,omantel,otech}/bootstrap-
kit/19-harbor.yaml):
  - Chart version reference bumped 1.0.0 → 1.1.0.
  - dependsOn: bp-seaweedfs REMOVED. New dependsOn = bp-cnpg + bp-cert-manager.
  - Added valuesFrom block mapping the 5 keys of flux-system/object-
    storage Secret:
      s3-bucket     → harbor.persistence.imageChartStorage.s3.bucket
      s3-region     → harbor.persistence.imageChartStorage.s3.region
      s3-endpoint   → harbor.persistence.imageChartStorage.s3.regionendpoint
      s3-access-key → objectStorage.s3.accessKey
      s3-secret-key → objectStorage.s3.secretKey
  - Inline values flip objectStorage.enabled=true,
    harbor.persistence.imageChartStorage.type=s3, and
    harbor.persistence.imageChartStorage.s3.existingSecret=harbor-
    objectstorage-credentials.

UI catalog (products/catalyst/bootstrap/ui/src/shared/constants/components.ts):
  - Harbor's `dependencies` array drops `seaweedfs`. Now ['cnpg', 'valkey'].

Validation:
  helm template default render →
    1448 lines, 5 Secrets (Harbor internal: core/jobservice/registry/
    registry-htpasswd/database — NO objectstorage-credentials),
    type=filesystem, 0 SeaweedFS references.
  helm template overlay render with objectStorage.enabled=true +
  type=s3 + bucket=omantel-harbor + region=fsn1 +
  regionendpoint=https://fsn1.your-objectstorage.com +
  existingSecret=harbor-objectstorage-credentials →
    1452 lines, 6 Secrets (5 internal + 1 objectstorage-credentials),
    type=s3 with Hetzner endpoint, registry pod envFrom wired to the
    new Secret, 0 SeaweedFS references.
  scripts/check-vendor-coupling.sh → exit 0 (no violations across
    platform/, clusters/, products/catalyst/bootstrap/{api,ui}/).
  helm lint → 0 failures.

WBS:
  §2 row 18 → 🟢 chart-released (#383).
  §9 #383 row → 🟢 chart-released narrative.
  §6 DAG: T383 moved from `class blocked` → `class done`.

Hetzner-S3 E2E deferred to Phase 8 (first omantel run).

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 18:18:37 +04:00
e3mrah
a1bd550208
fix(charts): HTTPRoute templates skip-render on missing host (was failing default-values render) (#402)
Blueprint-release for #401 failed because HTTPRoute templates use
{{- fail }} when gateway.host is not set, which trips the chart default-values
render gate in CI. Switched 6 templates from 'fail loud' to 'skip render':

  if .Values.gateway.host  →  emit HTTPRoute
  else                     →  emit nothing

The Gateway API admission already rejects HTTPRoute with empty hostnames,
so the loud-fail wasn't buying anything an operator wouldn't see at apply
time. Default-values render now produces zero HTTPRoute resources, which
is the correct shape for the upstream chart consumers that don't set
the Sovereign-only gateway block.

Files: keycloak, gitea, openbao, grafana, harbor, catalyst-platform.

Verified:
  helm template t products/catalyst/chart/ → 0 HTTPRoutes (clean)
  helm template t products/catalyst/chart/ --set ingress.gateway.enabled=true --set ingress.hosts.console.host=console.test --set ingress.hosts.api.host=api.test → 2 HTTPRoutes

Closes the blueprint-release failure on commit abf01b6f.

Co-authored-by: hatiyildiz <hati@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:23:58 +04:00
e3mrah
abf01b6f21
feat(platform): Gateway API migration audit (#387) (#401)
Migrates every minimal-Sovereign-set blueprint chart from
networking.k8s.io/v1.Ingress to gateway.networking.k8s.io/v1.HTTPRoute,
replacing the legacy Traefik-on-Sovereigns assumption with the canonical
Cilium + Envoy + Gateway API path per ADR-0001 §9.4 and the WBS §2
correction note (#388).

The single per-Sovereign Gateway is added as additional documents in
the existing bootstrap-kit slot clusters/_template/bootstrap-kit/01-cilium.yaml
(NOT a new top-level slot), since Cilium owns the GatewayClass. It
includes:

  - Certificate `sovereign-wildcard-tls` requesting `*.${SOVEREIGN_FQDN}`
    from `letsencrypt-dns01-prod` (cert-manager + #373 webhook)
  - Gateway `cilium-gateway` in `kube-system` with HTTPS (443, TLS
    terminate) + HTTP (80) listeners, allowedRoutes.namespaces.from=All

Per-blueprint HTTPRoute templates (canonical seam: each wrapper chart's
existing `templates/` directory):

  | Blueprint           | Host pattern                    | Backend port |
  |---------------------|---------------------------------|--------------|
  | bp-keycloak         | auth.<sov>                      | 80           |
  | bp-gitea            | git.<sov>                       | 3000         |
  | bp-openbao          | bao.<sov>                       | 8200         |
  | bp-grafana          | grafana.<sov>                   | 80           |
  | bp-harbor           | registry.<sov>                  | 80           |
  | bp-powerdns         | pdns.<sov>/api  (dual-mode)     | 8081         |
  | bp-catalyst-platform| console.<sov>, api.<sov>         | 80, 8080     |

bp-powerdns supports both Ingress (contabo legacy) and HTTPRoute
(Sovereign) simultaneously — the per-Sovereign overlay sets
`api.gateway.enabled=true` while leaving `api.enabled=true`. The
Ingress object is harmless on Cilium clusters with no Traefik. This
preserves contabo's existing pdns.openova.io flow per ADR-0001 §9.4.

bp-harbor flips `expose.type` from `ingress` to `clusterIP` in
platform/harbor/chart/values.yaml so the upstream chart no longer
emits its own Ingress; the HTTPRoute is the sole HTTP exposure.
TLS terminates at the Gateway (wildcard cert) rather than per-host
Certificates inside the chart.

bp-catalyst-platform's `templates/httproute.yaml` is NOT excluded by
.helmignore (unlike templates/ingress.yaml + templates/ingress-console-tls.yaml,
which remain contabo-only legacy demo infra). The contabo path keeps
serving console.openova.io/sovereign via Traefik unchanged.

Bootstrap-kit slot updates (per-Sovereign hostname interpolation):

  - 08-openbao.yaml      → gateway.host: bao.${SOVEREIGN_FQDN}
  - 09-keycloak.yaml     → gateway.host: auth.${SOVEREIGN_FQDN}
  - 10-gitea.yaml        → gateway.host: gitea.${SOVEREIGN_FQDN}
  - 11-powerdns.yaml     → api.host: pdns.${SOVEREIGN_FQDN}, api.gateway.enabled: true
  - 19-harbor.yaml       → gateway.host: registry.${SOVEREIGN_FQDN}
  - 25-grafana.yaml      → gateway.host: grafana.${SOVEREIGN_FQDN}

Server-side dry-run validation against the live Cilium Gateway API
CRDs on contabo: every HTTPRoute and the per-Sovereign Gateway
+ Certificate apply cleanly via `kubectl apply --dry-run=server`.

Contabo unaffected: clusters/contabo-mkt/* not modified. The legacy
SME ingresses (console-nova, marketplace, admin, axon, talentmesh,
stalwart, ...) continue to serve via Traefik as before. powerdns
on contabo remains on the Ingress path (api.gateway.enabled defaults
to false at the chart level).

Closes #387.

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:19:30 +04:00
e3mrah
ba2ff05292
feat(charts): bp-seaweedfs + bp-harbor + bp-vpa wrapper charts (#284)
W2.5.B — first authoring of the three Catalyst Blueprint wrapper charts
that fill bootstrap-kit slots 18 (seaweedfs), 19 (harbor) and 29 (vpa).
Each wraps an upstream chart as a Helm subchart and ships Catalyst-
curated overlay templates (NetworkPolicy + ServiceMonitor) gated behind
opt-in toggles, per docs/BLUEPRINT-AUTHORING.md §11 and
docs/INVIOLABLE-PRINCIPLES.md.

bp-seaweedfs (slot 18 — storage foundation)
  - Wraps seaweedfs/seaweedfs 4.22.0; Chart name `bp-seaweedfs`.
  - Catalyst defaults: 1 master + 3 volume + 1 filer + 2 s3 replicas.
  - S3 API on 8333 — single S3 surface every consumer talks to per
    docs/PLATFORM-TECH-STACK.md §3.5 (no per-app MinIO).
  - Overlay templates: NetworkPolicy (cross-namespace S3 reachability,
    cold-tier egress allowlist), ServiceMonitor (Capabilities-gated,
    DEFAULT FALSE per §11.2).
  - Default helm template kinds: ClusterRole, ClusterRoleBinding,
    ConfigMap, Deployment, Secret, Service, ServiceAccount, StatefulSet.

bp-harbor (slot 19 — per-Sovereign OCI registry)
  - Wraps goharbor/harbor 1.18.3 (appVersion 2.14.3); Chart name
    `bp-harbor`.
  - Catalyst defaults: blob backend = SeaweedFS S3 (regionendpoint
    seaweedfs-s3.seaweedfs.svc:8333), metadata DB = bp-cnpg external
    Postgres, ingress class `cilium`, expose.tls.enabled true (cert-
    manager-issued Secret).
  - Overlay templates: NetworkPolicy (CNPG/SeaweedFS/Keycloak egress),
    ServiceMonitor (Capabilities-gated, DEFAULT FALSE).
  - Trivy + SSO + pull-mirror are operator-flag opt-ins per per-
    Sovereign overlay (default false; trivy/keycloak/cnpg deps land on
    later slots).
  - Default helm template kinds: ConfigMap, Deployment, Ingress,
    PersistentVolumeClaim, Secret, Service, StatefulSet.

bp-vpa (slot 29 — vertical autoscaling)
  - Wraps cowboysysop/vertical-pod-autoscaler 11.1.1 (appVersion
    1.5.0); Chart name `bp-vpa`.
  - Catalyst defaults: 1 replica each of recommender + updater +
    admission-controller. Default mode `Off` (recommend only).
  - Admission webhook self-signs via init Job (cluster-internal); per-
    Sovereign overlay MAY swap to cert-manager.
  - Overlay templates: NetworkPolicy (apiserver + metrics-server
    egress, admission webhook ingress).
  - Upstream metrics.serviceMonitor / metrics.prometheusRule defaulted
    false per §11.2.
  - Default helm template kinds: ClusterRole, ClusterRoleBinding,
    ConfigMap, Deployment, Job, Pod, Secret, Service, ServiceAccount.

Lint + observability-toggle results
  helm lint: 1 chart(s) linted, 0 chart(s) failed (each)
  tests/observability-toggle.sh: PASS on all three (default render has
  zero monitoring.coreos.com/v1 references; opt-in render produces a
  ServiceMonitor; explicit-off render is clean).

Path isolation: only platform/seaweedfs/, platform/harbor/, and
platform/vpa/ — no HR slot files or other charts touched.

Refs: bootstrap-kit slots 18, 19, 29 reconcile against
ghcr.io/openova-io/bp-seaweedfs:1.0.0, bp-harbor:1.0.0, bp-vpa:1.0.0
which this commit produces on next blueprint-release CI run.

Co-authored-by: hatiyildiz <hatice.yildiz@openova.io>
2026-04-30 18:37:50 +04:00
hatiyildiz
7cafa3c894 docs(seaweedfs+guacamole): replace MinIO with SeaweedFS as unified S3 encapsulation; add Guacamole to bp-relay
Component-level architectural correction (two changes):

1. MinIO → SeaweedFS as unified S3 encapsulation layer

The old design used MinIO for in-cluster S3 plus separate cold-tier configuration scattered across consumers. The new design positions SeaweedFS as the single S3 encapsulation layer: every Catalyst component talks to one endpoint (seaweedfs.storage.svc:8333). SeaweedFS internally handles hot tier (in-cluster NVMe), warm tier (in-cluster bulk), and cold tier (transparent passthrough to cloud archival storage — Cloudflare R2 / AWS S3 / Hetzner Object Storage / etc., chosen at Sovereign provisioning). One audit/lifecycle/encryption boundary instead of N. No Catalyst component talks to cloud S3 directly anymore — Velero, CNPG WAL archive, OpenSearch snapshots, Loki/Mimir/Tempo, Iceberg, Harbor blob store, Application buckets all share one S3 surface.

2. Apache Guacamole added as Application Blueprint §4.5 Communication

Clientless browser-based RDP/VNC/SSH/kubectl-exec gateway. Keycloak SSO, full session recording to SeaweedFS for compliance evidence (PSD2/DORA/SOX). Composed into bp-relay. Replaces VPN+native-client distribution for auditable remote access.

Component changes:
- DELETED: platform/minio/
- CREATED: platform/seaweedfs/README.md (unified S3 + cold-tier encapsulation; bucket layout; multi-region replication via shared cold backend; migration-from-MinIO section)
- CREATED: platform/guacamole/README.md (clientless remote-desktop gateway; GuacamoleConnection CRD; compliance integration via session recordings)

Doc updates: PLATFORM-TECH-STACK §1+§3.5+§4.5+§5+§7.4; TECHNOLOGY-FORECAST L11+mandatory+a-la-carte counts (52 → 53); ARCHITECTURE §3 topology; SECURITY §4 DB engines; SOVEREIGN-PROVISIONING §1 inputs; SRE §2.5+§7; IMPLEMENTATION-STATUS §3; BLUEPRINT-AUTHORING stateful examples; BUSINESS-STRATEGY 13 component-count anchors + Relay product line; README.md backup row; CLAUDE.md folder count.

Component README updates (S3 endpoint + dependency renames): cnpg, clickhouse, flink, gitea, iceberg, harbor, grafana, livekit, kserve, milvus, opensearch, flux, stalwart, velero (substantive rewrite of velero — now writes exclusively to SeaweedFS with cold-tier auto-routing). Products: relay, fabric.

UI scaffold: products/catalyst/bootstrap/ui/src/shared/constants/components.ts — minio entry replaced with seaweedfs; velero+harbor deps updated; new guacamole entry added.

VALIDATION-LOG entry "Pass 104 — MinIO → SeaweedFS swap + Guacamole add" captures the encapsulation principle and adds Lesson #22: storage tier policy belongs at the encapsulation boundary, not inside every consumer.

Verification: zero remaining MinIO references in canonical docs (one intentional retention in TECHNOLOGY-FORECAST L37 explaining the swap); 53 platform/ folders matching all "53 components" anchors; bp-relay composition includes guacamole.
2026-04-28 10:23:46 +02:00
hatiyildiz
2a1d6f5d3f docs(pass-41): SOVEREIGN-PROVISIONING §4 + minio namespace drift across 3 components
SOVEREIGN-PROVISIONING.md §4 (Phase 1 Hand-off) "self-sufficient" list
had 6 items vs PLATFORM-TECH-STACK §2.3's 6 control-plane supporting
services. List was missing SPIRE (5-min rotating SVIDs — critical to
SECURITY model) and observability (Grafana stack — Catalyst's
self-monitoring). Same drift category as Pass 40: summary list drifted
independently from canonical reference. Added both, plus enumerated the
§2.1+§2.2 services in the "Catalyst control plane" bullet.

Mid-pass sweep finding: kserve L217 used minio.minio-system.svc but
canonical minio README declares namespace: storage (L70). Three other
components also used minio-system: milvus L78, harbor L145. Fixed all
three to align with canonical `storage` namespace per PLATFORM-TECH-STACK
§3.5. Drift likely came from Helm-chart upstream defaults.

platform/kserve substantively clean apart from namespace fix.

Pass 41 lesson: union-equality check applies to ALL summary passages in
canonical docs. When a passage enumerates items derived from a canonical
source list, count both and verify equality.
2026-04-27 23:21:19 +02:00
hatiyildiz
4043e1d51c docs(pass-32): registry-DNS sweep — harbor.<domain> across 9 component READMEs
Pass 25's deferred sweep, executed. Image refs of the form
harbor.<domain>/... (and one registry.<domain>/... in temporal) collapse
the location-code segment. Per NAMING §5.1, Catalyst per-host-cluster
Harbor DNS is harbor.{location-code}.{sovereign-domain} (e.g.
harbor.hfmp.openova.io).

Fixed (11 instances, 9 files):
- anthropic-adapter, bge (×2), debezium, harbor (×2 — ingress + Kyverno
  policy), knative (×2 — serving + traffic-split), llm-gateway, strimzi,
  trivy — all standardized to harbor.<location-code>.<sovereign-domain>.
- temporal had two drift items in one line: registry.<domain> (off-spec
  placeholder — Catalyst's only per-host-cluster registry is Harbor) AND
  legacy "fuse" namespace (renamed to bp-fabric per BUSINESS-STRATEGY
  §16.2 / Pass 26). Rewritten to fabric/order-worker.

Out of scope (deliberate): :latest tag hygiene, and whether Application
Blueprint READMEs should reference ghcr.io/openova-io/bp-<name>:<semver>
vs the Sovereign Harbor mirror. Stalwart customer-email-domain <domain>
placeholders preserved (correct semantics). external-dns illustrative
gslb/api/svc.<domain> preserved (upstream-doc generic).

With Pass 29 (canonical-doc DNS) + Pass 31 (carry-over fixes) + Pass 32
(image registry), the recurring DNS-placeholder collapse drift category
is addressed end-to-end.

Validation log Pass 32 entry added.
2026-04-27 22:36:39 +02:00
hatiyildiz
eff264b077 docs(pass-17): ARCHITECTURE OAM table pipe-fix + Harbor README de-drift
Pass 17 — drift-detection sweep on ARCHITECTURE + harbor. Two real
findings.

ARCHITECTURE §13 (OAM table):
- `| Trait | Blueprint overlay (`overlays/small|medium|large`) |`
  has pipe chars inside backticks inside a Markdown table cell —
  a known GFM rendering hazard. Replaced with comma-separated
  examples.

platform/harbor/README.md:
- The banner added in Pass 9 said "every host cluster runs a
  Harbor instance" but the body still described an older
  "Harbor Primary / Harbor Replica" cross-region replication
  topology. Same shape of architectural drift Pass 7 caught in
  OpenBao/ESO/Gitea/Flux — banner-add doesn't rewrite the body.
- Three sections rewritten:
  * Overview mermaid: now shows upstream-OCI → multiple
    independent per-cluster Harbors with local Trivy scan + local
    Pod pulls.
  * "Multi-Region Replication" → "Per-host-cluster mirroring (NOT
    primary-replica)". Single source of truth = upstream OCI
    (ghcr.io/openova-io/* for Catalyst+Blueprints, customer CI for
    application images), not a "primary Harbor".
  * Example replication policy: was a `dest_registry` cross-region
    push policy → now a pull-mirror policy from ghcr.io with
    scheduled-cron trigger.
- "Why Mandatory" table reframed in per-host-cluster terms.

VALIDATION-LOG: Pass 17 entry added with the specific drift-detection
lesson — banner-addition passes don't catch body-level drift; need
explicit body re-reads.

Refs #37
2026-04-27 21:58:53 +02:00
hatiyildiz
a52bda30cb docs(pass-9b): retry banners on harbor / falco / sigstore / syft-grype
Pass 9's commit ea81c38 only landed banners on grafana + kyverno —
the harbor / falco / sigstore / syft-grype edits failed because the
Edit tool requires a Read pass per file before write. Now Read'd
and applied:

- harbor: per-host-cluster registry, pointer to PLATFORM-TECH-STACK §3.5.
- falco: per-host-cluster runtime security, pointer to §3.3 + SRE §10
  (SIEM/SOAR pipeline).
- sigstore: cosign signing chain on every Blueprint OCI artifact,
  Kyverno admission verifies signatures.
- syft-grype: CI-side SBOM + runtime CVE matching.

Pass 9 now complete.

Refs #37
2026-04-27 21:41:22 +02:00
talent-mesh
c9d04a53b4 refactor: flatten platform/ structure (41 components)
Remove hierarchical grouping (networking/, security/, etc.) and use flat
structure for all 41 platform components.

Changes:
- All components now directly under platform/ (no subfolders)
- AI Hub components moved from meta-platforms/ai-hub/components/ to platform/
- Open Banking components (lago, openmeter) moved to platform/
- meta-platforms/ now only contains README files that reference platform/
- Open Banking custom services remain in meta-platforms/open-banking/services/

Structure:
- platform/ (41 components, flat)
- meta-platforms/ai-hub/ (README only, references platform/)
- meta-platforms/open-banking/ (README + 6 custom services)

All documentation links updated.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 15:19:48 +00:00