feat(bootstrap-kit): wire bp-continuum (failover orchestrator) — Pillar 3 unblock (Refs #2065) (#2072)

* feat(bootstrap-kit): wire bp-continuum (failover orchestrator) — Pillar 3 unblock

Adds bootstrap-kit slot 62 (62-bp-continuum.yaml) so the Continuum DR
controller actually deploys on a fresh Sovereign. Without this slot the
chart at products/continuum/chart/ sat in-tree with no install path —
catalyst-platform's QA fixtures (slot 13 qa-continuum-status-seed-job)
reference a Continuum CR named `cont-omantel` that no controller was
ever spinning up to reconcile, leaving Pillar-3 unverifiable end-to-end.

Pillar-3 of the canonical end-user DoD ("multi-region BCP — region kill
zero-data-loss failover") requires three pieces:

  1. bp-cnpg-pair (Pillar-3 follow-up #2068) — primary + replica CNPG
     with ReplicaCluster sync over Cilium ClusterMesh on the WG-public-
     IP DMZ data plane.
  2. Continuum CR + the per-app HTTPRoute drain hook (follow-up #2066).
  3. THIS controller — without bp-continuum deployed, every Continuum
     CR sits unhandled and the lua-record flip never fires, so a
     region-kill produces TXN-loss on every transaction in-flight.

This PR ships piece 3 — the controller itself, gated default-OFF.

Files
- NEW clusters/_template/bootstrap-kit/62-bp-continuum.yaml — HelmRepository
  + HelmRelease pinned to bp-continuum 0.1.1, targetNamespace
  catalyst-system, dependsOn [bp-catalyst-platform, bp-nats-jetstream,
  bp-powerdns], default-OFF gate via ${CONTINUUM_ENABLED:-false}.
- UPDATE clusters/_template/bootstrap-kit/kustomization.yaml — slot 62
  appended after slot 60 (bp-vcluster-helmrepo), with a header comment
  explaining the Pillar-3 dependency analysis.
- UPDATE scripts/expected-bootstrap-deps.yaml — slot 62 declared with the
  same dep set so scripts/check-bootstrap-deps.sh stays drift-free.
- UPDATE products/continuum/chart/Chart.yaml — version 0.1.0 → 0.1.1
  (first PUBLISHED version; the previous 0.1.0 sat in-tree but blueprint-
  release.yaml never pushed it to GHCR for lack of a path-change trigger)
  + add `catalyst.openova.io/smoke-render-mode: default-off` annotation
  required by blueprint-release's smoke-render gate for default-OFF charts.

Default-OFF rationale
The chart's own values.yaml ships `continuum.enabled: false` (chart
fail-fasts on empty `image.tag` when enabled=true — Inviolable
Principle #4a no-`:latest` guard). We surface a CONTINUUM_ENABLED
envsubst placeholder so per-Sovereign overlays may flip the gate on
once bp-cnpg-pair + bp-powerdns + lease witness are ready. Default
`false` matches the MARKETPLACE_ENABLED / SANDBOX_ENABLED knob shape.

Why dependsOn does NOT include bp-cnpg-pair
The chart ships default-OFF — the controller installs idle and only
exercises bp-cnpg-pair when an operator flips `continuum.enabled=true`.
Adding bp-cnpg-pair to dependsOn today would break the install on every
Sovereign that hasn't shipped #2068 yet. Per-Sovereign cnpg-pair
provisioning is the gating dependency at flip-time, not install-time.

Validation (Principle #15 — fresh state, NOT --dry-run=server)
- `helm package products/continuum/chart` → bp-continuum-0.1.1.tgz
- `helm template smoke products/continuum/chart` → empty (default-OFF,
  matches smoke-render-mode annotation contract).
- `helm template smoke products/continuum/chart --set
  continuum.enabled=true` → 6 resources rendered cleanly (Deployment,
  Service, ServiceAccount, RBAC, NetworkPolicy).
- `bash scripts/check-bootstrap-deps.sh` → "Drift: 0  Cycles: 0  PASSED".
- `bash scripts/check-bootstrap-kit-pin-sync.sh` → "bp-continuum:
  chart=0.1.1 pin=0.1.1  PASS".
- `kubectl kustomize clusters/_template/bootstrap-kit/` → 52 HelmReleases
  rendered (was 51 + bp-continuum), `kubectl apply --dry-run=client` on
  the rendered YAML produces no errors for bp-continuum.

GHCR publication path
bp-continuum:0.1.0 was never published — git history shows the chart
committed in-tree but the blueprint-release workflow (which triggers on
`products/*/chart/**` diffs) had no path-change to detect since the
initial commit. Bumping Chart.yaml to 0.1.1 forces a fresh publish on
this PR's merge; the auto-bump-pin hook (TBD-A6) then converges the
slot pin via a no-op (already matches at 0.1.1).

Verified bp-continuum:0.1.1 will publish via blueprint-release.yaml's
detect step (`git diff HEAD~1 HEAD | grep -E
'^(platform|products)/[^/]+/(chart/|blueprint.yaml)'`) which catches
products/continuum/chart/Chart.yaml in this commit's diff.

Refs #2065

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(continuum): bump blueprint.yaml spec.version 0.1.0 → 0.1.1 (lockstep)

TestBootstrapKit_BlueprintVersionLockstepSweep enforces
Chart.yaml.version == blueprint.yaml.spec.version for every
bootstrap-kit blueprint. Previous commit bumped Chart.yaml but missed
the blueprint manifest — this commit closes the lockstep.

Same Refs #2065 thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <hati.yildiz@openova.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
e3mrah 2026-05-20 10:10:59 +04:00 committed by GitHub
parent 7b31736482
commit 53f510b983
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 212 additions and 2 deletions

View File

@ -0,0 +1,154 @@
# bp-continuum — Catalyst bootstrap-kit Blueprint slot 62
# (Customer-facing capability / DR orchestration).
#
# OpenOva Continuum — Disaster-Recovery orchestrator for active-hot-
# standby Applications (EPIC-6, slice K-Cont-1 #1101 onward). Reconciles
# Continuum.dr.openova.io/v1 CRs; per-Continuum-CR goroutine maintains a
# lease (10s renew, 30s TTL), watches CNPG replication metrics, and
# executes the switchover sequence on lease loss + replication health
# drop (drain HTTPRoute → flip lua-record on pool-domain-manager →
# flip CNPG primary via bp-cnpg-pair → audit on NATS).
#
# ─── Pillar-3 unblock (#2065, TBD-V14) ─────────────────────────────────
# Pillar-3 of the canonical end-user DoD ("multi-region BCP — region kill
# zero-data-loss failover") requires THREE pieces:
# 1. bp-cnpg-pair (C-DB-1) — primary + replica CNPG with ReplicaCluster
# sync over Cilium ClusterMesh on the WG-public-IP DMZ data plane.
# 2. Continuum CR + the per-app HTTPRoute drain hook.
# 3. THIS controller — without bp-continuum deployed, every Continuum
# CR sits unhandled and the lua-record flip never fires, so a
# region-kill produces TXN-loss on every transaction in-flight.
#
# Before this slot, the chart existed at products/continuum/chart/ and
# the controller image was built by .github/workflows/build-continuum-
# controller.yaml + SHA-pinned in values.yaml — but no bootstrap-kit
# slot deployed it on a fresh Sovereign. catalyst-platform's QA fixtures
# (slot 13, `qa-continuum-status-seed-job`) reference a Continuum CR
# named `cont-omantel` that no controller is ever spinning up to
# reconcile. This slot closes the loop.
#
# ─── Default-OFF gate ──────────────────────────────────────────────────
# The chart's own values.yaml ships `continuum.enabled: false` (chart
# fail-fasts on empty `image.tag` when enabled=true — Inviolable
# Principle #4a no-`:latest` guard). We surface a CONTINUUM_ENABLED
# envsubst placeholder so per-Sovereign overlays may flip the gate on
# once bp-cnpg-pair + bp-powerdns + lease witness are ready. Default
# `false` so a zero-touch provision lands a non-Continuum Sovereign
# (matches the MARKETPLACE_ENABLED / SANDBOX_ENABLED knob shape).
#
# ─── Placement ─────────────────────────────────────────────────────────
# Continuum is itself a single-region controller — it lives on the
# MANAGEMENT cluster (per docs/EPICS-1-6-unified-design.md §9 + the
# chart's blueprint.yaml placementSchema: modes=[single-region]) and
# observes data-plane regions over Cilium ClusterMesh + the witness.
# The Application CRs it reconciles are active-hot-standby; the
# controller itself is single-region.
#
# ─── dependsOn ─────────────────────────────────────────────────────────
# - bp-catalyst-platform (slot 13) — owns the
# `dr.openova.io/v1.Continuum` CRD that the controller watches.
# Without this edge, Helm render-time Capabilities gate fails the
# install (no matches for kind "Continuum"). NB: CRD lives at
# products/catalyst/chart/crds/continuum.yaml.
# - bp-nats-jetstream (slot 7) — catalyst.audit publish target the
# controller emits switchover audit events to.
# - bp-powerdns (slot 11) — the pool-domain-manager Service that
# fronts PowerDNS is what the controller POSTs lua-record commits
# to during the flip step of the switchover sequence.
#
# bp-cnpg-pair is intentionally NOT in dependsOn because the chart ships
# default-OFF — the controller installs and waits idle until a per-
# Sovereign overlay flips `continuum.enabled=true`. Operators must
# install bp-cnpg-pair (Pillar 3 audit follow-up #2068) AND configure
# the lease witness BEFORE flipping the gate.
#
# Wrapper chart: products/continuum/chart/
# Catalyst-curated values: products/continuum/chart/values.yaml
# Reconciled by: Flux on the new Sovereign's k3s control plane.
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-continuum
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-continuum
namespace: flux-system
labels:
catalyst.openova.io/slot: "62"
catalyst.openova.io/component: continuum-controller
openova.io/category: customer-facing-capability
openova.io/epic: "6"
spec:
interval: 15m
releaseName: continuum
# targetNamespace = catalyst-system to colocate with the other
# catalyst-platform controllers (per slot 13 convention). The chart
# uses .Release.Namespace for every templated resource.
targetNamespace: catalyst-system
dependsOn:
- name: bp-catalyst-platform
- name: bp-nats-jetstream
- name: bp-powerdns
chart:
spec:
chart: bp-continuum
# 0.1.1 — first published version. 0.1.0 was never pushed to GHCR
# despite Chart.yaml claiming so; the chart sat in-tree without a
# bootstrap-kit slot to pin it, so blueprint-release.yaml never
# bumped past the initial commit's no-op detect step. Bumping to
# 0.1.1 in the same PR as this slot forces a fresh publish and
# the auto-bump-pin hook (TBD-A6) lands the matching pin write.
version: 0.1.1
sourceRef:
kind: HelmRepository
name: bp-continuum
namespace: flux-system
install:
timeout: 10m
disableWait: true
remediation:
retries: 3
upgrade:
timeout: 10m
disableWait: true
remediation:
retries: 3
# Per-Sovereign overlay surface.
#
# enabled — default-OFF via ${CONTINUUM_ENABLED:-false} on the
# bootstrap-kit Kustomization substitute. Flip true on a per-
# Sovereign overlay's substitute map ONCE the operator has:
# - bp-cnpg-pair installed (Pillar-3 follow-up #2068 — primary +
# replica CNPG cluster with ReplicaCluster sync over ClusterMesh)
# - bp-powerdns + pool-domain-manager reachable (lua-record commits)
# - lease witness configured (Cloudflare KV per K-Cont-3, or DNS
# quorum fallback)
# The chart's own `continuum.enabled: false` default is the
# defence-in-depth backstop — a stale per-Sovereign overlay that
# hand-installs the HR without our envsubst layer still default-OFFs
# gracefully.
#
# Image tag — NOT overridden here. The chart's values.yaml carries
# the canonical SHA-pinned `continuum.image.tag` (auto-bumped on every
# push to main by .github/workflows/build-continuum-controller.yaml).
# Day-2 SHA pivots remain available via per-Sovereign overlay patches
# at spec.values.continuum.image.tag.
#
# pdmURL / natsURL — empty defaults route through the in-cluster
# Service DNS (pool-domain-manager.catalyst-system.svc.cluster.local
# + nats.openova-system.svc.cluster.local respectively). Per-
# Sovereign overlays may repoint at Sovereign-local instances.
values:
continuum:
enabled: ${CONTINUUM_ENABLED:-false}

View File

@ -157,6 +157,16 @@ resources:
# slot-19a comment block + 19a-bp-sandbox.yaml header for full
# diagnostic chain. No functional difference for operators — the
# SANDBOX_ENABLED knob still gates rendering identically.
# bp-continuum (slot 62) — Pillar-3 unblock (#2065, TBD-V14). DR
# orchestrator for active-hot-standby Applications. Reconciles
# Continuum.dr.openova.io/v1 CRs; executes switchover sequence
# (drain HTTPRoute → flip lua-record → flip CNPG primary → audit on
# NATS). Default-OFF via ${CONTINUUM_ENABLED:-false}; operators flip
# on once bp-cnpg-pair + lease witness are configured. See slot-62
# header comment for full Pillar-3 dependency analysis. Sequenced past
# the vCluster cohort (slots 54/58/59/60) so its `bp-catalyst-platform`
# dep + Continuum CRD ordering converge before the controller starts.
- 62-bp-continuum.yaml
# bp-newapi (slot 80) — multi-tenant LLM marketplace gateway. Sequenced
# after the W2.K1 dependency wave (cnpg/keycloak/openbao Ready) so
# NewAPI's ExternalSecret + DSN dependencies resolve on first reconcile.

View File

@ -7,7 +7,15 @@ description: |
switchover sequence). Slice K-Cont-1 of EPIC-6 (#1101) ships the
product skeleton; K-Cont-2 fills the reconcile loop.
type: application
version: 0.1.0
# 0.1.1 (Pillar-3 unblock #2065, 2026-05-20): first PUBLISHED version.
# 0.1.0 sat in-tree without a bootstrap-kit slot to pin it, so the
# blueprint-release workflow's `detect changed paths` step never had
# reason to re-run and the chart was never pushed to GHCR. Bumping the
# pin in lockstep with the new slot file (clusters/_template/bootstrap-
# kit/62-bp-continuum.yaml) makes blueprint-release publish the chart
# on this PR's merge; the auto-bump-pin hook (TBD-A6) then converges
# the slot pin via a no-op (already matches).
version: 0.1.1
appVersion: "0.1.0"
home: https://openova.io
sources:
@ -28,3 +36,16 @@ annotations:
openova.io/category: customer-facing-capability
openova.io/epic: "6"
openova.io/depends-on: bp-cnpg-pair,bp-powerdns,pdm
# smoke-render-mode: default-off — bp-continuum's chart ships
# `continuum.enabled: false` as its default; helm template with
# default values legitimately renders zero resources (per chart
# README "the gate keeps the controller stopped until the operator
# installs bp-cnpg-pair + bp-powerdns and configures the witness").
# Without this annotation the blueprint-release.yaml smoke gate
# (`<5 lines = empty render`) fails publish. The enabled-render path
# is exercised at install time by the bootstrap-kit slot's per-
# Sovereign CONTINUUM_ENABLED flip and by the chart's own
# templates/* unit tests (default-off backstop covered by
# blueprint-release's auto-template step at lines 326-358 of the
# workflow).
catalyst.openova.io/smoke-render-mode: default-off

View File

@ -6,7 +6,7 @@ metadata:
catalyst.openova.io/section: pts-9-disaster-recovery
openova.io/category: customer-facing-capability
spec:
version: 0.1.0
version: 0.1.1
card:
title: Continuum
summary: |

View File

@ -510,6 +510,31 @@ slots:
- bp-harbor
wave: present
# ---- Slot 62 — bp-continuum DR orchestrator (Pillar-3 unblock).
# Issue #2065 (TBD-V14, 2026-05-20). Reconciles Continuum.dr.openova.io/v1
# CRs and executes the switchover sequence on lease loss + replication
# health drop (drain HTTPRoute → flip lua-record → flip CNPG primary →
# audit on NATS). Default-OFF gate via ${CONTINUUM_ENABLED:-false} on
# the bootstrap-kit Kustomization substitute; operators flip on once
# bp-cnpg-pair + lease witness are configured.
#
# dependsOn:
# - bp-catalyst-platform (slot 13) — owns the `dr.openova.io/v1.Continuum`
# CRD that the controller watches. Without this edge, Helm render-time
# Capabilities gate fails the install (no matches for kind "Continuum").
# - bp-nats-jetstream (slot 7) — catalyst.audit publish target the
# controller emits switchover audit events to.
# - bp-powerdns (slot 11) — pool-domain-manager fronts PowerDNS; the
# controller POSTs lua-record commits during the flip step.
#
# bp-cnpg-pair (Pillar-3 follow-up #2068) is intentionally NOT in
# dependsOn — chart ships default-OFF so the controller installs and
# waits idle until operators flip the gate after wiring bp-cnpg-pair.
- slot: 62
name: bp-continuum
depends_on: [bp-catalyst-platform, bp-nats-jetstream, bp-powerdns]
wave: present
# ---- Slot 80 — bp-newapi multi-tenant LLM marketplace gateway. Issue #799.
# Sequenced past the W2.K4 numbering plan (slots 36-48) so it never
# collides with the AI-runtime / observability / livekit cohort. The