feat(docs): lean documentation strategy — consolidate 16 docs into 7 canonical + 3 subdirs (#2094 )

* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy)

Single canonical "how OpenOva works" doc per founder's lean-doc strategy.
2926 source lines → 1110 consolidated lines, no semantic loss.

Sections:
 §1  High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint)
 §2  Repo layout
 §3  Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...)
 §4  Naming conventions (dimensions, patterns, labels, DOMAINS-CANON)
 §5  Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces)
 §6  Per-host-cluster infrastructure
 §7  Application Blueprints
 §8  Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh)
 §9  Bootstrap-kit slot ordering (full 48-slot canonical list)
 §10 EPIC-level design overview (EPIC-0 through EPIC-6)
 §11 Per-chart DESIGN.md inventory
 §12 OAM influence
 §13 Read further

Stale literal fixes:
 - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances)
 - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055)
 - failover-controller marked REPLACED by bp-continuum

New PR refs wired into §3:
 - PR #665   SPIRE deferral
 - PR #2071  bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region)
 - PR #2087  bp-cnpg-pair pre-merge guard
 - PR #2093  bp-cnpg-pair pre-merge guard

New stack components added to §3:
 - bp-cnpg-pair  (synchronous remote_apply ReplicaCluster across ClusterMesh)
 - bp-continuum  (lease-based failover orchestrator)
 - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11)

Source docs (to be deleted by orchestrator in final PR):
 - docs/PLATFORM-TECH-STACK.md
 - docs/NAMING-CONVENTION.md
 - docs/EPICS-1-6-unified-design.md
 - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md

* docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy)

* docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy)

* docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy)

Part 1 — Runbook consolidation:
- NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops,
  Blueprint authoring, chart conventions, demo walk, failover, troubleshooting)
- Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK /
  RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface
- Documents dual-annotation requirement for charts with enabled.default: false
  (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1
  dead-reserve incident as the live evidence
- All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console)
- All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works
- Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093

Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md):
- Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit)
- Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed,
  awaiting fresh-prov walk" (per 5-pillar DoD)
- Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053)
- Adds 3 new CRDs verified in products/catalyst/chart/crds/:
  CNPGPair, PDM, Sandbox
- Sandbox controller chain CODE-COMPLETE
  (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632)
- SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061)
- New §6 CI / supply-chain guards table: hollow-chart (#2087),
  smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle,
  subchart 4-step, Flux version-pin replay
- New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧
- Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20),
  Pillar 3 (per above), Pillar 4 (Sandbox chain)

Part 3 — GLOSSARY.md folded as single source of truth for banned terms:
- Header dated 2026-05-20, notes "single source of truth for banned terms"
  and "no separate BANNED-TERMS.md"
- Existing 11 banned-terms rows rewritten with italicized qualifiers
- NEW Forbidden test domains subsection:
  openova.io (mothership-only), omantel.openova.io (hallucinated),
  Nova Cloud (predecessor brand), eventforge.io (hallucinated),
  admin.<fqdn> (dead BSS URL)
- SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665
  with TBD-V29 (#2055) re-introduction roadmap
- Cross-links updated: IMPLEMENTATION-STATUS → STATUS,
  SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md

CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion).
No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs

Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11.

This is the orchestrator commit on top of the four cherry-picked consolidation
commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It:

1. Deletes 15 legacy source docs (now folded into the 7 canonical):
   PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design,
   BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG,
   5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD,
   PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING,
   DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING.

2. Moves transient + historical docs into proper subdirs:
   - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state)
   - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery,
     2026-05-20-trust-audit,2026-05-20-walk-runbook}.md
   - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md

3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision)
   + docs/adr/README.md index.

4. Updates CLAUDE.md reading-order + repo-structure block to match the
   lean strategy and current core/ tree (controllers/, marketplace/, etc.).

5. Sweeps all .md files + .github/workflows + scripts to repoint old doc
   paths to the new canonical homes. ADR cross-references kept intact
   (ADRs are immutable historical artifacts).

Operator-side cron scripts that still write to the old paths
(/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and
openova-private/bin/trust-audit.sh) need a one-line path update —
flagged in the PR body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md

The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its
repo-root sentinel; the file no longer exists after the lean-doc
consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to
match the new canonical filename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-20 14:40:01 +04:00

23 KiB

Raw Blame History

Catalyst Security Model

Status: Authoritative target architecture. Updated: 2026-05-20. Implementation: Per-component status tracked in STATUS.md. OpenBao, ESO, Keycloak component READMEs exist; Catalyst's integration glue is design-stage. SPIRE/SPIFFE was dropped from the bootstrap-kit by founder PR #665 (2026-05-03, "drop bp-spire — Cilium WireGuard is canonical east-west mesh") — the platform/spire/ chart is retained as opt-in for future re-introduction (see §2 below for re-enable triggers).

Identity, secrets, rotation, and multi-region credential semantics for Catalyst Sovereigns. Defer to GLOSSARY.md for terminology.

1. Identity: two systems, two purposes

Subject	System	Token	Lifetime	What it auths
Workloads (every Pod, every controller)	Cilium WireGuard mesh + K8s ServiceAccount TokenReview	Cilium-managed node-level WireGuard session keys (kernel) + projected SA bound-tokens (1h, kubelet-rotated)	WG session-keys rotate on every Cilium agent restart; bound tokens auto-rotate hourly	Pod ↔ Pod transport encryption (kernel WG); Pod ↔ OpenBao auth (via the `kubernetes` auth method = TokenReview); Pod ↔ NATS / Catalyst APIs (SA token in `Authorization` header, validated server-side)
Users (every human)	Keycloak	OIDC JWT	15 min access / 30 day refresh	UI auth, REST/GraphQL API, Gitea, console SSE

Two systems, never conflated. Workload identity is bound to a Kubernetes ServiceAccount (spiffe://<sov>/ns/<ns>/sa/<sa> shape preserved at the namespace+SA granularity, just verified via TokenReview against the K8s API server rather than via SPIRE-issued SVIDs). User identity is bound to a Keycloak realm subject. The two meet only at boundaries where a service acts on behalf of a user (and even then, the workload presents both: its SA token for in-band auth, the WireGuard mesh for transport encryption, and the user's JWT in the request body).

2. Workload identity — Cilium WireGuard + K8s ServiceAccount TokenReview

Status: Canonical since founder PR #665 (2026-05-03, "drop bp-spire — Cilium WireGuard is canonical east-west mesh"). The bp-spire slot was removed from clusters/_template/bootstrap-kit/ (Slot 06 deleted). The platform/spire/ chart remains in the repo as opt-in for future re-introduction; see "Re-enable triggers" below.

What protects east-west pod-to-pod traffic today:

┌──────────────────────────────────────────────────────────────────────┐
│ Cilium agent (DaemonSet) on every node                                │
│  - encryption.type = wireguard                                        │
│  - encryption.wireguard.userspaceFallback = false                     │
│  - every pod-to-pod packet that leaves a node is wrapped in a         │
│    WireGuard tunnel keyed per node-pair, at the kernel layer          │
│  - 100% mesh coverage (no exemptions), zero sidecars                  │
│  - L7 policy + identity-aware enforcement via Cilium NetworkPolicy    │
│    and CiliumNetworkPolicy CRs                                        │
└──────────────────────────────────────────────────────────────────────┘

What proves workload identity today (Pod → service-of-record):

┌──────────────────────────────────────────────────────────────────────┐
│ Every Pod has a projected ServiceAccount token                        │
│  - kubelet rotates the bound token hourly                             │
│  - audience-scoped per consumer (e.g. `https://openbao.catalyst.svc`) │
│  - Pod presents the SA token in Authorization: Bearer                 │
│  - Server (OpenBao, NATS, Catalyst API) validates via the K8s         │
│    TokenReview API → returns the (namespace, ServiceAccount) tuple    │
│  - Authorization decisions are made on that tuple                     │
└──────────────────────────────────────────────────────────────────────┘

Identity tuple examples in Catalyst (note the shape parallels SPIFFE ID spiffe://<sov>/ns/<ns>/sa/<sa> — preserved at namespace+SA granularity):

ns=catalyst-projector  sa=projector                ← control-plane microservice
ns=catalyst-gitea      sa=gitea                    ← per-Sovereign Git server
ns=muscatpharmacy      sa=wordpress                ← Application workload
ns=catalyst-openbao    sa=openbao                  ← OpenBao itself

OpenBao auth method: kubernetes (TokenReview-backed). Roles are bound to (namespace, ServiceAccount) tuples. Not the cert auth method, not JWT-SVID. See platform/cilium/chart/values.yaml:107-118 for the canonical comment locking this decision.

NATS JetStream auth: the bp-spire dependsOn was removed from clusters/_template/bootstrap-kit/07-nats-jetstream.yaml in PR #665. NATS no longer needs SVID-based auth; the kernel-level WireGuard encryption between every pod covers in-flight traffic, and JetStream Account-level isolation handles per-Org boundaries.

Catalyst REST API auth: workload calls are authenticated by SA bound-token (TokenReview); user calls by Keycloak-issued JWT.

Why this configuration is sufficient today

Concern	How it's met today
In-flight encryption	Cilium WireGuard, kernel-level, 100% mesh, no opt-out
Workload-to-workload authentication	K8s SA tokens validated server-side via TokenReview
Token rotation	Projected SA bound-tokens auto-rotate hourly (kubelet)
Defense against stolen long-lived tokens	Bound tokens are scoped to a single Pod + audience + 1h TTL; the legacy unbound SA secret-tokens are not used
Cross-Org isolation	vcluster boundary + NATS Account boundary + Keycloak realm boundary; SA tokens don't cross vcluster boundaries
Node-level identity	Cilium gives every node a WireGuard public key; CiliumNetworkPolicy + identity labels enforce L3/L7 policy at the eBPF datapath

Re-enable triggers (when to re-introduce SPIRE)

The platform/spire/ chart is retained for the following scenarios. None apply today; re-enable requires founder ruling that overrides PR #665.

Cross-Sovereign workload federation. When workloads in Sovereign A need to authenticate to services in Sovereign B without round-tripping through a shared K8s API server, SPIFFE federation (SPIFFE/SPIRE upstream-bundle exchange) is the canonical path. K8s SA TokenReview is local to one cluster.
Compliance audit requiring sub-hour cryptographic workload attestation. SOC2 Type II, PCI-DSS, or FedRAMP audits demanding (a) cryptographically attested workload identity (not bearer-token), (b) sub-hour rotation, (c) per-Pod fingerprint distinct from (namespace, SA). The SA-bound-token model proves (namespace, SA, audience) but not Pod-fingerprint; SPIRE workload attestation (k8s_psat + parent selectors) proves the fingerprint.
Per-workload-fingerprint authorization. When the policy decision requires distinguishing two Pods running the same SA in the same namespace (e.g. canary vs stable, two replicas with different secrets), SA token alone cannot distinguish them. SPIRE workload attestation can.

If any of (1)/(2)/(3) becomes a hard requirement, the re-introduction roadmap lives in TBD-V29 (#2055) — the 8-PR sketch covers: split platform/spire/ into platform/spire-crds/ + platform/spire/, add bp-spire-crds + bp-spire to clusters/_template/bootstrap-kit/, author ClusterSPIFFEID CRs for the ~6 first-wave services, add go-spiffe/v2 deps + tlsconfig.MTLSClientConfig to outbound HTTP clients, pair server-side tlsconfig.MTLSServerConfig + SPIFFE-ID ACLs, switch OpenBao auth from kubernetes to cert, re-enable oidc-discovery-provider, migrate remaining workloads in waves. Estimate 2000-3500 LOC, 2-4 weeks.

3. Secrets: OpenBao + ESO

Static secrets (API tokens, passwords, signing keys, OAuth client secrets) live in OpenBao. They reach Pods via External Secrets Operator (ESO).

       OpenBao (Raft cluster, region-local)
              │
              │  ┌──────────────────────────────────────────────┐
              │  │  ExternalSecret CR in Git, in the Application │
              │  │  Gitea repo. References path in OpenBao.     │
              │  └──────────────────────────────────────────────┘
              │                          │
              │                          ▼
              │  ┌──────────────────────────────────────────────┐
              │  │  ESO (in vcluster) reads ExternalSecret CR   │
              │  │  Authenticates to OpenBao via the `kubernetes`│
              │  │  auth method (projected SA bound-token →     │
              │  │  TokenReview); transport secured by Cilium WG│
              │  └──────────────────────────────────────────────┘
              │                          │
              │                          ▼
              │  ┌──────────────────────────────────────────────┐
              │  │  K8s Secret (rendered, versioned)             │
              │  │  Reloader watches hash → rolling deploy      │
              │  └──────────────────────────────────────────────┘
              │                          │
              ▼                          ▼
   (audit log + telemetry)         Pod mounts the secret

What's in Git (always):

ExternalSecret CR pointing at an OpenBao path
SecretStore CR pointing at the OpenBao endpoint
SecretPolicy CR (rotation rules)
Public keys, root CA certs (CRDs)

What's NEVER in Git:

Secret values (passwords, tokens, private keys, etc.)
OpenBao root tokens
Static API credentials

4. Dynamic credentials

For databases, S3, and other systems supporting short-lived credentials, OpenBao mints them on demand:

Pod                   catalyst-secret-sidecar          OpenBao (DB engine)
 │                          │                                  │
 │ "give me Postgres"      │ authenticates via SA bound-token  │
 │─────────────────────────►│                                   │
 │                          │ mints Postgres user             │
 │                          │ TTL=1h                          │
 │                          │──────────────────────────────────►│
 │                          │ returns user/password           │
 │◄─────────────────────────│◄──────────────────────────────────│
 │
 │ connects to Postgres, opens connection pool
 │
 │ at T+50min: sidecar pre-emptively requests new creds
 │              app drains old pool, swaps to new creds
 │              no downtime
 │
 │ at T+1h: OpenBao revokes the old user

The sidecar is automatic for any Pod whose Blueprint declares dynamicSecrets: true. Apps that prefer in-process can use the Catalyst SDK directly. Apps that can't do either get a rolling restart at the TTL boundary (acceptable for low-tier workloads).

Database engines supported: PostgreSQL (CNPG), FerretDB, MongoDB-compatible, ClickHouse, Valkey, SeaweedFS/S3.

5. Multi-region OpenBao — INDEPENDENT, NOT STRETCHED

Critical: each region runs its own Raft cluster. There is no cross-region Raft quorum. Region failures are independent failure domains.

   Region A (Muscat)              Region B (Salalah)              Region C (Frankfurt DR)
   ┌──────────────────┐           ┌──────────────────┐            ┌──────────────────┐
   │ OpenBao cluster  │           │ OpenBao cluster  │            │ OpenBao cluster  │
   │ 3 Raft nodes     │           │ 3 Raft nodes     │            │ 3 Raft nodes     │
   │ INDEPENDENT      │           │ INDEPENDENT      │            │ INDEPENDENT      │
   │ Raft quorum      │           │ Raft quorum      │            │ Raft quorum      │
   └──────┬───────────┘           └──────────────────┘            └──────────────────┘
          │                                ▲                                ▲
          │ async log shipping             │ async log shipping             │
          │ (Performance Replication)      │                                │
          └────────────────────────────────┴────────────────────────────────┘
                  one-way: primary → secondaries; no cross-region quorum

5.1 Fault domain semantics

Each region has its own self-contained 3-node Raft cluster. Quorum is intra-region only (need 2-of-3 in the same region).
A total Region A failure does NOT require any other region to do anything. Region B and C continue serving reads from their local replicated data.
Network partition between regions: each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
DR promotion is explicit. Either sovereign-admin-approved or automated by failover-controller with strict criteria. Not automatic on every blip.

5.2 Read/write semantics

Writes (rotations, new secrets) → primary OpenBao only.
Reads → local OpenBao replica (sub-10ms latency in same continent).
Replication lag <1s typical. Apps in B and C read post-rotation values without any cross-region call.
Region failure → DR replica promoted by the failover-controller. New writes are blocked briefly during promotion (~30s). After promotion, the DR region accepts writes.

5.3 Why NOT a stretched cluster

A stretched Raft cluster (5 nodes across 3 regions, single quorum) seems superficially appealing but is fragile:

A single region's network blip can cause loss of quorum if 3 of 5 nodes are in the affected region.
Cross-region latency degrades all writes (every write needs cross-region majority ack).
An entire region failure can leave the cluster without quorum.

We deliberately reject this pattern. Each region is its own failure domain.

6. Keycloak topology

Set at Sovereign provisioning time:

# In Sovereign CRD spec
keycloakTopology: per-organization      # SME-style: each Org gets its own
# OR
keycloakTopology: shared-sovereign      # Corporate: one Keycloak for the Sovereign

6.1 SME-style (`per-organization`)

Sovereign: omantel
└── Each Organization gets a minimal Keycloak (1 replica, embedded H2/sqlite,
    ~150 MB RAM, no HA)
    │
    ├── Organization muscatpharmacy
    │     Keycloak realm: muscatpharmacy
    │     Federations: Omantel-Mobile-OTP, Google, Apple
    ├── Organization acme-shop
    │     Keycloak realm: acme-shop
    └── …

Why per-Org for SME: blast radius. Muscat-pharmacy's Keycloak outage cannot affect Lulu-Hypermarket. Operationally cheap — minimal Keycloak fits in <200MB. SME tier customers don't need HA; if their Keycloak restarts in 10s during a deploy, that's tolerable.

Larger SMEs can opt into HA via a tier upgrade — same data model, just more replicas + Postgres backend instead of embedded H2.

6.2 Corporate (`shared-sovereign`)

Sovereign: bankdhofar
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
    Federates to Bank Dhofar's corporate Azure AD
    │
    ├── Realm: catalyst-admin (sovereign-admin team)
    ├── Realm: core-banking (Org)
    ├── Realm: digital-channels (Org)
    ├── Realm: analytics (Org)
    └── Realm: corporate-it (Org)

Why shared for corporate: the bank's security perimeter is the entire Sovereign. Every Organization within is a business unit of the same legal entity. Federation to Azure AD is the single auth choke-point anyway. Per-Org Keycloak would mean N times the Azure AD federation config — operational overhead with no security benefit.

6.3 App-level SSO

Every Application Blueprint can declare SSO support:

# in bp-wordpress configSchema
sso:
  enabled: true   # auto-creates a Keycloak client in the Org's realm
                  # injects credentials via OpenBao + ExternalSecret

End users get one-click SSO across all Apps in their Organization without ever seeing OAuth config.

7. Rotation policy

Every credential class has a SecretPolicy that drives automatic rotation.

apiVersion: catalyst.openova.io/v1alpha1
kind: SecretPolicy
metadata:
  name: stricter-rotation
  namespace: catalyst-system
spec:
  appliesTo:
    organizationLabels:
      tier: regulated
  rules:
    - kind: database-credentials
      maxTTL: 1h
      autoRotate: true
    - kind: api-token
      maxTTL: 90d
      autoRotate: true
      rotateBefore: 7d
    - kind: oauth-client-secret
      maxTTL: 90d
      autoRotate: true
    - kind: signing-key
      maxTTL: 365d
      autoRotate: false               # requires explicit approval
      requireApproval: [security-officer]
    - kind: tls-cert
      maxTTL: cert-manager-managed

Class	Default	Notes
Workload identity (K8s SA bound-token)	1 h, auto-rotated by kubelet	Not configurable. Audience-scoped per consumer. SPIRE SVID (5-min, X.509-cert) is the future-state target if a §2 re-enable trigger fires.
Dynamic DB creds	1 h, auto	Per-Blueprint TTL configurable.
API tokens, OAuth client secrets	90 d, auto	rotateBefore: 7d gives apps a refresh window.
Signing keys, root CAs	365 d, manual approval	Auto-rotation possible but disabled by default for high-impact keys.
TLS certs	cert-manager controlled	Acme/Let's Encrypt, ~60 d, automatic.
User passwords (Keycloak)	User-managed + MFA	Min age policy enforced by realm.

A security-officer sees a RotationDashboard view: every credential class, age, next rotation, force-rotate button (RBAC-gated).

8. The path of a secret value (no leakage)

1. Generated:   Crossplane composition or OpenBao auto-generator creates value.
                Never printed. Never echoed. Written directly to OpenBao via API.

2. Referenced:  ExternalSecret CR in Git names the OpenBao path. No value in Git.

3. Materialized: ESO reads OpenBao path (auth via projected SA bound-token + TokenReview; transport encrypted by Cilium WireGuard), renders K8s Secret.
                The K8s Secret is base64-encoded; never logged.

4. Consumed:    Pod mounts as env or file. Reloader watches hash; rolls deploy
                on change. Application sees plaintext only via mount or env.

5. Rotated:     SecretPolicy controller invokes rotation API on OpenBao.
                New value generated, replication propagates, ESO re-reads,
                Reloader rolls. Old value retained for grace window (24h),
                then revoked.

6. Audited:     Every step logged to Catalyst audit log. No plaintext.

What never happens:

Plaintext secrets in Git.
Plaintext secrets in shell command output.
Plaintext secrets in issues, PRs, comments, or chat.
Plaintext secrets in commit messages, branch names, tag names.

If a secret is ever leaked via terminal output (a misconfigured kubectl describe, a debug log), the leak is treated as a P1 incident: rotate immediately, audit history, communicate.

9. Compliance posture

Standard	Catalyst posture
SOC 2 Type 2	Audit logging in JetStream + OpenSearch SIEM cold storage. SecretPolicy enforces rotation. EnvironmentPolicy enforces approvals.
PSD2 / FAPI	Fingate Blueprint composes Keycloak (FAPI authorization), eIDAS cert verification, ext_authz.
DORA	Resilience testing via Litmus chaos Blueprint. Multi-region by default for regulated tier.
NIS2	Falco runtime detection + OpenSearch SIEM + Kyverno policy + supply-chain (cosign + Syft+Grype).
GDPR	Per-region data residency via Placement spec. Right-to-be-forgotten flow defined per Application Blueprint.
ISO 27001	Mappings published per control; evidence surfaced via Catalyst console audit views and SIEM exports.

Every Sovereign exports its audit log to a customer-specified SIEM. Default: OpenSearch in the Sovereign itself; customers may push to external Splunk, Datadog SIEM, etc.

10. Threat model summary

Threat	Mitigation
Stolen ServiceAccount token	Projected SA bound-tokens are 1h TTL, audience-scoped, Pod-bound (deleted when the Pod terminates) — legacy long-lived Secret-tokens are not used. (Future hardening: SPIRE SVID 5-min mTLS-cert if a §2 re-enable trigger fires.)
Stolen K8s Secret	Encrypted at rest in etcd. Pulled only via ESO with a projected SA bound-token (TokenReview-validated); transport encrypted by Cilium WireGuard.
Compromised Pod	NetworkPolicy (Cilium) + L7 policies limit blast radius. Falco detects anomalous syscalls.
Malicious commit to Environment Gitea	EnvironmentPolicy requires PR approvals. Kyverno admission control denies non-policy-compliant manifests.
Compromised Blueprint upstream	All Blueprints are cosigned. Kyverno verify-signatures policy denies unsigned/wrong-issuer artifacts.
Cross-Org leakage	vcluster isolation. JetStream Account isolation. Keycloak realm isolation (per-Org or shared).
Compromised sovereign-admin account	MFA required at Keycloak. JIT elevation for production-impacting actions. Full audit trail to SIEM.
Compromised OpenBao node	2-of-3 Raft quorum required for writes. Audit log captures every read. Rotate root token + re-shard quarterly.
Region-wide failure	Independent OpenBao Raft per region. PowerDNS lua-records (`ifurlup`) drop the affected regional endpoint from authoritative responses within the health-check window. Apps with `active-active` keep serving from healthy region.
Supply-chain attack on a build	SLSA-3 build provenance, cosign signing, Syft+Grype SBOM scanned in CI and at runtime by Trivy.

See ARCHITECTURE.md for the broader platform context.

23 KiB Raw Blame History