* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy) Single canonical "how OpenOva works" doc per founder's lean-doc strategy. 2926 source lines → 1110 consolidated lines, no semantic loss. Sections: §1 High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint) §2 Repo layout §3 Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...) §4 Naming conventions (dimensions, patterns, labels, DOMAINS-CANON) §5 Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces) §6 Per-host-cluster infrastructure §7 Application Blueprints §8 Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh) §9 Bootstrap-kit slot ordering (full 48-slot canonical list) §10 EPIC-level design overview (EPIC-0 through EPIC-6) §11 Per-chart DESIGN.md inventory §12 OAM influence §13 Read further Stale literal fixes: - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances) - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055) - failover-controller marked REPLACED by bp-continuum New PR refs wired into §3: - PR #665 SPIRE deferral - PR #2071 bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region) - PR #2087 bp-cnpg-pair pre-merge guard - PR #2093 bp-cnpg-pair pre-merge guard New stack components added to §3: - bp-cnpg-pair (synchronous remote_apply ReplicaCluster across ClusterMesh) - bp-continuum (lease-based failover orchestrator) - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11) Source docs (to be deleted by orchestrator in final PR): - docs/PLATFORM-TECH-STACK.md - docs/NAMING-CONVENTION.md - docs/EPICS-1-6-unified-design.md - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md * docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy) * docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy) * docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy) Part 1 — Runbook consolidation: - NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops, Blueprint authoring, chart conventions, demo walk, failover, troubleshooting) - Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK / RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface - Documents dual-annotation requirement for charts with enabled.default: false (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1 dead-reserve incident as the live evidence - All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console) - All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works - Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093 Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md): - Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit) - Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed, awaiting fresh-prov walk" (per 5-pillar DoD) - Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053) - Adds 3 new CRDs verified in products/catalyst/chart/crds/: CNPGPair, PDM, Sandbox - Sandbox controller chain CODE-COMPLETE (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632) - SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061) - New §6 CI / supply-chain guards table: hollow-chart (#2087), smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle, subchart 4-step, Flux version-pin replay - New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧 - Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20), Pillar 3 (per above), Pillar 4 (Sandbox chain) Part 3 — GLOSSARY.md folded as single source of truth for banned terms: - Header dated 2026-05-20, notes "single source of truth for banned terms" and "no separate BANNED-TERMS.md" - Existing 11 banned-terms rows rewritten with italicized qualifiers - NEW Forbidden test domains subsection: openova.io (mothership-only), omantel.openova.io (hallucinated), Nova Cloud (predecessor brand), eventforge.io (hallucinated), admin.<fqdn> (dead BSS URL) - SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665 with TBD-V29 (#2055) re-introduction roadmap - Cross-links updated: IMPLEMENTATION-STATUS → STATUS, SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion). No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11. This is the orchestrator commit on top of the four cherry-picked consolidation commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It: 1. Deletes 15 legacy source docs (now folded into the 7 canonical): PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design, BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG, 5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD, PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING, DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING. 2. Moves transient + historical docs into proper subdirs: - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state) - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery, 2026-05-20-trust-audit,2026-05-20-walk-runbook}.md - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md 3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision) + docs/adr/README.md index. 4. Updates CLAUDE.md reading-order + repo-structure block to match the lean strategy and current core/ tree (controllers/, marketplace/, etc.). 5. Sweeps all .md files + .github/workflows + scripts to repoint old doc paths to the new canonical homes. ADR cross-references kept intact (ADRs are immutable historical artifacts). Operator-side cron scripts that still write to the old paths (/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and openova-private/bin/trust-audit.sh) need a one-line path update — flagged in the PR body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its repo-root sentinel; the file no longer exists after the lean-doc consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to match the new canonical filename. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
23 KiB
Catalyst Security Model
Status: Authoritative target architecture. Updated: 2026-05-20.
Implementation: Per-component status tracked in STATUS.md. OpenBao, ESO, Keycloak component READMEs exist; Catalyst's integration glue is design-stage. SPIRE/SPIFFE was dropped from the bootstrap-kit by founder PR #665 (2026-05-03, "drop bp-spire — Cilium WireGuard is canonical east-west mesh") — the platform/spire/ chart is retained as opt-in for future re-introduction (see §2 below for re-enable triggers).
Identity, secrets, rotation, and multi-region credential semantics for Catalyst Sovereigns. Defer to GLOSSARY.md for terminology.
1. Identity: two systems, two purposes
| Subject | System | Token | Lifetime | What it auths |
|---|---|---|---|---|
| Workloads (every Pod, every controller) | Cilium WireGuard mesh + K8s ServiceAccount TokenReview | Cilium-managed node-level WireGuard session keys (kernel) + projected SA bound-tokens (1h, kubelet-rotated) | WG session-keys rotate on every Cilium agent restart; bound tokens auto-rotate hourly | Pod ↔ Pod transport encryption (kernel WG); Pod ↔ OpenBao auth (via the kubernetes auth method = TokenReview); Pod ↔ NATS / Catalyst APIs (SA token in Authorization header, validated server-side) |
| Users (every human) | Keycloak | OIDC JWT | 15 min access / 30 day refresh | UI auth, REST/GraphQL API, Gitea, console SSE |
Two systems, never conflated. Workload identity is bound to a Kubernetes ServiceAccount (spiffe://<sov>/ns/<ns>/sa/<sa> shape preserved at the namespace+SA granularity, just verified via TokenReview against the K8s API server rather than via SPIRE-issued SVIDs). User identity is bound to a Keycloak realm subject. The two meet only at boundaries where a service acts on behalf of a user (and even then, the workload presents both: its SA token for in-band auth, the WireGuard mesh for transport encryption, and the user's JWT in the request body).
2. Workload identity — Cilium WireGuard + K8s ServiceAccount TokenReview
Status: Canonical since founder PR #665 (2026-05-03, "drop bp-spire — Cilium WireGuard is canonical east-west mesh"). The bp-spire slot was removed from clusters/_template/bootstrap-kit/ (Slot 06 deleted). The platform/spire/ chart remains in the repo as opt-in for future re-introduction; see "Re-enable triggers" below.
What protects east-west pod-to-pod traffic today:
┌──────────────────────────────────────────────────────────────────────┐
│ Cilium agent (DaemonSet) on every node │
│ - encryption.type = wireguard │
│ - encryption.wireguard.userspaceFallback = false │
│ - every pod-to-pod packet that leaves a node is wrapped in a │
│ WireGuard tunnel keyed per node-pair, at the kernel layer │
│ - 100% mesh coverage (no exemptions), zero sidecars │
│ - L7 policy + identity-aware enforcement via Cilium NetworkPolicy │
│ and CiliumNetworkPolicy CRs │
└──────────────────────────────────────────────────────────────────────┘
What proves workload identity today (Pod → service-of-record):
┌──────────────────────────────────────────────────────────────────────┐
│ Every Pod has a projected ServiceAccount token │
│ - kubelet rotates the bound token hourly │
│ - audience-scoped per consumer (e.g. `https://openbao.catalyst.svc`) │
│ - Pod presents the SA token in Authorization: Bearer │
│ - Server (OpenBao, NATS, Catalyst API) validates via the K8s │
│ TokenReview API → returns the (namespace, ServiceAccount) tuple │
│ - Authorization decisions are made on that tuple │
└──────────────────────────────────────────────────────────────────────┘
Identity tuple examples in Catalyst (note the shape parallels SPIFFE ID spiffe://<sov>/ns/<ns>/sa/<sa> — preserved at namespace+SA granularity):
ns=catalyst-projector sa=projector ← control-plane microservice
ns=catalyst-gitea sa=gitea ← per-Sovereign Git server
ns=muscatpharmacy sa=wordpress ← Application workload
ns=catalyst-openbao sa=openbao ← OpenBao itself
OpenBao auth method: kubernetes (TokenReview-backed). Roles are bound to (namespace, ServiceAccount) tuples. Not the cert auth method, not JWT-SVID. See platform/cilium/chart/values.yaml:107-118 for the canonical comment locking this decision.
NATS JetStream auth: the bp-spire dependsOn was removed from clusters/_template/bootstrap-kit/07-nats-jetstream.yaml in PR #665. NATS no longer needs SVID-based auth; the kernel-level WireGuard encryption between every pod covers in-flight traffic, and JetStream Account-level isolation handles per-Org boundaries.
Catalyst REST API auth: workload calls are authenticated by SA bound-token (TokenReview); user calls by Keycloak-issued JWT.
Why this configuration is sufficient today
| Concern | How it's met today |
|---|---|
| In-flight encryption | Cilium WireGuard, kernel-level, 100% mesh, no opt-out |
| Workload-to-workload authentication | K8s SA tokens validated server-side via TokenReview |
| Token rotation | Projected SA bound-tokens auto-rotate hourly (kubelet) |
| Defense against stolen long-lived tokens | Bound tokens are scoped to a single Pod + audience + 1h TTL; the legacy unbound SA secret-tokens are not used |
| Cross-Org isolation | vcluster boundary + NATS Account boundary + Keycloak realm boundary; SA tokens don't cross vcluster boundaries |
| Node-level identity | Cilium gives every node a WireGuard public key; CiliumNetworkPolicy + identity labels enforce L3/L7 policy at the eBPF datapath |
Re-enable triggers (when to re-introduce SPIRE)
The platform/spire/ chart is retained for the following scenarios. None apply today; re-enable requires founder ruling that overrides PR #665.
- Cross-Sovereign workload federation. When workloads in Sovereign A need to authenticate to services in Sovereign B without round-tripping through a shared K8s API server, SPIFFE federation (
SPIFFE/SPIREupstream-bundle exchange) is the canonical path. K8s SA TokenReview is local to one cluster. - Compliance audit requiring sub-hour cryptographic workload attestation. SOC2 Type II, PCI-DSS, or FedRAMP audits demanding (a) cryptographically attested workload identity (not bearer-token), (b) sub-hour rotation, (c) per-Pod fingerprint distinct from
(namespace, SA). The SA-bound-token model proves(namespace, SA, audience)but not Pod-fingerprint; SPIRE workload attestation (k8s_psat + parent selectors) proves the fingerprint. - Per-workload-fingerprint authorization. When the policy decision requires distinguishing two Pods running the same SA in the same namespace (e.g. canary vs stable, two replicas with different secrets), SA token alone cannot distinguish them. SPIRE workload attestation can.
If any of (1)/(2)/(3) becomes a hard requirement, the re-introduction roadmap lives in TBD-V29 (#2055) — the 8-PR sketch covers: split platform/spire/ into platform/spire-crds/ + platform/spire/, add bp-spire-crds + bp-spire to clusters/_template/bootstrap-kit/, author ClusterSPIFFEID CRs for the ~6 first-wave services, add go-spiffe/v2 deps + tlsconfig.MTLSClientConfig to outbound HTTP clients, pair server-side tlsconfig.MTLSServerConfig + SPIFFE-ID ACLs, switch OpenBao auth from kubernetes to cert, re-enable oidc-discovery-provider, migrate remaining workloads in waves. Estimate 2000-3500 LOC, 2-4 weeks.
3. Secrets: OpenBao + ESO
Static secrets (API tokens, passwords, signing keys, OAuth client secrets) live in OpenBao. They reach Pods via External Secrets Operator (ESO).
OpenBao (Raft cluster, region-local)
│
│ ┌──────────────────────────────────────────────┐
│ │ ExternalSecret CR in Git, in the Application │
│ │ Gitea repo. References path in OpenBao. │
│ └──────────────────────────────────────────────┘
│ │
│ ▼
│ ┌──────────────────────────────────────────────┐
│ │ ESO (in vcluster) reads ExternalSecret CR │
│ │ Authenticates to OpenBao via the `kubernetes`│
│ │ auth method (projected SA bound-token → │
│ │ TokenReview); transport secured by Cilium WG│
│ └──────────────────────────────────────────────┘
│ │
│ ▼
│ ┌──────────────────────────────────────────────┐
│ │ K8s Secret (rendered, versioned) │
│ │ Reloader watches hash → rolling deploy │
│ └──────────────────────────────────────────────┘
│ │
▼ ▼
(audit log + telemetry) Pod mounts the secret
What's in Git (always):
ExternalSecretCR pointing at an OpenBao pathSecretStoreCR pointing at the OpenBao endpointSecretPolicyCR (rotation rules)- Public keys, root CA certs (CRDs)
What's NEVER in Git:
- Secret values (passwords, tokens, private keys, etc.)
- OpenBao root tokens
- Static API credentials
4. Dynamic credentials
For databases, S3, and other systems supporting short-lived credentials, OpenBao mints them on demand:
Pod catalyst-secret-sidecar OpenBao (DB engine)
│ │ │
│ "give me Postgres" │ authenticates via SA bound-token │
│─────────────────────────►│ │
│ │ mints Postgres user │
│ │ TTL=1h │
│ │──────────────────────────────────►│
│ │ returns user/password │
│◄─────────────────────────│◄──────────────────────────────────│
│
│ connects to Postgres, opens connection pool
│
│ at T+50min: sidecar pre-emptively requests new creds
│ app drains old pool, swaps to new creds
│ no downtime
│
│ at T+1h: OpenBao revokes the old user
The sidecar is automatic for any Pod whose Blueprint declares dynamicSecrets: true. Apps that prefer in-process can use the Catalyst SDK directly. Apps that can't do either get a rolling restart at the TTL boundary (acceptable for low-tier workloads).
Database engines supported: PostgreSQL (CNPG), FerretDB, MongoDB-compatible, ClickHouse, Valkey, SeaweedFS/S3.
5. Multi-region OpenBao — INDEPENDENT, NOT STRETCHED
Critical: each region runs its own Raft cluster. There is no cross-region Raft quorum. Region failures are independent failure domains.
Region A (Muscat) Region B (Salalah) Region C (Frankfurt DR)
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ OpenBao cluster │ │ OpenBao cluster │ │ OpenBao cluster │
│ 3 Raft nodes │ │ 3 Raft nodes │ │ 3 Raft nodes │
│ INDEPENDENT │ │ INDEPENDENT │ │ INDEPENDENT │
│ Raft quorum │ │ Raft quorum │ │ Raft quorum │
└──────┬───────────┘ └──────────────────┘ └──────────────────┘
│ ▲ ▲
│ async log shipping │ async log shipping │
│ (Performance Replication) │ │
└────────────────────────────────┴────────────────────────────────┘
one-way: primary → secondaries; no cross-region quorum
5.1 Fault domain semantics
- Each region has its own self-contained 3-node Raft cluster. Quorum is intra-region only (need 2-of-3 in the same region).
- A total Region A failure does NOT require any other region to do anything. Region B and C continue serving reads from their local replicated data.
- Network partition between regions: each region keeps operating independently. Writes pause on standby regions (since they're read-only by design).
- DR promotion is explicit. Either
sovereign-admin-approved or automated by failover-controller with strict criteria. Not automatic on every blip.
5.2 Read/write semantics
- Writes (rotations, new secrets) → primary OpenBao only.
- Reads → local OpenBao replica (sub-10ms latency in same continent).
- Replication lag <1s typical. Apps in B and C read post-rotation values without any cross-region call.
- Region failure → DR replica promoted by the failover-controller. New writes are blocked briefly during promotion (~30s). After promotion, the DR region accepts writes.
5.3 Why NOT a stretched cluster
A stretched Raft cluster (5 nodes across 3 regions, single quorum) seems superficially appealing but is fragile:
- A single region's network blip can cause loss of quorum if 3 of 5 nodes are in the affected region.
- Cross-region latency degrades all writes (every write needs cross-region majority ack).
- An entire region failure can leave the cluster without quorum.
We deliberately reject this pattern. Each region is its own failure domain.
6. Keycloak topology
Set at Sovereign provisioning time:
# In Sovereign CRD spec
keycloakTopology: per-organization # SME-style: each Org gets its own
# OR
keycloakTopology: shared-sovereign # Corporate: one Keycloak for the Sovereign
6.1 SME-style (per-organization)
Sovereign: omantel
└── Each Organization gets a minimal Keycloak (1 replica, embedded H2/sqlite,
~150 MB RAM, no HA)
│
├── Organization muscatpharmacy
│ Keycloak realm: muscatpharmacy
│ Federations: Omantel-Mobile-OTP, Google, Apple
├── Organization acme-shop
│ Keycloak realm: acme-shop
└── …
Why per-Org for SME: blast radius. Muscat-pharmacy's Keycloak outage cannot affect Lulu-Hypermarket. Operationally cheap — minimal Keycloak fits in <200MB. SME tier customers don't need HA; if their Keycloak restarts in 10s during a deploy, that's tolerable.
Larger SMEs can opt into HA via a tier upgrade — same data model, just more replicas + Postgres backend instead of embedded H2.
6.2 Corporate (shared-sovereign)
Sovereign: bankdhofar
└── ONE Keycloak (HA, 3 replicas, Postgres backend)
Federates to Bank Dhofar's corporate Azure AD
│
├── Realm: catalyst-admin (sovereign-admin team)
├── Realm: core-banking (Org)
├── Realm: digital-channels (Org)
├── Realm: analytics (Org)
└── Realm: corporate-it (Org)
Why shared for corporate: the bank's security perimeter is the entire Sovereign. Every Organization within is a business unit of the same legal entity. Federation to Azure AD is the single auth choke-point anyway. Per-Org Keycloak would mean N times the Azure AD federation config — operational overhead with no security benefit.
6.3 App-level SSO
Every Application Blueprint can declare SSO support:
# in bp-wordpress configSchema
sso:
enabled: true # auto-creates a Keycloak client in the Org's realm
# injects credentials via OpenBao + ExternalSecret
End users get one-click SSO across all Apps in their Organization without ever seeing OAuth config.
7. Rotation policy
Every credential class has a SecretPolicy that drives automatic rotation.
apiVersion: catalyst.openova.io/v1alpha1
kind: SecretPolicy
metadata:
name: stricter-rotation
namespace: catalyst-system
spec:
appliesTo:
organizationLabels:
tier: regulated
rules:
- kind: database-credentials
maxTTL: 1h
autoRotate: true
- kind: api-token
maxTTL: 90d
autoRotate: true
rotateBefore: 7d
- kind: oauth-client-secret
maxTTL: 90d
autoRotate: true
- kind: signing-key
maxTTL: 365d
autoRotate: false # requires explicit approval
requireApproval: [security-officer]
- kind: tls-cert
maxTTL: cert-manager-managed
| Class | Default | Notes |
|---|---|---|
| Workload identity (K8s SA bound-token) | 1 h, auto-rotated by kubelet | Not configurable. Audience-scoped per consumer. SPIRE SVID (5-min, X.509-cert) is the future-state target if a §2 re-enable trigger fires. |
| Dynamic DB creds | 1 h, auto | Per-Blueprint TTL configurable. |
| API tokens, OAuth client secrets | 90 d, auto | rotateBefore: 7d gives apps a refresh window. |
| Signing keys, root CAs | 365 d, manual approval | Auto-rotation possible but disabled by default for high-impact keys. |
| TLS certs | cert-manager controlled | Acme/Let's Encrypt, ~60 d, automatic. |
| User passwords (Keycloak) | User-managed + MFA | Min age policy enforced by realm. |
A security-officer sees a RotationDashboard view: every credential class, age, next rotation, force-rotate button (RBAC-gated).
8. The path of a secret value (no leakage)
1. Generated: Crossplane composition or OpenBao auto-generator creates value.
Never printed. Never echoed. Written directly to OpenBao via API.
2. Referenced: ExternalSecret CR in Git names the OpenBao path. No value in Git.
3. Materialized: ESO reads OpenBao path (auth via projected SA bound-token + TokenReview; transport encrypted by Cilium WireGuard), renders K8s Secret.
The K8s Secret is base64-encoded; never logged.
4. Consumed: Pod mounts as env or file. Reloader watches hash; rolls deploy
on change. Application sees plaintext only via mount or env.
5. Rotated: SecretPolicy controller invokes rotation API on OpenBao.
New value generated, replication propagates, ESO re-reads,
Reloader rolls. Old value retained for grace window (24h),
then revoked.
6. Audited: Every step logged to Catalyst audit log. No plaintext.
What never happens:
- Plaintext secrets in Git.
- Plaintext secrets in shell command output.
- Plaintext secrets in issues, PRs, comments, or chat.
- Plaintext secrets in commit messages, branch names, tag names.
If a secret is ever leaked via terminal output (a misconfigured kubectl describe, a debug log), the leak is treated as a P1 incident: rotate immediately, audit history, communicate.
9. Compliance posture
| Standard | Catalyst posture |
|---|---|
| SOC 2 Type 2 | Audit logging in JetStream + OpenSearch SIEM cold storage. SecretPolicy enforces rotation. EnvironmentPolicy enforces approvals. |
| PSD2 / FAPI | Fingate Blueprint composes Keycloak (FAPI authorization), eIDAS cert verification, ext_authz. |
| DORA | Resilience testing via Litmus chaos Blueprint. Multi-region by default for regulated tier. |
| NIS2 | Falco runtime detection + OpenSearch SIEM + Kyverno policy + supply-chain (cosign + Syft+Grype). |
| GDPR | Per-region data residency via Placement spec. Right-to-be-forgotten flow defined per Application Blueprint. |
| ISO 27001 | Mappings published per control; evidence surfaced via Catalyst console audit views and SIEM exports. |
Every Sovereign exports its audit log to a customer-specified SIEM. Default: OpenSearch in the Sovereign itself; customers may push to external Splunk, Datadog SIEM, etc.
10. Threat model summary
| Threat | Mitigation |
|---|---|
| Stolen ServiceAccount token | Projected SA bound-tokens are 1h TTL, audience-scoped, Pod-bound (deleted when the Pod terminates) — legacy long-lived Secret-tokens are not used. (Future hardening: SPIRE SVID 5-min mTLS-cert if a §2 re-enable trigger fires.) |
| Stolen K8s Secret | Encrypted at rest in etcd. Pulled only via ESO with a projected SA bound-token (TokenReview-validated); transport encrypted by Cilium WireGuard. |
| Compromised Pod | NetworkPolicy (Cilium) + L7 policies limit blast radius. Falco detects anomalous syscalls. |
| Malicious commit to Environment Gitea | EnvironmentPolicy requires PR approvals. Kyverno admission control denies non-policy-compliant manifests. |
| Compromised Blueprint upstream | All Blueprints are cosigned. Kyverno verify-signatures policy denies unsigned/wrong-issuer artifacts. |
| Cross-Org leakage | vcluster isolation. JetStream Account isolation. Keycloak realm isolation (per-Org or shared). |
| Compromised sovereign-admin account | MFA required at Keycloak. JIT elevation for production-impacting actions. Full audit trail to SIEM. |
| Compromised OpenBao node | 2-of-3 Raft quorum required for writes. Audit log captures every read. Rotate root token + re-shard quarterly. |
| Region-wide failure | Independent OpenBao Raft per region. PowerDNS lua-records (ifurlup) drop the affected regional endpoint from authoritative responses within the health-check window. Apps with active-active keep serving from healthy region. |
| Supply-chain attack on a build | SLSA-3 build provenance, cosign signing, Syft+Grype SBOM scanned in CI and at runtime by Trivy. |
See ARCHITECTURE.md for the broader platform context.