openova/clusters/_template/bootstrap-kit/30-trivy.yaml
e3mrah 0dbdf3b327
fix(bp-trivy): node-collector tolerates control-plane taint (closes #769) (#772)
PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to
the CP node when worker_count > 0. Two bootstrap-kit charts have pods
that MUST land on the CP and lacked the matching toleration:

bp-trivy
  • node-collector: Pod pinned to each node via nodeSelector
    `kubernetes.io/hostname=<node>`. The CP-bound collector reads
    /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler,
    /var/lib/kube-controller-manager via hostPath — these only exist
    on the CP. Without the toleration the collector sat Pending forever
    on otech93 (live evidence in #769).
  • scanJobTolerations: per-workload scan jobs the operator spawns may
    target pods on CP-only system DaemonSets (kube-system kube-proxy
    in non-Cilium mode, etc.). Adding the toleration here so reports
    are produced for those workloads too.

bp-alloy
  • DaemonSet — one pod MUST land on every node including the CP, so
    CP-local kubelet logs + node metrics flow into the LGTM stack.
    Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93
    and CP telemetry was silently lost.

Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP
is untainted in solo mode per PR #755's conditional.

Versions bumped:
  • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins)
  • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins)

Out of scope (audited, no change needed):
  • bp-cilium — upstream defaults already tolerate everything (verified
    on otech93: cilium DaemonSet at 4/4 nodes).
  • bp-falco — values.yaml already declares NoSchedule + NoExecute
    Exists tolerations (4/4 on otech93).
  • cnpg/harbor — no kubelet-cert-renew Jobs in current charts.

Verified:
  • `helm template` on both charts renders the expected toleration
    (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed
     by the operator at scan-job spawn time).
  • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:38:29 +02:00

72 lines
2.1 KiB
YAML

# bp-trivy — Catalyst bootstrap-kit Blueprint #30 (W2.K3, Tier 7 — Security/Policy).
# Trivy Operator. Static-scanning half of the Catalyst security stack:
# vulnerability + misconfiguration scanning of running workloads, images,
# RBAC, and rendered manifests. Pairs with bp-falco (runtime, slot 31)
# and bp-kyverno (admission, slot 27).
#
# Wrapper chart: platform/trivy/chart/ (umbrella over upstream
# aquasecurity/trivy-operator chart, Catalyst-curated values under the
# `trivy-operator:` key).
# Reconciled by: Flux on the new Sovereign's k3s control plane.
#
# dependsOn:
# - bp-cert-manager — Trivy Operator's admission webhook (and its
# ConfigAuditReport mutating-webhook in HA mode) requires a TLS cert
# from the cluster's letsencrypt-prod / internal CA ClusterIssuer
# before the apiserver will route AdmissionReview traffic. Without
# bp-cert-manager Ready, the Certificate resource sits Pending and
# the webhook serves stale or no certs.
---
apiVersion: v1
kind: Namespace
metadata:
name: trivy-system
labels:
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
name: bp-trivy
namespace: flux-system
spec:
type: oci
interval: 15m
url: oci://ghcr.io/openova-io
secretRef:
name: ghcr-pull
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: bp-trivy
namespace: flux-system
spec:
interval: 15m
releaseName: trivy
targetNamespace: trivy-system
dependsOn:
- name: bp-cert-manager
chart:
spec:
chart: bp-trivy
version: 1.0.3
sourceRef:
kind: HelmRepository
name: bp-trivy
namespace: flux-system
# Event-driven install: Trivy Operator pulls a multi-hundred-MB
# vulnerability database on first run; pod Ready is dominated by
# initial DB hydration, not manifest apply. disableWait lets Flux
# mark this Ready as soon as manifests apply; runtime convergence
# (DB hydration, first scan reports landing) is observed via kubectl.
install:
disableWait: true
remediation:
retries: 3
upgrade:
disableWait: true
remediation:
retries: 3