PR #755 added `node-role.kubernetes.io/control-plane=true:NoSchedule` to the CP node when worker_count > 0. Two bootstrap-kit charts have pods that MUST land on the CP and lacked the matching toleration: bp-trivy • node-collector: Pod pinned to each node via nodeSelector `kubernetes.io/hostname=<node>`. The CP-bound collector reads /var/lib/etcd, /var/lib/kubelet, /var/lib/kube-scheduler, /var/lib/kube-controller-manager via hostPath — these only exist on the CP. Without the toleration the collector sat Pending forever on otech93 (live evidence in #769). • scanJobTolerations: per-workload scan jobs the operator spawns may target pods on CP-only system DaemonSets (kube-system kube-proxy in non-Cilium mode, etc.). Adding the toleration here so reports are produced for those workloads too. bp-alloy • DaemonSet — one pod MUST land on every node including the CP, so CP-local kubelet logs + node metrics flow into the LGTM stack. Without the toleration Alloy ran 3/4 nodes (Ready=N-1) on otech93 and CP telemetry was silently lost. Both tolerations are no-ops on solo Sovereigns (worker_count=0): the CP is untainted in solo mode per PR #755's conditional. Versions bumped: • bp-trivy 1.0.2 → 1.0.3 (Chart.yaml + 3× HelmRelease pins) • bp-alloy 1.0.0 → 1.0.1 (Chart.yaml + 3× HelmRelease pins) Out of scope (audited, no change needed): • bp-cilium — upstream defaults already tolerate everything (verified on otech93: cilium DaemonSet at 4/4 nodes). • bp-falco — values.yaml already declares NoSchedule + NoExecute Exists tolerations (4/4 on otech93). • cnpg/harbor — no kubelet-cert-renew Jobs in current charts. Verified: • `helm template` on both charts renders the expected toleration (alloy: pod-spec; trivy: trivy-operator-config ConfigMap consumed by the operator at scan-job spawn time). • `bash scripts/check-bootstrap-deps.sh` PASSED (no DAG drift). Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
72 lines
2.1 KiB
YAML
72 lines
2.1 KiB
YAML
# bp-trivy — Catalyst bootstrap-kit Blueprint #30 (W2.K3, Tier 7 — Security/Policy).
|
|
# Trivy Operator. Static-scanning half of the Catalyst security stack:
|
|
# vulnerability + misconfiguration scanning of running workloads, images,
|
|
# RBAC, and rendered manifests. Pairs with bp-falco (runtime, slot 31)
|
|
# and bp-kyverno (admission, slot 27).
|
|
#
|
|
# Wrapper chart: platform/trivy/chart/ (umbrella over upstream
|
|
# aquasecurity/trivy-operator chart, Catalyst-curated values under the
|
|
# `trivy-operator:` key).
|
|
# Reconciled by: Flux on the new Sovereign's k3s control plane.
|
|
#
|
|
# dependsOn:
|
|
# - bp-cert-manager — Trivy Operator's admission webhook (and its
|
|
# ConfigAuditReport mutating-webhook in HA mode) requires a TLS cert
|
|
# from the cluster's letsencrypt-prod / internal CA ClusterIssuer
|
|
# before the apiserver will route AdmissionReview traffic. Without
|
|
# bp-cert-manager Ready, the Certificate resource sits Pending and
|
|
# the webhook serves stale or no certs.
|
|
|
|
---
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: trivy-system
|
|
labels:
|
|
catalyst.openova.io/sovereign: ${SOVEREIGN_FQDN}
|
|
---
|
|
apiVersion: source.toolkit.fluxcd.io/v1beta2
|
|
kind: HelmRepository
|
|
metadata:
|
|
name: bp-trivy
|
|
namespace: flux-system
|
|
spec:
|
|
type: oci
|
|
interval: 15m
|
|
url: oci://ghcr.io/openova-io
|
|
secretRef:
|
|
name: ghcr-pull
|
|
---
|
|
apiVersion: helm.toolkit.fluxcd.io/v2
|
|
kind: HelmRelease
|
|
metadata:
|
|
name: bp-trivy
|
|
namespace: flux-system
|
|
spec:
|
|
interval: 15m
|
|
releaseName: trivy
|
|
targetNamespace: trivy-system
|
|
dependsOn:
|
|
- name: bp-cert-manager
|
|
chart:
|
|
spec:
|
|
chart: bp-trivy
|
|
version: 1.0.3
|
|
sourceRef:
|
|
kind: HelmRepository
|
|
name: bp-trivy
|
|
namespace: flux-system
|
|
# Event-driven install: Trivy Operator pulls a multi-hundred-MB
|
|
# vulnerability database on first run; pod Ready is dominated by
|
|
# initial DB hydration, not manifest apply. disableWait lets Flux
|
|
# mark this Ready as soon as manifests apply; runtime convergence
|
|
# (DB hydration, first scan reports landing) is observed via kubectl.
|
|
install:
|
|
disableWait: true
|
|
remediation:
|
|
retries: 3
|
|
upgrade:
|
|
disableWait: true
|
|
remediation:
|
|
retries: 3
|