Commit Graph

8 Commits

Author SHA1 Message Date
e3mrah
cf35b4a9b6
fix(ci): blueprint.yaml spec.version lockstep in auto-bump (Closes #1856) (#1858)
A17 (#1855) hot-patched 6 drifted blueprints (cilium, cert-manager, flux,
openbao, keycloak, gitea) where blueprint.yaml spec.version had silently
fallen behind chart/Chart.yaml version, breaking
TestBootstrapKit_BlueprintCardsHaveRequiredFields. The structural root
cause: the TBD-A6 auto-bump hook in blueprint-release.yaml updated only
clusters/_template/bootstrap-kit/<N>-<chart>.yaml pins on every chart
publish — never the upstream platform/<bp>/blueprint.yaml.

This PR extends the auto-bump hook to lockstep platform/<bp>/blueprint.yaml
spec.version whenever Chart.yaml version bumps. Both file edits land in
the SAME commit (subject becomes `deploy(<chart>): bump bootstrap-kit pin
X -> Y (auto, Refs TBD-A6)` with a secondary line noting the blueprint
lockstep). Idempotent reset-and-rewrite retry preserved for the existing
parallel-matrix race case.

Workflow changes (.github/workflows/blueprint-release.yaml):
  * New step `bump_blueprint` after `bump_pin` — locates
    ${matrix.path}/blueprint.yaml OR ${matrix.path}/chart/blueprint.yaml
    (handles both platform-leaf and products-umbrella conventions),
    filters to kind:Blueprint (defensive against CRD yaml at the
    products/catalyst/chart/crds path), reads current spec.version at
    2-space indent, sed-rewrites to CHART_VERSION, verifies post-write.
  * Commit step renamed to "Commit + push bootstrap-kit pin bump +
    blueprint.yaml lockstep"; stages both files, single commit, with
    convergent retry on conflict.
  * Summary block surfaces both bumps separately.

Regression test (tests/e2e/bootstrap-kit/main_test.go):
  * New TestBootstrapKit_BlueprintVersionLockstepSweep — walks
    platform/* and products/*, discovers every Blueprint manifest with
    a sibling Chart.yaml, asserts spec.version == Chart.yaml version.
    Covers ALL ~70 blueprints, not just the canonical 10 kit ones the
    existing TestBootstrapKit_BlueprintCardsHaveRequiredFields gates.
  * Failure messages name the file, drift direction, and the exact sed
    command to fix — drift remediation is mechanical.

Drift cleanup (mandatory companion, same shape as A17/#1855):
  26 Application-Blueprint blueprints whose spec.version had been left
  at 1.0.0 / 0.1.0 while Chart.yaml moved forward — synced down to
  Chart.yaml as authoritative. All currently surface in the new sweep
  test; without the cleanup the test would block this PR (and every
  subsequent one). Affected: alloy, cert-manager-{dynadot,powerdns}-webhook,
  cluster-autoscaler-hcloud, cnpg, crossplane-claims, external-secrets[-stores],
  falco, grafana, guacamole, harbor, hcloud-csi, k8s-ws-proxy, mimir,
  netbird, newapi, openclaw, powerdns, seaweedfs, self-sovereign-cutover,
  trivy, valkey, velero, vpa, products/dmz-vcluster.

After this lands, the next chart-version bump in any platform/<bp>/ folder
auto-converges all three artifacts (Chart.yaml, blueprint.yaml,
bootstrap-kit pin) in a single bot commit. No more manual collector PRs;
no more silent drift between chart and Blueprint manifest.

Closes #1856.
Refs #1855 (A17 hot-patch this replaces structurally), #1713 (original TBD-A6 auto-bump hook).

Co-authored-by: hatiyildiz <hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 01:04:22 +04:00
e3mrah
b5c9839da7
feat(phase-8b): sovereign wizard auth-gate + handover JWT minting + Playwright CI fixes (#611)
Squash of PR #611 (feat/607) + PR #615 (feat/605) Phase-8b deliverables:

UI:
- AuthCallbackPage: mode-aware dispatch (catalyst-zero → magic-link server
  callback; sovereign → client-side OIDC token exchange via oidc.ts)
- Router: sovereign console routes (/console/*), DETECTED_MODE index redirect,
  authCallbackRoute dedup fix, authHandoverRoute safety net
- StepSuccess: mints RS256 handover JWT via POST /deployments/{id}/mint-handover-token
  before redirecting operator to Sovereign console (falls back to plain URL on error)

API:
- main.go: wires handoverjwt.LoadOrGenerate signer from CATALYST_HANDOVER_KEY_PATH env
- deployments.go: stamps HandoverJWTPublicKey from signer.PublicJWK() at create time
- provisioner.go: injects HandoverJWTPublicKey into Tofu vars JSON
- auth.go: /auth/handover endpoint for seamless single-identity flow

Infra:
- cloudinit-control-plane.tftpl: writes handover JWT public JWK to /var/lib/catalyst/
- variables.tf: handover_jwt_public_key variable (sensitive, default empty)

Chart:
- api-deployment.yaml / ui-deployment.yaml / values.yaml: expose handover JWT env vars

Playwright CI fixes:
- playwright-smoke.yaml / cosmetic-guards.yaml: health-check URL /sovereign/wizard → /wizard
- playwright.config.ts: BASEPATH default /sovereign → / + baseURL construction fix
- cosmetic-guards.spec.ts: provision URL /sovereign/provision/* → /provision/*
- sovereign-wizard.spec.ts: WIZARD_URL /sovereign/wizard → /wizard

Closes #605, #606, #607. Fixes Playwright CI (#142 sovereign wizard smoke tests).

Co-authored-by: e3mrah <e3mrah@openova.io>
2026-05-02 19:17:56 +04:00
e3mrah
1e7d1e67c9
test(e2e): omantel handover Playwright scaffold for Phase 8 (closes #429) (#432)
Phase 8 of the omantel handover (#369) needs an automated E2E that proves
DoD: omantel.omani.works runs as a fully self-sufficient Sovereign with
zero contabo dependency post-handover. Today this is a SCAFFOLD — when
Phase 4/6/7 land, dispatching the new workflow against a live omantel is
the entire Phase 8.

Canonical seam (anti-duplication, per memory/feedback_anti_duplication_seam_first.md):
  - tests/e2e/playwright/tests/  ← mirror of sovereign-wizard.spec.ts shape
    (NOT specs/ as the issue body said — actual repo path is tests/)
  - tests/e2e/playwright/playwright.config.ts (BASE_URL handling, retries,
    workers=1, reporter=list) — reused as-is
  - tests/e2e/playwright/tests/_helpers.ts:reachable() — reused for the
    pre-flight skip-when-unreachable pattern
  - .github/workflows/playwright-smoke.yaml — workflow shape (checkout v4,
    setup-node v4, npm install, playwright install --with-deps chromium,
    upload-artifact on failure) — mirrored, NOT duplicated

What ships:
  - tests/e2e/playwright/tests/omantel-handover.spec.ts (NEW, 6 tests):
      1. sovereign Ready + 23/23 blueprints
      2. all bp-* HelmReleases Ready=True
      3. catalyst-platform self-hosts (healthz + dashboard "23 / 23 ready")
      4. vendor-agnostic Object Storage (post-#425 canonical secret name
         flux-system/object-storage — NOT hetzner-object-storage)
      5. dig +trace omantel.omani.works ends at omantel NS, not contabo
      6. zero contabo dependency (omantel /api/healthz keeps returning 200)
    Self-skips when OMANTEL_BASE_URL/OMANTEL_API_BASE/OPERATOR_BEARER unset.

  - .github/workflows/omantel-e2e-handover.yaml (NEW):
    workflow_dispatch ONLY (no schedule cron — per CLAUDE.md "every workflow
    MUST be event-driven, NEVER scheduled"). Inputs let the operator override
    base URLs at dispatch time.

  - docs/omantel-handover-wbs.md:
    new §10 "Phase 8 acceptance criteria (executable DoD)" — 6 bullets 1:1
    with the spec test() blocks; §9 status row added for #429
    (🟢 scaffold-shipped).

Local verification:
  cd tests/e2e/playwright && npm install && \
    npx playwright test --list tests/omantel-handover.spec.ts
  → 6 tests listed cleanly
  npx playwright test tests/omantel-handover.spec.ts
  → 6 skipped (env vars unset, expected)

Out of scope (per #425 / #428 territory split):
  - internal/hetzner/, infra/hetzner/, platform/velero/chart/,
    clusters/.../34-velero.yaml — #425's vendor-agnostic sweep
  - .github/workflows/check-vendor-coupling.yaml — #428's coupling guard

Co-authored-by: hatiyildiz <hatiyildiz@noreply.github.com>
2026-05-01 17:52:18 +04:00
hatiyildiz
dddbab4b80 fix(cloudinit): create flux-system/ghcr-pull secret on Sovereign so private bp-* charts pull cleanly
Every bootstrap-kit HelmRepository CR carries `secretRef: name: ghcr-pull`
because bp-* OCI artifacts at ghcr.io/openova-io/ are private. Cloud-init
never created the Secret, so every fresh Sovereign's source-controller
logs `secrets "ghcr-pull" not found` and Phase 1 stalls at bp-cilium.
The operator workaround (kubectl apply by hand) is not durable across
reprovisioning. Verified live on omantel.omani.works pre-fix.

Changes:

- provisioner.Request gains GHCRPullToken (json:"-") so it is never
  serialized into persisted deployment records. provisioner.New() reads
  CATALYST_GHCR_PULL_TOKEN at startup; Provision() stamps it onto the
  Request before tofu.auto.tfvars.json. Validate() rejects empty for
  domain_mode=pool with a pointer to docs/SECRET-ROTATION.md.
- handler.CreateDeployment also stamps the env var onto the Request so
  the synchronous validation path returns 400 early on misconfiguration.
- infra/hetzner: variables.tf adds ghcr_pull_token (sensitive=true,
  default=""). main.tf computes ghcr_pull_username + ghcr_pull_auth_b64
  locals and passes both to templatefile().
  cloudinit-control-plane.tftpl emits a kubernetes.io/dockerconfigjson
  Secret manifest into /var/lib/catalyst/ghcr-pull-secret.yaml; runcmd
  applies it AFTER Flux core install but BEFORE flux-bootstrap.yaml so
  the GitRepository + Kustomization land into a cluster that already
  has working GHCR creds.
- products/catalyst/chart/templates/api-deployment.yaml mounts
  CATALYST_GHCR_PULL_TOKEN from the catalyst-ghcr-pull-token Secret in
  the catalyst namespace (key: token, optional: true so the Pod still
  starts on misconfigured installs and Validate() owns the gate).
- docs/SECRET-ROTATION.md: yearly-rotation runbook for the GHCR token,
  Hetzner per-Sovereign tokens, and the Dynadot pool-domain creds.
  Includes the kubectl create secret one-liner with <GHCR_PULL_TOKEN>
  placeholder; the token never lives in git.
- Tests: provisioner unit tests cover New() reading the env var,
  tolerance of missing env, pool-mode validation rejection with
  operator-facing error, BYO acceptance, and the json:"-" serialization
  invariant. tests/e2e/hetzner-provisioning gains a
  TestCloudInit_RendersGHCRPullSecret render-only integration test that
  asserts the rendered cloud-init contains the Secret, applies it
  before flux-bootstrap, and that the dockerconfigjson round-trips the
  sample token through templatefile() correctly. Existing
  pool-mode handler tests now t.Setenv the placeholder token; the
  on-disk redaction test asserts the placeholder never reaches disk.

Gates:
- go vet ./... and go test -race -count=1 ./... in
  products/catalyst/bootstrap/api: PASS.
- helm lint products/catalyst/chart: PASS (warnings pre-existing).
- tofu fmt + tofu validate: deferred to CI (no tofu binary on the
  development host).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 18:07:27 +02:00
hatiyildiz
55b8a18b32 test(e2e): #142, #143, #144 — Playwright UI smoke tests for sovereign wizard, admin vouchers, marketplace bp-<x> grid
Group L closes the three UI smoke-test gaps the verify-sweep flagged:

  #142 sovereign wizard       — tests/e2e/playwright/tests/sovereign-wizard.spec.ts
  #143 admin voucher UI       — tests/e2e/playwright/tests/admin-vouchers.spec.ts
  #144 unified bp-<x> grid    — tests/e2e/playwright/tests/marketplace-cards.spec.ts

Tests target the actual shipped UI shape (Pass 105+):

* Wizard step model is StepOrg → StepTopology → StepProvider →
  StepCredentials → StepComponents → StepReview, not the original ticket's
  StepDomain/StepHetzner draft from before the unified-Blueprints refactor.
* Admin voucher model uses an `active` toggle, not ISSUED/REVOKED status.
* "Marketplace card grid" = the Catalyst wizard's StepComponents (bp-<x>
  Blueprints), NOT the SME marketplace at core/marketplace (which is for
  SaaS Apps). Today every Blueprint is `visibility: unlisted`, so the test
  asserts the data layer (catalog.generated.ts) plus the documented
  EmptyState; once `visibility: listed` lands, the third assertion
  auto-extends to the rendered card grid.

Per principle #4 ("never hardcode"), all URLs come from env vars with
sensible local-dev defaults. Per principle #1 ("never speculate"), tests
self-skip with explicit reasons when their target app isn't reachable
instead of fail-noisy.

CI: .github/workflows/playwright-smoke.yaml boots the Catalyst UI in the
background and runs the suite on PRs touching UI sources or tests; admin
and marketplace specs self-skip in that workflow because spinning up all
three Astro apps + catalyst-api + Postgres is the full E2E pipeline's
job, not this smoke.

Local run (Catalyst UI on :4399, admin on :4398): 5 passed, 2 skipped
(skip reasons: marketplace #3 needs StepComponents reachable past
required-field gating; admin #2 needs ADMIN_TEST_COOKIE for an
authenticated session).

Refs: #142, #143, #144
2026-04-28 19:54:04 +02:00
hatiyildiz
a35da929f1 feat(sovereign-route): values-driven /sovereign + /api/v1 routing
Per docs/INVIOLABLE-PRINCIPLES.md #4 (never hardcode), the catalyst-ui
nginx config now flows from values.yaml at chart-render time:

- routing.basePath (/sovereign) — also drives ingress strip-prefix
- routing.catalystApi.serviceDNS — in-cluster reverse-proxy target
- routing.catalystApi.port — upstream port
- dns.resolverIP — CoreDNS for proxy-time resolution (avoids stale
  ClusterIP after catalyst-api restarts)
- ingress.host / ingress.priority / ingress.className

Files:
- products/catalyst/chart/values.yaml — new, documents every default
- products/catalyst/chart/templates/ui-configmap.yaml — new, nginx
  reverse-proxies /api/* to catalyst-api Service DNS
- products/catalyst/chart/templates/ui-deployment.yaml — mounts the
  ConfigMap at /etc/nginx/conf.d/default.conf
- products/catalyst/chart/templates/ingress.yaml — values-driven host
  + path + priority + class
- tests/e2e/sovereign-routing/* — Playwright smoke for the routing

Captured from stalled agent /tmp/agent-sovereign-route-finish — agent
stream watchdog timed out after the work was authored but before commit.
2026-04-28 19:48:40 +02:00
hatiyildiz
7c7c46bc62 test: Hetzner Sovereign end-to-end provisioning test (#141)
Closes the Group L "end-to-end provisioning test on Hetzner test project"
ticket. Per the ticket's exact wording: scaffolding + harness + CI
workflow, gated on HETZNER_TEST_TOKEN, NEVER mocked.

Lifecycle when HETZNER_TEST_TOKEN is set:
  1. Generate unique sovereign FQDN (e2e-<run-id>.openova.io)
  2. Stage canonical infra/hetzner/ OpenTofu module into temp dir
  3. Render tofu.auto.tfvars.json with test inputs (BYO domain mode so
     Dynadot isn't touched; region runtime-configurable; SSH key minted
     by CI per-run)
  4. tofu init && tofu apply -auto-approve (30m timeout)
  5. Assert outputs: control_plane_ip + load_balancer_ip are valid IPv4
  6. Assert TCP/22 reachable on control plane (5m await)
  7. Assert TCP/443 reachable on LB after Cilium + Flux land (15m await,
     soft-failure since the Catalyst control plane install is the long
     tail and partial-bootstrap is acceptable proof of OpenTofu + Flux)
  8. tofu destroy -auto-approve (always — t.Cleanup, runs even on fail)
  9. Verify state list is empty after destroy (no leaked resources)

When HETZNER_TEST_TOKEN is absent, the test SKIPS — does not mock, does
not fall through to a stub. Per docs/INVIOLABLE-PRINCIPLES.md #2,
mocking the cloud would tell us nothing about whether the OpenTofu module,
hcloud provider, cloud-init scripts, or k3s actually work. A second test
(TestHarness_NoHetznerCredsSkips) explicitly verifies the skip semantics
so future refactors don't accidentally land mocking.

CI workflow (.github/workflows/test-hetzner-e2e.yaml):
  - Triggers on workflow_dispatch (operator initiates real run) or PR
    labeled `test/hetzner-e2e` — NOT on every push (each run costs real
    Hetzner minutes ~EUR 0.005/run).
  - Generates a per-run throwaway SSH ed25519 keypair so no secret
    long-term key lands in any logs.
  - Installs OpenTofu via opentofu/setup-opentofu@v1.
  - Reads HETZNER_TEST_TOKEN + HETZNER_TEST_PROJECT_ID from repo secrets;
    operator populates them out-of-band (per the ticket: "operator will
    populate later").
  - 55m job timeout, plus the test itself uses contexts of 30m apply
    + 20m destroy.

Files:
  - tests/e2e/hetzner-provisioning/main_test.go (the harness)
  - tests/e2e/hetzner-provisioning/go.mod (separate module, stdlib-only)
  - .github/workflows/test-hetzner-e2e.yaml (gated CI)

Refs #141

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:00:29 +02:00
hatiyildiz
3dced3fdda test: bootstrap-kit Flux Kustomization integration test (#145)
Closes the Group L "integration test — provisioner backend bootstrap-kit
installer — all 11 phases install in sequence on a kind cluster" ticket.

Per the ticket note, the bootstrap installer is now Flux-driven from
clusters/<sovereign-fqdn>/ — NOT the bespoke Go-based installer that was
reverted in commit e668637. The test verifies that Flux reconciles the
right Kustomizations rather than that Go code helm-installs anything.

Two layers of validation:

1. Static manifest layer (runs on every push, cheap)
   - All 11 platform/<x>/blueprint.yaml + chart/Chart.yaml exist
   - Each blueprint.yaml satisfies catalyst.openova.io/v1alpha1 schema
     (apiVersion/kind/metadata.name/spec.version/card.title/card.summary)
   - Chart.yaml name matches "bp-<x>" and version matches blueprint.yaml
     spec.version
   - clusters/_template/ YAMLs parse after SOVEREIGN_FQDN_PLACEHOLDER
     substitution (when the template tree is on the branch — Group J/M
     ticket lands the per-Sovereign template)
   - The dependency order matches the canonical 11-phase sequence from
     SOVEREIGN-PROVISIONING.md §3 (cilium → cert-manager → flux →
     crossplane → sealed-secrets → spire → nats-jetstream → openbao →
     keycloak → gitea → bp-catalyst-platform)

2. Kind-cluster layer (runs on main pushes, gated on
   BOOTSTRAP_KIT_KIND_TEST=1)
   - Brings up kubernetes-in-docker
   - Installs Flux CRDs + source/kustomize controllers
   - Registers a GitRepository pointing at this monorepo
   - Synthesizes the 11 bootstrap-kit Kustomizations and applies them
   - Asserts the API server accepts all 11 (manifests are valid, schema
     satisfied) — this is the test's narrow scope per the ticket

The test deliberately does NOT wait for the kit to fully install upstream
charts or reach steady-state reconciliation. That belongs to #141 (real
Hetzner E2E with cloud credentials and outbound network), not a kind
cluster test in CI.

Files:
  - tests/e2e/bootstrap-kit/main_test.go (Go test, 11 subtests + 4 main)
  - tests/e2e/bootstrap-kit/go.mod (separate module — keeps test deps
    isolated from the production Go modules)
  - .github/workflows/test-bootstrap-kit.yaml (kind-action + flux2/action)

Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:58:18 +02:00