fix(catalyst-api): cutover state-machine idempotent + Job-status checkpoint (Refs #2132) (#2135)

TBD-V56 — bp-self-sovereign-cutover state-machine now consults existing Jobs by their durable `cutover.openova.io/step=<stepName>` label before minting a new Job. Survives catalyst-api Pod restarts mid-cutover. Pre-fix t40 (2026-05-21, Sovereign be5b43a8f5033a77): * cutover started 04:58:25Z * `egress-block-test` Job 1 (`cutover-egress-block-test-1779345819`) reached Complete=True at 06:54:24Z * catalyst-api restarted at 07:07Z (operational, separate cache-sync fix) * Pod 2 spawned a NEW Job (`cutover-egress-block-test-1779347242`) and ran the 10-minute deny-egress hold A SECOND TIME * cutover stuck at 63% for ~3 hours; Step 09 (gitea-token-mint) never fired -> `sme/provisioning` init container blocked -> tenant onboarding blocked -> Phase 2 (qwen-code Sandbox) unreachable Root cause: `runCutoverStep` only ever consulted the in-memory runEpoch when checking what Job to attach to. A Pod restart lost the epoch + the engine never looked for Jobs from prior process lifetimes via their durable `cutover.openova.io/step=<stepName>` label. Fix in `products/catalyst/bootstrap/api/internal/handler/cutover.go`: 1. New helpers `findExistingJobsForStep` / `findExistingTerminalJobForStep` / `findExistingRunningJobForStep` list Jobs in the cutover namespace by the `cutover.openova.io/step=<stepName>` label and return the first Complete=True / Failed / Running match. Complete wins over Failed when both exist (retried step's success-eventually path). 2. `runCutoverStep` now does (in order): (a) if a prior Complete Job exists -> write `step.<name>.result=success` + `finishedAt` from Job's CompletionTime + `jobName=<prior-Job>` directly. NO new Job. (b) if a prior Failed Job exists -> write failed + halt. NO re-create. (c) if a non-terminal Job exists -> attach watch to it. (d) otherwise mint a fresh Job under the current runEpoch. 3. `jobCompletionTime` helper extracts canonical timestamp from `status.completionTime` with defensive fallbacks. Validation: * 4 new unit tests in `cutover_test.go`: - TestFindExistingTerminalJobForStep_PrefersCompleteOverFailed - TestRunCutoverStep_SkipsRerunWhenPriorJobComplete (the t40 regression guard: Job creates = 0 for the egress-block-test step when a prior Complete Job exists) - TestRunCutoverStep_SurfacesPriorFailedJob (Failed Job from prior process surfaced, no rerun) - TestHandleCutoverStart_IdempotentReusesPriorCompleteJob (HTTP /start path consumes prior Complete Job) * All 27 pre-existing cutover tests still pass. * `helm template products/catalyst/chart` renders 4208 lines cleanly. Chart 1.4.233 -> 1.4.234; bootstrap-kit pin synced. Walk impact: once Flux auto-rolls the new catalyst-api image to t40, the next reconcile tick finds Job `cutover-egress-block-test-1779347242` Complete=True, writes step.egress-block-test.result=success, advances to Step 09. `sme/provisioning` init container `wait-for-cutover-token` unblocks once Step 09 patches the `provisioning-github-token` Secret. Refs #2132 (NOT Closes — operator walk + screenshot on t40 is the DoD per CLAUDE.md §0). Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
2026-05-21 12:51:32 +04:00 · 2026-05-21 12:51:32 +04:00 · 6cd2c78654
commit 6cd2c78654
parent 21443e5486
4 changed files with 677 additions and 7 deletions
--- a/clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
+++ b/clusters/_template/bootstrap-kit/13-bp-catalyst-platform.yaml
@ -855,7 +855,19 @@ spec:
      # tenant_created_wire_test.go (commit 56e04ac8a) keeps guarding
      # the wire shape. Refs #2133 (NOT Closes — operator walk on
      # fresh prov required).
-      version: 1.4.233
+      # 1.4.234 — TBD-V56 / #2132 (2026-05-21): bp-self-sovereign-cutover
+      # state-machine idempotent across catalyst-api Pod restarts. t40
+      # (2026-05-21) reproduced the failure mode: mid-cutover Pod
+      # restart caused the 10-min `egress-block-test` Job to run TWICE
+      # because the engine never consulted the durable
+      # `cutover.openova.io/step=<name>` label on existing Jobs from
+      # the prior process lifetime — it minted a fresh Job under a
+      # new in-memory runEpoch. Fix in cutover.go consults existing
+      # Jobs before minting; Complete=True flips
+      # step.<name>.result=success directly; no rerun. Unblocks Step 09
+      # (gitea-token-mint), unblocks customer-journey Phase 2 (Sandbox).
+      # Refs #2132 (NOT Closes — operator walk on fresh prov required).
+      version: 1.4.234
      sourceRef:
        kind: HelmRepository
        name: bp-catalyst-platform
--- a/products/catalyst/bootstrap/api/internal/handler/cutover.go
+++ b/products/catalyst/bootstrap/api/internal/handler/cutover.go
@ -573,6 +573,129 @@ func cutoverJobName(stepName string, runEpoch int64) string {
 	return fmt.Sprintf("cutover-%s-%d", stepName, runEpoch)
 }

+// cutoverStepLabelKey is the label every cutover Job (created by
+// createCutoverJob) carries with value == stepName. The reconcile
+// loop in runCutoverStep queries by this label to detect Jobs that
+// completed during a prior process lifetime — that's how the
+// state-machine becomes idempotent across catalyst-api Pod restarts
+// (TBD-V56 / #2132).
+const cutoverStepLabelKey = "cutover.openova.io/step"
+
+// findExistingJobsForStep returns every Job in the cutover namespace
+// whose `cutover.openova.io/step` label matches stepName, with the
+// most-recently-created Job first.
+//
+// This is the read-side seam that makes the state-machine idempotent
+// across catalyst-api restarts. Pre-fix the engine only ever consulted
+// the in-memory runEpoch when checking what Job to attach to — a Pod
+// restart lost the epoch and a NEW Job got minted even when the
+// previous Job had already reached Complete=True. That's how t40
+// (2026-05-21) ran the 10-minute `egress-block-test` hold TWICE on
+// the same Sovereign: Job 1 completed at 06:54:24Z, catalyst-api
+// restarted at 07:07Z, the resume path created Job 2 from scratch at
+// 07:07:23Z instead of attaching to the already-Complete Job 1.
+//
+// Post-fix the engine asks "is there already a Job for this step?"
+// via this helper BEFORE creating a fresh one. Any Job stamped by
+// createCutoverJob carries `cutover.openova.io/step=<stepName>`, so
+// the lookup is exact.
+func findExistingJobsForStep(ctx context.Context, deps *cutoverDeps, stepName string) ([]batchv1.Job, error) {
+	selector := fmt.Sprintf("%s=%s", cutoverStepLabelKey, stepName)
+	jobs, err := deps.core.BatchV1().Jobs(deps.ns).List(ctx, metav1.ListOptions{
+		LabelSelector: selector,
+	})
+	if err != nil {
+		return nil, fmt.Errorf("list existing Jobs for step %q in %q: %w", stepName, deps.ns, err)
+	}
+	// Newest first. The fake clientset rarely populates CreationTimestamp
+	// but every real Job has one stamped by the apiserver, so the sort
+	// is meaningful in production. Stable so equal timestamps preserve
+	// input order.
+	sort.SliceStable(jobs.Items, func(i, j int) bool {
+		return jobs.Items[i].CreationTimestamp.After(jobs.Items[j].CreationTimestamp.Time)
+	})
+	return jobs.Items, nil
+}
+
+// findExistingTerminalJobForStep scans existing Jobs for a step and
+// returns the first one that reached a terminal condition (Complete
+// or Failed). Returns (nil, "", false) if no Job has terminated yet
+// — the caller is then free to either attach to a still-running Job
+// or create a fresh one.
+//
+// Prefers Complete over Failed when both exist for the same step,
+// because a re-fired step (chart's auto-trigger Job retries after a
+// transient failure) is a valid "success eventually" path. If only
+// Failed exists, the caller surfaces that as the cutover failure
+// — exactly the engine's existing semantics.
+func findExistingTerminalJobForStep(ctx context.Context, deps *cutoverDeps, stepName string) (*batchv1.Job, batchv1.JobConditionType, bool) {
+	jobs, err := findExistingJobsForStep(ctx, deps, stepName)
+	if err != nil {
+		return nil, "", false
+	}
+	var failedJob *batchv1.Job
+	for i := range jobs {
+		j := &jobs[i]
+		cond, ok := terminalJobCondition(j)
+		if !ok {
+			continue
+		}
+		if cond == batchv1.JobComplete {
+			return j, batchv1.JobComplete, true
+		}
+		// Hold the first Failed in case no Complete exists.
+		if failedJob == nil {
+			failedJob = j
+		}
+	}
+	if failedJob != nil {
+		return failedJob, batchv1.JobFailed, true
+	}
+	return nil, "", false
+}
+
+// findExistingRunningJobForStep returns the first non-terminal Job
+// (still in flight) for a step, or nil. The engine attaches its watch
+// to a running Job instead of minting a fresh one when this returns
+// non-nil — covers the case where a previous process kicked off the
+// Job and the apiserver / kubelet are still progressing it after the
+// catalyst-api Pod restarted.
+func findExistingRunningJobForStep(ctx context.Context, deps *cutoverDeps, stepName string) *batchv1.Job {
+	jobs, err := findExistingJobsForStep(ctx, deps, stepName)
+	if err != nil {
+		return nil
+	}
+	for i := range jobs {
+		j := &jobs[i]
+		if _, terminal := terminalJobCondition(j); terminal {
+			continue
+		}
+		return j
+	}
+	return nil
+}
+
+// jobCompletionTime returns the Job's completion timestamp in RFC3339,
+// falling back to the latest condition's LastTransitionTime, then to
+// time.Now() if the Job is malformed. The fallbacks are defensive — a
+// terminal condition without timing data is a chart bug, but we'd
+// rather emit a slightly-imprecise audit row than drop the success
+// signal entirely.
+func jobCompletionTime(job *batchv1.Job) time.Time {
+	if job == nil {
+		return time.Now().UTC()
+	}
+	if job.Status.CompletionTime != nil {
+		return job.Status.CompletionTime.Time.UTC()
+	}
+	for _, c := range job.Status.Conditions {
+		if c.Type == batchv1.JobComplete && c.Status == corev1.ConditionTrue && !c.LastTransitionTime.IsZero() {
+			return c.LastTransitionTime.Time.UTC()
+		}
+	}
+	return time.Now().UTC()
+}
+
 // createCutoverJob creates a fresh Job from the step's PodSpec.
 //
 // Per docs/INVIOLABLE-PRINCIPLES.md #3 (Crossplane is the ONLY day-2
@ -845,12 +968,90 @@ func (h *Handler) runCutover(ctx context.Context, deps *cutoverDeps, steps []cut

 func (h *Handler) runCutoverStep(ctx context.Context, deps *cutoverDeps, step cutoverStep, runEpoch int64) error {
 	bus := h.cutoverBusFor()
+
+	// ── Idempotency check (TBD-V56 / #2132) ─────────────────────────────
+	//
+	// BEFORE writing `step.<name>.result=running` and minting a new Job,
+	// look for a Job from a prior process lifetime that already settled
+	// the step. This is the seam that survives catalyst-api Pod restarts:
+	//
+	//   1. If a Complete=True Job exists with the cutover.openova.io/step
+	//      label, write success to the durable ConfigMap and return
+	//      WITHOUT creating a new Job (no re-running 10-min holds).
+	//   2. If only a Failed Job exists, surface the failure — the
+	//      engine's halt semantics are unchanged from the pre-fix path.
+	//   3. If a non-terminal Job exists, attach the watch to that Job
+	//      name instead of minting a new one.
+	//   4. If no Job exists, mint a fresh Job under the current runEpoch.
+	//
+	// For DaemonSet-wait steps this is a no-op (the chart owns the
+	// DaemonSet lifecycle; we only wait for ready).
+	if step.mode == cutoverModeJob {
+		if job, cond, terminal := findExistingTerminalJobForStep(ctx, deps, step.stepName); terminal {
+			finishedAt := jobCompletionTime(job)
+			if cond == batchv1.JobComplete {
+				// Recover the step's startedAt from the durable ConfigMap
+				// if present, otherwise stamp it with the Job's CreationTimestamp
+				// as a fallback so audit rows remain populated.
+				priorStatus, _ := readCutoverStatus(ctx, deps)
+				startedAt := priorStatus["step."+step.stepName+".startedAt"]
+				if startedAt == "" {
+					if !job.CreationTimestamp.IsZero() {
+						startedAt = job.CreationTimestamp.UTC().Format(time.RFC3339)
+					} else {
+						startedAt = finishedAt.Format(time.RFC3339)
+					}
+				}
+				if err := patchCutoverStatus(ctx, deps, map[string]string{
+					"step." + step.stepName + ".startedAt":  startedAt,
+					"step." + step.stepName + ".finishedAt": finishedAt.Format(time.RFC3339),
+					"step." + step.stepName + ".result":     "success",
+					"step." + step.stepName + ".jobName":    job.Name,
+				}); err != nil {
+					return fmt.Errorf("status patch (resume-success): %w", err)
+				}
+				h.publishCutoverEvent(bus, cutoverEvent{
+					Time:    finishedAt.Format(time.RFC3339),
+					Phase:   cutoverPhaseStepFinished,
+					Level:   "info",
+					Step:    step.stepName,
+					JobName: job.Name,
+					Message: fmt.Sprintf("step %s already completed by prior Job %s; advancing without re-running", step.stepName, job.Name),
+				})
+				return nil
+			}
+			// cond == batchv1.JobFailed: surface as the step's terminal
+			// failure — the engine halts at this step, identical to a
+			// freshly-failed Job. We DO NOT re-create on Failed because
+			// the chart's PodSpec is idempotent state-ensure logic and
+			// a second attempt usually fails the same way; operator
+			// intervention is required.
+			if err := patchCutoverStatus(ctx, deps, map[string]string{
+				"step." + step.stepName + ".finishedAt": finishedAt.Format(time.RFC3339),
+				"step." + step.stepName + ".result":     "failed",
+				"step." + step.stepName + ".jobName":    job.Name,
+			}); err != nil {
+				return fmt.Errorf("status patch (resume-failed): %w", err)
+			}
+			return fmt.Errorf("Job %s/%s reported Failed condition (carried over from prior cutover attempt)", deps.ns, job.Name)
+		}
+	}
+
 	startedAt := time.Now().UTC()
 	jobOrDS := ""
+	var attachToExisting *batchv1.Job

 	switch step.mode {
 	case cutoverModeJob:
-		jobOrDS = cutoverJobName(step.stepName, runEpoch)
+		// Prefer attaching to a still-running Job from a prior process
+		// lifetime over minting a new one. This avoids the t40 failure
+		// mode where two Jobs ran the SAME 10-min hold back-to-back.
+		attachToExisting = findExistingRunningJobForStep(ctx, deps, step.stepName)
+		if attachToExisting != nil {
+			jobOrDS = attachToExisting.Name
+		} else {
+			jobOrDS = cutoverJobName(step.stepName, runEpoch)
+		}
 	case cutoverModeDaemonSetWait:
 		jobOrDS = step.daemonsetRef
 	}
@ -863,19 +1064,25 @@ func (h *Handler) runCutoverStep(ctx context.Context, deps *cutoverDeps, step cu
 		return fmt.Errorf("status patch (start): %w", err)
 	}

+	startMsg := fmt.Sprintf("step %s started (%s)", step.stepName, step.mode)
+	if attachToExisting != nil {
+		startMsg = fmt.Sprintf("step %s resumed by attaching to in-flight Job %s", step.stepName, attachToExisting.Name)
+	}
 	h.publishCutoverEvent(bus, cutoverEvent{
 		Time:    startedAt.Format(time.RFC3339),
 		Phase:   cutoverPhaseStepStarted,
 		Level:   "info",
 		Step:    step.stepName,
 		JobName: jobOrDS,
-		Message: fmt.Sprintf("step %s started (%s)", step.stepName, step.mode),
+		Message: startMsg,
 	})

 	switch step.mode {
 	case cutoverModeJob:
-		if _, err := createCutoverJob(ctx, deps, step, runEpoch); err != nil {
-			return err
+		if attachToExisting == nil {
+			if _, err := createCutoverJob(ctx, deps, step, runEpoch); err != nil {
+				return err
+			}
 		}
 		cond, err := watchJobToCompletion(ctx, deps, jobOrDS)
 		if err != nil {
--- a/products/catalyst/bootstrap/api/internal/handler/cutover_test.go
+++ b/products/catalyst/bootstrap/api/internal/handler/cutover_test.go
@ -945,3 +945,368 @@ func TestResumeInterruptedCutover_NoOpWhenNeverStarted(t *testing.T) {
 		t.Errorf("resume hook created %d Jobs on never-started cutover, want 0", jobCreates)
 	}
 }
+
+// ── TBD-V56 / #2132 — Job-status checkpoint idempotency ─────────────────────
+//
+// The t40 (2026-05-21) failure mode that motivated TBD-V56:
+//
+//   1. Pod 1 (catalyst-api) creates `cutover-egress-block-test-1779345819`.
+//      The Job runs the 10-minute deny-egress hold and reaches Complete=True
+//      at 06:54:24Z.
+//   2. Pod 1 restarts mid-cutover at 07:07Z (operationally, to fix a
+//      separate cache-sync issue). The success-patch to the status
+//      ConfigMap never lands.
+//   3. Pod 2 boots. ResumeInterruptedCutover reads the ConfigMap; status
+//      shows step.egress-block-test.result=running. Resume resets it
+//      to "", spawns runCutover from scratch.
+//   4. PRE-FIX: runCutoverStep mints a NEW Job
+//      `cutover-egress-block-test-1779347242` and runs the 10-min hold
+//      a SECOND time. Wall-clock waste: 10 minutes per step.
+//      POST-FIX: runCutoverStep consults findExistingTerminalJobForStep
+//      BEFORE minting a new Job. Job 1's Complete=True condition is
+//      observed; the engine writes step.<name>.result=success directly
+//      and advances to the next step without re-running.
+
+// makeCompletedJobForStep builds a Job in the cutover namespace already
+// stamped with a JobComplete=True condition + CompletionTime, labeled
+// for step lookup. Simulates a Job that completed during a prior
+// catalyst-api process lifetime — the canonical Pod-restart scenario.
+func makeCompletedJobForStep(stepName string, completionOffset time.Duration) *batchv1.Job {
+	completedAt := metav1.NewTime(time.Now().UTC().Add(-completionOffset))
+	return &batchv1.Job{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      fmt.Sprintf("cutover-%s-%d", stepName, time.Now().Unix()-int64(completionOffset.Seconds())),
+			Namespace: cutoverTestNS,
+			Labels: map[string]string{
+				cutoverStepPartOfLabel:    cutoverStepPartOfValue,
+				cutoverStepComponentLabel: "cutover-job",
+				cutoverStepLabelKey:       stepName,
+			},
+			CreationTimestamp: metav1.NewTime(completedAt.Time.Add(-5 * time.Minute)),
+		},
+		Status: batchv1.JobStatus{
+			Succeeded:      1,
+			CompletionTime: &completedAt,
+			Conditions: []batchv1.JobCondition{{
+				Type:               batchv1.JobComplete,
+				Status:             corev1.ConditionTrue,
+				LastTransitionTime: completedAt,
+				Reason:             "Test",
+			}},
+		},
+	}
+}
+
+// makeFailedJobForStep mirrors makeCompletedJobForStep but stamps a
+// terminal Failed condition.
+func makeFailedJobForStep(stepName string) *batchv1.Job {
+	completedAt := metav1.NewTime(time.Now().UTC().Add(-30 * time.Second))
+	return &batchv1.Job{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      fmt.Sprintf("cutover-%s-failed-%d", stepName, time.Now().Unix()),
+			Namespace: cutoverTestNS,
+			Labels: map[string]string{
+				cutoverStepPartOfLabel:    cutoverStepPartOfValue,
+				cutoverStepComponentLabel: "cutover-job",
+				cutoverStepLabelKey:       stepName,
+			},
+			CreationTimestamp: metav1.NewTime(completedAt.Time.Add(-5 * time.Minute)),
+		},
+		Status: batchv1.JobStatus{
+			Failed: 4,
+			Conditions: []batchv1.JobCondition{{
+				Type:               batchv1.JobFailed,
+				Status:             corev1.ConditionTrue,
+				LastTransitionTime: completedAt,
+				Reason:             "BackoffLimitExceeded",
+			}},
+		},
+	}
+}
+
+// TestFindExistingTerminalJobForStep_PrefersCompleteOverFailed covers
+// the read-side seam in isolation. A Complete Job MUST win over a
+// Failed Job when both exist for the same step — a retried step is
+// allowed to succeed eventually.
+func TestFindExistingTerminalJobForStep_PrefersCompleteOverFailed(t *testing.T) {
+	objs := []k8sruntime.Object{
+		makeFailedJobForStep("egress-block-test"),
+		makeCompletedJobForStep("egress-block-test", 1*time.Minute),
+	}
+	_, client := fakeHandlerWithCutover(t, objs...)
+
+	job, cond, terminal := findExistingTerminalJobForStep(context.Background(),
+		&cutoverDeps{core: client, ns: cutoverTestNS}, "egress-block-test")
+	if !terminal {
+		t.Fatalf("findExistingTerminalJobForStep returned terminal=false; want true")
+	}
+	if cond != batchv1.JobComplete {
+		t.Errorf("cond = %q, want %q (Complete must win over Failed when both exist)", cond, batchv1.JobComplete)
+	}
+	if job == nil {
+		t.Fatalf("job == nil; want the Complete=True Job")
+	}
+}
+
+// TestRunCutoverStep_SkipsRerunWhenPriorJobComplete is the canonical
+// Pod-restart regression test for TBD-V56 / #2132. Status ConfigMap
+// shows step result=running; a Complete=True Job for that step
+// already exists in the cluster (from the prior process lifetime).
+// The engine MUST NOT mint a new Job — the t40 10-minute-hold-ran-twice
+// bug. It must flip result=success directly off the prior Job and
+// advance.
+func TestRunCutoverStep_SkipsRerunWhenPriorJobComplete(t *testing.T) {
+	preStatus := &corev1.ConfigMap{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      cutoverStatusConfigMapName(),
+			Namespace: cutoverTestNS,
+		},
+		Data: map[string]string{
+			"cutoverComplete":  "false",
+			"cutoverStartedAt": "2026-05-21T04:58:25Z",
+			"totalSteps":       "3",
+			// Step 1 (gitea-mirror) succeeded cleanly in Pod 1.
+			"step.gitea-mirror.result":     "success",
+			"step.gitea-mirror.startedAt":  "2026-05-21T04:58:30Z",
+			"step.gitea-mirror.finishedAt": "2026-05-21T04:59:00Z",
+			// Step 2 (egress-block-test) — Pod 1 crashed AFTER the Job
+			// reached Complete=True but BEFORE the success-patch landed.
+			"step.egress-block-test.result":    "running",
+			"step.egress-block-test.startedAt": "2026-05-21T06:44:24Z",
+			"step.egress-block-test.jobName":   "cutover-egress-block-test-1779345819",
+		},
+	}
+	// The completed Job from Pod 1 is still in the cluster (24h TTL).
+	priorJob := makeCompletedJobForStep("egress-block-test", 13*time.Minute)
+	objs := []k8sruntime.Object{
+		makeCutoverStepCM("cutover-step-01-gitea-mirror", "gitea-mirror", 1, cutoverModeJob, minimalPodSpecYAML, ""),
+		makeCutoverStepCM("cutover-step-02-egress-block-test", "egress-block-test", 2, cutoverModeJob, minimalPodSpecYAML, ""),
+		makeCutoverStepCM("cutover-step-03-mirror-resync", "mirror-resync", 3, cutoverModeJob, minimalPodSpecYAML, ""),
+		preStatus,
+		priorJob,
+	}
+	h, client := fakeHandlerWithCutover(t, objs...)
+	// Auto-complete any Job that DOES get created so the engine can
+	// finish — but we ASSERT that egress-block-test is NOT re-created.
+	installJobReactor(t, client, batchv1.JobComplete)
+
+	// Count Job creates per step.
+	jobCreates := map[string]int{}
+	var muJobs sync.Mutex
+	client.PrependReactor("create", "jobs", func(action clienttesting.Action) (bool, k8sruntime.Object, error) {
+		ca, ok := action.(clienttesting.CreateAction)
+		if !ok {
+			return false, nil, nil
+		}
+		job, ok := ca.GetObject().(*batchv1.Job)
+		if !ok {
+			return false, nil, nil
+		}
+		muJobs.Lock()
+		jobCreates[job.Labels[cutoverStepLabelKey]]++
+		muJobs.Unlock()
+		return false, nil, nil
+	})
+
+	// Fire the on-startup resume path — what cmd/api/main.go calls.
+	h.ResumeInterruptedCutover(context.Background())
+
+	// Wait for engine to terminate.
+	deadline := time.Now().Add(15 * time.Second)
+	for time.Now().Before(deadline) {
+		bus := h.cutoverBusFor()
+		bus.mu.Lock()
+		running := bus.running
+		bus.mu.Unlock()
+		if !running {
+			break
+		}
+		time.Sleep(50 * time.Millisecond)
+	}
+
+	cm, err := client.CoreV1().ConfigMaps(cutoverTestNS).Get(context.Background(),
+		cutoverStatusConfigMapName(), metav1.GetOptions{})
+	if err != nil {
+		t.Fatalf("get status ConfigMap: %v", err)
+	}
+	if cm.Data["cutoverComplete"] != "true" {
+		t.Errorf("cutoverComplete = %q, want true; data=%+v", cm.Data["cutoverComplete"], cm.Data)
+	}
+	if cm.Data["step.egress-block-test.result"] != "success" {
+		t.Errorf("step.egress-block-test.result = %q, want success", cm.Data["step.egress-block-test.result"])
+	}
+	if cm.Data["step.egress-block-test.jobName"] != priorJob.Name {
+		t.Errorf("step.egress-block-test.jobName = %q, want %q (must carry the prior Job's name)",
+			cm.Data["step.egress-block-test.jobName"], priorJob.Name)
+	}
+
+	muJobs.Lock()
+	defer muJobs.Unlock()
+	if jobCreates["gitea-mirror"] != 0 {
+		t.Errorf("gitea-mirror Job creates = %d, want 0 (success in prior run)", jobCreates["gitea-mirror"])
+	}
+	// THE CORE ASSERTION — the t40 regression guard.
+	if jobCreates["egress-block-test"] != 0 {
+		t.Errorf("egress-block-test Job creates = %d, want 0 — TBD-V56 idempotency violated: a new Job was minted even though the prior Job is Complete=True", jobCreates["egress-block-test"])
+	}
+	if jobCreates["mirror-resync"] != 1 {
+		t.Errorf("mirror-resync Job creates = %d, want 1 (never ran before)", jobCreates["mirror-resync"])
+	}
+}
+
+// TestRunCutoverStep_SurfacesPriorFailedJob — when the only existing
+// Job for a step is Failed, the engine MUST NOT re-create it. The
+// cutover halts at that step with failedStep + lastError set. Mirrors
+// the existing FailsHaltAtFailedStep semantics but exercises the
+// Pod-restart code path (the failure happened in a prior process).
+func TestRunCutoverStep_SurfacesPriorFailedJob(t *testing.T) {
+	preStatus := &corev1.ConfigMap{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      cutoverStatusConfigMapName(),
+			Namespace: cutoverTestNS,
+		},
+		Data: map[string]string{
+			"cutoverComplete":             "false",
+			"cutoverStartedAt":            "2026-05-21T04:58:25Z",
+			"totalSteps":                  "2",
+			"step.gitea-mirror.result":    "running",
+			"step.gitea-mirror.startedAt": "2026-05-21T04:58:30Z",
+		},
+	}
+	priorJob := makeFailedJobForStep("gitea-mirror")
+	objs := []k8sruntime.Object{
+		makeCutoverStepCM("cutover-step-01-gitea-mirror", "gitea-mirror", 1, cutoverModeJob, minimalPodSpecYAML, ""),
+		makeCutoverStepCM("cutover-step-02-egress-block-test", "egress-block-test", 2, cutoverModeJob, minimalPodSpecYAML, ""),
+		preStatus,
+		priorJob,
+	}
+	h, client := fakeHandlerWithCutover(t, objs...)
+
+	jobCreates := map[string]int{}
+	var muJobs sync.Mutex
+	client.PrependReactor("create", "jobs", func(action clienttesting.Action) (bool, k8sruntime.Object, error) {
+		ca, ok := action.(clienttesting.CreateAction)
+		if !ok {
+			return false, nil, nil
+		}
+		job, ok := ca.GetObject().(*batchv1.Job)
+		if !ok {
+			return false, nil, nil
+		}
+		muJobs.Lock()
+		jobCreates[job.Labels[cutoverStepLabelKey]]++
+		muJobs.Unlock()
+		return false, nil, nil
+	})
+
+	h.ResumeInterruptedCutover(context.Background())
+
+	deadline := time.Now().Add(15 * time.Second)
+	for time.Now().Before(deadline) {
+		bus := h.cutoverBusFor()
+		bus.mu.Lock()
+		running := bus.running
+		bus.mu.Unlock()
+		if !running {
+			break
+		}
+		time.Sleep(50 * time.Millisecond)
+	}
+
+	cm, err := client.CoreV1().ConfigMaps(cutoverTestNS).Get(context.Background(),
+		cutoverStatusConfigMapName(), metav1.GetOptions{})
+	if err != nil {
+		t.Fatalf("get status ConfigMap: %v", err)
+	}
+	if cm.Data["cutoverComplete"] == "true" {
+		t.Errorf("cutoverComplete = true, want false (prior Failed Job halts the cutover)")
+	}
+	if cm.Data["failedStep"] != "gitea-mirror" {
+		t.Errorf("failedStep = %q, want gitea-mirror", cm.Data["failedStep"])
+	}
+	if cm.Data["step.gitea-mirror.result"] != "failed" {
+		t.Errorf("step.gitea-mirror.result = %q, want failed", cm.Data["step.gitea-mirror.result"])
+	}
+	muJobs.Lock()
+	defer muJobs.Unlock()
+	if jobCreates["gitea-mirror"] != 0 {
+		t.Errorf("gitea-mirror Job creates = %d, want 0 (prior Failed Job must NOT be re-created)", jobCreates["gitea-mirror"])
+	}
+	if jobCreates["egress-block-test"] != 0 {
+		t.Errorf("egress-block-test Job creates = %d, want 0 (engine must halt at the failed step)", jobCreates["egress-block-test"])
+	}
+}
+
+// TestHandleCutoverStart_IdempotentReusesPriorCompleteJob — the same
+// guarantee through the HTTP /start path. Operator/auto-trigger hits
+// /start on a Pod that hasn't yet picked up the prior-process state;
+// the engine must observe the prior Complete=True Job and NOT re-run.
+func TestHandleCutoverStart_IdempotentReusesPriorCompleteJob(t *testing.T) {
+	preStatus := &corev1.ConfigMap{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      cutoverStatusConfigMapName(),
+			Namespace: cutoverTestNS,
+		},
+		Data: map[string]string{
+			"cutoverComplete":  "false",
+			"cutoverStartedAt": "2026-05-21T04:58:25Z",
+			"totalSteps":       "1",
+			// Empty per-step rows — the orchestrator's perspective on
+			// boot; the Jobs in the cluster are the source of truth.
+		},
+	}
+	priorJob := makeCompletedJobForStep("only-step", 1*time.Minute)
+	objs := []k8sruntime.Object{
+		makeCutoverStepCM("cutover-step-01-only-step", "only-step", 1, cutoverModeJob, minimalPodSpecYAML, ""),
+		preStatus,
+		priorJob,
+	}
+	h, client := fakeHandlerWithCutover(t, objs...)
+	installJobReactor(t, client, batchv1.JobComplete)
+
+	jobCreates := 0
+	var muJobs sync.Mutex
+	client.PrependReactor("create", "jobs", func(action clienttesting.Action) (bool, k8sruntime.Object, error) {
+		muJobs.Lock()
+		jobCreates++
+		muJobs.Unlock()
+		return false, nil, nil
+	})
+
+	rec := httptest.NewRecorder()
+	req := httptest.NewRequest(http.MethodPost, "/api/v1/sovereign/cutover/start", nil)
+	h.HandleCutoverStart(rec, req)
+	if rec.Code != http.StatusOK {
+		t.Fatalf("status = %d, want 200; body=%s", rec.Code, rec.Body.String())
+	}
+
+	deadline := time.Now().Add(15 * time.Second)
+	for time.Now().Before(deadline) {
+		bus := h.cutoverBusFor()
+		bus.mu.Lock()
+		running := bus.running
+		bus.mu.Unlock()
+		if !running {
+			break
+		}
+		time.Sleep(50 * time.Millisecond)
+	}
+
+	muJobs.Lock()
+	defer muJobs.Unlock()
+	if jobCreates != 0 {
+		t.Errorf("Job creates = %d, want 0 (prior Complete Job must be reused, not re-minted)", jobCreates)
+	}
+
+	cm, err := client.CoreV1().ConfigMaps(cutoverTestNS).Get(context.Background(),
+		cutoverStatusConfigMapName(), metav1.GetOptions{})
+	if err != nil {
+		t.Fatalf("get status ConfigMap: %v", err)
+	}
+	if cm.Data["cutoverComplete"] != "true" {
+		t.Errorf("cutoverComplete = %q, want true", cm.Data["cutoverComplete"])
+	}
+	if cm.Data["step.only-step.result"] != "success" {
+		t.Errorf("step.only-step.result = %q, want success", cm.Data["step.only-step.result"])
+	}
+}
--- a/products/catalyst/chart/Chart.yaml
+++ b/products/catalyst/chart/Chart.yaml
@ -2258,8 +2258,94 @@ name: bp-catalyst-platform
 #
 # Refs #1099 (NOT Closes — operator walk + screenshot is the DoD per
 # CLAUDE.md §0).
-version: 1.4.233
-appVersion: 1.4.232
+version: 1.4.234
+appVersion: 1.4.234
+# 1.4.234 — TBD-V56 / #2132 (2026-05-21): fix bp-self-sovereign-cutover
+# state-machine to checkpoint Job-status into the durable ConfigMap on
+# every reconcile tick — survives catalyst-api Pod restarts mid-cutover.
+#
+# Pre-fix t40 (2026-05-21, Sovereign be5b43a8f5033a77): cutover started
+# 04:58:25Z. `egress-block-test` Job 1 (`cutover-egress-block-test-
+# 1779345819`) reached Complete=True at 06:54:24Z, but catalyst-api
+# restarted at 07:07Z (operationally, to fix a separate cache-sync
+# issue) BEFORE the success-patch to the status ConfigMap landed. On
+# Pod 2, `ResumeInterruptedCutover` reset `step.egress-block-test.result`
+# from "running" to "" and spawned `runCutover` from scratch. The
+# engine then minted a fresh Job (`cutover-egress-block-test-
+# 1779347242`) and ran the 10-minute deny-egress hold A SECOND TIME.
+# Wall-clock waste: 10 minutes per step. Worse: cutover stuck at 63%
+# for ~3 hours because the success-write never landed — Step 09
+# (gitea-token-mint) never fired → `sme/provisioning` init container
+# `wait-for-cutover-token` blocked forever → tenant onboarding
+# blocked → Phase 2 (qwen-code Sandbox) unreachable.
+#
+# Root cause (CLAUDE.md §4.16 multi-layer RCA): `runCutoverStep`
+# only ever consulted the in-memory runEpoch when checking what Job
+# to attach to. A Pod restart lost the epoch + the engine never
+# looked for Jobs from prior process lifetimes via their durable
+# `cutover.openova.io/step=<stepName>` label.
+#
+# Fix lives in `products/catalyst/bootstrap/api/internal/handler/
+# cutover.go`:
+#
+#   1. New helper `findExistingTerminalJobForStep` lists every Job
+#      in the cutover namespace labeled `cutover.openova.io/step=
+#      <stepName>` and returns the first Complete=True (preferring
+#      Complete over Failed when both exist for a retried step).
+#   2. New helper `findExistingRunningJobForStep` returns the first
+#      non-terminal Job for a step so the engine attaches its watch
+#      to it instead of minting a fresh Job.
+#   3. `runCutoverStep` now does (in order): (a) if a prior Complete
+#      Job exists, write `step.<name>.result=success` + finishedAt
+#      from the Job's CompletionTime + jobName=<prior-Job's name>
+#      directly and return — NO new Job created; (b) if a prior
+#      Failed Job exists, write failed and halt — NO re-create; (c)
+#      if a non-terminal Job exists, attach watch to it; (d)
+#      otherwise mint a fresh Job under the current runEpoch.
+#   4. `jobCompletionTime` helper extracts the canonical timestamp
+#      from a Job's `status.completionTime` (with defensive fallbacks
+#      to condition.LastTransitionTime / time.Now).
+#
+# Idempotency surface (the founder's TBD-V56 acceptance criteria):
+#   - Pod-restart mid-cutover (step result=running, Complete Job in
+#     cluster) → engine flips result=success, no rerun.
+#   - Pod-restart mid-cutover (step result=running, Failed Job in
+#     cluster) → engine flips result=failed, no rerun. Halts.
+#   - POST /cutover/start while another cutover is running → 409
+#     (unchanged).
+#   - POST /cutover/start when cutoverComplete=true → 200 with
+#     existing state, no engine spawn (unchanged).
+#   - POST /cutover/start when in-flight on cluster but in-memory
+#     state empty (HTTP path) → engine consults existing Jobs, skips
+#     completed steps, attaches to running ones.
+#
+# Validation:
+#   - 4 new unit tests in `cutover_test.go`:
+#     - TestFindExistingTerminalJobForStep_PrefersCompleteOverFailed
+#     - TestRunCutoverStep_SkipsRerunWhenPriorJobComplete (the t40
+#       regression guard — asserts Job creates = 0 for the
+#       egress-block-test step when a prior Complete Job exists)
+#     - TestRunCutoverStep_SurfacesPriorFailedJob (Failed Job from
+#       prior process is surfaced, no rerun)
+#     - TestHandleCutoverStart_IdempotentReusesPriorCompleteJob
+#       (HTTP /start path consumes prior Complete Job)
+#   - All 27 pre-existing cutover tests still pass.
+#
+# Walk impact: once this PR merges + Flux auto-rolls the new
+# catalyst-api image to t40, the next reconcile tick will find Job
+# `cutover-egress-block-test-1779347242` Complete=True, write
+# step.egress-block-test.result=success to the ConfigMap, and
+# advance to step 09 (gitea-token-mint). The `sme/provisioning`
+# init container `wait-for-cutover-token` unblocks once Step 09
+# patches the `provisioning-github-token` Secret with the
+# `catalyst.openova.io/token-source: self-sovereign-cutover-step-09`
+# annotation. Customer journey on `walkdemo40.omani.homes`
+# unblocks. ALL future Sovereign provs benefit from restart-safe
+# cutover semantics.
+#
+# Refs #2132 TBD-V56. Per CLAUDE.md §0, this is NOT `Closes` — the
+# founder closes after operator-walk-with-screenshot verification.
+#
 # 1.4.232 — fix(sovereign-tls): render Cilium Gateway listener YAML in
 # the chart (templates/sovereign-tls-vars-cm.yaml) and feed it into the
 # sovereign-tls Kustomization via Flux postBuild.substituteFrom on