contextfoundry.dev

Checkpoint Cascade Recovery

Surviving mid-pipeline crashes without losing progress.

Context Foundry v0.7.4

The problem

Build loops crash. OOM kills, API rate limits, ctrl-C, power failures -- any of these can terminate the pipeline mid-stage. Context Foundry writes checkpoints after each stage completes so it can resume without re-running everything from scratch.

But checkpoints alone are not enough. A checkpoint says "the builder stage completed." It does not say "and the builder's output file is intact." If the process was killed between writing the checkpoint and flushing the artifact to disk, you get a checkpoint that claims success and an artifact that is missing or truncated.

The next stage reads the checkpoint, sees "builder completed", skips straight to reviewer -- and the reviewer opens build-claims.md to find nothing. It either errors out or, worse, produces a review of an empty file.

Before: trust the checkpoint

The original checkpoint logic was straightforward:

on_stage_complete(stage):
    checkpoint.stage = stage
    checkpoint.timestamp = now()
    checkpoint.save()

on_resume(task_id):
    checkpoint = load_checkpoint(task_id)
    skip_to(checkpoint.stage + 1)

This works perfectly when nothing goes wrong. It fails silently when the artifact is missing. The checkpoint file is small and writes atomically. The artifact (a markdown file that can be several KB) might not flush before the crash.

The cascade fix

On resume, Context Foundry now validates that expected artifacts exist for each completed stage before skipping it. If an artifact is missing, the pipeline rewinds to the stage that should have produced it -- and cascades backward if needed.

Each pipeline stage has a defined set of output artifacts:

SScout -- produces .buildloop/scout-report.md PPlanner -- produces .buildloop/current-plan.md IBuilder -- produces .buildloop/build-claims.md DReviewer -- produces .buildloop/review-report.md

The resume logic walks backward through the checkpoint stages, checking each artifact:

on_resume(task_id):
    checkpoint = load_checkpoint(task_id)
    resume_from = checkpoint.stage + 1

    // Walk backward, validating artifacts
    for stage in [reviewer, builder, planner, scout]:
        if checkpoint.stage >= stage:
            if !artifact_exists(stage):
                resume_from = stage
                log("Checkpoint says {stage} completed but
                     {artifact} missing -- re-running
                     from {stage}")

    run_pipeline_from(resume_from)

The cascade is the key insight. If current-plan.md is missing but the checkpoint says "builder completed", the pipeline does not just re-run the planner. It checks whether scout-report.md also exists, because the planner depends on it. If the scout report is also missing, it goes all the way back to scout.

What you see

When cascade recovery kicks in, the TUI log pane shows exactly what happened:

[info] Resuming task T23.1 from checkpoint
[info] Checkpoint says builder completed
[warn] build-claims.md missing -- rewinding
[warn] current-plan.md missing -- rewinding
[info] scout-report.md exists -- valid
[info] Re-running from planner stage

No silent failures. No garbage passed to downstream stages. The pipeline recovers to a consistent state and continues.

Implementation details

The checkpoint struct tracks three fields:

struct Checkpoint {
    task_id: String,
    stage: PipelineStage,
    timestamp: DateTime<Utc>,
}

Checkpoints are written to .buildloop/checkpoint.json as atomic file writes (write to temp file, then rename). The checkpoint file itself is always consistent -- the issue is only with the larger artifact files.

Artifact validation is currently existence-based: the file must exist and have a non-zero size. This catches the common crash scenarios -- file not written, file truncated to zero bytes, file deleted by a partial cleanup.

Edge cases

Existence checks do not catch everything. A file that was partially written -- valid markdown header but truncated body -- will pass the existence check. The downstream stage will receive an incomplete artifact.

Content validation (checking for expected sections, valid JSON structure, minimum length) is possible but not currently implemented. The existence check catches the majority of crash-related failures. Partial writes are rarer because most artifact writes happen as a single write_all call, and modern filesystems are good at making those atomic.

A more aggressive approach would be to checksum artifacts and store the hash alongside the checkpoint. On resume, recompute the hash and compare. This would catch bit-rot and partial writes at the cost of added complexity. For now, the simpler approach is sufficient.