contextfoundry.dev · May 2026

An Evolving Harness

Three months ago we added a QUERY stage. Last month we added AI summaries everywhere. Today we shipped v3.2.0.

A retrospective on how the shape of an autonomous-build harness changes as the agents underneath it do -- and why that change is the point.

Context Foundry is a self-hosted Rust binary that runs AI coding agents through a multi-stage pipeline against your codebase. It plans, builds, audits, ships, and learns. The pipeline today has nine stages. The pipeline a year ago had four. Some of the stages that exist now did not exist three months ago. None of that is accidental. It is the central premise.

The most useful thing we have learned shipping Context Foundry is this: as coding agents and their tooling evolve, so should the harness that runs them. A pipeline that was right last quarter is wrong now. A skill format that was sensible last release is the wrong abstraction this release. The harness's job is to be the thing that changes when the agents change underneath it -- not the thing that calcifies around how the agents used to work.

This article is the May snapshot. It walks through the shape of the system as of v3.2.0, the recent changes, and the reasoning behind each one. Some of it will be wrong by November. That is fine. We will write the November snapshot in November.

The problem the harness exists to solve

AI coding agents are remarkable at one-shot tasks. Give a frontier model a clear prompt and a single change to make, and the output is often production-quality on the first attempt. The problem is not the one-shot. The problem is the loop.

Run an agent autonomously across ten tasks in sequence and the failure modes compound:

The harness is the answer to those four failure modes. It is the scaffolding around the agent that makes a sequence of one-shots add up to something coherent. Take the scaffolding away and you are back to one agent in a loop, drifting.

The pipeline today

Every task in TASKS.md flows through nine stages. Each stage is a separate agent invocation with a fresh context window. Each stage writes a single named artifact that the next stage reads.

Q ──> R ──> P ──> P+ ──> B ──> A    SHIP   DISCOVER   SKILLS
                                     │        │
                                     ▼        ▼
                                  git push   scan TASKS.md
                                               │
                                               ▼
                                             extract
                                             SKILL.md
StageQuestion it answersOutput
Q  QUERYWhat clarifying questions about this task?questions.md
R  RESEARCHWhat's already in the codebase?research-report.md
P  PLANWhat exact edits, what verification?current-plan.md
P+ PLAN-REVIEWDoes this plan stand up to fresh review?plan iterates
B  BUILDMake the edits, run tests, fix breaks.build-claims.md
A  AUDITDoes the code match the claims?review-report.md
SHIPCommit and push.feat() or WIP()
DISCOVERWhat's left to do in TASKS.md?appends tasks
SKILLSWhat did we learn from this run?SKILL.md files

The fresh-context boundary is the key invariant. The planner never sees the builder's diff. The auditor never sees the planner's reasoning. Each agent reads only the artifacts that came before it. The artifacts are the only memory.

Why QUERY and RESEARCH come before PLAN

Plans are cheap to write and expensive to live with. The two stages before PLAN exist to make sure the plan is grounded.

QUERY is the cheapest stage in the pipeline and one of the highest-leverage. The agent reads the task description and writes back the clarifying questions it would ask if it could ask. Ambiguity surfaces here, before the plan locks it in. If the task says "add retry logic to the upload handler" and there are three plausible interpretations of what "retry" means, QUERY catches that. The plan does not have to.

RESEARCH is the grounding pass. A fresh-context agent reads the actual codebase -- not what the planner assumes the codebase looks like, but the files as they are right now. It writes research-report.md: tech stack, relevant files, architecture notes, risks, suggested approach. The planner inherits this artifact and writes current-plan.md against known reality.

Without QUERY, ambiguity is resolved by the planner's first guess. Without RESEARCH, the plan is fiction. The builder discovers it the hard way, and the audit catches the gap a stage too late.

QUERY was added in v3.0 (February). RESEARCH was added in v3.1 (April). Neither existed in v0.7. That is the harness changing shape as the agents change underneath it -- the moment frontier models got good enough at conversational clarification, the cheapest place to spend a thousand tokens was on questions instead of guesses.

Doubt in the loop: why we doubt the plan harder than the build

The hardest thing to undo in a codebase is a bad architectural decision. The easiest thing to fix is a bug. So Context Foundry doubts the plan harder than it doubts the build. Two reasons:

Architecture is forever. A badly shaped feature haunts the codebase for months. Refactoring out a wrong abstraction is more expensive than getting it right the first time -- often by an order of magnitude.

Bug fixes are routine. Code defects can be fixed any day of the week. A wrong loop iterator, an off-by-one, a missing null check -- these are 30-minute fixes once spotted. The audit catches them, the builder fixes them, the system moves on.

So PLAN runs through a plan-review stage (P+) before BUILD ever starts. P+ is a fresh-context agent that re-reads the plan, greps the cited files, and rejects the plan if claims don't match. The plan goes back to the planner with the rejection notes appended.

P+ depth scales by task complexity:

Per-task overrides are [fast] (skip P+ entirely; trust BUILD+AUDIT) and [strict] (force the full three iterations even on Simple tasks). The complexity engine reads the task description heuristically; the user can pin it.

AUDIT runs after BUILD with the same fresh-context discipline. It reads build-claims.md, greps the diff, runs the verification commands, and decides between a feat() commit (audit passed) and a WIP() commit (audit found gaps). The builder never sees the audit until the run is over.

One number captures the philosophy: catch a bad plan in thirty seconds, or catch a bad architecture six months from now. The math favors the thirty seconds.

Task composition is the upstream lever

The complexity engine is downstream of how the user writes the task. The same scope can land cheaply or thrash for an hour depending on how it is composed.

Two examples from real runs in the project's own history:

TaskDescriptionCost
T1.16"(1) wire ranker (2) BM25 upgrade (3) telemetry boost" -- three concerns in one task.$20, 63 min, 4 PLAN attempts, P+ rejected the plan three times.
T1.17"Persist one config field across restart." -- single change.$2, 8 min, first-pass through. No rejections.

Same level of underlying complexity, ten-times cost difference. The lever is composition. The rule of thumb is one mental model change per task. Signs that a task is over-bundled and should be split:

The harness can do a lot to absorb a badly composed task, but it cannot do everything. Composition is upstream of every other lever in the system.

Patterns became skills

Context Foundry shipped for two years with a "patterns" abstraction -- JSON blobs at ~/.foundry/patterns/, each pattern a tuple of {pattern_id, severity, keywords, issue, solution, frequency}. The planner scanned them every run, did a keyword-overlap match, injected the top ten into the prompt. It worked, mostly. It also accumulated 2384 entries before we audited it and found that only seven had ever been cited in a build that passed audit. The rest were debris.

In v3.2.0, the patterns abstraction was retired. The new abstraction is skills, following Anthropic's Agent Skills specification. A skill is a directory with a SKILL.md file containing YAML frontmatter (name, description, metadata) and a free-form Markdown body:

~/.foundry/skills/plan-file-token-overflow-planner/SKILL.md

---
name: plan-file-token-overflow-planner
description: Use when a current-plan.md grows past ~30KB. The planner
  starts emitting noise instead of file:line specs. Split the task.
metadata:
  cf-stage: planner
  cf-keywords: [planning, token-budget, current-plan, overflow]
  cf-citations-pass: 4
  cf-citations-wip: 1
  cf-last-used: 2026-05-11
---

When planning, watch for current-plan.md exceeding 30KB...

The format change matters less than the discovery model. Anthropic Skills are progressively-disclosed -- the agent reads a short catalog of (name, description) pairs and pulls the body in only when it judges the skill relevant. Context Foundry's planner stage sees roughly 2% of the catalog in any given prompt, ranked by a hybrid retriever.

The cross-tool dividend was unexpected and significant. Skills authored for Claude Code work in Context Foundry. Skills authored for Cursor projects work in Context Foundry. AGENTS.md files -- the Linux Foundation cross-vendor standard adopted by Codex, Cursor, Aider, Gemini, and Copilot -- get discovered and surfaced. We dedicate a section of the startup screen to external skills with per-source opt-in. CF reads them; it never modifies them. The user's existing investment in any of those formats works in this pipeline without conversion.

For the longer write-up on why one big skills.md file would have been a mistake despite being the obvious move, see Skills and Plugins.

Hybrid retrieval, not vibes

The ranker that decides which skills to inject is a three-signal combination computed per skill, per stage, per task:

An optional fourth signal is the cf-stage metadata field. A skill can hint that it is most relevant to the planner, the reviewer, or both. As of v3.2.0 (T1.31) this is a non-binding hint -- the ranker weights it, but does not filter on it. Skills are eligible for every pipeline stage. Set skills_stage_filter_strict: true in ~/.foundry/config.json to restore the legacy gate.

The detail that matters most is the embedding location: everything runs on the user's machine. No embedding API call leaves the laptop. No third-party sees the task description. The catalog ships with 271 skills, the retriever ranks all 271 per stage per task, and the top N (default max_pattern_injection = 10, tunable) get injected into that stage's prompt. With one Ollama process and a typical skills directory, the whole retrieval loop adds well under a second per stage.

This is the same architecture that SkillFlow, RAG-MCP, and LangGraph BigTool describe for tool retrieval at scale. The differentiator here is that the embedding step is on-device by default. The user does not pay an embedding bill, and the data does not leave the workstation.

The citation loop -- how the catalog learns

The retriever picks ten skills to inject. The interesting question is which ten of the ten the agent actually used.

Skill citations close that loop. Each skill has a stable skill_id matching its directory name. The agent prompt instructs each stage that uses a skill to end its artifact with a citation footer:

**Skills referenced:** plan-file-token-overflow-planner, async-lock-while-not-if-planner

That footer is the agent's self-report. The system verifies it post-hoc. After AUDIT runs, a scanner greps every committed artifact -- current-plan.md, build-claims.md, review-report.md -- for **Skills referenced:** footers. Every hit writes a row to a SQLite sidecar at ~/.foundry/skills-telemetry.db:

 1.  INJECT       Retriever ranks 271 skills. Top N injected into PLAN's
                   prompt, scoped by cf-stage.

 2.  CITE         Planner self-reports usage. current-plan.md ends with:

                   **Skills referenced:** plan-file-token-overflow-planner

 3.  BUILD + AUDIT run normally. build-claims.md and review-report.md
                   may also append their own **Skills referenced:** footers.

 4.  SCAN         Post-AUDIT scanner greps every committed artifact for
                   skill_id footers. Verifies the agent's self-report
                   against the actual on-disk text.

 5.  RECORD       Hits write to ~/.foundry/skills-telemetry.db:
                     - commit feat()  →  citations_pass++
                     - commit WIP()   →  citations_wip++
                     - last_used = now()

 6.  RE-RANK      Next task's ranker reads the DB. Success-rate
                   weighting boosts skills that ship and demotes
                   skills that fail. Compounding by the run.

The success-rate weighting is the part worth dwelling on. A skill that gets injected into the prompt but never cited gains nothing. A skill that gets cited in a build that passes audit gains rank for next time. A skill that gets cited in a build that ships as WIP() -- because the audit found a gap -- loses rank.

The catalog learns from real outcomes, not from being injected. Skills that ship rise. Skills that fail fall. The harness has its own ground truth.

The learn loop, in numbers

One overnight session in the first week of May:

5
feat() commits
0
WIP() commits
11
new skills learned
14
citations recorded

$141 over 10h 25m · $28 per [Complex] task average · 0 re-runs

The five tasks ranged from a skills-telemetry honesty fix to a cross-pipeline AI-summary feature -- all rated [C] by the complexity engine. The skill cited in T1.29 (rust-struct-literal-field-explosion) was ranked top-3 for T1.30 and T1.31 by the success-rate boost. Subsequent runs got the lesson without re-deriving it. The catalog learned overnight.

This is the property we cared about when we started. A single passing build produces a tiny amount of evidence. A hundred passing builds, with citations, produces a useful ranking. The harness compounds where one-shot agent runs do not.

AI summaries everywhere

The other v3.2.0 change worth naming explicitly is the AI-summary feature in the TUI. Click anything on the running dashboard -- a pipeline tile, the task queue, the narrative pane, the skill citations panel, the stats meter, the agent output -- and a modal opens with a Claude Haiku summary of what that surface means right now.

The summary is contextual. The summary for the QUERY pipeline tile reads the questions artifact and tells you what the eight questions are about. The summary for the AUDIT pipeline tile reads the review report and tells you what the auditor flagged. The summary for the stats panel reads the run state and tells you whether the project is on track.

The user does not need to read the log. The harness explains itself.

This is the kind of feature that did not exist in the harness six months ago and was not obviously a good idea even three months ago. It became obviously a good idea once Claude Haiku 4.5 priced fast summarization at fractions of a cent and once the modal infrastructure existed to display the result without leaving the dashboard. The harness got a new explanatory surface because the underlying capability got cheap.

That is the recurring loop. When a frontier capability gets cheap, we ask what shape of harness that capability unlocks. We added QUERY when conversational clarification got cheap. We added AI summaries everywhere when fast summarization got cheap. We will add something else when the next capability lands.

What v3.2.0 ships

Ninety-seven commits since v3.1.0. The highlights:

Binaries for macOS (arm64 and x86_64), Linux (x86_64), and Windows (x86_64). Install via cargo install foundry, npm install -g context-foundry, Homebrew, winget, or download from the GitHub release.

What's next

The honest answer is: whatever the next model release makes worth trying.

Three months ago we added QUERY because models got good enough at conversational clarification to make it cheap. Six weeks ago we added the local Ollama embedding loop because nomic-embed-text became a credible drop-in for hosted embedding APIs and we wanted the data to stay on the laptop. Last month we added AI summaries everywhere because fast summarization stopped being a meaningful line item. None of those features were on a roadmap when the previous release shipped. They became obvious as soon as the underlying capability was real.

The next change will follow the same pattern. Some capability will get cheap or get good. We will ask what shape of harness that unlocks. We will ship the change. The harness will look different by November.

That is the point. A harness that does not change as the agents change is a harness that calcified around how the agents used to work. The whole value proposition of building this in the open as a Rust binary with a TUI -- instead of buying into a single vendor's all-in-one product -- is that the shape stays editable. We change the shape. You see the change. The next release ships with it.

If you want to follow along, the repo is at github.com/context-foundry/context-foundry. PRs, issues, and ideas are welcome. The whole system is the project's own dogfood -- every feature in this article was built by Context Foundry building Context Foundry.