An Evolving Harness -- Context Foundry

Context Foundry is a self-hosted Rust binary that runs AI coding agents through a multi-stage pipeline against your codebase. It plans, builds, audits, ships, and learns. The pipeline today has nine stages. The pipeline a year ago had four. Some of the stages that exist now did not exist three months ago. None of that is accidental. It is the central premise.

The most useful thing we have learned shipping Context Foundry is this: as coding agents and their tooling evolve, so should the harness that runs them. A pipeline that was right last quarter is wrong now. A skill format that was sensible last release is the wrong abstraction this release. The harness's job is to be the thing that changes when the agents change underneath it -- not the thing that calcifies around how the agents used to work.

This article is the May snapshot. It walks through the shape of the system as of v3.2.0, the recent changes, and the reasoning behind each one. Some of it will be wrong by November. That is fine. We will write the November snapshot in November.

The problem the harness exists to solve

AI coding agents are remarkable at one-shot tasks. Give a frontier model a clear prompt and a single change to make, and the output is often production-quality on the first attempt. The problem is not the one-shot. The problem is the loop.

Run an agent autonomously across ten tasks in sequence and the failure modes compound:

Task 3 builds on task 2's mistakes, which built on task 1's mistakes. There is no gate that catches a drifting plan before the next task adopts it.
No fresh-context review means the agent grades its own homework. Self-review converges on self-justification.
No learning means the same mistakes repeat every run. The system has no memory of what it did wrong last week.
No task composition discipline means a single bundled "do three things" task pays a thrashing tax that a well-shaped task would not.

The harness is the answer to those four failure modes. It is the scaffolding around the agent that makes a sequence of one-shots add up to something coherent. Take the scaffolding away and you are back to one agent in a loop, drifting.

The pipeline today

Every task in TASKS.md flows through nine stages. Each stage is a separate agent invocation with a fresh context window. Each stage writes a single named artifact that the next stage reads.

Q ──> R ──> P ──> P+ ──> B ──> A    SHIP   DISCOVER   SKILLS
                                     │        │
                                     ▼        ▼
                                  git push   scan TASKS.md
                                               │
                                               ▼
                                             extract
                                             SKILL.md

Stage	Question it answers	Output
Q QUERY	What clarifying questions about this task?	`questions.md`
R RESEARCH	What's already in the codebase?	`research-report.md`
P PLAN	What exact edits, what verification?	`current-plan.md`
P+ PLAN-REVIEW	Does this plan stand up to fresh review?	plan iterates
B BUILD	Make the edits, run tests, fix breaks.	`build-claims.md`
A AUDIT	Does the code match the claims?	`review-report.md`
SHIP	Commit and push.	`feat()` or `WIP()`
DISCOVER	What's left to do in TASKS.md?	appends tasks
SKILLS	What did we learn from this run?	`SKILL.md` files

The fresh-context boundary is the key invariant. The planner never sees the builder's diff. The auditor never sees the planner's reasoning. Each agent reads only the artifacts that came before it. The artifacts are the only memory.

Why QUERY and RESEARCH come before PLAN

Plans are cheap to write and expensive to live with. The two stages before PLAN exist to make sure the plan is grounded.

QUERY is the cheapest stage in the pipeline and one of the highest-leverage. The agent reads the task description and writes back the clarifying questions it would ask if it could ask. Ambiguity surfaces here, before the plan locks it in. If the task says "add retry logic to the upload handler" and there are three plausible interpretations of what "retry" means, QUERY catches that. The plan does not have to.

RESEARCH is the grounding pass. A fresh-context agent reads the actual codebase -- not what the planner assumes the codebase looks like, but the files as they are right now. It writes research-report.md: tech stack, relevant files, architecture notes, risks, suggested approach. The planner inherits this artifact and writes current-plan.md against known reality.

Without QUERY, ambiguity is resolved by the planner's first guess. Without RESEARCH, the plan is fiction. The builder discovers it the hard way, and the audit catches the gap a stage too late.

QUERY was added in v3.0 (February). RESEARCH was added in v3.1 (April). Neither existed in v0.7. That is the harness changing shape as the agents change underneath it -- the moment frontier models got good enough at conversational clarification, the cheapest place to spend a thousand tokens was on questions instead of guesses.

Doubt in the loop: why we doubt the plan harder than the build

The hardest thing to undo in a codebase is a bad architectural decision. The easiest thing to fix is a bug. So Context Foundry doubts the plan harder than it doubts the build. Two reasons:

Architecture is forever. A badly shaped feature haunts the codebase for months. Refactoring out a wrong abstraction is more expensive than getting it right the first time -- often by an order of magnitude.

Bug fixes are routine. Code defects can be fixed any day of the week. A wrong loop iterator, an off-by-one, a missing null check -- these are 30-minute fixes once spotted. The audit catches them, the builder fixes them, the system moves on.

So PLAN runs through a plan-review stage (P+) before BUILD ever starts. P+ is a fresh-context agent that re-reads the plan, greps the cited files, and rejects the plan if claims don't match. The plan goes back to the planner with the rejection notes appended.

P+ depth scales by task complexity:

Simple tasks: 1 P+ pass. Bug fixes, single-file changes, well-specified config tweaks.
Medium tasks: up to 2 passes. New features in known territory.
Complex tasks: up to 3 passes. Cross-cutting changes, new subsystems, anything that touches more than four modules.

Per-task overrides are [fast] (skip P+ entirely; trust BUILD+AUDIT) and [strict] (force the full three iterations even on Simple tasks). The complexity engine reads the task description heuristically; the user can pin it.

AUDIT runs after BUILD with the same fresh-context discipline. It reads build-claims.md, greps the diff, runs the verification commands, and decides between a feat() commit (audit passed) and a WIP() commit (audit found gaps). The builder never sees the audit until the run is over.

One number captures the philosophy: catch a bad plan in thirty seconds, or catch a bad architecture six months from now. The math favors the thirty seconds.

Task composition is the upstream lever

The complexity engine is downstream of how the user writes the task. The same scope can land cheaply or thrash for an hour depending on how it is composed.

Two examples from real runs in the project's own history:

Task	Description	Cost
`T1.16`	"(1) wire ranker (2) BM25 upgrade (3) telemetry boost" -- three concerns in one task.	$20, 63 min, 4 PLAN attempts, P+ rejected the plan three times.
`T1.17`	"Persist one config field across restart." -- single change.	$2, 8 min, first-pass through. No rejections.

Same level of underlying complexity, ten-times cost difference. The lever is composition. The rule of thumb is one mental model change per task. Signs that a task is over-bundled and should be split:

Numbered sub-features in the description: "(1) ... (2) ... (3) ..."
Lead sentence contains "and also," "plus," "three layers," "additionally"
Multiple distinct verbs in the opening clause
File references span more than ~6 paths (real blast radius)
Description exceeds ~500 words

The harness can do a lot to absorb a badly composed task, but it cannot do everything. Composition is upstream of every other lever in the system.

Patterns became skills

Context Foundry shipped for two years with a "patterns" abstraction -- JSON blobs at ~/.foundry/patterns/, each pattern a tuple of {pattern_id, severity, keywords, issue, solution, frequency}. The planner scanned them every run, did a keyword-overlap match, injected the top ten into the prompt. It worked, mostly. It also accumulated 2384 entries before we audited it and found that only seven had ever been cited in a build that passed audit. The rest were debris.

In v3.2.0, the patterns abstraction was retired. The new abstraction is skills, following Anthropic's Agent Skills specification. A skill is a directory with a SKILL.md file containing YAML frontmatter (name, description, metadata) and a free-form Markdown body:

~/.foundry/skills/plan-file-token-overflow-planner/SKILL.md

---
name: plan-file-token-overflow-planner
description: Use when a current-plan.md grows past ~30KB. The planner
  starts emitting noise instead of file:line specs. Split the task.
metadata:
  cf-stage: planner
  cf-keywords: [planning, token-budget, current-plan, overflow]
  cf-citations-pass: 4
  cf-citations-wip: 1
  cf-last-used: 2026-05-11
---

When planning, watch for current-plan.md exceeding 30KB...

The format change matters less than the discovery model. Anthropic Skills are progressively-disclosed -- the agent reads a short catalog of (name, description) pairs and pulls the body in only when it judges the skill relevant. Context Foundry's planner stage sees roughly 2% of the catalog in any given prompt, ranked by a hybrid retriever.

The cross-tool dividend was unexpected and significant. Skills authored for Claude Code work in Context Foundry. Skills authored for Cursor projects work in Context Foundry. AGENTS.md files -- the Linux Foundation cross-vendor standard adopted by Codex, Cursor, Aider, Gemini, and Copilot -- get discovered and surfaced. We dedicate a section of the startup screen to external skills with per-source opt-in. CF reads them; it never modifies them. The user's existing investment in any of those formats works in this pipeline without conversion.

For the longer write-up on why one big skills.md file would have been a mistake despite being the obvious move, see Skills and Plugins.

Hybrid retrieval, not vibes

The ranker that decides which skills to inject is a three-signal combination computed per skill, per stage, per task:

BM25 keyword match -- sparse: task description intersected with skill keywords and tech-stack tags. Cheap and exact. Catches "the task says migration and this skill has migration in its description" without needing to embed anything.
Cosine similarity -- dense: the task description and each skill's (name + description) are embedded with nomic-embed-text, a 137M-parameter model served by a local Ollama instance. 768-dimensional vectors, roughly 50ms per call on a laptop CPU, cached to disk so subsequent runs don't re-embed unchanged skill descriptions. The dense signal catches semantic overlap that the sparse signal misses -- "the task is about token budget" matching a skill described as "prompt size overflow".
Telemetry boost -- success-rate weighting: every skill carries cf-citations-pass and cf-citations-wip counters. Skills cited in builds that passed audit rank higher. New skills start neutral and earn rank only by shipping.

An optional fourth signal is the cf-stage metadata field. A skill can hint that it is most relevant to the planner, the reviewer, or both. As of v3.2.0 (T1.31) this is a non-binding hint -- the ranker weights it, but does not filter on it. Skills are eligible for every pipeline stage. Set skills_stage_filter_strict: true in ~/.foundry/config.json to restore the legacy gate.

The detail that matters most is the embedding location: everything runs on the user's machine. No embedding API call leaves the laptop. No third-party sees the task description. The catalog ships with 271 skills, the retriever ranks all 271 per stage per task, and the top N (default max_pattern_injection = 10, tunable) get injected into that stage's prompt. With one Ollama process and a typical skills directory, the whole retrieval loop adds well under a second per stage.

This is the same architecture that SkillFlow, RAG-MCP, and LangGraph BigTool describe for tool retrieval at scale. The differentiator here is that the embedding step is on-device by default. The user does not pay an embedding bill, and the data does not leave the workstation.

The citation loop -- how the catalog learns

The retriever picks ten skills to inject. The interesting question is which ten of the ten the agent actually used.

Skill citations close that loop. Each skill has a stable skill_id matching its directory name. The agent prompt instructs each stage that uses a skill to end its artifact with a citation footer:

**Skills referenced:** plan-file-token-overflow-planner, async-lock-while-not-if-planner

That footer is the agent's self-report. The system verifies it post-hoc. After AUDIT runs, a scanner greps every committed artifact -- current-plan.md, build-claims.md, review-report.md -- for **Skills referenced:** footers. Every hit writes a row to a SQLite sidecar at ~/.foundry/skills-telemetry.db:

 1.  INJECT       Retriever ranks 271 skills. Top N injected into PLAN's
                   prompt, scoped by cf-stage.

 2.  CITE         Planner self-reports usage. current-plan.md ends with:

                   **Skills referenced:** plan-file-token-overflow-planner

 3.  BUILD + AUDIT run normally. build-claims.md and review-report.md
                   may also append their own **Skills referenced:** footers.

 4.  SCAN         Post-AUDIT scanner greps every committed artifact for
                   skill_id footers. Verifies the agent's self-report
                   against the actual on-disk text.

 5.  RECORD       Hits write to ~/.foundry/skills-telemetry.db:
                     - commit feat()  →  citations_pass++
                     - commit WIP()   →  citations_wip++
                     - last_used = now()

 6.  RE-RANK      Next task's ranker reads the DB. Success-rate
                   weighting boosts skills that ship and demotes
                   skills that fail. Compounding by the run.

The success-rate weighting is the part worth dwelling on. A skill that gets injected into the prompt but never cited gains nothing. A skill that gets cited in a build that passes audit gains rank for next time. A skill that gets cited in a build that ships as WIP() -- because the audit found a gap -- loses rank.

The catalog learns from real outcomes, not from being injected. Skills that ship rise. Skills that fail fall. The harness has its own ground truth.

The learn loop, in numbers

One overnight session in the first week of May:

feat() commits

WIP() commits

new skills learned

citations recorded

$141 over 10h 25m · $28 per [Complex] task average · 0 re-runs

The five tasks ranged from a skills-telemetry honesty fix to a cross-pipeline AI-summary feature -- all rated [C] by the complexity engine. The skill cited in T1.29 (rust-struct-literal-field-explosion) was ranked top-3 for T1.30 and T1.31 by the success-rate boost. Subsequent runs got the lesson without re-deriving it. The catalog learned overnight.

This is the property we cared about when we started. A single passing build produces a tiny amount of evidence. A hundred passing builds, with citations, produces a useful ranking. The harness compounds where one-shot agent runs do not.

AI summaries everywhere

The other v3.2.0 change worth naming explicitly is the AI-summary feature in the TUI. Click anything on the running dashboard -- a pipeline tile, the task queue, the narrative pane, the skill citations panel, the stats meter, the agent output -- and a modal opens with a Claude Haiku summary of what that surface means right now.

The summary is contextual. The summary for the QUERY pipeline tile reads the questions artifact and tells you what the eight questions are about. The summary for the AUDIT pipeline tile reads the review report and tells you what the auditor flagged. The summary for the stats panel reads the run state and tells you whether the project is on track.

The user does not need to read the log. The harness explains itself.

This is the kind of feature that did not exist in the harness six months ago and was not obviously a good idea even three months ago. It became obviously a good idea once Claude Haiku 4.5 priced fast summarization at fractions of a cent and once the modal infrastructure existed to display the result without leaving the dashboard. The harness got a new explanatory surface because the underlying capability got cheap.

That is the recurring loop. When a frontier capability gets cheap, we ask what shape of harness that capability unlocks. We added QUERY when conversational clarification got cheap. We added AI summaries everywhere when fast summarization got cheap. We will add something else when the next capability lands.

What v3.2.0 ships

Ninety-seven commits since v3.1.0. The highlights:

Skills migration complete (T1.12-T1.16) -- patterns retired, Anthropic Skills format shipped, hybrid retriever in place.
AI summary everywhere (T1.24/T1.25/T1.33) -- every dashboard surface clickable; right-click context menu for files in Explore view.
Cross-provider skill discovery (T1.27/T1.28) -- AGENTS.md, .cursorrules, .claude/skills, .github/copilot-instructions.md, all surfaced read-only with per-source opt-in.
Skills at every pipeline stage (T1.31) -- the cf-stage filter is now a ranker hint, not a hard gate. Skills inject into QUERY, RESEARCH, PLAN, P+, BUILD, AUDIT, SHIP, DISCOVER.
Honest skill telemetry (T1.30) -- the post-audit citation scanner closes the learn loop. Citations counted against real commit outcomes.
Complexity-aware P+ (T1.20/T1.23) -- depth scales by tier; [fast] and [strict] overrides for per-task control.
Live-reload TASKS.md (T1.19) -- edit the queue while CF is running, no restart needed.
Plugins rename (T1.22) -- the user-facing label "extensions" became "plugins" everywhere. On-disk directory name preserved for path stability.
Unified TUI conventions (T1.34) -- shared modal renderer, padding, proportional scrollbars, hover-locked focus, single-row pipeline tiles with full-name tooltips.
Eval badge stale-vs-live distinction (T1.29) -- the EVAL meter dims and prefixes with (last) when showing a previous task's grade during in-flight work.
UTF-8 panic fix -- raw byte-slicing of artifact files panicked when byte 1024 fell inside a multi-byte UTF-8 sequence. Now uses the shared truncate_str helper.

Binaries for macOS (arm64 and x86_64), Linux (x86_64), and Windows (x86_64). Install via cargo install foundry, npm install -g context-foundry, Homebrew, winget, or download from the GitHub release.

What's next

The honest answer is: whatever the next model release makes worth trying.

Three months ago we added QUERY because models got good enough at conversational clarification to make it cheap. Six weeks ago we added the local Ollama embedding loop because nomic-embed-text became a credible drop-in for hosted embedding APIs and we wanted the data to stay on the laptop. Last month we added AI summaries everywhere because fast summarization stopped being a meaningful line item. None of those features were on a roadmap when the previous release shipped. They became obvious as soon as the underlying capability was real.

The next change will follow the same pattern. Some capability will get cheap or get good. We will ask what shape of harness that unlocks. We will ship the change. The harness will look different by November.

That is the point. A harness that does not change as the agents change is a harness that calcified around how the agents used to work. The whole value proposition of building this in the open as a Rust binary with a TUI -- instead of buying into a single vendor's all-in-one product -- is that the shape stays editable. We change the shape. You see the change. The next release ships with it.

If you want to follow along, the repo is at github.com/context-foundry/context-foundry. PRs, issues, and ideas are welcome. The whole system is the project's own dogfood -- every feature in this article was built by Context Foundry building Context Foundry.