contextfoundry.dev

Skills and Plugins

From 2384 theatrical patterns to 248 retrievable skills.

How Context Foundry adopted Anthropic's Agent Skills spec, and why one big file is the wrong answer.

What is a skill?

A skill, in Anthropic's sense, is a progressively-disclosed knowledge unit. It is a folder containing a single SKILL.md file (plus any supporting assets) that an agent can choose to read when it judges the contents relevant to the task at hand. The agent does not see the body of every skill on every turn. It sees a short index of skill names and one-line descriptions, and pulls the full body in only when it decides to.

The format is minimal. A skill is a Markdown file with YAML frontmatter:

---
name: doubt-loop
description: Run an adversarial audit pass on builder output. Use when a builder
  agent claims a task is done and the diff has not been independently verified.
---

# Doubt Loop

Read the build claims file. For every claim in DELTA_MANIFEST, verify against
actual code. Re-run every CHECK in VERIFICATION_MATRIX. Report PASS or FAIL.
...

That is the whole specification. name and description are required. The body is free-form. The Agent Skills specification calls this "progressive disclosure": the agent's context window holds only the index until the description matches the task, then the body is loaded on demand. Anthropic's own write-up frames skills as the unit you use to equip agents for the real world -- discrete, named, addressable competencies that an agent can pick up the way a contractor picks up a tool from a truck.

The discovery model is the key insight. The agent is not handed a curriculum. It is handed a catalog. It reads the catalog, decides what is relevant, and loads the rest itself.

Learnings from creating skills for Context Foundry

Context Foundry did not start with skills. It started with patterns: auto-extracted JSON entries written to ~/.foundry/patterns/ at the end of every build. Each pattern was a tuple of {pattern_id, title, severity, keywords, issue, solution, frequency}. The theory was that the system would learn from itself -- every fixed bug becomes a pattern, every pattern becomes a hint for the next build.

The theory was wrong. Not subtly wrong. Theatrically wrong.

A fresh-context audit of the pattern system found:

The auditor's verdict was unsparing:

"You would not rebuild patterns/extensions today. The only reason to keep the current architecture is sunk cost."

That report became tasks T1.12 through T1.16. The migration ran in five stages, each its own RPID cycle:

T1.12 -- Prune ruthlessly. Delete every pattern not cited in a passing build. 2384 entries down to 124 survivors. The frequency=1 ones were never going to fire again; they were debris.

T1.13 -- Migrate the survivors to the Skills format. Each surviving pattern became SKILL.md at plugins/<name>/skills/<topic>/SKILL.md. We split each one into two skills -- a planner-<topic> variant and a reviewer-<topic> variant -- because the planner needs to know the trap exists and the reviewer needs to check whether the builder fell into it. 124 patterns became 248 skills.

T1.14 -- Convert extensions to plugins. Each plugins/<name>/ directory got a .claude-plugin/plugin.json manifest matching Anthropic's plugin reference. The transport now matches the ecosystem standard. Plugins are shareable. Skills are addressable. Versioning is a directory rename.

T1.15 -- Replace the matcher with skill-discovery plus a Context Foundry per-stage filter. The old keyword matcher is gone. The new retriever is hybrid: BM25 over name + description for lexical matches, nomic-embed-text cosine similarity for semantic matches, and a telemetry-driven popularity boost from skills that were cited in passing builds. Top-K results flow into the prompt for the stage that requested them. T1.31 update: cf-stage is now a non-binding hint, not a filter. Skills are eligible for every pipeline stage (QUERY, RESEARCH, PLAN, P+, BUILD, AUDIT, DISCOVER) and the ranker decides what surfaces. Set skills_stage_filter_strict: true in ~/.foundry/config.json to restore the legacy per-stage gate. Query, research, and discovery citations are recorded under the existing cited_by_scout telemetry column (a union of the three Scout-adjacent stages until a follow-up schema migration splits them).

T1.16 -- Telemetry-driven popularity. Every time a skill is loaded and the build passes audit, the skill's score increments. Skills that get loaded but never cited decay. The retrieval ranking is now a feedback loop, not a static index.

The shape of the system flipped. Before: a flat JSON blob the planner scanned every run. After: a catalog the planner queries, a retriever that ranks, and a per-stage filter that picks the few skills that actually matter to the current task. The agent is no longer handed everything -- it is handed what the retriever judges relevant. This is the same architecture SkillFlow and RAG-MCP describe for tool retrieval at scale, and the same pattern LangGraph BigTool implements in production for agents with hundreds of tools.

FAQ: Why not put everything in one big file organized by software stack?

This is the question every observer asks. It is the obvious move. We have a hundred plus skills -- why not concatenate them into skills.md, organize by section (Rust / Python / React / Roblox / Workday), and hand the agent one document?

The answer has five parts, and they compound.

1. Granularity and the discovery model

Anthropic's progressive-disclosure spec works because the agent matches the description field of each skill against the task in front of it. One file is one description. The description has to summarize everything in the file, which means it cannot fire on anything specific. The agent ends up either loading the monolith on every run (defeating the purpose of progressive disclosure) or ignoring it entirely (because no single description matches the current task).

Many small skills with focused descriptions produce many discrete activation triggers. The retriever picks three skills out of a hundred. The monolith forces a binary choice: all or nothing.

2. Token economy

A hybrid retriever (BM25 plus cosine) ranks N skills and pulls the top K into the prompt. With 248 skills and K equal to 5, a typical stage sees roughly two percent of the catalog in its prompt. The monolith inverts this: every prompt either eats the whole file or none of it. The middle ground -- parsing sections out of the monolith based on the task -- means re-inventing the SKILL.md format inside the file, badly.

Skills give you per-token-cost control. Monoliths give you a take-it-or-leave-it bill.

3. Curation lifecycle

Skills can be deprecated, versioned, archived, or rolled back independently. A directory rename moves a skill into archive/. A new version is a new directory. Git diff at the skill level is comprehensible.

A monolith requires surgical in-place edits. Two skills updated in the same commit produce one diff that looks like a refactor. Bisecting "which change broke the planner stage" turns into a line-level archaeology problem instead of a directory-level one.

4. Marketplace distribution

Anthropic's plugin ecosystem ships skills as installable modules. The user runs claude /plugin install owner/repo and the plugin's skills become available immediately. A monolith is not a unit anyone can install. You cannot subscribe to one section of someone else's skills.md. You can absolutely subscribe to roblox-helpers/skills/cframe-not-position.

This is also why Context Foundry's plugins now match Anthropic's plugin spec exactly. The transport is the marketplace's transport. Anyone with a Claude Code install can consume them without any Context Foundry awareness on their end.

5. Composition

Tasks combine skills dynamically. A single build might activate planner-cframe-not-position, reviewer-cframe-not-position, doubt-loop, and spec-drift-check -- four skills from three different domains, picked by the retriever based on what the task actually touches.

Hardcoding combinations into one file removes that flexibility. The retriever is doing real work: it observes the task and assembles a tailored prompt slice. A monolith has no observation, no assembly -- only a fixed table of contents.

The amplifier risk

There is a sixth reason, structural rather than functional. Patterns and skills are an amplification layer: every future build sees them. When they are atomic, a bad one can be removed without disturbing the good ones. When they are monolithic, a bad section poisons the whole file's relevance score and the agent either trusts the entire monolith or none of it. The blast radius of a mistake scales with the granularity of the unit. Skills make mistakes recoverable. Monoliths make mistakes contagious.

This is what the pattern audit was really telling us. The 2384 patterns were not wrong -- they were ungovernable. We could not delete the bad ones without auditing all of them. The migration to skills was, more than anything else, a move to make governance possible at all.

Plugins: Context Foundry vs Anthropic's plugin reference

Context Foundry's domain modules used to be called extensions. After T1.14 they are plugins -- specifically, plugins that conform to Anthropic's plugin reference. The folder layout is verbatim:

plugins/
  roblox/
    .claude-plugin/
      plugin.json          # Anthropic-spec manifest
    skills/
      cframe-not-position/
        SKILL.md           # planner variant
      cframe-not-position-review/
        SKILL.md           # reviewer variant
    commands/              # optional slash commands
    CLAUDE.md              # extension-level guidance

The plugin.json manifest carries the metadata Anthropic's loader expects: name, version, description, and an optional commands list for slash-command surfaces. Anthropic's loader can ingest these directly. Context Foundry's orchestrator can also ingest them -- but only as a consumer of the same transport, not as the owner of a competing format.

The division of labor:

Concern Owned by
Folder layout, plugin.json, SKILL.md format Anthropic's plugin reference
Per-stage injection policy (which skill goes to which agent in which RPID stage) Context Foundry orchestrator
Retrieval ranking (BM25 + cosine + popularity) Context Foundry retriever
Telemetry feedback loop (popularity score updates after audit) Context Foundry eval harness

The split keeps the things that need to be standardized standardized, and the things that need to be opinionated opinionated. Anyone holding a Claude Code install can take a Context Foundry plugin and use it without Context Foundry. Anyone running Context Foundry can take a third-party Anthropic plugin and feed it through the retriever without any conversion. The transport is shared; the policy is not.

This is also why the broader AGENTS.md universal standard matters. The transport layer for agent knowledge is converging -- skills, plugins, AGENTS.md, MCP servers -- and the projects that align with the convergence inherit the marketplace. The projects that build their own format become islands.

What we learned

The patterns-to-skills migration was painful because it forced an admission. The original system did not work, had not worked for months, and the only reason it was still there was that nobody had measured it. The audit measured it. The measurement was unambiguous. The migration followed.

The lesson is not "skills are better than patterns." The lesson is that every accumulation layer in an agent system needs a utilization gate. If a learned artifact is not getting cited in passing builds, it is not knowledge -- it is noise. Skills happen to have a better disclosure model and a better marketplace story, but their real win is that they are countable. You can ask "which skill fired on this build, and did the build pass?" and get an answer. Patterns could not answer that question without the rebuild that became T1.12-T1.16.

One big file would have been faster to write. It would also have been impossible to measure, impossible to share, and impossible to govern. The boring win of small files with strict frontmatter is that you can do all three.

References