contextfoundry.dev · May 2026

Superpowers vs Context Foundry

Same Anthropic Skills format, different layer of the stack.

An honest, in-depth side-by-side of Jesse Vincent's Superpowers Claude Code plugin and Context Foundry. Architecture, skill systems, methodology coverage, token economics, and honest gap analysis in both directions.

TL;DR

Superpowers is a Claude Code plugin that bolts a software-engineering methodology onto a turn-based conversational session. Context Foundry is a standalone Rust binary that runs an autonomous build loop against your codebase across many tasks without supervision. They are not competitors. They live at different layers of the same stack and could be run together.

Plugin · v5.1.0 · 645K installs

Superpowers

A methodology layer that lives inside an interactive Claude Code session. 14 composable skills + 1 master dispatcher that fires automatically and routes the agent through TDD, brainstorming, plan-writing, subagent dispatch, code review, and worktree-based parallelism.

MIT · Anthropic Skills format · turn-based collaboration
Binary · v3.3.0 · self-hosted

Context Foundry

A 9-stage autonomous build loop (Q→R→P→P+→B→A→SHIP→DISCOVER→SKILLS) that reads a TASKS.md queue and ships feat() commits unattended. Hybrid skill retriever (BM25 + local Ollama nomic-embed + telemetry) ranks 271 skills per stage. TUI dashboard, multi-provider routing.

MIT · Anthropic Skills format · autonomous pipeline

Layer diagram — where each sits

The diagram below is the single most important point of this comparison. Superpowers steers what one agent does on one turn. Context Foundry orchestrates many agents across many tasks, each running in its own fresh context. They overlap only at the skill-format layer.

┌──────────────────────────────────────────────────────────────────────────┐ │ Layer 5 — Multi-task autonomy │ │ Reads TASKS.md, runs stages, commits per task, learns across runs. │ │ │ │ [ Context Foundry ] ← pipeline + retriever + telemetry + TUI │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────────┐ │ Layer 4 — Per-stage agent invocation │ │ Spawns Claude / Codex / Copilot in PTY, fresh-context per stage. │ │ │ │ [ Context Foundry — Builder, Auditor, etc. ] │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────────┐ │ Layer 3 — In-session methodology │ │ Inside one Claude session: brainstorm, plan, TDD, review, debug. │ │ │ │ [ Superpowers ] ← dispatcher + composable skills │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────────┐ │ Layer 2 — Skill / knowledge format │ │ │ │ [ Anthropic Agent Skills (SKILL.md) ] ← both build on this │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────────┐ │ Layer 1 — Foundation model + CLI │ │ [ Claude Code / Codex / Copilot / Gemini / Cursor / OpenCode ] │ └──────────────────────────────────────────────────────────────────────────┘

This is why a fair comparison cannot pick a winner: they run at different layers. The interesting question is not "which is better," it is "where does each contribute most?"

Side-by-side feature matrix

The verdict column says what each tool actually does, not what it could in theory do. means the capability is shipped and load-bearing. ° means partial or different mechanism. means not present.

Capability Superpowers Context Foundry
Form factor Claude Code plugin (markdown skills) Standalone Rust binary + TUI
Execution model Turn-based, human in the loop Autonomous, queue-driven, unattended
Time horizon One session, minutes to hours Multiple tasks, hours to days
Skill format Anthropic Skills SKILL.md Anthropic Skills SKILL.md
Skill catalog size 14 curated skills (methodology-focused) 271 shipped + user-extensible (pitfall-focused)
Skill activation Master "using-superpowers" dispatcher reads request, picks skills via when_to_use metadata Hybrid retriever per stage: BM25 + nomic-embed cosine + telemetry success-rate boost
Master / dispatcher skill using-superpowers — runs every conversation, under 2K tokens, routes to skills No equivalent — routing is per-stage by the retriever
TDD enforcement test-driven-development skill enforces red/green/refactor; tests must fail before impl No dedicated TDD skill in shipped catalog
Systematic debugging systematic-debugging: 4-phase root-cause methodology; architectural review trigger after 3 failed fixes No dedicated debugging methodology skill
Brainstorming / requirements brainstorming: Socratic refinement, design sections for validation QUERY stage: clarifying questions before research ° (different mechanism)
Plan writing writing-plans skill — structured implementation plan PLAN stage produces current-plan.md with file:line operations
Plan review Code review post-implementation only ° P+ (plan-review) iterates the plan before BUILD; depth scales by complexity tier (1/2/3)
Plan execution executing-plans with batched checkpoints BUILD stage; verification commands per plan
Code review requesting-code-review + receiving-code-review + code-reviewer agent AUDIT stage: fresh-context agent reads build-claims, greps diff, verifies CHECK lines
Fresh-context per stage Subagent dispatch isolates implementation tokens Every stage spawns a clean Claude/Codex process
Subagent dispatch subagent-driven-development + dispatching-parallel-agents Dual-model arena + per-stage agents; not user-invoked parallel °
Git worktrees using-git-worktrees — parallel tasks without clobbering Worktrees for dual-model arena and parallel agents
Auto-commit per task finishing-a-development-branch skill assists, user-driven ° SHIP stage commits feat() or WIP() based on audit verdict
Task queue / multi-task No queue — one task per conversation TASKS.md queue with QRPBA progress indicators
Discovery / next-task Not present DISCOVER stage scans codebase, appends new tasks
Skill authoring writing-skills: TDD applied to skill docs Pattern extractor writes SKILL.md after each task
Cross-tool reach Ships .claude-plugin/, .codex-plugin/, .cursor-plugin/, .opencode/ — 7+ tools Cross-provider runtime (Claude/Codex/Copilot/local); reads external skill formats (AGENTS.md, .cursorrules, .github/copilot-instructions.md)
Vector embedding Not used — activation is via when_to_use semantic matching by the host model Local Ollama nomic-embed-text (137M params, 768-dim, ~50ms/call, on-device)
Telemetry / learn loop Behavioral pressure-tests to keep skills firing under load ° Citation scanner + SQLite sidecar: skills cited in passing builds gain rank; failures decay
Token economics "VERY token light" — master skill <2K tokens; subagents absorb implementation cost Heavier per task (~$28 per [Complex] task on the May 2026 overnight run) but unattended °
Live dashboard / TUI None; lives in the terminal Claude Code is running in Ratatui TUI with clickable AI summaries on every pane
Per-stage model routing One model per session stage_overrides in .foundry.json — e.g. Claude on PLAN, Codex on BUILD
License MIT MIT
Install /plugin install superpowers@claude-plugins-official cargo install foundry / npm i -g context-foundry / brew / winget
Adoption (May 2026) 645,146 installs v3.3.0 just shipped; significantly smaller install base

Methodology coverage

Superpowers' value prop is methodology: it teaches Claude how to do software development — brainstorming, TDD, debugging, code review — on every turn. Context Foundry's value prop is orchestration: it spends those tokens on running stages in the right order. They are unequally distributed across the same surface.

Methodology Superpowers Context Foundry Honest verdict
Brainstorming / clarify requirements Socratic, multi-turn, design-doc style QUERY stage writes clarifying questions to disk Both cover this. Superpowers is more conversational; CF is more artifact-driven.
Plan writing writing-plans skill PLAN + P+ stages with verification matrix CF is more rigorous here — P+ rejects plans that don't grep against actual files.
TDD (red/green/refactor) test-driven-development — tests must fail before code Not enforced as a discrete skill in the catalog Real gap in CF. A test-first skill would slot naturally into the BUILD stage.
Systematic debugging 4-phase: investigate root cause, pattern analysis, hypothesis test, implement; architectural review after 3 failed attempts The AUDIT stage flags WIP() commits; no proactive debugging methodology Real gap in CF. Superpowers' debugging discipline is one of its sharpest contributions.
Code review requesting-code-review + receiving-code-review + code-reviewer agent AUDIT stage with fresh-context agent verifying CHECK lines against the diff Different rhythms. Superpowers' review is conversational; CF's is binary pass/fail driving feat() vs WIP().
Worktree-based parallelism using-git-worktrees — user-invoked branching for parallel tasks Dual-model arena uses worktrees for A/B model comparison Both lean on worktrees. Different use cases — parallel tasks (SP) vs parallel models on one task (CF).
Skill authoring writing-skills — behavioral pressure-tests for skill docs Pattern extractor agent writes SKILL.md from task artifacts Different angles. Superpowers teaches authoring discipline; CF auto-extracts from completed work.
Multi-task queue + autonomy Not in scope — one task per session Full TASKS.md pipeline running unattended overnight Real gap in Superpowers — but it is by design. Superpowers is a methodology, not an orchestrator.
Per-task verdict + auto-commit User commits at end of session SHIP stage emits feat() or WIP() per task based on AUDIT Real gap in Superpowers by the same logic — no autonomous git surface.
Cross-run learning No mechanism — skills are static Citation telemetry; skills cited in passing builds gain rank for next task Real gap in Superpowers. A static skill library doesn't get better the more you use it.

Skill system — the deep dive

Both projects build on Anthropic's Agent Skills specification (SKILL.md with YAML frontmatter, progressive disclosure, name + description as the activation key). The interesting differences are in how the skills are ranked and surfaced.

Superpowers: dispatcher + when_to_use semantics

One master skill, using-superpowers, loads on every conversation. It is under 2K tokens. It reads the user's request and decides which of the 14 skills to invoke. Activation is via the host model's own pattern-matching against each skill's when_to_use metadata. There is no retriever and no vector store — the orchestration logic lives in the dispatcher's prose. From Simon Willison's writeup:

"VERY token light… one doc of fewer than 2k tokens. It uses subagents to manage token-heavy stuff, including all the actual implementation. As it needs bits of the process, it runs a shell script to search for them."

— Simon Willison on Superpowers

The catalog is small and curated: 14 skills, each manually authored, covering methodology rather than domain knowledge.

Context Foundry: hybrid retriever + per-stage filter

271 skills shipped by default. Per stage per task, a three-signal retriever ranks every candidate:

The top N (default 10, tunable) get injected into that stage's prompt. The retriever runs for every stage (QUERY, RESEARCH, PLAN, P+, BUILD, AUDIT, SHIP, DISCOVER, SKILLS), filtered softly by the cf-stage hint in each skill's metadata.

Where they meet

Both treat the skill body as the unit of progressive disclosure. Both keep the catalog separate from the activation logic. Both use Anthropic's spec so the on-disk content is portable between tools. Critically: a Superpowers skill could be dropped into ~/.foundry/skills/ and the Context Foundry retriever would pick it up for any stage where its description matched. The portability is real, not theoretical.

Token economics

Both projects make different bets about how to spend tokens.

Superpowers

Bet: minimize the methodology overhead per turn so the user-facing budget is mostly implementation.

Mechanism: master skill is <2K tokens; subagents take the implementation cost; skills load on demand via runtime search.

Implication: works inside any Claude Code session without changing your billing pattern materially.

Context Foundry

Bet: spend whatever a queue takes overnight, because the alternative is paying a human to babysit the run.

Mechanism: 9 stages per task, each a fresh-context agent invocation; ~$28 average per [Complex] task on the May 2026 overnight run.

Implication: per-task cost is visibly higher, but the time the user spends is approximately zero between invocation and the morning's commits.

This is not a fair fight on raw tokens, and it shouldn't be. Superpowers is optimizing for the unit "how much extra do I pay to get methodology added to a session I was going to have anyway?" Context Foundry is optimizing for "how much do I pay to skip the session entirely?"

Cross-tool reach

Both ship beyond a single vendor, but in different ways.

Aspect Superpowers Context Foundry
Where the skills live Ships per-tool wrapper dirs: .claude-plugin/, .codex-plugin/, .cursor-plugin/, .opencode/, .codex-app/, plus Factory Droid, Gemini CLI, GitHub Copilot CLI integrations Skills live in ~/.foundry/skills/ and plugins/<name>/skills/. The tools are invoked by Foundry, not the other way around
Reads external skill formats Authors its own format and pushes it cross-tool Discovers AGENTS.md (Linux Foundation standard), .cursorrules, .claude/skills/<topic>/SKILL.md, .github/copilot-instructions.md — with per-source opt-in on the startup screen
Direction of integration "My skills should work in your tool" "My pipeline can call your tool, and read your skills"
Per-stage model routing One model per session Different model per pipeline stage via stage_overrides — e.g. Claude Opus on PLAN, Codex on BUILD, local model on DISCOVER

What happens if you run both?

This is the most interesting scenario and the one that argues hardest against treating them as competitors.

Hypothesis

Context Foundry's BUILD and AUDIT stages spawn Claude Code processes. If the user has installed Superpowers, those spawned processes inherit Superpowers' master skill and all 14 of its methodology skills. Foundry's pipeline gets TDD discipline, four-phase debugging, and the code-reviewer agent for free.

Two specific gaps in Context Foundry's catalog — explicit TDD enforcement and a systematic debugging methodology — are exactly where Superpowers is strongest. A user running Foundry on a TDD-shaped codebase, with Superpowers installed in their global Claude Code config, would get something closer to the union of both projects' strengths than either alone:

This is the practical answer to "which should I install?" Both, if your workflow has room for them. Treat the layer diagram at the top as the seating chart: they sit in different chairs.

Honest gaps in Context Foundry (that Superpowers fills)

Where Superpowers is sharper

1. No dedicated TDD skill in the shipped 271. The catalog has many "test failure" pitfalls but no "tests must fail before implementation" enforcement skill. Adding one would slot naturally into BUILD via the cf-stage: build hint.

2. No four-phase debugging methodology. AUDIT catches failures post-hoc; Superpowers' systematic-debugging drives the agent to investigate root cause before patching. CF would benefit from a sibling AUDIT-time skill that scans for symptom-treating fixes.

3. No subagent dispatch from within a stage. Foundry's stages are themselves the subagents; there's no idiom for a stage to fan out further. Superpowers' dispatching-parallel-agents covers a use case (e.g. "review four files in parallel") that Foundry doesn't have a primitive for.

4. No conversational brainstorming. QUERY writes questions to disk; it does not maintain a Socratic dialogue with the user. For ill-specified tasks, Superpowers' multi-turn brainstorming may produce a better starting plan.

5. No skill-author discipline. The pattern extractor writes SKILL.md mechanically; Superpowers' writing-skills applies TDD principles to the skill itself, including behavioral pressure-tests.

Honest gaps in Superpowers (that Context Foundry fills)

Where Context Foundry is sharper

1. No autonomy. Superpowers is bound to the rhythm of a human turn. It cannot run overnight against a queue. This is by design — it is a methodology layer, not an orchestrator — but it means "ship 10 tasks while I sleep" is not in scope.

2. No queue, no DISCOVER stage, no auto-commit verdict. Each session ends when the user closes it. There is no equivalent to TASKS.md, no QRPBA progress indicators, and no programmatic feat() vs WIP() decision based on a fresh-context audit.

3. No retriever. Skill activation relies on the host model's own pattern-matching against when_to_use. This is elegant at 14 skills and probably saturates at a few dozen; it would not scale to a 271-skill catalog. Foundry's BM25 + dense + telemetry combination is the answer to "what happens when the catalog grows to hundreds of skills?"

4. No telemetry feedback loop. Skills don't learn from outcomes. A skill that fires every conversation but never contributes to a passing build will keep firing forever. Foundry's citation scanner is the closed-loop version of the same idea.

5. No per-stage model routing. Superpowers runs against whatever model the host CLI uses. Foundry's stage_overrides mean a project can put Claude Opus on PLAN, Codex on BUILD, and a local model on DISCOVER — routing by what each model is best at.

6. No live dashboard. Output is whatever the host terminal shows. Foundry's TUI surfaces pipeline state, AI summaries on every pane, hover tooltips, click-through to artifacts — the user can watch an autonomous run without reading the raw log.

When to use which

Reach for Superpowers when…

  • You're at the keyboard, in a Claude Code session, working on one task at a time.
  • You want methodology discipline (TDD, root-cause debugging, plan-first) without changing how you launch Claude.
  • The bottleneck is the quality of one session, not throughput across many sessions.
  • You want a small, curated, hand-authored skill set, not a hundreds-deep catalog.
  • You're using a tool that isn't Claude Code (Cursor, Codex, Copilot, OpenCode) but you still want the methodology.
  • You're new to disciplined AI-collaboration and want a learning path baked into the workflow.

Reach for Context Foundry when…

  • You want to leave a queue running unattended — overnight, while you're in meetings, while you sleep.
  • The work decomposes into 5+ tasks that can each be specified at the TASKS.md level.
  • You need a fresh-context audit gate before commits hit your tree.
  • You want learning to compound across runs — skills cited in passing builds gaining rank for next time.
  • You want per-stage model routing (cheap models for cheap stages, frontier models for hard stages).
  • You want a TUI you can glance at to see where the pipeline is.
  • You're running models locally and want the embedding step to stay on the laptop.

Closing

The dominant framing on AI coding tools right now is competitive: which plugin is best, which agent wins, which framework will beat which. That framing is wrong for these two projects. Superpowers and Context Foundry sit at different layers of the same stack, both build on Anthropic's Skills format, both treat fresh-context discipline as load-bearing, and both ship under MIT.

The interesting comparison is not which one but which combination. The honest answer is that Context Foundry is stronger at orchestration and weaker at methodology; Superpowers is the inverse. If your workflow has room for both, run both. If it doesn't, the layer diagram at the top of this page tells you where each one fits.

Two projects, same skill format, complementary scope. The Anthropic Skills format being shared between them is not a coincidence — it's what makes the seating chart legible. That portability is itself the most important thing in this comparison.

Sources