Stop Putting Best Practices in Skills

Vercel demonstrated that AGENTS.md outperforms skills in their agent evals. AGENTS.md hit 100% pass rate on general framework knowledge. Skills with explicit instructions reached 79%. In 56% of cases, the agent had access to a skill but never invoked it. Their conclusion: skills work for vertical, action-specific workflows, not for general best practices.

Their evals were single-shot, though. One prompt, one response, done. Skills depend on context to be called. The model sees a name and a one-line description and has to decide in a single cold shot whether to invoke. In a real session, you go back and forth, context accumulates, the model picks up patterns. Single-shot penalizes skills by testing them in conditions nobody actually uses them in.

Then there’s Superpowers. People install it, and Claude Code starts following TDD, writing plans before coding, and debugging systematically. It bundles best practices as skills and people swear by it. If skills are supposed to lose to AGENTS.md, why does Superpowers work so well?

I ran 51 multi-turn evals across 4 configurations, replicated Vercel’s experiment in realistic multi-turn sessions, and read Claude Code’s source to understand the mechanics. Skills and CLAUDE.md are both just prompts. Same markdown, same model. The only difference is whether the prompt reaches the model. CLAUDE.md reaches it every time. Skills depend on a chain of decisions that fails 34-94% of the time. And Superpowers works not because of skills, but because its hook bypasses the skill system entirely, approximating what CLAUDE.md does natively.

TL;DR: Skills and CLAUDE.md are both just prompt. When skills get invoked, they work just as well. The problem is they only get invoked 6-66% of the time. CLAUDE.md is always in context. Put guidelines in CLAUDE.md, use skills for on-demand recipes.

Contents:

How skills actually work in Claude Code

Before the data, how skills work under the hood. I first figured this out by reading OpenCode’s source, then confirmed it against Claude Code’s leaked source. I reference file paths from that codebase throughout this section. Search GitHub, you’ll find them.

Discovery: name and description

When Claude Code starts a session, it scans for skills across three levels (src/skills/loadSkillsDir.ts):

  1. Managed/etc/claude-code/.claude/skills/ (org-wide)
  2. User~/.claude/skills/ (your personal skills)
  3. Project.claude/skills/ (checked into the repo)

For each skill directory, it reads the SKILL.md frontmatter – name, description, and optional fields like context, allowed-tools, arguments. The full markdown body stays on disk.

At session init, only name and description reach the model. formatCommandDescription() in src/tools/SkillTool/prompt.ts produces one line per skill: - {name}: {description}. The listing is built by getSkillListingAttachments() in src/utils/attachments.ts, which formats these lines within a token budget (1% of the context window, max 250 chars per description) and sends them as a <system-reminder> user message:

The following skills are available for use with the Skill tool:

- test-driven-development: Use when implementing any feature or bugfix
- systematic-debugging: Use when encountering any bug or test failure

Two short strings. The model knows skills exist, but it hasn't read them. This is the bottleneck. The model sees what is available, not how to apply it. It has to decide, from a name and a sentence, whether to spend a tool call loading the full content. In my evals, it almost never does.

Invocation: full content on demand

When the model decides a skill is relevant, it calls the Skill tool (src/tools/SkillTool/SkillTool.ts):

{ "tool": "Skill", "input": { "skill": "test-driven-development" } }

The SkillTool loads the full SKILL.md content via getPromptForCommand(), substitutes variables like $ARGUMENTS and ${CLAUDE_SKILL_DIR}, and injects it into the conversation. Two execution modes:

  • Inline (default) – content goes directly into the current conversation as a user message. The model reads the instructions and follows them in the same context.
  • Fork (context: fork in frontmatter) – spawns a sub-agent via executeForkedSkill() with isolated context and its own token budget. The result comes back without bloating the parent conversation.

Same mechanism that lets the model read files or run bash commands. It asks, the runtime reads a markdown file, the content comes back.

CLAUDE.md: always in context

CLAUDE.md takes a different path. At session start, Claude Code walks from your working directory up to root, collecting every CLAUDE.md, .claude/CLAUDE.md, and .claude/rules/*.md it finds (src/utils/claudemd.ts). It supports @include directives – one file pulling in others, up to 5 levels deep.

All of this feeds into getUserContext() in src/context.ts, which gets prepended to the conversation as a <system-reminder> user message before the model sees anything else (src/utils/api.ts).

ContentWhen it loadsHow it loads
CLAUDE.mdEvery session, automaticallyPrepended as first user message
Skill listingsEvery session, automaticallyName + description only
Skill contentOn demand, when model calls Skill toolFull markdown injected into conversation

CLAUDE.md is always there. Skill content waits for the model to decide it’s relevant and call the tool.

What Superpowers actually does

Superpowers registers a SessionStart hook. When a session begins, the hook runs a shell script that reads the using-superpowers skill from disk and outputs it as additionalContext in the hook response. Claude Code injects that into the conversation as a <system-reminder> message.

The content is aggressive. The skill wraps instructions in <EXTREMELY_IMPORTANT> tags:

IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.

This is not negotiable. This is not optional. You cannot rationalize your way out of this.

It even includes a "Red Flags" table listing thoughts the model might have for skipping skills ("This is just a simple question," "I need more context first," "The skill is overkill") and labels each one as rationalization.

So Superpowers doesn't wait for the model to discover skills. It front-loads instructions into every session via a hook, telling the model to invoke skills before doing anything else. This is basically the same idea as having a CLAUDE.md with a hint ("invoke the relevant skill before coding"), just louder. Better than plain skills, but still not the same as having the actual guidelines in CLAUDE.md from the start. The model still has to invoke the skill, read the content, and follow it. Three steps that can fail. CLAUDE.md skips all three.

The activation gap

I call this the activation gap. The distance between "skill is installed" and "model actually uses the skill."

I ran single-shot evals first to confirm Vercel's numbers. 31 tasks across react and next.js (react-best-practices-eval, nextjs-agents-md-eval) and a 10-task Superpowers benchmark (superpowers-eval). Similar results. Vanilla skills: 0% invocation. The model never opens the drawer on its own. AGENTS.md: 76-90% pass rate.

But as I mentioned, single-shot isn't how people work. So I built a multi-turn eval suite.

The multi-turn eval

5 scenarios, 3-4 turns each. Plain Node.js/TypeScript so framework knowledge isn't a confounding variable. The prompts are the kind of thing you'd actually type.

Scenario 1: TDD – Email Validator (4 turns)

TurnPromptExpected workflow
1"Build a function that validates email addresses. It should handle basic formats like user@domain.com and reject obviously invalid ones like missing @ or empty strings."TDD: write tests first
2"Now add support for international emails – addresses with unicode characters in the local part and IDN domains like user@münchen.de."TDD: extend tests first
3"I found a bug – plus aliases like user+tag@gmail.com are being rejected. Fix it."Debug: reproduce with failing test
4"Refactor to separate the parsing logic from the validation logic."Refactor: ensure tests pass after

Scenario 2: Debugging – Broken LRU Cache (3 turns)

Starts with a buggy LRU cache implementation (eviction check uses >= instead of >, causing items to "disappear").

TurnPromptExpected workflow
1"This LRU cache is broken – items seem to disappear even when the cache isn't full. Can you figure out what's wrong and fix it?"Debug: reproduce, find root cause
2"It works now but it's really slow when the cache size is large – like 10000 entries. Can you improve the performance?"Debug: reason about complexity
3"Add tests to make sure these bugs don't come back."TDD: write regression tests

Scenario 3: Planning – Rate Limiter (3 turns)

TurnPromptExpected workflow
1"I need a rate limiter for an API. Limit each client to 100 requests per minute. Give me a plan before coding."Plan: present approach first
2"Actually, fixed window won't work for my use case – requests cluster at window boundaries and burst through. I need sliding window instead."Plan: revise, explain trade-offs
3"Implement it and add tests."TDD: write tests, then implement

Scenario 4: Refactoring – Express Middleware (4 turns)

Starts with a 160-line monolithic middleware handling auth, logging, rate limiting, and error handling.

TurnPromptExpected workflow
1"This middleware file is 300 lines and handles auth, logging, rate limiting, and error handling all in one. Help me understand what it does."Analysis: read and explain
2"Split it into separate, focused middleware files."Refactor: restructure safely
3"The auth middleware broke after the split – requests that should require auth are passing through without a token."Debug: reproduce, identify regression
4"Add tests for each middleware so we catch this kind of thing."TDD: write isolated tests

Scenario 5: Mixed – HTTP Client Retry (3 turns)

Starts with a basic HTTP client without retry logic.

TurnPromptExpected workflow
1"Add retry with exponential backoff to this HTTP client. It should retry on 5xx errors and network failures, up to 3 retries."TDD or plan first
2"It's retrying on 400 Bad Request errors too. That's wrong – 4xx should fail immediately without retrying."Debug: identify status code bug
3"Add tests covering the retry logic – success on first try, retry on 5xx, no retry on 4xx, max retries exceeded."TDD: comprehensive test suite

Each scenario crosses workflow boundaries. TDD leads to debugging, debugging ends with tests, planning leads to implementation. This is where skills should shine, since they have dedicated workflows for each phase.

Four configurations:

ConfigWhat the model gets
SuperpowersSessionStart hook + skills (the real plugin experience)
Plain skillsSame skills installed, no hook, no hint
CLAUDE.mdEquivalent guidelines written as static rules, always in context
CLAUDE.md + hintOne-liner in CLAUDE.md saying "invoke the relevant skill before coding" + skills installed

Same model (claude-opus-4-6), same tasks, same workspace setup. All runs executed in a clean environment with ~/.claude/plugins, ~/.claude/skills, ~/.claude/settings.json, and ~/.claude/CLAUDE.md temporarily disabled. Only the Superpowers config had plugins restored (it needs them for the hook). Each turn was capped at 15 agentic turns.

Results

Skill invocations

ConfigInvocationsRate
Superpowers (hook)10/1566%
CLAUDE.md + hint5/1533%
Plain skills1/156%
CLAUDE.md (guidelines)n/an/a

8 of 68 total turns hit the 15 max-turns limit. That just means the model ran out of agentic steps before finishing, not that it wasn't doing useful work. In most MT turns, the model was actively writing tests and implementation, it just needed more steps to complete. Skill invocations on those turns are valid (they happened before the cutoff).

Multi-turn helps Superpowers a lot. From 10% in single-shot to 66% here. The hook fires at session start, and across turns the model builds momentum. Once it invokes TDD on turn 1, it knows the skill exists and reaches for debugging when the task shifts on turn 3.

The CLAUDE.md hint works, but only in a clean environment. This was the Vercel-style config. In my earlier run with global plugins contaminating things, it scored 6% (1/16, wrong skill). Clean run: 33% (5/15, correct skills). The hint is sensitive to noise. Competing global skills and plugins dilute its effect.

Plain skills got one spontaneous invocation out of 15 turns. The model invoked systematic-debugging unprompted on scenario 04, turn 3, after two turns of conversation context. So multi-turn can trigger invocation without a hook, but it's rare.

Clean environment matters more than I expected. Every config did better in the clean run. The earlier local runs (with global plugins present but renamed) showed Superpowers at 41%, CLAUDE.md+hint at 6%, plain skills at 0%. Clean run: 66%, 33%, 6%. Global plugins and skills create noise that suppresses skill invocation.

When Superpowers invokes skills

The pattern is consistent across all runs (local, Docker, and clean):

ScenarioTurn 1Turn 2Turn 3Turn 4
01 emailTDDdebugging
02 LRU cachedebuggingTDD
03 rate limiterbrainstormingTDD
04 middlewaredebuggingTDD
05 HTTP retrybrainstormingverification

Skills fire at transitions, when the workflow changes (coding to debugging, debugging to testing). On continuation turns the model doesn't re-invoke. It keeps the momentum from the previous invocation. Which makes sense. You don't re-read the TDD manual every time you write a new test.

TDD compliance

Skill invocations are one metric. Did the agent actually follow the workflow? I checked whether test files were written before implementation files on the key TDD turns.

ScenarioSuperpowersPlain SkillsCLAUDE.mdCLAUDE.md + hint
01 email t1test firstimpl firsttest firsttest first
02 LRU t1test firsttest firsttest firsttest first
03 rate limiter t3test firstimpl firsttest first MTimpl first
05 HTTP retrytest first (t2)test only (t3)test first (t1)test first (t1)

Here's the thing: Superpowers and CLAUDE.md are basically tied. Both wrote tests first on 4 out of 4 measured scenarios. CLAUDE.md + hint got 3/4. Plain skills got 1/4.

Having guidelines in CLAUDE.md wasn't necessarily better at making the model follow TDD. When Superpowers fires, the workflow quality is just as good. They're all prompt. Same markdown, same instructions, same model. The only difference is whether the prompt reaches the model. CLAUDE.md reaches it every time. Superpowers reaches it 66% of the time.

The interesting case is scenario 04 (refactor middleware, turn 2). No config wrote tests before refactoring. They all jumped straight to splitting the middleware into files. The "write tests before restructuring" guideline needs to be stronger, regardless of delivery mechanism.

Why this happens

Both CLAUDE.md and skill listings arrive through the same channel: <system-reminder> wrapped user messages. No architectural trust difference. The difference is just presence.

CLAUDE.md content is always in the context window. Every turn, every decision, the guidelines are right there. Skill content requires the model to read the name+description listing, decide the skill is relevant, call the Skill tool, wait for the content, then follow it. Each step can fail.

So the activation gap isn't a quality problem. It's a reliability problem. When skills get invoked, they work. They just don't always get invoked. Superpowers gets to 66% in clean multi-turn. The CLAUDE.md hint gets 33%. Neither reaches 100%.

CLAUDE.md gets 100% presence. No invocation needed.

Skills are recipes, CLAUDE.md is the health code

Think of it like a kitchen.

CLAUDE.md is the health code. Wash your hands, sanitize surfaces, check temperatures. Every cook follows these rules on every shift. They're non-negotiable and always visible, posted on the wall. You don't wait for someone to ask "should I wash my hands before touching food?" It's the baseline.

Skills are recipes. You pull the recipe for bouillabaisse when someone orders bouillabaisse. You don't tape every recipe to the wall next to the health code. That's noise. Recipes have their moment. The health code is constant.

Superpowers tries to turn recipes into the health code by having a hook shout "READ THE RECIPES" at the start of every shift. It works most of the time. But you could just put the important rules on the wall.

CLAUDE.md is for guidelines. Conventions, coding standards, workflow rules, TDD processes, debugging protocols. Anything the agent must follow every session. "Write tests before implementation" is a health code rule. It goes in CLAUDE.md.

Skills are for recipes. Specific, on-demand procedures you invoke when the moment calls for it. "Generate a database migration," "scaffold a component," "run the release checklist." These don't need to be in context all the time. They need to be there when you ask for them. Use context: fork for heavy recipes that would bloat the main context.

Hooks are for automation, not instruction delivery. Pre-commit validation, linting, notifications. If you're using a hook to inject guidelines (like Superpowers does), it works at 66% in clean multi-turn, but CLAUDE.md would do the same job at 100% with zero activation gap.

MechanismPresenceInvocation neededClean multi-turn rate
CLAUDE.md (health code)100%Non/a, always there
Superpowers (hook + recipes)Hook: 100%, Content: 66%Yes66%
CLAUDE.md + hint + skills100% (hint), 33% (content)Yes33%
Plain skills (recipes on shelf)Listing onlyYes6%

General guidelines don't belong in skills. Skills are not how you say "always do X." They're how you say "when you need to do Y, here's how."

Full turn-by-turn results

Every turn, every config. MT marks turns that hit the 15 max-turns limit.

Superpowers (hook + skills) – 10/15 invocations (66%)

ScenarioTurnSkill invokedFirst file written
01 emailt1 (tdd)test-driven-developmentvalidateEmail.test.ts (test first)
01 emailt2 (tdd)– (used Edit)
01 emailt3 (debug)systematic-debugging
01 emailt4 (refactor)validateEmail.ts
02 LRUt1 (debug)systematic-debugginglru-cache.test.ts (test first)
02 LRUt2 (debug)lru-cache.ts
02 LRUt3 (tdd)test-driven-development
03 rate limitert1 (plan)brainstorming– (planning, no code)
03 rate limitert2 (plan)
03 rate limitert3 (tdd)test-driven-developmentrate-limiter.test.ts (test first) MT
04 middlewaret1 (analysis)– (reading code)
04 middlewaret2 (refactor)logging.ts (impl first)
04 middlewaret3 (debug)systematic-debuggingmiddleware.test.ts
04 middlewaret4 (tdd)test-driven-developmentlogging.test.ts (test first) MT
05 HTTP retryt1 (tdd)brainstorming
05 HTTP retryt2 (debug)http-client.test.ts (test first)
05 HTTP retryt3 (tdd)verification-before-completion

Plain skills (no hook, no hint) – 1/15 invocations (6%)

ScenarioTurnSkill invokedFirst file written
01 emailt1 (tdd)validateEmail.ts (impl first)
01 emailt2 (tdd)
01 emailt3 (debug)
01 emailt4 (refactor)validateEmail.ts
02 LRUt1 (debug)lru-cache.test.ts (test first)
02 LRUt2 (debug)lru-cache.ts
02 LRUt3 (tdd)
03 rate limitert1 (plan)scalable-bubbling-lagoon.md
03 rate limitert2 (plan)types.ts (impl first) MT
03 rate limitert3 (tdd)
04 middlewaret1 (analysis)
04 middlewaret2 (refactor)logging.ts (impl first)
04 middlewaret3 (debug)systematic-debuggingmiddleware.test.ts
04 middlewaret4 (tdd)logging.test.ts
05 HTTP retryt1 (tdd)
05 HTTP retryt2 (debug)
05 HTTP retryt3 (tdd)http-client.test.ts

CLAUDE.md (guidelines, no skills) – 0/14 invocations (n/a)

ScenarioTurnFirst file written
01 emailt1 (tdd)validateEmail.test.ts (test first)
01 emailt2 (tdd)
01 emailt3 (debug)
01 emailt4 (refactor)
02 LRUt1 (debug)lru-cache.test.ts (test first)
02 LRUt2 (debug)bench.ts
02 LRUt3 (tdd)
03 rate limitert1 (plan)
03 rate limitert2 (plan)
03 rate limitert3 (tdd)rate-limiter.test.ts (test first) MT
04 middlewaret1 (analysis)
04 middlewaret2 (refactor)tests first (4 test files before impl) MT
04 middlewaret3 (debug)
04 middlewaret4 (tdd)
05 HTTP retryt1 (tdd)http-client.test.ts (test first)
05 HTTP retryt2 (debug)
05 HTTP retryt3 (tdd)

CLAUDE.md + hint (skills installed) – 5/15 invocations (33%)

ScenarioTurnSkill invokedFirst file written
01 emailt1 (tdd)test-driven-developmentvalidateEmail.test.ts (test first)
01 emailt2 (tdd)
01 emailt3 (debug)
01 emailt4 (refactor)validateEmail.ts
02 LRUt1 (debug)systematic-debugginglru-cache.test.ts (test first)
02 LRUt2 (debug)lru-cache.ts
02 LRUt3 (tdd)
03 rate limitert1 (plan)snazzy-juggling-glacier.md
03 rate limitert2 (plan)rate-limiter.ts (impl first)
03 rate limitert3 (tdd)types.ts (impl first) MT
04 middlewaret1 (analysis)
04 middlewaret2 (refactor)
04 middlewaret3 (debug)systematic-debugging
04 middlewaret4 (tdd)test-driven-developmentlogging.test.ts (test first) MT
05 HTTP retryt1 (tdd)test-driven-developmenthttp-client.test.ts (test first) MT
05 HTTP retryt2 (debug)
05 HTTP retryt3 (tdd)

Methodology

Execution

Each scenario runs as a multi-turn claude -p session:

# Turn 1: fresh session
claude -p --model claude-opus-4-6 --output-format stream-json \
  --verbose --dangerously-skip-permissions --max-turns 15 \
  "$PROMPT" > turn-1.jsonl

# Extract session ID
sid=$(grep -o '"session_id":"[^"]*"' turn-1.jsonl | head -1 | cut -d'"' -f4)

# Turn 2+: resume same session
claude -p --model claude-opus-4-6 --output-format stream-json \
  --verbose --dangerously-skip-permissions --max-turns 15 \
  --resume "$sid" "$PROMPT" > turn-2.jsonl

Environment isolation

All runs disabled user-level configuration to prevent contamination:

# Disabled at start, restored on exit (trap)
~/.claude/plugins      -> ~/.claude/plugins.eval-disabled
~/.claude/skills       -> ~/.claude/skills.eval-disabled
~/.claude/settings.json -> ~/.claude/settings.json.eval-disabled
~/.claude/CLAUDE.md    -> ~/.claude/CLAUDE.md.eval-disabled

Only the Superpowers config re-enabled ~/.claude/plugins (the hook + skills come from the plugin). OAuth auth stays in the macOS keychain, unaffected by the rename.

Config setup per workspace

Each scenario gets a fresh /tmp workspace with package.json, tsconfig.json, seed files (if any), and npm install. Then:

  • Superpowers: plugin provides the SessionStart hook + skills via .claude/settings.json
  • Plain skills: skills copied into workspace .claude/skills/, no hook
  • CLAUDE.md: CLAUDE.md with equivalent TDD/debugging/planning guidelines
  • CLAUDE.md + hint: CLAUDE.md with "Before writing code, first explore the project structure, then invoke the relevant skill for the task at hand." + skills copied into workspace .claude/skills/

Max turns

Each turn was capped at 15 agentic steps (--max-turns 15). 8 of 68 turns hit this limit. The affected turns are marked with MT in the results tables. In most MT turns the model was actively writing tests and code, it just needed more steps to finish. The || true flag prevents truncation from killing the runner script.

Measurement

Skill invocations are extracted from stream-json transcripts by searching for "name":"Skill" in assistant messages. TDD compliance is measured by the order of Write tool calls – whether test files (.test.ts, .spec.ts) appear before implementation files.

Reproducibility

A Dockerfile is included for fully isolated runs (requires ANTHROPIC_API_KEY):

docker build -t multiturn-eval .
docker run -d -e ANTHROPIC_API_KEY=$KEY \
  -v ./results:/home/evaluser/eval/results \
  multiturn-eval

I validated the eval across three environments: local with plugin rename, Docker with zero user config, and local with full config disabled. Superpowers invocation patterns were identical across all three.

The repos are open if you want to reproduce or poke at the data:

What to do with this

If you're setting up Claude Code for a project:

  • TDD, debugging protocols, code style, naming conventions go in CLAUDE.md. These are rules you want followed every session. No invocation, no activation gap, 100% presence.
  • "Scaffold a service," "generate a migration," "run the release checklist" go in skills. These are procedures you call when you need them. Use context: fork if they're heavy.
  • If you need CLAUDE.md to reference extra documentation, make it an index. Point to files. Same pattern Claude Code uses for its own memory: a root file that links to specifics.
  • If you're using Superpowers and it works for you, keep using it. Now you know why it works (the hook) and where it drops off (34% of turns in multi-turn, more in single-shot). Moving your guidelines to CLAUDE.md would close that gap.

Skills are not broken. They're just not for guidelines.

We want to work with you. Check out our Services page!

Edy Silva

I own a computer

View all posts by Edy Silva →