AUDIT-SKILL-BENCH

Did the skill outperform a one-line prompt?
Empirical benchmark of audit-design-token-drift against three other strategies. Pre-registered hypothesis, frozen sha256-verified inputs, scored against a human-graded reference. Modeled on Max Taylor’s cc-compression-bench.
The four strategies + their headline recall

Each strategy ran the same task three times against the same frozen input snapshot. Recall is scored against the reference audit’s seven findings.

A — Full SKILL.md
The full skill loaded as system prompt
5.33/7
76% · ±0.47
B — One-liner
26-word task description, no skill
6.00/7
86% · ±0.00
C — Vague baseline
“Audit for problems”
5.67/7
81% · ±1.06
D — Lean SKILL.md
A minus its rhetorical sections
6.00/7
86% · ±0.00

The surprise: removing two rhetorical sections from the skill (D) lifted its recall above the full skill (A) and caught a value-drift finding A missed every single time (0/3 → 3/3). The lean skill beat the full skill.

The bench drove a real change in the skill — see the new Step 3.5 (final-value verification) on the audit-design-token-drift skill page. Cited from the case study at petebartsch.com/case-studies/thios.

Per-strategy results — mean across 3 trials each
Metric A — Full skill B — One-liner C — Baseline D — Lean
Recall mean (n / 7) 5.33 ± 0.47 6.00 ± 0.00 5.67 ± 1.06 6.00 ± 0.00
Recall % 76% 86% 81% 86%
Precision (strict) 1.00 0.89 0.30 0.82
Precision (generous) 1.00 0.95 0.65 0.90
Actionability 1.00 0.92 0.68 1.00
Loop-step routing 3 / 3 ✓ 0 / 3 0 / 3 3 / 3 ✓
Output tokens (est. mean) 2,527 ~2,278 3,535 2,127
Findings reported (mean) 11 11.7 38 12.7

Strict precision treats out-of-scope component-design recommendations as false positives. Generous precision counts any defensible real concern as true positive. B token mean excludes B1 (anomalously short fallback after a sandbox write failure; content captured verbatim, but token count not comparable).

Per-finding recall — did each strategy catch each reference finding?
Reference finding A1A2A3 B1B2B3 C1C2C3 D1D2D3
CRT-1 Brand gold drift inside DESIGN.md
CRT-2 Premium gold #D4A000 not in tokens.json
HIGH-1 Auxosphere #909090 ≠ resolved #6c757d ××× ××× ×
MED-1 --color-secondary-text missing from HTML
MED-2 --breakpoint-* missing from HTML ××
MED-3 --z-popover/tooltip/skip-link missing ½
LOW-1 --color-accent-dark #1e7e34 aliasing footgun ×× ½ ×××
Run total (n / 7) 655 666 74.55.5 666
Caught (1.0)
Partial (0.5)
Missed (0)
Pre-registered decision rules — applied mechanically

Thresholds were locked into HYPOTHESIS.md before any run was scored. Verdicts are mechanical applications.

Rule Verdict Punchline
1. Skill earns its weight vs. one-liner? FIRED A doesn’t beat B by ≥30% on recall or actionability. Simplify SKILL.md.
2. Actionability mis-invested? passes A still wins actionability (1.00 vs 0.92). Keep the template; trim the prose.
3. Rationalizations decorative? FIRED — STRONGLY D beats A on recall (+12.5%) and tokens (−16%); HIGH-1 went 0/3 → 3/3 just by removing them. Delete them.
4. Bench too easy? FIRED C catches ≥4. Headline separates A vs B less cleanly than a harder fixture would.
5. Pareto trade? N/A A doesn’t win on every metric, so this rule doesn’t fire.

Reproducibility & methodology. 4 strategies, 3 trials each (12 total), one frozen sha256-verified input snapshot. Pre-registered hypothesis with decision rules. Full methodology, scoring JSON, per-trial outputs, the HYPOTHESIS.md answer key, and a fork-template for future skills (audit-cad-hygiene, audit-component-consistency) all live at _agents/bench/audit-skill-bench/ in the Thios repo.