audit-design-token-drift against three other strategies. Pre-registered hypothesis, frozen sha256-verified inputs, scored against a human-graded reference. Modeled on Max Taylor’s cc-compression-bench.Each strategy ran the same task three times against the same frozen input snapshot. Recall is scored against the reference audit’s seven findings.
The surprise: removing two rhetorical sections from the skill (D) lifted its recall above the full skill (A) and caught a value-drift finding A missed every single time (0/3 → 3/3). The lean skill beat the full skill.
The bench drove a real change in the skill — see the new Step 3.5 (final-value verification) on the audit-design-token-drift skill page. Cited from the case study at petebartsch.com/case-studies/thios.
| Metric | A — Full skill | B — One-liner | C — Baseline | D — Lean |
|---|---|---|---|---|
| Recall mean (n / 7) | 5.33 ± 0.47 | 6.00 ± 0.00 | 5.67 ± 1.06 | 6.00 ± 0.00 |
| Recall % | 76% | 86% | 81% | 86% |
| Precision (strict) | 1.00 | 0.89 | 0.30 | 0.82 |
| Precision (generous) | 1.00 | 0.95 | 0.65 | 0.90 |
| Actionability | 1.00 | 0.92 | 0.68 | 1.00 |
| Loop-step routing | 3 / 3 ✓ | 0 / 3 | 0 / 3 | 3 / 3 ✓ |
| Output tokens (est. mean) | 2,527 | ~2,278 | 3,535 | 2,127 |
| Findings reported (mean) | 11 | 11.7 | 38 | 12.7 |
Strict precision treats out-of-scope component-design recommendations as false positives. Generous precision counts any defensible real concern as true positive. B token mean excludes B1 (anomalously short fallback after a sandbox write failure; content captured verbatim, but token count not comparable).
| Reference finding | A1 | A2 | A3 | B1 | B2 | B3 | C1 | C2 | C3 | D1 | D2 | D3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CRT-1 Brand gold drift inside DESIGN.md | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
CRT-2 Premium gold #D4A000 not in tokens.json |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
HIGH-1 Auxosphere #909090 ≠ resolved #6c757d |
× | × | × | × | × | × | ✓ | × | ✓ | ✓ | ✓ | ✓ |
MED-1 --color-secondary-text missing from HTML |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
MED-2 --breakpoint-* missing from HTML |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | × | × | ✓ | ✓ | ✓ |
MED-3 --z-popover/tooltip/skip-link missing |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ½ | ✓ | ✓ | ✓ | ✓ |
LOW-1 --color-accent-dark #1e7e34 aliasing footgun |
✓ | × | × | ✓ | ✓ | ✓ | ✓ | ✓ | ½ | × | × | × |
| Run total (n / 7) | 6 | 5 | 5 | 6 | 6 | 6 | 7 | 4.5 | 5.5 | 6 | 6 | 6 |
Thresholds were locked into HYPOTHESIS.md before any run was scored. Verdicts are mechanical applications.
| Rule | Verdict | Punchline |
|---|---|---|
| 1. Skill earns its weight vs. one-liner? | FIRED | A doesn’t beat B by ≥30% on recall or actionability. Simplify SKILL.md. |
| 2. Actionability mis-invested? | passes | A still wins actionability (1.00 vs 0.92). Keep the template; trim the prose. |
| 3. Rationalizations decorative? | FIRED — STRONGLY | D beats A on recall (+12.5%) and tokens (−16%); HIGH-1 went 0/3 → 3/3 just by removing them. Delete them. |
| 4. Bench too easy? | FIRED | C catches ≥4. Headline separates A vs B less cleanly than a harder fixture would. |
| 5. Pareto trade? | N/A | A doesn’t win on every metric, so this rule doesn’t fire. |
Reproducibility & methodology. 4 strategies, 3 trials each (12 total), one frozen sha256-verified input snapshot. Pre-registered hypothesis with decision rules. Full methodology, scoring JSON, per-trial outputs, the HYPOTHESIS.md answer key, and a fork-template for future skills (audit-cad-hygiene, audit-component-consistency) all live at _agents/bench/audit-skill-bench/ in the Thios repo.