AUDIT-SKILL-BENCH

Did the skill outperform a one-line prompt?

Empirical benchmark of audit-design-token-drift against three other strategies. Pre-registered hypothesis, frozen sha256-verified inputs, scored against a human-graded reference. Modeled on Max Taylor’s cc-compression-bench.

The four strategies + their headline recall

Each strategy ran the same task three times against the same frozen input snapshot. Recall is scored against the reference audit’s seven findings.

A — Full SKILL.md

The full skill loaded as system prompt

5.33/7

76% · ±0.47

B — One-liner

26-word task description, no skill

6.00/7

86% · ±0.00

C — Vague baseline

“Audit for problems”

5.67/7

81% · ±1.06

D — Lean SKILL.md

A minus its rhetorical sections

6.00/7

86% · ±0.00

The surprise: removing two rhetorical sections from the skill (D) lifted its recall above the full skill (A) and caught a value-drift finding A missed every single time (0/3 → 3/3). The lean skill beat the full skill.

The bench drove a real change in the skill — see the new Step 3.5 (final-value verification) on the audit-design-token-drift skill page. Cited from the case study at petebartsch.com/case-studies/thios.

Per-strategy results — mean across 3 trials each

Metric	A — Full skill	B — One-liner	C — Baseline	D — Lean
Recall mean (n / 7)	5.33 ± 0.47	6.00 ± 0.00	5.67 ± 1.06	6.00 ± 0.00
Recall %	76%	86%	81%	86%
Precision (strict)	1.00	0.89	0.30	0.82
Precision (generous)	1.00	0.95	0.65	0.90
Actionability	1.00	0.92	0.68	1.00
Loop-step routing	3 / 3 ✓	0 / 3	0 / 3	3 / 3 ✓
Output tokens (est. mean)	2,527	~2,278	3,535	2,127
Findings reported (mean)	11	11.7	38	12.7

Strict precision treats out-of-scope component-design recommendations as false positives. Generous precision counts any defensible real concern as true positive. B token mean excludes B1 (anomalously short fallback after a sandbox write failure; content captured verbatim, but token count not comparable).

Per-finding recall — did each strategy catch each reference finding?

Reference finding	A1	A2	A3	B1	B2	B3	C1	C2	C3	D1	D2	D3
CRT-1 Brand gold drift inside DESIGN.md	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
CRT-2 Premium gold `#D4A000` not in tokens.json	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
HIGH-1 Auxosphere `#909090` ≠ resolved `#6c757d`	×	×	×	×	×	×	✓	×	✓	✓	✓	✓
MED-1 `--color-secondary-text` missing from HTML	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
MED-2 `--breakpoint-*` missing from HTML	✓	✓	✓	✓	✓	✓	✓	×	×	✓	✓	✓
MED-3 `--z-popover/tooltip/skip-link` missing	✓	✓	✓	✓	✓	✓	✓	½	✓	✓	✓	✓
LOW-1 `--color-accent-dark #1e7e34` aliasing footgun	✓	×	×	✓	✓	✓	✓	✓	½	×	×	×
Run total (n / 7)	6	5	5	6	6	6	7	4.5	5.5	6	6	6

Caught (1.0)

Partial (0.5)

Missed (0)

Pre-registered decision rules — applied mechanically

Thresholds were locked into HYPOTHESIS.md before any run was scored. Verdicts are mechanical applications.

Rule	Verdict	Punchline
1. Skill earns its weight vs. one-liner?	FIRED	A doesn’t beat B by ≥30% on recall or actionability. Simplify SKILL.md.
2. Actionability mis-invested?	passes	A still wins actionability (1.00 vs 0.92). Keep the template; trim the prose.
3. Rationalizations decorative?	FIRED — STRONGLY	D beats A on recall (+12.5%) and tokens (−16%); HIGH-1 went 0/3 → 3/3 just by removing them. Delete them.
4. Bench too easy?	FIRED	C catches ≥4. Headline separates A vs B less cleanly than a harder fixture would.
5. Pareto trade?	N/A	A doesn’t win on every metric, so this rule doesn’t fire.

Reproducibility & methodology. 4 strategies, 3 trials each (12 total), one frozen sha256-verified input snapshot. Pre-registered hypothesis with decision rules. Full methodology, scoring JSON, per-trial outputs, the HYPOTHESIS.md answer key, and a fork-template for future skills (audit-cad-hygiene, audit-component-consistency) all live at _agents/bench/audit-skill-bench/ in the Thios repo.

Built with the Thios design system — teal #15635E, gold #E8AF00, Saira.
Bench: 2026-05-05. Sibling pages: design-system.html · design-system-loop.html · design-system-architecture.html.