Benchmarks
Per-model and per-scenario results from the prompt-adaptation study. Eight scenarios per model; paired A/B with blind judging.
Per-model summary
Per-scenario — Claude Sonnet 4.6
The strongest signal in the dataset; every scenario improved.