Skip to main content
← Research

Benchmarks

Per-model and per-scenario results from the prompt-adaptation study. Eight scenarios per model; paired A/B with blind judging.

Per-model summary

ModelBaselineEnhancedGainWinsSignificance
Claude Sonnet 4.676.187.9+15.4%8 / 8p < 0.01
Mistral Large 351.174.0+44.7%6 / 8p ≈ 0.06
DeepSeek V3.277.981.5+4.7%5 / 8n.s.
Combined68.481.1+18.6%19 / 24p ≈ 0.003

Per-scenario — Claude Sonnet 4.6

The strongest signal in the dataset; every scenario improved.

ScenarioTaskBaselineEnhancedΔ
Vague function requestcode7590+15
Lazy refactor requestcode7288+16
Minimal email requestwriting7278+6
Terse summary requestwriting7291+19
Casual code reviewanalysis8295+13
Vague performance questionanalysis8292+10
Lazy extraction requestextraction7282+10
Minimal log parsingextraction8287+5

Scope.These are the only models we measured in this study. Refrase supports many more models in production — the prompt-adaptation system uses each model's official documentation as context, so adaptations work for any model with a published prompting guide. We're running follow-up measurements on additional models; this page will be updated when those land.

Browse models without statistical claims. /models has prompting guides for every model Refrase supports — what each model expects, how it differs from others, and how to write prompts that actually work.

Try Refrase on your own prompt.

Same enhancer the research validated — 4–7 seconds end-to-end.