Benchmarks

Per-model and per-scenario results from the prompt-adaptation study. Eight scenarios per model; paired A/B with blind judging.

Per-model summary

Model	Baseline	Enhanced	Gain	Wins	Significance
Claude Sonnet 4.6	76.1	87.9	+15.4%	8 / 8	p < 0.01
Mistral Large 3	51.1	74.0	+44.7%	6 / 8	p ≈ 0.06
DeepSeek V3.2	77.9	81.5	+4.7%	5 / 8	n.s.
Combined	68.4	81.1	+18.6%	19 / 24	p ≈ 0.003

Per-scenario — Claude Sonnet 4.6

The strongest signal in the dataset; every scenario improved.

Scenario	Task	Baseline	Enhanced	Δ
Vague function request	code	75	90	+15
Lazy refactor request	code	72	88	+16
Minimal email request	writing	72	78	+6
Terse summary request	writing	72	91	+19
Casual code review	analysis	82	95	+13
Vague performance question	analysis	82	92	+10
Lazy extraction request	extraction	72	82	+10
Minimal log parsing	extraction	82	87	+5

Scope.These are the only models we measured in this study. Refrase supports many more models in production — the prompt-adaptation system uses each model's official documentation as context, so adaptations work for any model with a published prompting guide. We're running follow-up measurements on additional models; this page will be updated when those land.

Browse models without statistical claims. /models has prompting guides for every model Refrase supports — what each model expects, how it differs from others, and how to write prompts that actually work.

Try Refrase on your own prompt.

Same enhancer the research validated — 4–7 seconds end-to-end.

Try Refrase free Download dataset