Skip to main content
Refrase
  • Pricing
Star
← Research

Benchmarks

Per-model and per-scenario results from the prompt-adaptation study. Eight scenarios per model; paired A/B with blind judging.

Per-model summary

ModelBaselineEnhancedGainWinsSignificance
Claude Sonnet 4.676.187.9+15.4%8 / 8p < 0.01
Mistral Large 351.174.0+44.7%6 / 8p ≈ 0.06
DeepSeek V3.277.981.5+4.7%5 / 8n.s.
Combined68.481.1+18.6%19 / 24p ≈ 0.003

Per-scenario — Claude Sonnet 4.6

The strongest signal in the dataset; every scenario improved.

ScenarioTaskBaselineEnhancedΔ
Vague function requestcode7590+15
Lazy refactor requestcode7288+16
Minimal email requestwriting7278+6
Terse summary requestwriting7291+19
Casual code reviewanalysis8295+13
Vague performance questionanalysis8292+10
Lazy extraction requestextraction7282+10
Minimal log parsingextraction8287+5

Scope.These are the only models we measured in this study. Refrase supports many more models in production — the prompt-adaptation system uses each model's official documentation as context, so adaptations work for any model with a published prompting guide. We're running follow-up measurements on additional models; this page will be updated when those land.

Browse models without statistical claims. /models has prompting guides for every model Refrase supports — what each model expects, how it differs from others, and how to write prompts that actually work.

Try Refrase on your own prompt.

Same enhancer the research validated — 4–7 seconds end-to-end.

Try Refrase freeDownload dataset
Refrase

Your prompts, upgraded.

Product

  • Enhance
  • Extension
  • API
  • MCP

Research

  • Papers
  • Methodology
  • Benchmarks
  • Models

Company

  • Blog
  • Changelog
  • Pricing
  • Docs
  • GitHub
Privacy Policy·Terms of Service·All Systems Operational

© 2026 Refrase. All rights reserved.