Research
Does Refrase actually make prompts better? We measured it on three frontier models with paired A/B comparison and blind judging. Here's what we found.
Headline result
Across the three measured models, 19 of 24 paired comparisonsfavored the Refrase-enhanced prompt over the user's raw prompt (p ≈ 0.003). On Claude Sonnet 4.6, every single one of 8 scenarios improved (p < 0.01).
Measured models
We chose three models that span quality, cost, and provider style: Anthropic, Mistral, DeepSeek.
Claude Sonnet 4.6
+15.4%
76.1 → 87.9
Statistically significant on every scenario.
Mistral Large 3
+44.7%
51.1 → 74.0
Largest absolute gain. Trends significant.
DeepSeek V3.2
+4.7%
77.9 → 81.5
Smallest gain — DeepSeek already follows raw prompts well.
How we tested
- Paired A/B with blind judging.For each scenario we ran the user's raw prompt against the Refrase-enhanced version on the same target model. A separate Claude Sonnet 4.6 instance (extended thinking enabled) judged both outputs side-by-side on a 0–100 scale.
- Realistic prompts. Eight scenarios spanning code, writing, analysis, and extraction — the kind people actually type, not carefully crafted prompts. (Pre-optimized prompts would hit a ceiling effect.)
- Order randomized, condition hidden.The judge doesn't know which response came from which condition, eliminating position bias.
- Five scoring criteria. Correctness, completeness, usefulness, clarity, precision — in that priority order.
Honest limitations
- Three models is a small sample. Per-model claims have wide confidence intervals; only Claude Sonnet hits the conventional
p < 0.01bar individually. - The judge is itself an LLM. Claude judging Claude outputs has potential self-preference bias; we mitigate via blind randomized ordering.
- Single repetition per scenario — LLMs are non-deterministic, so some near-zero results are likely run-to-run variance.
- English-only, technical prompts. Results may not generalize to creative writing, multilingual prompts, or specialized domains.
Paper
Validating LLM-Powered Prompt Adaptation: Three-Model Study with Paired A/B Comparison
The full write-up of the experiment summarized above — design, prompts, scenarios, judge configuration, per-scenario results, and limitations.
Try what the research validated.
Paste any prompt — Refrase rewrites it for your target model in 4–7 seconds.