Research

Does Refrase actually make prompts better? We measured it on three frontier models with paired A/B comparison and blind judging. Here's what we found.

Headline result

Across the three measured models, 19 of 24 paired comparisonsfavored the Refrase-enhanced prompt over the user's raw prompt (p ≈ 0.003). On Claude Sonnet 4.6, every single one of 8 scenarios improved (p < 0.01).

Measured models

We chose three models that span quality, cost, and provider style: Anthropic, Mistral, DeepSeek.

Claude Sonnet 4.6

+15.4%

76.1 → 87.9

Wins 8 / 8p < 0.01

Statistically significant on every scenario.

Mistral Large 3

+44.7%

51.1 → 74.0

Wins 6 / 8p ≈ 0.06

Largest absolute gain. Trends significant.

DeepSeek V3.2

+4.7%

77.9 → 81.5

Wins 5 / 8n.s.

Smallest gain — DeepSeek already follows raw prompts well.

How we tested

Paired A/B with blind judging.For each scenario we ran the user's raw prompt against the Refrase-enhanced version on the same target model. A separate Claude Sonnet 4.6 instance (extended thinking enabled) judged both outputs side-by-side on a 0–100 scale.
Realistic prompts. Eight scenarios spanning code, writing, analysis, and extraction — the kind people actually type, not carefully crafted prompts. (Pre-optimized prompts would hit a ceiling effect.)
Order randomized, condition hidden.The judge doesn't know which response came from which condition, eliminating position bias.
Five scoring criteria. Correctness, completeness, usefulness, clarity, precision — in that priority order.

Full methodology Per-scenario data Download dataset

Honest limitations

Three models is a small sample. Per-model claims have wide confidence intervals; only Claude Sonnet hits the conventional p < 0.01 bar individually.
The judge is itself an LLM. Claude judging Claude outputs has potential self-preference bias; we mitigate via blind randomized ordering.
Single repetition per scenario — LLMs are non-deterministic, so some near-zero results are likely run-to-run variance.
English-only, technical prompts. Results may not generalize to creative writing, multilingual prompts, or specialized domains.

Paper

2026Craig Certo

Validating LLM-Powered Prompt Adaptation: Three-Model Study with Paired A/B Comparison

The full write-up of the experiment summarized above — design, prompts, scenarios, judge configuration, per-scenario results, and limitations.

Read paper Methodology Benchmarks

Try what the research validated.

Paste any prompt — Refrase rewrites it for your target model in 4–7 seconds.

Try Refrase free See pricing