Skip to main content
← Back to blog

Validating Refrase: a three-model paired A/B study

5 min readResearchMethodology

Before we shipped Refrase as a product, we wanted to answer a straightforward question: does LLM-powered prompt adaptation actually produce measurably better output from frontier models compared to the user's raw prompt? "Use XML tags for Claude" and "put the question last" are folklore — empirically untested, inconsistently applied, rarely measured.

So we measured it.

The headline result

Across three target models and eight realistic scenarios, 19 of 24 paired comparisons favored the Refrase-enhanced prompt (Wilcoxon signed-rank test, p ≈ 0.003). Per model:

  • Claude Sonnet 4.6: +15.4% (8/8 scenarios, p < 0.01)
  • Mistral Large 3: +44.7% (6/8 wins, p ≈ 0.06)
  • DeepSeek V3.2: +4.7% (5/8 wins, n.s.)

How we tested

For each scenario we ran the same prompt two ways: baseline— the user's raw, casual prompt sent straight to the target model — and enhanced— the same intent rewritten by Refrase, then sent to the target model. A separate Claude Sonnet 4.6 instance with extended thinking enabled judged both outputs side-by-side on a 0–100 scale across five criteria (correctness, completeness, usefulness, clarity, precision). Order was randomized; the judge didn't know which response came from which condition.

Eight scenarios spanning code, writing, analysis, and extraction. Two per task type. The prompts were deliberately casual — the kind people actually type into chat interfaces. Pre-optimized prompts would hit a ceiling effect and wouldn't reflect real usage.

Why those three models

We needed a sample that spanned the design space: a polished frontier model that follows instructions well (Claude Sonnet 4.6), a reasoning-leaning model with a different instruction-following style (DeepSeek V3.2), and a strong open-weight model with looser baseline behavior (Mistral Large 3). They're all served via the same API surface (AWS Bedrock) with identical inference configuration, so comparisons are clean.

What drove the wins

The judge's reasoning consistently identified four patterns where Refrase wins: surfacing implicit constraints (a casual "professional email" becomes explicit about tone, length, structure), adding model-native formatting (XML tags for Claude, markdown for GPT, thinking directives for Qwen), setting quality floors (minimum expectations without capping ambition), and providing task framing (a vague "review this code" becomes structured analysis with severity levels).

Refrase ties or loses (less common) in three situations: the target model is already excellent and the raw prompt is unambiguous; random LLM run-to-run variance; or the enhanced prompt accidentally over-constrains the model's approach.

Honest limitations

Three models is a small sample. Per-model claims have wide confidence intervals; only Claude Sonnet hits the conventional p < 0.01 bar individually. The judge is itself an LLM (Claude judging Claude has potential self-preference bias; we mitigate via blind randomized ordering). One repetition per scenario, so some near-zero results are likely run-to-run variance. English-only, technical prompts.

What this means in practice

Refrase's enhancer doesn't apply hardcoded adaptation rules. It's a single LLM call (Claude Haiku 4.5) that reads each model's official documentation as context and rewrites your prompt accordingly. Adding support for a new model is a matter of adding a curated model card; the enhancer adapts automatically.

The full paper, per-scenario tables, dataset download, and reproducibility instructions are at /research.

Share:

Related posts