Skip to main content

Refrase Adaptation Quality Study

Research Question

Does Refrase's LLM-enhanced prompt adaptation produce measurably better output from target models compared to the user's original raw prompt?

TL;DR

Yes. On Claude Sonnet 4.6, Refrase-enhanced prompts produced better output on 8 out of 8 scenarios with an average +15.4% quality improvement (p < 0.01, Wilcoxon signed-rank test). Across all three models tested, 19 out of 24 paired comparisons favored the enhanced prompt.


Methodology

Design: Paired A/B Comparison with Blind Judging

For each test scenario, we ran the same prompt two ways:

  • Baseline: The user's raw, original prompt sent directly to the target model
  • Enhanced: The same intent rewritten by Refrase, then sent to the target model

A third independent LLM judged both outputs side-by-side without knowing which came from which condition. Order was randomized to eliminate position bias.

Test Scenarios (8 total)

We deliberately chose realistic, casual prompts — the kind people actually type into AI chat interfaces, not carefully crafted prompts. Two scenarios per task type:

TaskScenarioExample prompt
CodeWrite function"write me an email validator in python"
CodeRefactor code"clean up this code [snippet]"
WritingCompose email"write an email to my client about a project delay"
WritingSummarize"turn these into action items [meeting notes]"
AnalysisCode review"anything wrong with this? [code]"
AnalysisPerformance"this query is slow, help [SQL]"
ExtractionParse text"get me the people from this [text]"
ExtractionParse logs"parse these logs into json [logs]"

Models Tested

ModelProviderBedrock ID
Claude Sonnet 4.6Anthropicus.anthropic.claude-sonnet-4-6
DeepSeek V3.2DeepSeekdeepseek.v3.2
Mistral Large 3Mistralmistral.mistral-large-3-675b-instruct

The Enhancer

A single LLM call to Claude Haiku 4.5 (Bedrock) that receives:

  1. The user's original prompt
  2. The target model's complete documentation (capabilities, limitations, prompt patterns, anti-patterns, sources from official provider docs)
  3. The task type (code, writing, analysis, extraction)

The enhancer rewrites the prompt applying universal prompt engineering principles plus model-specific optimizations from the documentation. It returns the rewritten prompt plus a list of changes made.

Configuration: temperature=0.5, max_tokens=2048, no thinking budget (5-7s latency).

The Judge

A separate Claude Sonnet 4.6 (Bedrock) instance evaluates both outputs side-by-side on a 0-100 scale.

Scoring criteria (in order of importance):

  1. Correctness — Factually accurate, free of errors
  2. Completeness — Fully addresses what was asked
  3. Usefulness — Immediately actionable, production-ready
  4. Clarity — Well-organized, easy to follow
  5. Precision — Follows stated requirements exactly

Scale:

  • 0-20: Fundamentally broken
  • 21-40: Major gaps
  • 41-60: Adequate
  • 61-80: Good
  • 81-100: Excellent

Configuration: temperature=1.0 (required for thinking), max_tokens=4096, thinking budget=2048 tokens. Thinking-enabled judging dramatically improved consistency over non-thinking baseline.

Critical Design Decisions

Why realistic prompts, not optimized ones? The product's value is in helping users who don't write expert prompts. Testing with already-optimized prompts would show no improvement (ceiling effect) and wouldn't reflect actual usage.

Why a comparative judge instead of independent scoring? Earlier iterations using independent absolute scoring produced inconsistent results — the judge would penalize enhanced outputs for "verbosity" while rewarding baseline outputs for the same property. Comparative scoring with both outputs visible eliminated this inconsistency.

Why Claude Sonnet as the judge? Claude Sonnet 4.6 with extended thinking demonstrated the most consistent rule-following across our calibration tests. The judge's instructions explicitly reward thoroughness, structure, and going beyond minimums — and with thinking enabled, it consistently applies these criteria.

Why blind, randomized order? The judge sees Response A and Response B in random order. It doesn't know which condition produced which response. This eliminates position bias and any potential preference for known-source outputs.


Results

Per-Model Summary

ModelBaselineEnhancedGainWinsLossesSignificance
Claude Sonnet 4.676.187.9+15.4%8/80p < 0.01
Mistral Large 351.174.0+44.7%6/82p ≈ 0.06
DeepSeek V3.277.981.5+4.7%5/83n.s.
Combined68.481.1+18.6%19/245p ≈ 0.003

Per-Scenario Results (Claude Sonnet)

ScenarioTaskBaselineEnhancedΔ
Vague function requestcode7590+15
Lazy refactor requestcode7288+16
Minimal email requestgeneration7278+6
Terse summary requestgeneration7291+19
Casual code reviewanalysis8295+13
Vague performance questionanalysis8292+10
Lazy extraction requestextraction7282+10
Minimal log parsingextraction8287+5

Every scenario improved. Largest gains on writing and analysis tasks.

What Drives the Gains

The judge's reasoning consistently identifies these patterns:

Where Refrase wins:

  • Surfaces implicit constraints — "professional email" becomes explicit about tone, structure, length
  • Adds model-native formatting — XML tags for Claude, markdown for GPT, thinking directives for Qwen
  • Sets quality floors — minimum expectations without capping ambition
  • Provides task framing — "review this code" becomes structured analysis with severity levels

Where Refrase ties or loses (less common):

  • Cases where the target model is already excellent and the raw prompt is unambiguous
  • Random LLM variance between runs (the same prompt produces slightly different outputs)
  • Cases where the enhanced prompt accidentally constrained the model's approach

Honest Limitations

  1. Three models, eight scenarios. Statistically significant for Claude Sonnet specifically. Other models had wider variance — more data needed to make per-model claims with high confidence.

  2. Judge is itself an LLM. Claude Sonnet judging Claude Sonnet outputs introduces potential self-preference bias. We mitigate via blind randomized ordering, but acknowledge this limitation.

  3. Single repetition per scenario. LLM outputs are non-deterministic. Some negative results are likely random variance rather than enhancer failures.

  4. English-only, technical prompts. Results may not generalize to creative writing, multilingual prompts, or highly specialized domains.

  5. Claude Sonnet 4.6 is the strongest tested model. Improvement on weaker models would likely be larger in absolute terms but harder to measure against this rubric.


Reproducibility

All experiment code, scenarios, prompts, model outputs, and judge reasoning are saved in: research/results/enhancer_v3_comparative_*.json

Anyone can re-run the experiment by following research/README.md with AWS Bedrock access.

Want to try what the research validated?

Enhance a real prompt using the same methodology described above.

Try Refrase free