Refrase Adaptation Quality Study

Research Question

Does Refrase's LLM-enhanced prompt adaptation produce measurably better output from target models compared to the user's original raw prompt?

TL;DR

Yes. On Claude Sonnet 4.6, Refrase-enhanced prompts produced better output on 8 out of 8 scenarios with an average +15.4% quality improvement (p < 0.01, Wilcoxon signed-rank test). Across all three models tested, 19 out of 24 paired comparisons favored the enhanced prompt.

Methodology

Design: Paired A/B Comparison with Blind Judging

For each test scenario, we ran the same prompt two ways:

Baseline: The user's raw, original prompt sent directly to the target model
Enhanced: The same intent rewritten by Refrase, then sent to the target model

A third independent LLM judged both outputs side-by-side without knowing which came from which condition. Order was randomized to eliminate position bias.

Test Scenarios (8 total)

We deliberately chose realistic, casual prompts — the kind people actually type into AI chat interfaces, not carefully crafted prompts. Two scenarios per task type:

Task	Scenario	Example prompt
Code	Write function	"write me an email validator in python"
Code	Refactor code	"clean up this code [snippet]"
Writing	Compose email	"write an email to my client about a project delay"
Writing	Summarize	"turn these into action items [meeting notes]"
Analysis	Code review	"anything wrong with this? [code]"
Analysis	Performance	"this query is slow, help [SQL]"
Extraction	Parse text	"get me the people from this [text]"
Extraction	Parse logs	"parse these logs into json [logs]"

Models Tested

Model	Provider	Bedrock ID
Claude Sonnet 4.6	Anthropic	us.anthropic.claude-sonnet-4-6
DeepSeek V3.2	DeepSeek	deepseek.v3.2
Mistral Large 3	Mistral	mistral.mistral-large-3-675b-instruct

The Enhancer

A single LLM call to Claude Haiku 4.5 (Bedrock) that receives:

The user's original prompt
The target model's complete documentation (capabilities, limitations, prompt patterns, anti-patterns, sources from official provider docs)
The task type (code, writing, analysis, extraction)

The enhancer rewrites the prompt applying universal prompt engineering principles plus model-specific optimizations from the documentation. It returns the rewritten prompt plus a list of changes made.

Configuration: temperature=0.5, max_tokens=2048, no thinking budget (5-7s latency).

The Judge

A separate Claude Sonnet 4.6 (Bedrock) instance evaluates both outputs side-by-side on a 0-100 scale.

Scoring criteria (in order of importance):

Correctness — Factually accurate, free of errors
Completeness — Fully addresses what was asked
Usefulness — Immediately actionable, production-ready
Clarity — Well-organized, easy to follow
Precision — Follows stated requirements exactly

Scale:

0-20: Fundamentally broken
21-40: Major gaps
41-60: Adequate
61-80: Good
81-100: Excellent

Configuration: temperature=1.0 (required for thinking), max_tokens=4096, thinking budget=2048 tokens. Thinking-enabled judging dramatically improved consistency over non-thinking baseline.

Critical Design Decisions

Why realistic prompts, not optimized ones? The product's value is in helping users who don't write expert prompts. Testing with already-optimized prompts would show no improvement (ceiling effect) and wouldn't reflect actual usage.

Why a comparative judge instead of independent scoring? Earlier iterations using independent absolute scoring produced inconsistent results — the judge would penalize enhanced outputs for "verbosity" while rewarding baseline outputs for the same property. Comparative scoring with both outputs visible eliminated this inconsistency.

Why Claude Sonnet as the judge? Claude Sonnet 4.6 with extended thinking demonstrated the most consistent rule-following across our calibration tests. The judge's instructions explicitly reward thoroughness, structure, and going beyond minimums — and with thinking enabled, it consistently applies these criteria.

Why blind, randomized order? The judge sees Response A and Response B in random order. It doesn't know which condition produced which response. This eliminates position bias and any potential preference for known-source outputs.

Results

Per-Model Summary

Model	Baseline	Enhanced	Gain	Wins	Losses	Significance
Claude Sonnet 4.6	76.1	87.9	+15.4%	8/8	0	p < 0.01
Mistral Large 3	51.1	74.0	+44.7%	6/8	2	p ≈ 0.06
DeepSeek V3.2	77.9	81.5	+4.7%	5/8	3	n.s.
Combined	68.4	81.1	+18.6%	19/24	5	p ≈ 0.003

Per-Scenario Results (Claude Sonnet)

Scenario	Task	Baseline	Enhanced	Δ
Vague function request	code	75	90	+15
Lazy refactor request	code	72	88	+16
Minimal email request	generation	72	78	+6
Terse summary request	generation	72	91	+19
Casual code review	analysis	82	95	+13
Vague performance question	analysis	82	92	+10
Lazy extraction request	extraction	72	82	+10
Minimal log parsing	extraction	82	87	+5

Every scenario improved. Largest gains on writing and analysis tasks.

What Drives the Gains

The judge's reasoning consistently identifies these patterns:

Where Refrase wins:

Surfaces implicit constraints — "professional email" becomes explicit about tone, structure, length
Adds model-native formatting — XML tags for Claude, markdown for GPT, thinking directives for Qwen
Sets quality floors — minimum expectations without capping ambition
Provides task framing — "review this code" becomes structured analysis with severity levels

Where Refrase ties or loses (less common):

Cases where the target model is already excellent and the raw prompt is unambiguous
Random LLM variance between runs (the same prompt produces slightly different outputs)
Cases where the enhanced prompt accidentally constrained the model's approach

Honest Limitations

Three models, eight scenarios. Statistically significant for Claude Sonnet specifically. Other models had wider variance — more data needed to make per-model claims with high confidence.
Judge is itself an LLM. Claude Sonnet judging Claude Sonnet outputs introduces potential self-preference bias. We mitigate via blind randomized ordering, but acknowledge this limitation.
Single repetition per scenario. LLM outputs are non-deterministic. Some negative results are likely random variance rather than enhancer failures.
English-only, technical prompts. Results may not generalize to creative writing, multilingual prompts, or highly specialized domains.
Claude Sonnet 4.6 is the strongest tested model. Improvement on weaker models would likely be larger in absolute terms but harder to measure against this rubric.

Reproducibility

All experiment code, scenarios, prompts, model outputs, and judge reasoning are saved in: research/results/enhancer_v3_comparative_*.json

Anyone can re-run the experiment by following research/README.md with AWS Bedrock access.

Refrase Adaptation Quality Study

Research Question

Does Refrase's LLM-enhanced prompt adaptation produce measurably better output from target models compared to the user's original raw prompt?

TL;DR

Methodology

Design: Paired A/B Comparison with Blind Judging

For each test scenario, we ran the same prompt two ways:

Baseline: The user's raw, original prompt sent directly to the target model
Enhanced: The same intent rewritten by Refrase, then sent to the target model

A third independent LLM judged both outputs side-by-side without knowing which came from which condition. Order was randomized to eliminate position bias.

Test Scenarios (8 total)

We deliberately chose realistic, casual prompts — the kind people actually type into AI chat interfaces, not carefully crafted prompts. Two scenarios per task type:

Task	Scenario	Example prompt
Code	Write function	"write me an email validator in python"
Code	Refactor code	"clean up this code [snippet]"
Writing	Compose email	"write an email to my client about a project delay"
Writing	Summarize	"turn these into action items [meeting notes]"
Analysis	Code review	"anything wrong with this? [code]"
Analysis	Performance	"this query is slow, help [SQL]"
Extraction	Parse text	"get me the people from this [text]"
Extraction	Parse logs	"parse these logs into json [logs]"

Models Tested

Model	Provider	Bedrock ID
Claude Sonnet 4.6	Anthropic	us.anthropic.claude-sonnet-4-6
DeepSeek V3.2	DeepSeek	deepseek.v3.2
Mistral Large 3	Mistral	mistral.mistral-large-3-675b-instruct

The Enhancer

A single LLM call to Claude Haiku 4.5 (Bedrock) that receives:

The user's original prompt
The target model's complete documentation (capabilities, limitations, prompt patterns, anti-patterns, sources from official provider docs)
The task type (code, writing, analysis, extraction)

Configuration: temperature=0.5, max_tokens=2048, no thinking budget (5-7s latency).

The Judge

A separate Claude Sonnet 4.6 (Bedrock) instance evaluates both outputs side-by-side on a 0-100 scale.

Scoring criteria (in order of importance):

Correctness — Factually accurate, free of errors
Completeness — Fully addresses what was asked
Usefulness — Immediately actionable, production-ready
Clarity — Well-organized, easy to follow
Precision — Follows stated requirements exactly

Scale:

0-20: Fundamentally broken
21-40: Major gaps
41-60: Adequate
61-80: Good
81-100: Excellent

Configuration: temperature=1.0 (required for thinking), max_tokens=4096, thinking budget=2048 tokens. Thinking-enabled judging dramatically improved consistency over non-thinking baseline.

Critical Design Decisions

Results

Per-Model Summary

Model	Baseline	Enhanced	Gain	Wins	Losses	Significance
Claude Sonnet 4.6	76.1	87.9	+15.4%	8/8	0	p < 0.01
Mistral Large 3	51.1	74.0	+44.7%	6/8	2	p ≈ 0.06
DeepSeek V3.2	77.9	81.5	+4.7%	5/8	3	n.s.
Combined	68.4	81.1	+18.6%	19/24	5	p ≈ 0.003

Per-Scenario Results (Claude Sonnet)

Scenario	Task	Baseline	Enhanced	Δ
Vague function request	code	75	90	+15
Lazy refactor request	code	72	88	+16
Minimal email request	generation	72	78	+6
Terse summary request	generation	72	91	+19
Casual code review	analysis	82	95	+13
Vague performance question	analysis	82	92	+10
Lazy extraction request	extraction	72	82	+10
Minimal log parsing	extraction	82	87	+5

Every scenario improved. Largest gains on writing and analysis tasks.

What Drives the Gains

The judge's reasoning consistently identifies these patterns:

Where Refrase wins:

Surfaces implicit constraints — "professional email" becomes explicit about tone, structure, length
Adds model-native formatting — XML tags for Claude, markdown for GPT, thinking directives for Qwen
Sets quality floors — minimum expectations without capping ambition
Provides task framing — "review this code" becomes structured analysis with severity levels

Where Refrase ties or loses (less common):

Cases where the target model is already excellent and the raw prompt is unambiguous
Random LLM variance between runs (the same prompt produces slightly different outputs)
Cases where the enhanced prompt accidentally constrained the model's approach

Honest Limitations

Three models, eight scenarios. Statistically significant for Claude Sonnet specifically. Other models had wider variance — more data needed to make per-model claims with high confidence.
Judge is itself an LLM. Claude Sonnet judging Claude Sonnet outputs introduces potential self-preference bias. We mitigate via blind randomized ordering, but acknowledge this limitation.
Single repetition per scenario. LLM outputs are non-deterministic. Some negative results are likely random variance rather than enhancer failures.
English-only, technical prompts. Results may not generalize to creative writing, multilingual prompts, or highly specialized domains.
Claude Sonnet 4.6 is the strongest tested model. Improvement on weaker models would likely be larger in absolute terms but harder to measure against this rubric.

Reproducibility

All experiment code, scenarios, prompts, model outputs, and judge reasoning are saved in: research/results/enhancer_v3_comparative_*.json

Anyone can re-run the experiment by following research/README.md with AWS Bedrock access.

Refrase Adaptation Quality Study

Research Question

TL;DR

Methodology

Design: Paired A/B Comparison with Blind Judging

Test Scenarios (8 total)

Models Tested

The Enhancer

The Judge

Critical Design Decisions

Results

Per-Model Summary

Per-Scenario Results (Claude Sonnet)

What Drives the Gains

Honest Limitations

Reproducibility

Want to try what the research validated?

Refrase Adaptation Quality Study

Research Question

TL;DR

Methodology

Design: Paired A/B Comparison with Blind Judging

Test Scenarios (8 total)

Models Tested

The Enhancer

The Judge

Critical Design Decisions

Results

Per-Model Summary

Per-Scenario Results (Claude Sonnet)

What Drives the Gains

Honest Limitations

Reproducibility

Want to try what the research validated?