Published2026

Validating LLM-Powered Prompt Adaptation: A Three-Model Study with Paired A/B Comparison

Craig Certo

Abstract

We evaluate whether LLM-powered prompt adaptation produces measurably better output from frontier models compared to the user's original raw prompt. Across three target models (Claude Sonnet 4.6, Mistral Large 3, DeepSeek V3.2) and eight realistic scenarios spanning code, writing, analysis, and extraction, 19 of 24 paired comparisons favored the adapted prompt (p ≈ 0.003). Claude Sonnet 4.6 improved on every scenario (8/8, +15.4%, p < 0.01); Mistral Large 3 saw the largest absolute gain (+44.7%, 6/8 wins, p ≈ 0.06); DeepSeek V3.2 showed a small but directional improvement (+4.7%, 5/8 wins, n.s.). Order was randomized and the LLM judge — a separate Claude Sonnet 4.6 instance with extended thinking enabled — scored both outputs side-by-side on a 0–100 scale across five criteria (correctness, completeness, usefulness, clarity, precision) without knowing which came from which condition.

1. Introduction

Practitioners working across multiple LLM providers face a recurring problem: a prompt that produces excellent output on one model can underperform on another, even when the intent is identical. The standard advice — "write structured prompts," "use XML tags for Claude," "put the question last," — is folklore: empirically untested, inconsistently applied, and rarely measured.

We test whether an LLM-powered adaptation step — rewriting a user's casual prompt into a model-specific structured version using that model's own documentation as context — produces measurably better output. The contribution of this paper is empirical evidence on three frontier models, with a reproducible blind-judging methodology and a complete dataset release.

2. Methodology

Paired A/B with blind judging

For each test scenario we ran the same prompt two ways: Baseline— the user's raw original prompt sent directly to the target model — and Enhanced — the same intent rewritten by Refrase, then sent to the target model. A third independent LLM judged both outputs side-by-side without knowing which came from which condition. Order was randomized to eliminate position bias.

The enhancer

A single LLM call to Claude Haiku 4.5 (Bedrock) that receives the user's original prompt, the target model's complete documentation (capabilities, limitations, prompt patterns, anti-patterns, sourced from official provider docs), and the task type (code, writing, analysis, extraction). The enhancer rewrites the prompt applying universal prompt-engineering principles plus model-specific optimizations from the documentation. Configuration: temperature=0.5, max_tokens=2048, no thinking budget (5–7s latency).

The judge

A separate Claude Sonnet 4.6 (Bedrock) instance with extended thinking enabled scored both outputs on a 0–100 scale across five criteria in priority order: correctness, completeness, usefulness, clarity, precision. Configuration: temperature=1.0 (required for thinking), max_tokens=4096, thinking budget=2048 tokens. Thinking-enabled judging dramatically improved consistency over the non-thinking baseline.

Test scenarios (8 total)

We deliberately chose realistic, casual prompts — the kind people actually type into AI chat interfaces, not carefully crafted prompts. Two scenarios per task type: code (write function, refactor), writing (compose email, summarize), analysis (code review, performance), extraction (parse text, parse logs). The full scenario list, prompts, and judge reasoning are in the dataset.

Models tested

Claude Sonnet 4.6 (Anthropic), Mistral Large 3 (Mistral), DeepSeek V3.2 (DeepSeek). All accessed via AWS Bedrock with identical inference configuration.

3. Results

Model	Baseline	Enhanced	Gain	Wins	p
Claude Sonnet 4.6	76.1	87.9	+15.4%	8/8	< 0.01
Mistral Large 3	51.1	74.0	+44.7%	6/8	≈ 0.06
DeepSeek V3.2	77.9	81.5	+4.7%	5/8	n.s.
Combined	68.4	81.1	+18.6%	19/24	≈ 0.003

Every scenario on Claude Sonnet 4.6 improved. The largest gains were on writing and analysis tasks. The combined Wilcoxon signed-rank test across all 24 paired comparisons rejects the null hypothesis at p ≈ 0.003.

Per-scenario data Download dataset

4. Discussion

The judge's reasoning consistently identified four patterns where Refrase wins: surfacing implicit constraints (a casual "professional email" becomes explicit about tone, structure, and length), adding model-native formatting (XML tags for Claude, markdown for GPT, thinking directives for Qwen), setting quality floors (minimum expectations without capping ambition), and providing task framing (a vague "review this code" becomes structured analysis with severity levels).

Where Refrase ties or loses (less common): cases where the target model is already excellent and the raw prompt is unambiguous; random LLM variance between runs; cases where the enhanced prompt accidentally over-constrained the model's approach.

The very large gain on Mistral Large 3 (+44.7%) reflects that its baseline score (51.1) was substantially lower than the other two models. Mistral has more headroom to gain from structure than already-strong models — the absolute enhanced score (74.0) is still below Claude's baseline.

5. Limitations

Three models, eight scenarios. Statistically significant for Claude Sonnet specifically. Other models had wider variance — more data needed to make per-model claims with high confidence.
Judge is itself an LLM. Claude Sonnet judging Claude Sonnet outputs introduces potential self-preference bias. We mitigate via blind randomized ordering, but acknowledge this limitation.
Single repetition per scenario. LLM outputs are non-deterministic; some near-zero results are likely random variance rather than enhancer failures.
English-only, technical prompts. Results may not generalize to creative writing, multilingual prompts, or highly specialized domains.
Claude Sonnet 4.6 is the strongest tested model. Improvement on weaker models would likely be larger in absolute terms but harder to measure against this rubric.

Try Refrase free All research Benchmarks