Research

Refrase's adaptation rules are derived from empirical testing across 46 model configurations on structured output tasks.

Methodology

We evaluated 46 model configurations across 12 model families on 8 structured output scenarios spanning extraction, analysis, and generation tasks. Each configuration was tested with multiple thinking mode and temperature variants.

Outputs were scored on a 3-layer evaluation framework: L1 (service-specific criteria), L2 (universal 10-rule quality rubric), and L3 (binary decision). Two independent LLM judges (Claude Sonnet 4.6 and Claude Haiku 4.5) provided inter-rater reliability via Cohen's Kappa.

Three-Layer Scoring Pipeline

L1Task-Specific Criteria

Service-specific evaluation criteria loaded from JSON configuration. Assesses domain accuracy, required fields, and format compliance.

L2Universal Quality Rubric

10-rule quality rubric scored 0-30. Evaluates coherence, completeness, instruction adherence, formatting, and relevance across all output types.

L3Binary Success/Failure

Final pass/fail determination. Would a domain expert accept this output for production use? Synthesizes L1 and L2 signals into an actionable verdict.

Dual-Judge Inter-Rater Reliability

Every evaluation is independently performed by two LLM judges: Claude Sonnet 4.6 and Claude Haiku 4.5. Each judge applies the same three-layer scoring pipeline to the same model outputs, but makes its assessments independently.

We measure agreement using Cohen's Kappa, a statistical measure that accounts for chance agreement. A Kappa of 1.0 indicates perfect agreement, while 0.0 indicates agreement no better than chance. Our best evaluation run achieved Kappa = 1.0, validating both the rubric design and the judges' reliability.

When judges disagree, we flag the evaluation for manual review. Systematic disagreements help us identify ambiguous criteria and refine the rubric for future runs.

Model Leaderboard

RankModelProviderQualityGain
1Claude Sonnet 4.6Anthropic94+12%
2Gemini ProGoogle93
3GPT-4oOpenAI91+8%
4Qwen3 235BAlibaba88+15%
5DeepSeek V3DeepSeek87+10%
6Mistral Large 3Mistral86+7%
7Llama 3.1 405BMeta85+9%
8Kimi K2Moonshot84+11%
9GLM 4.7Z.AI83+13%
10Nemotron 30BNVIDIA82+6%

Papers

Published2026
Multi-Provider Prompt Optimization for Structured Output Tasks

Craig Certo

We present a systematic evaluation of prompt structure effects across 46 model configurations from 12 provider families. Our three-layer scoring framework reveals that optimal prompt structure varies significantly between models, with XML-structured prompts improving Claude outputs by 12-18% while markdown headers yield better results on GPT-4o.

Coming Soon
Service-Level Optimization: Extending Prompt Adaptation Across Five Structured Output Services

Expanding evaluation from single-service benchmarks to five distinct structured output services: extraction, knowledge gap analysis, job posting extraction, job analysis, and resume generation. Top 3-5 models from Paper 1 tested across all services.

Coming Soon
End-to-End Effectiveness: Baseline vs Enhanced vs Full Pipeline

Measuring real-world effectiveness of the winning model configuration across the complete pipeline. Compares baseline prompts, adapted prompts, and fully optimized pipeline outputs in production scenarios.

Full BenchmarksBrowse All Models