← Back to blog

How We Tested 46 Model Configurations

4 min readMethodologyModels

Before releasing Refrase, we needed to answer a fundamental question: does prompt structure actually matter, and if so, how much? To find out, we designed a rigorous evaluation framework and tested 46 model configurations across 8 real-world scenarios.

The Methodology

We selected models from every major provider: Anthropic, OpenAI, Google, Meta, Mistral, Cohere, Alibaba, and Amazon. For models that support reasoning modes, we tested each configuration separately (standard, light thinking, full thinking), giving us 46 distinct configurations.

Each configuration was evaluated against 8 scenarios drawn from real-world use cases: resume extraction, career gap analysis, job posting parsing, interview preparation, and more. Every output was scored by two independent LLM judges using a structured rubric that assessed accuracy, completeness, formatting, and relevance.

Key Findings

The results confirmed our hypothesis: prompt structure has a measurable impact on output quality, but the optimal structure varies significantly between models. Three findings stood out.

First, XML-structured prompts improved Claude model outputs by 12-18% on our quality rubric compared to unstructured prompts. This advantage disappeared entirely on GPT-4o, where markdown headers performed better.

Second, reasoning mode configuration mattered more than we expected. For some models, enabling extended thinking actually degraded structured output quality because the model over-reasoned and deviated from format requirements. The optimal thinking configuration was task-dependent.

Third, inter-judge agreement was remarkably high (Cohen's kappa = 1.0 in our best run), validating the reliability of our automated evaluation pipeline.

What This Means for Refrase

These findings directly informed the adapter rules that power Refrase. Every rule in the system is backed by empirical evidence from this evaluation. When Refrase adds XML tags for Claude or restructures constraints for Gemini, it is applying optimizations that we have measured and validated.

We are preparing a full research paper with detailed results, statistical analysis, and reproducibility instructions. In the meantime, you can explore our research page for benchmark data and methodology details.

Share:

Related posts