Skip to main content
← All models

GPT-4o

OpenAI · openai family · Official docs

GPT-4o is OpenAI's previous-generation multimodal flagship, now superseded by GPT-4.1 (cheaper, 1M context, better benchmarks). Key differentiation from Claude: GPT uses role-based system messages in the messages array rather than a separate system parameter. GPT-4.1+ follows instructions more literally than GPT-4o, requiring explicit specification. OpenAI recommends markdown headers for prompt structure (similar to Claude) but also supports XML tags for context wrapping (unlike Llama which has no strong XML support). For structured outputs, OpenAI enforces strict schema constraints (all fields required, additionalProperties:false, max 100 properties, max depth 5) that differ from Claude's approach. GPT models benefit significantly from the 3-instruction agentic pattern (persistence + tool honesty + planning). The prompt caching strategy (static content first, variable last) is analogous to Anthropic's caching but uses different mechanics. For Refrase, the GPT adapter should use markdown-structured prompts with explicit section headers, avoid JSON for context wrapping in favor of XML, and leverage the strict structured output mode rather than hoping for JSON compliance.

Try Refrase on a GPT-4o prompt

Paste any prompt — Refrase rewrites it using GPT-4o's documentation as context. 4–7 seconds end-to-end.

Specifications

128K
Context window
16K
Max output
$2.5 / $10
Per 1M tokens (in/out)

Strengths

extractionanalysisgenerationcode

Key capabilities

  • Multimodal flagship model supporting text, vision, and audio inputs (source: OpenAI GPT-4o Model Page, Overview)
  • Structured Outputs with strict JSON schema enforcement via response_format parameter, achieving near-100% schema adherence (source: OpenAI Structured Outputs Guide, Introduction)
  • Function/tool calling with strict:true for guaranteed argument conformance (source: OpenAI Structured Outputs Guide, Function Calling)
  • 128K context window for processing long documents and conversations (source: OpenAI Models Page, GPT-4o)
  • Prompt caching with 50% discount on repeated context, reducing cost to $1.25/1M cached input tokens (source: OpenAI Pricing Page, GPT-4o)
  • Image understanding and visual reasoning from uploaded images (source: OpenAI GPT-4o System Card, Section 1)
  • Supports both system messages and developer messages for instruction delivery (source: Azure OpenAI Reasoning Models docs, System Messages)

Known limitations

  • Hallucination rate of approximately 15.8% on the Vectara FaithJudge Leaderboard; citation fabrication remains common across test conditions (source: Vectara FaithJudge Leaderboard 2025; StudyFinds GPT-4o Hallucination Study, Nov 2025)
  • Default output capped at 4,096 tokens unless max_tokens is explicitly set to 16,384; developers must configure this parameter (source: OpenAI Developer Community, GPT-4o Output Token Discussion)
  • Superseded by GPT-4.1 which is cheaper ($2/$8 vs $2.50/$10), has 1M token context (vs 128K), and scores higher on coding and instruction-following benchmarks (source: OpenAI GPT-4.1 Announcement, Introducing GPT-4.1)
  • Trial-and-error debugging pattern when attempting autonomous/agentic actions; frequently hallucinating API calls and file paths (source: OpenAI GPT-4o System Card, Autonomous Action Evaluation)
  • Knowledge cutoff of June 2024 means no awareness of events after that date without web search augmentation (source: OpenAI Model Release Notes; otterly.ai LLM Knowledge Cutoff Dates)

How to prompt GPT-4o

Preferred instruction format

OpenAI GPT models use role-based chat completion messages with 'system' (or 'developer' for reasoning models) role. System messages can appear at the start of the conversation and set the model's behavior. Unlike Claude which uses a separate top-level system parameter, GPT embeds system instructions as the first message in the messages array with role='system'. GPT-4.1+ follows instructions more literally than GPT-4o, requiring explicit specification rather than relying on inference. (source: OpenAI GPT-4.1 Prompting Guide, Instruction Hierarchy; Azure OpenAI Reasoning Models docs, Developer Messages)

Recommended practices

  • Structure system prompts with clear hierarchical sections: Role and Objective, Instructions, Reasoning Steps, Output Format, Examples, Context, Final Instructions (source: OpenAI GPT-4.1 Prompting Guide, System Message Structure)
  • Use markdown formatting for prompt structure: H1-H4 titles for sections, inline backticks for code, numbered/bulleted lists for instructions (source: OpenAI GPT-4.1 Prompting Guide, Delimiter Conventions)
  • Use XML tags for wrapping structured data in context -- they perform well for precise start/end wrapping, metadata attributes, and nesting, e.g. <doc id='1' title='Name'>Content</doc> (source: OpenAI GPT-4.1 Prompting Guide, Delimiter Conventions)
  • For long context, place instructions at BOTH the beginning and end of provided context for best performance (source: OpenAI GPT-4.1 Prompting Guide, Long Context Handling)
  • Use the API tools field for function/tool definitions rather than injecting schemas into the prompt manually -- API-parsed tools outperform manual injection by ~2% on SWE-bench (source: OpenAI GPT-4.1 Prompting Guide, Tool & Function Calling)
  • For agentic use, include three key system prompt instructions: (1) persistence -- keep going until resolved, (2) tool-calling honesty -- don't guess, use tools, (3) planning -- plan extensively before each function call. These alone increased SWE-bench scores by ~20% (source: OpenAI GPT-4.1 Prompting Guide, Agentic Behavior Patterns)
  • Use few-shot examples for standard GPT models (GPT-4o, GPT-4.1) to demonstrate desired patterns. But AVOID few-shot for reasoning models (o1, o3, o4-mini) where zero-shot performs better (source: OpenAI Help Center, Prompt Engineering Best Practices; OpenAI Reasoning Best Practices)
  • Specify context reliance explicitly: tell the model whether to use only provided documents or also its own knowledge (source: OpenAI GPT-4.1 Prompting Guide, Context Reliance Tuning)
  • Pin production applications to specific model snapshots (e.g., gpt-4o-2024-08-06) for consistent behavior (source: OpenAI Prompt Engineering Guide, Production Recommendations)
  • Structure prompts for caching: place static content first (system instructions, few-shot examples, tool definitions) and variable content last (user messages) (source: OpenAI Prompt Engineering Guide, Caching Strategy)

Anti-patterns to avoid

  • Do NOT inject tool/function schemas directly into system prompts -- use the API tools parameter instead (source: OpenAI GPT-4.1 Prompting Guide, Tool & Function Calling)
  • Do NOT use all-caps, bribes, or tips to emphasize instructions -- these are generally unnecessary with GPT-4.1+ (source: OpenAI GPT-4.1 Prompting Guide, Instruction Hierarchy)
  • Do NOT instruct the model to 'must call a tool before responding' without fallback -- this causes hallucinated tool calls when insufficient information exists. Instead instruct: 'if you don't have enough information, ask the user' (source: OpenAI GPT-4.1 Prompting Guide, Common Anti-Patterns)
  • Do NOT use JSON format for wrapping multiple documents in context -- it performs poorly due to verbosity and escaping overhead. Use XML or pipe-delimited format instead (source: OpenAI GPT-4.1 Prompting Guide, Delimiter Conventions)
  • Do NOT add 'think step by step' to reasoning models (o1, o3, o4-mini) -- they already reason internally and adding such instructions can hurt performance (source: OpenAI Reasoning Best Practices; Azure OpenAI Reasoning Models docs)
  • Do NOT use few-shot prompting with reasoning models -- start with zero-shot and add only 1-2 examples if needed (source: OpenAI Reasoning Best Practices)
  • Do NOT include sample phrases without telling the model to vary them -- GPT models tend to repeat exact phrases verbatim (source: OpenAI GPT-4.1 Prompting Guide, Common Anti-Patterns)

Sources

Compare prompting style with another model

Skip the manual application.

Refrase reads everything above and applies it for you. Try it on one of your own prompts.