Gemini 2.5 Flash
Google · gemini family · Official Docs
Gemini 2.5 Flash is the best cost-performance model available and the Refrase adapter does NOTHING to it. The same critical fixes needed for Pro apply here: (1) use native structured output via response_mime_type + response_schema instead of prompt-based JSON, (2) separate system instructions into the dedicated API parameter, (3) add XML tag delimiters for prompt structure. Flash additionally benefits from thinking budget control — setting thinkingBudget=0 for simple extraction could halve response time and cost with zero quality loss. Flash's free tier makes it ideal for development and testing. The adapter should detect task complexity and automatically set appropriate thinking budgets.
Specifications
Key Capabilities
- ✓Same 1M token context window as Pro but at ~4x lower cost ($0.30 vs $1.25 per 1M input tokens), making it the cost-performance leader for long-context tasks (source: Google AI pricing, https://ai.google.dev/gemini-api/docs/pricing)
- ✓First Flash model with built-in thinking/reasoning — configurable thinking budget from 0 to 24,576 tokens, with the ability to fully disable thinking (thinkingBudget=0) unlike Pro which cannot disable it (source: Google AI docs, Thinking guide, https://ai.google.dev/gemini-api/docs/thinking)
- ✓Identical structured output support to Pro — response_mime_type='application/json' with response_schema, supporting JSON Schema with anyOf, $ref, enum, format, and property ordering (source: Google AI docs, Structured Output, https://ai.google.dev/gemini-api/docs/structured-output)
- ✓Fast inference: first token in 0.21-0.37 seconds, 163 tokens/second throughput — approximately 3x faster than Pro (source: Artificial Analysis benchmarks, https://artificialanalysis.ai/models/gemini-2-5-flash)
- ✓Free tier available — free of charge for input/output tokens with rate limits, plus 500 free Google Search grounding requests per day (source: Google AI pricing, https://ai.google.dev/gemini-api/docs/pricing)
- ✓Full multimodal input: text, code, images (up to 3,000 per prompt), audio (~8.4 hours), video (~45 min with audio), and documents (1,000 pages per file) (source: Vertex AI docs, Gemini 2.5 Flash, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash)
- ✓OpenAI API compatibility endpoint with reasoning_effort mapping to Gemini thinking levels (source: Google AI docs, OpenAI Compatibility, https://ai.google.dev/gemini-api/docs/openai)
Known Limitations
- ⚠Same structured output + function calling incompatibility as Pro on Gemini 2.5 — structured outputs fail when tool calls are present in message history (source: GitHub issue googleapis/python-genai#867, https://github.com/googleapis/python-genai/issues/867)
- ⚠Shallower reasoning depth compared to Pro — thinking budget maxes at 24,576 tokens vs Pro's 32,768; default is dynamic allocation that may use fewer thinking tokens for complex tasks (source: Google AI docs, Thinking guide, https://ai.google.dev/gemini-api/docs/thinking)
- ⚠Same JSON schema complexity limits as Pro — overly complex schemas with long names, deep nesting, many optional properties, or many enum values trigger InvalidArgument: 400 errors (source: Vertex AI docs, Structured Output, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output)
- ⚠Less verbose by default than previous Gemini models — may produce overly terse responses for tasks requiring detailed explanation unless explicitly instructed (source: Vertex AI docs, Gemini 3 Prompting Guide, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start/gemini-3-prompting-guide)
- ⚠System instructions do not prevent jailbreaks or information leakage — same limitation as all Gemini models (source: Vertex AI docs, System Instructions, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/system-instructions)
Prompt Patterns
Preferred Instruction Format
Identical to Pro — uses dedicated systemInstruction API parameter, not embedded in user messages. Can be a single string or array of strings. Processed before user prompts, persists across conversation turns. (source: Vertex AI docs, System Instructions, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/system-instructions)
Recommended Practices
- Same XML-tag or Markdown-heading delimiter strategy as Pro — choose one format and use consistently within a prompt (source: Google AI docs, Prompt Design Strategies, https://ai.google.dev/gemini-api/docs/prompting-strategies)
- For cost-sensitive applications, disable thinking entirely with thinkingBudget=0 for simple extraction/formatting tasks — Flash uniquely supports this unlike Pro (source: Google AI docs, Thinking guide, https://ai.google.dev/gemini-api/docs/thinking)
- Use Flash's free tier for development/testing before scaling to paid tier — same API, same capabilities, just rate-limited (source: Google AI pricing, https://ai.google.dev/gemini-api/docs/pricing)
- For structured output, use response_mime_type + response_schema instead of prompt-based JSON instructions — identical to Pro, but more cost-effective per token (source: Google AI docs, Structured Output, https://ai.google.dev/gemini-api/docs/structured-output)
- Place questions at the END of long-context prompts for best retrieval accuracy (source: Google AI docs, Long Context guide, https://ai.google.dev/gemini-api/docs/long-context)
- Include 3-5 few-shot examples with consistent formatting for extraction and classification tasks (source: Google AI docs, Prompt Design Strategies, https://ai.google.dev/gemini-api/docs/prompting-strategies)
- Use context caching for repeated long-context queries — ~4x cost reduction, especially impactful given Flash's already-low per-token pricing (source: Google AI docs, Long Context guide, https://ai.google.dev/gemini-api/docs/long-context)
- Keep temperature at default 1.0 — same recommendation as Pro, reasoning is optimized for this default (source: Vertex AI docs, Gemini 3 Prompting Guide, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start/gemini-3-prompting-guide)
Anti-Patterns to Avoid
- DO NOT lower temperature below 1.0 — same looping and degradation risk as Pro (source: Vertex AI docs, Gemini 3 Prompting Guide, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start/gemini-3-prompting-guide)
- DO NOT duplicate JSON schema in prompt when using response_schema — same as Pro, use schema description fields only (source: Vertex AI docs, Structured Output, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output)
- DO NOT use high thinking budgets for simple tasks — wastes tokens and money; Flash's ability to set thinkingBudget=0 should be leveraged for extraction/formatting (source: Google AI docs, Thinking guide, https://ai.google.dev/gemini-api/docs/thinking)
- DO NOT combine structured output with function calling on 2.5 models — same incompatibility as Pro (source: GitHub issue googleapis/python-genai#867)
- DO NOT use inconsistent formatting across few-shot examples (source: Google AI docs, Prompt Design Strategies, https://ai.google.dev/gemini-api/docs/prompting-strategies)
- DO NOT place essential instructions at the end of very long prompts (source: Vertex AI docs, Gemini 3 Prompting Guide, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start/gemini-3-prompting-guide)
Sources
- https://ai.google.dev/gemini-api/docs/prompting-strategies
- https://ai.google.dev/gemini-api/docs/structured-output
- https://ai.google.dev/gemini-api/docs/thinking
- https://ai.google.dev/gemini-api/docs/long-context
- https://ai.google.dev/gemini-api/docs/pricing
- https://ai.google.dev/gemini-api/docs/openai
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/system-instructions
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start/gemini-3-prompting-guide
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash
What Refrase Does
Here is exactly how Refrase optimizes prompts for Gemini 2.5 Flash, rule by rule:
Before / After
See how Refrase transforms a generic prompt for Gemini 2.5 Flash.
Try It
Click "Refrase It" or select a model to see the optimized prompt.