Skip to main content
← All Models

Qwen3 32B

Alibaba · qwen family · Official Docs

Qwen3-32B is the sweet spot of the Qwen3 lineup — a dense model that runs on a single GPU (quantized) while delivering reasoning that surpasses the previous-generation QwQ-32B thinking model. It shares the same hybrid thinking architecture as the 235B flagship, making prompt patterns fully portable between the two. For Refrase, this is the ideal default recommendation for users who want strong multilingual reasoning without premium pricing. The dense architecture avoids MoE routing complexity, making latency more predictable. At $0.16/1M input tokens on DashScope (and even cheaper on third-party providers), it offers exceptional value.

#12
Rank
84
Quality Score
800ms
Avg Response
+12%
Adaptation Gain

Specifications

131K
Context Window
33K
Max Output
$0.16 / $0.64
Per 1M tokens (in/out)
Alibaba Cloud DashScope international pricing. Same price for thinking and non-thinking modes. Open-weight Apache 2.0 — widely available on Groq ($0.29/1M input), Together AI, and other inference providers at varying prices. Self-hosting eliminates API costs. (source: Alibaba Cloud Model Studio, Model Pricing page)

Key Capabilities

  • Dense 32.8B parameter model with 64 layers and grouped-query attention (64 Q heads, 8 KV heads) (source: Hugging Face, Qwen3-32B Model Card)
  • Hybrid thinking mode identical to 235B: enable_thinking parameter plus /think and /no_think soft switches (source: Hugging Face, Qwen3-32B Model Card)
  • Surpasses QwQ-32B in thinking mode and Qwen2.5-Instruct in non-thinking mode — best-in-class at 32B scale (source: Hugging Face, Qwen3-32B Model Card)
  • 100+ languages and dialects with strong multilingual instruction following and translation (source: Hugging Face, Qwen3-32B Model Card)
  • Extended context from 32K native to 131K tokens via YaRN rope scaling with factor 4.0 (source: Hugging Face, Qwen3-32B Model Card)
  • Optimized for creative writing, role-playing, and multi-turn dialogues based on human preference alignment (source: Hugging Face, Qwen3-32B Model Card)
  • Open-weight Apache 2.0 license — runnable locally on consumer hardware with quantization (source: Qwen GitHub Repository)

Known Limitations

  • Greedy decoding causes performance degradation and repetition loops in thinking mode — sampling required (source: Hugging Face, Qwen3-32B Model Card)
  • YaRN scaling is static and may negatively impact shorter text performance when enabled; only activate for inputs exceeding 32K tokens (source: Hugging Face, Qwen3-32B Model Card)
  • Thinking content (<think> blocks) must not be included in multi-turn conversation history — implementation must strip them (source: Hugging Face, Qwen3-32B Model Card)
  • When enable_thinking=True, model always outputs <think> blocks (may be empty if soft-switched to non-thinking), requiring consistent parsing logic (source: Hugging Face, Qwen3-32B Model Card)

Prompt Patterns

Preferred Instruction Format

Standard chat format with system/user/assistant roles. Thinking mode controlled via enable_thinking=True/False or /think and /no_think soft switches in user messages. Identical prompt interface to Qwen3-235B.

Recommended Practices

  • Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode; Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking mode (source: Hugging Face, Qwen3-32B Model Card)
  • Set max output to 32,768 tokens for most queries; 38,912 for competition-level problems (source: Hugging Face, Qwen3-32B Model Card)
  • Strip <think> blocks from conversation history in multi-turn scenarios (source: Hugging Face, Qwen3-32B Model Card)
  • Only enable YaRN when input exceeds 32,768 tokens (source: Hugging Face, Qwen3-32B Model Card)
  • For self-hosted deployment, use SGLang >= 0.4.6.post1 or vLLM >= 0.8.5 with --enable-reasoning flag (source: Hugging Face, Qwen3-32B Model Card)

Anti-Patterns to Avoid

  • Never use greedy decoding (temperature=0) — causes endless repetitions (source: Hugging Face, Qwen3-32B Model Card)
  • Do not include <think> blocks in conversation history for multi-turn chats (source: Hugging Face, Qwen3-32B Model Card)
  • Avoid enabling YaRN for short contexts — static scaling factor degrades performance on texts under 32K tokens (source: Hugging Face, Qwen3-32B Model Card)

What Refrase Does

Here is exactly how Refrase optimizes prompts for Qwen3 32B, rule by rule:

Thinking mode control

Refrase adds /think or /no_think directives based on your task type. Reasoning-heavy tasks get thinking mode enabled; simple extraction tasks get it disabled for speed.

English enforcement

Refrase adds explicit 'respond in English' instructions to prevent the model from switching to other languages, which some multilingual models do by default.

Before / After

See how Refrase transforms a generic prompt for Qwen3 32B.

Original

Extract the key information from this document. Be accurate.

Adapted for Qwen3 32B

Extract the key information from this document.
/think
Respond in English.

Try It

Your prompt134 chars
Optimized for qwen3-32b

Click "Refrase It" or select a model to see the optimized prompt.