Qwen3 32B
Alibaba · qwen family · Official Docs
Qwen3-32B is the sweet spot of the Qwen3 lineup — a dense model that runs on a single GPU (quantized) while delivering reasoning that surpasses the previous-generation QwQ-32B thinking model. It shares the same hybrid thinking architecture as the 235B flagship, making prompt patterns fully portable between the two. For Refrase, this is the ideal default recommendation for users who want strong multilingual reasoning without premium pricing. The dense architecture avoids MoE routing complexity, making latency more predictable. At $0.16/1M input tokens on DashScope (and even cheaper on third-party providers), it offers exceptional value.
Specifications
Key Capabilities
- ✓Dense 32.8B parameter model with 64 layers and grouped-query attention (64 Q heads, 8 KV heads) (source: Hugging Face, Qwen3-32B Model Card)
- ✓Hybrid thinking mode identical to 235B: enable_thinking parameter plus /think and /no_think soft switches (source: Hugging Face, Qwen3-32B Model Card)
- ✓Surpasses QwQ-32B in thinking mode and Qwen2.5-Instruct in non-thinking mode — best-in-class at 32B scale (source: Hugging Face, Qwen3-32B Model Card)
- ✓100+ languages and dialects with strong multilingual instruction following and translation (source: Hugging Face, Qwen3-32B Model Card)
- ✓Extended context from 32K native to 131K tokens via YaRN rope scaling with factor 4.0 (source: Hugging Face, Qwen3-32B Model Card)
- ✓Optimized for creative writing, role-playing, and multi-turn dialogues based on human preference alignment (source: Hugging Face, Qwen3-32B Model Card)
- ✓Open-weight Apache 2.0 license — runnable locally on consumer hardware with quantization (source: Qwen GitHub Repository)
Known Limitations
- ⚠Greedy decoding causes performance degradation and repetition loops in thinking mode — sampling required (source: Hugging Face, Qwen3-32B Model Card)
- ⚠YaRN scaling is static and may negatively impact shorter text performance when enabled; only activate for inputs exceeding 32K tokens (source: Hugging Face, Qwen3-32B Model Card)
- ⚠Thinking content (<think> blocks) must not be included in multi-turn conversation history — implementation must strip them (source: Hugging Face, Qwen3-32B Model Card)
- ⚠When enable_thinking=True, model always outputs <think> blocks (may be empty if soft-switched to non-thinking), requiring consistent parsing logic (source: Hugging Face, Qwen3-32B Model Card)
Prompt Patterns
Preferred Instruction Format
Standard chat format with system/user/assistant roles. Thinking mode controlled via enable_thinking=True/False or /think and /no_think soft switches in user messages. Identical prompt interface to Qwen3-235B.
Recommended Practices
- Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode; Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking mode (source: Hugging Face, Qwen3-32B Model Card)
- Set max output to 32,768 tokens for most queries; 38,912 for competition-level problems (source: Hugging Face, Qwen3-32B Model Card)
- Strip <think> blocks from conversation history in multi-turn scenarios (source: Hugging Face, Qwen3-32B Model Card)
- Only enable YaRN when input exceeds 32,768 tokens (source: Hugging Face, Qwen3-32B Model Card)
- For self-hosted deployment, use SGLang >= 0.4.6.post1 or vLLM >= 0.8.5 with --enable-reasoning flag (source: Hugging Face, Qwen3-32B Model Card)
Anti-Patterns to Avoid
- Never use greedy decoding (temperature=0) — causes endless repetitions (source: Hugging Face, Qwen3-32B Model Card)
- Do not include <think> blocks in conversation history for multi-turn chats (source: Hugging Face, Qwen3-32B Model Card)
- Avoid enabling YaRN for short contexts — static scaling factor degrades performance on texts under 32K tokens (source: Hugging Face, Qwen3-32B Model Card)
What Refrase Does
Here is exactly how Refrase optimizes prompts for Qwen3 32B, rule by rule:
Before / After
See how Refrase transforms a generic prompt for Qwen3 32B.
Try It
Click "Refrase It" or select a model to see the optimized prompt.