Qwen3 32B

Alibaba · qwen family · Official docs

Qwen3-32B is the sweet spot of the Qwen3 lineup — a dense model that runs on a single GPU (quantized) while delivering reasoning that surpasses the previous-generation QwQ-32B thinking model. It shares the same hybrid thinking architecture as the 235B flagship, making prompt patterns fully portable between the two. For Refrase, this is the ideal default recommendation for users who want strong multilingual reasoning without premium pricing. The dense architecture avoids MoE routing complexity, making latency more predictable. At $0.16/1M input tokens on DashScope (and even cheaper on third-party providers), it offers exceptional value.

Try Refrase on a Qwen3 32B prompt

Paste any prompt — Refrase rewrites it using Qwen3 32B's documentation as context. 4–7 seconds end-to-end.

Open in /enhance Try Guided mode

Specifications

131K

Context window

33K

Max output

$0.16 / $0.64

Per 1M tokens (in/out)

Alibaba Cloud DashScope international pricing. Same price for thinking and non-thinking modes. Open-weight Apache 2.0 — widely available on Groq ($0.29/1M input), Together AI, and other inference providers at varying prices. Self-hosting eliminates API costs. (source: Alibaba Cloud Model Studio, Model Pricing page)

Strengths

analysiscode

Key capabilities

✓Dense 32.8B parameter model with 64 layers and grouped-query attention (64 Q heads, 8 KV heads) (source: Hugging Face, Qwen3-32B Model Card)
✓Hybrid thinking mode identical to 235B: enable_thinking parameter plus /think and /no_think soft switches (source: Hugging Face, Qwen3-32B Model Card)
✓Surpasses QwQ-32B in thinking mode and Qwen2.5-Instruct in non-thinking mode — best-in-class at 32B scale (source: Hugging Face, Qwen3-32B Model Card)
✓100+ languages and dialects with strong multilingual instruction following and translation (source: Hugging Face, Qwen3-32B Model Card)
✓Extended context from 32K native to 131K tokens via YaRN rope scaling with factor 4.0 (source: Hugging Face, Qwen3-32B Model Card)
✓Optimized for creative writing, role-playing, and multi-turn dialogues based on human preference alignment (source: Hugging Face, Qwen3-32B Model Card)
✓Open-weight Apache 2.0 license — runnable locally on consumer hardware with quantization (source: Qwen GitHub Repository)

Known limitations

⚠Greedy decoding causes performance degradation and repetition loops in thinking mode — sampling required (source: Hugging Face, Qwen3-32B Model Card)
⚠YaRN scaling is static and may negatively impact shorter text performance when enabled; only activate for inputs exceeding 32K tokens (source: Hugging Face, Qwen3-32B Model Card)
⚠Thinking content (<think> blocks) must not be included in multi-turn conversation history — implementation must strip them (source: Hugging Face, Qwen3-32B Model Card)
⚠When enable_thinking=True, model always outputs <think> blocks (may be empty if soft-switched to non-thinking), requiring consistent parsing logic (source: Hugging Face, Qwen3-32B Model Card)

How to prompt Qwen3 32B

Preferred instruction format

Standard chat format with system/user/assistant roles. Thinking mode controlled via enable_thinking=True/False or /think and /no_think soft switches in user messages. Identical prompt interface to Qwen3-235B.

Recommended practices

Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode; Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking mode (source: Hugging Face, Qwen3-32B Model Card)
Set max output to 32,768 tokens for most queries; 38,912 for competition-level problems (source: Hugging Face, Qwen3-32B Model Card)
Strip <think> blocks from conversation history in multi-turn scenarios (source: Hugging Face, Qwen3-32B Model Card)
Only enable YaRN when input exceeds 32,768 tokens (source: Hugging Face, Qwen3-32B Model Card)
For self-hosted deployment, use SGLang >= 0.4.6.post1 or vLLM >= 0.8.5 with --enable-reasoning flag (source: Hugging Face, Qwen3-32B Model Card)

Anti-patterns to avoid

Never use greedy decoding (temperature=0) — causes endless repetitions (source: Hugging Face, Qwen3-32B Model Card)
Do not include <think> blocks in conversation history for multi-turn chats (source: Hugging Face, Qwen3-32B Model Card)
Avoid enabling YaRN for short contexts — static scaling factor degrades performance on texts under 32K tokens (source: Hugging Face, Qwen3-32B Model Card)

Sources

Skip the manual application.

Refrase reads everything above and applies it for you. Try it on one of your own prompts.

Open /enhance with Qwen3 32B

← All models