Qwen3 235B
Alibaba · qwen family · Official docs
Qwen3-235B-A22B is the flagship MoE model from Alibaba's Qwen team. Its hybrid thinking mode is a first-class feature — unlike models where reasoning is bolted on, Qwen3 was trained from the ground up to switch between deep reasoning and fast responses. The /think and /no_think soft switches make it uniquely controllable at the prompt level without API parameter changes. At 22B activated parameters per token, it delivers frontier-class reasoning at a fraction of the compute cost of dense 200B+ models. The 119-language support makes it the strongest multilingual open-weight model available. Key trade-off: the MoE architecture requires significant VRAM for self-hosting despite low per-token compute.
Specifications
Strengths
Key capabilities
- ✓Mixture-of-Experts architecture: 235B total params, 22B activated per token, 128 experts with 8 activated per token (source: Hugging Face, Qwen3-235B-A22B Model Card)
- ✓Hybrid thinking mode: seamless switching between thinking mode (step-by-step reasoning in <think> blocks) and non-thinking mode (fast direct responses) via enable_thinking parameter or /think and /no_think soft switches (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
- ✓119 languages and dialects supported across Indo-European, Sino-Tibetan, Afro-Asiatic, and other language families (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
- ✓Trained on approximately 36 trillion tokens — nearly double Qwen2.5's 18 trillion — including synthetic math and code data (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
- ✓Extended context via YaRN rope scaling from 32K native to 131K tokens (source: Hugging Face, Qwen3-235B-A22B Model Card)
- ✓Strong agentic task performance: leading results on complex agent-based benchmarks among open-source models (source: Hugging Face, Qwen3-32B Model Card)
- ✓Open-weight under Apache 2.0 license enabling full commercial and research use (source: Qwen GitHub Repository)
Known limitations
- ⚠Greedy decoding causes performance degradation and endless repetitions — must use sampling with recommended temperature settings (source: Hugging Face, Qwen3-235B-A22B Model Card)
- ⚠YaRN static scaling applies a constant factor regardless of input length, which may negatively impact performance on shorter texts when enabled (source: Hugging Face, Qwen3-235B-A22B Model Card)
- ⚠Higher presence_penalty values (above ~1.5) may cause language mixing in multilingual contexts (source: Hugging Face, Qwen3-235B-A22B Model Card)
- ⚠Quantization below 4-bit causes significant performance degradation, especially in complex reasoning tasks, more pronounced than in previous Qwen generations (source: arXiv:2505.02214, 'An Empirical Study of Qwen3 Quantization')
- ⚠Format-dependent reasoning: strong on pattern-matching benchmarks but weaker on strict logical forms like syllogisms (source: LogiEval benchmark analysis, emergentmind.com)
How to prompt Qwen3 235B
Preferred instruction format
Standard chat format with system/user/assistant roles. System message sets context; user message contains the task. Thinking mode controlled via enable_thinking=True/False in chat_template_kwargs, or via /think and /no_think soft switches in user messages.
Recommended practices
- Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode; Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking mode (source: Hugging Face, Qwen3-235B-A22B Model Card)
- Set max output to 32,768 tokens for most queries; use 38,912 for highly complex competition-level problems (source: Hugging Face, Qwen3-235B-A22B Model Card)
- In multi-turn conversations, include only the final output in history — strip <think> blocks from previous turns (source: Hugging Face, Qwen3-235B-A22B Model Card)
- Use thinking_budget parameter to cap reasoning token usage when latency is a concern (source: Alibaba Cloud Documentation, 'How to use deep thinking models')
- Enable YaRN rope scaling only when input exceeds 32,768 tokens to avoid performance impact on shorter contexts (source: Hugging Face, Qwen3-235B-A22B Model Card)
Anti-patterns to avoid
- Never use greedy decoding (temperature=0) — causes endless repetitions and severe quality degradation (source: Hugging Face, Qwen3-235B-A22B Model Card)
- Do not include thinking content (<think> blocks) in multi-turn conversation history — only include final output (source: Hugging Face, Qwen3-235B-A22B Model Card)
- Avoid presence_penalty values above 1.5 in multilingual scenarios — triggers language mixing (source: Hugging Face, Qwen3-235B-A22B Model Card)