Qwen3 235B

Alibaba · qwen family · Official docs

Qwen3-235B-A22B is the flagship MoE model from Alibaba's Qwen team. Its hybrid thinking mode is a first-class feature — unlike models where reasoning is bolted on, Qwen3 was trained from the ground up to switch between deep reasoning and fast responses. The /think and /no_think soft switches make it uniquely controllable at the prompt level without API parameter changes. At 22B activated parameters per token, it delivers frontier-class reasoning at a fraction of the compute cost of dense 200B+ models. The 119-language support makes it the strongest multilingual open-weight model available. Key trade-off: the MoE architecture requires significant VRAM for self-hosting despite low per-token compute.

Try Refrase on a Qwen3 235B prompt

Paste any prompt — Refrase rewrites it using Qwen3 235B's documentation as context. 4–7 seconds end-to-end.

Open in /enhance Try Guided mode

Specifications

131K

Context window

33K

Max output

$0.7 / $2.8

Per 1M tokens (in/out)

Alibaba Cloud DashScope international pricing. Thinking mode output: $8.40/1M tokens. Global (US Virginia) pricing lower: $0.287 input, $1.147 output (non-thinking), $2.868 (thinking). Open-weight Apache 2.0 — self-hosting eliminates API costs. (source: Alibaba Cloud Model Studio, Model Pricing page)

Strengths

analysisgenerationcode

Key capabilities

✓Mixture-of-Experts architecture: 235B total params, 22B activated per token, 128 experts with 8 activated per token (source: Hugging Face, Qwen3-235B-A22B Model Card)
✓Hybrid thinking mode: seamless switching between thinking mode (step-by-step reasoning in <think> blocks) and non-thinking mode (fast direct responses) via enable_thinking parameter or /think and /no_think soft switches (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
✓119 languages and dialects supported across Indo-European, Sino-Tibetan, Afro-Asiatic, and other language families (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
✓Trained on approximately 36 trillion tokens — nearly double Qwen2.5's 18 trillion — including synthetic math and code data (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
✓Extended context via YaRN rope scaling from 32K native to 131K tokens (source: Hugging Face, Qwen3-235B-A22B Model Card)
✓Strong agentic task performance: leading results on complex agent-based benchmarks among open-source models (source: Hugging Face, Qwen3-32B Model Card)
✓Open-weight under Apache 2.0 license enabling full commercial and research use (source: Qwen GitHub Repository)

Known limitations

⚠Greedy decoding causes performance degradation and endless repetitions — must use sampling with recommended temperature settings (source: Hugging Face, Qwen3-235B-A22B Model Card)
⚠YaRN static scaling applies a constant factor regardless of input length, which may negatively impact performance on shorter texts when enabled (source: Hugging Face, Qwen3-235B-A22B Model Card)
⚠Higher presence_penalty values (above ~1.5) may cause language mixing in multilingual contexts (source: Hugging Face, Qwen3-235B-A22B Model Card)
⚠Quantization below 4-bit causes significant performance degradation, especially in complex reasoning tasks, more pronounced than in previous Qwen generations (source: arXiv:2505.02214, 'An Empirical Study of Qwen3 Quantization')
⚠Format-dependent reasoning: strong on pattern-matching benchmarks but weaker on strict logical forms like syllogisms (source: LogiEval benchmark analysis, emergentmind.com)

How to prompt Qwen3 235B

Preferred instruction format

Standard chat format with system/user/assistant roles. System message sets context; user message contains the task. Thinking mode controlled via enable_thinking=True/False in chat_template_kwargs, or via /think and /no_think soft switches in user messages.

Recommended practices

Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode; Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking mode (source: Hugging Face, Qwen3-235B-A22B Model Card)
Set max output to 32,768 tokens for most queries; use 38,912 for highly complex competition-level problems (source: Hugging Face, Qwen3-235B-A22B Model Card)
In multi-turn conversations, include only the final output in history — strip <think> blocks from previous turns (source: Hugging Face, Qwen3-235B-A22B Model Card)
Use thinking_budget parameter to cap reasoning token usage when latency is a concern (source: Alibaba Cloud Documentation, 'How to use deep thinking models')
Enable YaRN rope scaling only when input exceeds 32,768 tokens to avoid performance impact on shorter contexts (source: Hugging Face, Qwen3-235B-A22B Model Card)

Anti-patterns to avoid

Never use greedy decoding (temperature=0) — causes endless repetitions and severe quality degradation (source: Hugging Face, Qwen3-235B-A22B Model Card)
Do not include thinking content (<think> blocks) in multi-turn conversation history — only include final output (source: Hugging Face, Qwen3-235B-A22B Model Card)
Avoid presence_penalty values above 1.5 in multilingual scenarios — triggers language mixing (source: Hugging Face, Qwen3-235B-A22B Model Card)

Sources

Compare prompting style with another model

vs Claude Sonnet 4.6 vs GPT-4o vs DeepSeek V3

Skip the manual application.

Refrase reads everything above and applies it for you. Try it on one of your own prompts.

Open /enhance with Qwen3 235B