Skip to main content
← All models

Qwen3 235B

Alibaba · qwen family · Official docs

Qwen3-235B-A22B is the flagship MoE model from Alibaba's Qwen team. Its hybrid thinking mode is a first-class feature — unlike models where reasoning is bolted on, Qwen3 was trained from the ground up to switch between deep reasoning and fast responses. The /think and /no_think soft switches make it uniquely controllable at the prompt level without API parameter changes. At 22B activated parameters per token, it delivers frontier-class reasoning at a fraction of the compute cost of dense 200B+ models. The 119-language support makes it the strongest multilingual open-weight model available. Key trade-off: the MoE architecture requires significant VRAM for self-hosting despite low per-token compute.

Try Refrase on a Qwen3 235B prompt

Paste any prompt — Refrase rewrites it using Qwen3 235B's documentation as context. 4–7 seconds end-to-end.

Specifications

131K
Context window
33K
Max output
$0.7 / $2.8
Per 1M tokens (in/out)
Alibaba Cloud DashScope international pricing. Thinking mode output: $8.40/1M tokens. Global (US Virginia) pricing lower: $0.287 input, $1.147 output (non-thinking), $2.868 (thinking). Open-weight Apache 2.0 — self-hosting eliminates API costs. (source: Alibaba Cloud Model Studio, Model Pricing page)

Strengths

analysisgenerationcode

Key capabilities

  • Mixture-of-Experts architecture: 235B total params, 22B activated per token, 128 experts with 8 activated per token (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Hybrid thinking mode: seamless switching between thinking mode (step-by-step reasoning in <think> blocks) and non-thinking mode (fast direct responses) via enable_thinking parameter or /think and /no_think soft switches (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
  • 119 languages and dialects supported across Indo-European, Sino-Tibetan, Afro-Asiatic, and other language families (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
  • Trained on approximately 36 trillion tokens — nearly double Qwen2.5's 18 trillion — including synthetic math and code data (source: Qwen Blog, 'Qwen3: Think Deeper, Act Faster')
  • Extended context via YaRN rope scaling from 32K native to 131K tokens (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Strong agentic task performance: leading results on complex agent-based benchmarks among open-source models (source: Hugging Face, Qwen3-32B Model Card)
  • Open-weight under Apache 2.0 license enabling full commercial and research use (source: Qwen GitHub Repository)

Known limitations

  • Greedy decoding causes performance degradation and endless repetitions — must use sampling with recommended temperature settings (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • YaRN static scaling applies a constant factor regardless of input length, which may negatively impact performance on shorter texts when enabled (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Higher presence_penalty values (above ~1.5) may cause language mixing in multilingual contexts (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Quantization below 4-bit causes significant performance degradation, especially in complex reasoning tasks, more pronounced than in previous Qwen generations (source: arXiv:2505.02214, 'An Empirical Study of Qwen3 Quantization')
  • Format-dependent reasoning: strong on pattern-matching benchmarks but weaker on strict logical forms like syllogisms (source: LogiEval benchmark analysis, emergentmind.com)

How to prompt Qwen3 235B

Preferred instruction format

Standard chat format with system/user/assistant roles. System message sets context; user message contains the task. Thinking mode controlled via enable_thinking=True/False in chat_template_kwargs, or via /think and /no_think soft switches in user messages.

Recommended practices

  • Use Temperature=0.6, TopP=0.95, TopK=20, MinP=0 for thinking mode; Temperature=0.7, TopP=0.8, TopK=20, MinP=0 for non-thinking mode (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Set max output to 32,768 tokens for most queries; use 38,912 for highly complex competition-level problems (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • In multi-turn conversations, include only the final output in history — strip <think> blocks from previous turns (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Use thinking_budget parameter to cap reasoning token usage when latency is a concern (source: Alibaba Cloud Documentation, 'How to use deep thinking models')
  • Enable YaRN rope scaling only when input exceeds 32,768 tokens to avoid performance impact on shorter contexts (source: Hugging Face, Qwen3-235B-A22B Model Card)

Anti-patterns to avoid

  • Never use greedy decoding (temperature=0) — causes endless repetitions and severe quality degradation (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Do not include thinking content (<think> blocks) in multi-turn conversation history — only include final output (source: Hugging Face, Qwen3-235B-A22B Model Card)
  • Avoid presence_penalty values above 1.5 in multilingual scenarios — triggers language mixing (source: Hugging Face, Qwen3-235B-A22B Model Card)

Sources

Compare prompting style with another model

Skip the manual application.

Refrase reads everything above and applies it for you. Try it on one of your own prompts.