Skip to main content
← All models

Nemotron 30B

NVIDIA · nemotron family · Official docs

Nemotron 3 Nano is NVIDIA's answer to efficient agentic AI. The hybrid Mamba-2 + Transformer MoE architecture is unique — combining linear-time sequence modeling (Mamba) with sparse expert activation (MoE) and selective attention (GQA). This gives it the 1M context window that pure Transformer models struggle with at this parameter count. The 3.3x throughput advantage over Qwen3-30B is significant for production deployments. The ThinkingBudgetClient pattern for controlling reasoning depth is a novel inference-time optimization. Trained on 25T tokens (33% synthetic) — one of the largest training runs for a model this size. Released December 2025.

Try Refrase on a Nemotron 30B prompt

Paste any prompt — Refrase rewrites it using Nemotron 30B's documentation as context. 4–7 seconds end-to-end.

Specifications

1M
Context window
33K
Max output
$0.05 / $0.2
Per 1M tokens (in/out)
Reasoning variant: $0.06/$0.24. AWS Bedrock: $0.06/$0.24. Open-weight model — self-hosting is infrastructure cost only. (source: OpenRouter, pricepertoken.com)

Strengths

codegeneration

Key capabilities

  • 1M token context window — among the largest available (source: Hugging Face, Model Card)
  • Hybrid Mamba-2 + Transformer MoE architecture: 52 layers (23 Mamba-2, 23 MoE, 6 GQA Attention) (source: Hugging Face, Model Card)
  • 30B total / 3.5B active parameters — extremely efficient inference (source: Hugging Face, Model Card)
  • 128 routed experts + 1 shared expert per MoE layer, 6 activated per token (source: Hugging Face, Model Card)
  • Configurable reasoning with thinking budget control (source: Hugging Face, Model Card)
  • Strong mathematical reasoning: AIME 2025 89.1% without tools, 99.2% with tools (source: Hugging Face, Model Card)
  • Code generation: LiveCodeBench v6 68.3% (source: Hugging Face, Model Card)
  • Long context: RULER-100@1M 86.3% (source: Hugging Face, Model Card)
  • 20 natural languages + 43 programming languages (source: Hugging Face, Model Card)
  • 3.3x higher inference throughput than Qwen3-30B-A3B on single H200 (source: NVIDIA, Technical Report)
  • Trained on 25 trillion tokens including 33% synthetic data (source: Hugging Face, Model Card)
  • NVIDIA Nemotron Open Model License — commercial use allowed (source: Hugging Face, Model Card)

Known limitations

  • General knowledge (MMLU-Pro 78.3%) slightly behind Qwen3-30B (80.9%) — hybrid architecture trades breadth for reasoning depth (source: Medium, Technical Review)
  • Does not excel at complex prompts for entire online applications (source: Community testing)
  • Reasoning mode off decreases accuracy on harder prompts (source: Hugging Face, Model Card)
  • 1M context requires very high VRAM (~120GB+ for BF16) (source: Hugging Face, Model Card)
  • Default config uses 256K context — 1M requires VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 flag (source: Hugging Face, Model Card)
  • Fine-tuning the router layer is not recommended (source: Community testing)
  • Best performance in English — other languages have lower performance (source: Hugging Face, Model Card)
  • Requires trust_remote_code=True for Transformers inference due to hybrid architecture (source: Hugging Face, Model Card)

How to prompt Nemotron 30B

Preferred instruction format

Standard OpenAI-compatible chat format with system/user/assistant roles. Supports enable_thinking flag in chat template for reasoning mode toggle.

Recommended practices

  • Use temperature=1.0 and top_p=1.0 for reasoning tasks (source: Hugging Face, Model Card)
  • Use temperature=0.6 and top_p=0.95 for tool calling (source: Hugging Face, Model Card)
  • Use greedy decoding (do_sample=False) when reasoning is disabled (source: Hugging Face, Model Card)
  • Control reasoning budget via ThinkingBudgetClient pattern to limit thinking tokens (source: Hugging Face, Model Card)
  • Use Qwen3 coder tool call parser with vLLM for tool calling (source: Hugging Face, Model Card)
  • Maintain at least 75% reasoning and 25% non-reasoning examples when fine-tuning (source: Community testing)

Anti-patterns to avoid

  • Do NOT use low temperature for reasoning tasks — model is calibrated for temperature=1.0 (source: Hugging Face, Model Card)
  • Do NOT fine-tune the router layer — it degrades expert selection (source: Community testing)
  • Do NOT disable reasoning for complex tasks requiring multi-step logic — accuracy drops significantly (source: Hugging Face, Model Card)
  • Do NOT exceed 256K context without explicitly enabling long model length — default config will truncate (source: Hugging Face, Model Card)

Sources

Skip the manual application.

Refrase reads everything above and applies it for you. Try it on one of your own prompts.