Kimi K2
Moonshot · kimi family · Official docs
Kimi K2 is a cost-effective frontier model with exceptional agentic capabilities. The 1T total / 32B active MoE architecture delivers strong performance at low inference cost. Key differentiator: optimized specifically for tool calling and agentic workflows, not just chat. The 0.6 temperature requirement is strict — deviating significantly degrades output quality. Watch for verbosity (2-2.5x token usage vs peers) which can inflate costs despite low per-token pricing. The vLLM tool calling compatibility issues have been documented and fixed but indicate the model's tool call format requires careful parser configuration.
Specifications
Strengths
Key capabilities
- ✓1 trillion total parameters with only 32B activated per token via 384-expert MoE (source: Hugging Face, Model Card)
- ✓State-of-the-art agentic intelligence with native tool calling and autonomous problem-solving (source: Hugging Face, Model Card)
- ✓Strong coding: LiveCodeBench v6 53.7% SOTA, SWE-bench Verified 65.8% single attempt (source: Hugging Face, Model Card)
- ✓Mathematical reasoning: MATH-500 97.4% SOTA, AIME 2024 69.6% SOTA (source: Hugging Face, Model Card)
- ✓General knowledge: MMLU 89.5%, MMLU-Redux 92.7% SOTA (source: Hugging Face, Model Card)
- ✓Instruction following: IFEval 89.8% Prompt Strict SOTA (source: Hugging Face, Model Card)
- ✓OpenAI and Anthropic API compatible — drop-in replacement (source: Hugging Face, Model Card)
- ✓Modified MIT License allowing commercial use (source: Hugging Face, Model Card)
Known limitations
- ⚠Max output tokens limited to 8K in standard mode, 16K for SWE-bench agentless (source: Hugging Face, Model Card)
- ⚠No vision/multimodal support in K2 base — requires K2.5 for vision (source: Community documentation)
- ⚠Extreme verbosity: 2-2.5x token usage compared to other models, impacting cost and latency (source: Skywork.ai analysis)
- ⚠Initial vLLM tool calling only 18% success rate without custom parser fixes (source: vLLM Blog, debugging report)
- ⚠Thinking mode adds 15-35% latency and 1.2-1.6x token overhead (source: Skywork.ai, Kimi K2 Thinking Limits)
- ⚠Can overthink easy tasks in thinking mode, drift on long rule-heavy prompts (source: Skywork.ai analysis)
- ⚠2-5% hallucination rate on highly specific uncited facts even in thinking mode (source: Skywork.ai analysis)
- ⚠Reflex-grade model without long thinking — not designed for deep extended reasoning (source: Hugging Face, Model Card)
How to prompt Kimi K2
Preferred instruction format
Standard OpenAI-compatible chat format with system/user/assistant roles. Default system prompt: 'You are Kimi, an AI assistant created by Moonshot AI.'
Recommended practices
- Set temperature to 0.6 for Instruct mode (source: Hugging Face, Model Card)
- Use tool_choice='auto' for autonomous tool selection (source: GitHub, Kimi-K2 README)
- OpenAI-compatible function calling format with tools parameter (source: Hugging Face, Model Card)
- For Anthropic API compatibility, apply temperature mapping: real_temperature = request_temperature * 0.6 (source: Hugging Face, Model Card)
- Provide task-specific system prompts rather than relying on defaults when special instructions are needed (source: GitHub, Kimi-K2 README)
Anti-patterns to avoid
- Do NOT set temperature above 0.6 for Instruct mode — model was optimized for this setting (source: Hugging Face, Model Card)
- Do NOT use long rule-heavy prompts — model may drift from instructions (source: Skywork.ai analysis)
- Do NOT assume live data retrieval without explicitly enabling browsing/tools — model produces confident but stale answers otherwise (source: Skywork.ai analysis)
- Do NOT expect vision capabilities from K2 base — use K2.5 for multimodal tasks (source: Community documentation)