大模型能力排行榜/测评时间2026-01-08#
| 模型 | 作者 | 全球平均 | 推理平均 | 编程平均 | 智能体编程平均 | 数学平均 | 数据分析平均 | 语言平均 | 指令遵循平均 |
|---|---|---|---|---|---|---|---|---|---|
| Claude 4.6 Opus Thinking High Effort | Anthropic | 76.33 | 88.67 | 78.18 | 61.67 | 89.32 | 69.89 | 83.27 | 63.31 |
| Claude 4.5 Opus Thinking High Effort | Anthropic | 75.96 | 80.09 | 79.65 | 63.33 | 90.39 | 74.44 | 81.26 | 62.55 |
| GPT-5.2 High | OpenAI | 74.84 | 83.21 | 76.07 | 51.67 | 93.17 | 78.16 | 79.81 | 61.77 |
| GPT-5.2 Codex | OpenAI | 74.30 | 77.71 | 83.62 | 51.67 | 88.77 | 78.20 | 73.68 | 66.45 |
| GPT-5.1 Codex Max High | OpenAI | 73.98 | 83.65 | 80.68 | 53.33 | 83.22 | 70.12 | 76.48 | 70.38 |
| Gemini 3 Pro Preview High | 73.39 | 77.42 | 74.60 | 55.00 | 81.84 | 74.39 | 84.62 | 65.85 | |
| Gemini 3 Flash Preview High | 72.40 | 74.55 | 73.90 | 40.00 | 84.17 | 74.77 | 84.56 | 74.86 | |
| GPT-5.1 High | OpenAI | 72.04 | 78.79 | 72.49 | 53.33 | 86.90 | 69.61 | 79.26 | 63.90 |
| GPT-5 Pro | OpenAI | 70.48 | 81.69 | 72.11 | 51.67 | 86.17 | 57.04 | 80.69 | 63.96 |
| Kimi K2.5 Thinking | Moonshot AI | 69.07 | 75.96 | 77.86 | 48.33 | 84.87 | 61.36 | 77.67 | 57.41 |
| GPT-5.1 Codex | OpenAI | 68.61 | 81.98 | 71.78 | 53.33 | 79.58 | 60.75 | 69.48 | 63.39 |
| Claude Sonnet 4.5 Thinking | Anthropic | 68.19 | 77.59 | 80.36 | 53.33 | 79.31 | 56.97 | 76.45 | 53.35 |
| GPT-5 Mini High | OpenAI | 65.91 | 68.32 | 68.20 | 46.67 | 82.20 | 55.20 | 75.52 | 65.27 |
| DeepSeek V3.2 Thinking | DeepSeek | 62.20 | 77.17 | 64.62 | 40.00 | 85.03 | 50.00 | 70.41 | 48.19 |
| Grok 4 | xAI | 62.02 | 79.13 | 73.13 | 30.00 | 83.02 | 63.38 | 76.39 | 29.07 |
| Claude 4.1 Opus Thinking | Anthropic | 61.81 | 72.33 | 74.66 | 48.33 | 73.19 | 48.98 | 72.76 | 42.40 |
| Kimi K2 Thinking | Moonshot AI | 61.59 | 63.49 | 67.44 | 38.33 | 81.10 | 52.29 | 66.45 | 62.03 |
| Claude Haiku 4.5 Thinking | Anthropic | 61.32 | 61.68 | 72.81 | 41.67 | 77.53 | 59.30 | 66.45 | 49.78 |
| Claude 4 Sonnet Thinking | Anthropic | 61.27 | 69.01 | 77.48 | 40.00 | 70.50 | 54.63 | 72.91 | 44.34 |
| GPT-5.1 Codex Mini | OpenAI | 60.38 | 64.71 | 69.93 | 40.00 | 76.26 | 49.70 | 63.01 | 59.02 |
| Grok 4.1 Fast | xAI | 59.99 | 80.20 | 69.61 | 31.67 | 83.72 | 52.24 | 74.33 | 28.20 |
| Claude 4.5 Opus Medium Effort | Anthropic | 59.10 | 53.21 | 78.51 | 63.33 | 66.32 | 45.54 | 78.66 | 28.11 |
| DeepSeek V3.2 Exp Thinking | DeepSeek | 58.90 | 64.37 | 70.06 | 31.67 | 82.40 | 51.50 | 71.06 | 41.27 |
| Gemini 2.5 Pro (Max Thinking) | 58.33 | 70.81 | 75.69 | 33.33 | 68.32 | 51.62 | 75.50 | 33.07 | |
| GLM 4.7 | Z.AI | 58.09 | 59.73 | 73.13 | 41.67 | 76.02 | 55.17 | 65.23 | 35.66 |
| GLM 4.6 | Z.AI | 55.19 | 62.06 | 71.02 | 35.00 | 81.13 | 51.95 | 58.99 | 26.19 |
| Claude 4.1 Opus | Anthropic | 54.45 | 40.89 | 76.07 | 53.33 | 62.83 | 45.38 | 76.75 | 25.92 |
| Claude Sonnet 4.5 | Anthropic | 53.69 | 42.29 | 76.07 | 48.33 | 62.62 | 47.00 | 76.00 | 23.52 |
| Gemini 2.5 Flash (Max Thinking) (2025-09-25) | 53.09 | 51.45 | 67.50 | 23.33 | 75.35 | 60.98 | 65.34 | 27.68 | |
| Qwen 3 235B A22B Thinking 2507 | Alibaba | 52.97 | 59.40 | 68.97 | 6.67 | 73.39 | 52.18 | 69.52 | 40.64 |
| DeepSeek V3.2 | DeepSeek | 51.84 | 44.25 | 75.69 | 46.67 | 63.95 | 45.03 | 64.24 | 23.06 |
| Claude 4 Sonnet | Anthropic | 50.98 | 39.67 | 80.74 | 38.33 | 60.36 | 44.07 | 71.01 | 22.68 |
| Qwen 3 Next 80B A3B Thinking | Alibaba | 50.41 | 58.16 | 60.66 | 8.33 | 74.26 | 53.58 | 56.31 | 41.54 |
| DeepSeek V3.2 Exp | DeepSeek | 49.85 | 45.50 | 73.19 | 36.67 | 64.38 | 44.26 | 65.60 | 19.33 |
| GPT-5.2 No Thinking | OpenAI | 48.91 | 42.80 | 76.45 | 40.00 | 58.25 | 47.68 | 49.97 | 27.20 |
| Qwen 3 235B A22B Instruct 2507 | Alibaba | 48.84 | 58.43 | 69.61 | 13.33 | 68.03 | 44.72 | 66.07 | 21.72 |
| GPT-5 Nano High | OpenAI | 48.62 | 40.29 | 62.39 | 23.33 | 68.41 | 43.41 | 46.84 | 55.70 |
| Qwen 3 Next 80B A3B Instruct | Alibaba | 48.35 | 54.75 | 68.20 | 10.00 | 70.18 | 49.78 | 66.34 | 19.19 |
| Kimi K2 Instruct | Moonshot AI | 48.10 | 42.23 | 74.28 | 31.67 | 58.15 | 43.34 | 66.69 | 20.36 |
| Gemini 2.5 Flash (Max Thinking) (2025-06-05) | 47.74 | 44.64 | 66.03 | 16.67 | 68.75 | 47.31 | 62.27 | 28.50 | |
| GPT OSS 120b | OpenAI | 46.09 | 39.21 | 60.21 | 16.67 | 68.87 | 38.80 | 48.59 | 50.29 |
| Claude Haiku 4.5 | Anthropic | 45.33 | 33.94 | 72.17 | 33.33 | 57.97 | 45.13 | 57.05 | 17.75 |
| Grok Code Fast | xAI | 45.13 | 42.30 | 64.44 | 33.33 | 56.01 | 48.99 | 48.56 | 22.27 |
| Qwen 3 32B | Alibaba | 43.56 | 48.25 | 66.03 | 3.33 | 67.44 | 46.54 | 55.54 | 17.77 |
| GPT-5.1 No Thinking | OpenAI | 42.65 | 26.81 | 77.48 | 28.33 | 44.51 | 44.07 | 53.84 | 23.50 |
| Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17) | 42.56 | 43.34 | 66.41 | 5.00 | 61.04 | 47.04 | 51.98 | 23.08 | |
| Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25) | 42.39 | 36.16 | 65.39 | 1.67 | 64.90 | 47.88 | 52.60 | 28.11 | |
| Devstral 2 | Mistral | 41.24 | 27.74 | 66.79 | 43.33 | 52.52 | 39.14 | 45.67 | 13.50 |
| GLM 4.6V | Z.AI | 40.07 | 37.22 | 64.24 | 3.33 | 62.50 | 46.41 | 49.74 | 17.06 |
| Qwen 3 30B A3B | Alibaba | 39.01 | 36.68 | 48.88 | 1.67 | 65.35 | 44.92 | 54.47 | 21.11 |
| Grok 4.1 Fast (Non-Reasoning) | xAI | 33.45 | 23.35 | 54.26 | 10.00 | 38.92 | 40.61 | 50.01 | 16.98 |
| Trinity Large Preview | Arcee | 32.74 | 20.61 | 65.65 | 3.33 | 44.93 | 40.33 | 42.15 | 12.19 |
数据来源: LiveBench AI 排行榜