LIVE · 2026.03.04 · v1.0

ALL Bench Leaderboard 2026

31 Models × 25 Metrics — The only leaderboard combining Metacognitive (FINAL Bench) + ARC-AGI-2 + 23 standard benchmarks
✓ Scores verified against Artificial Analysis Intelligence Index (2026.03.04) · AA #1 Gemini 3.1 Pro · AA #2 GPT-5.3 Codex · AA OpenSrc #1 GLM-5
Scale AI SEAL · artificialanalysis.ai · arcprize.org · FINAL-Bench/Metacognitive (HF Official) · Chatbot Arena · aimultiple.com

31
Models
25
Metrics
10
Providers
17
Open Source
+6
v1.0 New
📊 Leaderboard
📈 Charts
📎 Benchmark Info
🏆 ALL Bench Composite Score Ranking
Average of all available benchmarks · MMLU-Pro · GPQA · AIME · HLE · ARC-AGI-2 · Metacog · SWE-Pro · BFCL · IFEval · SWE-V · Colored by provider
Filter:
Model Provider 🏆 Score 📅 Release 📚 MMLU-Pro 🧠 GPQA◆ 📐 AIME25 🔭 HLE 🧩 ARC-AGI-2★ 🧬 Metacog★ 🏗 SWE-Pro 🔧 BFCL 📋 IFEval 🖥 LCB 💻 SWE-V⚠ 🌍 MMMLU 📥 CtxIn 📤 CtxOut ⚡ tok/s ⏱ TTFT 👁 Vision ⚙ Arch 🏆 ELO 📄 License 💰 $/M in
Grade:
S≥90%
A≥75%
B≥60%
C<60%
★ = New in v1.0 💚 Green row = Open-source value pick 🧩 ARC-AGI-2 = arcprize.org official 🧬 Metacog = FINAL-Bench official (9 models measured)

🧩 ARC-AGI-2 — Abstract Reasoning Frontier

Official arcprize.org · Vertical bars by score · Contamination-proof visual reasoning benchmark

Key: Gemini 3.1 Pro dominates at 88.1% (Feb 2026 leaderboard #1). GPT-5.2 52.9% · Claude Opus 4.6 ~37.6%. Kimi K2.5 surprisingly low at 12.1% despite HLE dominance — distinct capability axis.

🧬 Metacog: Baseline → Self-Correction Gain (Δ)

FINAL-Bench official · Baseline FINAL Score vs MetaCog condition · Error Recovery drives 94.8% of gains

Key: Claude Opus 4.6 has lowest baseline (rank 9) but largest Δ gain (+20.13) — strongest self-correction. Kimi K2.5 highest baseline but smallest gain. Declarative–Procedural gap persists across all models.

🕸 Capability Radar — TOP 6 Multi-Axis Profile

MMLU-Pro · GPQA · AIME · HLE · ARC-AGI-2 · MMMLU · Each axis normalized to 100

Key: No single model dominates all axes. Gemini leads MMMLU+HLE, GPT-5.2 leads MMLU-Pro, Kimi K2.5 exceptional on MMLU-Pro 92.0. Different strengths suggest routing strategies.

📊 Capability Domains — Reasoning vs Coding vs Language

Grouped bars: Reasoning avg (GPQA+AIME+HLE) · Coding avg (SWE-Pro+LCB) · Language avg (MMLU-Pro+MMMLU+IFEval)

Key: Claude Opus 4.6 leads Coding domain. Gemini 3.1 Pro leads Language. GPT-5.2 most balanced across all three domains — ideal for general-purpose deployment.

💰 Performance vs Cost — Value Frontier Map

X = Input price log scale ($/M tokens) · Y = Composite Score · Top-left quadrant = elite value zone

Value leaders: DeepSeek V3.2 ($0.14/M, score ~74) and GLM-5 ($0.35/M) offer exceptional open-weight value. GPT-OSS-120B is truly free with competitive performance.

🏭 Provider Strength — Average Score by Company

Average composite score across all models per provider · Shows lab-level consistency

Key: OpenAI strongest average (combining closed+OSS models). Alibaba's Qwen3.5 family shows remarkable breadth. DeepSeek punches above weight with MIT-licensed models.

📅 Intelligence Timeline — Score vs Release Date

Bubble size = context window (log scale) · Color = provider · Rapid capability gains 2025→2026

Key: ~15-point score jump from Jan 2025 to Feb 2026. Feb 2026 releases (GPT-5.2, Gemini 3.1 Pro) establish new ceiling. Context window growth independent of intelligence score.

⚖ Open vs Closed — Distribution Comparison

Score distribution: Open-weight (15 models) vs Closed-API (6 models) · Box plot style with individual points

Key: Open-weight models now overlap significantly with closed-API. Top open models (Kimi K2.5, Qwen3.5-397B) match or exceed many closed offerings — open-source gap is closing.

📐 Benchmark Score Variance — Consistency Analysis

For each benchmark: show min/max/mean across all models · Reveals benchmark difficulty & discrimination power

Key: HLE shows widest variance (7.0–44.9) = best discrimination. ARC-AGI-2 also highly discriminating (12.1–88.1). AIME25 scores cluster high — many models saturating it.

🌡 Full Benchmark Heatmap — 31 Models × 11 Benchmarks

Color intensity = score · White/light = unreported · Indigo = high · Reveals capability patterns across the entire landscape

🧩 ARC-AGI-2 ★NEW — Abstract Reasoning

Tests novel visual pattern completion — cannot be solved by memorization. arcprize.org. Gemini 3.1 Pro 88.1% (Feb 2026 leaderboard 1st) · GPT-5.2 52.9% · Claude Opus 4.6 ~37.6% · Kimi K2.5 12.1%. Most contamination-proof benchmark available.

🇰🇷 한국 소버린 AI — 독자 AI 파운데이션 모델 (독파모) 현황

과기정통부 주관 '독자 AI 파운데이션 모델 프로젝트' 2026.02 기준 4개 정예팀: LG AI연구원(K-EXAONE) · SK텔레콤(A.X K1) · 업스테이지(Solar Open 100B) · 모티프테크놀로지스.
• 1차 평가(2026.01.15): 5팀 → 3팀 (네이버클라우드 독자성 미달, NC AI 점수 미달 탈락)
• 패자부활전(2026.02.20): 모티프테크놀로지스 추가 선정 → 4팀 체제
• K-EXAONE: 1차 평가 1위 · 13개 벤치마크 평균 72점 · AA 오픈웨이트 톱10 · 236B MoE
• Solar Open 100B: AIME 84.3% · 19.7T 토큰 · 100B MoE · arXiv 2601.07022
• A.X K1: 국내 최초 500B 파라미터 · Apache 2.0 오픈소스
• 목표: 글로벌 AI 모델 95% 이상 성능 확보 · 2027년 최종 2팀 선정 · 5,300억원 예산

🧬 Metacognitive ★NEW — FINAL-Bench

Official: HF FINAL-Bench/Metacognitive. 100 tasks, 9 SOTA models tested. Baseline FINAL Score: Kimi K2.5 68.71 · GPT-5.2 62.76 · GLM-5 62.50 · Gemini 59.5 · Opus 4.6 56.04 (rank 9). ER (error recovery) accounts for 94.8% of self-correction gains. 14 models = not evaluated (—).

📊 Composite Score — 공정 가중 평균

10개 벤치마크(MMLU-Pro · GPQA · AIME · HLE · ARC-AGI-2 · Metacog · SWE-Pro · BFCL · IFEval · SWE-V) 총합÷10. 미제출(null)=0점 패널티 → 벤치마크 회피 시 불이익. 막대 상단 n/10 = 실제 커버리지 표시. 이전 버전의 "null 제외 평균" 방식은 Grok 4.1 Fast(2개만 제출→86점), DeepSeek R2(4개→88점) 같은 허위 1위 오류를 유발함.

📚 MMLU-Pro

HF: TIGER-Lab/MMLU-Pro. 57,000 expert-level questions across disciplines. Largest sample size → highest statistical reliability. Much harder than original MMLU. Gold standard general knowledge benchmark.

🧠 GPQA Diamond ⭐

HF: Idavidrein/gpqa. 198 PhD-level questions in biology, chemistry, physics. Human expert average ~65%. Highest discrimination power among frontier models.

📐 AIME 2025

AoPS: 2025 AIME. American Invitational Mathematics Examination. 2025 problem set minimizes contamination. Tests mathematical reasoning and creative problem solving.

🔭 HLE — Humanity's Last Exam

HF: centerforaisafety/hle. 2,500 expert-submitted questions. Intended to be the final closed-ended academic benchmark. Kimi K2.5 44.9% · Gemini 3.1 Pro 44.7% lead.

🏗 SWE-Pro ⭐ Recommended

scale.com/leaderboard/coding. Scale AI SEAL, 1865 real repos. Contamination-free. ~35pt lower than SWE-Verified — honest measure of real coding. OpenAI recommends over Verified.

💻 SWE-Verified ⚠ Caution

swebench.com. 59.4% of tasks found defective in OpenAI audit. Memorization/contamination risk. Reference only. Prefer SWE-Pro for accurate assessment.

🔧 BFCL v4

gorilla.cs.berkeley.edu. Berkeley Function-Calling Leaderboard. Measures tool use and agent capability. Qwen3.5-122B world #1.

📋 IFEval

HF: google/IFEval. Instruction following evaluation. Verifiable output constraints. Tests precision compliance.

🖥 LiveCodeBench

livecodebench.github.io. Competitive programming from LeetCode, AtCoder, Codeforces. Continuously updated to prevent contamination.

🌍 MMMLU — Multilingual

HF: openai/MMMLU. MMLU in 57 languages. Gemini 3.1 Pro ~88% leads. Qwen3.5 officially supports 201 languages.

⚙ Architecture

MoE = sparse activation (efficient), Dense = full params (quality), Hybrid = DeltaNet+MoE. Parentheses = active/total params. Active params determine inference cost. Qwen3.5-35B: 3B active → 194 tok/s.

⏱ TTFT Latency

Time To First Token (seconds). Lower is faster. Mistral Large 3 0.3s · GPT-5.2 0.6s fastest. Reasoning models (DeepSeek R1 8s) are slower due to chain-of-thought. <2s recommended for real-time apps.

💰 Pricing

Input cost in $/million tokens. 0 = free open-weights. Qwen3.5-35B $0.10/M, DeepSeek V3.2 $0.14/M offer extreme value vs closed models. GPT-5.2 $1.75/M · Claude Opus 4.6 $5/M.

ALL Bench Leaderboard v1.0 · Updated March 4, 2026 · Scale AI SEAL · artificialanalysis.ai · arcprize.org · FINAL-Bench/Metacognitive (HF) · Chatbot Arena