31 Models × 25 Metrics — The only leaderboard combining Metacognitive (FINAL Bench) + ARC-AGI-2 + 23 standard benchmarks
✓ Scores verified against Artificial Analysis Intelligence Index (2026.03.04) · AA #1 Gemini 3.1 Pro · AA #2 GPT-5.3 Codex · AA OpenSrc #1 GLM-5
Scale AI SEAL · artificialanalysis.ai · arcprize.org · FINAL-Bench/Metacognitive (HF Official) · Chatbot Arena · aimultiple.com
| Model↕ | Provider | 🏆 Score↕ | 📅 Release↕ | 📚 MMLU-Pro↕ | 🧠 GPQA◆↕ | 📐 AIME25↕ | 🔭 HLE↕ | 🧩 ARC-AGI-2★↕ | 🧬 Metacog★↕ | 🏗 SWE-Pro↕ | 🔧 BFCL↕ | 📋 IFEval↕ | 🖥 LCB↕ | 💻 SWE-V⚠↕ | 🌍 MMMLU↕ | 📥 CtxIn↕ | 📤 CtxOut↕ | ⚡ tok/s↕ | ⏱ TTFT↕ | 👁 Vision | ⚙ Arch | 🏆 ELO↕ | 📄 License | 💰 $/M in↕ |
|---|
Official arcprize.org · Vertical bars by score · Contamination-proof visual reasoning benchmark
FINAL-Bench official · Baseline FINAL Score vs MetaCog condition · Error Recovery drives 94.8% of gains
MMLU-Pro · GPQA · AIME · HLE · ARC-AGI-2 · MMMLU · Each axis normalized to 100
Grouped bars: Reasoning avg (GPQA+AIME+HLE) · Coding avg (SWE-Pro+LCB) · Language avg (MMLU-Pro+MMMLU+IFEval)
X = Input price log scale ($/M tokens) · Y = Composite Score · Top-left quadrant = elite value zone
Average composite score across all models per provider · Shows lab-level consistency
Bubble size = context window (log scale) · Color = provider · Rapid capability gains 2025→2026
Score distribution: Open-weight (15 models) vs Closed-API (6 models) · Box plot style with individual points
For each benchmark: show min/max/mean across all models · Reveals benchmark difficulty & discrimination power
Color intensity = score · White/light = unreported · Indigo = high · Reveals capability patterns across the entire landscape
Tests novel visual pattern completion — cannot be solved by memorization. arcprize.org. Gemini 3.1 Pro 88.1% (Feb 2026 leaderboard 1st) · GPT-5.2 52.9% · Claude Opus 4.6 ~37.6% · Kimi K2.5 12.1%. Most contamination-proof benchmark available.
과기정통부 주관 '독자 AI 파운데이션 모델 프로젝트' 2026.02 기준 4개 정예팀: LG AI연구원(K-EXAONE) · SK텔레콤(A.X K1) · 업스테이지(Solar Open 100B) · 모티프테크놀로지스.
• 1차 평가(2026.01.15): 5팀 → 3팀 (네이버클라우드 독자성 미달, NC AI 점수 미달 탈락)
• 패자부활전(2026.02.20): 모티프테크놀로지스 추가 선정 → 4팀 체제
• K-EXAONE: 1차 평가 1위 · 13개 벤치마크 평균 72점 · AA 오픈웨이트 톱10 · 236B MoE
• Solar Open 100B: AIME 84.3% · 19.7T 토큰 · 100B MoE · arXiv 2601.07022
• A.X K1: 국내 최초 500B 파라미터 · Apache 2.0 오픈소스
• 목표: 글로벌 AI 모델 95% 이상 성능 확보 · 2027년 최종 2팀 선정 · 5,300억원 예산
Official: HF FINAL-Bench/Metacognitive. 100 tasks, 9 SOTA models tested. Baseline FINAL Score: Kimi K2.5 68.71 · GPT-5.2 62.76 · GLM-5 62.50 · Gemini 59.5 · Opus 4.6 56.04 (rank 9). ER (error recovery) accounts for 94.8% of self-correction gains. 14 models = not evaluated (—).
10개 벤치마크(MMLU-Pro · GPQA · AIME · HLE · ARC-AGI-2 · Metacog · SWE-Pro · BFCL · IFEval · SWE-V) 총합÷10. 미제출(null)=0점 패널티 → 벤치마크 회피 시 불이익. 막대 상단 n/10 = 실제 커버리지 표시. 이전 버전의 "null 제외 평균" 방식은 Grok 4.1 Fast(2개만 제출→86점), DeepSeek R2(4개→88점) 같은 허위 1위 오류를 유발함.
HF: TIGER-Lab/MMLU-Pro. 57,000 expert-level questions across disciplines. Largest sample size → highest statistical reliability. Much harder than original MMLU. Gold standard general knowledge benchmark.
HF: Idavidrein/gpqa. 198 PhD-level questions in biology, chemistry, physics. Human expert average ~65%. Highest discrimination power among frontier models.
AoPS: 2025 AIME. American Invitational Mathematics Examination. 2025 problem set minimizes contamination. Tests mathematical reasoning and creative problem solving.
HF: centerforaisafety/hle. 2,500 expert-submitted questions. Intended to be the final closed-ended academic benchmark. Kimi K2.5 44.9% · Gemini 3.1 Pro 44.7% lead.
scale.com/leaderboard/coding. Scale AI SEAL, 1865 real repos. Contamination-free. ~35pt lower than SWE-Verified — honest measure of real coding. OpenAI recommends over Verified.
swebench.com. 59.4% of tasks found defective in OpenAI audit. Memorization/contamination risk. Reference only. Prefer SWE-Pro for accurate assessment.
gorilla.cs.berkeley.edu. Berkeley Function-Calling Leaderboard. Measures tool use and agent capability. Qwen3.5-122B world #1.
HF: google/IFEval. Instruction following evaluation. Verifiable output constraints. Tests precision compliance.
livecodebench.github.io. Competitive programming from LeetCode, AtCoder, Codeforces. Continuously updated to prevent contamination.
HF: openai/MMMLU. MMLU in 57 languages. Gemini 3.1 Pro ~88% leads. Qwen3.5 officially supports 201 languages.
MoE = sparse activation (efficient), Dense = full params (quality), Hybrid = DeltaNet+MoE. Parentheses = active/total params. Active params determine inference cost. Qwen3.5-35B: 3B active → 194 tok/s.
Time To First Token (seconds). Lower is faster. Mistral Large 3 0.3s · GPT-5.2 0.6s fastest. Reasoning models (DeepSeek R1 8s) are slower due to chain-of-thought. <2s recommended for real-time apps.
Input cost in $/million tokens. 0 = free open-weights. Qwen3.5-35B $0.10/M, DeepSeek V3.2 $0.14/M offer extreme value vs closed models. GPT-5.2 $1.75/M · Claude Opus 4.6 $5/M.
ALL Bench Leaderboard v1.0 · Updated March 4, 2026 · Scale AI SEAL · artificialanalysis.ai · arcprize.org · FINAL-Bench/Metacognitive (HF) · Chatbot Arena