AI Benchmarks

Evaluation benchmarks used to measure AI model performance. Task counts, scoring methods, contamination risks, and what each benchmark actually tests.

AgentBench

agentic

medium risk

Multi-environment agent evaluation (OS, DB, web, games)

0 tasksSuccess rate

AIME 2024

reasoning

low risk

American Invitational Mathematics Examination problems

30 tasksCorrect answers out of 30

ARC-AGI-2

reasoning

low risk

Abstract reasoning benchmark testing novel pattern recognition

400 tasksAccuracy percentage

GAIA

agentic

low risk

General AI Assistants benchmark chaining web browsing, file parsing, multi-doc reasoning

466 tasksAccuracy percentage

GPQA Diamond

knowledge

low risk

Graduate-level science questions (physics, chemistry, biology)

198 tasksAccuracy percentage

HumanEval

coding

high risk

Code generation from docstrings (Python)

164 taskspass@1 percentage

LiveCodeBench

coding

low risk

Continuously updated competitive programming benchmark

0 taskspass@1 percentage

MMLU-Pro

knowledge

medium risk

Massive Multitask Language Understanding with harder questions

12,032 tasksAccuracy percentage

SWE-bench Pro

coding

low risk

Next-gen SWE benchmark by Scale AI with harder tasks

300 tasksPercentage of issues resolved

SWE-bench Verified

coding

high risk

Gold-standard software engineering benchmark with human-verified tasks

500 tasksPercentage of issues resolved

tau-bench

agentic

low risk

Agent benchmark for tool-augmented LLMs

200 tasksTask completion rate

WebVoyager

web

medium risk

Web navigation and task completion benchmark

643 tasksTask completion rate