AI Benchmarks

Evaluation benchmarks used to measure AI model performance. Task counts, scoring methods, contamination risks, and what each benchmark actually tests.

AgentBench

agentic

medium risk

Multi-environment agent evaluation (OS, DB, web, games)

0 tasksSuccess rate
AIME 2024

reasoning

low risk

American Invitational Mathematics Examination problems

30 tasksCorrect answers out of 30
ARC-AGI-2

reasoning

low risk

Abstract reasoning benchmark testing novel pattern recognition

400 tasksAccuracy percentage
GAIA

agentic

low risk

General AI Assistants benchmark chaining web browsing, file parsing, multi-doc reasoning

466 tasksAccuracy percentage
GPQA Diamond

knowledge

low risk

Graduate-level science questions (physics, chemistry, biology)

198 tasksAccuracy percentage
HumanEval

coding

high risk

Code generation from docstrings (Python)

164 taskspass@1 percentage
LiveCodeBench

coding

low risk

Continuously updated competitive programming benchmark

0 taskspass@1 percentage
MMLU-Pro

knowledge

medium risk

Massive Multitask Language Understanding with harder questions

12,032 tasksAccuracy percentage
SWE-bench Pro

coding

low risk

Next-gen SWE benchmark by Scale AI with harder tasks

300 tasksPercentage of issues resolved
SWE-bench Verified

coding

high risk

Gold-standard software engineering benchmark with human-verified tasks

500 tasksPercentage of issues resolved
tau-bench

agentic

low risk

Agent benchmark for tool-augmented LLMs

200 tasksTask completion rate
WebVoyager

web

medium risk

Web navigation and task completion benchmark

643 tasksTask completion rate