AI Benchmarks
Evaluation benchmarks used to measure AI model performance. Task counts, scoring methods, contamination risks, and what each benchmark actually tests.
agentic
Multi-environment agent evaluation (OS, DB, web, games)
reasoning
American Invitational Mathematics Examination problems
reasoning
Abstract reasoning benchmark testing novel pattern recognition
agentic
General AI Assistants benchmark chaining web browsing, file parsing, multi-doc reasoning
knowledge
Graduate-level science questions (physics, chemistry, biology)
coding
Code generation from docstrings (Python)
coding
Continuously updated competitive programming benchmark
knowledge
Massive Multitask Language Understanding with harder questions
coding
Next-gen SWE benchmark by Scale AI with harder tasks
coding
Gold-standard software engineering benchmark with human-verified tasks
agentic
Agent benchmark for tool-augmented LLMs
web
Web navigation and task completion benchmark