What is MMLU? - AI Knowledge Base

Overview

MMLU (Massive Multitask Language Understanding) is one of the most widely used benchmarks for evaluating large language models. It tests a model's knowledge and reasoning abilities across 57 different subjects.

What It Measures

MMLU evaluates models on:

Humanities: History, Philosophy, Law
STEM: Mathematics, Physics, Computer Science
Social Sciences: Economics, Psychology, Sociology
Other: Professional exams, General knowledge

Scoring

Models are scored as a percentage of correct answers:

Score Range	Interpretation
85%+	Excellent (PhD-level)
70-85%	Good (Graduate-level)
50-70%	Fair (Undergraduate-level)
<50%	Needs improvement

Top Performers (2024)

GPT-4o: ~90%
Claude 3.5 Sonnet: ~88%
Qwen2.5-72B: ~85%
Llama 3.1 70B: ~82%

Why It Matters

MMLU is important because:

Tests broad knowledge, not just language fluency
Covers real-world subjects relevant to users
Widely adopted, enabling fair comparison
Correlates well with real-world usefulness

Limitations

Primarily English-focused
Multiple-choice format only
Static dataset (knowledge cutoff)
Doesn't test reasoning chains

Related Benchmarks

MMLU-Pro: Extended version with harder questions
ARC: Science reasoning questions
HellaSwag: Commonsense reasoning