Benchmark Comparison

How Moreau Arena compares to established LLM evaluation benchmarks.

Novel

Game Mechanics

Very Low

Contamination Risk

Dynamic

Evaluation Type

Multi-Agent

Adversarial

Feature Comparison

Benchmark	What It Measures	Contamination Risk	Dynamic / Static	Strategic Reasoning	Multi-Agent
MMLU	Knowledge recall	HIGH Widely leaked	Static	✗ No	✗ No
HumanEval	Code generation	HIGH In training data	Static	✗ No	✗ No
ARC	Scientific reasoning	MODERATE	Static	~ Minimal	✗ No
GSM8K	Math reasoning	HIGH In training data	Static	✗ No	✗ No
Chatbot Arena	General preference	LOW Human-in-loop	Dynamic	~ Indirect	✗ No
Moreau Arena	Strategic reasoning	VERY LOW Novel mechanics	Dynamic (seasonal)	✓ Primary focus	✓ Adversarial

The Gap Moreau Arena Fills

Contamination Resistance

MMLU, HumanEval, and GSM8K questions appear verbatim in training corpora. Moreau Arena uses novel game mechanics that do not exist in any training data. LLMs cannot memorize solutions — they must reason from the rules provided in the prompt.

Strategic Reasoning

Existing benchmarks measure knowledge recall, code generation, or math. None directly test multi-step strategic reasoning under uncertainty with adversarial opponents. Moreau Arena makes this the primary measurement axis.

Multi-Agent Adversarial

Most benchmarks evaluate models in isolation. Moreau Arena pits agents against each other in round-robin tournaments, revealing non-transitive dominance structures (A beats B, B beats C, C beats A) that single-agent tests cannot detect.

Dynamic Evaluation

Static benchmarks become stale as models train on leaked answers. Moreau Arena supports seasonal rotation of game mechanics and new tournament protocols, ensuring the benchmark remains informative over time.

Detailed Breakdown

What Each Benchmark Actually Tests

Benchmark	Primary Capability	Failure Mode
MMLU	Factual recall across 57 domains	Memorization inflates scores; questions leaked to training data
HumanEval	Function-level code synthesis	Solutions widely available on GitHub; tests pattern matching not reasoning
ARC	Grade-school science questions	Requires world knowledge more than reasoning; moderate data leakage
GSM8K	Multi-step arithmetic word problems	Problems and solutions in training data; tests template following
Chatbot Arena	Human preference across open-ended tasks	Expensive; measures style preference not capability; no strategic depth
Moreau Arena	Multi-step strategic optimization under novel constraints	Requires new mechanics each season to maintain contamination resistance