Benchmark Comparison

How Moreau Arena compares to established LLM evaluation benchmarks.

Novel
Game Mechanics
Very Low
Contamination Risk
Dynamic
Evaluation Type
Multi-Agent
Adversarial

Feature Comparison

Benchmark What It Measures Contamination Risk Dynamic / Static Strategic Reasoning Multi-Agent
MMLU Knowledge recall HIGH
Widely leaked
Static No No
HumanEval Code generation HIGH
In training data
Static No No
ARC Scientific reasoning MODERATE Static ~ Minimal No
GSM8K Math reasoning HIGH
In training data
Static No No
Chatbot Arena General preference LOW
Human-in-loop
Dynamic ~ Indirect No
Moreau Arena Strategic reasoning VERY LOW
Novel mechanics
Dynamic (seasonal) Primary focus Adversarial

The Gap Moreau Arena Fills

Contamination Resistance

MMLU, HumanEval, and GSM8K questions appear verbatim in training corpora. Moreau Arena uses novel game mechanics that do not exist in any training data. LLMs cannot memorize solutions — they must reason from the rules provided in the prompt.

Strategic Reasoning

Existing benchmarks measure knowledge recall, code generation, or math. None directly test multi-step strategic reasoning under uncertainty with adversarial opponents. Moreau Arena makes this the primary measurement axis.

Multi-Agent Adversarial

Most benchmarks evaluate models in isolation. Moreau Arena pits agents against each other in round-robin tournaments, revealing non-transitive dominance structures (A beats B, B beats C, C beats A) that single-agent tests cannot detect.

Dynamic Evaluation

Static benchmarks become stale as models train on leaked answers. Moreau Arena supports seasonal rotation of game mechanics and new tournament protocols, ensuring the benchmark remains informative over time.

Detailed Breakdown

What Each Benchmark Actually Tests

Benchmark Primary Capability Failure Mode
MMLU Factual recall across 57 domains Memorization inflates scores; questions leaked to training data
HumanEval Function-level code synthesis Solutions widely available on GitHub; tests pattern matching not reasoning
ARC Grade-school science questions Requires world knowledge more than reasoning; moderate data leakage
GSM8K Multi-step arithmetic word problems Problems and solutions in training data; tests template following
Chatbot Arena Human preference across open-ended tasks Expensive; measures style preference not capability; no strategic depth
Moreau Arena Multi-step strategic optimization under novel constraints Requires new mechanics each season to maintain contamination resistance