A benchmark designed to be impossible to memorize.
Most LLM benchmarks suffer from data contamination — models have seen the test questions during training. Moreau Arena solves this by presenting a novel game with mechanics that don't exist in any training corpus. LLMs must reason from first principles using only the rules provided in the prompt.
Select from 14 animal types, each with a unique passive ability and active skill. Animals create asymmetric matchups — no single choice dominates all others.
Distribute 20 points across HP, ATK, SPD, and WIL (minimum 1 each). The allocation determines max health, damage, dodge chance, and ability resistance.
The deterministic engine runs tick-by-tick combat. Faster creatures attack first. Abilities proc based on WIL. Matches end when one creature reaches 0 HP.
Each matchup is a best-of-7 series to reduce variance. Agents play against all opponents in round-robin format. Rankings use Bradley-Terry scoring.
14 animals with unique abilities that create non-transitive dynamics.
| Animal | Passive | Ability 1 | Ability 2 | Archetype |
|---|---|---|---|---|
| Bear | Fury ProtocolFury Protocol +50% ATK when HP < 50% Duration: 3 ticks |
Berserker RageBerserker Rage 3.5% per tick +60% ATK, −40% dodge Duration: 3 ticks |
Last StandLast Stand 3.5% per tick +100% ATK when HP < 15% One-time use |
Tank/Bruiser |
| Tiger | Ambush WiringAmbush Wiring First attack +100% dmg Requires SPD > opponent |
PouncePounce 4.5% per tick +70% ATK, skip opponent attack |
HamstringHamstring 4.5% per tick −55% SPD, −10% dodge Duration: 4 ticks |
Assassin |
| Wolf | Pack SensePack Sense Passive scaling damage |
Pack HowlPack Howl 4.5% per tick +30% ATK Duration: 4 ticks |
RendRend 4.5% per tick 5% max HP DoT Duration: 3 ticks |
Scaling DPS |
| Buffalo | Thick HideThick Hide First hit −50% damage |
Thick HideThick Hide (Active) 4.5% per tick Blocks all damage Duration: 1 tick |
Iron WillIron Will 3.5% per tick Heal 12% max HP One-time use |
Sustain Tank |
| Boar | ChargeCharge First attack +50% damage |
StampedeStampede 4.5% per tick +50% ATK, skip opponent attack |
GoreGore 3.5% per tick −40% ATK, ignores dodge |
Berserker |
| Monkey | Primate CortexPrimate Cortex Enhanced cognitive ability |
Chaos StrikeChaos Strike 4.5% per tick 0.8×–2.2× damage multiplier |
MimicMimic 3.5% per tick Copy 75% of opponent ability Cannot copy: Iron Will, Last Stand, Mimic |
Trickster |
| Crocodile | Death RollDeath Roll (Passive) Grapple damage bonus |
Death RollDeath Roll (Active) 4.5% per tick Grapple attack |
Thick ScalesThick Scales 4.5% per tick Damage reduction Duration: 2 ticks |
Grappler |
| Eagle | Aerial StrikeAerial Strike Aerial advantage in combat |
DiveDive 3.5% per tick +100% ATK, ignores dodge |
Keen EyeKeen Eye 4.5% per tick +20% dodge chance Duration: 3 ticks |
Burst DPS |
| Snake | Venom GlandsVenom Glands Passive poison on attacks |
VenomVenom 4.5% per tick 3% max HP DoT, stacks ×3 Duration: 3 ticks per stack |
CoilCoil 4.5% per tick Guaranteed dodge Duration: 1 tick |
DoT |
| Raven | OmenOmen Foresight passive |
Shadow CloneShadow Clone 4.5% per tick Creates a shadow clone One-time use |
CurseCurse 4.5% per tick Debuff opponent Duration: 3 ticks |
Utility |
| Shark | Blood FrenzyBlood Frenzy (Passive) Bonus damage vs low-HP targets |
Blood FrenzyBlood Frenzy (Active) 3.5% per tick Enhanced finishing damage |
BiteBite 4.5% per tick Sustained damage Duration: 2 ticks |
Finisher |
| Owl | Night VisionNight Vision Enhanced perception in combat |
ForesightForesight 4.5% per tick Predict opponent actions Duration: 2 ticks |
Silent StrikeSilent Strike 4.5% per tick Precision attack |
Precision |
| Fox | CunningCunning Passive evasion bonus |
EvasionEvasion 4.5% per tick +50% dodge chance Duration: 3 ticks |
TrickTrick 4.5% per tick Negates opponent's next proc Duration: 1 tick |
Trickster |
| Scorpion | Paralytic StingParalytic Sting Passive stun chance on attacks |
StingSting 4.5% per tick Skip opponent attack |
ExoskeletonExoskeleton 4.5% per tick Block 15% damage Duration: 1 tick |
Poison Tank |
Determines maximum health: max_hp = 50 + hp * 10. Range: 60–220 HP. Higher HP means more damage absorbed before death.
Determines base damage: base_dmg = atk - 1. Range: 0–16 damage per tick. Monotonic — more ATK always means more damage.
Determines dodge chance: dodge = spd * 2.5%. Range: 2.5%–42.5%. Also affects attack order — faster creatures strike first.
Determines ability resistance: resist = wil * 3.3%. Range: 3.3%–56.1%. Affects how often abilities proc and resist opponent abilities.
Moreau Arena uses Bradley-Terry (BT) scoring rather than Elo for rankings. BT handles non-transitive dominance structures (where A beats B, B beats C, but C beats A) that break traditional rating systems. Scores are normalized to [0, 1] with bootstrap confidence intervals (N=1000) to quantify uncertainty.
Elo ratings are also computed for reference but should not be used for definitive rankings due to path-dependency issues in non-transitive environments.