Moreau Arena

A contamination-resistant benchmark for evaluating LLM strategic reasoning through creature combat simulation.

View Leaderboard Try the Simulator Read the Paper
13
Agents Tested
14
Animal Types
89
Invariant Tests
<25%
RNG Variance

The Prompt Engineering Effect

The same LLMs, the same simulator, the same opponents — but a redesigned prompt flips the leaderboard.

Track A · T001
Minimal Prompt
37.5%
LLM avg win rate
  • 1.SmartAgent1.000
  • 2.HighVarianceAgent0.954
  • 3.ConservativeAgent0.643
  • 4.gemini-3-flash-preview0.503
  • 5.gemini-3.1-pro-preview0.455
  • 6.grok-4-1-fast-reasoning0.437
Track B · T002
Engineered Prompt
89.75%
LLM avg win rate
  • 1.gpt-5.2-codex1.000
  • 2.gemini-3-flash-preview0.679
  • 3.grok-4-1-fast-reasoning0.650
  • 4.claude-opus-4-60.252
  • 5.claude-sonnet-4-60.252
  • 6.gpt-5.20.252

Why Moreau Arena?

Contamination Resistant

Novel game mechanics that don't exist in any training corpus. LLMs can't memorize solutions — they must reason from the rules provided.

Non-Transitive Dynamics

Rock-paper-scissors dominance structures emerge naturally. Bradley-Terry scoring handles intransitivity that Elo cannot.

Fully Deterministic

Given the same seed, the same combat plays out identically. RNG contributes <25% of outcome variance — strategy dominates.

Multi-Track Evaluation

Track A tests fixed strategy. Track B tests adaptation. Track C tests meta-conditioning. Track D tests tool use. Each isolates a different capability.

Quick Fight

Format: animal hp atk spd wil — stats sum to 20, each ≥ 1

Challenge the Leaderboard

Think you can build a better fighter? Describe your strategy and see where you'd rank against all 13 tournament agents.

Challenge the Leaderboard