Moreau Arena

Can your LLM outsmart the arena? A benchmark that tests strategic reasoning — not memorization.

View Leaderboard Try the Simulator Read the Paper
15
Agents Tested
3
Tournaments
89
Invariant Tests
<25%
RNG Variance

The Prompt Engineering Effect

The same LLMs, the same simulator, the same opponents — three prompt designs reveal what actually drives LLM performance.

Season 1: The Arena Expands

Research proved the benchmark works. Season 1 is now live with 91 completed tournament series across the public arena.

14
Animals
5
Quality Gates
182K
Games Tested

The Frozen Model Effect

Without meta-context hints, three models from three different providers — Gemini Flash, Gemini Pro, and Grok — independently converge on the exact same build (Boar 8/8/3/1) for every single game out of 2,358 total. Zero adaptation despite the loser-adapts mechanic.

Meanwhile, Codex-5.2 stays #1 and Opus jumps from #12 to #3 — proving that some models can derive strategy from formulas alone, while others need explicit examples to compete.

Why Moreau Arena?

Named after H.G. Wells' The Island of Doctor Moreau — where creatures were engineered to test the boundaries of nature.

Contamination Resistant

Novel game mechanics that don't exist in any training corpus. LLMs can't memorize solutions — they must reason from the rules provided.

Non-Transitive Dynamics

Rock-paper-scissors dominance structures emerge naturally. Bradley-Terry scoring handles intransitivity that Elo cannot.

Fully Deterministic

Given the same seed, the same combat plays out identically. RNG contributes <25% of outcome variance — strategy dominates.

Multi-Track Evaluation

Track A tests one-shot play. Track B adds formulas, meta-context, and adaptation. Track C keeps formulas and adaptation but removes meta hints, isolating how much strategy models can derive from mechanics alone.

What’s Next: Moreau Island

Moreau Arena tests strategic reasoning. Moreau Island extends that foundation into an alignment resilience framework: what happens as enforcement weakens, information becomes unreliable, and every choice carries a cost. The path runs Shore → Thicket → Caldera, measuring not just whether a model optimizes well, but how much of its policy survives stress.

Shore

Full enforcement, clear rules, structured feedback. This is the live foundation already partially implemented via Arena T002/T003.

Foundation Live

Thicket

Partial observability, uncertain enforcement, alliances, defections, and deception pressure.

Planned

Caldera

Extreme scarcity, mandatory aggression, triage, sacrifice, and pressure-tested value stability.

Planned

Quick Fight

Benchmark (6) Season 1 (14)
Animals: bear, buffalo, boar, tiger, wolf, monkey, porcupine, scorpion, vulture, rhino, viper, fox, eagle, panther

Enter: name hp atk spd wil (stats must sum to 20)

Challenge the Leaderboard

Think you can build a better fighter? Describe your strategy and see where you'd rank against all 15 tournament agents.

Challenge the Leaderboard