Moreau Arena Publications

Primary Paper

Moreau Arena: Not All LLMs Need Hints to Reason Strategically

Empirical benchmark paper covering T001/T002/T003, 2,609 completed series, PSI validation, and the T003 integrity pass.

DOI pending · preprint live now

Download PDF Open Repository

Companion Vision Paper

Moreau Island: A Three-Zone Adversarial Moral Environment for LLM Alignment Stress Testing

Conceptual framework for Shore, Thicket, and Caldera: a staged alignment resilience environment built on top of the current Arena foundation.

DOI pending · preprint live now

Download PDF See Island Context

Arena v3 Abstract

We introduce Moreau Arena, a contamination-resistant benchmark for LLM strategic reasoning. Agents design creature builds (animal choice plus stat allocation) for a novel stochastic combat game whose rules exist only in the prompt. Three tournaments over the same benchmark core—13 agents in T001/T002, expanded to 15 in T003—and 2,609 best-of-7 series reveal a sharp boundary condition.

Under vague rules with no feedback (T001), baselines dominate and LLMs win only 37.5% of cross-group series. With exact formulas, meta-context hints, and per-game adaptation (T002), the leaderboard reverses: LLMs win 89.75%. A targeted ablation (T003) removes only the meta-context hints while preserving exact rules and feedback. Performance does not collapse uniformly; instead the field splits. GPT-5.2-Codex and Claude Opus remain strong without hints, while three models from three providers freeze on a single identical build and never adapt.

Meanwhile, GPT-5.4—released during this study—ranks 14th of 15 despite exploring 79 builds, demonstrating that frontier capability does not automatically transfer to novel strategic domains. A prompt sensitivity analysis using a semantically equivalent paraphrase yields Kendall τ = 1.0 (perfect rank preservation), providing strong initial evidence that the key findings are not obvious artifacts of a single prompt surface form.

Contributions

A newly authored game environment designed to reduce contamination risk and to admit a derivable dominant strategy under a frozen configuration snapshot.
Two 13-agent round-robin tournaments (8 frontier LLMs, 5 baselines), totaling 1,559 best-of-7 series under different information scaffolding regimes.
A third no-meta ablation tournament (T003) with 15 agents and 1,050 series, revealing that some models remain strategically robust without hints while others collapse to rigid default builds.
A negative-transfer result: GPT-5.4, a state-of-the-art frontier model, ranks 14th of 15 despite active exploration, showing that general capability does not imply strategic competence in novel domains.
A prompt sensitivity validation (Kendall τ = 1.0) providing strong initial evidence that the T003 findings are not obvious artifacts of a single prompt surface form.
Evidence that the T001 → T002 performance reversal implicates comprehension and feedback rather than an absence of strategic capability, with model-specific meta-initialization dependence as the key moderator.

Tournament Progression

Metric	T001 (One-Shot)	T002 (Scaffolded)	T003 (No Meta)
Agents	13 (8 LLM + 5 baseline)	13 (8 LLM + 5 baseline)	15 (10 LLM + 5 baseline)
Prompt type	Qualitative, no feedback	Exact formulas + meta + adaptation	Exact formulas + adaptation, no meta
LLM vs baseline WR	37.5%	89.75%	Field splits
#1 Agent	SmartAgent (baseline)	GPT-5.2-Codex	GPT-5.2-Codex
Series count	779	780	1,050
Non-transitive cycles	0	12	12
PSI τ	—	—	1.0 (prompt-robust)

Key Findings

1. The Leaderboard Flip

Under vague rules (T001), baselines dominate—SmartAgent, HighVarianceAgent, and ConservativeAgent occupy the top 3. LLMs win only 37.5% of cross-group series. With exact formulas, meta hints, and per-game adaptation (T002), the ordering reverses completely: every LLM outranks every baseline, with 89.75% cross-group win rate. GPT-5.2-Codex jumps from rank 9 to rank 1.

2. Meta Removal Splits the Field

T003 removes meta-context while keeping exact rules and feedback. The result is not uniform collapse but a sharp split. GPT-5.2-Codex stays #1 (BT=1.0), Claude Opus rises to #3 (BT=0.895), and GPT-5.2 holds #4. These models reason strategically without hints. Meanwhile, Gemini Flash, Gemini Pro, and Grok all freeze on a single identical build (BOAR 8/8/3/1) and never adapt—0% adaptation rate across hundreds of losses.

3. GPT-5.4: Negative Transfer

GPT-5.4, a state-of-the-art frontier model released during this study, ranks 14th of 15 in T003. It explores 79 unique builds and adapts after 89.8% of losses—yet still loses to every fixed-strategy baseline. Search volume alone is insufficient; what matters is whether search converges toward strategically strong regions. This distinction between search volume and search quality is one of the clearest behavioral findings.

4. PSI τ = 1.0: Rankings Are Prompt-Robust

A semantically equivalent paraphrase of the T003 prompt produces identical rank ordering (Kendall τ = 1.0). GPT-5.4 remains last, baselines remain above most LLMs, and the field split persists. This provides strong initial evidence that the result is not an obvious artifact of a single prompt surface form.

5. The WIL Trap

In T001, Claude Opus repeats wolf 3/8/1/8 in 86% of series, investing heavily in WIL (willpower). But WIL resist applies only to ability procs (3.5–4.5% per tick)—fewer than one blocked proc per match. Mean WIL correlates strongly negatively with BT score (ρ = −0.93). With exact formulas in T002/T003, Opus escapes the trap and jumps to rank 3.

6. Non-Transitive Dynamics

T001 has 0 strict 3-cycles; T002 and T003 each have 12. Adaptation creates genuine rock-paper-scissors structure. An all-LLM cycle in T003: Claude Opus > GPT-5.2-Codex (80%) > GPT-5.2 (60%) > Claude Opus (60%). No single agent dominates all others.

Beyond the Benchmark: Season 1

The paper covers the frozen 6-animal benchmark (T001–T003). In parallel, Moreau Arena operates as a living competitive layer with seasonal roster expansions. Season 1 introduces 14 animals, WIL regen mechanics, and new cross-animal interactions (anti-heal, proc suppression, charge ignoring dodge). A separate tournament with 9 LLMs and 5 baselines across 91 all-pairs series produces dramatically reshuffled rankings—notably, GPT-5.4 rises from 14th (T003) to 2nd (S1), suggesting that the expanded strategy space unlocks capabilities the narrow benchmark could not measure.

Season 1 results are available on the S1 Leaderboard. Season data is not part of the paper’s frozen benchmark but demonstrates the auto-balancer and roster expansion design described in the paper.

Other Publications

Additional publications by Victor Stasiuc. Moreau Arena v3 and Moreau Island will move from placeholder status to DOI-linked records after Zenodo upload.

Moreau Arena v3
DOI pending · preprint available above

Moreau Island Vision Paper
DOI pending · preprint available above

Depth Avoidance in Safety-Aligned Language Models
10.5281/zenodo.18168544

The Human-in-the-LLM Box
10.5281/zenodo.18357935

L5D v3 Core
10.5281/zenodo.18358266

L5D-Ψ Expressive Overlay
10.5281/zenodo.18364943

L5D-Ψ Field Notes
10.5281/zenodo.18364945

Victor Calibration
arXiv:2512.17956

Source Code

The full benchmark—simulator, tournament data, prompts, analysis scripts, and all 89 invariant tests—is open source and available for reproducible evaluation.

View on GitHub

Citation

@article{stasiuc2026moreau,
  title   = {Moreau Arena: Not All LLMs Need Hints to Reason
             Strategically},
  author  = {Stasiuc, Victor},
  year    = {2026},
  note    = {Config hash: b7ec5885, 89 invariant tests,
             15 agents across 3 tournament protocols,
             2609 series, PSI tau = 1.0}
}

Reproducibility

1. Clone & Install

The full benchmark, including simulator, tournament data, and all 89 invariant tests, is available from the repository. No GPU required.

git clone https://github.com/supervitek/moreau-arena-paper.git
cd moreau-arena-paper
pip install -r requirements.txt

2. Verify Config Integrity

The core simulator config is frozen and hash-verified. Any modification will cause invariant tests to fail.

# Verify config hash (must match paper)
python scripts/verify_config_hash.py

# Expected: b7ec588583135ad640eba438f29ce45c10307a88dc426decd31126371bb60534

3. Run Invariant Tests

python -m pytest tests/test_invariants.py -v
# Expected: 89 passed

4. Recompute Rankings

# Recompute BT rankings for each tournament
python -m analysis.bt_ranking data/tournament_001/results.jsonl
python -m analysis.bt_ranking data/tournament_002/results.jsonl
python -m analysis.bt_ranking data/tournament_003/results.jsonl

Papers & Publications