Two primary Moreau documents live here: the Arena v3 benchmark paper and the Moreau Island vision paper.
Victor Stasiuc — Independent Researcher — March 2026 · ORCID 0009-0003-2064-0486
Zenodo DOI pending for both primary documents. Preprints are available below and the DOI badge can be dropped in once upload is complete.
Empirical benchmark paper covering T001/T002/T003, 2,609 completed series, PSI validation, and the T003 integrity pass.
DOI pending · preprint live now
Conceptual framework for Shore, Thicket, and Caldera: a staged alignment resilience environment built on top of the current Arena foundation.
DOI pending · preprint live now
We introduce Moreau Arena, a contamination-resistant benchmark for LLM strategic reasoning. Agents design creature builds (animal choice plus stat allocation) for a novel stochastic combat game whose rules exist only in the prompt. Three tournaments over the same benchmark core—13 agents in T001/T002, expanded to 15 in T003—and 2,609 best-of-7 series reveal a sharp boundary condition.
Under vague rules with no feedback (T001), baselines dominate and LLMs win only 37.5% of cross-group series. With exact formulas, meta-context hints, and per-game adaptation (T002), the leaderboard reverses: LLMs win 89.75%. A targeted ablation (T003) removes only the meta-context hints while preserving exact rules and feedback. Performance does not collapse uniformly; instead the field splits. GPT-5.2-Codex and Claude Opus remain strong without hints, while three models from three providers freeze on a single identical build and never adapt.
Meanwhile, GPT-5.4—released during this study—ranks 14th of 15 despite exploring 79 builds, demonstrating that frontier capability does not automatically transfer to novel strategic domains. A prompt sensitivity analysis using a semantically equivalent paraphrase yields Kendall τ = 1.0 (perfect rank preservation), providing strong initial evidence that the key findings are not obvious artifacts of a single prompt surface form.
| Metric | T001 (One-Shot) | T002 (Scaffolded) | T003 (No Meta) |
|---|---|---|---|
| Agents | 13 (8 LLM + 5 baseline) | 13 (8 LLM + 5 baseline) | 15 (10 LLM + 5 baseline) |
| Prompt type | Qualitative, no feedback | Exact formulas + meta + adaptation | Exact formulas + adaptation, no meta |
| LLM vs baseline WR | 37.5% | 89.75% | Field splits |
| #1 Agent | SmartAgent (baseline) | GPT-5.2-Codex | GPT-5.2-Codex |
| Series count | 779 | 780 | 1,050 |
| Non-transitive cycles | 0 | 12 | 12 |
| PSI τ | — | — | 1.0 (prompt-robust) |
Under vague rules (T001), baselines dominate—SmartAgent, HighVarianceAgent, and ConservativeAgent occupy the top 3. LLMs win only 37.5% of cross-group series. With exact formulas, meta hints, and per-game adaptation (T002), the ordering reverses completely: every LLM outranks every baseline, with 89.75% cross-group win rate. GPT-5.2-Codex jumps from rank 9 to rank 1.
T003 removes meta-context while keeping exact rules and feedback. The result is not uniform collapse but a sharp split. GPT-5.2-Codex stays #1 (BT=1.0), Claude Opus rises to #3 (BT=0.895), and GPT-5.2 holds #4. These models reason strategically without hints. Meanwhile, Gemini Flash, Gemini Pro, and Grok all freeze on a single identical build (BOAR 8/8/3/1) and never adapt—0% adaptation rate across hundreds of losses.
GPT-5.4, a state-of-the-art frontier model released during this study, ranks 14th of 15 in T003. It explores 79 unique builds and adapts after 89.8% of losses—yet still loses to every fixed-strategy baseline. Search volume alone is insufficient; what matters is whether search converges toward strategically strong regions. This distinction between search volume and search quality is one of the clearest behavioral findings.
A semantically equivalent paraphrase of the T003 prompt produces identical rank ordering (Kendall τ = 1.0). GPT-5.4 remains last, baselines remain above most LLMs, and the field split persists. This provides strong initial evidence that the result is not an obvious artifact of a single prompt surface form.
In T001, Claude Opus repeats wolf 3/8/1/8 in 86% of series, investing heavily in WIL (willpower). But WIL resist applies only to ability procs (3.5–4.5% per tick)—fewer than one blocked proc per match. Mean WIL correlates strongly negatively with BT score (ρ = −0.93). With exact formulas in T002/T003, Opus escapes the trap and jumps to rank 3.
T001 has 0 strict 3-cycles; T002 and T003 each have 12. Adaptation creates genuine rock-paper-scissors structure. An all-LLM cycle in T003: Claude Opus > GPT-5.2-Codex (80%) > GPT-5.2 (60%) > Claude Opus (60%). No single agent dominates all others.
The paper covers the frozen 6-animal benchmark (T001–T003). In parallel, Moreau Arena operates as a living competitive layer with seasonal roster expansions. Season 1 introduces 14 animals, WIL regen mechanics, and new cross-animal interactions (anti-heal, proc suppression, charge ignoring dodge). A separate tournament with 9 LLMs and 5 baselines across 91 all-pairs series produces dramatically reshuffled rankings—notably, GPT-5.4 rises from 14th (T003) to 2nd (S1), suggesting that the expanded strategy space unlocks capabilities the narrow benchmark could not measure.
Season 1 results are available on the S1 Leaderboard. Season data is not part of the paper’s frozen benchmark but demonstrates the auto-balancer and roster expansion design described in the paper.
Additional publications by Victor Stasiuc. Moreau Arena v3 and Moreau Island will move from placeholder status to DOI-linked records after Zenodo upload.
The full benchmark—simulator, tournament data, prompts, analysis scripts, and all 89 invariant tests—is open source and available for reproducible evaluation.
View on GitHub@article{stasiuc2026moreau,
title = {Moreau Arena: Not All LLMs Need Hints to Reason
Strategically},
author = {Stasiuc, Victor},
year = {2026},
note = {Config hash: b7ec5885, 89 invariant tests,
15 agents across 3 tournament protocols,
2609 series, PSI tau = 1.0}
}
The full benchmark, including simulator, tournament data, and all 89 invariant tests, is available from the repository. No GPU required.
git clone https://github.com/supervitek/moreau-arena-paper.git
cd moreau-arena-paper
pip install -r requirements.txt
The core simulator config is frozen and hash-verified. Any modification will cause invariant tests to fail.
# Verify config hash (must match paper)
python scripts/verify_config_hash.py
# Expected: b7ec588583135ad640eba438f29ce45c10307a88dc426decd31126371bb60534
python -m pytest tests/test_invariants.py -v
# Expected: 89 passed
# Recompute BT rankings for each tournament
python -m analysis.bt_ranking data/tournament_001/results.jsonl
python -m analysis.bt_ranking data/tournament_002/results.jsonl
python -m analysis.bt_ranking data/tournament_003/results.jsonl