Academic publication and citation information.
A Non-Transitive Benchmark for Evaluating Strategic Decision-Making in Large Language Models
Download PDFWe introduce Moreau Arena, a contamination-resistant benchmark for evaluating strategic reasoning in large language models. The benchmark centers on a creature combat simulator with 14 animal types and a 4-stat allocation system that produces non-transitive dominance structures. Because the game mechanics are novel and do not appear in any training corpus, LLMs cannot rely on memorized solutions and must reason from first principles.
We evaluate 8 LLMs across two tournament protocols: T001 (minimal prompt, fixed builds) and T002 (engineered prompt with formulas, meta-information, structured output, and adaptation). Results show a dramatic performance shift: LLM average win rate rises from 37.5% in T001 to 89.75% in T002, with the leaderboard inverting completely. This demonstrates that prompt engineering — not model capability — is the primary determinant of strategic performance in novel domains.
We provide ablation studies isolating the contribution of each T002 component, a Prompt Sensitivity Index (PSI) measuring ranking stability under paraphrase, and characterize 3-cycle intransitive dynamics in the agent population. The full simulator, tournament data, and 89 invariant tests are released to enable reproducible evaluation of future models.
The same models tested in T001 (minimal prompt) and T002 (engineered prompt) produce completely inverted rankings. In T001, the algorithmic SmartAgent dominates all LLMs. In T002, LLMs dominate all baselines. The rank-order correlation between T001 and T002 is strongly negative.
The T001→T002 performance jump is larger than any difference between models within the same track. This suggests that for novel strategic domains, how you ask matters more than which model you ask.
Multiple 3-cycles are detected in the agent population: A beats B, B beats C, C beats A. This validates the use of Bradley-Terry scoring over Elo and demonstrates that the benchmark captures the rock-paper-scissors dynamics it was designed to test.
The variance budget analysis shows RNG contributes <25% of outcome variance. The remaining >75% is attributable to strategic choices. Combined with novel mechanics, this confirms the benchmark measures reasoning, not memorization.
@article{moreau-arena-2026,
title = {Moreau Arena: A Non-Transitive Benchmark for
Evaluating Strategic Decision-Making in LLMs},
year = {2026},
note = {Config hash: b7ec5885, 89 invariant tests,
13 agents across 2 tournament protocols}
}
The full benchmark, including simulator, tournament data, and all 89 invariant tests, is available from the repository. No GPU required.
git clone https://github.com/moreau-arena/moreau-arena.git
cd moreau-arena
pip install -r requirements.txt
The core simulator config is frozen and hash-verified. Any modification will cause invariant tests to fail. Verify the config hash matches the value embedded in the paper:
# Verify config hash (must match paper)
python -c "import hashlib, pathlib; print(hashlib.sha256(pathlib.Path('simulator/config.json').read_bytes()).hexdigest())"
# Expected output:
# b7ec588583135ad640eba438f29ce45c10307a88dc426decd31126371bb60534
89 tests verify determinism, stat constraints, combat mechanics, scoring, and data integrity. All must pass for valid reproduction.
python -m pytest tests/test_invariants.py -v
# Expected: 89 passed
Bradley-Terry scores and Elo ratings can be recomputed directly from the frozen JSONL tournament data. Results should match the paper exactly.
# Recompute BT rankings for Track A (T001)
python -m analysis.bt_ranking data/tournament_001/results.jsonl
# Recompute BT rankings for Track B (T002)
python -m analysis.bt_ranking data/tournament_002/results.jsonl
# Regenerate full results summary across both tracks
python -m analysis.bt_ranking data/tournament_001/results.jsonl data/tournament_002/results.jsonl
A self-contained verification script is included in the handoff package. It runs all integrity checks in sequence: config hash verification, invariant test suite, BT score recomputation, and cross-track consistency checks.
# Run the full verification pipeline
python scripts/verify_handoff.py
# Run the simulator directly
python -m simulator --build1 "bear 3 14 2 1" --build2 "buffalo 8 6 4 2" --games 1000