Moreau Arena - Tournaments

4

Defined Tracks

3

Completed

10

LLMs Tested

5

Baselines

Track A

T001 — Minimal Prompt

Protocol

Each agent receives the game rules and must produce a single fixed build. The prompt provides basic mechanics but no formulas, no meta-information, and no structured output format.

What It Tests

Raw strategic reasoning from minimal information. Can the LLM infer good stat allocations from qualitative descriptions alone?

Key Result

LLM average win rate: 37.5%
Baselines outperform most LLMs. SmartAgent (#1) dominates.

vs

Track B

T002 — Engineered Prompt

Protocol

Same game, same simulator, but the prompt includes: exact damage/HP formulas, meta-information about matchup archetypes, structured JSON output format, and adaptation between games in a series.

What It Tests

Whether LLMs can leverage explicit formulas and structured guidance to make near-optimal decisions.

Key Result

LLM average win rate: 89.75%
LLMs dominate baselines. gpt-5.2-codex (#1) at BT 1.0.

Track C

T003 — Exact-Only Cleanroom

Protocol

T002 minus meta-context. LLMs receive exact formulas, JSON output format, and can adapt between games — but get zero strategy hints about which builds or animals are strong. 15 agents (10 LLMs + 5 baselines), 1050 series.

What It Tests

Can LLMs derive strategy from mechanics alone? Or did T002's improvement come from copying the meta-context's top builds?

Key Result

LLM average win rate: 53.1%
Intermediate outcome. Meta-context helps (~8pp boost) but exact formulas alone provide some improvement over T001 (+6.6pp).

The Frozen Model Effect

Three models (Gemini Flash, Gemini Pro, Grok) chose the exact same Boar 8/8/3/1 build for every single game — 2,358 builds, zero adaptation. Without meta-context, they default to a cached pattern. Meanwhile, Codex-5.2 stays #1 and Opus jumps to #3.

What Changed Across Tournaments

Component	T001	T002	T003
Damage Formulas	"Higher ATK deals more damage"	`base_dmg = atk - 1`	`base_dmg = atk - 1`
HP Formula	"More HP means more health"	`max_hp = 50 + hp * 10`	`max_hp = 50 + hp * 10`
Meta Information	None	Top-5 builds from T001	None (removed)
Output Format	Free text	Structured JSON	Structured JSON
Adaptation	None (fixed build)	Loser adapts	Loser adapts
LLM Avg Win Rate	37.5%	89.75%	53.1%

Planned Tracks

Track D — Tool-Augmented

Agents can call the simulator as a tool to test builds before committing. Tests whether LLMs can effectively use computational tools for hypothesis-driven optimization.

PLANNED

Ablation Study

Which T002 components matter most? Five variants isolate individual contributions.

Variant	Description	Components Included
formulas-only	Add exact formulas to T001	`+formulas`
meta-only	Add meta-info to T001	`+meta`
adaptation-only	Add between-game adaptation	`+adapt`
structured-output-only	Add JSON output format	`+json`
formulas-no-meta = T003	T002 minus meta-info	`+formulas +json +adapt`

Tournament Protocols

Protocol

What It Tests

Key Result

Protocol

What It Tests

Key Result

Protocol

What It Tests

Key Result

The Frozen Model Effect

What Changed Across Tournaments

Planned Tracks

Track D — Tool-Augmented

Ablation Study