Tournament Protocols

Three completed tracks, one planned — each isolating a different LLM capability.

4
Defined Tracks
3
Completed
10
LLMs Tested
5
Baselines
Track A
T001 — Minimal Prompt

Protocol

Each agent receives the game rules and must produce a single fixed build. The prompt provides basic mechanics but no formulas, no meta-information, and no structured output format.

What It Tests

Raw strategic reasoning from minimal information. Can the LLM infer good stat allocations from qualitative descriptions alone?

Key Result

LLM average win rate: 37.5%
Baselines outperform most LLMs. SmartAgent (#1) dominates.

vs
Track B
T002 — Engineered Prompt

Protocol

Same game, same simulator, but the prompt includes: exact damage/HP formulas, meta-information about matchup archetypes, structured JSON output format, and adaptation between games in a series.

What It Tests

Whether LLMs can leverage explicit formulas and structured guidance to make near-optimal decisions.

Key Result

LLM average win rate: 89.75%
LLMs dominate baselines. gpt-5.2-codex (#1) at BT 1.0.

Track C
T003 — Exact-Only Cleanroom

Protocol

T002 minus meta-context. LLMs receive exact formulas, JSON output format, and can adapt between games — but get zero strategy hints about which builds or animals are strong. 15 agents (10 LLMs + 5 baselines), 1050 series.

What It Tests

Can LLMs derive strategy from mechanics alone? Or did T002's improvement come from copying the meta-context's top builds?

Key Result

LLM average win rate: 53.1%
Intermediate outcome. Meta-context helps (~8pp boost) but exact formulas alone provide some improvement over T001 (+6.6pp).

The Frozen Model Effect

Three models (Gemini Flash, Gemini Pro, Grok) chose the exact same Boar 8/8/3/1 build for every single game — 2,358 builds, zero adaptation. Without meta-context, they default to a cached pattern. Meanwhile, Codex-5.2 stays #1 and Opus jumps to #3.

What Changed Across Tournaments

Component T001 T002 T003
Damage Formulas "Higher ATK deals more damage" base_dmg = atk - 1 base_dmg = atk - 1
HP Formula "More HP means more health" max_hp = 50 + hp * 10 max_hp = 50 + hp * 10
Meta Information None Top-5 builds from T001 None (removed)
Output Format Free text Structured JSON Structured JSON
Adaptation None (fixed build) Loser adapts Loser adapts
LLM Avg Win Rate 37.5% 89.75% 53.1%

Planned Tracks

Track D — Tool-Augmented

Agents can call the simulator as a tool to test builds before committing. Tests whether LLMs can effectively use computational tools for hypothesis-driven optimization.

PLANNED

Ablation Study

Which T002 components matter most? Five variants isolate individual contributions.

Variant Description Components Included
formulas-only Add exact formulas to T001 +formulas
meta-only Add meta-info to T001 +meta
adaptation-only Add between-game adaptation +adapt
structured-output-only Add JSON output format +json
formulas-no-meta = T003 T002 minus meta-info +formulas +json +adapt