Three completed tracks, one planned — each isolating a different LLM capability.
Each agent receives the game rules and must produce a single fixed build. The prompt provides basic mechanics but no formulas, no meta-information, and no structured output format.
Raw strategic reasoning from minimal information. Can the LLM infer good stat allocations from qualitative descriptions alone?
LLM average win rate: 37.5%
Baselines outperform most LLMs. SmartAgent (#1) dominates.
Same game, same simulator, but the prompt includes: exact damage/HP formulas, meta-information about matchup archetypes, structured JSON output format, and adaptation between games in a series.
Whether LLMs can leverage explicit formulas and structured guidance to make near-optimal decisions.
LLM average win rate: 89.75%
LLMs dominate baselines. gpt-5.2-codex (#1) at BT 1.0.
T002 minus meta-context. LLMs receive exact formulas, JSON output format, and can adapt between games — but get zero strategy hints about which builds or animals are strong. 15 agents (10 LLMs + 5 baselines), 1050 series.
Can LLMs derive strategy from mechanics alone? Or did T002's improvement come from copying the meta-context's top builds?
LLM average win rate: 53.1%
Intermediate outcome. Meta-context helps (~8pp boost) but exact formulas alone provide some improvement over T001 (+6.6pp).
Three models (Gemini Flash, Gemini Pro, Grok) chose the exact same Boar 8/8/3/1 build for every single game — 2,358 builds, zero adaptation. Without meta-context, they default to a cached pattern. Meanwhile, Codex-5.2 stays #1 and Opus jumps to #3.
| Component | T001 | T002 | T003 |
|---|---|---|---|
| Damage Formulas | "Higher ATK deals more damage" | base_dmg = atk - 1 |
base_dmg = atk - 1 |
| HP Formula | "More HP means more health" | max_hp = 50 + hp * 10 |
max_hp = 50 + hp * 10 |
| Meta Information | None | Top-5 builds from T001 | None (removed) |
| Output Format | Free text | Structured JSON | Structured JSON |
| Adaptation | None (fixed build) | Loser adapts | Loser adapts |
| LLM Avg Win Rate | 37.5% | 89.75% | 53.1% |
Agents can call the simulator as a tool to test builds before committing. Tests whether LLMs can effectively use computational tools for hypothesis-driven optimization.
PLANNEDWhich T002 components matter most? Five variants isolate individual contributions.
| Variant | Description | Components Included |
|---|---|---|
| formulas-only | Add exact formulas to T001 | +formulas |
| meta-only | Add meta-info to T001 | +meta |
| adaptation-only | Add between-game adaptation | +adapt |
| structured-output-only | Add JSON output format | +json |
| formulas-no-meta = T003 | T002 minus meta-info | +formulas +json +adapt |