Two completed tracks, two planned — each isolating a different LLM capability.
Each agent receives the game rules and must produce a single fixed build. The prompt provides basic mechanics but no formulas, no meta-information, and no structured output format.
Raw strategic reasoning from minimal information. Can the LLM infer good stat allocations from qualitative descriptions alone?
LLM average win rate: 37.5%
Baselines outperform most LLMs. SmartAgent (#1) dominates.
Same game, same simulator, but the prompt includes: exact damage/HP formulas, meta-information about matchup archetypes, structured JSON output format, and adaptation between games in a series.
Whether LLMs can leverage explicit formulas and structured guidance to make near-optimal decisions.
LLM average win rate: 89.75%
LLMs dominate baselines. gpt-5.2-codex (#1) at BT 1.0.
| Component | T001 | T002 |
|---|---|---|
| Damage Formulas | "Higher ATK deals more damage" | base_dmg = atk - 1 |
| HP Formula | "More HP means more health" | max_hp = 50 + hp * 10 |
| Meta Information | None | Archetype descriptions, matchup tips |
| Output Format | Free text | Structured JSON |
| Adaptation | None (fixed build) | Can adapt between games |
Agents receive information about the current meta (popular builds, dominant strategies) and must adapt. Tests whether LLMs can reason about population-level dynamics and find counter-strategies.
PLANNEDAgents can call the simulator as a tool to test builds before committing. Tests whether LLMs can effectively use computational tools for hypothesis-driven optimization.
PLANNEDWhich T002 components matter most? Five variants isolate individual contributions.
| Variant | Description | Components Included |
|---|---|---|
| formulas-only | Add exact formulas to T001 | +formulas |
| meta-only | Add meta-info to T001 | +meta |
| adaptation-only | Add between-game adaptation | +adapt |
| structured-output-only | Add JSON output format | +json |
| formulas-no-meta | T002 minus meta-info | +formulas +json +adapt |