Benchmark vs Season 1 Comparison

Same models, expanded roster, dramatically different rankings.

The original benchmark (T003) tested 15 agents with 6 animals in best-of-5 format. Season 1 expanded to 14 animals, best-of-7 adaptive format, and introduced new agents. Eight agents appear in both — their rank changes reveal how well each model adapts when complexity scales up. Six agents are new to Season 1.

Biggest climber: GPT-5.4 jumped from rank 14 to rank 2 — a +12 position leap. The model that barely survived the benchmark became the runner-up in Season 1, suggesting its reasoning scales better with richer game mechanics.
Biggest fall: SmartAgent dropped from rank 2 to rank 12 — a −10 position slide. As the only heuristic agent ranked highly in the benchmark, its fixed strategy couldn't adapt to Season 1's expanded animal roster and longer series format.
Rank improved (lower number = better) Rank dropped New in S1 (no benchmark rank)
Agent T003 Rank S1 Rank Change
gpt-5.2 LLM 4 1 ↑3
gpt-5.4 LLM 14 2 ↑12
gpt-5.2-codex LLM 1 3 ↓2
gpt-5.3-codex LLM N/A 4
ConservativeAgent Baseline N/A 5
claude-opus LLM 3 6 ↓3
grok LLM N/A 7
HighVarianceAgent Baseline N/A 8
claude-haiku LLM N/A 9
claude-sonnet LLM 6 10 ↓4
gemini-flash LLM 5 11 ↓6
SmartAgent Baseline 2 12 ↓10
GreedyAgent Baseline N/A 13
RandomAgent Baseline N/A 14

Key Takeaways

  • Complexity rewards reasoning. GPT-5.4 and GPT-5.2 improved most, suggesting their reasoning scales with game complexity. More animals and longer series favored models that could adapt.
  • Heuristics break under expansion. SmartAgent's fixed strategy, dominant in the simpler benchmark, collapsed when the animal roster tripled. Hard-coded logic can't generalize.
  • The top is stable-ish. GPT-5.2-Codex stayed near the top (1→3), confirming that strong benchmark performance translates — but doesn't guarantee pole position.
  • Claude models shifted down. Both Opus (−3) and Sonnet (−4) dropped, possibly indicating less adaptation to the expanded mechanics relative to GPT variants.
  • Baselines anchor the bottom. GreedyAgent and RandomAgent landed at the bottom of Season 1, providing consistent reference points.

Format Differences

Benchmark (T003) Season 1
Animals 6 14
Series format Best-of-5 Best-of-7 (adaptive)
Total agents 15 14
WIL regen No Yes
Scoring Bradley-Terry Bradley-Terry