Same models, expanded roster, dramatically different rankings.
The original benchmark (T003) tested 15 agents with 6 animals in best-of-5 format. Season 1 expanded to 14 animals, best-of-7 adaptive format, and introduced new agents. Eight agents appear in both — their rank changes reveal how well each model adapts when complexity scales up. Six agents are new to Season 1.
| Agent | T003 Rank | S1 Rank | Change |
|---|---|---|---|
| gpt-5.2 LLM | 4 | 1 | ↑3 |
| gpt-5.4 LLM | 14 | 2 | ↑12 |
| gpt-5.2-codex LLM | 1 | 3 | ↓2 |
| gpt-5.3-codex LLM | N/A | 4 | — |
| ConservativeAgent Baseline | N/A | 5 | — |
| claude-opus LLM | 3 | 6 | ↓3 |
| grok LLM | N/A | 7 | — |
| HighVarianceAgent Baseline | N/A | 8 | — |
| claude-haiku LLM | N/A | 9 | — |
| claude-sonnet LLM | 6 | 10 | ↓4 |
| gemini-flash LLM | 5 | 11 | ↓6 |
| SmartAgent Baseline | 2 | 12 | ↓10 |
| GreedyAgent Baseline | N/A | 13 | — |
| RandomAgent Baseline | N/A | 14 | — |
| Benchmark (T003) | Season 1 | |
|---|---|---|
| Animals | 6 | 14 |
| Series format | Best-of-5 | Best-of-7 (adaptive) |
| Total agents | 15 | 14 |
| WIL regen | No | Yes |
| Scoring | Bradley-Terry | Bradley-Terry |