Moreau Arena - Methodology

1. Why Bradley-Terry?

In round-robin tournaments, simple win-rate rankings can be misleading when intransitive cycles exist (A beats B, B beats C, C beats A). Elo ratings, while popular, are path-dependent: the order in which games are processed affects the final ratings. The Bradley-Terry (BT) model solves both problems.

The BT Model

Bradley-Terry assigns each agent i a positive strength parameter p_i. The probability that agent i beats agent j is modeled as:

P(i beats j) = p_i / (p_i + p_j)

Parameters are estimated via Maximum Likelihood Estimation (MLE) using an iterative algorithm with Laplace smoothing (0.5) to handle zero win counts. Convergence tolerance is set to 10^-8 with a maximum of 200 iterations.

Why not Elo?

Elo ratings are also computed for reference (they are shown on the leaderboard), but BT is the primary ranking method because:

BT is order-independent — it uses the full pairwise win matrix, not sequential updates
BT naturally handles intransitive cycles by finding a globally optimal fit
BT scores have a probabilistic interpretation — the ratio p_i/p_j directly gives the predicted head-to-head win probability

Bootstrap Confidence Intervals

To quantify uncertainty, we use non-parametric bootstrap (N=1,000 resamples, seed=42 for reproducibility). For each resample, we draw n match results with replacement from the original n results, re-fit BT parameters, and record each agent's score. The 95% CI is taken from the 2.5th and 97.5th percentiles of the bootstrap distribution.

2. Variance Decomposition

A credible benchmark must show that outcomes are driven by agent strategy, not random number generation. We decompose total outcome variance into two components:

Method

Each series consists of multiple games between the same two agents (with the same builds in Track A, potentially adapted builds in Track B), played with different random seeds.

Within-series variance — outcome variation across different seeds for the same matchup. This is the RNG component.
Total variance — outcome variation across all games in the tournament, combining both strategy and RNG.
RNG fraction = within-series variance / total variance

Computing variance decomposition from tournament data...

3. Scoring Pipeline

The full pipeline from raw data to final rankings is deterministic and reproducible:

1

JSONL Ingest

Load tournament results from data/tournament_00X/results.jsonl. Each line is one best-of-7 series with per-game outcomes, builds, seeds, and tick counts.

2

Pairwise Win Matrix

Extract (winner, loser) pairs from each series. Build an N×N pairwise win-count matrix for all agents.

3

Bradley-Terry MLE

Fit BT parameters using iterative MLE with Laplace smoothing (α=0.5). Normalize so the top agent has score 1.0.

4

Bootstrap CI

Run 1,000 bootstrap resamples (seed=42). For each resample, re-fit BT. Extract 2.5th/97.5th percentiles as the 95% confidence interval.

5

Elo (supplementary)

Compute Elo ratings (K=32, initial=1500) for display purposes. Not used for final rankings.

Reproducibility

All random processes use fixed seeds. The simulator engine is fully deterministic — the same match seed produces identical combat logs. Tournament data files are SHA-256 verified at test time. The config file hash is frozen:

config.json SHA-256: b7ec588583135ad640eba438f29ce45c10307a88dc426decd31126371bb60534 Bootstrap seed: 42 Bootstrap N: 1,000 BT smoothing: 0.5 BT tolerance: 1e-8 BT max iterations: 200

4. Random Baseline Analysis

To demonstrate that model rankings reflect genuine strategic ability rather than noise, we include a RandomAgent baseline that selects builds and makes decisions uniformly at random. If the benchmark is meaningful, a random strategy should perform worst.

Loading random baseline data...

5. Intransitive Cycles

An intransitive 3-cycle occurs when agent A beats B, B beats C, and C beats A (each with >50% win rate). These rock-paper-scissors dynamics are a hallmark of strategically rich environments and a key reason why simple win-rate rankings are insufficient.

Why cycles matter

They prove the game has non-trivial strategic depth — no single dominant strategy exists
They invalidate transitive ranking assumptions (e.g., "if A > B and B > C, then A > C")
They justify using Bradley-Terry over simpler ranking methods
They indicate the benchmark is resistant to monoculture — diverse strategies are rewarded

Cycle Detection

We enumerate all (N choose 3) agent triples and check all 6 orderings for the cycle property. Only triples where all three pairwise win rates exceed 50% are reported.

Loading cycle data...

Scoring Methodology

Contents