<<

Elo-MMR: A Rating System for Massive Multiplayer Competitions

Aram Ebtekar Paul Liu Vancouver, BC, Canada Stanford University [email protected] Stanford, CA, USA [email protected]

ABSTRACT judges who rank contestants against one another; these subjective Skill estimation mechanisms, colloquially known as rating sys- scores are known to be noisy [31]. In all these cases, scores can tems, play an important role in competitive sports and games. They only be used to compare and rank participants at the same event. provide a measure of player skill, which incentivizes competitive Players, spectators, and contest organizers who are interested in performances and enables balanced match-ups. In this paper, we comparing players’ skill levels across different competitions will present a novel Bayesian rating system for contests with many par- need to aggregate the entire history of such rankings. A strong ticipants. It is widely applicable to competition formats with dis- player, then, is one who consistently wins against weaker players. crete ranked matches, such as online programming competitions, To quantify skill, we need a rating system. obstacle courses races, and video games. The system’s simplicity Good rating systems are difficult to create, as they must balance allows us to prove theoretical bounds on its robustness and run- several mutually constraining objectives. First and foremost, rat- time. In addition, we show that it is incentive-compatible: a player ing systems must be accurate, in that ratings provide useful pre- who seeks to maximize their rating will never want to underper- dictors of contest outcomes. Second, the ratings must be efficient form. Experimentally, the rating system surpasses existing systems to compute: within video game applications, rating systems are in prediction accuracy, and computes faster than existing systems predominantly used for matchmaking in massively multiplayer by up to an order of magnitude. online games (such as Halo, CounterStrike, League of Legends, etc.) [24, 28, 35]. These games have hundreds of millions of players CCS CONCEPTS playing tens of millions of games per day, necessitating certain la- tency and memory requirements for the rating system [11]. Third, • Information systems → Learning to rank; • Computing me- rating systems must be incentive-compatible: a player’s rating thodologies → Learning in probabilistic graphical models. should never increase had they scored worse, and never decrease KEYWORDS had they scored better. This is to prevent players from regretting a win, or from throwing matches to game the system. Rating systems rating system, skill estimation, mechanism design, competition, that can be gamed often create disastrous consequences to player- bayesian inference, robust, incentive-compatible, elo, glicko, base, potentially leading to the loss of players [3]. Finally, the rat- ACM Reference Format: ings provided by the system must be human-interpretable: ratings Aram Ebtekar and Paul Liu. 2021. Elo-MMR: A Rating System for Mas- are typically represented to players as a single number encapsulat- sive Multiplayer Competitions. In Proceedings of the Web Conference 2021 ing their overall skill, and many players want to understand and (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia. ACM, New York, NY, predict how their performances affect their rating [20]. USA, 12 pages. https://doi.org/10.1145/3442381.3450091 Classically, rating systems were designed for two-player games. 1 INTRODUCTION The famous Elo system [17], as well as its Bayesian successors Glicko and Glicko-2, have been widely applied to games such as Competitions, in the form of sports, games, and examinations, have and Go [20–22]. Both Glicko versions model each player’s been with us since antiquity. Many competitions grade perfor- skill as a real random variable that evolves with time according mances along a numerical scale, such as a score on a test or a com- to Brownian motion. Inference is done by entering these variables pletion time in a race. In the case of a college admissions exam or a into the Bradley-Terry model [13], which predicts probabilities of track race, scores are standardized so that a given score on two dif- game outcomes. Glicko-2 refines the Glicko system by adding a ferent occasions carries the same meaning. However, in events that rating volatility parameter. Unfortunately, Glicko-2 is known to feature novelty, subjectivity, or close interaction, standardization be flawed in practice, potentially incentivizing players to losein is difficult. The Spartan Races, completed by millions of runners, what’s known as “volatility farming”. In some cases, these attacks feature a variety of obstacles placed on hiking trails around the can inflate a user’s rating several hundred points above its natural world [10]. Rock climbing, a sport to be added to the 2020 Olympics, value, producing ratings that are essentially impossible to beat via likewise has routes set specifically for each competition. DanceS- honest play. This was most notably exploited in the popular game port, gymnastics, and figure skating competitions have a panel of of Pokemon Go [3]. See Section 5.1 for a discussion of this issue, as This paper is published under the Creative Commons Attribution 4.0 International well as an application of this attack to the rating system. (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their The family of Elo-like methods just described only utilize the personal and corporate Web sites with the appropriate attribution. WWW ’21, April 19–23, 2021, Ljubljana, Slovenia binary outcome of a match. In settings where a scoring system © 2021 IW3C2 (International World Wide Web Conference Committee), published provides a more fine-grained measure of match performance, Ko- under Creative Commons CC-BY 4.0 License. valchik [26] has shown variants of Elo that are able to take advan- ACM ISBN 978-1-4503-8312-7/21/04. https://doi.org/10.1145/3442381.3450091 tage of score information. For competitions consisting of several set WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu tasks, such as academic olympiads, Forišek [18] developed a model allows us to rigorously analyze its properties: the “MMR” in the in which each task gives a different “response” to the player: the to- name stands for “Massive”, “Monotonic”, and “Robust”. “Massive” tal response then predicts match outcomes. However, such systems means that it supports any number of players with a runtime that are often highly application-dependent and hard to calibrate. scales linearly; “monotonic” is a synonym for incentive-compatible, Though Elo-like systems are widely used in two-player settings, ensuring that a rating-maximizing player always wants to perform one needn’t look far to find competitions that involve much more well; “robust” means that rating changes are bounded, with the than two players. In response to the popularity of team-based games bound being smaller for more consistent players than for volatile such as CounterStrike and Halo, many recent works focus on com- players. Robustness turns out to be a natural byproduct of accurately petitions that are between two teams [14, 23, 25, 27]. Another pop- modeling performances with heavy-tailed distributions, such as ular setting is many-player contests such as academic olympiads: the logistic. TrueSkill is believed to satisfy the first two properties, notably, programming contest platforms such as Codeforces, Top- albeit without proof, but fails robustness. Codeforces only satisfies coder, and [6, 8, 9]. As with the aforementioned Spartan incentive-compatibility, and Topcoder only satisfies robustness. races, a typical event attracts thousands of contestants. Program- Experimentally, we show that Elo-MMR achieves state-of-the- ming contest platforms have seen exponential growth over the past art performance in terms of both prediction accuracy and runtime decade, collectively boasting millions of users [5]. As an example, on industry datasets. In particular, we process the entire Codeforces Codeforces gained over 200K new users in 2019 alone [2]. database of over 300K rated users and 1000 contests in well under In “free-for-all” settings, where 푁 players are ranked individually, a minute, beating the existing Codeforces system by an order of the Bayesian Approximation Ranking (BAR) [33] models magnitude while improving upon its accuracy. Furthermore, we 푁  the competition as a series of 2 independent two-player contests. show that the well-known Topcoder system is severely vulnerable In reality, of course, the pairwise match outcomes are far from to volatility farming, whereas Elo-MMR is immune to such attacks. independent. Thus, TrueSkill [24] and its variants [16, 28, 30] model A difficulty we faced was the scarcity of efficient open-source rat- a player’s performance during each contest as a single random ing system implementations. In an effort to aid researchers and variable. The overall rankings are assumed to reveal the total order practitioners alike, we provide open-source implementations of all among these hidden performance variables, with various methods rating systems, dataset mining, and additional processing used in used to model ties and teams. For a textbook treatment of these our experiments at https://github.com/EbTech/Elo-MMR. methods, see [34]. These rating systems are efficient in practice, Organization. In Section2, we formalize the details of our Bayesian successfully rating userbases that number well into the millions (the model. We then show how to estimate player skill under this model Halo series, for example, has over 60 million sales since 2001 [4]). in Section3, and develop some intuitions of the resulting formu- The main disadvantage of TrueSkill is its complexity: originally las. As a further refinement, Section4 models skill evolutions from developed by Microsoft for the popular Halo video game, TrueSkill players training or atrophying between competitions. This mod- performs approximate belief propagation, which consists of mes- eling is quite tricky as we choose to retain players’ momentum sage passing on a factor graph, iterated until convergence. Aside while preserving incentive-compatibility. While our modeling and from being less human-interpretable, this complexity means that, derivations occupy multiple sections, the system itself is succinctly to our knowledge, there are no proofs of key properties such as run- presented in Algorithms1 to3. In Section5, we perform a volatility time and incentive-compatibility. Even when these properties are farming attack on the Topcoder system and prove that, in contrast, discussed [28], no rigorous justification is provided. In addition, we Elo-MMR satisfies several salient properties, the most critical of are not aware of any work that extends TrueSkill to non-Gaussian which is incentive-compatibility. Finally, in Section6, we present performance models, which might be desirable to limit the influ- experimental evaluations, showing improvements over industry ence of outlier performances (see Section 5.2). standards in both accuracy and speed. An extended version of this It might be for these reasons that popular platforms such as paper, with additional proofs and experiments, can be found at Codeforces and Topcoder opted for their own custom rating sys- https://arxiv.org/abs/2101.00400. tems. These systems are not published in academia and do not come with Bayesian justifications. However, they retain the formu- 2 A BAYESIAN MODEL FOR MASSIVE laic simplicity of Elo and Glicko, extending them to settings with much more than two players. The Codeforces system includes ad COMPETITIONS hoc heuristics to distinguish top players, while curbing rampant in- We now describe the setting formally, denoting random variables flation. Topcoder’s formulas are more principled from a statistical by capital letters. A series of competitive rounds, indexed by 푡 = perspective; however, it has a volatility parameter similar to Glicko- 1, 2, 3,..., take place sequentially in time. Each round has a set of 2, and hence suffers from similar exploits18 [ ]. Despite their flaws, participating players P푡 , which may in general overlap between these systems have been in place for over a decade, and have more rounds. A player’s skill is likely to change with time, so we repre- recently gained adoption by additional platforms such as CodeChef sent the skill of player 푖 at time 푡 by a real random variable 푆푖,푡 . and LeetCode [1,7]. In round 푡, each player 푖 ∈ P푡 competes at some performance level 푃푖,푡 , typically close to their current skill 푆푖,푡 . The deviations { − } { } Our contributions. In this paper, we describe the Elo-MMR rating 푃푖,푡 푆푖,푡 푖 ∈P푡 are assumed to be i.i.d. and independent of 푆푖,푡 푖 ∈P푡 . system, obtained by a principled approximation of a Bayesian model Performances are not observed directly; instead, a ranking gives { } similar to Glicko and TrueSkill. It is fast, embarrassingly parallel, the relative order among all performances 푃푖,푡 푖 ∈P푡 . In particular, and makes accurate predictions. Most interesting of all, its simplicity ties are modelled to occur when performances are exactly equal, Elo-MMR: A Rating System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

1 ( 퐿 푊 ) a zero-probability event when their distributions are continuous. be said that 퐸푖,≤푡 , 퐸푖,≤푡 are “almost sufficient” for 푆푖,푡 : any addi- This ranking constitutes the observational evidence 퐸푡 for our tional information, such as from domain-specific scoring systems, Bayesian updates. The rating system seeks to estimate the skill 푆푖,푡 becomes redundant for the purposes of skill estimation. Secondly, of every player at the present time 푡, given the historical round conditioned on 푃:,≤푡 , the posterior skills 푆:,푡 are independent of one rankings 퐸 ≤푡 := {퐸1, . . . , 퐸푡 }. another. As a result, there are no inter-player correlations to model, We overload the notation Pr for both probabilities and probabil- and a player’s posterior is unaffected by rounds in which they are ity densities: the latter interpretation applies to zero-probability not a participant. Finally, if we’ve truly identified 푃푖,푡 , then rounds events, such as in Pr(푆푖,푡 = 푠). We also use colons as shorthand for later than 푡 should not prompt revisions in our estimate for 푃푖,푡 . collections of variables differing only in a subscript: for instance, This obviates the need for expensive whole-history update proce- { } 2 푃:,푡 := 푃푖,푡 푖 ∈P푡 . The joint distribution described by our Bayesian dures [15, 16], for the purposes of present skill estimation. model factorizes as follows: Finally, a word on the rate of convergence. Suppose we want our estimate to be within 휖 of 푃푖,푡 , with probability at least 1 − 훿. Pr(푆:,:, 푃:,:, 퐸:) (1) Ö Ö Ö Ö By asymptotic normality of the posterior [19], it suffices to have = Pr(푆 ) Pr(푆 | 푆 − ) Pr(푃 | 푆 ) Pr(퐸 | 푃 ), ( 1 1 ) 푖,0 푖,푡 푖,푡 1 푖,푡 푖,푡 푡 :,푡 푂 휖2 log 훿 participants. 푖 푖,푡 푖,푡 푡 When the initial prior, performance model, and evolution model ( ) where Pr 푆푖,0 is the initial skill prior, are all Gaussian, treating 푃푖,푡 as certain is the only simplifying Pr(푆푖,푡 | 푆푖,푡−1) is the skill evolution model (Section4), approximation we will make; that is, in the limit |P푡 | → ∞, our method performs exact inference on Equation (1). In the following Pr(푃푖,푡 | 푆푖,푡 ) is the performance model, and sections, we focus some attention on generalizing the performance Pr(퐸 | 푃 ) is the evidence model. 푡 :,푡 model to non-Gaussian log-concave families, parametrized by lo- For the first three factors, we will specify log-concave distributions cation and scale. We will use the as a running (see Definition 3.1). The evidence model, on the other hand, is a example and see that it induces robustness; however, our frame- deterministic indicator. It equals one when 퐸푡 is consistent with work is agnostic to the specific distributions used. { } 휋 the relative ordering among 푃푖,푡 푖 ∈P푡 , and zero otherwise. The prior rating 휇푖,푡 and posterior rating 휇푖,푡 of player 푖 at round Finally, our model assumes that the number of participants |P푡 | is 푡 should be that summarize the player’s prior and posterior large. The main idea behind our algorithm is that, in sufficiently mas- skill distribution, respectively. We’ll use the mode: thus, 휇푖,푡 is the sive competitions, from the evidence 퐸푡 we can infer very precise maximum a posteriori (MAP) estimate, obtained by setting 푠 to { } estimates for 푃푖,푡 푖 ∈P푡 . Hence, we can treat these performances maximize the posterior Pr(푆푖,푡 = 푠 | 푃푖,≤푡 ). By Bayes’ rule, as if they were observed directly. 휋 휇푖,푡 := arg max 휋푖,푡 (푠), That is, suppose we have the skill prior at round 푡: 푠 ( ) ( | ) 휇푖,푡 := arg max 휋푖,푡 (푠) Pr(푃푖,푡 | 푆푖,푡 = 푠). (3) 휋푖,푡 푠 := Pr 푆푖,푡 = 푠 푃푖,<푡 . (2) 푠 Now, we observe 퐸푡 . By Equation (1), it is conditionally indepen- This objective suggests a two-phase algorithm to update each dent of 푆푖,푡 , given 푃푖,≤푡 . By the law of total probability, player 푖 ∈ P푡 in response to the results of round 푡. In phase one, we estimate 푃 from (퐸퐿 , 퐸푊 ). By Doob’s consistency theorem, Pr(푆 = 푠 | 푃 , 퐸 ) 푖,푡 푖,푡 푖,푡 푖,푡 푖,<푡 푡 |P | ∫ our estimate is extremely precise when 푡 is large, so we assume = Pr(푆푖,푡 = 푠 | 푃푖,<푡 , 푃푖,푡 = 푝) Pr(푃푖,푡 = 푝 | 푃푖,<푡 , 퐸푡 ) d푝 it to be exact. In phase two, we update our posterior for 푆푖,푡 and the rating 휇푖,푡 according to Equation (3). → Pr(푆푖,푡 = 푠 | 푃푖,≤푡 ) almost surely as |P푡 | → ∞. The integral is intractable in general, since the performance poste- 3 SKILL ESTIMATION IN TWO PHASES rior Pr(푃푖,푡 = 푝 | 푃푖,<푡 , 퐸푡 ) depends not only on player 푖, but also 3.1 Performance estimation ∈ P on our belief regarding the skills of all 푗 푡 . However, in the limit In this section, we describe the first phase of Elo-MMR. For nota- of infinite participants, Doob’s consistency theorem [19] implies tional convenience, we assume all probability expressions to be that it concentrates at the true value 푃푖,푡 . Since our posteriors are conditioned on the prior context 푃푖,<푡 , and omit the subscript 푡. continuous, the convergence holds for all 푠 simultaneously. 퐿 Our prior belief on each player’s skill 푆푖 implies a prior distribu- Indeed, we don’t even need the full evidence 퐸푡 . Let 퐸 = {푗 ∈ 푖,푡 tion on 푃푖 . Let’s denote its probability density function (pdf) by P : 푃푗,푡 > 푃푖,푡 } be the set of players against whom 푖 lost, and ∫ 푊 { ∈ P } ( ) = ( = ) = ( ) ( = | = ) 퐸푖,푡 = 푗 : 푃푗,푡 < 푃푖,푡 be the set of players against whom 푓푖 푝 : Pr 푃푖 푝 휋푖 푠 Pr 푃푖 푝 푆푖 푠 d푠, (4) 푖 won. That is, we only see who wins, draws, and loses against 푖. ( 퐿 푊 ) where 휋푖 (푠) was defined in Equation (2). Let 푃푖,푡 remains identifiable using only 퐸푖,푡 , 퐸푖,푡 , which will be more convenient for our purposes. ∫ 푝 퐹푖 (푝) := Pr(푃푖 ≤ 푝) = 푓푖 (푥) d푥, Passing to the limit in which 푃푖,≤푡 is identified serves to justify −∞ several common simplifications made by total-order rating systems. be the corresponding cumulative distribution function (cdf). For the Firstly, since 푃푖,≤푡 is a sufficient statistic for predicting 푆푖,푡 , it may purpose of analysis, we’ll also define the following “loss”, “”,

1 2 The relevant limiting procedure is to treat performances within 휖-width buckets as As opposed to historical skill estimation, which is concerned with 푃 (푆푖,푡 | 푃푖,≤푡′ ) ties, and letting 휖 → 0. This technicality appears in the proof of Theorem 3.2. for 푡′ > 푡. Whole-history methods can take advantage of future information. WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

L LR Normal Logistic and “victory” functions: 2 12 d −푓푖 (푝) 0.4 푙푖 (푝) := ln(1 − 퐹푖 (푝)) = , 10 d푝 1 − 퐹푖 (푝) 8 0.3 푓 ′(푝) d 푖 6 푑푖 (푝) := ln 푓푖 (푝) = , 0.2 d푝 푓푖 (푝) 4 0.1 d 푓푖 (푝) 2 푣푖 (푝) := ln 퐹푖 (푝) = . d푝 퐹푖 (푝) -4 -2 2 4 -4 -2 2 4 Evidently, 푙푖 (푝) < 0 < 푣푖 (푝). Now we define what it means for the deviation 푃푖 − 푆푖 to be log-concave. Figure 1: 퐿2 versus 퐿푅 for typical values (left). Gaussian ver- sus logistic probability density functions (right). Definition 3.1. An absolutely continuous random variable on a convex domain is log-concave if its probability density function 푓 is positive on its domain and satisfies Logistic performance model. Now we assume the performance 푓 (휃푥 + (1 − 휃)푦) > 푓 (푥)휃 푓 (푦)1−휃 , ∀휃 ∈ (0, 1), 푥 ≠ 푦. deviation 푃푗 − 푆푗 has a logistic distribution with mean 0 and vari- ance 훽2. In general, the rating system administrator is free to set 훽 Log-concave distributions appear widely, and include the Gauss- differently for each contest. Since shorter contests tend to bemore ian and logistic distributions used in Glicko, TrueSkill, and many variable, one reasonable choice might be to make 1/훽2 proportional others. We’ll see inductively that our prior 휋푖 is log-concave at to the contest duration. every round. Since log-concave densities are closed under convolu- Given the mean and variance of the skill prior, the independent tion [12], the independent sum 푃 = 푆 + (푃 −푆 ) is also log-concave. 푖 푖 푖 푖 sum 푃푗 = 푆푗 + (푃푗 − 푆푗 ) would have the same mean, and a variance Log-concavity is made very convenient by the following lemma, that’s increased by 훽2. Unfortunately, we’ll see that the logistic proved in the extended version of this paper: performance model implies a form of skill prior from which it’s Lemma 3.1. If 푓푖 is continuously differentiable and log-concave, tough to extract a mean and variance. Even if we could, the sum then the functions 푙푖,푑푖, 푣푖 are continuous, strictly decreasing, and does not yield a simple distribution. For experienced players, we expect 푆푗 to contribute much less 푙 (푝) < 푑 (푝) < 푣 (푝) for all 푝. 푖 푖 푖 variance than 푃푗 −푆푗 ; thus, in our heuristic approximation, we take For the remainder of this section, we fix the analysis with respect 푃푗 to have the same form of distribution as the latter. That is, we 휋 = to some player 푖. As argued in Section2, 푃푖 concentrates very take 푃푗 to be logistic, centered at the prior rating 휇푗 arg max 휋푗 , 2 2 2 narrowly in the posterior. Hence, we can estimate 푃푖 by its MAP, with variance 훿푗 = 휎푗 + 훽 , where 휎푗 will be given by Equation (8). choosing 푝 so as to maximize: This distribution is analytic and log-concave, so the same methods√ 퐿 푊 퐿 푊 based on Theorem 3.2 apply. Define the scale parameter 훿¯ := 3훿 . Pr(푃푖 = 푝 | 퐸푖 , 퐸푖 ) ∝ 푓푖 (푝) Pr(퐸푖 , 퐸푖 | 푃푖 = 푝). 푗 휋 푗 A logistic distribution with variance 훿2 has cdf and pdf: ≻ ≺ ∼ ∈ 퐿 ∈ 푊 푗 Define 푗 푖, 푗 푖, 푗 푖 as shorthand for 푗 퐸푖 , 푗 퐸푖 , ∈ P \ ( 퐿 ∪ 푊 ) 푥 − 휇휋 ! 푗 퐸푖 퐸푖 (that is, 푃푗 > 푃푖 , 푃푗 < 푃푖 , 푃푗 = 푃푖 ), respectively. 1 1 푗 퐹푗 (푥) = = 1 + tanh , The following theorem yields our MAP estimate: −(푥−휇휋 )/훿¯ ¯ 1 + 푒 푗 푗 2 2훿푗 휋 ¯ 휋 Theorem 3.2. Suppose that for all 푗, 푓푗 is continuously differen- (푥−휇푗 )/훿 푗 푥 − 휇 푒 1 2 푗 tiable and log-concave. Then the unique maximizer of Pr(푃푖 = 푝 | 푓푗 (푥) = = sech .  ( − 휋 )/ ¯ 2 ¯ ¯ 퐿 푊 ) ¯ 푥 휇푗 훿 푗 4훿푗 2훿푗 퐸푖 , 퐸푖 is given by the unique zero of 훿푗 1 + 푒 Õ Õ Õ 푄푖 (푝) := 푙푗 (푝) + 푑푗 (푝) + 푣 푗 (푝). The logistic distribution satisfies two very convenient relations: 푗 ≻푖 푗∼푖 푗 ≺푖 ′ ¯ 퐹푗 (푥) = 푓푗 (푥) = 퐹푗 (푥)(1 − 퐹푗 (푥))/훿푗 , The proof appears in the extended version of this paper. Intu- 푓 ′(푥) = 푓 (푥)(1 − 2퐹 (푥))/훿¯ , itively, we’re saying that the performance is the balance point be- 푗 푗 푗 푗 tween appropriately weighted wins, draws, and losses. Let’s look from which it follows that at two specializations of our general model, to serve as running 1 − 2퐹푗 (푝) −퐹푗 (푝) 1 − 퐹푗 (푝) examples in this paper. 푑푗 (푝) = = + = 푙푗 (푝) + 푣 푗 (푝). 훿¯ 훿¯ 훿¯ Gaussian performance model. If both 푆푗 and 푃푗 − 푆푗 are assumed In other words, a tie counts as the sum of a win and a loss. to be Gaussian with known means and variances, then their inde- This can be compared to the approach (used in Elo, Glicko, BAR, pendent sum 푃푗 will also be a known Gaussian. It is analytic and Topcoder, and Codeforces) of treating each tie as half a win plus log-concave, so Theorem 3.2 applies. half a loss.3 We substitute the well-known Gaussian pdf and cdf for 푓푗 and 퐹푗 , respectively. A simple binary search, or faster numerical techniques 3Elo-MMR, too, can be modified to split ties into half win plus half loss. It’s easyto that Lemma 3.1 still holds if 푑 (푝) is replaced by 푤 푙 (푝) + 푤 푣 (푝) for some such as the Illinois algorithm or Newton’s method, can be employed 푗 푙 푗 푣 푗 푤푙 , 푤푣 ∈ [0, 1] with |푤푙 − 푤푣 | < 1. In particular, we can set 푤푙 = 푤푣 = 0.5. The to solve for the unique zero of 푄푖 . results in Section5 won’t be altered by this change. Elo-MMR: A Rating System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

푠 − 푝 Õ 휋 (푠 − 푝 )휋 Finally, putting everything together: 퐿′(푠) = 0 + √ √푘 . Thus, 2 tanh (6) 훽0 ∈H 훽푘 3 훽푘 12 Õ Õ Õ −퐹푗 (푝) Õ 1 − 퐹푗 (푝) 푘 푡 푄 (푝) = 푙 (푝) + 푣 (푝) = + . ′ 푖 푗 푗 ¯ ¯ 퐿 is continuous and strictly increasing in 푠, so its zero is unique: 푗 ⪰푖 푗 ⪯푖 푗 ⪰푖 훿푗 푗 ⪯푖 훿푗 it is the MAP 휇푡 . Similar to what we did in the first phase, we can Our estimate for 푃푖 is the zero of this expression. The terms on solve for 휇푡 with binary search or other root-solving methods. the right correspond to probabilities of winning and losing against We pause to make an important observation. From Equation (6), each player 푗, weighted by 1/훿¯푗 . Accordingly, we can interpret the rating carries a rather intuitive interpretation: Gaussian factors Í ¯ 푗 ∈P (1−퐹푗 (푝))/훿푗 as a weighted expected rank of a player whose in 퐿 become 퐿2 penalty terms, whereas logistic factors take on a performance is 푝. Similar to the performance computations in Code- more interesting form as 퐿푅 terms. From Figure1, we see that the forces and Topcoder, 푃푖 can thus be viewed as the performance level 퐿푅 term behaves quadratically near the origin, but linearly at the at which one’s expected rank would equal 푖’s actual rank. extremities, effectively interpolating between 퐿2 and 퐿1 over a scale of magnitude 훽푘 3.2 Belief update It is well-known that minimizing a sum of 퐿2 terms pushes the argument towards a weighted mean, while minimizing a sum of Having estimated 푃푖,푡 in the first phase, the second phase is rather simple. Ignoring normalizing constants, Equation (3) tells us that the 퐿1 terms pushes the argument towards a weighted median. With pdf of the skill posterior can be obtained as the pointwise product 퐿푅 terms, the net effect is that 휇푡 acts like a robust average of the of the pdfs of the skill prior and the performance model. When both historical performances 푝푘 . Specifically, one can check that factors are differentiable and log-concave, then so is their product. Í 푤 푝 1 휇 = 푘 푘 푘 , where 푤 := and Its maximum is the new rating 휇 ; let’s see how to compute it for 푡 Í 푤 0 2 푖,푡 푘 푘 훽0 the same two specializations of our model. 휋 (휇푡 − 푝 )휋 푤 := √ tanh √ 푘 for 푘 ∈ H . (7) Gaussian performance model. When the skill prior and perfor- 푘 푡 (휇푡 − 푝푘 )훽푘 3 훽푘 12 mance model are Gaussian with known means and variances, mul- 푤 is close to 1/훽2 for typical performances, but can be up to tiplying their pdfs yields another known Gaussian. Hence, the pos- 푘 푘 휋2/6 times more as |휇 − 푝 | → 0, or vanish as |휇 − 푝 | → ∞. terior is compactly represented by its mean 휇푖,푡 , which coincides 푡 푘 푡 푘 2 This feature is due to the thicker tails of the logistic distribution, with the MAP and rating; and its variance 휎푖,푡 , which is our uncer- tainty regarding the player’s skill. as compared to the Gaussian, resulting in an algorithm that resists drastic rating changes in the presence of a few unusually good or Logistic performance model. When the performance model is bad performances. We’ll formally state this robustness property in non-Gaussian, the multiplication does not simplify so easily. By Theorem 5.7. Equation (3), each round contributes an additional factor to the Estimating skill uncertainty. While there is no easy way to com- belief distribution. In general, we allow it to consist of a collection pute the variance of a posterior in the form of Equation (5), it will of simple log-concave factors, one for each round in which player 푖 be useful to have some estimate 휎2 of uncertainty. There is a simple has participated. Denote the participation history by 푡 formula in the case where all factors are Gaussian. Since moment- H푖,푡 := {푘 ∈ {1, . . . , 푡} : 푖 ∈ P푘 }. matched logistic and normal distributions are relatively close (c.f. Figure1), we apply the same formula: Since each player can be considered in isolation, we’ll omit the 1 Õ 1 subscript 푖. Specializing to the logistic setting, each 푘 ∈ H푡 con- := . (8) 푝 휎2 훽2 tributes a logistic factor to the posterior, with mean 푘 and variance 푡 푘 ∈{0}∪H푡 푘 2 훽푘 . We still use a Gaussian initial prior, with mean and variance 2 4 SKILL EVOLUTION OVER TIME denoted by 푝0 and 훽0, respectively. Postponing the discussion of skill evolution to Section4, for the moment we assume that 푆푘 = 푆0 Factors such as training and resting will change a player’s skill for all 푘. The posterior pdf, up to normalization, is then over time. If we model skill as a static variable, our system will Ö eventually grow so confident in its estimate that it will refuse to 휋 (푠) (푃 = 푝 | 푆 = 푠) 0 Pr 푘 푘 푘 admit substantial changes. To remedy this, we introduce a skill 푘 ∈H ′ 푡 evolution model, so that in general 푆 ≠ 푆 ′ for 푡 ≠ 푡 . Now rather ! 푡 푡 (푠 − 푝 )2 Ö  휋 푠 − 푝  than simply being equal to the previous round’s posterior, the skill ∝ exp − 0 sech2 √ 푘 . (5) 2 훽 prior at round 푡 is given by 2훽0 ∈H 12 푘 푘 푡 ∫ Maximizing the posterior density amounts to minimizing its 휋푡 (푠) = Pr(푆푡 = 푠 | 푆푡−1 = 푥) Pr(푆푡−1 = 푥 | 푃<푡 ) d푥. (9) negative logarithm. Up to a constant offset, this is given by The factors in the integrand are the skill evolution model and the     푠 − 푝0 Õ 푠 − 푝푘 previous round’s posterior, respectively. Following other Bayesian 퐿(푠) := 퐿2 + 퐿푅 , 훽0 훽푘 rating systems (e.g., Glicko, Glicko-2, and TrueSkill [21, 22, 24]), 푘 ∈H푡   we model the skill diffusions 푆푡 − 푆푡−1 as independent zero-mean 1 2 휋푥 ( | = ) where 퐿2 (푥) := 푥 and 퐿푅 (푥) := 2 ln cosh √ . Gaussians. That is, Pr 푆푡 푆푡−1 푥 is a Gaussian with mean 푥 2 2 2 12 and some variance 훾푡 . The Glicko system sets 훾푡 proportionally to WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu the time elapsed since the last update, corresponding to a continu- • Zero uncertainty. In the limit as 휎푡−1 → 0 (i.e., when the previous ous Brownian motion. Codeforces and Topcoder simply set 훾푡 to rating 휇푡−1 is a perfect estimate of 푆푡−1), our belief on 푆푡 must 2 a constant when a player participates, and zero otherwise, corre- become Gaussian with mean 휇푡−1 and variance 훾 . Finer-grained sponding to changes that are in proportion to how often the player information regarding the prior history 푃≤푡 must be erased. competes. Now we are ready to complete the two specializations of In particular, Elo-MMR(∞) fails the zero diffusion property because our rating system. it simplifies the belief distribution, even when 훾 = 0. In the proof of Gaussian performance model. If both the prior and performance Theorem 4.1, we’ll see that Elo-MMR(0) fails the zero uncertainty distributions at round 푡 − 1 are Gaussian, then the posterior is property. Thus, it is in fact necessary to have 휌 strictly positive and also Gaussian. Adding an independent Gaussian diffusion to our finite. In Section 5.2, we’ll come to interpret 휌 as a kind of inverse posterior on 푆푡−1 yields a Gaussian prior on 푆푡 . By induction, the momentum. skill belief distribution forever remains Gaussian. This Gaussian specialization of the Elo-MMR framework lacks the R for robustness 4.2 A heuristic pseudodiffusion algorithm (see Theorem 5.7), so we call it Elo-MM휒. Each factor in the posterior (see Equation (5)) has a parameter 훽푘 . = / 2 Logistic performance model. After a player’s first contest round, Define a factor’s weight to be 푤푘 : 1 훽푘 , which by Equation (8) Í 2 the posterior in Equation (5) becomes non-Gaussian, rendering the contributes to the total weight 푘 푤푘 = 1/휎푡 . Here, unlike in integral in Equation (9) intractable. Equation (7), 푤푘 does not depend on |휇푡 − 푝푘 |. A very simple approach would be to replace the full posterior in The approximation step of Elo-MMR(∞) replaces all the logistic Equation (5) by a Gaussian approximation with mean 휇푡 (equal to factors by a single Gaussian whose variance is chosen to ensure 2 the posterior MAP) and variance 휎푡 (given by Equation (8)). As in that the total weight is preserved. In addition, its mean is chosen the previous case, applying diffusions in the Gaussian setting isa to preserve the player’s rating, given by the unique zero of Equa- simple matter of adding means and variances. tion (6). Finally, the diffusion step of Elo-MMR(∞) increases the 2 With this approximation, no memory is kept of the individual Gaussian’s variance, and hence the player’s skill uncertainty, by 훾푡 ; performances 푃푡 . Priors are simply Gaussian, while posterior den- this corresponds to a decay in the weight. sities are the product of two factors: the Gaussian prior, and a lo- To generalize the idea, we interleave the two steps in a contin- gistic factor corresponding to the latest performance. To ensure uous manner. The approximation step becomes a transfer step: robustness (see Section 5.2), 휇푡 is computed as the argmax of this rather than replace the logistic factors outright, we take away the posterior before replacement by its Gaussian approximation. We same fraction from each of their weights, and place the sum of re- call the rating system that takes this approach Elo-MMR(∞). moved weights onto a new Gaussian factor. In order for this opera- As the name implies, it turns out to be a special case of Elo- tion to preserve ratings, the new factor must be centered at 휇푡−1. MMR(휌). In the general setting with 휌 ∈ [0, ∞), we keep the full Since Gaussian pdfs compose, the prior Gaussian factor can be com- posterior from Equation (5). Since we cannot tractably compute the bined with the new one. The diffusion step becomes a decay step, effect of a Gaussian diffusion, we seek a heuristic derivation ofthe reducing each factor’s weight by the same fraction, chosen such 2 next round’s prior, retaining a form similar to Equation (5) while that the overall uncertainty is increased by 훾푡 . satisfying many of the same properties as the intended diffusion. To make the idea precise, we generalize the posterior from Equa- tion (5) with fractional multiplicities 휔푘 ; the 푘’th factor is raised 4.1 Desirable properties of a “pseudodiffusion” to the power 휔푘 . As a result, Equations (6) and (8) become: We begin by listing some properties that our skill evolution algo- rithm, henceforth called a “pseudodiffusion”, should satisfy. The 휔 (푠 − 푝 ) Õ 휔 휋 (푠 − 푝 )휋 퐿′(푠) = 0 0 + 푘√ √푘 , first two properties are natural: 2 tanh 훽0 ∈H 훽푘 3 훽푘 12 • Incentive-compatibility. First and foremost, the pseudodiffusion 푘 푡 1 Õ 휔푘 must not break the incentive-compatibility of our rating system. := 푤푘, where 푤푘 := . (10) 휎2 훽2 That is, a rating-maximizing player should never be motivated 푡 푘 ∈{0}∪H푡 푘 to lose on purpose. (Theorem 5.5). • Rating preservation. The pseudodiffusion must not alter the arg max For 휌 ∈ [0, ∞], the Elo-MMR(휌) algorithm continuously and of the belief density. That is, the rating of a player should not simultaneously performs transfer and decay, with transfer proceed- 휋 change: 휇 = 휇 − . 푡 푡 1 ing at 휌 times the rate of decay. Holding 훽푘 fixed, changes to 휔푘 In addition, we borrow four properties of Gaussian diffusions: can be described in terms of changes to 푤푘 : 2 • Correct magnitude. Pseudodiffusion with parameter 훾 must in- Õ crease the skill uncertainty, as measured by Equation (8), by 훾2. 푤¤ 0 = −푟 (푡)푤0 + 휌푟 (푡) 푤푘, • Composability. Two pseudodiffusions applied in sequence, first 푘 ∈H푡 2 2 푤¤ = −(1 + 휌)푟 (푡)푤 for 푘 ∈ H , with parameter 훾1 and then with 훾2 , must have the same effect 푘 푘 푡 2 + 2 as a single pseudodiffusion with parameter 훾1 훾2 . • Zero diffusion. In the limit as 훾 → 0, the effect of pseudodiffusion where the arbitrary decay rate 푟 (푡) can be eliminated by a change must vanish, i.e., not alter the belief distribution. of variable d휏 = 푟 (푡)d푡. After some time Δ휏, the total weight will Elo-MMR: A Rating System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Algorithm 1 Elo-MMR(휌, 훽,훾) • Incentive-compatibility. This property will be stated in Theo- for all rounds 푡 do rem 5.5. To ensure that its proof carries through, the relevant facts to note here are that the pseudodiffusion algorithm ignores the for all players 푖 ∈ P푡 in parallel do if 푖 has never competed before then performances 푝푘 , and centers the transferred Gaussian weight at 휇 , 휎 ← 휇 , 휎 the rating 휇푡−1, which is trivially monotonic in 휇푡−1. 푖 푖 푛푒푤푐표푚푒푟 푛푒푤푐표푚푒푟 ′ 2 • Rating preservation. Recall that the rating is the unique zero of 퐿 푝푖,푤푖 ← [휇푖 ], [1/휎푖 ] diffuse(푖,훾, 휌) in Equation (10). To see that this zero is preserved, note that the q decay and transfer operations multiply 퐿′ by constants (휅 and 휋 ← 2 + 2 푡 휇푖 , 훿푖 휇푖, 휎푖 훽 휌 휅푡 , respectively), before adding the new Gaussian term, whose for all 푖 ∈ P푡 in parallel do contribution to 퐿′ is zero at its center. update(푖, 퐸 , 훽) 푡 • Correct magnitude. Follows from our derivation for 휅푡 . Algorithm 2 diffuse(푖,훾, 휌) • Composability. Follows from correct magnitude and the fact that every pseudodiffusion follows the same differential equations. ← ( + 2/ 2)−1 휅 1 훾 휎푖 • → → < ∞ 휌 휌 Í Zero diffusion. As 훾 0, 휅푡 1. Provided that 휌 , we also 푤퐺,푤퐿 ← 휅 푤푖,0, (1 − 휅 ) 푘 ≥0 푤푖,푘 휌 → ∈ { } ∪ H 푛푒푤 → have 휅푡 1. Hence, for all 푘 0 푡 , 푤푘 푤푘 . 푝푖,0 ← (푤퐺 푝푖,0 + 푤퐿휇푖 )/(푤퐺 + 푤퐿) • Zero uncertainty. As 휎푡−1 → 0, 휅푡 → 0. The total weight decays 푤 ← 휅(푤 + 푤 ) 휌 푖,0 퐺 퐿 from 1/휎2 to 훾2. Provided that 휌 > 0, we also have 휅 → 0, so for all 푘 > 0 do 푡−1 푡 1+휌 these weights transfer in their entirety, leaving behind a Gaussian 푤푖,푘 ← 휅 푤푖,푘 2 √ with mean 휇푡−1, variance 훾 , and no additional history. □ 휎푖 ← 휎푖 / 휅 Algorithm 3 update(푖, 퐸, 훽) 5 THEORETICAL PROPERTIES   푥−휇휋   푥−휇휋  In this section, we see how the simplicity of the Elo-MMR formulas Í 1 푗 Í 1 푗 푝 ← zero푥 푗 ⪯푖 tanh − 1 + 푗 ⪰푖 tanh + 1 훿 푗 2훿¯푗 훿 푗 2훿¯푗 enables us to rigorously prove that the rating system is incentive- compatible, robust, and computationally efficient. 푝푖 .push(푝) 2 5.1 Incentive-compatibility 푤푖 .push(1/훽 )  2  Í 푤푖,푘 훽 푥−푝푖,푘 To demonstrate the need for incentive-compatibility, let’s look at 휇푖 ← zero푥 푤푖,0 (푥 − 푝푖,0) + 푘>0 tanh 훽¯ 2훽¯ the consequences of violating this property in the Topcoder and Glicko-2 rating systems. These systems track a “volatility” for each have decayed by a factor 휅 := 푒−Δ휏 , resulting in the new weights: player, which estimates the variance of their performances. A player   Õ whose recent performance history is more consistent would be 푤푛푒푤 = 휅푤 + 휅 − 휅1+휌 푤 , 0 0 푘 assigned a lower volatility score, than one with wild swings in ∈H 푘 푡 performance. The volatility acts as a multiplier on rating changes; 푛푒푤 = 1+휌 ∈ H 푤푘 휅 푤푘 for 푘 푡 . thus, players with an extremely low or high performance will have In order for the uncertainty to increase from 휎2 to 휎2 + 훾2, we their subsequent rating changes amplified. 푡−1 푡−1 푡 While it may seem like a good idea to boost changes for players must solve 휅/휎2 = 1/(휎2 + 훾2) for the decay factor: 푡−1 푡−1 푡 whose ratings are poor predictors of their performance, this fea- 2 !−1 ture has an exploit. By intentionally performing at a weaker level, a 훾푡 휅푡 = 1 + . player can amplify future increases to an extent that more than com- 휎2 푡−1 pensates for the immediate hit to their rating. A player may even Algorithm1 details the full Elo-MMR( 휌) rating system. The main “farm” volatility by alternating between very strong and very weak loop runs whenever a round of competition takes place. First, new performances. After acquiring a sufficiently high volatility score, players are initialized with a Gaussian prior. Then, changes in player the strategic player exerts their honest maximum performance over skill are modeled by Algorithm2; note how the Gaussian prior is a series of contests. The amplification eventually results in a rat- updated into a weighted with the newly created factor. ing that exceeds what would have been obtained via honest play. Given the round rankings 퐸푡 , the first phase of Algorithm3 solves This type of exploit was discovered in Glicko-2 as applied to the an equation to estimate 푃푡 . Finally, the second phase solves another Pokemon Go video game [3]. Table 5.3 of [18] presents a milder equation for the rating 휇푡 . violation in Topcoder competitions. The hyperparameters 휌, 훽,훾 are domain-dependent, and can be To get a realistic estimate of the severity of this exploit, we per- set by standard hyperparameter search techniques. For convenience,√ formed a simple experiment on the first five years of the Codeforces ¯ 3 contest dataset (see Section 6.1). In Figure2, we plot the rating evo- we assume 훽 and 훾 are fixed and use the shorthand 훽푘 := 휋 훽푘 . Whereas our exposition used global round indices, here a subscript lution of the world’s #1 ranked competitive programmer, Gennady 푘 corresponds to the 푘’th round in player 푖’s participation history. Korotkevich, better known as tourist. In the control setting, we plot his ratings according to the Topcoder and Elo-MMR(1) systems. Theorem 4.1. Algorithm2 with 휌 ∈ (0, ∞) meets all of the prop- We contrast these against an adversarial setting, in which we have erties listed in Section 4.1. tourist employ the following strategy: for his first 45 contests, Proof. We go through each of the six properties in order. tourist plays normally (exactly as in the unaltered data). For his WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

Lemma 5.1. Adding a win term to 푄 (·), or replacing a tie term by 3750 Topcoder (honest) 푖 a win term, always increases its zero. Conversely, adding a loss term, 3500 Topcoder (adversarial) or replacing a tie term by a loss term, always decreases it. 3250 Proof. By Lemma 3.1, 푄푖 (푝) is decreasing in 푝. Thus, adding a 3000 positive term will increase its zero whereas adding a negative term 2750 will decrease it. The desired conclusion follows by noting that, for

Rating all 푗 and 푝, 푣 푗 (푝) and 푣 푗 (푝) − 푑푗 (푝) are positive, whereas 푙푗 (푝) and 2500 푙푗 (푝) − 푑푗 (푝) are negative. □ 2250 While not needed for our main result, a similar argument shows 2000 that performance scores are monotonic across the round standings: 1750 0 20 40 60 80 100 Theorem 5.2. If 푖 ≻ 푗 (that is, player 푖 beats 푗) in a given round, Contest # then player 푖 and 푗’s performance estimates satisfy 푝푖 > 푝 푗 .

3750 Proof. If 푖 ≻ 푗 with 푖, 푗 adjacent in the rankings, then Elo-MMR (honest) Õ Õ 3500 Elo-MMR (adversarial) 푄푖 (푝) − 푄 푗 (푝) = (푑푘 (푝) − 푙푘 (푝)) + (푣푘 (푝) − 푑푘 (푝)) > 0. ∼ ∼ 3250 푘 푖 푘 푗 for all 푝. Since 푄 and 푄 are decreasing functions, it follows that 3000 푖 푗 푝푖 > 푝 푗 . By induction, this result extends to the case where 푖, 푗 are 2750 not adjacent in the rankings. □ Rating 2500 What matters for incentives is that performance scores be coun- 2250 terfactually monotonic; meaning, if we were to alter the round standings, a strategic player will always prefer to place higher: 2000

1750 Lemma 5.3. In any given round, holding fixed the relative ranking 0 20 40 60 80 100 of all players other than 푖 (and holding fixed all preceding rounds), Contest # the performance 푝푖 is a monotonic function of player i’s prior rating and of player 푖’s rank in this round. Figure 2: Volatility farming attack on the Topcoder system. Proof. Monotonicity in the prior rating follows directly from monotonicity of the self-tie term 푑 in 푄 . Since an upward shift in next 45 contests, tourist purposely falls to last place whenever his 푖 푖 the rankings can only convert losses to ties to wins, monotonicity Topcoder rating is above 2975. Finally, tourist returns to playing in contest rank follows from Lemma 5.1. □ normally for an additional 15 contests. This strategy mirrors the Glicko-2 exploit documented in [3], Having established the relationship between round rankings and does not require unrealistic assumptions (e.g., we don’t demand and performance scores, the next step is to prove that, even with tourist to exercise very precise control over his performances). hindsight, players will always prefer their performance scores to Compared to a consistently honest tourist, the volatility farming be as high as possible: tourist ended up 523 rating points ahead by the end of the experi- ment, with almost 1000 rating points gained in the last 15 contests Lemma 5.4. Holding fixed the set of contest rounds in whicha alone. Transferring the same sequence of performances to the Elo- player has participated, their current rating is monotonic in each of MMR(1) system, we see that it not only is immune to such volatility- their past performance scores. farming attacks, but it also penalizes the dishonest strategy with a Proof. The player’s rating is given by the zero of 퐿′ in Equa- rating loss that decays exponentially once honest play resumes. tion (10). The pseudodiffusions of Section4 modify each of the 훽 Recall that a key purpose of modeling volatility in Topcoder 푘 in a manner that does not depend on any of the 푝 , so they are fixed and Glicko-2 was to boost rating changes for inconsistent players. 푘 for our purposes. Hence, 퐿′ is monotonically increasing in 푠 and Remarkably, Elo-MMR achieves the same effect: we’ll see in Sec- decreasing in each of the 푝 . Therefore, its zero is monotonically tion 5.2 that, for 휌 ∈ [0, ∞), Elo-MMR(휌) also boosts changes to 푘 increasing in each of the 푝 . inconsistent players. And yet, we’ll now prove that no strategic 푘 This is almost what we wanted to prove, except that 푝 is not incentive for purposely losing exists in any version of Elo-MMR. 0 a performance. Nonetheless, it is a function of the performances: To this end, we need a few lemmas. Recall that, for the purposes specifically, a weighted average of historical ratings which, using of the algorithm, the performance 푝 is defined to be the unique 푖 this same lemma as an inductive hypothesis, are themselves mono- zero of the function 푄 (푝) := Í 푙 (푝) + Í 푑 (푝) + Í 푣 (푝), 푖 푗 ≻푖 푗 푗∼푖 푗 푗 ≺푖 푗 tonic in past performances. By induction, the proof is complete. □ whose terms 푙푖,푑푖, 푣푖 are contributed by opponents against whom 푖 lost, drew, or won, respectively. Wins (losses) are always positive Finally, we conclude that a rating-maximizing player is always (negative) contributions to a player’s performance score: motivated to improve their round rankings, or raw scores: Elo-MMR: A Rating System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Theorem 5.5 (Incentive-compatibility). Holding fixed the set 5.2 Robust response of contest rounds in which each player has participated, and the Another desirable property in many settings is robustness: a player’s historical ratings and relative rankings of all players other than 푖, rating should not change too much in response to any one contest, player 푖’s current rating is monotonic in each of their past rankings. no matter how extreme their performance. The Codeforces and TrueSkill systems lack this property, allowing for unbounded rating Proof. Choose any contest round in player 푖’s history, and con- changes. Topcoder achieves robustness by clamping any changes sider improving player 푖’s rank in that round while holding every- that exceed a cap, which is initially high for new players but de- thing else fixed. It suffices to show that player 푖’s current rating creases with experience. would necessarily increase as a result. When 휌 > 0, Elo-MMR(휌) achieves robustness in a natural, In the altered round, by Lemma 5.3, 푝푖 is increased; and by smoother manner. To understand how, we look at the interplay Lemma 5.4, player 푖’s post-round rating is increased. By Lemma 5.3 between Gaussian and logistic factors in the posterior. Recall the again, this increases player 푖’s performance score in the following notation in Equation (10), describing the loss function and weights. round. Proceeding inductively, we find that performance scores and Theorem 5.7. In the Elo-MMR(휌) rating system, let ratings from this point onward are all increased. □ Δ+ := lim 휇푡 − 휇푡−1, Δ− := lim 휇푡−1 − 휇푡 . 푝푡 →+∞ 푝푡 →−∞ In the special cases of Elo-MM휒 or Elo-MMR(∞), the rating sys- Then, for Δ± ∈ {Δ+, Δ−}, tem is “memoryless”: the only data retained for each player are −1 the current rating 휇푖,푡 and uncertainty 휎푖,푡 ; detailed performance 휋 휋2 Õ 휋 1 √ ©푤 + 푤 ª ≤ Δ± ≤ √ . history is not saved. In this setting, we present a natural mono- ­ 0 6 푘 ® 푤 훽푡 3 푘 ∈H 훽푡 3 0 tonicity theorem. A similar theorem was previously stated for the « 푡−1 ¬ Codeforces system, albeit in an informal context without proof [8]. 푑 ( ) ≤ Proof. Using the fact that 0 < 푑푥 tanh 푥 1, differentiating ′ Theorem 5.6 (Memoryless Monotonicity). In either the Elo- 퐿 in Equation (10) yields MM휒 or Elo-MMR(∞) system, suppose 푖 and 푗 are two participants of 휋2 Õ ∀푠 ∈ R, 푤 ≤ 퐿′′(푠) ≤ 푤 + 푤 . round 푡. Suppose that the ratings and corresponding uncertainties sat- 0 0 6 푘 푘 ∈H isfy 휇푖,푡−1 ≥ 휇푗,푡−1, 휎푖,푡−1 = 휎푗,푡−1. Then, 휎푖,푡 = 휎푗,푡 . Furthermore: 푡−1 If 푖 ≻ 푗 in round 푡, then 휇푖,푡 > 휇푗,푡 . Now, the performance at round 푡 adds a new term with multi- ′ 휋 (푠−푝푘 )휋 If 푗 ≻ 푖 in round 푡, then 휇푗,푡 − 휇푗,푡−1 > 휇푖,푡 − 휇푖,푡−1. plicity one to 퐿 (푠): its value is √ tanh √ . 훽푘 3 훽푘 12 ′ As a result, for every 푠 ∈ R, in the limit as 푝푡 → ±∞, 퐿 (푠) 휋 ′ Proof. The new contest round will add a rating perturbation increases by ∓ √ . Since 휇푡−1 was a zero of 퐿 without this new 2 2 훽푡 3 with variance 훾푡 , followed by a new performance with variance 훽푡 . ′ 휋 term, we now have 퐿 (휇푡−1) → ∓ √ . Dividing by the former As a result, 훽푡 3 inequalities yields the desired result. □ − 1 − 1 ! 2 ! 2 1 1 1 1 The proof reveals that the magnitude of Δ± depends inversely 휎푖,푡 = + = + = 휎푗,푡 . 2 + 2 2 2 + 2 2 on that of 퐿′′ in the vicinity of the current rating, which in turn is 휎푖,푡−1 훾푡 훽푡 휎푗,푡−1 훾푡 훽푡 related to the derivative of the tanh terms. If a player’s performances The remaining conclusions are consequences of three proper- vary wildly, the tanh terms will be widely dispersed, so any potential ties: memorylessness, incentive-compatibility (Theorem 5.5), and rating value will necessarily be in the tail ends of most of the translation-invariance (ratings, skills, and performances are quanti- terms. Tails contribute very small derivatives, enabling a larger fied on a common interval scale relative to one another). rating change. Conversely, the tanh terms of a player with a very Since the Elo-MM휒 or Elo-MMR(∞) systems are memoryless, we consistent performance history will contribute large derivatives, so may replace the initial prior and performance histories of players the bound on their rating change will be small. with any alternate histories of our choosing, as long as our choice is Thus, Elo-MMR naturally caps the rating changes of all players, compatible with their current rating and uncertainty. For example, and the cap is smaller for consistent performers. The cap will in- both 푖 and 푗 can be considered to have participated in the same crease after an extreme performance, providing a similar “momen- set of rounds, with 푖 always performing at 휇푖,푡−1. and 푗 always tum” to the Topcoder and Glicko-2 systems, but without sacrificing performing at 휇푗,푡−1. Round 푡 is unchanged. incentive-compatibility (Theorem 5.5). Suppose 푖 ≻ 푗. Since 푖’s historical performances are all equal or We can compare the lower and upper bound in Theorem 5.7: stronger than 푗’s, Theorem 5.5 implies 휇푖,푡 > 휇푗,푡 . their ratio is on the same order as the fraction of the total weight Suppose 푗 ≻ 푖. By translation-invariance, if we shift each of that is held by the normal term. Recall that 휌 is the weight transfer 푗’s performances, up to round 푡 and including the initial prior, rate: larger 휌 results in more weight being transferred into 푤0; in upward by 휇푖,푡−1 − 휇푗,푡−1, the rating changes between rounds will this case, the lower and upper bound tend to stay close together. be unaffected. Players 푖 and 푗 now have identical histories, except Conversely, the momentum effect is more pronounced when 휌 is that we still have 푗 ≻ 푖 at round 푡. Therefore, 휇푗,푡−1 = 휇푖,푡−1 and, small. In the extreme case 휌 = 0, 푤0 vanishes for experienced play- by Theorem 5.5, 휇푗,푡 > 휇푖,푡 . Subtracting the equation from the ers, so a sufficiently volatile player would be subject to correspond- inequality proves the second conclusion. □ ingly large rating updates. In the extended version of this paper, WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu we quantify an asymptotic steady state for the weights, and argue Dataset # contests avg. # participants / contest that 1/휌 can be thought of as a momentum parameter. Codeforces 1087 2999 Topcoder 2023 403 5.3 Runtime analysis and optimizations Reddit 1000 20 Let’s look at the computation time needed to process a round with Synthetic 50 2500 participant set P, where we again omit the round subscript. Each Table 1: Summary of test datasets. player 푖 has a participation history H푖 . Estimating 푃푖 entails finding the zero of a monotonic function In practice, a single bucket might not contain enough participants, with 푂 (|P|) terms, and then obtaining the rating 휇푖 entails finding so we sample enough buckets to yield the desired precision. (|H |) the zero of another monotonic function with 푂 푖 terms. Using Simple parallelism. Since each player’s rating computation is either of the Illinois or Newton methods, solving these equations independent, the algorithm is embarrassingly parallel. Threads can ( 1 ) to precision 휖 takes 푂 log log 휖 iterations. As a result, the total read the same global data structures, so each additional thread runtime needed to process one round of competition is contributes only 푂 (1) memory overhead. ! Õ 1 푂 (|P| + |H푖 |) log log . 6 EXPERIMENTS 휖 푖 ∈P In this section, we compare various rating systems on real-world This complexity is more than adequate for Codeforces-style com- datasets, mined from several sources that will be described in Sec- petitions with thousands of contestants and history lengths up to tion 6.1. The metrics are runtime and predictive accuracy, as de- a few hundred. Indeed, we were able to process the entire history scribed in Section 6.2. Implementations of all rating systems, dataset of Codeforces on a small laptop in less than half an hour. Nonethe- mining, and additional processing used in our experiments can be less, it may be cost-prohibitive in truly massive settings, where |P| found at https://github.com/EbTech/Elo-MMR. or |H푖 | number in the millions. Fortunately, it turns out that both We compare Elo-MM휒 and Elo-MMR(휌) against the industry- functions may be compressed down to a bounded number of terms, tested rating systems of Codeforces and Topcoder. For a fairer with negligible loss of precision. comparison, we hand-coded efficient versions of all four in the safe subset of Rust, parellelized using the Rayon crate; as Adaptive subsampling. In Section2, we used Doob’s consistency such, the Rust compiler verifies that they contain no data races[32]. theorem to argue that our estimate for 푃 is consistent. Specifically, 푖 Our implementation of Elo-MMR(휌) makes use of the optimizations we saw that 푂 (1/휖2) opponents are needed to get the typical error in Section 5.3, bounding both the number of sampled opponents below 휖. Thus, we can subsample the set of opponents to include in and the history length by 500. In addition, we test the improved the estimation, omitting the rest. Random sampling is one approach. TrueSkill algorithm of [30], basing our code on an open-source A more efficient approach chooses a fixed number of opponents implementation of the same algorithm. The inherent seqentiality whose ratings are closest to that of player 푖, as these are more likely of its message-passing procedure prevented us from parallelizing it. to provide informative match-ups. On the other hand, if the setting All experiments were run on a 2.0 GHz 24-core Skylake machine requires incentive-compatibility to hold exactly, then one must with 24 GB of memory. avoid choosing different opponents for each player. Hyperparameter search. To ensure fair comparisons, we ran a History compression. Similarly, it’s possible to bound the number separate grid search for each triple of algorithm, dataset, and metric, of stored factors in the posterior. Our skill-evolution algorithm over all of the algorithm’s hyperparameters. The hyperparameter decays the weights of old performances at an exponential rate. set that performed best on the first 10% of the dataset, was then ( 1 ) Thus, the contributions of all but the most recent 푂 log 휖 terms are used to test the algorithm on the remaining 90% of the dataset. negligible. Rather than erase the older logistic terms outright, we recommend replacing them with moment-matched Gaussian terms, 6.1 Datasets similar to the transfers in Section4 with 휅푡 = 0. Since Gaussians Due to the scarcity of public domain datasets for rating systems, compose easily, a single term can then summarize an arbitrarily we mined three datasets to analyze the effectiveness of our system. long prefix of the history. 2 1 The datasets were mined using data from each source website’s Substituting 1/휖 and log for |P| and |H푖 |, respectively, the 휖 inception up to October 9th, 2020. We also created a synthetic runtime of Elo-MMR with both optimizations becomes dataset to test our system’s performance when the data generating   |P| 1 process matches our theoretical model. Summary statistics of the 푂 log log . 휖2 휖 datasets are presented in Table1. If the contests are extremely large, so that Ω(1/휖2) opponents Codeforces contest history. This dataset contains the current en- have a rating and uncertainty in the same 휖-width bucket as player tire history of rated contests ever run on codeforces.com, the dom- 푖, then it’s possible to do even better: up to the allowed precision 휖, inant platform for online programming competitions. The Code- the corresponding terms can be treated as duplicates. Hence, their forces platform has over 850K users, over 300K of whom are rated, sum can be determined by counting how many of these opponents and has hosted over 1000 contests to date. Each contest has a cou- win, lose, or tie against player 푖. Given the pre-sorted list of ranks of ple thousand participants on average. A typical contest takes 2 to players in the bucket, two binary searches would yield the answer. 3 hours and contains 5 to 8 problems. Players are ranked by total Elo-MMR: A Rating System for Massive Multiplayer Competitions WWW ’21, April 19–23, 2021, Ljubljana, Slovenia points, with more points typically awarded for tougher problems 6.3 Empirical results and for early solves. They may also attempt to “hack” one another’s Recall that Elo-MM휒 has a Gaussian performance model, matching submissions for bonus points, identifying test cases that break their the modeling assumptions of Topcoder and TrueSkill. Elo-MMR(휌), solutions. on the other hand, has a logistic performance model, matching Topcoder contest history. This dataset contains the current entire the modeling assumptions of Codeforces and Glicko. While 휌 was history of algorithm contests ever run on the topcoder.com. Top- included in the hyperparameter search, in practice we found that coder is a predecessor to Codeforces, with over 1.4 million total all values between 0 and 1 produce very similar results. users and a long history as a pioneering platform for programming To ensure that errors due to the unknown skills of new players contests. It hosts a variety of contest types, including over 2000 algo- don’t dominate our metrics, we excluded players who had com- rithm contests to date. The scoring system is similar to Codeforces, peted in less than 5 total contests. In most of the datasets, this re- but its rounds are shorter: typically 75 minutes with 3 problems. duced the performance of our method relative to the others, as our method seems to converge more accurately. Despite this, we see SubredditSimulator threads. This dataset contains data scraped in Table2 that both versions of Elo-MMR outperform the other rat- from the current top-1000 most upvoted threads on the website ing systems in both the pairwise inversion metric and the ranking reddit.com/r/SubredditSimulator/. Reddit is a social news ag- deviation metric. gregation website with over 300 million users. The site itself is We highlight a few key observations. First, significant perfor- broken down into sub-sites called subreddits. Users then post and mance gains are observed on the Codeforces and Topcoder datasets, comment to the subreddits, where the posts and comments receive despite these platforms’ rating systems having been designed specif- votes from other users. In the subreddit SubredditSimulator, users ically for their needs. Our gains are smallest on the synthetic dataset, are language generation bots trained on text from other subreddits. for which all algorithms perform similarly. This might be in part Automated posts are made by these bots to SubredditSimulator ev- due to the close correspondence between the generative process ery 3 minutes, and real users of Reddit vote on the best bot. Each and the assumptions of these rating systems. Furthermore, the post (and its associated comments) can thus be interpreted as a synthetic players compete in all rounds, enabling the system to round of competition between the bots who commented. converge to near-optimal ratings for every player. Finally, the im- proved TrueSkill performed well below our expectations, despite Synthetic data. This dataset contains 10K players, with skills our best efforts to improve it. We suspect that the message-passing and performances generated according to the Gaussian generative numerics break down in contests with a large number of individual model in Section2. Players’ initial skills are drawn i.i.d. with mean 2 participants. The difficulties persisted in all TrueSkill implemen- 1500 and variance 350 . Players compete in all rounds, and are Infer.NET 2 tations that we tried, including on Microsoft’s popular ranked according to independent performances with variance 200 . framework [29]. To our knowledge, we are the first to present ex- Between rounds, we add i.i.d. Gaussian increments with variance 2 periments with TrueSkill on contests where the number of partici- 35 to each of their skills. pants is in the hundreds or thousands. In preliminary experiments, TrueSkill and Elo-MMR score about equally when the number of 6.2 Evaluation metrics ranks is less than about 60. To compare the different algorithms, we define two measures of Now, we turn our attention to Table3, which showcases the com- predictive accuracy. Each metric will be defined on individual con- putational efficiency of Elo-MMR. On smaller datasets, it performs testants in each round, and then averaged: comparably to the Codeforces and Topcoder algorithms. However, Í Í the latter suffer from a quadratic time dependency on the number 푡 푖 ∈P metric(푖, 푡) aggregate(metric) := 푡 . of contestants; as a result, Elo-MMR outperforms them by almost Í |P | 푡 푡 an order of magnitude on the larger Codeforces dataset. Pair inversion metric [24]. Our first metric computes the fraction Finally, in comparisons between the two Elo-MMR variants, we of opponents against whom our ratings predict the correct pairwise note that while Elo-MMR(휌) is more accurate, Elo-MM휒 is always result, defined as the higher-rated player either winning or tying: faster. This has to do with the skill drift modeling described in Section4, as every update in Elo-MMR( 휌) must process 푂 (log 1 ) # correctly predicted matchups 휖 pair_inversion(푖, 푡) := × 100%. terms of a player’s competition history. |P푡 | − 1 This metric was used in the original evaluation of TrueSkill [24]. 7 CONCLUSIONS Rank deviation. Our second metric compares the rankings with This paper introduces the Elo-MMR rating system, which is in part the total ordering that would be obtained by sorting players accord- a generalization of the two-player Glicko system, allowing any ing to their prior rating. The penalty is proportional to how much number of players. By developing a Bayesian model and taking the these ranks differ for player 푖: limit as the number of participants goes to infinity, we obtained simple, human-interpretable rating update formulas. Furthermore, |actual rank − predicted rank| rank_deviation(푖, 푡) := × 100%. we saw that the algorithm is incentive-compatible, robust to ex- |P | − 1 푡 treme performances, asymptotically fast, and embarrassingly paral- In the event of ties, among the ranks within the tied range, we use lel. To our knowledge, our system is the first to rigorously prove the one that comes closest to the rating-based prediction. all these properties in a setting with more than two individually WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Aram Ebtekar and Paul Liu

Codeforces Topcoder TrueSkill Elo-MM흌 Elo-MMR(흆) Dataset pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. pair inv. rank dev. Codeforces 78.3% 14.9% 78.5% 15.1% 61.7% 25.4% 78.5% 14.8% 78.6% 14.7% Topcoder 72.6% 18.5% 72.3% 18.7% 68.7% 20.9% 73.0% 18.3% 73.1% 18.2% Reddit 61.5% 27.3% 61.4% 27.4% 61.5% 27.2% 61.6% 27.3% 61.6% 27.3% Synthetic 81.7% 12.9% 81.7% 12.8% 81.3% 13.1% 81.7% 12.8% 81.7% 12.8% Table 2: Performance of each rating system on the pairwise inversion and rank deviation metrics. Bolded entries denote the best performances (highest pair inv. or lowest rank dev.) on each metric and dataset.

Dataset CF TC TS Elo-MM흌 Elo-MMR(흆) [3] Farming Volatility: How a major flaw in a well-known rating system takes over Codeforces 212.9 72.5 67.2 31.4 35.4 the GBL leaderboard. reddit.com/r/TheSilphRoad/comments/hwff2d/farming_ volatility_how_a_major_flaw_in_a/ Topcoder 9.60 4.25 16.8 7.00 7.52 [4] Halo Xbox video game franchise: in numbers. telegraph.co.uk/technology/video- Reddit 1.19 1.14 0.44 1.14 1.42 games/11223730/Halo-in-numbers.html Synthetic 3.26 1.00 2.93 0.81 0.85 [5] Kaggle milestone: 5 million registered users! kaggle.com/general/164795 [6] Kaggle Progression System. kaggle.com/progression Table 3: Total compute time over entire dataset, in seconds. [7] LeetCode New Contest Rating Algorithm. leetcode.com/discuss/general- discussion/468851/New-Contest-Rating-Algorithm-(Coming-Soon) [8] Open Codeforces Rating System. codeforces.com/blog/entry/20762 ranked players. In terms of practical performance, we saw that it [9] Topcoder Algorithm Competition Rating System. topcoder.com/community/ outperforms existing industry systems in both prediction accuracy competitive-programming/how-to-compete/ratings [10] Why Are Obstacle-Course Races So Popular? theatlantic.com/health/archive/ and computation speed. 2018/07/why-are-obstacle-course-races-so-popular/565130/ This work can be extended in several directions. First, the choices [11] Sharad Agarwal and Jacob R. Lorch. 2009. Matchmaking for online games and we made in modeling ties, pseudodiffusions, and opponent subsam- other latency-sensitive P2P systems. In SIGCOMM 2009. 315–326. [12] Mark Yuying An. 1997. Log-concave probability distributions: Theory and statis- pling are by no means the only possibilities consistent with our tical testing. (1997). Bayesian model of skills and performances. Second, it may be pos- [13] Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete sible to further improve accuracy by fitting more flexible perfor- designs: I. The method of paired comparisons. Biometrika (1952), 324–345. [14] Shuo Chen and Thorsten Joachims. 2016. Modeling Intransitivity in Matchup mance and skill evolution models to application-specific data. and Comparison Data. In WSDM 2016. 227–236. Another useful extension would be to team competitions. Given [15] Rémi Coulom. [n.d.]. Whole-history rating: A Bayesian rating system for players of time-varying strength. In CG 2008. Springer, 113–124. a performance model for teams, Elo-MMR infers each team’s perfor- [16] Pierre Dangauthier, Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. TrueSkill mance. To make this useful in settings where teams are frequently Through Time: Revisiting the . In NeurIPS 2007. 337–344. reassigned, we must model teams in terms of their individual mem- [17] Arpad E. Elo. 1961. New USCF rating system. Chess Life (1961), 160–161. [18] RNDr Michal Forišek. 2009. Theoretical and Practical Aspects of Programming bers; unfortunately, it’s not possible to precisely infer an individ- Contest Ratings. (2009). ual’s performance from team rankings alone. Therefore, it becomes [19] David A Freedman. 1963. On the asymptotic behavior of Bayes’ estimates in the necessary to condition an individual’s skill on their team’s per- discrete case. The Annals of Mathematical Statistics (1963), 1386–1403. [20] Mark E Glickman. 1995. A comprehensive guide to chess ratings. American Chess formance. In the case where a team’s performance is modeled as Journal (1995), 59–102. the sum of its members’ independent Gaussian contributions, el- [21] Mark E Glickman. 1999. Parameter estimation in large dynamic paired compari- son experiments. Applied Statistics (1999), 377–394. ementary facts about multivariate Gaussian distributions enable [22] Mark E Glickman. 2012. Example of the Glicko-2 system. Boston University posterior skill inferences at the individual level. Generalizing this (2012), 1–6. approach to other models remains an open challenge. [23] Linxia Gong, Xiaochuan Feng, Dezhi Ye, Hao Li, Runze Wu, Jianrong Tao, Changjie Fan, and Peng Cui. 2020. OptMatch: Optimized Matchmaking via Mod- Over the past decade, online competition communities such as eling the High-Order Interactions on the Arena. In KDD 2020. 2300–2310. Codeforces have grown exponentially. As such, considerable work [24] Ralf Herbrich, Tom Minka, and Thore Graepel. 2006. TrueSkillTM: A Bayesian has gone into engineering scalable and reliable rating systems. Skill Rating System. In NeurIPS 2006. 569–576. [25] Tzu-Kuo Huang, Chih-Jen Lin, and Ruby C. Weng. 2006. Ranking individuals by Unfortunately, many of these systems have not been rigorously group comparisons. In ICML 2006. 425–432. analyzed in the academic community. We hope that our paper and [26] Stephanie Kovalchik. 2020. Extension of the Elo rating system to margin of open-source release will open new explorations in this area. victory. Int. J. Forecast. (2020). [27] Yao Li, Minhao Cheng, Kevin Fujii, Fushing Hsieh, and Cho-Jui Hsieh. 2018. Learning from Group Comparisons: Exploiting Higher Order Interactions. In ACKNOWLEDGEMENTS NeurIPS 2018. 4986–4995. [28] Tom Minka, Ryan Cleven, and Yordan Zaykov. 2018. TrueSkill 2: An improved The authors are indebted to Daniel Sleator and Danica J. Sutherland Bayesian skill rating system. Technical Report MSR-TR-2018-8. Microsoft. for initial discussions that helped inspire this work, and to Nikita [29] T. Minka, J.M. Winn, J.P. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill. /Infer.NET Gaevoy for the open-source improved TrueSkill upon which our 0.3. Microsoft Research Cambridge. http://dotnet.github.io/infer. [30] Sergey I. Nikolenko, Alexander, and V. Sirotkin. 2010. Extensions of the TrueSkill implementation is based. Experiments in this paper are funded by TM rating system. In In Proceedings of the 9th International Conference on Appli- a Cloud Research Grant. The second author is supported by cations of Fuzzy Systems and Soft Computing. 151–160. [31] Jerneja Premelč, Goran Vučković, Nic James, and Bojan Leskošek. 2019. Reliability a VMWare Fellowship and the Natural Sciences and Engineering of judging in DanceSport. Front. Psychol. (2019), 1001. Research Council of Canada. [32] Josh Stone and Nicholas D Matsakis. The Rayon library (Rust Crate). crates.io/ crates/rayon [33] Ruby C. Weng and Chih-Jen Lin. 2011. A Bayesian Approximation Method for REFERENCES Online Ranking. J. Mach. Learn. Res. (2011), 267–300. [1] CodeChef Rating Mechanism. codechef.com/ratings [34] John Michael Winn. 2019. Model-based machine learning. [2] Codeforces: Results of 2019. codeforces.com/blog/entry/73683 [35] Lin Yang, Stanko Dimitrov, and Benny Mantin. 2014. Forecasting sales of new virtual goods with the Elo rating system. RPM (2014), 457–469.