The Evaluation of Rating Systems in Online Free-For-All Games

The Evaluation of Rating Systems in Online Free-for-All Games Arman Dehpanah Muheeb Faizan Ghori Jonathan Gemmell Bamshad Mobasher School of Computing School of Computing School of Computing School of Computing DePaul University DePaul University DePaul University DePaul University Chicago, USA Chicago, USA Chicago, USA Chicago, USA [email protected] [email protected] [email protected] [email protected] Abstract—Online competitive games have become increasingly might also be hampered by the inclusion of new players since popular. To ensure an exciting and competitive environment, these the system does not possess any knowledge of these players. games routinely attempt to match players with similar skill levels. In this paper, we consider six evaluation metrics. We include Matching players is often accomplished through a rating system. There has been an increasing amount of research on developing traditional metrics such as accuracy, mean absolute error, such rating systems. However, less attention has been given to the and Kendall’s rank correlation coefficient. We further include evaluation metrics of these systems. In this paper, we present an metrics adapted from the domain of information retrieval, exhaustive analysis of six metrics for evaluating rating systems in including mean reciprocal rank (MRR), average precision online competitive games. We compare traditional metrics such as (AP), and normalized discounted cumulative gain (NDCG). accuracy. We then introduce other metrics adapted from the field of information retrieval. We evaluate these metrics against several We analyze the ability of these metrics to capture meaningful well-known rating systems on a large real-world dataset of over insights when they are used to evaluate the performance of 100,000 free-for-all matches. Our results show stark differences in three popular rating systems: Elo, Glicko, and TrueSkill. their utility. Some metrics do not consider deviations between two To perform this analysis we limit our experimentation to ranks. Others are inordinately impacted by new players. Many free-for-all matches. Free-for-all is a widely used game-play do not capture the importance of distinguishing between errors in higher ranks and lower ranks. Among all metrics studied, we mode where several players simultaneously compete against recommend Normalized Discounted Cumulative Gain (NDCG) one another in the same match. The winner is the “last because not only does it resolve the issues faced by other metrics, man standing”; this mode of game-play is more commonly but it also offers flexibility to adjust the evaluations based on the referred to as Battle Royale. Our real-world dataset includes goals of the system. over 100,000 matches and over 2,000,000 unique players from Index Terms—rating systems, rank prediction, evaluation, free- PlayerUnknown’s Battlegrounds. for-all games Our evaluation shows that in free-for-all matches the metrics adapted from information retrieval can better evaluate the I. INTRODUCTION rating systems while being more resistant to the influence of new players. NDCG, in particular, could more precisely Online competitive games pit players against one another in capture the predictive power of these systems. NDCG dis- player-versus-player (PvP) matches. A common goal of PvP tinguishes between the prediction errors for top-tier players games is to match players based on their skills. When a new with higher ranks and those for low-tier players with lower player is matched against an experienced player, neither is ranks by applying a weight to positions. This is particularly likely to enjoy the competition. Therefore, competitive games important for companies whose business model is based on arXiv:2008.06787v1 [cs.IR] 15 Aug 2020 often use rating algorithms to match players with similar skills. user interaction and engagement at the top levels of play. Top- Rating systems often represent players with a single number, tier players are the ones who most probably stay in the system describing the player’s skills. For example, Elo [1] considers and play more games. 1500 as the default skill rating for new players and updates The rest of this paper is organized as follows: In Section II, this value based on the outcome of the matches they played. the related works are reviewed. Rank prediction and its appli- Rating systems leverage skill ratings to predict ranks. While cation in rating systems are discussed in Section III. In Sec- researchers have made numerous efforts in improving rank tion IV, the evaluation metrics are introduced. In Section V, prediction, less attention has been given to how predicted ranks the dataset and the experiments are explained, and then we are evaluated. discuss the results in detail in Section VI. Finally, we conclude There are several metrics commonly used to evaluate rat- the paper and mention future works in Section VII. ings. However, these metrics often do not capture important characteristics of the ratings. They might give equal weighting II. RELATED WORK to low-tier and top-tier players, even when matching top-tier The popularity of online competitive games has exploded in players is more important for the goals of the system. They the last decade. There are more than 800 million users playing online games and this number is expected to grow to over 1 Whereas research often attempts to improve rank prediction, billion by the year 2024 [2]. we seek to better evaluate the predicted ranks. We explore An important mode of game-play is the free-for-all which six evaluation metrics, including traditional metrics and those can be divided into deathmatch and battle royale. In death- drawn from the domain of information retrieval. We analyze match games many players are pitted against one another; the the explanatory power of these metrics using a large real- winner is the one with the most points at the end of the game. world dataset and consider how the metrics describe different In battle royale players eliminate one another; the winner is populations such as the top-tier and the most frequent players. the last one standing. A critical component of these zero-sum games is the ability III. PREDICTING RANK to rate a player’s skill. An accurate skill rating enables the Predicting rank in online competitive games is important for system to generate balanced matches [3, 4]. It allows for several reasons. It can be used to evaluate the performance of the assignment of teams with similar skill levels [3, 5–8]. It the players. It can be used to assign players to teams. It can serves as feedback for the players so that they can track their be used to create balanced matches. Often, rank is predicted performance over time [4, 8, 9]. by first evaluating the skills of the players. The most prominent examples of such systems are Elo [1], The skill level of a player p can be described as a single Glicko [10] and TrueSkill [4]. Elo and Glicko are designed number: µp. In a head-to-head match between two players, p1 for head-to-head matches with only two players. TrueSkill, and p2, with skill ratings, µp1 and µp2 , the player with the developed by Microsoft, extends upon these approaches to higher rating can be predicted to win that match. In a free- handle multiplayer games with multiple teams. for-all match with several players p1; p2; p3; :::; pn, their skill A rating system like those above can be used to predict ratings µp1 ; µp2 ; :::; µpn can be used to create a rank ordering the outcome of games. Those predictions can then be evalu- of the players Rpred. After the match, the observed rankings ated to judge the quality of the rating system. Accuracy is Robs can be used to update the skill levels. often used for evaluating rating systems. It has been used There are several ways to update a player’s skill rating. to evaluate first-person shooter games [4–7, 11–14], real-time In the remainder of this section, we describe three common strategy games [5, 12, 15, 16], as well as tennis[15, 17, 18], algorithms: Elo, Glicko, and TrueSkill. soccer [14, 18], football [14], and board games [19–21]. Other metrics have been used for rank prediction. Log- A. Elo likelihood has been used to evaluate first-person shooters [6], Originally developed for ranking chess players, Elo has real-time strategies [15], and board games [21]. Information been used for ranking players in many competitive environ- gain has been used to evaluate first-person shooters [14]. Mean ments [1]. Elo calculates the skill level of players by appraising squared error has been used to evaluate real-time strategy a set of historical match results. Elo assumes that players’ games [22] and soccer [23]. Mean absolute error has been skill follows a Gaussian distribution with the same standard used to evaluate first-person shooters [14] and real-time strate- deviation for all players. gies [22]. Root mean squared error has been used to evaluate As a convention, the default rating for new players is chess games [20]. Spearman’s rank correlation coefficient set to 1500. The rating is updated after each match. The has been used to evaluate first-person shooter games [24]. winner is awarded points and the loser surrenders points after However, most of these examples are head-to-head games; each match. The amount of these points is dependent on the there are only two sides such as in chess or squad-vs-squad probability of the outcome of the match based on the two first-person shooters. When there are more than two sides, players’ initial ratings. these metrics are less appropriate. The probability that player pi wins the match against player Information retrieval, the process of obtaining relevant pj can be calculated by: resources from document collections [25], includes several µ −µ j i −1 metrics for evaluating the ranking of search results.

The Evaluation of Rating Systems in Online Free-For-All Games

PERVASIVE BEHAVIOR INTERVENTIONS Using Mobile Devices for Overcoming Barriers for Physical Activity

Competition-Based User Expertise Score Estimation

Empirical Software Engineering at Microsoft Research

Application and Further Development of Trueskill™ Ranking in Sports

Thesis Template

Ranking (Trueskill) Map #1 #1 Player Map #2 Player #2 Player Vs Player #3 Map Map #3 Map #4 Map #5 #4 Player Ranking Systems ~ 30 Mins Talk

Trueskill 2: an Improved Bayesian Skill Rating System

Influence of Gameplay on Skill in Halo Reach

Measuring Cooperative Behavior in Contemporary Multiplayer Games

Model-Based Machine Learning

Microsoft Research Cambridge

Microsoft Trademarks