Intrinsic Chess Ratings 1 Introduction
Total Page:16
File Type:pdf, Size:1020Kb
CORE Metadata, citation and similar papers at core.ac.uk Provided by Central Archive at the University of Reading Intrinsic Chess Ratings Kenneth W. Regan ∗ Guy McC. Haworth University at Buffalo University of Reading, UK February 12, 2011 Abstract This paper develops and tests formulas for representing playing strength at chess by the quality of moves played, rather than the results of games. Intrinsic quality is estimated via evaluations given by computer chess programs run to high depth, ideally whose playing strength is sufficiently far ahead of the best human players as to be a “relatively omniscient” guide. Several formulas, each having intrinsic skill parameters s for “sensitivity” and c for “competence,” are argued theoretically and tested by regression on large sets of tournament games played by humans of varying strength as measured by the internationally standard Elo rating system. This establishes a correspondence between Elo rating and the parame- ters. A smooth correspondence is shown between statistical results and the century points on the Elo scale, and ratings are shown to have stayed quite constant over time (i.e., little or no “inflation”). By modeling human players of various strengths, the model also enables distributional prediction to detect cheating by getting com- puter advice during games. The theory and empirical results are in principle trans- ferable to other rational-choice settings in which the alternatives have well-defined utilities, but bounded information and complexity constrain the perception of the utilitiy values. 1 Introduction Player strength ratings in chess and other games of strategy are based on the results of games, and are subject to both luck when opponents blunder and drift in the player pool. This paper aims to rate players intrinsically by the quality of their decisions, as refereed by computer programs run to sufficient depth. We aim to settle controversial questions of import to chess policy and enjoyment, and then extend the answers to other decision-making activities: 1. Has there been “inflation”—or deflation—in the chess Elo rating system over the past forty years? 2. Does a faster time control markedly reduce the quality of play? ∗Dept. of CSE, 201 Bell Hall, University at Buffalo Buffalo, NY 14260-2000; (716) 645-3180 x114, fax 645-3464; [email protected] 1 3. Was Old Master X stronger than modern master Y ? 4. Can game scores from tournaments where high results by a player are suspected to result from fraud reveal the extent to which “luck”—or collusion—played a role? 5. Can we objectively support assertions of the kind: this player is only an average master in the openings and middlegames, but plays at super-grandmaster strength in endgames? The most-cited predecessor study, [Guid and Bratko2006], aimed only to compare the chess world champions, used a relatively low depth (12 ply) of a program Crafty [Hyatt2011] generally considered below the elite, and most importantly for our pur- poses, evaluated only the played move and Crafty’s preferred move, if different. The departure point for our work is that to model probabilistic move-choice to needed ac- curacy and assess skill at chess, it is necessary to evaluate all of the relevant available moves. At a game turn with `-many legal moves, we can list them m0; m1; : : : ; m`−1 in nonincreasing order of their (composite) evaluations e(m0); e(m1); : : : ; e(m`−1) by a (jury of) computer program(s), and use these as a measure of the moves’ utilities. (When the convention of stating scores always from White’s point of view is used, for Black moves the evaluations can be negated.) Our work uses the commercial chess pro- gram Rybka 3 [Rajlich and Kaufman], which was rated the best program on the CCRL rating list [CCR2010] between its release in August 2007 and the release of Rybka 4 in May 2010. Guven the evaluations, for each i, 0 ≤ i < `, define δi = e(m0) − e(mi). A vector ∆ = (δ0 = 0; δ1; : : : ; δN−1) is called the spread of the top N moves. Often we cap N at a number such as 20 or 50, and if ` < N, we can pad out to length N with values of “+1” if needed. The only game-specific information used by our methods is the spread, the actual move played, and the overall evaluation of the position. This minimal information is common to other games of strategy, and our work aspires to handle any decision- making application with bounded rationality in which utilities (taken already to include risk/reward tradeoffs) must be converted into probabilities. 2 Background and Previous Literature The Elo rating system (see [Elo2011]) computes an expected score E based on the differences rP − rO between the rating of a player P and the ratings of various op- ponents O, and adjusts rP according to the difference between the actual score and E. Although different particular formulas for E and (especially) the change to rP are used by the World Chess Federation (FIDE) and various national federations, they are generally set up so that rP − rO = 200 means that P must score approximately 75% in games versus O in order to avoid losing rating points. (Under the logistic- curve model used by the United States Chess Federation, USCF, this is close to 76%.) 2 Since only rating differences matter there is no absolute meaning to the numbers pro- duced, but considerable effort has been expended by FIDE and national federations to maintain the strength levels indicated by particular numbers stable over time. For four decades all World Champions and their closest contenders have held ratings in the neighborhood of 2800, while 2650 is the most commonly cited threshold for the unofficial “super-grandmaster” status, 2500 is typical of Grandmasters, 2400 of Inter- national Masters, 2300 of FIDE Masters (indeed, awardees of the the last three titles must have current ratings above those floors), while 2200 is often designated “master” by national federations. The USCF then uses 2000 as the threshold for “Expert,” 1800 for “Class A,” officially down to 1000 for “Class E” which has also been referred to as the strength of “bright beginners.” Hence we refer to 200 points as a “class unit.” By fitting to these itemized skill levels, our paper continues work on reference fallible agents in the game context [Reibman and Ballard1983, Korf1987, Korf1988, Haworth2003, Haworth and Andrist2004, Andrist and Haworth2005]. The aim going beyond these papers, and beyond the results reported in this preliminary work, is to fit probabilities and confidence intervals for move section by thus-calibrated agents. The present work establishes that a reference-fallible model is supported by data taken on a far larger scale than previous studies, and using a more-ramified model than [DiFatta, Haworth, and Regan2009, DiFatta, Haworth, and Regan2010] which are based on Bayesian inference. 3 Basic Model Our key abstract concept is the probability pi = pi(s; c; : : :) of a player P of skill level corresponding to parameters s; c; : : : choosing move mi at a given turn t. We wish to fit pi as a function of the spread ∆t for that turn. We make three basic modeling assumptions: (a) A player’s choices at different game turns are independent, even for successive game turns. (b) There is a relation between the probability pi of selecting the i-th move and the probability p0 of choosing an optimal move that depends simply and mainly on the value δi alone. (c) For players at all skill levels, the relation has the form r(pi; p0) = g(δi), where r is a ratio, g is a continuous, decreasing function of δi, and g depends only on the parameters s; c; : : : for the skill level. Assumption (3) is de rigeur for the whole enterprise of regarding pi as a function of information for the given turn t alone. It also enables us to multiply probabilities from different turns, and add probabilities to project (for instance) the total number of matches to a computer’s first choices over a sequence of moves. We justify it further below. Note that (3) does not assert that pi depends on δi alone. The quality of alternative moves must factor into the probability of selecting move mi somehow. Our assump- tion (3) asserts what seems to be the simplest and weakest dependence on alternatives, 3 however, saying all their effect is bundled into p0. We assert no other dependence on δj for j 6= i. In (3) we allow a smaller dependence on the overall evaluation e, but only to down-weight or filter out cases where e is extreme—i.e. for poor moves or when one side is clearly winning—as detailed in Section 5. The function r is said to define the model, and the functions g = gs;c;::(δ) are curves used to fit the model. The models and curves are normalized so that g(δ0) = g(0) = 1. Plausible models include: 1. “Shares”: r is pi=p0, so pi = p0 · g(δi). 1=g(δi) 2. “Powers”: r is log(1=p0)= log(1=pi), so pi = p0 . pi log(1=p0) 3. r(pi; p0) = = g(δi). p0 log(1=pi) Another way of describing Model (1) is that the curve value si = gs;c;::(δi) is the “share” of move mi, and its probability pi is the ratio of si to the sum of the P`−1 shares, S = i=0 si. Hence the name “Shares.” In Model (2), the curve represents an “exponential decay” of probability in going from an optimal move to an inferior one. Justification of the Assumptions Assumption (3) is intuitively false when a sequence of move choices constitutes a single plan.