From Ranking of Chess Players to Internet Applications

From ranking of chess players to internet applications

Ruby Chiu-Hsing Weng

Department of Statistics National Chengchi University

2010.09.03

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 1 / 54 A story: US college football 1985

team (win-lose-tie) winning rate preliminary rank Miami (10-2) 10/12 7 Tennessee (9-1-2) 9/12 below 7

However, Tennessee thumped Miami 35-7 in the Sugar Bowl.

Packages of oranges from Tennessee fans arrived! Final polls ranked Tennessee fourth, with Miami signiﬁcantly lower.

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 2 / 54 Ranking Outline

Ranking Statistical models Online ranking system Multiple teams/players Experiments Concluding remarks

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 3 / 54 Ranking Ranking

complete ranking of t objects. - t = 3 objects: 1. Totoro, 2. Superman, 3. Sponge Bob - Emily’s preference: 1, 3, 2 - Andy’s preference: 2, 3, 1

incomplete ranking involves k < t objects. - baseball teams: Lions, Elephants, Dragons, Tigers - each game involves two teams, k =2

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 4 / 54 Ranking Examples

sport: World Cup 2010, baseball, tennis games: chess, go

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 5 / 54 Ranking The problems

A total of t players (or teams) Each match involves k = 2 players (or teams)

Who will win? How to obtain a global ranking? Who are top 10? What if k > 2? (e.g. car racing) What if t is extremely large?

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 6 / 54 Ranking Game outcomes

Scoring - 10-6, 4-8, 81-72

Ranking - (2,1), (1,2,3), (3,1,2,4)

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 7 / 54 Ranking Score vs rank

Chess: a score is not likely Sport games: - scores may be aﬀected by whether, location etc - can transform score to rank Online rating: - rating a book, a movie, etc - susceptible to variation among raters

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 8 / 54 Ranking Caution: incomplete scored data

items A B C A B C each scores 3 items each scored 2 items Nice rater 8 7 6 7 6 Mean rater 4 3 2 4∗ 3 Avg. Score 6 5 4 4 5∗ 6

Eﬀect of the rater to rater variation - can be eliminated if each rater scores all items - can lead to opposite result using incomplete data

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 9 / 54 Ranking Caution: winning rate for ranked data

Winning proportion: ignores opponents’ strength

16993 / 1 9 A1 34865 946 49% 16926 14823 / 168 2 7 A2 23300 64% 6794 3 3 1 A3 14954 13693 / 969 292 92% 12513 / 138 4 7 A4 27290 46% 13392 5 10894 / 5 18 A5 24108 478 45% 12736 10465 / 113 6 6 A6 18706 56% 7106 5 7 6 A7 10856 10345 / 511 0 95% 10334 / 8 2 A8 14522 803 71% 3385 10138 / 9 7 A9 15838 360 64% 5340 10 3 A10 14500 9740 / 3771 989 67%

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 10 / 54 Ranking Paired comparison on web

#wins Some websites use winning proportion: #comparisons : http://whoishotter.com (men & women) http://kittenwar.com (kittens) e.g. 70% people think Vince is cuter than Savannah.

Disadvantages: Ignores opponents’ strength Not good unless enough votes

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 11 / 54 Statistical models Outline

Ranking Statistical models Online ranking system Multiple teams/players Experiments Concluding remarks

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 12 / 54 Statistical models Thurstone-Mosteller model (1927)

Xi : unobserved actual performance of object i R: rank of k objects

k =3 R = (1, 3, 2) if and only if X1 > X3 > X2.

Thurstone (1927) invented this model and proposed 2 using Xi N(θi ,σ ), θi > 0 is strength of i. ∼

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 13 / 54 Statistical models Normal curve

1 x2 p(x)= e− 2 √2π

P(|X|<2)=0.95 p(x) 0.0 0.1 0.2 0.3 0.4

−4 −2 0 2 4

x Ruby Chiu-Hsing Weng (National Chengchi Univ.) 14 / 54 Statistical models Cumulative distribution function

X X N(θ θ , 2σ2), σ =1, 0.5 1 − 2 ∼ 1 − 2 θ1 θ2 P(X1 X2 > 0) = Φ( − ) = Prob(player 1 wins) − √2σ2 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Winning probability of player 1 0.2

0.1

0 −3 −2 −1 0 1 2 3 θ −θ 1 2 Ruby Chiu-Hsing Weng (National Chengchi Univ.) 15 / 54 Statistical models Bradley-Terry model (1952)

Object j’s skill: θj > 0, j =1, 2,..., k θ P(object i beats j)= i . θi + θj Does it relate to unobserved actual performance?

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 16 / 54 Statistical models More on the two models

Thurstone-Mosteller: X X normal 1 − 2 ∼ θ1 θ2 P(X1 > X2)=Φ( − ), √2σ2 where Φ is cdf of N(0, 1). Bradley-Terry: X X logistic 1 − 2 ∼ θ1 P(X1 > X2)= θ1 + θ2 simpler and popular

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 17 / 54 Statistical models Normal vs logistic

normal vs logistic

black: normal red: logistic pdf 0.0 0.1 0.2 0.3 0.4

−4 −2 0 2 4

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 18 / 54 Statistical models Solving Bradley-Terry model

rij : number of times i beats j

P(r12 =1, r21 =2, r31 = 2) θ θ 2 θ 2 = 1 2 3 θ + θ θ + θ θ + θ 1 2 2 1 3 1

Find θi that maximizes above probability, subject to θ> 0, i θi = 1. It is called maximum likelihood estimation (MLE). P

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 19 / 54 Statistical models How does Bradley-Terry work?

Example: k = 2 Objects

NCCU NTU NCCU 8 (θ1,θ2)=(0.8, 0.2) NTU 2 ⇒

If r 9, then (θ ,θ )=(0.82, 0.18) 12 ← 1 2 θ1 increases 0.02 (beat a weaker team, gains little in θ) If r 3, then (θ ,θ )=(0.73, 0.27) 21 ← 1 2 θ1 decreases 0.07 (beaten by a weaker team, loses more in θ)

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 20 / 54 Statistical models Remarks

Both models have been successful.

However, for games on the internet, there are millions of matches per day. Adjust strength parameters by solving maximum likelihood estimation can be slow.

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 21 / 54 Online ranking system Outline

Ranking Statistical models Online ranking system Multiple teams/players Experiments Concluding remarks

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 22 / 54 Online ranking system Online learning

An online method learns data serially. procedure: - predict the next outcome - soon the outcome is available - reﬁne the prediction model discard cases after processing; a ﬁxed amount of memory Online ranking: online method for games - rank individuals/estimate their skills

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 23 / 54 Online ranking system Successful online ranking systems

For two-player games: 1. Elo (Elo (1960)): the ﬁrst chess rating system with probabilistic underpinning (Hungarian-born American physics professor) - US Chess Federation (USCF) - World Chess Federation (FIDE) - World Football Elo Ratings - The European Go Federation - scrabble, table tennis, etc

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 24 / 54 Online ranking system Successful online ranking systems

2. Glicko (Glickman (1999)): the ﬁrst Bayesian rating system (Statistics professor of Boston University) Chairman of the U.S. Chess Federation (USCF) since 1992 - the free internet chess server - commercial games (glicko-2, free now)

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 25 / 54 Online ranking system Elo

player’s skill characterized by strength θi Update rule:

θi θi + K(si pij ), ← − K: a constant (K = 32 in USCF system for amateur players) si =1 if i wins and 0 if i loses pij : approximate probability that i beats j

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 26 / 54 Online ranking system Elo: example

A(θ1 = 1500) competes against B (θ2 = 1500) p12 =0.5, p21 =0.5 A wins. (s1 =1, s2 = 0)

θi θi + K(si pij ) ← − θ θ + 32(1 0.5) = 1516 1 ← 1 − θ θ + 32(0 0.5) = 1484 2 ← 2 − θ θ here p = Φ( 1− 2 ) 12 √2σ2 may use logistic instead of normal

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 27 / 54 Online ranking system Why invented Glicko?

- Suppose A, B (both rated 1500) played and A won. - Under USCF version of Elo, A gains 16, B loses 16

Suppose A had just returned to play after years, but B played frequently.

- A’s rating of 1500 is less reliable - B’s rating of 1500 is more reliable

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 28 / 54 Online ranking system Elo vs Glicko

Elo: player’s skill characterized by strength θ

Glicko:

θ Normal(µ, σ2) ∼ player’s skill characterized by (µ, σ2)

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 29 / 54 Online ranking system Update rule

Elo update rule:

θi θi + K(si pij ), ← − Glicko update rule:

2 2 µi µi + g(σ ,σ , µi , µj )(si pij ) ← i j − j X σ2 omit i ←

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 30 / 54 Online ranking system Ranking system vs model for ranked data

Models for ranked data: Thurstone-Mosteller model (Thurstone, 1927) Bradley-Terry model (Bradley and Terry, 1952)

Many online ranking systems are based on these models: Elo: TM, later shift to BT Glicko: BT

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 31 / 54 Online ranking system Bayesian method

Bayesian method makes use of Bayes theorem.

Given new pieces of evidence, Bayes theorem modiﬁes probabilities. P(D)P(+ D) P(D +) = | . | P(+)

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 32 / 54 Multiple teams/players Outline

Ranking Statistical models Online ranking system Multiple teams/players Experiments Concluding remarks

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 33 / 54 Multiple teams/players Multiple teams

horse racing, car racing

The 2002 US NASCAR car racing

- 36 races, each involved 43 drivers

- a total of 87 diﬀerent drivers competed

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 34 / 54 Multiple teams/players Online game

games played on the internet - millions of matches per day - each match has k 2 teams ≥ - each team may have unequal individuals

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 35 / 54 Multiple teams/players Games: Microsoft Xbox title Halo 2

Diﬀerent game types: 1. Head to Head: 2 players in a game each player forms a team 2. Small Teams: (multi-player) up to 12 players in 2 teams 3. Large Teams: (multi-player) up to 16 players in 2 teams 4. Free for All: (multi-team) up to 8 players in a game each player forms a team

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 36 / 54 Multiple teams/players Multiple teams/players

Team 1 (θ1) Adam Alan Andy Amy

Team 2 (θ2) Team 3 (θ3) Bob Brian Christine Dick Emma

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 37 / 54 Multiple teams/players TrueSkillTM

Herbrich, Graepel and Minka (2006, Microsoft Research) for Xbox Live

Why rank individual? - more fun (match players; stimulate interest)

TrueSkill Leaderboard: use µ 3σ −

update rule: - numerical integration required for k > 2

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 38 / 54 Multiple teams/players Our method Weng and Lin (2010): simple analytic form available Team skill update:

µi µi +Ωi 2 ← 2 σ σ max(1 ∆i , κ) i ← i − Individual skill update: 2 σij µij µij + 2 Ωi ← σi 2 2 2 σij σij σij max 1 2 ∆i , κ ← − σi !

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 39 / 54 Multiple teams/players Our method: toy example

2 σij µi µi +Ωi , µij µij + 2 Ωi ← ← σi

If Ω1 = 10, σ11 =3,σ12 =5,σ13 = 8, 2 2 2 2 then σ1 =3 +5 +8 = 98 and 9 µ µ + 10 11 ← 11 98 ∗ 25 µ µ + 10 12 ← 12 98 ∗ 169 µ µ + 10 13 ← 13 98 ∗

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 40 / 54 Experiments Outline

Ranking Statistical models Online ranking system Multiple teams/players Experiments Concluding remarks

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 41 / 54 Experiments Experiments: Game Data

Data: generated by Bungie Studios during the beta testing of the Xbox title Halo 2.

Game type # games # players Free for All 5,943 60,022 Small Teams 27,539 4,992 Head to Head 6,227 1,672 Large Teams 1,199 2,576 Table: Data summary

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 42 / 54 Experiments Results

BT PL TM TrueSkill Free for All * 30.59% 31.74% 44.65% 30.82% Small Teams * 33.97% * 33.97% 36.46% 35.23% Head to Head 32.53% 32.53% * 32.41% 32.44% Large Teams * 37.30% * 37.30% 39.37% 38.15%

Table: Prediction error. The column “TrueSkill” is copied from a table in Herbrich et al. (2006). Note that we use the same way as TrueSkill to calculate prediction errors.

*: best among four methods Ruby Chiu-Hsing Weng (National Chengchi Univ.) 43 / 54 Experiments Compare TM and BT

BT seems better than TM

Thurstone-Mosteller: X1 X2 follows normal. Bradley-Terry: X X follows− logistic. 1 − 2 Most currently used Elo variants use logistic rather than normal (corresponding to TM and BT), because it is argued that:

weaker players have signiﬁcantly greater winning chances than normal model predicts.

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 44 / 54 Experiments Normal vs logistic

Solid line: normal; Dashed line: logistic x-axis: θ1 θ2; y-axis: Prob(player 1 wins) 1 −

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Winning probability of player 1 0.2

0.1

0 −6 −4 −2 0 2 4 6 θ −θ 1 2

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 45 / 54 Experiments Normal vs logistic Solid line: normal; Dashed line: logistic With logistic, weaker player has bigger winning chance

0.018

0.016

0.014

0.012

0.01

0.008

0.006

Winning probability of player 1 0.004

0.002

0 −6 −5.5 −5 −4.5 −4 θ −θ 1 2

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 46 / 54 Experiments Running time

implementation: C and F# TrueSkill is written in F# competitive accuracy and shorter running time - “Free for All”: using F# TrueSkill 13 seconds, ours 1.2 seconds source code: available online, free to use http://www.csie.ntu.edu.tw/~cjlin/papers/online_ranking

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 47 / 54 Concluding remarks Outline

Ranking Statistical models Online ranking system Multiple teams/players Experiments Concluding remarks

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 48 / 54 Concluding remarks Have we answered the questions?

Predict who will win.

How to obtain a global ranking? Who are top ten?

What if t is extremely large?

What if k > 2 or multi-players in teams?

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 49 / 54 Concluding remarks Have we answered the questions?

Predict who will win. use µ or µ 3σ − How to obtain a global ranking? Who are top ten? µ 2σ − What if t is extremely large? use online method What if k > 2 or multi-players in teams? TrueSkill or our method

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 49 / 54 Concluding remarks Another rank problem

Pagerank:

- a link algorithm named after Larry Page

- used by Google Internet search engine

- Given a query, it returns a ranked list of pages.

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 50 / 54 Concluding remarks Who links to my site?

node: webpage line: link

A C B D E

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 51 / 54 Concluding remarks Original formula for PageRank

PR(1) PR(n) PR(A)=(1 α)+α + + , α =0.85 − C(1) · · · C(n) C(i): # of outbound links of page i

B A C

pA = 0.15+0.85(pB/1+ pC /1) pB = 0.15+0.85(pA/2) pC = 0.15+0.85(pA/2)

pA =1.46, pB = pC =0.77 Ruby Chiu-Hsing Weng (National Chengchi Univ.) 52 / 54 Concluding remarks Other challenge: Recommendation System

Users view books/movies at Amazon Next time, Amazon show books/movie according to your preference

Netﬂix: an on-line DVD rental company The netﬂix prize (one million USD) http://www.netflixprize.com Movie rating system

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 53 / 54 Concluding remarks Statistics is everywhere!

Welcome to join statistics line!

Thank you for your attention.

Ruby Chiu-Hsing Weng (National Chengchi Univ.) 54 / 54