Bayesian Inference for Plackett-Luce Ranking Models
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian inference for Plackett-Luce ranking models John Guiver joguiver@microsoft.com Edward Snelson esnelson@microsoft.com Microsoft Research Limited, 7 J J Thomson Avenue, Cambridge CB3 0FB, UK Abstract Luce can be achieved via maximum likelihood estima- This paper gives an efficient Bayesian method tion (MLE) using MM methods (Hunter, 2004), we for inferring the parameters of a Plackett- are unaware of an efficient Bayesian treatment. As we Luce ranking model. Such models are param- will show, MLE is problematic for sparse data due to eterised distributions over rankings of a finite overfitting, and it cannot even be found for some data set of objects, and have typically been stud- samples that do occur in real situations. Sparse data ied and applied within the psychometric, so- within the context of ranking is a common scenario ciometric and econometric literature. The in- for some applications and is typified by having a small ference scheme is an application of Power EP number of observations and a large number of items (expectation propagation). The scheme is ro- to rank (Dwork et al., 2001), or each individual ob- bust and can be readily applied to large scale servation may rank only a few of the total items. We data sets. The inference algorithm extends to therefore develop an efficient Bayesian approximate in- variations of the basic Plackett-Luce model, ference procedure for the model that avoids overfitting including partial rankings. We show a num- and provides proper uncertainty estimates on the pa- ber of advantages of the EP approach over rameters. the traditional maximum likelihood method. The Plackett-Luce distribution derives its name from We apply the method to aggregate rankings independent work by Plackett (1975) and Luce (1959). of NASCAR racing drivers over the 2002 sea- The Luce Choice Axiom is a general axiom governing son, and also to rankings of movie genres. the choice probabilities of a population of `choosers', choosing an item from a subset of a set of items. The axiom is best described by a simple illustration. Sup- 1. Introduction pose that the set of items is fA; B; C; Dg, and suppose that the corresponding probabilities of choosing from Problems involving ranked lists of items are this set are (p ; p ; p ; p ). Now consider a subset widespread, and are amenable to the application of A B C D fA; Cg with choice probabilities (q ; q ). Then Luce's machine learning methods. An example is the sub- A C choice axiom states that q =q = p =p . In other field of \learning to rank" at the cross-over of machine A C A C words, the choice probability ratio between two items learning and information retrieval (see e.g. Joachims is independent of any other items in the set. et al., 2007). Another example is rank aggregation and meta-search (Dwork et al., 2001). The proper Suppose we consider a set of items, and a set of choice modelling of observations in the form of ranked items probabilities that satisfy Luce's axiom, and consider requires us to consider parameterised probability dis- picking one item at a time out of the set, according tributions over rankings. This has been an area of to the choice probabilities. Such samples give a to- study in statistics for some time (see Marden, 1995 tal ordering of items, which can be considered as a for a review), but much of this work has not made sample from a distribution over all possible orderings. its way into the machine learning community. In this The form of such a distribution was first considered paper we study one particular ranking distribution, by Plackett (1975) in order to model probabilities in a the Plackett-Luce, which has some very nice proper- K-horse race. ties. Although parameter estimation in the Plackett- The Plackett-Luce model is applicable when each ob- Appearing in Proceedings of the 26 th International Confer- servation provides either a complete ranking of all ence on Machine Learning, Montreal, Canada, 2009. Copy- items, or a partial ranking of only some of the items, or right 2009 by the author(s)/owner(s). a ranking of the top few items (see section 3.5 for the Bayesian inference for Plackett-Luce ranking models last two scenarios). The applications of the Plackett- v = (v1; : : : ; vn) where vi ≥ 0 is associated with item Luce distribution and its extensions have been quite index i: Y varied including horse-racing (Plackett, 1975), docu- PL(! j v) = fk(v) (1) ment ranking (Cao et al., 2007), assessing potential k=1;:::;K demand for electric cars (Beggs et al., 1981), modelling where electorates (Gormley & Murphy, 2005), and modelling v dietary preferences in cows (Nombekela et al., 1994). !k fk(v) ≡ fk(v!k ; : : : ; v!K ) , (2) v!k + ··· + v!K Inferring the parameters of the Plackett-Luce distribu- tion is typically done by maximum likelihood estima- 2.1. Vase model interpretation tion (MLE). Hunter (2004) has described an efficient MLE method based on a minorise/maximise (MM) The vase model metaphor is due to Silverberg (1980). algorithm. In recent years, powerful new message- Consider a multi-stage experiment where at each stage passing algorithms have been developed for doing ap- we are drawing a ball from a vase of coloured balls. proximate deterministic Bayesian inference on large The number of balls of each colour are in proportion to belief networks. Such algorithms are typically both the v!k . A vase differs from an urn only in that it has accurate, and highly scalable to large real-world prob- an infinite number of balls, thus allowing non-rational lems. Minka (2005) has provided a unified view of proportions. At the first stage a ball !1 is drawn from these algorithms, and shown that they differ solely by the vase; the probability of this selection is f1(v). At the measure of information divergence that they min- the second stage, another ball is drawn | if it is the imise. We apply Power EP (Minka, 2004), an algo- same colour as the first, then put it back, and keep on rithm in this framework, to perform Bayesian inference trying until a new colour !2 is selected; the probability for Plackett-Luce models. of this second selection is f2(v). Continue through the stages until a ball of each colour has been selected. It is In section 2, we take a more detailed look the Plackett- clear that equation 1 represents the probability of this Luce distribution, motivating it with some alternative sequence. The vase model interpretation also provides interpretations. In section 3, we describe the algo- a starting point for extensions to the basic P-L model rithm at a level of detail where it should be possible for detailed by Marden (1995), for example, capturing the the reader to implement the algorithm in code, giving intuition that judges make more accurate judgements derivations where needed. In section 4, we apply the at the higher ranks. algorithm to data generated from a known distribu- tion, to an aggregation of 2002 NASCAR race results, 2.2. Thurstonian interpretation and also to the ranking of genres in the MovieLens data set. Section 5 provides brief conclusions. A Thurstonian model (Thurstone, 1927) assumes an unobserved random score variable xi (typically inde- 2. Plackett-Luce models pendent) for each item. Drawing from the score dis- tributions and sorting according to sampled score pro- A good source for the material in this section, and vides a sample ranking | so the distribution over for rank distributions in general, is the book by Mar- scores induces a distribution over rankings. A key re- den (1995). Consider an experiment where N judges sult, due to Yellott (Yellott, 1977), says that if the are asked to rank K items, and assume no ties. The score variables are independent, and the score distri- outcome of the experiment is a set of N rankings butions are identical except for their means, then the (n) (n) (n) fy ≡ (y1 ; : : : ; yK ) j n = 1;:::;Ng where a rank- score distributions give rise to a P-L model if and only ing is defined as a permutation of the K rank indices; if the scores are distributed according to a Gumbel (n) distribution. in other words, judge n ranks item i in position yi (where highest rank is position 1). Each ranking has The CDF G (x j µ, β) and PDF g(x j µ, β) of a Gumbel (n) (n) (n) an associated ordering ! ≡ (!1 ;:::;!K ), where distribution are given by an ordering is defined as a permutation of the K item (n) (x j µ, β) = e−z (3) indices; in other words, judge n puts item !i in po- G z sition i. Rankings and orderings are related by (drop- g(x j µ, β) = e−z (4) β ping the judge index) !yi = i, y!i = i. The Plackett-Luce (P-L) model is a distribution over − x−µ where z(x) = e β . For a fixed β, g(x j µ, β) is an rankings y which is best described in term of the as- exponential family distribution with natural parame- sociated ordering !. It is parameterised by a vector µ ter v = e β which has a Gamma distribution conjugate Bayesian inference for Plackett-Luce ranking models prior. The use of the notation v for this natural param- the vi are positive values, it is natural to assign them µi eter is deliberate | it turns out that vi = e β is the Gamma distribution priors, and this is reinforced by P-L parameter for the ith item in the ranking distri- the discussion in section 2.2.