Factor Analysis for Ranking Data
Philip L.H. Yu, K.F. Lam and S.M. Lo
Department of Statistics and Actuarial Science
The University of Hong Kong
Pokfulam Road, HONG KONG
Supp ose in a random sample of n individuals, each individual is asked to rank a set of k
items according to a certain preference criterion. In other words, individual i provides a ranking
T
of k items, r =घr ;:::;r ङ , where r is the rank of item j from individual i. Smaller ranks
i i1 ik ij
refer to the more preferred items.
To mo del the ranking data, we assume that each ranking r is generated according to
i
the ordering of k latent utilities x ; ࣽࣽࣽ ;x assigned by each individual. For example, if r =
i1 ik i
T
घ2; 3; 1ङ is recorded, wehave x i2 i1 i3 x : i T x = z a + b + " घi =1;:::;n; j =1;:::;kघ>dङङ ij i j j ij T T where a =घa ;:::;a ङ describ es the factor loadings and z =घz ;:::;z ङ is the vector j 1j dj i i1 id T of latent common factors distributed as N घ0; I ङ. The item mean vector b =घb ;:::;b ङ d 1 k re ects the "imp ortance" of each item. The error term, " , is the unique factor whichis ij 2 ङ distribution, indep endentof z ;:::;z . Notationally, denote assumed to followaN घ0; 1 n j 2 2 ङ and other entries ;:::; A =[a ࣽࣽࣽa ], अ the diagonal matrix with diag घअङ= घ dࣾk 1 k k ࣾk k 1 equal to zero and ई = fA; b; अg the set of parameters of interest. 1. MONTE CARLO EM ALGORITHM Let X ; Z b e the matrices of the unobservable resp onse utilities and latent common nࣾk nࣾd th th factors, resp ectively with their i rows corresp ond to the i individual. Denote R = nࣾk T [r ;:::;r ] the matrix of the observed rankings. We treat fX; Z g as missing data and R as 1 n observed data. 1.1 Implementing the E-step via the Gibbs Sampler The E-step here only involves computation of the conditional exp ectations of the complete- T T T T T data sucient statistics, fX X ; Z Z ; Z X ; 1 X ; 1 Z g,given R and ई .To nd the condi- T tional exp ectation of 1 X ,we use Gibbs Sampling algorithm which consists of drawing samples consecutively from the full conditional p osterior distributions: घaङ draws z from f घz jx ; r ; ई ङ i i i i and घbङ draws x from f घx jz ; r ; ई ङfor i =1;:::;n. i i i i T T The conditional exp ectation of 1 X and X X can b e approximated by taking the average P T of the random draws of x and the average of their pro duct sum x x , resp ectively. Finally, i i i i T T T conditional exp ectation of 1 Z , Z Z and Z X can b e obtained similarly as in Meng and Schillingघ1996ङ. 1.2 M-step By replacing the complete-data sucient statistics with their corresp onding conditional exp ectations obtained in E-step, a closed-form maximum likeliho o d estimate of ई can b e ob- tained. The new set of ई is then used for calculation of the conditional exp ectation of the sucient statistics in E-step and the algorithm is iterated until convergence is attained. 1.3 Determining Convergence of MCEM via Bridge Sampling Because of the simulation variabilityintro duced by the Gibbs sampler in E step, the ࣿ MCEM estimates may uctuate around a stationary p oint ई even on convergence. Detecting convergence by setting an upp er b ound for the relative di erences b etween consecutive iterates is impractical. To monitor convergence of the MCEM algorithm, we use the bridge sampling th criterion discussed by Meng and Wong घ1996ङ. The bridge sampling estimate for the i ratio is given by ओ " 1=2 घt+1ङ घt;mङ घt;mङ P Lघई jx ;z ङ M i i घtङ m=1 घt+1ङ घt;mङ घt;mङ ङ ;z Lघई jx Lघई jx ; z ङ i i i i = ; " ओ घtङ 1=2 घtङ घt+1;mङ घt+1;mङ Lघई jx ; z ङ i i P Lघई jx ;z ङ M i i घt+1ङ m=1 घt+1;mङ घt+1;mङ Lघई jx ;z ङ i i घt;mङ घt;mङ घtङ where fx ; z ;m =1;:::;Mg denote the M Gibbs samples from f घx jz ; r ; ई ङand i i i i i घtङ घtङ th f घz jx ; r ; ई ङ with ई b eing the t iterate of ई . The estimate for the log-likeliho o d ratio i i i घt+1ङ P Lघई jx ;z ङ घt+1ङ घtङ n i i ^ : Weplot of two consecutive iterates is then given by h घई ; ई ङ= ln घtङ i=1 Lघई jx ;z ङ i i घt+1ङ घtङ ^ hघई ; ई ङ against t to determine the convergence of the MCEM algorithm. A curve con- verging to zero indicates a convergence b ecause EM should increase the likeliho o d at each step. 2. MODEL SELECTION AND TEST OF FIT To determine the numb er of factors required, we adopt standard likeliho o d ratio test here to test for any signi cant improvement in the t of a higher dimension factor mo del. Evaluating the observed data likeliho o d is not a trivial task here b ecause the observed data likeliho o d cannot b e computed analytically. We suggest to simulate the observed data log- likeliho o d by the GHK simulator and the standard likeliho o d ratio test can then b e used for mo del selection. To assess the go o dness-of- t of the selected mo del, wecombine subsets of rankings to examine the t for each subgroup. Let p = P घx >x ;:::;x ;x ;:::;x ङ b e the partial j j 1 j 1 j +1 k probability of ranking item j as rst and n b e the observed numb er of individuals with item j j ranked as the top item. The estimated partial probabilities, denoted by^p , can also b e simulated j by the GHK simulator given the maximum likeliho o d estimate of ई . The t can b e examined n np^ j j p by calculating the घapproximateङ standardized residuals e = घj =1;:::;kङ: These j np^ घ1 p^ ङ j j residuals allow us to p erform a go o dness-of- t test for the factor mo del and identify which part of the data is घor is notङ well tted. MAIN REFERENCES Meng, X.L., and Schilling, S. घ1996ङ "Fitting Full-Information Item Factor Mo dels and an Empirical Investigation of Bridge Sampling", JASA, 91, 1254-1267. Meng, X.L., and Wong, W.H. घ1996ङ "Simulating Ratios of Normalizing Constants via a Simple Identity: a Theoretical Exploration", Statistica Sinica, 6, 831-860.