Factor Analysis for Ranking Data

Home , Ranking

Philip L.H. Yu, K.F. Lam and S.M. Lo

Department of Statistics and Actuarial Science

The University of Hong Kong

Pokfulam Road, HONG KONG

[email protected]

Supp ose in a random sample of n individuals, each individual is asked to rank a set of k

items according to a certain preference criterion. In other words, individual i provides a ranking

of k items, r =घr ;:::;r ङ , where r is the rank of item j from individual i. Smaller ranks

i i1 ik ij

refer to the more preferred items.

To mo del the ranking data, we assume that each ranking r is generated according to

the ordering of k latent utilities x ; ࣽࣽࣽ ;x assigned by each individual. For example, if r =

i1 ik i

घ2; 3; 1ङ is recorded, wehave x

i2 i1 i3

x :

x = z a + b + " घi =1;:::;n; j =1;:::;kघ>dङङ

ij i j j ij

T T

where a =घa ;:::;a ङ describ es the factor loadings and z =घz ;:::;z ङ is the vector

j 1j dj i i1 id

of latent common factors distributed as N घ0; I ङ. The item mean vector b =घb ;:::;b ङ

d 1 k

re ects the "imp ortance" of each item. The error term, " , is the unique factor whichis

ङ distribution, indep endentof z ;:::;z . Notationally, denote assumed to followaN घ0;

1 n

2 2

ङ and other entries ;:::; A =[a ࣽࣽࣽa ], अ the diagonal matrix with diag घअङ= घ

dࣾk 1 k k ࣾk

k 1

equal to zero and ई = fA; b; अg the set of parameters of interest.

1. MONTE CARLO EM ALGORITHM

Let X ; Z b e the matrices of the unobservable resp onse utilities and latent common

nࣾk nࣾd

th th

factors, resp ectively with their i rows corresp ond to the i individual. Denote R =

nࣾk

[r ;:::;r ] the matrix of the observed rankings. We treat fX; Z g as missing data and R as

1 n

observed data.

1.1 Implementing the E-step via the Gibbs Sampler

The E-step here only involves computation of the conditional exp ectations of the complete-

T T T

T T

data sucient statistics, fX X ; Z Z ; Z X ; 1 X ; 1 Z g,given R and ई .To nd the condi-

tional exp ectation of 1 X ,we use Gibbs Sampling algorithm which consists of drawing samples

consecutively from the full conditional p osterior distributions: घaङ draws z from f घz jx ; r ; ई ङ

i i i i

and घbङ draws x from f घx jz ; r ; ई ङfor i =1;:::;n.

i i i i

The conditional exp ectation of 1 X and X X can b e approximated by taking the average

of the random draws of x and the average of their pro duct sum x x , resp ectively. Finally,

i i i

T T

conditional exp ectation of 1 Z , Z Z and Z X can b e obtained similarly as in Meng and

Schillingघ1996ङ.

1.2 M-step

By replacing the complete-data sucient statistics with their corresp onding conditional

exp ectations obtained in E-step, a closed-form maximum likeliho o d estimate of ई can b e ob-

tained. The new set of ई is then used for calculation of the conditional exp ectation of the

sucient statistics in E-step and the algorithm is iterated until convergence is attained.

1.3 Determining Convergence of MCEM via Bridge Sampling

Because of the simulation variabilityintro duced by the Gibbs sampler in E step, the

ࣿ

MCEM estimates may uctuate around a stationary p oint ई even on convergence. Detecting

convergence by setting an upp er b ound for the relative di erences b etween consecutive iterates

is impractical. To monitor convergence of the MCEM algorithm, we use the bridge sampling

criterion discussed by Meng and Wong घ1996ङ. The bridge sampling estimate for the i ratio

is given by

ओ "

1=2

घt+1ङ

घt;mङ घt;mङ

Lघई jx ;z ङ

i i

घtङ

m=1 घt+1ङ

घt;mङ घt;mङ

ङ ;z Lघई jx

Lघई jx ; z ङ

i i

= ;

" ओ

घtङ 1=2

घtङ

घt+1;mङ घt+1;mङ

Lघई jx ; z ङ

i i P

Lघई jx ;z ङ

i i

घt+1ङ

m=1

घt+1;mङ घt+1;mङ

Lघई jx ;z ङ

i i

घt;mङ घt;mङ

घtङ

where fx ; z ;m =1;:::;Mg denote the M Gibbs samples from f घx jz ; r ; ई ङand

i i i

i i

घtङ घtङ

f घz jx ; r ; ई ङ with ई b eing the t iterate of ई . The estimate for the log-likeliho o d ratio

i i i

घt+1ङ

Lघई jx ;z ङ

घt+1ङ घtङ

i i

: Weplot of two consecutive iterates is then given by h घई ; ई ङ= ln

घtङ

i=1

Lघई jx ;z ङ

i i

घt+1ङ घtङ

hघई ; ई ङ against t to determine the convergence of the MCEM algorithm. A curve con-

verging to zero indicates a convergence b ecause EM should increase the likeliho o d at each

step.

2. MODEL SELECTION AND TEST OF FIT

To determine the numb er of factors required, we adopt standard likeliho o d ratio test

here to test for any signi cant improvement in the t of a higher dimension factor mo del.

Evaluating the observed data likeliho o d is not a trivial task here b ecause the observed data

likeliho o d cannot b e computed analytically. We suggest to simulate the observed data log-

likeliho o d by the GHK simulator and the standard likeliho o d ratio test can then b e used for

mo del selection.

To assess the go o dness-of- t of the selected mo del, wecombine subsets of rankings to

examine the t for each subgroup. Let p = P घx >x ;:::;x ;x ;:::;x ङ b e the partial

j j 1 j 1 j +1 k

probability of ranking item j as rst and n b e the observed numb er of individuals with item j

ranked as the top item. The estimated partial probabilities, denoted by^p , can also b e simulated

by the GHK simulator given the maximum likeliho o d estimate of ई . The t can b e examined

n np^

j j

by calculating the घapproximateङ standardized residuals e = घj =1;:::;kङ: These

np^ घ1p^ ङ

j j

residuals allow us to p erform a go o dness-of- t test for the factor mo del and identify which part

of the data is घor is notङ well tted.

MAIN REFERENCES

Meng, X.L., and Schilling, S. घ1996ङ "Fitting Full-Information Item Factor Mo dels and an

Empirical Investigation of Bridge Sampling", JASA, 91, 1254-1267.

Meng, X.L., and Wong, W.H. घ1996ङ "Simulating Ratios of Normalizing Constants via a Simple

Identity: a Theoretical Exploration", Statistica Sinica, 6, 831-860.