Diagonal Orthant Multinomial Probit Models

Diagonal Orthant Multinomial Probit Models James E. Johndrow Kristian Lum David B. Dunson Duke University Virginia Tech Duke University Abstract Dunson [2004], who use a student t data augmentation scheme with the latent t variables expressed Bayesian classification commonly relies on as scale mixtures of Gaussians. The scale and de- probit models, with data augmentation al- grees of freedom in the t are chosen to provide a near gorithms used for posterior computation. exact approximation to the logistic density. Holmes By imputing latent Gaussian variables, one and Held [2006] represent the logistic distribution as a can often trivially adapt computational ap- scale-mixture of normals where the scales are a trans- proaches used in Gaussian models. How- formation of Kolmogorov-Smirnov random variables. ever, MCMC for multinomial probit (MNP) citet fruhwirth2009improved propose an alternative models can be inefficient in practice due data-augmentation scheme which approximates a log- to high posterior dependence between latent Gamma distribution with a mixture of normals, result- variables and parameters, and to difficulties ing in conditionally-conjugate updates for regression in efficiently sampling latent variables when coefficients. Polson et al. [2012] develop a novel data- there are more than two categories. To ad- augmented representation of the likelihood in a lo- dress these problems, we propose a new class gistic regression using Polya-Gamma latent variables. of diagonal orthant (DO) multinomial mod- With a normal prior, the regression coefficients have els. The key characteristics of these models conditionally-conjugate posteriors. Their method has include conditional independence of the la- the advantage that the latent variables can be sampled tent variables given model parameters, avoid- directly via an efficient rejection algorithm without the ance of arbitrary identifiability restrictions, need to rely on additional auxiliary variables. and simple expressions for category probabil- Although the work outlined above has opened MNL ities. We show substantially improved com- to Gibbs sampling, the distributions of the latent vari- putational efficiency and comparable predic- ables are either exotic or complex scale-mixtures of tive performance to MNP. normals for which multivariate analogues are not simple to work with. As such, the extension to analysis of multivariate unordered categorical data, nonparamet- 1 Introduction ric regression, and other more complex situations is not straightforward. In contrast, the multinomial probit This work is motivated by the search for an alterna- (MNP) model is trivially represented by a set of Gaus- tive to the multinomial logit (MNL) and multinomial sian latent variables, allowing access to a wide range of probit (MNP) models that is more amenable to effi- methods developed for Gaussian models. MNP is also cient Bayesian computation, while maintaining flexi- a natural choice for Bayesian estimation because of the bility. Historically, the MNP has been preferred for fully-conjugate updates for model parameters and la- Bayesian inference in polychotomous regression, since tent variables. However, because the latent Gaussians the data augmentation approach of Albert and Chib are not conditionally independent given regression pa- [1993] leads to straightforward Gibbs sampling. Ef- rameters, mixing is poor and computation does not ficient methods for Bayesian inference in the MNL scale well. are a more recent development. A series of proposed data-augmentation methods for Bayesian in- Three characteristics of the multinomial probit model ference in the MNL dates at least to O'Brien and lead to challenging computation and render it of lim- ited use in complex problems: the need for identifying Appearing in Proceedings of the 16th International Con- restrictions, the specification of non-diagonal covari- ference on Artificial Intelligence and Statistics (AISTATS) ance matrices for the residuals when it is not well- 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP motivated, and high dependence between latent vari- 31. Copyright 2013 by the authors. 29 Diagonal Orthant Multinomial Probit Models ables. Because our proposed method addresses all Zhang et al. [2008] extended their algorithm to multi- three of these issues, we review each of them here and variate multinomial probit models. summarize the pertinent literature. Whatever prior specification and computational approach is used, the selection of an arbitrary base cate- 1.1 Identifying Restrictions gory is an important modeling decision that may have serious implications for mixing (Burgette and Hahn Many choices of identifying restrictions for the MNP [2010]). The class probabilities in the MNP are linked have been suggested, and the choice of identifying re- to the model parameters by integrating over subsets of strictions has important implications for Bayesian in- J . The choice of one category as base results in these ference (Burgette and Nordheim [2012]). Consider the R regions having different geometries, causing an asym- standard multinomial probit, where here we assume a metry in the effect of translations of the elliptical mul- setting where covariates are constant across levels of tivariate normal density on the resulting class proba- the response (i.e. a classification application): bilities. To address this issue, Burgette and Nordheim _ [2012] and Burgette and Hahn [2010] rely on an alter- yi = j , uij = uik native identifying restriction, circumventing the need k to select an arbitrary base category. Although mixing u = x β + ij i j ij is improved and the model has an appealing symme- i ∼ N(0; Σ) try in the representation of categorical probabilities as W a function of latent variables, it does not address the where is the max function. The uij's are referred issue of high dependence between latent variables, and to as latent utilities, after the common economic in- the proposed Gibbs sampler is quite complex (though terpretation of them as unobserved levels of welfare computationally no slower than simpler algorithms). in choice models. We make the distinction between a classification application as presented above and the choice model common to the economics literature in which each category has its own set of covariates (i.e. uij = xijβ + ij). A common approach to identifying the model is to choose a base category and take differ- 1.2 Dependence in Idiosyncratic Errors ences. Suppose we select category 1 as the base. We then have the equivalent model: The multinomial probit model arose initially in the u~i1 = 0 econometrics literature (see e.g. Hausman and Wise [1978]), where dependence between the errors in the u~ij = xi(βj − β1) + ij − i1 linear utility models is often considered desirable. In Theu ~'s are a linear transformation M of the original econometrics and marketing, interest often centers on T latent utilities, so u~i;2:J ∼ N(xi(β2:J −β1); M ΣM). predicting the effects of changes in product prices on Early approaches to Bayesian computation in the product shares. If the errors have a diagonal covari- MNP placed a prior on Σ~ = M T ΣM. Even this ance matrix, the model has the \independence of irrel- parametrization is not fully identified, and requires evant alternatives" (IIA) property (see Train [2003]), that one of the variances be fixed (usually to one) which is often considered too restrictive for character- to fully identify the model. McCulloch and Rossi izing the substitution patterns between products in a [1994] ignored this issue and adopted a parameter- market. Because early applications for the MNP were expanded Gibbs sampling approach by placing an largely of the marketing and econometrics flavor, it has inverse-Wishart prior on Σ~ . McCulloch et al. [2000] become standard to specify a model with dependence developed a prior specification on covariance matrices in the errors without much attention to whether it is with the upper left element restricted to one. Imai and scientifically motivated. We find little compelling rea- Van Dyk [2005] use a working parameter that is resam- son for this assumption for most applications outside pled at each step of the Gibbs sampler to transform be- of economics and marketing. If the application does tween identified and unidentified parameters, with im- not motivate dependence in errors, the resulting model proved mixing, though the resulting full conditionals will certainly have higher variance than a model with are complex and the sampling requires twice the num- independent errors, and the additional dependence be- ber of steps as the original McCullough and Rossi sam- tween latent variables will negatively impact mixing. pler. Zhang et al. [2006] used a parameter-expanded In this case, the applications we have in mind are not Metropolis-Hastings algorithm to obtain samples from specifically economics/marketing applications, and as a correlation matrix, resulting in an identified model such we will generally assume that the errors are con- that restricts the scales of the utilities to be the same. ditionally independent given regression parameters. 30 James E. Johndrow, Kristian Lum, David B. Dunson 1.3 Dependence Between Latent Variables updated in a block, greatly improving mixing and computational scaling. The remainder of the paper is orga- Except in the special case of marketing applications nized as follows. In section 2, we introduce DO models, where there are covariates that differ with the level and show that a unique solution to the likelihood equa- of the response variable (such as the product price), tions can be obtained from independent binary regres- an identified MNP must have J − 1 sets of regression sions. We also illustrate the relationship of DO-logistic coefficients, with one set restricted to be zero. If we to MNL and DO-probit to MNP. We give several in- assume β1 = 0 and an identity covariance matrix, we terpretations for regression coefficients in DO models, can actually sample all J latent variables for each ob- and explain why the model parameters are often easier servation i. Note this is equivalent to setting one class to interpret than in the MNP.

Diagonal Orthant Multinomial Probit Models

MNP: R Package for Fitting the Multinomial Probit Model

Chapter 3: Discrete Choice

Lecture 6 Multiple Choice Models Part II – MN Probit, Ordered Choice

A Multinomial Probit Model with Latent Factors: Identification and Interpretation Without a Measurement System

Nlogit — Nested Logit Regression

Simulating Generalized Linear Models

Estimation of Logit and Probit Models Using Best, Worst and Best-Worst Choices Paolo Delle Site, Karim Kilani, Valerio Gatta, Edoardo Marcucci, André De Palma

Using Heteroskedastic Ordered Probit Models to Recover Moments of Continuous Test Score Distributions from Coarsened Data

Multinomial Logistic Regression Models with SAS Example

The Generalized Multinomial Logit Model

15 Panel Data Models for Discrete Choice William Greene, Department of Economics, Stern School of Business, New York University

A Survey of Discrete Choice Models