Diagonal Orthant Multinomial Probit Models

James E. Johndrow Kristian Lum David B. Dunson Duke University Virginia Tech Duke University

Abstract Dunson [2004], who use a student t data augmen- tation scheme with the latent t variables expressed Bayesian classification commonly relies on as scale mixtures of Gaussians. The scale and de- probit models, with data augmentation al- grees of freedom in the t are chosen to provide a near gorithms used for posterior computation. exact approximation to the logistic density. Holmes By imputing latent Gaussian variables, one and Held [2006] represent the logistic distribution as a can often trivially adapt computational ap- scale-mixture of normals where the scales are a trans- proaches used in Gaussian models. How- formation of Kolmogorov-Smirnov random variables. ever, MCMC for multinomial probit (MNP) citet fruhwirth2009improved propose an alternative models can be inefficient in practice due data-augmentation scheme which approximates a log- to high posterior dependence between latent Gamma distribution with a mixture of normals, result- variables and parameters, and to difficulties ing in conditionally-conjugate updates for regression in efficiently sampling latent variables when coefficients. Polson et al. [2012] develop a novel data- there are more than two categories. To ad- augmented representation of the likelihood in a lo- dress these problems, we propose a new class gistic regression using Polya-Gamma latent variables. of diagonal orthant (DO) multinomial mod- With a normal prior, the regression coefficients have els. The key characteristics of these models conditionally-conjugate posteriors. Their method has include conditional independence of the la- the advantage that the latent variables can be sampled tent variables given model parameters, avoid- directly via an efficient rejection algorithm without the ance of arbitrary identifiability restrictions, need to rely on additional auxiliary variables. and simple expressions for category probabil- Although the work outlined above has opened MNL ities. We show substantially improved com- to Gibbs sampling, the distributions of the latent vari- putational efficiency and comparable predic- ables are either exotic or complex scale-mixtures of tive performance to MNP. normals for which multivariate analogues are not sim- ple to work with. As such, the extension to analysis of multivariate unordered categorical data, nonparamet- 1 Introduction ric regression, and other more complex situations is not straightforward. In contrast, the multinomial probit This work is motivated by the search for an alterna- (MNP) model is trivially represented by a set of Gaus- tive to the multinomial logit (MNL) and multinomial sian latent variables, allowing access to a wide range of probit (MNP) models that is more amenable to effi- methods developed for Gaussian models. MNP is also cient Bayesian computation, while maintaining flexi- a natural choice for Bayesian estimation because of the bility. Historically, the MNP has been preferred for fully-conjugate updates for model parameters and la- Bayesian inference in polychotomous regression, since tent variables. However, because the latent Gaussians the data augmentation approach of Albert and Chib are not conditionally independent given regression pa- [1993] leads to straightforward Gibbs sampling. Ef- rameters, mixing is poor and computation does not ficient methods for Bayesian inference in the MNL scale well. are a more recent development. A series of pro- posed data-augmentation methods for Bayesian in- Three characteristics of the multinomial ference in the MNL dates at least to O’Brien and lead to challenging computation and render it of lim- ited use in complex problems: the need for identifying Appearing in Proceedings of the 16th International Con- restrictions, the specification of non-diagonal covari- ference on Artificial Intelligence and (AISTATS) ance matrices for the residuals when it is not well- 2013, Scottsdale, AZ, USA. Volume 31 of JMLR: W&CP motivated, and high dependence between latent vari- 31. Copyright 2013 by the authors.

29 Diagonal Orthant Multinomial Probit Models ables. Because our proposed method addresses all Zhang et al. [2008] extended their algorithm to multi- three of these issues, we review each of them here and variate multinomial probit models. summarize the pertinent literature. Whatever prior specification and computational ap- proach is used, the selection of an arbitrary base cate- 1.1 Identifying Restrictions gory is an important modeling decision that may have serious implications for mixing (Burgette and Hahn Many choices of identifying restrictions for the MNP [2010]). The class probabilities in the MNP are linked have been suggested, and the choice of identifying re- to the model parameters by integrating over subsets of strictions has important implications for Bayesian in- J . The choice of one category as base results in these ference (Burgette and Nordheim [2012]). Consider the R regions having different geometries, causing an asym- standard multinomial probit, where here we assume a metry in the effect of translations of the elliptical mul- setting where covariates are constant across levels of tivariate normal density on the resulting class proba- the response (i.e. a classification application): bilities. To address this issue, Burgette and Nordheim _ [2012] and Burgette and Hahn [2010] rely on an alter- yi = j ⇔ uij = uik native identifying restriction, circumventing the need k to select an arbitrary base category. Although mixing u = x β +  ij i j ij is improved and the model has an appealing symme- i ∼ N(0, Σ) try in the representation of categorical probabilities as W a function of latent variables, it does not address the where is the max function. The uij’s are referred issue of high dependence between latent variables, and to as latent utilities, after the common economic in- the proposed Gibbs sampler is quite complex (though terpretation of them as unobserved levels of welfare computationally no slower than simpler algorithms). in choice models. We make the distinction between a classification application as presented above and the choice model common to the economics literature in which each category has its own set of covariates (i.e. uij = xijβ + ij). A common approach to identifying the model is to choose a base category and take differ- 1.2 Dependence in Idiosyncratic Errors ences. Suppose we select category 1 as the base. We then have the equivalent model: The multinomial probit model arose initially in the u˜i1 = 0 literature (see e.g. Hausman and Wise [1978]), where dependence between the errors in the u˜ij = xi(βj − β1) + ij − i1 linear utility models is often considered desirable. In Theu ˜’s are a linear transformation M of the original econometrics and marketing, interest often centers on T latent utilities, so u˜i,2:J ∼ N(xi(β2:J −β1), M ΣM). predicting the effects of changes in product prices on Early approaches to Bayesian computation in the product shares. If the errors have a diagonal covari- MNP placed a prior on Σ˜ = M T ΣM. Even this ance matrix, the model has the “independence of irrel- parametrization is not fully identified, and requires evant alternatives” (IIA) property (see Train [2003]), that one of the variances be fixed (usually to one) which is often considered too restrictive for character- to fully identify the model. McCulloch and Rossi izing the substitution patterns between products in a [1994] ignored this issue and adopted a parameter- market. Because early applications for the MNP were expanded Gibbs sampling approach by placing an largely of the marketing and econometrics flavor, it has inverse-Wishart prior on Σ˜ . McCulloch et al. [2000] become standard to specify a model with dependence developed a prior specification on covariance matrices in the errors without much attention to whether it is with the upper left element restricted to one. Imai and scientifically motivated. We find little compelling rea- Van Dyk [2005] use a working parameter that is resam- son for this assumption for most applications outside pled at each step of the Gibbs sampler to transform be- of economics and marketing. If the application does tween identified and unidentified parameters, with im- not motivate dependence in errors, the resulting model proved mixing, though the resulting full conditionals will certainly have higher variance than a model with are complex and the sampling requires twice the num- independent errors, and the additional dependence be- ber of steps as the original McCullough and Rossi sam- tween latent variables will negatively impact mixing. pler. Zhang et al. [2006] used a parameter-expanded In this case, the applications we have in mind are not Metropolis-Hastings algorithm to obtain samples from specifically economics/marketing applications, and as a correlation matrix, resulting in an identified model such we will generally assume that the errors are con- that restricts the scales of the utilities to be the same. ditionally independent given regression parameters.

30 James E. Johndrow, Kristian Lum, David B. Dunson

1.3 Dependence Between Latent Variables updated in a block, greatly improving mixing and com- putational scaling. The remainder of the paper is orga- Except in the special case of marketing applications nized as follows. In section 2, we introduce DO models, where there are covariates that differ with the level and show that a unique solution to the likelihood equa- of the response variable (such as the product price), tions can be obtained from independent binary regres- an identified MNP must have J − 1 sets of regression sions. We also illustrate the relationship of DO-logistic coefficients, with one set restricted to be zero. If we to MNL and DO-probit to MNP. We give several in- assume β1 = 0 and an identity covariance matrix, we terpretations for regression coefficients in DO models, can actually sample all J latent variables for each ob- and explain why the model parameters are often easier servation i. Note this is equivalent to setting one class to interpret than in the MNP. In section 3, we outline to have utility that is identically zero and sampling a simple algorithm for Bayesian computation in the the remaining utility differences from the transformed DO-probit and discuss extensions to the basic regres- distribution described in section 2.1; however, the in- sion setting. In section 4, we compare the DO-probit, tractability of the high dependence between utilities is MNP, and MNL in simulation studies. In section 5, much clearer using this alternative parametrization. we apply both methods to a real dataset and show To implement data augmentation MCMC, we must that they are virtually indistinguishable in prediction. sample the latent utilities conditional on regression In section 6 we conclude and discuss potential future coefficients from a multivariate normal distribution re- directions for this work. W stricted to the space where ui,yi = k uik. Albert and Chib [1993] suggest rejection sampling; but this tends 2 The Diagonal Orthant Multinomial to be extremely inefficient, particularly as J grows. All Model subsequent algorithms have sampled uij | ui,−j in se- quence. With an identity covariance matrix for latent The Diagonal Orthant Multinomial (DO) class of mod- utilities conditional on regression parameters, we can els represent an unordered as a set reduce this to a two-step process: of binary variables. Let y be unordered categorical with J levels and suppose γ[1:J] are independent bi- 1. Set b = W u . Sampleu ˜ ∼ i k6=yi ik [i,yi] nary variables. Define y = j ⇔ {γj = 1} ∪ {γk =

N[b,∞)(xiβyi , 1). 0 ∀ k 6= j}. Binary variables have a well-known la- tent variable representation. Let zj ∼ f(µj, σ) where 2. Sample u[i,j] ∼ N(−∞,u ](xiβ[i,j], 1) indepen- [i,yi] f is a location-scale density with location parameter dently for all j 6= yi. µj and common scale σ, and set γj = 1 ⇔ zj > 0. However, for our purposes, we must ensure that only Even in simple cases, mixing tends to be poor because one γj is one, and thus we restrict the z’s to belong to the truncation region for each latent Gaussian depends the set: on the current value in the Markov chain for the other latent Gaussians, resulting in high dependence. We reiterate that this issue is not simply a consequence J of the choice of a non-diagonal covariance matrix; it [ J Ω = {z ∈ R : zj > 0, zk < 0, k 6= j} is true for any choice of covariance matrix for the ’s. j=1 While recent work on developing a Hamiltonian MC scheme for jointly sampling from truncated multivari- By the Radon-Nikodym theorem, the joint distribution ate normal distributions appears promising as an alter- of the z’s is: native to the standard Gibbs sampling approach (Pak- man and Paninski [2012]), the need to update each latent variable conditional on the others has histori- QJ 1(z ∈ Ω) f(zj − µj) cally been a substantial contibutor to the inefficiency f(z) = j=1 J of Bayesian computation for MNP. Z Y 1(z ∈ Ω) f(zj − µj)dz J In this paper we propose Diagonal Orthant Multino- R j=1 mial models (DO models), a novel class of models for unordered categorical response data that admits la- If we let f = φ(·), where φ(·) is the univariate normal tent variable representations (including a Gaussian la- pdf, the result is a probit analogue that we refer to as tent variable representation for the DO-probit model), DO-Probit. does not require the selection of a base category, and in which the latent variables are conditionally inde- The joint probability density of z’s in DO-probit is pendent given regression parameters and thus may be that of a J-variate normal distribution with identity

31 Diagonal Orthant Multinomial Probit Models covariance that is restricted to the regions of RJ with The conditional class probabilities in the DO mod- one sign positive and the others negative. This is easy els have a useful correspondence with the multinomial to visualize in 2, where the density is a bivariate nor- logit. The class probabilities in the regression prob- R lem, Pr(y = j | x , β ) = ψ (x , β ), are given mal with correlation zero and unit variance restricted i i [1:J] j i [1:J] by: to the second and fourth quadrants. (1 − F (−x β )) Q F (−x β ) The density in the two dimensional case is shown in i j l6=j i l PJ Q Figure 1. In higher dimensions, the restriction will s=1(1 − F (−xiβs)) l6=j F (−xiβl) define orthants over which the density is nonzero. The The ratio of probabilities for two classes j and k is marginal distribution of any two latent variables will therefore: always be restricted to orthants that are diagonally (1−F (−x β )) Q F (−x β ) apposed rather than adjacent, hence the designation [ i j l6=j i l ] PJ (1−F (−x β )) Q F (−x β ) Diagonal Orthant Multinomial model. [ s=1 i s l6=s i l ] Q [(1−F (−xiβk)) l6=k F (−xiβl))] PJ Q [ s=1(1−F (−xiβs)) l6=s F (xiβl)] The denominators and all but one of the terms in the products in the numerators cancel, leaving:

ψj(xi, β ) (1 − F (−x β ))F (−x β ) [1:J] = i j i k ψk(xi, β[1:J]) (1 − F (−xiβk))F (xiβj) (1−F (−xiβj ))/F (−x β ) = i j (1−F (−x β )) i k /F (−xiβk)

Figure 1: The joint density of the z’s in the DO-probit for which is a function of only βj and βk. Recall that for the 2-category case with zero means (left panel). The right the multinomial logit model, the ratio of class proba- panel shows approximate semispherical regions covering 95 bilities for classes j and k also depends only on β and percent of the total probability in the 3-category case with j zero means for all categories. βk and is given by:   Pr(yi = j) The probability measure for z induces a probability log = xiβj − xiβk Pr(y = k) measure on y. The categorical probabilities are easily i calculated as: In DO-probit, F (·) = Φ(·), so the log relative class Q (1 − F (−µj)) F (−µk)  Pr(yi=j)  Pr(y = j) = k6=j probability, log , is: PJ Q Pr(yi=k) j=1(1 − F (−µj)) k6=j F (−µk)     Φ(−xiβj ) Φ(−xiβk) = ψj(µ1, . . . , µJ ) log − log 1 − Φ(−xiβj ) 1 − Φ(−xiβk) where F (·) is the CDF corresponding to f. Clearly, if While not as convenient as the linear expression f = φ, then F = Φ(·) is the standard normal CDF. arising from the multinomial logit, this quantity is While strictly speaking, the DO model with a full set of nonetheless easily calculated and provides a direct re- J category-specific intercepts and J ×p regression coef- lationship between the coefficients in the multinomial ficients is not identified, in the next section we suggest logit and those in the DO-probit. This is a substantial a simple identifying restriction that, critically, allows advantage over the multinomial probit model, in which the parameters for the DO-model to be estimated via the class probabilities do not have a closed form. independent binary regressions, providing substantial The more interesting case arises when we choose a lo- computational advantages. gistic distribution for the zj’s in the DO model, giving 1 the DO-logistic model. Here, F (t) = 1+e−t is the lo- 2.1 Interpretation of Regression Coefficients gistic CDF, and the log probability ratio for two cate- and Relationship to Multinomial Logit gories is:

(j) ! 1 ! 1 ! x β In a regression context, DO models define an alterna- p 1+exiβk 1+e i j log i = log − log tive link function in a GLM for unordered categori- (k) 1 − 1 1 − 1 pi 1+exiβk 1+exiβj cal/multinomial data, where: x β  1 e i j  xiβk xiβj = log 1+e 1+e (1) yi ∼ Categorical(pi)  1 exiβk    1+exiβj 1+exiβk p = ψ (x , β ), ψ (x , β ), . . . , ψ (x , β ) i 1 i [1:J] 2 i [1:J] J i [1:J] = xiβj − xiβk

32 James E. Johndrow, Kristian Lum, David B. Dunson

(j) where pi = Pr(yi = j). This is identical to the But since (B) log probability ratios in the multinomial logit. Evi- µˆ e j dently, the DO-logistic is an alternative form of the (B) (B) pˆ µˆ j = 1+e j MNL. In the next section, we suggest an identifying (B) 1 (B) restriction for the DO model that is much more con- 1 − pˆj µˆ 1+e j venient than the use of an arbitrary base category in MNL models and treats all of the categories identically. we have that: (B)  µˆ  We also note that using the approach in O’Brien and 1 e j (B) (B) µˆ µˆ Dunson [2004] one can easily do computation for the  1+e k 1+e j  (B) (B) log  (B)  =µ ˆ − µˆ DO-logistic model by introducing a scale parameter for µˆ j k  1 e k  (B) (B) the latent variables that is mixed over a Gamma den- µˆ µˆ 1+e j 1+e k sity. This immediately suggests an alternative Gibbs (B) (B) sampling algorithm for an MNL-like model that in- Therefore the collectionµ ˆ1 ,..., µˆJ - that is, volves only one additional sampling step relative to the MLEs from independent binary regressions on the DO-probit. y˜1,..., y˜J - is a valid solution to the likelihood equa- (B) tions for the DO model. Since the MLEsµ ˆj for any 2.2 Identification j are unique, this solution is also unique, and is in fact identical to the solution that results from imposing PJ Closer inspection of (1) reveals something rather strik- the restriction j=1 pˆj = 1 in the DO logistic model. ing about the representation of category probabilities A similar argument goes through for the general DO in DO models. Consider (1) in an intercepts-only model. This further hints at the strategy we employ model: for Bayesian computation, which is identical to that used for J independent binary regressions. Of course, (j) ! 1 eµj ! p µ µj this was by construction; DO models were conceived log i = log 1+e k 1+e (k) 1 eµk of and designed expressly to allow for Bayesian com- p µj µk i 1+e 1+e putation using a set of J independent latent variables. As presented thus far, DO is an under-identified gen- Note that while this is conceptually similar to the eralized linear model, and thus there will be multiple derivation of the multinomial logit model from inde- sets of parameters µ1, . . . , µJ that maximize the like- pendent binary regressions, in that case the response lihood (inifinitely many, in fact). However, all of the (j) for each is defined byy ˜i = 1 if solutions to the likelihood equationsµ ˆ ,..., µˆ must (j) 1 J y = j andy ˜ = 0 if y = b, where b is the base cate- satisfy: i i i gory, and thus the binary regressions are on a subset of   the observations and defined against an arbitrary base pˆj log =µ ˆj − µˆk category. For DO models, we perform binary regres- pˆk sions on all of the observations and do not require a base category. We also have the advantage of an addi- for any j, k ∈ {1,...,J}, a simple consequence of MLE tional interpretation for the estimated parameters as invariance. Yet (1) suggests that we might identify the relating to the marginal probability of each category. model and obtain a very useful approach to estimation simply by recognizing the connection with binary lo- 2.3 Relationship of DO-probit to gistic regression. Consider an intercept-only binary (j) Multinomial Probit with responsey ˜i = 1(yi = j) and (B) let pj = Pr(˜yj = 1). We have that: Both MNP and DO-probit link probability vectors to latent Gaussian variables by integrating multivariate pˆ(B) ! normal densities over particular regions. For reasons log j =µ ˆ(B) (B) j discussed earlier, we consider only cases in which the 1 − pˆj latent Gaussian variables are conditionally indepen- dent given parameters. To simplify the exposition, we and so for two independent logistic regressions ofy ˜j consider an intercepts-only MNP and a corresponding andy ˜k on an intercept, we get: DO-probit model with J category-specific mean pa- rameters and the identifying restriction presented in (B) (B) ! pˆ (1 − pˆ ) the previous section. Suppose we have an MNP with log j k =µ ˆ(B) − µˆ(B) (B) (B) j k identity covariance and restrict the mean of the first pˆk (1 − pˆj ) utility to be zero. If we take µ2 = −0.75, the result- ing category probabilities for the trivial two-category

33 Diagonal Orthant Multinomial Probit Models model are (0.7115, 0.2885). The equivalent DO-probit 3 Computation −1 model will have parameters µ1 = Φ (0.7115) = −1 0.5578 and µ2 = Φ (0.2885) = −0.5578. The two Bayesian computation for the DO-probit is very models give identical category probabilities, but they straightforward. With a N(0, cI), c ∈ R+ prior on the arrive at them differently. Figure 2 shows the con- regression coefficients, the entire algorithm consists of tours of the bivariate normal distribution for the la- Gibbs steps: tent utilities in the MNP (left panel) and the trun- cated bivariate normal distribution in the DO-probit T 1. Sample zi,yi | βyi ∼ N+(xi βyi , 1) and zi,k ∼ (right panel). The MNP integrates above and below T N−(xi βk, 1) for k 6= yi, where N+ and N− repre- the line y = x (shown on the figure) to calculate prob- sents a normal distribution truncated below and abilities, whereas DO-probit integrates over the second above by zero, respectively. and fourth quadrants. T −1 2. Sample βk | z ∼ N(˜µ, S˜) with S˜ = (x x+1/cI) T andµ ˜ = x z·,k for k ∈ 1,...,K.

Moreover, because of the latent Gaussian data aug- mentation scheme used to estimate the model, there are many alternatives to normal priors on regression coefficients. In classification problems with p covari- ates, the dimension of the parameter space is J × p (or J × (p − 1) in the MNL and MNP), where J is the Figure 2: Contours and integration regions for equivalent number of possible values of y. Also, the larger the J, two-category MNP and DO-probit models with category the less information is provided by observing yi = j. probabilities (0.7115, 0.2885). Thus, even with modest p and fairly large n, one should consider priors on regression coefficients with robust Of course, we can define a 1-1 function between MNPs shrinkage properties, particularly when prediction is with J categories, an identity covariance matrix, and an important goal. The local-global family of shrink- µ1 = 0 and DO-probit with the marginal MLE restric- age priors have the scale-mixture representation: tion outlined in the previous section. The MNP prob- 2 2 abilities for such a model are defined by the intractable βjl ∼ N(0, τ φjl) integrals: where we have adapted the notation to the regres- ∞ Z sion classification context with j ∈ {1,...,J} and pj = Pr(y = j) = φ(zj − µj) −∞ l ∈ {1, . . . , p}. Here, τ is a global shrinkage parameter Z zj Z zj  and φjl are local shrinkage parameters corresponding × φ(zJ − µJ )dzJ ... φ(z1)dz1 dzj to βjl. There is a substantial literature on priors for τ −∞ −∞ and the φjl’s that favor aggressively shrinking most of for j = 1,...,J. The integral has no analytic form. As the coefficients toward zero while retaining the sparse such, marginalizing over the latent variables in MNP signals. For example, choosing independent C+(0, 1) is computationally costly, and thus the latent variables priors on τ and the φjl’s gives the horseshoe prior are usually conditioned on in MCMC algorithms rather (Carvalho et al. [2010]), where C+(0, 1) is a standard than integrating them out. This leads to high depen- half-Cauchy distribution. Local-global priors can be dence and inefficient computation. In contrast, one employed in our model, with the only additional com- can marginalize out the latent variables in DO-probit putational burden relative to the continuous response and obtain analytic expressions for the category prob- case being the imputation of the latent variables. abilities, a significant benefit of our approach. It is equally straightforward to apply other priors on One can approximate the MNP integrals by quadra- regression parameters. Bayesian variable selection and ture or simulation, giving a set of probabilities model averaging can be performed by specifying point- p1, . . . , pJ . We can then find the equivalent parameters mass mixture priors on βjl and employing a stochastic of a DO-probit by simply inverting the probabilities, search variable selection algorithm (see Hoeting et al. −1 i.e. µj = Φ (pj) for all j. Note that while cate- [1999] for an overview of these methods). A nonpara- gory probabilities are increasing in µ for both models, metric prior on regression parameters can be obtained the MNP does not have the simple interpretation of by specifying Gaussian process (GP) priors on the la- the regression parameters as relating to the marginal tent variables. The simplest approach is to specify J category probabilities, and cannot be estimated from independent GP’s corresponding to the J sets of la- independent binary regressions as can the DO-probit. tent variables (one for each possible value of y). In

34 James E. Johndrow, Kristian Lum, David B. Dunson

MNP, this leads to difficult computation, and thus it is often necessary to use posterior approximations to make computation tractable (see Girolami and Rogers [2006]). The superior computational properties of DO- probit suggest that exact posterior computation using MCMC may be feasible with GP priors for problems of meaningful scale.

4 Simulation Studies

We conducted a series of simulation studies to assess the performance and properties of DO-probit and to compare it with MNP. We first simulated data from a MNP with n × p design matrix x with identity co- variance matrix. We used n = 2000, p = 2, and J = 5 levels of the response. The (J −1) category-specific in- tercepts were sampled from N(0,.5) and the (J − 1)p coefficients were sampled from N(0, 1). We then fit Figure 3: Distribution of Lag-10, -25, and -100 autocorre- either MNP or DO-probit by Gibbs sampling. The lations across all simulations and parameters for DO-probit (top) and MNP (bottom) chains were run for 10,000 iterations each, in every case starting from β = 0. We repeated the simulation Simulation MNP DO-probit and subsequent fitting 10 times. Note that the param- 1 0.30 0.30 eter expansion algorithms designed to improve mixing 2 0.36 0.37 are irrelevant for fitting this MNP model since we do 3 0.51 0.51 not need to sample a covariance matrix. We choose 4 0.32 0.32 5 0.42 0.42 category 1 as the base category for fitting. 6 0.49 0.49 Figure 3 shows histograms of lag-10, lag-25, and lag- 7 0.49 0.48 8 0.36 0.37 100 autocorrelations for the Markov chains for β from 9 0.42 0.42 MNP and DO-probit across the 10 simulations. Au- 10 0.45 0.45 tocorrelations are much lower at all three lag lengths. This is a critical aspect of the performance of Bayesian Table 1: Comparison of misclassification rates for each of stochastic algorithms. Higher autocorrelations require 10 simulated data sets. much longer run times to achieve the same effective sample sizes. In more complex models, the autocor- gression coefficients and 6 intercepts). Because n is relations for MNP can be prohibitively high. In the not large relative to p in this case, we use a Horseshoe following section, we show that lower autocorrelations shrinkage prior on the β’s (see Carvalho et al. [2010] are a feature of the DO-probit that persists in real ap- and the discussion in section 3). plications with more complex priors on the coefficients. A boxplot of posterior samples for β is shown in the Table 1 shows the mean in-sample misclassification left panel of figure 4, and a corresponding plot for the rate (using the posterior mode as the prediction) from MNP with identity covariance matrix is presented in each of the ten simulations. The results are quite com- the right panel. Red colored boxes indicate parameters parable for the two models, suggesting that DO-probit that are considered nonzero on the basis of the crite- and MNP are exchangeable in regard to their perfor- ria suggested in Carvalho et al. (let κ = 1/(1−τ 2φ2 ), mance as classifiers. jk jk and consider βjk nonzero ifκ ˆjk, the posterior mean of κjk, is > 0.5). Recent work has shown that this 5 Applications inclusion criterion has optimal properties under a 0-1 loss function (Datta and Ghosh [2012]). Note that of We consider the glass identification data from the UCI the 60 coefficients in the DO-probit, 56 are effectively machine learning website (Frank and Asuncion [2010]). shrunk to zero, whereas of the 50 coefficients in the There are 214 observations in the data and the re- MNP, 39 of them are effectively shrunk to zero. In sponse (class of the glass sample) is unordered cate- addition, of the 9 covariates, 6 of them have all 6 coef- gorical with seven possible levels, of which six are ob- ficients shrunk to zero in the DO-probit, which is the served. There are nine continuous covariates. Thus equivalent of excluding that covariate entirely. In the a DO-probit classifier has sixty parameters (6 × 9 re- MNP, 5 of these 6 covariates also have all 5 coefficients

35 Diagonal Orthant Multinomial Probit Models effectively shrunk to zero. cent, a process that was repeated twenty times. We formed out of sample predictions by taking the poste- rior mode of the predicted class for each observation in the test set. The full set of 60 (50) coefficients were used for prediction in the DO-probit (MNP) models, which is standard practice for shrinkage priors. Table 3 shows the best, median, and worst misclassification rates for the two models. The results show that the two classifiers are equivalent. Note that for over half of the test datasets, the misclassification rates for the two models were identical.

best median worst Probit 0.22 0.38 0.58 DO-Probit 0.22 0.37 0.60

Table 3: Summary of misclassification rates for 20 random holdouts of glass identification data.

Figure 4: Boxplots of posterior samples for β1:p for MNP and DO-probit. Red boxes indicate that the coefficient is considered nonzero by the criterion described above.

Figure 5 shows boxplots of the lag 1 through lag 50 posterior autocorrelations in the Markov chains for β from the DO-probit and MNP models. The boxes show the interquartile range and the black dots the medians. This confirms that much lower autocorre- lations in the DO-probit persist in real applications with larger number of parameters and more complex hierarchical modeling structures. Note that the resid- ual autocorrelation for DO-probit is due mainly to the Metropolis-Hastings steps used to sample the scale pa- rameters for the horseshoe prior, and that parameter expansion or slice sampling could improve the mixing, see Scott [2010]. Table 2 shows quantiles of the ef- fective sample size for DO-probit and MNP estimated Figure 5: Boxplots of lag-1 through lag-50 autocorrela- tions for MCMC samples of β from MNP and DO-probit. on the glass data. The median effective sample size 1:p for DO-probit is 3.85 times that for MNP, even in this relatively simple modeling context. The run time per iteration for DO-probit was 1.06 times that of MNP, a 6 Discussion difference that is entirely due to the larger number of parameters for DO-probit. The DO-probit model provides an attractive alterna- tive to the multinomial logit and multinomial pro- twenty bit models that overcomes the major limitations of each while retaining many of their attractive proper- 10% 25% 50% 75% 90% ties. Like MNP, DO-probit admits a Gaussian latent DO probit 101 250 1204 3176 6085 MNP 78 128 216 464 835 variable representation, allowing for simple conjugate updates for regression parameters and application of numerous methods designed for multivariate Gaussian Table 2: Quantiles of effective sample sizes for β parame- ters from DO-probit and MNP using 15,000 MCMC itera- data, a feature that MNL lacks. However, our model tions with 1000 iteration burn-in. does not suffer from the poor mixing and high auto- correlations in MCMC samples that make MNP prac- We assessed out-of-sample prediction on the dataset tically infeasible for use in high-dimensional applica- by randomly holding out 10 percent of observations tions. Unlike MNP and MNL, our model does not re- and estimating the model on the remaining 90 per- quire a base category for identification. However, class

36 James E. Johndrow, Kristian Lum, David B. Dunson probabilities for DO-probit marginal of latent variables C.M. Carvalho, N.G. Polson, and J.G. Scott. The are easily calculated, and relative class probabilities horseshoe estimator for sparse signals. Biometrika, are functions only of the regression parameters corre- 97(2):465–480, 2010. sponding to the compared classes, an attractive feature shared with MNL. J. Datta and J.K. Ghosh. Asymptotic properties of bayes risk for the horseshoe prior. Bayesian Analy- The DO-probit link provides a number of possible sis, 7(4):771–792, 2012. avenues for future work. The Gaussian latent vari- able representation of the model allows for extension A. Frank and A. Asuncion. UCI ma- to via specifying a Gaussian chine learning repository, 2010. URL process prior on the latent variables. Another interest- http://archive.ics.uci.edu/ml. ing possibility would be to explore a novel class of dis- crete choice models by allowing dependence between M. Girolami and S. Rogers. Variational bayesian multi- latent variables, the analogue of a non-diagonal co- nomial probit regression with gaussian process pri- variance matrix in the MNP. Multivariate unordered ors. Neural Computation, 18(8):1790–1817, 2006. categorical response data with covariates is a particu- J.A. Hausman and D.A. Wise. A conditional probit larly challenging context in which to develop compu- model for qualitative choice: Discrete decisions rec- tationally tractable Bayesian methods. One could po- ognizing interdependence and heterogeneous prefer- tentially allow for dependence in multivariate cases by ences. Econometrica: Journal of the Econometric specifying a prior on structured covariance matrices for Society, pages 403–426, 1978. the latent Gaussian variables. Another possible appli- cation would be to time series of polychotomous vari- J.A. Hoeting, D. Madigan, A.E. Raftery, and C.T. ables by specifying AR models on the latent variables. Volinsky. Bayesian model averaging: a tutorial. Sta- In complex modeling situations such as these, fully tistical science, pages 382–401, 1999. Bayesian estimation using DO-probit may be straight- forward, whereas the computational challenges of the Chris C. Holmes and Leonhard Held. Bayesian aux- MNP make these extensions, while theoretically pos- iliary variable models for binary and multinomial sible, practically infeasible. regression. Bayesina Analysis, 1(1):145–168, 2006. K. Imai and D.A. Van Dyk. A bayesian analysis of the 7 Acknowledgments multinomial probit model using marginal data aug- mentation. Journal of Econometrics, 124(2):311– This work was supported by Award Number 334, 2005. R01ES017436 from the National Institute of Environ- R. McCulloch and P.E. Rossi. An exact likelihood mental Health Sciences. The content is solely the re- analysis of the multinomial probit model. Journal sponsibility of the authors and does not necessarily of Econometrics, 64(1):207–240, 1994. represent the official views of the National Institute of Environmental Health Sciences or the National In- R.E. McCulloch, N.G. Polson, and P.E. Rossi. A stitutes of Health. This work has been partially sup- bayesian analysis of the multinomial probit model ported by DTRA CNIMS Contract HDTRA1-11-D- with fully identified parameters. Journal of Econo- 0016-0001 and Virginia Tech internal funds. metrics, 99(1):173–193, 2000.

References S.M. O’Brien and D.B. Dunson. Bayesian multivariate logistic regression. Biometrics, 60(3):739–746, 2004. J.H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the Ari Pakman and Liam Paninski. Exact hamiltonian American statistical Association, 88(422):669–679, monte carlo for truncated multivariate gaussians. 1993. arXiv preprint arXiv:1208.4118, 2012. Nick G. Polson, James G. Scott, and Jesse Windle. L.F. Burgette and P.R. Hahn. Symmetric bayesian Bayesian inference for logistic models using Polya- multinomial probit models. Duke University Statis- Gamma latent variables. ArXiv e-prints, 2012. tical Science Technical Report, pages 1–20, 2010. J.G. Scott. Parameter expansion in local-shrinkage L.F. Burgette and E.V. Nordheim. The trace restric- models. arXiv preprint arXiv:1010.5265, 2010. tion: An alternative identification strategy for the bayesian multinomial probit model. Journal of Busi- K.E. Train. methods with simulation. ness & Economic Statistics, 30(3):404–410, 2012. Cambridge university press, 2003.

37 Diagonal Orthant Multinomial Probit Models

X. Zhang, W.J. Boscardin, and T.R. Belin. Sampling correlation matrices in bayesian models with corre- lated latent variables. Journal of Computational and Graphical Statistics, 15(4):880–896, 2006. X. Zhang, W.J. Boscardin, and T.R. Belin. Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computa- tional statistics & data analysis, 52(7):3697–3708, 2008.

38