Ultimate P´olya Gamma Samplers – Efficient MCMC for possibly imbalanced binary and categorical data

Sylvia Fr¨uhwirth-Schnatter (WU Wien) Gregor Zens (WU Wien) Helga Wagner (JKU Linz)

November 18, 2020

Abstract Modeling binary and categorical data is one of the most commonly encountered tasks of applied statisticians and econometricians. While Bayesian methods in this context have been available for decades now, they often require a high level of famil- iarity with Bayesian or suffer from issues such as low sampling efficiency. To contribute to the accessibility of Bayesian models for binary and categorical data, we introduce novel latent variable representations based on P´olya Gamma random variables for a range of commonly encountered models. From these latent variable representations, new algorithms for binary, binomial and multinomial models are derived. All models allow for a con- ditionally Gaussian likelihood representation, rendering extensions to more complex modeling frameworks such as state space models straight-forward. However, sampling efficiency may still be an issue in these data augmentation based estimation frame- works. To counteract this, MCMC boosting strategies are developed and discussed in detail. The merits of our approach are illustrated through extensive simulations and a real data application.

Keywords: Bayesian, Data augmentation, Discrete choice, Gibbs sampling, MCMC boost- arXiv:2011.06898v3 [stat.CO] 17 Nov 2020 ing.

1 1 Introduction

Applied statisticians and econometricians commonly have to deal with modelling binary or categorical outcome variables. Widely used tools for analyzing such data include logit and probit models, binomial and multinomial logit (MNL) regression-type models as well as more flexible models such as binary state space or random effects models. Bayesian approaches toward inference are very useful in this context, in particular for uncertainty quantification, but as opposed to Gaussian outcome variables their implementation is de- manding from a computational viewpoint. One strategy to implement sampling-based inference relies on importance sampling (Zellner & Rossi 1984) or Metropolis-Hastings (MH) algorithms (Rossi et al. 2005), ex- ploiting directly the non-Gaussian likelihood. However, these algorithms require careful tuning and substantial experience with Bayesian computation for more complex models such as state space models. Routine Bayesian computation for these type of data more often relies on Markov Chain Monte Carlo (MCMC) algorithms based on data augmentation. As shown by the seminal paper of Albert & Chib (1993), the binary probit model admits a latent variable representation where the latent variable equation is linear in the unknown parameters with an error term following a standard normal distribution. As the latent variables are easily sampled for known parameters, the latent variable representation admits a straightforward Gibbs sampler using one level of data augmentation, where the unknown parameters are sampled from a conditionally Gaussian model. This strategy works also for more complex models such as probit state space or random effects models. However, MCMC estimation based on data augmentation is less straightforward for a logit model which still admits a latent variable representation that is linear in the unknown

2 parameters, but exhibits an error following a logistic distribution. Related latent vari- able representations with non-Gaussian errors exist for MNL models (see e.g. Fr¨uhwirth- Schnatter & Fr¨uhwirth (2010)) and binomial models (Fussl et al. 2013). While the latent variables can be sampled straightforwardly for any of these latent variable representations, sampling the unknown parameters is far from straightforward due to the non-Gaussian errors. A common resolution relies on a scale-mixture representation of the non-Gaussian er- ror distribution and introduces the scales as a second level of data augmentation. Very conveniently, the unknown model parameters are then sampled from a conditionally Gaus- sian model. Examples include a representation of the logistic distribution involving the Kolmogoroff-Smirnov distribution (Holmes & Held 2006) and very accurate finite scale- mixture approximations (Fr¨uhwirth-Schnatter & Fr¨uhwirth 2007, 2010, Fr¨uhwirth-Schnatter et al. 2009). A seminal paper in this context which avoids any explicit latent variable repre- sentation and works instead with a single level of data augmentation is Polson et al. (2013). They derive the P´olya-Gamma sampler, using a mixture representation of the non-Gaussian likelihood of the marginal model involving the P´olya-Gamma distribution. In the present paper, we propose a new sampler involving the P´olya-Gamma distribution introduced by Polson et al. (2013). However, as opposed to working with the marginal model, we intro- duce a new mixture representation of the logistic distribution based on the P´olya-Gamma distribution in the latent variable representation of the logit model. Similarly as Holmes & Held (2006) and Fr¨uhwirth-Schnatter & Fr¨uhwirth (2010), we expand data augmentation and introduce the P´olya-Gamma mixing variables as a second set of latent variables. Our new P´olya-Gamma mixture representation of the logit model has the advantage that the joint (conditional) posterior distribution of all data-augmented variables is easy to sam-

3 ple from as the P´olya-Gamma mixing variable conditional on the latent variable follows a tilted P´olya-Gamma distribution. Conditional on all data-augmented variables, this allows to sample the unknown model parameters from a conditionally Gaussian model, even for complex settings such as state space or random effects models. One challenge when working with any MCMC method based on data augmentation is poor mixing, in particular in the case of imbalanced data, where the success probability is either close to zero or one for the majority of the observed data points, see the excellent work of Johndrow et al. (2019). Neither the original P´olya-Gamma sampler of Polson et al. (2013) with a single level of data augmentation nor our new P´olya-Gamma sampler with two levels of data augmentation is an exception to this rule. In the present paper we introduce imbalanced marginal data augmentation (iMDA) as a boosting strategy to make not only our new sampler, but also the original probit sampler of Albert & Chib (1993) robust to possibly imbalanced data. This strategy is inspired by earlier work on marginal data augmentation for binary and categorical data (van Dyk & Meng 2001, Imai & van Dyk 2005, McCulloch et al. 2000). Starting from a latent variable representation of the binary model, we move to an unidentified version of the latent variable representation with the help of two unidentified parameters. One parameter is a global scale parameter for the latent variable which usually takes the value one for the identified model. Introducing such an unidentified parameter has been shown to improve mixing considerably, among others by Imai & van Dyk (2005). However, this strategy does not resolve the situation of highly imbalanced data. To this aim, we introduce in addition an unknown threshold parameter which takes the value zero for the identified model. As will be shown, this parameter restores balance in the unidentified model and in this way improves mixing considerably for unbalanced data, both for the sampler of Albert

4 & Chib (1993) as well as for our new P´olya-Gamma samplers. As iMDA only works in the context of a latent variable representation, this strategy cannot be applied to the original P´olya-Gamma sampler of Polson et al. (2013) due to the lack of such a representation. The new P´olya-Gamma representation of the logit model is very generic and is easily combined with iMDA, not only for binary regression models, but also for more flexible models such as binary state space models, as the resulting full conditional of the state pro- cess is Gaussian which allows multi-move sampling of the entire state process (Fr¨uhwirth- Schnatter 1994, Carter & Kohn 1994). We name any sampling strategy that combines a P´olya-Gamma mixture representation with iMDA an ultimate P´olya-Gamma sampler due to its efficiency. A further contribution of the present paper is to show that such an ultimate P´olya- Gamma sampler can be derived for more general non-Gaussian problems, including multi- nomial modelling of categorical data and modelling binomial data. For the MNL model usually a logit model based on a partial dRUM representation is applied to sample the cate- gory specific parameters, see e.g. Holmes & Held (2006), Fr¨uhwirth-Schnatter & Fr¨uhwirth (2010), Polson et al. (2013). In the present paper, we derive a new latent variable rep- resentation of the MNL model involving a partial dRUM representation with two latent variables instead of a single one. One latent variable equation is linear in the unknown parameters and involves a logistic error distribution. As for binary logit models, we use the P´olya-Gamma mixture representation of the logistic distribution and introduce the mixing variables as additional latent variables. For binomial models, a latent variable represen- tation has been introduced earlier by Fussl et al. (2013), but this representation does not provide an explicit choice equation. However, since an explicit choice equation is needed to apply iMDA, we derive a new latent variable representation for binomial data involving

5 two latent variables. Both latent variable equations are linear in the unknown parameters and involve a generalized logistic error distribution. Extending the strategy introduced for binary logit models, we use P´olya-Gamma mixture representations of the generalized logistic distribution and introduce the P´olya-Gamma mixing variables as additional latent variables. Both for MNL models as well as for binomial models, this data augmentation scheme leads to a conditionally Gaussian posterior and allows to sample all unknowns through highly efficient block moves. Again, we apply iMDA to derive an ultimate P´olya-Gamma sampler which is shown to mix well also for highly unbalanced data. The rest of the paper is structured as follows. Section 2 introduces the ultimate P´olya- Gamma sampler for binary data. This sampling strategy is extended to multinomial data in Section 3 and to binomial data in Section 4. For all three types of models, the ulti- mate P´olya-Gamma sampler is compared to alternative algorithms for simulated data in Section 5. Section 6 considers binary state space models. Section 7 concludes.

2 Ultimate P´olya-Gamma samplers for binary data

2.1 Latent variable representations for binary data

Binary models for a set of N binary data y = (y1, . . . , yN ) are defined by

Pr(yi = 1|λi) = Fε(log λi), (1)

where λi depends on exogenous variables and unknown parameters β, e.g. log λi = xiβ in a standard binary regression model. Choosing the cdf Fε(ε) = Φ(ε) of the standard normal distribution leads to the probit model Pr(yi = 1|λi) = Φ(log λi), whereas the cdf

6 ε ε Fε(ε) = e /(1 + e ) of the logistic distribution leads to the logit model

Pr(yi = 1|λi) = λi/(1 + λi).

A latent variable representation of model (1) involving the latent utility zi is given by:

yi = I{zi > 0}, zi = log λi + εi, εi ∼ fε(εi), (2)

0 where fε(ε) = Fε(ε) = φ(ε) is equal to the standard normal pdf for a probit model and ε ε 2 equal to fε(ε) = e /(1 + e ) for a logit model. While MCMC estimation based on (2) is straightforward for the probit model using one level of data augmentation involving the latent utilities (z1, . . . , zN ) (Albert & Chib 1993), for the logit model a second level of data augmentation is required, based on a mixture representation of the logistic distribution. In the present paper, we apply a new mixture representation of the logistic distribution,

1 Z 2 f (ε) = eε/(1 + eε)2 = e−ω ε /2p(ω)d ω, (3) ε 4 where ω ∼ PG (2, 0) follows a P´olya-Gamma distribution (Polson et al. 2013) with param- eter b = 2 and κ = 0, see Appendix A.1.1 and A.1.2 for details. This representation is very convenient, as the conditional posterior of ωi given εi is equal to a tilted P´olya-Gamma distribution, i.e. ωi|εi ∼ PG (2, |εi|) which is easy to sample from, see Polson et al. (2013).

Hence, whenever log λi is linear the unknown parameters β, such as log λi = xiβ for a bi- nary regression model, this new representation allows to construct a P´olya-Gamma sampler that (a) samples the unknown parameters β conditional on the latent variables (z1, . . . , zN ) and (ω1, . . . , ωN ) from a conditionally Gaussian model, and (b) samples zi, ωi|λi, yi inde- pendently for each i by sampling the latent variable zi from zi|λi, yi in the original (2) in the usual manner (see Appendix A.3.1) and sampling the parameter

ωi conditional on zi from ωi|zi, λi ∼ PG (2, |zi − log λi|).

7 2.2 MCMC estimation

While the sampler outlined above is easy to implement, such data augmentation schemes are known to be slowly mixing for imbalanced data sets where only a few cases with yi = 1 or yi = 0 among the N data points are observed, see e.g. Johndrow et al. (2019). To deal with this issue, we combine our new P´olya-Gamma mixture representation with marginal data augmentation (van Dyk & Meng 2001, Imai & van Dyk 2005, McCulloch et al. 2000). Given a global scale parameter δ and a threshold parameter γ, we transform the latent variable zi such that √ z˜i = δzi + γ, (4) to define the following expanded model: √ ˜ yi = I{z˜i > γ}, z˜i = log λi + δεi. (5) √ ˜ Transformation (4) implies that log λi = δ log λi + γ. For a binary regression model with √ ˜ ˜ ˜ xi,d ≡ 1, e.g., log λi = xiβ, where βj = δβj + I{j = d}γ. The global scale parameter δ and the threshold parameter γ, however, are not identified, since the marginal version of the expanded model (5), where the latent variables (˜z1,..., z˜N ) are integrated out, is independent both of γ and δ: ˜ log λi − γ Pr(yi = 1|λi, γ, δ) = Pr(˜zi > γ|λi, γ, δ) = Fε( √ ) = Pr(yi = 1|λi). (6) δ Various MCMC schemes based on the expanded model (5) are possible. We consider here marginal data augmentation where the following prior p(γ, δ) = p(γ|δ)p(δ) is defined on the parameters γ and δ with fixed hyper parameters d0,D0 and G0:

−1 γ|δ ∼ N (0, δG0) , δ ∼ G (d0,D0) . (7)

8 Note that the joint prior p(γ, δ) can also be written as p(δ|γ)p(γ) where

? ? −1 2  γ ∼ t2d0 (0,G0) ,G0 = G0D0/d0, δ|γ ∼ G d0 + 1/2,D0 + γ /(2G0) . (8)

The prior for the model parameters is defined by specifying a prior for the parameter β˜ in the expanded model (5) which is allowed to depend on the global scale parameter δ.

Let z˜ summarize all latent variables in the model i.e. z˜ = (˜z1,..., z˜N ) for a probit model and z˜ = (˜z1,..., z˜N , ω1, . . . , ωN ) for a logit model. Marginal data augmentation is based on (a) sampling the latent variables z˜ from p(z˜|β, y) and (b) sampling the unknown parameters β from p(β|z˜, y) without conditioning on δ and γ. Step (a) is carried out by sampling from the augmented posterior p(z˜, δ, γ|β, y) = p(z˜|δ, γ, β, y)p(δ, γ|β, y) by first drawing (δ, γ) from the conditional prior p(δ, γ|β)1, then drawing z˜ conditional on (δ, γ), and finally discarding (δ, γ). To sample from z˜|δ, γ, we sample zi for each i from p(zi|λi, yi) in the identified model (2), sample ωi|zi, λi ∼

PG (2, |zi − log λi|) and definez ˜i according to transformation (4). In each sweep of the sampler, we are dealing with a different expanded model. This allows much larger moves than in conventional sampling where δ ≡ 1 and γ ≡ 0 for all MCMC draws. Step (b) is carried out by joint sampling from p(β, δ, γ|z˜, y) and dropping δ and γ. To this aim, we sample (β˜, δnew, γnew) ∼ p(β˜, δnew, γnew|z˜, y) in the expanded model (5) √ ˜ new new and match back to (2) through the inverse of (4), log λi = (log λi − γ )/ δ . Using representation (8) of p(δ, γ), the joint posterior p(β˜, δ, γ|z˜, y) factorizes for any binary model as p(β˜, δ, γ|z˜, y) ∝ p(y|γ, z˜)p(γ)p(δ|γ)p(z˜|β˜, δ)p(β˜|δ). Thus, we first sample γ from the marginal posterior p(γ|y, z˜) ∝ p(y|γ, z˜)p(γ) which is equal to the Student-t prior given

1Note that p(δ, γ|β, y) ∝ p(y|δ, γ, β)p(δ, γ|β) ∝ p(δ, γ|β) is independent of y, see again (6), but may depends on the model parameters β.

9 Algorithm 1 The ultimate P´olya-Gamma sampler for binary data.

Choose starting values for λ = (λ1, . . . , λN ) and repeat the following steps: −1 (a-1) For each i = 1,...,N, sample zi = log λi +Fε (yi +Ui(1−yi −πi)) in model (2), where Ui ∼ U [0, 1], −1 −1 −1 πi = Fε(log λi), and Fε (p) = Φ (p) for the probit and Fε (p) = log p − log(1 − p) for the logit

model. For a logit model, sample ωi|zi, log λi ∼ PG (2, |zi − log λi|). ? −1 ? ? ? (a-2) Sample δ ∼ G (d0,D0) and γ ∼ γ|δ , λ. For a binary regression model, γ|δ , βd is given in (13). √ ? ? Move to the unidentified model by definingz ˜i = δ zi + γ for i = 1,...,N. (b-1) Sample γnew from the truncated t-prior p(γ) using (10). Sample δnew ∼ δnew|γnew, z˜. For a binary new new −1 new new regression model, δ |γ , z˜ ∼ G (dN ,DN (γ )) with dN and DN (γ ) given by (15). ˜ ˜ new ˜ new new (b-2) Sample β from β|z˜, δ . For a binary regression model, β|z˜, δ ∼ Nd (bN , δ BN ) where bN √ ˜ new new and BN are given by (16). Move back to the identified model using log λi = (log λi − γ )/ δ , √ new new new e.g. transformation (11) for a binary regression model and define zi = (˜zi − γ )/ δ .

in (8), however, truncated to a region depending on the latent variablesz ˜i:

Y Y p(γ|y, z˜) ∝ p(γ) I{γ > z˜i} I{γ < z˜i} ∝ p(γ)I{L(z˜) < γ < O(z˜)}, (9)

i:yi=0 i:yi=1

where L(z˜) = maxi:yi=0 z˜i and O(z˜) = mini:yi=1 z˜i, if 0 < #{i : yi = 1} < N. If #{i : yi =

0} = 0, then L(z˜) = −∞; if #{i : yi = 1} = 0, then O(z˜) = +∞. Hence, γ is sampled as: ! O(z˜) L(z˜) γ = pG?T −1 U × T ( ) + (1 − U) × T ( ) , (10) 0 2d0 2d0 p ? 2d0 p ? G0 G0

−1 where U ∼ U [0, 1] and Tν(·) and Tν (·), are, the cdf and the quantile function of a tν- distribution. Conditional on γ, we sample (β˜, δ) from p(β˜, δ|γ, z˜) = p(β˜|z˜, δ)p(δ|z˜, γ). The specifics depend on the chosen model for λi. We consider now in detail a binary regression model, where log λi = xiβ and xi,d ≡ 1 corresponds to the intercept. In the expanded model (5), √ ˜ ˜ ˜ the linear predictor is log λi = xiβ, where βj = δβj + I{j = d}γ. We specify the prior

10 ˜ β|δ ∼ Nd (0, δA0) in the expanded model. Using the inverse of (4), √ √ ˜ ˜ βj = βj/ δ, j = 1, . . . , d − 1, βd = (βd − γ)/ δ, (11) the conditional prior p(β|γ, δ) can be derived: 2   0(d−1)×1 β|γ, δ ∼ Nd (a0, A0) , a0 =   . (12) − √γ δ

From (12) we obtain that p(δ, γ|β) ∝ p(βd|δ, γ)p(γ|δ)p(δ) is proportional to p(γ|δ, βd)p(δ) where p(δ) is given by (7) and p(γ|δ, βd) is equal to the following normal distribution: √  γ|δ, βd ∼ N δg1, δG1 ,G1 = G0A0,dd/(G0 + A0,dd), g1 = −βd × G0/(G0 + A0,dd).(13)

−1 Hence, to sample from the conditional prior p(δ, γ|β), δ is sampled from G (d0,D0) and √  γ|δ, βd is sampled from N δg1, δG1 .

Next, for each i the latent variables (zi, ωi) are sampled from the identified model (2) and transformation (4) yields the latent variables (˜zi, ωi) in the expanded model √ ˜ yi = I{z˜i > γ}, z˜i = xiβ + δεi. (14)

Discarding γ and δ completes Step (a).

Step (b) is carried out in the expanded model (14) conditional on all (˜zi, ωi)s. A new value of γ is sampled from the marginal posterior (10). The conditionally conjugate prior ˜ ˜ −1 for β|δ allows joint sampling of β, δ|z˜, γ. The marginal posterior δ|z˜, γ ∼ G (dN ,DN (γ)) is inverted Gamma with following moments: 1 N d = d + + , (15) N 0 2 2 N γ2 1 X 1 D (γ) = D + + ω (˜z − x b )2 + b> A−1b , N 0 2G 2 i i i N 2 N 0 N 0 i=1

2 > The marginal prior in the identified model reads β ∼ Nd 0, A0 + G0eded where ed =  > 01×(d−1) .

11 where

N N X > −1 X > −1 bN = BN ( ωixi z˜i), BN = (A0 + ωixi xi) , (16) i=1 i=1 ˜ while the conditional posterior β|δ, z˜, γ ∼ Nd (bN , δBN ) is multivariate normal. The pos- terior draw of β˜ is transformed back to β using (11). The ultimate P´olya-Gamma (UPG) sampler for binary data is summarized in Algo- rithm 1.3 Note that for the probit model the marginal data augmentation scheme of Imai & van Dyk (2005) would result, if γ ≡ 0 was fixed to zero. Our data augmentation scheme is much more general than that. First of all, it works for any binary data model with a latent variable representation. Second, freeing the location of the threshold γ in model (5) restores balance, with γ being considerably larger than zero, if only a few cases with yi = 1 are present among the N binary observations, as will be illustrated in Section 2.3. The imbalanced marginal data augmentation (iMDA) scheme introduced in this section leads to an MCMC scheme that is well mixing, even in cases of extremely unbalanced data, see Section 5 for further illustration.

2.3 Illustration & intuition

As a first illustration of the potential merits of the UPG approach, we compare estimation efficiency of the popular P´olya-Gamma sampler from Polson et al. (2013) with UPG in an imbalanced data setting in Figure 1. It is clearly visible that, while both samplers simu- late from the same posterior distribution, the UPG approach outperforms the conventional P´olya-Gamma sampler considerably in terms of efficiency. Note that this efficiency gain is realized in spite of an additional layer of data augmentation to sample the latent utilities zi.

3All algorithms outlined in this work will soon be made available in R package UPG.

12 1.00 −8 0.75 −10 0.50

−12 0.25

−14 0.00 0 2500 5000 7500 10000 0 10 20 30 40

1.00 −8 −9 0.75

−10 0.50 −11 −12 0.25

−13 0.00 0 2500 5000 7500 10000 0 10 20 30 40

Figure 1: MCMC draws and autocorrelation function of βd for the UPG logit sampler (top) and the original P´olya-Gamma logit sampler (bottom); 2 out of N = 10, 000 observations are non-zero.

While this second layer of data augmentation allows for straight-forward model extensions due to a conditionally Gaussian likelihood representation, it usually increases autocorrela- tion in the posterior draws significantly. This is counteracted by our novel MCMC boosting strategy based on the working parameters γ and δ. The interplay of these two working parameters is illustrated for the case of a logistic regression model in Figure 2. In case of imbalanced data where only a few data points are a success, most of the latent utilities zi will be negative. On the contrary, zi will mostly be greater than zero if the majority of data points are successes. In the balanced case, zi is concentrated around zero. The role of γ is to shift the mean of the latent utilities towards zero in order to restore balance in the unidentified model. At the same time, zi is rescaled using δ. Ultimately, this results in an estimation setup where the boosted utilities √ z˜i = zi δ + γ are re-centered and re-scaled such that the resulting distribution exhibits

13 0.25 2000 1000

0.20 0.15 1500 750 0.15 0.10 1000 500 0.10 0.05 0.05 500 250

0.00 0 0 0.00 −10 0 10 −20 0 20 0.0 2.5 5.0 7.5 10.0 −10 0 10

0.25 1000 4000 0.15 0.20 750 3000 0.15 0.10 500 0.10 2000 0.05 0.05 1000 250

0.00 0 0 0.00 −10 0 10 −20 0 20 0.0 2.5 5.0 7.5 10.0 −10 0 10

0.25 2000 1000

0.20 0.15 1500 750 0.15 0.10 1000 500 0.10 0.05 0.05 500 250

0.00 0 0 0.00 −10 0 10 −20 0 20 0.0 2.5 5.0 7.5 10.0 −10 0 10

Figure 2: Posterior densities for latent utilities in the unboosted model (first column), γ (second column), δ (third column) and boosted utilities (fourth column). The rows correspond to different data sets with N = 1, 000 and success probabilities of 1% (first row), 50% (second row) and 99% (third row). The figure demonstrates how the auxiliary parameters γ and δ transform the latent utilities zi to pseudo-balanced, boosted utilities z˜i, allowing for efficient estimation. fat tails and a zero mean. This leads to an improvement in the signal-to-noise ratio in the boosted model when compared to the unboosted model, effectively allowing for more efficient MCMC estimation. In addition to this intuitive explanation of our boosting strategy, the role of δ and γ can be explored in a more technical context as well. Variants of the working parameter δ have been discussed for instance in Imai & van Dyk (2005). Put briefly, δ is used to ensure that the posterior of the boosted utilitiesz ˜i becomes more diffuse than the posterior of

14 the unboosted utilities zi. Through this diffusion, autocorrelation in the posterior draws will decrease asz ˜i becomes less dependent on the current values of the remaining model parameters (e.g. the regression coefficients β). This allows the sampler to move through the parameter space more quickly. A formal proof that this strategy improves convergence of data augmentation algorithms can be found in Meng & Van Dyk (1999). Theoretical results on MCMC sampling efficiency from the perspective of computational complexity are provided in Johndrow et al. (2019). One key insight is that data augmentation strategies in imbalanced categorical data sets results in a sampling behavior where the average step size of the sampler is too small to explore the whole posterior distribution efficiently. In our boosting strategy, this issue is effectively offset by restoring pseudo-balance through the working parameter γ, allowing the UPG sampler to efficiently explore the entire posterior distribution.

3 Ultimate P´olya-Gamma samplers for multinomial data

Let {yi} be a sequence of categorical data, i = 1,...,N, where yi is equal to one of at least three unordered categories. The categories are labeled by L = {0, . . . , m}, and for any k the set of all categories but k is denoted by L−k = L \{k}. We assume that the observations are mutually independent and that for each k ∈ L the probability of yi taking the value k depends on covariates xi in the following way:

exp(xiβk) Pr(yi = k|β0,..., βm) = πki(β0,..., βm) = m , (17) X exp(xiβl) l=0 where β0,..., βm are category specific unknown parameters of dimension d. To make the model identifiable, the parameter βk0 of a baseline category k0 is set equal to 0: βk0 = 0.

15 Thus the parameter βk is in terms of the change in log-odds relative to the baseline category k0. In the following, we assume without loss of generality that k0 = 0. A more general version of the multinomial logit (MNL) model (again with baseline k0 = 0) reads:

m X Pr(yi = k|β) = λki/(1 + λli), (18) l=1 where λ1i, . . . , λmi depend on unknown parameters β. For the standard MNL regression model (17), for instance, log λki = xiβk for k = 1, . . . , m. In this section, we consider a new random utility representation of a MNL model with at least 3 categories for the general formulation (18). Starting point is writing the MNL model as a random utility model (RUM), see McFadden (1974):

uki = log λki + ki, k = 0, . . . , m, (19)

yi = k ⇔ uki = max uli. (20) l∈L

Thus the observed category is equal to the category with maximal utility. If the errors

0i, . . . , mi in (19) are i.i.d. random variables from an extreme value (EV) distribution, then the MNL model (18) results as marginal distribution of the categorical variable yi.

To sample the category specific parameters in log λki conditional on the remaining parameters, it is common practice to reduce the RUM (20) to a binary logit model with observations dki = I{yi = k} (Fr¨uhwirth-Schnatter & Fr¨uhwirth 2010). However, the UPG sampler introduced in Section 2 cannot be applied, since the parameters for all categories in L−k appear as an off-set in the latent variable representation. Our new data augmentation scheme resolves this problem by reducing the data to three categories: category k, the baseline category, and a category which collapses the remaining categories in A = {l ∈ L|l 6= {k, 0}}. For all categories in A, we define an aggregated

16 utility ua,i = maxl∈A uli as the maximal utility in A. It can be shown that X ua,i = log λa,i + a,i, λa,i = λli, l∈A where a,i follows an EV distribution. The proof is straightforward since exp(−uli) ∼ E (λli) and therefore exp (−ua,i) = minl∈A exp(−uli) ∼ E (λa,i). Evidently, for m = 1, A is the empty set as there are no alternative categories and we obtain the binary logit model. For each specific category k = 1, . . . , m, the full RUM (19) can be represented as following aggregated RUM, based on the latent utilities (uki, u0i, ua,i):

uki = log λki + ki, (21)

u0i = 0i,

ua,i = log λa,i + a,i, (22)   k, uki ≥ max(u0i, ua,i),  yi = 0, u0i ≥ max(uki, ua,i), (23)    a ∈ A, ua,i ≥ max(uki, u0i), where the errors ki, 0i, a,i are iid from an EV distribution.

To derive an efficient sampler to estimate the category specific parameters in log λki conditional on ua,i, it is useful to rewrite (21) in the following way:

uki − u0i = log λki + εki, (24) where εki ∼ LO follows a logistic distribution, independent of a,i, and the choice equation is the same as in (23). We term this model the aggregated RUM representation.

Conditional on having observed yi, the posterior distribution p(ui|λki, yi) of the latent variables ui = (uki, u0i, ua,i) is of closed form and easy to sample from, see Theorem 1 which is proven in Appendix A.2.

17 Theorem 1 Draws from the posterior distribution p(ui|λki, yi) can be represented as:       log Vki −uki I{yi 6= k} e 1  λki    log(Ui)      −u0i     log V I{y 6= 0}   e  = − Pm  1  −  0i i  , (25)   1 + l=1 λli     −ua,i  log(Va,i)  e 1 I{yi ∈/ A} λa,i where Ui,Vki,V0i,Va,i, i = 1,...,N, are iid uniform random numbers.

For the standard MNL regression model, the parameters βk corresponding to category k appears only in a linear regression model with logistic errors, given by (24). The term log λa,i in (22) depends only on the regression parameters β−k of all other categories in A and, as opposed to the original partial dRUM, log λa,i does not appear as an offset in the latent equation (21), from which we estimate the category specific parameters βk. Rather, the information of all other categories is sufficiently summarized in the latent variable ua,i which affects only the choice equation. Conditional on the aggregated utility ua,i, the logit model (24) is independent of β−k and we can proceed similarly as in Section 2 to derive an ultimate P´olya-Gamma sampler by exploiting the P´olya-Gamma mixture representation of the logistic distribution in (24) with latent variables ωki, i = 1,...,N. To protect our sampler against imbalanced data, we apply iMDA with category specific parameters γk and δk, using the transformations

p p p u˜ki = δkuki + γk, u˜0i = δku0i, u˜a,i = δkua,i. (26) √ ˜ Transformation (26) implies that log λki = δk log λki + γk. For a MNL regression model, √ ˜ ˜ ˜ e.g., it follows that log λki = xiβk, where βk,j = δkβk,j + I{j = d}γk.

18 Transformation (26) leads to following unidentified model:

˜ p u˜ki − u˜0i = log λki + δkεki, εki ∼ LO, (27)   k, u˜ki ≥ u˜a,i + γk, u˜ki ≥ u˜0i + γk,  yi = 0, u˜0i ≥ u˜ki − γk, u˜0i ≥ u˜a,i, (28)    a ∈ A, u˜a,i ≥ u˜ki − γk, u˜a,i ≥ u˜0i.

The priors for the parameters δk and γk are chosen as for the binary model:

−1 γk|δk ∼ N (0, δkG0) , δk ∼ G (d0,D0) .

? With G0 = G0D0/d0, the joint prior p(δk, γk) of γk and δk factors as:

? −1 2  γk ∼ t2d0 (0,G0) , δk|γk ∼ G d0 + 1/2,D0 + γk/(2G0) . (29)

˜ Comparable to the binary case, we define a prior for βk in the boosted model (27). For ˜ a MNL regression model, e.g., we assume the prior βk|δk ∼ Nd (0, δkA0). It is possible to √  derive explicitly the conditional prior βk|γk, δk ∼ Nd −γk/ δked, A0 and the marginal > prior βk ∼ Nd 0, A0 + G0eded in, respectively, model (27) and (24). Based on (27), Algorithm 1 can be extended in a fairly straightforward manner. The ultimate P´olya-Gamma sampler for multinomial data is summarized in Algorithm 2, with latent variables z˜k = (u˜1,..., u˜N , ωk1, . . . , ωkN ), where u˜i = (˜uki, u˜0i, u˜a,i). The choice equation (28) implies following constraint for γk, conditional on z˜:

L(z˜) < γk ≤ O(z˜),O(z˜) = mini:yi=k(˜uki − max(˜u0i, u˜a,i)) (31)   L(z˜) = max max(˜uki − u˜0i), max(˜uki − u˜a,i) . i:yi=0 i:yi∈A

If #{yi = k} = 0, then O(z˜) = +∞; if #{yi = k} = N, then L(z˜) = −∞. Hence, the marginal posterior p(γk|y, z˜k) ∝ p(y|γk, z˜k)p(γk) is equal to the Student-t prior given in

19 Algorithm 2 The ultimate P´olya-Gamma sampler for multinomial data.

Choose starting values for λ = (λ1,..., λm), λk = (λk1, . . . , λkN ) and repeat for k = 1, . . . , m:

(a-1) For each i = 1,...,N, sample the latent variables uki = (uki, u0i, ua,i) ∼ p(uki|λki, yi)

in the aggregated RUM model (21) of category k using Theorem 1. Sample ωki|λki, uki ∼

PG (2, |uki − u0i − log λki|) in the model (24). ? −1 ? ? ? (a-2) Sample δk ∼ G (d0,D0) and γk ∼ γk|δk, λk. For a MNL regression model, γk|δk, βk,d ∼ p ? ?  N δk · g1, δkG1 with g1 = −βk,dG0/(G0 + A0,dd) and G1 = G0A0,dd/(G0 + A0,dd). Define p ? p ? p ? u˜ki = δkuki + γk,u ˜0i = δku0i, andu ˜a,i = δkua,i for i = 1,...,N, to move to the unidentified model (27). new new new new (b-1) Sample γk from the truncated t-prior p(γk) using (32). Sample δk ∼ δk |γk , z˜. For the MNL new new −1 new regression model, δk |γk , z˜ ∼ G (d0 + (N + 1)/2,Dk(γk )), where

2 γk 1 PN 2 1 > −1 Dk(γk) = D0 + + ωki(˜uki − u˜0i − xibk) + b A bk, 2G0 2 i=1 2 k 0 PN > −1 PN > −1 bk = Bk( i=1 ωkixi (˜uki − u˜0i)), Bk = (A0 + i=1 ωkixi xi) . (30)

˜ new ˜ new new (b-2) Sample from βk|z˜k, δk . For the MNL regression model, βk|z˜k, δk ∼ Nd (bk, δk Bk) where bk ˜ new p new and Bk are given by (30). Move back to the full RUM (20) using log λki = (log λki − γk )/ δk ˜ new p new and discard z˜. For the MNL regression model, e.g. βkj = (βkj −I{j = d}γk )/ δk , j = 1, . . . , d.

(29), however, truncated to a region depending on z˜:

Y Y p(γk|y, z˜k) ∝ p(γk) I{γk < u˜ki − max(˜u0i, u˜a,i)} I{γk > u˜ki − u˜0i}

i:yi=k i:yi=0 Y × I{γk > u˜ki − u˜a,i} ∝ p(γk)I{L(z˜) ≤ γk ≤ O(z˜)}.

i:yi∈A

Given U ∼ U [0, 1], γk is sampled as: ! !! O L γnew = pG?T −1 U × T + (1 − U) × T . (32) k 0 2d0 2d0 p ? 2d0 p ? G0 G0

˜ Conditional on γk, we sample from δk|z˜k, γk and βk|z˜k, δk, see Step (b) in Algorithm 2.

20 4 Ultimate P´olya-Gamma samplers for binomial data

In this section, we consider models with binomial outcomes, i.e. models of the form

yi ∼ BiNom (Ni, πi) , logit πi = log λi, i = 1,...,N, (33)

with log λi = xiβ for a model. As shown by Johndrow et al. (2019), Bayesian inference for binomial regression models based on the P´olya Gamma sampler (Polson et al. 2013) is sensitive to imbalanced data. A latent variable representation of binomial models introduced by Fussl et al. (2013) is also is sensitive to unbalanced data, as shown in Section 5. As for a logit model, which results for Ni ≡ 1, iMDA would be an option to boost MCMC, however, Fussl et al. (2013) provide no explicit choice equation, which is needed for iMDA. The goal of this section is to define an ultimate P´olya-Gamma sampler for binary outcomes by combing a new latent variable representation of binomial models based on a P´olya-Gamma representation of the generalized logistic distribution with iMDA to protect the algorithm against imbalanced data.

4.1 A new latent variable representations for binomial data

We introduce in Theorem 2 a new latent variable representation for binomial outcomes involving latent variables zi such that the latent variable equation is linear in log λi and the errors follows a generalized logistic distribution. An explicit choice equation is pro- vided which relates the latent variables z1,..., zN to observations y1, . . . , yN . We show in

Theorem 3 that, conditional on yi, the posterior distribution zi|yi, λi of the latent variables zi is of closed form and easy to sample from. See Appendix A.2 for a proof of Theorem 2 and 3.

21 Theorem 2 Latent variable representation of a binomial model For 0 < yi < Ni, the binomial model has following random utility representation:

wi = log λi + εw,i, εw,i ∼ GLII (k), (34)

vi = log λi + εv,i, εv,i ∼ GLI (Ni − k),

yi = k ⇔ wi > 0, vi < 0,

where GLI (ν) and GLII (ν) are, respectively, the generalized logistic distributions of type I and type II. For yi = 0, the model reduces to

vi = log λi + εv,i, εv,i ∼ GLI (Ni), yi = 0 ⇔ vi < 0.

For yi = Ni, the model reduces to

wi = log λi + εw,i, εw,i ∼ GLII (Ni), yi = Ni ⇔ wi > 0.

For Ni = 1, the logistic model results, as both GLI (ν) and GLII (ν) reduce to a logistic distribution for ν = 1. For yi = 0, zi = vi, whereas for yi = 1, zi = wi, and the choice equation reduces to yi = I{zi > 0}.

Theorem 3 Sampling the utilities in the binomial RUM Given yi = k and holding all model parameters fixed, the latent variables wi|k > 0 and vi|k < Ni are conditionally independent. The distributions of wi|k > 0 and vi|k < Ni are equal in distribution to ! 1 wi = log (1 + λi) − λi , yi > 0, (35) 1/yi Wi ! 1 + λi 1 1 vi = − log − , yi < Ni, (36) λ 1/(Ni−yi) λ i Vi i where Wi and Vi are iid uniform random numbers.

22 4.2 Ultimate P´olya-Gamma samplers for binomial data

The following steps are the main building stones for the ultimate P´olya-Gamma sampler for binomial data. We use a Gaussian mixture representation of the generalized logistic distribution based on the P´olya-Gamma distribution and apply iMDA to achieve robustness against imbalanced data. A random variable ε following the generalized logistic distributions of type I or type II can be represented as mixture of normals using the Polya-Gamma distribution as mixing measure, (eε)a c(a, b) ωξ2 Z ∞ ω(ε − ξ)2 fε(ε) = c(a, b) ε b = b exp( ) exp(− )p(ω)dω (37) (1 + e ) 2 2 0 2 where ξ = κ/ω and κ = a − b/2, and ω ∼ PG (b, 0) follows the Polya-Gamma distribution introduced by Polson et al. (2013), see Appendix A.1.1 to A.1.3.

For yi > 0, the type II generalized logistic distribution εw,i ∼ GLII (yi) in (34) has such a representation with, see (A.11):

1 − yi κw,i κw,i = , ξw,i = , ωw,i ∼ PG (yi + 1, 0) . 2 ωw,i

Similarly, for yi < Ni, the type I generalized logistic distribution εv,i ∼ GLI (Ni − yi) in (34) has such a representation with, see (A.7):

Ni − yi − 1 κv,i κv,i = , ξv,i = , ωv,i ∼ PG (Ni − yi + 1, 0) . 2 ωv,i

Note that κw,i = 0 for yi = 1 and κv,i = 0 for yi = Ni − 1. Hence, for Ni = 1, the P´olya-Gamma mixture approximation of a logistic model involving PG (2, 0) results. For

Ni > 1, κv,i > 0 for 0 ≤ yi ≤ Ni − 2 and κw,i < 0 for 2 ≤ yi ≤ Ni.

For each i = 1,...,N, we introduce the latent variables zi = (wi, ωw,i, vi, ωv,i), if

0 < yi < Ni, zi = (wi, ωw,i), if yi = Ni, and zi = (vi, ωv,i), if yi = 0. Conditional on

23 Algorithm 3 The ultimate P´olya-Gamma sampler for binomial data.

Choose starting values for λ = (λ1, . . . λN ) and repeat the following steps:

(a1) For each i = 1,...,N, sample wi|λi, (yi > 0) and vi|λi, (yi < Ni) using Theorem 3 and sample

ωw,i|wi, λi, (yi > 0) and ωv,i|vi, λi, (yi < Ni) using (38). ? −1 ? ? ? (a2) Sample δ ∼ G (d0,D0) and γ ∼ γ|δ , λ. For a binomial regression model, γ|δ , βd is given in √ √ ? ? ? ? (A.28). Definew ˜i = δ wi + γ , if yi > 0, andv ˜i = δ vi + γ , if yi < Ni, for i = 1,...,N, to move to the unidentified model. (b-1) Sample γnew from the truncated Student-t prior using (42). Sample δnew ∼ p(δ|γnew, z˜, y) ∝ √ d +1 1  I −DI /δ BI / δ δ e e using the resampling described in Appendix A.3.3. For a binomial regres-

sion model, dI ,DI and BI are given by (A.31). ˜ ˜ new ˜ new new new (b-2) Sample β ∼ p(β|z˜, δ ). For a binomial regression model, β|z˜, δ ∼ Nd (bN (δ ), δ BN )

where bN (δ) and BN are given by (A.30). Move back to the identified model using log λi = √ ˜ new new (log λi −γ )/ δ and discard z˜. For a binomial regression model, e.g., use transformation (11).

holding all unknown parameters in log λi fixed, the latent variables wi|λi, (yi > 0) and vi|λi, (yi < Ni) are sampled from Theorem 3 without conditioning on ωw,i and ωv,i. Given the latent variables wi and vi, the parameters ωw,i|wi, (yi > 0), λi and ωv,i|vi, (yi < Ni), λi are independent and follow (tilted) P´olya-Gamma distributions:

ωw,i|wi, yi, λi ∼ PG (yi + 1, |wi − log λi|) , yi > 0, (38)

ωv,i|vi, yi, λi ∼ PG (Ni − yi + 1, |vi − log λi|) , yi < Ni.

To protect our sampler against imbalanced data, we apply iMDA and transform the latent variablesw ˜i andv ˜i as for a logit model: √ √ w˜i = δwi + γ, v˜i = δvi + γ, (39) √ ˜ which implies log λi = δ log λi + γ. Transformation (39) leads to an unidentified, boosted

24 version of model (34) with a choice equation involving γ: √ ˜ w˜i = log λi + δεw,i, yi > 0, (40) √ ˜ v˜i = log λi + δεv,i, yi < Ni,   v˜i < γ < w˜i, 0 < k < Ni,  yi = k ⇔ γ > v˜i, k = 0, (41)    γ < w˜i, k = Ni.

The parameters δ and γ have the same prior as in the logit model, see (7). The choice equation (41) implies following constraint for γ, conditional on z˜ = (z˜1,..., z˜N ), where z˜i = (w ˜i, ωw,i, v˜i, ωv,i):

L(z˜) < γ < O(z˜),L(z˜) = max(˜vi),O(z˜) = min(w ˜i). (42)

However, due to the presence of the (fixed) location parameters ξw,i and ξv,i in the likelihood (37), as opposed to a binary or a MNL model, the posterior p(δ|γ, z˜) does not take a closed form. Only if Ni = 1 for all i, then all ξw,is and ξv,is are zero and an inverted gamma posterior for δ results. If log λi is a linear function of the unknown model parameters β, then completing squares yields a conditionally Gaussian distribution for p(β˜|γ, δ, z˜) and a closed form expression for p(δ|γ, z˜) is available which, however, is not longer an inverted Gamma posterior as for the logit model, but a more general distribution taking the form

1dI +1 √ p(δ|γ, z˜) ∝ e−DI /δeBI / δ. δ

Details how to sample from such a distribution are provided in Appendix A.3.3. The ultimate P´olya-Gamma sampler for binomial data is summarized in Algorithm 3.

25 5 Illustrative applications to simulated data

In this section, we discuss a set of systematic simulations carried out to compare the efficiency of the UPG approach to other popular Bayesian samplers for discrete choice models. The main results are based on simulations with varying intercepts in the true data generating process. Large negative and a large positive intercepts correspond to simulation setups resulting in highly imbalanced data.4 For logit and multinomial logit specifications, we also look at a varying sample size when the number of observed successes is kept constant at one. For the multinomial case, we furthermore investigate the scenario of a growing number of categories/choices. Finally, the binomial logit case allows us to vary the number of trials Ni observed in the data. These additional results as well as results for probit models and a number of numerical results can be found in the Appendix A.5. Each Markov chain was run for 10,000 iterations after a burn in phase of 2,000 iterations. To gain some robustness with respect to the computed inefficiency factors, each simulation is repeated 50 times. Below, the median results from these 50 runs are discussed. To compare the efficiency of different samplers, we provide inefficiency factors based on an estimate of the spectral density of the posterior chain evaluated at zero.5 For binary logit models, we compare the UPG sampling scheme, the P´olya-Gamma sampler of Polson et al. (2013) (P&S), the adaptive MH step proposed in Johndrow et al. (2019) (AMH) and the auxiliary mixture data augmentation scheme outlined in Fr¨uhwirth- Schnatter & Fr¨uhwirth (2010) (FS&F). To assess sampling efficiency for the multinomial

4In the main simulation results, we fix N = 1000 for probit, logit and multinomial logit models. For the multinomial case, we consider three categories. Finally, for binomial models, we fix N = 500 and Ni = 10. 5Estimating the spectral density at 0 is accomplished via the R package coda (Plummer et al. 2006) and is based on fitting an autoregressive process to the posterior draws.

26 5

6 4 4

Sampler Sampler 3 3 Sampler UPG UPG 4 UPG FS&F FS&F AuxMix AMH AMH 2 2 AMH P&S P&S

1 2 1 ln(Median Inefficiency Factor) ln(Median Inefficiency Factor) ln(Median Inefficiency Factor) 0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 Intercept Intercept Intercept

(a) Binary logit. (b) Multinomial logit. (c) Binomial logit.

Figure 3: Inefficiency factors across 50 simulation runs for varying intercepts βd. logit model (MNL), we compare the MNL implementation from the technical appendix of Polson et al. (2013) (P&S), a version of the AMH, the partial dRUM sampler from Fr¨uhwirth-Schnatter & Fr¨uhwirth (2010) (FS&F) as well as the UPG approach outlined in Section 3. Finally, to illustrate the efficiency gains in the case of regression analysis of binomial data, we compare the UPG approach outlined in Section 4 to a sampler based on an AMH algorithm and the auxiliary mixture sampler for binomial logit models introduced in Fussl et al. (2013) (AuxMix). The results of the main simulation exercise are summarized in Figure 3. The results show that standard data augmentation techniques exhibit extremely inefficient sampling behavior when increasing data sparsity as pointed out in Johndrow et al. (2019). While this general pattern can be observed for the UPG approach as well, our novel boosting strategy clearly alleviates this issue and allows for rather efficient estimation also in imbalanced data settings.6

6Interestingly, all samplers exhibit asymmetric inefficiency factors with respect to baseline sparsity in

27 A direct comparison between the UPG approach and the original P´olya-Gamma sampler shows that the former clearly outperforms the latter in terms of efficiency in settings with imbalanced data. That is, UPG samplers are the preferable data augmentation sampler for imbalanced data cases. However, in data sets that are very balanced, the original P´olya- Gamma sampler exhibits the highest observed efficiency levels. This is most likely due to the additional layer of latent variables that the UPG approach utilizes compared to P&S. Nevertheless, note that all data augmentation algorithms under consideration even surpass the performance of the efficiency ”gold standard” AMH in balanced scenarios. In other words, these settings correspond to data sets where inefficient estimation plays a minor role to begin with. In summary, UPG as well as Metropolis-Hastings based estimation allow for efficient sampling across a wide range of settings when compared to standard data augmentation algorithms. This can also be seen through extensive, additional simulations that are sum- marized in subsection A.5. It is worth to mention that the UPG approach has some benefits compared to Metropolis-Hastings based samplers. First, no tuning is necessary. Second, one of the major advantages of the UPG approach is that, for all models discussed in this paper, the conditional posterior of the regression effects is multivariate normal. This enables the researcher to readily rely on the large statistical toolbox of modifications and the MNL case. That is, one has to differentiate between cases with an increasingly sparse baseline (positive intercept range) and increasingly sparse non-baseline groups (negative intercept range). A sparse baseline seems to increase inefficiency in all observed cases. This is rather intuitive given that intercepts are the log of the probability of a specific category divided by the log of the probability of the baseline/reference. Generally the reference category should be a ”common” rather than a ”rare” category. However, this problem can be avoided in applied statistical modelling as the baseline can be chosen by the researcher, effectively eliminating this issue.

28 extensions of the linear regression model that has been developed over the past decades. We will proceed to demonstrate this convenient fact through applying the UPG approach to one of the most general modeling frameworks in Section 6, where we discuss binary state space modeling and present a brief real data application.

6 Application to a binary state space model

Let {yt} be a time series of binary observations, observed for t = 1,...,T . Each yt is assumed to take one of two possible values, labelled by {0, 1}. The probability that yt takes the value 1 depends on covariates xt, including a constant (i.e. xtd ≡ 1), through time-varying parameters βt in the following way:

exp(xtβt) Pr(yt = 1|β1,..., βT ) = . (43) 1 + exp(xtβt)

We assume that conditional on knowing β1,..., βT , the observations are mutually inde- pendent. A commonly used model for describing the time-variation of βt reads:

βt = βt−1 + wt, wt ∼ Nd (0, Q) , (44)

> where Q = Diag (θ1, . . . , θd). The expected initial values β = (β1, . . . , βd) and the vari- ances θ1, . . . , θd are unknown. To complete the model specification, the distribution of the initial value β0 has to be specified. We will discuss this choice in connection with our boosting strategy. Markov chain Monte Carlo estimation of such logit-type state space models has been considered by various authors, in particular by Shephard & Pitt (1997) and Gamerman (1998). A characteristic feature of several MCMC approaches for binary state space models is that practical implementation requires the use of an MH algorithm for sampling the state

29 process, which in turn makes it necessary to define a suitable proposal density in a rather high-dimensional parameter space. Single-move sampling for this type of models is known to be potentially very inefficient; see e.g. Shephard & Pitt (1997). Fr¨uhwirth-Schnatter & Fr¨uhwirth (2007) implemented an auxiliary mixture sampler for a binary regression model with time-varying parameters. We show in this section that Algorithm 1 can be quite easily extended to binary state space models, as Step (a) is generic conditional on knowing log λt = xtβt.

6.1 The UPG sampler for binary state space models

A latent utility zt of choosing category 1 is introduced for each yt:

βt = βt−1 + wt, wt ∼ Nd (0, Q) ,

yt = 1 ⇔ zt > 0, zt = xtβt + εt. (45)

For a probit link, (45) is a conditionally Gaussian state space model (SSM), whereas for a logit link, εt follows a logistic distribution and (45) is a conditionally non-Gaussian. However, using the P´olya Gamma mixture representation of the logistic distribution as in Section 2 yields a (high-dimensional) Gaussian posterior for the state process β =

{β0, β1,..., βT }, which allows multi-move sampling of the entire state process using FFBS (Fr¨uhwirth-Schnatter 1994, Carter & Kohn 1994) in a similar fashion as for a probit SSM. To achieve robustness again imbalance, we apply iMDA as introduced in Section 2: √ z˜t = δzt + γ, (46)

30 10

0

−10

−20 Intercept −30 Global Pandemic 1800 1850 1900 1950 2000

Figure 4: Posterior of local level model fitted to global pandemic data. Solid line is posterior median, dashed lines are 90/10 posterior quantiles. to transform the SSM (45) to the following unidentified SSM

˜ ˜ βjt = βj,t−1 + wjt, wjt ∼ N (0, δθj) , j = 1, . . . , d, √ ˜ yt = 1 ⇔ z˜t > γ, z˜t = xtβt + δεt, (47) where transformation (46) implies √ √ ˜ ˜ βj = δβj + I{j = d}γ, βjt = δβjt + I{j = d}γ, t = 0, . . . , T, j = 1, . . . , d. (48)

The ultimate P´olya-Gamma sampler for binary SSMs is summarized in Appendix A.4, Algorithm 4.

6.2 Illustration using pandemic outbreak data

To illustrate the gains in sampling efficiency for binary state space models, we apply the framework outlined in the previous subsections to a dataset on global pandemics. The data covers 220 years from 1800 to 2020. We define a pandemic as a disease episode that can be

31 Sampler ● Boosted Non−Boosted 800 600 400 ●●●●● ●● ●●●●●●●●●● ●● ●●●●●●●●●●● ● ●●●●●● ●●●●●● ●●●● ●● 200 ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●● ●● ●● ●●●● ●●●● ●●●●● ●●●●●● ●●●●● ●●●●● ●●●● ●● ●● ● ● ●●● ●●● ●●●●● ● ●●● ● ●● ● ● Inefficiency ● ● ● ● ● ●●●●● ● ● ● ● ●● ●●● ● ● ● ●● ●●● ● ● ●●●●●●● ●● ● ●● ● 0 ●● ●● ●●●● ●● ●● ●●●● ●●●● ●●●● ●●● State Sampling 1800 1850 1900 1950 2000

Figure 5: Inefficiency factors for boosted and unboosted logit local level models fitted to global pandemic data. characterized by worldwide spread of the disease and a relatively high death toll (> 75,000 deaths). In addition, we focus on diseases that are characterized by relatively short periods of activity, hence excluding global pandemics such as AIDS. This leaves us with a total of eight pandemic events, starting with a bubonic plague outbreak between 1855 and 1860. With more than 12,000,000 deaths in China and India alone, this is considered the third major outbreak of the plague in human history, following prior pandemics in the 6th and 14th century. As second event in the 19th century, we consider a major influenza outbreak from 1889-1890, also known as ”Russian flu” or ”Asiatic flu”. The beginning of the 20th century was hit especially hard by pandemics, starting with an outbreak of Encephalitis lethargica from 1915-1926. Within the same time period, the well known Spanish flu pandemic took place between 1918 and 1920. Being one of the deadliest recorded pandemics, the Spanish flu ultimately eradicated around 25% of the world’s population. After the second world war, we identify four major global pandemic events: Still in the 20th century, two influenza A strains known as ”Asian flu” (1957-1958)

32 and ”Hong Kong flu” (1968 - 1969) killed around 3,000,000 people. In more recent history, a H1N1 virus (”swine flu”) spread globally in 2009 and 2010, leading to up to 575,000 deaths. Finally, the most recent event is the ongoing global outbreak of COVID-19, starting in 2019 and claiming around 1,294,000 victims as of November, 13th 2020.7

For years featuring a global pandemic, the binary indicator yt equals 1. In years without major pandemic events, yt = 0. A pandemic is observed in roughly 1 out of 8 years with extremely high autocorrelation, rendering the data set relatively imbalanced. To illustrate the efficiency gains arising from our boosting algorithm, a logistic local level model is fitted to the data twice, once with and once without boosting, enabling us to directly compare these samplers in terms of efficiency. We iterate the Gibbs sampler 500,000 times after a burn-in period of 20,000 iterations. The resulting posterior distribution is shown in Figure 4. The resulting inefficiency factors are provided in Figure 5. Clearly, long periods of persistence in the binary outcome time series lead to an aggravated efficiency loss in the sampler. On the contrary, during periods where yt is likely to change (corresponding to a predicted probability around 0.5), sampling efficiency is rather high. Summarizing, the boosted sampling algorithm described in Section 6.1 is able to drastically reduce sampling inefficiency, even during prolonged periods of no change in the outcome variable. Hence, the UPG approach can be considered a useful tool for efficient discrete choice analysis, not only in simple regression models, but also in more complex settings such as state space frameworks. 7Sourced from https://en.wikipedia.org/wiki/List_of_epidemics and sources therein.

33 7 Concluding Remarks

Due to a wide range of applications in many areas of applied science, much attention has been dedicated towards the development of estimation algorithms for generalized linear models. In the past decades, various data augmentation algorithms have been brought forward that have steadily increased accessibility and popularity of Bayesian estimation techniques in the context of discrete choice models. In this article, we introduce a number of sampling algorithms based on the data aug- mentation algorithm of Polson et al. (2013). These algorithms may be used to estimate parameters in the most commonly encountered discrete choice models and are suited to model binary, categorical and binomial outcomes. The samplers are easily implemented, intuitively appealing and allow for a conditionally Gaussian posterior of the regression effects in logit, multinomial logit and binomial logit regression type models. While Gibbs sampling algorithms are well-known and require only minimal tuning, it has been recently noted that estimation based on Metropolis-Hastings algorithms exhibit higher estimation efficiency, especially in applications with imbalanced data (Johndrow et al. 2019). To counteract efficiency loss in such settings, we develop a novel MCMC boosting strategy based on marginal data augmentation and apply it to the introduced sampling algorithms. The proposed UPG algorithms combine the appeal of Gibbs samplers with a comparative level of sampling efficiency, even in unbalanced cases. This is demonstrated via extensive simulation studies and a real data application.

34 References

Albert, J. H. & Chib, S. (1993), ‘Bayesian analysis of binary and polychotomous response data’, Journal of the American Statistical Association 88, 669–679.

Carter, C. K. & Kohn, R. (1994), ‘On Gibbs sampling for state space models’, Biometrika 81, 541–553.

Fr¨uhwirth-Schnatter, S. (1994), ‘Data augmentation and dynamic linear models’, Journal of Time Series Analysis 15, 183–202.

Fr¨uhwirth-Schnatter, S. & Fr¨uhwirth, R. (2007), ‘Auxiliary mixture sampling with appli- cations to logistic models’, Computational Statistics and Data Analysis 51, 3509–3528.

Fr¨uhwirth-Schnatter, S. & Fr¨uhwirth, R. (2010), Data augmentation and MCMC for bi- nary and multinomial logit models, in T. Kneib & G. Tutz, eds, ‘Statistical Modelling and Regression Structures – Festschrift in Honour of Ludwig Fahrmeir’, Physica-Verlag, Heidelberg, pp. 111–132.

Fr¨uhwirth-Schnatter, S., Fr¨uhwirth, R., Held, L. & Rue, H. (2009), ‘Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data’, Statistics and Comput- ing 19, 479–492.

Fussl, A., Fr¨uhwirth-Schnatter, S. & Fr¨uhwirth, R. (2013), ‘Efficient MCMC for binomial logit models’, ACM Transactions on Modeling and Computer Simulation 22, 3:1–3:21. doi: 10.1145/2414416.2414419.

Gamerman, D. (1998), ‘Markov chain Monte Carlo for dynamic generalized linear models’, Biometrika 85, 215–227.

35 Holmes, C. C. & Held, L. (2006), ‘Bayesian auxiliary variable models for binary and multi- nomial regression’, Bayesian Analysis 1, 145–168.

Imai, K. & van Dyk, D. A. (2005), ‘A Bayesian analysis of the multinomial probit model using marginal data augmentation’, Journal of Econometrics 124, 311–334.

Johndrow, J. E., Smith, A., Pillai, N. & Dunson, D. B. (2019), ‘MCMC for imbalanced categorical data’, Journal of the American Statistical Association 114, 1394–1403.

McCulloch, R. E., Polson, N. G. & Rossi, P. E. (2000), ‘A Bayesian analysis of the multino- mial probit model with fully identified parameters’, Journal of Econometrics 99, 173–193.

McFadden, D. (1974), Conditional logit analysis of qualitative choice behaviour, in P. Zarembka, ed., ‘Frontiers of Econometrics’, Academic, New York, pp. 105–142.

Meng, X.-L. & Van Dyk, D. (1999), ‘Seeking efficient data augmentation schemes via conditional and marginal data augmentation’, Biometrika 86, 301–320.

Plummer, M., Best, N., Cowles, K. & Vines, K. (2006), ‘CODA: Convergence diagnosis and output analysis for MCMC’, R News 6(1), 7–11.

Polson, N. G., Scott, J. G. & Windle, J. (2013), ‘Bayesian inference for logistic models using P´olya-Gamma latent variables’, Journal of the American Statistical Association 108, 1339–49.

Robert, C. P. & Casella, G. (1999), Monte Carlo Statistical Methods, Springer Series in Statistics, Springer, New York/Berlin/Heidelberg.

Rossi, P. E., Allenby, G. M. & McCulloch, R. (2005), Bayesian Statistics and Marketing, Wiley, Chichester.

36 Shephard, N. & Pitt, M. K. (1997), ‘Likelihood analysis of non-Gaussian measurement time series’, Biometrika 84, 653–667. van Dyk, D. & Meng, X.-L. (2001), ‘The art of data augmentation’, Journal of Computa- tional and Graphical Statistics 10, 1–50.

Zellner, A. & Rossi, P. E. (1984), ‘Bayesian analysis of dichotomous quantal response models’, Journal of Econometrics 25, 365–393.

37 A Appendix

A.1 Mixture representations

A.1.1 The P´olya-Gamma mixture representation

For all latent variable representations derived in this paper for binary, multinomial or categorical data, the error term in the latent equations arises from a distribution ε ∼ Fε(ε), for which the density fε(ε) can be represented as a mixture of normals using the P´olya- Gamma distribution as mixing measure:

ε a Z 2 (e ) c(a, b) κε − ωε f (ε) = c(a, b) = e e 2 p(ω)dω, (A.1) ε (1 + eε)b 2b where κ = a − b/2 and ω ∼ PG (b, 0) follows the P´olya-Gamma distribution introduced by Polson et al. (2013) with parameter b. Conditional on ω the likelihood contribution of ε is proportional to that of a N (κ/ω, 1/ω) observation. 8 This new representation is very convenient, as the conditional posterior of ω|ε can be derived from following (tilted) P´olya-Gamma distribution with the same pa- rameter b:

ω|ε ∼ PG (b, |ε|) . (A.3)

To simulate from a (tilted) P´olya-Gamma distribution PG (q, c) distribution, the following convolution property is exploited:

X1 ∼ PG (q1, c) ,X2 ∼ PG (q2, c) ⇒ X1 + X2 ∼ PG (q1 + q2, c) . 8Based on rewriting ωε2 −ω(ε − κ/ω)2 κε − = + d, (A.2) 2 2 where d is a constant not depending on ε.

38 Pq Hence, to simulate from Y ∼ PG (q, c), use Y = j=1 Xj , where Xj ∼ PG (1, c) are q independent r.v. from the PG (1, c) distribution.

A.1.2 The logistic and the type I generalized logistic distribution

For the type I generalized logistic distribution ε ∼ GLI (ν) with parameter ν > 0, the density reads νe−ε νeνε f (ε) = = . (A.4) ε (1 + e−ε)ν+1 (1 + eε)ν+1

GLI (ν) reduces to the logistic distribution for ν = 1. The c.d.f. of a type I generalized logistic distribution takes a simple form: 1 eνε F (ε) = = . (A.5) ε (1 + e−ε)ν (1 + eε)ν Hence, the quantiles are available in closed form:  1  ε = F −1(p) = − log − 1 . (A.6) p ε p1/ν

The type I generalized logistic distribution ε ∼ GLI (ν) can be represented as a mixture of normals with a P´olya-Gamma distribution serving as mixing measure, where ν − 1 ω ∼ PG (ν + 1, 0) , κ = , (A.7) 2 see Section A.1.1. For the logistic distribution, ω ∼ PG (2, 0) and κ = 0.

A.1.3 The type II generalized logistic distribution

For the type II generalized logistic distribution ε ∼ GLII (ν) with parameter ν > 0, the density reads νe−νε νeε f (ε) = = . (A.8) ε (1 + e−ε)ν+1 (1 + eε)ν+1

39 Also GLII (ν) reduces to the logistic distribution for ν = 1. The c.d.f. of a type II generalized logistic distribution takes a simple form: 1 e−νε F (ε) = 1 − = 1 − . (A.9) ε (1 + eε)ν (1 + e−ε)ν Hence, the quantiles are available in closed form:  1  ε = F −1(p) = log − 1 . (A.10) p ε (1 − p)1/ν

The type II generalized logistic distribution ε ∼ GLII (ν) can be represented as a mixture of normals with a P´olya-Gamma distribution serving as mixing measure, where 1 − ν ω ∼ PG (ν + 1, 0) , κ = , (A.11) 2 see Section A.1.1. Again, for the logistic distribution, ω ∼ PG (2, 0) and κ = 0 results.

A.2 Proofs

Proof of Theorem 1

The proof of Theorem 1 is straightforward. Depending on the observed category yi, the cor- responding utility is the maximum among all latent utilities and the posterior distribution p(uki, u0i, ua,i|λki, yi) is equal to one of the following distributions:

−uki ? −u0i −uki −ua,i −uki e |yi = k ∼ E (λi ) ⇒ e = e + E (1) , e = e + E (λa,i) ,

−u0i ? −uki −u0i −ua,i −u0i e |yi = 0 ∼ E (λi ) ⇒ e = e + E (λki) , e = e + E (λa,i) ,

−ua,i ? −uki −ua,i −u0i −ua,i e |yi ∈ A ∼ E (λi ) ⇒ e = e + E (λki) , e = e + E (1) ,

? Pm where λi = 1 + l=1 λli. For efficient joint sampling of all utilities for all i = 1,...,N, this can be rewritten as in (25).

40 Proof of Theorem 2

A binomial observation yi can be regarded as the aggregated number of successes among

Ni independent binary outcomes z1i, . . . , zNi,i, labelled {0, 1} and each following the binary logit model Pr(zni = 1|πi) = πi. For each individual binary observation zni, the logit model can be written as a RUM (McFadden 1974):

uni = log λi + ni, ni ∼ LO,

zni = I{uni > 0}, involving a latent variable uni, where ni are i.i.d. errors following a logistic distribution.

Among the Ni binary experiment, yi outcomes zni choose the category 1, whereas the remaining Ni − yi outcomes zni choose the category 0. The challenge is to aggregate the latent variables uni to a few latent variables in such a way that an explicit choice equation is available. As it turns out, such an aggregation can be based on the order statistics u(1),i < . . . < u(Ni),i of u1i, . . . , uNi,i.

Consider first the case that yi = 0. Such an outcome is observed, iff zni = 0 or, equivalently, the latent utility is negative (uni < 0) for all n = 1,...,Ni. Hence, a necessary and sufficient condition for yi = 0 is that the maximum of all utilities is negative, or equivalently,

yi = 0 ⇔ u(Ni),i < 0. (A.12)

Next, consider the case that yi = Ni. Such an outcome is observed, iff zni = 1 or, equiv- alently, the latent utility is positive (uni > 0 ) for all n = 1,...,Ni. Hence, a necessary and sufficient condition for yi = Ni is that the minimum of all utilities is positive, or equivalently,

yi = Ni ⇔ u(1),i > 0. (A.13)

41 Also for outcomes 0 < yi < Ni, the order statistics u(Ni−k),i and u(Ni−k+1),i provide necessary and sufficient conditions:

yi = k, 0 < k < Ni ⇔ u(Ni−k),i < 0, u(Ni−k+1),i > 0. (A.14)

Note that (A.12) – (A.14) are choice equations involving either a single or two order statis- tics. Hence, we introduce the corresponding order statistics as aggregated latent variables.

Given yi, we define

vi = u(Ni−k),i, yi > 0,

wi = u(Ni−k+1),i, yi < Ni,

yi = k ⇔ vi < 0, wi > 0,

with obvious modifications for yi = 0 and yi = Ni. It remains to prove that the latent variables can be represented as in the aggregated model (34):

wi = log λi + εw,i, εw,i ∼ GLII (yi) , yi > 0 (A.15)

vi = log λi + εw,i, εv,i ∼ GLI (Ni − yi) , yi < Ni.

Note that the order statistics can be represented for j = 1,...,Ni as u(j),i = log λi + ε(j),i, involving the order statistics ε(1),i, . . . , ε(Ni),i are of Ni iid realisations ε1,i, . . . , εNi,i of a logis- tic distribution. Their distribution can be derived from the order statistics X(1),i,...,X(Ni),i of Ni uniform random numbers X1i,...,XNi,i using:

−1 X(j),i ε(j),i = F (X(j),i) = log ⇔ X(j),i = F (ε(j),i), (A.16) 1 − X(j),i where F is the cdf of the logistic distribution.

42 First, for the special cases where yi = 0 or yi = Ni, we use that X(j),i ∼ B (j, Ni − j + 1).

Using (A.16), we can derive the density of ε(Ni),i:

N exp(ε )Ni Ni−1 i (Ni),i p(ε(N ),i) = NiF (ε(N ),i) f(ε(N ),i) = , i i i Ni+1 (1 + exp(ε(Ni),i)) which is the density of a GLI (Ni) distribution, see (A.4). Hence, for yi = 0,

vi = log λi + εv,i, εv,i = ε(Ni),i ∼ GLI (Ni) .

Using (A.16), we can derive the density of ε(1),i:

N exp(ε ) Ni−1 i (1),i p(ε(1),i) = Ni(1 − F (ε(1),i)) f(ε(1),i) = N +1 , (1 + exp(ε(1),i)) i which is the density of a GLII (Ni) distribution, see (A.8). Hence, for yi = Ni:

wi = log λi + εw,i, εw,i = ε(1),i ∼ GLII (Ni) .

> Second, for 0 < yi < Ni we need the joint distribution of (ε(Ni−k),i, ε(Ni−k+1),i) , where

ε(Ni−k+1),i = ε(Ni−k),i + ∆εi with ∆εi > 0. Using that (X(j),i,X(j+1),i − X(j),i, 1 − X(j+1),i) ∼

D (j, 1,Ni − j) follows a Dirichlet distribution, see e.g. Robert & Casella (1999), we ob- tain that (X(Ni−k),i, 1 − X(Ni−k+1),i,X(Ni−k+1),i − X(Ni−k),i) ∼ D (Ni − k, k, 1). To derive p(ε(Ni−k),i, ε(Ni−k+1),i), we consider the transformations

−1 −1 ε(Ni−k),i = F (X(Ni−k),i), ε(Ni−k+1),i = F (X(Ni−k+1),i), (A.17)

and their inverse, X(Ni−k),i = F (ε(Ni−k),i) and X(Ni−k+1),i = F (ε(Ni−k+1),i). We determine

f(ε ) 0 ∂(X(Ni−k),i,X(Ni−k+1),i) (Ni−k),i = = f(ε(Ni−k),i)f(ε(Ni−k+1),i), ∂(ε(Ni−k),i), ε(Ni−k+1),i)) 0 f(ε(Ni−k+1),i)

43 where f is the pdf of the logistic distribution. Therefore,

p(ε(Ni−k),i, ε(Ni−k+1),i) = Γ(N + 1) i Ni−k−1 k−1 F (ε(Ni−k),i) (1 − F (ε(Ni−k+1),i)) · f(ε(Ni−k),i)f(ε(Ni−k+1),i) = Γ(k)Γ(Ni − k) Ni−k exp(ε(Ni−k),i) exp(ε(Ni−k+1),i) Γ(Ni + 1) · I{ε(N −k+1),i > ε(N −k),i}. Ni−k+1 k+1 i i (1 + exp(ε(Ni−k),i)) (1 + exp(ε(Ni−k+1),i)) Γ(k)Γ(Ni − k) This density can be expressed as

p(ε(Ni−k),i, ε(Ni−k+1),i) = p(ε(Ni−k),i) p(ε(Ni−k+1),i) · C · I{ε(Ni−k+1),i > ε(Ni−k),i}, where

Ni−k (Ni − k) exp(ε(Ni−k),i) p(ε(N −k),i) = (A.18) i Ni−k+1 (1 + exp(ε(Ni−k),i)) is the density of a GLI (Ni − k) distribution, see (A.4),

k exp(ε(Ni−k+1),i) p(ε(Ni−k+1),i) = k+1 (A.19) (1 + exp(ε(Ni−k+1),i))

is the density of a GLII (k) distribution, see (A.8), and

Γ(N + 1) C = i Γ(k + 1)Γ(Ni − k + 1) is a normalising constant. It is possible to verify that

Z +∞ Z +∞

p(ε(Ni−k),i, ε(Ni−k+1),i) d ε(Ni−k+1),i d ε(Ni−k),i = 1. −∞ ε (Ni−k),i

> > Defining (εv,i, εw,i) = (ε(Ni−k),i, ε(Ni−k+1),i) , yields (A.15).

44 Proof of Theorem 3

Knowing that yi = k, with 0 < k < Ni, wi|yi and vi|yi are conditionally independent and the following holds:

wi = log λi + εw,i, vi = log λi + εv,i,

where εw,i|yi = k ∼ GLII (k) is truncated to [− log λi, +∞), since wi > 0 and εv,i|yi = k ∼

GLI (Ni − k) is truncated to (−∞, − log λi], since vi < 0. −1 Since both Fε and Fε are available in closed form for both types of generalized logistic distributions, we obtain:

   yi 1 λi εw,i = log 1/y − 1 , 1 − p = Wi(1 − Fε(− log λi)) = Wi , (1 − p) i 1 + λi where Wi is a uniform random number, see (A.10). This proves equation (35): ! ! 1 + λi 1 1 εw,i = log − 1 ⇒ wi = log λi + εw,i = log (1 + λi) − λi . λ 1/yi 1/yi i Wi Wi Furthermore,

 1  1 εv,i = − log 1/(N −y ) − 1 , p = ViFε(− log λi) = Vi N −y , p i i (1 + λi) i i where Vi is a uniform random number, see (A.6). This proves equation (36): ! ! 1 + λi 1 + λi 1 1 εv,i = − log − 1 ⇒ vi = log λi + εv,i = − log − . 1/(Ni−yi) λ 1/(Ni−yi) λ Vi i Vi i

For yi = 0, only vi|yi is sampled; for yi = Ni, only wi|yi is sampled. It is easy to verify that indeed wi > 0 and vi < 0.

45 A.3 Computational details

A.3.1 Sampling the utilities in a binary model

Consider the latent variable representation of a binary model

Pr(yi = 1|λi) = Fε(log λi), (A.20)

involving the latent variables zi:

yi = I{zi > 0}, zi = log λi + εi, εi ∼ fε(εi), (A.21)

where fε(ε) is the pdf of the cdf Fε(ε). fε(ε) = φ(ε) is equal to the standard normal pdf ε ε 2 for a probit model and equal to fε(ε) = e /(1 + e ) for a logit model. −1 Provided that the quantile function Fε (p) is available in closed form, it is easy to sample zi from the marginalized density zi|λi, y in the latent variable model (A.21). The latent variables zi are pairwise independent and the posterior zi|yi, λi is given by

p(zi|yi, λi) ∝ p(yi|zi)fε(zi − log λi).

The posterior of zi is fε(zi − log λi) truncated to (−∞, 0], if yi = 0, and truncated to

[0, ∞), if yi = 1, hence, zi = log λi + εi, where εi ∼ fε(εi)I{εi > − log λi}, if yi = 1, and 9 εi ∼ fε(εi)I{εi < − log λi}, if yi = 0. Therefore:

−1 zi = log λi + Fε (yi + Ui(1 − yi − πi)), (A.22)

−1 −1 where Ui ∼ U [0, 1] and πi = Pr(yi = 1|λi) = Fε(log λi), where Fε (p) = Φ (p) for the −1 probit model and Fε (p) = log p − log(1 − p) for the logit model.

9 To simulate ε from a distribution Fε(ε) truncated to [a, b] we simulate a uniform random number U −1 −1 and define either ε = Fε (Fε(a) + U(Fε(b) − Fε(a))) or ε = Fε (Fε(b) − U(Fε(b) − Fε(a))).

46 A.3.2 Adaptive Metropolis Hastings

Johndrow et al. (2019) suggest to use a simplified adaptive MH algorithm based on the ideas outlined in ?. Corresponding to their sampling scheme, we update the probit/logit/multinomial

? coefficients θ using a time-inhomogenous proposal θi for each component θi of θ using the proposal kernel

k−1 s X q (θ , θ?) = φ((θ? − θ )2/σ ), σ = (θ − θ¯ )2, (A.23) k ik i i ik ik ik k ij ik j=0 ¯ where φ is the standard univariate Gaussian density and θik is the average posterior draw across all previous Gibbs iterations. That is, they use the variance of all previous Gibbs iterations to construct on-the-fly proposal densities. For the first 100 iterations, we use a proposal density with unit variance.

A.3.3 Sampling the scale parameter δ for binomial models

The likelihood contribution for each yi in the boosted P´olya-Gamma mixture representation of a binomial model (40) reads:

˜ p(w ˜i|δ, log λi) ∝ (A.24) ( 2 ˜ 2 ˜ √ ) 1 ωw,iw˜i w˜iκw,i ωw,i(log λi) log λi   √ exp − + √ − + w˜iωw,i − δκw,i , yi > 0, δ 2δ δ 2δ δ ˜ p(˜vi|δ, log λi) ∝ ( 2 ˜ 2 ˜ √ ) 1 ωv,iv˜i v˜iκv,i ωv,i(log λi) log λi   √ exp − + √ − + v˜iωv,i − δκv,i , yi < Ni. δ 2δ δ 2δ δ The likelihood contributions (A.24) have to be combined with the prior p(δ|γ) and with the ˜ prior distributions of the parameters in log λi. We provide details for a binomial regression model.

47 ˜ ˜ ˜ For a binomial regression model, log λi = xiβ, where the parameter β is defined exactly as for a binary regression model:  √ δβ , 1 ≤ j < d, ˜  j βj = √ (A.25)  δβj + γ, j = d.

As in the binary case, we assume the following prior for β˜ in the boosted model:

˜ β|δ ∼ Nd (0, δA0) , (A.26)

where A0,dd is possibly larger than A0,jj, for 1 ≤ j < d, which implies the following joint prior of β|γ, δ in the identified model:   0(d−1)×1 β|γ, δ ∼ Nd (a0, A0) , a0 =   . (A.27) − √γ δ This dependence on δ and γ has to be taken into account, when we move from the identified

? −1 ? ? to the unidentified model and sample δ ∼ G (d0,D0) and sample γ |δ , βd from √  ? ? ? ? γ |δ , βd ∼ N δ g1, δ G1 , (A.28)

G1 = G0A0,dd/(G0 + A0,dd), g1 = −βd × G0/(G0 + A0,dd).

The boosted model reads √ ˜ w˜i = xiβ + δκw,i/ωw,i +ε ˜w,i, yi > 0, √ ˜ v˜i = xiβ + δκv,i/ωv,i +ε ˜v,i, yi < Ni. √ √ withε ˜w,i = δεw,i andε ˜v,i = δεv,i. The choice equation is defined in (41). To obtain the joint posterior p(β˜, δ|z˜, γ), we combine the likelihood contributions (A.24) with the conditional prior p(δ|γ) given in (8) and the prior distribution p(β˜|δ) given in

48 (A.26):

p(β˜, δ|z˜, γ) ∝ Q p(w ˜ |δ, β˜) Q p(˜v |δ, β˜)p(β˜|δ)p(δ|γ). (A.29) i:yi>0 i i:yi

It can be shown that the joint posterior p(β˜, δ|z˜, γ) factorizes in the following way

 dI +1     ˜ ˜ 1 DI BI p(β, δ|z˜, γ) ∝ fN (β; bN (δ), δBN ) exp − exp √ δ δ δ with

N N X > −1 −1 X > bN (δ) = BN mN (δ), mN (δ) = mixi , BN = A0 + Mixi xi, (A.30) i=1 i=1  √   √  mi = I{yi > 0} w˜iωw,i − δκw,i + I{yi < Ni} v˜iωv,i − δκv,i ,

Mi = I{yi > 0}ωw,i + I{yi < Ni}ωv,i, N 1 X d = d + 1/2 + (I{y > 0} + I{y < N }) , (A.31) I 0 2 i i i i=1 N 1 X 1 D = D + γ2/(2G ) + I{y > 0}w˜2ω + I{y < N }v˜2ω  − m>B m , I 0 0 2 i i w,i i i i v,i 2 a N a i=1 N X > BI = (I{yi > 0}w˜iκw,i + I{yi < Ni}v˜iκv,i) − ma BN mb, i=1 ˜ where fN (β; bN (δ), δBN ) is the density of a Nd (bN (δ), δBN )-distribution and ma and mb are defined below. ˜ The last terms in DI and BI result from completing the squares in β. Note that   ˜ 1 1 ˜ > −1 ˜ fN (β; bN (δ), δBN ) = √ exp − (β − bN (δ)) B (β − bN (δ)) dp 2δ N 2πδ |BN |     1 1 ˜ > −1 ˜ 1 > ˜ 1 > = √ exp − β B β + mN (δ) β exp − mN (δ) BN mN (δ) . dp 2δ N δ 2δ 2πδ |BN |

49  1 > Hence, the factor exp 2δ mN (δ) BN mN (δ) has to be added to (A.29) to compensate for ˜ completing the squares in fN (β; bN (δ), δBN ). Using that √ mN (δ) = ma − δmb, (A.32) N X > ma = (I{yi > 0}w˜iωw,i + I{yi < Ni}v˜iωv,i) xi , i=1 N X > mb = (I{yi > 0}κw,i + I{yi < Ni}κv,i) xi , i=1 we obtain:       1 > 1 > 1 > exp mN (δ) BN mN (δ) ∝ exp m BN ma exp −√ m BN mb . 2δ 2δ a δ a √ d +1 1  I −DI /δ BI / δ The marginal density p(δ|γ, z˜, y) ∝ δ e e is not available in closed form, unless all Ni = 1. In this case κw,i = κv,i = 0 for all i = 1,...,N and hence BI = 0 and −1 p(δ|γ, z˜, y) follows the inverted gamma distribution G (dI ,DI ). It can be shown that in any case DI > 0, provided that D0 > 0. To sample from 1dI +1 √ p(δ|γ, z˜, y) ∝ e−DI /δeBI / δ δ in Algorithm 3 the following resampling technique is used: Choose an “auxiliary prior” π(δ) for resampling such that the mode and the curvature of π(δ) coincide with the mode

δM and the curvature Ip of the posterior p(δ|γ, z˜, y) which are given by:

16D2 pB2 + 16D (d + 1) δ = I ,I = − I I I . M h i2 p 5 p 2 4 · (δM ) 2 BI + BI + 16DI (dI + 1)

Resampling works as follows. L draws δ(l) ∼ π(δ), l = 1,...,L, from the auxiliary prior are resampled using weights proportional to the “auxiliary likelihood” `(δ) = p(δ|γ, z˜, y)/π(δ),

50 −1 ? ? Figure A.1: Approximating p(δ|·) by the inverted gamma densities δ ∼ G (dI ,DI ) (left- √ −1 ? ? hand side) and δ ∼ G (bI ,BI ) (right-hand side) for a case where BI > 0 and BI < 0 (bottom).

given by:

log `(δ) ∝ log p(δ|γ, z˜, y) − log π(δ)

DI BI ∝ −(dI + 1) log δ − + √ − log π(δ). δ δ

(l?) ? The desired draw from p(δ|·) is given by δ , where l ∼ MulNom (1; w1, . . . , wL) and the (l) weights wl ∝ `(δ ) are normalized to 1. The auxiliary likelihood `(δ) is expected to be rather flat over the support of π(δ), see Figure A.1 for illustration. Hence, w1, . . . , wL is expected to be close to a uniform distribution and L can be pretty small (L = 5 or L = 10 should be enough). The factorization of the posterior suggests two distribution families as auxiliary prior

−1 ? ? π(δ). First, the inverted Gamma prior δ ∼ G (dI ,DI ) with mode δIG and curvature IIG

51 of the pdf given by:

? ? 3 DI (dI + 1) δIG = ? ,IIG = − ? 2 . dI + 1 (DI )

Matching the mode, i.e. δIG = δM , and the curvature, i.e. IIG = Ip, to the posterior, i.e. ? 3 ? ? (dI + 1) dI + 1 dI + 1 IIG = − ? 2 = − 2 = − 2 = Ip, (DI ) (δIG) (δM ) ? DI δIG = ? = δM , dI + 1 ? ? yields following optimal choice for the parameters (dI ,DI ):

? 2 ? ? dI = −Ip · (δM ) − 1,DI = δM (dI + 1).

The log likelihood ratio reads:

? ? DI − DI BI log `(δ) ∝ −(dI − d ) log δ − + √ . I δ δ √ ? −1 ? ? 10 Second, provided that BI > 0, the translated inverted Gamma prior δ ∼ G (2bI ,BI ) with mode δSQ and the curvature ISQ of the pdf given by: ? 2 ? 5 (BI ) 8(bI + 1) δSQ = ? 2 ,ISQ = − ? 4 . 4(bI + 1) (BI )

Matching the mode, i.e. δSQ = δM , and the curvature, i.e. ISQ = Ip, to the posterior, i.e. ? 5 ? ? 8(bI + 1) bI + 1 bI + 1 ISQ = − ? 4 = − 2 = − 2 = Ip, (BI ) 2(δSQ) 2(δM ) ? 2 (BI ) δSQ = ? 2 = δM , 4(bI + 1) 10As follows from the law of transformation of densities, a random variable δ, where the transformed √ variable δ ∼ G−1 (2a, b), has density

2a  a+1 b 1 − √b p (δ; 2a, b) = e δ . (A.33) SQ 2Γ(2a) δ To sample from such a density, we sample X ∼ G−1 (2a, b) and take the square, i.e. δ = X2.

52 ? ? yields following optimal choice for the parameters (bI , bI ):

? 2 ? ? p bI = −2Ip · (δM ) − 1,BI = 2(bI + 1) δM .

The log likelihood ratio reads:

? ? DI BI + BI log `(δ) ∝ −(dI − b ) log δ − + √ . I δ δ

See again Figure A.1 for the optimal values for two examples, one with BI > 0 and one with BI < 0.

A.4 More Details on the UPG sampler for binary state space models

−1 As for binary regression models, we use the priors γ|δ ∼ N (0, δG0) , δ ∼ G (d0,D0) and, specify the following priors in the boosted SSM (47):   ˜ ˜ ˜ ˜ −1 βj0|βj, θj, δ ∼ N βj, δP0,jjθj , βj|δ ∼ N (0, δA0,jj) , θj ∼ G (c0,C0) , (A.34) where A0,jj, j = 1, . . . , d and c0 as well as C0 are fixed hyperparameters. Marginalizing ˜ ˜ with respect to βj yields following prior for the initial value βj0 in (47):

˜ βj0|δ, θj ∼ N (0, δ(A0,jj + P0,jjθj)) . (A.35)

Matching back through transformation (48) yields the prior in the original SSM (45):  γ  βj ∼ N (0,A0,jj) , j < d, βd|γ, δ ∼ N −√ ,A0,dd . (A.36) δ

We observe the same prior dependence between γ, δ and βd as for the simple binary regression model: √  γ|δ, βd ∼ N δg1, δG1 ,G1 = G0A0,dd/(G0 + A0,dd), g1 = −βd × G0/(G0 + A0,dd(A.37)).

53 Note that in Step (b-2a) of Algorithm 4, δnew is sampled from the marginalized posterior

new new δ |γ , z˜, θ1, . . . , θd where the entire state process and the initial values are integrated out. This is possible, since all process variances, the observation variance, and the variance of the initial state vector are scaled with δ, see Appendix A.4.1.

A.4.1 Linear Gaussian state space models with unknown scaling

A linear Gaussian state space model with unknown scaling is defined as the following dynamic stochastic system for t = 1,...,T :

βt = Ftβt−1 + wt, wt ∼ Nd (0, δQt) , (A.38)

yt = Htβt + t, t ∼ Nr (0, δRt) , (A.39)

ˆ  where β0 ∼ Nd β0|0, δP0|0 . The key variables in these formulations are the state variable

βt and the observation variable yt. The state variable βt is a latent d-dimensional random vector which is observed only indirectly through the effect it has on the distribution of the r-dimensional time series yt. Note that the system (A.38) and (A.39) is a linear Gaussian state space model where all (co)variances depend on the same unknown scaling factor δ.

Let ϑ be the vector containing all unknown parameters in the system matrices Ft, Ht, Qt and Rt apart from δ.

In filtering, the goal is to estimate β0,..., βT |y, δ, ϑ given data y = (y1,..., yT ) for fixed model parameters ϑ. For a linear Gaussian state space model with unknown scaling,

t t the filter density p(βt|δ, ϑ, y ) at time t, where y = (y1,..., yt), is equal to a normal distribution, scaled with δ, i.e.

t ˆ  βt|δ, ϑ, y ∼ Nd βt|t, δ Pt|t ,

54 Algorithm 4 The ultimate P´olya-Gamma sampler for binary state space models.

Choose starting values for (θ1, . . . , θd), and λ1, . . . , λT and repeat the following steps:

−1 (a-1) For each t = 1,...,T , sample zt = log λt + Fε (yt + Ut(1 − yt − πt)) in the SSM (45), where

Ut ∼ U [0, 1] and πt = Fε(log λt). Sample ωt|zt, λt, yt ∼ PG (2, |zt − log λt|) for the logit SSM. ? −1 ? ? (a-2) Sample δ ∼ G (d0,D0) and γ |δ from (A.37). Move to the boosted SSM (47) using transforma-

tion (48) to definez ˜t, t = 1,...,T . (b) Sampling in the boosted conditionally Gaussian SSM (47):

(b-1) Sample γnew = pG?T −1 (U × T (O(z˜)/pG?) + (1 − U) × T (L(z˜)/pG?)), where U ∼ 0 2d0 2d0 0 2d0 0

U [0, 1], Tν (·) is the cdf of a tν distribution, L(z˜) = maxt:yt=0 z˜t and O(z˜) = mint:yt=1 z˜t. new ˜ ˜ (b-2) Sample δ and β0,..., βT jointly conditional on θ1, . . . , θd based on the marginal prior ˜ βj0|δ, θj given in (A.35), using the methods described in Appendix A.4.1 with the process 2 variances being equal to δθj and heteroscedastic observation errors σt = δ/ωt (where ωt is

known): new new −1 (b-2a) Sample δ from δ|γ , z˜, θ1, . . . , θd ∼ G (dT ,DT ), where dT and DT are given by:

new 2 2 T +1 (γ ) 1 PT (˜zt−zˆt|t−1) dT = d0 + ,DT = D0 + + , 2 2G0 2 t=1 St|t−1

andz ˆt|t−1 and St|t−1 are obtained from a Kalman filter with δ = 1, see (A.43). ˜ ˜ new (b-2b) Sample β0,..., βT |δ , z˜, θ1, . . . , θd using forward-filtering-backward-sampling.   ˜ new ˜ ˜ new ˜ new (b-3) Update (βj, θj)|δ under the prior βj0|βj, θj, δ ∼ N βj, δ P0,jjθj given by (A.34):

˜ ˜ new (b-3a) Sample βj|θj, βj0, δ , for j = 1, . . . , d:

 ˜  ˜ new ˜ A0,jj βj0 new A0,jj θj P0,jj βj|θj, δ , βj0 ∼ N , δ . A0,jj +θj P0,jj A0,jj +θj P0,jj new ˜ ˜ −1 (b-3b) Sample θj|δ , βj, {βjt} ∼ G (c0 + (T + 1)/2,Cj) for j = 1, . . . , d where

 ˜ ˜ 2  1 (βj0−βj ) PT ˜ ˜ 2 Cj = C0 + new + (βjt − βj,t−1) . 2δ P0,jj t=1 (b-4) Move back to the identified model through the inverse of transformation (48), i.e. j = 1, . . . , d, √ √ ˜ new new ˜ new new βj = (βj − I{j = d}γ )/ δ and βjt = (βjt − I{j = d}γ )/ δ , t = 0,...,T .

55 ˆ ˆ where βt|t and Pt|t are determined recursively for t = 1,...,T from βt−1|t−1, Pt−1|t−1 and yt. This results follows immediately from applying the Kalman filter to the system (A.38) and (A.39). From a computational point of view, this extended filter is a standard Kalman-

t filter which is run with δ = 1, and where all covariances in the filter density βt|δ, ϑ, y , the t−1 t−1 propagated density βt|δ, ϑ, y and the predictive density yt|δ, ϑ, y are multiplied by δ after running the filter.

We provide further details for TVP models for univariate time series yt, where r = 1, 2 ˆ 0 Ft = Id, Ht = xt, Qt = Q = Diag (θ1, . . . , θd), Rt = σt , β0|0 = (β1, . . . , βd) and P0|0 =

Diag (θ1P0,11, . . . , θdP0,dd):   t−1 ˆ (a) Propagation: βt|δ, ϑ, y ∼ Nd βt|t−1, δPt|t−1 , where ˆ ˆ βt|t−1 = βt−1|t−1, Pt|t−1 = Pt−1|t−1 + Q.

t−1  (b) Prediction: yt|δ, ϑ, y ∼ N yˆt|t−1, δSt|t−1 ,

ˆ > 2 yˆt|t−1 = xtβt|t−1,St|t−1 = xtPt|t−1xt + σt . (A.40)   t ˆ (c) Correction: βt|δ, ϑ, y ∼ Nd βt|t, δPt|t , ˆ ˆ βt|t = βt|t−1 + Kt(yt − yˆt|t−1), Pt|t = (I − Ktxt)Pt|t−1,

> −1 Kt = Pt|t−1xt St|t−1.

An advantage of writing the Kalman filter in this way is that the integrated likelihood p(y|δ, ϑ) is available in closed form. If r = 1, then exploiting (A.40) we obtain:

T T ( T )   2 2 Y 1 1 X (yt − yˆt|t−1) p(y|δ, ϑ) = p(y |yt−1, δ, ϑ) ∝ exp − , (A.41) t δ 2δ S t=1 t=1 t|t−1 wherey ˆt|t−1 and St|t−1 are the moments given in (A.40). This allows to sample the scaling factor δ jointly with the state process β0,..., βT conditional on ϑ and y,

56 (a) by sampling δ|ϑ, y from the inverted Gamma prior that results from combining the

−1 likelihood (A.41) with the inverted gamma prior δ ∼ G (d0,D0),

−1 δ|ϑ, y ∼ G (dT ,DT ) , (A.42) T 2 T 1 X (yt − yˆt|t−1) d = d + ,D = D + ; T 0 2 T 0 2 S t=1 t|t−1

(b) by sampling β0,..., βT |y, δ, ϑ conditional on δ using FFBS (Fr¨uhwirth-Schnatter 1994).

As mentioned above, the momentsy ˆt|t−1 and St|t−1 in Step (a) are obtained from running a Kalman-filter with δ = 1. Note that we need not rerun the Kalman-filter in Step (b), all covariances are just multiplied by the scaling factor δ sampled in Step (a).

Application to boosting binary state space models. This result is useful for the boosted state space model (47), because the working parameter δnew can be sampled with- ˜ ˜ ˜ ˜ out conditioning on the state process β0,..., βT and the initial values β1,..., βd. In this ˆ case, ϑ = (θ1, . . . , θd) and the Kalman filter described above is applied with β0|0 = 0d and

P0|0 = Diag (A0,11 + θ1P0,11,...,A0,dd + θdP0,dd). The Kalman filter reads:

˜ t−1 ˆ  βt|δ, ϑ, z˜ ∼ Nd βt−1|t−1, δPt|t−1 , Pt|t−1 = Pt−1|t−1 + Diag (θ1, . . . , θd) , (A.43) t−1  ˆ > z˜t|δ, ϑ, z˜ ∼ N zˆt|t−1, δSt|t−1 , zˆt|t−1 = xtβt−1|t−1,St|t−1 = xtPt|t−1xt + ωt, ˜ t ˆ  βt|δ, ϑ, z˜ ∼ Nd βt|t, δPt|t , ˆ ˆ > −1 βt|t = βt−1|t−1 + Kt(˜zt − zˆt|t−1), Pt|t = (I − Ktxt)Pt|t−1,, Kt = Pt|t−1xt St|t−1.

Therefore, the likelihood (A.41) takes the following form:

T T ( T )   2 2 Y 1 1 X (˜zt − zˆt|t−1) p(z˜|δ, ϑ) = p(˜z |z˜t−1, δ, ϑ) ∝ exp − . t δ 2δ S t=1 t=1 t|t−1

57 A.5 Additional simulation results

This section provides some additional simulation results for probit, logit, multinomial and binomial regression models. We investigate a number of simulation scenarios. First, we vary the sample size N with only one instance of yi = 1, implying increasing imbalance with a larger sample size. Second, we fix N = 1000 and vary the intercept in the data generating process, where large absolute intercepts imply greater imbalance than intercepts that are close to zero (corresponding to the fully balanced case). In addition, we vary the number of choices {0, . . . , m} in the multinomial case and the number of successes Ni in the binomial case. The results for the probit model are discussed in detail in Section A.5.1. Graphical summaries of the additional simulation results of the UPG models for the binary logit model, the multinomial model for the binomial logit model are provided in Section A.5.2. Put briefly, the additional simulation results suggest that the proposed UPG approach can compete with the efficiency ”gold standard” of the Metropolis-Hastings algorithm. At the same time, several extensions of the UPG approach are easily implemented due to the conditional Gaussian posterior of the regression coefficients.

A.5.1 Probit results

For binary probit models, we compare the marginal data augmentation approach from Imai & van Dyk (2005) (I&vD),11 the UPG approach outlined in this article, the adaptive Metropolis-Hastings step pro- posed in Johndrow et al. (2019) (AMH) as well as the original probit data augmentation

11We use the UPG algorithm with γ deterministically fixed at 0 which corresponds to the marginal data augmentation algorithm in Imai & van Dyk (2005).

58 5

6 4 Sampler Sampler UPG 4 I&vD 3 UPG AMH I&vD A&C AMH A&C 2 2

ln(Median Inefficiency Factor) 1 0 ln(Median Inefficiency Factor) 10 100 500

1000 2000 4000 6000 8000 −2 0 2 10000 20000 50000 N Intercept

(a) Varying sample sizes with one success. (b) Varying intercepts with N = 1000.

Figure A.2: Inefficiency factors across 50 simulation runs for binary probit models. scheme outlined in Albert & Chib (1993) (A&C). The results for the probit model are summarized graphically in Figure A.2 and are provided numerically in Table A.1. It is obvious that the commonly used standard data augmentation approach from A&C has severe efficiency problems when increasing sparsity in the dependent variable and therefore imbalancedness of the data. Increasing inefficiency can also be observed for the I&vD algorithm as seen in Figure A.2 and Table A.1. On the other hand, constantly high levels of efficiency are observed for UPG and the AMH algorithm. In the probit case, the extreme intercept values of −3 and 3 imply success counts of close to 0 and close to N, respectively. In general, AMH and UPG exhibit efficient estimation behavior over the full range of intercepts. On the contrary, I&vD as well as A&C show increasing inefficiency with increasing imbalance.

59 8

6 Sampler UPG FS&F 4 AMH P&S

2 ln(Median Inefficiency Factor) 10 100 500 1000 2000 4000 6000 8000 10000 20000 50000 N

Figure A.3: Inefficiency factors varying sample size with one success in binary logit models.

A.5.2 Additional results

Additional results are provided in Figure A.3 for the binary logit model, in Figure A.5 and Figure A.4 for the multinomial model and in Figure A.6 for the binomial logit model. Selected numerical results for these models can be found in Table A.2 to Table A.5.

60 7

6

5 Sampler UPG 4 FS&F AMH P&S 3

2 ln(Median Inefficiency Factor) 5 10 15 20 25 30 K

Figure A.4: Inefficiency factors for varying number of categories K = m + 1 in multinomial logit models with one success and N = 1000.

61 8

6 Sampler UPG FS&F 4 AMH P&S

2 ln(Median Inefficiency Factor) 10 100 500 1000 2000 4000 6000 8000 10000 20000 50000 N

Figure A.5: Inefficiency factors for varying sample size in multinomial logit models with one success and three categories.

62 2.5 Sampler UPG AuxMix 2.0 AMH

1.5 ln(Median Inefficiency Factor) 5 10 20 25 30 35 40 Ni

Figure A.6: Varying number of successes Ni in binomial logit models with intercept βd = −3 and N = 500.

63 A.5.3 Tabulated simulation results

Table A.1: Varying Sample Sizes for Binary Probit Models

Probit N = 10 N = 100 N = 1000 N = 10000 N = 20000 N = 50000

I&vD IFB 1.8676 4.8926 21.1759 109.6955 182.3035 364.5669

ESSB 5354.4923 2043.9223 472.2356 91.1618 54.8541 27.4309

UPG IF / IFB 0.7283 0.5512 0.2164 0.0499 0.0308 0.0153

ESS / ESSB 1.3730 1.8143 4.6216 20.0215 32.5028 65.2238

AMH IF / IFB 2.9028 1.1569 0.2763 0.0535 0.0315 0.0157

64 ESS / ESSB 0.3445 0.8644 3.6191 18.7057 31.7257 63.6286

A&C IF / IFB 3.0332 7.0097 10.8610 9.8597 9.8830 5.3714

ESS / ESSB 0.3297 0.1427 0.0921 0.1014 0.1012 0.1864

Note: Values are medians across 50 simulation runs. Every simulated dependent variable has exactly one success. IF = inefficiency factors, ESS = effective sample size. IFB and ESSB correspond to the simulation results of the benchmark sampler I&vD. Results for all other samplers are reported relative to this benchmark. Table A.2: Varying sample sizes for binary logit models.

Logit N = 10 N = 100 N = 1000 N = 10000 N = 20000 N = 50000

P&S IFB 4.0487 28.6732 187.1806 1003.4837 1627.8337 1898.2638

ESSB 2469.9174 348.7592 53.4272 9.9668 6.1431 5.2682

UPG IF / IFB 0.5097 0.1301 0.0268 0.0054 0.0033 0.0030

B

65 ESS / ESS 1.9619 7.6858 37.2430 184.3834 299.3352 337.9056

AMH IF / IFB 1.5706 0.2329 0.0350 0.0063 0.0039 0.0032

ESS / ESSB 0.6367 4.2931 28.5437 157.7139 256.6996 307.8137

FS&F IF / IFB 3.4093 4.4482 4.3992 2.1613 1.7815 2.4249

ESS / ESSB 0.2933 0.2248 0.2273 0.4626 0.5613 0.4124

Note: Values are medians across 50 simulation runs. Every simulated dependent variable has exactly one success. IF = inefficiency factors, ESS = effective sample size. IFB and ESSB correspond to the simulation results of the benchmark sampler P&S. Results for all other samplers are reported relative to this benchmark. Table A.3: Varying sample sizes for multinomial logit models.

MNL N = 10 N = 100 N = 1000 N = 10000 N = 20000 N = 50000

P&S IFB 4.2483 28.1323 195.4187 1089.3182 1453.2866 2658.0921

ESSB 2361.8450 360.7564 54.3959 11.1845 8.7454 5.2080

UPG IF / IFB 0.5263 0.1318 0.0261 0.0051 0.0038 0.0021

B

66 ESS / ESS 1.8979 7.4951 36.1844 162.5113 207.3269 347.4167

AMH IF / IFB 1.5154 0.2324 0.0336 0.0059 0.0044 0.0024

ESS / ESSB 0.6622 4.2820 28.3397 140.8530 182.0504 308.7848

FS&F IF / IFB 3.4130 4.4907 5.0743 2.4723 2.0918 1.5335

ESS / ESSB 0.2977 0.2369 0.2313 0.4644 0.4803 0.5711

Note: Values are medians across 50 simulation runs. Every simulated dependent variable has exactly one success. IF = inefficiency factors, ESS = effective sample size. IFB and ESSB correspond to the simulation results of the benchmark sampler P&S. Results for all other samplers are reported relative to this benchmark. Table A.4: Varying number of categories K = m + 1 for multinomial logit models.

MNL K = 5 K = 10 K = 15 K = 20 K = 25 K = 30

P&S IFB 173.1453 170.0451 173.8775 182.2487 180.7854 192.1254

ESSB 57.7559 58.8080 57.5210 54.8745 55.3162 52.0566

UPG IF / IFB 0.0292 0.0298 0.0291 0.0277 0.0279 0.0263

ESS / ESSB 34.2630 33.6757 34.3930 36.2290 35.8619 38.1432 67

AMH IF / IFB 0.0373 0.0383 0.0377 0.0358 0.0362 0.0341

ESS / ESSB 27.0810 26.3595 26.8156 28.1783 27.9553 29.5801

FS&F IF / IFB 5.4401 5.6107 5.1288 5.3679 5.3210 4.9568

ESS / ESSB 0.2184 0.2154 0.2357 0.2362 0.2358 0.2483

Note: Values are medians across 50 simulation runs. Every simulated group except the baseline has exactly one success. IF = inefficiency factors, ESS = effective sample size. IFB and ESSB correspond to the simulation results of the benchmark sampler P&S. Results for all other samplers are reported relative to this benchmark. Table A.5: Varying number of trials for binomial regression models.

Binomial Ni = 5 Ni = 10 Ni = 20 Ni = 25 Ni = 30 Ni = 35 Ni = 40

AuxMix IFB 17.0848 16.8418 17.3633 17.5455 17.4283 17.4665 17.0705

ESSB 585.3150 593.7698 575.9287 569.9499 573.7801 572.5258 585.8053

B 68 UPG IF / IF 0.2396 0.2467 0.2542 0.2577 0.2662 0.2660 0.2741

ESS / ESSB 4.1739 4.0531 3.9333 3.8801 3.7561 3.7589 3.6480

AMH IF / IFB 0.2743 0.2724 0.2650 0.2702 0.2796 0.2812 0.2969

ESS / ESSB 3.6462 3.6714 3.7729 3.7013 3.5763 3.5559 3.3683

Note: Values are medians across 50 simulation runs. Each data set has 500 observations and a true intercept value of βd = −3. IF = inefficiency factors, ESS = effective sample size. IFB and ESSB correspond to the simulation results of the benchmark sampler AuxMix. Results for all other samplers are reported relative to this benchmark.