Orebro¨ University Orebro¨ University School of Business Master in Applied Statistics Sune Karlsson Thomas Laitila VT14 (2014/07/04)
Bayesian Inference in the Multinomial Probit Model: A case study
Oleksandr Dyklevych, 82/02/21 Abstract
Binary, multinomial and multivariate response models play an important role in the variety of scientific fields. The most commonly encountered response models are Multinomial Logit model (MNL) and Multinomial Probit model (MNP). Being easier to deal with, MNL suffers from an independence of irrelevant alternatives assumption (IIA). MNP model does not have IIA issue, but imposes its own difficulties. A direct estimation of the coefficients in case of MNP model is computationally burdensome. Bayesian approach allows to avoid complex calculations of the maximum likelihood function for MNP model, but it should be applied with caution because of the identification problem. Response models are often applied for prediction of the customers behavior and preferences. The paper is also concerned with using MNP model for prediction of customers retention behavior and lifetime value on the individual level. A Bayesian approach is used to estimate the parameter of interest and models’ adequacy is tested. The results are discussed, leaving some space for future work. Contents
1 Introduction6 1.1 Motivation of the model selection...... 7 1.2 Outline...... 8
2 Theoretical concepts9 2.1 Setting up the model...... 9 2.2 Bayesian framework...... 13 2.2.1 Binary case...... 13 2.2.2 Multinomial case...... 14
3 Model estimation and inference: binary case 16 3.1 Target population...... 16 3.2 Dataset description...... 18 3.3 Model estimation results...... 28
4 Model estimation and inference: multinomial case 36 4.1 Target population...... 36 4.2 Dataset description...... 37 4.3 Model estimation results...... 40 4.4 Applying results for CLV calculation...... 53
5 Conclusion 57 5.1 Summary...... 57 5.2 Extensions, applications, and future work...... 58
A Posterior distribution of beta, binary case 60
B Posterior distribution of beta, Sigma, multinomial case 62
C Poisson-Binomial distribution 65
D Source code documentation 68
1 List of Tables
3.1 Customers related covariates...... 19 3.2 Product related covariates...... 24 3.3 The correlation matrix for variables LAM, LF, LT, and LI... 24 3.4 The summary table of covariates, model 1...... 29 3.5 Posterior results for model 1...... 30 3.6 Misclassification table for model 1...... 34
4.1 The distribution of the outcomes for model 2...... 37 4.2 The summary table of covariates, model 2...... 40 4.3 Posterior results for model 2, βe...... 42 4.4 Comparison of observed and estimated frequencies, model 2. 52 4.5 Posterior predictive probabilities example, model 2...... 54
2 List of Figures
3.1 Imaginary time-lines for customers, model 1...... 18 3.2 Contingency table and mosaic plot for variable GEN, model 1. 20 3.3 Contingency table and mosaic plot for (grouped) variable AGE, model 1...... 21 3.4 Contingency table and mosaic plot for variable ZIP, model 1. 22 3.5 Contingency table and mosaic plot for variable DP, model 1. 22 3.6 Contingency table and mosaic plot for variable DW, model 1. 23 3.7 Contingency table and mosaic plot for variable PU, model 1. 23 3.8 Contingency table and mosaic plot for (grouped) variable LAM, model 1...... 25 3.9 Contingency table and mosaic plot for variable LF, model 1. 26 3.10 Contingency table and mosaic plot for variable ZF, model 1. 27 3.11 Contingency table and mosaic plot for variable DS, model 1. 28 3.12 Posterior densities for β, model 1...... 31 3.13 Sampled output for β, model 1...... 32 3.14 Autocorrelation plots of sampled output for β, model 1.... 33
4.1 Contingency table and mosaic plot for variable GEN, model 2. 38 4.2 Contingency tables for variables AGE and ZIP, model 2... 38 4.3 Contingency tables for variables DP and DW, model 2.... 39 4.4 Contingency tables for variables PU and LAM, model 2.... 39 4.5 Contingency tables for variables ZF and DS, model 2..... 39 4.6 Contingency tables for variable LF, model 2...... 40 4.7 Posterior densities for βe, model 2, (part 1)...... 43 4.8 Posterior densities for βe, model 2, (part 2)...... 44 4.9 Posterior densities for βe, model 2, (part 3)...... 45 4.10 Sampled output for βe, model 2, (part 1)...... 46 4.11 Sampled output for βe, model 2, (part 2)...... 47 4.12 Sampled output for βe, model 2, (part 3)...... 48 4.13 Autocorrelation plots of output for βe, model 2, (part 1).... 49
3 List of Figures
4.14 Autocorrelation plots of output for βe, model 2, (part 2).... 50 4.15 Autocorrelation plots of output for βe, model 2, (part 3).... 51 4.16 Dependence of CLV on LAM...... 55
4 Listings
D.1 Source C-code for MNPBayes.dll ...... 69 D.2 Source R-code for bp.bayes function...... 71 D.3 Source R-code for mnp.bayes function...... 72 D.4 Source R-code for data.reshape function...... 74 D.5 Source R-code for bp.bayes.predict function...... 76 D.6 Source R-code for mnp.bayes.predict function...... 77 D.7 Source R-code for tab.data function...... 78 D.8 Source C-code for poissbin.dll ...... 79 D.9 Source R-code for poiss.binom.pmf function...... 80
5 Chapter 1
Introduction
Multinomial Probit and Multinomial Logit (hereinafter, MNP and MNL re- spectively) represent extensions to a simple linear regression model. They both target the similar set of problems — prediction of the individual prob- abilities of a certain outcome. Most of the readers must be familiar with the simplest special cases of MNL and MNP, namely, binary logit model, which is often called logistic regression, and binary probit model. A variable of in- terest in such binary models can take two possible values, or, in other words, we observe two possible outcomes for a dependent variable. By making an assumption that the outcomes are influenced by the set of regressors, the probabilities of the outcomes are estimated by means of maximum likelihood estimation. When there are more than two possible outcomes observed for a response variable, the situation is modeled by means of MNL or MNP. As usually, the outcomes are unordered, i.e. the order of the outcomes does not matter. For example, consider the following problem: there are four options to get from one city to another — by car, by bus, by train, or by plane. Definitely, different people prefer different types of transportation, and the choice is in- fluenced by many factors: price, travel time, comfort, etc. We can arrange the outcomes by 24 different ways, but it does not affect the estimation pro- cess anyway. In many experiments, however, the order of the outcomes does matter. To deal with ordered outcomes, one shall use the other extensions of the binary outcome models, namely, Ordered Logit and Ordered Probit regressions. Multinomial models shall not be mixed with multivariate models. Mul- tivariate regression applies to the cases when one wants to estimate several binary models jointly, assuming that their outcomes are correlated. As it was said, both MNP and MNL target the same set of problems. Nonetheless, there are several fundamental differences between them. MNL
6 Chapter 1. Introduction relies on the so-called independence of the irrelevant alternatives assumption (hereinafter, IIA), which states that relative probability of each outcome does not depend on the presence (or absence) of the other alternatives. In many cases such assumption sounds unrealistic. That said, the formulation of the MNP assumes the correlation between outcomes. Apart from MNP, there are another classes of the choice models, which were developed to solve the IIA problem. Some of the most renowned such models are Nested Logit and Mixed Logit. On the other hand, MNP imposes its own limitations. The estimation of the model coefficient is computationally burdensome, when the number of the outcomes for the variable of interest is greater than 3. In this case, the expression for the maximum likelihood function has no closed form, and cannot be solved directly. The Bayesian approach allows to estimate the co- efficients of the model and avoid the complex computation of the maximum likelihood. But the application of the Bayesian technique is restricted by an- other issue, namely, multiplicative redundancy. A more detailed explanation of the problems and the ways to deal with them are given in chapter 2.
1.1 Motivation of the model selection
The main goal of the paper is to explore implementation, performance and adequacy of the Bayesian approach in the real-world application by modeling the retention behavior of the customers. In a service-based environment, establishing long-term relationships with customers plays vital role in maximizing company’s profit. A quantitative measure of the efficiency of such relationships is expressed by means of a customer lifetime value, or CLV. A hindsight on customers’ database reveals an influence of the individual factors on customers’ loyalty, and, therefore, allows to manipulate and forecast it for new and existing customers. An important part of CLV is retention rate. The choice of modeling ap- proach for CLV and retention behavior depends on the business environment and the target population1. A variety of models are proposed for estimation and forecasting of the retention behavior. One can find a comprehensive overview of CLV modeling in Gupta et al. (2006). The data set under the study comes from continuous and non-contractual business environment. It means that customers are allowed to make pur- chases when they want, and they are not bound by any subscription agree- ment. The target population includes new customers only, i.e. those, who
1A short, but nice description of types of relationship with customers can be found in Fader and Hardie (2009).
7 Chapter 1. Introduction made a purchase for the first time2. Referring to Gupta et al. (2006), a probit model is found to be suitable for the retention rate analysis and inference. It is also pointed by Verhoef (2004), that a univariate binomial probit model is a good starting point for the problem of the estimation of the retention behavior. Univariate probit model can be extended to a multinomial target, and serves as a base to the other approaches in modeling CLV — survival models, Pareto/NBD models etc. (see, for example, Fader and Hardie (2009)). The selection of the Bayesian approach to the model estimation is driven by several reasons. Purely statistical advantages can be found in Albert and Chib (1993). A personal motivation of the author in studying Bayesian statistics and its application to a real-life problems is also a cause. Overall business benefits from the model are also at the stake. Revealing the factors, which affect the retention rate, will lead to the more efficient allocation of the marketing expenses, more robust budgeting processes and customer segmentation; therefore, it shall help to maximize company’s profit.
1.2 Outline
This thesis will be structured as follows: In chapter 2, the notation and theoretical basis will be introduced, that will be then used throughout other chapters. An overview of the model and sampling techniques will be given. In chapter 3 and chapter 4, the theory will be applied to the real-world data. A data description will be given, together with graphical representa- tion. The chapters will introduce different approaches to the target definition and models will be formalized, estimated and tested. In chapter 5, the results will be summarized and possible extensions, applications and future work will be discussed. Finally, Appendix C will present a supplementary theory about Poisson- Binomial distribution and in Appendix D source code for the models estima- tion will be listed.
2A more detailed description of the dataset and the business environment is given in chapter 3 and chapter 4.
8 Chapter 2
Theoretical concepts
2.1 Setting up the model
We start with an introduction of a general form of the multinomial probit (hereinafter referred to as MNP) model. The following notation for the MNP model based on the Chib et al. (1998). The form, in which the MNP model is introduced, will be useful afterwards, when we’ll move to its application on practice. Throughout the notation we assume that scalars are written in italic, vectors are written in boldface, and matrices are written in BOLD- FACE. ∗ ∗ ∗ ∗ Let y1, y2, . . . , yn : ∀i = 1, n , yi ∈ (1, . . . , p) be a set of observable un- ordered outcomes for some population. Each outcome is associated with a set of covariates (vij, wi) , j = 1, p, where vij is a covariate vector that varies across both subjects from the population and outcomes; and wi of size q × 1 ∗ ∗ ∗ 0 contains characteristics of subject i. Let zi = zi1, . . . , zip be a p × 1 vector of continuous unobserved latent variables, such that
∗ ∗ ∗ yi = j, if max (zi ) = zij (2.1) ∗ and every latent variable zij is an utility for an alternative j ∗ 0 0 zij = vijδ + wiγj + ξij (2.2) ∗ In the real world one does not observe zij, only yi and covariates are known to the researcher. Writing (2.2) simultaneously for all j = 1, p we obtain ∗ 0 0 z = v δ + w γ + ξi1 i1 i1 i 1 z∗ = v0 δ + w0γ + ξ i2 i2 i 2 i2 (2.3) ··· ∗ 0 0 zip = vipδ + wiγp + ξip
9 Chapter 2. Theoretical concepts
By denoting
0 vi1 v0 i2 Vi = . . 0 vip 0 0 0 wi 0 ··· 0 00 w0 ··· 00 i Wi = . . .. . (2.4) . . . . 0 0 0 0 0 ··· wi γ1 γ 2 Υ = . . γp we obtain ∗ zi = Viδ + WiΥ + ξi (2.5)
Note, that Vi has the same number of rows as Wi. The notation can be simplified further by introducing:
0 0 0 0 vi1 wi 0 ··· 0 v0 00 w0 ··· 00 i2 i Ri = (Vi, Wi) = ...... (2.6) . . . . . 0 0 0 0 vip 0 0 ··· wi δ β∗ = Υ
Summarizing (2.1- 2.6) and, assuming that ξij are distributed according to p-dimensional normal distribution with 0 mean vector and the covari- 0 ance matrix Ξ, i.e. ξi = (ξi1, ξi2, . . . , ξip)p×1 ∼ Np(0, Ξ), the MNP model is formalized as
∗ ∗ ∗ yi = j, if max (zi ) = zij, ∀j = 1, p ∗ ∗ zi = Riβ + ξi (2.7)
ξi ∼ Np(0, Ξ) The definition of MNP, as it given in (2.7), lacks of unique identification of stated preference. Suppose, one adds a positive constant a to the every equation in (2.3). It does not introduce any changes to (2.1) in terms that
10 Chapter 2. Theoretical concepts stated preference remains unaffected, thus, value of a cannot be estimated. This problem is known as the additive redundancy. The usual way of dealing with the additive redundancy is to subtract p-th equation in (2.3) from the others, so we have z∗ − z∗ = v0 − v0 δ + w0 γ − γ + (ξ − ξ ) i1 ip i1 ip i 1 p i1 ip ∗ ∗ 0 0 0 zi2 − zip = vi2 − vip δ + wi γ2 − γp + (ξi2 − ξip) ··· ∗ ∗ 0 0 0 zi(p−1) − zip = vi(p−1) − vip δ + wi γp−1 − γp + ξi(p−1) − ξip (2.8) ∗ ∗ By introducing zij = zij − zip and εij = ξij − ξip, ∀j = 1, p − 1 into (2.8) and following the same logic as in (2.4- 2.6), we re-parameterize the model in (2.7) as next
Xi = (Ai, Bi) 0 0 vi1 − vip v0 − v0 i2 ip Ai = . . 0 0 vi(p−1) − vip 0 0 0 wi 0 ··· 0 00 w0 ··· 00 i Bi = . . .. . (2.9) . . . . 0 0 0 0 0 ··· wi and, finally ( j, if max (zi) = zij > 0 yi = , j = 1, p − 1 0, if max (zi) < 0
zi = Xiβ + εi (2.10) εi ∼ Np−1(0, Σ) The covariance matrix Σ of new error terms satisfies
0 Σ = [Ip−1, −1p−1] Ξ [Ip−1, −1p−1] (2.11)
11 Chapter 2. Theoretical concepts
Let us denote
Pij = P r [yi = j| Xi, β, Σ] (2.12)
= P r [(zij > zic, ∀c 6= j) ∩ (zij > 0)] 0 0 0 = P r xijβ + εij > xicβ + εic, ∀c 6= j ∩ xijβ + εij > 0 0 0 0 = P r εij − εic > xic − xij β, ∀c 6= j ∩ εij > −xijβ Z
= fεi (0, Σ) dεi, ∀j = 1, p − 1
Ej where 0 \ 0 0 Ej = εij > −xijβ (εij − εic) > xic − xij β (2.13) c6=j and −1 1 − 1 ε 0Σ ε f (0, Σ) = e 2 i i (2.14) εi p−1 p (2π) 2 |Σ| Now, the likelihood function for individual i becomes
p−1 X li(β, Σ) = P r [yi = j] I (yi = j) (2.15) j=0 p−1 Y dij = P r [yi = j] j=0 where dij = 1, if yi = j, and dij = 0 otherwise. The likelihood function for the full sample is
n p−1 Y Y dij L(β, Σ) = P r [yi = j] (2.16) i=1 j=0
From the definition of Ej in (2.13) we conclude, that the additive redundancy problem has vanished. However, the model in (2.10) still lacks of the identi- fication. If both sides of the every equation in (2.8) are multiplied by some positive constant c, this will change data generation process while leaving 2 probabilities Pij in (2.12) unchanged. Thus, L(β, Σ) = L(cβ, c Σ) and we face multiplicative redundancy. A standard solution to multiplicative redundancy is to normalize one of the diagonal elements σjj of Σ to 1 (the general rule is to normalize σ11). One can find more extensive explanation of both additive redundancy and multiplicative redundancy (or, commonly saying, the identification problem)
12 Chapter 2. Theoretical concepts of MNP in Train (2009), Xiao et al. (2008) and McCulloch, Polson, et al. (2000). The expression for probabilities in (2.12) captures practical obstacles of applications of MNP. There is no closed form expression for the high dimen- sional integral, when p > 4. And that is where the Bayesian approach comes in very handy. While some academical works were dedicated to the simula- tion estimation of the maximum likelihood in (2.16), Bayesian methods in inference about MNP were proven to be more efficient as they allowed to completely avoid burdensome computation of maximum likelihood. One can find references to comparison of both methods in Nobile (1998).
2.2 Bayesian framework
As one may note, the multiplicative redundancy imposes certain difficulties to the estimation of the MNP model with Bayesian inference in its general form. The reason for that is simple enough: it is difficult to define a prior distribution on the set of covariance matrices such that the first diagonal element is 1.
2.2.1 Binary case Let us first consider a special case of the MNP model, when the target variable takes only two values, i.e. p = 2. The special case above is nothing but binary probit model, which takes the form: ( 1, if zi > 0 yi = 0, if zi < 0 0 zi = xiβ + εi (2.17) εi ∼ N(0, 1) Note, that in case of binary probit we drop out indexing by j, since there is only one latent variable zi that represents utility function for every individual i = 1, n. The binary probit model is estimable by the frequentist technique as well. The latter approach, however, is not of the main interest for the paper, but we can use it to compare the frequentist results with the Bayesian-estimated model, especially in the case of the non-informative prior distribution for β. To estimate parameters of binary probit model we follow Albert and Chib (1993). Let us briefly depict the sampling scheme for the estimation. Note, that we assume non-informative prior for β:
13 Chapter 2. Theoretical concepts
1. given X, y and β, sample zi, ∀i = 1, n according to
+0 0 zi|X, y, β ∼ N (xiβ, 1), if yi = 1 −0 0 zi|X, y, β ∼ N (xiβ, 1), if yi = 0
where N+0 denotes normal distribution, truncated by 0 from left; and N−0 — truncated by 0 from right; 2. given X and z from step1, sample β according to (see (A.2))