Bayesian Approach to Binary Data Analysis in Economics: Modelling and Parameter Estimation

BAYESIAN APPROACH TO BINARY DATA ANALYSIS IN ECONOMICS: MODELLING AND PARAMETER ESTIMATION

Jantre Sanket Rajendra(12309), Arushi Jain(12162), Kartik Agarwal(12344), Srajal Nayak(12728) Indian Institute of Technolgy, Kanpur 20/04/2016

ABSTRACT In the areas of statistics and econometrics the analysis of binary response data is widely used. A huge array of literature is dedicated to the same purpose. In classical statistics, the maximum likelihood method is used to model this data and inferences about the model are based on the associated asymptotic theory. However, the inferences based on the classical approach are not accurate if the sample size is small. Albert and Chib proposed a Bayesian method to model categorical response data which incorporated Gibbs sampling and the data augmentation algorithm together. Values of latent data is easily sampled from truncated normal distribution. In this article, Albert and Chib’s approach is used to estimate the parameters in the binary logit and probit models. A dataset on contraception use in Indonesia is used to model the binary outcome using former models with the help of number of explanatory variables. Gibbs Sampler has been used to sample the values of coefficients. Bayesian inference on this model is done by calculating posterior means of the coefficients on the given covariates, standard deviation, credible interval and average covariate effects. KEY WORDS: Binary probit; Data Augmentation; Logit model; Maximum Likelihood Estimates; Latent data; Gibbs Sampling; Credible interval; Posterior mean.

INTRODUCTION If the response variable in a dataset is categorical, Generalized Linear Models (GLMs) are used instead of linear regression models to estimate the model parameters and to model the given data. GLMs can be denoted in the form 퐸(푦) = 푔(푥′훽), where 푦 is dependent variable, 푥 is the vector of explanatory variables, 훽 is the vector of parameters and 푔 is the link function. GLMs depend on a probability model that describes an event’s probability as the distribution function of the independent variables. The model is considered binary if the response variable takes two values (푦 = 0,1). On the other hand, model is multinomial if the dependent variable takes more than two values.

In the Bayesian inference of linear regression models conjugate prior distributions exist for the model parameters. So, it makes inferences for these models easier. However, making inferences in Bayesian GLMs is complicated because there are no conjugate prior distributions for the model parameters. So, it is hard to make a simulation in Bayesian GLMs. Albert and Chib [1] proposed an approach for binary probit regression models to tackle this problem. In this approach the conditional distributions of the model parameters in GLMs become the same as those in linear regression models by adding a latent variable to the model [4].

In this article a simulation based approach proposed by Albert and Chib [1] is introduced for computing the exact posterior distribution of 훽. In this method, the data set in the logit and probit models can be augmented by adding a set of latent variables (푧) into the model. Latent variables have a continuous distribution such that 푦푖 = 1 when 푧푖 > 1 and 푦푖 = 0 otherwise. Thus, the conditional distribution of 훽 given 푧 is a normal distribution whose mean can be easily computed, and the conditional distribution of 푧 given 훽 is a truncated normal distribution which is easy to compute. Since the conditional distributions can be calculated using this approach, Gibbs sampling can easily be used to calculate the exact posterior distribution of 훽.

GIBBS SAMPLING AND DATA AUGMENTATION ALGORITHM The Gibbs sampling and the data augmentation algorithm are both used to compute the posterior distribution of 훽 in the case of binary logit and probit models [1].

Gibbs sampling is used to generate sample from a multivariate distribution using full conditional distributions of parameters. A full conditional distribution is the conditional distribution of a parameter given all of the other parameters in the model. After applying the Gibbs sampler to the model the dataset converges to the joint posterior distribution of the parameters [2].

(0) (0) For the parameters 훽푖 (푖 = 1,2, … , 푝) the initial values are taken as 훽1 , … , 훽푝 . After the initial or seed values of the parameters are determined, Gibbs sampling is implemented by sampling from the full conditional distributions successively. (1) (0) 훽1 푓푟표푚 휋(훽1 /{훽푗 , 푗 ≠ 1}) (1) (1) (0) 훽2 푓푟표푚 휋(훽2 /훽1 , {훽푗 , 푗 > 2}) …………………………………………………… (1) (1) 훽푝 푓푟표푚 휋(훽푝 /{훽푗 , 푗 < 푝}) (푡) (푡) The cycle (1) is iterated 푡 times and so the sample 훽(푡) = (훽1 , … , 훽푝 ) is generated. As 푡 approaches ∞, the joint distribution of 훽(푡) approaches the joint distribution of 훽.

We generate the sample values and till certain iterations 푡∗ we discard the generated sample often termed as burn-in sample. After 푡∗, the draws for iteration 푡 > 푡∗ are described to be generated from posterior distribution of 훽. The Gibbs cycle is used to produce the sample of size 푚 which then is used compute estimations for the posterior moments and density.

A difficulty with Gibbs sampling is that the sequences produced by this method are not independently identically distributed. Besides, convergence of the sequence to the posterior distribution is an important point in Gibbs sampling. Theoretically, as the number of draws (푛) in a Gibbs sequence approaches to infinity, the sequence converges to the posterior distribution. However, a sufficient number of draws for convergence of the sequence must be determined in practice through certain diagnostics.

The second method besides Gibbs sampling which is used to compute the posterior distributions is the data augmentation algorithm. In this algorithm, the observed data 푦 is augmented by latent data 푧. Therefore, it is easy to analyse the observed data 푦. To implement this algorithm, samples are drawn from the conditional distributions 푝(훽 /푦, 푧) and 푝(푧 /훽, 푦).

푝(훽 /푦) = 푝(훽 /푧, 푦) 푝(훽 /푧) dz ….(1) ∫푧

To compute the posterior density in above equation, the predictive density 푝(푧 /푦) of 푧 must be computed using below equation.

푝(푧 /푦) = 푝(푧 /훽, 푦) 푝(훽 /푦) d훽 ….(2) ∫퐵

The data augmentation algorithm to compute the posterior density of 훽 can be implemented by applying the following steps. In the 푖푡ℎ step of the algorithm, the current approximation to 푝(훽 /푦) is taken to be 푔푖(훽), (a) Draw a sample 푧1, … , 푧푚 from the current approximation to the predictive density 푝(푧 /푦). (b) Update the current approximation to 푝(훽 /푦) to give 푔푖+1(훽) as a combination of conditional densities of 훽 given the augmented data which is generated in step (e) Specifically, 푔푖+1(훽) is computed as given below.

푚 −1 푗 푔푖+1(훽) = 푚 ∑ 푝(훽 /푧 , 푦) 푗=1

In steps (a1) and (a2) below, it is shown how the latent data 푧 is generated from 푝(푧 /푦) in step (a). Equation (2) allows us to generate 푧 from(푧 /푦) in these steps when the current approximation to 푝(훽 /푦) is 푔푖(훽). (a1) Generate 훽 from 푔푖(훽). (a2) Generate 푧 from(푧 /푦) (훽 is the value generated in the above step) If the number of draws (푚) is sufficiently large, a good approximation to 푝(훽 /푦) can be achieved by implementing the steps (a1), (a2), (a) and (b) [7, 8].

BAYESIAN ANALYSIS OF BINARY LOGIT AND PROBIT MODELS The common representation for the binary model is given below,

푦푖 ~ 퐵푒푟푛표푢푙푙푖(푔(휂푖))

휂푖 ~ 푥푖′훽 훽 ~ 휋(훽)

In the above representation, 푦푖 is a binary response variable that takes only two values 0 and 1 for ′ 푛 observations; 푥푖 = (푥푖1, … , 푥푖푝) are 푝 covariate measurements; 훽 is ( 푝 x 1) vector of regression coefficients and 휋(. ) is the prior distribution of 훽. Here 푔(. ) is a link function and 휂푖 a linear predictor. If the link function is the standard Gaussian cumulative distribution function (cdf), a probit model is obtained. If the link function is the logistic cdf, the logit model is obtained.

The Ordinary Least Squares (OLSs) method can be used to analyse the binary regression model, but using this method with a binary response variable causes some problems. One of the problems is that the errors are heteroscedastic and these heteroscedastic errors are a function of the parameter vector 훽. Another problem is that the predicted values can take values outside the interval (0, 1).

Albert and Chib [1] proposed a Bayesian method to analyse regression models where the response variable is categorical. This approach depends on the idea that the posterior distribution of 훽 given the latent variable 푧 in categorical regression models, and the posterior distribution of 훽 in normal linear regression models (푧 = 푥′훽 + 휀) are the same.

Let us analyse the binary model representation given at the start of this section with the help of Bayesian method for probit framework. After the latent variables are added to the model, it can be represented as given below. 1 푖푓 푧 > 0 푦 = { 푖 푖 0 표푡ℎ푒푟푤푖푠푒 ′ 푧푖 = 푥푖 훽 + 휀푖, 휀푖 = 푁(0,1), 훽 ~ 휋(훽)

It is easy to find the full conditional distributions of 훽 and 푧 by using above representation. Since these full conditional distributions can be calculated, Gibbs sampling can easily be performed to analyse the binary probit model.

If the prior distribution of 훽 is normal, that is 휋(훽) = 푁(푏, 푣), the full conditional distribution of 훽 is also normal. 훽 /푧 ~ 푁(퐵, 푉), 퐵 = 푉(푣−1푏 + 푥′푧), 푉 = (푣−1 + 푥′푥)−1

The full conditional distribution of each latent variable 푧푖 is a truncated normal distribution which is easy to simulate. ′ 푁(푥푖 훽, 1)퐼(푧푖 > 0) 푖푓 푦푖 = 1 푧푖 /훽, 푥푖, 푦푖 = { ′ 푁(푥푖 훽, 1)퐼(푧푖 ≤ 0) 표푡ℎ푒푟푤푖푠푒

The method proposed by Albert and Chib [1] to analyse binary probit models can be developed by generalizing the probit link function as a family of 푡 distributions. Thus, the degree of freedom of the 푡 distribution which is best suited to the data, can be determined. The most popular link functions for binary data are logit and probit. If the degree of freedom is taken as seven, then the link function is logit. Moreover if the degree of freedom is taken to be a very large number (in practice it is taken to be 100), then the link function is probit.

If 휀푖 has a standard logistic distribution, the binary logit model is obtained. When the variables 휆푖, 푖 = 1,2, … , 푛 are introduced into the model, then the binary logit model can be represented as below. 1 푖푓 푧 > 0 푦 = { 푖 푖 0 표푡ℎ푒푟푤푖푠푒 ′ 2 푧푖 = 푥푖 훽 + 휀푖, 휀푖 = 푁(0, 휆푖), 휆푖 = 4휓푖 , 휓푖~ 퐾푆, 훽 ~ 휋(훽)

where, the 휓푖, 푖 = 1,2, … , 푛 are independent and have the Kolmogorov-Smirnov(KS) distribution

If the prior distribution of 훽 is normal, that is 휋(훽) = 푁(푏, 푣), the full conditional distribution of 훽 given 푧, 푦 푎푛푑 휆 is also normal as represented below.

훽 /푦, 푧, 휆~ 푁(퐵, 푉), −1 ′ −1 ′ −1 −1 −1 퐵 = 푉(푣 푏 + 푥 푊푧), 푉 = (푣 + 푥 푊푥) , 푊 = 푑푖푎푔(휆1 , … , 휆푛 )

The full conditional distribution of each latent variable 푧푖 is truncated normal with variance 휆푖.

′ 푁(푥푖 훽, 휆푖)퐼(푧푖 > 0) 푖푓 푦푖 = 1 푧푖 /훽, 푥푖, 푦푖, 휆푖 = { ′ 푁(푥푖 훽, 휆푖)퐼(푧푖 ≤ 0) 표푡ℎ푒푟푤푖푠푒

The full conditional distribution of 휆푖 does not have a standard form, but sampling from this distribution is easy using rejection sampling. In Bayesian analysis of the binary logit model, simulations from the full conditional distributions 푝(훽 /휆, 푧), 푝(푧 /훽, 푦) and 푝(휆) are implemented using Gibbs sampling.

The speed of the simulation in the logit model is slower than that in the probit model because in the logit model it is necessary to simulate from 휆 in addition to 푧 and 훽, and the posterior variance- covariance matrix V in Logit model changes for each update of 휆 [4].

DATA FOR EXPERIMENT The dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey created by Tjen-Sien Lim and maintained by UCI Machine Learning Repository [5]. The samples are married women who were either not pregnant or do not know if they were at the time of interview. The problem is to predict the current contraceptive method choice (no use or use of some methods) of a woman based on her demographic and socio-economic characteristics.

The task chosen here is related to Development Economics field. Dataset is multivariate which has total of 1473 observations of 10 attributes. There are no missing values for any attribute in the given data. The information of these attributes is given in the below table.

Attribute Attribute Information Wife’s Age Numerical Wife’s Education Categorical (1=low,2,3,4=High) Husband’s Education Categorical (1=low,2,3,4=High) No. of Children Ever Born Numerical Wife’s Religion Binary (0=Non-Islam, 1=Islam) Wife Is Working Now? Binary (0=No, 1=Yes) Husband’s Occupation Categorical (1,2,3,4) Standard of Living Index Categorical (1=low,2,3,4=High) Media Exposure Binary (0=Not Good, 1=Good) Contraceptive Method Used Binary (0=Not Used, 1=Some Method Used) Table1: Attributes and their information Based on the above 10 attributes we are creating a total of 18 variables which are explanatory in kind and 1 response variable. We are taking numerical and binary variables as they are and we are creating 3 dummy variables for each categorical variable viz., wife’s education with 1(=low) as base category, husband’s education with 1(=low) as base category, Husband’s occupation with 1 as base category and standard of living index with 1(=low) as base category. In this dataset, the husband’s occupation attribute as four values which indicate different kinds of occupation and not quality of occupation, but data repository at UCI doesn’t mention which are those occupations they are taking into account.

The aim of this study is to find the impact of these various attributes on the wife’s choice of use of any methods of contraception. Some descriptive statistics for numerical variables are given in the table below.

Attribute Median Mean SD Min Max Wife’s Age 32 32.5384 8.2272 16 49 No. of Children Ever Born 3 3.2614 2.3585 0 16 Table1: Descriptive statistics for numerical variables Also, after checking the proportions for categorical and binary variables we found that data has more number of lower and middle class families than higher class. 57% of women used some contraceptive method. 85% of the women in the data were Islamic which shows that sample is biased towards Islam. 25% women were working during the survey. 92.6% women had good media exposure. Fertility rate of the data is close to 3.3 which shows that there is above replacement level fertility is present.

RESULTS The parameter estimates for the binary probit model are presented in the table below. It also includes credible intervals which are 2.5th and 97.5th quatiles of 10000 observations. Also average covarite effect is also calculated for each covariate.

Significantly Avg Covariate Variables Post_Mean SD Credible Interval Different from 0 Effect Constant 0.06528 0.33434 -0.5966 0.7061 no 0.0223 Wife age -0.0511 0.00575 -0.0622 -0.0397 yes -0.0174 Children Born 0.20657 0.019952 0.1673 0.2453 yes 0.0705 Wife Religion -0.2747 0.10455 -0.479 -0.0713 yes -0.0937 wife_now_working -0.080693 0.08205 -0.243 0.0783 no -0.0275 Media Exposure 0.32827 0.14907 0.0357 0.6252 yes 0.112 w_educ_2 0.17691 0.13774 -0.0941 0.4433 no 0.0604 w_educ_3 0.45289 0.146 0.1678 0.7405 yes 0.1545 w_educ_4 0.91175 0.15933 0.6025 1.2261 yes 0.3111 h_educ_2 0.18331 0.22253 -0.2553 0.6218 no 0.0625 h_educ_3 0.26227 0.217 -0.1634 0.698 no 0.0895 h_educ_4 0.16923 0.21961 -0.2655 0.6027 no 0.0577 std_liv_2 0.23492 0.14346 -0.0515 0.5172 no 0.0801 std_liv_3 0.3318 0.13578 0.0649 0.5978 yes 0.1132 std_liv_4 0.49373 0.13724 0.2224 0.7629 yes 0.1685 h_occ_2 -0.11612 0.098625 -0.3044 0.0786 no -0.0396 h_occ_3 0.041433 0.09802 -0.1485 0.2362 no 0.0141 h_occ_4 0.27217 0.25999 -0.2339 0.7854 no 0.0929 Table3: Parameter Estimates for binary probit model The parameter estimates for the binary logit model with credible interval and average covariate effect are presented in the table below.

Significantly Avg Covariate Variables Post_Mean SD Credible Interval Different from 0 Effect Constant 0.37066 0.17331 0.0309 0.7121 yes 0.0776 Wife age -0.080937 0.0068817 -0.0942 -0.0676 yes -0.0169 Children Born 0.34042 0.028284 0.2863 0.3962 yes 0.0713 Wife Religion -0.34229 0.11709 -0.5769 -0.1161 yes -0.0717 wife_now_working -0.096079 0.10362 -0.3005 0.1056 no -0.0201 Media Exposure 0.62978 0.15011 0.3388 0.9302 yes 0.1319 w_educ_2 0.0021096 0.12988 -0.2533 0.2601 no 0.0004 w_educ_3 0.40556 0.12676 0.1626 0.6567 yes 0.0849 w_educ_4 1.1138 0.13476 0.8545 1.3825 yes 0.2332 h_educ_2 0.092846 0.1416 -0.1789 0.3761 no 0.0194 h_educ_3 0.31297 0.1319 0.0589 0.5775 yes 0.0655 h_educ_4 0.29291 0.1284 0.0419 0.547 yes 0.0613 std_liv_2 0.1673 0.13495 -0.0994 0.4361 no 0.035 std_liv_3 0.33402 0.12648 0.086 0.58 yes 0.0699 std_liv_4 0.61268 0.12654 0.3663 0.862 yes 0.1283 h_occ_2 -0.15962 0.11232 -0.3738 0.0612 no -0.0334 h_occ_3 0.066898 0.10941 -0.1461 0.2823 no 0.014 h_occ_4 0.15638 0.18553 -0.2102 0.5223 no 0.0327 Table4: Parameter Estimates for binary logit model

The estimates in Table 3 and 4 are consistent with the predictions of economic theory. With the help of credible interval we can observe that in both models increase in wife’s age causes less contraception use. As number of children born increases the use of contraception increases. If wife is Islamic then it has strong negative impact on contraception use. It may be attributed to the opposition to the use of contraceptives in Islam, as contraceptive is seen as against the nature. Current employment status of woman is not significantly affecting the contraception use in both the models. Good media exposure helps to increase the use of contraception. Wife’s education matters only if she is highly educated which has strong impact on choice to use contraceptives. Also in the case of husband the use of contraceptive is increasing with high education only in case of logit model. High standard of living helps to increase use of contraceptives. As families may want to maintain their high standard of living having less number of children. Husband’s occupation is not significantly affecting contraception use.

Having estimated the parameters of a model, one is typically interested in the practical implications of those estimates. However, interpretation of the parameter estimates is complicated by the nonlinearity of the models we have considered. In binary data models, for ′ example, 퐸(푦푖|푥푖, 훽) = Pr(푦푖 = 1|푥푖, 훽) = 퐹(푥푖 훽), Therefore, the marginal effect of changing some continuous covariate in 푥푖, say 푥ℎ, is not simply given by 훽ℎ. This can be easily seen by taking the derivative of Pr(푦푖 = 1|푥푖, 훽) with respect to 푥ℎ.

′ 휕 Pr(푦푖 = 1|푥푖, 훽) 휕퐹(푥𝑖 훽) ′ = = 푓(푥푖 훽)훽ℎ, 휕푥ℎ 휕푥ℎ

Hence the marginal effect of 푥ℎ depends on 훽ℎ, but also on all of the covariates in 푥푖, all of the parameters in 훽, and will differ with the choice of cdf 퐹(. ) and respective pdf 푓(. ). Table 5 gives the choice probability and marginal effects for the above binary data models [3].

Table4: Marginal Effects in binary data models

We have used the following formula to calculate average covariate effects presented in the Tables 3 and 4 above. 푛 ′ ̂ ̂ −1 ′ ̂ ̂ 푓(푥푖 훽)훽ℎ = 푛 ∑ 푓(푥푖 훽)훽ℎ 푖=1 By use of the above expression the observed results are given in last columns of both the table 3 and 4. We can see that average covariate effects for wife’s age and children born are almost equal. Other statistically significant variables also showing similarity in their average covariate effects in both the models. If we compare these effects with their respective posterior means we find significant difference and also these average covariate effects are less than posterior means.

CONCLUSIONS The aim of this paper is to use the Bayesian method proposed by Albert and Chib [1] to analyse binary logit and probit models. The most important point of this Bayesian approach is that the binary probit model is obtained from the normal linear regression model by introducing latent variables into the model. This approach has many advantages. Firstly, in small samples this method provides accurate inferences about the model, whereas the classical inferences in small samples are not accurate. Secondly, in this Bayesian method Gibbs sampling is used to calculate the posterior distributions of the model parameters by simulating from standard distributions. Thus, this method can be implemented easily in many computer softwares and programming languages.

The illustration presented in this paper shows us the decision to use contraceptive method by women in Indonesia is affected by various factors in positive and negative way. Age of women has negative impact whereas number of children born, good media exposure, high standard of living index, high educational attainment by women and men (in logit case) have significant positive effect on contraceptive use. At last being Islamic is producing negative impact on choice of contraception use.

REFERENCES 1. Albert, J.H. and Chib, S. Bayesian Analysis of Binary and Polychotomous Response Data, Journal of the American Statistical Association 422, 669–679, 1993. 2. Gelfand, A.E. and Smith, A.F.M. Sampling-Based Approaches to Calculating Marginal Densities, Journal of the American Statistical Association 85, 398–409, 1990. 3. I Jeliazkov, MA Rahman, Binary and Ordinal Data Analysis in Economics: Modeling and Estimation, Mathematical Modeling with Multidisciplinary Applications, edited by X.S. Yang, 123-150, 2012. 4. John Wiley & Sons Inc, Hoboken, New Jersey.Holmes, C.C. and Held, L. Bayesian Auxiliary Variable Models for Binary and Multinomial Regression, Bayesian Analysis 1, 145–168, 2006. 5. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

APPENDIX – CODES

% Probit Model clc tic; [data,text]= xlsread('C:\Users\Sanket\Downloads\ECO543A-Bayesian Data Analysis\Project\DATA.xlsx'); contraceptive_used = data(:,5); N = numel(contraceptive_used); const = ones(N,1); wife_age = data(:,1); children_born = data(:,2); wife_religion = data(:,3); wife_now_working = data(:,6); media_exposure = data(:,4); w_educ_2 = data(:,7); w_educ_3 = data(:,8); w_educ_4 = data(:,9); h_educ_2 = data(:,10); h_educ_3 = data(:,11); h_educ_4 = data(:,12); std_liv_2 = data(:,16); std_liv_3 = data(:,17); std_liv_4 = data(:,18); h_occ_2 = data(:,13); h_occ_3 = data(:,14); h_occ_4= data(:,15); n= 12500; burn = 2500; beta = zeros(n,18); z = zeros(n,numel(contraceptive_used)); X = [const wife_age children_born wife_religion wife_now_working media_exposure w_educ_2 w_educ_3 w_educ_4 h_educ_2 h_educ_3 h_educ_4 std_liv_2 std_liv_3 std_liv_4 h_occ_2 h_occ_3 h_occ_4]; mu0=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]'; V0=eye(18); z(1,:)=contraceptive_used'; I = eye(18); InvV0 = inv(V0); InvV0mu0 = inv(V0)*mu0; V1=(InvV0+X'*X)\I; sigma=1; h=waitbar(0,'Probit Simulation In Progress'); % starting the chain for i=2:n; mu1=V1*(InvV0mu0+X'*z(1,:)'); beta(i-1,:) = mu1'+(chol(V1,'lower')*mvnrnd(mu0,V0)')'; for j=1:N; mu=X(j,:)*beta(i-1,:)'; u=unifrnd(0,1); if contraceptive_used(j,1)==0 k=u*normcdf(-mu/sigma); z(1,j)=norminv(k,0,1)*sigma+mu; else k=normcdf(-mu/sigma)+u*(1-normcdf(-mu/sigma)); z(1,j)=norminv(k,0,1)*sigma+mu; end end; waitbar(i/n); end; close(h) CI = zeros(18,2); Post_Mean = mean(beta(burn+1:n,:))'; Std_Dev = std(beta(burn+1:n,:))'; for i=1:18 CI(i,:) = quantile(beta(burn+1:n,i),[0.025,0.975]); end Rows = {'Constant','Wife age','Children Born','Wife Religion','wife_now_working ','Media Exposure ','w_educ_2 ','w_educ_3 ','w_educ_4 ','h_educ_2 ','h_educ_3 ','h_educ_4 ','std_liv_2 ','std_liv_3 ','std_liv_4',' h_occ_2','h_occ_3','h_occ_4'}; T_Probit = table(Post_Mean,Std_Dev,'RowNames',Rows) CI toc

% Average Covarite Effect - Probit Model clc tic; [data,text]= xlsread('C:\Users\Sanket\Downloads\ECO543A-Bayesian Data Analysis\Project\DATA.xlsx'); contraceptive_used = data(:,5); N = numel(contraceptive_used); const = ones(N,1); wife_age = data(:,1); children_born = data(:,2); wife_religion = data(:,3); wife_now_working = data(:,6); media_exposure = data(:,4); w_educ_2 = data(:,7); w_educ_3 = data(:,8); w_educ_4 = data(:,9); h_educ_2 = data(:,10); h_educ_3 = data(:,11); h_educ_4 = data(:,12); std_liv_2 = data(:,16); std_liv_3 = data(:,17); std_liv_4 = data(:,18); h_occ_2 = data(:,13); h_occ_3 = data(:,14); h_occ_4= data(:,15);

X = [const wife_age children_born wife_religion wife_now_working media_exposure w_educ_2 w_educ_3 w_educ_4 h_educ_2 h_educ_3 h_educ_4 std_liv_2 std_liv_3 std_liv_4 h_occ_2 h_occ_3 h_occ_4]; beta1=[0.06528,-0.0511,0.20657,-0.2747,-0.080693,0.32827,... 0.17691,0.45289,0.91175,0.18331,0.26227,0.16923,0.23492,... 0.3318,0.49373,-0.11612,0.041433,0.27217]; sum=zeros(18,1); Cov_eff=zeros(18,1); mu = 0; sigma = 1; pd = makedist('Normal',mu,sigma); for j=1:N; x = X(j,:)*beta1'; for k=1:18; sum(k)=sum(k)+pdf(pd,x)*beta1(1,k); end end for k=1:18; Cov_eff(k)=sum(k)/N; end format short Cov_eff

% Logit Model clc tic; [data,text]= xlsread('C:\Users\Sanket\Downloads\ECO543A-Bayesian Data Analysis\Project\DATA.xlsx'); contraceptive_used = data(:,5); N = numel(contraceptive_used); const = ones(N,1); wife_age = data(:,1); children_born = data(:,2); wife_religion = data(:,3); wife_now_working = data(:,6); media_exposure = data(:,4); w_educ_2 = data(:,7); w_educ_3 = data(:,8); w_educ_4 = data(:,9); h_educ_2 = data(:,10); h_educ_3 = data(:,11); h_educ_4 = data(:,12); std_liv_2 = data(:,16); std_liv_3 = data(:,17); std_liv_4 = data(:,18); h_occ_2 = data(:,13); h_occ_3 = data(:,14); h_occ_4= data(:,15); n= 12500; burn = 2500; beta = zeros(n,18); z = zeros(1,N); X = [const wife_age children_born wife_religion wife_now_working media_exposure w_educ_2 w_educ_3 w_educ_4 h_educ_2 h_educ_3 h_educ_4 std_liv_2 std_liv_3 std_liv_4 h_occ_2 h_occ_3 h_occ_4]; mu0=[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]'; V0=0.1*eye(18); z(1,:)=contraceptive_used'; lambda = zeros(N,1); W = zeros(N); I = eye(18); msg = 'Error occurred: NAN'; InvV0 = inv(V0); InvV0mu0 = inv(V0)*mu0; h=waitbar(0,'Logit Simulation In Progress'); % starting the chain for i=2:n; for j=1:N; k=KSrnd(); l=4*k^2; lambda(j)=l; W(j,j)=1/l; end V1=(InvV0+X'*W*X)\I; mu1=V1*(InvV0mu0+X'*W*z(1,:)'); beta(i-1,:)=mu1'+(chol(V1,'lower')*mvnrnd(mu0,V0)')'; for j=1:N; mu=X(j,:)*beta(i-1,:)'; sigma=sqrt(lambda(j)); u=unifrnd(0,1); if contraceptive_used(j,1)==0 k=u*normcdf(-mu/sigma); z(1,j)=norminv(k,0,1)*sigma+mu; else k=normcdf(-mu/sigma)+u*(1-normcdf(-mu/sigma)); z(1,j)=norminv(k,0,1)*sigma+mu; end end; waitbar(i/n); end; close(h) CI = zeros(18,2); Post_Mean = mean(beta(burn+1:n,:))'; Std_Dev = std(beta(burn+1:n,:))'; for i=1:18 CI(i,:) = quantile(beta(burn+1:n,i),[0.025,0.975]); end Rows = {'Constant','Wife age','Children Born','Wife Religion','wife_now_working ','Media Exposure ','w_educ_2 ','w_educ_3 ','w_educ_4 ','h_educ_2 ','h_educ_3 ','h_educ_4 ','std_liv_2 ','std_liv_3 ','std_liv_4',' h_occ_2','h_occ_3','h_occ_4'}; T_Logit = table(Post_Mean,Std_Dev,'RowNames',Rows) CI toc

% Average Covarite Effect - Logit Model clc tic; [data,text]= xlsread('C:\Users\Sanket\Downloads\ECO543A-Bayesian Data Analysis\Project\DATA.xlsx'); contraceptive_used = data(:,5); N = numel(contraceptive_used); const = ones(N,1); wife_age = data(:,1); children_born = data(:,2); wife_religion = data(:,3); wife_now_working = data(:,6); media_exposure = data(:,4); w_educ_2 = data(:,7); w_educ_3 = data(:,8); w_educ_4 = data(:,9); h_educ_2 = data(:,10); h_educ_3 = data(:,11); h_educ_4 = data(:,12); std_liv_2 = data(:,16); std_liv_3 = data(:,17); std_liv_4 = data(:,18); h_occ_2 = data(:,13); h_occ_3 = data(:,14); h_occ_4= data(:,15);

X = [const wife_age children_born wife_religion wife_now_working media_exposure w_educ_2 w_educ_3 w_educ_4 h_educ_2 h_educ_3 h_educ_4 std_liv_2 std_liv_3 std_liv_4 h_occ_2 h_occ_3 h_occ_4]; beta1=[0.37066,-0.080937,0.34042,-0.34229,-0.096079,... 0.62978,0.0021096,0.40556,1.1138,0.092846,0.31297,... 0.29291,0.1673,0.33402,0.61268,-0.15962,0.066898,0.15638]; sum=zeros(18,1); Cov_eff=zeros(18,1); mu = 0; sigma = 1; pd = makedist('Logistic',mu,sigma); for j=1:N; x = X(j,:)*beta1'; for k=1:18; sum(k)=sum(k)+pdf(pd,x)*beta1(1,k); end display(j); end for k=1:18; Cov_eff(k)=sum(k)/N; end format short Cov_eff

% Kolmogorov Smirnov Random Number Generator function SMIR_f = KSrnd() % THIS SUBPROGRAM PRODUCES VARIATES FROM THE KOLMOGOROV-SMIRNOV % LIMIT DISTRIBUTTON FUNCTION. % % SOURCE: LUC DEVROYE "THE SERIES METHOD FOR RANDOM VARIATE % GENERATION AND ITS APPLICATION TO THE KOLMOGOROV-SMIRNOV % DISTRIBUTION", AMERICAN JOURNAL OF MATHEMATICAL % AND MANAGEMENT SCIENCES. % % AT EACH CALL, THE ARGUMENT CAN BE GIVEN THE VALUE 0 AS IN % THE STATEMENT X= SMIR(0). % THE PROGRAM USES THE SUBPROGRANS UNI AND REXP FOR THE GENERATION % OF UNIFORM (0,1) AND EXPONENTIAL RANDON VARIATES. % % THE CONSTANTS IN THE PROGRAM DEPEND UPON C. ALL THE VALUES IN % THE PROGRAM ARE FOR C=0.75. FOR OTHER VALUES OF C IN THE % RANGE SQRT(l/3)

% C=0.75; P=0.3728330; P1=2.682166; P2=1.594471; CSQRE=0.5625; PTC=2.193245; PINV=0.4559454; ALPHA=0.1368725; % BETA=0.2279727; B=1.295291;

DEC=rand(); if DEC > P V0=(DEC-P)*P2; SMIR_f=Algo_1(V0); else V0=DEC*P1; SMIR_f=Algo_2(V0); end function SMIR_1 = Algo_1(V1) SMIR_1 = function_21(V1); function SMIR = function_21(V) SMIR1=CSQRE+0.5*exprnd(1); SMIRSQ=SMIR1+SMIR1; if V < ALPHA SMIR=function_4(SMIR1,SMIRSQ,V); else SMIR=function_3(SMIR1); end end function SMIR = function_2() V2 = rand(); SMIR = function_21(V2); end function SMIR = function_3(SMIR1) SMIR = sqrt(SMIR1); end function SMIR = function_4(SMIR2,SMIRSQ,V1) if SMIRSQ > 174 SMIR = function_3(SMIR2); else T=exp(-SMIRSQ); K=1; NUM=0; SUM=0; SMIR = function_5(NUM,K,SUM,T,SMIRSQ,SMIR2,V1); end end function SMIR = function_5(NUM1,K1,SUM1,T1,SMIRSQ1,SMIR3,V2) NUM1=NUM1+K1+K1+1; K1=K1+1; if NUM1*SMIRSQ1 > 174 SMIR = function_3(SMIR3); else SUM1=SUM1+(NUM1+1)*T1^NUM1; if V2 >= SUM1 SMIR = function_3(SMIR3); else NUM1=NUM1+K1+K1+1; K1=K1+1; if NUM1*SMIRSQ1 > 174 SMIR = function_3(SMIR3); else SUM1=SUM1+(NUM1+1)*T1^NUM1; if V2 < SUM1 SMIR = function_2(); else SMIR = function_5(NUM1,K1,SUM1,T1,SMIRSQ1,SMIR3,V2); end end end end end end function SMIR_2 =Algo_2(V1) % GENERATE VARIATE DISTRIBUTED AS INVERSE OF SQUARE ROOT OF TAIL OF % CHI2 DENISTY WITH 3 DEGREES OF FREEDOM SMIR_2 = function_7(V1); function SMIR = function_7(V) E=exprnd(1)*B; E1=E+E; Y=PTC+E; if E*E < E1*PTC*(Y+PTC) SMIR = function_8(Y,V); else if PINV*Y-1-ln(PINV*Y) > E1 SMIR = function_6(); else SMIR = function_8(Y,V); end end end function SMIR = function_6() V = rand(); SMIR = function_7(V); end % CONSECUTIVE ACCEPTANCE/REJECTION STEPS function SMIR = function_8(Y1,V) SMIR_IN = 1.1107206/sqrt(Y1); Z=0.5/Y1; SUM1=1-Z; if V < SUM1 SMIR = SMIR_IN; else K=3; KSQRE=8; if Y1 > 21.7 SMIR = SMIR_IN; else T = exp(-Y1); SMIR = function_9(T,SMIR_IN,K,KSQRE,V,SUM1); end end end function SMIR=function_9(T1,SMIR_IN,K1,KSQRE1,V,SUM2) TU=T1^KSQRE1; SUM2=SUM2+(KSQRE1+1)*TU; if V > SUM2 SMIR = function_6(); else SUM2=SUM2-Z*TU; if V <= SUM2 SMIR = SMIR_IN; else K1=K1+2; KSQRE1=K1*K1-1; if Y*KSQRE1 > 174 SMIR = SMIR_IN; else SMIR = function_9(T1,SMIR_IN,K1,KSQRE1,V,SUM2); end end end end end end

Published with MATLAB® R2015a