Inference in Semiparametric Dynamic Models for Binary Longitudinal Data

Inference in Semiparametric Dynamic Models for Binary Longitudinal Data Siddhartha CHIB and Ivan JELIAZKOV This article deals with the analysis of a hierarchical semiparametric model for dynamic binary longitudinal responses. The main complicat- ing components of the model are an unknown covariate function and serial correlation in the errors. Existing estimation methods for models with these features are of O(N3),whereN is the total number of observations in the sample. Therefore, nonparametric estimation is largely infeasible when the sample size is large, as in typical in the longitudinal setting. Here we propose a new O(N) Markov chain Monte Carlo based algorithm for estimation of the nonparametric function when the errors are correlated, thus contributing to the growing literature on semiparametric and nonparametric mixed-effects models for binary data. In addition, we address the problem of model choice to enable the formal comparison of our semiparametric model with competing parametric and semiparametric specifications. The performance of the methods is illustrated with detailed studies involving simulated and real data. KEY WORDS: Average covariate effect; Bayes factor; Bayesian model comparison; Clustered data; Correlated binary data; Labor force participation; Longitudinal data; Marginal likelihood; Markov chain Monte Carlo; Markov process priors; Nonparametric estimation; Partially linear model. 1. INTRODUCTION lag coefficients, g(·) is an unknown function, and εit is a serially correlated error term. We assume that g(·) is a smooth but This article discusses techniques for analyzing semiparamet- otherwise unrestricted function that is estimated nonparametri- ric models for dynamic binary longitudinal data. A hierarchical ˜ Bayesian approach is adopted to combine the main regression cally. For this reason, xit does not contain an intercept or the components: nonparametric functional form, dynamic depen- covariate sit, although those may be included in wit; identifica- dence through lags of the response variable and serial correlation issues are discussed in Section 2.2. The model in (1) aims tion in the errors, and multidimensional heterogeneity. In this to exploit the panel structure to distinguish among three impor- context, we pursue several objectives. First, we propose new tant sources of intertemporal dependence in the observations computationally efficient estimation techniques to carry out the ( yi1,...,yiTi ). One source is due to the lags yi,t−1,...,yi,t−J, analysis. Computational efficiency is key, because longitudinal which capture the notion of “state dependence,” where the prob- (panel) data pose special computational challenges compared ability of response may depend on past occurrences because of with cross-sectional or time series data, so that “brute force” altered preferences, trade-offs, or constraints. A second source estimation becomes infeasible in many settings. Second, we ad- of dependence is the presence of serial correlation in the er- dress the problem of model choice by computing marginal like- rors (εi1,...,εiTi ). Finally, the observations ( yi1,...,yiTi ) can lihoods and Bayes factors to determine the posterior probabili- also be correlated because of heterogeneity; these differences ties of competing models. This allows for the formal compari- are captured through the individual effects βi. Addressing the son of semiparametric versus parametric models, and addresses differences among units is also essential in guarding against the problems of variable selection and lag determination. Third, the emergence of “spurious state dependence,” because tempo- we propose a simulation-based approach to calculate the aver- ral pseudodependence could occur due to the fact that history age covariate effects, which provides interpretability of the esti- may simply serve as a proxy for these unobserved differences mates despite the nonlinearity and intertemporal dependence in (Heckman 1981). the model. We examine the methods in a simulation study and The presence of the nonparametric function in the binary then apply them in the analysis of women’s labor force partici- response model in (1) raises a number of challenges for es- pation. timation, because of the intractability of the likelihood func- To illustrate the model, let yit be the binary response of inter- tion. Many of these problems have been largely overcome in = = est, where the indices i and t (i 1,...,n, t 1,...,Ti) refer to the Bayesian context by Wood and Kohn (1998) and Shively, units (e.g., individuals, firms, countries) and time. We consider Kohn, and Wood (1999), based on the framework of Albert a dynamic partially linear binary choice model where yit de- and Chib (1993). One open problem, however, is the analy- ˜ pends parametrically on the covariate vectors xit and wit (con- sis of semiparametric models with serially correlated errors. taining two disjoint sets of covariates) and nonparametrically This issue has been studied by Diggle and Hutchinson (1989), on the covariate sit in the form Altman (1990), Smith, Wong, and Kohn (1998), Wang (1998), = ˜ + + yit 1 xitδ witβi g(sit) and Opsomer, Wang, and Yang (2001). These studies conclude that serial correlation, if ignored, poses fundamental problems + φ y − +···+φ y − + ε > 0 , (1) 1 i,t 1 J i,t J it that can have substantial adverse consequences for estimation {·} where 1 is an indicator function, δ and βi are vectors of com- of the nonparametric function. The ability to estimate models mon (fixed) and unit-specific (random) effects, φ1,...,φJ are with serial correlation is limited in practice, however, because of the computational intensity of existing algorithms. In these O 3 = n Siddhartha Chib is the Harry C. Hartkopf Professor of Econometrics and cases the estimation algorithms are (N ), where N i=1 Ti Statistics, John M. Olin School of Business, Washington University, St. Louis, MO 63130 (E-mail: [email protected]). Ivan Jeliazkov is Assistant Professor, De- partment of Economics, University of California, Irvine, CA 92697 (E-mail: © 2006 American Statistical Association [email protected]). The authors thank the editor, associate editor, and two refer- Journal of the American Statistical Association ees, along with Edward Greenberg, Dale Poirier, and Justin Tobias, for helpful June 2006, Vol. 101, No. 474, Theory and Methods comments. DOI 10.1198/016214505000000871 685 686 Journal of the American Statistical Association, June 2006 is the total number of observations in the sample. This feature the literature on nonparametric modeling (for some specific ex- renders nonparametric estimation infeasible in panel data set- amples, see Wahba 1978; Shiller 1984; Silverman 1985; Besag, tings where sample sizes are generally large. Here we propose a Green, Higdon, and Mengersen 1995; Koop and Poirier 2004). new O(N) algorithm for estimating the nonparametric function Defining ht = vt − vt−1, the second-order random-walk spec- when the errors are correlated, thus contributing to the growing ification is given by literature on static semiparametric and nonparametric mixed- ht ht 2 effects models for binary data (see, e.g., Lin and Zhang 1999; gt = 1 + gt−1 − gt−2 + ut, ut ∼ N (0,τ ht), Lin and Carroll 2001; Karcher and Wang 2001). As far as we ht−1 ht−1 know, in the literature there are no O(N) estimators for (binary (2) or continuous) panel data with correlated errors. where τ 2 is a smoothness parameter. Small values of τ 2 pro- A second open problem is the question of model compari- duce smoother functions, whereas larger values allow the func- son in the semiparametric setting. Because the marginal like- tion to be more flexible and interpolate the data more closely. lihood, which is used to compare two models on the basis The weight ht adjusts the variance to account for possibly ir- of their Bayes factor, is a large-dimensional integral, previous regular spacing between consecutive points in the design point work (e.g., DiMatteo, Genovese, and Kass 2001; Wood, Kohn, vector. Other possibilities are conceivable for the weights (see, Shively, and Jiang 2002; Hansen and Kooperberg 2002) has re- e.g., Shiller 1984; Besag et al. 1995; Fahrmeir and Lang 2001); lied on measures such as the Akaike information criterion (AIC) the one given here implies that the variance grows linearly with and Bayes information criterion (BIC). However, computation the distance ht, a property satisfied by random walks. This lin- of these criteria is infeasible in our context. We extend that lit- earity is appealing because it implies that conditional on g − erature by describing an approach, based on Chib (1995), for t 1 and g − , the variance of g + , k ≥ 0, will depend only on the calculating marginal likelihoods and Bayes factors. We use this t 2 t k distance v + − v − , but not on the number of points k that lie approach in Section 7 to compare several competing parametric t k t 1 in between. and semiparametric models. We now deviate from the prior of Fahrmeir and Lang (2001), The article is organized as follows. Section 2 completes the and, in fact, from much of the literature on nonparametric func- statistical model, and Section 3 presents the Markov chain tional modeling, by working with a version of the prior in (2) Monte Carlo (MCMC) fitting method. Section 4 shows how that is proper. The prior in (2) is improper because it incorpo- the average effects of the covariates on the probability of re- rates information only about deviations from linearity, but says sponse are calculated, and Section 5 is concerned with model nothing about the linearity itself. To rectify this problem, we comparison. Section 6 presents a detailed simulation study of complete the specification of the smoothness prior by providing the performance of the estimation method. Section 7 studies a distribution for the initial states of the random-walk process, the intertemporal labor force participation of a panel of married women. Finally, Section 8 presents concluding remarks. g g 1 τ 2 ∼ N 10 ,τ2G , (3) g g 0 2. HIERARCHICAL MODELING AND PRIORS 2 20 where G is a 2 × 2 symmetric positive definite matrix.

Load more