Generalize These Models by Allowing the Cut-Points to Depend on Covariates As
Total Page:16
File Type:pdf, Size:1020Kb
WHO Multi-Country Studies unit Working Paper 4 Self-reported health and anchoring vignettes in SAGE Wave 1: Applying the bivariate hierarchical ordered probit model and anchoring vignette methodologies in SAGE to improve cross-country comparability of self-reported health. Márton Ispány, Emese Verdes, Ajay Tandon, Somnath Chatterji March 2012 Introduction In many social science researches and econometric applications data that arise through measurement of discrete outcomes or discrete choice among a set of alternatives are in the form of ordinal or ordered categorical data. Such examples, among others, are self- report responses in household surveys, modeling labor force participate, or decision of which product to choose or which candidate to elect. To analyze such data a number of discrete response (choice) models have been developed in econometric theory, see Greene [,]. A simple one of these kinds of models is the class of ordered univariate response models where the number of categories of the dependent variable is greater than two, i.e. there are several possible outcomes or choices and they are ordered according to the preferences of respondent. The same data structure also arises in the analysis of repeated measurements, where the response of each respondent, experimental unit or subject is observed on multiple occasions to record the level of a specific event. Responses of this type are known as multivariate or correlated categorical responses. The theory of univariate ordered models is relatively well developed and they have been applied extensively in biostatics, economics, political science and sociology while estimations of the joint probability distribution of two or more ordered categorical variables are less common in the literature. The relatively new bivariate (multivariate) ordered probit (BIOPROBIT) models could be treated as an extension of a standard bivariate (multivariate) probit model when the number of categories of the dependent variables is greater than two. The estimation procedures and their statistical properties of BIOPROBIT model are studied in Greene [, Section 11.5.2], Sajaia [] and others. Some of the applications of BIOPROBIT models are modeling educational level of married couples (Magee et al. []), educational attainment in French and Germany (Lauer []), family size (Calhoun [,]), fertility outcome and fertility motivations of Danish twins (Kohler and Rodgers []), analyzing ownership of cats and dogs or dogs and televisions (Butler and Chatterjee [,]), and household-level decision between number of seasonal tickets and number of cars (Scott and Axhausen []). The paper put forth a new approach to modeling correlated categorical responses in heterogeneous population where the subjective scale is changing according to the different segments, such as countries, of the population. We utilize a vignette methodology to evaluate and correct subjective correlated responses. This methodology 1 is widely applied to many economic applications with subjective scales, e.g. health, health care, school community strength, HIV risk, state effectiveness, and corruption (see for example, http://gking.harvard.edu/vign/eg/). WHO’s Study on global AGEing and adult health (SAGE) used a number of methods to improve the reliability, validity and comparability of its self-reported health measures, including the use of anchoring vignettes. The anchoring vignette technique presents the respondent with a set of hypothetical stories about which the same questions and response categories are used for self-assessment of health. The vignettes are used to fix the level of ability on a given health domain to better distinguish between differences in self-ratings due to actual health differences and those due to varying norms or expectations for health. (Hopkins and King, 2010; Salomon, 2004). Measuring the health state of individuals is important for the evaluation of health and social policies, monitoring and measuring the health of populations. Self-report health is a common method for assessing health status in household health surveys and single question versions of self-reported health predict a range of health outcomes from disability to death (Cutler 2009, Singh-Manoux et al. Psychosomatic Medicine 2007;69:138-43.). Methods The models Categorical data on health are usually described by discrete choice latent variable models, by assuming that the observed categorical variables, e.g. the self-report health responses, are discrete transformations of an underlying, unobservable, and continuous true level of health. For detailed introduction to discrete choice models see Chapter 21 of Greene [] and a recent survey of Greene [, Chapter 11]. If this discrete transformation is constant across individuals then we say that the homogeneous reporting behavior holds in responses. On the contrary, reporting heterogeneity means that the mappings between the latent variables and observed categorical variables are different for various categories of respondents. In this paper, we consider the multivariate, especially the bivariate case, i.e. we allow more than one categorical response for each individual. Let yij ,i 1,, N, j 1,,M , be a self-reported categorical health measure, where i and j refer to the respondent and the number of question, respectively. Moreover, N and M denotes the number of respondents and questions, respectively. The latent variable models assume there is an unobserved continuous latent variable yij for ith respondent at jth question. These latent variables are supposed to depend on observable covariates and they are modeled by latent equations. Using the linear regression model as one the simplest ones yij is specified as T yij xi j ij (1) Here xi is a vector of covariates for the i -th respondent, j is a regression coefficient, and the error vector i ( ij ) has M -dimensional normal distribution with mean 0 and variance matrix . The diagonal elements of are supposed to be 1, in order to be 2 identified the model. In the homogeneous reporting case it is assumed that the observed categorical responses yij of the i -th individual depend on the latent variables in the following way: j j yij k k 1 yij k , (2) j j j j j j k 1,, K , 0 1 K 1 K with 0 , K , where K denotes the number of different answers to self-report questions. The model (1) with cut-point definition (2) is called multivariate ordered probit model. In the special case M 2 , we speak about bivariate ordered probit (BIOPROPIT) model, see Section 11.5.2 in Greene [] or Sajaia []. We also remark that in the one-dimensional case M 1 the standard ordered probit (OPROBIT) model is given, see Chapter 21.8 in Greene []. There are several extensions of the ordered probit model that follow the logic of bivariate models using two latent equations with correlated error terms, see e.g. Butler et al. [] and Tobias and Li []. Our setup follows the latter one. Seemingly unrelated and simultaneous specifications of two-equation ordered probit model are considered e.g. in j Sajaia []. The parameters, which are the regression coefficients j ’s, the cut-points k ’s and the independent non-diagonal elements of are estimated by the maximum likelihood method using the full information maximum likelihood (FIML), see supplement S1. In order to be increasing the cut-points we use exponential j j j j j j parametrization, i.e. k k 1 exp(k ) ,1 1 , and the new parameters k ’s will be estimated. The FIML technique can be easily applied in any statistical or econometric software, e.g. in STATA, where the cumulative density function of the standard bivariate normal distribution is implemented, see Sajaia []. Butler et al. [] proposed a two-step estimation approach based on fitting univariate ordered probit models. Tobias and Li [] suggested a Bayesian alternative estimation. The two latent equations of the BIOPROBIT model can be rewritten in the following two-dimensional vector form as a standard linear model T yi1 xi 0 1 i1 T . yi2 0 xi 2 i2 Here the error terms i1 and i2 are distributed as standard bivariate normal with correlation coefficient . Summarizing, the BIOPROBIT model is a two-dimensional latent model with ordered probit link function and bivariate normally distributed latent variables. In the heterogeneous reporting case the ordered probit models are no longer appropriate for describing the data. However, it is possible to generalize these models by allowing the cut-points to depend on covariates as j T j j j T j i,1 xi 1 , i,k i,k 1 expxi k , k 2,, K 1. (3) Hence the dependence between the categorical observed and continuous latent variables can be derived in the way 3 j j yij k i,k1 yij i, k (4) j Here k ’s are parameters which measure the impact of covariates on cut-points, see Terza [] and Pudney and Shields []. In order to identify the effect of covariates on cut- points we use vignette ratings as exogeneous information which fix different levels of respondent’s categories. This technique has been suggested by Tandon et al. [], see also King et al. [], Salomon et al. []. Suppose that there are L vignettes for each question and v denote by yij the vignette rating for -th vignette of j -th question at i -th respondent. It is also assumed that the possible vignette values are1,2,,K , i.e. they coincide with the possible values of self-reports. In the latent trait model approach it is supposed that there v is an unobservable continuous variable yij behind each for all i, j and . We assume that these latent variables are fixed over the whole population, i.e. they do not depend on the covariates. In mathematical terms, v yij j ij , (5) i 1,, N, j 1,,M, 1,,L , where j denotes the vignette jmean and the error vector v i (ij ) has -dimensional normal distribution with mean 0 and variance matrix . The observed vignette ratings depend on the latent vignette variables in the following way: v j v j yij k i,k 1 yij i,k (6) It should be emphasized that we use the same cut-points as in the self-report part.