Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) © 2020 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd. All Rights Reserved. 1 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) Student Guide Introduction This dataset example introduces ordered probit. This technique allows researchers to evaluate whether a categorical variable with three or more categories that follow some order is a function of one or more independent variables. The ordered probit model is most commonly estimated via maximum likelihood estimation (MLE). This example describes ordered probit, discusses the assumptions underlying it, and shows how to estimate and interpret ordered probit models. We illustrate ordered probit using a subset of data from the 2013 Behavioral Risk Factor Surveillance System (BRFSS) operated by the U.S. Centers for Disease Control (http://www.cdc.gov/brfss/). Specifically, we test whether a 3-category measure of the strenuousness of recent activity is predicted by gender, age, income, and education level. An analysis like this allows researchers to evaluate factors that predict activity levels, which may be useful in designing fitness plans. What Is Ordered Probit? Ordered probit models explain variation in a categorical variable that consists of three or more ordered categories as a function of one or more independent Page 2 of 20 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd. All Rights Reserved. 1 variables. Categories must only be ordered (e.g. lowest to highest, weakest to strongest, strongly agree to strongly disagree) – the method does not require that the distance between the categories be equal. Typically the values of such variables are scored sequentially starting at 0 or 1, but the method only requires that the scoring follow some recognizable order. Ordered probit models are typically used when the dependent variable has 3 to 7 ordered categories. More than that, and researchers often turn to OLS regression, while if the dependent variable only has two categories, the ordered probit model reduces to simple probit. Ordered probit is one example from the family of Generalized Linear Models (GLMs). GLMs connect a linear combination of independent variables and estimated parameters – often called the linear predictor – to a dependent variable using a link function. The link function typically involves some sort of non-linear transformation, which in the case of ordered probit means that the probabilities that a given observation in the dataset falls into each of the categories of the dependent variable are non-linear functions of the independent variables. The parameters of GLMs are typically estimated using Maximum Likelihood Estimation (MLE). Because ordered probit models are estimated via MLE, it is best if the dataset has a sufficiently large number of observations. Just how many is open to debate, but in his book Regression Models for Categorical and Limited Dependent Variables (SAGE, 1997), J. Scott Long suggests trying to meet two criteria: (1) have at least 100 observations total, and (2) have at least 10 observations for each coefficient estimated in the model. In simple terms, MLE is an iterative process that approximates estimates for the coefficients that maximize the fit of the model to the sample of data. By maximizing fit, MLE also minimizes the unexplained variance in the dependent variable. In that sense, MLE accomplishes the same objective as ordinary least squares (OLS) does for standard regression. Page 3 of 20 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd. All Rights Reserved. 1 When computing statistical tests, it is customary to define the null hypothesis (H0) to be tested. In ordered probit, the standard null hypothesis is that each coefficient is equal to zero. The actual coefficient estimates will not be exactly equal to zero in any particular sample of data, simply due to random chance in sampling. The t-tests conducted to test each individual coefficient are designed to help determine if the coefficients are different enough from zero to be declared statistically significant. “Different enough” is typically defined as producing a test statistic with a level of statistical significance, or p-value, that is less than 0.05. This would lead us to reject the null hypothesis (H0) that the coefficient in question equals zero. Estimating an Ordered Probit Model One way to understand the ordered probit model is to imagine that there is a continuous, but unobserved, dependent variable that is a linear function of an independent variable. Let’s call that latent dependent variable Y* and the observed independent variable X. We do not observe Y*, but we do observe Y as a categorical variable. Which category of Y a case falls into depends on whether Y* crosses a given threshold. Figure 1 illustrates this in the simple case where Y only takes on two values, coded 0 for those voting for Romney for U.S. President in 2012 and 1 for those voting for Obama. Figure 1 shows a latent, or unobserved, propensity of voting for Obama on the y-axis as Y*. This propensity is a linear function of X. At any given value for X, there is a probability distribution representing possible values for Y*. Because the area under any probability distribution sums to 1, at any given value of X, the average probability of someone voting for Obama is captured by the proportion of the probability distribution that falls above the threshold dividing Y* into those two groups. That proportion of the distribution is shown in blue in Figure 1. For a probit model, the distribution is assumed to be normal. (Note: If the Page 4 of 20 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd. All Rights Reserved. 1 distribution were assumed to be logistic, we would estimate the model using logit rather than probit.) The mean chart shows the gradual shift in the mean. The vertical axis is labeled “Latent Propensity (Y asterisks)” and the horizontal axis is labeled “Values of X.” It shows five normal distribution curves, drawn next to each other diagonally from bottom-left to top-right. A linear positively sloped line passes through the mean value with zero standard deviation of all the curves. A horizontal dotted line labeled “Threshold” is drawn passing through the curves. The region of the curves lying above the dotted line is shaded. The region above the dotted line is labeled “Voted for Obama (Y equals 1)” and the region lying below is labeled “Voted for Romney (Y equals 0).” Figure 1 Illustration of the latent variable interpretation of a simple probit model with a single threshold and two categories for the dependent variable. Page 5 of 20 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd. All Rights Reserved. 1 Figure 2 shows what happens when we have three ordered categories rather than two. Figure 2 still represents a latent variable Y* on the y-axis as a linear function of X. However, Y* is now divided by two thresholds into three observed categories. At any given value of X, the proportion of the probability distribution that falls below the first threshold (shown as white) represents the probability of falling into the first category. The proportion of the probability distribution that falls between the two thresholds (shown in red) represents the probability of falling into the second category. The proportion of the probability distribution that falls above the second threshold (shown in blue) represents the probability of falling into the third category. At any given value of X, all of these probabilities will (and must) sum to 1. For an ordered probit model, the distribution is assumed to be normal. (Note: If the distribution were assumed to be logistic, we would estimate the model using ordered logit rather than ordered probit.) The mean chart shows the gradual shift in the mean. The vertical axis is labeled Page 6 of 20 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd. All Rights Reserved. 1 “Latent Propensity (Y asterisk)” and is marked “Outcome 1,” “Outcome 2,”and “Outcome 3” from bottom to top. The horizontal axis is labeled “Values of X.” It shows five normal distribution curves, drawn next to each other diagonally from bottom-left to top-right. A linear positively sloped line passes through the mean value with zero standard deviation of all the curves. Two horizontal dotted lines labeled “Threshold 1” and “Threshold 2” are drawn from “Outcome 2” passing through the curves. The region of the curves lying above threshold 1 line is shaded. The region of the curve lying between threshold 1 and threshold 2 is shaded. Figure 2 Illustration of the latent variable interpretation of an ordered probit model with two thresholds and three ordered categories for the dependent variable. Because the latent variable Y* is unobserved, it has no scale. In order to estimate the model illustrated in Figure 1 or 2, we need to impose a restriction on either Page 7 of 20 Learn About Ordered Probit in Stata With Data From the Behavioral Risk Factor Surveillance System (2013) SAGE SAGE Research Methods Datasets Part 2020 SAGE Publications, Ltd.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages20 Page
-
File Size-