Learn About Ordered Logit Regression in R with Data from the General Social Survey (2016)
Total Page:16
File Type:pdf, Size:1020Kb
Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) © 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) Student Guide Introduction This dataset example introduces ordered logit regression. This technique allows researchers to evaluate whether a categorical variable with three or more ordered categories is a function of one or more independent variables. This example describes ordered logit, discusses the assumptions underlying it, and shows how to estimate and interpret ordered logit models. We illustrate ordered logit using a subset of data from the 2016 General Social Survey (GSS) (http://gss.norc.org/). Specifically, we test whether having children influences the employment status of women. An analysis like this allows researchers to evaluate factors that influence labor force status, which may be useful in policy designs. What Is Ordered Logit? Ordered logit models explain variation in a categorical variable that consists of three or more ordered categories as a function of one or more independent variables. Categories must only be ordered (e.g., lowest to highest, weakest to strongest, strongly agree to strongly disagree)—the method does not require that the distance between the categories be equal. Typically, the values of such variables are scored sequentially starting at 0 or 1, but the method only requires that the values follow some recognizable order. Ordered logit models are typically used when the dependent variable has three to seven ordered categories. More Page 2 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 than that, and researchers often turn to ordinary least squares (OLS) regression, while if the dependent variable only has two categories, the ordered logit model reduces to a simple binary logit model. Ordered logit is one example from the family of generalized linear models (GLMs). GLMs connect a linear combination of independent variables and estimated parameters—often called the linear predictor—to a dependent variable using a link function. The link function typically involves some sort of nonlinear transformation, which in the case of ordered logit means that the probabilities that a given observation in the dataset falls into each of the categories of the dependent variable are nonlinear functions of the independent variables. For example, in the binary logistic regression, we model the probability of the outcome variable falling into one of the two categories as a function of the linear combination of the independent variables. But probability is a value between 0 and 1; while the linear combination of the independent variables can take any real number, so we need a nonlinear function to compress the value of the linear combination into the region between 0 and 1, which is the logistic function. The parameters of GLMs are typically estimated using maximum likelihood estimation (MLE). Because ordered logit models are estimated via MLE, it is best if the dataset has a sufficiently large number of observations. Just how many is open to debate, but in his book Regression Models for Categorical and Limited Dependent Variables (SAGE, 1997), J. Scott Long suggests trying to meet two criteria: (1) have at least 100 observations total and (2) have at least 10 observations for each coefficient estimated in the model. In simple terms, MLE accomplishes the same objective as OLS does for standard regression. MLE is an iterative process that approximates estimates for the coefficients that maximize the fit of the model to the sample of data. By maximizing fit, MLE also minimizes the unexplained variance in the dependent variable. Page 3 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 When computing statistical tests, it is customary to define the null hypothesis (H0) to be tested. In ordered logit, the standard null hypothesis is that each coefficient is equal to zero. The actual coefficient estimates will not be exactly equal to zero in any particular sample of data, simply due to random chance in sampling. The t-tests conducted to test each individual coefficient are designed to help determine whether the coefficients are different enough from zero to be declared statistically significant. “Different enough” is typically defined as producing a test statistic with a level of statistical significance or p-value that is less than .05. This would lead us to reject the null hypothesis (H0) that the coefficient in question equals zero. Assumptions Behind the Method Nearly every statistical model or test relies on some underlying assumptions, and they are all affected by the mix of data you happen to have. Different textbooks present the assumptions for an ordered logit model in different ways. Here are the key factors to consider when estimating an ordered logit: • The dependent variable must consist of ordered categories. • The model is correctly specified (e.g., we have the right independent variables in the model properly measured). • The individual residuals are independent of each other and follow a logistic distribution. • The effect of a given independent variable on the latent dependent variable is the same across all thresholds. This is sometimes called the parallel regression assumption or the proportional odds assumption. It simply means that we only estimate one coefficient for each independent variable rather than having that coefficient estimate change as we move from one category of the dependent variable to another. • Because it is generally estimated via MLE, ordered logit regression requires moderate to large sample sizes. Page 4 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 Estimating an Ordered Logit Model One way to understand the ordered logit model is to imagine that there is a continuous, but unobserved, dependent variable that is a linear function of an independent variable. Let’s call that latent dependent variable Y* and the observed independent variable X. We do not observe Y*, but we do observe Y as a categorical variable. Which category of Y a case falls into depends on whether Y* crosses a given threshold. Suppose we have a dependent variable Y with categories labeled A, B, and C that we believe is affected by values of an independent variable named X. We might arbitrarily select category A as the baseline category. If we then estimate an ordered logit model, we will estimate an intercept and slope that describes how X is related to the probability of an observation being in Category B versus A and another intercept and slope that describes how X is related to the probability of an observation being in Category C versus A. In other words, an ordered logit model where Y takes on three values is similar to simultaneously estimating two simple binary logit models. We say “similar” because the parameter estimates of an ordered logit model are constrained (appropriately so) by the requirement that the probability of an observation being in Categories A, B, or C must sum to 1. Because of this restriction, if we know how Category B compares to A and how Category C compares to A, we know by definition how Category B compares to Category C. Figure 1 illustrates this in the simple case where Y only takes on two values, coded 0 for those voting for Romney for U.S. President in 2012 and 1 for those voting for Obama. Figure 1 shows a latent, or unobserved, propensity of voting for Obama on the y-axis as Y*. This propensity is a linear function of X. At any given value for X, there is a probability distribution representing possible values for Y*. Because the area under any probability distribution sums to 1, at any given value Page 5 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 of X, the estimated probability of someone voting for Obama is captured by the proportion of the probability distribution that falls above the threshold dividing Y* into those two groups. That proportion of the distribution is shown in blue in Figure 1. For a logit model, the distribution is assumed to be logistic. If the distribution were assumed to be normal, we would estimate the model using probit rather than logit. Figure 1: Illustration of the Latent Variable Interpretation of a Simple Logit Model With a Single Threshold and Two Categories for the Dependent Variable. Figure 2 shows what happens when we have three ordered categories rather than two. Figure 2 still represents a latent variable Y* on the y-axis as a linear function of X. However, Y* is now divided by two thresholds into three observed Page 6 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 categories. At any given value of X, the proportion of the probability distribution that falls below the first threshold (shown as white) represents the probability of falling into the first category. The proportion of the probability distribution that falls between the two thresholds (shown in red) represents the probability of falling into the second category.