Learn About Ordered Regression in R With Data From the General Social Survey (2016)

© 2019 SAGE Publications, Ltd. All Rights Reserved. This PDF has been generated from SAGE Research Methods Datasets. SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016)

Student Guide

Introduction This dataset example introduces ordered logit regression. This technique allows researchers to evaluate whether a categorical variable with three or more ordered categories is a function of one or more independent variables.

This example describes ordered logit, discusses the assumptions underlying it, and shows how to estimate and interpret ordered logit models. We illustrate ordered logit using a subset of data from the 2016 General Social Survey (GSS) (http://gss.norc.org/). Specifically, we test whether having children influences the employment status of women. An analysis like this allows researchers to evaluate factors that influence labor force status, which may be useful in policy designs.

What Is Ordered Logit? Ordered logit models explain variation in a categorical variable that consists of three or more ordered categories as a function of one or more independent variables. Categories must only be ordered (e.g., lowest to highest, weakest to strongest, strongly agree to strongly disagree)—the method does not require that the distance between the categories be equal. Typically, the values of such variables are scored sequentially starting at 0 or 1, but the method only requires that the values follow some recognizable order. Ordered logit models are typically used when the dependent variable has three to seven ordered categories. More

Page 2 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 than that, and researchers often turn to ordinary (OLS) regression, while if the dependent variable only has two categories, the ordered logit model reduces to a simple binary logit model.

Ordered logit is one example from the family of generalized linear models (GLMs). GLMs connect a linear combination of independent variables and estimated parameters—often called the linear predictor—to a dependent variable using a link function. The link function typically involves some sort of nonlinear transformation, which in the case of ordered logit means that the probabilities that a given observation in the dataset falls into each of the categories of the dependent variable are nonlinear functions of the independent variables. For example, in the binary , we model the probability of the outcome variable falling into one of the two categories as a function of the linear combination of the independent variables. But probability is a value between 0 and 1; while the linear combination of the independent variables can take any real number, so we need a nonlinear function to compress the value of the linear combination into the region between 0 and 1, which is the logistic function.

The parameters of GLMs are typically estimated using maximum likelihood estimation (MLE). Because ordered logit models are estimated via MLE, it is best if the dataset has a sufficiently large number of observations. Just how many is open to debate, but in his book Regression Models for Categorical and Limited Dependent Variables (SAGE, 1997), J. Scott Long suggests trying to meet two criteria: (1) have at least 100 observations total and (2) have at least 10 observations for each coefficient estimated in the model.

In simple terms, MLE accomplishes the same objective as OLS does for standard regression. MLE is an iterative process that approximates estimates for the coefficients that maximize the fit of the model to the sample of data. By maximizing fit, MLE also minimizes the unexplained variance in the dependent variable.

Page 3 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1

When computing statistical tests, it is customary to define the null hypothesis (H0) to be tested. In ordered logit, the standard null hypothesis is that each coefficient is equal to zero. The actual coefficient estimates will not be exactly equal to zero in any particular sample of data, simply due to random chance in sampling. The t-tests conducted to test each individual coefficient are designed to help determine whether the coefficients are different enough from zero to be declared statistically significant. “Different enough” is typically defined as producing a test statistic with a level of statistical significance or p-value that is less than .05. This would lead us to reject the null hypothesis (H0) that the coefficient in question equals zero.

Assumptions Behind the Method Nearly every statistical model or test relies on some underlying assumptions, and they are all affected by the mix of data you happen to have. Different textbooks present the assumptions for an ordered logit model in different ways. Here are the key factors to consider when estimating an ordered logit:

• The dependent variable must consist of ordered categories. • The model is correctly specified (e.g., we have the right independent variables in the model properly measured). • The individual residuals are independent of each other and follow a logistic distribution. • The effect of a given independent variable on the latent dependent variable is the same across all thresholds. This is sometimes called the parallel regression assumption or the proportional odds assumption. It simply means that we only estimate one coefficient for each independent variable rather than having that coefficient estimate change as we move from one category of the dependent variable to another. • Because it is generally estimated via MLE, ordered logit regression requires moderate to large sample sizes.

Page 4 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1

Estimating an Ordered Logit Model One way to understand the ordered logit model is to imagine that there is a continuous, but unobserved, dependent variable that is a linear function of an independent variable. Let’s call that latent dependent variable Y* and the observed independent variable X. We do not observe Y*, but we do observe Y as a categorical variable. Which category of Y a case falls into depends on whether Y* crosses a given threshold.

Suppose we have a dependent variable Y with categories labeled A, B, and C that we believe is affected by values of an independent variable named X. We might arbitrarily select category A as the baseline category. If we then estimate an ordered logit model, we will estimate an intercept and slope that describes how X is related to the probability of an observation being in Category B versus A and another intercept and slope that describes how X is related to the probability of an observation being in Category C versus A. In other words, an ordered logit model where Y takes on three values is similar to simultaneously estimating two simple binary logit models. We say “similar” because the parameter estimates of an ordered logit model are constrained (appropriately so) by the requirement that the probability of an observation being in Categories A, B, or C must sum to 1. Because of this restriction, if we know how Category B compares to A and how Category C compares to A, we know by definition how Category B compares to Category C.

Figure 1 illustrates this in the simple case where Y only takes on two values, coded 0 for those voting for Romney for U.S. President in 2012 and 1 for those voting for Obama. Figure 1 shows a latent, or unobserved, propensity of voting for Obama on the y-axis as Y*. This propensity is a linear function of X. At any given value for X, there is a probability distribution representing possible values for Y*. Because the area under any probability distribution sums to 1, at any given value

Page 5 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 of X, the estimated probability of someone voting for Obama is captured by the proportion of the probability distribution that falls above the threshold dividing Y* into those two groups. That proportion of the distribution is shown in blue in Figure 1. For a logit model, the distribution is assumed to be logistic. If the distribution were assumed to be normal, we would estimate the model using probit rather than logit.

Figure 1: Illustration of the Latent Variable Interpretation of a Simple Logit Model With a Single Threshold and Two Categories for the Dependent Variable.

Figure 2 shows what happens when we have three ordered categories rather than two. Figure 2 still represents a latent variable Y* on the y-axis as a linear function of X. However, Y* is now divided by two thresholds into three observed

Page 6 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 categories. At any given value of X, the proportion of the probability distribution that falls below the first threshold (shown as white) represents the probability of falling into the first category. The proportion of the probability distribution that falls between the two thresholds (shown in red) represents the probability of falling into the second category. The proportion of the probability distribution that falls above the second threshold (shown in blue) represents the probability of falling into the third category. At any given value of X, all of these probabilities will (and must) sum to 1. For an ordered logit model, the distribution is assumed to be logistic. (Note: If the distribution were assumed to be normal, we would estimate the model using rather than ordered logit.)

Figure 2: Illustration of the Latent Variable Interpretation of an Ordered Logit Model With Two Thresholds and Three Ordered Categories for the Dependent Variable.

Because the latent variable Y* is unobserved, it has no scale. In order to estimate

Page 7 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 the model illustrated in Figure 1 or 2, we need to impose a restriction on either the intercept of the regression line or one of the thresholds. In the case of simple logit where there is only one threshold and, thus, two categories on the dependent variable, nearly every statistical software fixes the threshold at zero and estimates the intercept. In the case of ordered logit where there are two or more thresholds and, thus, three or more categories, nearly every statistical software fixes the intercept at zero and estimates the thresholds. In fact, many software programs will refer to the set of thresholds as intercepts. These restrictions are necessary, but the choice between them is not consequential. Regardless of which restriction is imposed, the estimated impact of each independent variable on the dependent variable will be unchanged.

Ordered logit models express the latent variable Y* as a function of one or more independent variables as shown in Equation 1:

(1)

* Yi = Xi β+εi where:

* • Yi is the latent dependent variable for subject i. • Xi is a row vector of individual values of the independent variables for subject i. • β is a column vector of coefficients that link those independent variables to the dependent variable. • εi is a random variable with a standard logistic distribution.

Because we only observe the categorical version Y of Y*, we need a way to link Xβ to the probability that an observation falls into each given category of the observed dependent variable Y. Suppose there are J categories on the dependent variable.

Page 8 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 We need a way to transform Xβ into the probability that each observation falls into each of the J categories. This will also require that we estimate values for the thresholds. In short, we need what is called a link function to perform the following operation as shown in Equation 2:

(2)

* g(pij) = τj − Yi where:

• g() is a link function we have yet to define. * • pij = Pr(τj − 1 < Yi ≤ τj) is the probability of observation i falling into category j of the dependent variable. • τj is the estimated threshold separating category j from category j + 1.

For the ordered logit model, g() comes from the logistic link function. Thus, we can calculate values for pij using the inverse of the logit link function as shown in Equation 3:

(3)

exp(τj − Xiβ) exp(τj − 1 − Xiβ) pij = − 1 + exp(τj − Xiβ) 1 + exp(τj − 1 − Xiβ)

Researchers have values for Y and the independent variables X in their datasets—they use MLE to estimate the coefficients β as well as the thresholds—the τj values (sometimes called the intercepts). Unlike standard multiple regression, the β coefficients cannot be directly interpreted as slope coefficients that describe the marginal effect of each independent variable on the probability that Y falls into any particular category. Interpreting the coefficient estimates of an ordered logit model is more complicated and is something

Page 9 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 described below in the context of a specific example.

Illustrative Example: Working Status of Female With Children This analysis examines whether having children influences the working status of women. The specific research question is

Controlling for age and education, are women with more children less likely to be employed?

The research question could be stated in the form of a null hypothesis:

H0 = Controlling for age and education, the number of one’s children has no effect on her employment status.

Note that by “Controlling for” it simply means that those variables are included in the model; it is not to be confused with controlled experiments.

The Data This example uses a subset of data from the 2016 GSS (http://gss.norc.org/). We use several variables:

• Employment status (WRKSTAT): possible values are Not working, Working part-time, and Working full-time. • Age (AGE): a continuous variable of age between 18 and 89. • Education (DEGREE): Highest degree earned; it is an ordinal variable with possible values: 1 = Little high school, 2 = High school, 3 = Junior college, 4 = Bachelor, 5 = Graduate. • Gender (SEX): Male or Female.

We consider female subjects only in this example. There are 1,189 female subjects. Responses for the dependent variable (WRKSTAT) are recorded on a

Page 10 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 3-level scale that follows an order from not working to working full-time, making this example appropriate for ordered logit.

Analyzing the Data Before proceeding to the ordered logit model, it is a good idea to examine the distribution of the dependent variable. Its frequency distribution is shown in Figure 3:

Figure 3: Frequency Distribution of the Employment Status.

In the figure we can see that 626 subjects was working full-time, 231 part-time, and 332 not working. Ordered logit models do not perform as well if there are small numbers of observations in one or more of the categories of the dependent variable or if there is a substantial skew in the distribution of observations across the categories. We have neither of those problems here.

It would also be valuable to produce summary and explore the distributions of each of the independent variables as well. However, in the interest of space, we will forgo doing so now.

The results of the ordered logit model itself are presented in Figure 4:

Figure 4: Summary of the Ordered Logit Model.

Page 11 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1

The top portion of Figure 4 reports the individual parameter estimates linking the independent variables to the dependent variable, their estimated standard errors, and t values. The middle portion of Figure 4 reports the estimated values for the thresholds or intercepts. Researchers generally do not have predictions about the thresholds or their level of statistical significance.

The bottom portion of Figure 4 reports two measures of relative model fit. None of the measures of model fit follow any particular scale, so they cannot be interpreted as “large” or “small” in absolute terms. They would only become relevant if we were to estimate additional models using the same exact dataset and dependent variable.

In this example, we focus our attention on the individual coefficient estimates linking the independent variables to the dependent variable and their corresponding level of statistical significance.

Using their t values, we can calculate the approximate p-values for the coefficients from the t-distribution (which is included in most statistical software), which are all less than .01, so each coefficient estimate is statistically significantly different from zero. This would lead us to reject the null hypothesis of a coefficient being equal

Page 12 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 to zero for all of the estimates. Here are several typical observations to be made from Figure 4:

• Controlling for age and education level, number of children is significantly and negatively associated with employment, suggesting that women with more children are less likely to be working full-time and more likely to be not working. • Controlling for number of children and education level, age is significantly and negatively associated with employment, suggesting that older women are less likely to be working full-time and more likely to be not working. • Controlling for number of children and age, education level is significantly and positively associated with employment, suggesting that women with higher education are more likely to be working full-time.

However, just looking at ordered logit coefficients and tests of statistical significance does not tell the whole story. We explore some of the findings in greater detail through computing predicted probabilities.

Predicted Probabilities We can compute the predicted probability of a respondent falling into the various employment categories based on the results in Figure 4 using the inverse link function as shown previously in Equation 1. Because the relationship between all of the independent variables and the probability that Y falls into a particular category is nonlinear, you can only compute a predicted probability by setting every independent variable in the model to some specific value.

For example, to compare the predicted probabilities of falling into each of the three categories on the dependent variable between women with at most high school degrees and women with at most college degrees, we need to set the value for the degree variable to the appropriate value, and we need to set all of

Page 13 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 the other independent variables to some fixed values as well. The most common strategy is to set the remaining variables to central measures such as their means, medians, or modes. An alternative is to compute the predicted probability for each observation based on its own values for its independent variables, but this makes it harder to isolate the potential effect of any one independent variable. In order to keep it simple, we consider women with at most high school degrees and women with at most college degrees separately and keeping other independent variables at their means.

Figure 5 reports the results of estimating these predicted probabilities using postestimation simulation. A full discussion of this process is beyond the scope of this example, but briefly, the process computes 1,000 sets of predicted probabilities by simulating values for the model coefficients based on their estimated values, variances, and covariances. For more information, see “Making the most of statistical analyses: improving interpretation and presentation” by King, Tomz, and Wittenberg (American Journal of Political Science, 44 (2): 341–355).

Figure 5: Estimated Predicted Probability of Falling Into Each of Three Employment Categories by Women With Most High School Degrees (Top) and Women With Most College Degrees (Bottom) Holding the Remaining Independent Variables as Constant.

Page 14 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1

The results include estimated means, standard deviations, medians (50%), and the 2.5 and 97.5 percentiles for the 1,000 simulated expected values of the dependent variable. This is generally where researchers focus their attention. At the bottom of each table, the results also include a mean for the predicted value of Y and those values of Y that represent the 50.0, 2.5, and 97.5 percentiles.

Tables in Figure 6 are helpful when you only need to compute a small number of predicted probabilities to interpret the findings of the model. The best way to explore the impact of a continuous independent variable or an independent variable that takes on many values is to compute the predicted probability of falling into one of the employment categories based on values of the independent variable in question and present the results graphically. Figure 6 presents the predicted probability of “Not working,” along with confidence intervals, as a function of the number of children one has, holding all other independent variables constant. The solid blue curve at the center of the plot is the predicted probabilities, and lighter grey curves around it are upper and lower confidence

Page 15 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 limits at various significance levels (0.8, 0.95, and 0.99). We can see that the probability of “Not working” increases with the number of children between zero and five and starts to decrease as the number of children continues to increase.

Figure 6: Predicted Probability of “Not working,” Along With Confidence Intervals, as a Function of the Number of Children One Has, Holding All Other Independent Variables Constant.

Complete interpretation of the results of an ordered logit model would present similar tables or figures for every independent variable in the model.

Page 16 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1

Presenting Results The results of an ordered logit model can be presented in a variety of ways. Here we offer one example.

“We used a subset of data from the 2016 GSS to test the following null hypothesis:

H0 = Controlling for age and education, the number of children a woman has imposes no effect on her employment status.

There are 1,189 female subjects in this sample. Results from the ordered logit model are presented in Figure 4. Controlling for age and education level, the number of children a woman has is significantly and negatively associated with employment, suggesting that women with more children are less likely to be working full-time and more likely to be not working. Figure 5 shows that the probability of “Not working” increases with the number of children between zero and five children and starts to decrease as the number of children continues to increase. Further interpretation and diagnostic testing should be explored to evaluate the robustness of these findings.”

Review Ordered logit expresses an ordered categorical dependent variable as a function of one or more independent variables. Ordered logit models are estimated via MLE. Direct interpretation of the coefficient estimates is limited to whether they are positive, negative, or not statistically significant. To really understand the results of an ordered logit model requires calculating predicted probabilities.

The ordered logit model is very similar to the ordered . Ordered logit simply assumes the residuals of the latent variable model follow a logistic distribution, whereas the ordered probit model assumes they follow a normal distribution. Ordered logit is also somewhat similar to the ordered logit model,

Page 17 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016) SAGE SAGE Research Methods Datasets Part 2019 SAGE Publications, Ltd. All Rights Reserved. 1 which is a model where the dependent variable takes on three or more categorical values, but for the ordered logit model, those categories are not assumed to follow any order. There is also an ordered probit model that shares some similarities to ordered logit.

You should know:

• What types of variables are suitable for an ordered logit model. • The basic assumptions behind the ordered logit model. • How to estimate and interpret the results of an ordered logit model. • How to report the results from an ordered logit model.

Your Turn You can download the sample dataset along with a guide showing how to estimate an ordered logit model using statistical software. See whether you can reproduce the results presented here; then repeat the analysis by considering only male subjects.

Page 18 of 18 Learn About Ordered Logit Regression in R With Data From the General Social Survey (2016)