Analysis of Binary Dependent Variables Using Linear Probability Model and Logistic Regression : a Replication Study

Total Page:16

File Type:pdf, Size:1020Kb

Analysis of Binary Dependent Variables Using Linear Probability Model and Logistic Regression : a Replication Study ANALYSIS OF BINARY DEPENDENT VARIABLES USING LINEAR PROBABILITY MODEL AND LOGISTIC REGRESSION : A REPLICATION STUDY Submitted by Lutendo Vele A thesis submitted to the Department of Statistics in partial fulfilment of the requirements for Master degree in Statistics in the Faculty of Social Sciences Supervisor Harry J. Khamis Spring, 2019 ABSTRACT Linear Probability Model (LPM) is commonly used because it is easy to compute and interpret than with logits and probits even though the estimated probabilities may fall outside the 0,1 interval and the linearity concept does not make much sense when deal- ing with probabilities. This paper extends upon the results of Luca, Owens, and Sharma (2015) reviewing the use of LPM to examine if alcohol prohibition reduces domestic vi- olence. Regular LPM resulted in inconclusive estimates since prohibition was omitted due to collinearity as controls were added. However Luca et al. (2015) had results, and further inspection on their regression commands showed that they ran a linear regression, then a post-estimation on residuals and further used residuals as a dependent variable hence the results were different from the regular LPM. Their method still resulted in unbounded predicted probabilities and heteroscedastic residuals, thus showing that OLS was inefficient and a non-linear binary choice model like logistic regression would be a better option. Logistic regression predicts the probability of an outcome that can only have two values and was therefore used in this paper. Unlike LPM, logistic regression uses a non-linear function which results in a sigmoid bounding the predicted outcome between 0 and 1. Logistic regression had no complication; thus logistic (or any another non-linear dichotomous dependent variable models) regression should have been used on the final analysis while LPM is used at a preliminary stage to get quick results. Keywords : binary choice models, logistic regression, linear probability model, forbid- den regression, binary dependent variables, dichotomous variables, residuals as dependent variables i Contents 1 Introduction 1 2 Data 2 2.1 Descriptive Statistics . 2 2.2 Treatment of Missing Data . 4 3 Methodology 5 3.1 Linear Probability Model . 5 3.1.1 Assumptions of Linear Probability Model . 6 3.1.2 Critics on Linear Probability Model . 6 3.2 Logistic Regression . 7 3.2.1 Assumptions of Logistic Regression . 8 3.2.2 Critics on Logistic Regression . 8 4 Results 8 4.1 Linear Probability Model Results . 8 4.2 Logistic Regression Results . 10 4.2.1 Sample Size Analysis . 10 4.2.2 Examining the odds that the husband drinks . 11 4.2.3 Examining the odds that the husband beats his wife . 18 5 Conclusion 25 References 27 ii 1 Introduction Data analysis, which is grounded in statistics with a long history, has played an important role in different domains through its process that begins with data collection to analysis to answer the research question(s). To answer a research question, one needs to study different factors, thus the dependent variable and independent variables. Regression anal- ysis is a mathematical process that guides in answering questions like which variables are most significant and how do those variables interact with each other. However, variables come in two main groups, each with further classifications; Categorical, also known as qualitative or discrete variables which are further classified into nominal, dichotomous and ordinal, and Continuous, also known as quantitative variables which are classified into interval and ratio. Therefore, different analysis or modelling methods are needed to model different dependent variable types. This paper will focus on regression models used on dichotomous dependent variables, thus binary choice models. Dichotomous variables take the value 1, which may represent success or 0 representing failure. When modelling such a variable, one quickly think in terms of probabilities. For example, what is the probability that a married man living in a state with alcohol prohibition, is religious, has a certain level of education, and works a white collar job drinks alcohol? Linear Probability Model (LPM) and Logistic Regression are some of the models es- timated when the regression model has a dichotomous dependent variable. Regardless of the critics discussed by Maddala (1983), LPM is one of the most applied statistical models in social sciences because of its easy interpretability and computation speed. Lo- gistic regression, introduced in the late 1960s early 1970s, addresses the critics discussed by Maddala (1983) and as its model is fit by an iterative process of the maximum likeli- hood it was expensive to estimate; as a result LPM remain favourable especially in the early years, however, with the improvement of computer technology, the linear proba- bility model lost favour. Since LPM violates probability boundaries and thus result in somewhat meaningless predictions, LPM can be used as the first step in the dichotomous dependent variable analysis. Amemiya (1981, p. 1486{1487) in his survey on qualitative response models, concerning LPM, stated: \it has frequently been used in economet- ric applications, especially in the early years, because of its computational simplicity. Though I do not recommend its use in the final stage of a study, it may be used for the purpose of obtaining quick estimates in a preliminary stage". This paper extends upon the results of Luca et al. (2015). The authors used the linear probability model to examine both the effect of prohibition on the drinking behaviour of husbands and the impact of prohibition on domestic violence. However, nothing is mentioned regarding the two main problems with the LPM were unbounded probability predictions are possible, and linearity does not make much sense conceptually, and this resulted in enough curiosity to replicate their study. Therefore, the objectives of this thesis are to give a brief overview of the linear probability model & logistic regression and a review using applications to see if the logistic model would be preferable over LPM. This paper contains an overview of LPM and logistic regression along with applications of each. Section 2 contains the data description. Section 3 covers the basic preliminary 1 ideas surrounding LPM and logistic regression, their assumptions and critics, including how they differ. Followed by a summary of the results reported in section 4. Finally concluding remarks on section 5. 2 Data This thesis investigates if the use of LPM in an article by Luca et al. (2015) was the best binary choice regression model to answer the research questions given that LPM has weaknesses which may result in meaningless predictions. The same datasets used by the authors were provided when they published their paper and were therefore used in this thesis. A couple of datasets were used and first a panel dataset containing the evolution of alcohol prohibition; thus precise laws pertaining to the prohibition of alcohol sales and or consumption and their changes for 17 major states in India from 1980-2010 was compiled. Rich microdata was also collected from the 1998-1999 and 2005-2006 Indian National Family Health Survey (NFHS) to investigate the impact of alcohol prohibition on individual behaviour. Finally Indian Crime Records Bureau (NCRB) for the years 1980-2010 was used to com- plement their individual-data analysis from state-level administrative crime data with a focus on crimes targeted towards women. The 17 major states investigated in the study were not listed, and with the information given regarding them, it is hard also list them as India has 29 states. NFHS has approximately 3 % of missing observations while NCRB has approximately 31 % of missing observations which exceeds the 25 % rule of thumb as proposed by Dermitas et al. Enders (2010, p. 260). The NCRB was used to investigate the effect of prohibition on other types of crimes targeted towards women on the state level, which are continuous variables and therefore excluded in this thesis. 2.1 Descriptive Statistics This subsection contains descriptive statistics of the NFHS variables. The National Fam- ily Health Survey will be used to investigate the impact of alcohol prohibition, and each variable's description is shown in Table 1 below. Household members' alcohol consump- tion and women's experience and attitude towards intimate partner are contained in this dataset. Husband's and wife's demographic characteristics like age, education, household size, religion, if he or she works a white-collar job or not and whether their household is in an urban area or not are part of the dataset and are used collectively during the analysis as husband controls and wife controls to see if their inclusion into the model has an effect on the estimate. The number of children in the house, whether or not the husband is justified in beating his wife if he suspects her of cheating and whether or not the wife has money of her own were variables grouped as bargaining controls. Lastly, the age categories of both husband and wife and also their age gap and educational gap were also used respectively. The information provided by these variables suggests that they may affect the husband's drinking and or violent behaviour. 2 Table 1: Indian National Family Health Survey Variable Description Variable Description year Year of Interview rep age Respondent's Current Age rep educ Education in Single Years hhsize Number of Household Members children Numder of Living Children stwt State Individual Weight husb beat
Recommended publications
  • Lecture Notes 7
    ECON 497: Lecture Notes 13 Page 1 of 1 Metropolitan State University ECON 497: Research and Forecasting Lecture Notes 13 Dummy Dependent Variable Techniques Studenmund Chapter 13 Basically, if you have a dummy dependent variable you will be estimating a probability. Probabilities are necessarily restricted to fall in the range [0,1] and this puts special conditions on the regression. Just doing a linear regression can result in estimated probabilities that are either negative or greater than 1 and are a bit nonsensical. As a result, there are other techniques for estimating these relationships that can generate better results. The Linear Probability Model The second or third best way to estimate models with dummy dependent variables is to simply estimate the model as you normally might: Di = β0 + β1X1i + β2X2i + εi For example, if you have a sample of the U.S. adult population and you're trying to determine the probability that a person is incarcerated, you might estimate the equation: Di = β0 + β1AGEi + β2GENDERi + εi where Di is a dummy variable taking the value 1 if a person is incarcerated and 0 if not AGEi is the person's age GENDERi is a dummy variable equal to 1 if the person is male and 0 otherwise Imagine that the estimated coefficients are: ˆ Di = 0.0043 - 0.0001*AGEi + 0.0052*GENDERi Interpretation of the estimated coefficients is straightforward. If there are two women, one of whom is one year older than the other, the estimated probability that the older one will be incarcerated will be 0.0001 less than the estimated probability that the younger one will be.
    [Show full text]
  • Logistic Regression, Part I: Problems with the Linear Probability Model
    Logistic Regression, Part I: Problems with the Linear Probability Model (LPM) Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 22, 2015 This handout steals heavily from Linear probability, logit, and probit models, by John Aldrich and Forrest Nelson, paper # 45 in the Sage series on Quantitative Applications in the Social Sciences. INTRODUCTION. We are often interested in qualitative dependent variables: • Voting (does or does not vote) • Marital status (married or not) • Fertility (have children or not) • Immigration attitudes (opposes immigration or supports it) In the next few handouts, we will examine different techniques for analyzing qualitative dependent variables; in particular, dichotomous dependent variables. We will first examine the problems with using OLS, and then present logistic regression as a more desirable alternative. OLS AND DICHOTOMOUS DEPENDENT VARIABLES. While estimates derived from regression analysis may be robust against violations of some assumptions, other assumptions are crucial, and violations of them can lead to unreasonable estimates. Such is often the case when the dependent variable is a qualitative measure rather than a continuous, interval measure. If OLS Regression is done with a qualitative dependent variable • it may seriously misestimate the magnitude of the effects of IVs • all of the standard statistical inferences (e.g. hypothesis tests, construction of confidence intervals) are unjustified • regression estimates will be highly sensitive to the range of particular values observed (thus making extrapolations or forecasts beyond the range of the data especially unjustified) OLS REGRESSION AND THE LINEAR PROBABILITY MODEL (LPM). The regression model places no restrictions on the values that the independent variables take on.
    [Show full text]
  • Introduction to Generalized Linear Models for Dichotomous Response Variables Edps/Psych/Soc 589
    Introduction to Generalized Linear Models for Dichotomous Response Variables Edps/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology ©Board of Trustees, University of Illinois Outline The Problem Linear Model for π Relationship π(x) & x Logistic regression Probit models SAS & R Triva Outline Introduction to GLMs for binary data Primary Example: High School & Beyond. The problem Linear model for π. Modeling Relationship between π(x) and x. Logistic regression. Probit models. Trivia Graphing: jitter and loews C.J. Anderson (Illinois) Introduction to GLMS for Dichotomous Data 2.1/ 56 Outline The Problem Linear Model for π Relationship π(x) & x Logistic regression Probit models SAS & R Triva The Problem Many variables have only 2 possible outcomes. Recall: Bernoulli random variables Y =0, 1. π is the probability of Y =1. E(Y )= µ = π. Var(Y )= µ(1 − µ)= π(1 − π). When we have n independent trials and take the sum of Y ’s, we have a Binomial distribution with mean = nπ and variance = nπ(1 − π). We are typically interested in π. We will consider models for π, which can vary according to some the values of an explanatory variable(s) (i.e., x1,...,xk). To emphasis that π changes with x’s, we write π(x) C.J. Anderson (Illinois) Introduction to GLMS for Dichotomous Data 3.1/ 56 Outline The Problem Linear Model for π Relationship π(x) & x Logistic regression Probit models SAS & R Triva Example: High School & Beyond Data from seniors (N=600). Consider whether students attend an academic high school program type of a non-academic program type (Y ).
    [Show full text]
  • 1. Linear Probability Model Vs. Logit (Or Probit) We Have Often Used Binary ("Dummy") Variables As Explanatory Variables in Regressions
    EEP/IAS 118 Andrew Dustan Section Handout 13 1. Linear Probability Model vs. Logit (or Probit) We have often used binary ("dummy") variables as explanatory variables in regressions. What about when we want to use binary variables as the dependent variable? It's possible to use OLS: ͭ = ͤ + ͥͬͥ + ⋯ + &ͬ& + ͩ where y is the dummy variable. This is called the linear probability model . Estimating the equation: ͊Ȩʚͭ = 1|ͬʛ = ͭȤ= ɸͤ + ɸͥͬͥ + ⋯ + ɸ&ͬ& ͭȤ is the predicted probability of having ͭ = 1 for the given values of ͬͥ … ͬ&. Problems with the linear probability model (LPM): 1. Heteroskedasticity: can be fixed by using the "robust" option in Stata. Not a big deal. 2. Possible to get ͭȤ< 0 or ͭȤ> 1. This makes no sense—you can't have a probability below 0 or above 1. This is a fundamental problem with the LPM that we can't patch up. Solution: Use the logit or probit model. These models are specifically made for binary dependent variables and always result in 0 < ͭȤ< 1. Let's leave the technicalities aside and look at a graph of a case where LPM goes wrong and the logit works: Linear Probability Model Logit (probit looks similar) 1.5 1.5 1 1 -------- ͊Ȩʚ--------ͭ = 1|ͬʛ ͊Ȩʚͭ = 1|ͬʛ 0.5 0.5 0 0 -0.5 -0.5 ɸ ɸ ɸ ɸ ɸ ɸ 0 + 1ͬ1 + ⋯ + ͬ͟͟ 0 + 1ͬ1 + ⋯ + ͬ͟͟ This is the main feature of a logit/probit that distinguishes it from the LPM – predicted probability of ͭ = 1 is never below 0 or above 1, and the shape is always like the one on the right rather than a straight line.
    [Show full text]
  • LINEAR REGRESSION for BINARY OUTCOMES 1 Logistic Or
    Running head: LINEAR REGRESSION FOR BINARY OUTCOMES 1 Logistic or Linear? Estimating Causal Effects of Experimental Treatments on Binary Outcomes Using Regression Analysis Robin Gomila Princeton University Currently in production: Journal of Experimental Psychology: General c 2020, American Psychological Association. This paper is not the copy of record and may not exactly replicate the final, authoritative version of the article. Please do not copy or cite without authors’ permission. The final article will be available, upon publication, via its DOI: 10.1037/xge0000920 Corresponding Author: Correspondence concerning this article should be addressed to Robin Gomila, Department of Psychology, Princeton University. E-mail: [email protected] Materials and Codes: Simulations and analyses reported in this paper were computed in R. The R codes can be found on the Open Science Framework (OSF): https://osf.io/ugsnm/ Author Contributions: Robin Gomila generated the idea for the paper. He wrote the article, simulation code and analysis code. Conflict of Interest: The author declare that there were no conflicts of interest with respect to the authorship or the publication of this article. LINEAR REGRESSION FOR BINARY OUTCOMES 2 Abstract When the outcome is binary, psychologists often use nonlinear modeling strategies such as logit or probit. These strategies are often neither optimal nor justified when the objective is to estimate causal effects of experimental treatments. Researchers need to take extra steps to convert logit and probit coefficients into interpretable quantities, and when they do, these quantities often remain difficult to understand. Odds ratios, for instance, are described as obscure in many textbooks (e.g., Gelman & Hill, 2006, p.
    [Show full text]