Truncation and Censoring

Total Page:16

File Type:pdf, Size:1020Kb

Truncation and Censoring Truncation and Censoring Laura Magazzini [email protected] Laura Magazzini (@univr.it) Truncation and Censoring 1 / 40 Truncation and censoring Truncation and censoring Truncation: sample data are drawn from a subset of a larger population of interest . Characteristic of the distribution from which the sample data are drawn . Example: studies of income based on incomes above or below the poverty line (of limited usefulness for inference about the whole population) Censoring: values of the dependent variable in a certain range are all transformed to (or reported at) a single value . Defect in the sample data . Example: in studies of income, people below the poverty line are reported at the poverty line Truncation and censoring introduce similar distortion into conventional statistical results Laura Magazzini (@univr.it) Truncation and Censoring 2 / 40 Truncation and censoring Truncation Truncation Aim: infer the caracteristics of a full population from a sample drawn from a restricted population . Example: characteristics of people with income above $100,000 Let Y be a continous random variable with pdf f (y). The conditional distribution of y given y > a (a a constant) is: f (y) f (yjy > a) = Pr(y > a) In case of y normally distributed: 1 φ x−µ f (yjy > a) = σ σ 1 − Φ(α) a−µ where α = σ Laura Magazzini (@univr.it) Truncation and Censoring 3 / 40 Truncation and censoring Truncation Moments of truncated distributions E(Y jy < a) < E(Y ) E(Y jy > a) > E(Y ) V (Y jtrunc:) < V (Y ) Laura Magazzini (@univr.it) Truncation and Censoring 4 / 40 Truncation and censoring Truncation Moments of the truncated normal distribution Let y ∼ N(µ, σ2) and a constant E(yjtruncation) = µ + σλ(α) Var(yjtruncation) = σ2[1 − δ(α)] .α = (a − µ)/σ .φ (α) is the standard normal density .λ (α) is called inverse Mills ratio: λ(α) = φ(α)=[1 − Φ(α)] if truncation is y > a λ(α) = −φ(α)=Φ(α) if truncation is y < a .δ (α) = λ(α)[λ(α) − α], where 0 < δ(α) < 1 for any α Laura Magazzini (@univr.it) Truncation and Censoring 5 / 40 Truncation and censoring Truncation Example: a truncated log-normal income distribution From New York Post (1987): \The typical upper affluent American... makes $142,000 per year... The people surveyed had household income of at least $100,000" . Does this tell us anything about the typical American? \... only 2 percent of Americans make the grade" . Degree of truncation in the sample: 98% . The $142,000 is probably quite far from the mean in the full population Assuming lognormally distributed income in the population (log of income has a normal distribution), the information can be employed to deduce the population mean income Let x = income and y = ln x σφ(α) E[yjy > log 100] = µ + 1 − Φ(α) By substituting E[x] = E[ey ] = eµ+σ2=2, we get E[x] = $22; 087 . 1987 Statistical Abstract of the US listed average household income of about $25; 000 (relatively good estimate based on little information!) Laura Magazzini (@univr.it) Truncation and Censoring 6 / 40 Truncation and censoring Truncation The truncated regression model ∗ 0 2 yi = xi β + i , with i jxi ∼ N(0; σ ) ∗ Unit i is observed only if yi cross a threshold: ∗ n:a: if yi ≤ a yi = ∗ ∗ yi if yi > a ∗ 0 0 E[yi jyi > a] = xi β + σλ(αi ), with αi = (a − xi β)/σ Laura Magazzini (@univr.it) Truncation and Censoring 7 / 40 Truncation and censoring Truncation OLS estimation OLS of y on x leads to inconsistent estimates ∗ ∗ 0 . The model is yi jyi > a = E(yi jyi > a) + i = xi β + σλ(αi ) + i . By construction, the error term is heteroskedastic . Omitted variable bias (λi is not included in the regression) . In applications, it is usually found that the OLS estimates are biased toward zero: the marginal effect in the subpopulation is: ∗ @E[yi jyi > a] @αi = β + σ(dλ(αi )=dαi ) @xi @xi = ::: = β(1 − δ(αi )) { Since 0 < δ(αi ) < 1, the marginal effect in the subpopulation is less than the corresponding coefficient Laura Magazzini (@univr.it) Truncation and Censoring 8 / 40 Truncation and censoring Truncation Maximum likelihood estimation Under the normality assumption, MLE can be obtained that provides a consistent estimator . For each observation: y −x0β 1 φ i i ∗ σ σ f (yi jyi > a) = 1 − Φ(αi ) 0 a−xi β with αi = σ . The log-likelihood can be written as N 0 N 0 X yi − x β X a − x β log L = log σ−1φ i − log 1 − Φ i σ σ i=1 i=1 Laura Magazzini (@univr.it) Truncation and Censoring 9 / 40 Truncation and censoring Truncation Example: simulated data If y ∗ is fully observed, OLS can be applied Laura Magazzini (@univr.it) Truncation and Censoring 10 / 40 Truncation and censoring Truncation Example: simulated data However, only y ∗ > a is included in the sample Laura Magazzini (@univr.it) Truncation and Censoring 11 / 40 Truncation and censoring Truncation Example: simulated data OLS on the observed sample is biased Laura Magazzini (@univr.it) Truncation and Censoring 12 / 40 Truncation and censoring Truncation Example: simulated data MLE (truncreg) allows to get a consistent estimate of β Laura Magazzini (@univr.it) Truncation and Censoring 13 / 40 Truncation and censoring Censored data Censored data Censored regression models generally apply when the variable to be explained is partly continuous but has positive probability mass at one or more points Assume there is a variable with quantitative meaning y ∗ and we are interested in E[y ∗jx] If y ∗ and x were observed for everyone in the population: standard regression methods (ordinary or nonlinear least squares) can be applied In the case of censored data, y ∗ is not observable for part of the population . Conventional regression methods fail to account for the qualitative difference between limit (censored) and nonlimit (continuous) observations . Top coding / corner solution outcome Laura Magazzini (@univr.it) Truncation and Censoring 14 / 40 Truncation and censoring Censored data Top coding: example Data generating process Let wealth∗ denote actual family wealth, measured in thousands of dollars Suppose that wealth∗ follows the linear regression model E[wealth∗jx] = x0β Censored data: we observe wealth only when wealth∗ > 200 . When wealth∗ is smaller than 200 we know that it is, but we do not know the actual value of wealth Therefore observed wealth can be written as wealth = max(wealth∗; 200) Laura Magazzini (@univr.it) Truncation and Censoring 15 / 40 Truncation and censoring Censored data Top coding: example Estimation of β We assume that wealth∗ given x has a homoskedastic normal distribution wealth∗ = x0β + , jx ∼ N(0; σ2) Recorded wealth is: wealth = max(wealth∗; 200) = max(x0β + , 200) β is estimated via maximum likelihood using a mixture of discrete and continuous distributions (details later...) Laura Magazzini (@univr.it) Truncation and Censoring 16 / 40 Truncation and censoring Censored data Example: seat demanded and ticket sold Laura Magazzini (@univr.it) Truncation and Censoring 17 / 40 Truncation and censoring Censored data The censored normal distribution y ∗ ∼ N(µ, σ2) Observed data are censored in a = 0: y = 0 if y ∗ ≤ 0 y = y ∗ if y ∗ > 0 The distribution is a mixture of discrete and continuous distribution . If y ∗ ≤ 0: f (y) = Pr(y = 0) = Pr(y ∗ ≤ 0) = Φ(−µ/σ) = 1 − Φ(µ/σ) ∗ y−µ . If y > 0: f (y) = φ σ 0−µ E[y] = 0 × Pr(y = 0) + E[yjy > 0] × Pr(y > 0) = (µ + σλ)Φ σ with λ = φ/Φ Laura Magazzini (@univr.it) Truncation and Censoring 18 / 40 Truncation and censoring Censored data The censored regression model Tobit model (Tobin, 1958) Let y ∗ be a continuous variable (latent variable): ∗ 0 yi = xi β + i ; where jx ∼ N(0; σ2) The observed data y are ∗ ∗ 0 if yi ≤ 0 yi = max(0; yi ) = ∗ ∗ yi if yi > 0 Why not OLS? Why not OLS on positive y ∗? Laura Magazzini (@univr.it) Truncation and Censoring 19 / 40 Truncation and censoring Censored data MLE estimation As we assume jx ∼ N(0; σ2), the likelihood function can be written The distribution is a mixture of discrete and continuous distribution . A positive probability is assigned to the observations yi = 0: ∗ Pr(yi = 0jxi ) = Pr(yi ≤ 0jxi ) 0 = Pr(xi β + i ≤ 0) 0 = Pr(i ≤ −xi β) 0 = 1 − Pr(i < xi β) x 0β = 1 − Φ i σ 0 ∗ yi −xi β . For yi > 0: f (yi ) = φ σ Laura Magazzini (@univr.it) Truncation and Censoring 20 / 40 Truncation and censoring Censored data MLE estimation The likelihood can be written as: 0 0 Y x β Y 1 yi − x β L(β; σ2jy) = 1 − Φ i φ i σ σ σ yi =0 yi >0 2 y −x0β 0 − 1 i i Y x β Y 1 2 σ = 1 − Φ i p e σ 2πσ2 yi =0 yi >0 In the case of censored data, β estimated from the Tobit model can be employed to study the effect of x on E[y ∗jx] Laura Magazzini (@univr.it) Truncation and Censoring 21 / 40 Truncation and censoring Censored data Example: simulated data If y ∗ is fully observed, OLS can be applied Laura Magazzini (@univr.it) Truncation and Censoring 22 / 40 Truncation and censoring Censored data Example: simulated data However, if y ∗ ≤ a, data are recorded as a Laura Magazzini (@univr.it) Truncation and Censoring 23 / 40 Truncation and censoring Censored data Example: simulated data OLS on the observed sample is biased Laura Magazzini (@univr.it) Truncation and Censoring 24 / 40 Truncation and censoring Censored data Example: simulated data MLE (tobit) allows to get a consistent estimate of β Laura Magazzini (@univr.it) Truncation and Censoring 25 / 40 Truncation and censoring Censored data Corner solution outcomes Still labeled \censored regression models" Pioneer work by Tobin (1958): household purchase of durable goods Let y be an observable choice or outcome describing some economic agent, such as an individual or a firm, with the following characteristics: y takes on the value zero with positive probability but is a continuous random variable over strictly positive values .
Recommended publications
  • Estimation of Panel Data Regression Models with Two-Sided Censoring Or Truncation
    Estimation of Panel Data Regression Models with Two-Sided Censoring or Truncation Sule Alan, Bo E. Honoré, Luojia Hu, and Søren Leth–Petersen Federal Reserve Bank of Chicago Reserve Federal WP 2011-08 Estimation of Panel Data Regression Models with Two-Sided Censoring or Truncation Sule Alany Bo E. Honoréz Luojia Hu x Søren Leth–Petersen { November 14, 2011 Abstract This paper constructs estimators for panel data regression models with individual speci…c heterogeneity and two–sided censoring and truncation. Following Powell (1986) the estimation strategy is based on moment conditions constructed from re–censored or re–truncated residuals. While these moment conditions do not identify the parameter of interest, they can be used to motivate objective functions that do. We apply one of the estimators to study the e¤ect of a Danish tax reform on household portfolio choice. The idea behind the estimators can also be used in a cross sectional setting. Key Words: Panel Data, Censored Regression, Truncated Regression. JEL Code: C20, C23, C24. This research was supported by NSF Grant No. SES-0417895 to Princeton University, the Gregory C. Chow Econometric Research Program at Princeton University, and the Danish National Research Foundation, through CAM at the University of Copenhagen (Honoré) and the Danish Social Science Research Council (Leth–Petersen). We thank Christian Scheuer and numerous seminar participants for helpful comments. The opinions expressed here are those of the authors and not necessarily those of the Federal Reserve Bank of Chicago or the Federal Reserve System. yFaculty of Economics, University of Cambridge, Sidgwick Avenue, Cambridge, UK, CB3 9DD.
    [Show full text]
  • Best-Practice Recommendations for Defining, Identifying, and Handling
    Article Organizational Research Methods 16(2) 270-301 ª The Author(s) 2013 Best-Practice Reprints and permission: sagepub.com/journalsPermissions.nav Recommendations for DOI: 10.1177/1094428112470848 orm.sagepub.com Defining, Identifying, and Handling Outliers Herman Aguinis1, Ryan K. Gottfredson1, and Harry Joo1 Abstract The presence of outliers, which are data points that deviate markedly from others, is one of the most enduring and pervasive methodological challenges in organizational science research. We provide evidence that different ways of defining, identifying, and handling outliers alter substantive research conclusions. Then, we report results of a literature review of 46 methodological sources (i.e., journal articles, book chapters, and books) addressing the topic of outliers, as well as 232 organizational science journal articles mentioning issues about outliers. Our literature review uncovered (a) 14 unique and mutually exclusive outlier defi- nitions, 39 outlier identification techniques, and 20 different ways of handling outliers; (b) inconsistencies in how outliers are defined, identified, and handled in various methodological sources; and (c) confusion and lack of transparency in how outliers are addressed by substantive researchers. We offer guidelines, including decision-making trees, that researchers can follow to define, identify, and handle error, inter- esting, and influential (i.e., model fit and prediction) outliers. Although our emphasis is on regression, structural equation modeling, and multilevel modeling, our general framework forms the basis for a research agenda regarding outliers in the context of other data-analytic approaches. Our recommenda- tions can be used by authors as well as journal editors and reviewers to improve the consistency and transparency of practices regarding the treatment of outliers in organizational science research.
    [Show full text]
  • Censored Data and Truncated Distributions William Greene
    19 Censored Data and Truncated Distributions William Greene Abstract We detail the basic theory for regression models in which dependent variables are censored or underlying distributions are truncated. The model is extended to models for counts, sample selection models, and hazard models for duration data. Entry-level theory is presented for the practitioner. We then describe a few of the recent, frontier developments in theory and practice. 19.1 Introduction 695 19.2 Truncation 697 19.3 Censored data and the censored regression model 701 19.3.1 Estimation and inference 704 19.3.2 Specification analysis 705 19.3.3 Heteroskedasticity 706 19.3.4 Unobserved heterogeneity 707 19.3.5 Distribution 707 19.3.6 Other models with censoring 708 19.4 Incidental truncation and sample selection 712 19.5 Panel data 715 19.5.1 Estimating fixed effects models 716 19.5.2 Estimating random effects models 719 19.5.3 An application of fixed and random effects estimators 719 19.5.4 Sample selection models for panel data 721 19.6 Recent developments 724 19.7 Summary and conclusions 728 19.1 Introduction The analysis of censoring and truncation arises not from a free-standing body of theory and economic/econometric modeling, but from a subsidiary set of results that treat a practical problem of how data are gathered and analyzed. Thus, we have chosen the title ‘‘Censored Data and Truncated Distributions’’ for this chapter, rather than the more often used rubric ‘‘Limited Dependent Variables’’ 695 696 Censored Data and Truncated Distributions (see, e.g., Maddala, 1983) specifically to underscore the relationship between the results and this practical issue.
    [Show full text]
  • Applying Robust Methods to Operational Risk Modeling
    Applying Robust Methods to Operational Risk Modeling Anna Chernobai∗ Svetlozar T. Rachev† University of California Universit¨at Karlsruhe, at Santa Barbara, USA Germany and University of California at Santa Barbara, USA March 7, 2006 Contents 1 Introduction 1 2 Actuarial Approach to Modeling Operational Risk 2 3 Why Robust Statistics? 4 3.1 Classical vs. Robust Methods . 4 3.2 Some Examples . 5 3.3 Overview of Literature on Robust Statistics . 6 3.4 Outlier Rejection Approach and Stress Tests: A Parallel . 8 4 Application to Operational Loss Data 9 5 Conclusions 10 6 Acknowledgements 12 ∗Email: [email protected]. Parts of the research were done when A.Chernobai was visiting Institute of Econometrics, Statistics, and Mathematical Finance, School of Economics and Business Engineering, University of Karlsruhe, Germany, in 2005. †Email: [email protected]. S.Rachev gratefully acknowledges research support by grants from Division of Mathematical, Life and Physical Sciences, College of Letters and Science, University of California, Santa Barbara, the German Research Foundation (DFG), and the German Academic Exchange Service (DAAD). 1 INTRODUCTION 1 1 Introduction In 2001, the Basel Committee of the Bank of International Settlements (BIS) released new regulatory capital guidelines on operational risk (BIS, 2001a,b), finalized in 2004 (BIS, 2004). The nature of operational risk, fundamentally different from that of market risk and credit risk, is highly bank-specific and calls for the development of complex quantitative and quali- tative solutions, new “know-how,” and setting additional standards for training bank person- nel. Despite significant progress in the operational risk management, numerous challenges remain, and the current development in the area is criticized by many as being “more art than science.” An actuarial type model dominates statistical models for operational risk under the Advanced Measurement Approach (see Section 2 for the discussion of the model; see also publications by BIS, 2001-2004).
    [Show full text]
  • Optimally Combining Censored and Uncensored Datasets
    OPTIMALLY COMBINING CENSORED AND UNCENSORED DATASETS PAUL J. DEVEREUX AND GAUTAM TRIPATHI Abstract. Economists and other social scientists often encounter data generating mechanisms (dgm's) that produce censored or truncated observations. These dgm's induce a probability distribution on the realized observations that di®ers from the underlying distribution for which inference is to be made. If this dichotomy between the target and realized populations is not taken into account, statistical inference can be severely biased. In this paper, we show how to do e±cient semiparametric inference in moment condition models by supplementing the incomplete observations with some additional data that is not subject to censoring or truncation. These additional observations, or refreshment samples as they are sometimes called, can often be obtained by creatively matching existing datasets. To illustrate our results in an empirical setting, we show how to estimate the e®ect of changes in com- pulsory schooling laws on age at ¯rst marriage, a variable that is censored for younger individuals. We also demonstrate how refreshment samples for this application can be created by matching cohort information across census datasets. 1. Introduction In applied research, economists often face situations in which they have access to two datasets that they can use but one set of data su®ers from censoring or truncation. In some cases, especially if the censored sample is larger, researchers use it and attempt to deal with the problem of partial observation in some manner1. In other cases, economists simply use the clean sample, i.e., the dataset not subject to censoring or truncation, and ignore the censored one so as to avoid biases.
    [Show full text]
  • Limited Dependent Variables—Truncation, Censoring, and Sample Selection
    19 LIMITED DEPENDENT VARIABLes—TRUNCATION, CENSORING, AND SAMP§LE SELECTION 19.1 INTRODUCTION This chapter is concerned with truncation and censoring. As we saw in Section 18.4.6, these features complicate the analysis of data that might otherwise be amenable to conventional estimation methods such as regression. Truncation effects arise when one attempts to make inferences about a larger population from a sample that is drawn from a distinct subpopulation. For example, studies of income based on incomes above or below some poverty line may be of limited usefulness for inference about the whole population. Truncation is essentially a characteristic of the distribution from which the sample data are drawn. Censoring is a more common feature of recent studies. To continue the example, suppose that instead of being unobserved, all incomes below the poverty line are reported as if they were at the poverty line. The censoring of a range of values of the variable of interest introduces a distortion into conventional statistical results that is similar to that of truncation. Unlike truncation, however, censoring is a feature of the sample data. Presumably, if the data were not censored, they would be a representative sample from the population of interest. We will also examine a form of truncation called the sample selection problem. Although most empirical work in this area involves censoring rather than truncation, we will study the simpler model of truncation first. It provides most of the theoretical tools we need to analyze models of censoring and sample selection. The discussion will examine the general characteristics of truncation, censoring, and sample selection, and then, in each case, develop a major area of application of the principles.
    [Show full text]
  • Economics 536 Lecture 21 Counts, Tobit, Sample Selection, and Truncation
    University of Illinois Department of Economics Fall 2016 Roger Koenker Economics 536 Lecture 21 Counts, Tobit, Sample Selection, and Truncation The simplest of this general class of models is Tobin's (1958) model for durable demand ∗ > yi = xi β + ui ui ∼ iid F ∗ yi = maxfyi ; 0g ∗ That is, we have a propensity, a latent variable, which describes demand for something { when yi > 0 we act on it otherwise we do nothing. This model is the simplest form of the Censored regression model. The first question we should address is Why not estimate by OLS? First, we must clarify: \OLS on what?" Let's consider OLS on just the yi > 0 observations. Recall that OLS tries to estimate the conditional mean function for y so let's try to compute this in our case: ∗ yi = xiβ + ui so ∗ > ∗ > > E(yijyi > 0) = xi β + E(uijyi > 0) = xi β + E(ui > −xi β) by the Appendix A > σφ(xi β/σ) = xiβ + > Φ(xi β/σ) 2 when ui ∼ iid N (0; σ ): Thus Eβ^ = (X>X)−1X>Ey = β + σ(X>X)−1X>λ ∗ where λ = (φi=Φi). Note that all the mass corresponding to y < 0 piles up at y = 0: So we get a nonlinear conditional expectation function. 1. The Heckman 2-step Estimator This suggests that if we could somehow estimate β/σ = γ we might be able to correct for the bias introduced by omitting the zero observations. How to estimate γ? The tobit model as expressed above is just the probit model we have already considered except that in the previous case σ ≡ 1; but note here we can divide through by σ in the first equation without changing anything.
    [Show full text]
  • TRUNCATED REGRESSION in EMPIRICAL ESTIMATION Thomas L. Marsha and Ron C. Mittelhammerb Paper Presented at the Western Agricultur
    TRUNCATED REGRESSION IN EMPIRICAL ESTIMATION Thomas L. Marsha and Ron C. Mittelhammerb Paper presented at the Western Agricultural Economics Association Annual Meetings, Vancouver, British Columbia, June 29-July 1, 2000 aAssistant Professor, Department of Agricultural Economics, Kansas State University, Manhattan, KS, 66506-4011, 785-532-4913, [email protected]; bProfessor, Department of Agricultural Economics and Adjunct Professor, Program in Statistics, Washington State University, Pullman, WA, 99164-0621 Copyright 2000 by Thomas L. Marsh and Ron C. Mittelhammer. All rights reserved. Readers may make verbatim copies of this document for non-commercial purposes by any means, provided that this copyright notice appears on all such copies. Truncated Regression in Empirical Estimation Abstract: In this paper we illustrate the use of alternative truncated regression estimators for the general linear model. These include variations of maximum likelihood, Bayesian, and maximum entropy estimators in which the error distributions are doubly truncated. To evaluate the performance of the estimators (e.g., efficiency) for a range of sample sizes, Monte Carlo sampling experiments are performed. We then apply each estimator to a factor demand equation for wheat-by-class. Key Words: doubly truncated samples, Bayesian regression, maximum entropy, wheat-by-class Introduction In empirical applications, economists are increasingly estimating regression models that are truncated in nature.1 Most commonly, truncated regression has been associated with singly truncated distributions (for example, see Maddala). However, doubly truncated distributions also arise in practice and are receiving attention in the economic literature (Bhattacharya, Chaturvedi, and Singh; Cohen; Maddala; Nakamura and Nakamura; Schneider). Accounting for truncated random variables in regression analysis is important because ordinary least squares estimators can be inefficient, biased, or inconsistent (Maddala) otherwise.
    [Show full text]
  • Chapter 4: Censoring, Truncation, and Selection
    Chapter 4: Censoring, Truncation, and Selection Joan Llull Microeconometrics IDEA PhD Program I. Introduction In this Chapter we review models that deal with censored, truncated and se- lected data. This introductory section, distinguishes the three concepts. In all cases, we consider a latent variable that is described by a linear model, and that is only partially observed. Hence, our latent variable y∗ is given by: y∗ = x0β + ": (1) We define (left) truncation as the situation in which we only observe y∗ if it is above certain threshold. In particular, we observe: ( y∗ if y∗ > 0 y = (2) { if y∗ ≤ 0; where { indicates that the observation is missing. The threshold is normalized to zero without loss of generality, as in Chapter 3.1 What follows is for the case of left-truncation; results for right truncation are analogous. Alternatively, we have (left) censored data in the similar situation in which, when y∗ is below the threshold, we observe the individuals (and, eventually, the regressors), but not y∗: ( y∗ if y∗ > 0 y = (3) 0 if y∗ ≤ 0: Finally, we have a selected sample if y∗ is only observable for a particular (non- representative) subset of the population, which is observable. In this case: ( y∗ if z0γ + ν > 0 y = (4) − otherwise; and d ≡ 1fz0γ + ν > 0g is observed, where z typically includes at least one variable that is not included in x (exclusion restriction). In this case, when the condition for observing y∗ is not satisfied, we still observe the characteristics of the individual. 1 The threshold could be individual-specific, L = x0δ, in which case the normalization implies that we would identify β∗ = β − δ instead of β.
    [Show full text]
  • Truncation and Censoring
    Truncation and Censoring Laura Magazzini [email protected] Laura Magazzini (@univr.it) Truncation and Censoring 1 / 35 Truncation and censoring Truncation and censoring Truncation: sample data are drawn from a subset of a larger population of interest . Characteristic of the distribution from which the sample data are drawn . Example: studies of income based on incomes above or below the poverty line (of limited usefulness for inference about the whole population) Censoring: values of the dependent variable in a certain range are all transformed to (or reported at) a single value . Defect in the sample data . Example: in studies of income, people below the poverty line are reported at the poverty line Truncation and censoring introduce similar distortion into conventional statistical results Laura Magazzini (@univr.it) Truncation and Censoring 2 / 35 Truncation and censoring Truncation Truncation Aim: infer the caracteristics of a full population from a sample drawn from a restricted population . Example: characteristics of people with income above $100,000 Let Y be a continous random variable with pdf f (y). The conditional distribution of y given y > a (a a constant) is: f (y) f (yjy > a) = Pr(y > a) In case of y normally distributed: 1 φ x−µ f (yjy > a) = σ σ 1 − Φ(α) a−µ where α = σ Laura Magazzini (@univr.it) Truncation and Censoring 3 / 35 Truncation and censoring Truncation Moments of truncated distributions E(Y jy < a) < E(Y ) E(Y jy > a) > E(Y ) V (Y jtrunc:) < V (Y ) Laura Magazzini (@univr.it) Truncation and Censoring
    [Show full text]
  • Selection Bias - What You Don’T Know Can Hurt Your Bottom Line
    Selection Bias - What You Don’t Know Can Hurt Your Bottom Line. Gaétan Veilleux, Valen Technologies 2011 CAS Ratemaking and Product Management Seminar March 20-22, 2011 Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to provide a forum for the expression of various points of view on topics described in the programs or agendas for such meetings. Under no circumstances shall CAS seminars be used as a means for competing companies or firms to reach any understanding – expressed or implied – that restricts competition or in any way impairs the ability of members to exercise independent business judgment regarding matters affecting competition. It is the responsibility of all seminar participants to be aware of antitrust regulations, to prevent any written or verbal discussions that appear to violate these laws, and to adhere in every respect to the CAS antitrust compliance policy. 2 “I don’t like statistics. It’s like logic, it doesn’t make any sense.” 3 What Is Selection Bias? “A type of bias caused by choosing non-random data for statistical analysis. The bias exists due to a flaw in the sample selection process, where a subset of the data is systematically excluded due to a particular attribute. The exclusion of the subset can influence the statistical significance of the test, or produce distorted results.” (Investopedia) Selection bias results from estimation on a subsample of individuals who have essentially elected themselves for estimation through their decision to participate in a particular program.
    [Show full text]
  • Module Four, Part One
    MODULE FOUR, PART ONE: SAMPLE SELECTION IN ECONOMIC EDUCATION RESEARCH William E. Becker and William H. Greene * Modules One and Two addressed an economic education empirical study involved with the assessment of student learning that occurs between the start of a program (as measured, for example, by a pretest) and the end of the program (posttest). At least implicitly, there is an assumption that all the students who start the program finish the program. There is also an assumption that those who start the program are representative of, or at least are a random sample of, those for whom an inference is to be made about the outcome of the program. This module addresses how these assumptions might be wrong and how problems of sample selection might occur. The consequences of and remedies for sample selection are presented here in Part One. As in the earlier three modules, contemporary estimation procedures to adjust for sample selection are demonstrated in Parts Two, Three and Four using LIMDEP (NLOGIT), STATA and SAS. Before addressing the technical issues associated with sample selection problems in an assessment of one or another instructional method, one type of student or teacher versus another, or similar educational comparisons, it might be helpful to consider an analogy involving a contest of skill between two types of contestants: Type A and Type B. There are 8 of each type who compete against each other in the first round of matches. The 8 winners of the first set of matches compete against each other in a second round, and the 4 winners of that round compete in a third.
    [Show full text]