Thesis

Robust methods for personal income distribution models

VICTORIA-FESER, Maria-Pia

Abstract

In the present thesis, robust statistical techniques are applied and developed for the economic problem of the analysis of personal income distributions and inequality measures. We follow the approach based on influence functions in order to develop robust estimators for the parametric models describing personal income distributions when the data are censored and when they are grouped. We also build a robust procedure for a test of choice between two models and analyse the robustness properties of goodness-of-fit tests. The link between economic and robustness properties is studied through the analysis of inequality measures. We begin our discussion by presenting the economic framework from which the statistical developments are made, namely the study of the personal income distribution and inequality measures. We then discuss the robust concepts that serve as basis for the following steps and compute optimal bounded-influence estimators for different personal income distribution models when the data are continuous and complete. In a third step, we study the case of censored data and propose a generalization of the EM [...]

Reference

VICTORIA-FESER, Maria-Pia. Robust methods for personal income distribution models. Thèse de doctorat : Univ. Genève, 1993, no. SES 384

URN : urn:nbn:ch:unige-64509 DOI : 10.13097/archive-ouverte/unige:6450

Available at: http://archive-ouverte.unige.ch/unige:6450

Disclaimer: layout of this document may differ from the published version.

1 / 1

2 Robust Methods for Personal Income Distribution Models

Maria-Pia Victoria Feser

Submitted for the degree of Ph.D in Econometrics and

Faculty of Economic and Social Sciences University of Geneva, Switzerland

Accepted on the recommendation of Dr. A.C. Atkinson, professor, London, Dr. P. Balestra, professor, Geneva, Dr. U. Kohli, professor, Geneva, Dr. E. Ronchetti, professor, Geneva, supervisor, Dr. P. Rousseeuw, professor, Brussels.

Thesis No. 384

May 1993 i

To Johannes, with love. ii

Abstract In the present thesis, robust statistical techniques are applied and developed for the economic problem of the analysis of personal income distributions and inequality measures. We follow the approach based on influence functions in order to develop robust estimators for the parametric models describing personal income distributions when the data are censored and when they are grouped. We also build a robust procedure for a test of choice between two models and analyse the robustness properties of goodness-of-fit tests. The link between economic and robustness properties is studied through the analysis of inequality measures. We begin our discussion by presenting the economic framework from which the statistical developments are made, namely the study of the per- sonal income distribution and inequality measures. We then discuss the robust concepts that serve as basis for the following steps and compute opti- mal bounded-influence estimators for different personal income distribution models when the data are continuous and complete. In a third step, we study the case of censored data and propose a generalization of the EM algorithm with robust estimators. For grouped data, Hampel’s theorem is extended in order to build optimally bounded-influence estimators for grouped data. We then focus on tests for model choice and develop a robust generalized Cox-type statistic. We also analyse the robustness properties of a wide class of goodness-of-fit statistics by computing their level influence functions. Fi- nally, we study the robustness properties of inequality measures and relate our findings with some economic properties these measures should fulfil. Our motivation for the development of these new robust procedures comes from our interest in the field of income distribution and inequality measurement. However, it should be stressed that the new estimators and tests procedures we propose do not only apply in this particular field, but they can be used in or extended to any parametric problem in which density estimation, incomplete information, grouped or discrete data, model choice, goodness-of-fit, concentration index, is one of the key words. iii

R´esum´e Dans cette th`ese, nous developpons et appliquons certaines techniques de la statistique robuste au probl`eme ´economique de l’analyse de la distribution du revenu personnel et des mesures d’in´egalit´e. Nous utilisons l’approche bas´ee sur les fonctions d’influence afin de developper des estimateurs ro- bustes pour les mod`eles param´etriques d´ecrivant la distribution du revenu personnel lorsque les donn´ees sont censur´ees et lorsqu’elles sont group´ees. Nous construisons aussi des proc´edures robustes pour tester le choix entre deux mod`eles et analysons les propri´etes de robustesse de tests d’ad´equation. Le lien entre certaines propri´etes ´economiques et de robustesse est ´etudi´eau moyen des mesures d’in´egalit´e. Nous commen¸cons notre discussion par une pr´esentation du cadre ´eco- nomique dans lequel nous nous situons,asavoirl’´ ` etude de la distribution du revenu personnel et des mesures d’in´egalit´e associ´ees. Nous exposons ensuite les concepts de la statistique robuste qui nous sont utiles par la suite et cal- culons des estimateurs optimaux `a influence born´ee pour diff´erents mod`eles de distribution de revenu personnel lorsque les donn´ees sont continues et compl`etes, simul´ees ou r´eelles. Dans un troisi`eme temps, nous ´etudions le cas des donn´ees censur´ees et proposons une g´en´eralisation de l’algorithme EM avec des estimateurs robustes. Le th´eor`eme de Hampel est ensuite ´etendu au cas des donn´ees group´ees et des estimateurs robustesa ` influence born´ee de fa¸con optimale sont propos´es. Plus tard, nous nous concentrons sur les proc´edures de choix de mod`ele et d´eveloppons une statistique de test robuste de type Cox. Nous analysons aussi les propri´et´es de robustesse d’une large classe de statistiques de test d’ad´equation en calculant les cor- respondantes fonctions d’influence sur le niveau. Finalement, nous ´etudions les propri´etes de robustesse de mesures d’in´egalite en fonction des propri´et´es ´economiques que ces derni`eres doivent satisfaire. Le d´eveloppement de nouveaux estimateurs et de nouvelles proc´edures de test a ´et´emotiv´eparnotreint´erˆet au probl`eme de l’´etude des distributions de revenu personnel et des mesures d’in´egalit´e. Cependant, il est utile de mettre en ´evidence le fait que les nouveaux estimateurs et les nouvelles proc´edures de test que nous proposons ne sont pas seulement applicables dans ce domaine particulier. En effet, ils peuvent ˆetre appliqu´es ou ´etendus `adesprobl`emes param´etriques dans lesquels des termes comme estimation de densit´e, information incompl`ete, donn´ees group´ees ou discr`etes, choix de mod`ele, tests d’ad´equation, indices de concentration sont des mot-cl´es. iv

Acknowledgement I would like to express my gratitude to Prof. E. Ronchetti for his valuable suggestions and his generous guidance throughout the course of this research. His encouragement as expert and as friend have made this work possible. I am also grateful to Prof. A. C. Atkinson and Dr. F. Cowell for their support during my research at the London School of Economics and to Prof. P. Balestra, Prof. U. Kohli and Prof. P. Rousseeuw for their comments during the defense. My thanks also go to my friends and colleagues of the faculty of eco- nomic and social sciences of the University of Geneva for their stimulating discussions and their moral support, especially to S. H´eritier for his helpful comments during the preparation of the defense. Finally, I would like to express my grateful thanks to my parents, for their love, encouragement and support during most of my student life. Contents

1 Introduction 1

2 Income Distribution and Inequality 7 2.1Introduction...... 7 2.2Thegenerationanddistributionofincome...... 8 2.3TheLorenzcurveandanalysisofinequality...... 10 2.3.1 Definitionandconstruction...... 10 2.3.2 Anorderingtool...... 13 2.4Incomeinequalitymeasures...... 14 2.5Parametricmodelsforincomedistributions...... 18 2.5.1 Generatingsystems...... 19 2.5.2 Propertiesofincomedistributionmodels...... 20 2.5.3 Mostcommonlyusedmodels...... 22 2.6Statisticalaspectsoftheanalysisofincome...... 24

3 OBRE with Complete Information 27 3.1Introduction...... 27 3.2Robustnessconcepts...... 30 3.2.1 Definitions...... 30 3.2.2 Continuityandqualitativerobustness...... 31 3.2.3 The influence function ...... 31 3.2.4 The influence function and robustness measures . . . . 32 3.2.5 Thebreakdownpoint...... 33 3.3Optimalrobustestimators...... 34 3.3.1 Optimalityresults...... 34 3.3.2 Computationalaspects...... 39 3.3.3 Howtochoosetheboundc ...... 42 3.4Applicationtotwoincomedistributionmodels...... 44 3.4.1 Simulationresults...... 44

v vi CONTENTS

3.4.2 Applicationtorealdata...... 50 3.5Propertiesofrobustestimators...... 57 3.5.1 Efficiency...... 57 3.5.2 Sensitivity...... 57 3.5.3 Breakdownpoint...... 61

4 OBRE with Incomplete Information 65 4.1TheEMalgorithm...... 65 4.1.1 Introduction...... 65 4.1.2 DefinitionoftheEMalgorithm...... 66 4.1.3 Discussionandexample...... 67 4.1.4 Generalization...... 68 4.2OBREwithincompleteinformation...... 70 4.2.1 TheEMMalgorithm...... 70 4.2.2 The EMM algorithm and robust estimators ...... 71 4.2.3 Comparisonwiththeclassicalapproach...... 74 4.3Empiricalresults...... 76 4.3.1 Robustestimates...... 76 4.3.2 ComparisonwiththeMLE...... 77 4.3.3 Conclusion...... 80

5 Robust Estimators for Grouped Data 83 5.1Theproblem...... 83 5.2 Classical estimators and their IF ...... 86 5.2.1 Minimumpowerdivergenceestimators...... 86 5.2.2 Influence function ...... 87 5.3Ageneralclassofestimators...... 90 5.4 Influence function of MGP-estimators ...... 92 5.5Robustestimatorswithgroupeddata...... 94 5.5.1 Optimalityproblem...... 94 5.5.2 Computation...... 97 5.5.3 Efficiency...... 98 5.6Simulationresultsandconclusion...... 99 5.6.1 Simulationresults...... 99 5.6.2 Localshiftsensitivity...... 100 5.6.3 Conclusion...... 102

6 Robust Tests for Model Choice 103 6.1Introduction...... 103 6.2Classicaltests...... 107 CONTENTS vii

6.2.1 TheCoxstatistic...... 107 6.2.2 TheAtkinsonstatistic...... 109 6.2.3 Otherapproaches...... 110 6.3Smallsampleproperties...... 111 6.3.1 Introduction...... 111 6.3.2 Simulationstudy...... 112 6.3.3 A Cox-type statistic with a parametric bootstrap . . . 114 6.4Robustnessproperties...... 117 6.4.1 Robustnessandtests...... 118 6.4.2 Level influence function ...... 120 6.4.3 Simulationresults...... 124 6.5Robustmodelchoicetests...... 125 6.5.1 Someadhocrobusttests...... 125 6.5.2 Robust bounded-influence LM test ...... 128 6.5.3 RobustCox-typestatistic...... 130 6.5.4 Simulationstudy...... 132 6.6Conclusion...... 133

7 Robustness and Goodness-of-Fit Tests 135 7.1Introduction...... 135 7.2Robustnessandgoodness-of-fittechniques...... 139 7.3LIFoftheCressieandReadstatistic...... 141 7.4Simulationstudy...... 147

8 Robustness and Inequality Measures 151 8.1Introduction...... 151 8.2 Decomposability and mean-preserving cont...... 154 8.2.1 Generalproperties...... 155 8.2.2 Robustnessproperties...... 156 8.2.3 Kolm’sindex...... 157 8.2.4 Generalizedentropyindex...... 158 8.2.5 TheGiniindex...... 159 8.3Arbitrarycontaminations...... 161 8.3.1 Simulationstudy...... 162 8.4Parametricestimationapproach...... 163 8.4.1 Influence function of generalized entropy indexes . . . 163 8.4.2 Simulationresults...... 165 8.5Conclusion...... 166

9 Conclusion 169 viii CONTENTS

A Functional Forms for PID 173 A.1Terminologyandnotations...... 173 A.2ParetotypeI...... 174 A.3ParetotypeII...... 174 A.4ParetotypeIII...... 175 A.5Gammadistribution...... 175 A.6Beninidistribution...... 176 A.7Vincidistribution...... 176 A.8GeneralizedGammadistribution...... 177 A.9LognormaltypeI...... 177 A.10Davisdistribution...... 178 A.11Weibulldistribution...... 178 A.12Fisklogisticdistribution...... 179 A.13GeneralizedBetadistributionI...... 180 A.14GeneralizedBetadistributionII...... 180 A.15Singh-Maddaladistribution...... 181 A.16LognormaltypeII...... 182 A.17DagummodeltypeI...... 182 A.18DagummodeltypeII...... 183 A.19DagummodeltypeIII...... 183 A.20Log-Gompertzdistribution...... 184 A.21Majumder-Chakravartydistribution...... 184

B Functional Forms for the Lorenz Curve 187 B.1ModelofKakwaniandPodder...... 187 B.2ModelofRascheetal...... 187 B.3ModelofGupta...... 187 B.4 Model of Villasenor and Arnold ...... 188 B.5ModelofBasmannetal...... 188

C Income Inequality Measures 189 C.1Coefficientofvariation...... 189 C.2Relativemeandeviation...... 189 C.3Relativemediandeviation...... 189 C.4 of the logarithm of income ...... 190 C.5Bonferroniinequalitymeasure...... 190 C.6Hirschman’sindex...... 190 C.7Theilindexes...... 190 C.8 Eltet¨oandFrigyes’sinequalitymeasures...... 190 C.9Kakwaniinequalitymeasure...... 191 CONTENTS ix

C.10Basmann-Slottjeinequalitymeasure(WGM)...... 191 C.11Dalton’sinequalitymeasure...... 191 C.12Atkinson’sinequalitymeasure...... 192 C.13Kolm’sinequalitymeasure...... 192 C.14Generalizedentropyfamily...... 192

D Equations System for Robust Tests 193 x CONTENTS List of Figures

2.1 A typical representation of the Lorenz Curve ..... 11 2.2 The Gamma density as a model for PID ...... 23

3.1 Value of the mean IF around the true parameter θ .. 42 3.2 MLE and OBRE of the on PSID data ...... 53 3.3 MLE and OBRE of the Dagum I model on PSID data 54 3.4 Gamma (OBRE) and Dagum (MLE) fit on PSID data 55 3.5 OBRE of the Gamma and Dagum I model on FES data 56 3.6 Efficiency of the OBRE for the Gamma model ..... 58 3.7 Efficiency of the OBRE for the Pareto model ..... 59 3.8 Sensitivity of the MLE and the OBRE to outliers for the Pareto model ...... 60 3.9 Sensitivity of MLE and OBRE to different propor- tions of contamination ...... 62 3.10 Bias of Theil index estimates when the data are con- taminated ...... 64

4.1 Weights given by the OBRE with 10% of information loss ...... 75 4.2 Weights given by the OBRE with 30% of information loss ...... 76

7.1 Behaviour of goodness-of-fit statistics with model con- tamination ...... 149

xi xii LIST OF FIGURES List of Tables

3.1 Some examples of occurrence and frequency of gross errors ...... 29 3.2 MLE and OBRE for the Gamma model 1 (non con- taminated) ...... 46 3.3 MLE and OBRE for the Gamma model 2 (1% of ‘bad’ contamination) ...... 47 3.4 MLE and OBRE for the Gamma model 3 (3% of con- tamination) ...... 47 3.5 MLE and OBRE for the Gamma model 4 (5% of con- tamination) ...... 47 3.6 MLE and OBRE for the Pareto model 1 (non con- taminated) ...... 49 3.7 MLE and OBRE for the Pareto model 2 (2% of con- tamination) ...... 49 3.8 MLE and OBRE for the Pareto model 3 (5% of con- tamination) ...... 49 3.9 MLE and OBRE for the Gamma and Dagum models on PSID data...... 52 3.10 MLE and OBRE for the Gamma and Dagum models on FES data...... 53

4.1 OBRE on non contaminated data, with the EMM algorithm and the CD estimation ...... 77 4.2 OBRE on contaminated data at 1%, with the EMM algorithm and the CD estimation ...... 78 4.3 OBRE on contaminated data at 3%, with the EMM algorithm and the CD estimation ...... 78 4.4 OBRE and MLE on non contaminated data, with the EMM algorithm ...... 79

xiii xiv LIST OF TABLES

4.5 OBRE and MLE on contaminated data at 1%, with the EMM algorithm ...... 79 4.6 OBRE and MLE on non contaminated data, with the EMM algorithm, when we ignore truncation ...... 80

5.1 MLE and OBRE (c = 5.0) for the Pareto model with grouped data ...... 100

6.1 Finite sample level of Cox and Atkinson statistics (Gamma against Lognormal) ...... 113 6.2 Finite sample level of Cox and Atkinson statistics (Exponential against Pareto) ...... 114 6.3 Finite sample level of Cox and Atkinson statistics (Pareto against Exponential) ...... 115 6.4 Finite sample level of LKR statistic (Gamma against Lognormal) ...... 116 6.5 Finite sample level of LKR statistic (Exponential against Pareto) ...... 116 6.6 Finite sample level of LKR statistic (Pareto against Exponential) ...... 116 6.7 Power (in %) of the Cox statistic (Exponential against Pareto) ...... 117 6.8 Power (in %) of the LKR statistic (Exponential against Pareto) ...... 118 6.9 Actual levels (in %) of Cox and Atkinson statistics under model contamination (ε =1%) (Gamma against Lognormal) ...... 124 6.10 Actual levels (in %) of Cox and Atkinson statistics under model contamination (ε =2%) (Gamma against Lognormal) ...... 125 6.11 Actual levels (in %) of the robust Atkinson statistic (c =2.0) with contamination (Pareto against Expo- nential) ...... 132 6.12 Actual levels (in %) of the Atkinson statistic with contamination (Pareto against Exponential) ...... 133

8.1 Empirical Theil index when a random proportion of data are multiplied by 10 ...... 163 8.2 Empirical Theil index when a random proportion of data are multiplied by 4 ...... 164 LIST OF TABLES xv

8.3 MLE and Theil index with and without data contam- ination ...... 165 8.4 OBRE and Theil index with and without data con- tamination ...... 166 xvi LIST OF TABLES Chapter 1

Introduction

A great number of philosophers, scientists, politicians, economists, writers, humanists, religious people through the ages have spent a lot of energy trying to understand the reasons of human inequalities. It is hard to believe there will be one day an answer. Our work however was motivated by this kind of question: why are there so great differences in people’s wealth? The present dissertation is not a philosophical essay, but a modest scientific contribution to the study of one of the several aspects of human wealth, the distribution of the income among the people. Moreover, its aim is even not to try to give some elements of an answer to the question, but to provide the economist with new statistical tools, developed especially for the matter. The distribution of income among people is also called the personal in- come distribution (PID). In economics, its study has several scopes. One of them is to understand how the total income in a given society is distributed among the people, or the households, or economic units, that is to deter- mine which economic and social factors influence the distribution of income. Another aim is to provide a measure which represents a judgement of the degree of inequality in the distribution of personal income, not only by itself but also when compared with the same measure computed on the basis of data from different populations. The space for the statistician is then wide. There are (a) stochastic models to build (for explaining how the PID is generated), (b) econometric models to define and estimate (for determining the factors influencing the generation and distribution of income), (c) statistical distributions to de- fine and estimate (for describing PID) and (d) inequality indexes to build and estimate (for measuring income inequality). In the present work, we

1 2 CHAPTER 1. INTRODUCTION concentrate on the two last aspects. The regularities displayed by observed PID over time and space pro- vide sufficient justification to describe them with the help of some statis- tical distribution functions. This provides not only a useful summary of the phenomenon, but also a technique to study the effects of alternative redistributive policies. In particular, the estimated distribution can serve as a basis for the computation of inequality measures. The phenomenon of income inequality has been a source of world-wide social upheaval. It has become a weapon in the hands of social reformers and a point of intellectual debate among academics. It is therefore necessary to invest energy in the development of appropriate statistical tools. The two main aspects of this debate, ethical evaluation and statistical measurement, are not always clearly distinguishable. In our work, however, the statistical tools we propose do not only apply to the study of PID or inequality measures but to a wider range of similar problems. On the other hand, our work was directed by the specificity of the economic problem, that is the developments we made were motivated by their usefulness in the study of PID and income inequality measures. Moreover, in the case of inequality measure we have a closer look at the relations between the economic and statistical properties. This is important because drawing inferences about economic inequality plays an important part in political debates about economic and social trends, and in a variety of applied studies in the field of welfare economics. However, the statistical basis on which the inferences are drawn is not always spelt out, and so the relationship between the numbers observed in a particular sample and the supposed underlying concept of inequality within the target population may be different from that suggested by superficial appearances. The statistical innovation we propose in this field is the use of robust methods. Robustness is a statistical concept which in a sense measures a “qualitative” aspect of any estimator, more precisely its stability under non standard conditions. It also conveys the idea that the theoretical models, may they be simple or very complex, are only able to reflect the behaviour of the majority of the data. That is, the robust statistical tool is built such that the influence of data that may not belong to the stated theoretical model is limited. It is well known that economic data in particular are far from being clean; this usually means that some observations may be present which in a sense have nothing to do with the majority of the data. These rogue data can be a result of the collection procedures. A simple example is the “decimal point error”: the coder inadvertently puts the decimal point in 3 the wrong place and thus multiplies an observation by a factor of 10. More subtle is the week-month confusion where data are supposedly collected on weekly income, but some respondents actually report income per month. If those observations that we also call contaminations or outliers have negligible impact upon the analysis, then obviously there is nothing to worry about. Unfortunately, in most cases they are extreme and therefore they can drive the value of the estimators by themselves. It is arguable that an outlier of this sort should be treated as exceptional and dropped from the sample. Such extreme values may of course be picking up true information; but very often in empirical work a case can be made for dropping “obviously” an in- appropriate or suspect observation that may be the result of recording error or other contamination. This type of ad hoc procedure is unsatisfactory, but if it is not done then the result of the analysis may be seriously biased. The robust methods we propose automatically take into account the presence of extreme observations during the estimation procedure. Indeed, these robust estimation procedures are built in such a way that they provide at the same time and in an optimal way, robust estimators and weights corresponding to each observation according to its ‘distance’ form the bulk of the majority of the data. There is therefore no need for a preliminary subjective data screening. The developments we make are organized in the following way. In chap- ter 2 we present the framework of PID and inequality measures. We first discuss very briefly the different theories explaining the generation and dis- tribution of income. We also present the Lorenz curve, a statistical tool for representing and comparing inequality in the PID, and review the most well known income inequality measures. The different parametric models pro- posed in the literature to describe the PID are then analyzed. A discussion about the statistical aspects involved in the analysis of PID concludes this chapter. In chapter 3, we compute robust estimators for PID models when the data are continuous and complete. We begin by presenting the robustness concepts we need to later develop the theory, in particular the influence function (IF). The IF is the main robustness tool we use in our develop- ments. It gives the influence of an infinitesimal amount of contamination introduced in the data on the value of any statistic (e.g. estimator, test statistic, inequality measure...) Since the case of continuous and complete data is simple, we use optimal robust bounded-influence estimators already developed in the literature. However, our contribution is the application of the general theory to the case of PID models, in particular with real data. In chapter 4 we widen the framework to censored data. To compute ro- 4 CHAPTER 1. INTRODUCTION bust estimators in this case we propose a generalization of the EM algorithm, namely the EMM algorithm. The former allows one to compute maximum likelihood estimators (MLE) when the data are censored. The EMM algo- rithm allows one to compute robust estimators in the same situation. After a presentation of the EM algorithm, we discuss its generalization. We then compare the EMM algorithm when the data are truncated with the classical approach which considers the conditional distribution. Since the data on PID are numerous, they are often presented in a grouped form. In chapter 5 we build robust estimators for this case. We first present a large class of classical estimators and compute their IF. Although we find that the IF is bounded, that is the influence of infinitesimal amounts of contamination on the value of the estimators is limited, we show that it can nevertheless be large. Therefore, after defining a more general class of estimators, we find estimators which are less influenced by contamination. We conclude chapter 5 with a simulation study in which we compare the classical and robust estimators for grouped data. At this stage, we will have provided the necessary tools to compute ro- bust estimators for a given parametric PID model. However, this work would be incomplete without the development of a robust procedure for choosing one PID model. This is actually the subject of chapter 6. We concentrate on tests between non-nested hypotheses. We begin by presenting the most well known test statistics, particularly the Cox-type statistics. We then highlight one of their disadvantages, namely the non accuracy of the approximation of their distribution by means of their asymptotic distribution, even when the sample sizes are relatively large. We also study their robustness properties, that is the influence on the asymptotic level of the test due to infinitesimal amounts of contamination. Finally, we propose a robust test statistic based on a robust parametric test, which avoids the problems raised previously, and show its performance when computed from contaminated samples. In order to be as complete as possible, we also study goodness-of-fit tests in chapter 7. Actually, we only show that the classical goodness-of-fit tests can be badly influenced by infinitesimal amounts of contamination among the data. In particular, we compute the influence of such contaminations on the asymptotic level of the test and present a numerical example. Finally, in chapter 8, we study more in details the case of inequality measures. Indeed, these measures can be thought of as estimators of the true underlying inequality which depends on the distribution of income. Therefore, we can compute their IF. However, the aim is not the same as with PID models, in that this time we want to relate the behaviour of the IF to the economic properties the different inequality measures fulfil. 5

We find that in some special cases some inequality measures have a bounded IF, but unfortunately we must conclude that in the more realistic cases, the IF of a very large class of inequality measures is unbounded. We conclude this chapter with the proposition of robust inequality measures via robust estimates of parametric PID models and present a simulation study. Finally, chapter 9 concludes the present work. 6 CHAPTER 1. INTRODUCTION Chapter 2

Income Distribution and Inequality Measures

2.1 Introduction

The topic of the income distribution can be stated in the general economic framework as did Harold Lydall: ‘The essential problem of economics is how to increase economic welfare. In a broad sense, this problem can be di- vided into two parts: how to increase total output from given resources; and how to distribute the resulting goods and services in such way as to give the community the most benefit from them. These two aspects are sometimes described as the problem of ‘production’ and the problem of ‘distribution’, respectively. The two parts are not, of course, independent; and many of the most difficult questions arise out of the interdependence of production and distribution. Nevertheless, it is possible to identify some influences which bear primarily on the side of production and others which primarily affect distribution. No progress could be made in the discussion unless we ab- stracted, at least temporarily, from some of the considerations which might eventually be shown to be relevant to one or other side’ (Lydall 1968). Research on income distribution has followed two main directions. The first deals with the factor price formation and the corresponding factor shares, i.e. the distribution of income among the factors of production. This approach was initiated by Ricardo (1819), and further developed by several schools of economic thought. The second approach deals with the distribution of a mass of income among the members of a set of economic units (family, household, individual), considering either the total income of each economic unit or its desegregation by source of income, such as wages

7 8 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY and salaries, property income, self-employment income, transfers, etc. The related topic of the latter approach is commonly called the size distribu- tion of income or personal income distribution (PID). This chapter is only concerned with this topic. According to Slottje (1989), the theory of PID can be divided into three major categories. Models explaining the generation of income distributions are one of the important aspect of the theory. Why are incomes in a given society at a given time different? What are the determinants influencing the particular aspects of the income distribution? These are the main questions that researchers interested in the generation of income have tried to answer. In section 2.2 we briefly present the different theories. Another important aspect regards the measures of inequality given the income distribution. Since it can be argued that the principal indicator of social welfare is given by the income level and the PID, it is important to develop tools to compare different societies on a social welfare basis. This is the role of the Lorenz curve and inequality measures, developed respectively in sections 2.3 and 2.4. Finally, an alternative way of studying PID is to describe it by means of statistical tools. As will be argued below, this approach has a lot of advantages. Moreover, it can serve as a basis for the study of the effect of policies on the PID. This is why a number of authors have concentrated their research on modelling the PID by means of parametric models. This approach is developed in section 2.5 and serves as basis of the research presented in the following chapters.

2.2 The generation and distribution of income

The various theories that have been proposed to explain the distribution of income among individuals have emerged from two main schools of thought. The first may be called the stochastic theory of distribution and is rep- resented by such authors as Gibrat (1931), Champernowne (1937, 1953), Aitchison and Brown (1954), Rutherford (1955), Mandelbrot (1960, 1961) and Steindl (1965). These authors explain the generation of income with the help of stochastic processes, that is the actual form of the distribution is the stationary state of a stochastic process. For example, Gibrat (1931) formulated his theory based on the law of proportionate effect, and pro- posed a model which generates a positively skewed distribution. Gibrat’s model is a first-order Markov chain model. The variables are expressed in their logarithms with the log of income dependent on the log of income 2.2. THE GENERATION AND DISTRIBUTION OF INCOME 9 lagged a period and random events. The theory shows that, as time goes by, the distribution of income approaches the distribution of the random disturbance, which tends toward normality. Hence, he then proposed the lognormal distribution as a suitable income distribution. In 1937, Champernowne based his stochastic theory on Markov chains which generated a . Later, in 1953, he showed that under suitable conditions, his stochastic process tends toward a unique equilibrium distribution dependent upon the transition matrix but not on the initial dis- tribution. Typically, his models specify transitional probabilities, that is the probabilities that units belonging to a certain class will move up or down to another class in the following period of time. Then, the income distribution at a certain time is linked to the income distribution one period before by a transition equation through transition probabilities. Champernowne made some quite strong assumptions which considerabilly simplified his models. He supposed firstly that no income unit moves up by more than one income class in a period of time, and secondly that the transitional probabilities are constant with respect to time and independent of the income level. He then proposed different models able to generate the Pareto distribution, a two-tailed distribution obeying the Pareto law (see section 2.5) and other distributions, by relaxing his initial hypotheses. Finally, similar variations of these stochastic models were compiled by the authors mentioned above. We should also say that one main criticism about the simple stochastic models is that the process requires an incredibly long period of time to attain an equilibrium or a stationary state distribution (Shorrocks 1973). The second school of thought, which may be called the socioeconomic school, seeks the explanation of PID by means of economic and institutional factors, such as sex, age, occupation, education, geographical differences, and the distribution of wealth. Three groups of authors belong to this school. The first follows the human capital approach, based on the hypothesis of lifetime income maximization. The authors concentrate their attention on the supply side of labour, which is the result of the maximization of the personal utility function. Typically, they specify a utility function in which variables such as the general price level, the salary, education, family size, etc, are entered. This approach was initiated by Mincer (1958) and Becker (1962, 1967) and subsequently developed by Chiswick (1968, 1971, 1974). The second group of authors, who mainly link the PID to education levels, is referred to as the education planning school by Tinbergen (1975). It is represented by such authors as Bowles (1969), Dougherty (1971, 1972) and Psacharopulos and Hinchliffe (1972). They concentrate their attention on 10 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY the demand side, deriving various types of labour from production functions. Hence, in this case a production function is specified, and it includes not only the classical production factors such as land, labour and capital, but also technical development, different types of labour often being measured bythedegreeofschooling. Finally, the third group of authors is called the supply and demand school. The major contribution to this approach is represented by Tinbergen (1975), who considers PID a result of the supply and demand of different kinds of labour.

2.3 The Lorenz curve and analysis of income in- equality

The Lorenz curve, widely used to represent and analyze the size distribution of income and wealth, is defined as the relationship between the cumulative proportion of income units and the cumulative proportion of income received when units are arranged in ascending order of their income. Lorenz (1905) proposed this curve in order to compare and analyse in a non-parametric way, inequalities of wealth in a country during different epochs, or in different countries during the same epoch. The curve has been used principally as a convenient graphical device to represent size distributions of income and wealth. It is a very useful tool not only for the nonparametric description of the observed PID but also for the comparison in terms of social welfare between income distribution states. In the first subsection we present the technical aspect of the Lorenz curve and in the second subsection we explain the link between the Lorenz curve and the concept of social welfare.

2.3.1 Definition and construction The graph of the curve is represented in a unit square (see figure 2.1). The straight line joining the points (0,0) and (1,1) is called the egalitarian line, because along this line 10% of income receivers get 10% of income, 20%, 20% of income, and so on. Typically, actual distributions lie below the egalitarian line: the greater the convexity, the greater the degree of inequality. More formally, the Lorenz curve can be defined by means of the prob- ability distribution function associated to the income variable. Let F be x this function, and define p = F (x)= 0 dF (t), where p can be interpreted as the proportion of units having an income less than or equal to x.Pietra (1915) (see also Gastwirth (1971)) presented a definition of the Lorenz curve 2.3. THE LORENZ CURVE AND ANALYSIS OF INEQUALITY 11 Proportion of income received 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 Proportion of income receivers

Figure 2.1: A typical representation of the Lorenz Curve in terms of the inverse of the cumulative distribution function given by F −1(t)=inf{x : F (x) ≥ t}. The Lorenz curve is then written as 1 p L(p)= F −1(t)dt (0 ≤ p ≤ 1) (2.1) µ 0 where the mean µ is given by µ = xdF (x). (see also Kakwani 1980 for another definition). The Lorenz curve can also be interpreted as a tool to measure the con- centration of the income variable. It represents a point comparison measure as regard to a synthetic comparison measure given by inequality measures (see Zenga 1989). The Lorenz curve satisfies the following conditions (Kakwani 1980): 1. if p =0thenL(p)=0

2. if p =1thenL(p)=1 x ≥ 1 3. L (p)= µ 0andL(p) = µf(x) > 0 12 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY

4. L(p) ≤ p where f(x) is the density function associated to F (x). It is possible to use the Lorenz curve as a parametric tool. Two ap- proaches have been considered. The more obvious is simply to choose a parametric distribution for the income variable, and derive analytically the corresponding Lorenz curve. (see Dagum 1980a for the derivation of the Lorenz curve corresponding to some well known parametric models for PID). The second approach was formulated by Kakwani and Podder (1973, 1976). They specified the functional form of the Lorenz curve directly (for the gen- eral case) as L(p)=pαe−β(1−p) (2.2) where 0 ≤ p ≤ 1, a curve which satisfies the properties described above. Other authors proposed new parametric families as models for the Lorenz curve. The list includes models due to Rasche et al. (1980), Gupta (1984), Villase˜nor and Arnold (1989), and more recently Basmann et al. (1990). We present these functions in appendix B. As we see in the next section, the Lorenz curve is a very useful tool for the analysis of inequality. It can be used (a) to measure the income inequality (see Gini 1910), (b) to perform a partial ordering of social welfare states (see Atkinson AB 1970), (c) to study the effect of income taxes (see Latham 1988), (d) to derive goodness-of-fit tests for functions, as well as upper and lower bounds for the Gini ratio (see Gastwirth 1972 and Gail and Gastwirth 1978b). Finally, it should be stressed that recently another measure of point con- centration has been proposed in the literature: the Z(p) concentration curve (see Zenga 1984). It is based on the comparison of the inverse cumulative −1 distribution function F (p) with the inverse first incomplete moment func- −1 1 x tion F1 (p), where F1(x)= µ 0 tdF (t). The Z(p) concentration curve is given by −1 − F (p) Z(p)=1 −1 (2.3) F1 (p) According to Zenga (1989), its advantage is that it has not a ‘forced be- haviour’ that is it does not fulfil property 3 for the Lorenz curve. Indeed, this property implies that the difference function p − L(p) assumes its max- imum value for p = F (µ) and there is no reason why the relative inequality must be greater for the middle classes than for the richest or the poorest classes. (For examples of application see Dancelli 1989 and Salvaterra 1989). 2.3. THE LORENZ CURVE AND ANALYSIS OF INEQUALITY 13

2.3.2 An ordering tool

As we have seen, the Lorenz curve displays the deviation of each individual income from perfect equality, and hence it captures the essence of inequality. The nearer the Lorenz curve is to the egalitarian line, the more equal the distribution of income will be. Consequently, the Lorenz curve can be used as a criterion for ranking PID. The income ranking uses the concept of “Lorenz-domination”. An in- come profile is said to Lorenz dominate another income profile in the weak sense if the Lorenz curve of the former lies nowhere below the Lorenz curve of the latter. We have strict Lorenz domination if we add the restriction that at some places the Lorenz curve is above. However, this ordering is a quasi-ordering, since if the Lorenz curves intersect, neither of the income profiles is said to be preferred. There is also a link between the Lorenz curve ranking and social welfare. Atkinson AB (1970) proved a theorem which implies that if the Lorenz curve corresponding to one PID is above the Lorenz curve of another PID (and both have the same mean), then the social welfare function (or social evaluation function, see Chakravarty 1990, for a definition) is greater for the first population, regardless of the form of the utility function except that it is increasing and concave. Later, Dasgupta, Sen and Starret (1973) and Rothschild and Stiglitz (1973) generalized Atkinson’s theorem by weakening the hypotheses. (For a relation between social welfare functions and PID, see also Dagum 1990). In practice, international comparisons involve usually different popula- tion sizes and different means, as do intertemporal comparisons for the same country. Therefore, Shorrocks (1983) generalized the Lorenz curve by scal- ing it up by the mean income. He also proved that an unambiguous ranking of income profiles (providing some suitable conditions on the social welfare function) can be obtained if and only if the generalized Lorenz curves do not intersect. This is likely to be true in many important practical situ- ations, since differences between Lorenz curves tend to be relatively small compared with variations in mean incomes. However, the welfare judgement captured by the generalized Lorenz domination may come into conflict with the social desire for more equally distributed income. For example, if we in- crease the income of the richest individual, keeping all other incomes fixed, the total income increases, as does inequality. It is nevertheless possible to have a PID ranking which avoids this kind of conflict. As Shorrocks (1983) showed, if we assume that an improvement of each income by a constant improves the social welfare function, then the comparison of the generalized 14 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY

Lorenz curves discounted for each individual by the mean income, gives a quasi-ordering of the income profiles.

2.4 Income inequality measures

While the Lorenz criterion provides only a quasi-ordering of income profiles, an alternative statistic that completely orders the set of all income profiles is an inequality index. An inequality index is a scalar representation of interpersonal income differences within a given population, and hence can be considered as a measure of inequality (if inequality is defined in terms of income). One of the first inequality indexes was developed by Corrado Gini. It is still one of the most widely used to analyse the size distribution of income and wealth. Gini (1912) specified the Gini mean difference which is by definition n n ∆= |xj − xi|/n(n − 1) (2.4) j=1 i=1 where 0 ≤ ∆ ≤ 2µ. The last expression can also be written ∆= |x − y|dF (x)dF (y) (2.5) where X and Y are identically and independently distributed variables. When all the incomes are equal, then ∆ = 0 and when the last income is equal to the total income, then ∆ = 2µ. Since ∆ is a monotonic increasing function of the degree of income inequality, Gini defined

IG =∆/2µ (2.6)

0 ≤ IG ≤ 1, as an income inequality measure, known as the Gini ratio or Gini index. Gini also proved that his index is equal to twice the area between the egalitarian line and the Lorenz curve L(p) and hence can be given by

−1 2 1 F (p) IG =1− udF (u)dp µ 0 −∞ 2 ∞ x =1− udF (u)dF (x) (2.7) µ −∞ −∞ A number of intuitive interpretations of the Gini index are possible. For example, Pyatt (1976) interpreted the Gini index as equal to the average gain expected by an individual from the option of being someone else in the 2.4. INCOME INEQUALITY MEASURES 15 population. Sen (1973) suggested another interesting interpretation. In any pairwise comparison, the individual with the lower income may suffer from depression upon discovering that his income is lower. If it is assumed that this depression is proportional to the difference in incomes, the average of all such depressions in all possible pairwise comparisons leads to the Gini index. The Gini index can be expressed in several forms. For example Dorfman (1979) showed that ∞ −1 2 IG =1− 2µ (1 − F (x)) dF (x) (2.8) 0 Stuart (1954) and Lerman and Yitzhaki (1984) showed that

2( Covariance between x and F (x)) G = (2.9) µ and Gastwirth (1972) pointed out that 1 IG = F (x)(1 − F (x))dx (2.10) µ 2 1 = x F (x) − dF (x) (2.11) µ 2

The above result is obtained by making use of Fubini’s theorem (Apostol 1975) ∞ x dF (u)xdF (x)= udF (u)dF (x) (2.12) x 0 Moreover, many authors have suggested a generalization of the Gini index. For a discussion, see Chakravarty (1990). Following Gini’s work, a considerable number of income inequality in- dexes have been proposed. Measures of inequality can be divided into nor- mative and positive categories. The normative measures are concerned with measuring inequality in terms of the social welfare so that a higher degree of inequality corresponds to a lower level of social welfare. The main contribu- tors to this approach are Dalton (1920), Aigner and Heinz (1967) Atkinson AB (1970), Sen (1973), Kolm (1976), Blackorby and Donaldson (1978, 1980), Ebert (1987) Bossert and Pfingsten (1989) and Chakravarty (1983, 1990). On the other hand, the positive measures are statistical devices that measure either the relative dispersion or the relative entropy of the observed distribution, without reference to the normative notion of social welfare. Among the most used indexes, there are the Gini index, the coefficient of 16 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY variation, the relative mean deviation, the relative median deviation, the Bonferroni (1930) index, the variance of the logarithm of income (Gibrat 1931), Hirschman (1945) index, Theil (1967) indexes, Eltet¨oandFrigyes (1968) inequality measures and the Kakwani (1980) inequality measure. In appendix B we present the mathematical forms of these indexes as well as some normative indexes. It should be noted, as Sen (1973) argued, that the distinction between normative and positive measures is not a firm one, and that every positive measure embodies some forms of the social welfare function. For example, the welfare implications of the Gini index have been debated by a large number of authors (for a brief review, see Kakwani 1980). The main problem that arises in a concrete income inequality analysis is that the different inequality measures may lead to conflicting results. This is specially the case with normative measures where the form of the welfare functions plays a crucial role. (For a detailed simulation study to judge the relative merits of various inequality measures, see Champernowne 1974). These conflicts have led to the axiomatic approach to analyzing inequality. Among the pioneering works there is Atkinson AB (1970), Cowell (1977), Foster (1985) and Sen (1973). This group considers that welfare should serve as a basis for inequality analysis and inequality measurement (or more specifically an income inequality measure) should fulfil certain criteria and then a measure should be found that is consistent with these criteria. We list briefly some of these properties (for a discussion of these properties, see the reference list given in Chakravarty 1990, p. 32). Denoting the income inequality measure by I,wehave:

1. Pigou-Dalton Principle of transfers (Pigou 1912 and Dalton 1920): the transfer of a pound from a richer person to a poorer one decreases inequality.

2. Principle of proportional addition to incomes (scale independence): proportional addition or substraction to all incomes should leave I unaffected.

3. Principle of equal addition to incomes: equal additions to all incomes should diminish and equal subtraction should increase I.

4. Principle of proportional addition to persons: I should be invariant to proportional additions to the population of income receivers.

5. Principle of symmetry: I should be invariant with respect to any per- mutation of income among the income receivers. 2.4. INCOME INEQUALITY MEASURES 17

6. Principle of normalization: the range of I should be in the interval [0,1], with zero (one) for perfect equality (inequality). The problem then becomes, of course, one of finding a class of inequality measures that satisfies these criteria. Of all the forms proposed, only one class comes close. This class is known as the generalized entropy family. The relationship between the axioms and entropy can be traced to Shorrocks (1980, 1983), Cowell (1980) and Maasoumi (1986). The concept of using entropy to measure inequality, however, goes back to Theil (1967). This family can be defined very generally by 1 x β+1 Iβ = − 1 dF (x) (2.13) E β(β +1) µ where F denotes either the empirical distribution function or a parametric model. (Theil’s indexes are the limiting cases corresponding to β = −1and β = 0). The entropy class of measures not only satisfies most welfare-based axioms but the class also has members that have powerful decomposition properties (see Theil 1989). In the same line of thought, Slottje, Basmann and Nieswiadomy (1989) developed the weighted geometric mean (WGM) measure first proposed by Basmann and Slottje (1987). This inequality measure based on the underly- ing Lorenz curve actually represents a class of inequality measures because it depends on a parameter-vector of weights given to incomes classes. Accord- ing to the choice of these weights, several well-known measures of inequality can be approximated. Their argument is the following: when using one measure of income inequality it is important to know the relative impor- tance or weights that use of the measure assigns to the various quantiles or shares. Indeed, inequality measures are frequently appealed to in connec- tion with policy questions concerning distributive justice, and when one of them is used as a policy target, its implicit weight structure is more likely to be causally related to the resulting levels of social and political unrest than the mere magnitude of the inequality measure itself. However, for most inequality measures, the weights given to the different income quantiles do not appear explicitly in their formulation. This can be solved by the WGM measures. To conclude, it should be stressed that inequality indexes have also been used by different authors interested in decomposition of the overall degree of inequality. In this context, two particular applications stand out. The first concerns a partition of the population into disjoint subgroups (age, sex, race, religion, etc.) and the researcher is interested in examining how the 18 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY overall degree of inequality can be appropriately resolved into contributions due to inequality within each group and inequality between groups (see e.g. Silber 1989). The other main application desegregates the total income of each individual into amounts earned from different sources and examines the impact of each of these sources on the overall degree of inequality. (For a review and a reference list, see Chakravarty 1990, section 2.6).

2.5 Parametric models for income distributions

The idea of hypothesizing and then estimating the size distribution of in- come was initiated by Vilfredo Pareto in 1895. This research was motivated by his controversy with the French and Italian socialists who were pressing for institutional reforms to reduce the inequality in the PID. Pareto (1896) specified his model as a quantitative tool to support the viewpoint that in- come inequality can only be reduced by sustaining economic growth and not by redistributing actual income levels. In this way Pareto analyzed the characteristic of regularity in the upper tail of the observed income distri- bution. This showed that the income elasticity of the survival distribution function was constant, i.e. there was a decreasing linear relation between the income-power (i.e. the logarithm of the income variable) and the logarithm of the backward cumulative distribution function. Formally, let n be the number of economic units (households) and n(x) the number of households having an income greater than x, then the relation observed by Pareto is:

log n(x)=b − a log (x) (2.14) with 0

2.5.1 Generating systems According to Dagum (1980b), all PID models except the Champernowne (1953) model can be deduced from the following three generating systems: 1. Pearson’s system (Elderton and Johnson 1969)

2. D’Addario’s system (D’Addario 1949)

3. Dagum’s system (Dagum 1980b, 1980c) Pearson (1894) specified a differential equation from which an impor- tant family of functions is derived. All members of the Pearson family are determined by means of the four first moments of the distribution. Although the Pearson system was not designed with the specific purpose of generating PID models, several of them belong to this system, such as the Pareto types I-II, the Pearson type I or Beta dis- tributions of first and second kind (Thurow 1970, McDonald and Ransom 1979, McDonald 1984), the Gamma distribution or Pearson type III (March 1898, Amoroso 1925, Salem and Mount 1974, Bartels 1977), the generalized four-parameter Gamma distribution (Amoroso 1925), the Pearson type V or Vinci (1921) distribution, the Pearson type IV (Bartels 1977, Kloek and Van Dijk 1977, 1978), the Pearson type VI (McDonald 1984) and the Student distribution or Pearson type VII (Kloek and Van Dijk 1977, 1978). Following the idea of transformation function applied by Edgeworth (1898) and Fr´echet (1939), D’Addario specified his system by means of a probability generating function and a transformation function (see Dagum 1980a, 1980b). Among the well known income distribution derived from 20 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY this system, we have the Pareto type I-II, the Lognormal type I (Gibrat 1931, Aitchison and Brown 1957), the Lognormal type II (Dagum 1980c), the Amoroso (1925) distribution, the Gamma distribution, the Vinci (1921) distribution, the Davis (1941) distribution and the Weibull (1951) distribu- tion. For the third system, Dagum (1980c) observed that the income elasticity of the cumulative distribution function was a decreasing, and in general, a concave function of the distribution function. This pattern led Dagum to the specification of a new generating system of income and wealth distribution. Among the well known PID derived from this system, we have the three Pareto types, the Benini (1906) distribution, the Weibull (1951) distribution, the (Fisk 1961), the Singh-Maddala (1976) distribution, the log- (Dagum, 1980c) and the three Dagum types distributions (Dagum 1977, 1980c).

2.5.2 Properties of income distribution models The models developed for PID are not based on a causal explanation: they are simply univariate models that purport to achieve an accurate statistical description of the phenomenon. The choice of a particular mathematical form for the model can be made dependent on the set of properties it is supposed to fulfil. While Edgeworth (1898) was the first to propose some desired properties for income distribution models, further developments can be found in Fr´echet (1939), Aitchison and Brown (1957), Mandelbrot (1960) and Dagum (1977, 1980b). The following properties are widely accepted in the literature (see Dagum, 1980b):

• Model foundation

• Convergence to the Pareto distribution

• Principle of parsimony

• Relation to income inequality measures

• Good fit over the whole income range

By model foundation is meant the extent to which the mathematical form of a PID model is derived from realistic elementary assumptions. Accordingly, the PID models can be grouped into the following three main classes: (a) stochastic, (b) logico-empirical, (c) ad hoc. An PID model has a stochastic 2.5. PARAMETRIC MODELS FOR INCOME DISTRIBUTIONS 21 foundation when its mathematical form is the outcome of an a priori set of probability assumptions (for example the Singh-Maddala distribution). The model has a logico-empirical foundation if it is the theoretical counterpart of observed regularities, that is if the functional form is the solution of a specified differential equation which captures the characteristics of regular- ity observed in an empirical PID (for example those specified by Pareto). Finally, an ad hoc model is a model with the sole purpose of fitting PID models, providing neither a plausible probability theory basis nor a logico- empirical foundation (for example the Gamma distribution). This property is very controversial, since it may be important for some author to have a model which has a theoretical basis, while for others the fit to the data is the unique important property. Convergence to the Pareto law is a property that has emerged from empirical evidence (see Davis 1941, Mandelbrot 1960, Budd 1970). This property is called the weak form of the Pareto law and means that for large values of income the income distribution can be approximated by Pareto’s formula. The principle of parsimony is a well-known statistical property. A model with few parameters has to be preferred to a complicated one. It always costs time and money to include in the model new parameters which might improve the goodness-of-fit over the whole income range. Moreover, simple models can be easily interpreted. The models commonly used contain two or three parameters. The property ‘relation to income inequality measures’ is useful for an- alytical and statistical purpose. The explicit mathematical solution of the Gini ratio or other inequality measures becomes an important tool of anal- ysis. They can be obtained directly from the fitted model and used in the specification of macroeconomic models of income determination and income policy. The simpler the relation between the distribution and the inequality measure is, the more the model is to be preferred. For a long time, the lack of a model which could describe the whole range of incomes was a stumbling block in the study of income distribu- tion. It has been frequently suggested that several different models may be required to explain the empirical distribution. However, models have been proposed recently which provide a ‘reasonable’ description of observed in- come distributions. The term ‘reasonable’ has to be taken cautiously, as in the literature the quality is measured by means of goodness-of-fit tests. This tests, according for example to Ransom and Cramer (1983) are in most cases not appropriate because of their dependence on data set size. More- over, the large choice between the different statistics can be very confusing 22 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY

(see chapter 7). There is also the problem of robustness of both the esti- mators calculated with classical method, and the classical tests (see chapter 3). Moreover, it is well known that the more parameters the model has, the better the fit of the estimated distribution is. But, on the other hand, a model with many parameters is not only difficult to estimate, but also the variability of the estimators is very large. The property of ‘good fit over the whole income range’ is then an a priori subjective property, even if some models are always preferred to others. We believe that although these properties are very important, the choice of a PID model should be made in accordance with the data. Therefore, it is very important to choose first an appropriate model choice statistic, then a good technique of estimation and right goodness-of-fit tests in order to be able to check if the model under consideration does reflect the behaviour of the data at hand. The main purpose of our research is precisely to develop and analyse such statistical tools.

2.5.3 Most commonly used models In this subsection, we discuss the properties of the models cited in the pre- vious section. The functional forms are presented in appendix A. Historically these models can be classified as follows:

1. Three Pareto types (Pareto 1896)

2. Pearson type III or Gamma distribution (March 1898, Amoroso 1925, Salem and Mount 1974, Bartels 1977)

3. Benini (1906) distribution

4. Pearson type V or Vinci (1921) distribution

5. The generalized four-parameter Gamma distribution (Amoroso 1925)

6. Lognormal type I (Gibrat 1931, Aitchison and Brown 1957)

7. Davis (1941) distribution

8. Weibull (1951) distribution

9. Logistic distribution (Fisk 1961)

10. Pearson type I or Beta of first and second kind (Thurow 1970, Mc- Donald and Ransom 1979, McDonald 1984) 2.5. PARAMETRIC MODELS FOR INCOME DISTRIBUTIONS 23

11. Singh-Maddala (1976) distribution 12. Lognormal type II (Dagum 1980b) 13. Three Dagum types (Dagum 1977, 1980b) 14. Log-Gompertz distribution (Dagum 1980b) 15. Majumder and Chakravarty (1990) distribution Among these models, the most frequently used in applied work are the Pareto, the Lognormal, the Gamma, and recently the Dagum and Singh- Maddala models. The generalised of the second kind is also used in practice as PID model (Slottje 1989), partly because it includes as special cases the Gamma and the Singh-Maddala distributions. In Fig- ure 2.2 we give a graphical representation of the Gamma model as a typical unimodal and asymmetric density used as a PID model. Gamma density 0.0 0.05 0.10 0.15 0.20 0.25

0246810 Income variable

Figure 2.2: The Gamma density as a model for PID

All these functions constitute a set of possible statistical models for de- scribing the PID. However, the problem of the choice of the functional form 24 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY still remains. As said before, one can choose the model because of its the- oretical foundation, or because it has worked well in previous analyses. In our opinion, the problem is not so simple and it has to be taken very seri- ously. We have first to make distinctions between models having a different number of parameters. It has been noted by Metcalf (1972) that while a two-parameter model may be too simple to reflect the impact of economic fluctuations, a three- or four-parameter model may be appropriate. This remark applies also for any parametric model. The question is then: is the cost of an additional parameter worthwhile, especially knowing that the variability of the estimators can be very large? In the literature, we find that among models with one parameter, only the Pareto law provides in general an excellent fit to the upper tail of the income distribution, but the fit over the whole range of incomes is poor. (see e.g. Mandelbrot 1960). Among the two-parameter models, empirical evidence indicates that the Gamma distribution seems to fit data better than the Lognormal, especially in the tails, but both overestimate skewness (see Salem and Mount 1974). Among the three-parameters models, Dagum (1977) showed in an application that his model (type I) performs better than the Sing-Maddala one. Finally, Majumder and Chakravarty (1990) showed in an empirical example, that their four-parameters model performs better than the Dagum (type II-III) models and the generalized Beta distribution of second kind. The development of new functional forms for PID will still continue for a long time, because no model is perfectly adapted to every set of data. We believe that this will always be the case. Hence, another approach to the search of a better model is to improve the estimation techniques and diagnostic checking. An economist could say that this is a statistician’s work and that this technical research can also be done independently of the subject. We answer however that because of the specificity of the problems associated with the study of PID, the development of statistical tools can only be done with knowledge of the subject. For example, the particular forms of the data imply different estimation techniques. In the next section, we discuss all the statistical aspects related to PID that are relevant for the improvement of any analysis.

2.6 Statistical aspects of the analysis of income distributions

As mentioned above, our work is mainly concerned with the characterization of PID by means of parametric models. Given this choice, the questions a 2.6. STATISTICAL ASPECTS OF THE ANALYSIS OF INCOME 25 statistician may ask are:

• Which model to choose?

• How to estimate the parameters?

• Do the estimates or the derived measures reflect the reality (or at least a part of it)?

• How to use the estimates to answer the questions an economist may ask?

Moreover, since we are also interested in income inequalities, an other ques- tion may be asked:

• Is income inequality correctly estimated?

To describe the size distribution of income we have to choose one among all the functional forms proposed in section 2.5. In a statistical point of view this means choosing the model which best suit the data. Moreover, since we don’t believe that the proposed model necessarily reflects the behaviour of all the data, one has to consider possible model misspecifications. That is, we want to build a test statistic that takes into account the specificity of the data set knowing that some of the observations may not belong to the model. This is done in chapter 6. But before developing such a tool, we consider the second point, i.e. the choice of the estimation technique. This problem will be largely discussed in chapters 3, 4 and 5. However, we can give here some of the main reasons that have motivated our estimation technique choice. Firstly, the classical tech- niques (as the maximum likelihood, least squares, minimum chi-squared...) depend on the hypothesis that the parametric model holds exactly, or in other words, that to a given data set corresponds exactly a given paramet- ric model. In our opinion, this is not a realistic hypothesis. Indeed, in our empirical study we will show how the deviations between the theoretical model and the observed one can ruin the classical estimators. Secondly, the method which deals with model deviations, namely robust statistics, can be applied to any parametric model. Hence, we can not only compute robust estimators for the PID parameters, but also for income inequality measures or Lorenz curves. Unfortunately, with personal income data, one cannot simply apply the robust techniques of estimation available in the literature. Indeed, typically in this field, the personal income data have the following forms: 26 CHAPTER 2. INCOME DISTRIBUTION AND INEQUALITY

• continuous data with complete information

• continuous and truncated data

• grouped data with complete information

• grouped and truncated data

The first case is an ideal but a rare one, and robust estimators can be computed without new theoretical development. For the second case we propose two methods which deal with incomplete information. One of them is the EM algorithm (see chapter 4). We actually propose an extension of it to robust estimation The case of grouped data will be treated in chapter 5. Finally, the case of grouped and truncated data is not treated in this thesis, but given the previous results, its development should not require too much work and is left for further research. Once the model is chosen and estimated, should one test it against the data, i.e. does one test if the estimated parametric model fits the data? Moreover, should one take into account that some data may not belong to the model? These two fundamental questions related to the question of the goodness-of-fit of the estimated model are discussed in chapter 7. As we will see, a great number of test statistics have been developed, but the study of their robustness properties has often been left aside. Once these three steps have been achieved, the estimates are ready to be used. For example, they can be used to compute inequality measures. For seek of completeness, we also study the relation between the economic and the statistical properties an inequality measure should fulfil, before conclud- ing that if one wants robust inequality measures, then robust estimates for the parametric PID should be used. This is done in chapter 8. Chapter 3

Optimal B-Robust Estimators with Complete Information

3.1 Introduction

In this chapter we present the basic notions of robust statistics that we will need in order to develop the estimators and the test statistics for the analysis of PID models. We first give a definition of robustness and highlight some of its advantages. We then present the elementary robustness concepts, especially the influence function (IF), the statistical tool on which most of our results are based. We also present robust bounded-influence estimators, discuss how they are built and give indications on how to compute them. We then apply the results to two PID models and make some simulations in order to study the behaviour of the robust estimators when compared to classical ones. We also apply the results to real data. Finally, we study more particularly some properties of these robust estimators. Robust methods probably date back to the prehistory of statistics, but it is only recently that some statisticians and mathematicians have put this idea in a statistical and mathematical framework: see Huber (1964, 1981), Hampel (1968, 1974) and Hampel et al. (1986). Robust statistics belongs to parametric statistics. It is an extension of classical parametric statistics, taking into account that parametric mod- els are only approximations to reality. It supplements all classical fields of statistics which use parametric models by adding measures of stability. It studies the behaviour of statistical procedures, not only under strict para-

27 28 CHAPTER 3. OBRE WITH COMPLETE INFORMATION metric models, but also both in smaller and in larger neighbourhoods of such parametric models. Hampel et al. (1986) give the following definition:

Robust statistics, as a collection of related theories, is the statis- tics of approximate parametric models.

The main aims of robust statistics are (see Hampel et al. 1986):

• To describe the structure best fitting the bulk of the data,

• To identify deviating data points (outliers) or deviating substructures for further treatment, if desired, and

• To identify and give warning about highly influential data points.

When a parametric model is assumed and the parameters of the model are estimated and tested, robust statistics takes explicitly into account that the model may be distorted, that is the model may deviate from reality. One source of deviation is outliers: a minority of data that do not belong to the model at all. Another are influential points: observations that belong to the model but are distant from the bulk of data. The residuals from a robust fit automatically show these outliers or influencial points and also the proper random variability of the ‘good’ data. The outliers which have been found, should be studied separately. According to Hampel et al. (1986), we may distinguish four main types of deviations from strict parametric models:

• The occurrence of gross errors.

• Rounding and grouping.

• The model may have been conceived as an approximation anyway, for example, by virtue of the central limit theorem.

• Apart from the distributional assumptions, the assumption of inde- pendence (or of some other specific correlation structure) may only be approximately fulfilled.

Some reasons for gross errors (namely values which deviate from the pattern set by the majority of the data) are copying errors, inadvertent ob- servations of a member of a different population, equipment failure... The effects on data are usually much bigger than those of permanently acting causes of random variability. A single huge unnoticed gross error can spoil 3.1. INTRODUCTION 29 a statistical analysis completely, several percent gross errors are rather com- mon, and modern robust techniques can deal with outliers better than clas- sical methods for objective or subjective rejection of outliers. With sufficient care, gross errors can often be prevented, and indeed there appear to exist, though probably rarely, high-quality data sets of thousands of data without gross errors. However, necessary care cannot always be afforded. Moreover, even with fully automatic data recording and properly working equipment, there may be large transient effects (c.f the example below). When data are rounded, grouped or classified (as often in PID models) the deviations appear when a little percentage of data are shifted from a class to another. However, these local granularity effects appear to be the most harmless type of deviation from idealized models, and in fact they can often be neglected, but they should not be forgotten. Large sets of measurement data of very high quality, which to all knowl- edge contain no gross errors, still tend to show small but noticeable devi- ations from the normal model, specially in the tails. General robustness theory could be applied in principle without any problem. Gross errors may

Area Data Frequency Source

Economics US census data 1950 0.01% Coale & Stephan (1962)

Geophysics seismograph readings 5 − 7% Freedman (1966)

Industry, Science, various published ca. 1 − 10% Daniel & Wood (1980), Agriculture etc. data sets (0 − 20%) Daniel (1976)

Agriculture wheat data 6% Cochran (1947)

Historical data ‘Bills of Mortality’ 7 − 10% Zabell (1976) Babylonian 20 − 40% Huber (1974) Venus-observations

Table 3.1: Some examples of occurrence and frequency of gross errors appear in several fields and the frequency can be of 0.01% for the smallest case up to 20% or more (see table 3.1 and Hampel et al. 1986). Nevertheless, even if the fraction of outliers is little, it can produce very large deviations in the estimates. One can argue that there are other methods which can detect outliers. 30 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

For example, especially with one dimensional data (such as PID models) by looking at the histogram one can in some cases detect outlying points which can then be removed or treated separately. However, firstly this method (i.e. maximum likelihood estimation after removing some observations) is less efficient than the robust procedure we propose (see below) and often it is also non robust (see Welsh and Ronchetti 1992). Secondly, detecting outliers by looking directly at the data is possible only if they are consid- erably distant from the bulk of data. Hence, if they are “not so far” it is impossible to decide whether to keep or to remove them. Thirdly, classical estimators after rejection of outliers leads to wrong confidence intervals for the parameters whereas (asymptotically) correct confidence intervals can be computed automatically with robust estimators. Finally, outliers are just one source of model deviations. Robust techniques can deal also with other types of deviations such as some types of model misspecification.

3.2 Robustness concepts

3.2.1 Definitions

Suppose we have the observations x1,...,xn belonging to some sample space X and a parametric model Fθ on the sample space, where the unknown p parameter belongs to some parameter space Θ ∈ . Assume that Fθ has a density f(·; θ). In classical statistics, one then assumes that the observations are distributed according to Fθ, and undertakes to estimate θ based on the data at hand. In robustness theory the model Fθ is considered as a mathematical abstraction which is only an ideal approximation of reality. Statistical procedures are developed, procedures which still behave fairly well under deviations from the assumed model. (n) We identify the sample x1,...,xn with its empirical distribution F given by 1 n F (n)(x)= ∆ (x ) (3.1) n x i i=1 where ∆x is the point mass 1 in x. As estimators of θ we consider real-valued statistics Tn = Tn(x1,...,xn), which can be represented (at least asymptot- (n) ically) as a functional Tn(xi,...,xn)=T (F ). As a neighborhood of the parametric model Fθ, we consider the set

{Gε| Gε =(1− ε)Fθ + εW } (3.2) where W is an arbitrary distribution function. The distribution Gε can be 3.2. ROBUSTNESS CONCEPTS 31

considered as a between Fθ and W ,lyinginaneigh- borhood of radius ε from Fθ. When we put W =∆z, i.e. the distribution whichputspointmass1inanyz,weget

{Fε| Fε =(1− ε)Fθ + ε∆z} (3.3)

Fε isoftenreferredtoasaε-type contamination distribution. Its interpre- tation is easy in that one can say that Fε generates with probability (1 − ε) data from Fθ and with probability ε gross errors (z). In the following subsections, we consider three different ways of assessing the robustness of the estimator T (F (n)), namely the continuity property, the influence function (IF) and the breakdown point. These notions are often described as qualitative robustness, infinitesimal robustness and quantitative robustness.

3.2.2 Continuity and qualitative robustness Hampel (1971) introduced the concept of qualitative robustness of a se- quence of estimators {Tn}; roughly speaking it requires that a slight change in the model Fθ should result in only a small change in the distribution of Tn, uniformly in n. The main idea is to complement the notion of differen- tiability with a continuity condition. When the estimators are defined by a functional T , one can give the following definition for the continuity of the estimator (see Hampel et al. 1986): For any sequence of distribution G(n) (in the neighborhood of the empirical distribution F (n)) such that a suitable distance between (n) G and Fθ tends to 0 as n increases, T is said to be continuous (n) at Fθ if |T (G ) − T (Fθ)|→0 as n →∞. Qualitative robustness is then defined as (see also Hampel et al. 1986): The estimator T is qualitatively robust at the model Fθ if for any distribution G(n) (in the neighborhood of the empirical distribution F (n) such that a suitable distance between G(n) and F (n) tends to 0 as n increases, the same distance between the distributions of T (F (n))andT (G(n)) tends to 0 as n increases.

3.2.3 The influence function The IF was introduced by Hampel (1968, 1974) and further developed by Hampel et al. (1986). It describes how the estimator responds to a small amount of contamination at any point. The IF at z can be thought of as an 32 CHAPTER 3. OBRE WITH COMPLETE INFORMATION approximation to the relative change in an estimate caused by the addition of a small proportion of spurious observations at z. It is defined at the model Fθ by T ((1 − ε)Fθ + ε∆z) − T (Fθ) IF(z,T,Fθ) = lim (3.4) ε→0 ε

Hence, it describes the effect of a small contamination at the point z (ε∆z) on the estimate, standardized by the mass of the contamination. It can be viewed as a linearization of the asymptotic bias of the estimator caused by the contamination in the observations. The influence function appears in the first order term of a von Mises expansion of T at H, a distribution in the neighborhood of Fθ.Ifwechoose H = F (n), the expansion of T is given by 1 n T (F (n))=T (F )+ IF(x ,T,F ) + Remainder (3.5) θ n i θ i=1 If the remainder becomes negligible for n →∞, this expression shows that the estimator is asymptotically normal, with the asymptotic variance given by T V (T,Fθ)= IF(x, T, Fθ)IF(x, T, Fθ) dFθ(x) (3.6) The above definition of the IF is asymptotic in nature, and there exist some simple finite-sample versions. The simplest idea is the so-called sensi- tivity curve (Tukey 1970-1971). Suppose we have an estimator {Tn; n ≥ 1} and a sample of n − 1 observations. Then the sensitivity curve SCn is given by

SCn(z|x1,...,xn,Tn)=n[Tn(x1,...,xn−1,z) − Tn(x1,...,xn−1)] (3.7)

In many situations, SCn will converge to the IF as n →∞. The sensitivity curve is a useful diagnostic tool in the sense that it is computed for a given sample. However, it cannot be used to compare the sensitivities of several estimator because these sensitivities depend on the given sample.

3.2.4 The influence function and robustness measures From the robustness point of view, there are at least three important sum- mary values of the IF. The most important is the supremum of the absolute value (in the one-dimensional case, i.e when p = 1). One defines the gross- error sensitivity (in the one-dimensional case) of T at Fθ by ∗ γ (T,Fθ)=sup|IF(x, T, Fθ)| (3.8) x 3.2. ROBUSTNESS CONCEPTS 33

The gross-error sensitivity measures the worst influence which a small amount of contamination of fixed size can have on the value of the estimator. There- fore, it may be regarded as an upper bound on the asymptotic bias of the estimator. In the multidimensional case, there are several ways to generalize the gross-error sensitivity. One defines the unstandardized gross-error sensitivity of T at Fθ by ∗ || || γu(T,Fθ)=sup IF(x, T, Fθ) (3.9) x where || ...|| denotes the Euclidean norm. It is the most direct way to define the gross-error sensitivity. Another way to generalize the gross-error sensitivity is to measure the IF in the metric given by the asymptotic covariance matrix of the estimator, which defines the self-standardized sensitivity ∗ T −1 γs (T,Fθ)=sup IF(x, T, Fθ) V (T,Fθ) IF(x, T, Fθ) (3.10) x The second summary value has to do with small fluctuations (rounding and grouping) in the observations. The standardized effect of shifting an observation slightly from the point x to some neighbouring point y is ap- proximately described by a normalized difference or simply the slope of the IF in that point. One defines the local-shift sensitivity by ||IF(y,T,F ) − IF(x, T, F )|| λ∗ =sup θ θ (3.11) x= y ||y − x|| When the derivative of the IF exists, we can write ∗ ∂ λ =sup IF(x, T, Fθ) (3.12) x ∂x

3.2.5 The breakdown point In contrast to the IF which is a local concept, the breakdown point is a measure of the global reliability of the estimators, which describes up to what distance (in probability) from the model distribution the estimators still give some relevant information (the estimators are still near the true value of the parameters). The simplest definition of the breakdown point ε∗ is the minimum proportion ε of contamination at a point z for which the estimator T ((1 − ε)Fθ + ε∆z) is unbounded in z. The formal definition is given in Hampel et al. (1986, p. 97). We consider here the finite-sample version of the breakdown point which is simpler and easily computable (see 34 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

Donoho and Huber 1983). Let Tn(x1,...,xn) be the estimator calculated with the sample x1,...,xn. We construct an estimator Tn(z1,...,zn)where the sample z1,...,zn is obtained by replacing m data points xi by arbitrary values. We define the maximum bias by

b(Tn; m, x1,...,xn)= sup ||Tn(x1,...,xn) − Tn(z1,...,zn)|| (3.13) z1,...,zn

Then the finite-sample breakdown point is the maximum proportion m/n of points one can change, such that the bias remains bounded: ∗ m ε =max ; b(Tn; m, x1,...,xn) < ∞ (3.14) n m n ∗ →∞ In general, taking the limit of εn for n yields the asymptotic breakdown point ε∗. The breakdown point should serve as a complement to the local robust- ness concept given by the IF. Indeed when the number p of parameters is high, the breakdown point of most bounded-influence estimators is low (see Maronna 1976, Maronna, Bustos and Yohai 1979). In the literature, especially for econometric models, two kind of robust estimators have been proposed: bounded-influence and high breakdown point estimators. Robust estimators with bounded IF in both the explanatory and response variables have been proposed by Mallows (1975), Hampel (1978), Krasker (1980), Krasker and Welsch (1982) and De Jongh, De Wet and Welsch (1987). High 1 breakdown point estimators (that achieve breakdown points near 2 ) include the least median of squares estimator of Rousseeuw (1984), the S estima- tors of Rousseeuw and Yohai (1984), and the estimators of Yohai (1987) and Yohai and Zamar (1988). Donoho and Liu (1988) have also highlighted the automatic robustness of minimum distance estimators. Recently how- ever, Simpson, Ruppert and Caroll (1992) proposed a bounded-influence high breakdown point estimator. They followed the strategy which starts with a high breakdown point estimator and then performs one iteration of a Newton-Raphson-type algorithm toward the solution of a bounded-influence estimator.

3.3 Optimal robust estimators

3.3.1 Optimality results The class of estimators we will be working with is the class of M-estimators. They are a generalization of the maximum likelihood estimators (MLE) and 3.3. OPTIMAL ROBUST ESTIMATORS 35 have been introduced by Huber (1964). The latter are the solution TMLE in θ of the minimization problem

n [− log f(xi; θ)] = min! (3.15) θ i=1

The M-estimators are defined as the solution Tn in θ of the minimization problem n ρ(xi; θ)=min! (3.16) θ i=1 where ρ is some function on X × Θ. Suppose that ρ has a derivative T ∂ ∂ ψ(x; θ)= ρ(x; θ),..., ρ(x; θ) (3.17) ∂θ1 ∂θp then the estimate Tn satisfies the implicit equation n ψ(xi; Tn) = 0 (3.18) i=1 It is worth mentioning that this class is rich and includes a variety of well known estimators. For example, ∂ ∂ ψ(x; θ)=s(x; θ)= log f(x; θ),..., log f(x; θ) (3.19) ∂θ1 ∂θp gives the MLE. Under regularity conditions (see Huber 1981), an M-estimator is asymptotically normal with the asymptotic covariance matrix

−1 −T V (T,Fθ)=M(ψ, Fθ) Q(ψ, Fθ)M(ψ, Fθ ) (3.20) where ∂ M(ψ, F )=− ψ(x; θ)dF (x) (3.21) θ ∂θ θ and T Q(ψ, Fθ)= ψ(x; θ)ψ(x; θ) dFθ(x) (3.22)

Moreover, to any asymptotically normal estimator, there corresponds an asymptotically equivalent M-estimator (see Hampel et al. 1986). Hence there is no loss, at least asymptotically, in confining attention to the class of M-estimators. 36 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

The corresponding functional T is defined by the equation

ψ(x; T )dFθ(x) = 0 (3.23) and Fisher consistency (i.e. T (Fθ)=θ)) implies

ψ(x; θ)dFθ(x)=0,∀θ. (3.24)

One of the most important properties of the IF is that it is proportional to the function ψ defining the M-estimators. Hence, for M-estimators, −1 IF(x; T,Fθ)=M(ψ, Fθ) ψ(x; T ) (3.25) Therefore, the properties of ψ (boundness) are directly related to the prop- erties of the IF of the M-estimators. Note that with MLE, the IF is then proportional to the scores function. With PID models, this scores function is mostly unbounded (see appendix A). In the classical case, the optimality problem consists for a given para- metric model on finding the most efficient estimator, that is the one which minimizes its asymptotic covariance matrix. The MLE solves the problem. However, if one supposes that the postulated parametric model does not hold exactly, one has to introduce a constraint in order to improve stability in a neighborhood of the model. The most natural one is that the gross-error sensitivity of the estimator should not exceed a given real-valued bound c. Hence the robust problem can be viewed as a generalization of the classi- cal one. Given the parametric model {Fθ|θ ∈ Θ}, we try to find the best trade-off between efficiency and robustness, knowing that • for robustness reasons (if the model holds approximately), we want a bounded asymptotic bias (B-robust) in a neighborhood of the model, and • if the model holds exactly, we lose efficiency by bounding this bias. This means that ∗ • we want a bounded gross error sensitivity, i.e. γ (T,Fθ) ≤ c and • we want to minimize the asymptotic covariance matrix1 T V (T,Fθ)= IF(x; T,Fθ)IF(x; T,Fθ) dFθ(x). (3.26)

1If we consider the problem this way such an estimator does not exist (see Hampel et al. 1986). A solution can be found if one minimizes the trace of the covariance matrix, trace which is asymptotically equivalent to the mean squared error (MSE). 3.3. OPTIMAL ROBUST ESTIMATORS 37

Moreover, the estimator has to be Fisher consistent, that is T (Fθ)=θ. Hence, the robust optimality problem consists on minimizing the trace of the asymptotic covariance matrix among all Fisher consistent estimators, subject to the condition that the gross-error sensitivity is bounded. Such estimators are called optimal B-robust estimators (OBRE). There exists in fact several versions of OBRE which depend on the choice of the norm in the definition of the gross-error sensitivity (see Hampel et al. 1986). Before giving the optimality theorem, we introduce the Huber function, whichisusedtodefinetheOBRE.Thisfunction,givenby c h (z)=z · min 1, (3.27) c ||z|| transforms each point outside a hypersphere to its nearest point on it and leaves those inside unchanged. Intuitively, for efficiency reasons the OBRE will be as similar as possible to the MLE, that is its ψ function should be as close as possible to the likelihood scores in the centre. On the other hand, since the influence is proportional to the ψ function, in order to obtain a bounded IF, one has to truncate the ψ function. This can be achieved by means of the Huber function. Formally, the following theorem gives the OBRE in the unstandardized case, that is the best trade-off between efficiency and robustness for a general parametric model.

Theorem 1 Assume a parametric model {Fθ|θ ∈ Θ} with scores function · u s( ; θ), a real positive c and denote by Tc the solution for θ of n A,a ψc (xi; θ) = 0 (3.28) i=1 where A,a − ψc (x; θ)=hc (A[s(x; θ) a]) (3.29) and A , a (respectively a p × p matrix and a p-dimensional vector) are de- termined by the equations: A,a T ψc (x; θ)s(x; θ) dFθ(x)=I (3.30) A,a ψc (x; θ)dFθ(x) = 0 (3.31)

u then Tc is optimal Bu-robust in the sense that it minimizes the trace of the covariance matrix subject to an upper bound c on the unstandardized gross-error sensitivity. 38 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

Proof: see Hampel et al. (1986), p. 242-243.

It is important to note that the matrix A and the vector a depend in general on the parameter θ and should respectively be viewed as A(θ)and a(θ). The vector a(θ) as given by (3.31) is the shift of the scores function necessary to obtain a Fisher consistent estimator. Concerning the upper bound c, it is easy to show that (see also Krish- nakumar and Ronchetti 1993) dim(θ) c ≥ (3.32) Eθ (||s(x; θ)||) A similar theorem can also be given in the standardized case. In that case, only a weaker property can be demonstrated (which holds for the unstandardized case too). The following theorem gives this estimator.

Theorem 2 Assume again a parametric model {Fθ; θ ∈ Θ} with scores · s function s( ,θ), a real positive c and denote by Tc the solution for θ of n A,a ψc (xi; θ) = 0 (3.33) i=1 where A,a − ψc (x; θ)=hc (A[s(x; θ) a]) (3.34) and A , a (respectively a p × p matrix and a p-dimensional vector) are de- termined by the equations: A,a A,a T ψc (x; θ)ψc (x; θ) dFθ(x)=I (3.35) A,a ψc (x; θ)dFθ(x) = 0 (3.36)

s then Tc is “admissible” Bs-robust in the sense that there is no M-estimator with the standardized gross-error sensitivity bounded under c and a smaller asymptotic covariance matrix. Proof: see Hampel et al. (1986), p. 245.

The bound on c is given by c ≥ dim(θ) (3.37) 3.3. OPTIMAL ROBUST ESTIMATORS 39

(see 3.49). In the applications, we have chosen to calculate the OBRE in the stan- dardized case and will refer to it as OBRE. However, in PID models like the Gamma or the Pareto we could have chosen the unstandardized case since the estimators are invariant to scale transformations. Nevertheless there is not an argument in favour of the unstandardized case either. The relative advantage of the former is that we have a lower bound for c that does not depend on the parameter. Moreover, it seems that it is numerically more stable. In the next subsections we will discuss the choice for c and propose an algorithm to find the optimal solution.

3.3.2 Computational aspects In this subsection we will discuss the computation of OBRE in the case of continuous data and complete information. We will explain how to ob- tain OBRE by presenting a general algorithm proposed by Hampel et al. (1986). We will add to this general framework a numerical method based on the Newton-Raphson expansion to solve the equations in the minimization problem. In order to simplify the derivation of OBRE in any parametric model, Hampel et al. (1986) proposed a general step-by-step algorithm. This method proposes a way to calculate the matrix A and the vector a for a given θ and thus avoids the simultaneous calculation of the implicit equations defining these matrix and vector and the first order condition. Indeed, a n A,a cleaner way to obtain OBRE could be to find the root of i=1 ψc (xi; θ)=0 under the constraints given by (3.35) and (3.36). However, this would be very computational intensive and would probably not improve the results obtained by our algorithm. Since this optimality problem is in fact a min- imization problem, the basic idea of our algorithm is to use the Newton- Raphson method and find the solution by computing iteratively the two following steps:

• For a given parameter θ0, we find the values of A and a by solving the implicit equations (3.35) and (3.36)

• We compute the Newton-Raphson step (for A and a given in the pre- vious step) and add it to θ0.

More practically, the complete algorithm can be defined by the five following steps: 40 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

Step 1: Fix a precision threshold ε and an initial value for the parameter θ

Step 2: Initial matrix A = J1/2(θ)−T and vector a =0whereJ(θ)istheFisher information matrix given by T J(θ)= s(x; θ)s(x; θ) dFθ(x) (3.38)

Step 3: Determine iteratively A and a by means of

T −1 A A = M2 (θ)

and

a = s(x; θ)Wc(x; θ)dFθ(x) Wc(x; θ)dFθ(x)

Step 4: Compute M1(θ)and 1 n ∆θ = M −1(θ) [s(x ; θ) − a]W (x ; θ) 1 n i c i i=1 where c W (x; θ)=min 1; c ||A[s(x; θ) − a]|| and − − T k Mk(θ)= [s(x; θ) a][s(x; θ) a] Wc (x; θ)dFθ(x)

Step 5: Test if |∆θ| >εthen θ = θ +∆θ andreturntostep2,elsestop

For the first step we can take for instance the MLE as initial value of the parameter or if available a high breakdown estimator. In the second and third steps we find the solution of the implicit equations defining the matrices A and a. The third step is an iterative step since the matrices are redefined until convergence by means of the proposed equations. This simplification has been proposed by Hampel et al. (1986). For the second step, we choose a =0andA = J 1/2(θ)−1 since these values solve the equations for c = ∞ (corresponding to the MLE). In step 4, we compute the Newton-Raphson step. The corresponding algorithm can also be viewed as an algorithm based on the IF (IF algorithm). It is based on (3.5), and θk+1 the value of the estimator obtained in the step k +1isgivenby n k+1 k 1 k θ = θ + IF(x ; θ ,F k ) (3.39) n i θ i=1 3.3. OPTIMAL ROBUST ESTIMATORS 41 where −1 k k k T k IF(x; θ ,Fθk )= ψ(x; θ )s(x; θ ) dFθk (x) ψ(x; θ ) (3.40)

On the other hand, the Newton-Raphson algorithm allows us to find the min- imum of a function by searching a stationary point. The Newton-Raphson step (∆θ = θk+1 − θk)isgivenby∆θ = −g (θ)/g (θ)whereg(θ)isthe function to minimize. In our problem, the first derivative of the function n to minimize is given by (1/n) i=1 ψ(xi; θ). If we approximate the average over the sample by the integral over the tentatively estimated distribution and make use of the Fisher consistency we get: ∂ ψ(x; θ)dF (x)=− ψ(x; θ)s(x; θ)dF (x) (3.41) ∂θ θ θ The Newton-Raphson step is then given by −1 n − ∂ 1 k ∆θ = ψ(x; θ) dFθk (x) ψ(xi; θ ) ∂θ k n θ=θ i=1 n 1 k = IF(x ; θ ,F k ) (3.42) n i θ i=1 It is easy to see that for a large sample size (n →∞), the second derivative of the objective function converges to the left hand side of (3.41). In our simulations, we found that with samples sizes of 200, the algorithm works very well. With the MLE as starting point, the convergence to the OBRE is quite fast. What often takes a long time is actually the computation of the matrix A and vector a for which we have to compute integrals. One could avoid this problem by taking the empirical version for the parametric distribution, if the sample size is large enough. To illustrate the convergence to the OBRE, figure 3.1 shows the plot of

1 n IF(x ; θ,F ) n i θ i=1 as a function of the parameter to estimate. In this example (Pareto sample with 200 observations, α =3,c = 5) we can see that the value of the mean of the influence is always lower to the difference between the parameter value and the solution. Thus, if we start with a value for θ in the neighborhood of the solution, the algorithm makes the parameter converge rapidly to the solution. 42 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

True parameter - estimate

. . . .

...... IF ...... -2 -1 0 1 . .

1.5 2.0 2.5 3.0 3.5 4.0 4.5 Parameter

Figure 3.1: Value of the mean IF around the true parameter θ

3.3.3 How to choose the bound c

Knowing that an OBRE results from a compromise between efficiency and robustness, it is easy to imagine that the way to control “more or less ro- bustness” is by means of the bound c. Indeed, for c = ∞ there is no bound on the IF and thus, the OBRE is equivalent to the MLE, with the same asymptotic covariance matrix. On the other hand, when c decreases the es- timator is more B-robust but it looses efficiency. Hence, one way to choose the bound c is to choose the minimum value of c in order to keep a minimum of efficiency, for example 95% at the model. We measure the asymptotic efficiency of an OBRE by the ratio between the traces of the asymptotic covariance matrices of the MLE and the OBRE. For a given parametric model, it depends analytically on both the value of the parameter and the bound c. However, in numerical applications we found that the value of the parameter does not influence the efficiency of 3.3. OPTIMAL ROBUST ESTIMATORS 43

the estimator. The asymptotic efficiency is given by tr{V (T ,F )} MLE θ (3.43) tr{V (Tc,Fθ)}

−1 with V (TMLE,Fθ)=J(θ) (the Fisher information matrix). For the esti- mator Tc we have T V (Tc,Fθ)= IF(x; Tc,Fθ)IF(x; Tc,Fθ) dFθ(x) (3.44)

where −1 IF(x; θ,Fθ)=M1(θ) [s(x; θ − a]Wc(x; θ) (3.45) and − − T k Mk(θ)= [s(x; θ) a][s(x; θ) a] Wc (x; θ)dFθ(x) (3.46) Thus we can write

−1 −T V (Tc,Fθ)=M1 (θ)M2(θ)M1 (θ) (3.47)

Denote by ν1 and ν2 respectively the numerator and the denominator of the ratio given by (3.43), then the following algorithm computes the efficiency (Eff(c; θ)) of the OBRE (in the standardized case) as a function of the bound c and for a given θ. Step 1: For a given θ, compute a =0 A = J 1/2(θ)−T −1 ν1 = tr[J(θ) ]

Step 2: Determine A and a iteratively by means of

T −1 A A = M2 (θ)

a = s(x; θ)Wc(x; θ)dFθ(x)/ Wc(x; θ)dFθ(x)

Step 3: Compute −1 −T ν2 = tr M1 (θ)M2(θ)M1 (θ)

Step 4: The efficiency of the OBRE is given by ν Eff(c; θ)= 1 ν2 44 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

Before computing the efficiency versus the bound c for a given parametric model, it is important to have the lower bound for c. The corresponding estimator would be the most B-robust estimator. In the unstandardized case the lower bound for c is given by dim(θ) c ≥ (3.48) E[||s(x; θ)||] In the standardized case we obtain c ≥ dim(θ) (3.49)

In subsection 3.5.1 we will give the plots of the efficiency versus the bound c for two PID models, in order to give an idea of the value of c for which the estimator is still reasonably efficient.

3.4 Application to two income distribution models

In this section and the following we treat the case of continuous data in complete information. The case of ‘complete information’ occurs when in- comes are observed over the whole income range. Practically, when the data are continuous, the information is complete if we can observe incomes from the lowest to the highest, which means that incomes under a mini- mum taxable income are also reported. In the following chapters, we will discuss the estimation techniques in the most general cases, i.e. when the data are continuous and incomplete and when the data are grouped. In the complete-information case, we can directly apply the optimality theorem given previously, by using the algorithms given in the subsection 3.3.2. The techniques can be applied to any PID model but in the following subsection we will present the results for two important models. In a simulation study, we then compute the OBRE with simulated and contaminated data, and compare them to the MLE. We will end this section with an application to real data samples.

3.4.1 Simulation results For the simulations, we have chosen a PID model with one parameter and another with two parameters. They are the Pareto law and the Gamma distribution respectively. The density and scores functions are given by: • Pareto law: −(α+1) α f(x; α)=αx x0 3.4. APPLICATION TO TWO INCOME DISTRIBUTION MODELS 45

where 0

where Γ(˜ α) is the Digamma function given by ∂ Γ(˜ α)= log Γ(α) (3.50) ∂α Since the IF of the MLE is proportional to the scores function (see (3.25)), we can see that the MLE for the Pareto law and the Gamma distri- bution have an unbounded IF. In appendix A we give the scores functions of the PID models discussed in section 2.5 and study their IF. In order to give a more realistic interpretation of the results, we also 0 calculated an income inequality measure. We chose the Theil index IE based on the entropy of the distribution (see (8.28)). However, the results also apply to other income inequality measures. In chapter 8 we study more particularly the case of income inequality measures. For the Pareto law, knowing that E(x)=x0α/(α − 1), the Theil index 0 IE is given by 1 α I0 = − log (3.51) E α − 1 α − 1 For the Gamma distribution, knowing that E(x)=α/λ, the Theil index is given by 1 I0 = + Γ(˜ α) − log(α) (3.52) E α

Gamma distribution In order to measure the effect of the contamination on the MLE and the stability of the OBRE in the presence of different percentages of contamina- tion, we simulated 50 samples of 200 observations generated by the Gamma distribution with parameters values α =3andλ = 1. We contaminated the samples in three different ways: 46 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

Model 1: This model is the non-contaminated model Fα,λ. Model 2: This model contains 1% of very ‘bad’ contamination, that is we took 1% of the highest observation that we multiplied by 10.

Model 3: This model contains 3% of contamination and is given by 0.97Fα,λ + 0.03Fα,0.1λ

Model 4: This model contains 5% of contamination and is given by 0.95Fα,λ + 0.05Fα,0.1λ We chose these models because we wanted to point out the performances of the estimators in such artificial situations. Thus, if the OBRE works well in these cases it would also work well in relatively better cases. In tables 3.2, 3.3, 3.4 and 3.5 we give the MLE and the OBRE (in the standardized case) for different bounds2 c. As in the contaminated models the estimators are biased, we have to consider not only the variance of the estimators, but also the bias. Thus we computed their mean squared error:

MSE(T )=VAR(T )+BIAS(T )2

Bound c α, λ MLE OBRE MSE Theil Index α 3.05 (0.03) 0.07 0.155 λ 1.01 (0.01) 0.01

c =4.0 α 3.06 (0.04) 0.07 0.155 λ 1.02 (0.01) 0.01

Table 3.2: MLE and OBRE for the Gamma model 1 (non contam- inated)

Let us now interpret the results. As expected, the MLE are ruined with only a small amount of contamination introduced in the model: theα ˆ fall from the value of 3.05 (in the non-contaminated case) to the value of 1.38 with only 1% of contamination (see tables 3.2 and 3.3). The MSE in the latter case is of 2.65. The same phenomenon can be observed for the Theil index: it rises from 0.155 to 0.32, thus twice its true value! It is of course the worst contamination type for the MLE, because we systematically multiply by a relative big value 1% of the highest observations. However, in the more

2The values in brackets are the estimated standard error of the estimated means. Theil index is calculated with the mean value of the estimates. 3.4. APPLICATION TO TWO INCOME DISTRIBUTION MODELS 47

Bound c α, λ MLE OBRE MSE Theil Index α 1.38 (0.02) 2.65 0.320 λ 0.36 (0.01) 0.41

c =5.0 α 3.01 (0.04) 0.07 0.157 λ 1.01 (0.01) 0.01

c =1.5 α 3.12 (0.06) 0.17 0.152 λ 1.04 (0.02) 0.02

Table 3.3: MLE and OBRE for the Gamma model 2 (1% of ‘bad’ contamination)

Bound c α, λ MLE OBRE MSE Theil Index α 1.67 (0.04) 1.89 0.270 λ 0.44 (0.01) 0.32

c =4.0 α 2.59 (0.03) 0.22 0.181 λ 0.82 (0.01) 0.04

c =2.0 α 2.80 (0.04) 0.11 0.168 λ 0.91 (0.01) 0.02

c =1.5 α 2.85 (0.05) 0.12 0.165 λ 0.93 (0.02) 0.02

Table 3.4: MLE and OBRE for the Gamma model 3 (3% of con- tamination)

Bound c α, λ MLE OBRE MSE Theil Index α 1.28 (0.04) 3.09 0.342 λ 0.30 (0.01) 0.50

c =4.0 α 2.37 (0.05) 0.51 0.196 λ 0.72 (0.02) 0.09

c =2.0 α 2.68 (0.04) 0.20 0.175 λ 0.86 (0.02) 0.03

c =1.5 α 2.78 (0.05) 0.16 0.169 λ 0.89 (0.02) 0.03

Table 3.5: MLE and OBRE for the Gamma model 4 (5% of con- tamination) 48 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

realistic cases, namely the models 3 and 4 (see tables 3.4 and 3.5), the MLE of α also falls to the values of 1.67 and 1.28, and the Theil index rises to the values of 0.27 and 0.342. The MSE of the MLE is equal to 1.89 and 3.09! These cases are more realistic in the sense that in a practical case, it is possible that a random percentage of observations are badly recorded, that is they are recorded as 10 times their true values (decimal point error). On the other hand, we can observe that when the model is non contaminated the OBRE gives the same value for the parameters and the Theil index than that of the MLE. When an amount of contamination is introduced, the OBRE are more or less stable and much less influenced than the MLE which are completely ruined. In the second model (see table 3.3) they have a MSE (for the parameter α) equal to only 0.07 with only c =5.0. This valuefortheboundc corresponds, as we will see below, to 97 % of efficiency. With 3% and 5% of contamination, the OBRE has a MSE of 0.12 and 0.16 respectively, when the bound c is at its minimum. We cannot hope to have the same results as in the less contaminated cases, but when we compare the OBRE with the MLE, the MSE for the latter is about 20 times the MSE of the OBRE. It is interesting at this point to give some remarks on the computing aspect. We have written the programs in the Fortran language and have executed them on a VAX (VMS system). The CPU time needed to calculate the estimators depends on the choice of the bound c. The smallest CPU time needed was of course for the calculation of the MLE that only needs a very simply computed minimization procedure. For the OBRE, when the bound c is high, the CPU time is about 30 seconds per sample. But when we lower the bound to its lowest value (c =1.5) the CPU time increases to about 17 minutes per sample. The computing time does not seem to depend on the underlying model, that is it does not depend on the degree of contamination introduced in the true model.

Pareto law For the Pareto law, we simulated 100 samples of 200 observations generated by the Pareto distribution with parameter values α =3.0andx0 =0.5. We also contaminated the samples in two different ways in order to obtain the following models:

Model 1: This model is the non-contaminated model Fα,x0

Model 2: This model contains 2% of contamination and is given by 0.98Fα +

0.02Fα,10x0 3.4. APPLICATION TO TWO INCOME DISTRIBUTION MODELS 49

Model 3: This model contains 5% of contamination and is given by 0.95Fα +

0.05Fα,10x0 In tables 3.6, 3.7 and 3.8, we give the MLE and the OBRE for different bounds c.

Bound c MLE OBRE MSE Theil Index 2.99 (0.02) 0.05 0.095

c =2.0 2.99 (0.02) 0.04 0.095

Table 3.6: MLE and OBRE for the Pareto model 1 (non contami- nated)

Bound c MLE OBRE MSE Theil Index 2.68 (0.02) 0.15 0.128

c =3.0 2.83 (0.02) 0.07 0.110

c =1.5 2.94 (0.02) 0.06 0.100

Table 3.7: MLE and OBRE for the Pareto model 2 (2% of contam- ination)

Bound c MLE OBRE MSE Theil Index 2.25 (0.02) 0.61 0.212

c =3.0 2.53 (0.02) 0.28 0.151

c =1.5 2.76 (0.02) 0.11 0.118

Table 3.8: MLE and OBRE for the Pareto model 3 (5% of contam- ination)

The interpretation of the results is almost the same as the one of the Gamma distribution. The MLE fully deviates from the true value of the parameter with 2% of contamination (see table 3.7). With 5% of contami- nation, the Theil index rises from 0.095 (in the non-contaminated case, see table 3.6) to 0.212, and thus becomes more than twice its initial value (see table 3.8). For the OBRE, we can say that it performs better than the MLE considering that when the model is not contaminated, it has the same value than the MLE. However, when a small amount of contamination is 50 CHAPTER 3. OBRE WITH COMPLETE INFORMATION introduced, the OBRE has a smaller MSE. With 2% of contamination the OBRE has a MSE equal to 0.06 (like in the non contaminated case) when the bound c is near its minimum. With 5% of contamination the OBRE has a MSE of 0.11 (for the smaller value for the bound c) which is quite better than the MSE of the MLE (equal to 0.61). On the computing side, we have noticed that like for the Gamma dis- tribution, the CPU time needed to calculate the estimators depends on the choice of the value for the bound c. When the bound is high the CPU time is of about 15 seconds by sample. However, when we lower the bound until its lowest value (c =1.5), the CPU time increases until about 5 minutes by sample.

3.4.2 Application to real data In order to illustrate the usefulness of OBRE in the study of PID, we ap- ply the techniques presented above to real data sets. The first set is the total family income in 19813 in the USA using the ‘Panel Study of Income Dynamics’ (PSID). The second set comes from the ‘Family expenditure Sur- vey’ (FES) in 1985 in the United Kingdom. Moreover, in order to be more complete in our analysis, we fit not only the Gamma distribution to the observations, but also the Dagum model type I. The cumulative distribution function Fβ,λ,δ(x) of the Dagum model type Iisgivenby −β −δ Fβ,λ,δ(x)= 1+λx (3.53) where 0 ≤ x<∞,(β,λ) > 0andδ>1. The corresponding density function f(x; β, λ, δ)isgivenby −β−1 f(x; β, λ, δ)=(β +1)λδx−δ−1 1+λx−δ (3.54)

The scores functions are then given by

∂ 1 s (x; β, λ, δ)= log f(x; β, λ, δ)= − log(1 + λx−δ) (3.55) 1 ∂β β +1 ∂ 1 x−δ s (x; β, λ, δ)= log f(x; β, λ, δ)= − (β +1) (3.56) 2 ∂λ λ (1 + λx−δ)

3We define the total family income as the sum of the total taxable income of head and wife, the total transfer income of head and wife, the total taxable income of all others in the family unit and the total transfer income of all others in the family unit. Since the Gamma variable is positive, we consider only the positive incomes. 3.4. APPLICATION TO TWO INCOME DISTRIBUTION MODELS 51

∂ 1 x−δ log(x) s (x; β, λ, δ)= log f(x; β, λ, δ)= +(β +1)λ − 3 ∂δ δ (1 + λx−δ) log(x) (3.57)

The study of the limiting cases reveals that

lim s1(x; β, λ, δ)=−∞ but lim s1(x; β, λ, δ)=ct x→0 x→∞ lim s2(x; β, λ, δ)=ct and lim s2(x; β, λ, δ)=ct (3.58) x→0 x→∞ lim s3(x; β, λ, δ)=∞ and lim s3(x; β, λ, δ)=−∞ x→0 x→∞ (3.59)

Hence, sup s(x; β, λ, δ) = ∞ (3.60) x The scores function is then unbounded. As with the simulations, in order to give a more realistic interpretation of the results, we also calculated an income inequality measure. This time we chose the Gini index. For the Gamma distribution4 and the Dagum model5 type I, the Gini index is given by respectively B[0.5; α, α +1] I =2 − 1 (3.61) G B[α, α +1] and Γ(β)Γ(2β + 1 ) δ − IG = 1 1 (3.62) Γ(2β)Γ(β + δ ) where B[α, β] is the Beta function given by 1 Γ(α)Γ(β) B[α, β]= tα−1(1 − t)β−1dt = (3.63) 0 Γ(α + β) and where B[t0; α, β] is the incomplete Beta function given by t0 α−1 β−1 B[t0; α, β]= t (1 − t) dt ,0≤ t0 < 1 (3.64) 0 The first sample we analyse is the PSID data sample. In table 3.9 we compute the MLE and OBRE for the Gamma and Dagum models. For the Gamma distribution, we can see that the difference between the MLE and

4see Salem and Mount 1974 5see e.g. Dagum 1985 52 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

Model Parameter MLE OBRE Gini index Bound c Gamma α 0.32 0.7205 λ 1.4 · 10−5 α 1.67 0.4055 c =3.0 λ 7.8 · 10−5 Dagum β 0.41 0.4120 λ 1.18 · 1014 δ 3.15 β 0.36 0.4337 c =2.0 λ 1.22 · 1014 δ 3.13

Table 3.9: MLE and OBRE for the Gamma and Dagum models on PSID data. the OBRE is considerably large. Actually, the MLE estimates a zeromodal distribution when the data fit a unimodal distribution. Figure 3.2 gives the histogram of the empirical distribution and the plots of the estimated Gamma distribution by the MLE and the OBRE. It is obvious that the MLE leads to a very bad fit: the estimated distribution does not reflect at all the data. The reason in this case is that the MLE is almost completely determined by the highest observations that represent only a very small proportion of the data set. On the other hand, the estimated density by means of the OBRE reflects very well the nature of the data. For the Dagum model, the difference between the MLE and the OBRE is very small. In the plot comparing the two estimated distributions with the histogram of the empirical distribution (see figure 3.3), we can remark that the two estimators lead to the same fit. If we compare the estimated densities corresponding either to the Gamma (OBRE) or the Dagum (MLE) distributions (see figure 3.4), we can see that they provide both a good fit and that they are hardly distinguishable. Moreover, it should be stressed that the Gamma model is more parsimonious (1 parameter less) than the Dagum model type I. The second sample is the FES data sample. Table 3.10 contains the MLE and OBRE for the parameters of the Gamma distribution and the Dagum model type I. We can remark that the difference between the MLE and the OBRE for the two models does not seem to be significative. However, what is more relevant is the comparison between the estimated Gamma distribution and the estimated Dagum model by means of the OBRE (see figure 3.5): there is no visible difference. With these two examples, we showed that the MLE can lead a statis- 3.4. APPLICATION TO TWO INCOME DISTRIBUTION MODELS 53

[1] MLE for the Gamma distribution (-.-)

[2] OBRE for the Gamma distribution (__)

Figure 3.2: MLE and OBRE of the Gamma distribution on PSID data

Model Parameter MLE OBRE Gini index Bound c Gamma α 2.67 0.3296 λ 1.62 · 10−2 α 2.59 0.3341 c =2.0 λ 1.64 · 10−2 Dagum β 0.71 0.3504 λ 78.83 · 105 δ 3.11 β 0.664 0.3980 c =2.0 λ 15.74 · 105 δ 2.77

Table 3.10: MLE and OBRE for the Gamma and Dagum models on FES data. 54 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

[1] MLE for the Dagum model type I (-.-)

[2] OBRE for the Dagum model type I (__)

Figure 3.3: MLE and OBRE of the Dagum I model on PSID data 3.4. APPLICATION TO TWO INCOME DISTRIBUTION MODELS 55

[1] OBRE for the Gamma distribution (-.-)

[2] MLE for the Dagum model type I (__)

Figure 3.4: Gamma (OBRE) and Dagum (MLE) fit on PSID data 56 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

[1] OBRE for the Gamma distribution (-.-)

[2] OBRE for the Dagum model type I (__)

Figure 3.5: OBRE of the Gamma and Dagum I model on FES data 3.5. PROPERTIES OF ROBUST ESTIMATORS 57 tical analysis to false conclusions (especially, in our case, with the Gamma distribution). However, with the samples we have analysed, it seems that the estimated (MLE) Dagum model fits the data quite well. A number of economists would argue that a model which fulfils a set of economic proper- ties has to be prefered to an ‘ad-hoc’ mathematical formula (see section 2.5). The Dagum model is in this sense better that the Gamma distribution. It also has one more parameter which permits the model to accommodate the large observations. But is it worthwhile to estimate one more parameter to accommodate a few observations? Nevertheless, it should not be forgotten that MLE are based on the assumption that the underlying model is exact which is almost never the case in reality, and robust techniques are neces- sary. Hence, even a model which has the best theoretical properties can in practice not at all reflect the behaviour of the data and often this is not due to theoretical reasons but to some ‘bad’ observations.

3.5 Properties of robust estimators

3.5.1 Efficiency

To compute the efficiency of the OBRE at the model Fθ, we used the algo- rithm presented in section 3.3. We first fixed the value of the bound c and calculated the efficiency of the estimator for different values of the parame- ters (we did it for the Gamma distribution and the Pareto law). We found a constant value for the efficiency depending on the fixed value for the bound c. In a second step, we fixed the value of the parameters and calculated the efficiency when the bound c varies. As expected, we found an increasing efficiency with the bound c. Figure 3.6 and Figure 3.7 give the plot of the efficiency versus the bound c for the Gamma distribution (with α =3.0and λ =1.0) and respectively for the Pareto law (with α =3.0andx0 =0.5). For the Gamma distribution, we can see that the OBRE has 95% of efficiency for a bound equal to about 4. For a bound equal to 1.5, we gain robustness but the asymptotic efficiency falls under 60%. For the Pareto distribution, the OBRE has 95% of efficiency for a bound equal to about 3. It has less than 75% of efficiency when the bound is equal to 1.5.

3.5.2 Sensitivity In this subsection we will study the sensitivity of the estimators to the distance of an outlier from the underlying model. This can be measured 58 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

......

Efficiency ...... 0.6 0.7 0.8. 0.9 1.0 . . . .

2468 Bound c

Figure 3.6: Efficiency of the OBRE for the Gamma model 3.5. PROPERTIES OF ROBUST ESTIMATORS 59

...... Efficiency ......

0.75 0.80 0.85. 0.90 0.95 1.00 . .

2468 Bound c

Figure 3.7: Efficiency of the OBRE for the Pareto model 60 CHAPTER 3. OBRE WITH COMPLETE INFORMATION by means of the sensitivity curve (see (3.7)). For OBRE, we know that their bias is bounded and this bound depends on the choice of c and the percentage ε of contamination present in the data. Therefore, contrarily to the MLE, the OBRE should not be influenced by the distance between the bulk of data and the outlier, but by the percentage of outliers. To measure the sensitivities of the MLE and the OBRE, we took 4% of observations that we varied from their ‘true’ value to a very large value. In figure 3.8, we computed the MLE and the OBRE of the parameter α of the Pareto distribution. The sample (of 200 observations) is generated from the Pareto distribution with parameter values α =3andx0 =0.5. The contaminated model is 0.96Fα,x0 +0.04Fα,k·x0 where k =2, 3,...,10. As expected, the bias of the MLE increases with the distance (measured by k). For c =2.0 the OBRE has a constant value when the distance increases (it is not influenced at all by 4% of contamination).

OBRE Estimator

MLE 2.4 2.6 2.8 3.0

246810 Distance

Figure 3.8: Sensitivity of the MLE and the OBRE to outliers for the Pareto model 3.5. PROPERTIES OF ROBUST ESTIMATORS 61

3.5.3 Breakdown point

In this section we tried to determine qualitatively the breakdown point of an OBRE. We could also find it analytically but we only want to have ‘an idea’ of what the breakdown point of a OBRE could be. Since the breakdown point measures the minimum proportion of the worst contamination (introduced in the model) for which the estimator is unbounded, we should find the worst type of contamination to compute the breakdown point. The contamination type we have considered until now should not be the worst type of contamination because in fact it lowers the value of the estimators. Hence, as these estimators are positive real valued, their bias cannot be infinite. Nevertheless, since, as said before, our aim is just to make a qualitative analysis of the breakdown point for the OBRE, we only study the bias (T − θ) of the estimators (with one sample) when the percentage of contamination varies, for one income distribution and one income inequality measure. In a first step, we studied the bias of the estimators of the Pareto distri- bution when a fixed value of contamination is introduced in the data. Thus, we generated a sample of 1000 observations from the Pareto distribution with parameter values of α =3.0andx0 =0.5 and built the contamination model p p 1 − F + F n α,x0 n α,10000x0

p where n is a fixed proportion. It is an extreme type of contamination, but the aim is to observe the effect on the bias of the estimators in such an extreme situation. In figure 3.9 we plotted the bias of the different estimators (MLE and OBRE with different bounds c) versus the proportion p n of contamination. The MLE is immediately influenced by a little amount of contamination. The OBRE with a relatively large bounds c are also badly influenced with small amounts of contamination, but their bias is less large than the MLE one. The most B-robust estimator (corresponding to c =1.5) has a reasonable bias up to about 5% of contamination. We also studied the bias of the estimators for the Gamma distribution and found the same results as with the Pareto distribution. In a second step, since the impact of a contamination on an income inequality measure is more obvious to interpret, we actually studied the behaviour of the Theil index when the percentage of contamination varies. To build figure 3.10, we computed the estimators of α from a sample of 1000 observations generated by the Pareto distribution with parameter values 62 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

c=1.5

c=2.0 c=7.0

c=3.0 Bias

c=4.0

MLE -2.5 -2.0 -1.5 -1.0 -0.5 0.0

0246810 Proportion of contamination (%)

Figure 3.9: Sensitivity of MLE and OBRE to different proportions of contamination 3.5. PROPERTIES OF ROBUST ESTIMATORS 63

α =3andx0 =0.5 when the model is contaminated as − (1 pn)Fα,x0 + pnFα,10x0

We then calculated the Theil index and plotted it versus the percentage pn of contamination. As expected, the Theil index corresponding to the MLE is very sensitive to a small amount of contamination, it has a very little breakdown point. On the other hand, the Theil index corresponding to the OBRE is relatively stable (the bias does not increase rapidly with the percentage of contamination) when the bound c is small. Even if the value of the Theil index increases, it increases in a very little proportion when we compare it to the index corresponding to the MLE. These breakdown points can be improved by defining a high breakdown estimator (see Rousseeuw 1984, Rousseeuw and Leroy 1987) as starting point for the computation of the B-robust estimator and then perform only one Newton-Raphson step in the direction of OBRE (see Simpson, Ruppert and Caroll 1992). However, in our examples, this is still an open problem. 64 CHAPTER 3. OBRE WITH COMPLETE INFORMATION

MLE

c=7.0 c=4.0

Theil’s index c=3.0

c=2.0

c=1.5 0.1 0.2 0.3 0.4 0.5

0246810 Proportion of contamination (%)

Figure 3.10: Bias of Theil index estimates when the data are con- taminated Chapter 4

Optimal B-Robust Estimators with Incomplete Information

4.1 The EM algorithm

4.1.1 Introduction

Income data are typically truncated. Truncation occurs either in the low incomes because of taxation rules or in the high incomes for confidentiality reasons. To compute MLE (and later OBRE) in this situation we use the EM algorithm. This algorithm was developed by Dempster, Laird and Rubin (1977) in order to calculate iteratively the MLE with censored data. Since each iteration of the algorithm consists on an expectation step followed by a maximization step, Dempster, Laird and Rubin (1977) called it the EM algorithm. The term ‘censored data’ in its general form implies the existence of two sample spaces Y and X and a mapping from X to Y . The observed data T y =(y1,...,yn) are a realization from Y . The corresponding x in X is not observed directly, but only indirectly through y. More specifically, it is assumed that there is a mapping x → y(x)fromX to Y ,andthatx is known only to lie in X(y), the subset of X determined by the equation y = y(x), where y is the observed data. How are the complete and incomplete densities related? Let f(x; θ)be the sampling density corresponding to the complete-data specification. Then

65 66 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION we obtain its corresponding incomplete-data specification g(y; θ)by g(y; θ)= f(x; θ)dx (4.1) X(y)

The EM algorithm is directed at finding a value of θ which maximizes g(y; θ) given an observed y, but it does so by making essential use of the associated family f(x; θ).

4.1.2 Definition of the EM algorithm As Dempster, Laird and Rubin (1977) did, we first restrict ourselves to the case of an , to which most of PID models belong. However, as we will see further on, the theory applies at the most general level. Suppose first that f(x; θ) has the regular exponential-family form

1 f(x; θ)= b(x)exp[t(x)θ] (4.2) d(θ) where t(x) denotes a 1 × p vector of complete data sufficient statistics (p = dim(θ)) and d(θ) is a constant which makes the integration of f(x; θ)equal to unity: d(θ)= b(x)exp[t(x)θ] dx (4.3) X For instance, for the Gamma distribution we have −1 b(x)= xi (4.4) X λα −n d(θ)= (4.5) Γ(α) t(x)= log(xi), − xi (4.6) X X Intuitively, when one has to estimate the parameters of a model with censored data, one would try to complete the information through the es- timation procedure. In a first step, one can compute the MLE estimators ignoring the fact that the data set is censored. In a second step, as these estimators depend on statistics calculated by means of the censored data, one has to transform these statistics in order to push them near to the value they would have taken if the data set was complete. The EM algorithm proposes to find these values by computing the mathematical expectation 4.1. THE EM ALGORITHM 67 of the corresponding statistics, conditionally on the data and the estimates of the last step until convergence. Thus, in the framework of exponential families, the characterization of the EM algorithm can be written as following. Suppose that θ(k) denotes thecurrentvalueofθ after k cycles of the algorithm. The next cycle can be described in two steps, as follows: • E-step: Estimate the complete-data sufficient statistics t(x) by finding t(k) = E t(x)|y,θ(k) (4.7)

• M-step: Determine θ(k+1) as the solution of the equation E[t(x)|y,θ]=t(k) (4.8)

The last equations are the familiar form of the likelihood equations for the MLE given data from regular exponential family. That is, if we were to sup- pose that t(k) represents the sufficient statistics computed from an observed x drawn from the E-step, then the M-step defines the MLE of θ.

4.1.3 Discussion and example Formal convergence properties of the EM algorithm are given (in the general case) in the paper of Dempster, Laird and Rubin (1977). The latter also explain why repeated application of the E- and M-step leads ultimately to the value θˆ of θ that maximizes log g(y; θ). They also show why the MLE obtained when one considers the conditional density (CD) leads to the same estimator as the one proposed by the EM algorithm. However, this is not true with other estimates such as robust estimates (see section 4.2). As an example, we will take the case of data truncated below a mini- mum value s, generated by a Gamma density. This example is commonly encountered in the study of PID. The relation between the complete and the incomplete data set can be written For x>s, y(x)=y (observed) For x

and hence, s (k) (k) − t = t(x)f(x; θ )dx +(1 Fθ(k) (s)) t(y) (4.10) 0 • M-step: Determine θ(k+1) as the solution of the equation E [t(x)|y,θ]=t(k) (4.11)

For the Gamma distribution Fα,λ, the first-order conditions of the maxi- mization of the log likelihood problem, are given by the two following equa- tions α λ = (4.12) x¯ log(α) − Γ(˜ α)=log(¯x) − log(˜x) (4.13) ˜ 1 where Γ(α) is the digamma function,x ¯ = n xi is the arithmetic mean 1 andx ˜ = xi n is the geometric mean. As the equations defining the MLE depend onx ¯ andx ˜, we calculate their mathematical expectation in the E- step. Forx ¯,wehave s E [¯x|y,α,λ]= xf(x; α, λ)dx +(1− Fα,λ(s))y ¯ (4.14) 0 1 wherey ¯ is the observed arithmetic mean. For log(˜x)= log(xi), we have n s E [log(˜x)|y,α,λ]= log(x)f(x; α, λ)dx +(1− Fα,λ(s)) log(˜y) (4.15) 0 wherey ˜ is the observed geometric mean. Hence, to find the MLE of the Gamma density when the data are trun- cated at a minimum value s, we begin by calculating the estimators with the equations defining the first order conditions (calculating the arithmetic mean and the logarithm of the geometric mean with the observed data), and then transform these statistics in the E-step and recalculate the estimators in the M-step. The algorithm converges very fast and is easy to compute.

4.1.4 Generalization The final generalisation proposed by Dempster, Laid and Rubin (1977) omits all reference to exponential families and concerns MLE. They introduced the function Q(θ∗|θ)=E [log f(x; θ∗)|y,θ] (4.16) which is assumed to exist for all pairs (θ∗,θ). The EM iteration θ(k) → θ(k+1) is then defined as follows: 4.1. THE EM ALGORITHM 69

• E-step: Compute Q(θ|θ(k)) • M-step: Choose θ(k+1) to be a value of θ which maximizes Q(θ|θ(k)). The heuristic idea here is that we would like to choose θˆ which maximizes log f(x; θ). Since we do not know log f(x; θ), we instead maximize its current expectation given the data y and the current fit θ(k). In next section, we will generalize (4.16) by replacing log f(x; θ)(which corresponds to the MLE), with a more general function ρ(x; θ) (for robust estimators). To conclude, note that in their paper, Dempster, Laird and Rubin (1977) gave an heuristic illustration of the convergence of the EM algorithm in the case of exponential families of distribution, by showing that the EM algo- rithm solves the same problem as the MLE with the CD. In the general case, i.e. when f(·; θ) is not restricted to the class of exponential distributions, we can also illustrate it. We can rewrite the EM-algorithm in the following way • E-step: Compute E s(x; θ)|y,θ(k) • M-step: Choose θ(k+1) solution of E s(x; θ)|y,θ(k) =0 where ∂ s(x; θ)= log f(x; θ) (4.17) ∂θ With the CD, for any density function, the first order conditions of the maximization problem are given by ∂ ∂ log f(y; θ) − n log k(θ) = 0 (4.18) ∂θ ∂θ where k(θ)= f(x; θ)dx (4.19) X(y) This is equivalent to n k(θ) s(yi; θ) − n s(x; θ)f(x; θ)dx = 0 (4.20) i=1 X(y) or s(x; θ)f(x; θ)dx + k(θ)s(y; θ) = 0 (4.21) X¯(y) where X¯(y)=X\X(y). Thus, we see that the solution for θ given by the EM-algorithm is equal to the MLE when one considers the CD (we will call this procedure the CD estimation). 70 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION

4.2 Optimal B-robust estimators with incomplete information

4.2.1 The EMM algorithm In this subsection we will extend the EM algorithm to M-estimators and compare this approach to the CD estimation. We will begin with the EM algorithm. An M-estimator is defined by the solution in θ of

n ψ(xi; θ) = 0 (4.22) i=1 with ψ being any function. The estimators we consider have the property of being Fisher consistent, that is

ψ(x; θ)dFθ(x) = 0 (4.23)

In many examples, the ψ-function is a function (K) of the scores, so we can write ψ(x; θ)=K[s(x; θ)] (4.24) As examples we have K[s(x; θ)] = s(x; θ) (4.25) for the MLE, and K[s(x; θ)] = hc (A[s(x; θ) − a]) (4.26) for the OBRE. To apply the EM-algorithm, we need the generalized form and we use the same transformation in the expression of the E-step and the M-step as we used for the MLE (i.e. deriving the expressions with respect to the parameter). Thus, the solution for the M-estimator is given by the following algorithm:

For a given θ(k), find the solution in θ of n (k) E ψ(xi; θ) y,θ =0 i=1 ⇔ n (k) n ψ(x; θ)f(x; θ )dx + k(θ) ψ(yi; θ) = 0 (4.27) ¯ X(y) i=1 4.2. OBRE WITH INCOMPLETE INFORMATION 71

We will call this algorithm the EMM algorithm since it is based on the idea of the EM algorithm and extended to M-estimators. We can interpret the idea of the EMM algorithm like the idea of the EM algorithm for the MLE: the value of the ψ function on the missing obser- vations is replaced by its mathematical expectation. The EMM algorithm then completes the information through the estimation procedure. In the CD estimation approach, we first give the following definition. We denote bys ˜(·; θ) the derivative of the logarithm of the conditional density with respect to the parameter’s vector, such that

∂ s˜(x; θ)=s(x; θ) − k(θ)/k(θ) (4.28) ∂θ We have also the following property: ∂ k(θ)= s(x; θ)f(x; θ)dx = − s(x; θ)f(x; θ)dx (4.29) ∂θ X(y) X¯ (y)

We then define ψ˜(·; θ) the function corresponding to the M-estimators with the conditional density. We can write

ψ˜(x; θ)=K[˜s(x; θ)] = ∂ K s(x; θ) − k(θ)/k(θ) (4.30) ∂θ An M-estimator is then defined in that case by

n ψ˜(yi; θ) = 0 (4.31) i=1 The conclusion is that the two approaches are equivalent if and only if K[s(·; θ)] = B · s(·; θ), i.e. in the case of the MLE. In the next subsection, we will apply the EMM algorithm to B-robust estimators. We will see that in that case, we can give an interpretation of the difference between the two approaches.

4.2.2 The EMM algorithm and robust estimators As we saw above, the EMM algorithm and CD estimation don’t always give the same estimators. With OBRE we can imagine that the two approaches are different because the statistical model foundation depends on the chosen approach. With CD estimation the underlying model is transformed: we 72 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION don’t consider the density over the complete data set but the density on the incomplete data set. With the EMM algorithm, we keep the initial model (density over the complete data set) by replacing the truncated interval with its corresponding mathematical expectation. However, the empirical results will show that with OBRE, the two approaches give nearly the same estimates at least for low truncation points. The difference between the two approaches is due to the computed weights on the observations which depend on the underlying model. As we have seen before, for a given θ(k) the EMM-algorithm attempts to find the solution in θ of (4.27). To do this, we need two iterative steps: the external step which transforms θ(k) into θ(k+1) and the internal step which permits finding the solution of (4.27). At the end of the iterative procedures, we will find θˆ such that θˆ = θ(k) = θ(k+1). Thus the EMM- procedure is equivalent to the following problem:

Find θˆ solution of 1 n E ψ(x ; θ)|y = 0 (4.32) n i i=1

The advantage of this method is that it needs less iterations, and thus the convergence is faster. We can write 1 n E ψ(xi; θ)|y = ψ(x; θ)dFθ(x)+ n ¯ i=1 X(y) 1 n k(θ) ψ(y ; θ) n i i=1 1 n = ψ˜(y ; θ) (4.33) n i i=1

In order to find the solution of this equation, we propose to develop a Newton-Raphson step which can also be applied in this case. By mak- ing the following approximation (justified by the fact that the mean over the sample can be approximated by the integral over the CD and that we want the M-estimator to be Fisher consistent at the complete model, i.e. X ψ(x; θ)dFθ(x)=0) n ∂ 1 ∼ E ψ(x ; θ) y = − ψ(x; θ)s(x; θ)T dF (x) (4.34) ∂θ n i θ i=1 X 4.2. OBRE WITH INCOMPLETE INFORMATION 73 the Newton-Raphson step ∆θ is given by −1 1 n ∆θ = ψ(x; θ)s(x; θ)T dF (x) ψ˜(y ; θ) (4.35) θ n i X i=1

The complete algorithm to find the standardized optimal OBRE in in- complete information when the EMM-algorithm is applied, is given by the following five steps:

• Step 1: As initial value for the parameter we can take the MLE θ = TMLE. Fix also a precision threshold ε.

• Step 2: Initial values for the matrix A and vector a given by A = J 1/2(θ)−T and a =0

• Step 3: Determine A and a iteratively by

T −1 A A = M2 (θ) s(x; θ)W (x; θ)dF (x) a = X c θ X Wc(x; θ)dFθ(x)

• Step 4: Compute M1(θ)and

1 n ∆θ = M −1(θ)A−1 ψ˜(y ; θ) 1 n i i=1

with c W (x; θ)=min 1; c A[s(x; θ) − a]

and − − T k Mk(θ)= [s(x; θ) a][s(x; θ) a] Wc (x; θ)dFθ(x) X

• Step 5: If ∆θ ≥ ε then θ(k+1) = θ(k) +∆θ and again to step 2, else stop.

The function ψ is defined for the OBRE in the standardized case by (3.34) and the following. 74 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION

4.2.3 Comparison with the classical approach The application of the optimality theorem to the incomplete information case is very easy: we have only to replace the complete density by the conditional density. Thus, the standardized OBRE is given by

1 n 1 n ψ˜(y ; θ)= [˜s(y ; θ) − a]W˜ (y ; θ) = 0 (4.36) n i n i c i i=1 i=1 where by definition 1 s˜(y; θ)=s(y; θ)+ s(x; θ)dFθ(x) (4.37) k(θ) X(y) and c W˜ (y; θ)=min 1; (4.38) c A[˜s(y; θ) − a] The matrix A and vector a are implicitly defined by 1 ψ˜(x; θ)ψ˜(x; θ)T dF (x)=I (4.39) k(θ) θ X(y) 1 ˜ ψ(x; θ)dFθ(x) = 0 (4.40) k(θ) X(y)

It is difficult to compare the two approaches by looking at the equations defining the problems. However, we can notice that in the approach with the EMM algorithm the OBRE is consistent on the complete distribution, in the approach with the CD estimation, the OBRE is consistent on the incomplete distribution. Thus the two solutions will be in general different and the reason for their difference can be seen in the weights (see below). This result is not surprising, because often equivalent classical procedures become different when they are robustified (see e.g. H´eritier and Ronchetti 1993 with robust parametric tests). We can also observe that if we choose a bound c large enough the two approaches give the same result. In fact in this case we are using the MLE. Empirically we tried to compute the difference in the weights calculated with the two approaches. We computed the matrix A and vector a when the distribution is a Gamma distribution with parameters α =3andλ =1. We chose c =2.0. In order to make this difference as clear as possible, we chose two dif- ferent truncation points. In figure 4.1, we can see that with a loss of 10% of information, the difference in the computation of the weights by the two 4.2. OBRE WITH INCOMPLETE INFORMATION 75

EM algorithm (___)

Conditional (...) Weight 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Observation x

Figure 4.1: Weights given by the OBRE with 10% of information loss

approaches is very small. In figure 4.2, we made intentionally a high trun- cation point (30% of information loss) and observed that the difference be- tween the two approaches is larger. In fact the weights associated to the EMM algorithm are always the same even if we choose a truncation point corresponding to more than 50% of information loss. Thus, the difference is due to the weights calculated with the CD estimation. This confirms the hypothesis we made in the introduction: the underlying model modifies the computation of the weights and hence when the model is contaminated, also the OBRE. The question which then arises is: which approach to choose? In income distribution models we know that the unreported lower incomes do really exist. Thus, it is more realistic to postulate that the true model is the complete one and choose the approach with the EMM algorithm to calculate the estimators. What’s more, since in certain cases the incom- plete information does not only correspond to the case of truncated data, the EMM algorithm is largely applicable in comparison to the approach with 76 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION

EM algorithm (___)

Conditional (...) Weight 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Observation x

Figure 4.2: Weights given by the OBRE with 30% of information loss the CD. However, as we will see, the differences on the estimates between the two approaches are small at least in the case of a truncation corresponding to a information loss less than 10%.

4.3 Empirical results

4.3.1 Robust estimates In order to compare the standardized OBRE in the two approaches, we sim- ulated 50 samples of 200 observations generated by a Gamma distribution with parameter values α =3.0andλ =1.0. In a second step, we con- taminated the model with the mixture model (1 − ε)Fα,λ + εFα,0.1λ taking ε =1%andε = 3%. We also truncated the data under a minimum value s, corresponding to an information loss of between 0.8% and 8%. As expected, in the non contaminated case, since the weights on the observations are nearly all equal to unity, the two approaches give the same 4.3. EMPIRICAL RESULTS 77

Truncation α, λ OBRE (EMM) MSE Theil OBRE (CD) MSE Theil (c =2.0) Index (c =2, 0) Index s =1.0 α 3.12 (0.07) 0.24 0.152 3.12 (0.06) 0.22 0.152 (8%) λ 1.03 (0.02) 0.03 1.03 (0.02) 0.02

s =0.8 α 3.10 (0.06) 0.21 0.153 3.10 (0.06) 0.17 0.153 (4.7%) λ 1.03 (0.02) 0.02 1.03 (0.02) 0.02

s =0.6 α 3.09 (0.06) 0.17 0.153 3.09 (0.05) 0.14 0.153 (2.3%) λ 1.02 (0.02) 0.02 1.03 (0.02) 0.02

s =0.4 α 3.07 (0.05) 0.12 0.154 3.06 (0.04) 0.10 0.155 (0.8%) λ 1.02 (0.02) 0.02 1.02 (0.02) 0.01

Table 4.1: OBRE on non contaminated data, with the EMM algo- rithm and the CD estimation estimators which are equivalent to the simulation values for the parameters (see table 4.1). In the case of 1% of contamination (see table 4.2), we cannot say that the estimators are different. Only the CPU time of computation is different: the CD estimation works about 70% faster than the approach with the EMM algorithm. In the case of 3% of contamination the difference is still small (see table 4.3). The reason is because as we have seen in the previous section, the difference between the two approaches becomes clear when the loss of information is relatively large. By looking at the MSE of the estimators we can first say it is larger than in the complete information case. With the non contaminated data, since the bias is equal to zero the MSE measures the variability of the estimator which is bigger when the truncation point is higher. In the contaminated cases the MSE have relatively the same values as in the non contaminated case, and moreover, they are the same when we compare the estimators with the two approaches.

4.3.2 Comparison with the MLE

In this subsection we would like to compare the OBRE to the MLE. For the reasons we presented above, we chose the approach with the EMM algorithm for the OBRE (for the MLE each case leads to the same results). We took the same sample as before and contaminated 1% of the highest observations of each sample by multiplying them by 10. The results are very surprising. First, as we can see in table 4.4, when the samples are not contaminated, the OBRE and the MLE give the expected results, with a 78 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION

Truncation α, λ OBRE (EMM) MSE Theil OBRE (CD) MSE Theil (c =2.0) Index (c =2.0) Index s =1.0 α 3.07 (0.07) 0.24 0.154 3.11 (0.07) 0.24 0.152 (8%) λ 1.02 (0.02) 0.03 1.03 (0.02) 0.03

s =0.8 α 3.07 (0.06) 0.21 0.154 3.08 (0.07) 0.22 0.154 (4.7%) λ 1.02 (0.02) 0.02 1.03 (0.02) 0.03

s =0.6 α 3.06 (0.06) 0.17 0.155 3.08 (0.06) 0.18 0.154 (2.3%) λ 1.02 (0.02) 0.02 1.02 (0.02) 0.02

s =0.4 α 3.06 (0.05) 0.12 0.155 3.05 (0.05) 0.12 0.155 (0.8%) λ 1.02 (0.02) 0.02 1.02 (0.02) 0.02

Table 4.2: OBRE on contaminated data at 1%, with the EMM algorithm and the CD estimation

Truncation α, λ OBRE (EMM) MSE Theil OBRE (CD) MSE Theil (c =2.0) Index (c =2.0) Index s =1.0 α 2.59 (0.06) 0.31 0.181 2.64 (0.06) 0.30 0.178 (8%) λ 0.85 (0.02) 0.04 0.87 (0.02) 0.04

s =0.8 α 2.65 (0.05) 0.26 0.177 2.69 (0.05) 0.24 0.175 (4.7%) λ 0.86 (0.02) 0.03 0.88 (0.02) 0.03

s =0.6 α 2.71 (0.05) 0.20 0.173 2.73 (0.05) 0.19 0.172 (2.3%) λ 0.88 (0.02) 0.03 0.89 (0.02) 0.03

s =0.4 α 2.75 (0.04) 0.15 0.171 2.75 (0.04) 0.15 0.171 (0.8%) λ 0.90 (0.02) 0.02 0.90 (0.01) 0.02

Table 4.3: OBRE on contaminated data at 3%, with the EMM algorithm and the CD estimation 4.3. EMPIRICAL RESULTS 79

Truncation α, λ OBRE (EMM) MSE Theil MLE MSE Theil (c =2.0) Index Index s =1.0 α 3.12 (0.07) 0.24 0.152 3.13 (0.06) 0.18 0.151 (8%) λ 1.03 (0.02) 0.03 1.04 (0.02) 0.02

s =0.8 α 3.10 (0.06) 0.21 0.153 3.09 (0.05) 0.15 0.153 (4.7%) λ 1.03 (0.02) 0.02 1.03 (0.02) 0.02

s =0.6 α 3.09 (0.06) 0.17 0.153 3.08 (0.05) 0.13 0.154 (2.3%) λ 1.02 (0.02) 0.02 1.02 (0.02) 0.01

s =0.4 α 3.07 (0.05) 0.12 0.154 3.04 (0.04) 0.09 0.156 (0.8%) λ 1.02 (0.02) 0.02 1.01 (0.02) 0.01

Table 4.4: OBRE and MLE on non contaminated data, with the EMM algorithm comparable MSE. On the other hand, when we introduce the contamination (see table 4.5) the behaviour of the MLE is catastrophic! They are even worse than those of the non-truncated case (see table 3.3).

Truncation α, λ OBRE (EMM) MSE Theil MLE MSE Theil (c =2.0) Index Index s =1.0 α 3.07 (0.07) 0.24 0.154 0.61 (0.00) 5.71 0.579 (8%) λ 1.02 (0.02) 0.03 0.25 (0.00) 0.57

s =0.8 α 3.07 (0.06) 0.21 0.154 0.61 (0.00) 5.51 0.579 (4.7%) λ 1.02 (0.02) 0.02 0.23 (0.00) 0.59

s =0.6 α 3.06 (0.06) 0.17 0.155 0.61 (0.00) 5.71 0.579 (2.3%) λ 1.02 (0.02) 0.02 0.22 (0.00) 0.61

s =0.4 α 3.06 (0.05) 0.12 0.155 0.65 (0.02) 5.60 0.579 (0.8%) λ 1.02 (0.02) 0.02 0.21 (0.01) 0.62

Table 4.5: OBRE and MLE on contaminated data at 1%, with the EMM algorithm

Another interesting thing to do is to compare the OBRE and the MLE when we ignore the loss of information. The non-information can also be viewed as a contamination, since the truncation transforms the statistics in the score function (the arithmetic and geometric mean for the Gamma distribution). The results presented in table 4.6 show that the MSE of the 80 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION

OBRE is smaller than the MSE of the MLE for the highest truncation point, but there seems to be no significative difference between the two approaches.

Truncation α, λ OBRE (EMM) MSE Theil MLE MSE Theil (c =2.0) Index Index s =1.0 α 3.84 (0.08) 0.83 0.125 4.20 (0.05) 1.53 0.114 (8%) λ 1.22 (0.03) 0.06 1.31 (0.02) 0.11

s =0.8 α 3.50 (0.08) 0.36 0.136 3.80 (0.04) 0.74 0.126 (4.7%) λ 1.13 (0.03) 0.03 1.22 (0.02) 0.06

s =0.6 α 3.31 (0.07) 0.19 0.144 3.48 (0.05) 0.31 0.137 (2.3%) λ 1.07 (0.03) 0.02 1.13 (0.02) 0.03

s =0.4 α 3.12 (0.07) 0.11 0.152 3.21 (0.04) 0.12 0.148 (0.8%) λ 1.02 (0.03) 0.01 1.06 (0.02) 0.01

Table 4.6: OBRE and MLE on non contaminated data, with the EMM algorithm, when we ignore truncation

4.3.3 Conclusion Often in PID models, the data are censored. Moreover, in the observation set there are contaminations which lead the MLE to very bad values. To treat the case of incomplete information or, more particularly, of truncated data, there exist two approaches which were developed for the classical estimation methods. They are the EM algorithm and the CD estimation. In order to extent these two approaches to the more realistic case (which considers the model as an approximation to reality), we proposed in this chapter a new algorithm based on the idea of the EM algorithm, namely the EMM algorithm. It allows to compute M-estimators in incomplete infor- mation. We showed that this algorithm is equivalent to the CD estimation only in some particular cases like for MLE. For OBRE, as the underlying model is not the same, we have shown that the estimators corresponding to the two different approaches are different and the difference is larger when the loss of information is big. However, for realistic cases in which the information loss is less than 10%, we have found no significative difference between the two approaches. From a philosophical point of view, the EMM algorithm has maybe to be preferred because we know a priori that censored data exist but are not available. They should not be ignored during the analysis if we want our 4.3. EMPIRICAL RESULTS 81 conclusion to be relevant to real population. 82 CHAPTER 4. OBRE WITH INCOMPLETE INFORMATION Chapter 5

Robust Estimators for Grouped Data

5.1 The problem

Although there has been considerable progress in the data collecting sys- tem, personal income data are still mainly available in the grouped form. This is especially the case in less developed countries. Moreover, since the original observations (continuous) are numerous and because of problems of confidentiality, it can happen that the statistical institutions only give “cen- sored” information which takes the form of the number of economic units falling within determined classes of income. We call it censored because the information contained in the initial data set (thousands of observations) is shrunk to a few observations (the number of economic units for each income classes and the bounds of the income classes). Typically, grouped data are a simplified representation of a large number of observations. We consider here the case where the point of interest is the underlying distribution of these univariate data. An outlier or a model misspecification then appears first among the continuous observations and is detectable only by an analysis on the grouped data. With this kind of representation, as pointed out by Barnett (1992), intu- itively an outlier is detected in the group whose corresponding probability is small and the observed frequency is relatively large. Let πj, j =1,...,J be the probabilities corresponding to the J groups and Mˆj the corresponding observed frequencies. Suppose also we have n observations. If the probabil- ities πj are known, we can compare them with the observed frequencies Mˆ j.

83 84 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

Actually, Fuchs and Kennet (1980) define the ‘adjusted residuals’ √ n(Mˆj − πj) tj = (5.1) πj(1 − πj) and consider outlying group that which yield t =max|tj|. There are two main disadvantages to this simple approach. Firstly, very often the probabilities πj are unknown and depend on the parameters of the underlying model. If these probabilities are estimated, it must be in a robust fashion, otherwise the above measure cannot be used. Secondly, once an outlying group has been detected, what can one do with it? We cannot just remove it from the sample as it could be done with continuous observations. Therefore it is necessary to develop a method which is able to detect outliers, to estimate the parameters of interest and to deal with outlying groups in an optimal way, all together. The aim of this chapter is to propose such a method. Let us now present the model. With grouped data, what we observe is the number of individuals from a given population having one and only one characteristic among J exclusive possibilities. When studying the PID, the individuals are the economic units and the characteristics are the different income groups. The basic model takes then the following form: we observe T a vector [Nˆ1 ... NˆJ ] where the Nˆj’s, j =1,...,J,arethenumberof economic units having an income belonging to the class Ij and are such J ˆ that j=1 Nj = n, the sample size. This observed vector is a realization of T the random variable N =[N1 ... NJ ] which is distributed according to a T with true cell probabilities π =[π1 ... πJ ] .The probability of a realization is given by

ˆ ˆ T n! Nˆ1 NˆJ P (N =[N1 ... NJ ] )= π1 ...πJ (5.2) Nˆ1! ...NˆJ ! We want to describe the distribution of the underlying continuous data by means of the parametric model Fθ. The relation with the multinomial distribution is given through the cell probabilities. We suppose that there exist J functions kj of θ given by

kj(θ)= dFθ(x) (5.3) Ij such that J J kj(θ)= dFθ(x) = 1 (5.4) j=1 j=1 Ij 5.1. THE PROBLEM 85 and when θ is the true parameter k(θ)=π (5.5)

Therefore, the multinomial model is defined by the functions kj and θ.Note that I1 ∪ I2 ...∪ IJ gives the range of the underlying continuous variable X. Another derivation of the above multinomial distribution is as follows. Suppose N1,...,NJ are independent Poisson random variables with means µ1,...,µJ , the conditional joint distribution of N1,...,NJ given that Nj = n is given by (5.2) with π = µj . j µj The first two moments of the multinomial distribution are given by (see e.g. Johnson and Kotz 1969) T E(N)=n[π1 ... πJ ] = nπ (5.6) Cov(N)=n Diag(π) − ππT (5.7)

It should be stressed that Cov(N)hasrankJ −1 and if one needs an inverse for Cov(N), McCullagh and Nelder (1989) propose the generalized inverse 1 Cov(N)− = Diag (5.8) nπj We can verify that Cov(N)Cov(N)−Cov(N)=Cov(N). InordertobeabletoderiveasymptoticpropertiesofestimatorsT of θ, we need to approximate the multinomial distribution. Asymptotically, the multinomial distribution becomes a normal multivariate distribution (see −1 e.g. Bishop, Fienberg and Holland 1975). Indeed, let√M = n N (Mj = −1 1 n Nj, ∀j) be the vector of frequencies, and set Un = n(Mˆ − π). Then ∼ Un → N(0, Σ) (5.9) where Σ=Diag(π) − ππT (5.10)

It should be stressed that Nˆj or Mˆj are linked to the empirical distribu- tion F (n) of the underlying variable X. We know the latter only by parts, i.e. we have −1 (n) n Nˆj = Mˆj = dF (5.11) Ij (n) As n →∞, F → Fθ,thus

lim Mˆj = E(Mj)=πj (5.12) n→∞ ∼ 1The symbol → means convergence in distribution as n →∞ 86 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

5.2 Classical estimators and their influence func- tion

5.2.1 Minimum power divergence estimators With personal income observations, we are interested in modelling the dis- tribution of the underlying continuous variable. Therefore, we want to com- pute estimators for the parameter θ. Among the most well known classical estimators of θ for the multinomial model there is the MLE. It is given by the solution TMLE in θ of

J ∂ G(Mˆ ; k(θ)) = Mˆ log k (θ) = 0 (5.13) j ∂θ j j=1

Under some regularity conditions (see Birch 1964) the MLE is asymptotically normal. This is shown (see Bishop, Fienberg and Holland 1975) by making use of a useful theorem (see Birch 1964) which states that as M → π then θ(M)solutionofG(M,k(θ)) = 0 can be expanded as following

θ(M)=θ +(M − π)Diag(π)−1/2A(AT A)−1 + o(M − π) (5.14) where A is the J × p (p = dim(θ)) matrix whose (i, l) element is

−1/2 ∂ πi ki(θ) (5.15) ∂θl Therefore, applying this result for the MLE, we have that √ ∼ T −1 n(TMLE − θ) → N(0, (A A) ) (5.16)

We note that the (i, l) element of the matrix AT A can be written as

J ∂ ∂ k (θ) log k (θ) log k (θ) (5.17) j ∂θ j ∂θ j j=1 i l so that AT A is indeed the Fisher information matrix for the multinomial model defined by k(θ). The regularity conditions under which the asymptotic normality of the MLE and later of a more general estimator can be stated, are the one pro- vided by Birch (1964). They ensure that the function k of θ is smooth, that atthetruemodelk(θ)=π, that the number of groups is really J and the number of parameters is p and they are identifiable. 5.2. CLASSICAL ESTIMATORS AND THEIR IF 87

It should be stressed that there exists a large class of asymptotically equivalent estimators. This class has been discovered and named by Cressie and Read (1984) and Read and Cressie (1988) as the class of minimum Iλ- discrepancy estimators based on the power divergence statistic Iλ.These estimators are defined by the solution in θ which minimize ⎡  ⎤ λ 1 J Mˆ Iλ(Mˆ ; k(θ)) = Mˆ ⎣ j − 1⎦ (5.18) λ(1 + λ) j k (θ) j=1 j where −∞ <λ<∞ is a fixed parameter. They are called MPE for min- imum power divergence estimators. By taking the derivatives with respect to θ, the MPE are defined by the solution in θ,forafixedλ,of   λ+1 J Mˆ ∂ Gλ(Mˆ ; k(θ)) = j k (θ) = 0 (5.19) k (θ) ∂θ j j=1 j We can see that we get the MLE when λ = 0. Cressie and Read (1984) showed that any MPE is asymptotically normal with mean θ and covariance matrix n−1(AT A)−1. They use the result of Birch (1964) and define the best asymptotically normal (BAN) estimator (noted here θˆ) the one which satisfies

−1/2 T −1 −1/2 θˆ = θ +(Mˆ − π)Diag(π) A(A A) + op(n ) (5.20) and they show that TMPE is BAN.

5.2.2 Influence function We want now to study the robustness properties of the MPE through its IF. In order to do that we need to express the MPE as a functional of the dis- tribution which might be contaminated. If errors or model misspecification occur, they occur before grouping the observations. Hence, the perturbation is introduced in the underlying distribution, i.e. the distribution of the x’s. That is to say, we suppose that the misspecified distribution, say Gε, lies in a neighborhood of Fθ, i.e.

Gε =(1− ε)Fθ + εW (5.21) where W is any distribution. The estimator TMPE of θ can then be written as a functional of the underlying distribution. Indeed, TMPE is determined by k(θ) which in turn is determined by the underlying distribution. There- fore, we have TMPE := TMPE(Gε) (for notational simplicity, we shall write 88 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

TMPE(Gε):=T (Gε)). Given that the observations are grouped, the func- tional T (Gε) is defined by   J dG λ+1 Ij ε ∂ 0= k (θ) = k (T (G )) ∂θ j j=1 j ε θ=T (Gε)   J (1 − ε)k (θ)+ε dW λ+1 j Ij ∂ k (θ) (5.22) k (T (G )) ∂θ j j=1 j ε θ=T (Gε)

By taking the derivative with respect to ε at ε =0,weobtain J ∂ J ∂ 0=−(λ +1) k (θ)+(λ +1) log k (θ) dW − ∂θ j ∂θ j j=1 j=1 Ij J 1 ∂ ∂ ∂ (λ +1) k (θ) k (θ) T (G ) + k (θ) ∂θ j ∂θT j ∂ε ε j=1 j ε=0 J ∂ ∂ k (θ) T (G ) (5.23) ∂θ∂θT j ∂ε ε j=1 ε=0

Thus, ⎡ ⎤ −1 J ∂ ⎣ ∂ ∂ ⎦ T (G ) = k (θ) log k (θ) log k (θ) ∂ε ε j ∂θ j ∂θT j ε=0 j=1 J ∂ log k (θ) dW (5.24) ∂θ j j=1 Ij

How can we interpret our result? First of all we note that the influence on the MPE of a infinitesimal model deviation does not depend on λ, therefore this influence is the same for all the members of the class. Secondly, Ij dW ˆ is the frequency Mj under the distribution W . Of course, at the model

Ij dW is kj(θ). In the extreme case, when W is the point mass 1 at any point z, then we have the IF of TMPE at the model Fθ given by ⎡ ⎤ −1 J ∂ ∂ IF(z,T ,F )=⎣ k (θ) log k (θ) log k (θ)⎦ MPE θ j ∂θ j ∂θT j j=1 J ∂ δ (z) log k (θ) (5.25) j ∂θ j j=1 5.2. CLASSICAL ESTIMATORS AND THEIR IF 89 where 1ifz ∈ I δ (z)= j (5.26) j 0otherwise

We note that the IF is bounded because

max IF(z,TMLE,Fθ) = ⎡ ⎤ −1 J ⎣ ∂ ∂ ⎦ ∂ max kj(θ) log kj(θ) log kj(θ) log kj(θ) < ∞(5.27) j ∂θ ∂θT ∂θ j=1

However, this maximum value which describes the bias on the value of the MPE, can be very large. Indeed

∂ Ij s(x, θ)dFθ(x) log kj(θ)= (5.28) ∂θ kj(θ) and if kj(θ)issmallands(x, θ) is large in the interval Ij, then the IF can be very large. It should be stressed that this happens very frequently with PID models, since the classes with low probability correspond to the lowest or highest incomes where, at the same time, the scores function is large. Finally, we can verify that ⎡ ⎤ −1 J ∂ ∂ E [IF(z,T ,F )] = ⎣ k (θ) log k (θ) log k (θ)⎦ Fθ MPE θ j ∂θ j ∂θT j j=1 J ∂ k (θ) log k (θ) = 0 (5.29) j ∂θ j j=1 and

T EFθ [IF(z,TMPE,Fθ)IF (z,TMPE,Fθ)] = ⎡ ⎤ ⎡ ⎤ −1 J ∂ ∂ J ∂ ∂ ⎣ k (θ) log k (θ) log k (θ)⎦ · ⎣ k (θ) log k (θ) log k (θ)⎦ · j ∂θ j ∂θT j j ∂θ j ∂θT j j=1 j=1 ⎡ ⎤ −T J ∂ ∂ ⎣ k (θ) log k (θ) log k (θ)⎦ = V (T ) (5.30) j ∂θ j ∂θT j MPE j=1 90 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

5.3 A general class of estimators

In the previous section we have seen that although the IF of a MPE is bounded, it can be large. This is a property we would like to avoid. In this section, we generalize the class of estimators for the parameter θ in order to be able to build a more robust estimator, i.e. an estimator with a smaller IF. This is important because as we will see with a numerical example, the robust estimator derived from this more general class has an improved stability when the underlying model is not exact. For continuous data, Huber (1964) defined the class of M-estimators which are a generalization of the MLE. With grouped data, we propose the following generalization which defines MGP-estimators (i.e generalized MLE with grouped data). A MGP-estimator Tψ(Mˆ1,...,MˆJ ) defined through a function ψ is given by the solution in θ of J J γ ˆ γ (n) ψj(θ)Mj = ψj(θ) dF (x) = 0 (5.31) j=1 j=1 Ij where −∞ <γ<∞ and ψ isanyfunction.Tobeabletoderivesome properties of MGP-estimators, we have to impose some conditions on the function ψ. For simplicity of derivation, we assume differentiability for ψ. However, the results we will show hold under a weaker condition like piece- wise differentiability. This class includes the MLE when ∂ ψ (θ)= k (θ) (5.32) j ∂θ j and γ = 1, and it includes the MPE when 1 ∂ ψj(θ)= γ kj(θ) (5.33) kj(θ) ∂θ with γ = λ +1. Let us now derive the asymptotic distribution of a MGP-estimator. First we want the MGP-estimator to be Fisher consistent, i.e. Tψ(Fθ)=θ.That means J γ J γ ψj(θ) dFθ(x) = kj(θ) ψj(θ) = 0 (5.34) j=1 Ij j=1 By a Taylor series expansion of J γ (n) ψj(Tψ) dF (x) (5.35) j=1 Ij 5.3. A GENERAL CLASS OF ESTIMATORS 91

at Tψ = θ,weget J γ J γ (n) (n) 0= ψj(Tψ) dF (x) = ψj(θ) dF (x) j=1 Ij j=1 Ij γ J ∂ +(T − θ) dF (n)(x) ψ (θ)+ ψ ∂θ j j=1 Ij 2 O(Tψ − θ ) (5.36)

Hence, ⎡ ⎤ −1 J ∂ J T = θ − ⎣ ψ (θ)Mˆ γ ⎦ ψ (θ)Mˆ γ + ψ ∂θ j j j j j=1 j=1 2 O(Tψ − θ ) (5.37)

Expanding ˆ γ ψj(θ)Mj (5.38) in a Taylor series around kj(θ), we get ˆ γ γ ˆ − γ−1 ψj(θ)Mj = ψj(θ)kj(θ) +(Mj kj(θ))γψj(θ)kj(θ) + 2 O(Mˆj − kj(θ) ) (5.39)

Summing over j gives

J J ˆ γ ˆ − γ−1  ˆ −  ψj(θ)Mj = γ (Mj kj(θ))ψj(θ)kj(θ) + o( M k(θ) ) (5.40) j=1 j=1 which is equal to the expression of the numerator in (5.37). For the denom- inator of (5.37), we have

J ∂ J ∂ ψ(θ)Mˆ γ = ψ (θ)k (θ)γ + O(Mˆ − k(θ)) ∂θ j ∂θ j j j=1 j=1 J ∂ = −γ ψ (θ)k (θ)γ−1 k (θ)+O(Mˆ − k(θ)(5.41)) j j ∂θT j j=1

−1/2 Knowing that Mˆ = k(θ)+Op(n )andthat √ ∼ T n(Mˆ − k(θ)) → N(0,Diag(kj (θ)) − k(θ)k (θ)) (5.42) 92 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA from (5.37) we deduce that √ ∼ n(Tψ − θ) → N(0, Σ(θ)) (5.43) where ⎡ ⎤ −1 J ∂ Σ(θ)=⎣ k (θ)γψ (θ) log k (θ)⎦ j j ∂θT j ⎡j=1 ⎤ J ⎣ 2γ−1 T ⎦ kj(θ) ψj(θ)ψj(θ) j=1 ⎡ ⎤ −T J ∂ ⎣ k (θ)γψ (θ) log k (θ)⎦ (5.44) j j ∂θT j j=1

We can verify that if we replace ψj by the expression corresponding to MPE given in (5.33), then Σ(θ) is equal to the inverse of the Fisher information matrix.

5.4 Influence function of MGP-estimators

As argued for the MPE, we need to write the estimator Tψ as a functional T (Fθ) of the underlying distribution Fθ. If, as it is realistic to suppose, the model is misspecified or there are some gross errors, then the underlying distribution lies actually in a neighborhood of the model Fθ given by

{Gε|Gε =(1− ε)Fθ + εW } (5.45)

The estimator Tψ is then a functional T (Gε) of the distribution Gε and is given by J γ 0= ψj(T (Gε)) dGε(x) j=1 Ij J γ = ψj(T (Gε)) (1 − ε)kj(θ)+ε dW (x) (5.46) j=1 Ij Taking the derivatives with respect to ε at ε =0weget J J γ−1 γ−1 0=−γ ψj(θ)kj(θ) kj(θ)+γ ψj(θ)kj(θ) dW (x) j=1 j=1 Ij J ∂ γ ∂ + ψ (θ)k (θ) T (G ) (5.47) ∂θ j j ∂ε ε j=1 ε=0 5.4. INFLUENCE FUNCTION OF MGP-ESTIMATORS 93

Knowing that J γ ψj(θ)kj(θ) = 0 (5.48) j=1 and consequently

J ∂ J ∂ ψ (θ)k (θ)γ + γ ψ (θ)k (θ)γ−1 k (θ) = 0 (5.49) ∂θ j j j j ∂θT j j=1 j=1 we finally get ⎡ ⎤ −1 J ∂ ⎣ γ ∂ ⎦ T (G ) = ψ (θ)k (θ) log k (θ) ∂ε ε j j ∂θT j ε=0 j=1 J γ−1 ψj(θ)kj(θ) dW (x) (5.50) j=1 Ij

∂ We see that if we put γ =1andψj(θ)= ∂θ log kj(θ)wegettheexpression for the MLE. The IF is defined by replacing the distribution W by the distribution ∆z which puts a point mass 1 at any point z.Wethenget ⎡ ⎤ −1 J ∂ IF(z,T ,F )=⎣ k (θ)γψ (θ) log k (θ)⎦ ψ θ j j ∂θT j j=1 J γ−1 δj(z)kj (θ) ψj(θ) (5.51) j=1

We can verify that ⎡ ⎤ −1 J ∂ E[IF(z,T ,F )] = ⎣ k (θ)γψ (θ) log k (θ)⎦ ψ θ j j ∂θT j j=1 J γ kj(θ) ψj(θ) = 0 (5.52) j=1 and ⎡ ⎤ −1 J ∂ E[IF(z,T ,F )IFT (z,T ,F )] = ⎣ k (θ)γψ (θ) log k (θ)⎦ ψ θ ψ θ j j ∂θT j j=1 94 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA ⎡ ⎤ J ⎣ 2(γ−1) T ⎦ kj(θ)kj(θ) ψj(θ)ψj(θ) j=1 ⎡ ⎤ −T J ∂ ⎣ k (θ)γψ (θ) log k (θ)⎦ j j ∂θT j j=1

= V (Tψ) (5.53)

In the next section, we will derive a robust estimator in an optimal way, in that we will find the MGP-estimator which minimizes the variance, given a bound on its IF.

5.5 Robust estimators with grouped data

5.5.1 Optimality problem The first question to answer is how to impose a bound on the IF?Themost natural way is to impose, for a given θ,

max IF(z,Tψ ,Fθ) = z ⎡ ⎤ −1 J ⎣ γ ∂ ⎦ γ−1 max kj(θ) ψj(θ) log kj(θ) kj(θ) ψj(θ) ≤ c (5.54) j ∂θT j=1

If we impose the constraint (5.54), then we know that in the worst cases, by choosing an appropriate value for c,theIF can be less than its maximum, thus leading to a less biased estimator. The condition (5.54) is equivalent to

⎡ ⎤ −1 J ∂ ∀j , ⎣ k (θ)γψ (θ) log k (θ)⎦ k (θ)γ−1ψ (θ) ≤ c (5.55) j j ∂θT j j j j=1

The optimality problem is then stated in terms of finding the best trade- off between efficiency and robustness. For efficiency reasons we want to minimize the asymptotic covariance2 of the estimator and for robustness reasons we want an IF as small as possible. Hence, the optimality problem is given by T min tr E IF(z,Tψ,Fθ)IF (z,Tψ,Fθ) (5.56) ψ1,...,ψJ

2As in the continuous case, we minimize the trace of the asymptotic covariance matrix. 5.5. ROBUST ESTIMATORS WITH GROUPED DATA 95 subject to

⎡ ⎤ −1 J ∂ ∀j , ⎣ k (θ)γ ψ (θ) log k (θ)⎦ k (θ)γ−1ψ (θ) ≤ c (5.57) j j ∂θT j j j j=1 among all Fisher consistent estimators, i.e. J γ kj(θ) ψj(θ) = 0 (5.58) j=1 This problem is similar to Hampel’s theorem for continuous data (see Ham- pel et al. 1986). The solution of the problem is given by the solution in θ of   γ J Mˆ ∂ j A(θ) k (θ) − a(θ)k (θ) · W (A(θ),a(θ)) = 0 (5.59) k (θ) ∂θ j j j j=1 j where ⎧ ⎫ ⎨ ⎬ c Wj(A(θ),a(θ)) = min 1; (5.60) ⎩ ∂ − ⎭ A(θ) ∂θ log kj(θ) a(θ) and where the p×p-matrix A(θ)andp-vector a(θ) are determined implicitly by the equations J ∂ A(θ) k (θ) − a(θ)k (θ) · W (A(θ),a(θ)) = 0 (5.61) ∂θ j j j j=1 J ∂ ∂ A(θ) k (θ) − a(θ)k (θ) log k (θ) · W (A(θ),a(θ)) = I (5.62) ∂θ j j ∂θT j j j=1 To prove the above result, we need to write the problem in the following way ⎧ ⎫ ⎨J ⎬ 2γ−1 T min tr kj(θ) ψj(θ)ψj(θ) (5.63) ψ1,...,ψJ ⎩ ⎭ j=1 subject to γ−1 ∀j , kj(θ) ψj(θ)≤c (5.64) J γ kj(θ) ψj(θ) = 0 (5.65) j=1 J ∂ k (θ)γ ψ (θ) log k (θ)=I (5.66) j j ∂θT j j=1 96 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

Solving the problem (5.63) under the constraints (5.65) and (5.66) is equiv- alent to minimize the Lagrange-multiplier equation given by

J γ−1 T γ−1 kj(θ) ψj(θ) ψj(θ)kj(θ) dFθ(x) − j=1 Ij J ∗T γ−1 a kj(θ) ψj(θ)dFθ(x) − j=1 Ij J ∗ γ−1 T a kj(θ) ψj(θ) dFθ(x) − j=1 Ij J ∂ k (θ)γ−1ψ (θ) log k (θ)dF (x) · A∗T + A∗T − j j ∂θT j θ j=1 Ij J ∂ A∗ k (θ)γ−1 log k (θ)ψ (θ)T dF (x)+A∗ = j ∂θ j j θ j=1 Ij J T γ−1 ∗ ∗ ∂ kj(θ) ψj(θ) − a − A log kj(θ) Ij ∂θ j=1 γ−1 ∗ ∗ ∂ kj(θ) ψj(θ) − a − A log kj(θ) dFθ(x)+ ∂θ ∂ Q A∗,a∗,k (θ), k (θ) (5.67) j ∂θ j Where a∗ (p × 1) and A∗ (p × p) are the Lagrange multipliers. We see then that the optimal functions ψj are of the form 1 ∂ − γ A(θ) kj(θ) a(θ)kj(θ) (5.68) kj(θ) ∂θ where A(θ)anda(θ) are determined implicitely by (5.65) and (5.66). If we add the constraint on the IF (5.64), we can see that we have to multiply these functions by the weights given by ⎧ ⎫ ⎨ ⎬ c ∀j ,min 1; (5.69) ⎩ ∂ − ⎭ A(θ) ∂θ log kj(θ) a(θ)

The optimal functions ψj’s are then given by 1 ∂ − · ψj(θ)= γ A(θ) kj(θ) a(θ)kj(θ) kj(θ) ∂θ 5.5. ROBUST ESTIMATORS WITH GROUPED DATA 97 ⎧ ⎫ ⎨ ⎬ c min 1; (5.70) ⎩ ∂ − ⎭ A(θ) ∂θ log kj(θ) a(θ) We can see that replacing (5.70) in (5.63),(5.65) and (5.66) leads to the solution given in (5.59), (5.61) and (5.62). We first note that we actually found a set of solutions depending on a parameter γ. The functions are actually weighted linear combinations of the ‘scores’ functions sj for MPE given by   γ Mˆj ∂ sj(Mˆj,θ)= kj(θ) (5.71) kj(θ) ∂θ Therefore, when c = ∞, we get nothing else but the MPE. This is not surprising because we know that they are all asymptotically efficient.

5.5.2 Computation To compute the robust estimators, we use the same methodology as in the continuous case. We first find, for a given θ,thematrixA(θ)andthe vector a(θ) by means of the equations (5.61) and (5.62), and then compute a Newton-Raphson step for θ. The algorithm we propose is then given by the following 5 steps: Step 1: Fix a precision threshold ε and an initial value for the parameter θ Step 2: Compute the initial matrix A(θ) and vector a(θ)with ⎡ ⎤ −1 J ∂ ∂ A(θ)=⎣ k (θ) log k (θ) log k (θ)⎦ (5.72) j ∂θ j ∂θT j j=1 and a(θ)=0 Step 3: Determine A(θ)anda(θ) iteratively by means of ⎡ ⎤ −1 J ∂ ∂ A(θ)=⎣ k (θ) − a(θ)k (θ) log k (θ)W (A(θ),a(θ))⎦ ∂θ j j ∂θT j j j=1 (5.73) and ⎡ ⎤ −1 J J ∂ a(θ)=⎣ k (θ)W (A(θ),a(θ))⎦ k (θ)W (A(θ),a(θ)) j j ∂θ j j j=1 j=1 (5.74) 98 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

Step 4: Compute the Newton-Raphson step for θ given by   γ J Mˆ ∂ ∆θ = j A(θ) k (θ) − a(θ)k (θ) · W (A(θ),a(θ)) k (θ) ∂θ j j j j=1 j (5.75)

Step 5: If ∆θ>εthen θ ← θ +∆θ andgotostep2,elsestop.

It should be stressed that the initial value for θ is very important since the convergence of the algorithm is guarantied for an initial θ near the solution.

5.5.3 Efficiency The efficiency of the robust estimator can be expressed by the ratio of the traces of the asymptotic covariance matrix of the MPE and the asymptotic covariance matrix of the OBRE. Because of (5.62), we have J 1 ∂ V (T )= A(θ) k (θ) − a(θ)k (θ) ψc k (θ) ∂θ j j j=1 j ∂ T k (θ) − a(θ)k (θ) A(θ)T W 2(A(θ),a(θ)) ∂θ j j j J ∂ = k (θ)A(θ) log k (θ) − a(θ) j ∂θ j j=1 ∂ T log k (θ) − a(θ) A(θ)T W 2(A(θ),a(θ)) (5.76) ∂θ j j

We can see that for c = ∞,

∀j , Wj(A(θ),a(θ)) = 1

a(θ)=0 ⎡ ⎤ −1 J ∂ ∂ A(θ)=⎣ k (θ) log k (θ) log k (θ)⎦ j ∂θ j ∂θT j j=1 and

V (Tψ∞ )=V (TMLE)=V (TMPE) The asymptotic efficiency of the robust estimator can be measured by the ratio of the traces of the asymptotic variance of the MLE and the asymptotic 5.6. SIMULATION RESULTS AND CONCLUSION 99 variance of the robust estimator, i.e. ⎡ ⎤ J ∂ ∂ T tr ⎣ k (θ)A(θ) log k (θ) − a(θ) log k (θ) − a(θ) A(θ)T W 2(A(θ),a(θ))⎦ j ∂θ j ∂θ j j j=1 ⎡ ⎤ −1 J ∂ ∂ tr ⎣ k (θ) log k (θ) log k (θ)⎦ (5.77) j ∂θ j ∂θT j j=1

One can choose the bound c as a function of the asymptotic efficiency at the model one wants to preserve. Indeed when c = ∞, we have 100% efficiency but the estimator is not robust. When we lower c, we gain robustness but loose efficiency. What is the minimum value we can give to c?Wehave

J ∂ I = k (θ)γ−1ψ (θ) k (θ) (5.78) j j ∂θT j j=1

Taking the traces, we get

J ∂ dim(θ)=tr(I)= k (θ)γ−1ψ (θ)T k (θ) (5.79) j j ∂θ j j=1

The Cauchy-Schwartz inequality implies

J γ−1 ∂ dim(θ) ≤ k (θ) ψ (θ) k (θ) j j ∂θ j j=1

J ∂

Hence, dim (θ) c> (5.81) J ∂ j=1 ∂θkj(θ)

5.6 Simulation results and conclusion

5.6.1 Simulation results In order to measure the performance of the robust estimator, we computed the MLE and the OBRE of the parameter α of the Pareto distribution from 100 CHAPTER 5. ROBUST ESTIMATORS FOR GROUPED DATA

50 simulated samples of 1000 observations with parameter values α =3.0 and x0 =0.5 that we contaminated by means of

− (1 ε)Fα,x0 + εFα,10·x0 (5.82)

The results are presented in table (5.1).

ε OBRE MSE MLE MSE

0% 3.02 (0.02) 0.02 3.01 (0.01) 0.01

1% 2.97 (0.02) 0.02 2.81 (0.01) 0.05

3% 2.87 (0.02) 0.04 2.45 (0.01) 0.27

5% 2.75 (0.02) 0.08 2.23 (0.01) 0.61

7% 2.61 (0.02) 0.17 2.01 (0.01) 0.99

Table 5.1: MLE and OBRE (c = 5.0) for the Pareto model with grouped data

We can see that although the IF of the MLE is bounded, when the underlying model is contaminated, the MLE has a large bias. On the other hand, with the robust estimator we can see that this bias is considerably smaller. For example, with 7% of contamination, the MSE of the MLE is more than 5 times greater than the MSE of the OBRE. Therefore, with OBRE we improve the stability of the estimation.

5.6.2 Local shift sensitivity Before concluding, we would like to discuss briefly the problem of the local shift sensitivity (λ∗). With grouped data, it is easy to imagine that grouping errors can occur and therefore a robust estimator should be robust to this kind of model deviation. The λ∗ measures the influence of these errors on the value of the estimator. With continuous data it is given by the derivative of the IF,atthepointz. With grouped data, it is actually given by the maximum difference between the value of the IF when the contamination is shifted between two consecutive classes. That is, the λ∗ is given by the 5.6. SIMULATION RESULTS AND CONCLUSION 101 maximum difference between (5.51) evaluated at two consecutive classes j and j + 1. For MGP-estimators, we have ⎡ ⎤ −1 J ∗ ⎣ γ ∂ ⎦ λ =max kj(θ) ψj(θ) log kj(θ) j ∂θT j=1 γ−1 γ−1 kj(θ) ψj(θ) − kj+1(θ) ψj+1(θ) (5.83)

In the particular case of MPE, we have ⎡ ⎤ −1 J ∗ ⎣ ∂ ∂ ⎦ λ =max kj(θ) log kj(θ) log kj(θ) j ∂θ ∂θT j=1

∂ ∂ log k (θ) − log k (θ) (5.84) ∂θ j ∂θ j+1 Note that s(x; θ)dF (x) ∂ Ij θ log kj(θ)= (5.85) ∂θ Ij dFθ(x) which can be approximated by

∂ s(x(j); θ)f(x(j); θ)(x − x ) log k (θ)= j+1 j = s(x(j); θ) (5.86) j (j) ∂θ f(x ; θ)(xj+1 − xj)

{ } { } (j) xj +xj+1 ∗ where xj =inf Ij , xj+1 =supIj and x = 2 . Therefore, λ is approximately proportional to

max s(x(j); θ) − s(x(j+1); θ) (5.87) j

For example, for the Pareto distribution, (5.87) is proportional to max log(x(j+1)) − log(x(j)) (5.88) j which, according to the length of the classes and the positions of the classes (in high or low incomes), can be large. However, with robust estimators from the class of MGP-estimators, since by construction .

γ−1 ∀j , kj(θ) ψj(θ)

5.6.3 Conclusion In this chapter we have computed the IF of a general class of estimators for grouped data, namely the class of MPE. We found that this IF can be large although it is bounded. Therefore, we proposed a more general class of estimators, the MGP-estimators, which include the MPE and permits to define robust estimators. By analogy with Hampel’s theorem, we defined optimal bounded IF estimators and by a simulation study, we showed that under small model contaminations, they are a lot more stable than the classical MPE. Chapter 6

Robust Tests for Model Choice

6.1 Introduction

In this chapter we study the problem of choosing a model which according to certain criteria, best represents the data. With PID models, the prob- lem is restricted to the choice between several functional forms (given in section 2.5). Test procedures for this typical problem and for more general ones already exist. We give a review of the most well known statistical pro- cedures in this section and in section 6.2. However, some of the procedures suffer from notable disadvantages. One of them is the lack of accuracy of the approximation of their distribution by means of the asymptotic distribu- tion when the sample size is not extremely large. In section 6.3 we analyse this problem with two notable statistics and present an improvement to the procedure. The other disadvantage is that the test statistics are sofar based on the assumption that the stated model is exact and thus they are not robust to small model deviations. We present the concept of robustness in tests and study the robustness properties of one eminent test statistic in section 6.4. Finally, in section 6.5 we propose a robust model choice procedure and present a simulation study. The problem of model choice plays an important role in statistical ana- lysis. It should precede the step of model estimation and hence it has to be studied carefully. Very often this step is omitted and the model choice pro- cedure becomes merely an a posteriori test of the adequacy of the estimated model (chosen ‘intuitively’) for the data. Model choice includes a large num-

103 104 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE ber of different statistical techniques, such as exploratory data analysis, non parametric estimation, Bayesian statistics ... However, we focus here on model choice procedures as tests between two or several hypotheses. We can separate model choice procedures into two main categories which depend on the nature of the problem. When the models to be compared are nested the classical theory can be applied. This includes for example, the classical F test, Lagrange-Multiplier (LM) test, Wald (W) test, like- lihood ratio test... On the other hand, when the models are separate (or the corresponding hypotheses are non-nested), the classical theory no longer holds. In this chapter we will treat the latter case. The first development of a test of separate families of hypotheses is due to Cox (1961, 1962). He stated the definition of non-nested models in the context of separate families of distribution as ‘Two hypotheses are separate if an arbitrary simple hypothesis in one cannot be obtained as a limit of simple hypotheses in the other’. In the context of separate non-linear regression models this definition becomes that two models are separate if neither may be obtained from the other by the imposition of appropriate parametric restrictions (Fisher and McAleer 1981). The areas of application are wide (see Loh 1985). In biology, Cox (1962) studied the problem of testing between two quantal response curves, and Williams (1970) the problem of choosing between a segmented linear re- gression and a smooth regression for enzyme synthesis. In medicine, Efron (1984) compared non-nested linear models for heart transplant data. In economics we can quote the work of Pesaran (1982) on testing non-nested economic models. In engineering, Lawless (1982) studied the discrimination between the normal and extreme value distribution for lifetime data. Finally in literature, Cox and Brandwood (1959) studied the choice of a model for dating the works of Plato. There are various alternative ways to approach the problem of choice between non-nested models. A common practice among econometricians is to rely on discrimination criteria such as Akaike’s (1973, 1974) information criterion or Schwarz’s (1978) Bayesian information criterion. For PID models, the κ-criterion (El- derton 1938) has been used by Hirschberg, Molina and Slottje (1989). Such criteria measure how well different models fit the data (for some with an adjustment being made for parsimony). A model is then evaluated on the basis of its own performance, regardless of the alternative. A second alternative for comparing non-nested models is to nest the separate models in a comprehensive (general) model and use a parametric 6.1. INTRODUCTION 105 specification to test the competing separate models as special cases. For example, when choosing the distribution of the income variable, one can fit to the data one or both of the four-parameter generalized beta distributions that include nearly all the models actually used as special or limiting cases (see McDonald and Richards 1987). With linear models, one combines the regressors of two separate models in a general one. There are however, several well-known problems with this approach. It can be very difficult to estimate the general model because the number of parameters can be large leading to few residual degrees of freedom; in linear models there may be high collinearity between competing regressors (McAleer 1986). In many cases of interest in econometrics, the structural parameters under the combined hypothesis are not identified and the log-likelihood function of the artificially constructed model has singularities under both the null and the alternative hypotheses. Moreover, often different econometric models correspond to rival economic theories and therefore combination of them is not considered meaningful. Finally, comprehensive models do not use information regarding the competing separate models in an optimal way. However, a number of investigators have advocated that comprehensive models are still appropriate as a basis for significance tests of non-nested hypotheses. In particular, LM tests on comprehensive models have been developed and their relation to the Cox test has been shown (see Atkinson AC 1970, Breusch and Pagan 1980, Dastoor 1985). Moreover, since classical tests for nested hypotheses have attracted the attention of many statisti- cians, sometimes it is convenient to transform the test between non-nested hypotheses into a parametric test in order to use the theory already devel- oped for this kind of procedure. In particular, we will see that this is a very useful way to develop robust procedures. Finally, the approach first developed by Cox (1961, 1962) of a modified likelihood ratio test statistic has retained most of the researcher’s attention. In the last decade several Cox type statistics have been developed mainly in order to simplify the procedure when dealing with particular models. These statistics include the J, JA, W 1andW 2 statistics to be discussed in section 6.2. There are two main disadvantages of the Cox and Cox-type statistics. First of all, the asymptotic distribution of the Cox and Cox-type statistics is not an accurate approximation to the finite sample distribution. Secondly, when dealing with parametric models it is well known that model devia- tions due for example to some ‘bad’ observations, can falsify considerably a statistical analysis. This is true not only with parametric estimation, but also with tests (see section 6.4). In our case, it is then important to anal- 106 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE yse robustness properties of model choice statistics. However, it should be stressed that the disadvantages highlighted here for Cox-type statistics do also apply to most of the classical model choice statistics. In section 6.2, we present firstly the Cox statistic which initiated research in the field of testing separate families of hypotheses. We also present the different approaches developed following Cox, from the Atkinson (Atkinson AC 1970) statistic, through the J and JA statistics and Monte-Carlo based tests, to more recent developments. In section 6.5, we propose some robust procedures for the model choice, analyse their properties and develop a modified robust Cox statistic that we test by means of simulation. Finally, we conclude this chapter with a discussion of further developments in the field of model choice procedures. Before concluding, we would like to discuss an approach developed for the case of PID models. Hirschberg, Molina and Slottje (1989) proposed a selection criterion for choosing a PID model. They argue that the choice of an appropriate statistical distribution to approximate the empirical PID has often been somewhat arbitrary and propose the use of Elderton’s (1938) κ-criterion. This criterion permits an a priori elimination of certain mem- bers of the Pearson Family as candidates for describing the actual empirical distribution. It can only be used for the members of the Pearson Family, but it should be stressed that a large proportion of PID models actually belong to this family. The κ-criterion is based on the four first moments of the distribution. Simply, the skewness and kurtosis of the empirical distribution and of the hypothetical distribution are compared by means of the κ-criterion which maps them each onto the real line. Consequently, if the κ-criterion of the empirical and the hypothetical distribution do not intersect on the real line then in that particular instance the hypothetical distribution under consid- eration does not properly describe the empirical distribution. Hirschberg, Molina and Slottje (1989) give the range of the κ-criterion for the PID that belong to the Pearson family. The procedure consists then on estimating the κ-criterion by means of the moments computed directly from the data. However, the nonlinear nature of higher-order moments can make the variability of the κ-criterion very large. It is therefore important firstly to have a large number of observations and secondly to be prepared to compute an estimate of the variability in order to be more precise about which forms to eliminate a priori. The first point is nearly always satisfied with income data. For the second point, the authors propose to rely on either a first-order approximation of the covariance matrix of the first four moments (delta method), or on a non-parametric method such as the jackknife or the 6.2. CLASSICAL TESTS 107 bootstrap. The κ-criterion is attractive by its simplicity, but in our opinion, suffers from very important disadvantages. First of all, the Pearson family does not cover the range of all possible models for PID. Secondly, some models like the Gamma distribution can never be rejected a priori. Thirdly, the variance of the estimates tends to be very large so that only a few models are rejected a priori. Hence, the problem of choosing between the remaining models is still not solved. Actually, the advantage that the κ-criterion offers is to restrain the set of possible models for PID when one wants to systematically test each model separately, one against the other.

6.2 Classical tests

The testing of non-nested or separate families of hypotheses was initiated by Cox (1961, 1962). Let H0 and H1 be two hypotheses for a random ∼ 0 variable X. Under H0, it is assumed that X Fα with corresponding 0 ∼ 1 density function f (x; α) and under H1, it is assumed that X Fβ with 1 corresponding density function f (x; β), where α ∈ Ωα and β ∈ Ωβ are p × 1 and q × 1 parameter vectors. In this section we first present the Cox statistic, then give an interpre- tation and a link with the general information theory approach, and finally present Atkinson’s approach to the problem of model choice.

6.2.1 The Cox statistic 1 0 ˆ 1 1 ˆ Let L0(ˆα)= n log f (xi;ˆα)andL1(β)= n log f (xi; β)betheob- served log-likelihood functions respectively corresponding to the hypotheses H0 and H1,whereˆα and βˆ are the corresponding MLE. Define L(ˆα; βˆ)= L0(ˆα) − L1(βˆ) and consider H0 as the hypothesis under test and H1 as the alternative. Cox proposed the following test statistic

UCox = L(ˆα; βˆ) − E[L(α; β)] (6.1) where the expectation is taken under the estimated model f 0(·;ˆα). This statistic compares L(ˆα; βˆ) with the best estimate of the value we would expect under H0. A negative Cox statistic indicates that the actual per- formance of the model under H1 is better than expected. A significantly negative value for the Cox statistic therefore leads to the rejection of H0 be- cause the model under H1 is performing too well for the model under H0 to be regarded as the true one. Similarly, a significantly positive Cox statistic 108 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE also leads to the rejection of H0 because the model under H1 is performing worse than expected under H0. √ The asymptotic distribution of nUCox under H0 is the normal distri- bution with zero mean and variance given by − − − T 0 −1 − Var[L E(L)] Cov[L E(L); S0 ]Var[S ] Cov[S0; L E(L)] (6.2) where L = L(α0; βα0 ), α0 being the true parameter under H0 and βα0 being 1 0 defined to maximize the expected log-likelihood log{f (x; β)}f (x; α0)dx, 0 and S0 =[∂ log f (x; α)/∂α]α=α0 = s0(x; α0). The and covariances are taken with respect to the distribution under H0. It is usually necessary in applications to estimate this variance consistently from the data. For example, α0 can be estimated byα ˆ and βα0 by βαˆ. Moreover, for numerical simplification, one could also replace the integrals over the distribution under H0 by the means over the sample. In his paper, Kent (1986) interprets the Cox statistic as the difference between the ‘observed’ and the ‘fitted’ information gain. Indeed, the differ- ence between the models H0 and H1 can be summarized by the Kullback and Leibler (1951) information gain ∆=∆(α; β)= log{f 0(x; α)/f 1(x; β)}f 0(x; α)dx (6.3)

Kullback (1959) describes this quantity as the mean information per ob- servation from f 0(x; α) from discriminating in favour of f 0(x; α) against f 1(x; β). To estimate α, β and ∆ from the sample one can use the usual ‘observed MLE’,α ˆ and βˆ. Alternatively, since it can be argued that all information about f 0(x; α) is contained in the MLEα ˆ,onecanestimateβ by βαˆ,whereβαˆ maximizes the ‘fitted log likelihood’ (or the expected log- likelihood). The two natural estimates of the information gain ∆ are then given by the ‘observed information gain’ ∆=ˆ log{f 0(x;ˆα)/f 1(x; βˆ)}F (n)(x) (6.4) and the ‘fitted information gain’ 0 1 0 ∆=˜ log{f (x;ˆα)/f (x; βαˆ)}f (x;ˆα)dx (6.5)

It is easy to see that the Cox statistic UCox is equal to ∆ˆ − ∆.˜ It is interesting to point out that if H0 is nested in H1,then∆˜ ≡ 0 ˆ and we get the parametric scores test. Moreover replacing β by βαˆ in ∆ˆ ˆ (Atkinson AC 1970) or replacing βαˆ by β in ∆˜ (White 1982) does not affect the asymptotic null distribution of UCox under H0. 6.2. CLASSICAL TESTS 109

6.2.2 The Atkinson statistic Following Cox’s work, Atkinson AC (1970) proposed a test statistic similar to the Cox statistic by stating a comprehensive mixture model. The general idea, which has then been re-used by econometricians is the following. In 0 order to choose between a set of probability distribution functions f (x; θ0), m ..., f (x; θm), one constructs the mixture model m j λj {f (x; θj)} j=0 (6.6) m { j }λj j=0 f (z; θj ) dz where λj = 1, and makes inferences about the λj. For example, if we are interested in testing for departures from one model, f 0(x; α), in the direction of another, f 1(x; β), we would construct the combined probability distribution function {f 0(x; α)}λ{f 1(x; β)}1−λ f c(x; λ, α, β)= (6.7) {f 0(z; α)}λ{f 1(z; β)}1−λdz and test the hypothesis that λ = 1. The log-likelihood function is Lc(x; λ, α, β)=λ log f 0(x; α)+(1− λ)logf 1(x; β) − log f 0(z; α)λf 1(z; β)(1−λ)dz (6.8)

When differentiating Lc(x; λ, α, β) with respect to λ, α and β at λ =1and α =ˆα and taking the mean over the sample, one gets n n 0 1 ∂ c 1 f (xi;ˆα) L (x; λ, α, β) = log − n ∂λ n f 1(x ; β) i=1 λ=1,α=ˆα i=1 i f 0(z;ˆα) log f 0(z;ˆα)dz (6.9) f 1(z; β) n 1 ∂ c L (x; λ, α, β) = 0 (6.10) n ∂α λ=1,α=ˆα i=1 n 1 ∂ c L (x; λ, α, β) = 0 (6.11) n ∂β i=1 λ=1,α=ˆα Note that when λ = 1, the likelihood is independent of β. As estimator of β, Atkinson (1970) proposes βαˆ also called the pseudo MLE, leading to the following test statistic √ 0 L(ˆα; βαˆ) − L(ˆα; βαˆ)f (x;ˆα)dx UAtk = n ' ( (6.12) 0 1/2 V Fαˆ 110 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

' ( 0 where V Fαˆ is given by (6.2). Under the null hypothesis the statistic is asymptotically distributed with the standard . Hence, ˆ UAtk is similar to UCox, except that β in UCox is replaced by βαˆ.

6.2.3 Other approaches Davidson and MacKinnon (1981) developed the J statistic and Fisher and McAleer (1981) the JA statistic in the context of linear and nonlinear econo- metric models. For their construction the authors follow the same approach which is based on a ‘t-statistic’ from a comprehensive model. They consider an artificial compound model in which the models under the null hypothesis and under the alternative hypothesis are represented. The parameter under the alternative hypothesis is estimated a priori by means of the least squares estimator under H1 for the J statistic and by means of the least squares es- timator under H0 for the JA statistic. (See also MacKinnon, White and Davidson 1983) Gourieroux, Monfort and Trognon (1983) present another approach to the model selection problem, which can be applied to both nested and non- nested models. Their test statistics are based on the estimator βαˆ of the parameter under the alternative hypothesis. In the nested case this test is the Wald test. Mizon and Richard (1986) present a single framework that unifies the alternative ways of generating and analysing non-nested test statistics, by means of the encompassing principle. The generated test statistics are gen- eralizations of the Cox statistic. They call the resulting test a Wald encom- passing test. Aguirre-Torres and Gallant (1983) also propose a generalization of the Cox statistics which has the advantage of promising further development for robustness considerations. They worked in the framework of non-nested multivariate nonlinear regression models. Their generalization comes from the idea to generalize the method of estimation for the parameters. If one computes M-estimators instead of MLE, characterized as the solutionsα ˜ in α and β˜ in β which respectively maximize

1 n ρ0 (α)= ρ0(y ; x ,α) (6.13) n n i i i=1 1 n ρ1 (β)= ρ1(y ; x ,β) (6.14) n n i i i=1 this leads to an obvious generalization of the Cox test. One rejects H0 in 6.3. SMALL SAMPLE PROPERTIES 111 favour of H1 when the absolute value of the statistic 0 − 1 ˜ − UR = ρn(˜α) ρn(β) 1 n ρ0(y; x , α˜) − ρ1(y; x , β˜) f 0(y; x , α˜)dy (6.15) n i i i i=1 y exceeds an appropriate critical point. They use the same idea as Hampel et al. (1986, chap 7) for the τ-tests. The question which still remains open is how to define the functions ρ0 and ρ1? To conclude, we present the work of Dastoor (1985) who showed that the Cox procedure for non-nested hypotheses can also be interpreted as a LM test. This result will be used later in order to develop a robust version of the Cox test. The idea is to construct a comprehensive model (as suggested by Cox 1961 and further developed by Atkinson 1970) of the type −1 f c(x; λ, α, β)={f 0(x; α)}λ{f 1(x; β)}(1−λ) {f 0(y; α)}λ{f 1(y; β)}(1−λ)dy (6.16) λ where testing H0 would be equivalent to testing the null hypothesis H0 : λ = λ  c 1 against the alternative H1 : λ = 1. The comprehensive model f (x; λ, α, β) contains both H0 and H1 as special cases. However, according to Dastoor (1985) this need not be the case. His argument is that the null hypothesis should be the hypothesis of interest and that the role of the alternative is to provide the test with very high power ‘in a particular direction’. Hence, to test H0, a general model should contain H0 as a particular case and some relevant information about H1. For example, the relevant information about 1 ˆ 1 H1 is given either by f (x; β)orbyf (x; βαˆ ). The comprehensive model is then only a function of the observations and the parameters α and λ. The LM, likelihood ratio (LR) and Wald (W) statistics can now be used to test for λ = 1 on the comprehensive model f c(x; λ, α, β˜), where β˜ is an appropriate estimator of β. Dastoor (1985) shows that the LM statistic for testing λ = 1 is equal to the square of the standardized Cox statistic, or the square of the standardized Atkinson statistic, depending on the choice of β˜.

6.3 Small sample properties

6.3.1 Introduction Atkinson AC (1970) noted that the finite sample distribution of the Cox statistic when the sample size is not large, is sometimes far from being 112 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE normal. Williams (1970) suggested a parametric bootstrap to improve the approach (see Efron 1979). Loh (1985) reported that theoretical and em- pirical evidence shows that the Cox statistic tends to overreject the null hypothesis. Godfrey and Pesaran (1983) proposed an adjusted statistic in the particular case of linear regression models. Their statistic has a better size than the Cox statistic. It seems also that the other statistics consid- ered in section 6.2 have “small sample problems” (see Godfrey and Pesaran 1983). What can be done to avoid this kind of problem? What seems natural to do is to bootstrap the distribution of the statistic. In this case, we don’t need the standardized version of the Cox or Cox-type statistics. However, Loh (1985) criticizes either the asymptotic approximation or the parametric bootstrap, and proposes a computer intensive method to compute the critical points for a given nominal level, such that the test size cannot much exceed the nominal level. In the next subsection, we present a simulation study to test the accuracy of Cox-type tests when choosing a PID model. Then, in subsection 6.3.3 we present an alternative procedure based on the parametric bootstrap.

6.3.2 Simulation study In order to illustrate some of the remarks stated above, we present a simu- lation study. We want to see if these remarks apply in the particular case of PID models, where the number of observations is in general large, and the distribution functions have a special shape. The aim is not to give a general answer but at least to analyse some typical cases. We study the finite sample level of the tests, by simulating samples from a Gamma, Pareto or Expo- nential distribution and then computing the Cox and Atkinson statistics1 to test the Gamma against the Lognormal distribution, the Pareto against the Exponential and the Exponential against the Pareto (the mathematical formulas for the distributions are given in appendix A). We simulated 1000 samples of 200 observations from a Gamma distri- bution with parameter values of α1 =3.0 () and α2 =1.0 (), and tested the Gamma against the Lognormal distribu- tion. We computed the proportion of estimated statistics falling into the crit- 2 ical region for different nominal levels ω.Wechose ω =1%(κω = ±2.57),

1We actually standardize the statistics by the variance in which the parameter under H0 is estimated by means of the MLE and the parameter under H1 by means of the pseudo MLE under H0'. ( 2κ ± − ω ω = Φ 1 2 , where Φ is the cumulative distribution of the standard normal 6.3. SMALL SAMPLE PROPERTIES 113

ω =3%(κω = ±2.17), ω =5%(κω = ±1.96) and ω = 10% (κω = ±1.64). We can compute 95% confidence intervals for these proportions through a normal approximation. With 200 observations they are of ±0.62 for the 1% level, ±1.06 for the 3% level, ±1.35 for the 5% level and ±1.86 for the 10% level. The results of the simulation are given in table 6.1. By looking at the results in table 6.1, it seems that with 200 observations, Cox and Atkinson statistics have finite sample levels close to nominal low levels, i.e. the confidence interval contains the true level, except maybe for the 10% level.

Nominal level Finite sample level Cox Atk 1% 1% 0.9% 3% 3.0% 2.3% 5% 4.1% 3.7% 10% 7.6% 7.5%

Table 6.1: Finite sample level of Cox and Atkinson statistics (Gamma against Lognormal)

In a second study, we did the same simulations for testing the Exponen- tial against the Pareto distribution. Since the latter is defined for a variable such that x0 ≤ X, we have to consider the truncated distribution for the Exponential distribution which is given by

αe−αx 0 −α(x−x0) f (x; α)= ∞ −αx = αe (6.17) x0 αe dx We also have the following simplifications: 1 ∆ˆ − ∆=˜ log(β ) − log(βˆ) + log(x ) − log(x)f 0(x;ˆα)dx (6.18) αˆ n i for the Cox statistic and 1 ∆ˆ − ∆=˜ β /βˆ − 1 + log(x ) − log(x)f 0(x;ˆα)dx (6.19) αˆ n i for the Atkinson statistic. We simulated 1000 samples of 200 observations from an Exponential distribution with parameter α = 12. The finite sample level of the Cox and Atkinson tests are given in table 6.2. Once again, good agreement is found. variable. 114 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

Nominal level Finite sample level Cox Atk 1% 1.7% 0.4% 3% 3.3% 2.2% 5% 6.1% 4.1% 10% 10.4% 9.1%

Table 6.2: Finite sample level of Cox and Atkinson statistics (Ex- ponential against Pareto)

In a third study we did the same simulations for testing the Pareto against the Exponential distribution. We also have the following simplifica- tions: 1 1 ∆ˆ − ∆=(ˆ˜ α +1) − log(x0) − log(xi) − α n ˆ log(β) − log(βαˆ) (6.20) for the Cox statistic, and 1 1 ∆ˆ − ∆=(ˆ˜ α +1) − log(x0) − log(xi) − α n 1 β ( x − x )+1 (6.21) αˆ n i 0 for the Atkinson statistic, where f 0 is the Pareto density. We simulated 1000 samples of 200 observations from a Pareto distribu- tion with parameter α =3.0. The finite sample levels of Cox and Atkinson statistics are given in table 6.3. This time we can see that the approxima- tion to the normality for both statistics is not really accurate. Someone who is thinking to be working with a 5% level is actually working with a 2.7% level, so that the nominal level overestimates the actual level when one con- siders the asymptotic distribution of the test statistic. It is difficult to give an explanation for this phenomenon, but as Atkinson AC (1988) noticed, this is certainly due to extreme legitimate values. One way to avoid this problem is by means of a parametric bootstrap (see below). However, in the next section, we will see that with a robust version of the Cox or Atkinson statistic, this problem disappears.

6.3.3 A Cox-type statistic with a parametric bootstrap As pointed out above, in order to avoid distributional problems of large sample approximation statistics, as Williams (1970) first suggested, one can 6.3. SMALL SAMPLE PROPERTIES 115

Nominal level Finite sample level Cox Atk 1% 1.2% 2.0% 3% 2.0% 2.4% 5% 2.7% 2.9% 10% 4.1% 4.5%

Table 6.3: Finite sample level of Cox and Atkinson statistics (Pareto against Exponential) estimate the distribution of the Cox or the Atkinson statistic by computing these statistics from simulated samples. This method is also very convenient when the distribution of the statistic is very complicated to derive, or simply when it is unknown. For example, if for robustness reasons we would like to transform the Cox statistic, then a paremetric bootstrap can be done in order to take a decision about the null hypothesis. In a unpublished paper, Atkinson (1988) proposed to consider the statis- tic given by the ratio of the observed likelihood functions and determine its distribution by means of a parametric bootstrap. We will call this statistic the LKR statistic. It is given by 0 f (x;ˆα) (n) ULKR = L(ˆα; βˆ)= log dF (x) (6.22) f 1(x; βˆ)

The procedure is the following: first compute the MLEsα ˆ and βˆ from the 3 observed sample and calculate the observed LKR, then simulate samples 0 from Fα α=ˆα and compute the corresponding simulated LKR. The observed LKR is ranked among the simulated LKR, and if for example the rank falls in the upper or lower 2.5% extremes, then H0 is rejected at the 5% level. In order to study this procedure, we computed the finite sample levels of the test when comparing different models. We first tested the Gamma distribution against the Lognormal distribution: we simulated 1000 samples of 200 observations from a Gamma (α1 =3.0, α2 =1.0) and computed the proportion of samples for which the rank was in the 0.5%, 1.5%, 2.5% and 5% extremes. The results are presented in table 6.4. We also tested the Exponential distribution against the Pareto distribution by simulating 1000 samples of 200 observations from an Exponential (α =12.0) and the Pareto against the Exponential by simulating 1000 samples of 200 observations from aPareto(α =3.0). The results are presented in table 6.5 and table 6.6.

3In our simulation study, we chose to simulate 1000 samples. 116 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

What we remark is that unlike the Cox or Atkinson statistic, the LKR statistic does not underreject the null hypothesis. Indeed, we have seen that when we test the Pareto against the Exponential by means of the Cox or Atkinson statistic, the finite sample levels are significatively smaller than the nominal ones. This is not the case with the LKR statistic (see table 6.6).

Nominal level Finite sample level 1% 0.8% 3% 2.5% 5% 4.5% 10% 9.1%

Table 6.4: Finite sample level of LKR statistic (Gamma against Lognormal)

Nominal level Finite sample level 1% 0.8% 3% 2.7% 5% 4.1% 10% 9.1%

Table 6.5: Finite sample level of LKR statistic (Exponential against Pareto)

Nominal level Finite sample level 1% 0.6% 3% 2.6% 5% 4.8% 10% 8.6%

Table 6.6: Finite sample level of LKR statistic (Pareto against Ex- ponential)

We also compare the power of the LKR statistic with the power of the Cox statistic. When the models are nested, consideration of type II errors leads to a satisfactory definition of power. But for tests of non-nested mod- els, a more useful concept of power is the probability of making the correct decision; that is of accepting the true model and rejecting the false one. Table 6.7 and table 6.8 give the power of respectively the Cox statistic and the LKR statistic when testing the Exponential distribution against the Pareto distribution. We simulated 1000 samples of 200 observations from 6.4. ROBUSTNESS PROPERTIES 117 a Pareto distribution with parameter values β =2,...,10 and computed the proportion of samples which lead to the decision of rejecting the null hypothesis and accepting the alternative. We can see that the powers are the same for the different approaches. The conclusion that can be drawn from these results is that even if the LKR statistic with a parametric bootstrap is more computer intensive, in some cases it performs better than the Cox or Atkinson statistic, in that it does not overreject the null hypothesis. Moreover, it seems that we don’t lose power with this approach. It would be interesting to determine exactly what are the causes of the ‘bad behaviour’ of the Cox and Atkinson statistic with respect to some null hypotheses, or even try to define a family of null hypothesis for which these statistics behave nicely. However, we will see in the next section that with a robustified version of the Cox or Atkinson statistic, we avoid the problems highlighted in this section.

Parameter Nominal levels values β 1% 3% 5% 10% 2.0 97.5 96.8 95.9 90.5 3.0 93.6 94.5 94.2 93.7 4.0 79.0 84.9 86.7 87.9 5.0 64.8 71.8 76.4 78.1 6.0 49.2 56.7 60.7 66.4 7.0 36.4 45.4 50.2 57.9 8.0 28.6 36.3 41.0 48.1 9.0 23.5 32.8 37.5 43.1 10.0 18.2 25.6 28.2 36.0

Table 6.7: Power (in %) of the Cox statistic (Exponential against Pareto)

6.4 Robustness properties

Tests with robustness properties are desirable. Hall (1985), in the context of regression models, noting that the full information MLE under normality is extremely sensitive to misspecifications of the error distribution, stated that ‘ ... the development of non-nested selection techniques based on more robust estimators would appear extremely desirable...’. The aim of this subsection is to present a mathematical technique which is then used to assess the robustness properties of the Cox statistic. 118 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

Parameter Nominal levels values β 1% 3% 5% 10% 2.0 99.3 98.1 96.0 91.0 3.0 93.2 94.2 93.9 90.4 4.0 77.4 83.0 84.8 87.7 5.0 59.4 70.6 73.1 77.4 6.0 43.0 53.6 58.7 66.9 7.0 33.9 43.8 49.3 57.0 8.0 23.5 35.5 40.2 48.3 9.0 20.2 28.5 34.1 40.8 10.0 16.3 24.4 29.6 36.5

Table 6.8: Power (in %) of the LKR statistic (Exponential against Pareto)

6.4.1 Robustness and tests

The investigation of robustness of testing procedures goes back to Pearson (1931) who discovered the nonrobustness of the test of variances. Pearson’s work can indeed be seen as the beginning of systematic research into the robustness of tests. Later, the word ‘robustness’ was first used by Box (1953) again in connection with tests. However, much more has since been done on robust estimation, robust test procedures are still needed. The reason is quite obvious: we cannot estimate robustly the parameters of a model and then use unmodified procedures to test hypotheses about the parameters or about the estimated model. There are two fundamental goals in robust testing. The level of a test should be stable under small, arbitrary departures from the null hypothesis (robustness of validity), and the test should retain a good power under small, arbitrary departures from specified alternatives (robustness of efficiency). The development of robust testing procedures follows two main ap- proaches to robustness, that is, the minimax approach introduced by Huber (1964), and the infinitesimal approach (Hampel 1968). Huber’s minimax approach (see also Huber 1981, chap 10) is based on the following idea. In the problem of testing a simple hypothesis against a simple alternative, find the test which maximizes the minimum power over a neighborhoud of the alternative, under the side condition that the maximum level over a neighborhoud of the hypothesis is bounded. The solution to this problem is the censored likelihood ratio test which is based on a truncated likelihood ratio statistic (Huber 1965). This ensures that outlying points will have a bounded influence on the level and on the power of the test. Huber’s idea 6.4. ROBUSTNESS PROPERTIES 119 led to an approach using shrinking neighborhoods (see Huber-Carol 1970, Rieder 1978, Bednarski 1982, Ronchetti 1987). Hampel’s infinitesimal approach led to testing procedures based on M- estimators. They have been developed merely for the linear models (see Ronchetti 1982, Hampel et al. 1986 chap 7, Markatou and Hettmansperger 1990, Markatou, Stahel and Ronchetti 1991, Silvapulle 1992, Markatou and Tsai 1992). Ronchetti (1979, 1982) and Rousseeuw and Ronchetti (1979, 1981) were the first to adapt Hampel’s optimality problem for estimators to testing procedures, in the case of testing a null hypothesis about a one- dimensional parameter. Hampel’s optimality problem for testing procedures can be stated as: Under a bound on the influence of small contamination on the test’s level and power (robustness requirement), the power of the test at the ideal model is maximized (efficiency requirement). More recently, H´eritier and Ronchetti (1992) have extended the existing theory on robust parametric tests to general parametric models. However, to our knowledge nothing has been done for model choice or goodness-of-fit tests. The aim of this chapter is to provide a robust model choice statistic, based as we will see, on the theory developed by H´eritier and Ronchetti (1992). In the next chapter we discuss the problem of robustness and goodness-of-fit tests. Our first step is to study the robustness properties of the Cox statis- tic. The tool we use is the level influence function (LIF) (Rousseeuw and Ronchetti 1979, Ronchetti 1982). It gives the influence of an infinitesimal amount of contamination in the data on the true level of the test, i.e. the level of the test at the model. Actually, this level is often approximated by the asymptotic level. The general technique consists of assuming a dis- tribution in the neighborhood of the model under the null hypothesis and then studying the effect on the asymptotic level of the test. For tests, the neighborhood of the model at H0 is given by 0 − 0 Gε,n =(1 εn)Fα + εnW (6.23) where W is any distribution. To compute the LIF ,wechooseW =∆z and we define 0 − 0 Fε,n =(1 εn)Fθ + εn∆z (6.24) 0 With Fε,n the LIF determines the worst bias on the level (see Hampel et al. 1986). To study the effect of this model deviation, it is necessary to choose − 1 a contamination εn that tends to zero at the rate n 2 , or its effects will soon dominate everything and give divergence (see Hampel et al. 1986, chap 3). −1/2 Therefore, εn = n ε. With parametric tests, one has to choose the rate of 120 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE convergence of the test statistic to avoid overlapping neighborhoods between the hypotheses under H0 and H1 . In the next subsection we first compute the asymptotic level of the test 0 under Fε,n and then derive the LIF by taking the derivative of this level with respect to ε at ε =0asn →∞.

6.4.2 Level influence function As we have seen before, with the model choice test first proposed by Cox, if the null hypothesis H0 is that the observations are distributed according to 0 4 Fα, we have under H0 that √ ˆ − √ n2 E (n) (ˆα; β) EF 0 (α, β) F α α=ˆα ∼ Un = n2UCox = 1 → N[0; 1] (6.25) { 0 | } 2 V (Fα) α=ˆα where ˆ 0 − 1 ˆ (n) EF (n) (ˆα, β)= [log f (x;ˆα) log f (x; β)]dF (x) (6.26) 0 − 1 } 0 EF 0 (α, β) = [log f (x;ˆα) log f (x; βαˆ) f (x;ˆα)dx (6.27) α α=ˆα 0 and where V (Fα) α=ˆα is a consistent estimator of the asymptotic variance which depends on the empirical distribution F (n) only through the estimator αˆ of α. We can also writeα ˆ and βˆ as functionals of the empirical distribution, i.e.α ˆ(F (n))andβˆ(F (n)). −1 The test decision will be to reject H0 if |Un| >κω,(κω =Φ (1 − ω/2), ω is the nominal level). This decision is taken by using the asymptotic distribution of Un. Therefore, we define the true asymptotic level of the test | | 0 as the probability that Un exceeds the critical value κω at the model Fα and when considering the asymptotic distribution of Un.Itisgivenby

ω = ω(0) = 1 − Φ(κω) (6.28) Since the model is not always exactly true and the underlying distribution 0 lies in a neighborhood of the model distribution Fα, the actual asymptotic level will be in general biased. This can lead the test to false decisions. We consider the ε-contamination distribution given by 0 − 0 Fε,n =(1 εn)Fα + εn∆z. (6.29)

√ε where εn = n and ∆z is the distribution which puts a point mass 1 in z.

∼ 4Again, the symbol → means convergence in distribution as n →∞ 6.4. ROBUSTNESS PROPERTIES 121

0 Assuming Fε,n under H0 we then have that ∼ 2 Un → N µ(ε),σ (ε) (6.30) where ⎧ ⎫ ⎪√ ⎪ ⎪ 0 0 ˆ 0 − 0 ⎪ ⎨ n2 EFε,n αˆ(Fε,n), β(Fε,n) EFα (α, β) 0 ⎬ α=ˆα(Fε,n) µ(ε) = lim→∞ ⎪ 1 ⎪ n ⎩⎪ 2 ⎭⎪ 0 | 0 V (Fα) α=ˆα(Fε,n) (6.31) and V (F 0 ) σ2(ε) = lim ε,n (6.32) n→∞ 0 | 0 V (Fα) α=ˆα(Fε,n)

0 ˆ 0 Note thatα ˆ(Fε,n)andβ(Fε,n) are the MLE (functional) solution of ∂ log f 0(x; α)dF 0 (x) = 0 (6.33) ∂α ε,n and ∂ log f 1(x; β)dF 0 (x) = 0 (6.34) ∂β ε,n 0 The actual asymptotic level under the hypothetical model Fε,n is then given by κ − µ(ε) ω(ε)=1− Φ ω =1− Φ(z(ε)) (6.35) σ(ε) The bias on the asymptotic level can be approximated by a first order Taylor expansion of ω(ε) around ω(0), i.e. ∂ω(ε) 2 ω(ε) − ω(0) = ε · + O(ε ). (6.36) ∂ε ε=0 We have that − ∂ω(ε) ∂ κω µ(ε) = − Φ ∂ε ∂ε σ(ε) ε=0 ε=0 ∂ ∂µ(ε) = Φ(z(ε))) · ∂z ∂ε ε=0 ε=0 2 ∂ ∂σ (ε) + Φ(z(ε)) · (6.37) ∂z ∂ε ε=0 ε=0 122 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

It is easy to show that ∂ 2 σ (ε) = 0 (6.38) ∂ε ε=0 −1 −1/2 Hence, since the variance is of order n and εn of order n , then when n →∞, the effect of ε vanishes. Therefore, the bias on the asymptotic level will be proportional to the effect on the mean of the asymptotic distribution of Un.Wehavethat ⎧ ⎪√ ⎪ ∂ 0 0 ˆ 0 − 0 ⎨ n2 ∂ε EFε,n αˆ(Fε,n), β(Fε,n) EFα (α, β) 0 ∂ α=ˆα(Fε,n) ε=0 µ(ε) = lim→∞ ⎪ 1 ∂ε n ⎪ 0 2 ε=0 ⎩⎪ [V (Fα)] √ ⎫ 0 0 ∂ 0 ⎪ n E 0 αˆ(F ), βˆ(F ) − E 0 (α, β) · V (F ) 0 ⎬ Fα α α Fα 0 ∂ε α α=ˆα(Fε,n) α=ˆα(Fα) ε=0 − 3 (6.39) 0 2 ⎪ [V (Fα)] ⎭

Since 0 0 0 ˆ − 0 EFα αˆ(Fα), β(Fβ ) EFα (α, β) 0 = 0 (6.40) α=ˆα(Fα) ε=0 the second term of the right hand side of (6.39) vanishes. The bias depends 0 0 ˆ 0 0 then on the derivative of EFε,n (ˆα(Fε,n), β(Fε,n)) and EFα (α, β) 0 at α=ˆα(Fε,n) ε =0andwhenn →∞.Wehave ∂ 0 0 0 ˆ EFε,n αˆ(Fε,n), β(Fε,n) = ∂ε ε=0 ∂ ε 1 − √ log f 0(x;ˆα(F 0 )) − log f 1(x; βˆ(F 0 )) dF 0(x)+ ∂ε n ε,n ε,n α ε √ log f 0(z;ˆα(F 0 )) − log f 1(z; βˆ(F 0 )) = n ε,n ε,n ε=0 1 −√ log f 0(x; α) − log f 1(x; β ) dF 0(x)+ n α α 0 0 ∂ 0 1 0 ∂ 0 s (x; α)dF (x) αˆ(F ) − s (x; β )dF (x) · βˆ(F ) + α ∂ε ε,n α α ∂ε ε,n ε=0 ε=0 1 √ log f 0(z; α) − log f 1(z; β ) (6.41) n α In the same way we get ∂ 0 EFα (α, β) = 0 ∂ε α=ˆα(Fε,n),ε=0 6.4. ROBUSTNESS PROPERTIES 123 0 0 ∂ 0 s (x; α)dF (x) · αˆ(F ) − α ∂ε ε,n ε=0 1 0 ∂ ∂ 0 s (x; β )dF (x) · β · αˆ(F ) + α α ∂α α ∂ε ε,n ε=0 0 − 1 0 0 · ∂ 0 log f (x; α) log f (x; βα) s (x; α)dFα (x) αˆ(Fε,n) (6.42) ∂ε ε=0 Moreover, since ∂ 0 ∂ 0 · √1 αˆ(Fε,n) = αˆ(Fε,n) ∂ε ε=0 ∂εn ε=0 n 1 0 and s (x; βα)dFα(x) = 0, we finally get − ∂ 2 0 1 0 µ(ε) = * log f (x; α) − log f (x; βα) dF (x) − ∂ε V (F 0) α ε=0 α 0 1 log f (z; α) − log f (z; βα) + 0 − 1 0 0 · log f (x; α) log f (x; βα) s (x; α)dFα (x) ∂ 0 αˆ(Fε,n) (6.43) ∂εn ε=0 The bias on the asymptotic level is then given by − κω − * ε · · 0 − 1 0 − ω(ε) ω(0) = 2 ydΦ(y) log f (x; α) log f (x; βα) dFα(x) V (F 0) −∞ α 0 1 log f (z; α) − log f (z; βα) + 0 − 1 0 0 · ∗ 0 log f (x; α) log f (x; βα) s (x; α)dFα (x) IF (z,α,ˆ Fα) +O(ε2) (6.44) where 0 ∗ 0 ∂ 0 s (z; α) IF (z,α,ˆ F ) = lim αˆ(F ) = (6.45) α n→∞ ∂ε ε,n − ∂ 0 0 n ε=0 ∂αs (x; α)dFα (x)

0 1 That means that a single observation z such that {log f (z; α)−log f (z; βα)} or s0(z; α) are large can make the bias on the asymptotic level very large and lead the test to false decisions. Indeed, while the second term is up to a multiplicative constant the IF of the estimator of the parameter under 124 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE the null hypothesis, the first term is directly related to the influence on the test statistic. To be more precise, suppose that α is estimated robustly, ∗ 0 then IF (z,α,ˆ Fα) is bounded and the influence on the asymptotic level is 0 1 proportional to {log f (z; α) − log f (z; βα)} which can be very large. For example, if we want to test the Gamma against the Lognormal, the differ- ence between the log-likelihood functions evaluated in any point z is given by { − − − }− α1 log(α2) log(Γ(α1)) + (α1 1) log(z) α2z 1 1 − log(x) − log(β22π) − [log(x) − β ]2 = 2 2 2β2 1 2 − β1 1 2 − α1 log(z)+ 2 log(z) α2z + Q(α1,α2,β1,β2) (6.46) β2 2β2 which can take very large values. With the Atkinson statistic, it is easy to show that we obtain a similar result.

6.4.3 Simulation results In order to illustrate the above results, we made a simulation and com- puted the sample level when the model is not exact. We contaminated 1000 samples of 200 observations from a Gamma distribution with parameters α1 =3.0 (slope parameter) and α2 =1.0 (scale parameter) with the model contamination ε ε (1 − √ )F + √ F · (6.47) n α1,α2 n α1,0.1 α2 where ε = 1% , 2% and computed the sample levels. The results are shown in tables 6.9 and 6.10.

Nominal level Actual levels (in %) Cox Atk 1% 4.3% 4.4% 3% 6.2% 5.8% 5% 8.0% 7.8% 10% 11.4% 11.3%

Table 6.9: Actual levels (in %) of Cox and Atkinson statistics under model contamination (ε =1%) (Gamma against Lognormal)

The actual levels are the probabilities that the test statistics computed from the simulated and contaminated samples exceed the critical value κω, 6.5. ROBUST MODEL CHOICE TESTS 125

Nominal level Actual levels (in %) Cox Atk 1% 12.5% 12.4% 3% 15.2% 14.4% 5% 17.2% 16.9% 10% 20.8% 20.3%

Table 6.10: Actual levels (in %) of Cox and Atkinson statistics under model contamination (ε =2%) (Gamma against Lognormal) where ω is the nominal level (we actually estimated the probability by the frequency among 1000 samples). As the different amounts of contamination are very small, one would expect that the actual level would be close to the nominal one. However, the results show that the difference can be large: someone who believes to be at the 5% level might actually be working with a level of 17.2%! The tests are very sensitive to small amounts of gross error: it is therefore important to develop robust procedures.

6.5 Robust model choice tests

In this section, we first present some ‘ad hoc’ robust versions of Cox type statistics and discuss their properties. Secondly, since the Cox statistic can be viewed as a parametric LM test, we present an optimal robust version by using the work of Heritier and Ronchetti (1992) on robust bounded- influence tests in general parametric models. We end this section with some simulation results.

6.5.1 Some ad hoc robust tests

As pointed out above when we investigated the robustness properties of the Cox statistic, one should not only robustify the statistic itself, but also the estimators of the parameters. As a first step, we propose to separate the parameter estimation problem from the test statistic computation one. That is, we first compute robust estimators for the parameters and then, given these estimators, compute a robust Cox type statistic, namely a robust LKR statistic. The argument is the following. Suppose we know the parameters α and β, then the test reduces to a simple hypothesis against a simple alternative. A robust test has already been developed in this case by Huber (1965): the 126 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE censored likelihood ratio test. The corresponding statistic is given by

n 0 c f (xi; α) 1 (6.48) f (x ; β) i=1 i c In our case, if we have robust estimators for the parameters α and β,we can consider that we are in the case studied by Huber (if the proportion of outliers is not too large). This is a very simple and restrictive hypothesis, but since the LIF of the Cox statistic depends only on the IF of the estimator 0 1 α andonthevalueoflogf (xi; α) − log f (xi; β), and since the former is bounded when α is robustly estimated, the bias on the level will only depend on the latter expression. It should be stressed that the parameter β should also be robustly estimated in order to have greater power. So the first robust test statistic we propose is given by

n 0 c f (xi;˜α) ULKR1 = log (6.49) 1 ˜ i=1 f (xi; β) c whereα ˜ and β˜ are OBRE. One of the main problems with this approach is that we don’t know the distribution of ULKR1. This problem can be solved by a parametric bootstrap. However, it should be stressed that this approach is very computer intensive since we have to compute OBRE not only for the observed sample, but also for the simulated ones. We tried to avoid this by computing MLE from the simulated samples (because the samples are not contaminated), but the results were very poor. Another main problem is how to choose the bounds c and c .InHuber (1965), they are defined by the neighborhood structure and on the percent- age of contamination. However, in a practical case, we don’t know how the data are contaminated. One idea is to choose them in order that the proba- 0 bility δ under Fα that the log-likelihood ratio exceeds these bounds is small. This way, only a small proportion of ‘good observations’ are rejected and if the sample contains gross errors, then the resulting statistic will exceed the fixed bounds anyway. To be more precise, c and c are determined implicitely by means of

c f 0(x;˜α) δ f 0(x;˜α) log f 0(x;˜α)dx = · log f 0(x;˜α)dx −∞ f 1(x; β˜) 2 f 1(x; β˜) ∞ 0 0 f (x;˜α) 0 δ · f (x;˜α) 0 log f (x;˜α)dx = log f (x;˜α)dx (6.50) c f 1(x; β˜) 2 f 1(x; β˜) 6.5. ROBUST MODEL CHOICE TESTS 127

This method is very subjective and it could be argued that it is more rea- sonable to guess the percentage of contamination and compute c and c as in Huber (1965). The second approach we propose exploits the properties of OBRE. One of the advantages is that during the computation procedure, weights corre- sponding to each observation are calculated. If the observation influences a lot the estimator, or if it is far from the bulk of data, or if it is a gross error, then the corresponding weight is near 0. Hence, we could use these weights for the LKR statistic, because if an observation is a gross error for the parameter estimator, it should also be a gross error for the test statistic. More precisely, the procedure for the computation of a robust LKR statistic (ULKR2) will be

• Compute the OBREα ˜ for the parameters of the model under H0 and 0 keep the weights Wc (x)

• Compute the OBRE β˜ for the parameters of the model under H1

• Compute n f 0(x ;˜α) U = log i W 0(x ) (6.51) LKR2 1 ˜ c i i=1 f (xi; β)

Then we use the bootstrap to approximate the distribution of ULKR2. In a third approach we propose to use the same idea as in the second approach but we are more drastic. Indeed, instead of computing a weighted LKR statistic for which the weights are not computed in an optimal way, 0 we could simply remove the observations for which the weights Wc (xi)are small. We could then compute the LKR statistic with the ‘screened’ sam- ple. However, it is difficult to decide only by means of the weights if an observation is an outlier. We have studied by means of simulations the second and third procudure proposed here. Since we don’t know the distribution of these test statistics, we have to use a parametric bootstrap in order to take a decision. The procedure for both test statistics is then the following: • from one sample of observations, compute the OBREα ˜ and β˜ of α and β, then compute the “observed” test statistic • simulate a large enough number (say 1000) of samples of the same size 0 than the observed one from Fα˜ • computeα ˜ and β˜ for each of the simulated samples and compute the “simulated” test statistics 128 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

• rank the “observed” test statistic among the “simulated” test statistics and take a decision

This is a very computer intensive procedure, especially if we want to study the test statistics (in this case we have several simulated “samples of obser- vations”). We tried to reduce the time by computing MLE for the simulated samples, but our results were very poor. We didn’t investigate any further because we found a better alternative presented in subsection 6.5.3.

6.5.2 Robust bounded-influence LM test In this subsection, we briefly expose the robust version of the LM parametric test developed by Heritier and Ronchetti (1992). Consider a general parametric model {Fθ}. We are interested in testing the null hypothesis that q

2 T −1 R = UGLM C UGLM (6.53) where 1 n U = ψ(x ; T ∗) (6.54) GLM n i ψ (2) i=1 · · × · · ∗ ψ( ; )(2) is the second part of dimension q 1 of the vector ψ( ; ), Tψ is the M-estimator in the reduced model, i.e the solution of the equation

n ∗ ∗ ψ(xi; Tψ)(1) =0,withTψ(2) = 0 (6.55) i=1 The matrix C depends on the asymptotic covariance matrix of the M- estimator given by −1 T −T V (ψ, Fθ)=M(ψ, Fθ) ψ(x, θ)ψ (x, θ)dFθ(x)M(ψ, Fθ ) (6.56) 6.5. ROBUST MODEL CHOICE TESTS 129 where ∂ M(ψ, F )=− ψ(x; θ)dF (x) (6.57) θ ∂θ θ C is given by T C = M(22.1)V(22)M(22.1) (6.58) where − −1 M(22.1) = M(22) M(21)M(11)M(12) (6.59) ∗ and C can be estimated consistently by replacing θ by Tψ. Heritier and Ronchetti (1992) show that under the null hypothesis R2 is X 2 asymptotically distributed according to a q distribution. They compute the IF of the level and find that it is proportional to the square of the IF of the statistic UGLM . The latter is equal to the self-standardized IF of the estimator Tψ(2) given by

T −1 IF(z; Tψ(2),Fθ) V (ψ, Fθ)(22)IF(z; Tψ(2),Fθ) (6.60) The robust version of the LM test is given by the optimally robust self- standardized estimator Tψ(2) with partitioned√ parameters given in Hampel et al. (1986). For a given bound c ≥ q, it is based on the following ψ-function

ψ (x; θ) = A s(x; θ) c (1) (11) (1) − ψc(x; θ)(2) = hc [A[s(x; θ) a]](2) (6.61) where hc(x) is the Huber function, a(1) =0,theq-dimensional vector a(2) and the lower triangular matrix A are determined by the equations

ψc(x; θ)(2)dFθ(x)=0 T ψc(x; θ)ψc (x; θ)dFθ(x)=I (6.62)

The solution for ψ leads to a simplification in the expression of the statistic. Indeed, we have

−1 −T V (ψ, Fθ)=M(ψ, Fθ) M(ψ, Fθ) (6.63) where A−1 ψ (x; θ) sT (x; θ) dF (x) (12) c (1) (2) θ M(ψ, Fθ)= T (6.64) 0 ψc(x; θ)(2)s (x; θ)(2)dFθ(x) 130 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

The inverse is given by −1 − −1 −1 −1 M(11) M(11)M(12)M(22) M (ψ, Fθ)= −1 (6.65) 0 M(22) and therefore −1 −T T C = M(22) M(22)M(22) M(22) = I (6.66) The robust LM statistic is then given by T 1 n R2 = U T U = ψ (x ;(T , 0)) · GLM GLM n c i MLE(1) (2) i=1 1 n ψ (x ;(T , 0)) (6.67) n c i MLE(1) (2) i=1

6.5.3 Robust Cox-type statistic In this subsection we apply the results of Heritier and Ronchetti (1992) to the Cox and Atkinson statistics when they are interpreted as a LM test (Dastoor 1985). In subsection (6.2.3), we saw that if we construct the comprehensive model −1 f c(x; θ)={f 0(x; α)}λ{f 1(x; β˜)}(1−λ) {f 0(y; α)}λ{f 1(y; β˜)}(1−λ)dy (6.68) where θ =(α, λ)T and β˜ is an estimator of β, then testing the null hypothesis H0 : λ = 1 against the alternative H1 : λ = 1 leads to the construction of the Cox, Atkinson or White statistic, depending on the choice of β˜. Without loss of generality, in the following we will consider the particular case of the ˜ Atkinson statistic. Therefore, β is the pseudo MLE βαˆ. c Under H0 the scores function corresponding to f are given by c c ∂ c s (x; θ)(1) s (x; θ)= log f (x; θ) = c (6.69) ∂θ λ=1 s (x; θ)(2) where c ∂ c ∂ 0 0 s (x; θ)(1) = log f (x; θ) = log f (x; α)=s (x; α) (6.70) ∂α λ=1 ∂α and c ∂ c 0 1 ˜ s (x; θ)(2) = log f (x; θ) =logf (x; α) − log f (x; β) − ∂λ λ=1 6.5. ROBUST MODEL CHOICE TESTS 131

log f 0(y; α) − log f 1(y; β˜) f 0(y; α)dy

= sCox(x; α, β˜) (6.71)

The corresponding ψc-function is then given by 0 A(11)s (x; α) ψc(x; θ)= 0 ˜ ˜ (6.72) A(21)s (x; α)+A(22)[sCox(x; α, β) − a(2)] Wc(x; α, β) where ⎧ ⎫ ⎨ ⎬ ˜ c Wc(x; α, β)=min 1; (6.73) ⎩ 0 ˜ ⎭ A(21)s (x; α)+A(22)[sCox(x; α, β) − a(2)] The robust Atkinson statistic is finally given by 1 n U = A s0(x ;ˆα)+A [s (x ;ˆα, β˜) − a ] W (x ;ˆα, β˜) GLM n (21) i (22) Cox i (2) c i i=1 (6.74) ˜ whereα ˆ is the MLE of α and β is the pseudo MLE of β, βαˆ Knowing that the IF on the level (see (6.44)) is proportional to s0(x; α) and to log f 0(x; α) − log f 1(x; β), we see that by using the robust version of the LM test with the comprehensive model (6.68), we bound exactly the right quantity. Note that UGLM includes the Atkinson statistic as a limiting case, when c →∞. To compute the robust LM statistic, we propose to use the same algo- rithm as for the OBRE. That is Step 1: Compute the MLE for α. θ =(ˆα, 1)T . Step 2: Compute the initial matrix A and vector a by means of

a = 0 (6.75) A−1A−T = sc(x; θ)sc(x; θ)T f 0(x, αˆ)dx (6.76)

Step 3: Solve for A(21), A(22) and a(2), the following implicit equations 0 ψc(x; θ)(2)f (x;ˆα)dx = 0 (6.77) T 0 ψc(x; θ)(2)ψc(x; θ)(1)f (x;ˆα)dx = 0 (6.78) T 0 ψc(x; θ)(2)ψc(x; θ)(2)f (x;ˆα)dx = 1 (6.79)

where A(12) = a(1) =0. 132 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE

Step 4: Compute UGLM given in (6.74). √ Step 5: At the nominal level ω, accept H0 if | nUGLM | <κω,whereκω = Φ−1(1 − ω/2). It should be stressed that the third step is not really obvious to compute. Unlike with parameter estimation, one cannot solve equations (6.77), (6.78) and (6.79) by means of an iterative process. One has instead to use a routine to find the zero roots of a system of nonlinear equations. In appendix D we give a more detailed expression of this system.

6.5.4 Simulation study

In order to study the robustness properties of UGLM when compared to the classical Atkinson statistic, we tested the Pareto against the Exponential distribution, when the samples are contaminated and non contaminated Pareto samples. We actually simulated 1000 samples of 200 observations from a Pareto distribution with parameters α =3.0andx0 =0.5, samples that we contaminated by means of

− √ε √ε (1 )Fα,x0 + Fα,10·x0 (6.80) 200 200 For amounts of contaminations from ε =0%toε = 7%, table 6.11 and ta- ble 6.12 give the actual levels of the robust (c =2.0) (and classical) Atkinson statistic when testing the Pareto against the Exponential distribution.

Amount of Nominal levels contamination 1% 3% 5% 10% 0% 1.3 3.5 5.5 10.2 1% 1.8 3.1 5.3 10.3 2% 0.6 2.5 4.6 10.1 3% 1.2 3.3 5.1 10.3 4% 1.7 4.2 6.0 12.6 5% 0.5 2.3 4.5 10.3 6% 1.4 3.6 5.4 10.7 7% 0.9 3.2 5.4 9.4

Table 6.11: Actual levels (in %) of the robust Atkinson statistic (c =2.0) with contamination (Pareto against Exponential)

As expected, we find that the robustified Atkinson statistic is not influ- enced by small departures from the model under the null hypothesis, that is the actual level when computing the robustified Atkinson statistic is very 6.6. CONCLUSION 133

Amount of Nominal levels (in %) contamination 1% 3% 5% 10% 0% 2.1 3.1 3.5 5.2 1% 2.9 3.7 5.4 10.3 2% 5.2 7.3 8.9 10.9 3% 6.3 8.7 10.3 14.7 4% 9.4 12.5 14.5 18.3 5% 10.3 15.4 17.1 21.9 6% 13.1 18.5 22.5 27.6 7% 15.1 20.7 23.5 29.9

Table 6.12: Actual levels (in %) of the Atkinson statistic with con- tamination (Pareto against Exponential) stable. On the other hand, the classical statistic is already heavily influenced with small amounts of contamination. The next step will be to investigate the power of this test. This will enable us to find the estimator of β which maximizes the power of the test, and also to investigate the power loss due to the robustification of the Atkinson statistic. However, this work is left to future research.

6.6 Conclusion

In this chapter we have focused on model choice procedures for separate models. This problem is not particular to the study of PID models, al- though particular procedures are used in this context. Actually, it seems that econometric models have received most of the attention. The first proposed statistic and maybe the most used as well is the Cox statistic and what we call Cox-type statistics. We have seen that these statistics can be interpreted as parametric tests when the models under the null and the alternative hypothesis are compounded in one. The disadvantages of such statistics, as well as other model choice statis- tics, is the bad test size with relatively small sample sizes and their nonro- bustness. The first disadvantage has been highlighted by means of a simula- tion study, particularly for PID models. We found that with a sample size of 200 observations (which is quite large), some particular testing hypotheses (e.g. Pareto against Exponential) tend to underreject the null hypothesis. On the other hand, when one uses a parametric bootstrap this problem disappears. The fact that Cox and Cox-type statistics are not robust is not surprising, 134 CHAPTER 6. ROBUST TESTS FOR MODEL CHOICE for most of test statistics suffer from the same problem. We investigated the robustness properties of the Cox statistic by means of the LIF and found that it was unbounded. We then proposed a robust version based on robust parametric tests for general parametric models. We also made a simulation study in the particular case of PID models. In future research we intend to make the following developments. Firstly we would like to investigate the power of the robust statistic, especially with different type of estimators for the parameter under the alternative hypothe- sis. Secondly, we would like to apply our results to a real sample of personal income data. Thirdly, since model choice procedures have been mostly used in the econometric literature, we would like to consider generalized linear models as an application. Chapter 7

Robustness and Goodness-of-Fit Tests

7.1 Introduction

Goodness-of-fit techniques are methods for examining how well a sample of data agrees with a given distribution as its population. The null hypothesis H0 is that a given random variable X follows a stated probability law F (x). This setup covers a broad range of different problems, but we are interested in some particular cases. Firstly, we deal with univariate data for which there is a vast literature. Secondly, we consider for the null hypothesis parametric models Fθ, where the nuisance parameter has to be estimated, or in other words, the null hypothesis is composite. The goodness-of-fit techniques applied to test H0 are based on measuring in some way the discrepancy between the sample data and the hypothesized distribution. The alternative hypothesis H1 often gives little or no infor- mation on the distribution of the data, and simply states that H0 is false. Therefore, the major focus is on the measure of agreement of the data with the null hypothesis; in fact, it is usually hoped to accept that H0 is true. Consequently, in a test of fit, where the alternative is very vague, the appropriate statistical test will often be by no means clear and no general theory of Neyman-Pearson type appears applicable in these situations. Even when concepts such as statistical power of the procedure are considered, it rarely happens that one testing procedure emerges as superior. Therefore, a very large number of goodness-of-fit techniques have been developed, in- cluding formal test and graphical techniques. In this section we briefly present the different techniques available in-

135 136 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS cluding the most recent contributions. Robustness and goodness-of-fit tech- niques are discussed in section 7.2. In section 7.3, we compute the LIF of a general class of classical statistics. Finally section 7.4 concludes. It should be stressed that the aim of this chapter is merely to debate the prob- lem of the robustness of goodness-of-fit tests in general. This has already been done for some particular test statistics and we extend the theory to a general class of statistics for discrete and multivariate data. Then we will propose a robust procedure for testing the goodness-of-fit but leave some open problems for further research. In 1900, Pearson (1948) invented the chi-squared test. This statistic and its derived versions are among the mostly used. As this test is based on a comparison between observed cell counts and their expected values under the hypothesis, it is often less powerful than other classes of tests of fit. However, it is the most generally applicable test of fit since it applies to discrete, continuous, univariate and multivariate data. With PID, the data are univariate and mostly grouped, with given classes. Hence, the chi-squared test seems to be an appropriate test in this particular field. To test the null hypothesis that a random sample x1,...,xn has the 0 distribution Fθ = Fθ , we have to partition (if the data are continuous) the range of x into J cells, say I1,...,IJ (for a discussion about the number of cells see e.g. Kallenberg, Oosterhoff and Schriever 1985). Nˆ1,...,NˆJ are the observed number of xi’s in these cells and are a realization of the random vector N =(N ,...,N )T which has the multinomial distribution with pa- 1 J + , ˆ 0 0 0 T rameters n = Nj = Nj, and probabilities k (θ)= k1(θ),...,kJ (θ) where 0 0 kj (θ)= dFθ (7.1) Ij The Pearson statistic is given by

J 2 2 X = Uj (7.2) j=1 where ⎛ ⎞ √ ˆ − 0 ⎝Mj kj (θ)⎠ Uj = n (7.3) 0 kj (θ)

−1 th and Mˆj = n Nˆj is the observed j class frequency. If θ is known, asymp- 2 X 2 totically X has a (J−1) distribution. The problem of testing the goodness- of-fit is reduced to the problem of testing whether a multinomial variable has 7.1. INTRODUCTION 137

0 0 cell probabilities nkj (θ). In other words, testing H0 : Fθ = Fθ is reduced to 0 0 testing H0 : Nˆ = nk (θ), or equivalently, H0 : Mˆ = k (θ). In 1924, Fisher (1924) showed that the limiting null distribution of X2 depends on the method of estimation used for the parameter θ.Heargued that the appropriate estimator is the MLE. He further proposed an estimator asymptotically equivalent to the MLE, the one which minimizes X2 for the ˆ X 2 observed Mj. With any of these two estimators, the statistic has a (J−p−1) distribution under the null hypothesis (p = dim(θ)). More recently, Cressie and Read (1984) extended the theory of classical chi-squared procedures by introducing a class of test statistics based on measures of divergence between discrete distributions on J points. If q = ∗ ∗ ∗ (q1,...,qJ )andq =(q1,...,qJ ) are such probability distributions, the directed divergence of order λ of q from q∗ is given by the power divergence statistic ⎡  ⎤ λ 1 J q Iλ[q; q∗]= q ⎣ j − 1⎦ . (7.4) λ(λ +1) j q∗ j=1 j Iλ is a metric only for λ = −0.5 but it is a useful general information measure of ‘distance’ for all real λ. The Cressie and Read statistic is given by Rλ =2nIλ[M; k0(θ)] ,λ∈ (−∞; ∞) (7.5) For λ =0andλ = −1 we obtain respectively   J ˆ 0 ˆ 0 ˆ Mj I (M; k (θ)) = Mj log 0 (7.6) j=1 kj (θ)   J k0(θ) I−1(Mˆ ; k0(θ)) = k0(θ)log j (7.7) j ˆ j=1 Mj

This class includes the X2 (λ = 1), the Neyman (1949) statistic NM2 (λ = −2), the log likelihood ratio statistic G2 (λ = 0) the Freeman-Tukey 2 − 1 (1950) statistic F (λ = 2 ) and the modified log likelihood ratio statistic GM 2 (λ = −1). Cressie and Read (1984) and Read and Cressie (1988) show that under some suitable regularity conditions, if we choose an estimator for θ,sayθ˜, 0 0 −1/2 λ 0 such that k (θ˜)=k (θ)+Op(n ), then 2nI (Mˆ ; k (θ˜)) is asymptotically distributed according to a chi-squared distribution with J − p − 1 degrees of freedom. In particular, a MPE could be chosen (see (5.19)). The Cressie and Read family statistic is very useful for discrete data or grouped data. However, when the data are continuous, grouping them 138 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS reduces the information. Therefore, goodness-of-fit statistics have been also developed for continuous data. They are called empirical distribution func- tion (EDF) based statistics. They are measures of the discrepancy between the EDF and a given distribution function. To be more precise, they mea- (n) sure the difference between F (x)andFθ(x). The one which have at- tracted most attention are based on the vertical difference between F (n)(x) and Fθ(x). Certainly the most well known is the Kolmogorov (1933) D statistic. A closely related statistic V has been developed by Kuiper (1960). A wide class of measures of discrepancy is given by the Cramer-von Mises family 2 (n) Q = n F (x) − Fθ(x) φ(x)dFθ(x) (7.8) where φ(x) is a suitable function which gives weights to the squared dif- 2 (n) ference F (x) − Fθ(x) .Whenφ(x) = 1 the statistic is the Cramer-von 2 −1 Mises statistic, now usually denoted by W .Whenφ(x)=[fθ(x)(1− Fθ(x))] the statistic is the Anderson-Darling (1954) statistic denoted by A2.Amod- ification of the W 2 has been proposed by the Watson (1961). To compute these statistics, it is convenient to use a transformation of the data, namely the probability integral transformation (see Stephens 1984). When the null hypothesis is simple, distribution theory of EDF statistics is well developed, even for finite samples, and tables have been available for some time (Stephens 1984). When θ has to be estimated, one can refer to some tables for particular cases or simulate the distribution of the statistic by means of a parametric bootstrap (most of the available tables seem to have been built in this way anyway). Another approach which has not been systematized sofar is to rely on measures of distances between the estimated parametric model and the es- timated empirical distribution based on the observations. Such distances are also used in the context of parameter estimation by means of minimum distance estimators (see Donoho and Liu 1988). One of the most well known is the Hellinger distance (see e.g. Beran 1977, Tamura and Boos 1986 and Simpson 1987, 1989). Finally, in the same line of thought, Bickel and Rosen- blatt (1973) proposed another procedure for testing the goodness of fit (see also Ghosh and Wei-Min Huang 1991 for an improved version). Cressie and Read (1984) included the Hellinger distance and the Bickel and Rosenblatt statistic into a more general statistic based on estimated empirical density functions. They consider for the latter a simple kernel 7.2. ROBUSTNESS AND GOODNESS-OF-FIT TECHNIQUES 139 estimator given by

F (n)(x) − F (n)(x − h) F (n)(x)= (7.9) h and defined λ λ 2nh fn(x) 2nH (fn,f0)= fn(x) − 1 dx (7.10) λ(λ +1) f0(x) which can be thought of as a continuous version of their statistic Rλ.When − 1 λ = 1 we have the Bickel and Rosenblatt (1973) statistic, when λ = 2 we have the statistic based on the Hellinger distance (Beran 1977). Finally, we cannot conclude without mentioning the Moran’s (1851) statistic which has recently received attention (see Cheng and Stephens 1989). To conclude, it should be stressed that some developments have been done for the particular case of PID models. Indeed, Gail and Gastwirth (1978a, 1978b) developed two goodness-of-fit tests for the exponential dis- tribution based on the Lorenz curve and the Gini index. Although the test statistics do not depend on the unknown parameter, they are very limited because they apply only to the exponential distribution.

7.2 Robustness and goodness-of-fit techniques

In this section we report the discussions found in the literature about the ro- bustness properties of some goodness-of-fit tests. The null hypothesis states that the observations belong exactly to a particular parametric model, but in our opinion, it is more sensible to consider that the underlying model is not exact. A robust test is then a test which takes into account small model deviations and the corresponding decision is not influenced by them. As we will see below, this robust procedure is in conflict with some researcher’s beliefs who measure the power of a particular test statistic when the alterna- tive hypothesis lies in a neighborhood of the parametric model and therefore prefer test statistics which are very sensitive to small model deviations. Our aim is to draw the attention on the fact that with goodness-of-fit statistics as well as with parameter estimators, a single observation can lead the test to false decision. We don’t believe that this is preferable to a loss of power. Among the first statisticians who have investigated some robustness properties of goodness-of-fit statistics, we find Gail and Ware (1978). They studied by means of simulations, the robustness to rounding and truncation 140 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS measurement error of several scale-free goodness-of-fit tests for exponential- ity. They found for instance that while Moran’s statistic is very sensitive to such errors, the tests based on the sample Lorenz curve and Gini’s statistic (Gail and Gastwirth 1978a, 1978b) are relatively robust. They agree on the fact that it is not desirable to reject the null hypothesis merely because of a small proportion of bad observations. This section is a short review of the recent works about robustness and goodness-of-fit tests. Since to our knowledge, nothing has been done for X 2 type statistics, the next section exposes our analysis on the robustness properties of the power divergence family statistics. In order to study the sensitivity of some of the most well known EDF statistics, Michael and Schucany (1985) compute their IF. They argue that unlike with parameter estimators, it is desirable for a small amount of contamination to have a large effect on the value of the statistic. They find that the IF of the Kolmogorov-Smirnov statistic, the Kuiper statistic, the Cramer-von Mises statistic and the Watson statistic are bounded. The Anderson-Darling statistic has however a unbounded IF. In the same line of thought, Drost, Kallenberg and Oosterhoof (1990) showed that EDF goodness-of-fit tests of composite hypothesis testing have a greater power when the parameter estimators are non robust under local contamination alternatives. Non robust is used here to actually say that the estimator θˆn of θ is non consistent under local contamination alterna- tives. If such an estimator is used, then Drost, Kallenberg and Oosterhoof (1990) showed that EDF statistics tend to infinity in probability. They have considered the Kolmogorov-Smirnov statistic, the Kuiper statistic, the Cram´er-von Mises statistic, and the Anderson-Darling statistic. They also show this result for the Cressie-Read statistic. We don’t believe that it is more important for the goodness-of-fit test to have a great power than for the test statistic to be robust to a small amount of contaminations in the model. We are here stressing an important dilemma: how can we at the same time require robustness against small de- viations and a test statistic able to detect small departures from the model? One could argue that very small deviations should not influence the test decision, whereas greater deviations should make the test statistic vary a lot. To conclude, there is maybe a way to make a compromise between ro- bustness (i.e. a small amount of contamination is acceptable) and power (i.e. a large amount of contamination should lead to the rejection of the null hypothesis). It was suggested by Rousseeuw (oral communication): when testing H0 : x ∼ Fθ, find the smallest value of ε such that the ε-version of 7.3. LIF OF THE CRESSIE AND READ STATISTIC 141

Fθ can be accepted at the 5% level, and if this ε exeeds a given value (fixed a priori) then reject H0. We leave the investigation of this idea for future research.

7.3 Level-influence function of the Cressie and Read statistic

In this section we investigate the robustness properties of the Cressie and Read goodness-of-fit statistic. As we have seen in section 6.4 a test is robust when its level and a power are not influenced by small deviations in the model under the null hypothesis. Our approach is the following: we first derive an approximation of the bias on the asymptotic level caused by model contaminations. This way, we find out which part of the test statistic has to be robustified. Actually we follow the same approach as for the model choice test studied in chapter 6, but as we will see, because the problem is different, small changes are needed. First, we give more details about the Rλ statistic. In order to clarify the notations, let us define again the Rλ goodness-of- fit procedure. The continuous (original) data are supposed to come from a ∼ 0 parametric model, i.e. x1,...,xn Fθ. WewanttotestifFθ = Fθ (null λ hypothesis H0). In order to build the R -statistic, the data must be in a grouped form. Let Nˆj = {#xi | xi ∈ Ij}, Nˆj = n, j =1,...,J,whereIj th defines the j cell and suppose that the bounds of the cells are fixed. Denote ˆ −1 ˆ 0 0 by Mj = n Nj the observed cell frequencies and by kj (θ)= Ij dFθ (x)the corresponding probabilities under the null hypothesis. Then the Rλ-statistic is given by ⎡  ⎤ λ 2n J Mˆ Rλ =2nIλ[Mˆ ; k0(θ)] = Mˆ ⎣ j − 1⎦ (7.11) λ(λ +1) j 0 j=1 kj (θ)

λ J,p J,p The test decision will be to reject H0 if R >κω . κω is determined for a λ nominal level ω by the asymptotic distribution of the statistic R under H0 X 2 J,p − X 2 given by a (J−1) if θ is known, i.e. κω is the (1 ω) quantile of (J−1). The asymptotic level of the test ω is defined as − 0 J,p ω =1 H(J−1)(κω ) (7.12) 0 X 2 where Hq is the cumulative distribution function of a q . This is the true 0 asymptotic level of the test at the model Fθ . Tobemoreprecise,weshould write this asymptotic level as ω = ω(0). 142 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS

In order to study the asymptotic distribution and the robustness prop- erties of these tests, it is convenient to write the test statistic in the form

J λ 2 R = Uj + op(1) (7.13) j=1

We have that ⎡  ⎤ J 0 λ+1 2n Mˆ j − k (θ) 2nIλ(Mˆ ; k0(θ)) = k0(θ) ⎣ 1+ j − 1⎦ (7.14) λ(λ +1) j 0 j=1 kj (θ)

Writing ˆ − 0 Mj kj (θ) vj(θ)= 0 (7.15) kj (θ) and expanding in a Taylor series, we see that under H0, (7.14) equals (see Cressie and Read 1984) 2n J λ(λ +1) k0(θ) (λ +1)v (θ)+ v2(θ)+O (v3) = λ(λ +1) j j 2 j p j ⎡ j=1 ⎤ J v2(θ) 2n ⎣ k0(θ) j + o (n−1)⎦ (7.16) j 2 p j=1 J 0 (because j=1 kj (θ)vj(θ) = 0). Hence,

λ 0 1 2nI (Mˆ ; k (θ)) = 2nI (Mˆ ; k(θ)) + op(1) (7.17) λ 1 X 2 which shows that R = R +op(1) has asymptotically the (J−1) distribution (the conditions under which this is true are provided by Birch 1964, see Read and Cressie 1988). Therefore, we can write

J λ 2 R = Uj + op(1) (7.18) j=1 where ⎛ ⎞ √ ˆ − 0 ⎝Mj kj (θ)⎠ Uj = n (7.19) 0 kj (θ) and when θ is known ∼ Uj → N(0, 1) (7.20) 7.3. LIF OF THE CRESSIE AND READ STATISTIC 143

Since the null hypothesis H0 is not always simple, we have to estimate θ. Read and Cressie (1984) proved that

λ 1 2nI (Mˆ ; k(T )) = 2nI (M,kˆ (T )) + op(1) (7.21) where T is a best asymptotically normal (BAN) estimator of θ.Inparticular, any MPE, TMPE is BAN. Therefore we have that

J λ 2 R (TMPE)= Uj (TMPE)+op(1) (7.22) j=1 where ⎛ ⎞ √ ˆ − 0 ⎝Mj kj (TMPE)⎠ Uj(TMPE)= n (7.23) 0 kj (TMPE) X 2 converges is distribution to a (J−p−1). All these results are correct if the data are correctly specified by the 0 model Fθ . However, as it is more realistic, we suppose that the model is not exactly true, i.e. that the underlying distribution lies in a neighborhood of 0 the model distribution Fθ ,givenby 0 0 − 0 Fε,n Fε,n =(1 εn)Fθ + εn∆z (7.24)

The asymptotic level will then be in general biased. This can lead the test to √ε false decisions. Note that as argued in chapter 6, we must choose εn = n . Let T be any BAN estimator of θ.Wehavethat ⎛ ⎞ √ ˆ − 0 ⎝Mj kj (T )⎠ ∼ Uj(T )= n → N(0, 1) (7.25) 0 kj (T ) where (n) Mˆj = dF (7.26) Ij 0 Assuming Fε,n under H0 we then have →∼ 2 Uj(T (Fε,n)) N(µj(ε),σj (ε)) (7.27) where ⎧√ ⎫ ⎨ 0 − 0 0 ⎬ n Ij dFε,n kj (T (Fε,n)) µj(ε) = lim (7.28) n→∞ ⎩ 0 0 ⎭ kj (T (Fε,n)) 144 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS and 0 Ij dFε,n σ2(ε) = lim (7.29) j n→∞ 0 0 kj (T (Fε,n)) λ 0 Let us denote the asymptotic distribution of R (T ) under Fε,n by 0 · 2 2 D ( ; µ1(ε),...,µJ (ε); σ1 (ε),...,σJ (ε)) (7.30) where D0 is the cumulative distribution of J Z − µ(ε) 2 j (7.31) σ(ε) j=1 and the Zj, ∀j are standard normal variables. The asymptotic level of the 0 test under Fε,n is then given by − 0 J,p 2 2 ω(ε)=1 D (κω ; µ1(ε),...,µJ (ε); σ1 (ε),...,σJ (ε)) (7.32) The bias on the asymptotic level can be approximated by a first order Taylor expansion of ω(ε) arround ω(0), i.e. ∂ 2 ω(ε) − ω(0) = ε · ω(ε) + O(ε ) (7.33) ∂ε ε=0 We have that ⎡ ⎤ ∂ J ∂ − ⎣ 0 J,p ⎦ · ω(ε) = D (κω ;0,...,µj,...,0; 1,...,1) ∂ε ε=0 ∂µj j=1 µj =0 ∂ µ (ε) ∂ε j ⎡ ε=1 ⎤ J ∂ − ⎣ D0(κJ,p;0,...,0; 1,...,σ2,...,1) ⎦ · ∂σ2 ω j j=1 j σ2=1 j ∂ 2 σj (ε) (7.34) ∂ε ε=0 It is easy to show that dF 0 ∂ 2 ∂ Ij ε,n σ (ε) = lim = 0 (7.35) ∂ε j n→∞ ∂ε k0(T (F 0 )) ε=0 j ε,n ε=0

−1 −1/2 The reason is because the variance is of order n and εn of order n , then when n →∞, the effect of ε vanishes. Therefore, the bias on the 7.3. LIF OF THE CRESSIE AND READ STATISTIC 145 asymptotic level will be proportional to the effect on the µj’s. Moreover, 0 D (·; µ1,...,µJ ;1,...,1) becomes the cumulative distribution function of a X 2 (J−p−1)(τ) with non centrality parameter τ given by J 2 τ = µj (7.36) j=1

0 0 Putting D (u; µ1(ε),...,µJ (ε); 1,...,1) = H(J−p−1)(u; τ(ε)), we have ∂ − ∂ 0 J,p · ∂ ω(ε) = H(J−p−1)(κω ; τ) τ(ε) (7.37) ∂ε ε=0 ∂τ τ=0 ∂ε ε=0 where J ∂ ∂ τ(ε) =2 µ (0) · µ (ε) (7.38) ∂ε j ∂ε j ε=0 j=1 ε=0

However, ⎧√ ⎫ ⎨ 0 − 0 0 ⎬ n Ij dFθ kj (T (Fθ )) µj(0) = lim = 0 (7.39) n→∞ ⎩ 0 0 ⎭ kj (T (Fθ )) The bias on the asymptotic level then cannot be approximated by its linear term which vanishes, but we have to develop it until the quadratic term. We have 2 − ε ∂ 3 ω(ε) ω(0) = 2 ω(ε) + O(ε ) (7.40) 2 ∂ε ε=0 where ∂ − ∂ 0 J,p · 2 ω(ε) = H(J−p−1)(κω ; τ) ∂ε ε=0 ∂τ τ=0 J 2 ∂ ∂ 2 µ (ε) + µ (0) µ (ε) ∂ε j j ∂ε2 j j=1 ε=0 ε=0 J 2 ∂ 0 J,p ∂ = − H − − (κ ; τ) · 2 µj(ε) (7.41) ∂τ (J p 1) ω ∂ε τ=0 j=1 ε=0 We have that ⎧ √ ⎨ n ∂ dF 0 − k0(T (F 0 )) ∂ ∂ε Ij ε,n j ε,n ε=0 µj(ε) = lim ∂ε n→∞ ⎩ 0 ε=0 kj (θ) √ ⎫ 0 0 ∂ 0 0 ⎬ n dF − k (T (Fθ)) k (T (F )) − 1 Ij θ j ∂ε j ε,n ε=0 3 (7.42)⎭ 2 0 2 kj (θ) 146 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS

Since 0 − 0 dFθ kj (T (Fθ)) = 0 (7.43) Ij the second term on the right hand side disappears. Thus √ ∂ 1 −√1 0 √1 − µj(ε) = lim n kj (θ)+ δj(z) ∂ε 0 n→∞ n n ε=0 kj (θ) ∂ 0 ∂ 0 T kj (θ) T (Fε,n) (7.44) ∂θ ∂ε ε=0 where 1ifz ∈ I δ (z)= j (7.45) j 0otherwise

Knowing that ∂ 0 ∂ 0 · √1 √1 ∗ T (Fε,n) = T (Fε,n) = IF (z; T,Fθ) (7.46) ∂ε ε=0 ∂εn ε=0 n n we finally get ∂ 1 − 0 − ∂ 0 ∗ µj(ε) = kj (θ)+δj(z) kj (θ)IF (z; T,Fθ) (7.47) ∂ε 0 ∂θT ε=0 kj (θ)

The bias on the asymptotic level is approximated by − 2 ∂ 0 J,p · ω(ε) ω(0) = ε H(J−p−1)(κω ; τ) ∂τ τ=0 J 1 ∂ 2 k0(θ) − δ (z)+ k0(θ) · IF∗(z,T,F 0) k0(θ) j j ∂θT j θ j=1 j +O(ε3) (7.48)

We can then see that the bias on the asymptotic level depends on the IF of the estimator. If T = TMPE,then J ∂ 0 δj(z) log k (θ) IF(z,T ,F0)= j=1 ∂θ j = IF∗(z,T ,F0) MPE θ J 0 ∂ 0 ∂ 0 MPE θ j=1 kj (θ) ∂θ log kj (θ) ∂θT log kj (θ) (7.49) In order to understand the value of the bias on the asymptotic level we just computed, let us consider the model contamination (7.24) but instead of ∆z 7.4. SIMULATION STUDY 147 we take a distribution W which is actually any distribution. It is then easy to show that ∂ 1 − 0 − ∂ 0 · ∂ µj(ε) = kj (θ)+ dW T kj (θ) TMPE(Fε,n) ∂ε 0 Ij ∂θ ∂εn ε=0 kj (θ) ε=0 (7.50) where J dW ∂ log k0(θ) ∂ j=1 Ij ∂θ j T (F ) = (7.51) ∂ε MPE ε,n J 0 ∂ 0 ∂ 0 n ε=0 j=1 kj (θ) ∂θ log kj (θ) ∂θT log kj (θ) 0 0 − We then see that if W is near Fθ , Ij dW is near kj (θ), therefore ω(ε) ω(0) is near 0. When W =∆z, the bias on the asymptotic level is maximum . However, it is difficult to determine for which class such that δj(z)=1this bias is maximum. We only can say that although it is bounded, because ∗ 0 ∞ max δj(z)=1andmaxIF (z,TMPE,Fθ ) < and because the level is in [0, 1], a large change of the value of the level because of a small amount of contamination, implies non robustness. That is the level can breakdown from 0.05 to 1 only because of a few observations.

7.4 Simulation study

In this section we present a simulation study and then discuss an ‘ad hoc’ robust version of the Cressie and Read statistic. We tested the stability of the level of the goodness-of-fit with different test statistics when a small percentage of contamination is introduced in the model. We simulated a sample of 1000 observations from the Lognormal distribution with parameters µ =1.0andσ =0.8 and contaminated the model by multiplying by 10 a percentage ε of observations chosen randomly. We then grouped the observations in 14 classes. We tested the simple hy- 2 pothesis H0 : F = Fµ,σ2 where µ and σ are known, by computing the Pearson statistic, the Neyman-Pearson statistic, the log likelihood statistic and the Freeman-Tukey statistic. All these test statistics belong to the class of power divergence statistics. In figure 7.1, the values of these statistics with different percentages of (J−1) contamination are compared with κω at the nominal 5%-level, i.e. the p- X 2 value of a 13. We can see that with less than 2% of contamination, the four statistics already lead the test to a false decision. This confirms our results about the LIF of the Cressie and Read statistic, that is, although the LIF is bounded, it can be very large. Thus working with a biased level means 148 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS that we overreject the null hypothesis simply because of a small percentage of outliers. How can we improuve the stability of the level of the test? What already seems to exist in the literature (without having been developed systemati- cally for goodness-of-fit tests) is a set of robust distances between distribu- tions. In particular, the Hellinger distance seems to offer good prospects. Another strategy would be to use a model choice approach for testing the goodness-of-fit, and in this case, a robust procedure already exists (see chap- ter 6). However, in this section, we also present an ad hoc robust procedure as a test of goodness-of-fit, when the parameters of the model have to be estimated. It is based on the X 2 statistic and OBRE for the parameters. The idea is to use in the test statistic, the weights generated during the parameter estimation. Indeed, under the null hypothesis, we expect the class ˆ 0 frequency Mj to be as near as possible to kj (θ), and hence, if we want a ˆ − 0 robust statistic, we could put the weights wj on the difference Mj kj (θ) . In other words, if wj < 1, it is because during the estimation procedure, the OBRE found that the scores function with grouped data was far from its ˆ 0 , that is Mj is far from kj (θ). Hence, the statistic we define is ⎛ ⎞ J ˆ λ − 0 ˜ λ 1 Mj kj (θ) Iλ [Mˆ ; k0(θ˜)] = Mˆ ⎝w ⎠ (7.52) R λ(λ +1) j j 0 ˜ λ j=1 kj (θ) where θ˜ is the OBRE of θ. Since the derivation of the distribution of this test statistic under the null hypothesis is probably unfeasible, a bootstrap could be used. We leave this investigation for future research. 7.4. SIMULATION STUDY 149 Pearson’s statistic (lambda=1) 0 20406080 0 20406080 Neyman-Pearson statistic (lambda=-2)

012345 012345 % of contamination % of contamination Log likelihood statistic (lambda=0) 0 20406080 0 20406080 Freeman-Tukey Statistic (lambda=-0.5) 012345 012345 % of contamination % of contamination

Figure 7.1: Behaviour of goodness-of-fit statistics with model con- tamination 150 CHAPTER 7. ROBUSTNESS AND GOODNESS-OF-FIT TESTS Chapter 8

Robustness and Inequality Measures

8.1 Introduction

In this chapter, we study the robustness properties of the estimators of inequality measures (see also Cowell and Victoria-Feser 1993). There are various approaches in the literature to the definition of an inequality measure. For our purposes we see an inequality measure simply as a member of a class of functions that is defined by a set of essential characteristics. The class of functions defines in turn a set of statistics that can be used to characterise PID, distribution of wealth, and so on (see section 2.4) The essential characteristics of an inequality measure are the subject of some debate. However, many would accept that the principal property which should be possessed by an inequality measure is respect for the principle of transfers (Dalton 1920). In addition we may require that the class of admis- sible inequality measures should satisfy properties such as scale irrelevance or decomposability (see below). The purpose of this chapter is to examine the formal relationship between the economic properties that the inequality measure should fulfil with the statistical property of robustness. In particular, we study the link between the principle of transfers and the IF of the members of some very wide class of inequality measures. A principal question which we wish to address is whether there is a systematic relationship between the properties that are taken to be defining characteristics of the class of inequality measures and the behaviour of the IF.

151 152 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

We believe that a robust approach to the estimation of inequality mea- sures is particularly important. It provides the researcher with a useful supplement of information concerning the true underlying structure of in- equality for a given data sample. Indeed, the effect of data contamination in the tails of the distribution can result in serious confusion between quite different underlying income distributions. In practical work, some researchers have noticed that the data are far from being clean, that is data that in a sense have nothing to do with the majority of the data are present in the sample1. Moreover, it is well known that Atkinson inequality measures for example, for the value of the inequal- ity aversion parameter greater than unity, are extraordinarily sensitive to abnormally small incomes (see for example Cowell 1977, p. 132-134, Pud- ney and Sutherland 1992). We first introduce some notation. Let I be the (true) inequality measure that we estimate by means of a sample x1,...,xn,wherethexi are realiza- tions of a random variable X. For example, X is the income variable when measuring income inequality. We denote I by I(F ), i.e as a functional of the distribution F . An estimator of I(F ) is given by replacing F by either (n) F , the empirical distribution of X or Fθˆ, an estimated parametric model, such that X ∼ Fθ. Let us also define the following mixture distribution

Gε =(1− ε)F + εH (8.1) where 0 <ε<1andH is some perturbation distribution. For example, H is the distribution ∆z which puts a point mass 1 at any point z.Inthecase of income inequality measures, as we will see later, it is more convenient to consider a distribution H with corresponding probability distribution ⎧ ⎪ ⎨ α1 if x = z1 dH(x)= ... (8.2) ⎩⎪ αm if x = zm ∀i, αi ≥ 0, and αi =1. The influence of an infinitesimal model deviation (or model contamina- tion) on the estimate is then given by I(G ) − I(F ) lim ε (8.3) ε→0 ε

1For example, in the 1981 and 1986 family expenditure survey for the UK the top-most income in the sample has a value two times that of the next-highest income. It is arguable that an outlier of this sort should be treated as exceptional and dropped from the sample (see Jenkins 1992). 8.1. INTRODUCTION 153 or, when the derivative exists, by ∂ I(Gε) (8.4) ∂ε ε=0 It should be stressed that (8.4) is a slightly different definition than the usual IF. Indeed, when H =∆z, then it is equal to the IF, denoted by IF(z,I,F). If H is any distribution, then (8.4) can be called an integrated IF (IIF)becauseitisequalto IF(z,I,F)dH(z). However, in our examples, H will be a distribution which puts different point masses at different points depending on a common point z. Therefore, by extension of the definition of the IF, we will consider that with such a distribution H, (8.4) is still an IF. As said above, in this chapter we analyse the formal relationship between the principle of transfers and the IF of inequality measures. Is it possible to draw a general conclusion about all the inequality measures fulfilling this economic property? Unfortunately not, and we show it now. We define a general class of income inequality measures I(F )bytheset of all I(F ) which satisfies the principle of transfers. In other words if, say, I∗(F ) belongs to this class, then the transfer of an arbitrary positive amount of income from a poorer income receiver to a richer income receiver (such that the mean of the distribution is preserved), increases the value of I∗(F ). In order to study the effect on the estimator of an infinitesimal amount of contamination, we suppose that the underlying distribution lies in a neigh- bourhood of the model as defined in (8.1). H is in principle any perturbation distribution. In our case, an appropriate perturbation distribution is given by H(z) defined as follows 0.5ifx = µ − z = x (z) dH(z)(x)= 1 (8.5) 0.5ifx = µ + z = x2(z)

(z) where 0 ≤ z<µ. Let us denote by Gε the mixture distribution (z) − (z) Gε := (1 ε)F + εH then I(G(z)) − I(F ) lim ε = IF(z,I,F) ε→0 ε This kind of perturbation distribution is convenient because it preserves the mean of the distribution. Hence, 1 1 µ(G(z))= xdG(z)(x)=(1− ε)µ(F )+ ε(µ(F ) − z)+ ε(µ(F )+z) ε ε 2 2 = µ(F ) (8.6) 154 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

Suppose now that z∗ >z, then by the principle of transfers we have (z∗) (z) I(Gε ) >I(Gε ) (8.7) The IF of these inequality measures are then obtained by subtracting I(F ), dividing by ε and taking the limit when ε → 0. We have

∗ I(G(z )) − I(F ) I(G(z)) − I(F ) lim ε ≥ lim ε (8.8) ε→0 ε ε→0 ε i.e. IF(z∗,I,F) ≥ IF(z,I,F) (8.9) Hence, we cannot conclude that the IF of an inequality measure which satisfies the principle of transfers is unbounded. Therefore, at this stage we need to consider a restricted class of inequality measures. We then hope to find that in these restricted classes, the principle of transfers implies an unbounded IF. In section 8.2 we analyse the robustness properties of a general class of inequality measures satisfying the principle of transfers. Therefore, we make the asumption that the type of contamination preserves the mean of the distribution. In section 8.3 we extend our results to the cases where the type of contamination is arbitrary and present a simulation study. In section 8.4 we consider inequality measures estimated through parametric models. In particular, we show that their IF is proportional to the IF of the estimators of the parameters, so that robust inequality measures are obtained through robust estimates of parametric models.

8.2 Decomposability and mean-preserving contam- ination

In this section we show that with mean-preserving contaminations, any in- equality measure belonging to the class of decomposable inequality measures and satisfying the principle of transfers have an unbounded IF. Although we have restricted the class of inequality measures, the results found here are applicable to a great number of inequality measures, such as the coeffi- cient of variation, the relative mean and median deviation, the variance of the logarithm of income (Gibrat 1931), members of the generalized entropy family (Cowell 1980), Hirschman’s index (Hirschman 1945), Atkinson’s in- dex (Atkinson AB 1970), Kolm’s index (Kolm 1976), etc. A notable excep- tion to this class is the Gini (1912) coefficient. However, we will see below that the same conclusions can be drawn with it. 8.2. DECOMPOSABILITY AND MEAN-PRESERVING CONT. 155

8.2.1 General properties An inequality measure fulfilling the property of decomposability can be writ- teninthefollowingform

I(F )=ϕ[J(F, µ(F )) ,µ(F )] (8.10) where J(F, µ)= φ(x, µ)dF (x) , (8.11)

φ is a function R→Rand ϕ is a monotonic increasing function in its first argument. As said before, one of the most important properties that an income inequality measure has to fulfil is the principle of transfers. In the case of decomposable measures given by (8.10) and (8.11) where φ is a differentiable function, this implies

φ1(x2,µ) − φ1(x1,µ) > 0 for any x1 0 from a poorer income receiver to a richer income receiver. If before the transfer the income distribution is represented by F ,thenafter the transfer, the new distribution F ∗ is the same as F except that ∃ ∗ ∗ ∗ − ∗ x1,x2 such that x1 = x1 dx , x2 = x2 + dx and x1 0 is the derivative of ϕ with respect to its first argument. If I satisfies the principle of transfers, then we have that

φ1(x2,µ) − φ1(x1,µ) > 0 for any x1

8.2.2 Robustness properties Let us now derive the IF of any I(F ) satisfying the principle of transfers for the perturbation distribution defined in (8.5). Given that (z) (z) (z) (z) I(Gε )=ϕ[J(Gε ,µ(Gε )) ,µ(Gε )] (z) = ϕ[J(Gε ,µ(F )) ,µ(F )] (8.16) where 1 J(G(z),µ)=(1− ε) φ(x, µ)dF (x)+ εφ(x (z),µ)+ ε 2 1 1 εφ(x (z),µ) (8.17) 2 2 The IF of the inequality measure is given by I(G(z)) − I(F ) IF(z,I,F) = lim ε ε→0 ε ∂ · ∂ (z) = ϕ[J, µ] J(Gε ,µ) ∂J ∂ε ε=0 1 1 = ϕ [J, µ] · −J(F, µ)+ φ(x (z),µ)+ φ(x (z),µ)(8.18) 1 2 1 2 2 The behaviour of the IF is directly related to the properties of the function φ.Sinceφ is strictly convex, φ(x, ·) is unbounded either when x →−∞or when x →∞. Therefore, as z →∞, unless φ is symmetric in µ,theIF of I is unbounded. However, it should be stressed that the above statement is valid only if we consider a random variable which takes its values on the whole set of the real numbers. It could be argued that in the case of PID an additional a priori restriction on the values of the random variable is appropriate, namely x ≥ 0. If so, then the above analysis needs to be revised. Consider instead of (8.5) the following definition 1 · (z) 1+z if x = z µ dH (x)= z 1 · (8.19) 1+z if x = z µ then the IF of I is given by

IF(z,I,F)=ϕ1[J, µ] −J(F, µ)+ 1 z 1 φ(µ · z,µ)+ φ(µ ,µ) (8.20) 1+z 1+z z 8.2. DECOMPOSABILITY AND MEAN-PRESERVING CONT. 157

The behaviour of the IF as z →∞depends on the behaviour of φ(x, µ)as x → 0orasx →∞. We have the following situations. If

lim φ(x, µ) < ∞ and φ(0,µ) < ∞ x→∞ then the IF is bounded. But if

lim φ(x, µ)=∞ or lim φ(x, µ)=∞ x→∞ x→0 then the IF is unbounded. In the next subsection, we will analyse more particularly two inequality measures belonging to the class of decomposable measures through their function φ.

8.2.3 Kolm’s index Kolm’s (1976) index belongs to the class of decomposable inequality mea- sures and is given by 1 I (F )= log e(µ(F )−x)dF (x) (8.21) K  where the parameter >0 measures the degree of inequality aversion. Its function φ is given by φ(x, µ)=e(µ−x) (8.22)

Since limx→∞ φ(x, µ)=0andφ(0,µ) < ∞,theIF of Kolm’s index with mean-preserving contaminations is bounded. This is also easily shown ana- lytically. (z) Consider again the mixture model Gε with the perturbation function given in (8.19). We have 1 − 1 − z − 1 I (G )= log (1 − ε) e(µ x)dF (x)+ εeµ(1 z) + εeµ(1 z ) K ε  1+z 1+z (8.23) The IF of the Kolm index is given by  1 · − (µ−x) IF(z,IK ,F)= (µ−x) e dF (x)+  e dF (x) 1 z 1 eµ(1+z) + eµ(1+ z ) (8.24) 1+z 1+z  ∞ Therefore, as limz→∞ IF(z,IK ,F) < . 158 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

8.2.4 Generalized entropy index Members of the ‘generalized entropy’ family are defined by 1 x β+1 Iβ (F )= − 1 dF (x). (8.25) E β(β +1) µ where β ∈ (−∞;+∞). This family has very convenient properties for the study of inequality: each inequality measure derived from it can be inter- preted as a measure of the distance between the distribution of the income and the distribution in which every economic unit receives the mean income µ (see Cowell 1980). The Theil (1967) inequality measures correspond to the limiting cases when β →−1andβ → 0. They are obtained by using the fact that

tα − 1 log(t) = lim (8.26) α→0 α and are given by −1 − x IE (F )= log dF (x) (8.27) µ x x I0 (F )= log dF (x) (8.28) E µ µ

The corresponding φ functions are given by ⎧ ⎪ β+1 ⎪ 1 x − 1 for β = −1, 0 ⎨ β(β+1) µ φ(x, µ)=⎪ − log x for β = −1 (8.29) ⎪ µ ⎩⎪ x x µ log µ forβ =0

For the IF, we have the following situation:

1. β ≥ 0: limx→∞ φ(x, µ)=∞, then the IF is unbounded.

2. −1 <β<0: limx→∞ φ(x, µ)=−∞ and φ(0,µ) < ∞, then the IF is bounded.

3. β ≤−1: limx→0 φ(x, µ)=∞, then the IF is unbounded.

These results can also be found by analysing the IF with mean-preserving contaminations. If we consider the perturbation function given by (8.19), 8.2. DECOMPOSABILITY AND MEAN-PRESERVING CONT. 159 the IF becomes β+1  − 1 z z IF(z,Iβ= 1,0,F)=−Iβ (F )+ + −(8.30)1 E E β(β +1) 1+z (1 + z)zβ+1 1 − z IF(z,I−1,F)=−I−1(F ) − log(z) (8.31) E E 1+z 1 − z IF(z,I0 ,F)=−I0 (F ) − log(z) (8.32) E E 1+z We can see then that the IF with mean-preserving contaminations is bounded for the members of the generalized entropy family having the parameter −1 <β<0. However, we should emphasize that this result is only valid for contam- inations that leave the mean of the distribution unaffected. As we shall see in section 8.3, the possibility that the contamination affects the mean can have a dramatic impact on the IF.

8.2.5 The Gini index Although the Gini index does not belong to the class of decomposable in- equality measures, we will show that with mean-preserving contaminations, the IF of this statistic is unbounded. The Gini concentration ratio is given by 1 IG(F )=1− 2R(F )=1− 2 qF (α)dα (8.33) 0 where qF (α)isgivenby F −1(α) −1 qF (α)=µ (F ) udF (u) (8.34) −∞ and can be interpreted as the proportion of income belonged by the propor- tion α of the poorest income receivers. We consider here the mixture distribution (8.1) which ensures that µ(Gε)= µ(F ):=µ. The influence on IG of this misspecification is then given by I (G ) − I (F ) R(F ) − R(G ) lim G ε G =2· lim ε (8.35) ε→0 ε ε→0 ε We have 1 R(Gε)= qGε (α)dα 0 160 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

−1 1 1 Gε (α) = udGε(u)dα 0 µ 0 ∞ 1 x = udGε(u)dGε(x) −∞ µ −∞ ∞ 1 x x = (1 − ε) udF (u)+ε udH(u) dGε(x) −∞ µ −∞ −∞ (1 − ε)2 ∞ x = udF (u)dF (x)+ µ −∞ −∞ ε(1 − ε) ∞ x udF (u)dH(x)+ µ −∞ −∞ ε(1 − ε) ∞ x udH(u)dF (x)+ µ −∞ −∞ ε2 ∞ x udH(u)dH(x) (8.36) µ −∞ −∞

Let us now consider the perturbation distribution H(z) given by (8.5). We then have − µ−z µ+z (z) − 2 ε(1 ε) 1 1 R(Gε )=(1 ε) R(F )+ udF (u)+ udF (u) + µ 2 −∞ 2 −∞ ε(1 − ε) µ − z µ+z µ + z ∞ dF (x)+ dF (x) + µ 2 µ−z 2 µ+z ε2 1 µ−z 1 µ+z udH(z)(u)+ udH(z)(u) (8.37) µ 2 −∞ 2 −∞

The IF is then given by 1 µ−z µ+z IF(z,IG,F)=4R(F ) − udF (u)+ udF (u) − µ −∞ −∞ µ − z (F (µ + z) − F (µ − z)) + µ µ + z (1 − F (µ + z)) (8.38) µ

How does the IF behave as z varies? ∂ 1 µ − z IF(z,I ,F)= [2F (µ + z) − F (µ − z) − 1] − f(µ + z) (8.39) ∂z G µ µ

→∞ ∂ → 1 →∞ As z , ∂zIF(z,IG,F) µ and therefore IF(z,IG,F) . 8.3. ARBITRARY CONTAMINATIONS 161

8.3 Arbitrary contaminations

So far we have considered the issue of robustness of inequality measures given a very special kind of contamination: one which preserves the mean of the distribution. This approach would be relevant to cases in which inequality was defined in terms of income shares rather than incomes, and observations were available on shares. Under such circumstances the population mean would be known by definition. But this sort of situation is exceptional. Otherwise we have to assume either that the impact of the contamination on the mean is negligible, or that the impact on inequality of variability in the mean is negligible, neither of which is satisfactory. However when we allow for arbitrary perturbations to a distribution the resulting impact on inequality measures is going to yield an expression involving both transfer effects and change in mean, which are difficult to interpret analytically without more restriction on the class of inequality measures. In this subsection, we derive the IF of an inequality measure belonging to the class of decomposable measures. We consider here the mixture model Gε where the perturbation distribution is the point mass one at any point z.TheIF is given by ∂ IF(z,I,F)=ϕ1(J(F, µ),µ) · J(Gε,µ(Gε)) + ∂ε ε=0 ϕ2(J(F, µ),µ)) · [z − µ] (8.40)

th where ϕj, j =1, 2, is the derivative of ϕ with respect to its j argument. For most inequality measures, the relevant part is given by the first term in (8.40). We have ∂ ∂ J(G ,µ(G )) = (1 − ε) φ(x, µ(G ))dF (x)+εφ(z,µ(G )) ∂ε ε ε ∂ε ε ε ε=0 ε=0 ∂ = − φ(x, µ)dF (x)+ φ(x, µ)dF (x) · [z − µ]+ ∂µ φ(z,µ) (8.41)

We may make the following remarks: (a) The first term is independent of z, (b) the second term is due to the effect of the contamination on the mean and (c) the third term is also present if we suppose that the mean is given. If a decomposable measure has an unbounded IF for the mean-preserving contamination case, then it must also be unbounded in the arbitrary con- tamination case by virtue of (c) above. 162 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

Some inequality measures which have bounded IF in the mean-preserving contamination case will be unbounded in the non arbitrary contamination case because of (b). This can easily be shown in the case of the generalized entropy family. Indeed, consider the same perturbation function as above and suppose that the mean µ is given. The IF of the members of the gen- eralized entropy family with parameter −1 <β<0(forwhichtheIF with mean-preserving contamination is bounded), is given by

1 z β+1 IF(z,Iβ ,F)=−Iβ (F )+ − 1 (8.42) E E β(β +1) µ

We can see that unfortunately the IF is unbounded. Finally, for the case of the Gini index, unboundedness of the IF with arbitrary contaminations has been shown by Monti (1992). What is the conclusion? If we try to take account of the case where contamination affects the mean (which is probably more realistic in most applications) the problem of an unbounded IF arises. This presents a serious problem for the empirical analysis of income inequality from micro-data, since it means that a few “rogue” low observations (in the case of bottom- sensitive inequality measures) or high observations (for the other types of inequality measure) would drive the estimated value of the inequality index on their own. The situation may not be so bad for the rare cases where the mean does not itself have to be estimated, but this is small comfort. What can be done about this situation? One approach is to screen pragmatically by eye. This is frequently done and usually for good rea- sons. However, although the researcher’s judgment may be very good in a particular instance, the procedure is very arbitrary. Moreover, since the only option is whether or not to drop the questionable observation from the sample, there can be situations in which this procedure is regarded as too drastic. We will suggest an alternative procedure in the next section. First let us take a look at the importance of contamination on empirical estimates of inequality.

8.3.1 Simulation study

0 For this first simulation study, we computed the Theil index IE (see equa- tion (8.28)) which as many other inequality measures, may be determined empirically by only a few observations. This is shown in table 8.1. We com- puted samples of 200 observations generated by a Lognormal distribution 8.4. PARAMETRIC ESTIMATION APPROACH 163 given by t 1 2 1 − 2 (log(x)−µ) Fµ,σ(t)= √ e 2σ dx (8.43) 0 x σ22π with parameters µ =1.0andσ =0.8. We contaminated a percentage of those observations by simply multiplying them by 10. The differences between the Theil indexes are then due to those small proportions of obser- vations.

Degree of Theil Index (SD) contamination 0% 0.312 (0.002) 1% 0.430 (0.024) 2% 0.515 (0.030) 3% 0.586 (0.027) 4% 0.649 (0.029) 5% 0.700 (0.031)

Table 8.1: Empirical Theil index when a random proportion of data are multiplied by 10

We can see that with only 5% of “rogue” data, Theil index becomes twice its initial value! However, one could argue that this type of contamination, i.e. a random proportion of contamination multiplied by 10, is too extreme. But this is the type or error that can occur during the recording process, that is a “decimal point error”. The other type of error we consider here is the week-month confusion: the data are supposedly collected on weekly income, but some respondents actually report income per month. Although this type of error is less large, we can see by means of a simulation study that its effect on inequality measures is quite important. In table 8.2 we show the computed Theil index when a random proportion of data are multiplied by 4.

8.4 Parametric estimation approach

8.4.1 Influence function of generalized entropy indexes As an alternative to direct estimation of inequality, we will consider here the parametric approach, i.e. the analysis of inequality measures estimated through a parametric model (F = Fθ). Although the following results apply to any type of inequality measure, we will examine the particular case of the generalized entropy family. 164 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

Degree of Theil Index (SD) contamination 0% 0.312 (0.002) 1% 0.334 (0.003) 2% 0.357 (0.004) 3% 0.379 (0.010) 4% 0.400 (0.010) 5% 0.429 (0.012)

Table 8.2: Empirical Theil index when a random proportion of data are multiplied by 4

An estimator of one of the members of this family (henceforth called GEFE) is then obtained by replacing F in (8.25) by Fθ where θ is estimated from the sample. The parametric approach has several advantages, but the most important for our purpose is that it allows one to build robust estimators. Let T (Fθ) be an estimator (functional) of θ for the parametric model Fθ. When considering the usual contaminated model Gε

Gε =(1− ε)Fθ + ε∆z (8.44) the IF of a GEFE is given by

β · IF(z,IE ,Fθ)=A(β,Fθ) IF(z,T,Fθ) (8.45) where β+1 1 x T A(β,Fθ)= s(x, θ) dFθ(x) − β(β +1) µ(Fθ) β+1 1 1 x T dFθ(x) xs(x, θ) dFθ(x)(8.46) µ(Fθ) β µ(Fθ)

For the particular cases when β =0andβ = −1, we have x x A(0,F )= log s(x, θ)T dF (x) − θ µ(F ) µ(F ) θ θ θ x T x x s(x, θ) dFθ(x) · log dFθ(x)+ µ(Fθ) µ(Fθ) µ(Fθ) x T s(x, θ) dFθ(x) (8.47) µ(Fθ) 8.4. PARAMETRIC ESTIMATION APPROACH 165 and x T x T A(−1,Fθ)= log s(x, θ) dFθ(x)+ s(x, θ) dFθ(x) µ(Fθ) µ(Fθ) (8.48) Since A(β,Fθ) does not depend on z,theIF of the GEFE is proportional to the IF of the estimator of θ. Therefore, if IF(z,T,Fθ) is unbounded then β so is IF(z,IE ,Fθ). A reasonable way of obtaining robust estimators of income inequality measure is then through the specification of a parametric model, where the parameters are estimated robustly. For the estimator of θ,weproposetaking the OBRE.

8.4.2 Simulation results In order to show how large the bias on the income inequality measure esti- mated through parametric models can be when the data are contaminated, we computed the Theil index I0 for different samples. The results come from 100 simulated samples of 200 observations from a Gamma distribution. The mixture distribution is

Gε =(1− ε)Fα,λ + εFα,0.1·λ (8.49) where α =3.0andλ =1.0. The results are presented in table 8.3 and table 8.42 We can see that only 3% of contaminations push the value of the Theil index from 0.155 to 0.27, and hence determine alone the value of the esti- mator. This is certainly not a desirable property. However, it is possible to avoid these situations with robustified inequality measures.

Bias of the MLE MSE of the MLE Theil Index No contamination α 0.05 0.07 0.155 λ 0.01 0.01 3% contamination α -1.33 1.89 0.270 λ -0.54 0.32 5% contamination α -1.72 3.09 0.342 λ -0.70 0.5

Table 8.3: MLE and Theil index with and without data contamina- tion In table 8.4 we present the same computations as in table 8.3 except that instead of using the MLE, we calculated the OBRE (c =1.5).

2These results were already presented under another form in tables 3.2, 3.4 and 3.5. 166 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES

Bias of the OBRE MSE of the OBRE Theil Index No contamination α 0.06 0.07 0.155 λ 0.02 0.01 3% contamination α 0.15 0.12 0.165 λ 0.07 0.02 5% contamination α -0.22 0.16 0.169 λ -0.11 0.03

Table 8.4: OBRE and Theil index with and without data contami- nation

The results show that with OBRE, the value of the Theil index is almost unaffected by the contaminations, even with an amount of 5%.

8.5 Conclusion

Inequality measures should convey practical information about income dis- tributions. Whether or not they do this effectively depends of course upon the reliability of income data; it also depends upon the method of estima- tion. It is possible that under certain estimation procedures the apparent picture of inequality is strongly influenced by data contamination. The conventional properties of inequality measures also play a rˆole in determining the extent to which the picture of PID is distorted by data contamination: the IF is a useful device for quantifying this effect. One might have supposed that the fundamental property of mainstream inequal- ity analysis - the transfer principle - is a sufficient condition for unboundness of the IF. We have shown that this is not the case. However, this nega- tive result is not actually very encouraging because we have shown that as soon as one introduces other standard features of inequality measures (scale invariance, decomposability) or a more realistic specification of the estima- tion problem (where the mean itself has to be estimated rather than being specified a priori) the problem of unboundness of the IF re-emerges. As we have shown, the impact upon measured inequality of quite small amounts of contamination in the tails of the distribution can be disastrous. Controlling for this contamination in practice can be tricky, and ad hoc methods are likely to be unreliable. One way of dealing with this problem is to adopt a parametric approach to PID analysis and inequality measurement, and to estimate inequality through robust estimates of the parameters of the income distribution model. Even if the robust estimates of inequality are not used on their own, they 8.5. CONCLUSION 167 should provide a useful supplementary check against estimates of inequality computed by classical methods. Where discrepancies between the results for the two approaches emerge and are attributable to a small number of observations, this information should be taken into account in drawing con- clusions about the “true” picture of the underlying PID. 168 CHAPTER 8. ROBUSTNESS AND INEQUALITY MEASURES Chapter 9

Conclusion

In this thesis, we developed several statistical tools for analysing the PID and for computing inequality measures. These tools can also be used for other models or other problems. We concentrated on robust statistical procedures because as we often argued, economic data and especially income data, are likely to be contaminated by some extreme observations. These procedures offer the advantage of taking into account the occurence of such observations during the statistical analysis, mostly in an optimal way. We based our results on the IF. Our work can be divided in 6 different steps:

1. Robust estimators for PID models when the data are continuous and complete

2. Robust estimators for PID models when the data are continuous and censored

3. Robust estimators for PID models when the data are grouped

4. Robust model choice test (for choosing in particular a PID model)

5. Robustness properties of goodness-of-fit tests

6. Robustness and inequality measures

In step (1) we applied the general theory of OBRE to the case of PID. We showed by means of simulations and real data, that robust estimators can be very useful when fitting a PID model to the data. Indeed, when using parsimonious models like for example the Gamma distribution (with respect e.g. to Dagum models), the classical estimators (MLE) can completely ruin

169 170 CHAPTER 9. CONCLUSION the analysis (see figure 3.2). On the other hand, OBRE reflect the behaviour of the majority of the data. In step (2) we widened the framework to censored data. We proposed a generalization of the EM algorithm, namely the EMM algorithm. The former allows one to compute MLE when the data are censored, whereas the EMM algorithm allows one to compute robust estimators in the same situation. We compared this new approach with the classical approach in which one considers the conditional distribution of the data, in the case of truncated data. The difference lies in the computation of the weights of the robust estimators, but this difference is detectable empirically only when the information loss is very large. We found no evidence of preferring one or the other approach, but we argued that the EMM algorithm is of wider applicability. In future research, we would like to investigate more deeply the theoretical differences between the two approaches. Instep(3)wetreatedthecaseofgroupeddata.Wefirstcomputed the IF of the general class of classical estimators for grouped data given by the MPE. We found that although the IF of any MPE is bounded, the bias on the estimator due to infinitesimal amounts of contamination can be large. Therefore, we defined a larger class of estimators and computed their properties (asymptotic distribution, IF, ...). We then applied Ham- pel’s optimality problem and computed optimally bounded-influence robust estimators for grouped data. We showed by means of a numerical example that they are less influenced by contaminations than the MPE. In future research, we intend to analyse real income data available only in grouped form. These three first steps constitute the necessary tools to compute robust estimators for a given parametric PID model. In step (4) we completed the work with the development of a robust procedure for choosing one PID model. We concentrated on tests between non-nested hypotheses, and used the results available in the literature on robust bounded-influence tests for parametric models. Indeed, the Cox and Cox-type statistics can be interpreted as parametric LM tests. After com- puting the LIF of the Cox statistic, we showed that when we apply robust parametric tests on this statistic, we bound exactly the right quantity. We then proposed this new robust statistic as a robust model choice statistic. Finally, through a numerical application, we showed that the latter has at least two important advantages compared with the classical Cox statistic: the test is not influenced by small amounts of contamination and the asymp- totic distribution is an accurate approximation. In future research we would like to investigate the power of the robust statistic, apply our results to real 171 incomedataandtoothermodels. In order to be as complete as possible, in step (5) we studied goodness- of-fit statistics. We were able to show by computing the LIF of the Cressie and Read class of test statistics and by means of simulations, that classical goodness-of-fit tests can be badly influenced by infinitesimal amounts of contamination. We proposed an ad hoc robust statistic based on OBRE for grouped data. In future research, we would like to analyse this statistic through simulations. Finally, in step (6) we studied more in details the case of inequality measures. We computed the IF of several classes of inequality measures in order to relate its behaviour to the economic properties the different inequality measures fulfil. We found that the principle of transfers alone does not imply an unbounded IF, but when combined with decomposability, it implies an unbounded IF (except in some very special cases). Therefore, we proposed to compute inequality measures via robust estimates of parametric PID models. Since there can be a bias due to the choice of an inadequate PID model, some would prefer to compute non parametric inequality measures. In future research, we would like to investigate this approach in a robust fashion. This concludes our thesis. Our hope is that it will at least stimulate discussion among researcher working in applied fields involving the analysis of data, about the usefulness of considering a robust approach. 172 CHAPTER 9. CONCLUSION Appendix A

Functional Forms for PID

A.1 Terminology and notations

• X = income variable • F (x) = income distribution function • f(x) = income density function • µ = dF (x) = distribution mean • m¯ = the median of the distribution x • tdF (t) m(x)= F (x) = the incomplete firts moment • ∈ kj =#xi Ij = income quantiles or income shares (ordered in as- cending order), kj =1 • x˜ =exp( log(x)dF (x)) is the geometric mean • ∞ −u α−1 Γ(α)= 0 e u du = Gamma function • ˜ ∂ Γ(α)= ∂α log Γ(α) • 1 α−1 − β−1 Γ(α)Γ(β) B(α, β)= 0 t (1 t) dt = Γ(α+β) = Beta function • ˜ ∂ B1(α, β)=∂α log B(α, β) • ˜ ∂ B2(α, β)=∂β log B(α, β) • ∞ 1 ξ(α)= n=0 nα , α>1, is the Riemann function • ˜ ∂ ξ(α)= ∂α log ξ(α)

173 174 APPENDIX A. FUNCTIONAL FORMS FOR PID

A.2 Pareto type I

The Pareto type I model is commonly called the Pareto law. The associated distribution function is given by: x −α Fα(x)=1− (A.1) x0 with 0 1. The density function is given by: −(α+1) α f(x; α)=αx x0 (A.2) The scores function is given by 1 s(x; α)= − log(x)+log(x ) (A.3) α 0 We have, lim s(x; α)=−∞ (A.4) x→∞ Therefore sup IF(x; TMLE,Fα) = ∞ (A.5) x

A.3 Pareto type II

The associated distribution function is given by x − λ −α Fα,λ(x)=1− (A.6) x0 − λ with λ1. The density function is given by −(α+1) α f(x; α, λ)=α(x − λ) (x0 − λ) (A.7) The scores function is given by ∂ 1 − − − ∂α log f(x; α, λ) α log(x λ)+log(x0 λ) s(x; α, λ)= ∂ = α+1 − α (A.8) ∂λ log f(x; α, λ) x−λ x0−λ We have, −∞ lim s(x; α, λ)= (A.9) x→∞ − α x0−λ Therefore, sup IF(x; TMLE,Fα,λ) = ∞ (A.10) x A.4. PARETO TYPE III 175

A.4 Pareto type III

The distribution function is given by x − λ −α Fα,β,λ(x)=1− exp(−βx) (A.11) x0 − λ with λ 0. The density function is given by α x − λ −α f(x; α, β, λ)= + β e−βx (A.12) x − λ x0 − λ The scores function is given by ⎡ ⎤ ∂ log f(x; α, β, λ) ⎢ ∂α ⎥ ⎣ ∂ ⎦ s(x; α, β, λ)= ∂β log f(x; α, β, λ) ∂ log f(x; α, β, λ) ⎡ ∂λ ⎤ (α + β(x − λ))−1 − log x−λ ⎢ x0−λ ⎥ ⎢ −1 ⎥ = ⎢ α − ⎥ (A.13) ⎣ x−λ + β x ⎦ −1 α(x−λ) α − α α+β(x−λ) + x−λ x0−λ We have ⎡ ⎤ −∞ ⎢ ⎥ ⎣ −∞ ⎦ lim→∞ s(x; α, β, λ)= (A.14) x − α x0−λ Therefore, sup IF(x; TMLE,Fα,β,λ) = ∞ (A.15) x

A.5 Gamma distribution

The Gamma distribution from Pearson’s type III family may be definded by λα f(x; α, λ)= xα−1e−λx (A.16) Γ(α) with 0 0. The scores function is given by ∂ − ˜ ∂α log f(x; α, λ) log(λ) Γ(α)+log(x) s(x; α, λ)= ∂ = α − (A.17) ∂λ log f(x; α, λ) λ x 176 APPENDIX A. FUNCTIONAL FORMS FOR PID

We have, ∞ lim s(x; α)= (A.18) x→∞ −∞

Therefore sup IF(x; TMLE,Fα,λ) = ∞ (A.19) x

A.6 Benini distribution

The distribution function is given by x −δ(log(x)−log(x0)) Fδ(x)=1− (A.20) x0 with 0 0. The density function is given by x x −δ(log(x)−log(x0)) f(x; δ)= 1+δ log x−1 (A.21) x0 x0 The scores function is given by 1 x 2 − s(x; δ)= −1 log (A.22) x x0 log x0 + δ We have lim s(x; δ)=−∞ (A.23) x→∞ Therefore, sup IF(x; TMLE,Fδ) = ∞ (A.24) x

A.7 Vinci distribution

The density function may be defined by α λ −1 f(x; α, λ)= x−α−1e−λx (A.25) Γ(α) with 0 ≤ x<∞ and α, λ > 0. The scores function is given by ∂ − ˜ − ∂α log f(x; α, λ) log(λ) Γ(α) log(x) s(x; α, λ)= ∂ = α − −1 (A.26) ∂λ log f(x; α, λ) λ x A.8. GENERALIZED GAMMA DISTRIBUTION 177

We have ∞ lim s(x; α)= (A.27) x→0 −∞

Therefore, sup IF(x; TMLE,Fα,λ) = ∞ (A.28) x

A.8 Generalized Gamma distribution

The density function may be written as

α |β|λ β f(x; α, β, λ, γ)= (x − γ)αβ−1e−λ(x−γ) (A.29) Γ(α) with 0 <γ≤ x<∞, α, β =0and λ>0. The scores function is given by ⎡ ⎤ ∂ log f(x; α, β, λ, γ) ⎢ ∂α ⎥ ⎢ ∂ log f(x; α, β, λ, γ) ⎥ ⎢ ∂β ⎥ s(x; α, β, λ, γ)=⎣ ∂ ⎦ ∂λ log f(x; α, β, λ, γ) ∂ log f(x; α, β, λ, γ) ⎡ ∂γ ⎤ log(λ) − Γ(˜ α)+β log(x − γ) ⎢ ⎥ ⎢ sign(β) +log(x − γ) α + λ(x − γ)β ⎥ = ⎢ β ⎥ (A.30) ⎢ α − − β ⎥ ⎣ λ (x γ) ⎦ − αβ−1 − β−1 x−γ + βλ(x γ) We have ∂ ∞ if β>0 lim log f(x; α, β, λ, γ)= (A.31) x→∞ ∂α −∞ if β<0

Therefore, sup IF(x; TMLE,Fα,β,λ,γ ) = ∞ (A.32) x

A.9 Lognormal type I

The density function is given by 1 1 f(x; µ, σ2)= √ exp − (log(x) − µ)2 (A.33) xσ 2π 2σ2 178 APPENDIX A. FUNCTIONAL FORMS FOR PID with 0 0. The scores function is given by ∂ 2 1 − 2 ∂µ log f(x; µ, σ ) σ2 (log(x) µ) s(x; µ, σ )= ∂ 2 = 1 1 2 (A.34) − 2 + 4 (log(x) − µ) ∂σ2 log f(x; µ, σ ) 2σ 2σ We have ∞ lim s(x; µ, σ2)= (A.35) x→∞ ∞

Therefore,   ∞ sup IF(x; TMLE,Fµ,σ2 ) = (A.36) x

A.10 Davis distribution

The density function model is given by α −1 λ −α−1 λ f(x; α, λ)= (x − x0) exp − 1 (A.37) Γ(α)ξ(α) x − x0 with 0 0andα>1. The scores function is given by ∂ log f(x; α, λ) s(x; α, λ)= ∂α ∂ log f(x; α, λ) ⎡ ∂λ ⎤ log(λ)Γ(˜ α) − ξ˜(α) − log(x − x0) ⎢ −1 ⎥ = ⎣ α − (x−x0) ⎦ (A.38) λ −λ 1−exp x−x0

We have ∂ lim log f(x; α, λ)=−∞ (A.39) x→∞ ∂α Therefore, sup IF(x; TMLE,Fα,λ) = ∞ (A.40) x

A.11

The distribution function is given by α (x − x0) Fα(x)=1− exp − (A.41) x0 A.12. FISK LOGISTIC DISTRIBUTION 179 where 0 0. The associated density function is given by − α α−1 (x x0) f(x; α)=α(x − x0) exp − (A.42) x0 The scores function is given by α 1 (x − x0) s(x; α)= +log(x − x0) 1 − (A.43) α x0 We have lim s(x; α)=−∞ (A.44) x→x0 Therefore, sup IF(x; TMLE,Fα) = ∞ (A.45) x

A.12 Fisk logistic distribution

The distribution function is given by

− −δ 1 Fλ,δ(x)= 1+λx (A.46) with 0 ≤ x<∞, δ>1andλ>0. The associated density function is given by −2 f(x; λ, δ)=λδx−δ−1 1+λx−δ (A.47)

The scores function is given by 1 2 ∂ − δ ∂λ log f(x; λ, δ) λ x +λ s(x; λ, δ)= ∂ = 1 − 2λ (A.48) ∂δ log f(x; λ, δ) δ log(x) 1+ xδ+λ

We have 1 lim s(x; λ, δ)= λ (A.49) x→∞ −∞

Therefore, sup IF(x; TMLE,Fλ,δ) = ∞ (A.50) x 180 APPENDIX A. FUNCTIONAL FORMS FOR PID

A.13 Generalized Beta distribution I

The density function is given by αxαγ−1 x α λ−1 f(x; α, β, γ, λ)= 1 − (A.51) βαγB(γ,λ) β with 0 ≤ x<βand 0 <α<1. The scores function is given by ⎡ ⎤ ∂ log f(x; α, β, γ, λ) ⎢ ∂α ⎥ ⎢ ∂ log f(x; α, β, γ, λ) ⎥ ⎢ ∂β ⎥ s(x; α, β, γ, λ)=⎣ ∂ ⎦ ∂γ log f(x; α, β, γ, λ) ∂ log f(x; α, β, γ, λ) ⎡ ∂λ ' ( ⎤ − x 1 (λ 1) log β ⎢ +log(x) − γ log(β) − ' (−α ⎥ α x − ⎢ β 1 ⎥ ⎢ − α ⎥ ⎢ αγ (λ 1) β ⎥ ⎢ − + ' (−α ⎥ = ⎢ β x − ⎥ (A.52) ⎢ β 1 ⎥ ⎢ − ˜ ⎥ ⎣ log(x) B1(γ,λ) ⎦ α − ˜ − x B2(γ,λ)+log 1 β We have ∂ lim log f(x; α, β, γ, λ)=−∞ (A.53) x→β ∂λ Therefore, sup IF(x; TMLE,Fα,β,γ,λ) = ∞ (A.54) x

A.14 Generalized Beta distribution II

The density function is given by αxαγ−1 x α −(γ+λ) f(x; α, β, γ, λ)= 1 − (A.55) βαγB(γ,λ) β with 0 ≤ x<βand 0 <α<1. The scores function is given by ⎡ ⎤ ∂ log f(x; α, β, γ, λ) ⎢ ∂α ⎥ ⎢ ∂ log f(x; α, β, γ, λ) ⎥ ⎢ ∂β ⎥ s(x; α, β, γ, λ)=⎣ ∂ ⎦ ∂γ log f(x; α, β, γ, λ) ∂ ∂λ log f(x; α, β, γ, λ) A.15. SINGH-MADDALA DISTRIBUTION 181 ' ( ⎡ x ⎤ 1 (γ+λ)log β + γ log(x) − γ log(β)+ ' (−α ⎢ α x − ⎥ ⎢ β 1 ⎥ ⎢ α ⎥ ⎢ − αγ − '(γ(+λ) β ⎥ ⎢ β x − − ⎥ = ⎢ β α 1 (A.56) ⎥ ⎢ α ⎥ ⎢ − − ˜ − − x ⎥ ⎣ α log(x) α log(β) B1(γ,λ ) log 1 β ⎦ α − ˜ − − x B2(γ,λ) log 1 β We have ∂ lim log f(x; α, β, γ, λ)=∞ (A.57) x→β ∂λ Therefore sup IF(x; TMLE,Fα,β,γ,λ) = ∞ (A.58) x

A.15 Singh-Maddala distribution

The distribution function is given by 1 F (x)=1− (A.59) α,β,γ (1 + αxβ)γ with 0 ≤ x<∞, α, β, γ > 0andβγ > 1. The corresponding density function is given by −γ−1 f(x; α, β, γ)=αβγxβ−1 1+αxβ (A.60) The scores function is given by ⎡ ⎤ ∂ ∂α log f(x; α, β, γ) ⎢ ∂ ⎥ s(x; α, β, γ)=⎣ ∂β log f(x; α, β, γ) ⎦ ∂ log f(x; α, β, γ) ⎡ ∂γ ⎤ β 1 − (γ+1)x ⎢ α 1+αxβ ⎥ ⎢ β ⎥ = ⎢ 1 − α(γ+1)x log(x) ⎥ (A.61) ⎣ β +log(x) 1+αxβ ⎦ 1 − β γ log 1+αx We have ∂ lim log f(x; α, β, γ)=−∞ if β>0 (A.62) x→∞ ∂γ and ∂ lim log f(x; α, β, γ)=∞ if β<0 (A.63) x→∞ ∂β Therefore sup IF(x; TMLE,Fα,β,γ ) = ∞ (A.64) x 182 APPENDIX A. FUNCTIONAL FORMS FOR PID

A.16 Lognormal type II

The density function is given by 2 1 √ − 1 − − 2 f(x; µ, σ )= exp 2 (log(x x0) µ) (A.65) σ(x − x0) 2π 2σ whith 0 0. The scores function is given by ∂ 2 2 ∂µ log f(x; µ, σ ) s(x; µ, σ )= ∂ 2 2 log f(x; µ, σ ) ∂σ 1 − − σ2 (log(x x0) µ) = − 1 1 − − 2 (A.66) 2σ2 + 2σ4 (log(x x0) µ) We have ∂ lim log f(x; µ, σ2)=∞ (A.67) x→∞ ∂µ Therefore,   ∞ sup IF(x; TMLE,Fµ,σ2 ) = (A.68) x

A.17 Dagum model type I

The density function is given by −β−1 f(x; β, λ, δ)=(β +1)λδx−δ−1 1+λx−δ (A.69) whith 0 ≤ x<∞, β,λ > 0andδ>1. The scores function is given by ⎡ ⎤ ∂ log f(x; β, λ, δ) ⎢ ∂β ⎥ ⎣ ∂ ⎦ s(x; β, λ, δ)= ∂λ log f(x; β, λ, δ) ∂ log f(x; β, λ, δ) ⎡ ∂δ ⎤ 1 − log 1+λx−δ ⎢ β+1 ⎥ ⎢ 1 − β+1 ⎥ = ⎣ λ xδ+λ ⎦ (A.70) 1 λ(β+1) − δ +log(x) xδ+λ 1 We have ∂ lim log f(x; β, λ, δ)=−∞ (A.71) x→∞ ∂δ Therefore, sup IF(x; TMLE,Fβ,λ,δ) = ∞ (A.72) x A.18. DAGUM MODEL TYPE II 183

A.18 Dagum model type II

The density function is given by ⎧ ⎨ −β−1 (1 − α)(β +1)λδx−δ−1 1+λx−δ if x>0 f(x; α, β, λ, δ)=⎩ (A.73) α if x =0 whith x<∞, β,λ > 0, δ>1and0<α<1. The scores function is given by ⎡ ⎤ ∂ log f(x; α, β, λ, δ) ⎢ ∂α ⎥ ⎢ ∂ log f(x; α, β, λ, δ) ⎥ ⎢ ∂β ⎥ s(x; α, β, λ, δ)=⎣ ∂ ⎦ ∂λ log f(x; α, β, λ, δ) ∂ log f(x; α, β, λ, δ) ⎡ ∂δ ⎤ − 1 ⎢ 1−α ⎥ ⎢ 1 −δ ⎥ ⎢ − log 1+λx ⎥ ⎢ β+1 ⎥ = ⎢ 1 − β+1 ⎥ (A.74) ⎣ λ xδ+λ ⎦ 1 λ(β+1) − δ +log(x) xδ+λ 1 We have ∂ lim log f(x; α, β, λ, δ)=−∞ (A.75) x→∞ ∂δ Therefore, sup IF(x; TMLE,Fα,β,λ,δ) = ∞ (A.76) x

A.19 Dagum model type III

The density function is given by −β−1 −δ−1 −δ f(x; β, λ, δ)=(β +1)λδ(x − x0) 1+λ(x − x0) (A.77) whith 0 ≤ x0 0andδ>1. The scores function is given by ⎡ ⎤ ∂ log f(x; β, λ, δ) ⎢ ∂β ⎥ ⎣ ∂ ⎦ s(x; β, λ, δ)= ∂λ log f(x; β, λ, δ) ∂ log f(x; β, λ, δ) ⎡ ∂δ ⎤ 1 − log 1+λ(x − x )−δ ⎢ β+1 0 ⎥ ⎢ 1 β+1 ⎥ = − δ (A.78) ⎣ λ (x−x0) +λ ⎦ 1 λ(β+1) +log(x − x ) δ − 1 δ 0 (x−x0) +λ 184 APPENDIX A. FUNCTIONAL FORMS FOR PID

We have ∂ lim log f(x; β, λ, δ)=−∞ (A.79) x→∞ ∂δ Therefore,

sup IF(x; TMLE,Fβ,λ,δ) = ∞ (A.80) x

A.20 Log-Gompertz distribution

The distribution function is given by −δ Fλ,δ(x)=exp −λx (A.81) with 0 ≤ x<∞, λ>0andδ>1. The density function is given by f(x; λ, δ)=λδx−δ−1 exp −λx−δ (A.82)

The scores function is given by ∂ 1 − x−δ ∂λ log f(x; λ, δ) λ s(x; λ, δ)= ∂ = 1 − −δ (A.83) ∂δ log f(x; λ, δ) δ log(x) 1+λx

We have 1 lim s(x; λ, δ)= λ (A.84) x→∞ −∞

Therefore,

sup IF(x; TMLE,Fλ,δ) = ∞ (A.85) x

A.21 Majumder-Chakravarty distribution

The density function is given by

α β − β − − βδ β γ δ αx δ α 1 −1/δ f(x; α, β, γ, δ)= γβxβ + δ (A.86) 1 − α α B δ β , β A.21. MAJUMDER-CHAKRAVARTY DISTRIBUTION 185 with 0 αδ, x>0. The scores function is given by ⎡ ⎤ ∂ log f(x; α, β, γ, δ) ⎢ ∂α ⎥ ⎢ ∂ log f(x; α, β, γ, δ) ⎥ ⎢ ∂β ⎥ s(x; α, β, γ, δ)=⎣ ∂ ⎦ = ∂γ log f(x; α, β, γ, δ) ∂ log f(x; α, β, γ, δ) ⎡ ∂δ ⎤ log(δ) − log(γ) − log(x)+B˜ 1 − α , α 1 − B˜ 1 − α , α β 1 δ β β β 2 δ β β ⎢ β ⎥ ⎢ 1 α 1 1 1 α α α 1 α α α 1 (γx) log(γx) ⎥ ⎢ − 2 log(δ)+ log(γ)+ log(x) − B˜ − , 2 − B˜ − , 2 − β ⎥ ⎢ β β δ γ 1 δ β β β 2 δ β β β δ (γx) +δ ⎥ ⎢ β β ⎥ ⎢ β−δα − 1 γ (γx) ⎥ ⎣ δγ δ (γx)β +δ ⎦ α − β − β ˜ 1 − α α 1 1 β − 1 1 βδ δ2 log(γ) γ2 log(x)+B1 δ β , β δ + δ2 log (γx) + δ δ (γx)β +δ (A.87)

We have ∂ lim log f(x; α, β, γ, δ)=−∞ (A.88) x→∞ ∂α Therefore, sup IF(x; TMLE,Fα,β,γ,δ) = ∞ (A.89) x 186 APPENDIX A. FUNCTIONAL FORMS FOR PID Appendix B

Functional Forms for the Lorenz Curve

B.1 Model of Kakwani and Podder

Their model is given by

α −β(1−p) Lα,β(p)=p e (B.1) where 0 ≤ p ≤ 1.

B.2 Model of Rasche et al.

The proposed functional form for the Lorenz curve is given by

α 1/β Lα,β(p)=[1− (1 − p) ] (B.2) where 0 ≤ p ≤ 1, α>0andβ ≤ 1. It includes the Lorenz curve specification corresponding to the Pareto distribution as a special case when β =1,α<1.

B.3 Model of Gupta

The model is given by p−1 Lα(p)=pα (B.3) where 0 ≤ p ≤ 1andα>1. Gupta proposed this model because it is easily estimable by the linear least squares method (using its log-linear form).

187 188APPENDIX B. FUNCTIONAL FORMS FOR THE LORENZ CURVE

B.4 Model of Villasenor and Arnold

They proposed the following model 1 1/2 L (p)= −(γx + δ) − αp2 + βp + δ2 (B.4) α,β,γ,δ 2

For the conditions on the parameters, see Villase˜nor and Arnold (1989).

B.5 Model of Basmann et al.

The model is given by

αp+β −γ(1−p2)−λ(1−p)+ε Lα,β,γ,λ(p)=p e (B.5) where 0 ≤ p ≤ 1andε ∼ N(0,σ2). This model includes also the model proposed by Kakwani and Podder (1973). Appendix C

Income Inequality Measures

For some inequality measures, we give the properties they fullfil (see sec- tion 2.4). We also give the range with continuous distributions (R), and with discrete distributions (Rn, n is the number of observations) (see also Cowell 1993).

C.1 Coefficient of variation

1 1 2 I = (x − µ)2dF (x) (C.1) CV µ √ It has the properties (1), (2), (3), (4) and (5). R =(0, ∞), Rn =(0, n − 1).

C.2 Relative mean deviation

Pietra (1915) proposed the following inequality measure 1 x I = − 1 dF (x)(C.2) MN 2 µ ∞ − 1 It has the properties (2), (3), (4), (5) and (6). R =(0, ), Rn = 0, 2 1 n .

C.3 Relative median deviation 1 x I = − 1 dF (x)(C.3) MD 2 m¯ It has the properties (2), (3), (4) and (5).

189 190 APPENDIX C. INCOME INEQUALITY MEASURES

C.4 Variance of the logarithm of income x 2 I = log dF (x)(C.4) VL x˜

It has the properties (2), (3), (4) and (5). R = Rn =(0, ∞).

C.5 Bonferroni inequality measure

Bonferroni (1930) proposed the following inequality measure 1 I = (µ − m(x)) dF (x)(C.5) BF µ

C.6 Hirschman’s index

Hirschman (1945) proposed the following inequality measure 1 x x I = dF (x)(C.6) HM 2 µ µ − 1

It has the properties (1), (2), (3), (4) and (5).

C.7 Theil indexes

Theil (1967) proposed the following inequality measures −1 − x IE = log dF (x)(C.7) µ x x I0 = log dF (x)(C.8) E µ µ

0 ∞ They have the properties (1), (2), (3), (4) and (5). For IE, R =(0, ), Rn =(0, log n).

C.8 Eltet¨o and Frigyes’s inequality measures

Eltet¨o and Frigyes (1968) proposed the following inequality measures µ xdF (x) I =1− 0 (C.9) EF1 µ C.9. KAKWANI INEQUALITY MEASURE 191 µ 0 xdF (x) IEF2 =1− ∞ (C.10) µ xdF (x) µ IEF3 =1− ∞ (C.11) µ xdF (x)

They have the properties (1), (5) and (6).

C.9 Kakwani inequality measure

Kakwani (1980) proposed the following inequality measure ∞ √ 1 1 2 2 IKW = √ µ + x dF (x) − 2 (C.12) 2 − 2 µ 0 It has the properties (1), (5) and (6).

C.10 Basmann-Slottje inequality measure (WGM)

Basmann and Slottje (1987) proposed the following inequality measure

J − J aj IWGM =1 J kj (C.13) j=1 where the aj’s ( aj = 1) are weights given to the income quantiles. Two subclasses are deduced: the class I WGM inequality measures is such that 1 a =1+b k − (C.14) j j J and the class II WGM inequality measures is such that

aj = cj + bJ (kJ−j+1 − kj) (C.15)

C.11 Dalton’s inequality measure

Dalton (1920) proposed the following inequality measure U(x)dF (x) I =1− (C.16) DT U(µ) where U(x) is a utility function. It has the properties (1), (3), (4) and (5). 192 APPENDIX C. INCOME INEQUALITY MEASURES

C.12 Atkinson’s inequality measure

Atkinson AB (1970) proposed the following inequality measure

1 1 1− I =1− x1−dF (x) (C.17) A µ where the parameter >0 measures the degree of inequality aversion. It has the properties (1), (2), (3), (4), (5) and (6). The particular case when  =1isgivenby 1 1 − exp log(x)dF (x) (C.18) µ R =(0, 1).

C.13 Kolm’s inequality measure

Kolm (1976) proposed the following inequality measure 1 I = log e(µ−x)dF (x) (C.19) K  where the parameter >0 measures the degree of inequality aversion. It has the properties (1), (4) and (5).

C.14 Generalized entropy family

This family of inequality measures is given by 1 x β+1 Iβ = − 1 dF (x) (C.20) E β(β +1) µ

R = Rn =(0, ∞). Appendix D

Equations System for Robust Tests

We have to find A(21) (1 × p), A(22) (1 × 1) and a(2) (1 × 1) solution of the following implicit equations:

ψ(x; θ)(2)f(x;ˆα)dx =0 (D.1) T ˆ ψ(x; θ)(2)ψ(x; θ)(1)f(x; θ)dx =0 (D.2) T ψ(x; θ)(2)ψ(x; θ)(2)f(x;ˆα)dx =1 (D.3) (D.4)

T where θ =(ˆα, 1) and A(12) = a(1) = 0. We rewrite this system in a more computational system. Let 0 B1 = s (x;ˆα)Wc(x;ˆα, β˜)f(x;ˆα)dx (D.5)

B2 = sCox(x;ˆα, β˜)Wc(x;ˆα, β˜)f(x;ˆα)dx (D.6)

B3 = Wc(x;ˆα, β˜)f(x;ˆα)dx (D.7) 0 0 T B4 = s (x;ˆα)s (x;ˆα) Wc(x;ˆα, β˜)f(x;ˆα)dx (D.8) 0 T B5 = sCox(x;ˆα, β˜)s (x;ˆα) Wc(x;ˆα, β˜)f(x;ˆα)dx (D.9) 0 0 T 2 ˜ B6 = s (x;ˆα)s (x;ˆα) Wc (x;ˆα, β)f(x;ˆα)dx (D.10)

193 194 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS 0 ˜ 2 ˜ B7 = s (x;ˆα)sCox(x;ˆα, β)Wc (x;ˆα, β)f(x;ˆα)dx (D.11) 0 2 ˜ B8 = s (x;ˆα)Wc (x;ˆα, β)f(x;ˆα)dx (D.12) 2 ˜ 2 ˜ B9 = sCox(x;ˆα, β)Wc (x;ˆα, β)f(x;ˆα)dx (D.13) ˜ 2 ˜ B10 = sCox(x;ˆα, β)Wc (x;ˆα, β)f(x;ˆα)dx (D.14) 2 ˜ B11 = Wc (x;ˆα, β)f(x;ˆα)dx (D.15) and solve for A(12), A(22) and a(2) the following equations system: A(21)B1 + A(22) B2 − a(2)B3 = 0 (D.16) − T A(21)B4 + A(22) B5 a(2)B1 = 0 (D.17) T A(21)B6A +2A(22)A(21) B7 − a(2)B8 + (21) 2 − 2 A(22) B9 2a(2)B10 + a(2)B11 = 1 (D.18) References

Aguirre-Torres, V. and A. R. Gallant, 1983. The Null and Non-Null Asymptotic Distribution of the Cox Test for Multivariate Nonlinear Regression: Alternatives and a New Distribution-Free Cox Test. Journal of Econometrics 21, p. 5-33.

Aigner, D. J. and A. J. Heinz, 1967. On the Determinants of Income Inequality. American Economic Review 57, p. 175-84.

Aitchison, J. and J. C. Brown, 1954. On Criteria for Description of Income Distribution. Metroeconomica 6, p. 88-107.

Aitchison, J. and J. C. Brown, 1957. The Log-Normal Distribu- tion. Cambridge, Massachussets: Cambridge University Press.

Akaike, H., 1973. Information Theory and an Extension of the Maximum Likelihood Principle. In Proceedings of the Second In- ternational Symposium on Information Theory,B.N.PetrovandF. Csaki (eds), p.267-81. Budapest: Akademiai Kiado.

Akaike, H., 1974. A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control, AC-19, p. 716-23.

Amoroso, L., 1925. Ricerche Intorno alla Curva dei Redditi. An- nali di Matematica Pura ed Applicata 4-21, p. 123-57.

Anderson, T. W. and D. A. Darling, 1954. A Test of Goodness- of-Fit. Journal of the American Statistical Association 49, p. 765-9.

Apostol, T., 1975. Mathematical Analysis. London: Addison- Wesley.

Atkinson, A. B., 1970. On the Measurement of Inequality. Journal of Economic Theory 2, p. 244-63.

195 196 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Atkinson, A. C., 1970. A Method for Discriminating Between Mod- els. Journal of the Royal Statistical Society B 32, p. 323-53.

Atkinson, A. C., 1988. Monte-Carlo Tests of Separate Families of Hypotheses. Unpublished Paper, Imperial College, London.

Barnett, V., 1992. Unusual Outliers. In Data Analysis and Statistical Inference, Festschrift in Honour of Prof. Dr. Friedhelm Eicken S. Schach and G. Trenklen (eds). Bergisch Gladbach, Germany: Joseph Eul Verlag GmbH.

Bartels, C. P. A., 1977. Economic Aspects of Regional Welfare. Leiden: Martinus Nijhoff Social Sciences Division.

Basmann, R. L., K. J. Hayes, J. D. Johnson and D. J. Slot- tje, 1990. A General Functional Form for Approximating the Lorenz Curve. Journal of Econometrics 43, p 77-90.

Basmann, R. L. and D. J. Slottje, 1987. A New Index of Income Inequality: The B Measure. Economics Letters 24, p. 385-9.

Becker, G. S., 1962. Investment in Human Capital: A Theoretical Analysis. Journal of Political Economy 70, p. 9-49.

Becker, G. S., 1967. Human Capital and the Personal Distribution of Income. Ann Arbor: University of Michigan Press.

Bednarski, T., 1982. Binary Experiments, Minimax Tests and 2- Alternative Capacities. The Annals of Statistics 10, p. 226-32.

Benini, R., 1906. Principii di Statistica Metodologica. Torino: UTET.

Beran, R., 1977. Minimum Hellinger Distance Estimates for Para- metric Models. The Annals of Statistics 5, p. 445-63.

Bickel, P. J. and M. Rosenblatt, 1973. On Some Global Mea- sures of the Deviations of Density Function Estimates. The Annals of Statistics 1, p. 1071-95. [Correction 3(1975), p. 1370].

Birch, M. W., 1964. A New Proof of the Pearson-Fisher Theorem. Annals of Mathematical Statistics 35, p. 817-24. 197

Bishop, Y. M. M., S. E. Fienberg and P. W. Holland, 1975. Discrete Multivariate Analysis: Theory and Practice. Cambridge, Massachussets: MIT Press. Blackorby, C. and D. Donaldson, 1978. Measures of Relative Equality and their Meaning in Terms of Social Welfare. Journal of Economic Theory 18. Blackorby, C. and D. Donaldson, 1980. A Theoretical Treat- ment of Indices of Absolute Inequality. International Economic Re- view 21. Bonferroni, C. E., 1930. Elementi di Statistica Generale. Firenze: Libreria Seber. Bossert, W. and A. Pfingsten, 1989. Intermediate Inequality: Concepts, Indices and Welfare Implications. Mathematical Social Sciences 19, p. 117-34. Bowles, S., 1969. Planning Educational Systems for Economic Growth. Harvard Economic Studis 133. Cambridge, Massachussets: Harward University Press. Box, G. E. P., 1953. Non-Normality and Tests of Variances. Biometrika 40, p. 318-35. Breusch, T. S. and A. R. Pagan, 1980. The Lagrange Multi- plier Test and Its Application to Model Specification in Econometrics. Review of Economic Studies 47, p. 239-53. Budd, E. C., 1970. Distribution Issues: Trends and Policies. Post- war Changes in the Size Distribution of Income Using Transformation Functions. Economie Appliqu´ee 30, p. 369-90. Chakravarty, S. R., 1983. Ethically Flexible Measures of Poverty. Canadian Journal of Economics 16. Chakravarty, S. R., 1990. Ethical Social Index Numbers. Berlin: Springer-Verlag. Champernowne, D. G., 1937. The Distribution of Income Between Persons. New York: Cambridge University Press. Champernowne, D. G., 1953. A Model of Income Distribution. Economic Journal 63, p. 318-51. 198 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Champernowne, D. G., 1974. A Comparison of Measures of In- equality of Income Distribution. Economic Journal 84, p. 787-816.

Cheng, R. C. H. and M. A. Stephens, 1989. A Goodness-of-Fit Test Using Moran’s Statistic with Estimated Parameters. Biometrika 76, p. 385-92.

Chiswick, B. R., 1968. The Average Level of Schooling and the Intraregional Inequality of Income: A Clarification. American Eco- nomic Review 58, p. 495-501.

Chiswick, B. R., 1971. Earning Inequality and Economic Develop- ment. Quarterly Journal of Economics 85, p. 21-32.

Chiswick, B. R., 1974. Income Inequality: Regional Analysis with a Human-Capital Framework. New York: National Bureau of Economic Research.

Coale, A. J. and F. F. Stephan, 1962. TheCaseoftheIndi- ans and the Teen-Age Widows. Journal of the American Statistical Association 57, p. 338-47.

Cochran, W. G., 1947. Some Consequences when the Assumptions for the Analysis of Variance are not Satisfied. Journal of Biometrics 3, p. 22-38.

Cowell, F. A., 1977, 1993 (second edition). Measuring Inequal- ity. Oxford: Phillip Allan.

Cowell, F. A., 1980. Generalized Entropy and the Measurement of Distributional Change. European Economic Review 13, p. 147-59.

Cowell, F. A. and M.-P. Victoria-Feser, 1993. Robustness Properties of Inequality Measures. Manuscript. London School of Economics, UK.

Cox, D. R., 1961. Tests of Separate Families of Hypotheses. Pro- ceedings of the Fourth Berkeley Symposium on Mathematical Statis- tics and Probability 1, p. 105-23. Berkeley: University of California Press.

Cox, D. R., 1962. Further Results on Tests of Separate Families of Hypotheses. Journal of the Royal Statistical Society B 24, p. 406-24. 199

Cox D. R. and L. Brandwood, 1959. On a Discriminatory Prob- lem Connected With the Works of Plato. Journal of the Royal Sta- tistical Society B 21, p. 195-200.

Cressie, N. A. C. and T. R. C. Read, 1984. Multinomial Goodness- of-Fit Tests. Journal of the Royal Statistical Society B 46, p. 440-64.

D’Addario, R., 1949. Ricerche sulla Curva dei Redditi. Giornale degli Economisti ed Annali di Economia 8, p. 91-114.

Dagum, C., 1977. A New Model of Personal Income Distribution: Specification and Estimation. Economie Appliqu´ee 30, p. 413-36.

Dagum, C., 1980a. The Generation and Distribution of Income, the LorenzCurveandtheGiniRatio. Economie Appliqu´ee 33, p. 327-67.

Dagum, C., 1980b. Generating Systems and Properties of Income Distribution Models. Metron 38 (3-4), p. 3-26.

Dagum, C., 1980c. Sistemas Generadores de Distribution del Ingreso ylaLeydePareto. El Trimestre Economico 188, p. 877-917.

Dagum, C., 1985. Analyses of Income Distribution and Inequality by Education and Sex in Canada. Advances in Econometrics 4, p. 167-227.

Dagum, C., 1990. On the Relationship Between Income Inequality Measures and Social Welfare Functions. Journal of Econometrics 43, p. 91-102.

Dalton, H., 1920. The Measurement of the Inequality of Incomes. Economic Journal 30, p. 348-61.

Dancelli, L., 1989. On the Behaviour of the Zp Concentration Curve. In Income and Wealth Distribution, Inequality and Poverty, C. Dagum and M. Zenga (eds), p. 111-27. New York: Springer-Verlag.

Daniel, C., 1976. Applications of Statistics to Industrial Experimen- tation. New York: Wiley.

Daniel, C. and F. S. Wood, 1980. Fitting Equations to Data. New York: Wiley.

Dasgupta, P., A. K. Sen and D. Starret, 1973. Notes on the Measurement of Inequality. Journal of Economic Theory 6, p. 180-7. 200 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Dastoor, N. K., 1985. A Classical Approach to Cox’s Test for Non-Nested Hypotheses. Journal of Econometrics 27, p. 363-70.

Davidson, R. and J. G. MacKinnon, 1981. Several Tests for Model Specification in Presence of Alternative Hypotheses. Econo- metrica 49, p. 781-93.

Davis, H. T., 1941. The Analysis of Economic Time Series. Bloom- ington, Indiana: the Principia Press.

De Jongh, P. J., T. De Wet and A. H. Welsh, 1987. Mallows- Type Bounded-Influence Trimmed Means. Journal of the American Statistical Association 84, p. 805-10.

Dempster, A. P., M. N. Laird and D. B. Rubin, 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B 39, p. 1-22.

Donoho, D. L. and P. J. Huber, 1983. The Notion of Breakdown Point. In A Festschrift for Erich L. Lehmann, P. J. Bickel, K. A. Doksum and J. L. Hodges (eds), p. 157-184. California: Wadsworht, Belmont.

Donoho, D. L. and R. C. Liu, 1988. The “Automatic” Robustness of Minimum Distance Functionals. The Annals of Statistics 16, p. 552-86.

Dorfman, R., 1979. A Formula of the Gini Coefficient. Review of Economics and Statistics 61, p. 146-9.

Dougherty, C. R. S., 1971. Estimates of Labour Aggregates Func- tions. Harward Center for International Affairs, Economic Develop- ment Report 190. Cambridge: Development Research Group.

Dougherty, C. R. S., 1972. Substitution and the Structure of Labour Force. Economic Journal 82, p. 170-82. Drost, F. C., W. C. M. Kallenberg and J. Oosterhoff, 1990. The Power of EDF Tests of Fit under Non-Robust Estimation of Nui- sance Parameters. Statistics and Decisions 8, p. 167-82.

Ebert, U., 1987. Size and Distribution of Incomes as Determinants of Social Welfare. Journal of Economic Theory 41, p. 23-33. 201

Edgeworth, F. Y., 1898. On the Representation of Statistics by Mathematical Formulae. Journal of the Royal Statistical Society 1, p. 670-700.

Efron, B., 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, p. 1-26.

Efron, B., 1984. Comparing Non-Nested Linear Models. Journal of the American Statistical Association 79, p. 791-803.

Elderton, W. P., 1938. Frequency Curves and Correlation. Cam- bridge: Cambridge University Press.

Elderton, W. P. and N. L. Johnson, 1969. Systems of Frequency Curves. Cambridge: Cambridge University Press.

Elteto,¨ O. and E. Frigyes, 1968. New Inequality Measures as Efficient Tools for Causal Analysis and Planning. Econometrica 36, p. 383-96.

Fisher, G. R. and M. McAleer, 1981. Alternative Procedures and Associated Tests of Significance for Non-Nested Hypotheses. Journal of Econometrics 16, p. 103-19.

Fisher, R. A., 1924. The Conditions Under which χ2 Measures the Discrepancy Between Observation and Hypothesis. Journal of the Royal Statistical Society B 87, p. 442-50.

Fisk, P. R., 1961. The Graduation of Income Distribution. Econo- metrica 29, p. 171-85.

Foster, J. E., 1985. Inequality Measurement. In Fair Allocation, H. P. Young (ed). American Mathematical Society, Providence.

Frechet,´ M., 1939. Sur les Formules de R´epartition des Revenus. Revue de l’Institut International de Statistique 7(1), p. 32-8.

Freedman, H. W., 1966. The “Little Variable Factor”. A Statistical Discussion of the Reading of Seismograms. Bulletin of the American Seismological Society 56, p. 593-604.

Freeman, M. F. and J. W. Tukey, 1950. Transformations Related to the Angular and the Square Root. Annals of Mathematical Statistics 21, p. 607-11. 202 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Fuchs, C. and R. Kennet, 1980. A test for Outlying Cells in the Multinomial Distribution and Two-way contingency tables. Journal of the American Statistical Association 75, p. 395-8.

Gail, M. H. and J. L. Gastwirth, 1978a. A Scale-Free Goodness- of-Fit Test for Exponential Distribution Based on the Gini Statistic. Journal of the Royal Statistical Society B 40, 350-7.

Gail, M. H. and J. L. Gastwirth, 1978b. A Scale-Free Goodness- of-Fit Test for Exponential Distribution Based on the Lorenz Curve. Journal of the American Statistical Association 73, p. 787-93.

Gail, M. H. and J. Ware, 1978. On the Robustness to Measure- ment Error of Tests for Exponentiality. Biometrika 65, p. 305-9.

Gastwirth, J. L., 1971. A General Definition of the Lorenz Curve. Econometrica 39, p. 1037-9.

Gastwirth, J. L., 1972. TheEstimationoftheLorenzCurveand Gini Index. Review of Economics and Statistics 54, p. 306-16.

Ghosh, B. K. and Wei-Min Huang, 1991. The Power and Optimal Kernel of the Bickel-Rosenblatt Test for Goodness-of-Fit. The Annals of Statistics 19, p. 999-1009.

Gibrat, R., 1931. Les Inegalit´es Economiques. Paris: Sirey.

Gini, C., 1910. Indici di Concentrazione e di Dipendenza. Atti della III Reunione della Sociat`a Italiana per il Progresso delle Scienze, in Gini 1955, p. 3-120.

Gini, C., 1912. Variabilit`a e Mutabilit`a. Bologna.

Gini, C., 1955. Memorie di Metodologia Statistica. Vol 1: Vari- abilit`a e Concentratione. Roma: Pizetti and Salvemini, Libreria Eredi Virgilio Veschi.

Godfrey, L. G. and M. H. Pesaran, 1983. Tests of Non-Nested Regression Models: Small Sample Adjustments and Monte Carlo Evi- dence. Journal of Econometrics 21, p. 133-54.

Gourieroux, C., A. Monfort and A. Trognon, 1983. Testing Nested or Non-Nested Hypotheses. Journal of Econometrics 21, p. 83-115. 203

Gupta, M. R., 1984. Functional Forms for Estimating the Lorenz Curve. Econometrica 52, p. 1313-4.

Hall, A., 1985. A Simplified Method of Calculating the Distribution Free Cox Test. Economics Letters 18, p. 149-51.

Hampel, F. R., 1968. Contribution to the Theory of Robust Esti- mation. Ph.D Thesis. University of California, Berkeley.

Hampel, F. R., 1971. A General Qualitative Definition of Robust- ness. Annals of Mathematics and Statistics 42, p. 1887-96.

Hampel, F. R., 1974. The Influence Curve and its Role in Robust Estimation. Journal of the American Statistical Association 69, p. 383-93.

Hampel, F. R., 1978. Optimally Bounding the Gross-Error-Sensitivity and the Influence Position in Factor Space. Proceedings of the Statis- tical Computing Section, American Statistical Association, p. 59-64. Hampel, F. R., E. Ronchetti, P. J. Rousseeuw and W. A. Stahel, 1986. Robust Statistics: the Approach Based on Influence Functions. New York: Wiley.

Heritier, S. and E. Ronchetti, 1992. Robust Bounded-Influence Tests in General Parametric Models. Working paper, University of Geneva, Switzerland.

Hirschberg, J. G., D. J. Molina and D. J. Slottje, 1989. A Selection Criterion for Choosing Between Functional Forms of Income Distribution. Econometric Review 7(2), p. 183-97.

Hirschman, A. O., 1945. National Power and the Structure of For- eign Trade. Berkeley: University of California Press.

Huber, P. J., 1964. Robust Estimation of a . Annals of Mathematical Statistics 35, p. 73-101.

Huber, P. J., 1965. A robust Version of the Probability Ratio Test. Annals of Mathematical Statistics 36, p. 1753-8.

Huber, P. J., 1974. Early Cuneiform Evidence for the Planet Venus. AAAS Annual Meeting, San Francisco, California.

Huber, P. J., 1981. Robust Statistics. New York: Wiley. 204 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Huber-Carol, C., 1970. Etude Asymptotique de Tests Robustes. Ph.D Thesis, ETH Z¨urich, Switzerland.

Jenkins, S. P., 1992. Accounting for Inequality Trends: Decomposi- tion Analyses for the UK, 1971-1986. Mimeo, University College of Swansea, UK.

Johnson, N. J. and S. Kotz, 1969. Discrete Distributions. Boston: Houghton Mifflin.

Kakwani, N. C., 1980. Inequality and Poverty: Methods of Esti- mation and Policy Applications. A World Bank Research Publication. Oxford University Press.

Kakwani, N. C. and N. Podder, 1973. On the Estimation of the Lorenz Curve from Grouped Observations. International Economic Review 14, p. 278-92.

Kakwani, N. C. and N. Podder, 1976. Efficient Estimation of the Lorenz Curve and Associated Inequality Measures from Grouped Observations. Econometrica 44, p. 137-48. Kallenberg, W. C. M., J. Oosterhoff and B. F. Schriever, 1985. The Number of Classes in Chi-Squared Goodness-of-fit Tests. Journal of the American Statistical Association 80, p. 959-68.

Kent, J. T., 1986. The Underlying Structure of Non-Nested Hypoth- esis Tests. Biometrika 73, p. 33-43.

Kloek, T. and H. K. Van Dijk, 1977. Further Results on Efficient Estimation of Income Distribution Parameters. Economie Appliqu´ee 30(3), p. 439-59.

Kloek, T. and H. K. Van Dijk, 1978. Efficient Estimation of Income Distribution Parameters. Journal of Econometrics 8, p. 61- 74.

Kolm, S. C., 1976. Unequal Inequalities I and II. Jounal of Eco- nomic Theory 12, p. 416-42 and 13, p. 82-11.

Kolmogorov, A. N., 1933. Sulla Determinazione Empirica di una Legge di Distribuzione. Giornale del Istituto Italiano di Attuari 4, p. 83-91. 205

Krasker, W. S., 1980. Estimation in Linear Regression Models with Disparate Data Points. Econometrica 48, p. 1333-46.

Krasker, W. S. and R. E. Welsch, 1982. Efficient Bounded- Influence Regression Estimation. Journal of the American Statistical Association 77, p. 595-604.

Krishnakumar, J. and E. Ronchetti, 1993. Optimal Robust Es- timators for Simultaneous Equations Models. Manuscript. University of Geneva, Switzerland.

Kuiper, N. H., 1960. Tests Concerning Random Points on a Circle. Proceedings Koninkl. Neder. Akad. van Wetenschappen A 63, p. 38-47.

Kullback, S., 1959. Information Theory and Statistics. New York: Wiley.

Kullback, S. and R. A. Leibler, 1951. On Information and Sufficiency. Annals of Mathematical Statistics 22, p. 79-86.

Latham, R., 1988. Lorenz-Dominating Income Tax Functions. In- ternational Economic Review 29, p. 185-98.

Lawless, J. F., 1982. Statistical Models and Methods for Lifetime Data. New York: Wiley.

Lerman, R. and S. Yitzhaki, 1984. A Note on the Calculation and Interpretation of the Gini Index. Economics Letters 15, p. 363-8.

Loh, W.-Y., 1985. A New Method for Testing Separate Families of Hypotheses. Journal of the American Statistical Association 80, p. 362-8.

Lorenz, M. O., 1905. Methods for Measuring Concentration of Wealth. Journal of the American Statistical Association 9, p. 209-19.

Lydall, H., 1968. The Structure of Earnings. Oxford: Clarendon Press.

Maasoumi, E., 1986. The Measurement and Decomposition of Mul- tidimentional Equality. Econometrica 48, p. 1791-1803. 206 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

MacKinnon, J. G., H. White and R. Davidson, 1983. Tests for Model Specification in the Presence of Alternative Hypothese: Some Further Results. Journal of Econometrics 21, p. 53-70.

Majumder, A. and S. R. Chakravarty, 1990. Distribution of Personal Income: Development of a New Model and its Application to U.S. Income Data. Journal of Applied Econometrics 5, 189-96.

Mallows, C. L., 1975. On Some Topics in Robustness. Technical Memorandum, Bell Telephone Laboratories, Murray Hill, NJ.

Mandelbrot, B., 1960. The Pareto-L´evy Law and the Distribution of Income. International Economic Review 1(2), p. 79-106.

Mandelbrot, B., 1961. Stable Random Functions and the Multi- plicative Variation of Income. Econometrica 29, p. 517-43.

March, L., 1898. Quelques Exemples de Distribution de Salaires. Journal de la Societ´e Statistique de Paris, p. 193-206.

Markatou, M. and T. P. Hettmansperger, 1990. Robust Bounded Influence Tests in Linear Models. Journal of the American Statistical Association 85, p. 187-90.

Markatou, M., W. A. Stahel and E. Ronchetti, 1991. Robust M-Type Testing Procedures for Linear Models. In Directions in Ro- bust Statistics and Diagnostics,PartI,W.A.StahelandS.Weisberg (eds), p. 201-20. New York: Springer-Verlag.

Markatou, M. and C. L. Tsai, 1992. Robust Tests in Nonlinear Regression Models. Manuscript, University of Iowa, Iowa.

Maronna, R. A., 1976. Robust M-Estimators of Multivariate Loca- tion and Scatter. The Annals of Statistics 4, p. 51-67.

Maronna, R. A., O. H. Bustos and V. J. Yohai, 1979. Bias- and Efficiency-Robustness of General M-Estimators for Regression with Random Carriers. In Smoothing Techniques for Curve Estimation, T. Gasser and M. Rosenblatt (eds), p. 91-116. New York: Springer- Verlag.

McAleer, M., 1986. Specification Tests for Separate Models: A Survey. In Specification Analysis in the Linear Model,M.L.King and D. E. A. Giles (eds). London: Routledge and Kegan Paul. 207

McCullagh, P. and J. A. Nelder FRS, 1989. Generalized Linear Models (second edition). London: Chapman and Hall.

McDonald, J. B., 1984. Some Generalized Functions for the Size Distribution of Income. Econometrica 52, p. 647-64.

McDonald, J. B. and M. R. Ransom, 1979. Functional Forms, Estimation Techniques and the Distribution of Income. Econometrica 47, p. 1513-25.

McDonald, J. B. and D. O. Richards, 1987. Model Selection: Some Generalized Distributions. Communications Statistics, Theory and Methods 16, p. 1049-74.

Metcalf, C. E., 1972. An Econometric Model of the Income Dis- tribution. Chicago: Markham.

Michael, J. R. and W. R. Schucany, 1985. The Influence Curve and Goodness-of-Fit. Journal of the American Statistical Association 80, p. 678-82.

Mincer, J., 1958. Investment in Human Capital and Personal In- come Distribution. Journal of Political Economy 66, p. 281-302.

Mizon, G. E. and J.-F. Richard, 1986. The Encompassing Prin- ciple and its Application to Non-Nested Hypotheses. Econometrica 54, p. 657-78.

Monti, A. C., 1992. The Study of the Gini Concentration Ratio by Means of the Influence Function. Statistica (to appear).

Moran, P., 1951. The Random Division of an Interval,PartII. Journal of the Royal Statistical Society B 13, p. 147-50.

Neyman, J., 1949. Contribution to the Theory of the χ2 Test. Pro- ceedings of the Berkeley Symposium of Mathematical Statistics and Probability, p. 239-73.

Pareto, V., 1896. Ecrits sur la Courbe de la R´epartition de la Richesse. Oeuvres compl`etes de Vilfredo Pareto publi´ees sous la di- rection de Giovanni Busino, Librairie Droz, Gen`eve, 1965.

Pareto, V., 1897. Cours d’Economie Politique. Vol. 2, part I, chapter 1, Lausanne, Switzerland. 208 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Pearson, E. S., 1931. The Analysis of Variance in Cases of Non- Normal Variation. Biometrika 2, p. 114-33.

Pearson, K., 1894. Contribution to the Mathematical Theory of Evolution. Philosophical Transactions of the Royal Society 184, in K. Pearson 1948, p. 1-40.

Pearson, K., 1948. Karl Pearson Early Statistical Papers. Cam- bridge: Cambridge University Press.

Pesaran, M. H., 1982. Comparison of Local Power of Alternative Tests of Non-Nested Regression Models. Econometrica 50, p. 1287- 305.

Pietra, G., 1915. Delle Relazione tra Indici di Variabilit`a,NoteIe II. Atti del Reale Istituto Veneto di Scienze, Lettere ad Arti LXXIV, p. 775-804.

Pigou, A. C., 1912. Wealth and Welfare. London: Macmillan.

Psacharopoulos, G. and K. Hinchliffe, 1972. Further Evidence on the Elasticity of Substitution among Different Types of Educated Labour. Journal of Political Economy 80, p. 786-96.

Pudney, S. and H. Sutherland, 1992. The Statistical Reliability of Micro-Simulation Estimates: Results for a UK Tax-Benefit Model. Mimeo, Department of Applied Economics, Cambridge, UK.

Pyatt, G., 1976. On the Interpretation and Deseggregation of Gini Coefficients. Economic Journal 86, p. 243-55.

Ransom, M. R. and J. S. Cramer, 1983. Income Distribution Functions with Disturbances. European Economic Review 22, p. 363-72. Rasche, R. H., J. Gaffney, A. Y. C. Koo and N. Obst, 1980. Functional Forms for Estimating the Lorenz Curve. Econometrica 48, p. 1061-2.

Read, T. R. C. and N. A. C. Cressie, 1988. Goodness-of-Fit Statistics for Discrete Multivariate Data. New York: Springer-Verlag.

Ricardo, D., 1819. Principles of Political Economy. London: Cam- bridge Academic Press. 209

Rieder, H., 1978. A Robust Asymptotic Testing Model. The Annals of Statistics 6, p. 1080-94. Ronchetti, E., 1979. Robustheitseigenschaften von Tests. Diploma Thesis, ETH Z¨urich, Switzerland. Ronchetti, E., 1982. Robust Testing in Linear Models: The In- finitesimal Approach. Ph. D. Thesis, ETH, Zurich. Ronchetti, E., 1987. Robust C(α)-Type Tests for Linear Models. Sankhya A 49, p. 1-6. Rothschild, M. and J. E. Stiglitz, 1973. Some Further Results on the Measurement of Inequality. Journal of Economic Theory 6, p. 188-204. Rousseeuw, P. J., 1984. Least Median of Squares Regression. Jour- nal of the American Statistical Association 79, p. 871-80. Rousseeuw, P. J. and A. M. Leroy, 1987. Robust Regression and Outlier Detection. New York: Wiley. Rousseeuw, P. J. and E. Ronchetti, 1979. The Influence Curve for Tests. Research Report 21, ETH Z¨urich, Switzerland. Rousseeuw, P. J. and E. Ronchetti, 1981. Influence Curves for General Statistics. Journal of Computational and Applied Mathe- matics 7, p. 161-66. Rousseeuw, P. J. and V. J. Yohai, 1984. Robust Regression by Means of S-Estimators. In Robust and Nonlinear Time Series Analysis, J. Franke, W. Hardle and R. D. Martin (eds), p. 256-72. New York: Springer-Verlag. Rutherford, R. S. G., 1955. Income Distribution: a New Model. Econometrica 23, p. 277-94. Salem, A. B. Z. and T. D. Mount, 1974. A Convenient Descrip- tive Model of Income Distribution: The Gamma Density. Economet- rica 42, p. 1115-27. Salvaterra, T., 1989. Comparison Among Concentration Curves and Indexes in Some Empirical Distributions. In Income and Wealth Distribution, Inequality and Poverty, C. Dagum and M. Zenga (eds), p. 194-214. New York: Springer-Verlag. 210 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Schwarz, G., 1978. Estimating the Dimension of a Model. The Annals of Statistics 6, p. 461-4.

Sen, A., 1973. On Economic Inequality. New York: W. W. Norton.

Shorrocks, A. F., 1973. Aspects of the Distribution of Personal Wealth. Ph.D dissertation, London School of Economics.

Shorrocks, A. F., 1980. The Class of Additively Decomposable Inequality Measures. Econometrica 48, p. 613-26.

Shorrocks, A. F., 1983. The Impact of Income Components on the Distribution of Family Income. Quarterly Journal of Economics 98, p. 311-26.

Silber, J. 1989. Factor Components Population Subgroups and the Computation of the Gini Index of Inequality. The Review of Eco- nomics and Statistics 71, p. 107-15.

Silvapulle, M. J., 1992. Robust Wald-Type Tests of One Sided Hypothesis in the Linear Model. Journal of the American Statistical Association 87, p. 156-61.

Simpson, D. G., 1987. Minimum Hellinger Distance Estimation for the Analysis of Count Data. Journal of the American Statistical Association 82, p. 802-7.

Simpson, D. G., 1989. Hellinger Deviance Tests: Efficiency, Break- down Points, and Examples. Journal of the American Statistical Association 84, p. 107-13.

Simpson, D. G., D. Ruppert and R. J. Caroll, 1992. One- Step GM Estimates and Stability of Inferences in Linear Regression. Journal of the American Statistical Association 87, p. 439-50.

Singh, S. K. and G. S. Maddala, 1976. A Function for the Size Distribution of Incomes. Econometrica 44, p. 963-70.

Slottje, D. J., 1989. The Structure of Earnings and the Measure- ment of Income Inequality in the U.S.. Amsterdam: North-Holland. Slottje, D. J., R. L. Basmann and M. Nieswiadomy, 1989. On the Empirical Relationship Between Several Well-Known Inequality Measures. Journal of Econometrics 42, p. 49-66. 211

Steindl, J., 1965. Random Processes and the Growth of Firms: A Study of the Pareto Law. New York: Hafner Press.

Stephens, M. A., 1984. Tests Based on EDF Statistics. In Goodness-of-Fit Techniques, R. B. D’Agostino and M. A. Stephens (eds), p. 97-193. New York, Basel: Marcel Dekker.

Stuart, A., 1954. The Correlation Between Variate-Values and Ranks in Samples From a Continuous Distribution. British Jour- nal of Statistical Psychology.

Tamura, R. and D. Boos, 1986. Minimum Hellinger Distance Estimation for Multivariate Location and Covariance. Journal of the American Statistical Association 81, p. 223-9.

Theil, H., 1967. Economics and Information Theory. Amsterdam: North- Holland.

Theil, H., 1989. The Development of International Inequality. Jour- nal of Econometrics 42, p. 145-55.

Thurow, L. C., 1970. Analysing the American Income Distribution. American Economic Review 60, p. 261-9.

Tinbergen, J., 1975. Income Distribution: Analysis and Policy. Amsterdam: North-Holland.

Tukey, J. W., (1970-1971), 1977. Exploratory Data Analysis (1970- 71: Preliminary edition). Massachussets: Addison-Wesley, Reading.

Villasenor,˜ J. A. and B. C. Arnold, 1989. Elliptical Lorenz Curves. Journal of Econometrics 40, p. 327-38.

Vinci, F., 1921. Nuovi Contributi allo Studio della Distribuzione dei Redditi. Giornale degli Economisti e Rivista di Statistica 18, p. 309-48.

Watson, G. S., 1961. Goodness-of-Fit Tests on a Circle. Biometrika 48, p. 109-14.

Weibull, W., 1951. A Statistical Distribution Function of Wide Applicability. Journal of Applied Mechanics 18, p. 293-7. 212 APPENDIX D. EQUATIONS SYSTEM FOR ROBUST TESTS

Welsh, A. H. and E. Ronchetti, 1992. A Failure of Intuition: Outliers Deletion in Linear Regression. Working paper, University of Geneva, Switzerland.

White, H., 1982. Regularity Conditions for Cox’s Test of Non- Nested Hypotheses. Journal of Econometrics 19, p. 301-18.

Williams, D. A., 1970. Discrimination Between Regression Models to Determine the Pattern of Enzyme Synthesis in Synchronous Cell Cultures. Journal of Biometrics 28, p. 23-32.

Yohai, V. J., 1987. High Breakdown Point and High Efficiency Robust Estimates for Regression. The Annals of Statistics 15, p. 642-56.

Yohai, V. J. and R. H. Zamar, 1988. High Breakdown Point Estimates of Regression by Means of the Minimization of an Efficient Scale. Journal of the American Statistical Association 83, p. 406-13.

Zabell, S., 1976. Arbuthnot, Heberden and the Bills of Mortality. Technical Report 40. Department of Statistics, University of Chicago.

Zenga, M., 1984. Proposta per un Indice di Concentrazione Basato sui Rapporti fra Quantili di Popolazione e Quantili di Reddito. Gior- nale degli Economisti ed Annali di Economia XLIII 5-6, p. 301-26.

Zenga, M., 1989. Concentration Curves and Concentration Indexes Derived From them. In Income and Wealth Distribution, Inequality and Poverty, C. Dagum and M. Zenga (eds), p. 149-70. New York: Springer-Verlag.