ESTIMATION OF ~E ~N FROM C~SOHED IN~ DATA

Sandra A. West, Bureau of Labor

from censored data. One way to approach this I. ~CTION problem is to fit a theoretical distribution to In recent years economists and sociologists the upper portion of the observed income have increasingly become interested in the distribution and then determine the conditional study of income attainment and income of the upper tail. An estimator of this inequality as is reflected by the growing conditional mean will be referred to as a tail numbers of articles which treat an individual's estimator. In this paper a truncated Pareto income as the phenomenum to be explained. Some distribution is used and a modified maximum research on income have Included the likelihood estimator is developed for the investigation of the effects of status group Pareto parameter. Other estimators of the membership on income attainment and inequality: Pareto parameter are also considered. In the determinants of racial or ethnic particular it is shown that the estimator most inequality; the determinants of gender used in applications can lead to very inequality; and the determinants of income and misleading results. In order to see the effect wealth attainment in old age. of the tail estimator, six actual populations, Although many of the studies used census consisting of income data that were not data, research on income has come increasingly censored, were considered. Different estimates to rely on data from social surveys. The use of the tail and overall mean were compared to of survey data in the study of income can the true values for twenty subpopulations. In present a serious measurement problem since Section 2 the problem is formulated. In respondents' incomes most often are not Section 3 the Pareto distribution is discussed measured in their true dollar amounts but in and methods of estimating the parameter are categories with the last category being considered. In particular, a maximum open-ended. The problem with this categorical likelihood estimator using truncated and measurement is the most acute when the censored data is developed. In Section 4, the researcher wants to estimate income through the two leading candidates for the overall mean, application of a statistical technique, such as using a tail estimator, are computed on six regression, which assumes specific real populations (twenty sub-populatlons), measurements. The most common way for changing where the true tail is known. Section 5 categorical income variables into measurement contains the conclusions. In addition to the variables has been to assign all incomes within usual properties of , there is another a specific category to the midpoint of that nice feature that is discussed in the category. However, this procedure still Appendix. It is shown that under a mild presents the problem of how to estimate the assumption the percent difference between two mean (or midpoint) for the upper income means is bounded relative to the percent category which is open ended. In this paper difference between subgroup means. the problem of estimating the mean of the open 2' FO~TION OF FROHL~ ended interval is considered, but the problem arose in a different manner. True population data are: X , X ..., X . The project underlying this paper began in Let Yi be the ordered X's up t~ an~'includi~ connection with a program at the Bureau of some fixed number U. Assume there are (N-t) Labor Statistics (BLS) that reports weekly observations of value U and over. The data earnings of wage and salary workers. The data actually observed are- on usual weekly earnings are collected in the Current Population Survey (CPS), a national YI Y2, ... Yt' (N-t) U's. survey of households conducted for the Bureau The mean of the complete population is desired; of Labor Statistics by the Bureau of the that Is, Census. Estimates published quarterly include _ N usual weekly earnings of full-time wage X = z Xl/N and salary workers by age, race, and sex. In i=l (2.1) addition comparisons are often made over time. An obvious estimator is" A function of the often considered is t the ratio of women' s earnings to men' s earnings ? = [i_-Zl Yi + U(N-t) ~N, at any given point in time and the change of (2.2) the ratio over time. Due to problems inherent which is known as the Winsorlzed mean. This in operating with medians, see West (1985a), it estimator underestimates the true mean and was decided to also compute means. However, overestimates the ratio of women's earnings to the exact mean could not be computed since due men's earnings. A natural extension of the to operational procedures the data received by Winsorlzed mean is discussed in this paper. A the BLS are censored at $999; that is any theoretical distribution is fit to the censored person making $999 or more per week is coded as data using the observed income data and then an making $999 . overall mean is determined. Since much income The problem will be considered from the data available are not in exact dollar amounts point of view of computing a population mean but in categories, the problem will be reformulated in terms of grouped data. The observed data are in the following form.

665 Income Intervals f = frequency let X l and X2 be independent Pareto random variables with parameters s I and s 2 Uo_x) ~ x-min(sl~ 2) H(x), (3.3) f where H is a slowly varying function of x. This Ur_ 2 <_ y < Ur_ 1 r-I result gives insight into the persistence of f y>_ Ur_l = U r Paretian tail. Parameter estimation for the classical where r Pareto distribution has been fairly well i~l_ fi = N, and fr = N-t. investigated from the point of view of point estimation• Interval estimation has not been extensively investigated• A new generalized Let Mi be the mid-point of the i-th interval, Bayesian approach for interval estimation of for i = I, 2, ..., r-l, and Mrbe an estimator the Pareto parameters has been developed by West (1986). In this paper three popular of the mid-point of the r-th interval, then an methods for point estimation will be estimator of the mean is: considered: I) Least squares, 2) Quantiles; ^ r-I and 3) Maximum Iikelihood. ~[ = [iZ1 Mif i + Mrfr]/N. (2.3) Since the Pareto distribution is considered a good fit for the income distribution over the One possibility for M is to fit a theoretical r higher portion, the parameter will ~be estimated distribution to the (r-l) mid-points and then from the left truncated distribution. Let Ms, Mr would be the mean of the conditional dis- Ms+l, ... Mr_ 1 denote the truncated and non- tribution, P(X _< xlX _> U). That is, censored data. For the least squares estimator, note that if M s , s 4= xlx> M s <_ M i < U, F(M i) = 1 - s/Mi then where f(xlU) is the conditional density of X In [I- F(M i)] = s In M s - s In M i. (3.4) given that X is greater than the fixed number U. Letting Yi = In[I-F(M i) ] and X i = In M i Another possibility for Mr is the median of i r the conditional density. Parker and Fenwlck where f(M~) = 7 f~ / r f, for i = s, s+l, (1983) found that this estimator performed .£ j:s J l:s .L better than using the mean, but this was not ..., (r-l) then the least squares estimator the case with the method developed in this for s is- paper. r-1 r-I r-I Many distributions have been proposed for + income data, see Theil (1967), Arnold (1983). From the literature it seems that researchers & = (3.5) r-I r-I are satisfied with the Pareto distribution as a 2 2 fit to the upper portion of the income curve. (r-~) i~s xl - (i~s xi) In this paper only the Pareto density will be Equation (3.4) implies that as Mi, the level of considered, but different methods of estimating the parameters will be investigated. income under consideration increases, the number of people in the population who have 3. P~ DI~ON incomes greater than M i decreases. However, Consider the classical Pareto distribution the Pareto Curve as represented in (3.4) is linear only in the upper portions of an income F(x) = P(X <_ x) = l-(K/x)S for x >_ K >_ 0, s > 0 distribution. In the literature the least = 0 for x < K. (3.1) squares method is one of the two principal Noting that P(X _> xlX _> U) = P(X _> x)/P(X _> U) methods for the estimation of s. The second is estimation from quantiles, = U s x-s then f(xlU) = - d P(X _> xlX _> U)/dx = described below. Let M and M denote the p-th P q and q-th quantile respectively; that is, s U s x-~-I , x > U. Thus, F(Mp) = P(M _< Mp) = 1 - (K /Mp) s= p. (3.6) co : Iu x fdx : u <3.2> Letting Mp and Mq be estimators of Mp and Mq In practice, it is frequently assumed that respectively, leads to the following estimator s > I, so that the distribution has a positive of s: finite mean• Note that the sum of two inde- Pq = ln((1-p)/(1-q))/ln(%~p). (3.7) pendent Pareto random variables has a distribu- tion which has a Paretian tail• Specifically,

666 Most reseachers seem to use this method with IfK0, K>0 either the mid-points of the last two closed intervals or the last closed interval and the ifM-U f(M) = I-F(U) = (K/U) ~ (3.n) open interval. Specifically, if the mid-points where U is a fixed oonstant. of the last two closed intervals are used (3.7) Suppose we observe r distinct values for M: becomes M = M i with frequency fi where K < M i < U, ac = In(f/(fr+ fr-l))/In(Mr-2/Mr-1)" (3.8) for i = I, 2, ...(r-l). If the lower bounds of the last closed interval M = U with frequency f r . and the open interval are used, then (3.7) r becomes where r fi = N. i=l o = ln((fr_l + fr)/fr)/ln(U/Lr-I )" (3.9) The likelihood function L is" In the literature the estimator in (3.9) seems .e ~ ~+l fr to be the one most reconTnended, for example see i=l (3 12) Shryock (1975), Parker and Fenwick (1983). which leads to the maximum likelihood estimator Consistency is easily verified for the of ~" quantile estimator and it is resistant to r-I . Quandt (1966) found that the = (N- fr)/[(i__Z1 fi In M i) - (N- fr ) In K - performance of the quantile estimates was not much inferior to those of the maximum f In K/U1. (3.13) likelihood estimates. A Monte Carlo study r reported by Koutrouvelis (1981) supports that Note that the maximum likelihood estimator of K view. However, it will be seen, empirically, is the minimum of M i for i = I, ... r. in Section 4 that this method depends very much on the classification of the population. It If the distribution is truncated on the left can lead to gross errors and at best it does as at Ms, the estimator becomes- well as the maximum likelihood estlmator. r-I r-1 r-I Also, it can be shown theoretically that the : 7 fi /[i-_Zs fi In Mi - In M quantile estimator is inferior to other i=s s (i2sfi) available estimators. This is done by fr ln(Ms/U) ]" (3.14) rewriting (3.7) in the following form: m (3.10) Note that in the case of a Pareto distribution, ( c + i~ 1 d i Vi )-1 truncation is equivalent to rescaling, K=M s. In the next section six different truncation where the Vi' s are independent exponential points are considered. It is shown that the variables and c and the ali's are constants. smallest errors occur when the starting point is near the truncated mean. This representation is arrived at by using the In determining (3.14) the fact that there is following well known theorem, David (1970). an interval of data was not taken into Let VI, V2, ..., V be a sample of size m from account. The likelihood for grouped data leads the exponential distribution (F(V) = 1 - to a maximum likelihood estimator that can only exp(-~V),V>0) with corresponding order be obtained through iterative techniques. statistics Wi, i=l, 2...m, and spacings Specifically, suppose for the N observations, X1, X2,... , XN, all that is reported about a Z i = W i - Wi_l, i=l, 2, ..., m (W 0 = 0 by definition). The spacings are independent particular X i is the interval into which it exponential random variables. Moreover, the falls [Li, U i], i=l, 2, ... r-l; and if X i_> U random variables {(m-i+l) Zi, i=l, 2, ..., m} all that is reported is [U, ~). Noting that are independent identically distributed random ------~] variables wl th the c~on exponential ell i < x < ul}: u i , distribution. The direct relationship between the Pareto distribution and the standard P{X > U} = K s U-~ , and that it is observed exponential distribution (X i = K exp (Vi/~)) that the interval (Li, U i) contains fiobserva- renders this theorem an important tool in tions for i = I, 2, ..., r-l, and the interval discussing the distribution of Pareto order statistics. Moments of random variables in the (U, ~) contains fr observations then the form of (3.10) can be derived. With these likelihood is" formulas it can be demonstrated that the [__~ fi r quantile estimators are inferior to other L = N' K =N U -= fr [Li ~ Ui ~] / H f ' available estimators. • - i=l i" In the rest of this section the maximum The maximum likelihood estimator can be likelihood estimator will be considered. First obtained in closed form only if U I = c L i, for consider a random variable M that can take on some constant c; this is not the case in the the following values with corresponding probabilities.

667 current situation. It will be seen in the next intervals are centered around multiples of section that the maxiumum likelihood estimator ten. After the true means were computed on the in (3.14) leads to fairly accurate estimates. entire set of data, the data were truncated at $999 and overall means with estimated tails 4. EVALUATION C~ MEAN ~TC~S GN SIX RFAG were computed for each of the possibilities. FO~TIONS Of the two estimating techniques, the maximum likelihood, (3.14), has a second parameter, In this section the two leading candidates which will be referred to as the starting for tail estimators will be evaluated on six point. In order to determine how much of the real populations, where the true tail is income data should be used in estimating the known. The two candidates are the modified Pareto parameter ~, six starting points were maximum likelihood estimator (3.14) derived in considered. The parameter ~ was estimated the previous section and the quantile estimator using data starting with $200, $300, $400, (3.9) that is recon~ended in the literature. $500, $600 and $700. ~npirically it was found The six populations are briefly described. that if the earnings distribution was truncated Two years of income data were obtained from at the mid-point, Ms, of the interval the 1982 and 1983 Consumer Expenditure Interview Survey, CE, conducted by the Bureau containing the truncated mean then the of Labor Statistics. The Interview Survey is resulting estimate of ~ led to the estimate of one component of the Consumer Expenditure the mean that was the closest to the true mean. Survey Program. It uses a national probability The overall mean reported in the tables is sample of households and is designed to collect the mean computed on $I intervals for data up data on the types of expenditures which to $999 combined with the estimated tail mean; respondents can be expected to recall for a that is, letting S : {i IXi < 999} then period of 3 months. Information is collected on demographic and family characteristics and on the inventory of major durable goods of each : (i s xl + fir er 3/" (4.1) consumer unit. In the fifth and final interview, an annual supplement is used to = ( ~ (N-fr)/N) + fir fr/N obtain a financial profile of the consumer unit. Only people who worked full time (35 or where' TRM : {i ZS X i }/(N - fr ) . ~ne percent more hours per week and 50 or more weeks per error of the estimator is determined by year) were used. The yearly income recorded was transformed into weekly earnings. PE : I00(~- X)/~ (4.2) Five groups are constructed: all, men, women, where ~ is the mean computed on $I intervals age 16-24, and age 25 and over. for the complete population. The percent error One year of income data was obtained from of the tail contribution to the mean (or the the 1979 wave of the Panel Study of Income tail mean) is defined as: Dynamics, PSID, conducted by the Institute for x X - M f )/ g X , (4.3) Social Research. The data were divided into 100(ieS' i r r ieS' i the same five groups as the CE data. where S' is the complement of S. One year of income data was obtained from a Due to space llmitations, the empirical special study conducted in 1977 to gauge the results will be summarized and four tables will accuracy of the earnings data derived from the be displayed as an example of the results. Current Population Survey, CPS. Wage and Additional tables are in West (1985b). In salary workers in one-eighth of the CPS sample almost every case the centered intervals led to were asked in January, to supply information on smaller errors than the uncentered intervals. how they were paid, how much they earned and For the maximum likelihood estimator, out of 30 how many hours they worked. With their cases 83% had absolute percent errors between 0 permission, the same information was then and I; the remaining 17% had absolute percent obtained by mail from their employers. Again errors between I.I and 5.2. The quantile only full time workers were used. The data method could not be used in several cases were divided into three groups: all, men and because it led to an ~ < I, which yields a women. negative mean. This method depends on the The last two sets of data were obtained from classification of the data, can lead to gross an article by Parker and Fenwick (1983). The errors- such as 236% and at best does as well authors use the 1976 and 1977 waves of the as the modified maximum likelihood method. Panel Study of Income Dynamics. The PSID '76 In Tables 1-3, the mean using the modified and '77 data are joint husband and wife yearly maximum likelihood method and the mean using income. All the distributions are displayed in the quantile method are recorded for the 5 West (1985b) • subgroups, using data from the 1982 and 1983 The studies performed on the three popula- Consumer Expenditure Survey and the 1979 Panel tions, CE '82, '83 and PSID '79, will be Study of Income Dynamics. The maximum described first. For each of the five likelihood estimates were computed by chosing categories in the three populations the data M to be in the interval that contains the s were grouped in five different ways: $I, $I0 truncated mean. (The data were grouped into $I0 centered and uncentered intervals, and $50 centered intervals.) In Table 4 the means centered and uncentered intervals. The computed by the two methods are recorded for uncentered intervals are the usual intervals of equal width starting with zero. The centered three subgroups using 1977 CPS special data.

668 Table I. Wean Weekly Earnings Computed From 5. CONCLUSIONS 1982 Consumer Expenditure Survey In practice, estimates of ~ have been used in a variety of ways to characterize the Percent Quantile Percent inequality of an observed income distribution. For example, Bowman (1945) used ~ directly for X Error X Error comparisons of inequality. More common is the procedure of transforming ~ to obtain an ALL 385 -.3 434 -13.1 inequality measure. The Lorenz measure of 449 -.1 486 - 8.4 income concentration will equal I/(2~-I) if the WDM~N 282 4.6 275 7.0 income distribution is Pareto (with ~>I). The 16-24 230 .I 227 1.5 Gini measure of income concentration will equal 25+ 420 -.6 457 -9.3 ~/(~-I) if the income distribution is Pareto (with ~>I). Whichever function of ~ is the object of investigation, its best estimate (in the sense of asymptotic ) will be Table 2. Mean Weekly Earnlngs Computed From obtained by inserting the best estimate of ~ in 1983 Consumer Expenditure Survey the function. From the theoretical and empirical investi- gations, it is clear that the maximum likeli- ~E Percent Quantile Percent hood estimator of ~ is the best. Also the X Error X Error variance of a function of the maximum likeli- hood estimator is easily determined. The ALL 392 1.0 439 -10.9 qu~ntile estimator is inferior to the maximum 461 -.9 * likelihood estimator and can lead to misleading 292 2.8 288 4.4 results. 16-24 229 .2 224 2.4 From the empirical investigation, the 25+ 424 1.0 467 8.9 estimator of the population mean which uses a Pareto tail with the modified maximum likeli- hood estimator for the parameter ~ does very m Method led to a neg;atlve mean. well and is the mean recomnended. If instead of the entire population a random sample was drawn then the estimates from the sample data would be subject to variation. Confi- Table 3. Wean Weekly Earnings ~ted From dence intervals could easily be computed for 1979 Panel Study of Income Dyrmmlcs the population mean. ~IX ~E Percent Quantile Percent A simple geometric argument is given for the X Error X Error fact that the percent difference between two means is bounded relative to the percent ALL 284 0.2 315 -10.6 difference between subgroup means, if the 343 -0.1 388 13.0 proportion of people in each subgroup remains ~OM~N 194 0.0 194 0.2 the same over time. 16-24 207 0.3 2O6 0.5 25+ 302 0.0 336 ll.l Let Xtl and Xt2 denote at time t the average income of two mutually exclusive groups (say men and women) of size ntl and nt2respectively.

Let XtT denote the average income of the Table 4. Mean Week/y Earnlngs Computed From 191'7 C~S- Specla/ Data combined group at time t- that is, XtT = (nil Xtl + nt2 Xt2)/(ntl + nt2) Percent Quantlle Percent X Error X Error = U t Xtl + (I-U t) Xt2 where Ut = ntl / (ntl + nt2 )" ALL 389 -1.0 * For each group it is often of interest to M~ 441 -4.0 * look at the change over two time periods. That ~DMEN 288 -1.0 980 -236.8 is, for group I the change between time periods I and 2 is defined as- t Me~ led to a neg~tlve mean. (X21 - Xll )/~ll : (X21 /~ll ) -I.

669 Similarly, the percent change for group 2 and the combined group are respectively, Arnold, B.C. (1983). Pareto Distributions 5 (X22 /~12 )-I and (X2T /]~IT) -1. International Co-operatlve Publishing House, It will be shown geometrically, that if U 1 - U 2 Fairland, Md. then the percent change for the combined group Bowman, M.J. (1945). "A Graphical Analysis of is bounded by the percent changes for the subgroup means. Persona/ Income Distribution In the United States," American Economic Review, 35, 607-628. Assuming U I : U 2 and XRJX n <_ Xz/XIT ±t David, H.A. (1970). Order Statistics, Wiley, New York. will be shown that X2T/XIT <_ X22/XI2. The reverse inequalities are shown in a similar Koutrouvelis, I.A. (1981). "Large Sample way. Quantile Estimation in Pareto Laws ," Let A be the point with coordinates Comnunlcations In Statistics ~ Theor F and Methods, AI0, 189-201. (u I XII' U2 X21 ) and B the point with Parker, R.N. and Fenwick, R. (1983). "The coordinates (-(I-U I)XI2, - (I-U2)X22)" Letting Pareto Curve and Its Utility for Open-Ended 0 be the point through the origin (0,0) then Income Distributions in Survey Research," the slopes of the lines AB, 0A and OB are Social Forces, 61, 872-885. respectively ~IT' ~21/~11 and ~22/~12, Quandt, R.E. (1966). "01d and New Methods of since U 1 = U 2. Estimation and the Pareto Distrlbution," Metrika, I0, 55-82. Letting ¢, ~, and 0 be the respective angles that the lines AB, OA and OB' make with X-axis, Sands, S. (1973). "Estimating Data Withheld in as indicated in Figure I, then Grouped Size Distributions," Journal of American Statistical Association, 68, 306-311. slope (~) = X~/YIT : tan Shryock, H. and Siegel, J. (1975). The Methods slope (O'A') : X21/'X11 = tan 0 and Materials of Demography, Washington, D.C., U.S. Government Printing Office. slope (GB) = X22/9~12 = tan ~o. Theil, H. (1967). Economics and Information Theory, Amersterdam, North Holland Publishing Given that 0 < • it follows that • m< ~. From Figure 1 it is clear that since 0 < • then Company.

m< w for a triangle to be formed from O-A and n West, S.A. (1985a). "Standard Measures of OB. Thus for Censored Earnings Data, From the Current Population Survey", Bureau of slope (AB) < slope (OB) X2~lT <_ X22/XI2 Labor Statistics Report.

Figure 1 West, S.A. (1985b) • "Estimation of the Mean From Censored Income Data," Bureau of Labor Assuming that e < ~ it is shown that ~ < ~. m Statistics Report. Geometrical proof that if U z = U2 = U then the percent change between group means is bounded West, S.A. (1986). "Generalized Bayesian by the percent changes for the subgroup means. Inference for Pareto Parameter," (to be B' submitted to the Journal of the Royal / Statistical Society. / C / /J ," / /

,~ t// ~ A

Thanks are due to Janice Shack-Marquez for her computer work and to Darlene King / and Daphne Van Buren for typing this paper.

/ / / /

D

B (-( 1-U)XI 2 ,-( I-U)X2 2)

670