Measures of Central Tendency for Censored Wage Data
Total Page:16
File Type:pdf, Size:1020Kb
MEASURES OF CENTRAL TENDENCY FOR CENSORED WAGE DATA Sandra West, Diem-Tran Kratzke, and Shail Butani, Bureau of Labor Statistics Sandra West, 2 Massachusetts Ave. N.E. Washington, D.C. 20212 Key Words: Mean, Median, Pareto Distribution medians can lead to misleading results. This paper tested the linear interpolation method to estimate the median. I. INTRODUCTION First, the interval that contains the median was determined, The research for this paper began in connection with the and then linear interpolation is applied. The occupational need to measure the central tendency of hourly wage data minimum wage is used as the lower limit of the lower open from the Occupational Employment Statistics (OES) survey interval. If the median falls in the uppermost open interval, at the Bureau of Labor Statistics (BLS). The OES survey is all that can be said is that the median is equal to or above a Federal-State establishment survey of wage and salary the upper limit of hourly wage. workers designed to produce data on occupational B. MEANS employment for approximately 700 detailed occupations by A measure of central tendency with desirable properties is industry for the Nation, each State, and selected areas within the mean. In addition to the usual desirable properties of States. It provided employment data without any wage means, there is another nice feature that is proven by West information until recently. Wage data that are produced by (1985). It is that the percent difference between two means other Federal programs are limited in the level of is bounded relative to the percent difference between occupational, industrial, and geographic coverage. To subgroup means, if the proportion of units in each subgroup address this critical void in the Federal statistical efforts, the remains the same for the two groups. Since the problem OES program tested the feasibility of incorporating wage considered in this paper deals with grouped data that have questions into the survey in 1989 and 1990 pilot studies. lower and upper open intervals, it is not possible to compute The 1992 OES survey collects data in 15 States on an exact mean. The problem will be considered from the occupational hourly wage by industry in nonagricultural point of view of computing a population mean from fight establishments. The data are collected in eleven intervals, and left censored data. First the problem will be formulated rather than in exact dollar amounts, with the lowest and and two methods for computing the mean will be discussed. uppermost intervals open. One method results in the Winsorized mean and the other Research was conducted to f'md a suitable estimate of method uses a classical Pareto distribution. central tendency for the occupational wage data of the OES The population of true earnings data are denoted by survey. It was determined that both mean and median x,,x~ ..... x~. would be measured. Each has advantages and disadvantages The mean of the population is desired; that is, which will be discussed. The first part of the research focused on the problem of estimating the overall occupational wage mean for each X = industry, according to the OES survey's objective. The second part of the research explores the best method for estimating the wage mean of an upper open interval. This The data actually observed are the frequencies of each of would be useful for analyses such as regression where the intervals 11, 12 ..... It, which are mutually exclusive, and interval wages are used as dependent variables. The two measures of central tendency are described in exhaustive of the earnings scale. I~ and I r are open Section II. The empirical studies and results are given in intervals where 11 contains all xj less than some fixed Section III. The conclusions are presented in Section IV. Section V contains plans for future research. number U~ and Ir contains all xj greater than or equal to IL MEASURES OF CENTRAL TENDENCY A. MEDIANS some fixed number U(~_1). For i=2,3 ..... (r-l), I i are In elementary theory the median has considerable claims bounded intervals containing all xj between some fixed to be used as a measure of location for unimodal distributions. It is readily interpretable in terms of ordinary numbers U(i_l) and U i . ideas. What gives the arithmetic mean the greater importance in advanced theory is its superior mathematical Let f denote the observed frequency of interval I i for tractability and certain sampling properties. The median has i = 1,2 ..... r and let M~ be the mid-point of the i-th interval, a compensating advantage in that it is less sensitive to the configuration of the outlying parts of the frequency then the usual estimator of X is the grouped mean, Xg • distribution than is the mean. This is especially important with earnings data and in particular, with the censored data, _ r M/I//lV the median is a logical choice. However, the median is X = Z sensitive to the way the data are grouped and operating with g i=1 643 Since intervals 11 and I r are not bounded, M~ and M r will need to be estimated. 1 ! has a natural lower bound, M r = ~x f(xl Uir_,))dx = U(,._I)o~l(a- 1). either 0 or the minimum wage; for this study the Federal U{,-i) minimum occupational wage , We1) , was used. Thus the Note that for the mean to be positive, a > 1. Many methods exist for estimating the parameter tZ. The estimate for M I is: one most used and recommended in the literature is the quantile method, which is described next. /~1 ----(W(1) +Vl )/2. Let Mp and Mq denote the p-th and q- th quantile An obvious estimate for M r is: respectively; that is, ~'Ir.w = U(r_l)' F(Mv) = P(M <_ My)= 1-(KiMv) °< = p. which would lead to the Winsorized group mean: Similarly, for Mq. r--] /N Letting My and it~/q be estimators of Mp and M q Xg,w -- {/~lJ~ + Z Mifi + U(r-1)fr } " respectively, leads to the following estimator of a: i=2 Clearly,/¢/,.w will underestimate M, . With the Winsorized q ll,.I,aql ,l. mean a straight line is used for the missing values; a natural Most researchers seem to use this method with either the extension, now to be considered, is to fit a curve for the mid-points of the last two bounded intervals or the last missing values. bounded interval and the upper open interval. Specifically, if the mid-points of the last two bounded intervals are used, For the estimator Of M, , consider fitting a theoretical the estimator of t~ becomes: distribution to the (r-1) mid-points, and take/17/r as the mean d< = ln[(f,)/(fr + ir_l)]/ln[Mr_~lMr_,]. of the conditional distribution, P( X < xl X > U(r_,) ). ^ This will lead to Mr,q, 2 and Xg,q,2 as estimators for M r That is, and Xg, respectively. That is, M, = x dP(X < xlX >_ U(r_l)) = f(xlU(,_,)) dx f X A r-! /N = + Z M,,, + 7=2 where f(xlU(r_~) ) is the conditional density of X given that where X is greater than or equal to the fixed number U(r_~). = t4r-,) 1). Another possibility for /~/, is the median of the This method is referred to as the quantile II method, and conditional density. Parker and Fenwiek (1983) found that ^ Xg,q,2 is referred to as the Qnt II estimator in this paper. If this estimator performed better than the mean, but this was not the ease with the method and data considered in West the lower bounds of the last bounded interval and the open (1985). interval are used, then the estimator of ~ becomes: Many distributions have been proposed for earnings data, but the literature indicates that the researchers are satisfied 07o = ~l(/,_, + i,)l(i,)Vl,l u<,_,)lU(,_~> i. with the Pareto distribution as a fit to the upper portion of the earnings curve. Consider the Pareto distribution: In the literature the estimator ,d o seems to be the one most recommended, for example, see Shryock (1975), F(x) = P(X < x) = 1 -(K[x) ~ for x > K > O, a > 0 Parker and Fenwick (1983). This will lead to 34r,q.~ and =0 for x<K. ^ Noting that, Xg.q,l as estimators for M r and Xg, respectively. The P(X >_ xlX >_ V(,_1)) = P(X >_ x)/P(X >_ U(r_1)) = (V(r_l))tt x -t* method is referred to as the quantile I method and Xg,q.1 is then referred to as the Qnt I estimator in this paper. f(xlU(,._,)) = -dP(X > xlX > U(r_,))ldx An alternative estimator for t;t is a modified maximum = a(u(,_,)yx-:-', for x )_ U(r_l ) . likelihood estimator developed in West (1985). A brief THUS, description of the estimator will be given here. Since the Pareto distribution is considered a good fit for the 644 distribution of higher earnings, the parameter will be We define the error of estimation to be the difference estimated from the left truncated distribution. Let between the estimated value and the true value, which is M., M.+, ..... M._, assumed to be known. The relative error is defined as the ratio of the error of estimation to the true value. We be the left tnmcated mid-points of the bounded intervals, compare different estimators by looking at the absolute then the modified maximum likelihood estimator is: values of the relative errors of their estimates.