Evaluation of Generalized Variance Function Estimators for the U.S
Total Page:16
File Type:pdf, Size:1020Kb
EVALUATION OF GENERALIZED VARIANCE FUNCTION ESTIMATORS FOR THE U.S. CURRENT EMPLOYMENT SURVEY Moon J. Cho, John L. Eltinge, Julie Gershunskaya, Larry Huff, U.S. Bureau of Labor Statistics Moon J. Cho, 2 Massachusetts Avenue NE, Washington, DC 20212 (Cho [email protected]) Key Words: Complex sample design, Covered Em- ple, in CES survey, domains are the combinations ployment and Wages Program (ES-202), Design- of industries, areas and time. DeÞne Vpj= Vp(θˆj ) based inference, Generalized least squares, Model- as the design variance of θˆj ,andVˆpj = Vˆp(θˆj )as based inference, Superpopulation model, Variance an estimator of Vpj . Throughout this paper, the estimator stability. subscript “p” denotes the method to obtain an ex- pectation or variance evaluated with respect to the 1. Introduction sample design. The generalized variance function method models In applied work with generalized variance function the variance of a survey estimator, Vpj, as a function models for sample survey data, one generally seeks of the estimate and possibly other variables (Wolter to develop and validate a model that is relatively 1985). The common speciÞcation is parsimonious and that produces variance estimators that are approximately unbiased and relatively sta- Vpj = f(θj ,Xj , γj )+qj (1) ble. This development and validation work often begins with regression of initial variance estimators where Xj is a vector of predictor variables poten- (computed through standard design-based methods) tially relevant to estimators of Vpj, qj is a univariate on one or more candidate explanatory variables. “equation error” with the mean 0, and γj is a vector Evaluation of the adequacy of the resulting regres- of variance function parameters which we need to sion Þt depends heavily on the relative magnitudes of estimate. Note especially that qj represents the de- error terms associated, respectively, with pure sam- viation of Vpj from its modeled value f(θj ,Xj , γj ). pling variability of the initial design-based variance estimators; the deterministic lack of Þt in the pro- 3. Current Employment Survey Data posed generalized variance function model; and the random equation error associated with the gener- The CES survey collects data on employment, hours, alized variance function model. This paper presents and earnings from 400,000 nonfarm establishments some simple methods of evaluating the relative mag- monthly. Employment is the total number of persons nitudes of the sampling error and equation error employed full or part time in a nonfarm establish- terms. Both parametric and nonparametric regres- ment during a speciÞed payroll period. An establish- sion methods are used in producing smoothed esti- ment, which is an economic unit, is generally located mators of the mean squared equation error in the at a single location, and is engaged predominantly in underlying generalized variance function model Þt. one type of economic activity (BLS Handbook,1997). Some of the proposed diagnostics are applied to This paper will focus only on total employment in data from the U.S. Current Employment Survey. the reporting establishment. One important feature of the CES program is that 2. Variance Function Model complete universe employment counts of the previ- ous year become available from the Unemployment DeÞne θˆj a point estimator of θj , a Þnite population Insurance tax records on a lagged basis (Butani, mean or total. Let θξ j be a superpopulation ana- Stamas and Brick, 1997). The quarterly unemploy- logue of θj where j is the domain index. For exam- ment insurance Þles are generally transmitted Þve months after the end of the quarter by the states to The views expressed in this paper are those of the authors and do not necessarily reßect the policies of the U.S. Bureau BLS. It takes BLS an additional 3 months to process of Labor Statistics. these Þles through various edits as well as perform record linkage to previous quarters before making 12 reporting UI accounts. There are 430 industry- it available as the sampling frame (Butani, Stamas area combinations in our CES data. Each industry- and Brick, 1997). This data known as ES202 data area combination has data from January to Decem- are used annually to benchmark the CES sample es- ber of the year 2000. Hence we have 5160 industry- timates to these universe counts (Werking, 1997). area-time combinations. For the current analysis, Using the benchmark data, xia0, at the base pe- we considered data from the following six indus- riod from ES202 data, the CES program obtains tries: Mining, Construction and Mining, Construc- weighted link relative estimator,y ˆiat,toestimate tion, Manufacturing Durable Goods, Manufacturing the total employment, xiat, within the industry i, Nondurable Goods, Wholesale Trade. Consider the area a and month t, GVF model yˆiat = xia0Rˆiat log(Vˆiat)=γ0 + γ1log(xia0)+γ2log(niat) +γ3log(tia0)+e. (2) where Rˆiat is the growth ratio estimate from bench- mark month 0 to current month t. In this model, we assume that both intercepts and The CES sample design uses stratiÞed sampling of slopes are constant across the industries and areas. Unemployment Insurance (UI) accounts with strata Various modiÞed models can be considered from (2). deÞned by state, industry and employment size class For example, we may allow the intercepts to vary (BLS Handbook, 1997). CES aims primarily at across industries. Further, we may allow the inter- meeting the requirements for the national estimates. cepts and the slopes to vary across industries. As for Þner domains which are deÞned by geographic characteristics, and industrial classiÞcations, effec- 5. Residual Decomposition tive sample sizes within occupational classiÞcations become so small that the standard design based es- Suppose that a model Þtting method (e.g., ordinary timators are not precise enough to satisfy the needs least squares perhaps on a transformed scale; or non- of prospective data users (Eltinge, Fields, Fisher, linear least squares) leads to the point estimatorγ ˆj . Gershunskaya, Getz, Huff, Tiller and Waddington, This in turn leads to the estimated variances, 2001). It is necessary to have stable estimates of def ˆ ˆ V (ˆyiat)fortheÞner domains. Vpj∗ = f(θj , Xj , γˆj ) . (3) 4. Model Fitting Note that Vpj∗ is the variance estimator based on the model, which is transformed back onto original We used the direct variance estimators from the sur- variance scale. vey as the dependent variables in GVF models. In From the deÞnition of the direct variance estima- ˆ the CES survey, we have direct estimators, Vpj of tor Vpj, from Fay’s method which is a variant of the balanced half-samples replication methods. Each Vˆpj = Vpj + #j (4) replicate half-sample estimate is formed based on a Hadamard matrix. In the standard balanced half- where #j has a mean 0 and a constant variance. Re- samples replication methods, only the selected ones call the variance function model in (1), areusedtoestimatethevariance,andtheweights for the selected units are multiplied by a factor 2 to Vpj = f(θj ,Xj , γj )+qj form the weights for the replicate estimate (Wolter, 1985). However, in Fay’s method, one-half of the sample is weighted down by a factor K(0 K<1) Then the resulting residuals are and the remaining half is weighted up by a≤ compen- ˆ sating factor 2 K (Judkins, 1990). In our CES Vpj Vpj∗ − − example, K =0.5. =(Vˆpj Vpj) (V ∗ Vpj) − − pj − We assume that Vˆpj is a design unbiased estima- = #j f(θˆj , Xˆj , γˆj ) f(θj ,Xj , γj ) qj (5) tor for Vpj, i.e., Ep(Vˆpj)=Vpj . Our sample con- − { − − } = #j + qj E(qj ) + E(qj ) sists of Unemployment Insurance accounts, which re- { − } port nonzero employment for previous and current f(θˆj , Xˆj , γˆj ) f(θj ,Xj , γj ) . (6) − { − } months. Let niat be a number of responding UI ac- counts within the industry i,areaa and month t.In In the equation above, #j is a pure estimation er- this paper, we consider only domains with at least ror in the original Vˆpj estimates with E(#j )=0, qj E(qj ) is random equation error, and E(qj ) In this section, we obtain and model the condi- { − } 2 is deterministic lack-of-Þt in our model attributable tional squared error, E (Vpj∗ Vpj) Xj . Consider e.g., to omitted regressors or misspeciÞed functional { − | } ˆ ˆ ˆ 2 form. f(θj , Xj , γˆj ) f(θj ,Xj , γj ) ,thelastterm E (Vpj Vpj∗ ) Xj in (6),{ is a parameter− estimation error} attributable { − | } ˆ 2 ˆ = E (Vpj Vpj)+(Vpj Vpj∗ ) Xj . to errors (θj ,Xj , γˆj ) (θj ,Xj , γj ) . { − − } | { − } Exploratory analysis of the adequacy of our esti- ) * Recall the equation (5) mated values, Vpj∗ , may focus on the magnitude of the prediction errors, Vpj∗ Vpj , relative to the er- ˆ ˆ ˆ − Vpj Vpj∗ = #j + qj f(θj , Xj , γˆj ) f(θj ,Xj , γj ) . − − { − } rors, Vˆpj Vpj , in the original estimators Vˆpj . − ! " 2 ˆ From the variance function model in (1), and the If E(V#pj∗ Vpj) $is smaller than the variance of Vpj, − deÞnition of Vpj∗ in (3), we have then we would prefer Vpj∗ . In some cases, we may Þnd that ˆ ˆ Vpj Vpj∗ = qj f(θj , Xj , γˆj) f(θj ,Xj , γj ) . 2 − − { − } def ˆ ˆ δ(θj ,Xj , γj ) = E f(θj , Xj , γˆj ) Vpj θj ,Xj , γ We are now assuming that for all X , − | j % ( & ' E #j qj f(θj ,Xj , γj ) f(θˆj , Xˆj , γˆj ) Xj − { − } | varies across values of θj or Xj with 2 is# much) smaller than E(#j Xj*),$ and δ(θj ,Xj , γj ) << Vp(Vˆpj Vpj)onlyinsome 2 | E (Vpj V ∗ ) Xj . Generally this will be − { − pj | } cases. In this case, we might prefer Vpj∗ for some, true provided that the number of domains is but not all values of Xj .