Power-Shrinkage and Trimming: Two Ways to Mitigate Excessive Weights
Total Page:16
File Type:pdf, Size:1020Kb
ASA Section on Survey Research Methods Power-Shrinkage and Trimming: Two Ways to Mitigate Excessive Weights Chihnan Chen1, Nanhua Duan2, Xiao-Li Meng3, Margarita Alegria4 Boston University1 UCLA2 Harvard University3 Cambridge Health Alliance and Harvard Medical School4 Abstract 2 Power-Shrinkage and Trimming Large-scale surveys often produce raw weights with very The survey weights are constructed as the reciprocals of large variations. A standard approach is to perform the selection probabilities. Let w be the survey weight, n some form of trimming, as a way to reduce potentially be the sample size, and y be the variable of interest. Let large variances in various survey estimators. The amount p 2 [0; 1] be the power-shrinkage parameter and T 2 [0; 1] of trimming is usually determined by considerations of be the trimming threshold de¯ned in terms of percentile. bias-variance trade-o®. While bias-variance trade-o® is The original weighted estimator, power-shrinkage estima- a sound principle, the trimming method itself is popular tor and the trimmed estimator for the expectation, E(y), only because of its simplicity, not because it has statisti- are given by cally desirable properties. In this paper we investigate a Xn more principled method by shrinking the variability of the wiyi log of the weights. We use the same mean squared error ² Original weighted estimator :y ¹ = i=1 (MSE) criterion to determine the amount of shrinking. w Xn The shrinking on the log scale implies shrinking weights wi by a power parameter p between [0,1]. Our investiga- i=1 tion suggests an empirical way to predict the optimal Xn choice. This power shrinkage method provides a natu- p wi yi ral way to deal with measurement errors and outliers in ² Power-Shrinkage estimator:y ¹(p) = i=1 the raw weights, while preserving the ranking of the raw w Xn p weights. We demonstrate the use of this method with the wi National Latino and Asian American Study (NLAAS). i=1 Keywords: Selection probability, unequal selections, se- Xn w (T )y lection biases, self-weighting. i i ² Trimmed weighted estimator:y ¹(T ) = i=1 w Xn wi(T ) 1 Introduction i=1 where wi(T ) = min(wi;Z(T )), and Z(T ) is the (100 £ In survey studies, the weights are often used to mitigate th the bias from the unequal selection probability. Estima- T ) percentile of w. tion bias is reduced by sampling weight at the expense of It is well known that whiley ¹w is unbiased, its variance increased variance. As an example, Kish (1992) provided can be very large when the variance of the weights is (T ) an authoritative account of the weighted and unweighted large. The trimming estimatey ¹w addresses this problem estimators. He discussed the weight trimming approach by reducing those excessively large weights via wi(T ) = to reduce the mean square error through the bias-variance min(wi;Z(T )), that is, the top weights above Z(T ) are trade-o®. We propose below a new approach, power- trimmed down to Z(T ). This trimming will introduce (T ) shrinkage method, to deal with this problem. The power- bias, but the hope is that the variance ofy ¹w is much shrinkage approach we propose is a continuous transfor- smaller thany ¹w, such that the overall mean square error (T ) mation and preserves the ranking of weights. We investi- ofy ¹w is smaller thany ¹w. gate the properties of both trimming and power-shrinkage Although trimming is a very useful and popular approaches via simulation studies. method, it is nevertheless an ad-hoc method, and has We organize this paper in ¯ve sections. Section 1 is the a number of undesirable properties. For example, it does introduction. Section 2 provides the formulas of shrink- not preserve the (strict) ranks of the original weights. age weights. Section 3 describes the NLAAS data set Furthermore, it does not address problems in the con- used in our simulation study. Section 4 describes the struction of the weights that are likely to a®ect all or simulation design. Section 5 summarizes the results and most weights even though most of them did not become concludes. excessive. 2839 ASA Section on Survey Research Methods We therefore propose the power-shrinkage method 3000 based on the motivation described below. In most sur- vey studies, the distribution of the weights is reasonably 2500 approximated by the log normal distribution (see section 3 for a real data example). So if we want to reduce the 2000 impact of excessive weights, or if we believe that there is noise in the weights (e.g. due to measurement error in 1500 the construction of weights), we should rescale the log of Density the weights to reduce its variance. This is the same as 1000 raising the weight by a power p 2 [0; 1], which gives the (p) power-shrinkage estimatey ¹w . 500 When p = 1 and T = 1, there is no shrinkage or trim- ming; all three estimators are the same;y ¹ =y ¹(p) =y ¹(T ). 0 w w w 0 5 10 15 4 On the other hand, when p = 0 and T = 0, both power- x 10 shrinkage and trimmed estimators are equivalent to the unweighted estimator. When p and T are close to one, Figure 1: Distribution of Survey Weight w the estimators are less biased but with larger variances. When p and T are close to zero, they behave like the un- weighted estimator, biased but with a smaller variance. 500 The optimal choice of p depends on a number of factors. The challenge is therefore to ¯nd a practical way to ¯nd a 400 good (not necessary optimal) choice of p, as well as some common choices of p that will work reasonably well in a 300 variety of situations. To explore these and to compare power-shrinkage and trimming approaches, we designed a simulation study using the NLAAS dataset as the tem- Density 200 plate. 100 3 NLAAS 0 The National Latino and Asian American Study 1 2 3 4 5 6 (NLAAS) is a nationally representative survey of White, Latino and Asian American household residents (aged Figure 2: Distribution of log (w) 18 and older) in the non-institutionalized population of 10 the US. A total of 4864 individuals, including Latinos, Asians, and Whites, were interviewed. The sample in- cludes an NLAAS Core, designed to be nationally rep- resentative of all Latino origin groups regardless of geo- 4 Simulation Design graphic patterns; and NLAAS-HD supplements, designed to oversample geographic areas with moderate to high To investigate the properties of power-shrinkage and density (HD) of Latino and Asian households. Weight- trimming approaches, we designed a simulation study us- ing reflecting the joint probability of selection from the ing the NLAAS dataset. We constructed a hypotheti- pooled Core and HD samples provides sample-based cov- cal population consisting of I clusters, where I = 4; 864. erage of the national Latino population. Assume there are j = 1; :::; wi individuals in the clus- 2 The properties of weights in NLAAS are summarized ter i, and their yi;j » N(yi; σy), where σy is the stan- in Table 1, Figure 1 and Figure 2. As we can see, the dard deviation of y in each cluster, which we use the variance of the weight is large, and the distribution of log unweighted standard error from the y in the NLAAS dataset. Our arti¯cial population consisted of N = weight is roughly Gaussian. PI i=1 wi = 35; 705; 416 individuals. If y is restricted to w Percentile w be positive, we use the log-normal distribution instead of MIN 80 5 368 the normal distribution. If y is a binary variable other MEAN 7,340 25 1,325 than gender, we use a logistic regression on age and gen- MAX 136,011 50 2,861 der to predict the outcomes in each cluster. The variable 75 8,625 gender is predicted with a logistic regression model on 95 28,385 age. We applied a two-stage sampling design. First, we Table 1: Summery Statistics for w draw q clusters by simple random sampling without re- placement. Second, we draw s cases within each cluster 2840 ASA Section on Survey Research Methods by simple random sampling without replacement. The the variable agepluswgt is constructed as simulated sample size is n = q £ s. The observation from ¤ agepluswgt = age + 0:001 £ w; cluster i is assigned the weight wi = wi=s. Given the sample design, we are interested in the fol- lowing estimators: which has a correlation of 0:5726 with w. By investi- gating agepluswgt and survey weight w, we attempt to ² The sample estimator: learn the performance of power-shrinkage and trimming for variables with high correlation with survey weight. Xq Xs ¤ Note that the correlation between variables and weights wi yi;j i=1 j=1 in NLAAS, ½, is di®erent from the correlation in our ar- y¹ = w Xq Xs ti¯cial population, which is denoted as½ ^, typically lower ¤ wi in magnitude because the random noise we introduced i=1 j=1 within each cluster. To investigate the relationship between the optimal ² The Power-Shrinkage estimator: power-shrinkage parameter p¤, trimming threshold T ¤, simulated correlation½ ^ and sample size n, we ¯t the fol- Xq Xs ¤ p lowing regression model. (wi ) yi;j µ ¶ µ ¶ (p) i=1 j=1 p¤ j½^j y¹w = q X Xs log = ¯0 +¯1 log +¯2 log(n)+" (1) ¤ p 1 ¡ p¤ 1 ¡ j½^j (wi ) i=1 j=1 µ ¶ µ ¶ T ¤ j½^j log = γ0 +γ1 log +γ2 log(n)+" (2) ² The trimmed estimator: 1 ¡ T ¤ 1 ¡ j½^j ¤ ¤ Xq Xs where the p or T are replaced as 0:99 if it is 1, and ¤ wi (T )yi;j replaced as 0:01 if it is 0.