A Pooled Two-Sample Median Test Based on Density Estimation Vadim Y

A Pooled Two-Sample Median Test Based on Density Estimation Vadim Y

Journal of Modern Applied Statistical Methods Volume 10 | Issue 2 Article 28 11-1-2011 A Pooled Two-Sample Median Test Based on Density Estimation Vadim Y. Bichutskiy George Mason University, [email protected] Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons Recommended Citation Bichutskiy, Vadim Y. (2011) "A Pooled Two-Sample Median Test Based on Density Estimation," Journal of Modern Applied Statistical Methods: Vol. 10 : Iss. 2 , Article 28. DOI: 10.22237/jmasm/1320121620 Available at: http://digitalcommons.wayne.edu/jmasm/vol10/iss2/28 This Emerging Scholar is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState. Journal of Modern Applied Statistical Methods Copyright © 2011 JMASM, Inc. November 2011, Vol. 10, No. 2, 692-698 1538 – 9472/11/$95.00 Emerging Scholars A Pooled Two-Sample Median Test Based on Density Estimation Vadim Y. Bichutskiy George Mason University Fairfax, Virginia A new method based on density estimation is proposed for medians of two independent samples. The test controls the probability of Type I error and is at least as powerful as methods widely used in statistical practice. The method can be implemented using existing libraries in R. Key words: Sample median, two-sample hypothesis test, adaptive kernel density estimation. Introduction population median is approximately normal and uses one of several methods for estimating the Let X1n , X2 , … , X be iid having cdf F and pdf f with F(η) = 1/2 so that η is the population standard error of the sample median. Virtually median. Suppose f is continuous at η with f(η) > all methods are very conservative, particularly 0. Denote the sample median by H. It is known for heavy-tailed populations. that H is asymptotically normal with mean η and A new two-sample test is proposed for variance 1/4nf2(η). Estimating the asymptotic comparing medians. When population shapes standard error of the sample median requires an can be assumed to be the same, a pooled test estimate of the population density at the median. statistic, analogous to a pooled two-sample Besides being a challenging problem, density Student’s t statistic for comparing means, is estimation was difficult to apply in practice prior derived. Computer-intensive Monte Carlo to the computer revolution; due to this, several simulations in R (R Development Core Team, alternative methods for estimating the standard 2009) are used to study the properties of the test error of the sample median have been developed and compare it to other methods. The method (Maritz & Jarrett, 1978; McKean & Schrader, offers several additional benefits to practitioners: 1984; Price & Bonett, 2001; Sheather & Maritz, (1) a parameter that controls the trade-off 1983; Sheather, 1986). between making the test conservative and liberal Comparing medians based on two with a suitable value of the parameter producing independent samples is a well-studied problem a test with a nominal significance level; (2) the (see Wilcox & Charlin, 1986; Wilcox, 2005; test is easy to implement in R using the Wilcox, 2006; Wilcox, 2010 also has a good QUANTREG (Koenker, 2009) library. discussion). The methods fall into two main categories. The first uses the bootstrap (Efron, Methodology 1979), and the second assumes the sample Two-Sample Test Statistic for Difference in median or some other estimator of the Medians Let and X1n , X2 , … , X Y1m , Y2 , … , Y be two independent random Vadim Y. Bichutskiy is a Ph.D. student in the samples of sizes n and m from populations with Department of Statistics. This work was densities fx, fy that are continuous at the medians completed when he was a M.S. student in the ηx, ηy with fx(ηx) > 0, fy(ηy) > 0, respectively. Department of Statistics and Biostatistics at Denote sample medians by Hx, Hy. The test California State University, East Bay hypotheses are: (Hayward). Email him at: [email protected]. 692 VADIM Y. BICHUTSKIY =Δ where H0x:-ηη y 22 vs. nfˆˆ (H )+ mf (H ) H :-ηη≠Δ , ˆ = x xyy 1x y f(H)p nm+ where Δ is a specified difference in medians, and is often 0. is the pooled estimate of the population density For sufficiently large n and m: at the median. 2 H~ N (η , 1/4nf (η ),) Simulations xxxx The software R was used to simulate the power of the pooled test statistic (1). Two cases 2 H~ N (η , 1/4mf (η ),) were considered: (i) population shapes are yy yy assumed to be known, and (ii) population shapes are unknown. The assumption of known population shapes is analogous to the 11 1 HH~−− N ηη , +, assumption of known population variances in xy xy 22 4nf(η )mf(η ) the z-test for comparing the means of two xx yy normal populations since the variance determines the shape of the normal distribution. HH(−−−ηη) xy xy~(0,1). N The goal was to see how the test would perform 11+ 1 for samples of moderate size from symmetric 22 heavy-tailed populations. Parent populations 2nf(xxη )mf( yyη ) investigated were Cauchy, Laplace and Student’s t distributions with 2 and 3 degrees of Assuming the normal approximation freedom. In all settings, the parent populations holds when the standard error of the difference were of the same shape, shifted under the in medians is estimated, then under the null alternative, and a two-sided test H0: ηx = ηy hypothesis, the V statistic is: versus H1: ηx ≠ ηy was performed. (H−−Δ H ) Adaptive Kernel Density Estimation VN= xy ~(0,1) When population shapes are unknown, 11 1 ˆ + fx(ηx) and fy(ηy) are estimated with fHxx() and 22 2nf(H)mf(H)ˆˆ ˆ xx yy fHy ()y , respectively, using adaptive kernel density estimation (AKDE). ˆ ˆ ∈ d where fHxx() and fHyy() are respective Let X1n , X2 , … , X be a sample population density estimates at the median. from unknown density f. The AKDE is a three Further, if it is assumed that the two step procedure: populations have the same shape, possibly with 1. Find a pilot estimate f(X) that satisfies a difference in location, then fx(ηx) = fy(ηy), and i = > 1, 2,… , n. the density estimates can be pooled to obtain a f(X)i 0, pooled test statistic: 2. Define local bandwidth factors (H−−Δ H ) ={f(X )/ g} = xy λ i i -γ where g is the geometric VNp ~(0,1) 111+ ≤≤ mean of the f(X)i and 0 γ 1 is the ˆ 2fp (H) n m sensitivity parameter. (1) 3. The adaptive kernel estimate is defined by 693 POOLED TWO-SAMPLE MEDIAN TEST n (Efron & Tibshirani, 1993. p. 221); and (iv) f(X)ˆ = n-1 h -d -d K{h -1 -1 (X-X )} λλiii permutation test. Figure 3 shows the receiver i=1 operating characteristic (ROC) curves for a balanced design with n = m = 30. The parent where K(.) is a kernel function and h is the populations were of the same shape in each case bandwidth. and the difference in population medians was set to 1. For the bootstrap and the permutation test, The AKDE method varies the the difference in medians was used as the metric. bandwidth among data points and is better suited Each point on the curves is based on 10,000 for heavy-tailed populations than ordinary KDE simulated samples. (Silverman, 1998, pp. 100-110). Intuitively, the AKDE is based on the idea that for heavy-tailed Conclusion populations a larger bandwidth is needed for Tests for comparing medians tend to be very data points in the tails of the distribution (i.e., conservative. The proposed test is able to control for outliers). In R, function AKJ in library the probability of Type I error. It is as powerful QUANTREG implements AKDE. Obtaining the as the permutation test and the bootstrap and is pilot estimate requires the use of another density more powerful than the MWW test for heavy- estimation method, such as ordinary KDE. The tailed populations. The more heavy-tailed the general view in the literature is that AKDE is parent population, the greater the power fairly robust to the method used for the pilot advantage of the proposed test over the MWW estimate (Silverman, 1998) and that the choice test; when the parent population is light-tailed, of the sensitivity parameter γ is more critical. the MWW test is more powerful than the When using AKDE with Gaussian kernel, if the proposed test. parent population has tails close to normal then A key precept of the method is that γ <.5 should be used, however, if the parent AKDE provides a better estimate of the population density at the median, especially for population is heavy-tailed then γ >.5 should be heavy-tailed populations, than ordinary KDE. As =. used. Thus, γ 5 is a good choice and has expected, using ordinary KDE makes the test been shown to reduce bias (Abramson, 1982). very conservative where the Type I error rate can be as low as 0.02 at the 5% significance Results level. Case 1: Known Population Shapes These experiments show that the Figure 1 shows the power curves for the sensitivity parameter γ in AKDE controls the pooled test when population shapes are assumed trade-off between making the test conservative to be known at the 5% level of significance. and liberal, with a suitable value of γ producing Each point on the curves is based on 10,000 a test with a nominal significance level.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us