<<

Open Journal of , 2012, 2, 281-296 http://dx.doi.org/10.4236/ojs.2012.23034 Published Online July 2012 (http://www.SciRP.org/journal/ojs)

Subsampling Method for Robust Estimation of Regression Models

Min Tsao, Xiao Ling Department of Mathematics and Statistics, University of Victoria, Victoria, Canada Email: [email protected]

Received March 29, 2012; revised April 30, 2012; accepted May 10, 2012

ABSTRACT We propose a subsampling method for robust estimation of regression models which is built on classical methods such as the method. It makes use of the non-robust nature of the underlying classical method to find a good sample from regression contaminated with , and then applies the classical method to the good sample to produce robust estimates of the regression model parameters. The subsampling method is a computational method rooted in the bootstrap methodology which trades analytical treatment for intensive computation; it finds the good sam- ple through repeated fitting of the regression model to many random subsamples of the contaminated data instead of through an analytical treatment of the outliers. The subsampling method can be applied to all regression models for which non-robust classical methods are available. In the present paper, we focus on the basic formulation and robust- ness property of the subsampling method that are valid for all regression models. We also discuss variations of the method and apply it to three examples involving three different regression models.

Keywords: Subsampling Algorithm; Robust Regression; Outliers; Bootstrap; Goodness-of-Fit

1. Introduction attempts to identify the subsample which is free of out- liers. Our method makes use of standard non-robust Robust estimation and inference for regression models is classical regression methods for both identifying the an important problem with a long history in . free subsamples and then estimating the regres- Earlier work on this problem is discussed in [1] and [2]. sion model with the outlier free subsamples. Specifically, The first book focusing on robust regression is [3] which suppose we have a sample consisting of mostly “good gives a thorough coverage of robust regression methods data points” from an ideal regression model and some developed prior to 1987. There have been many new outliers which are not generated by the ideal model, and developments in the last two decades. Reference [4] pro- we wish to estimate the ideal model. The basic idea of vides a good coverage on many recent robust regression our method is to consider subsamples taken without re- methods. Although there are now different robust methods placement from the contaminated sample and to identify, for various regression models, most existing methods among possibly many subsamples, “good subsamples” involve a quantitative measure of the outlyingness of which contain only good data points. Then estimate the individual observations which is used to formulate robust ideal regression model using only the good subsamples . That is, contributions from individual observa- through a simple classical method. The identification of tions to the estimators are weighted depending on their good subsamples is accomplished through fitting the degrees of outlyingness. This weighting by outlyingness model to many subsamples with the classical method, is done either explicitly as in, for example, the GM- and then using a criterion, typically a goodness-of-fit estimators of [5] or implicitly as in the MM- of measure that is sensitive to the presence of outliers, to [6] through the use of  functions. determine whether the subsamples contain outliers. We In this paper, we introduce an alternative method for will refer to this method as the subsampling method. The robust regression which does not involve any explicit or subsampling method has three attractive aspects: 1) it is implicit notion of outlyingness of individual observations. based on elements of classical methods, and as such it Our alternative method focuses instead on the presence can be readily constructed to handle all regression models or absence of outliers in a subset (subsample) of a sam- for which non-robust classical methods are available, 2) ple, which does not require a quantitative characteri- under certain conditions, it provides unbiased estimators sation of outlyingness of individual observations, and for the ideal regression model parameters, and 3) it is

Copyright © 2012 SciRes. OJS 282 M. TSAO, X. LING easy to implement as it does not involve the potentially EY ii=,, gx β  (1) difficult task of formulating the outlyningness of indi- 1 q vidual observations and their weighting. where Yi   is the response variable, Xi  is the q 1 corresponding covariates vector, g xi ,:β  is Point (3) above is particularly interesting as evaluating p the outlyingness of individual observations is tradi- the regression function and β   is the regression tionally at the heart of robust methods, yet in the re- parameter vector. To accommodate different regression models, the distributions of Y and X are left un- gression context this task can be particularly difficult. To i i specified here. They are also not needed in our sub- further illustrate this point, denote by X ,Y an ii sequent discussions. observation where Y  1 is the response and X  q i i Denote by S =,,,zz z a contaminated sam- the corresponding covariates vector. The outlyingness of N  12 N  ple of N observations containing n “good data” ge- obseravtion X ,Y here is with respect to the under- ii nerated by model (1) and m “bad data” which are lying regression model, not with respect to a fixed point outliers not from the model. Here n and m are un- in q1 as is in the location problem. It may be an known integers that add up to NSS. Let n and m be outlier due to the outlyingness in either Xi or Yi or the (unknown) partition of S such that S contains both. In simple regression models where the underlying N n the n good data and Sm contains the m bad data. To models have nice geometric representations, such as the achieve robust estimation of β with SN , the subsam- linear or multiple models, the outlying- pling method first constructs a sample Sg to estimate ness of an X ,Y may be characterized by extending ii the unknown good Sn , and then applies a (non- measures of outlyingness for the location problem robust) classical estimator to Sg . The resulting estimator through for example the residuals. But in cases where Yi for β will be referred to as the subsampling estimator is discrete such as a binary, Poisson or multinomial re- or SUE for β . Clearly, a reliable and efficient Sg sponse, the geometry of the underlying models are com- which captures a high percentage of the good data points plicated and the outlyingness of Xii,Y  may be dif- in Sn but none of the bad points in Sm is the key to ficult to formulate. With the subsampling methods, we the robustness and the of the SUE. The avoid the need to formulate the outlyingness of indivi- subsampling algorithm that we develop below is aimed at dual observations but instead focus on the consequence generating a reliable and efficient Sg . of outliers, that is, they typically lead to a poor fit. We take advantage of this observation to remove the outliers 2.1. The Subsampling Algorithm and hence achieve robust estimation of regression models. Let A be a random sample of size ns taken without It should be noted that traditionally the notion of an replacement from SN , which we will refer to as a sub- “outlier” is often associated with some underlying meas- sample of SN . The key idea of the subsampling method ure of outlyingness of individual observations. In the is to construct the estimator Sg for Sn by using a present paper, however, by “outliers” we simply sequence of subsamples from SN . data points that are not generated by the ideal model and To fix the idea, for some nns  , let AAA123,,, will lead to a poor fit. Consequently, the removal of be an infinite sequence of independent random subsam- *** outliers is based on the quality of fit of subsamples, not a ples each of size ns . Let AAA12,,3 be the sub- measure of outlyingness of individual points. sequence of good subsamples, that is, subsamples which The rest of the paper is organized as follows. In Sec- do not contain any outliers. Each of these sequences tion 2, we set up notation and introduce the subsampling contains only a finite number of distinct subsamples. We method. In Section 3, we discuss asymptotic and robust- choose to work with a repeating but infinite sequence ness properties of the subsampling estimator under general instead of a finite sequence of distinct subsamples as that conditions not tied to a specific regression model. We finite number may be very large and the infinite sequence then apply the subsampling methods to three examples set-up provides the most convenient theoretical frame- involving three different regression models in Section 4. work as we will see below. Consider using the partial In Section 5, we discuss variations of the subsampling union method which may improve the efficiency and reliability j BA= * (2) of the method. We conclude with a few remarks in j  i i=1 Section 6. Proofs are given in the Appendix. to estimate the good data set Sn . Clearly, B j is a sub- set of S . The following theorem gives the consistency 2. The Subsampling Method n of B j as an estimator for Sn . *** To set up notation, let zxii=,yi be a realization of a Theorem 1 With probability one, AAA123,,, has random vector ZXi=ii,Y satisfying regression model infinitely many elements and

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 283

PB ==1.Sn (3)   AA,,, A*  and forms Sg as follows. 12    Proof: See the Appendix.  * ALGORITHM SAL ( nrks ,,)—Subsampling al- To measure the efficiency of B j as an estimator for S , let W be the number of points in B . Since gorithm based on  and  n j j * For chosen ( , ) and nrks ,, where nns  : BSj  n , it is an efficient estimator of Sn if the ratio Step 1: Randomly draw a subsample A1 of size ns Wnj is close to one. But for any finite j , Wj is a . Hence we use the expectation of the from data set SN . Step 2: Using method  , fit the regression model (1) ratio, which we denote by EBF j , to measure the to the subsample A1 obtained in Step 1 and compute the efficiency of B j . We have corresponding goodness-of-fit score 1 . EWj Step 3: Repeat Steps 1 and 2 for k times. Each time EBFj=,j=1,2,, record A , , the subsample taken and the associated n  j j  goodness-of-fit score at the jth repeat, for where EWj is the expected number of good data jk=1,2, , . points picked up by Bj . The following theorem gives a Step 4: Sort the k subsamples by their associated  simple expression of EB in terms of n and n . F  j  s values; denote by  12,,,  k  the ordered values of Theorem 2 The efficiency of B in recovering good j  j where ii 1 , and denote by A12,,,AA   k  data points is the correspondingly ordered subsamples. This puts the EW j subsamples in the order of least likely to contain outliers j nn s EBFj==1 ,=1,2,j  (4) to most likely to contain outliers according to the  nn  criterion. * Proof: See the Appendix. Step 5: Form Sg using the r subsamples with the smallest  values, that is Theorem 2 indicates that EBF j converges to 1 as j goes to infinity. The convergence is very fast when r* nn n is not close to 1, that is, the subsample size s SAg =. ()i (5) i=1 ns is not too small relative to n . Hence for properly chosen ns and j , Bj is an excellent estimator for We will refer to Sg as the combined sample. It esti- Sn . Nevertheless, in practice while we can generate the mates S using the pseudo-good subsamples sequence AAA,,, easily, the subsequence of n 123    *** good subsamples AA,,,A and hence B are not AA12,,,    A*  which is an approximate version of 123 j r available as we do not have the to determine A**,,,AA *. Formally, we now define the subsam-  12 r*  whether a subsample Ai is a good subsample or a bad pling estimator SUE for β as the estimator generated subsample containing one or more outliers. To deal with by applying method  to the combined sample Sg . *  this, for a fixed r  , the subsampling algorithm Although the method applied to Sg to generate the SUE described below finds a sequence of r* pseudo-good sub- does not have to be the same as the method used in the

 algorithm to generate Sg , for convenience we will use samples AA,,, A* , and takes their union to 12  r the same method in both places. The SUE does not have an analytic expression. It is an implicit function of form the estimator SSg for n . * nrks ,,,,  and SN , and as such we may write SUE Denote by  a classical method for regression * analysis, for example, the method of least squares. De- ( nrs ,,,,,k SN ) when we need to highlight the in- note by  an associated quantitative goodness-of-fit gredients. It should be noted that while B , the union of criterion, such as the mean squared error, AIC or BIC, r* A**,,,AA *, may be viewed as a random sample taken which may be sensitive to the presence of outliers on a 12 r* relative basis; that is, given two samples of the same size, without replacement from Sn , the combined sample Sg may not be viewed as such even when it contains no one contains outliers and another does not, upon fitting * regression model (1) with method  to the two sam- outliers. This is because the r pseudo-good subsam- ples,  can be used to effectively identify the one that   ples AA,,, A*  are not independent due to the contains outliers. Denote by  the numerical score 12   r given by the criterion  upon fitting the model (1) to a dependence in their  -scores, which are the smallest r* subsample A using method  , and suppose that a order statistics of the k  -scores. Consequently, Sg small  value means a good fit of model (1) to A . The is not a simple random sample. This complicates the subsampling algorithm finds pseudo-good subsamples distributional theory of the SUE but fortunately it has

Copyright © 2012 SciRes. OJS 284 M. TSAO, X. LING little impact on its breakdown robustness as we will see Instead of randomly generating subsamples in Steps 1-3 * later. in SAL ( nrks ,,), we can modify it to going through all

We conclude this subsection with an example typical 1140 distinct subsamples. The combined sample Sg given * of situations where we expect the algorithm to be by such a modified SAL nrks = 17, = 2, = 1140 will successful in finding a good combined sample Sg for recover all 18 good data points in Sn . Consequently, the estimating Sn . SUE reduces to the least squares estimator based on Sn Example 1: Consider the following and hence existing theory for the latter applies to the SUE. yx=3 5 ,  N0,2 =4 . (6) We generated n =18 observations from model (6) 2.2. Parameter Selection for the Subsampling and then added m =2 outliers to form a data set of size Algorithm Nnm= =20. See Figure 1(a) for the of N this data set. To recover the 18 good data points, consider In real applications,  may be too large that going n using the subsampling algorithm with the method of least s squares as  and the MSE ( SSresidual ns  2 as the through all subsamples of size ns is difficult. Thus we need to use random in Steps 1-3 of SAL  criterion. For ns =17, there are 1140 such subsam- * ples, 18 of which contain no outliers. We fitted the sim- ( nrs ,,k). The parameter selection discussed below is * ple linear model using the method of least squares to all for SAL ( nrs ,,k) with random sampling only. Also, the 1140 subsamples and computed the MSE for each fit. discussion below is independent of the underlying re- Figure 1(b) shows the plot of the 1140 MSEs versus the gression model. Thus we will not deal with the selection corresponding subsample labels. The MSEs of the 18 of  and  (which is tied to the model) here but will good subsamples are at the bottom, and they are sub- address this for some regression models in Section 4 stantially smaller than that of the other subsamples. If we when we discuss specific applications. * The objective of SAL ( nrk,,* ) is to generate a are to run SAL nrs =17, =2,k for a suitably large s combined sample S to estimate the good data set S . k , the resulting combined sample Sg is likely the g n union of two good subsamples, recovering at least 17 of To this end, we want to carefully select the parameter * the 18 good data points. values to improve the chances that 1) the r pseudo-   For this example, the total number of distinct sub- good subsamples AA,,, A are indeed good  12   r*  samples of size ns =17 is only 1140, which is not large. 

subsamples from Sn and 2) Sg is efficient, that is, it captures a high percentage of points in Sn . The para- meter selection strategy below centres around meeting these conditions.

1) Selecting ns —The subsample size Proper selection of ns is crucial for meeting con- dition (1). The first rule for selecting ns is that it must satisfy mn< s  n. Under this condition, A12,,,AA k  generated by Steps 1-3 are either good subsamples containing no outliers or subsamples con- (a) taining a mix of good data points and outliers. The γ- score is the most effective in identifying good subsam- ples from such a sequence as they are the ones with small

scores. If nns > , there will not be any good subsamples in the sequence. If nms < , there could be bad subsam- ples consisting of entirely outliers. When the outliers follow the same model as the good data but with a dif- ferent β , the γ-scores for bad subsamples of only out- liers may also be very small. This could cause the sub- sampling algorithm to mistaken such bad subsamples as (b) good subsamples, resulting in violations of condition (1). Figure 1. (a) Scatter plot of a contaminated sample of 20 In real applications, values of m and n may be from model (6) and the true regression line; (b) MSE from a unknown but one may have estimates of these which can least squares fit of a simple linear model to a subsample be used to select ns . In the absence of such estimates, versus its label. we recommend a default value of “half plus one”, i.e.

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 285

nNs = int 0.5  1 where int[x] is the integer part of x . for this probability. We now consider selecting k with * * This default choice is based on the assumption that given ns , r and p . m< n which is standard in many robustness studies. Let pg be the probability that A , a random sub-

Under this assumption and without further information, sample of size ns from SN , is a good subsample con- the only choice of n that is guaranteed to satisfy s taining no outliers. Under the condition that nns  , we mn< s  n is the default value. have Note that for each application, ns must be above a certain minimum, i.e., np1 . For example, if we let n m s  n =2 when model (1) is a simple linear model where n 0 s p =>0.s  (9) p =2, we will get trivially perfect fit for all subsamples. g nm In this case, the  -score is a constant and hence cannot  n be used to differentiate good subsamples from the bad s ones. Now let T be the total number of good subsamples * 2) Selecting r —The number of subsamples to be in A12,,,AA k  . Then T is a binomial random combined in Step 5 variable with TBinkp , g . Let pki,= PT  i . *   The selection of r is tied to condition (2). For Then simplicity, here we ignore the difference between the pki, = P at least i good subsamples in k      pseudo-good subsamples AA,,,A identified 12  * i1 (10) r k j kj =1ppgg 1 . by the subsampling algorithm and the real good sub- j=0j samples A**,,,AA *. This allows us to assume that * 12r * Since the value of r has been determined in step [b] above, pkr, * is only a function of k and the S is the same as a B * and use the efficiency measure   g r * of B for S . By (4), desired k value is determined by p through r* g r* kplrp=argmin ,** =0.99 . (11) nn s l     ESFg==EBF* 1  . (7) r n In real applications where m and hence n are un- known, we choose a working proportion   0.0.5 , For a given (desired) value of the efficiency ESF  g  , 0   we find the r* value needed to achieve this by solving representing either our estimate of the proportion of (7) for r* ; in view of that r* must be an integer, we outliers or the maximum proportion we will consider. have Then use mN=int 0  to determine k . We will use 0.1 as the default value for 0 which represents a  moderate level of contamination. log 1 ESFg  r* =int1. (8) * Finally, k is a function of r and pg , both of which log nns log n  are functions of ns . Thus k is ultimately a function of only n . We may link the selection of all three para- When n is the default value of int 0.5N  1 , the s s   meters together by looking for the optimal n which maximum r* required to achieve a 99% efficiency s will minimize k subject to the (estimated) constraint ( ESFg= 99% ) is 7. This maximum is reached when mn< s  n, where m and n are estimates based on nN= .  . Such an optimal n can be found by simply com- In simulation studies, when r* is chosen by (8), the 0 s puting the k values corresponding to all ns satisfying observed efficiency, as measured by the actual number of the constraint. We caution, however, that this optimal n points in S divided by n , tends to be lower than the s g may be smaller than 0.5N  1 . As such, it may not expected value used in (8) to find r* . This is likely the satisfy the real but unknown constraint of mn<  n. consequence of the dependence among the pseudo-good s subsamples. We will further comment on this depen- To give examples of parameter values determined dence in the next section. through the above three steps and to prepare for the 3) Selecting k —The total number of subsamples applications of the subsampling algorithm in subsequent * to be generated examples, we include in Table 1 the r values and k values required in order to achieve an efficiency of For a finite k , the sequence A12,,,AA k gene- * rated by Steps 1-3 may or may not contain r good ESFg  = 99% for various combinations of m and n * subsamples. But we can set k sufficiently large so that values. To compute this table, both p and ns are set the probability of having at least r* good subsamples, to their default values of 0.99 and 0.5N  1, respec- * * p , is high. We will use p =0.99 as the default value tively.

Copyright © 2012 SciRes. OJS 286 M. TSAO, X. LING

Table 1. The number of subsamples k and the number of good subsamples r * required to achieve an efficiency of * * ESFg  = 99% . Exact values of m and n and default values of p and ns were used for computing these k and r values.

* Sample size N Number of good data n Number of outliers m Subsample size ns r Number of subsamples k

* n =20 m =0 ns =11 r =6 k =6

* N =20 n =18 m =2 ns =11 r =5 k =58

* n =16 m =4 ns =11 r =4 k =383

* n =60 m =0 ns =31 r =7 k =7

* N =60 n =54 m =6 ns =31 r =6 k = 1378

* n =48 m =12 ns =31 r =5 k = 312,912

* The computationally intensive nature of SAL ( nrks ,,) can be seen in the case given by  Nmn,,  = 60,12,48 , where to achieve an efficiency of 99% we need to fit the model in question to 312,912 subsamples of size 31 each. This could be a problem if it is time consuming to fit the model to each subsample. To visualize the relationship among various parameters of the algorithm, for the case of mn,=12,48, Figure 2 shows the theoretical * efficiency of the algorithm ESF  g  verses r . The dashed black line in the plot represents the 99% ef- ficiency line. With subsample size ns =31, we see that the efficiency curve rises rapidly as r* values increases. * At r =5, the efficiency reaches 99%. For this example, Figure 2. The efficiency of the subsampling algorithm we also plotted the probability of having at least i good * SAL( nrks ,,) as a function of the number of good subsam- subsamples pp=,ki verses i when the total num- i  ples r* that form the combined sample S . The n is set ber of subsamples is set at k =7000 in Figure 3(a). We g s see that at this k value, the probability of having at at the default value. The dashed line is the 99% efficiency least r* =5 (required for 99% efficiency) is only about line.

0.96. To ensure a high probability of having at least 5 good subsamples, we need to increase k . In Figure 3(b), properties of the SUE under conditions not tied to a we plotted the same curve but at k = 312,912 as was specific regression model. computed in Table 1. The probability of having at least 5 good subsamples is now at 0.99. 3.1. The Asymptotic Distribution of the * Subsampling Estimator To summarize, to select the parameters ns , r and * k for SAL ( nrks ,,), we need the following infor- We first briefly discuss the asymptotic distribution of the mation: 1) a desired efficiency ESF g  (default = 0.99), SUE with respect to ne , the size of the combined sample 2) a working proportion of outliers 0 (default = 0.1) Sg . We will refer to ne as the effective sample size. * and 3) a probability p (default = 0.99) of having at Although ne is random, it is bounded between ns and * least r good subsamples in k random subsamples of N . Under the default setting of nNs =0.5 1, ne will size ns . We have developed an R program which approach infinity if and only if N approaches infinity. computes the values of r* and k for any combination Also, when the proportion of outliers and hence the ratio * nN are fixed, n will approach infinity if and only if of Nn,,sFE Sg,,0 p. The default value of this e n does. So asymptotics with respect to n is equi- input vector is NN,0.5 1,99%,0.1,0.99 . Note that e valent to that with respect to N or n . the determination of the algorithm parameters does not Let βˆ be the SUE for β and consider its asymptotic depend on the actual model being fitted. distribution under the assumption that Sg may be

viewed as random sample from the good data set Sn . 3. Asymptotic and Robustness Properties of Since S is a random sample from the ideal model (1), the Subsampling Estimator n under the assumption Sg is also a random sample from In this section, we discuss the asymptotic and robustness model (1). Hence βˆ is simply the (classical) estimator

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 287

studied on a case by case basis.

3.2. The Breakdown Robustness and the Breakdown Probability Function of the Subsampling Estimator While a unified asymptotic theory for the SUE is elusive due to its dependence on the underlying model (1), the breakdown properties of the SUE presented below do not depend on the model and thus apply to SUEs for all (a) models. * * Consider an SUE( nrks ,,,,, SN ) where nrks ,, and N are fixed. Denote by  the proportion of out-

liers in SN and hence  = mN. Due to the non-robust nature of the classical method  , the SUE will break

down if there is one or more outliers in Sg . Thus it may break down whenever  is not zero as there is a

chance that Sg contains one or more outliers. It follows that the traditional notion of a breakdown point, the threshold  value below which an estimator will not break down, cannot be applied to measure the robustness (b) of the SUE. The breakdown robustness of the SUE is better characterized by the breakdown probability as a Figure 3. (a) The probability (pi) of having at least i good function of  . This breakdown probability function, subsamples in k = 7000 subsamples; (b) The probability (pi) denoted by BP  , is just the probability of not having of having at least i good subsamples in k = 312,912 at least r* good subsamples in a random sequence of subsamples. The horizontal black line is the pi = 0.99 line. k subsamples of size n when the proportion of out- s liers is  . That is given by method  and the random sample Sg . Its asymptotic distribution is then given by the existing BP = P fewer than r* good subsamples in k  . (12) asymptotic theory for method  . For example, for a linear model such as that in Example 1, the SUE βˆ A trivial case arises when nns > . In this case, generated by the least squares method will have the usual BP  =1 regardless of the  value. For the case of asymptotic normal distribution under the assumption. The interest where nns  , we have from (9) asymptotic normal distribution can then be used to make n m inference about β . Thus in this case, there is no need  ns 0 for new asymptotic theory for the SUE. pg  =>0, (13) In some cases such as when it captures all the good nm  data points, Sg may indeed be considered as a random ns sample from the good data. Nevertheless, as we have where mN=  and nN=1 . By (10), the break- noted in Section 2.1 that in general, S is not a random  g down probability function is sample due to a selection bias of the subsampling * r*1 algorithm; the r subsamples forming Sg are the sub- k j kj samples which fit the model the best according to the  BP=1  pgg  p  . (14) j=0 j criterion, and as such they tend to contain only those good data points which are close to the underlying re- In deriving (14), we have assumed implicitly that the gression curve. For the simple linear model, for example,  criterion will correctly identify good subsamples good data points at the boundary of the good data band without outliers. This is reasonable in the context of are more likely to be missed by Sg . Consequently, there discussing the breakdown of the SUE as we can assume may be less variation in Sg than that in a typical random the outliers are arbitrarily large and hence any reasonable sample from the model. This may lead to an under-  criterion would be accurate in separating good sub- estimation of the model although the SUE for samples from bad subsamples. β may still be unbiased. The selection bias depends on The concept of breakdown probability function can the underlying model and the method  . Its impact on also be applied to traditional robust estimators. Let  * the asymptotic distribution of the SUE βˆ needs to be be the breakdown point of some traditional robust esti-

Copyright © 2012 SciRes. OJS 288 M. TSAO, X. LING mator. Then its breakdown probability function is the fol- find the value of k by using a predetermined p* , the lowing step function probability of having at least r* good subsamples in k . In view of the breakdown probability function, this 0, if   * , amounts to selecting k by requiring BP =1 p* , BP =  * (15) 0 1, if   . which is a condition imposed on BP at a single Figure 4 contains the breakdown probability functions point 0 . An alternative way of selecting k is to im- of three SUEs for the case of N =60. These SUEs are pose a stronger condition on BP over some interval * of interest. defined by the default values of ESg , ns and p Note that everything else being equal, we can get a with working proportions of outliers of 0 =0.1,0.2,0.4, respectively. Their breakdown functions (in dotted lines) more robust SUE by using a larger k . For practical are identified by their working proportions. The break- applications, however, we caution that a very large k will compromise both the computational efficiency of the down function for the SUE with 0 =0.4 is uniformly smaller than the other two and hence this SUE is the subsampling algorithm and the efficiency of the com- most robust. This is not surprising as it is designed to bined sample Sg as an estimator of the good data set. handle 40% outliers. For comparison, the breakdown The latter point is due to the fact that in practice, the probability function of a hypothetical traditional robust subsamples forming Sg are not independent random estimator (non-SUE) with a breakdown point of 0.3 is samples from the good data set; in the extreme case also plotted (in solid line) in Figure 4. Here we see that where k goes to infinity, the subsample with the smallest * the SUE and the traditional robust estimator are com- γ-score will appear infinitely many times, and thus all r plementary to each other; whereas the later will never subsamples in the union of Sg are repeats of this same breakdown so long as the proportion of outliers is less subsample. This leads to the lowest efficiency for Sg than its breakdown point but it will for sure breakdown with ESFg  = nN s . Thus when selecting the k otherwise, an SUE has a small probability of breaking value, it is necessary to balance the robustness and the down even when the proportion is lower than that it is efficiency of the SUE. designed to handle but this is compensated by the po- To conclude Section 3, we note that although the sitive probability that it may not breakdown even when selection bias problem associated with the combined the proportion is higher. Incidentally, everything else sample Sg can make the asymptotic theory of the SUE being fixed, the k value associated with the SUE in- difficult, it has little impact on the breakdown robustness creases rapidly as the working proportion increases. The of the SUE. This is due to the fact that to study the excellent robustness of the SUE for 0 =0.4, for ex- breakdown of the SUE, we are only concerned with ample, comes at a price of a huge amount of computa- whether Sg contains any outliers. As such, the size of tion. Sg and the possible dependence structure of points The breakdown probability function may be applied to within are irrelevant. Whereas in Section 3.1 we had to select parameter k for the subsampling algorithm. make the strong assumption that Sg is a random sample * Recall that in Section 2.2, after ns and r are chosen from the good data in order to borrow the asymptotic and the working proportion of outliers 0 is fixed, we theory from the classical method  , here we do not need this assumption. Indeed, as we have pointed out after (14) that the breakdown results in this section are valid under only a weak assumption, that is, the  criterion employed is capable of separating good subsam- ples from subsamples containing one or more arbitrarily large outliers. Any reasonable  should be able to do so.

4. Applications of the Subsampling Method In this section, we apply the subsampling method to three real examples through which we demonstrate its use- fulness and discuss issues related to its implementation. For the last example, we also include a small simulation study on the finite sample behaviour of SUE. Figure 4. Breakdown probability functions (BPs) of three SUEs and one non-SUE for a data set of size N = 60. BPs for An important issue concerning the implementation of the SUE’s designed to handle α0 = 10%, 20% and 40% the subsampling method which we have not considered outliers are indicated by their associated α0 value. in Section 2 is the selection of classical method  and

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 289 goodness-of-fit criterion  . For linear regression and SUE will give consistent estimates for the model para- non-linear regression models, the least squares estimation meters. Residual plots will also look consistent in terms (LSE) method and the mean squared error (MSE) are of which points on the plots appear to be outliers. We good choices for  and  , respectively, as the LSE now further illustrate these issues in the examples below. and MSE are very sensitive to outliers in the data. Example 2: Linear model for stackloss data Outliers will lead to a poor fit by the LSE, resulting in a The well-known stackloss data from Brownlee [7] has large MSE. Thus a small value of the MSE means a good 21 observations on four variables concerning the opera- fit. For and models, tion of a plant for the oxidation of ammonia to nitric acid. the maximum likelihood estimation (MLE) method and The four variables are stackloss ( y ), air flow rate ( x1 ), the (DV) can be used as  and  , respec- cooling water inlet temperature ( x2 ) and acid concen- tively. The MLE and DV are also sensitive to outliers. A tration ( x3 ). We wish to fit a multiple linear regression good fit should have a small DV. If the ratio DV npe   model, is much larger than 1, then it is not a good fit. yxxx=     Another important issue is the proper selection of the 0112233 working proportion of outliers or equivalently the (esti- to this data. We use the LSE and MSE for  and  , mated) number of outliers m in the sample. This is respectively, in the SUE. We also try three m values, needed to determine the r* and k to run the subsam- m =2,4 and 6, which represent roughly 10%, 20% and pling algorithm. Ideally, the selected m value should be 30% working proportion of outliers in the data. The sub- slightly above the true number of outliers as this will lead sample size is chosen to be the default size of ns =11. to a robust and efficient SUE. If we have some infor- The corresponding values for r* and k in the SAL mation about the proportion of outliers in the sample and the estimates for regression parameters are given in such as a tight upper bound, we can use this information Table 2. For comparison, Table 2 also includes the to select m . In the absence of such information, we may estimates given by the LSE and MME, a robust estimator use several values for m to compute the SUE and introduced in [6]. The residual versus fitted value plots identify the most proper value for the data set in question. for the LSE and SUE are in Figure 5. Since the regres- For m values above the true number of outliers, the sion parameter estimates for the SUE with m =4 and

(a) (b)

(c) (d) Figure 5. Residual versus fitted value plots for Example 2: (a) LSE; (b) SUE with m = 2; (c) SUE with m = 4; (d) SUE with m = 6. The dashed lines are 3ˆ , which are used to identify outliers.

Copyrigh t © 2012 SciRes. OJS 290 M. TSAO, X. LING

Table 2. Regression parameter estimates with standard errors in brackets for Example 2.

SUE (m = 2) SUE (m = 4) SUE (m =6)

parameter LSE MME r* =6 r* =5 r* =4

k =57 k =327 k = 2593

0 −39.92 (11.90) −41.52 (5.30) −38.59 (10.93) −37.65 (4.73) −36.72 (3.65)

1 0.72 (0.13) 0.94 (0.12) 0.77 (0.13) 0.80 (0.07) 0.84 (0.05)

2 1.30 (0.37) 0.58 (0.26) 1.09 (0.35) 0.58 (0.17) 0.45 (0.13)

3 −0.15 (0.16) −0.11 (0.07) −0.16 (0.14) −0.07 (0.06) −0.08 (0.05)

 3.24 1.91 2.97 1.25 0.93

sample size n =21 n =21 ne =20 ne =17 ne =15 m =6 are consistent and the corresponding residual group, are given in the data set. The response variable of plots identify the same 4 outliers, m =4 is the most interest is the proportion of miners who have symptoms reasonable choice. The effective sample size for Sg of severe pneumoconiosis (denoted by Y ). The 8 group when m =4 is ne =17 , and hence this Sg includes proportions of severe pneumoconiosis are plotted in all the good data points in the data and the SUE is the Figure 6(a) with each circle representing one group. most efficient. It is clear from Table 2 and Figure 5 that Since it is reasonable to assume that the corresponding the LSE and the SUE with m =2 fail to identify any number of severe cases for each group is a binomial outliers and their estimates are influenced by the outliers. random variable, on page 432 of [9] the authors consi- The robust MME identifies two outliers, and its estimates dered a logistic regression model for Y , i.e., for  ,  and  are slightly different from those 12 3 exp x given by the SUE with m =4. Since the MME is EY=.01 1expx usually biased in the estimation of the intercept 0 , the 01 estima te of 0 from the MME is quite different. This data set has been analysed by many , for To apply subsampling method for logistic regression, example, Andrews [8], Rousseeuw and Leroy [3] and we choose the MLE method and the deviance DV as  Montgomery et al. [9]. Most of these authors concluded and  , respectively. With N =8 groups, we set m =1 that there are four outliers in the data (observations 1, 3, and 2 in the computation, and set the subsample size to * 4 and 21), which is consistent with the result of the SUE ns =5. The corresponding values for r and k are * * with m =4. rk,=4,23   and rk, = 3,76 for m =1 and 2, Note that the SUE is based on the combined sample respectively. The original data set has no apparent out- which is a trimmed sample. A large m value assumes liers. In order to demonstrate the robustness of the SUE, more outliers and leads to heavier trimming and hence a we created one outlier group by changing the number of smaller combined sample. This is seen from the SUE severe cases for the 27.5 years of exposure group from original 8 to 18. Consequently, the sample proportion of with m =6 where the effective sample size ne is 15 instead of 17 for m =4. Consequently, the resulting severe pneumoconiosis cases for this group has been estimate for the variance is lower than that for m =4. changed from the initial 8/48 to 18/48. Outliers such as However, the estimates for the regression parameters this can be caused, for example, by a typo in practice. under m =4 and m =6 are comparable, reflecting the The sample proportions with this one outlier are plotted fact that under certain conditions the SUEs associated in Figure 6(b). The regression parameter estimates from with different parameter settings of SAL algorithm are various estimators are given in Table 3, where the M all unbiased. method is the robust estimator from [11]. The fitted lines Example 3: Logistic regression for coal miners data given by the MLE, the SUE and the M method are also Ashford [10] gives a data set concerning incidents of plotted in Figure 6. For both data sets, the SUE with severe pneumoconiosis among 371 coal miners. The 371 m =1 and 2 gives the same result. The SUE does not miners are divided into 8 groups according to the years find any outliers for the original data set, while it of exposure at the coal mine. The values of three vari- correctly identifies the one outlier for the modified data ables, “years” of exposure (denoted by x ) for each set. For the original data set, the three methods give group, “total number” of miners in each group, and the almost the same estimates for the parameters, and this is number of “severe cases” of pneumoconiosis in each reflected by their fitted lines (of proportions) which are

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 291

(a)

(b) Figure 6. Sample proportions and fitted proportions for Example 3: (a) The original data set with no outlier group; (b) Modified data with one outlier group. Note that in (a) the fitted lines are all the same for the MLE, SUE and M methods, while in (b) they are different.

Table 3. Regression parameter estimates with standard errors in brackets for Example 3.

Original data With one Outlier group Parameter SUE MLE MLE M

0 −4.80 (0.57) −4.07 (0.46) −4.80 (0.59) −5.24 (0.70)

1 0.09 (0.02) 0.08 (0.01) 0.09 (0.02) 0.10 (0.02) nearly the same as can be seen in Figure 6(a). For the LSE (dotted) and SUE (solid), and Figure 7(b) shows modified data set, the SUE and the M method are robust, the residual plot from the SUE fit. Since there is only one and their fitted lines are in Figure 6(b). The outlier has mild outlier in the data, the estimates from the LSE and little or no influence on these fitted lines. SUE are similar and they are reported in Table 4. Example 4: Non-linear model for enzymatic reaction We also use model (16) to conduct a small simulation data study to examine the finite sample distribution of the To analyse an enzymatic reaction data set, one of the SUE. We generate 1000 samples of size N =12 where, models that Bates and Watts [12] considered is the well- for each sample, 10 observations are generated from the known Michaelis-Menton model, a non-linear model model with 0 =215, 1 =0.07 and a normally dis- given by tributed error with mean 0 and  =8. The other 2 observations are outliers generated from the same model  x y =,0   (16) but with a different error distribution; a normal error 1  x distribution with mean 30 and  =1. The two out- liers are y outliers, and Figure 7(c) shows a typical where 0 and 1 are regression parameters, and the sample with different fitted lines for the LSE and SUE. 2 error  has variance  . The data set has N =12 The estimated parameter values are also reported in observations of treated cases. Response variable y is Table 4. For each sample, the SUE estimates are com- the velocity of an enzymatic reaction and x is the * puted with ns =7, r =4 and k =63. Figure 8 shows ˆˆ substrate concentration in parts per million. To compute the for 01,,ˆ and the sample size of Sg . the SUE, we use the LSE method and the MSE for  The dotted vertical lines are the true values for 01,, and  , respectively, and set m =2 and ns =7 which and n =10. Table 5 shows the biases and standard lead to r* =4 and k =63. Figure 7(a) shows the errors for the LSE and SUE based on the simulation ˆ scatter plot of y versus x and the fitted lines for the study. The distributions of the SUE estimators 0 and

Copyright © 2012 SciRes. OJS 292 M. TSAO, X. LING

(a) (b)

(c) (d) Figure 7. Fitted regression lines for Example 4: (a) The LSE and SUE lines for the real data set; (b) SUE fit residual plot for the real data set; (c) The LSE and SUE lines for the simulated data with outliers; (d) SUE fit residual plot for the simulated data. Dotted lines in plots (b) and (d) are the 3ˆ lines.

(a) (b)

(c) (d) ˆ ˆ Figure 8. Histograms for the SUE: (a) Estimator 0 ; (b) Estimator 1 ; (c) Estimator ˆ ; (d) The sample size of Sg . The vertical dotted lines are the true values for  01,, and n .

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 293

Table 4. Regression estimates with the standard errors in brackets for Example 4.

Original Simulated Data set parameter LSE (s.e.) SUE (s.e.) LSE (s.e.) SUE (s.e.)

0 212.68 (6.95) 216.62 (4.79) 210.35 (9.03) 217.30 (4.55)

1 0.064 (0.008) 0.072 (0.006) 0.073 (0.014) 0.069 (0.007)

 10.93 7.10 15.23 6.47

Table 5. Results for the simulation study. and only if  j   C . Then we define the combined subsample as Parameter LSE bias (s.e.) SUE bias (s.e.)

SAg = j . (17) 0 −3.64 (8.84) 1.64 (8.93) j:j  C

1 0.0059 (0.0238) 0.0052 (0.0266) Step 5b: Instead of a cut-off point, we can use  1 as  5.70 (1.51) −0.20 (2.71) a reference point and take the union of all subsamples

whose  scores are comparable to  1 . That is, for a ˆ 1 are approximately normal, and the biases are much pre-determined constant  >1, we define the combined smaller than that of the LSE. That of the estimated subsample as variance also looks like a  2 distribution. The average SAg = j . (18) effective sample size of Sg is 9.93, which is very close j:j  1 to the number of good data points n =10 . There are a small number of cases where the effective sample size is In both Steps 5a and 5b, the number of subsamples in

12. These are likely cases where the “outliers” generated the union are not fixed. The values of  C and  are mild or benign outliers and are thus included in the depend on n , ns ,  ,  and the underlying model. combined sample. Selection of  C and  may be based on the distri- bution of  -scores of good subsamples. 5. Secondary Criterion and Other Variations If either Step 5a or 5b is used instead of the original of the Subsampling Algorithm Step 5, we need to ensure the number of subsamples taken union of in (17) or (18) is no less than r* . If it is * * The 5-step subsampling algorithm SAL ( nrks ,,) intro- less than r , then the criterion based on  C or  may duced in Section 2 is the basic version which is straight- be too restrictive and it is not having the desired effect of forward to implement. In this section, we discuss modifi- improving the efficiency of the original Step 5. It may cations and variations which can improve its efficiency also be that the number of good subsamples is less than and reliability. r* and in this case, a re-run of SAL with a larger k is required. 5.1. Alternative Stopping Criteria for Finally, noting that the r* subsamples making up Improving the Efficiency of the Sg in Step 5 may not be distinct, another way to in- * Combined Subsample Sg crease efficiency is to use r distinct subsamples. The * * number of good subsamples in a sequence of k distinct In Step 5 of SAL ( nrks ,,), the first r subsamples in * the sequence A ,,A,A are identified as r subsamples follows a Hypergeometric ( kL,,g LT ) dis- 12  k  tribution, where L is the total number of good sub- good subsamples and taken union of to form S . How- g g samples of size n and L the total number of sub- ever, it is clear from the discussion on parameter selec- s T samples of this size. Since L is usually much larger tion in Section 2.2 that there are likely more than r* T than L , the hypergeometric distribution is approxi- good subsamples among the k generated by SAL g mately a binomial distribution. Hence the required k ( nrk,,* ). When there are more than r* good sub- s for having r* good subsamples with probability p* is samples, we want to use them all to form a larger and approximately the same as before. thus more efficient Sg . We now discuss two alternatives to the original Step 5 (referred to as Step 5a and Step 5b, 5.2. Consistency of Subsamples and Secondary respectively) that can take advantage of the additional Criterion for Improved Reliability of the good subsamples. Subsampling Algorithm Step 5a: Suppose there is a cut-off point for the  i j scores, say  C , such that the jth subsample is good if Let β and β be the estimates given by (method 

Copyright © 2012 SciRes. OJS 294 M. TSAO, X. LING applied to) the ith and jth subsamples, respectively. Let stantially increase the computational difficulties of the d βij, β be a distance measure. We say that these two algorithm. Additional criteria such as the of the ij subsamples are inconsistent if ddββ,>c where elements of β may also be added. * dc is a fixed positive constant. Conversely, we say that With the secondary criterion, the first r subsamples ij the two subsamples are consistent if ddββ,  c . taken union of in Step 5c has an increased chance of all Inconsistent subsamples may not be pooled into the com- being good subsamples. While it is possible that they are bined sample Sg to estimate the unknown β . actually all bad subsamples of the same kind (consistent * Step 5 of SAL ( nrks ,,) relies only on the γ-ordering bad subsamples with small γ-scores), this is improbable of the subsamples to construct the combined sample Sg . in most applications. Empirical experience suggests that In this and the two variations above, S is the union of ij g for suitably chosen metric d β , β , threshold dc and ( r* or a random number of) subsamples with the a reasonably large subsample size ns (say nNs =0.5 1), * smallest γ-scores. However, an Sg constructed in this the first r subsamples are usually consistent, making manner may fail to be a union containing only good the consistency check redundant. However, when the subsample as a small γ-score is not a sufficient condition first r* subsamples are not consistent and in particular for a good subsample. One case of bad subsamples with when ir * , it is important to look for an explanation. r* small γ-scores we have mentioned previously is that of a Besides a poorly chosen metric or threshold, it is also bad subsample consisting of entirely outliers that also possible that the data set actually comes from a mixture follow the same model as the good data but with a dif- model. Apart from the original Step 5, a secondary criterion ferent β . In this case, the γ-score of this bad subsample can also be incorporated into Step 5a or Step 5b. may be very small. When it is pooled into the Sg , Finally, there are other variations of the algorithm outliers will be retained and the resulting SUE will not be which may be computationally more efficient. One such robust. Another case is when a bad subsample consists of variation whose efficiency is difficult to measure but is some good data points and some outliers but the model * nevertheless worth mentioning here requires only r =1 happens to fit this bad subsample well, resulting in a very good subsample to start with. With r* =1, the total small γ-score. In both cases, the  criterion is unable to number of subsamples required ( k ) is substantially re- identify the bad subsamples but the estimated β based duced, leading to less computation. Once identified, the on these bad subsamples can be expected to be incon- one good subsample is used to test points outside of it. sistent with that given by a good subsample. All additional good points identified through the test are To guard against such failures of the  criterion, we use the consistency as a secondary criterion to increase then combined with the good subsample to form a com- * bined sample. The efficiency of such a combined sample the reliability of the SAL ( nrs ,,k). Specifically, we pool only subsamples that are consistent through a is difficult to measure and it depends on the test. But modified Step 5, Step 5c, given below. computationally, this approach is in general more effi- Step 5c: Denote by β i the estimated value of β cient since the testing step is computationally inexpen- ij sive. This approach is equivalent to a partial depth func- based on A()i where i=1,2, ,k and let d  β , β  be a distance measure between two estimated values. tion based approach proposed in [13]. Take the union 6. Concluding Remarks r*

Sg =A, (19) It is of interest to note the connections among our sub-  i j j=1 sampling method, the bootstrap method and the method * where A (iij =,,12i , i* ) are the first r consis- of trimming outliers. With the subsampling method, we i j r tent subsamples satisfying substitute analytical treatment of the outliers (such as the

i use of the  functions in the M-estimator) with com- ddββj , il  max  c puting power through the elimination of outliers by re- ii jl peated fitting of the model to the subsamples. From this for some predetermined constant dc in point of view, our subsampling method and the bootstrap A12,,,AA  k . method share a common spirit of trading analytical The  criterion is the primary criterion of the sub- treatment for intensive computing. Nevertheless, our sub- sampling algorithm as it provides the first ordering of the sampling method is not a bootstrap method as our objec- k subsamples. A secondary criterion divides the γ- tive is to identify and then make use of a single combined ordered sequence into consistent subsequences and per- sample instead of making inference based on all boots- forms a grouping action (instead of an ordering action) trap samples. The subsampling based elimination of out- on the subsamples. In principle, we can switch the roles liers is also a generalization of the method of trimming of the primary and secondary criteria but this may sub- outliers. Instead of relying on some measure of outlying-

Copyright © 2012 SciRes. OJS M. TSAO, X. LING 295 ness of the data to decide which data points to trim, the REFERENCES subsampling method uses a model based trimming by [1] P. J. Huber, “Robust Statistics,” Wiley, New York, 1981. identifying subsamples containing outliers through fitting [2] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw and W. the model to the subsamples. The outliers are in this case A. Stahel, “Robust Statistics: The Approach Based on In- outliers with respect to the underlying regression model fluence Functions,” Wiley, New York, 1986. instead of some measure of central location and they are [3] P. J. Rousseeuw and A. M. Leroy, “Robust Regression only implicitly identified as being part of the data points and Outlier Detection,” Wiley, New York, 1987. not in the final combined sample. [4] R. A. Maronna, R. D. Martin and V. J. Yohai, “Robust The subsampling method has two main advantages: it Statistics: Theory and Methods,” Wiley, New York, 2006. is easy to use and it can be used for any regression model [5] D. G. Simpson, D. Ruppert and R. J. Carroll, “On One- to produce robust estimation for the underlying ideal Step GM-Estimates and Stability of Inferences in Linear model. There are, however, important theoretical issues Regression,” Journal of the American Statistical Associa- that remain to be investigated. These include the charac- tion, Vol. 87, No. 418, 1992, pp. 439-450. terization of the dependence structure among observa- [6] V. J. Yohai, “High Breakdown-Point and High Efficiency tions in the combined sample, the finite sample distri- Estimates for Regression,” Annals of Statistics, Vol. 15, bution and the asymptotic distribution of the SUE. Un- No. 2, 1987, pp. 642-656. doi:10.1214/aos/1176350366 fortunately, there does not seem to be a unified approach [7] K. A. Brownlee, “ and Methodology in which can be used to tackle these issues for SUEs for all Science and Engineering,” 2nd Edition, Wiley, New York, regression models; the dependence structure and the 1965. distributions of the SUE will depend on the underlying [8] D. F. Andrews, “A Robust Method for Multiple Linear regression model as well as the method  and criterion Regression,” Technometrics, Vol. 16, No. 4, 1974, pp.  chosen. This is yet another similarity to the bootstrap 523-531. method in that although the basic ideas of such com- [9] D. C. Montgomery, E. A. Peck and G. G. Vining, “Intro- putation intensive methods have wide applications, there duction to Linear ,” 4th Edition, is no unified theory covering all applications and one Wiley, New York, 2006. needs to investigate the validity of the methods on a case [10] J. R. Ashford, “An Approach to the Analysis of Data for by case basis. We have made some progress on the sim- Semi-Quantal Responses in Biological Assay,” Biomet- rics, Vol. 15, No. 4, 1959, pp. 573-581. ple case of a linear model. We continue to examine these doi:10.2307/2527655 issues for this and other models, and hope to report our [11] E. Cantoni and E. Ronchetti, “Robust Inference for Gen- findings once they become available. eralized Linear Models,” Journal of American Statistical Association, Vol. 96, No. 455, 2001, pp. 1022-1030. 7. Acknowledgements doi:10.1198/016214501753209004 The authors would like to thank Dr. Julie Zhou for her [12] D. M. Bates and D. G. Watts, “ generous help and support throughout the development Analysis and Its Applications,” Wiley, New York, 1988. of this work. This work was supported by a grant from [13] M. Tsao, “Partial Depth Functions for Multivariate Data,” the National Science and Engineering Research Council Manuscript in Preparation, 2012. of Canada.

Copyright © 2012 SciRes. OJS 296 M. TSAO, X. LING

Appendix: Proofs of Theorems 1 and 2 nn s EW21== nss EU  n ns Proof of Theorem 1: n

Let p be the probability that random subsample of 2 g nn s size n from S is a good subsample containing no =1n   . s N n outliers. Since nns  , we have pg >0. ** With probability 1, the subsequence AA12,, is an It follows that infinite sequence. This is so because the event that this nn 2 subsequence contains only finitely many subsamples is s EBF 2 =1  , equivalent to the event that there exists a finite l such n that AAll,,1  contains no good subsamples. Since and formula (4) is also true for j =2. p >0, the probability of this latter event and hence that g Now assume (4) holds for some j 12. We show of the former event are both zero. Further, under the that it also holds for j . Denote by # A the number * *  condition that Aj  Sn , Aj may be viewed as a ran- of points in A . Then * dom sample of size ns taken directly without replace- WB=# =# BAWU = , where U is ** j  jjjj  11 j1j1 ment from Sn . Hence with probability 1, AA12,, is the number of good data points in Bj but not in Bj1 . an infinite sequence of random samples from the finite The distribution of U j1 conditioning on Wwj11= j population Sn . It follows that for any zSj  n , the is * probability that it is in at least one of the Ai is 1. Hence UWjj11=Hyperg,,, w j  1 nnwn js1 (21) PSn  B =1. This and the fact that BS  n imply    the theorem. By (21) and the assumption that (4) holds for j 1 , Proof of Theorem 2: we have We prove (4) by induction. For j =1, since BA= * which contains exactly n 11 s EW jj = EW 11  EEU jj W1 points, W1 is a constant and nWj1  EW 1  ns = EWjs1  E n EB==. n F 1 nn  nn Hence (4) is true for j =1. s = nEWsj 1 * n To find EF B2 , denote by A1 the complement of A* A* j1 1 with respect to Sn . Then 1 contains nn s nn s * =1nnn  points. Since A is a random sample of size n taken ss 2 s n without replacement from Sn , we may assume it * j j contains U1 data points from A1 and nUs  1 data nn s nn s * ==1.nn points from A . It follows that U has a hypergeo- j1  1 1 n n metric distribution  It follows that U1  Hypergnnnn,,, ss (20) EW j with expected value  j  nn s EBFj==1.  nn nn s EU1 =.ns  n Thus (4) also holds for j , which proves Theorem 2.

** Since BAA21=  2, its number of data points Wn2=s U1. Hence

Copyright © 2012 SciRes. OJS