A Unified Approach to Data Transformation and Outlier
Detection using Penalized Assessment
A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Ph.D. in Statistics of the McMicken College of Arts and Sciences
November 2013 by Wei Guo
B.S. TongJi University, Shanghai, China, 2006 M.S. University of Cincinnati, Cincinnati, OH, USA, 2010
Committee Chair: Seongho Song Ph.D.
Abstract
In many statistical applications normally distributed sample and sample without outliers are desired. However, in practice, it is often the case that the normality assumption is violated, such as when highly influential outliers exist in the dataset, which will adversely impact the validity of the statistical analysis. In this dissertation, a Unified Approach is proposed to handle outlier detection, Box-Cox transformation using a penalized information criteria at the same time.
This research started from investigating the performance of Box-Cox transformation in uncontaminated samples and suggested that the sample should be anchored to 1 before Box-Cox transformation is applied when the sample minimum is larger than 1. Simulation results showed that anchor-to-1 method is working well in enhancing the accuracy of Box-Cox transformation by decreasing the variance of λ and eliminating extremely large or small values of λ. The efficacy of Unified Approach is also verified in the clean samples including normal and lognormal, where the Unified Approach is able to tell that no Box-Cox is needed and no outliers are present. Later, simulations in the contaminated normal samples demonstrated that the Unified
Approach can achieve the balance between a good model fitting (close to normal sample) and the complexity of data analysis through penalizing anchor-to-1, outlier exclusion, and Box-Cox transformation in the form of an adjusted information criteria. Through precise outlier detection and appropriate Box-Cox transformation, the efficacy of the Unified Approach is verified in the contaminated samples.
i
ii
Acknowledgement
I would like to take this opportunity to extend my deepest gratitude and sincere thanks to my adviser, Dr. Paul S. Horn, for his priceless guidance, insightful feedback, warm encouragement and all his constant support through the past few years. I appreciate the valuable advice and support of the other committee members, Drs. Seongho Song, Siva Sivaganesan, Xia
Wang, and Emily Kang. I thank the Department of Mathematics for providing me the financial support for my graduate studies at University of Cincinnati.
I thank my parents Zhanjun Guo and Yuexia Zhang for all their love, support and faith in me and allowing me to be as ambitious as I wanted. I also would like to thank my mother in law
Conghui Wang for helping us taking of my child when I was working toward my degree.
Without her help I could not finish my degree nor approach my dream in my career.
Lastly I would like to thank my wife Xuejiao Diao. Her enduring love, encouragement, patience, and caring have been the greatest backing throughout my life. Without her accompany I could not succeed in my graduate study in a foreign country. I am fortunate to have my daughter
Angie during my graduate study and she brings me so much joy and makes me laugh, although I did not have much time for her every day. Without such a warm family all of this would not be possible.
iii
Table of Contents
Abstract ...... I Acknowledgement ...... iii Table of Contents ...... iv List of Figures ...... vi List of Tables ...... x 1 Introduction ...... 1 1.1 Background ...... 1 1.2 Outliers ...... 5 1.3 Data transformation ...... 8 1.3.1. Mathematical expressions ...... 8 1.3.2. Estimation of parameters ...... 10 1.3.3. Hypothesis tests and inference on transformation parameter ...... 12 1.3.4. Impact of outliers on data transformation...... 14 1.4 Information Criterion ...... 16 1.4.1 Akaike Information Criterion ...... 16 1.4.2 Bayesian Information Criterion ...... 17 1.5 Research Gap in Literature ...... 19 2. Unified Approach ...... 22 2.1. Anchoring-to-1 ...... 22 2.2. Penalized Assessment ...... 41 2.3. Implement of unified approach ...... 43 2.4. Advantages of new method ...... 47 3. No Outliers: The Uncontaminated Sample Case ...... 51 3.1. Overview ...... 51 3.2. Anchoring-to-1 ...... 54 3.3. Unified Approach-Normal Distribution ...... 75 3.4. Assessment-Normal Distribution ...... 91 3.5. Unified Approach-Lognormal Distribution ...... 98 3.6. Assessment-Lognormal Distribution ...... 113
iv
4. Outliers: The Contaminated Sample Case ...... 116 4.1. Samples with Outliers ...... 116 4.2. Simulation on N(10,1) samples ...... 123 4.3. Simulation on Lognormal samples ...... 151 5. Discussion and future research ...... 164 Bibliography ...... 169 Appendix ...... 177
v
List of Figures
Figure 1-1 Research Gap ...... 21 Figure 2-1 Anchor-to-1, N(10,1), Sample size=100 ...... 24 Figure 2-2 Anchor-to-5, N(10,1), Sample size=100 ...... 25 Figure 2-3 LogNormal, Anchor-to-1,sample size=100 ...... 27 Figure 2-4 LogNormal, Anchor-to-5,sample size=100 ...... 28 Figure 2-5 Lambdas of Different Anchoring ...... 29 Figure 2-6 95% Confidence Interval of λ at different anchor points ...... 30 Figure 2-7 Distribution of Skewed Raw Sample ...... 32 Figure 2-8 Comparison of Skewness and Kurtosis for different anchoring ...... 34 Figure 2-9 Histogram of transformed sample with different anchoring: SQRT Transformation ...... 35 Figure 2-10 Histogram of transformed sample with different anchoring: LOG Transformation ...... 36 Figure 2-11 Histogram of transformed sample with different anchoring: INVERSE Transformation ...... 37 Figure 2-12 Outlier Candidates ...... 46 Figure 2-13 Flowchart of overall data processing...... 48 Figure 2-14 Flow Chart of the Unified Approach ...... 49 Figure 2-15 Flowchart of outlier drop ...... 50 Figure 3-1 Box plot shows the variation of λ using regular BC on original sample ...... 55 Figure 3-2 95% Confidence interval of lambda in baseline case ...... 56 Figure 3-3 A histogram shows the histogram of original sample ...... 58 Figure 3-4 Histogram shows the histogram of transformed data with lambda 9.13 ...... 59 Figure 3-5 Histogram of original sample ...... 61 Figure 3-6 Histogram of transformed sample using lambda=-6.91 ...... 62 Figure 3-7 Distribution of lambda after anchoring-to-1 the original sample ...... 65 Figure 3-8 The mean and 95% confidence interval of lambda after anchoring-to-1 ...... 66 Figure 3-9 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted ...... 67 Figure 3-10 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted, cont. .... 68 Figure 3-11 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda ...... 69 Figure 3-12 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda, cont...... 70 Figure 3-13 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda ...... 71 Figure 3-14 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont...... 72 Figure 3-15 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda ...... 73 Figure 3-16 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont...... 74 Figure 3-17 Outliers Detected in Clean N(10,1) Samples, 1000 Repetitions ...... 77 Figure 3-18 Location of detected outliers, N(10,1), Clean Sample ...... 78 Figure 3-19 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, no outliers excluded, no penalty for Box-Cox ...... 79 Figure 3-20 Box-plot of lambda: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1) ...... 80
vi
Figure 3-21 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, no outliers detected, no penalty for Box-Cox ...... 81 Figure 3-22 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1) ...... 82 Figure 3-23 Percent of four cases in unified approach with penalty to BC ...... 86 Figure 3-24 Location of outliers, NO Box-Cox involved ...... 86 Figure 3-25 Location of outliers, BC=1 and DP=1 ...... 88 Figure 3-26 Box-plot of Lambda: Unified Approach, Clean Data, BC=1, outliers Excluded ...... 89 Figure 3-27 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, BC=1, no outliers excluded . 90 Figure 3-28 Comparison of MSE, BC=0 DP=0, N(10,1) Clean Data...... 96 Figure 3-29 Comparison of MSE, BC=0 DP=1, N(10,1) Clean Data...... 96 Figure 3-30 Comparison of MSE, BC=1 DP=0, N(10,1) Clean Data...... 97 Figure 3-31 Comparison of MSE, BC=1 DP=1, N(10,1) Clean Data...... 97 Figure 3-32 Outliers Detected in Clean LogN(0,1) Samples, 1000 Repetitions ...... 100 Figure 3-33 Location of outliers detected, LogN0,1), Clean Data ...... 102 Figure 3-34 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox ...... 103 Figure 3-35 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox ...... 104 Figure 3-36 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox ...... 106 Figure 3-37 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox ...... 107 Figure 3-38 Percent of four cases in unified approach with penalty to BC, Clean LogN(0,1) ...... 110 Figure 3-39 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, no outliers excluded ...... 111 Figure 3-40 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, outliers excluded(>=1) ..... 112 Figure 3-41 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=0 ...... 115 Figure 3-42 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=1 ...... 116 Figure 4-1 Distributions of Normal samples with outliers ...... 118 Figure 4-2 Distributions of Lognormal samples with outliers ...... 121 Figure 4-3 Distribution of Lambda in Baseline with outliers, N(10,1) ...... 124 Figure 4-4 Confidence Interval of Lambda in Baseline with outliers, N(10,1) ...... 125 Figure 4-5 Distribution of Lambda, Anchor-to-1, with outliers, N(10,1) ...... 127 Figure 4-6 Confidence Interval of Lambda, Anchor-to-1, with outliers, N(10,1) ...... 128 Figure 4-7 Accuracy of outlier detection, N(10,1), 5% outliers, no penalty for BC ...... 132 Figure 4-8 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy1=1, no extra penalty for Box-Cox ...... 133 Figure 4-9 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy2=1, no extra penalty of Box-Cox ...... 134 Figure 4-10 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy3=1, no extra penalty of Box-Cox...... 135
vii
Figure 4-11 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy4=1, no extra penalty of Box-Cox...... 136 Figure 4-12 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy5=1, no extra penalty of Box-Cox...... 137 Figure 4-13 Percent of different handling methods, 5% outliers, N(10,1), penalty for BC ...... 139 Figure 4-14 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases ...... 140 Figure 4-15 Distribution of lambda, Unified Approach, N(10,1), 5% outlier with penalty for BC, accuracy5=1 cases ...... 141 Figure 4-16 Accuracy of outlier detection, N(10,1), 10% outliers, no penalty for BC ...... 142 Figure 4-17 Percent of different handling methods, 10% outliers, penalty for BC ...... 142 Figure 4-18 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases ...... 143 Figure 4-19 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 10% outlier ...... 144 Figure 4-20 Accuracy of outlier detection, N(10,1), 15% outliers, no penalty for BC ...... 145 Figure 4-21 Percent of different handling methods, 15% outliers, penalty for BC ...... 145 Figure 4-22 Accuracy of outlier detection, N(10,1), 15% outliers, penalty for BC, DP=1 and BC=0 cases ...... 146 Figure 4-23 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 15% outlier ...... 147 Figure 4-24 Accuracy of outlier detection, N(10,1), 20% outliers, no penalty for BC ...... 148 Figure 4-25 Percent of different handling methods, 20% outliers, penalty for BC ...... 148 Figure 4-26 Accuracy of outlier detection, N(10,1), 20% outliers, penalty for BC, DP=1 and BC=0 cases ...... 149 Figure 4-27 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 20% outlier ...... 150 Figure 4-28 Percent of different handling methods, 5% outliers, LogN(0,1), penalty for BC ...... 152 Figure 4-29 Accuracy of outlier detection, LogN(0,1), 5% outliers, no penalty for BC ...... 152 Figure 4-30 Accuracy of outlier detection, LogN(0,1), 5% outliers, penalty for BC ...... 153 Figure 4-31 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC ...... 153 Figure 4-32 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 154 Figure 4-33 Percent of different handling methods, 10% outliers, LogN(0,1), penalty for BC ...... 155 Figure 4-34 Accuracy of outlier detection, LogN(0,1), 10% outliers, no penalty for BC ...... 155 Figure 4-35 Accuracy of outlier detection, LogN(0,1), 10% outliers, penalty for BC ...... 156 Figure 4-36 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC ...... 156 Figure 4-37 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 157 Figure 4-38 Percent of different handling methods, 15% outliers, LogN(0,1), penalty for BC ...... 158
viii
Figure 4-39 Accuracy of outlier detection, LogN(0,1), 15% outliers, no penalty for BC ...... 158 Figure 4-40 Accuracy of outlier detection, LogN(0,1), 15% outliers, penalty for BC ...... 159 Figure 4-41 Distribution of lambda, LogN(0,1), 15% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC ...... 159 Figure 4-42 Distribution of lambda, LogN(0,1), 15% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 160 Figure 4-43 Percent of different handling methods, 20% outliers, LogN(0,1), penalty for BC ...... 161 Figure 4-44 Accuracy of outlier detection, LogN(0,1), 20% outliers, no penalty for BC ...... 161 Figure 4-45 Accuracy of outlier detection, LogN(0,1), 20% outliers, penalty for BC ...... 162 Figure 4-46 Distribution of lambda, LogN(0,1), 20% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC ...... 162 Figure 4-47 Distribution of lambda, LogN(0,1), 20% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 163
ix
List of Tables
Table 2-1 Comparison of Skewness and Kurtosis for different anchoring ...... 33 Table 3-1 An example where BC gives extreme lambda ...... 57 Table 3-2 An example where BC gives extreme lambda, con’t ...... 57 Table 3-3 An example shows the bomb of Box-Cox, which gives lambda=-6.91 ...... 60 Table 3-4 An example shows the bomb of Box-Cox, which gives lambda=-6.91, cont...... 60 Table 3-5 Comparison of MSE ...... 95 Table 3-6 Location of outliers excluded, LogN0,1), Clean Data ...... 101 Table 4-1 Goodness of Fit statistics of Normal samples with/without outliers ...... 119 Table 4-2 Goodness of Fit statistics of Lognormal samples with/without outliers ...... 122
x
1 Introduction
1.1 Background
It is an unfortunate fact of statistical analysis that the data used to build models and draw inference are not always well-behaved. Outliers are a typical example of aberrant data. They appear in almost all research projects, especially in observational studies where dataset contains unusual observations and/or the distribution of collected data shows extreme skewness and kurtosis.
Outliers can have adverse impact on statistical analyses. First, they usually increase the error variance and reduce the power of statistical tests. Second, outliers may violate the normality assumption that is required in many statistical models, making both Type I and Type II errors larger. Third, they can seriously bias or impact the estimates of parameters of interest. If a researcher’s analytical approach is to do nothing about the presence of outliers, the resulting statistical models will essentially describe none of the data, neither the bulk of the data nor the outliers. Even if the outliers are processed, the resulting conclusion drawn from the data might still be biased or in the opposite direction if the outlier handling strategy is not carefully is designed.
Outliers need to be identified or detected at first before appropriate handling. A lot of outlier detection strategies have been proposed and applied in practice. Grubbs' test (Grubbs,
1969), Dixon's Q test (Dixon, 1951), Tietjen-Moore test, (Tietjen-Moore, 1972) and Generalized
ESD Test (Rosner, 1983) are popular methods to detect outliers. Other methods flag observations, such as the interquartile range, Tukey’s outlier filter (Hoaglin, Mosteller, and Tukey, 1983). The
1 rest of the approaches are either distance-based (Knorr, et al, 1998) or density-based (Breunig,
2000), and all of them use the distance to the K-nearest neighbors to label outliers.
There is a great deal of literature on how to handle identified outliers. The prevailing ways fall into the following categories. The simplest way is to exclude those suspicious observations and then analyze the remaining clean observations, although deletion of outliers is still in debate. For example, in the regression procedure, deleting outliers exhibit a large degree of influence on the parameters estimates (Cook 1979).
The second method is to use data transformation to alleviate the impact of outliers, natural log and square root transformations are popular choices to handle outliers or skewed distribution. Through transformation extreme observations can be kept in the data set without loss of information and the relative ranking of observations remains, while the skewness and error variance present in the data set can be largely reduced (Hamilton, 1992). Specifically, Box and Cox’s power transformation (Box and Cox, 1964) is widely used to accommodate outliers and skewed distribution. The transformation parameter λ can be found using the Maximum
Likelihood Estimator (MLE) (Box and Cox, 1964). However, the MLE of transformation parameter might be adversely affected by outliers as shown in Andrews (1971). Conversely the transformed data set may still have outliers or skewness/kurtosis, which indicating the failure of
Box-Cox transformation.
The third way is to accommodate outliers, which means that using various “robust” procedures to protect their data from being distorted by the presence of outliers. These techniques “accommodate the outliers at no serious inconvenience-or are robust against the presence of outliers” (Barnett and Lewis, 1994, p. 35). Certain parameter estimates, especially the mean (variance) and Least Squares estimations, are particularly vulnerable to outliers, or
2 have “low breakdown” values (Osborne, 2004). For this reason, researchers turn to robust or
“high breakdown” methods to provide alternative estimates for these important aspects of the data. Huber’s M-estimation (Huber, 1981), Least Trimmed Square estimation (Rousseeuw, 1984),
S-estimation (Rousseeuw and Yohai, 1984), and MM-estimation (Yohai, 1987) are of this type.
On the other side, if we know a-priori why an outlier exists we can simply delete the outlier, for example, when there is a known misplacement of a decimal point. However, in other situations, outliers can contain important information that cannot be ignored, such as in medical studies where the underlying distribution of values is naturally skewed to the right (e.g., cholesterol in adults who are not on statins). In this situation, are we going to treat those extreme large observations as outliers or just regular observations in the long tails? One researcher’s outlier may not be another one’s outlier. Tail observations in one study may be judged as outliers in another study.
If we decide those suspicious observations are outliers, we are faced with the question that whether those outliers should be excluded or not. If we decide them as non-outliers we would like to find a way to accommodate them. Data transformation is of great help in this case.
A suitable transformation would preserve all relevant information and ease the impact of extreme observations.
Another issue is that if we delete some outliers and/or transform the data, how should we assess the end result? If outliers are dropped before the transformation, we are bearing the risk of losing information from those deleted observations; such as in the above cholesterol example.
We need to make a trade-off between the improved fit of data and the loss of information through the exclusion of outliers. Information criterion such as Akaike Information Criterion
(Akaike, 1974), AIC and Bayesian information criterion BIC (or Schwarz criterion, also SBC,
3
SBIC) (Schwarz, 1978) are popular goodness-of-fit statistics that can measure this trade-off, in particular they penalize over-fitting the data. In our problem, they can be used to penalize exclusion of outlier.
In this dissertation, I would like to consider Box-Cox transformation, outlier detection, and the information criterion simultaneously to address these issues mentioned above. A unified new approach to handle outliers in the process of Box-Cox transformation using penalized assessment will be proposed and examined through simulation studies. Another method to find appropriate Box-Cox transformation parameter is studied. Applications of the new method in real data will also be demonstrated.
4
1.2 Outliers
In statistics, an outlier is an observation that is numerically distant from the rest of the data (Barnett and Lewis, 1994). Moore and McCabe (1999) further states that an outlier lies outside the overall pattern of a distribution.
Outliers can have many causes. One possible cause is changes in measurement system.
For example, outliers may appear when a physical apparatus for measurement broke down.
Another example is the errors in data transmission or transcription, which can also lead to outliers. The second reason has to do with human manipulation, or sometimes, fraudulent behaviors. The third possible reason is that outliers are just a pure reflection of the reality. In this case, the natural deviations should be flagged for further investigation.(Hodge and Austin, 2004).
For centuries, there had been no standard mathematical definition of outliers and their recognitions were based on subjective judgment. Usually, anomalous observations were detected and, where appropriate, were removed to keep the cleanness of the entire dataset. System defects and human fraud could be identified in outlier detection process, which effectively prevents catastrophic consequences in the first place. With advanced applications of computer science and statistics in the present days, the outlier detection methods are much more rigorous and systematic.
The following is a brief review of the models that are commonly used for outlier detection. All the models assume that data follow a normal distribution, and big deviations from the mean are very “unlikely”. (Rousseeuw and Leroy, 1996).
x Grubbs' test for outliers (Grubbs, 1969)
5
In Grubb’s test, the hypothesis are defined as follows H0: no outliers. Ha: at least one outlier. The Grubbs' test statistic is defined as the largest absolute deviation from the sample mean in units of the sample standard deviation (Grubbs, 1969):
ȁܻ െܻതȁ ǡேڮൌୀଵǡܩ ݏ
Where ܻത and ݏ are the sample mean and standard deviation, respectively. In the two-sided test, the null hypothesis is rejected at a significance level ߙ if
ଶ െͳ ݐఈሺଶேሻΤ ǡேିଶܰ ܩ ඨ ଶ െʹݐఈሺଶேሻΤ ǡேିଶܰ ܰ
ଶ where ݐఈሺଶேሻΤ ǡேିଶ is the upper critical value of the t distribution at a significance levelߙΤ ሺʹܰሻ with ܰെʹ degrees of freedom.
x Dixon’s Q test (Dixon, 1951)
To screen bad data with a Q test, the first step is to arrange the data in ascending order and calculate Q as:
݃ܽ ܳൌ ݎܽ݊݃݁
where a gap is the absolute difference between the outlier and the closest value to the outlier. The questionable point can be rejected whenܳ ܳ௧, where ܳ௧ is a reference value corresponding to the confidence level and the sample size.
Other methods flag observations according to the interquartile range. For example, [Q1- k(Q3-Q1), Q3+k(Q3-Q1)], if Q1 and Q3 are the lower and upper quartiles respectively. One can name any observation outside the range as an outlier.(Tukey, 1977)
6
In addition to the common model-based outlier detection methods, there are also distance-based methods (Knorr, 1998) and density-based methods(Breunig, 2000). Both of the latter two methods frequently use the distance to the k-nearest neighbors to gauge outliers.
In the process of identifying outliers, the masking effect and swamping effect are problems we need to pay attention to. Masking effect means that in the successive use of a test for detecting a single outlier, we may fail to detect any outlier because of the effect of other outliers (Tietjen and Moore,1972). For example, when there are multiple outliers rather than one or two outliers, these additional outliers may influence the value of the test statistic so that it shows no signs of outliers. On the other hand, swamping effect occurs when to many outliers are spotted in the detection (Barnett and Lewis, 1984). For example, when there exists only one single outlier while we are looking for two or more of them, the test results would usually indicate either none or all of the tested points as outliers. Due to masking and swamping effects, a lot of tests make it a requirement that the number of outliers being tested should be clearly specified before the test (Kitagawa, 1979).
7
1.3 Data transformation
Data transformation should align with the particular statistical analysis method. Some
statistical analyses have guidance on how data transformation should be performed, or whether
data transformation is necessary at all. For instance, to construct a 95% confidence interval, an
easy way is to take the sample mean plus and minus two standard deviation units. However, the
two standard deviation rule can only be used for data with a normal distribution, and is not
applicable in a non-normal data. The central limit theorem states that the sample mean in most
cases does vary normally when the sample size is reasonably large. However if the sample size is
not large enough and the samples are skewed substantially, the central limit theorem, together
with the resulting confidence interval, fails to be a good approximation of the population.
Therefore, it is advised that when dealing with skewed data, they should be transformed to a
symmetric distribution, and a confidence interval can then be constructed. To keep originality of
the data, the confidence interval that has been constructed in the transformed symmetric
distribution can be transformed back with the inverse of the transformation.
1.3.1. Mathematical expressions
Tukey (1957) proposed a series of power transformations. The transformed values are a
monotonic function of the observations over some admissible range and they are indexed by the
constant lambda for positive values:
ሺሻ ݔ ȀɉǢ ɉ ് Ͳ ݔ ൌ൜ ݔǢ ɉ ൌ Ͳ
However, the transformation has been modified by Box & Cox (1964) to take into account the
discontinuity at ɉൌͲ so that
8
ሺሻ ሺݔ െ ͳሻȀɉǢ ɉ ് Ͳ ݔ ൌ൜ ݔǢ ɉ ൌ Ͳ
for some unknown ɉ. To accommodate negative values, the transformation was modified by Box
and Cox (1964) with a shift parameter:
ሺሻ ൣሺݔ Ɋሻ െͳ൧Τ ɉǢ ɉ ് Ͳ ݔ ൌቊ ሺݔ ߤሻǢ ɉ ൌ Ͳ
where ɉis the power transformation parameter and ߤ is chosen so that ݔߤis greater than zero
for all of the data.
Box and Cox (1964) also proposed the geometric mean version of transformation that
has the form:
ିଵ ሺݔሻሻ Ǣ ɉ ് Ͳ ሺሻ ሺݔ െ ͳሻȀɉሺ ݔ ൌቊ ሺݔሻݔǢ ɉ ൌ Ͳ
ଵ ൗ :ݔሻ , i.e., the geometric mean. It can also be extended to shift versionڮ where GM(x)=ሺݔଵ
ሺݔሻሻ Ǣ ɉ ് Ͳ ሺሻ ሾሺݔ Ɋሻ െͳሿȀɉሺ ݔ ൌ൜ ሺݔሻሺݔ ɊሻǢ ɉ ൌ Ͳ
Manly (1976) suggested another alternative transformation that is suitable for negative
values and it is claimed to be effective at turning skewed unimodal distribution into nearly
symmetric normal-like distributions:
ሺሻ ሾሺɉݔሻെͳሿΤɉ Ǣɉ്Ͳ ݔ ൌ൜ ݔǢ ɉ ൌ Ͳ
John and Draper (1980) introduced the modulus transformation that is to normalize
distributions already possessing some measure of approximate symmetry and has the following
form:
9
ሺሻ ሺݔሻൣሺȁݔȁɊሻ െͳ൧Τ ɉǢɉ്Ͳ ݔ ൌቊ ሺݔሻሺȁݔȁ ߤሻǢ ɉ ൌ Ͳ
Bickel and Doksum (1981) suggested another version of transformation so that
ሺሻ :distributions of ݔ with unbounded support such that the Normal distribution can be included
for ɉͲ,
ൣሺݔሻȁݔȁ െͳ൧Τ ɉ
The inverse Hyperbolic Sine Transformation proposed by Johnson (1949) is also applied
to accommodate skewed data.
ଶ ଶ ଵଶΤ ିଵ ሺߠݕ௧ሻȀߠ ݄݊݅ݏ ሺݕ௧ǡߠሻ ൌ݃௧ ൌ൫ߠݕ௧ ሺߠ ݕ௧ ͳሻ ൯ Ȁߠ ൌ݃
Yeo & Johnson (2000)’s modification of Box-Cox transformation:
ఒ ൣሺݔͳሻ െ ͳ൧Ȁߣሺݔ Ͳǡ ߣ ് Ͳሻۓ ሺݔͳሻሺݔ Ͳǡ ߣ ൌ Ͳሻۖ ߰ሺߣǡ ݔሻ ൌ െൣሺെݔ ͳሻଶିఒ െ ͳ൧Ȁሺʹ െ ߣሻሺݔ ൏ Ͳǡ ߣ ് ʹሻ۔ ۖ െሺെݔ ͳሻሺݔ ൏ Ͳǡ ߣ ൌ ʹሻە
1.3.2. Estimation of parameters
In terms of finding the transformation parameters lambda and shift parameter c, Box and
Cox (1964) proposed maximum likelihood as well as Bayesian methods for the estimation of the
parameter λ. For Maximum Likelihood method, it is assumed that for some unknown λ
transformed observations satisfy the normal theory assumptions, i.e. are independently normally
distributed with mean μ and constant variance σ^2.
10
Through multiplying the normal density by the Jacobian of the transformation, we can obtain the probability density for the transformed observations, together with the likelihood in of the original observations. Maximizing the likelihood function with respect to λ,μ, and σ^2 will provide the corresponding estimates of parameters.
As for the Bayesian method to find parameter, it is assumed that the prior distributions of
μ, and logσ can be taken as essentially over the region in which the likelihood is appreciable and we integrate over the parameters to obtain a posterior distribution of λ. A review of more
Bayesian considerations can be found in Pericchi (1981) and Sweeting (1984). And robust adaptions of the estimation procedures have been the studies by a group of researchers, including
Carroll (1980; 1982 a), Bickel and Doksum (1981), Carroll and Ruppert (1984), Taylor (1983,
1985a, b, 1987) and Carroll and Ruppert (1987).
Andrew et al. (1971, 1973), Dunn and Tubbs (1980) and Beauchamp and Robson (1986) have extended the Box-Cox Procedure to multivariate data level. And Draper and Cox (1969) proposed an approximation for the precision of the maximum likelihood estimate, and the proposed method was later corrected by Hinkley (1975). Cressie (1978) suggested a simple graphical procedure to estimate the transformation parameter by utilizing the principle of one degree of freedom for non-additivity in a two-way table without replication while Hernandez and
Johnson (1980) proposed to estimate the transformation by minimizing the Kullback-Leibler
(Kullback and Leibler, 1951) information in examining the large sample behavior of transformation to normality.
Hinkley (1985) proposed an estimation of the transformation parameter based on the likelihood analysis for local deviations from a normal theory linear model. This estimation simultaneously corrects for non-additivity and heterogeneity of residuals and is considered a
11 more analytical procedure. Later based on Kendall’s rank correlation, Han (1987) proposed a nonparametric estimation of the transformation parameter. Han’s approach was considered to be more consistent and efficient than the maximum likelihood estimator. Solomon (1985) suggested the application of the Box-Cox technique to simple random effects models and the
Box-Cox technique was later applied in all types of mixed models by Sakia (1988). Chang compiled the computer programs for λ estimation and the programs were modified for perfection by Huang et al (1978).
1.3.3. Hypothesis tests and inference on transformation parameter
Researchers are always interested in finding out whether the estimate of the Box-Cox transformation parameter conforms to a hypothesized value. Box and Cox (1964) employed the asymptotic distribution of the likelihood ratio to test the hypotheses. Later, Andrews (1971) proposed another test in which the null distribution of the parameter is known. This test for the value of parameter ignores the Jacobian transformation and is easier to calculate. Atkinson (1973) incorporated this approach in the comparison of the powers of the three tests and concluded that the tests derived from the likelihood method have more powers. Lawrance(1987a) later standardized Atkinson's score statistic and conducted a comparison of the two tests. It was found that the standardized statistic has improved standard normal behavior compared to Atkinson's test. Hence, Lawrance's statistic has been extended to models where response and mean may both be transformed by Hinkley (1988).
Furthermore as an alternative to the likelihood ratio confidence interval for testing hypotheses about the transformation parameter Lawrance (1987b) has given an asymptotically justifiable expression for the estimated variance of ɉ which leads to more efficient hypothesis
12 tests on ɉ than Atkinson's (1985,p .100) which is based on a regression analogy for constructed variables. A more recent improvement of both Atkinson's test and Lawrance' standardized score statistic has been proposed by Wang (1987) and is said to give a more accurate approximation to the standard normal. A simulated comparison by Atkinson and Lawrance (1989) of the test statistics concluded that, in general, Atkinson's test is very similar to Lawrance' test. However, a caution should be made that the small samples used in Lawrance's test could have masked its superiority over the Atkinson's test.
While Draper and Cox (1969) have shown that the estimation of ɉ is fairly robust to non- normality as long as the variable has a reasonably symmetric distribution, this may not -be the case when skewness is encountered. Thus, Poirier (1978) investigated the effect of the transformation in limited dependent variables, i .e. variables which have been possibly censored or truncated thus introducing some skewness. The procedure maximizes the likelihood function of the truncated normal distribution. Although the asymptotic properties of the maximum likelihood estimators are known, little is known about their small sample properties. Spitzer
(1978) employed Box-Cox transformation approach and examined the small sample properties.
The results suggested that Box-Cox approach was met the assumption of approximate normality.
For forecasting purposes, the forecasts were unbiased and their variances were remarkably low.
Carroll and Ruppert (1981), Carroll (1982a) and Hinkley and Runger (1984) have provided further response to the work of Bickel and Doksum (1981). Doksum and Wong (1983) concluded that Box-Cox transformed data tests have better power properties. Therefore, it is acceptable to use the standard methods for the normal linear model in the transformed variables.
Wood (1974) and Carroll and Ruppert (1984, 1988) made transformation on both the response and the theoretical model in order to examine Box-Cox transformation when the
13 theoretical regression function was given. Box-Cox transformation approach was also examined by a Monte Carlo study, which concluded that it matters little when the correct transformation was not known as a priori. Later, Ruppert et al. (1989) studied if the transforming theoretical and empirical models fit in the Michaelis-Menten model, together with its error structure. Rudemo et al. (1989) used the power transformation for the logistic model in bioassay and had positive conclusion. Wixley (1986) made a Box-Cox power transformation in linear models and the results suggested that unconditional likelihood ratio tests have more correct level.
1.3.4. Impact of outliers on data transformation.
The selection of a transformation may be properly viewed as model selection and, in this initial phase of analysis, influential cases can have particularly important and lasting effects that are difficult to uncover in the subsequent analysis. Thus, an outlying observation in the original scale may conform in the transformed scale. It is therefore necessary to find if the evidence for the particular transformation is spread evenly throughout the data or just within a few cases.
Andrew (1971) showed that the Maximum Likelihood Estimator (MLE) of transformation parameter may be adversely influenced by one or several outlying observations. Atkinson (1982) introduced some diagnostic displays of outlying and influential observations in multiple regression and their possible reduction by a transformation. Some criteria for estimating the transformation through the use of constructed variables are suggested by Atkinson (1983,1985) and this has been extended to the power transformation after a shift in location. Carroll (1980,
1982b) proposed robust estimators of the transformation parameter by replacing the normal likelihood with an objective function that is less sensitive to outlying responses. In order to measure the influence of cases on the Box-Cox likelihood estimate of the response
14 transformation parameter in linear regression, Cook and Wang (1983) mentioned a method superior to Atkinson's in regard of detecting influential cases for a transformation. Atkinson
(1986) further extended his work by deriving some expressions for estimating the effect of deletion of observations on the estimate of the transformation parameter and some subsequent tests of hypotheses.
15
1.4 Information Criterion
1.4.1 Akaike Information Criterion
The Akaike information criterion, AIC (Akaike, 1974), is a measure of the goodness of fit of a statistical model and is used for variable selection. It describes the tradeoff between accuracy and complexity of the model. AIC is often used as a model selection index. The idea behind the AIC is that in choosing between possible models there needs to be a trade-off between the improved fit of a more complex model and the increased number of parameters such a model requires.
ൌ െʹ ሺሻ ʹ,
where k is the number of parameters and L is the maximized value of the likelihood function. Given a dataset, the model with the minimum value of AIC should be preferred. Hence
AIC not only rewards goodness-of-fit, but also includes a penalty that is an increasing function of the number of parameters, i.e., the complexity of the model. The AIC is a simple expression and its calculation makes it appropriate to deal with outliers.
Kitagawa (1979) was the first to apply AIC to detect outliers. In his work two models were proposed to describe the distribution of a data set consisting of clean data and outlying observations. Low side and high side observations are treated as outlier candidates, the middle part of the data are treated as the main group or clean sample. Outliers are assumed to be distributed with the same variance σ^2 as the main group but with different means in Model 1.
And in Model 2 they are assumed to be normally distributed with the same mean as the main group but with different variance. AIC of models with different number of outliers are computed and the minimum of AIC decides which observations are outliers, corresponding to the “best” model.
16
Model 1:
ଶ ǡ݊ଵሻڮ߶ሺݔǢ ߤଵǡߪ ሻǡሺ݅ൌͳǡ ଶ ǡ݊െ݊ଶሻڮሺݔሻ ൌቐ߶ିభǡିభିమሺݔǢ ߤǡ ߪ ሻǡሺ݅ൌ݊ଵ ͳǡ݂ ଶ ǡ݊ሻڮ߶ሺݔǢ ߤଵǡߪ ሻǡሺ݅ൌ݊െ݊ଶ ͳǡ
Model 2:
ଶ ǡ݊ଵሻڮ߶ሺݔǢ ߤǡ ߬ ሻǡሺ݅ൌͳǡ ଶ ǡ݊െ݊ଶሻڮሺݔሻ ൌቐ߶ିభǡିభିమሺݔǢ ߤǡ ߪ ሻǡሺ݅ൌ݊ଵ ͳǡ݂ ଶ ǡ݊ሻڮ߶ሺݔǢ ߤଵǡߪ ሻǡሺ݅ൌ݊െ݊ଶ ͳǡ
Kadota (2003a, 2003b, 2006) applied U-Statistic (Ueda, 1996), an approximation of AIC, to detect genes whose expression profile is considerably different in some tissues than in others.
Gene expression ratios are sorted and normalized and then an approximation to AIC called “U- statistic” are calculated for different combinations of outlier candidates. Those special genes can be considered as outliers. Combination with minimum U-statistic identifies the tissue specific genes, which are in fact outliers.
1.4.2 Bayesian Information Criterion
Based on the likelihood function, the Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) (Schwarz, 1978) provides a criterion for model selection among a finite set of models. The BIC is based on exponential distribution assumption. The formula is:
݈݇݊ሺ݊ሻܮ݈݊כʹൌെܥܫܤሺሺݔȁ݇ሻሻൎ݈݊ כ ʹെ
where lnL is the log-likelihood function of observations, k is the number of free parameters to be estimated, n is the sample size the data. ln(p(xūk)) is the natural log probability of the observed data when the number of parameters is given.
17
It the model fitting process, adding parameter can help increase the likelihood. However, this can cause over fitting of the model. BIC penalties the number of parameters and resolves the over fitting issue properly. The penalty is larger in BIC than in AIC.
18
1.5 Research Gap in Literature
A lot of research work has been done to detect outliers, transform data, and applying
Information Criterion (AIC and BIC). However a lot of problems still remain unaddressed.
First, Box-Cox transformation is not resistant in the presence of outliers due to the non- robustness of MLE. Transform data with or without outliers needs further consideration. The transformed data might not be as normal as expected and may still have outliers if they are retained. If dropped, we bear the risk of losing information on the expense of having a better model fitting. The transformation parameter lambda found through optimization is not always reasonable due to flatness of likelihood function.
The second question of interest is how suspicious observations are treated. There are two ways suspicious observations are viewed. They can be treated either as real outliers or as regular extreme observations in a skewed dataset. Identified outliers can be dropped to improve model fitting and skewness of data can be alleviated through data transformation. It is a critical issue that needs to be addressed before any analysis can be conducted in the presence of extreme observations.
Third, the advantage of using information criterion such as AIC to detect outliers is quite obvious as shown in Kitagawa (1979) and Kadota (2003b)’s work, while its limitations cannot be overlooked. Outliers can be identified through AIC, but how to handle them, i.e., exclude those identified outliers or keep them is still a problem. How to evaluate the goodness of fit of Box-
Cox transformation when outliers are dropped is unsolved. Additionally, in Kitagawa’s (1979) paper AIC is used for small sample (15 observations), when sample size becomes larger the computational complexity is increasing dramatically, because AIC of all possible combinations of outlier candidates need to be calculated. Moreover we do not know exactly how many outliers
19 are in the sample, which makes the case more difficult. The assumption that the outliers are normally distributed is quite strong and sometimes not realistic, for example, if there are in fact only two or three outliers, it is not reasonable to use a normal distribution to describe the outliers.
Last but not least, there is no research that integrates Bayesian Information Criterion
(BIC), Box-Cox transformation, and outlier, which can robustly detect outliers and find the appropriate transformation at the same time without excluding too many extreme observations. I will find the best compromise between removing outliers and/or transforming the data, using the
BIC, to guide the way. In this dissertation I would like to propose a unified approach that combines the Box-Cox transformation and outlier detection using penalized assessment, which will tackle the above mentioned problems. The research gap in the literature is shown in Figure
1-1.
20
Figure 1-1 Research Gap
21
2. Unified Approach
2.1. Anchoring-to-1
Box-Cox transformation can be directly applied to positive observations, and for negative
samples, one more step is needed, which is to shift the whole sample to positive one by adding a
constant to each of the observations in the sample. This is stated by Box & Cox (1964):
ൣሺݕ Ɋሻ െͳ൧Τ ɉǢɉ്Ͳ ݔ ൌቊ ሺݕ ߤሻǢ ɉ ൌ Ͳ
here ݕ is the raw sample of size n and ݔ is transformed sample, ɉ is the transformation
parameter and μ is chosen so that ݕ ߤ is greater than zero for all of the data. Now the question
is how to chooseߤ? One may choose ߤൌെݕ ߝ to ensure that the minimum of shifted
sample is always greater than zero, which means that the sorted shifted sample becomes
ǡݕሺሻ ߝെݕڮߝǡ ݕሺଶሻ ߝെݕǡݕሺଷሻ ߝെݕǡ
here ߝ can be any positive number theoretically.
The following two figures show the comparison of distributions for different choices ofɂ.
The sample here is Normal (10, 1), sample size is 100. The shape of the histogram does change
much, although the minimum of the distribution is shifted to 1 (Figure 2-1) and 5 (Figure 2-2)
respectively.
This next two figures show the shift of a skewed distribution, Log-Normal; the sample
size is also 100. Before shift, the sample has quite a long right tail, while after shifting to 1 and 5;
the symmetry of the distribution is improved, although it does not look normal yet. (Figure
2-3and Figure 2-4)
22
Now two questions arise here:
1) Does the choice of ߝ affect the resulting ɉ ?
?Does ݕ influence the efficacy of the Box-Cox transformation (2
23
Figure 2-1 Anchor-to-1, N(10,1), Sample size=100
24
Figure 2-2 Anchor-to-5, N(10,1), Sample size=100
25
It is noted that adding a constant to a variable changes only the mean, not the standard deviation or variance, skew, or kurtosis. However, the constant ε and, thus the amount of the shift of the distribution can influence subsequent data transformations. This will be shown in the following examples.
A simple simulation here shows the effect of different choices of ɂ on the Box-Cox transformation parameterɉ. Samples from normal distribution are generated with sample size ranging from 20 to 200. Five choices of ɂ are chosen to shift the minimum of original samples to
0.05, 0.1, 1, 2, and 3 respectively. For each case, 1000 repetitions are produced. Since those samples are generated from normal distribution, we expect to find ɉൌͳ using Box-Cox transformation, which means that no transformation is needed. It can be seen that in the baseline case, where raw sample without any shift is transformed using Box-Cox, resulting ɉ has large variation, especially in small samples. While applying other values ofɂ, the variation of ɉ is alleviated and its mean is quite close to 1, the confidence interval is narrower than the baseline case. (Figure 2-5 and Figure 2-6)
26
Figure 2-3 LogNormal, Anchor-to-1,sample size=100
27
Figure 2-4 LogNormal, Anchor-to-5,sample size=100
28
Figure 2-5 Lambdas of Different Anchoring
29
Figure 2-6 95% Confidence Interval of λ at different anchor points
30
Another example also shows the impact of anchoring in transformation. We generated a skewed random sample of size 100, the raw sample is generated from Exp(X)+10, where
X~N(0,1); the histogram is plotted in Figure 2-7. It is desired to find a transformation that will make it behave normal. Three most frequently used transformation Square Root, Natural
Logarithm, and Inverse function are applied to the raw data. We also anchored the raw sample to different values: 1, 2, 3, 5, and 10. The skewness and kurtosis of the raw sample is 2.1 and 4.8 respectively, which indicates large deviance from normality. If the three transformations are directly applied to the raw data, the skewness and kurtosis becomes closer to zero (normal distribution). If anchored the raw sample to one of the five values, the skewness and kurtosis both become even closer to zero, which indicating a better fitting to normality compared to non- anchor (Table 2-1 and Figure 2-8). The histograms of transformed samples using different anchoring values are shown in Figure 2-9, Figure 2-10, and Figure 2-11.
31
Figure 2-7 Distribution of Skewed Raw Sample
32
Table 2-1 Comparison of Skewness and Kurtosis for different anchoring
Y Original Anchor_1 Anchor_2 Anchor_3 Anchor_5 Anchor_10 SQRT Skewness 2.1053 1.9236 1.4504 1.6104 1.7020 1.8074 1.9224 Kurtosis 4.7690 3.8745 2.0144 2.5720 2.9235 3.3575 3.8688
LOG Skewness 2.1053 1.7506 0.8505 1.1570 1.3303 1.5294 1.7483 Kurtosis 4.7690 3.0917 0.2596 1.0066 1.5277 2.2145 3.0818
INV Skewness 2.1053 -1.4316 0.0682 -0.4215 -0.7039 -1.0404 -1.4274 Kurtosis 4.7690 1.8216 -0.9297 -0.6234 -0.1728 0.6024 1.8066
33
Comparison of anchoring for Sqrt Transformation 6.0000 5.0000 4.0000 3.0000 Skewness 2.0000 Kurtosis 1.0000 0.0000
Comparison of anchoring for LogTransformation 6.0000 5.0000 4.0000 3.0000 Skewness 2.0000 Kurtosis 1.0000 0.0000
Comparison of anchoring for InverseTransformation 6.0000 5.0000 4.0000 3.0000 2.0000 1.0000 Skewness 0.0000 Kurtosis -1.0000 -2.0000
Figure 2-8 Comparison of Skewness and Kurtosis for different anchoring
34
Figure 2-9 Histogram of transformed sample with different anchoring: SQRT Transformation
35
Figure 2-10 Histogram of transformed sample with different anchoring: LOG Transformation
36
Figure 2-11 Histogram of transformed sample with different anchoring: INVERSE Transformation
37
To find an outlier in a dataset, one simple way is to sort the observations from smallest to largest and then look at those ones that are on the lower side and higher side to see if they are extremely small or large in order to decide whether they should be treated as outliers. Usually transformation is conducted to the raw observations without any modification (for positive data).
In this thesis, it is proposed that one extra step is needed before the transformation procedure: the sorted raw data should be anchored to 1. Osborne (2002) also mentioned this point, although no simulation was provided in his work. The reason why the raw data is anchored to 1 is that the minimum value of a sample has an effect on the efficacy of transformations, which is shown in the above examples and simulations. Further details of the benefits of anchor-to-1 will be given in later sections, especially when the objective is to look for outliers and transform data at the same time. The derivation of Box-Cox transformation parameter ɉ is given below.
x Non-Anchoring
ǡ୬ and the sorted transformed data is denoted byڮThe raw data is denoted by ଵǡଶǡ
ǡ୬ in order of increasing magnitude. The transformed data is expected to followڮଵǡଶǡ normal distribution with unknown mean and variance. Box-Cox transformation parameter is obtained through maximizing the log-likelihood function of the transformed data.
ఒ ሺݕ െͳሻΤ ߣߣ ് Ͳ ݔ ൌቊ ሺݕሻ ߣ ് Ͳ
ଶ ǡݔ̱ܰሺߤǡ ߪ ሻڮݔଵǡݔଶǡ
ǡݔ isڮThe likelihood function of ݔଵǡݔଶǡ
మ σసభሺ௫ିఓሻ ଶ ି ି మ ௫ ൌ݊Ǩෑ݂ሺݔሻ ൌ݊Ǩሺʹߨߪ ሻ ଶ݁ ଶఙܮ ୀଵ
38
ǡݕ, the Jacobian of Box-CoxڮTo find the likelihood function of raw dataݕଵǡݕଶǡ
transformation is
ݔ ఒିଵ݀ ൌฬ ฬൌݕܬ ݕ݀
మ σ ሺ௫ିఓሻ ି ି సభ ଶ ଶ ଶఙమ ఒିଵ ௬ ൌ݊Ǩෑ݂ሺݔሻ ൌ݊Ǩሺʹߨߪ ሻ ݁ ෑݕܮ ୀଵ ୀଵ
Log-likelihood function is
σሺ ݔ െߤሻଶ ݊ ൌሺ݊Ǩሻ െ ሺʹߨߪଶሻ െ ୀଵ ሺߣെͳሻሺݕሻ ܮܮ ௬ ʹ ʹߪଶ ୀଵ
Through MLE
ଵ ߤƸ ൌݔҧ ߪଶ ൌ σ ሺݔ െݔҧሻଶ ୀଵ
Substitute ߤƸ and ߪଶinto the log-likelihood function
݊ ݊ ൌሺ݊Ǩሻ െ ሺʹߨ݁ሻ െ ൫ߪଶ൯ሺߣെͳሻሺݕሻ ܮܮ ௬ ʹ ʹ ୀଵ
ߣ can be found iteratively through Newton-Raphson algorithm.
x Anchoring-to-1
ఒ ሾሺݕ ݉ሻ െͳሿΤ ߣߣ ് Ͳ ݔ ൌ൜ ሺݕ ݉ሻ ߣ ് Ͳ
.here ݉ൌെݕሺଵሻ ͳ, which shifts the minimum of anchored sample to be 1
ǡݔ remains the sameڮThe likelihood function of ݔଵǡݔଶǡ
మ σసభሺ௫ିఓሻ ଶ ି ି మ ௫ ൌ݊Ǩෑ݂ሺݔሻ ൌ݊Ǩሺʹߨߪ ሻ ଶ݁ ଶఙܮ ୀଵ
39
ǡݕ, the Jacobian of Box-CoxڮTo find the likelihood function of raw sampleݕଵǡݕଶǡ
transformation is
ݔ ఒିଵ݀ ൌฬ ฬൌሺݕ ݉ሻܬ ݕ݀
మ σసభሺ௫ିఓሻ ଶ ି ି మ ఒିଵ ௬ ൌ݊Ǩෑ݂ሺݔሻ ൌ݊Ǩሺʹߨߪ ሻ ଶ݁ ଶఙ ෑሺݕ ݉ሻܮ ୀଵ ୀଵ
Log-likelihood function is
σሺ ݔ െߤሻଶ ݊ ൌሺ݊Ǩሻ െ ሺʹߨߪଶሻ െ ୀଵ ሺߣെͳሻሺݕ ݉ሻ ܮܮ ௬ ʹ ʹߪଶ ୀଵ
Through MLE
ଵ ߤƸ ൌݔҧ ߪଶ ൌ σ ሺݔ െݔҧሻଶ ୀଵ
Substitute ߤƸ and ߪଶinto the log-likelihood function
݊ ݊ ൌሺ݊Ǩሻ െ ሺʹߨ݁ሻ െ ൫ߪଶ൯ ሺߣ െ ͳሻ ሺݕ ݉ሻ ܮܮ ௬ ʹ ʹ ୀଵ
ߣ can be found iteratively through Newton-Raphson algorithm, which is similar to the non-
anchor case.
40
2.2. Penalized Assessment
Information criterion such as AIC and BIC are a measure of the goodness of fit of a statistical model. It describes the tradeoff between accuracy and complexity of the model. They are often used as a model selection criterion. The idea behind the AIC is that in choosing between possible models there needs to be a trade-off between the improved fit of a more complex model and the increased number of parameters such a model requires. Information criterion (AIC and BIC) can be used to detect outliers, as seen in Kitagawa (1979), and Kadota
(2003a, 2003b, 2006). In my unified approach, information criterion is used to assess the goodness of fit of the transformed data, which are supposed to follow normal distribution.
Information criterion can be easily computed with the sum of squared errors of the transformed data.
When skewed sample or sample with outlying observations is to be analyzed, one may have the following options to make the data behave normal, or close to normal:
a) Box-Cox the whole sample to get a normal transformed sample;
b) Exclude some extreme observations that are in the tails to make the remaining sample
look normal;
c) Exclude some extreme observations and Box-Cox transform the remaining sample to get
a transformed data which is closer to normal than case b.
The purpose of the above options is to make a better model fitting, which means that the transformed sample is close to normal. While one may take the risk of over-fitting the data, which means that one tends to exclude more extreme observations (causing loss of information) or implement unnecessary transformation (over-analyzing) to get a better fitting. In this sense, I
41 would like to utilize information criterion’s attractive feature of penalizing over-fitting to decide the trade-off and answer the questions: how many extreme observations should be excluded, whether Box-Cox transformation is needed, and whether the transformation should be based on the whole sample or partial sample.
ሺሻ. In theכሺሻ כ ʹand ൌ െ כʹሺሻ כ ʹIt is known that ൌ െ unified approach, we treat Box-Cox transformation parameter λ as an extra parameter in the penalty function, in addition to the two parameters μ and σ. Another way to penalize over-fitting is to treat outliers excluded as extra parameters. If the number of observations excluded is denoted by dp, then the number of parameters in the penalty function is 3+dp, namely λ, μ, σ, and the excluded dp observations. Therefore the adjusted information criterion of transformation data becomes
כʹ୷ כ ʹൌ െ
ʹ൫ɐଶ൯െכ ሺʹɎሻ ሺെሻכ ൫ሺെሻǨ൯ ሺെሻכʹൌെ
୬
ሺ͵ሻכʹሺɉെͳሻ ሺ୧ሻ כ ୧ୀଵ
ሺሻכ୷ כ ʹൌ െ
ʹ൫ɐଶ൯െכ ሺʹɎሻ ሺെሻכ ൫ሺെሻǨ൯ ሺെሻכʹൌെ
୬
ሺെሻכሺɉെͳሻ ሺ୧ሻ ሺ͵ሻ כ ୧ୀଵ
Based on the above adjusted information criterion, we can decide how many observations to be excluded and whether Box-Cox transformation is worth being implemented based on the values of information criterion, which is smaller, the better.
42
2.3. Implement of unified approach
For a given data set, our proposed unified approach dealing with skewed sample or outlier works in this flow:
ǡݕሺሻڮa) Sort the observations from smallest to largest ݕሺଵሻǡݕሺଶሻǡ
b) Anchoring- to-1: if the minimum of the sample is larger than 1, e.g., N(10,1) sample, then
use transformation y*=y+1-ymin, so that the minimum of raw data is shifted to 1. If the
minimum of the sample is smaller than 1, e.g. LogN(0,1), it is unnecessary to anchor to 1.
c) The original data anchored-to-1 now becomes
ǡݕሺሻ ͳെݕڮͳǡ ݕሺଶሻ ͳെݕǡݕሺଷሻ ͳെݕǡ
d) Sequentially drop observations in the high end and low end, the number of observations
to be excluded is denoted by hdp and ldp respectively, which ranges from zero to one
fifth of total sample size. The percent can be set to any number below 50%, because if
there are more than 50% true outliers, then the rest of sample can be treated as outliers.
The total number of observations excluded is dp, which is equal to ldp plus hdp.
e) Re-Anchor remaining sample to 1 when observation in lower end is excluded.
f) Apply Box-Cox transformation to the remaining n-dp data.
g) Each time one observation is excluded in high end or low end, find Box-Cox
transformation parameter λ by maximizing the likelihood of transformed data.
h) Compute information criterion BIC (or AIC) for above cases where different number of
observations are dropped.
i) Find the minimum BIC value and corresponding number of observations dropped and ߣ
for this sample.
43
j) Conduct normality test such as Anderson-Darling statistic to test the normality of
transformed data.
k) Back-transform to original scale if needed.
The above procedure is summarized in Figure 2-13, Figure 2-14, and Figure 2-15. In
Figure 2-13, given a sample that needs transformation, if the normality test is passed, then no transformation is needed. If the sample does not pass normality test or contains outliers, extra data processing is needed such as outlier detection or data transformation. Figure 2-14 shows the procedure of the Unified Approach, if the sample minimum is larger than 1, then its minimum is anchored to 1. Then the decision regarding outlier dropping and Box-Cox transformation needs to be made based on penalized assessment, which is adjusted BIC. Given a sample, four choices are available to make the sample behave normally: BC=0 DP=0 (no Box-Cox and no outlier drop), BC=0 DP=1(no Box-Cox and drop outliers), BC=1 DP=0( do Box-Cox and no outliers need to be dropped), and BC=1 DP=1( Box-Cox and outlier dropping are needed at the same time).
For each sample, we need to make three decisions: how many outliers are there? Is Box-
Cox transformation needed to make the sample normal? Are outlier exclusion and Box-Cox transformation both needed to achieve normality? To answer those questions BIC is utilized in the unified approach to make the decisions. Observations in the low end and high end of the sorted sample may be outliers and the combination of observations in low end and high end provides many candidate models to choose from, this is shown in Figure 2-12.This is similar to
Kitagawa (1979). In addition, the Box-Cox may help to achieve normality after some suspicious observations are excluded. We need to decide whether it is worth Box-Cox transforming the sample, which is measured as an extra penalty in BIC. Therefore, we have quite a few candidate
44 models consisting of different number of observations to be excluded and whether Box-Cox needs to be involved. The point is that normality can be achieved by dropping some observations and Box-Cox transforms the sample, which might be over-fitting. Another extra penalty is from anchoring-to-1, if the minimum of the original sample is larger than 1, then it will be anchored to
1 before any Box-Cox transformation, which yields an extra penalty in BIC. For some samples, we need to re-anchor the remaining sample to 1 after the observation in the low end is dropped if the minimum of the retained sample is still larger than 1, then redo the Box-Cox transformation on the remaining anchored-to-1 sample. The exclusion of observations in the low end may happen several times, every time one observation is dropped, it is treated as an extra parameter in the BIC penalty part; each time the sample is anchored to 1, an extra parameter is added as a penalty in BIC. Excluding observations, anchor the sample to 1, and Box-Cox transform the sample are treated as model fitting. BIC is a good measure of trade-off between model fitting
(normality) and complex of model (outlier exclusion, Box-Cox transform, and anchor-to-1) and hence is used in the unified approach to choose the appropriate solution. The reason why BIC is chosen rather than AIC is in our simulation study BIC is found to outperform AIC and other information criteria such as AICc. The reason is that the penalty part of AIC is 2*k, which does not depend on the sample size, while BIC has a penalty k*Ln(N), the penalty of over-analyzing the sample (bigger k) is magnified by the corresponding sample size. When some observations are excluded, the sample size becomes n-dp, which cannot be reflected in AIC.
45
High End x(n-10) x(n-9) x(n-9) ..x(n-8) ...... x(n-1) x(n-1) x(n-1) x(n-1) x(n) x(n) x(n) x(n) x(n) x(1), x(2), x(3),…, x(9), x(10) x(1), x(2), x(3),…, x(9) . . Low End . x(1), x(2), x(3) x(1), x(2) x(1)
Figure 2-12 Outlier Candidates
46
2.4. Advantages of new method
First, the new method can handle small to medium size data set, in our simulation study it shows that the method is applicable to sample size ranging from 20 to 200 and the results are acceptable. Previous research such as Kitagawa (1979), Kadota (2003) can only handle small samples, usually less than 50.
Second, the new method is able to detect percent of outliers ranging from 0% to 20%. In the previous research the percent of outliers is quite low, only two or three suspicious observations in low end and high end are considered.
Third, the new method balances the tradeoff of excluding too many extreme observations and a better model fitting. As excluding all the suspicious observations will help reduce the skewness/kurtosis and hence improve the normality of the data; however it may bear the risk of losing useful information. Our new method employs BIC to make this tradeoff using appropriate penalty to avoid over fitting and loss of information.
Fourth, after applying the new method to data set with outliers the transformed data pass the normality test in a high percentage, which means that our transformation is effective.
Fifth, our new method is effective in correctly identifying outliers, which is shown in both the clean and contaminated data. This feature is attractive because it can avoid the masking and swamping effect in outlier detections.
Lastly, in detecting outliers significance level is often required in the testing procedure such as 1%, 5%, and 10%. The new method does not need to set level of significance to avoid the subjectivity issue in statistical tests.
47
Figure 2-13 Flowchart of overall data processing
48
Figure 2-14 Flow Chart of the Unified Approach
49
Figure 2-15 Flowchart of outlier drop
50
3. No Outliers: The Uncontaminated Sample Case
3.1. Overview
To investigate the impact of outliers on Box-Cox transformation and evaluate the performance of the unified approach, the simulation study is conducted on clean data. This means that samples are generated from specified distributions without outliers added. The effectiveness of the unified approach lies in that it can provide the most accurate or appropriate
Box-Cox transformation parameter λ, however for most of the cases it is not easy to know what the true value of λ is. Therefore distributions that can be transformed to normal distribution with known λ will be used in the simulation. The first one is normal distribution itself where λ=1.
This means that if one transforms a normally distributed sample, then a reasonable method should yield results where λ is (approximately) equal to 1. Another distribution used is Log-
Normal distribution. If X is a random variable with a normal distribution, then Y=exp(X) has a log-normal distribution; likewise, if Y is log-normally distributed, then X=Log(Y) has a normal distribution, which indicates that λ is 0 according to the definition of Box-Cox transformation.
Through the simulation on clean data the following three questions would like to be answered:
a) The ability to detect outliers, which means that, is the unified approach able to tell that
the data is clean?
b) What is the distribution of λ found through the new approach? Since we know what the
true value is, is λ provided by the unified approach as expected?
c) If the unified approach detected outliers in clean data and excluded them, what does the
distribution of transformed sample look like?
51
To answer the above questions, the simulation strategy is set up in the following way:
a) Simulate samples from N(10,1) and Log-Normal distribution
b) The sample size ranges from 20 to 200 by 10.
c) For each sample generate 1000 repetitions.
d) In each repetition, apply the new approach to identify outliers and find appropriate ɉ
e) Test the normality of transformed sample
f) Proceed to next repetition and do steps d), e), and f).
For each sample, the unified approach will provide the number of outliers detected, λ that can transform the sample to (near) normality, as well as normality test statistics of the transformed sample such as Anderson-Darling statistic, skewness, and kurtosis. For the purpose of comparison, we would like to compare the results from the following cases:
a) Null case: the original sample, no anchoring, no outlier detection/exclusion and no
Box-Cox transformation.
b) Baseline case: Box-Cox transform the whole sample, no anchoring, no outlier
detection/exclusion, this is the baseline case.
c) Anchoring-only case: Anchor the original sample to 1, Box-Cox transform the
anchored sample, no outlier detection/exclusion.
d) Unified Approach: Apply the unified approach to the original sample.
Given each original sample, the unified approach will yield an optimal solution based on the penalized information criterion, which includes the following four situations:
a) No outliers, no Box-Cox transformation is needed
b) No outliers, Box-Cox transformation is needed
52 c) Outliers are excluded and no Box-Cox transformation is needed d) Outliers are excluded and Box-Cox is also needed
53
3.2. Anchoring-to-1
The first step is to look at the null case, given a random from N(10,1) without outliers, apply Box-Cox transformation to this sample and λ is expected to be one. Figure 3-1 shows the
Box-plot of λ along with different samples. It is shown that the variation of λ is quite large, especially when sample size is relatively small. Some outlying values of λ can be positive 8 or negative 7 (for example, when sample size is 20), although the median is around one and the true value of λ=1 is covered in between first and third quartiles. If those extreme λ are used to transform the data, the resulting transformed sample may not follow normal distribution as expected and cause trouble for subsequent analysis. Figure 3-2 shows the 95% confidence interval of λ based on 1000 repetitions, in most of sample sizes one is covered by the 95% confidence interval, for some cases such as sample size 60, 70, 80, one is not in the 95% confidence interval, indicating that for those normal samples Box-Cox will generate a λ that is not equal to one.
Table 3-1 provided an example where Box-Cox is not working well, the original sample is randomly chosen from N(10,1) with sample size 20, the histogram shown in Figure 3-3. The λ found through Box-Cox is 9.13, which is far away from one. The histogram of the transformed sample is shown in Figure 3-4. The magnitude of transformed sample is greatly increased to
10E+08, although the normality is improved in terms of skewness, kurtosis, Anderson-Darling test, and Shapiro-Wilks test (Table 3-2). This huge increase in the magnitude of the sample might make the analysis more complicated compared to cost of the improved fitting to normality.
54
Figure 3-1 Box plot shows the variation of λ using regular BC on original sample
55
95% CI of Lambda: Null Case, N(10,1), Clean data Lambda 1.05 1.04 1.03 1.02 1.01 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size
Figure 3-2 95% Confidence interval of lambda in baseline case
56
Table 3-1 An example where BC gives extreme lambda
Original 10.5281 10.4717 9.0009 9.8508 10.0498 10.3015 10.1181 8.9294 10.3289 9.9043 Transformed 236356218.86 225057637.76 56513694.63 128804488.19 154599860.43 193786523.46 164464813.53 52543192.84 198538190.09 135332303.71 λ=9.13 Original 8.9114 10.5156 10.8465 9.7124 10.1058 10.3895 10.8041 10.3644 10.3134 10.23 Transformed 51585550.71 233817390.30 310252851.26 113194033.18 162650494.65 209430291.92 299366619.31 204850365.23 195840818.68 181836784.30
Table 3-2 An example where BC gives extreme lambda, con’t
PValue- Pvalue- Anderson- Shapiro- λ=9.13 Skewness Kurtosis Anderson- Shapiro- darling Wilk Darling Wilk Original -1.0797 0.5257 1.0034 0.0095 0.8705 0.0120 Transformed -0.1337 -0.1458 0.3206 >0.2500 0.9557 0.4613
57
Figure 3-3 A histogram shows the histogram of original sample
58
Figure 3-4 Histogram shows the histogram of transformed data with lambda 9.13
59
Box-Cox transformation λ is found through maximizing the likelihood of the transformed sample, which is supposed to follow a normal distribution. The log-likelihood function is a convex function and has maximum point. However, sometimes the log-likelihood function can be quite flat, which makes it quite difficult to find a λ that achieves the maximum of log- likelihood function; therefore λ may not exist or is quite unstable. Such an example is given in
Table 3-3and Figure 3-5. The original sample is a little skewed, although not too far away from normality, while λ found through Box-Cox is -6.91, resulting the transformed sample points identical, all equal to 0.1446. (Table 3-4) This transformation is not applicable in practice since it makes sample almost to a single value. (Figure 3-6)
Table 3-3 An example shows the bomb of Box-Cox, which gives lambda=-6.91
Original 10.8945 10.1425 12.23 10.3208 9.7381 10.9127 10.5673 12.5883 9.5804 9.7774 Transformed 0.14471779 0.144717784 0.144717796 0.144717786 0.144717779 0.144717791 0.144717788 0.144717797 0.144717776 0.144717779 λ=-6.91 Original 9.3805 9.5144 9.6365 9.4778 10.9863 10.2257 10.2774 9.6361 10.5105 9.5349 Transformed 0.144717773 0.144717775 0.144717777 0.144717774 0.144717791 0.144717785 0.144717786 0.144717777 0.144717788 0.144717776
Table 3-4 An example shows the bomb of Box-Cox, which gives lambda=-6.91, cont.
PValue- Pvalue- Anderson- Shapiro- λ=-6.91 Skewness Kurtosis Anderson- Shapiro- darling Wilk Darling Wilk Original 1.37857132 1.632773 0.998438 0.0097 0.84657 0.0047 Transformed ......
60
Figure 3-5 Histogram of original sample
61
Figure 3-6 Histogram of transformed sample using lambda=-6.91
62
All the above examples show that, even in clean sample, Box-Cox may not be doing well in some cases; λ provided by direct use of Box-Cox has quite large variance, especially in small sample sizes. If outliers are present in the sample, it brings more variability if applying Box-Cox directly without proper handling.
When the original sample is anchored to 1 before using Box-Cox transformation, from
Figure 3-7 it can be seen that, compared to the null case, the variation of λ decreased dramatically, the mean value of λ based on 1000 repetitions is quite close to one although the 95% confidence interval of λ is not always covering one (Figure 3-8). The result confirms the effectiveness of anchor-to-1 idea, which can dramatically decrease the variance of λ.
It draws attention that whether anchoring to 1 or not depends on the smallest value of the original sample. If the minimum of the original sample is larger than 1, than it needs to be anchored to get a better estimate of λ; if it is between zero and one, it should not be anchored.
The following two examples showed this scenario. Figure 3-9 and Figure 3-10 showed the Box- plot of λ after regular Box-Cox transformation of a random normal sample with minimum less than 1; from Figure 3-11 and Figure 3-12 it is noticed that the distribution of λ does not improve after anchoring the sample to 1, the variance even becomes larger. This particular sample is drawn in two steps: first draw a sample from N(10,1), then shift this sample to 0.5 through y- y[min]+0.5, then the minimum of this new sample is 0.5, which is between 0 and 1.
When the minimum of a sample is less than 1, it is not necessary to anchor to 1. The following example shows this. A random sample drawn from Exp(Z), where Z is following a
N(0,1) distribution, around half of the samples are between zero and one. The following four figures (Figure 3-13 to Figure 3-16) show the difference between anchor-to-1 and not anchor-to-
63
1. It is noticed that after anchor-to-1, the variance of lambda becomes larger. Therefore we propose that if the minimum of a sample is larger than 1, it needs to be anchored to 1; if the minimum is between 0 and 1, we do need to anchor it to 1.
64
Figure 3-7 Distribution of lambda after anchoring-to-1 the original sample
65
95% CI of Lambda: Anchor-to-1, N(10,1), Clean data Lambda 1.01
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.90
0.89
0.88
0.87 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size
Figure 3-8 The mean and 95% confidence interval of lambda after anchoring-to-1
66
Figure 3-9 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted
67
Figure 3-10 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted, cont.
68
Figure 3-11 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda
69
Figure 3-12 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda, cont.
70
Figure 3-13 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda
71
Figure 3-14 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont.
72
Figure 3-15 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda
73
95% CI of Lambda: Anchor-to-1, LogN(0,1), Clean data Lambda -0.65
-0.66
-0.67
-0.68
-0.69
-0.70
-0.71
-0.72
-0.73
-0.74
-0.75
-0.76
-0.77
-0.78
-0.79
-0.80
-0.81 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size
Figure 3-16 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont.
74
3.3. Unified Approach-Normal Distribution
The unified approach is applied to clean N(10,1) sample. Outlier detection, anchoring-to-
1, Box-Cox transformation, and penalized information criterion (BIC) is considered simultaneously here. In terms of anchoring-to-1, as the original sample is from N(10,1), which has a minimum value larger than 1, therefore the original sample is anchored to 1 before Box-
Cox transformation. In the unified approach, outlier exclusion is implemented along with Box-
Cox transformation, extreme observations in the lower end and upper end of the sample will be dropped to achieve a better model fitting. The exclusion of observations is penalized in BIC.
Every time observation in the lower end is dropped, the sample has a minimum value that is greater than 1; hence we need to re-anchor the sample to 1 before Box-Cox transformation. The anchoring-to-1, exclusion of observations, and Box-Cox transformation are all penalized in the information criterion to obtain a trade-off between better model fitting (normality) and complexity of the model (over-analyzing).
Figure 3-17 shows the outliers detected by the unified approach in different sample sizes.
For each sample size, 1000 repetitions are generated, apply the unified approach to each repetition, it will tell how many outliers are in the sample, whether those outliers should be excluded or not, if they are excluded, is it necessary to Box-Cox transform the remaining sample to achieve normality? For example, when sample size is 20, within the 1000 repetitions, 774 repetitions have no outliers, 172 repetitions have one outlier, 34 repetitions have two outliers, 10 repetitions have three outliers, and 20 repetitions have 6 outliers. When the sample size becomes larger, the percentage of repetitions that are detected as “no outliers” is increasing, from 92.1% in sample size 60 to 95.4% in sample size 200. This shows that the unified approach is doing
75 well in identifying outliers; in clean data situation, it is able to tell that the samples being analyzed are clean for most of the cases.
In terms of the locations of outliers detected, it is noticed from Figure 3-18 that the detected outliers are located in both the lower and upper end of the sample; in small sample sizes, they are detected in both ends of the sample.
As for the performance of Box-Cox transformation in the clean samples, λ is supposed to be 1 since the sample is normally distributed and hence no transformation is needed. Figure 3-19 shows the distribution of λ when the unified approach detected zero outliers in the sample. Figure
3-20 shows the distribution of λ when more than 1 outlier is detected. It is noticed that λ in Figure
3-19 has smaller variation than Figure 3-20; although in Figure 3-20 there are fewer outlying values of λ.
In terms of model fitting (normality test), it does not show too much difference between the cases where no outliers are detected and more than one outliers are detected. Figure 3-21 and
Figure 3-22 provided the P-values for the Anderson-Darling statistic.
76
Figure 3-17 Outliers Detected in Clean N(10,1) Samples, 1000 Repetitions
77
Figure 3-18 Location of detected outliers, N(10,1), Clean Sample
78
Figure 3-19 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, no outliers excluded, no penalty for Box-Cox
79
Figure 3-20 Box-plot of lambda: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1)
80
Figure 3-21 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, no outliers detected, no penalty for Box-Cox
81
Figure 3-22 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1)
82
In the above analysis, extra penalty is not considered for doing Box-Cox transformation.
This means that after anchor the sample to 1there are two choices, the first one is doing Box-Cox to the sample to achieve a better model fitting but with extra penalty in the BIC; the second one is not to do Box-Cox. It needs to be considered whether Box-Cox is really needed, which means that an extra penalty has to be paid for transforming the sample. It is necessary to think if it is worth transforming the sample to achieve normality with a penalty. The BIC in the two situations will be compared and then decision can be made based on smaller values of BIC. If this scenario is taken into account, then the unified approach will generate quite different results compared to the previous analysis. When considering the penalty of doing Box-Cox, for each sample, there are four choices of handling it:
a) 0-0 case (leave it alone): no outlier detection/exclusion, no Box-Cox transformation.
b) 0-1 case: no outlier exclusion, do Box-Cox transformation.
c) 1-0 case: exclude outliers, no Box-Cox transformation.
d) 1-1 case: exclude outliers, do Box-Cox transformation.
83
Figure 3-23 shows the percentage of each case in 1000 repetitions, it is noticed that the majority of repetitions falls into the first case, which means that given the clean N(10,1) sample, nothing needs to be done to make the sample behave normal. When sample size is larger than 80, this percentage is always above 90%. This agrees with our expectation since the sample is normally distributed. In the 0-1 case, no outliers are excluded and no Box-Cox transformation is implemented, therefore λ=1 in these cases. In the cases where no Box-Cox is implemented, outliers are detected in lower end or both lower and higher end of the sample
84
(
Figure 3-24).
85
Figure 3-23 Percent of four cases in unified approach with penalty to BC
Figure 3-24 Location of outliers, NO Box-Cox involved
86
Before Box-Cox is implemented in the sample, there might be outliers that need to be excluded.
When there are outliers detected, the location of outliers is either in lower end or in both lower and upper end of the sample
(
Figure 3-25). The distribution of λ in this case is shown in Figure 3-26. When there no outliers are found, then the distribution of λ is shown in the Figure 3-27. Since the number of repetitions in those cases is quite small, therefore the Box-plot looks strange and some of them only have one diamond point.
87
Figure 3-25 Location of outliers, BC=1 and DP=1
88
Figure 3-26 Box-plot of Lambda: Unified Approach, Clean Data, BC=1, outliers Excluded
89
Figure 3-27 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, BC=1, no outliers excluded
90
3.4. Assessment-Normal Distribution
To evaluate the effectiveness of the unified approach, we will contact a simple estimation of the resulting mean. We will then look at the Mean Square Error (MSE) of the transformed sample. Base on the discussion in previous section, we have to look at the four cases separately.
x Null case: no outlier detection/exclusion, no Box-Cox transformation
ଵ The original sample ǡ ǡǥǡ ̱ሺͳͲǡͳሻ, then ܯܵܧ ൌ σሺோ ܻഥ െͳͲሻଶ,ܻഥ ൌ ଵ ଶ ୬ ோ ୀଵ
ଵ σௌௌ ܻ , SS is sample size(20 to 200 by 10), ܻ is the ith observation in the kth repetition. ௌௌ ୀଵ
Rep is the number of repetitions in each sample, here we take 1000 repetitions. Here the original sample is drawn from N(10,1) therefore the Mean Square Error is based on the squared difference of observed mean value and expected value 10. Since no Box-Cox transformation is involved, the MES is expected to be one.
x In the Unified Approach
ଵ ଵ a) If BC=0 (DP=0 or DP>0) ܯܵܧ ൌ σሺோ ܻഥ െͳͲሻଶ,ܻഥ ൌ σௌௌି ܻ is the ோ ୀଵ ௌௌି ୀଵ mean of retained sample, DP is the number of outliers excluded. As there is no Box-Cox involved, the Mean Square Error is based on the squared difference of mean of retained sample and the expected mean, which is 10. If outliers are excluded then we have to take into account the observations dropped by subtracting the sample size by DP. If no outliers are dropped, then
DP=0. In this case the MSE calculation is the same as the null case except that the sample size is changed when there are outliers dropped.
ଵ ଶ ଵ ሻ െ ͳͲ൯ , ܻഥ ൌ σௌௌି ݂ ሺܻ ሻ isכb) If BC=1(DP=0 or DP>0) ൌ σோ൫݂ିଵሺܻതതത ோ ୀଵ ௌௌି ୀଵ the mean of transformed sample in the kth repetition. Since Box-Cox is involved in the process,
91 we have to back-transform the mean of transformed samples to the original scale using the transformation found in this repetition. If this sample has been anchored to 1 before the Box-
ഥ כതതത Cox transformations, then we need to un-anchor it back to the original scale throughܻ ൌܻ
̴ܻ െͳ, ̴ܻ is the sample minimum in the kth repetition. If outliers are excluded then we have to take into account the observations dropped by subtracting the sample size by DP. If no outliers are dropped, then DP=0.
The comparison between the unified approach and null case is shown in Table 3-5 and Figure
3-28 to Figure 3-31.
Figure 3-28 Comparison of MSE, BC=0 DP=0, N(10,1) Clean Data
92
Figure 3-31 Comparison of MSE, BC=1 DP=1, N(10,1) Clean Data
In the BC=0 and DP=0 case, where no outliers are dropped and no Box-Cox transformation is involved, the MSE are the same for unified approach and null case. Among the
1000 repetitions the majority of cases fall into this category.
93
94
Table 3-5 Comparison of MSE
All SAMPLE BC=0 DP=0 BC=0 DP=1 BC=1 DP=0 BC=1 DP=1 reps SIZE U N Freq U N Freq U N Freq U N Freq Null 20 1.094 1.094 743 1.376 1.003 215 1.380 295.883 31 2.377 1.648 11 1.081 30 1.033 1.033 818 1.246 0.879 140 1.095 174.384 32 0.605 0.611 10 1.003 40 0.972 0.972 851 1.257 1.044 116 1.312 145.789 22 2.516 0.878 11 0.970 50 0.998 0.998 876 1.096 0.930 96 1.019 105.216 20 2.160 1.439 8 0.988 60 1.050 1.050 891 1.329 1.154 76 1.534 101.004 30 1.103 2.546 3 1.077 70 1.125 1.125 893 1.128 0.952 72 1.074 81.245 30 1.068 1.270 5 1.128 80 0.930 0.930 905 1.093 0.981 63 1.073 68.846 29 0.498 0.085 3 0.946 90 1.035 1.035 919 0.787 0.764 64 0.601 50.342 15 3.848 0.833 2 1.016 100 1.047 1.047 903 1.566 1.421 68 2.225 106.488 27 3.388 1.849 2 1.082 110 0.963 0.963 924 1.066 0.921 42 0.982 44.768 30 4.193 1.826 4 0.957 120 0.970 0.970 926 0.754 1.099 50 0.828 37.680 22 2.934 0.008 2 0.971 130 0.971 0.971 931 1.663 1.242 52 2.066 86.460 12 2.288 0.637 5 0.979 140 0.990 0.990 943 1.229 0.688 35 0.845 43.022 17 1.621 1.163 5 0.980 150 0.925 0.925 933 1.145 0.881 49 1.009 56.081 16 3.081 1.201 2 0.925 160 1.023 1.023 935 1.477 0.834 51 1.232 75.317 12 2.126 0.580 2 1.013 170 1.049 1.049 938 1.136 1.061 49 1.205 55.644 13 . . 0 1.049 180 1.047 1.047 939 1.010 0.908 46 0.917 46.451 14 2.066 1.110 1 1.041 190 1.053 1.053 939 1.117 0.968 49 1.082 54.743 12 . . 0 1.052 200 1.077 1.077 943 1.148 0.721 45 0.828 51.660 11 1.080 0.001 1 1.055 U: Unified Approach; N: Null Case
95
Figure 3-28 Comparison of MSE, BC=0 DP=0, N(10,1) Clean Data
Figure 3-29 Comparison of MSE, BC=0 DP=1, N(10,1) Clean Data
96
Figure 3-30 Comparison of MSE, BC=1 DP=0, N(10,1) Clean Data
Figure 3-31 Comparison of MSE, BC=1 DP=1, N(10,1) Clean Data
97
3.5. Unified Approach-Lognormal Distribution
The sample is randomly drawn from LogN(0,1) distribution or exp(z), where z is following N(0,1). The unified approach is applied to this sample. Outlier detection, anchoring-to-
1, Box-Cox transformation, and penalized information criterion (BIC) is considered simultaneously here. In terms of anchoring-to-1, as the original sample is from LogN(0,1), which has a minimum value smaller than 1, therefore the original sample is not needed to be anchored to 1 before Box-Cox transformation. Even if observations in the lower end are excluded during the process, it is not needed to anchor the samples to 1 since the samples are drawn from exp(N(0,1)), hence there are around 50% observations are smaller than one. 20% of observations are suspected to be outliers, so almost all the samples in the unified approach have minimum values smaller than 1. In the unified approach, outlier exclusion is implemented along with Box-
Cox transformation, extreme observations in the lower end and upper end of the sample will be dropped to achieve a better model fitting. The exclusion of observations is penalized in BIC. The exclusion of observations and Box-Cox transformation are penalized in the information criterion to obtain a trade-off between better model fitting (normality) and complexity of the model (over- analyzing).
Figure 3-22 shows the outliers detected by the unified approach in different sample sizes.
For each sample size, 1000 repetitions are generated, apply the unified approach to each repetition, it will tell how many outliers are in the sample, whether those outliers should be excluded or not, if they are excluded, is it necessary to Box-Cox transform the remaining sample to achieve normality? For example, when sample size is 20, within the 1000 repetitions, 315 repetitions have no outliers, 325 repetitions have one outlier, 162 repetitions have two outliers,
125 repetitions have three outliers, and 0 repetitions have 6 outliers. When the sample size
98 becomes larger, the percentage of repetitions that are detected as “no outliers” is increasing, from
51.8% in sample size 70 to 65% in sample size 200. This shows that the unified approach is doing well in identifying outliers; in clean data situation, it is able to tell that the samples being analyzed are clean for most of the cases.
In terms of the locations of outliers detected, it is noticed from Table 3-6 that almost all of the detected outliers are located in upper end of the sample, which is different from normal samples. This is due to skewness of the lognormal distribution, which has long tails on the right.
As for the performance of Box-Cox transformation in the clean samples, λ is supposed to be 0 according to the definition of Box-Cox transformation since the sample is lognormally distributed and. Figure 3-34 shows the distribution of λ when the unified approach detected zero outliers in the sample. Figure 3-35 shows the distribution of λ when more than 1 outlier is detected. It is noticed that λ in Figure 3-34 has much smaller variation than Figure 3-35.
99
Figure 3-32 Outliers Detected in Clean LogN(0,1) Samples, 1000 Repetitions
100
Table 3-6 Location of outliers excluded, LogN0,1), Clean Data
SAMPLE_SIZE Low High Both 20 0 68.2 0.3 30 0.1 59.2 0.2 40 0 55.6 0.0 50 0 51.6 0.0 60 0 51.6 0.0 70 0 48.2 0.0 80 0 50.2 0.0 90 0 43.6 0.0 100 0 44.8 0.0 110 0 42.9 0.0 120 0 40.8 0.0 130 0 39.4 0.0 140 0 41 0.0 150 0 39.3 0.1 160 0 37.6 0.0 170 0 38 0.0 180 0 36.7 0.0 190 0 37.3 0.1 200 0 35 0.0
101
Figure 3-33 Location of outliers detected, LogN0,1), Clean Data
102
Figure 3-34 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox
103
Figure 3-35 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox
104
In terms of model fitting (normality test), it does not show too much difference between the cases where no outliers are detected and more than one outliers are detected. Figure 3-36 and
Figure 3-37 provided the P-values for the Anderson-Darling statistic. (AD is available if needed)
105
Figure 3-36 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox
106
Figure 3-37 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox
107
In the above analysis, we did not consider give extra penalty for doing Box-Cox transformation. This means that we are faced with two choices given a sample, the first one is doing Box-Cox to the sample to achieve a better model fitting but with extra penalty in the BIC; the second one is not to do Box-Cox. We need to consider whether Box-Cox is really needed, which means that an extra penalty has to be paid for transforming the sample. It is necessary to think if it is worth transforming the sample to achieve normality with a penalty. The BIC in the two situations will be compared and then decision can be made based on smaller values of BIC.
If this scenario is taken into account, then the unified approach will generate different results compared to the previous analysis. When considering the penalty of doing Box-Cox, for each sample, we have four choices of handling it:
a) 0-0 case (leave it alone): no outlier exclusion, no Box-Cox transformation.
b) 0-1 case: no outlier exclusion, do Box-Cox transformation.
c) 1-0 case: exclude outliers, no Box-Cox transformation.
d) 1-1 case: exclude outliers, do Box-Cox transformation.
108
Figure 3-38 Percent of four cases in unified approach with penalty to BC, Clean LogN(0,1) Figure 3-38 shows the percentage of each case in 1000 repetitions, notice that the majority repetitions falls into the second and fourth case, which means that given the clean
N(10,1) sample, we need to Box-Cox transform the sample with appropriate exclusion of extreme observations in the high end to make the sample behave normal. This agrees with the expectation since the sample is Log-normally distributed. In the 0-1 case, no outliers are excluded and Box-Cox transformation is implemented (Figure 3-39). In the 1-1 cases where Box-
Cox is implemented and outliers are excluded, the distribution of λ is shown in Figure 3-40.
109
Figure 3-38 Percent of four cases in unified approach with penalty to BC, Clean LogN(0,1)
110
Figure 3-39 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, no outliers excluded
111
Figure 3-40 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, outliers excluded(>=1)
112
3.6. Assessment-Lognormal Distribution
To evaluate the effectiveness of the unified approach, we need to look at the Mean
Square Error of the transformed sample. Base on the discussion in previous section, we have to look at the four cases separately.
x Null case: no outlier detection/exclusion, no Box-Cox transformation
The original sampleଵǡଶǡǥǡ୬ǡ୧ ൌሺ୧ሻ ǡ୧̱ሺͲǡͳሻ, then
ଵ ଵ ൌ σሺୖୣ୮ തതത െͲሻଶ,തതത ൌ σୗୗ ሺ ሻ, SS is sample size(20 to 200 by 10), is the ୖୣ୮ ୩ୀଵ ୩ ୩ ୗୗ ୧ୀଵ ୩୧ ୩୧ ith observation in the kth repetition. Rep is the number of repetitions in each sample, here we take 1000 repetitions. Here the original sample is drawn from LogNormal(0,1) therefore the
Mean Square Error is based on the squared difference of mean of logarithm of original value and expected value 0 since we have perfect knowledge of the samples. As തത୩ത is average of the transformed observations, the MSE computed need to be multiplied by the corresponding sample size for the purpose of comparison.
x In the Unified Approach
a) If BC=0 (DP=0 or DP>0)
ଵ ଵ ܯܵܧ ൌ σሺோ ܻഥ െͲሻଶ,ܻഥ ൌ σௌௌି ܮ݃ሺܻ ሻ, SS is sample size(20 to 200 by 10), ோ ୀଵ ௌௌ ୀଵ
୩୧ is the ith observation in the kth repetition. Rep is the number of repetitions in each sample, here we take 1000 repetitions. DP is the number of observations excluded from the sample. Since no Box-Cox transformation is involved here, the only difference with the null case is the sample size. Here the original sample is drawn from LogNormal(0,1) therefore the Mean Square Error is based on the squared difference of mean of logarithm of remaining original values and expected
113 value 0. When Box-Cox is not implemented, the way to compute MSE is the same as the null case except the sample size.
b) If BC=1(DP=0 or DP>0)
ଵ ଶ ଵ ሻെͲ൯ ,ܻഥ ൌ σௌௌି ݂ ሺܻ ሻ is the mean of remainedכൌ σோ൫݂ିଵሺܻതതത ܧܵܯ ோ ୀଵ ௌௌି ୀଵ transformed sample in the kth repetition after DP observations are dropped. Since Box-Cox is involved in the process, the mean of transformed samples need to be back-transformed to the original scale using the transformation found in this repetition. As anchor-to-1 is not conducted here, there is no need to un-anchor the back-transformed sample. If outliers are excluded then the observations dropped need to be taken into account by subtracting the sample size by DP. If no outliers are dropped, then DP=0.
The comparison between the unified approach and null case is shown in
Figure 3-41 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=0
114 and Figure 3-42. It is seen that the MSE of the unified approach is quite close to one as expected.
In the case where both outlier exclusion and Box-Cox transformation are applied, the MSE is even closer to 1 compared to the null case. This is because the Log-Normal sample is drawn from Exp(Z), where Z is flowing N(0,1), if it is back-transformed (taken logarithm), then it is expected that the variance of back-transformed sample to be 1.
Figure 3-41 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=0
115
Figure 3-42 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=1
4. Outliers: The Contaminated Sample Case
4.1. Samples with Outliers
To study the effectiveness of Box-Cox transformation in the presence of outliers and verify the usefulness of the unified approach, it is helpful to perform simulations in samples with outliers. To begin with, random samples from known distributions will be generated. Initially, random samples from a Normal distribution with mean 10 and variance 1, with outliers, will be generated. The sample sizes examined will range from 20 to 200 (by 10) and the percentage of outliers in the samples will be 5%, 10%, 15%, and 20%. Potential outliers will be obtained by randomly selecting observations from the sample and then multiplying them by a factor of 10.
Here the probability of each observation being selected to be an outlier is 5% to 20%. This means that for each specific sample with sample size, for example 100, the percent of generated
116 outlier is 5%. The true number of observations multiplied by 10 is not necessarily equal to 5.
While in large repetitions, the mean number of outliers generated will be 5 given that sample size equals 100 and percent of outlier is 5%. Random samples from other distributions (e.g.
Lognormal) will be generated in a similar manner. Here only observations in the high end, i.e. the extremely large observations are considered to be suspicious outliers.
Here is an example of a sample with outliers added. The clean data is generated from
Normal (10, 1), and 5% to 30% of the original observations are randomly chosen and multiplied by 10 to produce outliers. The difference between clean sample and contaminated sample can be seen in the figures and tables. Samples with outliers are right-skewed. The normality test statistics, histogram, normal density plot all indicate that the contaminated sample is deviating largely from normal distribution. For example, when sample size is 100, even if there are only 5% outliers, the Anderson-Darling statistic drastically changes from 0.36 to 31, the kurtosis changes from -0.04 to 46.75, and skewness changes from -0.11 to 6.89. As the percent of outliers added is increasing from 10% to 30%, the Anderson-Darling statistic is decreasing from 31.73 to 20.03, kurtosis drops from 29.61 to -0.51, and skewness changes from 5.56 to 1.18. Shapiro-Wilk,
Kolmogorov-Smirnov, and Cramer-von Mises normality test statistics show the similar pattern when the percent of outliers increase. As the percent of outliers increase, they can be treated as non-outliers when they become the majority of the sample. In our settings, for example, if 90% observations are multiplied by 10, then the sample is equivalent to a sample with 10% extreme small-value outliers and the rest observations are following a normal distribution with mean 100 and variance 10.
117
Figure 4-1 Distributions of Normal samples with outliers
118
Table 4-1 Goodness of Fit statistics of Normal samples with/without outliers
Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% 1.4589 1.3617 1.3932 <0.0050 0.8142 0.0014 0.2291 <0.0100 5% 4.4472 19.8416 6.3656 <0.0050 0.2791 <0.0001 0.4821 <0.0100 N=20 10% 4.4472 19.8416 6.3656 <0.0050 0.2791 <0.0001 0.4821 <0.0100 15% 2.1329 2.8700 5.2783 <0.0050 0.4715 <0.0001 0.4745 <0.0100 20% 1.0381 -0.8658 3.5196 <0.0050 0.6469 <0.0001 0.4116 <0.0100 30% 0.5466 -1.7131 2.7270 <0.0050 0.7169 <0.0001 0.3630 <0.0100
Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% -0.5493 0.6696 0.6930 0.0698 0.9620 0.1081 0.1231 0.0571 5% 6.9953 49.2761 15.5805 <0.0050 0.1843 <0.0001 0.4757 <0.0100 N=50 10% 6.9953 49.2761 15.5805 <0.0050 0.1843 <0.0001 0.4757 <0.0100 15% 3.2552 9.1514 15.2254 <0.0050 0.3493 <0.0001 0.5032 <0.0100 20% 1.9472 1.9381 12.9844 <0.0050 0.4852 <0.0001 0.4847 <0.0100 30% 0.9763 -0.9886 9.0427 <0.0050 0.6385 <0.0001 0.4234 <0.0100
Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% -0.1137 -0.0381 0.3600 >0.2500 0.9914 0.7743 0.0600 >0.1500 5% 6.8897 46.7486 31.0268 <0.0050 0.1806 <0.0001 0.4543 <0.0100 N=100 10% 5.5553 29.6115 31.7335 <0.0050 0.2130 <0.0001 0.4712 <0.0100 15% 3.1855 8.4605 30.5585 <0.0050 0.3463 <0.0001 0.4927 <0.0100 20% 2.3078 3.5654 27.5557 <0.0050 0.4443 <0.0001 0.4851 <0.0100 30% 1.1783 -0.5115 20.0336 <0.0050 0.6077 <0.0001 0.4428 <0.0100
Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% 0.1587 0.5661 0.3677 >0.2500 0.9929 0.6623 0.0514 >0.1500 5% 5.9031 33.4386 48.7243 <0.0050 0.1928 <0.0001 0.4704 <0.0100 N=150 10% 4.3321 17.0561 48.8724 <0.0050 0.2540 <0.0001 0.4874 <0.0100 15% 2.4920 4.3382 43.9501 <0.0050 0.4046 <0.0001 0.4883 <0.0100 20% 1.8499 1.5209 38.8027 <0.0050 0.4906 <0.0001 0.4725 <0.0100 30% 0.9334 -1.0686 28.0832 <0.0050 0.6307 <0.0001 0.4227 <0.0100
Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% 0.0207 -0.4445 0.2900 >0.2500 0.9935 0.5267 0.0373 >0.1500 5% 6.8671 46.0285 62.5194 <0.0050 0.1763 <0.0001 0.4605 <0.0100 N=200 10% 5.1181 24.7308 63.4635 <0.0050 0.2292 <0.0001 0.4799 <0.0100 15% 2.6245 5.0245 58.7372 <0.0050 0.3950 <0.0001 0.4944 <0.0100 20% 1.9086 1.7491 52.1481 <0.0050 0.4863 <0.0001 0.4804 <0.0100 30% 0.9560 -1.0169 37.4026 <0.0050 0.6327 <0.0001 0.4244 <0.0100
119
Here is another example of samples with outliers. The clean sample is generated from
Lognormal distribution by taking exponent of a random variable from N(0,1) distribution. It can be seen from the following table that the skewness and kurtosis varies a lot as the percent of outlier increases. It can be seen from the histograms of the contaminated samples that in the
Lognormal data, it is not easy to find outliers because observations that are randomly chosen to multiply by 10 can be in the lower side of the sample and after multiplied by 10, it is located in the right tail of the sample, thus making the outlier detection difficult.
120
Figure 4-2 Distributions of Lognormal samples with outliers
121
Table 4-2 Goodness of Fit statistics of Lognormal samples with/without outliers
Skewness Kurtosis % of Outlier N=20 N=50 N=100 N=150 N=200 N=20 N=50 N=100 N=150 N=200
0% 2.750 1.931 3.713 5.075 2.734 6.642 3.191 18.568 34.853 9.342 5% 2.058 2.596 3.222 3.728 3.445 2.776 7.947 11.612 15.334 15.571 10% 2.058 2.596 3.058 3.549 3.312 2.776 7.947 9.788 13.662 14.438 15% 1.588 4.657 5.748 3.661 4.166 1.026 25.724 40.634 15.251 22.926 20% 3.964 3.089 6.717 3.260 6.048 16.508 10.224 51.309 11.699 43.810 30% 3.972 3.329 5.567 8.727 5.091 16.571 11.545 35.153 92.166 29.964
122
4.2. Simulation on N(10,1) samples
x Baseline
In the baseline case where Box-Cox is applied to the sample directly, when there are outliers in the sample, the Box-Cox is adversely affected by outliers and λ is deviating from 1, as shown in the following figure. It is seen that the variation of λ is large when the sample size is small and the percent of outliers is low. When only 5% outliers are present, the median of λ is around -2 while some outlying λ’s are around 1, as more outliers are present, the value of λ becomes stable, which is around -2.
123
Figure 4-3 Distribution of Lambda in Baseline with outliers, N(10,1)
124
95% CI of Lambda: Baseline Case, N(10,1), 5% outlier 95% CI of Lambda: Baseline Case, N(10,1), 10% outlier Lambda Lambda -1.2 -1.84 -1.86 -1.3 -1.88 -1.90 -1.4 -1.92 -1.94 -1.5 -1.96 -1.98 -1.6 -2.00 -1.7 -2.02 -2.04 -1.8 -2.06 -2.08 -1.9 -2.10 -2.12 -2.0 -2.14 -2.16 -2.1 -2.18 -2.20 -2.2 -2.22 -2.24 -2.3 -2.26 -2.4 -2.28 -2.30 -2.5 -2.32 -2.34 -2.6 -2.36 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size 95% CI of Lambda: Baseline Case, N(10,1), 15% outlier 95% CI of Lambda: Baseline Case, N(10,1), 20% outlier Lambda Lambda -1.84 -1.62 -1.85
-1.86 -1.63 -1.87
-1.88 -1.64 -1.89 -1.90 -1.65 -1.91 -1.92 -1.66 -1.93 -1.94 -1.67 -1.95 -1.96 -1.68 -1.97 -1.98 -1.99 -1.69 -2.00 -2.01 -1.70 -2.02 -2.03 -1.71 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size
Figure 4-4 Confidence Interval of Lambda in Baseline with outliers, N(10,1)
125
x Anchor-to-1
If the sample is anchored to 1, the variation of λ can be alleviated a lot compared to the baseline case. It can be seen that most of the outlying values of λ are gone, although there are many large λ in small samples. When the percent of outlier is 20%, the mean of λ is between -0.5 and -1 with a few λ larger than 1.
126
Figure 4-5 Distribution of Lambda, Anchor-to-1, with outliers, N(10,1)
127
95% CI of Lambda: Anchor-to-1, N(10,1), 5% outlier 95% CI of Lambda: Anchor-to-1, N(10,1), 10% outlier Lambda Lambda 0.0 -0.42 -0.44 -0.46 -0.1 -0.48 -0.50 -0.2 -0.52 -0.54
-0.3 -0.56 -0.58 -0.60 -0.4 -0.62 -0.64 -0.5 -0.66 -0.68 -0.70 -0.6 -0.72 -0.74 -0.7 -0.76 -0.78 -0.8 -0.80 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size 95% CI of Lambda: Anchor-to-1, N(10,1), 15% outlier 95% CI of Lambda: Anchor-to-1, N(10,1), 20% outlier Lambda Lambda -0.52 -0.51 -0.53 -0.52 -0.54 -0.55 -0.53 -0.56 -0.54 -0.57 -0.55 -0.58 -0.59 -0.56 -0.60 -0.57 -0.61 -0.58 -0.62 -0.63 -0.59 -0.64 -0.60 -0.65 -0.61 -0.66 -0.67 -0.62 -0.68 -0.63 -0.69 -0.64 -0.70 -0.71 -0.65 -0.72 -0.66 -0.73 -0.67 -0.74 -0.75 -0.68 -0.76 -0.69 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size
Figure 4-6 Confidence Interval of Lambda, Anchor-to-1, with outliers, N(10,1)
128
x Accuracy of outlier detection
To evaluate the efficacy of the unified approach in detecting outliers, it is desirable to check how many outliers can be found for each of the samples. The fowling steps are conducted to compute this accuracy. For each sample, it is know the true number of observations that are multiplied by 10, which are generated outliers. The Unified Approach can detect how many outliers in each sample, how many of them should be dropped, and whether Box-Cox is needed to transform the remaining sample to achieve normality. Besides the Box-Cox transformation parameter λ will also be computed by the unified approach. As the Unified Approach detects outliers in both ends of the sample, it will tell how many outliers are detected in each end.
Denote num_outlier to be the true number of outliers in the sample, upper_drop to be the outliers detected in the high end, low_drop to be the number of outliers detected in the low end, outlier_drop to be the total number of outliers detected (equals low_drop plus high_drop). To evaluate the accuracy of outlier detection, for each sample the following measurements are used:
a) If abs(upper_drop-num_outlier)<=1 then accurate_1=1; else accurate_1=0.
b) If abs(low_drop-0)<=1 then accurate_2=1; else accurate_2=0.
c) If abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate_3=1; else
accurate_3=0.
d) If low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate_4=1;else
accurate_4=0.
e) If low_drop=0 and upper_drop=num_outlier then accurate_5=1; else accurate_5=0.
The first criteria measures the accuracy of outlier detection in the high end; the second one measures the accuracy of outlier detection in the low end, the true number of outliers in the
129 low is zero because outliers are generated by multiplying observations by 10; the third one measures low end and high end outlier detection separately with some flexibility; the fourth one has higher standard that requires the low end outlier detection must be exactly accurate; the fifth one is the most strict one that requires both low end and high end outlier detection be exactly accurate compared to the true case. It happens that the Unified Approach may find outliers in the low end even if the outliers are generated in the high end. This is probably due to the mask effect/swamp effect, which is one of the major difficulties in outlier detection.
For each sample size, there are 1000 repetitions where the six accuracy measurements of outlier detection can be obtained. The percentages of “accurate” outliers detection based on the six measurements are plotted in the following figures.
130
x Unified Approach without penalty for Box-Cox
When the unified approach is applied to the samples with outliers, extra penalty of conducting Box-Cox transformation is not included in the BIC at first. The λ found through the unified approach has much smaller variance. Outliers detected and excluded in high end only cases result in a narrower confidence interval for λ compared to the cases where outliers are detected in low end. The correct rate for the cases in high end and both end are the almost the same since the repetitions match for most of the times.
It is seen that the correct rate of the Unified Approach in detecting outliers is above 95% for the first four accuracy criteria. When the sample size is larger than 50, Accurate_5 and
Accurate_6 (the most stringent criteria), achieve an accuracy rate above 95%. The following figures show the distribution of λ found through the Unified Approach in different accuracy criteria. The variance of λ becomes smaller and the number of extremely large and small values of λ decreased.
131
Figure 4-7 Accuracy of outlier detection, N(10,1), 5% outliers, no penalty for BC
132
Figure 4-8 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy1=1, no extra penalty for Box-Cox
133
Figure 4-9 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy2=1, no extra penalty of Box-Cox
134
Figure 4-10 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy3=1, no extra penalty of Box-Cox
135
Figure 4-11 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy4=1, no extra penalty of Box-Cox
136
Figure 4-12 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy5=1, no extra penalty of Box-Cox
137
x Unified Approach with penalty for Box-Cox
When the unified approach is applied to the samples with outliers, extra penalty is paid for conducting Box-Cox transformation. It is shown that in most of the cases, unified approach decided that only outlier exclusion is needed to achieve normality rather than doing an extra
Box-Cox transformation. When sample size is larger than 50, more than 90% repetitions need only exclusion of outliers to get normality.
It is noticed that in most of the cases the unified approach chooses to exclude outliers and not to Box-Cox transform the remaining sample considering the extra penalty for doing Box-Cox.
After sample size is larger than 50, the proportion samples out of 1000 repetitions is greater than
90%. Since this is a contaminated normal sample with outliers, it is expected to only exclude the outliers and the remaining sample would be a normal one. This is verified by the simulation results. Applying the similar accuracy measurement used in the previous section (no-penalty part), the accuracy of outlier detection is promising with considering penalty for Box-Cox transformation when the unified approach is applied.
The following Box-Plots show the distribution of λ in the cases where Accurate_6=1 and
Accurate_5=1, including all four handling methods: BC=1 DP=1, BC=1 DP=0, BC=0 DP=1, and
BC=0 DP=0.
138
Figure 4-13 Percent of different handling methods, 5% outliers, N(10,1), penalty for BC
139
Figure 4-14 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases
140
Figure 4-15 Distribution of lambda, Unified Approach, N(10,1), 5% outlier with penalty for BC, accuracy5=1 cases
141
10% outliers:
Figure 4-16 Accuracy of outlier detection, N(10,1), 10% outliers, no penalty for BC
Figure 4-17 Percent of different handling methods, 10% outliers, penalty for BC
142
Figure 4-18 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases
143
Figure 4-19 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 10% outlier
144
15% outliers
Figure 4-20 Accuracy of outlier detection, N(10,1), 15% outliers, no penalty for BC
Figure 4-21 Percent of different handling methods, 15% outliers, penalty for BC
145
Figure 4-22 Accuracy of outlier detection, N(10,1), 15% outliers, penalty for BC, DP=1 and BC=0 cases
146
Figure 4-23 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 15% outlier
147
20% outliers
Figure 4-24 Accuracy of outlier detection, N(10,1), 20% outliers, no penalty for BC
Figure 4-25 Percent of different handling methods, 20% outliers, penalty for BC
148
Figure 4-26 Accuracy of outlier detection, N(10,1), 20% outliers, penalty for BC, DP=1 and BC=0 cases
149
Figure 4-27 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 20% outlier
150
4.3. Simulation on Lognormal samples
To test the Unified Approach, Lognormal sample is chosen to verify the accuracy of outlier detection and the Box-Cox transformation. It is known that Lognormal sample can be transformed to normality through log transformation with λ=0. In the presence of outliers, it is expected that Unified Approach can detect outliers accurately and transform the remaining sample using logarithm. The random sample used here is generated through exp(Z), where Z is from N(0,1). As the minimum of such sample is less than 1, according to the Unified Approach, the sample minimum will not be anchored to 1 before conducting Box-Cox transformation.
Given such a sample, there are four conclusions: there are outliers dropped and Box-Cox is needed (DP=1 and BC=1); there are outliers dropped and no Box-Cox is needed (DP=1 and
BC=0); there are no outliers detected and Box-Cox is needed (DP=0 and BC=1); there are no outliers detected and no Box-Cox is needed (DP=0 and BC=0).
It is noticed from the following figures that out of the 1000 repetitions most of the samples choose to drop outliers and Box-Cox transform the remaining observations even considering the penalty of doing Box-Cox (DP=1 and BC=1) (Figure4-30, 4-35, 4-40, 4-45). The value of λ for cases where accuracy5=1 BC=1 DP=1 is close to the expected value 0 (Figure 4-
33, 4-34, 4-38, 4-39, 4-43, 4-44, 4-48, 4-49). The variance and confidence interval for the mean of λ is quite narrow and covers 0 at the most of the times. The accuracy of outlier detection is not as good as normal samples. The accuracy rate drops dramatically as sample size increases. This is probably due to the skewness of the lognormal sample. For example, an outlier with value 15 is generated by multiplying 1.5 by 10 and it is not easy to differentiate it from another original value 15 in the right tail of the sample. Therefore the regular outlier detection accuracy measurement may not be able to reflect this situation.
151
5%
Figure 4-28 Percent of different handling methods, 5% outliers, LogN(0,1), penalty for BC
Figure 4-29 Accuracy of outlier detection, LogN(0,1), 5% outliers, no penalty for BC
152
Figure 4-30 Accuracy of outlier detection, LogN(0,1), 5% outliers, penalty for BC
Figure 4-31 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC
153
Figure 4-32 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, penalty for BC
154
10%
Figure 4-33 Percent of different handling methods, 10% outliers, LogN(0,1), penalty for BC
Figure 4-34 Accuracy of outlier detection, LogN(0,1), 10% outliers, no penalty for BC
155
Figure 4-35 Accuracy of outlier detection, LogN(0,1), 10% outliers, penalty for BC
Figure 4-36 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC
156
Figure 4-37 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, penalty for BC
157
15%
Figure 4-38 Percent of different handling methods, 15% outliers, LogN(0,1), penalty for BC
Figure 4-39 Accuracy of outlier detection, LogN(0,1), 15% outliers, no penalty for BC
158
Figure 4-40 Accuracy of outlier detection, LogN(0,1), 15% outliers, penalty for BC
Figure 4-41 Distribution of lambda, LogN(0,1), 15% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC
159
Figure 4-42 Distribution of lambda, LogN(0,1), 15% outlier, accuracy5=1, BC=1 DP=1, penalty for BC
160
20%
Figure 4-43 Percent of different handling methods, 20% outliers, LogN(0,1), penalty for BC
Figure 4-44 Accuracy of outlier detection, LogN(0,1), 20% outliers, no penalty for BC
161
Figure 4-45 Accuracy of outlier detection, LogN(0,1), 20% outliers, penalty for BC
Figure 4-46 Distribution of lambda, LogN(0,1), 20% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC
162
Figure 4-47 Distribution of lambda, LogN(0,1), 20% outlier, accuracy5=1, BC=1 DP=1, penalty for BC
163
5. Discussion and future research
In this dissertation, a unified approach is proposed to handle outlier detection, Box-Cox transformation using a penalized information criterion at the same time. Extensive simulation studies have been conducted and the major findings are summarized as below:
1) Box-Cox transformation is not always working well in terms of finding an appropriate
value of λ. From the simulations in chapter 2, for N(10,1) samples, the values of λ found
through Box-Cox have extreme values such as 9 and -6, which is far away from the
expected value 1.The value transformed samples can be quite large or small using those
extreme λ, which is not desired, even though the transformed sample might pass the
normality test such as Anderson-Darling. For Lognormal samples, it is expected to find a
value of λ quite close to zero, while the simulation shows that Box-Cox does not always
provide λ that is close to zero. The variance of λ is found to be quite large, especially in
small samples.
2) Anchor-to-1 method is proposed to solve the above problem. For samples with minimum
value larger than 1, it should be anchored so that the minimum of the sample is 1 before
Box-Cox transformation. Through anchor-to-1, it is found that the variance of the Box-
Cox λ is much smaller and extreme values of λ disappear compared to not using anchor-
to-1. The confidence interval for the mean value of λ covers the true value most of the
times for sample size ranging from 20 to 200.
3) Based on the idea of anchor-to-1, a Unified Approach is proposed to handle outliers and
Box-Cox transformation at the same time. For a given sample with or without outliers,
the unified approach can tell how many outliers are in the sample, how many extreme
observations should be excluded, and what Box-Cox λ should be used to transform the
164
remaining sample to achieve normality. The idea behind the Unified Approach is that it
finds a tradeoff between better model fitting (normality) and less information loss
(observations excluded). The criteria employed is adjusted BIC, which penalizes outlier
exclusion, anchor-to-1, and Box-Cox transformation at the same time.
4) The efficacy of the Unified Approach is verified in clean samples at first. According to
the simulation results in chapter 3, the unified approach is working satisfactorily in clean
normal samples. This means that the Unified Approach can accurately detect there are no
outliers in a given sample and based on the penalized BIC the Unified Approach can find
λ whose value is close to the true value 1, indicating that neither Box-Cox or outlier
exclusion is needed for the clean normal sample. Simulation has also been conducted in
clean lognormal samples. The number of outliers found by the Unified Approach is no
more than 2 for more than 90% of the 1000 repetitions and the outliers identified are
mostly located in the high end of the sample. For each clean lognormal sample, the
Unified Approach has only two solutions: no outlier exclusion and do the Box-Cox
transformation (DP=0 and BC=1) or exclude some high end observations and do the Box-
Cox transformation to the remaining samples (DP=1 and BC=1). As for the value of λ
found through the Unified Approach, it is quite close to 1 and the variance of λ becomes
smaller and the confidence interval covers zero when sample size gets larger than 50.
5) For samples with outliers, the Unified Approach has also been tested in both normal and
lognormal samples. Outliers are randomly generated in clean samples by multiplying 10
to the original observation. To evaluate the accuracy of outlier detection, five
measurements are created (accuracy1 to accuracy5), of whom accuracy5 is the most
stringent one. In the normal samples, the Unified Approach worked satisfactorily.
165
Without considering the extra penalty of doing Box-Cox transformation, accuracy6 is
larger than 90% when sample size is larger than 50 and the value of λ is close to 1 with
variance becoming smaller as sample size increases. Considering the penalty of doing
Box-Cox transformation, when sample size is larger than 50, for more than 90% of the
repetitions, the Unified Approach chose to only exclude some extremely large
observations and not to do Box-Cox, which is consistent with the true case. For samples
with more outliers added, similar conclusions can be drawn from the simulations. In the
lognormal samples, for most of the repetitions (around 90%), the Unified Approach chose
to exclude extreme observations and do the Box-Cox transformation to the remaining
sample (DP=1 and BC=1). The value of λ is close to 0 as expected. As for the accuracy of
outlier detection, the Unified Approach is not working as well as the normal samples.
This is because that lognormal sample has long tails in the high end and it is difficult to
distinguish outliers from the large original values, i.e., an outlier generated by
multiplying 1.2 by 10 looks no difference compared to the original value 12.
This dissertation has explored the Box-Cox transformation in dealing with skewed samples and outliers and a Unified Approach has been proposed for handling samples with outliers. A few more things might be considered in the future research:
1) The penalty function used in the Unified Approach treats the following as extra
parameters: Box-Cox transformation, each excluded observation, and anchor-to-1. They
are penalized equally, i.e., in a case where one observation is excluded, Box-Cox is
conducted and Anchor-to-1 is used, the total penalty is 3. It might be a good idea to
allocate different weight to them. For example, for observations excluded, the largest
observation dropped is penalized by 1; the second largest one can be penalized by 2, and
166
so on so forth. The penalty for doing Box-Cox, anchor-to-1, and observation exclusion
can also be different. The penalty for outliers dropped in both high and low ends can be
higher than outliers that are dropped in only one end. For example, two outliers dropped
in both ends (one in each end) can be penalized by 5 (penalty is 3 for one drop in high
end and 2 in the low end), rather than 4 (2 for each outlier drop in either end). Another
thing can be considered is that the penalty part in BIC is k*log(n), where k is the number
of parameters, other forms penalty function might be considered such as ൌ
ଶ୩ሺ୩ାଵሻ . ୬ି୩ିଵ
2) Outliers in the low end can also be considered in outlier detection. In this dissertation,
outliers are generated only in the high end for simplicity, while the occurrence of
extremely small observations in the low end of a sample is not rare. The impact of
observations in the low end might be alleviated by shifting the sample to the right;
however the shifted sample still shows skewness and may not be able to pass normality
test. Therefore considering outliers in both high end and low end will provide a more
general solution for outlier detection and data transformation.
3) The way outliers are detected is through checking all the combinations of suspicious
observations. In this dissertation, only 20% of observations are considered as suspicious
outliers, for small sample this might not be a problem, while for large samples such as
1000, 40401 combinations (201*201) need to be checked using penalized BIC, which
will cause a big computational burden.
4) In deciding how many observations should be excluded and what λ should be used to
achieve normality, adjusted BIC is used as a criteria, smaller value of this adjusted BIC
will lead to the solution. The potential problem is that there might be only small
167 difference in the value of adjusted BIC, for example, BICc=201 suggested λ=0.8, outliers excluded should be 5 and BICc=200.8 suggests λ=1, outliers excluded should be 6. In this situation, is it making a big difference using λ=0.8 instead of λ=1 due to the 0.8 difference? Some flexibility needs to be given in choosing the parameters since in practice choices are often limited to several values of λ such as integers between -3 and 3.
168
Bibliography
[1]. Grubbs, Frank (February 1969), Procedures for Detecting Outlying Observations in
Samples, Technometrics, 11(1), pp. 1-21.
[2]. R. B. Dean and W. J. Dixon (1951) "Simplified Statistics for Small Numbers of
Observations". Anal. Chem., 1951, 23 (4), 636–638.
[3]. Peirce, Benjamin, "Criterion for the Rejection of Doubtful Observations", Astronomical
Journal II 45 (1852)
[4]. David Hoaglin, Frederick Mosteller, and John Tukey (editors), Understanding Robust and
Exploratory Data Analysis, New York, John Wiley & Sons, 1983, pp. 39, 54, 62, 223.
[5]. Knorr, E. M. and Ng, R. T.: 1998, Algorithms for Mining Distance-Based Outliers in
Large Datasets. In: Proceedings of the VLDB Conference. New York, USA, pp. 392–403
[6]. Markus Breunig and Hans-Peter Kriegel and Raymond T. Ng and Jörg Sander: 2000,
LOF: Identifying Density-Based Local Outliers. In: Proceedings of the ACM SIGMOD
Conference. pp. 93-104
[7]. Cook, R. Dennis (Feb 1977). "Detection of Influential Observations in Linear
Regression". Technometrics (American Statistical Association) 19 (1): 15–18.
[8]. Barnett, V. and Lewis, T.: 1994, Outliers in Statistical Data. John Wiley & Sons., 3rd
edition.
[9]. Osborne, Jason W. & Amy Overbay (2004). The power of outliers (and why researchers
should always check for them). Practical Assessment, Research & Evaluation, 9(6).
[10]. P.J. Huber. Robust Statistics. John Wiley & Sons, New York, 1981.
[11]. Rousseeuw, P. J. (1984) "Least Median of Squares Regression" Journal of the American
Statistical Association, 79, 871–880.
169
[12]. Rousseeuw, P.J. and Yohai, V. (1984), “Robust Regression by Means of S estimators”,
in Robust and Nonlinear Time Series Analysis, edited by J. Franke, W. Härdle, and R.D.
Martin, Lecture Notes in Statistics 26, Springer Verlag, New York, 256-274.
[13]. Yohai V.J. (1987), “High Breakdown Point and High Efficiency Robust Estimates for
Regression,” Annals of Statistics, 15, 642-656.
[14]. Hamilton, L.C. (1992). Regressions with graphics: A second course in applied statistics.
Monterey, CA.: Brooks/Cole.
[15]. Box, G. E. P. & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society, 26(2), 211-252.
[16]. Andrews, D. F. (1971). A note on the selection of data transformations. Biometrika,
58(2), 249-254.
[17]. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions
on Automatic Control, 19(6), 716-723.
[18]. Moore, D.S. & McCabe G.P. (1999), Introduction to the Practice of Statistics. , Freeman
&Company.
[19]. Hodge, V. & Austin, Journal of Artificial Intelligence Review, 22-2, 85-126
[20]. Tukey, J. W. (1957) “The comparative anatomy of transformations”. Annals of
Mathematical Statistics, 28, pp. 602-632.
[21]. John W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley.
[22]. Tietjen, G & Moore, R (1972). Some Grubbs- type statistics for the detection of several
outliers. Tech- nometrics, 14, 583-597.
[23]. Barnett, B. and Lewis, T. (1984). Outliers in statistical data, second edition. New York:
Wiley.
170
[24]. Manly, B. F. (1976) “Exponential data transformation”. The Statistician, 25, pp. 37-42.
[25]. John, J. A. & Draper, N. R. (1980) “An alternative family of transformations”. Applied
Statistics, 29, pp. 190-197.
[26]. Bickel, P. J. & Robson, D. S. (1981) “An analysis of transformations revisited”. Journal
of the American Statistical Association, 76, pp. 296-311.
[27]. Yeo, In-Kwon and Johnson, Richard (2000). A new family of power transformations to
improve normality or symmetry. Biometrika, 87, 954-959.
[28]. Pericchi, L. R. (1981) “A Bayesian approach to transformations to
normality”.Biometrika, 68, 35-43.
[29]. Sweeting, T. J. (1984) “On the choice of the prior distribution for the Box-Cox
transformed linear model”. Biometrika, 71, 127-134.
[30]. Carroll, R. J. (1980) “A robust method for testing transformation to achieve approximate
normality”. Journal of the Royal Statistical Society, Series B, 42, 71-78.
[31]. Carroll, R. J. (1982 a) “Tests for regression parameters in power transformation models”.
Scandinavian Journal of Statistics, 9, 217-222.
[32]. Carroll, R. J. & Ruppert, D. (1984) “Power transformations when fitting theoretical
models to data”. Journal of the American Statistical Association, 79, 321-615.
[33]. Taylor, M. J. G. (1983) “Power transformation to symmetry”. Unpublished Dissertation,
Department of Statistics, University of California, Berkeley.
[34]. Taylor, M. J. G. (1985 a) “Power transformation to symmetry”. Biometrika, 72, 145-152.
[35]. Taylor, M. J. G. (1985 b) “Measure of location of skew distributions obtained through
Box-Cox transformations”. Journal of the American Statistical Association, 80, 427-432.
171
[36]. Taylor, M. J. G. (1987) “Using a generalized mean as a measure of location”.
Biometrical Journal, 6, 731-738.
[37]. Andrews, D. F., Gnanadesikan, R. & Warner, J. L. (1973) “Method for assessing
multivariate normality, in: P. R. Krishnaiah (Ed.)”. Multivariate Analysis III, pp. 95-115,
New York, Academic Press.
[38]. Dunn, J. E. & Tubbs, J. D. (1980) “A procedure for determining homoscedastic
transformations of multivariate normal populations”. Communications in Statistics-
Simulation and Computation B, 9(6), 589-598.
[39]. Beauchamp, J. J. & Robson, D. S. (1986) “Transformation considerations in
discriminant analysis”. Communications in Statistics-Simulation and Computation, 15(1),
147-179.
[40]. Rigby, R.A. & Stasinopoulos, D.M. (2004) “Smooth centile curves for skew and
kurtotic data modelled using the Box–Cox power exponential distribution”. Stat Med,
23(19), 3053-76.
[41]. Bozdogan, H. & Ramirez, D.E. (1988) “UTRANS and MTRANS: Marginal and Joint
Box-Cox Transformations of Multivariate Data to 'Near' Normality”. Multivariate
Behavioral Research, 23, 131-132.
[42]. Lachtermacher, G. & Fuller, J.D. (1995). “Back propagation in time-series forecasting”.
Journal of forecasting (0277-6693), 14 (4), p. 381.
[43]. Draper, N. R. & Cox, D. R. (1969) “On distributions and their transformations to
normality”. Journal of the Royal Statistical Society, Series B, 31, 472-476.
[44]. Hinkley, D. V. (1975) “On power transformation to symmetry”. Biometrika, 62, 101-
111.
172
[45]. Cressie, N. A. C. (1978) “Removing nonadditivity from two-way tables with one
observation per cell”. Biometrics, 34, 505-513.
[46]. Hernandes, F. & Johnson, R. A. (1980) “The large sample behaviour of transformations
to normality”. Journal of the American Statistical Association, 75, 855-861.
[47]. Kullback, S. &Leibler, R.A. (1951). "On Information and Sufficiency". The Annals of
Mathematical Statistics 22 (1): 79–86.
[48]. Hinkley, D. V. (1985) “Transformation diagnostics for linear models”. Biometrika, 72,
487-496.
[49]. Han, A. K. (1987) “A non-parametric analysis of transformation”. Journal of
Econometrics, 35, 191-209.
[50]. Solomon, P. J. (1985) “Transformations for components of variance and
covariance”.Biometrika, 72, 233-239.
[51]. Sakia, R. M. (1988) “Application of the Box-Cox transformation technique to linear
balanced mixed analysis or variance models with a multi-error structure”. Unpublished
PhD Thesis, Universitaet Hohenheim, FRG.
[52]. CHANG, H. S. (1977 a). A computer program for Box-Cox transformation and
estimation technique. Econometrica, 45, 1741
[53]. Huang, C.L. & Moon, L.C. & Chang, H.S (1978). A computer program using the Box-
Cox transformation technique for the specification of functional form, The American
Statistician, 32, 144.
[54]. Atkinson, A.C. (1973) Testing transformations to normality, Journal of the Royal
Statistical Society, Series B, 35, 473-479
173
[55]. Carroll, R. J. (1980) A robust method for testing transformation to achieve approximate
normality, Journal of the Royal Statistical Society, Series B, 42, 71-78.
[56]. Lawrance., J. (1987 a) The score statistic for regression transformation, Biometrika, 74,
275-279.
[57]. Hinkley, D, . V. (1988) More on score tests for transformation in regression, Biometrika,
75, 366-369.
[58]. Lawrance., J. (1987 b) A note on the variance of the Box-Cox regression transformation
estimate, Applied Statistics, 36, 221-223.
[59]. Atkinson, A. C. (1985) Plots, Transformation and Regression: An Introduction to
Graphical Methods of Diagnostic Regression Analysis (Oxford, Clarendon Press)
[60]. Wang, S. (1987) Improved approximation for transformation diagnostics,
Communications in Statistics Theory and Methods, 16(6), 1797-1819.
[61]. Atkinson, A.C. & Lawrance, A. J. (1989) A comparison of asymptotic equivalent test
statistics for regression transformation, Biometrika, 76, 223-229.
[62]. Draper, N. R. & Cox, D. R. (1969) On distributions and their transformations to
normality, Journal of the Royal Statistical Society, Series B, 31, 472-476.
[63]. Poirier, D. J. (1978) The use of the Box-Cox transformation in limited dependent
variable models, Journal of the American Statistical Association, 73, 285-287.
[64]. Spitzer, J. J.(1978) A Monte Carlo investigation of the Box-Cox transformation in small
samples, Journal of the American Statistical Association, 73, 488-495
[65]. Bickel, P. J. and Doksum, K. A. (1981) An analysis of transformations revisited, Journal
of the American Statistical Association, 76, 296-311.
174
[66]. Box, G. E. P. & Cox, D. R. (1982) An analysis of transformation revisited, rebutted,
Journal of the American Statistical Association, 77, 209-210.
[67]. Carroll, R. J. & Ruppert, D. (1981) On prediction and the power transformation family,
Biometrika, 79, 321-328.
[68]. Hinkley, D. V. & Runger, G. (1984). The analysis of transformed data, Journal of the
American Statistical Association, 79, 302-320.
[69]. Doksum, K. A. & Wong, C. W. (1983) Statistical tests based on transformed data,
Journal of the American Statistical Association, 78, 411-417.
[70]. Wood, J. T. (1974) An extension of transformations of Box and Cox, Applied Statistics,
23, 278-283.
[71]. Carroll, R. J. & Ruppert, D. (1984) Power transformations when fitting theoretical
models to data, Journal of the American Statistical Association, 79, 321-615.
[72]. Ruppert, D., Cressie, N. & Carroll, R. J. (1989) A transformation/weighting model for
estimating Michaelis-Menten parameters, Biometrics, 45, 637-656.
[73]. Rudemo, M., Ruppert, D. & Streibig, J. C. (1989) Random effects models in nonlinear
regression with application to bioassay, Biometrics, 45, 349-362.
[74]. Wixley, R. A. J.(1986). Unconditional likelihood tests for the linear model following
Box-Cox transformation. South African Statistical Journal, 20, 67-83.
[75]. Atkinson, A. C. (1982). Regression diagnostics, transformation and constructed
variables, Journal of the Royal Statistical Society, Series B, 44, 1-36.
[76]. Atkinson, A. C. (1983) Diagnostic regression analysis and shifted power transformation.
Technometrics, 25, 23-33.
175
[77]. Carroll, R. J. (1982 b) Two examples of transformation when there are possible outliers,
Applied Statistics, 31, 149-152.
[78]. Cook, R. D. & Wang, P. C. (1983) Transformation and influential cases in regression,
Technometrics, 25, 337-343.
[79]. Atkinson, A. C. (1986) Diagnostic tests for transformation, Technometrics, 28, 29-37.
[80]. Schwarz, Gideon E. (1978). "Estimating the dimension of a model". Annals of Statistics
6 (2): 461–464.
[81]. Osborne, Jason (2002). Notes on the use of data transformations. Practical Assessment,
Research & Evaluation, 8(6).
[82]. Kadota, K., Nishimura, S.I., Bono, H. et al. 2003a. Detection of genes with tissue-specific
expression patterns using Akaike’s Information Criterion (AIC) procedure. Physiol. Genomics,
12:251–259.
[83]. Kadota, K., Tominaga, D., Akiyama, Y., & Takahashi, K. (2003). Detecting outlying samples in
microarray data: A critical assessment of the effect of outliers on sample classification. Chem-Bio
Informatics Journal, 3, 30-45.
[84]. Kadota, K., Ye, J., Nakai, Y., Terada, T., & Shimizu, K. (2006). ROKU: a novel method for
identification of tissue-specific genes. BMC Bioinformatics, 7:294.
[85]. Ueda, T. (1996). Simple method for the detection of outliers. Japanese Journal of
Applied Statistics, 25(1), 17-26.
176
Appendix
The following SAS IML codes are used to generate the simulation results in Chapter 2-4.
Here simulation for Normal samples are provided due to limited space. For lognormal samples, the only thing to change is the distribution in the beginning of simulation.
A. N(10,1) sample, no Box-Cox transformation and outlier detection involved
*original sample, compute AIC BIC MSE AD AD_Pvalue; proc iml; pi = constant("pi"); e = constant("e"); ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; epsilon={0.0 0.05 0.1 0.15 0.20}; out1=shape(.,95000,9);
/*Normal data*/ do k=1 to 19; **sample size loop; n=ss[k]; z=shape(.,n,1); u=z;
neps=ncol(epsilon); do j=1 to 1; **percent of outliers loop; eps=epsilon[j];
%let rep=1000; do i=1 to &rep; **1000 repitition loop;
seedz=21273*k+430856*j+4487*i+9999; seedu=9789*k+1723*j+1532+4487*i+9999;
AIC=shape(.,&rep,1); BIC=shape(.,&rep,1); AD=shape(.,&rep,1); AD2=shape(.,&rep,1); SE=shape(.,&rep,1); pvalue=shape(.,&rep,1);
call rannor(seedz,z);
177
y=10+1*z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end;
c=y; d=rank(y); y[d]=c;
xsl=ssq(y-y[:])/n; z2=sum(log(1:n));
AIC[i]=n*log(2*pi*e)+n*log(xsl)-2*z2+4;*AIC; BIC[i]=n*log(2*pi*e)+n*log(xsl)-2*z2+2*log(n);*BIC; SE[i]=(y[:]-10)##2; *Square Error;
*anderson-darling; len=nrow(y); sse=ssq(y-y[:])/(len-1); uu=(y-y[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[i]=999; ad2[i]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[i]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[i]=ad[i]#(1+0.75/len+2.25/len##2); if ad2[i]>=0.6 & ad2[i]<13 then pvalue[i]=exp(1.2937- 5.709#ad2[i]+0.0186#ad2[i]##2); else if ad2[i]>=0.34 & ad2[i]<0.6 then pvalue[i]=exp(0.9177-4.297#ad2[i]- 1.38#ad2[i]##2); else if ad2[i]>=0.2 & ad2[i]<0.34 then pvalue[i]=1-exp(-8.318+42.796#ad2[i]- 59.938#ad2[i]##2); else pvalue[i]=1-exp(-13.436+101.14#ad2[i]-223.73#ad2[i]##2); end; out1[&rep*neps*(k-1)+&rep*(j-1)+i,1] = n; out1[&rep*neps*(k-1)+&rep*(j-1)+i,2] = eps; out1[&rep*neps*(k-1)+&rep*(j-1)+i,3] = i;
178 out1[&rep*neps*(k-1)+&rep*(j-1)+i,4] = AIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,5] = BIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,6] = AD[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,7] = AD2[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,8] = pvalue[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,9] = SE[i];
end; *i; end; *j; end; *k;
create temp2 var{Sample_Size pct_outlier Rep AIC BIC AD AD2 AD_pvalue SE}; append from out1; close temp2; quit; data temp2; set temp2; where sample_size ne .; run; data out.normal_null; set temp2; rename AIC=AIC_null BIC=BIC_null AD=AD_null AD2=AD2_null SE=SE_null ad_Pvalue=ad_Pvalue_null; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_3\MSE_null.doc"; proc means data=temp2 noprint; var SE; by sample_size; output out=mse(drop=_type_) mean=MSE_null; run; proc print data=mse; run; ods rtf close;
B. N(10,1) sample, Box-Cox transformation only
*original data,do BC, check lambda, large variance;
179 libname out "F:\April_May_2013_Chapter\new_data_04302013"; proc iml;
/*constants used in the calculation*/ pi = constant("pi"); e = constant("e"); start BoxCox(lam) global(y,n,pi,e); x=(y##lam-1)/lam; xsl=ssq(x-x[:])/n; z1=sum(log(1:n)); f=0.5*n*log(2#pi#e)+0.5*n*log(xsl)-(lam-1)*sum(log(y))-z1; return (f); finish BoxCox; ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; /*ss={20 80 150 };*/ epsilon={0.0 0.05 0.1 0.15 0.20};
/*out1 and out2 are the output datasets*/ out1=shape(.,95000,11);
/*Normal data*/ do k=1 to 19; **sample size loop; n=ss[k]; z=shape(.,n,1); u=z;
neps=ncol(epsilon); do j=1 to 1; **percent of outliers loop; eps=epsilon[j];
%let rep=1000; do i=1 to &rep; **1000 repitition loop;
seedz=21273*k+430856*j+4487*i+9999; *!!!attention: when 9999 changes to 999 xsl=0 bombs; seedu=9789*k+1723*j+1532+4487*i+9999;
lambda=shape(.,&rep,1); AIC=shape(.,&rep,1); BIC=shape(.,&rep,1);
180
AD=shape(.,&rep,1); AD2=shape(.,&rep,1); SE=shape(.,&rep,1); pvalue=shape(.,&rep,1); btran=shape(.,&rep,1); x_bar=shape(.,&rep,1);
call rannor(seedz,z); y=10+1*z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end; c=y; d=rank(y); y[d]=c;
/*use nlpnra call to find the optimized log-likelihood, calculate AIC*/ l0=1; lammy=.; call nlpnra(q,lammy,"BoxCox",l0);
if q<=0 then do; lambda[i]=. ; AIC[i]=.; BIC[i]=.; AD[i]=.; AD2[i]=.; pvalue[i]=.; SE[i]=.; btran[i]=.; x_bar[i]=.; end;
else do; lambda[i]=lammy; x=(y##lammy-1)/lammy; x_bar[i]=x[:]; xsl=ssq(x-x[:])/n; z2=sum(log(1:n));
AIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+6;*raw; BIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+3*log(n);*BIC; btran[i]=(lammy#x_bar[i]+1)##(1/lammy); SE[i]=(btran[i]-10)##2;*back transformed square error;
181
*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[i]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[i]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[i]=ad[i]#(1+0.75/len+2.25/len##2); if ad2[i]>=0.6 & ad2[i]<13 then pvalue[i]=exp(1.2937- 5.709#ad2[i]+0.0186#ad2[i]##2); else if ad2[i]>=0.34 & ad2[i]<0.6 then pvalue[i]=exp(0.9177-4.297#ad2[i]- 1.38#ad2[i]##2); else if ad2[i]>=0.2 & ad2[i]<0.34 then pvalue[i]=1-exp(-8.318+42.796#ad2[i]- 59.938#ad2[i]##2); else pvalue[i]=1-exp(- 13.436+101.14#ad2[i]-223.73#ad2[i]##2); end;*max(f)=1; end;*q>0; out1[&rep*neps*(k-1)+&rep*(j-1)+i,1] = n; out1[&rep*neps*(k-1)+&rep*(j-1)+i,2] = eps; out1[&rep*neps*(k-1)+&rep*(j-1)+i,3] = i; out1[&rep*neps*(k-1)+&rep*(j-1)+i,4] = AIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,5] = lambda[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,6] = q; out1[&rep*neps*(k-1)+&rep*(j-1)+i,7] = BIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,8] = AD[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,9] = AD2[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,10] = pvalue[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,11] = SE[i];
end;*i;
end;*j; end;*k; create temp1 var{Sample_Size pct_outlier Rep AIC Lambda Q BIC AD AD2 AD_pvalue SE}; append from out1;
182 close temp1; quit; data temp1; set temp1; where sample_size ne .; run; data out.normal_baseline_BC; set temp1; rename lambda=lambda_baseline BIC=BIC_baseline AIC=AIC_baseline AD=AD_baseline AD2=AD2_baseline AD_pvalue=AD_pvalue_baseline; drop q; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_3\MSE_baseline.doc"; proc means data=temp1 noprint; var SE; by sample_size; output out=mse(drop=_type_) mean=MSE_baseline; run; proc print data=mse; run; ods rtf close; ods html image_dpi=300; ods rtf file="F:\April_May_2013_Chapter\Chapter_3\lambda_baseline.doc"; proc sgplot data=out.normal_baseline_BC(where=(pct_outlier=0)); vbox lambda_baseline / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title 'Box-Plot of Lambda: Null case, N(10,1), Clean data'; YAXIS LABEL = 'Lambda' GRID VALUES = (-8 TO 8 BY 2); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc means data=out.normal_baseline_BC(where=(pct_outlier=0)) noprint; var lambda_baseline; by sample_size; output out=CI mean=lambda_mean_null uclm=Upper lclm=Lower; run; data CI_plot; set CI; drop lower upper lambda_mean_null;
183 bound = lower; output; bound = upper; output; bound = lambda_mean_null; output; run; goptions reset=all; symbol1 interpol=hiloctj cv=red ci=blue width=2 value=dot; axis1 label=('Lambda'); axis2 label=('Sample Size') ; proc gplot data=CI_plot; plot bound*sample_size/vaxis=axis1 haxis=axis2; title '95% CI of Lambda: Null Case, N(10,1), Clean data'; run; title; quit; ods rtf close; data extreme; set out.normal_baseline_BC; where pct_outlier=0 and abs(lambda_baseline)>5; run;
C. N(10,1) Anchor to 1 and Box-Cox transform only libname out "F:\April_May_2013_Chapter\new_data_04302013";
*anchor to 1; *MSE here is not correct, needs to be backtransform and un-anchored; proc iml;
/*constants used in the calculation*/ pi = constant("pi"); e = constant("e"); start BoxCox(lam) global(y,n,pi,e); x=(y##lam-1)/lam; xsl=ssq(x-x[:])/n; z1=sum(log(1:n)); f=0.5*n*log(2#pi#e)+0.5*n*log(xsl)-(lam-1)*sum(log(y))-z1; return (f); finish BoxCox;
184
ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; /*ss={20 80 150 };*/ epsilon={0.0 0.05 0.1 0.15 0.20};
/*out1 and out2 are the output datasets*/ out1=shape(.,95000,11);
/*Normal data*/ do k=1 to 19; **sample size loop; n=ss[k]; z=shape(.,n,1); u=z;
neps=ncol(epsilon); do j=1 to 1; **percent of outliers loop; eps=epsilon[j];
%let rep=1000; do i=1 to &rep; **1000 repitition loop; seedz=21273*k+430856*j+4487*i+9999; *!!!attention: when 9999 changes to 999 xsl=0 bombs; seedu=9789*k+1723*j+1532+4487*i+9999; lambda=shape(.,&rep,1); AIC=shape(.,&rep,1); BIC=shape(.,&rep,1); AD=shape(.,&rep,1); AD2=shape(.,&rep,1); SE=shape(.,&rep,1); pvalue=shape(.,&rep,1); btran=shape(.,&rep,1); x_bar=shape(.,&rep,1); call rannor(seedz,z); y=10+1*z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end; c=y; d=rank(y); y[d]=c; y=y+1-y[1];
185
/*use nlpnra call to find the optimized log-likelihood, calculate AIC*/ l0=1; lammy=.; call nlpnra(q,lammy,"BoxCox",l0); if q<=0 then do; lambda[i]=. ; AIC[i]=.; BIC[i]=.; AD[i]=.; AD2[i]=.; pvalue[i]=.; SE[i]=.; btran[i]=.; x_bar[i]=.; end; else do; lambda[i]=lammy; x=(y##lammy-1)/lammy; x_bar[i]=x[:]; xsl=ssq(x-x[:])/n; z2=sum(log(1:n));
AIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+6;*raw; BIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+3*log(n);*BIC; btran[i]=(lammy#x_bar[i]+1)##(1/lammy); SE[i]=(btran[i]-10)##2;*back transformed square error;
*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[i]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[i]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[i]=ad[i]#(1+0.75/len+2.25/len##2); if ad2[i]>=0.6 & ad2[i]<13 then pvalue[i]=exp(1.2937- 5.709#ad2[i]+0.0186#ad2[i]##2);
186 else if ad2[i]>=0.34 & ad2[i]<0.6 then pvalue[i]=exp(0.9177-4.297#ad2[i]- 1.38#ad2[i]##2); else if ad2[i]>=0.2 & ad2[i]<0.34 then pvalue[i]=1-exp(-8.318+42.796#ad2[i]- 59.938#ad2[i]##2); else pvalue[i]=1-exp(- 13.436+101.14#ad2[i]-223.73#ad2[i]##2); end;*max(f)=1; end;*q>0; out1[&rep*neps*(k-1)+&rep*(j-1)+i,1] = n; out1[&rep*neps*(k-1)+&rep*(j-1)+i,2] = eps; out1[&rep*neps*(k-1)+&rep*(j-1)+i,3] = i; out1[&rep*neps*(k-1)+&rep*(j-1)+i,4] = AIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,5] = lambda[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,6] = q; out1[&rep*neps*(k-1)+&rep*(j-1)+i,7] = BIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,8] = AD[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,9] = AD2[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,10] = pvalue[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,11] = SE[i];
end;*i; end;*j; end;*k; create temp1 var{Sample_Size pct_outlier Rep AIC Lambda Q BIC AD AD2 AD_pvalue SE}; append from out1; close temp1; quit; data temp1; set temp1; where sample_size ne .; run; data out.normal_baseline_anchor_BC; set temp1; rename lambda=lambda_baseline_anchor BIC=BIC_baseline_anchor AIC=AIC_baseline_anchor AD=AD_baseline_anchor AD2=AD2_baseline AD_pvalue=AD_pvalue_baseline_anchor; drop q; run; ods html image_dpi=300;
187 ods rtf file="F:\April_May_2013_Chapter\Chapter_3\lambda_baseline_anchor.doc"; proc sgplot data=out.normal_baseline_anchor_BC(where=(pct_outlier=0)); vbox lambda_baseline_anchor / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title 'Box-Plot of Lambda: Anchor-to-1, N(10,1), Clean data'; YAXIS LABEL = 'Lambda' GRID VALUES = (-8 TO 8 BY 2); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc means data=out.normal_baseline_anchor_BC(where=(pct_outlier=0)) noprint; var lambda_baseline_anchor; by sample_size; output out=CI mean=lambda_mean_anchor lclm=Lower uclm=Upper; run; data CI_plot; set CI; drop lower upper lambda_mean_anchor; bound = lower; output; bound = upper; output; bound = lambda_mean_anchor; output; run; goptions reset=all; symbol1 interpol=hiloctj cv=red ci=blue width=2 value=dot; axis1 label=('Lambda'); axis2 label=('Sample Size') ; proc gplot data=CI_plot; plot bound*sample_size/vaxis=axis1 haxis=axis2; title '95% CI of Lambda: Anchor-to-1, N(10,1), Clean data'; run; title; quit; ods rtf close;
D. Unified Approach, detect outlier in both ends, anchor to 1 if minimum smaller than 1 and Box-Cox transform, penalty for Box-Cox is not considered
options linesize=200 pagesize=200 nocenter; options nosource nonotes;
188
libname out "C:\wg_05282013"; *p1; proc iml; pi = constant("pi"); e = constant("e"); start BoxCox(lam) global(y2,n,dp,pi,e); x=(y2##lam-1)/lam; xsl=ssq(x-x[:])/(n-dp); z1=sum(log(1:(n-dp))); f=0.5*(n-dp)*log(2#pi#e)+0.5*(n-dp)*log(xsl)-(lam-1)*sum(log(y2))-z1; return (f); finish BoxCox; ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; epsilon={0.05 0.1 0.15 0.20}; out1=shape(.,2290000,17); do k=1 to 10; n=ss[k]; z=shape(.,n,1); u=shape(.,n,1); l_max_drop=n/5; u_max_drop=n/5; max_drop=(l_max_drop+1)*(u_max_drop+1);
*find a constant used later; if k=1 then do; cc=0; end; else do; cc=0; do index=1 to k-1; cc=cc+(0.2*ss[index]+1)##2; end; end;
neps=ncol(epsilon); neps=1;*change accordingly; do j=1 to 1; eps=epsilon[j];
189
%let rep=1000; do i=1 to &rep; *rep; seedz=21273*k+430856*j+4487*i+9999; seedu=9789*k+1723*j+1532+4487*i+9999; call rannor(seedz,z); y=10+z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end;
do h=1 to l_max_drop+1; ldp=h-1; y22=shape(.,n-ldp,1); y33=shape(.,n-ldp,1); y44=shape(.,n-ldp,1); c=y; d=rank(y); y[d]=c; y_min=y[1];*the minimum of raw obs; y44=y+1-y_min;*anchor to 1; y22=y44[ldp+1:n]; y33=y22+1-y22[1];*anchor every time when lower end is dropped; do g=1 to u_max_drop+1; udp=g-1; dp=ldp+udp; y2=shape(.,n-dp,1); y_kp=shape(.,n-dp,1); lambda=shape(.,max_drop,1); AIC=shape(.,max_drop,1); AIC2=shape(.,max_drop,1); BIC=shape(.,max_drop,1); AIC4=shape(.,max_drop,1); AIC5=shape(.,max_drop,1); ad=shape(.,max_drop,1); ad2=shape(.,max_drop,1); pvalue=shape(.,max_drop,1); SE=shape(.,max_drop,1); y_kp_bar=shape(.,max_drop,1); SE0=shape(.,max_drop,1); x_bar=shape(.,max_drop,1);
190
btran=shape(.,max_drop,1); SE2=shape(.,max_drop,1);
**drop outliers; c=y33; d=rank(y33); y33[d]=c; y2=y33[1:n-ldp-udp]; y_kp=y[ldp+1:n-udp];
l0=1; lammy=.; call nlpnra(q,lammy,"BoxCox",l0);
if q<=0 then do; lambda[g]=. ; AIC[g]=.; AIC2[g]=.; BIC[g]=.; AIC4[g]=.; ad[g]=.; ad2[g]=.; pvalue[g]=.; SE[g]=.; y_kp_bar[g]=.; btran[g]=.; x_bar[g]=.; SE0[g]=.; SE2[g]=.;
end;
else do; lambda[g]=lammy; x=(y2##lammy-1)/lammy;
z2=sum(log(1:(n-dp))); z3=sum(log(1:n));
xsl=ssq(x-x[:])/(n-dp); AIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+6+2*dp;*raw; BIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+(3+dp)*log(n-dp);*BIC;
x_bar[g]=x[:];
191
SE0[g]=(x_bar[g]-10)##2;*transformed square error;
y_kp_bar[g]=y_kp[:];
SE2[g]=(y_kp_bar[g]-10)##2;
btran[g]=(lammy#x_bar[g]+1)##(1/lammy)+y[h]-1; SE[g]=(btran[g]-10)##2;*back transformed square error; /* print dp ldp h udp y y2 y_kp x btran lammy se y_kp_bar x_bar SE0 SE2;*/
*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[g]=999; ad2[g]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[g]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[g]=ad[g]#(1+0.75/len+2.25/len##2); end; if ad2[g]>=0.6 & ad2[g]<13 then pvalue[g]=exp(1.2937- 5.709#ad2[g]+0.0186#ad2[g]##2); else if ad2[g]>=0.34 & ad2[g]<0.6 then pvalue[g]=exp(0.9177- 4.297#ad2[g]-1.38#ad2[g]##2); else if ad2[g]>=0.2 & ad2[g]<0.34 then pvalue[g]=1-exp(- 8.318+42.796#ad2[g]-59.938#ad2[g]##2); else pvalue[g]=1-exp(-13.436+101.14#ad2[g]-223.73#ad2[g]##2);;
end;
out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,1] = n; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,2] = eps; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,3] = i;
192
out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,4] = dp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,5] = ldp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,6] = udp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,7] = AIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,8] = lambda[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,9] = q; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,10] = ad[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,11] = ad2[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,12] = pvalue[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,13] = BIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,14] = SE[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,15] = SE0[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,16] = btran[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,17] = SE2[g];
end; *g; end; *h; end; *i; end; *j; end; *k; create p1 var{Sample_Size pct_outlier Rep outlier_drop low_drop upper_drop AIC Lambda Q AD AD2 pvalue BIC SE SE0 btran SE2}; append from out1; close p1; quit; data out.bc_p1; set p1; where sample_size ne .; run;
193
E. Unified Approach, detect outlier in both ends, anchor to 1 if minimum smaller than 1 and Box-Cox transform, penalty for Box-Cox is considered
options nosource nonotes; libname out "C:\wg_05282013"; ****p1; proc iml; *no box-cox is done is this program, this result is used to be compared with anchor_every_step_drop_BC.sas; *lambda is set to one for all the time, indicating no BC is done; pi = constant("pi"); e = constant("e");
/*ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200};*/ ss={20 30 40 50 60 70 80 90 100 110}; epsilon={0.05 0.1 0.15 0.20}; out1=shape(.,2290000,14); do k=1 to 10; n=ss[k]; z=shape(.,n,1); u=shape(.,n,1); l_max_drop=n/5; u_max_drop=n/5; max_drop=(l_max_drop+1)*(u_max_drop+1);
*find a constant used later; if k=1 then do; cc=0; end; else do; cc=0; do index=1 to k-1; cc=cc+(0.2*ss[index]+1)##2; end; end;
neps=ncol(epsilon); neps=1;*change accordingly; do j=1 to 1; eps=epsilon[j];
194
%let rep=1000; do i=1 to &rep; *rep; seedz=21273*k+430856*j+4487*i+9999; seedu=9789*k+1723*j+1532+4487*i+9999; call rannor(seedz,z); y=10+z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end; do h=1 to l_max_drop+1; ldp=h-1; y22=shape(.,n-ldp,1); y33=shape(.,n-ldp,1); y44=shape(.,n-ldp,1); c=y; d=rank(y); y[d]=c; y_min=y[1];*the minimum of raw obs; y44=y+1-y_min;*anchor to 1; y22=y44[ldp+1:n]; y33=y22+1-y22[1]; do g=1 to u_max_drop+1; udp=g-1; dp=ldp+udp; y2=shape(.,n-dp,1); y_kp=shape(.,n-dp,1); lambda=shape(.,max_drop,1); AIC=shape(.,max_drop,1); AIC2=shape(.,max_drop,1); BIC=shape(.,max_drop,1); AIC4=shape(.,max_drop,1); AIC5=shape(.,max_drop,1); ad=shape(.,max_drop,1); ad2=shape(.,max_drop,1); pvalue=shape(.,max_drop,1); SE=shape(.,max_drop,1);
**drop outliers; c=y33; d=rank(y33);
195
y33[d]=c; y2=y33[1:n-ldp-udp]; y_kp=y[ldp+1:n-udp];
q=999; lammy=1; lambda[g]=lammy; x=(y2##lammy-1)/lammy;
z2=sum(log(1:(n-dp))); z3=sum(log(1:n));
xsl=ssq(x-x[:])/(n-dp); AIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+4+2*dp;*raw AIC; BIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+(2+dp)*log(n-dp);*BIC; SE[g]=(y_kp[:]-10)##2; /* print dp ldp udp y y2 y_kp x SE;*/
*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[g]=999; ad2[g]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[g]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[g]=ad[g]#(1+0.75/len+2.25/len##2); end; if ad2[g]>=0.6 & ad2[g]<13 then pvalue[g]=exp(1.2937- 5.709#ad2[g]+0.0186#ad2[g]##2); else if ad2[g]>=0.34 & ad2[g]<0.6 then pvalue[g]=exp(0.9177- 4.297#ad2[g]-1.38#ad2[g]##2); else if ad2[g]>=0.2 & ad2[g]<0.34 then pvalue[g]=1-exp(- 8.318+42.796#ad2[g]-59.938#ad2[g]##2); else pvalue[g]=1-exp(-13.436+101.14#ad2[g]- 223.73#ad2[g]##2);;
196
/* end;*/
out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,1] = n; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,2] = eps; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,3] = i; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,4] = dp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,5] = ldp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,6] = udp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,7] = AIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,8] = lambda[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,9] = q; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,10] = ad[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,11] = ad2[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,12] = pvalue[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,13] = BIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,14] = SE[g]; end; *g; end; *h; end; *i; end; *j; end; *k; create p1 var{Sample_Size pct_outlier Rep outlier_drop low_drop upper_drop AIC Lambda Q AD AD2 pvalue BIC SE}; append from out1; close p1; quit; data out.no_bc_p1; set p1; where sample_size ne .;
197 run;
F. Evaluation of Outlier detection accuracy of Unified Approach
libname out "F:\April_May_2013_Chapter\outlier_sample_05282013\normal\count_number_of_out lier"; /**************************************************************************** ********;*/ %macro min2(var=, dat=); proc summary data=&dat nway; class sample_size pct_outlier Rep; var &var; output out=aa(drop=_type_ _freq_) min(&var)=min_&var; run; proc sort data=&dat; by sample_size pct_outlier Rep; run; proc sort data=aa; by sample_size pct_outlier Rep; run; data bb; merge &dat aa; by sample_size pct_outlier Rep; if &var=min_&var; run; proc freq data=bb noprint; by sample_size pct_outlier; table outlier_drop/nocol norow nopercent out=cc_&var; run; %mend; %min2(var=BIC,dat=out.bc_5_pct); data accuracy; set bb; if abs(upper_drop-num_outlier)<=1 then accurate=1; else accurate=0; if abs(low_drop-0)<=1 then accurate2=1;
198 else accurate2=0; if abs(outlier_drop-num_outlier)<=1 then accurate3=1; else accurate3=0; if abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1; else accurate4=0; if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1; else accurate5=0; if low_drop=0 and upper_drop=num_outlier then accurate6=1; else accurate6=0; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\06222013_accuracy_reply_to_horn.doc "; proc freq data=bb noprint; table outlier_drop/norow nocol nopercent out=one; table low_drop/norow nocol nopercent out=two; table upper_drop/norow nocol nopercent out=three; by sample_size; run; proc print data=one;run; proc print data=two;run; proc print data=three;run; ods rtf close;
ods html image_dpi=300; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\plot_06222013_lambda_outlier_detect ion_accuracy.doc"; proc freq data=accuracy noprint; table accurate/norow nocol nopercent out=acc; table accurate2/norow nocol nopercent out=acc2; table accurate3/norow nocol nopercent out=acc3; table accurate4/norow nocol nopercent out=acc4; table accurate5/norow nocol nopercent out=acc5; table accurate6/norow nocol nopercent out=acc6; by sample_size; run;
199 proc print data=acc; title 'if abs(upper_drop-num_outlier)<=1 then accurate=1;'; run; title; proc print data=acc2; title 'if abs(low_drop-0)<=1 then accurate2=1;'; run; title; proc print data=acc3; title 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1;'; run; title; proc print data=acc4; title'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1'; run; title; proc print data=acc5; title'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1'; run; title; proc print data=acc6; title 'if low_drop=0 and upper_drop=num_outlier then accurate6=1;'; run; title;
**plot; symbol1 color=vibg interpol=join value=dot height=1.5; proc gplot data=acc(where=(accurate=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: high end outlier detection'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1;'; run; title; quit; proc gplot data=acc2(where=(accurate2=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: low end outlier detection'; title2 'if abs(low_drop-0)<=1 then accurate2=1;'; run;title; quit; proc gplot data=acc3(where=(accurate3=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: both ends outlier detection';
200 title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1;'; run;title; quit; proc gplot data=acc4(where=(accurate4=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low end and outlier in high end'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1'; run;title; quit; proc gplot data=acc5(where=(accurate5=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1'; run;title; quit; proc gplot data=acc6(where=(accurate6=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1;'; run;title; quit; proc sgplot data=accuracy(where=(accurate=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in high end, no penalty for BC'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate2=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in low end, no penalty for BC';
201 title2 'if abs(low_drop-0)<=1 then accurate2=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate3=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in both ends, no penalty for BC'; title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate4=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end detected, no penalty for BC'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate5=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (low end detected=0) detected, no penalty for BC'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate6=1)); vbox lambda / category=sample_size clusterwidth=0.5 ;
202 xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (exact, and low end detected=0) detected, no penalty for BC'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; ods rtf close;
* plot_06222013_lambda_outlier_detection_accuracy.doc; *****************************************************; *****************************************************; *****penalty for BC; data out.bc_no_bc_5_pct; set out.bc_5_pct out.no_bc_5_pct; run;
%min2(var=BIC,dat=out.bc_no_bc_5_pct); data leave_raw_alone; set bb; if lambda=1 then BC=0; else BC=1; if outlier_drop=0 then DP=0; else DP=1; run; proc freq data=leave_raw_alone noprint; table DP*BC/norow nocol nopercent out=leave_raw_alone_summary; table outlier_drop*BC/norow nocol nopercent out=leave_raw_alone_summary_both; table upper_drop*BC/norow nocol nopercent out=leave_raw_alone_summary_up; table low_drop*BC/norow nocol nopercent out=leave_raw_alone_summary_low; by sample_size; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\06232013_leave_raw_alone.doc"; proc print data=leave_raw_alone_summary; run; proc print data=leave_raw_alone_summary_both;
203 run; proc print data=leave_raw_alone_summary_up; run; proc print data=leave_raw_alone_summary_low; run; ods rtf close; data accuracy; set leave_raw_alone; if abs(upper_drop-num_outlier)<=1 then accurate=1; else accurate=0; if abs(low_drop-0)<=1 then accurate2=1; else accurate2=0; if abs(outlier_drop-num_outlier)<=1 then accurate3=1; else accurate3=0; if abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1; else accurate4=0; if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1; else accurate5=0; if low_drop=0 and upper_drop=num_outlier then accurate6=1; else accurate6=0; run; data accuracy_dp_only; set accuracy; where DP=1 and BC=0; *only look at cases where DP=1; run;
ods html image_dpi=300; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\plot_06232013_lambda_outlier_detect ion_accuracy_BC_penalty.doc"; proc freq data=accuracy_dp_only noprint; table accurate/norow nocol nopercent out=acc; table accurate2/norow nocol nopercent out=acc2; table accurate3/norow nocol nopercent out=acc3; table accurate4/norow nocol nopercent out=acc4; table accurate5/norow nocol nopercent out=acc5;
204 table accurate6/norow nocol nopercent out=acc6; by sample_size; run; proc print data=acc; title 'if abs(upper_drop-num_outlier)<=1 then accurate=1, penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc2; title 'if abs(low_drop-0)<=1 then accurate2=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc3; title 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc4; title'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc5; title'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc6; title 'if low_drop=0 and upper_drop=num_outlier then accurate6=1,penalty for BC, cases DP=1 and BC=0'; run; title;
**plot; symbol1 color=vibg interpol=join value=dot height=1.5; proc gplot data=acc(where=(accurate=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: high end outlier detection'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1, penalty for BC, cases DP=1 and BC=0'; run; title; quit;
205
proc gplot data=acc2(where=(accurate2=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: low end outlier detection'; title2 'if abs(low_drop-0)<=1 then accurate2=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc3(where=(accurate3=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: both ends outlier detection'; title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc4(where=(accurate4=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low end and outlier in high end'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc5(where=(accurate5=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc6(where=(accurate6=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc sgplot data=accuracy_dp_only(where=(accurate=1)); vbox lambda / category=sample_size clusterwidth=0.5 ;
206 xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in high end, penalty for BC'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate2=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in low end, penalty for BC'; title2 'if abs(low_drop-0)<=1 then accurate2=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate3=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in both ends, penalty for BC'; title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate4=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end detected, penalty for BC'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/;
207
XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate5=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (low end detected=0) detected, penalty for BC'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate6=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (exact, and low end detected=0) detected, penalty for BC'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; ods rtf close;
208