<<

A Unified Approach to Data Transformation and Outlier

Detection using Penalized Assessment

A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Ph.D. in of the McMicken College of Arts and Sciences

November 2013 by Wei Guo

B.S. TongJi University, Shanghai, China, 2006 M.S. University of Cincinnati, Cincinnati, OH, USA, 2010

Committee Chair: Seongho Song Ph.D.

Abstract

In many statistical applications normally distributed sample and sample without outliers are desired. However, in practice, it is often the case that the normality assumption is violated, such as when highly influential outliers exist in the dataset, which will adversely impact the validity of the statistical analysis. In this dissertation, a Unified Approach is proposed to handle outlier detection, Box-Cox transformation using a penalized information criteria at the same time.

This research started from investigating the performance of Box-Cox transformation in uncontaminated samples and suggested that the sample should be anchored to 1 before Box-Cox transformation is applied when the sample minimum is larger than 1. Simulation results showed that anchor-to-1 method is working well in enhancing the accuracy of Box-Cox transformation by decreasing the of λ and eliminating extremely large or small values of λ. The efficacy of Unified Approach is also verified in the clean samples including normal and lognormal, where the Unified Approach is able to tell that no Box-Cox is needed and no outliers are present. Later, simulations in the contaminated normal samples demonstrated that the Unified

Approach can achieve the balance between a good model fitting (close to normal sample) and the complexity of data analysis through penalizing anchor-to-1, outlier exclusion, and Box-Cox transformation in the form of an adjusted information criteria. Through precise outlier detection and appropriate Box-Cox transformation, the efficacy of the Unified Approach is verified in the contaminated samples.

i

ii

Acknowledgement

I would like to take this opportunity to extend my deepest gratitude and sincere thanks to my adviser, Dr. Paul S. Horn, for his priceless guidance, insightful feedback, warm encouragement and all his constant support through the past few years. I appreciate the valuable advice and support of the other committee members, Drs. Seongho Song, Siva Sivaganesan, Xia

Wang, and Emily Kang. I thank the Department of Mathematics for providing me the financial support for my graduate studies at University of Cincinnati.

I thank my parents Zhanjun Guo and Yuexia Zhang for all their love, support and faith in me and allowing me to be as ambitious as I wanted. I also would like to thank my mother in law

Conghui Wang for helping us taking of my child when I was working toward my degree.

Without her help I could not finish my degree nor approach my dream in my career.

Lastly I would like to thank my wife Xuejiao Diao. Her enduring love, encouragement, patience, and caring have been the greatest backing throughout my life. Without her accompany I could not succeed in my graduate study in a foreign country. I am fortunate to have my daughter

Angie during my graduate study and she brings me so much joy and makes me laugh, although I did not have much time for her every day. Without such a warm family all of this would not be possible.

iii

Table of Contents

Abstract ...... I Acknowledgement ...... iii Table of Contents ...... iv List of Figures ...... vi List of Tables ...... x 1 Introduction ...... 1 1.1 Background ...... 1 1.2 Outliers ...... 5 1.3 Data transformation ...... 8 1.3.1. Mathematical expressions ...... 8 1.3.2. Estimation of parameters ...... 10 1.3.3. Hypothesis tests and inference on transformation parameter ...... 12 1.3.4. Impact of outliers on data transformation...... 14 1.4 Information Criterion ...... 16 1.4.1 Akaike Information Criterion ...... 16 1.4.2 Bayesian Information Criterion ...... 17 1.5 Research Gap in Literature ...... 19 2. Unified Approach ...... 22 2.1. Anchoring-to-1 ...... 22 2.2. Penalized Assessment ...... 41 2.3. Implement of unified approach ...... 43 2.4. Advantages of new method ...... 47 3. No Outliers: The Uncontaminated Sample Case ...... 51 3.1. Overview ...... 51 3.2. Anchoring-to-1 ...... 54 3.3. Unified Approach- ...... 75 3.4. Assessment-Normal Distribution ...... 91 3.5. Unified Approach-Lognormal Distribution ...... 98 3.6. Assessment-Lognormal Distribution ...... 113

iv

4. Outliers: The Contaminated Sample Case ...... 116 4.1. Samples with Outliers ...... 116 4.2. Simulation on N(10,1) samples ...... 123 4.3. Simulation on Lognormal samples ...... 151 5. Discussion and future research ...... 164 Bibliography ...... 169 Appendix ...... 177

v

List of Figures

Figure 1-1 Research Gap ...... 21 Figure 2-1 Anchor-to-1, N(10,1), Sample size=100 ...... 24 Figure 2-2 Anchor-to-5, N(10,1), Sample size=100 ...... 25 Figure 2-3 LogNormal, Anchor-to-1,sample size=100 ...... 27 Figure 2-4 LogNormal, Anchor-to-5,sample size=100 ...... 28 Figure 2-5 Lambdas of Different Anchoring ...... 29 Figure 2-6 95% of λ at different anchor points ...... 30 Figure 2-7 Distribution of Skewed Raw Sample ...... 32 Figure 2-8 Comparison of and for different anchoring ...... 34 Figure 2-9 of transformed sample with different anchoring: SQRT Transformation ...... 35 Figure 2-10 Histogram of transformed sample with different anchoring: LOG Transformation ...... 36 Figure 2-11 Histogram of transformed sample with different anchoring: INVERSE Transformation ...... 37 Figure 2-12 Outlier Candidates ...... 46 Figure 2-13 Flowchart of overall data processing...... 48 Figure 2-14 Flow Chart of the Unified Approach ...... 49 Figure 2-15 Flowchart of outlier drop ...... 50 Figure 3-1 shows the variation of λ using regular BC on original sample ...... 55 Figure 3-2 95% Confidence interval of lambda in baseline case ...... 56 Figure 3-3 A histogram shows the histogram of original sample ...... 58 Figure 3-4 Histogram shows the histogram of transformed data with lambda 9.13 ...... 59 Figure 3-5 Histogram of original sample ...... 61 Figure 3-6 Histogram of transformed sample using lambda=-6.91 ...... 62 Figure 3-7 Distribution of lambda after anchoring-to-1 the original sample ...... 65 Figure 3-8 The and 95% confidence interval of lambda after anchoring-to-1 ...... 66 Figure 3-9 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted ...... 67 Figure 3-10 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted, cont. .... 68 Figure 3-11 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda ...... 69 Figure 3-12 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda, cont...... 70 Figure 3-13 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda ...... 71 Figure 3-14 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont...... 72 Figure 3-15 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda ...... 73 Figure 3-16 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont...... 74 Figure 3-17 Outliers Detected in Clean N(10,1) Samples, 1000 Repetitions ...... 77 Figure 3-18 Location of detected outliers, N(10,1), Clean Sample ...... 78 Figure 3-19 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, no outliers excluded, no penalty for Box-Cox ...... 79 Figure 3-20 Box-plot of lambda: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1) ...... 80

vi

Figure 3-21 Box-plot of Anderson-Darling : Unified Approach, N(10,1), Clean Data, no outliers detected, no penalty for Box-Cox ...... 81 Figure 3-22 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1) ...... 82 Figure 3-23 Percent of four cases in unified approach with penalty to BC ...... 86 Figure 3-24 Location of outliers, NO Box-Cox involved ...... 86 Figure 3-25 Location of outliers, BC=1 and DP=1 ...... 88 Figure 3-26 Box-plot of Lambda: Unified Approach, Clean Data, BC=1, outliers Excluded ...... 89 Figure 3-27 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, BC=1, no outliers excluded . 90 Figure 3-28 Comparison of MSE, BC=0 DP=0, N(10,1) Clean Data...... 96 Figure 3-29 Comparison of MSE, BC=0 DP=1, N(10,1) Clean Data...... 96 Figure 3-30 Comparison of MSE, BC=1 DP=0, N(10,1) Clean Data...... 97 Figure 3-31 Comparison of MSE, BC=1 DP=1, N(10,1) Clean Data...... 97 Figure 3-32 Outliers Detected in Clean LogN(0,1) Samples, 1000 Repetitions ...... 100 Figure 3-33 Location of outliers detected, LogN0,1), Clean Data ...... 102 Figure 3-34 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox ...... 103 Figure 3-35 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox ...... 104 Figure 3-36 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox ...... 106 Figure 3-37 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox ...... 107 Figure 3-38 Percent of four cases in unified approach with penalty to BC, Clean LogN(0,1) ...... 110 Figure 3-39 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, no outliers excluded ...... 111 Figure 3-40 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, outliers excluded(>=1) ..... 112 Figure 3-41 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=0 ...... 115 Figure 3-42 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=1 ...... 116 Figure 4-1 Distributions of Normal samples with outliers ...... 118 Figure 4-2 Distributions of Lognormal samples with outliers ...... 121 Figure 4-3 Distribution of Lambda in Baseline with outliers, N(10,1) ...... 124 Figure 4-4 Confidence Interval of Lambda in Baseline with outliers, N(10,1) ...... 125 Figure 4-5 Distribution of Lambda, Anchor-to-1, with outliers, N(10,1) ...... 127 Figure 4-6 Confidence Interval of Lambda, Anchor-to-1, with outliers, N(10,1) ...... 128 Figure 4-7 Accuracy of outlier detection, N(10,1), 5% outliers, no penalty for BC ...... 132 Figure 4-8 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy1=1, no extra penalty for Box-Cox ...... 133 Figure 4-9 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy2=1, no extra penalty of Box-Cox ...... 134 Figure 4-10 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy3=1, no extra penalty of Box-Cox...... 135

vii

Figure 4-11 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy4=1, no extra penalty of Box-Cox...... 136 Figure 4-12 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy5=1, no extra penalty of Box-Cox...... 137 Figure 4-13 Percent of different handling methods, 5% outliers, N(10,1), penalty for BC ...... 139 Figure 4-14 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases ...... 140 Figure 4-15 Distribution of lambda, Unified Approach, N(10,1), 5% outlier with penalty for BC, accuracy5=1 cases ...... 141 Figure 4-16 Accuracy of outlier detection, N(10,1), 10% outliers, no penalty for BC ...... 142 Figure 4-17 Percent of different handling methods, 10% outliers, penalty for BC ...... 142 Figure 4-18 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases ...... 143 Figure 4-19 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 10% outlier ...... 144 Figure 4-20 Accuracy of outlier detection, N(10,1), 15% outliers, no penalty for BC ...... 145 Figure 4-21 Percent of different handling methods, 15% outliers, penalty for BC ...... 145 Figure 4-22 Accuracy of outlier detection, N(10,1), 15% outliers, penalty for BC, DP=1 and BC=0 cases ...... 146 Figure 4-23 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 15% outlier ...... 147 Figure 4-24 Accuracy of outlier detection, N(10,1), 20% outliers, no penalty for BC ...... 148 Figure 4-25 Percent of different handling methods, 20% outliers, penalty for BC ...... 148 Figure 4-26 Accuracy of outlier detection, N(10,1), 20% outliers, penalty for BC, DP=1 and BC=0 cases ...... 149 Figure 4-27 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 20% outlier ...... 150 Figure 4-28 Percent of different handling methods, 5% outliers, LogN(0,1), penalty for BC ...... 152 Figure 4-29 Accuracy of outlier detection, LogN(0,1), 5% outliers, no penalty for BC ...... 152 Figure 4-30 Accuracy of outlier detection, LogN(0,1), 5% outliers, penalty for BC ...... 153 Figure 4-31 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC ...... 153 Figure 4-32 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 154 Figure 4-33 Percent of different handling methods, 10% outliers, LogN(0,1), penalty for BC ...... 155 Figure 4-34 Accuracy of outlier detection, LogN(0,1), 10% outliers, no penalty for BC ...... 155 Figure 4-35 Accuracy of outlier detection, LogN(0,1), 10% outliers, penalty for BC ...... 156 Figure 4-36 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC ...... 156 Figure 4-37 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 157 Figure 4-38 Percent of different handling methods, 15% outliers, LogN(0,1), penalty for BC ...... 158

viii

Figure 4-39 Accuracy of outlier detection, LogN(0,1), 15% outliers, no penalty for BC ...... 158 Figure 4-40 Accuracy of outlier detection, LogN(0,1), 15% outliers, penalty for BC ...... 159 Figure 4-41 Distribution of lambda, LogN(0,1), 15% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC ...... 159 Figure 4-42 Distribution of lambda, LogN(0,1), 15% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 160 Figure 4-43 Percent of different handling methods, 20% outliers, LogN(0,1), penalty for BC ...... 161 Figure 4-44 Accuracy of outlier detection, LogN(0,1), 20% outliers, no penalty for BC ...... 161 Figure 4-45 Accuracy of outlier detection, LogN(0,1), 20% outliers, penalty for BC ...... 162 Figure 4-46 Distribution of lambda, LogN(0,1), 20% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC ...... 162 Figure 4-47 Distribution of lambda, LogN(0,1), 20% outlier, accuracy5=1, BC=1 DP=1, penalty for BC ...... 163

ix

List of Tables

Table 2-1 Comparison of Skewness and Kurtosis for different anchoring ...... 33 Table 3-1 An example where BC gives extreme lambda ...... 57 Table 3-2 An example where BC gives extreme lambda, con’t ...... 57 Table 3-3 An example shows the bomb of Box-Cox, which gives lambda=-6.91 ...... 60 Table 3-4 An example shows the bomb of Box-Cox, which gives lambda=-6.91, cont...... 60 Table 3-5 Comparison of MSE ...... 95 Table 3-6 Location of outliers excluded, LogN0,1), Clean Data ...... 101 Table 4-1 statistics of Normal samples with/without outliers ...... 119 Table 4-2 Goodness of Fit statistics of Lognormal samples with/without outliers ...... 122

x

1 Introduction

1.1 Background

It is an unfortunate fact of statistical analysis that the data used to build models and draw inference are not always well-behaved. Outliers are a typical example of aberrant data. They appear in almost all research projects, especially in observational studies where dataset contains unusual observations and/or the distribution of collected data shows extreme skewness and kurtosis.

Outliers can have adverse impact on statistical analyses. First, they usually increase the error variance and reduce the power of statistical tests. Second, outliers may violate the normality assumption that is required in many statistical models, making both Type I and Type II errors larger. Third, they can seriously bias or impact the estimates of parameters of interest. If a researcher’s analytical approach is to do nothing about the presence of outliers, the resulting statistical models will essentially describe none of the data, neither the bulk of the data nor the outliers. Even if the outliers are processed, the resulting conclusion drawn from the data might still be biased or in the opposite direction if the outlier handling strategy is not carefully is designed.

Outliers need to be identified or detected at first before appropriate handling. A lot of outlier detection strategies have been proposed and applied in practice. Grubbs' test (Grubbs,

1969), Dixon's Q test (Dixon, 1951), Tietjen-Moore test, (Tietjen-Moore, 1972) and Generalized

ESD Test (Rosner, 1983) are popular methods to detect outliers. Other methods flag observations, such as the interquartile , Tukey’s outlier filter (Hoaglin, Mosteller, and Tukey, 1983). The

1 rest of the approaches are either distance-based (Knorr, et al, 1998) or density-based (Breunig,

2000), and all of them use the distance to the K-nearest neighbors to label outliers.

There is a great deal of literature on how to handle identified outliers. The prevailing ways fall into the following categories. The simplest way is to exclude those suspicious observations and then analyze the remaining clean observations, although deletion of outliers is still in debate. For example, in the regression procedure, deleting outliers exhibit a large degree of influence on the parameters estimates (Cook 1979).

The second method is to use data transformation to alleviate the impact of outliers, natural log and square root transformations are popular choices to handle outliers or skewed distribution. Through transformation extreme observations can be kept in the without loss of information and the relative of observations remains, while the skewness and error variance present in the data set can be largely reduced (Hamilton, 1992). Specifically, Box and Cox’s power transformation (Box and Cox, 1964) is widely used to accommodate outliers and skewed distribution. The transformation parameter λ can be found using the Maximum

Likelihood Estimator (MLE) (Box and Cox, 1964). However, the MLE of transformation parameter might be adversely affected by outliers as shown in Andrews (1971). Conversely the transformed data set may still have outliers or skewness/kurtosis, which indicating the failure of

Box-Cox transformation.

The third way is to accommodate outliers, which that using various “robust” procedures to protect their data from being distorted by the presence of outliers. These techniques “accommodate the outliers at no serious inconvenience-or are robust against the presence of outliers” (Barnett and Lewis, 1994, p. 35). Certain parameter estimates, especially the mean (variance) and Least Squares estimations, are particularly vulnerable to outliers, or

2 have “low breakdown” values (Osborne, 2004). For this reason, researchers turn to robust or

“high breakdown” methods to provide alternative estimates for these important aspects of the data. Huber’s M-estimation (Huber, 1981), Least Trimmed Square estimation (Rousseeuw, 1984),

S-estimation (Rousseeuw and Yohai, 1984), and MM-estimation (Yohai, 1987) are of this type.

On the other side, if we know a-priori why an outlier exists we can simply delete the outlier, for example, when there is a known misplacement of a decimal point. However, in other situations, outliers can contain important information that cannot be ignored, such as in medical studies where the underlying distribution of values is naturally skewed to the right (e.g., cholesterol in adults who are not on statins). In this situation, are we going to treat those extreme large observations as outliers or just regular observations in the long tails? One researcher’s outlier may not be another one’s outlier. Tail observations in one study may be judged as outliers in another study.

If we decide those suspicious observations are outliers, we are faced with the question that whether those outliers should be excluded or not. If we decide them as non-outliers we would like to find a way to accommodate them. Data transformation is of great help in this case.

A suitable transformation would preserve all relevant information and ease the impact of extreme observations.

Another issue is that if we delete some outliers and/or transform the data, how should we assess the end result? If outliers are dropped before the transformation, we are bearing the risk of losing information from those deleted observations; such as in the above cholesterol example.

We need to make a trade-off between the improved fit of data and the loss of information through the exclusion of outliers. Information criterion such as Akaike Information Criterion

(Akaike, 1974), AIC and Bayesian information criterion BIC (or Schwarz criterion, also SBC,

3

SBIC) (Schwarz, 1978) are popular goodness-of-fit statistics that can measure this trade-off, in particular they penalize over-fitting the data. In our problem, they can be used to penalize exclusion of outlier.

In this dissertation, I would like to consider Box-Cox transformation, outlier detection, and the information criterion simultaneously to address these issues mentioned above. A unified new approach to handle outliers in the process of Box-Cox transformation using penalized assessment will be proposed and examined through simulation studies. Another method to find appropriate Box-Cox transformation parameter is studied. Applications of the new method in real data will also be demonstrated.

4

1.2 Outliers

In statistics, an outlier is an observation that is numerically distant from the rest of the data (Barnett and Lewis, 1994). Moore and McCabe (1999) further states that an outlier lies outside the overall pattern of a distribution.

Outliers can have many causes. One possible cause is changes in measurement system.

For example, outliers may appear when a physical apparatus for measurement broke down.

Another example is the errors in data transmission or transcription, which can also lead to outliers. The second reason has to do with human manipulation, or sometimes, fraudulent behaviors. The third possible reason is that outliers are just a pure reflection of the reality. In this case, the natural deviations should be flagged for further investigation.(Hodge and Austin, 2004).

For centuries, there had been no standard mathematical definition of outliers and their recognitions were based on subjective judgment. Usually, anomalous observations were detected and, where appropriate, were removed to keep the cleanness of the entire dataset. System defects and human fraud could be identified in outlier detection process, which effectively prevents catastrophic consequences in the first place. With advanced applications of computer science and statistics in the present days, the outlier detection methods are much more rigorous and systematic.

The following is a brief review of the models that are commonly used for outlier detection. All the models assume that data follow a normal distribution, and big deviations from the mean are very “unlikely”. (Rousseeuw and Leroy, 1996).

x Grubbs' test for outliers (Grubbs, 1969)

5

In Grubb’s test, the hypothesis are defined as follows H0: no outliers. Ha: at least one outlier. The Grubbs' is defined as the largest absolute deviation from the sample mean in units of the sample (Grubbs, 1969):

ƒš ȁܻ௜ െܻതȁ ǡேڮൌ௜ୀଵǡܩ ݏ

Where ܻത and ݏ are the sample mean and standard deviation, respectively. In the two-sided test, the null hypothesis is rejected at a significance level ߙ if

ଶ െͳ ݐఈሺଶேሻΤ ǡேିଶܰ ܩ൐ ඨ ଶ െʹ൅ݐఈሺଶேሻΤ ǡேିଶܰ ܰ

ଶ where ݐఈሺଶேሻΤ ǡேିଶ is the upper critical value of the t distribution at a significance levelߙΤ ሺʹܰሻ with ܰെʹ degrees of freedom.

x Dixon’s Q test (Dixon, 1951)

To screen bad data with a Q test, the first step is to arrange the data in ascending order and calculate Q as:

݃ܽ݌ ܳൌ ݎܽ݊݃݁

where a gap is the absolute difference between the outlier and the closest value to the outlier. The questionable point can be rejected whenܳ ൐ ܳ௧௔௕௟௘, where ܳ௧௔௕௟௘ is a reference value corresponding to the confidence level and the sample size.

Other methods flag observations according to the . For example, [Q1- k(Q3-Q1), Q3+k(Q3-Q1)], if Q1 and Q3 are the lower and upper quartiles respectively. One can name any observation outside the range as an outlier.(Tukey, 1977)

6

In addition to the common model-based outlier detection methods, there are also distance-based methods (Knorr, 1998) and density-based methods(Breunig, 2000). Both of the latter two methods frequently use the distance to the k-nearest neighbors to gauge outliers.

In the process of identifying outliers, the masking effect and swamping effect are problems we need to pay attention to. Masking effect means that in the successive use of a test for detecting a single outlier, we may fail to detect any outlier because of the effect of other outliers (Tietjen and Moore,1972). For example, when there are multiple outliers rather than one or two outliers, these additional outliers may influence the value of the test statistic so that it shows no signs of outliers. On the other hand, swamping effect occurs when to many outliers are spotted in the detection (Barnett and Lewis, 1984). For example, when there exists only one single outlier while we are looking for two or more of them, the test results would usually indicate either none or all of the tested points as outliers. Due to masking and swamping effects, a lot of tests make it a requirement that the number of outliers being tested should be clearly specified before the test (Kitagawa, 1979).

7

1.3 Data transformation

Data transformation should align with the particular statistical analysis method. Some

statistical analyses have guidance on how data transformation should be performed, or whether

data transformation is necessary at all. For instance, to construct a 95% confidence interval, an

easy way is to take the sample mean plus and minus two standard deviation units. However, the

two standard deviation rule can only be used for data with a normal distribution, and is not

applicable in a non-normal data. The states that the sample mean in most

cases does vary normally when the sample size is reasonably large. However if the sample size is

not large enough and the samples are skewed substantially, the central limit theorem, together

with the resulting confidence interval, fails to be a good approximation of the population.

Therefore, it is advised that when dealing with skewed data, they should be transformed to a

symmetric distribution, and a confidence interval can then be constructed. To keep originality of

the data, the confidence interval that has been constructed in the transformed symmetric

distribution can be transformed back with the inverse of the transformation.

1.3.1. Mathematical expressions

Tukey (1957) proposed a series of power transformations. The transformed values are a

monotonic function of the observations over some admissible range and they are indexed by the

constant lambda for positive values:

஛ ሺ஛ሻ ݔ௜ ȀɉǢ ɉ ് Ͳ ݔ௜ ൌ൜ ݔ௜Ǣ ɉ ൌ Ͳ‰‘Ž

However, the transformation has been modified by Box & Cox (1964) to take into account the

discontinuity at ɉൌͲ so that

8

஛ ሺ஛ሻ ሺݔ௜ െ ͳሻȀɉǢ ɉ ് Ͳ ݔ௜ ൌ൜ ݔ௜Ǣ ɉ ൌ Ͳ‰‘Ž

for some unknown ɉ. To accommodate negative values, the transformation was modified by Box

and Cox (1964) with a shift parameter:

஛ ሺ஛ሻ ൣሺݔ௜ ൅Ɋሻ െͳ൧Τ ɉǢ ɉ ് Ͳ ݔ௜ ൌቊ ሺݔ௜ ൅ ߤሻǢ ɉ ൌ Ͳ‰‘Ž

where ɉis the power transformation parameter and ߤ is chosen so that ݔ൅ߤis greater than zero

for all of the data.

Box and Cox (1964) also proposed the version of transformation that

has the form:

஛ ஛ିଵ ሺݔሻሻ Ǣ ɉ ് Ͳ ሺ஛ሻ ሺݔ௜ െ ͳሻȀɉሺ ݔ௜ ൌቊ ሺݔሻŽ‘‰ݔ௜Ǣ ɉ ൌ Ͳ

ଵ ൗ௡ :ݔ௡ሻ , i.e., the geometric mean. It can also be extended to shift versionڮ where GM(x)=ሺݔଵ

஛ ஛ ሺݔሻሻ Ǣ ɉ ് Ͳ ሺ஛ሻ ሾሺݔ௜ ൅Ɋሻ െͳሿȀɉሺ ݔ௜ ൌ൜ ሺݔሻŽ‘‰ሺݔ௜ ൅ ɊሻǢ ɉ ൌ Ͳ

Manly (1976) suggested another alternative transformation that is suitable for negative

values and it is claimed to be effective at turning skewed unimodal distribution into nearly

symmetric normal-like distributions:

ሺ஛ሻ ሾ‡š’ሺɉݔ௜ሻെͳሿΤɉ Ǣɉ്Ͳ ݔ௜ ൌ൜ ݔ௜Ǣ ɉ ൌ Ͳ

John and Draper (1980) introduced the modulus transformation that is to normalize

distributions already possessing some measure of approximate symmetry and has the following

form:

9

஛ ሺ஛ሻ •‹‰ሺݔ௜ሻൣሺȁݔ௜ȁ൅Ɋሻ െͳ൧Τ ɉǢɉ്Ͳ ݔ௜ ൌቊ ሺݔ௜ሻŽ‘‰ሺȁݔ௜ȁ ൅ ߤሻǢ ɉ ൌ Ͳ‰‹•

Bickel and Doksum (1981) suggested another version of transformation so that

ሺ஛ሻ :distributions of ݔ௜ with unbounded support such that the Normal distribution can be included

for ɉ൐Ͳ,

஛ ൣ•‹‰ሺݔ௜ሻȁݔ௜ȁ െͳ൧Τ ɉ

The inverse Hyperbolic Sine Transformation proposed by Johnson (1949) is also applied

to accommodate skewed data.

ଶ ଶ ଵଶΤ ିଵ ሺߠݕ௧ሻȀߠ ݄݊݅ݏ ሺݕ௧ǡߠሻ ൌ݃௧ ൌŽ‘‰൫ߠݕ௧ ൅ ሺߠ ݕ௧ ൅ͳሻ ൯ Ȁߠ ൌ݃

Yeo & Johnson (2000)’s modification of Box-Cox transformation:

ఒ ൣሺݔ൅ͳሻ െ ͳ൧Ȁߣሺݔ ൒ Ͳǡ ߣ ് Ͳሻۓ ሺݔ൅ͳሻሺݔ ൒ Ͳǡ ߣ ൌ Ͳሻ‰‘Žۖ ߰ሺߣǡ ݔሻ ൌ െൣሺെݔ ൅ ͳሻଶିఒ െ ͳ൧Ȁሺʹ െ ߣሻሺݔ ൏ Ͳǡ ߣ ് ʹሻ۔ ۖ െŽ‘‰ሺെݔ ൅ ͳሻሺݔ ൏ Ͳǡ ߣ ൌ ʹሻە

1.3.2. Estimation of parameters

In terms of finding the transformation parameters lambda and shift parameter c, Box and

Cox (1964) proposed maximum likelihood as well as Bayesian methods for the estimation of the

parameter λ. For Maximum Likelihood method, it is assumed that for some unknown λ

transformed observations satisfy the normal theory assumptions, i.e. are independently normally

distributed with mean μ and constant variance σ^2.

10

Through multiplying the normal density by the Jacobian of the transformation, we can obtain the probability density for the transformed observations, together with the likelihood in of the original observations. Maximizing the with respect to λ,μ, and σ^2 will provide the corresponding estimates of parameters.

As for the Bayesian method to find parameter, it is assumed that the prior distributions of

μ, and logσ can be taken as essentially over the region in which the likelihood is appreciable and we integrate over the parameters to obtain a posterior distribution of λ. A review of more

Bayesian considerations can be found in Pericchi (1981) and Sweeting (1984). And robust adaptions of the estimation procedures have been the studies by a group of researchers, including

Carroll (1980; 1982 a), Bickel and Doksum (1981), Carroll and Ruppert (1984), Taylor (1983,

1985a, b, 1987) and Carroll and Ruppert (1987).

Andrew et al. (1971, 1973), Dunn and Tubbs (1980) and Beauchamp and Robson (1986) have extended the Box-Cox Procedure to multivariate data level. And Draper and Cox (1969) proposed an approximation for the precision of the maximum likelihood estimate, and the proposed method was later corrected by Hinkley (1975). Cressie (1978) suggested a simple graphical procedure to estimate the transformation parameter by utilizing the principle of one degree of freedom for non-additivity in a two-way table without while Hernandez and

Johnson (1980) proposed to estimate the transformation by minimizing the Kullback-Leibler

(Kullback and Leibler, 1951) information in examining the large sample behavior of transformation to normality.

Hinkley (1985) proposed an estimation of the transformation parameter based on the likelihood analysis for local deviations from a normal theory linear model. This estimation simultaneously corrects for non-additivity and heterogeneity of residuals and is considered a

11 more analytical procedure. Later based on Kendall’s , Han (1987) proposed a nonparametric estimation of the transformation parameter. Han’s approach was considered to be more consistent and efficient than the maximum likelihood estimator. Solomon (1985) suggested the application of the Box-Cox technique to simple random effects models and the

Box-Cox technique was later applied in all types of mixed models by Sakia (1988). Chang compiled the computer programs for λ estimation and the programs were modified for perfection by Huang et al (1978).

1.3.3. Hypothesis tests and inference on transformation parameter

Researchers are always interested in finding out whether the estimate of the Box-Cox transformation parameter conforms to a hypothesized value. Box and Cox (1964) employed the asymptotic distribution of the likelihood ratio to test the hypotheses. Later, Andrews (1971) proposed another test in which the null distribution of the parameter is known. This test for the value of parameter ignores the Jacobian transformation and is easier to calculate. Atkinson (1973) incorporated this approach in the comparison of the powers of the three tests and concluded that the tests derived from the likelihood method have more powers. Lawrance(1987a) later standardized Atkinson's score statistic and conducted a comparison of the two tests. It was found that the standardized statistic has improved standard normal behavior compared to Atkinson's test. Hence, Lawrance's statistic has been extended to models where response and mean may both be transformed by Hinkley (1988).

Furthermore as an alternative to the likelihood ratio confidence interval for testing hypotheses about the transformation parameter Lawrance (1987b) has given an asymptotically justifiable expression for the estimated variance of ɉ which leads to more efficient hypothesis

12 tests on ɉ than Atkinson's (1985,p .100) which is based on a regression analogy for constructed variables. A more recent improvement of both Atkinson's test and Lawrance' standardized score statistic has been proposed by Wang (1987) and is said to give a more accurate approximation to the standard normal. A simulated comparison by Atkinson and Lawrance (1989) of the test statistics concluded that, in general, Atkinson's test is very similar to Lawrance' test. However, a caution should be made that the small samples used in Lawrance's test could have masked its superiority over the Atkinson's test.

While Draper and Cox (1969) have shown that the estimation of ɉ is fairly robust to non- normality as long as the variable has a reasonably symmetric distribution, this may not -be the case when skewness is encountered. Thus, Poirier (1978) investigated the effect of the transformation in limited dependent variables, i .e. variables which have been possibly censored or truncated thus introducing some skewness. The procedure maximizes the likelihood function of the truncated normal distribution. Although the asymptotic properties of the maximum likelihood estimators are known, little is known about their small sample properties. Spitzer

(1978) employed Box-Cox transformation approach and examined the small sample properties.

The results suggested that Box-Cox approach was met the assumption of approximate normality.

For forecasting purposes, the forecasts were unbiased and their were remarkably low.

Carroll and Ruppert (1981), Carroll (1982a) and Hinkley and Runger (1984) have provided further response to the work of Bickel and Doksum (1981). Doksum and Wong (1983) concluded that Box-Cox transformed data tests have better power properties. Therefore, it is acceptable to use the standard methods for the normal linear model in the transformed variables.

Wood (1974) and Carroll and Ruppert (1984, 1988) made transformation on both the response and the theoretical model in order to examine Box-Cox transformation when the

13 theoretical regression function was given. Box-Cox transformation approach was also examined by a Monte Carlo study, which concluded that it matters little when the correct transformation was not known as a priori. Later, Ruppert et al. (1989) studied if the transforming theoretical and empirical models fit in the Michaelis-Menten model, together with its error structure. Rudemo et al. (1989) used the power transformation for the logistic model in bioassay and had positive conclusion. Wixley (1986) made a Box-Cox power transformation in linear models and the results suggested that unconditional likelihood ratio tests have more correct level.

1.3.4. Impact of outliers on data transformation.

The selection of a transformation may be properly viewed as and, in this initial phase of analysis, influential cases can have particularly important and lasting effects that are difficult to uncover in the subsequent analysis. Thus, an outlying observation in the original scale may conform in the transformed scale. It is therefore necessary to find if the evidence for the particular transformation is spread evenly throughout the data or just within a few cases.

Andrew (1971) showed that the Maximum Likelihood Estimator (MLE) of transformation parameter may be adversely influenced by one or several outlying observations. Atkinson (1982) introduced some diagnostic displays of outlying and influential observations in multiple regression and their possible reduction by a transformation. Some criteria for estimating the transformation through the use of constructed variables are suggested by Atkinson (1983,1985) and this has been extended to the power transformation after a shift in location. Carroll (1980,

1982b) proposed robust estimators of the transformation parameter by replacing the normal likelihood with an objective function that is less sensitive to outlying responses. In order to measure the influence of cases on the Box-Cox likelihood estimate of the response

14 transformation parameter in , Cook and Wang (1983) mentioned a method superior to Atkinson's in regard of detecting influential cases for a transformation. Atkinson

(1986) further extended his work by deriving some expressions for estimating the effect of deletion of observations on the estimate of the transformation parameter and some subsequent tests of hypotheses.

15

1.4 Information Criterion

1.4.1 Akaike Information Criterion

The Akaike information criterion, AIC (Akaike, 1974), is a measure of the goodness of fit of a and is used for variable selection. It describes the tradeoff between accuracy and complexity of the model. AIC is often used as a model selection index. The idea behind the AIC is that in choosing between possible models there needs to be a trade-off between the improved fit of a more complex model and the increased number of parameters such a model requires.

  ൌ െʹ Žሺሻ ൅ʹ,

where k is the number of parameters and L is the maximized value of the likelihood function. Given a dataset, the model with the minimum value of AIC should be preferred. Hence

AIC not only rewards goodness-of-fit, but also includes a penalty that is an increasing function of the number of parameters, i.e., the complexity of the model. The AIC is a simple expression and its calculation makes it appropriate to deal with outliers.

Kitagawa (1979) was the first to apply AIC to detect outliers. In his work two models were proposed to describe the distribution of a data set consisting of clean data and outlying observations. Low side and high side observations are treated as outlier candidates, the middle part of the data are treated as the main group or clean sample. Outliers are assumed to be distributed with the same variance σ^2 as the main group but with different means in Model 1.

And in Model 2 they are assumed to be normally distributed with the same mean as the main group but with different variance. AIC of models with different number of outliers are computed and the minimum of AIC decides which observations are outliers, corresponding to the “best” model.

16

Model 1:

ଶ ǡ݊ଵሻڮ߶ሺݔǢ ߤଵǡߪ ሻǡሺ݅ൌͳǡ ଶ ǡ݊െ݊ଶሻڮ௜ሺݔሻ ൌቐ߶௟ି௡భǡ௡ି௡భି௡మሺݔǢ ߤǡ ߪ ሻǡሺ݅ൌ݊ଵ ൅ͳǡ݂ ଶ ǡ݊ሻڮ߶ሺݔǢ ߤଵǡߪ ሻǡሺ݅ൌ݊െ݊ଶ ൅ͳǡ

Model 2:

ଶ ǡ݊ଵሻڮ߶ሺݔǢ ߤǡ ߬ ሻǡሺ݅ൌͳǡ ଶ ǡ݊െ݊ଶሻڮ௜ሺݔሻ ൌቐ߶௟ି௡భǡ௡ି௡భି௡మሺݔǢ ߤǡ ߪ ሻǡሺ݅ൌ݊ଵ ൅ͳǡ݂ ଶ ǡ݊ሻڮ߶ሺݔǢ ߤଵǡߪ ሻǡሺ݅ൌ݊െ݊ଶ ൅ͳǡ

Kadota (2003a, 2003b, 2006) applied U-Statistic (Ueda, 1996), an approximation of AIC, to detect genes whose expression profile is considerably different in some tissues than in others.

Gene expression ratios are sorted and normalized and then an approximation to AIC called “U- statistic” are calculated for different combinations of outlier candidates. Those special genes can be considered as outliers. Combination with minimum U-statistic identifies the tissue specific genes, which are in fact outliers.

1.4.2 Bayesian Information Criterion

Based on the likelihood function, the Bayesian information criterion (BIC) or Schwarz criterion (also SBC, SBIC) (Schwarz, 1978) provides a criterion for model selection among a finite set of models. The BIC is based on exponential distribution assumption. The formula is:

൅݈݇݊ሺ݊ሻܮ݈݊כʹൌെܥܫܤሺ݌ሺݔȁ݇ሻሻൎ݈݊ כ ʹെ

where lnL is the log-likelihood function of observations, k is the number of free parameters to be estimated, n is the sample size the data. ln(p(xūk)) is the natural log probability of the observed data when the number of parameters is given.

17

It the model fitting process, adding parameter can help increase the likelihood. However, this can cause over fitting of the model. BIC penalties the number of parameters and resolves the over fitting issue properly. The penalty is larger in BIC than in AIC.

18

1.5 Research Gap in Literature

A lot of research work has been done to detect outliers, transform data, and applying

Information Criterion (AIC and BIC). However a lot of problems still remain unaddressed.

First, Box-Cox transformation is not resistant in the presence of outliers due to the non- robustness of MLE. Transform data with or without outliers needs further consideration. The transformed data might not be as normal as expected and may still have outliers if they are retained. If dropped, we bear the risk of losing information on the expense of having a better model fitting. The transformation parameter lambda found through optimization is not always reasonable due to flatness of likelihood function.

The second question of interest is how suspicious observations are treated. There are two ways suspicious observations are viewed. They can be treated either as real outliers or as regular extreme observations in a skewed dataset. Identified outliers can be dropped to improve model fitting and skewness of data can be alleviated through data transformation. It is a critical issue that needs to be addressed before any analysis can be conducted in the presence of extreme observations.

Third, the advantage of using information criterion such as AIC to detect outliers is quite obvious as shown in Kitagawa (1979) and Kadota (2003b)’s work, while its limitations cannot be overlooked. Outliers can be identified through AIC, but how to handle them, i.e., exclude those identified outliers or keep them is still a problem. How to evaluate the goodness of fit of Box-

Cox transformation when outliers are dropped is unsolved. Additionally, in Kitagawa’s (1979) paper AIC is used for small sample (15 observations), when sample size becomes larger the computational complexity is increasing dramatically, because AIC of all possible combinations of outlier candidates need to be calculated. Moreover we do not know exactly how many outliers

19 are in the sample, which makes the case more difficult. The assumption that the outliers are normally distributed is quite strong and sometimes not realistic, for example, if there are in fact only two or three outliers, it is not reasonable to use a normal distribution to describe the outliers.

Last but not least, there is no research that integrates Bayesian Information Criterion

(BIC), Box-Cox transformation, and outlier, which can robustly detect outliers and find the appropriate transformation at the same time without excluding too many extreme observations. I will find the best compromise between removing outliers and/or transforming the data, using the

BIC, to guide the way. In this dissertation I would like to propose a unified approach that combines the Box-Cox transformation and outlier detection using penalized assessment, which will tackle the above mentioned problems. The research gap in the literature is shown in Figure

1-1.

20

Figure 1-1 Research Gap

21

2. Unified Approach

2.1. Anchoring-to-1

Box-Cox transformation can be directly applied to positive observations, and for negative

samples, one more step is needed, which is to shift the whole sample to positive one by adding a

constant to each of the observations in the sample. This is stated by Box & Cox (1964):

஛ ൣሺݕ௜ ൅Ɋሻ െͳ൧Τ ɉǢɉ്Ͳ ݔ௜ ൌቊ ሺݕ௜ ൅ ߤሻǢ ɉ ൌ Ͳ‰‘Ž

here ݕ௜ is the raw sample of size n and ݔ௜ is transformed sample, ɉ is the transformation

parameter and μ is chosen so that ݕ௜ ൅ߤ is greater than zero for all of the data. Now the question

is how to chooseߤ? One may choose ߤൌെݕ௠௜௡ ൅ߝ to ensure that the minimum of shifted

sample is always greater than zero, which means that the sorted shifted sample becomes

ǡݕሺ௡ሻ ൅ߝെݕ௠௜௡ڮߝǡ ݕሺଶሻ ൅ߝെݕ௠௜௡ǡݕሺଷሻ ൅ߝെݕ௠௜௡ǡ

here ߝ can be any positive number theoretically.

The following two figures show the comparison of distributions for different choices ofɂ.

The sample here is Normal (10, 1), sample size is 100. The shape of the histogram does change

much, although the minimum of the distribution is shifted to 1 (Figure 2-1) and 5 (Figure 2-2)

respectively.

This next two figures show the shift of a skewed distribution, Log-Normal; the sample

size is also 100. Before shift, the sample has quite a long right tail, while after shifting to 1 and 5;

the symmetry of the distribution is improved, although it does not look normal yet. (Figure

2-3and Figure 2-4)

22

Now two questions arise here:

1) Does the choice of ߝ affect the resulting ɉ ?

?Does ݕ௠௜௡ influence the efficacy of the Box-Cox transformation (2

23

Figure 2-1 Anchor-to-1, N(10,1), Sample size=100

24

Figure 2-2 Anchor-to-5, N(10,1), Sample size=100

25

It is noted that adding a constant to a variable changes only the mean, not the standard deviation or variance, skew, or kurtosis. However, the constant ε and, thus the amount of the shift of the distribution can influence subsequent data transformations. This will be shown in the following examples.

A simple simulation here shows the effect of different choices of ɂ on the Box-Cox transformation parameterɉ. Samples from normal distribution are generated with sample size ranging from 20 to 200. Five choices of ɂ are chosen to shift the minimum of original samples to

0.05, 0.1, 1, 2, and 3 respectively. For each case, 1000 repetitions are produced. Since those samples are generated from normal distribution, we expect to find ɉൌͳ using Box-Cox transformation, which means that no transformation is needed. It can be seen that in the baseline case, where raw sample without any shift is transformed using Box-Cox, resulting ɉ has large variation, especially in small samples. While applying other values ofɂ, the variation of ɉ is alleviated and its mean is quite close to 1, the confidence interval is narrower than the baseline case. (Figure 2-5 and Figure 2-6)

26

Figure 2-3 LogNormal, Anchor-to-1,sample size=100

27

Figure 2-4 LogNormal, Anchor-to-5,sample size=100

28

Figure 2-5 Lambdas of Different Anchoring

29

Figure 2-6 95% Confidence Interval of λ at different anchor points

30

Another example also shows the impact of anchoring in transformation. We generated a skewed random sample of size 100, the raw sample is generated from Exp(X)+10, where

X~N(0,1); the histogram is plotted in Figure 2-7. It is desired to find a transformation that will make it behave normal. Three most frequently used transformation Square Root, Natural

Logarithm, and Inverse function are applied to the raw data. We also anchored the raw sample to different values: 1, 2, 3, 5, and 10. The skewness and kurtosis of the raw sample is 2.1 and 4.8 respectively, which indicates large deviance from normality. If the three transformations are directly applied to the raw data, the skewness and kurtosis becomes closer to zero (normal distribution). If anchored the raw sample to one of the five values, the skewness and kurtosis both become even closer to zero, which indicating a better fitting to normality compared to non- anchor (Table 2-1 and Figure 2-8). The of transformed samples using different anchoring values are shown in Figure 2-9, Figure 2-10, and Figure 2-11.

31

Figure 2-7 Distribution of Skewed Raw Sample

32

Table 2-1 Comparison of Skewness and Kurtosis for different anchoring

Y Original Anchor_1 Anchor_2 Anchor_3 Anchor_5 Anchor_10 SQRT Skewness 2.1053 1.9236 1.4504 1.6104 1.7020 1.8074 1.9224 Kurtosis 4.7690 3.8745 2.0144 2.5720 2.9235 3.3575 3.8688

LOG Skewness 2.1053 1.7506 0.8505 1.1570 1.3303 1.5294 1.7483 Kurtosis 4.7690 3.0917 0.2596 1.0066 1.5277 2.2145 3.0818

INV Skewness 2.1053 -1.4316 0.0682 -0.4215 -0.7039 -1.0404 -1.4274 Kurtosis 4.7690 1.8216 -0.9297 -0.6234 -0.1728 0.6024 1.8066

33

Comparison of anchoring for Sqrt Transformation 6.0000 5.0000 4.0000 3.0000 Skewness 2.0000 Kurtosis 1.0000 0.0000

Comparison of anchoring for LogTransformation 6.0000 5.0000 4.0000 3.0000 Skewness 2.0000 Kurtosis 1.0000 0.0000

Comparison of anchoring for InverseTransformation 6.0000 5.0000 4.0000 3.0000 2.0000 1.0000 Skewness 0.0000 Kurtosis -1.0000 -2.0000

Figure 2-8 Comparison of Skewness and Kurtosis for different anchoring

34

Figure 2-9 Histogram of transformed sample with different anchoring: SQRT Transformation

35

Figure 2-10 Histogram of transformed sample with different anchoring: LOG Transformation

36

Figure 2-11 Histogram of transformed sample with different anchoring: INVERSE Transformation

37

To find an outlier in a dataset, one simple way is to sort the observations from smallest to largest and then look at those ones that are on the lower side and higher side to see if they are extremely small or large in order to decide whether they should be treated as outliers. Usually transformation is conducted to the raw observations without any modification (for positive data).

In this thesis, it is proposed that one extra step is needed before the transformation procedure: the sorted raw data should be anchored to 1. Osborne (2002) also mentioned this point, although no simulation was provided in his work. The reason why the raw data is anchored to 1 is that the minimum value of a sample has an effect on the efficacy of transformations, which is shown in the above examples and simulations. Further details of the benefits of anchor-to-1 will be given in later sections, especially when the objective is to look for outliers and transform data at the same time. The derivation of Box-Cox transformation parameter ɉ is given below.

x Non-Anchoring

ǡ›୬ and the sorted transformed data is denoted byڮThe raw data is denoted by ›ଵǡ›ଶǡ

ǡš୬ in order of increasing magnitude. The transformed data is expected to followڮଵǡšଶǡš normal distribution with unknown mean and variance. Box-Cox transformation parameter is obtained through maximizing the log-likelihood function of the transformed data.

ఒ ሺݕ௜ െͳሻΤ ߣߣ ് Ͳ  ݔ௜ ൌቊ ሺݕ௜ሻ ߣ ് Ͳ‰‘Ž

ଶ ǡݔ௡̱ܰሺߤǡ ߪ ሻڮݔଵǡݔଶǡ

ǡݔ௡ isڮThe likelihood function of ݔଵǡݔଶǡ

௡ ೙ మ ௡ σ೔సభሺ௫೔ିఓሻ ଶ ି ି మ ௫ ൌ݊Ǩෑ݂ሺݔ௜ሻ ൌ݊Ǩሺʹߨߪ ሻ ଶ݁ ଶఙܮ ௜ୀଵ

38

ǡݕ௡, the Jacobian of Box-CoxڮTo find the likelihood function of raw dataݕଵǡݕଶǡ

transformation is

ݔ௜ ఒିଵ݀ ൌฬ ฬൌݕ௜ܬ ݕ௜݀

௡ ೙ మ ௡ ௡ σ ሺ௫೔ିఓሻ ି ି ೔సభ ଶ ଶ ଶఙమ ఒିଵ ௬ ൌ݊Ǩෑ݂ሺݔ௜ሻ ൌ݊Ǩሺʹߨߪ ሻ ݁ ෑݕ௜ܮ ௜ୀଵ ௜ୀଵ

Log-likelihood function is

௡ σሺ௡ ݔ െߤሻଶ ݊ ൌŽ‘‰ሺ݊Ǩሻ െ Ž‘‰ሺʹߨߪଶሻ െ ௜ୀଵ ௜ ൅ሺߣെͳሻ෍Ž‘‰ሺݕሻ ܮܮ ௬ ʹ ʹߪଶ ௜ ௜ୀଵ

Through MLE

ଵ ߤƸ ൌݔҧ ߪ෢ଶ ൌ σ௡ ሺݔ െݔҧሻଶ ௡ ௜ୀଵ ௜

Substitute ߤƸ and ߪଶinto the log-likelihood function

௡ ݊ ݊ ൌŽ‘‰ሺ݊Ǩሻ െ Ž‘‰ሺʹߨ݁ሻ െ Ž‘‰൫ߪ෢ଶ൯൅ሺߣെͳሻ෍Ž‘‰ሺݕሻ ܮܮ ௬ ʹ ʹ ௜ ௜ୀଵ

ߣ can be found iteratively through Newton-Raphson algorithm.

x Anchoring-to-1

ఒ ሾሺݕ௜ ൅݉ሻ െͳሿΤ ߣߣ ് Ͳ  ݔ௜ ൌ൜ ሺݕ௜ ൅݉ሻ ߣ ് Ͳ‰‘Ž

.here ݉ൌെݕሺଵሻ ൅ͳ, which shifts the minimum of anchored sample to be 1

ǡݔ௡ remains the sameڮThe likelihood function of ݔଵǡݔଶǡ

௡ ೙ మ ௡ σ೔సభሺ௫೔ିఓሻ ଶ ି ି మ ௫ ൌ݊Ǩෑ݂ሺݔ௜ሻ ൌ݊Ǩሺʹߨߪ ሻ ଶ݁ ଶఙܮ ௜ୀଵ

39

ǡݕ௡, the Jacobian of Box-CoxڮTo find the likelihood function of raw sampleݕଵǡݕଶǡ

transformation is

ݔ௜ ఒିଵ݀ ൌฬ ฬൌሺݕ௜ ൅݉ሻܬ ݕ௜݀

௡ ೙ మ ௡ ௡ σ೔సభሺ௫೔ିఓሻ ଶ ି ି మ ఒିଵ ௬ ൌ݊Ǩෑ݂ሺݔ௜ሻ ൌ݊Ǩሺʹߨߪ ሻ ଶ݁ ଶఙ ෑሺݕ௜ ൅݉ሻܮ ௜ୀଵ ௜ୀଵ

Log-likelihood function is

௡ σሺ௡ ݔ െߤሻଶ ݊ ൌŽ‘‰ሺ݊Ǩሻ െ Ž‘‰ሺʹߨߪଶሻ െ ௜ୀଵ ௜ ൅ሺߣെͳሻ෍Ž‘‰ሺݕ ൅݉ሻ ܮܮ ௬ ʹ ʹߪଶ ௜ ௜ୀଵ

Through MLE

ଵ ߤƸ ൌݔҧ ߪ෢ଶ ൌ σ௡ ሺݔ െݔҧሻଶ ௡ ௜ୀଵ ௜

Substitute ߤƸ and ߪଶinto the log-likelihood function

௡ ݊ ݊ ൌŽ‘‰ሺ݊Ǩሻ െ Ž‘‰ሺʹߨ݁ሻ െ Ž‘‰൫ߪ෢ଶ൯ ൅ ሺߣ െ ͳሻ ෍ Ž‘‰ሺݕ ൅݉ሻ ܮܮ ௬ ʹ ʹ ௜ ௜ୀଵ

ߣ can be found iteratively through Newton-Raphson algorithm, which is similar to the non-

anchor case.

40

2.2. Penalized Assessment

Information criterion such as AIC and BIC are a measure of the goodness of fit of a statistical model. It describes the tradeoff between accuracy and complexity of the model. They are often used as a model selection criterion. The idea behind the AIC is that in choosing between possible models there needs to be a trade-off between the improved fit of a more complex model and the increased number of parameters such a model requires. Information criterion (AIC and BIC) can be used to detect outliers, as seen in Kitagawa (1979), and Kadota

(2003a, 2003b, 2006). In my unified approach, information criterion is used to assess the goodness of fit of the transformed data, which are supposed to follow normal distribution.

Information criterion can be easily computed with the sum of squared errors of the transformed data.

When skewed sample or sample with outlying observations is to be analyzed, one may have the following options to make the data behave normal, or close to normal:

a) Box-Cox the whole sample to get a normal transformed sample;

b) Exclude some extreme observations that are in the tails to make the remaining sample

look normal;

c) Exclude some extreme observations and Box-Cox transform the remaining sample to get

a transformed data which is closer to normal than case b.

The purpose of the above options is to make a better model fitting, which means that the transformed sample is close to normal. While one may take the risk of over-fitting the data, which means that one tends to exclude more extreme observations (causing loss of information) or implement unnecessary transformation (over-analyzing) to get a better fitting. In this sense, I

41 would like to utilize information criterion’s attractive feature of penalizing over-fitting to decide the trade-off and answer the questions: how many extreme observations should be excluded, whether Box-Cox transformation is needed, and whether the transformation should be based on the whole sample or partial sample.

ሺሻ. In theŽכሺሻ ൅Ž כ ʹand  ൌ െ כʹሺሻ ൅Ž כ ʹIt is known that   ൌ െ unified approach, we treat Box-Cox transformation parameter λ as an extra parameter in the penalty function, in addition to the two parameters μ and σ. Another way to penalize over-fitting is to treat outliers excluded as extra parameters. If the number of observations excluded is denoted by dp, then the number of parameters in the penalty function is 3+dp, namely λ, μ, σ, and the excluded dp observations. Therefore the adjusted information criterion of transformation data becomes

כʹ୷ ൅ כ ʹൌ െ  

ʹ൫ɐ෢ଶ൯െ‰‘Žכ ሺʹɎ‡ሻ ൅ ሺെ†’ሻ‰‘Žכ ൫ሺെ†’ሻǨ൯ ൅ ሺെ†’ሻ‰‘Žכʹൌെ

ሺ͵൅†’ሻכʹሺɉെͳሻ ෍Ž‘‰ሺ›୧ሻ ൅ כ ୧ୀଵ

ሺሻ‰‘Žכ୷ ൅ כ ʹൌ െ  

ʹ൫ɐ෢ଶ൯െ‰‘Žכ ሺʹɎ‡ሻ ൅ ሺെ†’ሻ‰‘Žכ ൫ሺെ†’ሻǨ൯ ൅ ሺെ†’ሻ‰‘Žכʹൌെ

ሺെ†’ሻ‰‘Žכሺɉെͳሻ ෍Ž‘‰ሺ›୧ሻ ൅ሺ͵൅†’ሻ כ ୧ୀଵ

Based on the above adjusted information criterion, we can decide how many observations to be excluded and whether Box-Cox transformation is worth being implemented based on the values of information criterion, which is smaller, the better.

42

2.3. Implement of unified approach

For a given data set, our proposed unified approach dealing with skewed sample or outlier works in this flow:

ǡݕሺ௡ሻڮa) Sort the observations from smallest to largest ݕሺଵሻǡݕሺଶሻǡ

b) Anchoring- to-1: if the minimum of the sample is larger than 1, e.g., N(10,1) sample, then

use transformation y*=y+1-ymin, so that the minimum of raw data is shifted to 1. If the

minimum of the sample is smaller than 1, e.g. LogN(0,1), it is unnecessary to anchor to 1.

c) The original data anchored-to-1 now becomes

ǡݕሺ௡ሻ ൅ͳെݕ௠௜௡ڮͳǡ ݕሺଶሻ ൅ͳെݕ௠௜௡ǡݕሺଷሻ ൅ͳെݕ௠௜௡ǡ

d) Sequentially drop observations in the high end and low end, the number of observations

to be excluded is denoted by hdp and ldp respectively, which ranges from zero to one

fifth of total sample size. The percent can be set to any number below 50%, because if

there are more than 50% true outliers, then the rest of sample can be treated as outliers.

The total number of observations excluded is dp, which is equal to ldp plus hdp.

e) Re-Anchor remaining sample to 1 when observation in lower end is excluded.

f) Apply Box-Cox transformation to the remaining n-dp data.

g) Each time one observation is excluded in high end or low end, find Box-Cox

transformation parameter λ by maximizing the likelihood of transformed data.

h) Compute information criterion BIC (or AIC) for above cases where different number of

observations are dropped.

i) Find the minimum BIC value and corresponding number of observations dropped and ߣ

for this sample.

43

j) Conduct normality test such as Anderson-Darling statistic to test the normality of

transformed data.

k) Back-transform to original scale if needed.

The above procedure is summarized in Figure 2-13, Figure 2-14, and Figure 2-15. In

Figure 2-13, given a sample that needs transformation, if the normality test is passed, then no transformation is needed. If the sample does not pass normality test or contains outliers, extra data processing is needed such as outlier detection or data transformation. Figure 2-14 shows the procedure of the Unified Approach, if the sample minimum is larger than 1, then its minimum is anchored to 1. Then the decision regarding outlier dropping and Box-Cox transformation needs to be made based on penalized assessment, which is adjusted BIC. Given a sample, four choices are available to make the sample behave normally: BC=0 DP=0 (no Box-Cox and no outlier drop), BC=0 DP=1(no Box-Cox and drop outliers), BC=1 DP=0( do Box-Cox and no outliers need to be dropped), and BC=1 DP=1( Box-Cox and outlier dropping are needed at the same time).

For each sample, we need to make three decisions: how many outliers are there? Is Box-

Cox transformation needed to make the sample normal? Are outlier exclusion and Box-Cox transformation both needed to achieve normality? To answer those questions BIC is utilized in the unified approach to make the decisions. Observations in the low end and high end of the sorted sample may be outliers and the combination of observations in low end and high end provides many candidate models to choose from, this is shown in Figure 2-12.This is similar to

Kitagawa (1979). In addition, the Box-Cox may help to achieve normality after some suspicious observations are excluded. We need to decide whether it is worth Box-Cox transforming the sample, which is measured as an extra penalty in BIC. Therefore, we have quite a few candidate

44 models consisting of different number of observations to be excluded and whether Box-Cox needs to be involved. The point is that normality can be achieved by dropping some observations and Box-Cox transforms the sample, which might be over-fitting. Another extra penalty is from anchoring-to-1, if the minimum of the original sample is larger than 1, then it will be anchored to

1 before any Box-Cox transformation, which yields an extra penalty in BIC. For some samples, we need to re-anchor the remaining sample to 1 after the observation in the low end is dropped if the minimum of the retained sample is still larger than 1, then redo the Box-Cox transformation on the remaining anchored-to-1 sample. The exclusion of observations in the low end may happen several times, every time one observation is dropped, it is treated as an extra parameter in the BIC penalty part; each time the sample is anchored to 1, an extra parameter is added as a penalty in BIC. Excluding observations, anchor the sample to 1, and Box-Cox transform the sample are treated as model fitting. BIC is a good measure of trade-off between model fitting

(normality) and complex of model (outlier exclusion, Box-Cox transform, and anchor-to-1) and hence is used in the unified approach to choose the appropriate solution. The reason why BIC is chosen rather than AIC is in our simulation study BIC is found to outperform AIC and other information criteria such as AICc. The reason is that the penalty part of AIC is 2*k, which does not depend on the sample size, while BIC has a penalty k*Ln(N), the penalty of over-analyzing the sample (bigger k) is magnified by the corresponding sample size. When some observations are excluded, the sample size becomes n-dp, which cannot be reflected in AIC.

45

High End x(n-10) x(n-9) x(n-9) ..x(n-8) ...... x(n-1) x(n-1) x(n-1) x(n-1) x(n) x(n) x(n) x(n) x(n) x(1), x(2), x(3),…, x(9), x(10) x(1), x(2), x(3),…, x(9) . . Low End . x(1), x(2), x(3) x(1), x(2) x(1)

Figure 2-12 Outlier Candidates

46

2.4. Advantages of new method

First, the new method can handle small to medium size data set, in our simulation study it shows that the method is applicable to sample size ranging from 20 to 200 and the results are acceptable. Previous research such as Kitagawa (1979), Kadota (2003) can only handle small samples, usually less than 50.

Second, the new method is able to detect percent of outliers ranging from 0% to 20%. In the previous research the percent of outliers is quite low, only two or three suspicious observations in low end and high end are considered.

Third, the new method balances the tradeoff of excluding too many extreme observations and a better model fitting. As excluding all the suspicious observations will help reduce the skewness/kurtosis and hence improve the normality of the data; however it may bear the risk of losing useful information. Our new method employs BIC to make this tradeoff using appropriate penalty to avoid over fitting and loss of information.

Fourth, after applying the new method to data set with outliers the transformed data pass the normality test in a high percentage, which means that our transformation is effective.

Fifth, our new method is effective in correctly identifying outliers, which is shown in both the clean and contaminated data. This feature is attractive because it can avoid the masking and swamping effect in outlier detections.

Lastly, in detecting outliers significance level is often required in the testing procedure such as 1%, 5%, and 10%. The new method does not need to set level of significance to avoid the subjectivity issue in statistical tests.

47

Figure 2-13 Flowchart of overall data processing

48

Figure 2-14 Flow Chart of the Unified Approach

49

Figure 2-15 Flowchart of outlier drop

50

3. No Outliers: The Uncontaminated Sample Case

3.1. Overview

To investigate the impact of outliers on Box-Cox transformation and evaluate the performance of the unified approach, the simulation study is conducted on clean data. This means that samples are generated from specified distributions without outliers added. The effectiveness of the unified approach lies in that it can provide the most accurate or appropriate

Box-Cox transformation parameter λ, however for most of the cases it is not easy to know what the true value of λ is. Therefore distributions that can be transformed to normal distribution with known λ will be used in the simulation. The first one is normal distribution itself where λ=1.

This means that if one transforms a normally distributed sample, then a reasonable method should yield results where λ is (approximately) equal to 1. Another distribution used is Log-

Normal distribution. If X is a with a normal distribution, then Y=exp(X) has a log-normal distribution; likewise, if Y is log-normally distributed, then X=Log(Y) has a normal distribution, which indicates that λ is 0 according to the definition of Box-Cox transformation.

Through the simulation on clean data the following three questions would like to be answered:

a) The ability to detect outliers, which means that, is the unified approach able to tell that

the data is clean?

b) What is the distribution of λ found through the new approach? Since we know what the

true value is, is λ provided by the unified approach as expected?

c) If the unified approach detected outliers in clean data and excluded them, what does the

distribution of transformed sample look like?

51

To answer the above questions, the simulation strategy is set up in the following way:

a) Simulate samples from N(10,1) and Log-Normal distribution

b) The sample size ranges from 20 to 200 by 10.

c) For each sample generate 1000 repetitions.

d) In each repetition, apply the new approach to identify outliers and find appropriate ɉ

e) Test the normality of transformed sample

f) Proceed to next repetition and do steps d), e), and f).

For each sample, the unified approach will provide the number of outliers detected, λ that can transform the sample to (near) normality, as well as normality test statistics of the transformed sample such as Anderson-Darling statistic, skewness, and kurtosis. For the purpose of comparison, we would like to compare the results from the following cases:

a) Null case: the original sample, no anchoring, no outlier detection/exclusion and no

Box-Cox transformation.

b) Baseline case: Box-Cox transform the whole sample, no anchoring, no outlier

detection/exclusion, this is the baseline case.

c) Anchoring-only case: Anchor the original sample to 1, Box-Cox transform the

anchored sample, no outlier detection/exclusion.

d) Unified Approach: Apply the unified approach to the original sample.

Given each original sample, the unified approach will yield an optimal solution based on the penalized information criterion, which includes the following four situations:

a) No outliers, no Box-Cox transformation is needed

b) No outliers, Box-Cox transformation is needed

52 c) Outliers are excluded and no Box-Cox transformation is needed d) Outliers are excluded and Box-Cox is also needed

53

3.2. Anchoring-to-1

The first step is to look at the null case, given a random from N(10,1) without outliers, apply Box-Cox transformation to this sample and λ is expected to be one. Figure 3-1 shows the

Box-plot of λ along with different samples. It is shown that the variation of λ is quite large, especially when sample size is relatively small. Some outlying values of λ can be positive 8 or negative 7 (for example, when sample size is 20), although the is around one and the true value of λ=1 is covered in between first and third quartiles. If those extreme λ are used to transform the data, the resulting transformed sample may not follow normal distribution as expected and cause trouble for subsequent analysis. Figure 3-2 shows the 95% confidence interval of λ based on 1000 repetitions, in most of sample sizes one is covered by the 95% confidence interval, for some cases such as sample size 60, 70, 80, one is not in the 95% confidence interval, indicating that for those normal samples Box-Cox will generate a λ that is not equal to one.

Table 3-1 provided an example where Box-Cox is not working well, the original sample is randomly chosen from N(10,1) with sample size 20, the histogram shown in Figure 3-3. The λ found through Box-Cox is 9.13, which is far away from one. The histogram of the transformed sample is shown in Figure 3-4. The magnitude of transformed sample is greatly increased to

10E+08, although the normality is improved in terms of skewness, kurtosis, Anderson-Darling test, and Shapiro-Wilks test (Table 3-2). This huge increase in the magnitude of the sample might make the analysis more complicated compared to cost of the improved fitting to normality.

54

Figure 3-1 Box plot shows the variation of λ using regular BC on original sample

55

95% CI of Lambda: Null Case, N(10,1), Clean data Lambda 1.05 1.04 1.03 1.02 1.01 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size

Figure 3-2 95% Confidence interval of lambda in baseline case

56

Table 3-1 An example where BC gives extreme lambda

Original 10.5281 10.4717 9.0009 9.8508 10.0498 10.3015 10.1181 8.9294 10.3289 9.9043 Transformed 236356218.86 225057637.76 56513694.63 128804488.19 154599860.43 193786523.46 164464813.53 52543192.84 198538190.09 135332303.71 λ=9.13 Original 8.9114 10.5156 10.8465 9.7124 10.1058 10.3895 10.8041 10.3644 10.3134 10.23 Transformed 51585550.71 233817390.30 310252851.26 113194033.18 162650494.65 209430291.92 299366619.31 204850365.23 195840818.68 181836784.30

Table 3-2 An example where BC gives extreme lambda, con’t

PValue- Pvalue- Anderson- Shapiro- λ=9.13 Skewness Kurtosis Anderson- Shapiro- darling Wilk Darling Wilk Original -1.0797 0.5257 1.0034 0.0095 0.8705 0.0120 Transformed -0.1337 -0.1458 0.3206 >0.2500 0.9557 0.4613

57

Figure 3-3 A histogram shows the histogram of original sample

58

Figure 3-4 Histogram shows the histogram of transformed data with lambda 9.13

59

Box-Cox transformation λ is found through maximizing the likelihood of the transformed sample, which is supposed to follow a normal distribution. The log-likelihood function is a convex function and has maximum point. However, sometimes the log-likelihood function can be quite flat, which makes it quite difficult to find a λ that achieves the maximum of log- likelihood function; therefore λ may not exist or is quite unstable. Such an example is given in

Table 3-3and Figure 3-5. The original sample is a little skewed, although not too far away from normality, while λ found through Box-Cox is -6.91, resulting the transformed sample points identical, all equal to 0.1446. (Table 3-4) This transformation is not applicable in practice since it makes sample almost to a single value. (Figure 3-6)

Table 3-3 An example shows the bomb of Box-Cox, which gives lambda=-6.91

Original 10.8945 10.1425 12.23 10.3208 9.7381 10.9127 10.5673 12.5883 9.5804 9.7774 Transformed 0.14471779 0.144717784 0.144717796 0.144717786 0.144717779 0.144717791 0.144717788 0.144717797 0.144717776 0.144717779 λ=-6.91 Original 9.3805 9.5144 9.6365 9.4778 10.9863 10.2257 10.2774 9.6361 10.5105 9.5349 Transformed 0.144717773 0.144717775 0.144717777 0.144717774 0.144717791 0.144717785 0.144717786 0.144717777 0.144717788 0.144717776

Table 3-4 An example shows the bomb of Box-Cox, which gives lambda=-6.91, cont.

PValue- Pvalue- Anderson- Shapiro- λ=-6.91 Skewness Kurtosis Anderson- Shapiro- darling Wilk Darling Wilk Original 1.37857132 1.632773 0.998438 0.0097 0.84657 0.0047 Transformed ......

60

Figure 3-5 Histogram of original sample

61

Figure 3-6 Histogram of transformed sample using lambda=-6.91

62

All the above examples show that, even in clean sample, Box-Cox may not be doing well in some cases; λ provided by direct use of Box-Cox has quite large variance, especially in small sample sizes. If outliers are present in the sample, it brings more variability if applying Box-Cox directly without proper handling.

When the original sample is anchored to 1 before using Box-Cox transformation, from

Figure 3-7 it can be seen that, compared to the null case, the variation of λ decreased dramatically, the mean value of λ based on 1000 repetitions is quite close to one although the 95% confidence interval of λ is not always covering one (Figure 3-8). The result confirms the effectiveness of anchor-to-1 idea, which can dramatically decrease the variance of λ.

It draws attention that whether anchoring to 1 or not depends on the smallest value of the original sample. If the minimum of the original sample is larger than 1, than it needs to be anchored to get a better estimate of λ; if it is between zero and one, it should not be anchored.

The following two examples showed this scenario. Figure 3-9 and Figure 3-10 showed the Box- plot of λ after regular Box-Cox transformation of a random normal sample with minimum less than 1; from Figure 3-11 and Figure 3-12 it is noticed that the distribution of λ does not improve after anchoring the sample to 1, the variance even becomes larger. This particular sample is drawn in two steps: first draw a sample from N(10,1), then shift this sample to 0.5 through y- y[min]+0.5, then the minimum of this new sample is 0.5, which is between 0 and 1.

When the minimum of a sample is less than 1, it is not necessary to anchor to 1. The following example shows this. A random sample drawn from Exp(Z), where Z is following a

N(0,1) distribution, around half of the samples are between zero and one. The following four figures (Figure 3-13 to Figure 3-16) show the difference between anchor-to-1 and not anchor-to-

63

1. It is noticed that after anchor-to-1, the variance of lambda becomes larger. Therefore we propose that if the minimum of a sample is larger than 1, it needs to be anchored to 1; if the minimum is between 0 and 1, we do need to anchor it to 1.

64

Figure 3-7 Distribution of lambda after anchoring-to-1 the original sample

65

95% CI of Lambda: Anchor-to-1, N(10,1), Clean data Lambda 1.01

1.00

0.99

0.98

0.97

0.96

0.95

0.94

0.93

0.92

0.91

0.90

0.89

0.88

0.87 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size

Figure 3-8 The mean and 95% confidence interval of lambda after anchoring-to-1

66

Figure 3-9 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted

67

Figure 3-10 When the sample minimum is between 0 and 1, the regular Box-Cox is conducted, cont.

68

Figure 3-11 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda

69

Figure 3-12 Anchoring the previous sample (minimum is less than 1) will NOT change the lambda, cont.

70

Figure 3-13 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda

71

Figure 3-14 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont.

72

Figure 3-15 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda

73

95% CI of Lambda: Anchor-to-1, LogN(0,1), Clean data Lambda -0.65

-0.66

-0.67

-0.68

-0.69

-0.70

-0.71

-0.72

-0.73

-0.74

-0.75

-0.76

-0.77

-0.78

-0.79

-0.80

-0.81 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size

Figure 3-16 LogN(0,1) sample, anchoring-to-1 does NOT improve lambda, cont.

74

3.3. Unified Approach-Normal Distribution

The unified approach is applied to clean N(10,1) sample. Outlier detection, anchoring-to-

1, Box-Cox transformation, and penalized information criterion (BIC) is considered simultaneously here. In terms of anchoring-to-1, as the original sample is from N(10,1), which has a minimum value larger than 1, therefore the original sample is anchored to 1 before Box-

Cox transformation. In the unified approach, outlier exclusion is implemented along with Box-

Cox transformation, extreme observations in the lower end and upper end of the sample will be dropped to achieve a better model fitting. The exclusion of observations is penalized in BIC.

Every time observation in the lower end is dropped, the sample has a minimum value that is greater than 1; hence we need to re-anchor the sample to 1 before Box-Cox transformation. The anchoring-to-1, exclusion of observations, and Box-Cox transformation are all penalized in the information criterion to obtain a trade-off between better model fitting (normality) and complexity of the model (over-analyzing).

Figure 3-17 shows the outliers detected by the unified approach in different sample sizes.

For each sample size, 1000 repetitions are generated, apply the unified approach to each repetition, it will tell how many outliers are in the sample, whether those outliers should be excluded or not, if they are excluded, is it necessary to Box-Cox transform the remaining sample to achieve normality? For example, when sample size is 20, within the 1000 repetitions, 774 repetitions have no outliers, 172 repetitions have one outlier, 34 repetitions have two outliers, 10 repetitions have three outliers, and 20 repetitions have 6 outliers. When the sample size becomes larger, the percentage of repetitions that are detected as “no outliers” is increasing, from 92.1% in sample size 60 to 95.4% in sample size 200. This shows that the unified approach is doing

75 well in identifying outliers; in clean data situation, it is able to tell that the samples being analyzed are clean for most of the cases.

In terms of the locations of outliers detected, it is noticed from Figure 3-18 that the detected outliers are located in both the lower and upper end of the sample; in small sample sizes, they are detected in both ends of the sample.

As for the performance of Box-Cox transformation in the clean samples, λ is supposed to be 1 since the sample is normally distributed and hence no transformation is needed. Figure 3-19 shows the distribution of λ when the unified approach detected zero outliers in the sample. Figure

3-20 shows the distribution of λ when more than 1 outlier is detected. It is noticed that λ in Figure

3-19 has smaller variation than Figure 3-20; although in Figure 3-20 there are fewer outlying values of λ.

In terms of model fitting (normality test), it does not show too much difference between the cases where no outliers are detected and more than one outliers are detected. Figure 3-21 and

Figure 3-22 provided the P-values for the Anderson-Darling statistic.

76

Figure 3-17 Outliers Detected in Clean N(10,1) Samples, 1000 Repetitions

77

Figure 3-18 Location of detected outliers, N(10,1), Clean Sample

78

Figure 3-19 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, no outliers excluded, no penalty for Box-Cox

79

Figure 3-20 Box-plot of lambda: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1)

80

Figure 3-21 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, no outliers detected, no penalty for Box-Cox

81

Figure 3-22 Box-plot of Anderson-Darling statistic: Unified Approach, N(10,1), Clean Data, outliers excluded(>=1)

82

In the above analysis, extra penalty is not considered for doing Box-Cox transformation.

This means that after anchor the sample to 1there are two choices, the first one is doing Box-Cox to the sample to achieve a better model fitting but with extra penalty in the BIC; the second one is not to do Box-Cox. It needs to be considered whether Box-Cox is really needed, which means that an extra penalty has to be paid for transforming the sample. It is necessary to think if it is worth transforming the sample to achieve normality with a penalty. The BIC in the two situations will be compared and then decision can be made based on smaller values of BIC. If this scenario is taken into account, then the unified approach will generate quite different results compared to the previous analysis. When considering the penalty of doing Box-Cox, for each sample, there are four choices of handling it:

a) 0-0 case (leave it alone): no outlier detection/exclusion, no Box-Cox transformation.

b) 0-1 case: no outlier exclusion, do Box-Cox transformation.

c) 1-0 case: exclude outliers, no Box-Cox transformation.

d) 1-1 case: exclude outliers, do Box-Cox transformation.

83

Figure 3-23 shows the percentage of each case in 1000 repetitions, it is noticed that the majority of repetitions falls into the first case, which means that given the clean N(10,1) sample, nothing needs to be done to make the sample behave normal. When sample size is larger than 80, this percentage is always above 90%. This agrees with our expectation since the sample is normally distributed. In the 0-1 case, no outliers are excluded and no Box-Cox transformation is implemented, therefore λ=1 in these cases. In the cases where no Box-Cox is implemented, outliers are detected in lower end or both lower and higher end of the sample

84

(

Figure 3-24).

85

Figure 3-23 Percent of four cases in unified approach with penalty to BC

Figure 3-24 Location of outliers, NO Box-Cox involved

86

Before Box-Cox is implemented in the sample, there might be outliers that need to be excluded.

When there are outliers detected, the location of outliers is either in lower end or in both lower and upper end of the sample

(

Figure 3-25). The distribution of λ in this case is shown in Figure 3-26. When there no outliers are found, then the distribution of λ is shown in the Figure 3-27. Since the number of repetitions in those cases is quite small, therefore the Box-plot looks strange and some of them only have one diamond point.

87

Figure 3-25 Location of outliers, BC=1 and DP=1

88

Figure 3-26 Box-plot of Lambda: Unified Approach, Clean Data, BC=1, outliers Excluded

89

Figure 3-27 Box-plot of Lambda: Unified Approach, N(10,1), Clean Data, BC=1, no outliers excluded

90

3.4. Assessment-Normal Distribution

To evaluate the effectiveness of the unified approach, we will contact a simple estimation of the resulting mean. We will then look at the Mean Square Error (MSE) of the transformed sample. Base on the discussion in previous section, we have to look at the four cases separately.

x Null case: no outlier detection/exclusion, no Box-Cox transformation

ଵ The original sample ǡ ǡǥǡ ̱ሺͳͲǡͳሻ, then ܯܵܧ ൌ σሺோ௘௣ ܻഥ െͳͲሻଶ,ܻഥ ൌ ଵ ଶ ୬ ோ௘௣ ௞ୀଵ ௞ ௞

ଵ σௌௌ ܻ , SS is sample size(20 to 200 by 10), ܻ is the ith observation in the kth repetition. ௌௌ ௜ୀଵ ௞௜ ௞௜

Rep is the number of repetitions in each sample, here we take 1000 repetitions. Here the original sample is drawn from N(10,1) therefore the Mean Square Error is based on the squared difference of observed mean value and expected value 10. Since no Box-Cox transformation is involved, the MES is expected to be one.

x In the Unified Approach

ଵ ଵ a) If BC=0 (DP=0 or DP>0) ܯܵܧ ൌ σሺோ௘௣ ܻഥ െͳͲሻଶ,ܻഥ ൌ σௌௌି஽௉ ܻ is the ோ௘௣ ௞ୀଵ ௞ ௞ ௌௌି஽௉ ௜ୀଵ ௞௜ mean of retained sample, DP is the number of outliers excluded. As there is no Box-Cox involved, the Mean Square Error is based on the squared difference of mean of retained sample and the expected mean, which is 10. If outliers are excluded then we have to take into account the observations dropped by subtracting the sample size by DP. If no outliers are dropped, then

DP=0. In this case the MSE calculation is the same as the null case except that the sample size is changed when there are outliers dropped.

ଵ ଶ ଵ ሻ െ ͳͲ൯ , ܻഥ ൌ σௌௌି஽௉ ݂ ሺܻ ሻ isכb) If BC=1(DP=0 or DP>0) ൌ σோ௘௣൫݂ିଵሺܻതതത ோ௘௣ ௞ୀଵ ௞ ௞ ௞ ௌௌି஽௉ ௜ୀଵ ௞ ௞௜ the mean of transformed sample in the kth repetition. Since Box-Cox is involved in the process,

91 we have to back-transform the mean of transformed samples to the original scale using the transformation found in this repetition. If this sample has been anchored to 1 before the Box-

ഥ כതതത Cox transformations, then we need to un-anchor it back to the original scale throughܻ௞ ൌܻ௞ ൅

ܻ௞̴௠௜௡ െͳ, ܻ௞̴௠௜௡ is the sample minimum in the kth repetition. If outliers are excluded then we have to take into account the observations dropped by subtracting the sample size by DP. If no outliers are dropped, then DP=0.

The comparison between the unified approach and null case is shown in Table 3-5 and Figure

3-28 to Figure 3-31.

Figure 3-28 Comparison of MSE, BC=0 DP=0, N(10,1) Clean Data

92

Figure 3-31 Comparison of MSE, BC=1 DP=1, N(10,1) Clean Data

In the BC=0 and DP=0 case, where no outliers are dropped and no Box-Cox transformation is involved, the MSE are the same for unified approach and null case. Among the

1000 repetitions the majority of cases fall into this category.

93

94

Table 3-5 Comparison of MSE

All SAMPLE BC=0 DP=0 BC=0 DP=1 BC=1 DP=0 BC=1 DP=1 reps SIZE U N Freq U N Freq U N Freq U N Freq Null 20 1.094 1.094 743 1.376 1.003 215 1.380 295.883 31 2.377 1.648 11 1.081 30 1.033 1.033 818 1.246 0.879 140 1.095 174.384 32 0.605 0.611 10 1.003 40 0.972 0.972 851 1.257 1.044 116 1.312 145.789 22 2.516 0.878 11 0.970 50 0.998 0.998 876 1.096 0.930 96 1.019 105.216 20 2.160 1.439 8 0.988 60 1.050 1.050 891 1.329 1.154 76 1.534 101.004 30 1.103 2.546 3 1.077 70 1.125 1.125 893 1.128 0.952 72 1.074 81.245 30 1.068 1.270 5 1.128 80 0.930 0.930 905 1.093 0.981 63 1.073 68.846 29 0.498 0.085 3 0.946 90 1.035 1.035 919 0.787 0.764 64 0.601 50.342 15 3.848 0.833 2 1.016 100 1.047 1.047 903 1.566 1.421 68 2.225 106.488 27 3.388 1.849 2 1.082 110 0.963 0.963 924 1.066 0.921 42 0.982 44.768 30 4.193 1.826 4 0.957 120 0.970 0.970 926 0.754 1.099 50 0.828 37.680 22 2.934 0.008 2 0.971 130 0.971 0.971 931 1.663 1.242 52 2.066 86.460 12 2.288 0.637 5 0.979 140 0.990 0.990 943 1.229 0.688 35 0.845 43.022 17 1.621 1.163 5 0.980 150 0.925 0.925 933 1.145 0.881 49 1.009 56.081 16 3.081 1.201 2 0.925 160 1.023 1.023 935 1.477 0.834 51 1.232 75.317 12 2.126 0.580 2 1.013 170 1.049 1.049 938 1.136 1.061 49 1.205 55.644 13 . . 0 1.049 180 1.047 1.047 939 1.010 0.908 46 0.917 46.451 14 2.066 1.110 1 1.041 190 1.053 1.053 939 1.117 0.968 49 1.082 54.743 12 . . 0 1.052 200 1.077 1.077 943 1.148 0.721 45 0.828 51.660 11 1.080 0.001 1 1.055 U: Unified Approach; N: Null Case

95

Figure 3-28 Comparison of MSE, BC=0 DP=0, N(10,1) Clean Data

Figure 3-29 Comparison of MSE, BC=0 DP=1, N(10,1) Clean Data

96

Figure 3-30 Comparison of MSE, BC=1 DP=0, N(10,1) Clean Data

Figure 3-31 Comparison of MSE, BC=1 DP=1, N(10,1) Clean Data

97

3.5. Unified Approach-Lognormal Distribution

The sample is randomly drawn from LogN(0,1) distribution or exp(z), where z is following N(0,1). The unified approach is applied to this sample. Outlier detection, anchoring-to-

1, Box-Cox transformation, and penalized information criterion (BIC) is considered simultaneously here. In terms of anchoring-to-1, as the original sample is from LogN(0,1), which has a minimum value smaller than 1, therefore the original sample is not needed to be anchored to 1 before Box-Cox transformation. Even if observations in the lower end are excluded during the process, it is not needed to anchor the samples to 1 since the samples are drawn from exp(N(0,1)), hence there are around 50% observations are smaller than one. 20% of observations are suspected to be outliers, so almost all the samples in the unified approach have minimum values smaller than 1. In the unified approach, outlier exclusion is implemented along with Box-

Cox transformation, extreme observations in the lower end and upper end of the sample will be dropped to achieve a better model fitting. The exclusion of observations is penalized in BIC. The exclusion of observations and Box-Cox transformation are penalized in the information criterion to obtain a trade-off between better model fitting (normality) and complexity of the model (over- analyzing).

Figure 3-22 shows the outliers detected by the unified approach in different sample sizes.

For each sample size, 1000 repetitions are generated, apply the unified approach to each repetition, it will tell how many outliers are in the sample, whether those outliers should be excluded or not, if they are excluded, is it necessary to Box-Cox transform the remaining sample to achieve normality? For example, when sample size is 20, within the 1000 repetitions, 315 repetitions have no outliers, 325 repetitions have one outlier, 162 repetitions have two outliers,

125 repetitions have three outliers, and 0 repetitions have 6 outliers. When the sample size

98 becomes larger, the percentage of repetitions that are detected as “no outliers” is increasing, from

51.8% in sample size 70 to 65% in sample size 200. This shows that the unified approach is doing well in identifying outliers; in clean data situation, it is able to tell that the samples being analyzed are clean for most of the cases.

In terms of the locations of outliers detected, it is noticed from Table 3-6 that almost all of the detected outliers are located in upper end of the sample, which is different from normal samples. This is due to skewness of the lognormal distribution, which has long tails on the right.

As for the performance of Box-Cox transformation in the clean samples, λ is supposed to be 0 according to the definition of Box-Cox transformation since the sample is lognormally distributed and. Figure 3-34 shows the distribution of λ when the unified approach detected zero outliers in the sample. Figure 3-35 shows the distribution of λ when more than 1 outlier is detected. It is noticed that λ in Figure 3-34 has much smaller variation than Figure 3-35.

99

Figure 3-32 Outliers Detected in Clean LogN(0,1) Samples, 1000 Repetitions

100

Table 3-6 Location of outliers excluded, LogN0,1), Clean Data

SAMPLE_SIZE Low High Both 20 0 68.2 0.3 30 0.1 59.2 0.2 40 0 55.6 0.0 50 0 51.6 0.0 60 0 51.6 0.0 70 0 48.2 0.0 80 0 50.2 0.0 90 0 43.6 0.0 100 0 44.8 0.0 110 0 42.9 0.0 120 0 40.8 0.0 130 0 39.4 0.0 140 0 41 0.0 150 0 39.3 0.1 160 0 37.6 0.0 170 0 38 0.0 180 0 36.7 0.0 190 0 37.3 0.1 200 0 35 0.0

101

Figure 3-33 Location of outliers detected, LogN0,1), Clean Data

102

Figure 3-34 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox

103

Figure 3-35 Box-plot of Lambda: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox

104

In terms of model fitting (normality test), it does not show too much difference between the cases where no outliers are detected and more than one outliers are detected. Figure 3-36 and

Figure 3-37 provided the P-values for the Anderson-Darling statistic. (AD is available if needed)

105

Figure 3-36 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, no outliers excluded, no penalty for Box-Cox

106

Figure 3-37 Box-plot of Anderson-Darling Statistic: Unified Approach, LogN(0,1), Clean Data, outliers excluded (>=1), no penalty for Box-Cox

107

In the above analysis, we did not consider give extra penalty for doing Box-Cox transformation. This means that we are faced with two choices given a sample, the first one is doing Box-Cox to the sample to achieve a better model fitting but with extra penalty in the BIC; the second one is not to do Box-Cox. We need to consider whether Box-Cox is really needed, which means that an extra penalty has to be paid for transforming the sample. It is necessary to think if it is worth transforming the sample to achieve normality with a penalty. The BIC in the two situations will be compared and then decision can be made based on smaller values of BIC.

If this scenario is taken into account, then the unified approach will generate different results compared to the previous analysis. When considering the penalty of doing Box-Cox, for each sample, we have four choices of handling it:

a) 0-0 case (leave it alone): no outlier exclusion, no Box-Cox transformation.

b) 0-1 case: no outlier exclusion, do Box-Cox transformation.

c) 1-0 case: exclude outliers, no Box-Cox transformation.

d) 1-1 case: exclude outliers, do Box-Cox transformation.

108

Figure 3-38 Percent of four cases in unified approach with penalty to BC, Clean LogN(0,1) Figure 3-38 shows the percentage of each case in 1000 repetitions, notice that the majority repetitions falls into the second and fourth case, which means that given the clean

N(10,1) sample, we need to Box-Cox transform the sample with appropriate exclusion of extreme observations in the high end to make the sample behave normal. This agrees with the expectation since the sample is Log-normally distributed. In the 0-1 case, no outliers are excluded and Box-Cox transformation is implemented (Figure 3-39). In the 1-1 cases where Box-

Cox is implemented and outliers are excluded, the distribution of λ is shown in Figure 3-40.

109

Figure 3-38 Percent of four cases in unified approach with penalty to BC, Clean LogN(0,1)

110

Figure 3-39 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, no outliers excluded

111

Figure 3-40 Box-plot of Lambda: Unified Approach, LogN(0,1) clean data, outliers excluded(>=1)

112

3.6. Assessment-Lognormal Distribution

To evaluate the effectiveness of the unified approach, we need to look at the Mean

Square Error of the transformed sample. Base on the discussion in previous section, we have to look at the four cases separately.

x Null case: no outlier detection/exclusion, no Box-Cox transformation

The original sampleଵǡଶǡǥǡ୬ǡ୧ ൌ‡š’ሺ୧ሻ ǡ୧̱ሺͲǡͳሻ, then

ଵ ଵ  ൌ σሺୖୣ୮ തതത െͲሻଶ,തതത ൌ σୗୗ ‘‰ሺ ሻ, SS is sample size(20 to 200 by 10),  is the ୖୣ୮ ୩ୀଵ ୩ ୩ ୗୗ ୧ୀଵ ୩୧ ୩୧ ith observation in the kth repetition. Rep is the number of repetitions in each sample, here we take 1000 repetitions. Here the original sample is drawn from LogNormal(0,1) therefore the

Mean Square Error is based on the squared difference of mean of logarithm of original value and expected value 0 since we have perfect knowledge of the samples. As തത୩ത is average of the transformed observations, the MSE computed need to be multiplied by the corresponding sample size for the purpose of comparison.

x In the Unified Approach

a) If BC=0 (DP=0 or DP>0)

ଵ ଵ ܯܵܧ ൌ σሺோ௘௣ ܻഥ െͲሻଶ,ܻഥ ൌ σௌௌି஽௉ ܮ݋݃ሺܻ ሻ, SS is sample size(20 to 200 by 10), ோ௘௣ ௞ୀଵ ௞ ௞ ௌௌ ௜ୀଵ ௞௜

୩୧ is the ith observation in the kth repetition. Rep is the number of repetitions in each sample, here we take 1000 repetitions. DP is the number of observations excluded from the sample. Since no Box-Cox transformation is involved here, the only difference with the null case is the sample size. Here the original sample is drawn from LogNormal(0,1) therefore the Mean Square Error is based on the squared difference of mean of logarithm of remaining original values and expected

113 value 0. When Box-Cox is not implemented, the way to compute MSE is the same as the null case except the sample size.

b) If BC=1(DP=0 or DP>0)

ଵ ଶ ଵ ሻെͲ൯ ,ܻഥ ൌ σௌௌି஽௉ ݂ ሺܻ ሻ is the mean of remainedכൌ σோ௘௣൫݂ିଵሺܻതതത ܧܵܯ ோ௘௣ ௞ୀଵ ௞ ௞ ௞ ௌௌି஽௉ ௜ୀଵ ௞ ௞௜ transformed sample in the kth repetition after DP observations are dropped. Since Box-Cox is involved in the process, the mean of transformed samples need to be back-transformed to the original scale using the transformation found in this repetition. As anchor-to-1 is not conducted here, there is no need to un-anchor the back-transformed sample. If outliers are excluded then the observations dropped need to be taken into account by subtracting the sample size by DP. If no outliers are dropped, then DP=0.

The comparison between the unified approach and null case is shown in

Figure 3-41 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=0

114 and Figure 3-42. It is seen that the MSE of the unified approach is quite close to one as expected.

In the case where both outlier exclusion and Box-Cox transformation are applied, the MSE is even closer to 1 compared to the null case. This is because the Log-Normal sample is drawn from Exp(Z), where Z is flowing N(0,1), if it is back-transformed (taken logarithm), then it is expected that the variance of back-transformed sample to be 1.

Figure 3-41 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=0

115

Figure 3-42 Comparison of MSE in LogN(0,1), Clean data, BC=1 DP=1

4. Outliers: The Contaminated Sample Case

4.1. Samples with Outliers

To study the effectiveness of Box-Cox transformation in the presence of outliers and verify the usefulness of the unified approach, it is helpful to perform simulations in samples with outliers. To begin with, random samples from known distributions will be generated. Initially, random samples from a Normal distribution with mean 10 and variance 1, with outliers, will be generated. The sample sizes examined will range from 20 to 200 (by 10) and the percentage of outliers in the samples will be 5%, 10%, 15%, and 20%. Potential outliers will be obtained by randomly selecting observations from the sample and then multiplying them by a factor of 10.

Here the probability of each observation being selected to be an outlier is 5% to 20%. This means that for each specific sample with sample size, for example 100, the percent of generated

116 outlier is 5%. The true number of observations multiplied by 10 is not necessarily equal to 5.

While in large repetitions, the mean number of outliers generated will be 5 given that sample size equals 100 and percent of outlier is 5%. Random samples from other distributions (e.g.

Lognormal) will be generated in a similar manner. Here only observations in the high end, i.e. the extremely large observations are considered to be suspicious outliers.

Here is an example of a sample with outliers added. The clean data is generated from

Normal (10, 1), and 5% to 30% of the original observations are randomly chosen and multiplied by 10 to produce outliers. The difference between clean sample and contaminated sample can be seen in the figures and tables. Samples with outliers are right-skewed. The normality test statistics, histogram, normal density plot all indicate that the contaminated sample is deviating largely from normal distribution. For example, when sample size is 100, even if there are only 5% outliers, the Anderson-Darling statistic drastically changes from 0.36 to 31, the kurtosis changes from -0.04 to 46.75, and skewness changes from -0.11 to 6.89. As the percent of outliers added is increasing from 10% to 30%, the Anderson-Darling statistic is decreasing from 31.73 to 20.03, kurtosis drops from 29.61 to -0.51, and skewness changes from 5.56 to 1.18. Shapiro-Wilk,

Kolmogorov-Smirnov, and Cramer-von Mises normality test statistics show the similar pattern when the percent of outliers increase. As the percent of outliers increase, they can be treated as non-outliers when they become the majority of the sample. In our settings, for example, if 90% observations are multiplied by 10, then the sample is equivalent to a sample with 10% extreme small-value outliers and the rest observations are following a normal distribution with mean 100 and variance 10.

117

Figure 4-1 Distributions of Normal samples with outliers

118

Table 4-1 Goodness of Fit statistics of Normal samples with/without outliers

Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% 1.4589 1.3617 1.3932 <0.0050 0.8142 0.0014 0.2291 <0.0100 5% 4.4472 19.8416 6.3656 <0.0050 0.2791 <0.0001 0.4821 <0.0100 N=20 10% 4.4472 19.8416 6.3656 <0.0050 0.2791 <0.0001 0.4821 <0.0100 15% 2.1329 2.8700 5.2783 <0.0050 0.4715 <0.0001 0.4745 <0.0100 20% 1.0381 -0.8658 3.5196 <0.0050 0.6469 <0.0001 0.4116 <0.0100 30% 0.5466 -1.7131 2.7270 <0.0050 0.7169 <0.0001 0.3630 <0.0100

Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% -0.5493 0.6696 0.6930 0.0698 0.9620 0.1081 0.1231 0.0571 5% 6.9953 49.2761 15.5805 <0.0050 0.1843 <0.0001 0.4757 <0.0100 N=50 10% 6.9953 49.2761 15.5805 <0.0050 0.1843 <0.0001 0.4757 <0.0100 15% 3.2552 9.1514 15.2254 <0.0050 0.3493 <0.0001 0.5032 <0.0100 20% 1.9472 1.9381 12.9844 <0.0050 0.4852 <0.0001 0.4847 <0.0100 30% 0.9763 -0.9886 9.0427 <0.0050 0.6385 <0.0001 0.4234 <0.0100

Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% -0.1137 -0.0381 0.3600 >0.2500 0.9914 0.7743 0.0600 >0.1500 5% 6.8897 46.7486 31.0268 <0.0050 0.1806 <0.0001 0.4543 <0.0100 N=100 10% 5.5553 29.6115 31.7335 <0.0050 0.2130 <0.0001 0.4712 <0.0100 15% 3.1855 8.4605 30.5585 <0.0050 0.3463 <0.0001 0.4927 <0.0100 20% 2.3078 3.5654 27.5557 <0.0050 0.4443 <0.0001 0.4851 <0.0100 30% 1.1783 -0.5115 20.0336 <0.0050 0.6077 <0.0001 0.4428 <0.0100

Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% 0.1587 0.5661 0.3677 >0.2500 0.9929 0.6623 0.0514 >0.1500 5% 5.9031 33.4386 48.7243 <0.0050 0.1928 <0.0001 0.4704 <0.0100 N=150 10% 4.3321 17.0561 48.8724 <0.0050 0.2540 <0.0001 0.4874 <0.0100 15% 2.4920 4.3382 43.9501 <0.0050 0.4046 <0.0001 0.4883 <0.0100 20% 1.8499 1.5209 38.8027 <0.0050 0.4906 <0.0001 0.4725 <0.0100 30% 0.9334 -1.0686 28.0832 <0.0050 0.6307 <0.0001 0.4227 <0.0100

Anderson-Darling Shapiro-Wilk Kolmogorov-Smirnov % of Outlier Skewness Kurtosis Statistic P-Value Statistic P-Value Statistic P-Value 0% 0.0207 -0.4445 0.2900 >0.2500 0.9935 0.5267 0.0373 >0.1500 5% 6.8671 46.0285 62.5194 <0.0050 0.1763 <0.0001 0.4605 <0.0100 N=200 10% 5.1181 24.7308 63.4635 <0.0050 0.2292 <0.0001 0.4799 <0.0100 15% 2.6245 5.0245 58.7372 <0.0050 0.3950 <0.0001 0.4944 <0.0100 20% 1.9086 1.7491 52.1481 <0.0050 0.4863 <0.0001 0.4804 <0.0100 30% 0.9560 -1.0169 37.4026 <0.0050 0.6327 <0.0001 0.4244 <0.0100

119

Here is another example of samples with outliers. The clean sample is generated from

Lognormal distribution by taking exponent of a random variable from N(0,1) distribution. It can be seen from the following table that the skewness and kurtosis varies a lot as the percent of outlier increases. It can be seen from the histograms of the contaminated samples that in the

Lognormal data, it is not easy to find outliers because observations that are randomly chosen to multiply by 10 can be in the lower side of the sample and after multiplied by 10, it is located in the right tail of the sample, thus making the outlier detection difficult.

120

Figure 4-2 Distributions of Lognormal samples with outliers

121

Table 4-2 Goodness of Fit statistics of Lognormal samples with/without outliers

Skewness Kurtosis % of Outlier N=20 N=50 N=100 N=150 N=200 N=20 N=50 N=100 N=150 N=200

0% 2.750 1.931 3.713 5.075 2.734 6.642 3.191 18.568 34.853 9.342 5% 2.058 2.596 3.222 3.728 3.445 2.776 7.947 11.612 15.334 15.571 10% 2.058 2.596 3.058 3.549 3.312 2.776 7.947 9.788 13.662 14.438 15% 1.588 4.657 5.748 3.661 4.166 1.026 25.724 40.634 15.251 22.926 20% 3.964 3.089 6.717 3.260 6.048 16.508 10.224 51.309 11.699 43.810 30% 3.972 3.329 5.567 8.727 5.091 16.571 11.545 35.153 92.166 29.964

122

4.2. Simulation on N(10,1) samples

x Baseline

In the baseline case where Box-Cox is applied to the sample directly, when there are outliers in the sample, the Box-Cox is adversely affected by outliers and λ is deviating from 1, as shown in the following figure. It is seen that the variation of λ is large when the sample size is small and the percent of outliers is low. When only 5% outliers are present, the median of λ is around -2 while some outlying λ’s are around 1, as more outliers are present, the value of λ becomes stable, which is around -2.

123

Figure 4-3 Distribution of Lambda in Baseline with outliers, N(10,1)

124

95% CI of Lambda: Baseline Case, N(10,1), 5% outlier 95% CI of Lambda: Baseline Case, N(10,1), 10% outlier Lambda Lambda -1.2 -1.84 -1.86 -1.3 -1.88 -1.90 -1.4 -1.92 -1.94 -1.5 -1.96 -1.98 -1.6 -2.00 -1.7 -2.02 -2.04 -1.8 -2.06 -2.08 -1.9 -2.10 -2.12 -2.0 -2.14 -2.16 -2.1 -2.18 -2.20 -2.2 -2.22 -2.24 -2.3 -2.26 -2.4 -2.28 -2.30 -2.5 -2.32 -2.34 -2.6 -2.36 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size 95% CI of Lambda: Baseline Case, N(10,1), 15% outlier 95% CI of Lambda: Baseline Case, N(10,1), 20% outlier Lambda Lambda -1.84 -1.62 -1.85

-1.86 -1.63 -1.87

-1.88 -1.64 -1.89 -1.90 -1.65 -1.91 -1.92 -1.66 -1.93 -1.94 -1.67 -1.95 -1.96 -1.68 -1.97 -1.98 -1.99 -1.69 -2.00 -2.01 -1.70 -2.02 -2.03 -1.71 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size

Figure 4-4 Confidence Interval of Lambda in Baseline with outliers, N(10,1)

125

x Anchor-to-1

If the sample is anchored to 1, the variation of λ can be alleviated a lot compared to the baseline case. It can be seen that most of the outlying values of λ are gone, although there are many large λ in small samples. When the percent of outlier is 20%, the mean of λ is between -0.5 and -1 with a few λ larger than 1.

126

Figure 4-5 Distribution of Lambda, Anchor-to-1, with outliers, N(10,1)

127

95% CI of Lambda: Anchor-to-1, N(10,1), 5% outlier 95% CI of Lambda: Anchor-to-1, N(10,1), 10% outlier Lambda Lambda 0.0 -0.42 -0.44 -0.46 -0.1 -0.48 -0.50 -0.2 -0.52 -0.54

-0.3 -0.56 -0.58 -0.60 -0.4 -0.62 -0.64 -0.5 -0.66 -0.68 -0.70 -0.6 -0.72 -0.74 -0.7 -0.76 -0.78 -0.8 -0.80 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size 95% CI of Lambda: Anchor-to-1, N(10,1), 15% outlier 95% CI of Lambda: Anchor-to-1, N(10,1), 20% outlier Lambda Lambda -0.52 -0.51 -0.53 -0.52 -0.54 -0.55 -0.53 -0.56 -0.54 -0.57 -0.55 -0.58 -0.59 -0.56 -0.60 -0.57 -0.61 -0.58 -0.62 -0.63 -0.59 -0.64 -0.60 -0.65 -0.61 -0.66 -0.67 -0.62 -0.68 -0.63 -0.69 -0.64 -0.70 -0.71 -0.65 -0.72 -0.66 -0.73 -0.67 -0.74 -0.75 -0.68 -0.76 -0.69 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Sample Size Sample Size

Figure 4-6 Confidence Interval of Lambda, Anchor-to-1, with outliers, N(10,1)

128

x Accuracy of outlier detection

To evaluate the efficacy of the unified approach in detecting outliers, it is desirable to check how many outliers can be found for each of the samples. The fowling steps are conducted to compute this accuracy. For each sample, it is know the true number of observations that are multiplied by 10, which are generated outliers. The Unified Approach can detect how many outliers in each sample, how many of them should be dropped, and whether Box-Cox is needed to transform the remaining sample to achieve normality. Besides the Box-Cox transformation parameter λ will also be computed by the unified approach. As the Unified Approach detects outliers in both ends of the sample, it will tell how many outliers are detected in each end.

Denote num_outlier to be the true number of outliers in the sample, upper_drop to be the outliers detected in the high end, low_drop to be the number of outliers detected in the low end, outlier_drop to be the total number of outliers detected (equals low_drop plus high_drop). To evaluate the accuracy of outlier detection, for each sample the following measurements are used:

a) If abs(upper_drop-num_outlier)<=1 then accurate_1=1; else accurate_1=0.

b) If abs(low_drop-0)<=1 then accurate_2=1; else accurate_2=0.

c) If abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate_3=1; else

accurate_3=0.

d) If low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate_4=1;else

accurate_4=0.

e) If low_drop=0 and upper_drop=num_outlier then accurate_5=1; else accurate_5=0.

The first criteria measures the accuracy of outlier detection in the high end; the second one measures the accuracy of outlier detection in the low end, the true number of outliers in the

129 low is zero because outliers are generated by multiplying observations by 10; the third one measures low end and high end outlier detection separately with some flexibility; the fourth one has higher standard that requires the low end outlier detection must be exactly accurate; the fifth one is the most strict one that requires both low end and high end outlier detection be exactly accurate compared to the true case. It happens that the Unified Approach may find outliers in the low end even if the outliers are generated in the high end. This is probably due to the mask effect/swamp effect, which is one of the major difficulties in outlier detection.

For each sample size, there are 1000 repetitions where the six accuracy measurements of outlier detection can be obtained. The percentages of “accurate” outliers detection based on the six measurements are plotted in the following figures.

130

x Unified Approach without penalty for Box-Cox

When the unified approach is applied to the samples with outliers, extra penalty of conducting Box-Cox transformation is not included in the BIC at first. The λ found through the unified approach has much smaller variance. Outliers detected and excluded in high end only cases result in a narrower confidence interval for λ compared to the cases where outliers are detected in low end. The correct rate for the cases in high end and both end are the almost the same since the repetitions match for most of the times.

It is seen that the correct rate of the Unified Approach in detecting outliers is above 95% for the first four accuracy criteria. When the sample size is larger than 50, Accurate_5 and

Accurate_6 (the most stringent criteria), achieve an accuracy rate above 95%. The following figures show the distribution of λ found through the Unified Approach in different accuracy criteria. The variance of λ becomes smaller and the number of extremely large and small values of λ decreased.

131

Figure 4-7 Accuracy of outlier detection, N(10,1), 5% outliers, no penalty for BC

132

Figure 4-8 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy1=1, no extra penalty for Box-Cox

133

Figure 4-9 Distribution of Lambda, N(10,1), 5% outlier, cases where accuracy2=1, no extra penalty of Box-Cox

134

Figure 4-10 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy3=1, no extra penalty of Box-Cox

135

Figure 4-11 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy4=1, no extra penalty of Box-Cox

136

Figure 4-12 Distribution of Lambda, N(10,1), 5% outlier, outlier detection accuracy5=1, no extra penalty of Box-Cox

137

x Unified Approach with penalty for Box-Cox

When the unified approach is applied to the samples with outliers, extra penalty is paid for conducting Box-Cox transformation. It is shown that in most of the cases, unified approach decided that only outlier exclusion is needed to achieve normality rather than doing an extra

Box-Cox transformation. When sample size is larger than 50, more than 90% repetitions need only exclusion of outliers to get normality.

It is noticed that in most of the cases the unified approach chooses to exclude outliers and not to Box-Cox transform the remaining sample considering the extra penalty for doing Box-Cox.

After sample size is larger than 50, the proportion samples out of 1000 repetitions is greater than

90%. Since this is a contaminated normal sample with outliers, it is expected to only exclude the outliers and the remaining sample would be a normal one. This is verified by the simulation results. Applying the similar accuracy measurement used in the previous section (no-penalty part), the accuracy of outlier detection is promising with considering penalty for Box-Cox transformation when the unified approach is applied.

The following Box-Plots show the distribution of λ in the cases where Accurate_6=1 and

Accurate_5=1, including all four handling methods: BC=1 DP=1, BC=1 DP=0, BC=0 DP=1, and

BC=0 DP=0.

138

Figure 4-13 Percent of different handling methods, 5% outliers, N(10,1), penalty for BC

139

Figure 4-14 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases

140

Figure 4-15 Distribution of lambda, Unified Approach, N(10,1), 5% outlier with penalty for BC, accuracy5=1 cases

141

10% outliers:

Figure 4-16 Accuracy of outlier detection, N(10,1), 10% outliers, no penalty for BC

Figure 4-17 Percent of different handling methods, 10% outliers, penalty for BC

142

Figure 4-18 Accuracy of outlier detection, N(10,1), 5% outliers, penalty for BC, DP=1 and BC=0 cases

143

Figure 4-19 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 10% outlier

144

15% outliers

Figure 4-20 Accuracy of outlier detection, N(10,1), 15% outliers, no penalty for BC

Figure 4-21 Percent of different handling methods, 15% outliers, penalty for BC

145

Figure 4-22 Accuracy of outlier detection, N(10,1), 15% outliers, penalty for BC, DP=1 and BC=0 cases

146

Figure 4-23 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 15% outlier

147

20% outliers

Figure 4-24 Accuracy of outlier detection, N(10,1), 20% outliers, no penalty for BC

Figure 4-25 Percent of different handling methods, 20% outliers, penalty for BC

148

Figure 4-26 Accuracy of outlier detection, N(10,1), 20% outliers, penalty for BC, DP=1 and BC=0 cases

149

Figure 4-27 Distribution of lambda in the cases where accuracy5=1, no penalty for BC, N(10,1), 20% outlier

150

4.3. Simulation on Lognormal samples

To test the Unified Approach, Lognormal sample is chosen to verify the accuracy of outlier detection and the Box-Cox transformation. It is known that Lognormal sample can be transformed to normality through log transformation with λ=0. In the presence of outliers, it is expected that Unified Approach can detect outliers accurately and transform the remaining sample using logarithm. The random sample used here is generated through exp(Z), where Z is from N(0,1). As the minimum of such sample is less than 1, according to the Unified Approach, the sample minimum will not be anchored to 1 before conducting Box-Cox transformation.

Given such a sample, there are four conclusions: there are outliers dropped and Box-Cox is needed (DP=1 and BC=1); there are outliers dropped and no Box-Cox is needed (DP=1 and

BC=0); there are no outliers detected and Box-Cox is needed (DP=0 and BC=1); there are no outliers detected and no Box-Cox is needed (DP=0 and BC=0).

It is noticed from the following figures that out of the 1000 repetitions most of the samples choose to drop outliers and Box-Cox transform the remaining observations even considering the penalty of doing Box-Cox (DP=1 and BC=1) (Figure4-30, 4-35, 4-40, 4-45). The value of λ for cases where accuracy5=1 BC=1 DP=1 is close to the expected value 0 (Figure 4-

33, 4-34, 4-38, 4-39, 4-43, 4-44, 4-48, 4-49). The variance and confidence interval for the mean of λ is quite narrow and covers 0 at the most of the times. The accuracy of outlier detection is not as good as normal samples. The accuracy rate drops dramatically as sample size increases. This is probably due to the skewness of the lognormal sample. For example, an outlier with value 15 is generated by multiplying 1.5 by 10 and it is not easy to differentiate it from another original value 15 in the right tail of the sample. Therefore the regular outlier detection accuracy measurement may not be able to reflect this situation.

151

5%

Figure 4-28 Percent of different handling methods, 5% outliers, LogN(0,1), penalty for BC

Figure 4-29 Accuracy of outlier detection, LogN(0,1), 5% outliers, no penalty for BC

152

Figure 4-30 Accuracy of outlier detection, LogN(0,1), 5% outliers, penalty for BC

Figure 4-31 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC

153

Figure 4-32 Distribution of lambda, LogN(0,1), 5% outlier, accuracy5=1, BC=1 DP=1, penalty for BC

154

10%

Figure 4-33 Percent of different handling methods, 10% outliers, LogN(0,1), penalty for BC

Figure 4-34 Accuracy of outlier detection, LogN(0,1), 10% outliers, no penalty for BC

155

Figure 4-35 Accuracy of outlier detection, LogN(0,1), 10% outliers, penalty for BC

Figure 4-36 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, no penalty for BC

156

Figure 4-37 Distribution of lambda, LogN(0,1), 10% outlier, accuracy5=1, BC=1 DP=1, penalty for BC

157

15%

Figure 4-38 Percent of different handling methods, 15% outliers, LogN(0,1), penalty for BC

Figure 4-39 Accuracy of outlier detection, LogN(0,1), 15% outliers, no penalty for BC

158

Figure 4-40 Accuracy of outlier detection, LogN(0,1), 15% outliers, penalty for BC

Figure 4-41 Distribution of lambda, LogN(0,1), 15% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC

159

Figure 4-42 Distribution of lambda, LogN(0,1), 15% outlier, accuracy5=1, BC=1 DP=1, penalty for BC

160

20%

Figure 4-43 Percent of different handling methods, 20% outliers, LogN(0,1), penalty for BC

Figure 4-44 Accuracy of outlier detection, LogN(0,1), 20% outliers, no penalty for BC

161

Figure 4-45 Accuracy of outlier detection, LogN(0,1), 20% outliers, penalty for BC

Figure 4-46 Distribution of lambda, LogN(0,1), 20% outlier, accuracy6=1, BC=1 DP=1, no penalty for BC

162

Figure 4-47 Distribution of lambda, LogN(0,1), 20% outlier, accuracy5=1, BC=1 DP=1, penalty for BC

163

5. Discussion and future research

In this dissertation, a unified approach is proposed to handle outlier detection, Box-Cox transformation using a penalized information criterion at the same time. Extensive simulation studies have been conducted and the major findings are summarized as below:

1) Box-Cox transformation is not always working well in terms of finding an appropriate

value of λ. From the simulations in chapter 2, for N(10,1) samples, the values of λ found

through Box-Cox have extreme values such as 9 and -6, which is far away from the

expected value 1.The value transformed samples can be quite large or small using those

extreme λ, which is not desired, even though the transformed sample might pass the

normality test such as Anderson-Darling. For Lognormal samples, it is expected to find a

value of λ quite close to zero, while the simulation shows that Box-Cox does not always

provide λ that is close to zero. The variance of λ is found to be quite large, especially in

small samples.

2) Anchor-to-1 method is proposed to solve the above problem. For samples with minimum

value larger than 1, it should be anchored so that the minimum of the sample is 1 before

Box-Cox transformation. Through anchor-to-1, it is found that the variance of the Box-

Cox λ is much smaller and extreme values of λ disappear compared to not using anchor-

to-1. The confidence interval for the mean value of λ covers the true value most of the

times for sample size ranging from 20 to 200.

3) Based on the idea of anchor-to-1, a Unified Approach is proposed to handle outliers and

Box-Cox transformation at the same time. For a given sample with or without outliers,

the unified approach can tell how many outliers are in the sample, how many extreme

observations should be excluded, and what Box-Cox λ should be used to transform the

164

remaining sample to achieve normality. The idea behind the Unified Approach is that it

finds a tradeoff between better model fitting (normality) and less information loss

(observations excluded). The criteria employed is adjusted BIC, which penalizes outlier

exclusion, anchor-to-1, and Box-Cox transformation at the same time.

4) The efficacy of the Unified Approach is verified in clean samples at first. According to

the simulation results in chapter 3, the unified approach is working satisfactorily in clean

normal samples. This means that the Unified Approach can accurately detect there are no

outliers in a given sample and based on the penalized BIC the Unified Approach can find

λ whose value is close to the true value 1, indicating that neither Box-Cox or outlier

exclusion is needed for the clean normal sample. Simulation has also been conducted in

clean lognormal samples. The number of outliers found by the Unified Approach is no

more than 2 for more than 90% of the 1000 repetitions and the outliers identified are

mostly located in the high end of the sample. For each clean lognormal sample, the

Unified Approach has only two solutions: no outlier exclusion and do the Box-Cox

transformation (DP=0 and BC=1) or exclude some high end observations and do the Box-

Cox transformation to the remaining samples (DP=1 and BC=1). As for the value of λ

found through the Unified Approach, it is quite close to 1 and the variance of λ becomes

smaller and the confidence interval covers zero when sample size gets larger than 50.

5) For samples with outliers, the Unified Approach has also been tested in both normal and

lognormal samples. Outliers are randomly generated in clean samples by multiplying 10

to the original observation. To evaluate the accuracy of outlier detection, five

measurements are created (accuracy1 to accuracy5), of whom accuracy5 is the most

stringent one. In the normal samples, the Unified Approach worked satisfactorily.

165

Without considering the extra penalty of doing Box-Cox transformation, accuracy6 is

larger than 90% when sample size is larger than 50 and the value of λ is close to 1 with

variance becoming smaller as sample size increases. Considering the penalty of doing

Box-Cox transformation, when sample size is larger than 50, for more than 90% of the

repetitions, the Unified Approach chose to only exclude some extremely large

observations and not to do Box-Cox, which is consistent with the true case. For samples

with more outliers added, similar conclusions can be drawn from the simulations. In the

lognormal samples, for most of the repetitions (around 90%), the Unified Approach chose

to exclude extreme observations and do the Box-Cox transformation to the remaining

sample (DP=1 and BC=1). The value of λ is close to 0 as expected. As for the accuracy of

outlier detection, the Unified Approach is not working as well as the normal samples.

This is because that lognormal sample has long tails in the high end and it is difficult to

distinguish outliers from the large original values, i.e., an outlier generated by

multiplying 1.2 by 10 looks no difference compared to the original value 12.

This dissertation has explored the Box-Cox transformation in dealing with skewed samples and outliers and a Unified Approach has been proposed for handling samples with outliers. A few more things might be considered in the future research:

1) The penalty function used in the Unified Approach treats the following as extra

parameters: Box-Cox transformation, each excluded observation, and anchor-to-1. They

are penalized equally, i.e., in a case where one observation is excluded, Box-Cox is

conducted and Anchor-to-1 is used, the total penalty is 3. It might be a good idea to

allocate different weight to them. For example, for observations excluded, the largest

observation dropped is penalized by 1; the second largest one can be penalized by 2, and

166

so on so forth. The penalty for doing Box-Cox, anchor-to-1, and observation exclusion

can also be different. The penalty for outliers dropped in both high and low ends can be

higher than outliers that are dropped in only one end. For example, two outliers dropped

in both ends (one in each end) can be penalized by 5 (penalty is 3 for one drop in high

end and 2 in the low end), rather than 4 (2 for each outlier drop in either end). Another

thing can be considered is that the penalty part in BIC is k*log(n), where k is the number

of parameters, other forms penalty function might be considered such as   ൌ   ൅

ଶ୩ሺ୩ାଵሻ . ୬ି୩ିଵ

2) Outliers in the low end can also be considered in outlier detection. In this dissertation,

outliers are generated only in the high end for simplicity, while the occurrence of

extremely small observations in the low end of a sample is not rare. The impact of

observations in the low end might be alleviated by shifting the sample to the right;

however the shifted sample still shows skewness and may not be able to pass normality

test. Therefore considering outliers in both high end and low end will provide a more

general solution for outlier detection and data transformation.

3) The way outliers are detected is through checking all the combinations of suspicious

observations. In this dissertation, only 20% of observations are considered as suspicious

outliers, for small sample this might not be a problem, while for large samples such as

1000, 40401 combinations (201*201) need to be checked using penalized BIC, which

will cause a big computational burden.

4) In deciding how many observations should be excluded and what λ should be used to

achieve normality, adjusted BIC is used as a criteria, smaller value of this adjusted BIC

will lead to the solution. The potential problem is that there might be only small

167 difference in the value of adjusted BIC, for example, BICc=201 suggested λ=0.8, outliers excluded should be 5 and BICc=200.8 suggests λ=1, outliers excluded should be 6. In this situation, is it making a big difference using λ=0.8 instead of λ=1 due to the 0.8 difference? Some flexibility needs to be given in choosing the parameters since in practice choices are often limited to several values of λ such as integers between -3 and 3.

168

Bibliography

[1]. Grubbs, Frank (February 1969), Procedures for Detecting Outlying Observations in

Samples, Technometrics, 11(1), pp. 1-21.

[2]. R. B. Dean and W. J. Dixon (1951) "Simplified Statistics for Small Numbers of

Observations". Anal. Chem., 1951, 23 (4), 636–638.

[3]. Peirce, Benjamin, "Criterion for the Rejection of Doubtful Observations", Astronomical

Journal II 45 (1852)

[4]. David Hoaglin, Frederick Mosteller, and John Tukey (editors), Understanding Robust and

Exploratory Data Analysis, New York, John Wiley & Sons, 1983, pp. 39, 54, 62, 223.

[5]. Knorr, E. M. and Ng, R. T.: 1998, Algorithms for Mining Distance-Based Outliers in

Large Datasets. In: Proceedings of the VLDB Conference. New York, USA, pp. 392–403

[6]. Markus Breunig and Hans-Peter Kriegel and Raymond T. Ng and Jörg Sander: 2000,

LOF: Identifying Density-Based Local Outliers. In: Proceedings of the ACM SIGMOD

Conference. pp. 93-104

[7]. Cook, R. Dennis (Feb 1977). "Detection of Influential Observations in Linear

Regression". Technometrics (American Statistical Association) 19 (1): 15–18.

[8]. Barnett, V. and Lewis, T.: 1994, Outliers in Statistical Data. John Wiley & Sons., 3rd

edition.

[9]. Osborne, Jason W. & Amy Overbay (2004). The power of outliers (and why researchers

should always check for them). Practical Assessment, Research & Evaluation, 9(6).

[10]. P.J. Huber. . John Wiley & Sons, New York, 1981.

[11]. Rousseeuw, P. J. (1984) "Least Median of Squares Regression" Journal of the American

Statistical Association, 79, 871–880.

169

[12]. Rousseeuw, P.J. and Yohai, V. (1984), “ by Means of S estimators”,

in Robust and Nonlinear Analysis, edited by J. Franke, W. Härdle, and R.D.

Martin, Lecture Notes in Statistics 26, Springer Verlag, New York, 256-274.

[13]. Yohai V.J. (1987), “High Breakdown Point and High Robust Estimates for

Regression,” Annals of Statistics, 15, 642-656.

[14]. Hamilton, L.C. (1992). Regressions with graphics: A second course in applied statistics.

Monterey, CA.: Brooks/Cole.

[15]. Box, G. E. P. & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal

Statistical Society, 26(2), 211-252.

[16]. Andrews, D. F. (1971). A note on the selection of data transformations. Biometrika,

58(2), 249-254.

[17]. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions

on Automatic Control, 19(6), 716-723.

[18]. Moore, D.S. & McCabe G.P. (1999), Introduction to the Practice of Statistics. , Freeman

&Company.

[19]. Hodge, V. & Austin, Journal of Artificial Intelligence Review, 22-2, 85-126

[20]. Tukey, J. W. (1957) “The comparative anatomy of transformations”. Annals of

Mathematical Statistics, 28, pp. 602-632.

[21]. John W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley.

[22]. Tietjen, G & Moore, R (1972). Some Grubbs- type statistics for the detection of several

outliers. Tech- nometrics, 14, 583-597.

[23]. Barnett, B. and Lewis, T. (1984). Outliers in statistical data, second edition. New York:

Wiley.

170

[24]. Manly, B. F. (1976) “Exponential data transformation”. The Statistician, 25, pp. 37-42.

[25]. John, J. A. & Draper, N. R. (1980) “An alternative family of transformations”. Applied

Statistics, 29, pp. 190-197.

[26]. Bickel, P. J. & Robson, D. S. (1981) “An analysis of transformations revisited”. Journal

of the American Statistical Association, 76, pp. 296-311.

[27]. Yeo, In-Kwon and Johnson, Richard (2000). A new family of power transformations to

improve normality or symmetry. Biometrika, 87, 954-959.

[28]. Pericchi, L. R. (1981) “A Bayesian approach to transformations to

normality”.Biometrika, 68, 35-43.

[29]. Sweeting, T. J. (1984) “On the choice of the prior distribution for the Box-Cox

transformed linear model”. Biometrika, 71, 127-134.

[30]. Carroll, R. J. (1980) “A robust method for testing transformation to achieve approximate

normality”. Journal of the Royal Statistical Society, Series B, 42, 71-78.

[31]. Carroll, R. J. (1982 a) “Tests for regression parameters in power transformation models”.

Scandinavian Journal of Statistics, 9, 217-222.

[32]. Carroll, R. J. & Ruppert, D. (1984) “Power transformations when fitting theoretical

models to data”. Journal of the American Statistical Association, 79, 321-615.

[33]. Taylor, M. J. G. (1983) “Power transformation to symmetry”. Unpublished Dissertation,

Department of Statistics, University of California, Berkeley.

[34]. Taylor, M. J. G. (1985 a) “Power transformation to symmetry”. Biometrika, 72, 145-152.

[35]. Taylor, M. J. G. (1985 b) “Measure of location of skew distributions obtained through

Box-Cox transformations”. Journal of the American Statistical Association, 80, 427-432.

171

[36]. Taylor, M. J. G. (1987) “Using a generalized mean as a measure of location”.

Biometrical Journal, 6, 731-738.

[37]. Andrews, D. F., Gnanadesikan, R. & Warner, J. L. (1973) “Method for assessing

multivariate normality, in: P. R. Krishnaiah (Ed.)”. Multivariate Analysis III, pp. 95-115,

New York, Academic Press.

[38]. Dunn, J. E. & Tubbs, J. D. (1980) “A procedure for determining homoscedastic

transformations of multivariate normal populations”. Communications in Statistics-

Simulation and Computation B, 9(6), 589-598.

[39]. Beauchamp, J. J. & Robson, D. S. (1986) “Transformation considerations in

discriminant analysis”. Communications in Statistics-Simulation and Computation, 15(1),

147-179.

[40]. Rigby, R.A. & Stasinopoulos, D.M. (2004) “Smooth centile curves for skew and

kurtotic data modelled using the Box–Cox power exponential distribution”. Stat Med,

23(19), 3053-76.

[41]. Bozdogan, H. & Ramirez, D.E. (1988) “UTRANS and MTRANS: Marginal and Joint

Box-Cox Transformations of Multivariate Data to 'Near' Normality”. Multivariate

Behavioral Research, 23, 131-132.

[42]. Lachtermacher, G. & Fuller, J.D. (1995). “Back propagation in time-series forecasting”.

Journal of forecasting (0277-6693), 14 (4), p. 381.

[43]. Draper, N. R. & Cox, D. R. (1969) “On distributions and their transformations to

normality”. Journal of the Royal Statistical Society, Series B, 31, 472-476.

[44]. Hinkley, D. V. (1975) “On power transformation to symmetry”. Biometrika, 62, 101-

111.

172

[45]. Cressie, N. A. C. (1978) “Removing nonadditivity from two-way tables with one

observation per cell”. Biometrics, 34, 505-513.

[46]. Hernandes, F. & Johnson, R. A. (1980) “The large sample behaviour of transformations

to normality”. Journal of the American Statistical Association, 75, 855-861.

[47]. Kullback, S. &Leibler, R.A. (1951). "On Information and Sufficiency". The Annals of

Mathematical Statistics 22 (1): 79–86.

[48]. Hinkley, D. V. (1985) “Transformation diagnostics for linear models”. Biometrika, 72,

487-496.

[49]. Han, A. K. (1987) “A non-parametric analysis of transformation”. Journal of

Econometrics, 35, 191-209.

[50]. Solomon, P. J. (1985) “Transformations for components of variance and

covariance”.Biometrika, 72, 233-239.

[51]. Sakia, R. M. (1988) “Application of the Box-Cox transformation technique to linear

balanced mixed analysis or variance models with a multi-error structure”. Unpublished

PhD Thesis, Universitaet Hohenheim, FRG.

[52]. CHANG, H. S. (1977 a). A computer program for Box-Cox transformation and

estimation technique. Econometrica, 45, 1741

[53]. Huang, C.L. & Moon, L.C. & Chang, H.S (1978). A computer program using the Box-

Cox transformation technique for the specification of functional form, The American

Statistician, 32, 144.

[54]. Atkinson, A.C. (1973) Testing transformations to normality, Journal of the Royal

Statistical Society, Series B, 35, 473-479

173

[55]. Carroll, R. J. (1980) A robust method for testing transformation to achieve approximate

normality, Journal of the Royal Statistical Society, Series B, 42, 71-78.

[56]. Lawrance., J. (1987 a) The score statistic for regression transformation, Biometrika, 74,

275-279.

[57]. Hinkley, D, . V. (1988) More on score tests for transformation in regression, Biometrika,

75, 366-369.

[58]. Lawrance., J. (1987 b) A note on the variance of the Box-Cox regression transformation

estimate, Applied Statistics, 36, 221-223.

[59]. Atkinson, A. C. (1985) Plots, Transformation and Regression: An Introduction to

Graphical Methods of Diagnostic (Oxford, Clarendon Press)

[60]. Wang, S. (1987) Improved approximation for transformation diagnostics,

Communications in Statistics Theory and Methods, 16(6), 1797-1819.

[61]. Atkinson, A.C. & Lawrance, A. J. (1989) A comparison of asymptotic equivalent test

statistics for regression transformation, Biometrika, 76, 223-229.

[62]. Draper, N. R. & Cox, D. R. (1969) On distributions and their transformations to

normality, Journal of the Royal Statistical Society, Series B, 31, 472-476.

[63]. Poirier, D. J. (1978) The use of the Box-Cox transformation in limited dependent

variable models, Journal of the American Statistical Association, 73, 285-287.

[64]. Spitzer, J. J.(1978) A Monte Carlo investigation of the Box-Cox transformation in small

samples, Journal of the American Statistical Association, 73, 488-495

[65]. Bickel, P. J. and Doksum, K. A. (1981) An analysis of transformations revisited, Journal

of the American Statistical Association, 76, 296-311.

174

[66]. Box, G. E. P. & Cox, D. R. (1982) An analysis of transformation revisited, rebutted,

Journal of the American Statistical Association, 77, 209-210.

[67]. Carroll, R. J. & Ruppert, D. (1981) On prediction and the power transformation family,

Biometrika, 79, 321-328.

[68]. Hinkley, D. V. & Runger, G. (1984). The analysis of transformed data, Journal of the

American Statistical Association, 79, 302-320.

[69]. Doksum, K. A. & Wong, C. W. (1983) Statistical tests based on transformed data,

Journal of the American Statistical Association, 78, 411-417.

[70]. Wood, J. T. (1974) An extension of transformations of Box and Cox, Applied Statistics,

23, 278-283.

[71]. Carroll, R. J. & Ruppert, D. (1984) Power transformations when fitting theoretical

models to data, Journal of the American Statistical Association, 79, 321-615.

[72]. Ruppert, D., Cressie, N. & Carroll, R. J. (1989) A transformation/weighting model for

estimating Michaelis-Menten parameters, Biometrics, 45, 637-656.

[73]. Rudemo, M., Ruppert, D. & Streibig, J. C. (1989) Random effects models in nonlinear

regression with application to bioassay, Biometrics, 45, 349-362.

[74]. Wixley, R. A. J.(1986). Unconditional likelihood tests for the linear model following

Box-Cox transformation. South African Statistical Journal, 20, 67-83.

[75]. Atkinson, A. C. (1982). Regression diagnostics, transformation and constructed

variables, Journal of the Royal Statistical Society, Series B, 44, 1-36.

[76]. Atkinson, A. C. (1983) Diagnostic regression analysis and shifted power transformation.

Technometrics, 25, 23-33.

175

[77]. Carroll, R. J. (1982 b) Two examples of transformation when there are possible outliers,

Applied Statistics, 31, 149-152.

[78]. Cook, R. D. & Wang, P. C. (1983) Transformation and influential cases in regression,

Technometrics, 25, 337-343.

[79]. Atkinson, A. C. (1986) Diagnostic tests for transformation, Technometrics, 28, 29-37.

[80]. Schwarz, Gideon E. (1978). "Estimating the dimension of a model". Annals of Statistics

6 (2): 461–464.

[81]. Osborne, Jason (2002). Notes on the use of data transformations. Practical Assessment,

Research & Evaluation, 8(6).

[82]. Kadota, K., Nishimura, S.I., Bono, H. et al. 2003a. Detection of genes with tissue-specific

expression patterns using Akaike’s Information Criterion (AIC) procedure. Physiol. Genomics,

12:251–259.

[83]. Kadota, K., Tominaga, D., Akiyama, Y., & Takahashi, K. (2003). Detecting outlying samples in

microarray data: A critical assessment of the effect of outliers on sample classification. Chem-Bio

Informatics Journal, 3, 30-45.

[84]. Kadota, K., Ye, J., Nakai, Y., Terada, T., & Shimizu, K. (2006). ROKU: a novel method for

identification of tissue-specific genes. BMC , 7:294.

[85]. Ueda, T. (1996). Simple method for the detection of outliers. Japanese Journal of

Applied Statistics, 25(1), 17-26.

176

Appendix

The following SAS IML codes are used to generate the simulation results in Chapter 2-4.

Here simulation for Normal samples are provided due to limited space. For lognormal samples, the only thing to change is the distribution in the beginning of simulation.

A. N(10,1) sample, no Box-Cox transformation and outlier detection involved

*original sample, compute AIC BIC MSE AD AD_Pvalue; proc iml; pi = constant("pi"); e = constant("e"); ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; epsilon={0.0 0.05 0.1 0.15 0.20}; out1=shape(.,95000,9);

/*Normal data*/ do k=1 to 19; **sample size loop; n=ss[k]; z=shape(.,n,1); u=z;

neps=ncol(epsilon); do j=1 to 1; **percent of outliers loop; eps=epsilon[j];

%let rep=1000; do i=1 to &rep; **1000 repitition loop;

seedz=21273*k+430856*j+4487*i+9999; seedu=9789*k+1723*j+1532+4487*i+9999;

AIC=shape(.,&rep,1); BIC=shape(.,&rep,1); AD=shape(.,&rep,1); AD2=shape(.,&rep,1); SE=shape(.,&rep,1); pvalue=shape(.,&rep,1);

call rannor(seedz,z);

177

y=10+1*z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end;

c=y; d=rank(y); y[d]=c;

xsl=ssq(y-y[:])/n; z2=sum(log(1:n));

AIC[i]=n*log(2*pi*e)+n*log(xsl)-2*z2+4;*AIC; BIC[i]=n*log(2*pi*e)+n*log(xsl)-2*z2+2*log(n);*BIC; SE[i]=(y[:]-10)##2; *Square Error;

*anderson-darling; len=nrow(y); sse=ssq(y-y[:])/(len-1); uu=(y-y[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[i]=999; ad2[i]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[i]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[i]=ad[i]#(1+0.75/len+2.25/len##2); if ad2[i]>=0.6 & ad2[i]<13 then pvalue[i]=exp(1.2937- 5.709#ad2[i]+0.0186#ad2[i]##2); else if ad2[i]>=0.34 & ad2[i]<0.6 then pvalue[i]=exp(0.9177-4.297#ad2[i]- 1.38#ad2[i]##2); else if ad2[i]>=0.2 & ad2[i]<0.34 then pvalue[i]=1-exp(-8.318+42.796#ad2[i]- 59.938#ad2[i]##2); else pvalue[i]=1-exp(-13.436+101.14#ad2[i]-223.73#ad2[i]##2); end; out1[&rep*neps*(k-1)+&rep*(j-1)+i,1] = n; out1[&rep*neps*(k-1)+&rep*(j-1)+i,2] = eps; out1[&rep*neps*(k-1)+&rep*(j-1)+i,3] = i;

178 out1[&rep*neps*(k-1)+&rep*(j-1)+i,4] = AIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,5] = BIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,6] = AD[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,7] = AD2[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,8] = pvalue[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,9] = SE[i];

end; *i; end; *j; end; *k;

create temp2 var{Sample_Size pct_outlier Rep AIC BIC AD AD2 AD_pvalue SE}; append from out1; close temp2; quit; data temp2; set temp2; where sample_size ne .; run; data out.normal_null; set temp2; rename AIC=AIC_null BIC=BIC_null AD=AD_null AD2=AD2_null SE=SE_null ad_Pvalue=ad_Pvalue_null; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_3\MSE_null.doc"; proc means data=temp2 noprint; var SE; by sample_size; output out=mse(drop=_type_) mean=MSE_null; run; proc print data=mse; run; ods rtf close;

B. N(10,1) sample, Box-Cox transformation only

*original data,do BC, check lambda, large variance;

179 libname out "F:\April_May_2013_Chapter\new_data_04302013"; proc iml;

/*constants used in the calculation*/ pi = constant("pi"); e = constant("e"); start BoxCox(lam) global(y,n,pi,e); x=(y##lam-1)/lam; xsl=ssq(x-x[:])/n; z1=sum(log(1:n)); f=0.5*n*log(2#pi#e)+0.5*n*log(xsl)-(lam-1)*sum(log(y))-z1; return (f); finish BoxCox; ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; /*ss={20 80 150 };*/ epsilon={0.0 0.05 0.1 0.15 0.20};

/*out1 and out2 are the output datasets*/ out1=shape(.,95000,11);

/*Normal data*/ do k=1 to 19; **sample size loop; n=ss[k]; z=shape(.,n,1); u=z;

neps=ncol(epsilon); do j=1 to 1; **percent of outliers loop; eps=epsilon[j];

%let rep=1000; do i=1 to &rep; **1000 repitition loop;

seedz=21273*k+430856*j+4487*i+9999; *!!!attention: when 9999 changes to 999 xsl=0 bombs; seedu=9789*k+1723*j+1532+4487*i+9999;

lambda=shape(.,&rep,1); AIC=shape(.,&rep,1); BIC=shape(.,&rep,1);

180

AD=shape(.,&rep,1); AD2=shape(.,&rep,1); SE=shape(.,&rep,1); pvalue=shape(.,&rep,1); btran=shape(.,&rep,1); x_bar=shape(.,&rep,1);

call rannor(seedz,z); y=10+1*z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end; c=y; d=rank(y); y[d]=c;

/*use nlpnra call to find the optimized log-likelihood, calculate AIC*/ l0=1; lammy=.; call nlpnra(q,lammy,"BoxCox",l0);

if q<=0 then do; lambda[i]=. ; AIC[i]=.; BIC[i]=.; AD[i]=.; AD2[i]=.; pvalue[i]=.; SE[i]=.; btran[i]=.; x_bar[i]=.; end;

else do; lambda[i]=lammy; x=(y##lammy-1)/lammy; x_bar[i]=x[:]; xsl=ssq(x-x[:])/n; z2=sum(log(1:n));

AIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+6;*raw; BIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+3*log(n);*BIC; btran[i]=(lammy#x_bar[i]+1)##(1/lammy); SE[i]=(btran[i]-10)##2;*back transformed square error;

181

*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[i]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[i]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[i]=ad[i]#(1+0.75/len+2.25/len##2); if ad2[i]>=0.6 & ad2[i]<13 then pvalue[i]=exp(1.2937- 5.709#ad2[i]+0.0186#ad2[i]##2); else if ad2[i]>=0.34 & ad2[i]<0.6 then pvalue[i]=exp(0.9177-4.297#ad2[i]- 1.38#ad2[i]##2); else if ad2[i]>=0.2 & ad2[i]<0.34 then pvalue[i]=1-exp(-8.318+42.796#ad2[i]- 59.938#ad2[i]##2); else pvalue[i]=1-exp(- 13.436+101.14#ad2[i]-223.73#ad2[i]##2); end;*max(f)=1; end;*q>0; out1[&rep*neps*(k-1)+&rep*(j-1)+i,1] = n; out1[&rep*neps*(k-1)+&rep*(j-1)+i,2] = eps; out1[&rep*neps*(k-1)+&rep*(j-1)+i,3] = i; out1[&rep*neps*(k-1)+&rep*(j-1)+i,4] = AIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,5] = lambda[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,6] = q; out1[&rep*neps*(k-1)+&rep*(j-1)+i,7] = BIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,8] = AD[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,9] = AD2[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,10] = pvalue[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,11] = SE[i];

end;*i;

end;*j; end;*k; create temp1 var{Sample_Size pct_outlier Rep AIC Lambda Q BIC AD AD2 AD_pvalue SE}; append from out1;

182 close temp1; quit; data temp1; set temp1; where sample_size ne .; run; data out.normal_baseline_BC; set temp1; rename lambda=lambda_baseline BIC=BIC_baseline AIC=AIC_baseline AD=AD_baseline AD2=AD2_baseline AD_pvalue=AD_pvalue_baseline; drop q; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_3\MSE_baseline.doc"; proc means data=temp1 noprint; var SE; by sample_size; output out=mse(drop=_type_) mean=MSE_baseline; run; proc print data=mse; run; ods rtf close; ods html image_dpi=300; ods rtf file="F:\April_May_2013_Chapter\Chapter_3\lambda_baseline.doc"; proc sgplot data=out.normal_baseline_BC(where=(pct_outlier=0)); vbox lambda_baseline / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title 'Box-Plot of Lambda: Null case, N(10,1), Clean data'; YAXIS LABEL = 'Lambda' GRID VALUES = (-8 TO 8 BY 2); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc means data=out.normal_baseline_BC(where=(pct_outlier=0)) noprint; var lambda_baseline; by sample_size; output out=CI mean=lambda_mean_null uclm=Upper lclm=Lower; run; data CI_plot; set CI; drop lower upper lambda_mean_null;

183 bound = lower; output; bound = upper; output; bound = lambda_mean_null; output; run; goptions reset=all; symbol1 interpol=hiloctj cv=red ci=blue width=2 value=dot; axis1 label=('Lambda'); axis2 label=('Sample Size') ; proc gplot data=CI_plot; plot bound*sample_size/vaxis=axis1 haxis=axis2; title '95% CI of Lambda: Null Case, N(10,1), Clean data'; run; title; quit; ods rtf close; data extreme; set out.normal_baseline_BC; where pct_outlier=0 and abs(lambda_baseline)>5; run;

C. N(10,1) Anchor to 1 and Box-Cox transform only libname out "F:\April_May_2013_Chapter\new_data_04302013";

*anchor to 1; *MSE here is not correct, needs to be backtransform and un-anchored; proc iml;

/*constants used in the calculation*/ pi = constant("pi"); e = constant("e"); start BoxCox(lam) global(y,n,pi,e); x=(y##lam-1)/lam; xsl=ssq(x-x[:])/n; z1=sum(log(1:n)); f=0.5*n*log(2#pi#e)+0.5*n*log(xsl)-(lam-1)*sum(log(y))-z1; return (f); finish BoxCox;

184

ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; /*ss={20 80 150 };*/ epsilon={0.0 0.05 0.1 0.15 0.20};

/*out1 and out2 are the output datasets*/ out1=shape(.,95000,11);

/*Normal data*/ do k=1 to 19; **sample size loop; n=ss[k]; z=shape(.,n,1); u=z;

neps=ncol(epsilon); do j=1 to 1; **percent of outliers loop; eps=epsilon[j];

%let rep=1000; do i=1 to &rep; **1000 repitition loop; seedz=21273*k+430856*j+4487*i+9999; *!!!attention: when 9999 changes to 999 xsl=0 bombs; seedu=9789*k+1723*j+1532+4487*i+9999; lambda=shape(.,&rep,1); AIC=shape(.,&rep,1); BIC=shape(.,&rep,1); AD=shape(.,&rep,1); AD2=shape(.,&rep,1); SE=shape(.,&rep,1); pvalue=shape(.,&rep,1); btran=shape(.,&rep,1); x_bar=shape(.,&rep,1); call rannor(seedz,z); y=10+1*z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end; c=y; d=rank(y); y[d]=c; y=y+1-y[1];

185

/*use nlpnra call to find the optimized log-likelihood, calculate AIC*/ l0=1; lammy=.; call nlpnra(q,lammy,"BoxCox",l0); if q<=0 then do; lambda[i]=. ; AIC[i]=.; BIC[i]=.; AD[i]=.; AD2[i]=.; pvalue[i]=.; SE[i]=.; btran[i]=.; x_bar[i]=.; end; else do; lambda[i]=lammy; x=(y##lammy-1)/lammy; x_bar[i]=x[:]; xsl=ssq(x-x[:])/n; z2=sum(log(1:n));

AIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+6;*raw; BIC[i]=n*log(2*pi*e)+n*log(xsl)-2*(lammy-1)*sum(log(y))-2*z2+3*log(n);*BIC; btran[i]=(lammy#x_bar[i]+1)##(1/lammy); SE[i]=(btran[i]-10)##2;*back transformed square error;

*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[i]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[i]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[i]=ad[i]#(1+0.75/len+2.25/len##2); if ad2[i]>=0.6 & ad2[i]<13 then pvalue[i]=exp(1.2937- 5.709#ad2[i]+0.0186#ad2[i]##2);

186 else if ad2[i]>=0.34 & ad2[i]<0.6 then pvalue[i]=exp(0.9177-4.297#ad2[i]- 1.38#ad2[i]##2); else if ad2[i]>=0.2 & ad2[i]<0.34 then pvalue[i]=1-exp(-8.318+42.796#ad2[i]- 59.938#ad2[i]##2); else pvalue[i]=1-exp(- 13.436+101.14#ad2[i]-223.73#ad2[i]##2); end;*max(f)=1; end;*q>0; out1[&rep*neps*(k-1)+&rep*(j-1)+i,1] = n; out1[&rep*neps*(k-1)+&rep*(j-1)+i,2] = eps; out1[&rep*neps*(k-1)+&rep*(j-1)+i,3] = i; out1[&rep*neps*(k-1)+&rep*(j-1)+i,4] = AIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,5] = lambda[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,6] = q; out1[&rep*neps*(k-1)+&rep*(j-1)+i,7] = BIC[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,8] = AD[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,9] = AD2[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,10] = pvalue[i]; out1[&rep*neps*(k-1)+&rep*(j-1)+i,11] = SE[i];

end;*i; end;*j; end;*k; create temp1 var{Sample_Size pct_outlier Rep AIC Lambda Q BIC AD AD2 AD_pvalue SE}; append from out1; close temp1; quit; data temp1; set temp1; where sample_size ne .; run; data out.normal_baseline_anchor_BC; set temp1; rename lambda=lambda_baseline_anchor BIC=BIC_baseline_anchor AIC=AIC_baseline_anchor AD=AD_baseline_anchor AD2=AD2_baseline AD_pvalue=AD_pvalue_baseline_anchor; drop q; run; ods html image_dpi=300;

187 ods rtf file="F:\April_May_2013_Chapter\Chapter_3\lambda_baseline_anchor.doc"; proc sgplot data=out.normal_baseline_anchor_BC(where=(pct_outlier=0)); vbox lambda_baseline_anchor / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title 'Box-Plot of Lambda: Anchor-to-1, N(10,1), Clean data'; YAXIS LABEL = 'Lambda' GRID VALUES = (-8 TO 8 BY 2); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc means data=out.normal_baseline_anchor_BC(where=(pct_outlier=0)) noprint; var lambda_baseline_anchor; by sample_size; output out=CI mean=lambda_mean_anchor lclm=Lower uclm=Upper; run; data CI_plot; set CI; drop lower upper lambda_mean_anchor; bound = lower; output; bound = upper; output; bound = lambda_mean_anchor; output; run; goptions reset=all; symbol1 interpol=hiloctj cv=red ci=blue width=2 value=dot; axis1 label=('Lambda'); axis2 label=('Sample Size') ; proc gplot data=CI_plot; plot bound*sample_size/vaxis=axis1 haxis=axis2; title '95% CI of Lambda: Anchor-to-1, N(10,1), Clean data'; run; title; quit; ods rtf close;

D. Unified Approach, detect outlier in both ends, anchor to 1 if minimum smaller than 1 and Box-Cox transform, penalty for Box-Cox is not considered

options linesize=200 pagesize=200 nocenter; options nosource nonotes;

188

libname out "C:\wg_05282013"; *p1; proc iml; pi = constant("pi"); e = constant("e"); start BoxCox(lam) global(y2,n,dp,pi,e); x=(y2##lam-1)/lam; xsl=ssq(x-x[:])/(n-dp); z1=sum(log(1:(n-dp))); f=0.5*(n-dp)*log(2#pi#e)+0.5*(n-dp)*log(xsl)-(lam-1)*sum(log(y2))-z1; return (f); finish BoxCox; ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200}; epsilon={0.05 0.1 0.15 0.20}; out1=shape(.,2290000,17); do k=1 to 10; n=ss[k]; z=shape(.,n,1); u=shape(.,n,1); l_max_drop=n/5; u_max_drop=n/5; max_drop=(l_max_drop+1)*(u_max_drop+1);

*find a constant used later; if k=1 then do; cc=0; end; else do; cc=0; do index=1 to k-1; cc=cc+(0.2*ss[index]+1)##2; end; end;

neps=ncol(epsilon); neps=1;*change accordingly; do j=1 to 1; eps=epsilon[j];

189

%let rep=1000; do i=1 to &rep; *rep; seedz=21273*k+430856*j+4487*i+9999; seedu=9789*k+1723*j+1532+4487*i+9999; call rannor(seedz,z); y=10+z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end;

do h=1 to l_max_drop+1; ldp=h-1; y22=shape(.,n-ldp,1); y33=shape(.,n-ldp,1); y44=shape(.,n-ldp,1); c=y; d=rank(y); y[d]=c; y_min=y[1];*the minimum of raw obs; y44=y+1-y_min;*anchor to 1; y22=y44[ldp+1:n]; y33=y22+1-y22[1];*anchor every time when lower end is dropped; do g=1 to u_max_drop+1; udp=g-1; dp=ldp+udp; y2=shape(.,n-dp,1); y_kp=shape(.,n-dp,1); lambda=shape(.,max_drop,1); AIC=shape(.,max_drop,1); AIC2=shape(.,max_drop,1); BIC=shape(.,max_drop,1); AIC4=shape(.,max_drop,1); AIC5=shape(.,max_drop,1); ad=shape(.,max_drop,1); ad2=shape(.,max_drop,1); pvalue=shape(.,max_drop,1); SE=shape(.,max_drop,1); y_kp_bar=shape(.,max_drop,1); SE0=shape(.,max_drop,1); x_bar=shape(.,max_drop,1);

190

btran=shape(.,max_drop,1); SE2=shape(.,max_drop,1);

**drop outliers; c=y33; d=rank(y33); y33[d]=c; y2=y33[1:n-ldp-udp]; y_kp=y[ldp+1:n-udp];

l0=1; lammy=.; call nlpnra(q,lammy,"BoxCox",l0);

if q<=0 then do; lambda[g]=. ; AIC[g]=.; AIC2[g]=.; BIC[g]=.; AIC4[g]=.; ad[g]=.; ad2[g]=.; pvalue[g]=.; SE[g]=.; y_kp_bar[g]=.; btran[g]=.; x_bar[g]=.; SE0[g]=.; SE2[g]=.;

end;

else do; lambda[g]=lammy; x=(y2##lammy-1)/lammy;

z2=sum(log(1:(n-dp))); z3=sum(log(1:n));

xsl=ssq(x-x[:])/(n-dp); AIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+6+2*dp;*raw; BIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+(3+dp)*log(n-dp);*BIC;

x_bar[g]=x[:];

191

SE0[g]=(x_bar[g]-10)##2;*transformed square error;

y_kp_bar[g]=y_kp[:];

SE2[g]=(y_kp_bar[g]-10)##2;

btran[g]=(lammy#x_bar[g]+1)##(1/lammy)+y[h]-1; SE[g]=(btran[g]-10)##2;*back transformed square error; /* print dp ldp h udp y y2 y_kp x btran lammy se y_kp_bar x_bar SE0 SE2;*/

*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[g]=999; ad2[g]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[g]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[g]=ad[g]#(1+0.75/len+2.25/len##2); end; if ad2[g]>=0.6 & ad2[g]<13 then pvalue[g]=exp(1.2937- 5.709#ad2[g]+0.0186#ad2[g]##2); else if ad2[g]>=0.34 & ad2[g]<0.6 then pvalue[g]=exp(0.9177- 4.297#ad2[g]-1.38#ad2[g]##2); else if ad2[g]>=0.2 & ad2[g]<0.34 then pvalue[g]=1-exp(- 8.318+42.796#ad2[g]-59.938#ad2[g]##2); else pvalue[g]=1-exp(-13.436+101.14#ad2[g]-223.73#ad2[g]##2);;

end;

out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,1] = n; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,2] = eps; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,3] = i;

192

out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,4] = dp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,5] = ldp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,6] = udp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,7] = AIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,8] = lambda[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,9] = q; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,10] = ad[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,11] = ad2[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,12] = pvalue[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,13] = BIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,14] = SE[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,15] = SE0[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,16] = btran[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,17] = SE2[g];

end; *g; end; *h; end; *i; end; *j; end; *k; create p1 var{Sample_Size pct_outlier Rep outlier_drop low_drop upper_drop AIC Lambda Q AD AD2 pvalue BIC SE SE0 btran SE2}; append from out1; close p1; quit; data out.bc_p1; set p1; where sample_size ne .; run;

193

E. Unified Approach, detect outlier in both ends, anchor to 1 if minimum smaller than 1 and Box-Cox transform, penalty for Box-Cox is considered

options nosource nonotes; libname out "C:\wg_05282013"; ****p1; proc iml; *no box-cox is done is this program, this result is used to be compared with anchor_every_step_drop_BC.sas; *lambda is set to one for all the time, indicating no BC is done; pi = constant("pi"); e = constant("e");

/*ss={20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200};*/ ss={20 30 40 50 60 70 80 90 100 110}; epsilon={0.05 0.1 0.15 0.20}; out1=shape(.,2290000,14); do k=1 to 10; n=ss[k]; z=shape(.,n,1); u=shape(.,n,1); l_max_drop=n/5; u_max_drop=n/5; max_drop=(l_max_drop+1)*(u_max_drop+1);

*find a constant used later; if k=1 then do; cc=0; end; else do; cc=0; do index=1 to k-1; cc=cc+(0.2*ss[index]+1)##2; end; end;

neps=ncol(epsilon); neps=1;*change accordingly; do j=1 to 1; eps=epsilon[j];

194

%let rep=1000; do i=1 to &rep; *rep; seedz=21273*k+430856*j+4487*i+9999; seedu=9789*k+1723*j+1532+4487*i+9999; call rannor(seedz,z); y=10+z; if eps>0 then do; call ranuni(seedu,u); y=y#(10*(u<=eps)+(u>eps)); end; do h=1 to l_max_drop+1; ldp=h-1; y22=shape(.,n-ldp,1); y33=shape(.,n-ldp,1); y44=shape(.,n-ldp,1); c=y; d=rank(y); y[d]=c; y_min=y[1];*the minimum of raw obs; y44=y+1-y_min;*anchor to 1; y22=y44[ldp+1:n]; y33=y22+1-y22[1]; do g=1 to u_max_drop+1; udp=g-1; dp=ldp+udp; y2=shape(.,n-dp,1); y_kp=shape(.,n-dp,1); lambda=shape(.,max_drop,1); AIC=shape(.,max_drop,1); AIC2=shape(.,max_drop,1); BIC=shape(.,max_drop,1); AIC4=shape(.,max_drop,1); AIC5=shape(.,max_drop,1); ad=shape(.,max_drop,1); ad2=shape(.,max_drop,1); pvalue=shape(.,max_drop,1); SE=shape(.,max_drop,1);

**drop outliers; c=y33; d=rank(y33);

195

y33[d]=c; y2=y33[1:n-ldp-udp]; y_kp=y[ldp+1:n-udp];

q=999; lammy=1; lambda[g]=lammy; x=(y2##lammy-1)/lammy;

z2=sum(log(1:(n-dp))); z3=sum(log(1:n));

xsl=ssq(x-x[:])/(n-dp); AIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+4+2*dp;*raw AIC; BIC[g]=(n-dp)*log(2*pi*e)+(n-dp)*log(xsl)-2*(lammy- 1)*sum(log(y2))-2*z2+(2+dp)*log(n-dp);*BIC; SE[g]=(y_kp[:]-10)##2; /* print dp ldp udp y y2 y_kp x SE;*/

*anderson-darling; len=nrow(x); sse=ssq(x-x[:])/(len-1); uu=(x-x[:])/sqrt(sse); p=rank(uu); f=cdf('NORMAL',uu[p]); if abs(max(f)-1)<0.00001 then do; ad[g]=999; ad2[g]=999; end; else do; lnf1=log(f); lnf2=log(1-f); ad[g]=-len-sum((2#p/len-1/len)#lnf1)-sum((2-2#p/len+1/len)#lnf2); ad2[g]=ad[g]#(1+0.75/len+2.25/len##2); end; if ad2[g]>=0.6 & ad2[g]<13 then pvalue[g]=exp(1.2937- 5.709#ad2[g]+0.0186#ad2[g]##2); else if ad2[g]>=0.34 & ad2[g]<0.6 then pvalue[g]=exp(0.9177- 4.297#ad2[g]-1.38#ad2[g]##2); else if ad2[g]>=0.2 & ad2[g]<0.34 then pvalue[g]=1-exp(- 8.318+42.796#ad2[g]-59.938#ad2[g]##2); else pvalue[g]=1-exp(-13.436+101.14#ad2[g]- 223.73#ad2[g]##2);;

196

/* end;*/

out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,1] = n; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,2] = eps; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,3] = i; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,4] = dp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,5] = ldp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,6] = udp; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,7] = AIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,8] = lambda[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,9] = q; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,10] = ad[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,11] = ad2[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,12] = pvalue[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,13] = BIC[g]; out1[&rep*neps*cc+(max_drop)*&rep*(j-1)+(max_drop)*(i-1)+(h- 1)*(u_max_drop+1)+g,14] = SE[g]; end; *g; end; *h; end; *i; end; *j; end; *k; create p1 var{Sample_Size pct_outlier Rep outlier_drop low_drop upper_drop AIC Lambda Q AD AD2 pvalue BIC SE}; append from out1; close p1; quit; data out.no_bc_p1; set p1; where sample_size ne .;

197 run;

F. Evaluation of Outlier detection accuracy of Unified Approach

libname out "F:\April_May_2013_Chapter\outlier_sample_05282013\normal\count_number_of_out lier"; /**************************************************************************** ********;*/ %macro min2(var=, dat=); proc summary data=&dat nway; class sample_size pct_outlier Rep; var &var; output out=aa(drop=_type_ _freq_) min(&var)=min_&var; run; proc sort data=&dat; by sample_size pct_outlier Rep; run; proc sort data=aa; by sample_size pct_outlier Rep; run; data bb; merge &dat aa; by sample_size pct_outlier Rep; if &var=min_&var; run; proc freq data=bb noprint; by sample_size pct_outlier; table outlier_drop/nocol norow nopercent out=cc_&var; run; %mend; %min2(var=BIC,dat=out.bc_5_pct); data accuracy; set bb; if abs(upper_drop-num_outlier)<=1 then accurate=1; else accurate=0; if abs(low_drop-0)<=1 then accurate2=1;

198 else accurate2=0; if abs(outlier_drop-num_outlier)<=1 then accurate3=1; else accurate3=0; if abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1; else accurate4=0; if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1; else accurate5=0; if low_drop=0 and upper_drop=num_outlier then accurate6=1; else accurate6=0; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\06222013_accuracy_reply_to_horn.doc "; proc freq data=bb noprint; table outlier_drop/norow nocol nopercent out=one; table low_drop/norow nocol nopercent out=two; table upper_drop/norow nocol nopercent out=three; by sample_size; run; proc print data=one;run; proc print data=two;run; proc print data=three;run; ods rtf close;

ods html image_dpi=300; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\plot_06222013_lambda_outlier_detect ion_accuracy.doc"; proc freq data=accuracy noprint; table accurate/norow nocol nopercent out=acc; table accurate2/norow nocol nopercent out=acc2; table accurate3/norow nocol nopercent out=acc3; table accurate4/norow nocol nopercent out=acc4; table accurate5/norow nocol nopercent out=acc5; table accurate6/norow nocol nopercent out=acc6; by sample_size; run;

199 proc print data=acc; title 'if abs(upper_drop-num_outlier)<=1 then accurate=1;'; run; title; proc print data=acc2; title 'if abs(low_drop-0)<=1 then accurate2=1;'; run; title; proc print data=acc3; title 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1;'; run; title; proc print data=acc4; title'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1'; run; title; proc print data=acc5; title'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1'; run; title; proc print data=acc6; title 'if low_drop=0 and upper_drop=num_outlier then accurate6=1;'; run; title;

**plot; symbol1 color=vibg interpol=join value=dot height=1.5; proc gplot data=acc(where=(accurate=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: high end outlier detection'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1;'; run; title; quit; proc gplot data=acc2(where=(accurate2=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: low end outlier detection'; title2 'if abs(low_drop-0)<=1 then accurate2=1;'; run;title; quit; proc gplot data=acc3(where=(accurate3=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: both ends outlier detection';

200 title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1;'; run;title; quit; proc gplot data=acc4(where=(accurate4=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low end and outlier in high end'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1'; run;title; quit; proc gplot data=acc5(where=(accurate5=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1'; run;title; quit; proc gplot data=acc6(where=(accurate6=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1;'; run;title; quit; proc sgplot data=accuracy(where=(accurate=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in high end, no penalty for BC'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate2=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in low end, no penalty for BC';

201 title2 'if abs(low_drop-0)<=1 then accurate2=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate3=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in both ends, no penalty for BC'; title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate4=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end detected, no penalty for BC'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate5=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (low end detected=0) detected, no penalty for BC'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy(where=(accurate6=1)); vbox lambda / category=sample_size clusterwidth=0.5 ;

202 xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (exact, and low end detected=0) detected, no penalty for BC'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1;'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; ods rtf close;

* plot_06222013_lambda_outlier_detection_accuracy.doc; *****************************************************; *****************************************************; *****penalty for BC; data out.bc_no_bc_5_pct; set out.bc_5_pct out.no_bc_5_pct; run;

%min2(var=BIC,dat=out.bc_no_bc_5_pct); data leave_raw_alone; set bb; if lambda=1 then BC=0; else BC=1; if outlier_drop=0 then DP=0; else DP=1; run; proc freq data=leave_raw_alone noprint; table DP*BC/norow nocol nopercent out=leave_raw_alone_summary; table outlier_drop*BC/norow nocol nopercent out=leave_raw_alone_summary_both; table upper_drop*BC/norow nocol nopercent out=leave_raw_alone_summary_up; table low_drop*BC/norow nocol nopercent out=leave_raw_alone_summary_low; by sample_size; run; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\06232013_leave_raw_alone.doc"; proc print data=leave_raw_alone_summary; run; proc print data=leave_raw_alone_summary_both;

203 run; proc print data=leave_raw_alone_summary_up; run; proc print data=leave_raw_alone_summary_low; run; ods rtf close; data accuracy; set leave_raw_alone; if abs(upper_drop-num_outlier)<=1 then accurate=1; else accurate=0; if abs(low_drop-0)<=1 then accurate2=1; else accurate2=0; if abs(outlier_drop-num_outlier)<=1 then accurate3=1; else accurate3=0; if abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1; else accurate4=0; if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1; else accurate5=0; if low_drop=0 and upper_drop=num_outlier then accurate6=1; else accurate6=0; run; data accuracy_dp_only; set accuracy; where DP=1 and BC=0; *only look at cases where DP=1; run;

ods html image_dpi=300; ods rtf file="F:\April_May_2013_Chapter\Chapter_4\plot_06232013_lambda_outlier_detect ion_accuracy_BC_penalty.doc"; proc freq data=accuracy_dp_only noprint; table accurate/norow nocol nopercent out=acc; table accurate2/norow nocol nopercent out=acc2; table accurate3/norow nocol nopercent out=acc3; table accurate4/norow nocol nopercent out=acc4; table accurate5/norow nocol nopercent out=acc5;

204 table accurate6/norow nocol nopercent out=acc6; by sample_size; run; proc print data=acc; title 'if abs(upper_drop-num_outlier)<=1 then accurate=1, penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc2; title 'if abs(low_drop-0)<=1 then accurate2=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc3; title 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc4; title'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc5; title'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1,penalty for BC, cases DP=1 and BC=0'; run; title; proc print data=acc6; title 'if low_drop=0 and upper_drop=num_outlier then accurate6=1,penalty for BC, cases DP=1 and BC=0'; run; title;

**plot; symbol1 color=vibg interpol=join value=dot height=1.5; proc gplot data=acc(where=(accurate=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: high end outlier detection'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1, penalty for BC, cases DP=1 and BC=0'; run; title; quit;

205

proc gplot data=acc2(where=(accurate2=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: low end outlier detection'; title2 'if abs(low_drop-0)<=1 then accurate2=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc3(where=(accurate3=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: both ends outlier detection'; title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc4(where=(accurate4=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low end and outlier in high end'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc5(where=(accurate5=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc gplot data=acc6(where=(accurate6=1)); plot percent*sample_size; title1 'Accuracy of outlier detection: no outlier in low (exact 0) end and outlier in high end'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1, penalty for BC, cases DP=1 and BC=0'; run;title; quit; proc sgplot data=accuracy_dp_only(where=(accurate=1)); vbox lambda / category=sample_size clusterwidth=0.5 ;

206 xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in high end, penalty for BC'; title2 'if abs(upper_drop-num_outlier)<=1 then accurate=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate2=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in low end, penalty for BC'; title2 'if abs(low_drop-0)<=1 then accurate2=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate3=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outlier detected accurately in both ends, penalty for BC'; title2 'if abs(outlier_drop-num_outlier)<=1 then accurate3=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate4=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end detected, penalty for BC'; title2 'abs(low_drop-0)<=1 and abs(upper_drop-num_outlier)<=1 then accurate4=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/;

207

XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate5=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (low end detected=0) detected, penalty for BC'; title2 'if low_drop=0 and abs(upper_drop-num_outlier)<=1 then accurate5=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; proc sgplot data=accuracy_dp_only(where=(accurate6=1)); vbox lambda / category=sample_size clusterwidth=0.5 ; xaxis display=(nolabel); title1 'Box-Plot of Lambda: Unified Approach, N(10,1), 5% outlier, outliers detected only in high end (exact, and low end detected=0) detected, penalty for BC'; title2 'if low_drop=0 and upper_drop=num_outlier then accurate6=1, penalty for BC, cases DP=1 and BC=0'; YAXIS LABEL = 'Lambda' GRID VALUES = (-1.5 TO 3 BY 0.5); refline 1/; XAXIS LABEL = 'Sample Size'; run; title; ods rtf close;

208