The Pennsylvania State University The Graduate School

TWO TOPICS: A JACKKNIFE MAXIMUM LIKELIHOOD

APPROACH TO STATISTICAL AND A

CONVEX HULL PEELING DEPTH APPROACH TO

NONPARAMETRIC MASSIVE MULTIVARIATE ANALYSIS

WITH APPLICATIONS

A Thesis in by Hyunsook Lee

c 2006 Hyunsook Lee

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2006 The thesis of Hyunsook Lee was reviewed and approved∗ by the following:

G. Jogesh Babu Professor of Statistics Thesis Advisor, Chair of Committee

Thomas P. Hettmansperger Professor of Statistics

Bing Li Professor of Statistics

W. Kenneth Jenkins Professor of Electrical Engineering

James L. Rosenberger Professor of Statistics Head of Statistics Department

∗Signatures are on file in the Graduate School. Abstract

This dissertation presents two topics from opposite disciplines: one is from a para- metric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to selection and the second one is a convex hull peeling depth approach to nonparametric massive mul- tivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the Kullback-Leibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum Kullback-Liebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information cri- terion (BIC), Minimum description length (MDL), and Bootstrap information cri- terion. Likewise, the jackknife method chooses a model of minimum Kullback- Leibler distance through bias reduction. This bias, which is inevitable in model selection problems, arise from estimating the distance between an unknown true model and an estimated model. We show that (a) the jackknife maximum likeli- hood is consistent to the parameter of interest, (b) the jackknife estimate of the log likelihood is asymptotically unbiased, and (c) the stochastic order of the jackknife log likelihood estimate is O(log log n). Because of these properties, the jackknife information criterion is applicable to problems of choosing a model from non nested candidates especially when the true model is unknown. Compared to popular information criteria which are only applicable to nested models, the jackknife information criterion is more robust in terms of filtering various types of candidate models to choose the best approximating model. However, this robust method has a demerit that the jackknife criterion is unable to discriminate nested models. Next, we explore the convex hull peeling process to develop empirical tools for statistical inferences on multivariate massive data. Convex hull and its peeling

iii process has intuitive appeals for robust location estimation. We define the convex hull peeling depth, which enables to order multivariate data. This ordering pro- cess based on data depth provides ways to obtain multivariate quantiles including . Based on the generalized quantile process, we define a convex hull peeling central region, a convex hull level set, and a volume functional, which lead us to invent one dimensional mappings, describing shapes of multivariate distributions along data depth. We define empirical and measures based on the convex hull peeling process. In addition to these empirical descriptive statis- tics, we find a few methodologies to find multivariate outliers in massive data sets. Those outlier detection algorithms are (1) estimating multivariate quantiles up to the level α, (2) detecting changes in a measure sequence of convex hull level sets, and (3) constructing a balloon to exclude outliers. The convex hull peeling depth is a robust estimator so that the existence of outliers do not affect properties of inner convex hull level sets. Overall, we show all these good characteristics of the convex hull peeling process through bivariate sets to illustrate the procedures. We prove these empirical procedures are applicable to real massive data set by employing Quasars and galaxies from the Sloan Digital Sky Survey. Interesting scientific results from the convex hull peeling process on these massive data sets are also provided.

KEYWORDS: Kullback-Leibler distance, information criterion, statistical model selection, jackknife, jackknife information criterion (JIC), bias reduction, maxi- mum likelihood, unbiased estimation, non-nested model, computational geometry, convex hull, convex hull peeling, statistical data depth, , generalized quantile process, , skewness, kurtosis, volume func- tional, convex hull level set, massive data, outlier detection, balloon plot, multi- variate analysis

iv Table of Contents

List of Figures ix

List of Tables xi

List of Algorithms xii

Acknowledgments xiii

Chapter 1 Introduction: Motivations 1 1.1 ModelSelection...... 1 1.2 ConvexHullPeeling ...... 3 1.3 Synopsis...... 4

Chapter 2 Literature Reviews 6 2.1 Introduction...... 6 2.2 Model Selection Under Maximum Likelihood Principle ...... 7 2.2.1 Kullback-LeiberDistance...... 7 2.2.2 Preliminariles ...... 8 2.2.3 AkaikeInformationCriterion ...... 9 2.2.4 BayesInformationCriterion ...... 14 2.2.5 Minimum Description Length ...... 16 2.2.6 in Model Selection ...... 18 2.2.7 OtherCriteriaandComments ...... 20 2.3 ConvexHullandAlgorithms...... 21 2.3.1 Definitions...... 21 2.3.2 ConvexHullAlgorithms ...... 24

v 2.3.3 OtherAlgorithmsandComments ...... 30 2.4 DataDepth ...... 30 2.4.1 OrderingMultivariateData ...... 31 2.4.2 DataDepthFunctions ...... 32 2.4.3 StatisticalDataDepth ...... 35 2.4.4 Multivariate Symmetric Distributions ...... 36 2.4.5 General Structures for Statistical Depth ...... 37 2.4.6 DepthofaSinglePoint...... 39 2.4.7 Comments...... 39 2.5 Descriptive Statistics of Multvariate Distributions ...... 39 2.5.1 MultivariteMedian ...... 40 2.5.2 Location...... 41 2.5.3 Scatter...... 42 2.5.4 Skewness...... 44 2.5.5 Kurtosis ...... 46 2.5.6 Comments...... 48 2.6 Outliers ...... 49 2.6.1 WhatisOutliers?...... 49 2.6.2 WhereOutliersOccur? ...... 52 2.6.3 OutliersandNormalDistributions ...... 52 2.6.4 NonparametricOutlierDetection ...... 54 2.6.5 Convex Hull v.s. Outliers ...... 57 2.6.6 Comments: Outliers in Massive Data ...... 58 2.7 StreamingData ...... 59 2.7.1 Univariate Sequential Quantile Estimation ...... 60 2.7.2 -Approximation ...... 60 2.7.3 SlidingWindows ...... 61 2.7.4 StreamingDatainAstronomy ...... 62 2.7.5 Comments...... 63

Chapter 3 Jackknife Approach to Model Selection 65 3.1 Introduction...... 65 3.2 Prelimiaries ...... 68 3.3 Jackknife Maximum Likelihood Estimator ...... 71 3.4 Jackknife Information Criterion (JIC) ...... 73 3.5 When Candidate Models are Nested ...... 80 3.6 Discussions ...... 82

vi Chapter 4 Convex Hull Peeling Depth 86 4.1 Introduction...... 86 4.2 Preliminaries ...... 90 4.3 StatisticalDepth ...... 92 4.4 Convex Hull Peeling Functionals ...... 94 4.4.1 CenterofDataCloud: Median...... 94 4.4.2 LocationFunctional ...... 95 4.4.3 ScaleFunctional...... 97 4.4.4 SkewnessFunctional ...... 98 4.4.5 KurtosisFunctional...... 100 4.5 SimulationStudiesonCHPD ...... 102 4.6 Robustness of CHPM - Finite Sample Relative Efficiency Study ...... 104 4.7 StatisticalOutliers ...... 107 4.8 Discussions ...... 109

Chapter 5 Convex Hull Peeling Algorithms 112 5.1 Introduction...... 112 5.2 DataDescription: SDSSDR4 ...... 117 5.3 MultivariateMedian ...... 119 5.4 QuantileEstimation ...... 122 5.5 DensityEstimation ...... 125 5.6 OutlierDetection ...... 127 5.6.1 QuantileApproach ...... 128 5.6.2 Detecting Shape Changes caused by Outliers ...... 129 5.6.3 BalloonPlot...... 131 5.7 Estimating the Shapes of Multidimensional Distributions ...... 133 5.7.1 Skewness...... 133 5.7.2 Kurtosis ...... 136 5.8 Discoveries from Convex Hull Peeling Analysis on SDSS ...... 137 5.9 SummaryandConclusions ...... 141

Chapter 6 Concluding Remarks 144 6.1 SummaryandDiscussions ...... 144 6.1.1 Jackknife Model Selection ...... 144 6.1.2 Unspecified True Model and Non-Nested Candidate Models. 146 6.1.3 Convex Hull Peeling Process ...... 147

vii 6.1.4 Multivariate Descriptive Statistics ...... 148 6.1.5 MultivariateOutliers ...... 149 6.1.6 Color Distributions of Extragalactic Populations ...... 149 6.2 FutureWorks ...... 150 6.2.1 Clustering with Convex Hull Peeling ...... 150 6.2.2 VoronoiTessellation ...... 152 6.2.3 Addtional Methodologies for Outlier Detection ...... 152 6.2.4 Testing Similarity of Distributions ...... 153 6.2.5 BreakdownPoint ...... 153 6.2.6 CHPDisNotContinuous ...... 154 6.2.7 Sloan Digital Sky Survey Data Release Five ...... 155 6.2.8 AdditionalTopics...... 156

Bibliography 158

viii List of Figures

3.1 An illustration of model parameter space: g(x) is the unknown true model and ΘA, ΘB, Θ1, and Θ2 are parameter space of different models. Note that Θ1 ⊂ Θ2 and θg is the parameter of the model that mostly minimizes the KL distance than parameters of other models (θa, θb, and θ1). The black solid line and the blue dashed line indicate Θ1, of which detail will be given in the text...... 81 4.1 Bivariate samples of different shapes put into the convex hull peeling process. Clockwise, from the left upper panel, samples of size 1000 were generated from a bivariate uniform distribution on a square, a bivariate uniform distribution on a triangle, a bivariate Cauchy distribution with correlation 0.5, and a bivariate normal distribution withcorrelation0.5...... 103

5.1 Sample size is 106 from a standard bivariate normal distribution and target quantiles are {0.99, 0.95, 0.9, , , 0.1, 0.5, 0.01}. For the sequen- tial method, we took 1000 data at a time and estimated quantiles. Each quantile has 1000 corresponding convex hull peeling level sets, which are combined and peeled until about a half of data points left. Blue indicates quantiles of a non sequential method and red indicates quantiles of a sequential method...... 124 5.2 The sample size of both examples is 104 from standard bivariate normal distribution and bivariate t5 distribution with ρ = −0.5. The target quantiles are {0.99, 0.95, 0.9, , , 0.1, 0.5, 0.01}. Densities are estimated by of eqs.(5.2), from which empirical density profileisbuilt...... 126

ix 5.3 Samples of size 2000 are generated. The primary distribution is the standard bivariate normal but the panels of the second column show 1% contamination by the standard bivariate normal distribution centered at (4,4). As the convex hull peeling process proceeds, the area of each convex hull layer is calculated...... 130 5.4 Same samples from Figure 5.3. The blue line indicated the IQR and the red line indicated the balloon expanded IQR by 1.5 times lengthwise. Points outside this balloon are considered to be outliers. 133 5.5 Three different bivariate samples are compared in term of the skew- ness measure Rj. A sample of the first column was generated from a standard bivariate normal distribution; a sample of the sec- ond column from a bivariate normal distribution with correlation 0.5; a sample of the third column was generated from (X,Y ) ∼D 2 (N(0, 1), χ10), where X and Y areindependent...... 134 5.6 Tailweight is calculated based on eq. (5.3). Samples of size 2000 are generated from a uniform, normal, t10 and Cauchy distribution. . . 136 5.7 Volume sequence of convex hull peeling central regions from 70204 Quasars in four dimension color magnitude space...... 138 5.8 Volume sequence of convex hull peeling central regions from 499043 galaxies in four dimension color magnitude space...... 139 5.9 Sequence of Rj skewness measure from 70204 Quasars in four di- mensioncolormagnitudespace ...... 139 5.10 Sequence of Rj skewness measure from 499043 galaxies in four di- mensioncolormagnitudespace ...... 140

x List of Tables

4.1 Empirical square error and relative efficiency ...... 106

5.1 Comparison among mean, coordinate-wise median, and convex hull peeling median of synthesis data sets, size of 104 and 106 from the standard bivariate normal distribution. For the larger sample, we apply the sequential CHPM method...... 121 5.2 Different center locations of Quasars and galaxies distributions in the space of color magnitudes based on mean, coordinate-wise me- dian, CPHM, and sequential CHPM. The sequential method has adopted d =0.05 and m = 104...... 122

xi List of Algorithms

1 Grahamscanalgorithm...... 25 2 GiftWrappingAlgorithm...... 26 3 Beneathandbeyondalgorithm ...... 27 4 Divideandconqueralgorithm ...... 28 5 IncrementalAlgorithm ...... 28 6 Quickhullalgorithm ...... 29 7 ConvexhullpeelingmedianI ...... 119 8 Sequential convex hull peeling median I ...... 120 9 Sequential convex hull peeling median II ...... 120 10 pth Convex Hull Peeling Quantile ...... 122 11 Sequential pth Convex Hull Peeling Quantile ...... 123

xii Acknowledgments

One baby step of a life long journey is about to finish. I was not alone when I made this step. There are people who have constantly encouraged me. Among them, above all, I am very thankful to my advisor Prof. Babu who has been very patient with me. He gave me the great freedom no graduate student can imagine to explore the sea of statistics which led me to confront a few barren islands. A small part of this expedition is recorded in this dissertation. Also, I am very grateful for his brilliant advice and precise comments which have led my eyes wide open to see the endless possibilities in statistics that I like to challenge in many years. I am very much indebted to my committee members, Professors Hettmansperger and Li, for their advice and support, and my minor and external committee mem- ber, Prof. Jenkins, who offered the most interesting course, Adaptive filter, and inspired my creative thinking. In addition to their support, Prof. Li envisioned me the beauty of asymptopia through his course and Prof. Hettmansperger provided very stimulating and encouraging conversations about my research and classical music. Also, I would like to thank Prof. Rosenberger, who was supportive in every aspect as department head; Dr. van den Berk, who showed me how to use SQL for acquiring data from Sloan Digital Sky Survey; Prof. Rao, whose encouragement and interest in my work helped me to last until my thesis defence; Prof. Harkness, who has encouraged me with his warm heart; and people at SAMSI, who gave me a hope that there is something I can contribute to both astronomical and statisti- cal society. I am very thankful to Pablo among the people I met at SAMSI, who acknowledged my ability and helped me to build my selfesteem which I have lost for years. There are also many good friends who deserve more than a few words of gratitude. Without my parents’ understanding, patience, and love, I would never make this step through. No word can appreciate their love enough. I have never failed

xiii to thank God for having the best parents in the world who always stand by my side and believe in me.

Among my dear music pieces, Ich habe genug (BWV 82) is flowing in my mind. J.S. Bach, a humble servant to God, had transferred his pious mind to this beautiful music. His piety is residing in me and whispering me to be humble and awake for the future. I wish that God sees my dissertation as my gratitude to Him and knows that I am ready for the next step.

xiv Dedication

To my mother, who has taught me women’s sacrifice.

xv Chapter 1

Introduction: Motivations

This dissertation contains two topics: The first topic is about parametric model selection with the delete-one jackknife method under the maximum likelihood prin- ciple. The second topic focuses on statistical properties and algorithms of convex hull peeling process as nonparametric multivariate data analysis methods. Reviews and general introductions related to topics will be given at later chapters. In this chapter, motivations related to subjects are discussed.

1.1 Model Selection

One of the challenges in statistical model selection is assessing candidate models when the true data generating model is unspecified. Even though the problem of the unspecified true model is well recognized, in practice, traditional information criteria such as AIC, BIC, and their modifications are automatically applied to real problems as if the true model is identified. The misuse of these traditional model selection criteria likely lead to an improper choice of models. A quite number of studies presented the problems of the unspecified or mis- specified true model; for example, Huber, 1967[97]; Jennrich, 1969[109]; Nishii, 1988[155]; Render, 1981[169]; Sen, 1979[195]; Sin and White, 1996[210]; Vuong, 1989[225]; and White, 1982[231]. These studies focused on the idea that the model estimation procedure ultimately yields a best model to the unknown true model under various circumstances. However, there was no previous study about model selection when candidates are not nested and the true model is unspecified based on 2 the minimization of the Kullback-Leibler distance (Kullaback and Leibler, 1951). When the true model is known, the Kullback-Leiber distance between a true model and a candidate model is estimable. Developing a proper estimating method to minimize the distance has been popular in model selection studies. In general, estimating the Kullback-Leibler distance accompanies estimation bias so that the penalty terms beside the likelihood of the candidate model in the equations of well known selection criteria are the byproduct of this estimation bias. Bozdogan (1987[34]) criticized the generalization of those penalty terms under the unspecified true model condition. The status of the true model is very ambiguous. Even though the true model is unspecified, by accident one of the candidate models can coincide with the true model. At least, we might consider one of the candidate models as the best model when this candidate model is most closest to the unknown true model than the other models in terms of the Kullback-Leibler distance. A model, minimizing the distance from the unspecified true model, describes the true model most likely or equally compared to other candidates. Note that these candidate models are not necessarily nested to search for a best model. The parameter space of each model can be different and we investigate this parameter space to find a model that minimize the distance from the true model. However, bias arises from estimating distance since the parameters of a model and the distance are estimated under this unspecified true model. Only the known true model allows to assess this bias, which is very unlikely to occur in reality. Therefore, instead of estimating bias from the Kullback-Leiber distance with the unknown true model, we choose the bias reducing technique to find a model that minimizes the Kullback-Leiber distance. The jackknife resampling method has been known for its bias reduction prop- erty; however, it has lost its fame because of other famous resampling schemes such as bootstrap and cross-validation. Due to this less popularity, applying the jackknife method to develop a model selection criterion, especially a criterion min- imizing the Kullback-Leiber distance cannot be found. We expect that one of candidate models can be chosen, which minimizes the Kullback-Leiber distance without considering estimating bias since the jackknife method will reduce it. The jackknife method will provide the bias reduced maximum likelihood estimator of 3 a model of minimum distance. The procedure of assessing the bias reduced maxi- mum likelihood based on the jackknife method will be presented.

1.2 Convex Hull Peeling

The notion of convex hull peeling as a tool for a location estimator (Tukey, 1974) has been recognized in the statistical society more than three decades. Until Zuo and Serfling’s (2000a) discussion on statistical data depth, the convex hull peeling process for statistical inferences is hardly found. In Barnett’s extensive study (1976) on multivariate data, convex hull peeling was introduced to resolve the complexity of ordering multvariate data. From time to time a convex hull of given sample has been used for robust estimation and outlier detection. Compared to its long history, relatively small amount works have been done regarding convex hull peeling. Compared to other statistical depths, such as the simplicial depth or the half space depth, the convex hull peeling depth does not have any distribution theory. The technicality involved in the development of limiting distributions of convex hull peeling is quite difficult. Even though samples are generated from a continuous distribution, no continuous functional of convex hull peeling estimators is reported, which prevents the convex hull peeling process from being adapted for statistical inferences. Only advantage for using the convex hull peeling depth is its intuitive appeal. The convex hull peeling depth is rather intuitive and straightforward compared to other data depths. This characteristics of the convex hull peeling depth will be il- lustrated in later chapters with synthetic data and astronomical massive data from the Sloan Digital Sky Survey. The picturesque property of the convex hull peeling depth has inspired to perform various simulation studies like checking whether the convex hull peeling process measures scatter, skewness, and kurtosis of the distri- bution; testing the whether convex hull peeling process catches outliers; depicting whether the convex hull peeling process delineates the density profile. These em- pirical studies are mostly related to finding descriptive statistics and developing exploratory data analysis tools for multivariate distributions. Since the convex hull peeling depth is defined based on counting process, linking 4 every point with a depth value is simple. In addition, estimating a quantile is just collecting points of same depth. The contour formed by these points of same depth is convex and its volume is easily calculated. The fact that estimating quantile is easily directed to estimating the volume of the corresponding quantile, a set of collected points of same depth through the convex hull peeling process triggers to deduce the density profile of given sample. Also, this volume functional from random points of same depth provides a mapping from a multi-dimension quantile to a single value so that descriptive statistics by means of one dimensional curve mapping from multivariate data is borne based on the convex hull peeling process. Additionally, quantile estimation based on the convex hull peeling process can incorporate streaming data algorithms. This flexibility of data management is extremely helpful especially when massive multivariate data need to be summa- rized while the memory space for data processing is limited. Sequential quantile estimates based on streaming algorithms and regular quantile estimates do not show much discrepancy. Therefore, we like to develop space efficient multivariate descriptive statistics based on an empirical convex hull peeling process. In conclu- sion, the convex hull peeling process can be easily transformed to various empirical descriptive statistics as nonparametric multivariate massive data analysis tools.

1.3 Synopsis

The remaining chapters in this dissertation are divided into six parts, with the following structure: In Chapter 2, literature reviews relevant to the works per- formed in this dissertation are presented, including the Kullback-Leibler distance, Akaike information criterion and its variations, Bayes information criterion, model selection with data resampling methods, convex hull algorithms, statistical depths and their characteristics, multivariate descriptive statistics (median, scatter, skew- ness, and kurtosis), outliers, and massive streaming data. Chapter 3 discusses model selection with the jackknife method based on the maximum likelihood prin- ciple. We prove the consistency of the jackknife maximum likelihood estimator and the asymptotic unbiasedness of the jackknife likelihood. Chapter 4 addresses the convex hull peeling depth in various aspects beginning with its definition and proposing associated tools for multivariate nonparametric descriptive statistics. 5

Chapter 5 introduces various methods, illustrating applications of the convex hull peeling process. Further emphasis is given to actual applications to multivariate massive data from astronomical survey data. Finally, the last chapter provides a summary of our studies with concluding remarks, and some additional works that may come afterwards related to this dissertation and future work suggestions relevant to our current studies. Chapter 2

Literature Reviews

2.1 Introduction

This chapter presents highlights of some researches that have guided the studies performed in this thesis. It is divided into two parts: parametric model selection with a resampling method and nonparametric multivariate analysis with compu- tational algorithms. First, some well known information criteria from the maximum likelihood prin- ciple are discussed in addition to model selection criteria of bootstrap and cross validation. Then, some reviews of the nonparametric multivariate analysis will fol- low. Since our primarily concern is convex hull peeling for analyzing multivariate data, we present additional discussions on convex sets and convex hull algorithms. We, also, explain data depth which allows ordering multivariate data. The possibil- ity of ordering multivariate data will lead to nonparametric multivariate location, quantiles, and distributional shape estimation, and outlier detection. Last, due to our interests in massive data and their quick analysis tools, tools for analyzing streaming data will be addressed. 7

2.2 Model Selection Under Maximum Likelihood Principle

Model selection is a very broad subject in statistics, from variable selection to func- tional space classification. Model selection problems are found in other disciplines other than statistics and have received spotlights because of their broad applica- tions. Here, we confine our discussion of the statistical model selection framework to the maximum likelihood principle, which is connected to Boltzmann’s maxi- mum entropy (1877[32]) and Shannon’s information theory (1948[201]). These ap- proaches from physics and information theory are related to the Kullback-Leiber distance, which will be discussed later. Diverse discussions on model selection from a statistical point of view are provided in a manuscript edited by Lahiri (2001[120]) and books written by Kuha (2004[118]), Linhart & Zucchini (1986[126]), McQuar- rie and Tsai (1998[145]) and Burnham and Anderson (2002[39]).

2.2.1 Kullback-Leiber Distance

The Kullback-Leibler (KL) distance (Kullback and Leiber, 1951[119]), also called as relative entropy (Cover and Thomas, 1991[55]), is a measure of distance be- tween two distributions. Consider that X1,X2, ..., Xn are samples drawn from an unknown distribution G whose density function is denoted by g and F is the distribution of interest, whose density function is f. Then the KL distance is,

DKL[g; f] = g(x)log g(x)dx − g(x)log f(x)dx (2.1) Z Z g(x) = E [log ], g f(x) the discrepancy of assuming F when the true distribution is G. The objective of employing the KL distance in the model selection problem is finding a model of minimum DKL. Note that the first term in the right-hand side of eq. (2.1) is a constant. Searching f maximizing the second term,

LG = log f(x)dG(x) (2.2) Z 8 plays a key role in model selection. Among many model selection criteria reflecting the KL distance minimization, Akaike information criterion (AIC, Akaike, 1973[7]), Minimum descriptive length (MDL, Rissanen, 1978[176]), and Bayesian information criterion (BIC, Schwarz, 1978[192]) are the most popular criteria to choose the best approximating model. AIC is the result of approximating the bias of maximum likelihood estimation, MDL is maximizing the entropy of binary coding models, and BIC is maximizing the posterior distribution.

2.2.2 Preliminariles

In practice, model selection was performed on sets of candidate models, of which functional characteristics consequently are known. In other words, we can adopt some candidate models of known properties for statistical model selection. Such candidate models are subject to regularity conditions so that we can develop model selection methods. Prior to addressing model selection information criteria, we present some regularity conditions.

Consider an iid sample, X1, ..., Xn from an unknown distribution G. Let {fθ : p p θ ∈ Θ ⊂ R } be a family of distributions. Denote the likelihood of Xi as li(θ)= n log fθ(Xi) for i =1, ..., n. Let L(θ)= i=1 li(θ). Assumptions: P p p p (A1) The parametric space Θ is p-dimensional space, Θ ⊂ R such that li(θ) is p distinct for each θ ∈ Θ , i.e. if θ1 =6 θ2, li(θ1) =6 li(θ2).

p ∂ (A2) There exists θo in the interior of Θ , which is the unique solution to Eg[ ∂θ L(θ)] =

p 0 such that supθ∈Θ :||θ−θo||>(L(θ) − L(θo)) diverges to −∞ a.s. for any > 0.

(A3) Within a neighborhood of θo, the first and the second derivatives of li(θ) with ∂ ∂2 ∂ respect to θ, l (θ) and T l (θ) are well defined and bounded, i.e. | l (θ)| < ∂θ i ∂θ∂θ i ∂θl i ∂2 ∂ h(x) and | li(θ)| < h(x) for all x, where l, m = 1, ..., p. Here, li(θ) ∂θl∂θm ∂θl th ∂2 th is l component of gradient vector and − li(θ)is (l, m) component of ∂θl∂θm the Hessian matrix. 9

∂ (A4) For ψθ(x)= li(θ) or ψθ(x)= ∂θT li(θ),

∂ ∂ ∂θ ψθ(x)dx = ∂θ ψθ(x)dx. Z Z

p ∂2 ∂ ∂ T (A5) For θ ∈ Θ , −Eg[ ∂θ∂θT li(θ)] and Eg[ ∂θ li(θ) ∂θ li(θ)] are positive definite ma- trices.

Besides, the following lemma will used quiet often in this study.

Lemma 2.1 (Taylor Expansion). For any function h(θ, ·), which is twice differ- entiable in the neighborhood of θo,

∂ 1 ∂2 h(θ, ·)= h(θ , ·)+ h(θ , ·)(θ − θ )+ (θ − θ )T h(η, ·)(θ − θ ), o ∂θ o o 2 o ∂θ∂θT o where η satisfies η = (1 − r)θ + rθo for 0

2.2.3 Akaike Information Criterion

Akaike (1973[7], 1974[8]) first introduced the model selection criterion that chooses a model of the smallest KL distance. His criterion is based on estimating eq. (2.2) and choosing a model of a maximum eq. (2.2) estimate. This expected log likelihood, however, depends on the unknown true distribution G. The candidate parametric model has unknown parameter θ that is replaced by some estimate θˆ. Among many parameter estimating methods, Akaike used the ˆ maximum likelihood estimator θML, which maximizes the log likelihood. Note that this maximum likelihood estimation enforces another condition that the true model ˆ g belongs to the set of candidate models so that θML is a consistent estimator of

θo. Still we need to estimate G to get LG in (2.2). We estimate LG empirically by the empirical distribution Gˆ of G.

1 n L ˆ(θˆ)= log fˆ (x)dGˆ(x)= log fˆ (Xi). (2.3) G θML n θML i=1 Z X The empirical log likelihood usually overestimates the expected log likelihood 10

(Konishi and Katagawa, 1996[116]). This excess is the bias,

ˆ ˆ bG = Eg[LGˆ (θML) − LG(θML)] 1 n = Eg[ log fˆ (Xi) − log fˆ (x)g(x)dx]. (2.4) n θML θML i=1 X Z The bias-corrected log likelihood becomes

1 n log fˆ (Xi) − ˆbG, (2.5) n θML i=1 X ˆ where bG is the estimation of the bias bG. In general, a model that minimizes the Kullback-Leibler distance (2.1) or max- imizes bias-corrected log likelihood (2.5) (Konishi, 1999[115]) is chosen for the best approximating model. The Akaike type information criterion (IC) from sample size of n is defined as follows;

1 n IC(Xn; θˆML) = −2n{ log fˆ (Xi) − ˆbG} n θML i=1 n X = −2 log f (X )+2nˆb , (2.6) θˆML i G i=1 X where ˆbG and θˆML are estimators of bG and θo, respectively and various estimators exist under various conditions. (e.g. Akaike, 1973; Hurvich and Tsai, 1989[104]; Konishi and Kitagawa, 2003[114]; and Konishi and Kitagawa, 1996[116]). ˆ Akaike chose θML to replace θo then estimate the bias as follows. Under the assumptions (A1)-(A5), the bias in eq. (2.6) can be estimated along with the model parameters. From eq. (2.4),

ˆ ˆ bG = Eg[LGˆ (θML) − LG(θML)] 1 n = Eg[ log fˆ (Xi) − log fˆ (x)g(x)dx] n θML θML i=1 Z 1 X = tr[J −1I ]+ O(n−2) n g g 11

where Jg and Ig are p × p matrices defined by

2 ∂ log fθ(Xi) Jg = −Eg[ ] ˆ ∂θ∂θT θ=θML and ∂log fθ(Xi) ∂log fθ(Xi) Ig = Eg[ ] ˆ ∂θ ∂θT θ=θML (Konishi, 1999[115]). The information criterion defined in eq. (2.6) becomes

n TIC = −2 log f (X )+2tr[Jˆ−1Iˆ] (2.7) θˆML i i=1 X by replacing Jg and Ig with their estimators, Jˆ and Iˆ. This information criterion (TIC) was addressed in Takeuchi (1976[218]), which was written in Japanese and to which Kitagawa (1997[111]) and Shibata (1989[208]) referred. If the parametric family of distributions contains the true distribution, that is p −1 G = Fθo (i.e. g = fθo ) for some θo in Θ , then tr[J(Fθo ) I(Fθo )] = p, since

∂log f (x) ∂log f (x) ∂2log f (x) θ θ f (x)dx = − θ f (x)dx. ∂θ ∂θT θ ∂θ∂θT θ Z Z Thus the KL distance based information criterion is simplified to AIC,

n ˆ AIC = −2 log f(Xi|θML)+2p i=1 X ˆ = −2L(θML)+2p (2.8)

AIC is well known for its simplicity. Under the regularity conditions (A1)- (A5), AIC is defined as an estimate of minus twice the expected log likelihood of a specific model of which distribution family contains a true model. By the maximum likelihood principle, AIC takes a very simple form based on the sample log likelihood and the number of parameters. Nonetheless, drawbacks of AIC such as overfitting and inconsistency have been reported (Hurvich and Tsai, 1989[104]; McQuarrie et.al., 1997[146]). Quite modifications were developed under certain circumstances. Some of them will be discussed below. 12

Consider a regression model

y = Xβ +  (2.9) where X is an n × p design matrix, β is is a p × 1 vector of unknown parameters,

β =(β1, ..., βp). Respectively, p and n indicate the number of predicting variables T and sample size. The errors in the model,  =(1, ..., n) , are iid random variables such that  ∼ N(0, σ2I). The log likelihood of this model is

1 (y − Xβ)T (y − Xβ) L(β, σ2)= − nlog σ2 − . 2 2σ2

The maximum likelihood estimates are

βˆ =(XT X)−1XT y and σˆ2 =(y − Xβˆ)T (y − Xβˆ)/n.

Suppose that the true model of (2.9) is

y = Xoβo + o. (2.10)

In the eq. (2.10), Xo is an n × po design matrix, βo is (β1,o, ..., βpo,o) with po T parameters, and o =(o1, ..., on) are iid normal random variables with mean zero 2 and σo I. Let the vectors of parameters be θ = (β, σ) and θo = (βo, σo). The measure of eq. (2.1) between the true model (2.10) and a candidate model (2.9) is given by

2 2 DKL[θo; θ] = 2Eo[L(βo, σo ) − L(β, σ )] 2 2 2 2 T 2 = n(log σ − log σo )+ nσo /σ +(Xoβo − Xβ) (Xoβo − Xβ)/σ − n 2 2 2 2 2 2 = n{σo /σ − log (σo /σ ) − 1} + ||Xoβo − Xβ|| /σ . (2.11)

2 2 The property of an approximating model that minimizes Eo[L(βo, σo ) − L(β, σ )] does not change when it is multiplied by a positive constant 2 for defining an information criterion. 13

Hurvich and Tsai (1989[104]) derived the corrected AIC from the above regres- sion model (2.9), especially when the true model is nested in the fitted models. The model is nested when it is a subset of the other model. For the above model, the nested model is equivalent to {βo}⊂{β} where {βo} is the parameter space of βo and {β} is the parameter space of β. The model selection criterion of eq. (2.11) turns out as minimizing the following criterion.

σˆ2 nσ2 ||X β − Xβ||2 IC = E nlog + o + o o o σ2 σˆ2 σˆ2  o  n2 np = E [nlogσ ˆ2]+ + o n − p − 2 (n − p − 2) n(p + 1) = E [nlogσ ˆ2]+2 +1. (2.12) o (n − p − 2)

Here, the result that IC takes a simpler form compared to DKL[θo; θ] (2.11) is 2 attributed to the fact that nlog σo and n are considered as constants. Intentionally, the constant terms are omitted. Therefore, the corrected AIC (AICc) is

n(p + 1) AICc = nlogσ ˆ2 +2 . (2.13) (n − p − 2)

The last term in (2.13) serves as a penalty for model selection. Hurvich and Tsai (1989) have shown that AICc works better than AIC in small samples and is an unbiased estimator of the Kullback-Leibler information. Their AICc, nonetheless, overfits when the sample size increases since AIC and AICc have asymptotically equivalent penalty terms. It is also known that AICc is valid only when candidate models are nested (Murata et.al., 1994[150]). Cavanaugh (1997[45]) presented a derivation that unifies AIC and AICc under the regression framework. McQuarrie et.al. (1997) proposed AICu as an approximately unbiased estimator of Kullback-Leibler information to correct the overfitting tendency of AICc. The term in AICc (2.13), nlogσ ˆ2 is substituted by nlogσ ˜2 to obtain an improved alternative model selection criterion,

n(p + 1) AICu = nlogσ ˜2 +2 , (n − p − 2) whereσ ˜2 = nσˆ2/(n−p−1). In practice,σ ˜2 is replaced by s2 =(n−p−1)˜σ2/(n−p), 14

2 which is an unbiased estimator of σo . This modified model selection criterion becomes n(p + 1) AICu = nlog s2 +2 . (2.14) (n − p − 2) Various AIC type information criteria have been developed afterward although their application is limited to model selection in linear models or models. These models are nested to other candidate models and contain the true model, of which fact imposes a limitation in application because of the chances that the true model is unknown and candidate models are not nested.

2.2.4 Bayes Information Criterion

In bayesian statistics, a model is chosen when it pertains maximum compared to other models. On the other hand, if choosing one model is not an interest, bayesian model averaging serves as model selection methods in bayesian statistics (Kass and Wasserman, 1995[110]; Wasserman, 2000[229]). Compared to model averaging or bayes factor, the bayesian information criterion (BIC) did not come from a standard bayesian methodology but it pertains its name because it chooses a model of maximum posterior distribution. The bayesian information criterion (BIC) was initiated from a very different point of view compared to AIC. As Reschenhofer (1996[172]) pointed out, AIC pursues an optimal criterion for prediction but BIC is designed to find the model with the highest probability. Schwarz (1978[192]) first proposed BIC by approach- ing model selection from the bayesian perspective. The criterion chooses a model with the largest under assumptions. One of the assumptions is that the data were generated from an of distributions. Equal priors, additionally, are assumed so that the criterion has a very simple form.

Suppose that we observe data D and have candidate models Mk with parameter

θk. Then the posterior model probability is

p(D|Mk)= p(D|θk, Mk)p(θk|Mk)dθk, (2.15) Z where Schwarz (1978) assumed the Koopman-Darmois family for the likelihood p(D|θk, Mk) and equal values for p(θk|Mk). When we take the 15

Laplace approximation on eq. (2.15) around θˆML, the maximum likelihood estima- tor of a parameter vector in the , and we drop constant terms with respect to the given model Mk (e.g. the number of parameters), we are able to obtain BIC as follows,

BIC = −2L(θˆML)+ pklog n, (2.16)

where pk is the number of parameters in Mk and n is sample size. The log likelihood ˆ L(θML) is posterior log likelihood of given data. Detail discussions of BIC deriva- tion are partly found in Cavanaugh and Neath (1999b[44]), Neath and Cavanaugh (1997[153]) and Kuha (2004[118]). Zhang (1993[237]) showed that BIC has better convergence rate than AIC for linear regression models with normally distributed errors. Additionally, he argued that BIC has nearly optimal convergence rate. Because of this strong convergence in choosing a correct model, BIC is preferred to AIC; however, this strong conver- gence is only true when the models are linear and error distribution is normal. Despite the popularity due to its parsimony and consistency in model selection, McQuarrie (1999[143]) reported that BIC has tendency of overfitting for small samples. Like the small sample correction in AIC (Hurvich and Tsai, 1989[104]), similar small sample correction has been attempted on BIC. McQuarrie (1999[143]) proposed a signal-to-noise corrected BIC (BICc) with a stronger small sample penalty term. The proposed BICc is

nplog n BICc = nlogσ ˆ2 + , n − p − 2 where n is sample size, p is the number of parameters, andσ ˆ2 is the maximum likelihood estimator of a variance of the error distribution. Needless to say, this criterion was developed under the linear regression framework with iid normal errors. Asymptotically, BICc is equivalent to BIC but more consistent for small sample. 16

2.2.5 Minimum Description Length

Another approach from the information theoretic point of view for model selec- tion is minimum description length (MDL). MDL was first proposed by Rissanen (1978[176], 1986[174]) to get an optimal coding algorithm. MDL incorporates very similar principles of AIC type model selection criteria. Mainly, MDL finds a model that minimizes the description length of codewords through maximum likelihood principle (Rissanen, 1983[175]). It measures the log likelihood and the stochastic complexity of a model. When the source distribution is known, Shannon’s coding theorem dictates that we use the source distribution for the optimal code, and its code length is the negative logarithm of the data probability. When the source distribution is unknown, the search for a universal modelling procedure is the search for a universal coding scheme.

Let Fθ be a candidate distribution function with parameter θ and G be a true distribution. Since MDL was developed from coding theory, Fθ and G are generally considered as models of discrete distributions, especially binary models. In coding theorems, the length of codeword x is written as l(x) = −log G(x), where log denotes the logarithm based on two (Here, two comes from binary coding and the base of log can be other integers). Yet, the true codeword generating model is hidden to signal decoders. One wishes to obtain a criterion that gives a minimal length discrepancy to the true distribution by estimating a candidate model Fθ(x). Shannon source coding theorem (Shannon, 1948[201]) proved that the expected code length is bounded below, which easily can be proved by the Kraft inequality,

2−l(x) ≤ 1, (2.17) X∀x where ∀x indicates a set of all codewords and l(x) is the length of each codeword. This theorem illustrates that no coding algorithm is able to achieve shorter ex- pected codeword length than the entropy, H(G) = EG[−log G(X)] (Cover and Thomas, 1990[55]). Hence, the aim of MDL is deducing the distance between

−log Fθ(x) and −log G(x) (Barron et.al., 1998[20]). Furthermore, it is important to notice that the expected distance between the estimated code length and the 17

true code length, l(x) G(x)[log G(x) − log Fθ(x)] is same as the Kullback-Leibler distance I(G; Fθ)P or the relative entropy D(G||Fθ) (Cover and Thomas, 1991[55]). Akaike type information criteria and MDL share the commonality of minimizing the Kullback-Leibler distance or the relative entropy. Rissanen (1978[176], 1983[175]) developed the criterion to find a model that produces the shortest code length. The total code length is

Length(x, θ) = −log Fθ(x) = −log F (x|θ)+ Length(θ) (2.18) where F (x|θ) is the likelihood of a given model with parameter θ and Length(θ) is the total bits to encode the information of the given model. This total bits of Length(θ) also represents the stochastic complexity of a model. Then, eq. (2.18) is simplified to

1 Length(x, θˆML)= −log Fˆ (x)+ log |H(θˆML)| + Cn, (2.19) θML 2 where θˆML is the maximum likelihood estimator of θ, H(θˆML) is the Hessian matrix nd (minus of the 2 derivative of the log likelihood), and Cn(x) is a some constant with respect to θ. The second term, the Hessian matrix is interpreted as a complexity measure. It is known that n−p|H(θˆ)| → α as n → ∞. Note that α is a constant and p denotes the dimension of θ. As sample size n increases, the sensitivity of functional form of candidate models gradually subsides (Myung, 2000[151]; Qian and K¨unsch, 1998[166]). Hence, MDL is sensitive to the number of parameters in a model as well as the model complexity whose value depends upon the functional form of the model. As a result, MDL becomes almost equivalent to BIC,

1 MDL = −log Fˆ (x)+ plog n. (2.20) θML 2

Further rigorous studies on MDL are found in Barron et.al. (1998[20]), Clarke and Barron (1990[54]), Hansen and Yu (2001[84]), Qian and K¨unsch (1998[166]), Rissanen (1986[174], 1987[173]) and Yang and Barron (1998[234]). 18

2.2.6 Resampling in Model Selection

Alternatively, data resampling methods were adopted in model selection studies to estimate the KL distance. We will leave the jackknife method to the next chapter. In summary, the cross validation method has been proved to be equivalent to AIC asymptotically (Stone, 1977[215] and Amari et.al., 1997[11]) and the bootstrap methods are not efficient nor successful in terms of consistency of selecting models (Shao ,1996[202]; Chung et.al.,1997[53]; Ishiguro et.al., 1997[107]; and Shibata, 1997[207]). To show the asymptotic equivalence between AIC and cross-validation, we re- place the assumption (A2) by (A2’). We follow the notations from Assumptions

and denote the log likelihood of all observation L(θ) = i=1 li(θ) and the log th likelihood without i observation L−i(θ) = j6=i lj(θ), whereP li(θ) = log fθ(Xi). Assumption: P ˆ ˆ ∂ ˆ ∂ ˆ (A2’) For θ and θ−i such that ∂θ L(θ)= ∂θ L−i(θ−i)=0,

p p ||θˆ− θo|| −→ 0 and max ||θˆ−i − θo|| −→ 0, 1≤i≤n

∂ if a unique θo satisfies Eg[ ∂θ li(θo)] = 0.

Then, cross validation information criterion with one validation sample (CV1) is defined as n

CV 1= −2 li(θˆ−i). (2.21) i=1 X As AIC, CV1 measures log likelihood with maximum likelihood principle. Under the conditions (A1),(A2’),(A3)-(A5), asymptotic equivalence between CV1 and AIC is proved by the following;

From the definition of θˆ and θˆ−i, we have

∂ ∂ ∂2 l (θˆ ) = l (θˆ)+ l (θˆ)(θˆ − θˆ)(1 + o (1)) ∂θ i −i ∂θ i ∂θ∂θT i −i p ∂ ∂ l (θˆ ) = L(θˆ ) ∂θ i −i ∂θ −i ∂ ∂2 = L(θˆ)+ L(θˆ) (θˆ − θˆ)(1 + o (1)) ∂θ ∂θ∂θT −i p   ˆ ˆ = −Jˆ(θ−i − θ)(1 + op(1)) (2.22) 19

ˆ ∂2 ˆ In eq. (2.22), J refers to the Hessian matrix, − ∂θ∂θT L(θ). Above results are substituted in the following;

∂ ∂2 l (θˆ ) = l (θˆ)+(θˆ − θˆ)T l (θˆ)+(θˆ − θˆ)T l (θˆ)(θˆ − θˆ)(1 + o (1)) i −i i −i ∂θ i −i ∂θ∂θT i −i p ∂ ∂ = l (θˆ) − l (θˆ )Jˆ−1 l (θˆ) (1 + o (1)) i ∂θT i −i ∂θ i p   ∂ ∂ = l (θˆ) − l (θˆ)Jˆ−1 l (θˆ) (1 + o (1)) i ∂θT i ∂θ i p   ∂ ∂ = l (θˆ) − tr l (θˆ) l (θˆ)Jˆ−1 (1 + o (1)). i ∂θ i ∂θT i p   Therefore,

CV 1 = −2 li(θˆ−i) i X −1 = −2L(θˆ)+2tr(IˆJˆ )(1 + op(1)) (2.23)

ˆ n ∂ ˆ ∂ ˆ where I indicates i=1 ∂θ li(θ) ∂θT li(θ). If the conditionP that the true model is contained in the set of candidate models holds, then tr(IˆJˆ−1)= p. In this case, p is the number of parameters, then (2.23) is asymptotically equivalent to AIC (Stone, 1977[215]). Without knowing the true model, still (2.23) shares the asymptotical identity with TIC (Shibata, 1989[208]). Shao (1993[204]) showed that the popular leave-one-out cross validation method is asymptotically inconsistent in terms of the probability of choosing the best predicting model. The probability does not converge to 1 as sample size increases. Also, CV1 was conservative enough to pick an unnecessarily large model. Instead of leave-one-out, leave-nv-out cross validation (nv > 1) with nv/n → 1 as n →∞ was suggested to rectify the inconsistency of CV1. In addition to cross validation, bootstrap resampling principle was quite often studied for model selection. Here, the focus of the study is measuring Kullback- Leibler discrepancy although the argument of bootstrapping can be extended to other types of discrepancy (to name a few of those extensions, see Breiman, 1992[35]; Efron, 1983[67]; Hjorth 1994[93]; Linhart and Zucchini, 1986[126]; and Shao, 1996[202]). Asymptotic approximation of bootstrap is too complicated to be evaluated ana- 20 lytically. A few studies showed that bootstrap version of model selection criterion is functionally equivalent to AIC despite its computational complexity (Chung et.al., 1996[53]; Shibata, 1997[207]; and Ishiguro, 1997[107]). According to Chung et.al. (1996[53]), the bootstrap method has a downward bias which is roughly equivalent to the number of parameters of a candidate model like AIC. This state- ment implies that bootstrap model selection gives reasonable bias correction. Shi- bata (1997) confirmed this idea more rigorously through the investigation of several bootstrap type estimates. In addition, the advantage of using a bootstrap estimate is that it is free from expansion while AIC or other similar criteria are based on an expansion with respect to parameters. Often the calculation of analytic estimates, the gradient Iˆ and the Hessian Jˆ, becomes complicated. According to Ishiguro et.al. (1997[107]), the advantages of bootstrap model selection are unbiasedness and its independence of models.

2.2.7 Other Criteria and Comments

The model selection criteria discussed so far are most popular model selection criteria in terms of finding a model that minimizes the KL distance. On the other hand, they have shortcomings since they are developed under specific conditions. There are more notable model selection criteria that allow to choose a model, minimizing the KL distance, like H-Q criterion (Hannan and Quinn, 1979[83]) and model selection with data oriented penalty (Rao and Wu, 1989[167] and Bai et.al., 1999[16]). The information criterion by Hannan and Quinn (1979) was developed from choosing a correct order of a time series model by adapting the law of iterative logarithm. Rao and Wu (1989) presented a general form of penalties to obtain a consistent process for model selection in the regression framework. Other directions have been taken like Konishi and Kitagawa (1996[116]), who developed the generalized information criterion (GIC) to loose assumptions on popular information criteria like AIC and BIC. Irizarry (2001[106]) and Agostinelli (2002[6]) introduced a weighted version of these popular information criteria. Ca- vanaugh (1999a[43]) used the symmetric Kullback-Leibler distance, which makes their information criterion more consistent because the number in the penalty term has changed from 2 to 3. 21

In spite of numerous model selection methods, as Burnham and Anderson (2002[39]) emphasized, no model selection method is superior to the other meth- ods. The most important process is understanding data prior to applying any statistical model selection criterion.

2.3 Convex Hull and Algorithms

Since R´enyi and Sulanke (1963[170], 1964[171]) and Efron (1965[68]), convex hulls have received spotlights for decades. This section provides foundations and basic definitions related to convex hull and its algorithms for statistical multivariate data analysis, which will be introduced in Chapter 4 and Chapter 5. Frequently referred computational convex hull algorithms are introduced to assist the understanding of the convex hull process.

2.3.1 Definitions

Prior to presenting the definition of the convex hull, we need the definition on convex sets. Wikipedia1 provides a good summary on convex sets. At the , we give a simple definition.

Definition 2.1 (Convex Set). A set C ⊆ Rd is convex if for every two points x, y ∈ C the whole segment xy is also contained in C; in other words, the point tx + (1 − t)y belongs to C for every t ∈ [0, 1]. (Klee, 1971[112]).

Note that the intersection of an arbitrary family of convex sets is a convex set as well. Extensive studies on convex sets are found in Rockafellar(1970[177]) and Valentine(1976[222]). Here, we give the definition of the convex hull.

Definition 2.2 (Convex Hull). The convex hull of a set of points X in Rd denoted by conv(X), is the intersection of all convex sets in Rd containing X. In other words, conv(X), the convex hull of X is the smallest convex domain in Rd con- taining X. In descriptions of algorithms, a convex hull indicates points of a shape invariant minimal subset of conv(X), or extreme points of X. Connecting these points produces a d dimensional wrap around conv(X).

1http://en.wikipedia.org/wiki/Convex Set 22

There are numerous studies related to convex hull. Regarding the fundamen- tal and basic notions of the convex hull, readers may refer to the content from Wikipedia2. Modern mathematical and probabilistic studies on the convex hull were initiated by R´enyi and Sulanke (1963, 1964) and Efron (1965), who derived the expected number of vertices, areas, and perimeters of convex hulls, formed by samples generated from well known distributions. Their works were based on the relation among vertices, edges, and areas in 2 and 3 dimensional geometry dis- covered by Euler. Afterwards, convex hull studies have been focused on assessing limiting distributions by estimating expected number of vertices and volumes, and their for central limit theorems when sample were generated from convex body distributions. In a series of works by Hueter (1994[102], 1999a[100], 1999b[101]), central limit theorems for the number of vertices, perimeter, and area (Nn, Ln, An by notation respectively) of the convex hull of n iid random vectors, sampled from spherical distributions including the normal distribution. Her main tools are martingales and poisson approximation of the point process of convex hull vertices in addition to a geometric application of the renewal theorem. Hsing (1994[95]) studied the of convex hull areas from samples of uniform planar distributions. Approximating the process of convex hull vertices by the process of extreme points of a poisson point process lead Groeneboom (1988[80]) to a central limit theorem for the number of convex hull vertices of a uniform sample from the interior of a convex polygon. Cabo and Groeneboom (1994[40]) derived limit theorems for the boundary length and for the area of the convex hull, extending results of R´enyi and Sulanke (1963). A very useful and extensive review of the characteristics of limiting distributions of convex hull is found in Schneider (1988[194]) despite years passed since its publication. Moreover, results from Mass´e(1999[141], 2000[140]) provide interesting insights of convex hull studies. Suppose that a sequence of iid random variables in Rd is defined on a probability space (Ω, A,P ). Let Cn be the convex hull of {X1, ..., Xn},

Nn be the number of vertices of Cn, and Mn be 1 − P (Cn). Note that Nn and Mn are random variables as well. Mass´e(2000) stated that

1. Most of the properties of Cn,Nn, and Mn depend tightly on P and 2http://en.wikipedia.org/wiki/Convex hull 23

2. Lots of studies on expected values of Mn and Nn (E(Mn) and E(Nn)) exist but their variances are hardly known.

He showed that a traditionally accepted conjecture: E(Nn) →∞ implies Nn/E(Nn) → 1 in probability, is wrong. Mass´e(1999) pointed out no trivial distribution free bound for the variance of Nn. There is a different approach to get the limiting behavior of convex hull func- tionals by adapting support functions (Eddy and Gale, 1981[62]). Eddy (1980[63]) showed that the convex bodies can be embedded via a support function into the space of continuous functions on the unit sphere so that the distribution of the convex hull can be described by the distribution of a continuous function. Based on the mapping through support functions, Eddy and Gale (1981) characterized the limiting behavior the support function of the convex hull from samples in Rd. However, Groeneboom (1988) criticized using support functions because of the fact that characterization by support functions does not give the type of information one needs to develop a distribution theory for the point process of vertices of a convex hull. Some other directions in convex hull studies, deviated from obtaining limiting distributions, have been taken for statistical inferences although these studies are rarely found. Bebbington (1978) utilized the convex hull to trim data for robust correlation estimation. Tukey in Huber (1972) introduced the convex hull for robust estimation of multivariate data center. Barnett (1976) made a debut of ordering multivariate data by peeling convex hulls. Chakraborty (2001[47]), Zani et.al. (1998[236]), and Miller et.al. (2003[147]) discussed the possibility of outlier detection with the convex hull. Before we move to convex hull algorithms, we want to remind the notion of a convex hull in computer science. In geometry, the convex hull is the smallest convex set containing given points. Consider a point x of a convex set C which is called an extreme point of C provided that x is not an inner point of any segment in C. In computational science, the collection of these extreme points indicates a convex hull instead of the whole supporting space. This dual notions of a convex hull are alternatively used without any prior notification. Two notions of the convex hull: • In geometry, the smallest convex set containing given points 24

• In computer science, extreme points that comprise the boundary of the con- vex hull

Next, we will discuss algorthms finding convex hulls.

2.3.2 Convex Hull Algorithms

In mathematics and computer science, various algorithms, pawning the convex hull of given data, have developed. Books by de Berg et.al. (2000[28]), Edelsbrun- ner (1987[66]), Goodman and O’Rourke (2004[76]), Laszlo (1996[121]), O’Rourke (2000[159]), Preparata and Shamos (1985[163]), and others have illustrated various convex hull algorithms. In this section, we introduce most of well known convex hull algorithms. Some algorithms are restricted to two dimensional data. Some al- gorithms contain studies on computational complexity, a critical factor in analysis of large data sets. Constructing a convex hull involves sorting, causing at least O(nlog n) time for operation. In two dimension, it is reported that nlog n is a lower bound on the complexity of any algorithm that find the hull (O’Rourke, 2001[159]). As the data dimension increases, we can expect that the computation complexity will grow faster than nlog n. If data sets are huge, say tera- or peta-bytes in multivariate forms, the sorting process delays any fast computation. This relative slowness in computation is the sacrifice of using nonparametric methods in statistical analysis. The convex hull of a set of points is the smallest convex set that contains all the points. In the plane, we can visualize the convex hull as a stretched rubber band surrounding the points that, when released, takes a polygonal shape. This convex polygon is the convex hull and the extreme points in the hull are the vertices of the convex polygon. The convex hull algorithms collect points touching this rubber band so that the designing such rubber bands has been the primary interest of developing algorithms. Additional terminologies have to be introduced to state convex hull algorithms. The two dimensional and three dimensional convex hull is called a polygon and polyhedron, respectively. A higher dimension convex hull is a polytope but the visualization of a high dimensional (d> 3) convex hull is very restricted compared to two to three dimensional convex hulls. For description purposes, the (d−1)-dim 25 faces of a d-dimensional polytope are called facets, the (d − 2)-dim faces of a d dimensional polytope are ridges, and points are called vertices; for example, in the familiar three dimensional simple tetrahedron (simplex), triangles are facets and edges are ridges so that in total, we have 4 facets, 6 ridges, and 4 vertices. From Berg et.al. (2000[28]) the following relation among vertices, edges, and facets is found,

Let P be a convex polytope with n vertices. The number of edges of P is at most 3n − 6 and the number of facets of P is at most 2n − 4.

This statement only provides the upper bounds and no studies are found that provide exact numbers and relationships of vertices, ridges, and facets when the data come from higher dimension (d> 2).

Graham Scan Algorithm

Graham published this algorithms in 1972[77]. The Graham scan algorithm only requires arithmetic operations and comparisons. It is also known that no explicit conversion to polar coordinates (for sorting) is necessary (Bayer, 1999[25]). Eddy (1977a[64], 1977b[65]) showed that this algorithm requires at least O(nlog n) oper- ations to determine the vertices due to sorting points. These discoveries are based upon two dimensional data. The algorithm stated in Algorithm 1 is adapted from O’Rourke (2001[159]).

Algorithm 1 Graham scan algorithm Find the interior point x; label it po Sort all other points angularly about x; label p1, ..., pn−1 Stack S =(p1,p2)=(pt−1,pt); t indexes top. i ← 3 while i < n do if pi is left of (pt−1,pt) then Push(S, i) and increment i else Pop(S) end if end while 26

Gift Wrapping Algorithm

Chand and Kapur (1970[49]) introduced this algorithm for two dimensional poly- gons but the algorithm was soon extended to higher dimensional data (Laszlo, 1996([121]) and O’Rourke, 2001[159]). For two dimensional data, the complexity is O(nh) if the hull has h edges so that the worst complexity is (O(n2)). The statement of the gift wrapping algorithm in Algorithm 2 is adapted from O’Rourke (2001) and this algorithm is for both two and three dimensional data.

Algorithm 2 Gift Wrapping Algorithm Find the lowest point (smallest y-coordinate) Let io be its index, and set i ← io repeat for each j =6 i do Compute counterclockwise angle θ from previous hull edge. end for Let k be the index of the point with the smallest θ Output (pi,pk) as a hull edge i ← k until i = io

Jarvis’ March Algorithm

Jarvis (1972[108]) reported this algorithm, which is the special case of gift wrapping but limited to two dimensional data. This algorithm calculates angles from a point and connects to the next point of the smallest angle. The reported running time is O(nN), where N is the number of vertices on the convex hull so that the worst case running time of the algorithm is O(n2). Eddy (1977a[64]) commented that for some configuration of points in the plane with small number of vertices N this Jarvis’ March algorithms is faster than the Graham scan algorithm.

Beneath-and-Beyond algorithm

This algorithm was introduced by Chazelle (1985[48]), claiming for an optimal algirhtm. For two dimension, the complexity of space and time is O(n) and O(nlog n), repectively. Additionally, the worst case complexity is O(nb(d+1)/2c) so that the algorithm is optimal for even dimensions. The name comes from the 27 fact that a point is added at a time to an already constructed convex hull. The algorithm described in Algorithm 3 is adopted from Edelsbrunner (1987).

Algorithm 3 Beneath and beyond algorithm Sort the n points with respect to their x1-coordinates Relabel points; (p1,p2, .., pn) is the sorted sequence. Construct P = conv{p1,p2,p3} [Iteration:] Complete the construction point by point. for i:=4 to n do Add point pi to the current convex hull, Update the representation of Pi−1 = conv{p1,p2, ..., pi−1} to Pi = conv{p1, ..., pi−1,pi}. end for

Divide-and-conquer algorithm

This algorithm was first introduced by Preparata and Hong (1977[164]). The divide and conquer algorithm work for general dimensional data. Time complexity of two and three dimensional data is reported as O(nlog n) (Edelsbrunner, 1987[66]). Time complexity for 4 dimensional data is known to be O((n + N)log 2N) and for 5 dimensional data is O((n + N)log 3N) where n is sample size and N is the number of vertices. However, it is known that merging process is very difficult when data dimension is more than two. The Divide-and-conquer algorithm for two dimension can be stated simply as follows.

1. Sort the points by x coordinate

2. Divide the points into two sets A and B, A containing the left upper n/2 points and B the right lower n/2 points

3. Compute the convex hulls A = conv(A) and B = conv(B) recursively.

4. Merge A and B: compute conv(A ∪ B)

The higher dimension (d> 2) divide-and-conquer algorithm stated in Algorithm 4 is adopted from O’Rourke (2001). 28

Algorithm 4 Divide and conquer algorithm a ← rightmost point of A b ← leftmost point of B while T = ab not lower tangent to both A and B do while T not lower tangent to A do a ← a +1 end while while T not lower tangent to B do b ← b +1 end while end while

Incremental Algorithm

This algorithm is introduced by O’Rourke (2001[159]) and stated in Algorithm 5 for three dimension sample. The total complexity is O(n2) since the complexity of finding faces and edges is proportionsal to n and this linear loops of searching faces and edges are embedded inside a loop that iterates n times (Thm 4.1.1., O’Rourke, 2001).

Algorithm 5 Incremental Algorithm

Initialize H4 to tetrahedron (p0,p1,p2,p3). for i =4, ..., n − 1 do for each face f or Hi−1 do Compute volume of tetrahedron determined by f and pi. Mark f visible iff volume < 0 if no faces are visible then Discard pi (it is inside Hi−1) else for each border edge e of Hi−1 do Construct cone face determined by e and pi end for for each visible face f do delete f end for Update Hi end if end for end for 29

Quick Hull Algorithm

The initial development of the quick hull algorithm was performed by Preparata and Hong (1977) for convex polygons in two dimensional space. Barber et.al. (1996) reinvented this quick hull algorithm by combining this two dimension quick hull al- gorithm and the general dimension beneath-beyond algorithm. Barber et.al. (1996) also provided an empirical evidence that their algorithm uses less memory and runs faster. The software for this algorithm is found from http://www.qhull.org. The algorithm stated in Algorithm 6 is adopted from Barber et.al. (1996[19]). Algorithm 6 Quick hull algorithm create a simplex of d+1 points for each facet F do for each unassigned point p do if q is above F then assign p to F’s outside set end if end for end for for each facet F with a non-empty outside set do select the furthest point p of F’s outside set initialize the visible set V to F for all unvisited neighbors N of facets in V do if p is above N then add N to V end if end for{the boundary of V is the set of horizon ridges H} for each ridge R in H do create a new facet from R and p link the new facet to its neighbors end for for each new facet F’ do for unassigned point q in a outside set of a facet in V do if q is above F’ then assign q to F”s outside set end if end for end for delete the facets in V end for 30

2.3.3 Other Algorithms and Comments

In addition to the algorithms discussed so far, B¨ohm and Kriegel (2001[31]) pre- sented the depth-first algorithm and the distance priority algorithm for large mul- tidimensional databases. Nevertheless, their design of algorithms aimed for two dimensional data and their illustration of algorithms did not utilize big enough data sets in size by today’s standard massive data. Many algorithms exist and mostly these algorithms have been proved that their algorithms have less computational complexity and less computational time than previously published algorithms regardless of data dimensions. Some of them are proved to be extremely simple and fast under particular conditions such as planar data. Even though some inventors of convex hull algorithms argue that their algorithms are capable to be expanded in general dimensions, any applications of their algorithms to higher dimensional data are hardly found. Hence, choosing an algorithm for a statistical analysis become a nontrivial problem. Instead of looking for a fast and efficient algorithm with conditions, we choose one that serves the general purposes. QuickHull algorithm by Barber et.al. (1996) will be employed through the rest of the study. The fact that numerous algorithms are traced from the literature in computa- tional geometry proves that just retrieving representative points of a convex hull from a given data set is an important problem mathematically and computation- ally. This overflowing number of algorithms also illuminates that lack of statistical functionals leading to statistical inferences and universal mathematical justifica- tion of convex hulls for quantification. This thesis may contribute to overcoming these shortcomings.

2.4 Data Depth

Since Barnett (1976[24]), data depth has emerged as a powerful tool of ordering multivariate data as well as exploring multivariate data with various applications. A data depth is known to be a device for measuring the centrality of a multivariate data point with respect to a given data cloud and a indicator of how deeply a data point is embedded in the data cloud (Liu and Singh, 1993[128]) so that it 31 characterizes the multivariate distributions. Understanding distributions of multivariate data asks for defining the order of points. In contrast to the simplicity of assigning orders for univariate sample, multivariate data undergo various methods of ordering and no consensus in multi- variate ordering has reached yet. In this section, we summarized well known data depths connected to ordering multivariate data and some notions for describing multivariate distributions.

2.4.1 Ordering Multivariate Data

A natural way of defining ordering of multivariate data is that points on the surface of data cloud have shallow depth and points far from the data cloud surface in any direction have deeper depth. Barnett (1976[24]) provided an extensive study on ordering multivariate data in a more quantitive way. Four different ordering principles were suggested.

Marginal ordering: This ordering is called M-ordering and as the name sug- gests, the ordering is based on marginal sample. M-ordering is related to transformation of the data set.

Reduced ordering: This ordering is called R-ordering and starts from reducing each multivariate observation to a single value by some combination of the components of given sample based on chosen metrics. The Mahalanobis distance is one of the most popular examples.

Partial ordering: This ordering is called P-ordering. Deviated from M-ordering and R-ordering, P-ordering considers overall inter-relational properties. The important principle of P-ordering is related to assigning ranks to data layers. Tukey’s peeling process belongs to P-ordering.

Conditional ordering: This ordering is called C-ordering or sequential ordering. C-ordering is conducted on one of the marginal sets of observations condi- tional on ordering within the marginal data, which stems from works about determining distribution free tolerance limits by hyper-rectangular coverage. 32

The studies on ordering principles in multivariate data is an effort of seeking a direct counterpart of ordering univariate data. By means of these ordering princi- ples, Barnett (1976) discussed multivariate median, multivariate , multivari- ate extreme points, multivariate quantiles, and outliers in multivariate data. Also, testing processes, modelling, classification and clustering in multivariate sample were put on the table. Some details on these topics will be presented in upcoming sections.

2.4.2 Data Depth Functions

The various ideas of ordering multivariate data led to coin various depth func- tions. A comprehensive list of data depth functions with their definitions and properties is given in Liu et.al. (1999[127]), Zuo and Serfling (2000a[241]), and Chenouri (2004[52]). Well known data depths are Mahalanobis Depth (Maha- lanobis, 1936[133]), Convex Hull Peeling Depth (Barnett, 1976[24]), Half Space Depth (Tukey, 1974[221]), Simplicial Depth (Liu, 1990[130]), Oja Depth (Oja, 1983[158]), Majority Depth (Singh, 1991[212]), Projection Depth (Donoho and Gasko, 1992[58]) and Regression Depth (Rousseeuw and Huber (1999[182]). Here, we discuss frequently appearing data depths in the literature. Let F be a in Rd (d ≥ 2). Throughout this section, unless stated otherwise, we assume that F is absolutely continuous and X =

{X1, ..., Xn} is a random sample from F .

Half Space Depth (Tukey, 1974) It is also called Tukey depth. The half space depth of x is defined as the minimal number of data points in any closed half space determined by a hyperplane passing through x

T T HD(x; X) = min #{i : u Xi ≥ u x}. ||u||=1

The median from the half space depth is called the Tukey median, which is constructed uniquely as a centroid. This depth process builds nested poly- topes, illustrated in Rousseeuw and Struyf (2004[180]) with samples form various distributions. Donoho and Gasko (1992) proved the continuity of the half space depth 33

and the almost sure convergence of the empirical half space depth to the theoretical depth. They showed that the half space depth breaks down at 1/3. The computation complexity of the half space depth has been studied; es- pecially, Miller et.al. (2003[147]) reported that their half space depth algo- rithm is more optimal (O(n2)) than the algorithm by Rousseeuw and Ruts (1998[187]), who presented , an extension of Tukey’s univariate in O(n2log 2n). The computation complexity was compared in two dimension framework.

Mahalanobis Depth (Mahalanobis, 1936) This depth is based on Mahalanobis distance; T −1 −1 MhD(x)=[1+(x − µF ) ΣF (x − µF )] ,

where µF and ΣF denote the mean and matrix of F , respectively. The sample version of this depth is obtained by replacing parameters with ˆ estimatorsµ ˆFn and ΣFn . The is assumed to be non-singular. Despite its frequent applications and popularity, it has been criticized for non robustness.

Oja’s Depth (Oja, 1983) Let S[x, X1, ..., Xd] be the closed simplex with vertices

of x and d random observations X1, ..., Xd from F . The Oja depth is

−1 OD(x; F )=[1+ EF {V ol(S[x, X1, ..., Xd])}] .

The sample version of OD(x; F ) is

n −1 OD(x; F )= [1 + {V ol(S[x, X , ..., X ])}]−1, n d +1 i1 di ∗   X

where ∗ indicates all d-plets (i1, ..., id) such that 1 ≤ i1 ≤ ... ≤ id ≤ n. It is known that the corresponding median is affine equivariant, but Oja depth has zero breakdown value.

Projection Depth (Donoho and Gasko, 1992) Outlyingness of any point x 34

relative to the data set X = {X1, ..., Xn} is defined as

|uT x − med{uT X}| O(x; X) = max . ||u||=1 MAD{uT X}

The univariate counterpart of the median absolute deviation (MAD) is

MAD{yi} = medi||yi − medj{yj}|| from a univariate data set {y1, ..., yn}. Base on this outlyingness O(x; X), the projection depth is defined as

PD(x; X)=(1+ O(x; X))−1.

Zuo and Serfling (2000a) showed that the projection depth is affine equi- variant, maximal at center, and monotonic relative to the deepest point. In addition, the projection depth median has large sample breakdown point 1/2 (Zuo et.al., 2004[238]).

Simplicial Depth (Liu, 1990) The simplicial depth of x is equivalent to the number of simplices formed by d+1 data points that contain x. The simplicial depth is defined as

SD(x; F )= PF {x ∈ S[X1, ..., Xd+1]}

where Si,d = S[Xi1 , ..., Xid+1 ] is a closed simplex formed by (d + 1) random observations from F . The sample version of this depth is

n −1 SD(x; F )= I . n d +1 x∈Si,d ∗   X It is known that the simplicial median is affine equi-variant with a breakdown value bounded above by 1/(d + 2). A few properties have been discussed that simplicial depth does not always attain maximality at the center and the function does not always have mono- tonicity relative to the deepest point (Zuo and Serfling, 2000b[242]). Liu (1990) proved the uniform consistency of the simplicial depth function when F is an absolute continuous distribution on Rd with a bounded density 35

f i.e. a.s sup ||SD(x; Fn) − SD(x; F )|| −→ 0 x∈Rd as n →∞.

Majority Depth (Singh, 1991) By denoting a major side as the half space

bounded by the hyperplane containing (Xi1 , ..., Xid ) which has probability ≥ 0.5.

MjD(F ; x)= PF {x is in a major side determined by (X1, ..., Xd)}.

It is known that the majority depth is affine equivariant, maximal at center, and monotonic relative to the deepest point (Zuo and Serfling, 2000a).

Liu and Singh (1993[128]) found that under proper conditions, the uniform consis- tency of the sample version of depth functions almost surely holds for the follow- ing depths: Mahalanobis depth, half space depth, simplicial depth, and majority 2 depth. For Mahalanobis depth, the uniform consistency holds if EF ||X|| < ∞; for the half space depth and the simplicial depth, the uniform consistency holds if F is absolutely continuous; and for the majority depth, the uniform consistency holds if F is an . These results are connected to the fact that these depths are applicable for statistical inferences of multivariate distributions.

2.4.3 Statistical Data Depth

While the definition and the description of various data depths are given in the pre- vious section, we discussed some properties, such as maximality and monotonicity of data depths. In addition to these two properties, affine invariance and zero at the infinity comprise the properties for statistical depths, allowing depth functions to be applied for statistical inferences. Such depth functions satisfying the four properties are called statistical depth functions. Zuo and Serfling (2000a[241]) explained the notion of statistical data depth. Although various measures of depths exist, not all have desired properties for sta- tistical analysis. In short, those properties of statistical data depth are affine invariance, maximality at center, monotonicity relative to deepest point, and van- ishing at infinity. 36

Definition 2.3. (Statistical depth function) Let the mapping D(·; ·) : Rd ×F → R be bounded, non-negative, where F is the class of distributions on the Borel sets of Rd and F is the distribution of random sample X. If D(·; F ) satisfies the following:

(P1) (Affine invariance) D(Ax + b; FAX+b)= D(x; FX ), for all x, holds for any random vector X in Rd, any d × d non-singular matrix A, and any d-vector b;

(P2) (Maximality at center) D(θ; F ) = supx∈Rd D(x; F ) holds for any F ∈ F having center θ;

(P3) (Monotonicity) for any F ∈ F having deepest point θ, D(x; F ) ≤ D(θ + α(x − θ); F ) holds for α ∈ [0, 1]; and

(P4) D(x; F ) → 0 as ||x||→∞, for each F ∈F.

Then D(·; F ) is called a statistical depth function.

To understand properties of these functions related to multivariate distribu- tions, we will need further definitions on types of distributions.

2.4.4 Multivariate Symmetric Distributions

Ordering highly depends on a symmetric distribution and a point about which the symmetry is defined. This point, which we will call the multivariate median, will be discussed in the next section with estimating methods. Here, we will focus on general notions of multivariate symmetric distributions.

Spherical Symmetric: For any orthogonal transformation A : Rd → Rd, the distribution of a random vector X ∈ Rd is spherical symmetric about θ provided that X − θ and A(X − θ) are identically distributed.

Elliptical Symmetric: For some affine transformation T : Rd → Rd of full rank, the distribution of a random vector X ∈ Rd is elliptical symmetric about θ if Y = T (X) and Y is spherical symmetric about θ. 37

Central Symmetric: The distribution of a random vector X ∈ Rd is centrally symmetric about θ ∈ Rd if

X − θ =d −(X − θ),

where =d implies that two random vectors are identically distributed.

Angular Symmetric: The distribution of a random vector X ∈ Rd is angular symmetric about θ ∈ Rd provided that the standardized vector (X −θ)/||X − θ|| is centrally symmetric about the origin. Note that conventionally (X − θ)/||X − θ|| =0 if ||X − θ|| = 0.

Half Space Symmetric: The distribution of a random vector X ∈ Rd is half space symmetric about θ ∈ Rd if P (X ∈ H) ≥ 1/2 for any closed half space H with θ on the boundary.

Note that central symmetry is a subset of angular symmetry, which is a subset of half space symmetry.

2.4.5 General Structures for Statistical Depth

Zuo and Serfling (2000a) presented general structures for statistical depth func- tions, according to which various depth functions are classified into four types. We summarize the four general structural types for constructing statistical depth functions and investigate the satisfaction degrees of these types with the properties of statistical depth functions.

Type A depth functions: Let h(x; x1, ..., xn) be any bounded non-negative func-

tion which measures in some sense the closeness of x to the points x1, ..., xn. A corresponding type A depth function is then defined by the average close- ness of x to a random sample of size n:

D(x, F )= Eh(x; X1, ..., Xn),

where X1, ..., Xn is a random sample from F . For such depth functions

the corresponding sample versions D(x; Fˆn) turn out to be U-statistics or 38

V-statistics. Majority depth (Singh, 1991[212]) belongs to this category pro- vided half space symmetric distributions.

Type B depth functions: Let h(x; x1, ..., xn) be an unbounded non-negative func- tion which measures in some sense that the distance of x from the points

x1, ..., xn. A corresponding type B depth function is then defined as

−1 D(x, F )=(1+ Eh(x; X1, ..., Xn)) ,

where X1, ..., Xn are random sample from F . This type B function may not satisfy the affine invariant property but it matches well with the other prop- erties of a statistical depth function. A type B depth function defined on any circular symmetric distributions satisfies the property of center maximality.

Type C depth functions: Let O(x; F ) be a measure of the outlyingness on the point x in Rd with respect to the center or the deepest point of the distri- bution F . Usually O(x; F ) is unbounded, but a corresponding type C depth function is defined to be bounded so as

D(x; F )=(1+ O(x; F ))−1.

Type C functions pertain properties of statistical depth functions and the projection depth and Mahalanobis depth are the examples of type C func- tions.

Type D depth functions: Let C be a class of closed subsets of Rd and P a probability measure on Rd. A corresponding type D depth function is defined as D(x; P, C) = inf{P (C)|x ∈ C ∈ C}. C The depth of a point x to a probability measure P on Rd is defined to be the minimum probability mass carried by a set C in C that contains x. The half space depth and the convex hull peeling depth are known to be type D functions.

Zuo and Serfling (2000a) favored the half space depth function because it is a statistical depth function and its breakdown point is relatively higher than the 39 breakdown points of other depth functions (≤ 1/3).

2.4.6 Depth of a Single Point

Initial depth calculations of given sample by means of a depth function provide a set of nested convex polygons formed by equi-depth points composing depth contours (equi-level sets). The depth contours subdivide the space into depth regions so that the region between contours k and k + 1 has depth k. Therefore, the depth of a new point is simply the depth region in which it lies. This depth allocation process led studies on the hypothesis testing of the similarity between two distributions (Liu and Singh, 1993 [128] and Li and Liu, 2004[123]) and classification (Ghosh and Chaudhuri, 2005[72]).

2.4.7 Comments

Ideas presented in this section are the foundation of developing nonparametric statistics and methods for multivariate data. As an initial step toward developing nonparametric statistics for multivariate data, based on notions and definitions provided in this section, we will discuss some multivariate estimation problems such as multivariate median, skewness, and kurtosis in the next section.

2.5 Descriptive Statistics of Multvariate Distri- butions

This section presents nonparametric estimators of multivariate distributions based on the data depth discussed in the previous section. In fact, small number of works related to the nonparametric estimators of multivariate distributions are found (e.g. Liu et.al., 1999 and Serfling, 2004[196]). Methods of estimating multivariate median, location, scatter, skewness, and kurtosis are presented. If multivariate estimators are not found, univariate counterparts are introduced. 40

2.5.1 Multivarite Median

The median of univariate data pertains the universal definition that one selects a point in the middle after sorting data. It is denoted as X n+1 or .5(X n + X n+1 ) 2 2 2 depending on the odd or even number of sample size. On the contrary, there is no predominant definition of multivariate median. Simple coordinate-wise median and mean are adopted for estimating multivariate median but they are not robust with respect to data distributions. Another notion of multivariate median is the deepest point in the data cloud. Yet being the deepest point in the data cloud requires defining directions and met- rics corresponding to ordering data. Aloupis (2005[10]) gave a good summary on multivariate median from the computer scientific point of view. Small (1990[213]) expedited multivariate traced back to early 1900s and presented statis- tical perspective of these multivariate medians. The list of popular multivariate medians is:

L1 Median: the point which minimizes the sum of Euclidean distances to all points in a given data set (Chaudhuri, 1996[42]). The breakdown point is known to be 1/2.

Half Space Median: based on half space depth, this median provides the point of which half space minimizes the maximum number of points. Donoho and Gasko (1992) are credited to this median. They studied its breakdown property. The breakdown point is bounded below between 1/(d+1) and 1/3.

Convex Hull Peeling Median: a point(s) left after sequential peeling process. The visual interpretation is very straight forward and intuitive. The number of left points after exhaustive peeling is not enough to comprise a simplex so that the coordinate wise average of these points are considered to be the location of multivariate median.

Oja’s Simplex Median: a point that minimizes the sum of the volumes of sim- plices formed with all subsets of d points from the data set in Rd. This median does not have the uniqueness property but has the property of affine equivariance. 41

Simplicial Median: a point in Rd which is contained in the most simplices formed by subsets of any d + 1 data points. This median is known to be affine transformation invariant.

Hyperplane Median: a point θ which minimizes the number of hyperplanes that a ray illuminating from θ crosses. Rousseeuw and Hubert (1999[181]) defined this median while introducing the maximum hyperplane depth, which is affine invariant.

According to the discussion by Aloupis (2005), studies on computational complex- ity of these multivariate medians are confined to bivariate data. Thus, we intend not to add computational complexity results. However, our naive guess concludes that the simplicial median and the Oja’s simplex median are more complex than other medians as the data dimension increases3 since these medians requires all possible simplex construction requiring O(nd+1) time computational complexity. In addition, the half space median, the hyperplane median, and L1 median ask for an iterative optimizing process, which makes computation less efficient.

2.5.2 Location

Estimating median is a specific example of estimating location. Traditionally, location of multivariate data has been estimated with the normal distribution as- sumption. The Mahalanobis distance is the very example of location estimation based on the normality of a given sample but it is well known for the lack of robust- ness. Recently, constructing robust multivariate location and scatter estimators became a very interesting problem. Furthermore, depth induced ordering promises new tools in multivariate data analysis and statistical inferences, just as order and rank statistics did for univariate data. The depth induced ordering is strongly related to generalized quantile based location estimation, which have gained its popularity due to robustness compared to Mahalanobis distance. According to EinMahl and Mason (1992[69]), the empirical quantile function

3Numerous algorithms developed in the computational geometry society are geared toward efficient and less complex algorithms; however, those algorithms only works under particular conditions. We focus on algorithms that function under general data configuration although those algorithms are computationally expensive. 42 is defined as

Un(t)= inf{λ(A) : Pn(A) ≥ t, A ∈ A} when 0

Un(t) was illustrated by showing that Un(t) is a random element in the space of left continuous functions with right limits (EinMahl and Manson, 1992). The general quantile process may not provide a straight forward instruction of ordering multivariate data but suggest a proper recipe to estimate locations of multidimensional data by means of a resembling univariate quantile function.

2.5.3 Scatter

Traditionally, measuring the degree of scatterness has been treated as a same problem of estimating the second moment, which idea was established by Mardia (1970[138]). Not only spreadness but also other distribution shape measures such as skewness and kurtosis were based on estimating higher moments. However, estimating moments is established on a particular distribution family, such as the multivariate normal distribution. By diverting away from Mardia’s approach, it become clear that not many ways are found to measure scatterness of multidimen- sional data distribution. When two univariate distributions are symmetrical, the relative degrees of scat- 43 terness of one distribution to the other is able to be assigned. According to Bickel and Lehmann (1976[30]), a symmetrical univariate distribution P is considered to be more dispersed about µ than another symmetrical distribution Q about ν if

P ([µ − a, µ + a]) ≤ Q([ν − a, ν + a]), for all a> 0. This definition can be extended to the multivariate case as in Eaton (1982[61]) by stating that P is more concentrated about µ than Q about v if

P (C + µ) ≥ Q(C + ν) for all convex symmetrical sets C ⊂ Rd. Oja (1983[158]) provided a generalized version of relative scatterness between two distributions.

Definition 2.4. Suppose φ : Rd → Rd is a function such that

V ol(φ(x1), ..., φ(xd+1)) ≥ V ol(x1, ..., xd+1)

d for all x1, ..., xd+1 ∈ R . Here, V ol(x1, ..., xd+1) indicates the volume of a simplex comprised of x1, ..., xd+1. Then, if P is the distribution of a random vector X and Q is the distribution of φ(X), we say that Q is more scattered than P , in short

P ≺S Q.

This volume based definition on scatterness of a sample will be adopted to develop measuring the relative scatterness of distributions by means of the convex hull peeling process. In addition to the spread ordering, Oja (1983) stipulated the enhanced concept of a scatter measure.

Definition 2.5. The function Ψ : P → R+ is a measure of scatter if

(1) P ≺S Q implies Ψ(P ) ≤ Ψ(Q) for all P, Q ∈ P (2) For any affine transformation L : Rd → Rd of the form L(x) = Ax + µ and P ∈ P such that P L−1 ∈ P satisfies

Ψ(P L−1)= abs(|A|)Ψ(P ). 44

Equivalently, Ψ(AX + µ)= abs(|A|)Ψ(X).

Note that Ψ(AX)=0 for a singular A.

A convex hull can be interpreted as a scatter set and the volume of the convex hull, a generalized range, is a measure of scatter. Therefore, the notion of scatter by Oja (1983) will be much useful when we define the degree of scatterness of affine invariant multidimensional distributions by means of the convex hull pro- cess. Additionally, a volume functional, a mapping from a generalized multivariate quantile function allows to allocate the orders of scatterness among distributions.

At a given data depth r, the volume of a more scattered distribution vF (r) will be larger than the volume of a less scattered distribution vG(r), i.e. vF (r) > vG(r) (Liu et.al., 1999). Consequently, a suitable generalized quantile function and its mapping to volume function will be the key to measure the degree of scatterness of multivariate distributions.

2.5.4 Skewness

Skewness is considered as a measure of asymmetry, an innate characteristics of a distribution. The skewness of a univariate distribution is accommodating two directions from the median; a positive span and a negative span from the center of the data distribution. When it comes to multivariate data, there are more than two directions that streak from the data center. These numerous directional rays from data center represent the skewness of the multivariate data distributions by their absolute values since a symmetric distribution will share same values regardless of directions; however, there exist only a few proper functionals based on multivariate quantiles so as to project these numerous rays into one dimensional curve for quantizing the degree of skewness. Classical definition of skewness is stipulated by Mardia (1970[138]). Let X and Y be random vectors from some multivariate distribution function F on Rd and have mean µ and covariance matrix Σ. Then skewness is defined as

0 −1 3 β1 = E[{(X − µ) Σ (Y − µ)} ] 45 where X and Y are iid random vectors with mean µ and covariance matrix Σ. The sample version of β1 can be obtained under the assumptions that Σ is not singular and the third moment exists. These assumptions are too strong to receive some criticisms. Ferreira and Steel (2006[71]) summarized a few other generalized skewness mea- sures beyond measuring the joint moments of the (Mardia, 1970) for unimodal distributions. One is based on projections of the random variable onto a line by Malkovich and Afifi (1973[135]) and the other is based on volumes of simplices by Oja (1983). From the unique in a unimodal distribution, con- sidered to be the center of multivariate distribution, this projection based skewness is considered to be a measure of skewness onto principal axes and the volume based skewness is called as directional skewness. Unlike multivariate skewness, univariate skewness is well defined. Av´erous and Meste (1997[14]) proposed skewness ordering ≺. A given skewness ordering ≺ is defined by comparing F with F¯ (where F¯(x)=1 − F (x)): F is said to be skew to the right if and only if F¯(x) ≺ F (x). By the same token, F is called less skew to the right than G (F ≺ G) if and only if G−1 ◦ F is convex. According to Oja (1983[158]), measures of skewness and kurtosis are usually selected by using intuition and it is required that if Φ is a measure of skewness or kurtosis then Φ(P L−1)=Φ(P ) where L is affine transformation L(X)= AX + b (A is non singular). Studies particularly focus on multivariate skewness are sparsely found, espe- cially ones define skewness based on a generalized quantile process, which al- lows one dimensional mapping to characterize multi-dimensional data distribution. Nonetheless, we can conclude that any changes along the rays going away from the data center can be treated as some measures of skewness in multivariate data. This idea of searching directional changes will lead us to define skewness based on the convex hull peeling process, without calculating associated moments of a multivariate sample. 46

2.5.5 Kurtosis

After the exploration of nonparametric descriptive measures of a distribution such as location, scatter, and skewness, the natural next step is to characterize kurto- sis. The univariate kurtosis, or the standardized fourth central moment, has been construed to represent either a heavy peakedness or heavy tailed distribution. The notion of univariate kurtosis can be easily extended to multivariate kurtosis. Re- gardless of the number of dimensions, kurtosis enables to distinguish a heavy tailed distribution or a heavy peaked distribution along the rays from the center of data. The classical definition of kurtosis is

κ = E{[(X − µ)0Σ−1(X − µ)]2} for a distribution in Rd with mean µ and covariance matrix Σ (Mardia, 1970[138]) so that kurtosis is the estimate of the 4th moment of X from µ. The kurtosis κ measures the dispersion of the squared Mahalanobis distance of X from µ. For an explicit discussion on kurtosis, the authentic nomenclature of kurtosis is necessary. One group of taxonomy is shoulder, peakedness, and tailweight. The other group is mesokurtic, platykurtic, and leptokurtic. First of all, a shoulder of the distribution indicates about a half way from the data center to the data surface so that higher kurtosis arises when probability mass is diminished near the shoulders and raised either near µ (greater peakedness) or in the tails (greater tail- weight) or both. Here, peakedness indicates how fat a distribution can be around µ and tailweight indicates how thick a distribution can be beyond shoulder and away from µ. The other nomenclature group is strongly related to the estimates of Mar- dia’s κ. According to Balanda and MacGillivray (1998[17]), platykurtic curves are squat with short tails (κ< 3), leptokurtic curves are high with long tails (κ> 3), and mesokurtic distributions (κ = 3) comparable to the normal distribution. Alternative measures for the moment based kurtosis κ are quantile-based kur- tosis measures, which are robust, more widely defined, and better at discriminat- ing distributional shapes (Wang and Serfling, 2005[228]). Quantile based kurtosis measures provide shape information about data distribution; on the other hand, classical moment based kurtosis quantifies the dispersion of probability mass. Depth measures enable to connect quantile estimators to volume estimators 47 and quantile based volume estimators are developed to measure kurtosis of a dis- tribution. Let any family of central regions be C = {CF (r):0

CF (r) can be a continuous function of r. A convex hull peeling central region will be discussed in Chapter 4. Note that the central regions are nested about MF , the center based on any notion of multivariate median, and are reduced to MF as r → 0. A corresponding volume functional is defined as

vF,C (r)= V ol(CF (r)) so that it is an decreasing function of the variable r. The boundary of the central 1 region CF ( 2 ) represents the shoulder of the distribution and separates the desig- nated central part from the corresponding tail part in the distribution. One of the volume based spatial kurtosis functionals (Serfling, 2004) is

1 r 1 r 1 vF,C ( 2 − 2 )+ vF,C( 2 + 2 ) − 2vF,C( 2 ) κF,C(r)= 1 r 1 r . vF,C ( 2 − 2 ) − vF,C( 2 + 2 )

This functional is invariant under shift, orthogonal and homogeneous scale trans- formations (Serfling, 2004[196]). Within certain typical classes of distributions we can attach intuitive interpre- tations to the values from κF,C (·). For some unimodal distribution F , for any fixed r, a value from κF,C(r) near +1 suggests peakedness, a value near −1 suggests a bowl shape distribution, and a value near 0 suggests uniformity. According to Bickel and Lehmann (1975[29]), measures of kurtosis, peakedness, and tailweight should be ratios of scale measures in the sense that both numer- ator and denominator should preserve their spread ordering. This ratio of scale measures is called as the measure of weight, whose definition is as follows:

vF (r) tF (r, s)= , vF (s) where r and s are depths (0

Another approach to multivariate kurtosis is found in studies of influential functionals by Wang and Serfling (2006[227]). Given a multivariate distribution F , a corresponding depth function orders points according to their centrality in the distribution F . Then the depth functions generate one dimensional curves for a convenient and practical description of specific features in a multivariate distribution, like kurtosis, and they explored the robustness of sample version of these curves by means of influence functionals. Structural representation by influential functionals is consistent with these curves from generalized quantile functions or depth functions. The aforementioned works successfully identified general shape properties of multivariate distributions and one of byproducts is a kurtosis measure. Although there is no consensus on the definition for kurtosis, we prefer to having a kurtosis definition that is free of estimating moments. Without no designated formulae for a kurtosis measure of a distribution, peakedness or tailweight of a probability distribution can be somewhat measured based on data depths and hint the kur- tosis property of the distribution. A suitable quantification of these measures of peakedness and tailweight shall induce the definition of kurtosis.

2.5.6 Comments

Classical descriptive statistics are defined by means of calculating moments. The drawback of this moment based process is that some probability distributions do not have expected moments to be compared with empirical moments. Due to lack of studies in nonparametric , we are aware of deficiencies in descriptive measures of multivariate distributions. Furthermore, for some nonpara- metric descriptive measures based on particular depths, only empirical versions of measures are available because the distribution of descriptive functionals expected from the population is unknown. Lots of possibilities are buried in the field of nonparametric descriptive measures of multivariate data distributions. 49

2.6 Outliers

As of writing this thesis, Google Scholar4 gave 5,760 articles about outlier detection. Detecting outliers is an important technique not only in statistics and data mining, but also in various fields such as natural science, engineering, business, and indus- try. However, most of outlier detection methods rely on small sample statistical analysis tools, developed from estimating the Mahalanobis distance (1936[133]) under the multivariate normal distribution assumption. Comprehensive studies on classical outliers are collected in the books by Barnett and Lewis (1994[21]), Rousseeuw and Leroy (1987[188]), and Hawkins (1980[89]). Numerous works re- lated to classical outlier detection also performed by Atkinson (1994[12]), Bar- nett (1978[23]), Hadi (1992[81]), Knorr (2000[113]), Kosinski (1999[117]), Penny (1995[161]), Rocke and Woodruff (1996[178]), Wilks (1963[232]), and so on. In addition, Werner (2003[230]) gave an interesting point of view to detect outliers with influence functionals in his doctoral thesis. A recent and extensive survey on outlier detection methodologies is found in Hodge and Austin (2004[94]). In our survey, we focus on non parametric outlier detection methods instead of parametric approaches such as hypothesis testings and distance calculations based on particular parametric models, which take normally distributed data for granted and initiate the detection process with estimating covariance matrices. Here, we describe multivariate outliers and their detection methods without considering the 2nd moment.

2.6.1 What is Outliers?

First of all, outliers have to be defined so as their detection methods to be de- veloped. For univariate data, outliers are relatively simple to be detected. If the distribution of univariate data is known, the hypothesis testing leads to decide outliers with probability. When the distribution is unknown, we sort the data and declare data of the lowest or the highest ranks as outliers. On the other hand, multivariate outliers have very vague definitions so that they cannot be treated as an extension of univariate outliers. Huber’s (1972[96]) stated that outliers are defined as unlikely to belong to

4http://scholar.google.com 50 main population and order statistics yields easy criterion to detect outliers. The unresolved ambiguity is that ordering multivariate data is not clearly defined to declare outliers. In their quite extensive outlier studies, Barnett and Leroy (1994[21]) adopted the notion that outliers have been informally defined as observations which appear to be inconsistent with the remainder of a data set.

An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Barnett and Leroy, 1994).

In other words, outliers are a subset of observations which appears to be inconsis- tent with the remainder of the data set whether or not those outliers are genuine members of the main population. A similar definition appears in Hawkins (1980 [89]) as follows;

an outlier is an observation which deviates so much from other ob- servations as to arouse suspicions that is was generated by a different mechanism.

This statement implies that contaminants arise from some other distribution. Al- ternatively, Beckman and Cook (1983[27]), in their wide ranging review paper on outliers, described an outlier as any observation that appears surprising or dis- crepant to the investigator. They suggested the following notions to clear some ambiguity about outliers:

• Discordant observation: any observation that appears surprising or discrepant to the investigator

• Contaminant: any observation that is not a realization from the target dis- tribution

• Outlier: a collective to refer to either a contaminant or a discordant obser- vation.

Beckman and Cook (1983) also gave an extensive history about outliers but put a major emphasis on outliers in linear models. 51

For the moment, we accept the fact that outliers are deviated from the majority of data points, far away from the center of data in any direction, and residents above the surface of data cloud without considering various statistical and computer scientific definitions of outliers. Techniques that determine the position of a point relative to the others (Rohlf, 1975[179]) or some measures that quantify the non euclidean distance from the center seem to be useful to indicate the degree of outlyingness of outliers with respect to the majority of data. Popular approaches to detecting outliers assume multivariate normal distri- bution and calculate Mahalobis distances with appropriate covariance matrices. Mahalobis distances allows to find the location of data center and to order multi- variate data by estimating sample mean and sample covariance. Without assuming multivariate normal, we still need some metric that quantifies ordering data but there is no consensus on this subject. Multivariate data can be ordered from left to right, from bottom to top, from front to back based on coordinates of data. They can be ordered outward from the center. On the contrary, the center of mul- tivariate data is not universally defined compared to the center of univariate data. An extensive survey on ordering multivariate data is found in Barnett (1976). In Rocke and Woodruff(1996), insights regarding outlier detection are given to be understood why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of data. Rocke and Woodruff (1996[178]) suggested a very useful framework to identify multivariate normal outliers when the main distribution is N(O, Id) where O and

Id indicate a zero vector of length d and a d × d identity matrix, respectively.

Shift outliers: the outliers have distribution N(µ, Id) where µ is a length d vector.

Point outliers: the outliers form a densely concentrated cluster.

Symmetric linear outliers: given a unit vector u, the outliers have distribution uN(O, Σ) for an arbitrary Σ.

Asymmetric linear outliers: given a unit vector u and its coordinate-wise ab- solute value |u|, the outliers have distribution |u|N(0, Σ)

Radial Outliers: outliers have distribution N(0, kId) with k > 1. 52

This categorization of outliers based on different types of normal distributions aids to separate outliers from the main N(O, Id) distribution; however, if there is no transformation of main data to N(O, Id), segregating outliers based on one of the catagories will lead to classifying inconsistent outliers. Overall, outliers in multi dimensional space is ambiguous compared to univariate outliers. The only fact that we are sure about outliers is that outliers are deviated from the primary data distribution regardless of dimensions.

2.6.2 Where Outliers Occur?

Barnett (1978[23]) summarized the causes of outliers: (1) measurement errors, (2) execution faults, and (3) intrinsic variability. Because of these various causes, outliers are treated in different ways: If a model is more plausible without outliers, outliers are detected to be eliminated; otherwise, a model is modified to cope with the existing outliers. Assertively, the residing space of outliers are most likely distinctive to the main distribution. This unique spatial occupation of outliers is clearly displayed by simple graphical methods when the data is univariate. In multivariate case, graphical visualization to locate outliers is quite complicated and sometimes impossible. From a knowledge discovery standpoint, rare events are more interesting than the common ones. An interesting list has been made (Hodge and Austin 2004[94]) about occasions that the outlier detection is applied. Some of those occasions are (1) fraudulent applications for credit cards and mobile phones, (2) loan ap- plication processing, (3) intrusion detection, (4) activity monitoring, (5) network performances, (6) fault diagnosis, (7) unexpected entries in database, and so on. These occasions clearly show that they are deviated from the main population. Therefore, the key of outlier detection is to develop methods of measuring the degree of deviation among points.

2.6.3 Outliers and Normal Distributions

Although there is no clear statement of a normal distribution assumption, most of outlier detection techniques rely on this assumption and require the estimates of the location and scale parameters of the distribution (Pena and Prieto, 2001 53

[160]). Multivariate outlier detection methods have been developed toward ac- quiring robust estimation of mean and covariance matrix, where the Mahalanobis distance T −1 D(x)= EF [(x − X¯) Σˆ (x − X¯)] become indispensible. Points with large Mahalanobis distances are multivariate outliers (Kosinski, 1999[117]) based on robust estimates of population scatter and location. Kosin- ski (1999) proposed a new multivariate outlier detection method that performs well under such high contamination conditions. Even though the Mahalanobis distance is affine invariant, on the contrary, it is well known that identification of multivariate outliers based on the Mahalanobis ditance is not robust (Rocke and Woodruff, 1996[178]). Hardin and Rocke (2004[85]) provided a survey of studies about robust outlier detection methods by means of Mahalanobis distance, in- cluding Atkinson (1994[12]), Barnett and Lewis (1994[21]), Gnanadeskikan and Kettenring (1972[74]), Hadi (1992[81]), Hawkins (1980[89]), Maronna and Yohai, (1995[139]), Penny (1995[161]), Rocke and Woodruff (1996[178]). Yet, problem- atically, outliers are included to calculate these location and scale estimators. In addition, the Mahalanobis distance is computationally expensive to calculate for large high dimensional data sets compared to the Euclidean distance because the computing process has to pass through the entire data for the covariance matrix and its reciprocal (Hodge and Austin, 2004[94]), which might be sparse or singular. Another approach to outlier detection based on a multivariate normal dis- tribution assumption is the minimum volume ellipsoid (MVE) or the minimum covariance determinant (MCD) (Rousseeuw and Leroy, 1994). Since data of a single population tend to shape a convex set, distributional covariance structures are represented by ellipsoids and outliers are separated beyond these ellipsoidal contours. Rocke and Woodruff (1996) specified that all known affine equi-variant outlier identification methods consist of the following two stages;

Phase I Estimate the location and the shape.

Phase II Calibrate the scale estimate, suggesting which points are far enough from the location estimate to be pronounced as outliers. 54

Rocke and Woodruff (1996) added that there is no guarantee that observed data follows a normal distribution in practice and urged that new algorithms and test- ings on non-normal data should be conducted. We end this discussion about outlier detection with the Mahalanobis distance by quoting Barnett (1983[22]) which provides a considerable warning on uncon- sciousness of assuming multivariate normal distributions for data analysis.

Beckman and Cook (1983) survey is limited to normal model. Surely I’m not alone in frequently (even predominantly) having to deal with nonnormal data. For univariate data we have a reasonable heritage of methodology. For multivariate data there is much to be done and great demand for appropriate model and method. I cannot agree that outliers in multivariate samples merely require straight forward generalizations of univariate techinques. The very notion of order is ambiguous and ill- defined (Barnett, 1976). Proper study of multivariate outliers involves some thorny issues.

2.6.4 Nonparametric Outlier Detection

Simplicity and affine invariant property makes the Mahalanobis distance rule over other methods when it comes to detecting multivariate outliers. As a matter of fact, without imposing normality, there is almost no way to detect outliers in higher dimensional data (only exception is outlier detection in bivariate samples). Shekhar (2001[206]) delivered the idea that methods for detecting outliers in multi-dimensional Euclidean space have a few limitations: First, multi-dimensional approaches assume that the data items are embedded in an isometric space and do not capture the spatial graphic structure. Secondly, they do not explore a pri- ori information about statistical data distribution. Last, they seldom provide a confidence measure of the discovered outliers. The first and the last issues some- what related to the discussion in Hodge and Austin (2004) about the curse of dimensionality. Lu et.al. (2003[132]) also made a point that for detecting outliers with multiple attributes, traditional outlier detection approaches could not be used properly due to the sparsity of observations in high dimensional space. They urged for redesigning distance functions to define the precise degree of proximity between 55 points. At the time of writing this thesis, the author is not sure if there is any attempts to draw multivariate box plot for outlier detections as Tukey did for univariate data without problems caused by the curse of dimensionality. However, some nonparametric box plot type attempts exist in bivariate case, which are presented below.

• Sunburst plot: By means of any data depth, a contour enclosing 50% points (a boundary of 50% central region) is obtained. Denote this coutour as a .5th level set. Then, lines are drawn from the boundary of the .5th level set to any points outside the .5th level set. Liu et.al. (1999) proposed Sunburst plot for a bivariate outlier detection problem. Nonetheless, this method suffers from a serious unidentifiability problem with massive data sets where those rays become almost indistinguishable.

• Bag plot: Miller et.al. (2003[147]) suggested the bag plot. The steps of

constructing a bag are: First, identify these two contours, Dk and Dk−1,

which are consecutive contours containing respectively nk and nk−1 data

points (nk ≤ bn/2c < nk−1). Second, calculate the value of a parameter λ, which determines the relative distance from the bag to each of the two contours. λ = (50 − J)/(L − J)

where Dk contains J% of the original points and Dk−1 contains L% of the original points. Last, construct the bag by interpolating between the two

contours by using the λ value. For each vertex v on Dk or Dk−1 let l be the line going through v and the Tukey median T ∗. Let u be the intersection of

l with the other contour (Dk−1 or Dk). Each vertex w of the bag lines on l interpolated between v and u i.e. w = λv +(1 − λ)u. The value of λ weights

the position of the bag towards Dk or Dk−1.

• Pairwise bivariate box plot with B-spline smoothing: Zani et.al. (1998 [236]) suggested a simple way of constructing a bivariate box plot based on convex hull peeling and B-spline smoothing. Their approach led to define a natural inner region which is completely nonparametric and retains the corre- 56

lation in the observations. This inner region adapts its spreadness depending on the data distribution and defines the box of Tukey’s box-and-whisker plot. The counterpart of whisker from the box plot is the linear expansion of this inner region based on the distance from the center to the inner region. To smooth angular outer region due to expansion of the angular inner region, B-spline is adopted. Zani et.al. (1998) explored their box plot with pairwise bivariate sample. Instead of pairwise bivariate box plots, a single multi- variate box plot seems possible, which they have not explored. We assert that the reason of exploring pairwise bivariate box plots instead of a single multivariate box plots is that the visualization of multivariate data is not fea- sible. Furthermore, if sample is large, it seems that B-spline is not necessary because the contour of the outer region will be smooth enough.

• Generalized gap test: This test was proposed by Rohlf (1975[179]). Apart from methods introduced previously, this method does not look for contours to set a boundary. Still outliers can be characterized by the fact that they are somewhat isolated from the main cloud of points. By the minimum spanning tree, lengths among points are determined. The presence of outliers that stick out from the main cloud of data exemplifies a gap in the distribution of lengths.

Note that the interpolation parameter λ of the bag plot (Miller et.al., 2003) goes to zero as the sample size increases. Without this interpolating process, the bag plot by Miller et.al. (2003) is similar to the bivariate bag plot invented by Rousseeuw et.al. (1999a[183]). Based on contours of the halfspace depth, this less sophisticated bivariate box plot (Rousseeuw et.al., 1999a) requires a .5th level set to draw extensions between vertices of from this .5th level set (hull) and the median by three times. Connecting the end points of these extensions creates another hull, which provides a boundary to determine outliers as we assign extreme outliers when univariate points are located outside of 3 times of the . Similarly, replacing a hull of 1.5 times extension in distance from the vertices of a .5th level sets away from the median with a hull of 3 times extension enables to assign mild outliers as such outliers from Tukey’s univariate box plot. We did not include the bivariate box plot study by Goldberg and Iglewicz 57

(1992[75]) since the box plot construction process involves estimating shape pa- rameters. Another very genuine nonparametric approach to outlier detection is bootlier-plot by Singh and Xe (2003[211]) but difficulties arise in application for large sample with multiple attributes. Even though above methods are only applied to bivariate sample, those meth- ods are very likely applicable to any multivariate sample. It is not surprising that most of these methods adopt data depths to obtain the hull containing 50% data. The hulls from data depth seem to be less affected by outliers and preserve the structural shape of the original distribution when sample are generated from a smooth convex body distribution. Without a comprehensive discussion on multi- variate outlier detection based on data depth, it is perilous to conclude that data depth is the best way to detect outliers. Nevertheless, a few studies on data depth promised that depth based outlier detection methods are robust. He and Wang (1997[90]) pointed out that contours of depth often provide a good geometrical understanding of the structure of a multivariate data set and are not influenced by outliers. They conjectured that data depths can track the elliptic contours if the data were generated from a elliptical distribution. They, however, added that technicalities to support the conjecture that depth contours delineate the original structure of the distribution, are not trivial.

2.6.5 Convex Hull v.s. Outliers

Although there is no study that convex hull peeling is directly applied to outlier detection, some studies entailed the idea that the convex hull process provides robust estimators by identifying possible outliers. Barnett suggested, in his celebrated 1976 paper, that the convex hull peeling process allows ordering multivariate data and detecting outliers, which idea is credited to Tukey (1974[221]). Hodge and Austin (2004[94]) made a comment that convex hull peeling removes the records beyond the boundaries of the distributional convex hull. Thus, the peeling process abrades the outliers. The peeling process can be repeated on the remaining points until it has removed pre-specified records. However, Hodge and Austin (2004) were susceptible to this peeling process which strips away points quickly since at least d+1 points are ripped off on every iteration. 58

Not like a convex hull peeling process, but similar in terms of contouring the distribution in shape, for eliminating outliers the minimum volume ellip- soid has been adopted (Titterington, 1978[220] and Silverman and Titterington, 1980[209]). Nolan (1992[157]) studied the asymptotic properties of multidimen- sional α-trimmed region based on ellipoids and various functionals of the convex set for multivariate location estimators. He also pointed out that the peeling pro- cess may remove too many non-outlying observations. Without associating data depth, a successful application of the convex hull process for outlier detection can be found. Bebbington (1978[26]) used a convex hull to trim outliers for a robust estimator of correlation, called as a convex hull trimmed correlation coefficient. He trimmed out points defining the vertices of the convex hull of given sample without disturbing the general shape of the bivariate distribution. He empirically displayed that the correlation coefficient is not af- fected. Despite its intuitive appeals and the success in applications, the analytic complexity of convex hull peeling process and its asymptotics has hindered more advanced studies.

2.6.6 Comments: Outliers in Massive Data

We live in the era of massive multivariate data, in contrast to statistics which mainly relies on univariate and small sample. Most of outlier detection stud- ies were based on univariate and small sample environment. A small number of methodologies in detecting outliers of multivariate data are discovered but most of methods happen to be confined in two dimensional world. The discovery of bivariate applications only might be attributed to limitations in graphical display. Strangely, coping with outliers in massive multivariate data seems so remote. Outliers in multivariate massive data are not handful observations any more. Testing outlyingness one observation at a time is a very unrealistic task. In general, these massive data do not have a priori information so that imposing a multivari- ate normal distribution may mislead results. Graphical visualization is impractical with more than two dimensional data since reducing dimensions disguises impor- tant pieces of information5. No simple and suitable nonparametric method exists

5This statement is not right. The author later heard about parallel coordinates by Inselberg and Dimsdale (1990[105]). The parallel coordinates do not ask for dimension reduction to display 59 for detecting outliers in multivariate massive data. While massive data are being collected, we need a way to alarm peculiar or unnecessary observations to manage data streaming efficiently. This alarming pro- cess strongly depends on how outliers are defined, which varies based on scientific interests. Despite the apparent complexity of the problem, we still can characterize outliers by the fact that they are somewhat isolated from the main clouds of points regardless the size and the dimension of data. Potential outliers are points which are not internal to the main cloud of data, somewhere above the surface of the cloud. Simple techniques that determine relative orders of observations to other observations should be on the way.

2.7 Streaming Data

The main challenge in data analysis, recently, is that the data volumes are much larger than the memory size of a typical computer (Bagchi et.al., 2004[15]). In addition, since data are archived along the long time period, preliminary analysis tools for redesigning data storage in the middle of archiving process is in demand. Whether the overflow in memory or interruption of data flow, such needs can be incorporated suitably and adaptively with streaming data analysis. Particularly, streaming data analysis confronts difficulties when the target of the data analysis is ordering data. Ordering requires sorting the whole sample; however, the physical limits - lack of memory and waiting time - prohibit proper ordering. This section summarizes studies of sequential estimation methods, appeared in statistics for ordering univariate data. Also, we present algorithmic methods from engineering perspectives, which is more empirical than statistical approaches. The empirical methods will eventually be utilized in developing algorithms for multivariate data analysis later. data of multiple attributes. The author was not able to find studies on statistical outlier detection with parallel coordinates. 60

2.7.1 Univariate Sequential Quantile Estimation

Getting quantiles requires ordering data but it is not necessarily to use all points at once. A space efficient recursive procedure for quantile estimation was proposed by Tierney (1983[219]). The concern was estimating a quantile of unknown dis- tribution with the similar accuracy of the whole sample quantile by taking a fixed amount of storage space. The discrepancy between the quantile estimate from the parts of sample and the sample quantile vanishes asymptotically. This methods is dubbed as simple (Hurley and Modarres, 1995). Chen et.al. (2000[50]) developed the exponentially weighted stochastic approximation (EWSA) estimator as an extension of the simple stochastic approximation. Hurley and Modarres (1995[103]) surveyed methods of low storage quantile es- timation on top of the stochastic approximation. They introduced the quantile estimation, which they compared to the stochastic approximation in addi- tion to the minimax trees and the remedian methods by simulation studies. They examined the statistical and computational properties of these methods for quantile estimation from a data set of n observations but used far fewer memory locations than n. Another notable stochastic approximation method for sequential quantile es- timation is the single pass low storage arbitrary quantile estimation for massive data sets, proposed by Liechty et.al. (2003). This method updates quantiles by interpolation between new inputs and selected stored data (low storage), which is very attractive because the method produces any quantiles of interest in a single pass. When these univariate sequential quantile estimation methods are applied, it often suffices to provide empirically reasonable approximations to sample quantiles instead of precise values, of which direction is commonly found in the engineering literature.

2.7.2 -Approximation

In univariate streaming data analysis, the -approximation approach is the most frequently adopted method to control size of data packet under processing, when all data from streaming cannot be incorporated. Numerous studies on the problem 61 of computing quantiles of large streaming online or disk-resident data using small main memory (e.g. Manku et.al., 1998[137]; Lin et.al., 2004[125]; and Greenwald and Khanna, 2001[78]). Manku et.al. (1998) proposed desirable properties of the -approximate algo- rithms. Those properties are:

1. an algorithm must not require prior knowledge of new inputs

2. an algorithm guarantees explicit and tunable approximation

3. an algorithm computes results in a single pass

4. an algorithm produces multiple quantiles at no extra cost

5. an algorithm uses as little memory as possible

6. and an algorithm must be simple to code.

Here we adopt the definition of -approximate from Lin et.al. (2004).

Definition 2.6 (-approximate). A quantile summary for a data sequence of N elements is -approximate if, for any given rank r, it returns a value whose rank is guaranteed to be within the interval [r − N,r + N].

Lin et.al. (2004[125]) studied the problem of maintaining a quantile summary from the most recently observed N elements over a stream continuously so that quantile queries can be answered with a guaranteed precision of N. They devel- oped a space efficient algorithm for pre-defined N that requires only a single pass on the data stream. Their computational performance study indicated that not only the actual quantile estimation error is far below the guaranteed precision but the space requirement is also much less than the given theoretical bound. Green- wald and Khanna (2001[78]) added that the -approximate quantile summary does not require a priori knowledge of the length of the input sequence.

2.7.3 Sliding Windows

Adapting low storage methods for streaming data implies from data and fitting filtered data within the low storage fixed space. This sampling scheme is 62 called sliding window in the data management study. Plethora of various slid- ing window algorithms for streaming data exist in computer science fields. Xu et.al. (2004[233]) presented diverse sliding window algorithms into two categories.

• Nf Window (Natural form window) takes most recent N points at a time.

• T Window takes data arrived in last T time.

By the definition, Nf window allows to overlap or to skip some portion of streaming data. On the other hand, by knowing data arrival rate, T window is able to gets rid of concerns of overlapping or skipping data. In scientific studies, the data archiving rates are controlled that T window scheme is more frequently used.

2.7.4 Streaming Data in Astronomy

Astronomy is the oldest science of data analysis. Vast amount of astronomical data archives led new discoveries that have changed the view of the world. As the new centennial emerged, the pattern of astronomical data archives is experiencing rapid changes both in time and spatial dimension. The rapid growth rate of data arrival in time and the capability of simultaneous recording of many variables changed the strategy of analyzing data. This rapid changes are attributed to new scientific and computational technology development. Virtual observatories become popular worldwide, which are capable of main- taining all collected astronomical data catalogs based on the computer scientific data base management system (National Virtual Observatory6 and the European Virtual Observatory7 are most well known). These collected data, ranging from historic data catalogs to new observations from recent sky survey projects,8 which typically produce data size of tera bytes a day, mostly become public recently but proper statistics to confront these huge data is not available. The increment growth rate of data archives are experiencing a steep exponential growth. Detail description on the properties of astronomical massive data is found in Brunner et.al. (2001[36]), Djorgovski et.al. (2001[57]), and Longo et.al. (2001[131]).

6http://www.virtualobservatory.org/ 7http://www.euro-vo.org/pub/index.html 8A list of survey projects are found from http://www.cv.nrao.edu/fits/www/yp survey.html 63

As a matter of fact, traditional statistics are unable to cope with such huge data and are giving its ways to computer science when it comes to analyze massive data. Nonetheless, traditional statistical methods are still in practice by taking parts of sample from survey data. Even though taking a small segment from huge data catalogs is justified based on a prior scientific knowledge, and astronomers should be better aware that new statistics for massive data has to replace traditional small sample statistics. Astronomers are urged to change their perspectives of analyzing data to cope with the fast growth data in time and space since astronomers cannot rely on their eyes for the massive data sets with many attributes. Also, they should recognize that taking small parts from huge data catalogs will be an waste of massive data base management technologies. Furthermore, statisticians are asked to open their eyes to develop statistics for massive data to confront the need of the 21st century not to yield their ways to computer scientists who brute forcefully plug data into machines without careful thoughts on the properties of statistics and a data set itself.

2.7.5 Comments

Determining changes in a distribution of a large, high-dimensional data set is a difficult problem. Coordinate wise naive approaches will not detect many types of distributional changes due to the lack of robustness in coordinate oriented meth- ods. Classical statistical tests usually apply to small data sets and now face three problems with large high-dimensional data. First it is likely impossible to fit all the data into memory at once. Partial sampling is a common refuge. The sampling, however well designed, comes with its pitfalls, particularly when outliers are con- cerned. Second, even if the samples are small in size, dimensionality becomes an obstacle. Existing techniques for nonparametric analysis of multivariate data such as clustering, multidimensional scaling, principal components analysis, and others become expensive as the number of dimensions increase. Third, classical multivari- ate statistics has centered heavily around the assumption of normality, primarily due to the convenience in terms of closed form solutions and easy applications. When the data sets are large, assumptions of homogeneity (e.g. iid observations) that underlie classical statistical inference might not be valid. Therefore, a success 64 over the challenges of multivariate massive data is expected from developing new non parametric multivariate analysis tools incorporating streaming data. Streaming data have emerged recently in the field of database and data man- agement because of the avalanches of . Disappointingly, streaming data algorithms are only developed for univariate data and these algorithms con- trolling data streams are very empirical. Gilbert et.al. (2005[73]) stated that in applications, quantile estimation algorithms for streaming data suffice to provide reasonable approximation to quantile estimates and not necessarily produce pre- cise values. Methodologies of controlling traffics of streaming data totally depends on the experiences of data miners. Additionally, lack of universal metrics on mul- tivariate data causes the deficiency of algorithms on multivariate streaming data. In Chapter 5, we will propose a few empirical methods for multivariate stream- ing data quantile estimation, which might serve general purposes for massive data analysis. Chapter 3

Jackknife Approach to Model Selection

3.1 Introduction

Model selection, in general, is a procedure for exploring candidate models and choosing the one among candidates that mostly minimizes the distance from the true model. One of the well known distance measures is the Kullback-Leibler (KL) distance (Kullback and Leibler, 1951[119]),

DKL[g; fθ] := g(x)log g(x)dx − g(x)log fθ(x)dx Z Z = Eg[log g(X)] − Eg[log fθ(X)] ≥ 0,

where g(X) is the unknown true model, fθ(X) is the estimating model with param- eter θ, and X is a random variable of observations. The purpose of an information p p criterion is to find the best model from a set of candidates, {fθ : θ ∈ Θ ⊂ R } that describes the data best. It is not necessary that this parametric family of distributions contains the true density function g but it is assumed that θg ex- ists such that fθg is closest to g in Kullback-Leibler distance. A fitted model, fθˆ, will be estimated and one wishes to assess the distance between fθˆ and g. Since

Eg[log g(X)] is unknown but constant, we only need to obtain the estimator of θg such that maximizes the likelihood Eg[log fθ(X)]. Conclusively, the maximum like- 66 lihood principle plays a role in choosing a model that minimizes the KL distance estimate. Akaike (1973[7], 1974[8]), first, introduced the model selection criterion (AIC) that leads to the model of the minimum KL distance. For an unknown density g, however, AIC chooses a model that maximizes the estimate of Eg[log fθ(X)]. Yet, the bias from estimating the likelihood as well as parameters with same data cannot be avoided. This bias appears as a penalty term in AIC, in proportion to the number of parameters in a model. Despite its simplicity, a few drawbacks were reported; the tendency of picking an overfitted model (Hurvich and Tsai, 1989[104]), the lack of consistency of choosing the correct model (McQuarrie et.al., 1997[146]), and the limited application on models from the same parametric family of distributions (Konishi and Kitagawa, 1996[116]). As remedies for these shortcomings of AIC, several information criteria with improved bias estimation were proposed. To adjust the inconsistency of AIC, especially for small sample size, Hurvich and Tsai (1989[104]) and McQuarrie et.al. (1997[146]) introduced the corrected Akaike information criterion (AICc) and the unbiased Akaike information criterion (AICu), respectively. The penalty terms in their information criteria estimate the bias more consistently so that AICc and AICu select the correct model more often than AIC. On the other hand, AICc and AICu are only applied to regression models with normal error distri- butions. To loosen the assumptions on bias estimation, other information criteria were presented, such as Takeuchi information criterion (TIC, Takeuchi, 1976[218]), the generalized information criterion (GIC, Konishi and Kitagawa, 1996[116]), and the information complexity (ICOMP, Bozdogan, 2000[33]). Apart from the bias estimation of the expected log likelihood, bias reduction through statistical resampling methods has fabricated model selection criteria with- out assumptions such as a parametric family of distributions. Popular resampling methods are cross validation, jackknife, and bootstrap. Stone (1977[215]) showed that estimating the expected log likelihood by cross validation is asymptotically equivalent to AIC. Several attempts with the bootstrap method were made for model selection; however, the bootstrap information criterion was no better than AIC (Chung et.al., 1996[53]; Shibata, 1997[207]; and Ishiguro et.al., 1997[107]). While the asymptotic properties of the cross validation and the bootstrap model 67 selection, and their applications were well studied, the application of the jackknife resampling method to model selection is hardly found in the literature. The jackknife estimator is expected to reduce the bias of the expected log likelihood from estimating the KL distance. Assume that the bias bn based on n observations arises from estimating the KL distance and satisfies the expansion,

1 1 b = b + b + O(n−3). n n 1 n2 2

th Let bn,−i be the estimate of bias without i observation. By denoting bJ as the jackknife bias estimate, the order of the jackknife bias estimate is O(n−2), which is reduced from the order of an ordinary bias O(n−1) since

1 n b = (nb − (n − 1)b ) J n n n,−i i=1 X1 1 = b + b − (b + b )+ O(n−3) 1 n 2 1 n − 1 2 1 = − b + O(n−3). n(n − 1) 2

Furthermore, this bias reduction allows to alleviate model constraints on estimating bias like the same parametric family of distributions. Very few studies on the jackknife principle were found in model selection com- pared to the bootstrap and the cross-validation methods, possiblely caused by the perception that jackknife is similar to either bootstrap or cross-validation asymp- totically as similarities among those resampling methods were reported in diverse statistical problems. In this chapter, we developed the jackknife information crite- rion (JIC) for model selection based on the KL distance measure as a competing model selection criterion. Section 2 describes necessary assumptions and discuss the strong convergence of the maximum likelihood estimator when the true model is unspecified. Section 3 presents the uniformly strong convergence of jackknife maximum likelihood estimator toward the new model selection criterion with the jackknife resampling scheme. Section 4 provides the definition of JIC and the stud- ies on its asymptotic properties under the regularity conditions. Section 5 provides a discussion some comparison between JIC and popular information critiera when the true model is unknown. Section 6 reconsiders the consistency of JIC with 68 minimal assumptions on the likelihood in model selection.

3.2 Prelimiaries

Suppose that the data points, X1,X2, ..., Xn are iid with common density g and p M = {fθ : θ ∈ Θ } is a class of candidate models. Let the log likelihood, log fθ(Xi) of Xi be denoted by li(θ). The log likelihood function of all the observations and the th n log likelihood function without i observation are denoted by L(θ) = i=1 li(θ) and L−i(θ) = j6=i lj(θ), respectively. Without a notice, we will assumeP the fol- lowing throughoutP our study:

(J0) The p dimensional parameter space Θp is a compact subset of Rp.

p (J1) For any θ ∈ Θ , each fθ ∈M is distinct, i.e. if θ1 =6 θ2, then fθ1 =6 fθ2 .

p (J2) A unique parameter θg exists in the interior of Θ and satisfies

Eg[log fθg (Xi)] = max Eg[log fθ(Xi)]. θ∈Θp

(J3) The log likehood li(θ) is thrice continuously differentiable with respect to p 2 3 θ ∈ Θ and the derivatives are denoted by ∇θli(θ), ∇θli(θ), and ∇θli(θ), 2 3 in a given order. Also, |l (θ)|, | ∂ l (θ)|, | ∂ l (θ)|,and | ∂ l (θ)| are i ∂θk i ∂θk∂θl i ∂θk∂θl∂θm i dominated by h(Xi), which is non negative, does not depend on θ, and

satisfies E[h(Xi)] < ∞ (k,l,m =1, ...p).

p T 2 (J4) For any θ ∈ Θ , Eg[∇θli(θ)∇θ li(θ)] and −Eg[∇θli(θ)] are finite and positive T 2 definite p × p matrices. If g = fθg , then Eg[∇θli(θ)∇θ li(θ)] = −Eg[∇θli(θ)].

2 (J5) Let Λ be a p × p non-singular matrix such that Λ = −Eg[∇θli(θg)].

Remark on (J1) Admittantly, the identifiability constraint in parameters is very restrictive. However, this strong condition eliminates long disputes on the problems caused by parameters near the boundary. When candidate models are nested, for example, in mixtures and regression models, it is hard to tell whether the propor- tion of the additional component is close to zero or the parameters of additional components are close to those of the smaller models. Due to this ambiguity in 69 parameter space, traditional model selection cannot be compared directly to the jackknife criterion, which will be discussed later. The discussion on these break downs of parameters near the boundary are found easily (e.g. Protassov et.al., 2002[165] and Bozdogan ,2000[33]) regarding identifiability problems.

Remark on (J2). If M contains g, not only θg maximizes Eg[log fθ(Xi)] but also

I[g; fθg ] becomes zero. However, g is likely to be unknown so that the estimate of the density fθg that minimizes the KL distance is sought as a surrogate model of g. Also, note that the parameter θg satisfies Eg[∇θli(θg)] = 0 and Eg[log fθ(Xi)] < p Eg[log fθg (Xi)] for any θ ∈ Θ , where < is right instead of ≤ because of the uniquness of θg (according to J2) and the dominent convergence theorem (accroding to J3, Eg[|log fθ(Xi)|] < ∞) Additional Remark. Candidate models for model selection criteria are not nec- essarily from the same parametric family. We recommand candidate models fθ1 and fθ2 are from different parametric families so that fθ1 ∈ M1 and fθ2 ∈ M2 satisfying M1 =6 M2. ˆ Hitherto, we show that the maximum likelihood estimator θn of θg converges almost surely to the parameter of interest θg that minimize the KL distance.

Theorem 3.1. If θˆn is a function of X1, ..., Xn such that ∇θL(θˆn)=0, then

ˆ a.s. θn −→ θg.

The maximum likelihood estimator θˆn is in another words, the minimum discrep- ancy estimator of θg.

Proof. It is sufficient to prove that B : P {limn→∞ ||θˆn − θg|| < } = 1 for ∀ > 0, where θˆn satisfying

n n p A : fθˆn (Xi) ≥ fθ(Xi), ∀θ ∈ Θ , ∀n. (3.1) i=1 i=1 Y Y We will prove this almost sure convergence by negating A ⇒ B. Suppose B does ˆ ˆ ˆ ¯ ¯ not hold. There exists a subsequence {θnk } of θn such that θnk → θ a.e. for θ =6 θg p and θ¯ ∈ Θ . Let > 0 be such that ||θ¯−θg|| ≥ . In addition, from the assumptions 70

(J0-2), for some δ > 0,

E[l1(θ¯) − l1(θg)] = −δ < 0. (3.2)

By the strong law of large numbers (SLLN), and (3.2),

1 n [l (θ¯) − l (θ )] −→a.s. −δ. n i i g i=1 X Thus, for all large n,

1 n [l (θ¯) − l (θ )] < −δ/2, a.s. (3.3) n i i g i=1 X ˆ ¯ As θnk → θ a.s., for all large k, we have almost surely

δ ||θ¯− θˆn|| < . 8E[h(X1)] where E[h(X1)] is assumed to be bounded from (J3). By the SLLN and (J3),

1 n 1 n [li(θ¯) − li(θˆn)] ≤ ||θ¯− θˆn|| h(Xi) n n i=1 i=1 X Xn δ 1 < h(Xi) 8E[h(X )] n 1 i=1 X δ < i.o. a.s. (3.4) 4

Now, (3.3) and (3.4) imply that

n

P [li(θˆn) − li(θg)] < 0, i.o. =1 i=1 ! X Therefore, n n

P fθˆn (Xi) < fθg (Xi) i.o. =1 i=1 i=1 ! Y Y that violates (3.1) for θ = θg. This completes the proof.  71

Under the assumptions (J0-2), the maximum likelihood estimator θˆn converges almost surely to θg even if the true model is unspecified. This strong consis- tency is generally assumed in statistical model selection studies with an unspec- ified/misspecified true model (e.g. Nishii, 1998[155]; Sin and White, 1996[210]).

For a particular case g = fθg , this strong convergence is easily proved with Wald’s approach (1949[226]). When θg is the true model parameter, Stone (1977[215]) showed that the maximum likelihood estimate of θˆn converges to θg in probability and proved that the cross validation is equivalent to AIC. AIC evaluates a fitted model to the true model in terms of minimizing the KL distance through the maximum likelihood estimation so that AIC only can be used under the circumstance that a target parametric family of distributions contains the true model, which is not likely happens in scientific problems. Another popular model selection criterion BIC also assumes equal priors for candidate models as well as an exponential distribution family.

Although the maximum likelihood estiatmor θˆn is strongly consistent with the parameter of interest θg (the parameter that minimizes the KL distance), there is no gaurantee that this θˆn produces an unbiased likelihood estimate. Similar problems arise in other well known model selection criteria. Unlike other information criteria that took the bias estimation approach in the likelihood estimation, we will take the bias reduction approach based on the jackknife resampling method. .

3.3 Jackknife Maximum Likelihood Estimator

Jackknife is well known for bias correction and relatively cheap in computation compared to bootstrap. To begin developing a model selection criterion from ˆ the jackknife, first we will show the jackknife maximum likelihood estimator θ−i strongly converges to θg uniformly.

ˆ Theorem 3.2. Let θ−i be a function of X1, ..., Xi−1,Xi+1, ..., Xn and satisfy

∇θL −i(θˆ−i)=0 for any i ∈ [1, n]. Then, uniformly in i, θˆ−i converges to θg almost surely.

P ( lim max ||θˆ−i − θg|| ≤ )=1, ∀> 0. n→∞ 1≤i≤n 72

Proof. Let {τn} be a sequence such that τn = max1≤i≤n ||θˆ−i − θg||, where θˆ−i satisfies

n n p A : fθˆ−i (Xj) ≥ fθ(Xj), ∀ θ ∈ Θ , ∀i ∈ [1, n]. (3.5) j=1Y,j6=i j=1Y,j6=i

It is sufficient to show that τn converges to zero almost surely (B). We will prove the uniform almost sure convergence in a similar way of proving Theorem 3.1. We will negate the statement A ⇒ B into ¬B ⇒ ¬A. ˆ Consider τn 6→ 0, a.e. and ||θ−i(n) − θg|| 6→ 0 a.e. for some i(n):1 ≤ i ≤ n. ¯ ˆ Let θ be a limit point of {θ−i(n)}. For the same δ as in (3.2) and E[h(X1)] < ∞ as in (J3), ¯ ˆ δ ||θ − θ−i(n)|| < i.o. 8E[h(X1)] Then,

1 n 1 n [l (θ¯) − l (θˆ )] ≤ ||θ¯ − θˆ || h(X ) n − 1 j j −i(n) −i(n) n − 1 j j=1,j6=i(n) j=1,j6=i(n) X X n ¯ ˆ 1 ≤ ||θ − θ−i(n)|| h(Xj) n − 1 j=1 X δ n 1 n < h(X ) 8E[h(X )] n − 1 n j 1 j=1 X δ < i.o. a.s. 4

Hence, 1 n P [l (θˆ − l (θ )] < 0 i.o =1 n − 1 j −i(n) j g  j=1X,j6=i(n)   and n n

P fˆ (Xj) < fθ(Xj) i.o. =1. (3.6)  θ−i(n)  j=1Y,j6=i(n) j=1Y,j6=i(n)   This result violates (3.5). This completes the proof. 

We show that the jackknife maximum likelihood estimator uniformly converges 73

to θg, the parameter that minimizes the KL distance. The strong consistency of the maximum likelihood estimator has been discussed extensively in some litera- ture (e.g. Wald, 1949[226]; LeCam, 1953[122]; and Huber, 1967[97]). Based on the regularity conditions and the strong convergence of the maximum likelihood esti- mators discussed so far, we shall take the jackknife approach to a model selection criterion in the next section.

3.4 Jackknife Information Criterion (JIC)

In this section, we define the jackknife information criterion (JIC) and investi- gate the asymptotic characteristics of the jackknife model selection principle. In contrast to the traditional theoretical formulas, the jackknife does not explicitly require assumptions on candidate models (Shao, 1995[203]). We only assumed regularity conditions to ensure the strong consistency of maximum likelihood esti- mators and the asymptotic behavior of the jackknife model selection. Deriving an information criterion starts from estimating the expected log like- lihood. Prior to drawing JIC, first of all, we define the jackknife estimator of the log likelihood function. In addition, we show that this jackknife estimator is asymptotically unbiased to the expected log likelihood function. When the bias corrected jackknife estimatorτ ¯ is defined as

τ¯ =τ ˆ − bJ ,

bJ is a jackknife bias estimator andτ ˆ is a target estimator. The jackknife bias estimator is then obtained by

1 n b =(n − 1) (ˆτ − τˆ) J n −i i=1 X th withτ ˆ−i, the same estimator asτ ˆ without the i observations. Similarly, we formulate the bias corrected jackknife estimator of the log likelihood.

Definition 3.1 (Bias corrected jackknife log likelihood). Let the jackknife estima- 74

tor of the log likelihood Jn be

n

Jn = nL(θˆn) − L−i(θˆ−i). (3.7) i=1 X

Hitherto, the strong convergence of θˆn and θˆ−i leads to the following theorem.

Theorem 3.3. Let X1,X2, ..., Xn be iid random variables from a density function p g of an unknown distribution and M = {fθ : θ ∈ Θ } be a class of models. If the

(g, M) meets (J0)-(J5), then the jackknife log likelihood estimator Jn satisfies

n

Jn = nL(θˆ) − L−i(θˆ−i) i=1 1X = L(θ )+ ∇l (θ )T Λ−1∇l (θ )+  , (3.8) g n i g j g n Xi6=j a.s where n is a random variable satisfying limn→∞ Eg[n]=0 and ||n|| −→ 0. Addi- 1 T −1 tionally, the fact that Eg[ n i6=j ∇li(θg) Λ ∇lj(θg)]=0 leads to Jn as an asymp- totically unbiased estimatorP of the log likelihood.

Proof. We begin with forming groups of components in Jn.

n

Jn = [L(θˆ) − L−i(θˆ−i)] i=1 Xn

= [L−i(θˆ)+ li(θˆ) − L−i(θˆ−i) − li(θg)+ li(θg)] i=1 Xn n n

= [L−i(θˆ) − L−i(θˆ−i)] + [li(θˆ) − li(θg)] + li(θg). i=1 i=1 i=1 X X X By the Taylor expansion, the first and the second term respectively become

n n 1 n [L (θˆ) − L (θˆ )] = ∇L (θˆ )(θˆ− θˆ )+ (θˆ− θˆ )T ∇2L (η )(θˆ − θˆ ) −i −i −i −i −i −i 2 −i −i i −i i=1 i=1 i=1 X X X 1 n = (θˆ − θˆ )T ∇2L (η )(θˆ− θˆ ) 2 −i −i i −i i=1 X 75 and

n

[li(θˆ) − li(θg)] = L(θˆ) − L(θg) i=1 X 1 = −(θ − θˆ)∇L(θˆ) − (θ − θˆ)T ∇2L(ξ)(θ − θˆ) g 2 g g 1 = − (θ − θˆ)T ∇2L(ξ)(θ − θˆ) 2 g g

ˆ ˆ ˆ ˆ where ξ = θg + δ(θ − θg) and ηi = θ + γi(θ−i − θ) for any δ and γi such that

0 < δ,γi < 1. Hence,

1 n 1 J = L(θ )+ (θˆ − θˆ )T ∇2L (η )(θˆ− θˆ ) − (θ − θˆ)T ∇2L(ξ)(θ − θˆ)(3.9) n g 2 −i −i i −i 2 g g i=1 X ˆ ˆ T 2 ˆ ˆ First, consider (θ − θ−i) ∇ L−i(ηi)(θ − θ−i) of eq. (3.9). From Theorems 3.1 ˆ ˆ and 3.2, the maximumP likelihood estimates θn and θ−i are confined such that max{||θˆn−θg||, maxi ||θˆ−i−θg||} ≤  for any (> 0). Also, note that maxi ||ηi−θg|| ≤ max(||θˆn − θg||, maxi ||θˆ−i − θg||). Thus, the following is established.

1 2 1 2 2 1 2 max || ∇ L−i(ηi)+Λ|| ≤ max ||∇ L−i(ηi) − ∇ L(θg)|| + || ∇ L(θg)+Λ||. 1≤i≤n n 1≤i≤n n n

1 2 2 a.s. 1 a.s Here, max1≤i≤n n ||∇ L−i(ηi) − ∇ L(θg)|| −→ 0 and || n L(θg)+Λ|| −→ 0. Then,

1 2 a.s. max || ∇ L−i(ηi)+Λ|| −→ 0. (3.10) 1≤i≤n n

In addition,

−∇li(θˆ) = ∇L(θˆ) − ∇L−i(θˆ−i) − ∇li(θˆ)

= ∇L−i(θˆ) − ∇L−i(θˆ−i) 2 = ∇ L−i(ηi)(θˆ− θˆ−i)

By using (3.10), the above equation is

1 (θˆ− θˆ )= Λ−1∇l (θˆ)+ o(n−1) a.s., (3.11) −i n i 76

uniformly in i. Let αn = ||θˆ − θg||, where α is an arbitrary constant and n is a.s. 2 a random variable, satisfying limn→∞ E[n] = 0 and n → 0. Since |∇ li(η)| is bounded, 2 ∇li(θˆ)= ∇li(θg)+ ∇ li(η)(θˆ− θg) is equivalent to ∗ ∇li(θˆ)= ∇li(θg)+ α n, (3.12) for some constant α∗. Therefore, combining eq. (3.10), eq. (3.11), and eq. (3.12) gives

n 1 n (θˆ − θˆ )T ∇2L (η )(θˆ − θˆ )= − ∇T l (θ )Λ−1∇l (θ )+  . −i −i i −i n i g i g n i=1 i=1 X X

At this point, we like to remind that without loss of generality an with any constant a is denoted by n since limn→0 E[an] = limn→0 E[n] = 0 and an = o(1) a.s. Similarly, the last term of eq. (3.9) becomes

1 (θ − θˆ)T ∇2L(ξ)(θ − θˆ)= − ∇T L(θ )Λ−1∇L(θ )+  g g n g g n

As ||ξ − θg|| ≤ ||θˆ− θg|| = o(1) a.s.,

1 ||∇2L(ξ) − ∇2L(θ )|| −→a.s. 0. n g

With the SLLN,

2 2 2 2 ∇ L(ξ) = ∇ L(θg)+ nΛ − nΛ − ∇ L(θg)+ ∇ L(ξ) 1 ∇2L(ξ) − ∇2L(θ ) = n ∇2L(θ )+Λ + n θ g − nΛ n g n     = −n(Λ + o(1)) a.s. (3.13)

Additionally, we have

2 ∇L(θg)= ∇L(θg) − ∇L(θˆ)= ∇ L(ξ)(θg − θˆ) (3.14) 77 and replacing eq. (3.13) in eq. (3.14) gives

1 (θ − θˆ)= − Λ−1∇L(θ )+ o(n−1) a.s. g n g

Hitherto,

1 (θ − θˆ)T ∇2L(ξ)(θ − θˆ)= − ∇T L(θ )Λ−1∇L(θ )+ o(n−1) a.s. g g n g g

Consequently, eq.(3.9) becomes

1 n 1 J = L(θ ) − ∇T l (θ )Λ−1∇l (θ )+ ∇T L(θ )Λ−1∇L(θ )+  n g 2n i g i g 2n g g n i=1 X 1 n 1 n = L(θ ) − ∇T l (θ )Λ−1∇l (θ )+ ∇T l (θ )Λ−1∇l (θ ) g 2n i g i g 2n i g i g i=1 i=1 1 X X + ∇T l (θ )Λ−1∇l (θ )+  2n i g j g n Xi6=j 1 = L(θ )+ ∇T l (θ )Λ−1∇l (θ )+  (3.15) g 2n i g j g n Xi6=j

Note that Jn is an asymptotically unbiased estimator of the expected log likelihood;

1 1 E[ ∇l (θ )T Λ−1∇l (θ )] = E[E[ ∇l(θ )T Λ−1∇l (θ )|X ] 2n i g j g 2n g j g i Xi6=j Xi6=j 1 = E[ ∇l (θ )T Λ−1]E[∇l (θ )] 2n i g j g Xi6=j = 0.

This completes the proof. 

Asymptotically, no bias term is involved in Jn in contrast to the cross validation and the bootstrap estimator of the log likelihood (Shao, 1993[204]; Chung et.al., 1996[53]; and Shibata, 1997[207]). Therefore, the model selection criterion with the jackknife method is expected to perform differently in asymptotics compared to the selection criteria of other resampling methods. Here we investigate the jackknife principle further as a model selection criterion by proposing the jackknife information criterion (JIC) as minus twice Jn since the popular information criteria 78 involve estimators of negative twice expected log likelihood.

Definition 3.2. Jackknife information criterion (JIC) is defined as minus twice

Jn.

JIC = −2Jn n

= −2nL(θˆ)+2 L−i(θˆ−i). i=1 X Note that multiplying −2 does not change the asymptotically unbiased prop- erty of Jn. Furthermore, the actual behavior of an estimator can be predicted through the convergence rate. The convergence rate of the maximum likelihood type estimator is generally obtained from the Taylor expansion and the central limit theorem (CLT). The regularity conditions guarantee the Taylor expansion.

The convergence rate of Jn could be specified by its limiting distribution. Nonethe- less, Miller (1974[148]) commented that no one succeeded in assessing the limiting distribution of delete-one jackknife estimator. While the CLT requires the consis- tent variance estimator, the delete-one jackknife is known for inconsistent variance estimators for non-smooth estimators (Shao and Wu, 1989[205]). Theorem 3.3 as- sures the existence of the asymptotically unbiased delete-one jackknife estimator of the expected log likelihood. Instead of getting the limiting distribution of Jn by the CLT, the stochastic rate of Jn is presented by means of the law of iterated logarithm (LIL) for understanding the behavior of Jn in the asymptopia.

Theorem 3.4. Let X1,X2, ..., Xn be iid random variables from the unknown dis- p tribution with density g and M be a class of models such that M = {fθ : θ ∈ Θ }.

Under (J0)-(J5), the stochastic orders relating to Jn are:

(1) Jn = L(θg)+ O(log log n) a.s.

1 −1 (2) n Jn = log fθg (x)g(x)dx + O( n log log n) a.s. Z p Proof. By the LIL, we have

−1 ∇li(θg)= O( n log log n) a.s. p 79

so that the stochastic order of Jn is

1 J = L(θ )+ ∇T l (θ )Λ−1∇l (θ )+  n g 2n i g j g n Xi6=j = L(θg)+ O(log log n) a.s. (3.16)

where n = o(1) a.s. from Theorem 3.3. The result from (3.16) and the LIL lead

1 1 1 J = L(θ )+ {J − L(θ )} n n n g n n g −1 = log fθg (x)g(x)dx + O( n log log n) a.s. (3.17) Z p 1 −1 since n L(θg) = log fθg (x)g(x)dx + O( n log log n) (Nishii, 1988). This com-  pletes the proof.R p

Although Theorem 3.3 shows that Jn is asymptotically unbiased, JIC might not pick a good model, particularly, among the nested models of which L(θg) are asymptotically identical. The stochastic rates of the estimators investigated by Nishii (1988[155]) implied that the of the estimators is bounded by a function of sample size. This randomness can be diluted by adopting penalty terms in other information criteria, which may enhance the consistency of jackknife model selection. We suggest a slight modification of JIC.

Proposition 3.1. The corrected JIC (JICc) selects a model consistently, which is defined as

JICc = JIC + Cn

Cn Cn where Cn satisfying n → 0 and log log n →∞.

Remark on Proposition 3.1: The penalty term Cn was adapted from the data oriented penalty by Rao and Wu (1989[167]) and Bai et.al. (1999[16]). Penalty terms can be estimated when the true model is known. However, in the current study, we consider cases when the true model is unspecified or misspecified so that the penalty are not estimable. Instead, bias reduction methods has been employed. On the other hand, in JIC, no model complexity has been reflected, which penalize further the model with many parameter. Therefore, the order of the penalty term Cn should be larger than the stochastic rate of Jn and requires to be 80 smaller than the sample size. Like MDL, it seems reasonable to adopt the degree of model complexity into Cn. If a part of candidate models are nested and those nested models have the complexity proportional to the number of parameters, then plog n would be a good choice for the penalty term since plog n is higher than log log n and less than n in order. An investigation with care shall follow to get

Cn in the future.

3.5 When Candidate Models are Nested

Consider k candidates models, M1, ..., Mk where the M1 is the smallest model and the Mk is the largest. These models are nested in given orders. Let the corresponding parameter space of each model be Ω1,...,Ωk. It is obvious that Ωi ⊂

Ωj for i < j. It is probable that θg, the parameter of interest satisfying E[∇L(θg)] =

0, lies in Ωm for a given m but does not live in the space Ωi where i < m. The virtue of consistent model selection methods is praised when the method correctly identiy m, neighter m − 1 nor m + 1. Traditional model selection criteria have penalty terms to control neither underestimated nor overestimated model to be chosen. However, the jackknife model selection criterion does not have such a penalty term.

Unless θg ∈ Ωk\Ωk−1, the jackknife information criterion is unable to discrim- inate the best model among nested models. The event θg ∈ Ωm implies θg ∈ Ωi for m ≤ i ≤ k so that the jackknife maximum likelihood estimates of Mm, ..., Mk are stochastically equivalent (Jn(Mi) − Jn(Mj)= o(1) for m ≤ i, j ≤ k and i =6 j).

On the other hand, if θg only resides in the largest parameter space, the maxi- mum likelihood approach estimates the parameter of interest θi of each model Mi that minimizes the KL distance (maximizing the empirical likelihood based on the maximum likelihood estimator of θi) in its own parameter space. Yet the resulting

KL distance by θi is larger than the KL distance by θg. When the true model is unspecified, bias due to model estimation cannot be assessed and the penalty terms in popular model selection criteria such as 2p in AIC and plog n in BIC do not take such simple forms. On the other hand, the bias reduction approach allows to measure the relative distances among candidate models with respect to KL distance. When the parameter space of each candidate model occupies unique 81 space, regardless of true model, the JIC will provide a model with the relatively short KL distance. Only a difficulty comes in when some candidate models are nested but the true model is unknown. The description of relationships among candidate models and their parameter space is illustrated in Fig. 3.1.

     Θ2  Θ1   θ  g   g(x)  θ   B       B   Θ θ1                  θA

ΘA

Figure 3.1. An illustration of model parameter space: g(x) is the unknown true model and ΘA,ΘB,Θ1, and Θ2 are parameter space of different models. Note that Θ1 ⊂ Θ2 and θg is the parameter of the model that mostly minimizes the KL distance than parameters of other models (θa, θb, and θ1). The black solid line and the blue dashed line indicate Θ1, of which detail will be given in the text.

In Figure 3.1, g(x) denotes the unknown true model that generated observa- tions. We can expect that there are parameters θg, θa, and θb of candidate models that minimize the KL distances from the unknown true model g and these pa- rameters reside in Θ2, ΘA,and ΘB, respectively. Let θg is the parameter of the target model that minimize the KL distance mostly among other candidate mod- els. Then, the jackknife log likelihood estimate is the asymptotically unbiased log likelihood and the jackknife maximum likelihood estimator of θg is the consistent estimator of θg. The model, of which parameter space is Θ2, is the best approxi- mating model that describes the true model in terms of the KL distance compared 82

to the other candidate models of ΘA and ΘB. However, the trouble comes in when we have nested candidate models.

Suppose M1 and M2 are the candidate models whose parameter space are Θ1 and Θ2. Consider Θ1 ⊂ Θ2 is the parameter space of the nested candidate model,

M1. Obviously, M1 is nested to M2. In Figure 3.1, we indicate this subspace Θ1 as a line that goes across the space of Θ2. If θg, the parameter that minimizes the KL distance mostly, resides on Θ1, the jackknife log likelihoods of both models (M1 and M2) are asymptotically equal and the jackknife methods do not provide any additional penalty term to choose the parsimonious model, M1. On the contrary, when Θ1 is the blue dashed line, where θg does not live on, θ1 is the parameter of interest that minimize the KL distance along Θ1. Then, only M2 is the best approximaint model to g. Disappointingly, there is no guarantee that the jackknife maximum likelihood estimator distinguishes the uniqueness of θg only to the larger model M2. Fortunately, in practice, we are more interested in choosing a model among sev- eral different parametric models, or non-nested models (Bozdogan, 2000[33]) while the true data generating function is not specified. Of course, the distribution of the parameter vectors of the data generating function is unknown and the average of the maximized log likelihood is not guaranteed to converge to the expected value of the parameterized log likelihood under the true model. Therefore, the consistency of model selection with the jackknife method cannot be tested as the consistency of AIC or BIC are tested (It is known that BIC is consistent and parsimonious and AIC tends to choose larger models). Furthermore, we like to emphasize that tra- ditional model selection methods should not be applied when the model selection process concerns non nested candidate models or the unspecified true model.

3.6 Discussions

We show that the jackknife information criterion (JIC) is the asymptotically un- biased estimator of the maximized log likelihood so that the model of minimum JIC value indicates the unbiased model of the minimum KL distance. Such model is the best approximating model among candidates which explains the unknown true model best. 83

On the other hand, JIC cannot be applied and its values cannot be compared when candidate models are nested and the true model is specified. This traditional model selection condition has been the primary drive force to develop well known information criteria such as AIC, AICc, HQ, BIC, MDL and many variable selec- tion methods. For example, in the field of time series, determining the correct and parsimonious order is the goal of model selection problems instead of adopting the complex full order time series model. Choosing a parsimonious model is the result of trade-offs between model bias and model variance, both of which quantities are estimable when the true model is specified. Nowadays, statistical model selection is applied in various disciplines and of- ten candidate models are not nested to each other. These candidate models take various functional forms as their parameters have different topologies based on ex- pertise on the subject matter. In addition, the true model is likely inaccessible so that choosing a working model, closest to true model has been the best strategy. Therefore, the model selection methods that minimize the distance between the unknown true model and candidates strongly depend on the choice of distance met- rics and rely on the recipes of unbiased estimation. Model selection methodologies observe less interests in estimating/approximating the distributions of candidate models and their parameters, which has been practiced in traditional information criterion studies. We have chosen the most celebrated KL distance to approach model selection problems because of the informative perspective. The KL distance is equivalent to Boltzmann’s entropy and Shannon’s information. For the unbiased estimation purpose, we choose the jackknife, which is known for its bias reduction. With a different distance measure, Herzberg and Tsukanov (1985[92], 1986[91]) concluded that the jackknife model selection approach provides a better criterion than Mal- lows’ Cp (1973[134]) when the correct model does not belong to the set of models under the verification process. The most popular statistical resampling method is bootstrapping, which has been studied often in statistical model selection (Chung et.al., 1996[53]; Ishig- uro et.al., 1997[107]; and Shibata, 1997[207]; Shao, 1996[202]; Cavanaugh and Shumway, 1997[46]; Djuri´c, 1997[56], Feng and McCulloch, 1996[70]; Hjorth, 1994[93]; and Neath and Cavanaugh, 2000[152]). These studies did not considered the un- 84 specified true model so that we cannot compare the results from these studies to our jackknife information criterion. Yet, a few studies presented that the bias of bootstrap model selection is proportional to the number of parameters in a model (Chung et.al., 1997; Ishiguro et.al., 1997; and Shibata, 1997). Notably, Chung et.al. (1997) added that bootstrap after bootstrap provides an asymptotically un- biased estimate of the maximum log likelihood. It is worthwhile to point out that JIC is quite simpler and inexpensive computation-wise as well as asymptotically unbiased, compared to bootstrap after bootstrap, After the debut of AIC and other model selection criteria, consistency in se- lecting parsimonious model has been the primary interests. When the candidate models are nested, JIC cannot achieve desirable consistency, expected for a model selection criterion. Nonetheless, from eq. (3.8), JIC could be improved by (a) find- 1 T −1 nd 1 ing an estimator of n i6=j ∇li(θo) Λ ∇lj(θo), (b) considering 2 order bias , or (c) adding the data orientedP penalty term. The virtue of JIC is that through the resampling technique, the asymptotically unbiased estimates of the maximum log likelihood can be obtained so that more general candidate models from various topological space can be tested to assess the true data generating model. No single model selection has the most desirable property compared to other model selection criteria. There are theoretical results showing that some of model selection criteria become optimal under proper conditions (Kuha, 2004[118]). Dif- ferent conditions lead to different conclusions, and arguably none of them capture the full complexity of real model selection problems. In addition, Sin and White (1996[210]) pointed out that a smaller quantity in model selection criterion does not conclude a correctly specified model; nonetheless, a correct model attains the smallest quantity of the information criterion. The information criterion for model selection is not a sufficient but a necessary condition of choosing a right model. Thus, a sole reliance to information criterion estimates without investigating a data set by its nature may mislead the conclusion. This weakness in model selection criteria was accentuated by Burnham and Anderson (2002[39]), who made a state- ment in their famous textbook that the model selection methods become successful after better understanding of data sets and restricting possible candidates.

1Adams et.al. (1971[5]) studied the properties of delete one jackknife and delete two jackknife methods. 85

The jackknife information criterion may suffer inconsistency when candidate models are nested but it has wider possibilities than traditional model selection criteria in terms of applicability to non nested models with an unspecified true model. Moreover, we are more likely to confront the situations such that (a) the true model is unknown or (b) the best approximating model, minimizing a given distance metric is more desired to be chosen instead of pretending that the true model can be specified. Therefore, the bias reducing jackknife approach for estimating the KL distance is worth to be recognized as a model selection criterion. Chapter 4

Convex Hull Peeling Depth

4.1 Introduction

Convex hull has a long history of endeavors to understand its mathematical char- acteristics as well as a long list of efforts on its extension of statistical applications. The lack of computational technology is partially attributed to this slow devel- opment while statistically and mathematically notable works were performed by R´enya and Sulanke (1963, 1964) and Efron (1965), which triggered most of math- ematicians and statisticians to enhance their researches on the subjects related to convex hull. Classical multivariate statistical methods assume that the data come from a multivariate normal distribution and the derivations are based on the sample co- variance matrix even though we aware that imposing multivariate normal distri- bution with some alleged robust covariance matrix is not robust at all. Problem- atically, no robust method without assuming parametric distributions has been emerged for reliable applications. Only recently a few data depth based methods have appeared for multidimensional data analysis. One of those data depths is the convex hull peeling depth. In contrast to the rich history of its investigations, convex hull peeling depth has found less attraction from the statistical society compared to other data depths. Some commentaries including criticisms are offered as follows; in Huber’s publication about (Huber, 1972[96]), a peeling process is proposed for a robust location estimator. This peeling process and its resulting location is affine invariant so 87 that the peeling estimator can be treated as a univariate problem but this peel- ing estimator has received the criticism that little is known about the statistical properties. According to Liu et.al. (1999[127]), although the convex hull peeling approach is intuitively appealing, its use is somewhat limited by the lack of an associated distribution theory. Since R´enyi and Sulanke (1963[170], 1964[171]) and Efron (1965[68]) addressed the expected number of vertices of a convex hull from samples of various dis- tributions, studies by after generations about convex hulls are more focused on proving the central limit theorems of the numbers of vertices, areas, volumes, and perimeters of convex hulls. On the contrary, any works on convex hull process as estimators of distributional properties are hardly found even though possibilities of statistical inferences have been presented (Tukey, 1974[221]; Barnett, 1976[24], and others). From the stochastic geometry point of view, convex hull process has been a interesting topic. A long list of contributors from the theoretic probability society is easily built. Publications by Buchta (2005[37]), Cabo and Groene- boom (1994[40]), Carnal (1970[41]), Eddy (1980[63]), Eddy and Gale (1981[62]), Efron (1965[68]), Groeneboom (1988[80]), Hsing (1994[95]), Hueter (1994[102], 1999a[100], 1999b[101]), Mass´e(1999[141], 2000[140]), R´enya and Sulanke (1963, 1964), Schneider (1988[194]) and others are frequently referred in other convex hull related studies. These studies focused on estimating expected values and variances of the number of vertices and volumes when sample were generated from simple convex body distributions. According to Mass´e(2000), the properties in numbers of vertices and areas from convex hull process are strongly tied to the sampling probability space so that the difficulties of obtaining central limit theorems come from estimating variances of the numbers of vertices and the areas. Additionally, Hueter (2004a[98] and 2004b[99]) recently expanded the problems of convex hull limiting distributions to the problems of distributional properties of convex hull peels. She stipulated the expected values and variances of the number of vertices on each peel and convex hull peeled areas as sequences and carefully investigated characteristics of these sequences. She emphasized that the asymptotic properties are preserved for the first few elements of these sequences. Although this statement seems lack of robustness, it is easily justified that the 88 large sample properties will not stay true with small size sample of remainders after the peeling process. Later, losing asymptotic properties will be illustrated when some convex hull peeling algorithms for descriptive statistics are presented. The concept that contours from data depth often provide a good geometrical understanding of the structure of a multivariate data set (He and Wang, 1997[90]) is very promising for linking convex hull peeling to multivariate descriptive statistics. Intuitively, convex hull peeling provides contours of equal level density but the involved technicality for mathematical justifications is not always trivial. We take the view that if a convex hull peeling depth measure is constructed for a random sample from a regular convex distribution, desirably the depth measure tracks the natural shape of the distribution and serves as tools for empirical multivariate data analysis. In addition to offering the natural shape of distributions, without concerns in various distance metrics, convex hull peeling provides the natural center of data; the point left after exhaustive peeling process. Properties of convex hull peeling process, such as the distance metric independence and natural contour- ing connected to the underlying density, provide convenient methods not only for estimating the median but for estimating quantiles of interests as location esti- mators. Quantile estimation from samples of simple convex body distributions leads to investigate supplementary descriptive statistical measures like skewness and kurtosis. Skewness and kurtosis are meant to describe properties of distributions. The definition of skewness begins with identifying the center of the distribution, which put an emphasis on getting a multivariate median. Skewness measures the degrees of distributional symmetry. Without calculating the 4th moments, the definition of kurtosis asks for locating shoulders, which divides the half of the central part of sample, and measuring any probability mass shifts from the shoulders. Tradi- tionally, studies about skewness and kurtosis of distributions are associated with influence functions since these functionals project on probability space regardless of data dimensions. However, there are cases when these functionals are not identi- fiable and also there are distributions that moments are not defined. Alternatively, without calculating moments of considering influence functions, quantile based de- scriptive statistics has assisted studies about the shape of distribution. 89

Nested depth contours delineate any variation occurs in the density profile of sample. All distributions do not necessarily preserve the axes of symmetry. The direction of these axes can move in angles along as the data depth changes. The principal axes might rotate as the data depth varies as well as the ratio between axes can change. Nonparametric multivariate descriptive statistics should be able to detect changes in principal axes within a distribution in contrast to parametric methods that start statistical analysis by fixing the principal axes of distributions. The algorithm that computes the depth contours counts the number of points that fall inside each depth contour and measures the shapes of each contour so that we can diagnose the variations of the density profile as we peel off consecutive convex hull layers. Last, depth contours can be used to construct a scalable partition of multivari- ate data space for data aggregation. We will define a depth equivalence class, or de-class as a set of points that fall within a certain depth range. Then, it is likely that outliers are residing beyond some de-classes of small data depth. Usually peripheral points are considered to be outliers and their behavior with respect to data depths is different from the majority of data within a contour of a de-class. Besides, as various definitions of outliers exist, diverse ways of outlier detection will be introduced. Relatively easy ways to detect multivariate outliers without assuming a parametric family of distributions particularly for streaming massive data are on demand to raise red lights when any peculiarity is observed, which is evidently important for a quick decision making. Here, we propose constructing a convex hull and its peeling process as tools for nonparametric multivariate data analysis. We aim these tools, in particular, for massive multivariate data since our experience with convex hull peeling processes have large sample properties. The process of convex hull peeling can be sketched as follows: First, by the definition of convex hull, we can retrieve a set of vertices that constructs the contour of the convex hull from d dimensional sample. Second, we discard these vertices from the original set of points and construct another convex hull on the set of left over points. We repeat this discarding a set of vertices and constructing a hull until we have less than d + 1 points left while we assign depth value on each convex hull before discarding. At the end, we have a sequence of nested contours assigned with proper data depth values. 90

We will show statistical properties that convex hull peeling depth can achieve. We made only one assumption that samples are generated from simple convex body distributions1. The rest of this chapter is organized as follows: Section 2 addresses definitions to connect convex hull peeling process to descriptive statistics. Section 3 proves that convex hull peeling depth is statistical depth. Section 4 discusses descriptive statistics induced from convex hull peeling processes. Instead of presenting theoretical properties of a convex hull peeling limiting distribution, Section 5 provides simulation results to understand the behavior of convex hull peeling descriptive statistics. Section 6 shows the robustness of convex hull peeling depth measure and Section 7 describes multivariate outliers and its relationship with convex hull peeling process. Last, we summarize our studies on the connection between statistical inference and convex hull peeling in the final section. The methodological aspects of our empirical studies on convex hull peeling statistics will be put off until the next chapter.

4.2 Preliminaries

First of all, we start from defining the convex hull peeling depth for further statis- tical multivariate analysis.

Definition 4.1 (Convex Hull Peeling Depth, CHPD). For a point x ∈ Rd and the data set X = {X1, ..., XN−1}, let C1 = conv{x, X} and denote a set of its vertices

V1. We can get Cj = Cj−1\Vj−1 through the convex hull peeling process until x ∈ Vj

(j = 2, ...). Then, for k s.t. k = minj{j : x ∈ Vj} the convex hull peeling depth (CHPD) of x is ](∪k V ) CHPD(x)= i=1 i ; N otherwise CHPD(x) is 0 (x is located outside of conv{X}).

Data depth denotes how deeply a data point is embedded in the data cloud. Therefore, the convex hull peeling depth of x is zero when x is located outside of

1For any points x ∈ A and y ∈ A where A is a measurable set, the line segment xy satisfies xy ∈ A and xy 6∈ ∂A. Here, A is a set, on which a probability measure can be defined. For example, a closed rectangular uniform distribution is not simple convex body distribution since any two points on one side of the retangle create a line that lies on the same side of the boundary of that rectangle. 91

th the conv{X1, ..., XN−1}. The hull of 1 − p depth approximately contains 100p percent of sample and indicates the pth quantile estimator. Next, we define the central region by the convex hull peeling depth, which improvises a mapping from the d dimensional space to one dimension curve, characterizing multidimensional data distributions.

Definition 4.2. Convex hull peeling central region RCH (t) is a collection of points of which depth is at least t such that

d RCH (t)= {x ∈ R : CHPD(x) ≥ t}, the smallest region enclosed by the depth contour of the 1 − tth quantile.

The central region is a d dimensional hyper polygons and comprise a smallest compact set. Then, the level set is defined, indicating de-class, a collection of points of same depth. Along with t, RCH (t) provides a nested sequence of hyper polygons.

Definition 4.3 (Convex hull level set). The boundary of the convex hull peeling central region is a collection of points of which depth is t such that

BCH (t) = ∂RCH (t) = {x ∈ Rd : CHPD(x)= t}, a collection of vertices from the tth convex hull peel, the 1 − tth quantile estimator.

The convex hull level set is associated with constructing the density profile. Since the number of points of every convex hull peel is kept in track and the central region represents the smallest compact set of which volume is calculated, the density profile become estimable empirically. The empirical process will be presented in the next chapter. Last, we define the volume of the central region.

Definition 4.4 (Volume of the central region). The volume of the convex hull peeling central regions is

VCH (t)= V olume(RCH (t)). 92

This volume functional has a diverse applicability because it provides a mapping d such that VCH : R → R. The volume of the central region is also used to show the affine invariance of the convex hull peeling depth. In contrast to its diverse use, the limiting distribution of this volume functional has not discovered yet. Without knowing the limiting distribution, much of following studies will rely on empirical methods.

4.3 Statistical Depth

Any data depths pertain the properties of a statistical depth function, such as affine invariance, monotonicity from center to outward, maximality at center, and zero at infinity. These properties enable data depths to be extended to access sample distributions for statistical inferences. Convex hull peeling depth also satisfies these conditions, which will be addressed next after the definition of statistical depth function (Zuo and Serfling, 2000a[241]).

Definition 4.5 (Statistical depth function). Let the mapping D(·; ·) : Rd ×F → R be bounded, non-negative, where F is the class of distributions on the Borel sets of Rd and F is the distribution of random sample X. If D(·; ·) satisfies the following:

(P1) (Affine invariance) D(Ax + b; FAX+b) = D(x; FX ) for all X (A non- singular matrix) holds for any random vector X in Rd, any d×d non-singular matrix A, and any d-vector b;

(P2) (Maximality at center) D(θ; F ) = supx∈Rd D(x; F ) holds for any F ∈ F having center θ;

(P3) (Monotonicity) for any F ∈ F having deepest point θ, D(x; F ) ≤ D(θ + α(x − θ); F ) holds for α ∈ [0, 1]; and

(P4) D(x; F ) → 0 as ||x||→∞, for each F ∈F.

Then D(·; F ) is called a statistical depth function.

First, we show the quasi concavity of the convex hull peeling depth. Rousseeuw and Ruts (1999b[184]) discussed the quasi-concavity of depth to ensure monotonic- ity and centrality. 93

Definition 4.6 (Quasi-concavity). For any positive measure µ the depth function d Dµ is quasi-concave i.e. for θ1 and θ2 in R and 0 ≤ γ ≤ 1:

Dµ(γθ1 + (1 − γ)θ2) ≥ min{Dµ(θ1),Dµ(θ2)}.

Then, the convex hull peeling depth is a quasi concave function.

p Theorem 4.1. For any θ1, θ2 ∈ R and 0 ≤ α ≤ 1, the convex hull peeling depth is quasi concave such that

CHPD(αθ1 + (1 − α)θ2) ≥ min{CHPD(θ1),CHPD(θ2)}.

Proof. Let MF be the convex hull peeling median. If ||θ1 − MF || ≤ ||θ2 −

MF || then CHPD(θ1) ≥ CHPD(θ2). Provided that τ = αθ1 + (1 − α)θ2 for

0 ≤ α ≤ 1, ||τ − MF || ≤ ||θ2 − MF || so that CHPD(τ) ≥ CHPD(θ2), where

CHPD(θ2) = min{CHPD(θ1),CHPD(θ2)}. By the same token, CHPD(τ) ≥ min{CHPD(θ1),CHPD(θ2)} is true when ||θ2 − MF || ≤ ||θ1 − MF ||. This com- pletes the proof. 

With Theorem 4.1, it is easily explained that the convex hull peeling depth satisfies (P2)-(P4). By the definition, at each convex hull peeling depth has a central region. Likewise, the center of data cloud MF can be rephrased as

∩∀tRCH (t). Thus CHPD(·) becomes maximum at MF by counting all points till the central region collapsed to MF . In other words, directly from the quasi concavity,

CHPD(x) ≤ CHPD(MF + γ(x − MF )) is true for any x and γ ∈ [0, 1], and

MF satisfies CHPD(MF ) = supx CHPD(x). This maximality at center combined with the quasi concavity provides the monotonicity of the convex hull peeling depth naturally. This monotonically non-increasing will lead to estimating density with the convex hull peeling process. Last, it is clear that any point x outside of the support of a distribution F or a convex hull of sample has zero depth. Different from the direct approach to show that the convex hull peeling depth function satisfies (P2)-(P4) by means of the quasi concavity of the depth function, the affine invariance can be understood intuitively. For any affine transformation of X to AX +b (A is d×d matrix and b is a d vector), it is obvious that the convex hull peeling center MF is transferred to AMF + b and VCH (t; AX + b) is equivalent 94

to abs|A|VCH (t; X) for any t. Therefore, the convex hull peeling depth does not depend on the underlying coordinate systems and the properties (P2)-(P4) remain valid after an affine transformation. Consequently, estimating convex hull peeling depths will be linked to some statistical inferences.

4.4 Convex Hull Peeling Functionals

Provided a sample from a multivariate probability distribution F , a corresponding convex hull peeling depth function orders points according to their centrality in the distribution F . One important role of this depth function is to generate one dimensional curves for the convenient and practical description of particular fea- tures of a multivariate distribution, like skewness and kurtosis. Here, we present nonparametric descriptive statistics that the convex hull peeling process can offer.

4.4.1 Center of Data Cloud: Median

Convex hull peeling has long been considered for estimating multivariate median (Tukey, 1974), unaffected by underlying coordinate systems. Convex hull peeling process provides the deepest point in the data cloud through exhaustive peeling process. Even though the limiting distributions of the convex hull peeling median and other location estimators is still behind the veil, we suggest empirical and applicable definitions of these convex hull peeling estimators. If a multivariate median is the inner most point in a set of sample, then a median can be defined through the convex hull peeling process, called a convex hull peeling median.

Definition 4.7 (Convex hull peeling median, CHPM). Given a point set, convex hulls are recursively constructed by peeling off vertices of every convex hull until at most d points left. If a single point is left, this single point is the convex hull peeling median. If multiple points are left, the coordinate wise mean indicates the median. If no point is left, the convex hull peeling median is the mean of the vertices of the last inner most convex hull.

Like the convex hull peeling depth, the convex hull peeling median is also affine invariant. The convex hull peeling median provides an easy way to assess the cen- ter of multivariate data. None the less, the convex hull peeling median is not the 95 only method to assess the center of multivariate data. There are various of ways to estimate the center location of data cloud. For details, Small (1990[213]) provided the survey of multidimensional medians. Even if multivariate medians are defined in various and casual ways, identifying the true data center is a very critical prob- lem to understand multivariate data distributions. The properly defined median assists to estimate the degree of symmetry of multivariate distributions correctly. In addition, estimating the correct median (data center) plays a critical role as well in detecting outliers since outliers are defined to be most deviated from the center of data compared to majority observations. Hardin and Rocke (2004[85]) summarized numerous outlier detection methods. Under the condition that the main data cloud forms one cluster in multidimensional space, Atkinson (1994[12]), Barnett and Lewis (1994[21]), Gnanadeskikan and Kettenring (1972[74]), Hadi (1992[81]), Hawkins (1980[89]), Maronna and Yohai (1995[139]), Penny (1995[161]), Rocke and Woodruff (1996[178]), and Rousseeuw and VanZomeren (1999c[185]) de- veloped methods for detecting outliers from identifying the center of data to mea- sure distance between observations and the data center. Because of the intuitive appeal, we like to adopt the convex hull peeling median as the data center.

4.4.2 Location Functional

The purpose of peeling is not only finding the center of the data set but also properly assigning ranks on observations. These ranks correspond to multivariate quantiles based on the notion of generalized quantile process. Here we introduce a generalized quantile process and convex hull peeling quantile estimators. According to EinMahl and Mason (1992[69]), a generalized quantile function is defined as follows;

Un(t)= inf{λ(A) : Pn(A) ≥ t, A ∈ A}, 0

The natural choice of λ is to be a Lebesque measure on Rd. In the convex hull peeling case, roughly λ is the volume of the central region as well as Un(t) since the central region is the smallest compact set in A. By the definition of the central region, Un(t) contains at least 1−t fraction of the data in the support space A. Par- ticular choices of measure λ yield a scale curve for dispersion or a Lorenz curve for 96 tailweight (see Liu et.al., 1999[127], Serfling, 2002a[198], 2002b[199], 2002c[200]). Depending on the smoothness of the sets in A, other choices for λ that may be feasible for the generalized quantile process are the length of the perimeter of a set, its diameter or a (probability) measure evaluated at the set. These possibilities will be discussed in future along with more discoveries in distribution properties of convex hull peeling estimators. For the convenience, instead of adopting volume of a central region, we define the convex hull level set as the convex hull peeling quantile function.

Un(t) = BCH (t) = {x ∈ Rd : CHPD(x)= t}.

Note that each level set is assigned probability weight 1−t. Additionally, the volume th of BCH (t) is the measure of 1 − t quantile; therefore, two dimensional plot of t and corresponding volume provides a convenient way to explore key features of multivariate F .

EinMahl and Mason (1992[69]) showed that Un(t) is measurable under some assumptions for each t ∈ (0, 1). It is clear that Un(t) is constant on ((i − 1)/n, i/n] for i = 1, ..., n − 1 and on (1 − 1/n, 1). There have been attempts to prove the continuity of a depth function under some conditions on multivariate distribution F (see EinMahl and Mason, 1992[69]; He and Wang, 1997[90]; and Wang and Ser- fling, 2006[227]). Despite the lack of understanding the properties of its quantile process, the convex hull peeling process provides various descriptive measures that characterize the features of distributions. Serfling (2004[196]) emphasized that quantile measures are very attractive approached to study probability distribu- tions, especially nonparametric inference because descriptive measures exemplify (characterize) interesting features of distributions. We like to add the definition of the inter quartile range (IQR) on multivariate data distribution. In an univariate example, IQR is the range between the first quartile and the third quartile, equivalently the range between the .25th quantile and the .75th quantile, which contained a half of observations including the center. Since the concept that IQR includes the center and shows the degree of concentra- tion of data around the center, for multivariate observations, we denote the 50% 97

th convex hull peeling central region RCH (.5) as IQR, without calculating the .25 th or .75 quantile. Obtaining RCH (.5) for IQR is also applicable to other type of depths when data dimension is more than one.

4.4.3 Scale Functional

A scale of a distribution represents the degree of dispersion in a distribution. Therefore, the volume of the convex hull central region is a direct indicator of the dispersion degree. As discussed above, this volume function is originated from a search for a generalized quantile function. Liu et.al. (1999[127]) called a func- tion that quantifies the expansion of the central region with increasing probability weight as the scale curve. Relative orders of the degrees of dispersion between two distributions are as- signed by estimating volumes at the depth t. We denote that distribution G has larger scale than distribution F when

VCH,G(t) ≥ VCH,F (t) after regularization since the supports of two samples are different. We will present some reasonable regularization methods in the next chapter. Studies on estimating covariance matrix are tightly related to the subject of measuring scales of multivariate distributions. Titterington (1978[220]) adapted ellipsoid trimming for a robust estimation of covariance. Bebbington (1978[26]) showed that data trimmed by a convex hull provided more robust covariance ma- trix. There studies reflect that a convex hull represent the spread of a distribution from outside. Removing points on the boundary (ellipsoid or convex hull) of the sample space is likely to tighten up the scale of a distribution. Their estimated covariance matrix, however, does not display any changes in scales as the data depth changes. The volume functional induced from a convex hull peeling process shall offer the quantization of the degree of spread along the data depths. This quantization process produces a spatial scale curve that yields a convenient one dimensional characterization of multivariate distribution of any dimensions. When this scale curve remains steady along data depth, the estimated covariance matrix will repre- 98 sent the spread of the multivariate data distribution. The volume functional linked to the degree of spread along data depth will, as well, assist estimating robust co- variance matrix by getting rid of a few level sets with small depth values, which are not consistent with the rest of level sets in terms of scales as Bebbington (1978) and Titterington (1978) eliminated points for robust covariance matrix estimates. Conclusively, the volume functional as well as the location functional play key roles in the formulation of spatial skewness and kurtosis functionals, which will be addressed in the following sections.

4.4.4 Skewness Functional

According to Oja (1983[158]), no partial ordering exists for multivariate skewness and kurtosis. However, the convex hull peeling depth is classified as a partial order- ing (Barnett, 1976[24]). Moreover, there are ways that we can quantize skewness and kurtosis of multivariate distributions with the convex hull peeling process. Under the assumption that data are generated from a convex body distribution, some measures of skewness and kurtosis will be accommodated. First, measur- ing skewness is presented, then kurtosis by means of convex hull peeling will be followed. The measures of skewness were primarily, and often uniquely, developed for testing lack of symmetry. For a univariate case, from the center of a data set, only two directions are defined to describe the skewness of distribution; a positive span and a negative span from the data center. Then, a non parametric skewness measure is easily obtained by comparing values from two directions. When it comes to multivariate data, numerous rays can be defined from the center of data.

Consider a level set BCH (t) at depth t which contains vt vertices, then vt rays can be created from the center of data cloud. In univariate distributions, two opposite directional rays from the data center comprise the skewness measure. Likewise, choosing appropriate rays of certain directions from the data center will represent the shape of a multivariate distribution. th th Let xj,i be the i vertex in a level set BCH,j comprised by the j peeling 99 process. We define the following quantity

maxi ||xj,i − CHPM|| − mini ||xj,i − CHPM|| Rj = mini ||xj,i − CHPM|| as a measure to display the degree of skewness of a multivariate distribution.

Not only a sequence of Rj visualizes but also quantizes the skewness along data depths. For symmetric distributions, Rj will remain constant along the level sets

BCH,j, corresponding to particular depth values by counting points. For skewed distributions, Rj will show some fluctuating trends. The denominator serves for the regularization and makes Rj affine invariance.

The sequence {Rj}, a convex hull peeling skewness measure, does not match any of three skewness catagories which Ferreira and Steel (2006[71]) classified. They categorized multivariate skewness measures appeared in the literature into three groups:

1. joint moments of the random variable,

2. projections of the random variable onto a line (projection maximizes some value of univariate skewness),

3. volumes of simplices (directional skewness).

Inspired by the comment from Oja (1983[158]) that measures of skewness are intu- itively selected and the connection between the convex hull level sets and volumes of simplices, we have designed a skewness measure Rj. We emphasize that Rj is not the volume of simplices nor the projection on a line and does not ask for cal- culating moments. We improvise the notion of skewness that measure a degree of asymmetry to coin skewness measures by using convex hull peeling methods. Our skewness measure provides one dimensional curve mapping from a multidimen- sional distribution and allows to display changes in shape of a distribution along data depths. Any fluctuation in shape orientation will be caught by estimating

{Rj}. Moreover, Rj is acquired based on a quantile process since quantiles corre- spond to convex hull peeling level sets. In addition, Rj satisfies affine invariance, fulfilling Oja’s (1983) statement that intuitive skewness measures must be affine invariant. 100

4.4.5 Kurtosis Functional

Quantile based kurtosis measures have been on the table by Wang and Serfling (2005[228], 2006[227]), Serfling (2004[196]), Zuo and Serfling (2000a[241]), Liu et.al. (1999[127]), Oja (1983[158]) and Groeneveld and Meeden (1984[79]) based on other types of data depths. Similarly, it is feasible to define kurtosis mea- sures based on a convex hull peeling process since this process provides estimable quantiles. Kurtosis is understood as a measure that quantizes the structure of the distri- bution by assessing the central and the tail parts of the data distributions. The boundary between the center and the tail represents the shoulder of a distribu- tion. Investigating kurtosis of a distribution can take three different outlooks of a distribution.

• mesokurtic property - measures the overall shape of distribution;

• platykurtic property - measures the centrality of data;

• leptokurtic property - measures tail weights.

The quantile based kurtosis functional has intuitive appeals since it provides a sim- ple one dimensional curve for evaluating characteristics of multidimensional data. For elliptical symmetric distributions, under suitable conditions on the chosen mul- tivariate quantile function, the kurtosis functional possesses very useful properties to actually determine the form of the distribution up to affine transformation (Wang and Serfling, 2005). In addition to describing a multivariate distribution with a single curve, the quantile based kurtosis functional allows to compare the properties like the degree of concentration at the center or the intensity of mass at tails among a number of multivariate distributions simultaneously, which we will illustrate in the next chapter via kurtosis measures from the convex hull peeling process. The quantile based kurtosis functional also quantizes a distribution into a single value and discriminates between high peakedness and heavy tailweight distributions. Here, we define a few quantile based kurtosis measures based on the convex hull peeling process. These measures are not a definite and univer- sal kurtosis measure but they serve well enough for various needs to characterize multivariate distributions. 101

Without calculating the 4th central moment, a univariate kurtosis is obtained by Q − 2 ∗ Q + Q K = 3 2 1 , Q3 − Q1 th where Qr indicates the r quartile. Any single quantity K incidates whether the data is leptokuritic (flat) or platykurtic (peaked). A univariate kurtosis measure is a simple discriminator between acutely centered and heavy tailed distributions. This type of quantile based kurtosis is easily extended to multivariate kurtosis by volume functionals. Thus, kurtosis based on volume functional from the convex hull peeling process is defined as

VCH (.25) + VCH (.75) − 2 ∗ VCH (.5) KCH = VCH (.25) − VCH (.75) for multidimensional data. Note that VCH (r) indicates the volume functional of th th the r depth or the 1 − r quantile and particularly, VCH (.5) is equivalent to the volume at the shoulders. The shoulders mean the level set of an inter quartile range of a distribution, which contains about a half of observations and divides the center and the tails. Instead taking a single quantity of kurtosis, we can define a kurtosis curve along with data depth away from the shoulders.

1 r 1 r 1 VCH ( 2 − 2 )+ VCH ( 2 + 2 ) − 2VCH ( 2 ) KCH (r)= 1 r 1 r VCH ( 2 − 2 ) − VCH ( 2 + 2 )

This kurtosis provides a curve along data depth and enables to measure the shift of probability mass away from the shoulders of the distribution. Serfling (2004[196]) showed that the kurtosis functional KCH (r) is invariant under shift, orthogonal and homogeneous scale transformations. Instead of investigating the curve by the kurtosis functional, measuring the tailweight distribution is another approach to obtain a kurtosis measure. The tailweight based on a volume functional from the convex hull peeling process is defined as follows; V (r) t(r, s)= CH (4.1) VCH (s) for 0

VCH (s) regularize the volume functionals of distributions, this tailweight enables to compare kurtoses of more than two distributions simultaneously. Additional kurtosis measures are found such as a Lorenz curve, a shrinkage plot, and a fan plot (Liu et.al., 1999). For some types of data depth, the distribution of a quantile functional is known so that kurtosis can be defined analytically. On the other hand, not much is known about the limiting distribution of convex hull peeling quantiles as other types of estimators from the convex hull peeling process is unknown. Therefore, we only consider empirical versions of kurtosis based on volume functionals.

4.5 Simulation Studies on CHPD

The primary goal of this section is investigating the behavior of the convex hull peeling process when it is applied to various samples. According to He and Wang (1997[90]), contours of data depth provide a good geometrical understanding of the structure of a multivariate sample. Under proper conditions, they asserted that the convergence of data depth implies the convergence of contours to density levels. Samples from a elliptical distribution show that contours acquired from the peeling process delineate the shape of a multivariate density profile; however, this picturesque idea is not always appropriate for any multivariate distributions. Figure 4.1 illustrates the shapes of convex hull level sets from various dis- tributions. As predicted, bivariate normal and Cauchy distribution preserve the symmetric shape of contours as peeling process proceeds. On the other hand, uni- form distributions from a square shape and a triangular sample space loose their shapes as the peeling process continues. As the layers proceed toward the center of these uniform samples, the shape of these layers forms smooth convex contours and loose any trace of natural angles seen from rectangular or triangular geometry of sample distributions. The contours of a few inner most layers seem to take very random shapes regardless of the overall shape of sample space. Besides, Figure 4.1 reveals an interesting fact that the behavior of the convex hull peeling depth is very different from the half space depth because the patterns of these depth contours pertains different properties. According Rousseeuw and Ruts (1999b[184]), the depth contours retrieved from half space depth peeling process on 103 y y −4 −2 0 2 4 0 2 4 6 8

−4 −2 0 2 4 −4 −2 0 2 4

x x y y −3 −2 −1 0 1 2 3 −200 0 100 300 −3 −2 −1 0 1 2 3 −50 0 50 100

x x

Figure 4.1. Bivariate samples of different shapes put into the convex hull peeling process. Clockwise, from the left upper panel, samples of size 1000 were generated from a bivariate uniform distribution on a square, a bivariate uniform distribution on a triangle, a bivariate Cauchy distribution with correlation 0.5, and a bivariate normal distribution with correlation 0.5. triangular, rectangular, or any regular polygon samples from uniform distributions, the half space depth function has ridges that correspond to an inversion (rotation for planar data) of depth contours so that the major axis of each depth contour by peeling looks like orthogonal to the major axes of polygons formed by samples. In Figure 4.1, contours from rectangular sample gradually shape into smooth ellipsoid as convex hull peeling proceeds but in Rousseeuw and Ruts (1999b), contours from rectangular sample by half space depth peeling process slowly change into smooth diamond shape as if the sample has been rotated by 45 degrees. 104

It is anticipated that the depth contours from peeling process delineate the probability contours of any sample. This expectation so far looks correct when the sample is generated from elliptical distributions (He and Wang, 1997). The reality is that the depth contours are different from probability contours (Rousseeuw and Ruts, 1999b). It is not always true that contours by peeling process depict equal density levels. The relationship between level sets created by peeling process based on statistical data depths and equal probability contours under various type of distributions is an open question.

4.6 Robustness of CHPM - Finite Sample Relative Efficiency Study

When a new statistics appears, it is likely that the statistics is tested for its ro- bustness characteristics. We wish to know whether the convex hull peeling depth process provides robust estimators. Especially, we like to know the robustness of the convex hull peeling median so that we can design other estimators, which are built upon the median estimator, like skewness and kurtosis measures. These esti- mators do not tend to be affected by any discordant observations or contamination. The median of a data set arises mainly in nonparametric problems as a natural robust estimate for the center of a distribution. As a nonparametric and robust estimate for the center of a distribution, it has a different character from the sample mean as is illustrated by different breakdown properties. The studies of breakdown points of multivariate median estimators seem inevitable when the primary interest in the estimators is their robustness. The breakdown point is the largest proportion of data which must be moved to the infinity so that the estimator behaves the same (Aloupis, 2005[10]). It is obvious that the breakdown point of a sample mean goes to zero as the sample size goes to infinity, which leads to the conclusion that a sample mean is not a robust estimator. On the other hand, some data depths show non zero breakdown points. Donoho (1982[59]) performed extensive studies on breakdown points of various multivariate data depths. For example, the breakdown point of the half space median is 1/3, known to be the best among other multivariate data depth 105 based medians for bivariate samples. Unfortunately, the breakdown point of the convex hull peeling median is reported to be inferior to other depth based medians. Donoho (1982) addressed that the breakdown point of the convex hull peeling median is bounded above by (n + d + 1)/[(n + 2)(d + 1)] for a sample size n in Rd and the breakdown point goes to zero as n goes to infinity. The computation of the convex hull peeling median breakdown point is very complicated because of the fact that the convex hull peeling median is not a continuous function of the data. On the other hand, it is not necessary to drag points to the extreme boundary to check the robustness of estimators. The relative efficiency, as well, is capable to show the degree of robustness. Therefore, instead of sending points to the infinite boundary for checking the robustness of the convex hull peeling median, we contaminated the main data cloud by a small cluster of outliers for investigating the robustness of the convex hull peeling median. Fundamentally, the finite sample relative efficiency measures the degree of outlier tolerance. Depending on methods of contamination and types of sample distributions including outliers, the degree of tolerance to outliers might varies. Rocke (1996[178]) provided the summary of outlier types, which guides us to design a simulation study on the finite sample relative efficiency. Among numerous ways of creating the main sample and outliers, we chose simulating shift outliers as the method of contamination corresponding to one of the five types of outliers discussed in Rocke (1996) to check the sensitivity of the convex hull peeling median to outliers. Here, we conducted a simulation study to see how the convex hull peeling me- dian is resistant to outliers in finite samples generated from contaminated normal distributions. Since our aim is to understand the behavior of convex hull peeling with massive data, we generated 500 samples from the model, (1 − )N((0, 0), I)+ N(·, 4I) with sample size 5000, where  indicates the proportion of contamina- tion caused by shift outliers. We contaminated data by 0.5%, 5%, and 20% with shift outliers from N(µ, σ2I), where µ = (5, 5)t for a near neighborhood contam- ination and µ = (10, 10)t for a remote contamination with σ = 2 for both near neighborhood and remote contaminations. Then, for an estimator Tj (CHPM, coordinate-wise median, and mean) based on the jth sample, we calculated the 106 empirical mean squared error

m 2 EMSE = (1/m) ||Tj − µT || j=1 X where m = 500 and µT = (0, 0). The relative efficiency (RE) of CHPM or coordinate- wise median is then obtained from dividing the EMSE of the sample mean by the EMSE of CHPM or the coordinate-wise median. The results are provided in Table 4.1.

Table 4.1. Empirical mean square error and relative efficiency N((5, 5)t, 4I) N((10, 10)t, 4I)  CHPM Median Mean CHPM Median Mean 0 0.002178 0.000647 0.000417 RE 0.191689 0.644964 1.00 0.005 0.0028521 0.007397 0.001682 0.002891 0.0007399 0.005444 RE 0.589961 2.274761 1.00 1.88291 7.35752 1.00 0.05 0.016842 0.009496 0.125522 0.017824 0.00950 0.500610 RE 7.45262 13.21783 1.00 28.08597 52.6738 1.00 0.2 0.139214 0.20300 2.00109 0.1435910 0.2033708 8.0017 RE 14.37412 9.85738 1.00 55.7256 39.3454 1.00

From Table 4.1, it is clear that the convex hull peeling median is highly com- petitive and not sensitive at all to outliers as the contamination rate increases. On the contrary, the coordinate-wise median seems less sensitive to outliers with small contamination but the degree of sensitivity increases with relatively high contam- ination. This result is attributed to the fact that the coordinate-wise median is not affine invariant; therefore, the relative efficiency of the coordinate-wise median will also vary depending on the direction of the shift outliers. Some other finite sample relative efficiency studies based on different data depths are found, where the sensitivity of these data depth to outliers has been measured. Rousseeuw and Ruts (1998[186]) investigated the relative efficiency of their HALFMED algorithm on bivariate samples and compared half space depth median with coordinate wise medians. Zuo (2003[240]) and Zuo et.al. (2004[238]) examined the finate sample relative efficiency of the projection depth and the 107

Stahl-Donoho estimator (half space depth) to the sample mean on bivariate stan- dard normal samples. Zuo (2003) found that their projection depth produced a similar result to the HALFMED of Rousseeuw and Ruts (1998). In addition, the finite sample relative efficiency study said that the Stahl-Donoho estimator can achieve a good balance between robustness and efficiency (Zuo et.al., 2004). The half space depth and the projection depth are a quite different type of data depth measures to the convex hull peeling depth. Therefore, a direct comparison between the relative efficiencies of the half space median and the convex hull peeling median may not provide a useful information. Nonetheless, it is very informative that the convex hull peeling median is insensitive to outliers as the half space median, one of the most popular and well studied data depth based multivariate median.

4.7 Statistical Outliers

So far, we have observed that some concerns about the robustness of estimators that are customized from the convex hull peeling process. Furthermore, this ro- bustness of the estimators is strongly related to problems of outliers since a robust estimator indicated a estimator that is not sensitive to outliers. A few properties from the convex hull peeling process are that (a) orders are assigned from outside to inside and (b) the inner region is unaffected by the outer contours, which imply that building convex hull level sets is connected to detecting outliers as a robust and non parametric method for discriminating outliers. These level sets enables to segregate outer observations to inner observations without any distortion caused by outliers since the convex hull peeling process is quite robust. In other words, the convex hull peeling process leads to develop statistics resistant to outliers. We reached a conclusion that outliers are apart from the majority of data points regardless of various definitions of outliers pawned from the and the computer science community. Outliers are cumbersome observations but cannot be ignored because rare events are more interesting than the majority from a knowledge discovery standpoint. In order to identify outliers, techniques which determine the position of a point relative to the others would seem to be useful (Rohlf, 1975[179]). Methods that convey the general definition of outliers and discriminate them are expected. 108

Understanding types of outliers assist to develop outlier detection methods. Barnett (1976[24]) summarized causes of outliers (1) measurement errors, (2) exe- cution faults, and (3) intrinsic variability. Outlier detecting methods are asked to cope with any causes and we believe that the convex hull peeling process is capable to reflect these causes. By assigning depths to observations, outliers are likely to have very small convex hull peeling depth values compared to major data points. These small depth values indicated those causes of outliers, which we will discuss in detail any connection between small data depth and outliers. Here we suggest two possible scenarios of convex hull peeling depth outliers. First, observations far away from the data center in any directions are classified as outliers. Naturally, these outliers will have very small data depths and detecting such outliers will ask for following two stages, similar steps presented by Rocke (1996[178]) for all known affine equi-variant outlier identification methods. The first step is to estimate depth of each point and volume of each convex hull level set. The next step is to calibrate the degree of dispersion in volumes caused by observations with small data depths to suggest how far enough those observations with small depths from a location estimate (e.g. data center) are considered to be outliers. The possibility of detecting outliers away from the location estimator with the convex hull peeling process was stated in Chakraborty (2001[47]) that particular observations outside of some small depth are treated as outliers. De- pending on the degree of dispersion of observations with small convex hull peeling depths, if the volume measure indicates unusual dispersion, we like to excludes those observations. Second, observations sticked out from a cloud of main data, are treated as outliers. It is likely that the main data can be contaminated with a cluster of outliers. This cluster of outliers disturbs the shape of convex hull level sets. As seen from Table 4.1, the inner regions of which points have larger depth values assigned by the convex hull peeling process are robust against outliers. Because of outliers that stick out from the main data cloud, the shape of outer convex hull peeling level sets is different from the shape of level sets by the main data. Then, we search for a certain depth up to which in the corresponding level sets outliers are contained. Investigating a proper shape functional of level sets shows the changes in level sets caused by outliers. Identifying any presence of a cluster of 109 outliers requires detecting an influence caused by outliers based on a proper depth based function. Any influence appears in depth fucntionals of level sets by outliers will disappear as the convex hull peeling process continues. Lu et.al. (2003[132]) and Hodge and Austin (2004[94]) worried about the spar- sity of data in high dimension space, where traditional outlier detection methods are not applicable. This curse of dimensionality, however, seems not to have a significant effect on outlier detection methods based on the convex hull peeling depth when the outlier detection methods is primarily applied to massive data. In the worst case, d + 1 points will suffice to construct one convex hull level set to assign a data depth value and d + 1 is practically a small fraction of massive data. Therefore, some outlier detection tools can be developed by measuring functionals based on the convex hull peeling depth. We will demonstrate methodologies that reflect outliers by the convex hull peeling process in the next chapter.

4.8 Discussions

The convex hull peeling process seems to preserve nice properties to estimate char- acteristics of distributions. Under the assumption that data are generated from smooth unimodal convex set, the convex hull peeling process is a useful to develop methods for non parametric statistical inferences with massive data. Because of the counting process to define the data depth, we were able to link convex hull peeling location estimators to the generalized quantile process, which is the key to perform statistical inferences based upon data depths. We show that the convex hull peeling depth is the statistical data depth, defined by Zuo and Serfling (2000a). In addition, we conduct a simulation study to illustrate that the convex hull peeling median is robust and not sensitive to outliers. Illustrating the convex hull peeling central regions and level sets based on the generalized quantile process leads to formulating a volume functional, in other words, the volume of a depth d (0 < d ≤ 1) convex hull. This volume functional guides to define scatter functionals, skewness functionals, and kurtosis functionals by providing a mapping from multi-dimensional data to a univariate value. Based on these empirical functionals, we can quantify multivariate distributions. We calculated the empirical mean square errors of convex hull peeling median, 110 coordinate-wise median, and mean. We observed that the convex hull peeling median is not sensitive to outliers. The fact that the convex hull peeling estimators are independent to underlying coordinate systems as well as the discovery of the insensitivity to outliers will assist developing outlier detection methods. Classical moment based skewness and kurtosis (Mardia, 1970[138]) only provide single valued discriminators. On the contrary, the convex hull peeling skewness functional or kurtosis functional provide one dimensional curves which provide informations such as how much probability mass has shifted or what degree the probability mass has moved from shoulders. The empirical skewness or kurtosis functionals based on the convex hull volume functional offer a better picture to describe multivariate distributions. Nonetheless, the descriptive statistic measures that we obtained from the con- vex hull peeling process only offer empirical results. The distribution properties of the convex hull peeling functionals are very pristine realm of statistics. In fu- ture, asymptotic properties of convex hull peeling estimators and theories of their limiting distributions are desired to be discovered based on the theories about martingales and renewal processes. Another drawback of the convex hull peeling process is the computational slow- ness compared to parametric methods. Even if the convex hull peeling depth is relatively efficient than the simplicial depth or the half space depth, the computa- tional complexity of the convex hull peeling process grows quickly as the dimension of data increases. This slowness is an inevitable sacrifice of using nonparametric methods for multivariate data analysis. Parametric methods normally assume that data come from a multivariate normal distribution and require covariance matrices to be estimated. Hence, relative fast computation comes from estimating the small number of parameters of the alleged model that generated multivariate data. On the contrary, imposing a particular family of distributions is likely to result in an unexpected conclusion when the model assumption is off the truth. In reality, there are not many non parametric approaches to multivariate data analysis. Without any rigorous and analytic distribution theories, those empirical approaches based on the convex hull peeling process will provide interesting infor- mation, which is worth to pursue. In the next chapter, we will seek non parametric descriptive statistics further based on empirical measures. We will illustrate the 111 descriptive statistics with the convex hull peeling process by describing algorithms and procedures, presenting simulation results, and showing applications to real data sets. Chapter 5

Convex Hull Peeling Algorithms

5.1 Introduction

Tools of visualizing descriptive statistics of multivariate distributions have been emerged as very appealing methods in a last few decades. These tools allow us to understand characteristics of the multivariate distributions via one dimensional curves without sacrificing any information of given sample. These one dimensional curves are simply constructed by means of depth functions. Opposing to univariate data which can be ordered simply by sorting, there is no universal method that orders multivariate data. Barnett (1976) cast an extensive ideas on different types of ordering multivariate data and discussed what methods are proper to order statistically. Liu et.al. (1999) and Zuo and Serfling (2000a) offered almost comprehensive lists on the methods of ordering multivariate data. These orderings are called data depths. From the data depths, location, scatter, skewness, and kurtosis, in addition to the data center, tailweight, and peakedness of distributions can be estimated. According to Serfling (2002[197]), the distribution theory of the sample version of one dimensional curves for these descriptive measures of multivariate data is left open. On the other hand, the empirical estimators of these descriptive measures and their good statistical properties are delineated by simulation studies where empirical measures produce estimates similar to the analytical counterparts. This fact has been the motivation of the simulation studies of multivariate data analysis accessed from data depths. 113

Among data depths, we choose the convex hull peeling depth to develop tools of getting the descriptive statistics of multivariate distributions. These tools provide values for empirical estimation or one dimensional curves for statistical interpre- tation. Here we offer some reasons of choosing the convex hull peeling process for statistics.

• Not many related studies compared to other depths. Other depths such as the half space depth and the simplicial depth have been studied extensively and comprehensively. These depths are proved to pertain very good statistical characteristics (Liu, 1990[130]; Liu et.al.,1999[127]; Zuo and Serfling, 2000a[241]; Zuo, 2004[238], Zuo and Cui, 2004[239], and references therein). A few relevant studies are found that have adopted the convex hull and the convex hull peeing process in statistical estimation problesm and one of them is the robust correlation coefficient by Bebbington (1978[26]). Other direc- tions have been taken to utilize the convex hull and its peeling for statistical inferences, such as the robust estimation of data center (Tukey, 1974[221]) and ordering by the peeling process (Barnett, 1976[24]). However, applica- tions of the convex hull peeling process seem to be limited because of the difficulties of getting the distributional characteristics of the peeling process. A few early statistical studies on the convex hull geared toward data trimming or outlier elimination (Silverman and Titterington, 1980[209]; Titterington, 1978[220]; and Nolan (1992[157]).

• Faster than other depths. With some exceptions like bivariate sample and a prior knowledge on sample point alignments in particular space1, the simplicial depth, Oja’s depth, and the majority depth seem to have the com- putational complexity of O(nd+1) for given n points in the d dimensional space so that the cost of computation of these depths is very expensive. The complexity of the half space and the regression depth is reported as O(nd−1log n) (Rousseeuw and Struyf, 2004[180]). In contrast, the complex- b d+1 c ity of the convex hull depth is O(n 2 log n). Still the complexity of convex

1Some of efficient algorithms are presented under specific data alignments (Cheng et.al., 2001[51]). 114

hull algorithms seems expensive but the increment rate is about the half of other depths. Therefore, the convex hull peeling depth is relatively cheaper to calculate compared to other data depths. Also, the convex hull peeling depth is optimal for odd dimensional data. Our current information about the computational complexity might be par- tial and incomplete, which is acceptable since we know that the computa- tional complexity studies mostly focused on two dimensional data and higher dimensional results were generally induced from the two dimensional results. A direct comparison of computational complexities among different depths seems unjust but if our conjecture is reasonable, it would be better to use fast algorithms such as convex hull peeling to estimate data depth, especially when we are interested in massive multivariate data.

• Intuitive appeals for depicting distributions and detecting outliers. Beyond bivariate data, outlier detection methods without assuming multi- variate normal distribution are hardly found in the statistic literature. Ac- cording to He and Wang (1997[90]), a generalized quantile process that de- picts contours of multivariate distributions, leads to statistical inferences. From our experience and intuition, the convex hull peeling process simply provides the contours at various levels. The half space depth also gives nested polytopes along with but other depths, simplicial depth, Oja’s depth, majority depth are hard to imagine their looks. Convex hull peels are connected to quantiles of a distribution in a nonparametric fashion, when the data were generated from a unimodal convex body distri- bution. If outliers reside around the main distribution, this contouring will detect the existence of outliers without disturbing the shape of inner hulls of the main distribution. This idea was already practiced in Bebbington (1978[26]) and Rousseeuw et.al. (1999a[183]). The convex hull peeling process accesses points that have same depth and lie on the same depth level set. However, we have to keep in mind the fact that this equi-depth contour level is not necessarily equivalent to the equi- probability contour. The latter represents area of equal density, while the equi-depth convex hull contours pertain to relative locations. Theoretic and 115

probabilistic studies that connect location and density (equi-depth contour and equi-probability contour) have not finalized despite numerous studies on the limiting distributions of the number of vertices, areas, and perimeters of a convex hull. Empirical studies will help to clarify the relationship between the equi-depth contours and the equi-probability contours. Additionally, we like to emphasize that outliers are usually peripheral points. Outliers are different from the majority of the data points even though no obvious connection exists between location and density. The outermost convex hull level sets (depth values close to 0) and beyond are likely to contain multivariate outliers.

• Statistical Depth. Needless to say, the fact that the convex hull peeling depth satisfies the conditions of statistical data depth leads to expand the convex hull peeling process to statistical inference problems.

• Robustness. The breakdown point of convex hull peeling median is small compared to the half space median and the simplicial median (Donoho, 1982[59] and Liu, 1990[130]). This fact misguides to the perception that convex hull peeling location estimators are not robust. In contrast, the stud- ies about the empirical means square errors and the relative efficiencies of the convex hull peeling median, compared with coordinate wise median and mean, clearly shows that the convex hull peeling median is not affected much by actual outliers. These outliers are not imaginary one residing at the infi- nite boundary.

• Applicable to sequential methods. Last, convex hull peeling is appli- cable with sequential methods. Estimating quantiles involves ordering data but when a data set is too large to be stored in primary memory, standard algorithms for computing sample quantiles become infeasible. McDermott and Lin (2003[142]) particularly studied on applying sequential methods in estimating bivariate quantiles with convex hull peeling process and showed that their sequential convex hull peeling process satisfies low storage con- straints. Motivated by the results from McDermott and Lin (2003[142]), we adopt sliding window algorithms to extend tools for statistical inferences se- quentially by means of the convex hull peeling process within the limit of storage space. 116

These reasons encourage to extend further investigations on the empirical con- vex hull peeling process. The main target of the investigation in this chapter is developing algorithms and illustrating tools to understand properties of multivari- ate massive data distribution based on the convex hull peeling process. To serve the aim of the algorithm development study, we use the QHULL2 algorithm for obtaining vertices of a convex hull and adopt R3 as the primary software. Algorithms are verified and illustrated via bivariate samples through Monte Carlo simulations. We emphasize that this convex hull peeling process will work in any moderate size dimensions. The only reason of choosing bivariate samples is the ease in visualization. We take samples from a bivariate uniform distribution, a bivariate normal distribution and, a bivariate t distribution. De facto applications are performed on colors of celestial objects mined from the Sloan Digital Sky Survey Data Release Four, which has more than two attributes. Due to limits of displaying this multivariate data set, we carefully coined one dimensional mappings based on the convex hull peeling quantile functionals and use the one dimensional mappings as surrogates of multi-dimensional display without any low dimensional projection. Since our algorithmic approaches are more empirical than theoretical, and com- putationally intensive than mathematically rigorous, the sequential methods on streaming data are introduced without mathematical justifications. As a matter of fact, algorithms on multivariate streaming data have not been discovered yet and we are unable to compare the results mathematically. Our intention is to know the degree of similarity and accuracy between a sequential method and a non sequential method. We adopt T-window scheme (Xu et.al., 2004[233]) to cope the streaming data and propose sequential methods that preserve memory and computation efficiency. This chapter mostly focuses on how the convex hull peeling process is applied to delineate characteristics of multivariate distributions, by proposing algorithms and visualization methods for multivariate massive data analysis. Here, we will extend the notion of the convex hull peeling process to the actual applications with astronomical data. We begin Section 2 with describing the target data sets, the

2The software is acquired from http://www.qhull.org. 3http://www.r-project.org 117

Sloan Digital Sky Survey Data Release Four prior to introducing algorithms. Sec- tion 3 discusses algorithms of multivariate median estimation including sequential methods. Section 4 presents the quantile estimation methods. Section 5 has den- sity estimation. Section 6 introduces outlier detection methods. We propose some practical ways to detect outliers depending on definitions of multivariate outliers. Section 7 addresses the distribution shape measures of multivariate data related to skewness and kurtosis. Section 8 gives a scientific interpretation of data appli- cations and the last section summarizes the overall characteristics of convex hull peeling algorithms.

5.2 Data Description: SDSS DR4

Astronomical data sets have experienced an unprecedented and countinuing growth in the volume, quality, and complexity over the past few years. Such massive data sets have motivated new statistical methods to be developed, which is the main goal of our study. One of prominent astronomical survey data available is from the Sloan Digital Sky Survey (SDSS hereafter, http://www.sdss.org). This section briefly introduces only SDSS and convex hull peeling algorithms will be applied to SDSS data in later sections. The SDSS began official operations in April 2000. Prior to the date, as commis- sioning data was released, technicality of the early stage of the project is described in York et.al.(2000[235]) along with preliminary studies on data properties. The SDSS has both imaging and spectroscopic survey of the sky (York. et.al., 2000[235]) using a wide-field 2.5m telescope. Images are collected from five broad bands, u, g, r, i, z and astronomically processed. Then, objects are selected from these imaging data for spectroscopy, which enables to calibrate one of the most important astronomical parametric values, redshift (the distance indicator). The SDSS software automatically reduces the 20Gb/hour data stream amazingly to useful object catalogs (Schlegel et.al., 2000[191]). General properties of SDSS data including parameters, data formats, pipeline design, data process, data quality, and calibration, are given in Stoughton et.al. (2000 [216]) with the Early Data Release. Data pipeline incorporates statistical screening and classification algorithms for selecting right objects. 118

As of May, 2006, SDSS publicized Data Release Four (DR4, hereafter), which is the target of our algorithm applications for statistical multivariate analysis. The SDSS DR4 was first released July, 2005 as a series of data publication from the Sloan Digital Sky Survey. Here, we give the brief explanation of SDSS and DR4 to get the glimpse of characteristics of data sets used in upcoming analyses. The imaging sky coverage of DR4 reaches 6670 square degrees of the sky and DR4 contains about 180 million objects. The spectroscopic catalogue of DR4 includes 4783 square degrees and 849,920 spectra of classified objects; among these objects, 565,715 are galaxies, 67,382 are Quasars (redshift< 2.3), 9,101 are high redshift Quasars, 102,714 are stars, 50,373 are M or later type stars. The volumes of images and spectra reach 7.5TB and 140GB, respectively. Attributes of photometric data are color indices, u,b,g,i,z along with coordinates. Object identification comes from spectroscopy but spectroscopic data are limited than photometric data. Adelman- McCarthy et.al.(2006[4]) described the overall properties of the SDSS DR4. (Infor- mations of previous data releases, DR1, DR2, and DR3, are available, Abazajian, 2003[3], 2004[2], 2005[1]). Specific details are also found from the SDSS website (http://www.sdss.org/dr4) and references therein. Among the catalogues, we choose SpecPhotoAll and extracted values un- der variables such as z (redshift, distance indicator in astronomy), psfMag u, psfMag g, psfMag r, psfMag i, and psfMag z. Astronomy catalogues contain various variables but mainly they are locations and magnitudes with errors from different calibration process. As mentioned above, we obtained psf magnitudes to get multi-color distribution and z for the physical distance reference. Therefore, the Quasar sample size is 70204 and the galaxy sample size is 499043 after getting rid of objects with missing magnitude values. These numbers are less than the numbers publicized from the website since redshifts can be obtained only from the spectroscopic data catalogue which is smaller than the photometric image data catalogue by the size. Nonetheless, the number of target objects do not agree with conventional statistical analysis along with the number of variables of objects. 119

5.3 Multivariate Median

The convex hull peeling process allows to estimate the center of multivariate data cloud naturally. The by product after consecutive peeling is called convex hull peeling median (CHPM). The convex hull peeling median is affine invari- ant (Chenouri, 2004[52]) while coordinate-wise median is not affine equivariant (Rousseeuw and Ruts, 1998[186]). By a convex hull algorithm, vertices are deter- mined and taken away from the d dimensional sample of size n for the next convex hull construction. Ripping off points from every convex hull is repeated until less than d + 1 points left and each hull construction assigns statistical depths to those extreme points of each hull. The left over point(s) after exhaustive peeling is de- fined to be the convex hull peeling median, the maximum depth point. Taking an average of left over points is the same process for the univariate median when there are even number of points to be sorted. Algorithm 7 describes how to obtain a convex hull peeling median.

Algorithm 7 Convex hull peeling median I Require: sample (n)  dimension (d) repeat peel the d dimension data with QHULL. until not more than (d + 1) points left. CHPM = average of the points left.

When the multivariate data are streaming or too large to fit in a memory (re- mind that assigning depth is same as sorting univariate data), we would like to adopt some sequential methods to estimate median at a given time. The key idea of sequential methods for bivariate samples was introduced by McDermott and Lin (2003[142]). Here, we expand the method to multivariate examples. First, we illustrate the sequential method on synthetic bivariate data purely for the easy visualization purpose. Later, we provide the results from the actual higher dimen- sional data to which our sequential methods are applied. Let total size of data is n and take m as a portion of n. For example, instead of waiting data to be accumulated through a year (n), we take daily data (m) and process daily. At the end of the year, we combine daily results to the final yearly result so that we do not suffer from storage space waste and computational delay 120

due to huge data. By denoting sample X1, ..., Xn, a sequential method repeats n peeling on X(N−1)m+1, ..., XNm where N =1, ..., m . Such algorithm is described in Algorithm 8.

Algorithm 8 Sequential convex hull peeling median I Require: n, m  d and empty S n for i =1, d m e do repeat peel the given m points with QHULL. until not more than pm points left (0

This sequential CHPM algorithm can be changed slightly as Algorithm 9.

Algorithm 9 Sequential convex hull peeling median II Require: n, m  d, empty S n for i =1, d m e do Add the given m points to S repeat peel S with QHULL. until not more than pm points left (0

Both sequential Algorithms 8 and 9 only need pn and (1+ p)m storage space, respectively so that the space shortage problem can be eliminated. Note for the cases when p → 0 and p → 1 in the algorithm 8 and 9; the first case rips off all points except presumed median every m point; the second case leaves all data as the non-sequential convex hull peeling process so that the medians from the non- sequential and the sequential method will be same. Also, m can vary depending 121 on data stream rates so that d can be adjusted. We prefer to choose very small d which allows S to be in a reasonable size for the final peeling process in order to resolve the memory problem. Example 1. Two synthetic data sets size of 104 and 106 from the standard bi- variate normal distribution were created. In order to compare the results from both non-sequqntial and sequential method, we choose the sequential CHPM al- gorithm 8 with d = 0.05 to obtain the median of 106 points by taking m = 104. Because of the memory allocation problem in our working computer, we are not able to increase the sample size more than 106 for the non-sequential CHPM. For the comparison purpose, we add coordinate-wise mean and median. Table 5.1 shows that for a symmetric distribution, any methods of estimating data center is working.

Table 5.1. Comparison among mean, coordinate-wise median, and convex hull peeling median of synthesis data sets, size of 104 and 106 from the standard bivariate normal distribution. For the larger sample, we apply the sequential CHPM method. n mean median CHPM 104 (0.001338, -0.02232) (-0.005305, -0.01643) (0.000918, -0.010589) 106 (0.000072, 0.000114) (0.001185, -0.000717) (0.002455, -0.000456) Sequential CHPM → (0.004741, -0.004111)

Example 2. We apply the non-sequential method and the sequantial method to estimate the center of four dimensional color space of Quasars and galaxies. We illustrate the difference between simple coordinate-wise methods and CHPM by processing SDSS DR4 Quasars (sample size of 70204) and galaxies (sample size of 499043) with four color attributes. As seen from Table 5.2, coordinate-wise mean and CHPM are very different. The range of colors is in general very narrow so that the observed discrepancies between coordinate-wise mean and CHPM indicate that the distributions of these objects in the 4 dimensional color space cannot be investigated by imposing a multivariate normal distribution. Importantly, mean and covariance matrix shall not be the parameters of interests. More warnings of using a multivariate normal distribution to describe data will come as we further apply the convex hull peeling processes on these data sets. 122

Table 5.2. Different center locations of Quasars and galaxies distributions in the space of color magnitudes based on mean, coordinate-wise median, CPHM, and sequential CHPM. The sequential method has adopted d = 0.05 and m = 104. Quasars u-g g-r r-i i-z Mean 0.4619 0.2484 0.1649 0.1008 Median 0.2520 0.1750 0.1520 0.0770 CHPM 0.2530 0.1640 0.1913 0.0700 Galaxies u-g g-r r-i i-z Mean 1.622 0.9211 0.4226 0.3439 Median 1.680 0.8930 0.4200 0.3540 CHPM 1.790 0.957 0.424 0.367 Seq. CHPM 1.772 0.950 0.4228 0.363

Unless the data distribution is symmetric, the results from Table 5.2 are ex- pected since mean is not a robust estimator and median is not an affine invariant estimator. Quasars and galaxies have skewed distributions in multidimensional color space. In addition, these samples of Quasars and galaxies clearly demon- strate coordinate-wise mean and median are not robust.

5.4 Quantile Estimation

In the previous chapter, we discuss the generalized quantile process, which leads to multivariate statistical inferences. Simulation studies suggest that quantile estimation via the convex hull peeling process provides reasonable sample quantiles when the data generating distribution is convex and unimodal. Algorithm 10 describes how empirically quantiles are estimated by means of the convex hull

Algorithm 10 pth Convex Hull Peeling Quantile Require: sample (n)  dimension (d) repeat peel the d dimension data with QHULL. until less than pn points left but no less than pn when points of the last convex hull level set are counted. The pth quantile is a set of vertices on the last peel. 123 peeling process. As the multivariate median, a convex hull peeling depth or a quantile can be easily estimated by peeling corresponding amount of points off. Without any technical difficulties, Algorithm 10 is extended to the sequential pth convex hull peeling algorithm. If target quantiles are given in order, say from the corresponding smallest depth to the largest one, L different quantiles are si- multaneously estimated. For an univariate and bivariate sample, this simultaneous estimation based on sequential methods was presented in Liechty et.al. (2003) and McDermott and Lin (2003). This simultaneous calculation is achievable due to the directional characteristics that the convex hull peeling process only works from outside to inside. Algorithm 11 describes how the sequential qunatile estimation works. Algorithm 11 Sequential pth Convex Hull Peeling Quantile Require: n, m  d n for i =1, d m e do repeat peel the given m points with QHULL. until less than pm points left but no less than pm when points of the last convex hull level set are counted. Store the vertices of the last level set to S end for{Last patch of data may have less than m points.} repeat peel S with QHULL. until less than a half of S left but no less than a half when points of the last convex hull level set are counted. The sequential pth quantile is a set of vertices on the last level set.

In Figure 5.1, we compared the results from a non sequential and a sequential method. Sample size of million might not be called as massive data but in order to to compare a non-sequential and a sequential method, we adopted sample size of million to avoid the restriction from the personal computer. It may be an obstacle to compute multivariate quantiles from more than millions of points in a single process but there seem almost no limit in sample size for a sequential method to estimate quantiles. As shown in Figure 5.1, convex hull peeling level sets from two methods are hardly separable. When the data set is from a smooth convex distribution, the sequential methods for estimating quantiles will overcome the computational memory limits. 124

all data sequential y −3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3

x

Figure 5.1. Sample size is 106 from a standard bivariate normal distribution and target quantiles are {0.99, 0.95, 0.9, , , 0.1, 0.5, 0.01}. For the sequential method, we took 1000 data at a time and estimated quantiles. Each quantile has 1000 corresponding convex hull peeling level sets, which are combined and peeled until about a half of data points left. Blue indicates quantiles of a non sequential method and red indicates quantiles of a sequential method.

Besides, we like to comment that the inter quartile range (IQR) is determined from the .5th quantile, containing points about a half of sample. Conceptually, the IQR contains the center of data while it shows degrees of compactness that the center portion of multivariate data has. In summary, estimating quartiles in a nonparametric way attracts particular interests for . Empirical quantiles provide the summary statistics so as to understand the distributions of Quasars and galaxies in the multi-dimensional color diagrams, which will be discussed in the later sections. 125

5.5 Density Estimation

Since simultaneously multiple quantiles are estimable, the empirical density profile can be estimated from these quantiles. Assuming that sample is generated from a multivariate probability distribution and a sequence of quantile estimates is or- dered from the quantile of the largest percentile to the quantile of the smallest percentile. Choices of quantiles depend on data. For the simplicity, we choose

{Qi} = {0.99, 0.95, 0.9, 0.8, ..., 0.1, 0.05, 0.01}. Let the convex hull volume of the corresponding quantile estimator (Def. 4.4) be VCH (1 − Qi)= VQi and the density at given quantile Qi be hQi with hQ0 = 0. Then,

(hQi − hQi−1 )VQi {XQi} = h(q)dV (q)=1 (5.1) Z where q is generalized quantile, and h(q) and V (q) are a density functional and a volume functional corresponding to q, respectively. The continuity of convex hull peeling functionals is an open question but the sum over hQi and VQi has to be one to get a legitimate density profile. Estimating the density function can be illustrated easily with samples from any bivariate distributions. We used the following calculating process to estimate hQi at each Qi. For the simplicity Qi is denoted by i and VCH (1 − Qi) by Vi; th therefore,Vi denotes the volume inside the i convex hull level set (in 2D, Vi = Ai, th area), Cpi denotes a convex hull central region contains the pi fraction of data and hi designates estimated height i.e. density at given Cpi . From the condition

(5.1), his are estimated from the following:

1 2(1 − p1) 1 − p1 = (V0 − V1)h1 ⇒ h1 = 2 V0 − V1 1 2(p1 − p2) 1 − p2 =1 − p1 + (h1 + h2)(V1 − V2) ⇒ h2 = − h1 (5.2) 2 V1 − V2 ··· 1 2(pi−1 − pi) 1 − pi =1 − pi−1 + (hi−1 + hi)(Vi−1 − Vi) ⇒ hi = − hi−1 2 Vi−1 − Vi 126

bivariate standard normal bivariate t5 with ρ=−0.5 y y −4 −2 0 2 4 −20 −10 0 10 20 −4 −2 0 2 4 −20 −10 0 10 20

x x density density

3 y 6 y 1 2 2 4 −1 0 −2 0 −3−2 −6−4

−40.00 −3 0.05 −2 0.10 −1 0.150 1 0.20 2 3 4 −60.00 −4 0.05 −2 0.10 0 0.15 2 0.20 4 6

x x

Figure 5.2. The sample size of both examples is 104 from standard bivariate nor- mal distribution and bivariate t5 distribution with ρ = −0.5. The target quantiles are {0.99, 0.95, 0.9, , , 0.1, 0.5, 0.01}. Densities are estimated by means of eqs.(5.2), from which empirical density profile is built.

For bivariate normal samples, a similar approach to the result in the left panels from Figure 5.2 is found in McDermott and Lin (2003[142]). We extend the density profile estimation method by McDermott and Lin (2003) based on the convex hull volume functionals and the convex hull center region from bivariate normal distri- butions to other type of distributions. Figure 5.2 clearly tells that the empirical density profile estimation through the convex hull peeling process is functioning 127

properly for both bivariate normal and correlated bivariate t5 distribution. Because of the sample distributions are smooth convex body distributions, the convex hull level sets at {0.99, 0.95, 0.9, 0.8, ..., 0.1, 0.05} delineate circular/elliptical contours and the reconstructed density profile reflect the original bell shape of the standard bivariate normal and correlated bivariate t5 distributions. Despite that the visu- alization is limited, this density profile estimation process is applicable to higher dimensional data as long as the convex hull volume functionals and convex hull peeling center regions are estimable. Additionally, without affecting shapes of the inner regions, any deviation in patterns of contour shapes from the inner regions to outer regions in the empir- ical density profiles indicates some irregularity in the distribution such as outlier contamination. The visualization of such irregularity is rather confined bivariate examples. Yet, shape quantization by means of skewness and kurtosis measures is available to detect such irregularity in the density profile.

5.6 Outlier Detection

Outliers are observations relatively remote from the center of data compared to the majority of data, where identifying the center becomes a critical issue to initiate the analysis on outliers. Small (1990) offered the various definitions of the multivariate data center, i.e. median. Among those definitions, we chose the convex hull peeling median to obtain the center of data. From the median, we can order multivariate data to define quantiles and diagnose the existence of outliers. The idea that the convex hull peeling process may work for detecting outliers was originated by Chakraborty (2001[47]). Relevant distance from this median based on the convex hull peeling process will be the distance metric setting the criteria of outlier discrimination. Most outlier detecting methods in multivariate analysis assume a multivariate normal distribution (Hawkins, 1980; Barnett and Lewis, 1994). Some parametric robust methods have been developed but very few non parametric approaches are found. Recently, the decision theoretic outlier detection without knowing the data distribution become a popular and familiar problem in machine learning but the development of the non parametric decision theoretic detection method is only 128 making a snail’s walk. This lack of methods is due to the challenges in ordering multivariate data. Here, we order data based on convex hull peeling depths and propose non distance based and non parametric outlier detection methods. By the convention that outliers are deviated from the main data population and live on the surface of data cloud, we wage three plausible methods of detecting outliers by means of the convex hull peeling process. The first one is a method based on empirical quantiles; the second, a method allowing to recognize shape distortion due to a cluster of outliers; third, the balloon plot, analogous to Tukey’s box plot of univariate data. These methods are particularly designed to detect outliers in massive data.

5.6.1 Quantile Approach

Our quantile approach to outlier detection is similar to testing an outlier based on the hypothesis testing, where the level α depends on the subject matter. We have no intention of fixing the level α, required to be suggested by the experts of the data. In Figure 5.2, some scattered points are located outside of the .99th quantile. These points become outliers if we denote the data depth 0.01 as the level α where outlier candidates reside. The level can be changed depending on research interests. Ultimately, this process allows to get rid of influential points which are far away from the data center without using any distance metric by assigning proper influential level α. This quantile approach is designed to alarm when possible outliers arrive in streaming data. Since the sequential method enables to update the quantiles, without loading all historic data into memory for ordering, based on the estimated level α quantile, future outliers are identified efficiently. The following example is illustrated to help understanding the quantile approach outlier detection method. Consider the following case: Every day the data archive center receives millions or more multivariate data points and has to report summary statistics at the end of month. In addition, daily bases, the center wants to report outliers to interrupt the data archiving process if frequent atypical observations are discovered. Instead of calling whole data worth of a month, which is likely beyond the available processing 129 capacity, we can adopt some sequential methods for ordering data and detecting outliers. First, choose a proper level α (multiple of α’s are available) for a convex hull peeling level set(s) and estimate quantiles matching to these α’s. Second, with newly arrived data, check any data are allocated outside of the α level set from yesterday. Meanwhile, new α level sets are acquired to be stored. Since the inner level sets are not influenced by to outliers, we expect that inner contours do not change as new data arrive. Repeat the first and second process until the end of the month unless data from a particular day contain too many outliers, identified based on the α level sets. Therefore, either the data processing is stopped in the middle of the month due to the peculiar behavior of data streams or a new summary statistics is already prepared to be reported at the end of the month. This quantile approach outlier detection method is flexible and modified easily based on the characteristics of streaming data and the level α. Although mathe- matically it has not proven, from the simulation studies the sequential approach to estimating quantiles has observed its success. Empirical sequential methods are commonly adapted in practice because they are beneficial for the sake of the pain reduction from loading a month worth huge multivariate data for just ordering (sorting) and the efficiency of keeping mere fraction of data for outlier detection. A few trimming methods have been reported for robust estimation (Nolan, 1992[157], 1999[156]; Titterington, 1978[220], Bebbington, 1978[26]). These trim- ming robust estimation can be enhanced by estimating α level sets and trimming outliers based on the sets. The quantile based outlier detection method allows to get rid of observations more deviated from the center of data cloud in lieu of data depth, independent of underlying coordinate systems and distance metrics.

5.6.2 Detecting Shape Changes caused by Outliers

From the previous chapter, we notice that the convex hull peeling process provides one dimensional curves that characterize and quantize the multivariate distribu- tions. If there are outliers away from the center, that curve is supposed to show a prominent feature corresponding to outliers. To generate such curve, we use the volume of convex hull central regions. Since central regions are nested as the depth increases, we expect that volumes of central regions will decrease monotonically. 130

Bivariate Normal Sample Outliers Added y y −2 0 2 −2 0 2 4

−3 −2 −1 0 1 2 3 −2 0 2 4 6 x x y y −2 0 2 −2 0 2 4

−3 −2 −1 0 1 2 3 −2 0 2 4 6

x x

j j GAP Vol Vol 0 10 20 30 40 50 0 10 20 30 40 50

0 20 40 60 80 0 20 40 60 80 jthconvex hull level set jthconvex hull level set

Figure 5.3. Samples of size 2000 are generated. The primary distribution is the stan- dard bivariate normal but the panels of the second column show 1% contamination by the standard bivariate normal distribution centered at (4,4). As the convex hull peeling process proceeds, the area of each convex hull layer is calculated. 131

Figure 5.3 has illustrations of two samples. The first column illustrates plots of a sample from bivariate normal distribution and the second column exemplifies plots of a contaminated sample. We can clearly see that a small number of outer convex hull level sets are stretched out because of outliers and this elongated shape is detected through estimating volumes. Convex hull level sets provide a sequence of volume measures which natually forms one dimensional curve. From the pattern of this one dimensional curve, any changes in a distribution along data depth can be identified. Due to outliers in this particular example from Figure 5.3, we observe a gap in the curve. The sequence of volumes has a sudden change when a convex hull level set reaches the layer without outliers. Then, a gap observed in a scale curve by estimating the convex hull level set volumes is the sign of the existence of a cluster of outliers. Detecting a gap as a tool for outlier detection has been taken previously by Rohlf (1975[179]), called generalized gap tests. For bivariate data, Rohlf (1975) constructed a minimum spanning tree and confirmed a gap in distribution of dis- tances between points under the presence of a cluster of outliers. The difference between Rohlf’s gap test and the convex hull peeling process is that the detect- ing shape distortion with convex hull peeling volume functionals does not ask for a distance metric to measure the distance between points which the gap test adopted. The presence of possible outliers is detected not only from a gap in the curve formed by the sequence of volume measures but also from a peculiar shape in the curve from the early part of the sequence. Identifying a gap in the curve is one of the realization of changes in distribution caused by outliers. Sensing changes of a distribution by means of a sequences of volume estimates is proposed for a quick outlier detection method during the preliminary data analysis without presuming any parametric distribution or projecting on smaller dimension space. After admitting the presence of outliers, an expensive analysis for individual points can be taken carefully to treat the potential outliers.

5.6.3 Balloon Plot

At the time of writing this thesis, the author is not sure if there is any attempts to draw multivariate box plot for outlier detections as Tukey did for univariate 132 data. However, some attempts exist in bivariate case; for example, Goldberg and Iglewicz (1992[75]), Zani et.al. (1998[236]), Liu et.al. (1999[127]), Rousseeuw et.al. (1999a[183]), Ruts and Rousseeuw (1998[189]) and Miller et.al. (2003[147]). The limit in dimension as seen from above bivariate examples without expanding to higher dimensions are partially due to the constraints in graphical visualization and small number of sample. Exceptionally, Zani (1998) discussed applications in multivariate cases by pairing bivariates, not a true multi dimensional application. Here we propose a multidimensional balloon plot, inspired by Tukey’s box plot. As discussed, the convex hull peeling process produces IQR, the shoulder of the distribution. Tukey’s box plot attaches whiskers adjacent to the IQR on both sides, of which length is 1.5 times the length of IQR or 3.0 times depending on characteristics of outliers. Likewise, we create multidimensional whiskers added to the convex hull peeling center region of IQR by blowing this IQR hull by 1.5 times lengthwise. The IQR balloon can be blown either by length or by volume depending on experiences. As the dimension increases, the small increment in length of a balloon is likely to contain outliers more due to the curse of dimensionality. The main idea of this ballon plot is to exclude points that does not belong to the main population. Therefore, the definition of the balloon plot is;

Definition 5.1 (Balloon plot). A balloon plot is created by blowing the .5th convex hull peeling central region by 1.5 times (lengthwise). Let v.5 be a set of vertices of the .5th convex hull peeling depth level set. The balloon for outlier detection is

B1.5 = {yi : yi = xi +1.5(xi − CHPM), xi ∈ v.5}.

The contour of the balloon is obtained by connecting {yi}.

Figure 5.4 illustrates how the balloon plot functions. The blue line is IQR, which captures the structure of the main distribution unaffected by outliers. The red line indicates the boundary of balloon so that points located outside of this balloon are classified as outliers. From Figure 5.4, the blue line is able to exclude the cluster of outliers located in the upper right corner. Instead of expanding 1.5 times, the balloon can be stretched out by 3 times or 31/d times lengthwise. Especially, the latter expansion minimizes the problem of dimensionality. The scale of the balloon expansion requires further studies to provide somewhat universal 133

Bivariate Normal Sample Outliers Added y y −2 0 2 −2 0 2 4

−3 −2 −1 0 1 2 3 −2 0 2 4 6

x x

Figure 5.4. Same samples from Figure 5.3. The blue line indicated the IQR and the red line indicated the balloon expanded IQR by 1.5 times lengthwise. Points outside this balloon are considered to be outliers. numbers.

5.7 Estimating the Shapes of Multidimensional Distributions

According to Oja (1983), skewness and kurtosis measures are designed intuitively and satisfy the affine invariant property. We like to show these characteristics of convex hull peeling skewness and kurtosis; however, the limiting distributions of convex hull peeling estimators are relatively unexplored. Instead, we present graphical diagnostic tools to measure skewness and kurtosis of multivariate data.

5.7.1 Skewness

With numerous rays of all directions radiating from the center of the data, mea- suring multivariate skewness has been a difficult task in multivariate data analysis. Although various approaches for qualitative, quantitative, and comparative ways of measuring skewness have been adopted and the convex hull peeling process can 134 produce multivariate skewness measures based on these various approaches for ap- plication, we only illustrate the convex hull peeling skewness of a distribution with

Rj defined in the previous chapter,

maxi ||xj,i − CHPM|| − mini ||xj,i − CHPM|| Rj = , mini ||xj,i − CHPM||

th th where xj,i denotes the i vertex in the j convex hull peeling level set.

N(0, I) N(0, Σ) non normal ) 1 ,

0 ( N −2 0 2 −2 0 2 −3 −2 −1 0 1 2 3

−3 −2 −1 0 1 2 3 −4 −2 0 2 5 10 15 20 25 30 χ2 10 j j j R R R 2 3 4 5 6 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 jthconvex hull level set jthconvex hull level set jthconvex hull level set

Figure 5.5. Three different bivariate samples are compared in term of the skewness measure Rj. A sample of the first column was generated from a standard bivariate normal distribution; a sample of the second column from a bivariate normal distribution with D 2 correlation 0.5; a sample of the third column was generated from (X,Y ) ∼ (N(0, 1),χ10), where X and Y are independent.

Figure 5.5 illustrates how skewness of multivariate distribution can be mea- sured along with data depth (a sequence of convex hull level sets). Samples in the first two columns represent symmetric distributions and a sample in the last col- 135 umn illustrates an asymmetric distribution. As predicted in the previous chapter, for symmetric distributions, the sequences of Rj maintain almost parallel curve with abscissa and for the asymmetric distribution, the sequence of Rj showed a decreasing pattern along with depths. This decreasing trend reflects the fact that the inner regions are more circular than outer regions.

Nevertheless, the trend in a sequence of Rj is very likely inconsistent when the values of the sequence are estimated from the inner convex hull peeling central regions. The values of the latter part of the sequence {Rj} may not reflect the actual shape of distributions. This phenomenon is a drawback of the convex hull peeling process. As the peeling process continues and the number of estimated level sets increases, the process suffer from the discrepancy due to small sample since the peeling process rips off points and leaves a smaller size sample for the next peeling process. The distribution of points inside the convex hull peeling center region left after the peeling process is not independent any more. This lack of reliability at the central part of the distribution in measuring its shape as the convex hull level sets march toward the data center is also observed in Figure 4.1. As sample size gets smaller due to the peeling process, the shape of remaining sample becomes smoother and the inner convex hull level sets loose the information of the original shape of the distribution. In addition, the very inner layers even loose the smoothness of level sets, the properties of large sample, since an exhaustive peeling process only leave small fraction of sample near the data center for the peeling process. As seen from Figure 5.5, the later part of the one dimensional curve formed by skewness measures could mislead the interpretation in skewness of the central part of the distribution. Figure 5.5 literally tells that the center parts of all three exam- ples have quite fluctuating skewness measures, which is unlikely to be observed in a single distribution. Instead of using a whole curve, to understand the skewness of the distribution along the data depth, it is recommended to discard the later part of skewness measures. A few studies on convex hull peeling process have been performed (Hueter, 2004a[98] and 2004b[99]) to alarm our discovery, the lack of reliability in the one dimensional curve for the skewness information of a distribu- tion. Those studies clearly stated that the asymptotic relationships between the convex hull level sets are only established for a small number of external convex 136 hull level sets. Nevertheless, the overall skewness along the data depth (level sets) is likely to be represented in a one dimensional curve for an easy interpretation and the information from the beginning part of the curve must be right. Estimating Rj along with data depth provides the measure of skewness of a distribution viewed from outside of data cloud and reflects any existing asymmetry of a distribution.

5.7.2 Kurtosis

Oja (1983) hinted that sample statistics for kurtosis can easily be found by using convex hulls although the kurtosis statistics was not suggested. Here, we suggest a preliminary graphical method of measuring kurtosis of a distribution based on the tail weight approach. Among kurtosis measures introduced in the previous chapter, the tailweight approach is easy as well as versatile.

uniform normal t10 cauchy ) min r , r ( t 0.0 0.2 0.4 0.6 0.8 1.0

1.0 0.8 0.6 0.4 0.2 0.0 r : depth

Figure 5.6. Tailweight is calculated based on eq. (5.3). Samples of size 2000 are generated from a uniform, normal, t10 and Cauchy distribution.

As discussed, Figure 5.6 illustrates the fact that kurtosis measure based on tailweights (Eq. 4.1) allows simultaneous comparison among all distributions of 137 interest. The calculation in Figure 5.6 is based on

VCH (r) t(r, rmin)= , (5.3) VCH (rmin) where r denotes data depth satisfying 0 < r ≤ 1 and rmin indicates the smallest depth. The result is obvious since the uniform distribution is the most flat among other distributions and the t distribution is the most fat tailed distribution among others. Similar plots based on different data depths and different kurtosis mea- sures are found in Wang and Serfling (2005 and 2006), Serfling (2004), and Liu et.al. (1999).

5.8 Discoveries from Convex Hull Peeling Anal- ysis on SDSS

Applying the convex hull peeling process on data provides lots of information about the characteristics of multi-dimensional data distribution without imposing any parametric distribution on data except the convexity of sample distribution. This non parametric approach based on the convex hull peeling process assists the efficient data analysis for new massive data archive. Non parametric methods have been applied to stars in three dimensional color space to build a stellar locus to segregate non stellar objects (Newberg and Yanny, 1989[154]). The number of dimensions strongly depends on the visualization lim- its. Although they did not incorporate all colors, the studies about stellar locus explained that the star distribution in the color magnitude space is not convex, which was the primary reason we excluded the data base of stars for our convex hull peeling descriptive statistics study. On the contrary, this stellar locus study illuminated the importance of understanding distributions of objects in color space for the classification purpose. We perform non parametric descriptive statistics by means of the convex hull peeling process to apprehend the properties of galaxy and Quasar distributions in four dimensional color magnitude space. First, the convex hull peeling process revealed a big discrepancy between sample mean and convex hull peeling median as discussed in Table 5.2. This result warns 138 not to use the multivariate normal distribution for advanced data analysis on colors of Quasars and galaxies. Particularly, the principal component analysis and the Mahalanobis distance will mislead scientific interpretation severely. Here we investigate the shapes of distributions of Quasars and galaxies with the convex hull peeling functionals. j 0.25 j Vol Vol 0 100 200 300 400 0 1 2 3 4

0 20 40 60 80 0 20 40 60 80

jthconvex hull level set jthconvex hull level set

Figure 5.7. Volume sequence of convex hull peeling central regions from 70204 Quasars in four dimension color magnitude space.

Figure 5.7 shows a convex hull peeling volume measures on a four dimensional Quasar color magnitude distribution. At every convex hull level set, we measure the volume of the corresponding convex hull peeling central region. To avoid the curse of dimensionality, we took the power of one fourth on volume scale, which is shown on the right panel in Figure 5.7. A few outer convex hulls indicate that the Quasar distribution is fat tailed since a few gaps appears at the beginning of the convex hull peeling volume curve. Observations occupy a few outer hulls need to be investigated since they contain possible outliers. Those outliers seem inconsistent with the rest of objects. Similar to the volume plot of Quasar data, we investigate volume functionals on the galaxy color magnitude distribution. Figure 5.8 clearly indicates that galaxies are also showing a fat tail distribution, which implies that some galaxies reside at several outer hulls need to receive attentions as outlier candidates. Next, we investigate convex hull peeling skewness measures. 139 j 0.25 j Vol Vol 0 1000 2000 3000 4000 5000 0 2 4 6 8

0 50 100 150 200 0 50 100 150 200

jthconvex hull level set jthconvex hull level set

Figure 5.8. Volume sequence of convex hull peeling central regions from 499043 galaxies in four dimension color magnitude space.

Figure 5.9 is a realization of measure Rj on Quasars. It shows a very interesting trend, which might explain the internal structure of the Quasar color magnitude distribution. Obviously, Quasars have an asymmetric distribution since the curve is not flat. The Rj measures of outer convex hulls has an increasing tendency j R 2 4 6 8 10 12 14

0 20 40 60 80

jthconvex hull level set

Figure 5.9. Sequence of Rj skewness measure from 70204 Quasars in four dimension color magnitude space 140 and the middle part has a decreasing tendency. On the other hand, the central part of the distribution shows symmetry while several inner most convex hull level sets are ignored. These inner Rj measures are intentionally discarded because of small sample properties discussed in the previous section. This phenomenon can be interpreted as Quasars do not come from a single component distribution. The Quasar distribution seems like mixtures of a few components. Another interpre- tation can be that asymmetric weights reside in the four dimensional distribution until about the 58th convex hull level set, which causes skewness. It is possible that the simple convex distribution assumption is violated. From the simple plot of one dimensional curve, much scientific information can be shed and many questions can be thrown to comprehend the caracteristics of the Quasar population. It is certain that Quasars have a complex shape in a four dimensional color magnitude space. Unlike the skewness curve of Quasars, galaxies show a nice smooth trend in skewness measures along convex hull level sets. Figure 5.10 is the realization of the measure Rj on galaxies. As discussed, the center part of distribution suffers from small sample property, which is displayed as a scatter near the end of the curve. j R 2 4 6 8 10 12

0 50 100 150 200

jthconvex hull level set

Figure 5.10. Sequence of Rj skewness measure from 499043 galaxies in four dimension color magnitude space 141

Figure 5.10 indicates that the center part of the distribution is more elongated than the outer part of the distribution. Very interesting properties about distributions in the color magnitude space are found from volume plot and skewness plot based on the volume measure of convex hull peeling central regions and the skewness measure Rj, respectively. These one dimensional curves allow to look for outliers as well as the characteristics of the distribution shape. Neither galaxies nor Quasars seem to have symmetric distributions at the first glance, based on all available data. Different sample will lead different results and astronomers practice volume limited sampling (Budav´ari et.al., 2003[38]) or magnitude limited sampling (Ball et.al., 2004[18]) to get good quality sample. Astronomical sampling scheme is beyond our study but we are sure that this sampling practice will lead different curves on volume and skewness measures. We encourage astronomers to investigate survey data with the convex hull peeling process as exploratory data analysis tools.

5.9 Summary and Conclusions

We have been able to achieve the goal of investigating any statistical aspects of the convex hull peeling process by means of developing one dimensional curves to explore multivariate distributions. We successfully acquired convex hull peeling descriptive statistics such as measures of scatter, skewness, kurtosis, and outlying- ness in addition to convex hull peeling location estimators. Studying the convex hull peeling process was initiated from estimating the data center in a robust and non parametric way. The peeling process does not assume any distributions nor any distance metrics and it is very intuitively appealing that the center of data is located at the core, left after the exhaustive peeling process. From locating the data center, we are able to extend convex hull computa- tional algorithms to estimating general locations or quantiles, scatter, skewness, and kurtosis as well as the density profile of the multivariate data. Outlier detec- tion methods such as the quantile based approach, the distribution shape change detection approach, and the balloon plots are developed. In addition, we show that convex hull peeling estimators are capable to adapt streaming data algorithms for an efficient computation and saving memory storage. 142

We observe a success in designing outlier detection methods based on the convex hull peeling process although Hodge and Austin (2004) expressed their concerns on the curse of dimensionality in the convex hull peeling outlier detection method. Note that in general the convex hull peeling process will be carried out with sample size n>>d and as small as d + 1 points will be enough to comprise a hull so that the worries of the curse of dimensionality can be ignored. Another concerns about data sparsity due to the increase of dimensions can be compensated by controlling the blow rate of the balloon when the balloon plot is hired for outlier detection. The size of balloon can be controlled empirically so that the balloon plot practically exclude outliers which occupies one corner of the d dimensional space unaffected by data sparsity. We came to the conclusion that the convex hull peeling process is quite effective for outlier detection since the inner regions is insensitive to outliers. Even though our studies show that the convex hull peeling process works well as exploratory multivariate analysis tools, the convex hull peeling depth has draw- backs. First, the direction of estimation always has to be inward. The estimating process begins the very outside of the data cloud and the inside cannot be reached until points above the data depth of interests are peeled off. Second, for high di- mensional data, computation become infeasible since the computation complexity grows in the power of the half of the dimension size. It is recommended to stay within 3 to 10 dimensional data sets. Third, the breakdown point is low. Last but not least, sample was assumed to be generated from a smooth convex body distribution, whose density profile is delineated by convex hull peeling level sets. In contrast, the convex hull peeling depth is highly recommended because of (1) no particular distribution assumption, (2) affine invariant so that estimators are not affected by underlying coordinate systems, (3) compatible with streaming data algorithms, (4) efficient computation than than other data depth measures. Convex hull peeling functionals enable to analyze galaxy and Quasar distribu- tions in multidimensional color magnitude space. We found very interesting facts that no multivariate normal distribution should be imposed on the color magni- tude distributions of galaxies and Quasars since the location estimator is so much deviated from the sample mean and the skewness curves from both types of objects do not show any symmetry in distributions. The results obtained by the convex hull peeling process and its functionals show the effectiveness of the non para- 143 metric approach to explain multivariate distributions without knowing any prior information about data. In fact, our intention of developing convex hull peeling algorithms is to acquire a prior knowledge about data distributions without imposing misleading paramet- ric models. If observed sample is believed to come from one population, then the sample tends to be form a unimodal distribution so that assuming the convexity of sample distribution is not a strong assumption any more but a right assumption. Therefore, the convex hull peeling process and its empirical functionals will serve as good exploratory data analysis tools. In future, not only sample versions of descriptive statistics based on the convex hull peeling process but also distribu- tion versions will be developed to perform more sophisticated statistical inferences beyond descriptive statistics. . Chapter 6

Concluding Remarks

6.1 Summary and Discussions

So far, we presented comprehensive theories, interpretations, algorithms, defini- tions, simulations, and applications related to the model selection with the jack- knife method and nonparametric multivariate data analysis with the convex hull peeling process. This section provides summaries of all discussions related to two main topics of this dissertation, given in previous chapters.

6.1.1 Jackknife Model Selection

A typical model selection method begins with estimating given candidate models under various principles such as the maximum likelihood principles or the least square method. Then, we estimate the distance between these estimated models and the true model, and choose a model of minimum distance; therefore, it is expected that certain bias arises from estimating each candidate model and the distance between this candidate model and the true model with the same sam- ple. Popular model selection criteria provide simple ways to quantize this dual estimation bias. The jackknife method is known for estimating bias reduction. When the jack- knife method is applied to model selection, we would like to propose a jackknife model selection criterion that reduces bias not to worry about specific penalty terms. In general, these penalty terms in model selection criteria depend on 145 strong assumptions on a true and candidate models. As other populate model selection criteria, we choose the Kullback-Leibler information as the distance mea- sure between the true and the candidate model. Thus, one of candidate models, minimizing this Kullback-Leilber distance, or maximizing the likelihood is chosen for the best approximating model. We showed that minimizing the Kullback-Leibler distance is equivalent to maxi- mizing the likelihood of a given model. This maximum likelihood principle provides the method of estimating candidate models, where the dual estimation process is introduced and the bias arise. The jackknife maximum log likelihood, however, is asymptotically unbiased to the log likelihood of a model with the parameter of interest so that concerns about bias disappear. The jackknife estimator of the log likelihood is defined as

n

Jn = nL(θˆn) − L−i(θˆ−i), i=1 X when L(·) and L−i(·) denote the log likelihood of all sample X1,X2, ..., Xn and all th ˆ ˆ sample but the i observation, and θn and θ−i denote the maximum likelihood estimators corresponding to L(·) and L−i(·), respectively. Under some regularity conditions, Jn is, therefore, an asymptotically unbiased estimator of the log likeli- hood L(θg), where θg is the parameter of interest. Note that we did not indicate θg as the parameter of the true model. This parameter is just the parameter of inter- est in the parameter space of a candidate model with which the Kullback-Leibler distance become minimized.

In order to apprehend it further, we investigate the stochastic order of Jn from the law of iterated logarithm. The results are

Jn = L(θg)+ O(log log n) a.s. and 1 J = log f (x)g(x)dx + O( n−1log log n) a.s. n n θg Z p Particularly, we can see that Jn has some degree of randomness that has higher order than o(1). This uncertainty causes a trouble when the candidate models are nested and their parameters of interest overlap since Jn does not have penalty 146 terms, estimates of bias under assumptions on models, to discriminate a parsimo- nious model from a full model. Next, we discuss the mechanism of the jackknife model selection criterion depending on whether candidate models are nested or non nested.

6.1.2 Unspecified True Model and Non-Nested Candidate Models

Traditional model selection methods assume that (1) candidate models are nested and (2) the true model is known. It is not necessarily that the true model belongs to the set of candidate models; however, the functional form of the true model is assumed to be somewhat apparent so that candidate models can be estimated under this true model. A set of nested candidate models, in other words, implies that the functional structure of the true model is known and this true model shares a similarity with this set of nested candidate models. These two assumptions are the keys to estimate the bias of a candidate model and produced familiar penalty terms in famous information criteria. Occasionally, we confront model selection problems while no information de- scribes the true data generating model. In such cases, we may throw candidate models from different parametric families to guess the true model or find the best approximating model to the truth. This scenario is more realistic than considering nested models and choosing variables to reach a parsimonious model in physical science. Under this circumstance- the true data generating model is unknown and possi- ble candidate models are not nested- the jackknife information criteria will be ideal to find the best approximating model, which is the closest model to the unknown true model in terms of the Kullback-Leibler distance. On the other hand, when there are some nested candidate models, the uncertainty, discussed earlier discour- ages to discriminate the full model to the parsimonious model. The jackknife model selection criterion, therefore, is not suitable for nested candidate models. For a such situation, we will need the modified jackknife information criterion which al- lows to discriminate or penalize the full model over the parsimonious model. We suggest the data oriented penalty terms, proposed by Rao and Wu (1989). 147

As Burnham and Anderson (2002) indicated, prior to applying an information criterion, it is important to understand the properties of data as well as to acquire a priori information so as to select a proper information criterion and candidate models. Statistics surely assists to make a decision among suitable possibilities but easily becomes blinded if those provided possibilities are absurd to the truth. Providing sensible candidates comes from understanding given data and disciplines creating such data sets.

6.1.3 Convex Hull Peeling Process

Applications of the convex hull peeling process are rarely found in statistical esti- mation and inference problems. On the other hand, the convex hull peeling process was introduced as a robust location estimator of multivariate data more than three decades ago (Tukey, 1974 and Barnett, 1976). The main obstacle of incorporating the convex hull peeling process into statistics was the lack of knowledge in mul- tivariate data analysis and computational powers. Despite its intuitive appeals, theoretical studies on limiting distributions of convex hull peeling estimators are rarely found. In stead of taking the mathematical and stochastic geometrical side of the convex hull peeling depth, the focus of the current study is investigating statistical properties in the convex hull peeling process empirically. We demon- strated whether the convex hull peeling process can be adapted to nonparametric multivariate data analysis. We begin our empirical study with developing location estimators. We define the convex hull peeling depth, which preserves the properties of statistical data depth (Zuo and Serfling, 2000a). From estimating convex hull peeling depths, we are able to assign orders to multivariate data, which has been challenged due to ambiguity in ordering multivariate data. The convex hull peeling process is not a unique ordering method. There are various methods of ordering multivariate data. After bestowing orders to multivariate sample, we formalize the convex hull peeling quantile estimator. In addition, the convex hull peeling center region, the convex hull level set, and the volume function of the center region are established upon the quantile estimator. Together with the generalized quantile process (EinMahl and Mason, 1992), we extend these quantile estimators to volume functionals that 148 provide one dimensional curve along with data depth so as to interpret the structure of multivariate distribution. The convex hull peeling process roughly sketches the structure of multivariate data and we developed methods to interpret the one dimensional curves delineated by the convex hull peeling process. The convex hull peeling process is geared to purely nonparametric multivariate analysis and known for its robustness. We show that convex hull peeling median is relatively insensitive to outliers through the relative efficiency study. Next, we summarize how the convex hull peeling process assist to describe the structure of multivariate data and detect outliers in a massive data set.

6.1.4 Multivariate Descriptive Statistics

We develop tools for nonparametric descriptive statistics from the convex hull peeling process. Therefore, any descriptive measures from the convex hull peeling procedure do not depend on any parametric models and any distance metrics. Furthermore, there is no need to calculate moments to measure scatter, skewness, and kurtosis of a multivariate data distribution. It is seen that the convex hull peeling process sketches the structure of mul- tivariate data distributions. Each convex hull level set associated with a convex hull peeling depth can be mapped to the profile of the density. We developed some functionals of convex hull level sets, which can quantize structural proper- ties of multivariate distributions at a given depth. These functionals are primarily obtained from measuring the volume of a convex hull peeling center region. Al- gebraic combinations of these volume functionals provides pictures of the scatter, skewness, and kurtosis measures of multidimensional distributions. Always, visualization of multidimensional data is the biggest obstacle in mul- tivariate data analysis and the multivariate quantiles are not exceptional. Our empirical descriptive statistics from the convex hull peeling process utilizes the volume functional to describe the structure of distributions. A sequence of volume functionals at each convex hull level set forms a one dimensional curve, describing the shape of multidimensional distributions along with data depths under a very weak assumption that the sample has been generated from a smooth convex body 149 distribution. This assumption is very reasonable in science for a sample from a single population tends to gather tightly.

6.1.5 Multivariate Outliers

Defining multivariate outliers is a very complicated problem. First of all, the definition of multivariate outliers are quite obscure compared to the outliers of univariate sample which can be separate easily by looking ranks of data. We define multivariate outliers which reside near the surface of the data cloud and away from the center of the data. As the convex hull peeling process will locate the surface of data cloud and the data center, the convex hull peeling depth will provide ways to segregate outliers from the main populations. We suggest three different ways of detection outliers based on the convex hull peeling process. First, discriminating outliers by estimating multivariate convex hull peeling quantiles. At given level α, if we consider the such portion of data are outliers, simply estimating quantile through the convex hull peeling process eliminates outliers, which reside above the estimated quantile. Second, identifying outliers by detecting shape changes in convex hull level sets. As the peeling process continues, a certain level set will not contain outliers any further so that the shape of the contour of that level set will change drastically from the shape of the contours for previous level sets, which presumably contain outliers. Last, we created Balloon plot, a multivariate expansion of Tukey’s box plot so that points outside of this balloon are considered to be outliers as points beyond the whiskers in Tukey’s box plot. The balloon is formed by blowing the convex hull peeling center region containing 50% data (inter quartile range, IQR) away from the data center accordingly. All of these outlier detection methods are purely nonparametric methods, independent of distance metrics.

6.1.6 Color Distributions of Extragalactic Populations

The primary discovery we made is that it is nonsense to impose multivariate normal distributions on Quasar and extragalaxy populations in color magnitude space for statistical analysis. When it come to multivariate massive data, there are not many tools exist to analyze these data so that imposing a multivariate normal 150 distribution to begin a statistical analysis became a golden standard. The absurdity of adopting the multivariate normal distribution is verified em- pirically by estimating the center of data cloud by the convex hull peeling process. We observed that the convex hull peeling medians of Quasars and galaxies are quite discrepant from sample means. If these objects follow a multivariate nor- mal distribution, no discrepancy between the convex hull peeling median and the sample mean is expected. We found out that both Quasars and galaxies have fat tail distributions in four color magnitude space. It will be worthwhile this small faction of observations composing the fat tail parts of distributions. These objects are easily obtained by peeling the data cloud few times since they reside near the surface of the data cloud. Additionally, The Quasar population seems to be a mixture of a few components since a smaller component residing inside the main component is causing a strange pattern in one dimensional curve of skewness measures. Our work may contribute to the photometric morphological automated classification problem of galaxies and the photometric redshift prediction problem by providing descriptive statistics on multidimensional distributions of numerous galaxies.

6.2 Future Works

I realize that this dissertation is just a beginning of what ahead. Model selection is relatively a new paradigm as well as the convex hull peeling process. Here, we add a few more ideas that will come along in the future.

6.2.1 Clustering with Convex Hull Peeling

The initial intention of studying the convex hull peeling process was developing empirical clustering algorithms. There are not many purely nonparametric clus- tering algorithms except K-means, K-medians, or its hybrids. The K-means type clustering algorithms rely on distance metrics whereas the convex hull peeling pro- cess is coordinate invariant. If an clustering algorithm is invented based on the convex hull peeling process, as a consequence, we could claim that this convex hull clustering algorithm is purely nonparametric and non distance metric dependent. 151

Unfortunately, so far, the algorithm we have invented is very expensive. This convex hull clustering algorithm has computational complexity of O(Kn) where K and n indicate the initial number of clusters and the sample size. The idea is similar to K-means where the algorithm begin with assigning initial k ∈ [1,K] to data. In general, nonparametric clustering algorithms is computationally very expensive. The reason of high cost computation is given by Mangiameli et.al. (1996[136]). According to Mangiameli et.al. (1996[136]), the cluster definition problems is NP-complete; therefore, no efficient and optimal algorithm exists to invent a clus- tering method. A number of heuristics methods have been proposed but those heuristic methods such as hierarchical ones ask for strong conditions on data to be compact and isolated clusters. Hartigan and Wong (1979[87]), and Hartigan (1987[86]) discussed high costs in computing clustering algorithms like K-means. They proposed local optimal algorithms, of which spirit might be adopted to de- veloping an convex hull peeling clustering algorithm conditioning on local optima in the near future. A maximum entropy approach might be helpful to invent locally optimal al- gorithms. Rasson and Granville (1996[168]) introduced the concept of maximum entropy as a dissimilarity measure to solve classification problems. Data depth classifiers utilizing maximum entropy are well studied in Ghosh and Chaudhuri (2005[72]), who investigated the possible use of different notions of data depth in nonparametric discriminant analysis. The convex hull peeling process only evaluates data depth from the smallest to the largest. No reverse ordering is possible. This one directional estimating depths, which is the unique property of the convex hull peeling process, might make clustering algorithms more expensive than k-means or similar partitioning style algorithms. The convex hull depth based clustering algorithms present more restrictions on designing a local optimal function, which was discussed by Hartigan and Wong (1979) as a cost saving method to develop clustering algorithms. In fact, Hartigan (1987) cast worries about high cost of the convex hull classifier for bivariate samples. It would be a great challenge to create a cost effective convex hull peeling classifier/clustering algorithm. 152

6.2.2 Voronoi Tessellation

While the author was playing with convex hull algorithms and exploring compu- tational methods, I realize the fact that the computational procedure of voronoi tessellation is closely related to convex hull peeling process. Fortunately, voronoi tessellation enables to weigh each points based on the distribution of points in space by calculating the volume (area) of each voronoi cell. Estimating weights is directly linked to estimating density profiles so that the voronoi tessellation may provide a new nonparametric density estimation methodology apart from kernel density estimation, which employs parametric kernels for nonparametric analysis. Some statistically relevant applications of voronoi tessellation are found. Allard and Fraley (1997[9]) used voronoi tessellation for nonparametric mixture compo- nent decomposition. Scargle et.al. (2003[190]) introduced voronoi tessellation for bayesian estimation problems. We foresee the application of tessellation for reveal- ing density structures of multivariate sample. We will perform voronoi tessellation on given sample and using thresholding techniques to separate the main structure imbedded in the multivarite density profile from noise fluctuations. Choosing a proper threshold will allows to find clusters as well. Voronoi tessellation seems to have lots of potentials in nonparametric statistical analysis, especially nonpara- metric density estimation as well as in scientific discoveries.

6.2.3 Addtional Methodologies for Outlier Detection

According to Hautamaki et.al. (2004[88]), the following list of different approaches can be taken for multivariate outlier detection:

1. Depth-based (outliers are expected to be detected out from data objects with shallow depth values),

2. Deviation based,

3. Distance based (knn),

4. Density based,

5. Distribution based 153

6. Clustering based,

7. Subspace based (due to the curse of dimensionality),

8. Support vector based,

9. Neural network based, and

10. Statistic based (hypothesis testing approach as seen from Li and Liu, (2004[123]), Liu and Singh (1992[129]), and Liu (1990[130])) outlier detection

Petrovskiy (2003[162]) discussed advantages and disadvantages of some of these approaches. We only used only a few approaches such as depth based, distribution based, and clustering based approaches although these approaches are not directly related to our quantile based method, detection distribution shape changes, and balloon plot. Somehow these approaches can be married to the convex hull peeling process for new outlier detection methodologies.

6.2.4 Testing Similarity of Distributions

Li and Liu (2004[123]), Liu and Singh (1993[128], 1992[129]) suggested or outlined the method of testing similarity between two distribution based on the simplicial data depth. Such notion seems to be extended to other data depth to develop tools for measuring similarity between distributions and the convex hull peeling depth cannot be the exception. We like to develop a method of testing similarity between distributions with the convex hull peeling process.

6.2.5 Breakdown Point

In order to show the robustness of an estimator, the idea of the breakdown point is almost always introduced. According to Small (1999[213]), the breakdown of an estimator is a useful property in understanding its robustness. The breakdown point indicate the degree of robustness so that the higher breakdown point implies a more robust estimator. The breakdown point can be formalized as follows;

Definition 6.1. Let X = {X1, ..., Xn} be a sample and µˆ(X) an estimator from that sample. Let X0 be a corrupted sample obtained by replacing m of the original 154 points by arbitrary ones and suppose µˆ(X0) is the corresponding estimator. The maximum bias caused by such contamination is

Bias(m;ˆµ) = sup ||µˆ(X0) − µˆ(X)||. X0

Infinite values for the bias indicate that the estimator has broken down. The break- down point can be defined as

m BD(ˆµ; X) = inf{ : Bias(m, µˆ)= ∞}. n

Note that the largest possible value for the breakdown point of any sensible location 1 estimator is 2 (for example, univariate median).

Regarding breakdown points of data depth, Donoho and Gasko (1992), and Donoho (1982) provided extensive studies on the subject, where we refer to the upper bound of the breakdown point of the convex hull peeling median ((n − d + 2)/(n +1)(d + 1)). It would be useful if we know the exact form of the breakdown point of the convex hull peeling median. The breakdown point of the convex hull peeling median may not be bad as we perceptively know since we have shown that the convex hull peeling median is insensitive to outliers from the relative efficiency studies.

6.2.6 CHPD is Not Continuous

Almost sure convergence between depth function and its sample version, and semi-continuity or regular continuity of depth function and its sample version has been proved for simplicial depth, half space depth, and projection depth (He and Wang, 1997[90]; Zuo and Serfling, 2000a[241]; Zuo and Serfling, 2000c[243]; Zuo, 2003[240]; and so on). These properties of almost sure convergence and continuity of depth functions lead to many theoretical and application studies on half space depth, simplicial depth, and projection depth. On the other hand, a similar ap- proach with the convex hull peeling depth has not been found. We can attribute this lack of showing the convergence and continuity of the convex hull peeling depth to the proposition that the convex hull peeling depth is not continuous. There are clues that support the discontinuity of the convex hull peeling depth. 155

The convex hull peeling depth is built upon the counting process, where counting itself is discrete. Also, it is not necessary that the most adjacent point has same depth value compare to the point on the same convex hull level set, which is more further away from the most adjacent point. We believe there is some evidence that supports this proposition in stochastic geometry society.

6.2.7 Sloan Digital Sky Survey Data Release Five

Data Release Five1 (DR5) has been published on June 28th in 20062, about ten days after my defense (June 16th, 2006). This latest data release provides larger data sets than Data Release Four (DR4). We wish to perform a similar analysis with DR5 as with DR4. First, we give the brief explanation of SDSS DR5 to get the glimpse of charac- teristics of data sets used in upcoming analysis. In DR5, the imaging sky coverage reaches 8000 square degrees of the sky and DR5 contains about 215 million ob- jects. The spectroscopic catalog of DR5 includes 5740 square degrees and 1,048,960 spectra of classified objects; among these objects, 675,749 are galaxies, 79,394 are Quasars (redshift< 2.3), 11,217 are high redshift Quasars, 154,925 are stars, 60,808 are M or later type stars. The volumes of images and spectra reach 9.0TB and 170GB, respectively. Attributes of photometric data are color indices, u,b,g,i,z along with coordinates. Object identification comes from spectroscopy but spec- troscopic data are limited than photometric data. The overall properties of the SDSS DR5 is avalable from http://www.sdss.org/DR5. Among the catalogs, we choose SpecPhotoAll and extracted values under variables such as z(redshift, distance indicator in astronomy), psfMag u, psfMag g, psfMag r, psfMag i, and psfMag z. Astronomy catalogs contain various vari- ables but mainly they are locations and magnitudes with errors from different calibration processes. We obtained the standard psf magnitudes to get the multi- color distribution of celestial objects and z for the physical distance references. We will trim objects with missing magnitudes. The Data Release Five is just released so that the huge catalog is still under

1http://www.sdss.org/dr5 2Information on the schedule of data releases is in http://www.sdss.org/science/ DRsched- ule.html 156 some extra calibrations, which might change the size of catalog, slightly. In a few upcoming months, when these extra calibrations are finalized, we will apply our convex hull peeling procedures to understand this new massive data to assist astronomers to look for outliers, errors, and peculiar values. These oddities alert astronomers to pay more attention to the small faction of objects occupies different residential area deviated from the main populations in multidimensional space instead of trimming those objects brute forcefully. Statistics and machine learning have to contribute astronomical data process procedures since human eyes cannot make decisions on these massive data any further. It is the matter of efficiency. We believe that as a part of multivariate analysis, the convex hull peeling process can contribute to make the most of the data from Sloan Digital Sky Survey. As a first step, it will be very interesting to know whether these new data sets will provides the same information as the DR4 when the convex hull peeling multivariate data analysis methods are applied.

6.2.8 Additional Topics

Several relevant research topics can be added as an extension of topics for my future works. These topics are briefly addressed in the following.

• theoretical counterpart of all empirical convex hull peeling measures to be analytical: this project seems not promising if I can prove the functional from the convex hull peeling process is not continuous.

• important sampling: whenever my works related to the convex hull peeling process were presented, among questions I received, one same question al- ways brought up, which is whether the convex hull peeling process can be incorporated to important sampling.

• concrete examples for the jackknife information criterion: So far, I had diffi- culties to create synthetic data or find an example that might have various types of candidate models from different parametric families and assume the unknown true model.

• marked point process with astronomical survey data to reveal the large scale structures of the universe: this work is an extension of the tessellation study. 157

• automated classification methods for massive data to replace human classi- fiers - the Hubble tuning fork needs some revolutionary changes.

• mixture model approach to detect signals: modeling and statistical signifi- cance bestow confidence on applying automated singal detection methods.

• improving KMM algorithm (Ashman et.al. (1994[13]) in astronomy: I like to replace the hypothesis testing for choosing a right mixture model to a proper statistical model selection criterion. Bibliography

[1] Abazajian, K. et.al. (2005) The Third Data Release of the Sloan Digital Sky Survey. AJ, 129, 1755-1759.

[2] Abazajian, K. et.al. (2004) The Second Data Release of the Sloan Digital Sky Survey. AJ, 128, 502-512.

[3] Abazajian, K. et.al. (2003) The First Data Release of the Sloan Digital Sky Survey. AJ, 126, 2081-2086.

[4] Adelman-McCarthy, J.K. et.al. (2006). The Fourth Data Release of the Sloan Digital Sky Survey. ApJS, 162, 38-48.

[5] Adams, J.E., Gray, H.L, and Watkins, T.A. (1971). An Asymptotic Character- ization of bias reduction by Jackknifing. Ann. Math. Stat, 42(5), 1606-1612.

[6] Agostinelli, C. (2002). Robust Model Selection in Regression via Weighted Likelihood Methodology Statistics & Probability Letters, 56 (3), 289-300.

[7] Akaike, H. (1973). Information Theory and an Extension of the Likelihood Ratio Principle Proceedings of the Second International Symposium of Infor- mation Theory, ed. by B.N. Petrov and F. Csaki. Budapest: Akademiai Kiado, 257-281.

[8] Akaike, H. (1974). A New Look at Statistical Model Identification. IEEE Trans. Auto. Cont’l, 19, 716-723.

[9] Allard, D. and Fraley, C. (1997). Nonparametric Maximum Likelihood Esti- mation of Features in Spatial Point Processes Using Voronoi Tessellation. J. Am. Stat. Ass., 92, 1485-1493.

[10] Aloupis, G. (2005) Geometric Measures of Data Depth. DIMACS Ser. Disc. Math. Theo. Comp. Sci., accepted. 159

[11] Amari, S. et.al. (1997). Asymptotic of Overtraining and Cross-Validation. IEEE Trans. Neural Networks, 68, 337-404.

[12] Atkinson, A.C. (1994). Fast Very Robust Methods for the Detection of Mul- tiple Outliers. J. Amer. Stat. Assoc., 89, 1329-1339.

[13] Ashman, K.M. et.al. (1994). Detecting Bimodality in Astronomical Datasets. AJ., 108, 2348-2361.

[14] Av´erous, J. and Meste, M. (1997). Skewness for Multivariate Distributions: Two Approaches. Ann. Stat., 25(5), 1984-1996.

[15] Bagchi, A. et.al. (2004). Deterministic Sampling and Range Counting in Geometric Data Streams. Proc. 20th Ann. Symp. Comp. Geom., 144-151.

[16] Bai, Z.D., Rao, C.R., and Wu, Y. (1999). Model Selection with Data-Oriented Penalty. Journal of Statistical Planning and Inference, 77, 103-117.

[17] Balanda, K.P. and MacGillivray, H.L. (1988). Kurtosis: A Critical Review. Am. Stat, 42(2), 111-119.

[18] Ball, N.M. et.al. (2004). Galaxy Types in the Sloan Digital Sky Survey Using Supervised Artificial Neural Networks. MNRAS, 348, 1038-1046.

[19] Barber, C.B. et.al. (1996). The Quickhull Algorithms for Convex Hulls. ACM Trans. Math. Software, 22(4), 469-483.

[20] Barron, A. et.al. (1998). The Minimum Description Length Principle in Cod- ing and Modeling. IEEE Trans. Info. Theory, 44(6), 2743-2760.

[21] Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data. John Wiley & Sons, 3rd Ed.,New York.

[22] Barnett, V. (1983). Discussion on Outlier...... s. Technometrics, 25(2), 150- 152.

[23] Barnett, V. (1978). The Study of Outliers: Purpose and Model. App. Stat., 27(3), 242-250.

[24] Barnett, V. (1976). The ordering of Multivariate Data. J. Royal Stat. Soc. A, 139(3), 318-355.

[25] Bayer, V. (1999). Survey of Algorithms for the Convex Hull Problem. preprint

[26] Bebbington, A.C. (1978). A Method of Bivariate Trimming for Robust Esti- mation of the Correlation Coefficient. Appl. Stat., 27(3), 221-226. 160

[27] Beckman, R.J. and Cook, R.D. (1983). Outlier...... s. Technometrics, 25(2), 119-149.

[28] de Berg, M. et.al. (2000). Computational Geometry: Algorithms and Appli- cations. Springer, New York

[29] Bickel, P.J. and Lehmann, E.L. (1975). Descriptive Statistics for Non- Parametric Models (I.Introduction, II.Location). Ann. Stat., 3, 1038-1069. [30] Bickel, P.J. and Lehmann, E.L. (1976). Descriptive Statistics for Non- Parametric Models (III.Dispersion). Ann. Stat., 4, 1139-1158. [31] B¨ohm, C. and Kriegel, H. (2001). Determinig the Convex Hull in Large Mul- tidimensional Databases, Int. Conf. Data Waringhousing and Knowledge Dis- covery [32] Boltzmann, L. (1877). Uber¨ die Beziehung zwischen dem zweiten Hauptsatze der mechanischen W¨armetheorie und der Wahrscheinlichkeitsrechnung respec- tive den Satzen ¨uber das W¨armegleichgewicht. Winer Berichte, 76, 373-435.

[33] Bozdogan, H. (2000). Akaike’s Information Criterion and Recent Develop- ments in Information Complexity, J. Math. Psychology, 44(1), 62-91 [34] Bozdogan, H. (1987). Model Selection and Akaike’s Information Criterion (AIC): The General Theory and its Analytical Extensions. Psychometrica, 52(3), 345-370

[35] Breiman, L. (1992). The Little Bootstrap and Other Methods for Dimension- ality Selection in Regression: X-fixed Prediction Error. J. Amer. Stat. Ass., 87, 738-754

[36] Brunner, R.J. et.al. (2001). Massive Datasets in Astronomy. Handbook of Mas- sive Datasets, Ch. 1, Eds. Abello, Pardalos, and Resende, 931-977.

[37] Buchta, C. (2005). An Identity Relating Moments of Functonals of Convex Hulls. Disc. Comp. Geom., 33, 125-142.

[38] Budav´ari, R. et.al. (2003). Angular Clustering with Photometric Redshifts in the Sloan Digital Sky Survey: Bimodality in the Clustering Properties of Galaxies. ApJ, 595, 59-70.

[39] Burnham, K.P. and Anderson, D.R., (2002). Model Selection and Inference, A practical Information-Theoretic Approach, 2nd Ed.,Springer-Verlag, New York.

[40] Cabo, A.J. and Groeneboom, P. (1994). Limits Theorems for Functionals of Convex Hulls. Prob. Th. Rel. Fields, 100, 31-55. 161

[41] Carnal, H. (1970). Die konvexe H¨ulle von n rotationssymmetrisch verteilten Punkten. Zeitschrift f¨ur Wahrscheinlichkeitstheorie verwandte Gebiete, 15, 168-175.

[42] Chaudhuri, P. (1996). On a Geometric Notion of Quantiles for Multivariate Data. J. Am. Stat. Ass., 91, 862-872.

[43] Cavanaugh, J.E. (1999). A Large-Sample Model Selection Criterion based on Kullback’s Symmetric Statistics & Probability Letters, 42, 333-343.

[44] Cavanaugh, J. E. and Neath, A. A. (1999). Generalizing the Derivation of the Schwarz Information Criterion Communications in Statistics, Part A – Theory and Methods,28, 49-66

[45] Cavanaugh, J.E. (1997). Unifying the Derivations for the Akaike and Cor- rected Akaike Information Criteria Statistics & Probability Letters, 33, 201- 208

[46] Cavanaugh, J. E. and Shumway, R. H. (1997) A Bootstrap Variant of AIC for State-space Model Selection Statistica Sinica, 7, 473-496

[47] Chakraborty, B. (2001). On Affine Equivariant Multivariate Quantiles. Ann. Inst. Statist. Math., 53(2), 380-403.

[48] Chazelle, B. (1985). On the Convex Layers of a Planar Set, IEEE Trans. Info. Theo., 31(4), 509-517

[49] Chand, D.R. and Kapur, S.S. (1970). An Algorithm for Convex Polytopes, J. ACM., 17, 78-86

[50] Chen, F., Lambert, D., Pinheiro, J.C. (2000). Incremental Quantile Estimation for Massive Tracking. ACM SIGKDD Int. Conf. KDDM, 516-522.

[51] Cheng, A.Y. and Ouyang, M. (2001). On Algorithms for Simplicial Depth. Proc. 13th Can. Conf. Comp. Geometry, 53-56

[52] Chenouri, S. (2004). Multivariate Robust Nonparametric Inference based on Data Depth. Ph.D thesis, University of Waterloo.

[53] Chung, H., Lee, K., and Koo, J. (1996). A note on Bootstrap Model Selection Criterion Stat. & Prob. Letters, 26, 35-41

[54] Clarke, B.S. and Barron, A.R. (1990). Information-Theoretic Asymptotics of Bayes Methods IEEE Trans. Info. Theory, 36(3), 453-471

[55] Cover, T. and Thomas, (1991). Elements of Information Theory, Wiley, New York 162

[56] Djuri´c, P.M. (1997). Using the Bootstrap to Select Models. ICASSP’97, 5, 3729-3732.

[57] Djorgovski, S.G. et.al. (2001). Exploration of Large Digital Sky Surveys. Mining the Sky, Eds. Banday et.al., Berlin, Springer-Verlag, 305-323.

[58] Donoho, D. L. and Gasko, M. (1992). Breakdown Properties of Location Es- timates Based on Halfspace Depth and Projected Outlyingness Ann. Stat., 20(4), 1803-1827.

[59] Donoho, D.L. (1982). Breakdown Properties of Multivariate Location Estima- tors PhD Qualifying Paper, Stanford Univ.

[60] Du, Q., Faber, V. and Gunzberger, M. (1999). Centroidal Voroni Tessellations: Applications and Algorithms SIAM Review, 41(4), 637-676.

[61] Eaton, M.L. (1982). A Review of Selected Topics in Multivariate Probability Inequalities. Ann. Stat., 10, 11-43

[62] Eddy, W.F. and Gale, J.D. (1981). The Convex Hull of a Spherically Sym- metric Sample, Adv. Appl. Prob., 13, 751-763

[63] Eddy, W.F. (1980). The Distribution of the Convex Hull of a Gaussian Sample, J. Appl. Prob., 17, 686-695

[64] Eddy, W.F. (1977). A New Convex Hull Algorithm for Planar Sets, ACM Trans. Math. Software, 3(4), 398-403

[65] Eddy, W.F. (1977). ALGORITHM 523 CONVEX, A New Convex Hull Algo- rithm for Planar Sets [Z], ACM Trans. Math. Software, 3(4), 411-412

[66] Edelsbrunner, H. (1987). Algorithms in Combinatorial Geometry, Springer- Verlag, New York.

[67] Efron, B. (1983). Estimating the Error Rate of a Prediction Rule: Improved- ment on Cross-Validation, and the Repeated Learning-Testing Methods. J. Amer. Stat. Ass., 78, 503-514.

[68] Efron, B. (1965). The Convex Hull of a Random Set of Points, Biometrika, 52(3), 331-343

[69] EinMahl, J.H.J. and Mason, D.M. (1992). Generalized Quantile Processes, Ann. Stat., 20(2), 1062-1078

[70] Feng, Z.D. and McCulloch, C.E. (1996). Using Bootstrap Likelihood Ratios in Finite Mixture Models. J.R. Statisti. Soc. B, 58, 609-617. 163

[71] Ferreira, J.T.A.S. and Steel, M.F.J. (2006). On Describing Multivariate Skewed Distributions: A Directional Approach. Can. J. Stat., 31(?), accepted.

[72] Ghosh, A.K. and Chaudhuri, P. (2005). On Maximum Depth and Related Classifiers. Scan. J. Stat., 32, 327-350.

[73] Gilbert, A.C. et.al. (2005). Domain-Driven Data Synopses for Dynamic Quan- tiles. IEEE Trans. Know. Data Eng., 17(7), 927-938.

[74] Gnandeskikan, R. and Kettenring, J. (1972). Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data. Biometrics, 28, 81-124.

[75] Goldberg, K.M. and Iglewicz, B. (1992). Bivariate Extensions of the Boxplot. Technometrics, 34, 307-320.

[76] Goodman, J.E. and O’Rourke, J. (2004) Handbook of Discrete and Computa- tional Geometry, Chapman & Hall/CRC, 2nd Ed.

[77] Graham, R.L. (1972). Finding the Convex Hull of a Simple Polygon, J. Algo- rithm, 4, 324-331

[78] Greenwald, M. and Khanna, S. (2001). Space-Efficient Online Computation of Quantile Summaries. SIGMOD RECORD, 30(2), 58-66.

[79] Groeneveld, R.A. and Meeden, G. (1985). Measuring Skewness and Kurtosis, The , 33, 391-399.

[80] Groeneboom, P. (1988). Limit Theorems for Convex Hulls. Prob. Th. Rel. Fields, 79, 327-368.

[81] Hadi, A. (1992). Identifying multiple Outliers in Multivariate data. J. Roy. Stat. Soc. B, 54, 761-771.

[82] Hampel, F. R., Ronchetti, E., Rousseeuw, P. J. and Stahel, W. (1986). Robust Statistics: The Approach Based on Influence Functions. John Wiley & Sons, New York.

[83] Hannan, E.J. and Quinn, B.G. (1979). The Determination of the Order of an Autoregression. J.R. Statisti. Soc. B, 41(2), 190-195.

[84] Hansen, M.H. and Yu, B. (2001). Model Selection and the Principle of Mini- mum Description Length Journal of the American Statistical Association, 96, 746-774.

[85] Hardin, J. and Rocke, D.M. (2004). Outlier Detection in the Multiple Cluster Setting Using the Minimum Covariance Determinant Estimator. Comp. Stat. Data. Ana., 44, 625-638. 164

[86] Hartigan, J.A. (1987). Estimation of a Convex Density Contour in Two Di- mensions. J. Am. Stat. Ass., 82(397), 267-270.

[87] Hartigan, J.A. and Wong, M.A. (1979). Algorithm AS 136: A K-Means Clus- tering Algorithm App. Stat., 28(1), 100-108.

[88] Hautamaki, V. et.al. (2004). Outlier Detection Using k-Nearest Neighbour Graph. Proc Pat Recog, 17th Intl. Conf.

[89] Hawkins, D. (1980) Identification of Outliers. Chapman and Hall, London

[90] He, X. and Wang, G. (1997). Convergence of Depth Contours for Multivariate Datasets Ann. Stat., 25(2), 495-504.

[91] Herzberg, A.M and Tsukanov, A.V. (1986). A Note on Modifications of The Jackknife Criterion for Model Selection Utilitas Mathematica, 29, 209-216

[92] Herzberg, A.M. and Tsukanov, A.V. (1985). The Monte-Carlo Comparison of Two Criteria for the Selection of Models J. Stat. Comput. Simul.,22, 113-126

[93] Hjorth, J.S.U. (1994). Computer Intensive Statistical Methods: Validataion, Model Selection and Bootstrap, Chapman and Hall, London

[94] Hodge, V.J. and Austin, J. (2004). A Survey of Outlier Detection Methodolo- gies. A.I. Rev., 22, 85-126.

[95] Hsing, T. (1994). On the Asymptotic Distribution of the Area Outside a Ran- dom Convex Hull in a Disk, Ann. App. Prob., 4(2), 478-493.

[96] Huber, P.J. (1972). Robust Statistics: a Review. Ann. Math. Stat., 43, 1041- 67.

[97] Huber, P.J. (1967). The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions. Proc. 5th Berkeley Symp. Math. Stat. Prob., 1, 221- 233.

[98] Hueter, I. (2004a) Gaussian Convex Hull Peels. preprint [99] Hueter, I. (2004b) Convex Peels of a Random Pont Set. preprint

[100] Hueter, I. (1999a) Limit Theorems for the Convex Hull of Random Points in Higher Dimensions. Trans. Amer. Math. Soc., 351(11), 4337-4363.

[101] Hueter, I. (1999b) The Convex Hull of Samples from Self-similar Distribu- tions. Adv. App. Prob., 31, 34-47.

[102] Hueter, I. (1994) The Convex Hull of a Normal Sample. Adv. App. Prob., 26, 855-875. 165

[103] Hurley, C. and Modarres, R. (1995). Low-Storage Quantile Estimation. Comp. Stat., 10, 311-325.

[104] Hurvich, C.M. and Tsai, C. (1989). Regression and Time Series Model Se- lection in Small Samples. Biometrika, 76(2), 297-307.

[105] Inselberg, A. and Dimsdale, B. (1990). Parallel Coordinates: a Tool for Vi- sualizing Multi-dimensional Geometry. Proc. Visualization 90, 361-378. [106] Irizarry, R.A. (2001). Information and Posterior Probability Criteria for Model Selection in Local Likelihood Estimation. J Amer. Stat. Ass., 96(453), 303-315.

[107] Ishiguro, M. et.al. (1997). Bootstrapping Log Likelihood and EIC, an Ex- tension of AIC. Annals of the Institute of Statistical Mathemathics, 49(3), 411-434.

[108] Jarvis, R.A. (1972). An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set, Info. Process. Letters, 1, 132-133.

[109] Jennrich, R.I. (1969). Asymptotic Properties of Non-Linear Estimators. Ann. Math. Stat., 40(2), 633-643. [110] Kass, R.E. and Wasserman, L. (1995). A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion. J. Am. Stat. Ass., 90, 928-934.

[111] Kitagawa, G. (1997). Information Criteria for the Predictive Evaluation of Bayesian Models. Commu. Statist.- Theory Meth., 26(9), 2223-2246. [112] Klee, V. (1971) What is a Convex Set? Amer. Math. Monthly, 78(6), 616- 631. [113] Knorr, E.M. et.al. (2000). Distance-based Outliers: Algorithms and Appli- cations. VLDB Journal, 8(3-4), 237-253. [114] Konishi,S. and Kitagawa, G. (2003). Asymptotic Theory for Information Criteria in Model Selection-Functoinal Approach J. Stat. Plan. Inf. 114, 45- 61. [115] Konishi, S. (1999) Statistical Model Evaluation and Information Criteria STATISTICS TEXTBOOKS AND MONOGRAPHS vol. 159 - Multivariate Analysis, design of , and survey sampling, ed. Ghosh, MARCEL DEKKER AG, USA, pp. 369-400.

[116] Konishi, S. and Kitagawa, G. (1996). Generalised Information Criteria in Model Selection. Biometrika, 83(4),875-890. 166

[117] Kosinski, A.S. (1999). A Procedure for the Detection of Multivariate Outliers. Comp. Stat. Data. Anal., 29, 145-161.

[118] Kuha, J. (2004). AIC and BIC. Soc. Meth. Res., 33(2), 188-229.

[119] Kullback, S. and Leibler,R.A. (1951). On Information and Sufficiency. Ann. Math. Stat., 22(1), 79-86.

[120] Lahiri, P. (2001) Ed. Model Selection. Inst. Math. Stat., Ohio.

[121] Laszlo, M.J. (1996). Computational Geometry and Computer Graphics in C++. Prentice Hall, Upper Saddle River, N.J.

[122] LeCam, L. (1953). On Some Asymptotic Properties of Maximum Likelihood estimates and Related Bayes Estimates. Univ. Calif. Pub. Stat., 1, 277-330.

[123] Li, J. and Liu, R.Y. (2004). New Nonparametric Tests of Multivariate Loca- tions and Scales Using Data Depth. Stat. Sci., 19(4), 686-696.

[124] Liechty, et.al. (2003). Single-pass Low-storage Arbitrary Quantile Estima- tion for Massive Datasets. Stat. Comp., 13, 91-100.

[125] Lin, X., Lu, H., Xu, J., and Yu, J.X. (2004). Continuously Maintaining Quan- tile Summaries of the Most Recent N Elements over a Data Stream. Proceed- ings. 20th International Conference on Data Engineering, 362 - 373

[126] Linhart, H. and Zucchini, W (1986). Model Selection. John Wiley & Sons, New York.

[127] Liu, R.Y., Parelius, J.M., and Singh, K. (1999). Multivariate Analysis by Data Depth: Descriptive Statistics, Graphics, and Inference, Ann. Stat., 27(3), 783-840.

[128] Liu, R.Y. and Singh, K. (1993). A Quality Index Based on Data Depth and Multivariate Rank Tests. J. Am. Stat. Ass., 88, 252-260.

[129] Liu, R.Y. and Singh, K. (1992). Ordering Directional Data: Concepts of Data Depth on Circles and Spheres. Ann. Stat., 20(3), 1468-1484.

[130] Liu, R.Y., (1990). On a Notion of Data Depth Based on Random Simplices. Ann. Stat., 18(1), 408-414.

[131] Longo, G. et.al. (2001). Advanced Data Mining Tools for Exploring Large Astronomical Databases. Astronomical Data Analysis, Eds. Stark and Murtagh, Proc. of SPIE, 4477, 61-75. 167

[132] Lu, C., Chen, D., and Kou, Y. (2003). Detectin Spatial Outliers with Multiple Attributes. Proc. 15th IEEE Int. Conf. on Tools with Artificial Intelligence (ICTAI03).

[133] Mahalanobis, P.C. (1936). On the Generalized Distance in Statistics. Proc. Nat. Acad. Sci. India, 12, 49-55.

[134] Mallows, C.L., (1973). Some Comments on Cp. Technometrics, 15, 661-675. [135] Malkovich, J.F. and Afifi, A.A. (1973) On Tests for Multivariate Normality. J. Am. Stat. Ass., 68, 176-179.

[136] Mangiameli, P., Chen, S.K., and West, D. (1996) A Comparision of SOM Neural Network and Hierarchical Clustering Methods. Euro. J. Oprt’al Res., 93, 402-417.

[137] Manku, G.S., Rajagopalan, S., and Lindsay, B.G. (1998). Approximate Me- dians and Other Quantiles in One Pass and with Limited Memory. ACM SIGMOD Record, 27(2), 426-435.

[138] Mardia, K.V. (1970). Measures of Multivariate Skewness and Kurtosis with Applications. Biometrika, 57, 519-530.

[139] Maronna, R. and Yohai, V. (1995). The Behavior of the Stahel-Donoho ro- bust Multivariate Estimator. J. Amer. Stat. Assoc., 90, 330-341.

[140] Mass´e, B. (2000). On the LLN for the Number of Vertices of a Random Convex Hull, Adv. Appl. Prob., 32, 675-681.

[141] Mass´e, B. (1999). On the Variance of the Number of Extreme Points of a Random Convex Hull, Stat. & Prob. Letters, 44, 123-130.

[142] McDermott, J. P. and Lin, D.K.J. (2003). Multivariate Density Estimation for Massive Datasets via Sequential Convex Hull Peeling. preprint

[143] McQuarrie, A. D. (1999). A Small-Sample Correction for the Schwarz SIC Model Selection Criterion Stat & Prob Letters, 44, 79-86.

[144] McQuarrie, A.D., and Tsai, C. (1999). Model Selection in Orthogonal Re- gression Stat & Prob Letters, 45, 341-349.

[145] McQuarrie, A.D. and Tsai, C. (1998). Regression and Time series Model Selection. World Scientific, Singapore.

[146] McQuarrie, A. et.al. (1997). The Model Selection Criterion AICu. Stat & Prob Letters, 34, 285-292. 168

[147] Miller, K. et.al. (2003). Efficient Computation of Location Depth Contours by Methods of Computational Geometry. Stat. Comp., 13, 153-162. [148] Miller, G. (1974). The Jackknife—A Review. Biometrika, 61(1), 1-15. [149] Minnotte, M.C. (1997). Nonparametric Testing of the Existence of Modes, Ann. Stat., 25(4), 1646-1660. [150] Murata, N., Yoshizawa, S., and Amari, S. (1994). Network Information Criterion-Determining the Number of Hidden Units for an Artificial Neural Network Model. IEEE Transactions on Neural Networks, 5(6), 865-872. [151] Myung, I.J., (2000). The Importance of Complexity in Model Selection. Jour- nal of Mathematical Psychology, 44, 190-204. [152] Neath, A. A. and Cavanaugh, J. E. (2000). A Regression Model Selection Criterion based on Bootstrap bumping for Use with Resistant Fitting, Com- putational Statistics and Data Analysis, 35(2), 155-169. [153] Neath, A.A. and Cavanaugh, J.E. (1997). Regression and Time Series Model Selection Using Variants of the Scharz Infroamtion Criterion Commun. Statist.-Theory Meth., 26(3), 559-580. [154] Newberg, H.J. and Yanny, B. (1997) Three-dimensional Parameterization of the Stellar Locus with Application to QSO Color Selection. ApJS, 113, 89-104. [155] Nishii, R. (1988). Maximum Likelihood Principle and Model Selection When the True Model is Unspecified, J. Multi. Anal., 27(2), 392-403. [156] Nolan, D. (1999). On Min-Max Majority and Deepest Points. Stat. & Prob. Lett., 43, 325-333. [157] Nolan, D. (1992). Asymptotics for multivariate trimming. Stoch. Proc. and Appl., 42, 157-169. [158] Oja, H. (1983). Descriptive statistics for Multivariate Distribution. Stat. & Probab. Lett., 1, 327-332. [159] O’Rourke, J. (2001). Computational Geometry in C. Cambridge Univ. Press, Cambridge. [160] Pena, D. and Prieto, F.J. (2001). Multivariate Outlier Detection and Robust Covariance Matrix Estimation. Technometrics, 43(3), 286-310. [161] Penny, K. (1995). Appropriate Critical Values When Testing for a Single Multivariate Outlier by Using the Mahalanobis Distance. App. Stat., 45, 73- 81. 169

[162] Petrovskiy, M.I. (2003). Outlier Detection Algorithms in Data Mining Sys- tems. Prog. Comp. S/W, 29(4), 228-237.

[163] Preparata, F.P. and Shamos, M.I. (1985). Computational Geometry: an In- troduction. Springer-Verlag, New York.

[164] Preparata, F.P. and Hong, S.J. (1977). Convex Hulls of Finite Sets of Points in Two and Three Dimensions, Comm. ACM, 20(2), 88-93.

[165] Protassov, R. et.al. (2002). Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio. ApJ, 571, 545-559.

[166] Qian, G. and K¨unsch, H. R. (1998) Some Notes onf Rissanen’s Stochastic Complexity. IEEE Trans. Info. Theo., 44(2), 782-786.

[167] Rao, C.R. and Wu, Y. (1989) A Strongly Consistent Procedure for Model Selection in a Regression Problem. Biometrika, 76, 369-374.

[168] Rasson, J.P. and Granville, V. (1996). Geometrical Tools in Classification. Comp. Stat. & Data Anal., 23, 105-123.

[169] Render, R. (1981). Note on the Consistency of the Maximum Likelihood Estimate for Nonidentifiable Distributions. Ann. Stat., 9(1), 225-228.

[170] R´enyi, A. and Sulanke, R. (1963). Uber¨ die konvexe H¨ulle von n zuf¨ahlten Punkten. Zeitschrift f¨ur Wahrscheinlichkeitstheorie verwandte Gebiete, 2, 75- 84.

[171] R´enyi, A. and Sulanke, R. (1964). Uber¨ die konvexe H¨ulle von n zuf¨ahlten Punkten. Zeitschrift f¨ur Wahrscheinlichkeitstheorie verwandte Gebiete, 3, 138- 147.

[172] Reschenhofer, E. (1996). On the Asymptotic Behavior of Akaike’s BIC. Stat & Prob Letters, 27, 171-175.

[173] Rissanen, J. (1987). Stochastic Complexity J. Roy. Stat. Soc. B, 49, 223-265.

[174] Rissanen, J. (1986). Stochastic Complexity and Modeling. Ann. Stat., 14, 1080-1100.

[175] Rissanen, J. (1983). A Universal Prior for Integers and Estimation by Mini- mum Description Length. Ann. Stat., 11, 416-431.

[176] Rissanen, J. (1978). Modeling by Shortest Data Description. Automatica, 14, 465-471.

[177] Rockafellar, R.T. (1970). Convex Analysis. Princeton Univ. Press, N.J. 170

[178] Rocke, D. and Woodruff, D. (1996). Identification of Outliers in Multivariate Data. J. Amer. Stat. Assoc., 91, 1047-1061.

[179] Rohlf, R.J. (1975). Generalization of the Gap Test for the Detection of Mul- tivariate Outliers. Biometrics, 31, 93-101.

[180] Rousseeuw, P.J. and Struyf, A. (2004). Computation of Robust Statistics: Depth, Median, and Related Measures. Handbook of Discrete and Computa- tional Geometry, 2nd Ed., eds. E. Goodman and J. O’Rourke, CRC Press, 1279-1292.

[181] Rousseeuw, P.J. and Hubert, M. (1999). Depth in an arrangement of hyper- planes. Disc. Comp. Geom., 22, 167-176.

[182] Rousseeuw, P.J. and Hubert, M. (1999A). Regression Depth. J. Am. Stat. Ass., 94(446), 388-402.

[183] Rousseeuw, P.J. et.al. (1999a). The Bagplot: A Bivariate Boxplot. Amer. Statist., 53, 382-387.

[184] Rousseeuw, P.J. and Ruts, I. (1999b). The Depth Function of a Population Distribution. Metrika, 49, 213-244.

[185] Rousseeuw, P.J. and VanDriessen, K. (1999c). A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics, 41, 212-223.

[186] Rousseeuw, P.J. and Ruts, I. (1998). Constructiong the Bivariate Tukey Me- dian. Stat. Sinica, 8, 827-839.

[187] Rousseeuw, P.J. and Struyf, A. (1998a). Computing Location Depth and Regression Depth in Higher Dimensions. Stat. Comp., 8, 193-203.

[188] Rousseeuw, P.J. and Leroy, A.M. (1987). and Outlier De- tection. Wiley, USA

[189] Ruts, I. and Rousseeuw, P.J. (1996). Computing Depth Contours of Bivariate Point Clouds. Comp. Stat. & Data Anal., 23, 153-168.

[190] Scargle, J. et.al. (2003). Adaptive Piecewise-constant Modeling of Signals in Multidimensional Spaces. PHYSTAT2003, 157-161.

[191] Schlegel, D. (2000). Commissioning Data from the Sloan Digital Sky Survey. Lecture Notes in Physics, 548, 385-393.

[192] Schwarz, G. (1978) Estimating the Dimension of a Model. Ann. Stat., 6(2), 461-464. 171

[193] Schneider, D.P. et.al. (2002). The Sloan Digital Sky Surve Quasar Catalog. I. Early Data Release. AJ, 123, 567-577.

[194] Schneider, R. (1988). Random Approximation of Conex Sets. J. Microscopy, 151(3), 211-227.

[195] Sen, P.K. (1979) Asymptotic Properties of Maximum Likelihood Esitmators based on Conditional Specification. The Annals of Statistics, 7(5), 1019-1033.

[196] Serfling, R. (2004). Nonparametric Multivariate Descriptive Measures Based on Spatial Quantiles. J. Stat. Plan. Inf., 123, 259-278.

[197] Serfling, R. (2002). Quantile Functions for Multivariate Analysis: Approaches and Applications, Statistica Neerlandica, 56, 214-232

[198] Serfling, R. (2002a). A depth Function and a scale curve based on spatial Quantiles. http://www.utdallas.edu/ serfling/papers/4l102.pdf

[199] Serfling, R. (2002b). Quantile Functions for Multivariate Analysis: Ap- proaches and Applications. Statistica Neerlandica, 56(2), 214-232.

[200] Serfling, R. (2002c). Generalized Quantile Processes Based on Multivariate Depth Functions, with Applications in Nonparametric Multivariate Analysis. J. Multi. Anal., 83, 232-247.

[201] Shannon, L. (1948). A Mathematical Theory of Communication. Bell Sys. Tech. Journal, 27, 379-423.

[202] Shao, J. (1996) Bootstrap Model Selection. J. Amer. Stat. Ass., 91, 655-665.

[203] Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap, Springer-Verlag, New York.

[204] Shao, J. (1993). Selection by Cross-Validation. J. Amer. Stat. Ass., 88, 486-494.

[205] Shao, J. and Wu, C.F.J. (1989). A General Theory for Jackknife Variance Estimation Ann. Stat.,17(3), 1176-1197

[206] Shekhar, S. Lu, C., and Zhang, P. (2001). Detecting Graph-Based Spatial Outliers: Algorithms and Applications. KDD 2001, 371-376.

[207] Shibata, R. (1997). Bootstrap Estimate of Kullback-Leibler Information for Model Selection. Statistica Sinica, 7, 375-394.

[208] Shibata, R. (1989). Statistical Aspects of Model Selection. From Data to Model, ed. Willems, Springer, 215-240. 172

[209] Silverman, B.W. and Titterington, D.M. (1980). Minimum Covering Ellipses. Siam J. Sci. Stat. Comp.,1(4), 401-409.

[210] Sin, C. and White, H. (1996) Information Criteria for Selecting Possibly Misspecified Parametric Models J. , 71, 207-225

[211] Singh, K. and Xie, M. (2003). Bootlier-plot – Bootstrap Based Outlier De- tection Plot, Sankhya, 65(3), 532-559

[212] Singh, K. (1991). Majority Depth. Unpublished manuscript.

[213] Small, C.G. (1990). A Survey of Multidimensional Medians. Int. Stat. Rev., 58(3), 263-277.

[214] Stone, M. (1974) Cross-Validatory Choice and Assessment of Statistical Pre- dictions (with discussion). J.R. Statist. Soc. B, 36, 111-147.

[215] Stone, M. (1977). An Asymptotic Equivalence of Choice of Model by Cross- Validation and Akaike’s Criterion. J Roy. Stat. Soc. B, 39(1), 44-47.

[216] Stoughton, C. et.al. (2002). Sloan Digital Sky Survey: Early Data Release. AJ, 123, 485-548.

[217] Stoyan, D. (1998). Random Sets: Models and Statistics. Int. Stat. Rev., 66(1), 1-27.

[218] Takeuchi, K. (1976). Distribution of Information Statistics and a Criterion of Model Fitting. Suri-Kagaku (in Japanese), 153, 12-18 Y. Dodge, Ed., 285-301.

[219] Tierney, L. (1983). A Space-Efficient Recursive Procedure for Estimating a Quantile of an Unknown Distribution. SIAM J. Sci. Stat. Comp., 4(4), 706.

[220] Titterington, D.M. (1978) Estimation of Correlation Coefficients by Ellip- soidal Trimming. App. stat., 27(3), 227-234.

[221] Tukey, J.W. (1974). Mathematics and the Picturing of Data. Proc. Interna- tional Congress of Mathematicians: Vancouver, 523-531.

[222] Valentine, F.A. (1976). Convex Sets. Krieger Pub. Co, Huntington, N.Y.

[223] Vanden Berk, D.E. et.al. (2005). An Empirical Calibration of the Complete- ness of the SDSS Quasar Survey. AJ, 129, 2047-2061.

[224] Vardi, Y. and Zhang, C. (2000). The Multivariate L1-median and Associated Data Depth. PNAS, 97(2), 1423-1426. 173

[225] Vuong, Q.H. (1989) Likelihood Ratio Tests for Model Selection and Non- Nested Hypotheses Econometrica, 57(2), 307-333.

[226] Wald, A. (1949). Note on the Consistency of the Maximum Likelihood Esti- mate. Ann. Math. Stat., 20(4), 595-601.

[227] Wang, J. and Sefling, R. (2006). Influence Functions for a General Class of Depth-Based Generalized Quantile Functions. to appear

[228] Wang, J. and Serfling, R. (2005). Nonparametric Multivariate Kurtosis and Tailweight Measures J. Nonparametric Stat., 17(4), 441-456.

[229] Wasserman, L. (2000). Bayesian Model Selection and Model Averaging. J Math. Psy., 44(1) , 92-107.

[230] Werner, M. (2003) Identification of Multivariate Outliers in Large Data Sets. ph.D disseration, U. Colorado at Denber.

[231] White, H. (1982). Maximum Likelihood Estimation of Misspecified Models. Econometrica, 50(1), 1-25.

[232] Wilks, (1963). Multivariate Statistical Outliers. Sankhy˜aA, 25, 405-426.

[233] Xu, J., Lin, X., and Zhou, X. (2004). Space Efficient Quantile Summary for Constrained Sliding Windows on a Data Stream. Proc. WAIM, LNCS 3129, 34-44.

[234] Yang, Y. and Barron, A.R. (1998). An Asymptotic Property of Model Selec- toin Criteria. IEEE Trans. Info. Theo., 44(1), 95-116.

[235] York, D.G. et.al. (2000). The Sloan Digital Sky Survey: Technical Summary. AJ, 120, 1579-1587.

[236] Zani, S. et.al. (1998). Robust Bivariate Boxplots and Multiple Outlier De- tection. Comp. Stat. Data Anal., 28, 257-270.

[237] Zhang, P. (1993). On The Convergence Rate of Model Selection Criteria. Commun. Statist.-Theory Meth., 22(10), 2765-2775.

[238] Zuo, Y. et.al. (2004). On the Stahel-Donoho Estimator and Depth-Weighted Means of Multivariate Data. Ann. Stat., 32(1), 167-188.

[239] Zuo, Y. and Cui, H. (2004a). Statistical Depth Functions and Some Appli- cations. Ad. Math (China), 33(1), 1-26.

[240] Zuo, Y. (2003). Projection Based Depth Functions and Associated Medians. Ann. Stat., 31(5), 1460-1490. 174

[241] Zuo, Y. and Serfling, R. (2000a). General Notions of Statistical Depth Func- tions. Ann. Stat., 28(2), 461-482.

[242] Zuo, Y. and Serfling, R. (2000b). On the Performance of Some Robust Non- parametric Location Measures Relative to a General Notion of Multivariate Symmetry. J. Stat. Plan. Infer., 84(1), 55-79.

[243] Zuo, Y. and Sefling, R. (2000c). Structural Properties and Convergence Re- sults for Contours of Sample Statistical Depth Functions. Ann. Stat., 28(2), 483-499. Vita Hyunsook Lee

Hyunsook Lee was born in Daejon, Korea on March 15, 1975. She received her Bachelor and Master’s degree in astronomy from Seoul National University in February of 1997 and August of 1999. She then enrolled at the Pennsylvania State University to pursue Ph.D in statistics. Meanwhile, she received her Master’s degree in statistics in May of 2004. She is currently a Ph.D candidate.