The Pennsylvania State University The Graduate School
TWO TOPICS: A JACKKNIFE MAXIMUM LIKELIHOOD
APPROACH TO STATISTICAL MODEL SELECTION AND A
CONVEX HULL PEELING DEPTH APPROACH TO
NONPARAMETRIC MASSIVE MULTIVARIATE DATA ANALYSIS
WITH APPLICATIONS
A Thesis in Statistics by Hyunsook Lee
c 2006 Hyunsook Lee
Submitted in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
August 2006 The thesis of Hyunsook Lee was reviewed and approved∗ by the following:
G. Jogesh Babu Professor of Statistics Thesis Advisor, Chair of Committee
Thomas P. Hettmansperger Professor of Statistics
Bing Li Professor of Statistics
W. Kenneth Jenkins Professor of Electrical Engineering
James L. Rosenberger Professor of Statistics Head of Statistics Department
∗Signatures are on file in the Graduate School. Abstract
This dissertation presents two topics from opposite disciplines: one is from a para- metric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second one is a convex hull peeling depth approach to nonparametric massive mul- tivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the Kullback-Leibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum Kullback-Liebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information cri- terion (BIC), Minimum description length (MDL), and Bootstrap information cri- terion. Likewise, the jackknife method chooses a model of minimum Kullback- Leibler distance through bias reduction. This bias, which is inevitable in model selection problems, arise from estimating the distance between an unknown true model and an estimated model. We show that (a) the jackknife maximum likeli- hood estimator is consistent to the parameter of interest, (b) the jackknife estimate of the log likelihood is asymptotically unbiased, and (c) the stochastic order of the jackknife log likelihood estimate is O(log log n). Because of these properties, the jackknife information criterion is applicable to problems of choosing a model from non nested candidates especially when the true model is unknown. Compared to popular information criteria which are only applicable to nested models, the jackknife information criterion is more robust in terms of filtering various types of candidate models to choose the best approximating model. However, this robust method has a demerit that the jackknife criterion is unable to discriminate nested models. Next, we explore the convex hull peeling process to develop empirical tools for statistical inferences on multivariate massive data. Convex hull and its peeling
iii process has intuitive appeals for robust location estimation. We define the convex hull peeling depth, which enables to order multivariate data. This ordering pro- cess based on data depth provides ways to obtain multivariate quantiles including median. Based on the generalized quantile process, we define a convex hull peeling central region, a convex hull level set, and a volume functional, which lead us to invent one dimensional mappings, describing shapes of multivariate distributions along data depth. We define empirical skewness and kurtosis measures based on the convex hull peeling process. In addition to these empirical descriptive statis- tics, we find a few methodologies to find multivariate outliers in massive data sets. Those outlier detection algorithms are (1) estimating multivariate quantiles up to the level α, (2) detecting changes in a measure sequence of convex hull level sets, and (3) constructing a balloon to exclude outliers. The convex hull peeling depth is a robust estimator so that the existence of outliers do not affect properties of inner convex hull level sets. Overall, we show all these good characteristics of the convex hull peeling process through bivariate synthetic data sets to illustrate the procedures. We prove these empirical procedures are applicable to real massive data set by employing Quasars and galaxies from the Sloan Digital Sky Survey. Interesting scientific results from the convex hull peeling process on these massive data sets are also provided.
KEYWORDS: Kullback-Leibler distance, information criterion, statistical model selection, jackknife, jackknife information criterion (JIC), bias reduction, maxi- mum likelihood, unbiased estimation, non-nested model, computational geometry, convex hull, convex hull peeling, statistical data depth, nonparametric statistics, generalized quantile process, descriptive statistics, skewness, kurtosis, volume func- tional, convex hull level set, massive data, outlier detection, balloon plot, multi- variate analysis
iv Table of Contents
List of Figures ix
List of Tables xi
List of Algorithms xii
Acknowledgments xiii
Chapter 1 Introduction: Motivations 1 1.1 ModelSelection...... 1 1.2 ConvexHullPeeling ...... 3 1.3 Synopsis...... 4
Chapter 2 Literature Reviews 6 2.1 Introduction...... 6 2.2 Model Selection Under Maximum Likelihood Principle ...... 7 2.2.1 Kullback-LeiberDistance...... 7 2.2.2 Preliminariles ...... 8 2.2.3 AkaikeInformationCriterion ...... 9 2.2.4 BayesInformationCriterion ...... 14 2.2.5 Minimum Description Length ...... 16 2.2.6 Resampling in Model Selection ...... 18 2.2.7 OtherCriteriaandComments ...... 20 2.3 ConvexHullandAlgorithms...... 21 2.3.1 Definitions...... 21 2.3.2 ConvexHullAlgorithms ...... 24
v 2.3.3 OtherAlgorithmsandComments ...... 30 2.4 DataDepth ...... 30 2.4.1 OrderingMultivariateData ...... 31 2.4.2 DataDepthFunctions ...... 32 2.4.3 StatisticalDataDepth ...... 35 2.4.4 Multivariate Symmetric Distributions ...... 36 2.4.5 General Structures for Statistical Depth ...... 37 2.4.6 DepthofaSinglePoint...... 39 2.4.7 Comments...... 39 2.5 Descriptive Statistics of Multvariate Distributions ...... 39 2.5.1 MultivariteMedian ...... 40 2.5.2 Location...... 41 2.5.3 Scatter...... 42 2.5.4 Skewness...... 44 2.5.5 Kurtosis ...... 46 2.5.6 Comments...... 48 2.6 Outliers ...... 49 2.6.1 WhatisOutliers?...... 49 2.6.2 WhereOutliersOccur? ...... 52 2.6.3 OutliersandNormalDistributions ...... 52 2.6.4 NonparametricOutlierDetection ...... 54 2.6.5 Convex Hull v.s. Outliers ...... 57 2.6.6 Comments: Outliers in Massive Data ...... 58 2.7 StreamingData ...... 59 2.7.1 Univariate Sequential Quantile Estimation ...... 60 2.7.2 -Approximation ...... 60 2.7.3 SlidingWindows ...... 61 2.7.4 StreamingDatainAstronomy ...... 62 2.7.5 Comments...... 63
Chapter 3 Jackknife Approach to Model Selection 65 3.1 Introduction...... 65 3.2 Prelimiaries ...... 68 3.3 Jackknife Maximum Likelihood Estimator ...... 71 3.4 Jackknife Information Criterion (JIC) ...... 73 3.5 When Candidate Models are Nested ...... 80 3.6 Discussions ...... 82
vi Chapter 4 Convex Hull Peeling Depth 86 4.1 Introduction...... 86 4.2 Preliminaries ...... 90 4.3 StatisticalDepth ...... 92 4.4 Convex Hull Peeling Functionals ...... 94 4.4.1 CenterofDataCloud: Median...... 94 4.4.2 LocationFunctional ...... 95 4.4.3 ScaleFunctional...... 97 4.4.4 SkewnessFunctional ...... 98 4.4.5 KurtosisFunctional...... 100 4.5 SimulationStudiesonCHPD ...... 102 4.6 Robustness of CHPM - Finite Sample Relative Efficiency Study ...... 104 4.7 StatisticalOutliers ...... 107 4.8 Discussions ...... 109
Chapter 5 Convex Hull Peeling Algorithms 112 5.1 Introduction...... 112 5.2 DataDescription: SDSSDR4 ...... 117 5.3 MultivariateMedian ...... 119 5.4 QuantileEstimation ...... 122 5.5 DensityEstimation ...... 125 5.6 OutlierDetection ...... 127 5.6.1 QuantileApproach ...... 128 5.6.2 Detecting Shape Changes caused by Outliers ...... 129 5.6.3 BalloonPlot...... 131 5.7 Estimating the Shapes of Multidimensional Distributions ...... 133 5.7.1 Skewness...... 133 5.7.2 Kurtosis ...... 136 5.8 Discoveries from Convex Hull Peeling Analysis on SDSS ...... 137 5.9 SummaryandConclusions ...... 141
Chapter 6 Concluding Remarks 144 6.1 SummaryandDiscussions ...... 144 6.1.1 Jackknife Model Selection ...... 144 6.1.2 Unspecified True Model and Non-Nested Candidate Models. 146 6.1.3 Convex Hull Peeling Process ...... 147
vii 6.1.4 Multivariate Descriptive Statistics ...... 148 6.1.5 MultivariateOutliers ...... 149 6.1.6 Color Distributions of Extragalactic Populations ...... 149 6.2 FutureWorks ...... 150 6.2.1 Clustering with Convex Hull Peeling ...... 150 6.2.2 VoronoiTessellation ...... 152 6.2.3 Addtional Methodologies for Outlier Detection ...... 152 6.2.4 Testing Similarity of Distributions ...... 153 6.2.5 BreakdownPoint ...... 153 6.2.6 CHPDisNotContinuous ...... 154 6.2.7 Sloan Digital Sky Survey Data Release Five ...... 155 6.2.8 AdditionalTopics...... 156
Bibliography 158
viii List of Figures
3.1 An illustration of model parameter space: g(x) is the unknown true model and ΘA, ΘB, Θ1, and Θ2 are parameter space of different models. Note that Θ1 ⊂ Θ2 and θg is the parameter of the model that mostly minimizes the KL distance than parameters of other models (θa, θb, and θ1). The black solid line and the blue dashed line indicate Θ1, of which detail will be given in the text...... 81 4.1 Bivariate samples of different shapes put into the convex hull peeling process. Clockwise, from the left upper panel, samples of size 1000 were generated from a bivariate uniform distribution on a square, a bivariate uniform distribution on a triangle, a bivariate Cauchy distribution with correlation 0.5, and a bivariate normal distribution withcorrelation0.5...... 103
5.1 Sample size is 106 from a standard bivariate normal distribution and target quantiles are {0.99, 0.95, 0.9, , , 0.1, 0.5, 0.01}. For the sequen- tial method, we took 1000 data at a time and estimated quantiles. Each quantile has 1000 corresponding convex hull peeling level sets, which are combined and peeled until about a half of data points left. Blue indicates quantiles of a non sequential method and red indicates quantiles of a sequential method...... 124 5.2 The sample size of both examples is 104 from standard bivariate normal distribution and bivariate t5 distribution with ρ = −0.5. The target quantiles are {0.99, 0.95, 0.9, , , 0.1, 0.5, 0.01}. Densities are estimated by means of eqs.(5.2), from which empirical density profileisbuilt...... 126
ix 5.3 Samples of size 2000 are generated. The primary distribution is the standard bivariate normal but the panels of the second column show 1% contamination by the standard bivariate normal distribution centered at (4,4). As the convex hull peeling process proceeds, the area of each convex hull layer is calculated...... 130 5.4 Same samples from Figure 5.3. The blue line indicated the IQR and the red line indicated the balloon expanded IQR by 1.5 times lengthwise. Points outside this balloon are considered to be outliers. 133 5.5 Three different bivariate samples are compared in term of the skew- ness measure Rj. A sample of the first column was generated from a standard bivariate normal distribution; a sample of the sec- ond column from a bivariate normal distribution with correlation 0.5; a sample of the third column was generated from (X,Y ) ∼D 2 (N(0, 1), χ10), where X and Y areindependent...... 134 5.6 Tailweight is calculated based on eq. (5.3). Samples of size 2000 are generated from a uniform, normal, t10 and Cauchy distribution. . . 136 5.7 Volume sequence of convex hull peeling central regions from 70204 Quasars in four dimension color magnitude space...... 138 5.8 Volume sequence of convex hull peeling central regions from 499043 galaxies in four dimension color magnitude space...... 139 5.9 Sequence of Rj skewness measure from 70204 Quasars in four di- mensioncolormagnitudespace ...... 139 5.10 Sequence of Rj skewness measure from 499043 galaxies in four di- mensioncolormagnitudespace ...... 140
x List of Tables
4.1 Empirical mean square error and relative efficiency ...... 106
5.1 Comparison among mean, coordinate-wise median, and convex hull peeling median of synthesis data sets, size of 104 and 106 from the standard bivariate normal distribution. For the larger sample, we apply the sequential CHPM method...... 121 5.2 Different center locations of Quasars and galaxies distributions in the space of color magnitudes based on mean, coordinate-wise me- dian, CPHM, and sequential CHPM. The sequential method has adopted d =0.05 and m = 104...... 122
xi List of Algorithms
1 Grahamscanalgorithm...... 25 2 GiftWrappingAlgorithm...... 26 3 Beneathandbeyondalgorithm ...... 27 4 Divideandconqueralgorithm ...... 28 5 IncrementalAlgorithm ...... 28 6 Quickhullalgorithm ...... 29 7 ConvexhullpeelingmedianI ...... 119 8 Sequential convex hull peeling median I ...... 120 9 Sequential convex hull peeling median II ...... 120 10 pth Convex Hull Peeling Quantile ...... 122 11 Sequential pth Convex Hull Peeling Quantile ...... 123
xii Acknowledgments
One baby step of a life long journey is about to finish. I was not alone when I made this step. There are people who have constantly encouraged me. Among them, above all, I am very thankful to my advisor Prof. Babu who has been very patient with me. He gave me the great freedom no graduate student can imagine to explore the sea of statistics which led me to confront a few barren islands. A small part of this expedition is recorded in this dissertation. Also, I am very grateful for his brilliant advice and precise comments which have led my eyes wide open to see the endless possibilities in statistics that I like to challenge in many years. I am very much indebted to my committee members, Professors Hettmansperger and Li, for their advice and support, and my minor and external committee mem- ber, Prof. Jenkins, who offered the most interesting course, Adaptive filter, and inspired my creative thinking. In addition to their support, Prof. Li envisioned me the beauty of asymptopia through his course and Prof. Hettmansperger provided very stimulating and encouraging conversations about my research and classical music. Also, I would like to thank Prof. Rosenberger, who was supportive in every aspect as department head; Dr. van den Berk, who showed me how to use SQL for acquiring data from Sloan Digital Sky Survey; Prof. Rao, whose encouragement and interest in my work helped me to last until my thesis defence; Prof. Harkness, who has encouraged me with his warm heart; and people at SAMSI, who gave me a hope that there is something I can contribute to both astronomical and statisti- cal society. I am very thankful to Pablo among the people I met at SAMSI, who acknowledged my ability and helped me to build my selfesteem which I have lost for years. There are also many good friends who deserve more than a few words of gratitude. Without my parents’ understanding, patience, and love, I would never make this step through. No word can appreciate their love enough. I have never failed
xiii to thank God for having the best parents in the world who always stand by my side and believe in me.
Among my dear music pieces, Ich habe genug (BWV 82) is flowing in my mind. J.S. Bach, a humble servant to God, had transferred his pious mind to this beautiful music. His piety is residing in me and whispering me to be humble and awake for the future. I wish that God sees my dissertation as my gratitude to Him and knows that I am ready for the next step.
xiv Dedication
To my mother, who has taught me women’s sacrifice.
xv Chapter 1
Introduction: Motivations
This dissertation contains two topics: The first topic is about parametric model selection with the delete-one jackknife method under the maximum likelihood prin- ciple. The second topic focuses on statistical properties and algorithms of convex hull peeling process as nonparametric multivariate data analysis methods. Reviews and general introductions related to topics will be given at later chapters. In this chapter, motivations related to subjects are discussed.
1.1 Model Selection
One of the challenges in statistical model selection is assessing candidate models when the true data generating model is unspecified. Even though the problem of the unspecified true model is well recognized, in practice, traditional information criteria such as AIC, BIC, and their modifications are automatically applied to real problems as if the true model is identified. The misuse of these traditional model selection criteria likely lead to an improper choice of models. A quite number of studies presented the problems of the unspecified or mis- specified true model; for example, Huber, 1967[97]; Jennrich, 1969[109]; Nishii, 1988[155]; Render, 1981[169]; Sen, 1979[195]; Sin and White, 1996[210]; Vuong, 1989[225]; and White, 1982[231]. These studies focused on the idea that the model estimation procedure ultimately yields a best model to the unknown true model under various circumstances. However, there was no previous study about model selection when candidates are not nested and the true model is unspecified based on 2 the minimization of the Kullback-Leibler distance (Kullaback and Leibler, 1951). When the true model is known, the Kullback-Leiber distance between a true model and a candidate model is estimable. Developing a proper estimating method to minimize the distance has been popular in model selection studies. In general, estimating the Kullback-Leibler distance accompanies estimation bias so that the penalty terms beside the likelihood of the candidate model in the equations of well known selection criteria are the byproduct of this estimation bias. Bozdogan (1987[34]) criticized the generalization of those penalty terms under the unspecified true model condition. The status of the true model is very ambiguous. Even though the true model is unspecified, by accident one of the candidate models can coincide with the true model. At least, we might consider one of the candidate models as the best model when this candidate model is most closest to the unknown true model than the other models in terms of the Kullback-Leibler distance. A model, minimizing the distance from the unspecified true model, describes the true model most likely or equally compared to other candidates. Note that these candidate models are not necessarily nested to search for a best model. The parameter space of each model can be different and we investigate this parameter space to find a model that minimize the distance from the true model. However, bias arises from estimating distance since the parameters of a model and the distance are estimated under this unspecified true model. Only the known true model allows to assess this bias, which is very unlikely to occur in reality. Therefore, instead of estimating bias from the Kullback-Leiber distance with the unknown true model, we choose the bias reducing technique to find a model that minimizes the Kullback-Leiber distance. The jackknife resampling method has been known for its bias reduction prop- erty; however, it has lost its fame because of other famous resampling schemes such as bootstrap and cross-validation. Due to this less popularity, applying the jackknife method to develop a model selection criterion, especially a criterion min- imizing the Kullback-Leiber distance cannot be found. We expect that one of candidate models can be chosen, which minimizes the Kullback-Leiber distance without considering estimating bias since the jackknife method will reduce it. The jackknife method will provide the bias reduced maximum likelihood estimator of 3 a model of minimum distance. The procedure of assessing the bias reduced maxi- mum likelihood based on the jackknife method will be presented.
1.2 Convex Hull Peeling
The notion of convex hull peeling as a tool for a location estimator (Tukey, 1974) has been recognized in the statistical society more than three decades. Until Zuo and Serfling’s (2000a) discussion on statistical data depth, the convex hull peeling process for statistical inferences is hardly found. In Barnett’s extensive study (1976) on multivariate data, convex hull peeling was introduced to resolve the complexity of ordering multvariate data. From time to time a convex hull of given sample has been used for robust estimation and outlier detection. Compared to its long history, relatively small amount works have been done regarding convex hull peeling. Compared to other statistical depths, such as the simplicial depth or the half space depth, the convex hull peeling depth does not have any distribution theory. The technicality involved in the development of limiting distributions of convex hull peeling estimators is quite difficult. Even though samples are generated from a continuous distribution, no continuous functional of convex hull peeling estimators is reported, which prevents the convex hull peeling process from being adapted for statistical inferences. Only advantage for using the convex hull peeling depth is its intuitive appeal. The convex hull peeling depth is rather intuitive and straightforward compared to other data depths. This characteristics of the convex hull peeling depth will be il- lustrated in later chapters with synthetic data and astronomical massive data from the Sloan Digital Sky Survey. The picturesque property of the convex hull peeling depth has inspired to perform various simulation studies like checking whether the convex hull peeling process measures scatter, skewness, and kurtosis of the distri- bution; testing the whether convex hull peeling process catches outliers; depicting whether the convex hull peeling process delineates the density profile. These em- pirical studies are mostly related to finding descriptive statistics and developing exploratory data analysis tools for multivariate distributions. Since the convex hull peeling depth is defined based on counting process, linking 4 every point with a depth value is simple. In addition, estimating a quantile is just collecting points of same depth. The contour formed by these points of same depth is convex and its volume is easily calculated. The fact that estimating quantile is easily directed to estimating the volume of the corresponding quantile, a set of collected points of same depth through the convex hull peeling process triggers to deduce the density profile of given sample. Also, this volume functional from random points of same depth provides a mapping from a multi-dimension quantile to a single value so that descriptive statistics by means of one dimensional curve mapping from multivariate data is borne based on the convex hull peeling process. Additionally, quantile estimation based on the convex hull peeling process can incorporate streaming data algorithms. This flexibility of data management is extremely helpful especially when massive multivariate data need to be summa- rized while the memory space for data processing is limited. Sequential quantile estimates based on streaming algorithms and regular quantile estimates do not show much discrepancy. Therefore, we like to develop space efficient multivariate descriptive statistics based on an empirical convex hull peeling process. In conclu- sion, the convex hull peeling process can be easily transformed to various empirical descriptive statistics as nonparametric multivariate massive data analysis tools.
1.3 Synopsis
The remaining chapters in this dissertation are divided into six parts, with the following structure: In Chapter 2, literature reviews relevant to the works per- formed in this dissertation are presented, including the Kullback-Leibler distance, Akaike information criterion and its variations, Bayes information criterion, model selection with data resampling methods, convex hull algorithms, statistical depths and their characteristics, multivariate descriptive statistics (median, scatter, skew- ness, and kurtosis), outliers, and massive streaming data. Chapter 3 discusses model selection with the jackknife method based on the maximum likelihood prin- ciple. We prove the consistency of the jackknife maximum likelihood estimator and the asymptotic unbiasedness of the jackknife likelihood. Chapter 4 addresses the convex hull peeling depth in various aspects beginning with its definition and proposing associated tools for multivariate nonparametric descriptive statistics. 5
Chapter 5 introduces various methods, illustrating applications of the convex hull peeling process. Further emphasis is given to actual applications to multivariate massive data from astronomical survey data. Finally, the last chapter provides a summary of our studies with concluding remarks, and some additional works that may come afterwards related to this dissertation and future work suggestions relevant to our current studies. Chapter 2
Literature Reviews
2.1 Introduction
This chapter presents highlights of some researches that have guided the studies performed in this thesis. It is divided into two parts: parametric model selection with a resampling method and nonparametric multivariate analysis with compu- tational algorithms. First, some well known information criteria from the maximum likelihood prin- ciple are discussed in addition to model selection criteria of bootstrap and cross validation. Then, some reviews of the nonparametric multivariate analysis will fol- low. Since our primarily concern is convex hull peeling for analyzing multivariate data, we present additional discussions on convex sets and convex hull algorithms. We, also, explain data depth which allows ordering multivariate data. The possibil- ity of ordering multivariate data will lead to nonparametric multivariate location, quantiles, and distributional shape estimation, and outlier detection. Last, due to our interests in massive data and their quick analysis tools, tools for analyzing streaming data will be addressed. 7
2.2 Model Selection Under Maximum Likelihood Principle
Model selection is a very broad subject in statistics, from variable selection to func- tional space classification. Model selection problems are found in other disciplines other than statistics and have received spotlights because of their broad applica- tions. Here, we confine our discussion of the statistical model selection framework to the maximum likelihood principle, which is connected to Boltzmann’s maxi- mum entropy (1877[32]) and Shannon’s information theory (1948[201]). These ap- proaches from physics and information theory are related to the Kullback-Leiber distance, which will be discussed later. Diverse discussions on model selection from a statistical point of view are provided in a manuscript edited by Lahiri (2001[120]) and books written by Kuha (2004[118]), Linhart & Zucchini (1986[126]), McQuar- rie and Tsai (1998[145]) and Burnham and Anderson (2002[39]).
2.2.1 Kullback-Leiber Distance
The Kullback-Leibler (KL) distance (Kullback and Leiber, 1951[119]), also called as relative entropy (Cover and Thomas, 1991[55]), is a measure of distance be- tween two distributions. Consider that X1,X2, ..., Xn are samples drawn from an unknown distribution G whose density function is denoted by g and F is the distribution of interest, whose density function is f. Then the KL distance is,
DKL[g; f] = g(x)log g(x)dx − g(x)log f(x)dx (2.1) Z Z g(x) = E [log ], g f(x) the discrepancy of assuming F when the true distribution is G. The objective of employing the KL distance in the model selection problem is finding a model of minimum DKL. Note that the first term in the right-hand side of eq. (2.1) is a constant. Searching f maximizing the second term,
LG = log f(x)dG(x) (2.2) Z 8 plays a key role in model selection. Among many model selection criteria reflecting the KL distance minimization, Akaike information criterion (AIC, Akaike, 1973[7]), Minimum descriptive length (MDL, Rissanen, 1978[176]), and Bayesian information criterion (BIC, Schwarz, 1978[192]) are the most popular criteria to choose the best approximating model. AIC is the result of approximating the bias of maximum likelihood estimation, MDL is maximizing the entropy of binary coding models, and BIC is maximizing the posterior distribution.
2.2.2 Preliminariles
In practice, model selection was performed on sets of candidate models, of which functional characteristics consequently are known. In other words, we can adopt some candidate models of known properties for statistical model selection. Such candidate models are subject to regularity conditions so that we can develop model selection methods. Prior to addressing model selection information criteria, we present some regularity conditions.
Consider an iid sample, X1, ..., Xn from an unknown distribution G. Let {fθ : p p θ ∈ Θ ⊂ R } be a family of distributions. Denote the likelihood of Xi as li(θ)= n log fθ(Xi) for i =1, ..., n. Let L(θ)= i=1 li(θ). Assumptions: P p p p (A1) The parametric space Θ is p-dimensional space, Θ ⊂ R such that li(θ) is p distinct for each θ ∈ Θ , i.e. if θ1 =6 θ2, li(θ1) =6 li(θ2).
p ∂ (A2) There exists θo in the interior of Θ , which is the unique solution to Eg[ ∂θ L(θ)] =
p 0 such that supθ∈Θ :||θ−θo||>(L(θ) − L(θo)) diverges to −∞ a.s. for any > 0.
(A3) Within a neighborhood of θo, the first and the second derivatives of li(θ) with ∂ ∂2 ∂ respect to θ, l (θ) and T l (θ) are well defined and bounded, i.e. | l (θ)| < ∂θ i ∂θ∂θ i ∂θl i ∂2 ∂ h(x) and | li(θ)| < h(x) for all x, where l, m = 1, ..., p. Here, li(θ) ∂θl∂θm ∂θl th ∂2 th is l component of gradient vector and − li(θ)is (l, m) component of ∂θl∂θm the Hessian matrix. 9
∂ (A4) For ψθ(x)= li(θ) or ψθ(x)= ∂θT li(θ),
∂ ∂ ∂θ ψθ(x)dx = ∂θ ψθ(x)dx. Z Z
p ∂2 ∂ ∂ T (A5) For θ ∈ Θ , −Eg[ ∂θ∂θT li(θ)] and Eg[ ∂θ li(θ) ∂θ li(θ)] are positive definite ma- trices.
Besides, the following lemma will used quiet often in this study.
Lemma 2.1 (Taylor Expansion). For any function h(θ, ·), which is twice differ- entiable in the neighborhood of θo,
∂ 1 ∂2 h(θ, ·)= h(θ , ·)+ h(θ , ·)(θ − θ )+ (θ − θ )T h(η, ·)(θ − θ ), o ∂θ o o 2 o ∂θ∂θT o where η satisfies η = (1 − r)θ + rθo for 0 2.2.3 Akaike Information Criterion Akaike (1973[7], 1974[8]) first introduced the model selection criterion that chooses a model of the smallest KL distance. His criterion is based on estimating eq. (2.2) and choosing a model of a maximum eq. (2.2) estimate. This expected log likelihood, however, depends on the unknown true distribution G. The candidate parametric model has unknown parameter θ that is replaced by some estimate θˆ. Among many parameter estimating methods, Akaike used the ˆ maximum likelihood estimator θML, which maximizes the log likelihood. Note that this maximum likelihood estimation enforces another condition that the true model ˆ g belongs to the set of candidate models so that θML is a consistent estimator of θo. Still we need to estimate G to get LG in (2.2). We estimate LG empirically by the empirical distribution Gˆ of G. 1 n L ˆ(θˆ)= log fˆ (x)dGˆ(x)= log fˆ (Xi). (2.3) G θML n θML i=1 Z X The empirical log likelihood usually overestimates the expected log likelihood 10 (Konishi and Katagawa, 1996[116]). This excess is the bias, ˆ ˆ bG = Eg[LGˆ (θML) − LG(θML)] 1 n = Eg[ log fˆ (Xi) − log fˆ (x)g(x)dx]. (2.4) n θML θML i=1 X Z The bias-corrected log likelihood becomes 1 n log fˆ (Xi) − ˆbG, (2.5) n θML i=1 X ˆ where bG is the estimation of the bias bG. In general, a model that minimizes the Kullback-Leibler distance (2.1) or max- imizes bias-corrected log likelihood (2.5) (Konishi, 1999[115]) is chosen for the best approximating model. The Akaike type information criterion (IC) from sample size of n is defined as follows; 1 n IC(Xn; θˆML) = −2n{ log fˆ (Xi) − ˆbG} n θML i=1 n X = −2 log f (X )+2nˆb , (2.6) θˆML i G i=1 X where ˆbG and θˆML are estimators of bG and θo, respectively and various estimators exist under various conditions. (e.g. Akaike, 1973; Hurvich and Tsai, 1989[104]; Konishi and Kitagawa, 2003[114]; and Konishi and Kitagawa, 1996[116]). ˆ Akaike chose θML to replace θo then estimate the bias as follows. Under the assumptions (A1)-(A5), the bias in eq. (2.6) can be estimated along with the model parameters. From eq. (2.4), ˆ ˆ bG = Eg[LGˆ (θML) − LG(θML)] 1 n = Eg[ log fˆ (Xi) − log fˆ (x)g(x)dx] n θML θML i=1 Z 1 X = tr[J −1I ]+ O(n−2) n g g 11 where Jg and Ig are p × p matrices defined by 2 ∂ log fθ(Xi) Jg = −Eg[ ] ˆ ∂θ∂θT θ=θML and ∂log fθ(Xi) ∂log fθ(Xi) Ig = Eg[ ] ˆ ∂θ ∂θT θ=θML (Konishi, 1999[115]). The information criterion defined in eq. (2.6) becomes n TIC = −2 log f (X )+2tr[Jˆ−1Iˆ] (2.7) θˆML i i=1 X by replacing Jg and Ig with their estimators, Jˆ and Iˆ. This information criterion (TIC) was addressed in Takeuchi (1976[218]), which was written in Japanese and to which Kitagawa (1997[111]) and Shibata (1989[208]) referred. If the parametric family of distributions contains the true distribution, that is p −1 G = Fθo (i.e. g = fθo ) for some θo in Θ , then tr[J(Fθo ) I(Fθo )] = p, since ∂log f (x) ∂log f (x) ∂2log f (x) θ θ f (x)dx = − θ f (x)dx. ∂θ ∂θT θ ∂θ∂θT θ Z Z Thus the KL distance based information criterion is simplified to AIC, n ˆ AIC = −2 log f(Xi|θML)+2p i=1 X ˆ = −2L(θML)+2p (2.8) AIC is well known for its simplicity. Under the regularity conditions (A1)- (A5), AIC is defined as an estimate of minus twice the expected log likelihood of a specific model of which distribution family contains a true model. By the maximum likelihood principle, AIC takes a very simple form based on the sample log likelihood and the number of parameters. Nonetheless, drawbacks of AIC such as overfitting and inconsistency have been reported (Hurvich and Tsai, 1989[104]; McQuarrie et.al., 1997[146]). Quite modifications were developed under certain circumstances. Some of them will be discussed below. 12 Consider a regression model y = Xβ + (2.9) where X is an n × p design matrix, β is is a p × 1 vector of unknown parameters, β =(β1, ..., βp). Respectively, p and n indicate the number of predicting variables T and sample size. The errors in the model, =(1, ..., n) , are iid random variables such that ∼ N(0, σ2I). The log likelihood of this model is 1 (y − Xβ)T (y − Xβ) L(β, σ2)= − nlog σ2 − . 2 2σ2 The maximum likelihood estimates are βˆ =(XT X)−1XT y and σˆ2 =(y − Xβˆ)T (y − Xβˆ)/n. Suppose that the true model of (2.9) is y = Xoβo + o. (2.10) In the eq. (2.10), Xo is an n × po design matrix, βo is (β1,o, ..., βpo,o) with po T parameters, and o =(o1, ..., on) are iid normal random variables with mean zero 2 and variance σo I. Let the vectors of parameters be θ = (β, σ) and θo = (βo, σo). The measure of eq. (2.1) between the true model (2.10) and a candidate model (2.9) is given by 2 2 DKL[θo; θ] = 2Eo[L(βo, σo ) − L(β, σ )] 2 2 2 2 T 2 = n(log σ − log σo )+ nσo /σ +(Xoβo − Xβ) (Xoβo − Xβ)/σ − n 2 2 2 2 2 2 = n{σo /σ − log (σo /σ ) − 1} + ||Xoβo − Xβ|| /σ . (2.11) 2 2 The property of an approximating model that minimizes Eo[L(βo, σo ) − L(β, σ )] does not change when it is multiplied by a positive constant 2 for defining an information criterion. 13 Hurvich and Tsai (1989[104]) derived the corrected AIC from the above regres- sion model (2.9), especially when the true model is nested in the fitted models. The model is nested when it is a subset of the other model. For the above linear regression model, the nested model is equivalent to {βo}⊂{β} where {βo} is the parameter space of βo and {β} is the parameter space of β. The model selection criterion of eq. (2.11) turns out as minimizing the following criterion. σˆ2 nσ2 ||X β − Xβ||2 IC = E nlog + o + o o o σ2 σˆ2 σˆ2 o n2 np = E [nlogσ ˆ2]+ + o n − p − 2 (n − p − 2) n(p + 1) = E [nlogσ ˆ2]+2 +1. (2.12) o (n − p − 2) Here, the result that IC takes a simpler form compared to DKL[θo; θ] (2.11) is 2 attributed to the fact that nlog σo and n are considered as constants. Intentionally, the constant terms are omitted. Therefore, the corrected AIC (AICc) is n(p + 1) AICc = nlogσ ˆ2 +2 . (2.13) (n − p − 2) The last term in (2.13) serves as a penalty for model selection. Hurvich and Tsai (1989) have shown that AICc works better than AIC in small samples and is an unbiased estimator of the Kullback-Leibler information. Their AICc, nonetheless, overfits when the sample size increases since AIC and AICc have asymptotically equivalent penalty terms. It is also known that AICc is valid only when candidate models are nested (Murata et.al., 1994[150]). Cavanaugh (1997[45]) presented a derivation that unifies AIC and AICc under the regression framework. McQuarrie et.al. (1997) proposed AICu as an approximately unbiased estimator of Kullback-Leibler information to correct the overfitting tendency of AICc. The term in AICc (2.13), nlogσ ˆ2 is substituted by nlogσ ˜2 to obtain an improved alternative model selection criterion, n(p + 1) AICu = nlogσ ˜2 +2 , (n − p − 2) whereσ ˜2 = nσˆ2/(n−p−1). In practice,σ ˜2 is replaced by s2 =(n−p−1)˜σ2/(n−p), 14 2 which is an unbiased estimator of σo . This modified model selection criterion becomes n(p + 1) AICu = nlog s2 +2 . (2.14) (n − p − 2) Various AIC type information criteria have been developed afterward although their application is limited to model selection in linear models or time series models. These models are nested to other candidate models and contain the true model, of which fact imposes a limitation in application because of the chances that the true model is unknown and candidate models are not nested. 2.2.4 Bayes Information Criterion In bayesian statistics, a model is chosen when it pertains maximum Bayes factor compared to other models. On the other hand, if choosing one model is not an interest, bayesian model averaging serves as model selection methods in bayesian statistics (Kass and Wasserman, 1995[110]; Wasserman, 2000[229]). Compared to model averaging or bayes factor, the bayesian information criterion (BIC) did not come from a standard bayesian methodology but it pertains its name because it chooses a model of maximum posterior distribution. The bayesian information criterion (BIC) was initiated from a very different point of view compared to AIC. As Reschenhofer (1996[172]) pointed out, AIC pursues an optimal criterion for prediction but BIC is designed to find the model with the highest probability. Schwarz (1978[192]) first proposed BIC by approach- ing model selection from the bayesian perspective. The criterion chooses a model with the largest posterior probability under assumptions. One of the assumptions is that the data were generated from an exponential family of distributions. Equal priors, additionally, are assumed so that the criterion has a very simple form. Suppose that we observe data D and have candidate models Mk with parameter θk. Then the posterior model probability is p(D|Mk)= p(D|θk, Mk)p(θk|Mk)dθk, (2.15) Z where Schwarz (1978) assumed the Koopman-Darmois family for the likelihood p(D|θk, Mk) and equal values for prior probability p(θk|Mk). When we take the 15 Laplace approximation on eq. (2.15) around θˆML, the maximum likelihood estima- tor of a parameter vector in the likelihood function, and we drop constant terms with respect to the given model Mk (e.g. the number of parameters), we are able to obtain BIC as follows, BIC = −2L(θˆML)+ pklog n, (2.16) where pk is the number of parameters in Mk and n is sample size. The log likelihood ˆ L(θML) is posterior log likelihood of given data. Detail discussions of BIC deriva- tion are partly found in Cavanaugh and Neath (1999b[44]), Neath and Cavanaugh (1997[153]) and Kuha (2004[118]). Zhang (1993[237]) showed that BIC has better convergence rate than AIC for linear regression models with normally distributed errors. Additionally, he argued that BIC has nearly optimal convergence rate. Because of this strong convergence in choosing a correct model, BIC is preferred to AIC; however, this strong conver- gence is only true when the models are linear and error distribution is normal. Despite the popularity due to its parsimony and consistency in model selection, McQuarrie (1999[143]) reported that BIC has tendency of overfitting for small samples. Like the small sample correction in AIC (Hurvich and Tsai, 1989[104]), similar small sample correction has been attempted on BIC. McQuarrie (1999[143]) proposed a signal-to-noise corrected BIC (BICc) with a stronger small sample penalty term. The proposed BICc is nplog n BICc = nlogσ ˆ2 + , n − p − 2 where n is sample size, p is the number of parameters, andσ ˆ2 is the maximum likelihood estimator of a variance of the error distribution. Needless to say, this criterion was developed under the linear regression framework with iid normal errors. Asymptotically, BICc is equivalent to BIC but more consistent for small sample. 16 2.2.5 Minimum Description Length Another approach from the information theoretic point of view for model selec- tion is minimum description length (MDL). MDL was first proposed by Rissanen (1978[176], 1986[174]) to get an optimal coding algorithm. MDL incorporates very similar principles of AIC type model selection criteria. Mainly, MDL finds a model that minimizes the description length of codewords through maximum likelihood principle (Rissanen, 1983[175]). It measures the log likelihood and the stochastic complexity of a model. When the source distribution is known, Shannon’s coding theorem dictates that we use the source distribution for the optimal code, and its code length is the negative logarithm of the data probability. When the source distribution is unknown, the search for a universal modelling procedure is the search for a universal coding scheme. Let Fθ be a candidate distribution function with parameter θ and G be a true distribution. Since MDL was developed from coding theory, Fθ and G are generally considered as models of discrete distributions, especially binary models. In coding theorems, the length of codeword x is written as l(x) = −log G(x), where log denotes the logarithm based on two (Here, two comes from binary coding and the base of log can be other integers). Yet, the true codeword generating model is hidden to signal decoders. One wishes to obtain a criterion that gives a minimal length discrepancy to the true distribution by estimating a candidate model Fθ(x). Shannon source coding theorem (Shannon, 1948[201]) proved that the expected code length is bounded below, which easily can be proved by the Kraft inequality, 2−l(x) ≤ 1, (2.17) X∀x where ∀x indicates a set of all codewords and l(x) is the length of each codeword. This theorem illustrates that no coding algorithm is able to achieve shorter ex- pected codeword length than the entropy, H(G) = EG[−log G(X)] (Cover and Thomas, 1990[55]). Hence, the aim of MDL is deducing the distance between −log Fθ(x) and −log G(x) (Barron et.al., 1998[20]). Furthermore, it is important to notice that the expected distance between the estimated code length and the 17 true code length, l(x) G(x)[log G(x) − log Fθ(x)] is same as the Kullback-Leibler distance I(G; Fθ)P or the relative entropy D(G||Fθ) (Cover and Thomas, 1991[55]). Akaike type information criteria and MDL share the commonality of minimizing the Kullback-Leibler distance or the relative entropy. Rissanen (1978[176], 1983[175]) developed the criterion to find a model that produces the shortest code length. The total code length is Length(x, θ) = −log Fθ(x) = −log F (x|θ)+ Length(θ) (2.18) where F (x|θ) is the likelihood of a given model with parameter θ and Length(θ) is the total bits to encode the information of the given model. This total bits of Length(θ) also represents the stochastic complexity of a model. Then, eq. (2.18) is simplified to 1 Length(x, θˆML)= −log Fˆ (x)+ log |H(θˆML)| + Cn, (2.19) θML 2 where θˆML is the maximum likelihood estimator of θ, H(θˆML) is the Hessian matrix nd (minus of the 2 derivative of the log likelihood), and Cn(x) is a some constant with respect to θ. The second term, the Hessian matrix is interpreted as a complexity measure. It is known that n−p|H(θˆ)| → α as n → ∞. Note that α is a constant and p denotes the dimension of θ. As sample size n increases, the sensitivity of functional form of candidate models gradually subsides (Myung, 2000[151]; Qian and K¨unsch, 1998[166]). Hence, MDL is sensitive to the number of parameters in a model as well as the model complexity whose value depends upon the functional form of the model. As a result, MDL becomes almost equivalent to BIC, 1 MDL = −log Fˆ (x)+ plog n. (2.20) θML 2 Further rigorous studies on MDL are found in Barron et.al. (1998[20]), Clarke and Barron (1990[54]), Hansen and Yu (2001[84]), Qian and K¨unsch (1998[166]), Rissanen (1986[174], 1987[173]) and Yang and Barron (1998[234]). 18 2.2.6 Resampling in Model Selection Alternatively, data resampling methods were adopted in model selection studies to estimate the KL distance. We will leave the jackknife method to the next chapter. In summary, the cross validation method has been proved to be equivalent to AIC asymptotically (Stone, 1977[215] and Amari et.al., 1997[11]) and the bootstrap methods are not efficient nor successful in terms of consistency of selecting models (Shao ,1996[202]; Chung et.al.,1997[53]; Ishiguro et.al., 1997[107]; and Shibata, 1997[207]). To show the asymptotic equivalence between AIC and cross-validation, we re- place the assumption (A2) by (A2’). We follow the notations from Assumptions and denote the log likelihood of all observation L(θ) = i=1 li(θ) and the log th likelihood without i observation L−i(θ) = j6=i lj(θ), whereP li(θ) = log fθ(Xi). Assumption: P ˆ ˆ ∂ ˆ ∂ ˆ (A2’) For θ and θ−i such that ∂θ L(θ)= ∂θ L−i(θ−i)=0, p p ||θˆ− θo|| −→ 0 and max ||θˆ−i − θo|| −→ 0, 1≤i≤n ∂ if a unique θo satisfies Eg[ ∂θ li(θo)] = 0. Then, cross validation information criterion with one validation sample (CV1) is defined as n CV 1= −2 li(θˆ−i). (2.21) i=1 X As AIC, CV1 measures log likelihood with maximum likelihood principle. Under the conditions (A1),(A2’),(A3)-(A5), asymptotic equivalence between CV1 and AIC is proved by the following; From the definition of θˆ and θˆ−i, we have ∂ ∂ ∂2 l (θˆ ) = l (θˆ)+ l (θˆ)(θˆ − θˆ)(1 + o (1)) ∂θ i −i ∂θ i ∂θ∂θT i −i p ∂ ∂ l (θˆ ) = L(θˆ ) ∂θ i −i ∂θ −i ∂ ∂2 = L(θˆ)+ L(θˆ) (θˆ − θˆ)(1 + o (1)) ∂θ ∂θ∂θT −i p ˆ ˆ = −Jˆ(θ−i − θ)(1 + op(1)) (2.22) 19 ˆ ∂2 ˆ In eq. (2.22), J refers to the Hessian matrix, − ∂θ∂θT L(θ). Above results are substituted in the following; ∂ ∂2 l (θˆ ) = l (θˆ)+(θˆ − θˆ)T l (θˆ)+(θˆ − θˆ)T l (θˆ)(θˆ − θˆ)(1 + o (1)) i −i i −i ∂θ i −i ∂θ∂θT i −i p ∂ ∂ = l (θˆ) − l (θˆ )Jˆ−1 l (θˆ) (1 + o (1)) i ∂θT i −i ∂θ i p ∂ ∂ = l (θˆ) − l (θˆ)Jˆ−1 l (θˆ) (1 + o (1)) i ∂θT i ∂θ i p ∂ ∂ = l (θˆ) − tr l (θˆ) l (θˆ)Jˆ−1 (1 + o (1)). i ∂θ i ∂θT i p Therefore, CV 1 = −2 li(θˆ−i) i X −1 = −2L(θˆ)+2tr(IˆJˆ )(1 + op(1)) (2.23) ˆ n ∂ ˆ ∂ ˆ where I indicates i=1 ∂θ li(θ) ∂θT li(θ). If the conditionP that the true model is contained in the set of candidate models holds, then tr(IˆJˆ−1)= p. In this case, p is the number of parameters, then (2.23) is asymptotically equivalent to AIC (Stone, 1977[215]). Without knowing the true model, still (2.23) shares the asymptotical identity with TIC (Shibata, 1989[208]). Shao (1993[204]) showed that the popular leave-one-out cross validation method is asymptotically inconsistent in terms of the probability of choosing the best predicting model. The probability does not converge to 1 as sample size increases. Also, CV1 was conservative enough to pick an unnecessarily large model. Instead of leave-one-out, leave-nv-out cross validation (nv > 1) with nv/n → 1 as n →∞ was suggested to rectify the inconsistency of CV1. In addition to cross validation, bootstrap resampling principle was quite often studied for model selection. Here, the focus of the study is measuring Kullback- Leibler discrepancy although the argument of bootstrapping can be extended to other types of discrepancy (to name a few of those extensions, see Breiman, 1992[35]; Efron, 1983[67]; Hjorth 1994[93]; Linhart and Zucchini, 1986[126]; and Shao, 1996[202]). Asymptotic approximation of bootstrap is too complicated to be evaluated ana- 20 lytically. A few studies showed that bootstrap version of model selection criterion is functionally equivalent to AIC despite its computational complexity (Chung et.al., 1996[53]; Shibata, 1997[207]; and Ishiguro, 1997[107]). According to Chung et.al. (1996[53]), the bootstrap method has a downward bias which is roughly equivalent to the number of parameters of a candidate model like AIC. This state- ment implies that bootstrap model selection gives reasonable bias correction. Shi- bata (1997) confirmed this idea more rigorously through the investigation of several bootstrap type estimates. In addition, the advantage of using a bootstrap estimate is that it is free from expansion while AIC or other similar criteria are based on an expansion with respect to parameters. Often the calculation of analytic estimates, the gradient Iˆ and the Hessian Jˆ, becomes complicated. According to Ishiguro et.al. (1997[107]), the advantages of bootstrap model selection are unbiasedness and its independence of models. 2.2.7 Other Criteria and Comments The model selection criteria discussed so far are most popular model selection criteria in terms of finding a model that minimizes the KL distance. On the other hand, they have shortcomings since they are developed under specific conditions. There are more notable model selection criteria that allow to choose a model, minimizing the KL distance, like H-Q criterion (Hannan and Quinn, 1979[83]) and model selection with data oriented penalty (Rao and Wu, 1989[167] and Bai et.al., 1999[16]). The information criterion by Hannan and Quinn (1979) was developed from choosing a correct order of a time series model by adapting the law of iterative logarithm. Rao and Wu (1989) presented a general form of penalties to obtain a consistent process for model selection in the regression framework. Other directions have been taken like Konishi and Kitagawa (1996[116]), who developed the generalized information criterion (GIC) to loose assumptions on popular information criteria like AIC and BIC. Irizarry (2001[106]) and Agostinelli (2002[6]) introduced a weighted version of these popular information criteria. Ca- vanaugh (1999a[43]) used the symmetric Kullback-Leibler distance, which makes their information criterion more consistent because the number in the penalty term has changed from 2 to 3. 21 In spite of numerous model selection methods, as Burnham and Anderson (2002[39]) emphasized, no model selection method is superior to the other meth- ods. The most important process is understanding data prior to applying any statistical model selection criterion. 2.3 Convex Hull and Algorithms Since R´enyi and Sulanke (1963[170], 1964[171]) and Efron (1965[68]), convex hulls have received spotlights for decades. This section provides foundations and basic definitions related to convex hull and its algorithms for statistical multivariate data analysis, which will be introduced in Chapter 4 and Chapter 5. Frequently referred computational convex hull algorithms are introduced to assist the understanding of the convex hull process. 2.3.1 Definitions Prior to presenting the definition of the convex hull, we need the definition on convex sets. Wikipedia1 provides a good summary on convex sets. At the moment, we give a simple definition. Definition 2.1 (Convex Set). A set C ⊆ Rd is convex if for every two points x, y ∈ C the whole segment xy is also contained in C; in other words, the point tx + (1 − t)y belongs to C for every t ∈ [0, 1]. (Klee, 1971[112]). Note that the intersection of an arbitrary family of convex sets is a convex set as well. Extensive studies on convex sets are found in Rockafellar(1970[177]) and Valentine(1976[222]). Here, we give the definition of the convex hull. Definition 2.2 (Convex Hull). The convex hull of a set of points X in Rd denoted by conv(X), is the intersection of all convex sets in Rd containing X. In other words, conv(X), the convex hull of X is the smallest convex domain in Rd con- taining X. In descriptions of algorithms, a convex hull indicates points of a shape invariant minimal subset of conv(X), or extreme points of X. Connecting these points produces a d dimensional wrap around conv(X). 1http://en.wikipedia.org/wiki/Convex Set 22 There are numerous studies related to convex hull. Regarding the fundamen- tal and basic notions of the convex hull, readers may refer to the content from Wikipedia2. Modern mathematical and probabilistic studies on the convex hull were initiated by R´enyi and Sulanke (1963, 1964) and Efron (1965), who derived the expected number of vertices, areas, and perimeters of convex hulls, formed by samples generated from well known distributions. Their works were based on the relation among vertices, edges, and areas in 2 and 3 dimensional geometry dis- covered by Euler. Afterwards, convex hull studies have been focused on assessing limiting distributions by estimating expected number of vertices and volumes, and their variances for central limit theorems when sample were generated from convex body distributions. In a series of works by Hueter (1994[102], 1999a[100], 1999b[101]), central limit theorems for the number of vertices, perimeter, and area (Nn, Ln, An by notation respectively) of the convex hull of n iid random vectors, sampled from spherical distributions including the normal distribution. Her main tools are martingales and poisson approximation of the point process of convex hull vertices in addition to a geometric application of the renewal theorem. Hsing (1994[95]) studied the central limit theorem of convex hull areas from samples of uniform planar distributions. Approximating the process of convex hull vertices by the process of extreme points of a poisson point process lead Groeneboom (1988[80]) to a central limit theorem for the number of convex hull vertices of a uniform sample from the interior of a convex polygon. Cabo and Groeneboom (1994[40]) derived limit theorems for the boundary length and for the area of the convex hull, extending results of R´enyi and Sulanke (1963). A very useful and extensive review of the characteristics of limiting distributions of convex hull is found in Schneider (1988[194]) despite years passed since its publication. Moreover, results from Mass´e(1999[141], 2000[140]) provide interesting insights of convex hull studies. Suppose that a sequence of iid random variables in Rd is defined on a probability space (Ω, A,P ). Let Cn be the convex hull of {X1, ..., Xn}, Nn be the number of vertices of Cn, and Mn be 1 − P (Cn). Note that Nn and Mn are random variables as well. Mass´e(2000) stated that 1. Most of the properties of Cn,Nn, and Mn depend tightly on P and 2http://en.wikipedia.org/wiki/Convex hull 23 2. Lots of studies on expected values of Mn and Nn (E(Mn) and E(Nn)) exist but their variances are hardly known. He showed that a traditionally accepted conjecture: E(Nn) →∞ implies Nn/E(Nn) → 1 in probability, is wrong. Mass´e(1999) pointed out no trivial distribution free bound for the variance of Nn. There is a different approach to get the limiting behavior of convex hull func- tionals by adapting support functions (Eddy and Gale, 1981[62]). Eddy (1980[63]) showed that the convex bodies can be embedded via a support function into the space of continuous functions on the unit sphere so that the distribution of the convex hull can be described by the distribution of a continuous function. Based on the mapping through support functions, Eddy and Gale (1981) characterized the limiting behavior the support function of the convex hull from samples in Rd. However, Groeneboom (1988) criticized using support functions because of the fact that characterization by support functions does not give the type of information one needs to develop a distribution theory for the point process of vertices of a convex hull. Some other directions in convex hull studies, deviated from obtaining limiting distributions, have been taken for statistical inferences although these studies are rarely found. Bebbington (1978) utilized the convex hull to trim data for robust correlation estimation. Tukey in Huber (1972) introduced the convex hull for robust estimation of multivariate data center. Barnett (1976) made a debut of ordering multivariate data by peeling convex hulls. Chakraborty (2001[47]), Zani et.al. (1998[236]), and Miller et.al. (2003[147]) discussed the possibility of outlier detection with the convex hull. Before we move to convex hull algorithms, we want to remind the notion of a convex hull in computer science. In geometry, the convex hull is the smallest convex set containing given points. Consider a point x of a convex set C which is called an extreme point of C provided that x is not an inner point of any segment in C. In computational science, the collection of these extreme points indicates a convex hull instead of the whole supporting space. This dual notions of a convex hull are alternatively used without any prior notification. Two notions of the convex hull: • In geometry, the smallest convex set containing given points 24 • In computer science, extreme points that comprise the boundary of the con- vex hull Next, we will discuss algorthms finding convex hulls. 2.3.2 Convex Hull Algorithms In mathematics and computer science, various algorithms, pawning the convex hull of given data, have developed. Books by de Berg et.al. (2000[28]), Edelsbrun- ner (1987[66]), Goodman and O’Rourke (2004[76]), Laszlo (1996[121]), O’Rourke (2000[159]), Preparata and Shamos (1985[163]), and others have illustrated various convex hull algorithms. In this section, we introduce most of well known convex hull algorithms. Some algorithms are restricted to two dimensional data. Some al- gorithms contain studies on computational complexity, a critical factor in analysis of large data sets. Constructing a convex hull involves sorting, causing at least O(nlog n) time for operation. In two dimension, it is reported that nlog n is a lower bound on the complexity of any algorithm that find the hull (O’Rourke, 2001[159]). As the data dimension increases, we can expect that the computation complexity will grow faster than nlog n. If data sets are huge, say tera- or peta-bytes in multivariate forms, the sorting process delays any fast computation. This relative slowness in computation is the sacrifice of using nonparametric methods in statistical analysis. The convex hull of a set of points is the smallest convex set that contains all the points. In the plane, we can visualize the convex hull as a stretched rubber band surrounding the points that, when released, takes a polygonal shape. This convex polygon is the convex hull and the extreme points in the hull are the vertices of the convex polygon. The convex hull algorithms collect points touching this rubber band so that the designing such rubber bands has been the primary interest of developing algorithms. Additional terminologies have to be introduced to state convex hull algorithms. The two dimensional and three dimensional convex hull is called a polygon and polyhedron, respectively. A higher dimension convex hull is a polytope but the visualization of a high dimensional (d> 3) convex hull is very restricted compared to two to three dimensional convex hulls. For description purposes, the (d−1)-dim 25 faces of a d-dimensional polytope are called facets, the (d − 2)-dim faces of a d dimensional polytope are ridges, and points are called vertices; for example, in the familiar three dimensional simple tetrahedron (simplex), triangles are facets and edges are ridges so that in total, we have 4 facets, 6 ridges, and 4 vertices. From Berg et.al. (2000[28]) the following relation among vertices, edges, and facets is found, Let P be a convex polytope with n vertices. The number of edges of P is at most 3n − 6 and the number of facets of P is at most 2n − 4. This statement only provides the upper bounds and no studies are found that provide exact numbers and relationships of vertices, ridges, and facets when the data come from higher dimension (d> 2). Graham Scan Algorithm Graham published this algorithms in 1972[77]. The Graham scan algorithm only requires arithmetic operations and comparisons. It is also known that no explicit conversion to polar coordinates (for sorting) is necessary (Bayer, 1999[25]). Eddy (1977a[64], 1977b[65]) showed that this algorithm requires at least O(nlog n) oper- ations to determine the vertices due to sorting points. These discoveries are based upon two dimensional data. The algorithm stated in Algorithm 1 is adapted from O’Rourke (2001[159]). Algorithm 1 Graham scan algorithm Find the interior point x; label it po Sort all other points angularly about x; label p1, ..., pn−1 Stack S =(p1,p2)=(pt−1,pt); t indexes top. i ← 3 while i < n do if pi is left of (pt−1,pt) then Push(S, i) and increment i else Pop(S) end if end while 26 Gift Wrapping Algorithm Chand and Kapur (1970[49]) introduced this algorithm for two dimensional poly- gons but the algorithm was soon extended to higher dimensional data (Laszlo, 1996([121]) and O’Rourke, 2001[159]). For two dimensional data, the complexity is O(nh) if the hull has h edges so that the worst complexity is (O(n2)). The statement of the gift wrapping algorithm in Algorithm 2 is adapted from O’Rourke (2001) and this algorithm is for both two and three dimensional data. Algorithm 2 Gift Wrapping Algorithm Find the lowest point (smallest y-coordinate) Let io be its index, and set i ← io repeat for each j =6 i do Compute counterclockwise angle θ from previous hull edge. end for Let k be the index of the point with the smallest θ Output (pi,pk) as a hull edge i ← k until i = io Jarvis’ March Algorithm Jarvis (1972[108]) reported this algorithm, which is the special case of gift wrapping but limited to two dimensional data. This algorithm calculates angles from a point and connects to the next point of the smallest angle. The reported running time is O(nN), where N is the number of vertices on the convex hull so that the worst case running time of the algorithm is O(n2). Eddy (1977a[64]) commented that for some configuration of points in the plane with small number of vertices N this Jarvis’ March algorithms is faster than the Graham scan algorithm. Beneath-and-Beyond algorithm This algorithm was introduced by Chazelle (1985[48]), claiming for an optimal algirhtm. For two dimension, the complexity of space and time is O(n) and O(nlog n), repectively. Additionally, the worst case complexity is O(nb(d+1)/2c) so that the algorithm is optimal for even dimensions. The name comes from the 27 fact that a point is added at a time to an already constructed convex hull. The algorithm described in Algorithm 3 is adopted from Edelsbrunner (1987). Algorithm 3 Beneath and beyond algorithm Sort the n points with respect to their x1-coordinates Relabel points; (p1,p2, .., pn) is the sorted sequence. Construct P = conv{p1,p2,p3} [Iteration:] Complete the construction point by point. for i:=4 to n do Add point pi to the current convex hull, Update the representation of Pi−1 = conv{p1,p2, ..., pi−1} to Pi = conv{p1, ..., pi−1,pi}. end for Divide-and-conquer algorithm This algorithm was first introduced by Preparata and Hong (1977[164]). The divide and conquer algorithm work for general dimensional data. Time complexity of two and three dimensional data is reported as O(nlog n) (Edelsbrunner, 1987[66]). Time complexity for 4 dimensional data is known to be O((n + N)log 2N) and for 5 dimensional data is O((n + N)log 3N) where n is sample size and N is the number of vertices. However, it is known that merging process is very difficult when data dimension is more than two. The Divide-and-conquer algorithm for two dimension can be stated simply as follows. 1. Sort the points by x coordinate 2. Divide the points into two sets A and B, A containing the left upper n/2 points and B the right lower n/2 points 3. Compute the convex hulls A = conv(A) and B = conv(B) recursively. 4. Merge A and B: compute conv(A ∪ B) The higher dimension (d> 2) divide-and-conquer algorithm stated in Algorithm 4 is adopted from O’Rourke (2001). 28 Algorithm 4 Divide and conquer algorithm a ← rightmost point of A b ← leftmost point of B while T = ab not lower tangent to both A and B do while T not lower tangent to A do a ← a +1 end while while T not lower tangent to B do b ← b +1 end while end while Incremental Algorithm This algorithm is introduced by O’Rourke (2001[159]) and stated in Algorithm 5 for three dimension sample. The total complexity is O(n2) since the complexity of finding faces and edges is proportionsal to n and this linear loops of searching faces and edges are embedded inside a loop that iterates n times (Thm 4.1.1., O’Rourke, 2001). Algorithm 5 Incremental Algorithm Initialize H4 to tetrahedron (p0,p1,p2,p3). for i =4, ..., n − 1 do for each face f or Hi−1 do Compute volume of tetrahedron determined by f and pi. Mark f visible iff volume < 0 if no faces are visible then Discard pi (it is inside Hi−1) else for each border edge e of Hi−1 do Construct cone face determined by e and pi end for for each visible face f do delete f end for Update Hi end if end for end for 29 Quick Hull Algorithm The initial development of the quick hull algorithm was performed by Preparata and Hong (1977) for convex polygons in two dimensional space. Barber et.al. (1996) reinvented this quick hull algorithm by combining this two dimension quick hull al- gorithm and the general dimension beneath-beyond algorithm. Barber et.al. (1996) also provided an empirical evidence that their algorithm uses less memory and runs faster. The software for this algorithm is found from http://www.qhull.org. The algorithm stated in Algorithm 6 is adopted from Barber et.al. (1996[19]). Algorithm 6 Quick hull algorithm create a simplex of d+1 points for each facet F do for each unassigned point p do if q is above F then assign p to F’s outside set end if end for end for for each facet F with a non-empty outside set do select the furthest point p of F’s outside set initialize the visible set V to F for all unvisited neighbors N of facets in V do if p is above N then add N to V end if end for{the boundary of V is the set of horizon ridges H} for each ridge R in H do create a new facet from R and p link the new facet to its neighbors end for for each new facet F’ do for unassigned point q in a outside set of a facet in V do if q is above F’ then assign q to F”s outside set end if end for end for delete the facets in V end for 30 2.3.3 Other Algorithms and Comments In addition to the algorithms discussed so far, B¨ohm and Kriegel (2001[31]) pre- sented the depth-first algorithm and the distance priority algorithm for large mul- tidimensional databases. Nevertheless, their design of algorithms aimed for two dimensional data and their illustration of algorithms did not utilize big enough data sets in size by today’s standard massive data. Many algorithms exist and mostly these algorithms have been proved that their algorithms have less computational complexity and less computational time than previously published algorithms regardless of data dimensions. Some of them are proved to be extremely simple and fast under particular conditions such as planar data. Even though some inventors of convex hull algorithms argue that their algorithms are capable to be expanded in general dimensions, any applications of their algorithms to higher dimensional data are hardly found. Hence, choosing an algorithm for a statistical analysis become a nontrivial problem. Instead of looking for a fast and efficient algorithm with conditions, we choose one that serves the general purposes. QuickHull algorithm by Barber et.al. (1996) will be employed through the rest of the study. The fact that numerous algorithms are traced from the literature in computa- tional geometry proves that just retrieving representative points of a convex hull from a given data set is an important problem mathematically and computation- ally. This overflowing number of algorithms also illuminates that lack of statistical functionals leading to statistical inferences and universal mathematical justifica- tion of convex hulls for quantification. This thesis may contribute to overcoming these shortcomings. 2.4 Data Depth Since Barnett (1976[24]), data depth has emerged as a powerful tool of ordering multivariate data as well as exploring multivariate data with various applications. A data depth is known to be a device for measuring the centrality of a multivariate data point with respect to a given data cloud and a indicator of how deeply a data point is embedded in the data cloud (Liu and Singh, 1993[128]) so that it 31 characterizes the multivariate distributions. Understanding distributions of multivariate data asks for defining the order of points. In contrast to the simplicity of assigning orders for univariate sample, multivariate data undergo various methods of ordering and no consensus in multi- variate ordering has reached yet. In this section, we summarized well known data depths connected to ordering multivariate data and some notions for describing multivariate distributions. 2.4.1 Ordering Multivariate Data A natural way of defining ordering of multivariate data is that points on the surface of data cloud have shallow depth and points far from the data cloud surface in any direction have deeper depth. Barnett (1976[24]) provided an extensive study on ordering multivariate data in a more quantitive way. Four different ordering principles were suggested. Marginal ordering: This ordering is called M-ordering and as the name sug- gests, the ordering is based on marginal sample. M-ordering is related to transformation of the data set. Reduced ordering: This ordering is called R-ordering and starts from reducing each multivariate observation to a single value by some combination of the components of given sample based on chosen metrics. The Mahalanobis distance is one of the most popular examples. Partial ordering: This ordering is called P-ordering. Deviated from M-ordering and R-ordering, P-ordering considers overall inter-relational properties. The important principle of P-ordering is related to assigning ranks to data layers. Tukey’s peeling process belongs to P-ordering. Conditional ordering: This ordering is called C-ordering or sequential ordering. C-ordering is conducted on one of the marginal sets of observations condi- tional on ordering within the marginal data, which stems from works about determining distribution free tolerance limits by hyper-rectangular coverage. 32 The studies on ordering principles in multivariate data is an effort of seeking a direct counterpart of ordering univariate data. By means of these ordering princi- ples, Barnett (1976) discussed multivariate median, multivariate range, multivari- ate extreme points, multivariate quantiles, and outliers in multivariate data. Also, testing processes, modelling, classification and clustering in multivariate sample were put on the table. Some details on these topics will be presented in upcoming sections. 2.4.2 Data Depth Functions The various ideas of ordering multivariate data led to coin various depth func- tions. A comprehensive list of data depth functions with their definitions and properties is given in Liu et.al. (1999[127]), Zuo and Serfling (2000a[241]), and Chenouri (2004[52]). Well known data depths are Mahalanobis Depth (Maha- lanobis, 1936[133]), Convex Hull Peeling Depth (Barnett, 1976[24]), Half Space Depth (Tukey, 1974[221]), Simplicial Depth (Liu, 1990[130]), Oja Depth (Oja, 1983[158]), Majority Depth (Singh, 1991[212]), Projection Depth (Donoho and Gasko, 1992[58]) and Regression Depth (Rousseeuw and Huber (1999[182]). Here, we discuss frequently appearing data depths in the literature. Let F be a probability distribution in Rd (d ≥ 2). Throughout this section, unless stated otherwise, we assume that F is absolutely continuous and X = {X1, ..., Xn} is a random sample from F . Half Space Depth (Tukey, 1974) It is also called Tukey depth. The half space depth of x is defined as the minimal number of data points in any closed half space determined by a hyperplane passing through x T T HD(x; X) = min #{i : u Xi ≥ u x}. ||u||=1 The median from the half space depth is called the Tukey median, which is constructed uniquely as a centroid. This depth process builds nested poly- topes, illustrated in Rousseeuw and Struyf (2004[180]) with samples form various distributions. Donoho and Gasko (1992) proved the continuity of the half space depth 33 and the almost sure convergence of the empirical half space depth to the theoretical depth. They showed that the half space depth breaks down at 1/3. The computation complexity of the half space depth has been studied; es- pecially, Miller et.al. (2003[147]) reported that their half space depth algo- rithm is more optimal (O(n2)) than the algorithm by Rousseeuw and Ruts (1998[187]), who presented BAGPLOT, an extension of Tukey’s univariate box plot in O(n2log 2n). The computation complexity was compared in two dimension framework. Mahalanobis Depth (Mahalanobis, 1936) This depth is based on Mahalanobis distance; T −1 −1 MhD(x)=[1+(x − µF ) ΣF (x − µF )] , where µF and ΣF denote the mean and covariance matrix of F , respectively. The sample version of this depth is obtained by replacing parameters with ˆ estimatorsµ ˆFn and ΣFn . The covariance matrix is assumed to be non-singular. Despite its frequent applications and popularity, it has been criticized for non robustness. Oja’s Depth (Oja, 1983) Let S[x, X1, ..., Xd] be the closed simplex with vertices of x and d random observations X1, ..., Xd from F . The Oja depth is −1 OD(x; F )=[1+ EF {V ol(S[x, X1, ..., Xd])}] . The sample version of OD(x; F ) is n −1 OD(x; F )= [1 + {V ol(S[x, X , ..., X ])}]−1, n d +1 i1 di ∗ X where ∗ indicates all d-plets (i1, ..., id) such that 1 ≤ i1 ≤ ... ≤ id ≤ n. It is known that the corresponding median is affine equivariant, but Oja depth has zero breakdown value. Projection Depth (Donoho and Gasko, 1992) Outlyingness of any point x 34 relative to the data set X = {X1, ..., Xn} is defined as |uT x − med{uT X}| O(x; X) = max . ||u||=1 MAD{uT X} The univariate counterpart of the median absolute deviation (MAD) is MAD{yi} = medi||yi − medj{yj}|| from a univariate data set {y1, ..., yn}. Base on this outlyingness O(x; X), the projection depth is defined as PD(x; X)=(1+ O(x; X))−1. Zuo and Serfling (2000a) showed that the projection depth is affine equi- variant, maximal at center, and monotonic relative to the deepest point. In addition, the projection depth median has large sample breakdown point 1/2 (Zuo et.al., 2004[238]). Simplicial Depth (Liu, 1990) The simplicial depth of x is equivalent to the number of simplices formed by d+1 data points that contain x. The simplicial depth is defined as SD(x; F )= PF {x ∈ S[X1, ..., Xd+1]} where Si,d = S[Xi1 , ..., Xid+1 ] is a closed simplex formed by (d + 1) random observations from F . The sample version of this depth is n −1 SD(x; F )= I . n d +1 x∈Si,d ∗ X It is known that the simplicial median is affine equi-variant with a breakdown value bounded above by 1/(d + 2). A few properties have been discussed that simplicial depth does not always attain maximality at the center and the function does not always have mono- tonicity relative to the deepest point (Zuo and Serfling, 2000b[242]). Liu (1990) proved the uniform consistency of the simplicial depth function when F is an absolute continuous distribution on Rd with a bounded density 35 f i.e. a.s sup ||SD(x; Fn) − SD(x; F )|| −→ 0 x∈Rd as n →∞. Majority Depth (Singh, 1991) By denoting a major side as the half space bounded by the hyperplane containing (Xi1 , ..., Xid ) which has probability ≥ 0.5. MjD(F ; x)= PF {x is in a major side determined by (X1, ..., Xd)}. It is known that the majority depth is affine equivariant, maximal at center, and monotonic relative to the deepest point (Zuo and Serfling, 2000a). Liu and Singh (1993[128]) found that under proper conditions, the uniform consis- tency of the sample version of depth functions almost surely holds for the follow- ing depths: Mahalanobis depth, half space depth, simplicial depth, and majority 2 depth. For Mahalanobis depth, the uniform consistency holds if EF ||X|| < ∞; for the half space depth and the simplicial depth, the uniform consistency holds if F is absolutely continuous; and for the majority depth, the uniform consistency holds if F is an elliptical distribution. These results are connected to the fact that these depths are applicable for statistical inferences of multivariate distributions. 2.4.3 Statistical Data Depth While the definition and the description of various data depths are given in the pre- vious section, we discussed some properties, such as maximality and monotonicity of data depths. In addition to these two properties, affine invariance and zero at the infinity comprise the properties for statistical depths, allowing depth functions to be applied for statistical inferences. Such depth functions satisfying the four properties are called statistical depth functions. Zuo and Serfling (2000a[241]) explained the notion of statistical data depth. Although various measures of depths exist, not all have desired properties for sta- tistical analysis. In short, those properties of statistical data depth are affine invariance, maximality at center, monotonicity relative to deepest point, and van- ishing at infinity. 36 Definition 2.3. (Statistical depth function) Let the mapping D(·; ·) : Rd ×F → R be bounded, non-negative, where F is the class of distributions on the Borel sets of Rd and F is the distribution of random sample X. If D(·; F ) satisfies the following: (P1) (Affine invariance) D(Ax + b; FAX+b)= D(x; FX ), for all x, holds for any random vector X in Rd, any d × d non-singular matrix A, and any d-vector b; (P2) (Maximality at center) D(θ; F ) = supx∈Rd D(x; F ) holds for any F ∈ F having center θ; (P3) (Monotonicity) for any F ∈ F having deepest point θ, D(x; F ) ≤ D(θ + α(x − θ); F ) holds for α ∈ [0, 1]; and (P4) D(x; F ) → 0 as ||x||→∞, for each F ∈F. Then D(·; F ) is called a statistical depth function. To understand properties of these functions related to multivariate distribu- tions, we will need further definitions on types of distributions. 2.4.4 Multivariate Symmetric Distributions Ordering highly depends on a symmetric distribution and a point about which the symmetry is defined. This point, which we will call the multivariate median, will be discussed in the next section with estimating methods. Here, we will focus on general notions of multivariate symmetric distributions. Spherical Symmetric: For any orthogonal transformation A : Rd → Rd, the distribution of a random vector X ∈ Rd is spherical symmetric about θ provided that X − θ and A(X − θ) are identically distributed. Elliptical Symmetric: For some affine transformation T : Rd → Rd of full rank, the distribution of a random vector X ∈ Rd is elliptical symmetric about θ if Y = T (X) and Y is spherical symmetric about θ. 37 Central Symmetric: The distribution of a random vector X ∈ Rd is centrally symmetric about θ ∈ Rd if X − θ =d −(X − θ), where =d implies that two random vectors are identically distributed. Angular Symmetric: The distribution of a random vector X ∈ Rd is angular symmetric about θ ∈ Rd provided that the standardized vector (X −θ)/||X − θ|| is centrally symmetric about the origin. Note that conventionally (X − θ)/||X − θ|| =0 if ||X − θ|| = 0. Half Space Symmetric: The distribution of a random vector X ∈ Rd is half space symmetric about θ ∈ Rd if P (X ∈ H) ≥ 1/2 for any closed half space H with θ on the boundary. Note that central symmetry is a subset of angular symmetry, which is a subset of half space symmetry. 2.4.5 General Structures for Statistical Depth Zuo and Serfling (2000a) presented general structures for statistical depth func- tions, according to which various depth functions are classified into four types. We summarize the four general structural types for constructing statistical depth functions and investigate the satisfaction degrees of these types with the properties of statistical depth functions. Type A depth functions: Let h(x; x1, ..., xn) be any bounded non-negative func- tion which measures in some sense the closeness of x to the points x1, ..., xn. A corresponding type A depth function is then defined by the average close- ness of x to a random sample of size n: D(x, F )= Eh(x; X1, ..., Xn), where X1, ..., Xn is a random sample from F . For such depth functions the corresponding sample versions D(x; Fˆn) turn out to be U-statistics or 38 V-statistics. Majority depth (Singh, 1991[212]) belongs to this category pro- vided half space symmetric distributions. Type B depth functions: Let h(x; x1, ..., xn) be an unbounded non-negative func- tion which measures in some sense that the distance of x from the points x1, ..., xn. A corresponding type B depth function is then defined as −1 D(x, F )=(1+ Eh(x; X1, ..., Xn)) , where X1, ..., Xn are random sample from F . This type B function may not satisfy the affine invariant property but it matches well with the other prop- erties of a statistical depth function. A type B depth function defined on any circular symmetric distributions satisfies the property of center maximality. Type C depth functions: Let O(x; F ) be a measure of the outlyingness on the point x in Rd with respect to the center or the deepest point of the distri- bution F . Usually O(x; F ) is unbounded, but a corresponding type C depth function is defined to be bounded so as D(x; F )=(1+ O(x; F ))−1. Type C functions pertain properties of statistical depth functions and the projection depth and Mahalanobis depth are the examples of type C func- tions. Type D depth functions: Let C be a class of closed subsets of Rd and P a probability measure on Rd. A corresponding type D depth function is defined as D(x; P, C) = inf{P (C)|x ∈ C ∈ C}. C The depth of a point x to a probability measure P on Rd is defined to be the minimum probability mass carried by a set C in C that contains x. The half space depth and the convex hull peeling depth are known to be type D functions. Zuo and Serfling (2000a) favored the half space depth function because it is a statistical depth function and its breakdown point is relatively higher than the 39 breakdown points of other depth functions (≤ 1/3). 2.4.6 Depth of a Single Point Initial depth calculations of given sample by means of a depth function provide a set of nested convex polygons formed by equi-depth points composing depth contours (equi-level sets). The depth contours subdivide the space into depth regions so that the region between contours k and k + 1 has depth k. Therefore, the depth of a new point is simply the depth region in which it lies. This depth allocation process led studies on the hypothesis testing of the similarity between two distributions (Liu and Singh, 1993 [128] and Li and Liu, 2004[123]) and classification (Ghosh and Chaudhuri, 2005[72]). 2.4.7 Comments Ideas presented in this section are the foundation of developing nonparametric statistics and statistical inference methods for multivariate data. As an initial step toward developing nonparametric statistics for multivariate data, based on notions and definitions provided in this section, we will discuss some multivariate estimation problems such as multivariate median, skewness, and kurtosis in the next section. 2.5 Descriptive Statistics of Multvariate Distri- butions This section presents nonparametric estimators of multivariate distributions based on the data depth discussed in the previous section. In fact, small number of works related to the nonparametric estimators of multivariate distributions are found (e.g. Liu et.al., 1999 and Serfling, 2004[196]). Methods of estimating multivariate median, location, scatter, skewness, and kurtosis are presented. If multivariate estimators are not found, univariate counterparts are introduced. 40 2.5.1 Multivarite Median The median of univariate data pertains the universal definition that one selects a point in the middle after sorting data. It is denoted as X n+1 or .5(X n + X n+1 ) 2 2 2 depending on the odd or even number of sample size. On the contrary, there is no predominant definition of multivariate median. Simple coordinate-wise median and mean are adopted for estimating multivariate median but they are not robust with respect to data distributions. Another notion of multivariate median is the deepest point in the data cloud. Yet being the deepest point in the data cloud requires defining directions and met- rics corresponding to ordering data. Aloupis (2005[10]) gave a good summary on multivariate median from the computer scientific point of view. Small (1990[213]) expedited multivariate medians traced back to early 1900s and presented statis- tical perspective of these multivariate medians. The list of popular multivariate medians is: L1 Median: the point which minimizes the sum of Euclidean distances to all points in a given data set (Chaudhuri, 1996[42]). The breakdown point is known to be 1/2. Half Space Median: based on half space depth, this median provides the point of which half space minimizes the maximum number of points. Donoho and Gasko (1992) are credited to this median. They studied its breakdown property. The breakdown point is bounded below between 1/(d+1) and 1/3. Convex Hull Peeling Median: a point(s) left after sequential peeling process. The visual interpretation is very straight forward and intuitive. The number of left points after exhaustive peeling is not enough to comprise a simplex so that the coordinate wise average of these points are considered to be the location of multivariate median. Oja’s Simplex Median: a point that minimizes the sum of the volumes of sim- plices formed with all subsets of d points from the data set in Rd. This median does not have the uniqueness property but has the property of affine equivariance. 41 Simplicial Median: a point in Rd which is contained in the most simplices formed by subsets of any d + 1 data points. This median is known to be affine transformation invariant. Hyperplane Median: a point θ which minimizes the number of hyperplanes that a ray illuminating from θ crosses. Rousseeuw and Hubert (1999[181]) defined this median while introducing the maximum hyperplane depth, which is affine invariant. According to the discussion by Aloupis (2005), studies on computational complex- ity of these multivariate medians are confined to bivariate data. Thus, we intend not to add computational complexity results. However, our naive guess concludes that the simplicial median and the Oja’s simplex median are more complex than other medians as the data dimension increases3 since these medians requires all possible simplex construction requiring O(nd+1) time computational complexity. In addition, the half space median, the hyperplane median, and L1 median ask for an iterative optimizing process, which makes computation less efficient. 2.5.2 Location Estimating median is a specific example of estimating location. Traditionally, location of multivariate data has been estimated with the normal distribution as- sumption. The Mahalanobis distance is the very example of location estimation based on the normality of a given sample but it is well known for the lack of robust- ness. Recently, constructing robust multivariate location and scatter estimators became a very interesting problem. Furthermore, depth induced ordering promises new tools in multivariate data analysis and statistical inferences, just as order and rank statistics did for univariate data. The depth induced ordering is strongly related to generalized quantile based location estimation, which have gained its popularity due to robustness compared to Mahalanobis distance. According to EinMahl and Mason (1992[69]), the empirical quantile function 3Numerous algorithms developed in the computational geometry society are geared toward efficient and less complex algorithms; however, those algorithms only works under particular conditions. We focus on algorithms that function under general data configuration although those algorithms are computationally expensive. 42 is defined as Un(t)= inf{λ(A) : Pn(A) ≥ t, A ∈ A} when 0 Un(t) was illustrated by showing that Un(t) is a random element in the space of left continuous functions with right limits (EinMahl and Manson, 1992). The general quantile process may not provide a straight forward instruction of ordering multivariate data but suggest a proper recipe to estimate locations of multidimensional data by means of a resembling univariate quantile function. 2.5.3 Scatter Traditionally, measuring the degree of scatterness has been treated as a same problem of estimating the second moment, which idea was established by Mardia (1970[138]). Not only spreadness but also other distribution shape measures such as skewness and kurtosis were based on estimating higher moments. However, estimating moments is established on a particular distribution family, such as the multivariate normal distribution. By diverting away from Mardia’s approach, it become clear that not many ways are found to measure scatterness of multidimen- sional data distribution. When two univariate distributions are symmetrical, the relative degrees of scat- 43 terness of one distribution to the other is able to be assigned. According to Bickel and Lehmann (1976[30]), a symmetrical univariate distribution P is considered to be more dispersed about µ than another symmetrical distribution Q about ν if P ([µ − a, µ + a]) ≤ Q([ν − a, ν + a]), for all a> 0. This definition can be extended to the multivariate case as in Eaton (1982[61]) by stating that P is more concentrated about µ than Q about v if P (C + µ) ≥ Q(C + ν) for all convex symmetrical sets C ⊂ Rd. Oja (1983[158]) provided a generalized version of relative scatterness between two distributions. Definition 2.4. Suppose φ : Rd → Rd is a function such that V ol(φ(x1), ..., φ(xd+1)) ≥ V ol(x1, ..., xd+1) d for all x1, ..., xd+1 ∈ R . Here, V ol(x1, ..., xd+1) indicates the volume of a simplex comprised of x1, ..., xd+1. Then, if P is the distribution of a random vector X and Q is the distribution of φ(X), we say that Q is more scattered than P , in short P ≺S Q. This volume based definition on scatterness of a sample will be adopted to develop measuring the relative scatterness of distributions by means of the convex hull peeling process. In addition to the spread ordering, Oja (1983) stipulated the enhanced concept of a scatter measure. Definition 2.5. The function Ψ : P → R+ is a measure of scatter if (1) P ≺S Q implies Ψ(P ) ≤ Ψ(Q) for all P, Q ∈ P (2) For any affine transformation L : Rd → Rd of the form L(x) = Ax + µ and P ∈ P such that P L−1 ∈ P satisfies Ψ(P L−1)= abs(|A|)Ψ(P ). 44 Equivalently, Ψ(AX + µ)= abs(|A|)Ψ(X). Note that Ψ(AX)=0 for a singular A. A convex hull can be interpreted as a scatter set and the volume of the convex hull, a generalized range, is a measure of scatter. Therefore, the notion of scatter by Oja (1983) will be much useful when we define the degree of scatterness of affine invariant multidimensional distributions by means of the convex hull pro- cess. Additionally, a volume functional, a mapping from a generalized multivariate quantile function allows to allocate the orders of scatterness among distributions. At a given data depth r, the volume of a more scattered distribution vF (r) will be larger than the volume of a less scattered distribution vG(r), i.e. vF (r) > vG(r) (Liu et.al., 1999). Consequently, a suitable generalized quantile function and its mapping to volume function will be the key to measure the degree of scatterness of multivariate distributions. 2.5.4 Skewness Skewness is considered as a measure of asymmetry, an innate characteristics of a distribution. The skewness of a univariate distribution is accommodating two directions from the median; a positive span and a negative span from the center of the data distribution. When it comes to multivariate data, there are more than two directions that streak from the data center. These numerous directional rays from data center represent the skewness of the multivariate data distributions by their absolute values since a symmetric distribution will share same values regardless of directions; however, there exist only a few proper functionals based on multivariate quantiles so as to project these numerous rays into one dimensional curve for quantizing the degree of skewness. Classical definition of skewness is stipulated by Mardia (1970[138]). Let X and Y be random vectors from some multivariate distribution function F on Rd and have mean µ and covariance matrix Σ. Then skewness is defined as 0 −1 3 β1 = E[{(X − µ) Σ (Y − µ)} ] 45 where X and Y are iid random vectors with mean µ and covariance matrix Σ. The sample version of β1 can be obtained under the assumptions that Σ is not singular and the third moment exists. These assumptions are too strong to receive some criticisms. Ferreira and Steel (2006[71]) summarized a few other generalized skewness mea- sures beyond measuring the joint moments of the random variable (Mardia, 1970) for unimodal distributions. One is based on projections of the random variable onto a line by Malkovich and Afifi (1973[135]) and the other is based on volumes of simplices by Oja (1983). From the unique mode in a unimodal distribution, con- sidered to be the center of multivariate distribution, this projection based skewness is considered to be a measure of skewness onto principal axes and the volume based skewness is called as directional skewness. Unlike multivariate skewness, univariate skewness is well defined. Av´erous and Meste (1997[14]) proposed skewness ordering ≺. A given skewness ordering ≺ is defined by comparing F with F¯ (where F¯(x)=1 − F (x)): F is said to be skew to the right if and only if F¯(x) ≺ F (x). By the same token, F is called less skew to the right than G (F ≺ G) if and only if G−1 ◦ F is convex. According to Oja (1983[158]), measures of skewness and kurtosis are usually selected by using intuition and it is required that if Φ is a measure of skewness or kurtosis then Φ(P L−1)=Φ(P ) where L is affine transformation L(X)= AX + b (A is non singular). Studies particularly focus on multivariate skewness are sparsely found, espe- cially ones define skewness based on a generalized quantile process, which al- lows one dimensional mapping to characterize multi-dimensional data distribution. Nonetheless, we can conclude that any changes along the rays going away from the data center can be treated as some measures of skewness in multivariate data. This idea of searching directional changes will lead us to define skewness based on the convex hull peeling process, without calculating associated moments of a multivariate sample. 46 2.5.5 Kurtosis After the exploration of nonparametric descriptive measures of a distribution such as location, scatter, and skewness, the natural next step is to characterize kurto- sis. The univariate kurtosis, or the standardized fourth central moment, has been construed to represent either a heavy peakedness or heavy tailed distribution. The notion of univariate kurtosis can be easily extended to multivariate kurtosis. Re- gardless of the number of dimensions, kurtosis enables to distinguish a heavy tailed distribution or a heavy peaked distribution along the rays from the center of data. The classical definition of kurtosis is κ = E{[(X − µ)0Σ−1(X − µ)]2} for a distribution in Rd with mean µ and covariance matrix Σ (Mardia, 1970[138]) so that kurtosis is the estimate of the 4th moment of X from µ. The kurtosis κ measures the dispersion of the squared Mahalanobis distance of X from µ. For an explicit discussion on kurtosis, the authentic nomenclature of kurtosis is necessary. One group of taxonomy is shoulder, peakedness, and tailweight. The other group is mesokurtic, platykurtic, and leptokurtic. First of all, a shoulder of the distribution indicates about a half way from the data center to the data surface so that higher kurtosis arises when probability mass is diminished near the shoulders and raised either near µ (greater peakedness) or in the tails (greater tail- weight) or both. Here, peakedness indicates how fat a distribution can be around µ and tailweight indicates how thick a distribution can be beyond shoulder and away from µ. The other nomenclature group is strongly related to the estimates of Mar- dia’s κ. According to Balanda and MacGillivray (1998[17]), platykurtic curves are squat with short tails (κ< 3), leptokurtic curves are high with long tails (κ> 3), and mesokurtic distributions (κ = 3) comparable to the normal distribution. Alternative measures for the moment based kurtosis κ are quantile-based kur- tosis measures, which are robust, more widely defined, and better at discriminat- ing distributional shapes (Wang and Serfling, 2005[228]). Quantile based kurtosis measures provide shape information about data distribution; on the other hand, classical moment based kurtosis quantifies the dispersion of probability mass. Depth measures enable to connect quantile estimators to volume estimators 47 and quantile based volume estimators are developed to measure kurtosis of a dis- tribution. Let any family of central regions be C = {CF (r):0 CF (r) can be a continuous function of r. A convex hull peeling central region will be discussed in Chapter 4. Note that the central regions are nested about MF , the center based on any notion of multivariate median, and are reduced to MF as r → 0. A corresponding volume functional is defined as vF,C (r)= V ol(CF (r)) so that it is an decreasing function of the variable r. The boundary of the central 1 region CF ( 2 ) represents the shoulder of the distribution and separates the desig- nated central part from the corresponding tail part in the distribution. One of the volume based spatial kurtosis functionals (Serfling, 2004) is 1 r 1 r 1 vF,C ( 2 − 2 )+ vF,C( 2 + 2 ) − 2vF,C( 2 ) κF,C(r)= 1 r 1 r . vF,C ( 2 − 2 ) − vF,C( 2 + 2 ) This functional is invariant under shift, orthogonal and homogeneous scale trans- formations (Serfling, 2004[196]). Within certain typical classes of distributions we can attach intuitive interpre- tations to the values from κF,C (·). For some unimodal distribution F , for any fixed r, a value from κF,C(r) near +1 suggests peakedness, a value near −1 suggests a bowl shape distribution, and a value near 0 suggests uniformity. According to Bickel and Lehmann (1975[29]), measures of kurtosis, peakedness, and tailweight should be ratios of scale measures in the sense that both numer- ator and denominator should preserve their spread ordering. This ratio of scale measures is called as the measure of weight, whose definition is as follows: vF (r) tF (r, s)= , vF (s) where r and s are depths (0 Another approach to multivariate kurtosis is found in studies of influential functionals by Wang and Serfling (2006[227]). Given a multivariate distribution F , a corresponding depth function orders points according to their centrality in the distribution F . Then the depth functions generate one dimensional curves for a convenient and practical description of specific features in a multivariate distribution, like kurtosis, and they explored the robustness of sample version of these curves by means of influence functionals. Structural representation by influential functionals is consistent with these curves from generalized quantile functions or depth functions. The aforementioned works successfully identified general shape properties of multivariate distributions and one of byproducts is a kurtosis measure. Although there is no consensus on the definition for kurtosis, we prefer to having a kurtosis definition that is free of estimating moments. Without no designated formulae for a kurtosis measure of a distribution, peakedness or tailweight of a probability distribution can be somewhat measured based on data depths and hint the kur- tosis property of the distribution. A suitable quantification of these measures of peakedness and tailweight shall induce the definition of kurtosis. 2.5.6 Comments Classical descriptive statistics are defined by means of calculating moments. The drawback of this moment based process is that some probability distributions do not have expected moments to be compared with empirical moments. Due to lack of studies in nonparametric multivariate statistics, we are aware of deficiencies in descriptive measures of multivariate distributions. Furthermore, for some nonpara- metric descriptive measures based on particular depths, only empirical versions of measures are available because the distribution of descriptive functionals expected from the population is unknown. Lots of possibilities are buried in the field of nonparametric descriptive measures of multivariate data distributions. 49 2.6 Outliers As of writing this thesis, Google Scholar4 gave 5,760 articles about outlier detection. Detecting outliers is an important technique not only in statistics and data mining, but also in various fields such as natural science, engineering, business, and indus- try. However, most of outlier detection methods rely on small sample statistical analysis tools, developed from estimating the Mahalanobis distance (1936[133]) under the multivariate normal distribution assumption. Comprehensive studies on classical outliers are collected in the books by Barnett and Lewis (1994[21]), Rousseeuw and Leroy (1987[188]), and Hawkins (1980[89]). Numerous works re- lated to classical outlier detection also performed by Atkinson (1994[12]), Bar- nett (1978[23]), Hadi (1992[81]), Knorr (2000[113]), Kosinski (1999[117]), Penny (1995[161]), Rocke and Woodruff (1996[178]), Wilks (1963[232]), and so on. In addition, Werner (2003[230]) gave an interesting point of view to detect outliers with influence functionals in his doctoral thesis. A recent and extensive survey on outlier detection methodologies is found in Hodge and Austin (2004[94]). In our survey, we focus on non parametric outlier detection methods instead of parametric approaches such as hypothesis testings and distance calculations based on particular parametric models, which take normally distributed data for granted and initiate the detection process with estimating covariance matrices. Here, we describe multivariate outliers and their detection methods without considering the 2nd moment. 2.6.1 What is Outliers? First of all, outliers have to be defined so as their detection methods to be de- veloped. For univariate data, outliers are relatively simple to be detected. If the distribution of univariate data is known, the hypothesis testing leads to decide outliers with probability. When the distribution is unknown, we sort the data and declare data of the lowest or the highest ranks as outliers. On the other hand, multivariate outliers have very vague definitions so that they cannot be treated as an extension of univariate outliers. Huber’s (1972[96]) stated that outliers are defined as unlikely to belong to 4http://scholar.google.com 50 main population and order statistics yields easy criterion to detect outliers. The unresolved ambiguity is that ordering multivariate data is not clearly defined to declare outliers. In their quite extensive outlier studies, Barnett and Leroy (1994[21]) adopted the notion that outliers have been informally defined as observations which appear to be inconsistent with the remainder of a data set. An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs (Barnett and Leroy, 1994). In other words, outliers are a subset of observations which appears to be inconsis- tent with the remainder of the data set whether or not those outliers are genuine members of the main population. A similar definition appears in Hawkins (1980 [89]) as follows; an outlier is an observation which deviates so much from other ob- servations as to arouse suspicions that is was generated by a different mechanism. This statement implies that contaminants arise from some other distribution. Al- ternatively, Beckman and Cook (1983[27]), in their wide ranging review paper on outliers, described an outlier as any observation that appears surprising or dis- crepant to the investigator. They suggested the following notions to clear some ambiguity about outliers: • Discordant observation: any observation that appears surprising or discrepant to the investigator • Contaminant: any observation that is not a realization from the target dis- tribution • Outlier: a collective to refer to either a contaminant or a discordant obser- vation. Beckman and Cook (1983) also gave an extensive history about outliers but put a major emphasis on outliers in linear models. 51 For the moment, we accept the fact that outliers are deviated from the majority of data points, far away from the center of data in any direction, and residents above the surface of data cloud without considering various statistical and computer scientific definitions of outliers. Techniques that determine the position of a point relative to the others (Rohlf, 1975[179]) or some measures that quantify the non euclidean distance from the center seem to be useful to indicate the degree of outlyingness of outliers with respect to the majority of data. Popular approaches to detecting outliers assume multivariate normal distri- bution and calculate Mahalobis distances with appropriate covariance matrices. Mahalobis distances allows to find the location of data center and to order multi- variate data by estimating sample mean and sample covariance. Without assuming multivariate normal, we still need some metric that quantifies ordering data but there is no consensus on this subject. Multivariate data can be ordered from left to right, from bottom to top, from front to back based on coordinates of data. They can be ordered outward from the center. On the contrary, the center of mul- tivariate data is not universally defined compared to the center of univariate data. An extensive survey on ordering multivariate data is found in Barnett (1976). In Rocke and Woodruff(1996), insights regarding outlier detection are given to be understood why the problem of detecting multivariate outliers can be difficult and why the difficulty increases with the dimension of data. Rocke and Woodruff (1996[178]) suggested a very useful framework to identify multivariate normal outliers when the main distribution is N(O, Id) where O and Id indicate a zero vector of length d and a d × d identity matrix, respectively. Shift outliers: the outliers have distribution N(µ, Id) where µ is a length d vector. Point outliers: the outliers form a densely concentrated cluster. Symmetric linear outliers: given a unit vector u, the outliers have distribution uN(O, Σ) for an arbitrary Σ. Asymmetric linear outliers: given a unit vector u and its coordinate-wise ab- solute value |u|, the outliers have distribution |u|N(0, Σ) Radial Outliers: outliers have distribution N(0, kId) with k > 1. 52 This categorization of outliers based on different types of normal distributions aids to separate outliers from the main N(O, Id) distribution; however, if there is no transformation of main data to N(O, Id), segregating outliers based on one of the catagories will lead to classifying inconsistent outliers. Overall, outliers in multi dimensional space is ambiguous compared to univariate outliers. The only fact that we are sure about outliers is that outliers are deviated from the primary data distribution regardless of dimensions. 2.6.2 Where Outliers Occur? Barnett (1978[23]) summarized the causes of outliers: (1) measurement errors, (2) execution faults, and (3) intrinsic variability. Because of these various causes, outliers are treated in different ways: If a model is more plausible without outliers, outliers are detected to be eliminated; otherwise, a model is modified to cope with the existing outliers. Assertively, the residing space of outliers are most likely distinctive to the main distribution. This unique spatial occupation of outliers is clearly displayed by simple graphical methods when the data is univariate. In multivariate case, graphical visualization to locate outliers is quite complicated and sometimes impossible. From a knowledge discovery standpoint, rare events are more interesting than the common ones. An interesting list has been made (Hodge and Austin 2004[94]) about occasions that the outlier detection is applied. Some of those occasions are (1) fraudulent applications for credit cards and mobile phones, (2) loan ap- plication processing, (3) intrusion detection, (4) activity monitoring, (5) network performances, (6) fault diagnosis, (7) unexpected entries in database, and so on. These occasions clearly show that they are deviated from the main population. Therefore, the key of outlier detection is to develop methods of measuring the degree of deviation among points. 2.6.3 Outliers and Normal Distributions Although there is no clear statement of a normal distribution assumption, most of outlier detection techniques rely on this assumption and require the estimates of the location and scale parameters of the distribution (Pena and Prieto, 2001 53 [160]). Multivariate outlier detection methods have been developed toward ac- quiring robust estimation of mean and covariance matrix, where the Mahalanobis distance T −1 D(x)= EF [(x − X¯) Σˆ (x − X¯)] become indispensible. Points with large Mahalanobis distances are multivariate outliers (Kosinski, 1999[117]) based on robust estimates of population scatter and location. Kosin- ski (1999) proposed a new multivariate outlier detection method that performs well under such high contamination conditions. Even though the Mahalanobis distance is affine invariant, on the contrary, it is well known that identification of multivariate outliers based on the Mahalanobis ditance is not robust (Rocke and Woodruff, 1996[178]). Hardin and Rocke (2004[85]) provided a survey of studies about robust outlier detection methods by means of Mahalanobis distance, in- cluding Atkinson (1994[12]), Barnett and Lewis (1994[21]), Gnanadeskikan and Kettenring (1972[74]), Hadi (1992[81]), Hawkins (1980[89]), Maronna and Yohai, (1995[139]), Penny (1995[161]), Rocke and Woodruff (1996[178]). Yet, problem- atically, outliers are included to calculate these location and scale estimators. In addition, the Mahalanobis distance is computationally expensive to calculate for large high dimensional data sets compared to the Euclidean distance because the computing process has to pass through the entire data for the covariance matrix and its reciprocal (Hodge and Austin, 2004[94]), which might be sparse or singular. Another approach to outlier detection based on a multivariate normal dis- tribution assumption is the minimum volume ellipsoid (MVE) or the minimum covariance determinant (MCD) (Rousseeuw and Leroy, 1994). Since data of a single population tend to shape a convex set, distributional covariance structures are represented by ellipsoids and outliers are separated beyond these ellipsoidal contours. Rocke and Woodruff (1996) specified that all known affine equi-variant outlier identification methods consist of the following two stages; Phase I Estimate the location and the shape. Phase II Calibrate the scale estimate, suggesting which points are far enough from the location estimate to be pronounced as outliers. 54 Rocke and Woodruff (1996) added that there is no guarantee that observed data follows a normal distribution in practice and urged that new algorithms and test- ings on non-normal data should be conducted. We end this discussion about outlier detection with the Mahalanobis distance by quoting Barnett (1983[22]) which provides a considerable warning on uncon- sciousness of assuming multivariate normal distributions for data analysis. Beckman and Cook (1983) survey is limited to normal model. Surely I’m not alone in frequently (even predominantly) having to deal with nonnormal data. For univariate data we have a reasonable heritage of methodology. For multivariate data there is much to be done and great demand for appropriate model and method. I cannot agree that outliers in multivariate samples merely require straight forward generalizations of univariate techinques. The very notion of order is ambiguous and ill- defined (Barnett, 1976). Proper study of multivariate outliers involves some thorny issues. 2.6.4 Nonparametric Outlier Detection Simplicity and affine invariant property makes the Mahalanobis distance rule over other methods when it comes to detecting multivariate outliers. As a matter of fact, without imposing normality, there is almost no way to detect outliers in higher dimensional data (only exception is outlier detection in bivariate samples). Shekhar (2001[206]) delivered the idea that methods for detecting outliers in multi-dimensional Euclidean space have a few limitations: First, multi-dimensional approaches assume that the data items are embedded in an isometric space and do not capture the spatial graphic structure. Secondly, they do not explore a pri- ori information about statistical data distribution. Last, they seldom provide a confidence measure of the discovered outliers. The first and the last issues some- what related to the discussion in Hodge and Austin (2004) about the curse of dimensionality. Lu et.al. (2003[132]) also made a point that for detecting outliers with multiple attributes, traditional outlier detection approaches could not be used properly due to the sparsity of observations in high dimensional space. They urged for redesigning distance functions to define the precise degree of proximity between 55 points. At the time of writing this thesis, the author is not sure if there is any attempts to draw multivariate box plot for outlier detections as Tukey did for univariate data without problems caused by the curse of dimensionality. However, some nonparametric box plot type attempts exist in bivariate case, which are presented below. • Sunburst plot: By means of any data depth, a contour enclosing 50% points (a boundary of 50% central region) is obtained. Denote this coutour as a .5th level set. Then, lines are drawn from the boundary of the .5th level set to any points outside the .5th level set. Liu et.al. (1999) proposed Sunburst plot for a bivariate outlier detection problem. Nonetheless, this method suffers from a serious unidentifiability problem with massive data sets where those rays become almost indistinguishable. • Bag plot: Miller et.al. (2003[147]) suggested the bag plot. The steps of constructing a bag are: First, identify these two contours, Dk and Dk−1, which are consecutive contours containing respectively nk and nk−1 data points (nk ≤ bn/2c < nk−1). Second, calculate the value of a parameter λ, which determines the relative distance from the bag to each of the two contours. λ = (50 − J)/(L − J) where Dk contains J% of the original points and Dk−1 contains L% of the original points. Last, construct the bag by interpolating between the two contours by using the λ value. For each vertex v on Dk or Dk−1 let l be the line going through v and the Tukey median T ∗. Let u be the intersection of l with the other contour (Dk−1 or Dk). Each vertex w of the bag lines on l interpolated between v and u i.e. w = λv +(1 − λ)u. The value of λ weights the position of the bag towards Dk or Dk−1. • Pairwise bivariate box plot with B-spline smoothing: Zani et.al. (1998 [236]) suggested a simple way of constructing a bivariate box plot based on convex hull peeling and B-spline smoothing. Their approach led to define a natural inner region which is completely nonparametric and retains the corre- 56 lation in the observations. This inner region adapts its spreadness depending on the data distribution and defines the box of Tukey’s box-and-whisker plot. The counterpart of whisker from the box plot is the linear expansion of this inner region based on the distance from the center to the inner region. To smooth angular outer region due to expansion of the angular inner region, B-spline is adopted. Zani et.al. (1998) explored their box plot with pairwise bivariate sample. Instead of pairwise bivariate box plots, a single multi- variate box plot seems possible, which they have not explored. We assert that the reason of exploring pairwise bivariate box plots instead of a single multivariate box plots is that the visualization of multivariate data is not fea- sible. Furthermore, if sample is large, it seems that B-spline is not necessary because the contour of the outer region will be smooth enough. • Generalized gap test: This test was proposed by Rohlf (1975[179]). Apart from methods introduced previously, this method does not look for contours to set a boundary. Still outliers can be characterized by the fact that they are somewhat isolated from the main cloud of points. By the minimum spanning tree, lengths among points are determined. The presence of outliers that stick out from the main cloud of data exemplifies a gap in the distribution of lengths. Note that the interpolation parameter λ of the bag plot (Miller et.al., 2003) goes to zero as the sample size increases. Without this interpolating process, the bag plot by Miller et.al. (2003) is similar to the bivariate bag plot invented by Rousseeuw et.al. (1999a[183]). Based on contours of the halfspace depth, this less sophisticated bivariate box plot (Rousseeuw et.al., 1999a) requires a .5th level set to draw extensions between vertices of from this .5th level set (hull) and the median by three times. Connecting the end points of these extensions creates another hull, which provides a boundary to determine outliers as we assign extreme outliers when univariate points are located outside of 3 times of the interquartile range. Similarly, replacing a hull of 1.5 times extension in distance from the vertices of a .5th level sets away from the median with a hull of 3 times extension enables to assign mild outliers as such outliers from Tukey’s univariate box plot. We did not include the bivariate box plot study by Goldberg and Iglewicz 57 (1992[75]) since the box plot construction process involves estimating shape pa- rameters. Another very genuine nonparametric approach to outlier detection is bootlier-plot by Singh and Xe (2003[211]) but difficulties arise in application for large sample with multiple attributes. Even though above methods are only applied to bivariate sample, those meth- ods are very likely applicable to any multivariate sample. It is not surprising that most of these methods adopt data depths to obtain the hull containing 50% data. The hulls from data depth seem to be less affected by outliers and preserve the structural shape of the original distribution when sample are generated from a smooth convex body distribution. Without a comprehensive discussion on multi- variate outlier detection based on data depth, it is perilous to conclude that data depth is the best way to detect outliers. Nevertheless, a few studies on data depth promised that depth based outlier detection methods are robust. He and Wang (1997[90]) pointed out that contours of depth often provide a good geometrical understanding of the structure of a multivariate data set and are not influenced by outliers. They conjectured that data depths can track the elliptic contours if the data were generated from a elliptical distribution. They, however, added that technicalities to support the conjecture that depth contours delineate the original structure of the distribution, are not trivial. 2.6.5 Convex Hull v.s. Outliers Although there is no study that convex hull peeling is directly applied to outlier detection, some studies entailed the idea that the convex hull process provides robust estimators by identifying possible outliers. Barnett suggested, in his celebrated 1976 paper, that the convex hull peeling process allows ordering multivariate data and detecting outliers, which idea is credited to Tukey (1974[221]). Hodge and Austin (2004[94]) made a comment that convex hull peeling removes the records beyond the boundaries of the distributional convex hull. Thus, the peeling process abrades the outliers. The peeling process can be repeated on the remaining points until it has removed pre-specified records. However, Hodge and Austin (2004) were susceptible to this peeling process which strips away points quickly since at least d+1 points are ripped off on every iteration. 58 Not like a convex hull peeling process, but similar in terms of contouring the distribution in shape, for eliminating outliers the minimum volume ellip- soid has been adopted (Titterington, 1978[220] and Silverman and Titterington, 1980[209]). Nolan (1992[157]) studied the asymptotic properties of multidimen- sional α-trimmed region based on ellipoids and various functionals of the convex set for multivariate location estimators. He also pointed out that the peeling pro- cess may remove too many non-outlying observations. Without associating data depth, a successful application of the convex hull process for outlier detection can be found. Bebbington (1978[26]) used a convex hull to trim outliers for a robust estimator of correlation, called as a convex hull trimmed correlation coefficient. He trimmed out points defining the vertices of the convex hull of given sample without disturbing the general shape of the bivariate distribution. He empirically displayed that the correlation coefficient is not af- fected. Despite its intuitive appeals and the success in applications, the analytic complexity of convex hull peeling process and its asymptotics has hindered more advanced studies. 2.6.6 Comments: Outliers in Massive Data We live in the era of massive multivariate data, in contrast to statistics which mainly relies on univariate and small sample. Most of outlier detection stud- ies were based on univariate and small sample environment. A small number of methodologies in detecting outliers of multivariate data are discovered but most of methods happen to be confined in two dimensional world. The discovery of bivariate applications only might be attributed to limitations in graphical display. Strangely, coping with outliers in massive multivariate data seems so remote. Outliers in multivariate massive data are not handful observations any more. Testing outlyingness one observation at a time is a very unrealistic task. In general, these massive data do not have a priori information so that imposing a multivari- ate normal distribution may mislead results. Graphical visualization is impractical with more than two dimensional data since reducing dimensions disguises impor- tant pieces of information5. No simple and suitable nonparametric method exists 5This statement is not right. The author later heard about parallel coordinates by Inselberg and Dimsdale (1990[105]). The parallel coordinates do not ask for dimension reduction to display 59 for detecting outliers in multivariate massive data. While massive data are being collected, we need a way to alarm peculiar or unnecessary observations to manage data streaming efficiently. This alarming pro- cess strongly depends on how outliers are defined, which varies based on scientific interests. Despite the apparent complexity of the problem, we still can characterize outliers by the fact that they are somewhat isolated from the main clouds of points regardless the size and the dimension of data. Potential outliers are points which are not internal to the main cloud of data, somewhere above the surface of the cloud. Simple techniques that determine relative orders of observations to other observations should be on the way. 2.7 Streaming Data The main challenge in data analysis, recently, is that the data volumes are much larger than the memory size of a typical computer (Bagchi et.al., 2004[15]). In addition, since data are archived along the long time period, preliminary analysis tools for redesigning data storage in the middle of archiving process is in demand. Whether the overflow in memory or interruption of data flow, such needs can be incorporated suitably and adaptively with streaming data analysis. Particularly, streaming data analysis confronts difficulties when the target of the data analysis is ordering data. Ordering requires sorting the whole sample; however, the physical limits - lack of memory and waiting time - prohibit proper ordering. This section summarizes studies of sequential estimation methods, appeared in statistics for ordering univariate data. Also, we present algorithmic methods from engineering perspectives, which is more empirical than statistical approaches. The empirical methods will eventually be utilized in developing algorithms for multivariate data analysis later. data of multiple attributes. The author was not able to find studies on statistical outlier detection with parallel coordinates. 60 2.7.1 Univariate Sequential Quantile Estimation Getting quantiles requires ordering data but it is not necessarily to use all points at once. A space efficient recursive procedure for quantile estimation was proposed by Tierney (1983[219]). The concern was estimating a quantile of unknown dis- tribution with the similar accuracy of the whole sample quantile by taking a fixed amount of storage space. The discrepancy between the quantile estimate from the parts of sample and the sample quantile vanishes asymptotically. This methods is dubbed as simple stochastic approximation (Hurley and Modarres, 1995). Chen et.al. (2000[50]) developed the exponentially weighted stochastic approximation (EWSA) estimator as an extension of the simple stochastic approximation. Hurley and Modarres (1995[103]) surveyed methods of low storage quantile es- timation on top of the stochastic approximation. They introduced the histogram quantile estimation, which they compared to the stochastic approximation in addi- tion to the minimax trees and the remedian methods by simulation studies. They examined the statistical and computational properties of these methods for quantile estimation from a data set of n observations but used far fewer memory locations than n. Another notable stochastic approximation method for sequential quantile es- timation is the single pass low storage arbitrary quantile estimation for massive data sets, proposed by Liechty et.al. (2003). This method updates quantiles by interpolation between new inputs and selected stored data (low storage), which is very attractive because the method produces any quantiles of interest in a single pass. When these univariate sequential quantile estimation methods are applied, it often suffices to provide empirically reasonable approximations to sample quantiles instead of precise values, of which direction is commonly found in the engineering literature. 2.7.2 -Approximation In univariate streaming data analysis, the -approximation approach is the most frequently adopted method to control size of data packet under processing, when all data from streaming cannot be incorporated. Numerous studies on the problem 61 of computing quantiles of large streaming online or disk-resident data using small main memory (e.g. Manku et.al., 1998[137]; Lin et.al., 2004[125]; and Greenwald and Khanna, 2001[78]). Manku et.al. (1998) proposed desirable properties of the -approximate algo- rithms. Those properties are: 1. an algorithm must not require prior knowledge of new inputs 2. an algorithm guarantees explicit and tunable approximation 3. an algorithm computes results in a single pass 4. an algorithm produces multiple quantiles at no extra cost 5. an algorithm uses as little memory as possible 6. and an algorithm must be simple to code. Here we adopt the definition of -approximate from Lin et.al. (2004). Definition 2.6 (-approximate). A quantile summary for a data sequence of N elements is -approximate if, for any given rank r, it returns a value whose rank is guaranteed to be within the interval [r − N,r + N]. Lin et.al. (2004[125]) studied the problem of maintaining a quantile summary from the most recently observed N elements over a stream continuously so that quantile queries can be answered with a guaranteed precision of N. They devel- oped a space efficient algorithm for pre-defined N that requires only a single pass on the data stream. Their computational performance study indicated that not only the actual quantile estimation error is far below the guaranteed precision but the space requirement is also much less than the given theoretical bound. Green- wald and Khanna (2001[78]) added that the -approximate quantile summary does not require a priori knowledge of the length of the input sequence. 2.7.3 Sliding Windows Adapting low storage methods for streaming data implies sampling from data and fitting filtered data within the low storage fixed space. This sampling scheme is 62 called sliding window in the data management study. Plethora of various slid- ing window algorithms for streaming data exist in computer science fields. Xu et.al. (2004[233]) presented diverse sliding window algorithms into two categories. • Nf Window (Natural form window) takes most recent N points at a time. • T Window takes data arrived in last T time. By the definition, Nf window allows to overlap or to skip some portion of streaming data. On the other hand, by knowing data arrival rate, T window is able to gets rid of concerns of overlapping or skipping data. In scientific studies, the data archiving rates are controlled that T window scheme is more frequently used. 2.7.4 Streaming Data in Astronomy Astronomy is the oldest science of data analysis. Vast amount of astronomical data archives led new discoveries that have changed the view of the world. As the new centennial emerged, the pattern of astronomical data archives is experiencing rapid changes both in time and spatial dimension. The rapid growth rate of data arrival in time and the capability of simultaneous recording of many variables changed the strategy of analyzing data. This rapid changes are attributed to new scientific and computational technology development. Virtual observatories become popular worldwide, which are capable of main- taining all collected astronomical data catalogs based on the computer scientific data base management system (National Virtual Observatory6 and the European Virtual Observatory7 are most well known). These collected data, ranging from historic data catalogs to new observations from recent sky survey projects,8 which typically produce data size of tera bytes a day, mostly become public recently but proper statistics to confront these huge data is not available. The increment growth rate of data archives are experiencing a steep exponential growth. Detail description on the properties of astronomical massive data is found in Brunner et.al. (2001[36]), Djorgovski et.al. (2001[57]), and Longo et.al. (2001[131]). 6http://www.virtualobservatory.org/ 7http://www.euro-vo.org/pub/index.html 8A list of survey projects are found from http://www.cv.nrao.edu/fits/www/yp survey.html 63 As a matter of fact, traditional statistics are unable to cope with such huge data and are giving its ways to computer science when it comes to analyze massive data. Nonetheless, traditional statistical methods are still in practice by taking parts of sample from survey data. Even though taking a small segment from huge data catalogs is justified based on a prior scientific knowledge, statisticians and astronomers should be better aware that new statistics for massive data has to replace traditional small sample statistics. Astronomers are urged to change their perspectives of analyzing data to cope with the fast growth data in time and space since astronomers cannot rely on their eyes for the massive data sets with many attributes. Also, they should recognize that taking small parts from huge data catalogs will be an waste of massive data base management technologies. Furthermore, statisticians are asked to open their eyes to develop statistics for massive data to confront the need of the 21st century not to yield their ways to computer scientists who brute forcefully plug data into machines without careful thoughts on the properties of statistics and a data set itself. 2.7.5 Comments Determining changes in a distribution of a large, high-dimensional data set is a difficult problem. Coordinate wise naive approaches will not detect many types of distributional changes due to the lack of robustness in coordinate oriented meth- ods. Classical statistical tests usually apply to small data sets and now face three problems with large high-dimensional data. First it is likely impossible to fit all the data into memory at once. Partial sampling is a common refuge. The sampling, however well designed, comes with its pitfalls, particularly when outliers are con- cerned. Second, even if the samples are small in size, dimensionality becomes an obstacle. Existing techniques for nonparametric analysis of multivariate data such as clustering, multidimensional scaling, principal components analysis, and others become expensive as the number of dimensions increase. Third, classical multivari- ate statistics has centered heavily around the assumption of normality, primarily due to the convenience in terms of closed form solutions and easy applications. When the data sets are large, assumptions of homogeneity (e.g. iid observations) that underlie classical statistical inference might not be valid. Therefore, a success 64 over the challenges of multivariate massive data is expected from developing new non parametric multivariate analysis tools incorporating streaming data. Streaming data have emerged recently in the field of database and data man- agement because of the avalanches of data collection. Disappointingly, streaming data algorithms are only developed for univariate data and these algorithms con- trolling data streams are very empirical. Gilbert et.al. (2005[73]) stated that in applications, quantile estimation algorithms for streaming data suffice to provide reasonable approximation to quantile estimates and not necessarily produce pre- cise values. Methodologies of controlling traffics of streaming data totally depends on the experiences of data miners. Additionally, lack of universal metrics on mul- tivariate data causes the deficiency of algorithms on multivariate streaming data. In Chapter 5, we will propose a few empirical methods for multivariate stream- ing data quantile estimation, which might serve general purposes for massive data analysis. Chapter 3 Jackknife Approach to Model Selection 3.1 Introduction Model selection, in general, is a procedure for exploring candidate models and choosing the one among candidates that mostly minimizes the distance from the true model. One of the well known distance measures is the Kullback-Leibler (KL) distance (Kullback and Leibler, 1951[119]), DKL[g; fθ] := g(x)log g(x)dx − g(x)log fθ(x)dx Z Z = Eg[log g(X)] − Eg[log fθ(X)] ≥ 0, where g(X) is the unknown true model, fθ(X) is the estimating model with param- eter θ, and X is a random variable of observations. The purpose of an information p p criterion is to find the best model from a set of candidates, {fθ : θ ∈ Θ ⊂ R } that describes the data best. It is not necessary that this parametric family of distributions contains the true density function g but it is assumed that θg ex- ists such that fθg is closest to g in Kullback-Leibler distance. A fitted model, fθˆ, will be estimated and one wishes to assess the distance between fθˆ and g. Since Eg[log g(X)] is unknown but constant, we only need to obtain the estimator of θg such that maximizes the likelihood Eg[log fθ(X)]. Conclusively, the maximum like- 66 lihood principle plays a role in choosing a model that minimizes the KL distance estimate. Akaike (1973[7], 1974[8]), first, introduced the model selection criterion (AIC) that leads to the model of the minimum KL distance. For an unknown density g, however, AIC chooses a model that maximizes the estimate of Eg[log fθ(X)]. Yet, the bias from estimating the likelihood as well as parameters with same data cannot be avoided. This bias appears as a penalty term in AIC, in proportion to the number of parameters in a model. Despite its simplicity, a few drawbacks were reported; the tendency of picking an overfitted model (Hurvich and Tsai, 1989[104]), the lack of consistency of choosing the correct model (McQuarrie et.al., 1997[146]), and the limited application on models from the same parametric family of distributions (Konishi and Kitagawa, 1996[116]). As remedies for these shortcomings of AIC, several information criteria with improved bias estimation were proposed. To adjust the inconsistency of AIC, especially for small sample size, Hurvich and Tsai (1989[104]) and McQuarrie et.al. (1997[146]) introduced the corrected Akaike information criterion (AICc) and the unbiased Akaike information criterion (AICu), respectively. The penalty terms in their information criteria estimate the bias more consistently so that AICc and AICu select the correct model more often than AIC. On the other hand, AICc and AICu are only applied to regression models with normal error distri- butions. To loosen the assumptions on bias estimation, other information criteria were presented, such as Takeuchi information criterion (TIC, Takeuchi, 1976[218]), the generalized information criterion (GIC, Konishi and Kitagawa, 1996[116]), and the information complexity (ICOMP, Bozdogan, 2000[33]). Apart from the bias estimation of the expected log likelihood, bias reduction through statistical resampling methods has fabricated model selection criteria with- out assumptions such as a parametric family of distributions. Popular resampling methods are cross validation, jackknife, and bootstrap. Stone (1977[215]) showed that estimating the expected log likelihood by cross validation is asymptotically equivalent to AIC. Several attempts with the bootstrap method were made for model selection; however, the bootstrap information criterion was no better than AIC (Chung et.al., 1996[53]; Shibata, 1997[207]; and Ishiguro et.al., 1997[107]). While the asymptotic properties of the cross validation and the bootstrap model 67 selection, and their applications were well studied, the application of the jackknife resampling method to model selection is hardly found in the literature. The jackknife estimator is expected to reduce the bias of the expected log likelihood from estimating the KL distance. Assume that the bias bn based on n observations arises from estimating the KL distance and satisfies the expansion, 1 1 b = b + b + O(n−3). n n 1 n2 2 th Let bn,−i be the estimate of bias without i observation. By denoting bJ as the jackknife bias estimate, the order of the jackknife bias estimate is O(n−2), which is reduced from the order of an ordinary bias O(n−1) since 1 n b = (nb − (n − 1)b ) J n n n,−i i=1 X1 1 = b + b − (b + b )+ O(n−3) 1 n 2 1 n − 1 2 1 = − b + O(n−3). n(n − 1) 2 Furthermore, this bias reduction allows to alleviate model constraints on estimating bias like the same parametric family of distributions. Very few studies on the jackknife principle were found in model selection com- pared to the bootstrap and the cross-validation methods, possiblely caused by the perception that jackknife is similar to either bootstrap or cross-validation asymp- totically as similarities among those resampling methods were reported in diverse statistical problems. In this chapter, we developed the jackknife information crite- rion (JIC) for model selection based on the KL distance measure as a competing model selection criterion. Section 2 describes necessary assumptions and discuss the strong convergence of the maximum likelihood estimator when the true model is unspecified. Section 3 presents the uniformly strong convergence of jackknife maximum likelihood estimator toward the new model selection criterion with the jackknife resampling scheme. Section 4 provides the definition of JIC and the stud- ies on its asymptotic properties under the regularity conditions. Section 5 provides a discussion some comparison between JIC and popular information critiera when the true model is unknown. Section 6 reconsiders the consistency of JIC with 68 minimal assumptions on the likelihood in model selection. 3.2 Prelimiaries Suppose that the data points, X1,X2, ..., Xn are iid with common density g and p M = {fθ : θ ∈ Θ } is a class of candidate models. Let the log likelihood, log fθ(Xi) of Xi be denoted by li(θ). The log likelihood function of all the observations and the th n log likelihood function without i observation are denoted by L(θ) = i=1 li(θ) and L−i(θ) = j6=i lj(θ), respectively. Without a notice, we will assumeP the fol- lowing throughoutP our study: (J0) The p dimensional parameter space Θp is a compact subset of Rp. p (J1) For any θ ∈ Θ , each fθ ∈M is distinct, i.e. if θ1 =6 θ2, then fθ1 =6 fθ2 . p (J2) A unique parameter θg exists in the interior of Θ and satisfies Eg[log fθg (Xi)] = max Eg[log fθ(Xi)]. θ∈Θp (J3) The log likehood li(θ) is thrice continuously differentiable with respect to p 2 3 θ ∈ Θ and the derivatives are denoted by ∇θli(θ), ∇θli(θ), and ∇θli(θ), 2 3 in a given order. Also, |l (θ)|, | ∂ l (θ)|, | ∂ l (θ)|,and | ∂ l (θ)| are i ∂θk i ∂θk∂θl i ∂θk∂θl∂θm i dominated by h(Xi), which is non negative, does not depend on θ, and satisfies E[h(Xi)] < ∞ (k,l,m =1, ...p). p T 2 (J4) For any θ ∈ Θ , Eg[∇θli(θ)∇θ li(θ)] and −Eg[∇θli(θ)] are finite and positive T 2 definite p × p matrices. If g = fθg , then Eg[∇θli(θ)∇θ li(θ)] = −Eg[∇θli(θ)]. 2 (J5) Let Λ be a p × p non-singular matrix such that Λ = −Eg[∇θli(θg)]. Remark on (J1) Admittantly, the identifiability constraint in parameters is very restrictive. However, this strong condition eliminates long disputes on the problems caused by parameters near the boundary. When candidate models are nested, for example, in mixtures and regression models, it is hard to tell whether the propor- tion of the additional component is close to zero or the parameters of additional components are close to those of the smaller models. Due to this ambiguity in 69 parameter space, traditional model selection cannot be compared directly to the jackknife criterion, which will be discussed later. The discussion on these break downs of parameters near the boundary are found easily (e.g. Protassov et.al., 2002[165] and Bozdogan ,2000[33]) regarding identifiability problems. Remark on (J2). If M contains g, not only θg maximizes Eg[log fθ(Xi)] but also I[g; fθg ] becomes zero. However, g is likely to be unknown so that the estimate of the density fθg that minimizes the KL distance is sought as a surrogate model of g. Also, note that the parameter θg satisfies Eg[∇θli(θg)] = 0 and Eg[log fθ(Xi)] < p Eg[log fθg (Xi)] for any θ ∈ Θ , where < is right instead of ≤ because of the uniquness of θg (according to J2) and the dominent convergence theorem (accroding to J3, Eg[|log fθ(Xi)|] < ∞) Additional Remark. Candidate models for model selection criteria are not nec- essarily from the same parametric family. We recommand candidate models fθ1 and fθ2 are from different parametric families so that fθ1 ∈ M1 and fθ2 ∈ M2 satisfying M1 =6 M2. ˆ Hitherto, we show that the maximum likelihood estimator θn of θg converges almost surely to the parameter of interest θg that minimize the KL distance. Theorem 3.1. If θˆn is a function of X1, ..., Xn such that ∇θL(θˆn)=0, then ˆ a.s. θn −→ θg. The maximum likelihood estimator θˆn is in another words, the minimum discrep- ancy estimator of θg. Proof. It is sufficient to prove that B : P {limn→∞ ||θˆn − θg|| < } = 1 for ∀ > 0, where θˆn satisfying n n p A : fθˆn (Xi) ≥ fθ(Xi), ∀θ ∈ Θ , ∀n. (3.1) i=1 i=1 Y Y We will prove this almost sure convergence by negating A ⇒ B. Suppose B does ˆ ˆ ˆ ¯ ¯ not hold. There exists a subsequence {θnk } of θn such that θnk → θ a.e. for θ =6 θg p and θ¯ ∈ Θ . Let > 0 be such that ||θ¯−θg|| ≥ . In addition, from the assumptions 70 (J0-2), for some δ > 0, E[l1(θ¯) − l1(θg)] = −δ < 0. (3.2) By the strong law of large numbers (SLLN), and (3.2), 1 n [l (θ¯) − l (θ )] −→a.s. −δ. n i i g i=1 X Thus, for all large n, 1 n [l (θ¯) − l (θ )] < −δ/2, a.s. (3.3) n i i g i=1 X ˆ ¯ As θnk → θ a.s., for all large k, we have almost surely δ ||θ¯− θˆn|| < . 8E[h(X1)] where E[h(X1)] is assumed to be bounded from (J3). By the SLLN and (J3), 1 n 1 n [li(θ¯) − li(θˆn)] ≤ ||θ¯− θˆn|| h(Xi) n n i=1 i=1 X Xn δ 1 < h(Xi) 8E[h(X )] n 1 i=1 X δ < i.o. a.s. (3.4) 4 Now, (3.3) and (3.4) imply that n P [li(θˆn) − li(θg)] < 0, i.o. =1 i=1 ! X Therefore, n n P fθˆn (Xi) < fθg (Xi) i.o. =1 i=1 i=1 ! Y Y that violates (3.1) for θ = θg. This completes the proof. 71 Under the assumptions (J0-2), the maximum likelihood estimator θˆn converges almost surely to θg even if the true model is unspecified. This strong consis- tency is generally assumed in statistical model selection studies with an unspec- ified/misspecified true model (e.g. Nishii, 1998[155]; Sin and White, 1996[210]). For a particular case g = fθg , this strong convergence is easily proved with Wald’s approach (1949[226]). When θg is the true model parameter, Stone (1977[215]) showed that the maximum likelihood estimate of θˆn converges to θg in probability and proved that the cross validation is equivalent to AIC. AIC evaluates a fitted model to the true model in terms of minimizing the KL distance through the maximum likelihood estimation so that AIC only can be used under the circumstance that a target parametric family of distributions contains the true model, which is not likely happens in scientific problems. Another popular model selection criterion BIC also assumes equal priors for candidate models as well as an exponential distribution family. Although the maximum likelihood estiatmor θˆn is strongly consistent with the parameter of interest θg (the parameter that minimizes the KL distance), there is no gaurantee that this θˆn produces an unbiased likelihood estimate. Similar problems arise in other well known model selection criteria. Unlike other information criteria that took the bias estimation approach in the likelihood estimation, we will take the bias reduction approach based on the jackknife resampling method. . 3.3 Jackknife Maximum Likelihood Estimator Jackknife is well known for bias correction and relatively cheap in computation compared to bootstrap. To begin developing a model selection criterion from ˆ the jackknife, first we will show the jackknife maximum likelihood estimator θ−i strongly converges to θg uniformly. ˆ Theorem 3.2. Let θ−i be a function of X1, ..., Xi−1,Xi+1, ..., Xn and satisfy ∇θL −i(θˆ−i)=0 for any i ∈ [1, n]. Then, uniformly in i, θˆ−i converges to θg almost surely. P ( lim max ||θˆ−i − θg|| ≤ )=1, ∀> 0. n→∞ 1≤i≤n 72 Proof. Let {τn} be a sequence such that τn = max1≤i≤n ||θˆ−i − θg||, where θˆ−i satisfies n n p A : fθˆ−i (Xj) ≥ fθ(Xj), ∀ θ ∈ Θ , ∀i ∈ [1, n]. (3.5) j=1Y,j6=i j=1Y,j6=i It is sufficient to show that τn converges to zero almost surely (B). We will prove the uniform almost sure convergence in a similar way of proving Theorem 3.1. We will negate the statement A ⇒ B into ¬B ⇒ ¬A. ˆ Consider τn 6→ 0, a.e. and ||θ−i(n) − θg|| 6→ 0 a.e. for some i(n):1 ≤ i ≤ n. ¯ ˆ Let θ be a limit point of {θ−i(n)}. For the same δ as in (3.2) and E[h(X1)] < ∞ as in (J3), ¯ ˆ δ ||θ − θ−i(n)|| < i.o. 8E[h(X1)] Then, 1 n 1 n [l (θ¯) − l (θˆ )] ≤ ||θ¯ − θˆ || h(X ) n − 1 j j −i(n) −i(n) n − 1 j j=1,j6=i(n) j=1,j6=i(n) X X n ¯ ˆ 1 ≤ ||θ − θ−i(n)|| h(Xj) n − 1 j=1 X δ n 1 n < h(X ) 8E[h(X )] n − 1 n j 1 j=1 X δ < i.o. a.s. 4 Hence, 1 n P [l (θˆ − l (θ )] < 0 i.o =1 n − 1 j −i(n) j g j=1X,j6=i(n) and n n P fˆ (Xj) < fθ(Xj) i.o. =1. (3.6) θ−i(n) j=1Y,j6=i(n) j=1Y,j6=i(n) This result violates (3.5). This completes the proof. We show that the jackknife maximum likelihood estimator uniformly converges 73 to θg, the parameter that minimizes the KL distance. The strong consistency of the maximum likelihood estimator has been discussed extensively in some litera- ture (e.g. Wald, 1949[226]; LeCam, 1953[122]; and Huber, 1967[97]). Based on the regularity conditions and the strong convergence of the maximum likelihood esti- mators discussed so far, we shall take the jackknife approach to a model selection criterion in the next section. 3.4 Jackknife Information Criterion (JIC) In this section, we define the jackknife information criterion (JIC) and investi- gate the asymptotic characteristics of the jackknife model selection principle. In contrast to the traditional theoretical formulas, the jackknife does not explicitly require assumptions on candidate models (Shao, 1995[203]). We only assumed regularity conditions to ensure the strong consistency of maximum likelihood esti- mators and the asymptotic behavior of the jackknife model selection. Deriving an information criterion starts from estimating the expected log like- lihood. Prior to drawing JIC, first of all, we define the jackknife estimator of the log likelihood function. In addition, we show that this jackknife estimator is asymptotically unbiased to the expected log likelihood function. When the bias corrected jackknife estimatorτ ¯ is defined as τ¯ =τ ˆ − bJ , bJ is a jackknife bias estimator andτ ˆ is a target estimator. The jackknife bias estimator is then obtained by 1 n b =(n − 1) (ˆτ − τˆ) J n −i i=1 X th withτ ˆ−i, the same estimator asτ ˆ without the i observations. Similarly, we formulate the bias corrected jackknife estimator of the log likelihood. Definition 3.1 (Bias corrected jackknife log likelihood). Let the jackknife estima- 74 tor of the log likelihood Jn be n Jn = nL(θˆn) − L−i(θˆ−i). (3.7) i=1 X Hitherto, the strong convergence of θˆn and θˆ−i leads to the following theorem. Theorem 3.3. Let X1,X2, ..., Xn be iid random variables from a density function p g of an unknown distribution and M = {fθ : θ ∈ Θ } be a class of models. If the (g, M) meets (J0)-(J5), then the jackknife log likelihood estimator Jn satisfies n Jn = nL(θˆ) − L−i(θˆ−i) i=1 1X = L(θ )+ ∇l (θ )T Λ−1∇l (θ )+ , (3.8) g n i g j g n Xi6=j a.s where n is a random variable satisfying limn→∞ Eg[n]=0 and ||n|| −→ 0. Addi- 1 T −1 tionally, the fact that Eg[ n i6=j ∇li(θg) Λ ∇lj(θg)]=0 leads to Jn as an asymp- totically unbiased estimatorP of the log likelihood. Proof. We begin with forming groups of components in Jn. n Jn = [L(θˆ) − L−i(θˆ−i)] i=1 Xn = [L−i(θˆ)+ li(θˆ) − L−i(θˆ−i) − li(θg)+ li(θg)] i=1 Xn n n = [L−i(θˆ) − L−i(θˆ−i)] + [li(θˆ) − li(θg)] + li(θg). i=1 i=1 i=1 X X X By the Taylor expansion, the first and the second term respectively become n n 1 n [L (θˆ) − L (θˆ )] = ∇L (θˆ )(θˆ− θˆ )+ (θˆ− θˆ )T ∇2L (η )(θˆ − θˆ ) −i −i −i −i −i −i 2 −i −i i −i i=1 i=1 i=1 X X X 1 n = (θˆ − θˆ )T ∇2L (η )(θˆ− θˆ ) 2 −i −i i −i i=1 X 75 and n [li(θˆ) − li(θg)] = L(θˆ) − L(θg) i=1 X 1 = −(θ − θˆ)∇L(θˆ) − (θ − θˆ)T ∇2L(ξ)(θ − θˆ) g 2 g g 1 = − (θ − θˆ)T ∇2L(ξ)(θ − θˆ) 2 g g ˆ ˆ ˆ ˆ where ξ = θg + δ(θ − θg) and ηi = θ + γi(θ−i − θ) for any δ and γi such that 0 < δ,γi < 1. Hence, 1 n 1 J = L(θ )+ (θˆ − θˆ )T ∇2L (η )(θˆ− θˆ ) − (θ − θˆ)T ∇2L(ξ)(θ − θˆ)(3.9) n g 2 −i −i i −i 2 g g i=1 X ˆ ˆ T 2 ˆ ˆ First, consider (θ − θ−i) ∇ L−i(ηi)(θ − θ−i) of eq. (3.9). From Theorems 3.1 ˆ ˆ and 3.2, the maximumP likelihood estimates θn and θ−i are confined such that max{||θˆn−θg||, maxi ||θˆ−i−θg||} ≤ for any (> 0). Also, note that maxi ||ηi−θg|| ≤ max(||θˆn − θg||, maxi ||θˆ−i − θg||). Thus, the following is established. 1 2 1 2 2 1 2 max || ∇ L−i(ηi)+Λ|| ≤ max ||∇ L−i(ηi) − ∇ L(θg)|| + || ∇ L(θg)+Λ||. 1≤i≤n n 1≤i≤n n n 1 2 2 a.s. 1 a.s Here, max1≤i≤n n ||∇ L−i(ηi) − ∇ L(θg)|| −→ 0 and || n L(θg)+Λ|| −→ 0. Then, 1 2 a.s. max || ∇ L−i(ηi)+Λ|| −→ 0. (3.10) 1≤i≤n n In addition, −∇li(θˆ) = ∇L(θˆ) − ∇L−i(θˆ−i) − ∇li(θˆ) = ∇L−i(θˆ) − ∇L−i(θˆ−i) 2 = ∇ L−i(ηi)(θˆ− θˆ−i) By using (3.10), the above equation is 1 (θˆ− θˆ )= Λ−1∇l (θˆ)+ o(n−1) a.s., (3.11) −i n i 76 uniformly in i. Let αn = ||θˆ − θg||, where α is an arbitrary constant and n is a.s. 2 a random variable, satisfying limn→∞ E[n] = 0 and n → 0. Since |∇ li(η)| is bounded, 2 ∇li(θˆ)= ∇li(θg)+ ∇ li(η)(θˆ− θg) is equivalent to ∗ ∇li(θˆ)= ∇li(θg)+ α n, (3.12) for some constant α∗. Therefore, combining eq. (3.10), eq. (3.11), and eq. (3.12) gives n 1 n (θˆ − θˆ )T ∇2L (η )(θˆ − θˆ )= − ∇T l (θ )Λ−1∇l (θ )+ . −i −i i −i n i g i g n i=1 i=1 X X At this point, we like to remind that without loss of generality an with any constant a is denoted by n since limn→0 E[an] = limn→0 E[n] = 0 and an = o(1) a.s. Similarly, the last term of eq. (3.9) becomes 1 (θ − θˆ)T ∇2L(ξ)(θ − θˆ)= − ∇T L(θ )Λ−1∇L(θ )+ g g n g g n As ||ξ − θg|| ≤ ||θˆ− θg|| = o(1) a.s., 1 ||∇2L(ξ) − ∇2L(θ )|| −→a.s. 0. n g With the SLLN, 2 2 2 2 ∇ L(ξ) = ∇ L(θg)+ nΛ − nΛ − ∇ L(θg)+ ∇ L(ξ) 1 ∇2L(ξ) − ∇2L(θ ) = n ∇2L(θ )+Λ + n θ g − nΛ n g n = −n(Λ + o(1)) a.s. (3.13) Additionally, we have 2 ∇L(θg)= ∇L(θg) − ∇L(θˆ)= ∇ L(ξ)(θg − θˆ) (3.14) 77 and replacing eq. (3.13) in eq. (3.14) gives 1 (θ − θˆ)= − Λ−1∇L(θ )+ o(n−1) a.s. g n g Hitherto, 1 (θ − θˆ)T ∇2L(ξ)(θ − θˆ)= − ∇T L(θ )Λ−1∇L(θ )+ o(n−1) a.s. g g n g g Consequently, eq.(3.9) becomes 1 n 1 J = L(θ ) − ∇T l (θ )Λ−1∇l (θ )+ ∇T L(θ )Λ−1∇L(θ )+ n g 2n i g i g 2n g g n i=1 X 1 n 1 n = L(θ ) − ∇T l (θ )Λ−1∇l (θ )+ ∇T l (θ )Λ−1∇l (θ ) g 2n i g i g 2n i g i g i=1 i=1 1 X X + ∇T l (θ )Λ−1∇l (θ )+ 2n i g j g n Xi6=j 1 = L(θ )+ ∇T l (θ )Λ−1∇l (θ )+ (3.15) g 2n i g j g n Xi6=j Note that Jn is an asymptotically unbiased estimator of the expected log likelihood; 1 1 E[ ∇l (θ )T Λ−1∇l (θ )] = E[E[ ∇l(θ )T Λ−1∇l (θ )|X ] 2n i g j g 2n g j g i Xi6=j Xi6=j 1 = E[ ∇l (θ )T Λ−1]E[∇l (θ )] 2n i g j g Xi6=j = 0. This completes the proof. Asymptotically, no bias term is involved in Jn in contrast to the cross validation and the bootstrap estimator of the log likelihood (Shao, 1993[204]; Chung et.al., 1996[53]; and Shibata, 1997[207]). Therefore, the model selection criterion with the jackknife method is expected to perform differently in asymptotics compared to the selection criteria of other resampling methods. Here we investigate the jackknife principle further as a model selection criterion by proposing the jackknife information criterion (JIC) as minus twice Jn since the popular information criteria 78 involve estimators of negative twice expected log likelihood. Definition 3.2. Jackknife information criterion (JIC) is defined as minus twice Jn. JIC = −2Jn n = −2nL(θˆ)+2 L−i(θˆ−i). i=1 X Note that multiplying −2 does not change the asymptotically unbiased prop- erty of Jn. Furthermore, the actual behavior of an estimator can be predicted through the convergence rate. The convergence rate of the maximum likelihood type estimator is generally obtained from the Taylor expansion and the central limit theorem (CLT). The regularity conditions guarantee the Taylor expansion. The convergence rate of Jn could be specified by its limiting distribution. Nonethe- less, Miller (1974[148]) commented that no one succeeded in assessing the limiting distribution of delete-one jackknife estimator. While the CLT requires the consis- tent variance estimator, the delete-one jackknife is known for inconsistent variance estimators for non-smooth estimators (Shao and Wu, 1989[205]). Theorem 3.3 as- sures the existence of the asymptotically unbiased delete-one jackknife estimator of the expected log likelihood. Instead of getting the limiting distribution of Jn by the CLT, the stochastic rate of Jn is presented by means of the law of iterated logarithm (LIL) for understanding the behavior of Jn in the asymptopia. Theorem 3.4. Let X1,X2, ..., Xn be iid random variables from the unknown dis- p tribution with density g and M be a class of models such that M = {fθ : θ ∈ Θ }. Under (J0)-(J5), the stochastic orders relating to Jn are: (1) Jn = L(θg)+ O(log log n) a.s. 1 −1 (2) n Jn = log fθg (x)g(x)dx + O( n log log n) a.s. Z p Proof. By the LIL, we have −1 ∇li(θg)= O( n log log n) a.s. p 79 so that the stochastic order of Jn is 1 J = L(θ )+ ∇T l (θ )Λ−1∇l (θ )+ n g 2n i g j g n Xi6=j = L(θg)+ O(log log n) a.s. (3.16) where n = o(1) a.s. from Theorem 3.3. The result from (3.16) and the LIL lead 1 1 1 J = L(θ )+ {J − L(θ )} n n n g n n g −1 = log fθg (x)g(x)dx + O( n log log n) a.s. (3.17) Z p 1 −1 since n L(θg) = log fθg (x)g(x)dx + O( n log log n) (Nishii, 1988). This com- pletes the proof.R p Although Theorem 3.3 shows that Jn is asymptotically unbiased, JIC might not pick a good model, particularly, among the nested models of which L(θg) are asymptotically identical. The stochastic rates of the estimators investigated by Nishii (1988[155]) implied that the randomness of the estimators is bounded by a function of sample size. This randomness can be diluted by adopting penalty terms in other information criteria, which may enhance the consistency of jackknife model selection. We suggest a slight modification of JIC. Proposition 3.1. The corrected JIC (JICc) selects a model consistently, which is defined as JICc = JIC + Cn Cn Cn where Cn satisfying n → 0 and log log n →∞. Remark on Proposition 3.1: The penalty term Cn was adapted from the data oriented penalty by Rao and Wu (1989[167]) and Bai et.al. (1999[16]). Penalty terms can be estimated when the true model is known. However, in the current study, we consider cases when the true model is unspecified or misspecified so that the penalty are not estimable. Instead, bias reduction methods has been employed. On the other hand, in JIC, no model complexity has been reflected, which penalize further the model with many parameter. Therefore, the order of the penalty term Cn should be larger than the stochastic rate of Jn and requires to be 80 smaller than the sample size. Like MDL, it seems reasonable to adopt the degree of model complexity into Cn. If a part of candidate models are nested and those nested models have the complexity proportional to the number of parameters, then plog n would be a good choice for the penalty term since plog n is higher than log log n and less than n in order. An investigation with care shall follow to get Cn in the future. 3.5 When Candidate Models are Nested Consider k candidates models, M1, ..., Mk where the M1 is the smallest model and the Mk is the largest. These models are nested in given orders. Let the corresponding parameter space of each model be Ω1,...,Ωk. It is obvious that Ωi ⊂ Ωj for i < j. It is probable that θg, the parameter of interest satisfying E[∇L(θg)] = 0, lies in Ωm for a given m but does not live in the space Ωi where i < m. The virtue of consistent model selection methods is praised when the method correctly identiy m, neighter m − 1 nor m + 1. Traditional model selection criteria have penalty terms to control neither underestimated nor overestimated model to be chosen. However, the jackknife model selection criterion does not have such a penalty term. Unless θg ∈ Ωk\Ωk−1, the jackknife information criterion is unable to discrim- inate the best model among nested models. The event θg ∈ Ωm implies θg ∈ Ωi for m ≤ i ≤ k so that the jackknife maximum likelihood estimates of Mm, ..., Mk are stochastically equivalent (Jn(Mi) − Jn(Mj)= o(1) for m ≤ i, j ≤ k and i =6 j). On the other hand, if θg only resides in the largest parameter space, the maxi- mum likelihood approach estimates the parameter of interest θi of each model Mi that minimizes the KL distance (maximizing the empirical likelihood based on the maximum likelihood estimator of θi) in its own parameter space. Yet the resulting KL distance by θi is larger than the KL distance by θg. When the true model is unspecified, bias due to model estimation cannot be assessed and the penalty terms in popular model selection criteria such as 2p in AIC and plog n in BIC do not take such simple forms. On the other hand, the bias reduction approach allows to measure the relative distances among candidate models with respect to KL distance. When the parameter space of each candidate model occupies unique 81 space, regardless of true model, the JIC will provide a model with the relatively short KL distance. Only a difficulty comes in when some candidate models are nested but the true model is unknown. The description of relationships among candidate models and their parameter space is illustrated in Fig. 3.1.