Open Hlee.Pdf

The Pennsylvania State University The Graduate School TWO TOPICS: A JACKKNIFE MAXIMUM LIKELIHOOD APPROACH TO STATISTICAL MODEL SELECTION AND A CONVEX HULL PEELING DEPTH APPROACH TO NONPARAMETRIC MASSIVE MULTIVARIATE DATA ANALYSIS WITH APPLICATIONS A Thesis in Statistics by Hyunsook Lee c 2006 Hyunsook Lee Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2006 The thesis of Hyunsook Lee was reviewed and approved∗ by the following: G. Jogesh Babu Professor of Statistics Thesis Advisor, Chair of Committee Thomas P. Hettmansperger Professor of Statistics Bing Li Professor of Statistics W. Kenneth Jenkins Professor of Electrical Engineering James L. Rosenberger Professor of Statistics Head of Statistics Department ∗Signatures are on file in the Graduate School. Abstract This dissertation presents two topics from opposite disciplines: one is from a para- metric realm and the other is based on nonparametric methods. The first topic is a jackknife maximum likelihood approach to statistical model selection and the second one is a convex hull peeling depth approach to nonparametric massive multivariate data analysis. The second topic includes simulations and applications on massive astronomical data. First, we present a model selection criterion, minimizing the Kullback-Leibler distance by using the jackknife method. Various model selection methods have been developed to choose a model of minimum Kullback-Liebler distance to the true model, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), Minimum description length (MDL), and Bootstrap information criterion. Likewise, the jackknife method chooses a model of minimum Kullback- Leibler distance through bias reduction. This bias, which is inevitable in model selection problems, arise from estimating the distance between an unknown true model and an estimated model. We show that (a) the jackknife maximum likelihood estimator is consistent to the parameter of interest, (b) the jackknife estimate of the log likelihood is asymptotically unbiased, and (c) the stochastic order of the jackknife log likelihood estimate is O(log log n). Because of these properties, the jackknife information criterion is applicable to problems of choosing a model from non nested candidates especially when the true model is unknown. Compared to popular information criteria which are only applicable to nested models, the jackknife information criterion is more robust in terms of filtering various types of candidate models to choose the best approximating model. However, this robust method has a demerit that the jackknife criterion is unable to discriminate nested models. Next, we explore the convex hull peeling process to develop empirical tools for statistical inferences on multivariate massive data. Convex hull and its peeling iii process has intuitive appeals for robust location estimation. We define the convex hull peeling depth, which enables to order multivariate data. This ordering process based on data depth provides ways to obtain multivariate quantiles including median. Based on the generalized quantile process, we define a convex hull peeling central region, a convex hull level set, and a volume functional, which lead us to invent one dimensional mappings, describing shapes of multivariate distributions along data depth. We define empirical skewness and kurtosis measures based on the convex hull peeling process. In addition to these empirical descriptive statistics, we find a few methodologies to find multivariate outliers in massive data sets. Those outlier detection algorithms are (1) estimating multivariate quantiles up to the level α, (2) detecting changes in a measure sequence of convex hull level sets, and (3) constructing a balloon to exclude outliers. The convex hull peeling depth is a robust estimator so that the existence of outliers do not affect properties of inner convex hull level sets. Overall, we show all these good characteristics of the convex hull peeling process through bivariate synthetic data sets to illustrate the procedures. We prove these empirical procedures are applicable to real massive data set by employing Quasars and galaxies from the Sloan Digital Sky Survey. Interesting scientific results from the convex hull peeling process on these massive data sets are also provided. KEYWORDS: Kullback-Leibler distance, information criterion, statistical model selection, jackknife, jackknife information criterion (JIC), bias reduction, maximum likelihood, unbiased estimation, non-nested model, computational geometry, convex hull, convex hull peeling, statistical data depth, nonparametric statistics, generalized quantile process, descriptive statistics, skewness, kurtosis, volume functional, convex hull level set, massive data, outlier detection, balloon plot, multivariate analysis iv Table of Contents List of Figures ix List of Tables xi List of Algorithms xii Acknowledgments xiii Chapter 1 Introduction: Motivations 1 1.1 ModelSelection............................. 1 1.2 ConvexHullPeeling .......................... 3 1.3 Synopsis................................. 4 Chapter 2 Literature Reviews 6 2.1 Introduction............................... 6 2.2 Model Selection Under Maximum Likelihood Principle . .... 7 2.2.1 Kullback-LeiberDistance. 7 2.2.2 Preliminariles .......................... 8 2.2.3 AkaikeInformationCriterion . 9 2.2.4 BayesInformationCriterion . 14 2.2.5 Minimum Description Length . 16 2.2.6 Resampling in Model Selection . 18 2.2.7 OtherCriteriaandComments . 20 2.3 ConvexHullandAlgorithms. 21 2.3.1 Definitions............................ 21 2.3.2 ConvexHullAlgorithms . 24 v 2.3.3 OtherAlgorithmsandComments . 30 2.4 DataDepth ............................... 30 2.4.1 OrderingMultivariateData . 31 2.4.2 DataDepthFunctions . 32 2.4.3 StatisticalDataDepth . 35 2.4.4 Multivariate Symmetric Distributions . 36 2.4.5 General Structures for Statistical Depth . 37 2.4.6 DepthofaSinglePoint. 39 2.4.7 Comments............................ 39 2.5 Descriptive Statistics of Multvariate Distributions . ......... 39 2.5.1 MultivariteMedian . 40 2.5.2 Location............................. 41 2.5.3 Scatter.............................. 42 2.5.4 Skewness............................. 44 2.5.5 Kurtosis ............................. 46 2.5.6 Comments............................ 48 2.6 Outliers ................................. 49 2.6.1 WhatisOutliers?. .. .. 49 2.6.2 WhereOutliersOccur? . 52 2.6.3 OutliersandNormalDistributions . 52 2.6.4 NonparametricOutlierDetection . 54 2.6.5 Convex Hull v.s. Outliers . 57 2.6.6 Comments: Outliers in Massive Data . 58 2.7 StreamingData ............................. 59 2.7.1 Univariate Sequential Quantile Estimation . .. 60 2.7.2 -Approximation ........................ 60 2.7.3 SlidingWindows ........................ 61 2.7.4 StreamingDatainAstronomy . 62 2.7.5 Comments............................ 63 Chapter 3 Jackknife Approach to Model Selection 65 3.1 Introduction............................... 65 3.2 Prelimiaries ............................... 68 3.3 Jackknife Maximum Likelihood Estimator . 71 3.4 Jackknife Information Criterion (JIC) . ... 73 3.5 When Candidate Models are Nested . 80 3.6 Discussions ............................... 82 vi Chapter 4 Convex Hull Peeling Depth 86 4.1 Introduction............................... 86 4.2 Preliminaries .............................. 90 4.3 StatisticalDepth ............................ 92 4.4 Convex Hull Peeling Functionals . 94 4.4.1 CenterofDataCloud: Median. 94 4.4.2 LocationFunctional . 95 4.4.3 ScaleFunctional. 97 4.4.4 SkewnessFunctional . 98 4.4.5 KurtosisFunctional. 100 4.5 SimulationStudiesonCHPD . 102 4.6 Robustness of CHPM - Finite Sample Relative Efficiency Study . 104 4.7 StatisticalOutliers . 107 4.8 Discussions ...............................109 Chapter 5 Convex Hull Peeling Algorithms 112 5.1 Introduction...............................112 5.2 DataDescription: SDSSDR4 . 117 5.3 MultivariateMedian . 119 5.4 QuantileEstimation . 122 5.5 DensityEstimation . 125 5.6 OutlierDetection .. .. .. 127 5.6.1 QuantileApproach . 128 5.6.2 Detecting Shape Changes caused by Outliers . 129 5.6.3 BalloonPlot........................... 131 5.7 Estimating the Shapes of Multidimensional Distributions ...... 133 5.7.1 Skewness.............................133 5.7.2 Kurtosis .............................136 5.8 Discoveries from Convex Hull Peeling Analysis on SDSS . 137 5.9 SummaryandConclusions . 141 Chapter 6 Concluding Remarks 144 6.1 SummaryandDiscussions . 144 6.1.1 Jackknife Model Selection . 144 6.1.2 Unspecified True Model and Non-Nested Candidate Models. 146 6.1.3 Convex Hull Peeling Process . 147 vii 6.1.4 Multivariate Descriptive Statistics . 148 6.1.5 MultivariateOutliers . 149 6.1.6 Color Distributions of Extragalactic Populations . 149 6.2 FutureWorks ..............................150 6.2.1 Clustering with Convex Hull Peeling . 150 6.2.2 VoronoiTessellation . 152 6.2.3 Addtional Methodologies for Outlier Detection . 152 6.2.4 Testing Similarity of Distributions . 153 6.2.5 BreakdownPoint . 153 6.2.6 CHPDisNotContinuous . 154 6.2.7 Sloan Digital Sky Survey Data Release Five . 155 6.2.8 AdditionalTopics. 156 Bibliography 158 viii List of Figures 3.1 An illustration of model parameter space: g(x) is the unknown true model and ΘA, ΘB, Θ1, and Θ2 are parameter space of different models. Note that Θ1 ⊂ Θ2 and θg is the parameter of the model that mostly minimizes the KL distance than parameters of other models (θa, θb, and θ1). The

Load more