A Geometric Perspective on Some Topics in Statistical Learning By

A geometric perspective on some topics in statistical learning by Yuting Wei A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Statistics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Martin Wainwright, Co-chair Professor Adityanand Guntuboyina, Co-chair Professor Peter Bickel Professor Venkat Anantharam Spring 2018 A geometric perspective on some topics in statistical learning Copyright 2018 by Yuting Wei 1 Abstract A geometric perspective on some topics in statistical learning by Yuting Wei Doctor of Philosophy in Statistics University of California, Berkeley Professor Martin Wainwright, Co-chair Professor Adityanand Guntuboyina, Co-chair Modern science and engineering often generate data sets with a large sample size and a comparably large dimension which puts classic asymptotic theory into question in many ways. Therefore, the main focus of this thesis is to develop a fundamental understanding of statistical procedures for estimation and hypothesis testing from a non-asymptotic point of view, where both the sample size and problem dimension grow hand in hand. A range of different problems are explored in this thesis, including work on the geometry of hypothesis testing, adaptivity to local structure in estimation, effective methods for shape-constrained problems, and early stopping with boosting algorithms. Our treatment of these different problems shares the common theme of emphasizing the underlying geometric structure. To be more specific, in our hypothesis testing problem, the null and alternative are specified by a pair of convex cones. This cone structure makes it possible for a sharp characterization of the behavior of Generalized Likelihood Ratio Test (GLRT) and its optimality property. The problem of planar set estimation based on noisy measurements of its support function, is a non-parametric problem in nature. It is interesting to see that estimators can be constructed such that they are more efficient in the case when the underlying set has a simpler structure, even without knowing the set beforehand. Moreover, when we consider applying boosting algorithms to estimate a function in reproducing kernel Hibert space (RKHS), the optimal stopping rule and the resulting estimator turn out to be determined by the localized complexity of the space. These results demonstrate that, on one hand, one can benefit from respecting and making use of the underlying structure (optimal early stopping rule for different RKHS); on the other hand, some procedures (such as GLRT or local smoothing estimators) can achieve better performance when the underlying structure is simpler, without prior knowledge of the structure itself. To evaluate the behavior of any statistical procedure, we follow the classic minimax framework and also discuss about more refined notion of local minimaxity. i To my parents and grandmother. ii Contents Contents ii List of Figures iv I Introduction and background1 1 Introduction2 1.1 Geometry of high-dimensional hypothesis testing................ 2 1.2 Shape-constrained problems........................... 3 1.3 Optimization and early-stopping......................... 4 1.4 Thesis overview.................................. 5 2 Background6 2.1 Evaluating statistical procedures......................... 6 2.2 Non-parametric estimation............................ 9 II Statistical inference and estimation 13 3 Hypothesis testing over convex cones 14 3.1 Introduction.................................... 14 3.2 Background on conic geometry and the GLRT................. 20 3.3 Main results and their consequences....................... 23 3.4 Discussion..................................... 35 3.5 Proofs of main results .............................. 36 4 Adaptive estimation of planar convex sets 45 4.1 Introduction.................................... 45 4.2 Estimation procedures .............................. 49 4.3 Main results.................................... 52 4.4 Examples ..................................... 59 4.5 Numerical results................................. 62 iii 4.6 Discussion..................................... 67 4.7 Proofs of the main results ............................ 68 III Optimization 74 5 Early stopping for kernel boosting algorithms 75 5.1 Introduction.................................... 75 5.2 Background and problem formulation...................... 76 5.3 Main results.................................... 81 5.4 Consequences for various kernel classes..................... 86 5.5 Discussion..................................... 90 5.6 Proof of main results............................... 90 6 Future directions 97 A Proofs for Chapter3 99 A.1 The GLRT sub-optimality............................ 99 A.2 Distances and their properties.......................... 101 A.3 Proofs for Proposition 3.3.1 and 3.3.2...................... 101 A.4 Completion of the proof of Theorem 3.3.1(a).................. 108 A.5 Completion of the proof of Theorem 3.3.1(b).................. 111 A.6 Completion of the proof of Theorem 3.3.2 ................... 115 A.7 Completion of the proof of Proposition 3.3.2 and the monotone cone . 118 B Proofs for Chapter4 125 B.1 Additional proofs and technical results ..................... 125 B.2 Additional Simulation Results.......................... 148 C Proofs for Chapter5 154 C.1 Proof of Lemma1................................. 154 C.2 Proof of Lemma2................................. 155 C.3 Proof of Lemma3................................. 160 C.4 Proof of Lemma4................................. 165 Bibliography 167 iv List of Figures 3.1 (a) A 3-dimensional circular cone with angle α. (b) Illustration of a cone versus its polar cone. .................................... 26 3.2 Illustration of the product cone defined in equation (3.37)............. 28 4.1 Point estimation error when K∗ is a ball...................... 64 4.2 Point estimation error when K∗ is a segment.................... 64 4.3 Set estimation when K∗ is a ball .......................... 66 4.4 Set estimation when K∗ is a segment........................ 66 t ∗ 2 1 Pn t ∗ 2 5.1 Plots of the squared error kf − f kn = n i=1(f (xi) − f (xi)) versus the itera- tion number t for (a) LogitBoost using a first-order Sobolev kernel (b) AdaBoost using the same first-order Sobolev kernel K(x; x0) = 1 + min(x; x0) which gener- ates a class of Lipschitz functions (splines of order one). Both plots correspond to a sample size n = 100................................ 78 5.2 The mean-squared errors for the stopped iterates f¯T at the Gold standard, i.e. iterate with the minimum error among all unstopped updates (blue) and at T = (7n)κ (with the theoretically optimal κ = 0:67 in red, κ = 0:33 in black and κ = 1 in green) for (a) L2-Boost and (b) LogitBoost.................... 89 5.3 Logarithmic plots of the mean-squared errors at the Gold standard in blue and at T = (7n)κ (with the theoretically optimal rule for κ = 0:67 in red, κ = 0:33 in black and κ = 1 in green) for (a) L2-Boost and (b) LogitBoost. ......... 89 B.1 Point estimation error when K∗ is a square..................... 149 B.2 Point estimation error when K∗ is an ellipsoid................... 150 B.3 Point estimation error when K∗ is a random polytope . 150 B.4 Set estimation when K∗ is a square......................... 151 B.5 Set estimation when K∗ is an ellipsoid ....................... 152 B.6 Set estimation when K∗ is a random polytope................... 153 v Acknowledgments Before entering college, I never dreamt that I would fly to the other side of the world, complete a Ph.D. in statistics and be so accepted, understood, supported, and loved in the way from people within Berkeley and through a greater academic community, have shown me. I cannot begin to thank adequately those who helped me in the preparation of this thesis and made my past five years probably the most wonderful journey of my life. First and foremost, I am grateful to have two most amazing advisors that a graduate student can ever hope for, Martin Wainwright and Adityanand Guntuboyina. I first met Aditya through taking a graduate class with him on theoretical statistics. His class greatly intrigued my interest and equipped me with tools to work on statistics theory, primarily due to the extraordinary clarity of his teaching, as well as his passion for the material (who would know I came to Berkeley with the intention to work on applied statistics). After that we started to work together and I wrote my first real paper with him. As an advisor, Aditya is incredibly generous with his ideas and time, and has influenced me greatly with his genuine feature of humility, despite of his great talent and expertise. I also started to talk to Martin more frequently during my second year and was fortunate enough to visit him for three months in my third year when he was on sabbatical to ETH Zürich. During my interaction with Martin, I was (and I still am now) constantly amazed by his mathematical sharpness; his ability of distilling the essence of a problem so rapidly; his broad knowledge and deep understanding of so many subjects|statistics, optimization, information theory and computing; and by his care, his humor and aesthetical appreciation of coffee. It was one of the best things that could ever happen to me, to have worked with both of them over an intensive period of time. Over these years, they guided me about how to approach research, give talks, write, taught me what is good research, and helped

Load more