Geometrical Aspects of Statistical Learning Theory
Total Page:16
File Type:pdf, Size:1020Kb
Geometrical Aspects of Statistical Learning Theory Vom Fachbereich Informatik der Technischen Universit¨at Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) vorgelegt von Dipl.-Phys. Matthias Hein aus Esslingen am Neckar Prufungskommission:¨ Vorsitzender: Prof. Dr. B. Schiele Erstreferent: Prof. Dr. T. Hofmann Korreferent : Prof. Dr. B. Sch¨olkopf Tag der Einreichung: 30.9.2005 Tag der Disputation: 9.11.2005 Darmstadt, 2005 Hochschulkennziffer: D17 Abstract Geometry plays an important role in modern statistical learning theory, and many different aspects of geometry can be found in this fast developing field. This thesis addresses some of these aspects. A large part of this work will be concerned with so called manifold methods, which have recently attracted a lot of interest. The key point is that for a lot of real-world data sets it is natural to assume that the data lies on a low-dimensional submanifold of a potentially high-dimensional Euclidean space. We develop a rigorous and quite general framework for the estimation and ap- proximation of some geometric structures and other quantities of this submanifold, using certain corresponding structures on neighborhood graphs built from random samples of that submanifold. Another part of this thesis deals with the generalizati- on of the maximal margin principle to arbitrary metric spaces. This generalization follows quite naturally by changing the viewpoint on the well-known support vector machines (SVM). It can be shown that the SVM can be seen as an algorithm which applies the maximum margin principle to a subclass of metric spaces. The motivati- on to consider the generalization to arbitrary metric spaces arose by the observation that in practice the condition for the applicability of the SVM is rather difficult to check for a given metric. Nevertheless one would like to apply the successful ma- ximum margin principle even in cases where the SVM cannot be applied. The last part deals with the specific construction of so called Hilbertian metrics and positive definite kernels on probability measures. We consider several ways of building such metrics and kernels. The emphasis lies on the incorporation of different desired pro- perties of the metric and kernel. Such metrics and kernels have a wide applicability in so called kernel methods since probability measures occur as inputs in various situations. Zusammenfassung Geometrie spielt eine wichtige Rolle in der modernen statistischen Lerntheorie. Viele Aspekte der Geometrie k¨onnen in diesem sich schnell entwickelnden Feld gefunden werden. Diese Dissertation besch¨aftigt sich mit einigen dieser Aspekte. Ein großer Teil dieser Arbeit befasst sich mit sogenannten Mannigfaltigkeits-Methoden. Die Hauptmotivation liegt darin, daß es fur¨ Datens¨atze in Anwendungen eine in vielen F¨allen zutreffende Annahme ist, daß die Daten auf einer niedrig-dimensionalen Un- termannigfaltigkeit eines potentiell hoch-dimensionalen Euklidischen Raumes liegen. In dieser Arbeit wird ein mathematisch strenger und allgemeiner Rahmen fur¨ die Sch¨atzung und Approximation von geometrischen Strukturen und anderen Gr¨oßen der Untermannigfaltigkeit entwickelt. Dazu werden korrespondierende Strukturen auf einem durch eine Stichprobe von Punkten der Untermannigfaltigkeit erzeug- ten Nachbarschaftsgraphen genutzt. Ein weiterer Teil dieser Dissertation behandelt die Verallgemeinerung des sogenannten maximum-margin“-Prinzips auf allgemeine ” metrische R¨aume. Durch eine neue Sichtweise auf die sogenannte support vector ” machine“(SVM) folgt diese Verallgemeinerung auf naturliche¨ Weise. Es wird gezeigt, daß die SVM als ein Algorithmus gesehen werden kann, der das maximum-margin“- ” Prinzip auf eine Unterklasse von metrischen R¨aumen anwendet. Die Motivation fur¨ diese Verallgemeinerung entstand durch das in der Praxis h¨aufig auftretende Pro- blem, daß die Bedingungen fur¨ die Verwendung einer bestimmten Metrik in der SVM schwer zu uberpr¨ ufen¨ sind. Trotzdem wurde¨ man gerne selbst in F¨allen in denen die SVM nicht angewendet werden kann das erfolgreiche maximum-margin“-Prinizp ” verwenden. Der abschließende Teil dieser Arbeit besch¨aftigt sich mit der speziel- len Konstruktion von sogenannnten Hilbert’schen Metriken und positiv definiten Kernen auf Wahrscheinlichkeitsmaßen. Mehrere M¨oglichkeiten solche Metriken und Kerne zu konstruieren werden untersucht. Der Schwerpunkt liegt dabei auf der Inte- gration verschiedener gewunschter¨ Eigenschaften in die Metrik bzw. den Kern. Sol- che Metriken und Kerne haben vielf¨altige Anwendungsm¨oglichkeiten in sogenannten Kern-Methoden, da Wahrscheinlichkeitsmaße als Eingabeformate in verschiedensten Situationen auftreten. Wissenschaftlicher Werdegang des Verfassers 10/1996–02/2002 Studium der Physik mit Nebenfach Mathematik an der Universit¨at Tubingen.¨ 02/2002 Diplom in Physik Thema der Diplomarbeit: Numerische Simulation axialsymmetrischer, isolierter Systeme in der Allgemeinen Relativit¨atstheorie. Betreuer: PD. Dr. J. Frauendiener 06/2002–11/2005 Wissenschaftlicher Mitarbeiter am Max-Planck-Institut fur¨ biologische Kybernetik in Tubingen¨ in der Abteilung von Prof. Dr. Bernhard Sch¨olkopf. Erkl¨arung Hiermit erkl¨are ich, daß ich die vorliegende Arbeit - mit Ausnahme der in ihr ausdrucklich¨ genannten Hilfen - selbst¨andig verfasst habe. Acknowledgements First of all I would like to thank Bernhard Sch¨olkopf for giving me the possibility to do my doctoral thesis in an excellent research environment. He gave me the freedom to look for my own lines of research while always providing ideas how to progress. I also very much appreciated his advice and support in times when it was needed. I am especially thankful to Olivier Bousquet for guiding me into the world of learning theory. In our long discussions we usually grazed through all sorts of topics ranging from pure mathematics to machine learning to theoretical physics. This was very inspiring and raised my interest in several branches of mathematics. He always had time for questions and was a constant source of ideas for me. I want to thank Thomas Hofmann for giving me the opportunity to do my thesis at the TU Darmstadt. I am very thankful for his support in these last steps towards the thesis. A special thanks goes to Olaf Wittich for reading parts of the second chapter and for giving helpful comments which improved the clarity of this part. During these three years I had the pleasure to work or discuss with several other nice people. They all influenced in the way I think about learning theory. I thank all of them for their time and help: Jean-Yves Audibert, Goekhan Bakır, Stephane Boucheron, Olivier Chapelle, Jan Eichhorn, Andr´eElisseeff, Matthias Franz, Arthur Gretton, Jeremy Hill, Kwang-In Kim, Malte Kuss, Matti K¨a¨ari¨ainen, Navin Lal, Cheng Soon Ong, Petra Philips, Carl Rasmussen, Gunnar R¨atsch, Lorenzo Rosasco, Alexander Smola, Koji Tsuda, Ulrike von Luxburg, Felix Wichmann, Olaf Wittich, Dengyong Zhou, Alexander Zien, Laurent Zwald. I would like to thank all the AGBS team and in particular all the PhD students in our lab for a very nice atmosphere and a lot of fun. In particular I would like to thank our pioneer Ulrike von Luxburg for pleasant and helpful discussions and for the mutual support of our small ‘theory’ group, Navin Lal for a nice time here in Tubingen,¨ Malte Kuss for providing me his Matlab script to produce the nice manifold figures, my office mate Arthur Gretton for his subtle jokes and the nice atmosphere and all AOE participants for relaxing afterhours in our lab. Finally I would like to thank my family for their unconditional help and support during my studies and to Kathrin for her understanding and for reminding me sometimes that there is more in life than a thesis. Inhaltsverzeichnis 1 Introduction 13 1.1 Introduction to statistical learning theory . ..... 13 1.1.1 Empirical risk minimization . 15 1.1.2 Regularized empirical risk minimization . 18 1.2 Geometry in statistical learning theory . 19 1.3 SummaryofContributionsofthisthesis . 20 2 Consistent Continuum Limit for Graph Structure on Point Clouds 23 2.1 AbstractDefinitionoftheGraphStructure . 27 2.1.1 Hilbert Spaces of Functions on the vertices V and the edges E 27 2.1.2 The difference operator d and its adjoint d∗ .......... 28 2.1.3 The general graph Laplacian . 29 2.1.4 The special case of an undirected graph . 29 2.1.5 Smoothness functionals for regularization on undirected graphs 31 2.2 Submanifolds in Rd and associated operators . 33 2.2.1 Basics of submanifolds . 33 2.2.2 The weighted Laplacian and the continuous smoothness func- tional ............................... 41 2.3 Continuumlimitofthegraphstructure . 44 2.3.1 Notationsandassumptions. 45 2.3.2 Asymptotics of Euclidean convolutions on the submanifold M 47 2.3.3 Pointwise consistency of the degree function d or kernel den- sity estimation on a submanifold in Rd ............. 52 2.3.4 Pointwise consistency of the normalized and unnormalized graphLaplacian.......................... 58 2.3.5 Weak consistency of and the smoothness functional S(f) 64 HV 2.3.6 Summary and fixation of V by mutual consistency requirement 69 2.4 Applications................................H 71 2.4.1 Intrinsic dimensionality estimation of submanifolds in Rd . 71 2.5 Appendix ................................. 84 2.5.1 U-statistics ............................ 84 3 Kernels, Associated Structures and Generalizations 85 3.1 Introduction................................ 85 3.2 Positive Definite