Group Theoretical Methods in Machine Learning Risi Kondor COLUMBIA
Total Page:16
File Type:pdf, Size:1020Kb
Group theoretical methods in machine learning Risi Kondor Unofficial, corrected version of Ph.D. thesis Last updated May 2008. COLUMBIA UNIVERSITY 2008 Contents Introduction 6 Overview 8 I Algebra 10 1 Groups and representations 11 1.1 Groups.......................................... 11 1.1.1 Taxonomyofgroups .............................. 12 1.1.2 Elementaryconcepts . 13 1.1.3 Decomposinggroups .. .. .. .. .. .. .. 14 1.1.4 G-modulesandgroupalgebras . 15 1.1.5 Group actions and homogeneous spaces . ....... 16 1.1.6 Liegroups..................................... 16 1.2 RepresentationTheory. ....... 18 1.2.1 Character theory and Abelian groups . ....... 19 1.2.2 G-modules ..................................... 20 1.2.3 Restricted and induced representations . ........... 20 1.3 Thesymmetricgroup ............................... 21 1.3.1 Representations............................... 22 1.4 TheGeneralLinearGroup. ...... 27 1.4.1 Representations............................... 27 2 Harmonic analysis 30 2.1 The classical Fourier transforms . ........... 30 2.2 Fourieranalysisongroups . ........ 32 2.3 Spectralanalysisonfinitegroups . .......... 34 2.3.1 Factorialdesigns .............................. 34 2.3.2 Analysis of data on the symmetric group . ....... 35 3 Fast Fourier transforms 37 3.1 Fast Fourier transforms on Zn ............................... 38 3.2 Non-commutativeFFTs .............................. 39 3.3 Clausen’s FFT for the symmetric group . ......... 41 3.3.1 Extensions .................................... 46 3.4 The Snob library ...................................... 47 2 II Learning on groups 50 4 Hilbert space learning algorithms 51 4.1 The statistical learning framework . ........... 51 4.1.1 Empiricalriskminimization . ....... 52 4.2 TheHilbertspaceformalism. ........ 53 4.2.1 Reproducingkernels . 54 4.2.2 Therepresentertheorem. ..... 55 4.2.3 Regularizationtheory . ..... 56 4.3 Algorithms ...................................... 57 4.3.1 Gaussianprocesses . 57 4.3.2 Supportvectormachines. ..... 60 4.4 Symmetries ...................................... 61 4.5 Kernels on groups and homogeneous spaces . .......... 64 4.5.1 Positive definite functions . ....... 65 5 Learning permutations 67 5.1 RegularizedRiskMinimization . ......... 68 5.2 DiffusionKernels ................................. ..... 69 5.3 Immanants and the PerMceptron . ....... 73 5.3.1 The“PerMceptron” .............................. 73 6 Multi-object tracking 75 6.1 Aband-limitedrepresentation. ......... 76 6.1.1 Noisemodel.................................... 78 6.1.2 Observations .................................. 78 6.1.3 Inference ..................................... 79 6.2 Efficientcomputation .............................. ..... 79 6.2.1 Performance ................................... 79 6.3 Experiments..................................... 80 III Invariance 82 7 Invariant functionals 83 7.1 Thepowerspectrumandthebispectrum. ......... 84 7.2 Theskewspectrum................................. 87 7.3 Generalization to homogeneous spaces . ........... 89 8 Invariant features for images 92 8.1 Aprojectionontothesphere . ....... 92 8.2 Bispectral invariants of functions on S2 ......................... 95 8.2.1 Relation to the bispectrum on SO(3) . ....... 96 8.3 Experiments..................................... 98 3 Conclusions 100 Biblography 102 List of Symbols 106 Index 108 4 Acknowledgements I would like express my gratitude to everybody who has, directly or indirectly, had a part in helping this thesis come to be. First and foremost, I would like to thank my advisor, Tony Jebara, who has supported me, helped me, guided me, and gave me advice over countless meetings and brainstorming sessions. Thanks to the entire Columbia machine learning crew, especially Andy Howard, Bert Huang, Darrin Lewis, Rafi Pelossof, Blake Shaw, and Pannaga Shivaswamy. A very special thanks to Rocco Servedio, who was always there when I needed help both in scientific matters and the more mundane side of being a Ph.D. student. I would like to thank my entire committee, comprised of Cliff Stein, Christina Leslie and Zoubin Ghahramani in addition to Tony and Rocco, for the amount of time they put into trying to make sense of my thesis, and the valuable comments they made on it. Further afield, I would first like to thank John Lafferty and Guy Lebanon from my CMU days before I came to Columbia. They might not even remember now, but it was a couple of discussions with Guy and John that gave me the idea of learning on groups in the first place. I would like to thank Dan Rockmore for the discussions I had with him at Dartmouth and Ramakrishna Kakarala for sending me a hard copy of his thesis without ever having met me. If it were not for G´abor Cs´anyi, who, in addition to being one of my longest-standing friends, is also one of my scientific collaborators (yes, it’s time we finally came out with that molecular dynam- ics paper, G´abor), I might never have stumbled upon the idea of the bispectrum. Bal´azs Szendr˝oi, J´anos Csirik and Susan Harrington also provided important mathematical (and emotional) support towards the end of my Ph.D.. Finally, I would like to thank DOMA for the inspiring company and the endless stream of coffee. 5 Introduction Ever since its discovery in 1807, the Fourier transform has been one of the mainstays of pure mathematics, theoretical physics, and engineering. The ease with which it connects the analytical and algebraic properties of function spaces; the particle and wave descriptions of matter; and the time and frequency domain descriptions of waves and vibrations make the Fourier transform one of the great unifying concepts of mathematics. Deeper examination reveals that the logic of the Fourier transform is dictated by the structure of the underlying space itself. Hence, the classical cases of functions on the real line, the unit circle, and the integers modulo n are only the beginning: harmonic analysis can be generalized to functions on any space on which a group of transformations acts. Here the emphasis is on the word group in the mathematical sense of an algebraic system obeying specific axioms. The group might even be non-commutative: the fundamental principles behind harmonic analysis are so general that they apply equally to commutative and non-commutative structures. Thus, the humble Fourier transform leads us into the depths of group theory and abstract algebra, arguably the most extensive formal system ever explored by humans. Should this be of any interest to the practitioner who has his eyes set on concrete applications of machine learning and statistical inference? Hopefully, the present thesis will convince the reader that the answer is an emphatic “yes”. One of the reasons why this is so is that groups are the mathematician’s way of capturing symmetries, and symmetries are all around us. Twentieth century physics has taught us just how powerful a tool symmetry principles are for prying open the secrets of nature. One could hardly ask for a better example of the power of mathematics than particle physics, which translates the abstract machinery of group theory into predictions about the behavior of the elementary building blocks of our universe. I believe that algebra will prove to be just as crucial to the science of data as it has proved to be to the sciences of the physical world. In probability theory and statistics it was Persi Diaconis who did much of the pioneering work in this realm, brilliantly expounded in his little book [Diaconis, 1988]. Since then, several other authors have also contributed to the field. In comparison, the algebraic side of machine learning has until now remained largely unexplored. The present thesis is a first step towards filling this gap. The two main themes of the thesis are (a) learning on domains which have non-trivial algebraic structure; and (b) learning in the presence of invariances. Learning rankings/matchings are the classic example of the first situation, whilst rotation/translation/scale invariance in machine vision is probably the most immediate example of the latter. The thesis presents examples addressing real world problems in these two domains. However, the beauty of the algebraic approach is that it allows us to discuss these matters on a more general, abstract, level, so most of our results apply 6 equally well to a large range of learning scenarios. The generality of our approach also means that we do not have to commit to just one learning paradigm (frequentist/Bayesian) or one group of algorithms (SVMs/graphical models/boosting/etc.). We do find that some of our ideas regarding symmetrization and learning on groups meshes best with the Hilbert space learning framework, so in Chapters 4 and 5 we focus on this methodology, but even there we take a comparative stance, contrasting the SVM with Gaussian Processes and a modified version of the Perceptron. One of the reasons why up until now abstract algebra has not had a larger impact on the applied side of computer science is that it is often perceived as a very theoretical field, where computations are difficult if not impossible due to the sheer size of the objects at hand. For example, while permutations obviously enter many applied problems, calulations on the full symmetric group (permutation group) are seldom viable, since it has n! elements. However, recently abstract algebra has developed a strong computational side [B¨urgisser et al., 1997]. The core algorithms of this new computational algebra, such as the non-commutative FFTs discussed in detail