Riemannian Geometry and Statistical Machine Learning

Riemannian Geometry and Statistical Machine Learning Doctoral Thesis Guy Lebanon Language Technologies Institute School of Computer Science Carnegie Mellon University [email protected] January 31, 2005 Abstract Statistical machine learning algorithms deal with the problem of selecting an appropriate N N statistical model from a model space Θ based on a training set xi i=1 or (xi,yi) i=1 { } ⊂ X { } ⊂ . In doing so they either implicitly or explicitly make assumptions on the geometries X × Y of the model space Θ and the data space . Such assumptions are crucial to the success of X the algorithms as different geometries are appropriate for different models and data spaces. By studying these assumptions we are able to develop new theoretical results that enhance our understanding of several popular learning algorithms. Furthermore, using geometrical reasoning we are able to adapt existing algorithms such as radial basis kernels and linear margin classifiers to non-Euclidean geometries. Such adaptation is shown to be useful when the data space does not exhibit Euclidean geometry. In particular, we focus in our experiments on the space of text documents that is naturally associated with the Fisher information metric on corresponding multinomial models. Thesis Committee: John Lafferty (chair) Geoffrey J. Gordon, Michael I. Jordan, Larry Wasserman 1 Acknowledgements This thesis contains work that I did during the years 2001-2004 at Carnegie Mellon University. During that period I received a lot of help from faculty, students and friends. However, I feel that I should start by thanking my family: Alex, Anat and Eran Lebanon, for their support during my time at Technion and Carnegie Mellon. Similarly, I thank Alfred Bruckstein, Ran El-Yaniv, Michael Lindenbaum and Hava Siegelmann from the Technion for helping me getting started with research in computer science. At Carnegie Mellon University I received help from a number of people, most importantly my advisor John Lafferty. John helped me in many ways. He provided excellent technical hands-on assistance, as well as help on high-level and strategic issues. Working with John was a very pleasant and educational experience. It fundamentally changed the way I do research and turned me into a better researcher. I also thank John for providing an excellent environment for research without distractions and for making himself available whenever I needed. I thank my thesis committee members Geoffrey J. Gordon, Michael I. Jordan and Larry Wasser- man for their helpful comments. I benefited from interactions with several graduate students at Carnegie Mellon University. Risi Kondor, Leonid Kontorovich, Luo Si, Jian Zhang and Xioajin Zhu provided helpful comments on different parts of the thesis. Despite not helping directly on topics related to the thesis, I benefited from interactions with Chris Meek at Microsoft Research, Yoram Singer at the Hebrew University and Joe Verducci and Douglas Critchlow at Ohio State University. These interactions improved my understanding of machine learning and helped me write a better thesis. Finally, I want to thank Katharina Probst, my girlfriend for the past two years. Her help and support made my stay at Carnegie Mellon very pleasant, despite the many stressful factors in the life of a PhD student. I also thank her for patiently putting up with me during these years. 2 Contents 1 Introduction 7 2 Relevant Concepts from Riemannian Geometry 10 2.1 Topological and Differentiable Manifolds . ............. 10 2.2 TheTangentSpace................................. 13 2.3 RiemannianManifolds . .. .. .. .. .. .. .. .. ...... 14 3 Previous Work 16 4 Geometry of Spaces of Probability Models 19 4.1 Geometry of Non-Parametric Spaces . ......... 20 4.2 Geometry of Non-Parametric Conditional Spaces . ............. 22 4.3 Geometry of Spherical Normal Spaces . ......... 25 5 Geometry of Conditional Exponential Models and AdaBoost 26 5.1 Definitions...................................... 28 5.2 Correspondence Between AdaBoost and Maximum Likelihood............. 29 5.2.1 The Dual Problem (P1∗).............................. 30 5.2.2 The Dual Problem (P2∗).............................. 31 5.2.3 Specialcases .................................. 32 5.3 Regularization .................................. ..... 34 5.4 Experiments..................................... 36 6 Axiomatic Geometry for Conditional Models 39 6.1 Congruent Embeddings by Markov Morphisms of ConditionalModels ........ 41 6.2 A Characterization of Metrics on Conditional Manifolds ................ 43 6.2.1 ThreeUsefulTransformation . ...... 44 6.2.2 The Characterization Theorem . ...... 46 6.2.3 Normalized Conditional Models . ....... 52 6.3 A Geometric Interpretation of Logistic Regression and AdaBoost ........... 53 6.4 Discussion...................................... 55 7 Data Geometry Through the Embedding Principle 56 7.1 Statistical Analysis of the Embedding Principle . ............... 58 8 Diffusion Kernels on Statistical Manifolds 60 8.1 Riemannian Geometry and the Heat Kernel . ......... 62 8.1.1 TheHeatKernel ................................. 63 8.1.2 Theparametrixexpansion. ..... 65 8.2 RoundingtheSimplex .............................. ..... 67 8.3 Spectral Bounds on Covering Numbers and Rademacher Averages .......... 68 8.3.1 CoveringNumbers ............................... 68 8.3.2 RademacherAverages . 70 3 8.4 Experimental Results for Text Classification . .............. 73 8.5 Experimental Results for Gaussian Embedding . ............ 84 8.6 Discussion...................................... 86 9 Hyperplane Margin Classifiers 87 9.1 Hyperplanes and Margins on Sn .............................. 88 n 9.2 Hyperplanes and Margins on S+ ............................. 90 9.3 Logistic Regression on the Multinomial Manifold . .............. 94 9.4 Hyperplanes in Riemannian Manifolds . .......... 95 9.5 Experiments..................................... 97 10 Metric Learning 101 10.1 TheMetricLearningProblem . ........102 10.2 A Parametric Class of Metrics . .........103 10.3 An Inverse-Volume Probabilistic Model on the Simplex . ................108 10.3.1 Computing the Normalization Term . .......109 10.4 Application to Text Classification . ............109 10.5Summary ........................................ 113 11 Discussion 113 A Derivations Concerning Boosting and Exponential Models 115 A.1 Derivation of the Parallel Updates . ..........115 A.1.1 ExponentialLoss ............................... 115 A.1.2 Maximum Likelihood for Exponential Models . .........116 A.2 Derivation of the Sequential Updates . ...........117 A.2.1 ExponentialLoss ............................... 117 A.2.2 Log-Loss...................................... 118 A.3 RegularizedLossFunctions . ........118 A.3.1 Dual Function for Regularized Problem . ........118 A.3.2 Exponential Loss–Sequential update rule . ...........119 A.4 Divergence Between Exponential Models . ...........119 B Gibbs Sampling from the Posterior of Dirichlet Process Mixture Model based on a Spherical Normal Distribution 120 C The Volume Element of a Family of Metrics on the Simplex 121 C.1 The Determinant of a Diagonal Matrix plus a Constant Matrix ............121 C.2 The Differential Volume Element of F ........................123 λ∗J D Summary of Major Contributions 124 4 Mathematical Notation Following is a list of the most frequent mathematical notations in the thesis. Set/manifold of data points X Θ Set/manifold of statistical models X Y Cartesian product of sets/manifold × Xk The Cartesian product of X with itself k times The Euclidean norm k · k , Euclidean dot product between two vectors h· ·i p(x ; θ) A probability model for x parameterized by θ p(y x ; θ) A probability model for y, conditioned on x and parameterized by θ | D( , ), D ( , ) I-divergence between two models · · r · · T The tangent space to the manifold at x xM M ∈ M g (u, v) A Riemannian metric at x associated with the tangent vectors u, v T x ∈ xM ∂ n , e n The standard basis associated with the vector space T = Rn { i}i=1 { i}i=1 xM ∼ G(x) The Gram matrix of the metric g, [G(x)]ij = gx(ei, ej) (u, v) The Fisher information metric at the model θ associated with the vectors u, v Jθ δ (u, v) The induced Euclidean local metric δ (u, v)= u, v x x h i δx,y Kronecker’s delta δx,y =1 if x = y and 0 otherwise ι : A X The inclusion map ι(x)= x from A X to X. → ⊂ N, Q, R The natural, rational and real numbers respectively R+ The set of positive real numbers Rk m The set of real k m matrices × × [A]i The i-row of the matrix A ∂ k,m The standard basis associated with T Rk m { ab}a,b=1 x × X The topological closure of X Hn The upper half plane x Rn : x 0 { ∈ n ≥ } Sn The n-sphere Sn = x Rn+1 : x2 = 1 { ∈ i i } Sn The positive orthant of the n-sphere Sn = x Rn+1 : x2 = 1 + P + { ∈ + i i } P The n-simplex P = x Rn+1 : x = 1 n n { ∈ + i i } P f g Function composition f g(x)= f(g(x)) ◦ ◦ P C ( , ) The set of infinitely differentiable functions from to ∞ M N M N f u The push-forward map f : Tx Tf(x) of u associated with f : ∗ ∗ M→ N M→N f g The pull-back metric on associated with ( , g) and f : ∗ M N M→N d ( , ), d( , ) The geodesic distance associated with the metric g g · · · · dvol g(θ) The volume element of the metric gθ, dvol g(θ)= √det gθ = det G(θ) Y The covariant derivative of the vector field Y in the direction ∇X p of the vector field X associated with the connection ∇ 5 ℓ(θ) The log-likelihood function (θ) The AdaBoost.M2 exponential loss function E p˜ Empirical distribution associated with a training set D ⊂X×Y uˆ The L2 normalized version of the vector u d(x,S) The distance from a point to a set d(x,S) = infy S d(x,y) ∈ 6 1 Introduction There are two fundamental spaces in machine learning. The first space consists of data points X and the second space Θ consists of possible learning models. In statistical learning, Θ is usually a space of statistical models, p(x ; θ) : θ Θ in the generative case or p(y x ; θ) : θ Θ in the { ∈ } { | ∈ } discriminative case. The space Θ can be either a low dimensional parametric space as in parametric statistics, or the space of all possible models as in non-parametric statistics.

Load more