An Information Geometry of Statistical Manifold Learning

An Information Geometry of Statistical Manifold Learning Ke Sun [email protected] Stéphane Marchand-Maillet [email protected] Viper Group, Computer Vision & Multimedia Laboratory, University of Geneva, Switzerland Abstract NOTATIONS Manifold learning seeks low-dimensional repre- D sentations of high-dimensional data. The main Xn; Y n; Zn — respectively, a generic model in < , tactics have been exploring the geometry in an MAL input model in <D, and an MAL output model an input data space and an output embedding in <d. The subscripts denote the sample size and can be space. We develop a manifold learning the- n 2n ¯2 omitted; M(X), Mf, Mz, O , H , H ory in a hypothesis space consisting of models. X;k X;Y ;k Y ;Z;k,κ A model means a specific instance of a collec- — different model manifolds. The superscripts denote tion of points, e.g., the input data collectively the dimension. The parameters in parentheses denote or the output embedding collectively. The semi- the coordinate system. Both can be omitted; Sn−1 — Riemannian metric of this hypothesis space is the (n − 1)-dimensional statistical simplex; g, G — uniquely derived in closed form based on the in- semi-Riemannian metric of a model family and Fisher formation geometry of probability distributions. information metric; sX — pairwise dissimilarities of There, manifold learning is interpreted as a tra- ij jectory of intermediate models. The volume of a a model X. The superscript can be omitted; pjji, pij continuous region reveals an amount of informa- — neighbourhood probabilities; θ = (θ ; : : : ; θ ) — tion. It can be measured to define model com- 1 n plexity and embedding quality. This provides canonical parameters of probability distributions; jMj deep unified perspectives of manifold learning — volume or scale of M; kNNi — indexes of the theory. k-nearest-neighbours of the i’th sample; det(·) — determinant; diag(x1; : : : ; x`) — a diagonal matrix Manifold learning (MAL), or non-linear dimensionality re- with x ; : : : ; x on its main diagonal. duction, assumes that some given high-dimensional obser- 1 ` D vations y1;:::; yn 2 < lie around a low-dimensional sub-manifold fΓ(z): z 2 <dg induced by a smooth mapping Γ: <d ! <D (d D). While it is possible to learn tion on a weighted k-nearest-neighbour (kNN) graph of a parametric form of Γ (Hinton & Salakhutdinov, 2006), fyig. Stochastic Neighbour Embedding (SNE) (Hinton & the majority of manifold learners are non-parametric. They Roweis, 2003) and its extensions (Cook et al., 2007; Venna n & Kaski, 2007; van der Maaten & Hinton, 2008) represent learn directly a set of low-dimensional coordinates fzigi=1 n the non-convex family. They encode the input and output as to preserve certain information in fyigi=1. probability distributions and optimize the embedding in a Depending on the choice of information to be preserved, maximum-likelihood framework. By sacrificing convexity, at least two families of MAL methods thrived in the last non-local information can be preserved as well. The latter decade. The spectral methods (Tenenbaum et al., 2000; SNE-based family shows robustness to parameter configu- Roweis & Saul, 2000; Belkin & Niyogi, 2003) and semi- ration and favorable performance in data visualization. It definite embeddings (Weinberger et al., 2004; Sha & Saul, is being actively developed (Carreira-Perpiñán, 2010; Vla- 2005) represent the family with natural convex formula- dymyrov & Carreira-Perpiñán, 2012; Sun et al., 2012; Yang tions. They only preserve encodings of local informa- et al., 2013) and stands as state-of-the-art MAL. Proceedings of the 31 st International Conference on Machine Despite such a diversity, several critical problems in the Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- field of MAL remain unclear. Practically, no standard ex- right 2014 by the author(s). ists in gauging the data complexity and the embedding An Information Geometry of Statistical Manifold Learning quality. The performance is often empirically assessed formation. The information in p is distributed in a sample- via visualization or indirectly evaluated via classification. wise manner. Each sample xi has limited knowledge en- Theoretically, an intrinsic MAL theory with deep connec- coded into p·|i mostly regarding its neighbours. tions to classical statistical learning theory (Akaike, 1974; Equations (1) and (2) are general enough to cover Schwarz, 1978; Amari, 1995; Vapnik, 1998) is not estab- SNE (Hinton & Roweis, 2003), symmetric SNE (Cook lished. MAL emphasizes local information encoded into et al., 2007), t-SNE (van der Maaten & Hinton, 2008), sample-wise structures. How to describe and measure such and a spectrum of extensions. For example, SNE ap- preservation of local information is unknown. X 2 plies sij = τijjxi − xjjj to eq. (1) for encoding the in- We attack these problems with a geometry, not in an obser- put and output, where τi > 0 is a scalar; t-SNE applies D d X 2 vation space < or an embedding space < , but in a very sij = log(jjxi − xjjj + 1) to eq. (2) for encoding the out- high-dimensional hypothesis space made of models. put. From a kernel view (Ham et al., 2004), any MAL tech- nique that encodes into kernels naturally extends to such Definition 1. A model Xn is a specific instance of a set of probabilities. vectors fx1;:::; xng. Remark 1.1. By default, a model denoted by Xn is a co- Despite a model X can have various forms, after the en- T ordinate matrix (x1;:::; xn) . Alternatively, it can be im- coding p(X) is in a unified space. In eq. (1), (pjji(X)) is plicitly specified by a n × n matrix of pairwise measure- a point on the product manifold (Sn−1)n, where Sn−1 = Pn ments, e.g., distances or (dis-)similarities. f(p1; : : : ; pn): 8i; pi > 0; i=1 pi = 1g is a statistical Definition 2. A model family M is a smooth manifold con- simplex consisting of all distributions over f1; 2; : : : ; ng. In n2−1 sisting of continuous models. eq. (2), (pij(X)) is a point on S . Such a unified rep- resentation makes it possible to measure the difference be- T For example in MAL, the input Y n = (y1;:::; yn) or tween two models Y and Z with different original forms. T the output Zn = (z1;:::; zn) is one single model. The It motivates us to develop an MAL theory on the statistical T P simplex regardless of the original representations. model family Mz = fZn = (z1;:::; zn) : i zi = d 0; 8i; zi 2 < g includes all possible embeddings cen- tered at 0. Then, MAL can be described as a projection 1.2. Information Geometry Y ! Z?(Y ) 2 M through convex optimization, or a z We introduce the Riemannian geometry of Sn−1, the (n − path Z0(Y ); Z1(Y );:::; Z?(Y ) 2 M along the gradi- z 1)-dimensional statistical simplex formed by all distribu- ent of some non-convex objective function. tions in the form (p1; : : : ; pn). The readers are referred to (Jost, 2008; Amari & Nagaoka, 2000) for a thorough view. 1. Preliminaries n−1 Any (p1; : : : ; pn) 2 S uniquely corresponds to θ = 1.1. Manifold Learning (θ1; : : : ; θn) via the invertible mapping θi = log(pi=pr), 8i, where p (1 ≤ r ≤ n) is a reference probability. These Given a model family M, any model X 2 M, represent- r n canonical parameters θ serve as a global coordinate system ing a collection of coordinates x ;:::; x , can be encoded 1 n of Sn−1. Around 8θ 2 Sn−1, the partial derivative oper- into n distributions over f1; 2; : : : ; ng: ators f@=@θ1; ··· ; @=@θr−1; @=@θr+1; ··· ; @=@θng repre- X sent the velocities passing through θ along the coordinate exp −sij p (X) = (8j 6= i); p = 0 (8i) curves. An infinitesimal patch of Sn−1 around θ can be jji P exp −sX iji |6=i i| studied as a linear space T Sn−1 = fP (α · @=@θ ): (1) θ i:i6=r i i 8i; α 2 <g called the tangent space.A Riemannian met- or one single distribution over f1; 2; : : : ; ng2: i ric G defines a local inner product h@=@θi; @=@θjiG(θ) on X n−1 exp −s each tangent space TθS and varies smoothly across p (X) = ij (8j 6= i); p = 0 (8i): ij P exp −sX ii different θ. Locally, it is given by the positive definite {;|:{6=| {| (p.d.) matrix G (θ) = h@=@θ ; @=@θ i . Under cer- (2) ij i j G(θ) X tain conditions (Cencovˇ , 1982), the Riemannian metric of In either case, sij is a possibly non-symmetric difference n−1 measure between xi and xj, e.g., square distance. After statistical manifolds, e.g. S , is uniquely given by the normalization, p represents the probability of xi and xj Fisher information metric (FIM) (Rao, 1945) Gij(θ) = Pn being similar. The subscript “jji” of p in eq. (1) signifies a k=1 pk(θ)(@ log pk(θ)=@θi)(@ log pk(θ)=@θj). conditional probability; the subscript “ij” in eq. (2) signi- Lemma 3. On Sn−1, G (θ) = p (θ)δ − p (θ)p (θ). fies a joint probability. ij i ij i j It is not arbitrary but natural to employ eqs. (1) and (2) for FIM grants us the power to measure information. With statistical MAL, because they encode distributed local in- respect to a coordinate system, e.g. the canonical pa- An Information Geometry of Statistical Manifold Learning TX Mf Tp(X)S (tangent space) dp (differential or pushforward) S @ @ M @θj @ @'i f @'j My X probablistic p encoding p(X) @ @θi Y p(Y ) p(My) manifold learning g statistical projection G ? Z?(Y ) pullback p(Z (Y )) Mz p(Mz) A model family Mf A (product of) statistical simplex S Figure 1.

Load more