<<

An Geometry of Statistical Manifold Learning

Ke Sun [email protected] Stéphane Marchand-Maillet [email protected] Viper Group, Computer Vision & Multimedia Laboratory, University of Geneva, Switzerland

Abstract NOTATIONS Manifold learning seeks low-dimensional repre- D sentations of high-dimensional data. The main Xn, Y n, Zn — respectively, a generic model in < , tactics have been exploring the geometry in an MAL input model in

determinant; diag(x1, . . . , x`) — a diagonal matrix Manifold learning (MAL), or non-linear dimensionality re- with x , . . . , x on its main diagonal. duction, assumes that some given high-dimensional obser- 1 ` D vations y1,..., yn ∈ < lie around a low-dimensional sub-manifold {Γ(z): z ∈ 0 is a scalar; t-SNE applies D d X 2 vation space < or an embedding space < , but in a very sij = log(||xi − xj|| + 1) to eq. (2) for encoding the out- high-dimensional hypothesis space made of models. put. From a kernel view (Ham et al., 2004), any MAL tech- nique that encodes into kernels naturally extends to such Definition 1. A model Xn is a specific instance of a set of probabilities. vectors {x1,..., xn}.

Remark 1.1. By default, a model denoted by Xn is a co- Despite a model X can have various forms, after the en- T ordinate matrix (x1,..., xn) . Alternatively, it can be im- coding p(X) is in a unified space. In eq. (1), (pj|i(X)) is plicitly specified by a n × n matrix of pairwise measure- a point on the product manifold (Sn−1)n, where Sn−1 = Pn ments, e.g., or (dis-)similarities. {(p1, . . . , pn): ∀i, pi > 0; i=1 pi = 1} is a statistical Definition 2. A model family M is a smooth manifold con- simplex consisting of all distributions over {1, 2, . . . , n}. In n2−1 sisting of continuous models. eq. (2), (pij(X)) is a point on S . Such a unified rep- resentation makes it possible to measure the difference be- T For example in MAL, the input Y n = (y1,..., yn) or tween two models Y and Z with different original forms. T the output Zn = (z1,..., zn) is one single model. The It motivates us to develop an MAL theory on the statistical T P simplex regardless of the original representations. model family Mz = {Zn = (z1,..., zn) : i zi = d 0; ∀i, zi ∈ < } includes all possible embeddings cen- tered at 0. Then, MAL can be described as a projection 1.2. Information Geometry Y → Z?(Y ) ∈ M through convex optimization, or a z We introduce the Riemannian geometry of Sn−1, the (n − path Z0(Y ), Z1(Y ),..., Z?(Y ) ∈ M along the gradi- z 1)-dimensional statistical simplex formed by all distribu- ent of some non-convex objective function. tions in the form (p1, . . . , pn). The readers are referred to (Jost, 2008; Amari & Nagaoka, 2000) for a thorough view. 1. Preliminaries n−1 Any (p1, . . . , pn) ∈ S uniquely corresponds to θ = 1.1. Manifold Learning (θ1, . . . , θn) via the invertible mapping θi = log(pi/pr), ∀i, where p (1 ≤ r ≤ n) is a reference probability. These Given a model family M, any model X ∈ M, represent- r n canonical parameters θ serve as a global coordinate system ing a collection of coordinates x ,..., x , can be encoded 1 n of Sn−1. Around ∀θ ∈ Sn−1, the partial derivative oper- into n distributions over {1, 2, . . . , n}: ators {∂/∂θ1, ··· , ∂/∂θr−1, ∂/∂θr+1, ··· , ∂/∂θn} repre- X  sent the velocities passing through θ along the coordinate exp −sij p (X) = (∀j 6= i), p = 0 (∀i) curves. An infinitesimal patch of Sn−1 around θ can be j|i P exp −sX  i|i 6=i i studied as a linear space T Sn−1 = {P (α · ∂/∂θ ): (1) θ i:i6=r i i ∀i, α ∈ <} called the tangent space.A Riemannian met- or one single distribution over {1, 2, . . . , n}2: i ric G defines a local inner product h∂/∂θi, ∂/∂θjiG(θ) on X  n−1 exp −s each tangent space TθS and varies smoothly across p (X) = ij (∀j 6= i), p = 0 (∀i). ij P exp −sX  ii different θ. Locally, it is given by the positive definite ı,:ı6= ı (p.d.) matrix G (θ) = h∂/∂θ , ∂/∂θ i . Under cer- (2) ij i j G(θ) X tain conditions (Cencovˇ , 1982), the Riemannian metric of In either case, sij is a possibly non-symmetric difference n−1 measure between xi and xj, e.g., square . After statistical manifolds, e.g. S , is uniquely given by the normalization, p represents the probability of xi and xj metric (FIM) (Rao, 1945) Gij(θ) = Pn being similar. The subscript “j|i” of p in eq. (1) signifies a k=1 pk(θ)(∂ log pk(θ)/∂θi)(∂ log pk(θ)/∂θj). conditional probability; the subscript “ij” in eq. (2) signi- Lemma 3. On Sn−1, G (θ) = p (θ)δ − p (θ)p (θ). fies a joint probability. ij i ij i j It is not arbitrary but natural to employ eqs. (1) and (2) for FIM grants us the power to measure information. With statistical MAL, because they encode distributed local in- respect to a coordinate system, e.g. the canonical pa- An Information Geometry of Statistical Manifold Learning

TX Mf Tp(X)S (tangent space) dp (differential or pushforward) S ∂ ∂ M ∂θj ∂ ∂ϕi f ∂ϕj My X probablistic p encoding p(X) ∂ ∂θi Y p(Y ) p(My) manifold learning g statistical projection G ? Z?(Y ) pullback p(Z (Y )) Mz p(Mz)

A model family Mf A (product of) statistical simplex S

Figure 1. A geometry of statistical manifold learning. rameters θ, the information density of a the pullback metric with respect to Eq.(2) is θ ∈ M is given by the Riemannian volume ele- p n n    T ment det (G(θ)) (Jost, 2008). It means the amount X X ∂skl ∂skl g(ϕ) = p of information a single observation possesses with re- kl ∂ϕ ∂ϕ spect to θ. A small pdet (G(θ)) means that θ con- k=1 l=1 n n ! n n !T tains much uncertainty and requires many observations X X ∂skl X X ∂skl − p p , to learn. Then, the information capacity of a statistical kl ∂ϕ kl ∂ϕ model family M ⊂ Sn−1 is given by its volume |M| = k=1 l=1 k=1 l=1 R pdet (G(θ))dθ. It means the total amount of in- θ∈M where skl, pl|k and pkl vary with ϕ as in eqs. (1) and (2). formation, or the “number” of distinct models, contained in M (Myung et al., 2000). This volume is an intrinsic Remark 4.1. g(ϕ) is a meta-metric. Its exact form de- measure invariant to the choice of coordinate system. pends on sij(ϕ) and the choice between eqs. (1) and (2). Different encodings lead to different geometries of infor- To extend FIM to a general non-statistical manifold, e.g., mation. r a model family M parametrized by ϕ = (ϕ1, . . . , ϕr), one need to construct a mapping from Mr to Sn−1. Fol- Remark 4.2. We leave the reader to verify g(ϕ)  0. Therefore g(ϕ) is called a semi-Riemannian metric. lowing such a mapping, any tangent vector ∂/∂ϕi of r n−1 “Semi” means g(ϕ) is positive semi-definite (p.s.d.) rather M can be pushed forward to TθS as ∂/∂ϕi = 1 Pn (∂θ /∂ϕ · ∂/∂θ ). In this way, the inner product than p.d . A model X moving rigidly forms a null k=1 k i k space (Jost, 2008), a model family with zero volume, mean- h∂/∂ϕi, ∂/∂ϕji can be measured with FIM and used to define the Riemannian metric of Mr. Such a strategy is ing that such movements contribute zero information. called pullback (Jost, 2008). This is intuitively shown in From theorem4, the pullback metrics induced by eq. (1) fig.1. Mr (left) as in definition2 mirrors the information and eq. (2) are in similar forms. Both are some covariance geometry (right) through a probabilistic encoding p. of ∂skl/∂ϕ. Such similarity agrees with previous stud- ies where SNE and symmetric SNE show similar perfor- 2. An Intrinsic Geometry of MAL mance (Cook et al., 2007). Because of space limitation, the following results are derived based on the metric induced The central result of this paper is summarized as follows. by eq. (1) only. The subtle difference between these two kinds of normalizations is left to a longer version. Theorem 4. Consider a model family Mr = {s (ϕ): ij A natural question arises as to what kind of geometry is ϕ = (ϕ , . . . , ϕ )T ∈ Φ}, where Φ ⊂ 0 zooms other samples near or far from xi, so fies the potential information in each xi with the Rieman- that information at different scales can be incorporated. k p nian volume element det(h∂/∂xi, ∂/∂xiig˜(X)). Such (2 ≤ k ≤ n − 1) specifies the visual range in the maxi- sample-wise information provides theoretical quantities to mum number of samples that can be observed by any xi. outlier identification or landmark selection in sub-sampling The purpose of k is to ignore distant relationships to make procedures (van der Maaten & Hinton, 2008). Corollary5 related computation scalable. Datasets of different size gives a geometric interpretation of manifold kernel density are measured on the same statistical manifold Sk−1. Such estimation (Vincent & Bengio, 2003) as growing density to measurements are therefore comparable. As compared to maximize the information variance on Mf. eq. (3), the distribution defined by eq. (4) has higher en- tropy. Even if τ → ∞, meaning zero sight, distant pairs The target of this paper is a learning theory centered on i still occupy some probability mass. Hence, eq. (4) puts the above theorem4 and supported by corollaries5,6 more emphasis on non-local information. and 10 with illustrative simulations. We investigate two independent-yet-related model families, corresponding to Once X and k are fixed, all possible configurations of information geometries of model complexity and model τ = (τ1, . . . , τn) forms a n-dimensional model manifold n quality. We do not present a systematical comparison with denoted as OX,k(τ ). Its geometry is given as follows. other measurements. They are as many as manifold learn- Corollary 6. If the observers are characterized by ers. None escapes the fact that it is measured in the ob- eq. (3), the Riemannian metric of On (τ ) is g(τ ) = servation space or the embedding space (Venna & Kaski, X,k diag (g1(τ1), ··· , gn(τn)), where 2007; Zhang et al., 2011). In contrast, we regard all sam- ples collectively as one point and measure information   on a differentiable manifold of such models. In the his- ∂ X X gi(τi) = −  pj|i(τi) sij  tory of statistical learning, similar information geometric ∂τi j∈kNNi theories contributed deep insights (Akaike, 1974; Amari  2 & Nagaoka, 2000; Myung et al., 2000; Lebanon, 2003; X X 2 X X Xu, 2004; Nock et al., 2011). We echo these previous = pj|i(τi) sij −  pj|i(τi)sij  ; works and adapt to recent developments of MAL. Given j∈kNNi j∈kNNi the uniqueness of FIM, the proposed measurements try to if the observers in eq. are used instead, the correspond- estimate the true information loss in MAL. This is more (4) ing metric is gt(τ ) = (gt (τ ), ··· , gt (τ )), where general than and fundamentally different from empirical diag 1 1 n n measurements. 2 X ! t X sij gi (τi) = pj|i(τi) X 1 + τisij 3. Locally Accumulated Information j∈kNNi  2 We measure the complexity of a fixed model X = X X sij T X − p (τ ) . (x1,..., xn) given by a matrix (sij )n×n of pairwise dif-  j|i i X  1 + τisij X 2 j∈kNNi ferences. We only study the case that sij = kxi − xjk . An Information Geometry of Statistical Manifold Learning

DATASET n D TRUE d MLE LAI t-LAI Spiral 200 3 1 1.3 ± 0.23 1.2 ± 0.21 1.1 ± 0.06 Swiss roll 500 3 2 2.1 ± 0.13 2.0 ± 0.08 2.0 ± 0.05 Faces 698 4k ∼3 3.8 ± 0.42 3.4 ± 0.27 3.2 ± 0.15 Hands 481 245k ∼2 2.2 ± 0.37 1.7 ± 0.21 2.0 ± 0.10 MNIST 60k 784 unknown 10.1 ± 0.27 9.7 ± 0.55 9.8 ± 0.26

Table 1. Estimated intrinsic dimension (avg. ± std.) for each k ∈ {5, 10,..., 100}. Figure 2. Local dimension estimation.

We only discuss g(τ ) and leave gt(τ ) for future extentions. Rome. In this model, LAI ≈ 5.3. A rectangular model n The scale of OX,k(τ ) can be measured as follows. (Berlin, Paris, Vienna, Marseille) has “higher dimensional- Definition 7. The locally accumulated information (LAI) ity” and therefore a smaller value of LAI (≈ 5.1). of X given the visual range k is defined as LAI can be conveniently calculated with an off-the-shelf numerical integrator. The computation involves n integra- n  ∞  1 X Z p |O |(X) = g (t) dt . (5) tions, which can be reduced by averaging over a random k n i i=1 0 subset of samples in eq. (5). The cost of each integration scales with k (k  n). Overall, the computation is scalable. n n Remark 7.1. OX,k(τ ) resembles an orthant (0, ∞) , and LAI measures its average side length. A Dictionary-based Intrinsic Dimensionality Estimator

To understand LAI, one need to grasp the concept of in- We generate random artificial data with different dimen- formation. Shannon’s information, i.e. entropy, measures sion D to build a dictionary of LAI values indexed by k and the absolute uncertainty of a single distribution. Fisher’s D. This dictionary is used to map any input data to an in- information (Rao, 1945) measures the relative uncertainty trinsic dimensionality. Table1 shows its performance com- within a continuous region of distributions (Myung et al., pared to the maximum likelihood estimator (MLE) (Lev- 2000). In eq. (5), each term in the sum is Fisher’s infor- ina & Bickel, 2005) on several benchmark datasets, in- mation integrated along a statistical curve corresponding to cluding a spiral with one intrinsic dimension, a Swiss roll a local observation process. LAI measures how much in- with two intrinsic dimensions, an artificial face dataset2 trinsic difference, or how “many” distinct distributions, are rendered with different light directions and different ori- observed during zooming the observation radius from 0 to entations (three degrees of freedom), an image sequence3 ∞. recording a hand holding a bowl and rotating (around two degrees of freedom), and MNIST hand-written digits 4. In Proposition 8. ∀X, ∀k, ∀λ > 0, |O |(X) = |O |(λX). k k order to suppress the curse of dimensionality (Bellman, Proposition 9. If ∀i, xi’s 1-NN is unique, then ∀k, 1961), a dataset is projected to <50 with principal compo-  √  arccos 1/ k ≤ |Ok|(X) < ∞. nent analysis (PCA) before going to the estimators. Over- all, the LAI estimation is closer to the (estimated) ground LAI is invariant to scalling (proposition8). LAI is always truth with less variance. t-LAI is based on eq. (4) with sim- finite and has a lower bound (proposition9). This lower ilar definitions. It shows the best robustness to the choice of t bound can be approached by approximately placing X as k, because sij is able to incorporate more non-local infor- a regular n-simplex. This is only possible when the dimen- mation. MNIST is closer to real-world datasets, where the sion of {xi} is large enough. In fact, LAI reflects the in- ground truth is unknown and the intrinsic dimension varies trinsic dimensionality. In high dimension, the pairwise dis- from region to region. The large variance of (t-)LAI is be- tances present less variance (Bellman, 1961). Correspond- cause the intrinsic dimension changes with k. At a small ingly, the curve of distributions defined by eqs. (3) and (4) scale (k = 5), the intrinsic dimension tends to be over- is straighter and shorter. In low dimension, the observer- estimated (∼ 10) because of local high-dimensional noise. ations at different scales are more different. Correspond- At a larger scale (k = 100), the intrinsic dimension is esti- ingly, the curve of distributions is more bended. Consider 2http://isomap.stanford.edu/datasets.html a line of cities (London, Paris, Geneva, Rome). As an ob- 3http://vasc.ri.cmu.edu//idb/html/motion/ server in London expands its sight, it discovers the other hand/index.html three cities one by one. On S2, a curve starts from the Paris 4http://yann.lecun.com/exdb/mnist vertex, then bends towards Geneva, then bends towards An Information Geometry of Statistical Manifold Learning mated as 8 ∼ 9 as the data manifold shows up. This obser- where ∀i, vation agrees with the characteristics of real-world data.  2 LAI can be used for local dimensionality estimation (Carter i X Y 2 X Y gaa = pj|i sij −  pj|i sij  , et al., 2010) by computing a local average of R ∞ pg (t)dt 0 i j∈kNNi j∈kNNi around any x . Figure2 shows the side view of a candy bar i  2 dataset with a 1D stick, a 2D disk, and a 3D head. Its local i X Z 2 X Z dimension showed by the colors is well-estimated. gbb = pj|i sij −  pj|i sij , j∈kNNi j∈kNNi i i X Y Z  4. A Gap between Two Models gab = gba = pj|i sij · sij

j∈kNNi A central problem in MAL is to define the difference be-     tween an input Y and an output Z . Then, the embedding X X n n − p sY p sZ . quality can be evaluated, and MAL can be implemented  j|i ij   j|i ij j∈kNN j∈kNN through optimization. According to section3, Y and Z i i individually extend to two families modeling their inter- The metric gt(c) with respect to eq. (7) is obtained by re- nal complexity. Our strategy is to continuously deform one Z Z Z  placing sij with sij/ 1 + bisij in the above equations. family to the other along a bridging manifold. The volume of this bridge measures their intrinsic difference. Due to space limitation, the following discussion is only based on the geometry induced by eq. (6). Fortunately, such a bridge exists for any given Y and Z. 2n 2n Consider the model family HY ,Z,k defined by HY ,Z,k embeds all information regarding the intra- and inter-complexity of Y and Z. Across this bridge, we con- struct a low dimensional film, whose volume can be easily  a sY + b sZ if j ∈ kNN ; computed to estimate the closeness between the input and s (c) = i ij i ij i (6) ij +∞ if otherwise, output boundaries. Consider the 2D sub-manifold ¯2 HY ,Z,k,κ = {(a1, b1, . . . , an, , bn): ∀i, ai = aτi(Y , κ), or bi = bτi(Z, κ); a ∈ (0, ∞); b ∈ (0, ∞); κ < k} with a global coordinate system (a, b). All observers in  a sY + log b sZ + 1 if j ∈ kNN ; Y (resp. Z) simultaneously zoom according to one single st (c) = i ij i ij i (7) ij +∞ if otherwise, parameter a (resp. b). A large value of a (resp. b) corre- sponds to high frequency local information; a small value of a (resp. b) corresponds to low frequency distant informa- where ∀i, kNNi = kNNi(Y ) ∪ kNNi(Z) are the input or tion. For each i, the scalars τi(Y , κ) and τi(Z, κ) are com- output neighbours of i, ai > 0, bi > 0, and c = puted by binary search, so that the distribution p·|i defined (a1, b1, . . . , an, bn) serves as a global coordinate system. by eqs. (1), (6) and (7) has fixed entropy given by log κ The boundary b = 0 deteriorates to a model family in- and each sample has effectively the same number of neigh- n duced by Y (similar to OY ,k). The boundary a = 0 deteri- bours given by κ (Hinton & Roweis, 2003). Such align- n orates to a family induced by Z (similar to OZ,k). Among ment is to derive two lines a = 0 and b = 0 as close as all possible interpolations, eqs. (6) and (7) are simple and possible on each side of the gap. The volume of the film 2n ¯ natural. HY ,Z,k defined in eq. (6) is somehow flat (Amari HY ,Z,k,κ in between these lines approximates the minimal & Nagaoka, 2000). Its geometry is given as follows. efforts needed to shift a continuous spectrum of informa- tion from one family to the other. Corollary 10. With respect to eq. (6) and the global co- Proposition 11. The volume (area) of some region Ω on ordinate system c = (a , b , . . . , a , b ), the Riemannian 1 1 n n H¯Y ,Z,k,κ is metric of H2n is in the block-diagonal form Y ,Z,k v ZZ u n ! ¯ u X 2 i |Hk,κ,Ω|(Y , Z) = da dbt τi (Y , κ)gaa  1 1  Ω i=1 gaa gab 1 1 gba gbb  n ! n !2    ..  X 2 i X i g(c) =  .  , × τi (Z, κ)gbb − τi(Y , κ)τi(Z, κ)gab .  n n  i=1 i=1  gaa gab n n (8) gba gbb An Information Geometry of Statistical Manifold Learning

Random PCA Isomap SNE t-SNE 0.6 ...... – – – – – /b /b /b /b /b 1 1 1 1 1 – – – – – – density – Swiss roll

1/a . 1/a . 1/a . 1/a . 1/a . – – – – – – – – – – 0.05 (a) 5.3 × 103 (b) 1.6 × 103 (c) 4.6 × 102 (d) 9.9 × 102 (e) 8.1 × 102 2 ...... – – – – – /b /b /b /b /b 1 1 1 1 1 – – – – – – density – MNIST

1/a . 1/a . 1/a . 1/a . 1/a . – – – – – – – – – – 0.2 (f) 3.2 × 103 (g) 2.8 × 103 (h) 2.6 × 103 (i) 2.3 × 103 (j) 1.0 × 103

Figure 3. Performance measurements of different embeddings. In each sub-figure, the color-map shows the local densities vol(a, b)dσ(a, b) over Ω. From left to right (resp. bottom to up), the input (resp. output) observation radius expands from 5 to 50. The x-axis (resp. y-axis) is linear in 1/a (resp. 1/b). The number below each square shows the volume, i.e. the integration of local densities in the corresponding color-map. Note, the colors are different between the two datasets.

Proposition 12. ∀Y , ∀Z, ∀λy > 0, ∀λz > 0, measures the information density or interestingness at |H¯k,κ,Ω|(Y , Z) = |H¯k,κ,Ω|(λyY , λzZ). (a, b) based on separate observations on Y and Z (see sec- tion3). Therefore, |H¯Ω| can be understood as the linear The region Ω means interested information in MAL. It can agreement at different scales weighted by information den- be chosen empirically as a rectangle to exclude low fre- sity. quency information above the radius κ and high frequency Two classical datasets in MAL, Swiss roll (n = 103) and information below a minimal radius ks. Given Y and Z, MNIST (n = 5 × 103; five classes), are embedded into the volume |H¯k,κ,Ω|(Y , Z), shortly denoted as |H¯Ω|, can 2 be efficiently computed with a Monte Carlo integrator. It < by PCA, Isomap, SNE and t-SNE with typical config- forms a theoretical objective of MAL that is scale-invariant urations. The parameters k, κ and ks are empirically set to (proposition 12). 100, 50 and 5, respectively. The gap on MNIST is com- puted by randomly sampling 103 input-output pairs to be ¯ RR We re-write eq. (8) as |HΩ| = Ω vol(a, b) dσ(a, b). By comparable with Swiss roll. In fig.3, the color-maps show i i i i noting that gaa, gbb, gab and gba in corollary 10 are all in vol(a, b)dσ(a, b) over Ω. Their appearances depend on the the form of (co-)variances, the normalized scalar coordinate system of H¯Y ,Z,k,κ. Here, the axises are lin- vol(a, b) = ear in 1/a and 1/b and represent the observation radii. For s example, the upper-left corner means that the input (resp. Pn i 2 τi(Y , κ)τi(Z, κ)g output) information is examined at a radius of 5 (resp. 50) 1 − i=1 ab (9) Pn 2 i Pn 2 i samples. The gap volume |H¯Ω| below each color-map is τi (Y , κ)gaa τi (Z, κ)g i=1 i=1 bb independent of the coordinate system. The best method for ranges in [0, 1] and measures the overall linear correlation each dataset (Isomap for Swiss roll; t-SNE for MNIST) is Y Z between sij and sij for the same i. The more linearly cor- identified by the bluest square with the smallest volume. Y Z related sij and sij, the smaller the value of vol(a, b). The patterns on the color-maps tell more detailed informa- tion. The redness on the lower-right corner (e.g. fig.3(b)) Proposition 13. |H¯Ω| = 0 iff ∃a, b ∈ Ω, s.t. vol(a, b) = 0. indicates that the original neighbours are heavily invaded vol(a, b) is usually positive (proposition 13) and tells the in the embedding. Apparently, t-SNE outperforms SNE in input-output agreement at a specific frequency. The term this region. This is related to an information retrieval per- v  v  spective (Venna & Kaski, 2007). By comparing fig.3(f) u n u n uX 2 i uX 2 i with fig.3(a), MNIST with high dimensional noise is closer dσ(a, b) = t τi (Y , κ)gaada t τi (Z, κ)gbbdb i=1 i=1 to random data. It is more difficult to improve over random An Information Geometry of Statistical Manifold Learning embeddings in this dataset. The spiky pattern in fig.3(f-j) Y manifold learning shows that some structural information that distinguishes (project Y → Z) MNIST from random data is forced to a thin band due to LAI scaling the high dimensionality (Bellman, 1961). Ω Most MAL techniques preserve a single frequency or scale 0 H¯(gap) of a specific type of local information. The result strongly favors this frequency and this type of information. The pro- LAI scaling posed gap yields a family of criteria that are less-biased to- Z wards such choices. By mapping onto the statistical mani- fold, some redundant information is factored out. By align- Figure 4. Internal complexity of Y and Z and their gap. ing and seeking a minimal gap, an intrinsic difference is ex- posed. The integration over a spectrum gives accurate esti- mation of the true information loss. In practice, to compute the gap always faces the choice of a statistical encoding quality of models. These measurements are only briefly and associated parameters, e.g., k, κ, and ks. However, the sketched here to testify the learning theory. They can be relative order of the gap volumes should be robust to such further unfold into meaningful theories. This work is sum- choices. Despite that the results are developed based on the marized in fig.4. A fundamental trade-off of MAL is to SNE encodings, the gap volume is fundamentally different minimize the volume of the gap (lost information; see sec- from SNE’s objective and does not necessarily favor SNE. tion4) and to maximize the volume of the output (remained information; see section3). To unify and combine LAI and 5. Related Works and Discussion the gap volume into one criterion and to seek parameter- free invariants are worthy of future work. Information geometry (Rao, 1945; Cencovˇ , 1982; Nagaoka & Amari, 1982) plays a vital role in statistical learning The gap volume in proposition 11 as a theoretical objective theory. MAL has been developed along a statistical ap- is hard to optimize directly. This is expected and fits in a proach. It is a natural and meaningful step to bridge the usual two-stage learning scheme (Akaike, 1974; Schwarz, profound information geometry. Efforts (Weinberger et al., 1978; Xu, 2010). In the parameter learning stage, a simple 2007; Carreira-Perpiñán, 2010; Vladymyrov & Carreira- objective function is optimized for each candidate model. Perpiñán, 2012) in seeking efficient MAL implicitly used In the model selection stage, a sophisticated criterion that such a geometry. There, a common technique is to bend better approximates the generalization error is computed to the gradient 5(Z) of a cost function with M −1(Z)5(Z), select the best model. We seek to derive simple approxi- where M(Z) 0. This is equivalent to compute the gra- mations of the gap volume and develop related MAL algo- dient with respect to a Riemannian metric M(Z) on the rithms. solution space. Such a metric, however, has not been ex- Several possible extensions are discussed at the end of sec- plicitly mentioned or formally studied. tion2. A problem that fits in the recent advancements (Vla- Lebanon (2003; 2005) parametrized the Riemannian met- dymyrov & Carreira-Perpiñán, 2012; Yang et al., 2013) of ric of a statistical simplex and proposed a metric learning MAL is to find efficient optimization based on generaliza- objective to maximize the inverse volume element. Carter tions of Amari’s natural gradient (Amari & Nagaoka, 2000; et al. (2009; 2011) studied MAL on a collections of proba- Nock et al., 2011). A theoretical problem is to explore the bility density functions. In these works, the subject is still relationship with graph Laplacian regularization (Belkin & a data geometry, where the observed data is assumed to lie Niyogi, 2003; Weinberger et al., 2007). on a statistical manifold. This is different from the picture shown here, which views all input or output information Acknowledgments jointly as one point and studies its dynamics. This work is jointly supported by the Swiss National Sci- We formally introduce a semi-Riemannian geometry of a ence Foundation (SNSF) and the European COST Action model manifold. It broadens our horizons so that MAL ap- on Multilingual and Multifaceted Interactive Information pears as a curve (figs.1 and4) and different manifold learn- Access (MUMIA) via the Swiss State Secretariat for Edu- ers are viewed from a unified perspective. An intriguing cation and Research (SER project CC11.0043). aspect is that any volume corresponds to an amount of in- formation. It can be measured to define intrinsic quantities. On two specific model manifolds, we demonstrate how to References apply the theoretical results to measure the complexity and Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr., 19(6):716–723, 1974. An Information Geometry of Statistical Manifold Learning

Amari, S. Information geometry of the EM and em algorithms for Nock, R., Magdalou, B., Briys, E., and Nielsen, F. On track- neural networks. Neural Networks, 8(9):1379–1408, 1995. ing portfolios with certainty equivalents on a generalization of markowitz model: the fool, the wise and the adaptive. In ICML, Amari, S. and Nagaoka, H. Methods of Information Geome- pp. 73–80, 2011. try, volume 191 of Translations of Mathematical Monographs. AMS and OUP, 2000. (Published in Japanese in 1993). Rao, C. R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Cal. Math. Soc., 37(3):81–91, Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensional- 1945. ity reduction and data representation. Neural Comput., 15(6): 1373–1396, 2003. Roweis, Sam T. and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500): Bellman, R. Adaptive Control Processes: A Guided Tour. Prince- 2323–2326, 2000. ton University Press, 1961. Schwarz, G. Estimating the dimension of a model. Ann. Stat., 6 Carreira-Perpiñán, M. Á. The elastic embedding algorithm for (2):461–464, 1978. dimensionality reduction. In ICML, pp. 167–174, 2010. Sha, F. and Saul, L.K. Analysis and extension of spectral methods Carter, K. M., Raich, R., Finn, W. G., and Hero, A. O. FINE: for nonlinear dimensionality reduction. In ICML, pp. 784–791, Fisher Information Nonparametric Embedding. IEEE Trans. 2005. Pattern Anal. Mach. Intell., 31(11):2093–2098, 2009. Sun, K., Bruno, E., and Marchand-Maillet, S. Stochastic unfold- Carter, K. M., Raich, R., and Hero, A. O. On local intrinsic di- ing. In MLSP, pp. 1–6, 2012. mension estimation and its applications. IEEE Trans. Signal Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global Processing, 58(2):650–663, 2010. geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. Carter, K. M., Raich, R., Finn, W. G., and Hero, A. O. Information-geometric dimensionality reduction. IEEE Signal van der Maaten, L. J. P. and Hinton, G. E. Visualizing data using Process. Mag., 28(2):89–99, 2011. t-SNE. JMLR, 9(Nov):2579–2605, 2008. Cencov,ˇ N. N. Statistical Decision Rules and Optimal Infer- Vapnik, V. N. Statistical Learning Theory. Wiley-Interscience, ence, volume 53 of Translations of Mathematical Monographs. 1998. AMS, 1982. (Published in Russian in 1972). Venna, J. and Kaski, S. Nonlinear dimensionality reduction as Cook, J., Sutskever, I., Mnih, A., and Hinton, G. E. Visualizing information retrieval. In AISTATS, JMLR: W&CP 2, pp. 572– similarity data with a mixture of maps. In AISTATS, JMLR: 579, 2007. W&CP 2, pp. 67–74, 2007. Vincent, P. and Bengio, Y. Manifold Parzen windows. In NIPS Ham, J., Lee, D. D., Mika, S., and Schölkopf, B. A kernel view 15, pp. 825–832. 2003. of the dimensionality reduction of manifolds. In ICML, pp. 47–54, 2004. Vladymyrov, M. and Carreira-Perpiñán, M. Á. Partial-Hessian strategies for fast learning of nonlinear embeddings. In ICML, Hinton, G. E. and Roweis, S. T. Stochastic Neighbor Embedding. pp. 345–352, 2012. In NIPS 15, pp. 833–840. 2003. Weinberger, K. Q., Sha, F., and Saul, L. K. Learning a kernel Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimension- matrix for nonlinear dimensionality reduction. In ICML, pp. ality of data with neural networks. Science, 313(5786):504– 839–846, 2004. 507, 2006. Weinberger, K. Q., Sha, F., Zhu, Q., and Saul, L. K. Graph Lapla- Jost, J. Riemannian Geometry and Geometric Analysis. Springer, cian regularization for large-scale semidefinite programming. 5th edition, 2008. In NIPS 19, pp. 1489–1496. 2007.

Lebanon, G. Learning Riemannian metrics. In UAI, pp. 362–369, Xu, L. Advances on BYY harmony learning: Information the- 2003. oretic perspective, generalized projection geometry, and inde- pendent factor auto-determination. IEEE Trans. Neural Net- Lebanon, G. Riemannian geometry and statistical machine learn- works, 15(4):885–902, 2004. ing. PhD thesis, CMU, 2005. Xu, L. Machine learning problems from optimization perspective. Levina, E. and Bickel, P. J. Maximum likelihood estimation of J. Global Optim, 47(3):369–401, 2010. intrinsic dimension. In NIPS 17, pp. 777–784. 2005. Yang, Z., Peltonen, J., and Kaski, S. Scalable optimization Myung, J., Balasubramanian, V., and Pitt, M. A. Counting proba- of neighbor embedding for visualization. In ICML, JMLR: bility distributions: and model selection. W&CP 28.2, pp. 127–135, 2013. PNAS, 97(21):11170–11175, 2000. Zhang, J., Wang, Q., He, L., and Zhou, Z. H. Quantitative analysis Nagaoka, H. and Amari, S. Differential geometry of smooth fami- of nonlinear embedding. IEEE Trans. Neural Networks, 22 lies of probability distributions. Technical Report METR 82-7, (12):1987–1998, 2011. Univ. of Tokyo, 1982.