<<

Lie PCA: Density estimation for symmetric

Jameson Cahill∗ Dustin G. Mixon†‡ Hans Parshall§

Abstract We introduce an extension to local principal component analysis for learning sym- metric manifolds. In particular, we use a spectral method to approximate the Lie algebra corresponding to the of the underlying . We derive the sample complexity of our method for a various manifolds before applying it to various data sets for improved density estimation.

1 Introduction

Recent advances in machine learning have been made possible by exploiting and invariants in data. In 2003, Simard, Steinkraus and Platt [15] applied two different tricks in this spirit to achieve a record-breaking 0.40 percent error rate in classifying the MNIST database of handwritten digits. First, they augmented the training set using the observation that handwritten digits are closed under certain elastic distortions. Second, they exploited the translation invariance of images by applying a convolutional neural network architecture. In the time since, both data augmentation and convolutional neural networks have enabled substantial strides in image recognition (e.g., [11, 16]). These engineering feats have inspired various theoretical treatments of symmetries and invariants in data. Mallat’s scattering transform [12, 3] provides a principled alternative to convolutional neural networks that exhibits translation invariance and stability to diffeo- morphisms. For settings beyond image classification, other symmetries and invariants must be considered. In this spirit, Cahill, Contreras and Contreras-Hip [5, 4] identified Lipschitz maps from a signal space Cn to a low-dimensional feature space in a way that distinguishes arXiv:2008.04278v2 [cs.LG] 13 Sep 2020 orbits in Cn under the action of a representation of a finite group. Another approach is to learn symmetries and invariants from the data. For example, principal component analysis can be viewed as a method of identifying symmetries under the action of a low-dimensional affine group. For classification tasks, one may seek large linear groups under which the classification is ; this approach has been used in [13, 8, 7]. In this paper, we consider another fundamental problem in this vein of symmetries and invariants in data. Suppose you are given the task of augmenting a modest training set. You

∗Department of Mathematics and Statistics, University of North Carolina Wilmington, Wilmington, NC †Department of Mathematics, The Ohio State University, Columbus, OH ‡Translational Data Analytics Institute, The Ohio State University, Columbus, OH §Department of Mathematics, Western Washington University, Bellingham, WA

1 are told that there exists a of deformations (such as elastic distortions) that could be used for this task, but you are not told what the Lie group is. Can you estimate the Lie group from the data? In order to measure performance for this task, we phrase the problem in terms of density estimation: Given a sample {xi}i∈[n] from some unknown distribution d supported on some unknown symmetric manifold in R , produce {ys}s∈[N] with N  n that approximates random draws from this unknown distribution. In the next section, we propose an extension to local principal component analysis for this task. Our algorithm amounts to a spectral method that estimates the underlying Lie algebra from both the points {xi}i∈[n] and estimates {Ti}i∈[n] of the tangent spaces at these points. In Section 3, we analyze the sample complexity of this approach. As one would hope, we find that fewer samples are necessary when the manifold has lower dimension and its symmetry group has higher dimension. We apply our method to the density estimation problem in Section 4, and we conclude in Section 5 with a discussion.

2 Derivation of Lie PCA

Given a manifold M ⊆ Rd, denote its symmetry group by Sym(M) := {A ∈ GL(d): AM = M}. We are interested in M for which Sym(M) is a Lie group and the orbits of M under the action of Sym(M) are nontrivial. (See [10] for an elementary introduction to matrix Lie groups.) For example, if M is the unit sphere, then Sym(M) = O(d), which is a Lie group. Also, if Sym(M) acts transitively on M (e.g., O(d) acts transitively on the unit sphere), then M is the orbit of any point in M. Let sym(M) ⊆ Rd×d denote the Lie algebra of Sym(M), that is, the set of all matrices of the form f 0(0), where f is a differentiable function from some open interval 0 ∈ I ⊆ R to Sym(M) such that f(0) = id. Then the matrix exponential maps sym(M) onto the connected component of Sym(M) that contains id. Notice that members of Sym(M) that are close to id can be realized as eA, where A ∈ sym(M) is close to the zero matrix. Furthermore, for each A ∈ Cd×d, it holds that tA ke − id k2→2 = (1 + o(1)) · kAk2→2 · |t| as t → 0; indeed, it is easy to verify this for diagonalizable matrices, which are dense in the set of complex matrices. As such, if we know sym(M), then for every x ∈ M, it holds that eAx is a slight perturbation of x in M for every small A ∈ sym(M). This is the heart of our approach to the density estimation problem, but in order for this to work, we first need to estimate sym(M) from a sample {xi}i∈[n] of M. We accomplish this in two steps:

1. Use {xi}i∈[n] to obtain an estimate {Ti}i∈[n] of the tangent spaces {Txi M}i∈[n].

2. Use {xi}i∈[n] and {Ti}i∈[n] to obtain an estimate of sym(M). The first step above is well understood: local PCA. That is, for each i ∈ [n], we select the xj’s closest to xi and run principal component analysis (PCA) on this subcollection to estimate Ti. For the second step, we apply Algorithm 1, which we derive in this section. Our approach is motivated by the following observation:

2 Algorithm 1: Lie principal component analysis d Data: Sample {xi}i∈[n] of M ⊆ R , estimate {Ti}i∈[n] of {Txi M}i∈[n], and ` ∈ N Result: Estimate of Lie algebra sym(M) d×d d×d P Define Σ: → by Σ(A) := proj ⊥ ·A · proj . R R i∈[n] Ti span{xi} Compute eigenvectors {vj}j∈[`] of Σ corresponding to the ` smallest eigenvalues. Output span{vj}j∈[`].

Lemma 1. For every x ∈ M and A ∈ sym(M), it holds that Ax ∈ TxM. Proof. Fix x ∈ M and A ∈ sym(M). Consider any differentiable function f : I → Sym(M) such that f(0) = id and f 0(0) = A. Then g : I → M defined by g(t) := f(t)x is differentiable 0 0 with g(0) = f(0)x = x, and so Ax = f (0)x = g (0) ∈ TxM. For each x ∈ M, denote

d×d SxM := {A ∈ R : Ax ∈ TxM}.

By Lemma 1, we have SxM ⊇ sym(M) for every x ∈ M. Given a sample {xi}i∈[n] in M and T corresponding tangent spaces {Txi M}i∈[n], we may estimate sym(M) by i∈[n] Sxi M. We seek a numerically robust version of this estimate. To this end, it is convenient to write  ⊥   \ X ⊥ X Sx M = (Sx M) = ker proj ⊥ . (1) i i (Sxi M) i∈[n] i∈[n] i∈[n] The terms in this sum have a convenient expression:

⊥ Lemma 2. For every x ∈ M \{0}, it holds that proj(SxM) (A) = projNxM ·A · projspan{x}. ⊥ Our proof of this lemma makes use of the following expression for (SxM) : Lemma 3. For every x ∈ M, it holds that

> d×d (a) SxM = {yx + B : y ∈ TxM,B ∈ R , Bx = 0}, and ⊥ > (b) (SxM) = {zx : z ∈ NxM}.

d×d Proof. In the case where x = 0, we have SxM = R and both (a) and (b) are immediate. It remains to consider the case in which x 6= 0. (a) First, (⊇) follows from the fact that > 2 −2 (yx + B)x = kxk y ∈ TxM. For (⊆), given A ∈ SxM, take y = kxk Ax and B = −2 > > A(I − kxk xx ) so that A = yx + B with y ∈ TxM and Bx = 0. (b) First, (⊇) follows from the fact that hzx>, yx> + Bi = tr((zx>)>(yx> + B)) = tr(xz>(yx> + B)) = kxk2z>y + z>Bx = 0.

⊥ d ⊥ > For (⊆), suppose Z ∈ (SxM) . For every u ∈ R and v ∈ span{x} , since B = uv ∈ SxM satisfies Bx = 0, we have 0 = hZ, uv>i = tr(Z>uv>) = v>Z>u = hZv, ui. Since u ∈ Rd is arbitrary, it follows that Zv = 0, and since v ∈ span{x}⊥ is arbitrary, we conclude that > d > Z = zx for some z ∈ R . Next, for every y ∈ TxM, since yx ∈ SxM, it holds that > > > 2 ⊥ 0 = hZ, yx i = tr(Z yx ) = kxk hz, yi, and so z ∈ (TxM) = NxM, as desired.

3 d×d ⊥ Proof of Lemma 2. Fix A ∈ R . We seek the member of (SxM) that is closest to A in ⊥ > Frobenius norm. By Lemma 3(b), every member of (SxM) takes the form zx for some z ∈ NxM. Letting P denote orthogonal projection onto NxM, then Cauchy–Schwarz gives

> 2 2 > > 2 kA − zx kF = kAkF − 2hA, zx i + kzx kF 2 2 2 = kAkF − 2hAx, zi + kzk2kxk2 2 2 2 = kAkF − 2hP Ax, zi + kzk2kxk2 2 2 2 2 2 2 ≥ kAkF − 2kP Axk2kzk2 + kzk2kxk2 ≥ min(kAkF − 2kP Axk2 · t + kxk2 · t ). t≥0

Equality is achieved in the first inequality precisely when z is a positive multiple of P Ax. 2 For the second inequality, equality is achieved precisely when kzk2 = kP Axk2/kxk2. Overall,

>  P Ax kP Axk2  > xx ⊥ proj(SxM) (A) = · 2 x = PA 2 = projNxM ·A · projspan{x} . kP Axk2 kxk2 kxk2

We are now ready to derive our approach. Let ` ∈ N denote the dimension of the desired Lie algebra. (For our algorithm, this quantity is treated as a hyperparameter.) Next, let L(`, d) denote the set of all `-dimensional Lie algebras of Lie of GL(d). By ignoring the multiplicative structure, we may think of L(`, d) as a subset of the Grassmannian d×d d Gr(`, R ). Given a sample {xi}i∈[n] in M ⊆ R and a noisy estimate {Ti}i∈[n] of {Txi M}i∈[n], we define Σ: Rd×d → Rd×d by X Σ(A) := proj ⊥ ·A · proj . Ti span{xi} i∈[n]

Note that by Lemma 2, the ith term approximates proj ⊥ . Considering (1), we are (Sxi M) compelled to solve the program

minimize hΣ, projai subject to a ∈ L(`, d).

Since optimizing over L(`, d) is cumbersome, we relax to the entire Grassmannian:

minimize hΣ,Xi subject to X2 = X,X∗ = X, rank X = `.

As a consequence of the Poincar´eseparation theorem, the orthogonal projection onto the span of any ` of the bottom eigenvectors of Σ gives an optimizer of this program. In particular, the span of any such eigenvectors gives a worthy estimate for sym(M). One may be inclined to round this estimate to the nearest member of L(`, d), but we do not attempt this here.

3 Sample complexity of Lie PCA

In this section, we consider an idealized setting in which the estimate {Ti}i∈[n] of {Txi M}i∈[n] is exact. Intuitively, Lie PCA should require fewer samples when M is low-dimensional and sym(M) is high-dimensional. This intuition matches the following lower bound on the sample complexity of Lie PCA:

4 T Lemma 4. It holds that sym(M) ⊆ i∈[n] Sxi M, with equality only if codim sym(M) n ≥ n?(M) := . codim M ⊥ T ? Furthermore, sym(M) = i∈[n?(M)] Sxi M precisely when the subspaces {(Sxi M) }i∈[n (M)] are linearly independent.

Proof. Lemma 1 gives that sym(M) ⊆ Sxi M for every i ∈ [n], and so the desired inclusion ⊥ P ⊥ holds. Next, suppose equality holds. Then sym(M) = i∈[n](Sxi M) , and so

⊥ X ⊥ X codim sym(M) = dim sym(M) ≤ dim(Sxi M) ≤ dim(Nxi M) = n · codim M, i∈[n] i∈[n] where the second inequality follows from Lemma 3(b). Rearranging gives the desired bound, and the last statement follows from counting dimensions. We observe that in many cases, the bound in Lemma 4 is the threshold at which generic samples determine the Lie algebra. To make this rigorous, let O(M n) denote the set of subsets of M n ⊆ (Rd)n that are open and dense in the subspace topology, and define

◦ n n \ o n (M) := inf n ∈ N : ∃O ∈ O(M ), ∀{xi}i∈[n] ∈ O, sym(M) = Sxi M . i∈[n] (As a mnemonic, think of ◦ as the “o” in “open.”) Lemma 4 implies that n◦(M) ≥ n?(M) for every M, and in this section, we show that n◦(M) = n?(M) for several choices of M. We will continually make use of the following:

Lemma 5. For any manifold M ⊆ Rd and any Z ∈ GL(d), it holds that n?(ZM) = n?(M), n◦(ZM) = n◦(M). Proof. First, we show that that Sym(ZM) = Z ·Sym(M)·Z−1. Suppose A ∈ Sym(M). Then A ∈ GL(d) such that AM = M. Then ZAZ−1ZM = ZAM = ZM, meaning ZAZ−1 ∈ Sym(ZM). Similarly, B ∈ Sym(ZM) implies Z−1BZ ∈ Sym(M), and the claim follows. Next, we show that that sym(ZM) = Z · sym(M) · Z−1. Take any continuously differen- tiable f : I → Sym(ZM) with f(0) = id. Then g : I → Sym(M) defined by g(t) := Z−1f(t)Z satisfies g(0) = id and g0(0) = Z−1f 0(0)Z. This implies sym(ZM) ⊆ Z · sym(M) · Z−1, and the reverse containment holds by a similar argument. A similar argument demonstrates that TZxZM = Z · TxM. As such, dim sym(ZM) = dim sym(M) and dim ZM = dim M, from which it follows that n?(ZM) = n?(M). Now we verify that n◦(ZM) = n◦(M). First, since hZx, yi = hx, Z>yi, it follows that z is orthogonal to x precisely when (Z>)−1z is orthogonal to Zx. As such,

⊥ ⊥ > −1 ⊥ > −1 NZxZM = (TZxZM) = (Z · TxM) = (Z ) · (TxM) = (Z ) · NxM. Combining this with Lemma 3(b) then gives

⊥ > (SZxZM) = {z(Zx) : z ∈ NZxZM} > −1 > > > −1 ⊥ > = {(Z ) ux Z : u ∈ NxM} = (Z ) · (SxM) · Z .

5 Next, since h(Z>)−1XZ>,Y i = hX,Z−1YZi, we may similarly conclude that

⊥ ⊥ > −1 ⊥ > ⊥ SZxZM = ((SZxZM) ) = ((Z ) · (SxM) · Z ) ⊥ ⊥ −1 −1 = Z · ((SxM) ) · Z = Z · SxM · Z . T As such, sym(M) = i∈[n] Sxi M precisely when   −1 \ −1 \ sym(ZM) = Z · sym(M) · Z = Z · Sxi M · Z = SZxi ZM. i∈[n] i∈[n]

Finally, since the bicontinuous mapping f : {xi}i∈[n] → {Zxi}i∈[n] has the property that f(O(M n)) = O((ZM)n), the claim follows. We start by treating the case in which M is a subspace:

Theorem 6. Let M be any r-dimensional subspace of Rd. Then n◦(M) = n?(M) = r.

Proof. By Lemma 5, we have M = span{ei}i∈[r] without loss of generality. We claim that

AB Sym(M) = {[ 0 D ]: A ∈ GL(r),D ∈ GL(d − r)}.

AB r×r AB To see this, suppose [ CD ] ∈ GL(d) with A ∈ R satisfies [ CD ]M = M. Then C = 0, AB which forces A ∈ GL(r) and D ∈ GL(d − r). Conversely, every [ CD ] of this form satisfies AB [ CD ]M = M. Next, we claim that

AB r×r (d−r)×(d−r) sym(M) = {[ 0 D ]: A ∈ R ,D ∈ R }. Let S denote the right-hand side. For (⊆), select Z ∈ sym(M) and consider any continuously 0 1 differentiable f : I → Sym(M) with f(0) = id and f (0) = Z. Then Z = limh→0 h (f(h)−id). 1 Considering h (f(h) − id) ∈ S for every h ∈ I and S is closed, it follows that Z ∈ S. For (⊇), select Z ∈ S and pick a small enough interval I about 0 so that det(id +tZ) 6= 0 for every t ∈ I. Then f(t) := id +tZ is a continuously differentiable map from I to Sym(M). Furthermore, f(0) = id and f 0(t) = Z, and so Z ∈ sym(M). At this point, we may compute codim sym(M) r(d − r) n?(M) = = = r. codim M d − r

v r Furthermore, every x ∈ M takes the form x = [ 0 ] with v ∈ R , and so Lemma 3(b) gives

⊥ > 0 0 d−r (SxM) = {zx : z ∈ NxM} = {[ uv> 0 ]: u ∈ R }.

⊥ We claim that if the vectors {xi}i∈[n] are linearly independent, then the subspaces {(Sxi M) }i∈[n] ⊥ are linearly independent. To see this, suppose {(Sxi M) }i∈[n] are dependent. Then there exist scalars {αij}i∈[n],j∈[r], not all of which are zero, such that

X X > αijer+jxi = 0. i∈[n] j∈[d−r]

6 Select p ∈ [n] and q ∈ [d − r] such that αpq 6= 0. Then   X > > X X > αiqxi = eq αijer+jxi = 0, i∈[n] i∈[n] j∈[d−r]

i.e., the vectors {xi}i∈[n] are dependent, as claimed. Considering Lemma 4, the result follows r by taking O to be the set of linearly independent {xi}i∈[r] ∈ M . Next, we consider the case in which M is a strictly affine subspace, that is, any translation of a subspace that is not itself a subspace:

Theorem 7. Let M be any r-dimensional strictly affine subspace of Rd. Then n◦(M) = n?(M) = r + 1. Proof sketch. The proof is nearly identical to that of Theorem 6, so we state the intermediate claims without providing details. By Lemma 5, we have M = span{ei}i∈[r] + er+1 without loss of generality. Then

AB Sym(M) = {[ 0 D ]: A ∈ GL(r),D ∈ GL(d − r), De1 = e1}, AB r×r (d−r)×(d−r) sym(M) = {[ 0 D ]: A ∈ R ,D ∈ R , De1 = 0}. It follows that codim sym(M) (r + 1)(d − r) n?(M) = = = r + 1. codim M d − r ⊥ Furthermore, if {xi}i∈[n] are independent, then {(Sxi M) }i∈[n] are independent. The result r+1 follows by taking O to be the set of linearly independent {xi}i∈[r+1] ∈ M . Next, we consider certain types of quadrics. The first type of quadric we consider includes spheres and hyperboloids:

d×d d > Theorem 8. Select any full rank Q ∈ Rsym and put M := {x ∈ R : x Qx = 1}. (a) M is empty if and only if Q is negative definite.

◦ ? d+1 (b) Otherwise, n (M) = n (M) = 2 . We will apply the following well-known consequence of the implicit function theorem:

Lemma 9. Select any continuously differentiable f : Rd → R and put d M := {x ∈ R : f(x) = 0, ∇f(x) 6= 0}.

Then NxM = span{∇f(x)}. Proof of Theorem 8. First, (a) is immediate. For (b), take the eigenvalue decomposition Q = UΛU >, and we decompose Λ = T 1/2ST 1/2 with T = abs(Λ) and S = sgn(Λ), where these operations are performed entrywise. Notice that Q being full rank implies that U and T are both full rank. Substituting x = T 1/2U >z gives

d > M := {x ∈ R : x Qx = 1} d > 1/2 1/2 > 1/2 > −1 d > = {x ∈ R : x UT ST U x = 1} = (T U ) ·{z ∈ R : z Sz = 1}.

7 By Lemma 5, we may assume without loss of generality that Q = diag(1p, −1q) for some n p ∈ [d] and q = d − p. (Here and throughout, 1n denotes the all-ones vector in R .) > We claim that NxM = span{Qx} for every x ∈ M. Defining f(x) := x Qx − 1 = P 2 > i Qiixi − 1, then ∇f(x) = 2Qx. As such, x ∈ M only if x Qx = 1, only if x 6= 0, only if ∇f(x) = 2Qx 6= 0. Our claim then follows from Lemma 9. Next, we verify that M spans Rd. There are two cases to consider. If Q is the identity matrix, then M contains the identity basis {ei}i∈[d] and therefore spans. Otherwise, Q = diag(1p, −1q) for some p ∈ [d − 1] and q = d − p. Considering √ √ > 2 2 ( 2e1 + ej) Q( 2e1 + ej) = 2ke1k − kejk = 1 √ p d for every j > p, it follows that M contains {ei}i=1 ∪ { 2e1 + ej}j=p+1 and therefore spans. > d×d Next, we verify that L := {xx : x ∈ M} spans Rsym. To accomplish this, we select an d×d arbitrary A ∈ Rsym that is orthogonal to L, and we show that A = 0. Since x>Ax = tr(x>Ax) = tr(Axx>) = hA, xx>i = 0

for every x ∈ M, it follows that M ⊆ {x ∈ Rd : x>Ax = 0}. Now fix x ∈ M and select ⊥ an arbitrary y ∈ TxM = span{Qx} and differentiable f : I → M such that f(0) = x and f 0(0) = y. Then f(t)>Af(t) = 0 for all t ∈ I, and so the product rule gives

f 0(t)>Af(t) + f(t)>Af 0(t) = 0.

Evaluating at t = 0 gives y>Ax = 0, and so y ∈ span{Ax}⊥. This establishes that span{Qx}⊥ ⊆ span{Ax}⊥, meaning span{Ax} ⊆ span{Qx}, which in turn implies that there exists α ∈ R such that Ax = αQx. Considering 0 = x>Ax = αx>Qx = α,

it follows that Ax = 0. Since our choice for x ∈ M was arbitrary, and furthermore, M spans Rd, we conclude that A = 0, as desired. Next, we establish that Sym(M) = {A ∈ GL(d): A>QA = Q}. For (⊆), select any A ∈ Sym(M). Then A ∈ GL(d) by definition. Furthermore, for every x ∈ M, it holds that Ax ∈ M, and so

hA>QA, xx>i = (Ax)>QAx = 1 = x>Qx = hQ, xx>i.

> d×d > Since {xx : x ∈ M} spans Rsym, it follows that A QA = Q. For (⊇), suppose A ∈ GL(d) satisfies A>QA = Q. Then for every x ∈ Rd, it holds that (Ax)>QAx = x>A>QAx = x>Qx,

meaning AM = M. As such, A ∈ Sym(M), as desired. In words, we have shown that Sym(M) is the group of linear transformations A ∈ GL(d) that leave invariant the symmetric bilinear form (x, y) 7→ x>Qy. This group is known as the (indefinite) O(p, q), where O(d, 0) := O(d). We claim that the corresponding Lie algebra is given by

d×d > AB p×p q×q sym(M) = {Z ∈ R : Z = −QZQ} = {[ B> D ]: A ∈ Rantisym,D ∈ Rantisym}.

8 (This is presumably well known, but the proof is short, so we include it.) Select Z ∈ sym(M) so that there exists a differentiable f : I → Sym(M) such that f(0) = id and f 0(0) = Z. Differentiating the identity f(t)>Qf(t) = Q gives

f 0(t)>Qf(t) + f(t)>Qf 0(t) = 0.

> > AB Evaluating at t = 0 then gives Z Q + QZ = 0, meaning Z = −QZQ. Writing Z = [ CD ] then reveals that [ A> C> ] = Z> = −QZQ = [ −AB ], B> D> C −D AB p×p q×q from which it follows that Z = [ B> D ] with A ∈ Rantisym and D ∈ Rantisym. Furthermore, any such matrix satisfies

[ AB ]> = [ A> B ] = [ −AB ] = −Q[ AB ]Q. B> D B> D> B> −D B> D

It remains to show that every Z ∈ Rd×d satisfying Z> = −QZQ necessarily resides in sym(M). To this end, define f : R → GL(d) by f(t) = etZ . Then f(0) = id and f 0(0) = Z. It suffices to verify that f(t)>Qf(t) = Q, since this would imply f : R → Sym(M), meaning Z ∈ sym(M). Since Q−1 = Q, we have

f(t)>Qf(t) = etZ> QetZ = e−tQZQQetZ = e−tQZQ−1 QetZ = Qe−tZ Q−1QetZ = Q.

At this point, we may compute

codim sym(M) d2 − d d + 1 n?(M) = = 2 = . codim M d − (d − 1) 2

Furthermore, Lemma 3(b) gives

⊥ > > > (SxM) = {zx : z ∈ NxM} = {zx : z ∈ span{Qx}} = span{Qxx }.

d+1 Put n := 2 . Considering Lemma 4, we want to find {xi}i∈[n] in M for which the sub- > > spaces {span{Qxixi }}i∈[n] are linearly independent. This occurs precisely when {Qxixi }i∈[n] > are linearly independent, which in turn occurs precisely when {xixi }i∈[n] are linearly inde- pendent. Overall, we wish to find an open and dense subset O of M n such that for every > {xi}i∈[n] ∈ O, it holds that {xixi }i∈[n] are linearly independent. 0 d n > First, let O denote the set of {xi}i∈[n] ∈ (R ) such that {xixi }i∈[n] is linearly indepen- 0 d n dent. We claim that O is open and dense in (R ) . Select an orthonormal basis {Bi}i∈[n] d×d d n n×n > for Rsym, consider the mapping A:(R ) → R defined by (A({xi}i∈[n]))jk := hBj, xkxk i, and let p denote the polynomial in dn variables defined by

p({xi}i∈[n]) := det A({xi}i∈[n]).

Then O0 = p−1(R \{0}), which is open and dense in (Rd)n provided p is not the zero poly- nomial. As such, it suffices to find {xi}i∈[n] such that p({xi}i∈[n]) 6= 0. To this end, for each P > > i ∈ [n], consider the eigenvalue decomposition Bi = j∈[d] λijuijuij. Then {uijuij}i∈[d],j∈[n] d×d > is a spanning set for Rsym, and so any choice of basis {uijuij}(i,j)∈S in this spanning set has the desired property that p({uij}(i,j)∈S) 6= 0.

9 Finally, we claim that O := O0 ∩ M n is open and dense in M n. Openness follows from the definition of the subspace topology. To demonstrate denseness, consider the open set

d > R := {x ∈ R : x Qx > 0}.

The mapping g : R → M defined by g(x) := x/px>Qx is surjective since g(x) = x for every 0 m > x ∈ M ⊆ R. Moreover, for every {xi}i∈[n] ∈ O ∩ R , it holds that {xixi }i∈[n] is linearly > independent, and so {g(xi)g(xi) }i∈[n] is also linearly independent, meaning {g(xi)}i∈[n] ∈ O. n n As such, the continuous function h: R → M defined by h({xi}i∈[n]) := {g(xi)}i∈[n] has the property that h(O0 ∩ Rn) = O. Since O0 ∩ Rn is dense in Rn, it follows that O = h(O0 ∩ Rn) is dense in M n = h(Rn), as desired. Finally, we consider a family of quadrics that includes cones:

d×d d > Theorem 10. Select any full rank Q ∈ Rsym and put M := {x ∈ R \{0} : x Qx = 0}. (a) M is empty if and only if Q is positive definite or negative definite.

(b) Otherwise, if d = 2, then n?(M) = 2 but n◦(M) = ∞.

◦ ? d+1 (c) Otherwise, n (M) = n (M) = 2 − 1. For the previous results in this section, we computed n◦(M) by selecting a certain open and dense subset of M n. To accomplish this, we first identified an appropriate polynomial over (Rd)n, and then we argued that the complement of the zero set of this polynomial has an open and dense intersection with M n. An analogous argument for Theorem 10 appears to require more powerful tools from real algebraic . A real algebraic set k V ⊆ R is the simultaneous zero set of a finite collection of polynomials in R[x1, . . . , xk], which in turn determines the ideal I(V ) ⊆ R[x1, . . . , xk] of all polynomials that vanish on V . The dimension of a real algebraic set V is the dimension of the ring R[x1, . . . , xk]/I(V ). This notion of dimension can be difficult to compute directly. Similarly, the definition of nonsingular point of a real algebraic set is particularly technical; see Definition 3.3.9 of [2]. To simplify our interaction with these notions, we factor our analysis through the following proposition, the proof of which is contained in Section 5 of [6].

k Proposition 11. Let V ⊆ R denote the simultaneous zero set of real polynomials {}i∈[k−d].

(a) If V is a C∞ submanifold of Rk, then its dimension as a real algebraic set equals its dimension as a manifold.

(b) If V has dimension d as a real algebraic set and the Jacobian of z 7→ {pi(z)}i∈[k−d] has rank k − d at x ∈ V , then x is a nonsingular point of V .

(c) If the nonsingular points in V form a dense and path-connected subset, then for every real algebraic set U ⊆ Rk, the relative complement V \ U is either empty or both open and dense in V .

10 Proof of Theorem 10. First, (a) is immediate. In what follows, we assume Q is neither positive definite nor negative definite. As in the proof of Theorem 8, we may assume without loss of generality that Q = diag(1p, −1q) for some p ∈ [d] and q = d − p. Moreover, since Q is neither positive nor negative definite, we have p, q ∈ [d − 1]. Also, it holds that NxM = span{Qx} for every x ∈ M. Next, we verify that L := {xx> : x ∈ M} satisfies span(L) = span{Q}⊥. The definition ⊥ AB ⊥ of M immediately gives span(L) ⊆ span{Q} . For (⊇), consider [ CD ] ∈ span{Q} with p×p > p×p q×q p q A ∈ R . Then C = B , A ∈ Rsym, D ∈ Rsym, and tr(A) = tr(D). For x ∈ R and y ∈ R , we define > > A(x, y) := [ xx xy ], yx> yy> and we observe that A(x, y) ∈ L precisely when kxk = kyk 6= 0. It follows that span(L) 0 B contains [ B> 0 ], since we can pass to the singular value decomposition of B and apply the fact that for kxk = kyk = 1,

> A(x, y) − A(x, −y) [ 0 xy ] = ∈ span(L). yx> 0 2

A 0 X 0 It remains to show that span(L) contains [ 0 D ]. To this end, it suffices to show that [ 0 Y ] ∈ A 0 span(L) for every X,Y 0 such that tr(X) = tr(Y ). Indeed, we may express [ 0 D ] as a difference of such matrices: h A+ c id 0 i h c id 0 i A 0 p p p p [ 0 D ] = c − c , 0 D+ q idq 0 q idq

P > where c > 0 is appropriately large. Given eigenvalue decompositions X = i∈[p] λixixi and P > Y = j∈[q] µjyjyj , select measurable partitions

I1 t · · · t Ip = [0, tr(X)] = [0, tr(Y )] = J1 t · · · t Jq

such that |Ii| = λi and |Jj| = µj. Then

> X 0 X X xixi 0 [ 0 Y ] = |Ii ∩ Jj| · [ > ]. 0 yj yj i∈[p] j∈[q]

Each term above resides in span(L) since for kxk = kyk = 1, it holds that

> A(x, y) + A(x, −y) [ xx 0 ] = ∈ span(L). 0 yy> 2

X 0 As such, [ 0 Y ] ∈ span(L), as desired. Next, we verify that Sym(M) = {A ∈ GL(d): A>QA = αQ, α 6= 0}. For (⊆), let A ∈ Sym(M). Then

hA>QA, xx>i = (Ax)>QAx = 0 = x>Qx = hQ, xx>i.

Since span(L) = span{Q}⊥, we have shown A>QA = αQ for α 6= 0. For (⊇), suppose A ∈ GL(d) satisfies A>QA = αQ for some α 6= 0. Then for every x ∈ Rd, (Ax)>Q(Ax) = x>A>QAx = αx>Qx,

11 and so AM = M. As such, A ∈ Sym(M) as desired. We claim that the corresponding Lie algebra is given by

d×d > sym(M) = {Z ∈ R : Z = −QZQ + 2β id, β ∈ R} A0 B p×p q×q = {[ > ] + β id : A ∈ ,D ∈ , β ∈ }. B D0 0 Rantisym 0 Rantisym R Select Z ∈ sym(M) so that there exists a differentiable f : I → Sym(M) such that f(0) = id and f 0(0) = Z. Define α: I → R \{0} by tr(f(t)>Qf(t)Q) α(t) := 2 . kQkF Differentiating the identity f(t)>Qf(t) = α(t)Q gives

f 0(t)>Qf(t) + f(t)>Qf 0(t) = α0(0).

Evaluating at t = 0 then gives Z>Q + QZ = α0(0)Q, meaning Z> = −QZQ + α0(0) id. AB Writing Z = [ CD ] then reveals that

0 A> C> > 0 α (0) idp −AB [ ] = Z = −QZQ + α (0) id = [ 0 ], B> D> C α (0) idq −D

T T 0 T 0 from which it follows that C = B , A + A = α (0) idp, and D + D = α (0) idq. Setting 0 A0 B p×p q×q β = α (0)/2, we see that Z = [ > ] + β id for A ∈ , D ∈ , and β ∈ . B D0 0 Rantisym 0 Rantisym R Furthermore, any such matrix satisfies

> A0 B > A0 B −A0 B A0 B ([ > ] + β id) = [ > > ] + β id = [ > ] + β id = −Q([ > ] + β id)Q + 2β id . B D0 B D0 B −D0 B D0

Fixing β ∈ R, it remains to show that every Z ∈ Rd×d satisfying Z> = −QZQ + 2β id necessarily resides in sym(M). To this end, define f : R → GL(d) by f(t) = etZ . Then f(0) = id and f 0(0) = Z. It suffices to verify that f(t)>Qf(t) = αQ for some α 6= 0, since this would imply f : R → Sym(M), meaning Z ∈ sym(M). Since Q−1 = Q, we have

f(t)>Qf(t) = etZ> QetZ = e−tQZQ+2β idQetZ = e2βe−tQZQ−1 QetZ = e2βQe−tZ Q−1QetZ = e2βQ.

At this point, we may compute

codim sym(M) d2 − (d + 1) d + 1 n?(M) = = 2 = − 1. codim M d − (d − 1) 2

It remains to compute n◦(M). For this, (b) and (c) require different approaches. For (b), we take d = 2. Express M = M+ t M−, where

2 M± := {x ∈ R \{0} : x1 = ±x2}.

12 n Fix n ∈ N and select O ∈ O(M ). Observe that (1, 1) ∈ M+ ⊆ M, and therefore n n {(1, 1)}i∈[n] ∈ M . Since O is dense in M , there exists {xi}i∈[n] ∈ O arbitrarily close to {(1, 1)}i∈[n]; i.e., xi ∈ M+ for all i. Since Txi M = Txi M+ for every i ∈ [n], it holds that \ \ Sxi M = Sxi M+ ⊇ sym(M+), i∈[n] i∈[n] where the last step follows from Lemma 4. As in the proof of Theorem 6, codim sym(M+) = 1. T On the other hand, codim sym(M) = 2, and so i∈[n] Sxi M 6= sym(M). All together, we conclude n◦(M) = ∞. For (c), we consider d ≥ 3. Lemma 3(b) gives

⊥ > > > (SxM) = {zx : z ∈ NxM} = {zx : z ∈ span{Qx}} = span{Qxx }.

d+1 Put n := 2 − 1. Considering Lemma 4, we want to find {xi}i∈[n] in M for which > the subspaces {span{Qxixi }}i∈[n] are linearly independent. This occurs precisely when > > {Qxixi }i∈[n] are linearly independent, which in turn occurs precisely when {xixi }i∈[n] are linearly independent. Overall, we wish to find an open and dense subset O of M n such that > for every {xi}i∈[n] ∈ O, it holds that {xixi }i∈[n] are linearly independent. 0 d n > As in Theorem 8, let O denote the set of {xi}i∈[n] ∈ (R ) such that {xixi }i∈[n] is linearly independent; recall that O0 is open and dense in (Rd)n. Then O := O0 ∩ M n is open in M n, and it remains to show that O is dense in M n. By scaling, it suffices to show that √ Z = {(x1, . . . , xn) ∈ O : kxik = 2 for all i ∈ [n]} is dense in √ n V = {(x1, . . . , xn) ∈ M : kxik = 2 for all i ∈ [n]}. Considering Z takes the form Z = V \ U for some real algebraic set U, we are in a position to apply Proposition 11(c). We proceed in two cases. Case I. p 6= 1 6= q. By Proposition 11(c), it suffices to show that (i) Z is nonempty,

(ii) every point in V is nonsingular, and

(iii) V is connected.

⊥ > For (i), recall that span(L) = span{Q} and choose an appropriately scaled basis {xixi }i∈[n] for span(L) so that {xi}i∈[n] ∈ Z. Next, we demonstrate (ii). We may identify V with (Sp−1 × Sq−1)n, meaning V is a C∞ submanifold of Rdn, and so Proposition 11(a) gives that d its dimension as a real algebraic set equals dn − 2n. For each xi ∈ R , write xi = (yi, zi) p q with yi ∈ R and zi ∈ R , and define the polynomials

2 2 Pi(x1, . . . , xn) = kyik − 1,Qi(x1, . . . , xn) = kzik − 1.

Observe that V is the simultaneous zero set of {Pi,Qi}i∈[n]. For {xi}i∈[n] ∈ V , the Jacobian

> > > > J(x1, . . . , xn) := blockdiag(2y1 , 2z1 ,..., 2yn , 2zn )

13 satisfies JJ > = 4 id, and so J has rank 2n. Proposition 11(b) then gives (ii). Finally, the identification V = (Sp−1 × Sq−1)n implies (iii), as desired. Case II. Either p = 1 or q = 1. We may identify V with ({±1} × Sd−2)n, and so we may express V as the union of 2n connected components

G 1 n d−2 V = V,V := {([ z1 ],..., [ zn ]) : z1, . . . , zn ∈ S }. ∈{±1}n

n We will apply Proposition 11(c) to show that, for each  ∈ {±1} , the intersection V ∩ Z is dense in V. By the argument in Case I, Z is nonempty, and so we may select {xi}i∈[n] ∈ Z. n 0 n Since Z ⊆ V , there exists  ∈ {±1} such that {xi}i∈[n] ∈ V. Then for each  ∈ {±1} , it 0 n holds that {iixi}i∈[n] ∈ V0 ∩Z. As such, V ∩Z is nonempty for every  ∈ {±1} . Moreover, V is connected and, by the argument in Case I, every point in V is nonsingular. All together, we conclude that each V ∩ Z is dense in V, and so Z is dense in V , as desired.

4 Application to density estimation

In this section, we apply Lie PCA to perform density estimation in various settings. For each experiment, we consider a manifold with a nontrivial Lie group. We draw points {xi}i∈[n] in Rd according to some distribution supported on that manifold. The density estimation algorithm then uses these points to produce N  n draws {ys}s∈[N] from an estimate of the underlying distribution. For these experiments, we grant access to the dimension r of the manifold and the dimension ` of the Lie algebra so as to isolate the performance of each algorithm from the task of learning hyperparameters. Our algorithm first runs local PCA on k nearest neighbors from each of the data points to produce estimates {Ti}i∈[n] of the tangent spaces at each of the sample points. We then run Algorithm 1 to obtain an estimate span{vj}j∈[`] of sym(M). Then for each s ∈ [N], we draw t ∼ Unif([n]) and a random A from span{vj}j∈[`] with spherical Gaussian distribution, A and then we put ys := e xt. If ys is too far away from {xi}i∈[n], then we replace ys with another draw from this random process, repeating as necessary. We compare our approach to a few alternatives. One baseline is to simply draw each ys uniformly from {xi}i∈[n]. We denote this by BL1. Another baseline, which we call BL2, is to draw each ys from the unknown distribution. While this is not a plausible alternative, it indicates how well an algorithm can possibly perform. We also consider a standard approach known as kernel density estimation, which we denote by KDE. For this approach, we draw t ∼ Unif([n]) and a random Gaussian vector g with covariance determined by Silverman’s rule of thumb [14], and then put ys := xt + g. This sort of estimate is specifically designed for the regime in which n is large, where the covariance decays gracefully to zero. (We will find that n  100 is not large enough for this method to perform well.) Finally, we consider local PCA, which we denote by LPCA, in which the same estimates {Ti}i∈[n] obtained for the Lie PCA approach are used. Here, we draw t ∼ Unif([n]) and a random Gaussian vector g in Tt, and then put ys := xt + g. If ys is too far away from {xi}i∈[n], then we replace ys with another draw from this random process, repeating as necessary. We consider two different metrics for measuring the performance of these algorithms. Both metrics compare {ys}s∈[N] to a fresh draw {zs}s∈[N] from the underlying distribution.

14 Table 1: Density estimation error in normalized earth mover’s and Hausdorff distances

Manifold BL1 KDE LPCA Lie PCA BL2 line 0.3901 0.7391 0.4902 0.4185 0.1543 1.0886 2.1822 0.5558 0.5838 0.3645 ellipse 0.4782 0.8620 0.5346 0.4686 0.1909 0.8925 2.2269 1.3633 0.1569 0.1390 hyperbola 0.3322 0.5645 0.4337 0.2750 0.1444 3.0194 2.4414 2.5090 2.3396 0.4246 ellipse + noise 0.4030 0.4959 0.3580 0.3057 0.3365 0.7951 1.8170 1.0371 0.5512 0.6492 torus 1.0040 1.0465 0.9549 0.7022 0.5469 1.5197 3.2996 2.5790 1.7882 0.9122

Taking inspiration from [1], our first metric takes the (normalized) earth mover’s distance between {ys}s∈[N] and {zs}s∈[N]. Conveniently, this distance can be obtained by linear pro- N×N gramming. Indeed, defining D ∈ R by Dst := kys − ztk2 gives

1 N×N > nEMD({ys}s∈[N], {zs}s∈[N]) = min{ N tr(DX): X ∈ R ,X1 = X 1 = 1,X ≥ 0}. Intuitively, the normalized earth mover’s distance captures the average distance traveled per point by optimal transport from {ys}s∈[N] to {zs}s∈[N]. We found that for N = 300, this distance can computed in CVX [9] in about 15 seconds. We also wanted a metric that is determined by the underlying supports of the densities rather than being sensitive to fluctuations in the densities. This led us to also consider the Hausdorff distance between {ys}s∈[N] and {zs}s∈[N], which is much faster to compute:   Hausdorff({ys}s∈[N], {zs}s∈[N]) = max max min kxs − ytk2, max min kxs − ytk2 . s∈[N] t∈[N] t∈[N] s∈[N]

In words, if we identify the closest member of {ys}s∈[N] to each point in {zs}s∈[N], and vice versa, then the Hausdorff distance reports the largest of these distances. We considered densities on three different manifolds in R2. For these instances, we take n = 30, N = 300, k = 2, r = 1, and ` = 1. We then considered a noisy version of a manifold- supported density in R2, in which we take n = 60, N = 300, k = 10, r = 1, and ` = 1. Finally, we considered a density on a manifold in R3, in which we take n = 60, N = 300, k = 20, r = 2, and ` = 1. Results are reported in Table 1 and illustrated in Figure 1.

5 Discussion

This paper introduced a spectral method that uses the output from local PCA to estimate the Lie algebra corresponding to the symmetry group of the underlying manifold. In this section, we point out a few opportunities for future work. First, recall that our spectral

15 Figure 1: Illustration of results from Table 1. Given n independent draws from an unknown distribution in Rd, simulate an additional N  n draws. (left) Given data points. The origin of Rd is located at the center of the display box. (middle left) Simulated draws using kernel density estimation with Silverman’s rule of thumb. (middle) Simulated draws using local PCA. (middle right) Simulated draws using Lie PCA. (right) Fresh draws from the true distribution.

16 method arises from relaxing the set of `-dimensional Lie algebras to the Grassmannian. It would be interesting to somehow round the solution of the spectral method to a nearby Lie algebra. Next, our sample complexity results came from focusing on specific families of manifolds. It would be nice to have more general results in this vein. For example, can we characterize the manifolds M for which n◦(M) = n?(M)? When applying Lie PCA to the density estimation problem, we would prefer a principled approach for drawing A from our estimate of sym(M); we currently apply an ad hoc adaptation of Silverman’s rule of thumb. Finally, the performance of Lie PCA for density estimation appears to depend on whether the symmetry group acts transitively on the manifold. For example, the torus partitions into circular orbits under the action of its symmetry group. For this manifold, Lie PCA will encourage along these without regard for the other dimension of the manifold. However, local PCA captures some information about this other dimension, and it would be interesting to somehow incorporate this into the density estimation algorithm.

Acknowledgments

DGM was partially supported by AFOSR FA9550-18-1-0107 and NSF DMS 1829955. HP was partially supported by an AMS-Simons Travel Grant.

References

[1] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein Generative Adversarial Networks, ICML 2017, 214–223.

[2] J. Bochnak, M. Coste, M. F. Roy, Real Algebraic Geometry, Vol. 36, Springer Science & Business Media, 2013.

[3] J. Bruna, S. Mallat, Invariant scattering convolution networks, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013) 1872–1886.

[4] J. Cahill, A. Contreras, A. Contreras-Hip, Classifying Signals Under a Finite Abelian : The Finite Dimensional Setting, arXiv:1911.05862

[5] J. Cahill, A. Contreras, A. Contreras-Hip, Complete set of translation invariant mea- surements with Lipschitz bounds, Appl. Comput. Harmon. Anal. 49 (2020) 521–539.

[6] J. Cahill, D. G. Mixon, N. Strawn, Connectivity and irreducibility of algebraic va- rieties of finite unit norm tight frames, SIAM J. Appl. Algebra Geometry 1 (2017) 38–72.

[7] C. Clum, D. G. Mixon, T. Scarnati, Matching Component Analysis for Transfer Learning, SIAM J. Math. Data Sci. 2 (2020) 309–334.

[8] B. Dumitrascu, S. Villar, D. G. Mixon, B. E. Engelhardt, Optimal marker gene selection for cell type discrimination in single cell analyses, BioRxiv (2019) 599654.

17 [9] M. Grant, S. Boyd, CVX: Matlab software for disciplined convex programming, http: //cvxr.com/cvx

[10] B. Hall, Lie groups, Lie algebras, and representations: An elementary introduction, Vol. 222. Springer, 2015.

[11] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convo- lutional neural networks, NIPS 2012, 1097–1105.

[12] S. Mallat, Group invariant scattering, Comm. Pure Appl. Math. 65 (2012) 1331–1398.

[13] C. McWhirter, D. G. Mixon, S. Villar, Squeezefit: Label-aware dimensionality reduc- tion by semidefinite programming, IEEE Trans. Inform. Theory 66 (2020) 3878–3892.

[14] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall/CRC, 1986.

[15] P. Y. Simard, D. Steinkraus, J. C. Platt, Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, ICDAR 2003, 958.

[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- houcke, A. Rabinovich, Going deeper with convolutions, CVPR 2015, 1–9.

18