arXiv:1402.4437v2 [cs.LG] 25 May 2014 enepesd ersnain hudb meaning- be have should representations representations ( learned ful expressed: for been desiderata Various Learning the of Proceedings 2009 Cohen Taco ih 04b h author(s). the by 2014 right ae,w aeafehlo ttebscpicpe behind theory principles perspectiv Lie basic the of the from at learning look representation fresh unsupervised a take th In we representation. paper, good a characterizes what clear quite ( coders in fivrac n ietnln,laigtesearch no- the the a leaving on in disentangling, focus defined and been we invariance not Here of have tions way. notions these precise of mathematically most far so a Welling Max eto ersnainlann oue uha RBMs as such ( modules generalizations their learning and of representation layers of multiple ment learn to learn- tion is deep the approach of re- ing hallmark some The produced has breakthroughs. learning markable deep of field the Recently, Introduction 1. Amsterdam of University Group, Learning Machine ego&Lecun & Bengio fdt,admc okhsgn notedevelop- the into gone has work much and data, of ,asrc n ietnld( disentangled and abstract ), fetv o classification. for highly effective is representation invariant learned show the and that patches, image model transformed the of train pairs We on images. of and translation rotation that cyclic the describe class that a groups the – includes groups Lie over commutative inference compact probabilistic en- tractable fully that ables a relation employs conjugacy model Bayesian Our newfound symmetries. its sys- a from of tem particles elementary the derive to we used is that disentangling, physics from of principle fundamental notion a the borrow define To of data. representations disentangled and invariant- equivariant produces that groups Lie compact commutative of model probabilistic new a present We erigteIrdcbeRpeettoso omttv L Commutative of Representations Irreducible the Learning icn tal. et Vincent ejn,Cia 04 MR &Pvlm 2 Copy- 32. volume W&CP JMLR: 2014. China, Beijing, , 31 1 . , , st 2008 2014 Abstract nentoa ofrneo Machine on Conference International eln tal. et Welling .Hwvr tti on ti not is it point this at However, ). ,ivrat( invariant ), egoe al. et Bengio , 2005 odelwe al. et Goodfellow ,adautoen- and ), , representa- 2013 ,but ), is e , hc erfrt sWy’ rnil,floigKanatani following physics, principle, Weyl’s from ( as principle to refer fundamental disentan- we of a which concept borrow the we on handle gling, mathematical a the get To encode that representation the sim- in factor. can nuisance units one the classification), object ignore ply (e.g. f nuisance task a particular considered is a that va position) of object factor a (e.g. of ation transformation representa the to a invariant is is build that disentangling to tion Once easily: follows etc. invariance position achieved, identity, conditions, object can lighting of pixels) pose, composition (i.e. and a data as visual of example, thought vari For be of data. factors the distinct in the tion is represents invari- representation explicitly of disentangled that speak A one we disentangling? when and intuitively, ance mean, we do What work. future for meaning for rus u h anieso hspprsol eaccessible be should paper this of audience. broad ideas a main the but groups, ietnl)the disentangle) ttsta h lmnaycmoet fti ytmare system this of the by components given elementary principle Weyl’s the object. that in- imaged leaves states the but of image, identity the the in may variant pixel rotation) every (a of value group the this change from transformation A group origin. the consider paper, SO this as known throughout use example concrete will a we As that transformations values. appear- phe- measurement of “superficial the measured the set i.e. the ance”, change a of may mean it “essence” although the we nomenon, change this not By does that acts data. that the group symmetry on a presupposes principle Weyl’s measure physical but vi- nothing are of ments. pixels area all, the after to for principle sion, this apply We phys- inherent significance. no ical have that values measurement mere from edmntaei o elvle opc commutative compact real-valued applicable, for widely it is demonstrate principle we theoretical paper. this this in Although explained be will that concept a – group 1990 .W ila ie sueapsigfmlaiywt Lie with familiarity passing a assume times at will We 1. .I hsc,ti dai sdt es pr (i.e. apart tease to used is idea this physics, In ). (2) reuil representations irreducible cigo mgsb Drtto bu the about rotation 2D by images on acting , lmnayparticles elementary M.W T.S.C eGroups ie fapyia system physical a of ELLING OHEN ftesymmetry the of @ @ UVA UVA . . NL NL ri- or a- to - - Learning the Irreducible Representations of Commutative Lie Groups groups only. We introduce a probabilistic model that de- 2. Preliminaries scribes a representation of such a group, and show how it can be learned from pairs of images related by arbitrary and 2.1. Equivalence, Invariance and Reducibility unobserved transformations in the group. Compact com- In this section, we discuss three fundamental concepts on mutative groups are also known as toroidal groups, so we which the analysis in the rest of this paper is based: equiv- refer to this model as Toroidal Subgroup Analysis (TSA). alence, invariance and reducibility. Using a novel conjugate prior, the model integrates proba- RD bility theory and theory in a very elegant way. Consider a function Φ: X that assigns to each pos- RD → All the relevant probabilistic quantities such as normal- sible data point x a class-label (X = 1,...,L ) ∈ R{L } ization constants, moments, KL-divergences, the posterior or some distributed representation (e.g. X = ). Such a density over the transformation group, the marginal density function induces an equivalence relation on the input space RD: we say that two vectors x, y RD are Φ-equivalent if in data space, and their gradients can be obtained in closed ∈ form. they are mapped onto the same representation by Φ. Sym- bolically, x y Φ(x)=Φ(y). ≡Φ ⇔ 1.1. Related work Every equivalence relation on the input space fully de- termines a symmetry group acting on the space. This The first to propose a model and algorithm for learning group, call it G, contains all invertible transformations Lie group representations from data were Rao & Ruder- ρ : RD RD that leave Φ invariant: G = ρ x man (1999). This model deals only with one-parameter RD : Φ(ρ→(x)) = Φ(x) . G describes the symmetries{ | ∀ of∈ groups, a limitation that was later lifted by Miao and Rao Φ, or, stated differently, the} label function/representation Φ (2007). Both works rely on MAP-inference proceduresthat is invariant to transformations in G. Hence, we can speak can only deal with infinitesimally small transformations. of G-equivalence: x y ρ G : ρ(x) = y. For This problem was solved by Sohl-Dickstein et al. (2010) G example, if some elements≡ of⇔G ∃act∈ by rotating the image, using an elegant adaptive smoothing technique, making it two images are G-equivalent if they are rotations of each possible to learn from large transformations. This model other. uses a general linear transformation to diagonalize a one- parameter group, and combines multiple one-parameter Before we can introduce Weyl’s principle, we need one groups multiplicatively. more concept: the reduction of a group representation (Kanatani, 1990). Let us restrict our attention to linear Other, non-group-theoretical approaches to learning representations of Lie groups: ρ becomes a -valued transformations and invariant representations ex- function ρ of an abstract group element g G, such that ist (Memisevic & Hinton, 2010). These gating models g ∈ g,h G : ρ ◦ = ρ ρ . In general, every coordinate y were found to perform a kind of joint eigenspace analy- g h g h i ∀of y =∈ρ x can depend on every coordinate x of x. Now, sis (Memisevic, 2012), which is somewhat similar to the g j since x is G-equivalent to y, it makes no sense to consider irreducible reduction of a toroidal group. the coordinates xi as separate quantities; we can only con- Motivated by a number of statistical phenomena observed sider the vector x as a single unit because the symmetry in natural images, Cadieu & Olshausen (2012) describe a transformations ρg tangle all coordinates. In other words, model that decomposes a signal into invariant amplitudes we cannot say that coordinate xi is an independent part of ′ and covariant phase variables. the aggregate x, because a mapping x x = ρgx that is supposed to leave the intrinsic properties→ of x unchanged, None of the mentioned methods take into account the full will in fact induce induce a functional dependence between uncertainty over transformation parameters, as does TSA. all supposed parts x′ and x . Due to exact or approximate symmetries in the data, there i j is in general no unique transformation relating two images, However, we are free to change the basis of the measure- so that only a multimodal posterior distribution over the ment space. It may be possible to use a change of basis to group gives a complete description of the geometric situa- expose an invariant subspace, i.e. a subspace V RD that tion. Furthermore, posterior inference in our model is per- is mapped onto itself by every transformation in⊂ the group: formed by a very fast feed-forward procedure, whereas the g G : x V ρgx V . If such a subspace ex- MAP inference algorithm by Sohl-Dicksteint et al. requires ∀ists∈ and its orthogonal∈ ⇒ complement∈ V ⊥ RD is also an a more expensive iterative optimization. invariant subspace, then it makes sense to⊂ consider the two parts of x that lie in V and V ⊥ to be distinct, because they remain distinct under symmetry transformations. Let W be a change of basis matrix that exposes the invari- Learning the Irreducible Representations of Commutative Lie Groups ant subspaces, that is, the total energy (norm) of the image, they can be repre- 1 sented by orthogonal matrices acting on vectorized images. ρg −1 ρg = W 2 W , (1)  ρg As is well known, commuting matrices can be simulta- neously diagonalized, so one could represent a toroidal for all g G. Both ρ1 and ρ2 form a representation of g g group in terms of a basis of eigenvectors shared by every the same ∈abstract group as represented by ρ. The group 1 2 element in the group, and one of eigen- representations ρg and ρg describe how the individual parts ⊥ values for each element of the group, as was done in x1 V and x2 V are transformed by the elements of (Sohl-Dickstein et al., 2010) for 1-parameter Lie groups. the∈ group. As is∈ common in group representation theory, 1 2 However, orthogonalmatrices do not generally have a com- we refer to both the group representations ρg,ρg and the ⊥ plete set of real eigenvectors. One could use a complex subspaces V and V corresponding to these group repre- basis instead, but this introduces redundancies because the sentations as “representations”. eigenvalues and eigenvectors of an orthogonal matrix come 1 in complex conjugate pairs. For machine learning applica- The process of reduction can be applied recursively to ρg 2 tions, this is clearly an undesirable feature, so we opt for a and ρg. If at some point there is no more (non-trivial) in- variant subspace, the representation is called irreducible. joint block-diagonalization of the elements of the toroidal T Weyl’s principle states that the elementary components of group: ρϕ = WR(ϕ)W , where W is orthogonal and 3 a system are the irreducible representations of the symme- R(ϕ) is a block-diagonal rotation matrix : try group of the system. Properly understood, it is not a physics principle at all, but generally applicable to any situ- R(ϕ1) . ation where there is a well-defined notion of equivalence2. R(ϕ)=  ..  . (2) It is completely abstract and therefore agnostic about the  R(ϕJ ) type of data (images, optical flows, sound, etc.), making it   eminently useful for representation learning. The diagonal of R(ϕ) contains 2 2 rotation matrices × In the rest of this paper, we will demonstrate Weyl’s princi- cos(ϕj ) sin(ϕj ) ple in the simple case of a compact commutative subgroup R(ϕj )= − . (3) sin(ϕj ) cos(ϕj )  of the special in D dimensions. We want to stress though, that there is no reason the basic ideas can- In this parameterization, the real, orthogonal basis W iden- not be applied to non-commutative groups acting on non- tifies the group representation, while the vector of rotation linear latent representation spaces. angles ϕ identifies a particular element of the group. It is now clear why such groups are called “toroidal”: the 2.2. Maximal Tori in the Orthogonal Group parameter space ϕ is periodic in each element ϕj and In order to facilitate analysis, we will from here on con- hence is a topological . For a J-parameter toroidal sider only compact commutative subgroups of the special group, all the ϕj can be chosen freely. Such a group is orthogonal group SO(D). For reasons that will become known as a maximal torus in SO(D), for which we write TJ = ϕ ϕ [0, 2π], j =1,...J . clear shortly, such groups are called toroidal subgroups of { | j ∈ } SO(D). Intuitively, the toroidal subgroupsof general com- To gain insight into the structure of toroidal groups with pact Lie groups can be thought of as the “commutative fewer parameters, we rewrite eq. 2 using the matrix expo- part” of these groups. This fact, combined with their an- nential: alytic tractability (evidenced by the results in this paper) J makes them suitable as the starting point of a theory of R(ϕ) = exp  ϕj Aj . (4) probabilistic Lie-group representation learning. Xj=1   Imposing the constraint of orthogonality will make the The anti-symmetric matrices A = d R(ϕ) are known j dϕj 0 computation of matrix inverses very cheap, because for or- as the generators, and the ϕj are Lie-algebra −1 T thogonal Q, Q = Q . Orthogonal matrices also avoid coordinates. numerical problems, because their condition number is al- ways equal to 1. Another important property of orthogo- The Lie algebra is a structure that largely determines the nal transformations is that they leave the Euclidean metric structure of the corresponding Lie group, while having the invariant: Qx = x . Therefore, orthogonal matrices important advantage of forming a linear space. That is, all k k k k cannot express transformationssuch as contrast scaling, but 2. The picture becomes a lot more complicated, though, when they can still model the interesting structural changes in im- the group does not act linearly or is not completely reducible ages (Bethge et al., 2007). For example, since 2D image ro- 3. For ease of exposition, we assume an even dimensional space tation and (cyclic) translation are linear and do not change D = 2J, but the equations are easily generalized. Learning the Irreducible Representations of Commutative Lie Groups

one of the elementary parts uj is functionally independent of the others under symmetry transformations.

The variable ωj is known as the weight of the representa- tion (Kanatani, 1990). When the representations are equiv- alent (i.e. they have the same weight), the parts are “of the Figure 1. Parameter space ϕ = (ω1s,ω2s) of two toroidal sub- same kind” and are transformed identically. Elementary groups, for ω1 = 2,ω2 = 3 (left) and ω1 = 1,ω2 = 2π (right). components with different weights transform differently. The point ϕ moves over the dark green line as s is changed. Wrap- ping around is indicated in light gray. In the incommensurable In the following section, we show how a maximal toroidal case (right), coupling does not add structure to the model, because group and a 1-parameter subgroupcan be learned from cor- all transformations (points in the plane) can still be constructed by respondence pairs, and how these can be used to generate an appropriate choice of s. invariant representations. linear combinations of the generators belong to the Lie al- 3. Toroidal Subgroup Analysis gebra, and each element of the Lie algebra corresponds to an element of the Lie group, which itself is a non-linear We will start by modelling a maximal torus. A data pair T manifold. Furthermore, every subgroup of the Lie group (x, y) is related by a transformation ρϕ = WR(ϕ)W : corresponds to a subalgebra (not defined here) of the Lie y = WR(ϕ)WT x + ǫ, (6) algebra. All toroidal groups are the subgroup of some max- imal torus, so we can learn a general toroidal group by first where ǫ (0, σ2) represents isotropic Gaussian noise. learning a maximal torus and then learning a subalgebra of In other symbols,∼ N p(y x, ϕ)= (y WR(ϕ)WT x, σ2). its Lie algebra. Due to commutativity, the structure of the | N | We use the following notation for indexing invariant sub- Lie algebra of toroidal groups is such that any subspace of spaces. As before, W = (W − , W ). Let the Lie algebra is in fact a subalgebra. The relevance of j (:, 2j 1) (:, 2j) u = WT x and v = WT y. If we want to access one this observation to our machine learning problem is that to j j j j of the coordinates of u or v, we write u = WT x learn a toroidal group with I parameters (I

The function I0 that appears in the normalizing constant is Each block R(ωj s) is periodic with period 2π/ωj, but their known as the modified Bessel function of order 0. direct sum R(s) need not be. When ω1 and ω2 are not commensurate, all values of s R will produce differ- Since the von-Mises distribution is an exponential fam- ∈ ent R(s), and hence the parameter space is not bounded. ily, we can write it in terms of natural parameters ηj = To obtain a compact one-parameter group with parameter T (ηj1 , ηj2 ) as follows: space s [0, 2π], we restrict the frequencies ω to be inte- ∈ j gers, so that R(s)= R(s +2π) (see figure 1). 1 T p(ϕj )= exp(ηj T (ϕj)), (8) 2πI0( ηj ) It is easy to see that each block of R(s) forms a real-valued k k T irreducible representation of the toroidal group, making where T (ϕj) = (cos(ϕj ), sin(ϕj )) are the sufficient R(s) a direct sum of irreducible representations. From the statistics. The natural parameters can be computed from point of view expounded in section 2.1, we should view the T 4. The main reason for this restriction is that compact groups vector x asatangleof elementary components uj = Wj x, are simpler and better understood than non-compact groups. In where Wj = (W(:, 2j−1), W(:, 2j)) denotes the D 2 sub- practice, many non-compact groups can be compactified, so not × matrix of W corresponding to the j-th block in R(s). Each much is lost. Learning the Irreducible Representations of Commutative Lie Groups conventional parameters using, sensible result follows directly from the consistent applica- tion of the rules of probability. T ηj = κj [cos(µj ), sin(µj )] (9) Observe that when using a uniform prior (i.e. ηj = 0), the and vice versa, posterior mean µˆj (computed from ηˆj by eq. 10) will be exactly equal to the angle θj between uj and vj . We will −1 κj = ηj , µj = tan (ηj2 /ηj1 ) (10) use this fact in section 3.1 when we derive the formula for k k the orbit distance in a toroidal group. Using the natural parameterization, it is easy to see that Previous approaches to Lie group learning only provide the prior is conjugate to the likelihood, so that the poste- point estimates of the transformation parameters, which rior p(ϕ x, y) is again a product of von-Mises distributions. have to be obtained using an iterative optimization proce- | Such conjugacy relations are of great utility in Bayesian dure (Sohl-Dickstein et al., 2010). In contrast, TSA pro- statistics, because they simplify sequential inference. To vides a full posterior distribution which is obtained us- our knowledge, this conjugacy relation has not been de- ing a simple feed-forward computation. Compared to the scribed before. To derive this result, first observe that the work of Cadieu & Olshausen (2012), our model deals well likelihood term splits into a sum over the invariant sub- with low-energy subspaces, by simply describing the un- spaces indexed by j: certainty in the estimate instead of providing inaccurate es- timates that have to be discarded. p(ϕ x, y) p(y x, ϕ)p(ϕ) | ∝ | 1 3.1. Invariant Representation and Metric exp y WR(ϕ)WT x 2 p(ϕ) ∝ −2σ2 k − k  One way of doing invariant classification is by using an in- J T variant metric known as the manifold distance. This metric vj R(ϕj )uj T exp  + η T (ϕj ) ∝ σ2 j d(x, y) is defined as the minimum distance between the or- Xj=1   bits Ox = ρϕx ϕ G and Oy = ρϕy ϕ G . Observe that{ this is| only∈ a true} metric that{ satisfies| the∈ co-} incidence axiom d(x, y) =0 x = y if we take the T ⇔ Both the bilinear forms vj R(ϕj )uj and the prior terms condition x = y to mean “equivalence up to symmetry ηT T (ϕ) are linear functions of cos(ϕ ) and sin(ϕ ), so that transformations” or x G y, as discussed in section 2.1. j j j ≡ they can be combined into a single dot product: In practice, it has proven difficult to compute this distance exactly, so approximations such as tangent distance have J T been invented (Simard et al., 2000). But for a maximal p(ϕ x, y) exp  ηˆ T (ϕj), (11) | ∝ j torus, we can easily compute the exact manifold distance: Xj=1   d2(x, y) = min y WR(ϕ)WT x 2 which we recognize as a product of von-Mises in natural ϕ k − k form. 2 = min vj R(ϕj )uj ϕj k − k (13) The parameters ηˆj of the posterior are given by: Xj 2 = vj R(ˆµj )uj , 1 T k − k j ηˆj = ηj + 2 [uj1 vj1 + uj2 vj2 , uj1 vj2 uj2 vj1 ] X σ − (12) uj vj T where µˆj is the mean of the posterior p(ϕj x, y), obtained = ηj + k kk2 k[cos(θj ), sin(θj )] , | σ using a uniform prior (κj = 0). The last step of eq. 13 follows, because as we saw in the previous section, µˆ is where θ is the angle between u and v . Geometrically, j j j j simply the angle between u and v when using a uniform we can interpret the Bayesian updating procedure in eq. 12 j j prior. Therefore, R(ˆµ ) aligns u and v , minimizing the as follows. The orientation of the natural parameter vector j j j distance between them. ηj determines the mean of the von-Mises, while its mag- nitude determines the precision. To update this parame- Another approach to invariant classification is through an ter with new information obtained from data uj , vj , one invariant representation. Although the model presented T should add the vector (cos(θj ), sin(θj )) to the prior, us- above aims to describe the transformation between obser- ing a scaling factor that grows with the magnitude of uj vations x and y, an invariant-equivariant representation ap- and vj and declines with the square of the noise level σ. pears automatically in terms of the parameters of the pos- The longer uj and vj and the smaller the noise level, the terior over the group. To see this, consider all the transfor- greater the precision of the observation. This geometrically mations in the learned toroidal group G that take an image Learning the Irreducible Representations of Commutative Lie Groups

2πi/D x to itself. This set is known as the stabilizer stabG(x) where ρ = e is the D-th primitive root of unity. If we of x. It is a subgroup of G and describes the symmetries choose a basis of sinusoids for the filters in W, of x with respect to G. When a transformation ϕ G is W R −j −j(D−1) T applied to x, the stabilizer subgroup is left invariant,∈ for if (:, 2j−1) = (ρ ,...,ρ ) θ stab (x) then ρ ρ x = ρ ρ x = ρ x and hence = (cos(2πj/D),..., cos(2πj(D 1)/D))T ∈ G θ ϕ ϕ θ ϕ − ρθ stabG(ρϕx). I −j −j(D−1) T ∈ W(:, 2j) = (ρ ,...,ρ ) The posterior of x transformed into itself, p(ϕ x, x,µ,κ = = (sin( 2πj/D),..., sin( 2πj(D 1)/D))T , 0) = (ϕ µˆ , κˆ ) gives a probabilistic| description − − − j M j | j j of the stabilizerQ of x, and hence must be invariant. Clearly, then the change of basis performed by W is a DFT. Specif- R I the angle between x and x is zero, so µˆ = 0. On the other ically, (Xj)= uj1 and (Xj )= uj2 . hand, κˆ contains information about x and is invariant. To Now suppose we are interested in the transformation tak- see this, recall that κˆj = ηˆj . Using eq. 12 we obtain ing some arbitrary fixed vector e = W(1, 0,..., 1, 0)T 2 −2 Tk 2k −2 κˆj = uj σ = W x σ . Since every transfor- k k k j k to x. The posterior over ϕj is p(ϕj e, x, ηj = 0, σ = mation ϕ in the toroidal group acts on the 2D vector uj by | 1) = (ϕj ηˆj ), where (by eq. 12) we have ηˆj = rotation, the norm of uj is left invariant. M | T uj [cos(θj ), sin(θj )] , θj being the angle between uj k k T We recognize the computation of κˆ as the square pool- and the “real axis” ej = (1, 0) . In conventional coor- ing operation often applied in convolutional networks to dinates, the precision of the posterior is equal to the mod- gain invariance: project an image onto filters W − and ulus of the DFT, κˆj = uj = Xj , and the mean of :,2j 1 k k | | W:,2j and sum the squares. This computation is a direct the posterior is equal to the phase of the Fourier transform, consequence of our model setup. In section 3.3, we will µˆ = θj = arg(Xj ). Therefore, TSA provides a proba- find that the model for non-maximal tori is even more in- bilistic interpretation of the DFT coefficients, and makes it formative about the proper pooling scheme. possible to learn an appropriate generalized transform from data. Since we want to use κˆ as an invariant representa- tion, we should try to find an appropriate metric on κˆ- 3.3. Modeling a Lie subalgebra space. Let κˆ(x) be defined by p(ϕ x, x,κ = 0) = (ϕ µˆ , κˆ (x)). We suggest using| the Hellinger dis- Typically, one is interested in learning groups with fewer j M | j j Qtance: than J degrees of freedom. As we have seen, for one pa- rameter compact subgroupsof a maximal torus, the weights of the irreducible representations must be integers. We 2 2 1 model this using a coupled rotation matrix, as follows: H (ˆκ(x), κˆ(y)) = κˆj (x) κˆj (y) 2 q − q  Xj R(ω1s) 1  ..  T = u 2 + v 2 2 u v ρs = W . W (15) 2σ2 k j k k j k − k jkk j k Xj  R(ω s)  J  1 v R u 2 Where s [0, 2π] is the scalar parameter of this subgroup. = 2 j (ˆµj ) j , 2σ k − k ∈ 2 Xj The likelihood then becomes y (y ρ x, σ ). ∼ N | s We have found that the conjugate prior for this likelihood is the generalized von-Mises (Gatto & Jammalamadaka, which is equal to the exact manifold distance (eq. 13) up 1 2007): to a factor of 2σ2 . The first step of this derivation uses eq. + + + + + 12 under a uniform prior (ηj = 0), while the second step p(s)= (s η ) = exp η T (s) /Z M | · again makes use of the fact that µˆj is the angle between uj K  and v so that u v = uT R(ˆµ )v . j k jkk j k j j j = exp  κ+ cos(js µ+) /Z+ j − j Xj=1 3.2. Relation to the Discrete Fourier Transform   where T +(s) = [cos(s), sin(s),..., cos(Ks), sin(Ks)]T . We show that the DFT is a special case of TSA. The DFT T + + of a discrete 1D signal x = (x1,...,xD) is defined: This conjugacy relation p(s x, y) exp(ˆη T (s)) is obtained using similar reasoning| as∝ before, yielding· the up- date equation, D−1 −jn + + Xj = xnρ (14) ηˆj = ηj + ηˆk (16) nX=0 k:Xωk=j Learning the Irreducible Representations of Commutative Lie Groups where ηˆk is obtained from eq. 12 using a uniform prior ηk = 0. The sum in this update equation performs a pool- ing operation over a weight space, which is defined as the .35 span of those invariant subspaces k whose weight ωk = j. As it turns out, this is exactly the right thing to do in order 0 0 π 2π 0 π 2π 0 π 2π to summarize the data while maintaining invariance. As explained by Kanatani (1990), the norm ηˆ+ of a linear j Figure 2. Posterior distribution over s for three image pairs. combination of same-weight representationsk isk always in- variant (as is the maximum of any two ηˆ or ηˆ+ ). The k j k k j k similarity to sum-pooling and max-pooling in convnets is (in the case of 2D translations) is at the heart of the well quite striking. known aperture problem in vision. Having a tractable pos- In the maximal torus model, there are J = D/2 degrees terior is particularly important if the model is to be used of freedom in the group and the invariant representation to estimate longer sequences (akin to HMM/LDS models, but non-linear), where one may encounter multiple high- is D J = J-dimensional (κˆ1,..., κˆJ ). This represen- tation− of a datum x identifies a toroidal orbit, and hence density trajectories. the vector κˆ is a maximal invariant with respect to this If required, accurate MAP inference can be performed us- group (Soatto, 2009). For the coupled model, there is only ing the algorithm of Sohl-Dickstein et al. (2010), as de- one degree of freedom in the group, so the invariant repre- scribed in the supplementary material. This allows us to sentation should be D 1 dimensional. When all ωk are compute the exact manifold distance for the coupled model. − + + distinct, we have J variables κ1 ,...,κJ that are invari- ant. Furthermore, from eq. 15 we see that as x is trans- 3.4. Maximum Marginal Likelihood Learning formed, the angle between x and an arbitrary fixed refer- ence vector in subspace j transforms as θj (s) = δj + ωj s We train the model by gradient descent on the marginal for some data-dependent initial phase δj . It follows that likelihood. Perhaps surprisingly given the non-linearities ω θ (s) ω θ (s) = ω (δ + ω s) ω (δ + ω s) = in the model, the integrations required for the evaluation of j k − k j j k k − k j j ωj δk ωkδj is invariant. In this way, we can easily con- the marginal likelihood can be obtained in closed form for struct− another J 1 invariants, but unfortunately these are both the coupled and decoupled models. For the decoupled not stable because− the angle estimates can be inaccurate for model we obtain: low-energy subspaces. Finding a stable maximal invariant, T and doing so in a way that will generalize to other groups p(y x)= (y WR(ϕ)W x) (ϕj ηj )dϕ | Z ∈TJ N | M | is an interesting problem for future work. ϕ Yj 1 2 2 + exp 2 ( x + y ) I (ˆκ ) The normalization constant Z for the GvM has so = − 2σ k k k k 0 j . D  I (κ ) far only been described for the case of K = 2 har- (2πσ) Yj 0 j monics, but we have found a closed form solution in p (18) terms of the so-called modified Generalized Bessel Func- Observing that I (ˆκ )/I (κ ) is the ratio of normaliza- tions (GBF) of K-variables κ+ = κ+,...,κ+ and pa- 0 j 0 j 1 K tion constants of regular von-Mises distributions, the anal- rameters5 exp( iµ+) = exp( iµ+),..., exp( iµ+ ) 1 K ogous expression for the coupled model is easily seen to (Dattoli et al., 1991− ): − − be equal to eq. 18, only replacing j I0(ˆκj ) / I0(κj ) + + + + −iµ+ by Z+(ˆκ+, µˆ+)/Z+(κ+,µ+). The derivationQ of this result Z (κ ,µ )=2πI0(κ ; e ). (17) can be found in the supplementary material. We have developed a novel, highly scalable algorithm for The gradient of the log marginal likelihood of the uncou- the computation of GBF of many variables, which is de- pled model w.r.t. a batch X, Y (both storing N vectors in scribed in the supplementary material. the columns) is: Figure 2 shows the posterior over s for three image pairs d related by different rotations and containing different sym- ln p(Y X)= X(RT (µ)WT Y  A + WT X  By)T dW metries. The weights W and ω were learned by the pro- | T  T  x T cedure described in the next section. It is quite clear from + Y(R(µ)W X A + W Y B ) . this figure that MAP inference does not give a complete where we have used (P  Q) = Q P and (P  description of the possible transformations relating the im- 2j,n j,n 2j,n Q) − = Q P − as a “subspace weighting” oper- ages when the images have a degree of rotational sym- 2j 1,n j,n 2j 1,n metry. The posterior distribution of our model provides a 5. As described in the supplementary material, we use a slightly sensible way to deal with this kind of uncertainty, which different parameterization of the GBF. Learning the Irreducible Representations of Commutative Lie Groups ation. A, B(x) and B(y) are D N matrices with elements × I1(ˆκjn)κj ajn = 2 , I0(ˆκjn)ˆκjnσ (n) 2 Wjy bjn = k 2 k , κj σ where the κˆjn is the posterior precision in subspace j for image pair x(n), y(n) (the n-th column of X, resp. Y). The gradient of the coupled model is easily computed using the differential recurrence relations that hold for the GBF (Dattoli et al., 1991).

We use minibatch Stochastic Gradient Descent (SGD) on Figure 3. Filters learned by TSA, sorted by absolute frequency the log-likelihood of the uncoupled model. After every pa- ωj . The learned ωj -values range from from 11 to 12. rameter update, we orthogonalize W by setting all singular | | − values to 1: Let U, S, V = svd(W), then set W := UV. 1 This procedure and all previous derivations still work when the basis is undercomplete, i.e. has fewer columns (filters) .8 ED .6 TD than rows (dimensions in data space). To learn ωj , we esti- √ˆ mate the relative angular velocity from a batch .4 κ ωj = θj /δ MD

◦ Accuracy of image patches rotated by a sub-pixel amount δ =0.1 . .2 ED-NR 0 100 1k 10k 60k 4. Experiments Trainset size We trained a TSA model with 100 filters on a stream of 250.000 16 16 image patches x(t), y(t). The patches Figure 4. Results of classification experiment. See text for details. x(t) were drawn× from a standard normal distribution, and y(t) was obtained by rotating x(t) by an angle s drawn uni- formly at random from [0, 2π]. The learning rate α was ini- pler dataset, demonstrating that it has almost completely tialized at α0 = 0.25 and decayed as α = α0/√T , where modded out the variation caused by rotation. T was incremented by one with each pass through the data. Each minibatch consisted of 100 data pairs. After learn- 5. Conclusions and outlook ing W, we estimate the weights ωj and sort the filter pairs by increasing absolute value for visualization. As can be We have presented a novel principle for learning disentan- seen in fig. 3, the filters are very clean and the weights are gled representations, and worked out its consequences for estimated correctly except for a few filters on row 1 and 2 a simple type of symmetry group. This leads to a com- that are assigned weight 0 when in fact they have a higher pletely tractable model with potential applications to invari- frequency. ant classification and Bayesian estimation of motion. The model reproduces the pooling operations used in convolu- We tested the utility of the model for invariant classifica- tional networks from probabilistic and Lie-group theoretic tion on a rotated version of the MNIST dataset, using a principles, and provides a probabilistic interpretation of the 1-Nearest Neighbor classifier. Each digit was rotated by a DFT and its generalizations. random angle and rescaled to 16 16 pixels, resulting in 60k training examples and 10k testing× examples, with no The type of disentangling obtained in this paper is contin- rotated duplicates. We compared the Euclidean distance gent upon the rather minimalist assumption that all that can (ED) in pixel space, tangent distance (TD) (Simard et al., be said about images is that they are equivalent (rotated 2000), Euclidean distance on the space of √κˆ (equivalent copies) or inequivalent. However, the universal nature of to the exact manifold distance for the maximal torus, see Weyl’s principle bodes well for future applications to deep, section 3.1), the true manifold distance for the 1-parameter non-linear and non-commutative forms of disentangling. 2D rotation group (MD), and the Euclidean distance on the non-rotated version of the MNIST dataset (ED-NR). The Acknowledgments results in fig. 4 show that TD outperforms ED, but is out- performed by √κˆ and MD by a large margin. In fact, the We would like to thank prof. Giuseppe Dattoli for help with MD-classifier is about as accurate as ED on a much sim- derivations related to the generalized Bessel functions. Learning the Irreducible Representations of Commutative Lie Groups

References Soatto, S. Actionable information in vision. In 2009 IEEE 12th International Conference on Computer Vision, pp. Bengio, Y. and Lecun, Y. International Confer- 2138–2145. IEEE, September 2009. ence on Learning Representations, 2014. URL https://sites.google.com/site/representationlearninSohl-Dickstein,g2014/ J., Wang,. J. C., and Olshausen, B. A. An unsupervised algorithm for learning lie group transfor- Bengio, Y., Courville, A., and Vincent, P. Representation mations. arXiv preprint, 2010. Learning: A Review and New Perspectives. IEEE trans- actions on pattern analysis and machine intelligence,pp. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. 1–30, February 2013. Extracting and composing robust features with denois- ing autoencoders. Proceedings of the 25th international Bethge, M., Gerwinn, S., and Macke, J. H. Unsupervised conference on Machine learning, pp. 1096–1103, 2008. learning of a steerable basis for invariant image represen- tations. Proceedings of SPIE Human Vision and Elec- Welling, M., Rosen-zvi, M., and Hinton, G. Exponential tronic Imaging XII (EI105), February 2007. Family Harmoniums with an Application to Information Retrieval. Advances in neural information processing Cadieu, C. F. and Olshausen, B. A. Learning intermediate- systems, 17, 2005. level representations of form and motion from natural movies. Neural computation, 24(4):827–66, April 2012.

Dattoli, G., Chiccoli, C., Lorenzutta, S., Maino, G., Richetta, M., and Torre, A. A Note on the Theory of n-Variable Generalized Bessel Functions. Il Nuovo Ci- mento B, 106(10):1159–1166, October 1991.

Gatto, R. and Jammalamadaka, S. R. The generalized von Mises distribution. Statistical Methodology, 4(3):341– 353, 2007.

Goodfellow, I. J., Le, Q. V., Saxe, A. M., Lee, H., and Ng, A. Y. Measuring invariances in deep networks. Advances in Neural Information Processing Systems, 2009.

Kanatani, K. Group Theoretical Methods in Image Under- standing. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1990. ISBN 0387512535.

Memisevic, R. On multi-view feature learning. Interna- tional Conference on Machine Learning, 2012.

Memisevic, R. and Hinton, G. E. Learning to Rep- resent Spatial Transformations with Factored Higher- Order Boltzmann Machines. Neural Computation, 22 (6):1473–1492, June 2010. ISSN 1530-888X. doi: 10.1162/neco.2010.01-09-953.

Miao, X. and Rao, R. P. N. Learning the Lie groups of visual invariance. Neural computation, 19(10):2665–93, October 2007.

Rao, R. P. N. and Ruderman, D. L. Learning Lie groupsfor invariant visual perception. Advances in neural informa- tion processing systems, 816:810–816, 1999.

Simard, P. Y., Le Cun, Y. A., Denker, J. S., and Victorri, B. Transformation invariance in pattern recognition: Tan- gent distance and propagation. International Journal of Imaging Systems and Technology, 11(3):181–197, 2000.