On multi-view feature learning

Roland Memisevic [email protected] University of Frankfurt, Robert-Mayer-Str. 10, 60325 Frankfurt, Germany

Abstract Adapting the filters based on synthetic transforma- tions on images was shown to yield transformation- Sparse coding is a common approach to learn- specific features like phase-shifted Fourier components ing local features for object recognition. Re- when training on shifted image pairs, or “circular” cently, there has been an increasing inter- Fourier components when training on rotated image est in learning features from spatio-temporal, pairs (Memisevic & Hinton, 2010). Task-specific filter- binocular, or other multi-observation data, pairs emerge when training on natural transforma- where the goal is to encode the relationship tions, like facial expression changes (Susskind et al., between images rather than the content of 2011) or natural video (Taylor et al., 2010), and a single image. We provide an analysis of they were shown to yield state-of-the-art recognition multi-view feature learning, which shows that performance in these domains. Multi-view feature hidden variables encode transformations by learning models are also closely related to energy detecting rotation angles in the eigenspaces models of complex cells (Adelson & Bergen, 1985), shared among multiple image warps. Our which, in turn, have been successfully applied to analysis helps explain recent experimental video understanding, too (Le et al., 2011). They results showing that transformation-specific have also been used to learn within-image correla- features emerge when training complex cell tions by letting input and output images be the same models on videos. Our analysis also shows (Ranzato & Hinton, 2010; Bergstra et al., 2010). that transformation-invariant features can emerge as a by-product of learning represen- Common to all these methods is that they deploy prod- tations of transformations. ucts of filter responses to learn relations. In this paper, we analyze the role of these multiplicative interactions in learning relations. We also show that the hidden 1. Introduction variables in a multi-view feature learning model repre- sent transformations by detecting rotation angles in Feature learning (AKA dictionary learning, or sparse eigenspaces that are shared among the transforma- coding) has gained considerable attention in computer tions. We focus on image transformations here, but vision in recent years, because it can yield image rep- our analysis is not restricted to images. resentations that are useful for recognition. However, although recognition is important in a variety of tasks, Our analysis has a variety of practical applications, a lot of problems in vision involve the encoding of the that we investigate in detail experimentally: (1) We relationship between observations not single observa- can train complex cell and energy models using con- tions. Examples include tracking, multi-view geome- ditional sparse coding models and vice versa, (2) It try, action understanding or dealing with invariances. is possible to extend multi-view feature learning to model sequences of three or more images instead of A variety of multi-view feature learning models have just two, (3) It is mandatory that hidden variables recently been suggested as a way to learn features that pool over multiple subspaces to work properly, (4) In- encode relations between images. The basic idea be- variant features can be learned by separating pooling hind these models is that hidden variables sum over within subspaces from pooling across subspaces. Our products of filter responses applied to two observa- analysis is related to previous investigations of energy tions x and y and thereby correlate the responses. models and of complex cells (for example, (Fleet et al.,

th 1996; Qian, 1994)), and it extends this line of work to Appearing in Proceedings of the 29 International Confer- more general transformations than local translation. ence on , Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). On multi-view feature learning

2. Background on multi-view sparse 2.1. Energy models coding Multi-view feature learning is closely related to Feature learning1 amounts to encoding an image patch energy models and to models of complex cells x using a vector of latent variables z = σ(W Tx), (Adelson & Bergen, 1985; Fleet et al., 1996; Kohonen; where each column of W can be viewed as a linear Hyv¨arinen & Hoyer, 2000). The activity of a hidden feature (“filter”) that corresponds to one hidden vari- unit in an energy model is typically defined as the sum 2 able zk, and where σ is a non-linearity, such as the over squared filter responses, which may be written −1 sigmoid σ(a)= 1 + exp(a) . To adapt the param-  z = W T BTx BTx (2) eters, W , based on a set of example patches xα one  ∗  can use a variety of methods, including maximizing{ } where B contains image filters in its columns. W is the average sparsity of z, minimizing a form of recon- usually constrained such that each hidden variable, z , struction error, maximizing the likelihood of the obser- k computes the sum over only a subset of all products. vations via Gibbs sampling, and others (see, for exam- This way, hidden variables can be thought of as en- ple, (Hyv¨arinen et al., 2009) and references therein). coding the norm of a projection of x onto a subspace3. To obtain hidden variables z, that encode the rela- Energy models are also referred to as “subspace” or tionship between two images, x and y, one needs to “square-pooling” models. represent correlation patterns between two images in- For our analysis, it is important to note that, when stead. This is commonly achieved by computing the we apply an energy model to the concatenation of two sum over products of filter responses: images, x and y, we obtain a response that is closely T T T related to the response of a multi-view sparse coding z = W U x V y (1) ∗ model (cf., Eq. 1): Let bf denote a single column of where “ ” is element-wise multiplication, and matrix B. Furthermore, let u denote the part of the ∗ f the columns of U and V contain image fil- filter bf that gets applied to image x, and let vf de- ters that are learned along with W from data note the part that gets applied to image y, so that (Memisevic & Hinton, 2010). Again, one may apply T T T bf [x; y] = uf x + vf y. Hidden unit activities, zk, an element-wise non-linearity to z. The hidden units then take the form are “multi-view” variables that encode transforma- T T 2 T T tions not the content of single images, and they are zk = X Wfkuf x + vf y = 2 X Wfkuf xvf y commonly referred to as “mapping units”. f f T 2 T 2 Training the model parameters, (U, V, W ), can be + X Wfkuf x + X Wfkvf y achieved by minimizing the conditional reconstruction f f error of y keeping x fixed or vice versa (Memisevic, (3) 2011), or by conditional variants of maximum likeli- hood (Ranzato & Hinton, 2010; Memisevic & Hinton, Thus, up to the quadratic terms in Eq. 3, hidden unit 2010). Training the model on transforming random- activities are the same as in a multi-view feature learn- dot patterns yields transformation-specific features, ing model (Eq. 1). As we shall discuss in Section 3.5, such as phase-shifted Fourier features in the case of the quadratic terms do not significantly change the be- translation and circular harmonics in the case of ro- havior of the hidden units as compared to multi-view tation (Memisevic & Hinton, 2010; Memisevic, 2011). sparse coding models. An illustration of the energy Eq. 1 can also be derived by factorizing the pa- model is shown in Figure 1 (b). rameter tensor of a conditional sparse coding model (Memisevic & Hinton, 2010). An illustration of the 3. Eigenspace analysis model is shown in Figure 1 (a). We now show that hidden variables turn into sub- 1 We use the terms “feature learning”, “dictionary learn- space rotation detectors when the models are trained ing” and “sparse coding” synonymously in this paper. on transformed image pairs. To simplify the analy- Each term tends to come with a slightly different mean- sis, we shall restrict our attention to transformations, ing in the literature, but for the purpose of this work the T T differences are negligible. L, that are orthogonal, that is, L L = LL = I, 2 −1 In practice, it is common to add constant bias terms to where I is the identity matrix. In other words, L = the linear mapping. In the following, we shall refrain from LT. Linear transformations in “pixel-space” are also doing so to avoid cluttering up the derivations. We shall instead think of data and hidden variables as being in “ho- 3It is also common to apply non-linearities, such as a mogeneous notation” with an extra, constant 1-dimension. square-root, to the activity zk. On multi-view feature learning

z z zk Im Im zk W W px iθ e px (·)2 py py

U V Re Re y xi U V j

xi yj x y x y (a) (b) (c) (d)

Figure 1. (a) Modeling an image pair using a gated sparse coding model. (b) Modeling an image pair using an energy model applied to the concatenation of the images. (c) Projections px and py of two images onto the complex plane, that is spanned by two eigenfeatures. (d) Absorbing eigenvalues into input-features amounts to performing a projection and a rotation for image x. Hidden units can detect if this brings the projections into alignment (see text for details). known as warp. Note that practically all relevant spa- translations. tial transformations, like translation, rotation or local It is interesting to note that the imaginary and real shifts, can be expressed approximately as an orthog- parts of the eigenvectors of a translation matrix corre- onal warp, because orthogonal transformations sub- spond to sine and cosine features, respectively, reflect- sume, in particular, all permutations (“shuffling pix- ing the fact that Fourier components naturally come els”). in pairs. These are commonly referred to as quadrature An important fact about orthogonal matrices is that pairs in the literature. The same is true of Gabor fea- the eigen-decomposition L = UDU T is complex, tures, which represent local translations (Qian, 1994; where eigenvalues (diagonal of D) have absolute value Fleet et al., 1996). However, the property that eigen- 1 (Horn & Johnson, 1990). Multiplying by a complex vectors come in pairs is not specific to translations. number with absolute value 1 amounts to performing a It is shared by all transformations that can be repre- rotation in the complex plane, as illustrated in Figure sented by an orthogonal matrix. (Bethge et al., 2007) 1 (c) and (d). Each eigenspace associated with L is use the term generalized quadrature pair to refer to the also referred to as invariant subspace of L (as applica- eigen-features of these transformations. tion of L will keep eigenvectors within the subspace). Applying an orthogonal warp is thus equivalent to (i) 3.1. Commuting warps share eigenspaces projecting the image onto filter pairs (the real and A central observation to our analysis is that imaginary parts of each eigenvector), (ii) performing eigenspaces can be shared among transformations. a rotation within each invariant subspace, and (iii) When eigenspaces are shared, then the only way in projecting back into the image-space. In other words, which two transformations differ, is in the angles of we can decompose an orthogonal transformation into rotation within the eigenspaces. In this case, we can a set of independent, 2-dimensional rotations. The represent multiple transformations with a single set of most well-known examples are translations: A 1D- features as we shall show. translation matrix contains ones along one of its sec- ondary diagonals, and it is zero elsewhere4. The eigen- An example of a shared eigenspace is the Fourier-basis, vectors of this matrix are Fourier-components (Gray, which is shared among translations (Gray, 2005). Less 2005), and the rotation in each invariant subspace obvious examples are Gabor features which may be amounts to a phase-shift of the corresponding Fourier- thought of as the eigenbases of local translations, or feature. This leaves the norm of the projections onto features that represent spatial rotations. Formally, the Fourier-components (the power spectrum of the a set of matrices share eigenvectors if they commute signal) constant, which is a well known property of (Horn & Johnson, 1990). This can be seen by con- sidering any two matrices A and B with AB = BA 4To be exactly orthogonal it has to contain an additional and with λ,v an eigenvalue/eigenvector pair of B with one in another place, so that it performs a rotation with multiplicity one. It holds that BAv = ABv = λAv. wrap-around. On multi-view feature learning

Therefore, Av is also an eigenvector of B with the same overcome this problem by rephrasing the problem as a eigenvalue. detection task.

3.2. Extracting transformations 3.4. Mapping units as rotation detectors Consider the following task: Given two images x and For each eigenvector, v, and rotation angle, θ, define y, determine the transformation L that relates them, the complex output image filter assuming that L belongs to a given class of transfor- mations. vθ = exp(iθ)v The importance of commuting transformations for our which represents a projection and simultaneous ro- analysis is that, since they share an eigenbasis, any two tation by θ. This allows us to define a subspace transformations differ only in the angles of rotation in rotation-detector with preferred angle θ as follows: the joint eigenspaces. As a result, one may extract the transformation from the given image pair (x, y) T T rθ vTy vθ x vTy vθ x simply by recovering the angles of rotation between =( R )( R )+( I )( I ) (4) the projections of x and y onto the eigenspaces. To where subscripts R and I denote the real and imagi- this end, consider the real and complex parts vR and nary part of the filters like before. If projections are vI of some eigen-feature v = vR +ivI, where i = √ 1. normalized to length 1, we have The real and imaginary coordinates of the projection− v px of x onto the invariant subspace associated with θ T T r = cos φy cos(φx + θ)+sin φy sin(φx + θ) v are given by vRx and vI x, respectively. For the (5) v = cos(φ φ θ), projection py of the output image onto the invariant y x T T − − subspace, they are vRy and vI y. which is maximal whenever φy φx = θ, thus when − Let φx and φy denote the angles of the projections of the observed angle of rotation, φy φx, is equal to the x and y with the real axis in the complex plane. If we preferred angle of rotation, θ. However,− like before, normalize the projections to have unit norm, then the normalizing projections is not a good idea because of cosine of the angle between the projections, φy φx, the subspace aperture problem. We now show that − may be written mapping units are well-suited to detecting subspace rotations, if a number of conditions are met. cos(φy φx) = cos φy cos φx + sin φy sin φx − by a trigonometric identity. This is equivalent to com- If features and data are contrast normalized, then the puting the inner product between two normalized pro- projections will depend only on how well the image jections (cf. Figure 1 (c) and (d)). In other words, to pair represents a given subspace rotation. The value θ estimate the (cosine of) the angle of rotation between r , in turn, will depend (a) on the transformation (via the projections of x and y, we need to sum over the the subspace angle) and (b) on the content of the im- product of two filter responses. ages (via the angle between each image and the in- variant subspace). Thus, the output of the detector 3.3. The subspace aperture problem factors in both, the presence of a transformation and our ability to discern it. Note, however, that normalizing each projection to 1 The fact that rθ depends on image content makes amounts to dividing by the sum of squared filter re- it a suboptimal representation of the transformation. sponses, an operation that is highly unstable if a pro- However, note that rθ is a “conservative” detector, jection is close to zero. This will be the case, whenever that takes on a large value only if an input image one of the images is almost orthogonal to the invariant pair (x, y) complies with its transformation. We can subspace. This, in turn, means that the rotation an- therefore define a content-independent representation gle cannot be recovered from the given image, because by pooling over multiple detectors rθ that represent the the image is too close to the axis of rotation. One same transformation but respond to different images. may view this as a subspace-generalization of the well- known aperture problem beyond translation, to the set Therefore, by stacking eigenfeatures v and vθ in ma- of orthogonal transformations. Normalization would trices U and V , respectively, we may define the rep- ignore this problem and provide the illusion of a re- resentation t of a transformation, given two images x covered angle even when the aperture problem makes and y, as the detection of the transformation component impos- sible. In the next section we discuss how one may t = W TP U Tx V Ty (6)  ∗  On multi-view feature learning where P is a band-diagonal within-subspace pooling 3.5. Relation to energy models matrix, and W is an appropriate across-subspace pool- By concatenating images x and y, as well as filters v ing matrix that supports content-independence. and vθ, we may approximate the subspace rotation de- Furthermore, the following conditions need to be met: tector (Eq. 4) with the response of an energy detector: (1) Images x and y are contrast-normalized, (2) For each row uf of U there exists θ such that the corre- sponding row v of V can be written v = exp(iθ)u . T 2 T 2 f f f rθ v Ty vθ x v Ty vθ x In other words, filter pairs are related through rota- = ( R )+( R ) + ( I )+( I ) tions only. T θ T T θT = 2(vR y)(vR x)+(vI y)(vI x) Eq. 6 takes the same form as inference in a multi- T 2 θ T 2 T 2 θT 2 +(vR y) +(vR x) +(vI y) +(vI x) view feature learning model (cf., Eq. 1), if we absorb (7) the within-subspace pooling matrix P into W . Learn- ing amounts to identifying both the subspaces and the pooling matrix, so training a multi-view feature learn- Eq. 7 is equivalent to Eq. 4 up to the four quadratic ing model can be thought of as performing multiple terms. The four quadratic terms are equal to the sum simultaneous diagonalizations of a set of transforma- of the squared norms of the projections of x and y tions. When a dataset contains more than one trans- onto the invariant subspace. Thus, like the norm of the formation class, learning involves partitioning the set projections, they contribute information about the dis- of orthogonal warps into commutative subsets and si- cernibility of transformations. This makes the energy multaneously diagonalizing each subset. Note that, in response depend more on the alignment of the images practice, complex filters can be represented by learn- with its subspace. However, like for the inner product ing two-dimensional subspaces in the form of filter detector (Eq. 4), the peak response is attained when pairs. It is uncommon, albeit possible, to learn ac- both images reside within the detector’s subspace and tually complex-valued features in practice. when their projections are rotated by the detectors It is interesting to note that condition (2) above preferred angle θ. implies that filters are normalized to have the By pooling over multiple rotation detectors, rθ, we ob- same lengths. Imposing a norm constraint has tain the equivalent of an energy response (Eq. 3). This been a common approach to stabilizing learn- shows that energy models applied to the concatenation ing (Ranzato & Hinton, 2010; Memisevic, 2011; of two images are well-suited to modeling transforma- Susskind et al., 2011), but it has not been clear why tions, too. It is interesting to note that both, multi- imposing norm constraints help. Pooling over multi- view sparse coding models (Taylor et al., 2010) and ple subspaces may, in addition to providing content- energy models (Le et al., 2011) were recently shown to independent representations, also help deal with edge yield highly competitive performance in action recog- effects and noise, as well as with the fact that learned nition tasks, which require the encoding of motion in transformations may not be exactly orthogonal. In videos. practice, it is also common to apply a sigmoid non- linearity after computing mapping unit activities, so that the output of a hidden variable can be interpreted 4. Experiments as a probability. 4.1. Learning quadrature pairs Note that diagonalizing a single transformation, L, Figure 2 shows random subsets of input/output fil- would amount to performing a kind of canonical cor- ter pairs learned from rotations of random dot im- relations analysis (CCA), so learning a multi-view fea- ages (top plot), and from a mixed dataset, consisting ture learning model may be thought of as perform- of random rotations and random translations (bottom ing multiple analyzes with tied plot). We separate the two layers of pooling, W and features. Similarly, modeling within-image structure P , and we constrain P to be band-diagonal with en- by setting x = y (Ranzato & Hinton, 2010) would tries Pi,i = Pi,i+1 = 1 and 0 elsewhere. Thus, fil- amount to learning a PCA mixture with tied weights. ters need to come in pairs, which we expect to be ap- In the same way that neural networks can be used to proximately in quadrature (each pairs spans the sub- implement CCA and PCA up to a linear transforma- space associated with a rotation detector rθ). Figure 2 tion, the result of training a multi-view feature learn- shows that this is indeed the case after training. Here, ing model is a simultaneous diagonalization only up to we use a modification of a higher-order a linear transformation. (Memisevic, 2011) for training, but we expect simi- On multi-view feature learning

where Ω contains the quadratic terms of the energy model. In analogy to Section 3.4, for the detector to s function properly, features will need to satisfy vR = s exp(iθs)vR and vI = exp(iθs)vI for appropriate filters vR and vI . We verify that training on videos leads to filters which approximately satisfy this condition as follows: We use a gated autoencoder where we set U = V , and we set x = y to the concatenation of the 10 frames. In contrast to Section 4.1, we use a single (full) pooling matrix W . Figure 3 shows subsets of learned filters af- ter training the model on shifted random dots (top) and natural movies cropped from the van Hateren database (van Hateren & Ruderman, 1998) (center). The learned filter-sequences represent repeated phase- shifts as expected. Thus, they form the “eigenmovies” of each respective transformation class. Eq. 8 implies that learning videos requires consistency between the filters across time. Each factor – corre- Figure 2. Learned quadrature filters. Top: Filters learned sponding to a sequence of T filters – can model only from rotated images. Bottom: Filters learned from im- the repeated application of the same transformation. ages showing both rotations and translations. An inhomogeneous sequence that involves multiple dif- ferent types of transformation can only be modeled lar results to hold for other multi-view models5. We by devoting separate filter sets to homogeneous sub- discuss an application of separating pooling in detail sequences. We verify that this is what happens dur- in Section 5. Note that for the mixed dataset, both ing training, by using 10-frame videos showing random the set of rotations and the set of translations are sets dots that first rotate at a constant speed for 5 frames, of commuting warps (up to edge effects), but rotations then translate at a constant speed for the remaining 5 do not commute with translations and vice versa. The frames. Orientation, speed and direction vary across figure shows that the model has learned to separate out movies. We trained a gated autoencoder like in the the two types of transformation by devoting a subset previous experiment. The bottom plot in Figure 3 of filters to encoding rotations and another subset to shows that the model learns to decompose the movies modeling translations. into (a) Fourier filters which are quiet in the first half of a movie and (b) rotation features which are quiet in 4.2. Learning “eigenmovies” the second half of the movie. Both energy models and cross-correlation models can be applied to more than two images: Eq. 4 may be 5. Learning invariances by learning modified to contain all cross-terms, or all the ones that transformations are deemed relevant (for example, adjacent frames, Our analysis suggests that detector responses, rθ, will which would amount to a “Markov”-type gating model not be affected by transformations they are tuned to, of a video). Alternatively, for the energy mechanism, as these will only cause a rotation within the detec- we can compute the square of the concatenation of tor’s subspace, leaving the norm of the projections more than two images in place of Eq. 7, in which case, unchanged. Any other transformation (like show- we obtain the detector response ing a completely different image pair), however, may 2 2 change the representation: Projections may get larger r = vs Tx + vsTx X R s X I s or smaller as the transformation changes the degree of s s alignment of the images with the invariant subspaces. s T t T sT t T =Ω+ X vR xsvR xt + X vI xsvI xt st st This suggests, that we can learn features that are in- (8) variant with respect to one type of transformation and at the same time selective with respect to any 5 Code and datasets are available at other type of transformation as follows: we separate http://www.cs.toronto.edu/~rfm/code/multiview/index.html On multi-view feature learning

Frame: 1 2 4 6 8 10

Figure 3. Subsets of filters showing six frames from the “eigenmovies” of synthetic movies showing translating random- dots (top); natural movies, cropped from the van Hateren broadcast TV database (middle); synthetic movies showing random-dots which rotate for 5 frames then translate for 5 frames (bottom). the two pooling-levels into a band-matrix P and a sion 2. We used the features in a full matrix W (cf., Section 4.1). After training the classifier vs. k-nearest neighbors on the original im- model on a transformation class, the first-level pool- ages (we also tried logistic regression on raw images T T T ing activities P U x V y computed from a test and logistic regression as well as nearest neighbors on image pair (x, y) will constitute∗ a transformation in- 200-dimensional PCA-features, but the performance is variant code for this pair. Alternatively, we can use worse in all these cases). The learned features are sim- T T T P U x V x if the test data does not come in ilar to the features shown in Figure 2, so they are not the form of∗ pairs but consists only of single images. tuned to digits, and they are not trained discrimina- In this case we obtain a representation of the null- tively. They nevertheless consistently perform about transformation, but it will still be invariant. as well as, or better than, nearest neighbor. Also, even at half the original dataset size, the subspace features We tested this approach on the “rotated MNIST”- still attain about the same classification performance dataset from (Larochelle et al., 2007), which consists as raw images on the whole training set. All parame- of 72000 MNIST digit images of size 28 28 pixels ters were set using a fixed hold-out set of 2000 images. that are rotated by arbitrary random angles× ( 180 The experiment shows that the rotation detectors, rθ, to 180 degrees; 12000 train-, 60000 test-cases, classes− are affected sufficiently by the aperture problem, such range from 0 9). Since the number of training that they are selective to image content while being cases is fairly− large, most exemplars are represented invariant to rotation. This shows that we can “har- at most angles, so even linear classifiers perform well ness the aperture problem” to learn invariant features. (Larochelle et al., 2007). However, when reducing the number of training cases, the number of potential matches for any test case dramatically reduces, so clas- sification error rates become much worse when using 6. Conclusions raw images or standard features. We analyzed multi-view sparse coding models in terms We used a gated auto-encoder with 2000 factors and of the joint eigenspaces of a set of transformations. 200 mapping units, which we trained on image pairs Our analysis helps understand why Fourier features showing rotating random-dots. Figure 4 shows the and circular Fourier features emerge when training error rates when using subspace features (responses transformation models on shifts and rotations, and of the first layer pooling units) with subspace dimen- why square-pooling models work well in action and On multi-view feature learning

binocular disparity: Energy models, position shifts and phase shifts. Vision Research, 36(12):1839–1857, 1996.

Gray, Robert M. Toeplitz and circulant matrices: a review. Commun. Inf. Theory, 2:155–239, August 2005. Horn, Roger A. and Johnson, Charles R. Matrix Analysis. Cambridge University Press, 1990.

Hyv¨arinen, Aapo and Hoyer, Patrik. Emergence of phase- and shift-invariant features by decomposition of natu- ral images into independent feature subspaces. Neural Computation, 12:1705–1720, July 2000.

Hyv¨arinen, Aapo, Hurri, J., and Hoyer, Patrik O. Natu- ral Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer Verlag, 2009. Kohonen, Teuvo. The Adaptive-Subspace SOM (ASSOM) and its use for the implementation of invariant feature detection. In Proc. ICANN’95, Int. Conf. on Artificial Figure 4. Classification error rate as a function of training Neural Networks. set size on the “rotated mnist” dataset, using raw images Larochelle, Hugo, Erhan, Dumitru, Courville, Aaron, vs. subspace features trained on rotations. Bergstra, James, and Bengio, Yoshua. An empirical eval- uation of deep architectures on problems with many fac- tors of variation. In Proceedings of the 24th international motion recognition tasks. Our analysis furthermore conference on Machine learning, 2007. shows how the aperture problem implies that we can learn invariant features as a by-product of learning Le, Q.V., Zou, W.Y., Yeung, S.Y., and Ng, A.Y. Learning hierarchical spatio-temporal features for action recog- about transformations. nition with independent subspace analysis. In Proc. CVPR, 2011 The fact that squaring nonlinearities and multiplica- . IEEE, 2011. tive interactions can support the learning of relations Memisevic, Roland. Gradient-based learning of higher- suggests that these may help increase the role of sta- order image features. In Proceedings of the International tistical learning in vision in general. By learning about Conference on Computer Vision, 2011. relations we may extend the applicability of sparse Memisevic, Roland and Hinton, Geoffrey E. Learning to coding models beyond recognizing objects in static, represent spatial transformations with factored higher- single images, towards tasks that involve the fusion of order Boltzmann machines. Neural Computation, 22(6): multiple views, including inference about geometry. 1473–92, 2010. Qian, Ning. Computing stereo disparity and motion with Acknowledgments known binocular cell properties. Neural Computation, 6: 390–404, May 1994.

This work was supported by the German Federal Ranzato, Marc’Aurelio and Hinton, Geoffrey E. Modeling Ministry of Education and Research (BMBF) in the Pixel Means and Covariances Using Factorized Third- project 01GQ0841 (BFNT Frankfurt). Order Boltzmann Machines. In Computer Vision and Pattern Recognition, 2010. References Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M. Modeling the joint density of two images under a variety Adelson, E.H. and Bergen, J.R. Spatiotemporal energy of transformations. In Proceedings of IEEE Conference models for the perception of motion. J. Opt. Soc. Am. on Computer Vision and Pattern Recognition, 2011. A, 2(2):284–299, 1985. Taylor, Graham, W., Fergus, Rob, LeCun, Yann, and Bergstra, James, Bengio, Yoshua, and Louradour, J´erˆome. Bregler, Christoph. Convolutional learning of spatio- Suitability of V1 energy models for object classification. temporal features. In Proc. European Conference on Neural Computation, pp. 1–17, 2010. Computer Vision, 2010. Bethge, M, Gerwinn, S, and Macke, JH. Unsupervised van Hateren, L. and Ruderman, J. Independent compo- learning of a steerable basis for invariant image repre- nent analysis of natural image sequences yields spatio- sentations. In Human Vision and Electronic Imaging temporal filters similar to simple cells in primary visual XII. SPIE, February 2007. cortex. Proc. Biological Sciences, 265(1412):2315–2320, 1998. Fleet, D., Wagner, H., and Heeger, D. Neural encoding of