On Multi-View Feature Learning Roland Memisevic

On multi-view feature learning Roland Memisevic [email protected] University of Frankfurt, Robert-Mayer-Str. 10, 60325 Frankfurt, Germany Abstract Adapting the filters based on synthetic transformations on images was shown to yield transformation- Sparse coding is a common approach to learn- specific features like phase-shifted Fourier components ing local features for object recognition. Re- when training on shifted image pairs, or “circular” cently, there has been an increasing inter- Fourier components when training on rotated image est in learning features from spatio-temporal, pairs (Memisevic & Hinton, 2010). Task-specific filter- binocular, or other multi-observation data, pairs emerge when training on natural transforma- where the goal is to encode the relationship tions, like facial expression changes (Susskind et al., between images rather than the content of 2011) or natural video (Taylor et al., 2010), and a single image. We provide an analysis of they were shown to yield state-of-the-art recognition multi-view feature learning, which shows that performance in these domains. Multi-view feature hidden variables encode transformations by learning models are also closely related to energy detecting rotation angles in the eigenspaces models of complex cells (Adelson & Bergen, 1985), shared among multiple image warps. Our which, in turn, have been successfully applied to analysis helps explain recent experimental video understanding, too (Le et al., 2011). They results showing that transformation-specific have also been used to learn within-image correla- features emerge when training complex cell tions by letting input and output images be the same models on videos. Our analysis also shows (Ranzato & Hinton, 2010; Bergstra et al., 2010). that transformation-invariant features can emerge as a by-product of learning represen- Common to all these methods is that they deploy prod- tations of transformations. ucts of filter responses to learn relations. In this paper, we analyze the role of these multiplicative interactions in learning relations. We also show that the hidden 1. Introduction variables in a multi-view feature learning model represent transformations by detecting rotation angles in Feature learning (AKA dictionary learning, or sparse eigenspaces that are shared among the transforma- coding) has gained considerable attention in computer tions. We focus on image transformations here, but vision in recent years, because it can yield image rep- our analysis is not restricted to images. resentations that are useful for recognition. However, although recognition is important in a variety of tasks, Our analysis has a variety of practical applications, a lot of problems in vision involve the encoding of the that we investigate in detail experimentally: (1) We relationship between observations not single observa- can train complex cell and energy models using con- tions. Examples include tracking, multi-view geome- ditional sparse coding models and vice versa, (2) It try, action understanding or dealing with invariances. is possible to extend multi-view feature learning to model sequences of three or more images instead of A variety of multi-view feature learning models have just two, (3) It is mandatory that hidden variables recently been suggested as a way to learn features that pool over multiple subspaces to work properly, (4) In- encode relations between images. The basic idea be- variant features can be learned by separating pooling hind these models is that hidden variables sum over within subspaces from pooling across subspaces. Our products of filter responses applied to two observa- analysis is related to previous investigations of energy tions x and y and thereby correlate the responses. models and of complex cells (for example, (Fleet et al., th 1996; Qian, 1994)), and it extends this line of work to Appearing in Proceedings of the 29 International Confer- more general transformations than local translation. ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). On multi-view feature learning 2. Background on multi-view sparse 2.1. Energy models coding Multi-view feature learning is closely related to Feature learning1 amounts to encoding an image patch energy models and to models of complex cells x using a vector of latent variables z = σ(W Tx), (Adelson & Bergen, 1985; Fleet et al., 1996; Kohonen; where each column of W can be viewed as a linear Hyvärinen & Hoyer, 2000). The activity of a hidden feature (“filter”) that corresponds to one hidden vari- unit in an energy model is typically defined as the sum 2 able zk, and where σ is a non-linearity, such as the over squared filter responses, which may be written −1 sigmoid σ(a)= 1 + exp(a) . To adapt the param- z = W T BTx BTx (2) eters, W , based on a set of example patches xα one ∗ can use a variety of methods, including maximizing{ } where B contains image filters in its columns. W is the average sparsity of z, minimizing a form of recon- usually constrained such that each hidden variable, z , struction error, maximizing the likelihood of the obser- k computes the sum over only a subset of all products. vations via Gibbs sampling, and others (see, for exam- This way, hidden variables can be thought of as en- ple, (Hyvärinen et al., 2009) and references therein). coding the norm of a projection of x onto a subspace3. To obtain hidden variables z, that encode the rela- Energy models are also referred to as “subspace” or tionship between two images, x and y, one needs to “square-pooling” models. represent correlation patterns between two images in- For our analysis, it is important to note that, when stead. This is commonly achieved by computing the we apply an energy model to the concatenation of two sum over products of filter responses: images, x and y, we obtain a response that is closely T T T related to the response of a multi-view sparse coding z = W U x V y (1) ∗ model (cf., Eq. 1): Let bf denote a single column of where “ ” is element-wise multiplication, and matrix B. Furthermore, let u denote the part of the ∗ f the columns of U and V contain image fil- filter bf that gets applied to image x, and let vf de- ters that are learned along with W from data note the part that gets applied to image y, so that (Memisevic & Hinton, 2010). Again, one may apply T T T bf [x; y] = uf x + vf y. Hidden unit activities, zk, an element-wise non-linearity to z. The hidden units then take the form are “multi-view” variables that encode transforma- T T 2 T T tions not the content of single images, and they are zk = X Wfkuf x + vf y = 2 X Wfkuf xvf y commonly referred to as “mapping units”. f f T 2 T 2 Training the model parameters, (U, V, W ), can be + X Wfkuf x + X Wfkvf y achieved by minimizing the conditional reconstruction f f error of y keeping x fixed or vice versa (Memisevic, (3) 2011), or by conditional variants of maximum likelihood (Ranzato & Hinton, 2010; Memisevic & Hinton, Thus, up to the quadratic terms in Eq. 3, hidden unit 2010). Training the model on transforming random- activities are the same as in a multi-view feature learn- dot patterns yields transformation-specific features, ing model (Eq. 1). As we shall discuss in Section 3.5, such as phase-shifted Fourier features in the case of the quadratic terms do not significantly change the be- translation and circular harmonics in the case of ro- havior of the hidden units as compared to multi-view tation (Memisevic & Hinton, 2010; Memisevic, 2011). sparse coding models. An illustration of the energy Eq. 1 can also be derived by factorizing the pa- model is shown in Figure 1 (b). rameter tensor of a conditional sparse coding model (Memisevic & Hinton, 2010). An illustration of the 3. Eigenspace analysis model is shown in Figure 1 (a). We now show that hidden variables turn into sub- 1 We use the terms “feature learning”, “dictionary learn- space rotation detectors when the models are trained ing” and “sparse coding” synonymously in this paper. on transformed image pairs. To simplify the analy- Each term tends to come with a slightly different mean- sis, we shall restrict our attention to transformations, ing in the literature, but for the purpose of this work the T T differences are negligible. L, that are orthogonal, that is, L L = LL = I, 2 −1 In practice, it is common to add constant bias terms to where I is the identity matrix. In other words, L = the linear mapping. In the following, we shall refrain from LT. Linear transformations in “pixel-space” are also doing so to avoid cluttering up the derivations. We shall instead think of data and hidden variables as being in “ho- 3It is also common to apply non-linearities, such as a mogeneous notation” with an extra, constant 1-dimension. square-root, to the activity zk. On multi-view feature learning z z zk Im Im zk W W px iθ e px (·)2 py py U V Re Re y xi U V j xi yj x y x y (a) (b) (c) (d) Figure 1. (a) Modeling an image pair using a gated sparse coding model. (b) Modeling an image pair using an energy model applied to the concatenation of the images. (c) Projections px and py of two images onto the complex plane, that is spanned by two eigenfeatures. (d) Absorbing eigenvalues into input-features amounts to performing a projection and a rotation for image x. Hidden units can detect if this brings the projections into alignment (see text for details).

Load more