Auto-Encoding Transformations in Reparameterized Lie Groups for Unsupervised Learning
Total Page:16
File Type:pdf, Size:1020Kb
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) Auto-Encoding Transformations in Reparameterized Lie Groups for Unsupervised Learning Feng Lin,1 Haohang Xu,2 Houqiang Li,1,4 Hongkai Xiong,2 Guo-Jun Qi3,* 1 CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 2 Department of Electronic Engineering, Shanghai Jiao Tong University 3 Laboratory for MAPLE, Futurewei Technologies 4 Institute of Artificial Intelligence, Hefei Comprehensive National Science Center Abstract Unsupervised training of deep representations has demon- strated remarkable potentials in mitigating the prohibitive ex- penses on annotating labeled data recently. Among them is predicting transformations as a pretext task to self-train repre- sentations, which has shown great potentials for unsupervised learning. However, existing approaches in this category learn representations by either treating a discrete set of transforma- tions as separate classes, or using the Euclidean distance as the metric to minimize the errors between transformations. None of them has been dedicated to revealing the vital role of the geometry of transformation groups in learning represen- Figure 1: The deviation between two transformations should tations. Indeed, an image must continuously transform along be measured along the curved manifold (Lie group) of the curved manifold of a transformation group rather than through a straight line in the forbidden ambient Euclidean transformations rather than through the forbidden Euclidean space. This suggests the use of geodesic distance to minimize space of transformations. the errors between the estimated and groundtruth transfor- mations. Particularly, we focus on homographies, a general group of planar transformations containing the Euclidean, works to explore the possibility of pretraining representa- similarity and affine transformations as its special cases. To tions in an unsupervised fashion by removing the reliance avoid an explicit computing of intractable Riemannian log- on labeled data. arithm, we project homographies onto an alternative group Specifically, self-supervised representation learning from of rotation transformations SR(3) with a tractable form of geodesic distance. Experiments demonstrate the proposed ap- transformed images has shown great potential. The idea can proach to Auto-Encoding Transformations exhibits superior trace back to the data augmentation in supervised learn- performances on a variety of recognition problems. ing of convolutional neural networks in AlexNet, where la- beled data are augmented by various copies of their trans- formed copies. The fine structure of transformations have Introduction also been explored in training deep network representations. Representation learning plays a vital role in many machine For example, the celebrated convolutional neural networks learning problems. Among them is supervised pretraining are trained to transform equivariantly against the transla- from large-scale labeled datasets such as ImageNet and MS tions to capture the translated visual structures. This idea has COCO for various recognition tasks ranging from image been extended to train deep convolutional networks that are classification, object detection, and segmentation. Previous equivariant to to a family of transformations beyond trans- results have shown that representations learned through su- lations. However, these models still rely on supervised pre- pervised training on labeled data is critical to competitive training of representations on labeled data. performances before being finetuned on the downstream Recently, unsupervised pretraining of representations was tasks. However, it becomes prohibitively expensive to col- proposed in a self-supervised fashion. The deep networks lect labeled data for the purpose of pre-training representa- are trained by decoding the pretext signals without la- tion networks, which prevents the application of deep net- beled data (Kolesnikov, Zhai, and Beyer 2019). The state- works in many practical applications. This inspires many of-the-art performances have been achieved by the repre- Copyright c 2021, Association for the Advancement of Artificial sentations learned through various transformations as the Intelligence (www.aaai.org). All rights reserved. self-supervisory signals. Among them is a representative * corresponding author paradigm of Auto-Encoding Transformations (AET) (Qi y The work of Houqiang Li was supported by NSFC under Con- 2019) that extends the conventional autoencoders from re- tract 61836011 and 62021001. constructing data to transformations. The AET seeks to ex- 8610 plore the variations of data under various transformations, tions, which can be intractable for the homography group and capture the intrinsic patterns of data variations by pre- with no closed form. Instead, we project the transforma- training representations without need to annotate data. tions onto another group with a tractable expression of It relies on a simple criterion: if a representation network geodesic distances. Particularly, we choose the scaled rota- could successfully model the change of visual structures be- tion group S+(1)×SO(3) to approximate PG(2), in which fore and after a transformation, we should be able to de- the geodesic distance has a tractable closed form that can be code the transformation by comparing the representations of efficiently computed without taking an explicit group log- original and transformed images. In other words, transfor- arithm. This can greatly reduce the computing overhead in mations are used as a proxy to reveal the pattern of visual self-training the proposed model in the Lie group of homo- structures that are supposed to equivary against the applied graphies. Second, to train the AET, we need a reasonable pa- transformations. Both deterministic and probabilistic AET rameterization of homographies that can be computed from models have been presented in literature, which minimizes the deep network. Although a 3 × 3 matrix representation the Mean-Square Errors (MSE) between the matrix repre- in homogeneous coordinates is available to parameterize sentation of parametric transformations (Zhang et al. 2019) homographies directly, it mixes up various components of and maximizes the mutual information between the learned transformations such as rotations, translations, and scales representation and transformations (Qi 2019), respectively. that have incomparable scales of values. The direct matrix Albeit the promising performances on both AET mod- parameterization is thus not optimal which may result in un- els, they are subject to a questionable assumption — they stable numeric results. Thus, inspired by deep homography use the Euclidean distance as the metric to minimize the estimation (DeTone, Malisiewicz, and Rabinovich 2016; Er- errors between the estimated and the groundtruth transfor- lik Nowruzi, Laganiere, and Japkowicz 2017; Nguyen et al. mations. Obviously, this assumption is problematic. Indeed, 2018; Le et al. 2020), we choose to a reparameterization ap- transformations have a Lie group structure restricted on a proach by making the network output an implicit expres- curved manifold of matrix representations rather than be- sion of homographies with the four-point correspondences ing filled in the ambient Euclidean space. Moreover, a group between original and transformed images. The matrix rep- of transformations can be decomposed into various compo- resentation is computed from the output correspondences nents of atomic transformations. For example, a homogra- through a SVD operation. Such reparameterization of ho- phy as one of the most general planar transformations can be mographies is differentiable, thereby enabling us to mini- decomposed into affine, similarity and Euclidean transfor- mize the geodesic errors between transformations to train mations hierarchically. A direct summation of the Euclidean the deep network in an end-to-end fashion. distances over these components could yield incomparable In summary, our major contributions are as follows: results with some components overweighing the others due • We leverage the theory of Lie groups to explore the geom- to various scales. etry of transformations, using geodesic distance to min- From the perspective of transformation geometry in Lie imize the errors between the estimated and groundtruth groups, a more natural way to measure the different between transformations in AET. transformations is the shortest curved path (i.e., geodesic) connecting two transformations along the manifold, instead • Due to the intractable Riemannian logarithm, we propose of a straight segment in the ambient Euclidean space. As il- an approximation of PG(2) in a closed form by project- lustrated in Figure 1, given a pair of original and transformed ing the transformations onto the scaled rotation group. images, one cannot find a valid path of transformations in • Experiments demonstrate the proposed method exhibits the Euclidean space to continuously transform the original superior performances on various recognition problems. one to its transformed counterpart. On the contrary, such a path of transformations can only reside on the curved man- Background & Methodology ifold of a transformation group. Therefore, one should use the geodesic along the manifold as the metric to measure the AET with Euclidean Distances errors between transformations. First, let us review the Auto-Encoding Transformations To this end, we will leverage