Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction

Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction Richard Zhang Phillip Isola Alexei A. Efros Berkeley AI Research (BAIR) Laboratory University of California, Berkeley {rich.zhang,isola,efros}@eecs.berkeley.edu Abstract We propose split-brain autoencoders, a straightforward X X" modification of the traditional autoencoder architecture, for Raw Data Reconstructed unsupervised representation learning. The method adds a Data split to the network, resulting in two disjoint sub-networks. Traditional Autoencoder Each sub-network is trained to perform a difficult task – predicting one subset of the data channels from another. X Together, the sub-networks extract features from the en- # X%$ tire input signal. By forcing the network to solve cross- X X" channel prediction tasks, we induce a representation within X%# Raw Data the network which transfers well to other, unseen tasks. X$ Predicted This method achieves state-of-the-art performance on sev- Data Raw Data Predicted Data eral large-scale transfer learning benchmarks. Channels Channels Split-Brain Autoencoder 1. Introduction Figure 1: Traditional vs Split-Brain Autoencoder ar- A goal of unsupervised learning is to model raw data chitectures. (top) Autoencoders learn feature representa- without the use of labels, in a manner which produces a tion F by learning to reconstruct input data X. (bottom) useful representation. By “useful” we mean a represen- The proposed split-brain autoencoder is composed of two tation that should be easily adaptable for other tasks, un- disjoint sub-networks F1, F2, each trained to predict one known during training time. Unsupervised deep methods data subset from another, changing the problem from re- typically induce representations by training a network to construction to prediction. The split-brain representation solve an auxiliary or “pretext” task, such as the image re- F is formed by concatenating the two sub-networks, and construction objective in a traditional autoencoder model, achieves strong transfer learning performance. The model is as shown on Figure 1(top). We instead force the network to publicly available on https://richzhang.github. solve complementary prediction tasks by adding a split in io/splitbrainauto. the architecture, shown in Figure 1 (bottom), dramatically improving transfer performance. Despite their popularity, autoencoders have actually not [43, 34, 47]. For example, Vincent et al. [43] propose been shown to produce strong representations for transfer denoising autoencoders, trained to remove iid noise added tasks in practice [43, 34]. Why is this? One reason might to the input. Pathak et al. [34] propose context encoders, be the mechanism for forcing model abstraction. To prevent which learn features by training to inpaint large, random a trivial identity mapping from being learned, a bottleneck contiguous blocks of pixels. Rather than dropping data in is typically built into the autoencoder representation. How- the spatial direction, several works have dropped data in ever, an inherent tension is at play: the smaller the bottle- the channel direction, e.g. predicting color channels from neck, the greater the forced abstraction, but the smaller the grayscale (the colorization task) [26, 47]. information content that can be expressed. Context encoders, while an improvement over autoen- Instead of forcing abstraction through compression, via coders, demonstrate lower performance than competitors a bottleneck in the network architecture, recent work has on large-scale semantic representation learning bench- explored withholding parts of the input during training marks [47]. This may be due to several reasons. First, im- 11058 auxiliary domain input an architectural modification to the autoencoder paradigm: task type gap handicap adding a single split in the network, resulting in two dis- Autoencoder [19] reconstruction no no joint, concatenated, sub-networks. Each sub-network is Denoising autoencoder [43] reconstruction suffers no Context Encoder [34] prediction no suffers trained as a cross-channel encoder, predicting one subset Cross-Channel Encoder [47, 27] prediction no suffers of channels of the input from the other. A variety of aux- Split-Brain Autoencoder prediction no no iliary cross-channel prediction tasks may be used, such as colorization and depth prediction. For example, on RGB Table 1: Qualitative Comparison We summarize various images, one sub-network can solve the problem of coloriza- qualitative aspects inherent in several representation learn- tion (predicting a and b channels from the L channel in Lab ing techniques. Auxiliary task type: pretext task predi- colorspace), and the other can perform the opposite (syn- cated on reconstruction or prediction. Domain gap: gap thesizing L from a,b channels). In the RGB-D domain, between the input data during unsupervised pre-training and one sub-network may predict depth from images, while the testing time. Input handicap: input data is systematically other predicts images from depth. The architectural change dropped out during test time. induces the same forced abstraction as observed in cross- channel encoders, but is able to extract features from the age synthesis tasks are known to be notoriously difficult to full input tensor, leaving nothing on the table. evaluate [35] and the loss function used in [34] may not Our contributions are as follows: properly capture inpainting quality. Second, the model is • We propose the split-brain autoencoder, which is com- trained on images with missing chunks, but applied, at test posed of concatenated cross-channel encoders, trained time, to full images. This causes a “domain gap” between using raw data as its own supervisory signal. training and deployment. Third, it could simply be that the • We demonstrate state-of-the-art performance on sev- inpainting task in [34] could be adequately solved without eral semantic representation learning benchmarks in high-level reasoning, instead mostly just copying low and the RGB and RGB-D domains. mid-level structure from the surround. • To gain a better understanding, we perform exten- On the other hand, colorization turns out to be a surpris- sive ablation studies by (i) investigating cross-channel ingly effective pretext task for inducing strong feature rep- prediction problems and loss functions and (ii) re- resentations [47, 27]. Though colorization, like inpainting, searching alternative aggregation methods for combin- is a synthesis task, the spatial correspondence between the ing cross-channel encoders. input and output pairs may enable basic off-the-shelf loss functions to be effective. In addition, the systematic, rather than stochastic nature of the input corruption removes the 2. Related Work pre-training and testing domain gap. Finally, while inpaint- Many unsupervised learning methods have focused on ing may admit reasoning mainly about textural structure, modeling raw data using a reconstruction objective. Au- predicting accurate color, e.g., knowing to paint a school- toencoders [19] train a network to reconstruct an input bus yellow, may more strictly require object-level reason- image, using a representation bottleneck to force abstrac- ing and therefore induce stronger semantic representations. tion. Denoising autoencoders [43] train a network to undo Colorization is an example of what we refer to as a cross- a random iid corruption. Techniques for modeling the channel encoding objective, a task which directly predicts probability distribution of images in deep frameworks have one subset of data channels from another. also been explored. For example, variational autoencoders In this work, we further explore the space of cross- (VAEs) [23] employ a variational Bayesian approach to channel encoders by systematically evaluating various modeling the data distribution. Other probabilistic models channel translation problems and training objectives. include restricted Boltzmann machines (RBMs) [40], deep Cross-channel encoders, however, face an inherent hand- Boltzmann machines (DBMs) [37], generative adversarial icap: different channels of the input data are not treated networks (GANs) [15], autoregressive models (Pixel-RNN equally, as part of the data is used for feature extraction and [42] and Pixel-CNN [31]), bidirectional GANs (BiGANs) another as the prediction target. In the case of colorization, [8] and Adversarially Learned Inference (ALI) [9], and real the network can only extract features from the grayscale im- NVP [6]. Many of these methods [19, 43, 8, 9, 37] have age and is blind to color, leaving the color information un- been evaluated for representation learning. used. A qualitative comparison of the different methods, Another form of unsupervised learning, sometimes re- along with their inherent strengths and weaknesses, is sum- ferred to as “self-supervised” learning [4], has recently marized in Table 1. grown in popularity. Rather than predicting labels an- Might there be a way to take advantage of the underly- notated by humans, these methods predict pseudo-labels ing principle of cross-channel encoders, while being able to computed from the raw data itself. For example, image extract features from the entire input signal? We propose colorization [47, 26] has been shown to be an effective 21059 H×W ×Q pretext task. Other methods generate pseudo-labels from F(X1) ∈ ∆ . A standard cross-entropy loss be- egomotion [1, 22], video [45, 29], inpainting [34], co- tween the predicted and ground truth distributions is used, occurence

Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support