Learning Transferable Visual Models from Natural Language Supervision
Total Page:16
File Type:pdf, Size:1020Kb
Learning Transferable Visual Models From Natural Language Supervision Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1 Abstract interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero- SOTA computer vision systems are trained to pre- shot transfer to downstream datasets. Flagship systems like dict a fixed set of predetermined object categories. GPT-3 (Brown et al., 2020) are now competitive across This restricted form of supervision limits their many tasks with bespoke models while requiring little to no generality and usability since additional labeled dataset specific training data. data is needed to specify any other visual con- cept. Learning directly from raw text about im- These results suggest that the aggregate supervision acces- ages is a promising alternative which leverages a sible to modern pre-training methods within web-scale col- much broader source of supervision. We demon- lections of text surpasses that of high-quality crowd-labeled strate that the simple pre-training task of predict- NLP datasets. However, in other fields such as computer ing which caption goes with which image is an vision it is still standard practice to pre-train models on efficient and scalable way to learn SOTA image crowd-labeled datasets such as ImageNet (Deng et al., 2009). representations from scratch on a dataset of 400 Could scalable pre-training methods which learn directly million (image, text) pairs collected from the inter- from web text result in a similar breakthrough in computer net. After pre-training, natural language is used to vision? Prior work is encouraging. reference learned visual concepts (or describe new Joulin et al.(2016) demonstrated that CNNs trained to pre- ones) enabling zero-shot transfer of the model to dict words in image captions can learn representations com- downstream tasks. We study performance on over petitive with ImageNet training. Li et al.(2017) then ex- 30 different computer vision datasets, spanning tended this approach to predicting phrase n-grams in ad- tasks such as OCR, action recognition in videos, dition to individual words and demonstrated the ability of geo-localization, and many types of fine-grained their system to zero-shot transfer to other image classifi- object classification. The model transfers non- cation datasets. Adopting more recent architectures and trivially to most tasks and is often competitive pre-training approaches, VirTex (Desai & Johnson, 2020), with a fully supervised baseline without the need ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT for any dataset specific training. For instance, we (Zhang et al., 2020) have recently demonstrated the po- match the accuracy of the original ResNet50 on tential of transformer-based language modeling, masked ImageNet zero-shot without needing to use any of language modeling, and contrastive objectives to learn im- the 1.28 million training examples it was trained age representations from text. on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP. However, the aforementioned models still under-perform current SOTA computer vision models such as Big Trans- fer (Kolesnikov et al., 2019) and the weakly supervised 1. Introduction and Motivating Work ResNeXt (Mahajan et al., 2018). A crucial difference is scale. While Mahajan et al.(2018) and Kolesnikov et al. Pre-training methods which learn directly from raw text (2019) trained for accelerator years on millions to billions have revolutionized NLP over the last few years (Dai & Le, of images, VirTex, ICMLM, and ConVIRT trained for ac- 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford celerator days on one to two hundred thousand images. We et al., 2018; Devlin et al., 2018; Raffel et al., 2019). The close this gap and study the behaviors of image models development of “text-to-text” as a standardized input-output trained from natural language supervision at large scale. We demonstrate that a simplified version of ConVIRT trained *Equal contribution 1OpenAI, San Francisco, CA 94110, USA. Correspondence to: <falec, [email protected]>. from scratch, which we call CLIP, for Contrastive Language- Image Pre-training, is an efficient and scalable method of Proceedings of the 38 th International Conference on Machine learning from natural language supervision. We find that Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Learning Transferable Visual Models From Natural Language Supervision (1) Contrastive pre-training (2) Create dataset classifier from label text plane car Pepper the Pepper the Text aussie pupPepper thePepper the A photo of Text aussie pup Encoder dog aussie pupaussie pup … a {object}. Encoder ⋮ ⋮ … T1 T2 T3 … TN bird I1 I1·T1 I1·T2 I1·T3 … I1·TN (3) Use for zero-shot prediction I2 I2·T1 I2·T2 I2·T3 … I2·TN T1 T2 T3 … TN Image I3 I3·T1 I3·T2 I3·T3 … I3·TN Image Encoder I I ·T I ·T I ·T … I ·T Encoder 1 1 1 1 2 1 3 1 N ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ A photo of IN IN·T1 IN·T2 IN·T3 … IN·TN a dog. Figure 1. Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. CLIP learns to perform a wide set of tasks during pre- large quantities of data of this form available publicly on training including OCR, geo-localization, action recogni- the internet. To test this we constructed a new dataset of tion, and outperforms the best publicly available ImageNet 400 million (image, text) pairs collected form a variety of model while being more computationally efficient. We also publicly available sources on the Internet. To attempt to find that zero-shot CLIP models are much more robust than cover as broad a set of visual concepts as possible, we equivalent accuracy supervised ImageNet models. search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries. 2. Approach We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting At the core of our work is the idea of learning perception dataset has a similar total word count as the WebText dataset from the supervision contained in natural language paired used to train GPT-2. We refer to this dataset as WIT for with images. In the following subsections we detail our WebImageText. 1 specific approach. 2.2. Selecting an Efficient Pre-Training Method 2.1. Creating a Sufficiently Large Dataset Our initial approach, similar to VirTex, jointly trained an Existing work has mainly used three datasets, MS-COCO image CNN and text transformer from scratch to predict (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and the caption of an image. However, we encountered difficul- YFCC100M (Thomee et al., 2016). While MS-COCO and ties efficiently scaling this method. In Figure2 we show Visual Genome are high quality crowd-labeled datasets, they that a 63 million parameter transformer language model, are small by modern standards with approximately 100,000 which already uses twice the compute of its ResNet50 im- training photos each. By comparison, other computer vision age encoder, learns to recognize ImageNet classes three systems are trained on up to 3.5 billion Instagram photos times slower than an approach similar to Joulin et al.(2016) (Mahajan et al., 2018). YFCC100M, at 100 million photos, that predicts a bag-of-words encoding of the same text. is a possible alternative, but the metadata for each image is Recent work in contrastive representation learning has found sparse and of varying quality. Many images use automati- that contrastive objectives can outperform the equivalent cally generated filenames like 20160716 113957.JPG predictive objective (Tian et al., 2019). Noting this finding, as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural 1The base query list is all words occurring at least 100 times in language titles and/or descriptions in English, the dataset the English version of Wikipedia. This is augmented with bi-grams shrunk by a factor of 6 to only 15 million photos. This is with high pointwise mutual information for the pair (Church & approximately the same size as ImageNet. Hanks, 1990) as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet (Miller, 1995) synsets A major motivation for natural language supervision is the not already in the query list are added. Learning Transferable Visual Models From Natural Language Supervision 40 # image_encoder - ResNet or Vision Transformer # text_encoder - CBOW or Text Transformer 35 # I[n, h, w, c] - minibatch of aligned images # T[n, l] - minibatch of aligned texts # W_i[d_i, d_e] - learned proj of image to embed 30 # W_t[d_t, d_e] - learned proj of text to embed # t - learned temperature parameter 25 # extract feature representations of each modality 20 I_f = image_encoder(I) #[n, d_i] 4X efficiency 3X efficiency T_f = text_encoder(T) #[n, d_t] 15 # joint multimodal embedding [n, d_e] Zero-Shot ImageNet Accuracy 10 I_e = l2_normalize(np.dot(I_f, W_i), axis=1) Bag of Words Contrastive (CLIP) T_e = l2_normalize(np.dot(T_f, W_t), axis=1) 5 Bag of Words Prediction Transformer Language Model # scaled pairwise cosine similarities [n, n] 0 2M 33M 67M 134M 268M 400M logits = np.dot(I_e, T_e.T) * np.exp(t) # of images processed # symmetric loss function Figure 2.