Learning Transferable Visual Models from Natural Language Supervision

Total Page:16

File Type:pdf, Size:1020Kb

Learning Transferable Visual Models from Natural Language Supervision Learning Transferable Visual Models From Natural Language Supervision Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1 Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1 Abstract interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero- SOTA computer vision systems are trained to pre- shot transfer to downstream datasets. Flagship systems like dict a fixed set of predetermined object categories. GPT-3 (Brown et al., 2020) are now competitive across This restricted form of supervision limits their many tasks with bespoke models while requiring little to no generality and usability since additional labeled dataset specific training data. data is needed to specify any other visual con- cept. Learning directly from raw text about im- These results suggest that the aggregate supervision acces- ages is a promising alternative which leverages a sible to modern pre-training methods within web-scale col- much broader source of supervision. We demon- lections of text surpasses that of high-quality crowd-labeled strate that the simple pre-training task of predict- NLP datasets. However, in other fields such as computer ing which caption goes with which image is an vision it is still standard practice to pre-train models on efficient and scalable way to learn SOTA image crowd-labeled datasets such as ImageNet (Deng et al., 2009). representations from scratch on a dataset of 400 Could scalable pre-training methods which learn directly million (image, text) pairs collected from the inter- from web text result in a similar breakthrough in computer net. After pre-training, natural language is used to vision? Prior work is encouraging. reference learned visual concepts (or describe new Joulin et al.(2016) demonstrated that CNNs trained to pre- ones) enabling zero-shot transfer of the model to dict words in image captions can learn representations com- downstream tasks. We study performance on over petitive with ImageNet training. Li et al.(2017) then ex- 30 different computer vision datasets, spanning tended this approach to predicting phrase n-grams in ad- tasks such as OCR, action recognition in videos, dition to individual words and demonstrated the ability of geo-localization, and many types of fine-grained their system to zero-shot transfer to other image classifi- object classification. The model transfers non- cation datasets. Adopting more recent architectures and trivially to most tasks and is often competitive pre-training approaches, VirTex (Desai & Johnson, 2020), with a fully supervised baseline without the need ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT for any dataset specific training. For instance, we (Zhang et al., 2020) have recently demonstrated the po- match the accuracy of the original ResNet50 on tential of transformer-based language modeling, masked ImageNet zero-shot without needing to use any of language modeling, and contrastive objectives to learn im- the 1.28 million training examples it was trained age representations from text. on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP. However, the aforementioned models still under-perform current SOTA computer vision models such as Big Trans- fer (Kolesnikov et al., 2019) and the weakly supervised 1. Introduction and Motivating Work ResNeXt (Mahajan et al., 2018). A crucial difference is scale. While Mahajan et al.(2018) and Kolesnikov et al. Pre-training methods which learn directly from raw text (2019) trained for accelerator years on millions to billions have revolutionized NLP over the last few years (Dai & Le, of images, VirTex, ICMLM, and ConVIRT trained for ac- 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford celerator days on one to two hundred thousand images. We et al., 2018; Devlin et al., 2018; Raffel et al., 2019). The close this gap and study the behaviors of image models development of “text-to-text” as a standardized input-output trained from natural language supervision at large scale. We demonstrate that a simplified version of ConVIRT trained *Equal contribution 1OpenAI, San Francisco, CA 94110, USA. Correspondence to: <falec, [email protected]>. from scratch, which we call CLIP, for Contrastive Language- Image Pre-training, is an efficient and scalable method of Proceedings of the 38 th International Conference on Machine learning from natural language supervision. We find that Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Learning Transferable Visual Models From Natural Language Supervision (1) Contrastive pre-training (2) Create dataset classifier from label text plane car Pepper the Pepper the Text aussie pupPepper thePepper the A photo of Text aussie pup Encoder dog aussie pupaussie pup … a {object}. Encoder ⋮ ⋮ … T1 T2 T3 … TN bird I1 I1·T1 I1·T2 I1·T3 … I1·TN (3) Use for zero-shot prediction I2 I2·T1 I2·T2 I2·T3 … I2·TN T1 T2 T3 … TN Image I3 I3·T1 I3·T2 I3·T3 … I3·TN Image Encoder I I ·T I ·T I ·T … I ·T Encoder 1 1 1 1 2 1 3 1 N ⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ A photo of IN IN·T1 IN·T2 IN·T3 … IN·TN a dog. Figure 1. Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes. CLIP learns to perform a wide set of tasks during pre- large quantities of data of this form available publicly on training including OCR, geo-localization, action recogni- the internet. To test this we constructed a new dataset of tion, and outperforms the best publicly available ImageNet 400 million (image, text) pairs collected form a variety of model while being more computationally efficient. We also publicly available sources on the Internet. To attempt to find that zero-shot CLIP models are much more robust than cover as broad a set of visual concepts as possible, we equivalent accuracy supervised ImageNet models. search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries. 2. Approach We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting At the core of our work is the idea of learning perception dataset has a similar total word count as the WebText dataset from the supervision contained in natural language paired used to train GPT-2. We refer to this dataset as WIT for with images. In the following subsections we detail our WebImageText. 1 specific approach. 2.2. Selecting an Efficient Pre-Training Method 2.1. Creating a Sufficiently Large Dataset Our initial approach, similar to VirTex, jointly trained an Existing work has mainly used three datasets, MS-COCO image CNN and text transformer from scratch to predict (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and the caption of an image. However, we encountered difficul- YFCC100M (Thomee et al., 2016). While MS-COCO and ties efficiently scaling this method. In Figure2 we show Visual Genome are high quality crowd-labeled datasets, they that a 63 million parameter transformer language model, are small by modern standards with approximately 100,000 which already uses twice the compute of its ResNet50 im- training photos each. By comparison, other computer vision age encoder, learns to recognize ImageNet classes three systems are trained on up to 3.5 billion Instagram photos times slower than an approach similar to Joulin et al.(2016) (Mahajan et al., 2018). YFCC100M, at 100 million photos, that predicts a bag-of-words encoding of the same text. is a possible alternative, but the metadata for each image is Recent work in contrastive representation learning has found sparse and of varying quality. Many images use automati- that contrastive objectives can outperform the equivalent cally generated filenames like 20160716 113957.JPG predictive objective (Tian et al., 2019). Noting this finding, as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural 1The base query list is all words occurring at least 100 times in language titles and/or descriptions in English, the dataset the English version of Wikipedia. This is augmented with bi-grams shrunk by a factor of 6 to only 15 million photos. This is with high pointwise mutual information for the pair (Church & approximately the same size as ImageNet. Hanks, 1990) as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet (Miller, 1995) synsets A major motivation for natural language supervision is the not already in the query list are added. Learning Transferable Visual Models From Natural Language Supervision 40 # image_encoder - ResNet or Vision Transformer # text_encoder - CBOW or Text Transformer 35 # I[n, h, w, c] - minibatch of aligned images # T[n, l] - minibatch of aligned texts # W_i[d_i, d_e] - learned proj of image to embed 30 # W_t[d_t, d_e] - learned proj of text to embed # t - learned temperature parameter 25 # extract feature representations of each modality 20 I_f = image_encoder(I) #[n, d_i] 4X efficiency 3X efficiency T_f = text_encoder(T) #[n, d_t] 15 # joint multimodal embedding [n, d_e] Zero-Shot ImageNet Accuracy 10 I_e = l2_normalize(np.dot(I_f, W_i), axis=1) Bag of Words Contrastive (CLIP) T_e = l2_normalize(np.dot(T_f, W_t), axis=1) 5 Bag of Words Prediction Transformer Language Model # scaled pairwise cosine similarities [n, n] 0 2M 33M 67M 134M 268M 400M logits = np.dot(I_e, T_e.T) * np.exp(t) # of images processed # symmetric loss function Figure 2.
Recommended publications
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Automated Elastic Pipelining for Distributed Training of Transformers
    PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers Chaoyang He 1 Shen Li 2 Mahdi Soltanolkotabi 1 Salman Avestimehr 1 Abstract the-art convolutional networks ResNet-152 (He et al., 2016) and EfficientNet (Tan & Le, 2019). To tackle the growth in The size of Transformer models is growing at an model sizes, researchers have proposed various distributed unprecedented rate. It has taken less than one training techniques, including parameter servers (Li et al., year to reach trillion-level parameters since the 2014; Jiang et al., 2020; Kim et al., 2019), pipeline paral- release of GPT-3 (175B). Training such models lel (Huang et al., 2019; Park et al., 2020; Narayanan et al., requires both substantial engineering efforts and 2019), intra-layer parallel (Lepikhin et al., 2020; Shazeer enormous computing resources, which are luxu- et al., 2018; Shoeybi et al., 2019), and zero redundancy data ries most research teams cannot afford. In this parallel (Rajbhandari et al., 2019). paper, we propose PipeTransformer, which leverages automated elastic pipelining for effi- T0 (0% trained) T1 (35% trained) T2 (75% trained) T3 (100% trained) cient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically Layer (end of training) Layer (end of training) Layer (end of training) Layer (end of training) Similarity score allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the Figure 1. Interpretable Freeze Training: DNNs converge bottom pipeline, packs active layers into fewer GPUs, up (Results on CIFAR10 using ResNet).
    [Show full text]
  • Convolutional Neural Network Transfer Learning for Robust Face
    Convolutional Neural Network Transfer Learning for Robust Face Recognition in NAO Humanoid Robot Daniel Bussey1, Alex Glandon2, Lasitha Vidyaratne2, Mahbubul Alam2, Khan Iftekharuddin2 1Embry-Riddle Aeronautical University, 2Old Dominion University Background Research Approach Results • Artificial Neural Networks • Retrain AlexNet on the CASIA-WebFace dataset to configure the neural • VGG-Face Shows better results in every performance benchmark measured • Computing systems whose model architecture is inspired by biological networf for face recognition tasks. compared to AlexNet, although AlexNet is able to extract features from an neural networks [1]. image 800% faster than VGG-Face • Capable of improving task performance without task-specific • Acquire an input image using NAO’s camera or a high resolution camera to programming. run through the convolutional neural network. • Resolution of the input image does not have a statistically significant impact • Show excellent performance at classification based tasks such as face on the performance of VGG-Face and AlexNet. recognition [2], text recognition [3], and natural language processing • Extract the features of the input image using the neural networks AlexNet [4]. and VGG-Face • AlexNet’s performance decreases with respect to distance from the camera where VGG-Face shows no performance loss. • Deep Learning • Compare the features of the input image to the features of each image in the • A subfield of machine learning that focuses on algorithms inspired by people database. • Both frameworks show excellent performance when eliminating false the function and structure of the brain called artificial neural networks. positives. The method used to allow computers to learn through a process called • Determine whether a match is detected or if the input image is a photo of a training.
    [Show full text]
  • Tiny Imagenet Challenge
    Tiny ImageNet Challenge Jiayu Wu Qixiang Zhang Guoxi Xu Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Abstract Our best model, a fine-tuned Inception-ResNet, achieves a top-1 error rate of 43.10% on test dataset. Moreover, we We present image classification systems using Residual implemented an object localization network based on a Network(ResNet), Inception-Resnet and Very Deep Convo- RNN with LSTM [7] cells, which achieves precise results. lutional Networks(VGGNet) architectures. We apply data In the Experiments and Evaluations section, We will augmentation, dropout and other regularization techniques present thorough analysis on the results, including per-class to prevent over-fitting of our models. What’s more, we error analysis, intermediate output distribution, the impact present error analysis based on per-class accuracy. We of initialization, etc. also explore impact of initialization methods, weight decay and network depth on system performance. Moreover, visu- 2. Related Work alization of intermediate outputs and convolutional filters are shown. Besides, we complete an extra object localiza- Deep convolutional neural networks have enabled the tion system base upon a combination of Recurrent Neural field of image recognition to advance in an unprecedented Network(RNN) and Long Short Term Memroy(LSTM) units. pace over the past decade. Our best classification model achieves a top-1 test error [10] introduces AlexNet, which has 60 million param- rate of 43.10% on the Tiny ImageNet dataset, and our best eters and 650,000 neurons. The model consists of five localization model can localize with high accuracy more convolutional layers, and some of them are followed by than 1 objects, given training images with 1 object labeled.
    [Show full text]
  • Deep Learning Is Robust to Massive Label Noise
    Deep Learning is Robust to Massive Label Noise David Rolnick * 1 Andreas Veit * 2 Serge Belongie 2 Nir Shavit 3 Abstract Thus, annotation can be expensive and, for tasks requiring expert knowledge, may simply be unattainable at scale. Deep neural networks trained on large supervised datasets have led to impressive results in image To address this limitation, other training paradigms have classification and other tasks. However, well- been investigated to alleviate the need for expensive an- annotated datasets can be time-consuming and notations, such as unsupervised learning (Le, 2013), self- expensive to collect, lending increased interest to supervised learning (Pinto et al., 2016; Wang & Gupta, larger but noisy datasets that are more easily ob- 2015) and learning from noisy annotations (Joulin et al., tained. In this paper, we show that deep neural net- 2016; Natarajan et al., 2013; Veit et al., 2017). Very large works are capable of generalizing from training datasets (e.g., Krasin et al.(2016); Thomee et al.(2016)) data for which true labels are massively outnum- can often be obtained, for example from web sources, with bered by incorrect labels. We demonstrate remark- partial or unreliable annotation. This can allow neural net- ably high test performance after training on cor- works to be trained on a much wider variety of tasks or rupted data from MNIST, CIFAR, and ImageNet. classes and with less manual effort. The good performance For example, on MNIST we obtain test accuracy obtained from these large, noisy datasets indicates that deep above 90 percent even after each clean training learning approaches can tolerate modest amounts of noise example has been diluted with 100 randomly- in the training set.
    [Show full text]
  • The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches
    > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches Md Zahangir Alom1, Tarek M. Taha1, Chris Yakopcic1, Stefan Westberg1, Paheding Sidike2, Mst Shamima Nasrin1, Brian C Van Essen3, Abdul A S. Awwal3, and Vijayan K. Asari1 Abstract—In recent years, deep learning has garnered I. INTRODUCTION tremendous success in a variety of application domains. This new ince the 1950s, a small subset of Artificial Intelligence (AI), field of machine learning has been growing rapidly, and has been applied to most traditional application domains, as well as some S often called Machine Learning (ML), has revolutionized new areas that present more opportunities. Different methods several fields in the last few decades. Neural Networks have been proposed based on different categories of learning, (NN) are a subfield of ML, and it was this subfield that spawned including supervised, semi-supervised, and un-supervised Deep Learning (DL). Since its inception DL has been creating learning. Experimental results show state-of-the-art performance ever larger disruptions, showing outstanding success in almost using deep learning when compared to traditional machine every application domain. Fig. 1 shows, the taxonomy of AI. learning approaches in the fields of image processing, computer DL (using either deep architecture of learning or hierarchical vision, speech recognition, machine translation, art, medical learning approaches) is a class of ML developed largely from imaging, medical information processing, robotics and control, 2006 onward. Learning is a procedure consisting of estimating bio-informatics, natural language processing (NLP), cybersecurity, and many others.
    [Show full text]
  • Kernel Descriptors for Visual Recognition
    Kernel Descriptors for Visual Recognition Liefeng Bo University of Washington Seattle WA 98195, USA Xiaofeng Ren Dieter Fox Intel Labs Seattle University of Washington & Intel Labs Seattle Seattle WA 98105, USA Seattle WA 98195 & 98105, USA Abstract The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We high- light the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and princi- pled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel princi- pal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief net- works. We report superior performance on standard image classification bench- marks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet. 1 Introduction Image representation (features) is arguably the most fundamental task in computer vision. The problem is highly challenging because images exhibit high variations, are highly structured, and lie in high dimensional spaces. In the past ten years, a large number of low-level features over images have been proposed.
    [Show full text]
  • Face Recognition Using Popular Deep Net Architectures: a Brief Comparative Study
    future internet Article Face Recognition Using Popular Deep Net Architectures: A Brief Comparative Study Tony Gwyn 1,* , Kaushik Roy 1 and Mustafa Atay 2 1 Department of Computer Science, North Carolina A&T State University, Greensboro, NC 27411, USA; [email protected] 2 Department of Computer Science, Winston-Salem State University, Winston-Salem, NC 27110, USA; [email protected] * Correspondence: [email protected] Abstract: In the realm of computer security, the username/password standard is becoming increas- ingly antiquated. Usage of the same username and password across various accounts can leave a user open to potential vulnerabilities. Authentication methods of the future need to maintain the ability to provide secure access without a reduction in speed. Facial recognition technologies are quickly becoming integral parts of user security, allowing for a secondary level of user authentication. Augmenting traditional username and password security with facial biometrics has already seen impressive results; however, studying these techniques is necessary to determine how effective these methods are within various parameters. A Convolutional Neural Network (CNN) is a powerful classification approach which is often used for image identification and verification. Quite recently, CNNs have shown great promise in the area of facial image recognition. The comparative study proposed in this paper offers an in-depth analysis of several state-of-the-art deep learning based- facial recognition technologies, to determine via accuracy and other metrics which of those are most effective. In our study, VGG-16 and VGG-19 showed the highest levels of image recognition accuracy, Citation: Gwyn, T.; Roy, K.; Atay, M. as well as F1-Score.
    [Show full text]
  • Key Ideas and Architectures in Deep Learning Applications That (Probably) Use DL
    Key Ideas and Architectures in Deep Learning Applications that (probably) use DL Autonomous Driving Scene understanding /Segmentation Applications that (probably) use DL WordLens Prisma Outline of today’s talk Image Recognition Fun application using CNNs ● LeNet - 1998 ● Image Style Transfer ● AlexNet - 2012 ● VGGNet - 2014 ● GoogLeNet - 2014 ● ResNet - 2015 Questions to ask about each architecture/ paper Special Layers Non-Linearity Loss function Weight-update rule Train faster? Reduce parameters Reduce Overfitting Help you visualize? LeNet5 - 1998 LeNet5 - Specs MNIST - 60,000 training, 10,000 testing Input is 32x32 image 8 layers 60,000 parameters Few hours to train on a laptop Modified LeNet Architecture - Assignment 3 Training Input Forward pass Maxp Max Conv ReLU Conv ReLU FC ReLU Softmax ool pool Backpropagation - Labels Loss update weights Modified LeNet Architecture - Assignment 3 Testing Input Forward pass Maxp Max Conv ReLU Conv ReLU FC ReLU Softmax ool pool Output Compare output with labels Modified LeNet - CONV Layer 1 Input - 28 x 28 Output - 6 feature maps - each 24 x 24 Convolution filter - 5 x 5 x 1 (convolution) + 1 (bias) How many parameters in this layer? Modified LeNet - CONV Layer 1 Input - 32 x 32 Output - 6 feature maps - each 28 x 28 Convolution filter - 5 x 5 x 1 (convolution) + 1 (bias) How many parameters in this layer? (5x5x1+1)*6 = 156 Modified LeNet - Max-pooling layer Decreases the spatial extent of the feature maps, makes it translation-invariant Input - 28 x 28 x 6 volume Maxpooling with filter size 2
    [Show full text]
  • Transfer Learning Based Face Recognition Using Deep Learning M
    International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8 Issue-1S3, June 2019 Transfer Learning based Face Recognition using Deep Learning M. Madhu Latha, K. V. Krishnam Raju Abstract: Face Recognition technology is advancing rapidly learning also called as a hierarchical learning widespread as a with the current developments in computer vision. Face new area of the machine learning research. Deep learning Recognition can be done either in a single image or from a technique is a type of machine learning that gives computer tracked video. The task of face recognition can be improved with systems an ability to improve with knowledge and data. the recent surge of interest in the deep learning architectures. In Deep learning has attained great power and flexibility by this paper we proposed an Inception-V3 CNN using an approach representing knowledge. called Transfer learning. Deep learning algorithms are trained to solve specific tasks and designed to work in isolated. The idea of Nowadays Face Recognition [2] has legitimately broken transfer learning is to overcome isolated learning and leveraging in to widespread and benefits became more and more knowledge acquired from one task to solve the related ones. Thus, tangible. Face Recognition is the eminent biometric instead of training CNN from scratch we use Google’s pretrained technique extensively used in many areas in daily life such as Inception v3 model by applying transfer learning. Inception v3 military, finance, public security, etc. Face recognition is the model was trained on ImageNet dataset and transferred its outstanding research topic under the area of computer vision.
    [Show full text]
  • Transfer Imagenet for Medical Image Analyzing Using Unsupervised Learning
    Transfer ImageNet for Medical Image Analyzing Using Unsupervised Learning Ning Wang and Peng Chu Abstract˗˗Convolutional Neural Network (CNN) is powerful tool for image analyzing. It has demonstrated its success in multiple tasks. But requirement of large amount of annotated samples prohibits its widely using in medical image analyzing. This project explores using unsupervised convolutional auto-encoder (CAE) for fine-tuning CNN trained on ImageNet for medical images analyzing. The fine-tuned CNN then is test on a task for landmark estimation. Compared with CNN without unsupervised fine-tuning, CNN adjusted by CAE achieves 22% less average error on four landmarks. I. Introduction Deep neural network recently becomes a popular way for learning knowledge from massive annotated database. It shows great success in multiple tasks such as image classification and playing the board game Go. Specifically, convolutional neural network (CNN) which is believed to have mimicked the mammal visual vortex is the most widely used framework and achieves stat-of-art performance on the ImageNet classification [1] and AlphaGo project [2]. (a) (b) Figure 1: Image samples from (a) CIFAR10 and (b) Dental X-ray images. The CNN is usually trained on gradient descent algorithm which is a supervised process. Thus annotated date is require for the training process. A popular CNN configuration for ImageNet classification named ‘VGG-16’ has 138 million free parameters for training [3]. Fortunately, ImageNet also has over 10 million annotated images [4]. But for ordinary task, there is usually no such huge amount of annotated database for training. A popular way to solve this problem is to transfer knowledge learned on ImageNet to other domain by using small amount of annotated date to fine-tuning the network learned on ImageNet [6].
    [Show full text]
  • Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings
    Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings Luke Melas-Kyriazi George Han Celine Liang [email protected] [email protected] [email protected] Abstract et al., 2018; Rajpurkar et al., 2016). The now- common approach for employing these systems Over the past year, the emergence of trans- using transfer learning is to (1) pretrain a large fer learning with large-scale language mod- language model (LM), (2) replace the top layer of els (LM) has led to dramatic performance im- the LM with a task-specific layer, and (3) finetune provements across a broad range of natural language understanding tasks. However, the the entire model on a (usually relatively small) size and memory footprint of these large LMs labeled dataset. Following this pattern, Peters makes them difficult to deploy in many sce- et al.(2018), Howard and Ruder(2018), Radford narios (e.g. on mobile phones). Recent re- et al.(2019), and Devlin et al.(2018) broadly out- search points to knowledge distillation as a perform standard task-specific NLU models (i.e. potential solution, showing that when train- CNNs/LSTMs), which are initialized from scratch ing data for a given task is abundant, it is (or only from word embeddings) and trained on possible to distill a large (teacher) LM into a small task-specific (student) network with the available labeled data. minimal loss of performance. However, when Notably, transfer learning with LMs vastly out- such data is scarce, there remains a signifi- performs training task-specific from scratch in low cant performance gap between large pretrained data regimes.
    [Show full text]