Path Towards Better Deep Learning Models and Implications for Hardware

Total Page:16

File Type:pdf, Size:1020Kb

Path Towards Better Deep Learning Models and Implications for Hardware Path towards better deep learning models and implications for hardware Natalia Vassilieva Product Manager, ML Cerebras Systems AI has massive potential, but is compute-limited today. Cerebras Systems © 2020 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.” Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.” “Hardware capabilities and software tools both motivate and limit the type of ideas that AI researchers will imagine and will allow themselves to pursue. The tools at our disposal fashion our thoughts more than we care to admit.” Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 “Historically, progress in neural networks and deep learning research has been greatly influenced by the available hardware and software tools.” “Hardware capabilities and software tools both motivate and limit the type of ideas that AI researchers will imagine and will allow themselves to pursue. The tools at our disposal fashion our thoughts more than we care to admit.” “Is DL-specific hardware really necessary? The answer is a resounding yes. One interesting property of DL systems is that the larger we make them, the better they seem to work.” Yann LeCun, 1.1 Deep Learning Hardware: Past, Present and Future. 2019 IEEE International Solid-State Circuits Conference (ISSCC 2019), 12-19 Cerebras Systems Founded in 2016 ● Visionary leadership ● Smart, strategic backing ● 200+ world-class engineering team Building systems to drastically change the landscape of compute for AI Cerebras Systems © 2020 The Cerebras Wafer Scale Engine (WSE) The most powerful processor for AI 46,225 mm2 silicon 1.2 trillion transistors 400,000 AI optimized cores 18 Gigabytes of On-chip Memory 9 PByte/s memory bandwidth 100 Pbit/s fabric bandwidth TSMC 16nm process Hosted in CS-1 Cerebras Systems © 2020 Cerebras CS-1: Cluster-Scale DL Performance in a Single System Powered by the WSE Programming accessibility of a single node, using TensorFlow, PyTorch, and other frameworks Deploys into standard datacenter infrastructure Multiple units delivered, installed, and in-use today across multiple verticals Built from the ground up for AI acceleration Cerebras Systems © 2020 Let’s look at modern DL models… Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily 829 11.2 improves over time 90 5 ● Model sizes grows 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily 829 11.2 improves over time 90 5 AlexNet ● Model sizes grows 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually 80 1.2 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 440 100 Inception-ResNet 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually 80 1.2 SqueezeNet 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern models: computer vision Image classification accuracy and model sizes 105 SENet-154 FixEfficientNet 440 100 Inception-ResNet 480 ResNet-152 95 144 ● Accuracy steadily VGG-19 829 11.2 improves over time 90 AmoebaNet-A ResNeXt-101 ZFNet 5 AlexNet ● Model sizes grows Inception-v1 85 60 46 gradually (from tens to hundreds millions of 80 1.2 parameters) SqueezeNet 5 accuracy on ImageNet, % - 75 2010 2012 2014 2016 2018 2020 2022 Top Years Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Language modelling on WikiText-103 60 ● Quality of models 50 improves linearly with model size growing 40 exponentially, confirming a power law reported by OpenAI 30 20 Test perplexity 10 0 1 10 100 1000 10000 100000 1000000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Language modelling on WikiText-103 60 Tansformer-XL 2-layer skip-LSTM ● Quality of models 50 improves linearly with model size growing Mogrifier LSTM exponentially, confirming 40 GPT-2 a power law reported by OpenAI 30 BERT-Large-CAS GPT-3 20 Test perplexity 10 0 24M 1.5B 175B 1 10 100 1000 10000 100000 1000000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Sentiment analysis on SST-2 98 ● Similar trend is on 97 downstream tasks 96 95 94 93 Accuracy, % Accuracy, 92 91 90 1 10 100 1000 10000 100000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Sentiment analysis on SST-2 98 T5-3B T5-11B ● Similar trend is on 97 downstream tasks 96 95 94 Bert Large 93 Tiny Bert Accuracy, % Accuracy, 92 Bert Base 91 90 110M 340M 3B 11B 1 10 14.5M 100 1000 10000 100000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Modern Models: Natural Language Processing Sentiment analysis on SST-2 98 T5-3B ALBERT XLNet T5-11B ● Similar trend is on 97 downstream tasks 96 ● Further improvements 95 are possible with similar model sizes, but with 94 more compute Bert Large 93 Tiny Bert Accuracy, % Accuracy, 92 Bert Base 91 90 110M 340M 3B 11B 1 10 14.5M 100 1000 10000 100000 Model size, millions of parameters (log-scale) Data from https://paperswithcode.com/ Cerebras Systems © 2020 Memory and compute requirements Memory and compute requirements 100,000 ● 1 PFLOP-day is about days - 10,000 1 x DGX-2H or 1 x DGX-A100 busy for a day 1,000 100 10 Total training compute, PFLOP 1 1 10 100 1,000 10,000 100,000 Model memory requirement , GB Cerebras Systems © 2020 Memory and compute requirements Memory and compute requirements 100,000 Microsoft 1T-param ● 1 PFLOP-day is about days - 10,000 GPT-3 1 x DGX-2H or 1 x DGX-A100 busy for a day T5-11B 1,000 T-NLG Megatron-LM 100 GPT-2 10 BERT Large BERT Base Total training compute, PFLOP 2GB 272GB 1 5.4GB 2.8TB 16TB 1 10 100 1,000 10,000 100,000 Model memory requirement , GB Cerebras Systems © 2020 Memory and compute requirements Memory and compute requirements 100,000 Microsoft 1T-param 23,220 ● 1 PFLOP-day is about days - 10,000 GPT-3 1 x DGX-2H or 1 x DGX-A100 3,738 busy for a day T5-11B 1,000 ● NVIDIA Megatron-LM: trained on 512 V100 (32 DGX- T-NLG Megatron-LM 2H) for about 10 days 10096 GPT-2 ● OpenAI GPT-3: trained on 1024 V100 (64 DGX- 10 BERT Large 2H) for about 116 days 6 BERT Base Total training compute, PFLOP 2GB 272GB 1 5.4GB 2.8TB 16TB 1 10 100 1,000 10,000 100,000 Model memory requirement , GB Cerebras Systems © 2020 Future path: larger and smarter models Brute-force scaling is historical path to better models. • This is challenging • Memory needs to grow • Compute needs to grow Algorithmic innovations give more efficient models. • These are promising but challenge existing hardware. Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key Training data Model parameters Sample 1 Sample 2 Sample N Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key Training data Model parameters Sample 1 Sample 2 Sample N Ability to choose which input element should influence which model parameter, and which model parameter should influence the output for a given input element Cerebras Systems © 2020 Dynamic sparsity and conditional computations are key Training data Model parameters Sample 1 Sample 2 Sample N Ability to choose which input element should influence which model parameter, and which model parameter should
Recommended publications
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Generative Adversarial Networks (Gans)
    Generative Adversarial Networks (GANs) The coolest idea in Machine Learning in the last twenty years - Yann Lecun Generative Adversarial Networks (GANs) 1 / 39 Overview Generative Adversarial Networks (GANs) 3D GANs Domain Adaptation Generative Adversarial Networks (GANs) 2 / 39 Introduction Generative Adversarial Networks (GANs) 3 / 39 Supervised Learning Find deterministic function f: y = f(x), x:data, y:label Generative Adversarial Networks (GANs) 4 / 39 Unsupervised Learning "Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we do not know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI." - Yann Lecun "You cannot predict what you cannot understand" - Anonymous Generative Adversarial Networks (GANs) 5 / 39 Unsupervised Learning More challenging than supervised learning. No label or curriculum. Some NN solutions: Boltzmann machine AutoEncoder Generative Adversarial Networks Generative Adversarial Networks (GANs) 6 / 39 Unsupervised Learning vs Generative Model z = f(x) vs. x = g(z) P(z|x) vs P(x|z) Generative Adversarial Networks (GANs) 7 / 39 Autoencoders Stacked Autoencoders Use data itself as label Generative Adversarial Networks (GANs) 8 / 39 Autoencoders Denosing Autoencoders Generative Adversarial Networks (GANs) 9 / 39 Variational Autoencoder Generative Adversarial Networks (GANs) 10 / 39 Variational Autoencoder Results Generative Adversarial Networks (GANs) 11 / 39 Generative Adversarial Networks Ian Goodfellow et al, "Generative Adversarial Networks", 2014.
    [Show full text]
  • Automated Elastic Pipelining for Distributed Training of Transformers
    PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers Chaoyang He 1 Shen Li 2 Mahdi Soltanolkotabi 1 Salman Avestimehr 1 Abstract the-art convolutional networks ResNet-152 (He et al., 2016) and EfficientNet (Tan & Le, 2019). To tackle the growth in The size of Transformer models is growing at an model sizes, researchers have proposed various distributed unprecedented rate. It has taken less than one training techniques, including parameter servers (Li et al., year to reach trillion-level parameters since the 2014; Jiang et al., 2020; Kim et al., 2019), pipeline paral- release of GPT-3 (175B). Training such models lel (Huang et al., 2019; Park et al., 2020; Narayanan et al., requires both substantial engineering efforts and 2019), intra-layer parallel (Lepikhin et al., 2020; Shazeer enormous computing resources, which are luxu- et al., 2018; Shoeybi et al., 2019), and zero redundancy data ries most research teams cannot afford. In this parallel (Rajbhandari et al., 2019). paper, we propose PipeTransformer, which leverages automated elastic pipelining for effi- T0 (0% trained) T1 (35% trained) T2 (75% trained) T3 (100% trained) cient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically Layer (end of training) Layer (end of training) Layer (end of training) Layer (end of training) Similarity score allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the Figure 1. Interpretable Freeze Training: DNNs converge bottom pipeline, packs active layers into fewer GPUs, up (Results on CIFAR10 using ResNet).
    [Show full text]
  • Convolutional Neural Network Transfer Learning for Robust Face
    Convolutional Neural Network Transfer Learning for Robust Face Recognition in NAO Humanoid Robot Daniel Bussey1, Alex Glandon2, Lasitha Vidyaratne2, Mahbubul Alam2, Khan Iftekharuddin2 1Embry-Riddle Aeronautical University, 2Old Dominion University Background Research Approach Results • Artificial Neural Networks • Retrain AlexNet on the CASIA-WebFace dataset to configure the neural • VGG-Face Shows better results in every performance benchmark measured • Computing systems whose model architecture is inspired by biological networf for face recognition tasks. compared to AlexNet, although AlexNet is able to extract features from an neural networks [1]. image 800% faster than VGG-Face • Capable of improving task performance without task-specific • Acquire an input image using NAO’s camera or a high resolution camera to programming. run through the convolutional neural network. • Resolution of the input image does not have a statistically significant impact • Show excellent performance at classification based tasks such as face on the performance of VGG-Face and AlexNet. recognition [2], text recognition [3], and natural language processing • Extract the features of the input image using the neural networks AlexNet [4]. and VGG-Face • AlexNet’s performance decreases with respect to distance from the camera where VGG-Face shows no performance loss. • Deep Learning • Compare the features of the input image to the features of each image in the • A subfield of machine learning that focuses on algorithms inspired by people database. • Both frameworks show excellent performance when eliminating false the function and structure of the brain called artificial neural networks. positives. The method used to allow computers to learn through a process called • Determine whether a match is detected or if the input image is a photo of a training.
    [Show full text]
  • Lecture 10: Recurrent Neural Networks
    Lecture 10: Recurrent Neural Networks Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 1 May 4, 2017 Administrative A1 grades will go out soon A2 is due today (11:59pm) Midterm is in-class on Tuesday! We will send out details on where to go soon Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 2 May 4, 2017 Extra Credit: Train Game More details on Piazza by early next week Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 3 May 4, 2017 Last Time: CNN Architectures AlexNet Figure copyright Kaiming He, 2016. Reproduced with permission. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 4 May 4, 2017 Last Time: CNN Architectures Softmax FC 1000 Softmax FC 4096 FC 1000 FC 4096 FC 4096 Pool FC 4096 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Pool Pool 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 Pool Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Input Input VGG16 VGG19 GoogLeNet Figure copyright Kaiming He, 2016. Reproduced with permission. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 5 May 4, 2017 Last Time: CNN Architectures Softmax FC 1000 Pool 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 relu 3x3 conv, 64 3x3 conv, 64 F(x) + x 3x3 conv, 64 ..
    [Show full text]
  • Image Retrieval Algorithm Based on Convolutional Neural Network
    Advances in Intelligent Systems Research, volume 133 2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE2016) Image Retrieval Algorithm Based on Convolutional Neural Network Hailong Liu 1, 2, Baoan Li 1, 2, *, Xueqiang Lv 1 and Yue Huang 3 1Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science & Technology University, Beijing 100101, China 2Computer School, Beijing Information Science and Technology University, Beijing 100101, China 3Xuanwu Hospital Capital Medical University, 100053, China *Corresponding author Abstract—With the rapid development of computer technology low, can’t meet the needs of people. In order to overcome this and the increasing of multimedia data on the Internet, how to difficulty, the researchers from the image itself, and put quickly find the desired information in the massive data becomes forward the method of image retrieval based on content. The a hot issue. Image retrieval can be used to retrieve similar images, method is to extract the visual features of the image content: and the effect of image retrieval depends on the selection of image color, texture, shape etc., the image database to be detected features to a certain extent. Based on deep learning, through self- samples for similarity matching, retrieval and sample images learning ability of a convolutional neural network to extract more are similar to the image. The main process of the method is the conducive to the high-level semantic feature of image retrieval selection and extraction of features, but there are "semantic using convolutional neural network, and then use the distance gap" [2] between low-level features and high-level semantic metric function similar image.
    [Show full text]
  • Tiny Imagenet Challenge
    Tiny ImageNet Challenge Jiayu Wu Qixiang Zhang Guoxi Xu Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Abstract Our best model, a fine-tuned Inception-ResNet, achieves a top-1 error rate of 43.10% on test dataset. Moreover, we We present image classification systems using Residual implemented an object localization network based on a Network(ResNet), Inception-Resnet and Very Deep Convo- RNN with LSTM [7] cells, which achieves precise results. lutional Networks(VGGNet) architectures. We apply data In the Experiments and Evaluations section, We will augmentation, dropout and other regularization techniques present thorough analysis on the results, including per-class to prevent over-fitting of our models. What’s more, we error analysis, intermediate output distribution, the impact present error analysis based on per-class accuracy. We of initialization, etc. also explore impact of initialization methods, weight decay and network depth on system performance. Moreover, visu- 2. Related Work alization of intermediate outputs and convolutional filters are shown. Besides, we complete an extra object localiza- Deep convolutional neural networks have enabled the tion system base upon a combination of Recurrent Neural field of image recognition to advance in an unprecedented Network(RNN) and Long Short Term Memroy(LSTM) units. pace over the past decade. Our best classification model achieves a top-1 test error [10] introduces AlexNet, which has 60 million param- rate of 43.10% on the Tiny ImageNet dataset, and our best eters and 650,000 neurons. The model consists of five localization model can localize with high accuracy more convolutional layers, and some of them are followed by than 1 objects, given training images with 1 object labeled.
    [Show full text]
  • Dynamic Factor Graphs for Time Series Modeling
    Dynamic Factor Graphs for Time Series Modeling Piotr Mirowski and Yann LeCun Courant Institute of Mathematical Sciences, New York University, 719 Broadway, New York, NY 10003 USA {mirowski,yann}@cs.nyu.edu http://cs.nyu.edu/∼mirowski/ Abstract. This article presents a method for training Dynamic Fac- tor Graphs (DFG) with continuous latent state variables. A DFG in- cludes factors modeling joint probabilities between hidden and observed variables, and factors modeling dynamical constraints on hidden vari- ables. The DFG assigns a scalar energy to each configuration of hidden and observed variables. A gradient-based inference procedure finds the minimum-energy state sequence for a given observation sequence. Be- cause the factors are designed to ensure a constant partition function, they can be trained by minimizing the expected energy over training sequences with respect to the factors’ parameters. These alternated in- ference and parameter updates can be seen as a deterministic EM-like procedure. Using smoothing regularizers, DFGs are shown to reconstruct chaotic attractors and to separate a mixture of independent oscillatory sources perfectly. DFGs outperform the best known algorithm on the CATS competition benchmark for time series prediction. DFGs also suc- cessfully reconstruct missing motion capture data. Key words: factor graphs, time series, dynamic Bayesian networks, re- current networks, expectation-maximization 1 Introduction 1.1 Background Time series collected from real-world phenomena are often an incomplete picture of a complex underlying dynamical process with a high-dimensional state that cannot be directly observed. For example, human motion capture data gives the positions of a few markers that are the reflection of a large number of joint angles with complex kinematic and dynamical constraints.
    [Show full text]
  • Deep Learning Is Robust to Massive Label Noise
    Deep Learning is Robust to Massive Label Noise David Rolnick * 1 Andreas Veit * 2 Serge Belongie 2 Nir Shavit 3 Abstract Thus, annotation can be expensive and, for tasks requiring expert knowledge, may simply be unattainable at scale. Deep neural networks trained on large supervised datasets have led to impressive results in image To address this limitation, other training paradigms have classification and other tasks. However, well- been investigated to alleviate the need for expensive an- annotated datasets can be time-consuming and notations, such as unsupervised learning (Le, 2013), self- expensive to collect, lending increased interest to supervised learning (Pinto et al., 2016; Wang & Gupta, larger but noisy datasets that are more easily ob- 2015) and learning from noisy annotations (Joulin et al., tained. In this paper, we show that deep neural net- 2016; Natarajan et al., 2013; Veit et al., 2017). Very large works are capable of generalizing from training datasets (e.g., Krasin et al.(2016); Thomee et al.(2016)) data for which true labels are massively outnum- can often be obtained, for example from web sources, with bered by incorrect labels. We demonstrate remark- partial or unreliable annotation. This can allow neural net- ably high test performance after training on cor- works to be trained on a much wider variety of tasks or rupted data from MNIST, CIFAR, and ImageNet. classes and with less manual effort. The good performance For example, on MNIST we obtain test accuracy obtained from these large, noisy datasets indicates that deep above 90 percent even after each clean training learning approaches can tolerate modest amounts of noise example has been diluted with 100 randomly- in the training set.
    [Show full text]
  • LSTM-In-LSTM for Generating Long Descriptions of Images
    Computational Visual Media DOI 10.1007/s41095-016-0059-z Vol. 2, No. 4, December 2016, 379–388 Research Article LSTM-in-LSTM for generating long descriptions of images Jun Song1, Siliang Tang1, Jun Xiao1, Fei Wu1( ), and Zhongfei (Mark) Zhang2 c The Author(s) 2016. This article is published with open access at Springerlink.com Abstract In this paper, we propose an approach by means of text (description generation) is a for generating rich fine-grained textual descriptions of fundamental task in artificial intelligence, with many images. In particular, we use an LSTM-in-LSTM (long applications. For example, generating descriptions short-term memory) architecture, which consists of an of images may help visually impaired people better inner LSTM and an outer LSTM. The inner LSTM understand the content of images and retrieve images effectively encodes the long-range implicit contextual using descriptive texts. The challenge of description interaction between visual cues (i.e., the spatially- generation lies in appropriately developing a model concurrent visual objects), while the outer LSTM that can effectively represent the visual cues in generally captures the explicit multi-modal relationship images and describe them in the domain of natural between sentences and images (i.e., the correspondence language at the same time. of sentences and images). This architecture is capable There have been significant advances in of producing a long description by predicting one description generation recently. Some efforts word at every time step conditioned on the previously rely on manually-predefined visual concepts and generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via sentence templates [1–3].
    [Show full text]
  • Energy-Based Generative Adversarial Network
    Published as a conference paper at ICLR 2017 ENERGY-BASED GENERATIVE ADVERSARIAL NET- WORKS Junbo Zhao, Michael Mathieu and Yann LeCun Department of Computer Science, New York University Facebook Artificial Intelligence Research fjakezhao, mathieu, [email protected] ABSTRACT We introduce the “Energy-based Generative Adversarial Network” model (EBGAN) which views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions. Similar to the probabilistic GANs, a generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples. Viewing the discrimi- nator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary classifier with logistic output. Among them, we show one instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the dis- criminator. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images. 1I NTRODUCTION 1.1E NERGY-BASED MODEL The essence of the energy-based model (LeCun et al., 2006) is to build a function that maps each point of an input space to a single scalar, which is called “energy”. The learning phase is a data- driven process that shapes the energy surface in such a way that the desired configurations get as- signed low energies, while the incorrect ones are given high energies.
    [Show full text]
  • The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches
    > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches Md Zahangir Alom1, Tarek M. Taha1, Chris Yakopcic1, Stefan Westberg1, Paheding Sidike2, Mst Shamima Nasrin1, Brian C Van Essen3, Abdul A S. Awwal3, and Vijayan K. Asari1 Abstract—In recent years, deep learning has garnered I. INTRODUCTION tremendous success in a variety of application domains. This new ince the 1950s, a small subset of Artificial Intelligence (AI), field of machine learning has been growing rapidly, and has been applied to most traditional application domains, as well as some S often called Machine Learning (ML), has revolutionized new areas that present more opportunities. Different methods several fields in the last few decades. Neural Networks have been proposed based on different categories of learning, (NN) are a subfield of ML, and it was this subfield that spawned including supervised, semi-supervised, and un-supervised Deep Learning (DL). Since its inception DL has been creating learning. Experimental results show state-of-the-art performance ever larger disruptions, showing outstanding success in almost using deep learning when compared to traditional machine every application domain. Fig. 1 shows, the taxonomy of AI. learning approaches in the fields of image processing, computer DL (using either deep architecture of learning or hierarchical vision, speech recognition, machine translation, art, medical learning approaches) is a class of ML developed largely from imaging, medical information processing, robotics and control, 2006 onward. Learning is a procedure consisting of estimating bio-informatics, natural language processing (NLP), cybersecurity, and many others.
    [Show full text]