Deep Learning - Review Yann Lecun, Yoshua Bengio & Geoffrey Hinton
Total Page:16
File Type:pdf, Size:1020Kb
DEEP LEARNING - REVIEW YANN LECUN, YOSHUA BENGIO & GEOFFREY HINTON. OUTLINE • Deep Learning - History, Background & Applications. • Recent Revival. • Convolutional Neural Networks. • Recurrent Neural Networks. • Future. WHAT IS DEEP LEARNING? • A particular class of Learning Algorithms. • Rebranded Neural Networks : With multiple layers. • Inspired by the Neuronal architecture of the Brain. • Renewed interest in the area due to a few recent breakthroughs. • Learn parameters from data. • Non Linear Classification. SOME CONTEXT HISTORY • 1943 - McCulloch & Pitts develop computational model for neural network. Idea: neurons with a binary threshold activation function were analogous to first order logic sentences. • 1949 - Donald Hebb proposes Hebb’s rule Idea: Neurons that fire together, wire together! • 1958 - Frank Rosenblatt creates the Perceptron. • 1959 - Hubel and Wiesel elaborate cells in Visual Cortex. • 1975 - Paul J. Werbos develops the Backpropagation Algorithm. • 1980 - Neocognitron, a hierarchical multilayered ANN. • 1990 - Convolutional Neural Networks. APPLICATIONS • Predict the activity of potential drug molecules. • Reconstruct Brain circuits. • Predict effects of mutation on non-coding regions of DNA. • Speech/Image Recognition & Language translation MULTILAYER NEURAL NETWORK COMMON NON- LINEAR FUNCTIONS: 1) F ( Z ) = MAX(0,Z ) 2) SIGMOID 3)LOGISTIC COST FUNCTION: 1/2[(YL - TL)]2 Source : Deep learning Yann LeCun, Yoshua Bengio, Geoffrey Hinton Nature 521, 436–444 (28 May 2015) doi:10.1038/nature14539 STOCHASTIC GRADIENT DESCENT • Analogy: A person is stuck in the mountains and is trying to get down (i.e. trying to find the minima). • SGD: The person represents the backpropagation algorithm, and the path taken down the mountain represents the sequence of parameter settings that the algorithm will explore. The steepness of the hill represents the slope of the error surface at that point. The instrument used to measure steepness is differentiation (the slope of the error surface can be calculated by taking the derivative of the squared error function at that point). The direction he chooses to travel in aligns with the gradient of the error surface at that point. The amount of time he travels before taking another measurement is the learning rate of the algorithm. Source: http://sebastianraschka.com/images/faq/closed-form-vs-gd/ball.png; Wikipedia. BACK PROPAGATION COST FUNCTION: Error 2 =1/2[(yl - tl)] Source : Deep learning Yann LeCun, Yoshua Bengio, Geoffrey Hinton Nature 521, 436–444 (28 May 2015) doi:10.1038/nature14539 WHY ALL THE BUZZ? ImageNet: • ~5M labeled high resolution images. • Roughly 22K categories. • Collected from web & labeled by Amazon Mechanical Turk http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf DIMINISHING ERROR RATES CONVOLUTION NEURAL NETWORKS- CORE IDEA • Color Image - 32 x 32 pixels on 3 color palettes. • Pixel Intensity - 0 - 255. • Image Representation : 32 * 32 * 3 array of numbers with each pixel ranging between 0 and 255. • Idea : Feed the numerical array to a ConvNet and obtain probabilities for each class of objects as an N dimensional vector, where N is the number of classes. CONVOLUTIONAL NEURAL NETS • Multi stage Neural Nets that model V1,V2,V3 areas of the visual cortex. • (Convolutional Layer + NL Layer + Pooling Layer)^n + Fully Connected Layer. • Highly correlated local values are easily detected. • Ideal for volumetric data that come in multiple arrays. e.g., Color images. • Learn the ‘essence' of images well. • Applications in Computer Vision INITIAL WORK - YAN LECUN • Primitive recognition without hand coded features. • Adaptive, yet constrained architecture. • Hand written digit recognition served as a simple and powerful model. • Training Sample : 9298 zip codes on mails passing through Buffalo, NY. CONCEPTUAL OVERVIEW Source : Deep learning Yann LeCun, Yoshua Bengio, Geoffrey Hinton Nature 521, 436–444 (28 May 2015) doi:10.1038/nature14539 WHAT IS A CONVOLUTION? • Several meanings depending on the area of application. • Convolution - Operation of applying filters/kernels through overlapping regions of the image. • Stride - Extent of overlap during convolution. • Each filter has the same set of weights and biases. This minimizes the number of parameters. Sources: http://ufldl.stanford.edu/ http://www.kdnuggets.com/2016/09/beginners-guide-understanding-convolutional-neural- networks-part-1.html FILTERS/KERNELS • Filters - Carefully designed feature detectors (matrices) to detect edges, curves, colors etc. • Receptive field - Area covered by a single filter. • 3*3 and 5*5 are the common sizes. • Alexnet used 96 Kernels on the input layer. http://web.pdx.edu/~jduh/courses/Archive/geog481w07/Students/Ludwig_ImageConvolution.pdf RELU & POOLING LAYERS • RELU - Applies non-linear activation function MAX(0,x) to every pixel. Other common functions include tanh and sigmoid. • RELU - Addresses the ‘vanishing gradient problem’. • Pooling - Reduces the spatial size and minimizes overfitting. • MAX 2x2 is the most common pooling operation. • Dropout - Random elimination of neurons to minimize overfitting. • Pooling Vs. Larger Strides. NON LINEARITIES- TREND https://arxiv.org/pdf/1606.02228.pdf ALEXNET - PERSPECTIVE http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf CONVNETS - THE BIG PICTURE RECURRENT NEURAL NETWORKS - RNN • RNN - Neural Nets with feedback loops. Multiple copies of the same network, each passing a message to a successor • Used to train sequential inputs. i.e., Speech, DNA sequences etc. • Operate over sequences of vectors. Predict next character, word etc. http://colah.github.io/posts/2015-08-Understanding-LSTMs/ LONG SHORT TERM MEMORY (LSTM) • Sequences have long term dependencies. Why? “the clouds are in the sky” vs. “I grew up in France… I speak fluent French.” • Problem: Hard to store information for very long. • Solution: LTSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/ EXAMPLE BY ANDREJ KARPATHY • Source : 474MB of C code from Github • Multiple 3-layer LSTMs; • Few days of training on GPUs • Parameters - 10 Million. http://karpathy.github.io/2015/05/21/rnn-effectiveness/ RNNS AND BEYOND • RNNs augmented with Memory Networks: 1. Improved Performance. 2. Applications in question answering systems. • ConvNets + RNNs = Novel Applications FUTURE - DEEP LEARNING • Extension of recent successes from to Unsupervised Learning. • End to End Integration :Reinforcement + Convnets + RNNs • Natural language understanding. • Complex systems that combine learning, memory and reasoning. Source: https://developer.amazon.com/alexaprize REFERENCES • Andrej Karpathy’s Course: http://cs231n.stanford.edu/ • DeepLearning.tv : https://www.youtube.com/channel/UC9OeZkIwhzfv-_Cb7fCikLQ • Wikipedia!.