Deep Learning

Deep Learning Elisa Ricci University of Perugia Outline ● Introduction to Deep Learning ● Brief history of Neural Networks ● Deep Learning ● Recent technical advances ● Models: Autoencoders !NN RNN ● Open proble#s ● Deep Learning Hands-on &he AI Revolution !omputer 'ision !omputer 'ision ● Amazing progresses in the last few years with !onvolutional Neural Networks )!NNs*+ ,.4M i#ages, 1K categories (2009) !omputer 'ision ● Amazing progresses in the last few years with !onvolutional Neural Networks )!NNs*+ I#ageNet Large%/cale Visual Recognition Challenge )IL/'R!* Traditional ML vs Deep models vs Human /peech 0rocessing ● Machine translation+ Rick Rashid in Tianjin, China, October, 25, 2012 A voice recognition program translated a speech given by Richard F. Rashid, Microsoft’s top scientist, into Mandarin Chinese. /peech 0rocessing ● &e1t &o /peech 2aveNet. !reativity ● /tyle &ransfer+ [Ruder et al.] !reativity ● 3enerate Donald &rump4s &witter eruptions+ Machine Learning vs Deep Learning 2002 2012 Deep Learning ● What is deep learning5 ● Why is it generally better than traditional ML methods on i#age speech and certain other types of data5 Deep Learning ● What is deep learning5 Deep Learning means using a neural network with several layers of nodes between input and output hidden hidden hidden input output More formally ● A fa#ily of parametric #odels which learn non- linear hierarchical representations: input parameters non-linear parameters of the network activation of layer L 6 and infor#ally Deep Learning ● Why is it generally better than other ML methods on i#age speech and certain other types of data5 The series of layers between input and output compute relevant features automatically in a series of stages, just as our brains seem to. Deep Learning +++"ut neural networks have "een around for 78 years+++ /o, what is new5 Biological neuron Dendrites /ynapses • A neuron has Nucleus – Branching input )dendrites* – Branching output )the axon* Axon !ell "ody or /oma • Infor#ation #oves fro# the dendrites to the a1on via the cell body • A1on connects to dendrites via synapses – /ynapses vary in strength – /ynapses #ay be e1citatory or inhi"itory 0erceptron An Artificial Neuron )0erceptron* is a non-linear parameterized function with restricted output range x, w, x7 w7 inputs x w 9 9 Σ h output x w- - activation x w 8 8 function Biological neuron and Perceptron Dendrites /ynapses x, w Nucleus , x7 w7 inputs x w 9 9 Σ h output A1on x w- - activation x w 8 8 function !ell body or /oma Dendrite !ell Body A1on Brief History of Neural Networks [VUNO] 1943 – Mc!ulloch & Pitts Model ● Early model of artificial neuron ● 3enerates a "inary output ● &he weights values are fixed x, w, x7 w7 inputs x w 9 9 Σ output x w- - &hreshold w x8 8 ,:88 – Perceptron by Rose#blatt ● 0erceptron as a machine for linear classification ● Main idea: Learn the weigths and consider "ias. ● One weight per input ● Multiply weights with respective inputs and add bias ● If result larger than threshold return , otherwise ? x, w, x7 w7 inputs x w 9 9 Σ h output x w- - activation x w 8 8 , function b Activation functions @irst AI winter ● &he e1clusive or (AOR* cannot be solved "y perceptrons ● Neural models cannot be applied to complex tasks @irst AI winter ● But, can XOR be solved by neural networks5 Output layer ● Multi-layer perceptrons (ML0* Hidden can solve XOR layer ● @ew years later Minsky built such Hidden ML0 layer x1 x2 ….. xn Multi-layer feed forward Neural Network ● Main idea: ● Densely connect artificial neurons to reali(e compositions of non% Output linear functions layer ● &he information is propagated from Hidden the inputs to the outputs layer ● Directed Acyclic Graph (DA3* Hidden ● &asks: Classification, Regression layer ● &he input data are n dimensional usually the feature vectors x1 x2 ….. xn @irst AI winter ● $ow to train a ML05 ● Rosenblatt4s algorith# not Output yi={0,1} applica"le as it e1pects to know the layer d=? desired target+ Hidden i ● @or hidden layers we cannot know layer the desired target Hidden layer x1 x2 ….. xn 1986 ; Backpropagation ● Backpropagation revitali(e the field ● Learning ML0 for co#plicated functions can "e solved ● =fficient algorith# which processes “large” training sets ● Allowed for complicated neural network architectures ● &oday backpropagation is still at the core of neural network training 2erbos ),:E-*+ Beyond Regression: New &ools for 0rediction and Analysis in the Behavioral /ciences. 0h.D+ &hesis, $arvard Fniversity+ Ru#elhart $intont 2illia#s ),:>B*+ Learning representations "y "ack%propagating errors+ Nature Backpropagation Learning is the process of #odifying the weights of each layer θl in order to produce a network that performs some function: Output layer Hidden layer Hidden layer θl = ? x1 x2 ….. xn Backpropagation ● 0reli#inary steps: ● !ollectGacquire a training set IX, !J ● Define model and initiali(e randomly weights. ● 3iven the training set find the weights: Backpropagation 1 &rue label vector yi 2 $ypotesis = ; θ aL h )xi * 3 1) Forward propagation: su# inputs, produce activations, feed-forward 2) Error estimation+ 3) Back propagate the error signal and used it to update weights Backpropagation Randomly initiali(e the initial weights 2hile error is too large ),* For each training sample )presented in random order* Apply the inputs to the network !alculate the output for every neuron fro# the input layer through the hidden layers, to the output layer )7* Calculate the error at the outputs )9* Use the output error to compute error signals for previous layers Fse the error signals to compute weight adKust#ents Apply the weight adjust#ents 0eriodically evaluate the network perfor#ance Backpropagation ● Optimization with gradient descent: ● &he #ost important component is how to compute the gradient ● &he "ackward computations of network return the gradient Backpropagation Forward Backward Recursive rule: Previous layer Current layer ,::?s - CNN and LSTM ● I#portant advances in the field: ● Backpropagation ● Recurrent Long%/hort &er# Me#ory Networks )/ch#idhu"er ,::E* ● !onvolutional Neural Networks: OCR solved "efore 7???s )LeNet ,::>*+ Second AI winter ● NN cannot exploit many layers ● Overfitting ● 'anishing gradient (with NN training you need to #ultiply several s#all num"ers → they become s#aller and s#aller* ● Lack of processing power (no G0Fs* ● Lack of data (no large annotated datasets*◦ Second AI winter ● .ernel Machines (e.g+ S'Ms* suddenly become very popular ● /imilar accuracies than NN in the sa#e tasks ● Much fewer heuristics and para#eters ● Nice proofs on generalization &he believers 7?06 - Learning deep belief nets ● !lever way of initiali(ing network weights: ● &rain each layer one "y one with unsupervised training )using contrastive divergence* ● Much better than random values ● @ine%tune weigths with a round of supervised learning Kust as is nor#al for neural nets ● /tate of the art perfor#ance on MNI/& dataset [Hinton et al.] 7?,7 - Ale1Net ● $inton4 s group imple#ented a CNN similar to LeNet LLe!un,::>M but+++ ● &rained on I#agenet with two 30Fs ● 2ith some technical improve#ents )ReLF, dropout, data aug#entation* Traditional ML vs Deep models vs Human 2hy so powerful? ● Build an i#proved feature space ● @irst layer learns first order features (e.g+ edges6* ● SubseHuent layers learns higher order features )co#"inations of first layer features, co#binations of edges, etc.) ● @inal layer of transfor#ed features are fed into supervised layer)s* Learning hierarchical representations ● &raditional fra#ework Handcrafted Classification PearSimple Features Model Classifier ● Deep Learning Layer1 Layer2 Layer3 PearSimple θ θ Classifier θ1 2 3 2hy Deep Learning now5 ● &hree #ain factors: ● Better hardware ● Big data ● &echnical advances: ● Layer%wise pretraining ● Opti#i(ation )e+g. Ada# "atch nor#ali(ation* ● Regulari(ation )e+g. dropout* 6+ 30Fs NVIDIA Blog Big Data ● Large !"lly annotated datasets Advances with Deep Learning ● Better: ● Activation functions (RELF* ● &raining sche#es ● 2eights initialization ● Address overfitting )dropout* ● Nor#ali(ation "etween layers ● Residual deep learning ● 6+ 47 /igmoid activations ● 0ositive facts: ● Output can "e interpreted as pro"a"ility ● Output bounded in [? ,M ● Negative facts ● Always multiply with <, gradients can "e s#all ● &he gradients at the tails is flat to ? almost no weights updates 48 Rectified Linear Units ● More efficient gradient propagation: )derivative is 0 or constant* ● More efficient co#putation: )only comparison addition and #ultiplication*+ ● /parse activation: e.g. in a randomly initialized networks only a"out 8?O of hidden units are activated (having a non%(ero output) 49 ● Lots of variations have been proposed recently+ Losses ● Sum squared error "L2) loss gradient seeks the maximu# likelihood hypothesis under the assumption that the training data can "e modeled "y Normally distributed noise added to the target function value. ● @ine for regression "ut less natural for classification+ ● @or classification problems it is advantageous and increasingly popular to use the softmax activation function Kust at the output layer with the cross entropy loss function+ Softmax and Cross Entropy ● /oft#a1: softens 1 of " targets to #i#ic a pro"a"ility vector for each output+ 51 Softmax and Cross Entropy ● !ross entropy loss: #ost popular classification losses for classifiers that output proba"ilities: ● 3enerali(ation of logistic regression for more than two outputs. ● &hese new loss and activation functions helps avoid gradient saturation+ 52 /tochastic 3radient Descent (/3D* ● Fse mini%"atch sampled in the dataset for gradient esti#ate. ● /ometi#es helps to escape from local mini#a ● Noisy gradients act as regularization ● Also suita"le for datasets that change over ti#e ● 'ariance of gradients increases when batch size decreases ● Not clear how #any sa#ple per batch53 Learning rate ● 3reat i#pact on learning perfor#ance 54 Momentu# ● 3radient updates with momentum ● 0revent gradient switching all the ti#e ● @aster and more robust convergence 55 Adaptive Learning ● 0opular sche#es ● #estero$ Moment"m – Calculate point you would go to if using nor#al #omentu#.

Load more