CPSC 540: Machine Learning Deep Crfs, Convolutional Neural Networks

Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network CPSC 540: Machine Learning Deep CRFs, Convolutional Neural Networks Mark Schmidt University of British Columbia Winter 2017 Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Admin Assignment 4: 1 late day to hand in tonight, 2 for Monday. Assignment 5 coming soon. Project description coming soon. Final details coming soon. Bonus lecture on April 10th (same time and place)? Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Last Time: Log-Linear Models We discussed log-linear models like the Ising model, d 1 p(x ,x ,...,x )= exp x w + x x v , 1 2 d Z 0 i i j 1 i=1 (i,j) E X X2 @ A where if v is large it encourages neighbouring values to be the same. We also discussed CRF variants that generalize logistic regression, d 1 p(y1,y2,...,yd x1,x2,...,xd)= exp yiwT xi + yiyjv . | Z 0 1 i=1 (i,j) E X X2 @ A We can jointly learn a logistic regression model and the label depencies. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Last Time: Structured SVMs In generative UGM models (MRFs) we optimize the joint likelihood, n f(w)= log p(yi,xi w). − | Xi=1 In discriminative UGM models (CRFs) we optimize the conditional likelihood, n f(w)= log p(yi xi,w). − | Xi=1 In structured SVMs we penalize decision from decoding conditional likelihood, n i i i i f(w)= max[g(y ,y0) log p(y x ,w) + log p(y0 x ,w)], y0 − | | Xi=1 i i where g(y ,y0) is penalty for predicting y0 when true label is y . Only requires decoding to evaluate objective/subgradient. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Outline 1 Neural Networks Review 2 Deep Conditional Random Fields 3 Convolutional Neural Network Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Feedforward Neural Networks In 340 we discussed feedforward neural networks for supervised learning. With 1 hidden layer the classic model has this structure: Motivation: For some problems it’s hard to find good features. This learn features z that are good for supervised learning. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Neural Networks as DAG Models It’s a DAG model but there is an important di↵erence with our previous models: The latent variables zc are deterministic functions of the xj. Makes inference trivial: if you observe all xj you also observe all zc. In this case y is the only random variable. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Neural Network Notation We’ll continue using our supervised learning notation: (x1)T y1 (x2)T y2 X = 2 . 3 ,y= 2 . 3 , . 6 7 6 7 6 (xn)T 7 6yn7 6 7 6 7 4 5 4 5 For the latent features and two sets of parameters we’ll use 1 T (z ) w1 W1 2 T (z ) w2 W2 Z = 2 . 3 ,w= 2 . 3 ,W= 2 . 3 , . 6 7 6 7 6 7 6 (zn)T 7 6w 7 6 W 7 6 7 6 k7 6 k 7 4 5 4 5 4 5 where Z is n by k and W is k by d. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Introducing Non-Linearity We discussed how the “linear-linear” model, zi = Wxi,yi = wT zi, is degenerate since it’s still a linear model. The classic solution is to introduce a non-linearity, zi = h(Wxi),yi = wT zi, where a common-choice is applying sigmoid element-wise, 1 zi = , c 1+exp( W xi) − c which is said to be the “activation” of neuron c on example i. A universal approximator with k large enough. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Deep Neural Networks In deep neural networks we add multiple hidden layers, Mathematically, with 3 hidden layers the classic model uses yi = wT h(W 3 h(W 2 h(W 1xi))) . zi1 |zi2 {z } | zi3 {z } | {z } Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Biological Motivation Deep learning is motivated by theories of deep hierarchies in the brain. https://en.wikibooks.org/wiki/Sensory_Systems/Visual_Signal_Processing But most research is about making models work better, not be more brain-like. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Deep Neural Network History Popularity of deep learning has come in waves over the years. Currently, it is one of the hottest topics in science. Recent popularity is due to unprecedented performance on some difficult tasks: Speech recognition. Comptuer vision. Machine translation. These are mainly due to big datasets, deep models,andtons of computation. Plus some tweaks to the classic models. For a NY Times article discussing some of the history/successes/issues, see: https://mobile.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Training Deep Neural Networks If we’re training a 3-layer network with squared error, our objective is n 1 f(w, W 1,W2,W3)= (wT h(W 3h(W 2h(W 1xi))) yi)2. 2 − Xi=1 Usual training procedure is stochastic gradient. Highly non-convex and notoriously difficult to tune. Recent empirical/theoretical work indicates non-convexity may not be an issue: All local minima may be good for “large enough” networks. We’re discovering sets of tricks to make things easier to tune. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Training Deep Neural Networks Some common data/optimization tricks we discussed in 340: Data transformations. For images, translate/rotate/scale/crop each xi to make more data. Data standardization: centering and whitening. Adding bias variables. For sigmoids, corresponds to adding row of zeroes to each W m. Parameter initalization: “small but di↵erent”, standardizing within layers. Step-size selection: “babysitting”, Bottou trick, Adam, momentum. Momentum: heavy-ball and Nesterov-style modifications. Batch normalization:adaptivestandardizingwithinlayers. i + ReLU: replacing sigmoid with [Wcx ] . Avoids gradients extremely-close to zero. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Training Deep Neural Networks Some common regularization tricks we discussed in 340: Standard L2-regularization or L1-regularization “weight decay”. Sometimes with di↵erent λ for each layer. Early stopping of the optimization based on validation accuracy. Dropout randomly zeroess z values to discourage dependence. Hyper-parameter optimization to choose various tuning parameters. Special architectures like convolutional neural networks and LSTMs (later). Yields W m that are very sparse and have many tied parameters. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Backpropagation as Message-Passing Computing the gradient in neural networks is called backpropagation. Derived from the chain rule and memoization of repeated quantities. We’re going to view backpropagation as a message-passing algorithm. Key advantages of this view: It’s easy to handle di↵erent graph structures. It’s easy to handle di↵erent non-linear transformations. It’s easy to handle multiple outputs (as in structured prediction). It’s easy to add non-deterministic parts and combine with other graphical models. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Backpropagation Forward Pass Consider computing the output of a neural network for an example i, yi = wT h(W 3h(W 2h(W 1xi))) k k k d 3 2 1 i = wch W h W h W x . 0 c0c 0 c00c0 0 c00j j111 c=1 X cX0=1 cX00=1 Xj=1 @ @ @ AAA where we’ve assume that all hidden layers have k values. In the second line, the h functions are single-input single-output. The nested sum structure is similar to our message-passing structures. However, it’s easier because it’s deterministic: no random variables to sum over. The messages will be scalars rather than functions. Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Backpropagation Forward Pass Forward propagation through neural network as message passing: k k k d i 3 2 1 i y = wch W h W h W x 0 c0c 0 c00c0 0 c00j j111 c=1 X cX0=1 cX00=1 Xj=1 k @ k @ k @ AAA = w h W 3 h W 2 h(M ) c c0c c00c0 c00 c=1 !! X cX0=1 cX00=1 k k = w h W 3 h(M ) c c0c c0 c=1 ! X cX0=1 k = wch(Mc) c=1 X = My, i2 where intermediate messages are the z before transformation, z = h(M2(c0)). c0 Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Backpropagation Backward Pass The backpropagation backward pass computes the partial derivatives. For a loss f,thepartialderivativesinthelastlayerhavetheform @f i3 T 3 2 1 i = zc f 0(w h(W h(W h(W x )))), @wc where k k d zi3 = h W 3 h W 2 h W 1 xi . c0 0 c0c 0 c00c0 0 c00j j111 j=1 cX0=1 cX00=1 X Written in terms of messages@ it simplifies@ to @ AAA @f = h(Mc)f 0(My). @wc Neural Networks Review Deep Conditional Random Fields Convolutional Neural Network Backpropagation Backward Pass In terms of forward messages, the partial derivatives have the forms: @f = h(Mc)f 0(My), @wc @f = h(M )h0(M )w f 0(M ), @W3 c0 c c y c0c k @f 3 = h(Mc )h0(Mc ) W h0(Mc)wcf 0(My), @W2 00 0 c0c c00c0 c=1 X k k @f 2 3 = h(Mj)h0(Mc ) W h0(Mc ) W h0(Mc)wcf 0(My), @W1 00 c00c0 0 c0c jc00 c=1 cX0=1 X which are ugly but notice all the repeated calculations.

Load more