
MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung Backpropagation learning Simple vs. multilayer perceptron Hidden layer problem • Radical change for the supervised learning problem. • No desired values for the hidden layer. • The network must find its own hidden layer activations. Generalized delta rule • Delta rule only works for the output layer. • Backpropagation, or the generalized delta rule, is a way of creating desired values for hidden layers Outline • The algorithm • Derivation as a gradient algoritihm • Sensitivity lemma Multilayer perceptron • L layers of weights and biases • L+1 layers of neurons 1 1 2 2 L L x0 ⎯ W⎯ ,b→ x1 ⎯ W⎯ ,b→ ⎯ W⎯ ,b→ x L L ⎛ nl−1 ⎞ l l l−1 l xi = fW⎜ ∑ ij x j + bi ⎝ j=1 ⎠ Reward function • Depends on activity of the output layer only. Rx( L ) • Maximize reward with respect to weights and biases. Example: squared error • Square of desired minus actual output, with minus sign. nL L 1 L 2 Rx()=− ∑()di − xi 2 i=1 Forward pass For l =1 to L, nl−1 l l l−1 l ui = ∑Wij x j + bi j=1 l l xi = fu( i ) Sensitivity computation • The sensitivity is also called “delta.” L L ∂R si = f ′( ui ) L ∂xi L L = f ′() ui (di − xi ) Backward pass for l = L to 2 nl l−1 l−1 l l s j = f ′() u j ∑si Wij i=1 Learning update • In any order l l l−1 ∆Wij = ηsi x j l l ∆bi = ηsi Backprop is a gradient update • Consider R as function of weights and biases. ∂R l l−1 l ∂R l = si x j ∆Wij = η l ∂Wij ∂Wij ∂R l l ∂R l = si ∆bi = η l ∂bi ∂bi Sensitivity lemma • Sensitivity matrix = outer product – sensitivity vector – activity vector ∂R ∂R l−1 l = l x j ∂Wij ∂bi • The sensitivity vector is sufficient. • Generalization of “delta.” Coordinate transformation nl−1 l l l−1 l ui = ∑Wij fu()j + bi j=1 l ∂ui l l−1 l−1 = Wij f ′() u j ∂u j ∂R ∂R l = l ∂ui ∂bi Output layer L L xi = fu( i ) L L L−1 L ui = ∑Wij x j + bi j ∂R L ∂R L = f ′() ui L ∂bi ∂xi Chain rule • composition of two functions ul−1 → R ul−1 → ul → R l ∂R ∂R ∂ui l−1 = ∑ l l−1 ∂u j i ∂ui ∂u j ∂R ∂R l l−1 l−1 = ∑ l Wij f ′( u j ) ∂b j i ∂bi Computational complexity • Naïve estimate – network output: order N – each component of the gradient: order N – N components: order N2 • With backprop: order N Biological plausibility • Local: pre- and postsynaptic variables l l l−1 Wij l l−1 Wij l x j ⎯⎯ → xi , s j ⎯← ⎯ si • Forward and backward passes use same weights • Extra set of variables Backprop for brain modeling • Backprop may not be a plausible account of learning in the brain. • But perhaps the networks created by it are similar to biological neural networks. • Zipser and Andersen: – train network – compare hidden neurons with those found in the brain. LeNet • Weight-sharing • Sigmoidal neurons • Learn binary outputs Machine learning revolution • Gradient following – or other hill-climbing methods • Empirical error minimization.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages22 Page
-
File Size-