<<

MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung

Backpropagation learning Simple vs. multilayer Hidden problem

• Radical change for the problem. • No desired values for the hidden layer. • The network must find its own hidden layer activations. Generalized

• Delta rule only works for the output layer. • , or the generalized delta rule, is a way of creating desired values for hidden layers Outline

• The algorithm • Derivation as a gradient algoritihm • Sensitivity lemma

• L layers of weights and biases • L+1 layers of neurons

0 W 1,b1 1 W 2 ,b2 W L ,bL L x ⎯⎯ → x ⎯⎯ → L⎯⎯ → x

⎛ nl−1 ⎞ l l l−1 l xi = fW⎜ ∑ ij x j + bi ⎝ j=1 ⎠ Reward function

• Depends on activity of the output layer only. Rx( L )

• Maximize reward with respect to weights and biases. Example: squared error

• Square of desired minus actual output, with minus sign.

nL L 1 L 2 Rx()=− ∑()di − xi 2 i=1 Forward pass

For l =1 to L,

nl−1 l l l−1 l ui = ∑Wij x j + bi j=1 l l xi = fu( i ) Sensitivity computation

• The sensitivity is also called “delta.”

L L ∂R si = f ′( ui ) L ∂xi L L = f ′() ui (di − xi ) Backward pass

for l = L to 2

nl l−1 l−1 l l s j = f ′() u j ∑si Wij i=1 Learning update

• In any order

l l l−1 ∆Wij = ηsi x j l l ∆bi = ηsi Backprop is a gradient update

• Consider R as function of weights and biases.

∂R l l−1 l ∂R l = si x j ∆Wij = η l ∂Wij ∂Wij

∂R l l ∂R l = si ∆bi = η l ∂bi ∂bi Sensitivity lemma

• Sensitivity matrix = outer product – sensitivity vector – activity vector ∂R ∂R l−1 l = l x j ∂Wij ∂bi • The sensitivity vector is sufficient. • Generalization of “delta.” Coordinate transformation

nl−1 l l l−1 l ui = ∑Wij fu()j + bi j=1 l ∂ui l l−1 l−1 = Wij f ′() u j ∂u j ∂R ∂R l = l ∂ui ∂bi Output layer

L L xi = fu( i ) L L L−1 L ui = ∑Wij x j + bi j

∂R L ∂R L = f ′() ui L ∂bi ∂xi Chain rule

• composition of two functions

ul−1 → R ul−1 → ul → R l ∂R ∂R ∂ui l−1 = ∑ l l−1 ∂u j i ∂ui ∂u j

∂R ∂R l l−1 l−1 = ∑ l Wij f ′( u j ) ∂b j i ∂bi Computational complexity

• Naïve estimate – network output: order N – each component of the gradient: order N – N components: order N2 • With backprop: order N Biological plausibility

• Local: pre- and postsynaptic variables

l l l−1 Wij l l−1 Wij l x j ⎯⎯ → xi , s j ⎯← ⎯ si • Forward and backward passes use same weights • Extra set of variables Backprop for brain modeling

• Backprop may not be a plausible account of learning in the brain. • But perhaps the networks created by it are similar to biological neural networks. • Zipser and Andersen: – train network – compare hidden neurons with those found in the brain. LeNet

• Weight-sharing • Sigmoidal neurons • Learn binary outputs revolution

• Gradient following – or other hill-climbing methods • Empirical error minimization