<<

Lecture 5: Value Approximation

Emma Brunskill

CS234 .

Winter 2018

The value function approximation structure for today closely follows much of David Silver’s Lecture 6. For additional reading please see SB 2018 Sections 9.3, 9.6-9.7. The slides come almost exclusively from Ruslan Salakhutdinov’s class, and Hugo Larochelle’s class (and with thanks to Zico Kolter also for slide inspiration). The slides in my standard style format in the deep learning section are my own.

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 1 / 48 Important Information About Homework 2

Homework 2 will now be due on Saturday February 10 (instead of February 7) We are making this change to try to give some background on deep learning, give people enough time to do homework 2, and still give people time to study for the midterm on February 14 We will release the homework this week You will be able to start on some aspects of the homework this week, but we will be covering DQN which is the largest part, on Monday We will also be providing optional tutorial sessions on tensorflow

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 2 / 48 Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

4 Deep Learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 3 / 48 Class Structure

Last time: Control (making decisions) without a model of how the world works This time: Value function approximation and deep learning Next time: Deep reinforcement learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 4 / 48 Last time: Model-Free Control

Last time: how to learn a good policy from experience So far, have been assuming we can represent the value function or state-action value function as a vector Tabular representation Many real world problems have enormous state and/or action spaces Tabular representation is insucient

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 5 / 48 Recall: Reinforcement Learning Involves

Optimization Delayed consequences Exploration Generalization

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 6 / 48 Today: Focus on Generalization

Optimization Delayed consequences Exploration Generalization

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 7 / 48 Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

4 Deep Learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 8 / 48 Value Function Approximation (VFA)

Represent a (state-action/state) value function with a parameterized function instead of a table

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 9 / 48 Motivation for VFA

Don’t want to have to explicitly store or learn for every single state a Dynamics or reward model Value State-action value Policy Want more compact representation that generalizes across state or states and actions

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 10 / 48 Benefits of Generalization

Reduce memory needed to store (P, R)/V /Q/⇡ Reduce computation needed to compute (P, R)/V /Q/⇡ Reduce experience needed to find a good P, R/V /Q/⇡

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 11 / 48 Value Function Approximation (VFA)

Represent a (state-action/state) value function with a parameterized function instead of a table

Which function approximator?

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 12 / 48 Function Approximators

Many possible function approximators including Linear combinations of features Neural networks Decision trees Nearest neighbors Fourier / wavelet bases In this class we will focus on function approximators that are di↵erentiable (Why?) Two very popular classes of di↵erentiable function approximators Linear feature representations (Today) Neural networks (Today and next lecture)

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 13 / 48 Review: Descent

Consider a function J(w) that is a di↵erentiable function of a parameter vector w Goal is to find parameter w that minimizes J The gradient of J(w)is

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 14 / 48 Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

4 Deep Learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 15 / 48 Value Function Approximation for Policy Evaluation with an Oracle

First consider if could query any state s and an oracle would return the true value for v ⇡(s) The objective was to find the best approximate representation of v ⇡ given a particular parameterized function

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 16 / 48 Stochastic

Goal: Find the parameter vector w that minimizes the loss between a true value function v⇡(s) and its approximationv ˆ as represented with a particular function class parameterized by w. Generally use mean squared error and define the loss as

2 J(w)= ⇡[(v⇡(S) vˆ(S, w)) ] (1) Can use gradient descent to find a local minimum 1 w = ↵ J(w) (2) 2 5w Stochastic gradient descent (SGD) samples the gradient:

Expected SGD is the same as the full gradient update

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 17 / 48 VFA Prediction Without An Oracle

Don’t actually have access to an oracle to tell true v⇡(S)forany state s Now consider how to do value function approximation for prediction / evaluation / policy evaluation without a model Note: policy evaluation without a model is sometimes also called passive reinforcement learning with value function approximation ”passive” because not trying to learn the optimal decision policy

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 18 / 48 Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3) Following a fixed policy ⇡ (or had access to prior data) Goal is to estimate V ⇡ and/or Q⇡ Maintained a look up table to store estimates V ⇡ and/or Q⇡ Updated these estimates after each episode (Monte Carlo methods) or after each step (TD methods) Now: in value function approximation, change the estimate update step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 19 / 48 Feature Vectors

Use a feature vector to represent a state

x1(s) x2(s) x(s)=0 1 (3) ... B x (s) C B n C @ A

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 20 / 48 Linear Value Function Approximation for Prediction With An Oracle

Represent a value function (or state-action value function) for a particular policy with a weighted linear combination of features n T vˆ(S, w)= xj (S)wj = x(S) (w) Xj=1 Objective function is 2 J(w)= ⇡[(v⇡(S) vˆ(S, w)) ] Recall weight update is 1 (w)= ↵ J(w) (4) 2 5w Update is:

Update = step-size prediction error feature value ⇥ ⇥ Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 21 / 48 Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return v⇡(St ) Therefore can reduce MC VFA to doing on a set of (state,return) pairs: < S1, G1 >, < S2, G2 >, . . . , < ST , GT > Susbstituting Gt (St )forthetruev⇡(St ) when fitting the function approximator Concretely when using linear VFA for policy evaluation

w = ↵(Gt vˆ(St , w)) vˆ(St , w) (5) 5w = ↵(Gt vˆ(St , w))x(St ) (6)

Note: Gt may be a very noisy estimate of true return

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 22 / 48 MC Linear Value Function Approximation for Policy Evaluation

1: Initialize w = 0,Returns(s)=0 (s, a), k =1 8 2: loop

3: Sample k-th episode (sk1, ak1, rk1, sk2,...,sk,Lk ) given ⇡ 4: for t =1,...,Lk do 5: if First visit to (s)inepisodek then Lk 6: Append j=t rkj to Returns(st ) 7: Update weights P

8: end if 9: end for 10: k = k +1 11: end loop

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 23 / 48 Recall: Temporal Di↵erence (TD(0)) Learning with a Look up Table

Uses bootstrapping and sampling to approximate V ⇡ ⇡ Updates V (s) after each transition (s, a, r, s0):

⇡ ⇡ ⇡ ⇡ V (s)=V (s)+↵(r + V (s0) V (s)) (7) ⇡ ⇡ Target is r + V (s0), a biased estimate of the true value v (s) Look up table represents value for each state with a separate table entry

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 24 / 48 Temporal Di↵erence (TD(0)) Learning with Value Function Approximation

Uses bootstrapping and sampling to approximate true v ⇡ ⇡ Updates estimate V (s) after each transition (s, a, r, s0):

⇡ ⇡ ⇡ ⇡ V (s)=V (s)+↵(r + V (s0) V (s)) (8) ⇡ ⇡ Target is r + V (s0), a biased estimate of of the true value v (s) ⇡ In value function approximation, target is r + vˆ (s0), a biased and approximated estimate of of the true value v ⇡(s) 3 forms of approximation:

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 25 / 48 Temporal Di↵erence (TD(0)) Learning with Value Function Approximation

⇡ In value function approximation, target is r + vˆ (s0), a biased and approximated estimate of of the true value v ⇡(s) Supervised learning on a di↵erent set of data pairs: ⇡ < S1, r1 + vˆ (S2, w) >, < S2, r2 + vˆ(S3, w) >, . . .

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 26 / 48 Temporal Di↵erence (TD(0)) Learning with Value Function Approximation

⇡ In value function approximation, target is r + vˆ (s0), a biased and approximated estimate of of the true value v ⇡(s) Supervised learning on a di↵erent set of data pairs: ⇡ < S1, r1 + vˆ (S2, w) >, < S2, r2 + vˆ(S3, w) >, . . . In linear TD(0)

⇡ ⇡ ⇡ w = ↵(r + vˆ (s0, w) vˆ (s, w)) w vˆ (s, w) (9) ⇡ ⇡ 5 = ↵(r + vˆ (s0, w) vˆ (s, w))x(s) (10)

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 27 / 48 Convergence Guarantees for Linear Value Function Approximation for Policy Evaluation1

Define the mean squared error of a linear value function approximation for a particular policy ⇡ relative to the true value as

MSVE(w)= d(s)(v⇡(s) ˆv ⇡(s, w))2 (11) s S X2 where d(s): stationary distribution of ⇡ in the true decision process v,ˆw ⇡(s)=x(s)T w, a linear value function approximation

1Tsitsiklis and Van Roy. An Analysis of Temporal-Di↵erence Learning with Function Approximation. 1997.https://web.stanford.edu/ bvr/pubs/td.pdf Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 28 / 48 Convergence Guarantees for Linear Value Function Approximation for Policy Evaluation2

Define the mean squared error of a linear value function approximation for a particular policy ⇡ relative to the true value as

MSVE(w)= d(s)(v⇡(s) ˆv ⇡(s, w))2 (12) s S X2 where d(s): stationary distribution of ⇡ in the true decision process vˆ⇡(s)=x(s)T w, a linear value function approximation Monte Carlo policy evaluation with VFA converges to the weights wMC which has the minimum mean squared error possible:

⇡ ⇡ 2 MSVE(wMC )=min d(s)(v (s) vˆ (s, w)) (13) w ⇤ s S X2 2Tsitsiklis and Van Roy. An Analysis of Temporal-Di↵erence Learning with Function Approximation. 1997.https://web.stanford.edu/ bvr/pubs/td.pdf Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 29 / 48 Convergence Guarantees for Linear Value Function Approximation for Policy Evaluation3

Define the mean squared error of a linear value function approximation for a particular policy ⇡ relative to the true value as

MSVE(w)= d(s)(v⇡ (s) ˆv ⇡(s, w))2 (14) ⇤ s S X2 where d(s): stationary distribution of ⇡ in the true decision process vˆ⇡(s)=x(s)T w, a linear value function approximation

TD(0) policy evaluation with VFA converges to weights wTD which is within a constant factor of the minimum mean squared error possible:

1 ⇡ ⇡ 2 MSVE(wTD )= min d(s)(v (s) vˆ (s, w)) (15) 1 w ⇤ s S X2 3ibed. Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 30 / 48 Summary: Convergence Guarantees for Linear Value Function Approximation for Policy Evaluation4

Monte Carlo policy evaluation with VFA converges to the weights wMC which has the minimum mean squared error possible:

⇡ ⇡ 2 MSVE(wMC )=min d(s)(v (s) vˆ (s, w)) (16) w ⇤ s S X2

TD(0) policy evaluation with VFA converges to weights wTD which is within a constant factor of the minimum mean squared error possible:

1 ⇡ ⇡ 2 MSVE(wTD )= min d(s)(v (s) vˆ (s, w)) (17) 1 w ⇤ s S X2 Check your understanding: if the VFA is a tabular representation (one feature for each state), what is the MSVE for MC and TD?

4ibed. Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 31 / 48 Convergence Rates for Linear Value Function Approximation for Policy Evaluation

Does TD or MC converge faster to a fixed point? Not (to my knowledge) definitively understood Practically TD learning often converges faster to its fixed value function approximation point

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 32 / 48 Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

4 Deep Learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 33 / 48 Control using Value Function Approximation

Use value function approximation to represent state-action values qˆ⇡(s, a, w) q⇡ ⇡ Interleave Approximate policy evaluation using value function approximation Perform ✏-greedy policy improvement

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 34 / 48 Action-Value Function Approximation with an Oracle

qˆ⇡(s, a, w) q⇡ ⇡ Minimize the mean-squared error between the true action-value function q⇡(s, a) and the approximate action-value function:

⇡ ⇡ 2 J(w)= ⇡[(q (s, a) qˆ (s, a, w)) ] (18) Use stochastic gradient descent to find a local minimum 1 J(w)= [(q⇡(s, a) qˆ⇡(s, a, w)) qˆ⇡(s, a, w)](19) 2 5W 5w 1 (w)= ↵ J(w) (20) 2 5w Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 35 / 48 Linear State Action Value Function Approximation with an Oracle

Use features to represent both the state and action

x1(s, a) x2(s, a) x(s, a)=0 1 (21) ... B x (s, a) C B n C Represent state-action value function@ with aA weighted linear combination of features n T qˆ(s, a, w)=x(s, a) w = xj (s, a)wj (22) Xj=1 Stochastic gradient descent update:

⇡ ⇡ 2 J(w)= ⇡[(q (s, a) qˆ (s, a, w)) ] (23) 5w 5w

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 36 / 48 Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for a state is unknown and so substitute a target value

In Monte Carlo methods, use a return Gt as a substitute target

w = ↵(Gt qˆ(st , at , w)) qˆ(st , at , w) (24) 5w

For SARSA instead use a TD target r + qˆ(s0, a0, w) which leverages the current function approximation value

w = ↵(r + qˆ(s0, a0, w) qˆ(s, a, w)) qˆ(s, a, w) (25) 5w

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 37 / 48 Incremental Model-Free Control Approaches

Similar to policy evaluation, true state-action value function for a state is unknown and so substitute a target value

In Monte Carlo methods, use a return Gt as a substitute target

w = ↵(Gt qˆ(st , at , w)) qˆ(st , at , w) (26) 5w

For SARSA instead use a TD target r + qˆ(s0, a0, w) which leverages the current function approximation value

w = ↵(r + qˆ(s0, a0, w) qˆ(s, a, w)) qˆ(s, a, w) (27) 5w

For Q-learning instead use a TD target r + maxa qˆ(s0, a0, w)which leverages the max of the current function approximation value

w = ↵(r + max qˆ(s0, a0, w) qˆ(s, a, w)) w qˆ(s, a, w) (28) a0 5

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 38 / 48 Convergence of TD Methods with VFA

TD with value function approximation is not following the gradient of an objective function Informally, updates involve doing an (approximate) Bellman backup followed by best trying to fit underlying value function to a particular feature representation Bellman operators are contractions, but value function approximation fitting can be an expansion

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 39 / 48 Convergence of Control Methods with VFA

Algorithm Tabular Linear VFA Nonlinear VFA Monte-Carlo Control Sarsa Q-learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 40 / 48 Table of Contents

1 Introduction

2 VFA for Prediction

3 Control using Value Function Approximation

4 Deep Learning

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 41 / 48 Other Function Approximators

Linear value function approximators often work well given the right set of features But can require carefully hand designing that feature set An alternative is to use a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features

Local representations including Kernel based approaches have some appealing properties (including convergence results under certain cases) but can’t typically scale well to enormous spaces and datasets Alternative is to leverage huge recent success in using deep neural networks

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 42 / 48 Deep Neural Networks

Today: a brief introduction to deep neural networks Definitions Power of deep neural networks Neural networks / distributed representations vs kernel / local representations Universal function approximator Deep neural networks vs shallow neural networks How to train neural nets

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 43 / 48 Deep Neural Networks

Today: a brief introduction to deep neural networks Definitions Power of deep neural networks Neural networks / distributed representations vs kernel / local representations Universal function approximator Deep neural networks vs shallow neural networks How to train neural nets

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 44 / 48 Deep Neural Networks

Today: a brief introduction to deep neural networks Definitions Power of deep neural networks Neural networks / distributed representations vs kernel / local representations Universal function approximator Deep neural networks vs shallow neural networks How to train neural nets

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 45 / 48 Deep Learning Deep Neural Networks

Today: a brief introduction to deep neural networks Definitions Power of deep neural networks Neural networks / distributed representations vs kernel / local representations Universal function approximator Deep neural networks vs shallow neural networks How to train neural nets

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation Winter 2018 46 / 48 Feedforward Neural Networks

‣ Definition of Neural Networks - Forward propagation - Types of units - Capacity of neural networks

‣ How to train neural nets: - - with gradient descent

‣ More recent techniques: - Dropout - - Unsupervised Pre-training Feedforward neural network

FeedforwardHugo Larochelle neural network FeedforwardDepartement´ neural d’informatique network Universite´ de Sherbrooke FeedforwardHugo Larochelle neural network [email protected]´ Hugo Larochelle d’informatique Feedforward neuralDepartement´ Universit networke´ d’informatique de Sherbrooke Hugo Larochelle [email protected]´ de Sherbrooke 6, 2012 [email protected]´ d’informatique Hugo Larochelle UniversitSeptembere´ de Sherbrooke 6, 2012 Departement´ d’[email protected] 6, 2012 UniversitMath fore´ de my slides Sherbrooke “FeedforwardSeptember neural 6, 2012 network”. Abstract [email protected] NeuronAbstract Math for my slides “Feedforward neural network”. a(x• Neuron)=Mathb pre-activation+ for myi w slidesix (ori “Feedforward= inputb +activation):w> neuralx network”. • September 6, 2012 Abstract a(x)=b + w x = b + w x h(x)=a(x)=gMath(ab(P+ forx)) my =wi slidesxgi(b=i “Feedforward+b + wi>wx>ix neurali) network”. • •• i i i P x hxh(ax(x)=)=bwg(bga(+(aPx())xw)) =w =gx(gb(=+b +bP+ wiiwxxix) i) 1• Neuron• d output1 activation:i d i i i > • • Abstract xx1h1(xx)=dxd bwgbw(a(1Px1))wd =wdg(b +PP w x ) w • i i i Math for my slides “Feedforward• • • neural network”. wwx1 xd bw1 wd P •where•• • { a(x)=b + i wixi = b + ww> arex the weights (parameters) • g(• ) •• {{b is the bias term • · g(g() is) bcalledb the h(x)=g(a(Px)) = g(b + • i{wixi) • h(•x•)=· · g(a(x)) • hhg(x((x)=))=bg(ga((ax())x)) x1 xd bw1 wd •P•• · • (1) (1) (1) (1) a(x)=h(xb)=(1)g+(a(Wx)) (1)x a(x) =(1)b (1)W x a(x)=b (1)+ W (1)x a(x)i =i b (1)i W j (1)xji,j j • • a(x)=b + W x a(x)i =i b j i,jW xj w •• i j i,j • a(x)=b(1) + W(1)x ⇣⇣ a(x) = b(1) W (1)x⌘ ⌘ (out)(out) (2)(2) (2)(2)⇣ i i P Pj i,j j ⌘ o(x•)=o(x)=g g (out)(b(b (2)++ww (2)>>xx) ) P • o(x)=g (b + w⇣ >x) ⌘ • • P • { o(x)=g(out)(b(2) + w(2)>x) g( ) b • • · h(x)=g(a(x)) • a(x)=b(1) + W(1)x a(x) = b(1) W (1)x • i i j i,j j ⇣ P ⌘ o(x)=g(out)(b(2) + w(2)>x) •

1 11 1

1 Feedforward neural network FeedforwardFeedforwardHugo Larochelle neural neural network network Departement´ d’informatique Universite´ de Sherbrooke [email protected] Larochelle Larochelle DDepartement´epartement´ d’informatique d’informatique UniversitUniversitSeptembere´e´ de de Sherbrooke Sherbrooke 6, 2012 [email protected]@usherbrooke.ca Feedforward neural networkSeptemberSeptemberAbstract 7, 6, 2012 2012 Math for my slides “Feedforward neural network”. Hugo Larochelle Departement´ d’informatique a(x)=b + wixi = b + w>x • Universite´ de Sherbrookei AbstractAbstract [email protected] h(Mathx)= forg( mya(Px slides)) = “Feedforwardg(b + wix neurali) network”. • Math for my slides “Feedforwardi neural network”. September 6, 2012 x1 xd bw1 wd P •aa((xx)=)=bb++ i wwixxi ==bb++ww>>xx •• i i i w AbstractP •hh((xx)=)=gg((aa((xPx)))) = =gg((bb++ i wwiixxii)) Math for my•• slides “Feedforward neural network”. i x x bw w P a(x)=b + •w 1x{ = bd+ w>x 1 d P •i xi 1i xd bw1 wd • • g(a)=a h(x)=g(a(Px)) = g(b + w x ) • •ww i i i • 1 x1 xd bw1• wdg(a)P = sigm(a)= • 1+exp( a) • w • • { exp(a) exp( a) exp(2a) 1 • {g(a) = tanh(a)= = g(a)=a exp(a)+exp( a) exp(2a)+1 • { •g(a)=a •• g(a)=a g(a) = max(0,a) 1 • g(a) = sigm(a)= 1 •g(a)1 = sigm(a)=1+exp( a) g(a) = sigm(•a)= 1+exp( a) 1+exp( a) • • g(a)=reclin( a) = max(0 ,a) • exp(a) exp( a) exp(2a)exp(1 a) exp( a) exp(2a) 1 g(a) = tanh(a)= = exp(a) exp( a) exp(2a) 1 g(exp(a)a =)+exp( tanh(a) a)=exp(2a)+1exp(a)+exp( a) = exp(2a)+1 • • gg(a()) =b tanh( a)= exp(a)+exp( a) = exp(2a)+1 g(a) = max(0•,a•) · • g(a)(1) = max(0(1) ,a) (2) g(a)=reclin(•a)g =W( max(0a) =,a max(0)b x,a) h(x) w b(2) • ••Singlei,j i Hiddenj iLayeri Neural Net g( ) b g(a)=reclin(a) = max(0,a) • · • •Hiddengh(a()=reclin(x )= pre-activation:g(a(xa))) = max(0 ,a) W (1) b(1) x••h(x) w(2) b(2) • i,j i j p(y =1i i x) • p(ay(x=1)=| xb)(1) + W(1)x a(x) = b(1) + W (1)x h(x)=g(a(x•)) | i i j i,j j • •g( ) b (1) (1) (1) (1) ⇣ ⌘ a(x)=b +• Wg(· x) ba(x)i = b + W xj P • (2)i j (2)i,j> • f(1)·(x)=(1)o(b + w x)(2) •W ⇣ b x hP(x) w⌘ b(2) f(x)=o(b(2)•+ wW(2)i,j(1)>x) bi(1) xj h(x)i wi(2) b(2) • • Hidden• i,j layer iactivation:j i i h(x)=g(a(x)) • h(x)=g(a(x)) • a(x)=b(1) + W(1)x a(x) = b(1) + W (1)x • Output layer activation:(1) 1 (1) i i(1) j i,j(1) j • a(x)=b + W x a(x)i = bi1 + j Wi,j xj • ⇣ P ⌘ f(x)=o b(2) + w(2)>h⇣(1)x P ⌘ • f(x)=o b(2) + w(2)>x • ⇣ ⌘ ⇣ Output activation ⌘ function 1 1 FeedforwardFeedforward neural neural network network

HugoHugo Larochelle Larochelle DDepartement´epartement´ d’informatique d’informatique UniversitUniversite´e´ de de Sherbrooke Sherbrooke [email protected]@usherbrooke.ca SeptemberSeptember 6, 6, 2012 2012

AbstractAbstract MathMath for for my myArtificial slides slides “Feedforward “Feedforward Neuron neural neural network”. network”.

• Outputaa((xx)= )=activationbb++ of wthewix xneuron:i ==bb++ ww>>xx •• ii i i hh((xx)=)=gg((aa(P(xPx)))) = =gg((bb++ wwixxi)) •• ii i i xx1 xxd bwbw1 wwd PP •• 1 d 1 d ww •• Range is determined•• {{ by g g(()) bb •• ·· Bias only changes hh((xx)=)=gg((aa((xx)))) the position of the •• riff (1)(1) (from Pascal(1)(1) Vincent’s slides) (1)(1) (1)(1) aa((xx)=)=bb ++WW xx aa((xx))i==bbi WWi,j xxj •• i i jj i,j j ⇣⇣ ⌘⌘ (out) (2) (2) PP oo((xx)=)=gg(out)((bb(2) ++ww(2)>>xx)) ••

1 1 Feedforward neural network FeedforwardFeedforwardHugo Larochelle neural neural network network Departement´ d’informatique Universite´ de Sherbrooke [email protected] Larochelle Larochelle DDepartement´epartement´ d’informatique d’informatique UniversitUniversitSeptembere´e´ de de Sherbrooke Sherbrooke 6, 2012 [email protected]@usherbrooke.ca Feedforward neural networkSeptemberSeptemberAbstract 7, 6, 2012 2012 Math for my slides “Feedforward neural network”. Hugo Larochelle Departement´ d’informatique a(x)=b + wixi = b + w>x • Universite´ de Sherbrookei AbstractAbstract [email protected] h(Mathx)= forg( mya(Px slides)) = “Feedforwardg(b + wix neurali) network”. • Math for my slides “Feedforwardi neural network”. September 6, 2012 x1 xd bw1 wd P •aa((xx)=)=bb++ i wwixxi ==bb++ww>>xx •• i i i w AbstractP •hh((xx)=)=gg((aa((xPx)))) = =gg((bb++ i wwiixxii)) Math for my•• slides “Feedforward neural network”. i x x bw w P a(x)=b + •w 1x{ = bd+ w>x 1 d P •i xi 1i xd bw1 wd • • g(a)=a h(x)=g(a(Px)) = g(b + w x ) • •ww i i i • 1 x1 xd bw1• wdg(a)P = sigm(a)= • 1+exp( a) • w • • { exp(a) exp( a) exp(2a) 1 • {g(a) = tanh(a)= = g(a)=a exp(a)+exp( a) exp(2a)+1 • { •g(a)=a •• g(a)=a g(a) = max(0,a) 1 • g(a) = sigm(a)= 1 •g(a)1 = sigm(a)=1+exp( a) g(a) = sigm(•a)= 1+exp( a) 1+exp( a) • • g(a)=reclin( a) = max(0 ,a) • exp(a) exp( a) exp(2a)exp(1 a) exp( a) exp(2a) 1 g(a) = tanh(a)= = exp(a) exp( a) exp(2a) 1 g(exp(a)a =)+exp( tanh(a) a)=exp(2a)+1exp(a)+exp( a) = exp(2a)+1 • • gg(a()) =b tanh( a)= exp(a)+exp( a) = exp(2a)+1 g(a) = max(0•,a•) · • g(a)(1) = max(0(1) ,a) (2) g(a)=reclin(•a)g =W( max(0a) =,a max(0)b x,a) h(x) w b(2) • ••Singlei,j i Hiddenj iLayeri Neural Net g( ) b g(a)=reclin(a) = max(0,a) • · • •Hiddengh(a()=reclin(x layer)= pre-activation:g(a(xa))) = max(0 ,a) W (1) b(1) x••h(x) w(2) b(2) • i,j i j p(y =1i i x) • p(ay(x=1)=| xb)(1) + W(1)x a(x) = b(1) + W (1)x h(x)=g(a(x•)) | i i j i,j j • •g( ) b (1) (1) (1) (1) ⇣ ⌘ a(x)=b +• Wg(· x) ba(x)i = b + W xj P • (2)i j (2)i,j> • f(1)·(x)=(1)o(b + w x)(2) •W ⇣ b x hP(x) w⌘ b(2) f(x)=o(b(2)•+ wW(2)i,j(1)>x) bi(1) xj h(x)i wi(2) b(2) • • Hidden• i,j layer iactivation:j i i h(x)=g(a(x)) • h(x)=g(a(x)) • a(x)=b(1) + W(1)x a(x) = b(1) + W (1)x • Output layer activation:(1) 1 (1) i i(1) j i,j(1) j • a(x)=b + W x a(x)i = bi1 + j Wi,j xj • ⇣ P ⌘ f(x)=o b(2) + w(2)>h⇣(1)x P ⌘ • f(x)=o b(2) + w(2)>x • ⇣ ⌘ ⇣ Output activation ⌘ function 1 1 Feedforward neural network

Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] September 6, 2012

Abstract Math for my slides “Feedforward neural network”.

a(x)=b + w x = b + w>x • i i i h(x)=g(a(Px)) = g(b + w x ) • i i i x x bw w P • 1 d 1 d w Activation• Function • { • Sigmoid activation function: g(a)=a • g(a) = sigm(a)= 1 Ø Squashes the neuron’s • 1+exp( a) output between 0 and 1 exp(a) exp( a) exp(2a) 1 Ø Always positive g(a) = tanh(a)= exp(a)+exp( a) = exp(2a)+1 • Ø Bounded g( ) b • · Ø Strictly Increasing h(x)=g(a(x)) • a(x)=b(1) + W(1)x a(x) = b(1) W (1)x • i i j i,j j ⇣ P ⌘ o(x)=g(out)(b(2) + w(2)>x) •

1 Feedforward neural network

Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] September 6, 2012

Abstract Math for my slides “Feedforward neural network”.

a(x)=b + w x = b + w>x • i i i h(x)=g(a(Px)) = g(b + w x ) • i i i x x bw w P • 1 d 1 d w • • { g(a)=a • 1 g(a) = sigm(a)= 1+exp( a) • Activation Function exp(a) exp( a) exp(2a) 1 g(a) = tanh(a)= exp(a)+exp( a) = exp(2a)+1 • Rectified linear (ReLU) activation• function: g(a) = max(0,a) • Ø Bounded below by 0 g(a)=reclin(a) = max(0,a) (always non-negative) • g( ) b Ø Tends to produce units • · with sparse activities h(x)=g(a(x)) Ø Not upper bounded • (1) (1) (1) (1) Ø Strictly increasing a(x)=b + W x a(x) = b W x • i i j i,j j ⇣ P ⌘ o(x)=g(out)(b(2) + w(2)>x) •

1 p(y = c x) • |

p(y = c x) exp(a1) exp(aC ) > o(a) = softmax(Multilayera)= exp( aNeural) ... exp( aNet) •• | c c c c h P P i • Considerfo((xa)) = softmax(a networka with)= L hiddenexp(a 1layers.) ... exp(aC ) > • p(y = c x) c exp(ac) c exp(ac) •• | - hlayer(1)(x pre-activation) h(2)(x) W for(1) kh>0WP (2) W(3) bP(1) b(2) bi(3) • f(x) exp(a1) exp(aC ) > • o(k()a) = softmax((k) a()=k) (k 1) exp(a )(0)... exp(a ) a (x)=b + W h c(x)(ch (x)=c x) c • (1) (2) (1) (2) (3) (1) (2) (3) • h (x) h (x) W h PW W P b bi b • (k) p(y =(kc) x) - hhiddenf(x()x•)= layerg( aactivation|(x)) •• (k) (k) (k) (k 1) (0) afrom(1)( 1x )=to L:b (2) + W (1)h (2)x (h (3)= x(1)) (2) (3) • hh(L+1)(x(x))=ho(a((xL)+1)W(x)) = Wf(x) exp(Wa1) b exp(b aC b) > o(a) = softmax(a)= exp(a ) ... exp(a ) •• (k) • (k) c c c c h (k)(x)=g((ak) (x))(k) (k 1) (0) • a (x)=b + W h hxP(h = x) P i • f(x) (L+1)• (L+1) -h output(k) (layerx)= activationo(ka) ((kx=L+1):)) = f(x) • h (x)=h(1)g(x(a) h(x(2)))(x) W(1) W(2) W(3) b(1) b(2) b(3) • • (L+1) (L+1) h (x(k)=) o(a (k) (x)) =(k)f(x(k) 1) (0) a (x)=b + W h x (h (x)=x) • • h(k)(x)=g(a(k)(x)) • h(L+1)(x)=o(a(L+1)(x)) = f(x) •

2

2 2

2 Deep Neural Networks

Today: a brief introduction to deep neural networks Definitions Power of deep neural networks Neural networks / distributed representations vs kernel / local representations Universal function approximator Deep neural networks vs shallow neural networks How to train neural nets

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation 46 Winter 2018 42 / 44 Local vs. Distributed Representations

• Clustering, Nearest • RBMs, Factor models, Neighbors, RBF SVM, local PCA, Sparse Coding, density esFmators Deep models

• Parameters for each region. Local regions • # of regions is linear with C1=1 # of parameters. C2=1 C1=0 C1=1 C2=1 C2=0 C1=0 C2=0

C1 C2 C3

Learned prototypes (Bengio, 2009, Foundations and Trends in ) Local vs. Distributed Representations

• Clustering, Nearest • RBMs, Factor models, Neighbors, RBF SVM, local PCA, Sparse Coding, density esFmators Deep models C1=1 C2=1 C3=1 • Parameters for each region. Local regions • # of regions is linear with C1=1 # of parameters. C2=1 C3=0 C1=0 C1=1 C2=1 C1=0 C2=0 C3=0 C2=1 C3=0 C1=0 C3=1 C2=0 C3=0 C1 C2 C3 C1=0 C2=0 C3=1 Learned prototypes Local vs. Distributed Representations

• Clustering, Nearest • RBMs, Factor models, Neighbors, RBF SVM, local PCA, Sparse Coding, density esFmators Deep models C1=1 C2=1 C3=1 • Parameters for each region. Local regions • # of regions is linear with C1=1 # of parameters. C2=1 C3=0 C1=0 C1=1 C2=1 C1=0 C2=0 C3=0 C2=1 C3=0 C1=0 C3=1 C2=0 C3=0 C1 C2 C3 C1=0 C2=0 C3=1 Learned prototypes Local vs. Distributed Representations

• Clustering, Nearest • RBMs, Factor models, Neighbors, RBF SVM, local PCA, Sparse Coding, density esFmators Deep models C1=1 C2=1 C3=1 • Parameters for each region. Local regions • Each parameter affects many • # of regions is linear with regions, not just local. C1=1 # of parameters. • # of regions grows (roughly) C2=1 C3=0 exponenFally in # of parameters. C1=0 C1=1 C2=1 C1=0 C2=0 C3=0 C2=1 C3=0 C1=0 C3=1 C2=0 C3=0 C1 C2 C3 C1=0 C2=0 C3=1 Learned prototypes Capacity of Neural Nets

• Consider a singleR eseauxlayer´ neural de network neurones 2

z x2 1

0 1 -1 0 -1 0 -1 1 x1 zk Output sortie k y 1 x y2 2 x2 1 -.4 1 -1 wkj y1 .7 y2 0 1 0 1 -1 -1 0 -1.5 Hiddencachee´ j 0 -1 bias -1 0 biais 0 -1 .5 -1 1 1 1 1 wji 1 x1 1 x1 entrInputee´ i

x1 x2

x2 (from Pascal Vincent’s slides) z=-1 1

R2 z=+1 R1 x -1 1 1 2 z=-1R

-1 Capacity of Neural Nets

• Consider a single layer neural network

(from Pascal Vincent’s slides) Universal Approximation

• Universal Approximation Theorem (Hornik, 1991):

- “a single hidden layer neural network with a linear output unit can approximate any arbitrarily well, given enough hidden units’’

• This applies for sigmoid, tanh and many other activation functions.

• However, this does not mean that there is learning algorithm that can find the necessary parameter values. Deep Networks vs Shallow

1 hidden layer neural networks are already a universal function approximator Implies the expressive power of deep networks are no larger than shallow networks There always exists a shallow network that can represent any function representable by a deep (multi-layer) neural network But there can be cases where deep networks may be exponentially more compact than shallow networks in terms of number of nodes required to represent a function This has substantial implications for memory, computation and data eciency Empirically often deep networks outperform shallower alternatives

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation 47 Winter 2018 43 / 44 Deep Neural Networks

Today: a brief introduction to deep neural networks Definitions Power of deep neural networks Neural networks / distributed representations vs kernel / local representations Universal function approximator Deep neural networks vs shallow neural networks How to train neural nets

Emma Brunskill (CS234 Reinforcement Learning.Lecture ) 5: Value Function Approximation 48 Winter 2018 44 / 44 Feedforward Neural Networks

‣ How neural networks predict f(x) given an input x: - Forward propagation - Types of units - Capacity of neural networks

‣ How to train neural nets: - Loss function - Backpropagation with gradient descent

‣ More recent techniques: - Dropout - Batch normalization - Unsupervised Pre-training µ = 1 x(t) • T t 2 P1 (t) 2 = T 1 t(x µ) • b 1 P (t) (t) ⌃b = T 1 t(x µb)(x µ)> • P E[b µ]=µ E[2]=2 E ⌃ = ⌃ • b b h i µ µ b b b • p2/T b µ b µ 1.96 2/T • 2 ± p • b b ✓ = arg max p(x(1),...,x(T )) ✓ b • p(x(1),...,x(T ))= p(x(t)) t Y T 1 1 (t) (t) ⌃ = (x µ)(x µ)> • T T t P b b Machineb learning Supervised learning example: (x,y) x y • Training set: train = (x(t),y(t)) • D { } f(x; ✓) • Training valid test •D D• Empirical Risk Minimization:

• 1 arg min l(f(x(t); ✓),y(t))+⌦(✓) T ✓ t X Loss function Regularizer

• Learning is cast as optimization.

Ø For classification problems, we would like to minimize classification error.

Ø Loss function can sometimes be viewed as a surrogate for what we want to optimize (e.g. upper bound)

5 µ = 1 x(t) • T t 2 1 (t) 2 1 = P (x µ) µ = x(t) • b T 1 t • T t 1 P (t) (t) 2 = P1 (x(t) µ)2 ⌃ = T 1 t(x µ)(x µ)> T 1 t • b b • b P 2 2 1 P (t) (E[bt) µ]=µ E[ ]= E ⌃ = ⌃ ⌃ = T 1 t(x µ)(x µ)> b b • b b• µ µ h i P 2 2 E[b µ]=µ E[ ]= E ⌃b2 = ⌃ b b • b • p /Tb h b i µ µ b 2 b b µb µ 1.96 /T • p2/T • 2 ± b p µ b µ 1.96 2/T • b b (1) (T ) • 2 ± ✓ = arg max p(x ,...,x ) ✓ p • b b (1) (bT ) ✓• = arg max p(x ,...,x )(1) (T ) (t) ✓ p(x ,...,x )= p(x ) t b Y • p(xT(1)1,...,x1(T ))= (t) p(x(t)) (t) T ⌃ = T t(x µ)(x µ)> • t Y P Machine learning T 1 1 (t) (t) b b b T ⌃ = T t(x µ)(x µ)> • Supervised learning example: (x,y) x y • P Machine learning b b Trainingb set: train = (x(t),y(t)) • D { } Supervised learning example:f(x(;x✓,y) ) x y • • train (t) (t) Training set: = (x valid,y ) testFeedforward neural network • D {•D }D f(x; ✓) • • Hugo1 Larochelle valid test arg min l(f(x(t); ✓),y(t))+⌦(✓) Departement´ T d’informatique •D D ✓ t UniversitXe´ de Sherbrooke • l1(f(x(t); ✓),y(t)) arg min l(f(x(t); ✓[email protected]),y(t))+⌦(✓) Feedforward neural network • T ✓ ⌦(✓t) Feedforward neural network • X September 13, 2012 l(f(x(t); ✓),y(t)) = 1 l(f(x(t); ✓),y(t)) Feedforward⌦(✓) Hugo neural Larochelle network T t ✓ Feedforward✓ neural network • • r r DHugoepartement´ Larochelle d’informatique ⌦(✓) ✓ ✓ + P • • Abstract Departement´ UniversitHugo d’informatique Larochellee´ de Sherbrooke • = 1 l(f(x(t); ✓),y(t1))d ⌦(✓) Hugo Larochelle T Matht for✓ my slides “Feedforwardx R x neural✓f(x)=0( network”.t) (t) [email protected]´ e´ de Sherbrooke d’informatique •Stochastic r Gradientarg•{ min2 | rDescendrl(f(x ; ✓})D,y epartement´ )+⌦(✓ d’informatique) ✓ T2 ✓ ✓ + P v> xf(xt)v > 0 v [email protected]´ de Sherbrooke • Perform• updatesf(x) after seeing• eachr example:X 8 Universite´ de SherbrookeSeptember 13, 2012 • d 2 [email protected] x R(t) xf(x(t)=0) v> xf(x)v < 0 v [email protected] - Initialize•{l(f:2( x✓ |;r✓W),y(1), b)(1),...,}• Wr(L+1), b(L+1)8 September 13, 2012 • • 2 ⌘ { = l(f(x(t);}✓),y(t)) ⌦(✓) - For t=1:Tv> xf(x)v > 0 v ✓ ✓ September 13, 2012 • ⌦(r✓)l(f(x(t); ✓),y8(t)) • r Septemberr 13, 2012 Abstract • 2 (t) (t) - •forv each> f training(x)v < example0 v (x ,y ) x (t) (t•) • r ✓l1(f(x ; ✓8),y )(t) (tMath) for myTraining slides epoch “Feedforward neural network”.Abstract •=r T t (t)✓l(f((xt) ; ✓),y ) ✓⌦(✓) = ✓l(f(x ; ✓),y ) ✓⌦(✓) = Abstract • r r Mathr for my slidesr “Feedforward neuralAbstract network”. ⌦(✓) f(x) Iteration of all examples ✓ • ✓ + P↵ • Math for my slides “Feedforward5 neural network”. • Math for my slides “Feedforward neural network”. ⌦(✓) f(x) (t) (t) • r✓d • l(f(x ; ✓),y ) • To train ax neuralR net, xwef( need:x)=0 • f(5x) •{ 2 |f(rx) } • (t) (t) f(x•)c = p(y = c x)l(f(x ; ✓l)(,yf(x(t)); ✓),y(t)) • 2 |• • lr(f✓(x(t); ✓),y(t)) Ø Loss vfunction> x(ft):( x)(vt) >(t)0 v(t) • rx ly(f(x ; ✓)8,y ) • (t) (t) • • ✓l(f(⌦x(✓); ✓)(,yt) ) (t) Ø A procedure2 to compute • r •: ✓l(f(x ; ✓),y ) v> xf(x)vl( • c (y=cc) (y=c) f(cx) c y y l(f(x),y)=• c•1(y=cr) logf(x)c = log f(x)y = y • @ 1(y=c) P P e(c) log f(x)y = • P • = f(x)c f(x)y • f(x)y 1 @ @ 1 1(y=c()y=c) @ log(fy=log(xc))f(x)=y = log f(x)y =f(x) y f(x) f(x) f(x)c cf(x) f(x)y y c y 1 f(x) log f(x)y = [1(y=0),...,1(y=C 1)]> r f(x)y 1 1 1e(c) f(x) log f(x)y == [1(y=0),...,1(y=C 1)]> f(x) log f(x)yf(x=)r logf(x[1)(yy=0)=,...,1(yf=(C[1x()1)yy=0)]> ,...,1(y=C 1)]> r r f(x)y f(x)y f(x)y e(c) e(c) =e(c) = = f(x)y f(x)y f(x)y

1

1

1 1 1

6 Computational Flow Graph

• Forward propagation can be represented as an acyclic flow graph

• Forward propagation can be implemented in a modular way:

Ø Each box can be an object with an fprop method, that computes the value of the box given its children

Ø Calling the fprop method of each box in the right order yields forward propagation Computational Flow Graph

• Each object also has a bprop method

- it computes the gradient of the loss with respect to each child box.

• By calling bprop in the reverse order, we obtain backpropagation µ = 1 x(t) • T t 1 (t) 2 = P1 (x(t) µ)2 µ = T t x T 1 t • • b 1 (t) 2 = P1 (x(t) µµ1 )=2 PT (ttx) (t) T 1 t ⌃b =• T 1 t(x µb)(x µ)> • b • 2 1 (t) 2 1 = P (x µ) ⌃ = P(x(t) µ)(xP(t)T 12µ) t 2 b T 1 t E[b µ•]=bbµ E[ ]=> b E ⌃ =b ⌃ • • P ⌃ = 1 P(x(t) hµ)(ix(t) µ) E[b µ]=µ E[2]=µ µb2 E T⌃ 1 = ⌃t b > b• b b b b • • p2/T h i P 2 2 µ µ b E[b µ]=µ E[ ]= E ⌃ = ⌃ b b b b b • p2/T µ b•µ 1.96 2/T h i b • 2 ±µ µ b 2 b p b b µ µ 1.96 • /Tp2/T • 2 ± • b b (1) (T ) b ✓ = arg max p(x ,...,x ) p µ b µ 1.96 2/T ✓ • b b• 2 ± (1) (T ) ✓ = arg max p(xb ,...,x ) ✓ p • p(x(1),...,x(T ))= p(x(t)) • b b b (1) (T ) ✓ = arg max p(xt ,...,x ) • p(x(1),...,x(T ))= p(x(t)) ✓ Y T 1 1 (t) (tt) T ⌃ = T t(x µ)(xYb µ)> • • (1) (T ) (t) T 1 1 (t) (t) p(x ,...,x )= p(x ) ⌃ = (x µ)(xP µ)> • T T t b Machine learning t b b Y P Model Selectionb Supervised T 1 Machine learning1 example:learning(t) (x,y()t) x y b ⌃ = b (x µ)(x µ)> • • T T t • Training Protocol: Supervised learningTraining example: set: (xtrain,y) =x y(x(t),y(t)) • • b D P { Machine} learning train (t) (t) b b - Train your model on theTraining Training set:Set f(x;=✓) (x ,y ) • •D Supervised{ learning} example: (x,y) x y f(x; ✓) valid• test - For model selection,• use Validation Set •D TrainingD set: train = (x(t),y(t)) Ø Hyper-parameter search: hidden layer size,• , D { } number of iterations/epochs, etc. f(x; ✓) • - Estimate generalization performance using the Testvalid Set test •D D

• Generalization is the behavior of the model on unseen examples.

5 5

5 Early Stopping

• To select the number of epochs, stop training when validation set error increases (with some look ahead). g0(a)=g(a)(1 g(a)) • 2 g0(a)=1 g(a) • 2 ⌦(✓)= W (k) = W(k) 2 • k i j i,j k || ||F ⇣ ⌘ P P P (k) P (k) ⌦(✓)=2W • rW ⌦(✓)= W (k) • k i j | i,j | (k) (k) ⌦P(✓)P = sign(P W ) • rW (k) sign(W )i,j =1W(k)>0 1W(k)<0 • i,j i,j

(k) p6 (k) Wi,j U [ b, b] b = Hk h (x) pHk+Hk 1 • a(3)(x)=b(3) + W(3)h(2) • a(2)(x)=b(2) + W(2)h(1) • a(1)(x)=b(1) + W(1)x • h(3)(x)=o(a(3)(x)) • h(2)(x)=g(a(2)(x)) • h(1)(x)=g(a(1)(x)) • b(3) b(2) b(1) • W(3) W(2) W(1) xf(x) • @f(x) f(x+✏) f(x ✏) • Mini-batch,@x ⇡ 2✏ Momentum f(x) x ✏ • • Make updatesf(x based+ ✏) onf( xa mini-batch✏) of examples (instead of a single example):• 1 ↵ = Ø the• gradientt=1 is tthe average1 regularized loss for that mini-batch Ø can give a more2 accurate estimate of the gradient Pt1=1 ↵t < ↵t Ø can• leverage matrix/matrix1 operations, which are more efficient ↵P = ↵ • t 1+t • Momentum: Can ↵use an exponential average of previous ↵t = t 0.5 < 1 gradients:•  (t) (t 1) = l(f(x(t)),y(t))+ • r✓ r✓ r✓ Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’

4 Learning Distributed Representations

• Deep learning is research on learning models with multilayer representations

Ø multilayer (feed-forward) neural networks Ø multilayer graphical model (deep belief network, deep Boltzmann machine) • Each layer learns ‘‘distributed representation’’

Ø Units in a layer are not mutually exclusive • each unit is a separate feature of the input • two units can be ‘‘active’’ at the same time

Ø Units do not correspond to a partitioning (clustering) of the inputs

• in clustering, an input can only belong to a single cluster Inspiration from Visual Cortex Feedforward Neural Networks

‣ How neural networks predict f(x) given an input x: - Forward propagation - Types of units - Capacity of neural networks

‣ How to train neural nets: - Loss function - Backpropagation with gradient descent

‣ More recent techniques: - Dropout - Batch normalization - Unsupervised Pre-training Why Training is Hard

• First hypothesis: Hard optimization problem (underfitting)

Ø vanishing gradient problem Ø saturated units block gradient propagation

• This is a well known problem in recurrent neural networks Why Training is Hard

• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second order methods, such as KFAC) Ø Use GPUs, distributed computing.

• Second hypothesis (): use better regularization

Ø Unsupervised pre-training Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both: better optimization and better regularization! Unsupervised Pre-training

• Initialize hidden layers using

Ø Force network to represent latent structure of input distribution

Ø Encourage hidden layers to encode that structure Unsupervised Pre-training

• Initialize hidden layers using unsupervised learning

Ø This is a harder task than supervised learning (classification)

Ø Hence we expect less overfitting

Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012

Abstract Math for my slides “Autoencoders”. Autoencoders

• Hugo Larochelle AutoencodersDepartement´ h(:x )=Preview d’informatiqueg(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural [email protected] trained to reproduce its input at the output layer • DecoderOctober 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary(x log(units x )+(1 x ) log(1 x )) • ⌘ k k k k k k k k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx)

• x = o(a(x))

= sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k k k k k k k P P b b b b

1

1 Autoencoders Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected]

Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”.

• • Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x Autoencoders= o(c + W⇤h(x)) : Preview• = sigm(b + Wx) = sigm(c + W⇤h(x)) • Loss function for binary inputs x = o(c + W⇤h(x)) b 2 = sigm(c + W⇤h(x)) f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• k b P Ø Cross-entropyP error function f(x) x l(f(x)) = (x x )2 b b b • x =⌘ bo(a(x)) k k k = sigm(c + WP⇤h(x)) • Loss function for real-valued inputs b b b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k k k k k k k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx P a(x )b b b b • r Ø we use a linear activation function at the output b a(x(t)) = b + Wx(t) b ( h(x(t)) = sigm(a(x(t))) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b (

(t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1

W⇤ = W> •

1 Pre-training

• We will use a greedy, layer-wise procedure

Ø Train one layer at a time with unsupervised criterion Ø Fix the parameters of previous hidden layers Ø Previous layers can be viewed as feature extraction Fine-tuning

• Once all layers are pre-trained

Ø add output layer Ø train the whole network using supervised learning

• We call this last phase fine-tuning

Ø all parameters are ‘‘tuned’’ for the supervised task at hand Ø representation is adjusted to be more discriminative Why Training is Hard

• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second order methods, such as KFAC) Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization

Ø Unsupervised pre-training Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both: better optimization and better regularization! Dropout

• Key idea: Cripple neural network by removing hidden units stochastically

Ø each hidden unit is set to 0 with probability 0.5

Ø hidden units cannot co-adapt to other units

Ø hidden units must be more generally useful

• Could use a different dropout probability, but 0.5 usually works well p(y = c x) • |

p(y = c x) exp(a1) exp(aC ) > o(a) = softmax(a)= Dropoutexp(a ) ... exp(a ) •• | c c c c h P exp(a1) P exp(aC )i > fo((xa)) = softmax(a)= (exp(k) a ) ... exp(a ) •• Use random binary masks mc c c c • p(y = c x) •h(1)(x) | h(2)(x) W(1)h PW(2) W(3)Pb(1) b(2)i b(3) • Øf (xlayer) pre-activation for k>0 • exp(a1) exp(aC ) > o(k()a) = softmax((k) a)=(k) (k 1) ...(0) a (1)(x)=b(2) + W (1)h cexp(((2)xac)() h (3)c(xexp()=(1)acx) ) (2) (3) •• h (x) h (x) W W W b b b • h P P i hf((kkx))()x)=g((ak()k)(x))(k) (k 1) (0) ••Øa hidden(x)= layerb activation+ W h (k=1 tox L): (h = x) • (1) (2) (1) (2) (3) (1) (2) (3) hh((Lk+1)) (x()x)=h o(((xak)()L+1)W(x))W = f(x)W b b b •• h (x)=g(a (x)) • (k) (k) (k) (k 1) (0) a (x)=b + W h x (h = x) • h(L+1)(x)=o(a(L+1)(x)) = f(x) • Ø h(kOutput)(x)= activationg(a(k)(x ())k=L+1) • h(L+1)(x)=o(a(L+1)(x)) = f(x) •

2

2 2 Dropout at Test Time

• At test time, we replace the masks by their expectation

Ø This is simply the constant vector 0.5 if dropout probability is 0.5 Ø For single hidden layer: equivalent to taking the geometric average of all neural networks, with all possible binary masks

• Can be combined with unsupervised pre-training

• Beats regular backpropagation on many datasets

• Ensemble: Can be viewed as a geometric average of exponential number of networks. Why Training is Hard

• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second order methods, such as KFAC) Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization

Ø Unsupervised pre-training Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both: better optimization and better regularization! Batch Normalization

• Normalizing the inputs will speed up training (Lecun et al. 1998)

Ø could normalization be useful at the level of the hidden layers?

• Batch normalization is an attempt to do that (Ioffe and Szegedy, 2014)

Ø each unit’s pre-activation is normalized (mean subtraction, stddev division) Ø during training, mean and stddev is computed for each minibatch Ø backpropagation takes into account the normalization Ø at test time, the global mean / stddev is used Batch Normalization

Learned linear transformation to adapt to non-linear activation function (! and β are trained) vector, and be the set of these inputs over the training we introduce, for each activation x(k),apairofparameters data set. TheX normalization can then be written as a trans- γ(k), β(k),whichscaleandshiftthenormalizedvalue: formation (k) (k) (k) (k) x=Norm(x, ) y = γ x + β . X which depends not only! on the given training example x These parameters are learned! along with the original but on all examples –eachofwhichdependsonΘ if model parameters, and restore the representation power X x is generated by another layer. For backpropagation, we of the network. Indeed, by setting γ(k) = Var[x(k)] and would need to compute the Jacobians β(k) = E[x(k)],wecouldrecovertheoriginalactivations," if that were the optimal thing to do. ∂Norm(x, ) ∂Norm(x, ) X and X ; In the batch setting where each training step is based on ∂x ∂ X the entire training set, we would use the whole set to nor- ignoring the latter term would lead to the explosion de- malize activations. However, this is impractical when us- scribed above. Within this framework, whitening the layer ing stochastic optimization. Therefore, we make the sec- inputs is expensive, as it requires computing the covari- ond simplification: since we use mini-batches in stochas- T T ance Cov[x] = Ex∈X [xx ] E[x]E[x] and its tic gradient training, each mini-batch produces estimates inverse square root, to produce the whitened− activations of the mean and variance of each activation. This way, the Cov[x]−1/2(x E[x]),aswellasthederivativesofthese used for normalization can fully participate in transforms for− backpropagation. This motivates us to seek the gradient backpropagation. Note that the use of mini- an alternative that performs input normalization in a way batches is enabled by computation of per-dimension vari- that is differentiable and does not require the analysis of ances rather than joint covariances; in the joint case, reg- the entire training set after every parameter update. ularization would be required since the mini-batch size is Some of the previous approaches (e.g. likely to be smaller than the number of activations being (Lyu & Simoncelli, 2008)) use statistics computed whitened, resulting in singular covariance matrices. over a single training example, or, in the case of image Consider a mini-batch of size m.Sincethenormal- B networks, over different feature maps at a given location. ization is applied to each activation independently, let us However, this changes the representation ability of a focus on a particular activation x(k) and omit k for clarity. network by discarding the absolute scale of activations. We have m values of this activation in the mini-batch, We want to a preserve the information in the network, by = x . normalizing the activations in a training example relative B { 1...m} to the statistics of the entire training data. Let the normalized values be x1...m,andtheirlineartrans- formations be y1...m.Werefertothetransform ! 3NormalizationviaMini-Batch BNγ,β : x1...m y1...m Statistics → as the Batch Normalizing Transform.WepresenttheBN Since the full whitening of each layer’s inputs is costly Transformin Algorithm1. In the algorithm, ϵ is a constant and not everywhere differentiable, we make two neces- added to the mini-batch variance for numerical stability. sary simplifications. The first is that instead of whitening the features in layer inputs and outputs jointly, we will Input: Values of x over a mini-batch: = x1...m ; B { } normalize each scalar feature independently, by making it Parameters to be learned: γ, β have the mean of zero and the variance of 1. For a layer Output: yi = BNγ,β(xi) (1) (d) { } with d-dimensional input x=(x ...x ),wewillnor- m malize each dimension 1 µB x // mini-batch mean ← m i x(k) E[x(k)] #i=1 x(k) = − m Var[x(k)] 2 1 2 Batch NormalizationσB (xi µB) // mini-batch variance ! " ← m − where the expectation and variance are computed over the #i=1 • Why normalize the pre-activation? x µB training data set. As shown in (LeCun et al., 1998b), such x i − // normalize i ← 2 normalization speeds up convergence,Ø can evenhelp keep when the the pre-activation fea- in a non-saturatingσB + ϵ regime tures are not decorrelated. (though the linear transform !y γ" x + β couldBN cancel(x ) this // scale and shift i ← i ≡ γ,β i Note that simply normalizing eacheffect) input of a layer may change what the layer can represent. For instance, nor- Algorithm! 1: Batch Normalizing Transform, applied to malizing the inputs of a sigmoid would constrain them to • Use the global mean and stddevactivation at testx time.over a mini-batch. the linear regime of the nonlinearity. To address this, we make sure that the transformationØ insertedremoves in the the stochasticity network of Thethe mean BN transform and stddev can be added to a network to manip- can represent the identity transform.Toaccomplishthis, ulate any activation. In the notation y = BNγ,β(x),we Ø requires a final phase where, from the first to the last hidden layer • propagate all training data to that layer 3 • compute and store the global mean and stddev of each unit

Ø for early stopping, could use a running average