What Is Deep Learning?

Learning higher level abstractions/representations from data. CSCE 636 Neural Networks • • Motivation: how the brain represents sensory information in a Instructor: Yoonsuck Choe • • hierarchical manner.

1 2

The Rise of Deep Learning Long History (in Hind Sight)

Made popular in recent years Fukushima’s Neocognitron (1980). • et al. (2006). LeCun et al.’s Convolutional neural networks (1989). • • Andrew Ng & Jeff Dean (Google Brain team, 2012). Schmidhuber’s work on stacked recurrent neural networks (1993). • • Vanishing gradient problem. Schmidhuber et al.’s deep neural networks (won many • competitions and in some cases showed super human performance; 2011–).

3 4 Fukushima’s Neocognitron LeCun’s Colvolutional Neural Nets

Appeared in journal Biological Cybernetics (1980). • Multiple layers with local receptive fields. • Convolution kernel (weight sharing) + Subsampling S cells (trainable) and C cells (fixed weight). • • Fully connected layers near the end. Deformation-resistent recognition. • •

5 6

Current Trends to Deep Belief Nets

Deep belief networks (based on Boltzmann machine) Haykin Chapter 11: Stochastic Methods rooted in statistical • • mechanics. Deep neural networks • Convolutional neural networks • Deep Q-learning Network (extensions to ) •

7 8 Boltzmann Machine Boltzmann Machine: Energy

Network state: x from random variable X. • w = w and w = 0. • ij ji ii Energy (in analogy to thermodynamics): • 1 E(x) = wjixixj − 2 i j,i=j X X6 Stochastic binary machine: +1 or -1. • Fully connected symmetric connections: w = w . • ij ji Visible vs. hidden neurons, clamped vs. free-running. • Goal: Learn weights to model prob. dist of visible units. • Unsupervised. Pattern completion. • 9 10

Boltzmann Machine: Prob. of a State x Boltzmann Machine: P (X = x the rest) j | Probability of a state x given E(x) follows the Gibbs distribution: K • A : Xj = x. B : Xi = xi i=1,i=j (the rest). 1 E(x) • { } 6 P (X = x) = exp , P (A, B) Z − T P (Xj = x the rest)=   | P (B) – Z: partition function (normalization factor – hard to compute) P (A, B) P (A, B) = = Z = exp( E(x)/T ) P (A, B) P (A, B) + P ( A, B) − A ¬ x 1 X∀ = P – T: temperature parameter. x 1 + exp T i,i=j wjixi − 6 – Low energy states are exponentially more probable.  P  x = sigmoid wjixi With the above, we can calculate  T  • i,i=j K X6 P (Xj = x Xi = xi i=1,i=j )   |{ } 6 Can compute equilibrium state based on the above. – This can be done without knowing Z. •

11 12 Boltzmann Machine: Boltzmann Learning Rule (1)

Probability of activity pattern being one of the training patterns Initialize x(0) to a random vector. • • (visible unit: subvector xα; hidden unit: subvector yβ ), given the For j = 1, 2, ..., n (generate n samples x P (X)) • ∼ weight vector w. (j+1) (j) (j) (j) – x from p(x1 x , x , ..., x ) P (Xα = xα) 1 | 2 3 K (j+1) (j+1) (j) (j) – x from p(x2 x , x , ..., x ) Log-likelihood of the visible units being any one of the trainning 2 | 1 3 K • (j+1) (j+1) (j+1) (j) (j) patterns (assuming they are mutually independent) : – x from p(x3 x , x , x ..., x ) T 3 | 1 2 4 K – ... L(w)= log P (Xα = xα) x (j+1) (j+1) (j+1) (j+1) (j+1) αY∈T – xK from p(xK x1 , x2 , x3 ..., xK 1 ) | − = log P (Xα = xα) xα One new sample x(j+1) P (X). X∈T → ∼ Simulated annealing used (high T to low T ) for faster conv. We want to learn w that maximizes L(w). • •

13 14

Boltzmann Learning Rule (2) Boltzmann Learning Rule (3) Finally, we get: Want to calculate P (Xα = xα): use energy function. • • 1 E(x) E(x) E(x) L(w) = log exp log exp  P (Xα = xα)= exp x − T ! − x − T ! − xαX Xβ X Z x T ∈T   Xβ     Note that w is involved in: E(x) • log P (X = x )= log exp log Z 1 α α E(x) = w x x − T − − ji i j xβ   2 i j,i=j X X X6 E(x) = log exp Differentiating L(w) wrt wji, we get: − T • xβ   X ∂L(w) 1 =  P (Xβ = xβ Xα = xα)xj xi ∂w T | E(x) ji xα xβ log exp X∈T X − −  x T   P (X = x)xj xi X − x ! X E(x) Note: Z = exp • x − T   P 15 16 Boltzmann Learning Rule (4) Boltzmann Machine Summary

Setting: • Theoretically elegant. + • ρji = P (Xβ = xβ Xα = xα)xj xi | Very slow in practice (especially the unclamped phase). x xβ Xα∈T X •

ρji− = P (X = x)xj xi x x Xα∈T X We get: • ∂L(w) 1 + = ρji ρji− ∂wji T −   Attempting to maximize L(w), we get: • ∂L(w) + ∆wji =  = η ρji ρji− ∂wji −    where η = T . This is gradient ascent. 17 18

Logistic (or Directed) Belief Net Deep Belief Net (1)

Overcomes issues with Logistic Belief Net. Hinton et al. (2006) • Based on Restricted Boltzmann Machine (RBM): visible and • Similar to Boltzmann Machine, but with directed, acyclic hidden layers, with layer-to-layer full connection but no • connections. within-layer connections.

P (Xj = xj X1 = x1, ..., Xj 1 = xj 1) = P (Xj = xj parents(Xj )) | − − | RBM Back-and-forth update: update hidden given visible, then • Same learning rule: update visible given hidden, etc., then train w based on • ∂L(w) ∂L(w) (0) ( ) = ρ ρ ∞ ∆wji = η ji − ji ∂wji ∂wji

With dense connetions, calculation of P becomes intractable. • 19 20 Deep Belief Net (2) Deep Convolutional Neural Networks (1)

Deep Belief Net = Layer-by-layer training using RBM.

Hybrid architecture: Top layer = undirected, lower layers directed.

1. Train RBM based on input to form hidden representation.

2. Use hidden representation as input to train another RBM.

3. Repeat steps 2-3. Krizhevsky et al. (2012) • Applied to ImageNet competition (1.2 million images, 1,000 Applications: NIST digit recognition. • classes).

Network: 60 million parameters and 650,000 neurons. • Top-1 and top-5 error rates of 37.5% and 17.0%. • Trained with backprop. • 21 22

Deep Convolutional Neural Networks (2) Deep Convolutional Neural Networks (3)

Learned kernels (first convolutional layer). Left: Hits and misses and close calls. • • Resembles mammalian RFs: oriented Gabor patterns, color Right: Test (1st column) vs. training images with closest hidden • • opponency (red-green, blue-yellow). representation to the test data.

23 24 Deep Q-Network (DQN) DQN Overview

Google Deep Mind (Mnih et al. Nature 2015).

Latest application of deep learning to a reinforcement learning • domain (Q as in Q-learning). Input: video screen; Output: Q(s, a); Reward: game score. • Applied to Atari 2600 video game playing. Q(s, a): action-value function • • – Value of taking action a when in state s. 25 26

DQN Overview DQN Algorithm

Input preprocessing • Experience replay (collect and replay state, action, reward, and • resulting state)

Delayed (periodic) update of Q. • Moving target Qˆ value used to compute error (loss function L, • parameterized by weights θi). – : ∂L

∂θi

27 28 DQN Results DQN Hidden Layer Representation (t-SNE map)

Superhuman performance on over half of the games. Similar perception, similar reward clustered. • • 29 30

DQN Operation Summary

Deep belief network: Based on Boltzmann machine. Elegant • theory, good performance.

Deep convolutional networks: High computational demand, over • the board great performance.

Deep Q-Network: unique apporach to reinforcement learning. • End-to-end . Super-human performance.

Value vs. game state; Game state vs. action value. • 32