Deep Learning What Is Deep Learning?

Deep Learning What Is Deep Learning? Learning higher level abstractions/representations from data. CSCE 636 Neural Networks • • Motivation: how the brain represents sensory information in a Instructor: Yoonsuck Choe • • hierarchical manner. 1 2 The Rise of Deep Learning Long History (in Hind Sight) Made popular in recent years Fukushima’s Neocognitron (1980). • Geoffrey Hinton et al. (2006). LeCun et al.’s Convolutional neural networks (1989). • • Andrew Ng & Jeff Dean (Google Brain team, 2012). Schmidhuber’s work on stacked recurrent neural networks (1993). • • Vanishing gradient problem. Schmidhuber et al.’s deep neural networks (won many • competitions and in some cases showed super human performance; 2011–). 3 4 Fukushima’s Neocognitron LeCun’s Colvolutional Neural Nets Appeared in journal Biological Cybernetics (1980). • Multiple layers with local receptive fields. • Convolution kernel (weight sharing) + Subsampling S cells (trainable) and C cells (fixed weight). • • Fully connected layers near the end. Deformation-resistent recognition. • • 5 6 Current Trends Boltzmann Machine to Deep Belief Nets Deep belief networks (based on Boltzmann machine) Haykin Chapter 11: Stochastic Methods rooted in statistical • • mechanics. Deep neural networks • Convolutional neural networks • Deep Q-learning Network (extensions to reinforcement learning) • 7 8 Boltzmann Machine Boltzmann Machine: Energy Network state: x from random variable X. • w = w and w = 0. • ij ji ii Energy (in analogy to thermodynamics): • 1 E(x) = wjixixj − 2 i j,i=j X X6 Stochastic binary machine: +1 or -1. • Fully connected symmetric connections: w = w . • ij ji Visible vs. hidden neurons, clamped vs. free-running. • Goal: Learn weights to model prob. dist of visible units. • Unsupervised. Pattern completion. • 9 10 Boltzmann Machine: Prob. of a State x Boltzmann Machine: P (X = x the rest) j | Probability of a state x given E(x) follows the Gibbs distribution: K • A : Xj = x. B : Xi = xi i=1,i=j (the rest). 1 E(x) • { } 6 P (X = x) = exp , P (A, B) Z − T P (Xj = x the rest)= | P (B) – Z: partition function (normalization factor – hard to compute) P (A, B) P (A, B) = = Z = exp( E(x)/T ) P (A, B) P (A, B) + P ( A, B) − A ¬ x 1 X∀ = P – T: temperature parameter. x 1 + exp T i,i=j wjixi − 6 – Low energy states are exponentially more probable. P x = sigmoid wjixi With the above, we can calculate T • i,i=j K X6 P (Xj = x Xi = xi i=1,i=j ) |{ } 6 Can compute equilibrium state based on the above. – This can be done without knowing Z. • 11 12 Boltzmann Machine: Gibbs Sampling Boltzmann Learning Rule (1) Probability of activity pattern being one of the training patterns Initialize x(0) to a random vector. • • (visible unit: subvector xα; hidden unit: subvector yβ ), given the For j = 1, 2, ..., n (generate n samples x P (X)) • ∼ weight vector w. (j+1) (j) (j) (j) – x from p(x1 x , x , ..., x ) P (Xα = xα) 1 | 2 3 K (j+1) (j+1) (j) (j) – x from p(x2 x , x , ..., x ) Log-likelihood of the visible units being any one of the trainning 2 | 1 3 K • (j+1) (j+1) (j+1) (j) (j) patterns (assuming they are mutually independent) : – x from p(x3 x , x , x ..., x ) T 3 | 1 2 4 K – ... L(w)= log P (Xα = xα) x (j+1) (j+1) (j+1) (j+1) (j+1) αY∈T – xK from p(xK x1 , x2 , x3 ..., xK 1 ) | − = log P (Xα = xα) xα One new sample x(j+1) P (X). X∈T → ∼ Simulated annealing used (high T to low T ) for faster conv. We want to learn w that maximizes L(w). • • 13 14 Boltzmann Learning Rule (2) Boltzmann Learning Rule (3) Finally, we get: Want to calculate P (Xα = xα): use energy function. • • 1 E(x) E(x) E(x) L(w) = log exp log exp P (Xα = xα)= exp x − T ! − x − T ! − xαX Xβ X Z x T ∈T Xβ Note that w is involved in: E(x) • log P (X = x )= log exp log Z 1 α α E(x) = w x x − T − − ji i j xβ 2 i j,i=j X X X6 E(x) = log exp Differentiating L(w) wrt wji, we get: − T • xβ X ∂L(w) 1 = P (Xβ = xβ Xα = xα)xj xi ∂w T | E(x) ji xα xβ log exp X∈T X − − x T P (X = x)xj xi X − x ! X E(x) Note: Z = exp • x − T P 15 16 Boltzmann Learning Rule (4) Boltzmann Machine Summary Setting: • Theoretically elegant. + • ρji = P (Xβ = xβ Xα = xα)xj xi | Very slow in practice (especially the unclamped phase). x xβ Xα∈T X • ρji− = P (X = x)xj xi x x Xα∈T X We get: • ∂L(w) 1 + = ρji ρji− ∂wji T − Attempting to maximize L(w), we get: • ∂L(w) + ∆wji = = η ρji ρji− ∂wji − where η = T . This is gradient ascent. 17 18 Logistic (or Directed) Belief Net Deep Belief Net (1) Overcomes issues with Logistic Belief Net. Hinton et al. (2006) • Based on Restricted Boltzmann Machine (RBM): visible and • Similar to Boltzmann Machine, but with directed, acyclic hidden layers, with layer-to-layer full connection but no • connections. within-layer connections. P (Xj = xj X1 = x1, ..., Xj 1 = xj 1) = P (Xj = xj parents(Xj )) | − − | RBM Back-and-forth update: update hidden given visible, then • Same learning rule: update visible given hidden, etc., then train w based on • ∂L(w) ∂L(w) (0) ( ) = ρ ρ ∞ ∆wji = η ji − ji ∂wji ∂wji With dense connetions, calculation of P becomes intractable. • 19 20 Deep Belief Net (2) Deep Convolutional Neural Networks (1) Deep Belief Net = Layer-by-layer training using RBM. Hybrid architecture: Top layer = undirected, lower layers directed. 1. Train RBM based on input to form hidden representation. 2. Use hidden representation as input to train another RBM. 3. Repeat steps 2-3. Krizhevsky et al. (2012) • Applied to ImageNet competition (1.2 million images, 1,000 Applications: NIST digit recognition. • classes). Network: 60 million parameters and 650,000 neurons. • Top-1 and top-5 error rates of 37.5% and 17.0%. • Trained with backprop. • 21 22 Deep Convolutional Neural Networks (2) Deep Convolutional Neural Networks (3) Learned kernels (first convolutional layer). Left: Hits and misses and close calls. • • Resembles mammalian RFs: oriented Gabor patterns, color Right: Test (1st column) vs. training images with closest hidden • • opponency (red-green, blue-yellow). representation to the test data. 23 24 Deep Q-Network (DQN) DQN Overview Google Deep Mind (Mnih et al. Nature 2015). Latest application of deep learning to a reinforcement learning • domain (Q as in Q-learning). Input: video screen; Output: Q(s, a); Reward: game score. • Applied to Atari 2600 video game playing. Q(s, a): action-value function • • – Value of taking action a when in state s. 25 26 DQN Overview DQN Algorithm Input preprocessing • Experience replay (collect and replay state, action, reward, and • resulting state) Delayed (periodic) update of Q. • Moving target Qˆ value used to compute error (loss function L, • parameterized by weights θi). – Gradient descent: ∂L ∂θi 27 28 DQN Results DQN Hidden Layer Representation (t-SNE map) Superhuman performance on over half of the games. Similar perception, similar reward clustered. • • 29 30 DQN Operation Summary Deep belief network: Based on Boltzmann machine. Elegant • theory, good performance. Deep convolutional networks: High computational demand, over • the board great performance. Deep Q-Network: unique apporach to reinforcement learning. • End-to-end machine learning. Super-human performance. Value vs. game state; Game state vs. action value. • 32.

Deep Learning What Is Deep Learning?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support