<<

Machine Learning for Processing Neural Networks Continue

Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

1 So what are neural networks??

Voice N.Net Image N.Net Text caption signal Transcription

Game N.Net State Next move

• What are these boxes?

18797/11755 2 So what are neural networks??

• It began with this.. • Humans are very good at the tasks we just saw • Can we model the human brain/ human intelligence?

– An old question – dating18797/11755 back to Plato and Aristotle.. 3 MLP - Recap

• MLPs are Boolean machines – They represent Boolean functions over linear boundaries – They can represent arbitrary boundaries

are correlation filters – They detect patterns in the input • MLPs are Boolean formulae over patterns detected by – Higher-level perceptrons may also be viewed as feature detectors • MLPs are universal approximators – Can model any to arbitrary precision

• Extra: MLP in classification – The network will fire if the combination of the detected basic features matches an “acceptable” pattern for a desired class of signal • E.g. Appropriate combinations of (Nose, Eyes, Eyebrows, Cheek, Chin)  Face

4 MLP - Recap

• MLPs are Boolean machines – They represent arbitrary Boolean functions over arbitrary linear boundaries

• Perceptrons are pattern detectors – MLPs are Boolean formulae over these patterns

• MLPs are universal approximators – Can model any function to arbitrary precision

• MLPs are very hard to train – Training data are generally many orders of magnitude too few – Even with optimal architectures, we could get rubbish – Depth helps greatly!

– Can learn functions that regular classifiers cannot 5 What is a deep network?

Deep Structures

• In any directed network of computational elements with input source nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink

• Left: Depth = 2. Right: Depth = 3 Deep Structures

• Layered deep structure

• “Deep”  Depth > 2 MLP as a continuous-valued regression

T1 T1 x 1 1 f(x) x T1 T2 x 1 -1 T2 +

T2

• MLPs can actually compose arbitrary functions to arbitrary precision – Not just classification/Boolean functions • 1D example – Left: A net with a pair of units can create a pulse of any width at any location – Right: A network of N such pairs approximates the function with N scaled pulses 9 MLP features

DIGIT OR NOT?

• The lowest layers of a network detect significant features in the signal • The signal could be reconstructed using these features

– Will retain all the significant components of the signal 10 Making it explicit: an

푾푻 풀

푿 • A neural network can be trained to predict the input itself • This is an autoencoder • An encoder learns to detect all the most significant patterns in the

• A decoder recomposes the signal from the patterns 11 Deep Autoencoder

DECODER ENCODER What does the AE learn

푾푻 풀 푾 푿

푇 2 퐘 = 퐖퐗 퐗 = 퐖푇퐘 퐸 = 퐗 − 퐖 퐖퐗 Find W to minimize Avg[E]

• In the absence of an intermediate non-linearity • This is just PCA 13 The AE DECODER

ENCODER • With non-linearity – “Non linear” PCA – Deeper networks can capture more complicated

manifolds 14 The Decoder:

DECODER

• The decoder represents a source-specific generative dictionary • Exciting it will produce typical signals from the source!

15 The AE

DECODER

Cut the AE

ENCODER

16 The Decoder: Sax dictionary

DECODER

• The decoder represents a source-specific generative dictionary • Exciting it will produce typical signals from the source!

17 The Decoder: Clarinet dictionary

DECODER

• The decoder represents a source-specific generative dictionary • Exciting it will produce typical signals from the source!

18 NN for speech enhancement

19 Story so far

• MLPs are universal classifiers – They can model any decision boundary

• Neural networks are universal approximators – They can model any regression

• The decoder of MLP represent a non-linear constructive dictionary!

20 The need for shift invariance =

• In many problems the location of a pattern is not important – Only the presence of the pattern • Conventional MLPs are sensitive to the location of the pattern – Moving it by one component results in an entirely different input that the MLP wont recognize • Requirement: Network must be shift invariant Convolutional Neural Networks

History Hubel and Wiesel: 1959 (biological model), Fukushima: 1980 (computational model), Altas: 1988, Lecunn: 1989 (Backprop in convnets)

Yann LeCun Kunihiko Fukushima Convolutional Neural Networks • A special kind of multi- neural networks.

• Implicitly extract relevant features.

• A feed-forward network that can extract topological properties from an image.

• CNNs are also trained with a version of back-propagation .

Connectivity & weight sharing

All different weights All different weights Shared weights

Convolution layer has much smaller number of parameters by local connection and weight sharing Fully Connected Layer

Example: 200x200 image 40K hidden units ~2B parameters!!!

- Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. Ranzato

25 Locally Connected Layer

Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters

Note: This parameterization is good when input image is registered (e.g., face recognition). Ranzato

26 Locally Connected Layer

STATIONARITY? is similar at different locations

Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters

Ranzato

27 Convolutional Layer

Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels

Ranzato

28 Convolution Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Ranzato Convolutional Layer

Learn multiple filters.

E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters

Ranzato

46 Convolutional Layers

before:

output layer input layer hidden layer

now: Convolution Layer

32x32x3 image

32 height

32 width 3 depth Convolution Layer

32x32x3 image

5x5x3 filter

32

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

32 3 Convolution Layer Filters always extend the full depth of the input volume 32x32x3 image

5x5x3 filter

32

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

32 3 Convolution Layer

32x32x3 image 5x5x3 filter 32

1 number: the result of taking a dot between the filter and a small 5x5x3 chunk of the image 32 (i.e. 5*5*3 = 75-dimensional dot product + bias) 3 Convolution Layer

activation map 32x32x3 image 5x5x3 filter 32

28

convolve (slide) over all spatial locations

32 28 3 1 Convolution Layer consider a second, green filter 32x32x3 image activation maps 5x5x3 filter 32

28

convolve (slide) over all spatial locations

32 28 3 1 Convolution Layer For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps

32

28

Convolution Layer

32 28 3 6 We stack these up to get a “new image” of size 28x28x6! CNN Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32 28

CONV, ReLU e.g. 6 5x5x3 32 filters 28 3 6 CNN Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32 28 24

…. CONV, CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x3 5x5x6 32 filters 28 filters 24 3 6 10 Pooling Layer

Let us assume filter is an “eye” detector.

Q.: how can we make the detection robust to the exact location of the eye?

Ranzato

57 Pooling Layer

By “pooling” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features.

Ranzato

58 Pooling Layer - makes the representations smaller and more manageable - operates over each activation map independently:

Max Pooling

Single depth slice 1 1 2 4 x max pool with 2x2 filters 5 6 7 8 and stride 2 6 8 3 2 1 0 3 4 1 2 3 4

y ConvNets: Typical Stage

One stage (zoom)

Convol. Pooling

courtesy of K. Kavukcuoglu Ranzato

61 Digit classification ImageNet • 1.2 million high-resolution images from ImageNet LSVRC-2010 contest • 1000 different classes (sofmax layer) • NN configuration • NN contains 60 million parameters and 650,000 neurons, • 5 convolutional layers, some of which are followed by max-pooling layers • 3 fully-connected layers

Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada

ImageNet

Figure 3: 96 convolutional kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3 input images. The top 48 kernels were learned on GPU 1 while the bottom 48 kernels were learned on GPU 2. See Section 6.1 for details.

Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada

ImageNet

Eight ILSVRC-2010 test images and the five Five ILSVRC-2010 test images in the first labels considered most probable by our model. column. The remaining columns show the six The correct label is written under each image, training images that produce feature vectors in and the assigned to the correct label the last hidden layer with the smallest Euclidean is also shown with a red bar (if it happens to be distance from the feature vector for the test in the top 5). image. Krizhevsky, A., Sutskever, I. and Hinton, G. E. “ImageNet Classification with Deep Convolutional Neural Networks” NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada

CNN for Automatic

• Convolution over frequencies • Convolution over time CNN-Recap Feature maps • Neural network with specialized connectivity structure • Feed-forward: Pooling - Convolve input - Non-linearity (rectified linear) Non-linearity - Pooling (local max) • Supervised training Convolution • Train convolutional filters by back-propagating error (Learned) • Convolution over time • Adding memory to classical MLP network • Input image RecurrentRecurrent Neural Networks Neural (RNNs) Network

Recurrent networks introduce (RNN) cycles and a notion of time.

푥푡 푦푡

ℎ푡−1 ℎ푡

One-step delay

• They are designed to process sequences of data 푥1, … , 푥푛 and can produce sequences of outputs 푦1, … , 푦푚. Elman Nets (1990) – Simple Recurrent Neural Networks

• Elman nets are feed forward networks with partial recurrence

• Unlike feed forward nets, Elman nets have a memory or sense of time • Can also be viewed as a “Markovian” NN

(Vanilla) Recurrent Neural Network Simple Recurrent Neural Network

The state consists of a single “hidden” vector h: 푥푡 푦푡

ℎ푡−1 ℎ푡

One-step delay UnrollingRecurrent RNNs Neural Network RNNs can be unrolled across multiple time steps.

푥푡 푦푡

ℎ푡−1 ℎ푡 푦 푦 푦 0 1 2 One-step delay ℎ0 ℎ1 ℎ2

This produces a DAG which supports . 푥0 푥 푥 1 2 But its size depends on the input sequence length.

Learning time sequences

• Recurrent networks have one more or more feedback loops • There are many tasks that require learning a temporal sequence of events – Speech, video, Text, Market • These problems can be broken into 3 distinct types of tasks 1. Sequence Recognition: Produce a particular output pattern when a specific input sequence is seen. Applications: speech recognition 2. Sequence Reproduction: Generate the rest of a sequence when the network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc) 3. Temporal Association: Produce a particular output sequence in response to a specific input sequence. Applications: speech generation

RNN structureRecurrent Neural Network Often layers are stacked vertically (deep RNNs):

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12 Same parameters at this level

푥00 푥 푥02  01

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 Same parameters level at this level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works: (it called Backpropagation Through Time)

푦10 푦11 푦12 ℎ10 ℎ11 ℎ12

푥00 푥01 푥02 Activations Abstraction 푦00 푦01 푦02 - Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Activations

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Activations

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Activations

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Activations

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Activations

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Activations

Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦 푦12 11 ℎ ℎ ℎ 10 11 12

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02

- Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time RNN structureRecurrent Neural Network Backprop still works:

푦10 푦11 푦12

ℎ10 ℎ ℎ12 11

푥00 푥 푥02 01 Gradients Abstraction 푦00 푦01 푦02 - Higher ℎ00 ℎ01 ℎ02 level features 푥0 푥1 푥2 Time The memory problem with RNN

• RNN models signal context

• If very long context is used -> RNNs become unable to learn the context information Standard RNNs to LSTM

Standard

LSTM LSTM illustrated: input and forming new memory

LSTM cell takes the Cell state following input

• the input 푥푡 • past memory Forget gate output ℎ푡−1 • past memory 퐶푡−1 (all vectors) Input gate

New memory LSTM illustrated: Output

• Forming the output of the cell by using output gate

Overall picture: 푖 푖 LSTM Equations • 𝑖 = 휎 푥푡푈 + 푠푡−1푊 푓 푓 • 푓 = 휎 푥푡푈 + 푠푡−1푊 표 표 • 표 = 휎 푥푡푈 + 푠푡−1푊 • 풊: input gate, how much of the new • 푔 = tanh 푥 푈푔 + 푠 푊푔 information will be let through the memory 푡 푡−1 cell. • 푐푡 = 푐푡−1 ∘ 푓 + 푔 ∘ 𝑖 • 푠 = tanh 푐 ∘ 표 • 풇: forget gate, responsible for information 푡 푡 should be thrown away from memory cell. • 푦 = 푠표푓푡푚푎푥 푉푠푡 • 풐: output gate, how much of the information will be passed to expose to the next time step. • 품: self-recurrent which is equal to standard RNN

• 풄풕: internal memory of the memory cell LSTM Memory Cell • 풔풕: hidden state • 퐲: final output

92 LSTM output synchronization (NLP) Applications of RNNs

• Section overview – Language Model – Sentiment analysis / text classification – Machine translation and conversation modeling – Sentence skip-thought vectors RNN for Sentiment analysis / text classification

• A quick example, to see the idea. • Given text collections and their labels. Predict labels for unseen texts. Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko North American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015.

Composing music with RNN

http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/ CNN-LSTM-DNN for speech recognition

• Ensembles of RNN/LSTM, DNN, & Conv Nets (CNN) give huge gains (state of the art): • T. Sainath, O. Vinyals, A. Senior, H. Sak. “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” ICASSP 2015.

The Impact of in speech technologies

Cortana Conclusions

• MLPs are Boolean machines – They represent Boolean functions over linear boundaries – They can represent arbitrary boundaries • Perceptrons are correlation filters – They detect patterns in the input • MLPs are Boolean formulae over patterns detected by perceptron – Higher-level perceptrons may also be viewed as feature detectors • MLPs are universal approximators – Can model any function to arbitrary precision – Non linear PCA • Convolute NN can handle shift invariance – CNN • Special NN can model sequential data – RNN, LSTM