Quick viewing(Text Mode)

Introduction to Machine Learning

Introduction to

Introduction to Machine Learning Amo G. Tong 1 Lecture *

• An Introduction • Deep Learning, , and Aaron Courville

• Some materials are courtesy of . • All pictures belong to their creators.

Introduction to Machine Learning Amo G. Tong 2 Deep Learning

• What are deep learning methods? • Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest.

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 3 Deep Learning

• What are deep learning methods? • Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest. LeNet-5 (1998): 7-level convolutional network

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 4 Deep Learning

• What are deep learning methods? • Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest. AlexNet (2012): more layers and filters Trained for 6 days on two GPUs. Error rate: 15.3% (reduced from 26.2%)

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 5 Deep Learning

• What are deep learning methods? • Using a complex neural network to approximate the function we want to learn.

Story: ImageNet object recognition contest. ResNet(2015): 152 layers with residual connections. Error rate: 3.57%

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Introduction to Machine Learning Amo G. Tong 6 Deep Learning

• What are deep learning methods? • Using a complex neural network to approximate the function we want to learn.

• Why we need deep learning? • We need to solve complex real-world problems. • A standard way to parameterize functions. • by layer. • Flexible to be customized for different applications. • Image processing • Sequence data • Standard training methods. • .

Introduction to Machine Learning Amo G. Tong 7 Deep Learning

• What are deep learning methods? • Using a complex neural network to approximate the function we want to learn.

• Why now? • Availability of data

• Hardware • GPUs,…

• Software • Tensor Flow, Pytorch,…

Introduction to Machine Learning Amo G. Tong 8 Outline

• General Deep Feedforward Network.

• Convolutional Neural Network (CNN) • Image processing.

(RNN) • Sequence data processing.

Introduction to Machine Learning Amo G. Tong 9 Deep Feedforward Network ℎ

Input Output

Hidden Layers

Deep Feedforward Network

Training, Design, and Regularization

Introduction to Machine Learning Amo G. Tong 10 Deep Feedforward Network:

• Feedforward Network (Multi-Layer (MLP))

Input Output • More Layers.

• Training

• Design Hidden Layers • Regularization

Introduction to Machine Learning Amo G. Tong 11 Deep Feedforward Network: Training

• Cost Function: 퐽(휃) • Common choice: cross-entropy • The difference between two distributions 푄 and 푃 Input Output • 퐻 푃, 푄 = −E푥∼푃 log 푄(푥)

Hidden Layers

푃(푥) 퐷(푃||푄) = ෍ 푃 푥 log 푄(푥) 푥

Introduction to Machine Learning Amo G. Tong 12 Deep Feedforward Network: Training

• Cost Function: 퐽(휃) • Common choice: cross-entropy • The difference between two distributions 푄 and 푃 Input Output • 퐻 푃, 푄 = −E푥∼푃 log 푄(푥)

• 퐽 휃 = −E푥, 푦∼퐷train log Pmodel(푦|푥) (negative log likelihood) • Pmodel(푦|푥) changes from model to model • Least square error: assume Gaussian and use MLE Hidden Layers

푃(푥) 퐷(푃||푄) = ෍ 푃 푥 log 푄(푥) 푥

Introduction to Machine Learning Amo G. Tong 13 Deep Feedforward Network: Training

• Gradient-Based Training • Non-convex because we have nonlinear units. • Iteratively decrease the cost function; not global optimal; not even local optimal. • Initializations with small random numbers. • Backpropagation.

• One issue: the gradient must be large and predictable. Input Output

Hidden Layers

Introduction to Machine Learning Amo G. Tong 14 Deep Feedforward Network: Training

• Gradient-Based Training • Non-convex because we have nonlinear units. • Iteratively decrease the cost function; not global optimal; not even local optimal. • Initializations with small random numbers. • Backpropagation.

• One issue: the gradient must be large and predictable. Input Output

Stochastic (SGD) algorithm Hidden Layers While stopping criterion not met Sample a minibatch of 푚 examples 퐷푚 σ Compute the gradient 푔 ← 푥∈퐷푚 Δ휃 퐽(휃, 푥) Update 휃 ← 휃 − 휖푔

Introduction to Machine Learning Amo G. Tong 15 Deep Feedforward Network: Design

• Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡

• Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏)

• 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥)

• Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗)

Introduction to Machine Learning Amo G. Tong 16 Deep Feedforward Network: Design

• Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡

• Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏)

• 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥)

• Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗)

Introduction to Machine Learning Amo G. Tong 17 Deep Feedforward Network: Design

• Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡

• Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏)

• 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥)

• Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗)

Introduction to Machine Learning Amo G. Tong 18 Deep Feedforward Network: Design

• Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max.

• Traditional units: • Sigmoid (consider tanh(z)) • Perceptron

Introduction to Machine Learning Amo G. Tong 19 Deep Feedforward Network: Design

• Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max.

• Traditional units: • Sigmoid (consider tanh(z)) • Perceptron

Introduction to Machine Learning Amo G. Tong 20 Deep Feedforward Network: Design

• Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max.

• Traditional units: • Sigmoid (consider tanh(z)) • Perceptron

Introduction to Machine Learning Amo G. Tong 21 Deep Feedforward Network: Design

• Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better.

Introduction to Machine Learning Amo G. Tong 22 Deep Feedforward Network: Design

• Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better.

Introduction to Machine Learning Amo G. Tong 23 Deep Feedforward Network: Design

• Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned.

• Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. . • Greater depth may not be better.

• Other considerations: local connected layers, skip layers, recurrent layers, …

Introduction to Machine Learning Amo G. Tong 24 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. Stochastic gradient descent (SGD) algorithm While stopping criterion not met • Over-fitting is a serious problem in deep learning. Sample a minibatch of 푚 examples 퐷푚 σ Compute the gradient 푔 ← 푥∈퐷푚 Δ휃 퐽(휃, 푥) • Method 1: Norm Penalties: Update 휃 ← 휃 − 휖푔 • 퐽 휃 푛푒푤 = 퐽 휃 + 훼 ⋅ Ω(휃) 1 • 퐿2 regularization (weight decay): Ω 휃 = ∥ 휃 ∥2 2 1 • 퐿 regularization: Ω 휃 = σ푖 |휃푖|

• What is the difference between them?

ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) Typically, we put a penalty on 푊 but not bias.

Introduction to Machine Learning Amo G. Tong 25 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. Stochastic gradient descent (SGD) algorithm While stopping criterion not met • Over-fitting is a serious problem in deep learning. Sample a minibatch of 푚 examples 퐷푚 σ Compute the gradient 푔 ← 푥∈퐷푚 Δ휃 퐽(휃, 푥) • Method 2: Norm Penalties as Constrained Optimization: Update 휃 ← 휃 − 휖푔 • 퐽 휃, 훼 푛푒푤 = 퐽 휃 + 훼 ⋅ (Ω 휃 − 푘) • 훼 increases when Ω 휃 > 푘; 훼 decreases when Ω 휃 < 푘

• Take 훼 as another learnable parameter. • Select 푘 manully. • Need to control 훼 manully.

Introduction to Machine Learning Amo G. Tong 26 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. Stochastic gradient descent (SGD) algorithm While stopping criterion not met • Over-fitting is a serious problem in deep learning. Sample a minibatch of 푚 examples 퐷푚 σ Compute the gradient 푔 ← 푥∈퐷푚 Δ휃 퐽(휃, 푥) • Method 3: Explicit Constraints: Update 휃 ← 휃 − 휖푔 • 퐽 휃 푛푒푤 = 퐽 휃 • Whenever Ω 휃 > 푘, project 휃 to the nearest point in Ω 휃 < 푘.

• After each iteration, check if the new parameters are in the area Ω 휃 < 푘.

Introduction to Machine Learning Amo G. Tong 27 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. • Over-fitting is a serious problem in deep learning.

• Method 4: Input Output • How does over-fitting happen? • High representability and insufficient data. • Creating fake but useful training data is possible. • The target classifier is invariant to some transformations. Hidden Layers • E.g. object detection.

9 9

Good Bad

Introduction to Machine Learning Amo G. Tong 28 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. • Over-fitting is a serious problem in deep learning.

• Method 5: Multi-Task Learning • Two models (problems) share some parameters • The hidden representations can be better learned if two tasks share certain factors.

If animal? If cat? Output 2 Input

Output 1 If dog?

Shared

Introduction to Machine Learning Amo G. Tong 29 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. while 푗 < 푝 • Over-fitting is a serious problem in deep learning. update 휃 for 푛 steps; if no smaller validation error 푗 = 푗 + 1; • Method 6: Early Stopping else • We would want an early stop before converge. note the new parameters; • Validation set error. 푗 = 0; • One possible method: stop if no better parameter found in 푝 (say 10) consecutive training batches. • Simple to implement. • Need some extra data.

• Use second training to fully utilize the data. • Retraining. Adopt the same number of training steps. • Continuous Training. Start from the obtained parameters.

Introduction to Machine Learning Amo G. Tong 30 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. • Over-fitting is a serious problem in deep learning. Input Output • Method 7: Dropout • Bagging: train different models with different datasets. • Bagging neural networks is not effective. • Dropout: Hidden Layers • Idea: dynamically change the network structure. • Mute units by setting its output as zero. • Algorithm: • Before each iteration, randomly mute units. • Typically, with prob 0.2 drop an input node, with prob 0.5 drop a hidden node. • Pros: Computationally cheap, independent of training algorithm or model. • Cons: Size of the model may need increasing, does not work for small data set.

Introduction to Machine Learning Amo G. Tong 31 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. • Over-fitting is a serious problem in deep learning.

• Method 7: Dropout

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15, 1929–1958. 258, 265, 267, 672

Introduction to Machine Learning Amo G. Tong 32 Deep Feedforward Network: Regularization

• Regularization: methods for reducing over-fitting. • Over-fitting is a serious problem in deep learning.

• Method 8: Adversarial training • There can be 푥1 ≈ 푥2 but the results of prediction are very different.

• Goodfellow, I. J., Shlens, J., and Szegedy, C. Training on adversarially perturbed examples from the training set (2014b). Explaining and harnessing adversarial • Generative adversarial network (GAN) examples. CoRR, abs/1412.6572. 268, 269, 271, 555, 556

Introduction to Machine Learning Amo G. Tong 33 Convolutional Neural Networks

Convolutional Neural Networks

Introduction to Machine Learning Amo G. Tong 34 Convolutional Neural Networks

• One large class of neural networks. • The main unit is the convolutional layer • Replace the matrix multiplication by convolutional operation. • Designed for processing grid-like inputs.

Introduction to Machine Learning Amo G. Tong 35 Convolutional Neural Networks

• One large class of neural networks. • The main unit is the convolutional layer • Replace the matrix multiplication by convolutional operation. • Designed for processing grid-like inputs.

+∞ • Convolutional operation in math: 푓 ∗ 푔 푡 = ׬−∞ 푓 휏 푔 푡 − 휏 푑휏 • Average of 푔(푡) weighted by 푓(−휏)

Introduction to Machine Learning Amo G. Tong 36 Convolutional Neural Networks

• One large class of neural networks. Input Output • The main unit is the convolutional layer • Replace the matrix multiplication by convolutional operation. • Designed for processing grid-like inputs.

Hidden Layers

+∞ • Convolutional operation in math: 푓 ∗ 푔 푡 = ׬−∞ 푓 휏 푔 푡 − 휏 푑휏 • Average of 푔(푡) weighted by 푓(−휏) • Convolutional operation in deep learning: • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. 푊 2 푇ℎ(1) → 퐾 ∗ ℎ(1)

Introduction to Machine Learning Amo G. Tong 37 Convolutional Neural Networks

• One large class of neural networks. • The main unit is the convolutional layer • Replace the matrix multiplication by convolutional operation. • Designed for processing grid-like inputs.

+−∞ • Convolutional operation in math: 푓 ∗ 푔 푡 = ׬−∞ 푓 휏 푔 푡 − 휏 푑휏 • Average of 푔(푡) weighted by 푓(−휏) • Convolutional operation in deep learning: ℎ(1) 1D example • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) a 푎푒 + 푏푓 (2) 2 2 푇 (1) (2) 퐾 • ℎ = 푔 (푊 ℎ + 푏 ) b 푏푒 + 푐푓 • ….. 푐푒 + 푑푓 푊 2 푇ℎ(1) → 퐾 ∗ ℎ(1) c d 푑푒 + 0

Introduction to Machine Learning Amo G. Tong 38 Convolutional Neural Networks

• One large class of neural networks. • The main unit is the convolutional layer • Replace the matrix multiplication by convolutional operation. • Designed for processing grid-like inputs.

+−∞ • Convolutional operation in math: 푓 ∗ 푔 푡 = ׬−∞ 푓 휏 푔 푡 − 휏 푑휏 • Average of 푔(푡) weighted by 푓(−휏) • Convolutional operation in deep learning: ℎ(1) • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) a (2) 2 2 푇 (1) (2) 퐾 • ℎ = 푔 (푊 ℎ + 푏 ) b • ….. 푊 2 푇ℎ(1) → 퐾 ∗ ℎ(1) c d 2D example 푊 2 푇ℎ(1) → 퐾 ∗ ℎ(1) Introduction to Machine Learning Amo G. Tong 39 Convolutional Neural Networks

• Convolutional operation in deep learning: • ℎ(1) = 푔 1 (퐾(1) ∗ 푥) • ℎ(2) = 푔 2 (퐾(2) ∗ ℎ(1)) • ….. • What are the advantages? • Sparse interactions: the size of kernel • Fewer parameters • Parameter Sharing: • One set of parameter for each layer • Equivariance 푊 2 푇ℎ(1) 퐾 ∗ ℎ(1) • The output changes in the same way as input does.

Introduction to Machine Learning Amo G. Tong 40 Convolutional Neural Networks

• Convolutional operation in deep learning: • ℎ(1) = 푔 1 (퐾(1) ∗ 푥) • ℎ(2) = 푔 2 (퐾(2) ∗ ℎ(1)) • ….. • Typical structure of a convolutional networks: • layers+ • Activation functions (e.g. ReLU)+ • Pooling and Sampling • Pooling layers: • Replace the nodes with a summary of the nearby neighbors • E.g. Max Pooling: reports the maximum output within a rectangular neighborhood. • Why pooling? • Theoretically, making the model invariance to small translations. • Intuitively, we want to know if some feature exists than where it is.

Introduction to Machine Learning Amo G. Tong 41 Convolutional Neural Networks

• Convolutional operation in deep learning: • ℎ(1) = 푔 1 (퐾(1) ∗ 푥) • ℎ(2) = 푔 2 (퐾(2) ∗ ℎ(1)) • ….. • Typical structure of a convolutional networks: • Convolution layers+ • Activation functions (e.g. ReLU)+ • Pooling Max=1 Max=1 Max=0 • Pooling layers: • Replace the nodes with a summary of the nearby neighbors • E.g. Max Pooling: reports the maximum output within a rectangular neighborhood. • Why pooling? • Theoretically, making the model invariance to small translations. • Intuitively, we want to know if some feature exists than where it is.

Introduction to Machine Learning Amo G. Tong 42 Convolutional Neural Networks

• Convolutional operation in deep learning: • ℎ(1) = 푔 1 (퐾(1) ∗ 푥) • ℎ(2) = 푔 2 (퐾(2) ∗ ℎ(1)) • ….. • Typical structure of a convolutional networks: • Convolution layers+ • Activation functions (e.g. ReLU)+ • Pooling and Sampling

Introduction to Machine Learning Amo G. Tong 43 Convolutional Neural Networks

• Convolutional operation in deep learning: • ℎ(1) = 푔 1 (퐾(1) ∗ 푥) • ℎ(2) = 푔 2 (퐾(2) ∗ ℎ(1)) • ….. • Typical structure of a convolutional networks: • Convolution layers+ • Activation functions (e.g. ReLU)+ • Pooling and Sampling

Introduction to Machine Learning Amo G. Tong 44 Recurrent Neural Network

Recurrent Neural Network

“In this lecture, we are going to learn deep learning…..”

Introduction to Machine Learning Amo G. Tong 45 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data.

“In this lecture, we are going to learn deep learning…..”

푦1 푦2 푦3 푦4

푥1 푥2 푥3 푥4

Introduction to Machine Learning Amo G. Tong 46 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data.

“In this lecture, we are going to learn deep learning…..”

푦1 푦2 푦3 푦4

푥1 푥2 푥3 푥4

Introduction to Machine Learning Amo G. Tong 47 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data. A typical pattern.

• Stationary model Truth • Parameter sharing • Unlimited Sequence Loss • Powerful: can simulate a universal Turing Machine Output

Hidden state

Input

Introduction to Machine Learning Amo G. Tong 48 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data. A typical pattern.

• Stationary model Truth • Parameter sharing • Unlimited Sequence Loss • Powerful: can simulate a universal Turing Machine Output

Hidden state

Input

Introduction to Machine Learning Amo G. Tong 49 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data. A typical pattern.

• Stationary model Truth • Parameter sharing • Unlimited Sequence Loss • Powerful: can simulate a universal Turing Machine Output

Hidden state

Input

Introduction to Machine Learning Amo G. Tong 50 Recurrent Neural Network Hidden transmission: 푎푡 = 푏 + 푊ℎ푡−1 + 푈푥푡 푡 푡 • One large class of neural networks. ℎ = tanh(푎 ) • Designed for processing sequence data. Output: 표푡 = 푐 + 푉ℎ푡 푡 푡 Truth 푦표푢푡 = softmax(표 ) Loss: Loss Cross-entropy

Output

Hidden state

Input

Introduction to Machine Learning Amo G. Tong 51 Hidden transmission: Output: Example (“hello”) 푎푡 = 푏 + 푊ℎ푡−1 + 푈푥푡 표푡 = 푐 + 푉ℎ푡 ℎ푡 = tanh(푎푡) 푡 푡 Character-level Language Model 푦표푢푡 = softmax(표 ) e l l o Vocabulary [h,e,l,o] Loss o o l o 1.0 0.5 0.1 0.2 2.2 0.3 0.5 -1.5 -3.0 -1.0 1.9 -0.1 Output 4.1 1.2 -1.1 2.2

0.3 1.0 0.1 -0.3 -0.1 -0.3 -0.5 0.9 Hidden state 0.9 0.1 -0.3 0.7

1 0 0 0 0 1 0 0 0 0 1 1 Input 0 0 0 0 h e l l

Introduction to Machine Learning Amo G. Tong 52 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability. • Output may not carry sufficient information.

Introduction to Machine Learning Amo G. Tong 53 Recurrent Neural Network

• One large class of neural networks. • Designed for processing sequence data.

Output guided hidden transmission (Teacher forcing), with less information.

• Less powerful in terms of representability. • Output may not carry sufficient information.

• Training can be parallelized.

Introduction to Machine Learning Amo G. Tong 54 Recurrent Neural Network

Introduction to Machine Learning Amo G. Tong 55 Summary

• General Deep Feedforward Network. • Training • Design: output node, hidden node, • Regularization

• Convolutional Neural Network • Convolution Operator • Grid-like input • Extract features layer by layer

• Recurrent Neural Network • Sequence data processing • Hidden state transmission

Introduction to Machine Learning Amo G. Tong 56