Introduction to Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Machine Learning Introduction to Machine Learning Amo G. Tong 1 Lecture * Deep Learning • An Introduction • Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville • Some materials are courtesy of . • All pictures belong to their creators. Introduction to Machine Learning Amo G. Tong 2 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 3 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. LeNet-5 (1998): 7-level convolutional network https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 4 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. AlexNet (2012): more layers and filters Trained for 6 days on two GPUs. Error rate: 15.3% (reduced from 26.2%) https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 5 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. ResNet(2015): 152 layers with residual connections. Error rate: 3.57% https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 6 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. • Why we need deep learning? • We need to solve complex real-world problems. • A standard way to parameterize functions. • Layer by layer. • Flexible to be customized for different applications. • Image processing • Sequence data • Standard training methods. • Backpropagation. Introduction to Machine Learning Amo G. Tong 7 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. • Why now? • Availability of data • Hardware • GPUs,… • Software • Tensor Flow, Pytorch,… Introduction to Machine Learning Amo G. Tong 8 Outline • General Deep Feedforward Network. • Convolutional Neural Network (CNN) • Image processing. • Recurrent Neural Network (RNN) • Sequence data processing. Introduction to Machine Learning Amo G. Tong 9 Deep Feedforward Network ℎ Input Output Hidden Layers Deep Feedforward Network Training, Design, and Regularization Introduction to Machine Learning Amo G. Tong 10 Deep Feedforward Network: • Feedforward Network (Multi-Layer Perceptron (MLP)) Input Output • More Layers. • Training • Design Hidden Layers • Regularization Introduction to Machine Learning Amo G. Tong 11 Deep Feedforward Network: Training • Cost Function: 퐽(휃) • Common choice: cross-entropy • The difference between two distributions 푄 and 푃 Input Output • 퐻 푃, 푄 = −E푥∼푃 log 푄(푥) Hidden Layers 푃(푥) 퐷(푃||푄) = 푃 푥 log 푄(푥) 푥 Introduction to Machine Learning Amo G. Tong 12 Deep Feedforward Network: Training • Cost Function: 퐽(휃) • Common choice: cross-entropy • The difference between two distributions 푄 and 푃 Input Output • 퐻 푃, 푄 = −E푥∼푃 log 푄(푥) • 퐽 휃 = −E푥, 푦∼퐷train log Pmodel(푦|푥) (negative log likelihood) • Pmodel(푦|푥) changes from model to model • Least square error: assume Gaussian and use MLE Hidden Layers 푃(푥) 퐷(푃||푄) = 푃 푥 log 푄(푥) 푥 Introduction to Machine Learning Amo G. Tong 13 Deep Feedforward Network: Training • Gradient-Based Training • Non-convex because we have nonlinear units. • Iteratively decrease the cost function; not global optimal; not even local optimal. • Initializations with small random numbers. • Backpropagation. • One issue: the gradient must be large and predictable. Input Output Hidden Layers Introduction to Machine Learning Amo G. Tong 14 Deep Feedforward Network: Training • Gradient-Based Training • Non-convex because we have nonlinear units. • Iteratively decrease the cost function; not global optimal; not even local optimal. • Initializations with small random numbers. • Backpropagation. • One issue: the gradient must be large and predictable. Input Output Stochastic gradient descent (SGD) algorithm Hidden Layers While stopping criterion not met Sample a minibatch of 푚 examples 퐷푚 σ Compute the gradient 푔 ← 푥∈퐷푚 Δ휃 퐽(휃, 푥) Update 휃 ← 휃 − 휖푔 Introduction to Machine Learning Amo G. Tong 15 Deep Feedforward Network: Design • Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡 • Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏) • 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥) • Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗) Introduction to Machine Learning Amo G. Tong 16 Deep Feedforward Network: Design • Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡 • Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏) • 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥) • Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗) Introduction to Machine Learning Amo G. Tong 17 Deep Feedforward Network: Design • Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡 • Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏) • 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥) • Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗) Introduction to Machine Learning Amo G. Tong 18 Deep Feedforward Network: Design • Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max. • Traditional units: • Sigmoid (consider tanh(z)) • Perceptron Introduction to Machine Learning Amo G. Tong 19 Deep Feedforward Network: Design • Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max. • Traditional units: • Sigmoid (consider tanh(z)) • Perceptron Introduction to Machine Learning Amo G. Tong 20 Deep Feedforward Network: Design • Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max. • Traditional units: • Sigmoid (consider tanh(z)) • Perceptron Introduction to Machine Learning Amo G. Tong 21 Deep Feedforward Network: Design • Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned. • Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better. Introduction to Machine Learning Amo G. Tong 22 Deep Feedforward Network: Design • Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned. • Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better. Introduction to Machine Learning Amo G. Tong 23 Deep Feedforward Network: Design • Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned. • Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better. • Other considerations: local connected layers, skip layers, recurrent layers, … Introduction to Machine Learning Amo G.