Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Amo G. Tong 1 Lecture * Deep Learning • An Introduction • Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville • Some materials are courtesy of . • All pictures belong to their creators. Introduction to Machine Learning Amo G. Tong 2 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 3 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. LeNet-5 (1998): 7-level convolutional network https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 4 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. AlexNet (2012): more layers and filters Trained for 6 days on two GPUs. Error rate: 15.3% (reduced from 26.2%) https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 5 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. Story: ImageNet object recognition contest. ResNet(2015): 152 layers with residual connections. Error rate: 3.57% https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Introduction to Machine Learning Amo G. Tong 6 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. • Why we need deep learning? • We need to solve complex real-world problems. • A standard way to parameterize functions. • Layer by layer. • Flexible to be customized for different applications. • Image processing • Sequence data • Standard training methods. • Backpropagation. Introduction to Machine Learning Amo G. Tong 7 Deep Learning • What are deep learning methods? • Using a complex neural network to approximate the function we want to learn. • Why now? • Availability of data • Hardware • GPUs,… • Software • Tensor Flow, Pytorch,… Introduction to Machine Learning Amo G. Tong 8 Outline • General Deep Feedforward Network. • Convolutional Neural Network (CNN) • Image processing. • Recurrent Neural Network (RNN) • Sequence data processing. Introduction to Machine Learning Amo G. Tong 9 Deep Feedforward Network ℎ Input Output Hidden Layers Deep Feedforward Network Training, Design, and Regularization Introduction to Machine Learning Amo G. Tong 10 Deep Feedforward Network: • Feedforward Network (Multi-Layer Perceptron (MLP)) Input Output • More Layers. • Training • Design Hidden Layers • Regularization Introduction to Machine Learning Amo G. Tong 11 Deep Feedforward Network: Training • Cost Function: 퐽(휃) • Common choice: cross-entropy • The difference between two distributions 푄 and 푃 Input Output • 퐻 푃, 푄 = −E푥∼푃 log 푄(푥) Hidden Layers 푃(푥) 퐷(푃||푄) = ෍ 푃 푥 log 푄(푥) 푥 Introduction to Machine Learning Amo G. Tong 12 Deep Feedforward Network: Training • Cost Function: 퐽(휃) • Common choice: cross-entropy • The difference between two distributions 푄 and 푃 Input Output • 퐻 푃, 푄 = −E푥∼푃 log 푄(푥) • 퐽 휃 = −E푥, 푦∼퐷train log Pmodel(푦|푥) (negative log likelihood) • Pmodel(푦|푥) changes from model to model • Least square error: assume Gaussian and use MLE Hidden Layers 푃(푥) 퐷(푃||푄) = ෍ 푃 푥 log 푄(푥) 푥 Introduction to Machine Learning Amo G. Tong 13 Deep Feedforward Network: Training • Gradient-Based Training • Non-convex because we have nonlinear units. • Iteratively decrease the cost function; not global optimal; not even local optimal. • Initializations with small random numbers. • Backpropagation. • One issue: the gradient must be large and predictable. Input Output Hidden Layers Introduction to Machine Learning Amo G. Tong 14 Deep Feedforward Network: Training • Gradient-Based Training • Non-convex because we have nonlinear units. • Iteratively decrease the cost function; not global optimal; not even local optimal. • Initializations with small random numbers. • Backpropagation. • One issue: the gradient must be large and predictable. Input Output Stochastic gradient descent (SGD) algorithm Hidden Layers While stopping criterion not met Sample a minibatch of 푚 examples 퐷푚 σ Compute the gradient 푔 ← 푥∈퐷푚 Δ휃 퐽(휃, 푥) Update 휃 ← 휃 − 휖푔 Introduction to Machine Learning Amo G. Tong 15 Deep Feedforward Network: Design • Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡 • Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏) • 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥) • Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗) Introduction to Machine Learning Amo G. Tong 16 Deep Feedforward Network: Design • Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡 • Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏) • 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥) • Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗) Introduction to Machine Learning Amo G. Tong 17 Deep Feedforward Network: Design • Output units ℎ • A real vector • Regression Problem: Pr[푦|푥]. Input Output 푇 • Affine transformation: 푦표푢푡 = 푊 ℎ + 푏 2 • 퐽 휃 = E푥, 푦∼퐷train 푦 − 푦표푢푡 • Binary classification Hidden Layers • Sigmoid 휎 푇 • 푦표푢푡 = 휎(푊 ℎ + 푏) • 퐽 휃 = −E푥, 푦∼퐷train log 푦표푢푡(푥) • Multinoulli (푛-classification) Note: the gradient must be • 퐬퐨퐟퐭퐦퐚퐱 large and predictable. Log can 푇 • 푧 = 푊 ℎ + 푏 = (푧1, … , 푧푛) help. exp(zi) • softmax 푧 푖 = σ푗 exp(z푗) Introduction to Machine Learning Amo G. Tong 18 Deep Feedforward Network: Design • Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max. • Traditional units: • Sigmoid (consider tanh(z)) • Perceptron Introduction to Machine Learning Amo G. Tong 19 Deep Feedforward Network: Design • Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max. • Traditional units: • Sigmoid (consider tanh(z)) • Perceptron Introduction to Machine Learning Amo G. Tong 20 Deep Feedforward Network: Design • Hidden Units ℎ • Rectified linear units (ReLU). • 푔(푧) = max{0, 푧} Input Output • ℎ = 푔(푊푇푥 + 푏) • Good point: large gradient when active, easy to train • Bad point: not learnable when inactive • Generalizations. Hidden Layers • 푔 푧 = max 0, 푧 + 훼 min(0, 푧) • Absolute value rectification: 훼 = −1. • Leaky ReLU: fixed small 훼 (0.01). • Parametric ReLU (PReLU): learnable 훼. • Maxout units: group and take the max. • Traditional units: • Sigmoid (consider tanh(z)) • Perceptron Introduction to Machine Learning Amo G. Tong 21 Deep Feedforward Network: Design • Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned. • Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better. Introduction to Machine Learning Amo G. Tong 22 Deep Feedforward Network: Design • Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned. • Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better. Introduction to Machine Learning Amo G. Tong 23 Deep Feedforward Network: Design • Architecture Design • The layer structure. Input Output • ℎ(1) = 푔 1 (푊 1 푇 푥 + 푏(1)) • ℎ(2) = 푔 2 (푊 2 푇ℎ(1) + 푏(2)) • ….. • How many layers? How many units in each layer? Hidden Layers • Theoretically, universal approximation theorem: one layer of sigmoid can approximate any Borel measurable function, if given enough hidden units. • But it is not guaranteed that it can be learned. • Practically, (a) follow classic models: CNN, RNN, …; (b) start from few (2 or 3) layers with few (16, 32 or 64) hidden units and watch at the validation error. • Greater depth may not be better. • Other considerations: local connected layers, skip layers, recurrent layers, … Introduction to Machine Learning Amo G.

Introduction to Machine Learning

Persian Handwritten Digit Recognition Using Combination of Convolutional Neural Network and Support Vector Machine Methods

Lecture 10: Recurrent Neural Networks

Image Retrieval Algorithm Based on Convolutional Neural Network

LSTM-In-LSTM for Generating Long Descriptions of Images

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Active Authentication Using an Autoencoder Regularized CNN-Based One-Class Classiﬁer

Key Ideas and Architectures in Deep Learning Applications That (Probably) Use DL

Advancements in Image Classification Using Convolutional Neural Network

Global Sparse Momentum SGD for Pruning Very Deep Neural Networks

Arxiv:1907.08798V1 [Astro-Ph.SR] 20 Jul 2019 Keywords: Sun: Coronal Mass Ejections (Cmes) - Techniques: Image Processing

Deep Learning in MATLAB

Path Towards Better Deep Learning Models and Implications for Hardware