Quick viewing(Text Mode)

Lecture 8. Deep Learning. Convolutional Anns. Autoencoders COMP90051 Statistical Machine Learning

Lecture 8. Deep Learning. Convolutional Anns. Autoencoders COMP90051 Statistical Machine Learning

Lecture 8. . Convolutional ANNs. COMP90051 Statistical

Semester 2, 2017 Lecturer: Andrey Kan

Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 8 This lecture

• Deep learning ∗ Representation capacity ∗ Deep models and representation learning • Convolutional Neural Networks ∗ Convolution operator ∗ Elements of a convolution-based network • Autoencoders ∗ Learning efficient coding

2 Statistical Machine Learning (S2 2016) Deck 8

Deep Learning and Representation Learning

Hidden layers viewed as feature space transformation

3 Statistical Machine Learning (S2 2017) Deck 8 Representational capacity • ANNs with a single hidden layer are universal approximators • For example, such ANNs can represent any Boolean function

( 1, 2) = ( 1 + 2 – 0.5) ( , ) 𝑂𝑂𝑂𝑂 𝑥𝑥 1𝑥𝑥 2 𝑢𝑢 = 𝑔𝑔(𝑥𝑥1 + 𝑥𝑥2 – 1.5) ( ) = ( ) 𝐴𝐴𝐴𝐴𝐴𝐴 𝑥𝑥 1𝑥𝑥 𝑢𝑢 𝑔𝑔 𝑥𝑥 𝑥𝑥 1 𝑁𝑁𝑁𝑁𝑁𝑁=𝑥𝑥1 if 0 and 𝑢𝑢 = 0𝑔𝑔otherwise−𝑥𝑥

• Any Boolean𝑔𝑔 𝑟𝑟 function𝑟𝑟 ≥ over variables𝑔𝑔 𝑟𝑟 can be implemented using a hidden layer with up to 2 elements 𝑚𝑚 𝑚𝑚 • More efficient to stack several hidden layers

4 Statistical Machine Learning (S2 2017) Deck 8

“Depth” refers Deep networks to number of hidden layers

𝑎𝑎𝑖𝑖𝑖𝑖 𝑏𝑏𝑗𝑗𝑗𝑗 𝑐𝑐𝑘𝑘𝑘𝑘 𝑑𝑑𝑙𝑙𝑙𝑙 𝑥𝑥1 1 1 1 𝑡𝑡 𝑧𝑧 𝑠𝑠… 𝑢𝑢…1 2 …2 𝑥𝑥… 𝑡𝑡 𝑧𝑧…2

𝑝𝑝� hidden𝑠𝑠 hidden𝑢𝑢𝑝𝑝푝 layer 1 hidden𝑡𝑡𝑝𝑝� layer 3 output𝑞𝑞 input𝑝𝑝 layer 2 layer𝑧𝑧 𝑥𝑥layer

= tanh = tanh = tanh = tanh ′ ′ ′ ′ 5 𝒔𝒔 𝑨𝑨 𝒙𝒙 𝒕𝒕 𝑩𝑩 𝒔𝒔 𝒖𝒖 𝑪𝑪 𝒕𝒕 𝒛𝒛 𝑫𝑫 𝒖𝒖 Statistical Machine Learning (S2 2017) Deck 8 Deep ANNs as representation learning

• Consecutive layers form representations of the input of increasing complexity • An ANN can have a simple linear output layer, but using complex non-linear representation = tanh tanh tanh tanh ′ ′ ′ ′ • Equivalently,𝒛𝒛 a 𝑫𝑫hidden layer𝑪𝑪 can be 𝑩𝑩thought of𝑨𝑨 as𝒙𝒙 the transformed feature space, e.g., =

• Parameters of such a transformation𝒖𝒖 are𝜑𝜑 𝒙𝒙 learned from data

Bias terms are omitted for simplicity 6 Statistical Machine Learning (S2 2017) Deck 8 ANN layers as data transformation

𝑥𝑥1 1 1 1 𝑡𝑡 𝑧𝑧 𝑠𝑠… 𝑢𝑢…1 2 …2 𝑥𝑥… 𝑡𝑡 𝑧𝑧…2

𝑝𝑝� 𝑠𝑠 𝑢𝑢𝑝𝑝푝 𝑡𝑡𝑝𝑝� 𝑞𝑞 𝑝𝑝 𝑧𝑧 input𝑥𝑥 data the model

7 Statistical Machine Learning (S2 2017) Deck 8 ANN layers as data transformation

𝑥𝑥1 1 1 1 𝑡𝑡 𝑧𝑧 𝑠𝑠… 𝑢𝑢…1 2 …2 𝑥𝑥… 𝑡𝑡 𝑧𝑧…2

𝑝𝑝� 𝑠𝑠 𝑢𝑢𝑝𝑝푝 𝑡𝑡𝑝𝑝� 𝑞𝑞 𝑝𝑝 𝑧𝑧 pre𝑥𝑥 -processed data the model

8 Statistical Machine Learning (S2 2017) Deck 8 ANN layers as data transformation

𝑥𝑥1 1 1 1 𝑡𝑡 𝑧𝑧 𝑠𝑠… 𝑢𝑢…1 2 …2 𝑥𝑥… 𝑡𝑡 𝑧𝑧…2

𝑝𝑝� 𝑠𝑠 𝑢𝑢𝑝𝑝푝 𝑡𝑡𝑝𝑝� 𝑞𝑞 𝑝𝑝 𝑧𝑧 𝑥𝑥 pre-processed data the model

9 Statistical Machine Learning (S2 2017) Deck 8 ANN layers as data transformation

𝑥𝑥1 1 1 1 𝑡𝑡 𝑧𝑧 𝑠𝑠… 𝑢𝑢…1 2 …2 𝑥𝑥… 𝑡𝑡 𝑧𝑧…2

𝑝𝑝� 𝑠𝑠 𝑢𝑢𝑝𝑝푝 𝑡𝑡𝑝𝑝� 𝑞𝑞 𝑝𝑝 𝑧𝑧 𝑥𝑥 pre-processed data the model

10 Statistical Machine Learning (S2 2017) Deck 8 Depth vs width

• A single infinitely wide layer used in theory gives a universal approximator • However depth tends to give more accurate models ∗ Biological inspiration from the eye: ∗ first detect small edges and color patches; ∗ compose these into smaller shapes; ∗ building to more complex detectors, such as textures, faces etc. • Seek to mimic layered complexity in a network • However vanishing gradient problem affects learning with very deep models

11 Statistical Machine Learning (S2 2017) Deck 8 Animals in the zoo

Artificial Neural Recurrent neural Networks (ANNs) networks

Feed-forward Multilayer networks art: OpenClipartVectors at pixabay.com (CC0)

Perceptrons Convolutional neural networks

• Recurrent neural networks are not covered in this subject • If time permits, we will cover autoencoders. An is an ANN trained in a specific way. ∗ E.g., a multilayer can be trained as an autoencoder, or a can be trained as an autoencoder. 12 Statistical Machine Learning (S2 2016) Deck 8

Convolutional Neural Networks (CNN)

Based on repeated application of small filters to patches of a 2D image or range of a 1D input

13 Statistical Machine Learning (S2 2017) Deck 8 Convolution

is output vector

… … … 𝒃𝒃… … … 𝑏𝑏2

Σ × × ×

𝑤𝑤1 𝑤𝑤2 𝑤𝑤3 … … … … 𝑎𝑎1sliding𝑎𝑎 window2 𝑎𝑎3 𝑎𝑎4 is input vector 𝒂𝒂

14 Statistical Machine Learning (S2 2017) Deck 8 Convolution

is output vector

𝒃𝒃… … … 𝑏𝑏2 𝑏𝑏3 𝑏𝑏4 𝑏𝑏5

= 𝐶𝐶 ( ) ( ) 2 𝑏𝑏𝑖𝑖 � 𝑎𝑎 𝑖𝑖+𝛿𝛿 𝑤𝑤 𝛿𝛿+𝐶𝐶+1 = , , 𝑖𝑖 ≥ 𝛿𝛿=−𝐶𝐶 is called kernel*𝑇𝑇 1 2 3 = Σ 𝒘𝒘 or𝑤𝑤 filter𝑤𝑤 𝑤𝑤 𝒃𝒃 𝒂𝒂 ∗ 𝒘𝒘 × × ×

𝑤𝑤1 … 𝑤𝑤2 … 𝑤𝑤3 … … 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎4 is input vector 𝒂𝒂

*Later in the subject, we will also use an unrelated definition of kernel as a function representing a dot product 15 Statistical Machine Learning (S2 2017) Deck 8 Convolution on 2D images

… W … one Σ output = pixel

𝑩𝑩 𝑨𝑨 ∗ 𝑾𝑾

= 𝐶𝐶 𝐷𝐷 , ,

𝐵𝐵𝑖𝑖𝑖𝑖 � � 𝐴𝐴𝑖𝑖+𝛿𝛿𝑖𝑖 𝑗𝑗+𝛿𝛿𝑗𝑗𝑊𝑊𝛿𝛿𝑖𝑖+𝐶𝐶+1 𝛿𝛿𝑗𝑗+𝐷𝐷+1 𝛿𝛿𝑖𝑖 =−𝐶𝐶 𝛿𝛿𝑗𝑗=−𝐷𝐷

input output

16 Statistical Machine Learning (S2 2017) Deck 8 Filters as feature detectors convolve with a vertical edge filter -1 0 1 activation function -1 0 1 -1 0 1

is input image

𝑨𝑨

filtered image 17 Statistical Machine Learning (S2 2017) Deck 8 Filters as feature detectors convolve with a vertical edge filter

1 1 1 activation function 0 0 0 -1 -1 -1

is input image

𝑨𝑨

filtered image 18 Statistical Machine Learning (S2 2017) Deck 8 Stacking convolutions

filters -1 0 1 1 1 1 -1 0 1 0 0 0 -1 0 1 -1 -1 -1

• Develop complex representations at different scales and complexity downsampling and further convolutions • Filters are learned from training data!

19 Statistical Machine Learning (S2 2017) Deck 8 CNN for 5

patches of 48 × 48 48 × 48 × 5 downsampling × 24 × 4 2

2D convolution

fully connected downsampling 3D convolution flattening … … 1 × 1440

1 × 720 12 × 12 × 10 24 × 24 × 10 Implemented by Jizhizi Li based on LeNet5: http://deeplearning.net/tutorial/lenet.html 20 Statistical Machine Learning (S2 2017) Deck 8 Components of a CNN

• Convolutional layers ∗ Complex input representations based on convolution operation ∗ Filter weights are learned from training data • Downsampling, usually via Max Pooling ∗ Re-scaling to smaller resolution, limits parameter explosion • Fully connected parts and output layer ∗ Merges representations together

21 Statistical Machine Learning (S2 2017) Deck 8 Downsampling via max pooling

• Special type of processing layer. For an × patch = max , , … , 𝑚𝑚 𝑚𝑚 • Strictly speaking,𝑣𝑣 not everywhere𝑢𝑢11 𝑢𝑢12 differentiable.𝑢𝑢𝑚𝑚𝑚𝑚 Instead, “gradient” is defined heuristically ∗ Tiny changes in values of that is not the maximum do not change 𝑖𝑖𝑖𝑖 ∗ If is the maximum value,𝑢𝑢 tiny changes in that value change linearly 𝑣𝑣 𝑢𝑢𝑖𝑖𝑖𝑖 𝑣𝑣 • As such use = 1 if = , and = 0 otherwise 𝜕𝜕𝑣𝑣 𝜕𝜕𝜕𝜕 • Forward pass𝜕𝜕𝑢𝑢 records𝑖𝑖𝑖𝑖 maximising𝑢𝑢𝑖𝑖𝑖𝑖 𝑣𝑣 element,𝜕𝜕𝑢𝑢𝑖𝑖𝑖𝑖 which is then used in the backward pass during back-propagation 22 Statistical Machine Learning (S2 2017) Deck 8 Convolution as a regulariser

… Fully connected, unrestricted

… Fully connected, unrestricted

Restriction: same color – same weight

23 Statistical Machine Learning (S2 2017) Deck 8 Document classification (Kalchbrenner et al, 2014)

Structure of text important for classifying documents Capture patterns of nearby words using 1d convolutions

24 Statistical Machine Learning (S2 2016) Deck 8

Autoencoder

An ANN training setup that can be used for or efficient coding

25 Statistical Machine Learning (S2 2017) Deck 8 Autoencoding idea

: ∗ Univariate regression: predict from ∗ Multivariate regression: predict from 𝑦𝑦 𝒙𝒙 • Unsupervised learning: explore𝒚𝒚 data 𝒙𝒙 , … , ∗ No response variable 𝒙𝒙1 𝒙𝒙𝑛𝑛 • For each set

• Train an ANN𝒙𝒙𝑖𝑖 to𝒚𝒚 predict𝑖𝑖 ≡ 𝒙𝒙𝑖𝑖 from • Pointless? 𝒚𝒚𝑖𝑖 𝒙𝒙𝑖𝑖

26 Statistical Machine Learning (S2 2017) Deck 8 Autoencoder topology

• Given data without labels , … , , set and train an ANN to predict 𝒙𝒙1 𝒙𝒙𝑛𝑛 𝒚𝒚𝑖𝑖 ≡ 𝒙𝒙𝑖𝑖 • Set the hidden layer𝒛𝒛 𝒙𝒙𝑖𝑖in ≈the𝒙𝒙 𝑖𝑖middle “thinner” than the input 𝒖𝒖

𝒙𝒙 𝒖𝒖 𝒛𝒛

adapted from: Chervinskii at Wikimedia Commons (CC4) 27 Statistical Machine Learning (S2 2017) Deck 8 Introducing the bottleneck • Suppose you managed to train a network that gives a good restoration of the original signal

• This means that the data structure𝒛𝒛 𝒙𝒙can𝑖𝑖 ≈be𝒙𝒙 effectively𝑖𝑖 described (encoded) by a lower dimensional representation

𝒖𝒖

𝒙𝒙 𝒖𝒖 𝒛𝒛

adapted from: Chervinskii at Wikimedia Commons (CC4) 28 Statistical Machine Learning (S2 2017) Deck 8

• Autoencoders can be used for compression and dimensionality reduction via a non-linear transformation • If you use linear activation functions and only one hidden layer, then the setup becomes almost that of Principal Component Analysis (coming up in a few weeks) ∗ The difference is that ANN might find a different solution, it doesn’t use eigenvalues

29 Statistical Machine Learning (S2 2017) Deck 8 Tools

• Tensorflow, , ∗ python / lua toolkits for deep learning ∗ symbolic or automatic differentiation ∗ GPU support for fast compilation ∗ Theano tutorials at http://deeplearning.net/tutorial/ • Various others ∗ ∗ CNTK ∗ … • : high-level Python API. Can run on top of TensorFlow, CNTK, or Theano

30 Statistical Machine Learning (S2 2017) Deck 8 This lecture

• Deep learning ∗ Representation capacity ∗ Deep models and representation learning • Convolutional Neural Networks ∗ Convolution operator ∗ Elements of a convolution-based network • Autoencoders ∗ Learning efficient coding

31