07/11/2019

MALIS: Neural Networks Maria A. Zuluaga Data Science Department

Recap: Classification

The input space divided into decision regions whose boundaries are called decision boundaries or decision surfaces .

MALIS 2019 2 Source: A Zisserman

1 07/11/2019

Separating

Two (of infinitely) possible separating hyperplanes

Least squares solution regressing:

1 = −1

leads to a given by

{: + + = 0}

Separating hyperplanes classifiers are linear classifiers which try to “explicitly” separate the data as well as Figure 4.1.4 From The Elements of Statistical possible. learning

MALIS 2019 3

The

MALIS 2019 4

2 07/11/2019

The Perceptron

• Assumptions: • Data is linearly separable • Binary classification using labels ∈ −1, 1

• Goal: Find a separating by minimizing the distance of misclassified points to the decision boundary.

MALIS 2019 5

Formulation

• = ( + b)

+1, ≥ 0 = −1, < 0

As before we will “absorb” b by adding a “dummy” variable to x:

= ()

= = Source: for Intelligent Systems . Cornell University 1

MALIS 2019 6

3 07/11/2019

Error function: The perceptron criterion

• Conditions: • Patterns will have ∈ > 0 • Patterns will have ∈ < 0 • Since , this means we want all patterns to satisfy: ∈ {−1, +1} y > 0 = − Perceptron criterion∈ℳ

MALIS 2019 7

Interpretation

• The output is the characteristic function of a half-space, which is limited by the hyperplane

x 2 y = 1

y = −1

+ b = 0 x1

hyperplane

MALIS 2019 8

4 07/11/2019

Representation

+1

W0 activation x i Wi

MALIS 2019 9

Linear Separability

• Given two sets of points, is there a perceptron which classifies them ? x2 x2

x1 x1

YES NO • True only if sets are linearly separable

1 MALIS 2019 0

5 07/11/2019

Finding the weights

• Obtain an expression for the gradient of the perceptron criterion • Use stochastic gradient descent to minimize the error function • A change in the weight vector is given by: () () = − = +

Stochastic gradient descent vs gradient descent : Rather than computing the sum of the gradient contributions of each observation followed by a step in the negative gradient direction, a step is taken after each single observation is visited.

MALIS 2019 11

Perceptron training algorithm

initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 12

6 07/11/2019

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 13

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 14

7 07/11/2019

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 15

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 16

8 07/11/2019

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 17

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 18

9 07/11/2019

Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if ≤ 0 m=m+1 = + Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break

MALIS 2019 19

Hands on example: The OR function

• 03_perceptron.ipynb

1 +1

-1

0 1

MALIS 2019 20

10 07/11/2019

1 Hands on example: The OR function

0 1 X0 X1 X2 b W1 W2 y activation m 0.5 0 1

1

MALIS 2019 21

Solution

MALIS 2019 22

11 07/11/2019

Perceptron convergence theorem

• If there exists an exact solution (i.e. data is linearly separable), then it is guaranteed for the perceptron algorithm to converge to this solution in a finite number of steps.

• However: • The number of steps to convergence might be large • Until convergence is achieved, it is not possible to distinguish a non-separable problem from a slow to converge one.

MALIS 2019 23

What if we use a different initialization value?

Question: These are specific examples. What are the general set of inequalities that must be satisfied for an OR perceptron?

MALIS 2019 24

12 07/11/2019

Other logic functions

Exercises to complete in the notebook MALIS 2019 25

Perceptron Limitations

cannot generate XOR:

x1 x2 XOR x2 XOR 0 0 0 1 ? 1 0 1 0 1 1 0 1 x1 1 1 0

[Minsky 1969] -> The AI winter 26

13 07/11/2019

Perceptron Limitations

• The algorithm does not converge when the data are not separable • When the data is separable, there are many solutions, and which one is found depends on the starting values • The “finite” number of steps can be very large.

27

Some history

Source: C. Bishop - PRML

MALIS 2019 28

14 07/11/2019

Recap

• We introduced the perceptron algorithm, a that guarantees convergence • We saw that it guarantees a solution for separable data • But we also saw that it has numerous limitations

MALIS 2019 29

Neural Networks

MALIS 2019 30

15 07/11/2019

Why neural networks? A note on history • The term neural network has its origins in attempts to find mathematical representations of information processing in biological systems

• From Bishop: it has been used very broadly to cover a wide range of different models, many of which have been the subject of exaggerated claims regarding their biological plausibility

• They are nonlinear efficient models for statistical pattern recognition

MALIS 2019 31

Motivation

• Recall the first lecture on linear models • We saw that adding features could give a better fit of the model = 1, , , … , • : basis function • Model:

, = ()

MALIS 2019 32

16 07/11/2019

Motivation

Revisit: 01_linear_models.ipynb • We also saw that choosing the right set of features was challenging • Goal: Making the basis functions depend on parameters and() allow these parameters to be adjusted along with the coefficients during training {}

• How? Neural networks Which one is the good value for n?

MALIS 2019 33

Feed forward networks a.k.a. The multilayer perceptron (MLP) • Basic neural network model: series of functional transformations • Step 1: Construct M linear combinations of the input variables ∈ ℝ Layer index. E.g. activations () () parameters of the = + first layer of the network

j=1,..M Index of linear weights bias combinations i=1,..D Index of dimensions of the input X MALIS 2019 34

17 07/11/2019

Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 2: Transform each activation using a differentiable, nonlinear activation function ℎ ⋅ = ℎ • generally chosen to be a sigmoidal function ℎ ⋅ • corresponds to the outputs of the basis functions of our model. Recall:

, = () • In the context of neural networks, these are called hidden units

MALIS 2019 35

Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 3: The are again linearly combined to give output unit activations

Layer index. E.g. Output unit parameters of the activations = + 2nd layer of the network

k=1,..K Output index. weights bias K: number of outputs

MALIS 2019 36

18 07/11/2019

Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 4: The are transformed using an activation function to give a set of network outputs • Choice of the activation function follows same considerations as for linear models • For regression: Identity function • Common activation functions for classification • Sigmoid function: = 1/(1 + ) • Tanh: tanh = • Hinge or relu: ℎ =max(,0) • Softmax (multiclass): ℎ = ∑ MALIS 2019 37

Feed forward networks a.k.a MLP Final expression • Combining all, using a sigmoidal output unit activation function

, = ℎ + +

MALIS 2019 38

19 07/11/2019

Feed forward networks / MLP Interpretation: Network diagram • Forward propagation of information through the network

Fig 5.1 PRML – C. Bishop. Two-layered network MALIS 2019 39

Simplifying notation Absorbing the biases • As with linear models, the bias parameters can be absorbed into the set of weight parameters by adding a dummy variable = 1 () = • In the second layer: =

MALIS 2019 40

20 07/11/2019

Feed forward networks a.k.a MLP Simplified final expression • The overall network function now becomes

, = ℎ

, = () MALIS 2019 41

Multilayer perceptron Interpretation • Two stages of processing, each of which resembles the perceptron

Similar to perceptron

, = ℎ

Adapted from Fig 5.1 PRML – C. Bishop MALIS 2019 42

21 07/11/2019

Neuron Back to features

• A neuron can be seen as a feature map of the form

= ℎ • Therefore, each node in the network can be interpreted as a feature variable

• By optimizing the weights { w} we are doing feature selection

• Pre-trained networks: Resulting features from optimization useful to many problems

• Need of a lot of data

MALIS 2019 43

Network training: Backpropagation

• We cannot use the training algorithm from the perceptron because we don’t know the “correct” outputs of the hidden units

• Strategy: Apply the chain rule to differentiate composite functions

• Refresher: = ( ) → = ′()

= ⋅ Leibniz’s notation

MALIS 2019 44

22 07/11/2019

Deriving gradient descent for MLP

• See board notes for simpler derivation of the backpropagation algorithm

MALIS 2019 45

Deriving gradient descent for MLP • Error function:

= () • We will estimate () • Let us consider a simple linear model with outputs :

= • The error function for a particular input sample n will be 1 = − 2 MALIS 2019 46

23 07/11/2019

Deriving gradient descent for MLP

• The gradient of this error function w.r.t a weight = − = − • Interpretation: Product of an error signal associated to the output end of the link and the variable associated− to the input • Similar to expression obtained for logistic regression when using sigmoid function

Refresher: Exercise proposed () slide 27 (annotated) = = − MALIS 2019 47

Refresher: forward propagation • Let us recall the activation of each unit in the network be denoted as

= with the activation or input of a unit connecting to unit j and the weight associated to the connection and

= ℎ() • These are composite functions so, let’s use the chain rule to estimate the derivative of the error

. . . ℎ() MALIS 2019 48

24 07/11/2019

Deriving gradient descent for MLP • Applying the chain rule

= = • Since , then = ∑ = • All together Same form as = = −

MALIS 2019 49

How to estimate δ? • For the output units, we did it already:

= − • For the hidden units, we resort to the chain rule again = = • Can we obtain an expression for it?

MALIS 2019 50

25 07/11/2019

How to estimate δ for hidden units?

= = =

= Solution in the next slide but, try to do it on your own

= ℎ()

= ⋅ MALIS 2019 = ℎ 51

First thing is that:

= , so = = Now let us find an expression for a_k:

= And directly from the cheat sheet = ℎ . The derivative amounts to applying the chain rule

= ℎ′() ℎ′() Plugin into the original expression: = ℎ MALIS 2019 52

26 07/11/2019

Backpropagation formula

= ℎ

= ℎ Figure 5.7 – Bishop - PRML

Forward propagation

MALIS 2019 53

Backpropagation algorithm

1. For an input vector to the network do a forward pass using:

= , = ℎ To find the activations of all hidden and output units 2. Evaluate for the output units = − 3. Backward pass the δ’s to obtain all the for the hidden units using: = ℎ ∑ 4. Obtain the required derivatives using: =

MALIS 2019 54

27 07/11/2019

Backpropagation algorithm: DYI

• Read Section 5.3.2 from Bishop for a concrete example • We have derived a general form that covers any error function, activation function and network topology

• Obtain expression for the backpropagation algorithm when using cross-entropy error function (exercise 11.3 from ESL)

MALIS 2019 55

Properties: Universality

• MLPs are Universal Boolean functions • They can compute any Boolean function

• MLPs are Universal Classification functions

• MLPs are Universal approximators • Can actually compose arbitrary functions in any number of dimensions

MALIS 2019 56

28 07/11/2019

MLPs are Universal Boolean Functions

• The perceptron could not solve the XOR. • If the MLP is an universal Boolean function, it should be able to implement an XOR. • How?

X1 X2 Y A truth table shows all input combinations 0 0 0 for which output is 1 0 1 1 We express the function in disjunctive normal form 1 0 1 1 1 0 = +

MALIS 2019 57

XOR function : = +

* Bias being omitted MALIS 2019 58

29 07/11/2019

XOR function : = +

* Bias being omitted MALIS 2019 59

XOR function : = +

* Bias being ommited MALIS 2019 60

30 07/11/2019

XOR function : = +

AND

OR

AND Any truth table can be expressed in this manner

* Bias being omitted MALIS 2019 61

Exercise: Find weights for the XOR

+1

Y1 Y

Y2 +1

MALIS 2019 62

31 07/11/2019

Step 1: Write down truth tables

X1 X2 Y1 X1 X2 Y2 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1

Y1 Y2 Y 0 0 0 1 1 0 1 1

MALIS 2019 63

Step 2: Write general expressions Y

Y1 Y2 Y ℎ( + + ) 0 0 0 (-1) 0 1 1 + ≤ 0 1 0 1 1 1 1 + > 0

= −3 + > 0 = 4 + + > 0 = 4

** Layer index is being omitted MALIS 2019 64

32 07/11/2019

Y2 Step 2: Write general expressions

X1 X2 Y2 ℎ( + + ) 0 0 1 0* 0 1 0 0* 1 0 1 1 + ≤ 0 1 1 0 0*

+ ≤ 0

= −3 + > 0 = −5 + + ≤ 0 = 4

** Layer index is being omitted MALIS 2019 65

Step 2: Write general expressions Y1

X1 X2 Y1 ℎ( + + ) 0 0 1 0 0 1 1 1 1 0 0 0 + ≤ 0 1 1 0 0

+ > 0

= −3 + ≤ 0 = 4 -5 + + ≤ 0 =

** Layer index is being omitted MALIS 2019 66

33 07/11/2019

Result: Find weights for the XOR

+1 −5 Y1 −3 4 4 Y 4 4 −5 Y2 −3 −3 +1

MALIS 2019 67

What to do for more complex functions?

• Karnaugh Maps

=

= + + ̅ Drawback: MLP can represent a given function only if it is sufficiently wide

MALIS 2019 68

34 07/11/2019

MLP as universal function approximation

• A feed-forward network with at least one hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rd, under mild assumptions on the activation function

• However, the key problem is how to find suitable parameter values given a set of training data

• Proof by G. Cybenko, 1989

MALIS 2019 69

MLP Intuitive Potential

No hidden layer Half-space

One hidden Convex sets layer (intersections of half-spaces)

Two hidden Concave and layers non-connex sets (union of intersections of half-spaces)

70

35 07/11/2019

Summary on MLP

• Advantages • Very general, can be applied in many situations • Powerful according to theory • Efficient according to practice

• Drawbacks • Training is often slow • Choice of optimal number of layers & neurons difficult • Little understanding of real model

71

Deep Learning

Le-Net5: Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document (1998)

MALIS 2019 72

36 07/11/2019

Recap

• We introduced feedforward networks, aka the multilayer perceptron

• We introduced the backpropagation algorithm which is the mechanism to train feedforward networks

• We saw the strengths but also the limitations of MLPs

• Deep Learning course (spring term) if you want to learn about more powerful neural network architectures

MALIS 2019 73

What I have not covered yet

• Some other limitations: problems associated with training

MALIS 2019 74

37 07/11/2019

Further reading and useful material

Source Chapters The Elements of Statistical Learning Sec 4.5, Ch 11 Pattern Recognition and Machine Learning Sec. 4.1.7, Ch 5 Rosenblatt’s original article - The Perceptron --

Warning: Notation might vary among the different sources

MALIS 2019 75

From the first lecture

Deep learning

MALIS 2019 76

38 07/11/2019

MALIS 2019 77

Project definition: What I expect from you

• Able to identify a problem that can be solved using ML tools • Frame it correctly: supervised, not supervised, regression, classification, density estimation… • Able to establish reasonable objectives • Not too easy • Not too difficult that can not be completed in the given time frame • Able to follow instructions • Submit via moodle • Work in pairs or talk to me to agree on exceptions • Able to produce a readable document

MALIS 2019 78

39