– winter term 2016/17 – Chapter 07: Neural Networks I

Prof. Adrian Ulges Masters “” DCSM Department University of Applied Sciences RheinMain

1

14. November 2016 Resources for the next Chapters

Online Book 1: The nice one

I Nielsen: “Neural Networks and ” http://neuralnetworksanddeeplearning.com

Online Book 2: The tough one

I Goodfellow, Bengio, Courville: “Deep learning” https://www.deeplearningbook.org

2 (Artificial) Neural Networks.... image: [1]

I ... are one of the core topics of artificial intelligence (AI)

I ... are biologically inspired: They are mathematical models simulating the process of “thinking” in living organisms’ brains

I ... are graphs (or networks) of interlinked atomic processing units (neurons)

I ... are not programmed but define their own behavior entirely through the graph’s links and weights. These are adjusted by training. I ... are covered by two research fields 1. computational neuroscience (goal: better understanding of biological processes) 2. machine learning / AI (here) (goal: solving practical data analysis problems)

3 Neural Networks....

I ... have a turbulent history

1940 1950 1960 1970 1980 1990 2000 2010 2015

I ... exist in many variants

I multi-layer (in this lecture) I convolutional neural networks (in this lecture) I Boltzmann machines I RBF networks I recurrent neural networks I LSTM networks I ...

I ... are the most intensely machine learning model these days (“deep learning”)

4 Deep Learning Applications images from [5] [6] [2] [4] [7]

5 Neural Networks: Biological Inspiration Neurons are brain cells that send and receive electrical impulses. Neurons: Terminology I Cell body (soma): soma 1 center of the neuron, with electrical axon a diameter of 5 − 100µm impulse soma 3 synapse I axon: thin, long (up to 1m) axon dendrites nerve fibre, splits and dendrites dendrites transmits impulses to other neurons I dendrites: small, soma 2 branch-like outgrowths that axon synapse receive impulses and axon transmit them to the soma pre-synaptic post-synaptic membrane membrane I synapses: electro-chemical link between two neurons. Transfers signal from axon dentrite to dendrites via transmitter substances.

6 From Biological to Artificial Neural Networks

I Biological neural networks are (much) more complex than artificial ones - connectivity: 1011 neurons, each connected with ≈ 7, 000 others - adaptivity: number and connections of neurons change over time I Biological neural networks are asynchronous and compute with rather low frequency (1 KHz) I The circuit layout is unknown

Learning

I Learning happens at synapses, by changing the transmitter dose - sensitization: more transmitter, enhancing the signal - desensitization: less transmitter, supressing the signal

Artificial Neural Networks X1 w1 θ I weighted sum + activation function X y 2 Σ w2 I learning happens by adapting weights

wm I memory happens by storing weights Xm

7 Outline

1. Neurons

2. Training Neurons

3. The Capacity of Neurons

4. Neural Networks

8 The McP-Neuron

I The historically oldest (1943) neuron by McCulloch & Pitts

I strongly simplified model of biological neurons

I no learning yet

Definition

I input d x = (x1, ..., xd ) ∈ {0, 1} I output y ∈ {0, 1}

I weights w = X1 w d 1 (w1, ..., wd ) ∈ {−1, 1} θ X y I wj = 1: stimulating 2 w Σ (=”anregend”) 2 I wj = −1: inhibitory wm (= “hemmend”) Xm I threshold θ ∈ R

9 The McP-Neuron

Computational Model

I compute the scalar product between input signal x and weights w, namely w · x

I apply an activation function f (here, a step function), obtaining the output y

y = f (w · x)  1 if w · x ≥ θ = 0 else f(x·w)

I The McP-Neuron is a function φw,θ : {0, 1}d → {0, 1} with parameters w, θ If φw,θ(x) = 1, we say the x·w I θ neuron “is activated”, or the neuron “fires”

10 The McP-Neuron x → w McP-Neurons: Graphical Representation 1 1 . I input x . . I weights w → w xm m I threshold θ θ ↓ y What can McP-Neurons do? I We can model logic gates using McP-neurons

x → x → → 1 -1 1 1 x1 1 -0,5 x → → 2 1 x2 1 ↓ y 2 1 ↓ y ↓ y

NOT AND OR

11 From McP-Neurons to the Perceptron

McP-Neurons: Limitations I only boolean inputs/weights/outputs

I only binary activation functions

I no learning

Extensions: The Perceptron d I real-valued weights and inputs w, x ∈ R I We will introduce a learning , the Delta-rule

12 Perceptron: Graphical Interpretation

I The parameters w, θ define a hyperplane

I w · x ≥ θ: x is on the positive side

I w · x < θ: x is on the negative side

hy pe w rp la ne

θ / ||w|| positive side

negative side

13 Outline

1. Neurons

2. Training Neurons

3. The Capacity of Neurons

4. Neural Networks

14 Perceptron Training: Basics

w,θ I A neuron’s function (or behavior) φ is determined by the parameters w and θ

I Given labeled training data: How do we find the best parameters / the best hyperplane?

hyperplane 2 0 (w , θ ) positive 2 2 0 0 side 0 1 0 1 1 1 negative 0 0 0 0 side 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 1 negative 1 1 side hyperplane 1 positive θ (w1, 1) side

bad parameters: good parameters: error rate 12 / 30 = 40% error rate 0%

15 Perceptron Training: Vector Augmentation

Before learning, we simplify notation using a → w x1 1 simple trick called vector augmentation: . Idea . . I omit the additional threshold parameter → w θ and include it in the weight vector xd d θ So far... ↓ y I x := (x1, ..., xd ), w := (w1, ..., wd ) I fire if w · x ≥ θ (or w · x − θ ≥ 0) → w x1 1 Now . I x := (x1, ..., xd , 1), w := (w1, ..., wd , −θ) . . I fire if w · x ≥ 0 x → w Remark d d 1 → −θ This is just a change of notation (θ is not 0 written separately any more), not of behavior! ↓ y 16 Perceptron Training

Learning happens by an iterative optimization. We start with random values for w and θ, and in each iteration we ... I ... classify a training sample x and compare the result φ(x) with the targeted result y I If necessary, we correct the neuron’s parameters such that the output φ(x) is adapted towards the desired output y

wnew ← wold + ∆w x1 w 1 no 2 φ(x) x θ w Σ =? 2 yes w xd d y

17 Perceptron Training: The Delta Rule How do we compute the ’right’ update ∆w? There are different strategies. The most famous one is the Delta Rule:   4w = λ · y − φ(x) · x Remarks

I We call λ the

I Note: If xi is classified correctly, ∆w = 0 → no update

1 function train perceptron(x1, ..., xn, y1, ..., yn, λ): 2 initialize w randomly 3 until convergence: 4 choose a random training sample (x, y) (with y ∈ {0, 1}) 5 compute φ(x) // classify x 6 if φ(x) <> y:   7 w := w + λ · y − φ(x) · x | {z } ∆w 8

18 The Delta Rule: Motivation

  4w = λ · y − φ(x) · x

I Why does the Delta rule work?

I We illustrate what happens when a training sample is misclassified ...

hy pe rp la ne w w ∆w (=0.2·x) wnew =0.2 x 1 1

19 The Delta Rule: Motivation

Formal Motivation

20 The Delta Rule: Motivation

Formal Motivation

21 The Delta Rule: Code Example

22 The Delta Rule: Code Example

iteration 0 10

8

6

4

2

0 0 2 4 6 8 10

23 The Delta Rule: Code Example

iteration 1 10

8

6

4

2

0 0 2 4 6 8 10

24 The Delta Rule: Code Example

iteration 2 10

8

6

4

2

0 0 2 4 6 8 10

25 The Delta Rule: Code Example

iteration 1000 10

8

6

4

2

0 0 2 4 6 8 10

26 Outline

1. Neurons

2. Training Neurons

3. The Capacity of Neurons

4. Neural Networks

27 An Important Question...

d Can a single perceptron learn any Function f : R → {−1, 1}?

28 Achieving Non-Linearity by connecting Neurons We can realize XOR by connecting multiple neurons!

x2 neuron 1 neuron 2 1 0 x1 1 1 n e x2 1 1 u ro n 0,5 1,5 2

0 1 n x φ φ e 1 1(x1,x2) 2(x1,x2) u ro n 1 1 -1 0,5 φ φ φ (x ,x ) 3( 1(x1,x2), 2 1 2 neuron 3 φ (x ,x ) ) 2 1 2 0

φ φ φ φ φ ⊗ x1 x2 (x ,x ) 2(x ,x ) 3( (x ,x ), 2(x ,x )) x x 3 1 1 2 1 2 1 1 2 1 2 1 2 n ro u 0 0 0 0 0 0 e n 1 0 1 1 0 1 1 0 1 1 0 1 0 1 1 φ 1(x1,x2) 1 1 1 1 0 0

29 neuron 1 neuron 2 x 1 1 Achieving Non-Linearity 1 

x2  1 1 Remarks 0,5 1,5

(x ,x ) I This is only possible because 1(x1,x2)  1 2

of the neurons’ non-linear 1 -1 0,5 ( 1(x1,x2), activation functions! neuron 3 (x1,x2) ) - with activation functions

00 00 0 0 y = f (w1 · f (w1x1 + w2x2) + w2 · f (w1x1 + w2x2))

- without activation functions

00 00 0 0 y = w1 · (w1x1 + w2x2) + w2 · (w1x1 + w2x2) 000 000 = w1 x1 + w2 x2

This is just a linear function (and does not solve XOR)

30 Achieving Non-Linearity by connecting Neurons

Theorem (Learning Capacity of Networks of McP-Neurons)

We can realize any boolean function f : {0, 1}d → {0, 1} by connecting not more than 2d + 1 McP-neurons.

Proof(by construction)

31 Proof by Construction

32 Proof by Construction: Example ¬ ∧ ∧ ∧¬ ∧ ∨ x1 x2 x3 x4 x5 ∧¬ ∧¬ ∧ ∧ ∨ x1 x2 x3 x4 x5 ¬ ∧ ∧¬ ∧ ∧¬ x1 x2 x3 x4 x5

x ¬ 1 -1 1 -1 x1 x -1 1 2 1 x2 x -1 -1 3 1 x3

x4 -1 ... 1 1

x5 1 1 -1 3 3 2

1 1 1 1

33 The McP-Neuron: Do-it-Yourself

1. Sketch an McP-Neuron that models the (even) parity bit for a given sequence of 3 input bits, x1, x2, x3. 2. Given n input bits, the above model seems to require 2n + 1 neurons. Can you do ’better’ by using more than 2 layers?

34 The McP-Neuron: Do-it-Yourself

35 Perceptron: Activation Functions We can use other activation functions:

plot function parameters

1.00 0.75 

f(x) 0.50 1 if x ≥ θ f (x) = θ 0.25 0 else. 0.00 4 2 0 2 4 x

1.00 0.75 1 f (x) = −(x−θ) f(x) 0.50 1+e θ 0.25 (sigmoidal) 0.00 4 2 0 2 4 x

Remarks I The sigmoid approximates a step function, but is also differentiable (with f 0(x) = f (x) · (1 − f (x)))

36 Perceptron: Activation Functions

We can use other activation functions:

plot function parameters

5

4 3 f (x) = max(0, x) f(x) 2 - 1 rectified linear unit (RELU) 0 4 2 0 2 4 x

1.00 2 0.75 − (x−µ) 2σ2 f(x) 0.50 f (x) ∝ e µ, σ 0.25

0.00 (Gaussian) 4 2 0 2 4 x

37 Can connected learn Any Function?1

Theorem (Learning Capacity of Networks of Perceptrons) By connecting multiple perceptrons with sigmoidal activation function, we can approximate any continuous function f : [0, 1]d → [0, 1]. Remarks I By connecting perceptrons, we can learn any decision boundary! I This works with just one hidden layer (see below). I Open Question: How many perceptrons do we need? I Open Question: How do we find a solution that generalizes properly?

“In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function of n real variables with support in the unit hypercube. Only mild con- ditions are imposed on the univariate function. Our results settle an open question about representability in the class of single bid- den layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by conti- nuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. ”

(Cybenko., G. [3])

1 Try Nielsen’s online demo: http://neuralnetworksanddeeplearning.com/chap4.html 38 Outline

1. Neurons

2. Training Neurons

3. The Capacity of Neurons

4. Neural Networks

39 Definition: Neural Network

Neural Network Definition Example A neural network is a set of I I 4 neurons n1, ..., n4 (partially) connected neurons I activation functions I The output signal of a neuron a1, ..., a4 can be used as input signals to (multiple) other neurons I network input: x1, x2 I network output: z1, z2 I The networks’ input consists of all input signals that are not I “hidden signals”: y1, y2 derived from other neurons w12 I The network’s output consists of 1 -1 x1 all output signals that are not x -1 1 2 a a used as input for other neurons 1 2 y1 y2 Each link between two neurons I z1 a has a weight. We denote the 1 1 3 weight of the connection from 1 1 a neuron i to neuron j with wij . 4 z2

40 Neural Network: Graphical Representation Neural networks are weighted, directed graphs I Neurons are nodes, connected by weighted edges I Inputs and outputs are modeled as separate nodes

w 12 -1 x 1 -1 a 1 3 a 2 x -1 1 1 1 2 a 1 1 1 a2 1 2 -1 y1 y2 z 4 1 1 a 1 1 a3 5 5 input 1 1 1 a output 1 4 z 2 6

a6 Network Topologies We can distinguish two general network topologies

I feedforward networks

I recurrent networks 41 Feedforward Networks: Layers

I Feedforward networks are DAGs (“directed acyclic graphs”), i.e. they do not contain any cycles.

I The signal is never propagated backwards through the network (hence “feedforward”)

I Convention: We organize feedforward networks in layers

Layer Architecture

I Every neuron is only input hidden output connected with neurons from layer layers layer the previous and next layers

I We call two layers fully connected in case all of their neurons are connected

fully fully not fully connected connected connected 42 Feedforward Networks: Computation

Computational Model

I the signal is propagated through the network instantly

I We collect each layer’s weights+biases in a matrix/vector (for non-existing edges, entries are zero).

Example (two hidden layers) I We collect all the signals leaving each layer in a vector

I input x = (x1, ..., xn) 1 1 input hidden output I hidden layer 1: y = (y , ..., y ) 1 1 m layer layers layer 2 2 I hidden layer 2: y2 = (y1 , ..., yp ) I output z = (z1, ..., zq)

I Each layer applies (1) a weighted sum (a linear operation!), and (2) a non-linear activation f (for each neuron):

I y1 = f (W · x + b) 0 0 I y2 = g(W · y1 + b ) 00 00 x y1 y2 z I z = h(W · y2 + b ) 43 Do-Forward-Computation Yourself Given is the following network with threshold activation functions and input x = (1, 0). Compute the output z1.

1 y 2 1 1 -1 x1 z1 -2 1 0 y2 3 2 3 -1 -3 z x2 3 2 0 1 0 y3 0 44 Do-Forward-Computation Yourself Let’s do it again, this time in matrix notation:

1 y 2 1 1 -1 x1 z1 -2 1 0 y2 3 2 3 -1 -3 z x2 3 2 0 1 0 y3 0 45 The 3-layer Perceptron (MLP)

y1

x1 z1

......

z xn p

ym

x = (x1,...,xn) y = (x1,...,xm) z = (z1,...,zp)

In the following, we focus on the most basic type of neural network, the multi-layer perceptron (MLP). I The network is feed-forward I There is only one hidden layer I All layers are fully connected I The network uses a sigmoidal activation function I We know: such a network can learn any function! 46 Recurrent Neural Networks In contrast to feedforward networks, recurrent networks may contain cycles → the signal is propagated backwards through the network! Example Architecture: Elman Networks

context

input

hidden

output

47 Recurrent Neural Networks

I The signals in the network must be clocked

I We extend the computational model with a time component t = 1, 2, 3, ...

I Example: An OCR system recognizing a sequence of characters x1, x2, ... (here, t is the character number) I At each time t, the network is fed an input x(t)

1 function update elman network(x(t)) 2 hidden(t) := f(x(t), context(t-1)) 3 context(t) := g(hidden(t)) 4 output(t) := h(hidden(t)) 5

I This way, the network achieves a memory effect!

I Example “... in Europe. Italy ...”

48 References

[1] fdecomite: Minsky & Papert Model. https://flic.kr/p/5VsZ1M (retrieved: Nov 2016). [2] Google DeepDream robot: 10 weirdest images produced by AI ’inceptionism’ and users online (Photo: Reuters). http://www.straitstimes.com/asia/east-asia/ alphago-wins-4th-victory-over-lee-se-dol-in-final-go-match (retrieved: Nov 2016). [3] G. Cybenko. Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, Dec. 1989.

[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.

[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep . Nature, 518(7540):529–533, 02 2015.

[6] M.-A. Russon. Google DeepDream robot: 10 weirdest images produced by AI ’inceptionism’ and users online. http://www.ibtimes.co.uk/ google-deepdream-robot-10-weirdest-images-produced-by-ai-inceptionism-users-online-1509518 (retrieved: Nov 2016).

[7] C. Szegedy. Building a deeper understanding of images (Google Research Blog). https://research.googleblog.com/2014/09/building-deeper-understanding-of-images.html (retrieved: Nov 2016).

49