Machine Learning – winter term 2016/17 – Chapter 07: Neural Networks I
Prof. Adrian Ulges Masters “Computer Science” DCSM Department University of Applied Sciences RheinMain
1
14. November 2016 Resources for the next Chapters
Online Book 1: The nice one
I Nielsen: “Neural Networks and Deep Learning” http://neuralnetworksanddeeplearning.com
Online Book 2: The tough one
I Goodfellow, Bengio, Courville: “Deep learning” https://www.deeplearningbook.org
2 (Artificial) Neural Networks.... image: [1]
I ... are one of the core topics of artificial intelligence (AI)
I ... are biologically inspired: They are mathematical models simulating the process of “thinking” in living organisms’ brains
I ... are graphs (or networks) of interlinked atomic processing units (neurons)
I ... are not programmed but define their own behavior entirely through the graph’s links and weights. These are adjusted by training. I ... are covered by two research fields 1. computational neuroscience (goal: better understanding of biological processes) 2. machine learning / AI (here) (goal: solving practical data analysis problems)
3 Neural Networks....
I ... have a turbulent history
1940 1950 1960 1970 1980 1990 2000 2010 2015
I ... exist in many variants
I multi-layer perceptron (in this lecture) I convolutional neural networks (in this lecture) I Boltzmann machines I RBF networks I recurrent neural networks I LSTM networks I ...
I ... are the most intensely machine learning model these days (“deep learning”)
4 Deep Learning Applications images from [5] [6] [2] [4] [7]
5 Neural Networks: Biological Inspiration Neurons are brain cells that send and receive electrical impulses. Neurons: Terminology I Cell body (soma): soma 1 center of the neuron, with electrical axon a diameter of 5 − 100µm impulse soma 3 synapse I axon: thin, long (up to 1m) axon dendrites nerve fibre, splits and dendrites dendrites transmits impulses to other neurons I dendrites: small, soma 2 branch-like outgrowths that axon synapse receive impulses and axon transmit them to the soma pre-synaptic post-synaptic membrane membrane I synapses: electro-chemical link between two neurons. Transfers signal from axon dentrite to dendrites via transmitter substances.
6 From Biological to Artificial Neural Networks
I Biological neural networks are (much) more complex than artificial ones - connectivity: 1011 neurons, each connected with ≈ 7, 000 others - adaptivity: number and connections of neurons change over time I Biological neural networks are asynchronous and compute with rather low frequency (1 KHz) I The circuit layout is unknown
Learning
I Learning happens at synapses, by changing the transmitter dose - sensitization: more transmitter, enhancing the signal - desensitization: less transmitter, supressing the signal
Artificial Neural Networks X1 w1 θ I weighted sum + activation function X y 2 Σ w2 I learning happens by adapting weights
wm I memory happens by storing weights Xm
7 Outline
1. Neurons
2. Training Neurons
3. The Capacity of Neurons
4. Neural Networks
8 The McP-Neuron
I The historically oldest (1943) neuron by McCulloch & Pitts
I strongly simplified model of biological neurons
I no learning yet
Definition
I input d x = (x1, ..., xd ) ∈ {0, 1} I output y ∈ {0, 1}
I weights w = X1 w d 1 (w1, ..., wd ) ∈ {−1, 1} θ X y I wj = 1: stimulating 2 w Σ (=”anregend”) 2 I wj = −1: inhibitory wm (= “hemmend”) Xm I threshold θ ∈ R
9 The McP-Neuron
Computational Model
I compute the scalar product between input signal x and weights w, namely w · x
I apply an activation function f (here, a step function), obtaining the output y
y = f (w · x) 1 if w · x ≥ θ = 0 else f(x·w)
I The McP-Neuron is a function φw,θ : {0, 1}d → {0, 1} with parameters w, θ If φw,θ(x) = 1, we say the x·w I θ neuron “is activated”, or the neuron “fires”
10 The McP-Neuron x → w McP-Neurons: Graphical Representation 1 1 . I input x . . I weights w → w xm m I threshold θ θ ↓ y What can McP-Neurons do? I We can model logic gates using McP-neurons
x → x → → 1 -1 1 1 x1 1 -0,5 x → → 2 1 x2 1 ↓ y 2 1 ↓ y ↓ y
NOT AND OR
11 From McP-Neurons to the Perceptron
McP-Neurons: Limitations I only boolean inputs/weights/outputs
I only binary activation functions
I no learning
Extensions: The Perceptron d I real-valued weights and inputs w, x ∈ R I We will introduce a learning algorithm, the Delta-rule
12 Perceptron: Graphical Interpretation
I The parameters w, θ define a hyperplane
I w · x ≥ θ: x is on the positive side
I w · x < θ: x is on the negative side
hy pe w rp la ne
θ / ||w|| positive side
negative side
13 Outline
1. Neurons
2. Training Neurons
3. The Capacity of Neurons
4. Neural Networks
14 Perceptron Training: Basics
w,θ I A neuron’s function (or behavior) φ is determined by the parameters w and θ
I Given labeled training data: How do we find the best parameters / the best hyperplane?
hyperplane 2 0 (w , θ ) positive 2 2 0 0 side 0 1 0 1 1 1 negative 0 0 0 0 side 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 1 negative 1 1 side hyperplane 1 positive θ (w1, 1) side
bad parameters: good parameters: error rate 12 / 30 = 40% error rate 0%
15 Perceptron Training: Vector Augmentation
Before learning, we simplify notation using a → w x1 1 simple trick called vector augmentation: . Idea . . I omit the additional threshold parameter → w θ and include it in the weight vector xd d θ So far... ↓ y I x := (x1, ..., xd ), w := (w1, ..., wd ) I fire if w · x ≥ θ (or w · x − θ ≥ 0) → w x1 1 Now . I x := (x1, ..., xd , 1), w := (w1, ..., wd , −θ) . . I fire if w · x ≥ 0 x → w Remark d d 1 → −θ This is just a change of notation (θ is not 0 written separately any more), not of behavior! ↓ y 16 Perceptron Training
Learning happens by an iterative optimization. We start with random values for w and θ, and in each iteration we ... I ... classify a training sample x and compare the result φ(x) with the targeted result y I If necessary, we correct the neuron’s parameters such that the output φ(x) is adapted towards the desired output y
wnew ← wold + ∆w x1 w 1 no 2 φ(x) x θ w Σ =? 2 yes w xd d y
17 Perceptron Training: The Delta Rule How do we compute the ’right’ update ∆w? There are different strategies. The most famous one is the Delta Rule: 4w = λ · y − φ(x) · x Remarks
I We call λ the learning rate
I Note: If xi is classified correctly, ∆w = 0 → no update
1 function train perceptron(x1, ..., xn, y1, ..., yn, λ): 2 initialize w randomly 3 until convergence: 4 choose a random training sample (x, y) (with y ∈ {0, 1}) 5 compute φ(x) // classify x 6 if φ(x) <> y: 7 w := w + λ · y − φ(x) · x | {z } ∆w 8
18 The Delta Rule: Motivation
4w = λ · y − φ(x) · x
I Why does the Delta rule work?
I We illustrate what happens when a training sample is misclassified ...
hy pe rp la ne w w ∆w (=0.2·x) wnew =0.2 x 1 1
19 The Delta Rule: Motivation
Formal Motivation
20 The Delta Rule: Motivation
Formal Motivation
21 The Delta Rule: Code Example
22 The Delta Rule: Code Example
iteration 0 10
8
6
4
2
0 0 2 4 6 8 10
23 The Delta Rule: Code Example
iteration 1 10
8
6
4
2
0 0 2 4 6 8 10
24 The Delta Rule: Code Example
iteration 2 10
8
6
4
2
0 0 2 4 6 8 10
25 The Delta Rule: Code Example
iteration 1000 10
8
6
4
2
0 0 2 4 6 8 10
26 Outline
1. Neurons
2. Training Neurons
3. The Capacity of Neurons
4. Neural Networks
27 An Important Question...
d Can a single perceptron learn any Function f : R → {−1, 1}?
28 Achieving Non-Linearity by connecting Neurons We can realize XOR by connecting multiple neurons!
x2 neuron 1 neuron 2 1 0 x1 1 1 n e x2 1 1 u ro n 0,5 1,5 2
0 1 n x φ φ e 1 1(x1,x2) 2(x1,x2) u ro n 1 1 -1 0,5 φ φ φ (x ,x ) 3( 1(x1,x2), 2 1 2 neuron 3 φ (x ,x ) ) 2 1 2 0
φ φ φ φ φ ⊗ x1 x2 (x ,x ) 2(x ,x ) 3( (x ,x ), 2(x ,x )) x x 3 1 1 2 1 2 1 1 2 1 2 1 2 n ro u 0 0 0 0 0 0 e n 1 0 1 1 0 1 1 0 1 1 0 1 0 1 1 φ 1(x1,x2) 1 1 1 1 0 0
29 neuron 1 neuron 2 x 1 1 Achieving Non-Linearity 1
x2 1 1 Remarks 0,5 1,5
(x ,x ) I This is only possible because 1(x1,x2) 1 2
of the neurons’ non-linear 1 -1 0,5 ( 1(x1,x2), activation functions! neuron 3 (x1,x2) ) - with activation functions
00 00 0 0 y = f (w1 · f (w1x1 + w2x2) + w2 · f (w1x1 + w2x2))
- without activation functions
00 00 0 0 y = w1 · (w1x1 + w2x2) + w2 · (w1x1 + w2x2) 000 000 = w1 x1 + w2 x2
This is just a linear function (and does not solve XOR)
30 Achieving Non-Linearity by connecting Neurons
Theorem (Learning Capacity of Networks of McP-Neurons)
We can realize any boolean function f : {0, 1}d → {0, 1} by connecting not more than 2d + 1 McP-neurons.
Proof(by construction)
31 Proof by Construction
32 Proof by Construction: Example ¬ ∧ ∧ ∧¬ ∧ ∨ x1 x2 x3 x4 x5 ∧¬ ∧¬ ∧ ∧ ∨ x1 x2 x3 x4 x5 ¬ ∧ ∧¬ ∧ ∧¬ x1 x2 x3 x4 x5
x ¬ 1 -1 1 -1 x1 x -1 1 2 1 x2 x -1 -1 3 1 x3
x4 -1 ... 1 1
x5 1 1 -1 3 3 2
1 1 1 1
33 The McP-Neuron: Do-it-Yourself
1. Sketch an McP-Neuron that models the (even) parity bit for a given sequence of 3 input bits, x1, x2, x3. 2. Given n input bits, the above model seems to require 2n + 1 neurons. Can you do ’better’ by using more than 2 layers?
34 The McP-Neuron: Do-it-Yourself
35 Perceptron: Activation Functions We can use other activation functions:
plot function parameters
1.00 0.75
f(x) 0.50 1 if x ≥ θ f (x) = θ 0.25 0 else. 0.00 4 2 0 2 4 x
1.00 0.75 1 f (x) = −(x−θ) f(x) 0.50 1+e θ 0.25 (sigmoidal) 0.00 4 2 0 2 4 x
Remarks I The sigmoid approximates a step function, but is also differentiable (with f 0(x) = f (x) · (1 − f (x)))
36 Perceptron: Activation Functions
We can use other activation functions:
plot function parameters
5
4 3 f (x) = max(0, x) f(x) 2 - 1 rectified linear unit (RELU) 0 4 2 0 2 4 x
1.00 2 0.75 − (x−µ) 2σ2 f(x) 0.50 f (x) ∝ e µ, σ 0.25
0.00 (Gaussian) 4 2 0 2 4 x
37 Can connected Perceptrons learn Any Function?1
Theorem (Learning Capacity of Networks of Perceptrons) By connecting multiple perceptrons with sigmoidal activation function, we can approximate any continuous function f : [0, 1]d → [0, 1]. Remarks I By connecting perceptrons, we can learn any decision boundary! I This works with just one hidden layer (see below). I Open Question: How many perceptrons do we need? I Open Question: How do we find a solution that generalizes properly?
“In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function of n real variables with support in the unit hypercube. Only mild con- ditions are imposed on the univariate function. Our results settle an open question about representability in the class of single bid- den layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by conti- nuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. ”
(Cybenko., G. [3])
1 Try Nielsen’s online demo: http://neuralnetworksanddeeplearning.com/chap4.html 38 Outline
1. Neurons
2. Training Neurons
3. The Capacity of Neurons
4. Neural Networks
39 Definition: Neural Network
Neural Network Definition Example A neural network is a set of I I 4 neurons n1, ..., n4 (partially) connected neurons I activation functions I The output signal of a neuron a1, ..., a4 can be used as input signals to (multiple) other neurons I network input: x1, x2 I network output: z1, z2 I The networks’ input consists of all input signals that are not I “hidden signals”: y1, y2 derived from other neurons w12 I The network’s output consists of 1 -1 x1 all output signals that are not x -1 1 2 a a used as input for other neurons 1 2 y1 y2 Each link between two neurons I z1 a has a weight. We denote the 1 1 3 weight of the connection from 1 1 a neuron i to neuron j with wij . 4 z2
40 Neural Network: Graphical Representation Neural networks are weighted, directed graphs I Neurons are nodes, connected by weighted edges I Inputs and outputs are modeled as separate nodes
w 12 -1 x 1 -1 a 1 3 a 2 x -1 1 1 1 2 a 1 1 1 a2 1 2 -1 y1 y2 z 4 1 1 a 1 1 a3 5 5 input 1 1 1 a output 1 4 z 2 6
a6 Network Topologies We can distinguish two general network topologies
I feedforward networks
I recurrent networks 41 Feedforward Networks: Layers
I Feedforward networks are DAGs (“directed acyclic graphs”), i.e. they do not contain any cycles.
I The signal is never propagated backwards through the network (hence “feedforward”)
I Convention: We organize feedforward networks in layers
Layer Architecture
I Every neuron is only input hidden output connected with neurons from layer layers layer the previous and next layers
I We call two layers fully connected in case all of their neurons are connected
fully fully not fully connected connected connected 42 Feedforward Networks: Computation
Computational Model
I the signal is propagated through the network instantly
I We collect each layer’s weights+biases in a matrix/vector (for non-existing edges, entries are zero).
Example (two hidden layers) I We collect all the signals leaving each layer in a vector
I input x = (x1, ..., xn) 1 1 input hidden output I hidden layer 1: y = (y , ..., y ) 1 1 m layer layers layer 2 2 I hidden layer 2: y2 = (y1 , ..., yp ) I output z = (z1, ..., zq)
I Each layer applies (1) a weighted sum (a linear operation!), and (2) a non-linear activation f (for each neuron):
I y1 = f (W · x + b) 0 0 I y2 = g(W · y1 + b ) 00 00 x y1 y2 z I z = h(W · y2 + b ) 43 Do-Forward-Computation Yourself Given is the following network with threshold activation functions and input x = (1, 0). Compute the output z1.
1 y 2 1 1 -1 x1 z1 -2 1 0 y2 3 2 3 -1 -3 z x2 3 2 0 1 0 y3 0 44 Do-Forward-Computation Yourself Let’s do it again, this time in matrix notation:
1 y 2 1 1 -1 x1 z1 -2 1 0 y2 3 2 3 -1 -3 z x2 3 2 0 1 0 y3 0 45 The 3-layer Perceptron (MLP)
y1
x1 z1
......
z xn p
ym
x = (x1,...,xn) y = (x1,...,xm) z = (z1,...,zp)
In the following, we focus on the most basic type of neural network, the multi-layer perceptron (MLP). I The network is feed-forward I There is only one hidden layer I All layers are fully connected I The network uses a sigmoidal activation function I We know: such a network can learn any function! 46 Recurrent Neural Networks In contrast to feedforward networks, recurrent networks may contain cycles → the signal is propagated backwards through the network! Example Architecture: Elman Networks
context
input
hidden
output
47 Recurrent Neural Networks
I The signals in the network must be clocked
I We extend the computational model with a time component t = 1, 2, 3, ...
I Example: An OCR system recognizing a sequence of characters x1, x2, ... (here, t is the character number) I At each time t, the network is fed an input x(t)
1 function update elman network(x(t)) 2 hidden(t) := f(x(t), context(t-1)) 3 context(t) := g(hidden(t)) 4 output(t) := h(hidden(t)) 5
I This way, the network achieves a memory effect!
I Example “... in Europe. Italy ...”
48 References
[1] fdecomite: Minsky & Papert Model. https://flic.kr/p/5VsZ1M (retrieved: Nov 2016). [2] Google DeepDream robot: 10 weirdest images produced by AI ’inceptionism’ and users online (Photo: Reuters). http://www.straitstimes.com/asia/east-asia/ alphago-wins-4th-victory-over-lee-se-dol-in-final-go-match (retrieved: Nov 2016). [3] G. Cybenko. Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, Dec. 1989.
[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015.
[6] M.-A. Russon. Google DeepDream robot: 10 weirdest images produced by AI ’inceptionism’ and users online. http://www.ibtimes.co.uk/ google-deepdream-robot-10-weirdest-images-produced-by-ai-inceptionism-users-online-1509518 (retrieved: Nov 2016).
[7] C. Szegedy. Building a deeper understanding of images (Google Research Blog). https://research.googleblog.com/2014/09/building-deeper-understanding-of-images.html (retrieved: Nov 2016).
49