Lecture 6: Neural Networks Shuai Li John Hopcroft Center, Shanghai Jiao Tong University https://shuaili8.github.io https://shuaili8.github.io/Teaching/VE445/index.html 1 Outline • Perceptron • Activation functions • Multilayer perceptron networks • Training: backpropagation • Examples • Overfitting • Applications 2 Brief history of artificial neural nets • The First wave • 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model • 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons • 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation • The Second wave • 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was rediscovered and the whole field took off again • The Third wave • 2006 Deep (neural networks) Learning gains popularity • 2012 made significant break-through in many applications 3 Biological neuron structure • The neuron receives signals from their dendrites, and send its own signal to the axon terminal 4 Biological neural communication • Electrical potential across cell membrane exhibits spikes called action potentials • Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters • Chemical diffuses across synapse to dendrites of other neurons • Neurotransmitters can be excitatory or inhibitory • If net input of neuro transmitters to a neuron from other neurons is excitatory and exceeds some threshold, it fires an action potential 5 Perceptron • Inspired by the biological neuron among humans and animals, researchers build a simple model called Perceptron • It receives signals 푥푖’s, multiplies them with different weights 푤푖, and outputs the sum of the weighted signals after an activation function, step function 6 Neuron vs. Perceptron 7 Artificial neural networks • Multilayer perceptron network • Convolutional neural network • Recurrent neural network 8 Perceptron 9 Training • 푤푖 ← 푤푖 − 휂 표 − 푦 푥푖 • 푦: the real label • 표: the output for the perceptron • 휂: the learning rate • Explanation • If the output is correct, do nothing • If the output is higher, lower the weight • If the output is lower, increase the weight 10 Properties • Rosenblatt [1958] proved the training can converge if two classes are linearly separable and 휂 is reasonably small 11 Limitation • Minsky and Papert [1969] showed that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt’s one-layer perceptron • However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet 12 Solution: Add hidden layers • Two-layer feedforward neural network 13 Demo • Large Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by their class, and the decision regions by a trained neural network are shown underneath. You can play with these examples in this ConvNetsJS demo. 14 Activation Functions 15 Most popular in fully Activation functions connected neural network 1 • Sigmoid: 휎 푧 = 1+푒−푧 푒푧−푒−푧 • Tanh: tanh 푧 = 푒푧+푒−푧 • ReLU (Rectified Linear Unity): ReLU 푧 = max 0, 푧 Most popular in deep learning 16 Activation function values and derivatives 17 Sigmoid activation function • Its derivative • Sigmoid 휎′ 푧 = 휎 푧 1 − 휎 푧 1 휎 푧 = • Output range 0,1 1 + 푒−푧 • Motivated by biological neurons and can be interpreted as the probability of an artificial neuron “firing” given its inputs • However, saturated neurons make value vanished (why?) • 푓 푓 푓 ⋯ • 푓 0,1 ⊆ 0.5, 0.732 • 푓 0.5, 0.732 ⊆ 0.622,0.676 18 Tanh activation function • Tanh function • Its derivative sinh(푧) 푒푧 − 푒−푧 tanh ′ 푧 = 1 − tanh2(푧) tanh 푧 = = cosh(푧) 푒푧 + 푒−푧 • Output range −1,1 • Thus strongly negative inputs to the tanh will map to negative outputs • Only zero-valued inputs are mapped to near-zero outputs • These properties make the network less likely to get “stuck” during training 19 ReLU activation function • ReLU (Rectified linear unity) • Its derivative 1 if 푧 > 0 ReLU′ 푧 = ቊ function 0 if 푧 ≤ 0 ReLU 푧 = max 0, 푧 • ReLU can be approximated by softplus function • ReLU’s gradient doesn't vanish as x increases • Speed up training of neural networks • Since the gradient computation is very simple • The computational step is simple, no exponentials, no multiplication or division operations (compared to others) • The gradient on positive portion is larger than sigmoid or tanh functions • Update more rapidly • The left “dead neuron” part can be ameliorated by Leaky ReLU 20 ReLU activation function (cont.) • ReLU function • The only non-linearity comes from the path selection with individual neurons being active or ReLU 푧 = max 0, 푧 not • It allows sparse representations: • for a given input only a subset of neurons are active Sparse propagation of activations and gradients 21 Multilayer Perceptron Networks 22 Neural networks • Neural networks are built by connecting many perceptrons together, layer by layer 23 Universal approximation theorem • A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions 1-20-1 NN approximates a noisy sine function Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366 24 Increasing power of approximation • With more neurons, its approximation power increases. The decision boundary covers more details • Usually in applications, we use more layers with structures to approximate complex functions instead of one hidden layer with many neurons 25 Common neuron network structures at a glance 26 Multilayer perceptron network 27 Single / Multiple layers of calculation 푓 • Single layer function 휃 푓휃 푥 = 휎 휃0 + 휃1푥1 + 휃2푥2 푥1 푥2 • Multiple layer function 1 1 1 푓휃 • ℎ1 푥 = 휎 휃0 + 휃1 푥1 + 휃2 푥2 2 2 2 • ℎ2 푥 = 휎 휃0 + 휃1 푥1 + 휃2 푥2 ℎ1 ℎ2 • 푓휃 ℎ = 휎 휃0 + 휃1ℎ1 + 휃2ℎ2 푥1 푥2 28 Training Backpropagation 29 How to train? • As previous models, we use gradient descent method to train the neural network • Given the topology of the network (number of layers, number of neurons, their connections), find a set of weights to minimize the error function The set of training Target Output examples 30 Gradient interpretation • Gradient is the vector (the red one) along which the value of the function increases most rapidly. Thus its opposite direction is where the value decreases most rapidly. 31 Gradient descent • To find a (local) minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or an approximation) of the function at the current point 휕푓 • For a smooth function 푓(푥), is the direction that 푓 increases most 휕푥 rapidly. So we apply 휕푓 푥 = 푥 − 휂 (푥 ) 푡+1 푡 휕푥 푡 until 푥 converges 32 Gradient descent Gradient Training Rule i.e. Partial derivatives are the key Update 33 The chain rule • The challenge in neural network model is that we only know the target of the output layer, but don’t know the target for hidden and input layers, how can we update their connection weights using the gradient descent? • The answer is the chain rule that you have learned in calculus 푦 = 푓(푔(푥)) 푑푦 ⇒ = 푓′(푔(푥))푔′(푥) 푑푥 34 Feed forward vs. Backpropagation 35 Make a prediction 푡1 푡푘 36 Make a prediction (cont.) 푡1 푡푘 37 Make a prediction (cont.) 푡1 푡푘 38 • Assume all the activation functions are sigmoid 1 • Error function 퐸 = σ 푦 − 푡 2 Backpropagation 2 푘 푘 푘 휕퐸 • = 푦푘 − 푡푘 휕푦푘 휕푦푘 ′ (2) (1) (1) • (2) = 푓(2) 푛푒푡푘 ℎ푗 = 푦푘 1 − 푦푘 ℎ푗 휕푤푘,푗 휕퐸 (1) • ⇒ (2) = − 푡푘 − 푦푘 푦푘 1 − 푦푘 ℎ푗 푡1 휕푤푘,푗 • ⇒ 푤(2) ← 푤(2) + 휂훿(2)ℎ(1) (2) 푘,푗 푘,푗 푘 푗 훿푘 푡푘 Output of unit 푗 39 1 • Error function 퐸 = σ 푦 − 푡 2 2 푘 푘 푘 휕퐸 • = 푦푘 − 푡푘 Backpropagation (cont.) 휕푦푘 (2) • 훿푘 = 푡푘 − 푦푘 푦푘 1 − 푦푘 (2) (2) (2) (1) • ⇒ 푤푘,푗 ← 푤푘,푗 + 휂훿푘 ℎ푗 휕푦푘 (2) • (1) = 푦푘 1 − 푦푘 푤 휕ℎ 푘,푗 푡1 푗 (1) 휕ℎ 푗 ′ (1) (1) (1) • (1) = 푓(1) 푛푒푡푗 푥푚 = ℎ푗 1 − ℎ푗 푥푚 푡푘 휕푤푗,푚 휕퐸 (1) (1) 2 • (1) = −ℎ푗 1 − ℎ푗 σ푘 푤푘,푗 푡푘 − 푦푘 푦푘 1 − 푦푘 푥푚 휕푤푗,푚 (1) (1) 2 (2) = −ℎ푗 1 − ℎ푗 푤푘,푗 훿푘 푥푚 푘 (1) (1) (1) (1) • ⇒ 푤 ← 푤 + 휂훿 푥푚 푗,푚 푗,푚 푗 훿푗 40 1 • Error function 퐸 = σ 푦 − 푡 2 2 푘 푘 푘 • 훿(2) = 푡 − 푦 푦 1 − 푦 Backpropagation algorithms 푘 푘 푘 푘 푘 (2) (2) (2) (1) • ⇒ 푤푘,푗 ← 푤푘,푗 + 휂훿푘 ℎ푗 • 훿(1) = ℎ(1) 1 − ℎ(1) σ 푤 2 훿(2) • Activation function: sigmoid 푗 푗 푗 푘 푘,푗 푘 (1) (1) (1) • ⇒ 푤푗,푚 ← 푤푗,푚 + 휂훿푗 푥푚 Initialize all weights to small random numbers Do until convergence • For each training example: 1. Input it to the network and compute the network output 2. For each output unit 푘, 표푘 is the output of unit 푘 훿푘 ← 표푘 1 − 표푘 푡푘 − 표푘 3. For each hidden unit 푗, 표푗 is the output of unit 푗 훿푗 ← 표푗 1 − 표푗 푤푘,푗훿푘 푘∈푛푒푥푡 푙푎푦푒푟 4. Update each network weight, where 푥푖 is the output for unit 푖 푤푗,푖 ← 푤푗,푖 + 휂훿푗푥푖 41 See the backpropagation demo • https://google-developers.appspot.com/machine-learning/crash- course/backprop-scroll/ 42 Formula example for backpropagation 43 Formula example for backpropagation (cont.) 44 Calculation example • Consider the simple network below: • Assume that the neurons have sigmoid activation function and • Perform a forward pass on the network and find the predicted output • Perform a reverse pass (training) once (target = 0.5) with 휂 = 1 • Perform a further forward pass and comment on the result 45 • For each output unit 푘, 표푘 is the output of unit 푘 Calculation example (cont.) 훿푘 ← 표푘 1 − 표푘 푡푘 − 표푘 • For each hidden unit 푗, 표푗 is the output of unit 푗 훿푗 ← 표푗 1 − 표푗 푤푘,푗훿푘 푘∈푛푒푥푡 푙푎푦푒푟 • Update each network weight, where 푥푖 is the input for unit 푗 푤푗,푖 ← 푤푗,푖 + 휂훿푗푥푖 46 • For each output unit 푘, 표푘 is the output of unit 푘 Calculation example (cont.) 훿푘 ← 표푘 1 − 표푘 푡푘 − 표푘 • For each hidden unit 푗, 표푗 is the output of unit 푗 • Answer (i) • Input to top neuron = 0.35 × 0.1 + 0.9 × 0.8 = 0.755.
