Neural Networks

Neural Networks These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides for your own academic purposes, provided that you include proper attribution. Please send comments and corrections to Eric. Robot Image Credit: Viktoriya Sukhanova © 123RF.com 1 Neural Function • Brain function (thought) occurs as the result of the firing of neurons • Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters – Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds – Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength • There are about 1011 neurons and about 1014 synapses in the human brain! Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 2 Biology of a Neuron 3 Brain Structure • Different areas of the brain have different functions – Some areas seem to have the same function in all humans (e.g., Broca’s region for motor speech); the overall layout is generally consistent – Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly • We don’t know how different functions are “assigned” or acquired – Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) – Partly the result of experience (learning) • We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 4 The “One Learning Algorithm” Hypothesis Somatosensor Auditory Cortex y Cortex Auditory cortex learns to see Somatosensory cortex [Roe et al., 1992] learns to see [Metin & Frost, 1989] Based on slide by Andrew Ng 5 Sensor Representations in the Brain Seeing with your tongue Human echolocation (sonar) Haptic belt: Direction sense Implanting a 3rd eye [BrainPort; Welsh & Blasch, 1997; Nagel et al., 2005; Constantine-Paton & Law, 2009] Slide by Andrew Ng 6 Comparison of computing power INFORMATION CIRCA 2012 Computer Human Brain Computation Units 10-core Xeon: 109 Gates 1011 Neurons Storage Units 109 bits RAM, 1012 bits disk 1011 neurons, 1014 synapses Cycle time 10-9 sec 10-3 sec Bandwidth 109 bits/sec 1014 bits/sec • Computers are way faster than neurons… • But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel • Neural networks are designed to be massively parallel • The brain is effectively a billion times faster 7 Neural Networks • Origins: Algorithms that try to mimic the brain. • Very widely used in 80s and early 90s; popularity diminished in late 90s. • Recent resurgence: State-of-the-art technique for many applications • Artificial neural networks are not nearly as complex or intricate as the actual brain structure Based on slide by Andrew Ng 8 Neural networks Output units Hidden units Input units Layered feed-forward network • Neural networks are made up of nodes or units, connected by links • Each link has an associated weight and activation level • Each node has an input function (typically summing over weighted inputs), an activation function, and an output Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 9 Neuron Model: Logistic Unit “bias unit” x0 ✓0 x1 ✓1 x0 =1x0 =1 x = 2 3 ✓ = 2 3 x2 ✓2 ✓0 6 x3 7 6 ✓ 7 6 7 6 3 7 ✓1 4 5 4 5 ✓ 1 2 h (x)=g (✓|x) ✓ ✓Tx ✓3 X 1+e1− h (x)= ✓ ✓Tx 1+e− 1 Sigmoid (logistic) activation function: g(z)= z 1+e− Based on slide by Andrew Ng 10 Neural Network (2) bias units x0 =1 a0 1 h (x)= ✓ ✓Tx 1+e− Layer 1 Layer 2 Layer 3 (Input Layer) (Hidden Layer) (Output Layer) Slide by Andrew Ng 12 Feed-Forward Process • Input layer units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level • Working forward through the network, the input function of each unit is applied to compute the input value – Usually this is just the weighted sum of the activation on the links feeding into this node • The activation function transforms this input function into a final value – Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node Based on slide by T. Finin, M. desJardins, L Getoor, R. Par 13 Neural Network ⇥(1) ⇥(2) (j) ai =1 “activation” of unit i in layer j h (x)= ✓ (j) ✓Tx Θ1+=e− weight matrix controlling function mapping from layer j to layer j + 1 If network has sj units in layer j and sj+1 units in layer j+1, (j) then Θ has dimension sj+1 × (sj+1) . (1) 3 4 (2) 1 4 ⇥ R ⇥ ⇥ R ⇥ Slide by Andrew Ng 2 2 14 Vectorization (2) (1) (1) (1) (1) (2) a1 = g ⇥10 x0 +⇥11 x1 +⇥12 x2 +⇥13 x3 = g z1 (2) ⇣ (1) (1) (1) (1) ⌘ ⇣ (2)⌘ a2 = g ⇥20 x0 +⇥21 x1 +⇥22 x2 +⇥23 x3 = g z2 (2) ⇣ (1) (1) (1) (1) ⌘ ⇣ (2)⌘ a3 = g ⇥30 x0 +⇥31 x1 +⇥32 x2 +⇥33 x3 = g z3 ⇣ (2) (2) (2) (2) (2) (2) (2)⌘ (2) ⇣ ⌘ (3) h⇥(x)=g ⇥10 a0 +⇥11 a1 +⇥12 a2 +⇥13 a3 = g z1 ⇣ Feed-Forward Steps:⌘ ⇣ ⌘ z(2) =⇥(1)x 1a(2) = g(z(2)) h (x)= ✓ ✓Tx 1+e−(2) Add a0 =1 z(3) =⇥(2)a(2) (1) (2) ⇥ ⇥ (3) (3) h⇥(x)=a = g(z ) Based on slide by Andrew Ng 15 Other Network Architectures s = [3, 3, 2, 1] 1 h (x)= ✓ ✓Tx 1+e− Layer 1 Layer 2 Layer 3 Layer 4 L denotes the number of layers +L s N contains the numbers of nodes at each layer 2 – Not counting bias units – Typically, s0 = d (# input features) and sL-1=K (# classes) 16 Multiple Output Units: One-vs-Rest Pedestrian Car Motorcycle Truck K h⇥(x) R 2 We want: 1 0 0 0 0 1 0 0 h⇥(x) 2 3 h⇥(x) 2 3 h⇥(x) 2 3 h⇥(x) 2 3 ⇡ 0 ⇡ 0 ⇡ 1 ⇡ 0 6 0 7 6 0 7 6 0 7 6 1 7 6 7 6 7 6 7 6 7 when pedestrian when car when motorcycle when truck4 5 4 5 4 5 4 5 Slide by Andrew Ng 17 Multiple Output Units: One-vs-Rest K h⇥(x) R 2 We want: 1 0 0 0 0 1 0 0 h⇥(x) 2 3 h⇥(x) 2 3 h⇥(x) 2 3 h⇥(x) 2 3 ⇡ 0 ⇡ 0 ⇡ 1 ⇡ 0 6 0 7 6 0 7 6 0 7 6 1 7 6 7 6 7 6 7 6 7 when pedestrian when car when motorcycle when truck4 5 4 5 4 5 4 5 • Given {(x1,y1), (x2,y2), ..., (xn,yn)} • Must convert labels to 1-of-K representation 0 0 0 1 y = y = – e.g., when motorcycle, when car, etc. i 2 1 3 i 2 0 3 6 0 7 6 0 7 Based on slide by Andrew Ng 6 7 6 7 18 4 5 4 5 Neural Network Classification Given: {(x1,y1), (x2,y2), ..., (xn,yn)} +L s N contains # nodes at each layer 2 – s0 = d (# features) Binary classification Multi-class classification (K classes) K y = 0 or 1 y R e.g. , , , 2 pedestrian car motorcycle truck 1 output unit (sL-1= 1) K output units (sL-1= K) Slide by Andrew Ng 19 Understanding Representations 20 Representing Boolean Functions Logistic / Sigmoid Function Simple example: AND 1 g(z)= z 1+e− -30 1 +20 h (x)= ✓ ✓Tx +20 1+e− x x h (x) h (x) = g(-30 + 20x + 20x ) 1 2 Θ Θ 1 2 0 0 g(-30) ≈ 0 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(10) ≈ 1 Based on slide and example by Andrew Ng 21 Representing Boolean Functions AND OR -30 -10 1 +20 1 +20 h (x)= h (x)= ✓ ✓Tx ✓ ✓Tx +20 1+e− +20 1+e− NOT (NOT x1) AND (NOT x2) +10 1 +10 h (x)= ✓ ✓Tx -20 1 1+e− h (x)= -20 ✓ ✓Tx -20 1+e− 22 Combining Representations to Create Non-Linear Functions (NOT x1) AND (NOT x2) AND OR -30 +10 -10 1 1 1 +20 h (x)= -20 h (x)= +20 h (x)= ✓ ✓Tx ✓ ✓Tx ✓ ✓Tx +20 1+e−-20 1+e−+20 1+e− not(XOR) -10 II I -30 +20 in I +20 +20 1 h (x)= +10 ✓ ✓Tx +20 1+e− in III I or III III IV -20 -20 Based on example by Andrew Ng 23 Layering Representations x1 ... x20 x21 ... x40 x41 ... x60 ... x381 ... x400 20 × 20 pixel images d = 400 10 classes Each image is “unrolled” into a vector x of pixel intensities 24 Layering Representations x1 x2 “0” x3 “1” x4 x5 “9” Output Layer Hidden Layer xd Input Layer Visualization of Hidden Layer 25 LeNet 5 Demonstration: http://yann.lecun.com/exdb/lenet/ 26 Neural Network Learning 27 Perceptron Learning Rule ✓ ✓ + ↵(y h(x))x − Equivalent to the intuitive rules: – If output is correct, don’t change the weights – If output is low (h(x) = 0, y = 1), increment weights for all the inputs which are 1 – If output is high (h(x) = 1, y = 0), decrement weights for all inputs which are 1 Perceptron Convergence Theorem: • If there is a set of weights that is consistent with the training data (i.e., the data is linearly separable), the perceptron learning algorithm will converge [Minicksy & Papert, 1969] 29 Batch Perceptron (i) (i) n 1.) Given training data (x ,y ) i=1 2.) Let ✓ [0, 0,...,0] 2.) Repeat: 2.) Let ∆ [0, 0,...,0] 3.) for i =1 ...n,do 4.) if y(i)x(i)✓ 0 // prediction for ith instance is incorrect 5.) ∆ ∆+ y(i)x(i) 6.) ∆ ∆/n // compute average update 6.) ✓ ✓ + ↵∆ 8.) Until ∆ <✏ k k2 • Simplest case: α = 1 and don’t normalize, yields the fixed increment perceptron • Each increment of outer loop is called an epoch Based on slide by Alan Fern 30 Learning in NN: Backpropagation • Similar to the perceptron learning algorithm, we cycle through our examples – If the output of the network is correct, no changes are made – If there is an error, weights are adjusted to reduce the error • The trick is to assess the blame for the error and divide it among the contributing weights Based on slide by T.

Load more