Multilayer Perceptron
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Introduction to Machine Learning, CMU-10715 Perceptron, Multilayer Perceptron Barnabás Póczos, Sept 17 Contents History of Artificial Neural Networks Definitions: Perceptron, MLP Representation questions Perceptron algorithm Backpropagation algorithm 2 Short History Progression (1943-1960) • First mathematical model of neurons . Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) . A single layer neuron for classification . Perceptron learning rule . Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1969 Backpropagation… but not much attention… 3 Short History Progression (1980-) • 1986 Backpropagation reinvented: . Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: . Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 4 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM almost kills the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 5 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19 6 The Neuron – Each neuron has a body, axon, and many dendrites – A neuron can fire or rest – If the sum of weighted inputs larger than a threshold, then the neuron fires. – Synapses: The gap between the axon and other neuron’s dendrites. It determines the weights in the sum. 7 The Mathematical Model of a Neuron 8 Typical activation functions • Identity function • Threshold function (perceptron) • Ramp function 9 Typical activation functions • Logistic function • Hiperbolic tangent function 10 Structure of Neural Networks 11 Fully Connected Neural Network Input neurons, Hidden neurons, Output neurons 12 Layers Convention: The input layer is Layer 0. 13 Feedforward neural networks • Connections only between Layer i and Layer i+1 • = Multilayer perceptron • The most popular architecture. Recurrent NN: there are connections backwards too. 14 What functions can multi-layer perceptrons represent? 15 Perceptrons cannot represent the XOR function f(0,0)=1, f(1,1)=1, f(0,1)=0, f(1,0)=0 What functions can multilayer perceptrons represent? 16 Hilbert’s 13th Problem “Solve 7-th degree equation using continuous functions of two parameters.” Related conjecture: Let f be a function of 3 arguments such that Prove that f cannot be rewritten as a composition of finitely many functions of two arguments. Another rewritten form: Prove that there is a nonlinear continuous system of three variables that cannot be decomposed with finitely many functions of two variables. 17 Function decompositions f(x,y,z)=Φ1(ψ1(x), ψ2(y))+Φ2(c1ψ3(y)+c2ψ4(z),x) ψ1 x Φ1 ψ2 f(x,y,z) Σ y ψ3 c1 Σ Φ2 z ψ4 c2 18 Function decompositions 1957, Arnold disproves Hilbert’s conjecture. 19 Function decompositions Corollary: Issues: This statement is not constructive. For a given N we don’t know and for a given N and f, we don’t know 20 Universal Approximators Kur Hornik, Maxwell Stinchcombe and Halber White: “Multilayer feedforward networks are universal approximators”, Neural Networks, Vol:2(3), 359-366, 1989 Definition: ΣN(g) neural network with 1 hidden layer: M N N T (g) f : f (x1,..., xN ) ci g(ai x bi ) i1 Theorem: If δ>0, g arbitrary sigmoid function, f is continuous on a compact set A, then fˆ N (g),such that f (x) fˆ(x) x A esetén 21 Universal Approximators Theorem: (Blum & Li, 1991) SgnNet2(x,w) with two hidden layers and sgn activation function is uniformly dense in L2. Definition: (2) 3 2 1 sgn Net x, w wi sgn wij sgn w jl xl i j l Formal statement: If f x L2 , i.e. f 2 xdx , and 0, then w, such that X 2 f x w 3 sgn w 2 sgn w 1x dx dx i ij jl l 1 n i j l 22 Proof Integral approximation in 1-dim: Integral approximation in 2-dim: xi xi xi 2 i i j f x x d(x) i X i i i 23 Proof xi The indicator function of Xi polygon can be learned by this neural network: xi xi 1 if x is in Xi sgnai sgnbij x j i j -1 otherwise The weighted linear combination of these indicator functions will be a good approximation of the original function f 24 LEARNING The Perceptron Algorithm 25 The Perceptron 1 -1 26 The training set 27 The perceptron algorithm 28 The perceptron algorithm 29 Perceptron convergence theorem Theorem: If the training samples were linearly separable, then the algorithm finds a separating hyperplane in finite steps. The upper bound on the number of steps is “independent” from the number of training samples, but depends on the dimension. Proof: [homework] 30 Multilayer Perceptron 31 The gradient of the error 32 Notation 33 Some observations 34 The backpropagated error 35 The backpropagated error Lemma 36 The backpropagated error Therefore, 37 The backpropagation algorithm 38 .