<<

Advanced Introduction to , CMU-10715 ,

Barnabás Póczos, Sept 17 Contents

 History of Artificial Neural Networks  Definitions: Perceptron, MLP  Representation questions  Perceptron algorithm

algorithm

2 Short History  Progression (1943-1960) • First mathematical model of neurons . Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) . A single neuron for classification . Perceptron learning rule . Perceptron convergence theorem

 Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1969 Backpropagation… but not much attention…

3 Short History  Progression (1980-) • 1986 Backpropagation reinvented: . Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: . Character recognition, autonomous cars,…

• Open questions: ? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training?

• Hopfield nets (1982), Boltzmann machines,…

4 Short History

 Degression (1993-)

• SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM almost kills the ANN research.

• Training deeper networks consistently yields poor results.

• Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model)

5 Short History

Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative • Based on restrictive Boltzmann machines • Can be trained efficiently

Deep based networks

Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks, Advances in Neural Information Processing Systems 19 6 The Neuron

– Each neuron has a body, axon, and many dendrites – A neuron can fire or rest – If the sum of weighted inputs larger than a threshold, then the neuron fires. – Synapses: The gap between the axon and other neuron’s dendrites. It determines the weights in the sum.

7 The Mathematical Model of a Neuron

8 Typical activation functions • Identity function

• Threshold function (perceptron)

• Ramp function

9 Typical activation functions •

• Hiperbolic tangent function

10 Structure of Neural Networks

11 Fully Connected Neural Network

Input neurons, Hidden neurons, Output neurons

12 Layers

Convention: The input layer is Layer 0.

13 Feedforward neural networks • Connections only between Layer i and Layer i+1 • = Multilayer perceptron • The most popular architecture.

Recurrent NN: there are connections backwards too. 14 What functions can multi-layer represent?

15 Perceptrons cannot represent the XOR function f(0,0)=1, f(1,1)=1, f(0,1)=0, f(1,0)=0

What functions can multilayer perceptrons represent?

16 Hilbert’s 13th Problem

“Solve 7-th degree equation using continuous functions of two parameters.”

Related conjecture: Let f be a function of 3 arguments such that

Prove that f cannot be rewritten as a composition of finitely many functions of two arguments.

Another rewritten form: Prove that there is a nonlinear continuous system of three variables that cannot be decomposed with finitely many functions of two variables.

17 Function decompositions f(x,y,z)=Φ1(ψ1(x), ψ2(y))+Φ2(c1ψ3(y)+c2ψ4(z),x)

ψ1 x Φ1

ψ2 f(x,y,z) Σ y ψ3 c1 Σ Φ2 z ψ4 c2

18 Function decompositions

1957, Arnold disproves Hilbert’s conjecture.

19 Function decompositions

Corollary:

Issues: This statement is not constructive. For a given N we don’t know and for a given N and f, we don’t know

20 Universal Approximators

Kur Hornik, Maxwell Stinchcombe and Halber White: “Multilayer feedforward networks are universal approximators”, Neural Networks, Vol:2(3), 359-366, 1989

Definition: ΣN(g) neural network with 1 hidden layer:

M N  N T   (g)   f :   f (x1,..., xN )  ci g(ai x  bi )  i1  Theorem:

If δ>0, g arbitrary , f is continuous on a compact set A, then

fˆ  N (g),such that f (x)  fˆ(x)   x  A esetén

21 Universal Approximators

Theorem: (Blum & Li, 1991) SgnNet2(x,w) with two hidden layers and sgn is uniformly dense in L2.

Definition: (2) 3  2  1  sgn Net x, w   wi sgn wij sgn w jl xl  i  j  l  Formal statement:

If f x L2 , i.e.  f 2 xdx  , and   0, then w, such that X 2       f x  w 3 sgn w 2 sgn w 1x  dx dx        i  ij  jl l  1 n  i  j  l  22 Proof

Integral approximation in 1-dim:

Integral approximation in 2-dim:

xi

xi xi

2

i   i  j    f x   x d(x)        i X i i i 23 Proof

xi

xi xi

The indicator function of Xi polygon can be learned by this neural network:    1 if x is in Xi sgnai sgnbij x j    i  j  -1 otherwise The weighted linear combination of these indicator functions will be a good approximation of the original function f

24 LEARNING The Perceptron Algorithm

25 The Perceptron

1

-1

26 The training set

27 The perceptron algorithm

28 The perceptron algorithm

29 Perceptron convergence theorem

Theorem:

If the training samples were linearly separable, then the algorithm finds a separating in finite steps.

The upper bound on the number of steps is “independent” from the number of training samples, but depends on the dimension.

Proof: [homework]

30 Multilayer Perceptron

31 The gradient of the error

32 Notation

33 Some observations

34 The backpropagated error

35 The backpropagated error

Lemma

36 The backpropagated error

Therefore,

37 The backpropagation algorithm

38