Connectionist Learning an Artificial Neuron
Total Page:16
File Type:pdf, Size:1020Kb
Connectionist Learning An Artificial Neuron • Characteristics – not based on a symbol system – well suited for parallel distributed processing – due to distributed nature system degradation is “graceful” – despite the lack of symbols, representation of inputs and outputs is still critical – connectionist networks are trained rather than programmed • Tasks well suited for connectionism • Our coverage will stress the historical foundations, the back propagation model, and competitive networks 1 McCulloch-Pitts Neuron A Perceptron • Inputs and activation levels are +1 or -1 • Inputs and outputs are integers; inputs are variable, as x/y below, or constants • There is a “hard-limited” threshold • Here is the value table for the relation “and” • The learning is very simple: the adjustment for the weight of the ith component where c a learning constant and d the correct response is • The adjustments are • Although these logical relations allow development of a model of computation, it took to development of a learning algorithm with perceptrons to inspire new research interest in the 50’s and 60’s 2 Limitations A General Classifier • Perceptrons can only solve problems that • A full classification system are linear separable; the easiest example of a problem that is not linearly separable is exclusive-or • It is impossible to draw a single straight line to separate the • Each data grouping is a region in true from false values multidimensional space • Each region Ri has a discriminant function • Here would be the required weight gi measuring membership in the region assignments, but there is no solution • Within region Ri we have • Adjacent regions are separated by a border where • Work came to a stop in the 1970s until backpropagation networks were designed 3 A Specific Example Training the Network - 1 • A two dimensional • The perceptron computes where f(x) is the problem with two sign of x groupings • Expected outputs are +1 and -1 • The weights are first assigned random values, we assume [0.75, 0.5, -0.6]; we put • A single line forms in the first data point to get the boundary between regions • This is the correct response, so the weights are not changed; plugging in the second data point we have • This is the wrong response, so we apply the learning rule 4 Training the Network - 2 Threshold Functions • Since this is a hard limited, bipolar • Here are three examples perceptron, the learning increment is either +2c or -2c; we will let c = 0.2 • The third data point with the newly adjusted weights gives • A commonly used sigmoidal function called • This is not the desired output, so we adjust the logistic function is the weights again • The data is separated after 10 training • Adjusting delta varies the slope of the values; repeating the training on the same sigmoidal curve in the area of transition set, the values converge to [-1.3,-1.1,10.9] and the separator is -1.3x1-1.1x2+10.9=0 5 The Delta Rule - 1 The Delta Rule - 2 • The error surface represents the cumulative error over a data set as a function of • The delta rule is hill climbing that network weights uses the local derivative to minimize • Each possible the local error weight configuration • It is subject to getting stuck in a local is a point on the surface minima • The value c adjusts the learning rate • We use gradient descent learning where the – if too small, the training time may be too derivative gives the gradient to tell us the long direction which most rapidly reduces error – if too large, the solution may oscillate • Using the logistic function, the weight around each side of the minima adjustment of the jth input on the ith node is – reasonable small values are less error prone • Generalized to a multilayer network • c controls the learning rate, di and Oi are the desired and actual outputs; derivation of this the delta rule becomes the foundation formula is in the textbook of the back propagation algorithm 6 Back propagation -1 Back propagation - 2 • There is a hidden layer that passes forward • Adjusting the kth weight at the ith node the activations caused by the inputs • Errors are determined at the output and propagated backwards adjusting weights • A complete derivation is given in the text; the diagram below illustrates how the adjustments at the hidden layer are calculated • The logistic function is used because – it’s a sigmoidal function – it is continuous and differentiable everywhere – the derivative is greatest when the function is changing most rapidly – the derivative is easy to compute ( * and - ) 7 Example - NETtalk Performance of NETtalk • English pronunciation is highly irregular so • Learning proceeded quickly at the learning is very difficult start then progressed more slowly • NETtalk inputs a string and for each • The network was robust; random character returns a phonette and stress changes to weights only degraded the • The input layer has seven characters each network slightly with 29 possible values; 21 outputs are for • The reduction in the number of hidden phonettes, five others are for stress and layers compared to input layers shows syllable boundaries; there are 80 hidden that knowledge is being encoded units and 18,629 connections • Training was laborious with up to 100 passes through the training data • Performance was similar to a symbolic-based system, ID3: with a training set of 500 examples 60% were pronounced correctly • ID3 only required one pass; NETtalk required multiple passes 8 Exclusive-or Kohonen Network - 1 • This network is small: two input nodes, one hidden node and one output node • In this single layer network the input value for the vector X causes one and only one output node to fire • The weights of the winner are adjusted to be closer to the input vector • Learning is unsupervised; the learning • Although there are only four values in the constant c is relatively small and gets data set, the network had to repeat this data smaller as learning progresses 1400 times (!) to get the weights • The vector with the smallest Euclidean distance from the input vector is the one to fire since this matches the node with the largest activation value WX 9 Kohonen Network - 2 Using the Perceptron Data • The Kohonen network is sometimes called a winner-take-all network • It is interesting to study because – it is a classification system – it is the first stage of counterpropagation networks • Learning is unsupervised, this means • The initial candidates are A:(7,2) and the system discovers data clusterings B:(2,9); these migrate towards the • Prototype are introduced into the system with random initial values • These prototypes migrate to the data clusters • On the first round A is the winner • The number of prototypes and weights are used to build the network 10 More Steps in the Calculation Grossberg Learning • The basic idea is: • Weights for A are adjusted by the following – use a trained Kohonen net as the first layer to amounts, assuming c is 0.5 classify the data – add a supervision layer with counterpropagation to help train the network – each of the middle layer nodes connect to the • The next data point is (9.4, 6.4), A is again output nodes in an outstar formation the winner and has its weights adjusted • Training results in averaging the weights of the outputs from the outstar • C is a small learning constant, Wt the weight of the outstar, and Y the desired output vector • The third data point is (2.5,2.1) and A is the winner again • After all ten data the two prototypes have migrated to the two data clusters 11 Training the network Interpretation • From a cognitive perspective – Kohonen learning is like acquiring a conditioned stimulus from inputs – the next layer is an association of an unconditioned stimuli to some response • Another perspective •X1 is the engine speed, x2 the engine temp – conterpropagation is reinforcement for and the two clusters are A and B memory links • We want to find safe and dangerous states – this is like developing lookup table for • Training of the outstar A (B is similar) with responses to data patterns c set to 0.2 and weights [0, 0] • Grossberg networks can perform analysis of linear separable data • By training the Kohonen net first, the learning is much faster than • The weights are moving toward [1,0], the backpropogation final output 12 Hebbian Network Stimulus-Response • In 1949 Hebb proposed the following rule • Hebbian learning is based on unconditioned about neurons stimulus, conditioned stimulus and response – When a dog receives food its unconditioned response is a salivate – Pavlov rang a bell when the dog was feed thus • Suppose i’s output feeds the input of j conditioning the dog to associate the bell with food; just the bell alone caused salivation – if both are positive or both are negative, the connection is strengthened – if they are opposite signs, the connection is weakened • The equation for weight adjustment is • Hebbian learning can be unsupervised or supervised; it is sometimes called where c is the learning constant, f(X,W) is coincidence learning i’s output, and X is the input vector 13 Training the Network Testing the network • Assume the input [1,-1,1,-1,1,-1] where the • The unconditioned response still works first three values are the unconditioned stimulus and the last three are conditioned • The initial weight vector is [1,-1,1,0,0,0] since the network is untrained • Another unconditioned response • Calculating the new weights • The unconditioned vector is [1,1,1] to see if the conditioned response works • We repeat the process • A one