Connectionist Learning an Artificial Neuron

Connectionist Learning An Artificial Neuron • Characteristics – not based on a symbol system – well suited for parallel distributed processing – due to distributed nature system degradation is “graceful” – despite the lack of symbols, representation of inputs and outputs is still critical – connectionist networks are trained rather than programmed • Tasks well suited for connectionism • Our coverage will stress the historical foundations, the back propagation model, and competitive networks 1 McCulloch-Pitts Neuron A Perceptron • Inputs and activation levels are +1 or -1 • Inputs and outputs are integers; inputs are variable, as x/y below, or constants • There is a “hard-limited” threshold • Here is the value table for the relation “and” • The learning is very simple: the adjustment for the weight of the ith component where c a learning constant and d the correct response is • The adjustments are • Although these logical relations allow development of a model of computation, it took to development of a learning algorithm with perceptrons to inspire new research interest in the 50’s and 60’s 2 Limitations A General Classifier • Perceptrons can only solve problems that • A full classification system are linear separable; the easiest example of a problem that is not linearly separable is exclusive-or • It is impossible to draw a single straight line to separate the • Each data grouping is a region in true from false values multidimensional space • Each region Ri has a discriminant function • Here would be the required weight gi measuring membership in the region assignments, but there is no solution • Within region Ri we have • Adjacent regions are separated by a border where • Work came to a stop in the 1970s until backpropagation networks were designed 3 A Specific Example Training the Network - 1 • A two dimensional • The perceptron computes where f(x) is the problem with two sign of x groupings • Expected outputs are +1 and -1 • The weights are first assigned random values, we assume [0.75, 0.5, -0.6]; we put • A single line forms in the first data point to get the boundary between regions • This is the correct response, so the weights are not changed; plugging in the second data point we have • This is the wrong response, so we apply the learning rule 4 Training the Network - 2 Threshold Functions • Since this is a hard limited, bipolar • Here are three examples perceptron, the learning increment is either +2c or -2c; we will let c = 0.2 • The third data point with the newly adjusted weights gives • A commonly used sigmoidal function called • This is not the desired output, so we adjust the logistic function is the weights again • The data is separated after 10 training • Adjusting delta varies the slope of the values; repeating the training on the same sigmoidal curve in the area of transition set, the values converge to [-1.3,-1.1,10.9] and the separator is -1.3x1-1.1x2+10.9=0 5 The Delta Rule - 1 The Delta Rule - 2 • The error surface represents the cumulative error over a data set as a function of • The delta rule is hill climbing that network weights uses the local derivative to minimize • Each possible the local error weight configuration • It is subject to getting stuck in a local is a point on the surface minima • The value c adjusts the learning rate • We use gradient descent learning where the – if too small, the training time may be too derivative gives the gradient to tell us the long direction which most rapidly reduces error – if too large, the solution may oscillate • Using the logistic function, the weight around each side of the minima adjustment of the jth input on the ith node is – reasonable small values are less error prone • Generalized to a multilayer network • c controls the learning rate, di and Oi are the desired and actual outputs; derivation of this the delta rule becomes the foundation formula is in the textbook of the back propagation algorithm 6 Back propagation -1 Back propagation - 2 • There is a hidden layer that passes forward • Adjusting the kth weight at the ith node the activations caused by the inputs • Errors are determined at the output and propagated backwards adjusting weights • A complete derivation is given in the text; the diagram below illustrates how the adjustments at the hidden layer are calculated • The logistic function is used because – it’s a sigmoidal function – it is continuous and differentiable everywhere – the derivative is greatest when the function is changing most rapidly – the derivative is easy to compute ( * and - ) 7 Example - NETtalk Performance of NETtalk • English pronunciation is highly irregular so • Learning proceeded quickly at the learning is very difficult start then progressed more slowly • NETtalk inputs a string and for each • The network was robust; random character returns a phonette and stress changes to weights only degraded the • The input layer has seven characters each network slightly with 29 possible values; 21 outputs are for • The reduction in the number of hidden phonettes, five others are for stress and layers compared to input layers shows syllable boundaries; there are 80 hidden that knowledge is being encoded units and 18,629 connections • Training was laborious with up to 100 passes through the training data • Performance was similar to a symbolic-based system, ID3: with a training set of 500 examples 60% were pronounced correctly • ID3 only required one pass; NETtalk required multiple passes 8 Exclusive-or Kohonen Network - 1 • This network is small: two input nodes, one hidden node and one output node • In this single layer network the input value for the vector X causes one and only one output node to fire • The weights of the winner are adjusted to be closer to the input vector • Learning is unsupervised; the learning • Although there are only four values in the constant c is relatively small and gets data set, the network had to repeat this data smaller as learning progresses 1400 times (!) to get the weights • The vector with the smallest Euclidean distance from the input vector is the one to fire since this matches the node with the largest activation value WX 9 Kohonen Network - 2 Using the Perceptron Data • The Kohonen network is sometimes called a winner-take-all network • It is interesting to study because – it is a classification system – it is the first stage of counterpropagation networks • Learning is unsupervised, this means • The initial candidates are A:(7,2) and the system discovers data clusterings B:(2,9); these migrate towards the • Prototype are introduced into the system with random initial values • These prototypes migrate to the data clusters • On the first round A is the winner • The number of prototypes and weights are used to build the network 10 More Steps in the Calculation Grossberg Learning • The basic idea is: • Weights for A are adjusted by the following – use a trained Kohonen net as the first layer to amounts, assuming c is 0.5 classify the data – add a supervision layer with counterpropagation to help train the network – each of the middle layer nodes connect to the • The next data point is (9.4, 6.4), A is again output nodes in an outstar formation the winner and has its weights adjusted • Training results in averaging the weights of the outputs from the outstar • C is a small learning constant, Wt the weight of the outstar, and Y the desired output vector • The third data point is (2.5,2.1) and A is the winner again • After all ten data the two prototypes have migrated to the two data clusters 11 Training the network Interpretation • From a cognitive perspective – Kohonen learning is like acquiring a conditioned stimulus from inputs – the next layer is an association of an unconditioned stimuli to some response • Another perspective •X1 is the engine speed, x2 the engine temp – conterpropagation is reinforcement for and the two clusters are A and B memory links • We want to find safe and dangerous states – this is like developing lookup table for • Training of the outstar A (B is similar) with responses to data patterns c set to 0.2 and weights [0, 0] • Grossberg networks can perform analysis of linear separable data • By training the Kohonen net first, the learning is much faster than • The weights are moving toward [1,0], the backpropogation final output 12 Hebbian Network Stimulus-Response • In 1949 Hebb proposed the following rule • Hebbian learning is based on unconditioned about neurons stimulus, conditioned stimulus and response – When a dog receives food its unconditioned response is a salivate – Pavlov rang a bell when the dog was feed thus • Suppose i’s output feeds the input of j conditioning the dog to associate the bell with food; just the bell alone caused salivation – if both are positive or both are negative, the connection is strengthened – if they are opposite signs, the connection is weakened • The equation for weight adjustment is • Hebbian learning can be unsupervised or supervised; it is sometimes called where c is the learning constant, f(X,W) is coincidence learning i’s output, and X is the input vector 13 Training the Network Testing the network • Assume the input [1,-1,1,-1,1,-1] where the • The unconditioned response still works first three values are the unconditioned stimulus and the last three are conditioned • The initial weight vector is [1,-1,1,0,0,0] since the network is untrained • Another unconditioned response • Calculating the new weights • The unconditioned vector is [1,1,1] to see if the conditioned response works • We repeat the process • A one

Connectionist Learning an Artificial Neuron

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support