Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS
Total Page:16
File Type:pdf, Size:1020Kb
Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks Neural Networks A neural network is a set of connected input/output units During the learning phase, the network learns by adjusting the (neurons) where each connection has a weight associated with it. weights that enable it to predict the correct class label of the input samples (the training samples). Knowledge about the learning task is given in the form of examples. Inter neuron connection strengths (weights) are used to store the acquired information (the training examples). During the learning process the weights are modified in order to model the particular learning task correctly on the training examples. http://aemc.jpl.nasa.gov/activities/bio_regen.cfm 3 4 Neural Networks Network architectures Advantages Three different classes of network architectures prediction accuracy is generally high single‐layer feed‐forward neurons are organized in acyclic layers robust, works when training examples contain errors or noisy data multi‐layer feed‐forward output may be discrete, real‐valued, or a vector of several discrete or real‐ recurrent valued attributes The architecture of a neural network is linked with the learning fast evaluation of the learned target function algorithm used to train Criticism parameters are best determined empirically, such as the network topology or single‐layer multi‐layer structure long training time Input layer Output layer Input Output of of difficult to understand the learned function (weights) layer layer source nodes neurons not easy to incorporate domain knowledge Hidden Layer 5 6 Neurons The neuron Bias Neural networks are built out of a densely interconnected set x b of simple units (neurons) 1 w1 Each neuron takes a number of real‐valued inputs w0 Local Field Produces a single real‐valued output Output Input v Inputs to a neuron may be the outputs of other neurons. x2 w2 () y signal A neuron’s output may be used as input to many other neurons Activation Adder function (linear function combiner) which computes the (squashing weighted sum of the inputs: function) for limiting the xm wm amplitude of the m output of the uw b 0 wxj j neuron weights j1 y φ(u) Bias: serves to vary the activity of the unit 7 8 The neuron How does it Works? Assign weights to each input‐link Multiply each weight by the input value (0 or 1) Sum all the weight‐firing input combinations Apply squash function, e.g.: If sum > threshold for the Neuron then Output = +1 Else Output = ‐1 http://www‐cse.uta.edu/~cook/ai1/lectures/figures/neuron.jpg 9 10 Popular activation functions How Are Neural Networks Trained? Linear activation Logistic activation Initially 1 zz z choose small random weights (wi) 1 e z 1 Set threshold = 1 (step function) Choose small learning rate (r) z z 0 Apply each member of the training set to the neural net model Threshold activation Hyperbolic tangent activation using a training rule to adjust the weights 2u 1,if z 0, 1 e zzsign( ) u tanh u 2u For each unit 1,if z 0. 1 e Compute the net input to the unit as a linear combination of all the inputs 1 1 to the unit Compute the output value using the activation function z 0 z -1 Compute the error ‐1 Update the weights and the bias 11 11 12 Single Layer Perceptron Single layer perceptron: training rule Modify the weights (wi) according to the Training Rule: Are the simplest form of neural networks wi = wi + r ∙ (t –a) ∙ xi output variables r is the learning rate (eg. 0.2) input variables t = target output a = actual output xi =i‐th input value output nodes Learning rate: if too small learning occurs at a small pace, if too large it may stuck in local minimum in the decision space 13 14 b=‐1 Example x1 x2 Y Multi layer network X1=0 000 w0=0.49 101 w1=0.95 011 Y=0 111 w2=0.15 threshold = 0.5 r=0.05 X2=1 Compute output u = ‐1 x 0.49 + 0 x 0.95 + 1 x 0.15=‐0.34 < t for the input thus, y=0 target output = 1 Compute the actual output (y) = 0 Repeat the process error error = (1‐0) = 1 with the new correction factor = error x r = 0.05 weigths for a given w0 = 0.49 + 0.05 x (1‐0) x (‐1) = 0.44 number of Compute the new w1 = 0.95 + 0.05 x (1‐0) x 0 = 0.95 iterations input layer hidden layer output layer weights w2 = 0.15 + 0.05 x (1‐0) x 1 = 0.20 15 (one or more) 16 Training multi layer networks Multi‐Layer network of sigmoid units back‐propagation algorithm Problem: what is the desired output for a hidden node? => Backpropagation algorithm Phase 1: Propagation θθjj()rErr j Forward propagation of a training input Output vector to update the bias Back propagation of the propagation's output activations. wwij ij() rErrO j i Phase 2: Weight update Output nodes to update the weights For each weight‐synapse: Multiply its output delta and input activation to Errjj O()1 O j Err kjk w get the gradient of the weight. k error for a node in the hidden layer Bring the weight in the opposite direction of the Hidden nodes gradient by subtracting a ratio of it from the Err O()()1 O T O weight. jj jjj error for a node in the output layer This ratio influences the speed and quality of learning. The sign of the gradient of a weight indicates where the error is increasing, this is 1 Oj why the weight must be updated in the opposite Input nodes 1 e Ij direction. Repeat the phase 1 and 2 until the performance IwOjijij θ of the network is good enough. Input vector: xi i 17 18 Example Propagation w04=‐0.4 x1=1 1 w14=0.2 IwOjijij θ i w06=0.1 w15=‐0.3 4 w46=‐0.3 w24=0.4 1 Oj 1 e Ij x2=0 2 6 w56=‐0.2 w25=0.1 w34=‐0.5 5 3 neuron input output x3=1 w35=0.2 w05=0.2 activation function 4 0.2x1+0.4x0‐0.5x1‐0.4=‐0.7 1/(1+e0.7)=0.332 Oj = 1 / (1+e‐Ij) 5 ‐0.3x1+0.1x0+0.2x1+0.2=0.1 1/(1+e‐0.1)=0.525 xi – input variables (1,0,1) whose class is 1 and 0.105 wij – randomly assigned weights 6 ‐0.3x0.332‐0.2x0.525+0.1=‐0.105 1/(1+e )=0.474 learning rate = 0.9 19 20 Calculation of the neuron’ Updating weights error for a node in the output layer weight New value w46 ‐0.3 + 0.9 x 0.1311 x 0.332 = ‐0.261 Errjj O()()1 O jjj T O w56 ‐0.2 + 0.9 x 0.1311 x 0.525 = ‐0.138 error for a node in the hidden layer w14 0.2 + 0.9 x ‐0.0087 x 1 = 0.192 w15 ‐0.3 + 0.9 x ‐0.0065 x 1 = ‐0.306 Errjj O()1 O j Err kjk w k w24 0.4 + 0.9 x ‐0.0087 x 0 = 0.4 w25 0.1 + 0.9 x ‐0.0065 x 0 = 0.1 w34 ‐0.5 + 0.9 x ‐0.0087 x 1 = ‐0.508 to update the weights to update the bias w35 0.2 + 0.9 x ‐0.0065 x 1 = 0.194 w06 0.1 + 0.9 x 0.1311 = 0.218 wwij ij() rErrO j i θθjj()rErr j w05 0.2 + 0.9 x ‐0.0065 = 0.194 neuron output neuron error neuron output error w04 ‐0.4 + 0.9 x ‐0.0087 = ‐0.408 4 0.332 6 0.474 x (1 ‐ 0.474) x (1 ‐ 0.474) = 0.1311 4 0.332 ‐0.0087 5 0.525 5 0.525 x (1 ‐ 0.525) x (‐0.2) x 0.1311 = ‐0.0065 5 0.525 ‐0.0065 6 0.474 4 0.332 x (1 – 0.332) x (‐0.3) x 0.1311 = ‐0.0087 6 0.474 0.1311 21 22 Example Neural Network as a Classifier w04=‐0.408 Weakness x1=1 1 w14=0.192 Long training time w06=0.218 w15=‐0.306 4 w46=‐0.261 Require a number of parameters typically best determined empirically, e.g., the w24=0.4 network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind the x2=0 2 6 w56=‐0.138 learned weights and of ``hidden units" in the network w25=0.1 Strength 5 w34=‐0.508 High tolerance to noisy data Ability to classify untrained patterns 3 x3=0 w35=0.194 Well‐suited for continuous‐valued inputs and outputs w05=0.194 Successful on a wide array of real‐world data This is the resulting network after the first iteration. We now have to process Algorithms are inherently parallel another training example until the overall error is low or we run out of examples. 23 24 Ensemble Method Aggregation of multiple learned models with the goal of improving accuracy. Intuition: simulate what we do when we combine a expert panel in a human decision‐making process ENSEMBLE METHODS 25 26 Some Comments Methods to Achieve Diversity Combining models adds complexity Diversity from differences in input variation It is more difficult to characterize and explain predictions Different feature weightings The accuracy may increase Ratings Classifier A + Actors Classifier B Predictions + Genres Classifier C Violation of Ockham’s Razor Training Examples “simplicity leads to greater accuracy” Divide up training data among models Identifying the best model requires identifying the proper "model complexity" Classifier A + Classifier B Predictions + Classifier C Training Examples 27 28 Ensemble Methods: Increasing the How to combine models Accuracy Ensemble methods Use a combination of models to increase accuracy Algebraic methods Voting methods Combine a series of k learned models, M1, M2, …, Mk, with the aim of Average Majority voting creating an improved model M* Weighted average Weighted majority voting Sum Borda count (rank candidates in order of preference) Weighted sum Product Maximum Minimum Median 29 30 Popular ensemble methods Bagging: Bootstrap AGGregatING Analogy: Diagnosis based on multiple doctors’ majority vote Bagging: Training averaging the prediction over a collection of classifiers Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) Boosting: A classifier model Mi is learned for each training set Di weighted vote with a collection of classifiers Classification: classify an unknown sample X Each classifier M returns its class prediction Ensemble: i The bagged classifier M* counts the votes and assigns the class with the combining a set of heterogeneous classifiers most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple 31 32 Bagging Accuracy Often significant better than a single classifier derived from D For noise data: not considerably worse, more robust Proved improved accuracy in prediction Requirement: Need unstable classifier types Unstable means a small change to the training data may lead to major decision changes.