Applied Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 13 Mai, SoSe 2015 Neural Networks Motivation: the non-linear hypothesis x x x Goal: learn the non-linear problem x x x x x One solution: add a lot of non-linear features x as input to the logistic function x x x x x x x x x = ( + + + + x + + + ⋯ . ) x x g = sigmoid function If more than two dimensions, including polynomial feature not a good idea It might lead to over fitting. Neural Networks • Solution : SVM with projection in a high-dimensional space. Kernel ‘trick’. • New solution: Neural Networks to learn non-linear complex hypothesis Neural Networks (NNs) – Biological motivation • Originally motivated by the desire of having machines which can mimic the brain • Widely used in the 80s and early 90s, popularity diminished in the late 90s • Recent resurgence: more computational power to run large-scale NNs The ‘brain’ algorithm • Fascinating hypothesis (Roe et al. 1992): the brain functions through a single learning algorithm Auditory cortex X Neurons Schematic representation of a single neuron in the brain - Neuron is a computational unit ‘Inputs’ (dendrites) Synapse Node of Cell body Ranvier Schwann cell Myelin sheath Cell nucleus Communication between neurons Modeling neurons Logistic unit Activation function is often a sigmoid 1 ≡ ℎ = ≡ () 1 + There exist other activation function: step function, linear function Logistic unit representation Bias unit, =1 () Sigmoid or logistic activation Function w are the parameters of the model Called ‘weights’ in NNs Neural network representation A NN is a group of different neurons (logistic units) connected together Bias unit, =1 () () node that outputs the computed function () () () Layer 1 Layer 2 Layer 3 (input layer) (hidden layer) (output layer) Neural Network – Computational Steps () Some notation: () = activation of unit i in layer j () () = matrix of weights controlling () function mapping from layer j to layer j+1 () () Layer 1 Layer 2 Layer 3 The first hidden unit will have activation value computed as follows: () = ( + + + ) () = ( + + + ) () = ( + + + ) Neural Network – Computational Steps () For the hidden units () = ( + + + ) () () = ( + + + ) () () () = ( + + + ) () Layer 1 Layer 2 Layer 3 () () () () () () () () () For the output unit: = = ( + + + ) where is the matrix of weights (parameters) that controls the function from the hidden units to the one unit layer 3. NNs computation – compact form () For the hidden units () = ( + + + ) () () = ( + + + ) () () () = ( + + + ) () ∈ () Layer 1 Layer 2 Layer 3 () + + + )≡ () () = ( ) () () = ( ) () () = ( ) NNs computation – compact form () () () () () () () x = ; = ; = () () () () () () Layer 1 Layer 2 Layer 3 () + + + )≡ () () () = () () () () = ( ) = () () = ( ) () () () () = ( ) = ( ) Forward propagation () () () () () () = = ( + + () () () () + ) () It looks like logistic regression, except that the () features fed into log regression are the as , () activation values computed by hidden layers The network has to learn new features (). By () learning new features the network learns a better (non-linear) hypothesis Layer 1 Layer 2 Layer 3 Forward propagation : we start with activation input units and we further propagate to hidden units, compute activation of hidden units, then compute the output How to compute the weights w? • Necessity to train the network to assign values to the parameters w . It is again an optimization problem • Analogy with linear regression: given a training set of N input vectors , … … , together with the response {, … . } our objective function to minimize is the sum of squared errors: = ∑ − = (, ) − How to compute the weights w? • Analogy with logistic regression (classification): given a training set of N input vectors , … … , together with the response {, … . } our objective function is derived as follow: – Remember that instead of modeling Y directly we model the probability of Y (given by the logistic function) , = = () (⋯.) for N observation , = ∏ (1 − ()) −( , ) = −[∑ +(1-)log (1 − )] = = −[∑ () +(1-)log (1 − ())] = E() Objective function (error function) for NNs Multi-classification problem ∈ 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 () Layer 1 Layer 2 Layer 3 In case of one output unit the objective function is exactly the same. In case of k output units (e.g. k = 4) the objective function becomes: () () E() = −[∑ ∑ +(1- )log (1 − )] = () = −[∑ () +(1-)log (1 − ())] Minimization of the error function () • Goal: find a weight vector w which minimizes the defined error function • With only one feature Minimum of the function will be when ()=0 (gradient) Minimization of the error function () • Goal: find a weight vector w which minimizes the defined error function • With two features Minimum of the function will be when ()=0 (gradient) ≈ change in the error function due to a small step in the weight space from to + Minimization of the error function () • is non-convex, no analytical solution to ()=0 We will use iterative numerical procedures () = () + Δ() • Different algorithms have different choices for the weight vector update Δ() • Gradient descent: choose the weight update corresponding to a small step in the direction of the negative gradient () = () − = learning rate Minimization of the error function () • Gradient descent: choose the weight update corresponding to a small step in the direction of the negative gradient () = () − • Goal: find an efficient technique for evaluating for a feed-forward neural network. This is done by means of error backpropagation Training algorithms Many involve iterative procedure for minimization of error function, with adjustments made to the weights. At each step we can distinguish between two stages: • Compute the derivative of the error function = () (backpropagation will provide an efficient way to calculate it) • Derivative are used to compute adjustments to the weights The error backpropagation algorithm i () () () . j . () = . =σ(()) () () =∑ Layer 1 Layer 2 Layer 3 () () = ; substituting () we obtain = () () () () We define ≡ () errors, so = () () () (1) The error backpropagation algorithm For the output units: i () = () − () () activation function of the () . output unit () j . () Let’s evaluate for the hidden units = . () =σ( ) () ≡ = () () () () ()=∑ = = ( ) Therefore = Finally () () = The error backpropagation algorithm (in words) Algorithm to minimize the error function in NNs 1. Apply vector x1…xN to the network and forward propagate through the network using activation functions of all hidden and output units 2. Evaluate the for all output units 3. Back propagate s for each hidden unit in the network 4. Use (1) to compute the derivatives 5. Use derivative in the gradient descent algorithm Putting it together (short summary) • Step1: Pick neural network architecture (number of layers, number of nodes, connectivity) • Step2: Training – Randomly initialize the weights – Implement forward propagation to get s – Implement code to compute the cost function – Implement backpropagation to calculate partial derivatives – Use gradient descent (or others) to minimize E(w) Some famous applications Automated driving System ALVINN (Pomerleau 1993) NNs in Bioinformatics Advantages and disadvantages of NNs in Bioinformatics Advantages • No need for complete domain knowledge – Protein folding • Robust solution to complex problems – Microarray data (noisy data) Disadvantages: • Local minima • Selecting the architecture • Not always the best approach – For linear classification problems performs worse than linear methods NNs for protein secondary structure prediction PHD program (Rost et al. 2001) Based on Artificial Neural Networks The input is the protein sequence; the output is the predicted secondary structure class The most important factor for predicting the secondary structure of a single amino acid is the local sequence composition 1 K 2 R α 0.8 3 R 4 G 5 L β 0.6 6 P 7 P l 0.5 8 A 9 R.