Logistic Regression & Neural Networks

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities • What if we want a probability p(y|x)? • The perceptron gives us a prediction y • Let’s illustrate this with binary classification Illustrations: Graham Neubig The logistic function • “Softer” function than in perceptron • Can account for uncertainty • Differentiable Logistic regression: how to train? • Train based on conditional likelihood • Find parameters w that maximize conditional likelihood of all answers �" given examples �" Stochastic gradient ascent (or descent) • Online training algorithm for logistic regression • and other probabilistic models • Update weights for every training example • Move in direction given by gradient • Size of update step scaled by learning rate Gradient of the logistic function Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person Example: initial update Example: second update How to set the learning rate? • Various strategies • decay over time 1 � = � + � Number of Parameter samples • Use held-out test set, increase learning rate when likelihood increases Multiclass version Some models are better then others… • Consider these 2 examples • Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better Regularization • A penalty on adding extra weights • L2 regularization: � + • big penalty on large weights • small penalty on small weights • L1 regularization: � , • Uniform increase when large or small • Will cause many weights to become zero L1 regularization in online learning What you should know • Standard supervised learning set-up for text classification • Difference between train vs. test data • How to evaluate • 3 examples of supervised linear classifiers • Naïve bayes, Perceptron, Logistic Regression • Learning as optimization: what is the objective function optimized? • Difference between generative vs. discriminative classifiers • Smoothing, regularization • Overfitting, underfitting Neural networks Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person Formalizing binary prediction The Perceptron: a “machine” to calculate a weighted sum φ = 1 “A” 0 φ = 1 “site” -3 φ“located” = 1 0 0 φ“Maizuru”= 1 . 0 -1 φ = 2 sign - �" ⋅ ϕ" � “,” 0 "/, φ“in” = 1 0 φ“Kyoto” = 1 2 0 φ“priest” = 0 φ“black” = 0 The Perceptron: Geometric interpretation O X O X O X The Perceptron: Geometric interpretation O X O X O X Limitation of perceptron ● can only find linear separations between positive and negative examples X O O X Neural Networks ● Connect together multiple perceptrons φ“A” = 1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 -1 φ“,” = 2 φ“in” = 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0 ● Motivation: Can represent non-linear functions! Neural Networks: key terms • Input (aka features) φ“A” = 1 • Output φ“site” = 1 • Nodes φ“located” = 1 • Layers φ“Maizuru”= 1 -1 • Hidden layers φ“,” = 2 • Activation function φ = 1 “in” (non-linear) φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0 • Multi-layer perceptron Example ● Create two classifiers w0,0 φ0[0] 1 φ (x1) = {-1, 1} φ (x2) = {1, 1} 0 0 φ [1] 1 φ [1] 0 sign φ1[0] X 0 O 1 -1 φ0[0] b0,0 w O X 0,1 φ0[0] -1 φ (x ) = {-1, -1} φ (x ) = {1, -1} 0 3 0 4 -1 φ0[1] sign φ1[1] 1 -1 b0,1 Example ● These classifiers map to a new space φ (x ) = {-1, 1} φ0(x1) = {-1, 1} φ0(x2) = {1, 1} 1 3 φ [1] φ 1 X 2 O O φ1 φ1[0] O X X O φ1(x1) = {-1, -1} φ1(x2) = {1, -1} φ0(x3) = {-1, -1}φ0(x4) = {1, -1} φ1(x4) = {-1, -1} 1 1 φ1[0] -1 -1 -1 φ1[1] -1 Example ● In new space, the examples are linearly separable! φ0(x1) = {-1, 1} φ0(x2) = {1, 1} φ [1] X 0 O φ0[0] 1 1 φ2[0] = y 1 O X φ0(x3) = {-1, -1}φ0(x4) = {1, -1} 1 φ1[1] 1 φ1(x3) = {-1, 1} O φ1[0] -1 φ [0] -1 1 -1 (x ) = {-1, -1} φ1[1] φ1 1 -1 X O φ (x2) = {1, -1} φ1(x4) = {-1, -1} 1 Example wrap-up: Forward propagation ● The final net φ0[0] 1 1 φ0[1] tanh φ [0] 1 1 1 -1 φ [0] -1 tanh φ [0] 0 1 2 -1 tanh φ0[1] φ1[1] 1 -1 1 1 Softmax Function for multiclass classification ● Sigmoid function for multiple classes ��⋅6 7,9 Current class � � ∣ � = < �⋅6 7,9; ∑9; � Sum of other classes ● Can be expressed using matrix/vector ops � = exp � ⋅ ϕ �, � < � = �G- �̃ Ẽ∈� 30 Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words • For every training example, calculate the gradient (the direction that will increase the probability of y) • Move in that direction, multiplied by learning rate α Gradient of the Sigmoid Function Take the derivative of the probability � � ��⋅6 7 � � = 1 ∣ � = �� 1 + ��⋅6 7 ��⋅6 7 = ϕ � 1 + ��⋅6 7 + � � ��⋅6 7 � � = −1 ∣ � = 1 − �� 1 + ��⋅6 7 ��⋅6 7 = −ϕ � 1 + ��⋅6 7 + Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer � � w �� = 1 ∣ � 1 = ? �� = 1 ∣ � ��⋅� 7 = � � � ⋅� 7 + �� 1 + � � w2 w4 ϕ � y=1 �� = 1 ∣ � w = ? 3 �� = 1 ∣ � = ? �� Answer: Back-Propagation Calculate derivative with chain rule �� = 1 ∣ � �� = 1 ∣ � �� ℎ � = � , �� ℎ, � �� ⋅� 7 �,,R 1 + ��⋅� 7 + Error of Weight Gradient of next unit (δ4) this unit < In General �� = 1 ∣ � �ℎ" � = - δU �",U Calculate i based �� U on next units j: Backpropagation = Gradient descent + Chain rule Feed Forward Neural Nets All connections point forward ϕ � y It is a directed acyclic graph (DAG) Neural Networks • Non-linear classification • Prediction: forward propagation • Vector/matrix operations + non-linearities • Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7.

Logistic Regression & Neural Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support