Logistic Regression & Neural Networks

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities • What if we want a probability p(y|x)? • The perceptron gives us a prediction y • Let’s illustrate this with binary classification Illustrations: Graham Neubig The logistic function • “Softer” function than in perceptron • Can account for uncertainty • Differentiable Logistic regression: how to train? • Train based on conditional likelihood • Find parameters w that maximize conditional likelihood of all answers �" given examples �" Stochastic gradient ascent (or descent) • Online training algorithm for logistic regression • and other probabilistic models • Update weights for every training example • Move in direction given by gradient • Size of update step scaled by learning rate Gradient of the logistic function Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person Example: initial update Example: second update How to set the learning rate? • Various strategies • decay over time 1 � = � + � Number of Parameter samples • Use held-out test set, increase learning rate when likelihood increases Multiclass version Some models are better then others… • Consider these 2 examples • Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better Regularization • A penalty on adding extra weights • L2 regularization: � + • big penalty on large weights • small penalty on small weights • L1 regularization: � , • Uniform increase when large or small • Will cause many weights to become zero L1 regularization in online learning What you should know • Standard supervised learning set-up for text classification • Difference between train vs. test data • How to evaluate • 3 examples of supervised linear classifiers • Naïve bayes, Perceptron, Logistic Regression • Learning as optimization: what is the objective function optimized? • Difference between generative vs. discriminative classifiers • Smoothing, regularization • Overfitting, underfitting Neural networks Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person Formalizing binary prediction The Perceptron: a “machine” to calculate a weighted sum φ = 1 “A” 0 φ = 1 “site” -3 φ“located” = 1 0 0 φ“Maizuru”= 1 . 0 -1 φ = 2 sign - �" ⋅ ϕ" � “,” 0 "/, φ“in” = 1 0 φ“Kyoto” = 1 2 0 φ“priest” = 0 φ“black” = 0 The Perceptron: Geometric interpretation O X O X O X The Perceptron: Geometric interpretation O X O X O X Limitation of perceptron ● can only find linear separations between positive and negative examples X O O X Neural Networks ● Connect together multiple perceptrons φ“A” = 1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 -1 φ“,” = 2 φ“in” = 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0 ● Motivation: Can represent non-linear functions! Neural Networks: key terms • Input (aka features) φ“A” = 1 • Output φ“site” = 1 • Nodes φ“located” = 1 • Layers φ“Maizuru”= 1 -1 • Hidden layers φ“,” = 2 • Activation function φ = 1 “in” (non-linear) φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0 • Multi-layer perceptron Example ● Create two classifiers w0,0 φ0[0] 1 φ (x1) = {-1, 1} φ (x2) = {1, 1} 0 0 φ [1] 1 φ [1] 0 sign φ1[0] X 0 O 1 -1 φ0[0] b0,0 w O X 0,1 φ0[0] -1 φ (x ) = {-1, -1} φ (x ) = {1, -1} 0 3 0 4 -1 φ0[1] sign φ1[1] 1 -1 b0,1 Example ● These classifiers map to a new space φ (x ) = {-1, 1} φ0(x1) = {-1, 1} φ0(x2) = {1, 1} 1 3 φ [1] φ 1 X 2 O O φ1 φ1[0] O X X O φ1(x1) = {-1, -1} φ1(x2) = {1, -1} φ0(x3) = {-1, -1}φ0(x4) = {1, -1} φ1(x4) = {-1, -1} 1 1 φ1[0] -1 -1 -1 φ1[1] -1 Example ● In new space, the examples are linearly separable! φ0(x1) = {-1, 1} φ0(x2) = {1, 1} φ [1] X 0 O φ0[0] 1 1 φ2[0] = y 1 O X φ0(x3) = {-1, -1}φ0(x4) = {1, -1} 1 φ1[1] 1 φ1(x3) = {-1, 1} O φ1[0] -1 φ [0] -1 1 -1 (x ) = {-1, -1} φ1[1] φ1 1 -1 X O φ (x2) = {1, -1} φ1(x4) = {-1, -1} 1 Example wrap-up: Forward propagation ● The final net φ0[0] 1 1 φ0[1] tanh φ [0] 1 1 1 -1 φ [0] -1 tanh φ [0] 0 1 2 -1 tanh φ0[1] φ1[1] 1 -1 1 1 Softmax Function for multiclass classification ● Sigmoid function for multiple classes ��⋅6 7,9 Current class � � ∣ � = < �⋅6 7,9; ∑9; � Sum of other classes ● Can be expressed using matrix/vector ops � = exp � ⋅ ϕ �, � < � = �G- �̃ Ẽ∈� 30 Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words • For every training example, calculate the gradient (the direction that will increase the probability of y) • Move in that direction, multiplied by learning rate α Gradient of the Sigmoid Function Take the derivative of the probability � � ��⋅6 7 � � = 1 ∣ � = �� 1 + ��⋅6 7 ��⋅6 7 = ϕ � 1 + ��⋅6 7 + � � ��⋅6 7 � � = −1 ∣ � = 1 − �� 1 + ��⋅6 7 ��⋅6 7 = −ϕ � 1 + ��⋅6 7 + Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer � � w �� = 1 ∣ � 1 = ? �� = 1 ∣ � ��⋅� 7 = � � � ⋅� 7 + �� 1 + � � w2 w4 ϕ � y=1 �� = 1 ∣ � w = ? 3 �� = 1 ∣ � = ? �� Answer: Back-Propagation Calculate derivative with chain rule �� = 1 ∣ � �� = 1 ∣ � �� ℎ � = � , �� ℎ, � �� ⋅� 7 �,,R 1 + ��⋅� 7 + Error of Weight Gradient of next unit (δ4) this unit < In General �� = 1 ∣ � �ℎ" � = - δU �",U Calculate i based �� U on next units j: Backpropagation = Gradient descent + Chain rule Feed Forward Neural Nets All connections point forward ϕ � y It is a directed acyclic graph (DAG) Neural Networks • Non-linear classification • Prediction: forward propagation • Vector/matrix operations + non-linearities • Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7.

Load more