Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6
Slides adapted from Jordan Boyd-Graber, Chris Ketelsen
Machine Learning: Chenhao Tan | Boulder | 1 of 39 • HW1 turned in • HW2 released • Office hour • Group formation signup
Machine Learning: Chenhao Tan | Boulder | 2 of 39 Overview
Feature engineering
Revisiting Logistic Regression
Feed Forward Networks
Layers for Structured Data
Machine Learning: Chenhao Tan | Boulder | 3 of 39 Feature engineering
Outline
Feature engineering
Revisiting Logistic Regression
Feed Forward Networks
Layers for Structured Data
Machine Learning: Chenhao Tan | Boulder | 4 of 39 Feature engineering
Feature Engineering
Republican nominee George Bush said he felt nervous as he voted today in his adopted home state of Texas, where he ended...
(
(From Chris Harrison's WikiViz)
Machine Learning: Chenhao Tan | Boulder | 5 of 39 Feature engineering
Brainstorming What are features useful for sentiment analysis?
Machine Learning: Chenhao Tan | Boulder | 6 of 39 Feature engineering
What are features useful for sentiment analysis? • Unigram • Bigram • Normalizing options • Part-of-speech tagging • Parse-tree related features • Negation related features • Additional resources
Machine Learning: Chenhao Tan | Boulder | 7 of 39 • find high-frequency words and content words • replace content words with “CW” • extract patterns, e.g., “does not CW much about CW” [Tsur et al., 2010]
Feature engineering
Sarcasm detection
“Trees died for this book?” (book)
Machine Learning: Chenhao Tan | Boulder | 8 of 39 Feature engineering
Sarcasm detection
“Trees died for this book?” (book)
• find high-frequency words and content words • replace content words with “CW” • extract patterns, e.g., “does not CW much about CW” [Tsur et al., 2010]
Machine Learning: Chenhao Tan | Boulder | 8 of 39 Feature engineering
More examples: Which one will be retweeted more?
[Tan et al., 2014] https://chenhaot.com/papers/wording-for-propagation.html
Machine Learning: Chenhao Tan | Boulder | 9 of 39 Revisiting Logistic Regression
Outline
Feature engineering
Revisiting Logistic Regression
Feed Forward Networks
Layers for Structured Data
Machine Learning: Chenhao Tan | Boulder | 10 of 39 Revisiting Logistic Regression
Revisiting Logistic Regression
1 P(Y = 0 | x, β) = P 1 + exp [β0 + i βiXi] P exp [β0 + i βiXi] P(Y = 1 | x, β) = P 1 + exp [β0 + i βiXi] X L = − log P(y(j) | X(j), β) j
Machine Learning: Chenhao Tan | Boulder | 11 of 39 Revisiting Logistic Regression
Revisiting Logistic Regression
• Transformation on x (we map class labels from {0, 1} to {1, 2}):
T li = βi x, i = 1, 2 exp li oi = P , i = 1, 2 c∈{1,2} exp lc P • Objective function (using cross entropy − i pi log qi):
X (j) (j) (j) L (Y, Yˆ) = − P(y = 1) log P(ˆyi = 1 | x , β) + P(y = 0) log Pˆ(yi = 0 | Xi) j
Machine Learning: Chenhao Tan | Boulder | 12 of 39 Revisiting Logistic Regression
Logistic Regression as a Single-layer Neural Network
Input Linear Softmax layer
x1
x2 l1 o1
... l2 o2
xd
Machine Learning: Chenhao Tan | Boulder | 13 of 39 Revisiting Logistic Regression
Logistic Regression as a Single-layer Neural Network
Input Single layer Layer
x1
x2 o1
... o2
xd
Machine Learning: Chenhao Tan | Boulder | 14 of 39 Feed Forward Networks
Outline
Feature engineering
Revisiting Logistic Regression
Feed Forward Networks
Layers for Structured Data
Machine Learning: Chenhao Tan | Boulder | 15 of 39 Feed Forward Networks
Deep Neural networks
A two-layer example (one hidden layer) Input Hidden Output
x1
x2 o1
... o2
xd
Machine Learning: Chenhao Tan | Boulder | 16 of 39 Feed Forward Networks
Deep Neural networks
More layers: Input Hidden 1 Hidden 2 Hidden 3 Output
x1
x2 o1
... o2
xd
Machine Learning: Chenhao Tan | Boulder | 17 of 39 Feed Forward Networks
Forward propagation algorithm
How do we make predictions based on a multi-layer neural network? Store the biases for layer l in bl, weight matrix in Wl W1, b1 W2, b2 W3, b3 W4, b4
x1
x2 o1
... o2
xd
Machine Learning: Chenhao Tan | Boulder | 18 of 39 Feed Forward Networks
Forward propagation algorithm
Suppose your network has L layers Make a prediction based on text point x 1: Initialize a0 = x 2: for l = 1 to L do 3: zl = Wlal−1 + bl 4: al = g(zl) 5: end for 6: The prediction ˆy is simply aL
Machine Learning: Chenhao Tan | Boulder | 19 of 39 Linear combinations of linear combinations are still linear combinations.
Feed Forward Networks
Nonlinearity
What happens if there is no nonlinearity?
Machine Learning: Chenhao Tan | Boulder | 20 of 39 Feed Forward Networks
Nonlinearity
What happens if there is no nonlinearity? Linear combinations of linear combinations are still linear combinations.
Machine Learning: Chenhao Tan | Boulder | 20 of 39 Feed Forward Networks
Neural networks in a nutshell
• Training data Strain = {(x, y)} • Network architecture (model) ˆy = fw(x) • Loss function (objective function)
L (y, ˆy)
• Learning (next lecture)
Machine Learning: Chenhao Tan | Boulder | 21 of 39 Feed Forward Networks
Nonlinearity Options
• Sigmoid 1 f (x) = 1 + exp(x) • tanh exp(x) − exp(−x) f (x) = exp(x) + exp(−x) • ReLU (rectified linear unit) f (x) = max(0, x) • softmax exp(x) x = P exp(x ) xi i https://keras.io/activations/
Machine Learning: Chenhao Tan | Boulder | 22 of 39 Feed Forward Networks
Nonlinearity Options
Machine Learning: Chenhao Tan | Boulder | 23 of 39 Feed Forward Networks
Loss Function Options
• `2 loss X 2 (yi − ˆyi) i • ` loss 1 X |yi − ˆyi| i • Cross entropy X − yi log ˆyi i • Hinge loss (more on this during SVM) max(0, 1 − yˆy) https://keras.io/losses/
Machine Learning: Chenhao Tan | Boulder | 24 of 39 We consider a simple activation function ( 1 z ≥ 0 f (z) = 0 z < 0
Feed Forward Networks
A Perceptron Example
x = (x1, x2), y = f (x1, x2)
b
x1 o1
x2
Machine Learning: Chenhao Tan | Boulder | 25 of 39 Feed Forward Networks
A Perceptron Example
x = (x1, x2), y = f (x1, x2)
b
x1 o1
x2
We consider a simple activation function ( 1 z ≥ 0 f (z) = 0 z < 0
Machine Learning: Chenhao Tan | Boulder | 25 of 39 w = (1, 1), b = −0.5
b
x1 o1
x2
Feed Forward Networks
A Perceptron Example Simple Example: Can we learn OR?
x1 0 1 0 1 x2 0 0 1 1 y = x1 ∨ x2 0 1 1 1
Machine Learning: Chenhao Tan | Boulder | 26 of 39 Feed Forward Networks
A Perceptron Example Simple Example: Can we learn OR?
x1 0 1 0 1 x2 0 0 1 1 y = x1 ∨ x2 0 1 1 1
w = (1, 1), b = −0.5
b
x1 o1
x2
Machine Learning: Chenhao Tan | Boulder | 26 of 39 w = (1, 1), b = −1.5
b
x1 o1
x2
Feed Forward Networks
A Perceptron Example Simple Example: Can we learn AND?
x1 0 1 0 1 x2 0 0 1 1 y = x1 ∧ x2 0 0 0 1
Machine Learning: Chenhao Tan | Boulder | 27 of 39 Feed Forward Networks
A Perceptron Example Simple Example: Can we learn AND?
x1 0 1 0 1 x2 0 0 1 1 y = x1 ∧ x2 0 0 0 1
w = (1, 1), b = −1.5
b
x1 o1
x2
Machine Learning: Chenhao Tan | Boulder | 27 of 39 w = (−1, −1), b = 0.5
b
x1 o1
x2
Feed Forward Networks
A Perceptron Example Simple Example: Can we learn NAND?
x1 0 1 0 1 x2 0 0 1 1 y = ¬(x1 ∧ x2) 1 0 0 0
Machine Learning: Chenhao Tan | Boulder | 28 of 39 Feed Forward Networks
A Perceptron Example Simple Example: Can we learn NAND?
x1 0 1 0 1 x2 0 0 1 1 y = ¬(x1 ∧ x2) 1 0 0 0
w = (−1, −1), b = 0.5
b
x1 o1
x2
Machine Learning: Chenhao Tan | Boulder | 28 of 39 NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?
Feed Forward Networks
A Perceptron Example
Simple Example: Can we learn XOR?
x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0
Machine Learning: Chenhao Tan | Boulder | 29 of 39 But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?
Feed Forward Networks
A Perceptron Example
Simple Example: Can we learn XOR?
x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE!
Machine Learning: Chenhao Tan | Boulder | 29 of 39 The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?
Feed Forward Networks
A Perceptron Example
Simple Example: Can we learn XOR?
x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE! But why?
Machine Learning: Chenhao Tan | Boulder | 29 of 39 How can we fix this?
Feed Forward Networks
A Perceptron Example
Simple Example: Can we learn XOR?
x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable.
Machine Learning: Chenhao Tan | Boulder | 29 of 39 Feed Forward Networks
A Perceptron Example
Simple Example: Can we learn XOR?
x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?
Machine Learning: Chenhao Tan | Boulder | 29 of 39 Feed Forward Networks
A Perceptron Example
Increase the number of layers.
x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0
b b 1 1 −0.5 W1 = , b1 = x1 h1 o1 −1 −1 1.5 1 W2 = , b2 = −1.5 x2 h2 1
Machine Learning: Chenhao Tan | Boulder | 30 of 39 Feed Forward Networks
General Expressiveness of Neural Networks
Neural networks with a single hidden layer can approximate any measurable functions [Hornik et al., 1989, Cybenko, 1989].
Machine Learning: Chenhao Tan | Boulder | 31 of 39 Layers for Structured Data
Outline
Feature engineering
Revisiting Logistic Regression
Feed Forward Networks
Layers for Structured Data
Machine Learning: Chenhao Tan | Boulder | 32 of 39 Layers for Structured Data
Structured data Spatial information
https://www.reddit.com/r/aww/comments/6ip2la/before_and_ after_she_was_told_she_was_a_good_girl/
Machine Learning: Chenhao Tan | Boulder | 33 of 39 Layers for Structured Data
Convolutional Layers
Sharing parameters across patches input image output feature maps or input feature map
https://github.com/davidstutz/latex-resources/blob/master/ tikz-convolutional-layer/convolutional-layer.tex
Machine Learning: Chenhao Tan | Boulder | 34 of 39 • language • activity history
x = (x1,..., xT )
Layers for Structured Data
Structured data
Sequential information “My words fly up, my thoughts remain below: Words without thoughts never to heaven go.” —Hamlet
Machine Learning: Chenhao Tan | Boulder | 35 of 39 x = (x1,..., xT )
Layers for Structured Data
Structured data
Sequential information “My words fly up, my thoughts remain below: Words without thoughts never to heaven go.” —Hamlet • language • activity history
Machine Learning: Chenhao Tan | Boulder | 35 of 39 Layers for Structured Data
Structured data
Sequential information “My words fly up, my thoughts remain below: Words without thoughts never to heaven go.” —Hamlet • language • activity history
x = (x1,..., xT )
Machine Learning: Chenhao Tan | Boulder | 35 of 39 Layers for Structured Data
Recurrent Layers
Sharing parameters along a sequence
ht = f (xt, ht−1)
Machine Learning: Chenhao Tan | Boulder | 36 of 39 Layers for Structured Data
Recurrent Layers
Sharing parameters along a sequence
ht = f (xt, ht−1) Long short-term memory
Machine Learning: Chenhao Tan | Boulder | 37 of 39 Layers for Structured Data
What is missing?
• How to find good weights? • How to make the model work (regularization, architecture, etc)?
Machine Learning: Chenhao Tan | Boulder | 38 of 39 Layers for Structured Data
References (1)
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, 1989. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. Chenhao Tan, Lillian Lee, and Bo Pang. The effect of wording on message propagation: Topic- and author-controlled natural experiments on twitter. In Proceedings of ACL, 2014. Oren Tsur, Dmitry Davidov, and Ari Rappoport. ICWSM-A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews. In Proceedings of ICWSM, 2010.
Machine Learning: Chenhao Tan | Boulder | 39 of 39