: Chenhao Tan University of Colorado Boulder LECTURE 6

Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

Machine Learning: Chenhao Tan | Boulder | 1 of 39 • HW1 turned in • HW2 released • Office hour • Group formation signup

Machine Learning: Chenhao Tan | Boulder | 2 of 39 Overview

Feature engineering

Revisiting

Feed Forward Networks

Layers for Structured

Machine Learning: Chenhao Tan | Boulder | 3 of 39 Feature engineering

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 4 of 39 Feature engineering

Feature Engineering

Republican nominee George Bush said he felt nervous as he voted today in his adopted home state of Texas, where he ended...

(

(From Chris Harrison's WikiViz)

Machine Learning: Chenhao Tan | Boulder | 5 of 39 Feature engineering

Brainstorming What are features useful for sentiment analysis?

Machine Learning: Chenhao Tan | Boulder | 6 of 39 Feature engineering

What are features useful for sentiment analysis? • Unigram • Bigram • Normalizing options • Part-of-speech tagging • Parse-tree related features • Negation related features • Additional resources

Machine Learning: Chenhao Tan | Boulder | 7 of 39 • find high-frequency words and content words • replace content words with “CW” • extract patterns, e.g., “does not CW much about CW” [Tsur et al., 2010]

Feature engineering

Sarcasm detection

“Trees died for this book?” (book)

Machine Learning: Chenhao Tan | Boulder | 8 of 39 Feature engineering

Sarcasm detection

“Trees died for this book?” (book)

• find high-frequency words and content words • replace content words with “CW” • extract patterns, e.g., “does not CW much about CW” [Tsur et al., 2010]

Machine Learning: Chenhao Tan | Boulder | 8 of 39 Feature engineering

More examples: Which one will be retweeted more?

[Tan et al., 2014] https://chenhaot.com/papers/wording-for-propagation.html

Machine Learning: Chenhao Tan | Boulder | 9 of 39 Revisiting Logistic Regression

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 10 of 39 Revisiting Logistic Regression

Revisiting Logistic Regression

1 P(Y = 0 | x, β) = P 1 + exp [β0 + i βiXi] P exp [β0 + i βiXi] P(Y = 1 | x, β) = P 1 + exp [β0 + i βiXi] X L = − log P(y(j) | X(j), β) j

Machine Learning: Chenhao Tan | Boulder | 11 of 39 Revisiting Logistic Regression

Revisiting Logistic Regression

• Transformation on x (we map class labels from {0, 1} to {1, 2}):

T li = βi x, i = 1, 2 exp li oi = P , i = 1, 2 c∈{1,2} exp lc P • Objective function (using cross entropy − i pi log qi):

X (j) (j) (j) L (Y, Yˆ) = − P(y = 1) log P(ˆyi = 1 | x , β) + P(y = 0) log Pˆ(yi = 0 | Xi) j

Machine Learning: Chenhao Tan | Boulder | 12 of 39 Revisiting Logistic Regression

Logistic Regression as a Single-layer Neural Network

Input Linear Softmax layer

x1

x2 l1 o1

... l2 o2

xd

Machine Learning: Chenhao Tan | Boulder | 13 of 39 Revisiting Logistic Regression

Logistic Regression as a Single-layer Neural Network

Input Single layer Layer

x1

x2 o1

... o2

xd

Machine Learning: Chenhao Tan | Boulder | 14 of 39 Feed Forward Networks

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 15 of 39 Feed Forward Networks

Deep Neural networks

A two-layer example (one hidden layer) Input Hidden Output

x1

x2 o1

... o2

xd

Machine Learning: Chenhao Tan | Boulder | 16 of 39 Feed Forward Networks

Deep Neural networks

More layers: Input Hidden 1 Hidden 2 Hidden 3 Output

x1

x2 o1

... o2

xd

Machine Learning: Chenhao Tan | Boulder | 17 of 39 Feed Forward Networks

Forward propagation algorithm

How do we make predictions based on a multi-layer neural network? Store the biases for layer l in bl, weight matrix in Wl W1, b1 W2, b2 W3, b3 W4, b4

x1

x2 o1

... o2

xd

Machine Learning: Chenhao Tan | Boulder | 18 of 39 Feed Forward Networks

Forward propagation algorithm

Suppose your network has L layers Make a prediction based on text point x 1: Initialize a0 = x 2: for l = 1 to L do 3: zl = Wlal−1 + bl 4: al = g(zl) 5: end for 6: The prediction ˆy is simply aL

Machine Learning: Chenhao Tan | Boulder | 19 of 39 Linear combinations of linear combinations are still linear combinations.

Feed Forward Networks

Nonlinearity

What happens if there is no nonlinearity?

Machine Learning: Chenhao Tan | Boulder | 20 of 39 Feed Forward Networks

Nonlinearity

What happens if there is no nonlinearity? Linear combinations of linear combinations are still linear combinations.

Machine Learning: Chenhao Tan | Boulder | 20 of 39 Feed Forward Networks

Neural networks in a nutshell

• Training data Strain = {(x, y)} • Network architecture (model) ˆy = fw(x) • Loss function (objective function)

L (y, ˆy)

• Learning (next lecture)

Machine Learning: Chenhao Tan | Boulder | 21 of 39 Feed Forward Networks

Nonlinearity Options

• Sigmoid 1 f (x) = 1 + exp(x) • tanh exp(x) − exp(−x) f (x) = exp(x) + exp(−x) • ReLU (rectified linear unit) f (x) = max(0, x) • softmax exp(x) x = P exp(x ) xi i https://keras.io/activations/

Machine Learning: Chenhao Tan | Boulder | 22 of 39 Feed Forward Networks

Nonlinearity Options

Machine Learning: Chenhao Tan | Boulder | 23 of 39 Feed Forward Networks

Loss Function Options

• `2 loss X 2 (yi − ˆyi) i • ` loss 1 X |yi − ˆyi| i • Cross entropy X − yi log ˆyi i • Hinge loss (more on this during SVM) max(0, 1 − yˆy) https://keras.io/losses/

Machine Learning: Chenhao Tan | Boulder | 24 of 39 We consider a simple activation function ( 1 z ≥ 0 f (z) = 0 z < 0

Feed Forward Networks

A Example

x = (x1, x2), y = f (x1, x2)

b

x1 o1

x2

Machine Learning: Chenhao Tan | Boulder | 25 of 39 Feed Forward Networks

A Perceptron Example

x = (x1, x2), y = f (x1, x2)

b

x1 o1

x2

We consider a simple activation function ( 1 z ≥ 0 f (z) = 0 z < 0

Machine Learning: Chenhao Tan | Boulder | 25 of 39 w = (1, 1), b = −0.5

b

x1 o1

x2

Feed Forward Networks

A Perceptron Example Simple Example: Can we learn OR?

x1 0 1 0 1 x2 0 0 1 1 y = x1 ∨ x2 0 1 1 1

Machine Learning: Chenhao Tan | Boulder | 26 of 39 Feed Forward Networks

A Perceptron Example Simple Example: Can we learn OR?

x1 0 1 0 1 x2 0 0 1 1 y = x1 ∨ x2 0 1 1 1

w = (1, 1), b = −0.5

b

x1 o1

x2

Machine Learning: Chenhao Tan | Boulder | 26 of 39 w = (1, 1), b = −1.5

b

x1 o1

x2

Feed Forward Networks

A Perceptron Example Simple Example: Can we learn AND?

x1 0 1 0 1 x2 0 0 1 1 y = x1 ∧ x2 0 0 0 1

Machine Learning: Chenhao Tan | Boulder | 27 of 39 Feed Forward Networks

A Perceptron Example Simple Example: Can we learn AND?

x1 0 1 0 1 x2 0 0 1 1 y = x1 ∧ x2 0 0 0 1

w = (1, 1), b = −1.5

b

x1 o1

x2

Machine Learning: Chenhao Tan | Boulder | 27 of 39 w = (−1, −1), b = 0.5

b

x1 o1

x2

Feed Forward Networks

A Perceptron Example Simple Example: Can we learn NAND?

x1 0 1 0 1 x2 0 0 1 1 y = ¬(x1 ∧ x2) 1 0 0 0

Machine Learning: Chenhao Tan | Boulder | 28 of 39 Feed Forward Networks

A Perceptron Example Simple Example: Can we learn NAND?

x1 0 1 0 1 x2 0 0 1 1 y = ¬(x1 ∧ x2) 1 0 0 0

w = (−1, −1), b = 0.5

b

x1 o1

x2

Machine Learning: Chenhao Tan | Boulder | 28 of 39 NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0

Machine Learning: Chenhao Tan | Boulder | 29 of 39 But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE!

Machine Learning: Chenhao Tan | Boulder | 29 of 39 The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE! But why?

Machine Learning: Chenhao Tan | Boulder | 29 of 39 How can we fix this?

Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable.

Machine Learning: Chenhao Tan | Boulder | 29 of 39 Feed Forward Networks

A Perceptron Example

Simple Example: Can we learn XOR?

x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0 NOPE! But why? The single-layer perceptron is just a linear classifier, and can only learn things that are linearly separable. How can we fix this?

Machine Learning: Chenhao Tan | Boulder | 29 of 39 Feed Forward Networks

A Perceptron Example

Increase the number of layers.

x1 0 1 0 1 x2 0 0 1 1 x1 XOR x2 0 1 1 0

b b  1 1  −0.5 W1 = , b1 = x1 h1 o1 −1 −1 1.5 1 W2 = , b2 = −1.5 x2 h2 1

Machine Learning: Chenhao Tan | Boulder | 30 of 39 Feed Forward Networks

General Expressiveness of Neural Networks

Neural networks with a single hidden layer can approximate any measurable functions [Hornik et al., 1989, Cybenko, 1989].

Machine Learning: Chenhao Tan | Boulder | 31 of 39 Layers for Structured Data

Outline

Feature engineering

Revisiting Logistic Regression

Feed Forward Networks

Layers for Structured Data

Machine Learning: Chenhao Tan | Boulder | 32 of 39 Layers for Structured Data

Structured data Spatial information

https://www.reddit.com/r/aww/comments/6ip2la/before_and_ after_she_was_told_she_was_a_good_girl/

Machine Learning: Chenhao Tan | Boulder | 33 of 39 Layers for Structured Data

Convolutional Layers

Sharing parameters across patches input image output feature maps or input feature map

https://github.com/davidstutz/latex-resources/blob/master/ tikz-convolutional-layer/convolutional-layer.tex

Machine Learning: Chenhao Tan | Boulder | 34 of 39 • language • activity history

x = (x1,..., xT )

Layers for Structured Data

Structured data

Sequential information “My words fly up, my thoughts remain below: Words without thoughts never to heaven go.” —Hamlet

Machine Learning: Chenhao Tan | Boulder | 35 of 39 x = (x1,..., xT )

Layers for Structured Data

Structured data

Sequential information “My words fly up, my thoughts remain below: Words without thoughts never to heaven go.” —Hamlet • language • activity history

Machine Learning: Chenhao Tan | Boulder | 35 of 39 Layers for Structured Data

Structured data

Sequential information “My words fly up, my thoughts remain below: Words without thoughts never to heaven go.” —Hamlet • language • activity history

x = (x1,..., xT )

Machine Learning: Chenhao Tan | Boulder | 35 of 39 Layers for Structured Data

Recurrent Layers

Sharing parameters along a sequence

ht = f (xt, ht−1)

Machine Learning: Chenhao Tan | Boulder | 36 of 39 Layers for Structured Data

Recurrent Layers

Sharing parameters along a sequence

ht = f (xt, ht−1) Long short-term memory

Machine Learning: Chenhao Tan | Boulder | 37 of 39 Layers for Structured Data

What is missing?

• How to find good weights? • How to make the model work (regularization, architecture, etc)?

Machine Learning: Chenhao Tan | Boulder | 38 of 39 Layers for Structured Data

References (1)

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303–314, 1989. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. Chenhao Tan, Lillian Lee, and Bo Pang. The effect of wording on message propagation: Topic- and author-controlled natural experiments on twitter. In Proceedings of ACL, 2014. Oren Tsur, Dmitry Davidov, and Ari Rappoport. ICWSM-A Great Catchy Name: Semi-Supervised Recognition of Sarcastic Sentences in Online Product Reviews. In Proceedings of ICWSM, 2010.

Machine Learning: Chenhao Tan | Boulder | 39 of 39