<<

PROBABILISTIC MODELS FOR STRUCTURED DATA

02: Naïve Bayes and Logistic Regression

Instructor: Yizhou Sun [email protected]

January 10, 2019 Content

• Probabilistic Models for I.I.D. Data

• Naïve Bayes

• Logistic Regression

• Generative Models and Discriminative Models

• Summary

2 I.I.D. Data • Data: = , • A data point , 𝑛𝑛contains a feature vector and a 𝑖𝑖 𝑖𝑖 label𝐷𝐷 𝒙𝒙 𝑦𝑦 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 • n: number of data𝒙𝒙 𝑦𝑦 points • Assume data points are independent and identically distributed (i.i.d.) • Model under I.I.D. assumption • | = ( , | ) (if modeling joint distribution) • | 𝑝𝑝 =𝐷𝐷 𝜃𝜃 ( | , ) (if modeling conditional distribution, 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑝𝑝conditional𝐷𝐷 𝜃𝜃 ∏ i.i.d𝑝𝑝.)𝒙𝒙 𝑦𝑦 𝜃𝜃 𝑝𝑝 𝐷𝐷 𝜃𝜃 ∏𝑖𝑖 𝑝𝑝 𝑦𝑦𝑖𝑖 𝒙𝒙𝑖𝑖 𝜃𝜃 • Inference under I.I.D. assumption • Inference can be made for individual data points indepedently

3 Content

• Probabilistic Models for I.I.D. Data

• Naïve Bayes

• Logistic Regression

• Generative Models and Discriminative Models

• Summary

4 Naïve Bayes for Text •Text Data

•Revisit of Multinomial Distribution

•Multinomial Naïve Bayes

5 Text Data

•Word/term •Document • A sequence of words •Corpus • A collection of documents

6 Text Classification Applications •Spam detection From: [email protected] Subject: Loan Offer Do you need a personal or business loan urgent that can be process within 2 to 3 working days? Have you been frustrated so many times by your banks and other loan firm and you don't know what to do? Here comes the Good news Deutsche Bank Financial Business and Home Loan is here to offer you any kind of loan you need at an affordable interest rate of 3% If you are interested let us know. •Sentiment analysis

7 Represent A Document •A document d is represented by a sequence of words selected from a vocabulary • = , , … , , where is the id of i-th word in document and is the length 𝑑𝑑 𝑑𝑑� 𝑑𝑑푑 𝑑𝑑𝑁𝑁𝑑𝑑 𝑑𝑑𝑑𝑑 𝒘𝒘of document𝑤𝑤 𝑤𝑤 𝑤𝑤 𝑤𝑤 𝑑𝑑 𝑁𝑁𝑑𝑑 •A bag-of-words representation 𝑑𝑑 • = ( , , … , ), where is the number of words for nth word in the vocabulary 𝒙𝒙𝑑𝑑 𝑥𝑥𝑑𝑑� 𝑥𝑥𝑑𝑑푑 𝑥𝑥𝑑𝑑𝑑𝑑 𝑥𝑥𝑑𝑑𝑑𝑑 • = 1( == )

8 𝑥𝑥𝑑𝑑𝑛𝑛 ∑𝑖𝑖 𝑤𝑤𝑑𝑑𝑑𝑑 𝑛𝑛 Example

c1 c2 c3 c4 c5 m1 m2 m3 m4

𝒙𝒙𝑑𝑑

9 Naïve Bayes for Text •Text Data

•Revisit of Multinomial Distribution

•Multinomial Naïve Bayes

10 Bernoulli and Categorical Distribution • Bernoulli distribution • Discrete distribution that takes two values {0,1} • = 1 = and = 0 = 1 • E.g., toss a coin with head and tail 𝑃𝑃 𝑋𝑋 𝑝𝑝 𝑃𝑃 𝑋𝑋 − 𝑝𝑝 • Categorical distribution • Discrete distribution that takes more than two values, i.e., 1, … , • Also called generalized Bernoulli distribution, multinoulli𝑥𝑥distribution∈ 𝐾𝐾 • = = = 1 • E.g., get 1-6 from a dice with 1/6 𝑃𝑃 𝑋𝑋 𝑘𝑘 𝑝𝑝𝑘𝑘 𝑎𝑎𝑎𝑎𝑎𝑎 ∑𝑘𝑘 𝑝𝑝𝑘𝑘 11 Binomial and Multinomial Distribution • Binomial distribution • Number of successes (i.e., total number of 1’s) by repeating n trials of independent Bernoulli distribution with probability • : 𝑝𝑝 • 𝑥𝑥 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛= =𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠1 𝑠𝑠𝑠𝑠 𝑛𝑛 𝑥𝑥 𝑛𝑛−𝑥𝑥 • Multinomial𝑃𝑃 𝑋𝑋 𝑥𝑥 distribution𝑝𝑝 − 𝑝𝑝 (multivariate random ) 𝑥𝑥 • Repeat n trials of independent categorical distribution • Let be the number of times value has been observed, note = 𝑘𝑘 𝑥𝑥 𝑘𝑘 ! • =𝑘𝑘 𝑘𝑘, = , … , = = ∑ 𝑥𝑥 𝑛𝑛 ! !… ! 𝑛𝑛 𝑥𝑥𝑘𝑘 1 1 2 2 𝐾𝐾 𝐾𝐾 𝑘𝑘 𝑃𝑃 𝑋𝑋 𝑥𝑥 𝑋𝑋 𝑥𝑥 𝑋𝑋 𝑥𝑥 𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝐾𝐾 ∏ 𝑝𝑝𝑘𝑘 12 Naïve Bayes for Text •Text Data

•Revisit of Multinomial Distribution

•Multinomial Naïve Bayes

13 Bayes’ Theorem: Basics

• Bayes’ Theorem: P(h|X)= P(X|h)P(h) P(X) • Let X be a data sample (“evidence”) • Let h be a hypothesis that X belongs to class C • P(h) (prior probability): the probability of hypothesis h • E.g., the probability of “spam” class • P(X|h) (likelihood): the probability of observing the sample X, given that the hypothesis holds • E.g., the probability of an email given it’s a spam • P(X): marginal probability that sample data is observed • = ( ) • P(h|X), (i.e., posterior probability): the probability that the𝑃𝑃 hypothesis𝑋𝑋 ∑ℎ 𝑃𝑃 holds𝑋𝑋 ℎ 𝑃𝑃 givenℎ the observed data sample X

14 Classification: Choosing Hypotheses

• Maximum a posteriori (maximize the posterior): • Useful observation: it does not depend on the denominator P(X)

hMAP = arg max P(h | DX ) = arg max P(DX | h)P(h) h∈H h∈H

15 Classification by Maximum A Posteriori

• Let D be a training set of tuples and their associated class labels, and each tuple is represented by an p-D attribute vector x = (x1, x2, …, xp)

• Suppose there are m classes y {1, 2, …, m}

• Classification is to derive the maximum∈ posteriori, i.e., the maximal P(y=j|x)

• This can be derived from Bayes’ theorem = ( = ) = = ( ) 𝑝𝑝 𝒙𝒙 𝑦𝑦 𝑗𝑗 𝑝𝑝 𝑦𝑦 𝑗𝑗 𝑝𝑝 𝑦𝑦 𝑗𝑗 𝒙𝒙 • Since p(x) is constant for all classes,𝑝𝑝 only𝒙𝒙 ( ) needs to be maximized 𝑝𝑝 𝒙𝒙 𝑦𝑦 𝑝𝑝 𝑦𝑦

16 Now Come to Text Setting: Modeling • A document is represented as • = , , … , • is the i-th word of and is the length of document 𝒘𝒘𝑑𝑑 𝑤𝑤𝑑𝑑� 𝑤𝑤𝑑𝑑푑 𝑤𝑤𝑑𝑑𝑁𝑁𝑑𝑑 • Model𝑑𝑑𝑑𝑑 for class 𝑑𝑑 • Each𝑤𝑤 word in the sequence𝑑𝑑 𝑁𝑁 is sampled from 𝑑𝑑 multinoulli𝑝𝑝 𝒘𝒘𝑑𝑑distribution𝑦𝑦 with𝑦𝑦 parameter vector = ( , , … , ) independently𝑤𝑤𝑑𝑑𝑑𝑑 𝑦𝑦 • = and = = 𝜷𝜷 𝑦𝑦� 𝑦𝑦� 𝑦𝑦𝑦𝑦 𝛽𝛽Where𝛽𝛽 𝛽𝛽 𝑥𝑥𝑑𝑑𝑑𝑑 • is the𝑑𝑑𝑑𝑑 number of words for nth 𝑑𝑑𝑑𝑑word in the 𝑝𝑝vocabulary𝑤𝑤𝑑𝑑𝑑𝑑 𝑦𝑦 𝛽𝛽𝑦𝑦𝑤𝑤 𝑝𝑝 𝒘𝒘𝑑𝑑 𝑦𝑦 ∏𝑖𝑖 𝛽𝛽𝑦𝑦𝑤𝑤 ∏𝑛𝑛 𝛽𝛽𝑦𝑦𝑛𝑛 𝑑𝑑𝑑𝑑 • Model 𝑥𝑥 = • Follow categorical distribution with parameter vector = ( , 𝑝𝑝, …𝑦𝑦, 𝑗𝑗), i.e., • = = 𝝅𝝅 𝜋𝜋1 𝜋𝜋2 𝜋𝜋𝑚𝑚 𝑝𝑝 𝑦𝑦 𝑗𝑗 𝜋𝜋𝑗𝑗 17 Classification Process Assuming Parameters are Given: Inference • Find that maximizes , which is equivalently to maximize 𝑦𝑦 = max𝑝𝑝 𝑦𝑦 𝒙𝒙𝑑𝑑 , ∗ = 𝑑𝑑 𝑦𝑦 𝑎𝑎𝑎𝑎𝑎𝑎 𝑦𝑦 𝑝𝑝 𝒙𝒙 𝑦𝑦 = 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑦𝑦 𝑝𝑝 𝒙𝒙𝑑𝑑 𝑦𝑦 𝑝𝑝× 𝑦𝑦 𝑥𝑥𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑦𝑦 � 𝛽𝛽𝑦𝑦𝑦𝑦 𝜋𝜋𝑦𝑦 = 𝑛𝑛 × 𝑥𝑥𝑑𝑑𝑑𝑑 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑦𝑦 � 𝛽𝛽𝑦𝑦𝑦𝑦 𝜋𝜋𝑦𝑦 = 𝑛𝑛 +

𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑥𝑥𝑦𝑦 � 𝑥𝑥𝑑𝑑𝑑𝑑𝑙𝑙𝑙𝑙𝑙𝑙𝛽𝛽𝑦𝑦𝑦𝑦 𝑙𝑙𝑙𝑙𝑙𝑙𝜋𝜋𝑦𝑦 𝑛𝑛 18 Parameter Estimation via MLE: Learning • Given a corpus and labels for each document • = {( , )} • Find the MLE estimators for = ( , , … , , ) 𝐷𝐷 𝒙𝒙𝑑𝑑 𝑦𝑦𝑑𝑑 • The log likelihood functionΘ for the𝜷𝜷1 𝜷𝜷training2 𝜷𝜷𝑚𝑚 dataset𝝅𝝅 ( ) = ( , | ) = ,

𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 Θ 𝑙𝑙𝑙𝑙𝑙𝑙 � 𝑝𝑝 𝒙𝒙𝑑𝑑 𝑦𝑦𝑑𝑑 Θ � 𝑙𝑙𝑙𝑙𝑙𝑙 𝑝𝑝 𝒙𝒙𝑑𝑑 𝑦𝑦𝑑𝑑 Θ = 𝑑𝑑 = ( 𝑑𝑑 + )

� 𝑙𝑙𝑙𝑙𝑙𝑙 𝑝𝑝 𝒙𝒙𝑑𝑑 𝑦𝑦𝑑𝑑 𝑝𝑝 𝑦𝑦𝑑𝑑 � 𝑥𝑥𝑑𝑑𝑑𝑑𝑙𝑙𝑙𝑙𝑙𝑙𝛽𝛽𝑦𝑦𝑑𝑑𝑛𝑛 𝑙𝑙𝑙𝑙𝑙𝑙𝜋𝜋𝑦𝑦𝑑𝑑 • The optimization𝑑𝑑 problem 𝑑𝑑 max log ( ) . . Θ 𝐿𝐿 Θ 0 = 1 𝑠𝑠 𝑡𝑡 𝜋𝜋𝑗𝑗 ≥ 𝑎𝑎𝑎𝑎𝑎𝑎 � 𝜋𝜋𝑗𝑗 0 𝑗𝑗 = 1 19 𝛽𝛽𝑗𝑗𝑗𝑗 ≥ 𝑎𝑎𝑎𝑎𝑎𝑎 � 𝛽𝛽𝑗𝑗𝑗𝑗 𝑓𝑓𝑓𝑓𝑓𝑓 𝑎𝑎𝑎𝑎𝑎𝑎 𝑗𝑗 𝑛𝑛 Solve the • Use the Lagrange multiplier method • Solution

: • = ∑: 𝑑𝑑 𝑦𝑦𝑑𝑑==𝑗𝑗 𝑥𝑥𝑑𝑑𝑑𝑑 𝑗𝑗𝑗𝑗 ′ 𝛽𝛽•̂ : ∑𝑑𝑑 𝑦𝑦𝑑𝑑=:= total𝑗𝑗 ∑𝑛𝑛′ count𝑥𝑥𝑑𝑑𝑛𝑛 of word n in class j • : total count of words in class j ∑𝑑𝑑:𝑦𝑦𝑑𝑑=𝑗𝑗 𝑥𝑥𝑑𝑑𝑑𝑑 ′ ′ 𝑑𝑑 ( ) • ∑=𝑑𝑑 𝑦𝑦 =𝑗𝑗 ∑𝑛𝑛 𝑥𝑥𝑑𝑑𝑛𝑛 | | ∑𝑑𝑑 1 𝑦𝑦𝑑𝑑==𝑗𝑗 •𝑗𝑗1( = ) is the indicator , which equals to 1 𝜋𝜋� if = holds𝐷𝐷 𝑑𝑑 • |D|:𝑦𝑦 total𝑗𝑗 number of documents 𝑦𝑦𝑑𝑑 𝑗𝑗 20 Smoothing • What if some word n does not appear in some class j in training dataset? : • = = 0 ∑: 𝑑𝑑 𝑦𝑦𝑑𝑑=𝑗𝑗 𝑥𝑥𝑑𝑑𝑑𝑑 𝑗𝑗𝑗𝑗 ′ • 𝛽𝛽̂ ∑𝑑𝑑 𝑦𝑦=𝑑𝑑=𝑗𝑗 ∑𝑛𝑛′ 𝑥𝑥𝑑𝑑𝑛𝑛 = 0 𝑥𝑥𝑑𝑑𝑑𝑑 • But other𝑑𝑑 words may𝑛𝑛 have a strong indication the document belongs⇒ 𝑝𝑝 𝒙𝒙 to𝑦𝑦 class𝑗𝑗 j∝ ∏ 𝛽𝛽𝑗𝑗𝑛𝑛 • Solution: add-1 smoothing or Laplace smoothing

: • = ∑: 𝑑𝑑 𝑦𝑦𝑑𝑑=𝑗𝑗 𝑥𝑥𝑑𝑑𝑑𝑑+1 𝑗𝑗𝑗𝑗 ′ • 𝛽𝛽̂ : ∑𝑑𝑑 𝑦𝑦𝑑𝑑=𝑗𝑗 ∑𝑛𝑛′ 𝑥𝑥𝑑𝑑𝑛𝑛 +𝑁𝑁 • Check: = 1? 𝑁𝑁 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡� 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑛𝑛 𝑗𝑗𝑗𝑗 ∑ 𝛽𝛽̂ 21 Example • Data:

• Vocabulary: Index 1 2 3 4 5 6 Word Chinese Beijing Shanghai Macao Tokyo Japan • Learned parameters (with smoothing): 5 + 1 3 1 + 1 2 = = = = 8 + 6 7 3 + 6 9 1 + 1 1 𝑗𝑗1 0 + 1 1 𝛽𝛽̂𝑐𝑐� = = 𝛽𝛽̂ = = 8 + 6 7 3 + 6 9 3 1 + 1 1 𝑗𝑗2 0 + 1 1 𝛽𝛽̂𝑐𝑐� = = 𝛽𝛽̂ = = = 8 + 6 7 3 + 6 9 4 1 + 1 1 𝑗𝑗3 0 + 1 1 𝛽𝛽̂𝑐𝑐푐 = = 𝛽𝛽̂ = = 8 + 6 7 3 + 6 9 𝑐𝑐 1 0 + 1 1 𝑗𝑗4 1 + 1 2 𝛽𝛽̂𝑐𝑐푐= = 𝛽𝛽̂ = = 𝜋𝜋� = 8 + 6 14 3 + 6 9 4 0 + 1 1 𝑗𝑗5 1 + 1 2 𝛽𝛽̂𝑐𝑐푐 = = 𝛽𝛽̂ = = 8 + 6 14 3 + 6 9 𝑗𝑗 𝑗𝑗6 22 𝛽𝛽̂𝑐𝑐푐 𝛽𝛽̂ 𝜋𝜋� Example (Continued) •Classification stage • For the test document d=5, compute

• = = × = × 𝑥𝑥5𝑛𝑛 3 𝑝𝑝 𝑦𝑦 ×𝑐𝑐 𝒙𝒙5 ×∝ 𝑝𝑝 𝑦𝑦 0𝑐𝑐.0003∏𝑛𝑛 𝛽𝛽𝑐𝑐𝑐𝑐 4 3 3 1 1 • 7 = 14 14 =≈ × = × 𝑥𝑥5𝑛𝑛 1 𝑝𝑝 𝑦𝑦 ×𝑗𝑗 𝒙𝒙5×∝ 𝑝𝑝 𝑦𝑦 0.000𝑗𝑗 1∏𝑛𝑛 𝛽𝛽𝑗𝑗𝑛𝑛 4 2 3 2 2 • Conclusion: should be classified into c class 9 9 9 ≈ 𝒙𝒙5 23 A More General Naïve Bayes Framework

• Let D be a training set of tuples and their class labels, and each tuple is represented by an p-D attribute vector x = (x1, x2, …, xp) • Suppose there are m classes y {1, 2, …, m} • Goal: Find y = arg max = ( , )/ ( ) ( ) ∈ • A simplified assumption:𝑦𝑦 𝑝𝑝 𝑦𝑦 𝒙𝒙 attributes𝑝𝑝 𝑦𝑦 𝒙𝒙 𝑝𝑝 𝒙𝒙are∝ 𝑝𝑝 𝒙𝒙 𝑦𝑦 𝑝𝑝 𝑦𝑦 conditionally independent given the class (class conditional independency): • = ( | ) • ( | ) can follow any distribution, 𝑘𝑘 𝑘𝑘 𝑝𝑝• e.g.,𝒙𝒙 𝑦𝑦 Gaussian,∏ 𝑝𝑝 Bernoulli,𝑥𝑥 𝑦𝑦 categorical, … 𝑝𝑝 𝑥𝑥𝑘𝑘 𝑦𝑦 24 Content

• Probabilistic Models for I.I.D. Data

• Naïve Bayes

• Logistic Regression

• Generative Models and Discriminative Models

• Summary

25 Linear Regression VS. Logistic Regression

•Linear Regression (prediction) • Y: , + • y = = + + + + 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 −∞ ∞ • y| , 𝑇𝑇~ ( , ) 𝒙𝒙 𝜷𝜷 𝛽𝛽0 𝑥𝑥1𝛽𝛽1 𝑥𝑥2𝛽𝛽2 ⋯ 𝑥𝑥𝑝𝑝𝛽𝛽𝑝𝑝 𝑇𝑇 2 𝒙𝒙 𝛽𝛽 𝑁𝑁 𝒙𝒙 𝛽𝛽 𝜎𝜎 •Logistic Regression (classification) • Y: • = | , [0,1] = | , = 1 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑃𝑃 𝑌𝑌 𝑗𝑗 𝒙𝒙 𝜷𝜷 ∈ 𝑎𝑎𝑎𝑎𝑎𝑎 ∑𝑗𝑗 𝑃𝑃 𝑌𝑌 𝑗𝑗 𝒙𝒙 𝜷𝜷 26 Logistic Function •Logistic Function / sigmoid function: = 1 −𝑥𝑥 𝜎𝜎 𝑥𝑥 1+𝑒𝑒

: = ( )(1 ( )) ′ 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝜎𝜎 𝑥𝑥 𝜎𝜎 𝑥𝑥 − 𝜎𝜎 𝑥𝑥 27 Modeling Probabilities of Two Classes

{ } • = 1 , = = = { } {𝑇𝑇 } 𝑇𝑇 1 exp 𝒙𝒙 𝛽𝛽 𝑇𝑇{ } 𝑇𝑇 • 𝑃𝑃 𝑌𝑌 = 0 𝒙𝒙, 𝛽𝛽 = 1𝜎𝜎 𝒙𝒙 𝛽𝛽 1+=exp −𝒙𝒙 𝛽𝛽 1+=exp 𝒙𝒙 𝛽𝛽 { 𝑇𝑇 } { } 𝑇𝑇 exp −𝒙𝒙 𝛽𝛽 1 𝑇𝑇 𝑇𝑇 𝑃𝑃 𝑌𝑌 𝒙𝒙 𝛽𝛽 − 𝜎𝜎 𝒙𝒙 𝛽𝛽 1+exp −𝒙𝒙 𝛽𝛽 1+exp 𝒙𝒙 𝛽𝛽

= 𝛽𝛽0 𝛽𝛽1 𝛽𝛽 ⋮ 𝑝𝑝 • In other words 𝛽𝛽 • y| , ~ ( ) 𝑇𝑇 𝒙𝒙 𝛽𝛽 𝐵𝐵𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝜎𝜎 𝒙𝒙 𝛽𝛽

28 The 1-d Situation

• = 1 , , = +

1 0 1 1 0 𝑃𝑃 𝑌𝑌 𝑥𝑥 𝛽𝛽 𝛽𝛽 𝜎𝜎 𝛽𝛽 𝑥𝑥 =0,𝛽𝛽 =1 0 1

0.8 =0, =2

0 1 ) =5, =2 1 0 1 ,

0 0.6

0.4 Prob(Y=1|x, 0.2

0

-10 -5 0 5 10

x

29 Example

Q: What is here? 30 𝛽𝛽0 Classification Assuming Parameters are Given: Inference • = 1 , = > 0.5 • Class 1 𝑇𝑇 𝐼𝐼𝐼𝐼 𝑃𝑃 𝑌𝑌 𝒙𝒙 𝛽𝛽 𝜎𝜎 𝒙𝒙 𝛽𝛽 •Otherwise • Class 0

31 Parameter Estimation: Learning •MLE estimation • Given a dataset , • For a single data object with attributes , class label 𝐷𝐷 𝑤𝑤𝑤𝑤𝑤𝑤� 𝑛𝑛 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑖𝑖 • Let = = 1 , , . 𝒙𝒙 1 𝑖𝑖 • The 𝑦𝑦probability of observing would be 𝑖𝑖 𝑖𝑖 𝑖𝑖 • If 𝑝𝑝= 1, 𝑝𝑝 𝑦𝑦 𝒙𝒙 𝛽𝛽 𝑡𝑡𝑡𝑡� 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑜𝑜𝑜𝑜 𝑖𝑖 𝑖𝑖𝑖𝑖 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 • If = 0, 1 𝑖𝑖 𝑦𝑦𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡� 𝑝𝑝𝑖𝑖 𝑦𝑦 • Combing the two cases: 1 𝑦𝑦𝑖𝑖 𝑡𝑡𝑡𝑡𝑡𝑡� − 𝑝𝑝𝑖𝑖 𝑦𝑦𝑖𝑖 1−𝑦𝑦𝑖𝑖 𝑖𝑖 𝑖𝑖 = 1 𝑝𝑝 = − 𝑝𝑝 𝑇𝑇 𝑦𝑦𝑖𝑖 1−𝑦𝑦𝑖𝑖 𝑦𝑦𝑖𝑖 1−𝑦𝑦𝑖𝑖 exp 𝒙𝒙 𝛽𝛽 1 𝑇𝑇 𝑇𝑇 𝐿𝐿 ∏𝑖𝑖 𝑝𝑝𝑖𝑖 − 𝑝𝑝𝑖𝑖 ∏𝑖𝑖 1+exp 𝒙𝒙 𝛽𝛽 1+exp 𝒙𝒙 𝛽𝛽 32 Optimization • Equivalent to maximize log likelihood • = log 1 + exp • 𝑇𝑇 𝑇𝑇 ascent𝑖𝑖 𝑖𝑖 𝑖𝑖 update: 𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙𝐿𝐿 ∑ 𝑦𝑦 𝒙𝒙 𝛽𝛽 − ( ) 𝒙𝒙 𝛽𝛽 • = + 𝜕𝜕𝜕𝜕𝜕𝜕𝜕𝜕𝜕𝜕 𝛽𝛽 𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜𝑜𝑜 Step size • Newton-Raphson𝛽𝛽 𝛽𝛽 update𝜂𝜂 𝜕𝜕𝜕𝜕 •

• where are evaluated at old 𝛽𝛽 33 First •It is a (p+1) vector, with jth element as

( )

𝑝𝑝𝑖𝑖 𝛽𝛽

For j = 0, 1, …, p

34 Second Derivative •It is a (p+1) by (p+1) matrix, , with jth row and nth column as

35 What about Multiclass Classification?

•It is easy to handle under logistic regression, say M classes { } • = = , for j = 𝑇𝑇 { } exp 𝑥𝑥 𝛽𝛽𝑗𝑗 1, … , 1 𝑀𝑀−1 𝑇𝑇 𝑃𝑃 𝑌𝑌 𝑗𝑗 𝑥𝑥 1+∑𝑚𝑚=1 exp 𝑥𝑥 𝛽𝛽𝑚𝑚 • = = 𝑀𝑀 − { } 1 𝑀𝑀−1 𝑇𝑇 𝑃𝑃 𝑌𝑌 𝑀𝑀 𝑥𝑥 1+∑𝑚𝑚=1 exp 𝑥𝑥 𝛽𝛽𝑚𝑚

36 Recall Linear Regression and Logistic Regression •Linear Regression • y| , ~ ( , ) •Logistic Regression𝑇𝑇 2 𝒙𝒙 𝛽𝛽 𝑁𝑁 𝒙𝒙 𝛽𝛽 𝜎𝜎 • y| , ~ ( ) 𝑇𝑇 𝒙𝒙 𝛽𝛽 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝜎𝜎 𝒙𝒙 𝛽𝛽 •How about other distributions? • Yes, as long as they belong to

37 Exponential Family •Canonical Form • ; = exp 𝑇𝑇 𝑝𝑝 𝑦𝑦 𝜂𝜂 𝑏𝑏 𝑦𝑦 𝜂𝜂 𝑇𝑇 𝑦𝑦 − 𝑎𝑎 𝜂𝜂 • : natural parameter • : sufficient statistic 𝜂𝜂 • : log partition function for normalization 𝑇𝑇 𝑦𝑦 • : function that only dependent on 𝑎𝑎 𝜂𝜂 𝑏𝑏 𝑦𝑦 𝑦𝑦

38 Examples of Exponential Family

• ; = exp Many: 𝑇𝑇 • Gaussian, Bernoulli, Poisson,𝑝𝑝 𝑦𝑦 𝜂𝜂 beta,𝑏𝑏 𝑦𝑦 Dirichlet𝜂𝜂 𝑇𝑇 𝑦𝑦, categorical,− 𝑎𝑎 𝜂𝜂 … •For Gaussian (not interested in )

𝜎𝜎

•For Bernoulli

39 𝜂𝜂 Recipe of GLMs* • Determines a distribution for y • E.g., Gaussian, Bernoulli, Poisson

• Form the linear predictor for • = 𝑇𝑇 𝜂𝜂 • Determines𝜂𝜂 𝒙𝒙 𝛽𝛽 a link function: = ( ) • Connects the linear predictor to the mean−1 of the distribution 𝜇𝜇 𝑔𝑔 𝜂𝜂 • E.g., = for Gaussian, = for Bernoulli, = for Poisson 𝜇𝜇 𝜂𝜂 𝜇𝜇 𝜎𝜎 𝜂𝜂 𝜇𝜇 𝑒𝑒𝑒𝑒𝑒𝑒 𝜂𝜂 40 Content

• Probabilistic Models for I.I.D. Data

• Naïve Bayes

• Logistic Regression

• Generative Models and Discriminative Models

• Summary

41 Generative Models vs. Discriminative Models • Generative model • ( , ) • E.g., naïve Bayes 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗𝑗 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝 𝒙𝒙 𝑦𝑦 • Discriminative model • ( | ) • E.g., logistic regression 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑝𝑝 𝑦𝑦 𝒙𝒙

42 Which One is Better? • Consider , = × ( ) • Generative models require additional model of marginal 𝑝𝑝distribution𝒙𝒙 𝑦𝑦 𝑝𝑝 (𝑦𝑦)𝒙𝒙 𝑝𝑝 𝒙𝒙 • Need more data to learn ( ) • Distribution assumption𝑝𝑝 of𝒙𝒙 might be incorrect 𝑝𝑝 𝒙𝒙 𝑝𝑝 𝒙𝒙 • In practice, discriminative models work very well https://ai.stanford.edu/~ang/papers/nips01- discriminativegenerative.pdf

43 Content

• Probabilistic Models for I.I.D. Data

• Naïve Bayes

• Logistic Regression

• Generative Models and Discriminative Models

• Summary

44 Summary

• Probabilistic Models for I.I.D. Data • I.I.D. assumption enables joint distribution of data as a product of probability of single data points • Naïve Bayes • Assuming independence among features • Logistic Regression • Assuming conditional distribution follows Bernoulli distribution • Generative Models and Discriminative Models • Modeling joint distribution vs. conditional distribution 45 References

• http://pages.cs.wisc.edu/~jerryzhu/cs769/nb.p df • http://cs229.stanford.edu/notes/cs229- notes1.pdf • https://ai.stanford.edu/~ang/papers/nips01- discriminativegenerative.pdf

46 More about Lagrangian • Objective with equality constraints min ( ) . . 𝑤𝑤 𝑓𝑓 𝑤𝑤 = 0, = 1,2, … , • Lagrangian: 𝑠𝑠 𝑡𝑡 ℎ𝑖𝑖 𝑤𝑤 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 𝑙𝑙 • , = + ( ) • : Lagrangian multipliers 𝐿𝐿 𝑤𝑤 𝜶𝜶 𝑓𝑓 𝑤𝑤 ∑𝑖𝑖 𝛼𝛼𝑖𝑖ℎ𝑖𝑖 𝑤𝑤 • Solution:𝛼𝛼𝑖𝑖 setting the derivatives of Lagrangian to be 0 • = 0 = 0 for every i 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 47 𝜕𝜕𝜕𝜕 𝑎𝑎𝑎𝑎𝑎𝑎 𝜕𝜕𝛼𝛼𝑖𝑖 Generalized Lagrangian

• Objective with both equality and inequality constraints min ( ) . . 𝑤𝑤 𝑓𝑓 𝑤𝑤 = 0, = 1,2, … , 0, 𝑠𝑠 𝑡𝑡 = 1,2, … , ℎ𝑖𝑖 𝑤𝑤 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 𝑙𝑙 • Lagrangian 𝑔𝑔𝑗𝑗 𝑤𝑤 ≤ 𝑓𝑓𝑓𝑓𝑓𝑓 𝑗𝑗 𝑘𝑘 • , , = + ( ) + ( ) • : Lagrangian multipliers 𝐿𝐿 𝑤𝑤 𝜶𝜶 𝜷𝜷 𝑓𝑓 𝑤𝑤 ∑𝑖𝑖 𝛼𝛼𝑖𝑖ℎ𝑖𝑖 𝑤𝑤 ∑𝑗𝑗 𝛽𝛽𝑗𝑗𝑔𝑔𝑗𝑗 𝑤𝑤 • 0: Lagrangian multipliers 𝛼𝛼𝑖𝑖 𝛽𝛽𝑗𝑗 ≥ 48 Why It Works

• Consider function = max ( , , ) , : 𝑝𝑝 𝜃𝜃 𝑤𝑤, 𝛼𝛼 𝛽𝛽 𝛽𝛽𝑗𝑗≥0 𝐿𝐿 𝑤𝑤 𝜶𝜶 𝜷𝜷 = • , 𝑓𝑓 𝑤𝑤 𝑖𝑖𝑖𝑖 𝑤𝑤 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑎𝑎𝑎𝑎𝑎𝑎 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝜃𝜃𝑝𝑝 𝑤𝑤 � minimize ′ • Therefore, ∞ 𝑖𝑖𝑖𝑖 𝑤𝑤 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑛𝑛 𝑡𝑡 𝑠𝑠with𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 constraints𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 is equivalent to minimize 𝑓𝑓 𝑤𝑤 𝜃𝜃𝑝𝑝 𝑤𝑤

49 Lagrange

• The primal problem = min max ( , , ) , : ∗ • The dual 𝑝𝑝problem𝑤𝑤 𝛼𝛼 𝛽𝛽 𝛽𝛽𝑗𝑗≥0 𝐿𝐿 𝑤𝑤 𝜶𝜶 𝜷𝜷 = max min ( , , ) , : ∗ • According𝑑𝑑 to max𝛼𝛼 𝛽𝛽 𝛽𝛽-min𝑗𝑗≥0 inequality𝑤𝑤 𝐿𝐿 𝑤𝑤 𝜶𝜶 𝜷𝜷

• When does ∗ hold?∗ 𝑝𝑝 ≤ 𝑑𝑑

50 Primal = Dual

• = , under some proper condition (Slater conditions)∗ ∗ 𝑝𝑝• , 𝑑𝑑convex, affine • w < 0 𝑓𝑓Exists𝑔𝑔𝑗𝑗 , suchℎ 𝑖𝑖that all , , • 𝑤𝑤 need to satisfy𝑔𝑔𝑗𝑗 KKT conditions ∗ ∗ ∗ • = 0 𝑤𝑤𝜕𝜕𝜕𝜕 𝛼𝛼 𝛽𝛽 • 𝜕𝜕𝜕𝜕 = 0 • = 0, 0, 0 𝛽𝛽𝑗𝑗𝑔𝑔𝑗𝑗 𝑤𝑤 https://ℎ𝑖𝑖 cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf𝑤𝑤 𝑔𝑔𝑗𝑗 𝑤𝑤 ≤ 𝛽𝛽𝑗𝑗 ≥ 51