10-606 Mathematical Foundations for Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University

Final Exam Review

Matt Gormley Lecture 13 Oct. 15, 2018

1 Reminders • Homework 4: – Out: Thu, Oct. 11 – Due: Mon, Oct. 15 at 11:59pm • Final Exam – Date: Wed, Oct. 17 – Time: 6:30 – 9:30pm – Location: Posner Hall A35

3 EXAM LOGISTICS

4 Final Exam

• Time / Location – Date: Wed, Oct 17 – Time: Evening Exam, 6:30pm – 9:30pm – Room: Posner Hall A35 – Seats: There will be assigned seats. Please arrive early. • Logistics – Format of questions: • Multiple choice • True / False (with justification) • Short answers • Interpreting figures • Derivations • Short proofs – No electronic devices – You are allowed to bring one 8½ x 11 sheet of notes (front and back)

5 Final Exam • How to Prepare – Attend this final exam review session – Review prior year’s exams and solutions • We already posted these (see Piazza) • Disclaimer: This year’s 10-606/607 is not the same as prior offerings! – Review this year’s homework problems – Review this year’s quiz problems

6 Final Exam • Advice (for during the exam) – Solve the easy problems first (e.g. multiple choice before derivations) • if a problem seems extremely complicated you’re likely missing something – Don’t leave any answer blank! – If you make an assumption, write it down – If you look at a question and don’t know the answer: • we probably haven’t told you the answer • but we’ve told you enough to work it out • imagine arguing for some answer and see if you like it

7 Topics Covered

• Preliminaries • Matrix Calculus – Sets – Scalar derivatives – Types – Partial derivatives – Functions – Vector derivatives • Linear Algebra – Matrix derivatives – Vector spaces – Method of Lagrange multipliers – Matrices and linear operators – Least squares derivation – Linear independence • Probability – Invertability – Events – Eigenvalues and eigenvectors – Disjoint union – Linear equations – Sum rule – Factorizations – Discrete random variables – Matrix Memories – Continuous random variables – Bayes Rule – Conditional, marginal, joint – Mean and variance

9 Analysis of 10601 Performance

No obvious correlations…

10 Analysis of 10601 Performance

Correlation between Background Test and Midterm Exam: • Pearson: 0.46 (moderate) • Spearman: 0.43 (moderate) 11 Q&A

12 Agenda 1. Review of probability (didactic) 2. Review of linear algebra / matrix calculus (through application)

20 21 Oh, the Places You’ll Use Probability! Supervised Classification • Naïve Bayes 1 n p(y x ,x ,...,x )= p(y) p(x y) | 1 2 n Z i| i=1 • Logistic regression

P (Y = y X = x; )=p(y x; ) | | 2tT( 7(x)) = y · 2tT(y 7(x) y ·

22 Oh, the Places You’ll Use Probability!

ML TheoryPAC/SLT models for Supervised Learning • (Example:Algo sees Sample training sample Complexity) S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D • Does optimization over S, find hypothesis ℎ ∈ �. • Goal: h has small error over D. ∗ True error: ���� ℎ = Pr (ℎ � ≠ � (�)) �~ � How often ℎ � ≠ �∗(�) over future instances drawn at random from D

• But, can only : 1 Training error: ��� ℎ = � ℎ � ≠ �∗ � � � � � � How often ℎ � ≠ �∗(�) over training instances

Sample complexity: bound ���� ℎ in terms of ���� ℎ

23 Oh, the Places You’ll Use Probability! Deep Learning (Example: Deep Bi-directional RNN)

y1 y2 y3 y4

h1 h2 h3 h4

h1 h2 h3 h4

x1 x2 x3 x4

24 Oh, the Places You’ll Use Probability! Graphical Models • Hidden Markov Model (HMM)

n v p d n

time flies like an arrow • Conditional Random Field (CRF)

ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

25 Probability Outline

, Outcomes, Events – Complement – Disjoint union – Kolmogorov’s Axioms of Probability – Sum rule • Random Variables – Random variables, Probability mass function (pmf), Probability density function (pdf), Cumulative distribution function (cdf) – Examples – Notation – Expectation and Variance – Joint, conditional, marginal probabilities – Independence – Bayes’ Rule • Common Probability Distributions – Beta, Dirichlet, etc.

26 PROBABILITY AND EVENTS

27 Probability of Events Example 1: Flipping a coin Sample Space {Heads, Tails} Example: Heads Event E Example: {Heads} Probability P (E) P({Heads}) = 0.5 P({Tails}) = 0.5

28 Probability Theory: Definitions Probability provides a science for inference about interesting events Sample Space The set of all possible outcomes Outcome Possible result of an experiment Event E Any subset of the sample space Probability The non-negative number assigned P (E) to each event in the sample space • Each outcome is unique • Only one outcome can occur per experiment • An outcome can be in multiple events • An elementary event consists of exactly one outcome

• A compound event consists of multiple outcomes 29 Probability of Events Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event E Example: {3} (the event “the die came up 3”) Probability P (E) P({3}) = 1/6 P({4}) = 1/6

30 Probability of Events Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event E Example: {2,4,6} (the event “the roll was even”) Probability P (E) P({2,4,6}) = 0.5 P({1,3,5}) = 0.5

31 Probability of Events Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event E Example: [1, 6] hours Probability P([1,6]) = 0.000000000001 P (E) P([1,433,600, +∞)) = 0.99

32 Probability Theory: Definitions • The complement of an event E, denoted ~E, is the event that E does not occur. • P(E) + P(~E) = 1 • All of the following notations equivalently denote the complement of event E

E

~E

33 Disjoint Union

• Two events A and B are disjoint if

A • The disjoint union rule says that if events A B and B are disjoint, then

34 Disjoint Union • The disjoint union rule can be extended to multiple disjoint events

• If each pair of events Ai A and Aj are disjoint, B then

35 Non-disjoint Union

• Two events A and B are non-disjoint if

• We can apply the A disjoint union rule to various disjoint sets: B

36 Kolmogorov’s Axioms

1. P (E) 0, for all events E 2. P ()=1 3. If E1,E2,...are disjoint, then P (E1 or E2 or ...)=P (E1)+P (E2)+...

37 Kolmogorov’s Axioms All of 1. P (E) 0, for all events E probability can 2. P ()=1 be derived 3. If E1,E2,...are disjoint, then from just these! P Ei = P (Ei) i=1 i=1 In words: 1. Each event has non-negative probability. 2. The probability that some event will occur is one. 3. The probability of the union of many disjoint sets is the sum of their probabilities 38 Sum Rule

• For any two events A and B, we have that

A B

39 RANDOM VARIABLES

41 Random Variables: Definitions Random Def 1: Variable whose possible values Variable X are the outcomes of a random (capital experiment letters)

Value of a x The value taken by a Random (lowercase letters) Variable

42 Random Variables: Definitions Random Def 1: Variable whose possible values Variable X are the outcomes of a random experiment

Discrete Random variable whose values come Random X from a countable set (e.g. the natural Variable numbers or {True, False}) Continuous Random variable whose values come Random X from an interval or collection of Variable intervals (e.g. the real numbers or the range (3, 5)) 43 Random Variables: Definitions Random Def 1: Variable whose possible values Variable X are the outcomes of a random experiment

Def 2: A measureable function from the sample space to the real numbers: X : E Discrete Random variable whose values come Random X from a countable set (e.g. the natural Variable numbers or {True, False}) Continuous Random variable whose values come Random X from an interval or collection of Variable intervals (e.g. the real numbers or the range (3, 5)) 44 Random Variables: Definitions

Discrete Random variable whose values come Random X from a countable set (e.g. the natural Variable numbers or {True, False}) Probability p(x) Function giving the probability that mass discrete r.v. X takes value x. function (pmf) p(x):=P (X = x)

45 Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event E Example: {3} (the event “the die came up 3”) Probability P (E) P({3}) = 1/6 P({4}) = 1/6

46 Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event E Example: {3} (the event “the die came up 3”) Probability P (E) P({3}) = 1/6 P({4}) = 1/6 Discrete Ran- Example: The value on the top face dom Variable X of the die. Prob. Mass p(3) = 1/6 Function p(x) p(4) = 1/6 (pmf) 47 Random Variables: Definitions Example 2: Rolling a 6-sided die Sample Space {1,2,3,4,5,6} Outcome Example: 3 Event E Example: {2,4,6} (the event “the roll was even”) Probability P (E) P({2,4,6}) = 0.5 P({1,3,5}) = 0.5 Discrete Ran- Example: 1 if the die landed on an dom Variable X even number and 0 otherwise Prob. Mass p(1) = 0.5 Function p(x) p(0) = 0.5 (pmf) 48 Random Variables: Definitions

Discrete Random variable whose values come Random X from a countable set (e.g. the natural Variable numbers or {True, False}) Probability p(x) Function giving the probability that mass discrete r.v. X takes value x. function (pmf) p(x):=P (X = x)

49 Random Variables: Definitions

Continuous Random variable whose values come Random X from an interval or collection of Variable intervals (e.g. the real numbers or the range (3, 5)) Probability Function the returns a nonnegative density f(x) real indicating the relative likelihood function that a continuous r.v. X takes value x (pdf) • For any continuous random variable: P(X = x) = 0 • Non-zero probabilities are only available to intervals: b P (a X b)= f(x)dx a 50 Random Variables: Definitions Example 3: Timing how long it takes a monkey to reproduce Shakespeare Sample Space [0, +∞) Outcome Example: 1,433,600 hours Event E Example: [1, 6] hours Probability P([1,6]) = 0.000000000001 P (E) P([1,433,600, +∞)) = 0.99 Continuous Example: Represents time to Random Var. X reproduce (not an interval!) Prob. Density Example: Gamma distribution Function f(x)

51 Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω {1,2,3,4,5} Events x The sub-regions 1, 2, 3, 4, or 5

Discrete Ran- X Represents a random selection of a dom Variable sub-region Prob. Mass Fn. P(X=x) Proportional to size of sub-region

X=1 X=4

X=3

X=5 X=2

52 Random Variables: Definitions “Region”-valued Random Variables Sample Space Ω All points in the region: Events x The sub-regions 1, 2, 3, 4, or 5

DiscreteRecall Ranthat- an eventX Represents a random selection of a domisVariable any subset of the sub-region sample space. Prob.So Mass bothFn. definitionsP(X=x) Proportional to size of sub-region of the sample space here are valid. X=1 X=4

X=3

X=5 X=2

53 Random Variables: Definitions String-valued Random Variables Sample Space Ω All Korean sentences (an infinitely large set) Event x Translation of an English sentence into Korean (i.e. elementary events) Discrete Ran- X Represents a translation dom Variable Probability P(X=x) Given by a model

English: machine learning requires probability and P( X = 기 계 학 습 은 확 률 과 통 계 를 필 요 )

Korean: P( X = 머 신 러 닝 은 확 률 통 계 를 필 요 ) P( X = 머 신 러 닝 은 확 률 통 계 를 이 필 요 합 니 다 ) … 54 Random Variables: Definitions

Cumulative Function that returns the F (x) that a random variable X is less than or function equal to x: F (x)=P (X x) • For discrete random variables:

• For continuous random variables: x F (x)=P (X x)= f(x)dx 55 Random Variables and Events

Question: Something seems wrong… Random Def 2: A measureable • We defined P(E) (the capital ‘P’) as Variable function from the a function mapping events to sample space to the probabilities real numbers: • So why do we write P(X=x)? • A good guess: X=x is an event… X : R Answer: P(X=x) is just shorthand! These sets are events! Example 1: P (X = x) P ( : X()=x ) Example 2: { } P (X 7) P ( : X() 7 ) { } 56 Notational Shortcuts A convenient shorthand: P (A, B) P (A B)= | P (B) For all values of a and b: P (A = a, B = b) P (A = a B = b)= | P (B = b)

57 Notational Shortcuts

But then how do we tell P(E) apart from P(X) ? Random Event Variable

P (A, B) Instead of writing: P (A B)= | P (B)

We should write: PA,B(A, B) PA B(A B)= | | PB(B) …but only probability theory textbooks go to such lengths.

58 Expectation and Variance

The of X is E[X]. Also called the mean.

• Discrete random variables: Suppose X can take any value in the set . X E[X]= xp(x) x X • Continuous random variables: + E[X]= xf(x)dx 59 Expectation and Variance

The variance of X is Var(X). Var(X)=E[(X E[X])2] µ = • Discrete random variables: E [X] Var(X)= (x µ)2p(x) x X • Continuous random variables: + Var(X)= (x µ)2f(x)dx 60 Joint probability Marginal probability MULTIPLE RANDOM VARIABLES

61 Means, Variances and Covariances Marginal Probabilities

Remember the definition of the mean and covariance of a vector We can ”sum out” part of a joint distribution to get the marginal • random variable: • distribution of a subset of variables: E[x]= xp(x)dx = m p(x)= p(x, y) x ! y " Cov[x]=E[(x m)(x m)⊤]= (x m)(x m)⊤p(x)dx = V − − x − − This is like adding slices of the table together. ! • p(x,y) which is the expected value of the outer product of the variable Σz

with itself, after subtracting the mean. z Also, the covariance between two variables: y • Cov[x, y]=E[(x mx)(y my)⊤]=C − − y x = (x mx)(y my)⊤p(x, y)dxdy = C x xy − − ! Another equivalent definition: p(x)= p(x y)p(y). which is the expected value of the outer product of one variable • y | with another, after subtracting their means. # Note: C is not symmetric.

Joint Probability Joint Probability Conditional Probability

Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value dependson • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)=p(x, y)/p(y) • p(x, y) =prob(X = x and Y = y) |

z z

p(x,y|z)

p(x,y,z) y

x y

x

62 Slide from Sam Roweis (MLSS, 2005) Marginal Probabilities Means, Variances and Covariances Marginal Probabilities

Remember the definition of the mean and covariance of a vector We can ”sum out” part of a joint distribution to get the marginal • random variable: • distribution of a subset of variables: E[x]= xp(x)dx = m p(x)= p(x, y) x ! y " Cov[x]=E[(x m)(x m)⊤]= (x m)(x m)⊤p(x)dx = V − − x − − This is like adding slices of the table together. ! • p(x,y) which is the expected value of the outer product of the variable Σz with itself, after subtracting the mean. z Also, the covariance between two variables: y • Cov[x, y]=E[(x mx)(y my)⊤]=C − − y x = (x mx)(y my)⊤p(x, y)dxdy = C x xy − − ! Another equivalent definition: p(x)= p(x y)p(y). which is the expected value of the outer product of one variable • y | with another, after subtracting their means. # 63 Note: C is not symmetric. Slide from Sam Roweis (MLSS, 2005)

Joint Probability Conditional Probability

Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value dependson • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)=p(x, y)/p(y) • p(x, y) =prob(X = x and Y = y) |

z z

p(x,y|z) p(x,y,z) y

x y x Means, Variances and Covariances Marginal Probabilities

Remember the definition of the mean and covariance of a vector We can ”sum out” part of a joint distribution to get the marginal • random variable: • distribution of a subset of variables: E[x]= xp(x)dx = m p(x)= p(x, y) x ! y " Cov[x]=E[(x m)(x m)⊤]= (x m)(x m)⊤p(x)dx = V − − x − − This is like adding slices of the table together. ! • p(x,y) which is the expected value of the outer product of the variable Σz with itself, after subtracting the mean. z Also, the covariance between two variables: y • Cov[x, y]=E[(x mx)(y my)⊤]=C − − y x = (x mx)(y my)⊤p(x, y)dxdy = C x xy − − ! Another equivalent definition: p(x)= p(x y)p(y). which is the expected value of the outer product of one variable • y | with another, after subtracting their means. # Note: C is not symmetric.

Joint Probability ConditionalConditional Probability Probability

Key concept: two or more random variables may interact. If we know that some event has occurred, it changes our belief • Thus, the probability of one taking on a certain value dependson • about the probability of other events. which value(s) the others are taking. This is like taking a ”slice” through the joint table. • We call this a joint ensemble and write p(x y)=p(x, y)/p(y) • p(x, y) =prob(X = x and Y = y) |

z z

p(x,y|z) p(x,y,z) y

x y x

64 Slide from Sam Roweis (MLSS, 2005) Bayes’ Rule Entropy

Manipulating the basic definition of conditional probability gives Measures the amount of ambiguity or uncertainty in a distribution: • • one of the most important formulas in probability theory: H(p)= p(x)logp(x) p(y x)p(x) p(y x)p(x) − x p(x y)= | = | " | p(y) p(y x )p(x ) x′ ′ ′ Expected value of log p(x) (a function which depends on p(x)!). | • − This gives us a way of ”reversing”conditional! probabilities. H(p) > 0 unless only one possible outcomein which case H(p)=0. • • Thus, all joint probabilities can be factored by selecting anordering Maximal value when p is uniform. • for the random variables and using the ”chain rule”: • Tells you the expected ”cost” if each event costs log p(event) p(x, y, z, . . .)=p(x)p(y x)p(z x, y)p(... x, y, z) • − | | |

Independence and Independence & Conditional Independence Cross Entropy (KL Divergence)

Two variables are independent iff their joint factors: An assymetric measure of the distancebetween two distributions: • • p(x, y)=p(x)p(y) KL[p q]= p(x)[log p(x) log q(x)] p(x,y) ∥ x − p(x) " KL > 0 unless p = q then KL =0 x • = Tells you the extra cost if events were generated by p(x) but • instead of charging under p(x) you charged under q(x).

p(y) Two variables are conditionally independent given a third one if for • all values of the conditioning variable, the resulting slicefactors: p(x, y z)=p(x z)p(y z) z | | | ∀

65 Slide from Sam Roweis (MLSS, 2005) MLE AND MAP

66 MLE What does maximizing likelihood accomplish? • There is only a finite amount of probability mass (i.e. sum-to-one constraint) • MLE tries to allocate as much probability mass as possible to the things we have observed…

…at the expense of the things we have not observed

67 MLE vs. MAP

Suppose we have data = x(i) N D { }i=1 Maximum Likelihood N N Estimate (MLE) MLE = `;KtMLE = `;Ktp(t(i) ) p(t(i) ) | | i=1 i=1 N MAP = `;Kt p(t(i) )p() | i=1

68 MLE Example: MLE of Exponential Distribution

x pdf of Exponential(): f(x)=e • Suppose X Exponential() for 1 i N. • i Find MLE for data = x(i) N • D { }i=1 First write down log-likelihood of sample. • Compute first derivative, set to zero, solve for . • Compute second derivative and check that it is • concave down at MLE.

69 MLE Example: MLE of Exponential Distribution First write down log-likelihood of sample. • N ()= HQ; f(x(i)) (1) i=1 N = HQ;( 2tT( x(i))) (2) i=1 N = HQ;()+ x(i) (3) i=1 N = N HQ;() x(i) (4) i=1

70 MLE Example: MLE of Exponential Distribution Compute first derivative, set to zero, solve for . • d() d N = N HQ;() x(i) (1) d d i=1 N N = x(i) =0 (2) i=1 N MLE = (3) N (i) i=1 x

71 MLE Example: MLE of Exponential Distribution

x pdf of Exponential(): f(x)=e • Suppose X Exponential() for 1 i N. • i Find MLE for data = x(i) N • D { }i=1 First write down log-likelihood of sample. • Compute first derivative, set to zero, solve for . • Compute second derivative and check that it is • concave down at MLE.

72 MLE vs. MAP

Suppose we have data = x(i) N D { }i=1 Maximum Likelihood N N Estimate (MLE) MLE = `;KtMLE = `;Ktp(t(i) ) p(t(i) ) | | i=1 i=1 N Maximum a posteriori N (MAP) estimate MAP = `;KtMAP = `;Ktp(t(i) )p(p()t(i) )p() | | i=1 i=1 Prior

73 COMMON PROBABILITY DISTRIBUTIONS

74 Common Probability Distributions

• For Discrete Random Variables: – Bernoulli – Binomial – Multinomial – Categorical – Poisson • For Continuous Random Variables: – Exponential – Gamma – Beta – Dirichlet – Laplace – Gaussian (1D) – Multivariate Gaussian

75 000 001 002 003 Shared Components Topic Models 004 005 006 007 008 Anonymous Author(s) Affiliation 009 Address 010 email 011 012 Common Probability Distributions 013 014 015 1 DistributionsBeta Distribution 016 probability density function: 017 1 1 ⇥ 1 f(⌅ , ⇥)= x (1 x) 018 | B(, ⇥) 019 4 020 2 SCTM 021 3 ↵ = 0.1, = 0.9 022 C ) c=1 ⌅cx p(x ⇥ ,...,⇥ )= ↵ = 0.5, = 0.5 , C A Product of Experts, (PoE) [1] model V C where there are 023 1 C ⌅cv ↵ v=1 c=1 | 2 | Q↵ = 1.0, = 1.0 024 components, and the( summation in the denominator is over all possible↵ = 5. feature0, = 5.0 types. f P Q ↵ = 10.0, = 5.0 025 1 026 Latent Dirichlet allocation generative process The Finite IBP model generative process 027 For each topic k 1,...,K0 : ⇤ { } Dir() 0 [draw distribution0.2 0 over.4 words0.]6 0For.8 each component1 c 1,...,C :[columns] 028 k ⇥ ⇤ { } For each document m 1,...,M ⇤ { } ⇤c Beta( , 1) [draw probability of component c] 029 ✓m Dir(↵) [draw distribution over topics] C ⇥ For⇥ each topic k 1,...,K :[rows] For each word n 1,...,Nm ⇤ { } 030 ⇤ { } bkc Bernoulli(⇤c) zmn Mult(1, ✓m) [draw topic] ⇥ [draw⇥ whether topic includes cth component in its PoE] 031 xmn [draw word] ⇥ zmi 032 033 034 2.1 PoE 035 C ⌅ 036 p(x ⇥ ,...,⇥ )= c=1 cx (1) | 1 C V C 037 v⇥=1 c=1 ⌅cv 038 ⇥ 039 2.2 IBP 040 041 Latent Dirichlet allocation generative process The Beta-Bernoulli model generative process 042 For each topic k 1,...,K : Dir()⇤ { } [draw distribution over words] For each feature c 1,...,C :[columns] 043 k ⇥ ⇤ { } For each document m 1,...,M ⇤ { } ⇤c Beta( , 1) 044 ✓m Dir(↵) [draw distribution over topics] C ⇥ For⇥ each class k 1,...,K :[rows] 045 For each word n 1,...,Nm ⇤ { } ⇤ { } bkc Bernoulli(⇤c) zmn Mult(1, ✓m) [draw topic] ⇥ 046 x ⇥ [draw word] mn ⇥ zmi 047 048 049 2.3 Shared Components Topic Models 050 051 Generative process We can now present the formal generative process for the SCTM. For each 052 of the C shared components, we generate a distribution ⇥c over the V words from a Dirichlet parametrized by . Next, we generate a K C binary matrix using the finite IBP prior. We select 053 ⇥ the probability ⇤c of each component c being on (bkc =1) from a Beta distribution parametrized

1 000 001 002 003 Shared Components Topic Models 004 005 006 007 008 Anonymous Author(s) Affiliation 009 Address 010 email 011 012 Common Probability Distributions 013 014 015 1 DistributionsDirichlet Distribution 016 probability density function: 017 1 1 ⇥ 1 f(⌅ , ⇥)= x (1 x) 018 | B(, ⇥) 019 4 020 2 SCTM 021 3 ↵ = 0.1, = 0.9 022 C ) c=1 ⌅cx p(x ⇥ ,...,⇥ )= ↵ = 0.5, = 0.5 , C A Product of Experts, (PoE) [1] model V C where there are 023 1 C ⌅cv ↵ v=1 c=1 | 2 | Q↵ = 1.0, = 1.0 024 components, and the( summation in the denominator is over all possible↵ = 5. feature0, = 5.0 types. f P Q ↵ = 10.0, = 5.0 025 1 026 Latent Dirichlet allocation generative process The Finite IBP model generative process 027 For each topic k 1,...,K0 : ⇤ { } Dir() 0 [draw distribution0.2 0 over.4 words0.]6 0For.8 each component1 c 1,...,C :[columns] 028 k ⇥ ⇤ { } For each document m 1,...,M ⇤ { } ⇤c Beta( , 1) [draw probability of component c] 029 ✓m Dir(↵) [draw distribution over topics] C ⇥ For⇥ each topic k 1,...,K :[rows] For each word n 1,...,Nm ⇤ { } 030 ⇤ { } bkc Bernoulli(⇤c) zmn Mult(1, ✓m) [draw topic] ⇥ [draw⇥ whether topic includes cth component in its PoE] 031 xmn [draw word] ⇥ zmi 032 033 034 2.1 PoE 035 C ⌅ 036 p(x ⇥ ,...,⇥ )= c=1 cx (1) | 1 C V C 037 v⇥=1 c=1 ⌅cv 038 ⇥ 039 2.2 IBP 040 041 Latent Dirichlet allocation generative process The Beta-Bernoulli model generative process 042 For each topic k 1,...,K : Dir()⇤ { } [draw distribution over words] For each feature c 1,...,C :[columns] 043 k ⇥ ⇤ { } For each document m 1,...,M ⇤ { } ⇤c Beta( , 1) 044 ✓m Dir(↵) [draw distribution over topics] C ⇥ For⇥ each class k 1,...,K :[rows] 045 For each word n 1,...,Nm ⇤ { } ⇤ { } bkc Bernoulli(⇤c) zmn Mult(1, ✓m) [draw topic] ⇥ 046 x ⇥ [draw word] mn ⇥ zmi 047 048 049 2.3 Shared Components Topic Models 050 051 Generative process We can now present the formal generative process for the SCTM. For each 052 of the C shared components, we generate a distribution ⇥c over the V words from a Dirichlet parametrized by . Next, we generate a K C binary matrix using the finite IBP prior. We select 053 ⇥ the probability ⇤c of each component c being on (bkc =1) from a Beta distribution parametrized

1 000 001 002 003 Shared Components Topic Models 004 005 006 007 008 Anonymous Author(s) Affiliation 009 Address 010 email 011 012 013 014 015 1 Distributions 016 017 Beta Common Probability Distributions 018 1 1 ⇥ 1 f(⇤ , ⇥)= x (1 x) 019 | B(, ⇥) 020 Dirichlet Distribution 021 Dirichlet probability density function: 022 1 K K ( ) 023 ⌅ k 1 k=1 k p(⇤ )= ⇤k where B()= K (1) 024 | B() k=1 ⇥( k=1 k) 025 15 ⇤ 026 3 p 10

(

2 SCTM ~

| 027 ↵~

)

p 2.5

( 5 ~ C | 028 ↵~ ⌅ ) c=1 cx A (PoE) [1] model p(x ⇥ ,...,⇥ )= V C , where there are C Product of Experts 1 2 C ⌅ 029 | v=1 c=1 cv 0 Q components, and the summation in the denominator is1 over.5 all possible feature types. 030 0 P Q 0.25 031 0.25 1 0.5 1 1 032 Latent Dirichlet allocation generative process 0.8 0.5The Finite IBP model generative process 0.6 1 0.8 0.75 0.6 For each topic k 1,...,K : 0.75 033 0.4 0.4 ⇤ { } 2 For each component c 1,...,C2 :[columns] k Dir() [draw distribution0.2 over words] 0.2 034 ⇥ 1 1 ⇤ { } For each document m 1,...,M 0 0 ⇤ { } ⇤c Beta( , 1) [draw probability of component c] 035 ✓m Dir(↵) [draw distribution over topics] C ⇥ For⇥ each topic k 1,...,K :[rows] For each word n 1,...,Nm ⇤ { } 036 ⇤ { } bkc Bernoulli(⇤c) zmn Mult(1, ✓m) [draw topic] ⇥ x ⇥ [draw word] [draw whether topic includes cth component in its PoE] 037 mn ⇥ zmi 038 039 040 2.1 PoE 041 C ⇤ 042 p(x ⇥ ,...,⇥ )= c=1 cx (2) | 1 C V C 043 v⇥=1 c=1 ⇤cv 044 ⇥ 045 2.2 IBP 046

047 Latent Dirichlet allocation generative process The Beta-Bernoulli model generative process 048 For each topic k 1,...,K : Dir()⇤ { } [draw distribution over words] For each feature c 1,...,C :[columns] 049 k ⇥ ⇤ { } For each document m 1,...,M ⇤ { } ⇤c Beta( , 1) 050 ✓m Dir(↵) [draw distribution over topics] C ⇥ For⇥ each class k 1,...,K :[rows] 051 For each word n 1,...,Nm ⇤ { } ⇤ { } bkc Bernoulli(⇤c) zmn Mult(1, ✓m) [draw topic] ⇥ 052 x ⇥ [draw word] mn ⇥ zmi 053

1