Linear Discriminant Analysis Discriminant analysis or classification, can be viewed as a special type of regression, sometimes named × categorical regression. Still, we are given the data set (퐗, 퐲) where data matrix 퐗 = (퐱,…, 퐱) ∈ℝ , but 퐲 = (푦,…, 푦) ∈ 핃 , 핃 ⊆ℕ, where 핃 is a finite set of discrete class labels, and we name its size 퐾 = |핃| as the class size. We note the labels are categorical and for the purpose of labeling data; although labels are very often represented by integers like 1,…, 퐾 , their numerical values usually do not conceptually matter. For example, the labeling still makes sense if we choose a different set of distinct integers. The discriminant analysis finds a discriminant function 풻:ℝ → 핃 and 퐲 = (푦,…, 푦) = 풻(퐗) gives estimation of the true labels 퐲. The percentage of different labels between 퐲 and 퐲 is called accuracy. The most basic type is the binary classification when 퐾 =2. For a future variable 픁, 풻 provides label prediction 퓎 = 풻(픁).

We again emphasize the labels mainly serve the purpose of distinction between groups of data; however, it is not uncommon the labels could also serve technical or interpretive purposes. For example, 1) in algorithm, the label set is designed as {−1,1} for its error function; 2) Fisher’s binary linear discriminant in Theorem 2-7 can be shown to be equivalent to LSE if using label set − , ; 3) in sentiment analysis, −1 is usually to label the class of negative sentiment, and +1 is used to label the class of positive sentiment. The design of label set is called label coding.

As before, there can be feature function 휙 applied to each 퐱, and we are actually dealing with 훟 = 휙(퐱). The basic idea is to utilize the regression in (2-2), and write the discriminant as

풻(퐱) = 푓휙(퐗)퐰 = 푓훟퐰 (2-27) where 푓 is to convert the decimal value resulting from 훟퐰 to a label, and it is very often referred to as . Discriminant analysis based on (2-27) is called linear discriminant analysis. For example, suppose we have binary label set 핃 = {푙, 푙}, then we can define 1 휙(퐱)퐰 ≥ 0 푦 = 풻(퐱) = (2-28) 0 휙(퐱)퐰 < 0

For the special case where the bias 푤 is explicit, we can write 풻(퐱) = 푓휙 (퐗)퐰 + 푤 = 푓훟 퐰 + 푤 (2-29) which can be viewed as classification in ℝ dimension consistent with (2-27), putting all feature points at height 1 in the first dimension. For error, there could be a variety of different definitions. Considering the categorical nature of labels, ( ) ( ) ∑ one intuitive error definition is to count the differences between 퐲 = 풻 퐗 and 퐲: ℯ 퐲 = 1, which we refer to as identity error. Identity error is the most widely used performance indicator, but it is hard to optimize (it is highly piecewise, and its derivative is zero almost everywhere).

Perceptr on Algor ithm  Perceptron Algorithm could be the simplest algorithm for binary classification and can be viewed as a simplest one- neural network with one output. Technically, it tries to minimize a more numerical error rather than the identity error. To achieve this, it designs the label set as {−1,1}. For each data point 퐱 and label 푦 , the algorithm checks if sgn훟 퐰 = 푦 ; if so, the classification is correct; otherwise, the classification is wrong, and we calculate the perceptron error as the following, ℯ = −푦훟 퐰 훟 퐰 Note sgn훟 퐰 ≠푦 is equivalent to 푦훟 퐰<0, and thus we put a minus sign in front of it to make the error positive. Our objective is now to simply minimize the error, i.e. min ℯ. The classic way is to 퐰ퟎ use , with 휂 as a predefined positive step length

() () 퐰 =퐰 −휂∇퐰() ℯ where by Fact 1-7

() () ∇ () ℯ = −푦 훟 ⇒ 퐰 = 퐰 + 휂 푦 훟 퐰 (2-30) () () 훟 퐰 훟 퐰 The above can be modified as online updates whenever an error is encountered.

() () () { } (2-31) 퐰 ≔ 퐰 + 휂푦 훟 if sgn훟퐰 ≠ 푦, 푖 ∈ 1, … , 푁 The geometric interpretation is illustrated in Figure 2-1. A hyperplane 퐻(퐱) =퐰퐱 goes through the origin and has 퐰 as its normal. With a fixed 퐰, 훟 퐰 cos〈훟,퐰⟩ = will be all positive if 훟 is at ‖퐰‖‖훟‖ one side of 퐻 and all negative if at the other side, 훟 퐰 푑(훟,퐻) = is a signed distance from point 훟 ‖퐰‖ to the hyperplane, and 훟 퐰= ‖퐰‖푑(훟,퐻) ∝ 푑(훟,퐻). The interpretation can be clearer if we restrict ‖퐰‖ =1, then we have 훟 퐰=푑(훟,퐻), and “ℯ” is adding up the distances of misclassified points to the hyperplane. It is trivial to verify the optimal solution with or without the constraint are identical. Thus, in practice, we might just proceed the optimization without the “unit normal” Figure 2-1 Perceptron algorithm is to find the best constraint for simplicity, and normalize the optimal hyperplane through origin separating the points. With ∗ unit normal, the perceptron error can be interpreted solution 퐰 at the end. The perceptron objective min ℯ can be viewed as finding an orientation of 퐻 as adding up the distances from misclassified points 퐰ퟎ to the hyperplane. that best separates the data points, where the “best” () is in terms of minimum “ℯ”. The initialization 퐰 for Error! Reference source not found. can be random choice of any nonzero vector. The advantage of Perceptron is its simplicity, its online updates and the guaranteed convergence in ideal situation; while the disadvantage is its limited to binary classification and it runs indefinitely in non-ideal situation when the data points are not separable by a hyperplane (in this case a heuristic rule is needed for termination). Fact 2-5 Cauchy-Schwarz inequality states that ‖퐱‖〈⟩‖퐲‖〈⟩ ≥ |〈퐱,퐲⟩| for any 퐱,퐲∈ℝ where 〈퐱, 퐲⟩ is an inner product, and ‖퐱‖〈⟩ = 〈퐱, 퐱⟩ is the inner product induced norm.

Theorem 2-5 Perceptron Convergence Theorem. In binary classification, if there exists a hyperplane Commented [XY1]: The other similar proof at 퐻:퐰 퐱=0 s.t. the feature points (훟,푦),…, (훟,푦) satisfy {훟:퐰 퐱≥0} all have the same label http://www.cs.columbia.edu/~mcollins/courses/6998- 2012/notes/perc.converge.pdf and {훟:퐰 퐱<0} all have the other label, then the feature points are linearly separable. If the feature points are linearly separable by some hyperplane 퐻, then there exist two constants 푎, 푏 s.t. 푎푡 ≤ 퐰() − 퐰() ≤ 푏푡 for all 푡 if ℯ ≠0 for the algorithm defined by (2-31). As a result, the algorithm must terminate within 푡 ≤ iterations ℯ =0 at termination, otherwise 푎푡 > 푏푡 when 푡 is sufficiently large. For the lower bound, let 퐰∗ be a normal of the separation hyperplane 퐻, then

() () () () 퐰 = 퐰 + 휂푦 훟 +⋯+ 휂푦 훟 ⇒ 퐰 − 퐰 = 휂푦 훟 +⋯+ 휂푦 훟 ∗ () () ∗ ∗ ⇒ (퐰 ) 퐰 − 퐰 = (퐰 ) 휂푦 훟 +⋯+ 휂푦 훟 ≥ 휂푡 min 푦(퐰 ) 훟 ,…, By Cauchy-Schwarz inequality, (퐰∗)퐰() − 퐰() ≤ ‖퐰∗‖ 퐰() − 퐰() , therefore ∗ 휂 min 푦(퐰 ) 훟 ‖퐰∗‖ 퐰() − 퐰() ≥ 휂푡 min 푦 (퐰∗)훟 ⇒ 퐰() − 퐰() ≥ ,…, 푡 ∗ ,…, ‖퐰 ‖

∗ (퐰 ) 훟 ,…, where 푎 = ∗ is independent of 푡. For the upper bound, notice ‖퐰 ‖ () () 퐰 − 퐰 = 휂푦 훟 () () () () 퐰 − 퐰 = 휂푦 훟 + 휂푦 훟 = 퐰 − 퐰 + 휂푦 훟 () () 퐰 − 퐰 = 휂푦 훟 + 휂푦 훟 + 휂푦 훟 () () () () () () () () = 퐰 − 퐰 + 퐰 − 퐰 − 퐰 − 퐰 + 휂푦훟 = 퐰 − 퐰 + 휂푦 훟 and generally

() () () () 퐰 − 퐰 = 퐰 − 퐰 + 휂푦 훟 and

퐰() − 퐰() = 퐰() − 퐰() + 휂훟 +2휂푦 퐰() − 퐰() 훟

() Note 푦 퐰 훟 <0 because we assume ℯ ≠0 at step 푡. Therefore, 퐰() − 퐰() ≤ 퐰() − 퐰() + 휂훟 −2휂푦 퐰() 훟 which gives 퐰() − 퐰() = 휂훟 퐰() − 퐰() ≤ 퐰() − 퐰() + 휂훟 −2휂푦 퐰() 훟 … 퐰() − 퐰() ≤ 퐰() − 퐰() + 휂훟 −2휂푦 퐰() 훟 Adding up these inequalities, we can cancel terms on both sizes as the following, 퐰() − 퐰() = 휂훟 퐰() − 퐰() ≤ 퐰() − 퐰() + 휂훟 −2휂푦 퐰() 훟 … 퐰() − 퐰() ≤ 퐰() − 퐰() + 휂훟 −2휂푦 퐰() 훟 and therefore 퐰() − 퐰() ≤ 휂 훟 +⋯+ 훟 −2휼퐰() 푦 훟 +⋯+ 푦 훟 () ≤ 휂 max 훟 푡 −2휂 min 푦 퐰 훟 푡 ,…, ,…,

() where we see 푏 = 휂 max 훟 −2휂 min 푦 퐰 훟 , independent of 푡. ,…, ,…,

Mem ber ship LSE  Theorem 2-6 Membership LSE. Least squared error method for regression (see Theorem 2-1) can be adapted for classification; however, direct use the categorical values of the class labels as regression target is usually less favorable due to both performance issues and interpretation difficulty. Commented [TC2]: TODO: performance comparison

One adaption is to convert class labels into membership vectors. Define a new target vector 퐳, roughly named the membership vector, for each class label 푦, 푗 = 1,…, 푁

푡, ⋮ { } 퐳 = ∈ 0,1 , 푗 = 1,…, 푁 s.t. 푡, =1 for 푖 = 1,…, 퐾 푡,

That is, only one component of 퐳 is assigned 1 with all other components being 0. If 푡, =1 for some 푖, then 푦 = 푖, indicating 퐱 ∈ 퐶. Then the whole target matrix is 퐙 = (퐳,…, 퐳). By (2-4) we have the regression model 퐙 = 퐖횽, and the objective of LSE is to minimize 퐙 − 퐙 . We have the solution by (2-6) and (2-7) of Theorem 2-1 that 퐖∗ = (횽)퐙 Using the fact that (횽) = 횽 , we have

퐖∗ = 횽 퐙 ⇒ (퐖∗) = 퐙횽 And the regression function is 푓(퐱) = 퐙횽휙(퐱) Note the discriminant function is different, which needs to yield a class label. We heuristically set it as 풻(퐱) = max 푓(퐱) (2-32) 퐰 As a special case, if 횽 ≔ ퟏ and 퐖 ≔ , then by (2-11) of Theorem 2-1, we have 퐗 퐖 ∗ ∗ 퐰 = 퐳 − (퐖 ) 퐱 퐖∗ = 퐗 퐙 , (퐖∗) = 퐙퐗

where 퐱, 퐳 are the mean vector of 퐗, 퐙, and 퐗, 퐙 are centralized data. Then the regression function is ∗ ∗ ∗ ∗ 푔(퐱) = (퐖 ) 퐱 + 퐰 = (퐖 ) 퐱 + 퐳 − (퐖 ) 퐱 = 퐙퐗 퐱 + 퐳 (2-33) There are many limitations of LSE for classification. It inherits the sensitivity to outliers as mentioned in Theorem 2-1; secondly, the efficiency of LSE quickly degenerates when 퐾 increases; thirdly, the discriminant decision is made based on the maximum of the inferred membership weights of 퐳, but we lack good interpretation of 퐳; one example for (2-33) is shown in the following property.

Property 2-1 If 퐚 퐳 = 푏 for every 푗 =1, … , 푁, then we have 퐚 푔(퐱) = 푏. Let 퐛 = 푏ퟏ, then 1 퐚퐳 = 퐚(퐳 +⋯+ 퐳 ) = 푏 ⇒ 퐚퐙 = 퐛 ⇒ 퐚퐙 = 퐚(퐙 − 퐙) = 퐛 − 퐛 = ퟎ 푁 ⇒ 퐚푔(퐱) = 퐚퐙퐗퐱 + 퐳 = 퐚퐳 = 푏

In particular, if we let 퐚 = ퟏ, then 푏 =1 , and ퟏ 퐳 =1 holds for every 푗 = 1,…, 푁 . Therefore ퟏ푔(퐱) =1 for any 퐱, i.e. the inferred membership of any data entry 퐱 under 푔 in (2-33) sums to 1. However, we note it is not guaranteed 퐳 is a probabilistic vector under, i.e. it may have components not in [0,1], which is one limitation of LSE. 1  Theorem 2-7 Fisher’s Binary Linear Discriminant. For simplicity and WLOG, we consider 휙(퐱) ≔ 퐱 푤 퐰 ≔ and 퐰 , and we start with binary classification. By (2-28), the general form of a binary discriminant is

1 퐱 퐰 ≥−푤 풻(퐱) = 0 퐱 퐰 <−푤 The idea is those 풻(퐱 ) for 퐱 ∈ 퐶 should be as far away from those 풻(퐱 ) for 퐱 ∈ 퐶. The distance between two sets of data entries can be characterized by the mean of the data. 1 퐱 = 퐱 , 푖 = 0,1 푁 퐱∈ We would like to maximize

max (퐰 퐱 − 퐰 퐱 ) = max 퐰 (퐱 − 퐱 ) (2-34) ‖퐰‖ ‖퐰‖

where the normalization ‖퐰‖ =1 is to prevent the objective from going to infinity. A less obvious issue is that when 퐰 퐶 or 퐰 퐶 or both have large variance, then 퐰 (퐱 − 퐱 ) could be large but Commented [XY3]: TODO: illustration 퐰 퐶 or 퐰 퐶 may have considerable overlap. The variance is written as the following, 푠 = 퐰 퐱 − 퐱 퐰 퐱 − 퐱 , 푖 = 0,1

퐱∈ Therefore, we would like to minimize the variance of 퐰 퐶 and 퐰 퐶 as well. min (푠 + 푠 ) (2-35) ‖퐰‖ Therefore, combining (2-34) and (2-35) we give the Fisher’s criterion as the squared distance of mean over the sum of variance, 퐰 (퐱 − 퐱 ) 퐰퐌퐰 max 퐿 = max = max ‖ ‖ ‖ ‖ 퐰 퐰 푠 + 푠 퐰 퐰 퐰 (퐒 + 퐒)퐰 ( )( ) ∑ where 퐌 = 퐱 − 퐱 퐱 − 퐱 and 퐒 = 퐱∈퐱 − 퐱 퐱 − 퐱 is the covariance matrix of class 퐶 for 푖 = 0,1, and all these matrices are symmetric. Then, let 퐒 = 퐒 + 퐒 for convenience, ∇퐿 ∝ (퐰퐒퐰)퐌퐰 − (퐰퐌퐰)(퐒퐰) =0⇒ (퐰퐒퐰)퐌퐰 = (퐰퐌퐰)(퐒퐰) We note there is no need for Lagrange multiplier for the constraint; if we can find the direction of optimal 퐰, we just normalize it to unit vector. Note 퐌퐰 = (퐱 − 퐱)(퐱 − 퐱)퐰 ∝ 퐱 − 퐱, then f or some constant 푐, 푐(퐰퐒퐰) (퐰퐌퐰)(퐒퐰) = 푐(퐰퐒퐰)(퐱 − 퐱) ⇒ 퐰 = 퐒(퐱 − 퐱) 퐰퐌퐰 Since we need to normalize 푤 anyway, we do not care the constant, therefore we have the following Fisher’s discriminant. 퐰∗ ∝ 퐒(퐱 − 퐱) ∗ Connection to LSE & Optimal 푤. There are many strategies to determine 푤, e.g. one way to do this is to set 푤 naively as 푤∗ = (퐰∗)(퐱 − 퐱 ). One better quantitative approach to find optimal 푤 is through LSE, by direct regression to class labels (rather than membership vectors in Theorem 2-6). We show Fisher’s discriminant is equivalent to a LSE with special label coding, and therefore 푤 can be found by taking derivative. First, LSE minimizes

(푦 − 푤 − 퐱 퐰) + (푦 − 푤 − 퐱 퐰)

(퐱,)∈ (퐱,)∈ ⇒ (푦 − 푤 − 퐱 퐰)퐱 + (푦 − 푤 − 퐱 퐰)퐱 =0

(퐱,)∈ (퐱,)∈ ⇒ (푤 + 퐱 퐰)퐱 + (푤 + 퐱 퐰)퐱 = 푦퐱

(퐱,)∈ (퐱,)∈ ⇒ 퐱퐱 퐰 + 퐱퐱 퐰 + 푤퐱 + 푤퐱 = 푦퐱

(퐱,)∈ (퐱,)∈ (퐱,)∈ (퐱,)∈ Then plug in (2-9) of Theorem 2-1 and have

퐱퐱 + 퐱퐱 − 퐱퐱 − 퐱퐱 퐰 = (푦 − 푦) 퐱 (2-36)

(퐱,)∈ (퐱,)∈ (퐱,)∈ (퐱,)∈ By Corollary 1-5, we have

퐱퐱 + 퐱퐱 = 퐒 + 푁퐱 (퐱 ) + 푁퐱 (퐱 )

퐱∈ 퐱∈ Notice 퐱 = (푁 퐱 + 푁 퐱), we have

푁 푁 푁 푁푁 퐱퐱 = 퐱 (퐱) + 퐱 (퐱) = 퐱(퐱) + 퐱(퐱) 푁 푁 푁 푁 퐱∈ 퐱∈ 퐱∈

푁 푁 푁푁 푁 퐱퐱 = 퐱 (퐱) + 퐱 (퐱) = 퐱(퐱) + 퐱(퐱) 푁 푁 푁 푁 퐱∈ 퐱∈ 퐱∈ Then the LHS of (2-36) becomes 퐒 + 푁퐱 (퐱 ) + 푁퐱 (퐱 ) 푁 푁 푁 푁 푁 푁 − 퐱(퐱) + 퐱(퐱) + 퐱(퐱) + 퐱(퐱) 푁 푁 푁 푁 푁 푁 (2-37) =퐒+ (퐱(퐱) +퐱(퐱) −퐱(퐱) −퐱(퐱)) 푁 푁 푁 = 퐒 + 퐌 푁

For the RHS of (2-36), suppose 푦 =푐 for all 퐱 ∈퐶 and 푦 =푐 for all 퐱 ∈퐶, then (푦 −푦) 퐱 = (푐 −푦) 퐱 + (푐 −푦) 퐱 = (푐 −푦)푁퐱 + (푐 −푦)푁퐱

퐱∈ 퐱∈ As a result, if we let (푐 −푦)푁 =−(푐 −푦)푁, then the RHS will be 퐱 −퐱 times a constant. This happens when simply 푁 ⎧푐 = − 푁 1 1 ⇒ 푦 = (푁푐 + 푁푐) = (−푁 + 푁) = 0 ⎨ 푁 푁 푁 푐 = ⎩ 푁 (2-38) ⇒ (푐 −푦)푁 =−(푐 −푦)푁 =푁 ⇒ (푦 − 푦) 퐱 = 푁(퐱 − 퐱 ) Combining (2-36), (2-37) and (2-38), we have 푁 푁 푁 푁 퐒+ 퐌 퐰=푁(퐱 −퐱) ⇒퐒퐰= 푁− (퐱 −퐱)퐰 (퐱 −퐱) 푁 푁 ⇒퐰∗ ∝퐒(퐱 −퐱)

By this equivalence, we can find best 푤 in terms LSE by (2-9) of Theorem 2-1, ∗ ∗ 푤 = 푦 − 퐱 퐰 Also, by this equivalence to LSE, Fisher’s discriminant has all limitations of LSE. Theorem 2-8 Fisher’s Linear Discriminant for Mutiple Classes. For

Sigmoid & Soft max Function In above models, each data point is hard-assigned to a class, by designing activation function and hand- crafting objectives to optimize. In contrast, the probabilistic models find distributions for 푝(퐶|훟) for each feature data 훟 rather than a hard assignment. Many classic probabilistic classification models involve and its extension softmax function.

(a) Sigmoid function 휎(푎) = and 1 − 휎(푎) (b) Logit function 푎(휎) = 푙푛 Figure 2-2 Sigmoid function and logit function. The famous sigmoid function is defined as 휎(푎) = :ℝ→ (0,1), representing an S-shaped curve. Its inverse 푎 = ln is called a logit function. Also note 휎(푎) ∈ (0,1) and 1 푒 1 1 − 휎(푎) = 1 − = = ⇒ 1 − 휎(푎) = 휎(−푎) (2-39) 1 + 푒 1 + 푒 1 + 푒 We therefore note 휎(푎), 휎(−푎) can be viewed as a Bernoulli distribution. The softmax function, or normalized exponential function, can be viewed as an extension of the distribution 휎(푎), 휎(−푎) to higher dimensions. Given a vector 훂 = (훼,…, 훼), then the softmax outputs a multinomial distribution 푒 푒 휎(훂) = ,…, :ℝ →ℝ ∑ 푒 ∑ 푒 The name “softmax” contains “max” because it is a that preserves the information to find the maximum; meanwhile, it is “soft” in contrast to another function called hardmax function, which outputs a vector of zero components except for the component corresponding to the largest value in 훂 is set 1. In the special case of 퐾 =2, the softmax function is 푒 푒 1 1 휎(훼, 훼) = , = , (2-40) 푒 + 푒 푒 + 푒 1 + 푒() 1 + 푒() Comparing (2-39) and (2-40) we can see the relation between the sigmoid parameter 푎 and the two softmax parameter is 푎 = 훼 − 훼. This extension is more concrete in (16-4) when both sigmoid and softmax functions are applied in constructing probabilistic discriminants below.

 Property 2-2 The derivative of sigmoid 휎(푎) = is = 휎(1 − 휎). Simply check 푒 1 1 휎(푎) = = 1− = 휎(푎)1− 휎(푎) (1+ 푒) 1+ 푒 1+ 푒 In practice, we may very often use the log-sigmoid function for optimization purpose. Since 퓈(푎) ≔ log 휎(푎), then퓈(푎) =1 − 휎 = since 퓈(푎) = (log 휎(푎)) = 휎(1− 휎).  Property 2-3 The Jacobian matrix of softmax function is a symmetric matrix.

휎(1 − 휎) −휎휎 ⋯ −휎휎 −휎 휎 휎 (1 − 휎 ) ⋯ −휎 휎 퐉 휎 = 훂 ⋮ ⋮ ⋱ ⋮ −휎휎 −휎휎 ⋯ 휎(1 − 휎) 푖 휎 (훂) = Simply consider the th component , then ∑ ∑ ⎧푒 ,…,, 푒 푗 = 푖 휕휎(훂) ⎪ (∑ 푒 ) 휎 (훂)1− 휎 (훂) 푗 = 푖 = = 휕훼 ⎨ 푒 푒 −휎(훂)휎(훂) 푗 ≠ 푖 ⎪ − 푗 ≠ 푖 ⎩ (∑ 푒 ) In practice, we may also very often use the log-softmax function 퓈(훂) = 훼 − log∑ 푒 for optimization purpose, and its Jacobian is ,…, 1 − 휎 −휎 ⋯ −휎 −휎 1 − 휎 ⋯ −휎 퐉 퓈 = 훂 ⋮ ⋮ ⋱ ⋮ −휎 −휎 ⋯ 1 − 휎 Consider the 푖th component 퓈(훂) = 훼 − log∑ 푒 , then

휕퓈(훂) 휕 log 휎 1 휕휎 1 휎 (1− 휎 ) 푗 = 푖 1− 휎 푗 = 푖 = = = = 휕훼 휕훼 휎 휕훼 휎 −휎휎 푗 ≠ 푖 −휎 푗 ≠ 푖

Sigmoid Assumpt ion The models we discuss here are based on sigmoid assumption: for 푖 = 1,2, the probability that a data point belongs to a class 퐶 is a sigmoid curve 휎(푎) of some parameter 푎. It turns out that 푎 has nice interpretation and therefore the assumption makes sense. Given a feature data 훟, assume 1 푝(퐶|훟) = 휎(푎) = , 푖 = 1,2 (2-41) 1 + 푒 where “푝(퐶|훟)” is simplified notation for “푝(훟 ∈ 퐶|훟)”. Then we can solve 푎 by

푝(훟|퐶)푝(퐶) 1 푝(퐶|훟) = = 휎(푎) = 푝(훟|퐶 )푝(퐶 ) + 푝(훟|퐶 )푝(퐶 ) 1 + 푒 (2-42) 휎 푝(훟|퐶)푝(퐶) 푝(훟, 퐶) ⇒ 푎 = ln = ln = ln 1 − 휎 푝(훟|퐶)푝(퐶) 푝(훟, 퐶)

(훟|)() (훟,) And we have 푎 = ln = ln =−푎, which is consistent with (2-39). We can also verify (훟|)() (훟,) 1 1 푒 1 푝(퐶|훟) + 푝(퐶|훟) = + = + =1 1+ 푒 1+ 푒 푒 +1 1+ 푒

These expressions are highly interpretable: note 푝(훟) = 푝(훟, 퐶) + 푝(훟, 퐶), then if 푝(훟, 퐶) is much larger than 푝(훟, 퐶), then naturally it is more likely for 퐱 ∈ 퐶 , and the increase/decrease of the probability is S-shaped. Now for the general case of multiple classes 퐶, 푖 = 1,…, 퐾, we assume 푒 푝(퐶|훟) = 휎(훂) = (2-43) ∑ 푒 Expand (16-4) and we further have

푝(훟|퐶)푝(퐶) 푒 푝(퐶|훟) = = , 푖 = 1,…, 퐾 ∑ 푝(훟|퐶)푝(퐶) ∑ 푒

The solution for 훼, 푖 = 1,…, 퐾 is not unique, but the most intuitive one is

푒 = 푝(훟|퐶)푝(퐶) ⇒ 훼 = ln 푝(훟|퐶)푝(퐶) = ln 푝(훟, 퐶) (2-44)

We can now turn to model both 푝(훟|퐶) and 푝(퐶); for example, Lemma 2-1 shows the maximum likelihood of 푝(퐶) is proportional to the class size observed in the training data regardless of what 푝(훟|퐶) is, and Theorem 2-9 shows 푝(훟|퐶) can be modelled Gaussian maximum likelihood. If both 푝(훟|퐶) and 푝(퐶)have analytical form, then 훼 will have an analytical form. In practice, we could find 훼 = ln 푐 +ln 푝(훟, 퐶) for some constant 푐. This does not matter as it is scaling every 푒 by the constant 푐 (e.g. see (2-50)),

푒 = 푐 × 푝(훟|퐶)푝(퐶) ⇒ 훼 = ln 푐 + ln 푝(훟, 퐶) (2-45) For the special case of 퐾 =2, (2-44) is consistent with (2-42), because by (2-40) we have 푝(훟, 퐶) 푎 = 훼 − 훼 = ln 푝(훟, 퐶) − ln 푝(훟, 퐶) = ln (2-46) 푝(훟, 퐶) At last, we note a multinomial distribution 푝(퐶|훟) is analogous to the deterministic membership vector defined in Theorem 2-6, and can be referred to as probabilistic membership or mixed membership, and we say membership defined by (16-4) is softmax membership. One standard measure of the performance of a model that yields mixed memberships is cross-entropy error as discussed earlier. In the case of Commented [TC4]: TODO: adds contents of entropy and 푁 cross entropy classification, it is ℯ = ∑푖=1 ln 푝푦푖|훟푖 given data 횽 = (훟,…, 훟), 퐲 = (푦,…, 푦) , simply adding up the negative log probabilities of observed labels. The negative cross entropy −ℯ also equals the log- likelihood of 푝(횽|퐲); thus, maximizing the likelihood 푝(횽|퐲) is equivalent to minimizing the cross-entropy error. The later aims at maximizing 푝(횽|퐲) and hence minimizing the cross-entropy error.

Class Probability ML  Lemma 2-1 Class Probability ML. We refer to 푝(퐶),…, 푝(퐶) as class probabilities. We now show the MLE of class probabilities are proportional to class size. For binary classification, we have the likelihood 푝(횽, 퐲) = 푝(횽|퐲)푝(퐲) = 푝(퐶)푝(훟|퐶) 푝(퐶)푝(훟|퐶) (2-47)

⇒ ln 푝 = 푦(ln 푝(퐶) + ln 푝(훟|퐶)) + (1 − 푦)(ln 푝(퐶) + ln 푝(훟|퐶))

Let 푝(퐶) = 휋, then 푝(퐶) =1− 휋, and the terms dependent on 휋 are only

ln 푝 ∝ (푦 ln 휋 + (1− 푦) ln(1− 휋)) Taking derivative, we have 푦 1− 푦 1 1 − =0⇒ 푦 = 1− 푦 =0 휋 1− 휋 휋 1− 휋 1 1 푁 푁 ⇒ 푁 = 푁 =0⇒ 휋 = = 휋 1− 휋 푁 + 푁 푁

where 푁 is the size of class 퐶, and 푁 is the size of class 퐶. This can be easily extended to multiple classes. Let 푝(퐶) = 휋, 푖 = 1,…, 퐾, then note 휋 =1− 휋 −⋯− 휋 is dependent of previous probabilities, and

푝(횽, 퐲) = 휋푝(훟|퐶) × … × 휋푝(훟|퐶)

훟∈ 훟∈ (2-48)

⇒ ln 푝 ∝ ln 휋 + ⋯ + ln 휋 = 푁 ln 휋 + ⋯ + 푁 ln 휋

Taking derivatives w.r.t. 휋,…, 휋 we have

푁 푁 푁 = ⇒ 휋 = 휋, 푖 = 1, … , 퐾 − 1 (2-49) 휋 휋 푁 which implies 푁 푁 − 푁 푁 1− 휋 = 휋 = 휋 = 휋 ⇒ 휋 = 푁 푁 푁 Plug this back in (2-49), we have a general result

푁 휋 = , 푖 =1, … , 퐾 푁

GaussianML-Softmax Discriminant  Theorem 2-9 GaussianML-Softmax Discriminant. For this model, we assume the feature points in each class are Gaussian distributed. The Gaussians of each class could have different means 훍,…, 훍 but the same covariance 횺 to make the inferences for each class dependent. That is,

푝(훟|퐶)~ Gaussian(훍, 횺) , 푖 = 1,…, 퐾 Note we may not assume each class has its own covariance, otherwise this model reduces to independent maximum likelihoods of Gaussian distributions. The shared covariance is clearly one limitation of Gaussian ML; the other main limitation is it high complexity with quadratically growing number of parameters (퐾 considered fixed). We show also again emphasize, like other MLE, it tends to suffer the problem of overfitting and not robust to outliers (see Theorem 2-2). Using (2-48), we have

ln 푝 ∝ ln 푝(훟|퐶) +⋯+ ln 푝(훟|퐶)

훟∈ 훟∈ ∑ ( ) Clearly = 훟∈ ln 푝 훟|퐶 , 푖 = 1,…, 퐾 , which is the same as the MLE for 훍 Gaussian 훍 훍 likelihood Gaussian(훍, 횺) given only the data in class 퐶, and therefore by (1-19) of Theorem 1-11, 1 훍 = 훟 , 푖 =1, … , 퐾 푁 훟∈ For covariance 횺, using (1-18) of Theorem 1-11, we have

휕 ln 푝 푁 1 푁 1 = 횺 − (훟 − 훍 )(훟 − 훍 ) = 횺 − (훟 − 훍 )(훟 − 훍 ) =0 휕횺 2 2 2 2 훟∈ 훟∈ 1 ⇒ 횺 = (훟 − 훍 )(훟 − 훍 ) 푁 훟∈

∑ ( )( ) Let 횺 = 훟∈ 훟 − 훍 훟 − 훍 , 푖 = 1,…, 퐾, which is the biased sample covariance of each class, or the MLE of 횺 based only on the data from class 퐶, then 푁 횺 = 횺 = 휋 횺 푁

Recall our eventual goal of a probabilistic discriminant is to solve 푝(퐶|훟). With sigmoid assumption, 푝(퐶|훟) = 휎(훂). Note by discussion of (2-44) and (2-45), the constant is not important and can be removed, and therefore 1 1 1 푁 ln 푝(훟|퐶 )푝(퐶 ) = ln + ln − (훟 − 훍 )횺(훟 − 훍 ) + ln 2 푁 (2휋) 횺 (2-50) 1 푁 ⇒ 훼 = 횺훍 훟 − 훍횺훍 + ln 2 푁 We can rewrite 훂 = 퐖훟 + 퐛 where 퐖 = (퐰 ,…, 퐰 ), 퐰 = 횺훍 , and 퐛 = (푏 ,…, 푏 ), 푏 =− 훍횺훍 + ln . Therefore, the softmax parameter of Gaussian ML discriminant is linear in 훟.

Logistic Regr ession  The complexity of above Gaussian ML discriminant in Theorem 2-9 is usually considered too high. To reduce the complexity, the linearity of softmax parameter in 퐱 provides inspiration. We may conversely first restrict softmax parameter 훂 = 퐖훟 with 퐖 as parameters. Such model has linear complexity if 퐾 is fixed. Logistic regression is also more robust (see (6-4)) with Newton’s method for Commented [TC5]: TODO: Why logistic regression model optimization. However, with reduced model complexity, it loses a closed form solution for the multi- is more robust? class classification, as shown below.

Binary logistic regression. For simplicity, we first consider binary classification with labels 푙 = 1, 푙 = 0. The model makes an i.i.d. Bernoulli assumption that 푦~Bernoulli 휎훟 퐰 , 푖 = 1,…, 푁, or equivalently,

푝(퐶|훟) = 휎훟 퐰, 푖 = 1, … , 푁 (2-51)

And we optimize the following likelihood (negative cross entropy), Commented [TC6]: TODO: Add contents on entropy and cross entropy. 푝(횽, 퐲) ∝ 푝(퐲|횽) = 휎훟 퐰 1 − 휎훟 퐰 (2-52) ⇒ ln 푝 = 푦 ln 휎훟 퐰 + (1 − 푦) ln 1 − 휎훟 퐰 Closed form solution. Applying the derivative of sigmoid function by Property 2-2 and the derivative of vector product Fact 1-7, we have 푦 (1− 푦 ) ∇ ln 푝 = 휎(1− 휎)훟 + − 휎(1− 휎)훟 퐰 휎 1− 휎 = 푦(1− 휎)훟 − (1− 푦)휎훟 = 푦 − 휎훟 퐰 훟 휎 휎훟 퐰 Let 훔 = ⋮ = ⋮ , then 휎 휎훟퐰

∇퐰 ln 푝 = 횽(퐲 − 훔) (2-53)

and ∇퐰 ln 푝 = ퟎ yields 훔∗ = (횽)횽퐲 ∗ ∗ ∗ Let 훔 = (휎 ,…, 휎) , above can be further solved as ∗ ∗ 휎 휎 ln ∗ ln ∗ ⎛ 1− 휎 ⎞ ⎛ 1 − 휎 ⎞ 횽퐰 = ⋮ ⇒ 퐰∗ = (횽) ⋮ ⎜ ∗ ⎟ ⎜ ∗ ⎟ 휎 휎 ln ∗ ln ∗ ⎝ 1− 휎⎠ ⎝ 1 − 휎⎠ Iterative solutions. Notice ln 푝 in (2-52) is a strictly concave function w.r.t. 퐰 since logarithm is strictly concave and every 훟 퐰 is linear, so the MLE for ln 푝 has unique solution and is solvable by classic convex optimization algorithms like gradient ascent using (2-53). Let 훟 denote the sample mean of [ ] the first class, and 피 훟 denote the expected data of the first class, then 횽퐲 = 푁훟 is the sample ∑ ( ( ) ) [ ] feature sum of the first class, and 횽훔 = 푝 퐶|훟 훟 = 푁피 훟 is the expected feature sum [ ] of the first class, and therefore ∇퐰 ln 푝 = 푁훟 − 푁피 훟 , i.e. the fastest ascending direction of 퐰 points from the expected sum to the sample sum, intuitively meaning the model is evolving from the prior assumed distribution to fit the observed data. We many also use Newton’s method that comes with quadratic convergence, and its update is as the following,

() () 퐰 = 퐰 − 퐇퐰 ∇퐰 ln 푝 (2-54) where 퐇퐰 is the negative definite (because of being strictly concave) Hessian matrix of ln 푝 w.r.t. 퐰, consisting of all second-order derivatives of ln 푝 w.r.t. components of 퐰. Recall Hessian matrix is the Jacobian of gradient. Denote Jacobian matrix using bold symbol “훁”, we have

휎(1− 휎)훟 휎(1− 휎) 훁퐰(훔) = ⋮ = ⋱ 횽 휎(1− 휎)훟 휎(1− 휎)

Recall the model makes an i.i.d. Bernoulli assumption s.t. 푦~Bernoulli 휎훟 퐰, and therefore the

휎(1− 휎) diagonal matrix ⋱ is the covariance matrix of RV 퐲. Denote it as 횺, we 휎(1− 휎) arrive at

훁퐰횽(퐲 − 훔) = 훁퐰(−횽훔) =−횽횺횽 ⇒ 퐇퐰 = 훁퐰횽(퐲 − 훔) = −횽횺횽 Plug this back to (2-54); note both 횺, 훔 are dependent on 퐰() and hence on 푡, but we omit the superscript for simplicity. Then we have 퐰() = 퐰() + (횽횺횽)횽(퐲 − 훔) () = (횽횺횽 ) 횽횺횽 퐰 + 횽(퐲 − 훔) (2-55) = (횽횺횽)횽횺 횽퐰() + 횺(퐲 − 훔)

Note (2-55) coincide with the solution of weighted LSE as in EX 1, and therefore we have

퐰() = arg min 횺횽퐰() + 횺(퐲 − 훔) − 횽퐰() 퐰 (2-56) = arg min 횺 (퐲 − 훔) 퐰 That is, 퐰() is trying to minimize the squared error between 퐲 and 훔 weighted by the current () () inversed standard deviation 횺 . Now if 퐰 goes to extreme values that make some 휎 ≈ 푦,

() then corresponding inversed standard deviation in the diagonal of 횺 will increase to upweight the corresponding errors 휎 − 푦; therefore, we see weights in 횺 make it harder for each 휎 to overfit 퐲, or equivalently prevent components of 퐰 from becoming very large or very small. The algorithm of (2-55) also falls into the category of iteratively-reweighted least square algorithms. Commented [XY7]: TODO: more details about IRLS Multi-Class logistic regression. For multi-class classification, we first make an assumption similar to

(2-51) using softmax. Let 퐖 = (퐰,…, 퐰), and let 휎 denote the 푘th component of the softmax function 휎, then

퐰 훟 푒 푝(퐶|훟) = 휎(퐖 훟) = , 푖 = 1, … , 푁, 푘 = 1, … , 퐾 (2-57) 퐰 훟 ∑ 푒

The log-likelihood of 푝(퐶|훟) (negative cross-entropy) is

ln 푝 ∝ ln 푝(퐶|훟) + ⋯ + ln 푝(퐶|훟)

훟∈ 훟∈ 푒퐰 훟 푒퐰훟 = ln +⋯+ ln ∑ 푒퐰 훟 ∑ 푒퐰 훟 (2-58) 훟∈ 훟∈ 퐰 훟 퐰 훟 = 퐰 훟 − ln 푒 + ⋯ + 퐰훟 − ln 푒

훟∈ 훟∈ [ ] Let 훟 denote the sample mean of the 푘th class, and 피 훟 denote the expected data of the 푘th class. Taking derivative of the above likelihood, and we have

휕 ln 푝 푒퐰 훟 = 훟 − 훟 = 훟 − (푝(퐶|훟)훟) 휕퐰 ∑ 푒퐰 훟 훟∈ 훟∈ 훟∈ [ ] = 푁훟 − 푁피 훟

Again, we can iteratively apply gradient ascent for each 퐰, and the fastest ascending direction of 퐰 points from the expected sum to the sample sum, consistent with the binary logistic regression, intuitively meaning the model is evolving from the prior assumed distribution to fit the observed data. Bayesian Logistic regression.  Probit regression.