Linear Discriminant Analysis Discriminant analysis or classification, can be viewed as a special type of regression, sometimes named × categorical regression. Still, we are given the data set (퐗, 퐲) where data matrix 퐗 = (퐱 ,…, 퐱 ) ∈ℝ , but 퐲 = (푦 ,…, 푦 ) ∈ 핃 , 핃 ⊆ℕ, where 핃 is a finite set of discrete class labels, and we name its size 퐾 = |핃| as the class size. We note the labels are categorical and for the purpose of labeling data; although labels are very often represented by integers like 1,…, 퐾 , their numerical values usually do not conceptually matter. For example, the labeling still makes sense if we choose a different set of distinct integers. The discriminant analysis finds a discriminant function 풻:ℝ → 핃 and 퐲 = (푦 ,…, 푦 ) = 풻(퐗) gives estimation of the true labels 퐲. The percentage of different labels between 퐲 and 퐲 is called accuracy. The most basic type is the binary classification when 퐾 =2. For a future variable 픁, 풻 provides label prediction 퓎 = 풻(픁).
We again emphasize the labels mainly serve the purpose of distinction between groups of data; however, it is not uncommon the labels could also serve technical or interpretive purposes. For example, 1) in perceptron algorithm, the label set is designed as {−1,1} for its error function; 2) Fisher’s binary linear discriminant in Theorem 2-7 can be shown to be equivalent to LSE if using label set − , ; 3) in sentiment analysis, −1 is usually to label the class of negative sentiment, and +1 is used to label the class of positive sentiment. The design of label set is called label coding.
As before, there can be feature function 휙 applied to each 퐱, and we are actually dealing with 훟 = 휙(퐱). The basic idea is to utilize the regression in (2-2), and write the discriminant as
풻(퐱) = 푓 휙 (퐗)퐰 = 푓 훟 퐰 (2-27) where 푓 is to convert the decimal value resulting from 훟 퐰 to a label, and it is very often referred to as activation function. Discriminant analysis based on (2-27) is called linear discriminant analysis. For example, suppose we have binary label set 핃 = {푙 , 푙 }, then we can define 1 휙 (퐱)퐰 ≥ 0 푦 = 풻(퐱) = (2-28) 0 휙 (퐱)퐰 < 0
For the special case where the bias 푤 is explicit, we can write 풻(퐱) = 푓 휙 (퐗)퐰 + 푤 = 푓 훟 퐰 + 푤 (2-29) which can be viewed as classification in ℝ dimension consistent with (2-27), putting all feature points at height 1 in the first dimension. For error, there could be a variety of different definitions. Considering the categorical nature of labels, ( ) ( ) ∑ one intuitive error definition is to count the differences between 퐲 = 풻 퐗 and 퐲: ℯ 퐲 = 1 , which we refer to as identity error. Identity error is the most widely used performance indicator, but it is hard to optimize (it is highly piecewise, and its derivative is zero almost everywhere).
Perceptr on Algor ithm Perceptron Algorithm could be the simplest algorithm for binary classification and can be viewed as a simplest one-layer neural network with one output. Technically, it tries to minimize a more numerical error rather than the identity error. To achieve this, it designs the label set as {−1,1}. For each data point 퐱 and label 푦 , the algorithm checks if sgn 훟 퐰 = 푦 ; if so, the classification is correct; otherwise, the classification is wrong, and we calculate the perceptron error as the following, ℯ = −푦 훟 퐰 훟 퐰 Note sgn 훟 퐰 ≠푦 is equivalent to 푦 훟 퐰<0, and thus we put a minus sign in front of it to make the error positive. Our objective is now to simply minimize the error, i.e. min ℯ . The classic way is to 퐰 ퟎ use gradient descent, with 휂 as a predefined positive step length
( ) ( ) 퐰 =퐰 −휂∇퐰( ) ℯ where by Fact 1-7
( ) ( ) ∇ ( ) ℯ = −푦 훟 ⇒ 퐰 = 퐰 + 휂 푦 훟 퐰 (2-30) ( ) ( ) 훟 퐰 훟 퐰 The above can be modified as online updates whenever an error is encountered.
( ) ( ) ( ) { } (2-31) 퐰 ≔ 퐰 + 휂푦 훟 if sgn 훟 퐰 ≠ 푦 , 푖 ∈ 1, … , 푁 The geometric interpretation is illustrated in Figure 2-1. A hyperplane 퐻(퐱) =퐰 퐱 goes through the origin and has 퐰 as its normal. With a fixed 퐰, 훟 퐰 cos〈훟 ,퐰⟩ = will be all positive if 훟 is at ‖퐰‖ ‖훟 ‖ one side of 퐻 and all negative if at the other side, 훟 퐰 푑(훟 ,퐻) = is a signed distance from point 훟 ‖퐰‖ to the hyperplane, and 훟 퐰= ‖퐰‖ 푑(훟 ,퐻) ∝ 푑(훟 ,퐻). The interpretation can be clearer if we restrict ‖퐰‖ =1, then we have 훟 퐰=푑(훟 ,퐻), and “ℯ ” is adding up the distances of misclassified points to the hyperplane. It is trivial to verify the optimal solution with or without the constraint are identical. Thus, in practice, we might just proceed the optimization without the “unit normal” Figure 2-1 Perceptron algorithm is to find the best constraint for simplicity, and normalize the optimal hyperplane through origin separating the points. With ∗ unit normal, the perceptron error can be interpreted solution 퐰 at the end. The perceptron objective min ℯ can be viewed as finding an orientation of 퐻 as adding up the distances from misclassified points 퐰 ퟎ to the hyperplane. that best separates the data points, where the “best” ( ) is in terms of minimum “ℯ ”. The initialization 퐰 for Error! Reference source not found. can be random choice of any nonzero vector. The advantage of Perceptron is its simplicity, its online updates and the guaranteed convergence in ideal situation; while the disadvantage is its limited to binary classification and it runs indefinitely in non-ideal situation when the data points are not separable by a hyperplane (in this case a heuristic rule is needed for termination). Fact 2-5 Cauchy-Schwarz inequality states that ‖퐱‖〈⟩‖퐲‖〈⟩ ≥ |〈퐱,퐲⟩| for any 퐱,퐲∈ℝ where 〈퐱, 퐲⟩ is an inner product, and ‖퐱‖〈⟩ = 〈퐱, 퐱⟩ is the inner product induced norm.
Theorem 2-5 Perceptron Convergence Theorem. In binary classification, if there exists a hyperplane Commented [XY1]: The other similar proof at 퐻:퐰 퐱=0 s.t. the feature points (훟 ,푦 ),…, (훟 ,푦 ) satisfy {훟 :퐰 퐱≥0} all have the same label http://www.cs.columbia.edu/~mcollins/courses/6998- 2012/notes/perc.converge.pdf and {훟 :퐰 퐱<0} all have the other label, then the feature points are linearly separable. If the feature points are linearly separable by some hyperplane 퐻, then there exist two constants 푎, 푏 s.t. 푎푡 ≤ 퐰( ) − 퐰( ) ≤ 푏푡 for all 푡 if ℯ ≠0 for the algorithm defined by (2-31). As a result, the algorithm must terminate within 푡 ≤ iterations ℯ =0 at termination, otherwise 푎푡 > 푏푡 when 푡 is sufficiently large. For the lower bound, let 퐰∗ be a normal of the separation hyperplane 퐻, then
( ) ( ) ( ) ( ) 퐰 = 퐰 + 휂푦 훟 +⋯+ 휂푦 훟 ⇒ 퐰 − 퐰 = 휂푦 훟 +⋯+ 휂푦 훟 ∗ ( ) ( ) ∗ ∗ ⇒ (퐰 ) 퐰 − 퐰 = (퐰 ) 휂푦 훟 +⋯+ 휂푦 훟 ≥ 휂푡 min 푦 (퐰 ) 훟 ,…, By Cauchy-Schwarz inequality, (퐰∗) 퐰( ) − 퐰( ) ≤ ‖퐰∗‖ 퐰( ) − 퐰( ) , therefore ∗ 휂 min 푦 (퐰 ) 훟 ‖퐰∗‖ 퐰( ) − 퐰( ) ≥ 휂푡 min 푦 (퐰∗) 훟 ⇒ 퐰( ) − 퐰( ) ≥ ,…, 푡 ∗ ,…, ‖퐰 ‖
∗ (퐰 ) 훟 ,…, where 푎 = ∗ is independent of 푡. For the upper bound, notice ‖퐰 ‖ ( ) ( ) 퐰 − 퐰 = 휂푦 훟 ( ) ( ) ( ) ( ) 퐰 − 퐰 = 휂푦 훟 + 휂푦 훟 = 퐰 − 퐰 + 휂푦 훟 ( ) ( ) 퐰 − 퐰 = 휂푦 훟 + 휂푦 훟 + 휂푦 훟 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) = 퐰 − 퐰 + 퐰 − 퐰 − 퐰 − 퐰 + 휂푦 훟 = 퐰 − 퐰 + 휂푦 훟 and generally
( ) ( ) ( ) ( ) 퐰 − 퐰 = 퐰 − 퐰 + 휂푦 훟 and