Linear Discriminant Analysis Discriminant Analysis Or Classification, Can Be Viewed As a Special Type of Regression, Sometimes Named × Categorical Regression

Total Page:16

File Type:pdf, Size:1020Kb

Linear Discriminant Analysis Discriminant Analysis Or Classification, Can Be Viewed As a Special Type of Regression, Sometimes Named �×� Categorical Regression Linear Discriminant Analysis Discriminant analysis or classification, can be viewed as a special type of regression, sometimes named × categorical regression. Still, we are given the data set (퐗, 퐲) where data matrix 퐗 = (퐱,…, 퐱) ∈ℝ , but 퐲 = (푦,…, 푦) ∈ 핃 , 핃 ⊆ℕ, where 핃 is a finite set of discrete class labels, and we name its size 퐾 = |핃| as the class size. We note the labels are categorical and for the purpose of labeling data; although labels are very often represented by integers like 1,…, 퐾 , their numerical values usually do not conceptually matter. For example, the labeling still makes sense if we choose a different set of distinct integers. The discriminant analysis finds a discriminant function 풻:ℝ → 핃 and 퐲 = (푦,…, 푦) = 풻(퐗) gives estimation of the true labels 퐲. The percentage of different labels between 퐲 and 퐲 is called accuracy. The most basic type is the binary classification when 퐾 =2. For a future variable 픁, 풻 provides label prediction 퓎 = 풻(픁). We again emphasize the labels mainly serve the purpose of distinction between groups of data; however, it is not uncommon the labels could also serve technical or interpretive purposes. For example, 1) in perceptron algorithm, the label set is designed as {−1,1} for its error function; 2) Fisher’s binary linear discriminant in Theorem 2-7 can be shown to be equivalent to LSE if using label set − , ; 3) in sentiment analysis, −1 is usually to label the class of negative sentiment, and +1 is used to label the class of positive sentiment. The design of label set is called label coding. As before, there can be feature function 휙 applied to each 퐱, and we are actually dealing with 훟 = 휙(퐱). The basic idea is to utilize the regression in (2-2), and write the discriminant as 풻(퐱) = 푓휙(퐗)퐰 = 푓훟퐰 (2-27) where 푓 is to convert the decimal value resulting from 훟퐰 to a label, and it is very often referred to as activation function. Discriminant analysis based on (2-27) is called linear discriminant analysis. For example, suppose we have binary label set 핃 = {푙, 푙}, then we can define 1 휙(퐱)퐰 ≥ 0 푦 = 풻(퐱) = (2-28) 0 휙(퐱)퐰 < 0 For the special case where the bias 푤 is explicit, we can write 풻(퐱) = 푓휙 (퐗)퐰 + 푤 = 푓훟 퐰 + 푤 (2-29) which can be viewed as classification in ℝ dimension consistent with (2-27), putting all feature points at height 1 in the first dimension. For error, there could be a variety of different definitions. Considering the categorical nature of labels, ( ) ( ) ∑ one intuitive error definition is to count the differences between 퐲 = 풻 퐗 and 퐲: ℯ 퐲 = 1, which we refer to as identity error. Identity error is the most widely used performance indicator, but it is hard to optimize (it is highly piecewise, and its derivative is zero almost everywhere). Perceptr on Algor ithm Perceptron Algorithm could be the simplest algorithm for binary classification and can be viewed as a simplest one-layer neural network with one output. Technically, it tries to minimize a more numerical error rather than the identity error. To achieve this, it designs the label set as {−1,1}. For each data point 퐱 and label 푦 , the algorithm checks if sgn훟 퐰 = 푦 ; if so, the classification is correct; otherwise, the classification is wrong, and we calculate the perceptron error as the following, ℯ = −푦훟 퐰 훟 퐰 Note sgn훟 퐰 ≠푦 is equivalent to 푦훟 퐰<0, and thus we put a minus sign in front of it to make the error positive. Our objective is now to simply minimize the error, i.e. min ℯ. The classic way is to 퐰ퟎ use gradient descent, with 휂 as a predefined positive step length () () 퐰 =퐰 −휂∇퐰() ℯ where by Fact 1-7 () () ∇퐰() ℯ = −푦훟 ⇒ 퐰 = 퐰 + 휂 푦훟 (2-30) () () 훟 퐰 훟 퐰 The above can be modified as online updates whenever an error is encountered. () () () { } (2-31) 퐰 ≔ 퐰 + 휂푦 훟 if sgn훟퐰 ≠ 푦, 푖 ∈ 1, … , 푁 The geometric interpretation is illustrated in Figure 2-1. A hyperplane 퐻(퐱) =퐰퐱 goes through the origin and has 퐰 as its normal. With a fixed 퐰, 훟 퐰 cos〈훟,퐰⟩ = will be all positive if 훟 is at ‖퐰‖‖훟‖ one side of 퐻 and all negative if at the other side, 훟 퐰 푑(훟,퐻) = is a signed distance from point 훟 ‖퐰‖ to the hyperplane, and 훟 퐰= ‖퐰‖푑(훟,퐻) ∝ 푑(훟,퐻). The interpretation can be clearer if we restrict ‖퐰‖ =1, then we have 훟 퐰=푑(훟,퐻), and “ℯ” is adding up the distances of misclassified points to the hyperplane. It is trivial to verify the optimal solution with or without the constraint are identical. Thus, in practice, we might just proceed the optimization without the “unit normal” Figure 2-1 Perceptron algorithm is to find the best constraint for simplicity, and normalize the optimal hyperplane through origin separating the points. With ∗ unit normal, the perceptron error can be interpreted solution 퐰 at the end. The perceptron objective min ℯ can be viewed as finding an orientation of 퐻 as adding up the distances from misclassified points 퐰ퟎ to the hyperplane. that best separates the data points, where the “best” () is in terms of minimum “ℯ”. The initialization 퐰 for Error! Reference source not found. can be random choice of any nonzero vector. The advantage of Perceptron is its simplicity, its online updates and the guaranteed convergence in ideal situation; while the disadvantage is its limited to binary classification and it runs indefinitely in non-ideal situation when the data points are not separable by a hyperplane (in this case a heuristic rule is needed for termination). Fact 2-5 Cauchy-Schwarz inequality states that ‖퐱‖〈⟩‖퐲‖〈⟩ ≥ |〈퐱,퐲⟩| for any 퐱,퐲∈ℝ where 〈퐱, 퐲⟩ is an inner product, and ‖퐱‖〈⟩ = 〈퐱, 퐱⟩ is the inner product induced norm. Theorem 2-5 Perceptron Convergence Theorem. In binary classification, if there exists a hyperplane Commented [XY1]: The other similar proof at 퐻:퐰 퐱=0 s.t. the feature points (훟,푦),…, (훟,푦) satisfy {훟:퐰 퐱≥0} all have the same label http://www.cs.columbia.edu/~mcollins/courses/6998- 2012/notes/perc.converge.pdf and {훟:퐰 퐱<0} all have the other label, then the feature points are linearly separable. If the feature points are linearly separable by some hyperplane 퐻, then there exist two constants 푎, 푏 s.t. 푎푡 ≤ 퐰() − 퐰() ≤ 푏푡 for all 푡 if ℯ ≠0 for the algorithm defined by (2-31). As a result, the algorithm must terminate within 푡 ≤ iterations ℯ =0 at termination, otherwise 푎푡 > 푏푡 when 푡 is sufficiently large. For the lower bound, let 퐰∗ be a normal of the separation hyperplane 퐻, then () () () () 퐰 = 퐰 + 휂푦 훟 +⋯+ 휂푦 훟 ⇒ 퐰 − 퐰 = 휂푦 훟 +⋯+ 휂푦 훟 ∗ () () ∗ ∗ ⇒ (퐰 ) 퐰 − 퐰 = (퐰 ) 휂푦 훟 +⋯+ 휂푦 훟 ≥ 휂푡 min 푦(퐰 ) 훟 ,…, By Cauchy-Schwarz inequality, (퐰∗)퐰() − 퐰() ≤ ‖퐰∗‖ 퐰() − 퐰() , therefore ∗ 휂 min 푦(퐰 ) 훟 ∗ () () ∗ () () ,…, ‖퐰 ‖ 퐰 − 퐰 ≥ 휂푡 min 푦 (퐰 ) 훟 ⇒ 퐰 − 퐰 ≥ 푡 ∗ ,…, ‖퐰 ‖ ∗ (퐰 ) 훟 ,…, where 푎 = ∗ is independent of 푡. For the upper bound, notice ‖퐰 ‖ () () 퐰 − 퐰 = 휂푦 훟 () () () () 퐰 − 퐰 = 휂푦 훟 + 휂푦 훟 = 퐰 − 퐰 + 휂푦 훟 () () 퐰 − 퐰 = 휂푦 훟 + 휂푦 훟 + 휂푦 훟 () () () () () () () () = 퐰 − 퐰 + 퐰 − 퐰 − 퐰 − 퐰 + 휂푦훟 = 퐰 − 퐰 + 휂푦 훟 and generally () () () () 퐰 − 퐰 = 퐰 − 퐰 + 휂푦 훟 and () () () () () () 퐰 − 퐰 = 퐰 − 퐰 + 휂 훟 +2휂푦 퐰 − 퐰 훟 () Note 푦 퐰 훟 <0 because we assume ℯ ≠0 at step 푡. Therefore, () () () () () 퐰 − 퐰 ≤ 퐰 − 퐰 + 휂 훟 −2휂푦 퐰 훟 which gives () () 퐰 − 퐰 = 휂 훟 () () () () () 퐰 − 퐰 ≤ 퐰 − 퐰 + 휂 훟 −2휂푦 퐰 훟 … () () () () () 퐰 − 퐰 ≤ 퐰 − 퐰 + 휂 훟 −2휂푦 퐰 훟 Adding up these inequalities, we can cancel terms on both sizes as the following, () () 퐰 − 퐰 = 휂 훟 () () () () () 퐰 − 퐰 ≤ 퐰 − 퐰 + 휂 훟 −2휂푦 퐰 훟 … () () () () () 퐰 − 퐰 ≤ 퐰 − 퐰 + 휂 훟 −2휂푦 퐰 훟 and therefore 퐰() − 퐰() ≤ 휂 훟 +⋯+ 훟 −2휼퐰() 푦 훟 +⋯+ 푦 훟 () ≤ 휂 max 훟 푡 −2휂 min 푦 퐰 훟 푡 ,…, ,…, () where we see 푏 = 휂 max 훟 −2휂 min 푦 퐰 훟 , independent of 푡. ,…, ,…, Mem ber ship LSE Theorem 2-6 Membership LSE. Least squared error method for regression (see Theorem 2-1) can be adapted for classification; however, direct use the categorical values of the class labels as regression target is usually less favorable due to both performance issues and interpretation difficulty. Commented [TC2]: TODO: performance comparison One adaption is to convert class labels into membership vectors. Define a new target vector 퐳, roughly named the membership vector, for each class label 푦, 푗 = 1,…, 푁 푡, ⋮ { } 퐳 = ∈ 0,1 , 푗 = 1,…, 푁 s.t. 푡, =1 for 푖 = 1,…, 퐾 푡, That is, only one component of 퐳 is assigned 1 with all other components being 0. If 푡, =1 for some 푖, then 푦 = 푖, indicating 퐱 ∈ 퐶. Then the whole target matrix is 퐙 = (퐳,…, 퐳). By (2-4) we have the regression model 퐙 = 퐖횽, and the objective of LSE is to minimize 퐙 − 퐙 . We have the solution by (2-6) and (2-7) of Theorem 2-1 that ∗ 퐖 = (횽 ) 퐙 Using the fact that (횽) = 횽 , we have 퐖∗ = 횽 퐙 ⇒ (퐖∗) = 퐙횽 And the regression function is 푓(퐱) = 퐙횽휙(퐱) Note the discriminant function is different, which needs to yield a class label.
Recommended publications
  • Lecture 4 Feedforward Neural Networks, Backpropagation
    CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ^ = f(x) vector The network contains L − 1 hidden layers (2, in a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W 2 n×n and b 2 n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W 2 n×k and b 2 k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer (L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2 hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi 5/9 Mitesh M.
    [Show full text]
  • Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
    Revisiting the Softmax Bellman Operator: New Benefits and New Perspective Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies.
    [Show full text]
  • On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks Xiangrui Li, Xin Li, Deng Pan, Dongxiao Zhu∗ Department of Computer Science Wayne State University {xiangruili, xinlee, pan.deng, dzhu}@wayne.edu Abstract (unweighted) loss, resulting in performance degradation Deep convolutional neural networks (CNNs) trained with lo- for minority classes. To remedy this issue, the class-wise gistic and softmax losses have made significant advancement reweighted loss is often used to emphasize the minority in visual recognition tasks in computer vision. When training classes that can boost the predictive performance without data exhibit class imbalances, the class-wise reweighted ver- introducing much additional difficulty in model training sion of logistic and softmax losses are often used to boost per- (Cui et al. 2019; Huang et al. 2016; Mahajan et al. 2018; formance of the unweighted version. In this paper, motivated Wang, Ramanan, and Hebert 2017). A typical choice of to explain the reweighting mechanism, we explicate the learn- weights for each class is the inverse-class frequency. ing property of those two loss functions by analyzing the nec- essary condition (e.g., gradient equals to zero) after training A natural question then to ask is what roles are those CNNs to converge to a local minimum. The analysis imme- class-wise weights playing in CNN training using LGL diately provides us explanations for understanding (1) quan- or SML that lead to performance gain? Intuitively, those titative effects of the class-wise reweighting mechanism: de- weights make tradeoffs on the predictive performance terministic effectiveness for binary classification using logis- among different classes.
    [Show full text]
  • CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett
    CS281B/Stat241B. Statistical Learning Theory. Lecture 7. Peter Bartlett Review: ERM and uniform laws of large numbers • 1. Rademacher complexity 2. Tools for bounding Rademacher complexity Growth function, VC-dimension, Sauer’s Lemma − Structural results − Neural network examples: linear threshold units • Other nonlinearities? • Geometric methods • 1 ERM and uniform laws of large numbers Empirical risk minimization: Choose fn F to minimize Rˆ(f). ∈ How does R(fn) behave? ∗ For f = arg minf∈F R(f), ∗ ∗ ∗ ∗ R(fn) R(f )= R(fn) Rˆ(fn) + Rˆ(fn) Rˆ(f ) + Rˆ(f ) R(f ) − − − − ∗ ULLN for F ≤ 0 for ERM LLN for f |sup R{z(f) Rˆ}(f)| + O(1{z/√n).} | {z } ≤ f∈F − 2 Uniform laws and Rademacher complexity Definition: The Rademacher complexity of F is E Rn F , k k where the empirical process Rn is defined as n 1 R (f)= ǫ f(X ), n n i i i=1 X and the ǫ1,...,ǫn are Rademacher random variables: i.i.d. uni- form on 1 . {± } 3 Uniform laws and Rademacher complexity Theorem: For any F [0, 1]X , ⊂ 1 E Rn F O 1/n E P Pn F 2E Rn F , 2 k k − ≤ k − k ≤ k k p and, with probability at least 1 2exp( 2ǫ2n), − − E P Pn F ǫ P Pn F E P Pn F + ǫ. k − k − ≤ k − k ≤ k − k Thus, P Pn F E Rn F , and k − k ≈ k k R(fn) inf R(f)= O (E Rn F ) . − f∈F k k 4 Tools for controlling Rademacher complexity 1.
    [Show full text]
  • Loss Function Search for Face Recognition
    Loss Function Search for Face Recognition Xiaobo Wang * 1 Shuo Wang * 1 Cheng Chi 2 Shifeng Zhang 2 Tao Mei 1 Abstract Generally, the CNNs are equipped with classification loss In face recognition, designing margin-based (e.g., functions (Liu et al., 2017; Wang et al., 2018f;e; 2019a; Yao angular, additive, additive angular margins) soft- et al., 2018; 2017; Guo et al., 2020), metric learning loss max loss functions plays an important role in functions (Sun et al., 2014; Schroff et al., 2015) or both learning discriminative features. However, these (Sun et al., 2015; Wen et al., 2016; Zheng et al., 2018b). hand-crafted heuristic methods are sub-optimal Metric learning loss functions such as contrastive loss (Sun because they require much effort to explore the et al., 2014) or triplet loss (Schroff et al., 2015) usually large design space. Recently, an AutoML for loss suffer from high computational cost. To avoid this problem, function search method AM-LFS has been de- they require well-designed sample mining strategies. So rived, which leverages reinforcement learning to the performance is very sensitive to these strategies. In- search loss functions during the training process. creasingly more researchers shift their attention to construct But its search space is complex and unstable that deep face recognition models by re-designing the classical hindering its superiority. In this paper, we first an- classification loss functions. alyze that the key to enhance the feature discrim- Intuitively, face features are discriminative if their intra- ination is actually how to reduce the softmax class compactness and inter-class separability are well max- probability.
    [Show full text]
  • Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan
    Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan Mo Jinhua Zhao Massachusetts Institute of Technology Abstract Whereas deep neural network (DNN) is increasingly applied to choice analysis, it is challenging to reconcile domain-specific behavioral knowledge with generic-purpose DNN, to improve DNN's interpretability and predictive power, and to identify effective regularization methods for specific tasks. To address these challenges, this study demonstrates the use of behavioral knowledge for designing a particular DNN architecture with alternative-specific utility functions (ASU-DNN) and thereby improving both the predictive power and interpretability. Unlike a fully connected DNN (F-DNN), which computes the utility value of an alternative k by using the attributes of all the alternatives, ASU-DNN computes it by using only k's own attributes. Theoretically, ASU- DNN can substantially reduce the estimation error of F-DNN because of its lighter architecture and sparser connectivity, although the constraint of alternative-specific utility can cause ASU- DNN to exhibit a larger approximation error. Empirically, ASU-DNN has 2-3% higher prediction accuracy than F-DNN over the whole hyperparameter space in a private dataset collected in Singapore and a public dataset available in the R mlogit package. The alternative-specific connectivity is associated with the independence of irrelevant alternative (IIA) constraint, which as a domain-knowledge-based regularization method is more effective than the most popular generic-purpose explicit and implicit regularization methods and architectural hyperparameters. ASU-DNN provides a more regular substitution pattern of travel mode choices than F-DNN does, rendering ASU-DNN more interpretable.
    [Show full text]
  • Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of
    Pseudo-learning effects in reinforcement learning model-based analysis: A problem of misspecification of initial preference Kentaro Katahira1,2*, Yu Bai2,3, Takashi Nakao4 Institutional affiliation: 1 Department of Psychology, Graduate School of Informatics, Nagoya University, Nagoya, Aichi, Japan 2 Department of Psychology, Graduate School of Environment, Nagoya University 3 Faculty of literature and law, Communication University of China 4 Department of Psychology, Graduate School of Education, Hiroshima University 1 Abstract In this study, we investigate a methodological problem of reinforcement-learning (RL) model- based analysis of choice behavior. We show that misspecification of the initial preference of subjects can significantly affect the parameter estimates, model selection, and conclusions of an analysis. This problem can be considered to be an extension of the methodological flaw in the free-choice paradigm (FCP), which has been controversial in studies of decision making. To illustrate the problem, we conducted simulations of a hypothetical reward-based choice experiment. The simulation shows that the RL model-based analysis reports an apparent preference change if hypothetical subjects prefer one option from the beginning, even when they do not change their preferences (i.e., via learning). We discuss possible solutions for this problem. Keywords: reinforcement learning, model-based analysis, statistical artifact, decision-making, preference 2 Introduction Reinforcement-learning (RL) model-based trial-by-trial analysis is an important tool for analyzing data from decision-making experiments that involve learning (Corrado and Doya, 2007; Daw, 2011; O’Doherty, Hampton, and Kim, 2007). One purpose of this type of analysis is to estimate latent variables (e.g., action values and reward prediction error) that underlie computational processes.
    [Show full text]
  • Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019
    Lecture 18: Wrapping up classification Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by Julia Hockenmaier Today’s class • Perceptron: binary and multiclass case • Getting a distribution over class labels: one-hot output and softmax • Differentiable perceptrons: binary and multiclass case • Cross-entropy loss Recap: Classification, linear classifiers 3 Classification as a supervised learning task • Classification tasks: Label data points x ∈ X from an n-dimensional vector space with discrete categories (classes) y ∈Y Binary classification: Two possible labels Y = {0,1} or Y = {-1,+1} Multiclass classification: k possible labels Y = {1, 2, …, k} • Classifier: a function X →Y f(x) = y • Linear classifiers f(x) = sgn(wx) [for binary classification] are parametrized by (n+1)-dimensional weight vectors • Supervised learning: Learn the parameters of the classifier (e.g. w) from a labeled data set Dtrain = {(x1, y1),…,(xD, yD)} Batch versus online training Batch learning: The learner sees the complete training data, and only changes its hypothesis when it has seen the entire training data set. Online training: The learner sees the training data one example at a time, and can change its hypothesis with every new example Compromise: Minibatch learning (commonly used in practice) The learner sees small sets of training examples at a time, and changes its hypothesis with every such minibatch of examples For minibatch and online example: randomize the order of examples for
    [Show full text]
  • Categorical Data
    Categorical Data Santiago Barreda LSA Summer Institute 2019 Normally-distributed data • We have been modelling data with normally-distributed residuals. • In other words: the data is normally- distributed around the mean predicted by the model. Predicting Position by Height • We will invert the questions we’ve been considering. • Can we predict position from player characteristics? Using an OLS Regression Using an OLS Regression Using an OLS Regression • Not bad but we can do better. In particular: • There are hard bounds on our outcome variable. • The boundaries affect possible errors: i.e., the 1 position cannot be underestimated but the 2 can. • There is nothing ‘between’ the outcomes. • There are specialized models for data with these sorts of characteristics. The Generalized Linear Model • We can break up a regression model into three components: • The systematic component. • The random component. • The link function. 푦 = 푎 + 훽 ∗ 푥 + 푒 The Systematic Component • This is our regression equation. • It specifies a deterministic relationship between the predictors and the predicted value. • In the absence of noise and with the correct model, we would expect perfect prediction. μ = 푎 + 훽 ∗ 푥 Predicted value. Not the observation. The Random Component • Unpredictable variation conditional on the fitted value. • This specifies the nature of that variation. • The fitted value is the mean parameter of a probability distribution. μ = 푎 + 훽 ∗ 푥 푦~푁표푟푚푎푙(휇, 휎2) 푦~퐵푒푟푛표푢푙푙(휇) Bernoulli Distribution • Unlike the normal, this distribution generates only values of 1 and 0. • It has only a single parameter, which must be between 0 and 1. • The parameter is the probability of observing an outcome of 1 (or 1-P of 0).
    [Show full text]
  • Mixed Pattern Recognition Methodology on Wafer Maps with Pre-Trained Convolutional Neural Networks
    Mixed Pattern Recognition Methodology on Wafer Maps with Pre-trained Convolutional Neural Networks Yunseon Byun and Jun-Geol Baek School of Industrial Management Engineering, Korea University, Seoul, South Korea {yun-seon, jungeol}@korea.ac.kr Keywords: Classification, Convolutional Neural Networks, Deep Learning, Smart Manufacturing. Abstract: In the semiconductor industry, the defect patterns on wafer bin map are related to yield degradation. Most companies control the manufacturing processes which occur to any critical defects by identifying the maps so that it is important to classify the patterns accurately. The engineers inspect the maps directly. However, it is difficult to check many wafers one by one because of the increasing demand for semiconductors. Although many studies on automatic classification have been conducted, it is still hard to classify when two or more patterns are mixed on the same map. In this study, we propose an automatic classifier that identifies whether it is a single pattern or a mixed pattern and shows what types are mixed. Convolutional neural networks are used for the classification model, and convolutional autoencoder is used for initializing the convolutional neural networks. After trained with single-type defect map data, the model is tested on single-type or mixed- type patterns. At this time, it is determined whether it is a mixed-type pattern by calculating the probability that the model assigns to each class and the threshold. The proposed method is experimented using wafer bin map data with eight defect patterns. The results show that single defect pattern maps and mixed-type defect pattern maps are identified accurately without prior knowledge.
    [Show full text]
  • Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019
    Searching for A Robust Neural Architecture in Four GPU Hours Xuanyi Dong1;2,Yi Yang1 1University of Technology Sydney 2Baidu Research [email protected], [email protected] Abstract 0 0 : input GDAS Conventional neural architecture search (NAS) ap- 3 : output on a DAG proaches are based on reinforcement learning or evolution- : sampled ary strategy, which take more than 3000 GPU hours to find : unsampled a good model on CIFAR-10. We propose an efficient NAS 1 approach learning to search by gradient descent. Our ap- proach represents the search space as a directed acyclic graph (DAG). This DAG contains billions of sub-graphs, each of which indicates a kind of neural architecture. To 2 avoid traversing all the possibilities of the sub-graphs, we develop a differentiable sampler over the DAG. This sam- pler is learnable and optimized by the validation loss af- ter training the sampled architecture. In this way, our ap- proach can be trained in an end-to-end fashion by gra- 3 dient descent, named Gradient-based search using Differ- entiable Architecture Sampler (GDAS). In experiments, we Figure 1. We utilize a DAG to represent the search space of a neu- can finish one searching procedure in four GPU hours on ral cell. Different operations (colored arrows) transform one node CIFAR-10, and the discovered model obtains a test error (square) to its intermediate features (little circles). Meanwhile, of 2.82% with only 2.5M parameters, which is on par with each node is the sum of the intermediate features transformed from the state-of-the-art.
    [Show full text]
  • Reinforcement Learning with Dynamic Boltzmann Softmax Updates Arxiv
    Reinforcement Learning with Dynamic Boltzmann Softmax Updates Ling Pan1, Qingpeng Cai1, Qi Meng2, Wei Chen2, Longbo Huang1, Tie-Yan Liu2 1IIIS, Tsinghua University 2Microsoft Research Asia Abstract Value function estimation is an important task in reinforcement learning, i.e., prediction. The Boltz- mann softmax operator is a natural value estimator and can provide several benefits. However, it does not satisfy the non-expansion property, and its direct use may fail to converge even in value iteration. In this paper, we propose to update the value function with dynamic Boltzmann softmax (DBS) operator, which has good convergence property in the setting of planning and learning. Experimental results on GridWorld show that the DBS operator enables better estimation of the value function, which rectifies the convergence issue of the softmax operator. Finally, we propose the DBS-DQN algorithm by applying dynamic Boltzmann softmax updates in deep Q-network, which outperforms DQN substantially in 40 out of 49 Atari games. 1 Introduction Reinforcement learning has achieved groundbreaking success for many decision making problems, including roboticsKober et al. (2013), game playingMnih et al. (2015); Silver et al. (2017), and many others. Without full information of transition dynamics and reward functions of the environment, the agent learns an optimal policy by interacting with the environment from experience. Value function estimation is an important task in reinforcement learning, i.e., prediction Sutton (1988); DEramo et al. (2016); Xu et al. (2018). In the prediction task, it requires the agent to have a good estimate of the value function in order to update towards the true value function.
    [Show full text]