On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques

Total Page:16

File Type:pdf, Size:1020Kb

On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques On Logistic Regression: Gradients of the Log Loss, Multi-Class Classification, and Other Optimization Techniques Karl Stratos June 20, 2018 1 / 22 d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). 2 / 22 The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) 2 / 22 I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) 2 / 22 Recall: Logistic Regression d I Task. Given input x 2 R , predict either 1 or 0 (on or off). d I Model. The probability of on is parameterized by w 2 R as a dot product squashed under the sigmoid/logistic function σ : R ! [0; 1]. 1 p (1jx; w) := σ(w · x) := 1 + exp(−w · x) The probability of off is p (0jx; w) = 1 − σ(w · x) = σ(−w · x) I Today's focus: 1. Optimizing the log loss by gradient descent 2. Multi-class classification to handle more than two classes 3. More on optimization: Newton, stochastic gradient descent 2 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 3 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. 4 / 22 Negative Log Probability Under Logistic Function I Using y 2 f0; 1g to denote the two classes, − log p (yjx; w) = −y log σ(w · x) − (1 − y) log σ(−w · x) log(1 + exp(−w · x)) if y =1 = log(1 + exp(w · x)) if y =0 is convex in w. I Gradient given by rw (− log p (yjx; w)) = −(y − σ(w · x))x 4 / 22 n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? 5 / 22 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 5 / 22 LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd 5 / 22 Logistic Regression Objective (i) (i) n I Given iid samples S = (x ;y ) i=1, find w that minimizes the empirical negative log likelihood of S (\log loss"): n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 I Gradient? n 1 X rJ LOG(w) = y(i) − σ w · x(i) x(i) S n i=1 I Unlike in linear regression, there is no closed-form solution for LOG LOG wS := arg min JS (w) w2Rd LOG I But JS (w) is convex and differentiable! So we can do gradient descent and approach an optimal solution. 5 / 22 Gradient Descent for Logistic Regression Input: training objective n LOG 1 X (i) (i) J (w) := − log p y x ; w S n i=1 number of iterations T n LOG LOG LOG Output: parameter w^ 2 R such that JS (w^ ) ≈ JS (wS ) 1. Initialize θ0 (e.g., randomly). 2. For t = 0 :::T − 1, n ηt X θt+1 = θt + y(i) − σ w · x(i) x(i) n i=1 3. Return θT . 6 / 22 Overview Gradient Descent on the Log Loss Multi-Class Classification More on Optimization Newton's Method Stochastic Gradient Descent (SGD) 7 / 22 y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) 8 / 22 I Is this a valid probability distribution? From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters 8 / 22 From Binary to m Classes d I A logistic regressor has a single w 2 R to define the probability of \on" (against “off"): exp(w · x) p (1jx; w) = 1 + exp(w · x) y d I A log-linear model has w 2 R for each possible class y 2 f1 : : : mg to define the probability of the class equaling y: y exp(w · x) p y x; θ = Pm y0 y0=1 exp(w · x) t m where θ = w t=1 denotes the set of parameters I Is this a valid probability distribution? 8 / 22 I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) 9 / 22 Softmax Formulation m I We can transform any vector z 2 R into a probability distribution over m elements by the softmax function m m−1 softmax : R ! ∆ , exp(zi) softmaxi(z) := Pm 8i = 1 : : : m j=1 exp(zj) I A log-linear model is a linear transformation by a matrix m×d 1 m W 2 R (with rows w ::: w ) followed by softmax: p (yjx;W ) = softmaxy(W x) 9 / 22 l I For any l 2 f1 : : : mg, the gradient wrt.
Recommended publications
  • Lecture 4 Feedforward Neural Networks, Backpropagation
    CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ^ = f(x) vector The network contains L − 1 hidden layers (2, in a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W 2 n×n and b 2 n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W 2 n×k and b 2 k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer (L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2 hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi 5/9 Mitesh M.
    [Show full text]
  • Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
    Revisiting the Softmax Bellman Operator: New Benefits and New Perspective Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies.
    [Show full text]
  • Logistic Regression Trained with Different Loss Functions Discussion
    Logistic Regression Trained with Different Loss Functions Discussion CS6140 1 Notations We restrict our discussions to the binary case. 1 g(z) = 1 + e−z @g(z) g0(z) = = g(z)(1 − g(z)) @z 1 1 h (x) = g(wx) = = w −wx − P wdxd 1 + e 1 + e d P (y = 1jx; w) = hw(x) P (y = 0jx; w) = 1 − hw(x) 2 Maximum Likelihood Estimation 2.1 Goal Maximize likelihood: L(w) = p(yjX; w) m Y = p(yijxi; w) i=1 m Y yi 1−yi = (hw(xi)) (1 − hw(xi)) i=1 1 Or equivalently, maximize the log likelihood: l(w) = log L(w) m X = yi log h(xi) + (1 − yi) log(1 − h(xi)) i=1 2.2 Stochastic Gradient Descent Update Rule @ 1 1 @ j l(w) = (y − (1 − y) ) j g(wxi) @w g(wxi) 1 − g(wxi) @w 1 1 @ = (y − (1 − y) )g(wxi)(1 − g(wxi)) j wxi g(wxi) 1 − g(wxi) @w j = (y(1 − g(wxi)) − (1 − y)g(wxi))xi j = (y − hw(xi))xi j j j w := w + λ(yi − hw(xi)))xi 3 Least Squared Error Estimation 3.1 Goal Minimize sum of squared error: m 1 X L(w) = (y − h (x ))2 2 i w i i=1 3.2 Stochastic Gradient Descent Update Rule @ @h (x ) L(w) = −(y − h (x )) w i @wj i w i @wj j = −(yi − hw(xi))hw(xi)(1 − hw(xi))xi j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi 4 Comparison 4.1 Update Rule For maximum likelihood logistic regression: j j j w := w + λ(yi − hw(xi)))xi 2 For least squared error logistic regression: j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi Let f1(h) = (y − h); y 2 f0; 1g; h 2 (0; 1) f2(h) = (y − h)h(1 − h); y 2 f0; 1g; h 2 (0; 1) When y = 1, the plots of f1(h) and f2(h) are shown in figure 1.
    [Show full text]
  • On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks Xiangrui Li, Xin Li, Deng Pan, Dongxiao Zhu∗ Department of Computer Science Wayne State University {xiangruili, xinlee, pan.deng, dzhu}@wayne.edu Abstract (unweighted) loss, resulting in performance degradation Deep convolutional neural networks (CNNs) trained with lo- for minority classes. To remedy this issue, the class-wise gistic and softmax losses have made significant advancement reweighted loss is often used to emphasize the minority in visual recognition tasks in computer vision. When training classes that can boost the predictive performance without data exhibit class imbalances, the class-wise reweighted ver- introducing much additional difficulty in model training sion of logistic and softmax losses are often used to boost per- (Cui et al. 2019; Huang et al. 2016; Mahajan et al. 2018; formance of the unweighted version. In this paper, motivated Wang, Ramanan, and Hebert 2017). A typical choice of to explain the reweighting mechanism, we explicate the learn- weights for each class is the inverse-class frequency. ing property of those two loss functions by analyzing the nec- essary condition (e.g., gradient equals to zero) after training A natural question then to ask is what roles are those CNNs to converge to a local minimum. The analysis imme- class-wise weights playing in CNN training using LGL diately provides us explanations for understanding (1) quan- or SML that lead to performance gain? Intuitively, those titative effects of the class-wise reweighting mechanism: de- weights make tradeoffs on the predictive performance terministic effectiveness for binary classification using logis- among different classes.
    [Show full text]
  • Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss
    Regularized Regression under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss Here we considerthe problem of learning binary classiers. We assume a set X of possible inputs and we are interested in classifying inputs into one of two classes. For example we might be interesting in predicting whether a given persion is going to vote democratic or republican. We assume a function Φ which assigns a feature vector to each element of x — we assume that for x ∈ X we have d Φ(x) ∈ R . For 1 ≤ i ≤ d we let Φi(x) be the ith coordinate value of Φ(x). For example, for a person x we might have that Φ(x) is a vector specifying income, age, gender, years of education, and other properties. Discrete properties can be represented by binary valued fetures (indicator functions). For example, for each state of the United states we can have a component Φi(x) which is 1 if x lives in that state and 0 otherwise. We assume that we have training data consisting of labeled inputs where, for convenience, we assume that the labels are all either −1 or 1. S = hx1, yyi,..., hxT , yT i xt ∈ X yt ∈ {−1, 1} Our objective is to use the training data to construct a predictor f(x) which predicts y from x. Here we will be interested in predictors of the following form where β ∈ Rd is a parameter vector to be learned from the training data. fβ(x) = sign(β · Φ(x)) (1) We are then interested in learning a parameter vector β from the training data.
    [Show full text]
  • CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett
    CS281B/Stat241B. Statistical Learning Theory. Lecture 7. Peter Bartlett Review: ERM and uniform laws of large numbers • 1. Rademacher complexity 2. Tools for bounding Rademacher complexity Growth function, VC-dimension, Sauer’s Lemma − Structural results − Neural network examples: linear threshold units • Other nonlinearities? • Geometric methods • 1 ERM and uniform laws of large numbers Empirical risk minimization: Choose fn F to minimize Rˆ(f). ∈ How does R(fn) behave? ∗ For f = arg minf∈F R(f), ∗ ∗ ∗ ∗ R(fn) R(f )= R(fn) Rˆ(fn) + Rˆ(fn) Rˆ(f ) + Rˆ(f ) R(f ) − − − − ∗ ULLN for F ≤ 0 for ERM LLN for f |sup R{z(f) Rˆ}(f)| + O(1{z/√n).} | {z } ≤ f∈F − 2 Uniform laws and Rademacher complexity Definition: The Rademacher complexity of F is E Rn F , k k where the empirical process Rn is defined as n 1 R (f)= ǫ f(X ), n n i i i=1 X and the ǫ1,...,ǫn are Rademacher random variables: i.i.d. uni- form on 1 . {± } 3 Uniform laws and Rademacher complexity Theorem: For any F [0, 1]X , ⊂ 1 E Rn F O 1/n E P Pn F 2E Rn F , 2 k k − ≤ k − k ≤ k k p and, with probability at least 1 2exp( 2ǫ2n), − − E P Pn F ǫ P Pn F E P Pn F + ǫ. k − k − ≤ k − k ≤ k − k Thus, P Pn F E Rn F , and k − k ≈ k k R(fn) inf R(f)= O (E Rn F ) . − f∈F k k 4 Tools for controlling Rademacher complexity 1.
    [Show full text]
  • The Central Limit Theorem in Differential Privacy
    Privacy Loss Classes: The Central Limit Theorem in Differential Privacy David M. Sommer Sebastian Meiser Esfandiar Mohammadi ETH Zurich UCL ETH Zurich [email protected] [email protected] [email protected] August 12, 2020 Abstract Quantifying the privacy loss of a privacy-preserving mechanism on potentially sensitive data is a complex and well-researched topic; the de-facto standard for privacy measures are "-differential privacy (DP) and its versatile relaxation (, δ)-approximate differential privacy (ADP). Recently, novel variants of (A)DP focused on giving tighter privacy bounds under continual observation. In this paper we unify many previous works via the privacy loss distribution (PLD) of a mechanism. We show that for non-adaptive mechanisms, the privacy loss under sequential composition undergoes a convolution and will converge to a Gauss distribution (the central limit theorem for DP). We derive several relevant insights: we can now characterize mechanisms by their privacy loss class, i.e., by the Gauss distribution to which their PLD converges, which allows us to give novel ADP bounds for mechanisms based on their privacy loss class; we derive exact analytical guarantees for the approximate randomized response mechanism and an exact analytical and closed formula for the Gauss mechanism, that, given ", calculates δ, s.t., the mechanism is ("; δ)-ADP (not an over- approximating bound). 1 Contents 1 Introduction 4 1.1 Contribution . .4 2 Overview 6 2.1 Worst-case distributions . .6 2.2 The privacy loss distribution . .6 3 Related Work 7 4 Privacy Loss Space 7 4.1 Privacy Loss Variables / Distributions .
    [Show full text]
  • Loss Function Search for Face Recognition
    Loss Function Search for Face Recognition Xiaobo Wang * 1 Shuo Wang * 1 Cheng Chi 2 Shifeng Zhang 2 Tao Mei 1 Abstract Generally, the CNNs are equipped with classification loss In face recognition, designing margin-based (e.g., functions (Liu et al., 2017; Wang et al., 2018f;e; 2019a; Yao angular, additive, additive angular margins) soft- et al., 2018; 2017; Guo et al., 2020), metric learning loss max loss functions plays an important role in functions (Sun et al., 2014; Schroff et al., 2015) or both learning discriminative features. However, these (Sun et al., 2015; Wen et al., 2016; Zheng et al., 2018b). hand-crafted heuristic methods are sub-optimal Metric learning loss functions such as contrastive loss (Sun because they require much effort to explore the et al., 2014) or triplet loss (Schroff et al., 2015) usually large design space. Recently, an AutoML for loss suffer from high computational cost. To avoid this problem, function search method AM-LFS has been de- they require well-designed sample mining strategies. So rived, which leverages reinforcement learning to the performance is very sensitive to these strategies. In- search loss functions during the training process. creasingly more researchers shift their attention to construct But its search space is complex and unstable that deep face recognition models by re-designing the classical hindering its superiority. In this paper, we first an- classification loss functions. alyze that the key to enhance the feature discrim- Intuitively, face features are discriminative if their intra- ination is actually how to reduce the softmax class compactness and inter-class separability are well max- probability.
    [Show full text]
  • Bayesian Classifiers Under a Mixture Loss Function
    Hunting for Significance: Bayesian Classifiers under a Mixture Loss Function Igar Fuki, Lawrence Brown, Xu Han, Linda Zhao February 13, 2014 Abstract Detecting significance in a high-dimensional sparse data structure has received a large amount of attention in modern statistics. In the current paper, we introduce a compound decision rule to simultaneously classify signals from noise. This procedure is a Bayes rule subject to a mixture loss function. The loss function minimizes the number of false discoveries while controlling the false non discoveries by incorporating the signal strength information. Based on our criterion, strong signals will be penalized more heavily for non discovery than weak signals. In constructing this classification rule, we assume a mixture prior for the parameter which adapts to the unknown spar- sity. This Bayes rule can be viewed as thresholding the \local fdr" (Efron 2007) by adaptive thresholds. Both parametric and nonparametric methods will be discussed. The nonparametric procedure adapts to the unknown data structure well and out- performs the parametric one. Performance of the procedure is illustrated by various simulation studies and a real data application. Keywords: High dimensional sparse inference, Bayes classification rule, Nonparametric esti- mation, False discoveries, False nondiscoveries 1 2 1 Introduction Consider a normal mean model: Zi = βi + i; i = 1; ··· ; p (1) n T where fZigi=1 are independent random variables, the random errors (1; ··· ; p) follow 2 T a multivariate normal distribution Np(0; σ Ip), and β = (β1; ··· ; βp) is a p-dimensional unknown vector. For simplicity, in model (1), we assume σ2 is known. Without loss of generality, let σ2 = 1.
    [Show full text]
  • Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan
    Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan Mo Jinhua Zhao Massachusetts Institute of Technology Abstract Whereas deep neural network (DNN) is increasingly applied to choice analysis, it is challenging to reconcile domain-specific behavioral knowledge with generic-purpose DNN, to improve DNN's interpretability and predictive power, and to identify effective regularization methods for specific tasks. To address these challenges, this study demonstrates the use of behavioral knowledge for designing a particular DNN architecture with alternative-specific utility functions (ASU-DNN) and thereby improving both the predictive power and interpretability. Unlike a fully connected DNN (F-DNN), which computes the utility value of an alternative k by using the attributes of all the alternatives, ASU-DNN computes it by using only k's own attributes. Theoretically, ASU- DNN can substantially reduce the estimation error of F-DNN because of its lighter architecture and sparser connectivity, although the constraint of alternative-specific utility can cause ASU- DNN to exhibit a larger approximation error. Empirically, ASU-DNN has 2-3% higher prediction accuracy than F-DNN over the whole hyperparameter space in a private dataset collected in Singapore and a public dataset available in the R mlogit package. The alternative-specific connectivity is associated with the independence of irrelevant alternative (IIA) constraint, which as a domain-knowledge-based regularization method is more effective than the most popular generic-purpose explicit and implicit regularization methods and architectural hyperparameters. ASU-DNN provides a more regular substitution pattern of travel mode choices than F-DNN does, rendering ASU-DNN more interpretable.
    [Show full text]
  • Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of
    Pseudo-learning effects in reinforcement learning model-based analysis: A problem of misspecification of initial preference Kentaro Katahira1,2*, Yu Bai2,3, Takashi Nakao4 Institutional affiliation: 1 Department of Psychology, Graduate School of Informatics, Nagoya University, Nagoya, Aichi, Japan 2 Department of Psychology, Graduate School of Environment, Nagoya University 3 Faculty of literature and law, Communication University of China 4 Department of Psychology, Graduate School of Education, Hiroshima University 1 Abstract In this study, we investigate a methodological problem of reinforcement-learning (RL) model- based analysis of choice behavior. We show that misspecification of the initial preference of subjects can significantly affect the parameter estimates, model selection, and conclusions of an analysis. This problem can be considered to be an extension of the methodological flaw in the free-choice paradigm (FCP), which has been controversial in studies of decision making. To illustrate the problem, we conducted simulations of a hypothetical reward-based choice experiment. The simulation shows that the RL model-based analysis reports an apparent preference change if hypothetical subjects prefer one option from the beginning, even when they do not change their preferences (i.e., via learning). We discuss possible solutions for this problem. Keywords: reinforcement learning, model-based analysis, statistical artifact, decision-making, preference 2 Introduction Reinforcement-learning (RL) model-based trial-by-trial analysis is an important tool for analyzing data from decision-making experiments that involve learning (Corrado and Doya, 2007; Daw, 2011; O’Doherty, Hampton, and Kim, 2007). One purpose of this type of analysis is to estimate latent variables (e.g., action values and reward prediction error) that underlie computational processes.
    [Show full text]
  • Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019
    Lecture 18: Wrapping up classification Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by Julia Hockenmaier Today’s class • Perceptron: binary and multiclass case • Getting a distribution over class labels: one-hot output and softmax • Differentiable perceptrons: binary and multiclass case • Cross-entropy loss Recap: Classification, linear classifiers 3 Classification as a supervised learning task • Classification tasks: Label data points x ∈ X from an n-dimensional vector space with discrete categories (classes) y ∈Y Binary classification: Two possible labels Y = {0,1} or Y = {-1,+1} Multiclass classification: k possible labels Y = {1, 2, …, k} • Classifier: a function X →Y f(x) = y • Linear classifiers f(x) = sgn(wx) [for binary classification] are parametrized by (n+1)-dimensional weight vectors • Supervised learning: Learn the parameters of the classifier (e.g. w) from a labeled data set Dtrain = {(x1, y1),…,(xD, yD)} Batch versus online training Batch learning: The learner sees the complete training data, and only changes its hypothesis when it has seen the entire training data set. Online training: The learner sees the training data one example at a time, and can change its hypothesis with every new example Compromise: Minibatch learning (commonly used in practice) The learner sees small sets of training examples at a time, and changes its hypothesis with every such minibatch of examples For minibatch and online example: randomize the order of examples for
    [Show full text]