Categorical Data

Categorical Data Santiago Barreda LSA Summer Institute 2019 Normally-distributed data • We have been modelling data with normally-distributed residuals. • In other words: the data is normally- distributed around the mean predicted by the model. Predicting Position by Height • We will invert the questions we’ve been considering. • Can we predict position from player characteristics? Using an OLS Regression Using an OLS Regression Using an OLS Regression • Not bad but we can do better. In particular: • There are hard bounds on our outcome variable. • The boundaries affect possible errors: i.e., the 1 position cannot be underestimated but the 2 can. • There is nothing ‘between’ the outcomes. • There are specialized models for data with these sorts of characteristics. The Generalized Linear Model • We can break up a regression model into three components: • The systematic component. • The random component. • The link function. 푦 = 푎 + 훽 ∗ 푥 + 푒 The Systematic Component • This is our regression equation. • It specifies a deterministic relationship between the predictors and the predicted value. • In the absence of noise and with the correct model, we would expect perfect prediction. μ = 푎 + 훽 ∗ 푥 Predicted value. Not the observation. The Random Component • Unpredictable variation conditional on the fitted value. • This specifies the nature of that variation. • The fitted value is the mean parameter of a probability distribution. μ = 푎 + 훽 ∗ 푥 푦~푁표푟푚푎푙(휇, 휎2) 푦~퐵푒푟푛표푢푙푙(휇) Bernoulli Distribution • Unlike the normal, this distribution generates only values of 1 and 0. • It has only a single parameter, which must be between 0 and 1. • The parameter is the probability of observing an outcome of 1 (or 1-P of 0). 푦~퐵푒푟푛표푢푙푙(휇) [1,0,1,0,0,0,1,0,1,1] The Link Function • We are modeling lines. This means our predicted values: • Range from positive to negative infinity. • Feature a consistent change (slope) across the range. • These characteristics are not compatible with the mean parameter of some distributions (e.g., Bernoulli). The Logit Link Function • Maps the entire number line to a range between 0 and 1. • The logistic function (logit) maps probabilities to logits, while its inverse (ilogit, logistic) maps logits to probabilities. log(푝) exp(푧) 푧 = 푝 = log(1 − 푝) 푧 1 + exp(푧) p Logit logit(푝) Probability Probability ilogit(푧) Can we Distinguish SF and SG? • We predict the logit of the probability that a player is a small forward as a function of their height. • This is then turned into a probability and used in a Bernouilli distribution. 푙표푔푡(푎 + 퐻 ∗ ℎ푒푔ℎ푡) 푎 + 퐻 ∗ ℎ푒푔ℎ푡 (Fitted value) value) (Fitted Fitted value value Fitted ilogit Can we Distinguish PF and SG? SF SG SF SF Probability Logits o GLM estimates Bayesian Logistic Regression Vector containing only 1 and 0 Likelihood 푦~퐵푒푟푛표푢푙푙(휃) Systematic component 휃 = ilogit(휇) Link function 휇 = 푎0 + 퐻 ∗ ℎ푒푔ℎ푡 Random component Priors 2 푎0~푁(0, 1000 ) 퐻~푁(0, 10002) Bayesian Logistic Regression 푎0 퐻 Adding a Third Outcome • Logistic regression compares an outcome (success) to some reference outcome (failure). • Adding a third categorical outcome means we need SF multinomial logistic SG regression now PG (which I’ll call softmax regression like Kruschke). Softmax Regression • In softmax regression, you have a different equation for each outcome category. • The ‘weight’ for each outcome is then put 휇푗 = 푎0푗 + 퐻푗 ∗ ℎ푒푔ℎ푡 into the softmax function, yielding exp(휇푗) 푃푗 = 퐾 outcome probabilities σ푘=1 exp(휇푘) for each category. Softmax Regression • For softmax regression, you set the weight of the reference category to zero. All other categories receive a predicted weight. • The reference PG Reference category is arbitrary. SG SF • I like to pick one in the middle of predictor ranges because it is easier to interpret the models. Softmax Regression • Logistic regression can be thought of as a special case of softmax regression. • In logistic regression, you model a single 휇푗 = 푎0푗 + 퐻푗 ∗ ℎ푒푔ℎ푡 line for the outcome exp(휇푗) 푃푗 = 퐾 called ‘success’, and σ푘=1 exp(휇푘) failure is given a weight of 0. exp(휇푗) exp(휇푗) 푃푗 = = exp(0) + exp(휇푗) 1 + exp(휇푘) Logistic link function Softmax Regression 퐾 exp(휇푗) 푃푗 = 퐾 ln 푃푗 = 휇푗 − ln(෍ exp(휇푘)) σ푘=1 exp(휇푘) 푘=1 SG PG SF Ln(P(SG)) – P (category response) P (category Ln(P (category response)) (category Ln(P Ln(P (category response)) (category Ln(P Bayesian Softmax Regression Vector containing integers >0 and <(j+1) Random component Number of Systematic categories (j). component Link function For response categories 1…j Likelihood Priors 푎 ~푁(0, 10002) 푦~푚푢푙푡푛표푚푎푙(휃푗) exp(휇푗) 0푗 휃푗 = 퐾 σ푘=1 exp(휇푘) 2 휇푗 = 푎0푗 + 퐻푗 ∗ ℎ푒푔ℎ푡 퐻푗~푁(0, 1000 ) Bayesian Softmax Regression We have a new loop, a different 휇 for each category and trial. We exponentiate 휇 but don’t normalize (dcat does it for us). We set up priors for all of the j-1 coefficients. Remembering to set one group to zero! Softmax Regression: Results SG SF SF SG PG PG Results are coefficients specifying a different line for each category. Softmax Regression: Results SG SF PG SG SF PG PG SG Logits Logits Logits SF Bayesian Softmax Regression • The values of the three lines are the fitted values for each of our categories, in log-odds. • We can get the probability of observing each category for trial i by putting the fitted value for each category, for that trial, into the softmax function. 휇푖푗 = 푎0푖푗 + 퐻푖푗 ∗ ℎ푒푔ℎ푡푖 exp(휇푗) exp(푎0푖푗 + 퐻푖푗 ∗ ℎ푒푔ℎ푡푖) 푃푗 = 퐾 푃푗 = 퐾 σ푘=1 exp(휇푘) σ푘=1 exp(푎0푖푗 + 퐻푖푗 ∗ ℎ푒푔ℎ푡푖) Softmax Regression: Results Data PG SG SF Logits P(category selection) P(category Fitted Values P(category selection) P(category selection) P(category Classification • If you look up these and other topics: • Luce choice model. • Bayesian Decision Theory. • Linear discriminant analysis. • You will see the softmax function or something just like it. Classification • This framework is fundamental in research on decision-making and classification. • It can be used with a decision rule: select the category that maximizes P. • When w is a set of posterior probabilities, we have a Bayesian classifier. 푃 푌 퐶푖 푃(퐶푖) 푃(퐶푖|푌) = 퐽 σ푗=1 푃 푌 퐶푗 푃(퐶푗) Ordinal Logistic Regression • Basketball positions are somewhere between ordered and unordered. • They are numbered 1-5, and do correspond to increasing size of the players. • On the other hand they also seem at least somewhat arbitrarily ordered. Ordinal Logistic Regression • We can use softmax regression whether our outcome categories are inherently ordered or not. • But if our categories are inherently ordered, we may prefer ordinal logistic regression. • This method allows us to predict categorical responses when the response categories have an inherent order. Likert Scales • Likert scales involve collecting categorical ordered responses, usually a few integers or phrases connoting degree of opinion (DBDA2, pg. 681). • Categorical responses are meant to reflect an underlying metric variable. • Responses are meant to be averaged, yielding a metric variable that can be analyzed using normal- theory statistics. Likert Scales • The individual responses can be directly modeled using ordinal logistic regression. • Ordinal logistic analyses can provide more information than analyses involving averaging. Ordinal Logistic Regression • Responses are assumed to depend on an underlying continuous value. • Conceptually, a normal distribution is placed a given location along this continuous value 푥 Ordinal Logistic Regression • We estimate the 휇 and 휎 of the distribution. • We also estimate J-1 ‘thresholds’ for J response values. Threshold 2 Threshold 1 Threshold 3 푥 Ordinal Logistic Regression • The probability of observing a response of j is equal to the area under the curve between threshold j-1 and threshold j. P(response==2) P(response==3) P(response==1) P(response==4) 푥 Ordinal Logistic Regression • We can make this into a regression problem by modeling the mean parameter as a function of relevant predictors. Setting up the Data • The only unusual thing is that you have to set up a vector for the threshold parameters. • We will set thresholds 1 and J-1 to 1.5 and J-0.5. • The rest are set to NA so they are estimated by JAGS. Ordinal Logistic Model Likelihood For response possibilities 1…j 푦~푚푢푙푡푛표푚푎푙(휃푗) 휃푗 = Φ 푡ℎ푟푒푠ℎ푗, 휇, 휎 − Φ(푡ℎ푟푒푠ℎ푗−1, 휇, 휎) 휇 = 푎0 + 퐻 ∗ ℎ푒푔ℎ푡 Where 푡ℎ푟푒푠ℎ0 = -∞ and 푡ℎ푟푒푠ℎ퐽 = ∞ Priors 2 푎0~푁(0, 100 ) 퐻~푁(0, 1002) 휎~ℎ푎푙푓퐶푎푢푐ℎ푦(0,10) 푡ℎ푟푒푠ℎ푗~푁(0,4) Ordinal Logistic Model Random component Link function Systematic component Results 푇ℎ푟푒푠ℎ표푙푑푠 푎0 퐻 Results 6’0” 6’4” (note 휇 changes 휇 = 푎0 + 퐻 ∗ ℎ푒푔ℎ푡 but σ is fixed) 6’8” 7’0” Title • text Title • text title • 5 – regression and link functions • 6 – robustness, different shit you can do. Second levels predictors. heteroskedastic variances. • 7- interactions, relating lmer to formulas • 8-errors? Tips for building models. A two-sample t-test • Univariate ‘random effects’ model, explained. • Adding more predictors and interactions for each predictor. (anova- style decomposition) • Decomposition explained (with logistic) • The same for each parameter and notation • Link function explained • NOT multivariate draws • In general I will build up prediction of position from height. • First in bivariate logistic case • The multinimial for 5 classes. • Then ordinal. It also helps with logic of ordinal. Underlying latent variable that causes the ordinal response. • Compare to numbered prediction. .

Categorical Data

Lecture 4 Feedforward Neural Networks, Backpropagation

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Loss Function Search for Face Recognition

Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan

Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of

Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019

Mixed Pattern Recognition Methodology on Wafer Maps with Pre-Trained Convolutional Neural Networks

Arxiv:1910.04465V2 [Cs.CV] 16 Oct 2019

Reinforcement Learning with Dynamic Boltzmann Softmax Updates Arxiv

Rethinking Feature Distribution for Loss Functions in Image Classification