Categorical Data

Santiago Barreda LSA Summer Institute 2019 Normally-distributed data

• We have been modelling data with normally-distributed residuals.

• In other words: the data is normally- distributed around the mean predicted by the model. Predicting Position by Height

• We will invert the questions we’ve been considering.

• Can we predict position from player characteristics? Using an OLS Regression Using an OLS Regression Using an OLS Regression

• Not bad but we can do better. In particular:

• There are hard bounds on our outcome variable.

• The boundaries affect possible errors: i.e., the 1 position cannot be underestimated but the 2 can.

• There is nothing ‘between’ the outcomes.

• There are specialized models for data with these sorts of characteristics. The Generalized Linear Model

• We can break up a regression model into three components:

• The systematic component.

• The random component.

• The link function.

푦 = 푎 + 훽 ∗ 푥 + 푒 The Systematic Component

• This is our regression equation.

• It specifies a deterministic relationship between the predictors and the predicted value.

• In the absence of noise and with the correct model, we would expect perfect prediction.

μ = 푎 + 훽 ∗ 푥 Predicted value. Not the observation. The Random Component

• Unpredictable variation conditional on the fitted value.

• This specifies the nature of that variation.

• The fitted value is the mean parameter of a .

μ = 푎 + 훽 ∗ 푥

푦~푁표푟푚푎푙(휇, 휎2) 푦~퐵푒푟푛표푢푙푙𝑖(휇) Bernoulli Distribution

• Unlike the normal, this distribution generates only values of 1 and 0.

• It has only a single parameter, which must be between 0 and 1.

• The parameter is the probability of observing an outcome of 1 (or 1-P of 0).

푦~퐵푒푟푛표푢푙푙𝑖(휇) [1,0,1,0,0,0,1,0,1,1] The Link Function

• We are modeling lines. This means our predicted values:

• Range from positive to negative infinity.

• Feature a consistent change (slope) across the range.

• These characteristics are not compatible with the mean parameter of some distributions (e.g., Bernoulli). The Logit Link Function

• Maps the entire number line to a range between 0 and 1.

• The (logit) maps probabilities to logits, while its inverse (ilogit, logistic) maps logits to probabilities.

log(푝) exp(푧) 푧 = 푝 =

log(1 − 푝) 푧 1 + exp(푧) p Logit

logit(푝) Probability Probability ilogit(푧) Can we Distinguish SF and SG?

• We predict the logit of the probability that a player is a small forward as a function of their height.

• This is then turned into a probability and used in a Bernouilli distribution.

𝑖푙표푔𝑖푡(푎 + 퐻 ∗ ℎ푒𝑖푔ℎ푡)

푎 + 퐻 ∗ ℎ푒𝑖푔ℎ푡

(Fitted value) value) (Fitted

Fitted value value Fitted ilogit Can we Distinguish PF and SG?

SF SG

SF

SF

Probability Logits

o GLM estimates Bayesian

Vector containing only 1 and 0

Likelihood 푦~퐵푒푟푛표푢푙푙𝑖(휃) Systematic component 휃 = ilogit(휇) Link function 휇 = 푎0 + 퐻 ∗ ℎ푒𝑖푔ℎ푡

Random component Priors 2 푎0~푁(0, 1000 ) 퐻~푁(0, 10002) Bayesian Logistic Regression

푎0

퐻 Adding a Third Outcome

• Logistic regression compares an outcome (success) to some reference outcome (failure).

• Adding a third categorical outcome means we need SF multinomial logistic SG regression now PG (which I’ll call softmax regression like Kruschke). Softmax Regression

• In softmax regression, you have a different equation for each outcome category.

• The ‘weight’ for each outcome is then put 휇푗 = 푎0푗 + 퐻푗 ∗ ℎ푒𝑖푔ℎ푡 into the softmax function, yielding exp(휇푗) 푃푗 = 퐾 outcome probabilities σ푘=1 exp(휇푘) for each category. Softmax Regression

• For softmax regression, you set the weight of the reference category to zero. All other categories receive a predicted weight.

• The reference PG Reference category is arbitrary. SG SF • I like to pick one in the middle of predictor ranges because it is easier to interpret the models. Softmax Regression

• Logistic regression can be thought of as a special case of softmax regression.

• In logistic regression, you model a single 휇푗 = 푎0푗 + 퐻푗 ∗ ℎ푒𝑖푔ℎ푡 line for the outcome exp(휇푗) 푃푗 = 퐾 called ‘success’, and σ푘=1 exp(휇푘) failure is given a weight of 0. exp(휇푗) exp(휇푗) 푃푗 = = exp(0) + exp(휇푗) 1 + exp(휇푘)

Logistic link function Softmax Regression

퐾 exp(휇푗) 푃푗 = 퐾 ln 푃푗 = 휇푗 − ln(෍ exp(휇푘)) σ푘=1 exp(휇푘) 푘=1

SG

PG SF Ln(P(SG))

P (category response) P (category Ln(P (category response)) (category Ln(P Ln(P (category response)) (category Ln(P Bayesian Softmax Regression

Vector containing integers >0 and <(j+1) Random component

Number of Systematic categories (j). component

Link function For response categories 1…j

Likelihood Priors 푎 ~푁(0, 10002) 푦~푚푢푙푡𝑖푛표푚𝑖푎푙(휃푗) exp(휇푗) 0푗 휃푗 = 퐾 σ푘=1 exp(휇푘) 2 휇푗 = 푎0푗 + 퐻푗 ∗ ℎ푒𝑖푔ℎ푡 퐻푗~푁(0, 1000 ) Bayesian Softmax Regression

We have a new loop, a different 휇 for each category and trial.

We exponentiate 휇 but don’t normalize (dcat does it for us).

We set up priors for all of the j-1 coefficients. Remembering to set one group to zero! Softmax Regression: Results

SG SF SF SG

PG PG

Results are coefficients specifying a different line for each category. Softmax Regression: Results

SG SF

PG SG SF PG

PG SG

Logits Logits Logits

SF Bayesian Softmax Regression

• The values of the three lines are the fitted values for each of our categories, in log-odds.

• We can get the probability of observing each category for trial i by putting the fitted value for each category, for that trial, into the softmax function.

휇푖푗 = 푎0푖푗 + 퐻푖푗 ∗ ℎ푒𝑖푔ℎ푡푖

exp(휇푗) exp(푎0푖푗 + 퐻푖푗 ∗ ℎ푒𝑖푔ℎ푡푖) 푃푗 = 퐾 푃푗 = 퐾 σ푘=1 exp(휇푘) σ푘=1 exp(푎0푖푗 + 퐻푖푗 ∗ ℎ푒𝑖푔ℎ푡푖) Softmax Regression: Results

Data

PG SG

SF Logits P(category selection) P(category

Fitted Values

P(category selection) P(category selection) P(category Classification

• If you look up these and other topics:

• Luce choice model.

• Bayesian Decision Theory.

• Linear discriminant analysis.

• You will see the softmax function or something just like it. Classification

• This framework is fundamental in research on decision-making and classification.

• It can be used with a decision rule: select the category that maximizes P.

• When w is a set of posterior probabilities, we have a Bayesian classifier.

푃 푌 퐶푖 푃(퐶푖) 푃(퐶푖|푌) = 퐽 σ푗=1 푃 푌 퐶푗 푃(퐶푗) Ordinal Logistic Regression

• Basketball positions are somewhere between ordered and unordered.

• They are numbered 1-5, and do correspond to increasing size of the players.

• On the other hand they also seem at least somewhat arbitrarily ordered. Ordinal Logistic Regression

• We can use softmax regression whether our outcome categories are inherently ordered or not.

• But if our categories are inherently ordered, we may prefer ordinal logistic regression.

• This method allows us to predict categorical responses when the response categories have an inherent order. Likert Scales

• Likert scales involve collecting categorical ordered responses, usually a few integers or phrases connoting degree of opinion (DBDA2, pg. 681).

• Categorical responses are meant to reflect an underlying metric variable.

• Responses are meant to be averaged, yielding a metric variable that can be analyzed using normal- theory statistics. Likert Scales

• The individual responses can be directly modeled using ordinal logistic regression.

• Ordinal logistic analyses can provide more information than analyses involving averaging. Ordinal Logistic Regression

• Responses are assumed to depend on an underlying continuous value.

• Conceptually, a normal distribution is placed a given location along this continuous value

푥 Ordinal Logistic Regression

• We estimate the 휇 and 휎 of the distribution.

• We also estimate J-1 ‘thresholds’ for J response values.

Threshold 2

Threshold 1 Threshold 3

푥 Ordinal Logistic Regression

• The probability of observing a response of j is equal to the area under the curve between threshold j-1 and threshold j.

P(response==2) P(response==3)

P(response==1) P(response==4)

푥 Ordinal Logistic Regression

• We can make this into a regression problem by modeling the mean parameter as a function of relevant predictors. Setting up the Data

• The only unusual thing is that you have to set up a vector for the threshold parameters.

• We will set thresholds 1 and J-1 to 1.5 and J-0.5.

• The rest are set to NA so they are estimated by JAGS. Ordinal Logistic Model Likelihood For response possibilities 1…j 푦~푚푢푙푡𝑖푛표푚𝑖푎푙(휃푗) 휃푗 = Φ 푡ℎ푟푒푠ℎ푗, 휇, 휎 − Φ(푡ℎ푟푒푠ℎ푗−1, 휇, 휎) 휇 = 푎0 + 퐻 ∗ ℎ푒𝑖푔ℎ푡 Where 푡ℎ푟푒푠ℎ0 = -∞ and 푡ℎ푟푒푠ℎ퐽 = ∞

Priors 2 푎0~푁(0, 100 ) 퐻~푁(0, 1002) 휎~ℎ푎푙푓퐶푎푢푐ℎ푦(0,10)

푡ℎ푟푒푠ℎ푗~푁(0,4) Ordinal Logistic Model

Random component

Link function

Systematic component Results

푇ℎ푟푒푠ℎ표푙푑푠 푎0 퐻 Results

6’0” 6’4”

(note 휇 changes

휇 = 푎0 + 퐻 ∗ ℎ푒𝑖푔ℎ푡 but σ is fixed)

6’8” 7’0” Title

• text

Title

• text title

• 5 – regression and link functions • 6 – robustness, different shit you can do. Second levels predictors. heteroskedastic variances. • 7- interactions, relating lmer to formulas • 8-errors? Tips for building models. A two-sample t-test

• Univariate ‘random effects’ model, explained. • Adding more predictors and interactions for each predictor. (anova- style decomposition) • Decomposition explained (with logistic) • The same for each parameter and notation • Link function explained • NOT multivariate draws

• In general I will build up prediction of position from height. • First in bivariate logistic case • The multinimial for 5 classes. • Then ordinal. It also helps with logic of ordinal. Underlying latent variable that causes the ordinal response. • Compare to numbered prediction.