Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression Piyush Rai Topics in Probabilistic Modeling and Inference (CS698X) Jan 21, 2019 Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 1 Assuming hyperparameters as fixed, the posterior is Gaussian p(wjy; X; β; λ) = N (µN ; ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior's covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior's mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗jx ∗; X; y; β; λ) = p(y∗jw; x ∗; β)p(wjy; X; β; λ)dw = N (µN x ∗;β + x ∗ ΣN x ∗) Gives both predictive mean and predictive variance (imp: pred-var is different for each input) Recap: Bayesian Linear Regression QN > −1 −1 Assume Gaussian likelihood: p(yjX; w; β) = n=1 N (ynjw x n; β ) = N (yjXw; β IN ) QD −1 −1 Assume zero-mean spherical Gaussian prior: p(wjλ) = d=1 N (wd j0; λ ) = N (wj0; λ ID ) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior's covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior's mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗jx ∗; X; y; β; λ) = p(y∗jw; x ∗; β)p(wjy; X; β; λ)dw = N (µN x ∗;β + x ∗ ΣN x ∗) Gives both predictive mean and predictive variance (imp: pred-var is different for each input) Recap: Bayesian Linear Regression QN > −1 −1 Assume Gaussian likelihood: p(yjX; w; β) = n=1 N (ynjw x n; β ) = N (yjXw; β IN ) QD −1 −1 Assume zero-mean spherical Gaussian prior: p(wjλ) = d=1 N (wd j0; λ ) = N (wj0; λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian p(wjy; X; β; λ) = N (µN ; ΣN ) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior's mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗jx ∗; X; y; β; λ) = p(y∗jw; x ∗; β)p(wjy; X; β; λ)dw = N (µN x ∗;β + x ∗ ΣN x ∗) Gives both predictive mean and predictive variance (imp: pred-var is different for each input) Recap: Bayesian Linear Regression QN > −1 −1 Assume Gaussian likelihood: p(yjX; w; β) = n=1 N (ynjw x n; β ) = N (yjXw; β IN ) QD −1 −1 Assume zero-mean spherical Gaussian prior: p(wjλ) = d=1 N (wd j0; λ ) = N (wj0; λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian p(wjy; X; β; λ) = N (µN ; ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior's covariance matrix) n=1 Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗jx ∗; X; y; β; λ) = p(y∗jw; x ∗; β)p(wjy; X; β; λ)dw = N (µN x ∗;β + x ∗ ΣN x ∗) Gives both predictive mean and predictive variance (imp: pred-var is different for each input) Recap: Bayesian Linear Regression QN > −1 −1 Assume Gaussian likelihood: p(yjX; w; β) = n=1 N (ynjw x n; β ) = N (yjXw; β IN ) QD −1 −1 Assume zero-mean spherical Gaussian prior: p(wjλ) = d=1 N (wd j0; λ ) = N (wj0; λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian p(wjy; X; β; λ) = N (µN ; ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior's covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior's mean) N β n=1 Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 Gives both predictive mean and predictive variance (imp: pred-var is different for each input) Recap: Bayesian Linear Regression QN > −1 −1 Assume Gaussian likelihood: p(yjX; w; β) = n=1 N (ynjw x n; β ) = N (yjXw; β IN ) QD −1 −1 Assume zero-mean spherical Gaussian prior: p(wjλ) = d=1 N (wd j0; λ ) = N (wj0; λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian p(wjy; X; β; λ) = N (µN ; ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior's covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior's mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗jx ∗; X; y; β; λ) = p(y∗jw; x ∗; β)p(wjy; X; β; λ)dw = N (µN x ∗;β + x ∗ ΣN x ∗) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 Recap: Bayesian Linear Regression QN > −1 −1 Assume Gaussian likelihood: p(yjX; w; β) = n=1 N (ynjw x n; β ) = N (yjXw; β IN ) QD −1 −1 Assume zero-mean spherical Gaussian prior: p(wjλ) = d=1 N (wd j0; λ ) = N (wj0; λ ID ) Assuming hyperparameters as fixed, the posterior is Gaussian p(wjy; X; β; λ) = N (µN ; ΣN ) N X > −1 > −1 ΣN = (β x nx n + λID ) = (βX X + λID ) (posterior's covariance matrix) n=1 " N # X h > i > λ −1 > µ = ΣN β ynx n = ΣN βX y = (X X + ID ) X y (posterior's mean) N β n=1 The posterior predictive distribution is also Gaussian Z > −1 > p(y∗jx ∗; X; y; β; λ) = p(y∗jw; x ∗; β)p(wjy; X; β; λ)dw = N (µN x ∗;β + x ∗ ΣN x ∗) Gives both predictive mean and predictive variance (imp: pred-var is different for each input) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 2 A visualization of the posterior predictive of a Bayesian linear regression model Small predictive Line showing variance here predictive mean y Large predictive variance here x A Visualization of Uncertainty in Bayesian Linear Regression Posterior p(wjX; y) and lines (w0 intercept, w1 slope) corresponding to some random w's Prior (N=0) Posterior (N=1) Posterior (N=2) Posterior (N=20) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 3 Small predictive Line showing variance here predictive mean y Large predictive variance here x A Visualization of Uncertainty in Bayesian Linear Regression Posterior p(wjX; y) and lines (w0 intercept, w1 slope) corresponding to some random w's Prior (N=0) Posterior (N=1) Posterior (N=2) Posterior (N=20) A visualization of the posterior predictive of a Bayesian linear regression model Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 3 Posterior predictive: Red curve is predictive mean, shaded region denotes predictive uncertainty y y y y x x x x A Visualization of Uncertainty (Contd) We can similarly visualize a Bayesian nonlinear regression model Figures below: Green curve is the true function and blue circles are observations (xn; yn) Posterior of the nonlinear regression model: Some curves drawn from the posterior y y y y x x x x Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 4 y y y y x x x x A Visualization of Uncertainty (Contd) We can similarly visualize a Bayesian nonlinear regression model Figures below: Green curve is the true function and blue circles are observations (xn; yn) Posterior of the nonlinear regression model: Some curves drawn from the posterior y y y y x x x x Posterior predictive: Red curve is predictive mean, shaded region denotes predictive uncertainty Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 4 Estimating Hyperparameters for Bayesian Linear Regression Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 5 Example: For the linear regression model, the full set of parameters would be (w; λ, β) Can assume priors on all these parameters and infer their \joint" posterior distribution p(yjX; w; β; λ)p(w; λ, β) p(yjX; w; β; λ)p(wjλ)p(β)p(λ) p(w; β; λjX; y) = = p(yjX) R p(yjX; w; β)p(wjλ)p(β)p(λ) dw dλdβ Infering the above is usually intractable (rare to have conjugacy). Requires approximations. Also, What priors (or \hyperpriors") to choose for β and λ? What about the hyperparameters of those priors? Learning Hyperparameters in Probabilistic Models Can treat hyperparams as just a bunch of additional unknowns Can be learned using a suitable inference algorithm (point estimation or fully Bayesian) Prob. Mod. & Inference - CS698X (Piyush Rai, IITK) Bayesian Linear Regression (Hyperparameter Estimation, Sparse Priors), Bayesian Logistic Regression 6 Can assume priors on all these parameters and infer their \joint" posterior distribution p(yjX; w; β; λ)p(w; λ, β) p(yjX; w; β; λ)p(wjλ)p(β)p(λ) p(w; β; λjX; y) = = p(yjX) R p(yjX; w; β)p(wjλ)p(β)p(λ) dw dλdβ Infering the above is usually intractable (rare to have conjugacy).
Recommended publications
  • Logistic Regression, Dependencies, Non-Linear Data and Model Reduction
    COMP6237 – Logistic Regression, Dependencies, Non-linear Data and Model Reduction Markus Brede [email protected] Lecture slides available here: http://users.ecs.soton.ac.uk/mb8/stats/datamining.html (Thanks to Jason Noble and Cosma Shalizi whose lecture materials I used to prepare) COMP6237: Logistic Regression ● Outline: – Introduction – Basic ideas of logistic regression – Logistic regression using R – Some underlying maths and MLE – The multinomial case – How to deal with non-linear data ● Model reduction and AIC – How to deal with dependent data – Summary – Problems Introduction ● Previous lecture: Linear regression – tried to predict a continuous variable from variation in another continuous variable (E.g. basketball ability from height) ● Here: Logistic regression – Try to predict results of a binary (or categorical) outcome variable Y from a predictor variable X – This is a classification problem: classify X as belonging to one of two classes – Occurs quite often in science … e.g. medical trials (will a patient live or die dependent on medication?) Dependent variable Y Predictor Variables X The Oscars Example ● A fictional data set that looks at what it takes for a movie to win an Oscar ● Outcome variable: Oscar win, yes or no? ● Predictor variables: – Box office takings in millions of dollars – Budget in millions of dollars – Country of origin: US, UK, Europe, India, other – Critical reception (scores 0 … 100) – Length of film in minutes – This (fictitious) data set is available here: https://www.southampton.ac.uk/~mb1a10/stats/filmData.txt Predicting Oscar Success ● Let's start simple and look at only one of the predictor variables ● Do big box office takings make Oscar success more likely? ● Could use same techniques as below to look at budget size, film length, etc.
    [Show full text]
  • The Composite Marginal Likelihood (CML) Inference Approach with Applications to Discrete and Mixed Dependent Variable Models
    Technical Report 101 The Composite Marginal Likelihood (CML) Inference Approach with Applications to Discrete and Mixed Dependent Variable Models Chandra R. Bhat Center for Transportation Research September 2014 Data-Supported Transportation Operations & Planning Center (D-STOP) A Tier 1 USDOT University Transportation Center at The University of Texas at Austin D-STOP is a collaborative initiative by researchers at the Center for Transportation Research and the Wireless Networking and Communications Group at The University of Texas at Austin. DISCLAIMER The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated under the sponsorship of the U.S. Department of Transportation’s University Transportation Centers Program, in the interest of information exchange. The U.S. Government assumes no liability for the contents or use thereof. Technical Report Documentation Page 1. Report No. 2. Government Accession No. 3. Recipient's Catalog No. D-STOP/2016/101 4. Title and Subtitle 5. Report Date The Composite Marginal Likelihood (CML) Inference Approach September 2014 with Applications to Discrete and Mixed Dependent Variable 6. Performing Organization Code Models 7. Author(s) 8. Performing Organization Report No. Chandra R. Bhat Report 101 9. Performing Organization Name and Address 10. Work Unit No. (TRAIS) Data-Supported Transportation Operations & Planning Center (D- STOP) 11. Contract or Grant No. The University of Texas at Austin DTRT13-G-UTC58 1616 Guadalupe Street, Suite 4.202 Austin, Texas 78701 12. Sponsoring Agency Name and Address 13. Type of Report and Period Covered Data-Supported Transportation Operations & Planning Center (D- STOP) The University of Texas at Austin 14.
    [Show full text]
  • Paradoxes and Priors in Bayesian Regression
    Paradoxes and Priors in Bayesian Regression Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Agniva Som, B. Stat., M. Stat. Graduate Program in Statistics The Ohio State University 2014 Dissertation Committee: Dr. Christopher M. Hans, Advisor Dr. Steven N. MacEachern, Co-advisor Dr. Mario Peruggia c Copyright by Agniva Som 2014 Abstract The linear model has been by far the most popular and most attractive choice of a statistical model over the past century, ubiquitous in both frequentist and Bayesian literature. The basic model has been gradually improved over the years to deal with stronger features in the data like multicollinearity, non-linear or functional data pat- terns, violation of underlying model assumptions etc. One valuable direction pursued in the enrichment of the linear model is the use of Bayesian methods, which blend information from the data likelihood and suitable prior distributions placed on the unknown model parameters to carry out inference. This dissertation studies the modeling implications of many common prior distri- butions in linear regression, including the popular g prior and its recent ameliorations. Formalization of desirable characteristics for model comparison and parameter esti- mation has led to the growth of appropriate mixtures of g priors that conform to the seven standard model selection criteria laid out by Bayarri et al. (2012). The existence of some of these properties (or lack thereof) is demonstrated by examining the behavior of the prior under suitable limits on the likelihood or on the prior itself.
    [Show full text]
  • Logistic Regression Maths and Statistics Help Centre
    Logistic regression Maths and Statistics Help Centre Many statistical tests require the dependent (response) variable to be continuous so a different set of tests are needed when the dependent variable is categorical. One of the most commonly used tests for categorical variables is the Chi-squared test which looks at whether or not there is a relationship between two categorical variables but this doesn’t make an allowance for the potential influence of other explanatory variables on that relationship. For continuous outcome variables, Multiple regression can be used for a) controlling for other explanatory variables when assessing relationships between a dependent variable and several independent variables b) predicting outcomes of a dependent variable using a linear combination of explanatory (independent) variables The maths: For multiple regression a model of the following form can be used to predict the value of a response variable y using the values of a number of explanatory variables: y 0 1x1 2 x2 ..... q xq 0 Constant/ intercept , 1 q co efficients for q explanatory variables x1 xq The regression process finds the co-efficients which minimise the squared differences between the observed and expected values of y (the residuals). As the outcome of logistic regression is binary, y needs to be transformed so that the regression process can be used. The logit transformation gives the following: p ln 0 1x1 2 x2 ..... q xq 1 p p p probabilty of event occuring e.g. person dies following heart attack, odds ratio 1- p If probabilities of the event of interest happening for individuals are needed, the logistic regression equation exp x x ....
    [Show full text]
  • Generalized Linear Models
    CHAPTER 6 Generalized linear models 6.1 Introduction Generalized linear modeling is a framework for statistical analysis that includes linear and logistic regression as special cases. Linear regression directly predicts continuous data y from a linear predictor Xβ = β0 + X1β1 + + Xkβk.Logistic regression predicts Pr(y =1)forbinarydatafromalinearpredictorwithaninverse-··· logit transformation. A generalized linear model involves: 1. A data vector y =(y1,...,yn) 2. Predictors X and coefficients β,formingalinearpredictorXβ 1 3. A link function g,yieldingavectoroftransformeddataˆy = g− (Xβ)thatare used to model the data 4. A data distribution, p(y yˆ) | 5. Possibly other parameters, such as variances, overdispersions, and cutpoints, involved in the predictors, link function, and data distribution. The options in a generalized linear model are the transformation g and the data distribution p. In linear regression,thetransformationistheidentity(thatis,g(u) u)and • the data distribution is normal, with standard deviation σ estimated from≡ data. 1 1 In logistic regression,thetransformationistheinverse-logit,g− (u)=logit− (u) • (see Figure 5.2a on page 80) and the data distribution is defined by the proba- bility for binary data: Pr(y =1)=y ˆ. This chapter discusses several other classes of generalized linear model, which we list here for convenience: The Poisson model (Section 6.2) is used for count data; that is, where each • data point yi can equal 0, 1, 2, ....Theusualtransformationg used here is the logarithmic, so that g(u)=exp(u)transformsacontinuouslinearpredictorXiβ to a positivey ˆi.ThedatadistributionisPoisson. It is usually a good idea to add a parameter to this model to capture overdis- persion,thatis,variationinthedatabeyondwhatwouldbepredictedfromthe Poisson distribution alone.
    [Show full text]
  • A Compendium of Conjugate Priors
    A Compendium of Conjugate Priors Daniel Fink Environmental Statistics Group Department of Biology Montana State Univeristy Bozeman, MT 59717 May 1997 Abstract This report reviews conjugate priors and priors closed under sampling for a variety of data generating processes where the prior distributions are univariate, bivariate, and multivariate. The effects of transformations on conjugate prior relationships are considered and cases where conjugate prior relationships can be applied under transformations are identified. Univariate and bivariate prior relationships are verified using Monte Carlo methods. Contents 1 Introduction Experimenters are often in the position of having had collected some data from which they desire to make inferences about the process that produced that data. Bayes' theorem provides an appealing approach to solving such inference problems. Bayes theorem, π(θ) L(θ x ; : : : ; x ) g(θ x ; : : : ; x ) = j 1 n (1) j 1 n π(θ) L(θ x ; : : : ; x )dθ j 1 n is commonly interpreted in the following wayR. We want to make some sort of inference on the unknown parameter(s), θ, based on our prior knowledge of θ and the data collected, x1; : : : ; xn . Our prior knowledge is encapsulated by the probability distribution on θ; π(θ). The data that has been collected is combined with our prior through the likelihood function, L(θ x ; : : : ; x ) . The j 1 n normalized product of these two components yields a probability distribution of θ conditional on the data. This distribution, g(θ x ; : : : ; x ) , is known as the posterior distribution of θ. Bayes' j 1 n theorem is easily extended to cases where is θ multivariate, a vector of parameters.
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]
  • An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain
    J Korean Acad Nurs Vol.43 No.2, 154 -164 J Korean Acad Nurs Vol.43 No.2 April 2013 http://dx.doi.org/10.4040/jkan.2013.43.2.154 An Introduction to Logistic Regression: From Basic Concepts to Interpretation with Particular Attention to Nursing Domain Park, Hyeoun-Ae College of Nursing and System Biomedical Informatics National Core Research Center, Seoul National University, Seoul, Korea Purpose: The purpose of this article is twofold: 1) introducing logistic regression (LR), a multivariable method for modeling the relationship between multiple independent variables and a categorical dependent variable, and 2) examining use and reporting of LR in the nursing literature. Methods: Text books on LR and research articles employing LR as main statistical analysis were reviewed. Twenty-three articles published between 2010 and 2011 in the Journal of Korean Academy of Nursing were analyzed for proper use and reporting of LR models. Results: Logistic regression from basic concepts such as odds, odds ratio, logit transformation and logistic curve, assumption, fitting, reporting and interpreting to cautions were presented. Substantial short- comings were found in both use of LR and reporting of results. For many studies, sample size was not sufficiently large to call into question the accuracy of the regression model. Additionally, only one study reported validation analysis. Conclusion: Nurs- ing researchers need to pay greater attention to guidelines concerning the use and reporting of LR models. Key words: Logit function, Maximum likelihood estimation, Odds, Odds ratio, Wald test INTRODUCTION The model serves two purposes: (1) it can predict the value of the depen- dent variable for new values of the independent variables, and (2) it can Multivariable methods of statistical analysis commonly appear in help describe the relative contribution of each independent variable to general health science literature (Bagley, White, & Golomb, 2001).
    [Show full text]
  • Categorical Distributions in Natural Language Processing Version 0.1
    Categorical Distributions in Natural Language Processing Version 0.1 MURAWAKI Yugo 12 May 2016 MURAWAKI Yugo Categorical Distributions in NLP 1 / 34 Categorical distribution Suppose random variable x takes one of K values. x is generated according to categorical distribution Cat(θ), where θ = (0:1; 0:6; 0:3): RYG In many task settings, we do not know the true θ and need to infer it from observed data x = (x1; ··· ; xN). Once we infer θ, we often want to predict new variable x0. NOTE: In Bayesian settings, θ is usually integrated out and x0 is predicted directly from x. MURAWAKI Yugo Categorical Distributions in NLP 2 / 34 Categorical distributions are a building block of natural language models N-gram language model (predicting the next word) POS tagging based on a Hidden Markov Model (HMM) Probabilistic context-free grammar (PCFG) Topic model (Latent Dirichlet Allocation (LDA)) MURAWAKI Yugo Categorical Distributions in NLP 3 / 34 Example: HMM-based POS tagging BOS DT NN VBZ VBN EOS the sun has risen Let K be the number of POS tags and V be the vocabulary size (ignore BOS and EOS for simplicity). The transition probabilities can be computed using K categorical θTRANS; θTRANS; ··· distributions ( DT NN ), with the dimension K. θTRANS = : ; : ; : ; ··· DT (0 21 0 27 0 09 ) NN NNS ADJ Similarly, the emission probabilities can be computed using K θEMIT; θEMIT; ··· categorical distributions ( DT NN ), with the dimension V. θEMIT = : ; : ; : ; ··· NN (0 012 0 002 0 005 ) sun rose cat MURAWAKI Yugo Categorical Distributions in NLP 4 / 34 Outline Categorical and multinomial distributions Conjugacy and posterior predictive distribution LDA (Latent Dirichlet Applocation) as an application Gibbs sampling for inference MURAWAKI Yugo Categorical Distributions in NLP 5 / 34 Categorical distribution: 1 observation Suppose θ is known.
    [Show full text]
  • Binomial and Multinomial Distributions
    Binomial and multinomial distributions Kevin P. Murphy Last updated October 24, 2006 * Denotes more advanced sections 1 Introduction In this chapter, we study probability distributions that are suitable for modelling discrete data, like letters and words. This will be useful later when we consider such tasks as classifying and clustering documents, recognizing and segmenting languages and DNA sequences, data compression, etc. Formally, we will be concerned with density models for X ∈{1,...,K}, where K is the number of possible values for X; we can think of this as the number of symbols/ letters in our alphabet/ language. We assume that K is known, and that the values of X are unordered: this is called categorical data, as opposed to ordinal data, in which the discrete states can be ranked (e.g., low, medium and high). If K = 2 (e.g., X represents heads or tails), we will use a binomial distribution. If K > 2, we will use a multinomial distribution. 2 Bernoullis and Binomials Let X ∈{0, 1} be a binary random variable (e.g., a coin toss). Suppose p(X =1)= θ. Then − p(X|θ) = Be(X|θ)= θX (1 − θ)1 X (1) is called a Bernoulli distribution. It is easy to show that E[X] = p(X =1)= θ (2) Var [X] = θ(1 − θ) (3) The likelihood for a sequence D = (x1,...,xN ) of coin tosses is N − p(D|θ)= θxn (1 − θ)1 xn = θN1 (1 − θ)N0 (4) n=1 Y N N where N1 = n=1 xn is the number of heads (X = 1) and N0 = n=1(1 − xn)= N − N1 is the number of tails (X = 0).
    [Show full text]
  • Variance Partitioning in Multilevel Logistic Models That Exhibit Overdispersion
    J. R. Statist. Soc. A (2005) 168, Part 3, pp. 599–613 Variance partitioning in multilevel logistic models that exhibit overdispersion W. J. Browne, University of Nottingham, UK S. V. Subramanian, Harvard School of Public Health, Boston, USA K. Jones University of Bristol, UK and H. Goldstein Institute of Education, London, UK [Received July 2002. Final revision September 2004] Summary. A common application of multilevel models is to apportion the variance in the response according to the different levels of the data.Whereas partitioning variances is straight- forward in models with a continuous response variable with a normal error distribution at each level, the extension of this partitioning to models with binary responses or to proportions or counts is less obvious. We describe methodology due to Goldstein and co-workers for appor- tioning variance that is attributable to higher levels in multilevel binomial logistic models. This partitioning they referred to as the variance partition coefficient. We consider extending the vari- ance partition coefficient concept to data sets when the response is a proportion and where the binomial assumption may not be appropriate owing to overdispersion in the response variable. Using the literacy data from the 1991 Indian census we estimate simple and complex variance partition coefficients at multiple levels of geography in models with significant overdispersion and thereby establish the relative importance of different geographic levels that influence edu- cational disparities in India. Keywords: Contextual variation; Illiteracy; India; Multilevel modelling; Multiple spatial levels; Overdispersion; Variance partition coefficient 1. Introduction Multilevel regression models (Goldstein, 2003; Bryk and Raudenbush, 1992) are increasingly being applied in many areas of quantitative research.
    [Show full text]
  • Randomization Does Not Justify Logistic Regression
    Statistical Science 2008, Vol. 23, No. 2, 237–249 DOI: 10.1214/08-STS262 c Institute of Mathematical Statistics, 2008 Randomization Does Not Justify Logistic Regression David A. Freedman Abstract. The logit model is often used to analyze experimental data. However, randomization does not justify the model, so the usual esti- mators can be inconsistent. A consistent estimator is proposed. Ney- man’s non-parametric setup is used as a benchmark. In this setup, each subject has two potential responses, one if treated and the other if untreated; only one of the two responses can be observed. Beside the mathematics, there are simulation results, a brief review of the literature, and some recommendations for practice. Key words and phrases: Models, randomization, logistic regression, logit, average predicted probability. 1. INTRODUCTION nπT subjects at random and assign them to the treatment condition. The remaining nπ subjects The logit model is often fitted to experimental C are assigned to a control condition, where π = 1 data. As explained below, randomization does not C π . According to Neyman (1923), each subject has− justify the assumptions behind the model. Thus, the T two responses: Y T if assigned to treatment, and Y C conventional estimator of log odds is difficult to in- i i if assigned to control. The responses are 1 or 0, terpret; an alternative will be suggested. Neyman’s where 1 is “success” and 0 is “failure.” Responses setup is used to define parameters and prove re- are fixed, that is, not random. sults. (Grammatical niceties apart, the terms “logit If i is assigned to treatment (T ), then Y T is ob- model” and “logistic regression” are used interchange- i served.
    [Show full text]