A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model

Total Page:16

File Type:pdf, Size:1020Kb

A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model Mark Andrews April 11, 2019 HE hierarchical Dirichlet process mixture model (HDPMM), when its mix- ture components are categorical distributions, is a probabilistic model T of multinomial data. It was first described as part of a more general description of Hierarchical Dirichlet Process (HDP) models by Teh, Jordan, Beal, and Blei (2004, 2006), and is the Bayesian nonparametric generalization of the Latent Dirichlet Allocation (LDA) model of Blei, Ng, and Jordan (2003). The aim of this note is to describe in detail a Gibbs sampler for the HDPMM when used with multinomial data. This Gibbs sampler is based on of what was described in Teh et al. (2004, 2006) for general HDP models. However, as they did not deal in detail with HDP mixture models for multinomial data, important details of the sampler required for this particular case were not described. On the other hand, Newman, Asuncion, Smyth, and Welling (2009) do deal explicitly with the case of the HDPMM for multinomial data, and the Gibbs sampler that we describe here is almost identical to theirs. However, for some hyper-parameters, Newman et al. (2009) either make simplifying assumptions or assume that their values are known, which are assumptions that we do not make here. As such, the sampler we describe here is an extension, though minor, of that described in Newman et al. (2009). The probabilistic model One of the most straightforward applications of the multinomial data HDPMM is as a bag-of-words probabilistic language model, and in what follows we’ll describe it with this application in mind. However, modulo some possible changes in notation, this will in fact constitute a general description of the HDPMM for multinomial data. In general, according to a bag-of-words language model, a corpus of natural language is a set of J documents or texts w1, w2 ... wj ... wJ, where text j, i.e., wj, is a set of nj words from a finite vocabulary of V word types. For simplicity, this vocabulary can be represented as the V integers f1, 2 ... Vg. From this, we have each wj defined as wj = wj1, wj2 ... wji ... wjnj , with each wji 2 f1 ... Vg. The bag-of-words assumption is that, for each text, wj1, wj2 ... wji ... wjnj are exchangeable random variables, i.e. their joint proba- bility distribution is invariant to any permutation of the indices. By this assump- tion therefore, as the name implies, each text is modelled as an unordered set, or bag, of words. As a generative model of this language corpus, the HDPMM treats each ob- served word wji as a sample from one of an underlying set of text or discourse topics: φ = φ1, φ2 ... φk ..., 1 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 2 where each φk is a probability distribution over f1 ... Vg. The identity of the particular topic distribution from which wji is drawn is determined by the value of a discrete latent variable xji 2 f1, 2 ... k ...g that corresponds to wji. As such, for each wji, we model it as wjijxji, φ ∼ dcat(φxji ). To be clear, the HDPMM assumes that there are an unlimited number of topic distributions from which the observed data are drawn, and so each xji can take on infinitely many discrete values. The probability distribution over the infinitely possible values of each xji is given by an infinite length categorical distribution πj, i.e., πj = πj1, πj2 ... πjk ..., where 0 6 πjk 6 1 and k=1 πjk = 1, that is specific to text j. In other words, P1 xjijπj ∼ dcat(πj). Each πj is assumed to be drawn from a Dirichlet process prior whose base distribution, m, is a categorical distribution over the positive integers and whose scalar concentration parameter is a: πjja, m ∼ ddp(a, m). The m base distribution is assumed to be drawn from a stick breaking distribu- tion with a parameter γ: mjγ ∼ dstick(γ). The prior distributions of the Dirichlet process concentration parameter a and the stick-breaking parameter γ are Gamma distributions, both with shape and scale parameters equal to 1. For the topic distributions, φ1, φ2 ... φk ..., we can assume they are independently and identically drawn from a Dirichlet distribution with a length V location parameter and concentration parameter b. In turn, is drawn from a symmetric Dirichlet distribution with concentration parameter c. Finally, both b and c, like a and γ, can be given Gamma priors, again with shape and scale distributions equal to 1. Sampling each latent variable xji The posterior probability that xji takes the value k, for any k 2 1, 2 . ., is P(xji = kjwji = v, x:ji, w:ji, b, , a, m), / P(wji = vjxji = k, w:ji, x:ji, b, )P(xji = kjx:ji, a, m), where x:ji denotes all latent variables excluding xji, with an analogous meaning for w:ji. Here, the likelihood term is P(wji = vjxji = k, x:ji, w:ji, b, ) = P(wji = vjφk)P(φkjx:ji, w:ji, b, )dφk. Z This is the expected value of φkv according the Dirichlet posterior :ji V Γ(S + b) S:ji+b - (φ jx w b ) = k· φ kv v 1 P k :ji, :ji, , V :ji kv , v= Γ(S + b v) v= 1 kv Y1 :ji Q :ji V :ji where Skv , j0i06=ji I(wj0i0 = v, xj0i0 = k) and Sk· = v=1 Skv . As such, P :jiP Skv + b v P(wji = vjxji = k, x:ji, w:jli, b, ) = :ji . Sk· + b A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 3 The prior term, on the other hand, is P(xji = kjx:ji, a, m) = P(xji = kjπj)P(πjjx:ji, a, m)dπj, Z and this the expected value of πjk according to P(πjjx:ji, a, m) / P(x:jijπj)P(πjja, m). As P(πjja, m) is a Dirichlet process, by the definition of a Dirichlet process, we have P(πjkja, m) = Beta(amk, a fk06=kg mk0), and therefore, P :ji :ji P(πjkjx:ji, a, m) = Beta(Rjk + amk, a fk06=kg Rjk0 + mk0) :ji nj P where Rjk , i06=i I(xji0 = k). As such, the expected value of πjk is P :ji Rjk + amk P(xji = kjx:ji, a, m) = :ji , Rj· + a :ji :ji where Rj· = k=1 Rjk . Given these likelihood and prior terms, the posterior is simply P1 :ji R:ji + am Skv + b v jk k P(xji = kjwji = v, x:ji, w:ji, b, , a, m) / :ji :ji , Sk· + b Rj· + a :ji Skv + b v :ji / :ji × Rjk + amk . Sk· + b Note that from this, we also have :ji Skv + b v :ji P(xji > Kjx:ji, w, b, , a, m) / × R + amk . S:ji + b jk fk>KX g k· Given that for all k > K, where K is the maximum value of the set fxji : j 2 :ji :ji 1 . J, i 2 1 . njg, we have Rjk = 0 and Skv = 0, then b v P(x > Kjx , w, b, , a, m) / × am , ji :ji b k fk>KX g = v × amu where mu = fk>Kg mjk. As a practical matter of sampling, for each latent variable x , we calculate P ji :ji Skv + b v :ji fjik =/ :ji × Rjk + amk , Sk· + b for k 2 1, 2 . K, and then fjiu = v × amu where K, mu are defined as above and v = wji. Now, K k=1 fjik P(xji 6 Kjx:ji, w, b, , a, m) = K kP=1 fjik + fjiu P A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 4 and fjiu P(xji > Kjx:ji, w, b, , a, m) = K , k=1 fjik + fjiu and so a single random sample will be sufficientP to decide if xji 6 K or xji > K. If xji 6 K, then fjik P(xji = kjx:ji, w, b, , a, m, xji 6 K) = K . k=1 fjik new new On the other hand, if xji > K, the probability that xji =Pk for k > K is new v × amknew mknew P(xji = k jx:ji, w, b, , a, m, xji > K) = = . fjiu mu Sampling m and a The posterior distribution over the infinite length array m is P(mjx1:J, a, γ) / P(x1:Jja, m)P(mjγ), while the posterior over the scalar parameter a is P(ajx1:J) / P(x1:Jja, m)P(a), with the priors as stated above. The likelihood term in both cases is J nj P(x1:Jja, m) = P(xjijπj)P(πjja, m)dπj, j= i= Y1 Z Y1 nj K Rjk nj where i=1 P(xjijπj) = k=1 πjk , with Rjk = i=1 I(xji = k) and K is, as stated above, the maximum value attained by any latent variable. The prior Q Q P P(πjja, m) is a Dirichlet process prior and so, by definition of the Dirichlet process, K Γ(a) amu-1 amk-1 P(πj1, πj2 ... πjK, πjuja, m) = K πju πjk , Γ(amu) k=1 Γ(amk) k= Y1 where mu = fk>Kg mjk, as stated above,Q and πu = fk>Kg πjk. Therefore, PJ nj P P(x1:Jja, m) = P(xjijπj)P(πjja, m)dπj, j= i= Y1 Z Y1 J K Γ(a) amu-1 Rjk+mk-1 = K πju πjk dπj, j= Γ(amu) k=1 Γ(amk) k= Y1 Z Y1 J Q K Γ(a) Γ(amu) k=1 Γ(Rjk + amk) = K , Γ(am ) Γ(am ) Γ(Rj· + a) j=1 u k=1 k Q Y J K Γ(a) Q Γ(Rjk + amk) = , Γ(R + a) Γ(am ) j= j· k= k Y1 Y1 and this can be re-written as R J 1 K jk 1 σr = (τr)a-1(1 - τr)Rj·-1dτr (R , σr )(am ) jk , Γ(R ) j j j S jk jk k j= j· 0 k= σr = Y1 Z Y1 Xjk 0 A Gibbs Sampler for a Hierarchical Dirichlet Process Mixture Model 5 given that 1 r a- r r .
Recommended publications
  • Assessing Fairness with Unlabeled Data and Bayesian Inference
    Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference Disi Ji1 Padhraic Smyth1 Mark Steyvers2 1Department of Computer Science 2Department of Cognitive Sciences University of California, Irvine [email protected] [email protected] [email protected] Abstract We investigate the problem of reliably assessing group fairness when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores for unlabeled examples in each group using a hierarchical latent variable model conditioned on labeled examples. This in turn allows for inference of posterior distributions with associated notions of uncertainty for a variety of group fairness metrics. We demonstrate that our approach leads to significant and consistent reductions in estimation error across multiple well-known fairness datasets, sensitive attributes, and predictive models. The results show the benefits of using both unlabeled data and Bayesian inference in terms of assessing whether a prediction model is fair or not. 1 Introduction Machine learning models are increasingly used to make important decisions about individuals. At the same time it has become apparent that these models are susceptible to producing systematically biased decisions with respect to sensitive attributes such as gender, ethnicity, and age [Angwin et al., 2017, Berk et al., 2018, Corbett-Davies and Goel, 2018, Chen et al., 2019, Beutel et al., 2019]. This has led to a significant amount of recent work in machine learning addressing these issues, including research on both (i) definitions of fairness in a machine learning context (e.g., Dwork et al.
    [Show full text]
  • Bayesian Inference Chapter 4: Regression and Hierarchical Models
    Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective AFM Smith Dennis Lindley We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of time series. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents 1 Normal linear models 1.1. ANOVA model 1.2. Simple linear regression model 2 Generalized linear models 3 Hierarchical models 4 Dynamic models Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models A normal linear model is of the following form: y = Xθ + ; 0 where y = (y1;:::; yn) is the observed data, X is a known n × k matrix, called 0 the design matrix, θ = (θ1; : : : ; θk ) is the parameter set and follows a multivariate normal distribution. Usually, it is assumed that: 1 ∼ N 0 ; I : k φ k A simple example of normal linear model is the simple linear regression model T 1 1 ::: 1 where X = and θ = (α; β)T . x1 x2 ::: xn Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models Consider a normal linear model, y = Xθ + . A conjugate prior distribution is a normal-gamma distribution:
    [Show full text]
  • Hyperprior on Symmetric Dirichlet Distribution
    Hyperprior on symmetric Dirichlet distribution Jun Lu Computer Science, EPFL, Lausanne [email protected] November 5, 2018 Abstract In this article we introduce how to put vague hyperprior on Dirichlet distribution, and we update the parameter of it by adaptive rejection sampling (ARS). Finally we analyze this hyperprior in an over-fitted mixture model by some synthetic experiments. 1 Introduction It has become popular to use over-fitted mixture models in which number of cluster K is chosen as a conservative upper bound on the number of components under the expectation that only relatively few of the components K0 will be occupied by data points in the samples X . This kind of over-fitted mixture models has been successfully due to the ease in computation. Previously Rousseau & Mengersen(2011) proved that quite generally, the posterior behaviour of overfitted mixtures depends on the chosen prior on the weights, and on the number of free parameters in the emission distributions (here D, i.e. the dimension of data). Specifically, they have proved that (a) If α=min(αk; k ≤ K)>D=2 and if the number of components is larger than it should be, asymptotically two or more components in an overfitted mixture model will tend to merge with non-negligible weights. (b) −1=2 In contrast, if α=max(αk; k 6 K) < D=2, the extra components are emptied at a rate of N . Hence, if none of the components are small, it implies that K is probably not larger than K0. In the intermediate case, if min(αk; k ≤ K) ≤ D=2 ≤ max(αk; k 6 K), then the situation varies depending on the αk’s and on the difference between K and K0.
    [Show full text]
  • The Bayesian Lasso
    The Bayesian Lasso Trevor Park and George Casella y University of Florida, Gainesville, Florida, USA Summary. The Lasso estimate for linear regression parameters can be interpreted as a Bayesian posterior mode estimate when the priors on the regression parameters are indepen- dent double-exponential (Laplace) distributions. This posterior can also be accessed through a Gibbs sampler using conjugate normal priors for the regression parameters, with indepen- dent exponential hyperpriors on their variances. This leads to tractable full conditional distri- butions through a connection with the inverse Gaussian distribution. Although the Bayesian Lasso does not automatically perform variable selection, it does provide standard errors and Bayesian credible intervals that can guide variable selection. Moreover, the structure of the hierarchical model provides both Bayesian and likelihood methods for selecting the Lasso pa- rameter. The methods described here can also be extended to other Lasso-related estimation methods like bridge regression and robust variants. Keywords: Gibbs sampler, inverse Gaussian, linear regression, empirical Bayes, penalised regression, hierarchical models, scale mixture of normals 1. Introduction The Lasso of Tibshirani (1996) is a method for simultaneous shrinkage and model selection in regression problems. It is most commonly applied to the linear regression model y = µ1n + Xβ + ; where y is the n 1 vector of responses, µ is the overall mean, X is the n p matrix × T × of standardised regressors, β = (β1; : : : ; βp) is the vector of regression coefficients to be estimated, and is the n 1 vector of independent and identically distributed normal errors with mean 0 and unknown× variance σ2. The estimate of µ is taken as the average y¯ of the responses, and the Lasso estimate β minimises the sum of the squared residuals, subject to a given bound t on its L1 norm.
    [Show full text]
  • Posterior Propriety and Admissibility of Hyperpriors in Normal
    The Annals of Statistics 2005, Vol. 33, No. 2, 606–646 DOI: 10.1214/009053605000000075 c Institute of Mathematical Statistics, 2005 POSTERIOR PROPRIETY AND ADMISSIBILITY OF HYPERPRIORS IN NORMAL HIERARCHICAL MODELS1 By James O. Berger, William Strawderman and Dejun Tang Duke University and SAMSI, Rutgers University and Novartis Pharmaceuticals Hierarchical modeling is wonderful and here to stay, but hyper- parameter priors are often chosen in a casual fashion. Unfortunately, as the number of hyperparameters grows, the effects of casual choices can multiply, leading to considerably inferior performance. As an ex- treme, but not uncommon, example use of the wrong hyperparameter priors can even lead to impropriety of the posterior. For exchangeable hierarchical multivariate normal models, we first determine when a standard class of hierarchical priors results in proper or improper posteriors. We next determine which elements of this class lead to admissible estimators of the mean under quadratic loss; such considerations provide one useful guideline for choice among hierarchical priors. Finally, computational issues with the resulting posterior distributions are addressed. 1. Introduction. 1.1. The model and the problems. Consider the block multivariate nor- mal situation (sometimes called the “matrix of means problem”) specified by the following hierarchical Bayesian model: (1) X ∼ Np(θ, I), θ ∼ Np(B, Σπ), where X1 θ1 X2 θ2 Xp×1 = . , θp×1 = . , arXiv:math/0505605v1 [math.ST] 27 May 2005 . X θ m m Received February 2004; revised July 2004. 1Supported by NSF Grants DMS-98-02261 and DMS-01-03265. AMS 2000 subject classifications. Primary 62C15; secondary 62F15. Key words and phrases. Covariance matrix, quadratic loss, frequentist risk, posterior impropriety, objective priors, Markov chain Monte Carlo.
    [Show full text]
  • Package 'Distributional'
    Package ‘distributional’ February 2, 2021 Title Vectorised Probability Distributions Version 0.2.2 Description Vectorised distribution objects with tools for manipulating, visualising, and using probability distributions. Designed to allow model prediction outputs to return distributions rather than their parameters, allowing users to directly interact with predictive distributions in a data-oriented workflow. In addition to providing generic replacements for p/d/q/r functions, other useful statistics can be computed including means, variances, intervals, and highest density regions. License GPL-3 Imports vctrs (>= 0.3.0), rlang (>= 0.4.5), generics, ellipsis, stats, numDeriv, ggplot2, scales, farver, digest, utils, lifecycle Suggests testthat (>= 2.1.0), covr, mvtnorm, actuar, ggdist RdMacros lifecycle URL https://pkg.mitchelloharawild.com/distributional/, https: //github.com/mitchelloharawild/distributional BugReports https://github.com/mitchelloharawild/distributional/issues Encoding UTF-8 Language en-GB LazyData true Roxygen list(markdown = TRUE, roclets=c('rd', 'collate', 'namespace')) RoxygenNote 7.1.1 1 2 R topics documented: R topics documented: autoplot.distribution . .3 cdf..............................................4 density.distribution . .4 dist_bernoulli . .5 dist_beta . .6 dist_binomial . .7 dist_burr . .8 dist_cauchy . .9 dist_chisq . 10 dist_degenerate . 11 dist_exponential . 12 dist_f . 13 dist_gamma . 14 dist_geometric . 16 dist_gumbel . 17 dist_hypergeometric . 18 dist_inflated . 20 dist_inverse_exponential . 20 dist_inverse_gamma
    [Show full text]
  • Empirical Bayes Methods for Combining Likelihoods Bradley EFRON
    Empirical Bayes Methods for Combining Likelihoods Bradley EFRON Supposethat several independent experiments are observed,each one yieldinga likelihoodLk (0k) for a real-valuedparameter of interestOk. For example,Ok might be the log-oddsratio for a 2 x 2 table relatingto the kth populationin a series of medical experiments.This articleconcerns the followingempirical Bayes question:How can we combineall of the likelihoodsLk to get an intervalestimate for any one of the Ok'S, say 01? The resultsare presented in the formof a realisticcomputational scheme that allows model buildingand model checkingin the spiritof a regressionanalysis. No specialmathematical forms are requiredfor the priorsor the likelihoods.This schemeis designedto take advantageof recentmethods that produceapproximate numerical likelihoodsLk(6k) even in very complicatedsituations, with all nuisanceparameters eliminated. The empiricalBayes likelihood theoryis extendedto situationswhere the Ok'S have a regressionstructure as well as an empiricalBayes relationship.Most of the discussionis presentedin termsof a hierarchicalBayes model and concernshow such a model can be implementedwithout requiringlarge amountsof Bayesianinput. Frequentist approaches, such as bias correctionand robustness,play a centralrole in the methodology. KEY WORDS: ABC method;Confidence expectation; Generalized linear mixed models;Hierarchical Bayes; Meta-analysis for likelihoods;Relevance; Special exponential families. 1. INTRODUCTION for 0k, also appearing in Table 1, is A typical statistical analysis blends data from indepen- dent experimental units into a single combined inference for a parameter of interest 0. Empirical Bayes, or hierar- SDk ={ ak +.S + bb + .5 +k + -5 dk + .5 chical or meta-analytic analyses, involve a second level of SD13 = .61, for example. data acquisition. Several independent experiments are ob- The statistic 0k is an estimate of the true log-odds ratio served, each involving many units, but each perhaps having 0k in the kth experimental population, a different value of the parameter 0.
    [Show full text]
  • A Dirichlet Process Mixture Model of Discrete Choice Arxiv:1801.06296V1
    A Dirichlet Process Mixture Model of Discrete Choice 19 January, 2018 Rico Krueger (corresponding author) Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering, UNSW Australia, Sydney NSW 2052, Australia [email protected] Akshay Vij Institute for Choice, University of South Australia 140 Arthur Street, North Sydney NSW 2060, Australia [email protected] Taha H. Rashidi Research Centre for Integrated Transport Innovation, School of Civil and Environmental Engineering, UNSW Australia, Sydney NSW 2052, Australia [email protected] arXiv:1801.06296v1 [stat.AP] 19 Jan 2018 1 Abstract We present a mixed multinomial logit (MNL) model, which leverages the truncated stick- breaking process representation of the Dirichlet process as a flexible nonparametric mixing distribution. The proposed model is a Dirichlet process mixture model and accommodates discrete representations of heterogeneity, like a latent class MNL model. Yet, unlike a latent class MNL model, the proposed discrete choice model does not require the analyst to fix the number of mixture components prior to estimation, as the complexity of the discrete mixing distribution is inferred from the evidence. For posterior inference in the proposed Dirichlet process mixture model of discrete choice, we derive an expectation maximisation algorithm. In a simulation study, we demonstrate that the proposed model framework can flexibly capture differently-shaped taste parameter distributions. Furthermore, we empirically validate the model framework in a case study on motorists’ route choice preferences and find that the proposed Dirichlet process mixture model of discrete choice outperforms a latent class MNL model and mixed MNL models with common parametric mixing distributions in terms of both in-sample fit and out-of-sample predictive ability.
    [Show full text]
  • Part 2: Basics of Dirichlet Processes 2.1 Motivation
    CS547Q Statistical Modeling with Stochastic Processes Winter 2011 Part 2: Basics of Dirichlet processes Lecturer: Alexandre Bouchard-Cˆot´e Scribe(s): Liangliang Wang Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. Last update: May 17, 2011 2.1 Motivation To motivate the Dirichlet process, let us consider a simple density estimation problem: modeling the height of UBC students. We are going to take a Bayesian approach to this problem, considering the parameters as random variables. In one of the most basic models, one would define a mean and variance random parameter θ = (µ, σ2), and a height random variable normally distributed conditionally on θ, with parameters θ.1 Using a single normal distribution is clearly a defective approach, since for example the male/female sub- populations create a skewness in the distribution, which cannot be capture by normal distributions: The solution suggested by this figure is to use a mixture of two normal distributions, with one set of parameters θc for each sub-population or cluster c ∈ {1, 2}. Pictorially, the model can be described as follows: 1- Generate a male/female relative frequence " " ~ Beta(male prior pseudo counts, female P.C) Mean height 2- Generate the sex of each student for each i for men Mult( ) !(1) x1 x2 x3 xi | " ~ " 3- Generate the mean height of each cluster c !(2) Mean height !(c) ~ N(prior height, how confident prior) for women y1 y2 y3 4- Generate student heights for each i yi | xi, !(1), !(2) ~ N(!(xi) ,variance) 1Yes, this has many problems (heights cannot be negative, normal assumption broken, etc).
    [Show full text]
  • Fisher Information Matrix for Gaussian and Categorical Distributions
    Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2).
    [Show full text]
  • Categorical Distributions in Natural Language Processing Version 0.1
    Categorical Distributions in Natural Language Processing Version 0.1 MURAWAKI Yugo 12 May 2016 MURAWAKI Yugo Categorical Distributions in NLP 1 / 34 Categorical distribution Suppose random variable x takes one of K values. x is generated according to categorical distribution Cat(θ), where θ = (0:1; 0:6; 0:3): RYG In many task settings, we do not know the true θ and need to infer it from observed data x = (x1; ··· ; xN). Once we infer θ, we often want to predict new variable x0. NOTE: In Bayesian settings, θ is usually integrated out and x0 is predicted directly from x. MURAWAKI Yugo Categorical Distributions in NLP 2 / 34 Categorical distributions are a building block of natural language models N-gram language model (predicting the next word) POS tagging based on a Hidden Markov Model (HMM) Probabilistic context-free grammar (PCFG) Topic model (Latent Dirichlet Allocation (LDA)) MURAWAKI Yugo Categorical Distributions in NLP 3 / 34 Example: HMM-based POS tagging BOS DT NN VBZ VBN EOS the sun has risen Let K be the number of POS tags and V be the vocabulary size (ignore BOS and EOS for simplicity). The transition probabilities can be computed using K categorical θTRANS; θTRANS; ··· distributions ( DT NN ), with the dimension K. θTRANS = : ; : ; : ; ··· DT (0 21 0 27 0 09 ) NN NNS ADJ Similarly, the emission probabilities can be computed using K θEMIT; θEMIT; ··· categorical distributions ( DT NN ), with the dimension V. θEMIT = : ; : ; : ; ··· NN (0 012 0 002 0 005 ) sun rose cat MURAWAKI Yugo Categorical Distributions in NLP 4 / 34 Outline Categorical and multinomial distributions Conjugacy and posterior predictive distribution LDA (Latent Dirichlet Applocation) as an application Gibbs sampling for inference MURAWAKI Yugo Categorical Distributions in NLP 5 / 34 Categorical distribution: 1 observation Suppose θ is known.
    [Show full text]
  • A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics
    bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics Renato Rodrigues Silva 1,* 1 Institute of Mathematics and Statistics, Federal University of Goi´as, Goi^ania,Goi´as,Brazil *[email protected] Abstract In the regression analysis, there are situations where the model have more predictor variables than observations of dependent variable, resulting in the problem known as \large p small n". In the last fifteen years, this problem has been received a lot of attention, specially in the genome-wide context. Here we purposed the bayes H model, a bayesian regression model using mixture of two scaled inverse chi square as hyperprior distribution of variance for each regression coefficient. This model is implemented in the R package BayesH. Introduction 1 In the regression analysis, there are situations where the model have more predictor 2 variables than observations of dependent variable, resulting in the problem known as 3 \large p small n" [1]. 4 To figure out this problem, there are already exists some methods developed as ridge 5 regression [2], least absolute shrinkage and selection operator (LASSO) regression [3], 6 bridge regression [4], smoothly clipped absolute deviation (SCAD) regression [5] and 7 others. This class of regression models is known in the literature as regression model 8 with penalized likelihood [6]. In the bayesian paradigma, there are also some methods 9 purposed as stochastic search variable selection [7], and Bayesian LASSO [8].
    [Show full text]