Exponential Family Distributions

Total Page:16

File Type:pdf, Size:1020Kb

Exponential Family Distributions CS242: Probabilistic Graphical Models Lecture 3A: Exponential Families & Graphical Models Professor Erik Sudderth Brown University Computer Science September 27, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/ CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −10 −5 0 5 10 Exponential Families of Distributions 1 p(x ✓)= ⌫(x)exp ✓T φ(x) Z(✓)= ⌫(x)exp ✓T φ(x) dx | Z(✓) { } { } ZX = ⌫(x)exp ✓T φ(x) Φ(✓) Φ(✓) = log Z(✓) { − } d fixed vector of sufficient statistics (features), φ(x) R 2 specifying the family of distributions d unknown vector of natural parameters, ✓ ⇥ R 2 ✓ determine particular distribution in this family normalization constant or partition function Z(✓) > 0 (for discrete variables, integral becomes sum) reference measure independent of parameters ⌫(x) > 0 (for many models, we simply have ⌫ ( x )=1 ) d ⇥= ✓ R Z(✓) < ensures we construct a valid distribution { 2 | 1} Preview: Why the Exponential Family? p(x ✓)=⌫(x)exp ✓T φ(x) Φ(✓) Φ(✓) = log ⌫(x)exp ✓T φ(x) dx | { − } { } d ZX ⇥= ✓ R Φ(✓) < { 2 | 1} d φ(x) R vector of sufficient statistics (features) defining the family 2 d ✓ R vector of natural parameters indexing particular distributions • Allows simultaneous✓ study of many popular probability distributions: Bernoulli (binary), Categorical, Poisson (counts), Exponential (positive), Gaussian (real), … • Maximum likelihood (ML) learning is simple: moment matching of sufficient statistics • Bayesian learning is simple: conjugate priors are available Beta, Dirichlet, Gamma, Gaussian, Wishart, … • The maximum entropy interpretation: Among all distributions with certain moments of interest, the exponential family is the most random (makes fewest assumptions) • Parametric and predictive sufficiency: For arbitrarily large datasets, optimal learning is possible from a finite-dimensional set of statistics (streaming, big data) Natural Logarithms and Exponentials + ln(x) = log(x):R R ! exp(log(x)) = x log(xyz) = log(x) + log(y) + log(z) log(xa)=a log(x) Reminder: Logistic Functions simple derivatives convexity 1 @(z) = σ(z)σ( z) log σ(z) = log(1 + e z) σ(z)= z − 1+e− @z − − • Symmetry: • Log-odds ratio: σ(z) σ( z)=1 σ(z) log = z − − 1 σ(z) − Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin x 1 x Ber(x µ)=µ (1 µ) − x 0, 1 | − 2{ } E[x µ]=P[x = 1] = µ 0 µ 1 | Exponential Family Form: Derivation on board Ber(x ✓)=exp ✓x Φ(✓) ✓ ⇥=R 1 10 | { − } 2 0.9 9 Logistic Function: 0.8 8 Log-Normalizer 1 0.7 7 is Convex µ =(1+exp( ✓))− 0.6 6 − 0.5 5 ✓ = log(µ) log(1 µ) 0.4 4 − − 0.3 3 Log Normalization: 0.2 2 0.1 1 ✓ 0 0 Φ(✓) = log(1 + e ) −10 −5 0 5 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Y XK K=3 µ1 0 µ 1 µk =1 k kX=1 µk = E[xk] µ2 µ3 Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K Cat(x µ)= µxk = 0, 1 K , x =1 | k X { } k k=1 k=1 Exponential FamilyY Form: XK K µk =1 Cat(x ✓)=exp k=1 ✓kxk Φ(✓) | { − } k=1 K X Φ(✓) = log P `=1 exp(✓`) 0 µ 1 k Ø A linear subspace of ⇣exponentialP family parameters⌘ gives the same probabilities, because the features are linearly dependent: k xk =1 Cat(x ✓) = Cat(x ✓ + c) for any scalar constantP c | | Multinomial Logistic (softmax) Function “uniform” probabilities “max” µk(✓/T ), ✓ =(3, 0, 1) exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Softmax versus Logistic Function 1 1 Assume K=2 categories: 0.9 σ(z)= z 0.8 z e 1 1 1+e− µ1(z)= = = σ(z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e − − 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=σ(z2 z1) ez1 + ez2 − − 0.3 0.2 Binary logistic function is a special case. 0.1 0 −10 −5 0 5 10 exp(✓k) µk(✓)= K ,k=1,...,K. `=1 exp(✓`) K ✓k R,µk(✓) > 0, µk(✓)=1. 2 P k=1 P Minimal Exponential Families 1 1 Assume K=2 categories: 0.9 σ(z)= z 0.8 z e 1 1 1+e− µ1(z)= = = σ(z1 z2) 0.7 z1 z2 z2 z1 e + e 1+e − − 0.6 z2 0.5 e 0.4 µ2(z)= =1 s1(z)=σ(z2 z1) ez1 + ez2 − − 0.3 0.2 0.1 0 −10 −5 0 5 10 Ø In a minimal exponential family representation, the features must be linearly independent. Example: Ber(x ✓)=exp ✓x Φ(✓) Ø In a non-minimal or redundant exponential| family representation,{ − features} linearly dependent and multiple parameters give same distribution. Example: Ber(x ✓)=exp ✓ x + ✓ (1 x) Φ(✓ , ✓ ) | { 1 2 − − 1 2 } Multivariate Gaussian Distribution 1 1 1 T 1 (x µ, ⌃) = exp (x µ) ⌃− (x µ) N | (2⇡)N/2 ⌃ 1/2 −2 − − | | ⇢ Moment Parameterization: Mean & Covariance N 1 µ = E[x] R ⇥ 2 T T T N N ⌃=E[(x µ)(x µ) ]=E[xx ] µµ R ⇥ − − − 2 For simplicity, assume covariance positive definite & invertible. 1 ⇤=⌃− 1 Information Parameterization: Canonical Parameters # =⌃− µ 1 T 1 T 1 1 T 1 (x µ, ⌃) exp x ⌃− x + µ ⌃− x µ ⌃− µ N | / −2 − 2 ⇢ N 1 1 T T 1 2 − (x #, ⇤) exp x ⇤x + # x exp #sxs ⇤ssxs ⇤stxsxt / 8 − 2 − 2 39 N | / −2 "s=1 # s=t ⇢ < X X6 = 4 5 Recall general exponential family form: p(x: ✓)=exp ✓T φ(x) Φ(✓); | { − } CS242: Lecture 3A Outline Ø Exponential Families of Probability Distributions Ø Review: Maximum Likelihood (ML) Estimation Ø ML Estimation for Exponential Families 400 300 200 100 0 3 2 3 1 2 1 0 0 −1 −1 Terminology: Inference versus Learning • Inference: Given a model with known parameters, estimate the “hidden” variables for some data instance, or find their marginal distribution • Learning: Given multiple data instances, estimate parameters for a graphical model of their structure: Maximum Likelihood, Maximum a Posteriori, … Ø Training instances may be completely or partially observed Example: Expert systems for medical diagnosis • Inference: Given observed symptoms for a particular patient, infer probabilities that they have contracted various diseases • Learning: Given a database of many patient diagnoses, learn the relationships between diseases and symptoms Example: Markov random fields for semantic image segmentation • Inference: What object category is depicted at each pixel in some image? • Learning: Across a database of images, in which some example objects have been labeled, how do those labels relate to low-level image features? Frequentist Parameter Estimation (1) (2) (L) Ø Training data has L observations: x = x ,x ,...,x ✓ Ø Assume independent, identically distributed{ (i.i.d.) samples } from an unknown member of some family of distributions: d p(x ✓), ✓ x(`) R L Example: Data is Gaussian,| ✓ is unknown2 mean Ø Think of parameter vector ✓ as fixed, non-random constant Ø An estimator guesses (learns) parameters from data ˆ (1) (L) ˆ d ✓L(x)=f(x ,...,x ), ✓L(x) R 2 Example: ✓ is unknown mean, and ˆ 1 L (`) ✓L(x)= L `=1 x Ø GOAL: Estimators with good properties, uniformly across ✓ Ø Analyze estimators via expected performance on P hypothetical data (many trials of drawing L i.i.d. samples) Maximum Likelihood (ML) Estimation Ø Suppose I have L independent observations sampled from some unknown probability distribution: x = x(1),...,x(L) { } Ø Suppose I have two candidate parameter estimates where: p(x ✓1) >p(x ✓2) Given no other information,| choose the| higher likelihood model! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1 p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Consistency of ML Estimators L L (`) (`) ✓ˆL = arg max p(x ✓) = arg max log p(x ✓) ✓ | ✓ | `Y=1 X`=1 Ø An estimator is consistent if the sequence of estimates ✓ˆL converges to the true parameter in probability, for any true ✓ lim P ( ✓ˆL ✓ ✏)=0 for any ✏ > 0 L || − || ≥ !1 Ø Under “mild conditions” that are true for most distributions, including a broad range of exponential family distributions, the ML parameter estimates are always consistent Ø The ML estimator also satisfies a central limit theorem, and for large L has provably small variance (“efficiency”) Finding ML Estimates Ø For many practical models (including exponential families), log-likelihood is a smooth & continuous function of parameters: L(✓)= L log p(x(`) ✓) `=1 | Maximum will occur atP a point where derivative equals zero! Ø The maximum likelihood (ML) parameter estimate is defined as: L L ✓ˆ = arg max p(x(`) ✓) = arg max log p(x(`) ✓) ✓ | ✓ | `Y=1 X`=1 p(x ✓) is a probability mass function for discrete X, probability density function for continuous X. | Example: Bernoulli Distribution Ø A Bernoulli or indicator random variable X has one parameter: x 1 x p(x ✓)=✓ (1 ✓) − x 0, 1 | − 2 { } Ø The maximum likelihood (ML) estimate maximizes: L L(✓)= x(`) log(✓)+(1 x(`)) log(1 ✓) − − X`=1 L 1 (`) ✓ˆ = x ML: Empirical fraction of successes! L X`=1 Example: Categorical Distribution Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 XK K=3 ✓1 0 ✓k 1 ✓ =1 k kX=1 ✓k = E[xk] ✓2 probability ✓3 simplex A Constrained Optimization Lemma Objective: K ✓ˆ = arg max ak log ✓k ak 0 ✓ ≥ kX=1 K subject to ✓k =1 0 ✓k 1 K=3 ✓1 kX=1 Solution: K ak ✓ˆk = a0 = ak a0 k=1 ✓2 X Ø Proof for K=2: Change of variables to unconstrained problem ✓3 Ø Proof for general K: Lagrange multipliers Learning Categorical Distributions Categorical Distribution: Single roll of a (possibly biased) die K K xk K Cat(x ✓)= ✓k = 0, 1 , x =1 | X { } k kY=1 k=1 Ø If we have Nk observations
Recommended publications
  • Package 'Distributional'
    Package ‘distributional’ February 2, 2021 Title Vectorised Probability Distributions Version 0.2.2 Description Vectorised distribution objects with tools for manipulating, visualising, and using probability distributions. Designed to allow model prediction outputs to return distributions rather than their parameters, allowing users to directly interact with predictive distributions in a data-oriented workflow. In addition to providing generic replacements for p/d/q/r functions, other useful statistics can be computed including means, variances, intervals, and highest density regions. License GPL-3 Imports vctrs (>= 0.3.0), rlang (>= 0.4.5), generics, ellipsis, stats, numDeriv, ggplot2, scales, farver, digest, utils, lifecycle Suggests testthat (>= 2.1.0), covr, mvtnorm, actuar, ggdist RdMacros lifecycle URL https://pkg.mitchelloharawild.com/distributional/, https: //github.com/mitchelloharawild/distributional BugReports https://github.com/mitchelloharawild/distributional/issues Encoding UTF-8 Language en-GB LazyData true Roxygen list(markdown = TRUE, roclets=c('rd', 'collate', 'namespace')) RoxygenNote 7.1.1 1 2 R topics documented: R topics documented: autoplot.distribution . .3 cdf..............................................4 density.distribution . .4 dist_bernoulli . .5 dist_beta . .6 dist_binomial . .7 dist_burr . .8 dist_cauchy . .9 dist_chisq . 10 dist_degenerate . 11 dist_exponential . 12 dist_f . 13 dist_gamma . 14 dist_geometric . 16 dist_gumbel . 17 dist_hypergeometric . 18 dist_inflated . 20 dist_inverse_exponential . 20 dist_inverse_gamma
    [Show full text]
  • Fisher Information Matrix for Gaussian and Categorical Distributions
    Fisher information matrix for Gaussian and categorical distributions Jakub M. Tomczak November 28, 2012 1 Notations Let x be a random variable. Consider a parametric distribution of x with parameters θ, p(xjθ). The contiuous random variable x 2 R can be modelled by normal distribution (Gaussian distribution): 1 n (x − µ)2 o p(xjθ) = p exp − 2πσ2 2σ2 = N (xjµ, σ2); (1) where θ = µ σ2T. A discrete (categorical) variable x 2 X , X is a finite set of K values, can be modelled by categorical distribution:1 K Y xk p(xjθ) = θk k=1 = Cat(xjθ); (2) P where 0 ≤ θk ≤ 1, k θk = 1. For X = f0; 1g we get a special case of the categorical distribution, Bernoulli distribution, p(xjθ) = θx(1 − θ)1−x = Bern(xjθ): (3) 2 Fisher information matrix 2.1 Definition The Fisher score is determined as follows [1]: g(θ; x) = rθ ln p(xjθ): (4) The Fisher information matrix is defined as follows [1]: T F = Ex g(θ; x) g(θ; x) : (5) 1We use the 1-of-K encoding [1]. 1 2.2 Example 1: Bernoulli distribution Let us calculate the fisher matrix for Bernoulli distribution (3). First, we need to take the logarithm: ln Bern(xjθ) = x ln θ + (1 − x) ln(1 − θ): (6) Second, we need to calculate the derivative: d x 1 − x ln Bern(xjθ) = − dθ θ 1 − θ x − θ = : (7) θ(1 − θ) Hence, we get the following Fisher score for the Bernoulli distribution: x − θ g(θ; x) = : (8) θ(1 − θ) The Fisher information matrix (here it is a scalar) for the Bernoulli distribution is as follows: F = Ex[g(θ; x) g(θ; x)] h (x − θ)2 i = Ex (θ(1 − θ))2 1 n o = [x2 − 2xθ + θ2] (θ(1 − θ))2 Ex 1 n o = [x2] − 2θ [x] + θ2 (θ(1 − θ))2 Ex Ex 1 n o = θ − 2θ2 + θ2 (θ(1 − θ))2 1 = θ(1 − θ) (θ(1 − θ))2 1 = : (9) θ(1 − θ) 2.3 Example 2: Categorical distribution Let us calculate the fisher matrix for categorical distribution (2).
    [Show full text]
  • Categorical Distributions in Natural Language Processing Version 0.1
    Categorical Distributions in Natural Language Processing Version 0.1 MURAWAKI Yugo 12 May 2016 MURAWAKI Yugo Categorical Distributions in NLP 1 / 34 Categorical distribution Suppose random variable x takes one of K values. x is generated according to categorical distribution Cat(θ), where θ = (0:1; 0:6; 0:3): RYG In many task settings, we do not know the true θ and need to infer it from observed data x = (x1; ··· ; xN). Once we infer θ, we often want to predict new variable x0. NOTE: In Bayesian settings, θ is usually integrated out and x0 is predicted directly from x. MURAWAKI Yugo Categorical Distributions in NLP 2 / 34 Categorical distributions are a building block of natural language models N-gram language model (predicting the next word) POS tagging based on a Hidden Markov Model (HMM) Probabilistic context-free grammar (PCFG) Topic model (Latent Dirichlet Allocation (LDA)) MURAWAKI Yugo Categorical Distributions in NLP 3 / 34 Example: HMM-based POS tagging BOS DT NN VBZ VBN EOS the sun has risen Let K be the number of POS tags and V be the vocabulary size (ignore BOS and EOS for simplicity). The transition probabilities can be computed using K categorical θTRANS; θTRANS; ··· distributions ( DT NN ), with the dimension K. θTRANS = : ; : ; : ; ··· DT (0 21 0 27 0 09 ) NN NNS ADJ Similarly, the emission probabilities can be computed using K θEMIT; θEMIT; ··· categorical distributions ( DT NN ), with the dimension V. θEMIT = : ; : ; : ; ··· NN (0 012 0 002 0 005 ) sun rose cat MURAWAKI Yugo Categorical Distributions in NLP 4 / 34 Outline Categorical and multinomial distributions Conjugacy and posterior predictive distribution LDA (Latent Dirichlet Applocation) as an application Gibbs sampling for inference MURAWAKI Yugo Categorical Distributions in NLP 5 / 34 Categorical distribution: 1 observation Suppose θ is known.
    [Show full text]
  • Package 'Extradistr'
    Package ‘extraDistr’ September 7, 2020 Type Package Title Additional Univariate and Multivariate Distributions Version 1.9.1 Date 2020-08-20 Author Tymoteusz Wolodzko Maintainer Tymoteusz Wolodzko <[email protected]> Description Density, distribution function, quantile function and random generation for a number of univariate and multivariate distributions. This package implements the following distributions: Bernoulli, beta-binomial, beta-negative binomial, beta prime, Bhattacharjee, Birnbaum-Saunders, bivariate normal, bivariate Poisson, categorical, Dirichlet, Dirichlet-multinomial, discrete gamma, discrete Laplace, discrete normal, discrete uniform, discrete Weibull, Frechet, gamma-Poisson, generalized extreme value, Gompertz, generalized Pareto, Gumbel, half-Cauchy, half-normal, half-t, Huber density, inverse chi-squared, inverse-gamma, Kumaraswamy, Laplace, location-scale t, logarithmic, Lomax, multivariate hypergeometric, multinomial, negative hypergeometric, non-standard beta, normal mixture, Poisson mixture, Pareto, power, reparametrized beta, Rayleigh, shifted Gompertz, Skellam, slash, triangular, truncated binomial, truncated normal, truncated Poisson, Tukey lambda, Wald, zero-inflated binomial, zero-inflated negative binomial, zero-inflated Poisson. License GPL-2 URL https://github.com/twolodzko/extraDistr BugReports https://github.com/twolodzko/extraDistr/issues Encoding UTF-8 LazyData TRUE Depends R (>= 3.1.0) LinkingTo Rcpp 1 2 R topics documented: Imports Rcpp Suggests testthat, LaplacesDemon, VGAM, evd, hoa,
    [Show full text]
  • Chapter 9. Exponential Family of Distributions
    Machine Learning for Engineers: Chapter 9. Exponential Family of Distributions Osvaldo Simeone King's College London January 26, 2021 Osvaldo Simeone ML4Engineers 1 / 97 This Chapter In the previous chapters, we have adopted a limited range of probabilistic models, namely Bernoulli and categorical for discrete rvs and Gaussian for continuous rvs. While these are the most common modelling choices, they clearly do not represent many important situations. Examples: I Discrete data may a priori take arbitrarily large values, making Bernoulli and categorical models not suitable F ex.: waiting times for next arrival in a queue; I Continuous data may be non-negative, making Gaussian models not suitable F ex.: measurements of weights or heights. Osvaldo Simeone ML4Engineers 2 / 97 This Chapter Furthermore, we have seen that Bernoulli, categorical, and Gaussian distributions share several common features: I The gradient of the log-loss with respect to the model parameters can be expressed in terms of a mean error that measures the difference between mean under the model and observation (see Chapters 4 and 6); I ML learning can be solved in closed form by evaluating empirical averages (see Chapters 3, 4, and 6); I Information-theoretic quantities such as (differential) entropy and KL divergence can be computed in closed form (see Chapter 3). Osvaldo Simeone ML4Engineers 3 / 97 This Chapter In this chapter, we will introduce a general family of distributions that includes Bernoulli, categorical, and Gaussian as special cases: the exponential family of distributions. The family is much larger, and it also encompasses distributions such as I Poisson and geometric distributions, whose support is discrete and includes all integers; I exponential and gamma distributions, whose support is continuous and includes only non-negative values.
    [Show full text]
  • Discrete Categorical Distribution
    Discrete Categorical Distribution Carl Edward Rasmussen November 11th, 2016 Carl Edward Rasmussen Discrete Categorical Distribution November 11th, 2016 1 / 8 Key concepts We generalize the concepts from binary variables to multiple discrete outcomes. • discrete and multinomial distributions • the Dirichlet distribution Carl Edward Rasmussen Discrete Categorical Distribution November 11th, 2016 2 / 8 The multinomial distribution (1) Generalisation of the binomial distribution from 2 outcomes to m outcomes. Useful for random variables that take one of a finite set of possible outcomes. Throw a die n = 60 times, and count the observed (6 possible) outcomes. Outcome Count X = x1 = 1 k1 = 12 X = x2 = 2 k2 = 7 Note that we have one parameter too many. We X = x3 = 3 k3 = 11 don’t need to know all the ki and n, because 6 X = x4 = 4 k4 = 8 i=1 ki = n. X = x5 = 5 k5 = 9 P X = x6 = 6 k6 = 13 Carl Edward Rasmussen Discrete Categorical Distribution November 11th, 2016 3 / 8 The multinomial distribution (2) Consider a discrete random variable X that can take one of m values x1,..., xm. Out of n independent trials, let ki be the number of times X = xi was observed. m It follows that i=1 ki = n. m Denote by πi the probability that X = xi, with i=1 πi = 1. P > The probability of observing a vector of occurrences k = [k1,..., km] is given by P > the multinomial distribution parametrised by π = [π1,..., πm] : n! ki p(kjπ, n) = p(k1,..., kmjπ1,..., πm, n) = πi k !k !... km! 1 2 i= Y1 • Note that we can write p(kjπ) since n is redundant.
    [Show full text]
  • The Continuous Categorical: a Novel Simplex-Valued Exponential Family
    The Continuous Categorical: A Novel Simplex-Valued Exponential Family Elliott Gordon-Rodriguez 1 Gabriel Loaiza-Ganem 2 John P. Cunningham 1 Abstract cal relevance across the natural and social sciences (see Pawlowsky-Glahn & Egozcue(2006); Pawlowsky-Glahn Simplex-valued data appear throughout statistics & Buccianti(2011); Pawlowsky-Glahn et al.(2015) for an and machine learning, for example in the context overview). Prominent examples appear in highly cited work of transfer learning and compression of deep net- ranging from geology (Pawlowsky-Glahn & Olea, 2004; works. Existing models for this class of data rely Buccianti et al., 2006), chemistry (Buccianti & Pawlowsky- on the Dirichlet distribution or other related loss Glahn, 2005), microbiology (Gloor et al., 2017), genetics functions; here we show these standard choices (Quinn et al., 2018), psychiatry (Gueorguieva et al., 2008), suffer systematically from a number of limitations, ecology (Douma & Weedon, 2019), environmental science including bias and numerical issues that frustrate (Filzmoser et al., 2009), materials science (Na et al., 2014), the use of flexible network models upstream of political science (Katz & King, 1999), public policy (Bre- these distributions. We resolve these limitations unig & Busemeyer, 2012), economics (Fry et al., 2000), and by introducing a novel exponential family of dis- the list goes on. An application of particular interest in ma- tributions for modeling simplex-valued data – the chine learning arises in the context of model compression, continuous categorical, which arises as a nontriv- where the class probabilities outputted by a large model are ial multivariate generalization of the recently dis- used as ‘soft targets’ to train a small neural network (Bu- covered continuous Bernoulli.
    [Show full text]
  • Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data
    Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data Yarin Gal Yutian Chen Zoubin Ghahramani University of Cambridge f yg279, yc373, zoubin [email protected] Abstract Multivariate categorical data occur in many applications of machine learning, such as data analysis and language processing. Here we develop a flexible class of models for distribution estimation in such multivariate (i.e. vectors of) categorical data. Multivariate categorical data is challenging because the number of possible discrete observation vectors grows exponentially with the number of categorical variables in the vector. In particular, we address the problem of estimating the distribution when the data is sparsely sampled—i.e. in the typical case when the diversity of the data points is poor compared to the exponentially many possible observations. We make use of a continuous latent Gaussian space, but unlike pre- vious linear approaches, we learn a non-linear transformation between this latent space and the multivariate categorical observation space. Non-linearity is essen- tial for capturing multi-modality in the distribution. Our model ties together many existing models, linking the categorical linear latent Gaussian model, the Gaus- sian process latent variable model, and Gaussian process classification. We derive effective inference for our model based on recent developments in sampling-based variational inference and stochastic optimisation. 1 Introduction Categorical distribution estimation (CDE) forms one of the core problems in machine learning, and can be used to perform tasks ranging from survey analysis (Inoguchi, 2008) to cancer prediction (Zwitter and Soklic, 1988). One of the major challenges in CDE is sparsity. Sparsity can occur either when there is a single categorical variable with many possible values, some appearing scarcely, or when the data consists of vectors of categorical variables, with most configurations of categorical values not in the dataset.
    [Show full text]
  • A Compound Dirichlet-Multinomial Model for Provincial Level Covid-19 Predictions in South Africa
    medRxiv preprint doi: https://doi.org/10.1101/2020.06.15.20131433; this version posted June 17, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. A compound Dirichlet-Multinomial model for provincial level Covid-19 predictions in South Africa Alta de Waal*1,2Daan de Waal1,3 1 Department of Statistics, University of Pretoria, Pretoria, South Africa 2 Center for Artificial Intelligence (CAIR), Pretoria, South Africa 3 Mathematical Statistics and Actuarial Science, University of the Free State, Bloemfontein, South Africa * [email protected] Abstract 1 Accurate prediction of COVID-19 related indicators such as confirmed cases, deaths and 2 recoveries play an important in understanding the spread and impact of the virus, as 3 well as resource planning and allocation. In this study, we approach the prediction 4 problem from a statistical perspective and predict confirmed cases and deaths on a 5 provincial level. We propose the compound Dirichlet Multinomial distribution to 6 estimate the proportion parameter of each province as mutually exclusive outcomes. 7 Furthermore, we make an assumption of exponential growth of the total cummulative 8 counts in order to predict future total counts. The outcomes of this approach is not 9 only prediction. The variation of the proportion parameter is characterised by the 10 Dirichlet distribution, which provides insight in the movement of the pandemic across 11 provinces over time. 12 Introduction 13 The global COVID-19 (C19) pandemic has urged governments globally to rapidly 14 implement measures to reduce the number of infected cases and reduce pressure on the 15 June 15, 2020 1/17 NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
    [Show full text]
  • An R Package for Multivariate Categorical Data Analysis by Juhyun Kim, Yiwen Zhang, Joshua Day, Hua Zhou
    CONTRIBUTED RESEARCH ARTICLE 73 MGLM: An R Package for Multivariate Categorical Data Analysis by Juhyun Kim, Yiwen Zhang, Joshua Day, Hua Zhou Abstract Data with multiple responses is ubiquitous in modern applications. However, few tools are available for regression analysis of multivariate counts. The most popular multinomial-logit model has a very restrictive mean-variance structure, limiting its applicability to many data sets. This article introduces an R package MGLM, short for multivariate response generalized linear models, that expands the current tools for regression analysis of polytomous data. Distribution fitting, random number generation, regression, and sparse regression are treated in a unifying framework. The algorithm, usage, and implementation details are discussed. Introduction Multivariate categorical data arises in many fields, including genomics, image analysis, text mining, and sports statistics. The multinomial-logit model (Agresti, 2002, Chapter 7) has been the most popular tool for analyzing such data. However, it is limiting due to its specific mean-variance structure and the strong assumption that the counts are negatively correlated. Models that address over-dispersion relative to a multinomial distribution and incorporate positive and/or negative correlation structures would offer greater flexibility for analysis of polytomous data. In this article, we introduce an R package MGLM, short for multivariate response generalized linear models. The MGLM package provides a unified framework for random number generation, distribution fitting, regression, hypothesis testing, and variable selection for multivariate response generalized linear models, particularly four models listed in Table1. These models considerably broaden the class of generalized linear models (GLM) for analysis of multivariate categorical data. MGLM overlaps little with existing packages in R and other softwares.
    [Show full text]
  • Probability Distributions: Multinomial and Poisson
    Probability Distributions: Multinomial and Poisson Data Science: Jordan Boyd-Graber University of Maryland JANUARY 21, 2018 Data Science: Jordan Boyd-Graber UMD Probability Distributions: Multinomial and Poisson 1 / 12 j j Multinomial distribution Recall: the binomial distribution is the number of successes from multiple Bernoulli success/fail events The multinomial distribution is the number of different outcomes from multiple categorical events It is a generalization of the binomial distribution to more than two possible outcomes As with the binomial distribution, each categorical event is assumed to be independent Bernoulli : binomial :: categorical : multinomial Examples: The number of times each face of a die turned up after 50 rolls The number of times each suit is drawn from a deck of cards after 10 draws Data Science: Jordan Boyd-Graber UMD Probability Distributions: Multinomial and Poisson 2 / 12 j j Multinomial distribution Notation: let X~ be a vector of length K , where Xk is a random variable that describes the number of times that the kth value was the outcome out of N categorical trials. The possible values of each Xk are integers from 0 to N PK All Xk values must sum to N: k=1 Xk = N Example: if we roll a die 10 times, suppose it X1 = 1 comes up with the following values: X2 = 0 X3 = 3 X~ =< 1,0,3,2,1,3 > X4 = 2 X5 = 1 X6 = 3 The multinomial distribution is a joint distribution over multiple random variables: P(X1,X2,...,XK ) Data Science: Jordan Boyd-Graber UMD Probability Distributions: Multinomial and Poisson 3 / 12 j j Multinomial distribution 3 Suppose we roll a die 3 times.
    [Show full text]
  • CHAPTER 2 Estimating Probabilities
    CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF January 26, 2018* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION* This is a rough draft chapter intended for inclusion in the upcoming second edition of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You are welcome to use this for educational purposes, but do not duplicate or repost it on the internet. For online copies of this and other materials related to this book, visit the web site www.cs.cmu.edu/∼tom/mlbook.html. Please send suggestions for improvements, or suggested exercises, to [email protected]. Many machine learning methods depend on probabilistic approaches. The reason is simple: when we are interested in learning some target function f : X ! Y, we can more generally learn the probabilistic function P(YjX). By using a probabilistic approach, we can design algorithms that learn func- tions with uncertain outcomes (e.g., predicting tomorrow’s stock price) and that incorporate prior knowledge to guide learning (e.g., a bias that tomor- row’s stock price is likely to be similar to today’s price). This chapter de- scribes joint probability distributions over many variables, and shows how they can be used to calculate a target P(YjX). It also considers the problem of learning, or estimating, probability distributions from training data, pre- senting the two most common approaches: maximum likelihood estimation and maximum a posteriori estimation. 1 Joint Probability Distributions The key to building probabilistic models is to define a set of random variables, and to consider the joint probability distribution over them.
    [Show full text]