The Continuous Categorical: a Novel Simplex-Valued Exponential Family
Total Page:16
File Type:pdf, Size:1020Kb
The Continuous Categorical: A Novel Simplex-Valued Exponential Family Elliott Gordon-Rodriguez 1 Gabriel Loaiza-Ganem 2 John P. Cunningham 1 Abstract cal relevance across the natural and social sciences (see Pawlowsky-Glahn & Egozcue(2006); Pawlowsky-Glahn Simplex-valued data appear throughout statistics & Buccianti(2011); Pawlowsky-Glahn et al.(2015) for an and machine learning, for example in the context overview). Prominent examples appear in highly cited work of transfer learning and compression of deep net- ranging from geology (Pawlowsky-Glahn & Olea, 2004; works. Existing models for this class of data rely Buccianti et al., 2006), chemistry (Buccianti & Pawlowsky- on the Dirichlet distribution or other related loss Glahn, 2005), microbiology (Gloor et al., 2017), genetics functions; here we show these standard choices (Quinn et al., 2018), psychiatry (Gueorguieva et al., 2008), suffer systematically from a number of limitations, ecology (Douma & Weedon, 2019), environmental science including bias and numerical issues that frustrate (Filzmoser et al., 2009), materials science (Na et al., 2014), the use of flexible network models upstream of political science (Katz & King, 1999), public policy (Bre- these distributions. We resolve these limitations unig & Busemeyer, 2012), economics (Fry et al., 2000), and by introducing a novel exponential family of dis- the list goes on. An application of particular interest in ma- tributions for modeling simplex-valued data – the chine learning arises in the context of model compression, continuous categorical, which arises as a nontriv- where the class probabilities outputted by a large model are ial multivariate generalization of the recently dis- used as ‘soft targets’ to train a small neural network (Bu- covered continuous Bernoulli. Unlike the Dirich- ciluaˇ et al., 2006; Ba & Caruana, 2014; Hinton et al., 2015), let and other typical choices, the continuous cat- an idea used also in transfer learning (Tzeng et al., 2015; egorical results in a well-behaved probabilistic Parisotto et al., 2015). loss function that produces unbiased estimators, while preserving the mathematical simplicity of The existing statistical models of compositional data come the Dirichlet. As well as exploring its theoretical in three flavors. Firstly, there are models based on a Dirich- properties, we introduce sampling methods for let likelihood, for example the Dirichlet GLM (Campbell this distribution that are amenable to the reparam- & Mosimann, 1987; Gueorguieva et al., 2008; Hijazi & eterization trick, and evaluate their performance. Jernigan, 2009) and Dirichlet Component Analysis (Wang Lastly, we demonstrate that the continuous cate- et al., 2008; Masoudimansour & Bouguila, 2017). Secondly, gorical outperforms standard choices empirically, there are also models based on applying classical statistical across a simulation study, an applied example on techniques to RK -valued transformations of simplex-valued multi-party elections, and a neural network com- data, notably via Logratios (Aitchison, 1982; 1994; 1999; pression task.1 Egozcue et al., 2003; Ma et al., 2016; Quinn et al., 2020). Thirdly, the machine learning literature has proposed pre- dictive models that forgo the use of a probabilistic objective 1. Introduction altogether and optimize the categorical cross-entropy instead (Ba & Caruana, 2014; Hinton et al., 2015; Sadowski & Baldi, Simplex-valued data, commonly referred to as composi- 2018). The first type of model, fundamentally, suffers from tional data in the statistics literature, are of great practi- an ill-behaved loss function (amongst other drawbacks) (x2), which constrain the practitioner to using inflexible (linear) 1Department of Statistics, Columbia University 2Layer 6 AI. Correspondence to: Elliott Gordon- models with only a small number of predictors. The second Rodriguez <[email protected]>, Gabriel Loaiza- approach typically results in complicated likelihoods, which Ganem <[email protected]>, John P. Cunningham are hard to analyze and do not form an exponential family, <[email protected]>. forgoing many of the attractive theoretical properties of the Proceedings of the 37 th International Conference on Machine Dirichlet. The third approach suffers from ignoring nor- Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- malizing constants, sacrificing the properties of maximum thor(s). likelihood estimation (see Loaiza-Ganem & Cunningham 1Our code is available at https://github.com/ (2019) for a more detailed discussion of the pitfalls). cunningham-lab/cb_and_cc The Continuous Categorical: A Novel Simplex-Valued Exponential Family In this work, we resolve these limitations simultaneously to the setting where we aim to learn a (possibly nonlinear) by defining a novel exponential family supported on the regression function that models a simplex-valued response simplex, the continuous categorical (CC) distribution. Our in terms of a (possibly large) set of predictors. In this con- distribution arises naturally as a multivariate generaliza- text, and in spite of its strengths, the Dirichlet distribution tion of the recently discovered continuous Bernoulli (CB) suffers from three fundamental limitations. distribution (Loaiza-Ganem & Cunningham, 2019), a [0; 1]- First, flexibility: while the ability to capture interior modes supported exponential family motivated by Variational Au- might seem intuitively appealing, in practice it thwarts the toencoders (Kingma & Welling, 2014), which showed em- optimization of predictive models of compositional data (as pirical improvements over the beta distribution for modeling we will demonstrate empirically in section x5.2). To illus- data which lies close to the extrema of the unit interval. trate why, consider fitting a Dirichlet distribution to a single Similarly, the CC will provide three crucial advantages over observation lying inside the simplex. Maximizing the like- the Dirichlet and other competitors: it defines a coherent, lihood will lead to a point mass on the observed datapoint well-behaved and computationally tractable log-likelihood (as was also noted by Sadowski & Baldi(2018)); the log- (x3.1, 3.4), which does not blow up in the presence of zero- likelihood will diverge, and so will the parameter estimate valued observations (x3.2), and which produces unbiased (tending to 1 along the line that preserves the observed pro- estimators (x3.3). Moreover, the CC model presents no portions). This example may seem degenerate; after all, any added complexity relative to its competitors; it has one dataset that would warrant a probabilistic model had better fewer parameter than the Dirichlet and a normalizing con- contain more than a single observation, at which point no in- stant that can be written in closed form using elementary dividual point can ‘pull’ the density onto itself. However, in functions alone (x3.4). The continuous categorical brings the context of predictive modeling, observations will often probabilistic machine learning to the analysis of compo- present unique input-output pairs, particularly if we have sitional data, opening up avenues for future applied and continuous predictors, or a large number of categorical ones. theoretical research. Thus, any regression function that is sufficiently flexible, such as a deep network, or that takes a sufficiently large 2. Background number of inputs, can result in a divergent log-likelihood, frustrating an optimizer’s effort to find a sensible estimator. Careful consideration of the Dirichlet clarifies its shortcom- This limitation has constrained Dirichlet-based predictive ings and the need for the CC family. The Dirichlet distri- models to the space of linear functions (Campbell & Mosi- K bution is parameterized by α 2 R+ and defined on the mann, 1987; Hijazi & Jernigan, 2009). K−1 K−1 PK−1 simplex S = fx 2 R+ : i=1 xi < 1g by: Second, tails: the Dirichlet density diverges or vanishes at K the extrema for all but a set of measure zero values of α, so 1 Y p(x ; : : : ; x ; α) = xαi−1; (1) that the log-likelihood is undefined whenever an observation 1 K−1 B(α) i i=1 contains zeros (transformation-based alternatives, such as Logratios, suffer the same drawback). However, zeros are where B(α) denotes the multivariate beta function, and 2 ubiquitous in real-world compositional data, a fact that has xK = 1 − x1 − · · · − xK−1. lead to the development of numerous hacks, such as Palarea- This density function presents an attractive combination of Albaladejo & Mart´ın-Fernandez´ (2008); Scealy & Welsh mathematical simplicity, computational tractability, and the (2011); Stewart & Field(2011); Hijazi et al.(2011); Tsagris flexibility to model both interior modes and sparsity, as well & Stewart(2018), each of which carries its own tradeoffs. as defining an exponential family of distributions that pro- Third, bias: an elementary result is that the MLE of an expo- vides a multivariate generalization of the beta distribution nential family yields an unbiased estimator for its sufficient and a conjugate prior to the multinomial distribution. As statistic. However, the sufficient statistic of the Dirichlet such, the Dirichlet is by far the best known and most widely distribution is the logarithm of the data, so that by Jensen’s used probability distribution for modeling simplex-valued inequality, the MLE for the mean parameter, which is often random variables (Ng et al., 2011). This includes both the the object of interest, is biased. While the MLE is also latent variables of mixed membership