Generalized Linear Models for Aggregated Data
Avradeep Bhowmik Joydeep Ghosh Oluwasanmi Koyejo University of Texas at Austin University of Texas at Austin Stanford University [email protected] [email protected] [email protected]
Abstract 1 Introduction
Modern life is highly data driven. Datasets with Databases in domains such as healthcare are records at the individual level are generated every routinely released to the public in aggregated day in large volumes. This creates an opportunity form. Unfortunately, na¨ıve modeling with for researchers and policy-makers to analyze the data aggregated data may significantly diminish and examine individual level inferences. However, in the accuracy of inferences at the individ- many domains, individual records are difficult to ob- ual level. This paper addresses the scenario tain. This particularly true in the healthcare industry where features are provided at the individual where protecting the privacy of patients restricts pub- level, but the target variables are only avail- lic access to much of the sensitive data. Therefore, able as histogram aggregates or order statis- in many cases, multiple Statistical Disclosure Limita- tics. We consider a limiting case of gener- tion (SDL) techniques are applied [9]. Of these, data alized linear modeling when the target vari- aggregation is the most widely used technique [5]. ables are only known up to permutation, and explore how this relates to permutation test- It is common for agencies to report both individual ing; a standard technique for assessing statis- level information for non-sensitive attributes together tical dependency. Based on this relationship, with the aggregated information in the form of sample we propose a simple algorithm to estimate statistics. Care must be taken in the analysis of such the model parameters and individual level in- data, as na¨ıve modeling with aggregated data may sig- ferences via alternating imputation and stan- nificantly diminish the accuracy of inferences at the dard generalized linear model fitting. Our re- individual level. In particular, inferences drawn from sults suggest the effectiveness of the proposed aggregated data may lead to the problem of ecologi- approach when, in the original data, permu- cal fallacy [23], hence the resulting conclusions at the tation testing accurately ascertains the verac- group level may be misleading to researchers and pol- ity of the linear relationship. The framework icy makers interested in individual level inferences. An is extended to general histogram data with example that has been cited [21] is the high correla- larger bins - with order statistics such as the tion between per capita consumption of dietary fat and median as a limiting case. Our experimen- breast cancer in different countries, which may lead to tal results on simulated data and aggregated the incorrect conclusion that dietary fat causes breast healthcare data suggest a diminishing returns cancer [7]. property with respect to the granularity of Aggregated data in the form of histograms and other the histogram - when a linear relationship sample statistics are becoming more and more com- holds in the original data, the targets can be mon. Further, most of the data that is collected relates predicted accurately given relatively coarse to questions for which the respondents have only a few histograms. discrete options from which to select their answer. For example, data available from the Generalized Social Survey (GSS) [1] are often in this form. This paper Appearing in Proceedings of the 18th International Con- addresses the scenario where features are provided at ference on Artificial Intelligence and Statistics (AISTATS) the individual level, but the target variables are only 2015, San Diego, CA, USA. JMLR: W&CP volume 38. available as histogram aggregates or order statistics. Copyright 2015 by the authors. Despite the prevalence of order-statistic and histogram
93 Generalized Linear Models for Aggregated Data aggregated data, to the best of our knowledge, this ments of the vector by the same lowercase letter with problem has not been addressed in the literature. the boldface removed and the index added as a super- script. v> refers to the transpose of the column vector We consider a limiting case of generalized linear mod- v. We denote column partitions using semicolons, that eling when the target variables are only known up to is, M = [X; Y] implies that the columns of the subma- permutation, and explore how this relates to permu- trices X and Y are, in order, the columns of the full tation testing [12]; a standard technique for assessing matrix M. We use k · k to denote the L norm for vec- statistical dependency. Based on this relationship, we 2 tors and Frobenius norm for matrices. The vector v is propose a simple algorithm to estimate the model pa- said to be in increasing order if v(i) ≤ v(j) whenever rameters and individual level inferences via alternat- i ≤ j, and the set of all such vectors in n is denoted ing imputation and standard generalized linear model R with a subscripted downward pointing arrow as n. fitting. Our results suggest the effectiveness of the R↓ Two vectors v and w are said to be isotonic, v ∼ w, proposed approach when, in the original data, permu- ↓ if v(i) ≥ v(j) if and only if w(i) ≥ w(j) for all i, j. tation testing accurately ascertains the veracity of the linear relationship. The framework is extended to gen- eral histogram data with larger bins - with order statis- 1.1 Preliminaries and Related Work tics such as the median as a limiting case. Our exper- imental results suggest a diminishing returns property Aggregated data is often summarized using a sample - when a linear relationship holds in the original data, statistic, which provides a succinct descriptive sum- the targets can be predicted accurately given relatively mary [26]. Examples of sample statistics include the coarse histograms. Our results also suggest caution in average, median and various other quantiles. While in the widespread use of aggregation for ensuring the the mean is still the most common choice, the best privacy of sensitive data. choice for summarizing a sample generally depends on the distribution the sample has been generated from. In summary, the main contributions of this manuscript In many cases, the use of histograms [24] or order are as follows: statistic summaries is much more “natural” e.g. for categorical data, binary data, count valued data, etc. (i) we propose a framework for estimating the re- sponse variables of a generalized linear model The problem of imputing individual level records from given only a histogram aggregate summary by the sample mean has been studied in [20] and [21] formulating it as an optimization problem that among others. In particular, the paper [21] attempts alternates between imputation and generalized to reconstruct the individual level matrix by assuming linear model fitting. a low rank structure and compares their framework with other approaches which include an extension of (ii) we examine a limiting case of the framework the neighborhood model [10] and a variation of ecolog- when all the data is known up to permutation. ical regression [13] for the task of imputing individual Our examination suggests the effectiveness of the level records of the response variable. However, these proposed approach when, in the original data, approaches exploit the fact that sample mean is a lin- permutation testing accurately ascertains the ve- ear function of the sample values. Hence, none of these racity of the linear relationship. approaches are extendable to non-linear functions such (iii) we examine a second limiting case where only a as order statistics and histogram aggregates. To the few order statistics are provided. Our experimen- best of our knowledge, the special case of individual in- tal results suggest a diminishing returns property ferences based on sample statistic aggregate data has - when a linear relationship holds in the origi- not been addressed before. Our work proposes a first nal data, the targets can be predicted accurately solution to address this open problem. given relatively coarse histograms. Order Statistics: Given a sample of n real valued datapoints, the τ th order statistic of the sample is the The proposed approach is applied to the analysis of τ th smallest value in the sample. For example, the simulated datasets. In addition, we examine the Texas first order statistic is the minimum value of the sam- Inpatient Discharge dataset from the Texas Depart- n th th ple, the 2 order statistic is the median and the n ment of State Health Services [3] and a subset of the order statistic is the maximum value of the sample. 2008-2010 SynPUF dataset [2]. We specifically design a framework which makes it rel- atively straightforward to work with order statistics. Notation Histograms : Given a finite sample of n items from Matrices are denoted by boldface capital letters, vec- a set C, a histogram is a partition of the set C into tors by boldface lower case letters and individual ele- disjoint bins Ci : ∪iCi = C and the respective count or
94 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo percentage of elements from the sample in each bin. A generalized linear model (GLM) [18] is a generaliza- Seen this way, for any sample from C ⊆ R, a his- tion of linear regression that subsumes various models togram is essentially a set of order statistics for that like Poisson regression, logistic regression, etc. as spe- sample. Histograms can sometimes be specified with- cial cases1. A generalized linear model assumes that out their boundary values (eg. ”x < 30” as opposed the response variables, y are generated from a distribu- to ”0 < x < 30”)- this is equivalent to leaving out tion in the exponential family with the mean param- the first and the nth order statistic. Further, a set eter related via a link function to a linear function of of sample statistic summaries are easily converted to the predictor x. The model therefore is specified com- (and from) a discrete cumulative distribution by iden- pletely by a distribution Pφ(· | β) from the exponential tifying the quantile value as the cumulative histogram family, a linear predictor η = xβ, and a link function boundary, and the quantile identity as the height. This (∇φ)−1(·) which connects the expectation parameter cumulative histogram is easily converted to a standard of the response variable to the predictor variables as histogram by differencing of adjacent bins. A similar E(y) = (∇φ)−1(xβ). strategy is also applicable to unbounded domains us- As explored in great detail in [6], Bregman Divergences ing abstract max and min boundaries of ±∞. Based have a very close relationship with generalized linear on this bijection, we we will refer to a histogram as a models. In particular, maximum likelihood parame- generalization of the order statistic for the remainder ter estimation for a generalized linear model is equiv- of this manuscript. alent to minimizing a corresponding Bregman diver- Bregman Divergence : Let φ :Θ 7→ R be a strictly gence. For example, maximum likelihood for a Gaus- convex, closed function on the domain Θ ⊆ Rm which sian corresponds to squares loss, for Poisson the cor- is differentiable on int(Θ). Then, the Bregman diver- responding divergence is generalized I-divergence and gence Dφ(·k·) corresponding to the function φ is de- for Binomial, the corresponding divergence is the KL fined as divergence (see [6] for details). GLMs have been suc- cessfully applied in a wide variety of fields including Dφ(ykx) , φ(y) − φ(x) − h∇φ(x), y − xi machine learning , biological surveys [19], image seg- mentation and reconstruction [22], analysis of medical From strict convexity, it follows that Dφ(ykx) ≥ 0 trials [8], studying species-environment relationships and Dφ(ykx) = 0 if and only if y = x. Bregman di- in ecological sciences [15], virology [11] and estimating vergences are strictly convex in their first argument mortality from infectious diseases [14], among many but not necessarily in their second argument. In this others, and are widely prized for the interpretability paper we only consider convex functions of the form of their results and the extendability of their methods m P (i) φ(·): R 3 x 7→ i φ(x ) that are sums of identical in a plethora of domain specific variations [25]. They scalar convex functions applied to each component of are easy to use and implement and many off-the-shelf the vector x. We refer to this class as identically sep- software packages are available for most major pro- arable (IS). Square loss, Kullback-Leibler (KL) diver- gramming platforms. gence and generalized I-Divergence (GI) are members of this family (Table 1). 2 Problem Description Table 1: Examples of Bregman Divergences Consider a set of fully observed covariates n×(d−p) φ(x) Dφ(ykx) X = [x1; x2; ··· ; xd−p] ∈ R , and columns of response variables, Z = [z ; z ; ··· z ] ∈ n×p, 1 2 1 2 1 2 p R 2 kxk 2 ky − xk which are only known only up to the respective P (i) (i) histograms of their values (i.e., up to order statistics). i(x log x ) KL(ykx) = (i) P (i) y We assume that each element of zi has been gener- x ∈ Prob. Simplex i y log( (i) ) x ated from covariates X according to some generalized P x(i) log x(i) − x(i) GI(ykx) = i linear model with parameters βi. The objective is (i) n P (i) y (i) (i) to estimate the βi together with Z = [z1; z2; ··· zp] x ∈ R+ y log( (i) ) − y + x i x subject to the given order statistic constraints. Since maximum likelihood estimation in a generalized lin- Generalized Linear Models: While least squares ear model is equivalent to minimizing a correspond- regression is useful for modeling continuous real valued ing Bregman divergence, we choose the loss function −1 data generated from a Gaussian distribution. This is L(Z, β) = Dφ Zk(∇φ) (Xβ) to be minimized over not always a valid assumption. In many cases, the data of interest may be binary valued or count valued. 1see [17] or [16] for a detailed discussion on GLMs
95 Generalized Linear Models for Aggregated Data the variables Z, β while satisfying order statistics con- makes the problem much more complicated. There- straints on Z. fore, we attempt to solve it iteratively for each vari- able in an alternating minimization framework. The Without additional structure, the regression problem update steps consist of the following for each timestep: for each column can be solved independently, there- fore without loss of generality we assume Z is a single −1 th (i) βt = argmin Dφ Pt−1yt−1k(∇φ) (Xβ) + R(β) column z. We denote the τi order statistic of z as β sτ , with τi ∈ {τ1, τ2, ··· τh} ⊆ [n], which is the set of i −1 h order statistics specified via the histogram. For sim- (ii) yt = argminDφ Pt−1yk(∇φ) (Xβt) such y plicity, in the following section we consider estimation τ that Λy ≤ 0 and e i y = sτ under a single order statistic which has been computed i −1 over the entire column. We extend it subsequently to (iii) Pt = argmin Dφ Pytk(∇φ) (Xβt) the more general case of multiple order statistics com- P∈P puted over disjoint partitions. Step (i) is a standard generalized linear model param- Therefore, with Frobenius regularization terms eter estimation problem. This problem has been stud- R(β) = λkβk2, the overall problem statement boils ied in great detail in literature and a variety of off- down to the following optimization problem: the-shelf GLM solvers can be used for this. We focus instead on steps (ii) and (iii) which are much more min D zk(∇φ)−1(Xβ) + λkβk2 φ interesting. z,β (1) th s.t. τi order statistic of zi = sτi For (ii), note that since we assumed that φ is identi- cally separable, the same permutation applied to both 2.1 Estimation under a Single Order arguments of the corresponding Bregman divergence Statistic constraint Dφ(·k·) does not change its value. For any constraint −1 set C, we have argmin Dφ Pyk(∇φ) (Xβt) = Estimating under order statistics constraints is in gen- y∈C −1 −1 2 eral a highly non-trivial problem. It is easy to see argmin Dφ ykP (∇φ) (Xβt) given a fixed that the set of vectors with a given order statistic is y∈C P, X, β. Following this fact, step (ii) is a convex opti- not a convex set. Therefore, the above optimization mization problem in y and can be solved very easily. problem looks especially difficult to even represent in a concise manner in terms of z. However, it turns out Step (iii) is a non-convex optimization problem in gen- that with the following reformulation, the analysis of eral. However, for an identically separable Bregman the problem becomes much more manageable. divergence it turns out that the solution to this is re- markably simple. We rewrite z = Py where P ∈ P is a permutation matrix and y is a vector sorted in increasing order. Lemma 1. The (set of ) optimal permutation(s) in Note the following- step (iv) above is given by- −1 ˆ ˆ −1 argmin Dφ Pytk(∇φ) (Xβt) = P : Pyt ∼↓ (∇φ) (Xβt) P∈ n τi P (i) For a y ∈ R↓ , if e is a row vector with 1 in the th τ τi index and 0 everywhere else, then e i y repre- In other words, the optimal permutation is the th −1 sents the τi order statistic of y. Since permuta- one which makes yi,t isotonic with (∇φ) (Xβt). tion does not change the value of order statistics, Note that the optimal permutation is not unique if th −1 this is also the τi order statistic of z (∇φ) (Xβt) is not totally ordered. This is a direct application of the following result which appeared as (ii) If Λ is the matrix with Λj,j+1 = −1, Λj,j = 1 Lemma 3 in the paper [4]. and Λ = 0 for all other j, k :(k − j) 6= 0, ±1, j,k Lemma 2. If x ≥ x and y ≥ y and φ(·) is iden- the condition that y is sorted in increasing order 1 2 1 2 tically separable, then is equivalent to the linear constraint Λy ≤ 0.
x1 y1 x1 y2 Dφ( x k y ) ≤ Dφ( x k y ), and Putting all this together, the optimization problem (1) 2 2 2 1 y1 x1 y2 x1 becomes the following Dφ( y2 k x2 ) ≤ Dφ( y1 k x2 ) −1 min Dφ Pyk(∇φ) (Xβ) + R(β) P,y,β (2) 2.1.1 Solution in terms of z s.t. eτi y = s , Λy ≥ 0, P ∈ τi P Lemmata 1 and 2 suggest that we can optimize jointly over P and y instead of separately, since for any y we The above optimization problem is jointly convex in y and β for a fixed P, but the presence of P as a variable 2note that for a permutation matrix P, P−1 = P>
96 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo already know the optimal P. Combining the optimiza- step. As the cost function is bounded below by 0, the tion steps (ii) and (iii) in terms of P and y, our update algorithm converges to a stationary point. We now ex- step for z in the original optimization problem is the tend the framework to include histogram constraints following and blockwise partitioning.