<<

Generalized Linear Models for Aggregated

Avradeep Bhowmik Joydeep Ghosh Oluwasanmi Koyejo University of Texas at Austin University of Texas at Austin Stanford University [email protected] [email protected] [email protected]

Abstract 1 Introduction

Modern life is highly data driven. Datasets with Databases in domains such as healthcare are records at the individual level are generated every routinely released to the public in aggregated day in large volumes. This creates an opportunity form. Unfortunately, na¨ıve modeling with for researchers and policy-makers to analyze the data aggregated data may significantly diminish and examine individual level inferences. However, in the accuracy of inferences at the individ- many domains, individual records are difficult to ob- ual level. This paper addresses the scenario tain. This particularly true in the healthcare industry where features are provided at the individual where protecting the privacy of patients restricts pub- level, but the target variables are only avail- lic access to much of the sensitive data. Therefore, able as aggregates or order statis- in many cases, multiple Statistical Disclosure Limita- tics. We consider a limiting case of gener- tion (SDL) techniques are applied [9]. Of these, data alized linear modeling when the target vari- aggregation is the most widely used technique [5]. ables are only known up to permutation, and explore how this relates to permutation test- It is common for agencies to report both individual ing; a standard technique for assessing statis- level information for non-sensitive attributes together tical dependency. Based on this relationship, with the aggregated information in the form of sample we propose a simple algorithm to estimate . Care must be taken in the analysis of such the model parameters and individual level in- data, as na¨ıve modeling with aggregated data may sig- ferences via alternating imputation and stan- nificantly diminish the accuracy of inferences at the dard fitting. Our re- individual level. In particular, inferences drawn from sults suggest the effectiveness of the proposed aggregated data may lead to the problem of ecologi- approach when, in the original data, permu- cal fallacy [23], hence the resulting conclusions at the tation testing accurately ascertains the verac- group level may be misleading to researchers and pol- ity of the linear relationship. The framework icy makers interested in individual level inferences. An is extended to general histogram data with example that has been cited [21] is the high correla- larger bins - with order statistics such as the tion between per capita consumption of dietary fat and as a limiting case. Our experimen- breast cancer in different countries, which may lead to tal results on simulated data and aggregated the incorrect conclusion that dietary fat causes breast healthcare data suggest a diminishing returns cancer [7]. property with respect to the granularity of Aggregated data in the form of and other the histogram - when a linear relationship sample statistics are becoming more and more com- holds in the original data, the targets can be mon. Further, most of the data that is collected relates predicted accurately given relatively coarse to questions for which the respondents have only a few histograms. discrete options from which to select their answer. For example, data available from the Generalized Social Survey (GSS) [1] are often in this form. This paper Appearing in Proceedings of the 18th International Con- addresses the scenario where features are provided at ference on Artificial Intelligence and Statistics (AISTATS) the individual level, but the target variables are only 2015, San Diego, CA, USA. JMLR: W&CP volume 38. available as histogram aggregates or order statistics. Copyright 2015 by the authors. Despite the prevalence of order- and histogram

93 Generalized Linear Models for Aggregated Data aggregated data, to the best of our knowledge, this ments of the vector by the same lowercase letter with problem has not been addressed in the literature. the boldface removed and the index added as a super- script. v> refers to the transpose of the column vector We consider a limiting case of generalized linear mod- v. We denote column partitions using semicolons, that eling when the target variables are only known up to is, M = [X; Y] implies that the columns of the subma- permutation, and explore how this relates to permu- trices X and Y are, in order, the columns of the full tation testing [12]; a standard technique for assessing matrix M. We use k · k to denote the L norm for vec- statistical dependency. Based on this relationship, we 2 tors and Frobenius norm for matrices. The vector v is propose a simple algorithm to estimate the model pa- said to be in increasing order if v(i) ≤ v(j) whenever rameters and individual level inferences via alternat- i ≤ j, and the set of all such vectors in n is denoted ing imputation and standard generalized linear model R with a subscripted downward pointing arrow as n. fitting. Our results suggest the effectiveness of the R↓ Two vectors v and w are said to be isotonic, v ∼ w, proposed approach when, in the original data, permu- ↓ if v(i) ≥ v(j) if and only if w(i) ≥ w(j) for all i, j. tation testing accurately ascertains the veracity of the linear relationship. The framework is extended to gen- eral histogram data with larger bins - with order statis- 1.1 Preliminaries and Related Work tics such as the median as a limiting case. Our exper- imental results suggest a diminishing returns property Aggregated data is often summarized using a sample - when a linear relationship holds in the original data, statistic, which provides a succinct descriptive sum- the targets can be predicted accurately given relatively mary [26]. Examples of sample statistics include the coarse histograms. Our results also suggest caution in average, median and various other quantiles. While in the widespread use of aggregation for ensuring the the is still the most common choice, the best privacy of sensitive data. choice for summarizing a sample generally depends on the distribution the sample has been generated from. In summary, the main contributions of this manuscript In many cases, the use of histograms [24] or order are as follows: statistic summaries is much more “natural” e.g. for categorical data, binary data, count valued data, etc. (i) we propose a framework for estimating the re- sponse variables of a generalized linear model The problem of imputing individual level records from given only a histogram aggregate summary by the sample mean has been studied in [20] and [21] formulating it as an optimization problem that among others. In particular, the paper [21] attempts alternates between imputation and generalized to reconstruct the individual level matrix by assuming linear model fitting. a low rank structure and compares their framework with other approaches which include an extension of (ii) we examine a limiting case of the framework the neighborhood model [10] and a variation of ecolog- when all the data is known up to permutation. ical regression [13] for the task of imputing individual Our examination suggests the effectiveness of the level records of the response variable. However, these proposed approach when, in the original data, approaches exploit the fact that sample mean is a lin- permutation testing accurately ascertains the ve- ear function of the sample values. Hence, none of these racity of the linear relationship. approaches are extendable to non-linear functions such (iii) we examine a second limiting case where only a as order statistics and histogram aggregates. To the few order statistics are provided. Our experimen- best of our knowledge, the special case of individual in- tal results suggest a diminishing returns property ferences based on sample statistic aggregate data has - when a linear relationship holds in the origi- not been addressed before. Our work proposes a first nal data, the targets can be predicted accurately solution to address this open problem. given relatively coarse histograms. Order Statistics: Given a sample of n real valued datapoints, the τ th of the sample is the The proposed approach is applied to the analysis of τ th smallest value in the sample. For example, the simulated datasets. In addition, we examine the Texas first order statistic is the minimum value of the sam- Inpatient Discharge dataset from the Texas Depart- n th th ple, the 2 order statistic is the median and the n ment of State Health Services [3] and a subset of the order statistic is the maximum value of the sample. 2008-2010 SynPUF dataset [2]. We specifically design a framework which makes it rel- atively straightforward to work with order statistics. Notation Histograms : Given a finite sample of n items from Matrices are denoted by boldface capital letters, vec- a set C, a histogram is a partition of the set C into tors by boldface lower case letters and individual ele- disjoint bins Ci : ∪iCi = C and the respective count or

94 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo percentage of elements from the sample in each bin. A generalized linear model (GLM) [18] is a generaliza- Seen this way, for any sample from C ⊆ R, a his- tion of that subsumes various models togram is essentially a set of order statistics for that like , logistic regression, etc. as spe- sample. Histograms can sometimes be specified with- cial cases1. A generalized linear model assumes that out their boundary values (eg. ”x < 30” as opposed the response variables, y are generated from a distribu- to ”0 < x < 30”)- this is equivalent to leaving out tion in the with the mean param- the first and the nth order statistic. Further, a set eter related via a link function to a linear function of of sample statistic summaries are easily converted to the predictor x. The model therefore is specified com- (and from) a discrete cumulative distribution by iden- pletely by a distribution Pφ(· | β) from the exponential tifying the quantile value as the cumulative histogram family, a linear predictor η = xβ, and a link function boundary, and the quantile identity as the height. This (∇φ)−1(·) which connects the expectation parameter cumulative histogram is easily converted to a standard of the response variable to the predictor variables as histogram by differencing of adjacent bins. A similar E(y) = (∇φ)−1(xβ). strategy is also applicable to unbounded domains us- As explored in great detail in [6], Bregman Divergences ing abstract max and min boundaries of ±∞. Based have a very close relationship with generalized linear on this bijection, we we will refer to a histogram as a models. In particular, maximum likelihood parame- generalization of the order statistic for the remainder ter estimation for a generalized linear model is equiv- of this manuscript. alent to minimizing a corresponding Bregman diver- Bregman : Let φ :Θ 7→ R be a strictly gence. For example, maximum likelihood for a Gaus- convex, closed function on the domain Θ ⊆ Rm which sian corresponds to squares loss, for Poisson the cor- is differentiable on int(Θ). Then, the Bregman diver- responding divergence is generalized I-divergence and gence Dφ(·k·) corresponding to the function φ is de- for Binomial, the corresponding divergence is the KL fined as divergence (see [6] for details). GLMs have been suc- cessfully applied in a wide variety of fields including Dφ(ykx) , φ(y) − φ(x) − h∇φ(x), y − xi machine learning , biological surveys [19], image seg- mentation and reconstruction [22], analysis of medical From strict convexity, it follows that Dφ(ykx) ≥ 0 trials [8], studying species-environment relationships and Dφ(ykx) = 0 if and only if y = x. Bregman di- in ecological sciences [15], virology [11] and estimating vergences are strictly convex in their first argument mortality from infectious diseases [14], among many but not necessarily in their second argument. In this others, and are widely prized for the interpretability paper we only consider convex functions of the form of their results and the extendability of their methods m P (i) φ(·): R 3 x 7→ i φ(x ) that are sums of identical in a plethora of domain specific variations [25]. They scalar convex functions applied to each component of are easy to use and implement and many off-the-shelf the vector x. We refer to this class as identically sep- software packages are available for most major pro- arable (IS). Square loss, Kullback-Leibler (KL) diver- gramming platforms. gence and generalized I-Divergence (GI) are members of this family (Table 1). 2 Problem Description Table 1: Examples of Bregman Divergences Consider a set of fully observed covariates n×(d−p) φ(x) Dφ(ykx) X = [x1; x2; ··· ; xd−p] ∈ R , and columns of response variables, Z = [z ; z ; ··· z ] ∈ n×p, 1 2 1 2 1 2 p R 2 kxk 2 ky − xk which are only known only up to the respective P (i) (i) histograms of their values (i.e., up to order statistics). i(x log x ) KL(ykx) =  (i)  P (i) y We assume that each element of zi has been gener- x ∈ Prob. Simplex i y log( (i) ) x ated from covariates X according to some generalized P x(i) log x(i) − x(i) GI(ykx) = i linear model with parameters βi. The objective is (i) n P (i) y (i) (i) to estimate the βi together with Z = [z1; z2; ··· zp] x ∈ R+ y log( (i) ) − y + x i x subject to the given order statistic constraints. Since maximum likelihood estimation in a generalized lin- Generalized Linear Models: While least squares ear model is equivalent to minimizing a correspond- regression is useful for modeling continuous real valued ing Bregman divergence, we choose the −1  data generated from a Gaussian distribution. This is L(Z, β) = Dφ Zk(∇φ) (Xβ) to be minimized over not always a valid assumption. In many cases, the data of interest may be binary valued or count valued. 1see [17] or [16] for a detailed discussion on GLMs

95 Generalized Linear Models for Aggregated Data the variables Z, β while satisfying order statistics con- makes the problem much more complicated. There- straints on Z. fore, we attempt to solve it iteratively for each vari- able in an alternating minimization framework. The Without additional structure, the regression problem update steps consist of the following for each timestep: for each column can be solved independently, there- fore without loss of generality we assume Z is a single −1  th (i) βt = argmin Dφ Pt−1yt−1k(∇φ) (Xβ) + R(β) column z. We denote the τi order statistic of z as β sτ , with τi ∈ {τ1, τ2, ··· τh} ⊆ [n], which is the set of i −1  h order statistics specified via the histogram. For sim- (ii) yt = argminDφ Pt−1yk(∇φ) (Xβt) such y plicity, in the following section we consider estimation τ that Λy ≤ 0 and e i y = sτ under a single order statistic which has been computed i −1  over the entire column. We extend it subsequently to (iii) Pt = argmin Dφ Pytk(∇φ) (Xβt) the more general case of multiple order statistics com- P∈P puted over disjoint partitions. Step (i) is a standard generalized linear model param- Therefore, with Frobenius regularization terms eter estimation problem. This problem has been stud- R(β) = λkβk2, the overall problem statement boils ied in great detail in literature and a variety of off- down to the following optimization problem: the-shelf GLM solvers can be used for this. We focus instead on steps (ii) and (iii) which are much more min D zk(∇φ)−1(Xβ) + λkβk2 φ interesting. z,β (1) th s.t. τi order statistic of zi = sτi For (ii), note that since we assumed that φ is identi- cally separable, the same permutation applied to both 2.1 Estimation under a Single Order arguments of the corresponding Bregman divergence Statistic constraint Dφ(·k·) does not change its value. For any constraint −1  set C, we have argmin Dφ Pyk(∇φ) (Xβt) = Estimating under order statistics constraints is in gen- y∈C −1 −1  2 eral a highly non-trivial problem. It is easy to see argmin Dφ ykP (∇φ) (Xβt) given a fixed that the set of vectors with a given order statistic is y∈C P, X, β. Following this fact, step (ii) is a convex opti- not a convex set. Therefore, the above optimization mization problem in y and can be solved very easily. problem looks especially difficult to even represent in a concise manner in terms of z. However, it turns out Step (iii) is a non-convex optimization problem in gen- that with the following reformulation, the analysis of eral. However, for an identically separable Bregman the problem becomes much more manageable. divergence it turns out that the solution to this is re- markably simple. We rewrite z = Py where P ∈ P is a permutation matrix and y is a vector sorted in increasing order. Lemma 1. The (set of ) optimal permutation(s) in Note the following- step (iv) above is given by- −1  ˆ ˆ −1 argmin Dφ Pytk(∇φ) (Xβt) = P : Pyt ∼↓ (∇φ) (Xβt) P∈ n τi P (i) For a y ∈ R↓ , if e is a row vector with 1 in the th τ τi index and 0 everywhere else, then e i y repre- In other words, the optimal permutation is the th −1 sents the τi order statistic of y. Since permuta- one which makes yi,t isotonic with (∇φ) (Xβt). tion does not change the value of order statistics, Note that the optimal permutation is not unique if th −1 this is also the τi order statistic of z (∇φ) (Xβt) is not totally ordered. This is a direct application of the following result which appeared as (ii) If Λ is the matrix with Λj,j+1 = −1, Λj,j = 1 Lemma 3 in the paper [4]. and Λ = 0 for all other j, k :(k − j) 6= 0, ±1, j,k Lemma 2. If x ≥ x and y ≥ y and φ(·) is iden- the condition that y is sorted in increasing order 1 2 1 2 tically separable, then is equivalent to the linear constraint Λy ≤ 0.

 x1   y1   x1   y2  Dφ( x k y ) ≤ Dφ( x k y ), and Putting all this together, the optimization problem (1) 2 2 2 1  y1   x1   y2   x1  becomes the following Dφ( y2 k x2 ) ≤ Dφ( y1 k x2 ) −1  min Dφ Pyk(∇φ) (Xβ) + R(β) P,y,β (2) 2.1.1 Solution in terms of z s.t. eτi y = s , Λy ≥ 0, P ∈ τi P Lemmata 1 and 2 suggest that we can optimize jointly over P and y instead of separately, since for any y we The above optimization problem is jointly convex in y and β for a fixed P, but the presence of P as a variable 2note that for a permutation matrix P, P−1 = P>

96 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo already know the optimal P. Combining the optimiza- step. As the cost function is bounded below by 0, the tion steps (ii) and (iii) in terms of P and y, our update algorithm converges to a stationary point. We now ex- step for z in the original optimization problem is the tend the framework to include histogram constraints following and blockwise partitioning.

−1  ˆzt = argmin Dφ zk(∇φ) (Xβt) (3) z 2.2 Histogram Constraints th s.t. τ order statistic of z = sτ i i In case there are multiple order statistics constraints (histogram), the solution can be obtained by repeated It is not immediately obvious how to approach the application of equation (5). solution to this since the constraint set for z is not th convex. However, note that as a result of Lemma Suppose for the column z we have constraints as τi

2 it is clear that given a fixed X and βt if ˆzt is a order statistic of z = sτi for τi ∈ {τ1, τ2, ··· τh} ⊆ solution to the subproblem (3), we must have ˆzt ∼↓ {1, 2, ··· n}, the solution is given by the following- −1 (∇φ) (Xβt). (j) (j) Therefore, instead of searching over the set of all vec- 1. For all j < τ1,z ˆ = min(Γi , sτ1 ); similarly, for tors in n, it is sufficient to search only in the subset of (j) (j) R all j > τh,z ˆ = max(Γi , sτh ) −1 vectors that are isotonic with (∇φ) (Xβt). It turns out that not only is this set convex given a fixed X, βt, 2. For all 1 ≤ k < h, and j : τk ≤ j ≤ τk+1, the solution for zt is readily available in closed form.  −1 sτ j = τk Let Γt =(∇φ) (Xβt). Since the Bregman Diver-  k (j)  gence is IS, without loss of generality we can assume zˆ = sτk+1 j = τk+1   (j)  that Γt is in increasing order, therefore the constraint min sτ , max(Γ , sτ ) τk ≤ j ≤ τk+1 n  k+1 i k set for z becomes z ∈ R↓ and order statistics con- τi straints for z becomes the linear constraint e z = sτi . The proof for this follows in an identical manner to the Therefore, the optimization problem (3) over z is proof for the non-partitioned case earlier. As above, equivalent, up to a simple re-permutation step, to the the updated zt can be obtained by re-permuting ˆz to following preserve isotonicity with (∇φ)−1(Xβ). For a fully ob- min Dφ(zkΓt) z (4) served histogram, the update for z only involves a per- n τi mutation at each step. s.t. z ∈ R↓ , e z = sτi Lemma 3. Let ˆz be the solution to the optimization problem (4). Then, ˆz is given by- 2.3 Blockwise Order Statistic Constraints  s j = τ In the setup where the order statistics (or histograms)  τi i (j)  (j) are computed over blockwise partitions of the sample, zˆ = max(Γ , s ) j > τ (5) t t τi i the permutation matrix is a blockwise permutation min(Γ(j), s ) j < τ t τi i matrix and the isotonicity constraint is a blockwise isotonicity costraint. Sketch of Proof In the space of all z ordered in in- creasing order, the τ th order statistic constraint simply Since the Bregman Divergence is identically separa- i ble, the update for z separates out into independent becomesz ˆ(j) < s for j < τ and vice versa for j > τ . t τi i i updates for every block which can be done in a manner Suppose we were to optimize over all space instead of identical to that given by Lemma 3. The update step n- because the Bregman divergence is identically sep- R↓ for β remains unchanged. arable, the optimization problem separates out over (j) (j) different coordinates j asz ˆt = arg minz Dφ(zkΓt ) 3 such that z < (>)sτi for j < (>)τi. This is a uni- dimensional convex optimization problem the solution to which is given by equation (5) above. We provide experimental results using both simulated data and real data. Error for each generalized lin- Finally we note that ˆz , automatically lies in n since t R↓ ear model is defined as the corresponding Bregman Γ ∈ n, and hence, is also the solution to the optimi- t R↓ divergence (square loss for Gaussian, generalized I- sation problem (4).  divergence for Poisson, etc. see [6]) between the true Now, note that since we are performing iterative min- and recovered targets. The average errors for each imization, the cost function is non-increasing at every model is shown separately.

97 Generalized Linear Models for Aggregated Data

(a) Poisson Fit Error (b) Gaussian Fit Error (c) Binomial Fit Error

Figure 1: Permutation tests under Poisson, Gaussian and Binomial Estimation for 2, 5, 25 bins (top left, top right, bottom left) and ”No Relationship” (bottom right)

(a) Poisson Training Error (b) Gaussian Training Error (c) Binomial Training Error

Figure 2: Training Error under Poisson, Gaussian and Binomial Estimation

(a) Poisson Test Error (b) Gaussian Test Error (c) Binomial Test Error

Figure 3: Test Set Error under Poisson, Gaussian and Binomial Estimation 3.1 Simulated Data algorithm performs with respect to the fit by a general- ized linear model which knows the values of the target We randomly generate different sets of real valued pre- variables but permutes the target variables randomly dictor variables and parameters, and use the corre- for estimation. We perform the randomized permuta- sponding exponential family to generate their respec- tions multiple times and plot a histogram of the fitting tive response variables. We compute histograms for errors thus obtained and see how the results from our the response variables thus generated with varying algorithm compares to the histogram (Figure 1). The number of bins and test our algorithm for each case. black bar is the error obtained by our framework, the We perform the experiments for three different models red histogram is the histogram of errors obtained by - Gaussian, Poisson and Binomial. fitting after randomly permuting the targets. The blue We perform a basic permutation test3 to show how our histogram is the histogram of errors obtained by fit- ting a model where there is no relationship between 3Refer to [12] for more details on permutation tests the target variable and the covariate, the cyan bar is

98 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo

(a) Training Set Error (b) Test Set Error

Figure 4: Performance on SynPUF dataset

(a) Training Set Error (b) Test Set Error

Figure 5: Performance on Texas Inpatient Discharge dataset

(a) Recovered Histogram of DE-SynPUF Data (b) Recovered Histogram of Texas Discharged Data

Figure 6: Recovered Histograms of both datasets (true histograms on the left) the result of our framework applied to this data with a increase) error is lower i.e. black bar shifts towards histogram of 5 bins (histograms of other granularities left. perform similarly). Our test successfully rejects the We plot the average fitting and predictive performance null hypothesis of “no relationship when the the black of our algorithm with increasing number of bins over bar is to the left of the red histogram. Figure 1 shows five fold cross validation. We compare our results with that as histogram becomes finer (i.e number of bins the results obtained with the best possible GLM esti-

99 Generalized Linear Models for Aggregated Data mator which observes the full dataset (Figures 2 and variables like length of stay. 3)4. It can be seen in each case that as the histogram Following [21], we perform a log transform on the hos- of targets becomes finer (i.e., more bins) the error de- pital charges and length of stay before applying a Pois- creases but with a diminishing returns property with son regression model. We compare the performance respect to the coarseness of the histogram. of our algorithm over five-fold cross-validation with the best possible Poisson estimator which estimates in 3.2 DE-SynPUF dataset a fully observed scenario with an uncensored dataset (Figure 5). The plot shows that the performance of The CMS Beneficiary Summary DE-SynPUF dataset our framework improves with increasingly finer gran- is a public use dataset created by the Centers for Medi- ularity of histograms and approaches the performance care and Medicaid Services by applying different sta- of the best Poisson estimator. Finally, we compare the tistical disclosure limitation techniques to real benefi- histogram recovered by our framework with the true ciary claims data in a way so as to very closely resemble histogram for the dataset (Figure 6b). real Medicare data. It is often used for testing different data mining or statistical inferential methods before getting access to real Medicare data. We use a sub- 4 Conclusion and Future Work set of the DE-SynPUF dataset for a single state from the year 2008. With some trimming of datapoints (eg, This paper addresses the scenario where features are we do not take into account deceased beneficiaries) we provided at the individual level, but the target vari- model outpatient institutional annual primary payer ables are only available as histogram aggregates or or- reimbursement amount (PPPYMT-OP) with a num- der statistics. We proposed a simple algorithm to esti- ber of available predictor variables including age, race, mate the model parameters and individual level infer- sex, duration of coverage, presence/absence of a vari- ences via alternating imputation and standard gener- ety of chronic conditions, etc. alized linear model fitting. We considered two limiting We perform a log transform and compute histograms cases. In the first, the target variables are only known of varying granularity on the target variables. We use up to permutation. Our results suggest the effective- a Gaussian model for our estimation and evaluate the ness of the proposed approach when, in the original average performance of our algorithm over five fold data, permutation testing accurately ascertains the ve- cross validation in fitting both the training and test racity of the linear relationship. The framework was data sample points, comparing with the best possible then extended to general histogram data with larger Gaussian estimator which performs the estimation by bins - with order statistics such as the median as a observing the full dataset (Figures 4). As seen in the second limiting case. Experimental results on simu- plot, the performance of our framework improves as lated data and real healthcare data show the effective- the histogram of targets becomes finer in granularity ness of the proposed approach which may have impli- and approaches the performance of the best Gaussian cations on using aggregation as a of preserving estimator. We also compare the histogram of target privacy. For future work, we plan a more detailed anal- variables as recovered by our framework with the true ysis to better understand the properties and limits of histogram (Figure 6a). the framework given binned histogram data. We also plan to extend the approach to non-linear modeling. 3.3 Texas Inpatient Discharge dataset Acknowledgements We then test our algorithm on the Texas Inpatient Dis- charge dataset from the Texas Department of State Authors acknowledge support from NSF grant IIS Health Services [3] used in [21]. As with the simu- 1421729. lated data, we use histograms of varying granularity on the respective response variables and evaluate the References average performance in fitting both the training and test data sample points over five fold cross validation. [1] General Social Survey, NORC. http://www3. We use hospital billing records from the fourth quarter norc.org/GSS+Website/. of 2006 in the Texas Inpatient Discharge dataset and [2] Medicare Claims Synthetic Public Use Files regress it on the available individual level predictor (SynPUFs), Centers for Medicare and variables including binary variables race and sex, cat- Medicaid Services. http://www.cms.gov/ egorical variables county and zipcode, and real valued -Statistics-Data-and-Systems/ 4training/test error in figures 2b and 3b for the Gaus- Downloadable-Public-Use-Files/SynPUFs/ sian estimator for the fully observed case is ≈ 0 index.html.

100 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo

[3] Texas Department of State Health Services. [16] P. McCullagh and J. A. Nelder. Generalized linear Texas Inpatient Public Use Data File, 2014. models. 1989. https://www.dshs.state.tx.us/thcic/ [17] J. A. Nelder and R. Baker. Generalized linear hospitals/Inpatientpudf.shtm. models. Wiley Online Library, 1972. [4] S. Acharyya, O. Koyejo, and J. Ghosh. Learning [18] J. A. Nelder and R. W. M. Wedderburn. General- to rank with Bregman divergences and monotone ized linear models. Journal of the Royal Statistical retargeting. In Proceedings of the 28th Conference Society. Series A (General), 135(3):pp. 370–384, on Uncertainty in Artificial Intelligence (UAI), 1972. ISSN 00359238. 2012. [19] A. Nicholls. How to make biological surveys go [5] M. P. Armstrong, G. Rushton, and D. L. Zim- further with generalised linear models. Biological merman. Geographically masking health data to Conservation, 50(1):51–75, 1989. preserve confidentiality. Statistics in Medicine, 18 (5):497–525, 1999. [20] Y. Park and J. Ghosh. A probabilistic imputation framework for predictive analysis using variably [6] A. Banerjee, S. Merugu, I. S. Dhillon, and aggregated, multi-source healthcare data. In Pro- J. Ghosh. Clustering with Bregman divergences. ceedings of the 2nd ACM SIGHIT International The Journal of Machine Learning Research, 6: Health Informatics Symposium, pages 445–454. 1705–1749, 2005. ACM, 2012. [7] K. K. Carroll. Experimental evidence of dietary [21] Y. Park and J. Ghosh. Ludia: an aggregate- factors and hormone-dependent cancers. Cancer constrained low-rank reconstruction algorithm to Research, 35(11 Part 2):3374–3383, 1975. leverage publicly released health data. In Pro- [8] S. Dias, A. J. Sutton, A. Ades, and N. J. Welton. ceedings of the 20th ACM SIGKDD International Evidence synthesis for decision making 2 a gener- Conference on Knowledge Discovery and Data alized linear modeling framework for pairwise and Mining, pages 55–64. ACM, 2014. network meta-analysis of randomized controlled [22] G. Paul, J. Cardinale, and I. F. Sbalzarini. Cou- trials. Medical Decision Making, 33(5):607–617, pling image restoration and segmentation: a gen- 2013. eralized linear model/bregman perspective. In- [9] G. Duncan, M. Elliot, and J. Salazar-Gonz´alez. ternational Journal of Computer Vision, 104(1): Statistical confidentiality: Principles and practice 69–93, 2013. 2011. [23] W. S. Robinson. Ecological correlations and the [10] D. A. Freedman, S. P. Klein, J. Sacks, C. A. behavior of individuals. International journal of Smyth, and C. G. Everett. Ecological regression , 38(2):337–341, 2009. and voting rights. Evaluation Review, 15(6):673– [24] D. W. Scott. On optimal and data-based his- 711, 1991. tograms. Biometrika, 66(3):605–610, 1979. [11] J. J. Gart. The analysis of poisson regression with [25] L. Song, P. Langfelder, and S. Horvath. Random an application in virology. Biometrika, 51(3/4): generalized linear model: a highly accurate and pp. 517–521, 1964. ISSN 00063444. interpretable ensemble predictor. BMC bioinfor- [12] P. I. Good. Permutation, parametric and boot- matics, 14(1):5, 2013. strap tests of hypotheses, volume 3. Springer, [26] S. S. Wilks. Mathematical statistics. New York, 2005. page 644, 1962. [13] L. A. Goodman. Ecological regressions and be- havior of individuals. American Sociological Re- view, 1953. [14] P. Hardelid, R. Pebody, and N. Andrews. Mortal- ity caused by influenza and respiratory syncytial virus by age group in England and Wales 1999– 2010. Influenza and other respiratory viruses, 7 (1):35–45, 2013. [15] T. Jamil, W. A. Ozinga, M. Kleyer, and C. J. ter Braak. Selecting traits that explain species– environment relationships: a generalized linear approach. Journal of Vegetation Sci- ence, 24(6):988–1000, 2013.

101