Generalized Linear Models for Aggregated Data
Total Page:16
File Type:pdf, Size:1020Kb
Generalized Linear Models for Aggregated Data Avradeep Bhowmik Joydeep Ghosh Oluwasanmi Koyejo University of Texas at Austin University of Texas at Austin Stanford University [email protected] [email protected] [email protected] Abstract 1 Introduction Modern life is highly data driven. Datasets with Databases in domains such as healthcare are records at the individual level are generated every routinely released to the public in aggregated day in large volumes. This creates an opportunity form. Unfortunately, na¨ıve modeling with for researchers and policy-makers to analyze the data aggregated data may significantly diminish and examine individual level inferences. However, in the accuracy of inferences at the individ- many domains, individual records are difficult to ob- ual level. This paper addresses the scenario tain. This particularly true in the healthcare industry where features are provided at the individual where protecting the privacy of patients restricts pub- level, but the target variables are only avail- lic access to much of the sensitive data. Therefore, able as histogram aggregates or order statis- in many cases, multiple Statistical Disclosure Limita- tics. We consider a limiting case of gener- tion (SDL) techniques are applied [9]. Of these, data alized linear modeling when the target vari- aggregation is the most widely used technique [5]. ables are only known up to permutation, and explore how this relates to permutation test- It is common for agencies to report both individual ing; a standard technique for assessing statis- level information for non-sensitive attributes together tical dependency. Based on this relationship, with the aggregated information in the form of sample we propose a simple algorithm to estimate statistics. Care must be taken in the analysis of such the model parameters and individual level in- data, as na¨ıve modeling with aggregated data may sig- ferences via alternating imputation and stan- nificantly diminish the accuracy of inferences at the dard generalized linear model fitting. Our re- individual level. In particular, inferences drawn from sults suggest the effectiveness of the proposed aggregated data may lead to the problem of ecologi- approach when, in the original data, permu- cal fallacy [23], hence the resulting conclusions at the tation testing accurately ascertains the verac- group level may be misleading to researchers and pol- ity of the linear relationship. The framework icy makers interested in individual level inferences. An is extended to general histogram data with example that has been cited [21] is the high correla- larger bins - with order statistics such as the tion between per capita consumption of dietary fat and median as a limiting case. Our experimen- breast cancer in different countries, which may lead to tal results on simulated data and aggregated the incorrect conclusion that dietary fat causes breast healthcare data suggest a diminishing returns cancer [7]. property with respect to the granularity of Aggregated data in the form of histograms and other the histogram - when a linear relationship sample statistics are becoming more and more com- holds in the original data, the targets can be mon. Further, most of the data that is collected relates predicted accurately given relatively coarse to questions for which the respondents have only a few histograms. discrete options from which to select their answer. For example, data available from the Generalized Social Survey (GSS) [1] are often in this form. This paper Appearing in Proceedings of the 18th International Con- addresses the scenario where features are provided at ference on Artificial Intelligence and Statistics (AISTATS) the individual level, but the target variables are only 2015, San Diego, CA, USA. JMLR: W&CP volume 38. available as histogram aggregates or order statistics. Copyright 2015 by the authors. Despite the prevalence of order-statistic and histogram 93 Generalized Linear Models for Aggregated Data aggregated data, to the best of our knowledge, this ments of the vector by the same lowercase letter with problem has not been addressed in the literature. the boldface removed and the index added as a super- script. v> refers to the transpose of the column vector We consider a limiting case of generalized linear mod- v. We denote column partitions using semicolons, that eling when the target variables are only known up to is, M = [X; Y] implies that the columns of the subma- permutation, and explore how this relates to permu- trices X and Y are, in order, the columns of the full tation testing [12]; a standard technique for assessing matrix M. We use k · k to denote the L norm for vec- statistical dependency. Based on this relationship, we 2 tors and Frobenius norm for matrices. The vector v is propose a simple algorithm to estimate the model pa- said to be in increasing order if v(i) ≤ v(j) whenever rameters and individual level inferences via alternat- i ≤ j, and the set of all such vectors in n is denoted ing imputation and standard generalized linear model R with a subscripted downward pointing arrow as n. fitting. Our results suggest the effectiveness of the R# Two vectors v and w are said to be isotonic, v ∼ w, proposed approach when, in the original data, permu- # if v(i) ≥ v(j) if and only if w(i) ≥ w(j) for all i; j. tation testing accurately ascertains the veracity of the linear relationship. The framework is extended to gen- eral histogram data with larger bins - with order statis- 1.1 Preliminaries and Related Work tics such as the median as a limiting case. Our exper- imental results suggest a diminishing returns property Aggregated data is often summarized using a sample - when a linear relationship holds in the original data, statistic, which provides a succinct descriptive sum- the targets can be predicted accurately given relatively mary [26]. Examples of sample statistics include the coarse histograms. Our results also suggest caution in average, median and various other quantiles. While in the widespread use of aggregation for ensuring the the mean is still the most common choice, the best privacy of sensitive data. choice for summarizing a sample generally depends on the distribution the sample has been generated from. In summary, the main contributions of this manuscript In many cases, the use of histograms [24] or order are as follows: statistic summaries is much more \natural" e.g. for categorical data, binary data, count valued data, etc. (i) we propose a framework for estimating the re- sponse variables of a generalized linear model The problem of imputing individual level records from given only a histogram aggregate summary by the sample mean has been studied in [20] and [21] formulating it as an optimization problem that among others. In particular, the paper [21] attempts alternates between imputation and generalized to reconstruct the individual level matrix by assuming linear model fitting. a low rank structure and compares their framework with other approaches which include an extension of (ii) we examine a limiting case of the framework the neighborhood model [10] and a variation of ecolog- when all the data is known up to permutation. ical regression [13] for the task of imputing individual Our examination suggests the effectiveness of the level records of the response variable. However, these proposed approach when, in the original data, approaches exploit the fact that sample mean is a lin- permutation testing accurately ascertains the ve- ear function of the sample values. Hence, none of these racity of the linear relationship. approaches are extendable to non-linear functions such (iii) we examine a second limiting case where only a as order statistics and histogram aggregates. To the few order statistics are provided. Our experimen- best of our knowledge, the special case of individual in- tal results suggest a diminishing returns property ferences based on sample statistic aggregate data has - when a linear relationship holds in the origi- not been addressed before. Our work proposes a first nal data, the targets can be predicted accurately solution to address this open problem. given relatively coarse histograms. Order Statistics: Given a sample of n real valued datapoints, the τ th order statistic of the sample is the The proposed approach is applied to the analysis of τ th smallest value in the sample. For example, the simulated datasets. In addition, we examine the Texas first order statistic is the minimum value of the sam- Inpatient Discharge dataset from the Texas Depart- n th th ple, the 2 order statistic is the median and the n ment of State Health Services [3] and a subset of the order statistic is the maximum value of the sample. 2008-2010 SynPUF dataset [2]. We specifically design a framework which makes it rel- atively straightforward to work with order statistics. Notation Histograms : Given a finite sample of n items from Matrices are denoted by boldface capital letters, vec- a set C, a histogram is a partition of the set C into tors by boldface lower case letters and individual ele- disjoint bins Ci : [iCi = C and the respective count or 94 Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo percentage of elements from the sample in each bin. A generalized linear model (GLM) [18] is a generaliza- Seen this way, for any sample from C ⊆ R, a his- tion of linear regression that subsumes various models togram is essentially a set of order statistics for that like Poisson regression, logistic regression, etc. as spe- sample. Histograms can sometimes be specified with- cial cases1. A generalized linear model assumes that out their boundary values (eg. "x < 30" as opposed the response variables, y are generated from a distribu- to "0 < x < 30")- this is equivalent to leaving out tion in the exponential family with the mean param- the first and the nth order statistic.