A Universal Global Measure of Univariate and Bivariate Data Utility for Anonymised Microdata S Kocar
Total Page:16
File Type:pdf, Size:1020Kb
A universal global measure of univariate and bivariate data utility for anonymised microdata S Kocar CSRM & SRC METHODS PAPER NO. 4/2018 Series note The ANU Centre for Social Research & Methods the Social Research Centre and the ANU. Its (CSRM) was established in 2015 to provide expertise includes quantitative, qualitative and national leadership in the study of Australian experimental research methodologies; public society. CSRM has a strategic focus on: opinion and behaviour measurement; survey • development of social research methods design; data collection and analysis; data archiving and management; and professional • analysis of social issues and policy education in social research methods. • training in social science methods • providing access to social scientific data. As with all CSRM publications, the views expressed in this Methods Paper are those of CSRM publications are produced to enable the authors and do not reflect any official CSRM widespread discussion and comment, and are position. available for free download from the CSRM website (http://csrm.cass.anu.edu.au/research/ Professor Matthew Gray publications). Director, ANU Centre for Social Research & Methods Research School of Social Sciences CSRM is located within the Research School of College of Arts & Social Sciences Social Sciences in the College of Arts & Social Sciences at the Australian National University The Australian National University (ANU). The centre is a joint initiative between August 2018 Methods Paper No. 4/2018 ISSN 2209-184X ISBN 978-1-925715-09-05 An electronic publication downloaded from http://csrm.cass.anu.edu.au/research/publications. For a complete list of CSRM working papers, see http://csrm.cass.anu.edu.au/research/publications/ working-papers. ANU Centre for Social Research & Methods Research School of Social Sciences The Australian National University A universal global measure of univariate and bivariate data utility for anonymised microdata S Kocar Sebastian Kocar is a PhD candidate at the ANU Centre for Social Research & Methods and a data archivist in the Australian Data Archive at the Australian National University. His research interests include survey methodology, with a particular focus on online data collection, statistical disclosure control and higher education development. Abstract This paper presents a new global data utility steps: assumptions for statistical tests, statistical measure, based on a benchmarking approach. significance of the association, direction of the Data utility measures assess the utility of association and strength of the association (size anonymised microdata by measuring changes effect). in distributions and their impact on bias, variance and other statistics derived from the Since the model should be executed data. Most existing data utility measures have automatically with statistical software code or significant shortcomings – that is, they are a package, our aim was to allow all steps to limited to continuous variables, to univariate be done with no additional user input. For this utility assessment, or to local information loss reason, we propose approaches to automatically measurements. Several solutions are presented establish the direction of the association between in the proposed global data utility model. It two variables using test-reported standardised combines univariate and bivariate data utility residuals and sums of squares between groups. measures, which calculate information loss Although the model is a global data utility using various statistical tests and association model, individual local univariate and bivariate measures, such as two-sample Kolmogorov– utility can still be assessed for different types Smirnov test, chi-squared test (Cramer’s V), of variables, as well as for both normal and ANOVA F test (eta squared), Kruskal-Wallis H test non-normal distributions. The next important (epsilon squared), Spearman coefficient (rho) and step in global data utility assessment would Pearson correlation coefficient r( ). The model be to develop either program code or an R is universal, since it also includes new local statistical software package for measuring data utility measures for global recoding and variable utility, and to establish the relationship between removal data reduction approaches, and it can univariate, bivariate and multivariate data utility of be used for data protected with all common anonymised data. masking methods and techniques, from data reduction and data perturbation to generation Keywords: statistical disclosure control, data of synthetic data and sampling. At the bivariate utility, information loss, distribution estimation, level, the model includes all required data analysis bivariate analysis, effect size Methods Paper NO. 4/2018 iii Acronyms ANU Australian National University CSRM Centre for Social Research & Methods IL information loss K–S Kolmogorov–Smirnov SDC statistical disclosure control iv CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE Contents Series note ii Abstract iii Acronyms iv 1 Introduction 1 2 Balancing disclosure risk and data utility 3 2.1 Disclosure risk measures 3 2.2 Information loss measures 4 3 A new user-centred global data utility measure 6 3.1 Structure of the utility model 6 3.2 Local univariate data utility after perturbation, suppression or removal 7 3.3 Local univariate data utility after global recoding 9 3.4 Local bivariate data utility 10 3.5 Data utility of data with removed variables 14 3.6 Multivariate data utility 14 4 Discussion 15 References 16 Tables and figures Figure 1 Exponential distribution of information loss, the function of local data utility 8 Table 1 Bivariate tests and statistics used for local bivariate data utility calculation 10 Figure 2 Local data utility calculation procedure 11 M ETHODS PAPER NO. 4/2018 v vi CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE 1 Introduction Demand has been increasing for governments to disclosure, attribute disclosure, inferential release unit record files in either publicly available disclosure, and disclosure of confidential anonymised form or restricted-access moderately information about a population or model. SDC is protected form. Unit record files – which are primarily focused on identity disclosure – that is, basically tables that contain unaggregated identification of a respondent (Duncan & Lambert information about individuals, enterprises, 1989:207–208). On the one hand, releasing no organisations, and so on – are progressively data has zero disclosure risk with no utility; on being disseminated by government agencies the other, releasing all collected data including for both public use (i.e. anybody can access the identifying information offers high utility, but data) and research use (i.e. only researchers with the highest risk (Larsen & Huckett 2010). can access the data). These disseminated data Most research in the field has focused on the files can include detailed information – such privacy-constrained anonymisation problem – as medical, voter registration, census and that is, lowering information loss (IL) at certain customer information – which could be used for k-anonymity or l-diversity values, which is called the allocation of public funds, medical research the direct anonymisation problem. Since this and trend analysis (Machanavajjhala et al. 2007). problem can lead to significant IL, data could Publishing data about individuals allows quality be useless for specific applications, and the IL data analysis, which is essential to support an could be unacceptable to users (Ghinita et al. increasing number of real-world applications. 2009). Data producers should also be concerned However, it may also lead to privacy breaches about data users who would publish an analysis (Loukides & Gkoulalas-Divanis 2012:9777). Some based on statistics in anonymised files that are data are made publicly available, and some data not similar to corresponding statistics in the are offered in less safe settings (e.g. to registered original data. In that case, the data producers researchers on DVDs; see ‘five safes framework’ might have to correct any erroneous conclusions in Desai et al. [2016]). For instance, statistical that are reached, and such efforts might require organisations release sample microdata under greater resources than those needed to produce different modes of access and with different data with higher utility (Winkler 1998). The goal is, levels of protection. Data may be offered as therefore, to provide data products with reliable perturbed public use files, unperturbed statistical utility and controlled disclosure risk (Duncan & data to be accessed in on-site data labs, or even Stokes 2004, Templ et al. 2014). microdata under contract (Shlomo 2010) – also known as scientific use files – with a moderate The development of disclosure risk measures level of statistical disclosure control (SDC). has been ahead of the development of data utility This means that distributors should focus on measures, mostly as a result of the focus and data protection. Additionally, unit record files needs of data disseminators as data protectors, should be more restrictively protected in case especially when it comes to global measures of they are sensitive. Some variables in datasets those two data protection aspects. Research in can be sensitive as a result of the nature of the the field mostly focuses on developing unbiased information they contain – for example, data on approaches to measuring risk of identification,