A universal global measure of univariate and bivariate utility for anonymised microdata S Kocar

CSRM & SRC METHODS PAPER NO. 4/2018 Series note

The ANU Centre for Social Research & Methods the Social Research Centre and the ANU. Its (CSRM) was established in 2015 to provide expertise includes quantitative, qualitative and national leadership in the study of Australian experimental research methodologies; public society. CSRM has a strategic focus on: opinion and behaviour measurement; survey • development of social research methods design; data collection and analysis; data archiving and management; and professional • analysis of social issues and policy education in social research methods. • training in social science methods • providing access to social scientific data. As with all CSRM publications, the views expressed in this Methods Paper are those of CSRM publications are produced to enable the authors and do not reflect any official CSRM widespread discussion and comment, and are position. available for free download from the CSRM website (http://csrm.cass.anu.edu.au/research/ Professor Matthew Gray publications). Director, ANU Centre for Social Research & Methods Research School of Social Sciences CSRM is located within the Research School of College of Arts & Social Sciences Social Sciences in the College of Arts & Social Sciences at the Australian National University The Australian National University (ANU). The centre is a joint initiative between August 2018

Methods Paper No. 4/2018

ISSN 2209-184X ISBN 978-1-925715-09-05 An electronic publication downloaded from http://csrm.cass.anu.edu.au/research/publications.

For a complete list of CSRM working papers, see http://csrm.cass.anu.edu.au/research/publications/ working-papers.

ANU Centre for Social Research & Methods Research School of Social Sciences The Australian National University A universal global measure of univariate and bivariate data utility for anonymised microdata

S Kocar

Sebastian Kocar is a PhD candidate at the ANU Centre for Social Research & Methods and a data archivist in the Australian Data Archive at the Australian National University. His research interests include survey methodology, with a particular focus on online data collection, statistical disclosure control and higher education development.

Abstract

This paper presents a new global data utility steps: assumptions for statistical tests, statistical measure, based on a benchmarking approach. significance of the association, direction of the Data utility measures assess the utility of association and strength of the association (size anonymised microdata by measuring changes effect). in distributions and their impact on bias, variance and other derived from the Since the model should be executed data. Most existing data utility measures have automatically with statistical software code or significant shortcomings – that is, they are a package, our aim was to allow all steps to limited to continuous variables, to univariate be done with no additional user input. For this utility assessment, or to local information loss reason, we propose approaches to automatically measurements. Several solutions are presented establish the direction of the association between in the proposed global data utility model. It two variables using test-reported standardised combines univariate and bivariate data utility residuals and sums of squares between groups. measures, which calculate information loss Although the model is a global data utility using various statistical tests and association model, individual local univariate and bivariate measures, such as two-sample Kolmogorov– utility can still be assessed for different types Smirnov test, chi-squared test (Cramer’s V), of variables, as well as for both normal and ANOVA F test (eta squared), Kruskal-Wallis H test non-normal distributions. The next important (epsilon squared), Spearman coefficient (rho) and step in global data utility assessment would Pearson r( ). The model be to develop either program code or an R is universal, since it also includes new local statistical software package for measuring data utility measures for global recoding and variable utility, and to establish the relationship between removal data reduction approaches, and it can univariate, bivariate and multivariate data utility of be used for data protected with all common anonymised data. masking methods and techniques, from data reduction and data perturbation to generation Keywords: statistical disclosure control, data of synthetic data and sampling. At the bivariate utility, information loss, distribution estimation, level, the model includes all required data analysis bivariate analysis, effect size

Methods Paper No. 4/2018 iii Acronyms

ANU Australian National University

CSRM Centre for Social Research & Methods

IL information loss

K–S Kolmogorov–Smirnov

SDC statistical disclosure control

iv CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE Contents

Series note ii Abstract iii Acronyms iv 1 Introduction 1 2 Balancing disclosure risk and data utility 3 2.1 Disclosure risk measures 3 2.2 Information loss measures 4 3 A new user-centred global data utility measure 6 3.1 Structure of the utility model 6 3.2 Local univariate data utility after perturbation, suppression or removal 7 3.3 Local univariate data utility after global recoding 9 3.4 Local bivariate data utility 10 3.5 Data utility of data with removed variables 14 3.6 Multivariate data utility 14 4 Discussion 15 References 16

ables and figures

Figure 1 Exponential distribution of information loss, the function of local data utility 8 Table 1 Bivariate tests and statistics used for local bivariate data utility calculation 10 Figure 2 Local data utility calculation procedure 11

Mthodse Paper No. 4/2018 v vi CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE 1 Introduction

Demand has been increasing for governments to disclosure, attribute disclosure, inferential release unit record files in either publicly available disclosure, and disclosure of confidential anonymised form or restricted-access moderately information about a population or model. SDC is protected form. Unit record files – which are primarily focused on identity disclosure – that is, basically tables that contain unaggregated identification of a respondent (Duncan & Lambert information about individuals, enterprises, 1989:207–208). On the one hand, releasing no organisations, and so on – are progressively data has zero disclosure risk with no utility; on being disseminated by government agencies the other, releasing all collected data including for both public use (i.e. anybody can access the identifying information offers high utility, but data) and research use (i.e. only researchers with the highest risk (Larsen & Huckett 2010). can access the data). These disseminated data Most research in the field has focused on the files can include detailed information – such privacy-constrained anonymisation problem – as medical, voter registration, census and that is, lowering information loss (IL) at certain customer information – which could be used for k-anonymity or l-diversity values, which is called the allocation of public funds, medical research the direct anonymisation problem. Since this and trend analysis (Machanavajjhala et al. 2007). problem can lead to significant IL, data could Publishing data about individuals allows quality be useless for specific applications, and the IL data analysis, which is essential to support an could be unacceptable to users (Ghinita et al. increasing number of real-world applications. 2009). Data producers should also be concerned However, it may also lead to privacy breaches about data users who would publish an analysis (Loukides & Gkoulalas-Divanis 2012:9777). Some based on statistics in anonymised files that are data are made publicly available, and some data not similar to corresponding statistics in the are offered in less safe settings (e.g. to registered original data. In that case, the data producers researchers on DVDs; see ‘five safes framework’ might have to correct any erroneous conclusions in Desai et al. [2016]). For instance, statistical that are reached, and such efforts might require organisations release sample microdata under greater resources than those needed to produce different modes of access and with different data with higher utility (Winkler 1998). The goal is, levels of protection. Data may be offered as therefore, to provide data products with reliable perturbed public use files, unperturbed statistical utility and controlled disclosure risk (Duncan & data to be accessed in on-site data labs, or even Stokes 2004, Templ et al. 2014). microdata under contract (Shlomo 2010) – also known as scientific use files – with a moderate The development of disclosure risk measures level of statistical disclosure control (SDC). has been ahead of the development of data utility This means that distributors should focus on measures, mostly as a result of the focus and data protection. Additionally, unit record files needs of data disseminators as data protectors, should be more restrictively protected in case especially when it comes to global measures of they are sensitive. Some variables in datasets those two data protection aspects. Research in can be sensitive as a result of the nature of the the field mostly focuses on developing unbiased information they contain – for example, data on approaches to measuring risk of identification, sexual behaviour, health status, income, or if the and on investigating which anonymisation unit is an enterprise (Dupriez & Boyko 2010). methods and techniques are more suitable in different disclosure scenarios (Karr et al. There is a trade-off between reducing disclosure 2006:225). However, the existing data protection risk and increasing data utility. We can distinguish between four types of disclosure: identity

Methods Paper No. 4/2018 1 approaches suffer from at least one of the following drawbacks (Ghinita et al. 2009): • The research has focused on the privacy constraint problem and ignored the accuracy constraint problem. • The anonymisation is inefficient. • L-diversification causes unnecessary IL.

Very different approaches to SDC modify data in several different ways, leading to challenging disclosure risk and data utility assessments. Synthetic data generation creates a new data matrix with information about a fictitious software- generated population; sampling reduces the size of the sample by multiple times; other data reduction techniques remove variables or recode their values; and data perturbation methods replace original variable values with values generated by applying different perturbation techniques (International Household Survey Network 2017). As a result, protected matrices of different sizes and variables with different ranges of values require the use of adjusted measures of disclosure risk and data utility. Although there are generally accepted and widely used global measures of disclosure risk, mostly calculated as sums of individual per-record risks, data utility assessment is mainly based on estimations of changes in distributions and relations between protected variables (see Templ et al. 2015:28–30), which makes the assessment significantly more local (selected variables) than global (dataset). Methods such as distance measures, distribution estimation, cluster analysis and propensity score can be used to measure reduced utility, but we argue that IL and local (item) utility assessments as local measurements should be combined into data utility assessment as global measurements. The aim of this paper is to propose new solutions for global measurements of data utility for anonymised microdata.

Section 2 of this paper reviews disclosure risk measures, existing IL measures, and how to balance risk and data utility. Section 3 proposes a new user-centred global data utility measure, its structure and the included individual local utility measures. Section 4 establishes which scenarios the proposed model should be applied in, and what future research and development should focus on.

2 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE 2 Balancing disclosure risk and data utility

To release safe data products with high enough intruders’ jobs easier, including geographical data utility for a particular use, a holistic approach detail, outliers, many attribute variables and to SDC is preferred (Shlomo 2010). Government census data. Longitudinal or panel data also agencies in Australia are required by law (e.g. the represent substantial disclosure risk (Duncan & Privacy Act 1988) to protect identities of their Stokes 2004). The measures for calculating this responding units. Therefore, their primary focus is risk are instruments for estimating the identity on reducing disclosure risk. However, data utility disclosure. There are two types of measures: assessment should be considered as equally individual per-record and global per-file disclosure important, because agencies release data to risk measures (Shlomo 2010). be used by as many (safe) users as possible, Individual measures assess the probability of re- and want only unbiased results to be reported. identification of a single responding unit. Methods Data utility assessment is especially important for assessing individual disclosure risk for sample because several different data protection microdata are generally classified into three solutions are usually possible for the same unit types: heuristic, probabilistic record linkage and record file – that is, there are different choices of probabilistic modelling of disclosure risk. Since SDC methods and their associated parameters. there is no framework for obtaining consistent The trade-off between contradictory goals of global-level disclosure risk measures in the case decreasing risk and increasing utility involves of heuristics and record linkage approaches, choosing among different versions of information probabilistic modelling seems like the most released (Cox et al. 2011:161). A risk–utility optimal approach for global disclosure risk (R-U) confidentiality map (see Duncan & Stokes assessment (Shlomo 2010). Spruill (1982; cited 2004) can be created to establish which of the in Duncan & Lambert 1989) was one of the first solutions represent the best balance between to propose a measure of disclosure risk, which privacy and accuracy. Winkler (1998) argues that was a percentage of ‘protected’ records closer users of released files are concerned with their to their parent record than to any other source analytic validity: a file is analytically valid if it record, with the distance computed as a squared preserves means and covariances on a small set distance between the same record in the original of subdomains, a few margins and at least one and the protected data. Duncan and Lambert other distributional characteristic. This section (1989) proposed disclosure assessment formulas, reviews both disclosure risk and IL measures as which are predictive distributions, for three fundamentals for the development of a global different scenarios: release to a naive outsider, data utility measure, which could be used with release to an insider with complete knowledge a global disclosure risk measure in a global R-U and release of masked data to an intruder with confidentiality map. some knowledge. The last scenario seems to be the only realistic one, since even insiders do not 2.1 Disclosure risk measures have exact knowledge of all attributes for all units and may not believe respondents entirely. The The risk of disclosure is a function of the predictive distributions represent probabilities population, as well as the sample. In particular, it that any released record is correctly linked, and is dependent on sample uniques – that is, sample the formulas take into account the probability units that are unique within the sample file, where that the target record has been released, the uniqueness is determined by combinations of number of respondents with the same attributes identifying discrete key variables (Shlomo 2010). and the predictive contribution on how the target Several data characteristics can make potential attributes will appear if released (Duncan &

Methods Paper No. 4/2018 3 Lambert 1989). Modern measures for disclosure –– multivariate (e.g. comparing regression risk are often based on minimal sample uniques. analysis R2) (see Domingo-Ferrer et al. One of these measures is sample frequencies 2001, Shlomo 2010) on subset (SUDA2), a recursive algorithm that • types of variables generates all possible key variable subsets and scans for uniques patterns (Manning et al. 2008). –– IL measures for continuous variables The calculated risk depends on two aspects and –– IL measures for categorical microdata is higher for a larger number of minimal sample (Domingo-Ferrer et al. 2001). uniqueness contained within an observation, or a lower number of variables to determine Distance metrics are based on measuring uniqueness. The risk is calculated as: distortions to distributions. One of them is average absolute distance per cell (AAD), introduced by Gomatam and Karr (2003; cited in Shlomo 2010). It is based on the average absolute difference per table cell in the data and In this equation, corresponds to the maximum is defined as: size of variable subsets, is the number of minimal sample푚 uniques, and is number of 푖 data file units (Templ et 푀푆푈푚푖푛al. 2015). 푛 Global risk measures are aggregated per-record risk measures. They represent the disclosure risk Dorig and Dpert are frequency distributions for the entire data file. There are three common produced from the microdata, D(c) are disclosure risk measures at the global level: frequencies in cell c, and nc is the number of cells. number of sample uniques that are at the same The other common measure, which also takes time population uniques, expected number into account variability of data, is , proposed of correct matches/re-identifications (Shlomo by Yancey, Winkle and Creecy (2002; cited in 2010:75–76), and number of observations with Templ et al. 2014). It is defined as:혐혓ퟣ혴 individual risks higher than a threshold (Templ et al. 2014). Global risk can also be measured using log-linear models (Templ et al. 2015).

where is the original dataset, and x' is the 2.2 Information loss measures perturbed version of the dataset; there are records 𝑥 and variables; and S is the standard Data utility is based on whether data users can j deviation of the th variable. Some other common푛 carry out statistical inference and the same distance metrics푝 are a direct comparison of analysis on the anonymised data as on the categorical values푗 (for categorical variables), original data. Proxy measures have already been and mean square error, mean absolute error and developed to assess data utility, mostly based mean variation (for continuous variables). The on measurements of distortions to distributions listed IL measures for continuous variables are and the impact on other analysis tools, such as used to compare matrices, covariance matrices chi-squared statistics (Shlomo 2010). We could (distance metrics), and averages and correlation classify IL measures in terms of: matrices (benchmarking) (Domingo-Ferrer et al. • approaches to measuring IL 2001). –– direct distance metrics approach The benchmarking approach is based on –– benchmarking approach (Templ et al. 2014) comparing statistics computed on the original • type of analysis and perturbed data. At the univariate level, these –– univariate (e.g. comparing means) are statistics such as point estimates, variances or confidence intervals (Templ et al. 2015). At the –– bivariate (e.g. comparing measures of bivariate level, statistics produced by tests such association)

4 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE as the chi-squared test or ANOVA (goodness-of- The similarity of propensity scores, which is, in fit criterionR 2) are compared. The IL measure, the end, a data utility measure, can be calculated based on the chi-squared test and the measure – for example, by comparing percentiles in each of association between two categorical variables group (masked and original observations): called Cramer’s V (CV), is often used (Dorig and Dpert are frequency distributions produced from the microdata):

where is the estimated propensity score, N is the number of records, and c is the proportion of units with masked data in the merged data file. For continuous variables, a comparable measure Cluster analysis measure is based on the can be used to assess the impact on correlation. multivariate unsupervised machine-learning Similarly, statistics produced using regression method known as cluster analysis, which is analysis (coefficients,R 2) for multiple continuous carried out with a fixed number of groups on a variables and log-linear modelling for multiple merged data file including original and protected categorical variables could be used. data. The following equation is used to calculate Global IL metrics are less common than individual the utility measure: item, bivariate or multivariate loss metrics. One of these is global certainty penalty (GCP), introduced by Ghinita et al. (2009), who adopted the normalised certainty penalty (NCP) concept of Xu et al. (2006). NCP measures the amount of where G is the number of groups, wj is the cluster distortion introduced by generalisation and is an j weight, njo is the number of observations from IL measure for a single equivalence class (variable the original data, nj is the number of units in the level, univariate). GCP combines these measures jth cluster, and c is the proportion of original data in an IL measure for the entire data file. However, in the merged dataset. this measure works better with data that include continuous variables only, since distance is often not well defined for categorical variables (Xu et al. 2006). Woo et al. (2009) introduced a set of global utility measures: • propensity score measure • cluster analysis measure • empirical cumulative distribution function measures (using Kolmogorov–Smirnov-type statistics).

Propensity score measures are based on the idea that, if two large groups have the same distributions of propensity scores, they have a similar distribution of covariates. First, original and anonymised data are merged. Propensity scores – that is, the probability of a unit being in the masked dataset – are then calculated. Last, distributions of propensity scores, estimated via logistic regression, are compared.

Methods Paper No. 4/2018 5 3 A new user-centred global data utility measure

This section introduces a new global data utility indicators, comparing item distributions and measure, combining univariate distribution-based bivariate statistics calculated using original measures and bivariate measures, calculated and anonymised data. The idea is that users of from changed coefficients of association between protected data should be provided with access all variables in original and anonymised data to variables with similar (joint) distributions to files. The model should be used as a relative those in original, non-anonymised versions of the measure (for comparing global utility of the same data. In practice, this means that the same or at unit record file protected in alternative ways) least similar results would be reported in their and not as an absolute measure (for comparing research publications as are reported by data the global utility of different data files). This is producers with access to the original data, even because global data utility is highly dependent on though some characteristics of certain units in the ranges of mostly demographic variables, as well data might differ substantially as a result of SDC as other identifying variables, sizes of samples procedures applied. and characteristics of units with an increased Information loss and overall data accuracy are identification risk. quite challenging to quantify, partly because of To create a global measure, individual (local) different data masking techniques resulting in measures are also proposed. Based on review different changes to data, but mostly because of of the literature, we concluded that most users focusing on different research topics and local and global data utility measures have conducting different analyses of the same data. significant shortcomings. Some can be used with Consequently, we argue that purposely selecting continuous variables only; some are only suitable the most important estimates/variables, proposed for univariate utility assessment; some cannot by other authors (e.g. Templ 2011), could result be used with significantly changed sizes of in accurate measurements of local data utility anonymised data matrices or altered measures of and IL, but potentially biased estimates of global variables; and the remainder are more theoretical data utility. Unless the research is focused than user centred. almost exclusively on measuring a particular concept (e.g. economically active population or Generally, there are two main approaches to unemployment rate in the Labour Force Survey), measuring IL and data utility: direct measures of global data utility measurement should be distances and the benchmarking approach. The extended to the majority of, if not all, variables in latter is generally considered a better solution the data. (Templ et al. 2015). Since there are very different data protection methods and techniques, which produce data matrices of different sizes and 3.1 Structure of the utility model include variables with different ranges of values (e.g. bracketing, synthetic data, sampling), unit The model consists of aggregated local utility records may not be the same, and variables measures, which are combined from univariate might not be of the same type in original and local item utility measures and bivariate pairs protected data. Consequently, direct measures of items utility measures. These measures of distances between original and anonymised could potentially be weighted, based on the data often should not or cannot be calculated. level of data analysis that would primarily be Our proposed user-centred global data utility carried out in secondary use of data – either model is, therefore, based on benchmarking simple public use (mostly calculating descriptive

6 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE statistics) or slightly advanced scientific use of possible bivariate tests of association (in most data (calculating both descriptive and bivariate cases calculated as l = k − 1). statistics). Creating a global utility measurement • Global data utility (GDU) is a measure follows the principles of creating global risk combining ALDUuni and ALDUbiv measures measurement (see Section 2.1). However, (possibly multiplied by coefficients and in contrast to the calculation of global risk weights). If univariate utility is equally measurements, which are aggregated record- important to bivariate utility, the following level risks, global utility measures are aggregated equation should be used (hence the and weighted-average variable-level local utility coefficient ½): measures.

This model should be perceived as an expansion of more traditional univariate data utility assessment (e.g. Templ et al. 2015) to the bivariate level, since Lambert (1993) argues that a good anonymisation procedure generally preserves the first two moments of joint distributions, but analysis is not limited to mean and covariance estimations only. On the other hand, the To calculate local univariate data utility, we have primary focus of our model is not on data utility to take into account all the different ways in which assessment at the multivariate level, and the model data protection affects distributions of protected is therefore not particularly suitable for measuring data. Generally, we propose two different local the utility of anonymised data to be used in univariate data measurement approaches: LDUuni statistical modelling, although notable changes for perturbed, suppressed or removed data; in univariate and joined distributions generally and LDUuni for globally recoded data. However, result in significant changes in multivariate in some cases, a variable can be both globally distributions as well. In addition to expanding recoded and suppressed/perturbed. data utility measurement from the univariate to the bivariate level, we are expanding the benchmarking indicator approach usually applied 3.2 Local univariate data for continuous variables (Templ et at. 2014) to utility after perturbation, categorical, including globally recoded variables. Although Xu et al. (2006) listed global recoding as suppression or removal a more severe data reduction technique than local To create local univariate data utility measures recoding and a source of higher IL, they did not for perturbed, suppressed or removed data (also offer any measures for that IL. synthetic data), original and protected data have The key measures in the global data utility model to be compared at a local level using appropriate are the following: statistical tests. Statistical significance P( ) values returned by these tests indicate the level at which • Average local univariate data utility (ALDUuni) the null hypothesis (i.e. that the two samples is the mean of all local univariate data utility were drawn from the same distribution) can scores (LDUuni), calculated for each variable be rejected. We argue that local utility, defined (k) separately using approaches and tests as original data local (item) utility minus IL, described in Sections 3.2 and 3.3. decreases with a decrease of significance levels • Average local bivariate data utility (ALDUbiv) of selected distribution comparison tests, such as is the mean of all (k) local bivariate data utility the two-sample Kolmogorov–Smirnov (K–S) test scores, calculated for each variable separately for univariate continuous variable distributions. using approaches and tests described in Instead of using IL scores with just two possible Section 3.4. Each local bivariate data utility values – that is, IL = 0 if the two samples were score is the mean of all local bivariate data drawn from the same distribution (K–S test utility measures (LDUbiv) – there are l of them significance > 0.05), and IL = 1 if the two samples per variable, with l being the number of all

Methods Paper No. 4/2018 7 were not drawn from the same distribution (K–S We decided to use this distribution and its significance < 0.05) – we propose to calculate IL parameters to calculate IL, since very little from the exponential distribution; an indicator of increase in P value is required to retain the the protected item not being an accurate measure null hypothesis (no distribution differences follows exponential distribution ~ exp( ). The between groups), which means that IL levels probability density function of the exponential should decrease very quickly. The exponential distribution is given by: 𝑥 � distribution is a continuous probability distribution, applied to a continuous random variable with sets of possible values that are infinite and uncountable. However, in our case, We have to propose distribution parameters that the density value for values between 0.4 and 1 would return reasonable estimates of IL and data is <0.01, and IL is negligible. Therefore, we can utility at different values. The scale parameter P use it in practice despite𝑥 having a defined range value should equal 1, which means that, for each of variable values (see Figure 1). variable, IL and local utility values calculated with this equation will always have a value between 0 To compare𝑥 data and calculate P values, different and 1. For in the equation, different values have univariate statistical tests should be used. Since been considered. We suggest to be 14 times there are different types of variables with different the P value,𝑥 a statistical significance value of a distributions, various tests have been considered. selected test. With P and𝑥 with P , The following tests are used in our model at the which is the level traditionally used in hypothesis univariate level: 𝑥 = 14 = 0.05 testing, local data utility value equals IL (0.50). • For univariate-level interval and ratio However, at P , the probability of samples variables, nonparametric two-sample K–S being drawn from the same distribution is much = 0.01 test (Massey 1951) for equality of distributions lower; hence local data utility is significantly lower, for continuous variables should be used to at 0.13 (see Figure 1). These seem to be quite determine significance levels. The distributions reasonable estimates of univariate data utility. of original and protected variables are The following equation should be used to compared, given that the protected continuous calculate local univariate data utility after variables have not been globally recoded into perturbation, suppression or removal (LDUunipsr): categorical variables or bottom/top coded. Two-sample t test as a parametric version of the K–S test could potentially be used. However, it is not sensitive to variability and

Figure 1 Exponential distribution of information loss, the function of local data utility

f(x) Local data utility 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x Statistical test p value

8 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE would not be a suitable solution for comparing and users need to select the most suitable non-normal distributions; we would prefer measure for their case. to use a universal distribution-free test. We First, we have to consider how much we limit could theoretically use the paired-samples comparisons between groups of categorical version of the test. However, it focuses on variables due to the reduction in the number distances between variable values of the of categories. For example, we might want to same units, and therefore cannot be used for mask an age-group variable with ten 10-year all data protection methods and techniques age groups (0–9, 10–19, …, 90–99 years), (e.g. sampling – data reduction). recoding it into a variable with five 20-year age • For univariate-level categorical variables groups (0–19, …, 80–99 years). In practice, (nominal and ordinal), chi-squared test seems this means that comparisons of proportions to be the most popular and reasonable of people aged 20–29 with those aged 30–39, solution. The test would compare discrete as well as calculation of population estimates distributions of original and protected for these two groups (individually), would no variables as ‘groups’ and report P values. longer be possible. In this specific case, the total number of comparisons between pairs of groups decreases from 45 to 10, by almost 3.3 Local univariate data utility 78%. We argue that local univariate data utility after global recoding (푛(after� −global 1)/2) recording ( ) should decrease in the same way. The following formula for There are several existing distance measures should be used퐿퐷푈푛푖푔푟 in the model: for perturbed or suppressed categorical variables (see Domingo-Ferrer et al. 2001:107– 퐿퐷푈푛푖푔푟 109). However, developing an unbiased local utility measure for globally recoded variables is more challenging because collapsing of In the equation, represents the original number variable values changes their ranges or even of variable values, and represents the reduced their types. Therefore, we have to consider number of variable푛 values in the protected data. how these kinds of data reduction techniques 푚 Second, when values of continuous variables influence the analysis of data. After the initial are collapsed into groups, they might end up in collapsing of values, we also have to pay groups with less similar values (e.g. as a result attention to distributions of already globally of allocation). One way to measure this source of recoded variables, since aggregation is often IL is to calculate the percentage of units that are combined with other data reduction techniques, closer to any unit in the neighbour groups than to such as local suppression, or data perturbation any unit in their own group (e.g. age value 39 in techniques, such as data swapping. the 30–39 age group is closer to value 40 in the In this paper, we consider three different sources 40–49 age group than to value 33 in the 30–39 of local IL after global recoding: age group). The following formula for is proposed: • limited comparison of groups’ statistics 퐿퐷푈푛푖푔푟 (categorical categorical variable) • potentially biased allocation of values into → value groups (continuous categorical variable) In the equation, represents the number of all → units, and represents the number of all biased • increased heterogeneity of groups (continuous ba allocations at collapsing� of values in anonymised categorical variable). data. Generally,푛 the fewer groups are created, the Consequently,→ we propose three different higher the chance of biased allocations due to an measures of data utility after global recoding; increased heterogeneity of groups. some are more rigorous and some slightly less, Third, we introduce another group heterogeneity- based measure of local data utility after global

Methods Paper No. 4/2018 9 recoding. When we collapse values into a group, described for nonaggregated univariate-level data analysis treats all these values as the interval or ratio variables, or categorical variables same; ideally, this would not be the case. If the – that is, two-sample K–S or chi-squared tests number of groups is lower, the groups might be (LDU ). Local univariate data utility (LDU ) very heterogeneous. To measure heterogeneity, for globally recoded and further protected we could calculate average distances – that is, variables푢푛푖푝푠푟 is a product of local item data utility푢푛푖 mean absolute deviations (MAD) – from the mean after global recoding ( ) and local item of newly created groups. We would calculate data utility after perturbation (LDU ): MAD between unit values (before recoding) and 퐿퐷푈푛푖푔푟 corresponding group averages (of groups after 푢푛푖푝푠푟 recoding), and compare them with the maximum theoretical average distances if all units were in one group only. The following formula for 3.4 Local bivariate data utility LDU is proposed: To calculate local bivariate data utility (LDUbiv) 푢푛푖푔푟 scores, original and protected data can be compared at a local level using benchmarking statistics generated by appropriate bivariate There are r records in the original dataset and statistical tests. Our local bivariate measures s records in the anonymised dataset, with will be based on the Cramer’s V IL measure p groups/categories and r units in a group. x is the (see the equation in Section 2.2), proposed by grand average (as if there were just one group), Shlomo (2010). Cramer’s V is a chi-squared- ⁼ based measure of nominal association for while xj are means of newly created groups in anonymised data, calculated from values in the tables sized 2 × 2 and larger. For measures of original data. If no units had been removed and ordinal, interval and ratio associations/effect no sampling carried out, then . sizes, other statistical tests reporting coefficients measuring the strength of the relationship We also have to take into consideration�= 푠 other between two variables should be applied. possible changes to already recoded data, such Table 1 lists statistical tests and association as changes after carrying out local suppression measures for pairs of variables based on their or micro-aggregation. Therefore, we have to type and normality of the distributions (McDonald compare further distributions of recoded variables 2009:308–313; Fritz et al. 2012). (original and protected data) using approaches

Table 1 Bivariate tests and statistics used for local bivariate data utility calculation Non-normally Normally distributed distributed Type of variables Nominal Ordinal interval/ratio interval/ratio

Nominal Chi-squared test – – – (Cramer’s V) Ordinal Kruskal–Wallis Spearman – – H test correlation (epsilon squared) (Spearman’s rho) Non-normally Kruskal–Wallis Spearman Spearman – distributed interval/ H test correlation correlation ratio (epsilon squared) (Spearman’s rho) (Spearman’s rho) Normally distributed ANOVA F test Spearman Spearman Pearson correlation interval/ratio (eta squared) correlation correlation (Pearson’s r) (Spearman’s rho) (Spearman’s rho)

10 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE In total, five different bivariate tests and • If one P value is above and the other below the corresponding association measures should threshold, there is a significant difference in be included in our data utility model, in associations, and LDUbiv . addition to an optional one-sample K–S test for • If both P values are below the threshold, = 0 establishing whether a continuous variable is there are statistically significant associations, normally distributed. ANOVA F test and Kruskal– and further review of the results is required Wallis H test could be used for measuring the (continue with the next step). association between one interval/ordinal and one nominal variable with only two categories, Step 2: Establish whether the direction of the although t test and Mann–Whitney test would association is the same for pairs of variables in generally be considered as the primary test original and protected data (comparing signs of options. When performing bivariate analysis correlation coefficient numbers, standardised tests, the following should generally be explored: residuals of chi-squared tests, the total sum of assumptions for statistical tests, statistical squares between groups for ANOVA and H test – significance of the association, direction of the see details below): association and strength of the association (effect • If there are opposite directions, the size) (Page 2014). Therefore, in all cases, after the relationship between variables is significantly test has been selected and the results obtained, different as a result of data protection, and the following three steps should be followed to LDUbiv . calculate local bivariate data utility: • If the directions are the same, further review of = 0 Step 1: Establish whether significance levels for results is needed (continue with the next step). bivariate statistical tests are above a threshold Step 3: Calculate IL at the bivariate level using (P ; e.g. P ≥ 0.05 or P ≥ 0.01) for the same pairs of t measures of association between two variables, variables in original and protected data: and calculate local bivariate data utility (LDUbiv) • If both P values are above the threshold, using IL scores. there is no association in either case, and LDUbiv = 1. We can present the procedure in an algorithmic way (Figure 2).

Figure 2 Local data utility calculation procedure

P P Pt P or P Pt P P Pt � �

LDUbiv LDUbiv LDUbiv

Methods Paper No. 4/2018 11 Establishing the direction of association, and where is the observed value, and is the calculating IL and local bivariate data utility using expected value in the cell. On the other hand, 푖 푖 statistics of different bivariate tests are described ANOVA푂 F test and Kruskal–Wallis H test퐸 sums below. of squares between groups (SSB ) for individual groups, not a total sum of squares between for all 푖 3.4.1 Tests for pairs of variables groups (SSB), should be compared. ANOVA F test including at least one nominal sums of squares between groups (contributions variable – chi-squared test, of individual groups) are calculated with an Kruskal–Wallis H test, ANOVA adjusted Fisher’s (1925) equation for total SSB (removing the summation notation): F test To establish whether the direction of the association is the same for pairs of variables in where is the mean of the group, is the number original and protected data (step 2), we should of units in that group, and is the grand mean. For review and compare measures of the strength of the Kruskal–Wallis test, the approach푛 is based on the difference between either: the same principles of SSB calculation for normally • observed and expected values, or distributed continuous variables; ranked variables � • group means and the grand mean, or with group means ( ) are replaced with average • group average ranks and the grand average group ranks ( ), and grand means ( ) are replaced rank. with grand average rank in the equation ( ):

The measures used in our model are chi-squard test standardised residuals, ANOVA F test sums of squares between groups, and Kruskal–Wallis sums of squares between groups (from average where is the number of units in that group, and ranks). We prefer to introduce solutions that N is the total number of units in all groups. reduce the amount of manual input, reduce 푛� the need for manual review of bivariate results, For the calculation of the direction, it is important and calculate IL and data utility automatically to note that: (e.g. using R code or an R package). One of the solutions is to automatically establish the change in those measures of the strength of the difference. In general, if the direction of the association between two categorical variables This means that we have to change the sign of changes in the protected data, standardised the SSB value in the equation below if the mean residuals (SR) change sign. Although it is rank of the group is lower than the grand mean straightforward to notice a difference in 2 × 2 rank. For standardised residuals, there is no need tables, larger contingency tables are more for the change of sign, since they are already difficult to review manually or automatically. At negative for cells with observed values lower than least four approaches can be used to investigate expected values. Taking all those conditions into a statistically significant chi-squared test result: account, we propose to calculate the possible calculating residuals, comparing cells, ransacking change of the sign in the following way: and partitioning (Sharpe 2015). Calculating standardised residuals seems to be the best solution for automatic estimation (or detection of a change) of the direction of the association. The residuals are calculated for a single cell in a Measures of the strength of the difference (MSD) as: used in the formula are, as explained above, either standardised residuals (SR ) or sums of squares between groups (SSB for either ANOVA 푖 F test or Kruskal–Wallis H test). In the equation, 푖 푛

12 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE stands for either the number of contingency table We can use this formula to measure IL with three cells (chi-squared test) or the number of all one- different squared coefficients of association/size way compared groups (ANOVA, Kruskal–Wallis). effect measures ( ) (Fritz et al. 2012): The calculation is based on the following: • Cramer’s V (squared),푐표푒푓푓2 calculated as • If all measures of the strength of the difference values in protected data Dprot equalled 0 (no association between the two variables), the expression above would also equal 0. • epsilon squared, calculated as • If measures of the strength of the difference values in the protected data were closer to the original Dorig values than 0, the expression above would be negative (the same direction • eta squared, calculated as a ratio between for the association). sum of squares between groups (SSB) and total • If measures of the strength of the difference sum of squares (TSS) values in the protected data changed the sign for most cells/groups and/or higher SR/SSB values, the expression above would be positive (changed direction of the association). 3.4.2 Tests for pairs of variables that To calculate IL in step 3, Cramer’s V IL measure are at least ordinal – Spearman proposed by Shlomo (2010) should be adjusted. correlation and Pearson Since IL and local data utility in our model can correlation coefficient take values between 0 and 1, the multiplier 100 should be removed. Also, the Cramer’s V for the To establish whether the direction of the protected data, as well as epsilon squared and association is the same for pairs of variables eta squared, could theoretically be higher than in original and protected data, this time no the original coefficients of association. Therefore, additional tests are needed, and the signs of the the difference between the coefficients in the correlation coefficient just need to be checked. denominator should be absolute. Although it is To calculate IL and local bivariate data utility for not very likely, it should still be noted that the continuous or ranked variables (i.e. ordinal or theoretical value of the IL score could be higher non-normally distributed continuous variables), than 1 if the original data (D ) coefficient of orig the same equation as for IL for other tests in association were much lower than the protected this model should be used, just with squared data (D ) coefficients (e.g. 0.5 and 0.2), using prot Spearman ρ or Pearson r correlation coefficients the proposed equation. Therefore, the higher of ( ). The following equation for IL and the two coefficients should be in the denominator. utility is proposed: Last but not least, we decided to use squared 푐표푟푟_푐표푒푓푓2 coefficients of association in the equation for consistency, since some size-effect measures are already squared (eta and epsilon). This means that IL increases more quickly with an increase in the difference between the coefficient values; in our opinion, this better reflects the decreased 3.4.3 Tests for variables with changed local bivariate data utility. The adjusted equation type due to data protection for IL and local bivariate data utility ( ) is: If SDC leads to variables changing their type, 퐿퐷푈푏푖푣 either from continuous to categorical or from normally to non-normally distributed, it is more challenging to compare measures of association – that is, effect sizes measured with different statistical tests. An eta squared value represents

Methods Paper No. 4/2018 13 a significantly different effect size from the same utility calculations for perturbed, suppressed or Pearson value (Vacha-Haase & Thompson removed data, removing variables is a special 2004:2–4). Moreover, Cohen (1988:224–227) case. The utility of data, calculated with measures provided 푟a2 guide to the magnitude of the presented in Sections 3.1–3.4, decreases linearly effect size for the chi-squared test (Cramer’s with an increase in variables removed, since local V: 0.1 = small effect, 0.3 = medium effect, univariate data utility and average local bivariate 0.5 = large effect for 2 × 2 contingency tables). data utility for a removed variable equal 0. However, he noted that the degree of association However, removing variables means removing any does not depend on Cramer’s V coefficient only, valuable information about a measured concept, but also on the number of cells in contingency results in additional effects on global data utility tables (degrees of freedom). Therefore, just like at and should be additionally penalised. As in the the univariate level, local bivariate data utility for case of global recoding (decreasing values/ these variables might have to be calculated in an categories), decreasing the number of variables alternative way. limits bivariate analysis because it reduces the total number of pairs of variables that could It is easier to compare effect sizes if one or both be investigated at the bivariate level. A similar variables, measured at the continuous level, are equation to the formula for categorical recoded to the ordinal level and later perturbed, variables after global recoding can be used for suppressed or not additionally protected. In that this global reduction퐿퐷푈푛푖푔푟 coefficient GRC( ): case, local bivariate data utility can be calculated by comparing effect sizes or ρ2 (original data) with ρ2 (protected data) using the LDUbiv equation for pairs of variables that are푟2 at least ordinal. Generally, effect sizes decrease with global In the equation, Dorig represents the number recoding, and top and bottom coding, and there of variables in the original data, and Dprot is very little difference in effect sizes measured represents the number 푛( )of variables in the with Pearson and Spearman correlation protected data. If, for example, we removed푛( ) coefficient tests for both normal and skewed 5 variables relevant for analytical purposes from distributions. a dataset with 100 variables, the GRC would equal On the other hand, if one of the variables in the 0.902. We would then multiply global data utility pair is measured at the nominal level, while the with this GRC, if required. other one changes distribution or type (e.g. from interval to ordinal or binary), then eta squared, epsilon squared and Cramer’s V coefficient 3.6 Multivariate data utility values cannot be directly compared using the Although data utility assessment at the equations proposed in this paper. In that case, multivariate level is not the primary focus of the we would have to review how effect sizes of proposed global data utility measure, it could different statistical tests should be compared and nevertheless be integrated into the model under interpreted. certain conditions. The inclusion of multivariate assessment would contribute to the overall estimation of data quality, since microdata are 3.5 Data utility of data with more often than not used for statistical modelling. removed variables However, the most notable issue of an unbiased multivariate global utility assessment is a Removing variables is considered one of the most subjective selection of variables for multivariate extreme measures of data protection, unless analysis. If not all variables of analytical interest only direct identifiers, identification numbers are included in the model(s), the assessment is or variables irrelevant for analytical purposes more local than global. Consequently, multivariate are deleted from disseminated data. Although data utility could feasibly be extended to the removing individual cases and its effect on local special case of research data that focus almost data utility is already taken into account with local

14 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE exclusively on measuring a particular concept (e.g. Labour Force Survey).

Our adjusted Cramer’s V IL measure, originally proposed by Shlomo (2010:84), could be used with different multivariate coefficients, such as: • linear regression adjusted R statistic (including all standardised beta2 coefficients in the model) for normally distributed continuous variables • logistic regression Cox & Snell pseudo R statistic (including all odds ratios) for binary2 dependent variables, and continuous and binary independent variables • Cronbach alpha coefficient (including all if-item-deleted coefficients) for correlated continuous variables deriving an index.

Multivariate data utility – both local (see Shlomo 2010) and global (see Woo et al. 2009) – has already been studied in the literature. However, how these approaches compare and how they could be integrated into benchmarking-based global utility models are yet to be investigated. This type of global data utility assessment should therefore be an area of ongoing research and development.

Methods Paper No. 4/2018 15 4 Discussion

Data protection literature still offers little guidance shared with data users, whereas data protection on how to efficiently measure global data utility. details, based on disclosure risk measurements, Although much research has been done on how often cannot be provided because they have the to calculate identification risk and how to protect potential to reveal information that would help data to minimise it, less research has looked at intruders to re-create original data and identify how to measure IL at the unit record file level, individuals. Since this is a global measure, based and how to find the proper balance between on univariate and bivariate test-reported statistics global risk and global data utility. In this paper, (nonindividual records approach), it could be we propose a solution to an effective global data used to compare data for other purposes, such utility assessment. We prepared a user-centred as evaluating different survey data weighting global data utility measurement model, combining schemes, and comparing data collected local univariate data utility and local bivariate data with different survey modes or interviewers utility measures. We argue that utility assessment (investigating measurement errors). should be expanded at least to the bivariate level To fully exploit this model, either program code and that all variables, except for weights, other or an R package would have to be developed. derived variables, and ID numbers, should be This would enable other users to assess data included in the estimations. utility with as little effort and manual input as Since we have proposed several individual steps possible. They would only need to provide and individual local measures, this model can information about variable types (nominal, ordinal, be used for either univariate or bivariate utility continuous) in original and protected data, and assessment only, or for local utility measurement about performed global recoding and bottom focusing on a limited number of distributions and top coding, and provide both original and and associations. Five different bivariate protected data. One of the future challenges is measures are proposed to calculate local to develop unbiased global measures of data bivariate utility for pairs of variables of different utility at the multivariate level for different types of types and distributions. It should be noted that variables, knowing that even less restrictive data the proposed procedure could theoretically be protection approaches may result in considerable fully automated using a software environment information loss in complex statistical models. for statistical computing such as R. For this To start with, it would be particularly useful reason, approaches to automatically establish to establish the relationship between local the direction of association, using standardised univariate and bivariate data utility scores and residuals and sums of squares between groups, local multivariate utility scores for the same range are proposed. of variables, calculated using the data utility measures proposed in this paper. Because the model is based on a benchmarking approach, not on distance measures, it can be used with data protected by sampling or perturbed, and synthetic data techniques. One possible application of the model is first to investigate general global dataset data utility and then to use the results to focus on local IL, using a top-down approach. One important aspect of this model is that certain data utility measurement results (e.g. combinations of variables with less accurate estimations of associations) can be

16 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE References

Cohen J (1988). Statistical power analysis for the International Household Survey Network (2017). behavioral sciences, Lawrence Erlbaum Reducing the disclosure risk, www.ihsn. Associates, New Jersey, 2. org/anonymization-risk-reduction.

Cox LH, Karr AF & Kinney SK (2011). Risk‐utility Karr AF, Kohnen CN, Oganian A, Reiter JP paradigms for statistical disclosure & Sanil AP (2006). A framework for limitation: how to think, but not how to act. evaluating the utility of data altered International Statistical Review 79(2):160– to protect confidentiality. American 183. Statistician 60(3):224–232.

Desai T, Ritchie F & Welpton R (2016). Five safes: Lambert D (1993). Measures of disclosure risk designing data access for research, and harm. Journal of Official Statistics working paper, University of the West of 9(2):313. England. Larsen MD & Huckett JC (2010). Measuring Domingo-Ferrer J, Mateo-Sanz JM & Torra V disclosure risk for multimethod synthetic (2001). Comparing SDC methods for data generation. In: Proceedings: microdata on the basis of information loss SocialCom 2010 – the Second IEEE and disclosure risk. In: Pre-proceedings of International Conference on Social ETK-NTTS, vol. 2, 807–826. Computing, Minneapolis, 20–22 August, Institute of Electrical and Electronics Duncan G & Lambert D (1989). The risk of Engineers, 808–815. disclosure for microdata. Journal of Business & Economic Statistics 7(2):207– Loukides G & Gkoulalas-Divanis A (2012). Utility- 217. preserving transaction data anonymization with low information loss. Expert Systems —& Stokes SL (2004). Disclosure risk vs. with Applications 39(10):9764–9777. data utility: the R-U confidentiality map as applied to topcoding. Chance 17(3):16–20. Machanavajjhala A, Kifer D, Gehrke J & Venkitasubramaniam M (2007). L-diversity: Dupriez O & Boyko E (2010). Dissemination of privacy beyond k-anonymity. ACM microdata files: principles, procedures Transactions on Knowledge Discovery and practices, International Household from Data 1(1):3. Survey Network. Manning AM, Haglin DJ & Keane JA (2008). A Fisher RA (1925). Statistical methods for research recursive search algorithm for statistical workers, Genesis Publishing. disclosure assessment. Data Mining and Fritz CO, Morris PE & Richler JJ (2012). Effect size Knowledge Discovery 16(2):165–196. estimates: current use, calculations, and Massey FJ Jr (1951). The Kolmogorov–Smirnov interpretation. Journal of Experimental test for goodness of fit. Journal of Psychology: General 141(1):2. the American Statistical Association Ghinita G, Karras P, Kalnis P & Mamoulis N 46(253):68–78. (2009). A framework for efficient data McDonald JH (2009). Handbook of biological anonymization under privacy and statistics, 2nd edn, Sparky House accuracy constraints. ACM Transactions Publishing, Baltimore. on Database Systems 34(2):9.

Methods Paper No. 4/2018 17 Page P (2014). Beyond statistical significance: Vacha-Haase T & Thompson B (2004). How to clinical interpretation of rehabilitation estimate and interpret various effect research literature. International Journal of sizes. Journal of Counseling Psychology Sports Physical Therapy 9(5):726. 51(4):473.

Sharpe D (2015). Your chi-square test is Winkler WE (1998). Producing public-use statistically significant: now what? microdata that are analytically valid and Practical Assessment, Research & confidential, Statistical Research Report Evaluation 20(8):1–10. Series RR98/02, US Census Bureau, Statistical Research Division, Washington, Shlomo N (2010). Releasing microdata: disclosure DC. risk estimation, data masking and assessing utility. Journal of Privacy and Woo MJ, Reiter JP, Oganian A & Karr AF Confidentiality 2(1):73–91. (2009). Global measures of data utility for microdata masked for disclosure Templ M (2011). Estimators and model predictions limitation. Journal of Privacy and from the structural earnings survey for Confidentiality 1(1):7. benchmarking statistical disclosure methods, Research Report CS-2011-4, Xu J, Wang W, Pei J, Wang X, Shi B & Fu AWC Department of Statistics and Probability (2006). Utility-based anonymization using Theory, Vienna University of Technology. local recoding. In: Proceedings of the 12th ACM SIGKDD International Conference on —, Meindl B, Kowarik A & Chen S (2014). Knowledge Discovery and Data Mining, Introduction to statistical disclosure Philadelphia, 20–23 August, Association control (SDC), IHSN Working Paper 005, for Computing Machinery, New York, International Household Survey Network, 785–790. http://www.ihsn.org/home/sites/default/ files/resources/ihsn-working-paper-007- Oct27.pdf.

—, Kowarik A & Meindl B (2015). Statistical disclosure control for micro-data using the R package sdcMicro. Journal of Statistical Software 67(1):1–36.

18 CENTRE FOR SOCIAL RESEARCH & METHODS AND SOCIAL RESEARCH CENTRE Methods Paper No. 4/2018 19 CENTRE FOR SOCIAL RESEARCH & METHODS +61 2 6125 1279 [email protected] The Australian National University Canberra ACT 2601 Australia www.anu.edu.au

CRICOS PROVIDER NO. 00120C