Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Journal of Internet Services https://doi.org/10.1186/s13174-018-0098-z and Applications

RESEARCH Open Access Pseudonymization risk analysis in distributed systems Geoffrey K. Neumann, Paul Grace* , Daniel Burns and Mike Surridge

Abstract In an era of big data, online services are becoming increasingly data-centric; they collect, process, analyze and anonymously disclose growing amounts of in the form of pseudonymized data sets. It is crucial that such systems are engineered to both protect individual user (data subject) and give back control of personal data to the user. In terms of pseudonymized data this means that unwanted individuals should not be able to deduce sensitive information about the user. However, the plethora of pseudonymization algorithms and tuneable parameters that currently exist make it difficult for a non expert developer (data controller) to understand and realise strong privacy guarantees. In this paper we propose a principled Model-Driven Engineering (MDE) framework to model data services in terms of their pseudonymization strategies and identify the risks to breaches of user privacy. A developer can explore alternative pseudonymization strategies to determine the effectiveness of their pseudonymization strategy in terms of quantifiable metrics: i) violations of privacy requirements for every user in the current data set; ii) the trade-off between conforming to these requirements and the usefulness of the data for its intended purposes. We demonstrate through an experimental evaluation that the information provided by the framework is useful, particularly in complex situations where privacy requirements are different for different users, and can inform decisions to optimize a chosen strategy in comparison to applying an off-the-shelf algorithm. Keywords: Privacy, Pseudonymization, Risk analysis

1 Introduction medical research, but not allow an environmental moni- Motivation. The creation of novel, personalized and opti- toring service to track their movements. However, there is mized data-centered applications and services now typ- a concern that individuals may simply not understand the ically requires the collection, analysis and disclosure of implications for their own data within a service [2]; or this increasing amounts of data. Such systems will leverage may reflect the complexity of technology itself [3]. Hence, data-sets that include personal data and therefore their systems should be developed to ensure that: i) each indi- usage and disclosure represent a risk to a user’s (data sub- vidual’s privacy preferences are taken into account, and ii) ject) privacy. This data may be disclosed to trusted parties risks to each and every user’s privacy are minimized. or released into the public domain for social good (e.g. is the process of removing directly scientific and medical research); it may also be used inter- identifying information from data. Anonymized datasets nally within systems to improve the provision of a service may contain both sensitive information, and information (e.g. to better manage resources, or optimize a service for that could identify the person. Hence, concerns about pre- individual users). Importantly, users have different view- serving privacy has led to algorithms to pseudonymize points of what privacy means to them. Westin’s privacy data in such a way that personal information is not dis- indexes provide evidence of this [1]. For example, users closed to an unwanted party, e.g. k-anonymization [4], may wish to disclose sensitive information to support l-diversity [5]andt-closeness [6]. In general, developers’ understanding of external privacy threats are limited [7]. Non-experts in these algorithms will also find them diffi- *Correspondence: [email protected] cult to understand in terms of: i) the impact of the risk of IT Innovation, University of Southampton, Gamma House, Enterprise Road, re-identification, and ii) the effect of the pseudonymiza- SO16 7NS Southampton, UK tion on the usefulness of the data for its intended purpose.

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 2 of 16

Hence, development frameworks that guide a privacy-by- Outline.Section2 first describes how the framework design [8] process to achieve better privacy protection are is used to model user-centred privacy-aware systems that required. leverage pseudonymization, before describing how poten- Contribution. In this paper we propose a Model-Driven tial pseudonymization risks are identified in Section 3. Engineering (MDE) framework to model and analyze the Section 4 reports how the framework carries out data privacy risks of a system employing pseudonymization. utility analysis. Section 5 describes how the framework This provides the following key features: is applied following privacy-by-design practice, and the framework is evaluated in Section 6.Section7 describes • Modeling user privacy. The framework takes as input related work, and Section 8 provides a conclusion and models of the system in terms of the data and its flow indicates areas of future work. between systems and stakeholders (via data flow diagrams) and automatically generates a formal 2 Modelling privacy aware pseudonymization model of user privacy in the system (in terms of a systems Labeled Transition System). Here we describe a framework to model privacy aware • Pseudonymization risk analysis. The user-centered systems. This follows two steps. First, the developer mod- privacy model is then analyzed to automatically els their system; second, a formal model of user privacy in identify and quantify privacy risks based upon the this system is generated. modeled choices of pseudonymization. This returns the number of privacy violations that a 2.1 Step 1: modelling a privacy aware system pseudonymization strategy will or may result in, and The developers of the system create a set of artifacts that an assessment of whether the data will still be useable model the behaviour of their system: for its intended purposes when the violations have been eliminated to an acceptable standard. • AsetofData-Flow diagrams that model the flow of personal data within a system. In particular focusing We argue that the framework can be used by non- on how data is exchanged between the actors and pseudonymization experts at different stages of the devel- datastores. We utilize data-flow diagrams because opment lifecycle to make informed decisions about: they are an existing well-understood form of pseudonymization strategies, whether pseudonymized modeling that simply captures data behaviours. data is safe to release, and whether to impose restrictions • The data schema associated with each datastore; this on when data should be released. A system designer, using highlights the individual fields and also those profiled user data, can understand privacy risks early in containing personal data, and potentially sensitive the design process, and ensure that user privacy is a foun- personal data. dational requirement to be maintained. The framework canalsobeusedbysystemsdevelopers(ordatacon- We now consider a simple example to illustrate how trollers) to make decisions about data disclosure based on these elements form the input to the modelling frame- the information from the users in the real dataset. work. Two data-flow diagrams are given in Fig. 1.The

Fig. 1 Example data-flow diagram for sharing on data Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 3 of 16

nodes represent either an actor (oval) or a datastore transitions in such a way that the model can be anal- (rectangle). The datastores are labelled by two objects: ysed to understand how, and why, the user’s privacy the first is the identifier for the datastore, and the second changes. This novel contribution allows us to represent are the data schemas. The actual data flow is represented not only the sharing of a user’s personal information, but by directed arrows between the ovals and rectangles, also the potential for a user’s personal information to henceforth referred to as flow arrows. Each flow arrow be shared. This is the case when personal information is labelled with three objects: the set of data fields which is stored in a datastore that can be accessed by multiple flows between the two nodes, the purpose of the flow, individuals. and a numeric value indicating the order in which the The key elements of our model (illustrated in Fig. 2)are: data flow is executed. We assume datastore interfaces that support querying and display of individual fields (as • States: are representations of the user’s privacy. They opposed to coarse-grained records). The example illus- are labelled with variables to represent two trates the process where an administrator queries data pre-dominant factors: whether a particular actor has from an healthcare database to produce a pseudonymized identified a particular field, or whether an actor could version that is stored in a data store; from which data is identify a field. These variables, henceforth known as queried by researchers. state variables, take the form of Booleans, and there are two for each actor-data field pair (has, could). The 2.2 Step 2: automatically generating an LTS privacy state label s1 is given the table values shown in Fig. 2. model • Transitions: represent actions (collect, create, In this section we provide a formal model of user pri- read, disclose, anon, delete) on personal data vacy that is generated based upon the input data-flow performed by actors. They are labelled according diagrams (we do not detail the transformation algorithm to: i) an action, ii) the set of data fields, iii) in this paper, see [9]). User privacy is modeled in terms the data schema that the data field is a part of, iv) of how actor actions on personal data change the user’s the actor performing the action. There are two stateofprivacy.Wedefineanactor to be an individ- optional fields: i) a purpose that explains the reason ual or role type which can identify the user’s personal a particular privacy action is being taken, and ii) a data. Depending on the service provided, each actor may privacy risk measure to identify risks or may not have the capability to identify personal data. associated with this action (whose value is calculated Hence, a user’s privacy changes if any of their personal and annotated during risk analysis). data has been or can be identified by an actor. Prior mod- • Pseudonymization: the disclosure of pseudonymized els following this approach are: a Finite State Machine versions of each sensitive field is modelled using the (FSM) [10, 11] or a Labelled Transition System (LTS) anon transition. State variables (i.e. can access, has [12]. The common theme in both is that the user’s pri- accessed) can be declared on these fields. For vacy at any point in time is represented by a state, and example an analyst may have access permission for that actions, executed by actors, taken on their per- the field weightanon but may not have permission to sonal data can change this state. We build upon these access weight. This will mean that they may be approaches and extend them to label both states and allowed access to pseudonymized weight data for

Fig. 2 A State-based Model of User Privacy Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 4 of 16

statistical purposes but should be prevented from each field. State variables (i.e. can access, has accessed) matching any value to an individual. can be declared on these fields in the same way as for non anonymized fields and are also subject to the same 2.3 Step 3: considering user privacy preferences permissions. For example, an analyst may have access Each user has a policy which controls all aspects of how permission for the field weight.anon but may not their data is allowed to move through the system. This have permission to access weight. This corresponds to section will in brief outline how these policies operate. an actor having access to a pseudonymized version of As they control more aspects of the system than just the database. When they are only permitted access to pseudonymization, for a more detailed description refer pseudonymized versions of sensitives fields they should be to literature describing the system overall [12]. These poli- prevented from attaching any sensitive values to specific cies, which are derived from information obtained from a users. questionnaire, have three main aspects: There are two key types of risk which are considered:

1 Which services the user agrees to use (based upon 1. Re-identification: The risk that a person whose the purpose of the service, and optionally their trust personal data is pseudonymized within a disclosed of actors in that service). data set can be re-identified. 2 The sensitivities the user has about certain fields, 2. Value Prediction: Risk of a sensitive value being represented by either a sensitivity category (low, matched to an individual. medium, high for example), or a number which takes a value between 0 and 1 indicating how sensitive the Techniques such as k–anonymization [4] prevent re- user is to disclosure of that data. identification but do not guarantee that there is not still 3 The overall sensitivity preferences of a user. Users a value prediction risk. For example, suppose that after are categorised as having one of three preference k-anonymization a k-set about human physical attributes levels: unconcerned, pragmatist and fundamentalist contains 10 records, 9 of which have a weight over 100kg. [1]. These three categories indicate, respectively, that If an attacker knows their target is in that k-set they can the user requires a low, medium or high level of be 90% certain that their target has a weight over 100kg. restriction on data and which actions are allowed on This means that a privacy disclosure risk is still present. their data. An unconcerned user will have few In this version of the model, we focus on value restrictions while a fundamentalist user will have prediction risk. Note that although alternatives to k- many restrictions. anonymization, for example l-variance, may eliminate the danger of value prediction, this framework is designed to Within the context of the LTS, user policy operates on provide an assessment of any pseudonymization parame- transitions and is used to determine whether a transition ters that the user wishes to use. As stated, this framework should or should not be allowed. A transition that occurs does not attempt to produce a definitive pseudonymiza- despite not being permitted is referred to as a policy tion technique. violation. An actor that is not permitted to access a given The system will automatically discover and add to field will be referred to as an unauthorised actor while one the LTS, transitions that correspond to a possibility of who is allowed will be referred to as an authorised actor. unauthorized value prediction due to pseudonymization, referred to as risk-transitions. A risk that a given actor 3 Risk analysis (actor.A) can access a given sensitive field (field.F) 3.1 Identifying pseudonymization risk transitions is said to be present in every node in the LTS where the In the context of this paper risk refers to a danger that a pseudonymized version of F (F.anonymized)hasbeen single piece of information about a user is obtained by an accessed by A.IfA only has access rights to F.anon and entity or individual that is not authorised to have access not F, Risk-transitions will be added to the LTS starting to that information. In the case of pseudonymization this from each of these at-risk nodes. These will be marked as danger takes the form of a quantifiable probability. Some not allowed and it will be possible to calculate risk scores risksmaybeacceptableanditisimpossibletoeliminate or declare policy associated with these transitions. Each allrisk.Inthecaseofpseudonymization,risksassociated risk transition is uniquely identified by the set of quasi- with a sufficiently low probability can be ignored. Which identifier fields which the actor has already accessed. risks may be ignored is determined by a combination of system policy and the preferences of the individual user to 3.2 Scoring risk transitions - violations which the data refers, their user policy. To complete the generated privacy model, a numeric value Pseudonymization is incorporated into the LTS via the or values are calculated and added to the Risk Transitions; anon transition. This creates pseudonymized versions of these state how concerned a user should be about this Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 5 of 16

transition. To consider this, first a number of terms need The following terms are used in this pseudocode: to be defined: • • recordSet. The complete set of records. Risk is the danger of an individual value associated • F is the set of fields that the attacker has access to. with an individual user being predicted. • F • k -set. This is the set of records which appear to be Violations are risks within the dataset which are identical given the information available. In higher than an acceptable threshold. • k-anonymization the k-set is the set of records that A Threshold is a numeric limit above which a risk is appear to be identical when the quasi identifier fields judged to be unacceptable. F • are pseudonymized. The k -set is the set of records A Margin specifies how close values in a continuous which appear to be identical in the pseudonymized field are required to be to be judged to be equal for data when all quasi identifiers except for those in set the purpose of risk calculation. F are masked. • F F Thresholds are associated with individual values and k -seti is the k -set which record ibelongs to. The  F −  every single value may have a different threshold. This is size of this set is also referred to as k seti • | F − | because they are calculated from a combination of over- matches k set, Ri is the number of records F all system policy and an individual user’s policy. Initially, within k -seti where the sensitive value in question an individual’s preference level (unconcerned, pragmatist matches the real value in record i to within margin. • or fundamentalist as mentioned in the previous section) thresholdrv is the threshold associated with value v in will dictate how high the default of their thresholds for record r. • each field should be set. Once these have been defined a marginv is the margin associated with sensitive user may specify particular sensitivities attached to cer- field/set of values s. This will be 0 if the field is not tain fields and this will decrease the threshold value for continuous. those fields. Overall system policy will define how these user dependent factors influence thresholds. Non expert 4 Statistical utility users often state preferences about Once violations have been detected they will normally that is directly opposite to their actions [13], and it cannot need to be removed. It may be possible to leave some vio- be assumed that they are fully aware of the implications lations in the data depending on whether policy defines of revealing their data. Hence, it is the responsibility of hard restrictions or soft restrictions. Removing will affect thesystemdesignertoensurethatoverallsystempolicy the statistical properties of the data and so will affect the does not allow thresholds outside of an acceptable range results of experiments that researchers carry out on the regardless of user preference. data. The issue is compounded by the fact that it is harder The score used is simply the total count of violations to protect the privacy of outliers in the data [14]. Because within the data. It is calculated using the marginal prob- of this, violations are not likely to be evenly spread across abilities of values within k-sets as illustrated in the pseu- the data distribution which increases the possible impact docode in Algorithm 1. Note that this process is associated on the data of removing them. To address this, when data- with a single risk transition and therefore, as risk transi- sets are applied to the model transitions a utility report for tions are associated with the danger of accessing a single each risk transition is produced that shows both the vio- sensitive field, what we are calculating is the violation lation count and a utility score which gives an indication count for just one sensitive field (denoted v as the sensitive of the impact that removing data has had on the dataset’s field corresponds to a one dimensional array of values). statistical utility.

4.1 Statistics in the utility report It is impossible to define what being usable from a Algorithm 1 Calculating Violations research point of view actually entails as this depends on calculate iolations(recordSet, v) v the purpose for which the data is intended. The intention ∈ for all r recordSet do in our system is that the data from which violations have = violations 0 been removed, the trimmed data, is sufficiently similar to = F − set k setr the original data that the difference in results from any | ( F − )|÷| | risk = matches k set, Ri set test or statistical analysis is not sufficient to affect conclu- > if risk thresholdrv then sions. This is impossible to guarantee and so our approach = + violations violations 1 is to remove as few records as possible and to provide a end if list of statistics for both the original data and the trimmed end for data that is reasonably comprehensive and enables the return violations data controller to make their own decision. The use of a Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 6 of 16

standard set of statistics may also provide some reassur- circumstances, involve the removal of an unnecessary ance. It is important to note, that differential privacy [15] number of values and may also not be sufficient to remove is privacy-preserving method that provides such strong all violations. This is because, on one hand, removing privacy guarantees for certain statistical operations. The only a subset of violations may push the number of values approach differs from pseudonymization in that noise is within a certain range below the violation threshold while, added to the results of statistical queries on the data. A on the other hand, removing all violations reduces the size system designer may choose to engineer this solution, of the k-set and so may create a situation where previously but our tool considers cases where system designer have non violating values become violations. Further complica- chosen pseudonymization methods. tions are created as two or more sensitive values do not With these considerations in mind it was decided that have to be exactly equal to be considered identical from a the utility report should for now provide the following violation point of view, due to the use of margins. Addi- information: tionally the fact that different values may have different • thresholds attached to them based on user policies makes How many values need to be eliminated for the the problem even more complex. number violations to reach zero. Note that this Table 2 provides an example of both of these situations. number is not necessarily equal to the number of The sensitive field we are operating on is the weight field, violations. This number shall be referred to as Rv for the threshold is 0.75 for all values and the margin is 5 kilo- removed violations and the total number of violations grams. The data has been divided into 2 kF −setsasshown shall be referred to as Nv for number of violations. • below. Violating values are highlighted in blue. In set 1 it The complete set of statistics calculated for each is only necessary to remove one value for violations to be dataset provided by the Apache commons statistics eliminated. If all violations are removed then not only will 1 function that includes the statistics shown in Table 1. an unnecessary number of values be removed, but the set will only have one remaining member, item 1. This will now be classed as a violation as it is, by definition, identical The assumption of independence Currently the pro- to 100% of members of its set. cess of violation removal and utility reporting operates An intuitive response might be to remove k − n + 1vio- on a single sensitive field. The means that a violation lations at random where k is the number of violations in is a single value at danger of prediction rather than an the set and n is the minimum number of values for a given entire record. When a violation is removed only the rel- value to be identical to (or close to given the use of mar- evant value is removed and the rest of the record is left gin) including itself in order to be considered a violation untouched. This also means that the statistics above are based on its threshold. For example, if the threshold is 0.75 calculated from the vector of values associated with the for value v and there are 10 values in v’s set then v would relevant field. Correlations between fields may be impor- be in violation if it is close to 8 or more values (including tant for research purposes and so the intention is that itself). In this case n would be equal to 8. this will be introduced into the utility report at a later There are two problems with this approach. Firstly, dif- stage. ferent values may have different thresholds based on user policy and therefore there isn’t a single n value for an 4.2 Removing violations entire group of violations and secondly, the fact that we Removing violations is not as simple as removing every use margins also means that this approach would some- value which is in violation. This strategy may, in certain times produce the wrong result. This second problem is illustrated by set 2 in Table 2. Table 1 Statistical values used in Utility Report This set contains 4 violations in a set of size 6. As we are Name Definition Maximum The maximum value in the data-set Table 2 Risk values for 2-anonymization data records Minimum The minimum value in the data-set. Set 1 Set 2 Skewness The asymmetry of a frequency-distribution No Weight (kg) No Weight (kg) curve. 170 170 Kurtosis The sharpness of the peak of a 277 280 frequency-distribution curve. 378 374 Median The middle value in a data-set’s values. 475 474 Mean The average value of the data-set. 579 574 Standard deviation The measure of spread or dispersion of a data-set. 676 Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 7 of 16

assuming a threshold of 0.75 for all values and 4/6 < 0.75 • recordSet. The complete set of records. then k is already lower than n. According to the approach • violations[ data]. Records within the dataset data described above no values should need removing at all containing a violation. and the set should already be free of violations. How- • F is the set of fields that the attacker has access to. ever, the set clearly does contain violations and this is • kF -set. This is the set of records which appear to be due to the use of margins. Values 3 to 5 are in violation identical given the information available. • F F as they are each close (within 5 kg) to 5 other members K -seti is the K -set which record i belongs to. of their set. They are each close to the other violating • threshold The default threshold value that each value values as well as to value 1 (70 kg). Value 6 is also in vio- has as its threshold before individual user policies are lation as it is close to each of the other violating values applied. andtovalue2(80kg).Value1andvalue2arenotin violation themselves but they must be considered in any As can be seen in the pseudocode, for each set violations violation removal process as their presence is responsi- are removed, the set is rechecked, and this is repeated ble for other values being in violation. In this situation until no violations remain. not just the number of violating values removed but also Note that, although techniques such as l-diversity [5] which violating values are removed is important. If value can eliminate the risk of value re-identification, this sys- 6 is removed then the set will be left with 4 violating val- tem is not designed to be an alternative to such tech- ues, all values within the range 70 to 74. If value 3, 4 or niques. It allows finer control in the situation where every 5 is removed then the set will be left with only 2 violat- single value has its own associated risk threshold. It allows ing values as neither a value of 70 nor 76 will lead to a greater transparency for a non expert user. We assume violation. that a data controller has pseudonymized using a tech- As it is clearly non-trivial predicting exactly how many nique that they understand and have reasons for choosing. and which violations should be removed in all situations The resulting data is modified as minimally and transpar- that may occur, the process of removing violations was ently as possible in our system. made iterative. It was achieved through a modification of the k − n + 1 technique described above. As thresh- 5 Pseudonymization analysis framework olds vary based on user preferences, a default threshold We have implemented the prior described functionality in defined in the system is used to approximate how many a software framework (developed in Java); this is available violations should be removed. As this may not succeed for download2. This section outlines the use cases for sys- in removing all violations it is followed by checking each tems under design that wish to analyze the pseudonymiza- value individually for violation and repeating this process tion strategies employed; and importantly the steps that if necessary. a developer follows to use this tool. For this description Algorithm 2 describes the process in detail. The terms we assume that the data-flow model has been created, used in this pseudocode are defined in the following list: and the LTS has been generated (note, this must contain at least one instance in which the data is pseudonymized for the tool to be useful). The main use cases are as follows: Algorithm 2 Removing Violations removeViolations ( ) recordSet, v 1 The user, described as a Data Controller,isworking do on a live system and wishes to check the data violations[ recordSet] = currently in the system to ensure that an calculateViolations(recordSet, v) unacceptable level of policy violations isn’t occurring for all r ∈ recordSet do and rethink the pseudonymization approach if it is. = F − set K setr 2 The user, described as a System Designer,is if set not already cleaned then designing a system and wishes to devise a = (( − ) ∗| |) − numRemover 1 threshold set pseudonymization strategy that is unlikely to lead to (|set|−|violations[ set] |) an unacceptable level of policy violations.  randomly select (without replacement) 3 The user, either a System Designer or a Data violating values to remove and remove them Controller, wishes to impose conditions on any return cleanSet(set, violations[set], movement of pseudonymized data in the system so numRemover) that transitions are automatically disallowed when end if the number of violations is above a threshold. An end for error will be displayed if the system attempts to send while violations exist in recordSet data that is a violation of these restrictions. Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 8 of 16

The overall workflow is shown in Fig. 3.Actionsare pseudonymization strategies. Alternative pseudonymiza- marked with a "UC" caption to indicate which use case tion strategies may involve either exploring alterna- they are associated with. tive pseudonymization techniques, for example switching For use case 3, which may well be performed after use from k-anonymization to l-diversity, or considering allow- cases1or2,theuserprovidesrulesoftheform: ing access to only a subset of fields. The latter option is IF numberOfViolations > acceptableLimit [OR/AND] equivalent to choosing a different risk transition. A list of utilityThreshold < acceptableLimit... THEN block all risk transitions associated with the target sensitive field transition action will be displayed alongside the associated number of vio- There may be multiple utility threshold parameters lations for each. The user selects one of these to generate and any combination may be used. These will generally the utility report associated with this new risk transition. take the form of differences between the key statistics The ability to see different subsets of fields accompanied described in Section 4. by their violation count addresses the problem of high For use cases 1 and 2 the process is essentially the same. dimensionality in pseudonymization techniques such as The user starts by viewing the risk transition in the LTS k-anonymity [16]. This provides an easy to understand of most concern to them, this is the maximumrisktransi- metric for comparing different subsets of fields and under- tion i.e. the transition that has the most violations. As has standing which fields have the greatest cost in terms of already been stated, risk transitions are uniquely identified re-identification risk. While most solutions attempt to by the quasi identifiers that have already been accessed address the question of whether it is possible to reduce and the maximum risk transition will always be the one violation cost while still sending all fields to the researcher in which all quasi identifiers have been accessed. The user this may not be possible and the researcher may not need will view the utility report associated with that transition all fields. and will then select an action based on that report. As If the user chooses to accept a pseudonymization strat- an alternative to automatically viewing the maximum risk egy they may do so either before or after violations have transition a user may also select risk transitions on the LTS been removed. Clearly this depends on the nature of manually. these violations with regard to user policy and system This action may consist of either accepting the policy as any hard violation or violation associated with pseudonymization strategy as it is or exploring alternative legally required policy must be removed. In a live system,

Fig. 3 Overall Workflow Diagram Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 9 of 16

accepting a pseudonymization strategy will lead to data The framework generated the LTS as shown in Fig. 4. being disclosed as the risk is judged to be acceptable. In Dotted lines indicate potential policy violations. A system a system being designed accepting a pseudonymization administrator has the option of loading the six records strategy will mean defining that, either before or after given as examples above into the LTS. They would then violation removal, this transition will be part of the sys- see the violation scores 0, 2 and 4 as shown in this figure. tem and will always occur unless this rule is subsequently A utility report would be generated showing the cost changed. The only difference between use cases 1 and 2 to statistics of eliminating these violations. If the num- is how the utility report is generated. In a live system it ber of violations or the utility cost is unacceptable the will be generated with live data while in a system being data controller can consider increasing their k value or designed sample data will be used. reconsider their pseudonymization entirely. Alternatively, at the design phase, a system designer could declare that 6 Evaluation a number of violations above 50% is unacceptable. The To evaluate we apply the framework to particular use system would now throw an error if the above data was cases and observe the extent to which a user of the frame- used, forcing the administrator to choose another form of work is informed about the risks of pseudonymization. pseudonymization. The following use cases are described in turn and show Note that this situation assumes a uniform threshold how utility and violations may be considered. For both, we for all users. If this is not the case the number of vio- prepared a health record set to undergo 2-anonymization. lations will change. Suppose, for example, that we know A researcher has access to this anonymized data but does from a user questionnaire that the owner of row 4 is a data not and should not have access to the original data. The fundamentalist while all other users are pragmatists. The policy violation that we wish to avoid is the researcher system designer may have created a rule that fundamen- being able to predict an individual’s weight to within 5kg talists should, by default, have their threshold level set to (margin = 5). 0.7 rather than 0.9. In row 4 Risk w/height is at 3/4. This is not a violation in the default situation as it is below 0.9 6.1 Uniform privacy use case but now that the user is a fundamentalist it becomes a vio- In this case we assume that the threshold is 0.9 for all lation. Similarly, the owner of row 3 may be a pragmatist users; all users are considered to have the same privacy but may have declared that weight is an especially sensi- preferences. This means that an attacker predicting any tive field for them. The system designer may have created individual’s weight with a 90% or above confidence is con- a rule that especially sensitive fields when their owner is a sidered a violation. Age and height are quasi identifiers. pragmatist also have a threshold of 0.7, leading to the same Table 3 provides six sample records input to the model outcome for row 3. analysis process and shows how, as more identifying fields become available to the researcher, the number of vio- 6.2 Realistic privacy distribution lations of this policy increases. The risk columns show This use case features a larger dataset with 1103 realistic the proportion of records with matching quasi identifiers records; we show that the framework can be utilised to (considering only those that the attacker has access to) make decisions about data disclosure taking into account where weight is within 5kg of the weight of the current user privacy preferences. record. If this proportion is above 0.9 then it is highlighted as a violation. 6.2.1 Generating sample data The data used was randomly generated to approximate data from the United States national health survey of 2007-2010 [17]. Among other things, this survey provides Table 3 Risk values for 2-anonymization data records the height, BMI (Body Mass Index), sex, race (including Age Height Weight Height Age Age height non hispanic white, non hispanic black and hispanic) and age of 11039 adults categorised by race, sex and age. For (cm) (kg) risk risk risk each of these categories the mean height and BMI and the 30-40 180-200 100 2/4 2/2 2/2 standard for each is provided. We have generated a dataset 30-40 180-200 102 2/4 2/2 2/2 that is 10% of the size of the original survey using the same 20-30 180-200 110 2/4 3/4 2/2 distributions. The number of individuals in each category 20-30 180-200 111 2/4 3/4 2/2 in our dataset is exactly 10% of the number in the original data and for each category its own unique mean and stan- 20-30 160-180 80 1/2 1/4 1/2 dard error is used to generate these individuals. Table 4 20-30 160-180 110 1/2 3/4 1/2 describes our dataset. In this case weight is the sensitive Violations: 0 2 4 field and age, sex, race and height are all quasi identifiers. Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 10 of 16

Fig. 4 Pseudonymization risk analysis output

Table 4 Data distribution Sample Sex Age Race Mean Height Mean bmi size range height (cm) standard error bmi standard error 80 Male 20 -39 White 178.4 0.35 27.7 0.25 83 Male 40 -59 White 178.3 0.28 29.2 0.22 110 Male 60+ White 174.6 0.22 29.2 0.18 36 Male 20 -39 Black 176.9 0.39 28.7 0.39 37 Male 40 -59 Black 176.7 0.53 29.4 0.38 36 Male 60 + Black 174.4 0.42 28.8 0.32 57 Male 20 -39 hispanic 171.1 0.48 28.5 0.33 58 Male 40 -59 hispanic 170.3 0.36 29.5 0.24 39 Male 60 + hispanic 167.3 0.45 29.2 0.32 82 Female 20-39 White 164.9 0.25 27.5 0.41 86 Female 40-59 White 163.8 0.27 28.3 0.24 108 Female 60+ White 160.3 0.22 28.7 0.2 40 Female 20-39 Black 163.7 0.32 31.4 0.46 38 Female 40-59 Black 163.5 0.38 33.1 0.49 37 Female 60+ Black 160.6 0.28 31.1 0.33 67 Female 20-39 hispanic 158.2 0.23 28.8 0.23 58 Female 40-59 hispanic 157.1 0.33 30.2 0.34 51 Female 60+ hispanic 153.7 0.31 29.9 0.17 Total=1103 Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 11 of 16

Policy Distributions Each record is associated with an Table 6 User policy distribution imagined individual through a one to one mapping and Pragmatist 49% each imagined individual has their own privacy level. Fundamentalist 30% Some of these hypothetical data owners also consider Unconcerned 21% the weight field to be extra sensitive. The 3 by 2 matrix Regard weight as sensitive 10% giveninTable5 shows how a user’s privacy level con- Regard weight as normal 90% trols the threshold used both for fields they consider to be extra ‘’sensitive” and for ‘’normal” sensitive fields in our example. The number of users that fall into each sensi- tivity category is taken directly from our own survey of the general population 3.Table6 shows the proportion of race has a much greater impact than removing sex (down users falling into each category in our data and also shows to 67 violations) despite only containing 3 categories and how many consider weight to be an extra sensitive field. despite sex being, one would assume, correlated to weight to a greater extent than race. This may be due to the 6.2.2 Results unequal numbers of individuals in each racial group. The sample dataset was pseudonymized initially using k- The usefulness of this information will of course depend anonymity (carried out using the arx software [18]and on the purpose of the data and whether any fields can using its default settings). As expected, a utility report reasonably be removed. It could, however, also be used was first generated for the transition in which every quasi to inform decisions such as weight placed on different identifier had been accessed. A screenshot of this report fields when tuning pseudonymization algorithms or indi- in the k-anonymity scenario is shown in Fig. 5. cate which fields may need their hierarchical categories In this case 116 violations were detected and 117 val- rethought. These results may seem relatively straightfor- ues (about 1% of values) needed to be removed. This has ward and predictable in this example but this functionality had no effect on the maximum and minimums and has becomes more useful as the number of fields increase had a slight effect on the standard deviation, the skew and in situations where removing, partially obscuring or and kurtosis. The data controller or system designer may reconsidering multiple fields at once is a possibility. This decide that this is an acceptable statistical impact and is especially true with multiple fields as they may interact improve the sending of the data with violations removed, with each other in unpredictable ways. they may approve the sending of the original data or they may opt instead to try an alternative pseudonymization 6.3 Comparing with l-diversity and other algorithms algorithm. In this use case we assess the extent to which the framework can be used to make decisions about Removing Fields Choosing the “try removing fields” the pseudonymization algorithm chosen. An alternative option will result in a dialog box as shown in Fig. 6.This pseudonymization algorithm that a data controller may dialog box allows the data controller to generate a utility consider is l-diversity. Unlike k-anonymity, l-diversity [5] report for a smaller subset of fields and clearly shows how protects against value prediction by guaranteeing a range variable the number of violations are for different subsets. of different values in every set. Entropy based l-diversity Even for subsets containing three out of the four fields goes one step further in ensuring that no single value this varies from 29, if only age is unknown, to 107, if sex dominates so as to protect against probabilistic predic- is unknown. In this case it is perhaps not surprising that tions. Using l-diversity would therefore seem a logical sex has little impact on the effectiveness of pseudonymiza- strategy if the framework has revealed that using k- tion as it only contains two categories while age contains a anonymityresultsinalargenumberofviolations.For large number of unique values and so would be expected this reason we applied l-diversity using Shannon entropy to contribute significantly to disclosures risks. It may be to the data, again using the arx framework and default more significant in this case that removing height leaves settings. 38 violations and so has somewhat less of an impact than As before, a utility report was generated initially on the age, possibly due to a greater correlation between height scenario where all fields have been read. As the key met- and weight. It is perhaps most interesting that removing ric of utility is the difference between statistics calculated from the unmodified data and statistics from data which Table 5 Overall policy has been pseudonymized and had violations removed, α Unconcerned Pragmatist Fundamentalist the differences (or ) between each pair of statistics was recorded. This was done for both k-anonymity and Normal 1 0.9 0.8 l-diversity in order to compare the utility loss between the Sensitive 0.9 0.8 0.7 two algorithms. The results are shown in Table 7. Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 12 of 16

Fig. 5 Screenshot of Utility Report

Fig. 6 Screenshot of Violation Scores Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 13 of 16

Table 7 Risk values for 2-anonymization data records choosing a more restrictive algorithm such as l-diversity Statistic k-anonymity l-diversity or t-closeness is not sufficient for protecting user privacy. Number of violations 116 67 7 Related work Number removed 117 68 The system discussed here incorporates pseudonymiza- α Original tion into a wider framework. As we aim to provide novelty Minimum 46 0 0 in how an easy to use but thorough pseudonymization Maximum 134 0 0 framework is integrated into a wider privacy framework Mean 81.87 -0.39 -0.32 this section will discuss both comparable pseudonymiza- Standard deviation 13.06 -0.31 -0.15 tion techniques and comparable privacy frameworks from the literature. Median 81 1 1

Skewness 0.22 0.06 0.05 7.1 Pseudonymization and differential privacy Kurtosis -0.04 0.06 0.03 Many pseudonymization algorithms have been developed beginning with k-anonymity [4, 20]. This method divides data into groups such that no record is uniquely identifi- able. It does not, however, address the issue that groups As would be expected, less violations need to be may contain only a single value for a sensitive field and so removed when l-diversity is used. However, the over- fails to eliminate the risk of value prediction. It also suf- all effect on utility is similar between the two algo- fers from the curse of high dimensionality. With too many rithms. l-diversity also doesn’t eliminate the need to fields dividing data into sets can become impossible [16]. remove violations altogether and the number of viola- To address the issue of value prediction l-diversity was tions removed is still more than half the number removed proposed by Machanavajjhala et al [5]. This extends k- when using k-anonymity. This is despite the fact that anonymity by also ensuring that each individual sensitive l-diversity, Shannon entropy l-diversity in particular, is value is well represented in each set. l-diversity guarantees designed for precisely this scenario when we are attempt- that each q-set will contain at least l values for a sensi- ing to eliminate value predictions. Note that probabilistic tive field. This is, however, insufficient to protect against value predictions in particular are the focus of this system probabilistic attacks. Although multiple values are guar- and entropy based l-diversity is not expected to elim- anteed to exist in each q-set there is no guarantee that inate probabilistic value predictions and so l-diversity a single value won’t still dominate. To address this, alter- based on recursion, which is designed for this purpose, native versions of l-diversity such as entropy l-diversity was also used [5]. This did not work for our data. Using and recursive l-diversitythataimtoensurethatnovalue the default arx settings it failed to produce any results. dominates. This objective does, however, remain hard to t-closeness was also tried and this, similarly, did not pro- achieve in practice and may be overly limiting. t-closeness duce any results [19]. Both of these techniques would ensures that the distribution of an attribute is each sensi- presumably require fine tuning to be effective and how to tive attribute in each set is close to its distribution in the do this is not always clear. In our scenario values having overall table [19]. A limitation of t-closeness is that it does different probability thresholds complicates the issue of not protect against identity disclosure. avoiding probabilistic value predictions. Eliminating value Alternatively, pseudonymization may be achieved prediction to the probability threshold required by funda- through perturbation. Perturbation involves creating new mentalist users on their most sensitive fields would be so values from the same distribution as the original values restrictive if applied to every value that l-diversity would [21]. This suffers from the drawback that it assumes be unlikely to be successful. The use of a fixed margin to independence between variables and so potentially very define equality in continuous values may further compli- useful correlation information is lost to researchers [22]. cate matters. More generally, we are operating on a large Perturbation methods also do not guarantee that a record dataset with a lot of values for height, weight and age, sev- is indistinguishable from a quantifiable number of other eral quasi identifiers and a lot of correlation between fields records in the way that k-anonymity does [22]. and so it seems likely that more restrictive pseudonymiza- All of these approaches suffer from the disadvantage tion algorithms would tend to need tuning before being that they do not take account of individual data subjects successful. having different requirements for different sensitive fields. This study has shown our framework’s ability to give In order to address this the concept of personalized pri- detailed information on which pseudonymization strategy vacy protection was developed by Xiao and Tao [23]. This is preferable to just choosing an individual strategy. This approach is similar to the work described in this paper demonstrates that in a situation such as this one simply as it involves soliciting information from each user as to Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 14 of 16

which sensitive fields are most important and giving these continue to consider pseudonymization methods, which fields higher priority for protection. This goes some way this tool supports. There are tools and frameworks to help towards what we are proposing but it defines sensitive non-experts carry out differential privacy e.g. PSI [26] fields in terms of their sensitivity relative to other fields. and Adaptive Fuzz [27], therefore an interesting avenue of It does not allow the integration of a system of numeric future research is to consider privacy requirements and sensitivity levels or allow for users to have a low or high risk across the differing methods. sensitivity preference overall. An alternative method for incorporating user prefer- 7.2 Privacy frameworks ence is the condensation based approach proposed by Both Fischer [11]andKosa[10] define formal models of Aggarwal et al [24]. In this approach the data is divided privacy in terms of state machine representations. Their into groups of records which are guaranteed not to be dis- purpose to demonstrate that a system complies with pri- tinguishable from each other as with k-anonymity. The vacy regulations and requirements. Such models offer difference is that the size of each group is determined by strong building blocks that our formal privacy model the sensitivity level of the most sensitive value within it. builds upon; in particular moving from hand-crafted spec- That is to say, a value that a data subject specifies as highly ifications to auto-generated models that underpin the sensitive will be allocated to a large group to minimise the process and privacy risk analysis. probability of value prediction. This solution faces poten- MAPaS [28], is a model-based framework for the spec- tial efficiency issues as less sensitive values may be placed ification and analysis of privacy-aware systems. Centred into groups with more sensitive values and therefore may upon a UML Profile, purpose-based access control sys- receive a higher level of pseudonymization than necessary. tems are modelled and the framework allows queries to be This method also uses perturbation and so involves mod- executed to identify errors in the design. ifying sensitive values, possibly to a greater extent than is LINDDUN [29] is a framework for performing pri- required. vacy threat analysis on the design of a system in order In terms of the presentation of pseudonymization for to select privacy enforcing mechanisms that mitigate the users there are a number of tools available to anonymize threats. This combines a data flow model of the system data, which also provide some risk analysis feedback. with a privacy threat catalogue to provide a design-time The ARX Tool [18] provides methods for analyzing re- methodology to assess privacy risks. We similarly employ identification risks following the prosecutor, journalist a data-flow oriented methodology but explore the extent and marketer attacker models on a number of anonymiza- risk can be analysed automatically via the generation of tion algorithms. The Cornell Anonymization Toolkit an underpinning formal model. Further, we consider the (CAT) [25] performs Risk Analysis by evaluating the dis- use of MDE methods beyond the design phase (and in closure of risks of each value in pseudonymized data based particular analysis of running systems with real users). on user specified assumptions about the adversary’s back- A system’s behaviour should be matched against it’s ground knowledge. These tools offer important insights own privacy policy. [30] models a system’s behaviour in to identify privacy risks; and in our approach we seek terms of a Business Processing Model Notation (BPMN) to integrate similar capabilities (alongside fine-grained diagram and then the goal is to check whether this is user privacy consideration) into our privacy-by-design compliant with the system’s P3P privacy policy. [31]inte- methodology for developing distributed data services. grate links to the privacy policy in the system’s workflow Differential privacy [15] is a technique that provides (e.g. the BPEL specification), these are then checked by an strong formal guarantees of the privacy of users where analysis tool at design time to determine if the workflow personal data is released from statistical databases. Rather agrees with the policy. [6] provide a similar method; rather than changing the dataset itself (as with pseudonymiza- than having a designer merge the workflow and policy, tion methods), differential privacy adds noise to the out- the approach converts both models (a BPEL specification put of queries performed on the data. This guarantees and P3P policy) into a graph representation before for- that for any query, or sequence of queries, a subject and mally analyzing the correctness of the graph. However, all their personal data cannot be identified. Hence, it is an of these solutions only check if a system behaves accord- alternative method to achieving the same results as using ing to its stated privacy policy (our LTS can be similarly the pseudonymization framework of this paper. However, analysed); there has been limited research into the eval- there are downsides in the face of such strong privacy uation of a system in terms of privacy risk considering guarantees; i.e. the difficulty in developing the correct fine-grained user preferences. noise functions, the cases where the need to add too much noise reduces the statistical utility of the data, and also the 8Conclusion situations where data is released without knowing what This paper has discussed the pseudonymization aspect of functions will be performed on it. Hence, developers will a software framework for measuring privacy risk in a multi Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 15 of 16

actor system. It identifies pseudonymization risk and pro- Abbreviations vides easily understandable information to help a user BMI: Body mass index; BPMN: Business processing model notation; BPEL: Business process execution language; CAT: Cornell anonymization toolkit; LTS: without expert knowledge choose a pseudonymization Labelled transition system; MDE: Model drive engineering strategy. The concept of individual policy is central to this system to ensure that the preferences of individual Acknowledgements This work was supported by the European Commission under the Horizon data subjects are taken into account. Risk is quantified 2020 Programme (H2020), as part of the OPERANDO project (Grant based on policy violation and also based on what impact Agreements no. 653704). removing these violations will make from a statistical point of view. Transparency is also key. By integrating Availability of data and materials The software and data supporting the conclusions of this article are available these concerns into pseudonymization and integrating from https://github.com/OPERANDOH2020/op-privacy-modelling. pseudonymization into a larger privacy framework the system developed goes beyond existing pseudonymization Authors’ contributions GN conceived the pseudonymization risk analysis algorithms, and designed techniques. We have demonstrated through an exper- and performed the experiments. PG, DB, MS and GN conceived and imental evaluation) that, in variable policy scenarios, developed the underlying user privacy model theory and framework design. simply choosing a more thorough pseudonymizing tech- GN and PG implemented the software framework used in the paper. MS supervised the research project. All authors discussed the results, contributed niques is not sufficient to eliminate violations and so to, and approved the final manuscript. more information is needed to help the user choose a pseudonymization strategy. This system may be useful Ethics approval and consent to participate either for a system designer designing a system or for a Not applicable. data controller supervising a system. Consent for publication Future Work. A set of standard pseudonymization Not applicable. techniques could be provided so that the user doesn’t have Competing interests to manually input pseudonymized data although with the The authors declare that they have no competing interests. existence of more complete tools such as arx this may not be necessary. Challenges here concern the huge number Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in of tunable parameters involved in the pseudonymization published maps and institutional affiliations. process. Utility metrics will also ultimately be improved. In particular correlations will be incorporated as corre- Received: 2 May 2018 Accepted: 24 October 2018 lations between fields is often more important in data analysis than the statistical properties of individual fields. Risk (and consequently violation) calculations can be References 1. Kumaraguru P, Cranor LF. Privacy indexes: a survey of westin’s studies. improved by incorporating more situations, such as when 2005. an attacker does not know the value of a quasi identifier 2. Acquisti A, Brandimarte L, Loewenstein G. Privacy and human behavior exactly for their target but access to this quasi identifier, in the age of information. Science. 2015;347(6221):509–14. 3. Hildebrandt M. Profile transparency by design? re-enabling double pseudonymized or not, may still provide clues for value contingency; 2013. prediction. 4. Sweeney L. k-anonymity: A model for protecting privacy. Int J Uncertain We are also investigating the integration of the Fuzziness Knowl-Based Syst. 2002;10(05):557–70. rule-based decision system (predicated on the util- 5. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. L-diversity: privacy beyond k-anonymity. In: 22nd International ity reports) into distributed systems software; such Conference on Data Engineering (ICDE’06); 2006. p. 24. as privacy-oriented middleware and cloud protection 6. Li YH, Paik H-Y, Benatallah B. Formal consistency verification between systems. Enabling fine-grained user privacy enforce- bpel process and privacy policy. In: Proceedings of the 2006 International Conference on Privacy, Security and Trust. ACM; 2006. p. 26. ment where the system automatically makes decisions 7. Hadar I, Hasson T, Ayalon O, Toch E, Birnhack M, Sherman S, Balissa A. as to whether a disclosure of pseudonymized data Privacy by designers: software developers’ privacy mindset. Empir Softw will be allowed for any occurring distributed system Eng. 2018;23(1):259–89. action. 8. Cavoukian A. Privacy by design. Take the challenge. Information and privacy commissioner of Ontario, Canada. 2009. 9. Grace P, Burns D, Neumann G, Pickering B, Melas P, Surridge M. Endnotes Identifying privacy risks in distributed data services: A model-driven 1 approach. In: Proceedings of the 2018 IEEE International Conference on org.apache.commons.math3.stat.descriptive. Distributed Computing Systems. ICDCS ’18; 2018. DescriptiveStatistics 10. Kosa TA. Towards measuring privacy. PhD thesis, University of Ontario 2 https://github.com/OPERANDOH2020/op-privacy- Institute of Technology. 2015. 11. Fischer-Hübner S, Ott A. From a formal privacy model to its modelling implementation. In: Proceedings of the 21st National Information 3 detailsofthisuserprivacystudyiscurrentlyunder Systems; 1998. 12. Grace P, Surridge M. Towards a model of user-centered privacy submission preservation. In: Proceedings of the 12th International Conference on Neumann et al. Journal of Internet Services and Applications (2019) 10:1 Page 16 of 16

Availability, Reliability and Security. ARES ’17. New York: ACM; 2017. p. 91–1918. 13. Norberg PA, Horne DR, Horne DA. The privacy paradox: Personal information disclosure intentions versus behaviors. J Consum Aff. 41(1): 100–126. 14. Hay M, Miklau G, Jensen D, Towsley D, Weis P. Resisting structural re-identification in anonymized social networks. Proc VLDB Endowment. 2008;1(1):102–14. 15. Dwork C. Differential privacy: A survey of results. In: Agrawal M, Du D, Duan Z, Li A, editors. Theory and Applications of Models of Computation. Berlin: Springer; 2008. p. 1–19. 16. Aggarwal CC. On k-anonymity and the curse of dimensionality. In: Proceedings of the 31st International Conference on Very Large Data Bases; 2005. p. 901–9. VLDB Endowment. 17. Fryar CD, Gu Q, Ogden CL. Anthropometric reference data for children and adults: United states, 2007-2010. Vital health statistics. Series 11, Data from the national health survey. 2012;252:1–48. 18. Prasser F, Kohlmayer F, Gkoulalas-Divanis A, Loukides G. Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool. Cham: Springer; 2015, pp. 111–48. 19. Li N, Li T, Venkatasubramanian S. t-closeness: Privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd International Conference on Data Engineering; 2007. p. 106–15. 20. Samarati P, Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, Technical report, SRI International. 1998. 21. Agrawal R, Srikant R. Privacy-preserving data mining. In: ACM Sigmod Record, vol. 29. ACM; 2000. p. 439–50. 22. Aggarwal CC, Philip SY. A condensation approach to privacy preserving data mining. In: International Conference on Extending Database Technology. Springer; 2004. p. 183–99. 23. Xiao X, Tao Y. Personalized privacy preservation. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06. New York: ACM; 2006. p. 229–40. 24. Aggarwal CC, Yu PS. On variable constraints in privacy preserving data mining. In: Proceedings of the 2005 SIAM International Conference on Data Mining. SIAM; 2005. p. 115–25. 25. Xiao X, Wang G, Gehrke J. Interactive anonymization of sensitive data. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. SIGMOD ’09. New York: ACM; 2009. p. 1051–4. 26. Murtagh J, Taylor K, Kellaris G, Vadhan S. Usable Differential Privacy: A Case Study with PSI. ArXiv e-prints. 2018. 27. Winograd-Cort D, Haeberlen A, Roth A, Pierce BC. A framework for adaptive differential privacy. Proc ACM Program Lang. 2017;1(ICFP): 10–11029. 28. Colombo P, Ferrari E. Towards a modeling and analysis framework for privacy-aware systems. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing; 2012. p. 81–90. 29. Deng M, Wuyts K, Scandariato R, Preneel B, Joosen W. A privacy threat analysis framework: supporting the elicitation and fulfillment of privacy requirements. J Requir Eng. 2011;16(1):3–32. 30. Chinosi M, Trombetta A, et al. Integrating privacy policies into business processes. J Res Pract Inf Technol. 2009;41(2):155. 31. Short S, Kaluvuri SP. A data-centric approach for privacy-aware business process enablement. In: International IFIP Working Conference on Enterprise Interoperability. Springer; 2011. p. 191–203.