Contributions to Record Linkage for Disclosure Risk Assessment
Total Page:16
File Type:pdf, Size:1020Kb
Contributions to Record Linkage for Disclosure Risk Assessment by Jordi Nin Guerrero Advisor: Dr. Vicenç Torra i Reventós A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor in Artificial Intelligence at the Departament de Ciències de la Computació Universitat Autònoma de Barcelona A la Mireia , que fa seves les meves il·lusions. I a en Xavier, a qui esperem amb entusiasme. ii I am only a child playing on the beach, while vast oceans of truth lie undiscovered before me. Isaac Newton (1642-1727) iii Contents Agraïments xiii Abstract xv 1 Introduction 1 1.1 Motivations ................................... 1 1.2 Contributions .................................. 3 1.3 StructureoftheDocument . 4 2 Preliminaries 7 2.1 AggregationFunctions . 7 2.2 TimeSeries.................................... 12 2.3 Re-identification Methods . 17 2.4 Microdata Protection Methods . 21 2.5 Information Loss and Disclosure Risk . 33 2.6 DataSetsDescription ............................. 38 3 Microaggregation Analysis 45 3.1 Attribute Selection in Multivariate Microaggregation ........... 45 3.2 Modeling Projections in Microaggregation . ...... 58 3.3 Improving Microaggregation for Complex Record Anonymization . 62 v 4 Specific Disclosure Risk Measures 77 4.1 Rank Swapping Record Linkage . 77 4.2 AlignmentRecordLinkage. 93 4.3 ProjectedRecordLinkage . 98 5 Record Linkage using Fuzzy Integrals 109 5.1 An Alternative Disclosure Risk Scenario . 109 5.2 Experiments ................................... 117 6 Time Series Protection 125 6.1 TimeSeriesProtection .. .. .. .. .. .. 125 6.2 Time Series Information Loss Measures . 127 6.3 Time Series Disclosure Risk Measures . 130 6.4 Final Trade-off Evaluation . 135 6.5 Experiments ................................... 136 7 Conclusions and Future Directions 141 7.1 Summary of Contributions . 141 7.2 Conclusions ................................... 142 7.3 FutureDirections ................................ 143 vi List of Figures 2.1 RecordLinkageSchema.. 18 2.2 Data set protection and release process. ...... 22 2.3 Disclosure Risk Scenario. 23 3.1 Points which have been artificially generated to obtain databases with non-correlated attributes, in 2 dimensions (a), and 3 dimensions (b). 50 3.2 Quantifier Q(A) = Q(|A|/N) where Q(x) = x and a represents the values ofonerecordofthedataset. .. .. .. .. .. 60 3.3 Graphical representation of the DR (a) and Score (b) of (PCP, Zscores and Sugeno) microaggregation using a = 4 and k = 5,15,25. 62 3.4 Mic1D-κ schema................................. 63 4.1 Graphic representation of disclosure risk. ........ 79 4.2 Graphic representation of the p-distribution swap interval. 82 4.3 Graphic representation of the results obtained by the rank swapping disclosure risk measure applied to the Census data set (a) and EIA data set (b), protected with standard rank swapping. 84 4.4 Graphic representation of the results obtained by the rank swapping disclosure risk measure applied to the Census data set (a) and EIA data set (b), protected with rank swapping p-distribution. 85 4.5 Graphic representation of the results obtained by the rank swapping disclosure risk measure applied to the Census data set (a) and EIA data set (b), protected with rank swapping p-buckets. ............. 86 vii 4.6 Graphic representation of the information loss (a), disclosure risk (b) and score (c) values for the three rank swapping methods when Census datasetisprotected. .............................. 89 4.7 Graphic representation of the information loss (a), disclosure risk (b) and score (c) values for the three rank swapping methods when EIA datasetisprotected. .............................. 89 4.8 Graphic representation of the number of links obtained with different record linkage techniques, applied to the Census data set protected with optimal microaggregation (a) and MDAV microaggregation (b) us- ing k = 50. .................................... 96 4.9 Graphic representation of the number of links obtained with different record linkage techniques, applied to the EIA data set protected with optimal microaggregation (a) and MDAV microaggregation (b) using k = 50........................................ 97 4.10 Percentage of correct links obtained with different record linkage tech- niques, applied to the Census data set (a) and EIA data set (b), pro- tected with PCP microaggregation with v = 4. ............... 103 4.11 Percentage of correct links obtained with different record linkage tech- niques, applied to the Census data set (a) and EIA data set (b), pro- tected with Zscores microaggregation with v = 4. ............. 103 4.12 Percentage of correct links obtained with different record linkage tech- niques, applied to the Census data set (a) and EIA data set (b), pro- tected with MDAV with v = 4.......................... 104 5.1 Graphical representation of an artificial problem that satisfies Assump- tion 2: Data set A with attributes {Benefits, Start-up costs} and data set B with attribute {Business type}. In this figure, business types are rep- resented using squares, ellipses, triangles, and so on. ....... 111 e 5.2 Graphical representation of Qα for α = 1/5,... ,10/5. 119 s 5.3 Graphical representation of Qα for α = 0,0.1,... ,0.9. 120 t 5.4 Graphical representation of Qα for α = 0,0.1,... ,0.9. 121 6.1 Graphical representation of distance function selection. 127 6.2 Graphical representation of the effects of time series normalization, (a) represents the original data without normalization, (b) represents nor- malized data with independent normalization, (c) represents normal- ized data with time series normalization. 132 viii List of Tables 2.1 Rankswappingexample.. 25 2.2 Optimal univariate microaggregation example. ....... 28 2.3 Projection based microaggregation example. ....... 32 2.4 MDAV microaggregation example. 33 2.5 Attributes of the Census data set. In the first column, id stands for the attribute identifier used in this thesis, in the second column, Name stands for the identifier used in the source of the data set and in the third column a brief description of the attribute is given. ........ 39 2.6 Attributes of the EIA data set. In the first column, id stands for the attribute identifier used in this thesis, in the second column, Name stands for the identifier used in the source of the data set and in the third column a brief description of the attribute is given. ........ 39 2.7 Attributes of the Abalone data set. In the first column, id stands for the attribute identifier used in this thesis, in the second column, Name stands for the identifier used in the source of the data set and in the third column a brief description of the attribute is given. ........ 40 2.8 Attributes of the Dermatology data set. In the first column, id stands for the attribute identifier used in this thesis, in the second column, Name stands for the identifier used in the source of the data set and in the third column a brief description of the attribute is given. ...... 41 2.9 Attributes of the Housing data set. In the first column, id stands for the attribute identifier used in this thesis, in the second column, Name stands for the identifier used in the source of the data set and in the third column a brief description of the attribute is given. ........ 42 2.10 Attributes of the Ionosphere data set. ...... 42 ix 2.11 Attributes of the Iris data set. In the first column, id stands for the attribute identifier used in this thesis, in the second column, Name stands for the identifier used in the source of the data set and in the third column a brief description of the attribute is given. ........ 43 2.12 Attributes of the Water Treatment data set. ....... 43 2.13 Attributes of the WDBC data set. 44 3.1 Attribute selection of the Water Treatment data set. ......... 51 3.2 Attribute selection of the EIA data set. ...... 52 3.3 Scores of different microaggregation methods and parameterizations using the Water Treatment data set. Mic.Method-k corresponds to mi- croaggregation using method Method (MDAV, PCP or Zscore) with ini- tial anonymity value k.............................. 53 3.4 Scores of different microaggregation methods and parameterizations using the EIA data set. Mic.Method-k corresponds to microaggregation using method Method (MDAV, PCP or Zscore) with initial anonymity value k....................................... 54 3.5 SSE and real k′ values of different microaggregation methods and pa- rameterizations for different number of groups known by the intruder using the Water-treatment data set. Mic.Methodk corresponds to mi- croaggregation using method Method (MDAV, PCP or Zscore) with ini- tial anonymity value k.............................. 55 3.6 SSE and real k′ values of different microaggregation methods and pa- rameterizations for different number of groups known by the intruder using the EIA data set. Mic.Methodk corresponds to microaggregation using method Method (MDAV, PCP or Zscore) with initial anonymity value k....................................... 56 3.7 Score of different microaggregation methods and parameterizations. Mic.i .var.j corresponds to microaggregation using variation var (either PCP,Zscores or Sugeno) with a = i and k = j ................. 61 3.8 Example of a microdata file used to illustrate the preprocessing block. 64 3.9 Different groups of attributes known by the intruder. ......... 69 3.10 SSE and real k′ of different microaggregation methods and parame- terizations using the EIA data set. Mic.Method-k corresponds to mi- croaggregation using method Method (MDAV,