
ALGORITHMS AND ARCHITECTURES FOR DATA PRIVACY A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Dilys Thomas June 2007 c Copyright by Dilys Thomas 2007 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Rajeev Motwani) Principal Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Dan Boneh) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (John Mitchell) Approved for the University Committee on Graduate Studies. iii iv Abstract The explosive progress in networking, storage, and processor technologies has resulted in an unprecedented volume of digital data. With this increase in digital data, con- cerns about privacy of personal information have emerged. The ease with which data can be collected, stored in databases and queried efficiently over the internet has worsened the privacy situation, and has raised numerous ethical and legal concerns. Privacy enforcement today is being handled primarily through legislation. We aim to provide technological solutions to achieve a tradeoff between data privacy and data utility. We focus on three problems in the area of database privacy in this thesis. The first problem is that of data sanitization before publication. Publishing health and financial information for research purposes requires the data be anonymized so that the privacy of individuals in the database is protected. This anonymized in- formation can be (1) used as is or (2) can be combined with another (anonymized) dataset that shares columns or rows with the original anonymized dataset. We ex- plore both these sub-problems in this thesis. Another reason for sanitization is to give the data to an outsourced software developer for testing software applications without the outsourced developer learning information about its client. We briefly explain such a tool in this thesis. The second part of the thesis studies auditing query logs for privacy. Given certain forbidden views of a database that must be kept confidential, a batch of SQL queries that were posed over this database, and a definition of suspiciousness, we study the problem to determine whether the batch of queries is suspicious with respect to the forbidden views. The third part of the thesis deals with distributed architectures for data privacy. v The advent of databases as an outsourced service has resulted in privacy concerns on the part of the client storing data with third party database service providers. Previous approaches to enabling such a service have been based on data encryption, causing a large overhead in query processing. In this thesis we provide a distributed architecture for secure database services. We develop algorithms for distributing data and executing queries over this distributed data. vi Acknowledgments First and foremost I would like to thank my advisor Rajeev Motwani for running one of the most wonderful research groups. His suggestion for papers and ideas during our group lunches and research meetings have fueled a lot of good research in our group. His insightful comments from experience have been helpful from time to time. His group has been the source of a lot of academic and some social activities to keep us lively all the time. They say that a great teacher inspires. Rajeev’s enthusiasm to solve relevant and varied research problems have inspired me and all other members of his group to good research. I would like to express my gratitude to other professors at Stanford. Hector Garcia- Molina for being a active participant of privacy research and for his feedback on presentation style and other practical issues. Jennifer Widom for running the Stream group and for her insights into research at the database group lunches. Dan Boneh for his mathematical puzzles, which provided me with good stimulating thinking many an afternoon. I would like to thank my reading committee members: Rajeev Motwani, Dan Boneh, and John Mitchell and other members on my orals committee: Hector Garcia- Molina and Ashish Goel. I would like to thank the various research groups I was a part of: STREAM run by Jennifer Widom and Rajeev Motwani, RAIN run by Rajeev Motwani, Ashish Goel and Amin Saberi, PORTIA-PRIVACY run by Rajeev Motwani, Hector Garcia- Molina, Dan Boneh and John Mitchell and TRUST run by John Mitchell, Dan Boneh, Rajeev Motwani and Hector Garcia-Molina. I would like to thank my internship mentors and managers – I learnt a lot about vii different styles of research from them. I would like to especially thank Ramakrishnan Srikant, Rakesh Agrawal, Surajit Chaudhuri, Nicolas Bruno, Phillip Gibbons, Sachin Lodha, Anand Rajaraman and Srinivasan Sheshadri. I would like to thank my professors at the Indian Institute of Technology, Bombay, esp. S. Sudarshan for providing me an excellent undergraduate education and getting me initiated into research. I would also like to thank my batchmates and friends from there. I thank other students working with Rajeev: Krishnaram Kenthapadi, Gurmeet Manku, Gagan Aggarwal, Rina Panigrahy, Shubha Nabar, Ying Xu, Sergei Vassil- vitskii, An Zhu, David Arthur, Aleksandra Korolova, Mayur Datar, Brian Babcock, Liadan Boyen and Aristides Gionis who have provided a wonderful environment to work. I would like to thank Gaurav Bamania, Anuranjan Jha, Mayur Naik, Joseph Alex, Rajat Raina, Utkarsh Srivastava, Rajiv Agrawal, Omkar Deshpande, Pradeep Kumar, Rob, Jim Cybluski and Blake Blailey for being kind and considerate roommates. It was a pleasure spending time with people in the Theory and Database and OR groups at Stanford: Zoltan Gyongyi, Prasanna Ganesan, Mayank Bawa, Qi Su, Mukund Sundararajan, Adam Barth, Aaron Bradley, Damon Mosk-Aoyama, Sri- ram Sankaranarayanan, Anupam Datta, Bobji Mungamuru, Shivnath Babu, Hamid Nazarzadeh, Arvind Arasu, David Menestrina and Arnab Roy are just a few to name. I would like to thank my coauthors not mentioned above Renato Carmo, Prasen- jit Das, A A Diwan, Tomas Feder, Vignesh Ganapathy, Keith Ito, Samir Khuller, Yoshiharu Kohayakawa, Eduardo Sany Laber, Nina Mishra, Itaru Nishizawa, Nikhil Patwardhan, Sharada Sundaram and Rohit Varma. I would like to thank Kathi DiTommaso, Lynda Harris, Maggie Mcloughin, Wendy Cardamone, Claire Stager, Verna Wong, Indira Chaudhury, Meredith Hutchin, Jam Kiattinant and Peche Turner for taking care of all the important administrative mat- ters during the PhD. I would like to thank Lilian Lao, Andy Kacsmar and Miles Davis for taking care of the machines. I would like to thank the various outing clubs and groups at Stanford, the Catholic community here, SIA, Rains groups, IVGrad, DB movie and social committee for viii ensuring life outside work was something to look forward to. I would like to thank Joshua Easow and family and Jojy Michael and family for memorable times with them. Above all, I thank my grandparents, parents, sister Dina, my cousins and friends for a wonderful childhood with memories of catching butterflies from fields, small fish from streams, spending time on the hills, studying and playing. ix To God x Contents Abstract v Acknowledgments vii 1 Introduction 1 1.1 SanitizingdataforPrivacy . 1 1.1.1 PrivacyPreservingOLAP . 2 1.1.2 Clustering for Anonymity . 2 1.1.3 Probabilistic Anonymity . 3 1.1.4 AToolforDataPrivacy: Masketeer. 4 1.2 Auditing.................................. 5 1.3 Distributed Architectures for Privacy . .... 5 I Sanitizing Data for Privacy 7 2 Privacy Preserving OLAP 9 2.1 Introduction................................ 9 2.2 RelatedWork ............................... 11 2.3 DataPerturbation ............................ 12 2.4 Reconstruction .............................. 14 2.4.1 Reconstructing Single Column Aggregates . 14 2.4.2 Reconstructing Multiple Column Aggregates . .. 17 xi 2.5 Guarantees against privacy breaches .................................. 23 2.5.1 Review of (ρ1, ρ2)PrivacyBreach . .. 24 2.5.2 (s, ρ1, ρ2)PrivacyBreach .................... 24 2.5.3 SingleColumnPerturbation . 26 2.5.4 Multiple Independently Perturbed Columns . 27 2.6 Extensions................................. 29 2.6.1 CategoricalData ......................... 29 2.6.2 Alternative Retention Replacement Schemes . .. 29 2.6.3 Application to Classification . 31 2.7 Experiments................................ 32 2.7.1 RandomizationandReconstruction . 33 2.7.2 Scalability............................. 34 2.7.3 PrivacyBreachGuarantees. 36 2.8 Conclusions ................................ 37 3 Clustering for Anonymity 44 3.1 Introduction................................ 44 3.2 r-GATHERCLUSTERING . .. .. 50 3.2.1 LowerBound ........................... 51 3.2.2 UpperBound ........................... 52 3.2.3 (r, ǫ)-GatherClustering . 53 3.2.4 Combining r-Gather with k-Center............... 55 3.3 CellularClustering ............................ 55 3.3.1 r-CellularClustering . 60 3.4 Conclusions ................................ 64 4 Probabilistic Anonymity 65 4.1 Introduction................................ 65 4.1.1 OrganizationandContributions . 67 4.2 Automatic Detection of Quasi-identifiers . .... 68 4.2.1 Distinct Values and
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages190 Page
-
File Size-