Inference Control and Privacy Preservation in Data Mining

Suggested reviewers: Dan Simovici, Xintao Wu

Motivation: These recent years have seen a staggering increase in the volume of information exchanged through the internet and with it the size of personally identifiable information that is exchanged online and/or is stored in data repositories. This situation brings concerns about individual privacy rights and how to protect them through regulation and technology. This module aims to shed light on the current privacy and data protection issues and some of the methods that help protect it

Target Audience: Senior computer science majors. A graduate level module can also be used with additional assignments.

Prerequisites: Database, data structures, programming.

Module Objectives

 Give students an overview of the privacy concepts and requirements.

 Present the students with the techniques used for the preservation of private information.

 Present current privacy preserving techniques in data mining.

Module Organization

1. The Concept of privacy (2 hours)

a) Definition of Privacy and Data Protection.

b) Privacy and Security

c) Privacy and Legislation:

a.i. Legal: Individual Rights, Human Rights, Fourth Amendment, HiPAA.

a.ii. Organizational Privacy

a.iii. Informational Privacy: Digital Identities and Online Privacy Regulations

d) Types of Privacy Attacks.

e) Evaluating Privacy Techniques: Utility functions, disclosure factor

2. Data Centered Privacy Protection Methods (2 hours) a) What Data to hide:

b) Data Partitioning: Horizontal versus vertical

c) Data Modification : Aggregating, Blocking, Perturbation, Swapping, Sampling

d) Data Hiding: Cryptography Based techniques

3. Privacy Preserving Data Mining Techniques: (2 hours)

a) Data Obfuscation

b) Data Summarization

c) Data Separation

d) Inference Control

i. Confidential Data: Legal Requirements and Societal Expectations

ii. Data Aggregation

iii. Statistical Databases: Inference Control

iv. Conclusion

e) Privacy Preserving Association Rule Mining:

i. Horizontal Data Partitioning

ii. The ID3 Algorithm

Exercise: Privacy preserving association mining

Assume data is horizontally partitioned – Each site has complete information on a set of entities

– Same attributes at each site

The goal is to avoid disclosing entities, please develop an efficient association mining algorithm.