rx-anon—A Novel Approach on the De-Identification of Heterogeneous Data based on a Modified Mondrian Algorithm F. Singhofer A. Garifullina, M. Kern A. Scherp
[email protected] {aygul.garifullina,mathias.kern}@bt.com
[email protected] University of Ulm BT Technology University of Ulm Germany United Kingdom Germany ABSTRACT measure to protect PII is to anonymize all personal identifiers. Prior Traditional approaches for data anonymization consider relational work considered such personal data to be name, age, email ad- data and textual data independently. We propose rx-anon, an anony- dress, gender, sex, ZIP, any other identifying numbers, among oth- mization approach for heterogeneous semi-structured documents ers [12, 16, 31, 34, 52]. Therefore, the field of Privacy-Preserving composed of relational and textual attributes. We map sensitive Data Publishing (PPDP) has been established which makes the as- terms extracted from the text to the structured data. This allows sumption that a data recipient could be an attacker, who might also us to use concepts like :-anonymity to generate a joined, privacy- have additional knowledge (e. g., by accessing public datasets or preserved version of the heterogeneous data input. We introduce observing individuals). the concept of redundant sensitive information to consistently Data to be shared can be structured in the form of relational anonymize the heterogeneous data. To control the influence of data or unstructured like free texts. Research in data mining and anonymization over unstructured textual data versus structured predictive models shows that a combination of structured and un- data attributes, we introduce a modified, parameterized Mondrian structured data leads to more valuable insights.