Topology of Privacy: Lattice Structures and Information Bubbles for Inference and Obfuscation
Total Page:16
File Type:pdf, Size:1020Kb
Topology of Privacy: Lattice Structures and Information Bubbles for Inference and Obfuscation Michael Erdmann∗ Carnegie Mellon University December 12, 2017 c 2017 Michael Erdmann Abstract Information has intrinsic geometric and topological structure, arising from relative relationships beyond absolute values or types. For instance, the fact that two people did or did not share a meal describes a relationship independent of the meal's ingredients. Multiple such relationships give rise to relations and their lattices. Lattices have topology. That topology informs the ways in which information may be observed, hidden, inferred, and dissembled. Privacy preservation may be understood as finding isotropic topologies, in which relations appear homogeneous. Moreover, the underlying lattice structure of those topologies has a temporal aspect, which reveals how isotropy may contract over time, thereby puncturing privacy. Dowker's Theorem establishes a homotopy equivalence between two simplicial complexes derived from a relation. From a privacy perspective, one complex describes individuals with common attributes, the other describes attributes shared by individuals. The homotopy equivalence is an alignment of certain common cores of those complexes, effectively interpreting sets of individuals as sets of attributes, and vice-versa. That common core has a lattice structure. An element in the lattice consists of two components, one being a set of individuals, the other being an equivalent set of attributes. The lattice operations join and meet each amount to set intersection in one component and set union followed by a potentially privacy-puncturing inference in the other component. One objective of this research has been to understand the topology of the Dowker complexes, from a privacy perspective. First, privacy loss appears as simplicial collapse of free faces. Such collapse is local, but the property of fully preserving both attribute and association privacy requires a global condition: a particular kind of spherical hole. Second, by looking at the link of an identifiable individual in its encompassing Dowker complex, arXiv:1712.04130v1 [math.CO] 12 Dec 2017 one can characterize that individual's attribute privacy via another sphere condition. This characterization generalizes to certain groups' attribute privacy. Third, even when long- term attribute privacy is impossible, homology provides lower bounds on how an individual may defer identification, when that individual has control over how to reveal attributes. Intuitively, the idea is to first reveal information that could otherwise be inferred. This last result highlights privacy as a dynamic process. Privacy loss may be cast as gradient flow. Harmonic flow for privacy preservation may be fertile ground for future research. ∗This report is based upon work supported in part by the Air Force Office of Scientific Research under award number FA9550-14-1-0012 and in part by the National Science Foundation under award number IIS-1409003. Any opinions, findings and conclusions or recommendations expressed in this report are those of the author and do not necessarily reflect the views of the Government, the U.S. Department of Defense, or the National Science Foundation. ii Topology of Privacy: Lattice Structures and Information Bubbles Contents 1 Introduction 1 2 Outline 3 List of Primary Symbols 6 3 Privacy: Relations and Partially Ordered Sets 7 3.1 A Toy Example: Health Data and Attribute Privacy . 7 Assumption of Relational Completeness . 7 Assumption of Observational Monotonicity . 8 Assumption of Observational Accuracy . 8 3.2 A Dual Perspective: Payroll Data and Association Privacy . 10 3.3 Privacy Preservation and Loss: A Poset Model . 11 4 The Galois Connection for Modeling Privacy 13 4.1 Dowker Complexes . 13 4.2 Inference from Closure Operators . 14 4.3 Attribute and Association Privacy . 17 4.4 Disinformation Example Re-Revisited . 17 5 The Face Shape of Privacy 19 5.1 Free Faces . 19 5.2 Privacy versus Identifiability . 20 5.3 Spheres and Privacy . 21 5.4 A Spherical Non-Boundary Relation that Preserves Attribute Privacy . 21 6 Conditional Relations as Simplicial Links 23 7 Privacy Characterization via Boundary Complexes 25 8 The Meaning of Holes in Relations 27 9 Change-of-Attribute Transformations 30 10 Leveraging Lattices for Privacy Preservation 34 10.1 Attribute Release Order . 34 10.2 Inferences on a Lattice . 35 10.3 Preserving Attribute Privacy for Sets of Individuals . 36 10.4 Informative Attribute Release Sequences . 38 10.5 Isotropy, Minimal Identification, and Spheres . 39 10.6 Poset Lengths and Information Release . 41 10.7 Hidden Holes . 42 10.8 Bubbles are Lower Bounds for Privacy . 44 Contents iii 11 Experiments 47 11.1 Compare and Contrast . 48 11.2 Homology Computations . 49 11.3 Homology and Release Sequences in the Olympic Medals Dataset . 49 11.4 Homology and Release Sequences in the Jazz Dataset . 52 12 Inference in Sequence Lattices 55 12.1 Sequence Lattices for Dynamic Attribute Observations . 55 12.2 Lattices of Stochastic Observations . 57 12.3 General Inference Lattices . 58 13 Lattices for Strategy Obfuscation 62 13.1 Strategies for Nondeterministic Graphs . 62 13.2 Connecting the Topologies of Strategy Complexes and Privacy . 64 13.3 Example: Multi-State Goals and Multi-Strategy Singleton Goals . 66 13.4 Randomization . 68 14 Relations as a Category 70 14.1 Relationship-Preserving Morphisms . 70 14.2 Privacy-Establishing Morphisms . 72 14.3 Summary of Morphism Properties . 73 14.4 G-Morphisms . 75 14.5 Surjectivity Revisited . 77 15 Future Thoughts 79 15.1 Relaxing Assumptions . 79 15.2 Sensing Attributes Stochastically . 80 Acknowledgments 82 References 82 A Preliminaries 84 A.1 Simplicial Complexes . 84 A.2 Partially Ordered Sets (Posets) . 86 A.3 Semi-Lattices and Lattices . 87 A.4 Relations . 88 B Basic Tools 91 C Links, Deletions, and Inference 94 C.1 Links, Deletions, and Induced Maps . 94 C.2 Privacy Preservation in Links and Deletions . 96 C.3 Unique Identifiability, Free Faces, and Privacy Preservation . 98 D Inference Hardness 102 iv Topology of Privacy: Lattice Structures and Information Bubbles E Privacy Spheres 104 E.1 Individual Attribute Privacy . 104 E.2 Group Attribute Privacy . 106 E.3 Preserving Attribute and Association Privacy . 108 E.4 Square Relations Preserve Privacy Symmetrically . 113 F Poset Chains 118 F.1 Maximal Chains and Informative Attribute Release Sequences . 118 F.2 Chains and Links . 119 F.3 Isotropy . 121 G Many Long Chains 123 H Obfuscating Strategies 128 H.1 Source Complex . 128 H.2 Delaying Strategy Identification . 130 H.3 Delaying Goal Recognition . 132 H.4 Hamiltonian Flexibility for Strategy Obfuscation . 133 H.5 Example: A Rapidly Inferable Strategy . 134 H.6 Pure Nondeterministic Graphs and Pure Stochastic Graphs . 136 H.7 Strategy Obfuscation Summary . 137 I Morphisms and Lattice Generators 138 I.1 Morphisms . 138 I.2 G-Morphisms . 141 I.3 Lattice Generators . 142 J A Few More Examples 145 J.1 Local Spheres versus Global Contractibility . 145 J.2 Disinformation . 146 J.3 Insufficient Representation . 148 J.4 A Structural Inference Example: Passengers on Ferries . 149 Introduction 1 1 Introduction Privacy is the ability of an individual or entity to control how much that individual or entity reveals about itself to others. Fundamental research into privacy seeks to understand the limits of that ability. A brief history of privacy should include the following: • The right to privacy as a legal principle, appearing in an 1890 Harvard Law Review article [24]. The article was a reaction to the then modern technology of photography and the dissemination of gossip via print media. • A demonstration linking supposedly anonymous public information with other more specific public data, thereby revealing sensitive attributes [21]. The demonstration employed zip code, gender, and birth date to link anonymous public insurance summaries with voter registration data. Doing so produced the health record of the governor of Massachusetts. This privacy failure suggested a first form of homogenization, called k- anonymity. Roughly, the idea was to structure databases in such a way that a database could respond to any query with an answer consisting of no fewer than k individuals matching the query parameters. • The discovery that it is impossible to preserve the privacy of an individual for even a single attribute in the face of repeated statistical queries over a population [2], unless answers to those queries are purposefully perturbed with noise of magnitude on the order p of at least n. Here n is the size of the population. The significance of this discovery is to underscore how difficult it is to preserve privacy while retaining information utility. • Netflix Prize. In 2006, Netflix offered a $1M prize for an algorithm that would predict viewer preferences better than Netflix’s internal algorithm. Netflix made available some of its historical user preferences, in anonymized form, as a basis for the competition. Once again, it turned out that one could link this anonymized data with other publicly available databases, resulting in the potential (and in some cases actual) identification of Netflix viewers, thereby de-anonymizing their viewing history [17]. Whereas in the earlier health example, a few specific observables made linking possible (global coordinates, one might say, namely zip code, gender, birth date), in the Netflix example, the intrinsic geometric structure of the database facilitated linking via a wide variety of observables