Privacy Preserving Identification Using Sparse Approximation With
Total Page:16
File Type:pdf, Size:1020Kb
Privacy Preserving Identification Using Sparse Approximation with Ambiguization Behrooz Razeghi, Slava Voloshynovskiy, Dimche Kostadinov and Olga Taran Stochastic Information Processing Group, Department of Computer Science, University of Geneva, Switzerland behrooz.razeghi, svolos, dimche.kostadinov, olga.taran @unige.ch f g Abstract—In this paper, we consider a privacy preserving en- Owner Encoder Public Storage coding framework for identification applications covering biomet- +1 N M λx λx L M rics, physical object security and the Internet of Things (IoT). The X × − A × ∈ X 1 ∈ A proposed framework is based on a sparsifying transform, which − X = x (1) , ..., x (m) , ..., x (M) a (m) = T (Wx (m)) n A = a (1) , ..., a (m) , ..., a (M) consists of a trained linear map, an element-wise nonlinearity, { } λx { } and privacy amplification. The sparsifying transform and privacy L p(y (m) x (m)) Encoder | amplification are not symmetric for the data owner and data user. b Schematic Decoding List Probe +1 We demonstrate that the proposed approach is closely related (Private Decoding) y = x (m) + z d (a (m) , b) γL λy λy ≤ − to sparse ternary codes (STC), a recent information-theoretic 1 p (positions) − ´x y = ´x (Pubic Decoding) 1 m M (y) concept proposed for fast approximate nearest neighbor (ANN) ≤ ≤ L Data User b = Tλy (Wy) search in high dimensional feature spaces that being machine learning in nature also offers significant benefits in comparison Fig. 1: Block diagram of the proposed model. to sparse approximation and binary embedding approaches. We demonstrate that the privacy of the database outsourced to a for example biometrics, which being disclosed once, do not server as well as the privacy of the data user are preserved at a represent any more a value for the related security applications. low computational cost, storage and communication burdens. The data users (clients), who by hypothesis poses some query, Keywords—privacy; identification; sparse approximation; related to those stored in the databases of the data owners, transform learning; ambiguization; clustering. wish to identify them by obtaining an index/indices of the closest items in the data owner datasets. However, the client I. INTRODUCTION does not want to disclose his query to the server completely for privacy reasons. Therefore, it is assumed that both the data A. Identification and ANN Search owner and clients attempt at protecting their data from server Many modern applications such as biometrics, digital phys- side analysis, which is assumed to be honest but curious. ical object security and data generated by connected objects More particularly, the server might attempt: (a) to try to find in the IoT require privacy preserving identification of a query relationships between entries in the database of the data owner, with respect to a given dataset. Practically, the identification (b) reconstruct an individual entry of the data owner database problem is based on an ANN search when a list of indices or a common representatives or centroids for the clusters, (c) corresponding to the NN items is returned. At the final reconstruct the query of clients and (d) cluster multiple queries refinement stage, the list can be refined in a private setting from the same client or from multiple ones thus establishing and a single index is declared as the identified one. The the group interests based on the similarity of probes. Curious identification problem faces the curse of dimensionality. For clients might be also interested to discover more information this reason, the exact identification is replaced by a search of about the structure of database by exploring the NN around one list of closest items, i.e., one tries to trade-off the accuracy of or multiple probes. Additionally, one can envision collaborative identification by the search complexity. In recent years, many clients who might aggregate the results for multiple queries. methods providing efficient ANN solutions for multi-billion arXiv:1709.10297v1 [cs.CR] 29 Sep 2017 entry datasets were proposed and we named some of them C. Our contribution without pretending to be exhaustive in our overview [1]–[3]. This paper presents a new privacy preserving strategy to the ANN search problem based on a recently proposed concept B. Search in Privacy Preserving Settings: Main Considerations of Sparse Ternary Codes [1], [4]. Our main contribution Due to the massive amount of data, modern distributed consists in a novel formulation of database protection based storage and computing facilities, many ANN problems are con- on a STC representation with the addition of ambiguization sidered in a setting where the data user outsources his datasets noise to prevent database analysis and a novel formulation by applying the corresponding protection measures to third of the query mechanism based on sending the positions parties (servers) possessing powerful storage, communications of non-zero components in the query sparse representation and computing facilities. The need for data protection comes along with the fraction of ambiguous positions to prevent from many perspectives related to the cost of data collection, the reconstruction from query and collaborative processing of data as a ”product” that represents a great value in the era queries. Our approach differs from existing privacy protection of machine learning, which can be used to train and prune based on quantized embedding [2], [5] and attribute based new and existing machine learning tools. Moreover, the server [6] techniques. It has strong information-theoretic foundations might want to discover some hidden relationships in the data. and demonstrated higher coding gain with respect to binarized Finally, the non-renewable nature of some features such as embedding methods [4]. Our setup is quite generic and we assume that as an input The query (probe) y(m) RN is the noisy version we might have raw data, extracted features using any known of x(m), i.e., y(m) = x(m2) + z, where we assume hand crafted methods, aggregated local descriptors based on z RN is a Gaussian noise vector with distribution 2 2 BoW, FV, VLAD [7]–[9], etc, or intermediate or last layers 0; Σz = σzIN . Provided that the additive noise z is in- of deep nets [10]. Before we provide the input to the server dependentN of data, for large N the law of large numbers states we preprocess it by passing it via the sparsifying transform 1 2 1 T 2 that N y(m) x(m) 2 N E z z σz; m [M]. and thresholding it thus producing STC.1 The next stage Therefore,k for− an enrollmentk ! vector x(m') stored8 at database2 is an ambiguation that consists in the addition of selective with high probability query vector y(m) will be in an N- noise components to the zero positions in the sparse data dimensional sphere of radius pNσz and centered at x(m). representation (Fig. 1). As a sparsifying transform we use an efficient learning method based on the sparsified Procrustes III. PROPOSED FRAMEWORK problem formulation [11] in contrast to a fixed transform A. Sparse Representation and Encoding used in [1], [4]. The resulting data are stored on the public The intrinsic information content of feature vectors, like server. The client, who possesses some query, sends a set other real-world signals, is much smaller than their lengths. of non-zero positions after sparsification of the query to So we are interested in approximating them by sparse vectors. the server and expects that the server will return a set of N Many signals x(m) R may be represented as a linear corresponding indices corresponding to positive and negative combination of a small2 number of columns (words) from non-zero components. To ambiguise the server, the data user 2 N L a frame (dictionary) D R × according to the sparse also adds a certain portion of false positions to the query 2 approximation model. Therefore, x(m)= Da(m) + ex, where and keeps their indices. The search complexity of the server N L ex R is approximation error, a(m) R is sparse with is logarithmic in terms of the number of stored items times 2 2 a(m) 0 L, in which a(m) 0 := card(supp (a(m))). The the number of non-zero positions in the sparsified query. This generalk k problem formulationk ofk this sparse coding problem for also determines the returned file size to the client. Due to the a fixed dictionary can be expressed as: efficient sparse representation, the amount of returned data is 2 minimized while preserving the efficiency of the data structure a(m)= arg min x(m) Da(m) 2 + λΩ(a(m)) ; m [M] ; a(m) L k − k 8 2 representation. At the final stage, the data user disregards 2A (1) the lists corresponding to false positions and aggregates the whereb is an alphabet of sparse representations, λ is a remaining lists for the corresponding positions producing the regularizationA parameter and Ω(:) is sparsity-inducing regu- final ANN list. larization function. However, the sparse coding with respect D. Notation and Definitions to a (m) is an inverse problem in nature and known to be NP- T hard. This problem can be solved approximately by greedy Throughout this paper, superscript (:) stands for the or relaxation algorithms, but these only provide the correct transpose and (:)y stands for pseudo-inverse. Vectors and solution under certain conditions, which are often violated in matrices are denoted by boldface lower-case (x) and upper- real world applications. Furthermore, they are computationally case (X) letters, respectively. We consider the same notation expensive in practice, particularly for large-scale problems. for a random vector x and its realization. The difference should In this paper, we use a transform model [12] for the be clear from the context.