1 Privacy Preserving Speech Processing Manas A. Pathak, Bhiksha Raj, Shantanu Rane, Paris Smaragdis

I.INTRODUCTION frameworks for voice processing. Here, we refer chiefly to secure pattern-matching applications of speech such Speech is one of the most private forms of communi- as speech biometrics and speech recognition, rather than cation. People do not like to be eavesdropped on. They secure speech communication techniques, which have will frequently even object to being recorded; in fact already been studied [5]. in many places it is illegal to record people speaking in The goal of the described frameworks is to enable public, even when it is acceptable to capture their images voice-processing tasks in a manner which ensures that on video [1]. Yet, when a person uses a speech-based no party, including the user, the system, or a snooper, service such as a voice authentication system or a speech can derive undesired or unintended information from the recognition service, they must grant the service complete transaction. This would imply, for instance, that a user access to their voice recordings. This exposes the user to may enroll at a voice-authentication system without fear abuse, with security, privacy and economic implications. that an intruder or even the system itself could capture For instance, the service could extract information such and abuse his voice or statistical models derived from it. as gender, ethnicity, and even the emotional state of A surveillance agency could now determine if a crime the user from the recording – factors not intended to suspect is on a telephone line, but would learn nothing be exposed by the user – and use them for undesired if the speaker is an innocent person. Private speech data purposes. The recordings may be edited to create fake may be mined by third parties without exposing the recordings that the user never spoke, or to impersonate recordings. them for other services. Even derivatives from the voice These frameworks follow two distinct paradigms. In are risky to expose. E.g. a voice-authentication service the first paradigm [6], [7], [8], [9], [10], which we could make unauthorized use of the models or voice will refer to as the “cryptographic” paradigm, conven- prints it has for users to try to identify their presence tional voice-processing algorithms are rendered secure in other media such as YouTube. by computing them through secure operations. By re- Privacy concerns also arise in other situations. For casting the computations as a pipeline of “primitives”, instance, a doctor cannot just transmit a dictated medical each of which can be computed securely through a record to a generic voice-recognition service for fear combination of homomorphic encryption [11], [12], [13], of violating HIPAA requirements; the service provider [14], secure multiparty computation (SMC) [15] and requires various clearances first. Surveillance agencies oblivious transfer [16], [17], we ensure that no undesired must have access to all recordings by all callers on a information is leaked by any party. In this paradigm, telephone line, just to determine if a specific person of the accuracy of the basic voice-processing algorithm can interest has spoken over that line. Thus, in searching for remain essentially unchanged with respect to the original Jack Terrorist, they also end up being able to listen to non-private version. The privacy requirements, however, and thereby violate the privacy of John and Jane Doe. introduce a computational and communication overhead The need to protect the privacy of users and their data as the computation requires user participation. is well recognized in other domains [2]. Any private The second paradigm modifies voice-pattern classi- information that can be gleaned by inspecting a user’s fication tasks into a string-comparison operation [18], interaction with a system must be protected from prying [10]. Using a combination of appropriate data representa- eyes. To this end, techniques have been proposed in tion followed by locality sensitive hashing schemes both the literature for protecting user privacy in a variety of the data to be matched and the patterns they must match applications including e-voting, information retrieval and are converted to collections of bit strings, and pattern biometrics. Yet, the privacy of voice has not been ad- classification is performed by counting exact matches. dressed until recently, and the issue of privacy of speech The computational overhead of this string-comparison has been dealt with primarily as a policy problem [3], framework is minimal compared to the cryptographic [4], and not as a technology challenge. In this article, framework. Moreover, the entire setup is non-interactive we describe recent developments in privacy-preserving and secure, in the sense that no party learns anything 2 undesired regardless of how they manipulate the data. suffices to know that when we refer to a speech recording This comes at the price of a slight loss of performance, we actually refer to the sequence of feature vectors X = since the modified classification mechanism is not as [x1, x2, ··· , xT ]. For the privacy-preserving frameworks effective as conventional classification schemes. described later, we will assume that the user’s () In the rest of this paper, we first present a brief primer device can compute these features. The conversion of on speech in Section II and discuss its privacy issues signals to features by itself provides no privacy, since in Section III. We describe the cryptographic approach they can be used for pattern classification, and can even in Section IV and the string-comparison approach in be inverted to generate intelligible speech signals [21]. Section V. Finally in Section VI we present our con- clusions. In our discussion, we will assume a user who A. Biometric Applications: Identification and Authenti- possesses voice data and a system who must process cation it. This nomenclature may not always be appropriate; Biometric applications deal with determining the iden- nevertheless we will retain it for consistency. tity of the speaker. Here, the set C in (1) is the set of candidate speakers who may have spoken in a record- II.ABRIEF PRIMERON SPEECH PROCESSING ing. In addition, a “universal” speaker U, representing The speech processing applications we consider all the aggregate of all speakers not otherwise in C is deal with making inferences about the information in included in the set. U may be viewed as the “none- the recorded . Biometric applications attempt to of-the-above” option – classifying a recording as U is determine or confirm the identity of the speaker of equivalent to stating that the speaker is unknown. In a recording. Recognition applications attempt to infer speaker identification systems C comprises a collection what was spoken in the recording. We present below of speakers S (possibly including U) for whom models a very brief description of the salient aspects of these λS are available. In speaker authentication, C comprises applications as they pertain to this paper. The description only the speaker who has claimed to have spoken in is not intended to be exhaustive; for more detailed the recording, and the universal speaker U. P (C) in (1) information we refer readers to the various books and now becomes a tuning parameter to bias the classification papers on the topics, e.g. [19]. towards specific speakers or U. In all cases, the problem of inference is treated as Typically, the individual feature vectors in any record- one of statistical pattern classification. Classification is ing X are assumed to be IID and distributed according usually performed through a Bayes classifier. Let C be to a Gaussian mixture model (GMM) [22]. Thus, for any a set of candidate classes that a recording X might speaker S in C, P (X; λS) is assumed to have the form Q belong to. Let P (X|C) be the probability distribution of P (X; λS) = t P (xt; λS), where speech recordings X from class C. P (X|C) is usually K represented through a parametric model, i.e. P (X|C) ≈ X S S S P (x; λS) = w N (xt; µ , Σ ). (2) P (X; λ ) λ C k k k C where C are the parameters of the class k=1 and are learned from data. Classification is performed as Here K is total number of Gaussians in P (x; λS), N () ˆ S S S C = arg max log P (X; λC ) + log P (C) (1) represents a Gaussian density, and wk , µk and Σk are C∈C the mixture weight, mean vector and covariance where P (C) represents the a priori bias for class C. of the kth Gaussian in the mixture. The parameters of S S S Stated thus, the primary difference among the applica- the GMM are expressed as λS = {wk , µk , Σk ∀ k = tions considered lies in the definition of the candidate 1,...,K}. classes in C and the nature of the model P (X; λC ). λU , the parameters of the GMM for “universal Before proceeding, we note that the classifiers do speaker” U are learned from a large collection of speech not work directly from the speech signal. Instead, the recordings from many speakers. λU is often called a signal is converted to a sequence of feature vectors, “Universal Background Model”, or UBM. The GMM typically “Mel-frequency cepstral coefficients” (MFCC) parameters λS for any speaker S are learned from a [20]. To generate these, the signal is segmented into collection of recordings for the speaker, using the EM overlapping “frames”, each typically 25ms wide, with algorithm. When the amount of training data for the an overlap of 15ms between adjacent frames. From each speaker are limited, e.g. from enrollment recordings in frame, a vector of MFCC (or similar) features is derived, an authentication system, and insufficient to learn GMM which may be further augmented with their temporal parameters robustly, they are learned by adapting the pa- differences and double-differences. For our purposes, it rameters of the UBM to the data from the speaker using 3

an MAP estimation algorithm [23], [24]. Classification To compute P (X; λC ) for any class C, we must of speakers with the GMMs now proceeds as employ the following recursion, commonly known as the forward recursion. Here, the term αC (t, i) represents the T X total probability that the process arrives at state i after t Sˆ = argmax log P (x; λS) + θS. (3) S∈C transitions and generates the partial sequence x1,..., xt, t=1 C i.e., α (t, i) = P (x1,..., xt, state(t) = i; λC ). where θS encodes our prior biases as mentioned earlier. C C C α (1, i) =πi P (x1|i) C X C C B. Recognition Applications α (t, i) =P (xt|i, C) α (t − 1, j)aj,i ∀ t > 1 j In speech recognition systems, C in (1) represents the X P (X; λ ) = αC (T, i) (4) collection of all possible word sequences that a person C may say in the recording. Phrase spotting systems only i include the specific phrases of interest, along with a where T is the total number of feature vectors in X. background model, usually called a “garbage model” in P (X; λC ) computed in the above manner considers this context, which represents the “none of the above all possible state sequences that the process may have option”. Isolated word recognition systems assume that followed to generate X. It can be used in (1) for only a single word was spoken in the recorded segment phrase spotting and isolated-word recognizers. In contin- being analyzed; C here comprises the vocabulary of uous speech recognition, however, the classes are word words to recognize. In continuous speech recognition sequences, which are collapsed into a compact word (CSR) systems, C represents the set of all possible graph [26] and computing P (X|C) through (4) is not sentences a person may speak. This set can be very large, feasible for individual word sequences. Here, P (X; λC ) and even infinite in size. To make classification man- is replaced by the probability of the most likely state ageable, the set of all possible sentences is represented sequence though the HMM for C, and the classification as a compact, loopy word graph, which conforms to a is performed as grammar or n-gram language model that embodies the Cˆ = argmax log P (C) + max log P (X, s; λC ) (5) a priori probabilities P (C) [19]. C s The probability distribution P (X; λ ) for each class C where s is the state sequence followed by the process. C is usually modelled by a hidden Markov model Unlike classification based on forward probabilities, this (HMM). The theory of HMMs is well known; we only estimation can be performed over the set of word se- reiterate the salient aspects of it here [25], [19]. An quences represented by a word graph: the word graph is HMM is a model for time-varying processes and is composed into a large HMM by replacing each edge in characterized by a set of states [s , . . . , s ] and an 1 M the graph by the HMM for the word it represents [26]. associated set of probability distributions. According to The word sequence corresponding to the most probable the model, the process transitions through the states state sequence through the resulting HMM is guaranteed according to a Markov chain. After each transition it to be identical to the one obtained by (5). Even in draws an observation vector from a probability distri- contexts outside continuous speech recognition, (5) is bution associated with its current state. The parameters frequently used as a generic substitute for (1), since it is characterizing the HMM for any class C are i) the initial more efficient to compute. state probabilities ΠC = {πC , i = 1,...,M}, where πC i i The term max log P (X, s|C) in turn can be com- representing the probability that at the first instant the s puted very efficiently using a dynamic programming process will be in state s ; ii) transition probabilities i algorithm known as the Viterbi algorithm, which imple- AC = {aC i = 1, . . . , M, j = 1,...,M} which i,j ments the following recursion. Here ΓC (t, i) represents represent the probability that, given the process is in the log of the joint probability of x ··· x and the most state s at any time, it will jump to s at the next 1 t i j probable state sequence that arrives at state s at time t. transition, and iii) the set of state output probability i distributions {P C (x;ΛC )} associated with each state. In C C C i Γ (1, i) = log π + log P (x1;Λ ) C C i i speech recognition systems the P (x;Λi ) are generally C C C C C δ (t, i) = argmax Γ (t − 1, j) + log aj,i ∀ t > 1 modeled as Gaussian mixture densities: P (x;Λi ) = j P C C C k wi,kN (xt; µi,k, Σi,k). Thus the parameters for class C C C C C c C C C C Γ (t, i) = log P (xt;Λi ) + Γ (t − 1, δ (t, i)) + log aδC (t,i),i C are λC = {Π , A , Λ }, where Λ = {Λi ∀i = C C C C max log P (X, s|C) = max ΓC (T, i) (6) 1,...,M} and Λi = {wi,k, µi,k, Σi,k ∀ k}. s i 4

δC (t, i) is a table of “backpointers” from which the most next two sections describe two frameworks that enable likely state sequence argmaxs log P (X, s; λC ) can be such private processing. determined by tracing backwards recursively as sT = C C IV. THE CRYPTOGRAPHIC APPROACH argmaxi Γ (T, i), st = δ (t + 1, st+1), where st is the tth state in optimal state sequence. In continuous speech The cryptographic approach treats private speech pro- recognition systems, in particular, the backtraced optimal cessing as an instance of secure two-party computa- state sequence traces a path through the word graph and tion [27]. Consider the case where two parties, Alice is used to determine the spoken word sequence. and Bob have private data a and b respectively and they want to compute the result of a function f(a, b). Any computational protocol to calculate f(a, b) is said to III.PRIVACY ISSUESIN SPEECH PROCESSING be secure only if it leaks no more information about a From the discussion of the previous section, the key and b than what either party can gain from learning the component of all the above applications is the compu- result c. We assume a semi-honest model for the parties tation of the class score log P (X; λC ). In conventional where each party follows the protocol but could save — non-private – implementations of these applications, and intermediate results to learn more about the system providing the application has access to data other’s private data. In other words, the parties are honest X, which represents the user’s voice. This enables it to but curious and will follow the agreed-upon protocol but manipulate or misuse the data, raising privacy concerns. will try to learn as much as possible from the data flow The alternative, i.e., permitting the user to access the between the parties. We return to the issue of honesty models λC , is equally unacceptable. In many situations, later in this section. the models are the system’s intellectual property. Fur- We recast the conventional speech processing algo- thermore, in voice-data mining situations, the system’s rithms described in Section II as a pipeline of primitive models also represent the patterns it is searching for; computations, and sow how to execute each primitive in revealing these to external parties may not be acceptable. a computationally secure manner, i.e., a computationally A voice-based authentication system where the user bounded participant should not be able to derive private himself has access to the models cannot be considered information possessed by the other participant from the an effective authenticator. computation. The output of each stage of the pipeline Thus, for maximally protecting the user and the sys- is either distributed across both participants in the form tem from one another, the system must not learn the of random additive shares, or arrives at one of the user’s data X, while the user must not be able to infer participants in a locked form, where the other participant the system’s models λC . For the speech processing ap- holds the key. Thus both participants must interact to plications discussed above, this means that log probabil- perform the computations until the final outcome of the ity terms computed from GMMs, forward probabilities, overall computation arrives at the correct participant. and best-state-sequence probabilities (also called Viterbi scores) must all be computed without revealing X to A. Secure Primitives for Speech Processing the system or λC to the user. An additional twist arises To keep the discussion simple, we will consider that in the case of authentication systems, where the system there are only two parties, Alice and Bob, engaged in must be prevented from making unauthorized use of the the computation. The reader may consider Alice as the models it has for the user. In this case, the model λS client or end-user who possesses a speech signal to be must itself be in a form that can only useful when the analyzed while keeping it private from Bob, the remote system engages with the appropriate user. system which has the model parameters, some or all of The above deals with the computation of scores for which must be kept private from Alice. any class. But what about the final outcome of the To enable private computation, we will utilize public- classification? This, too, has an intended recipient. For key homomorphic encryption schemes, which enable instance, in most biometric applications it is the system operations to be performed on encrypted data [28]. We that must receive the outcome of the classification; how- keep the development simple by considering an addi- ever for recognition systems the user obtains the result. tively homomorphic cryptosystem [12], [13], [14] with Thus we also stipulate that the outcome is only revealed encryption function E[·]. Such a cryptosystem satisfies to the intended recipient. We refer to any computa- E[x] · E[y] = E[x + y] and (E[x])y = E[xy] for integer tional mechanisms that enable the above requirements messages x, y. For a discussion of other flavors of homo- as private computation, as opposed to conventional non- morphic cryptosystems, namely multiplicatively homo- private methods that make no guarantees to privacy. The morphic, 2-DNF homomorphic and fully homomorphic 5

cryptosystems, the reader is referred to a companion Suppose that x + y = (ln z1, ln z2, ..., ln zn). By article in this issue [29]. a slight abuse of notation, denote the vector of the We assume that the cryptosystem is semantically elementwise logarithms and elementwise exponents of secure [30], i.e., by using fresh random parameters the elements of x as ln x and ex respectively. Alice and during encryption but not during decryption, a given Bob wish to obtain uninformative additive shares, a and Pn plaintext may be mapped to different ciphertexts b such that a + b = ln ( i=1 zi), which is the “logsum” every time encryption is performed. This makes the operation that gives the protocol its name. We note that Pn Pn xi+yi cryptosystem, and hence the protocols discussed, i=1 zi = i=1 e and achieve the desired secret resilient to a chosen plaintext attack (CPA). We sharing using the following protocol [7]: further assume that while Alice and Bob share a 1) Alice chooses a at random. Then Alice and Bob public encryption key, Alice is the only person with compute additive shares q, s such that q + s = a decryption key that reverses E[·]. This means that SIP (ex−a, ey) using the SIP protocol above. Bob the system may encrypt data if needed, but only the combines these shares to obtain the inner product end-user may decrypt it. However, if required the φ. Pn xi+yi situation can be reversed by mirroring the protocols, 2) Bob computes b = ln φ = −a+ln ( i=1 e ) = Pn with relatively minor additional changes. Assume that −a + ln ( i=1 zi), which gives the desired result. Alice and Bob own n-length integer vectors x and y In the first step above, Alice and Bob employ additive respectively. In what follows, E[x] denotes a vector secret sharing in the exponent, which is equivalent to containing encryptions of the individual elements of x, multiplicative secret sharing. The parameter a should be i.e., (E[x1],E[x2], ..., E[xn]). chosen large enough because multiplicative secret shar- ing is not as secure as standard additive secret sharing. Additive Secret Sharing We present this protocol to illuminate the fact that homo- Bob has an encrypted value E[x]. He wishes to morphic functions can be manipulated to compute useful, share it as random additive clear-text shares with non-obvious functions. The same functionality can be Alice. He chooses an integer b at random, and sends obtained in a secure manner using other cryptographic E[x] · E[−b] = E[x − b] to Alice. Bob retains b as primitives, e.g., garbled circuits. To refer to this protocol his additive share. Alice decrypts x − b = a which is henceforth, we will state that Alice and Bob obtain her additive share. We will represent this operation in additive shares a, b, such that a + b = SLOG(ln z). short-hand by saying Alice and Bob receive a and b In an alternate scenario, Bob possesses P such that a + b = SHARE(x). E[ln z1],E[ln z2], ··· . He wishes to obtain E[log i zi]. He uses the SHARE protocol to share E[ln zi] ∀i with Secure Inner Product (SIP) Alice and proceeds as earlier. Finally, Alice sends E[a] P Alice and Bob want to compute uninformative additive to Bob who computes E[log i zi] = E[a]E[b]. Note shares a, b respectively of the inner product x>y. There that in the process Alice and Bob also obtain additive are several ways to achieve this [31], [32], [33], but for shares of the outcome. We will denote this operation expository purposes, we focus on a simple approach us- by stating that Bob receives c such that c = SLOG(ln z). ing additively homomorphic functions [34]. Alice sends element-wise encryptions E[xi] to Bob. Bob computes Secure Maximum Index (SMI) Qn yi Pn > i=1 E[xi] = E[ i=1 xiyi] = E[x y]. Alice and Alice and Bob wish to compute additive shares, T Bob can then receive additive shares a and b of E[x y] such that a + b = argmaxi xi + yi without revealing as a+b = SHARE(x>y). Observe that Bob operates in their data to each other. This is achieved by exploiting the encrypted domain and cannot discover x. Similarly, homomorphic encryption to perform blind-and-permute Alice does not know b, and hence cannot discover y. operations [35]. Let z = x + y. Then, the goal is to When Bob possesses only E[y] rather than y as privately compute additive shares of the index of the assumed above, it is still possible — using a trick largest element of z. To accomplish this, Alice chooses involving additive secret sharing — for Alice and a secret permutation πA on the index set {1, 2, ..., n}. > Bob to obtain additive shares of x y. As a notational Similarly Bob chooses a secret permutation πB on the shorthand, when we invoke any variant of this protocol, same set. The outcome of the blind-and permute protocol we will just state that Alice and Bob obtain additive is that Alice and Bob each obtain additive shares of shares a, b, such that a + b = SIP (x, y). πAπBz. Neither can reverse the other person’s permu- tations. However, given the permuted shares, Alice and Secure Logsum (SLOG) Bob can use repeated instantiations of the millionaire 6 protocol [17], obtain the index of the maximum element values and operate exclusively in the logarithmic domain. in the permutation of z, from which, using their respec- In fact, the need to perform operations on the logarithms tive permutations, they can obtain shares of the index of of the probability values is what makes the SLOG the maximum element in z. To see the steps involved protocol so useful. A second way to mitigate the issue in minimum finding based on homomorphic encryption, of extremely small probability values is to appropriately refer to a companion article submitted to this special scale the probability values by a large constant, e.g., 106 issue [36]. Alternatively the entire minimum finding prior to all encryptions and compensate for the scaling protocol may be executed using garbled circuits [37]. As after decryption. a shorthand notation, we say that Alice and Bob obtain additive shares a, b, such that a + b = SMI(x, y) B. Computing Scores Privately for Speech Processing In an alternate scenario, Bob has two encrypted The computations involved in speech processing tasks numbers E[x] and E[y] and desires to find which of the can now be cast in terms of the above primitives. We two is larger, he can engage in protocols such as those first consider how the various required scores can be described in [38], which employ threshold encryption computed privately. or secret sharing schemes in which both parties must cooperate to perform decryption such that one of them The Gaussian as a Dot Product can obtain the answer without exposing x or y to either. The fundamental component of speech processing We denote this also as max(x, y) = SMI(x, y); this models is the Gaussian, since observation distributions should not result in confusion since the actual operation are generally modeled as Gaussian mixtures. The form used will be clear from the context. of the log of a multi-variate Gaussian is well known: > −1 D Secure Maximum Value (SMV) log P (x) = −0.5(x−µ) Σ (x−µ)−0.5 log(2π) |Σ| Alice and Bob wish to compute uninformative additive where D is the dimensionality of the data, and µ and Σ shares a, b such that a + b = maxi xi + yi. This is are the mean and covariance matrix of the Gaussian. It achievable using a protocol similar to the one described is fairly simple to show that this can be manipulated into above, with the only difference being in the last step, the form x˜T W˜ x˜ [7] where x˜ is obtained by extending which – instead of revealing the index of the maximum – x as x˜ = [x>1]>, and reveals additive shares of the maximum value. To invoke " # this protocol, we will state that Alice and Bob obtain −0.5Σ−1 Σ−1µ W˜ = (7) additive shares a, b, such that a + b = SMV (x, y) [7]. 0 w∗ Other similar protocols may be defined. In particular, it is often essential to compute distance measures, such where w∗ = −0.5µ>Σ−1µ − 0.5(2π)D log |Σ| + as the Hamming or Euclidean or Absolute distance log prior, and prior captures an a priori probability between vectors x and y held privately by Alice and associated with the Gaussian. In the case of a solitary Bob respectively. Protocols for these computations can Gaussian, prior = 1; however if the Gaussian is one be efficiently designed using homomorphic functions from a mixture, prior represents the mixture weight and are an integral part of privacy-preserving nearest for the Gaussian. We can reduce the above computation neighbor methods which are covered in a separate further to a single inner product x¯>W , where x¯ is article in this issue [36]. a quadratically extended feature vector derived from x˜ which consists of all pairwise product terms x˜ix˜j Practical Considerations of all components x˜i, x˜j ∈ x˜, and W is obtained An important practical consideration is that we cannot by unrolling W˜ into a vector. In this representation always assume that x and y are integer vectors. Indeed, log N (x; µ, Σ) = x¯>W . speech processing routinely uses probabilistic models, such as Gaussian Mixture Models (GMMs) and Hidden Privacy preserving computation of a Log Gaussian Markov Models (HMMs) in which the model parameters Input: Alice possesses a vector x. Bob possesses a are floating point values between 0 and 1. A particular Gaussian parameterized by λ = {µ, Σ}. problem that is nearly ubiquitous in the iterative algo- Output: Alice and Bob obtain random additive shares a rithms used to perform modern speech processing is that and b such that a + b = log P (x; λ). multiplication of probability values result in extremely 1) Alice computes the extended vector x¯ from x. Bob small numbers. One way to mitigate this issue of floating arranges his model into a vector W as explained point precision is to take the logarithm of the probability above. 7

2) Alice and Bob engage in the SIP protocol to obtain log α(t, i) and αl(t) = [log α(t, 1), ··· , log α(t, M)]. additive shares a and b: a + b = SIP (x¯,W ). We represent {qt,i ∀i} obtained by Bob in the above Note that this is possible even if Bob only possesses computation as qt. Now Bob precomputes E[log ai] ∀i. an encrypted version of W . Note also that given only The recursions of the forward probability computation E[µ],E[Σ] and E[prior] Bob can obtain E[W ] using in HMMs given by Equation 4 can now be computed the protocol given in [9]. In either case, Alice and Bob in the log domain securely as follows. obtain no information about each other’s data. Step II Forward Algorithm

Computing the Logarithm of a Gaussian Mixture Pri- 1) Bob computes E[αl(1)] = qt · E[log π]. vately 2) For t = 2 ··· T , i = 1 ··· M Input: Alice possesses x. Bob possesses Λ = a) Bob engages with Alice in the SLOG protocol {λ1, λ2, ··· , λK }, the parameters of a Gaussian mixture to obtain c = SLOG(log ai + αl(t − 1)). Note P with K Gaussians, each with parameter λi. that c = E[log j α(t − 1, j)aj,i]. Output: Alice and Bob receive a and b such that b) Bob computes E[αl(t, i)] = c.qt,i a + b = P (x; Λ). 3) Alice and Bob engage in the SLOG protocol to 1) Bob arranges the parameters λi of each Gaussian obtain additive shares a + b = SLOG(αl(T )). into a vector Wi, i = 1,...,K. Note that a and b are additive shares of 2) For each i = 1,...,K, Alice and Bob engage in the log P (x1, x2, ··· , xT ; Γ). SIP protocol to obtain additive shares ci and bi: We will refer to this operation as a+b = SFWD(X, Γ). ci + di = SIP (x¯,Wi). Note that if Wi is available in unencrypted form, Alice only needs to transmit Secure Viterbi Algorithm E[x¯] once for the entire Gaussian mixture. The Viterbi algorithm is very similar to the forward 3) Alice and Bob apply the SLOG protocol to algorithm, except for two key differences. First, instead obtain a + b = SLOG(ln g), where g = of adding all incoming probabilities into any state in (4), [c1 + d1, ··· , cK + dK ], where ci + di is the log we choose the maximum in (6). Additionally, Alice must th of the i Gaussian. also receive the optimal state sequence. Ideally, state As before, Alice and Bob do not learn about each indices would be permuted in the received sequence; other’s data. We will refer to this operation as however, for simplicity, we omit the additional com- a + b = SMOG(x, Λ). plexities involved with permuting the indices. Using the notation of (6): δ(t, i) refers to the “best” predecessor Computing the logarithm of an HMM forward score for state si at time t. Γ(t, i) is the joint log likelihood privately of the most probable state sequence arriving at state si Input: Alice possesses X = [x1, x2, ··· , xT ]. Bob at time t and the observation sequence x1, ··· , xt. We possesses an M-state HMM Γ = {Π, A, Λ}, with will use the notation that a bold character represents a initial state probabilities Π = [π1, ··· , πM ], a transition vector that aggregates the scalar values represented by matrix A comprising vectors ai = [a1,i, ··· , a1,M ] the corresponding unbolded symbol for all states. For and a set of state output Gaussian mixture densities instance Γ(t) = [Γ(t, 1), Γ(t, 2), ··· , Γ(t, M)]. Λ = {Λ1, Λ2, ··· , ΛM }. Input: Alice possesses X = [x1, x2, ··· , xT ]. Bob Output: Alice and Bob obtain a and b such that possesses an M-state HMM Γ = {Π, A, Λ}. a + b = log P (X; Γ). Output: Alice and Bob obtain additive shares a and b such that a + b = log maxs P (X, s; Γ). Alice receives Step I State output density computation the optimal state sequence ¯s = s1, ··· , sT . 1) For all t = 1 ··· T, i = 1 ··· M a) Alice and Bob engage in the SMOG pro- The protocol uses operations similar to the ones de- scribed in the previous section and in the secure forward tocol to obtain additive shares gt,i + ht,i = algorithm described above. For the detailed steps of the SMOG(xt, Λi). Alice sends E[gt,i] to Bob. protocol, the reader is referred to [7]. b) Bob computes qt,i = E[gt,i]E[ht,i] = E[gt,i + ht,i] = E[log P (xt;Λi)] to obtain the encrypted value of the logarithm of the state output distri- C. Implementing Secure Speech Techniques bution for state i on xt. Having set up the operations as described above, In the protocol below we use the notation αl(t, i) = we now consider how they may be applied to our 8 speech problems. We will generally assume that all lated word recognition or phrase spotting, the speech models have already been learned by the system, recognition processes can generally be summarized into and, where required (e.g. speaker authentication), the the following procedure. models are encrypted. To avoid making the development 1) For each speaker s = 1 ··· N, the user and the unnecessarily complex, we do not describe how to system engage in the SFWD operation with X and s s privately learn GMM/HMM parameters [7], or how to Γs to obtain additive shares A and B . privately adapt an existing background GMM or UBM 2) The user and system engage in an SMAX protocol to a user’s enrollment data [9]. s s to determine argmaxs A +B . The system gets the result. Speaker authentication The above operation is entirely performed in the log Input: The system possesses encrypted models ΛS domain – all probabilities are log probabilities – and is for the speaker and clear-text models ΛU as the uni- hence robust to underflow. In practice the secure Viterbi versal background model. The user possesses vectors algorithm is a better choice the forward algorithm to x1, x2, ··· , xT . compute class scores, since its secure implementation Output: The system authenticates the user. has many fewer expensive SMOG operations, which are 1) For each t = 1 ··· T replaced by SMV operations. In this case P (X;Γs) is a) The user and the system engage in the SMOG computed using the SVIT protocol instead of SFWD operation with xt and ΛS to obtain additive as described earlier. For continuous speech recognition S S shares at and bt . the optimal state sequence (and the word sequence b) The user and the system engage in the SMOG corresponding to it) can be obtained using the SVIT operation with xt and ΛU to obtain additive protocol. U U shares at and bt . S P S U P U 2) The user computes A = t at and A = t at . D. Analyzing the protocols S P S U The system computes B = t bt and B = P U Correctness: t bt 3) The user and system engage in an SMI protocol to All of the presented protocols, including the determine max(AS + BS,AU + BU ). The system primitives, the score computation and the actual gets the result. classification procedures above are easily shown to be correct – the final result obtained with the private protocols is exactly what would have been obtained Speaker identification had the operations been performed in a non-private Input: The system possesses a set of models manner. It can also be ascertained that in practical Λ , Λ , ··· , Λ corresponding to speakers 1 ··· N (we 1 2 N implementations the insecure and secure versions of the consider the “background” model to be one of the computations are virtually indistinguishable, thus the speakers). The user has X = [x , ··· , x ]. 1 T accuracy tradeoff owing to encryption is negligible [7], Output: The system learns argmax P (X;Λ ). i i [8]. 1) For each speaker s = 1 ··· N, a) For each time t = 1 ··· T , the user and the system Security: engage in the SMOG operation with xt and Λs Each of the operations above is also computationally s s to obtain additive shares at and bt . secure against honest but curious participants. The indi- s P s b) The user computes A = t at . The system vidual primitives in Section IV-A are easily shown to be s P s computes B = t bt . secure (e.g. [39]) – Alice and Bob do not learn about 2) The user and system engage in an SMI protocol each others’ data at all, since they only see ciphertexts, s s to determine argmaxs A + B . The system gets or data masked by additive noise [40]. Consequently, the result. the secure Gaussian operation, the SMOG operation, the SFWD and the SVIT protocol are all guaranteed Speech Recognition not to reveal the user and system’s data to one another Input: The system possesses a set of models if the protocols are correctly followed. In general, none Γ1, Γ2, ··· , ΓN corresponding to words or word se- of the protocols reveal more than what the outcome of quences 1 ··· N (we consider the “background” model to the computation itself reveals. be one of the phrases). The user has X = [x1, ··· , xT ]. One might also consider a malicious model, where Output: The system learns argmaxi P (X;Λi). For iso- one or more parties may manipulate the data or send 9 bogus data in an attempt to disrupt the protocol or in the computation time. It should be noted that Paillier learn about the other party’s data. If both parties are encryption based on 256-bit and 512-bit keys is no malicious, security can be enforced by accompanying longer considered sufficiently secure in the cryptography the protocols with zero-knowledge proofs [41]. If only community. one of the parties is malicious, the other party can use Table II shows a similar computational expense table conditional disclosure of secrets [42] to make sure he/she containing the average time required to perform privacy- receives valid inputs from the malicious party. Both these preserving speaker authentication on a Core 2 Duo 2 methods, however, greatly increase the computation and GHz Linux machine per second of input audio. In this communication overhead of the protocols. case, BGN encryption [43] is employed, rather than Finally we note that although, technically, the the significantly less expensive Paillier encryption. The proposed protocols (SVIT ) also permit fully continuous data are from the YOHO dataset [44]. Both the speaker speech recognition (CSR), they assume that the entire and the subject were modeled by mixtures of 32 Gaus- HMM representing the complete word graph to be sians, and the audio was represented by a sequence of searched is evaluated for each analysis frame. In 39-dimensional feature vectors computed at the rate of practice, CSR is never preformed in this manner, even 100 times a second. By comparison, the “insecure” com- in conventional non-private implementations. CSR putation on the same experiment took only 3.2 seconds has a high memory footprint and high computational per second of audio. The scores computed by the secure complexity, and the word graphs must be pruned computation were identical to within the five decimal heavily based on partial scores, in order to restrict the places to those obtained in the insecure conventional computation. The act of pruning restricts the hypothesis classifier. Classification accuracies on this data set, with set considered and reveals information about the this setup achieves an EER of about 7%. Clearly then, recognition output. Techniques to hide this information computation is a serious bottleneck. Encryption and are not within the scope of this article, although they decryption take the most time. Additional operations take are topics of active research. relatively small time in comparison to the overhead of cipher text processing; nevertheless even they are consid- Performance: erably more expensive than insecure computation. The The private computation techniques must in princi- times shown do not include communication overhead; ple result in identical classification outcomes to their however for these tasks the communication overhead is conventional non-private counterparts. Although there trivial compared to the computational expenses. is a minor loss of resolution resulting from the fact In the experiments presented above, the implementa- that all computation must now be performed with fixed- tions were far from optimal. Moreover, the primitives point arithmetic to accommodate encryption over integer described herein can themselves be optimized; it is not fields, this has little effect on accuracy – speech process- clear that the particular structure chosen to decompose ing applications have historically achieved reasonable the computations is optimal from the perspective of performance with fixed-point implementations with as computational complexity. Performing computations on little as 16 bits of resolution. parallel processors, more efficient implementations of Perhaps the most important performance consideration encryption etc. all may improve the speeds by some is the additional computational complexity imposed by orders of magnitude; regardless, the final outcome is the privacy primitives. In particular, computing a single currently far slower than the speed performance of Gaussian securely takes a significant fraction of a second insecure computations. on a desktop computer [7]. Table I reports computation The methods described so far provide privacy to times for 1 second of audio [8] on an isolated word conventional state-of-art mechanisms for performing pat- recognition experiment for a vocabulary of the ten digits TABLE I 0-10, each of which was modeled by a 5-state HMM EXECUTIONTIMEFORISOLATEDWORDRECOGNITIONPROTOCOL with a single Gaussian. Results were obtained on a 3.2 (PAILLIER ENCRYPTION). GHz Pentium 4. The Paillier encryption scheme was Activity 256-bit 512-bit 1024-bit used [12]. The difference between the forward scores keys keys keys computed using secure computation those computed in Alice encrypts input data 205.23 s 1944.27 s 11045 s the conventional manner was less than 0.52%, and the (only once) Bob computes ξ(log bj (xt)) 79.47 s 230.30 s 461 s classification accuracy for the secure and conventional (per HMM) versions was nearly the same, being 99%. Results with Both compute ξ(αT (j)) 16.28 s 107.35 s 785 s different key sizes are given primarily to show the trends (per HMM) 10 TABLE II EXECUTION TIME FOR THE VERIFICATION PROTOCOL (PAILLIER like bit strings that are not invertible. We describe these ENCRYPTION). briefly below. Steps Time (256-bit) Time (1024-bit) Encrypting x¯t ∀t 138 s 8511 s Evaluating Adapted 97 s 1809 s A. Feature Representation: Supervectors Evaluating UBM ≈ as Adapted ≈ as Adapted Comparison 0.07 s 4.01 s An obvious solution to create a fingerprint is to apply Total =E[¯xt] + adapted 331 s 12133 s a cryptographic hash function H[·], e.g., SHA-256 [45], + UBM + compare ∼ 5.47 min ∼ 3 hr, 32 min to the speech input itself. However, direct conversion of audio signals into fingerprint-like patterns is difficult, tern recognition applications on speech. An alternate due the inherent variation in cadence and length of audio mechanism may be to modify the matching algorithms recordings. A more robust length- and cadence-invariant themselves to make them more amenable to efficient representation of the audio is first required. secure implementations. In the next section, one such ap- Campbell, et al. [46] extend the MAP adaptation proach is described. These techniques are not as generic procedures for GMMs mentioned in Section II-A to in their scope as SMC- based methods described above, construct a supervector (SV) to represent each speech since the right form of classifier must be found for the sample. A supervector is a characterization of an estimate task. Specifically, the technique we describe now applies of the distribution of feature vectors derived from the only to speaker authentication and congruent tasks. speech recording. It is obtained by performing maximum a posteriori (MAP) adaptation of the UBM over the V. SPEECH PROCESSINGAS PRIVACY PRESERVING recording and concatenating the parameters, typically STRING MATCHING just the means, or means and mixture weights of the adapted model. For instance, given the adapted model We now discuss an alternative framework for privacy- λ = {wˆs, µˆs, Σˆ s} with M Gaussian components, the preserving speech processing based on private string s i i i supervector sv is given by (ˆµs k µˆs k · · · k µˆs ). comparison. The main idea is to convert a speech sample 1 2 M The supervector is then used as a feature vector instead into a fingerprint, i.e., a fixed-length bit string. This of the original feature vectors derived from the speech representation allows us to compare two speech samples sample. In the speaker authentication task addressed by by checking if their respective fingerprints match exactly. Campbell et al., authentication verification is performed Unlike the cryptographic framework described earlier, using a binary support vector machine (SVM) classifier string comparison is non interactive, and also much for each user. The SVM is trained on supervectors faster than encryption which enables us to perform the obtained from enrollment utterances from the user, as privacy-preserving processing very efficiently. instances of one class, and from a collection of impostor The fingerprint representation is similar to a text- recordings as instances of the opposite class. As the password system. The privacy issues are also similar; classes are usually not separable in the original space, the user requires the system to store and compare the [46] also use a kernel mapping that is shown to achieve passwords in an obfuscated form, so that an adversary higher accuracy. cannot observe the original passwords. Furthermore, we A related approach is to use k-nearest neighbors require the system to be accurate; just as the password trained on supervectors as our classification algorithm. system is able to reject users entering incorrect pass- The rationale for this approach is twofold: firstly, words, the privacy-preserving speech processor should k-nearest neighbors allow classification with non-linear be able to classify speech samples accurately. Finally, decision boundaries with accuracy comparable to SVMs the performance of this string comparison-based speech with kernels [47]. Secondly, using the LSH transfor- processing approach should be competitive with con- mations discussed below, private k-nearest neighbors ventional speech processing methods. Initial software computation can be reduced to private string comparison, implementations reveal a tradeoff between a significant which can be easily accomplished without an interactive increase in speed and a small degradation in the accuracy protocol. with respect to classical methods based on, for example, Hidden Markov Models. The twin objectives of accuracy and privacy are B. Locality Sensitive Hashing achieved as follows: A data-length independent feature Locality sensitive hashing (LSH) [48] is a widely used vector (string) is derived from the speech signal such that technique for performing efficient approximate nearest- string comparison implies a nearest-neighbor classifica- neighbor search. An LSH function L(·) proceeds by tion. Second, these vectors are converted into password- applying a random transformation to a data vector x to 11

+ H[ ]

Speech → Supervector → LSH → Crypto Hash

Original Feature Space LSH Keys + UBM Fig. 1. Locality Sensitive Hashing Fig. 2. Speaker Verification as String Comparison

th projecting it to a vector L(x) into a lower dimensional i bit of the LSH for cosine distance, using ri defined space, which we refer to as the LSH key or bucket. A set as above is given by of data points that map to the same key are considered ( T 1 if ri x > 0, as approximate nearest neighbors (Figure 1). Li(x) = (9) 0 otherwise, A single LSH function does not group the data points into fine-grained clusters, one must use a hash key LSH is inherently not privacy preserving due to its obtained by concatenating the output of k LSH functions. locality sensitive property. It is possible to reconstruct This k-bit LSH function L(x) = L1(x) ··· Lk(x) maps the input vector by observing a sufficient number of a d-dimensional vector into a k-bit string. Two data LSH keys obtained from the same vector. To satisfy the vectors may be deemed to be highly similar if the privacy constraint, a cryptographic hash function H[·] is k-bit hashes derived from them are identical. Such fine applied to the LSH keys. Cryptographic hash functions, selectivity may however have an adverse effect on the such as SHA-256, MD5, are orders of magnitude faster recall in identifying neighbors when the data from a class to compute compared to homomorphic encryption. have significant inherent variations. To address this, m different LSH keys are computed over the same input to C. Privacy-Preserving Speech Processing through achieve better recall. Two data vectors x and y are said to String Comparison be neighbors if at least one of their keys, each of length The private string comparison framework lends itself k, matches exactly. LSH provides a major efficiency very well to biometric tasks where audio-length-invariant advantages: By precomputing the keys, the approximate supervector representations of the audio may be derived. nearest neighbor search can be done in time sub-linear In applications such as speaker verification and speaker in the size of the dataset. identification, the user can convert the test speech sample A family of LSH functions is defined for a particular into supervectors, apply the LSH transformation, and distance metric. A hash function from this family has the finally apply a cryptographic hash function and submit property that data points that are close to each other as the output to the system. The system can simply compare defined by the distance metric are mapped to the same the hashes provided by the user to the hashes computed key with high probability. There exist LSH constructions over the enrollment data and accept or reject the user for a variety of distance metrics, including arbitrary by comparing the number of matches to a pre-calibrated kernels [49], but we mainly consider LSH for Euclidean threshold. Due to the irreversibility of the hashes, the distance (E2LSH) [50] and cosine distance [51] as the system is not able to recover the original speech sample, LSH functions for these constructions are simply random and we are able to maintain the privacy of the speech vectors. As the LSH functions are data independent, it input. As an illustrative example, we present the string is possible to distribute them to multiple parties without comparison technique applied to the privacy-preserving privacy loss. speaker verification problem below and refer the reader The LSH construction for Euclidean distance trans- to [10] for other applications. forms a d-dimensional vector into a vector of k integers. The ith entry of the hash is as follows. D. Privacy-Preserving Speaker Verification as String rT x + b L (x) = i , (8) Comparison i w As the possible values for LSH keys discussed above where ri is a d-dimensional vector with each component lie in a relatively small set by cryptographic standards, drawn iid from N (0, 1), w is the width of the bin, (e.g., 256k for k-bit Euclidean LSH and 2k for k-bit cosine 255), and b ∈ [0, w]. Similarly, the construction of the LSH, it is possible for the server to obtain L(s) from 12

TABLE III H[L(s)] by applying brute-force search. To make this AVERAGE EER FOR THE TWO ENROLLMENT DATA attack infeasible, the domain of the hash function H[·] is CONFIGURATIONS AND THREE LSH STRATEGIES:EUCLIDEAN, increased by concatenating the LSH key with a long ran- COSINE, ANDCOMBINED (EUCLIDEAN & COSINE). dom string qi (e.g., 80-bits in length) that is unique to the Enrollment: Only Speaker user i, which is called the salt, as summarized in Fig. 2. Requiring the user to keep the salt private and unique Euclidean Cosine Combined 15.18% 17.35% 13.80% to each system, also gives the additional advantage of rendering cryptographically hashed enrollment data use- less to an adversary. With this modification, LSH-based privacy-preserving speaker verification is achieved using Enrollment: Speaker & Imposter the enrollment and authentication protocols described Euclidean Cosine Combined below: 15.16% 18.79% 11.86% A. Enrollment Protocol Each user has a set of enrollment utterances from the salt, the user does not need to store any speech {x1,..., xn}. The users also obtain the UBM and the data on its device. The enrollment and verification proto- l LSH functions {L1(·),...,Ll(·)}, each of length cols, therefore, satisfy the privacy constraints discussed k-bit from the system. Each user i generates the above. random 80-bit salt string qi. For each enrollment utterance xj, user i: E. Experiments a) performs adaptation of xj with the UBM to obtain Experiments examining the accuracy and performance supervector sj. of LSH-based schemes have been reported using the b) applies the l LSH functions to sj to obtain the keys YOHO dataset [44] which comprises of a collection {L1(sj),...,Ll(sj)}. c) applies the cryptographic hash function salted with of short utterances, each a sequence of three two- digit numbers, produced by 138 speakers. There qi to each of these keys to obtain are 96 enrollment utterances and 40 test utterances {H[L1(sj) k qi],...,H[Ll(sj) k qi]}, and sends them to the system. from each speaker. Mel-frequency cepstral coefficient (MFCC) features augmented by differences and double B. Authentication Protocol differences are chosen as feature vectors in these For a test utterance x0, user i: experiments [10]. A UBM with 64 Gaussian mixture 0 (a) performs adaptation of x with the UBM to components is trained on a random subset of the 0 obtain supervector s . enrollment data belonging to all users. Supervectors are 0 (b) applies the l LSH functions to s to obtain the obtained by individually adapting all enrollment and keys verification utterances to the UBM. 0 0 {L1(s ),...,Ll(s )}. (c) applies the cryptographic hash function salted Accuracy with qi to each of these keys to obtain 0 0 The lowest equal error rate (EER) was achieved {H[L1(s ) k qi],...,H[Ll(s ) k qi]}, and sends by using l = 200 instances of LSH functions each it to the system. of length k = 20 for both Euclidean and cosine (d) The system computes the number of matches distances. Table III indicates that LSH for Euclidean between the hashed keys for the test utterance distance performs better than LSH for cosine distance, and the corresponding hashes of enrollment ut- while combining the two classifiers gives the best terances. If this number exceeds a threshold, it accuracy. Furthermore, using imposter data achieves accepts the user. lower EER when using the combined scores for both l the distances: one-sided classifiers are clearly not X X match = I(H[Lj(si) k qi] sufficiently accurate. While the actual error rates in this i∈enrollment j=1 example may appear relatively high, it should be noted 0 = H[Lj(s ) k qi]), that an SVM-based classifier working from supervectors if match > threshold : accept. obtained a classification accuracy of 10.8% on the same setup. In other experiments not reported here, using The system never observes any LSH key before a supervectors derived from more detailed GMMs, EERs salted cryptographic hash function is applied to it. Apart of approximately 5% have been obtained using the 13 string matching framework. protocols. This would allow the creation of a non- interactive protocol where the user uploads the encrypted Execution Time speech sample, and the server can perform all the neces- Compared to a non-private speaker recognition system sary computations without requiring any further involve- based on supervectors, the only computational overhead ment from the user. Such a scheme would not alleviate all for the privacy-preserving version is in applying the the problems of privacy-preserving computation; there LSH and salted cryptographic hash function. For a would still be a need to deal with key distribution, ma- 64 × 39 = 2496-dimensional supervector representing licious adversaries, communication overhead, and many a single utterance, the computation for both Euclidean other practical problems. Nevertheless, researchers and and cosine LSH involves a multiplication with a random practitioners of privacy-preserving speech processing are matrix of size 20 × 2496 which requires a fraction following developments in FHE with interest. of a millisecond. Performing this operation 200 times required 15.8 milliseconds on average [10]. The reported REFERENCES times are for a laptop running 64-bit Ubuntu 11.04 with [1] US Federal Law, Title 18, Part 1, Chapter 19, Section 2511, 2 GHz Intel Core 2 Duo processor and 3 GB RAM. “Interception and disclosure of , oral, or electronic com- The Euclidean and cosine LSH keys of length k = 20 munications prohibited,” January 2012. require 8 × 20 bits = 20 bytes and 20 bits = 1.6 bytes [2] “The end of privacy,” The Economist, April 1999. [3] Paul Schwartz and Joel R. Reidenberg, Data Privacy Law: A for storage respectively. Using a C++ implementation Study of United States Data Protection, LEXIS law, 1996. of SHA-256 cryptographic hashing algorithm based on [4] Helen Nissenbaum, Privacy in Context - Technology, Policy, and the OpenSSL libraries [52], hashing 200 instances of the Integrity of Social Life, Stanford University Press, 2010. [5] R. K. Nichols and P. C. Lekkas, Speech cryptology, McGraw- each of these keys in total required 28.34 milliseconds Hill, 2002. on average. Beyond this, the verification protocol only [6] Madhusudana Shashanka and Paris Smaragdis, “Secure sound consists of matching the 256-bit long cryptographically classification: Gaussian mixture models,” in ICASSP, 2006. hashed keys derived from the test utterance to those [7] Paris Smaragdis and Madhusudana Shashanka, “A frame- work for secure speech recognition,” IEEE Transactions on obtained from the enrollment data. Audio, Speech and Language Processing, vol. 15, no. 4, pp. 1404–1413, 2007. [8] Manas A. Pathak, Shantanu Rane, Wei Sun, and Bhiksha Raj, VI.CONCLUSIONS “Privacy preserving probabilistic inference with hidden Markov models,” in ICASSP, 2011. The two frameworks presented in this article both [9] Manas A. Pathak and Bhiksha Raj, “Privacy preserving speaker show promise in enabling privacy-preserving speech verification using adapted GMMs,” in Interspeech, 2011, pp. processing. The cryptographic framework, while more 2405–2408. generic, carries the usual computational overhead of [10] Manas A. Pathak, Privacy-Preserving Machine Learning for Speech Processing, Ph.D. thesis, Carnegie Mellon University, cryptography. However, it can be made arbitrarily se- 2012. cure and flexible. The string-matching framework, on [11] Ronald L. Rivest, Leonard Adleman, and Michael L. Dertouzos, the other hand, is much more efficient; however it is “On data banks and privacy homomorphisms,” in Foundations restricted in its applicability and currently results in a of Secure Computation, 1978, pp. 169–180. [12] Pascal Paillier, “Public-key cryptosystems based on composite degradation of performance. degree residuosity classes,” in EUROCRYPT, 1999. At this point, both of them can only be viewed as [13] J. Benaloh, “Dense Probabilistic Encryption,” in Proceedings initial forays into the implementation of truly secure of the Workshop on Selected Areas of Cryptography, Kingston, ON, Canada, May 1994, pp. 120–128. speech-processing frameworks, and much work remains. [14] I. Damgard,˚ M. Geisler, and M. Krøigard,˚ “Homomorphic Researchers continue to investigate more efficient proto- encryption and secure comparison,” International Journal of cols, possibly using simpler encryption techniques which Applied Cryptography, vol. 1, no. 1, pp. 22–31, 2008. could be orders of magnitude faster than the methods [15] Andrew Yao, “Protocols for secure computations (extended abstract),” in IEEE Symposium on Foundations of Computer described here. Within the string-matching framework, Science, 1982. recent work has shown that significantly greater accu- [16] Michael Rabin, “How to exchange secrets by oblivious trans- racies can be obtained using nearest-neighbor methods fer,” Tech. Rep. TR-81, Harvard University, 1981. [17] Andrew Yao, “How to Generate and Exchange Secrets,” in such as those described in [36]. These also show great Proceedings of the 27th Annual Symposium on Foundations of promise in affording greater generalizability than the Computer Science (SFCS), Washington, DC, USA, 1986, pp. string-matching solutions described here. 162–167, IEEE Computer Society. A computationally efficient implementation of a fully [18] Manas A. Pathak and Bhiksha Raj, “Privacy preserving speaker verification as password matching,” in ICASSP, 2012. homomorphic encryption (FHE) scheme [29] would sig- [19] X. Huang, A. Acero, and H.-W. Hon, Spoken Language nificantly change the construction of privacy preserving Processing, Prentice-Hall, 2001. 14

[20] S.B.Davis and P. Mermelstein, “Comparison of parametric rep- order side channel analysis with additive and multiplicative resentations for monosyllabic word recognition in continuously maskings,” in CHES 2011, 2011. spoken sentences,” IEEE Transactions on Acoustics, Speech, [41] J. Feigenbaum, “Overview of interactive proof systems and and Signal Processing, vol. 28, 1980. zero-knowledge,” in Contemporary Cryptology: The Science of [21] S. Imai, “Cepstral analysis synthesis on the mel frequency Information Integrity. IEEE Press, 1992. scale,” in Proc. IEEE International Conference on Acoustics, [42] S. Laur and H. Lipmaa, “Additive conditional disclosure of Speech and Signal Processing, 1983, vol. 8, pp. 93–96. secrets and applications,” Cryptology ePrint Archive, Report [22] D. Reynolds and R. C. Rose, “Robust text-independent speaker 2005/378, 2005, http://eprint.iacr.org/. identification using gaussian mixture speaker models,” IEEE [43] Dan Boneh, Eu-Jin Goh, and Kobbi Nissim, “Evaluating Transactions on Speech and Audio Processing, vol. 3, 1995. 2-DNF formulas on ciphertexts,” in Theory of Cryptography [23] P. Kenny, “New map estimators for speaker recognition,” in Conference, 2005, pp. 325–341. Proc. Eurospeech, 2003. [44] Joseph P. Campbell, “Testing with the YOHO CD-ROM voice [24] P. Kenny, “Joint factor analysis of speaker and session variabil- verification corpus,” in ICASSP, 1995, pp. 341–344. ity : Theory and algorithms,” Technical report CRIM-06/08-13, [45] National Institute for Standards and Technology, FIPS 180-3: 2005. Secure Hash Standard, 2008. [25] T. Petrie, “Probabilistic functions of finite state markov chains,” [46] William M. Campbell, Douglas E. Sturim, Douglas A. Annals of Mathematical Statistics, vol. 40, no. 1, pp. 97–115, Reynolds, and Alex Solomonoff, “SVM based speaker veri- 1969. fication using a GMM supervector kernel and NAP variability [26] B. Raj, R. Singh, and T. Virtanen, “The basics of automatic compensation,” in ICASSP, 2006. speech recognition,” in Techniques for noise robustness in [47] Johnny Mariethoz,´ Samy Bengio, and Yves Grandvalet, Kernel automatic speech recognition, T. Virtanen, R. Singh, and B. Raj, Based Text-Independent Speaker Verification, John Wiley & Eds. Wiley, 2012. Sons, 2008. [27] A. Yao, “Protocols for secure computations,” in Foundations [48] Piotr Indyk and Rajeev Motwani, “Approximate nearest neigh- of Computer Science, 1982. bors: Towards removing the curse of dimensionality,” in [28] R. Rivest, L. Adleman, and M. L. Dertouzos, “On data banks Proceedings of the ACM Symposium on Theory of Computing, and privacy homomorphisms,” in Foundations of Computer 1998, pp. 604–613. Science, 1976. [49] Brian Kulis and Kristen Grauman, “Kernelized locality- sensitive hashing for scalable image search,” in IEEE Interna- [29] “Recent advances in fully homomorphic encryption: a possible tional Conference on Computer Vision, 2009, pp. 2130–2137. future for signal processing in the encrypted domain,” IEEE [50] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Signal Processing Magazine. SI Signal Processing in the En- Mirrokni, “Locality-sensitive hashing scheme based on p-stable crypted Domain, 2013. distributions,” in ACM Symposium on Computational Geometry, [30] S. Goldwasser and S. Micali, “Probabilistic encryption,” Jour- 2004, pp. 253–262. nal of Computer and System Sciences, vol. 28, pp. 270–299, [51] Moses Charikar, “Similarity estimation techniques from round- 1984. ing algorithms,” in ACM Symposium on Theory of Computing, [31] Jaideep Vaidya, Chris Clifton, Murat Kantarcioglu, and Scott 2002. Patterson, “Privacy-preserving decision trees over vertically [52] “OpenSSL,” http://www.openssl.org/docs/crypto/bn.html, 2010. partitioned data,” TKDD, vol. 2, no. 3, 2008. [32] W. Du and M. J. Atallah, “Privacy-preserving cooperative statis- tical analysis,” in Computer Security Applications Conference. (ACSAC). IEEE, 2001, pp. 102–110. [33] Shai Avidan and Moshe Butman, “Blind Vision,” in Pro- ceedings of the 9th European Conference on Computer Vision (ECCV), Graz, Austria, May. 2006. [34] B. Goethals, S. Laur, H. Lipmaa, and T. Mielikainen,¨ “On private scalar product computation for privacy-preserving data mining,” International Conference on Information Security and Cryptology (ICISC), pp. 23–25, 2004. [35] Mikhail Atallah and Jiangtao Li, “Secure outsourcing of sequence comparisons,” International Journal of Information Security, vol. 4, no. 4, pp. 277–287, 2005. [36] S. Rane and P. Boufounos, “Privacy-preserving nearest neighbor methods,” IEEE Signal Processing Magazine, To Appear, 2012. [37] V. Kolesnikov, A. Sadeghi, and T. Schneider, “Improved garbled circuit building blocks and applications to auctions and computing minima,” in Cryptology and Network Security (CANS), Kanazawa, Japan, December 2009, pp. 1–20. [38] F. Kershbaum, D. Biswas, and S. de Hoogh, “Performance com- parison of secure comparison protocols,” in Proc. International Workshop on Database and Expert System Applications, 2009, pp. 133–136. [39] B. Goethals, S. Laur, H. Lipmaa, and T. Mielikainen, “On private scalar product computation for privacy-preserving data mining,” in Proc. Intl. Conference on Information Security and Cryptology, 2004. [40] M. Quisquater L. Genelle, E. Prouff, “Thwarting higher-