Privacy in Control: A Markov Decision Process Perspective

Parv Venkitasubramaniam1

Abstract— Cyber physical systems, which rely on the joint As cyber physical systems imminently begin to replace our functioning of information and physical systems, are vulnerable existing basic infrastructures, leakage of such information to information leakage through the actions of the controller. In can have critically damaging consequences ranging from particular, if an external observer has access to observations in the system exposed through cyber communication links, airline delays, power blackouts to malfunctioning nuclear then critical information can be inferred about the internal reactors. It is, therefore, imperative that we understand the states of the system and consequently compromise the privacy privacy implications of information driven physical systems, in system operation. In this work, a mathematical framework and expand the science of control systems to include pri- based on a Markov Process model is proposed to investigate vacy requirements. In this article, we investigate a formal the design of controller actions when a privacy requirement is imposed as part of the system objective. Quantifying privacy mathematical framework using a specific class of Markov using information theoretic equivocation, the tradeoff between modelled systems, analyze it under favourable mathematical achievable privacy and system utility is studied analytically. conditions, and provide insights into the design of privacy Specifically, for a sub-class of Markov Decision Processes preserving controller policies. (MDP), where the system output is independent of present and Markov decision processes (MDPs) are a common discrete future states, the optimization is expressed as a solution to a with convex reward functions. Further, time mathematical framework for modelling decision making when the state evolution is a deterministic function of the in systems where the evolution of internal states and the states, actions and inputs, the Bellman equation is reduced to observable outputs depend partly on the external input and a series of recurrence relations. For the general MDP with partly on the actions of the internal control mechanism. In an privacy constraints, the optimization is expressed as a Partially MDP, at each time step, the process is in some state s, and Observable Markov Decision Process with belief dependent rewards (ρ−POMDP). Computable inner and outer bounds are upon receiving an input x, the decision maker may choose provided on the achievable privacy utility tradeoff using greedy any action a that is available in state s. The process responds policies and rate distortion optimizations respectively. at the next time step by randomly moving into a new state s′ (depending on the values of s and x), and giving the I.INTRODUCTION decision maker a corresponding reward or utility u(s, a, x). Often, the action a in state s could result in an output y. Cyber-physical systems, as the name suggests, rely on the In many cyber physical systems, it is fair to assume that the joint functioning of information systems and physical com- sequence of internal states are available to the decision maker ponents. These systems of the future, which include the smart to take the necessary action at each step. It is also essential electric grid, smart transportation, advance manufacturing that these internal states should remain confidential to any and next generation air traffic management system, are en- external observer. In the context of cyber physical systems, visioned to transform the way engineered systems function, particularly those implemented by wireless networks which far exceeding the systems of today in capability, adaptability, are vulnerable to eavesdropping, several questions arise: If reliability and usability. While the success of cyber physical the sequence of inputs and outputs are available to an external systems relies on the power of information exchange, it’s observer, how much information can he/she obtain about the fallibility lies in the power of information leakage. Despite internal states of the system. If the internal system states tremendous advances in cryptography, communication over are to remain confidential from an external observer, how the Internet is far from being truly confidential. The re- does it change the decision maker’s choices of actions and cent NSA controversy notwithstanding, there are examples consequently the expected reward or utility. Is there a fun- aplenty where the sensitive information of legitimate users damental tradeoff between the degree of privacy achievable are stolen using “visible” facets of communication such in the system and the total expected utility? In this article, as the length of transmitted packets [1], timing of packets we investigate the answers to these questions within the transmitted [2], routes of packet flow over a network [3] and framework of Markov Decision processes. suchlike. We emphasize the qualifier “visible” to indicate A quantitative model for information privacy is essential to that the aforementioned features of communication cannot be this investigation, and we rely on the information-theoretic hidden using encryption methodologies and are, in today’s equivocation (conditional entropy) for this purpose. Using wireless communication medium, easily retrievable [4]. equivocation as a measure of privacy, we wish to study controller policies that maximize utility subject to a desired *This work was supported in part by the National Science Foundations level of privacy. Specifically, consider the system as shown in through the grants CCF-1149495 and CNS-1117701 1P. Venkitasubramaniam is with the Electrical and Computer Engineering Figure 1. Let X = {X1, ··· ,Xn}, A = {A1, ··· ,An}, S = Department at Lehigh University parv.v at lehigh.edu {S1, ··· ,Sn}, Y = {Y1, ··· ,Yn} denote the respective se- adversarial observer can utilize observations in a particular A1, ··· ,An time step to improve his estimate of states in previous time steps; it is impractical to assume that eavesdroppers stop X1, ··· ,Xn Y1, ··· ,Yn updating their estimates of a state after time has elapsed. An identity thief can and will use all possible information– S1, ··· ,Sn past, present and future– to compromise a user’s privacy. It is this important difference that makes the problem of Fig. 1. Markov Decision Process Model of a Cyber Physical System designing optimal strategies quite challenging. It is therefore conceivable that although the decision maker has access to the true internal state, the strategies should take into account quences of input variables, internal state and output variables the belief from the external observer’s point of view, and over n time steps of system operation. The primary objective account for the possibility that actions in one time step can of the controller under no privacy restrictions is to maximize reveal information about states in past time steps. the expected reward/utility A quantitative approach to privacy utility tradeoffs has n previously been explored in [9], [10] using a notion the E Un = ( u(Xi,Ai,Si)) authors refer to as competitive privacy. In particular, they i=1 X consider the privacy utility tradeoff resulting when informa- Designing controller policies to maximize net reward defined tion is shared between regional transmission operators on a as above over finite and infinite horizon is well understood smart grid for the purpose of joint state estimation. Using [5]. In this work, we investigate the design of controller information theoretic equivocation to quantify privacy, they policies when a constraint is provided on the information demonstrate the equivalence of the privacy utility tradeoff to leaked, measured using equivocation a lossy source coding problem. As will be seen in Section IV-B of this paper, the connection to lossy source coding in P = E(H(S , ··· ,S |X , ··· ,X ,Y ··· ,Y ) n 1 n 1 n 1 n our model is obtained in the form of an upper bound, rather Specifically we model the net reward as a weighted sum of than a direct equivalence. An alternative measure to privacy the utility (Rn) and the privacy Pn, and study the design of in control is the use of differential privacy as in [11], where optimal policies. the authors address the problem of releasing filtered signals The confluence of communication and control has in that respect the privacy of the input data stream. general been a tremendous source of interest in the research community since Witsenhausen’s famous counterexample II.PRIVACY PRESERVING CONTROL:AN MDP was published [6]. More recently, the authors in [7] used FRAMEWORK an information theoretic perspective to shed new light on We describe the Markov Decision Process framework the counterexample. From the perspective of this work, we with relevance to the problem in consideration. Prior to the find it important to discuss in slight detail the problem description of the model, we briefly discuss the notation of control under communication constraints [8]. Control that follows. Uppercase variables X1,S1 etc denote random under communication constraints studies decision making in variables while lowercase x1, a1 denote values. Vectors systems where observations, feedback or state variables can X, S are represented using boldfaced letters. The notation j be measured or are to be communicated under bandwidth Xi refers to the sequence Xi,Xi+1, ··· ,Xj . limitations. Consequently, there is uncertainty in the exact value of observations, and the resulting decision framework We define a Markov Decision Process as an 8-tuple mirrors that of a Partially Observable Markov Decision M = {S, X , Y, A, pX , pY , pS, u} where S, X , Y, A are Process (POMDP). The approaches to solving those prob- finite sets denoting the sets of states, inputs, outputs lems have also relied on information theoretic measures to and actions respectively, pX , pY , pS are functions that quantify communication rates. There are two key differences define the distributions of the input, output and between privacy preserving decision making and control state transitions respectively and u is the utility function. under communication constraints. First, communication in Following is a detailed description of the model: privacy preserving decision making is implicit, wherein the control action results in information being communicated to Input Variables: Let X1, ··· ,Xn denote the inputs to the the external observer. In contrast, the communication in the system at discrete time steps 1, ··· , n. Each Xt takes on latter problem is explicit, and can be designed to optimize values in the finite set X = {1, ··· , nX }. In this work, we the primary objective of the controller. The second and consider X1, ··· ,Xn to be i.i.d random variables defined more subtle difference is due to non causality of privacy using a fixed probability mass function pX : X 7→ [0, 1], constraint. The value of information communicated in the known to all players in the system. communication constrained control problem is causal: The actions in the present state impact the rewards in that time State and Action Variables: Let S1, ··· ,Sn denote the step and future time steps, but has no relevance to rewards in state variables of the system. Each St takes on values in the past states. In the privacy preserving control scenario, an the finite set S = {1, ··· , nS}. Associated with each state i is an action space Ai and Ai = A. When the system Expected Utility: The following discussion assumes a finite is in state i, if the input received is x, an action a in Ai horizon model, wherein the total utility for a policy is results in the system transitioningS to another state according measured by the expected utility per time step over a horizon to a time homogeneous conditional probability mass function of n time steps. For a given policy µ = {µt, t = 0, 1, ··· , n}, ′ pS(s, x, a, s ) wherein for every t the expected utility is measured as: ′ ′ E( u(S ,X ,A )) Pr{St+1 = s |St = s, Xt = x, At = a} = pS(s, x, a, s ). U (µ) = t t t t , n n Output Variables: When the system is in state i, an action P where the expectation is over the realization of the state Y in Ai results in an output i belonging to a finite set evolution and the probabilistic strategy at every time step. , , n . Given an input X x, state S s Yi = {1 ··· Y } t = t = Although the reward as specified above represents a finite and action A a, the output is generated according to t = horizon, in some examples in subsequent section we will p s, x, a, y a time homogenous probability mass function Y ( ) consider an infinite horizon average reward to demonstrate wherein the results. The infinite horizon average reward is expressed Pr{Yt = y|St = s, Xt = x, At = a} = pY (s, x, a, y). as the limit lim infn→∞ Un.

In Section III, we will consider a sub class of MDPs wherein Note that in the absence of privacy constraints the the output and the controller are independent. Consequently, problem of maximizing the utility is a standard MDP for those output-unobservable MDPs optimization in stochastic control given by the Bellman ′ ′ ′ ′ pY (s, x, a, y) = pY (s , x, a , y) ∀x, y, a, s, a , s equation formulation [5]. In fact, the optimal strategy in finite and infinite horizon MDP models with bounded utility Utility: The action in a particular state results in a utility are known to be deterministic (qs(a) = 0 or 1) at every step. reward for a controller defined by a deterministic function As will be demonstrated through the example in Section u : S × X × i Ai 7→ R, such that u(x, s, a) is the utility ??, the class of deterministic strategies is insufficient to when the decision maker employs action a in state s in optimize privacy. response to inputS x.

The probability distributions pX , pS, pY are fixed and a Privacy Measure: We quantify the privacy of the internal priori known to both the controller and the eavesdropper. states from an external observer using Shannon’s equivoca- tion [12]. Specifically, given the policy µ, the observations Controller/Decision Maker:At time t, having observed the n n X1 , Y1 induce an a posterior probability distribution over states and inputs at all time steps until and including time n the internal state sequence S1 . Over the n-step horizon, for step t, and all outputs until and including time step t − 1, a given policy µ, the privacy is measured as: the decision maker chooses an action according to some Hµ(S1, ··· ,Sn|Y1, ··· ,Yn,X1, ··· ,Xn) probability distribution q : A 7→ [0, 1]. In effect, the Pn(µ) = (1) t n decision at time t: µt is a set of nS functions µt,s : X × t−1 t−1 |As| M S × Y 7→ P , where P is the |As|−dimensional where Hµ(X|Y ) is the Shannon entropy of the induced probability simplex. The controller is aware of the present probability distribution. The physical meaning of the state St = st and input Xt = xt, and chooses an action a measure can be understood by looking at two boundary conditions. , where equality implies that X ,X is according to the probability distribution {qs(a), a ∈ Ast } = P ≥ 0 1 ··· n t t−1 t−1 a deterministic function of the observed outputs Y , ··· ,Y , µt,st (x1, s1 , y1 ). Let At denote the random variable 1 n corresponding to the action at time t. The controller policy resulting in no privacy. Any policy that ensures a one-one µ is the sequence of decision functions µ = {µ1, ··· , µn}. mapping from the state space to the output space at every Eavesdropper Observation: The eavesdropper observes time step would result in privacy P = 0. The upper all inputs X1, ··· ,Xn and all outputs Y1, ··· ,Yn, and is bound P ≤ log(|S|) is the the maximum entropy rate of aware of the decision making policy µ. The realization the internal state process, with equality implying that the of the used by the controller at each time observations of the eavesdropper provide no information step is unknown to the eavesdropper and is an important about the internal states. reason for maintaining the privacy of the states. We note that the reward is not observable to the eavesdropper. Weighted Reward: A Privacy Utility Tradeoff For a given We also note that at any time t, the eavesdropper MDP M = {S, X , Y, A, pY , pX , pS, u}, our goal is to likely has imperfect information about the state St design a policy µ that maximizes a weighted sum of the (expressed using a belief vector over S), and consequently, utility and the achieved privacy. Mathematically, we wish to the information about St from the perspective of the design the policy: eavesdropper would be a function of the set of distributions ∗ µ (λ) = arg max(λUn(µ) + (1 − λ)Pn(µ)) (2) t t−1 t−1 µ q = {qs(a), s ∈ S} = {µt,s(x1, s1 , y1 ), s ∈ S} (in effect a conditional distribution). for a desired weight λ. As λ is increased from 0 to 1, the controller moves gradually from being a purely utility ∗ driven one to a purely privacy preserving one. Let Un(λ) form of the classical reward definitions. A privacy preserving ∗ and Pn(λ) denote the utility and privacy achieved by the MDP would also benefit from a Bellman-type reduction, if optimal policy respectively for a given λ. The sweep of the rewards were expressed in an additive form, which is ∗ ∗ the pairs (Un(λ), Pn(λ)) when 0 ≤ λ ≤ 1 provides provided in the following Lemma. the optimal privacy-utility tradeoff for the decision making Lemma 3.1: For any policy µ for an i-unobservable MDP process, which is the primary focus of this work. M, there exists a policy µ′ such that R(µ′) ≥ R(µ) and In the subsequent sections, we investigate the weighted ′ reward optimization, and propose reductions which are P(µ ) = Hµ′ (Si|Si−1Xi−1) amenable to value iteration methods. We first consider a sub Proof Any strategy µ Xis a sequence of probabilistic map- class of MDPs wherein conditioned on the inputs, the ob- pings µ1, ··· , µn where µi maps X1, ··· ,Xi,S1, ··· ,Si = servations are independent of the states. This class of output s, Y1 ··· Yi−1 to a probability distribution {q(a), a ∈ As}. unobservable systems will be shown to result in Bellman Based on the policy µ, let the induced joint probability t t−1 t equations with continuous action spaces, and for a narrower distribution on X1, Y1 , S1 be denote by pµ,t. We define a sub class of conditionally deterministic state evolution, the modified policy µ′ such that solution will be expressed as a sequence of recurrence ′ t t t−1 t t t−1 relations. For the general MDP, the problem will be shown µt(xt, st) = pµ,t(x1, s1, y1 )µt(x1, s1, y1 ) to represent a Partially Observable Markov Decision Process X i−1,Si−1,Yi−1 X with belief dependent rewards (ρ POMDP) which, owing − ′ to the continuous action space, is not practically amenable In other words, the policy µ at time t chooses action At to value iteration. We then provide computable inner and according to the marginal distribution generated by policy µ. outer bounds on the tradeoffs using rate distortion type Since the underlying stochastic model follows a Markovian optimizations. evolution, the joint probability distribution of the state, action and input until time t under policy µ can be expressed as: III. I-UNOBSERVABLE MDP t t t t According to classical state space models,, a system is Pr{X1, S1, A1} = Pr{Ai,Xi,Si|Ai−1,Xi−1,Si−1} said to be observable at time n, if there exists a finite i=1 Y n > n such that for any s ∈ S, if at time n, S = s, 1 n Since the one step conditional for the policies the knowledge of the inputs X , ··· ,X and the outputs n n1 µ and µ′ are identical by definition we can write Y1, ··· ,Yn1 suffices to determine the state Sn, otherwise the system is said to be unobservable at time n. By this ′ U(µ ) = E ′ (u(X ,S ,A )) = E (u(X ,S ,A )) definition, any non-deterministic state transition MDP model µ i i i µ i i i i i could qualify as being unobservable. In this section, we X X consider an information theoretic notion of unobservability, For an o-unobservable MDP, the privacy metric for policy wherein the generated output is independent of the present µ′ can be rewritten as: state and the succeeding state conditioned on the input. Any system which generates an output which is a deterministic Hµ′ (S1, ··· ,Sn|X1, ··· ,Xn,Y1, ··· ,Yn) function of the input regardless of state would fall under this (a) n n = Hµ′ (Si|Si−1Xi−1,Yi−1, X , Y ) category. i i (b) n n Specifically, an i-unobservable MDP is a system wherein = Hµ′ (Si|Si−1Xi−1, Xi , Yi ) the outputs generated depend only on the input and have (c) = H ′ (S |S X ) no dependence on present or future states of the system. µ i i−1 i−1 More specifically, an i-unobservable system is an MDP that where (a) follows from the fact that given Xi,Si the policy satisfies the following conditions µ′ does not depend on past history, and (b) and (c) follow (A1) The output generated is merely a function of the input from conditions (A1) and (A2) for an i-unobservable MDP. of the system, i.e. Further note that by definition of policy µ′ p (s, a, x, y) = p (s′, a′, x, y)∀s, a, s′, a′, x, y. Y Y Hµ(Si|Si−1Xi−1) = Hµ′ (Si|Si−1Xi−1) (A2) The information flow between the state and input is and since conditioning reduces entropy unaffected by the output, i.e. i−1 i−1 n n Hµ(Si|S , X , X , Y ) ≤ Hµ(Si|Si−1Xi−1) Si+1 − (Xi,Si) − Yi 1 1 i 1 = Hµ′ (Si|Si−1Xi−1). forms a . Solutions to classical MDPs with no privacy restrictions Consequent to the above lemma, it is sufficient to consider are typically obtained by reducing the full horizon optimiza- the class of policies which only depend on present state and tion to a single step Bellman-equation and applying value input, and the optimization can be written in additive form. iteration. A key element in the reduction is the additive The resulting Bellman equation for the cost to-go function at time step t then reduces to: which can be solved using an Entropy power maximization yielding

1−λ ′ 1 ′ ′ ′ Vt(s, x) = max λH(pS · q) + (1 − λ) q(a)r(s, a, x) λ u(s,x,s )+ λ Px′ p(x )Vn−1(s ,x ) q ∗ ′ 2 ( a q (s ) = 1−λ ′ 1 ′ ′ ′ λ u(s,x,s )+ λ Px′ p(x )Vn−1(s ,x ) ′ X s′ 2 ′ ′ + q · pS( p(x)Vt−1(s , x )) (3) The proof followsP by substituting the above solution back s′ x )  X X into the Bellman equation. Although the discussion thus far has centered around where V (s, x) is the expected reward if t steps remain in t the finite horizon optimization, the equivalent optimizations the finite horizon and for a given pair s, x for the discounted and average reward are easily obtained ′ ′ by modifying the cost-to-go functions. We demonstrate an pS · q(s ) = q(a)pS(s, a, x, s ) a infinite horizon average reward optimization for a specific X example that satisfies the condition in Theorem 3.2, in the The above Bellman equation represents a finite state following discussion. continuous action bounded convex reward system, for which a stationary optimal solution exists. The optimal solution A. Example: Anonymity under Memory Limitations can be obtained using a combination of convex optimization Consider a sequence of packets arriving to a router from a and value iteration techniques. Note that the maximization pair of sources according to independent equal rate Poisson in the cost-to-go function is a variant on the Entropy processes; for ease of explanation we refer to the two sources power optimization, and it is possible to solve for the as being red or blue respectively. The packets departing optimizing q using standard KKT conditions. In addition the router, by virtue of encryption and bit-padding, are to conditions (A1) and (A2), if the succeeding state is content-wise not matchable to the arriving packets. From the always a deterministic function of the present input, state perspective of an eavesdropper who observes the arriving and ′ and action (pS(s, a, x, s ) = 0 or 1), then the solution to departing packets, there are two distinct incoming streams the maximization is in fact identical to the Entropy power while only a single departing stream. The eavesdropper uses maximization, and the Bellman equation can be further the observation of the timing of the incoming and outgoing reduced to a series of recurrence relations. This is shown processes to determine the source of each outgoing packet. in the following theorem. The objective of the router is to shuffle the packets such that Theorem 3.2: If ∀s, a, x, s′, p(s, a, x, s′) = 1 or 0, then given the observation of the eavesdropper and knowledge of the solution to a finite horizon optimization which starts at the shuffling strategy, the source entropy of departing packets state s0 is given by: is maximized. When the router has finite memory (can store a maximum of k packets), the problem can be written as an Vn = p(x) log(ψn(s0, x)) i-unobservable MDP with privacy requirements. x X Prior to describing the MDP model for the problem, it is where ψt(s, x) is the solution to the following recursive important to note a key reduction that simplifies the class of equations policies without losing generality. Specifically, we consider only those policies where a packet is transmitted only upon a new arrival to a full buffer. It can be shown that this ψ s, x φ s, x ψ s′, x p(x) t( ) = ( ) ( t−1( )) simplification does not lose generality [?]. Further note s′ s′ ! X Y∈S that when the memory of a mix is limited, it is sufficient ψ (s, x) = 1, ∀s, x 0 for Eve to know the sequence of arriving packets in place Proof: When the state transition is deterministic, we assume of the complete timing information of the arrival point without loss of generality that every action a leads to a ′ process; a packet can wait in the buffer until the next packet unique state. To see this consider a pair of actions a, a arrives regardless of what time it arrives, so all that matters that lead to an identical state. In this case, the action with to Eve is the state of the buffer and the source of the the larger immediate utility u(s, x, a) is strictly better since next arriving packet. Therefore, without loss of generality, the paths of the subsequent functioning of the MDP would we will assume that Eve’s observation is restricted to the be identical. Consequently, optimizing the probability of incoming sequence of packets and we will consider only actions {q(a)} in every step is equivalent to optimizing the ′ ′ those mixing strategies in which mix transmits packets iff probability {q(s )} of moving to a particular state s . The its buffer is full and a new packet arrives. Bellman optimization therefore reduces to: The MDP model of the router is as follows. The router ′ ′ ′ Vn(s, x) = sup λH(S ) + (1 − λ) q(s )u(s, x, s ) is defined to be in state r if it has r red packets in its ′ q(s ) ( s′ buffer including the new arrival. Due to the reduction in X strategies wherein the mix transmits only upon a new arrival, + q(s′)p (x′)V (s′, x′) the state of the mix transitions only upon a new arrival (and X n−1  s′,x′ the resulting immediate departure). This definition of state X   assumes that it is sufficient to combine a new arrival into Proof: The Bellman equation for the infinite horizon problem the buffer of the mix, although, the eavesdropper observes a can be reduced to the following recurrence relations, where V w new arrival separately. This reduction is justified (and does ψr = 2 (r) (the excess reward), and c = 2 (the average not lose optimality) since whichever packet the mix chooses reward). to transmit (red or blue) the transition to the subsequent state ψr+1 = cψr − ψr−1∀r = 1, ··· , k − 1 depends only on the total composition of the buffer including 1−λ λ the new arrival, and not each individual component. Let St 2 ψk = cψk+1 th represent the mix’s state prior to the t arrival. △ That this model represents an i-unobservable MDP with initial conditions ψ0 = 1, ψ1 = c. Solving the follows from the fact that the departure process is a recurrence relation yields the solution in the statement of  delayed version of the of the arrival process regardless of the theorem. the source pattern of arrivals. Consequently, conditioned A special case of the above theorem when priority is on the arrival process, the outputs does not provide any absent was shown in [13] in the context of memory limited new information about the sequence of states or actions anonymity of Chaum mixes, wherein the optimal anonymity π of the mix, thus trivially satisfying conditions (A1) and (A2). without priority was shown to be log 2 cos k+3 . h  i Anonymity and State Privacy: Let the random variables IV. GENERAL MDP X1, ··· ,Xn denote the sequence of sources of incoming In this section, we consider the general MDP problem packets. If Z1, ··· ,Zn denote random variables representing posed in Section II. We demonstrate that the optimization the sources of outgoing packets, the measure of anonymity problem can be expressed as a Partially Observable MDP is given by the Shannon Entropy: with belief dependent rewards. In particular, the reduction H(Z , ··· ,Z |X , ··· ,X ) to a POMDP formulation is based on conversion of the A(k) = 1 n 1 n n privacy metric to an additive sequence of information leakage where k is the buffer size of the mix, i.e. the maximum terms which do not depend on future outputs, proven in the number of packets that can be stored by the mix. Given following Theorem. n n Theorem 4.1: If π0 = {π0(1), ··· , π0(nS)} is the initial X1 , Z1 , the sequence of states of the mix’s buffer are per- n n belief vector of the observer (about the state), the optimal fectly determined. Likewise given the sequence X1 , S1 , the n reward for the weighted optimization sequence of departing packets Z1 are perfectly determined. Consequently Vn(π0) = sup(λU(µ) + (1 − λ)P(µ)) n n n n µ H(Z1 |X1 ) = H(S1 |X1 ). is given by the solution to the set of Bellman equations (if Utility/Priority We assume that one of the sources, say the it exists): red source, has a higher priority for transmission by virtue of greater privileges for the user. The notion of prioritized  I  V (π, x) = sup R (π, Q, x) + γ(y)p (x′)V 1(π (s), x′) t  λ X X t− po  scheduling is in conflict with the notion of anonymity which Q} x′,y necessitates shuffling of packets, and consequently a tradeoff  (5) between the the level of priority provided and the degree where of anonymity achievable is expected. We model priority by πpo(y) = [πpo1(y), ··· , πpo|S|(y)] providing a utility 1 whenever the system reaches the state ′ ′ ′ ′ ′ π(s )q(a|s )pY (s , x, a, y)pS (s , x, a, s) π (y) = a St = k + 1 (all red packets have departed). In the absence pos P ′′ ′′ ′′ X s′′,a π(s )q(a|s )pY (s , x, a, y) of a privacy constraint, the optimal strategy that maximizes s P γ(y) = π(s)q(s, y) utility would be to transmit red packets as soon as they arrive, X s and transmit blue packets if and only if a blue packet arrives RI (π, Q, x) = λ( π(s)[H(Q(·|s) · p ) + H(Q(·|s) · p )] λ X Y S to the mix when the buffer contains k blue packets. This s deterministic strategy however works contrary to the idea −H(π · Q · p ))] + λ¯ π(s)q(s, a)u(s, a, x) Y X of anonymity because the state is perfectly known at every s point of time and the system provides zero anonymity. The and following theorem provides the achievable net reward for any (Q(·|s) · p )(y) = q(a|s)p (s, a, x, y) Y X Y given reward weight λ. a Theorem 3.3: The net reward R(λ) of a Chaum mix with ′ (Q(·|s) · p )(s) = q(a|s′)p (s′, a, x, s) buffer capacity k serving two equal rate users, when the red S X S source is given a higher priority is s (π · Q · p )(y) = π(s)q(a|s)p (s, a, x, y) Y X X Y R(λ) = λ(1 + log (cos θ)) (4) s a 2 Note that unlike the i-unobservable system, the optimiza- where θ is the smallest non zero solution to the equation: tion at each step is over the complete conditional probability sin((k + 3)θ) 1−λ Q = {q(a|s), s ∈ S, a ∈ As} and the instantaneous = 2 λ − 1 sin((k + 1)θ) reward is a function of the belief vector (from the observer’s perspective) at each time step. Although at each step the the optimal policy is easily computed. At any given belief ∗ controller is perfectly aware of the state, the external observer πt, the optimal action distribution Q for the greedy policy only maintains a belief of the state, and the actions of the is given by solving controller are observable only through the outputs. Conse- ∗ quently, it is critical for the designer to develop a strategy Q = arg sup λ( π(s)[H(Q(·|s) · pY ) + H(Q(·|s) · pS )] Q " s that takes into account the belief vector of the eavesdropper X and not merely the actual state, although the controller would −H(π · Q · pY ))] + λ¯ π(s)q(s, a)u(s, a, x) apply the specific distribution q(a|s) in choosing the action a s # when the state is s. Note that the eavesdropper’s knowledge X of the state is imperfect in the i-unobservable MDP as well, The instantaneous reward expressed in the above maximiza- although the resulting optimization was shown to be belief tion is a convex function of the probability distribution {Q}, independent. and can be optimized in every step as a function of the belief Proof and transition probabilities. Using Monte Carlo methods, the The key to reducing the weighted optimization problem average reward for the greedy can be computed to to the recursive Bellman equation form is in expressing the provide a lower bound on the tradeoff. In some special cases reward as a sum of rewards at each time step. In particular, [], the belief evolution of the greedy algorithm can be shown this requires the instantaneous reward to be a function only to converge to a specific belief thus allowing the analytical of the past history and present state. For policy µ, let characterization of the achievable tradeoff. W1 = (X1,Y1), ··· ,Wn = (Xn,Yn) denote the random variables representing the sequence of input-output pairs. By B. Outer Bound: Weak Eavesdropper repeated application of the chain rule of entropy and mutual The optimal greedy policy provides achievable privacy- information, we can write utility tradeoffs, which serve as inner bounds to the optimal

Hµ(S1, ··· ,Sn|X1, ··· ,Xn,Y1, ··· ,Yn) tradeoff between privacy and utility. In order to test the n n efficacy of these lower bounds, it is useful to compute close t t t t−1 = Hµ(St+1|S1W1 ) + Hµ(Yt|S1W1 ,Xt) outer bounds. A simple outer bound can be obtained by t=1 t=1 assuming no prior information to an external observer. We X H1(t) X H2(t) n appeal to the fact that conditioning reduces entropy, and | {z t−}1 | {z } consequently by removing the conditioning variables that − Hµ(Yt|S1W1 ,Xt) (6) t=1 the observer uses to determine the state at a given time, X H3(t) the privacy achieved will be higher for any policy. When the | {z } (7) observer is assumed to possess no prior information at any given time, the reward can be modified as: Note that each term in the above summation H1(t),H2(t) and H3(t) can be computed using information available Rλ(µ) = λ Hµ(St|St−1Xt−1Yt−1)+λ¯ u(Xt,St,At) to the controller at time t. Further note that H1(t) and t t H2(t) are entropies conditioned on the complete knowledge X X available to the controller, whereas H3(t) is conditioned on For any policy µ, the reward modified as above would be the knowledge of the observer which is captured using the greater than or equal to the actual reward. Note that the above belief vector πt at time t. Substituting (6) into the reward reward is belief independent at each time step, and is addi- metric provides the POMDP equation in the Theorem.  tive. Consequently, the optimal solution is given by solving The general model reformulated using theorem 4.1 is a the Bellman equations for an MDP model, identical to that POMDP with belief dependent rewards (ρ−POMDP) de- discussed in Section III. Readers familiar with information scribed in [14] and is in general hard to solve. For the finite theory will instantly recognize the above optimization as horizon version of the above problem, Theorem 2 in [14] a modification of the classical Rate-Distortion optimization shows that the optimal value function Vn(π) is convex in the with the entropy of the subsequent state H(Q(·|s)·pS ) being belief vector π, and subsequently proposes value iteration the additional term. An efficient iterative descent algorithm to compute the optimal reward. Applying value iteration to solve this optimization can be obtained by modifying the is nevertheless cumbersome in our setup primarily owing classical Blahut algorithm [15]. A greedy decision maker can to the uncountable action space. To that effect we now execute the iterative descent at every step (depending on the provide bounds that can be computed using lower complexity belief vector) to determine the optimal distribution and the methods. belief in the subsequent state. The resulting reward would be a lower bound on the maximum achievable reward. A. Inner Bound: Greedy Policy This simple outer bound can be improved by gradually An intuitive sub optimal policy that can be computed adding variables (that provide the observer information) without having to execute a value iteration algorithm is the greedy policy which maximizes the instantaneous reward and solving the resulting optimization. We demonstrate the at every step. While a closed form characterization of the computation of the inner and outer bounds for a simple two reward obtained by the greedy policy is not always feasible, state example in the following Section. V. TWO STATE SYSTEM EXAMPLE 1 Deterministic Consider a two state system, where S = {0, 1}, with a 0.95 Greedy 1 − α α Upper Bound fixed transition probability matrix P = 0.9 β 1 − β regardless of the action. We consider a system without inputs. 0.85 When the system is in state St = 0, the decision maker has 0.8

two possible actions a = 0 and a = 1 resulting in outputs 0.75 Y = 0 and Y = 1 respectively. When the system is in state Utility t t 0.7 St = 1, the decision maker has only one possible action 0.65 a = 1 resulting in an output Yt = 1. The utility function is defined as: 0.6 1 s = a 0.55 u(s, a) = . 0.5 0 s = 0, a = 1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4  Privacy Since there are only two possible actions in any time step, the action probability matrix Q has the form: Fig. 2. Privacy Utility Tradeoff for a Two State System 1 − p p Q = 0 1   REFERENCES and can consequently be parametrized using a single variable p. Given parameter p and prior belief vector π = [π, 1 − π] [1] Marc Liberatore and Brian Neil Levine, “Inferring the source of encrypted http connections,” in Proceedings of the 13th ACM con- at time t, we know that: ference on Computer and communications security, New York, NY, πp USA, 2006, CCS ’06, pp. 255–263, ACM. πp+(1−π) Yt = 1 [2] D. X. Song, D. Wagner, and X. Tian, “Timing Analysis of Keystrokes πpo(0) = 1 Yt = 0 and Timing Attacks on SSH,” in Proc. 10th USENIX Security  Symposium, 2001. Accordingly, [3] Jean-Franc¸ois Raymond, “Traffic analysis: Protocols, attacks, design issues and open problems,” in Designing Privacy Enhancing Tech- Rλ = λ[HS−h(π)+(πp+(1−π))h(πpo(0))]+(1−λ)(1−πp) nologies: Proceedings of International Workshop on Design Issues in Anonymity and Unobservability, H. Federrath, Ed. 2001, vol. 2009 of where HS is the entropy rate of the state evolution process LNCS, pp. 10–29, Springer-Verlag. (which is a constant since the state evolution is independent [4] Jihwang Yeo, Suman Banerjee, and Ashok Agrawala, “Measuring traffic on the wireless medium: Experience and pitfalls,” Tech. Rep., of the action). The optimal solution, if it exists, is given DTIC Document, 2002. by solving the ρ−POMDP. Solving the POMDP analytically [5] Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and might not be feasible, but deriving the tradeoff for the Dimitri P Bertsekas, and , vol. 1, Athena Scientific Belmont, 1995. greedy algorithm and the weak eavesdropper outer bound are [6] Hans S Witsenhausen, “A counterexample in stochastic optimum analytically straightforward. Specifically, at any given belief control,” SIAM Journal on Control, vol. 6, no. 1, pp. 131–147, 1968. π = [π, 1−π], the optimal parameter p∗ for the greedy policy [7] Se Yong Park, Pulkit Grover, and Anant Sahai, “A constant-factor approximately optimal solution to the witsenhausen counterexample,” is given by in Decision and Control 48th IEEE Conference on. IEEE, 2009, pp. 1 − π p∗ = (8) 2881–2886. exp 1−λ − π [8] Sekhar Tatikonda and Sanjoy Mitter, “Control under communication λ constraints,” Automatic Control, IEEE Transactions on, vol. 49, no. which results in a countable state positive recurrent Markov 7, pp. 1056–1068, 2004.  [9] Lalitha Sankar, Soummya Kar, Ravi Tandon, and H Vincent Poor, process for the belief vector. When the eavesdropper is “Competitive privacy in the smart grid: An information-theoretic assumed memoryless, since the belief is not updated at every approach,” in Smart Grid Communications (SmartGridComm), 2011 step, substituting (8) with π∗ = α provides the optimal IEEE International Conference on. IEEE, 2011, pp. 220–225. α+β [10] EV Belmega, L Sankar, and HV Poor, “Repeated games for privacy- policy parameter for the upper bound. The reward for this aware distributed state estimation in interconnected networks,” in optimal policy against a weak eavesdropper is then given by Network Games, Control and Optimization (NetGCooP), 2012 6th International Conference on. IEEE, 2012, pp. 64–68. π∗p∗ R = λ H − h(π∗) + (π∗p∗ + π¯∗)h [11] Jerome Le Ny and George J Pappas, “Differentially private filtering,” up X (π∗p∗ + π¯∗) in Decision and Control (CDC), 2012 IEEE 51st Annual Conference  ∗ ∗   on. IEEE, 2012, pp. 3398–3403. +(1 − λ)(1 − π p ) [12] C. E. Shannon, “Communication theory of secrecy systems,” Bell System Technical Journal, 1949. which serves as an outer bound for the optimal tradeoff. [13] A. Mishra and P. Venkitasubramaniam, “Anonymity under Memory Figure 2 plots the Privacy Utility tradeoffs for the two state Limitations: Optimal Strategy and Asymptotics,” in IEEE Interna- system for specific values of α, β. The comparison with the tional Symposium on Information Theory, July. 2013. [14] Mauricio Araya-Lopez,´ Olivier Buffet, Vincent Thomas, Franc¸ois optimal deterministic policy demonstrates the necessity of Charpillet, et al., “A pomdp extension with belief-dependent rewards,” randomness in the policy. Tighter inner and outer bounds in Neural Information Processing Systems-NIPS 2010, 2010. can be obtained by providing additional memory to the [15] Richard Blahut, “Computation of channel capacity and rate-distortion functions,” Information Theory, IEEE Transactions on, vol. 18, no. 4, eavesdropper in the outer bound and by adding more “look pp. 460–473, 1972. ahead” components to the controller in the inner bound.