<<

-Preserving Learning

Jun Sakuma [email protected] Shigenobu Kobayashi [email protected] Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midoriku, Yokohama, 226-8502, Japan Rebecca N. Wright [email protected] Rutgers University, 96 Frelinghuysen Road, Piscataway, NJ 08854, USA

Abstract In this paper, we consider the privacy of agents’ per- ceptions in DRL. Specifically, we provide solutions for We consider the problem of distributed rein- privacy-preserving (PPRL), in forcement learning (DRL) from private per- which agents’ perceptions, such as states, rewards, and ceptions. In our setting, agents’ perceptions, actions, are not only distributed but are desired to be such as states, rewards, and actions, are not kept private. Consider two example scenarios: only distributed but also should be kept pri- vate. Conventional DRL can han- Optimized Marketing (Abe et al., 2004): Consider dle multiple agents, but do not necessarily the modeling of the customer’s purchase behavior as guarantee privacy preservation and may not a (MDP). The goal is to ob- guarantee optimality. In this work, we design tain the optimal catalog mailing strategy which max- cryptographic solutions that achieve optimal imizes the long-term profit. Timestamped histories of policies without requiring the agents to share customer status and mailing records are used as state their private information. variables. Their purchase patterns are used as actions. Value functions are learned from these records to learn the optimal policy. If these histories are managed sep- 1. Introduction arately by two or more enterprises, they may not want to share their histories for privacy reasons (for exam- With the rapid growth of computer networks and net- ple, in keeping with privacy promises made to their worked , a large amount of information is customers), but might still like to learn a value func- being sensed and gathered by distributed agents phys- tion from their joint in order that they can all ically or virtually. Distributed reinforcement learning maximize their profits. (DRL) has been studied as an approach to learn a con- Load Balancing (Cogill et al., 2006): Consider a load trol policy thorough interactions between distributed balancing among competing factories. Each factory agents and environments—for example, sensor net- wants to accept customer jobs, but in order to max- works and mobile robots. DRL algorithms, such as the imize its own profit, may need to redirect jobs when distributed value function approach (Schneider et al., heavily loaded. Each factory can observe its own back- 1999) and the policy gradient approach (Moallemi & log, but factories do not want to share their backlog Roy, 2004), typically seek to satisfy two types of physi- information with each other for business reasons, but cal constraints. One is constraints on communication, they would still like to make optimal decisions. such as an unstable network environment or limited communication channels. The other is memory con- Privacy constraints prevent the data from being com- straints to manage the huge state/action space. There- bined in a single location where centralized reinforce- fore, the main emphasis of DRL has been to learn ment algorithms (CRL) could be applied. Although good, but sub-optimal, policies with minimal or lim- DRL algorithms work in a distributed setting, they ited sharing of agents’ perceptions. are designed to limit the total amount of data sent be- tween agents, but do not necessarily do so in a way Appearing in Proceedings of the 25 th International Confer- that guarantees privacy preservation. Additionally, ence on , Helsinki, Finland, 2008. Copy- DRL often sacrifices optimality in order to learn with right 2008 by the author(s)/owner(s). low communication. In contrast, we propose solutions Privacy-Preserving Reinforcement Learning that employ cryptographic techniques to achieve op- comp. comm. accuracy privacy timal policies (as would be learned if all the informa- CRL good good good none DRL good good medium imperfect tion were combined into a centralized reinforcement IDRL good good bad perfect learning (CRL) problem) while also explicitly protect- PPRL medium medium good perfect ing the agents’ private information. We describe solu- SFE bad bad good perfect tions both for data that is “partitioned-by-time” (as in the optimized marketing example) and “partitioned- Table 1. Comparison of different approaches by-observation” (as in the load balancing example).

Related Work. Private distributed protocols have approaches: CRL, DRL, independent distributed re- been considered extensively for , pioneered inforcement learning (IDRL, explained below), SFE, by Lindell and Pinkas (Lindell & Pinkas, 2002), who and our privacy-preserving reinforcement learning so- presented a privacy-preserving data-mining lutions (PPRL). In CRL, all the agents send their per- for ID3 decision-tree learning. Private distributed pro- ceptions to a designated agent, and then a centralized tocols have also been proposed for other data min- reinforcement is applied. In this case, the optimal con- k ing and machine learning problems, including -means vergence of value functions is theoretically guaranteed clustering (Jagannathan & Wright, 2005; Sakuma & when the dynamics of environments follow a discrete Kobayashi, 2008), support vector machines (Yu et al., MDP; however, privacy is not provided, as all the data 2006), boosting (Gambs et al., 2007), and belief prop- must be shared. agation (Kearns et al., 2007). On the opposite end of the spectrum, in IDRL (inde- Agent privacy in reinforcement learning has been pre- pendent DRL), each agent independently applies CRL viously considered by Zhang and Makedon (Zhang & only using its own local information; no information is Makedon, 2005). Their solution uses a form of average shared. In this case, privacy is completely preserved, reward reinforcement learning that does not necessar- but the learning results will be different and indepen- ily guarantee an optimal solution; further, their solu- dent. In particular, accuracy will be unacceptable if tion only applies partitioning by time. In contrast, our the agents have incomplete but important perceptions solutions guarantee optimality under appropriate con- about the environment. DRL can be viewed as an in- ditions and we provide solutions both when the data termediate approach between CRL and IDRL, in that is partitioned by time and by observation. the parties share only some information and accord- In principle, private distributed computations such as ingly reap only some gains in accuracy. these can be carried out using secure function evalu- The table also includes the direct use of general SFE ation (SFE) (Yao, 1986; Goldreich, 2004), which is a and our approach of PPRL. Both PPRL and SFE ob- general and well studied methodology for evaluating tain good privacy and good accuracy. Although our any function privately. However, although asymptot- solution incurs a significant cost (as compared to CRL, ically polynomially bounded, these computations can IDRL, and DRL) in computation and communication be too inefficient for practical use, particular when the to obtain this, it does so with significantly improved input size is large. For the reinforcement learning al- computational efficiency over SFE. We provide a more gorithms we address, we make use of existing SFE so- detailed comparison of the privacy, accuracy, and effi- lutions for small portions of our computation in order ciency of our approach and other possible approaches as part of a more efficient overall solution. along with our experimental results in Section 6. Our Contribution. We introduce the concepts of partitioning by time and partitioning by observation 2. Preliminaries in distributed reinforcement learning (Section 2). We 2.1. Reinforcement Learning and MDP show privacy-preserving solutions for SARSA learn- ing algorithms with random for both Let S be a finite state set and A be a finite action set. kinds of partitioning (Section 4). Additionally, these A policy π is a mapping from state/action pair (s, a) algorithms are expanded to Q-learning with greedy or to the π(s, a) with which action a is taken -greedy action selection (Section 5). We provide ex- at state s.Attimestept,wedenotebyst, at,andrt, perimental results in Section 6. the state, action, and reward at time t, respectively. π Table 1 provides a qualitative comparison of vari- A Q-function is the expected return Q (s, a)= E ∞ γkr | s s, a a γ ants of reinforcement learning in terms of efficiency, π k=0 t+k+1 t = t = ,where is a dis- learning accuracy, and privacy loss. We compare five count factor (0 ≤ γ<1). The goal is to learn the op- timal policy π maximizing the Q-function: Q∗(s, a)= Privacy-Preserving Reinforcement Learning

agent A agent B this global reword. The perception of the ith agent at A A A A B B B B i i i i i i (s1, a1 ,r1, s2 ) (s1, a1 ,r1, s2 ) (s1, a1 ,r1, s2 ) t h {s ,a , ,s ,a } time is denoted as t = t t t t+1 t+1 .The ...... agent A ...... i i private information of the ith agent is H = {ht}.

, , A A , A, A B B , B, B We note that partitioning by observation is more gen- (st , at rt st+1) (st , at rt st+1) (st , at rt st+1) agent B eral than partitioning by time, in that one can always represent a sequence that is partitioned by time by Partitioned-by-time Partitioned-by-observation one that is partitioned by observation. However, we Figure 1. Partitioning model in the two-agent case provide more efficient solutions in simpler case of par- titioning by time. Let πc be a policy learned by CRL. Then, informally, maxπ Q(s, a) for all (s, a). In SARSA learning, Q- the objective of PPRL is stated as follows: values are updated at each step as: Statement 1. The ith agent takes Hi as inputs. Af- ter the execution of PPRL, all agents learn a policy π ∆Q(st,at) ← α(rt + γQ(st ,at ) − Q(st,at)), c +1 +1 which is equivalent to π . Furthermore, no agent can Q(st,at) ← ∆Q(st,at)+Q(st,at), (1) learn anything that cannot be inferred from π and its own private input. where α is the . Q-learning is obtained by replacing the update of ∆Q by: This problem statement can be formalized as in SFE (Goldreich, 2004). This is a strong privacy re- ∆Q(st,at) ← α(rt + γ max Q(st ,a) − Q(st,at)). a +1 quirement which precludes consideration of solutions that reveal intermediate Q-values, actions taken, or Iterating these updates under appropriate conditions, states visited. We assume our agents behave semi- optimal convergence of Q-valuesisguaranteedwith honestly, a common assumption in SFE—this assumes probability 1 in discrete MDPs (Sutton & Barto, 1998; agents follows their specified protocol properly, but Watkins, 1989); the resulting optimal policy can be might also use their records of intermediate computa- readily obtained. tions in order to attempt to learn other parties’ private information. 2.2. Modeling Private Information in DRL 3. Cryptographic Building Blocks Let ht =(st,at,rt,st+1,at+1), let H = {ht}, and sup- m pose there are agents. We consider two kinds of Our solutions make use of several existing crypto- H partitioning of (see Fig. 1). graphic tools. Specifically, in our protocol, Q-values Partitioned-by-Time. This model assumes that are encrypted by an additive homomorphic cryptosys- only one agents interacts with the environment at tem, which allows the addition of encrypted values any time step t.LetT i be the set of time steps at without requiring their decryption, as described in Sec- which only ith agent has interactions with the envi- tion 3.1. Using the homomorphic properties, this al- ronment. Then T i ∩ T j = ∅, (i = j)andtheset lows encrypted Q-values are updated in the regular RL i i Q H = {ht | t ∈ T } is considered the private infor- manner, while unencrypted -values are not known to mation of the ith agent. agents. For computations which cannot be treated by the homomorphic property, we use SFE as a primitive, Partitioned-by-Observation. This model assumes as we describe in Section 3.2. that states and actions are represented as a collection of state and action variables. The state space and the i i 3.1. Homomorphic Public Key Cryptosystems action space are S = i S and A = i A where Si and Ai are the space of the ith agent’s state and In a public key cryptosystem, encryption uses a public action variables, respectively. Without loss of gener- key that can be known to everyone, while decryption ality (and for notational simplicity), we consider each requires knowledge of the corresponding private key. agent’s local state and action spaces to consist of a sin- Given a corresponding pair of (sk, pk)ofprivateand gle variable. If st ∈ S is the joint state of the agents public keys and a message m,thenc = epk(m; )de- i at time t,wedenotebyst the state that ith agent per- notes a (random) encryption of m,andm = dsk(c) i i ceives and by at the action of ith agent. Let rt be the denotes decryption. The encrypted value c uniformly local reward of ith agent obtained at time t. We define distributes over ZN if  is taken from ZN randomly. i the global reward (or reward for short) as rt = i rt in An additive homomorphic cryptosystem allows addi- this model. Our Q-functions are evaluated based on tion computations on encrypted values without knowl- Privacy-Preserving Reinforcement Learning

B edge of the secret key. Specifically, there is some op- Q of quotient Q ∈ ZN such that x =(QK + R) A B eration · (not requiring knowledge of sk) such that mod N,whereR ∈ ZN (0 ≤ R

• Public input; L, K,learningrateα,discountrateγ R(0 ≤ R

A  • L, K α γ step 4(a), A sends X and tables {cik}, {cik} with re- Public input; ,learningrate ,discountrate A A B B randomization such that • A’s input: (st ,at ), B’s input: (st ,at ) • A’s output: Encryption of updated Q-value c(st,at) A A B B cik = c(st ,i,at ,k) · e(0) (i ∈ S ,k ∈ A ), (6) • B’s output: Nothing c c sA ,i,aA ,k · e i ∈ SB,k ∈ AB , ik = ( t+1 t+1 ) (0) ( ) (7) 1. A: Initialize Q(s, a) arbitrarily and compute c(s, a)(= e(Q(s, a))) for all (s, a). to B. B determines c(st,at)=csB ,aB , c(st+1,at+1)= t t 2. Interaction with the environment:  A A A c B B and obtains e(K∆Q(st,at)) by eq. 5 (step • A a r ,s st+1,at+1 :Takeaction t and get t t+1.  B B B 4(b)). Then computes e(∆Q (st,at)) by private divi- • B:Takeactionat and get rt ,st+1. sion (step 4(c)). For all (hijk), B sets 3. Action selection: • A aA  :Choose t+1 randomly. e Q s ,a i sB,k aB • B aB c ← (∆ ( t t)) ( = t = t ) :Choose t+1 randomly. ∆ hijk 4. Update Q-value: e(0) (o.w.) (8) A (a) A:SendX , {cik}, {cik} to B by eq. 6, 7. B e K Q s ,a and sends {∆chijk} to A (step 4(d)). Finally, for all (b) : Compute ( ∆ ( t t)) by eq. 5 (ik), Q-values are updated as (c) B: Do private division of e(K∆Q(st,at)) with A,thenB learns e(∆Q (st,at)). A A A A c s ,i,a ,k ← c s ,i,a ,k · c A A . B { chijk} ( t t ) ( t t ) ∆ st iat k (9) (d) : Generate ∆ by eq. 8 and send it to A.  by A.Withthisupdate,e(∆Q (st,at)) is added (e) A:Updatec(s, a)with{∆chijk} by eq. 9. A B A B only when (h, i, j, k)=(st ,st ,at ,at ). Otherwise, e(0) is added. Note that A cannot tell which ele-  Figure 3. Private update of Q-values in partitioned-by- ment is e(∆Q (st,at)) in {∆chijk} because of the re- observation model (SARSA/random action selection) randomization. Thus, eq. 9 is the desired update. Lemma 2. If A and B behave semi-honestly, then after the private update of Q-values for SARSA and (i, j, k). For all (i, k), B generates and sends a table random action selection in partitioned-by-observation {cik} and {σik} whose values are set to model, A updates encrypted Q-values correctly but B c c sA,i,sB,π k · e −QB , learns nothing. learns nothing. ik = ( t t ( )) ( iπ(k)) (10) B B σ = d (cik), (11) By iterating private updates, encrypted Q-values ik trained by SARSA learning are obtained. where π : SB → SB is a random permutation B and Q ∈r ZN . At the third step, A recov- 5. Private Greedy Action Selection iπ(k) A A B B ers Qik(= Q(st ,i,st ,k) − Qik). With these ran- Private distributed algorithms for greedy action selec- dom shares of Q(sA,i,sB,π(k)), the values (i∗,k∗)= ∗ t t a a Q s, a A B tion to compute =argmax ( ) from encrypted arg max(i,k)(Qik +Qik)areobtainedbyA using private Q-values in both partitioning models are described. comparison. Finally, B learns aB∗ = π−1(k∗), where These are used for: (1) (-)greedy action selection, (2) π−1 is the inverse of π. max operation in updates of Q-learning, and (3) ex- A B tracting learned policies from final Q-values. In the Lemma 3. If and behaves semi-honestly, then, A partitioned-by-time model, this is readily solved by us- after the execution of private greedy action selection, aA∗ B aB∗ ing private comparison, so is omitted. learns and nothing else. learns and nothing else. 5.1. Private Greedy Action Selection in B∗ Partitioned-by-observation Model Note that a is not learned by A because index k is B A B obscured by the random permutation generated by . When A and B observe st and st , respec- tively, private greedy action selection requires that 5.2. Security of PPRL (1) A obtains aA∗ and nothing else, (2) B ob- Privacy-preserving SARSA learning is constructed by tains aB∗ and nothing else, where (aA∗,aB∗)= A A B B alternate iterations of private update and random ac- arg max aA,aB (Q(s ,a ,s ,a )). ( ) t t tion selection. The policy π can be extracted by The protocol is described in Fig. 4. Threshold de- computing arg maxa Q(s, a) for all (s, a) using private cryption is used here, too. First, A sends encrypted greedy action selection. The security follows from the A Q-values c(st ,i,j,k) with re-randomization for all earlier lemmas: Privacy-Preserving Reinforcement Learning

A B • A’s input: c(s, a) for all (s, a), st , B’s input: st the agent moves to sp+1.Whena2 is taken at sp(p = A∗ B∗ • A’s output: a , B’s output: a 1), the agent moves to sp−1, but the agent does not move when p = 1. A reward r = 1 is given only when A i ∈ SB ,j ∈ AA,k ∈ AB c sA,i,j,k 1. : For all ,send ( t ) a s r to B. the agent takes 1 at n−1;else, =0.Theepisodeis A B terminated at sn or after 1, 000 steps. 2. B: For all j ∈ A ,k ∈ A , compute cik(eq. 10), B σik(eq. 11) and send {cik}, {σik}to A. A learns 15, 000 steps and then B learns 15, 000 steps. A B B B 3. A: For all i ∈ A ,k ∈ A , compute σik = d (cik). CRL, IDRL, PPRL, and SFE were compared. SARSA A Then, compute Qik by applying the threshold de- learning with random or -greedy action selection was cryption recovery algorithm with public key pk and used for all settings. Table 2 shows the comparison re- σA ,σB shares ik ik. sults of computational cost, learning accuracy (number ∗ ∗ A 4. A and B: Compute (i ,k ) = arg max(i,k)(Qik + of steps to reach the goal state, averaged over 30 trials, B ∗ ∗ Qik) by private comparison. (A learns (i .k ).) and number of trials that successfully reach the goal 5. A:Sendk∗ to B. Then output aA∗ = i∗. state), and privacy preservation. B∗ −1 ∗ 6. B: Output a = π (k ). Learning accuracy of PPRL and SFE are the same as CRL because the policy learned by PPRL and SFE are Figure 4. Private greedy action selection in partitioned-by- guaranteed to be equal to the one learned by CRL. observation model In contrast, the optimal policy is not obtained suc- cessfully by IDRL because learning steps for IDRL agents correspond to the half of others. Because most Theorem 1. SARSA learning with private update of of the computation time is spent for private division Q-values and random action selection is secure in the and comparison, computation time with random se- sense of Statement 1. lection is much smaller than with -greedy selection. These experiments demonstrate that PPRL obtains Q Privacy-preserving SARSA learning and -learning good learning accuracy, while IDRL does not, though with ( -)greedy action selection can be constructed by computation time is larger than DRL and IDRL. combining private update and private greedy random action selection. However, these PPRLs do not follow 6.2. Load Balancing Task Statement 1 because it does not allow agents to know In these experiments, we consider a load balancing greedy actions obtained in the middle of the learning. problem (Cogill et al., 2006) in the partitioned-by- Therefore, the problem definition is relaxed as follows: observation model with two factories A and B.Each i Statement 2. The ith agent takes H as inputs. Af- factory can observe its own backlog sA,sB ∈{0, ..., 5}. ter the execution of PPRL, all agents learn a series At each time step, each factory decides whether or of greedy actions during learning steps and a policy c not to pass a job to other factories; the action vari- π which is equivalent to π . Furthermore, no agent able is aA,aB ∈{0, 1}. Jobs arrive and are pro- learns anything else. cessed independently at each time step with probabil- Theorem 2. SARSA and Q-learning with private up- ity 0.4and0.48, respectively. Agent A receives reward A A date of Q-values and private greedy/-greedy action se- r =50− (s )2.IfA passes the job to B,thenA’s lectionissecureinthesenseofStatement2. reward is reduced by 2 as a cost for redirection. If an overflow happens, the job is lost and rA =0is B 6. Experimental Results given. Similarly, r is computed as well. Perceptions (sA,aA,rA)and(sB,aB,rB ) are to be kept private. We performed experiments to examine the efficiency (In this task, actions cannot be kept private because of PPRL. Programs were written in Java 1.5.0. As the the parties learn them from whether the job was passed cryptosystem, (D˚amgard & Jurik, 2001) with 1024-bit or not.) keys was used. For SFE, Fairplay (Malkhi et al., 2004) Distributed reward DRL (RDRL) (Schneider et al., was used. Experiments were carried out under Linux 1999) is tested in addition to the four RLs tested with 1.2 GHz CPU and 2GB RAM. earlier. RDRL is a variant of DRL, which is the 6.1. Random Walk Task same with IDRL except that global rewards are shared among distributed agents (Schneider et al., 1999). This random walk task is partitioned by time. The SARSA/-greedy action selection was used in all set- S {s , ..., s } n state space is = 1 n ( = 40) and the action tings. Fig 5 shows the changes of sum of global rewards A {a ,a } space is = 1 2 . The initial and goal states are per episode. For avoiding overflows, cooperation be- s1 and sn, respectively. When a1 is taken at sp(p = n), Privacy-Preserving Reinforcement Learning References Table 2. Comparison of efficiency in random walk tasks comp. accuracy privacy Abe, N., Verma, N., Apte, C., & Schroko, R. (2004). Cross (sec) avg. #goal loss channel optimized marketing by reinforcement learning. CRL/rnd. 0.901 40.0 30/30 disclosed all IDRL/rnd. 0.457 247 8/30 Stmt. 1 ACM SIGKDD Int’l Conf. on Knowledge Discovery and PPRL/rnd. 4.71 × 103 40.0 30/30 Stmt. 1 Data Mining (pp. 767–772). SFE/rnd. > 7.0 × 106 40.0 30/30 Stmt. 1 CRL/-grd. 0.946 40.0 30/30 disclosed all Cogill, R., Rotkowitz, M., Van Roy, B., & Lall, S. (2006). IDRL/-grd. 0.481 — 0/30 Stmt. 2 An Approximate Approach to PPRL/-grd. 3.36 × 104 40.0 30/30 Stmt. 2 6 Decentralized Control of Stochastic Systems. LNCIS, SFE/-grd. > 7.0 × 10 40.0 30/30 Stmt. 2 329, 243–256.

Table 3. Comparison of efficiency in load balancing tasks. D˚amgard, I., & Jurik, M. (2001). A Generalisation, a comp. (sec) accuracy privacy loss Simplification and Some Applications of Paillier’s Prob- CRL 5.11 90.0 disclosed all RDRL 5.24 87.4 partially disclosed abilistic Public-Key System. Public Key IDRL 5.81 84.2 Stmt. 1 2001. Springer. PPRL 8.85 ×105 90.0 Stmt. 2 7 SFE > 2.0 × 10 90.0 Stmt. 2 Gambs, S., K´egl, B., & A¨ımeur, E. (2007). Privacy- preserving boosting. Data Mining and Knowledge Dis- SARSA/epsilon-greedy, load balancing covery, 14, 131–170. CRL/PPRL/SFE Goldreich, O. (2004). Foundations of Cryptography: Vol- 92 RDRL IDRL ume 2, Basic Applications. Cambridge University Press. 90 Jagannathan, G., & Wright, R. N. (2005). Privacy- 88 preserving distributed k-means clustering over arbitrar- ily partitioned data. ACM SIGKDD Int’l Conf. on 86

rewards Knowledge Discovery and Data Mining (pp. 593–599).

84 Kearns, M., Tan, J., & Wortman, J. (2007). Privacy- 82 Preserving Belief Propagation and Sampling. NIPS 20.

80 Lindell, Y., & Pinkas, B. (2002). Privacy Preserving Data 0 10 20 30 40 50 60 70 80 90 100 Mining. Journal of Cryptology, 15, 177–206. episodes Malkhi, D., Nisan, N., Pinkas, B., & Sella, Y. (2004). Fair- Figure 5. Performance evaluation (sum of global rewards play: a secure two-party computation system. USENIX Security Symposium (pp. 287–302). in an episode, normalized by the number of steps in an episode) in load balancing tasks (average of 100 trials). Moallemi, C. C., & Roy, B. V. (2004). Distributed opti- mization in adaptive networks. NIPS 16.

Sakuma, J., & Kobayashi, S. (2008). Large-scale k- tween agents is essential in this task. The performance means Clustering with User-Centric Privacy Preserva- of IDRL agents is inferior to others because selfish be- tion. Pacific-Asia Conf. on Knowledge Discovery and havior is learned. In contrast, CRL, PPRL and SFE Data Mining (PAKDD) 2008, to appear. agents successfully obtain cooperative behavior. The Schneider, J., Wong, W., Moore, A., & Riedmiller, M. performance of RDRL is intermediate because percep- (1999). Distributed value functions. Int’l Conf. on Ma- tions of RDRL agents are limited. Efficiency is shown chine Learning (pp. 371–378). in Table 3. Since -greedy action selection was used, Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learn- the privacy of IDRL, PPRL and SFE follow Statement ing: An Introduction. MIT Press. 2. The privacy preservation of RDRL is between CRL and PPRL. As discussed in Section 1, PPRL achieves Watkins, C. (1989). Learning from Delayed Rewards.Cam- bridge University. both the guarantee of privacy preservation and the op- timality which is equivalent to that of CRL; SFE does Yao, A. C.-C. (1986). How to generate and exchange se- the same, but at a much higher computational time. crets. IEEE Symposium on Foundations of (pp. 162–167).

Acknowledgments Yu, H., Jiang, X., & Vaidya, J. (2006). Privacy-preserving SVM using nonlinear kernels on horizontally partitioned This work was started at Tokyo Institute of Technol- data. ACM SAC (pp. 603–610). ogy and carried out partly while the first author was a visitor of the DIMACS Center. The third author is Zhang, S., & Makedon, F. (2005). Privacy preserving learn- partially supported by the National Science Founda- ing in negotiation. ACM SAC (pp. 821–825). tion under grant number CNS-0822269.