Privacy-Preserving Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Privacy-Preserving Reinforcement Learning Jun Sakuma [email protected] Shigenobu Kobayashi [email protected] Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midoriku, Yokohama, 226-8502, Japan Rebecca N. Wright [email protected] Rutgers University, 96 Frelinghuysen Road, Piscataway, NJ 08854, USA Abstract In this paper, we consider the privacy of agents’ per- ceptions in DRL. Specifically, we provide solutions for We consider the problem of distributed rein- privacy-preserving reinforcement learning (PPRL), in forcement learning (DRL) from private per- which agents’ perceptions, such as states, rewards, and ceptions. In our setting, agents’ perceptions, actions, are not only distributed but are desired to be such as states, rewards, and actions, are not kept private. Consider two example scenarios: only distributed but also should be kept pri- vate. Conventional DRL algorithms can han- Optimized Marketing (Abe et al., 2004): Consider dle multiple agents, but do not necessarily the modeling of the customer’s purchase behavior as guarantee privacy preservation and may not a Markov Decision Process (MDP). The goal is to ob- guarantee optimality. In this work, we design tain the optimal catalog mailing strategy which max- cryptographic solutions that achieve optimal imizes the long-term profit. Timestamped histories of policies without requiring the agents to share customer status and mailing records are used as state their private information. variables. Their purchase patterns are used as actions. Value functions are learned from these records to learn the optimal policy. If these histories are managed sep- 1. Introduction arately by two or more enterprises, they may not want to share their histories for privacy reasons (for exam- With the rapid growth of computer networks and net- ple, in keeping with privacy promises made to their worked computing, a large amount of information is customers), but might still like to learn a value func- being sensed and gathered by distributed agents phys- tion from their joint data in order that they can all ically or virtually. Distributed reinforcement learning maximize their profits. (DRL) has been studied as an approach to learn a con- Load Balancing (Cogill et al., 2006): Consider a load trol policy thorough interactions between distributed balancing among competing factories. Each factory agents and environments—for example, sensor net- wants to accept customer jobs, but in order to max- works and mobile robots. DRL algorithms, such as the imize its own profit, may need to redirect jobs when distributed value function approach (Schneider et al., heavily loaded. Each factory can observe its own back- 1999) and the policy gradient approach (Moallemi & log, but factories do not want to share their backlog Roy, 2004), typically seek to satisfy two types of physi- information with each other for business reasons, but cal constraints. One is constraints on communication, they would still like to make optimal decisions. such as an unstable network environment or limited communication channels. The other is memory con- Privacy constraints prevent the data from being com- straints to manage the huge state/action space. There- bined in a single location where centralized reinforce- fore, the main emphasis of DRL has been to learn ment algorithms (CRL) could be applied. Although good, but sub-optimal, policies with minimal or lim- DRL algorithms work in a distributed setting, they ited sharing of agents’ perceptions. are designed to limit the total amount of data sent be- tween agents, but do not necessarily do so in a way Appearing in Proceedings of the 25 th International Confer- that guarantees privacy preservation. Additionally, ence on Machine Learning, Helsinki, Finland, 2008. Copy- DRL often sacrifices optimality in order to learn with right 2008 by the author(s)/owner(s). low communication. In contrast, we propose solutions Privacy-Preserving Reinforcement Learning that employ cryptographic techniques to achieve op- comp. comm. accuracy privacy timal policies (as would be learned if all the informa- CRL good good good none DRL good good medium imperfect tion were combined into a centralized reinforcement IDRL good good bad perfect learning (CRL) problem) while also explicitly protect- PPRL medium medium good perfect ing the agents’ private information. We describe solu- SFE bad bad good perfect tions both for data that is “partitioned-by-time” (as in the optimized marketing example) and “partitioned- Table 1. Comparison of different approaches by-observation” (as in the load balancing example). Related Work. Private distributed protocols have approaches: CRL, DRL, independent distributed re- been considered extensively for data mining, pioneered inforcement learning (IDRL, explained below), SFE, by Lindell and Pinkas (Lindell & Pinkas, 2002), who and our privacy-preserving reinforcement learning so- presented a privacy-preserving data-mining algorithm lutions (PPRL). In CRL, all the agents send their per- for ID3 decision-tree learning. Private distributed pro- ceptions to a designated agent, and then a centralized tocols have also been proposed for other data min- reinforcement is applied. In this case, the optimal con- k ing and machine learning problems, including -means vergence of value functions is theoretically guaranteed clustering (Jagannathan & Wright, 2005; Sakuma & when the dynamics of environments follow a discrete Kobayashi, 2008), support vector machines (Yu et al., MDP; however, privacy is not provided, as all the data 2006), boosting (Gambs et al., 2007), and belief prop- must be shared. agation (Kearns et al., 2007). On the opposite end of the spectrum, in IDRL (inde- Agent privacy in reinforcement learning has been pre- pendent DRL), each agent independently applies CRL viously considered by Zhang and Makedon (Zhang & only using its own local information; no information is Makedon, 2005). Their solution uses a form of average shared. In this case, privacy is completely preserved, reward reinforcement learning that does not necessar- but the learning results will be different and indepen- ily guarantee an optimal solution; further, their solu- dent. In particular, accuracy will be unacceptable if tion only applies partitioning by time. In contrast, our the agents have incomplete but important perceptions solutions guarantee optimality under appropriate con- about the environment. DRL can be viewed as an in- ditions and we provide solutions both when the data termediate approach between CRL and IDRL, in that is partitioned by time and by observation. the parties share only some information and accord- In principle, private distributed computations such as ingly reap only some gains in accuracy. these can be carried out using secure function evalu- The table also includes the direct use of general SFE ation (SFE) (Yao, 1986; Goldreich, 2004), which is a and our approach of PPRL. Both PPRL and SFE ob- general and well studied methodology for evaluating tain good privacy and good accuracy. Although our any function privately. However, although asymptot- solution incurs a significant cost (as compared to CRL, ically polynomially bounded, these computations can IDRL, and DRL) in computation and communication be too inefficient for practical use, particular when the to obtain this, it does so with significantly improved input size is large. For the reinforcement learning al- computational efficiency over SFE. We provide a more gorithms we address, we make use of existing SFE so- detailed comparison of the privacy, accuracy, and effi- lutions for small portions of our computation in order ciency of our approach and other possible approaches as part of a more efficient overall solution. along with our experimental results in Section 6. Our Contribution. We introduce the concepts of partitioning by time and partitioning by observation 2. Preliminaries in distributed reinforcement learning (Section 2). We 2.1. Reinforcement Learning and MDP show privacy-preserving solutions for SARSA learn- ing algorithms with random action selection for both Let S be a finite state set and A be a finite action set. kinds of partitioning (Section 4). Additionally, these A policy π is a mapping from state/action pair (s, a) algorithms are expanded to Q-learning with greedy or to the probability π(s, a) with which action a is taken -greedy action selection (Section 5). We provide ex- at state s.Attimestept,wedenotebyst, at,andrt, perimental results in Section 6. the state, action, and reward at time t, respectively. π Table 1 provides a qualitative comparison of vari- A Q-function is the expected return Q (s, a)= E ∞ γkr | s s, a a γ ants of reinforcement learning in terms of efficiency, π k=0 t+k+1 t = t = ,where is a dis- learning accuracy, and privacy loss. We compare five count factor (0 ≤ γ<1). The goal is to learn the op- timal policy π maximizing the Q-function: Q∗(s, a)= Privacy-Preserving Reinforcement Learning agent A agent B this global reword. The perception of the ith agent at A A A A B B B B i i i i i i (s1, a1 ,r1, s2 ) (s1, a1 ,r1, s2 ) (s1, a1 ,r1, s2 ) t h {s ,a ,r ,s ,a } time is denoted as t = t t t t+1 t+1 .The .... .... agent A .... .... .... .... i i private information of the ith agent is H = {ht}. , , A A , A, A B B , B, B We note that partitioning by observation is more gen- (st , at rt st+1) (st , at rt st+1) (st , at rt st+1) agent B eral than partitioning by time, in that one can always represent a sequence that is partitioned by time by Partitioned-by-time Partitioned-by-observation one that is partitioned by observation. However, we Figure 1. Partitioning model in the two-agent case provide more efficient solutions in simpler case of par- titioning by time. Let πc be a policy learned by CRL. Then, informally, maxπ Q(s, a) for all (s, a). In SARSA learning, Q- the objective of PPRL is stated as follows: values are updated at each step as: Statement 1.