Deep Reinforcement Learning for Cyber Security Thanh Thi Nguyen and Vijay Janapa Reddi

1 Deep Reinforcement Learning for Cyber Security Thanh Thi Nguyen and Vijay Janapa Reddi Abstract—The scale of Internet-connected systems has in- Artificial intelligence (AI), especially machine learning creased considerably, and these systems are being exposed to (ML), has been applied to both attacking and defending in the cyber attacks more than ever. The complexity and dynamics of cyberspace. On the attacker side, ML is utilized to compromise cyber attacks require protecting mechanisms to be responsive, adaptive, and scalable. Machine learning, or more specifically defense strategies. On the cyber security side, ML is employed deep reinforcement learning (DRL), methods have been pro- to put up robust resistance against security threats in order posed widely to address these issues. By incorporating deep to adaptively prevent and minimise the impacts or damages learning into traditional RL, DRL is highly capable of solving occurred. Among these ML applications, unsupervised and complex, dynamic, and especially high-dimensional cyber defense supervised learning methods have been used widely for intru- problems. This paper presents a survey of DRL approaches developed for cyber security. We touch on different vital as- sion detection [9]–[11], malware detection [12]–[14], cyber- pects, including DRL-based security methods for cyber-physical physical attacks [15]–[17], and data privacy protection [18]. systems, autonomous intrusion detection techniques, and multi- In principle, unsupervised methods explore the structure and agent DRL-based game theory simulations for defense strategies patterns of data without using their labels while supervised against cyber attacks. Extensive discussions and future research methods learn by examples based on data’s labels. These directions on DRL-based cyber security are also given. We expect that this comprehensive review provides the foundations for and methods, however, cannot provide dynamic and sequential facilitates future studies on exploring the potential of emerging responses against cyber attacks, especially new or constantly DRL to cope with increasingly complex cyber security problems. evolving threats. Also, the detection and defending responses often take place after the attacks when traces of attacks become Index Terms—survey, review, deep reinforcement learning, available for collecting and analyzing, and thus proactive deep learning, cyber security, cyber defense, cyber attacks, defense solutions are hindered. A statistical study shows that Internet of Things, IoT. 62% of the attacks were recognized after they have caused significant damages to the cyber systems [19]. Reinforcement learning (RL), a branch of ML, is the closest I. INTRODUCTION form of human learning because it can learn by its own NTERNET of Things (IoT) technologies have been em- experience through exploring and exploiting the unknown I ployed broadly in many sectors such as telecommunica- environment. RL can model an autonomous agent to take tions, transportation, manufacturing, water and power man- sequential actions optimally without or with limited prior agement, healthcare, education, finance, government, and even knowledge of the environment, and thus, it is particularly entertainment. The convergence of various information and adaptable and useful in real time and adversarial environments. communication technology (ICT) tools in the IoT has boosted RL, therefore, demonstrates excellent suitability for cyber its functionalities and services to users to new levels. ICT security applications where cyber attacks are increasingly has witnessed a remarkable development in terms of system sophisticated, rapid, and ubiquitous [20]–[23]. design, network architecture, and intelligent devices in the last The recent development of deep learning has been incor- decade. For example, ICT has been advanced with the inno- porated into RL methods and enabled them to solve many vations of cognitive radio network and 5G cellular network complex problems [24]–[28]. The emergence of DRL has witnessed great success in different fields, from video game arXiv:1906.05799v3 [cs.CR] 21 Jul 2020 [1], [2], software-defined network (SDN) [3], cloud computing [4], (mobile) edge caching [5], [6], and fog computing [7]. domain, e.g. Atari [29], [30], game of Go [31], [32], real-time Accompanying these developments is the increasing vulner- strategy game StarCraft II [33]–[36], 3D multi-player game ability to cyber attacks, which are defined as any type of Quake III Arena Capture the Flag [37], and teamwork game offensive maneuver exercised by one or multiple computers to Dota 2 [38] to real-world applications such as robotics [39], target computer information systems, network infrastructures, autonomous vehicles [40], autonomous surgery [41], [42], nat- or personal computer devices. Cyber attacks may be instigated ural language processing [43], biological data mining [44], and by economic competitors or state-sponsored attackers. There drug design [45]. An area that recently attracts great attention has been thus a critical need of the development of cyber of the DRL research community is the IoT and cyber security. security technologies to mitigate and eliminate impacts of For example, a DRL-based resource allocation framework that these attacks [8]. integrates networking, caching, and computing capabilities for smart city applications is proposed in [46]. DRL algorithm, T. T. Nguyen is with the School of Information Technology, Deakin i.e., double dueling deep Q-network [47], [48], is used to solve University, Melbourne Burwood Campus, Burwood, VIC 3125, Australia, e- this problem because it involves a large state space, which mail: [email protected]. consists of the dynamic changing status of base stations, mo- V. J. Reddi is with the John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA, e-mail: bile edge caching (MEC) servers and caches. The framework [email protected]. is developed based on the programmable control principle of 2 SDN and the caching capability of information-centric net- Bellman equation [60]: working. Alternatively, Zhu et al. [49] explored MEC policies Q(s ; a ) = E[r + γr + γ2r + :::js ; a ] (1) by using the context awareness concept that represents the t t t+1 t+2 t+3 t t user’s context information and traffic pattern statistics. The use The discount factor γ 2 [0; 1] manages the importance levels of AI technologies at the mobile network edges is advocated of future rewards. It is applied as a mathematical trick to to intelligently exploit operating environment and make the analyze the learning convergence. In practice, discount is right decisions regarding what, where, and how to cache necessary because of partial observability or uncertainty of appropriate contents. To increase the caching performance, a the stochastic environment. DRL approach, i.e., the asynchronous advantage actor-critic Q-learning needs to use a lookup table or Q-table to store algorithm [50], is used to find an optimal policy aiming to expected rewards (Q-values) of actions given a set of states. maximize the offloading traffic. This requires a large memory when the state and action spaces Findings from our current survey show that applications of increase. Real-world problems often involve continuous state DRL in cyber environments are generally categorized under or action space, and therefore, Q-learning is inefficient to two perspectives: optimizing and enhancing the communica- solve these problems. Fortunately, deep learning has emerged tions and networking capabilities of the IoT applications, e.g. as a powerful tool that is a great complement to traditional [51]–[59], and defending against cyber attacks. This paper fo- RL methods. With the power of function approximation and cuses on the later where DRL methods are used to solve cyber representation learning, deep learning can learn a compact security problems with the presence of cyber attacks or threats. low-dimensional representation of raw high-dimensional data Next section provides a background of DRL methods, followed [61]. The combination of deep learning and RL was the by a detailed survey of DRL applications in cyber security research direction that Google DeepMind has initiated and in Section 3. We group these applications into three major pioneered. They proposed deep Q-network (DQN) with the categories, including DRL-based security solutions for cyber- use of a deep neural network (DNN) to enable Q-learning to physical systems, autonomous intrusion detection techniques, deal with high-dimensional sensory inputs [29], [62]. and multi-agent DRL-based game theory for cyber security. Section 4 concludes the paper with extensive discussions and future research directions on DRL for cyber security. II. DEEP REINFORCEMENT LEARNING PRELIMINARY Different from the other popular branch of ML, i.e., supervised methods learning by examples, RL characterizes an agent by creating its own learning experiences through interacting directly with the environment. RL is described by concepts of state, action, and reward (Fig. 1). It is a trial and error approach in which the agent takes action at each time step that causes two changes: current state of the environment is changed to a new state, and the agent receives a reward or penalty from the environment. Given a state, the reward is a Fig. 2. DQN architecture with the loss function described by L(β) = E[(r+ 0 0 0 2 0 function that can tell the agent how good or bad an action

Load more