Cg2020 MD BILLAL HOSSAIN ALL RIGHTS RESERVED

c 2020

MD BILLAL HOSSAIN

NETWORKING

A Thesis

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulﬁllment

of the Requirements for the Degree

Master of Science

Md Billal Hossain

August, 2020 QoS-AWARE INTELLIGENT ROUTING FOR SOFTWARE DEFINED

NETWORKING

Md Billal Hossain

Thesis

Approved: Accepted:

Advisor Dean of the College Dr. Jin Wei-Kocsis Dr. Craig Menzemer

Co-Advisor Dean of the Graduate School Dr. Kye-Shin Lee Dr. Marnie Saunders

Faculty Reader Date Dr. Hamid Bahrami

Department Chair Dr. Robert Veillette

ii ABSTRACT

This thesis proposes a reinforcement learning (RL) driven software-deﬁned networking

(SDN) routing scheme for the situation-awareness and intelligent networking management. Firstly, the existing SDN network monitoring technique is applied to track the quality of service (QoS) parameters (link delay and packet loss). Afterward, the

QoS data are fed to the RL algorithm in order to achieve situation awareness in SDN routing. The performance of the proposed RL-enabled routing scheme is evaluated in the simulation section by considering various network scenarios, including network congestion. Finally, the end-to-end delay, the episode reward, and the probability of path selection are recorded for each case. According to the outcomes, the proposed scheme intelligently select the eﬃcient data path according to the current state of the network. Moreover, the end-to-end delay is compared with the Dijkstra algorithm, demonstrating the superiority of RL-enabled dynamic routing strategy over static.

Additionally, the scalability of the algorithm is tested with multiple controller SDN.

iii ACKNOWLEDGMENTS

Firstly, I would like to extend my heartfelt gratitude to my adviser, Dr. Jin (Wei)

Kocsis, and co-adviser, Dr. Kye-Shin Lee, for their continuous guidance and advice throughout my master’s program. Their support and guidance empowered me to face challenges throughout this journey. Also, I would like to thank my thesis committee member, Dr. Hamid Bahrami, and Department chair, Dr. Robert Veillette, for their cooperation throughout the journey.

I would like to thank my colleagues at the University of Akron including members of the CPSSD group, especially Yifu, Gihan, Praveen, Moein for all their support. It was my pleasure working with you in a cooperative environment. My appreciation goes out to all the faculty members and staﬀ at the University of Akron for heightening my knowledge and skills via graduate courses. I am forever grateful to my parents, Md Haﬁz Ahmed and Aklima Akter.

I dedicate this thesis to those who work for a better world.

Thank you all for your support.

iv TABLE OF CONTENTS

Page

LISTOFTABLES...... vii

LISTOFFIGURES ...... viii

CHAPTER

I. INTRODUCTION...... 1

1.1 Contributions in this Thesis ...... 5

1.2 Publications ...... 5

1.3 ThesisOutline ...... 5

II. OVERVIEW OF SOFTWARE-DEFINED NETWORK TECHNOLOGY 8

2.1 MotivationsofSDN ...... 8

2.2 HistoryofSDNDevelopment ...... 10

2.3 SDNStructure ...... 11

2.4 DistributedSDNarchitecture ...... 17

2.5 Recent Development and Application of SDN ...... 19

III. OVERVIEW OF REINFORCEMENT LEARNING ...... 20

3.1 Artiﬁcial Intelligence, Artiﬁcial Neural Networks and Machine Learning 20

3.2 ReinforcementLearning ...... 23

3.3 Reinforcement Learning: Applications ...... 35

v IV. QoS-AWARE INTELLIGENT ROUTING FOR SOFTWARE DE- FINEDNETWORK ...... 36 4.1 Introduction ...... 36

4.2 Design of QoS-Aware Intelligent Routing ...... 38

V. SIMULATION RESULTS: QoS-AWARE INTELLIGENT ROUTING FORSOFTWAREDEFINEDNETWORK...... 45 5.1 Test-bedSetup: SingleController...... 45

5.2 Test-bed Setup: Multiple Controller ...... 60

VI. CONCLUSIONANDFUTUREWORK ...... 65

6.1 Conclusion ...... 65

6.2 FutureWork ...... 65

BIBLIOGRAPHY ...... 67

vi LIST OF TABLES

Table Page

2.1 PopularSDNNOS ...... 15

5.1 Linkparametersfordiﬀerentcases...... 46

5.2 Performance comparison between proposed method and Dijkstra algorithm-basedmethod ...... 48

vii LIST OF FIGURES

Figure Page

1.1 Overviewofthethesis...... 6

2.1 Systemarchitecture: TraditionalvsSDN ...... 9

2.2 System architecture of SDN: (a) Physical Structure; (b) Logical Structure 14

2.3 AnatomyofSDNHardwareComponents ...... 17

2.4 FlatArchitectureofSDN...... 18

2.5 HierarchicalSDNArchitecture ...... 19

3.1 AI and ML: (a) Hierarchical Structure of AI; (b) Classiﬁcation of MLalgorithms ...... 21 3.2 ExampleofRLscenario...... 24

3.3 ClassiﬁcationofRL...... 24

3.4 Actor-Critic: (a) Concept of AC; (b) Structure of AC ...... 33

4.1 BlockRepresentationofRIRD ...... 36

4.2 Flow Chart of QoS-aware Intelligent Routing ...... 37

4.3 Active Monitoring QoS in SDN: (a) Packet Loss; (b) Link Delay . . . . 39

5.1 Single-controller SDN topology...... 46

5.2 Probability of path selection for Case-I: From Host-1 to Host-2 ..... 50

5.3 Probability of path selection for Case-I: From Host-2 to Host-1 ..... 50

viii 5.4 Performance of RL agent for Case-I: (a) Episode reward value, Host- 1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to- End delay comparison, ICMP message (Ping) from Host-1 to Host-2 .. 51 5.5 Probability of path selection for Case-II: From Host-1 to Host-2 .... 53

5.6 Probability of path selection for Case-II: From Host-2 to Host-1 .... 53

5.7 Performance of RL agent for Case-II: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 toHost-2...... 54 5.8 Probability of path selection for Case-III: From Host-1 to Host-2.... 56

5.9 Probability of path selection for Case-III: From Host-2 to Host-1.... 56

5.10 Performance of RL agent for Case-III: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 toHost-2...... 57 5.11 Probability of path selection for Case-IV: From Host-1 to Host-2.... 58

5.12 Probability of path selection for Case-IV: From Host-2 to Host-1.... 58

5.13 Performance of RL agent for Case-IV: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 toHost-2...... 59 5.14 The overview of multiple-controller SDN test-system topology . .... 60

5.15 Probability of path selection for controller-1: From Host-1 to Gateway-1...... 61 5.16 Probability of path selection for controller-1: From Gateway-1 to Host-1 ...... 61 5.17 Probability of path selection for controller-2: From Gateway-2 to Host-2 ...... 62 5.18 Probability of path selection for controller-2: From Host-2 to Gateway-2...... 62

ix 5.19 Performance of RL agent for controllers: (a) Episode reward value, Host-1 to Gateway-1; (b) Episode reward value, Gateway-1 to Host- 1; (c) Episode reward value, Gateway-2 to Host-2; (d) Episode reward value, Host-2 to Gateway-2; (e) End-to-End delay comparison, ICMPmessage(Ping)fromHost-1toHost-2 ...... 63

x CHAPTER I

INTRODUCTION

Internet has inaugurated a new social era, where all digital resources have the potential to be interconnected and accessible everywhere. Traditional Internet protocol

(IP) networking technology is complex. It is difficult to manage [1], configure, and re- configure in case of a fault. Moreover, a traditional IP network is vertically integrated as the control and data planes are bundled together. Software Defined Networking

(SDN) is an emerging networking technology that breaks the vertical integration by separating the network’s control logic from the underlying devices (e.g., routers and switches) [2]. The logically centralized network control eases the management, reduces operating costs, promotes evolution and innovation [3], allows vendor-independent implementation [4], improves the network’s resource utilization [5], and facilitates traﬃc monitoring [6].

Efficient data routing in a large, complex and dynamic network is a challenging problem. Inefficient routing can lead to the network links overloading and increase the end-to-end transmission delay, which affects the overall performance of a communication network. In addition, a communication network is a dynamic environment where the performance depends on the current behavior of the nodes. There are multiple popular traditional routing algorithms (e.g., OSPF, IGRP, EIGRP, RIP,

1 BGP) established in traditional networking structures. A large volume of nodes with copious network parameters makes it more challenging for conventional routing to en- sure the expected QoS. Moreover, these algorithms were developed for decentralized network processing units. Therefore, they are not suitable for the SDN, whose fun- damental concept is to develop a centralized control [7]. SDN routing also demands the integration of Quality of Service (QoS) and situation awareness.

Machine Learning (ML) is an exceptional tool that enables automation in dynamic domains. ML-based dynamic routing [8] could lead to an efficient, near-optimal routing solution [9]. ML algorithms are primarily classified into supervised, unsupervised, semi-supervised, and reinforcement learning [10] [11]. The network researchers are trying to inject different classes of ML in a distinct approach. Among the ML classes, reinforcement learning (RL) is a framework by which a system can learn from its previous interactions with its environment to efficiently select its actions in the future. This makes it suitable for the SDN routing problem. Additionally, powerful network status monitoring tools facilitate the opportunity for situation awareness in SDN routing [12], [2], [5], [13], [14]. The combination of network monitoring, distributed network analytics and ML is known as a knowledge defined network (KDN) [15]. The following paragraph will introduce some accomplished research work related to the application of ML in SDN routing.

There is existing research that applies various machine learning algorithms for QoS-aware dynamic SDN routing. In [12], the author proposed a supervised

(LSTM-RNN) learning-based dynamic routing called NeuRoute. NeuRoute predicts

2 the traﬃc matrix based on previously recorded data to generate forwarding rules.

However, the effectiveness of the algorithm depends on the volume of annotated data for training the neural networks. Similarly, a deep-RL approach for SDN routing was proposed by Stampa et al. [15], where routing was optimized based on the bandwidth request between a source-destination pair. The RL agent adapted dynamically to current traffic status and customized the route configurations in an attempt to minimize the network delay. The algorithm was tested for 14 nodes and 21 full-duplex links with an average node degree of three, with 10 traffic intensity levels. This method did not consider network internal parameters, such as packet loss in a link, that are critical for QoS. Additionally, the researchers considered 1,000 distinct traffic configurations, so the computational complexity will increase with the number of network nodes. A similar approach called DDPG Routing Optimization Mechanism-DROM was tested with a deep deterministic policy gradient by Yu [16]. Furthermore, Xu et al. [5] proposed a deep-Q and traffic engineering based routing, Deep Reinforcement Learning

(DRL) with Traﬃc Engineering-(DRL-TE), to maximize the utilization of an SDN.

In DRL-TE, the RL agent was designed to eﬀectively utilize the networking resource among the three shortest paths according to hop-count. The method did not consider the network’s internal factors (e.g., link delay and packet loss). Kim et al. [17] proposed a simple demonstration of Q-learning’s application for congestion preventive routing in SDN. They considered a simple topology having ﬁve SDN switches and also didn’t consider packet loss. Sendra et al. [2] proposed a reinforcement learning-based

SDN routing. The simulation topology of [2] consisted of a few nodes with an average

3 degree of two. Yanju et al. [7] proposed a heuristic algorithm and supervised learning algorithm-based SDN routing. The initial step of that method [7] was to collect abundant training data and build a path database for each source-destination pair.

However, it was an impractical solution for a large-scale network. Francois et al. [18] proposed a secured Cognitive Routing Engine-CRE. CRE algorithm reduced the end to end delay, and encrypted the data with the Random Neural Network (RNN) and

RL. However, the method focused on the end to end delay at the expense of packet loss. Wang et al. [19] proposed a deep RL based resource allocation for mobile edge computing. The objective of the research was to minimize service time and balance resource allocation by RL. Sun et al. [20] proposed deep deterministic policy gradient

(DDPG) and a recurrent neural network (RNN) to generate routing policy in the

SDN platform. The method considered the average transmission delay to calculate episode reward. QoS-aware adaptive routing, called QAR, was proposed in [3] for multi-layer hierarchical SDN. The reward was a complex mathematical function and there was no simulation for the link congestion. There are also some other machine learning-based routing approaches proposed in [13, 21–28].

This thesis demonstrates a situation-aware routing scheme for the SDN by using an advantage actor critic (A2C [29]). The QoS-aware intelligent routing method continuously monitors the delay, and packet loss between SDN switches for situation awareness, then A2C makes an intelligent decision on that basis. Afterwards, the simulation results are analyzed for diﬀerent cases and compared with static routing strategy.

4 1.1 Contributions in this Thesis

1. An A2C algorithm that interacted in real-time with SDN for best QoS.

2. A deep A2C-based routing where the centralized SDN controller tailored the

routing scheme according to the delay and packet loss of the path was proposed.

Several scenarios were tested along with the network congestion.

3. The scalability of the proposed algorithm was tested for an environment with

multiple SDN controllers.

1.2 Publications

Research achievements presented in this thesis have been included in the following research paper:

1. Hossain, M.B. and Jin Wei , “Reinforcement Learning-Driven QoS-Aware In-

telligent Routing for Software-Deﬁned Networks,” IEEE Global Conference on

Signal and Information Processing (GlobalSIP), Nov 2019 [30].

1.3 Thesis Outline

The following chapters in this thesis unfold as shown in Fig. 1.1. Chapter 2 provides an overview about SDN. It discusses structural and historical information, such as the deﬁnition of SDN, the key component of SDN, the most signiﬁcant milestones, and the current development of SDN. Chapter 3 reviews general ideas about machine

5 Chapter-1: Introduction Case-I test: One best path case (all the paths having different bandwidth, delay and packet loss) Chapter-2: Overview of Software Defined Network Technology Case-II test: One best path case (multiple path have same bandwidth, delay and different packet loss) Chapter-3: Overview of Reinforcement Learning Case-III test: Congestion (all the paths having different bandwidth, delay and packet loss) Chapter-4: QoS-aware Intelligent Routing for SDN Case-IV test: One best path case (multiple path have same packet loss, bandwidth and different delay) Chapter-5: Simulation Results Case-V test: Multiple SDN case (Scalability) Chapter-6: Conclusions and Future Work

Figure 1.1: Overview of the thesis.

learning artiﬁcial intelligence and RL. The chapter highlights the history, development, and applications of RL algorithms. The chapter also gives and detailed ideas about the structure of A2C. Chapter 4 has a detailed discussion about the proposed

RL based routing algorithm. The chapter also describes some QoS parameters that are considered.

Chapter 5 shows the test-bed setup and simulation results. The simulation considered ﬁve test scenario cases and described each case. Sections 5.1 and 5.1.1 show test-bed setup and simulation results for one best path case. Section 5.1.2 shows

6 simulation results for one best path case where multiple similar paths exist with the same bandwidth, and delay, and diﬀerent packet loss. Section 5.1.3 describes the behavior of the proposed method in case of network congestion. Section 5.1.4 shows the test-bed setup and simulation results for one best path case where multiple paths had the same packet loss, bandwidth, and diﬀerent delay. Section 5.2 shows the test-bed setup and simulation results for multiple SDN controllers.

Finally, Chapter 6 provides a summary of the work carried out and potential future work.

7 CHAPTER II

OVERVIEW OF SOFTWARE-DEFINED NETWORK TECHNOLOGY

This chapter provides a brief introduction to the SDN, introducing its history, evolution, structure, major achievements, and application in the real world networking infrastructure.

2.1 Motivations of SDN

Computer networks can be represented in three planes of functionality: the data, control, and management planes [as shown in Fig.-2.1 a and b]. The data plane cor- responds to the networking devices, the control plane represents the protocols used to generate the forwarding tables, and the management plane provides the software services for remote network device management. The control plane enforces the policy defined in the management plane, and data plane executes the policy. The traditional networking devices (e.g., router, switches) are integrated with vendor-specific embedded operating systems which decide the data forwarding policy and control the data plane through hardware. Notably, there is no standardized protocol proposed for coordination between vendors. Consequently, the device only supports vendor- specific network configuration. Thus, an arduous effort is required to upgrade or modify network policy. In some cases, to carry out these actions, the system must

8 be replaced with new devices. This is a costly solution. Furthermore, over 4 billion

Internet users are linked through over half a million Autonomous Systems (AS), and the number is increasing rapidly every year [31,32]. With that many users demanding versatile applications, implementation in a diverse range of AS is challenging for traditional network infrastructure. SDN [33, 34] is a cutting edge networking approach that promises versatile application by decoupling the vertical integration, separating the control plane from the data plane. The separation replaces traditional forwarding devices (e.g., router, switch) with simple forwarding devices. Therefore, decision making is shifted to a centralized controller, providing a global view of the network and programming abstractions. The SDN leads to a highly scalable, cost-eﬀective, holistically manageable networking system with cloud abstraction and rigorous monitoring.

SDN Scheme Traditional Networking Scheme

Management Plane • Routing application Control Plane • Routing • Forwarding Rules (FR) • Application for FR • Packet Filter • Packet Filter application Management Plane • SNMP

Embedded operating system Control Plane: Customizable operating system

Data plane Data plane

Packet Forwarding Hardware Packet Forwarding Hardware

(a) (b) Figure 2.1: System architecture: Traditional vs SDN

9 2.2 History of SDN Development

The concept of the programmable network has a long history [35]. The earliest initia- tives to separate data and control signaling go back to the nineteenth century, and over time, the approach did not change within the existing packet structure. Some notable achievements of the programmable network were: programmable ATM networks [36],

NCP [37], SOFTNET [38], OPENSIG [39], Tempest [40] , ANTS [41], NetScript [42].

The early attempts to program the data plane had two approaches: programmable switches and data capsules. The ﬁrst approach did not modify the packet structure, whereas the latter approach replaced the packet with a tiny program. The POF is a model of modern programmable data plane devices. NCP initiated the idea of separating data and control signaling. Some major proposals for disintegrated control and data were ForCES [43], PCE [44], RCP [45], GSMP [46]. The SANE [47],

OpenFlow [48], NOX [49]. ATM network virtualization, initiated by the Tempest

Project [50], was the earliest motivation behind SDN virtualization. Early network virtualization projects include VINI [51], and Planet Lab [52]. Moreover, the network operating system (NOS) is not a new concept. NOX [49], ONOS [53], CISCO

ISO [54], and JUNIOS [55] are the popular NOS. The consistent eﬀort of researchers to separate the three network planes, along with the virtualization concept, tailored the innovative idea of SDN in 2008 at MIT, Stanford University and the University of California at Berkeley. The concept of OpenFlow protocol ushered in a new era in computer networking research that introduced the concept of SDN. Open Networking

10 Foundation (ONF) [56] deﬁned the protocol standards for OpenFlow. SDN also has great potential. It is estimated that the SDN market will be worth more than $12 billion in 2022.

2.3 SDN Structure

The SDN structure is similar to the traditional network, as both have physically interconnected nodes. The simplest physical structure is composed of a SDN controller and connected switches [Fig.-2.2a]. The dotted line represents the logical connection between controller and switches, whereas the solid line depicts the physical contact.

The interface between the SDN controller and its switches is called southbound interface (SBI). SDN can be connected to the traditional network by the westbound interface. According to the functionality, the SDN could be represented by three planes: the management plane, data plane, and control plane [Fig.-2.2b]. The shaded area indicates the SDN controller responsible for network control and management. The management plane is the top level of hierarchy where the network administrator de-

ﬁnes and implements the network policy. The policy, called ﬂow table, is forwarded to the lower plane through a protocol on request. The executive nodes at the data plane execute the plan and transfer data accordingly. The following section will depict an in-depth representation of the SDN structure.

11 2.3.1 SDN Interfaces

In computing, the formal deﬁnition of an interface is the connection of two or more pieces of network equipment or protocol. A typical network interface has some form of number called a port number, node ID, or node identiﬁer. The SDN has four interfaces: the East, West, North, and Southbound.

The software-based northbound interface (NBI) provides the abstraction opportunity in the programming language, known as application program interface

(API) design. It is still under research [57]. The researchers are working on developing a standard structure for the NBI, as currently, the NOX, OpenDaylight [58], and Onix [59] controllers all have their own, self-deﬁned API for NBI [60, 61]. Simi- larly, commonly used NBI APIs include the NVF NBAPI [59, 62], SDMN API [63], ad-hoc API, REST-ful API, and REST API [64]. On the other hand, some established NBI research achievements include the SFNet [65], yanc [66], and PANE [67] controller.

In contrast to the NBI, the southbound interface (SBI) has a well-deﬁned and widely accepted standard structure. The SBI is a collection of physical or virtual devices (e.g., Open vSwitch [68]), and APIs (e.g., OpenFlow, NetConf, SNMP). In addition, OVSDB [69], NetConf [70], BGP [71], and OpFlex [72] are also used in the

SBI. The SBI deﬁnes the communication protocol between forwarding devices (i.e.,

SDN switches) and their controllers, outlining the instruction sets of the forwarding devices. Its objective is to push the execution order from the management plane to the data plane through a standard protocol. OpenFlow is the most common and

12 widely used API for the SBI. OpenFlow’s journey started with version 1.0, which had

12 fixed matching fields. Now, it is running version 1.5, with over 40 matching fields and a wider range of features [Fig.-2.3]. According to it’s inventor, it is a method to implement the experimental protocol [48] by transforming a policy table, called a

ﬂow table (described in Fig.-2.3).

The eastbound and westbound interfaces are two special case interfaces which are used for scalability. The westbound interface (WBI) is used to connect the SDN with a traditional network. WBI translates the ﬂow table into a traditional routing path, and vice verse. SDN’s scalability demands the connection of multiple controllers together, which in turn necessitates an inter-controller communication protocol. The eastbound interface (EBI) connects multiple SDN controllers. The EBI and WBI standards are under research.

2.3.2 Software Infrastructure: SDN

SDN was developed to construct vendor-independent, user-deﬁned, and policy-based networks. The SDN platform is structured on numerous NOS with a virtualization hypervisor. Virtualization is a consolidated technology that can create multiple virtual instances on a single device. A hypervisor is a software that enables virtualization by diﬀerent computing resources sharing the same physical machine. The network hypervisor is a solution to support numerous applications using a single physical topology. FlowVisitor [73], FlowN [74], OpenVirteX [75], IBM SDN VE [76,77], and

NVP [62] are some popular hypervisors in SDN.

13 Management Plane x Routing Module. x Load balancing. x Firewall. x Security Module x Virtualization x Other Application. SDN Controller

North Bound Interface

Control Plane x Flow Control Table x GUI module. x Cluster Module. ---- Logical Connection x Network Hypervisor. Physical Connection

South Bound Interface

Legacy network West Bound Data Plane SDN Switches SDN Switches Interface x TOR switches x Switches. SDN Switches x Server. x Gateway Router.

Participation Plane

(a) SDN: Physical Structure (b) SDN: Logical Structure

Figure 2.2: System architecture of SDN: (a) Physical Structure; (b) Logical Structure

A NOS is a core element of SDN architecture that generates the control signal for network conﬁguration based on management policies deﬁned by the network administrator. NOS is designed for both centralized and distributed controllers. The centralized controller is a single entity that controls all the network activity, whereas the distributed controller oversees the management activity. Some of the popular

NOS are listed in table-2.1, and are developed on C, C++, python, java, and Ruby platforms.

14 Table 2.1: Popular SDN NOS

NOS Architecture Programming Language Beacon [78] centralized Java Floodlight [79] centralized java HyperFlow [80] distributed C++ Onix [59] distributed python, C Mul [81] centralized C Kandoo [82] distributed C,C++, python NOX [49] centralized C++ NOX-MT [83] centralized C++ OpenDaylight [58] distributed java ONOS [53] distributed java POX [84] centralized python Ryu [85] centralized python Trema [86] centralized C, Ruby

2.3.3 Hardware Infrastructure: SDN

SDNs are composed of a controller and switches. The controller act as a human brain and the switch as an executive part (i.e., hand, feet). Fig.-2.3 represents the anatom- ical structure of SDN hardware components. For simplicity, OpenFlow considered as an SBI API. The SDN switch has three major components: the SBI API, abstraction layer, and packet processing unit. SBI makes an interface between controller and switches connected through a secure sockets layer (SSL). Flow tables are the major component of the abstraction layer. It is possible to store multiple flow tables in a switch. The flow table has match, action, and statistic fields. Moreover, the switch might have a group table for flow control and a meter table to record the log activity.

Flow entries match incoming packets in order of priority, and a matching entry triggers the actions associated with the speciﬁc ﬂow entry. If no match is found, the

15 outcome depends on the network management policy. The packet processing unit is just a physical infrastructure to implement open system interconnection (OSI) layer-1 and layer-2. An incoming packet could match according to the source or destination

IP, input or output port, or VLAN tag, as well as any other user-speciﬁc ﬁelds.

The action could be to either forward (to single, multiple, or all exit port) or drop.

The statistic field in the flow table counts the event. The capacity of forwarding devices varies from 8000 to 1000000 flow entries. The count can include the number per table (packet match, reference count, packet lookup), per-flow (received packet, received byte, duration), per port (transferred, received and dropped packet size and number, collision), or per group (reference, packet byte count). The statistic could be collected by one request from the controller for QoS analysis. Furthermore, some of the available commercial SDN forwarding device (hard and soft) models are

CX600 series [87], MLX series [88], NoviSwitch 1248 [89], RackSwitch G8264 [90],

Z-series [91], OpenvSwitch, XorPlus [92], and contrail-vrouter [93].

The hardware anatomy Fig.-2.3 depicts the controller’s core functionality, both a northbound and a southbound API, and a few sample applications. The discovery module is triggered when a new event is initiated. The event includes connection, disconnection, and change in the network. For example, installation of a new switch will trigger the discovery module, causing the SDN controller to take necessary action to add the switch. Flow management is another important module of the controller. It is triggered when a data transfer request comes from a new end-user. This module contains the data ﬂow policy and action against the intruder.

16 The statistics module helps to collect the QoS parameters from the switch, which is under a controller called the domain. The Big Switch Networks, Cisco, Cumulus

Networks, Hewlett Packard Enterprise, Juniper Networks, Nuage Networks, Pica8,

Pluribus Networks, and VMware are the major vendors for SDN controllers.

•Single table OpenFlow V1.0-2009 •Static matching field

•Multiple table OpenFlow V1.1-2011 •Group table

•IPV6 support OpenFlow V1.2-2011 •Controller role exchange

•Meter table OpenFlow V1.3-2012 •Table-miss entry

SDN Controller OpenFlow •Synchronized table V1.4-2013 SSL •Egress table OpenFlow V1.5-2015 •Scheduled bundle

---- Logical Connection Physical Connection

Legacy network West Bound SDN Switches SDN Switches Interface

SDN Switches

Figure 2.3: Anatomy of SDN Hardware Components

2.4 Distributed SDN architecture

The SDN can be extended by distributed ﬂat architecture [Fig.-2.4]. The controllers are connected by the WBI and the domains are connected with a gateway switch

17 Eastbound Interface (EBI)

----Logical Connection Physical Connection

Domain-1 Domain-3

Gateway Switch

Domain-2

Figure 2.4: Flat Architecture of SDN

(shaded mark in Fig.-2.4). The architecture is used for large-scale networking, but it has a Single Node Failure (SNF) problem. For example, if the domain-1 controller fails, the domain-1 will shut down. Hierarchically distributed architecture has two layers of SDN controllers. The lower layer controllers are responsible for their respec- tive domains, while the upper layer root controller is responsible for managing a group of domain controllers [Fig.-2.5]. The number of layers can be increased according to the demand, and the responsibility is split accordingly among the layers. With more layers, the hierarchy is not aﬀected by the single node failure problem [82]. Moreover, the data transfer between domains is handled by the gateway switch. It is possible to have multiple gateway switches in a domain.

18 ----Logical Connection Physical Connection

Eastbound Interface (EBI) Eastbound Interface

Gateway Switch

Domain-3 Domain-1 Domain-2

Figure 2.5: Hierarchical SDN Architecture

2.5 Recent Development and Application of SDN

There is a tremendous amount of ongoing research regarding SDN directly, as well as its potential to be deployed to new area. Its numerous practical applications have already led to better networking performance. SDN is presently studied extensively in areas such as traﬃc engineering [94–102], data centers [103–105], security [106–

111], and wireless networking [112–114]. Network function virtualization, resource provisioning, traﬃc steering, and revenue modeling are the most popular research topics among the SDN development community.

19 CHAPTER III

OVERVIEW OF REINFORCEMENT LEARNING

This chapter brieﬂy introduces artiﬁcial intelligence (AI), and ML focuses on providing an overview of RL. It also provides the history, motivation, challenges, building block algorithms, and applications of RL. The chapter will also describe the advan- tages A2C method which is used in the routing algorithm.

3.1 Artiﬁcial Intelligence, Artiﬁcial Neural Networks and Machine Learning

AI is a broad ﬁeld of study that includes intelligent problem solving, knowledge reasoning, learning, and decision making. A human can learn from experience and interaction with the environment and hence can act intelligently. Any system that mimics human intelligence can be considered an AI system. It can be called a simulation of human intelligence in a machine. The discipline was born in the summer of

1956 at Dartmouth College in Hanover, New Hampshire. In the twentieth-century, research regarding AI greatly expanded. One notable ﬁndings was AI-based vision, which is composes of symbolic system AI and connectionist AI. A symbolic system AI develops methods for representing problems with ﬁnite symbols that represent knowledge, which are able to be understood by humans. On the other hand, a connectionist system is a bio-inspired model that consists of large number of interconnected ho-

20 mogeneous units and weights. The latter approach is a more popular and eﬀective solution for AI.

Supervised Learning

Semi-supervised Learning

Machine leaning Unsupervised Learning

Reinforcement Learning

(a) Hierarchical structure of AI (b) Classification of Machine Learning

Figure 3.1: AI and ML: (a) Hierarchical Structure of AI; (b) Classiﬁcation of ML algorithms

ML is a subset [Fig.-3.1a] of AI, that focuses on developing algorithms which drive the machine to make an intelligent decision. The algorithmic model is either predictive, making predictions, or descriptive, extracting information from the dataset.

In the prediction model, the objective of ML are to extrapolate outcomes. According to the dataset, the ML algorithm can be classiﬁed into three major categories: supervised, semi-supervised, unsupervised, and reinforcement learning [Fig.-3.1b]. Super- vised learning is driven by dataset (training) labeled with input and expected output.

The performance of the algorithm is commensurate to the number of events recorded in the dataset. However, dataset collection and storage is a big challenge. The unsupervised learning only use datasets that exclusively contain input. The objective

21 of unsupervised learning is to classify data clusters having similar properties. The semi-supervised learning concept is between supervised and unsupervised algorithms in that it accepts data that is partially annotated. Finally, RL focuses on how the

ML algorithm can take the best action in the current state, maximizing output. SDN routing can be considered an RL problem, where the controller will discover the best path that has a minimum cost and maximum throughput. The following section will brieﬂy describe the RL algorithm.

An artiﬁcial neural network (ANN) is also known as neural network (NN).

The model was inspired by the popular connectionist ML approach, and first proposed by Warren McCulloch and Walter Pitts in 1945 [115]. From that time onward, there was several ANN model proposed until 1979, when back propagation (BP), a significant achievement in the field, was proposed by Paul Werbos [116]. Initially, high computation speed requirement cause the development of BP to lag, but a boost in computational speed, along with parallel computation model, helped BP eventually gain popularity amongst researchers. Currently, almost all ANN algorithms developed are based on the BP algorithm. After the year 2000, ANN became even more popular with the introduction of the deep neural network [117, 118]. Deep NN is a technique composed of multiple nonlinear transformation modules that transfer low- level representation into a higher abstract representation. The technique makes it easier to construct the feature vector.

22 3.2 Reinforcement Learning

The development of the framework of RL was inspired by behavioral psychology to solve the sequential decision-making problem in 1984. The main idea behind RL is to train the RL agent to learn the environment by continuously taking some action and learning from the environment results [Fig.-3.2]. The process is repeated until the RL agent reaches a predeﬁned goal which will return the maximum appreciation called reward. The process creates a sequence of state, action, and reward

((s0, a0,r0);(s1, a1,r1), .....(sk, ak,rk)). Primarily, RL can be classified into two categories: the Model based and Model free methods [Fig.-3.3]. The model based RL employs state, action and reward to get the maximum expected reward. The RL agent knows about the environmental model. On the other hand, the model free method utilizes action and reward to maximize the expected reward in the absence of environmental knowledge. The model free RL methods are further divided into two subcategories: value based and policy based. The chapter will mostly describe the model free RL methods. Moreover, the value update scheme divides RL into an on-policy and off-policy method. In the on-policy method, the RL agent estimates the reward for state-action pairs, assuming a greedy policy was followed, whereas the off-policy considers the current policy, which to be continue for the next action.

3.2.1 History

The concept of RL was inspired in the early 1980s by the psychology of animal learning. The Markovian decision processes (MDPs), tools to predict discrete stochastic

23 Action , State

th ( | ) = ܵ א௜ actionݏat i ܣ אExpectedܽ௜ Reward ܴ௜ାଵ ܧ ݎ௜ାଵ ݏ௜

Figure 3.2: Example of RL scenario

Reinforcement Learning

Model Based Method Model Free Method

Value Based Policy Based Method Method

Figure 3.3: Classiﬁcation of RL

processes, were proposed for use in RL by Richard Bellman in 1957 [119]. The method provides a framework for decision making in a dynamic environment. Fur- thermore, Ron Howard introduced policy iteration in 1960 [120]. Those are the building block algorithm for RL development. The theory of Law of Eﬀect proposed by Edward Thorndike in 1898. It discusses the aftereﬀect of an action. Sixty years

24 later, The ﬁrst analog machine of Rl model, known as SNARCs (Stochastic Neural-

Analog Reinforcement Calculators), was proposed by Minsky et al. Around the same time, in 1960, the term reinforcement learning was formally proposed. Presently, the STeLLA [121], BOXES [122], Tsetlin machine [123], temporal-diﬀerence learning, TD(λ) [124], Q-learning [125, 126], Actor-Critic [127], DDPG [128], A3C [29],

A2C [29], Rainbow [129], and DDQN [130] are the most notable achievements in RL research. Moreover, the deep learning aided RL is now popular and can solve complex high-dimensional sequential decision making problems. Some of the notable deep RL based research projects are Atari [131], AlphaGo [132], the self-driving-car [133], and robotics [134].

3.2.2 RL algorithm: Components

This subsection will describe some of the parameters that are necessary to understand the advantage actor-critic method.

State: a feature representation of a test environment at the time of any action. Symbolic representation: si=state feature at ith action which is a subset of total state si ∈ S.

Action: an event that changes the state to achieve the objectives. Symbolic representation: ai=state feature at ith action which is a subset of action set ai ∈ A.

Reward: a scalar quantity of the appreciation level of action. For example, a positive numeric reward value could be represented as a positive action. Symbolic representation: ri= reward at ith action,expected reward at ith action, Ri = E(ri+1|si)

25 Markov decision processes (MDP): A full speciﬁcation of the RL problem can be represented by MDP [135]. It is a decision-making framework and stochastic control process for the discrete process. MDP can be represented by tuple of: state, action, transition probability and reward (S,A,P,R). The transition probability is available only for a deterministic environment.

Episode: a sequence of actions from the start to some terminal state. The number of actions could be limited to a maximum by which the RL could achieve its objectives. The term applies only to a discrete environment. In a continuous system, the action will continue forever.

State Transition Probability: a probability to change from a given state to a next state. State transition probability is available for the deterministic model where all the model parameters are known. In the case of a stochastic environment, the parameters are unknown. Symbolic representation: P [si+1|si, ai].

Discount Factor: a numeric present value of future reward ranging between

0 to 1. The factor limits the reward value in an inﬁnite state action reward space.

Symbolic representation: γ ∈ [0, 1].

Return: total discounted reward from ith action. Symbolic representation:

+∞ k Gi = ri+1 + γri+2 + ...... = k=0 γ ri+k+1, where γ is a discounted factor.

Value Function: aP numerical value that measures the beneﬁt of each state and action. Symbolic representation: V (s)= E[Gi|si = s] ∀s ∈ S.

26 Policy: a behavior or strategy of an agent. The policy could be stochastic or deterministic. Symbolic representation: deterministic policy, a = π(s) π(s): S →

A; stochastic policy, π(a|s)= P [a|s] ∀s ∈ S, ∀a ∈ A.

ε Greedy Policy Function: an algorithm that selects the action. The epsilon value determines how much time the RL agent will run with a trial and error basis. The pseudocode for the ε greedy algorithm is given in Algorithm-1. The main idea is to select a random action (exploration) at the beginning of the time and increase it by iterations to select the most probable action (exploitation).

Algorithm 1: ε greedy algorithm input : ε, maxiter, A, probA output: actionarray(1×maxiter) 1 for i ← 0 to maxiter do 2 temp = rand(); 3 if temp < ε then 4 actionarrayi = randomaction(A); 5 else 6 actionarrayi = returnmaxaction(A, probA) 7 ε = ε × i

State Value Function: the expected reward starting at state s and following policy π. The maximum State Value Function is also known as the optimal value function. Symbolic representation: state value function, vπ(s)= Eπ[Gi|si = s] ∀s ∈

S ; optimal state value function, v∗(s) = maxπ vπ(s) ∀s ∈ S.

Action Value Function: the expected return from state s and action a by following policy π. The optimal action value function is the maximum action value function. Symbolic representation: action value function, qπ(s, a)= Eπ[Gi|si =

27 s, ai = a] q : S ×A → R; optimal action value function, q∗(s, a) = maxπ qπ(s, a) q :

S × A → R.

Policy Gradient [136]: an optimization method that uses the gradient descent method to optimize the policy for the maximum expected reward. This helps the policy iteration method to excavate the optimal policy. Policy iteration has two phases: policy evaluation and updates to the NN parameter. The partial derivative of a stochastic policy of trajectory τ can be expressed as:

▽θπθ(τ) ▽θπθ(τ)= πθ(τ) = πθ(τ) ▽θ log πθ(τ) (3.1) πθ(τ)

where for a continuous space, the policy of trajectory τ with given NN param-

T eter vector θ is , πθ(τ)= πθ(s0, a0,s1, a1, ...., sT , aT )= p(s1) i=1 πθ(ai|si)p(si+1|si, ai).

Consider the objective function is J(θ) which can be deﬁnedQ as:

J(θ)= E[r(τ)] = πθ(τ)r(τ) (3.2) Z

The maximization of the objective function requires ﬁnding the optimal value of the network parameter θ. The gradient of objective function is:

▽θJ(θ)= ▽θπθ(τ)r(τ)dτ = πθ(τ) ▽θ log πθ(τ)r(τ)dτ = E[▽θ log πθ(τ)r(τ)] Z Z (3.3)

According to the deﬁnition of policy function πθ over a trajectory τ and N number episode, the expectation can be approximated to an equation with maximum log

28 likelihood as bellow:

N T T 1 ▽ J(θ) ≈ ▽ log π (a |s ) r(s , a ) (3.4) θ N θ θ i,j i,j i,j i,j i=1 j=1 ! j=1 ! X X X maximum log likelihood | {z } The maximum log-likelihood measures how likely the trajectory is under the current policy. The likelihood of a policy is increased if the trajectory results in a high positive reward. For a negative reward, the process is reversed. The model parameters will be updated to increase the likelihood of trajectories that move higher.

As a result, Eq.-3.4 is used to evaluate the policy. The policy parameter(θ) of the

NN will be updated with the learning rate (α ∈ [0, 1]) by the following equation:

θ ← θ + α ▽θ J(θ) (3.5)

The large number of possible actions in a stochastic policy leads to a higher variance in the general policy gradient method. There is a slightly modiﬁed version in which the reward is subtracted from a baseline b(i, j) function to solve the high variance problem. The policy gradient with baseline function can be represented by following the equation:

N T 1 ▽ J(θ) ≈ ▽ log π (a |s )(r(s , a ) − b(i, j)) (3.6) θ N θ θ i,j i,j i,j i,j i=1 j=1 ! X X

29 Value Iteration: the optimal value function and the best policy extraction.

The method is simple but requires more computation then the policy iteration. The value iteration can be represented by the following mathematical equation:

Vπ(si)= π(ai|si) p(si+1,ri|si, ai)[ri + γV (si+1)] (3.7) ai si ,ri X X+1

For a larger number of iteration(i): Vπ ≈ V∗

3.2.3 RL algorithm: MC Method

MC method is a simple model free RL method. The method learns the state value vπ(s) under the policy π by using average return at the end of the episode. MC method can be ﬁrst-visit or every-visit. The ﬁrst-visit MC method updates the return

(si) the ﬁrst time it’s encountered in an episode, whereas every-visit calculates every encounter si. The pseudocode for every-visit MC method is given in Algorithm- 2.

3.2.4 RL algorithm: TD

Temporal diﬀerence (TD) is a slightly modiﬁed version of the MC method. In the MC method, the RL agent learns at the end of the episode. In TD, it learns after a certain episode interval. This technique, called bootstrapping, accelerates the performance of the MC method. There are two common versions of TD that are widely used:

TD(0) and TD(λ). In TD(0), the agent learns after each episode. It learns after λ step for TD(λ). The pseudocode for tabular TD(0) is given in Algorithm- 3. The δ

30 Algorithm 2: Every-Visit MC algorithm-(on-policy) input : policy: π,maxepisode

output: Optimal State Value: v∗

1 statecount(s) ← 0 for all s ∈ S; 2 return(s) ← 0 for all s ∈ S; 3 V ← 0 for all s ∈ S; 4 for e ← 1 to maxepisode do

5 Generate episode by π → s0, a0,r0,s1, a1,r1, ...... , sT , aT ,rT ; 6 Initialize return,G ← 0; 7 for i ← T − 1 to 0 do

8 G ← G + ri+1;

9 return(si) ← return(si)+ Gi;

10 statecount(si) ← statecount(si) + 1;

return(s) 11 ∗ v (s) ← statecount(s) fors ∈ S; 12 Return v∗; is called TD error or TD target, which is used in the actor-critic method(this will be described in the next section).

3.2.5 RL algorithm: Actor-Critic Method

The AC method is the building block of modern RL algorithms like asynchronous advantage actor-critic (A3C) and proximal policy optimization (PPO). This method combines the value and policy-based RL methods into a single algorithm[Fig.-3.4a].

This makes the AC method faster in a continuous stochastic environment. The structure of the actor-critic is shown in Fig.-3.4. It consists of two parts, the actor and critic, where both interact back and forth to converge on an optimal solution. The actor, policy iteration network, is responsible for taking action and updating the

31 Algorithm 3: Tabular TD(0)-(on-policy) input : policy: π,maxepisode

output: Optimal State Value: v∗

1 α,γ ∈ [0, 1]; 2 V ← randomvalue except V (terminal) ← 0; 3 for e ← 1 to maxepisode do

4 Initialize state s0, i ← 1; 5 while not terminal state do

6 Generate action ai−1 by: π,si-1;

7 Apply ai−1 and get : ri,si;

8 V (si-1) ← V (si-1)+ α [ri + γV (si) − V (si-1)];

δ 9 s -1 ← s i i; | {z } 10 i ← i + 1;

11 Return: v∗ ← V ; network parameter θ by the policy gradient method. The critic evaluates the eﬀec- tiveness of the actor network’s action by the TD method. The critic can use the action value or state value iteration method. The overall process is: According to current policy the actor generates an action a then applies it to the environment.

The environment returns information about the reward and the next state. The critic evaluates the value of the state by using the equation in line-8 Algorithm-3, and sends TD error(δ) as feedback to the actor network. The actor network uses δ from the critic and updates the NN parameter by policy gradient. The overall process is shown in Fig.-3.4b.

32 Critic

() ← () + [ + () − ()]

Reward Value Function Temporal Difference

← [ + () − ()]

STATE STATE

Value Based Actor Critic Policy Based Action ENVIRONMENT Policy Function Action Selection

Actor

(a) Concept of Actor-Critic (b) Structure of Actor-Critic

Figure 3.4: Actor-Critic: (a) Concept of AC; (b) Structure of AC

3.2.6 RL algorithm: Advantage Actor Critic Algorithm (A2C)

Advantage actor-critic (A2C), proposed in [29], includes the advantage function in policy gradient calculation, and replaces the value calculation with value iteration

NN in AC. It adds the advantage function to the calculation in Eq.-3.6 to reach the following equation:

N T 1 ▽ J(θ) ≈ ▽ log π (a |s )(A (s , a )) (3.8) θ N θ θ i,j i,j w i,j i,j i=1 j=1 ! X X

Advantage function Aw(si,j, ai,j) or TD error δ at episode i and jth action could be deﬁned by:

Aw(si,j, ai,j)= ri,j + γV (si,j+1) − V (si,j)= δ (3.9)

33 The overall A2C algorithm is shown as pseudocode in Algorithm-4.

Algorithm 4: One-Step Advantage Actor Critic-A2C input : αw,γ,αθ, maxepisode,πθ

output: optimized NN for: v∗,π∗

1 Initialize the actor network parameter: θ ∈ Rd[0, 1]; 2 Initialize the critic network parameter: w ∈ Rd[0, 1]; 3 Initialize state value: V ← randomvalue except V (terminal) ← 0; 4 for e ← 1 to maxepisode do

5 Initialize state s0, i ← 1; 6 while not terminal state do

7 Generate action ai−1 by: πθ,si-1;

8 Apply ai−1 and get : ri,si;

9 δ = ri + γV (si,w) − V (si−1,w); i 10 w ← w + αwγ δ ▽w V (si−1,w); i 11 θ ← θ + αθγ δ ▽θ (log πθ(ai−1|si−1)δ);

12 si-1 ← si; 13 i ← i + 1;

The method can be enhanced by injecting deep NN concepts. For the deep

A2C method, there is an additional NN, called the critic network. The main diﬀerence is that instead of taking the value of next state V (si,w), the critic NN predicts the next state value (Vˆ (si,w)), then uses the predicted value in the equation stated in line

9, 10 Algorithm-4. Finally, the critic network optimizes the NN parameter to achieve

V (si,w) ≈ Vˆ (si,w)).

34 3.3 Reinforcement Learning: Applications

RL with deep learning techniques is superior; in some cases, it exceeds human intelligence. Numerous researchers have explored RL’s application in diﬀerent areas.

Some of the major applications of RL are robotic control [137–143], software engineering [144–150], communication networks [151, 152], cyber-security [153, 154], business strategy planning [155,156], UAV and aircraft control [157–162], smart grids [163,164], transportation and self-driving cars [165–173], health care [174], and computer gaming [175–177].

35 CHAPTER IV

QoS-AWARE INTELLIGENT ROUTING FOR SOFTWARE DEFINED

NETWORK

4.1 Introduction

The block-level representation of the proposed RL-driven QoS-aware routing algorithm is illustrated in Fig.-4.1. The algorithm comprises two main components: continuous QoS monitoring (CQM), and RL-based Intelligent Routing Decision Making

(RIRD). The CQM block measures the link-QoS parameter continuously and gives feedback to the RL agent. The RL based intelligent decision-making block selects the best route accordingly. The overall process is represented in Fig.-4.2.

Continuous QoS Monitor RL based IR Decision making (CQM) (RIRD)

SDN Controller

Source Destination

Physical connection Logical connection Figure 4.1: Block Representation of RIRD

36 Find possible path to reach source to destination

Reduce the number of path by a factor (limit the maximum number of hop)and take certain number of Path request received by controller path

Select a path (action) by ε greedy policy Routing decision made by A2C

Send data using selected path Install flow to the OpenvSwitch according to the path calculated by A2C Calculate the reward

Update: Actor and Critic Weight 6/3/2019

Figure 4.2: Flow Chart of QoS-aware Intelligent Routing

The process starts with path requests from an end user to the controller for network service. There are many possible path for reaching destination node.

Some paths are not viable and some have a large number of processing nodes, which leads to a huge nodal processing delay. To get rid of this problem, the controller calculates all possible paths and sorts out some viable paths. Then, the objective of the RL-based routing method is to identify the best route for data transfer. In the exploration period, the RL agent selects paths by ǫ greedy method, and accordingly sets up a ﬂow to send the data. The next stage is the evaluation section. The RL agent collects the reward value according to the policy. In RIRD, the RL will get

37 a maximum reward +r if and only if it can select the best path. However, it is possible to relax this restriction. In that case, the RL might converge to some path that has a tolerable QoS. In the next stage, the controller installs the ﬂow to the

SDN switches, according to the routing decision. The process is repeated, and after a certain iteration, the ǫ value becomes small and the RL agent exits the exploration, entering into the exploitation stage. The network converges to the optimal policy if the number of iterations is large. The next section will give a big picture of the methodology.

4.2 Design of QoS-Aware Intelligent Routing

4.2.1 Continuous Quality-of-Service Monitoring

Network monitoring systems are mainly classified as active and passive. The active monitoring system collects the status of the network periodically by sending a query message after a certain interval. The real-time monitoring technique creates an additional payload to the network. The passive method uses a mathematical model instead of sending a query packet after a regular interval. Though the passive method will not introduce an extra packet into the network, it is difficuit to build a mathematical function, and the accuracy is less efficient compared to the active monitoring scheme. In this proposed method, the active monitoring scheme is applied to monitor the delay, and packet loss.

38 Make link layer probe packet which Make 20 link layer probe packet payload contain the information which payload contain the about current time stamp ,src and dst information about src and dst DPID DPID

Send probe packet to appropriate Send probe packet to appropriate switches switches

Send probe packet probeSend wait for 25000 ms packet probeSend wait for 500 ms

Receive the probe packet and Receive the probe packet and count calculate the difference between the the number of packet sending and receiving time

Calculate the ‘edge_pkt_loss_mat’ Calculate the ‘edge_delay_mat’ Receive probe packet Receive probepacket Receive

(a) (b)

Figure 4.3: Active Monitoring QoS in SDN: (a) Packet Loss; (b) Link Delay

As shown in Fig.-4.3b, to calculate the delay, a simple Ethernet probe packet with a unique source port (0x5577), with a payload of generation time and source- destination data-path identiﬁcation (DPID) is sent through the connected OpenFlow switches repeatedly after 500ms. The delay of the individual link is calculated based on the method described in [178] and stored in edge delay mat − EDM. Mathemat- ically, link delay can be expressed as:

τ τ LinkDelay(s ,s )= τ − s1 − s2 − C (4.1) 1 2 total 2 2 where,

τtotal: Total travel time of packet from controller to switch pair and back to controller.

39 τs1 : Round trip time between the controller and switch-1.

τs2 : Round trip time between the controller and switch-2.

C: Calibration error (a processing delay of the host machine which depends on processing power)

Similarly, the packet loss rate (P LR) in a link is determined after 2500ms by calculating the ratio between the number of packets sent and lost by the SDN controller through a link [Fig.-4.3a]. In this experiment, 20 packets sent per link (i.e., switch pair) and 0x5599 as the port number. The calculation is deﬁned in Eq.-4.2.

Finally, individual P LR are saved into edge packet loss mat − EP LM.

Number of packet received packet loss rate(P LR)=1 − × 100% (4.2) Number of packet sent

4.2.2 RL-based Intelligent Routing Decision Making

RL-based intelligent routing decision making (RIRD) is the intelligent block that repeatedly interacts with the dynamic networking environment by exchanging three signals: state, action, and reward. The objectives of RIRD are to formulate the optimal routing policy by projecting from the set of states and actions, to maximize the reward. By injecting the model-free oﬀ-policy based A2C (described in Ch-3) algorithm, the RL agent gradually achieves knowledge about the relationship between these three signals by iteratively updating the weight of actor and critic NNs. The action is deﬁned by the probability of selecting a path from a set, and state is rep-

40 Algorithm 5: RL-based intelligent routing decision-making (RIRD)

input : Network topology G(V, E) with EDM1×E, and EP LM1×E ∗ output: Optimal routing policy πs d

1 while path request from source s and destination d: do 2 if ﬁrst-time request from s d pair then

3 take N path candidates: Ps d(∀ ∈ PT ) = [P1,P2, ...... , PN ] 4 Initialize:

5 p = [p1,p2, ...... pN ]

6 D = [d1,d2, ...... dN ]

7 L = [l1,l2, ...... lN ] 8 O = D + L × wˆ

9 Vs d ← normalized(O) 10 γ ∈ [0.9, 1]

11 and A2C parameters, θ,ω and learning rates αθ,αw. 12 start to execute from the step-14

13 else

14 Feed O to actor NN: Select a path, Pi, from Ps d with ε-greedy method

15 Calculate observation, oi, packet loss li and delay di of path Pi from EDM and EP LM by Eq.-(4.4).

16 Update: Oi = oi, Li = li; Di = di

17 Calculate: QoDi by using Eq.-(4.5) and update QoD

18 if QoDi = max{QoD} then

19 ri =+r 20 i ← 1 21 episode = done

22 else

23 ri = −r

24 δ ← ri + γVˆ (si+1,w) − V (si,w)

25 w ← w + αwδ ▽w V (si,w)

26 θ ← θ + δαθ∇θ log pi

41 resented by the traffic matrix (delay and packet-loss rate). Finally, the reward value of the action is quantified by the current action and status of the traffic matrix. For example, if the RL agent selects the path having the lowest delay and packet-loss rate, it will get the highest reward value.

The Algorithm-5 demonstrates a comprehensive view of the process. The network topology can be represented as an undirected graph G (E,V ), E is the set of the edges, and V denotes the vertex set of graph G. An edge represents the physical link that connects a pair of network nodes (in this case an SDN switch pair), whereas a vertex denotes processing nodes (i.e., SDN switches). The routing path,

P , is modeled as a path to guide the data packets from the source to destination hosts and contains a set of edges e ∈ E. Routing optimization can be characterized by a discrete state and an action RL problem. Let PT be all possible combinations of routing paths for a speciﬁc source-destination pair (src dst). However, all of PT might not be sensible for data communication. For example, in Fig.-5.1, if the source is switch-0 and destination is switch-4, then the path-[0, 1, 3, 7, 10, 11, 9, 4] is valid but not viable for eﬀective communication. To get rid of the issue, the RIRD method takes a set of N candidate paths Ps d selected from the total PT available paths based on the number of nodes.

The algorithm 5 triggered with the path request from an src dst pair is initiated. The ﬁrst time request is made, the SDN controller initializes some parameters, including the number of path candidates Ps d, the probability of selecting path p ∈ U[0, 1] : p = 1, the path delay matrix D ∈ Rd, the packet-loss matrix L that P 42 is set to be a zero matrix, the observation matrix O that is calculated based on Eq.-

(4.3), the weight parameters of actor and critic NNs that are initialized randomly, normalized observation assigned as the value of the states V , and the learning rates for the actor and critic NNs αθ and αw, respectively.

O1×N = D1×N + L1×N × wˆ (4.3) wherew ˆ is a scalar value and set to 100 for magnifying packet loss compared with delay.

In the next stage, the RL agent selects a path Pi from Ps d by using the

ǫ-greedy method, installs the ﬂow to the candidate switches associated with selected path Pi, subsequently observes the packet-loss rate li, delay di, and calculates the observation oi of the path Pi according to Eq.-(4.4), and updates Oi, Li,Di.

x li = pˆn  n=1  Q  x d = dˆ (4.4)  i n=1 n  P oi = di + li × wˆ     pˆn is the packet-loss rate associated with Link-n, and x is the number of links associated with the path Pi. For example, in Fig.-5.1, the overall packet loss for the shortest path between switch-0 (S0) and switch-3 (S3) will be the product of link loss in S0-S1 and S1-S3. The dˆn is the delay on Link-n that can be formulated as the sum of the associated link delay. As stated above, the quality of the RL agents action

43 is determined by the reward function. The quality of the routing path selected by

A2C is quantiﬁed by the term ”Quality of Delivery-QoD”. The QoD of path Pi is formulated by the following equation:

˜ QoDi = (1 − di) × w1 + (1 − li) × w2 (4.5)

where d˜i is the normalized delay value for the path Pi, and w1 and w2 are the weights for the delay and the packet-loss rate, respectively. According to Eq.-(4.5), it is clear that smaller values of the delay di and the packet-loss rate li lead a high

QoDi value (better QoS). The reward of the action is calculated by comparing the

QoDi with the most recent QoD. If the QoDi is highest among all the values in QoD for a certain action, the RL agent gets the positive reward +r. Otherwise, it gets the negative reward −r. In the ﬁnal stage, the RL agent updates the A2C parameters

(θ,w). An episode is considered to be accomplished if the RL agent selects the path that has the highest value in QoD, or has tried for a predeﬁned maximum number of trials (maximum episode step).

44 CHAPTER V

SIMULATION RESULTS: QoS-AWARE INTELLIGENT ROUTING FOR

SOFTWARE DEFINED NETWORK

This chapter analyzes the simulation results of the RL-based routing approach in single controller SDN and multi-controller SDN for diﬀerent cases. In the test-systems,

Mininet [179] was used to build the SDN topology, POX [84] platform was applied to implement the OpenFlow 1.0-based SDN controller, and Tensor-ﬂow was leveraged in the Python 2.7 environment to implement the RIRD component in the proposed method. Additionally, the CloudLab [180] cluster was adopted to implement the multi-controller SDN simulation.

5.1 Test-bed Setup: Single Controller

The topology of the single SDN for the simulation is depicted in Fig.-5.1, which comprises 12 SDN open vSwitches and 24 full duplex links (degree of connectivity > 3) with various link capacities. Each circle represents an OpenvSwitch with a single controller (not shown in the ﬁgure). Thus, the switches are in the same domain.

The networking environment consists of six types of networking links, L1 to L6. The parameters associated with the links of Types L4, L5, and L6 are ﬁxed as detailed in

Fig.-5.1, and the parameters associated with the links L1, L2, and L3 vary depending

45 L5 L5 L4=100Mbps, 1ms, 0% Destination(D) 11 9 4 L5=10Mbps, 30ms, 0% L6=1Mbps, 100ms, 0% L4 L5 L5L5 L5 L5 L5

10 L4

8 L2 L4 L4 L5 L5 6 2 L5

L1 5 7

L3 L5 L4 L4 L6 L6

Source(S) 0 L4 3 1 L4 L4

Figure 5.1: Single-controller SDN topology.

Table 5.1: Link parameters for diﬀerent cases

Scenario Number Link parameters (Bandwidth, link delay and packet loss) L1=100Mbps, 1ms, 0% Case-I L2=10Mbps, 30ms, 10% L3=1Mbps, 50ms, 15% L1=10Mbps, 30ms, 0% Case-II L2=10Mbps, 30ms, 10% L3=10Mbps, 30ms, 15% L1=0.5Mbps, 30ms,0% Case-III L2=0.1Mbps, 80ms, 0% L3=0.1Mbps, 80ms, 0% L1=100Mbps, 1ms, 0% Case-IV L2=10Mbps, 30ms, 0% L3=1Mbps, 50ms, 0% Case-V Muticontroller Simulation on the case study. The source host (Host-1) is connected to the Switch 0, and the destination host (Host-2) is connected with the Switch 10. Moreover, the necessary

46 parameters for RIRD (Algorithm-4) are as following: N = 20, αθ =0.001, αw =0.01,

+r = 30, −r = −1, maximum episode step = 10, ε = 1.0, and epsilon decay =

0.999. ε decays over time and is restarted as 0.9 if ε < 0.01. Both the actor and critic NNs comprise one hidden layer with 20 units, ReLU activation function, and

Adam optimizer. The output layer of the actor NN adopts the Softmax activation function to get the action probability. The selection of NN hyperparameter is a challenging task. The learning rates mentioned above are generally used to simulate

Atari games. The number of hidden layers are kept small for faster operation. This simulation illustrates a different perspective of intelligent SDN routing. However, the performance evaluation for a different combination of NN hyperparameters is listed as one of the future works. To demonstrate the operation of the proposed routing algorithm, flows from the switches removed immediately after delivering the data packets. In practical implementations, the flows will remain steady until Hard timeout or Idle timeout, which will add additional time to algorithm convergence.

In that case, the method can be applied after each timeout. Additionally, a custom sized ICMP (ping command) packet is sent from Host-1 to Host-2 to record the end- to-end QoS. The ICMP message uses the TCP protocol, so the communication is bidirectional. The link delay, bandwidth and packet loss are generated by changing bandwidth , delay , loss parameter of net.addLink() method in Miniet. The values are theoretical. However, the actual delay and packet loss are not constant. The host network interfacing card (NIC), and processing power also aﬀect the simulation.

The average and standard deviation of simulated packet loss are reasonably close to

47 theoretical value, but the delay is slightly higher than the expected theoretical value.

The link bandwidths of L1, L2, and L3 are listed in Table-5.1.

Table 5.2: Performance comparison between proposed method and Dijkstra algorithm-based method

Proposed Method Dijkstra Cases delay(ms) packet loss delay(ms) packet loss Case I 174.560 0.80% 282.874 18.40% Case II 269.722 1.08% 286.193 18.51% Case III 1028.102 13.25% -- Case IV 115.922 0% 216.392 0% Multi-controller 823.961 8.20% 993.064 31.2%

The performance of the routing algorithm for a single controller SDN is evaluated considering three normal case scenarios, Case-(I, II, IV), and one congestion scenario, Case-III, as detailed in Table-5.1. The scalability of the algorithm also tests for the multicontroller environment (Case-V). The link parameters are discussed in the

Case-V subsection. Notably, comparing routing strategies is a challenge in evaluating the performance of the manifold. For example, in supervised learning, method [12] used MSR over training epoch, where method [7] used accuracy or training and current network over the delay. In the case of RL, methods [15] [16] and used traffic intensity over delay; method [5] employed traffic demand versus delay, throughput, and utility; method [3] applied episode versus the number of hops for different time to live. The sundry algorithm-dependent performance evaluations for existing methods, stated in Chapter- 1, leads me to keep performance comparison as future work.

48 5.1.1 Case-I:

Case-I is stands for the one best path simulation, which indicates there is only one best path to reach from Host-1 (H1) to Host-2 (H2). The source is H1 and the destination is H2. The link bandwidth is set according to Case-I in Table-5.1. As the parameter

N = 10, the RL will evaluate the best path among 10 possibilities. The legends of Fig.-5.2 and 5.3 show all possible paths from the H1 to H2. For example from

H1 to H2, the 10 possible paths are [[0, 6, 8, 10], [0, 6, 9, 11, 10], ....., [0, 2, 6, 8, 11, 10]].

′ ′ ′ The paths P1, P2, P3 represent [0, 1, 5, 7, 10], [0, 1, 3, 7, 10], [0, 6, 8, 10] and P1, P2, P3 represent the reverse direction respectively in the rest of the description. According to the SDN topology [Fig.-5.1], the best path to reach the source-destination pair is [0,1,5,7,10] as the link has the least delay, and no packet loss. As previously mentioned, a TCP packet is sent, making the communication bidirectional and the

RL applied to both directions.

The simulation results for Case-I are shown in Fig.-5.2 5.3 and 5.4. From

Fig.-5.2 and 5.3, it can be observed that the probability of selecting the path P1

′ and P1 increases over episodes. The increase indicates that the RL agent receives a higher reward for this path selection, which can be veriﬁed by the normalized reward graph in Fig.-5.4a and b. In RL, the increasing episode number indicates that the agent tends to follow exploitation rather than the exploration. However, the proposed method will never stop exploration due to the greedy method, and additionally, for this method, it accelerates exploration if ε< 0.01.

49 Figure 5.2: Probability of path selection for Case-I: From Host-1 to Host-2

Figure 5.3: Probability of path selection for Case-I: From Host-2 to Host-1

50 Figure 5.4: Performance of RL agent for Case-I: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 to Host-2

The logic of the exploration sometimes leads to a negative reward, and also introduces some notches in the graph. However, the exploration after the convergence makes the algorithm more dynamic in case of a network disaster (e.g., congestion), which will be illustrated in the case-III simulation. The proposed method reaches a fairly good conclusive routing decision at around 120 episodes [Fig.-5.2, 5.3]. Fur- thermore, the end-to-end delay is comparable to one of the popular shortest path algorithms, called Dijkstra [181]. According to the Dijkstra algorithm, the routing path will be selected according to the hop counted. So, the path (P3) having fewer nodes (in this case SDN switches), it will be selected for data transfer. For the hard-

51 coding, the python method networksx.dijkstra path() is used. The end-to-end delay comparison graph from Fig.-5.4 indicates that the proposed intelligent routing strategy outperforms Dijkstra. Initialization time of proposed model is around 1500ms at the beginning of the simulation. This is acceptable as performance is beneﬁcial for the long-run. The performance comparison from the perspective of the average delay and packet-loss rate is also detailed in Table-5.2. From this it is clear that the delay is

174.560ms and the packet-loss rate is 0.80%, by using the RL based method, whereas the delay is 282.874ms and the packet-loss rate is 18.40% by using the Dijkstra-based method.

5.1.2 Case-II

Case-II stands for the multiple similar paths to reach from Host-1 (H1) to Host-

2 (H2) having the same delay, and same bandwidth but diﬀerent packet loss. For example, according to the 5.1, the path P3, P2, P1 contains the link [L − 6, 2, 4], [L −

4, 4, 3, 4], [L − 4, 4, 1, 4].

52 Figure 5.5: Probability of path selection for Case-II: From Host-1 to Host-2

Figure 5.6: Probability of path selection for Case-II: From Host-2 to Host-1

53 Figure 5.7: Performance of RL agent for Case-II: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 to Host-2

This means that although path P3 has fewer nodes, it is not optimal due to a

10% packet loss and a theoretical delay of about 130ms. On the other hand, the delay and link capacity of path P1 and P2 are exactly the same, except for P2 has a 15% packet loss. There is another path [0, 1, 5, 8, 10] that has the same speciﬁcation as P1, and the rest of the path has a larger delay or packet loss than the aforementioned paths. According to the path analysis, the path P1 or [0, 1, 5, 8, 10] is best for the

Case-II scenario. The simulation results from Fig.-5.5, and 5.6 make it clear that the probability of selecting an optimal path is increasing with additional episode, as is collecting consistent positive rewards in 5.7a and b. From Fig.-5.5, it is clear that

54 the initial path selection probability of P1 and [0, 1, 5, 8, 10] is fairly equal, and then increased in later episode. From the end-to-end delay comparison, it is clear that the proposed method outperforms Dijkstra, even though it has little diﬀerence in delay

[Fig.-5.7c, Table-5.2]. The average delay and packet loss for the proposed method is

269.722ms, 1.08%, compared with 286.193ms, 18.51% in Dijkstra.

5.1.3 Case-III

Case III illustrates the performance of the proposed method for traﬃc congestion in the link between the switches 5 and 7. As illustrated in Fig.-5.8 and 5.9, the decision for an optimal path converges to P1 before the congestion occurs. After the congestion started in episode 150, the proposed method changed the routing strategy, which gradually converged to path [0, 1, 6, 9, 8, 10] for delivering data packets from H1 to H2 and [0, 1, 5, 8, 10] for H2 to H1. Because of the intelligent and appropriate decision to change the optimal routing path, the SDN is resilient during unexpected congestion.

The reward [Fig.-5.10a,b] is not constant due to higher exploration and inconsistent delay. As shown in Fig.-5.10(c), the end-to-end delay increases for Packet 500 and the packets after it due to unexpected traﬃc congestion, and then decreases to a satisfactory level by using the proposed intelligent routing method.

55 Figure 5.8: Probability of path selection for Case-III: From Host-1 to Host-2

Figure 5.9: Probability of path selection for Case-III: From Host-2 to Host-1

56 Figure 5.10: Performance of RL agent for Case-III: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 to Host-2

As detailed in Table-5.2, the overall average end-to-end delay is 1028.102ms, and the packet-loss rate is 13.25%. The average value includes the congestion delay, so the average seems unreasonable. From Fig.-5.10c, it is clear that there is some high delay during decision making and after that, the delay is around 200ms.

57 5.1.4 Case-IV

Figure 5.11: Probability of path selection for Case-IV: From Host-1 to Host-2

Figure 5.12: Probability of path selection for Case-IV: From Host-2 to Host-1

58 Figure 5.13: Performance of RL agent for Case-IV: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 to Host-2

Case-IV depicts the scenario where several paths have the same packet loss, but the delay is diﬀerent. In such a scenario, the path which has the lowest delay will be the optimal path. According to the test topology for the case, the paths P1, P2, and P3 have the same packet loss (0%), but their delay is diﬀerent. As seen in Fig.-5.11,5.12 it is clear that the probability of path P1 increases over episodes. The normalized reward per episode is shown in Fig.-5.13a and b. The end-to-end graph (Fig.-5.13c) shows that the proposed method can select the optimal path, resulting in less end to end delay. More precisely, the average end-to-end delay is 115.992ms, with packet loss 0% compared to 216.392ms and 0% in Dijkstra.

59 5.2 Test-bed Setup: Multiple Controller

Internet Mininet

Port: 6640 Port: 6640 IP: 128.110.153.127 IP: 128.110.153.119

Figure 5.14: The overview of multiple-controller SDN test-system topology

The proposed model is also tested in a multi-controller environment to illustrate the scalability of the algorithm for a large scale network. The topology for the multi- controller test is shown in Fig.-5.14. The test-system mainly consists of CloudLab and the local machine, connected via the internet. Two computing nodes inside the

CloudLab are designed to implement two SDN controllers, and Mininet is used in local machines to design the SDN topology. As shown in Fig.-5.14, the links are speciﬁed by three types: L1, L2, and L3. Two controllers are represented by controller-1 and controller-2 having port address 6640 and diﬀerent IP addresses. Each controller represents one domain, and switch that connects the domains is called the gateway switch. 60 Figure 5.15: Probability of path selection for controller-1: From Host-1 to Gateway-1

Figure 5.16: Probability of path selection for controller-1: From Gateway-1 to Host-1

61 Figure 5.17: Probability of path selection for controller-2: From Gateway-2 to Host-2

Figure 5.18: Probability of path selection for controller-2: From Host-2 to Gateway-2

62 Figure 5.19: Performance of RL agent for controllers: (a) Episode reward value, Host-1 to Gateway-1; (b) Episode reward value, Gateway-1 to Host-1; (c) Episode reward value, Gateway-2 to Host-2; (d) Episode reward value, Host-2 to Gateway-2; (e) End-to-End delay comparison, ICMP message (Ping) from Host-1 to Host-2

For the topology, the domain-1 under the controller-1 consists of seven switches, s1 to s7, with s5 as a gateway (Gateway-1). The second domain consists of ﬁve switches, s8 to s12, with s8 as a gateway (Gateway-2) switch. The simulation result

63 starts indexing from 0, whereas the topology starts from 1, so switch-0 in simulation indicates switch-1 in topology. The later indexing method will be used for describing the simulation result. Based on the network settings of delay and packet-loss rate detailed in Fig.-5.14, theoretically speaking, [1, 2, 7, 5] is the optimal path from Host-

1 to the Gateway-1 and [8, 9, 11] is the optimal path from the Gateway-2 switch of controller-2 to Host-2. Fig.-5.15-5.18 shows the path selection probability. It is clear that the proposed method converges to an optimal path within 100 episodes. As mentioned, the link delay is also dependent on the computation power of the machine, as the virtual topology (environment) uses the processing power of the local machine.

The reward graph [Fig.-5.19 a-d] is more consistent (consistent environment), as the computation performance of the Cloudlab machine is much higher than the local machine. The convergence on path selection is consistent with the above conclusion.

The end-to-end delay comparison is shown in Fig.-5.19(e), which clearly shows that the proposed RL based intelligent method outperforms the Dijkstra-based method.

There is an additional delay due to the communication between Cloudlab and local machines through the internet. More comparison results are shown in Table-5.2, from which it is clear that the RL-based method achieves lower delay and packet-loss rate than the Dijkstra-based method.

64 CHAPTER VI

CONCLUSION AND FUTURE WORK

6.1 Conclusion

In this thesis, the A2C-RL technology is integrated with QoS-aware scheme to develop an intelligence in SDN routing. The performance of the proposed RL-enabled routing algorithm is tested for single-controller SDN and multi-controller SDN test-systems.

The simulation cases conclude that the RL-based routing strategy is suﬃciently dynamic, and intelligent to select the optimal path according to the current situation of the network. In the case of network congestion, the algorithm dynamically learns the next optimal path, so it is resilient in a network cataclysm. The muticontroller simulation is evidence of the scalability of the algorithm. As the framework mimics real world topology, the algorithm is ready to be tested for real-world application.

6.2 Future Work

The potential directions for the future work are:

1. To implement a real large test-bed scenario with real SDN topology.

2. To evaluate performance for diﬀerent NN hyperparameters and compare them

with existing algorithms.

65 3. To test multi-agent-based reinforcement learning for multi-controller topology.

4. To apply the new reinforcement learning approach.

5. To include more QoS parameters, such as link utilization, service, and user

priority.

66 BIBLIOGRAPHY

[1] T. Benson, A. Akella, and D. Maltz. Unraveling the complexity of network management. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI’09, pages 335–348, USA, 2009. USENIX As- sociation.

[2] S. Sendra, A. Rego, J. Lloret, J. M. Jimenez, and O. Romero. Including artiﬁcial intelligence in a routing protocol using software deﬁned networks. In 2017 IEEE International Conference on Communications Workshops (ICC Workshops), pages 670–674, May 2017.

[3] S. Lin, I. F. Akyildiz, P. Wang, and M. Luo. Qos-aware adaptive routing in multi-layer hierarchical software deﬁned networks: A reinforcement learning approach. In 2016 IEEE International Conference on Services Computing (SCC), pages 25–33, June 2016.

[4] C. N. Sminesh, E. G. M. Kanaga, and Ranjitha K. A proactive ﬂow ad- mission and re-routing scheme for load balancing and mitigation of congestion propagation in SDN data plane. Computing Research Repository (CoRR), abs/1812.02474, 2018.

[5] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang. Experience-driven networking: A deep reinforcement learning based approach. Computing Research Repository (CoRR), abs/1801.05757, 2018.

[6] M. Karakus and A. Durresi. Quality of service (qos) in software deﬁned networking (sdn): A survey. Journal of Network and Computer Applications, 80:200 – 218, 2017.

[7] L. Yanjun, L. Xiaobo, and Y. Osamu. Traﬃc engineering framework with machine learning based meta-layer in software-deﬁned networks. In 2014 4th IEEE International Conference on Network Infrastructure and Digital Content, pages 121–125, Sep. 2014.

67 [8] Z. Mammeri. Reinforcement learning based routing in networks: Review and classiﬁcation of approaches. IEEE Access, 7:55916–55950, 2019.

[9] J. Xie, F. R. Yu, T. Huang, R. Xie, J. Liu, C. Wang, and Y. Liu. A survey of machine learning techniques applied to software deﬁned networking (sdn): Research issues and challenges. IEEE Communications Surveys Tu- torials, 21(1):393–430, Firstquarter 2019.

[10] K. V. Murphy. Machine Learning: A probabilistic Prospective. USA: MIT Press, 2012.

[11] T. M. Mitchell. Machine Learning. USA: McGraw-Hill, 2017.

[12] A. Azzouni, R. Boutaba, and G. Pujolle. Neuroute: Predictive dynamic routing for software-deﬁned networks. Computing Research Repository (CoRR), abs/1709.06002, 2017.

[13] C. Lin, K. Wang, and G. Deng. A qos-aware routing in sdn hybrid networks. Procedia Computer Science, 110:242 – 249, 2017. 14th International Conference on Mobile Systems and Pervasive Computing (MobiSPC 2017) / 12th Inter- national Conference on Future Networks and Communications (FNC 2017) / Aﬃliated Workshops.

[14] L. Peshkin and V. Savova. Reinforcement learning for adaptive routing. Com- puting Research Repository (CoRR), abs/cs/0703138, 2007.

[15] G. Stampa, M. Arias, D. Sanchez-Charles, V. Munt´es-Mulero, and A. Cabellos. A deep-reinforcement learning approach for software-deﬁned networking routing optimization. Computing Research Repository (CoRR), abs/1709.07080, 2017.

[16] C. Yu, J. Lan, Z. Guo, and Y. Hu. Drom: Optimizing the routing in software- deﬁned networks with deep reinforcement learning. IEEE Access, 6:64533– 64539, 2018.

[17] S. Kim, J. Son, A. Talukder, and C. S. Hong. Congestion prevention mechanism based on q-leaning for eﬃcient routing in sdn. In 2016 International Conference on Information Networking (ICOIN), pages 124–128, Jan 2016.

[18] F. Francois and E. Gelenbe. Optimizing secure sdn-enabled inter-data centre overlay networks through cognitive routing. In 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecom- munication Systems (MASCOTS), pages 283–288, Sep. 2016.

68 [19] J. Wang, L. Zhao, J. Liu, and N. Kato. Smart resource allocation for mobile edge computing: A deep reinforcement learning approach. IEEE Transactions on Emerging Topics in Computing, pages 1–1, 2019.

[20] P. Sun, J. Li, J. Lan, Y. Hu, and X. Lu. Rnn deep reinforcement learning for routing optimization. In 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pages 285–289, Dec 2018.

[21] L. Peshkin and V. Savova. Reinforcement learning for adaptive routing. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290), volume 2, pages 1825–1830 vol.2, May 2002.

[22] X. Guo, H. Lin, Z. Li, and M. Peng. Deep reinforcement learning based qos- aware secure routing for sdn-iot. IEEE Internet of Things Journal, pages 1–1, 2019.

[23] A. Azzouni and G. Pujolle. Neutm: A neural network-based framework for traﬃc matrix prediction in sdn. In NOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium, pages 1–5, April 2018.

[24] T. A. Q. Pham, Y. Hadjadj-Aoul, and A. Outtagarts. Deep reinforcement learning based qos-aware routing in knowledge-deﬁned networking. In T. Q. Duong, N. Vo, and V. C. Phan, editors, Quality, Reliability, Security and Robustness in Heterogeneous Systems, pages 14–26, Cham, 2019. Springer International Publishing.

[25] Z. Mammeri. Reinforcement learning based routing in networks: Review and classiﬁcation of approaches. IEEE Access, 7:55916–55950, 2019.

[26] W. Liu. Intelligent routing based on deep reinforcement learning in software- deﬁned data-center networks. In 2019 IEEE Symposium on Computers and Communications (ISCC), pages 1–6, June 2019.

[27] Y. Liu, M. Dong, K. Ota, J. Li, and J. Wu. Deep reinforcement learning based smart mitigation of ddos ﬂooding in software-deﬁned networks. In 2018 IEEE 23rd International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), pages 1–6, Sep. 2018.

[28] P. Sun, J. Li, Z. Guo, Y. Xu, J. Lan, and Y. Hu. Sinet: Enabling scalable network routing with deep reinforcement learning on partial nodes. In Proceed- ings of the ACM SIGCOMM 2019 Conference Posters and Demos, SIGCOMM Posters and Demos ’19, page 88–89, New York, NY, USA, 2019. Association for Computing Machinery.

69 [29] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. Computing Research Repository (CoRR), abs/1602.01783, 2016.

[30] M. B. Hossain and J. Wei. Reinforcement learning-driven qos-aware intelligent routing for software-deﬁned networks. In 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pages 1–5, Nov 2019.

[31] S. Kemp. Digital in 2018: World’s internet users pass the 4 billiom mark. https://wearesocial.com/blog/2018/01/global-digital-report-2018. 2018.

[32] W. Yangyang and B. Jun. Survey of mechanisms for inter-domain sdn. https: //www.zte.com.cn/global/about/magazine/zte-communications/2017/3 /en 225/465746.html, 2017.

[33] N. McKeown. How sdn will shape networking. https://www.youtube.com/wa tch?v=c9-K5O qYgA, 2011.

[34] S. Schenke. The future of networking, and the past of protocols. https: //www.youtube.com/watch?v=YHeyuD89n1Y, 2011.

[35] N. Feamster, J. Rexford, and E. Zegura. The road to sdn: An intellectual history of programmable networks. SIGCOMM Comput. Commun. Rev., 44(2):87–98, April 2014.

[36] A. A. Lazar, Koon-Seng Lim, and F. Marconcini. Realizing a foundation for programmability of atm networks with the binding architecture. IEEE Journal on Selected Areas in Communications, 14(7):1214–1227, Sep. 1996.

[37] D. Sheinbein and R. P. Weber. Stored program controlled network: 800 service using spc network capability. The Bell System Technical Journal, 61(7):1737– 1744, Sep. 1982.

[38] J. Zander and R. Forchheimer. The softnet project: a retrospect. In 8th Euro- pean Conference on Electrotechnics, Conference Proceedings on Area Commu- nication, pages 343–345, June 1988.

[39] A. T. Campbell, I. Katzela, K. Miki, and J. Vicente. Open signaling for atm, internet and mobile networks (opensig’98). SIGCOMM Comput. Commun. Rev., 29(1):97–108, January 1999.

70 [40] J. E. van der Merwe, S. Rooney, L. Leslie, and S. Crosby. The tempest-a practical framework for network programmability. IEEE Network, 12(3):20–28, May 1998.

[41] D. J. Wetherall, J. V. Guttag, and D. L. Tennenhouse. Ants: a toolkit for building and dynamically deploying network protocols. In IEEE Open Architectures and Network Programming, April 1998.

[42] S. da Silva, Y. Yemini, and D. Florissi. The netscript active network system. IEEE Journal on Selected Areas in Communications, 19(3):538–551, March 2001.

[43] A. Doria et al. Forwarding and control element separation (forces) protocol speciﬁcation. https://tools.ietf.org/html/rfc5810, March 2010.

[44] Juniper Networks Inc. Path computation element protocol (pcep) conﬁguration. https://www.juniper.net/documentation/en US/junos/topics/topic-m ap/pcep-configuration.html, 2020.

[45] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, A. Shaikh, and J. van der Merwe. Design and implementation of a routing control platform. In Proceed- ings of the 2nd Conference on Symposium on Networked Systems Design and Implementation, NSDI’05, page 15–28, USA, 2005. USENIX Association.

[46] P. Newman, W. Edwards, R. Hinden, E. Hoﬀman, F. Ching Liaw, T. Lyon, and G. Minshall. Rfc1987: Ipsilon’s general switch management protocol speciﬁca- tion version 1.1, 1996.

[47] M. Casado et al. Sane: A protection architecture for enterprise networks. In Proc. 15th Conf. USENIX Security Symp, volume 15, USA, 2006. USENIX Association.

[48] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rex- ford, S. Shenker, and J. Turner. Openﬂow: Enabling innovation in campus networks. SIGCOMM Comput. Commun. Rev., 38(2):69–74, March 2008.

[49] G. Natasha, K. Teemu, P. Justin, P. Ben, C. Mart, M. Nick, and S. Scott. Nox: Towards an operating system for networks. SIGCOMM Comput. Commun. Rev., 38(3):105–110, July 2008.

[50] J. E. van der Merwe, S. Rooney, L. Leslie, and S. Crosby. The tempest-a practical framework for network programmability. IEEE Network, 12(3):20–28, May 1998.

71 [51] A. Bavier, N. Feamster, M. Huang, L. Peterson, and J. Rexford. In vini ver- itas: Realistic and controlled network experimentation. In Proceedings of the 2006 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’06, page 3–14, New York, NY, USA, 2006. Association for Computing Machinery.

[52] B. Chun, D. Culler, T. Roscoe, A. Bavier, L. Peterson, M. Wawrzoniak, and M. Bowman. Planetlab: An overlay testbed for broad-coverage services. SIG- COMM Comput. Commun. Rev., 33(3):3–12, July 2003.

[53] P. Berde, M. Gerola, J. Hart, Y. Higuchi, M. Kobayashi, T. Koide, B. Lantz, B. O’Connor, P. Radoslavov, W. Snow, and et al. Onos: Towards an open, distributed sdn os. In Proceedings of the Third Workshop on Hot Topics in Software Deﬁned Networking, HotSDN ’14, page 1–6, New York, NY, USA, 2014. Association for Computing Machinery.

[54] V. Bollapragada, C. Murphy, and R. White. Inside Cisco IOS Software Archi- tecture. Cisco Press, 2008.

[55] Juniper Network Inc. Junos os architecture overview. https://www.juniper. net/documentation/en US/junos/topics/concept/junos-software-arch itecture.html, 2020.

[56] Open Networking Foundation. An innovative combination of standards and open source software. https://www.opennetworking.org/software-define d-standards/overview/, 2020.

[57] J. Dix. Clarifying the role of software-deﬁned networking northbound apis. https://www.networkworld.com/article/2165901/clarifying-the-role -of-software=defined-networking-northbound-apis.html, 2013.

[58] The Linux Foundation. Opendaylight. https://www.opendaylight.org/, 2018.

[59] T. Koponen, M. Casado, N. Gude, J. Stribling, L. Poutievski, M. Zhu, R. Ra- manathan, Y. Iwata, H. Inoue, T. Hama, and S. Shenker. Onix: a distributed control platform for large-scale production networks. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pages 351–364, Berkeley, CA, USA, 2010.

[60] B. Salisbury. The northbound api- a big little problem. http://networksta tic.net/the-northbound-api-2/, 2012.

72 [61] R. Chua. Openﬂow northbound api - a new olympic sport. https://www.sd xcentral.com/articles/opinion-editorial/openflow-northbound-api- olympics/2012/07/, 2012.

[62] T. Koponen, K. Amidon, P. Balland, M. Casado, A. Chanda, B. Fulton, I. G., J. Gross, P. Ingram, E. Jackson, A. Lambeth, R. Lenglet, S. Li, A. Padman- abhan, J. Pettit, B. Pfaﬀ, R. Ramanathan, S. Shenker, A. Shieh, J. Stribling, P. Thakkar, D. Wendlandt, A. Yip, and R. Zhang. Network virtualization in multi-tenant datacenters. In 11th USENIX Symposium on Networked Sys- tems Design and Implementation (NSDI 14), pages 203–216, Seattle, WA, 2014. USENIX Association.

[63] K. Pentikousis, Y. Wang, and W. Hu. Mobileﬂow: Toward software-deﬁned mobile networks. IEEE Communications Magazine, 51(7):44–53, July 2013.

[64] W. Zhou, L. Li, M. Luo, and W. Chou. Rest api design patterns for sdn northbound api. In 2014 28th International Conference on Advanced Information Networking and Applications Workshops, pages 358–365, May 2014.

[65] K. Yap, T. Huang, B. Dodson, M. S. Lam, and N. McKeown. Towards software- friendly networks. In Proceedings of the First ACM Asia-Paciﬁc Workshop on Workshop on Systems, APSys ’10, page 49–54, New York, NY, USA, 2010. Association for Computing Machinery.

[66] M. Monaco, O. Michel, and E. Keller. Applying operating system principles to SDN controller design. Computing Research Repository (CoRR), abs/1510.05063, 2015.

[67] A. D. Ferguson, A. Guha, C. Liang, R. Fonseca, and S. Krishnamurthi. Par- ticipatory networking: An api for application control of sdns. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, page 327–338, New York, NY, USA, 2013. Association for Computing Machinery.

[68] Linux Foundation. Openvswitch. https://www.openvswitch.org/, 2011.

[69] B. Pfaﬀ and B. Davie. The Open vSwitch Database Management Protocol. RFC 7047, 2013.

[70] R. Enns, M. Björklund, A. Bierman, and J. Schönwälder. Network Configura- tion Protocol (NETCONF). RFC 6241, June 2011.

[71] Y. Rekhter and T. Li. Rfc1771: A border gateway protocol 4 (bgp-4), 1995.

73 [72] M. Smith, R. E. Adams, M. Dvorkin, Y. Laribi, V. Pandey, P. Garg, and N. Weidenbacher. OpFlex Control Protocol. Internet-Draft draft-smith-opﬂex- 03, Internet Engineering Task Force, April 2016. Work in Progress.

[73] R. Sherwood, G. Gibb, K. Yap, G. Appenzeller, M. Casado, N. McKeown, and G. Parulkar. Can the production network be the testbed? In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, page 365–378, USA, 2010. USENIX Association.

[74] D. Drutskoy, E. Keller, and J. Rexford. Scalable network virtualization in software-deﬁned networks. IEEE Internet Computing, 17(2):20–27, March 2013.

[75] A. Al-Shabibi, M. D. Leenheer, M. Gerola, A. Koshibe, W. Snow, and G. Parulkar. Openvirtex: A network hypervisor. In Open Networking Summit 2014 (ONS 2014), Santa Clara, CA, March 2014. USENIX Association.

[76] S. Racherla, D.Cain, S. Irwin, P. Ljungstrom, P. Patil, and A. M. Tarenzio. Implementing ibm software deﬁned network for virtual environments. IBM Redbooks publication, 2014.

[77] C. . Li, B. L. Brech, S. Crowder, D. M. Dias, H. Franke, M. Hogstrom, D. Lindquist, G. Paciﬁci, S. Pappe, B. Rajaraman, J. Rao, R. P. Ratnaparkhi, R. A. Smith, and M. D. Williams. Software deﬁned environments: An introduction. IBM Journal of Research and Development, 58(2/3):1:1–1:11, March 2014.

[78] D. Erickson. The Beacon OpenFlow controller. In Proceedings of the second ACM SIGCOMM workshop on Hot topics in software deﬁned networking, HotSDN ’13, pages 13–18, New York, NY, USA, 2013. ACM.

[79] BigSwitch network. Floodlight-project. https://floodlight.atlassian.net /wiki/spaces/floodlightcontroller/pages/1343647/Floodlight+Proje cts, 2018.

[80] T. Amin and G. Yashar. Hyperﬂow: A distributed control plane for openﬂow. In Proceedings of the 2010 Internet Network Management Conference on Re- search on Enterprise Networking, INM/WREN’10, page 3, USA, 2010. USENIX Association.

[81] D. Saikia. Open mul. https://sourceforge.net/projects/mul/, 2014.

74 [82] S. H. Yeganeh and Y. Ganjali. Kandoo: A framework for efficient and scalable offloading of control applications. In Proceedings of the First Workshop on Hot Topics in Software Defined Networks, HotSDN ’12, page 19–24, New York, NY, USA, 2012. Association for Computing Machinery.

[83] A. Tootoonchian, S. Gorbunov, Y. Ganjali, M. Casado, and R. Sherwood. On controller performance in software-deﬁned networks. In 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (Hot-ICE 12), San Jose, CA, April 2012. USENIX Association.

[84] M. C. Murphy. Pox controller. https://noxrepo.github.io/pox-doc/html/, 2014.

[85] Nippon Telegraph and Telephone Corporation. Ryu controller. https://ryu. readthedocs.io/en/latest/index.html, 2014.

[86] Takamiya et al. Openﬂow programming with trema. http://yasuhito.githu b.io/trema-book/, 2011.

[87] Huawei Technologies Co. Ltd. Cx600 metro services platform. https://carr ier.huawei.com/en/products/fixed-network/data-communication/rout er/cx600, 2020.

[88] Extreme Networks. Netiron (ces, cer and mlx series). https://www.extrem enetworks.com/support/documentation/netiron-ces-cer-mlx-series/, 2020.

[89] NoviFlow inc. Noviﬂow. https://noviflow.com/noviswitch/, 2019.

[90] Lenevo. Rackswitch g8264. https://lenovopress.com/tips0815, 2015.

[91] Ciena Corporation. Z-series. https://www.ciena.com/products/z-series/, 2020.

[92] A. Shang, J. Liao, and L. Du. xorplus. https://sourceforge.net/projects /xorplus/, 2016.

[93] Andrey mp et al. contrail-vrouter. https://github.com/tungstenfabric/tf -vrouter, 2016.

[94] M. Scharf, V. Gurbani, T. Voith, M. Stein, W. Roome, G. Soprovich, and V. Hilt. Dynamic vpn optimization by alto guidance. In 2013 Second European Workshop on Software Deﬁned Networks, pages 13–18, Oct 2013. 75 [95] N. Handigol, S. Seetharaman, M. Flajslik, A. Gember, N. McKeown, G. Parulkar, A. Akella, N. Feamster, R. Clark, A. Krishnamurthy, V. Brajkovic, and T. An- derson. Aster * x : Load-balancing web traﬃc over wide-area networks. 2010.

[96] C. A.B. Macapuna, C. E. Rothenberg, and M. F. Maur´ıcio. In-packet bloom ﬁlter based data center networking with distributed openﬂow controllers. In 2010 IEEE Globecom Workshops, pages 584–588, Dec 2010.

[97] T. Benson, A. Anand, A. Akella, and M. Zhang. Microte: Fine grained traﬃc engineering for data centers. In Proceedings of the Seventh COnference on Emerging Networking EXperiments and Technologies, CoNEXT ’11, New York, NY, USA, 2011. Association for Computing Machinery.

[98] M. F. Bari, S. R. Chowdhury, R. Ahmed, and R. Boutaba. Policycop: An autonomic qos policy enforcement framework for software deﬁned networks. In 2013 IEEE SDN for Future Networks and Services (SDN4FNS), pages 1–7, Nov 2013.

[99] M. V. Neves, C. A. F. D. Rose, K. Katrinis, and H. Franke. Pythia: Faster big data in motion through predictive software-deﬁned network optimization at runtime. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pages 82–90, May 2014.

[100] S. Sharma, D. Staessens, D. Colle, D. Palma, J. Gon¸calves, R. Figueiredo, D. Morris, M. Pickavet, and P. Demeester. Implementing quality of service for the software deﬁned networking enabled future internet. In 2014 Third European Workshop on Software Deﬁned Networks, pages 49–54, Sep. 2014.

[101] W. Kim, P. Sharma, J. Lee, S. Banerjee, J. Tourrilhes, S. Lee, and P. Yala- gandula. Automated and scalable qos control for network convergence. In Proceedings of the 2010 Internet Network Management Conference on Research on Enterprise Networking, INM/WREN’10, page 1, USA, 2010. USENIX As- sociation.

[102] P. Sköldström and B. C. Sanchez. Virtual aggregation using sdn. In 2013 Second European Workshop on Software Defined Networks, pages 56–61, Oct 2013.

[103] T. Benson, A. Akella, A. Shaikh, and S. Sahu. Cloudnaas: A cloud networking platform for enterprise applications. In Proceedings of the 2nd ACM Symposium on Cloud Computing, SOCC ’11, New York, NY, USA, 2011. Association for Computing Machinery.

76 [104] A. Das, C. Lumezanu, Y. Zhang, V. Singh, G. Jiang, and C. Yu. Transpar- ent and ﬂexible network management for big data processing in the cloud. In Presented as part of the 5th USENIX Workshop on Hot Topics in Cloud Com- puting, San Jose, CA, 2013. USENIX.

[105] R. Raghavendra, J. Lobo, and K. Lee. Dynamic graph query primitives for sdn-based cloudnetwork management. In Proceedings of the First Workshop on Hot Topics in Software Deﬁned Networks, HotSDN ’12, page 97–102, New York, NY, USA, 2012. Association for Computing Machinery.

[106] R. Hand, M. Ton, and E. Keller. Active security. In Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks, HotNets-XII, New York, NY, USA, 2013. Association for Computing Machinery.

[107] S. Shin and G. Gu. Cloudwatcher: Network security monitoring using openﬂow in dynamic cloud networks (or: How to provide security monitoring as a service in clouds?). In 2012 20th IEEE International Conference on Network Protocols (ICNP), pages 1–6, Oct 2012.

[108] E. Tantar, M. R. Palattella, T. Avanesov, M. Kantor, and T. Engel. Cogni- tion: A tool for reinforcing security in software deﬁned networks. In A. Tantar, E. Tantar, J. Sun, W. Zhang, Q. Ding, O. Sch¨utze, M. Emmerich, P. Legrand, P. Del Moral, and C. A. C. Coello, editors, EVOLVE - A Bridge between Prob- ability, Set Oriented Numerics, and Evolutionary Computation V, pages 61–78, Cham, 2014. Springer International Publishing.

[109] Y. Wang, Y. Zhang, V. Singh, C. Lumezanu, and G. Jiang. Netfuse: Short- circuiting traﬃc surges in the cloud. In 2013 IEEE International Conference on Communications (ICC), pages 3514–3518, June 2013.

[110] J. H. Jafarian, E. Al-Shaer, and Q. Duan. Openflow random host mutation: Transparent moving target defense using software defined networking. In Pro- ceedings of the First Workshop on Hot Topics in Software Defined Networks, HotSDN ’12, page 127–132, New York, NY, USA, 2012. Association for Com- puting Machinery.

[111] J. R. Ballard, I. Rae, and A. Akella. Extensible and scalable network monitoring using opensafe. In Proceedings of the 2010 Internet Network Management Conference on Research on Enterprise Networking, INM/WREN’10, page 8, USA, 2010. USENIX Association.

77 [112] J. Schulz-Zander, N. Sarrar, and S. Schmid. Aeroﬂux: A near-sighted controller architecture for software-deﬁned wireless networks. In Open Networking Summit 2014 (ONS 2014), Santa Clara, CA, March 2014. USENIX Association.

[113] H. Ali-Ahmad, C. Cicconetti, A. de la Oliva, M. Dr¨axler, R. Gupta, V. Mancuso, L. Roullet, and V. Sciancalepore. Crowd: An sdn approach for densenets. In 2013 Second European Workshop on Software Deﬁned Networks, pages 25–31, Oct 2013.

[114] A. W. Dawson, M. K. Marina, and F. J. Garcia. On the beneﬁts of ran virtual- isation in c-ran based mobile networks. In 2014 Third European Workshop on Software Deﬁned Networks, pages 103–108, Sep. 2014.

[115] W. S McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 1943.

[116] P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. 1974. PhD thesis; Harvard University, MA.

[117] L. Deng and D. Yu. Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 2014.

[118] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.

[119] R. Bellman. Dynamic Programming. Princeton University Press Princeton, New Jersey, 1957.

[120] R. A. Howard. Dynamic Programming and Markov Process. MIT press, John Wiley and Sons, Inc., 1960.

[121] J. H. Andreae. Stella: A scheme for a learning machine. IFAC Proceedings Volumes, 1(2):497 – 502, 1963. 2nd International IFAC Congress on Automatic and Remote Control: Theory, Basle, Switzerland, 1963.

[122] D. Michie and R. A Chambers. Boxes: An experiment in adaptive control. Machine intelligence, 2(2):137–152, 1968.

[123] Ole-Christoﬀer Granmo. The tsetlin machine - A game theoretic bandit driven approach to optimal pattern recognition with propositional logic. Computing Research Repository (CoRR), 2018.

[124] R. S. Sutton. Learning to predict by the methods of temporal diﬀerences. Machine Learning, 1988. 78 [125] C. J. C. H. Watkins. Learning from delayed rewards. 1989. Ph.D. Thesis; Cambridge University, UK.

[126] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279– 292, 1992.

[127] Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. Laboratory for Information and Decision Systems, MIT, 2000.

[128] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. M. O. Heess, T. Erez, Y. Tassa, D. Sil- ver, and D. Wierstra. Continuous control with deep reinforcement learning. Computing Research Repository (CoRR), abs/1509.02971, 2015.

[129] M. Hessel, J. Modayil, H. V. Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. G. Azar, and D. Silver. Rainbow: Combining improve- ments in deep reinforcement learning. Computing Research Repository (CoRR), abs/1710.02298, 2017.

[130] H. V. Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. Computing Research Repository (CoRR), abs/1509.06461, 2015.

[131] Atarimania. Atari 2600 games. https://www.randomterrain.com/atari-26 00-memories-history-1991.html, 2018.

[132] Google DeepMind. Alphago. https://deepmind.com/research/case-studi es/alphago-the-story-so-far, 2015.

[133] Y. You, X. Pan, Z. Wang, and C. Lu. Virtual to real reinforcement learning for autonomous driving. Computing Research Repository (CoRR), abs/1704.03952, 2017.

[134] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. Computing Research Repository (CoRR), abs/1504.00702, 2015.

[135] M. L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[136] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, page 1057–1063, Cambridge, MA, USA, 1999. MIT Press.

79 [137] J. Kober and J. Peters. Reinforcement Learning in Robotics: A Survey, pages 9–67. Springer International Publishing, Cham, 2014.

[138] X. Gao, Y. Jin, Q. Dou, and P. Heng. Automatic gesture recognition in robot- assisted surgery with reinforcement learning and tree search, 2020.

[139] M. Neunert, A. Abdolmaleki, M. Wulfmeier, T. Lampe, J. T. Springenberg, R. Hafner, F. Romano, J. Buchli, N. Heess, and M. Riedmiller. Continuous- discrete reinforcement learning for hybrid control in robotics, 2020.

[140] S. H. Semnani, H. Liu, M. Everett, A. de Ruiter, and J. P. How. Multi-agent motion planning for dense and dynamic environments via deep reinforcement learning, 2020.

[141] I. Carlucho, M. D. Paula, and G. G. Acosta. An adaptive deep reinforcement learning approach for mimo pid control of mobile robots. ISA Transactions, 2020.

[142] J. Garc´ıaand D. Shaﬁe. Teaching a humanoid robot to walk faster through safe reinforcement learning. Engineering Applications of Artiﬁcial Intelligence, 88:103360, 2020.

[143] J. Liu, J. Shou, Z. Fu, H. Zhou, R. Xie, J. Zhang, J. Fei, and Y. Zhao. Eﬃcient reinforcement learning control for continuum robots based on inexplicit prior knowledge, 2020.

[144] Y. Gar´ı,D. A. Monge, E. Pacini, C. Mateos, and C. G. Garino. Reinforcement learning-based autoscaling of workﬂows in the cloud: A survey, 2020.

[145] H. Wang, X. Hu, Q. Yu, M. Gu, W. Zhao, J. Yan, and T. Hong. Integrating reinforcement learning and skyline computing for adaptive service composition. Information Sciences, 519:141 – 160, 2020.

[146] A. Mahmud. Query-based summarization using reinforcement learning and transformer model. 2020. PhD Thesis; University of Lethbridge, AB., Canada.

[147] S. Walton. Decoupling reinforcement learning from search-based software engineering. Systems and Software Engineering Publication, 5(2), May 2019.

[148] P. S. Castro, S. Li, and D. Zhang. Inverse reinforcement learning with multiple ranked experts. arXiv preprint arXiv:1907.13411, 2019.

80 [149] S. Chen, J. Wu, and X. Chen. Deep reinforcement learning with model-based acceleration for hyperparameter optimization. In 2019 IEEE 31st International Conference on Tools with Artiﬁcial Intelligence (ICTAI), pages 170–177, Nov 2019.

[150] W. Li, S. Yan, B. Shen, and Y. Chen. Reinforcement learning of code search sessions. In 2019 26th Asia-Paciﬁc Software Engineering Conference (APSEC), pages 458–465, Dec 2019.

[151] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, and D. I. Kim. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys Tutorials, 21(4):3133–3174, Fourthquarter 2019.

[152] Y. Qian, J. Wu, R. Wang, F. Zhu, and W. Zhang. Survey on reinforcement learning applications in communication networks. Journal of Communications and Information Networks, 4(2):30–39, June 2019.

[153] T. T. Nguyen and V. J. Reddi. Deep reinforcement learning for cyber security. Computing Research Repository (CoRR), abs/1906.05799, 2019.

[154] C. Li and M. Qiu. Reinforcement learning for cyber-physical systems. New York: Chapman and Hall/CRC, 2019.

[155] Yoshiharu Sato. Model-free reinforcement learning for ﬁnancial portfolios: A brief survey. arXiv preprint arXiv:1904.04973, 2019.

[156] L. A E. Leal, M. Westerlund, and A. Chapman. Autonomous industrial management via reinforcement learning: Self-learning agents for decision-making–a review. arXiv preprint arXiv:1910.08942, 2019.

[157] O. Walker, F. Vanegas, F. Gonzalez, and S. Koenig. A deep reinforcement learning framework for uav navigation in indoor environments. In 2019 IEEE Aerospace Conference, pages 1–14, March 2019.

[158] W. Koch, R. Mancuso, R. West, and A. Bestavros. Reinforcement learning for uav attitude control. ACM Transactions on Cyber-Physical Systems, 3(2):1–21, 2019.

[159] J. Hu, H. Zhang, L. Song, Z. Han, and H V. Poor. Reinforcement learning for a cellular internet of uavs: protocol design, trajectory control, and resource management. IEEE Wireless Communications, 27(1):116–123, 2020.

81 [160] C. Piciarelli and G. L. Foresti. Drone patrolling with reinforcement learning. In Proceedings of the 13th International Conference on Distributed Smart Cameras, ICDSC 2019, New York, NY, USA, 2019. Association for Computing Machinery.

[161] T. Wan, R. Qin, Y. Chen, H. Snoussi, and C. Choi. A reinforcement learning approach for uav target searching and tracking. Multimedia Tools and Appli- cations, 78(4):4347–4364, 2019.

[162] R. Zanol, F. Chiariotti, and A. Zanella. Drone mapping through multi-agent reinforcement learning. In 2019 IEEE Wireless Communications and Networking Conference (WCNC), pages 1–7, April 2019.

[163] D. An, Q. Yang, W. Liu, and Y. Zhang. Defending against data integrity attacks in smart grid: A deep reinforcement learning-based approach. IEEE Access, 7:110835–110845, 2019.

[164] X. Lu, X. Xiao, L. Xiao, C. Dai, M. Peng, and H. V. Poor. Reinforcement learning-based microgrid energy trading with a reduced power plant schedule. IEEE Internet of Things Journal, 6(6):10728–10737, Dec 2019.

[165] B R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A A. Sallab, S. Yogamani, and P. P´erez. Deep reinforcement learning for autonomous driving: A survey. arXiv preprint arXiv:2002.00444, 2020.

[166] R. Xing, Z. Su, N. Zhang, Y. Peng, H. Pu, and J. Luo. Trust-evaluation-based intrusion detection and reinforcement learning in autonomous driving. IEEE Network, 33(5):54–60, Sep. 2019.

[167] M. Bouton, J. Karlsson, A. Nakhaei, K. Fujimura, M. J Kochenderfer, and J. Tumova. Reinforcement learning with probabilistic guarantees for autonomous driving. arXiv preprint arXiv:1904.07189, 2019.

[168] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data. IET Intelligent Transport Systems, Jan 2020.

[169] M. Zhu, Y. Wang, Z. Pu, J. Hu, X. Wang, and R. Ke. Safe, eﬃcient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transportation Research Part C: Emerging Technologies, 117:102662, 2020.

82 [170] Y. Wang, D. Zhang, Y. Liu, B. Dai, and L. H. Lee. Enhancing transportation systems via deep learning: A survey. Transportation Research Part C: Emerging Technologies, 99:144 – 163, 2019.

[171] H. Wei, G. Zheng, V. Gayah, and Z. Li. A survey on traﬃc signal control methods. arXiv preprint arXiv:1904.08117, 2019.

[172] G. Zheng, X. Zang, N. Xu, H. Wei, Z. Yu, V. Gayah, K. Xu, and Z. Li. Diagnosing reinforcement learning for traﬃc signal control. arXiv preprint arXiv:1905.04716, 2019.

[173] J. F Pettit, R. Glatt, J. R Donadee, and B. K Petersen. Increasing performance of electric vehicles in ride-hailing services using deep reinforcement learning. arXiv preprint arXiv:1912.03408, 2019.

[174] C. Yu, J. Liu, and S. Nemati. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2019.

[175] T. Shang, K. Han, J. Ma, and M. Mao. Research on self-gaming training method of wargame based on deep reinforcement learning. In Proceedings of the 2019 International Conference on Artiﬁcial Intelligence and Computer Science, AICS 2019, page 251–254, New York, NY, USA, 2019. Association for Computing Machinery.

[176] M. Singh, M. Dhull, M. Pahwa, and M. Sharma. Automated gaming pommer- man: Ffa. arXiv preprint arXiv:1907.06096, 2019.

[177] Y. Zhang, S. Li, and X. Xiong. A study on the game system of dots and boxes based on reinforcement learning. In 2019 Chinese Control and Decision Conference (CCDC), pages 6319–6322, June 2019.

[178] K. Phemius and M. Bouet. Monitoring latency with openﬂow. In Proceedings of the 9th International Conference on Network and Service Management (CNSM 2013), pages 122–125, Oct 2013.

[179] Mininet Team. Mininet: An instant virtual network on your laptop (or other pc). http://mininet.org/, 2018.

[180] University of Utah. Cloudlab. https://www.cloudlab.us/, 2020.

[181] E. W Dijkstra. A note on two problems in connexion with graphs. Numerische mathematik, 1(1):269–271, 1959.