c 2020
MD BILLAL HOSSAIN
ALL RIGHTS RESERVED QoS-AWARE INTELLIGENT ROUTING FOR SOFTWARE DEFINED
NETWORKING
A Thesis
Presented to
The Graduate Faculty of The University of Akron
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
Md Billal Hossain
August, 2020 QoS-AWARE INTELLIGENT ROUTING FOR SOFTWARE DEFINED
NETWORKING
Md Billal Hossain
Thesis
Approved: Accepted:
Advisor Dean of the College Dr. Jin Wei-Kocsis Dr. Craig Menzemer
Co-Advisor Dean of the Graduate School Dr. Kye-Shin Lee Dr. Marnie Saunders
Faculty Reader Date Dr. Hamid Bahrami
Department Chair Dr. Robert Veillette
ii ABSTRACT
This thesis proposes a reinforcement learning (RL) driven software-defined networking
(SDN) routing scheme for the situation-awareness and intelligent networking man- agement. Firstly, the existing SDN network monitoring technique is applied to track the quality of service (QoS) parameters (link delay and packet loss). Afterward, the
QoS data are fed to the RL algorithm in order to achieve situation awareness in SDN routing. The performance of the proposed RL-enabled routing scheme is evaluated in the simulation section by considering various network scenarios, including network congestion. Finally, the end-to-end delay, the episode reward, and the probability of path selection are recorded for each case. According to the outcomes, the proposed scheme intelligently select the efficient data path according to the current state of the network. Moreover, the end-to-end delay is compared with the Dijkstra algorithm, demonstrating the superiority of RL-enabled dynamic routing strategy over static.
Additionally, the scalability of the algorithm is tested with multiple controller SDN.
iii ACKNOWLEDGMENTS
Firstly, I would like to extend my heartfelt gratitude to my adviser, Dr. Jin (Wei)
Kocsis, and co-adviser, Dr. Kye-Shin Lee, for their continuous guidance and advice throughout my master’s program. Their support and guidance empowered me to face challenges throughout this journey. Also, I would like to thank my thesis committee member, Dr. Hamid Bahrami, and Department chair, Dr. Robert Veillette, for their cooperation throughout the journey.
I would like to thank my colleagues at the University of Akron including members of the CPSSD group, especially Yifu, Gihan, Praveen, Moein for all their support. It was my pleasure working with you in a cooperative environment. My appreciation goes out to all the faculty members and staff at the University of Akron for heightening my knowledge and skills via graduate courses. I am forever grateful to my parents, Md Hafiz Ahmed and Aklima Akter.
I dedicate this thesis to those who work for a better world.
Thank you all for your support.
iv TABLE OF CONTENTS
Page
LISTOFTABLES...... vii
LISTOFFIGURES ...... viii
CHAPTER
I. INTRODUCTION...... 1
1.1 Contributions in this Thesis ...... 5
1.2 Publications ...... 5
1.3 ThesisOutline ...... 5
II. OVERVIEW OF SOFTWARE-DEFINED NETWORK TECHNOLOGY 8
2.1 MotivationsofSDN ...... 8
2.2 HistoryofSDNDevelopment ...... 10
2.3 SDNStructure ...... 11
2.4 DistributedSDNarchitecture ...... 17
2.5 Recent Development and Application of SDN ...... 19
III. OVERVIEW OF REINFORCEMENT LEARNING ...... 20
3.1 Artificial Intelligence, Artificial Neural Networks and Machine Learning 20
3.2 ReinforcementLearning ...... 23
3.3 Reinforcement Learning: Applications ...... 35
v IV. QoS-AWARE INTELLIGENT ROUTING FOR SOFTWARE DE- FINEDNETWORK ...... 36 4.1 Introduction ...... 36
4.2 Design of QoS-Aware Intelligent Routing ...... 38
V. SIMULATION RESULTS: QoS-AWARE INTELLIGENT ROUTING FORSOFTWAREDEFINEDNETWORK...... 45 5.1 Test-bedSetup: SingleController...... 45
5.2 Test-bed Setup: Multiple Controller ...... 60
VI. CONCLUSIONANDFUTUREWORK ...... 65
6.1 Conclusion ...... 65
6.2 FutureWork ...... 65
BIBLIOGRAPHY ...... 67
vi LIST OF TABLES
Table Page
2.1 PopularSDNNOS ...... 15
5.1 Linkparametersfordifferentcases...... 46
5.2 Performance comparison between proposed method and Dijkstra algorithm-basedmethod ...... 48
vii LIST OF FIGURES
Figure Page
1.1 Overviewofthethesis...... 6
2.1 Systemarchitecture: TraditionalvsSDN ...... 9
2.2 System architecture of SDN: (a) Physical Structure; (b) Logical Structure 14
2.3 AnatomyofSDNHardwareComponents ...... 17
2.4 FlatArchitectureofSDN...... 18
2.5 HierarchicalSDNArchitecture ...... 19
3.1 AI and ML: (a) Hierarchical Structure of AI; (b) Classification of MLalgorithms ...... 21 3.2 ExampleofRLscenario...... 24
3.3 ClassificationofRL...... 24
3.4 Actor-Critic: (a) Concept of AC; (b) Structure of AC ...... 33
4.1 BlockRepresentationofRIRD ...... 36
4.2 Flow Chart of QoS-aware Intelligent Routing ...... 37
4.3 Active Monitoring QoS in SDN: (a) Packet Loss; (b) Link Delay . . . . 39
5.1 Single-controller SDN topology...... 46
5.2 Probability of path selection for Case-I: From Host-1 to Host-2 ..... 50
5.3 Probability of path selection for Case-I: From Host-2 to Host-1 ..... 50
viii 5.4 Performance of RL agent for Case-I: (a) Episode reward value, Host- 1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to- End delay comparison, ICMP message (Ping) from Host-1 to Host-2 .. 51 5.5 Probability of path selection for Case-II: From Host-1 to Host-2 .... 53
5.6 Probability of path selection for Case-II: From Host-2 to Host-1 .... 53
5.7 Performance of RL agent for Case-II: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 toHost-2...... 54 5.8 Probability of path selection for Case-III: From Host-1 to Host-2.... 56
5.9 Probability of path selection for Case-III: From Host-2 to Host-1.... 56
5.10 Performance of RL agent for Case-III: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 toHost-2...... 57 5.11 Probability of path selection for Case-IV: From Host-1 to Host-2.... 58
5.12 Probability of path selection for Case-IV: From Host-2 to Host-1.... 58
5.13 Performance of RL agent for Case-IV: (a) Episode reward value, Host-1 to Host-2; (b) Episode reward value, Host-2 to Host-1; (c) End-to-End delay comparison, ICMP message (Ping) from Host-1 toHost-2...... 59 5.14 The overview of multiple-controller SDN test-system topology . .... 60
5.15 Probability of path selection for controller-1: From Host-1 to Gateway-1...... 61 5.16 Probability of path selection for controller-1: From Gateway-1 to Host-1 ...... 61 5.17 Probability of path selection for controller-2: From Gateway-2 to Host-2 ...... 62 5.18 Probability of path selection for controller-2: From Host-2 to Gateway-2...... 62
ix 5.19 Performance of RL agent for controllers: (a) Episode reward value, Host-1 to Gateway-1; (b) Episode reward value, Gateway-1 to Host- 1; (c) Episode reward value, Gateway-2 to Host-2; (d) Episode re- ward value, Host-2 to Gateway-2; (e) End-to-End delay comparison, ICMPmessage(Ping)fromHost-1toHost-2 ...... 63
x CHAPTER I
INTRODUCTION
Internet has inaugurated a new social era, where all digital resources have the po- tential to be interconnected and accessible everywhere. Traditional Internet protocol
(IP) networking technology is complex. It is difficult to manage [1], configure, and re- configure in case of a fault. Moreover, a traditional IP network is vertically integrated as the control and data planes are bundled together. Software Defined Networking
(SDN) is an emerging networking technology that breaks the vertical integration by separating the network’s control logic from the underlying devices (e.g., routers and switches) [2]. The logically centralized network control eases the management, reduces operating costs, promotes evolution and innovation [3], allows vendor-independent im- plementation [4], improves the network’s resource utilization [5], and facilitates traffic monitoring [6].
Efficient data routing in a large, complex and dynamic network is a chal- lenging problem. Inefficient routing can lead to the network links overloading and increase the end-to-end transmission delay, which affects the overall performance of a communication network. In addition, a communication network is a dynamic envi- ronment where the performance depends on the current behavior of the nodes. There are multiple popular traditional routing algorithms (e.g., OSPF, IGRP, EIGRP, RIP,
1 BGP) established in traditional networking structures. A large volume of nodes with copious network parameters makes it more challenging for conventional routing to en- sure the expected QoS. Moreover, these algorithms were developed for decentralized network processing units. Therefore, they are not suitable for the SDN, whose fun- damental concept is to develop a centralized control [7]. SDN routing also demands the integration of Quality of Service (QoS) and situation awareness.
Machine Learning (ML) is an exceptional tool that enables automation in dy- namic domains. ML-based dynamic routing [8] could lead to an efficient, near-optimal routing solution [9]. ML algorithms are primarily classified into supervised, unsuper- vised, semi-supervised, and reinforcement learning [10] [11]. The network researchers are trying to inject different classes of ML in a distinct approach. Among the ML classes, reinforcement learning (RL) is a framework by which a system can learn from its previous interactions with its environment to efficiently select its actions in the fu- ture. This makes it suitable for the SDN routing problem. Additionally, powerful net- work status monitoring tools facilitate the opportunity for situation awareness in SDN routing [12], [2], [5], [13], [14]. The combination of network monitoring, distributed network analytics and ML is known as a knowledge defined network (KDN) [15]. The following paragraph will introduce some accomplished research work related to the application of ML in SDN routing.
There is existing research that applies various machine learning algorithms for QoS-aware dynamic SDN routing. In [12], the author proposed a supervised
(LSTM-RNN) learning-based dynamic routing called NeuRoute. NeuRoute predicts
2 the traffic matrix based on previously recorded data to generate forwarding rules.
However, the effectiveness of the algorithm depends on the volume of annotated data for training the neural networks. Similarly, a deep-RL approach for SDN routing was proposed by Stampa et al. [15], where routing was optimized based on the bandwidth request between a source-destination pair. The RL agent adapted dynamically to cur- rent traffic status and customized the route configurations in an attempt to minimize the network delay. The algorithm was tested for 14 nodes and 21 full-duplex links with an average node degree of three, with 10 traffic intensity levels. This method did not consider network internal parameters, such as packet loss in a link, that are critical for QoS. Additionally, the researchers considered 1,000 distinct traffic configurations, so the computational complexity will increase with the number of network nodes. A similar approach called DDPG Routing Optimization Mechanism-DROM was tested with a deep deterministic policy gradient by Yu [16]. Furthermore, Xu et al. [5] pro- posed a deep-Q and traffic engineering based routing, Deep Reinforcement Learning
(DRL) with Traffic Engineering-(DRL-TE), to maximize the utilization of an SDN.
In DRL-TE, the RL agent was designed to effectively utilize the networking resource among the three shortest paths according to hop-count. The method did not consider the network’s internal factors (e.g., link delay and packet loss). Kim et al. [17] pro- posed a simple demonstration of Q-learning’s application for congestion preventive routing in SDN. They considered a simple topology having five SDN switches and also didn’t consider packet loss. Sendra et al. [2] proposed a reinforcement learning-based
SDN routing. The simulation topology of [2] consisted of a few nodes with an average
3 degree of two. Yanju et al. [7] proposed a heuristic algorithm and supervised learn- ing algorithm-based SDN routing. The initial step of that method [7] was to collect abundant training data and build a path database for each source-destination pair.
However, it was an impractical solution for a large-scale network. Francois et al. [18] proposed a secured Cognitive Routing Engine-CRE. CRE algorithm reduced the end to end delay, and encrypted the data with the Random Neural Network (RNN) and
RL. However, the method focused on the end to end delay at the expense of packet loss. Wang et al. [19] proposed a deep RL based resource allocation for mobile edge computing. The objective of the research was to minimize service time and balance resource allocation by RL. Sun et al. [20] proposed deep deterministic policy gradient
(DDPG) and a recurrent neural network (RNN) to generate routing policy in the
SDN platform. The method considered the average transmission delay to calculate episode reward. QoS-aware adaptive routing, called QAR, was proposed in [3] for multi-layer hierarchical SDN. The reward was a complex mathematical function and there was no simulation for the link congestion. There are also some other machine learning-based routing approaches proposed in [13, 21–28].
This thesis demonstrates a situation-aware routing scheme for the SDN by using an advantage actor critic (A2C [29]). The QoS-aware intelligent routing method continuously monitors the delay, and packet loss between SDN switches for situation awareness, then A2C makes an intelligent decision on that basis. Afterwards, the simulation results are analyzed for different cases and compared with static routing strategy.
4 1.1 Contributions in this Thesis
1. An A2C algorithm that interacted in real-time with SDN for best QoS.
2. A deep A2C-based routing where the centralized SDN controller tailored the
routing scheme according to the delay and packet loss of the path was proposed.
Several scenarios were tested along with the network congestion.
3. The scalability of the proposed algorithm was tested for an environment with
multiple SDN controllers.
1.2 Publications
Research achievements presented in this thesis have been included in the following research paper:
1. Hossain, M.B. and Jin Wei , “Reinforcement Learning-Driven QoS-Aware In-
telligent Routing for Software-Defined Networks,” IEEE Global Conference on
Signal and Information Processing (GlobalSIP), Nov 2019 [30].
1.3 Thesis Outline
The following chapters in this thesis unfold as shown in Fig. 1.1. Chapter 2 provides an overview about SDN. It discusses structural and historical information, such as the definition of SDN, the key component of SDN, the most significant milestones, and the current development of SDN. Chapter 3 reviews general ideas about machine
5 Chapter-1: Introduction Case-I test: One best path case (all the paths having different bandwidth, delay and packet loss) Chapter-2: Overview of Software Defined Network Technology Case-II test: One best path case (multiple path have same bandwidth, delay and different packet loss) Chapter-3: Overview of Reinforcement Learning Case-III test: Congestion (all the paths having different bandwidth, delay and packet loss) Chapter-4: QoS-aware Intelligent Routing for SDN Case-IV test: One best path case (multiple path have same packet loss, bandwidth and different delay) Chapter-5: Simulation Results Case-V test: Multiple SDN case (Scalability) Chapter-6: Conclusions and Future Work
Figure 1.1: Overview of the thesis.
learning artificial intelligence and RL. The chapter highlights the history, develop- ment, and applications of RL algorithms. The chapter also gives and detailed ideas about the structure of A2C. Chapter 4 has a detailed discussion about the proposed
RL based routing algorithm. The chapter also describes some QoS parameters that are considered.
Chapter 5 shows the test-bed setup and simulation results. The simulation considered five test scenario cases and described each case. Sections 5.1 and 5.1.1 show test-bed setup and simulation results for one best path case. Section 5.1.2 shows
6 simulation results for one best path case where multiple similar paths exist with the same bandwidth, and delay, and different packet loss. Section 5.1.3 describes the behavior of the proposed method in case of network congestion. Section 5.1.4 shows the test-bed setup and simulation results for one best path case where multiple paths had the same packet loss, bandwidth, and different delay. Section 5.2 shows the test-bed setup and simulation results for multiple SDN controllers.
Finally, Chapter 6 provides a summary of the work carried out and potential future work.
7 CHAPTER II
OVERVIEW OF SOFTWARE-DEFINED NETWORK TECHNOLOGY
This chapter provides a brief introduction to the SDN, introducing its history, evo- lution, structure, major achievements, and application in the real world networking infrastructure.
2.1 Motivations of SDN
Computer networks can be represented in three planes of functionality: the data, control, and management planes [as shown in Fig.-2.1 a and b]. The data plane cor- responds to the networking devices, the control plane represents the protocols used to generate the forwarding tables, and the management plane provides the software ser- vices for remote network device management. The control plane enforces the policy defined in the management plane, and data plane executes the policy. The tradi- tional networking devices (e.g., router, switches) are integrated with vendor-specific embedded operating systems which decide the data forwarding policy and control the data plane through hardware. Notably, there is no standardized protocol proposed for coordination between vendors. Consequently, the device only supports vendor- specific network configuration. Thus, an arduous effort is required to upgrade or modify network policy. In some cases, to carry out these actions, the system must
8 be replaced with new devices. This is a costly solution. Furthermore, over 4 billion
Internet users are linked through over half a million Autonomous Systems (AS), and the number is increasing rapidly every year [31,32]. With that many users demanding versatile applications, implementation in a diverse range of AS is challenging for tra- ditional network infrastructure. SDN [33, 34] is a cutting edge networking approach that promises versatile application by decoupling the vertical integration, separating the control plane from the data plane. The separation replaces traditional forward- ing devices (e.g., router, switch) with simple forwarding devices. Therefore, decision making is shifted to a centralized controller, providing a global view of the network and programming abstractions. The SDN leads to a highly scalable, cost-effective, holistically manageable networking system with cloud abstraction and rigorous mon- itoring.
SDN Scheme Traditional Networking Scheme
Management Plane • Routing application Control Plane • Routing • Forwarding Rules (FR) • Application for FR • Packet Filter • Packet Filter application Management Plane • SNMP
Embedded operating system Control Plane: Customizable operating system
Data plane Data plane
Packet Forwarding Hardware Packet Forwarding Hardware
(a) (b) Figure 2.1: System architecture: Traditional vs SDN
9 2.2 History of SDN Development
The concept of the programmable network has a long history [35]. The earliest initia- tives to separate data and control signaling go back to the nineteenth century, and over time, the approach did not change within the existing packet structure. Some notable achievements of the programmable network were: programmable ATM networks [36],
NCP [37], SOFTNET [38], OPENSIG [39], Tempest [40] , ANTS [41], NetScript [42].
The early attempts to program the data plane had two approaches: programmable switches and data capsules. The first approach did not modify the packet structure, whereas the latter approach replaced the packet with a tiny program. The POF is a model of modern programmable data plane devices. NCP initiated the idea of separating data and control signaling. Some major proposals for disintegrated con- trol and data were ForCES [43], PCE [44], RCP [45], GSMP [46]. The SANE [47],
OpenFlow [48], NOX [49]. ATM network virtualization, initiated by the Tempest
Project [50], was the earliest motivation behind SDN virtualization. Early network virtualization projects include VINI [51], and Planet Lab [52]. Moreover, the net- work operating system (NOS) is not a new concept. NOX [49], ONOS [53], CISCO
ISO [54], and JUNIOS [55] are the popular NOS. The consistent effort of researchers to separate the three network planes, along with the virtualization concept, tailored the innovative idea of SDN in 2008 at MIT, Stanford University and the University of California at Berkeley. The concept of OpenFlow protocol ushered in a new era in computer networking research that introduced the concept of SDN. Open Networking
10 Foundation (ONF) [56] defined the protocol standards for OpenFlow. SDN also has great potential. It is estimated that the SDN market will be worth more than $12 billion in 2022.
2.3 SDN Structure
The SDN structure is similar to the traditional network, as both have physically in- terconnected nodes. The simplest physical structure is composed of a SDN controller and connected switches [Fig.-2.2a]. The dotted line represents the logical connection between controller and switches, whereas the solid line depicts the physical contact.
The interface between the SDN controller and its switches is called southbound inter- face (SBI). SDN can be connected to the traditional network by the westbound inter- face. According to the functionality, the SDN could be represented by three planes: the management plane, data plane, and control plane [Fig.-2.2b]. The shaded area indicates the SDN controller responsible for network control and management. The management plane is the top level of hierarchy where the network administrator de-
fines and implements the network policy. The policy, called flow table, is forwarded to the lower plane through a protocol on request. The executive nodes at the data plane execute the plan and transfer data accordingly. The following section will depict an in-depth representation of the SDN structure.
11 2.3.1 SDN Interfaces
In computing, the formal definition of an interface is the connection of two or more pieces of network equipment or protocol. A typical network interface has some form of number called a port number, node ID, or node identifier. The SDN has four interfaces: the East, West, North, and Southbound.
The software-based northbound interface (NBI) provides the abstraction op- portunity in the programming language, known as application program interface
(API) design. It is still under research [57]. The researchers are working on de- veloping a standard structure for the NBI, as currently, the NOX, OpenDaylight [58], and Onix [59] controllers all have their own, self-defined API for NBI [60, 61]. Simi- larly, commonly used NBI APIs include the NVF NBAPI [59, 62], SDMN API [63], ad-hoc API, REST-ful API, and REST API [64]. On the other hand, some estab- lished NBI research achievements include the SFNet [65], yanc [66], and PANE [67] controller.
In contrast to the NBI, the southbound interface (SBI) has a well-defined and widely accepted standard structure. The SBI is a collection of physical or virtual devices (e.g., Open vSwitch [68]), and APIs (e.g., OpenFlow, NetConf, SNMP). In addition, OVSDB [69], NetConf [70], BGP [71], and OpFlex [72] are also used in the
SBI. The SBI defines the communication protocol between forwarding devices (i.e.,
SDN switches) and their controllers, outlining the instruction sets of the forwarding devices. Its objective is to push the execution order from the management plane to the data plane through a standard protocol. OpenFlow is the most common and
12 widely used API for the SBI. OpenFlow’s journey started with version 1.0, which had
12 fixed matching fields. Now, it is running version 1.5, with over 40 matching fields and a wider range of features [Fig.-2.3]. According to it’s inventor, it is a method to implement the experimental protocol [48] by transforming a policy table, called a
flow table (described in Fig.-2.3).
The eastbound and westbound interfaces are two special case interfaces which are used for scalability. The westbound interface (WBI) is used to connect the SDN with a traditional network. WBI translates the flow table into a traditional routing path, and vice verse. SDN’s scalability demands the connection of multiple controllers together, which in turn necessitates an inter-controller communication protocol. The eastbound interface (EBI) connects multiple SDN controllers. The EBI and WBI standards are under research.
2.3.2 Software Infrastructure: SDN
SDN was developed to construct vendor-independent, user-defined, and policy-based networks. The SDN platform is structured on numerous NOS with a virtualization hypervisor. Virtualization is a consolidated technology that can create multiple vir- tual instances on a single device. A hypervisor is a software that enables virtualization by different computing resources sharing the same physical machine. The network hypervisor is a solution to support numerous applications using a single physical topology. FlowVisitor [73], FlowN [74], OpenVirteX [75], IBM SDN VE [76,77], and
NVP [62] are some popular hypervisors in SDN.
13 Management Plane x Routing Module. x Load balancing. x Firewall. x Security Module x Virtualization x Other Application. SDN Controller
North Bound Interface
Control Plane x Flow Control Table x GUI module. x Cluster Module. ---- Logical Connection x Network Hypervisor. Physical Connection
South Bound Interface
Legacy network West Bound Data Plane SDN Switches SDN Switches Interface x TOR switches x Switches. SDN Switches x Server. x Gateway Router.
Participation Plane
(a) SDN: Physical Structure (b) SDN: Logical Structure
Figure 2.2: System architecture of SDN: (a) Physical Structure; (b) Logical Structure
A NOS is a core element of SDN architecture that generates the control signal for network configuration based on management policies defined by the network administrator. NOS is designed for both centralized and distributed controllers. The centralized controller is a single entity that controls all the network activity, whereas the distributed controller oversees the management activity. Some of the popular
NOS are listed in table-2.1, and are developed on C, C++, python, java, and Ruby platforms.
14 Table 2.1: Popular SDN NOS
NOS Architecture Programming Language Beacon [78] centralized Java Floodlight [79] centralized java HyperFlow [80] distributed C++ Onix [59] distributed python, C Mul [81] centralized C Kandoo [82] distributed C,C++, python NOX [49] centralized C++ NOX-MT [83] centralized C++ OpenDaylight [58] distributed java ONOS [53] distributed java POX [84] centralized python Ryu [85] centralized python Trema [86] centralized C, Ruby
2.3.3 Hardware Infrastructure: SDN
SDNs are composed of a controller and switches. The controller act as a human brain and the switch as an executive part (i.e., hand, feet). Fig.-2.3 represents the anatom- ical structure of SDN hardware components. For simplicity, OpenFlow considered as an SBI API. The SDN switch has three major components: the SBI API, abstraction layer, and packet processing unit. SBI makes an interface between controller and switches connected through a secure sockets layer (SSL). Flow tables are the major component of the abstraction layer. It is possible to store multiple flow tables in a switch. The flow table has match, action, and statistic fields. Moreover, the switch might have a group table for flow control and a meter table to record the log activity.
Flow entries match incoming packets in order of priority, and a matching entry triggers the actions associated with the specific flow entry. If no match is found, the
15 outcome depends on the network management policy. The packet processing unit is just a physical infrastructure to implement open system interconnection (OSI) layer-1 and layer-2. An incoming packet could match according to the source or destination
IP, input or output port, or VLAN tag, as well as any other user-specific fields.
The action could be to either forward (to single, multiple, or all exit port) or drop.
The statistic field in the flow table counts the event. The capacity of forwarding devices varies from 8000 to 1000000 flow entries. The count can include the number per table (packet match, reference count, packet lookup), per-flow (received packet, received byte, duration), per port (transferred, received and dropped packet size and number, collision), or per group (reference, packet byte count). The statistic could be collected by one request from the controller for QoS analysis. Furthermore, some of the available commercial SDN forwarding device (hard and soft) models are
CX600 series [87], MLX series [88], NoviSwitch 1248 [89], RackSwitch G8264 [90],
Z-series [91], OpenvSwitch, XorPlus [92], and contrail-vrouter [93].
The hardware anatomy Fig.-2.3 depicts the controller’s core functionality, both a northbound and a southbound API, and a few sample applications. The discovery module is triggered when a new event is initiated. The event includes connection, disconnection, and change in the network. For example, installation of a new switch will trigger the discovery module, causing the SDN controller to take necessary action to add the switch. Flow management is another important module of the controller. It is triggered when a data transfer request comes from a new end-user. This module contains the data flow policy and action against the intruder.
16 The statistics module helps to collect the QoS parameters from the switch, which is under a controller called the domain. The Big Switch Networks, Cisco, Cumulus
Networks, Hewlett Packard Enterprise, Juniper Networks, Nuage Networks, Pica8,
Pluribus Networks, and VMware are the major vendors for SDN controllers.
•Single table OpenFlow V1.0-2009 •Static matching field
•Multiple table OpenFlow V1.1-2011 •Group table
•IPV6 support OpenFlow V1.2-2011 •Controller role exchange
•Meter table OpenFlow V1.3-2012 •Table-miss entry
SDN Controller OpenFlow •Synchronized table V1.4-2013 SSL •Egress table OpenFlow V1.5-2015 •Scheduled bundle
---- Logical Connection Physical Connection
Legacy network West Bound SDN Switches SDN Switches Interface
SDN Switches
Figure 2.3: Anatomy of SDN Hardware Components
2.4 Distributed SDN architecture
The SDN can be extended by distributed flat architecture [Fig.-2.4]. The controllers are connected by the WBI and the domains are connected with a gateway switch
17 Eastbound Interface (EBI)
----Logical Connection Physical Connection
Domain-1 Domain-3
Gateway Switch
Domain-2
Figure 2.4: Flat Architecture of SDN
(shaded mark in Fig.-2.4). The architecture is used for large-scale networking, but it has a Single Node Failure (SNF) problem. For example, if the domain-1 controller fails, the domain-1 will shut down. Hierarchically distributed architecture has two layers of SDN controllers. The lower layer controllers are responsible for their respec- tive domains, while the upper layer root controller is responsible for managing a group of domain controllers [Fig.-2.5]. The number of layers can be increased according to the demand, and the responsibility is split accordingly among the layers. With more layers, the hierarchy is not affected by the single node failure problem [82]. Moreover, the data transfer between domains is handled by the gateway switch. It is possible to have multiple gateway switches in a domain.
18 ----Logical Connection Physical Connection
Eastbound Interface (EBI) Eastbound Interface
Gateway Switch
Domain-3 Domain-1 Domain-2
Figure 2.5: Hierarchical SDN Architecture
2.5 Recent Development and Application of SDN
There is a tremendous amount of ongoing research regarding SDN directly, as well as its potential to be deployed to new area. Its numerous practical applications have already led to better networking performance. SDN is presently studied extensively in areas such as traffic engineering [94–102], data centers [103–105], security [106–
111], and wireless networking [112–114]. Network function virtualization, resource provisioning, traffic steering, and revenue modeling are the most popular research topics among the SDN development community.
19 CHAPTER III
OVERVIEW OF REINFORCEMENT LEARNING
This chapter briefly introduces artificial intelligence (AI), and ML focuses on provid- ing an overview of RL. It also provides the history, motivation, challenges, building block algorithms, and applications of RL. The chapter will also describe the advan- tages A2C method which is used in the routing algorithm.
3.1 Artificial Intelligence, Artificial Neural Networks and Machine Learning
AI is a broad field of study that includes intelligent problem solving, knowledge reasoning, learning, and decision making. A human can learn from experience and interaction with the environment and hence can act intelligently. Any system that mimics human intelligence can be considered an AI system. It can be called a simu- lation of human intelligence in a machine. The discipline was born in the summer of
1956 at Dartmouth College in Hanover, New Hampshire. In the twentieth-century, research regarding AI greatly expanded. One notable findings was AI-based vision, which is composes of symbolic system AI and connectionist AI. A symbolic system AI develops methods for representing problems with finite symbols that represent knowl- edge, which are able to be understood by humans. On the other hand, a connectionist system is a bio-inspired model that consists of large number of interconnected ho-
20 mogeneous units and weights. The latter approach is a more popular and effective solution for AI.
Supervised Learning
Semi-supervised Learning
Machine leaning Unsupervised Learning
Reinforcement Learning
(a) Hierarchical structure of AI (b) Classification of Machine Learning
Figure 3.1: AI and ML: (a) Hierarchical Structure of AI; (b) Classification of ML algorithms
ML is a subset [Fig.-3.1a] of AI, that focuses on developing algorithms which drive the machine to make an intelligent decision. The algorithmic model is either pre- dictive, making predictions, or descriptive, extracting information from the dataset.
In the prediction model, the objective of ML are to extrapolate outcomes. According to the dataset, the ML algorithm can be classified into three major categories: super- vised, semi-supervised, unsupervised, and reinforcement learning [Fig.-3.1b]. Super- vised learning is driven by dataset (training) labeled with input and expected output.
The performance of the algorithm is commensurate to the number of events recorded in the dataset. However, dataset collection and storage is a big challenge. The un- supervised learning only use datasets that exclusively contain input. The objective
21 of unsupervised learning is to classify data clusters having similar properties. The semi-supervised learning concept is between supervised and unsupervised algorithms in that it accepts data that is partially annotated. Finally, RL focuses on how the
ML algorithm can take the best action in the current state, maximizing output. SDN routing can be considered an RL problem, where the controller will discover the best path that has a minimum cost and maximum throughput. The following section will briefly describe the RL algorithm.
An artificial neural network (ANN) is also known as neural network (NN).
The model was inspired by the popular connectionist ML approach, and first pro- posed by Warren McCulloch and Walter Pitts in 1945 [115]. From that time onward, there was several ANN model proposed until 1979, when back propagation (BP), a significant achievement in the field, was proposed by Paul Werbos [116]. Initially, high computation speed requirement cause the development of BP to lag, but a boost in computational speed, along with parallel computation model, helped BP eventually gain popularity amongst researchers. Currently, almost all ANN algorithms devel- oped are based on the BP algorithm. After the year 2000, ANN became even more popular with the introduction of the deep neural network [117, 118]. Deep NN is a technique composed of multiple nonlinear transformation modules that transfer low- level representation into a higher abstract representation. The technique makes it easier to construct the feature vector.
22 3.2 Reinforcement Learning
The development of the framework of RL was inspired by behavioral psychology to solve the sequential decision-making problem in 1984. The main idea behind RL is to train the RL agent to learn the environment by continuously taking some ac- tion and learning from the environment results [Fig.-3.2]. The process is repeated until the RL agent reaches a predefined goal which will return the maximum appre- ciation called reward. The process creates a sequence of state, action, and reward
((s0, a0,r0);(s1, a1,r1), .....(sk, ak,rk)). Primarily, RL can be classified into two cat- egories: the Model based and Model free methods [Fig.-3.3]. The model based RL employs state, action and reward to get the maximum expected reward. The RL agent knows about the environmental model. On the other hand, the model free method utilizes action and reward to maximize the expected reward in the absence of environmental knowledge. The model free RL methods are further divided into two subcategories: value based and policy based. The chapter will mostly describe the model free RL methods. Moreover, the value update scheme divides RL into an on-policy and off-policy method. In the on-policy method, the RL agent estimates the reward for state-action pairs, assuming a greedy policy was followed, whereas the off-policy considers the current policy, which to be continue for the next action.
3.2.1 History
The concept of RL was inspired in the early 1980s by the psychology of animal learn- ing. The Markovian decision processes (MDPs), tools to predict discrete stochastic
23 Action , State
th ( | ) = ܵ א actionݏat i ܣ אExpectedܽ Reward ܴାଵ ܧ ݎାଵ ݏ
Figure 3.2: Example of RL scenario
Reinforcement Learning
Model Based Method Model Free Method
Value Based Policy Based Method Method
Figure 3.3: Classification of RL
processes, were proposed for use in RL by Richard Bellman in 1957 [119]. The method provides a framework for decision making in a dynamic environment. Fur- thermore, Ron Howard introduced policy iteration in 1960 [120]. Those are the building block algorithm for RL development. The theory of Law of Effect proposed by Edward Thorndike in 1898. It discusses the aftereffect of an action. Sixty years
24 later, The first analog machine of Rl model, known as SNARCs (Stochastic Neural-
Analog Reinforcement Calculators), was proposed by Minsky et al. Around the same time, in 1960, the term reinforcement learning was formally proposed. Presently, the STeLLA [121], BOXES [122], Tsetlin machine [123], temporal-difference learn- ing, TD(λ) [124], Q-learning [125, 126], Actor-Critic [127], DDPG [128], A3C [29],
A2C [29], Rainbow [129], and DDQN [130] are the most notable achievements in RL research. Moreover, the deep learning aided RL is now popular and can solve complex high-dimensional sequential decision making problems. Some of the notable deep RL based research projects are Atari [131], AlphaGo [132], the self-driving-car [133], and robotics [134].
3.2.2 RL algorithm: Components
This subsection will describe some of the parameters that are necessary to understand the advantage actor-critic method.
State: a feature representation of a test environment at the time of any action. Symbolic representation: si=state feature at ith action which is a subset of total state si ∈ S.
Action: an event that changes the state to achieve the objectives. Symbolic representation: ai=state feature at ith action which is a subset of action set ai ∈ A.
Reward: a scalar quantity of the appreciation level of action. For example, a positive numeric reward value could be represented as a positive action. Symbolic rep- resentation: ri= reward at ith action,expected reward at ith action, Ri = E(ri+1|si)
25 Markov decision processes (MDP): A full specification of the RL prob- lem can be represented by MDP [135]. It is a decision-making framework and stochas- tic control process for the discrete process. MDP can be represented by tuple of: state, action, transition probability and reward (S,A,P,R). The transition probability is available only for a deterministic environment.
Episode: a sequence of actions from the start to some terminal state. The number of actions could be limited to a maximum by which the RL could achieve its objectives. The term applies only to a discrete environment. In a continuous system, the action will continue forever.
State Transition Probability: a probability to change from a given state to a next state. State transition probability is available for the deterministic model where all the model parameters are known. In the case of a stochastic environment, the parameters are unknown. Symbolic representation: P [si+1|si, ai].
Discount Factor: a numeric present value of future reward ranging between
0 to 1. The factor limits the reward value in an infinite state action reward space.
Symbolic representation: γ ∈ [0, 1].
Return: total discounted reward from ith action. Symbolic representation:
+∞ k Gi = ri+1 + γri+2 + ...... = k=0 γ ri+k+1, where γ is a discounted factor.
Value Function: aP numerical value that measures the benefit of each state and action. Symbolic representation: V (s)= E[Gi|si = s] ∀s ∈ S.
26 Policy: a behavior or strategy of an agent. The policy could be stochastic or deterministic. Symbolic representation: deterministic policy, a = π(s) π(s): S →
A; stochastic policy, π(a|s)= P [a|s] ∀s ∈ S, ∀a ∈ A.
ε Greedy Policy Function: an algorithm that selects the action. The epsilon value determines how much time the RL agent will run with a trial and error basis. The pseudocode for the ε greedy algorithm is given in Algorithm-1. The main idea is to select a random action (exploration) at the beginning of the time and increase it by iterations to select the most probable action (exploitation).
Algorithm 1: ε greedy algorithm input : ε, maxiter, A, probA output: actionarray(1×maxiter) 1 for i ← 0 to maxiter do 2 temp = rand(); 3 if temp < ε then 4 actionarrayi = randomaction(A); 5 else 6 actionarrayi = returnmaxaction(A, probA) 7 ε = ε × i
State Value Function: the expected reward starting at state s and follow- ing policy π. The maximum State Value Function is also known as the optimal value function. Symbolic representation: state value function, vπ(s)= Eπ[Gi|si = s] ∀s ∈
S ; optimal state value function, v∗(s) = maxπ vπ(s) ∀s ∈ S.
Action Value Function: the expected return from state s and action a by following policy π. The optimal action value function is the maximum action value function. Symbolic representation: action value function, qπ(s, a)= Eπ[Gi|si =
27 s, ai = a] q : S ×A → R; optimal action value function, q∗(s, a) = maxπ qπ(s, a) q :
S × A → R.
Policy Gradient [136]: an optimization method that uses the gradient descent method to optimize the policy for the maximum expected reward. This helps the policy iteration method to excavate the optimal policy. Policy iteration has two phases: policy evaluation and updates to the NN parameter. The partial derivative of a stochastic policy of trajectory τ can be expressed as:
▽θπθ(τ) ▽θπθ(τ)= πθ(τ) = πθ(τ) ▽θ log πθ(τ) (3.1) πθ(τ)
where for a continuous space, the policy of trajectory τ with given NN param-
T eter vector θ is , πθ(τ)= πθ(s0, a0,s1, a1, ...., sT , aT )= p(s1) i=1 πθ(ai|si)p(si+1|si, ai).
Consider the objective function is J(θ) which can be definedQ as:
J(θ)= E[r(τ)] = πθ(τ)r(τ) (3.2) Z
The maximization of the objective function requires finding the optimal value of the network parameter θ. The gradient of objective function is:
▽θJ(θ)= ▽θπθ(τ)r(τ)dτ = πθ(τ) ▽θ log πθ(τ)r(τ)dτ = E[▽θ log πθ(τ)r(τ)] Z Z (3.3)
According to the definition of policy function πθ over a trajectory τ and N number episode, the expectation can be approximated to an equation with maximum log
28 likelihood as bellow:
N T T 1 ▽ J(θ) ≈ ▽ log π (a |s ) r(s , a ) (3.4) θ N θ θ i,j i,j i,j i,j i=1 j=1 ! j=1 ! X X X maximum log likelihood | {z } The maximum log-likelihood measures how likely the trajectory is under the current policy. The likelihood of a policy is increased if the trajectory results in a high positive reward. For a negative reward, the process is reversed. The model parameters will be updated to increase the likelihood of trajectories that move higher.
As a result, Eq.-3.4 is used to evaluate the policy. The policy parameter(θ) of the
NN will be updated with the learning rate (α ∈ [0, 1]) by the following equation:
θ ← θ + α ▽θ J(θ) (3.5)
The large number of possible actions in a stochastic policy leads to a higher variance in the general policy gradient method. There is a slightly modified version in which the reward is subtracted from a baseline b(i, j) function to solve the high variance problem. The policy gradient with baseline function can be represented by following the equation:
N T 1 ▽ J(θ) ≈ ▽ log π (a |s )(r(s , a ) − b(i, j)) (3.6) θ N θ θ i,j i,j i,j i,j i=1 j=1 ! X X
29 Value Iteration: the optimal value function and the best policy extraction.
The method is simple but requires more computation then the policy iteration. The value iteration can be represented by the following mathematical equation:
Vπ(si)= π(ai|si) p(si+1,ri|si, ai)[ri + γV (si+1)] (3.7) ai si ,ri X X+1
For a larger number of iteration(i): Vπ ≈ V∗
3.2.3 RL algorithm: MC Method
MC method is a simple model free RL method. The method learns the state value vπ(s) under the policy π by using average return at the end of the episode. MC method can be first-visit or every-visit. The first-visit MC method updates the return
(si) the first time it’s encountered in an episode, whereas every-visit calculates every encounter si. The pseudocode for every-visit MC method is given in Algorithm- 2.
3.2.4 RL algorithm: TD
Temporal difference (TD) is a slightly modified version of the MC method. In the MC method, the RL agent learns at the end of the episode. In TD, it learns after a certain episode interval. This technique, called bootstrapping, accelerates the performance of the MC method. There are two common versions of TD that are widely used:
TD(0) and TD(λ). In TD(0), the agent learns after each episode. It learns after λ step for TD(λ). The pseudocode for tabular TD(0) is given in Algorithm- 3. The δ
30 Algorithm 2: Every-Visit MC algorithm-(on-policy) input : policy: π,maxepisode
output: Optimal State Value: v∗
1 statecount(s) ← 0 for all s ∈ S; 2 return(s) ← 0 for all s ∈ S; 3 V ← 0 for all s ∈ S; 4 for e ← 1 to maxepisode do
5 Generate episode by π → s0, a0,r0,s1, a1,r1, ...... , sT , aT ,rT ; 6 Initialize return,G ← 0; 7 for i ← T − 1 to 0 do
8 G ← G + ri+1;
9 return(si) ← return(si)+ Gi;
10 statecount(si) ← statecount(si) + 1;
return(s) 11 ∗ v (s) ← statecount(s) fors ∈ S; 12 Return v∗; is called TD error or TD target, which is used in the actor-critic method(this will be described in the next section).
3.2.5 RL algorithm: Actor-Critic Method
The AC method is the building block of modern RL algorithms like asynchronous advantage actor-critic (A3C) and proximal policy optimization (PPO). This method combines the value and policy-based RL methods into a single algorithm[Fig.-3.4a].
This makes the AC method faster in a continuous stochastic environment. The struc- ture of the actor-critic is shown in Fig.-3.4. It consists of two parts, the actor and critic, where both interact back and forth to converge on an optimal solution. The actor, policy iteration network, is responsible for taking action and updating the
31 Algorithm 3: Tabular TD(0)-(on-policy) input : policy: π,maxepisode
output: Optimal State Value: v∗
1 α,γ ∈ [0, 1]; 2 V ← randomvalue except V (terminal) ← 0; 3 for e ← 1 to maxepisode do
4 Initialize state s0, i ← 1; 5 while not terminal state do
6 Generate action ai−1 by: π,si-1;
7 Apply ai−1 and get : ri,si;
8 V (si-1) ← V (si-1)+ α [ri + γV (si) − V (si-1)];
δ 9 s -1 ← s i i; | {z } 10 i ← i + 1;
11 Return: v∗ ← V ; network parameter θ by the policy gradient method. The critic evaluates the effec- tiveness of the actor network’s action by the TD method. The critic can use the action value or state value iteration method. The overall process is: According to current policy the actor generates an action a then applies it to the environment.
The environment returns information about the reward and the next state. The critic evaluates the value of the state by using the equation in line-8 Algorithm-3, and sends TD error(δ) as feedback to the actor network. The actor network uses δ from the critic and updates the NN parameter by policy gradient. The overall process is shown in Fig.-3.4b.
32 Critic