University of Nevada, Reno

Wireless Management Using Predictive Analytics

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering

by

Alisha Thapaliya

Dr. Shamik Sengupta/Thesis Advisor

December, 2018

THE GRADUATE SCHOOL

We recommend that the thesis prepared under our supervision by

ALISHA THAPALIYA

Entitled

Wireless Network Congestion Management Using Predictive Analytics

be accepted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Shamik Sengupta, Ph.D., Advisor

Lei Yang, Ph.D., Committee Member

Hanif Livani, Ph.D., Graduate School Representative

David W. Zeh, Ph.D., Dean, Graduate School

December, 2018

i Abstract

Wi-Fi Access Points (APs) deployed publicly are facing serious demands due to the proliferation in Wi-Fi enabled devices. This becomes more prominent when the user crowd moves dynamically in space creating a sporadic usage pattern. In order to cater for the dynamically changing spectrum demands, we need to iden- tify areas with high spectrum usage that needs better Wi-Fi coverage. In this thesis, we aim to understand the dynamic spectrum usage over space, time and channels. The temporal and spatial analysis helps us to identify places that are highly con- gested at any given time. The channel usage pattern determines channels that are over utilized and under utilized in the congested areas. The usage data from user devices can be analyzed to answer a number of pos- sible questions in regards to congestion, access point load balancing, user mobility trends and efficient channel allocation. Using this data, we attempt to identify Wi-Fi usage trends in a dynamic environment and use it to further predict the congestion in various locations. To accomplish this, we have used University of Nevada, Reno (UNR) to conduct our experiments where we use various super- vised learning algorithms to find the existing patterns in spectrum usage inside UNR. Using these patterns, we predict the values for certain key attributes that directly correlate to the congestion status of any location. Finally, we apply unsu- pervised learning algorithms to these predicted data instances to cluster them into different groups. Each group will determine the level of congestion for any build- ing at any time of any day. This way, we will be able to ascertain whether or not ii any place at any time in the future might require additional resources to be able to deliver wireless services efficiently. In an attempt to deliver wireless services in a resourceful manner, we also talk about self-coexistence among networks where the secondary networks can access licensed bands without interfering with the primary networks. This technology is referred to as dynamic spectrum access that allows the underutilized frequency bands to be used avoiding the need for additional resources. With all the secondary networks trying to access an available channel, there arises a game theoretic com- petition where they want to get a channel for themselves by incurring as minimum cost/time as possible. We implement a predictive strategy in the networks for them to land on an available channel in the shortest time possible minimizing the collisions among themselves. Thus, we investigate various predictive algorithms and observe how a self-learning approach can be helpful in maximizing utilities of the players in comparison to traditional game theoretic approaches. iii Acknowledgements

I would like to offer my sincere thanks to my advisor, Dr. Shamik Sengupta. With- out his support, this thesis would not have been possible. He has always provided me with insightful suggestions and guidance throughout the journey of my Mas- ters’ education. I would also like to thank my other committee members, Dr. Lei Yang and Dr. Hanif Livani for giving me proper feedback and support. My thanks also goes to the developers of python library scikit-learn and WEKA tool, using which I was able to successfully conduct my research with valuable results. I acknowledge, from the bottom of my heart, the support of my research by National Science Foundation(NSF), Award #1346600, Award #1516724, and Award #1723814. Lastly, thanks to all of my friends and family who have been a constant support in my life. iv

Contents

Abstracti

Acknowledgements iii

1 Introduction1

1.1 Network Congestion...... 1 1.2 Research Problem...... 2 1.2.1 Aerial Access Points...... 2 1.2.2 Optimized Resource Allocation...... 4 1.2.3 Dynamic Spectrum Access...... 5 1.3 Thesis Organization...... 7

2 Related Works8

2.1 Wi-Fi Data Analysis...... 8 2.2 Network Self-Coexistence...... 10

3 Wi-Fi Spectrum Usage Analytics 13

3.1 Methodology...... 14 3.1.1 Data Collection...... 14 v

3.1.2 Data Preprocessing...... 15 3.1.3 Data Analysis...... 16 3.2 Results...... 17

4 Predicting Congestion Level in Wireless Networks 30

4.1 Methodology...... 33 4.1.1 Data Collection...... 33 4.1.2 Dataset Creation and Pre-processing...... 35 4.1.3 Supervised Learning...... 36 4.1.4 Unsupervised Learning...... 38 4.2 Results...... 41 4.2.1 SVR Prediction Model...... 41 4.2.2 EM clustering...... 42 4.2.3 Final output...... 45

5 Incorporating Machine Learning in a Game Theoretic Environment for Dynamic Spectrum Access 49

5.1 System Model...... 51 5.1.1 Challenges...... 53 5.2 Self-learning in the game...... 56 5.2.1 Linear Regression...... 58 5.2.2 Support Vector Regression...... 59 5.2.3 Elastic Net...... 60 5.3 Proposed Mechanism...... 62 vi

5.4 Results...... 67

6 Conclusion and Future Works 75

Bibliography 79 vii

List of Tables

3.1 Mean and standard deviation of the group of users associated with each WiFi channel in ABB...... 20 3.2 Total number of users and the channels with the highest number of connected users in 2.4 GHz and 5 GHz frequency range on March 15, Wednesday at 12:05 p.m...... 27 3.3 Total number of users and the channels with the highest number of connected users in 2.4 GHz and 5 GHz frequency range on March 28, Tuesday at 8:30 p.m...... 28 3.4 Total number of users and the channels with the highest number of connected users in 2.4 GHz and 5 GHz frequency range on April 01, Saturday at 4:00 p.m...... 29

4.1 SVR Output Prediction (MSE in parenthesis)...... 42 4.2 Analysis of data in cluster2, cluster0 and cluster1...... 44

5.1 Strategic–form minority game with network x and y ...... 53 5.2 Comparison between experimentally calculated and predicted prob- abilities...... 68 5.3 Mean square errors of predictive algorithms...... 70 viii

5.4 Comparison between the time taken to reach equilibrium (in time units) when using different strategies...... 71 ix

List of Figures

3.1 JCSU Wi-Fi Usage...... 17 3.2 ABB Wi-Fi Usage...... 18 3.3 Channel Load in ABB...... 21 3.4 Heatmap: March 15, Wednesday, 12:05 pm...... 23 3.5 Heatmap: March 28, Tuesday, 8:30 pm...... 24 3.6 Heatmap: April 01, Saturday, 4:00 pm...... 25

4.1 Hourly Clients, , Frame Retry and Frame Error...... 38 4.2 Data used for EM clustering...... 43 4.3 The 3 clusters generated by the EM algorithm...... 44 4.4 Evaluation of the EM clustering model...... 47 4.5 Predicted data used in the EM clustering model...... 48 4.6 Clustering Model Output...... 48

5.1 Networks and Channels a) at the beginning of the game; b) after first stage when Network 2 got a channel; c) after second stage when Networks 1, 3 and 4 got channels; d) at the last stage when equilib- rium is achieved...... 55 5.2 Various stages of the game divided into time slots...... 63 x

5.3 Time to reach equilibrium (in time units) and corresponding optimal probability to switch with varying number of channels...... 65 5.4 Channel switching probability with varying number of channels... 69 5.5 Time taken to reach equilibrium (in time units) with different strate- gies when N = 10 and M = 10...... 71 5.6 Time taken to reach equilibrium (in time units) with different strate- gies when N = 10 and M = 20...... 72 5.7 Time taken to reach equilibrium (in time units) with different strate- gies when N = 10 and M = 30...... 73 5.8 Time taken to reach equilibrium (in time units) with different strate- gies when N = 10 and M = 40...... 74 1

Chapter 1

Introduction

1.1 Network Congestion

The ubiquity of Wi-Fi enabled devices have plunked down a serious load in the Access Points (APs) especially in public places because of the high density of the people accessing the internet. This load varies over space, time and channels. The crowd in public places such as universities, cafeterias, seminar halls move ran- domly thus generating different levels of congestion at different places across dif- ferent times. Additionally, this load is also unevenly distributed among the Wi-Fi channels employed by these APs. Studies conducted in this area have shown that many frequencies in the Wi-Fi spectrum are under utilized while some of the fre- quencies are seriously over utilized [1]. We need continuous access to wireless networks with greater capacity, perfor- mance, and throughput because there is a growing number of clients in wireless networks with more demand for a high quality internet connection [2]. This widely 2 spreading need for wireless services calls for the deployment of additional re- sources when demand reaches its peak. This is highly probable in scenarios where it is difficult to identify the Wi-Fi usage pattern, mostly in dynamic public places where the spectrum usage is both time and space dependent.

1.2 Research Problem

In this thesis, we discuss different ways that we can use to tackle the growing congestion. All of our ideas are divided into different research topics. Each topic is handled individually where we identify new solutions using simulations and experiments. The main goal is to use these solutions either separately or in a com- bined manner to optimize wireless networks.

1.2.1 Aerial Access Points

The sheer increase in the ever changing technological demands in today’s era has necessitated that these demands be fulfilled in an autonomous and dynamic man- ner. The concept of floating AP is important to achieve load balancing dynamically in a predefined manner which can be applied if we know the spectrum utilization patterns in Wi-Fi bands over space, time and channels. UAVs have been domi- nantly used in dynamic service provisioning in many such cases, be it in disaster recovery [3][4], wildlife monitoring [5], helping soldiers in the battlefields [6], package delivery [7] and so on. A remarkable contribution that can be made by UAVs is to provide a better wireless service to the users of a concerned region 3 when operated as floating APs. The stability, agility and autonomy of these UAVs make them very efficient in roaming around the network to better serve user de- mands. While such a network may seem fantastical, the concept of an airborne-based network isn’t all that far-fetched. The largescale companies like Google and Face- book have already taken initiatives to provide to the remote loca- tions via aerial methods. Google calls it “Project Loon” and for Facebook it is “In- ternet.org” [8][9]. These devices are meant to be operational at the stratospheric level in very high altitudes using the power from sun and wind [10][11]. There- fore, for commercial and small-scale industries, such techniques are nowhere near affordable. However, similar concepts can be implemented to provide better inter- net access by scaling down the whole mechanism to a more practical level. In this research, we are trying to understand the feasibility of proactive deploy- ment of floating APs inside a university where the Wi-Fi spectrum usage varies highly across different buildings depending upon the time of the day. For this purpose we have conducted our study at the University of Nevada, Reno (UNR). Since our area of research is university, the user base is mobile and move to dif- ferent locations depending on lunch break, class times etc. However, the mobile users tend to geographically cluster around several specific APs [12]. This calls for the dynamic assignment of APs according to user demands in several places. Hu- man behaviour can be highly predictable [13]. If we narrow our focus into some of the buildings with high user density to analyze the spectrum usage statistics, we can discover patterns in the channel usage behaviour among the users at different 4 periods of time. The mobility pattern can be detected by analyzing the historic user behavior [14], [15]. This pattern among the users can help us identify the locations and the peak hours when some of the channels go through the highest level of congestion while some are still under utilized. To balance this load, the dy- namic floating APs can be deployed proactively to serve user demands by moving around in the hot zones when there is maximum user crowd.

1.2.2 Optimized Resource Allocation

With the disproportionate and time varying usage of wireless services, studies show that some of these Wi-Fi resources are under-utilized in some locations while they are in high demand at other locations [1]. Therefore, instead of adding new resources every time the congestion level increases, it would be more efficient to utilize resources from areas where the congestion level is relatively low. This leads to an optimized resource allocation problem which can be highly useful in a mod- ern day setting where it is impossible to continuously deploy wireless APs as the number of wireless users vary. According to the Cisco report, traffic from wireless and mobile devices will exceed traffic from wired devices by 2018. It states that in 2018, Wi-Fi and mobile devices will account for 61 percent of IP traffic [2]. Higher traffic equates to more data in terms of traffic load, number of users, quality of wireless signal, etc., which can be utilized for network management, optimization and fine-tuning [16].In this thesis, we attempt to determine the congestion level in various locations inside the University of Nevada, Reno (UNR) based on a certain day and time using the data 5 collected from wireless activity of users. We investigate the patterns of user data and then apply machine learning algorithms to predict congestion in the future using both supervised and unsupervised learning algorithms. We believe this integrated Wi-Fi congestion prediction model can be used to control an autonomous resource allocation program. Ideally, the supervised pre- diction model, which will be fed real time data in order to create a more accurate prediction, will predict hours or even days ahead of time and send those predic- tions to the clustering algorithm. When the clustering algorithm predicts the future level of congestion, a resource allocater can see that at any time t, building y will have low congestion while building z will have high congestion. The resource al- locater can then take resources from building y (for the duration of time that it has low congestion) and assign it to building z in order to better the quality of signal in building z. This autonomous Wi-Fi resource allocater can then act as a load balancer for all buildings that has access to Wi-Fi data.

1.2.3 Dynamic Spectrum Access

The key enabling technology in dynamic spectrum access is Cognitive Radio that allows unlicensed secondary users to access the licensed bands without causing any interference to the primary users. Cognitive Radios can sense the wireless en- vironment, they can identify the wireless channels not being used by primary users and access them dynamically, this technique is called Dynamic Spectrum Access [1]. As the unoccupied spectrum bands are identified, the networks start looking for a channel for themselves. Since all the networks are searching for channels in a 6 greedy manner, naturally, each network wants to find an available channel in the least possible time. The networks have to find a way to self-coexist s o that they would not cause any harm to others and themselves. When the networks start looking for channels randomly, there is a high possibility of collision among them- selves thereby disrupting the QoS, especially when there are almost equal or equal number of networks and channels. To avoid such collision, the networks have to decide whether to switch to another channel or stay where they are. We look into this problem with a game theoretic environment: The players are networks, their strategy is to switch or stay and their utility is a function of time taken to reach system convergence i.e. when all the networks have acquired an available channel for transmission. However, the process of finding an available spectrum costs time which they want to minimize. That is why it is not possible to sense and search too many chan- nels every time [17]. The networks have to find a way to look for as fewer channels as possible and get the available channel for transmission. It could be beneficial if the networks learn to identify the best strategy for themselves and learn to model their behavior with the changing environmental components. In this thesis, we deal with the self-coexistence of networks with a self-learning approach. In this thesis, we investigate various predictive algorithms: Linear regression, Support Vector Regression and Elastic Net and compare them with other traditional non- predictive game theoretic mechanisms. We measure the accuracy of these algo- rithms in terms of time taken to reach the system convergence. We also observe how a self-learning approach can be helpful in maximizing utilities of the players 7 in comparison to traditional game theoretic approaches.

1.3 Thesis Organization

The rest of this thesis is organized as follows. In Chapter 2, we discuss about some of the related researches that have been done in the wireless network com- munity. Chapter 3 provides ideas on how we can use floating access points to solve the problem of network congestion by performing spectrum usage analysis within UNR. In Chapter 4, we predict congestion at various locations in the future using an integrated approach of supervised and unsupervised learning models to tackle the growing congestion from an optimized resource allocation technique. In Chapter 5, we prove how implementing predictive strategy into game theory can help achieve dynamic spectrum access. Finally, Chapter 6 concludes the thesis providing insights on how it can be further improved upon. 8

Chapter 2

Related Works

2.1 Wi-Fi Data Analysis

The analysis of data obtained from the user activity in the wireless environment has gathered much attention in the networking community in recent times. A lot of relevant work has already been done in this regard. In [18], the authors have focused on the performance of Wi-Fi based wireless networks to achieve network optimization. In this paper, an analysis of Dartmouth WiFi campus network composed of 476 access points has been conducted. The data collected spanned a duration of 2 months and was collected via SNMP polling and SYS log messaging. Furthermore, backend traffic were captured using pas- sive sniffers and added to the database. The usage data from all of these sources were analyzed to extract information regarding various aspects of the wireless net- works, such as network traffic (traffic load the network can handle, traffic per card, traffic across weekdays vs weekends), user mobility, card activity, access point ac- tivity etc. By observing the high variance in the activity of buildings, access points 9 and cards, network designers can implement new solutions to optimize wireless networks in a more resourceful manner. A similar study related to understanding the trends in use of academic Wi-Fi networks has been done in [19]. This paper analyzes data collected with syslog. The usage activity patterns on a daily and weekly basis demonstrated that the number of active devices increase in the morning and evening, and is considerably less in the afternoon. The main finding is that users on a small area are more frequent and recurring than users on a larger area, where the population is more heterogeneous and the main difference among buildings is dependent on the users who actively access internet in those locations. The authors in [20] have collected Wi-Fi data within the MIT network with 3000 access points to perform analysis of the traffic flowing through the access points in a spatial and temporal manner. In this paper, clustering techniques have been used to classify location dependent network behaviour by determining the number of connected users per access point over time. In [21], the authors collected data over a month by polling 177 Access Points ev- ery 5 minutes via SNMP. They maintain that it is important to understand network usage characteristics as it would lead to a more effective deployment of wireless network components. Their major finding in this paper is that the average user transfer-rates follow a power law meaning load is unevenly distributed across ac- cess points and is influenced more by which users are present than by the number of users. So they propose a clustering approach for the users based on two at- tributes, prevalence and persistence. 10

[22] analyzes the performance characteristics of the Google WiFi network in Mountain View (CA), with 500 access points. The proposed analysis has addressed three main categories, per-user traffic distributions, user classification based on their pattern in network usage and generated traffic, and user mobility in terms of travelled distance. [16] is a paper where the authors analyze the use of a Wi-Fi network deployed in a large-scale technical university with data logs collected for three weeks. They present a spatio-temporal analysis of the given network, and search for distinctive user behaviours based on certain situations, such as: whether students attending a lecture use wireless networks differently that users not attending the lecture. The authors of this paper believe that the analysis of such network behaviour can give fundamental insights on how to optimize and manage the network and it can also reveal patterns on how the end users behave.

2.2 Network Self-Coexistence

The problems regarding self-coexistence among networks in such a competitive environment have also been dealt with before. It was solved using tools from Modified Minority Game (MMG) Theory where the cooperation is self-imposed in a noncooperative game and a mixed strategy was identified for the networks to achieve Nash Equilibrium [17]. The paper describes how the challenges of a dynamic spectrum access can be resolved in a Wireless Regional Area Networks (WRAN) where IEEE 802.22 networks based on cognitive radios are competing to a get an available channel. The mathematical formulation to calculate the optimal 11 probability to switch was derived using game-theoretic techniques from a modi- fied minority strategic and noncooperative game. This probability depends on the number of competing networks, the number of available spectrum and the cost to switch to another channel. In this paper, however, the networks play the game in a static manner, without considering the history of their moves giving them no base to learn further in the game. The authors in [23] used the potentialities of communication among Base Sta- tions (BS) and Consumer Premise Equipments (CPE) in the IEEE 802.22 standard to guarantee self-coexistence among Wireless Regional Area Networks. They pro- posed a game-theoretic model with players as the BSs, strategies as the choices of a channel and preferences associated with the quality of the channel. Their primary focus was on the use of two utility functions to maximize the spatial reuse and min- imize the interference. They established that even though a rational player would choose to minimize the interference, it might not be the optimal choice when the players are selfish and resource is scarce. In such cases, if a BS chooses its chan- nel for maximum spatial reuse, it helps them achieve self-coexistence in a better manner. This model relies on the ability of CPEs to sense the spectrum and detect the level of interference so that the WRANs can change their tolerable amount of interference accordingly. In [24], another non-cooperative game theoretic model has been proposed to solve the problem of self-coexistence among Wireless Regional Area Networks. The authors in this paper have considered the same set of binary strategy i.e. switch or stay when the networks are faced with collision. The primary focus of 12 this paper is to minimize the step cost at every stage within the game. The util- ity function at each step is dependent on the probability to switch chosen by the networks. In each step, the optimal strategy is to find an optimal probability to switch so as to maximize the utility function. The paper focuses on how to select a strategy that gives the expected cost of staying in the same channel as equal to the cost of switching to another channel but does not further analyze the expected cost in the game. In [23], the self-coexistence problem has been addressed using two frameworks: (i) A multi-player non-cooperative repeated game (NoRa) (ii) A hedonic coalitional formation game (HeCtor) . In NoRa, the players have to choose from a finite set of strategies and their payoffs depend on the number of other players choosing the same strategy. In HeCtor, the tradeoffs between cooperation advantages and cooperation costs have been discussed. The authors established that the coopera- tion among WRANs leads to a higher throughput but at the cost of higher com- putational complexity causing significant loss of throughput where rapid changes occur in the channel occupancy. However, in a non-cooperative environment, the networks adapt faster to rapid changes and are successful in attaining the same throughput. Even though this paper is about the self-coexistence, the main motive is to evaluate the environment in term of cooperation to see whether a cooperative or a non-cooperative environment facilitates the networks in achieving a stable solution to the problem. 13

Chapter 3

Wi-Fi Spectrum Usage Analytics

In this chapter, we are trying to understand the feasibility of proactive deployment of floating APs inside a university where the WiFi spectrum usage varies highly across different buildings depending upon the time of the day. For this purpose, we have conducted our study at the University of Nevada, Reno (UNR) by col- lecting the spectrum usage data across almost all the buildings in UNR for over a period of four weeks. For this whole idea of aerial AP to be successful, we typically need answers to the following questions: 1) What are the buildings/locations with the maximum number of users accessing the WiFi channels? 2) At what time during the day does the spectrum usage reach its peak? 3) What are the channels that are highly over utilized and highly under utilized during those hours? To answer these questions, we measure various analyzable data from WiFi bands in 2.4 GHz and 5 GHz range at different locations and understand how it varies across different days of the week and different times of the day. The results provide various insights of when and where to best deploy the floating APs in a proactive manner and how to recon- figure them to make use of the under utilized bands so that the load is efficiently 14 distributed among all WiFi channels. Spectrum monitoring is needed to continuously survey the radio spectrum, process the collected data and infer information about spectrum usage [25]. This study is important because it tells us how the spectrum usage statistics is changing across different places, times and channels. The goal is to generate a heat map that dynamically determines the hot zones within the campus and deploy floating APs in those zones. Since network resource usage statistics can be estimated by pre- dicting a traffic map [26], [15], these APs can be deployed in a pre-defined manner. Additionally, the continuous analysis of Wi-Fi data also helps in risk management from a security standpoint. It helps us track any suspicious activities and alarms us ahead of time thus allowing the network administrators to prevent or prepare for attacks beforehand.

3.1 Methodology

3.1.1 Data Collection

The results presented in this chapter are based on the analysis of spectrum usage data collected across almost all buildings in University of Nevada, Reno. UNR spans across an area of 290 acres and has over 80 buildings. At any point of time, there are approximately 1275 Aruba APs active throughout the university. Many of these APs are deployed in outdoor spaces as well. The APs allow the user devices to connect in either 2.4 GHz or 5 GHz frequency range depending on whether 15 these devices support 802.11 b/g/n standard or 802.11 a/c standard. The non- overlapping channels that the clients can use in 2.4 GHz frequency range are 1, 6 and 11. The non-overlapping channels in the 5 GHz frequency range employed by our university APs are 36, 40, 44, 48, 149, 157 and 163. The data that we have used for this study extends from March 14, 2017 to April 14, 2017. We obtained our data by querying the Aruba Controller that handles all of the Access Points within UNR. We made this query every 5 minutes so as not to overload the server with requests. The query explicitly asks the server to show all the active APs at that point of time when the query is made. The information that we get from the response consists of name of the APs, their location, their IP addresses, number of clients in the 2.4 GHz and 5 GHz range connected to each specific AP and the channels they are using.

3.1.2 Data Preprocessing

The raw data that we collected needed to be preprocessed before further analy- sis. From the vast array of unstructured data, we extracted the attributes most relevant to our study. Our preprocessed data consists of the following attributes: Timestamp, AP name, AP location, number of clients in 2.4 GHz and 5 GHz range, channel in the 2.4 GHz and 5 GHz range being used by those clients, From this database, the first obvious result that we can gather is the total number of users at any location at a specified time or during a specified time interval. With this anal- ysis, we identified the seven most crowded locations in UNR for each day of the week, namely: MIKC, JCSU, ABB, DMSC, SEM, WRB and PSAC. Here, MIKC is 16 the library and JCSU is the student union and the rest of the buildings are equipped with classes and labs. After determining the highly congested buildings that re- quires our attention, we analyzed the channel usage statistics in those buildings at different times to figure out the time intervals and channels that suffer from high- est level of interference in those buildings. This provides us with the basic idea of congested scenario in terms of space, time and channels.

3.1.3 Data Analysis

After identifying the buildings with very large user base, it becomes important to determine the timestamp when the congestion reaches its peak. In order to be able to better serve user demands, we first need to know the time periods that require our utmost attention. To achieve this, we conducted location wise hourly average of our datasets for each day of the week. The results obtained from this analysis points out the hours of the day when the network congestion is maximum, and the channels being accessed during those hours. For weekends, the user mass significantly drops compared to weekdays. During weekdays, the data fluctuates from hour to hour. This fluctuation also depends on the type of the building. For instance, the buildings with classes have high load during class hours while the buildings with food court has high load during lunch times and so on. For the time being, we will take into account some specific buildings where the deployment of floating APs will provide maximum benefit in terms of better WiFi coverage and user satisfaction. 17

3.2 Results

We now present the results of the findings of our study by comparing the spec- trum usage analytics in our two candidate buildings: JCSU and ABB. JCSU is the Student Union of UNR. It consists of food court, starbucks, bank, graduate student lounge, theatre and so on. It is basically the place where students like to spend their leisure time. ABB is the Ansari Business Building. This building consists of classes and labs for all the students in the business department. We observe how the usage behavior in these two buildings changes over time, what channels are being highly utilized and what are the ground differences that we can establish in the channel usage pattern between these buildings.

FIGURE 3.1: Number of users in JCSU across different hours on Tues- days and Saturdays 18

In Figure 3.1, we have the data from three consecutive Tuesdays on dates March 14, March 28 and April 04 obtained from JCSU. From this plot, we can easily de- duce that during all Tuesdays, the user load remains similar during specific hours of the day. This plot helps us identify the time period when load on that particular building is maximum. From figure 3.1, it is readily visible that the user load is minimal until 7 a.m. after which it starts increasing noticeably until it reaches its peak between 12 p.m. to 1 p.m. Then the load slowly starts decreasing for the rest of the hours. With the assumption that 12 pm - 1 pm is usually the lunch hour for many students, the crowd in JCSU is naturally the maximum at this point.

FIGURE 3.2: Number of users in ABB across different hours on Tues- days and Saturdays

Figure 3.1 also shows the user flow on Saturdays on April 01 and April 08 in 19

JCSU. Compared to the weekday, the users in this plot have dropped significantly but the number of users remains high during the noon and afternoon even though the user behavior for two Saturdays are not quite the same. We now present the same sets of results for ABB. Since ABB has class hours, the user rate changes according to those class hours. When there are classes, there are high number of users and vice-versa. Naturally, this causes the plots for different days to be different. As we can see from Figure 3.2, ABB has the highest number of users between 1 pm to 2 pm on Tuesdays given by dates March 14, March 28 and April 04. As there are no classes on Saturdays, it depicts minimal user activity for the dates April 01 and April 08 in ABB. The same concept can be applied for all the buildings in UNR and for all days of the week. Since the users may have a trend in the spectrum usage from a spa- tial and temporal perspective, we can effectively identify when the buildings are going to experience the maximum load. Now that we know the hot zones and the timestamps when the usage reaches its peak, we need to know the channels that are over utilized and under utilized in the corresponding buildings and hours so that we can effectively manage the spectrum load. Our next set of findings include the number of users accessing a particular channel in the congested areas during the peak time to identify channels with the highest load and thus affecting the users’ services accessing those channels. After thorough analysis of different congested time periods during Tuesdays on ABB, we calculated the mean and standard deviation for the data points in each chan- nel. A low standard deviation implies that the data points are close to each other. 20

TABLE 3.1: Mean and standard deviation of the group of users asso- ciated with each WiFi channel in ABB

Channel Mean Standard deviation Range of Users 1 50 8 42 - 57 6 38 8 30 - 46 11 47 10 37 - 57 36 91 35 56 - 126 40 60 26 35 - 86 44 50 31 19 - 80 48 110 51 60 - 161 149 71 15 56 - 87 153 61 13 48 - 74 157 28 28 0 - 56 161 112 26 86 - 139

In our case, this means that during the congested hours of a particular location on a particular day, the number of users using the same channel do not drastically differ from each other. The statistics obtained will help us identify the channels with higher user load and the channels with less user load. The table consisting of mean and standard deviation of number of users on each channel is shown in Table 3.1. This table depicts the mean and standard deviation values of the number of users as seen on each channel that was employed by APs of ABB during the peak load hours on Tuesdays. From these values, we obtain the range of users that use those channels. The number of users not falling within this range are considered to be outliers and will not be taken into account. If the number of users lying between this range as seen from our dataset is averaged for each of the channels, 21 we can determine the load on every channel during congested periods of time in the Ansari business building. Figure 3.3 shows the mean number of users obtained from Table 3.1 on each channel during the peak hours of Tuesdays in ABB.

FIGURE 3.3: Average users on each channel during the peak hours of Tuesdays in ABB

In Figure 3.3, we can see that Channel 1 in the 2.4 GHz range and Channel 161 in the 5 GHz range suffer from the highest amount of user load. Since WiFi is con- tention based, when the number of users using a certain channel gets doubled, the signal quality in that frequency range deteriorates exponentially. That is why, it be- comes very essential that the users be distributed among all available channels in an unbiased manner. In the 2.4 GHz range, Channels 1 and 11 seem to have shared 22 the same level of load, whereas Channel 6 still has potential for utilization. How- ever, in the 5 GHz range, the distribution is very drastic. As we can see, Channels 36, 48 and 161 are being accessed by a large population. On the contrary, Channel 157 has a very little user base. We already know that ABB is congested on Tues- days during 10 am to 11 am. Now we also know that Channel 161 withstands the highest number of users. Meanwhile, Channel 157 has very few users associated with it. After accumulating all these information, we can accurately deploy the floating APs on Tuesdays in ABB while making use of the under-utilized channels so that congestion is accommodated to the optimal level. Using all of the above obtained results and applying similar techniques to the other locations, we now create a heat map to identify hot zones within UNR. The heat map consists of a section of the geographical map of UNR (due to space con- straints) that indicates at any time of the day, the locations with the largest number of users and the channels that are being used the most in the 2.4 GHz and 5 GHz frequency range. In these maps, the size of the circles indicate the size of the user base i.e. greater radius implies more number of users. The channels are color- coded and the index is shown on the upper left side of the map. At each location, we have displayed each channel from the 2.4 GHz range and 5 GHz range that are the most congested. That is why, there are two concentric circles for each location in the map where the color indicates the channel name in one of the two frequency ranges. Also, the channel range represented by the color of the outer circle has more number of users than the channel range represented by the color of the inner circle. These heat maps were generated for every 5 minutes of an entire day. 23

We then created a time lapse video for a number of separate days using these maps to provide a visual clarification of how the user crowd and the channel usage change over time and space. These videos are available in Spectrum Usage Timelapse Video [27]. For this chapter, we have presented three of these maps taken from different days at different times in figures 3.4, 3.5, 3.6. All of the information that we can gather from these figures are then exhibited in the tables 3.2, 3.3 and 3.4 respectively. One important thing to note here is that MIKC being the library always has the highest user load as compared to other locations irrespective of the day and time. Therefore, we will exclude MIKC from our further comparisons with other buildings where the user load varies with time.

FIGURE 3.4: Heat Map indicating the building load and the corre- sponding most utilized channels in a section of UNR on March 15, Wednesday at 12:05 p.m. 24

FIGURE 3.5: Heat Map indicating the building load and the corre- sponding most utilized channels in a section of UNR on March 28, Tuesday at 8:30 p.m.

Figure 3.4 shows the heat map on March 15, Wednesday at 12:05 p.m and the corresponding building and channel loads are depicted in Table 3.2. As mentioned earlier, we expect the crowd at JCSU to be very high at this point considering 12 - 1 p.m. as the lunch time for most students. From the table, we see that there are 260 users in the 2.4 GHz range and 517 users in the 5 GHz range. Mainly the buildings with classes such as EJCH, WRB, SEM, DMSC, ABB and PSAC are depicting a considerable amount of user load than the rest of the buildings which have a very little spectrum utilization. In terms of channel load, MIKC and PSAC have the most user connectivity in Channels 1 and 161. Since the outer circle (Channel 161) represents a channel in the 5 GHz range, it has more users than those connected in the 2.4 GHz range. From the table, it is clear that in MIKC, there are a total of 300 25

FIGURE 3.6: Heat Map indicating the building load and the corre- sponding most utilized channels in a section of UNR on April 01, Sat- urday at 4:00 p.m. users in the 2.4 GHz range out of which 145 are using only Channel 1, and there are a total of 863 users in the 5 GHz range out of which 208 are connected to the Channel 161 making these two channels highly congested. In Figure 3.5, we present the UNR wide spectrum usage on March 28 at 8:30 p.m. The reason behind choosing this time is to analyze if the user activity drops significantly at night as compared to the afternoon which is exactly being depicted by this picture. There is a very little user connectivity in all the other buildings except for MIKC and JCSU. Even in these buildings, the user load is less than that they were in the afternoon. The total load in all of the buildings and the corre- sponding channel load in the most utilized channels is found in table 3.3. The load on these buildings decreased even more during the weekends as shown 26 in figure 3.6, the heat map obtained from April 01, Saturday at 4:00 p.m. Compar- ing it with figure 3.5, we see that the only building that has a slight increase in user amount is JCSU while the rest have a very low spectrum usage. The exact values of user load over space and channels for this point of time is shown in table 3.4. We are now able to visually represent the WiFi spectrum usage inside UNR which becomes very handy in order to deploy the flying drones serving as APs. A quick glance through these three maps tells us that in all of the buildings, the most utilized channels never remain the same, it changes according to the day of the week and also the time of the day. Through the time lapse video in Spectrum Usage Timelapse Video [27], we can very easily analyze the changes in user load happening across an entire day in different locations and how different channels are being highly utilized even for the same building. 27

TABLE 3.2: Total number of users and the channels with the highest number of connected users in 2.4 GHz and 5 GHz frequency range on March 15, Wednesday at 12:05 p.m.

2.4 GHz range 5 GHz range Building Total Channels with Channel Total Channels with Channel Code load highest load load load highest load load JCSU 260 11 114 517 48 148 ARF 19 11 7 32 36 10 MIKC 300 1 145 863 161 208 NJC 28 11 10 11 153 8 WRB 93 1 41 297 153 64 EJCH 100 1 54 307 44 67 HREL 25 11 2 11 36 2 LP 29 1 19 31 36 10 CB 59 11 22 42 149 17 SLH 41 6 26 153 40 67 RSJ 32 1 18 75 40 37 MSS 41 11 21 204 48 43 PSAC 77 1 32 555 161 118 ABB 85 6 32 207 161 54 SEM 114 11 49 325 48 102 DMSC 96 11 44 307 149 68 LMR 26 1 11 62 161 20 LME 27 1 16 32 40 15 MM 49 1 25 172 36 38 RH 22 1 9 17 36 7 TB 3 11 2 0 None 0 28

TABLE 3.3: Total number of users and the channels with the highest number of connected users in 2.4 GHz and 5 GHz frequency range on March 28, Tuesday at 8:30 p.m.

2.4 GHz range 5 GHz range Building Total Channels with Channel Total Channels with Channel Code load highest load load load highest load load JCSU 58 11 24 125 36 32 ARF 15 11 8 10 36 4 MIKC 164 11 66 669 40 141 NJC 6 1 4 2 40 1 WRB 18 11 11 97 40 42 EJCH 23 11 11 46 149 15 HREL 7 1 4 1 48 1 LP 7 11 3 12 36 6 CB 18 1 8 9 40 4 SLH 1 11 1 2 48 2 RSJ 0 None 0 12 36 3 MSS 5 11 14 29 161 14 PSAC 12 6 6 74 161 20 ABB 16 11 8 113 36 32 SEM 30 11 15 60 149 19 DMSC 15 11 11 43 44 21 LMR 11 1 5 13 161 7 LME 9 11 7 17 48 7 MM 8 11 5 37 161 10 RH 0 None 0 2 153 1 TB 0 None 0 1 44 1 29

TABLE 3.4: Total number of users and the channels with the highest number of connected users in 2.4 GHz and 5 GHz frequency range on April 01, Saturday at 4:00 p.m.

2.4 GHz range 5 GHz range Building Total Channels with Channel Total Channels with Channel Code load highest load load load highest load load JCSU 67 6 34 185 161 37 ARF 9 6 5 7 36 3 MIKC 109 1 41 367 36 118 NJC 2 11 2 0 None 0 WRB 9 11 7 42 153 11 EJCH 3 1 2 10 153 4 HREL 10 11 5 4 40 4 LP 6 1 4 8 44 4 CB 12 11 8 7 40 3 SLH 0 None 0 0 None 0 RSJ 0 None 0 3 48 2 MSS 5 6 4 24 153 7 PSAC 7 11 3 31 161 12 ABB 10 1 5 55 48 21 SEM 12 1 7 22 149 12 DMSC 1 11 1 6 149 2 LMR 10 11 4 17 161 11 LME 2 1 1 8 48 5 MM 5 6 2 36 44 8 RH 1 6 1 0 None 0 TB 1 1 1 1 157 1 30

Chapter 4

Predicting Congestion Level in Wireless Networks

In our previous study, we had some constraints with the type of data that we an- alyzed and the conclusions that we reached to. We attributed AP load in a certain location directly with the number of users present in that location and the way they were distributed between the 2.4 GHz and 5 GHz frequency range. We also did not implement any sophisticated machine learning algorithms, but relied on mathematical models and plots to identify the trends because of the simplistic na- ture of the data we had. However, this study was really crucial as it led us to believe that there in fact exists a trend and we can dig deeper into it with complex algorithms to find hidden patterns and gain insights using more advanced tools to collect variations of data that can help us make informed decisions in regards to network congestion. In this study, we attempt to determine the congestion level in various locations inside the University of Nevada, Reno (UNR) based on a certain day and time. 31

This prediction will lead us to identify the areas that are suffering from heavy con- gestion vs other areas where the congestion is relatively low at the same time. This identification can pave the path for optimized resource allocation where resources can be pulled from areas with low congestion to alleviate the situation in high con- gestion areas without having to add new resources every time congestion occurs. There is a myriad of information that we can gather via the wireless activity of clients and can use to conduct this experiment. It is highly crucial to this study that we select the appropriate data types relevant to our research [28]. For this study, we have chosen four data attributes: number of clients, throughput, frame retry rate and frame error rate. These attributes help us answer the questions: 1)Is there a pattern to how the users take advantage of wireless networks? 2) When does the client activity increase significantly? 3) How can we relate these 4 attributes to congestion?

In this study, we are investigating patterns in usage data in terms of number of clients, throughput, frame error rate and frame retry rate. Furthermore, we are applying machine learning algorithms to predict the future values of these at- tributes based on the pattern. For instance, is the change in throughput values for one Tuesday similar to the change in throughput values for another Tuesday throughout the entire day? The existence of these trends allows us to apply super- vised learning algorithms to predict the values [29] of the four attributes given the date and time. However, we still do not know what these predictions mean and how they can be accredited to congestion. That’s when the unsupervised learning comes into play, with the main idea to draw inferences from data sets consisting of input data without labeled responses and find hidden groupings in data [30]. 32

Using this mechanism, we can efficiently classify different data items with the four attributes into different groups, where each group defines a certain level of con- gestion. We use the ideology that if there are more clients connected to an access point then the probability of a higher level of congestion is greater. We first pick the day and time in the future to determine its congestion status, after which we predict the number of clients, throughput, frame retry rate and frame error rate for those inputs. Finally we feed these predicted values to a clus- tering mechanism which will identify whether these values refer to a high level, medium level or a low level congestion. To gain a better understanding of how users are taking advantage of wireless networks, we examined the user activity data over a period of 5 weeks inside UNR. We polled the access points using SNMP commands [31] to extract the required information in terms of clients, throughput, packet drop and retry rates. The data was collected every 5 minutes so as not to overload the server with queries. The data consists of the following attributes: Timestamp, Location, Clients, Through- put, Frame Retry Rate, Frame Error Rate. We preprocessed the data in a suitable format to be stored in and accessed from a database. Afterwards, we averaged the data instances to an hourly basis to get a cleaner representation, and its analysis led us to believe that there exists a trend in how the users use Wi-Fi services across different hours of a day. Following this trend, we implemented Support Vector Re- gression and Polynomial Regression to predict the values of Clients, Throughput, Frame Retry and Frame error in the future for a certain location, day and time. To acknowledge how congestion can be identified with these attributes, we created 33 a clustering model using EM algorithm from the averaged data instances for the 5 weeks and with the 4 attributes. The model successfully grouped these data in- stances into 3 clusters, each cluster determining the congestion level low, medium or high. Finally, integrating the whole idea, the predicted values of the 4 attributes for a certain location, day and time given by our supervised learning algorithms were fed to the clustering model, the result specifying the congestion level corre- sponding to those inputs.

4.1 Methodology

4.1.1 Data Collection

The results presented in this chapter are based on the analysis of spectrum usage data collected across a section of UNR. UNR spans across an area of 290 acres and comprises of more than 80 buildings. There are two central controllers that manage a total of approximately 1275 active access points at any point of time throughout the university. Many of these APs are deployed in outdoor spaces as well. To facilitate the data collection process, we have used SNMP network protocol that allows different devices on a network to share information with one another in a consistent and reliable manner [18]. We polled the APs with different queries (snmpwalk commands) via the controllers that manage these APs. We poll every 5 minutes to obtain information reasonably frequently, within the limits of the computation and bandwidth available on our two polling workstations [32]. 34

At any point of time, separate queries were made to these APs explicitly asking for each of following information:

• number of clients connected to the AP

• location of the AP

• mac address of that AP

• AP bssid for 2.4 GHz and 5 GHz radio

• channels operating in that AP

• throughput of the AP

• frame retry rate of the AP (The number of retry packets as a percentage of the total packets transmitted and received by this AP)

• frame error rate of the AP (The number of error packets as a percentage of the total packets received on this AP)

Due to time constraints, we have limited our study to a geographical section of UNR consisting of 5 co-located buildings: JCSU, MIKC, ARF, WFC, and NJC. The data expands over a period of 5 weeks from December 1, 2017 to January 5, 2018. There are a total of 233 APs within these 5 buildings and the outside premises from which we have extracted all the information relevant to this research. 35

4.1.2 Dataset Creation and Pre-processing

Once the data is gathered using the process described in section II-A, we cre- ate individual CSV files that hold average clients per AP_MAC address, average throughput per AP_MAC, average frame retry per AP_MAC, and average frame error per AP_MAC. We compute and predict averages instead of totals in order to avoid any bias presented by one hour having more data than another. If one hour has more data than the others, the totals could potentially show high con- gestion when actually there is just a difference in the amount of data points result- ing in higher totals. Each individual CSV corresponds to a building on campus as well as the day of the week. Using python2.7 and the MySQLdb library, we can write SQL queries and manipulate the results with python before exporting the data to a CSV. Our database contains two tables, namely DATA_TABLE and LOCATION_TABLE. DATA_TABLE contains the TIMESTAMP,AP_MAC, CLIENT, THROUGHPUT, FRAME_RETRY, and FRAME_ERROR. LOCATION_TABLE con- tains AP_MAC and the corresponding location code AP_LOCATION. Since we want to filter our queries based on location, an inner join on AP_MAC is executed. From the inner join aliased as S we return the per record averages of CLIENTS, THROUGHPUT, FRAME_RETRY, and FRAME_ERROR. This is done 24 times via for loop in order to get averages from every hour (00-23). This process is repeated for every date in the database. To compute averages we use the built- in SQL aggregate AVG. Since we want to compute averages per MAC and not per record, we multiply the result returned by AVG(clients) by four since there are four BSSID’s per MAC. This results in an average number of clients per MAC address. 36

For the three other attributes, we take the result returned by AVG(attribute) and multiply it by two since there are duplicate values for each BSSID in a channel. This gives us average throughput, frame retry, and frame error per MAC address every iteration of the for loop (every hour). From there we export them to a CSV before next iteration of the for loop. This process of exporting data to a CSV is done for every date in the database inside five buildings on the UNR campus. The dates corresponding to the same day of the week (Monday - Friday) are aggregated into one CSV that holds all the data for a particular weekday and location. The CSV is named according to the location and day of the week. MIKCmonday.csv, for example, holds all Monday data at location MIKC.

4.1.3 Supervised Learning

To predict the average clients, throughput, frame retry and frame error per MAC address of an AP, we first plot each attribute as a function of time (hours) in order to identify trends before fitting an algorithm to the data. We found that the data points represented a normal distribution for all attributes. This normal distribu- tion can be seen in figure 4.1 where the average number of clients is plotted on an hourly basis. Based on the trend in the charts we were able to identify peak conges- tion period of times ranging from about 10 am to 2 pm on average across all days and locations. The charts were created using the python3 library Matplotlib [33]. Supervised learning is a subsection of machine learning where the user feeds the model training data which contains labels. The model learns to classify or 37 predict the labels based on the training data that is associated with it. Once training is complete, the model uses its prior knowledge of the training data to predict labels on new data. In this instance, we feed our model training data of Wi-Fi AP attributes (clients, throughput, frame retry and frame error) per hour, and it predicts the attribute at a given hour based on the training data. After identifying the trends in the data we decided to fit the data with support vector regression (SVR) because it does not take into account the outliers in the dataset and fits to the bulk of the data. SVR creates a regression line by splitting the difference between the two closest data points. Using the Gaussian kernel, we can create nonlinear regression lines by mapping the data points to a higher dimension. In addition to the SVR, we also fit a polynomial regression in the 2nd degree which takes into account outliers in the dataset. This was done in order to compare our results to the SVR. Lastly we create a weighted combination of both the SVR and the polynomial regression in hopes of finding an algorithm that allows the outliers some weight in the prediction but not so much as to throw off the entire prediction. The SVR and polynomial regression were created and run using the Scikit-learn python3 library [34] To calculate the accuracy of prediction for each algorithm, Mean Squared Error (MSE) was calculated for all three algorithms every time a prediction was made.

N 2 The formula for MSE is MSE = 1/N ∑i=1( fi − yi) where N is the number of data points, fi is the predicted value, and yi is the actual value [MSE]. On average, SVR had a lower MSE than the polynomial and weighted regressions. 38

Once we found SVR to be the best preforming regression algorithm for this data, we moved on to unsupervised learning to cluster the data points and define levels of AP congestion.

(a) (b)

(c) (d) FIGURE 4.1: Hourly Clients, Throughput, Frame Retry and Frame Error

4.1.4 Unsupervised Learning

Clustering, an unsupervised learning technique, takes a set of points in n-dimensional space and finds coherent subsets. Each subset consists of points that are clustered 39 together. The advantages of using clustering algorithms is it’s ability to categorize data instances automatically and the ability to find groupings that we might not otherwise find [35]. After predicting the values of average clients, throughput, frame retry and frame error per AP MAC address for a specified date and time, we need to re- late these values to a certain congestion level. However, there is no definitive way of assigning the predicted values to certain groups. We used EM clustering algo- rithm for this purpose. EM is a soft-clustering technique where it assigns a proba- bility distribution to each instance which indicates the probability of it belonging to each of the clusters [36]. Mixture models are a probabilistically-sound way to do soft clustering. We assume our data is sampled from K different sources (prob- ability distributions). The expectation maximization (EM) algorithm allows us to discover the parameters of these distributions, and figure out which point comes from each source at the same time [37]. EM is a method to find the means and vari- ances of a mixture of Gaussian distributions. We specifically chose EM because it allowed us to assign a certain level of probability for any data instance to fall into any cluster (congestion level). This is important because we are not making a hard determination of congestion for the predicted values, instead we are exploring the degree of possibility that those instances can be correlated to congestion . It is also useful when the range of values differs widely between dimensions [38] which is true in our case, as the range of values for throughput is very high as compared to the other attribute values. Using the EM algorithm, we clustered the hourly averaged data instances per 40

AP MAC address for the entire 5 weeks with just 4 attributes: Clients, Throughput, Frame Retry and Frame Error. There were a total of 863 instances, we chose to divide these data points into three groups, and the algorithm was successfully able to group them into 3 clean clusters. As we analyzed the clusters, we found that each of the clusters can be corroborated to a certain level of congestion. In one cluster, there were data items with very low values for all the attributes; we identified this cluster as low congestion cluster.In another cluster, we had data items with reasonable attribute values which we named as a medium congestion cluster, and finally, for the last cluster, the data instances had very high values for almost all attributes which we designated as a high congestion cluster. We have used WEKA to perform EM clustering which is a collection of ma- chine learning algorithms for data mining tasks. The software written in Java con- tains tools for data pre-processing, classification, regression, clustering, association rules, and visualization [39]. The accuracy of EM is measured using log-likelihood. Since all the data points are assigned to their respective clusters on the basis of probability, the objective function is to maximize the probability of likelihood of the data instances belonging to their assigned clusters [40].

Let D = x1, x2, ....., xn be n observed data vectors. Let Z = z1, z2, ....., zn be n values of hidden variables (i.e. the cluster labels). Log-likelihood of observed data given model: L(θ) = logp(D|θ) = log ∑ p(D, Z|θ) (4.1) Z

where p is the probability of a certain data vector xi belonging to a cluster in Z, θ denotes parameters: mean and variance, and both θ and Z are unknown [41]. 41

4.2 Results

We now present the future prediction by our machine learning models regarding whether a certain location in the future will be congested or not. We demonstrate how the data in this research is being fitted into the algorithms that we chose, what type of outputs are being received, and how they are being processed and analyzed.

4.2.1 SVR Prediction Model

In order to get a future prediction of congestion level, the first step was to predict the individual attributes such as average number of clients, throughput, frame retry, and frame error. We took the Support Vector Regression(SVR) from section III-C and created a basic user interface that allowed the user to enter input pa- rameters such as day of week, building, and hour of day. Based on those input parameters, the program would query the SVR model and output a prediction of the aforementioned attributes of Wi-Fi congestion. The output of the program can be seen in table 4.1. For building MIKC at 4 pm, the SVR predicted 14 clients per AP, a throughout of 1742 per AP, a frame retry of 20 per AP, and a frame error of 1 per AP. It is important to note that these values are averages across all AP’s in the MIKC building. The MSE for clients tells us the prediction differs from the actual values by about 2.45 clients on average. To find this average difference, we take the square root of the MSE for any particular attribute. From here, the predicted values are sent to the unsupervised learning EM clus- tering algorithm to get the resulting predicted congestion level. 42

TABLE 4.1: SVR Output Prediction (MSE in parenthesis)

Day Hour Location Clients Throughput Frame Frame retry error Monday 16 MIKC 14 (6.03) 1742 (68078.43) 20 (6.69) 1 (0.04) Thursday 05 JCSU 0 (0.49) 60 (7009.18) 7 (0.07) 0 (0.43) Thursday 08 MIKC 5 (4.37) 515 (74291.53) 7 (10.98) 0 (0.75)

4.2.2 EM clustering

To perform EM clustering, we have a dataset with 863 instances and 4 attributes as shown in figure 4.2. The data is in .arff format which is standardized in WEKA. The 4 attributes are: Clients, Througput, Frame retry and Frame error with the data type as ’numeric’ which means real numbers. For each data object, the comma sep- arated numbers denote values corresponding to the attributes in the attribute sec- tion. For instance, in line 9 of figure 4.2, 0 is the client number, 59 is the throughput value, 0 is the frame retry value and 2 is the frame error value. This data is ob- tained after averaging the 5-minute interval values for all these attributes in an hourly basis per AP MAC address. So, each of these data points refer to the num- ber of clients, throughput, packet retry rate and packet drop rate in an AP for a certain hour of a certain day in a certain location. We then upload the .arff data file into WEKA and apply the EM clustering al- gorithm specifying the number of clusters (k) as 3. The default number of clus- ters generated by this algorithms is 4, but we found out that the clusters are more clearly separated and are of more value when we assign the value of k as 3. This algorithm groups data instances into clusters based on the maximum likelihood estimate of the parameters: mean and variance of each attribute, and assigns the 43

FIGURE 4.2: Data used for EM clustering instances into their respective clusters where they have the highest probability of belonging to. Figure 4.3 shows the clustering output as performed by the EM algorithm. The three clusters are identified by colors blue, red and green with auto-generated names cluster0, cluster1 and cluster2 respectively. In this plot, the x-axis is the instance number and y-axis is the number of clients attribute. Due to space con- straints, we have only presented one plot with the client attribute in the y-axis. However, the other plots with the rest of the attributes also displayed similar re- sults. To understand how the clusters are formed, we select the different colored points in the plot and compare them. The results are shown in table 4.2. Here, the three data points in cluster2 have very high values for all the attributes. Compared to that, the data points in cluster0 have less attribute values but are still significant. 44

FIGURE 4.3: The 3 clusters generated by the EM algorithm

TABLE 4.2: Analysis of data in cluster2, cluster0 and cluster1

Cluster Instance Clients Throughput Frame Frame number retry error cluster2 84 13 1660 21 3 cluster2 85 13 1789 19 3 cluster2 86 13 1675 19 3 cluster0 493 2 258 2 2 cluster0 494 2 207 2 3 cluster0 495 2 258 2 3 cluster1 364 0 0 0 1 cluster1 366 0 4 0 1 cluster1 368 0 44 0 2 45

Lastly, the attribute values for all the data points are very less and almost negli- gible. We analyzed other points in all the clusters and obtained similar results. Based on this analysis, we labeled the data points in cluster2, cluster0 and cluster1 as highly congested, moderately congested and less congested respectively, with the cluster labels for cluster2 as high, cluster0 as medium and cluster1 as low in terms of congestion. Figure 4.4 shows the accuracy of our EM clustering model using log-likelihood, the value being -5.56. Since log-likelihood is the logarithmic value of probability, and probability is always between 0 and 1, the log value is negative. There were 4 iterations done by the algorithm to conclude to this result. It also shows the mean and standard deviation values of each of the attributes for each cluster, which provides the range of attribute values within each cluster and gives us an intuitive idea of the level of congestion in these clusters based on that range.

4.2.3 Final output

Using the clustering model generated in section 4.2.2, we predict the congestion level for the inputs used in section 4.2.1 based on the values of the 4 attributes predicted as shown in table 4.1. Figure 4.5 shows the data that we used as a test data for our clustering model which was generated using the prediction model. We want to determine which congestion level group each one of these instances belong to. As evident from fig- ure 4.6, the first data instance belongs to cluster2 (high congestion) which is corre- sponding to MIKC, Monday at 4 p.m. The second data instance belongs to cluster0 46

(medium congestion) which is corresponding to MIKC, Thursday at 8 a.m. The third data instance belongs to cluster1 (low congestion) which is corresponding to JCSU, Thursday at 5 a.m. In this way, we can predict the congestion level for any location, day and time in the future. 47

FIGURE 4.4: Evaluation of the EM clustering model 48

FIGURE 4.5: Predicted data used in the EM clustering model

FIGURE 4.6: Output of the clustering model showing the assigned cluster for each data instance 49

Chapter 5

Incorporating Machine Learning in a Game Theoretic Environment for Dynamic Spectrum Access

In the last two chapters, we conducted feasibility studies to detect and predict con- gestion beforehand and deploy optimized solutions to handle them in an efficient manner. The leading cause behind this congestion is the increasing demand for wireless spectrum because of which the networks are facing a serious shortage of the frequency bands. However, the major reason behind this is that a large portion of these bands is underutilized. Under such scenario, the need for efficient sharing of the available spectrum becomes inevitable which is yet another way to tackle the growing congestion. Cognitive Radios can sense the wireless environment, they can identify the wireless channels not being used by primary users and access them dynamically. In any situation where there are a certain number of secondary networks trying 50 to get an available channel, there arises a game theoretic competition where they want to get the channel for themselves by incurring as minimum cost as possible. The increase in cost is equivalent to the increase in time caused by the need of a search for an available channel. This process could be sped up if the networks had a predictive mechanism to determine the optimal strategy. In this chapter, we investigate various predictive algorithms: Linear regression, Support Vector Re- gression and Elastic Net and compare them with other traditional non-predictive game theoretic mechanisms. We measure the accuracy of these algorithms in terms of time taken to reach the system convergence. We also observe how a self-learning approach can be helpful in maximizing utilities of the players in comparison to tra- ditional game theoretic approaches. Since we are imposing learning in games, we need a strong set of data that they can learn from. We simulated an environment with a random number of compet- ing networks and a random number of available channels which changes dynam- ically. We then calculated the optimal probability for the networks to switch when there are N number of networks and M number of channels. M and N change dy- namically as the networks start capturing available channels. We did this for a set of M and N and stored the corresponding probabilities in a database. This became our training dataset from which the networks predict the probability to switch or stay that will best suit them when there are are any number of competing networks and available channels. As the networks predict, these predictions become part of the training dataset making it easier for these networks to predict in future and thus we have implemented online learning in our game environment. 51

To implement learning, we have used three different learning algorithms that are proven to be successful in estimating predictions: Linear Regression (LR), Sup- port Vector Regression (SVR) and Elastic Net. Since we are dealing with a con- tinuous variable which is the probability in this case, we have used regression algorithms to make predictions. We divided our dataset into training and testing datasets. The same training dataset is used for all the algorithms to make further predictions. The traditional game theoretic works investigating self-coexistence do not apply machine learning, but rather stick to the game-theoretic tools to derive strategies that lead to Nash Equilibrium. Our main purpose is to see how the use of machine learning approach to identify self-coexistence among networks facilitates us in giving better results. Through this study, we try to establish the feasibility of incorporation of machine learning algorithms in a game theoretic environment. The networks will play this incomplete information game without the knowledge of other players’ strategies game using the self-learning approach and we will find out how the time to access an available channel is being minimized by opting for this strategy.

5.1 System Model

Let us assume a game environment with dynamically changing components. There are N networks and M channels where the networks are contending for available channels. For simplicity, we assume that each network acquires only one channel for this game. In search for available channels, if two or more networks land on the same channel, collision occurs resulting in failure of transmission. The networks 52 are then faced with two choices: either to switch to another channel or to stay on the same channel assuming the others would leave. Both these choices have risks associated with them and it is the networks’ job to determine the proper strategy to use as per the demand of the situation. We consider the networks to be rational entities. That is why each network tries to maximize its own utility i.e. minimize the time that it takes to acquire an available channel. For simplicity, we consider that the number of available channels is always either equal to or greater than the number of competing networks. This way, equi- librium is always assured as all the networks will get a channel at some point of time leading to system convergence. So, if M is the number of available channels and N is the number of networks seeking a channel at any given point of time, M ≥ N. In this chapter, we have not considered the case where M < N so as not to complicate the equilibrium scenario. If we have more networks than the chan- nels, we will end up in a situation where some networks will always be left without a channel. It then becomes difficult to define the system convergence point which is why we stick to the scenario where M ≥ N. So, we model a dynamic game in a non-cooperative environment where the networks have to determine an optimal strategy to either switch or stay and co-exist among themselves to attain equilib- rium in the least time possible. 53

5.1.1 Challenges

By available spectrum, we mean that it is not being used by any of the networks. Using cognitive radio, it is possible to identify whether there is any primary net- work’s activity or not. However, the purpose of this game is to see how quickly you can get a channel that doesn’t have another secondary network trying to ac- cess the channel. If two or more networks try to grab the same channel, they collide with each other resulting in a garbled transmission. In such situation, they will all have a choice to either switch to another channel or stay assuming the others would leave. We will try to see which one of these strategies works better for the networks. Let us illustrate the game as a two-player prisoners’ dilemma problem. We will assume only two networks as players of the game, each of them with a strategy pair (switch, stay). We will consider the cost of switching to another channel as C while the cost of staying at the same channel remains 0. The players x and y will play this game in a pure strategy space i.e. each of them will either switch or stay with a probability of 1 or 0. In table 5.1, this game is expressed in a strategic form:

x\y Switch Stay Switch (C,C) (C,0) Stay (0,C) (0,0)

TABLE 5.1: Strategic–form minority game with network x and y Each cell in the table contains a pair that represents the cost associated with players x and y when they choose the corresponding strategies as indicated by the table. This form of game can be solved using Iterated Elimination of Dominated Strategies (IEDS). We find that IEDS gives us (Stay, Stay) as the Nash Equilibrium 54 solution of this game as it dominates all the other combination of strategies. How- ever, if both the players only choose to stay, the game will never proceed forward and equilibrium will never be reached. This special two-player case can easily be generalized to N-player game and we would still obtain the same results. What we can conclude from this result is that always choosing to stay with a probability of 1 is never a good strategy for the players. Now, let us consider a scenario where some networks always choose to switch in case of collision. Switching to a different channel every time is not the ideal solution either. It is highly likely that these collided networks will again land in a channel that others are trying to access. When the competition is very high, all the networks continuously struggle for an available channel. If all of them switch at every instance, the players may collide very frequently. We will illustrate the above scenario with Figure 5.1 where squares represent the channels and numbered circles represent different networks. Figure 5.1 shows the subsequent proceedings of the game through various stages when there are 6 networks and 10 channels. In fig 5.1(a), we see that the game starts when the networks randomly choose one of the channels. Network 1 is successful in getting a channel at this stage and the rest are conflicted and left with a choice to either stick to the same channel or find another channel. The game now moves to the next stage as shown in fig 5.1(b). At this stage, Network 5 and Network 6 switch but land on the same channel and collide again. In the next stage, as shown by fig 5.1(c), they are able to acquire available channels when Network 5 stays and Network 6 switches. Finally, the game ends as shown in fig 5.1(d) when there are no more 55

(a) (b)

(c) (d) FIGURE 5.1: Networks and Channels a) at the beginning of the game; b) after first stage when Network 2 got a channel; c) after second stage when Networks 1, 3 and 4 got channels; d) at the last stage when equilibrium is achieved competing networks. Now we get the idea that if the networks only decide to switch no matter what, it does not guarantee a channel. Rather, they are even open to collision and incur additional cost of switching to the other channel. We have established that the networks cannot switch or stay with a probabil- ity of 0 or 1. Thus, these networks need to identify the proper situations where switching would be beneficial and the situations where staying would be bene- ficial. This is achieved by deviating from pure strategy space to mixed strategy 56 space. In mixed strategy space, probabilities are assigned to each of the two strate- gies: switch and stay. If p denotes the probability to switch at any given time, the probability to stay at that time is 1 − p. It is now the job of the networks to identify the optimal strategy tuple (p, 1 − p) depending upon the environment. Since this is a dynamic game, the environment changes at every instance. The networks are continuously struggling to search for an available channel. Some of them can be successful at the very first instance while some of them may take a while. Under such circumstance, we have to develop a way so that the networks can adapt them- selves to their surroundings by learning to change their strategies accordingly.

5.2 Self-learning in the game

Machine learning is the study of algorithms that helps to identify complex pat- terns, hidden relationships from a large amount of data and facilitates intelligent decision-making [42]. Machine learning is similar to that of data mining in the sense that both processes look into data for patterns. However, the information obtained from machine learning is not just used for human comprehension but to adjust programs so that they can become intelligent enough to make rational decisions. There exist many techniques to employ machine learning: classifica- tion, clustering, regression, rule-based are some of them. It covers a wide range of applications like medical diagnosis, prediction, pattern recognition, recommender systems, forecast etc. For this particular problem, we are trying to apply regression to perform prediction. Regression analysis is a statistical technique for investigat- ing and modeling the relationship between variables [43]. In this chapter, we are 57 trying to model the relationship between the networks and channels to predict the probability to switch. The dynamicity of this game is handled using machine learning algorithms. Since the game components change very frequently, the networks need to come up with a strategy that also changes with the environment. That is why we have implemented learning techniques for these networks. These techniques involve a predictive mechanism through which the networks identify the probability at any given point of time. These predictions depend upon the number of competing networks and the number of available channels. The networks need to learn be- fore they make these predictions. So, we train the networks using a dataset where we provide the corresponding optimal probability for a number of networks and channels. Based on this dataset, the networks can learn to make predictions of probabilities at any given point of time. As they proceed in the game, the predicted values are again incorporated into the database to help these networks make fur- ther predictions. Before we proceed further, let us briefly discuss the algorithms that we have used for the prediction purpose. The following are a set of methods intended for regression in which the target value is expected to be a combination of the input variables. In mathematical notion, given an input vector x where x1, x2, ....., xn are the features or independent variables and if y is the predicted value or the dependent variable then

y(w, x) = b + w1x1 + w2x2 + ..... + wnxn (5.1) 58

where w = (w1, w2, .....wn) is the coefficient vector and b is the intercept. A re- gression model is stated in terms of a connection between the predictors x and the response y [44].

5.2.1 Linear Regression

Linear regression is used to study and identify the relationships between depen- dent variables(y) and independent variables(x) in a dataset. When there is only one independent variable, it is called simple linear regression and when there are more than one independent variables, it is called multiple linear regression. This relationship between variables is defined using a linear model as described

by equation (5.1) with coefficients w = (w1, w2, .....wn) to minimize the residual sum of squares between the actual values in the data, and the values predicted by the linear model. Mathematically it solves a problem of the form:

m m 2 2 minimize ∑(yi − yi) = ∑(yi − (w · xi + b)) (5.2) i=1 i=1

where y is the actual output as obtained from the training dataset and y is the predicted output from the linear approximation. Linear regression is typically used in predictive analysis. In [45], linear regres- sion experiments have been used to predict the age of a text’s author based on content features and stylistic features. 59

5.2.2 Support Vector Regression

Support Vector Regression shares the same idea as that of Support Vector Machine which is to find hyper-plane that separates the data cleanly in multidimensional space with as maximal separation between data points and hyper-plane as pos- sible. If the data is not clearly separable, the data is transformed by one of the kernels into higher dimensional (n+1) space and is then partitioned by the n+1 dimensional hyper-plane [46][47]. In SV regression, our goal is to find a function f (x) that has at most ε deviation from the actually obtained targets yi for all the training data, and at the same time is as flat as possible. In other words, it means that the errors are fine as long as they are less than ε, but any deviation larger than this will have to be accounted for. We have a linear function of the form

f (x) = w · xi + b (5.3)

Flatness in the case of (5.3) means that one seeks a small w. One way to ensure this is to minimize ||w||2 = w · w (dot product) [47]. The problem can be written as a convex optimization problem where we have to:

1 minimize ||w||2 2 60 subject to

   yi − w · xi − b ≤ ε (5.4)  w · xi + b − yi ≤ ε

The Support Vector Regression (SVR) tool is one such computational tool that has recently received much attention in the system identification literature, espe- cially because of its successes in building nonlinear blackbox models. The main feature of the algorithm is the use of a nonlinear kernel transformation to map the input variables into a feature space so that their relationship with the output variable becomes linear in the transformed space. This method has excellent gen- eralization capabilities to high-dimensional nonlinear problems due to the use of functions such as the radial basis functions which have good approximation capa- bilities as kernels. Another attractive feature of the method is its convex optimiza- tion formulation which eradicates the problem of local minima while identifying the nonlinear models [48].

5.2.3 Elastic Net

Elastic Net is a regularization method for regression that combines the penalties of both LASSO and Ridge regression and performs at least as better as any one of them can perform and generally gives better results than both. LASSO will oc- casionally achieve poor results when there’s a high degree of collinearity in the features by selecting any one of them at random. Ridge regression combats the overfitting and will perform better when features are collinear but it only scales 61 the coefficients without throwing them. Elastic net is a hybrid approach that lin- early combines the normalizations : L1 (manhattan distance) of LASSO and L2 (euclidean distance) of Ridge. Naive Elastic net finds the estimators in two stages: first it finds the ridge regression coefficients and then does the lasso shrinkage to remove less correlated coefficients. Elastic net is computationally more expensive than LASSO or ridge as the relative weight of LASSO and ridge has to be selected using cross validation. Elastic-net is useful when there are multiple features which are correlated with one another. LASSO is likely to pick one of these at random, while elastic-net is likely to pick both. The objective function to minimize for Elastic Net is:

m n 2 α(1 − ρ) 2 minimize ∑(yi − w · xi − b) + αρ ∑ + |w j| (5.5) i=1 j=1 2

Similar to the LASSO, the elastic net simultaneously does an automatic variable selection and continuous shrinkage, and it can select groups of correlated vari- ables. It is like a stretchable fishing net that retains ‘all the big fish’. Simulation studies and real data examples show that the elastic net often outperforms the LASSO in terms of prediction accuracy. Elastic Net encourages a grouping effect, where strongly correlated predictors tend to be in our out of the model together. The elastic net is particularly useful when the number of predictors is much bigger than the number of observations [49]. 62

5.3 Proposed Mechanism

Since this is a dynamic game, it proceeds through various stages before equilib- rium is achieved. We divide the game into time slots to mark different stages within the game. The game starts at time t = 0 and at every stage, t gets increased. In this study, t is measured in time units. When the game begins, the networks start competing among themselves for available channels. Some of these networks suc- ceed in acquiring a channel for themselves while others collide. These networks have a choice to either switch to another channel or to stay on the same chan- nel assuming that the other conflicted networks would leave. When the networks switch, it requires more time for them to sense, search and find another available channel. When the networks stay, they only lose the time slot until the next trans- mission attempt. So, switching costs these networks more time than staying. After deploying their strategy to either switch or stay with a probability p, the conflicted networks and the remaining available channels move on to the next time slot. This process is repeated for various time slots until all the networks have a channel for themselves which we define as the equilibrium point. Our objective is to minimize these time slots taken to achieve equilibrium. This scenario is better illustrated using figure 5.2 where M, N, and P denote the number of channels, networks and optimal probability to switch respectively. The game starts at time t = 0 where 40 networks are competing for an available channel. Out of these, 10 networks are successful in accessing available channels at the first attempt and will not compete any further. The conflicted networks either switch or stay depending on their value of P and move on to another time 63 slot with remaining available channels. The game thus proceeds through n stages until equilibrium is achieved at time T’.

FIGURE 5.2: Various stages of the game divided into time slots

We provide four different alternatives of the mixed strategy probability p and study each outcome in terms of time taken to reach equilibrium. (i) By using ran- dom p at every stage throughout the game. (ii) By staying neutral and using p = 0.5 throughout all the stages of the game. (iii) By using p obtained through tradi- tional game theoretic approach. (iv) By using p obtained through prediction in a machine learning based environment. Using random probability means that the networks randomly choose the prob- ability to switch at every stage of the game so as to avoid the overhead of predic- tion but at the same time, lack a proper strategy. The second alternative is to stay neutral. Neutrality, in this case, is the assumption that networks are unbiased to switching or staying and choose a probability of 0.5 throughout the game. The third mechanism gives us optimal probability as the Nash Equilibrium point after solving the game with traditional game theoretic tools. We use this probability 64 throughout the game and observe the time taken to reach system convergence. Fi- nally, our last alternative is to incorporate a learning model into the game where the networks themselves predict their optimal strategy p. None of the first three mechanisms use a machine learning approach to solving the game. We will see that a machine learning based environment facilitates self-coexistence at a faster rate than any of the other three mechanisms. Learning requires an initial set of a dataset for the networks to learn from. This dataset is simply a collection of the history of the networks’ past actions and strate- gies which gave them maximum utility in each case. After enough trials, these net- works have the ability to identify the proper strategy depending upon the situation and make predictions in the future accordingly. To train the networks, we created our own set of database. The database consists of a number of combination of net- works (N) and channels (M). For each combination, we played the game multiple times using all possible probabilities and calculated the time taken to reach conver- gence in each case. The probability giving us the least time to achieve equilibrium is then considered to be the optimal probability for that combination. We did this with N ranging from 2-50 and M ranging from 2-50 where M ≥ N giving us a total of 1176 datasets. In figure 5.3, we present a graph to calculate the optimal probabilities when there are 10 networks and 10, 20, 30 and 40 channels respectively. Since it is a graph of probability to switch vs time to reach equilibrium, the least time taken corresponds to the optimal probability. This is how we calculated the optimal probabilities required for our training dataset. One interesting thing to note here 65 is that for the cases where we have a higher number of channels, the system con- vergence time decreases implying that it takes less time to reach the equilibrium when the competition is less.

FIGURE 5.3: Time to reach equilibrium (in time units) and corre- sponding optimal probability to switch with varying number of chan- nels

To evaluate our mechanism, the dataset that we obtained to use for learning is now divided into training and testing dataset. The training dataset is used to train the networks so that they can predict the probability and the testing dataset 66 is used to check the accuracy of their prediction. From the total of 1176 datasets, 966 datasets are used for the training purpose and the rest for testing purpose. We store the training data into a database where we have 3 columns: Network, Channel and Probability. We will discuss the flow of the program with an algorithm where N denotes the number of competing networks, M denotes the number of available channels and P denotes the optimal probability to switch at any time during the program.

input : M, N output: Time Time ←− 0 Equlibrium ←− False play the game with (M, N) while Equlibrium not True do find M,N row in database if M,N row found then use P from database else predict P using machine learning algorithm augment the database with predicted P end M ←− Number of remaining channels N ←− Number of remaining networks play the game with (M, N) end 67

5.4 Results

We simulated a game environment with varying number of networks and chan- nels and observed different sets of results when using predictive strategy vs non- predictive strategy. In the non-predictive strategy, we used three different mech- anisms of random probability, neutral probability and traditional game-theoretic probability. In the predictive strategy, the probability is predicted by the networks at every stage of the game depending upon the number of networks and number of channels at that particular stage which we call as the optimal probability. For train- ing purpose, we provided an initial set of data also calculated through our simu- lation experiment. The source code for this program was written using Python language where we also made use of the Python Libraries: Scipy and Numpy for scientific and numerical calculations. We assume that N networks are competing for M channels where M≥N. All the networks have a mixed strategy to either switch or stay when faced with conflict. Each of them is trying to access an avail- able channel in as minimum time as possible. The networks reach an equilibrium state when all the networks have found an available channel. We have used three different machine learning algorithms for prediction: Lin- ear Regression, Support Vector Regression and Elastic Net. It is important to mea- sure the accuracy of these algorithms before relying on them to predict our strategy for the game. In table 5.2, we present the experimentally calculated optimal prob- abilities that incur minimum cost when there are 10 networks and 10, 20, 30 and 40 channels respectively. We also calculated the optimal probabilities for the same cases using the prediction algorithms and compared the results in table 5.2. It can 68 be seen that the prediction results closely match with that of the results obtained from the game simulation. In some cases, the prediction is more accurate than the others but overall the error is negligible thus justifying the use of predictive mech- anism to be an efficient methodology in finding out the mixed strategy probability for the game. One interesting thing to note here is that as the gap between the number of networks and channels is increasing, the probability is also increasing, although by a small rate. What we can understand from this result is that switch- ing is more beneficial for the networks than staying when the competition is less.

10 competing networks Number of Experimental Linear Support Vector Elastic available bands calculation Regression Regression Net 10 0.522 0.631 0.622 0.641 20 0.765 0.689 0.711 0.692 30 0.858 0.746 0.777 0.743 40 0.879 0.804 0.779 0.794

TABLE 5.2: Comparison between experimentally calculated and pre- dicted probabilities

Next, in figure 5.4, we analyze the accuracy of these algorithms closely with the graph for 10 competing networks. The minimal difference in the prediction results among these algorithms shows that all of these algorithms can be used interchangeably to implement machine learning in the game. In table 5.3, we present the Mean Square Error (MSE) of all the three algorithms that we calculated using our test dataset. From the table, we can see that in overall, Elastic Net is giving us more accurate results. We see that all of these algorithms have a minimum error further confirming their accuracy in the system. 69

FIGURE 5.4: Channel switching probability with varying number of channels

Finally, in figures 5.5, 5.6, 5.7 and 5.8, we present the major findings of this study which is to show how the use of learning strategy at every stage of the game gives us better results than every other scenario. We calculate the time taken to reach equilibrium in each of the four scenarios: (i) Where the networks use ran- dom probabilities at every stage of the game (ii) Where the networks use a proba- bility of 0.5 throughout the game (iii) Where the networks use optimal probability obtained through traditional game theoretic tools and keep it constant through all the stages of the game (iv) Where the networks predict optimal probability at ev- ery stage within the game using the three learning algorithms: Linear Regression, 70

Algorithm Mean Square Error Linear Regression 0.03345 Support Vector Regression 0.181004 Elastic Net 0.025036

TABLE 5.3: Mean square errors of predictive algorithms

Support Vector Regression and Elastic Net. This is done when there are 10 com- peting networks with 10, 20, 30 and 40 available channels and compare the time taken by each strategy in each case with the others. In figures 5.5, 5.6, 5.7 and 5.8, we show the time taken to reach equilibrium when playing the game with different strategies in case of 10 competing networks and 10, 20, 30 and 40 available channels respectively. As seen from these figures, even with the different levels of competition, the predictive strategy gives us better results than all other strategies. The time taken to reach equilibrium is decreased when the networks use machine learning algo- rithms to identify proper strategy. This gain in performance increases for even larger game environments with more number of players. Since we are constantly increasing the database with new combinations of N and C, the networks learn to adapt their strategies for such large environments. These results give us a new lead in the sense that rather than sticking to the traditional game theoretic aspects, a new machine learning based environment has a more positive effect on these networks and help them achieve self-coexistence in a better way. In table 5.4, we compare the time taken by all the non-predictive strategies against the time taken by the three learning algorithms when there are 10 networks with 10, 20, 30 and 40 channels respectively. 71

FIGURE 5.5: Time taken to reach equilibrium (in time units) with dif- ferent strategies when N = 10 and M = 10

10 competing networks Number of Random Neutral Game theoretic Linear Support Vector Elastic available bands strategy Strategy Strategy Regression Regression Net 10 30.11 21.45 17.83 16.46 17.23 17.32 20 18.69 10.25 8.27 7.21 6.24 6.29 30 10.71 6.48 8.11 8.35 20.8 17.3 40 8.29 7.27 6.34 3.55 6.23 6.21

TABLE 5.4: Comparison between the time taken to reach equilibrium (in time units) when using different strategies 72

FIGURE 5.6: Time taken to reach equilibrium (in time units) with dif- ferent strategies when N = 10 and M = 20 73

FIGURE 5.7: Time taken to reach equilibrium (in time units) with dif- ferent strategies when N = 10 and M = 30 74

FIGURE 5.8: Time taken to reach equilibrium (in time units) with dif- ferent strategies when N = 10 and M = 40 75

Chapter 6

Conclusion and Future Works

The ultimate goal of these measurements and study of Wi-Fi networks is to be able to provide the future users with good connectivity services despite of the differences in network infrastructure and fluctuation in spectrum usage. Before investing in technologies that assist in load balancing or dynamic access point pro- visioning, it is crucial to have a deeper understanding of the current and future Wi-Fi usage trends which can be achieved by the measurements conducted in our study. In a world where everything is dynamic, the applications of UAVs are sky- rocketing. While many major enterprises are opting to the aerial mechanism for provisioning WiFi service in remote areas, we have tried to modify the concept by using UAVs for better WiFi coverage for small scale industries with a much less in- vestment. In this thesis, we have conducted the feasibility study on how to deploy floating APs into locations with a very high user base and the demands change dy- namically. For this purpose, we analyzed the WiFi spectrum usage in University 76 of Nevada, Reno over different locations, at different times and observed the im- pact on all the WiFi channels. After our study, we found out that the users tend to make clusters at different locations during different hours of the day. Each build- ing has its own busy hours depending upon the day of the week. This trend in the spectrum usage helped us find the crucial information regarding the congested lo- cations and the most busy hours in those locations. We also analyzed the channels that are being mostly used in those peak hours and came up with a conclusion that all the users are almost always associated with the same channel creating a heavy load on that channel. Now, the deployed APs will not only serve in the crowded locations but can also make use of the underutilized channels to bring down the congestion to a minimum level. The first step in dealing with Wi-Fi network issues is identifying congestion, when and where it occurs. In this thesis, we have accurately predicted the conges- tion level in certain locations for specified dates and times inside UNR, a public university equipped with thousands of access points and a lot more daily user base. After realizing that there is in fact a pattern in the network usage for the same days of different weeks, we were able to successfully predict the values of 4 attributes that correlates to congestion: clients, throughput, frameretry and frameerror using SVR based on their historical values. We then used an EM clus- tering model to the predicted values to segregate the original dataset, the training data into different clusters with each cluster defining a certain level of congestion: low, medium or high. The predicted attribute values of data instances for different location, day and time are then used as a test data for this model that identifies 77 whether the data instance corresponds to a less, moderate or high congestion. We have integrated two machine learning approaches: supervised and unsupervised learning to determine the congestion level in a wireless network. We also modeled the dynamic spectrum access using a novel approach in the game-theoretic aspect. We tried to incorporate self coexistence among the cogni- tive radio based networks trying to access an available channel in a non-cooperative environment. This was done by making the networks intelligent where they an- alyze their surroundings using various machine learning algorithms so that they can learn their optimal strategy. This means they can predict the mixed strategy probability depending on the number of competing networks and available chan- nels. We also measured the accuracy of the learning algorithms that we used which are Linear Regression, Support Vector Regression and Elastic Net. We found out that the networks can reach equilibrium quickly when they opt to the mechanism of prediction and thus minimize their cost or maximize their utility. The results obtained through this study lead the way to a vast number of pos- sibilities in the wireless research community. As discussed, in this thesis we have queried Aruba controller to obtain our measurements such as the number of clients, AP name, location, channel etc. To explore the yet undiscovered horizons and find ideas that can help create better network management, these kinds of studies in- volving measurements of Wi-Fi attributes is a must. For this, we have identified new measurement perspectives such as advanced SNMP queries to the controller, syslogs etc. for getting more detailed information regarding various additional Wi-Fi variables including channel throughput, Signal-to-Noise Ratio, number of 78 transmitted packets/bytes of data, packet drop rate etc. which helps depict a very clear scenario of network congestion bringing forth new solutions on how to tackle them. Moreover, this study is a stepping stone towards the need for a security aware cyber infrastructure analysis and design. A deeper analysis of the data col- lected via this study will unveil robust network design solutions that not only will provide the users with a quality service but also with a secure environment. This study is mainly important for network designers to understand the activ- ities portrayed by mobile users in the network such as network congestion, vari- ance in the network activity across space, time and channels, crowd mobility etc. and use these information to better plan and extend network infrastructure by being able to predict congestion and proactively implementing some kind of dy- namic AP provisioning services, reconfiguration of APs to balance the load across all channels, using the mobility patterns to identify locations that require better services and so on. The study of network congestion combined with user mobility trends can be used to further the research in terms of efficient route calculation to deploy floating APs, minimize the flight time of these APs, providing better cov- erage with less number of additional APs by using strategic channel allocations etc. We believe that this study can be scaled to bigger networks that exhibit sim- ilar trends. This study serves as a tool to system administrators who constantly monitor wireless networks. It helps them better optimize the resources and in- frastructures and allows them to proactively avoid situations that might lead to a massive network load-balancing issue. 79

Bibliography

[1] Chengqi Song and Qian Zhang. “Intelligent dynamic spectrum access as- sisted by channel usage prediction”. In: INFOCOM IEEE Conference on Com- puter Communications Workshops, 2010. IEEE. 2010, pp. 1–6.

[2] Anna Kami´nska-Chuchmała.“Performance analysis of access points of uni- versity wireless network”. In: Rynek Energii 1.122 (2016), pp. 122–124.

[3] Stuart M Adams and Carol J Friedland. “A survey of unmanned aerial ve- hicle (UAV) usage for imagery collection in disaster research and manage- ment”. In: 9th International Workshop on Remote Sensing for Disaster Response. 2011, p. 8.

[4] Gurkan Tuna, Bilel Nefzi, and Gianpaolo Conte. “Unmanned aerial vehicle- aided communications system for disaster recovery”. In: Journal of Network and Computer Applications 41 (2014), pp. 27–36.

[5] Luis F Gonzalez et al. “Unmanned Aerial Vehicles (UAVs) and artificial in- telligence revolutionizing wildlife monitoring and conservation”. In: Sensors 16.1 (2016), p. 97. 80

[6] Noel Sharkey. “The automation and proliferation of military drones and the protection of civilians”. In: Law, Innovation and Technology 3.2 (2011), pp. 229– 240.

[7] Amazon wins patent for a flying warehouse that will deploy drones to deliver parcels in minutes. https://www.cnbc.com/2016/12/29/amazon-flying-warehouse- deploy-delivery-drones-patent.html. 2016.

[8] Project Loon. https://loon.co/.

[9] Facebook’s Giant Internet-Beaming Drone Finally Takes Flight. https : / / www . wired.com/2016/07/facebooks-giant-internet-beaming-drone-finally-

takes-flight/. 2016.

[10] Kanchan Kamnani and Chaitali Suratkar. “A review paper on Google Loon technique”. In: International Journal of Research In Science & Engineering 1.1 (2015), pp. 167–171.

[11] Facebook’s solar-powered drone makes first full test flight. https://www.engadget. com/2016/07/21/facebooks-solar-powered-drone-makes-first-full-

test-flight/. 2016.

[12] Huazhi Gong and JongWon Kim. “Dynamic load balancing through asso- ciation control of mobile users in WiFi networks”. In: IEEE Transactions on Consumer Electronics 54.2 (2008).

[13] Chaoming Song et al. “Limits of predictability in human mobility”. In: Sci- ence 327.5968 (2010), pp. 1018–1021. 81

[14] Jon Froehlich and John Krumm. Route prediction from trip observations. Tech. rep. SAE Technical Paper, 2008.

[15] Apollinaire Nadembega, Abdelhakim Hafid, and Tarik Taleb. “A destination and mobility path prediction scheme for mobile networks”. In: IEEE transac- tions on vehicular technology 64.6 (2015), pp. 2577–2590.

[16] Alessandro E Redondi et al. “Understanding the WiFi usage of university students”. In: 7th IEEE International Workshop on TRaffic Analysis and Charac- terization (TRAC). 2016, pp. 44–49.

[17] Shamik Sengupta et al. “A game theoretic framework for distributed self- coexistence among IEEE 802.22 networks”. In: Global Telecommunications Con- ference, 2008. IEEE GLOBECOM 2008. IEEE. IEEE. 2008, pp. 1–6.

[18] David Kotz and Kobby Essien. “Analysis of a campus-wide wireless net- work”. In: Wireless Networks 11.1-2 (2005), pp. 115–133.

[19] Enrica Zola and Francisco Barcelo-Arroyo. “A comparative analysis of the user behavior in academic WiFi networks”. In: Proceedings of the 6th ACM workshop on Performance monitoring and measurement of heterogeneous wireless

and wired networks. ACM. 2011, pp. 59–66.

[20] Francesco Calabrese, Jonathan Reades, and Carlo Ratti. “Eigenplaces: seg- menting space through digital signatures”. In: IEEE Pervasive Computing 9.1 (2010), pp. 78–84.

[21] Magdalena Balazinska and Paul Castro. “Characterizing mobility and net- work usage in a corporate wireless local-area network”. In: Proceedings of the 82

1st international conference on Mobile systems, applications and services. ACM. 2003, pp. 303–316.

[22] Mikhail Afanasyev et al. “Usage patterns in an urban WiFi network”. In: IEEE/ACM Transactions on Networking (TON) 18.5 (2010), pp. 1359–1372.

[23] Vanessa Gardellin, Sajal K Das, and Luciano Lenzini. “A fully distributed game theoretic approach to guarantee self-coexistence among WRANs”. In: INFOCOM IEEE Conference on Computer Communications Workshops, 2010. IEEE. 2010, pp. 1–6.

[24] Dong Huang et al. “A game theory approach for self-coexistence analysis among IEEE 802.22 networks”. In: Information, Communications and Signal Processing, 2009. ICICS 2009. 7th International Conference on. IEEE. 2009, pp. 1– 5.

[25] Abdallah Abdallah et al. “Detecting the impact of human mega-events on spectrum usage”. In: Consumer Communications & Networking Conference (CCNC), 2016 13th IEEE Annual. IEEE. 2016, pp. 523–529.

[26] Yunjuan Zang et al. “Wavelet transform processing for cellular traffic predic- tion in machine learning networks”. In: Signal and Information Processing (Chi- naSIP), 2015 IEEE China Summit and International Conference on. IEEE. 2015, pp. 458–462.

[27] Spectrum Usage Timelapse Video. https://goo.gl/TpAEkG. 83

[28] Minkyong Kim, David Kotz, and Songkuk Kim. “Extracting a mobility model from real user traces”. In: INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings. IEEE. 2006, pp. 1–13.

[29] Libo Song et al. “Evaluating location predictors with extensive Wi-Fi mobil- ity data”. In: INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies. Vol. 2. IEEE. 2004, pp. 1414–1424.

[30] Unsupervised Learning. https://www.mathworks.com/discovery/unsupervised- learning.html.

[31] Jeffrey D Case et al. Simple network management protocol (SNMP). Tech. rep. 1990.

[32] Network Basics: What Is SNMP and How Does It Work? https://www.auvik. com/media/blog/network-basics-what-is-snmp/.

[33] John D Hunter. “Matplotlib: A 2D graphics environment”. In: Computing in science & engineering 9.3 (2007), pp. 90–95.

[34] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: Jour- nal of machine learning research 12.Oct (2011), pp. 2825–2830.

[35] Diane Tang and Mary Baker. “Analysis of a metropolitan-area wireless net- work”. In: Wireless Networks 8.2/3 (2002), pp. 107–120.

[36] http://weka.sourceforge.net/doc.dev/weka/clusterers/EM.html.

[37] EM Algorithm: How it works. https://www.youtube.com/watch?v=REypj2sy_ 5U&t=1s. 2014. 84

[38] Arthur P Dempster, Nan M Laird, and Donald B Rubin. “Maximum like- lihood from incomplete data via the EM algorithm”. In: Journal of the royal statistical society. Series B (methodological) (1977), pp. 1–38.

[39] Weka 3: Data Mining Software in Java. https://www.cs.waikato.ac.nz/ml/ weka/.

[40] Miin-Shen Yang, Chien-Yo Lai, and Chih-Ying Lin. “A robust EM clustering algorithm for Gaussian mixture models”. In: Pattern Recognition 45.11 (2012), pp. 3950–3961.

[41] Clustering and the EM algorithm. https : / / www2 . cs . duke . edu / courses / fall07/cps271/EM.pdf.

[42] Shijun Wang and Ronald M Summers. “Machine learning and radiology”. In: Medical image analysis 16.5 (2012), pp. 933–951.

[43] Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining. Introduc- tion to linear regression analysis. Vol. 821. John Wiley & Sons, 2012.

[44] Frank E Harrell. “Ordinal logistic regression”. In: Regression modeling strate- gies. Springer, 2015, pp. 311–325.

[45] Dong Nguyen, Noah A Smith, and Carolyn P Rosé. “Author age prediction from text using linear regression”. In: Proceedings of the 5th ACL-HLT Work- shop on Language Technology for Cultural Heritage, Social Sciences, and Humani-

ties. Association for Computational Linguistics. 2011, pp. 115–123. 85

[46] Olivier Chapelle and Vladimir Vapnik. “Model selection for support vector machines”. In: Advances in neural information processing systems. 2000, pp. 230– 236.

[47] Alex J Smola and Bernhard Schölkopf. “A tutorial on support vector regres- sion”. In: Statistics and computing 14.3 (2004), pp. 199–222.

[48] Saneej B Chitralekha and Sirish L Shah. “Application of support vector re- gression for developing soft sensors for nonlinear processes”. In: The Cana- dian Journal of Chemical Engineering 88.5 (2010), pp. 696–709.

[49] Hui Zou and Trevor Hastie. “Regularization and variable selection via the elastic net”. In: Journal of the Royal Statistical Society: Series B (Statistical Method- ology) 67.2 (2005), pp. 301–320.