INTEGRATED NETWORK TRAFFIC-MOBILITY ANALYSIS AND MODELING USING BIG DATA

By BABAK ALIPOUR

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2019 © 2019 Babak Alipour To my Mom and Dad, whose love and support made my success possible ACKNOWLEDGMENTS First and foremost, thanks to all the help I received from my advisor, Prof. Ahmed Helmy, whose guidance, patience and intelligence made research obstacles conquerable. We thank Dr. Alin Dobra and Dr. Daisy Zhe Wang for help in the computing cluster, and the anonymous reviewers of IEEE InfoCom 2018 for useful feedback. The term ’cello mobility’ was suggested by Prof. Mostafa Ammar and used here with permission. Partial funding was provided by NSF Award Number 1320694 at University of Florida. We gratefully acknowledge the support of NVIDIA Corp. with the donation of the Titan Xp GPU used for this research.

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 8 LIST OF FIGURES ...... 9 ABSTRACT ...... 12

CHAPTER 1 INTRODUCTION ...... 14 1.1 Data-Driven Traffic and Mobility Analysis ...... 15 1.1.1 Mobility Analysis ...... 15 1.1.2 Traffic Analysis ...... 16 1.1.3 ’FLAMeS’: Framework for Large-scale Analysis of Mobile Societies ... 17 1.2 Predictability Analysis and Prediction Algorithm Design ...... 19 1.3 Integrated ’Generative’ Traffic-Mobility Modeling ...... 20 1.4 Research Contributions ...... 21 1.5 Dissertation Organization ...... 21 2 MATERIALS AND METHODS ...... 23 2.1 Input Datasets ...... 23 2.1.1 WLAN AP Logs ...... 24 2.1.2 NetFlow Logs ...... 24 2.2 DHCP and Merging Datasets ...... 25 2.2.1 Device Type Classification ...... 26 2.2.2 Computing System ...... 27 2.3 Machine Learning Application and Background ...... 28 2.3.1 Supervised Learning ...... 28 2.3.1.1 Classification ...... 29 2.3.1.2 Sequence prediction ...... 31 2.3.2 Unsupervised Learning ...... 32 2.3.2.1 Clustering ...... 33 2.3.2.2 Generative ...... 34 2.3.3 Reinforcement Learning ...... 35 2.3.4 Feature Selection ...... 35 3 ’FLUTES’ vs. ’CELLOS’: ANALYZING MOBILITY-TRAFFIC CORRELATIONS IN LARGE WLAN TRACES ...... 37 3.1 Related Work ...... 37 3.2 Mobility Analysis ...... 40 3.2.1 Session Start Probability ...... 40

5 3.2.2 Radius of Gyration ...... 42 3.2.3 Visitation Preferences and Interests ...... 43 3.2.4 Sessions Per Building ...... 45 3.2.5 Hourly Associations ...... 46 3.2.6 Visitation Preferences ...... 47 3.2.7 Return Probability ...... 48 3.3 Traffic Analysis ...... 48 3.3.1 Flow-level Statistical Characterization ...... 49 3.3.1.1 Size ...... 49 3.3.1.2 Packets ...... 51 3.3.1.3 Runtime ...... 52 3.3.1.4 Inter-arrival times (IAT) ...... 52 3.3.1.5 Protocols ...... 54 3.3.2 Network-Centric (Spatial) Analysis ...... 55 3.3.3 User Behavior (Temporal) Analysis ...... 56 3.3.3.1 Data consumption ...... 56 3.3.3.2 Packet rate ...... 56 3.3.3.3 Active duration ...... 57 3.4 Integrated Mobility-Network Traffic Analysis ...... 57 3.4.1 Feature Engineering ...... 58 3.4.1.1 Mobility ...... 58 3.4.1.2 Network traffic ...... 58 3.4.1.3 Cross-dimension ...... 59 3.4.2 Utility of Integrated Modeling ...... 61 3.5 Integrated Mobility-Network Traffic Generative Modeling ...... 62 3.5.1 Related Modeling Work ...... 64 3.5.2 Statistical Metrics ...... 65 3.5.3 Gaussian Mixture Model (GMM) ...... 67 3.5.4 Restricted Boltzmann Machine (RBM) ...... 68 3.5.5 Variational Auto-Encoder (VAE) ...... 69 3.5.6 Generative Adversarial Network (GAN) ...... 69 3.6 Lessons Learned and Modeling Insights ...... 70 3.7 Summary ...... 76 4 PREDICTABILITY ANALYSIS AND PREDICTION ALGORITHM DESIGN ..... 78 4.1 Related Work ...... 80 4.2 Entropy Estimators and Maximum Predictability ...... 82 4.3 Prediction Algorithms ...... 84 4.4 Experimental Setup ...... 85 4.4.1 Discrete-time Series ...... 86 4.4.2 Experiment Dimensions ...... 88 4.5 Mobility Analysis ...... 88 4.5.1 Overview ...... 88 4.5.2 Spatio-temporal Resolutions ...... 91

6 4.5.3 Comparison of Methods ...... 92 4.5.4 Correlations with Mobility and Network Traffic ...... 94 4.6 Summary and Future Work ...... 96 5 LEARNING THE RELATION BETWEEN MOBILE ENCOUNTERS AND WEB TRAF- FIC PATTERNS ...... 98 5.1 Related Work ...... 100 5.2 Mobility Encounters ...... 102 5.2.1 Daily Encounter Duration at the Building Level ...... 102 5.2.2 Encounter Duration Statistical Distributions ...... 104 5.3 Web Traffic Profile ...... 104 5.4 Pairwise Encounter-Traffic Relationship ...... 106 5.4.1 Device Type Categories ...... 107 5.4.2 Weekday Vs. Weekend ...... 107 5.4.3 Encounter Duration ...... 108 5.5 Learning Encounters ...... 110 5.5.1 Random Forest (RF) ...... 112 5.5.2 Deep Learning ...... 112 5.6 Summary and Future Work ...... 114 6 FUTURE DIRECTIONS ...... 117 6.1 Extensions of Flutes vs. Cellos in Social Context and Interest Dimensions ... 118 6.2 Predictability Analysis in the Traffic and Interest Dimensions ...... 119

APPENDIX A WEB DOMAIN INTEREST ANALYSIS ...... 121 A.1 Web Domain Interest Extraction ...... 121 B CHOICE OF DATA PROCESSING TOOLS ...... 124 REFERENCES ...... 127 BIOGRAPHICAL SKETCH ...... 137

7 LIST OF TABLES

2-1 Summary of datasets. B=billion...... 23

2-2 NetFlow example records...... 24

2-3 AP logs/DHCP example records...... 24

3-1 Summary of results for mobility analysis...... 41

3-2 Merged DHCP-NetFlow traces overview ...... 48

3-3 Traffic features used for integrated mobility-traffic analysis ...... 60

3-4 Average Kolmogorov-Smirnov statistic of all algorithms ...... 71

3-5 Kolmogorov-Smirnov statistic of the β-VAE for weekday features...... 73

3-6 Kolmogorov-Smirnov statistic of the β-VAE for weekend features...... 74

4-1 Statistics per device available for at least 7 days & accessed more than 5 APs. ... 82

4-2 Median Accuracy of LSTM across spatio-temporal resolutions ...... 92

4-3 Summary of Median Accuracy for Flutes vs Cellos with different methods ...... 94

5-1 Encounter record example ...... 102

5-2 Daily Encounter Duration in Seconds ...... 103

5-3 Best fit distributions for total daily encounter duration based on pairs classifications 104

8 LIST OF FIGURES

1-1 ’FLAMeS’ system overview...... 19

2-1 Wireless association for a device at different times...... 25

2-2 Time series for 25 days of combined AP-NetFlow Core traces ...... 27

2-3 Machine Learning algorithms widely used In Internet traffic classification...... 35

3-1 PDF Session start over time of the day...... 42

3-2 Radius of gyration and visited locations S(t) ...... 44

3-3 Zipf’s plot on L visited access points...... 45

3-4 Probability P (t) of session duration t...... 45

3-5 Hourly associations...... 46

3-6 Time spent at preferred building...... 47

3-7 Probability to return to a previously visited location...... 48

3-8 Traffic distribution plots...... 50

3-9 CDF of individual flow sizes ...... 51

3-10 Lognormal distribution plot for mean packet size of either device type ...... 52

3-11 Theoretical and empirical plots for Lognormal and mean packet size of Flute flows . 53

3-12 Exponential and Beta distribution plots for IAT...... 54

9 3-13 Correlation plots for mobility and traffic ...... 59

3-14 Correlation plots across mobility and traffic ...... 61

3-15 Visualization of Kolmogorov–Smirnov statistic ...... 66

3-16 Synthetic vs. Original ’total active time (TAT)’ feature for flutes...... 68

3-17 VAE architecture...... 70

3-18 GAN architecture...... 70

3-19 Kolmogorov-Smirnov statistic for mobility features ...... 71

3-20 Kolmogorov-Smirnov statistic for network traffic features ...... 72

3-21 Correlation between TAT and several mobility features ...... 72

3-22 Visualization of mobility features with Z1 from β-VAE...... 74

4-1 Discrete-Time Series Example ...... 87

4-2 ECDF of LSTM Prediction Accuracy for Flutes & Cellos at AP and Bldg. level ... 93

4-3 Correlation of Prediction Accuracy with Mobility and Network Traffic Features ... 95

5-1 Social context axes. (Sourced from [112]) ...... 101

5-2 Encounter duration CDF based on encounter pair device type...... 103

5-3 TF-IDF Matrix: Each row is a user profile...... 105

5-4 ECDF of traffic profile similarity across device types ...... 108

10 5-5 ECDF of traffic profile similarity for weekdays and weekends ...... 109

5-6 ECDF of traffic profile similarity across durations ...... 110

5-7 Pearson correlation coefficient between encounter duration and traffic profile similarity 111

5-8 Correlation of encounter duration and TP similarity for different Bldg. across days . 112

5-9 Accuracy of random forest and SDAE for encounters across days ...... 113

5-10 Architecture of the deep learning model (SDAE) ...... 115

6-1 Positioning of the work and most promising future directions...... 118

A-1 Heatmap of user activity across building categories ...... 122

B-1 Visualization of join operation between NetFlow and DHCP in Apache Spark. .... 126

11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy INTEGRATED NETWORK TRAFFIC-MOBILITY ANALYSIS AND MODELING USING BIG DATA By Babak Alipour May 2019 Chair: Ahmed Helmy Major: Computer Engineering Two major factors affecting mobile network performance are mobility and traffic patterns. Simulations, analytical-based performance evaluations and predictive caching schemes rely on models to approximate factors affecting the network. Hence, the understanding of mobility and traffic is imperative to the effective evaluation and efficient design of future mobilenetworks. Current models target either mobility or traffic, but do not capture their interplay. Besides, Many trace-based mobility models have largely used pre-smartphone datasets (e.g., AP-logs), or much coarser granularity (e.g., cell-towers) traces. Moreover, behaviors of users need to be modeled individually, and also in social contexts through pair-wise encounter and collective analysis. This raises questions regarding the relevance of existing models, and motivated us to revisit this area. In this work, we conduct a multidimensional analysis, to quantitatively characterize mobility and traffic spatio-temporal patterns, for laptops and smartphones, and across individual, pair-wise and collective planes, leading to a detailed integrated mobility-traffic analysis. Our study is data-driven, as we collect and mine capacious datasets (with 30TB in size, 300k devices) that capture all of mobility and traffic dimensions, in time and space, for each device type. The investigation is performed using our systematic (FLAMeS) framework. Overall, dozens of mobility and traffic features have been analyzed. This includes features at the individual level, i.e. how each user behaves as well as analysis of interactions between users through pair-wise and collective analysis. Furthermore, predictability limits of human behavior are investigated and pragmatic prediction algorithm design is discussed. We provide insights

12 and lessons, to serve as guidelines and a first step towards future integrated mobility-traffic models. We posit that a realistic model should encapsulate mobility, traffic, social context (individual, pair-wise, collective), and interest features of users across time and space, per device type. Finally, we build upon the knowledge of this interplay across dimensions and propose a generative integrated mobility-traffic model, with experimental results showing that it captures mobility and traffic features at the individual level well, while being extensible for expansion into other axes of social context. Our work acts as a stepping-stone towards a richer, more-realistic suite of mobile test scenarios and benchmarks.

13 CHAPTER 1 INTRODUCTION The research presented in this work consists of three main components. These are: Data-Driven Integrated Traffic-Mobility Analysis, Predictability Analysis and Prediction Algorithm Design, and Integrated Generative Mobility-Traffic Modeling. We first study (I) How different are mobility and traffic characteristics across device types, time andspace? (II) What are the relationships between these characteristics? (III) Should new models be devised to capture these differences? And, if so, how? Both mobility and network usage, characterize different aspects of human behavior. In this sense, we have a mobility plane anda (network) traffic plane. In reality, these two planes are likely interdependent. Human mobility may be influenced by network activity; for example, a person slowing down to read incoming messages. Also, network activity may be influenced by mobility and location; stationary users may produce/consume more data than those walking, and people may use different services in different places6 ([ ]). These planes also run through individual user behavior, as well as through the interactions between users through pair-wise and collective behaviors. Next, we study predictability by analyzing spatio-temporal patterns of entropy and design of predictive algorithms using a hybrid of classical algorithms (e.g. Markov Chains) with deep learning approaches (e.g. Recurrent Neural Networks (RNNs)). Then, we investigate the relationship between mobility ’encounters’ and web traffic profiles. Mobility and traffic planes run through individual user behaviors, as well as through the interactions between users through pair-wise and collective behaviors. Thus this investigation seeks to establish the relationship between mobility pair-wise encounters and web traffic. Finally, generative models such as Gaussian Mixture Models (GMMs) and Restricted Boltzmann Machines (RBMs) are used to create realistic integrated models spanning mobility, traffic, social context, and interest dimensions, with extensibility and privacy in mind. Given the ubiquity of Wireless LAN traces, these logs present an important data source to glean information on mobility behaviors and patterns of users. In addition, NetFlow traces,

14 which contain large-scale data on network traffic headers, enable analysis of web traffic access patterns. Thus, we rely on merging of 30TB of NetFlow data with half a terabyte of WLAN traces to achieve data-driven analysis, testing, design, and evaluation of integrated traffic- mobility models. The rest of this chapter is organized as follows. Section 1.1 describes the current ap- proaches in modeling mobility and traffic as well as challenges in their integration. Section 1.2 introduces ’predictability’ as a metric and discusses mobility and traffic prediction and predictor design. Section 1.3 presents a compelling case for the necessity of integrated traffic-mobility modeling. Section 1.4 summarizes the contributions of this work. Finally, Section 1.5 provides the organization of this dissertation. 1.1 Data-Driven Traffic and Mobility Analysis

Both mobility and network usage, characterize different aspects of human behavior. In this sense, we have a ’mobility plane’ and a ’network traffic plane’. In reality, these two planes are likely interdependent. In this section, we introduce how each of these planes has traditionally been studied, and the challenges in performing integrated traffic-mobility analysis and modeling. 1.1.1 Mobility Analysis

Human mobility has been an area of interest for decades, due to its importance in various research areas. It has been studied extensively, with many models derived with varying degrees of complexity and customizability. The spectrum ranges from simple synthetic mobility models to complex trace-based models, capturing different properties with varying degrees of accuracy [47, 114]. For spatial-temporal patterns, [39] and [105] reveal the regularity and bounds for predicting human mobility using cellular logs, presenting similar findings. This reaffirms the intrinsic properties of human mobility, despite differences in granularity and population across datasets. Another study highlighted the importance of combining different datasets to study various features simultaneously [123]. Another study using WLAN traces [15] revealed surprising patterns on increases of long-term mobility entropy by age, and the impact of

15 academic majors on students’ long-term mobility entropy. One of the motivations of our work is to analyze the stability of human mobility patterns, especially in terms of regularity and prediction accuracy upper bounds, in our dataset and then integrating different datasets to correlate mobility and network traffic. 1.1.2 Traffic Analysis

Network traffic has been studied extensively, for fixed networks (e.g.,1 RIPEAtlas , Ark2 , BISmark [109]), and increasingly for wireless networks: for rather stationary users (as in WLANs) (e.g., [46, 66]) and potentially more mobile users as for cellular networks (e.g., [76, 122]). Such analyses range from metrics such as flow count, sizes, and traffic volume to service usage (e.g., visited websites, backend services). The authors of [85] investigated correlations and characteristics of web domains accessed by users and their locations, based on NetFlow and DHCP logs from a university campus in 2004. They propose a simulation paradigm with data-driven parameters, producing realistic scenarios for simulations. On both WiFi and cellular networks, the authors of [91] performed an in-depth study on smartphone traffic, highlighting the benefits and limitations of using MPTCP. Distributions of flowinter- arrival time (IAT) and arrival rate at APs of ’static’ flows were analyzed (e.g., Exp, Weibull, Pareto, Lognormal) in [83]. Lognormal was found to best fit the flow sizes, while at small time scales (i.e. hourly), IAT was best described by Weibull but parameters vary from hour to hour. We analyze flows on a much larger scale, newer dataset including smartphones, and identify Lognormal distribution as the best fit for flow sizes, and beta as best for IAT, regardless of device type. The study in [120] analyzed ISP traces of cellular towers in Shanghai, and mapped timed traffic patterns to urban regions. It provided insight into mobile traffic patternsacross time, location, and frequency. This work is complementary to ours, as we provide a much finer scope analyzing campus WLAN traces.

1 http://atlas.ripe.net 2 http://www.caida.org/projects/ark/

16 1.1.3 ’FLAMeS’: Framework for Large-scale Analysis of Mobile Societies

It is likely that mobility and traffic planes are interdependent. Human mobility may be influenced by network activity; for example, a person slowing down to read incoming messages. Also, network activity may be influenced by mobility and location; stationary users may produce/consume more data than those walking, and people may use different services in different places85 [ ]. In earlier studies, this interdependence has not been widely considered, and models for both mobility and network traffic planes have been developed and evaluated largely in isolation, as described in the previous sections. For example, when evaluating mobile systems’ performance, traffic generation generally follows regular patterns, drawn from common simple distributions (e.g., exponential or uniform), while assuming neither nor reception of data impacts mobility. Simply observing people walking while staring at (or reacting to) their smartphones suggests, however, that such interdependencies need to be captured properly. Understanding the mobility-traffic interplay is imperative tothe effective evaluation and efficient design of future mobile algorithms ranging from userbehavior prediction and caching, to network load estimation and resource allocation. In this work, we take a stab at understanding the interconnection of the mobility and traffic planes. To do this properly, we need to consider the nature of mobile devices people use: one class of devices is merely intended for stationary use, typically while the user is seated—this primarily holds for laptop computers, dubbed ’cellos’. In contrast, another class— ’on-the-go’ smartphones, which we refer to as ’flutes’—lend themselves to truly mobile3 use . Analysis of the relation between mobility and traffic is an identified research gap. We focus our analysis on these two classes because they have been around long enough to have extensive datasets to build upon. Usage and traffic patterns of different device types have been studied from various perspectives ([3, 18, 38, 68, 76, 93]). However, those findings are based on classifications that

3 Throughout, we use ’flutes’ for ’smartphones’, and ’cellos’ for ’laptops’.

17 rely on either MAC addresses or HTTP headers solely. The former is rather limited and the latter may have serious privacy implications and are often unavailable. In [30], authors use packet-level traces from 10 phones and application-level monitoring from 33 Android devices to analyze smartphone traffic. Although this allowed fine-grained measurements, the approach is invasive and limited in scalability, leading to small sample sizes and restricted conclusions. They also do not compare the traffic of smartphones with that of “stop-to-use” wireless devices (i.e. cellos) nor do they measure spatial metrics. The study in [21] analyzes 32k users on campus, and focuses on multi-device usage. It notes differences between laptops and smartphones in packets, content, and time of usage. That work targets device usage patterns and security, while we study mobility and wireless traffic correlations. In our method, the combination of MAC and NetFlow allowed us to classify the majority of observed devices while preserving users’ privacy. We stipulate that the interconnection of mobility and traffic is modulated by the device(s) a mobile user is carrying. Therefore, we follow two main lines of investigation: we develop a framework to differentiate between cellos and flutes, and study both the mobility and traffic patterns for each of those types. Specifically, we quantitatively investigate the following questions in-depth: (I)How different are mobility and traffic characteristics across device types, time and space? (II)What are the relationships between these characteristics? (III) Should new models be devised to capture these differences? And, if so, how? To answer these questions, a multi-dimensional (comparative) analysis approach is adopted to investigate mobility and traffic spatio-temporal patterns for flutes and cellos. We drive our study with capacious datasets (30TB+) that capture all the above dimensions in a campus society, including over 300k devices. We set out to understand and ’quantify’ the ’gaps between flutes and cellos’, and the ’interaction between the mobility and traffic dimensions’. To methodically analyze statistical characteristics and correlations in multiple, a systematic ’F’ramework for ’L’arge-scale ’A’nalysis of ’M’obil’e’ ’S’ocieties (’FLAMeS’) is devised, as depicted in Fig. 1-1, that is used to analyze multi-sourced, multi-dimensional big data for

18 ’flutes’ and ’cellos’, including WLAN and NetFlow traces sourced from University of Florida’s campus. The main components include: I. Data collection and pre-processing, II. Flutes vs. cellos mobility and traffic analysis, and III. Integrated mobility-traffic analysis.

I.a I.b

I.c

II

III

Integrated Future Mixture Mobility-Traffic Models and Analysis Synthesis

Figure 1-1. ’FLAMeS’ system overview.

1.2 Predictability Analysis and Prediction Algorithm Design

Prediction techniques constitute fundamental mechanistic building blocks for many mobile protocols and applications, ranging from resource allocation to caching and recommender systems, among others. The study of predictability and behavioral regularity in the mobile society has been the focus of several studies.

19 However, the previously reported findings on the very high predictability of human mobility were drawn from analyses that had several limitations. First, the spatio-temporal scale at which users’ locations were observed—location within mobile network cell tower coverage— renders the prediction results impractical for most mobile applications that depend on the high accuracy of the user localization. Second, the theoretical limits of predictability were derived based on an entropy estimator that was unable to capture the variability of repeated sub-sequences of visited locations and therefore underestimating entropy (overestimating predictability). We revisit the study of mobile user predictability, and try to address the aforementioned limitations by exploring ’maximum predictability’ and ’pragmatic predictors’ using WiFi traces that have a much higher spatio-temporal granularity compared to call data records used in most previous studies on predictability. 1.3 Integrated ’Generative’ Traffic-Mobility Modeling

Mobile network performance is heavily impacted by ’mobility’ and ’traffic’ patterns. Models are applied extensively in simulations, analytical-based performance evaluations and benchmarks, to approximate factors affecting the network. There is a vast body of research on mobility and traffic analysis and modeling, carried out independently. Some of the most advanced models of human mobility provide mathematical frameworks with tunable parameters to generate a variety of mobility scenarios. The previous works suffer from one or more of the following limitations: 1) Strong assumptions andlackof validation on real-world big data. 2) Use of coarse-grained CDR logs. 3) Lack of integration with traffic features. 4) Missing predictability, and user interest aspects. In this work, after quantifying the differences between ’flutes’ and ’cellos’, across a variety of spatio-temporal mobility and traffic metrics, we established the necessity of modeling the device type, in addition to spatio-temporal mobility and traffic features, at the individual, pair-wise and group levels. We also analyze predictability and use it as a feature to define user behavior. Finally, we attempt to put all of these features together, to create an integrated generative traffic-mobility model.

20 1.4 Research Contributions

Our main contributions include:

1. ’Integrated Mobility-Traffic Analyses’ (Chapter 3): This work is the first to quantify the correlations of numerous features of mobility and traffic simultaneously. This can identify gaps in existing mobile networking models, and reopen the door for future impactful work in this area.

2. ’Flutes vs. Cellos Analysis’ (Sec. 3.2 and 3.3): The device type classification presented here, is an important dimension to understand. This is particularly relevant as new gen- erations of portable devices are introduced, that are different than laptops, traditionally considered in earlier studies.

3. ’Systematic Multi-dimensional Investigation Framework’ (Fig. 1-1): ’FLAMeS’ provides the scaffolding needed to process, in multiple dimensions, many features of largesets of measurements from wireless networks, including AP-logs and NetFlow traces. This systematic method can apply to other datasets in future studies.

4. ’Predictability Analysis and Prediction Algorithm Design’ (Chapter 4): We use highly granular datasets to measure maximum predictability, and compare the theoretical upper bounds with pragmatic predictors based on deep learning techniques such as Long Short-Term Memory (LSTM).

5. ’Analysis of Mobile Encounters and Web Traffic Patterns’ (Chapter 5): Mobility and network traffic have been traditionally studied separately and at an ’individual’ level.The interactions and patterns at ’encounter’ and ’group’ levels are vital factors for future mobile services and encounter-based services, but has not been studied in depth with real-world big data. In Chapter 5, we characterize mobility encounters and study the correlation between encounters and web traffic profiles using large-scale datasets ofWiFi and NetFlow traces

6. ’Integrated Mobility-Traffic Generative Modeling’ (Section 3.5): Models are utilized extensively in simulations, performance evaluations and benchmarks. With the analyses done in previous chapters, the dimensions of such models were investigated and their correlations analyzed. This part of the study is the culmination of all of those analyses, with the goal of creating an integrated generative mobility-traffic model. 1.5 Dissertation Organization

The rest of this dissertation is organized as follows. Chapter 2 introduces the materials and methods used in this study, including the datasets, tools, and brief machine learning background information. Chapter 3 presents the analysis of mobility and traffic dimensions for flutes and cellos, and then provides insight for an integrated mobility-traffic modelthat

21 captures the correlations of mobility and traffic features across device types, time and space. Finally, this work outlines a generative, integrated mobility-traffic model based on prior analyses. In Chapter 4, work on the predictability of human behaviors is presented, theoretical limits of predictability are investigated, and a new prediction algorithm is proposed that combines Markov Chain (MC) based predictors with recurrent neural networks. Chapter 5 characterizes mobility encounters and studies the correlation between encounters and web traffic profiles using large-scale datasets of WiFi and NetFlow traces, introduced inChapter 2. It also introduces a deep learning approach that learns to classify whether a pair of web traffic profiles have encountered. Finally, Chapter 6 puts forward potential future research directions that have been identified throughout this work.

22 CHAPTER 2 MATERIALS AND METHODS This chapter provides background information on the materials and methods used in this study, including the multi-sourced datasets, a wide variety of analytical and big data tools, experimental setup, as well as a brief introduction on several machine learning algorithms explored in this work. The goal is to describe the raw, interim and preprocessed inputs for the rest of this document (i.e. Sections I.a, I.b and I.c of FLAMeS framework depicted in Fig. 1-1), with reproducibility and completeness in mind. 2.1 Input Datasets

We drive our FLAMeS framework with large-scale datasets from multiple sources, capturing the mobility and traffic features in different dimensions. In this section, we introduce the two major datasets and their preprocessing, and present the device type classification into flutes and cellos. This is a necessary step to enable analysis of numerous metrics formobility and traffic over such large-scale datasets. The input datasets are specifically chosen tocapture: 1. location, mobility and network traffic information, 2. smartphone and laptop devices, 3. spatio-temporal features, and 4. scale in the number of devices and records. The total size is >30TB, consisting of two main parts: WLAN Access Point (AP) logs, and Netflow records (details in Tables 2-1, 2-2, and 2-3). 1

Table 2-1. Summary of datasets. B=billion. Record count Traffic Vol. (TB) MAC count DHCP CORE TCP UDP WLAN CORE Flutes 412.0 M 2.13 B 56.18 4.50 186.0 K 50.3 K Cellos 101.0 M 4.20 B 73.85 12.90 93.2 K 27.1 K Total2 557.5 M 6.53 B 134.39 17.61 316.0 K 80.0 K

1 Data was collected using proper procedures. It does not contain personally identifiable information (PII).

23 Table 2-2. NetFlow example records. Start time Finish time Duration Src. IP Dst. IP Protocol Src. port Dst. port Packets Size 1334332274.912 1334332276.576 1.664 173.194.37.7 10.15.225.126 TCP 80 60482 157 217708 1334332281.440 1334332282.912 1.472 10.15.133.170 74.125.229.58 TCP 2068 80 6 1484

Table 2-3. AP logs/DHCP example records. User IP User MAC AP name AP MAC Lease begin time Lease end time 10.130.90.3 00:11:22:33:44:55 b422r143-win-1 00:1d:e5:8f:1b:30 1333238737 1333238741 10.132.190.68 00:22:11:44:11:66 b416c299-win-1 00:17:59:5a:0e:30 1333239804 1333239818

2.1.1 WLAN AP Logs

These logs are collected from 1760 APs in 138 buildings over 479 days on a university campus, and contain association and authentication events from 316k devices in 2011-2012. It contains over 555M records, with each record including the device’s MAC and assigned IP addresses, the associated AP and a timestamp. Locations of the APs are approximated by the building locations where they are installed, i.e., (longitude, latitude) of Google Maps API. To validate this, we fetched 8000 mapped APs around the campus area from a crowd-sourced service, wigle.net. For the 130 matched APs (7.6% of total) in 42% of buildings (i.e., 58 bldgs), all were less than 200m from their mapped location; an error of less than 1.5% of the campus area. This is a reasonable margin of error for our research purposes, and acceptable when considering the maximum AP coverage range, inaccurate coarse-grained localization services and that we use coordinates of the center of each building whereas users may see an AP on the edge of a building. These access points are installed in a wide variety of buildings, including housing, classrooms, computer laboratories, libraries, offices, administrative, and restaurants. 2.1.2 NetFlow Logs

Over 76 billion records of NetFlow traces were collected from the same network, over 25 days in April 2012. A flow is defined as a consecutive sequence of packets with thesame transport protocol, source/destination IP and port number, as identified by the collecting gateway router. An example of major Netflow data fields is presented in Table 2-2. The NetFlow records are matched with the wireless associations (from the AP logs) using the dynamic MAC-to-IP address mapping from the DHCP logs. We refer to the result as

24 CORE dataset (Table 2-1). They are also augmented with location and website information using reverse DNS (rDNS). Dataset merging and system details in this part of the work is described in the following sections. 2.2 DHCP and Merging Datasets

In order to study network traffic across devices and APs, it is necessary to matchthe NetFlow records with wireless associations (from WLAN dataset). This task requires the MAC-IP mapping. The IP addresses are dynamically assigned using DHCP but DHCP session logs were not directly available and had to be derived. We define the duration of a DHCP lease as the time between two consecutive associations of a device with any AP; i.e. when a

device connects to AP1, a session starts and once the user device connects to AP2, the first session ends and a new one starts. Fig. 2-1 illustrates the associations of a sample device with

different APs at different times. The first session would have the IPgivenby AP1 and a lease

time t2 − t1, and so on. (total of 5 sessions in this example) The last association is discarded as we do not know the duration of that IP assignment. Combining these derived-DHCP records with the Location Information and Device Type Classification we create the DHCP table.

Figure 2-1. Wireless association for a device at different times.

The derived DHCP and NetFlow datasets were then merged to form what we refer to as the Core dataset for our study. The unique identifiers between the two are the clients’ IPs in addition to start and end time of flows, hence the need for a DHCP-like set. For a DHCP lease session LS, all flows whose IP address is the same as the lease and whose entire lifetime falls within the lease duration, are associated with LS. Given these traces, cellular usage cannot be analyzed. However, this does not significantly impact analysis for two reasons: 1) The traces already capture a very large user-base, with

25 tens of thousands of active devices. This raises confidence in our analysis of a real-world WLAN. 2) The WiFi campus coverage is ubiquitous, with 1760 APs installed in the vast majority of populated areas. Also, most laptops on campus lack cellular connectivity, and many smartphones use WiFi for their data to avoid cellular data costs. 2.2.1 Device Type Classification

To classify devices into flutes and cellos, we utilize several observations and heuristics. To start, note that a device manufacturer (with OUI) can be identified based on the first 3 octets of the MAC address3 . Most manufacturers produce one type of device (either laptop or phone), but some produce both (e.g., Apple). In the latter case, OUI used for one device type is not used for another. We conducted a survey to help classify 30 MAC prefixes accurately. Using OUI and survey information, we identify and label 46% of the total devices (90k cellos and 56k flutes). Then, from the NetFlow logs of these labeled devices, we observe over3k devices (92% of which are flutes) contacting admob.com; an ad platform serving mainly smartphones and tablets (i.e. flutes). This enables further classification of the remaining MAC addresses. Finally, we apply the following heuristic to the dataset: (1) obtain all OUIs (MAC prefix) that contacted admob.com; (2) if it is unlabeled, mark it as a flute. Overall, over270k devices were labeled (180k as flutes), covering 86% of the devices in AP logs and97%in NetFlow traces, a reasonable coverage for our purposes. Out of ≈ 80k devices in the NetFlow logs, ≈ 50K are flutes and ≈ 27K cellos. Fig. 2-2 shows the temporal plot for the combined traces over 25 days, after device classification. Throughout, the number of flows and total traffic volume is clearly higherfor cellos, even with an overall higher number of flutes connected. This larger number of flutes is also reflected in up to 600 more active APs communicating with flutes (during early morning hours). Also, note the device activities in a diurnal and weekly cycles, with the peaks occurring

3 MAC address randomization does not affect our association trace since it became common practice after our traces were collected.

26 Figure 2-2. Time series for 25 days of combined AP-NetFlow Core traces during weekdays, as expected. Wed, 25th, was the last day of classes, explaining the decline in network activity afterwards. This plot motivated our analyses for flutes vs cellos, over weekends vs weekdays. 2.2.2 Computing System

The size of the datasets is ≈30TB in raw text format, mostly consisting of NetFlow data and ≈0.5TB for AP logs. There were several challenges in managing and mining the large- scale datasets that required a thorough preparation, to run on a fast machine with plenty of resources/memory. We explored several techniques and pipelines for extraction, transformation, loading (ETL) and querying of big data and chose tools from Apache Hadoop ecosystem. We use Hive as our data warehouse (tables stored in Parquet format). Apache Spark is the compute engine for data processing and analysis tasks. Computation runs on two nodes, each with 64 cores and ≈0.5TB of memory. Further discussion of the system and comparison to others is out of the scope of this document.

27 2.3 Machine Learning Application and Background

This section provides a brief overview of machine learning (ML) and deep learning (DL) fundamentals and concepts that are essential to the discussions and comparisons of various methods in the rest of this study. The goal is to introduce important concepts that help explain decisions that have been made in terms of ML algorithms, methods and tools which will lead to discussions on speed, efficiency and scalability. The amount of data generated has been increasing rapidly and thus it has become evident that automated data analysis is crucial to make sense of these massive data pools. Machine learning enables computers to learn functions and patterns from data, without having been programmed explicitly; then those patterns can be used to predict future outcomes. Machine learning is typically utilized when explicit algorithms are not feasible and has been widely adopted in various fields, from automatic speech recognition and translation to data center optimization [36]. According to the type of problem, available data and the context, machine learning methods are broadly put into three groups [88]: Supervised learning, Unsupervised learning and Reinforcement learning. An introduction to each of these categories follows. 2.3.1 Supervised Learning

In supervised learning, the goal is to learn a function approximation, a mapping, from inputs to outputs, given a set of labeled data. The input typically consists of D-dimensional vectors of features or attributes and the output is either a real number or a category from a finite set of categorical class labels. Labeled data consists of data points whose inputand output are known; this is also called the training data. When the output is provided by direct observation with very high confidence, it is referred to as ground or base truth. Erroneous output labels in the training data will directly impact the usefulness of learned mappings as the supervised learning algorithm merely optimizes its predicted output to be as similar and as close as possible to the ground truth, according to its loss function; it simply does not have any means to determine the real-world usefulness of results. It is up to the scholars to provide reasonable ground truth, which is indeed one of the major resource-intensive challenges in

28 supervised learning. In supervised learning, depending on whether the output is categorical or real-valued, the problem is called classification or regression, respectively. The following algorithms have been used in multiple studies for traffic classification and are the most popular supervised learning algorithms in the community. 2.3.1.1 Classification

Naïve Bayes (NB). This model is probabilistic classifier based on Bayes’ theorem. With the assumption that the features are conditionally independent, then the probability of an observation with multiple features given a class can be written as the product of the probability of one-dimensional features of that observation given that class, which can be easily estimated from the data. This method is optimal if the independence assumption is true. However, the strong conditional independence assumption is the reason for the ”Naïve” word, because most real-world datasets are not expected to show such properties [88]. Nonetheless, authors in [24] show that Naïve Bayes classifier achieves high classification accuracy onmany real-world datasets. They chose 28 datasets from UCI machine learning repository [71] and got an average accuracy of 79%. Owing to its simplicity, Naïve Bayes algorithm runs very fast compared to many other machine learning algorithms [87].

k-Nearest Neighbors (k-NN). This algorithm is a non-parametric learner, which means it has a fixed number of parameters regardless of the amount of data. The algorithm looks at some number of nearest, labeled, neighbors of a new data point and typically outputs the majority class label. The parameter K determines how many neighbors are to be examined and a metric such as the Euclidean distance establishes the notion of nearest neighbors [88]. Since there is no actual training involved, this algorithm is categorized as a ”lazy” learner, which means the computation happens when a new unlabeled data point is made available and a label prediction is required. There have been numerous research studies on k-NN to make it run efficiently such as[98] or in a distributed environment such as [74].

Support Vector Machines (SVM). This class of supervised learners typically projects the data points into a higher dimensional space and then attempts to find a hyperplane or a set

29 of hyperplanes that best separate the target classes. That means finding a hyperplane whose distance from the nearest data points is maximum. Those nearest data points in the training set are the support vectors [110, 115]. A more in-depth discussion of the theory behind this model is beyond the scope of our survey and can be found in [88] and [115].

Decision Trees (DT). Given a training set of data points with known labels, the goal is to find classification rules that can predict the label of any data point from the valueofits features. Such classification rules can be expressed as a decision tree. The internal nodesofa decision tree consist of rules whose outcomes create the branches. The leaves in a decision tree represent the class label. Training a decision tree on the input data involves creating various possible decision trees on the input and choosing the simplest one that most often correctly classifies the training data [95]. A popular algorithm for supervised learning using decision trees, the C4.5 algorithm, is described in detail in [96]. An open source Java implementation of this algorithm, j48, is available in the WEKA software suite [43] and used in numerous studies [7, 33, 89].

Multilayer Perceptron (MLP). Neural networks have always been of great interest among researchers, aspiring to mimic the brain. A specific type of neural networks is the multilayer perceptron which is a feed-forward artificial neural network. The neurons in this type of network form a directed acyclic graph hence the name feed-forward. Each non-input node in the graph represents a neuron that has an activation function, this function defines the output of the neuron given its set of inputs. Authors in [69] show that these networks can approximate any function if and only if the activation function is not a polynomial. Researchers in [9] are one of the earliest adopters of MLP for Internet traffic classification.

Meta-algorithms. This category consists of algorithms that by definition are not machine learning algorithms themselves but rely on combining various machine learning techniques into a more accurate learner. A well-known meta-algorithm is AdaBoost [34]. This algorithm makes no assumptions about the performance of the weak classifiers as long as they are slightly better than random in predicting the true label. AdaBoost, typically using a decision tree, attempts

30 to reduce error by iteratively updating weights for input data depending on the results of the previous iteration. [7] applies AdaBoost for traffic characterization.

2.3.1.2 Sequence prediction

Markov Chain-based predictor. A Markov chain (MC) with a discrete state space has been applied for user mobility sequence prediction [75, 107]. In an order-k Markov predictor, the state space consists of tuples of k location (e.g. AP) names, where the next location prediction depends solely on the most recent preceding k-tuple. We build the model on the data so that observed k-tuples comprise the states. The transition probabilities are learned based on the frequency of appearances of such a transition in observations. The probability for a transition from current state S = Xi Xi+1...Xj to Xi+1Xi+2...Xj Xj+1 where j − i = k and

each Xi is the symbol for each location, is represented as P(Xj+1 = c | S = Xi Xi+1...Xj ) for all c observed in data and is learned based on the reappearance frequency of such a sequence. If a MC O(k) encounters a new sequence that it has never seen before, it falls back to MC O(k-1) recursively. The base case is O(0) which is simply the frequency distribution of all symbols observed so far. We compare the accuracy of Markov chains of varying orders with the theoretical predictability and recurrent neural networks in Chapter 4.

Recurrent Neural Networks (RNN). A more recent approach to sequence prediction, used in Chapter 4, is using deep recurrent neural networks (RNNs). Recurrent neural networks have loops within their cells, allowing information to persist and thus enabling the neural network to connect previous information to make a reasonable prediction of future. Certain types of RNNS are capable of learning long-term dependencies. These networks are essentially supervised neural networks that are trained to predict the next symbol in a sequence. There are multiple variants of RNNs, including Long short-term memory (LSTM) [52] and Gated Recurrent Unit (GRU) [20]. These networks can learn dynamic temporal patterns and have successfully been applied in speech recognition and text-to-speech engines [101]. In this work, we use a multi-layer LSTM to predict movements of users based on similar input tuples used for Markov Chain-based predictors in Chapter 4. RNNs are computationally expensive and

31 require hyper-parameter tuning. Thus the deep model is run only on a sample of users in this study. We also introduce a hybrid model of Markov Chains (MC) and RNN. The hybrid model uses the output of RNN only when the RNN is very confident (i.e. over 90% probability associated with the next symbol), otherwise it falls back to the MC introduced before.

1D Convolutional Neural Networks (CNN). Another type of neural networks, the convolutional neural networks (CNNs) learn convolutional filters to extract latent informa- tion across the data (i.e. 1D CNNs learn different temporal locality patterns) and use that information for predicting the next location. This is in contrast to 2D convolution filters that are typically used in image recognition tasks, where the filter capture spatial information. It has successfully been used in natural language processing for modeling and classification of sentences [61, 64]. In our study in Chapter 4, in addition to the multi-layer LSTM, we also use a 1D CNN to predict movements of users based on similar input tuples used for MC-based predictors. Convolutional neural networks tend to utilize GPU resources better, and thus are practically faster in our tests compared to the recurrent neural networks. 2.3.2 Unsupervised Learning

In contrast to supervised learning, given only the inputs, the goal is to find interesting patterns; that is why it is also sometimes called knowledge discovery. The most common unsupervised learning method is clustering. There are generally two types of clustering, hard clustering algorithms such as k-Means assign data points to exactly one cluster whereas soft clustering algorithms such as the expectation maximization (EM) algorithm [22] work by assigning probabilities to data points belonging to different clusters81 [ ]. K-means is sometimes called a hard EM algorithm [88]. As [88] explains, due to the inherent ambiguity of pattern mining, there is no obvious error metric to use; there are various metrics that are employed in different algorithms. These cluster validity measures are typically divided into two majorgroups [42]:

• External measures: These measures rely on the availability of external class labels to validate clusters. For instance, for each cluster, one can calculate the ratio of data points

32 within the cluster with a specific class label and then repeat the same procedure tofind the maximum ratio for all class labels. That is the definition of purity for the cluster. The average of purity for all clusters can be used as a measure for the purity of the clustering algorithm, with higher purity meaning more homogeneous clustering. These ratios calculated as described above can also be interpreted as probabilities of various classes within a cluster and then used to calculate the entropy of each cluster. Taking the average of these entropy values, one can compute the entropy of the clustering algorithm, with lower entropy showing higher homogeneity.

• Internal measures: These measures attempt to quantify how good the clusters are without class labels. An example of this measure is the sum of squared errors (SSE) which is calculated as the sum of squared distances from cluster mean for each data point. The distance method is usually Euclidean distance or Manhattan distance. Another popular cluster validation method is the silhouette coefficient97 [ ]. This measure calculates how well each data point belongs to its own cluster and how well it is separated from other clusters. Since the definition is per data point, the silhouette coefficients of all points within a cluster are typically averaged to get ameasureper cluster or these values are averaged for all clusters to get an idea of the clustering algorithm’s performance. A brief introduction of clustering methods that have been in used in traffic follows. Almost all of the studies that we discuss use labeled data and rely on some form of external validity measure for their choice of clustering method.

2.3.2.1 Clustering

Density-Based Spatial Clustering of Applications with Noise (DBSCAN). This clustering method is density-based, which means that clusters are defined to be regions of higher density in the input space. DBSCAN finds clusters of points that are close to each other and marks those that do not have many data points nearby as outliers. The algorithm takes two parameters, ε and N. The algorithm starts with some random, yet unvisited, data point and looks for points that are less than ε distance away from it, if the number of such points exceeds N, then a new cluster is started from there [29].

Expectation Maximization (EM). The goal of expectation maximization is to find the most probable clusters based on the training data and prior expectations. Assuming there exist groupings of data points within the entire dataset so that each group fits a certain distribution with different parameters, the data set can be modeled as a mixture of these distributions.

33 The probability distribution of choice is usually the Gaussian distribution. Finding the optimal maximum likelihood estimate is NP-hard [88]. But the EM algorithm is a good heuristic that iteratively attempts to find the maximum likelihood estimates of the parameters for these distributions [81, 88]. Another issue is finding the number of groupings in the data whichis typically done empirically. Authors in [81] apply EM algorithm for clustering different traffic profiles.

K-Means. K-means or sometimes called hard EM [88], assigns each data point to one cluster. This problem is also NP-hard but a heuristic algorithm that iteratively refines the assignment exists and it is sometimes referred to as the Lloyd’s algorithm [73]. In the simplest approach, one starts with K random points in the input space, called the centroids, and then assigns each data point to the nearest centroid, creating K clusters. Before the start of next iteration, K new centroids are calculated by taking the mean of each cluster. This process is typically repeated until the changes to centroids become smaller than a threshold. Examples of studies applying K-means to traffic characterization include27 [ , 28].

2.3.2.2 Generative

Gaussian Mixture Model (GMM). Mixture models allow probabilistic representation of sub-populations with an overall population. In a Gaussian mixture model, all data points are assumed to be generated from a mixture of Gaussian distributions. These models rely on Expectation Maximization (EM), introduced above, for finding the best fits of the Gaussian mixture. This is an efficient and fast, with reasonable results on our datasets.

Restricted Boltzmann Machine (RBM). Restricted Boltzmann Machine is a type of artificial neural network that uses generative learning, which is different from discriminative learning of supervised learning methods. In this approach, the goal is to estimate the probabil- ity distribution of the input. A fast learning algorithm for RBMs was introduced in 2006 [50]. RBMs have been used for various tasks, such as classification, regression, feature learning and modeling. A RBM is composed of two layers, a visible (input) layer and a hidden layer. Each node in the hidden layer, takes input from all nodes in the visible layer and produces an output,

34 Machine Learning

Supervised Unsupervised Learning Learning

Support Meta- k-Nearest Decision Multilayer EM Naïve Bayes Vector algorithms DBSCAN K-Means Neighbors Trees Perceptron Clustering Machines [AdaBoost]

Figure 2-3. Machine Learning algorithms widely used In Internet traffic classification.

by putting the weighted sum of those values through an activation function such as a Sigmoid. The weights are typically initialized to small random values chosen from a zero-mean Gaussian with a standard deviation of about 0.01, since using larger random values can speed the initial learning, but it may lead to a slightly worse final model49 [ ]. There is no intra-layer edge (and hence the restricted in restricted Boltzmann Machine). RBMs can reconstruct the data without labels, by sampling the values of the visible layer from the hidden layer, and can be utilized for modeling. In this work, We use RBMs in Section 3.5 to reconstruct integrated mobility-traffic data points. More advanced deep generative models are introduced and employed in Section 3.5. 2.3.3 Reinforcement Learning

Reinforcement learning (RL) is a type of learning which stems from behaviorism in psychology with the goal of maximizing a reward by taking appropriate actions in a given context or environment, allowing the machine to learn a policy given the feedback from the environment. It has a wide range of applications from robot navigation to playing abstract strategy games [102], but it is less commonly used in general and we have not seen RL techniques being applied to the traffic characterization problem. 2.3.4 Feature Selection

The run time of many machine learning models, whether supervised as Naïve Bayes or unsupervised such as k-Means, depends on the number of features. In addition to that, training a statistical classifier such as the Naïve Bayes involves fitting a joint distribution overthe

35 features, which in this case are assumed to be independent. Such a model’s performance can be improved if less relevant features that are not beneficial for classification are removed88 [ ]. One way of doing this is to use filters that measure relevance of features independent of the choice of the model. For example, the correlation between the features and the target class can be used for feature selection. Another approach is using wrapper methods. This approach uses a learning algorithm and attempts to find a good subset of features by running the learning algorithm on different subsets of features iteratively and comparing their outcomes. There are many ways to generate the candidate subsets of features including but not limited to exhaustive search of all subsets, forward selection which starts with no features and adds features one by one or backward elimination where all features are available in the beginning and are removed one by one for each iteration until no improvement is observed [87]. In the end, ”All models are wrong, but some models are useful” [11].

36 CHAPTER 3 ’FLUTES’ VS. ’CELLOS’: ANALYZING MOBILITY-TRAFFIC CORRELATIONS IN LARGE WLAN TRACES Mobile network performance is affected by a multitude of elements. Two major factors affecting mobile network performance are ’mobility’ and ’traffic’ patterns. Simulations and analytical-based performance evaluations rely on models to approximate factors affecting the network, enabling design and planning of future networks. Hence, the understanding of mobility and traffic is imperative to the effective evaluation and efficient design offuture mobile networks. Current models target either mobility or traffic, but do not capture their interplay. Many trace-based mobility models have largely used pre-smartphone datasets (e.g., AP-logs), or much coarser granularity (e.g., cell-towers) traces. This raises questions regarding the relevance of existing models, and motivates our study to revisit this area. Here, we conduct a multi-dimensional analysis, to ’quantitatively’ characterize mobility and traffic spatio-temporal patterns, for laptops and smartphones, leading to a detailed integrated mobility-traffic analysis. We present a ’data-driven’ study, published in [6], as we collect and mine capacious datasets (with over ’30TB’ of data, and 300k devices overall) that capture all of these dimensions. The investigation is performed using our systematic (’FLAMeS’) framework (Fig. 1-1). Overall, dozens of mobility and traffic features have been analyzed. The insights and lessons learned serve as guidelines and first steps towards a future ’integrated mobility-traffic models’. In addition, our work acts as a stepping-stone towards a richer, more-realistic suite of ’mobile test scenarios’ and ’benchmarks’. 3.1 Related Work

The body of literature has to be grouped into multiple subcategories, including studies in mobility analysis and modeling, traffic and network usage analysis and modeling, as wellas variations of the aforementioned dimensions across device types. Human Mobility: Human mobility is an important area of research with numerous application in urban planning, design and construction of infrastructure (including network- ing), congestion prediction, energy consumption estimation, content dissemination through

37 opportunistic networking, etc. Due to the wide variety of applications, human mobility has received significant attention. Over several decades, many models have been derived with varying degrees of complexity and customizability. The spectrum ranges from simple synthetic mobility models to complex data-driven models, designed to capture different properties with varying degrees of accuracy. We refer the reader to [47, 114] for surveys of mobility modeling and analysis. For spatial-temporal patterns, [39] and [105] reveal the regularity and bounds for predicting human mobility using cellular logs. In [105], authors argue for quantitative models to capture statistical characteristics of individual users by highlighting the differences between continuous time (CTRW) models and empirical results. Another work on WLAN traces [15] revealed surprising patterns on increases of long-term mobility entropy by age, and the impact of academic majors on students’ long-term mobility entropy. We have made obser- vations in terms of regularity of human mobility that are similar to [39, 105], which reaffirms the intrinsic properties of human mobility, despite differences in granularity and population across datasets. Another study highlighted the importance of combining datasets from multiple sources to study various features simultaneously [123]. To advance the understanding of human mobility, we integrated different datasets to correlate mobility and network traffic. Traffic analysis: Network traffic has been studied extensively , for fixed networks (e.g., RIPE Atlas1 , Ark2 , BISmark [109]), and increasingly for wireless networks: for rather stationary users (as in WLANs) (e.g., [46, 66]) and potentially more mobile users as for cellular networks (e.g., [76, 122]). Multimedia content is a major category of traffic consumption in mobile devices [76]. In [122], a comparison of traffic patterns between wired and wireless connections is presented. Their results show that overall flow counts and sizes are smaller for wireless networks, and there are differences between accesses to different categories of websites such as social media or multi-media content. Usage mode of the wireless network has been

1 http://atlas.ripe.net 2 http://www.caida.org/projects/ark/

38 changing and while most users were non-mobile in early 2000s [46], with the proliferation of on-the-go devices (’flutes’), that is no longer the case as we show in this study. Such analyses range from metrics such as flow count, sizes, and traffic volume to service usage (e.g., visited websites, backend services), to categories of interest (e.g. sports, news, etc.). The authors of [85] investigated correlations and characteristics of web domains accessed by users and their locations, based on NetFlow and DHCP logs from a university campus in 2004. They propose a simulation paradigm with data-driven parameters, producing realistic scenarios for simulations. On both WiFi and cellular networks, the authors of [91] performed an in-depth study on smartphone traffic, highlighting the benefits and limitations of using MPTCP. Distributions of flow inter-arrival time (IAT) and arrival rate at APs of “static” flows were analyzed (e.g., Exp, Weibull, Pareto, Lognormal) in [83]. Lognormal was found to best fit the flow sizes, while at small time scales (i.e. hourly), IAT was best described by Weibull but parameters vary from hour to hour. We analyze flows on a much larger scale, newer dataset including smartphones, and identify Lognormal distribution as the best fit for flow sizes, and beta as best for IAT, regardless of device type. The study in [120] analyzed ISP traces with 9600 cellular towers and 150K users in Shanghai, and mapped timed traffic patterns to urban regions. It provided insight into mobile traffic patterns across time, location and frequency. This workis complementary to ours, as we provide a much finer scope analyzing campus WLAN traces. Device Variation: Usage and traffic patterns of different device types have been studied from various perspectives ([3, 18, 38, 68, 76, 93]). The study in [3] analyzed Google WiFi network, deployed in Mountain View California during Spring of 2008. They categorized devices into three groups: Hotspot, Modem and Smartphone. Their results show that these three device types are naturally different from each other in terms of traffic features. Hand-held devices were found to have low UDP traffic in38 [ ], while HTTP traffic was high. This isin line with our findings. In [68], authors use extensive datasets to analyze mobility behaviors of laptops and smartphones. They compared a wide range of mobility metrics and presented significant differences between these two device types across spatial and temporal features,

39 such as unique APs visited, the radius of gyration and load characteristics. This study does not correlate traffic features with mobility. However, those findings are based on classifications that rely on either MAC addresses or HTTP headers solely. The former is rather limited and the latter may have serious privacy implications and are often unavailable. In [30], authors use packet-level traces from 10 phones and application-level monitoring from 33 Android devices to analyze smartphone traffic. Although this allowed fine-grained measurements, the approach is invasive and limited in scalability, leading to small sample sizes and restricted conclusions. They also do not compare the traffic of smartphones with that of “stop-to-use” wireless devices (i.e. cellos) nor do they measure spatial metrics. To characterize usage pattern for users with multiple wireless devices, authors in [21] analyze 32k users on a university campus environment, and focus on multi-device usage. It notes differences between laptops and smartphones in packets, content, and time of usage. That work targets device usage patterns and security, while we study mobility and wireless traffic correlations. In our method, the combination of MAC and NetFlow allowed us to classify majority of observed devices while preserving users’ privacy. 3.2 Mobility Analysis

This section covers the ’temporal’ and ’spatial’ mobility analyses. For all metrics, unless otherwise noted, we investigate 479 days. A summary of studied metrics and their most significant statistical values are presented in Tab. 3-1 along with mean and median ratios for comparison. From that list, we further investigate in this section those metrics that show the most interesting or non-trivial differences between ’flutes’ and ’cellos’. 3 3.2.1 Session Start Probability

A session is defined as the period between WLAN associations. The distributions of session start times across the day for four building categories are depicted in Fig. 3-1. The

3 We would like to thank Leonardo Tonetto, at Technical University of Munich (TUM), for the effective collaboration on this investigation.

40 Table 3-1. Summary of results for mobility. Upper values are for weekdays and lower ones for weekends (in red color). LJM: maximum jump [m]; DIA: diameter [m]; TJM: total trajectory length [m]; GYR: radius of gyration [m]; BLD: no. uniq. buildings; APC: access point count; PDT: time spent at preferred building [minutes]; DLT: total session time at each building. Flutes (F) Cellos (C) Ratio (C/F) µ ’median’ σ µ ’median’ σ µ ’median’ 435 296 813 178 1 624 0.409 0.003 LJM 350 168 683 97 1 312 0.277 0.006

549 411 874 195 1 642 0.355 0.002 DIA 425 179 739 107 1 338 0.252 0.006

1582 707 2336 378 1 1444 0.239 0.001 TJM 1036 279 1793 252 1 1766 0.243 0.004

396 290 2725 321 191 3265 1.102 1.019 GYR 330 248 1368 178 65.1 1800 1.247 1.4

5.4 3 5.6 1.8 1 2.1 0.811 0.659 BLD 2.8 2 4.1 1.5 1 1.8 0.539 0.262

11.8 6 13.3 3.7 2 4.8 0.333 0.333 APC 7.2 4 8.8 3 2 3.8 0.536 0.5

225 161 219 248 164 254 0.314 0.333 PDT 223 135 272 278 189 292 0.417 0.5

316 235 302 316 217 305 1 0.92 DTL 326 247 308 316 221 309 0.97 0.89 start times of the Sessions match the periodic beginning of classes, but mainly in ’Academic’ buildings, where users move mostly at the start and end of classes. In these places, activity drops sharply for ’cellos’ at 5pm, with considerable ’flutes’ activity until 8pm. For ’Social’ and ’Library’ buildings, the probability of new sessions remains higher for a few more hours into the evening, and the times users tend to leave are more spread out. We do not make similar observation during weekends, which is expected when the day is, unlike weekdays, not governed by a class schedule. For most visitors, the session start distributions show a smooth shape and no significant differences between device types (omitted for brevity).

41 Academic Social 0.0020 0.0020

0.0015 0.0015 ) )

T 0.0010 T 0.0010 ( ( P P

0.0005 0.0005

0.0000 0.0000 0 6 12 18 24 0 6 12 18 24 T (Time of day) [h] laptop T (Time of day) [h] smartphone Administrative Library 0.0020 0.0020

0.0015 0.0015 ) )

T 0.0010 T 0.0010 ( ( P P

0.0005 0.0005

0.0000 0.0000 0 6 12 18 24 0 6 12 18 24 T (Time of day) [h] T (Time of day) [h]

Figure 3-1. PDF Session start over time of the day.

3.2.2 Radius of Gyration

This metric, GYR, captures the size of the geospatial dispersion of a device’s movements, ∑ 1 N − 2 denoted by rg and computed as rg = N k=1 (r⃗k r⃗s ) , where r⃗1, ..., r⃗N are positional vectors

of a device and r⃗s is its center of gravity.

Grouping devices by their rg after six months of observation, we look at its evolution since the first time they are observed. Unsurprisingly (cf.[39]), after an initial transient period of about one week, this value stabilizes even across different semesters (not shown). We split the traces into weekdays and weekends, presenting the distributions in Fig. 3-2(a). For ’cellos’, we notice a substantial reduction in their overall mobility whereas, for ’flutes’, this difference is not so pronounced. This might be due to students havingfewer

42 activities on weekends, a tendency to study at a single building like a library, or just not carry their cellos; we will revisit this aspect in Sec. 3.4. ’Flutes’, being “always-on” devices, are able to capture movements at pass-by locations, dining areas, and bus stops and thus are better suited to capture the fine-granular mobility of their users than cellos. The latter, in acampus environment, are often used during studies only (such as during lectures and in libraries), while the former often follows the user while moving around the campus. Despite the 8.1km2 area of the campus (approximate radius of 1.42km), buildings with related fields of study (e.g. e.g. Fine Arts, Law School and Engineering) are fairly clustered. Computing the distance between the k-nearest neighboring buildings, for k = 22 and k = 9 (average number of visited buildings for ’flutes’ and ’cellos’) the median distances are 295m and 172m, respectively. Due to their focus on classes, attending students have limited area of activity on weekdays, which explains the observed ’radius of gyration’.

We also evaluated: (1) ’diameter’ DIA, the longest distance between any pair of r⃗k points;

(2) ’max jump’ LJM, the longest distance between a pair of consecutive r⃗k points; and (3) ’total trajectory length’ TJM, the sum of all trips made by a device. The distributions of these metrics are similar to ’Radius of Gyration’ and therefore not shown. Table 3-1 summarizes the most significant statistical values for these metrics. 3.2.3 Visitation Preferences and Interests

We count the number of unique buildings visited by a user, BLD, and define a ’preferred building’ as the location where a device has spent most of its time in a given day, measured in ∑ Nb minutes and referred to as PDT . We approximate the latter by the formula tb = k=1 Sk ,

where tb is the time spent, Nb the total number of sessions and S1...SN the time duration of ’each session’ at a building b, here referred as DLT . Interestingly, cellos have slightly longer stays but both have medians around 2:40 hours. The similarity of the distributions, combined with a lower number of visited locations indicate that cellos are used mostly when users remain longer periods at places.

43 1.0 102 laptops smartphones 0.8 t0.59 0.6

) 1 t 10 ( S CDF 0.4 t0.33 0.54 weekday laptop t 0.2 weekday smartphone weekend laptop weekend smartphone 0 t0.20 0.0 10 0 300 600 900 1200 100 101 102 103 104 rg [meters] t [hour]

(a) (b)

Figure 3-2. (a) Radius of gyration (rg for the device types). (b) Visited locations S (t). Vertical lines at 7, 120 and 240 days.

Fig. 3-2(b) highlights the differences between ’flutes’ and ’cellos’ on the required time t to visit ’S’(’t’) locations. After an initial exploration period of one week the rates of new visits change similarly for both device types, and new exploration rates show up at 120 and 240 days. These could be explained by the weekly schedules of the university as well as the usual length of a lecture term (≈ 4 months). Considering the coefficients during the “steady states” between 7 days and 120 days (t0.59 and t0.54), our dataset matches observations made by [105]. We also consider the number of unique APs a device associates with, APC, which provides a finer spatial resolution than the building level. Furthermore, the probability of finding a device at its ’L-th’ most visited access point is shown inFig. 3-3. When taking buildings as aggregating points for location, the values become L−1.36 for ’cellos’ and L−1.16 for ’flutes’. These approximations validate previous work on human mobility [39], yet highlight differences between device types.

44 Figure 3-3. Zipf’s plot on L visited access points.

11 11

1 1

1 1

1 1 t t P 1 P 1

1 1

1 1

1 1 1 1 1 1 1 1 t t

Figure 3-4. Probability P (t) of session duration t.

3.2.4 Sessions Per Building

To study AP utilization over time, we look at the session duration distribution, or session duration dispersal kernel P(t), depicted in Fig. 3-4. To facilitate comparison with NetFlow records, we limit the period to April/2012. The smaller inner plots represent the same metric, limited to four types of buildings.

45 We noted that the five-minute spikes correspond to default idle-timeout for the used WiFi routers. On the other hand, the ’knees’ at 1 and 2 hours could be explained by the typical duration of classes. They are only noticeable at Academic buildings (shown inside inner plots) and during weekdays (not shown). This leads us to conclude that despite the differences in distributions of device types, ’flutes’ and ’cellos’ present ’certain similarities in their usage, such as during classes’. To differentiate ’pass-by’ access points, we examine all sequences ofthree unique APs where all session durations are lower than 5 minutes (typical idle-timeout). We observed these APs clustered at buildings that also had major bus stops nearby. This shows that top ’pass-by’ APs are not necessarily at the most popular buildings. We revisit mobility in Section 3.4.1 for feature engineering.

(a) (b)

Figure 3-5. Hourly associations.

3.2.5 Hourly Associations

Measuring device associations every hour, Fig. 3-5(a) shows the percentage of devices with at least one event as a function of hours of the day. The majority of devices appear online between 9am and 8pm, with the hours between 2am and 6am having less than 20% of devices associating. We find no major differences between flutes’ and cellos’ distributions, asmany users potentially own both. As users arrive on campus and their phones announce their first location, they switch on their laptops. This issue bears further research through a future census study.

46 To measure the stay of devices throughout a day, we look at 1-hour intervals, and measure

4 the number of hours a device accessed an AP . Fig. 3-5(b) depicts Sh − Lh, where Sh and

Lh are total number of flutes and cellos respectively, with at least one record per hour, asa function of the number of hours online h. Flutes are predominant for short visits and very long stays, but the difference drops significantly at 9 hours, then increases. The rise after 9hoursis likely due to students living on campus, with always-on connected phones.

(a) (b)

Figure 3-6. Time spent at preferred building.

3.2.6 Visitation Preferences

Fig. 3-6(b) shows a scatter of the median time spent at a user’s preferred building. Each dot represents this value for a given location. This plot shows that ’academic’, ’police’ and ’museum’ buildings tend to have laptops staying longer, which makes sense intuitively, with students using laptops during lectures and staff working at the other two categories. On the contrary, for ’social’ and ’housing’ buildings, there is a higher probability of having flutes staying longer, hinting at a tendency to use mobile devices more in such places. Finally, ’administrative’, ’sports’ and ’library’ buildings tend to have both types of devices staying for

4 P. Widhalm, et al., “Discovering urban activity patterns in cell phone data”.

47 similar amounts of time. Analysis of inherited differences in browsing of online services given by this heterogeneity among buildings is left for future work. Fig. 3-6(a) depicts the time devices spend at their ’preferred building’ in a day.

Figure 3-7. Probability to return to a previously visited location.

3.2.7 Return Probability

We compare empirical values for devices to return to previously visited APs or buildings in Fig. 3-7. We observe returning spikes at every 24 hours, with the highest peaks at 48 and 168 hours (2 and 7 days). This can be explained by the schedule of classes at the university. 3.3 Traffic Analysis

Traffic characterization is vital to network research, from simulation and performance to design and testing. This section covers the metrics of flow and traffic on the merged NetFlow and DHCP (i.e. ’core’) dataset (overview in Table 3-2).

Table 3-2. Merged DHCP-NetFlow traces overview Cellos Flutes Avg. number of packets (per flow) 3.64 7.06 Avg. size of flows (bytes) 815.30 2063.48 Avg. duration of flows (ms) 1665.58 1896.90

48 In this section, we compare different ’traffic’ characteristics, across ’device types’, ’time’ and ’space’. For this purpose, we start with statistical characterization of ’individual’ flute and cello flows, in terms of size, duration, inter-arrival times and protocols. Next, wemeasure how these flows, ’put together’, affect the network patterns across APs and buildings. Finally, ’user behavior’ is analyzed by monitoring weekly cycles, data rates, and active durations. By quantifying ’temporal’ and ’spatial’ variations of traffic across device types, we makea case for new models to capture such variations based on the most relevant attributes. Table 3-3 summarizes the results. We find that smartphones and laptops are significantly different across most metrics tested. The differences between weekdays and weekends also tend tobe significant. We also analyze the relationship among extracted features in Section 3.4.1 for feature engineering. 3.3.1 Flow-level Statistical Characterization

First, statistical characteristics and distribution of traffic metrics are analyzed. We com- pare the following distributions using maximum likelihood estimation (MLE) and maximizing goodness-of-fit estimation: Gaussian, Exponential, Gamma, Weibull, Logistic, Beta andLog- normal5 . None of the analyzed flow metrics fit Gaussian distribution (based on Shapiro-Wilk test for normality, goodness-of-fit test and Q-Q plot results, not included for brevity). Thus, we use Mann-Whitney statistical test [77] to compare two unpaired groups (laptops vs smart- phones), and Wilcoxon signed-rank test [119] to compare two paired groups (each device type on weekdays vs weekends) 3.3.1.1 Size

Flow size is the sum of bytes for all packets within a single flow. First, outlier data points are removed using a robust measure of scale, based on inter-quartile range (IQR). Looking at individual flows of each device type shows that size of flows that originated from smartphones are significantly different that laptop flows (p-value< .05). On weekdays, average ’sizeof

5 For distribution comparison, significance threshold p − value is set at .05.

49 1.0 1.0 weekday laptops phones 0.9 0.9 1665.80 1035.96 Med. 581.89 365.60 0.8 0.8 2421.45 1555.62 weekday laptops phones 0.7 2.31 1.43 0.7 weekend laptops phones Med. 0.88 0.56 848.73 417.29 0.6 3.28 2.05 0.6 Med. 75.88 3.11 1663.00 1201.35 0.5 weekend laptops phones 0.5 CDF 1.16 0.54 CDF 0.4 Med. 0.12 0.01 0.4 2.16 1.48 0.3 0.3 0.2 weekday laptop 0.2 weekday laptop weekday smartphone weekday smartphone 0.1 weekend laptop 0.1 weekend laptop weekend smartphone weekend smartphone 0.0 0.0 0 2 4 6 8 10 10 4 10 3 10 2 10 1 100 101 102 103

(a) (b)

1.0 1.0 weekday laptops phones 0.9 204.92 53.23 0.9 Med. 80.00 5.52 0.8 301.85 109.05 0.8 weekend laptops phones 0.7 weekend laptops phones 0.7 97.30 17.27 295.50 54.37 Med. 68.18 1.83 0.6 Med. 105.24 0.39 0.6 439.05 138.60 97.39 32.37 0.5 0.5 weekday laptops phones CDF CDF 0.4 0.4 81.32 20.72 Med. 52.42 6.94 0.3 0.3 84.83 30.90 0.2 weekday laptop 0.2 weekday laptop weekday smartphone weekday smartphone 0.1 weekend laptop 0.1 weekend laptop weekend smartphone weekend smartphone 0.0 0.0 10 4 10 3 10 2 10 1 100 101 102 0 50 100 150 200 250 300

(c) (d)

Figure 3-8. Traffic distribution plots. (a) Packet processing rate of APs (millions per day)(b) Traffic load of APs (MB per day, log-scale) (c) User data consumption (MBper day, log-scale) (d) User active time (minutes per day) individual flute flows’> is’ 2x larger’ than cello flows (2070 vs. 822 bytes), while median is ’> 4x larger’ (678 vs. 142 bytes). There are no significant changes on weekends. The average packet size within a single flow also provides insight into packet-level behavior of services on mobile devices. We notice that the average ’packet size of flute flows’ is ’≈50% larger’ ’than that of cellos’ (212 vs 144 bytes on weekdays, 205 vs 142 on weekends). Comparing weekdays and weekends, median size of flute packets drops on weekends whereas it remains ’the same’ for cellos. In fact, comparing cello flows on weekdays and weekends shows ’no significant difference’ in terms of average packet size> (p-value .05). Despite smaller flows,

50 Figure 3-9. CDF of individual flow sizes (bytes, log-scale x axis), similar pattern on weekends the average cello generates ’2.7 times traffic’ as an average flute because the average cellois responsible for ’3.7 as many flows’ as a flute. Analyzing distributions of flow size andaverage packet size in our datasets shows that ’Lognormal’ distribution is the best fit, with varying parameters for each device type. For flow sizes in our dataset, a Lognormal distribution is the best fit, regardless of device type (Fig. 3-10). Many models assume flow sizes are static, or follow an exponential distribution but real world data provides no supporting evidence. Such simplifying assumptions fail to accurately account for very large flows obtained from a Lognormal distribution. 3.3.1.2 Packets

This metric is the count of packets within each flow. The mean and median packet counts per flow are 7.06 and 5 in flutes and 3.64 and 2 in cellos, during weekdays. The means drop slightly on weekends. Packet counts per flow match the ’Lognormal’ distribution well for flows of both device types. There are over 2 billion flows in the entire core dataset which contain only 1 packet. The average flute flow is bigger in size and has ’more packets’ (with higher

51 Figure 3-10. Lognormal distribution plot for mean packet size of either device type. Similar pattern emerges for individual flow sizes. variance) but there are ’fewer’ flows coming from these devices. This is analyzed further for TCP/UDP flows (Section 3.3.1.5). 3.3.1.3 Runtime

Flow runtime is the period of time the flow was active (equal to a flow’s finishtime − starttime). Flute flows have a mean and median of 1868ms and 128ms respectively on weekdays, while these numbers are 1639ms and 64ms for cellos. Both device types show increase in means during weekends (flutes by 204 and cellos by 164), indicating that although there are fewer devices online during weekends, they are more active. The low medians in either group corresponds to many ’short-lived’ flows with few packets, showing little variation across device type, time or space. 3.3.1.4 Inter-arrival times (IAT)

IAT is essential in simulation and modeling of networking protocols, Internet traffic classification30 [ , 87], and traffic performance [94]. In addition, realistic modeling of IAT is required for accurate simulation and measurement of congestion control mechanisms [87]. In

52 Figure 3-11. Theoretical and empirical densities and CDFs, Q-Q and P-P plots for Lognormal and mean packet size of Flute flows. Similar pattern emerges for Cellos (omitted). the research community, IAT and its Fourier transform are considered important features in traffic analysis as well. Due to limited availability or staleness of most packet-level datasets, although our NetFlow is on a higher abstraction layer (flow-level vs packet-level), analysis of flow IAT, as an aggregate of multiple TCP packets, is still a suitable way for measuring delay and jitter effects, though with lower granularity. Flow vs packet separation is irrelevant for UDP flows, since UDP is connectionless and each flow represents a single packet. Our results show that the flow IAT, regardless of device type, does not follow an exponen- tial distribution which is used to model IAT in many tool chains. Median of the flow ’IAT’ at APs is 6ms for cello flows and 4ms in case of flutes, on weekdays (similar on weekends), which suggests that the majority of APs handle flows from either device type at nearly the same rate. However, average ’IAT’ is ≈ 143ms for flute and ≈ 78ms for cello flows, as there are more cellos with very high rate of flows. Flow ’IAT’ in our datasets matches a ’beta’ distribution well (Fig. 3-12) with a ’very high estimated kurtosis’ and ’skewness’ (estimated at 58 & 6.9 respectively). The high estimated kurtosis illustrates that there are ’infrequent extreme values’,

53 which explains the observed highly elevated standard deviation of ’IAT’. Higher average IAT of flutes, combined with the higher standard deviation compared to cellos (596 vs 284),shows that flutes face ’more extreme periods of inactivity’, which can be caused by higher mobility and packet loss.

Figure 3-12. Exponential and Beta distribution plots for IAT.

3.3.1.5 Protocols

We analyze TCP and UDP, which comprise the majority of flows (>99.5%). Few flows use other protocols such as ICMP ( .4%). TCP accounts for ’78.5%’ of cello flows (’84.6% of bytes’) and ’98.2%’ of flute flows (’91.6% of bytes’). The higher presence of UDP incellosis reasonable, considering that UDP applications (e.g., multi-player games, video conferencing and file sharing) are more likely to be used with cellos. Comparing the number of packetsin flows, in case of TCP, the average number of packets in cello flows is almost half thatofa flute flow (4.6 vs 8.8), and the average packet size of flutes is 22% higher than thatofcellos. This supports our earlier observation regarding the bigger flow sizes of flutes, since most of their traffic uses TCP, and smartphones are responsible for more, and bigger, packets perTCP

54 flow. However, for UDP, the two device types are similar in terms of average packet countper flow (2.5 for cellos & 2.87 for flutes) and average packet size (119 for both). This conformsto low latency requirements of many UDP applications. Given these differences, traffic classification using machine learning90 [ ] could benefit from considering device types to train models. Internet traffic analysis is a challenging and important research area which seeks the ability to characterize flow of data on the Internet via differentiating between various profiles of Internet traffic generated by users. We investigate this in Sec. 3.4.2. After establishing the similarities and differences of flows, the next step is to evaluate whether the individual variations in flows lead to different ’aggregate traffic behaviors’ from viewpoint of the network. 3.3.2 Network-Centric (Spatial) Analysis

We examined the load of APs in all buildings on a daily basis to provide insight into differences from the viewpoint of the network. For each AP, we calculate flow metrics forevery weekday and weekend.We assume that APs retain a constant MAC address during the period of our traces and serve both flutes and cellos indiscriminately. We focus our analysis onthe first three weeks of NetFlow traces to avoid significant user behavior change duringexams period, as already shown in Fig. 2-2. First, we measure the daily packet and flow arrival rates at APs. The median flow rates are 42k and 20k per weekday for cellos and flutes respectively (7.5k and 0.5k on weekends). The average number of cello packets processed daily by APs is ’1.6 times higher’ than flute packets (Fig. 3-8(a)), with the medians on weekdays being 881k and 556k cello and flute packets respectively . Each AP handles, on average during weekdays, ’≈ 27 cello packets per second’ and ’≈ 17 flute packets per second’, dropping significantly to ≈ 13.5 and ≈ 6.25 on weekends. This indicates that, during the weekends, a high percentage of access points are not utilized, and surprisingly, it’s ’less’ likely to find flutes on campus, with ’60% of APs seeingno

55 flute flows’ and ’70% receiving no cello flows’. However, ’at least oneAPin >80% of buildings sees traffic’, supporting observations of less mobility during weekends. Next, we look at traffic volume, analyzed daily for every AP. On average weekdays, 90%of APs handle < 5GB of cello traffic (2.5GB on weekends), whereas the same percentage handles < 3GB of flute traffic1 ( GB on weekends) (Fig. 3-8(b)). Flutes are more mobile, visit a higher number of unique APs and have bigger flow sizes but they are still responsible for ’less overall network load’. Thus, lower traffic per AP does not necessarily translate to flutes consuming less bandwidth. Thus, the individual differences of flute and cello flows result in ’heterogeneous aggregate traffic patterns’ in time (different days) and space (APs at different buildings) Withthat established, in order to take steps towards modeling and simulation, we also need to analyze the behavior of users. 3.3.3 User Behavior (Temporal) Analysis

Here, we measure traffic patterns from a user-centric perspective. We identified gaps in diurnal and weekly cycles (Fig. 2-2) as well as traffic flow features of individual ’users’ including data consumption, packet rates, and network activity duration. 3.3.3.1 Data consumption

Fig. 3-8(c) shows daily data consumption, with 90% of cellos consuming ’<700MB’ and 90% of flutes using< ’ 200M’ on weekdays. Surprisingly, for cellos on campus during weekends, average data consumption is even higher whereas data consumption of flutes drops sharply. As shown in Fig. 2-2, traffic size follows a diurnal and weekly cycles, that is consistent with observations of flow and packet counts as well as active duration. 3.3.3.2 Packet rate

On weekdays, cellos on average generate ’≈318K packets’, while flutes only average ’ ≈84K packets’ per day. On weekends, the few on-campus cellos see greatly increased number of packets, with an average daily packet rate of ≈ 495K. Weekend flutes also have a modestly

56 ’increased’ packet count, with an average of ≈ 96K flows. With a similar pattern to packets counts, cellos have a much higher daily flow rate than flutes on weekdays and weekends alike. 3.3.3.3 Active duration

Total active time of devices serves well to demonstrate the differences between time spent online by users of different device types. We rely on NetFlow to measure ’active’ time instead of AP association time. This allows us to distinguish user’s ’idle’ presence in the network from its ’activity’ periods. Cellos have ’4x’ average active time compared to flutes in our traces (≈ 81 vs ≈ 21 min on weekdays, ≈ 97 vs ≈ 17 min on weekends). Overall, 90% of cellos are active for ’<3.5h’ and 90% of flutes are active for< ’ 1h’ (Fig. 3-8(d)). As evident in various metrics, the cellos appearing on weekends are more active than the average cello on weekdays. We plan to extend this user behavior analysis to website visitation patterns and domain interests in future work. Overall, the data consumption of flutes seems to be ’more bursty’ in nature, with ’bigger’ flows and ’lower active duration’. This could be due tomore intermittent usage of flutes and also bundling of network requests to save battery onthese devices. In addition, there are fewer devices on campus during weekends, but those remaining devices are more active and consume more data than average. Using mobility and traffic metrics, we have quantified how these features vary across device types, time and space. Next, we study the relationship between these features. 3.4 Integrated Mobility-Network Traffic Analysis

We study the relation between mobility and network traffic features, examine whether their ’fusion’ provides a case for the necessity of ’integrated mobility-traffic models’, and introduce steps towards such models (Section 3.4.2). Integrated analysis of mobility and network traffic is necessary because mobility, network usage, and device types characterize different aspects of human behavior. They are usually studied independently, but they are correlated. Thus, there is a need to to quantify and model relationships among these features using statistical techniques and machine learning models. Next, we discuss the features used in this analysis, and then we showcase the value and utility of integrated mobility-network traffic modeling.

57 3.4.1 Feature Engineering

To simplify analysis and interpretation, and reduce dimensionality, we identify the most important features. First, we study the relationships among variables from ’mobility’ and ’traf- fic’ dimensions separately. Then, from this subset of combined features, we investigate whether clusters of user devices appear in the dataset. For this, we use correlation feature selection (’CFS’ [44]), to obtain uncorrelated features, 6 but highly correlated to the classification. Finally, we quantify correlations between mobility and traffic metrics (See abbreviations in Fig. 3-13). Pearson correlation is shown in the figures. 3.4.1.1 Mobility

The ’CFS’ algorithm was run on 8 features (in Sec. V), and kept only ’5’ (to be used in the cross-dimension analysis). Fig. 3-13a visualizes the linear dependence between mobility features, comparing flutes and cellos on weekdays and weekends. It shows a strong correlation between number of unique APs and buildings, as well as between diameter and longest jump. The latter is because parts of the AP logs might not have captured intermediate movements. This also explains the high correlation between radius of gyration and diameter. Also, PCA shows that the first two principal components explain ≈ 78% of variance in data. Close inspection reveals temporal correlation relationships. For example, for cellos on weekdays, there is a strong correlation (0.96) between preferred building time (’PDT’) and time of network association (’DLT’), but weak correlation (0.1) on weekends, suggesting that most of weekend online time is spent at preferred buildings (e.g., libraries). 3.4.1.2 Network traffic

We extract statistical measures for traffic metrics (Sec. 3.3) per device per day. The ’CFS’ algorithm was run on 19 features, reducing them to 11. A summary of these metrics is provided in Table 3-3. The correlations are depicted in Fig. 3-13. The analysis shows us that average number of packets and bytes are positively correlated, but negatively correlated with

6 We show only Pearson correlation for simplicity and brevity.

58 Abbr. Description

flutes cellos Weekday TBY Total flow bytes ABY Avg. flow bytes SBY Std. flow bytes TAT Total active time AAT Avg. active

Weekend time TFC Total flow Abbr. Description count APC AP Count (unique) SFC Std. flow PDT Preferred building counts ∆t RUB UDP bytes / TJM Total (sum) jumps total bytes DIA Diameter of mobility RUF UDP flows / DLT Delta time (time of total flows network association) AIT Avg. IAT SIT Std. IAT (a) Mobility (b) Traffic Figure 3-13. Correlation plots for (a) ’mobility’ and (b) ’traffic’ features. Each cell’s left halfis for flutes and right half is for cellos, the upper right triangle is for weekdays and the lower left for weekends.

variance of bytes and uncorrelated with IAT. In addition, the first two principal components can explain ≈54% of the variance in data. This shows that more principal components are needed to capture the power of traffic features. Average IAT (’AIT’) seems to be mostly independent from other traffic features, but as ’AIT’ increases, its standard deviation (’SIT’) also greatly increases which could be due to device mobility; bearing further investigation on traffic-mobility interactions. Interestingly, active time is ’weakly correlated’ with number of flows and packets, which shows that users who remain online longer are ’not’ necessarily consuming traffic at a high rate. Examining weekdays and weekends, correlation trends among traffic features remain similar for either device type. 3.4.1.3 Cross-dimension

Studying correlations across mobility and traffic dimensions, based on subsets of features selected by ’CFS’, is a solid step towards an integrated mobility-traffic model. Results are presented in Fig. 3-14. This is based on subsets of features selected from mobility and traffic feature selection steps. We find that as the numbers of unique APs/buildings visited (’APC’, ’BLD’) ’increase’, the average active time (’AAT’), and total and std. of flow counts (’TFC’ and ’SFC’) ’decrease’ markedly (significant negative correlation). Surprisingly, there isno

59 Table 3-3. Traffic features used for integrated mobility-traffic analysis (per device, perday;see Fig. 3-13 for abbreviations). Upper values are for weekdays and lower ones for weekends (in red). Flutes (F) Cellos (C) Ratio (C/F) µ ’median’ σ µ ’median’ σ µ ’median’ TBY 96.77 11.47 194.52 373.08 144.68 554.54 3.85 12.61 [MB] 80.96 0.86 195.15 448.87 180.23 623.86 5.54 209.56

5.48 0.74 14.02 15.67 7.34 25.81 2.85 9.91 ABY 4.54 0.15 14.16 18.06 8.34 28.71 3.97 55.6

10.56 1.57 23.76 30.59 13.77 49.82 2.89 8.77 SBY 8.09 0.13 21.48 33.21 15.42 53.39 4.10 118.61

1,330 388.6 2,517 5,123 3,003 6,444 3.85 7.73 TAT 1,059 90.89 2,497 5,883 3,861 6,934 5.55 42.48

63.14 27.97 86.69 188.26 166.93 138.70 2.98 5.96 AAT 50.60 12.98 85.27 206.89 184.17 156.53 4.08 14.18

TFC 7.2 1.7 15.61 33.5 17.1 60.10 4.65 10.05 [K] 5.7 0.3 15.01 38.5 20.6 88.52 6.75 68.66

515.6 177.3 907.7 1,640 1,181 2,081 3.18 6.66 SFC 361.05 30.18 796.6 1,673 1,215 2,098 4.63 40.27

0.05 0.00 0.19 0.07 0.00 0.22 1.4 N/A RUB 0.06 0.00 0.22 0.08 0.00 0.23 1.33 N/A

0.07 0.00 0.18 0.12 0.02 0.22 1.71 N/A RUF 0.09 0.00 0.22 0.13 0.02 0.24 1.44 N/A

3.36 2.24 3.59 3.40 2.45 3.51 1.01 1.09 AIT 2.95 1.74 3.60 3.18 2.27 3.39 1.07 1.3

5.22 3.44 5.50 5.14 3.18 5.28 0.98 0.92 SIT 4.09 1.98 5.06 4.72 2.79 4.96 1.15 1.41

60 Figure 3-14. Correlation plots of mobility vs. traffic on weekdays (top) vs. weekends (bottom) for flutes (left) and cellos (right). noticeable change in total traffic consumed with change in ’APC’, suggesting bundling ofmore packets in flute flows. (Similar correlation between mobility diameter and the above traffic features) Average IAT (’AIT’) of flutes also rises slightly as mobility metrics ’decrease’; for cellos this correlation is almost ’nonexistent’. This reinforces our ’“stop-to-use”’ categorization; cellos are movable but are not active in transit. PCA shows that ≈54% of variance can be explained by first two principal components. To sum, ’flutes score high on mobility metrics’, have an overall lower flow count and network traffic but produce bigger flows on average. For cellos, on weekends the more time spent at preferred buildings the higher the total active time (’TAT’) and flow counts; this effect exists to a lesser degree for flutes. On weekdays, such correlation does not exist. 3.4.2 Utility of Integrated Modeling

Here we present the first steps towards an integrated model, by showcasing the value and utility of integrated mobility-network traffic modeling, with various applications in simulation, protocol design, and resource planning, particularly for 5G systems. We utilize daily mobility and traffic features of users during a week. First, we examine how different mobility andtraffic

61 features are for flutes and cellos using machine learning. Second, we investigate whether natural convex clusters of users appear in the dataset. These steps verify that the differences in mobility and traffic characteristics across device types are significant. We also findthat combining mobility and traffic makes this distinction even more clear. Finally, mixture models are used to model and synthesize simulated data points of each device type, finding that the accuracy of the mixture model ’increases’ when trained on ’combined’ features.

• Supervised classification: Having shown significant differences throughout this study, we used support vector machines (SVM) on different subsets of features to examine the feasibility of device type inference as well as the relationship between mobility and traffic characteristics. 7 These sets include mobility and traffic features ’separately’, then ’combined’, and then combined with ’weekend/weekday labels’. Using ’solely mobility features’ achieves ≈65% accuracy, while ’traffic features alone’, obtains ≈79% accuracy. Using all mobility and traffic variables ’combined’, the trained model achieves ≈81% accuracy. Then, as the ’combined’ feature set is extended to include ’weekdays and weekends’ independently, accuracy increases to ≈86%. This suggests that users’ behavior (both flutes and cellos) is ’more distinguishable’ when looking at ’combined’ mobility and traffic features; especially when ’temporal’ features such as weekdays are considered separately from weekends. We note that such behavior gaps are ’not’ the same for both device types and a model should to take that into account.

• Unsupervised clustering: To investigate natural convex clusters, we used K-means algorithm. Using ’mobility features only’, the best mean silhouette coefficient is achieved on k=2 and 4. However, cluster sizes are highly skewed and at k=2, ≈ 60% of devices are correctly clustered. ’Traffic features alone’, at k=2, results in ≈ 81.2% accuracy. ’Combining’ mobility and traffic, ’increases’ the accuracy to ≈ 81.5%. While some flutes and cellos are similar in terms of mobility and traffic, the clusters of thecombined features clearly illustrate ’two distinct modes’ (especially in ’traffic’) and the ’high homogeneity’ of the clusters hints at ’disjoint sets of behaviors’ in mobility and traffic dimensions, governed by the device type. 3.5 Integrated Mobility-Network Traffic Generative Modeling

Mobile network performance is heavily impacted by ’mobility’ and ’traffic’ patterns. Models are applied extensively in simulations, analytical-based performance evaluations and benchmarks, to approximate factors affecting the network. In Chapter 3, we have quantified

7 The classifier was used in conjunction with 5-fold Cross-Validation and ran with default parameters.

62 the differences between ’flutes’ and ’cellos’, across a variety of spatio-temporal mobility and traffic metrics. With that investigation, we established the necessity of modeling thedevice type, in addition to spatio-temporal mobility and traffic features. Chapter 4 expands upon that work by exploring the limits of human behavior predictability, and Chapter 5 explores the social context dimension, providing analysis and applications at the Encounter (pairwise) level. The understanding of mobility and traffic dimensions, taking into account time, space and device type, is imperative to the effective evaluation and efficient design of future mobile networks. This part of the work is the culmination of all of the analysis in this chapter, with the goal of creating an integrated generative mobility-traffic model. We have identified insights into relevant considerations in design and parameterization of mobility and network traffic models. We quantified the value and utility of such integrated models. AP associations allow us to calibrate user mobility and determine their community structures (and “hotspots”), while daily and weekly activity patterns help outline user sched- ules, which could be parameterized. Devising and validating a concrete candidate model is the next step. We also noted that it is crucial to differentiate between flutes and cellos for both mobility and traffic due to their very different nature. While there is also potential toanalyzea spectrum of devices (e.g., looking for ’guitars’, i.e. tablets). In addition, correlations of these features matter, and should be captured in models. As a result, one dimension that such an integrated model should comprise of is the device type. Moreover, the traffic generation, spatial locations, and temporal behavior can be linked per device type and per user “community” (e.g. students of different disciplines at various buildings). This hints at the importance of social context in building mobility and traffic models. The results of the analysis so far helped determine the dimensions of an integrated mobility-traffic model and how to parameterize it, with the goal of creating an integrated generative mobility-traffic model. Such models should have the following properties:

63 • Capture the relationship of features across ’mobility and traffic’ dimensions. This model is based on features from both aspects of human behavior, and should not only capture each feature independently, but also be capable of representing their correlations.

• Allow modeling of different ’device types’, such as flutes and cellos. Considering the always-changing spectrum of device types, the model should be flexible enough to allow for addition or removal of new device categories with different modes of usage (e.g. ’guitars’ described in section 3.4.2).

• Preserve privacy of input campus dataset. This is achieved by synthesizing statistically representative data points for any application, thus the original data points are no longer used.

• Provide a modular, systematic architecture. As new research uncovers new aspects of human behavior, new features and characteristics might need to be added to the model. A modular, systematic framework enables the integration of novel techniques and information while maintaining the other properties of the system. Generative models are a powerful class of machine learning models that learn the distribu- tion and characteristics of the underlying data and hence are useful for synthesis (generation) of similar datasets conforming to the model. The ability to generate new synthetic, but real- istic, user device profiles is of immense value for regenerating gaps in data, resource planning, predictive caching for 5G systems and other mobility-dependent services. What’s more, a generative model can serve as a privacy-preserving method of sharing data to encourage repro- ducibility and further research. The shared data points can be generated from a data-driven model, without being uniquely mappable to a certain user device. Such a generative model must be able to model spatio-temporal features of mobility, network traffic, device types, and their interdependence. Next, an overview of related work in modeling is provided. 3.5.1 Related Modeling Work

Mobility and network traffic have been traditionally studied and modeled separately. There is a vast body of research on mobility and traffic analysis and modeling. Some of the most advanced models of human mobility provide mathematical frameworks with tunable parameters to generate a variety of mobility scenarios. The work in [54] proposes a time-variant community (TVC) model that captures skewed location visitation preferences

64 and periodic re-appearance of mobile users. While the design of this model is data-driven, it still lacks the tools to capture all the mobility metrics introduced previously. Besides, it’s a mobility model only, and does not consider the traffic dimension. Authors in[57] propose the ’WHERE’ model that aims to capture movements of large population in metropolitan areas based on probability distributions of call duration, call location, home and work locations, as well as commute distances between home and work extracted from Call Detail Records (CDR). This approach uses coarse-grained CDR logs, and does not integrate the traffic dimension either. For a more complete overview of mobility models, we refer the reader to [47, 114] for surveys of mobility modeling and analysis. Many earlier studies on network traffic modeling and simulation assume random models of traffic but that has been rejected by empirical evidence17 [ ]. In addition, the effects of individual behavior of users, in terms of mobility, social context and interest, and the interactions of these dimensions with network traffic has not been studied. Simulators such as the well-known OMNeT++ [116], rely on realistic models to capture these dimensions and produce data points resembling the real world, for evaluation of new protocols in terms of correctness and performance. However, there are no modules that integrate mobility and traffic dimensions, capturing their relationship. Designing next-generation networks with the Internet of things in mind requires appropriate Quality of Service (QoS) mechanisms that requires realistic traffic characterization and modeling. We aim to propose an integrated mobility-traffic model that can generate realistic data points for both mobility and traffic dimensions while maintaining their correlations. 3.5.2 Statistical Metrics

To evaluate and compare the generative models, we first need to choose statistical methods of comparing synthetic and original samples. We used the Kolmogorov–Smirnov (KS) test and Jensen–Shannon divergence (JSD), which are briefly introduced below.

65 Figure 3-15. Visualization of Kolmogorov–Smirnov statistic for a normal distribution.

Kolmogorov–Smirnov (KS)

The Kolmogorov–Smirnov statistic quantifies the maximum distance between the empirical distribution functions of two samples. This test can be used as a goodness of fit test [80]. Since KS statistic is only measuring the maximum distance, it’s possible that two samples have high KS but have similar shape and parameters (e.g. falling off on the extremes). Jensen–Shannon divergence (JSD)

Jensen–Shannon divergence measures the similarity between two probability distributions (P, Q) by symmetrizing and smoothing Kullback–Leibler divergence [67], which itself measures the relative entropy of a probability distribution (P) to a reference probability distribution (Q). We use the square root of Jensen-Shannon divergence which is a metric (which means it has the following properties: non-negativity, the identity of indiscernibles, symmetry and triangle inequality), and with a base 2 logarithm in entropy calculations to make sure the distance is between 0 and 1. It is calculated using the following formula:

1 1 JSD(P∥Q) = D(P∥M) + D(Q∥M) (3–1) 2 2

66 Where M is the average of P and Q, i.e.

1 M = (P + Q) (3–2) 2

To create the synthetic user device profiles, we built on top of the integrated mobility- network traffic feature engineering and employed multiple techniques from statistics, andtothe best of our knowledge, for the first time, adapted state-of-the-art deep generative models to this task (such as variational auto-encoders and generative adversarial networks). We explore 4 generative models; Gaussian Mixture Model (GMM), Restricted Boltzmann Machine (RBM), Variational Auto-Encoder (VAE) and Generative Adversarial Networks (GAN). These generative models are introduced next. 3.5.3 Gaussian Mixture Model (GMM)

As the first step towards the synthesis of traces based on our datasets, wetrained Gaussian mixture models (GMM) on ’combined mobility and traffic features’. Gaussian Mixture Model works by finding sub-populations that follow a multivariate Gaussian distribution using Expectation Maximization (EM) to find the optimal parameters (mean and covariance) of the Gaussian distributions. From the combined model (CM), we acquired simulated samples. We used the Kolmogorov-Smirnov (KS) statistic to compare the simulated samples with the real data and found that CM is able to capture the behaviors of each device type. (Average KS statistic of features is < .2 for Flutes and Cellos) Importantly, we found that providing ’both’ mobility and traffic features to train a GMM results in lower average KS statistic. In other words, the combined model produces samples whose ’traffic’ features match the original data ’better’, compared with training a GMM ’on traffic features alone’ (based on KS statistic), hinting at a key relationship between mobility and traffic. On the other hand, comparing mobility features of CM with a GMM trained on mobility features alone shows no improvement. Fig 3-16 shows a sample CDF of ’TAT’, a network traffic feature, comparing the original and synthetic data for Cellos/laptops.

67 Figure 3-16. Synthetic vs. Original ’total active time (TAT)’ feature for flutes.

Overall, this shows that there is significant potential for an ’integrated mobility-traffic model’ that captures the differences and relationships of features, across ’device types’, ’time’ and ’space’. There is also a lot of room for improvement by using more advanced generative models, such as restricted Boltzmann machines, variational auto-encoders, and generative adversarial networks. 3.5.4 Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine (RBM), is a shallow, generative, artificial neural network consisting of one hidden layer (and one visible). The neurons form a (visible and hidden). RBMs can be stacked together to form deep belief networks (DBN). In the forward pass, RBM takes the input and translates it into an encoding through its hidden layer. In the backward pass, the encoding flows back to the visible layer to be reconstructed into data. RBM is trained with respect to a contrastive divergence. Put simply, an RBM is a non-linear lossy estimator that learns the mapping of data from visible nodes into hidden nodes.

68 3.5.5 Variational Auto-Encoder (VAE)

Auto-Encoder is a class of neural networks that work by finding a dense encoding of the input by decreasing the size of each consecutive layer until the desired encoding size, which is then followed by layers with increasing sizes until the original size where the output is expected to be the reconstructed data. Variational Auto-Encoders [65] follow the similar architecture with two differences: 1- instead of mapping to a fixed vector (encoding), it mapsthedata to two vectors representing a Gaussian distribution’s mean and standard deviation and uses samples from that distribution to flow through the second half of the model where the data is reconstructed. A re-parameterization technique is employed to allow for backpropagation through the stochastic . 2- Loss function is now penalized by the Kullback-Liebler (KL) Divergence between the learned distribution and the standard normal. These changes enable the generation of data that preserves the underlying distribution of the data. Figure 3-17 demonstrates the idea. Put simply, the encoder part of VAE is a transformation that turns the complex input into Gaussian representations, and the decoder is the reverse transformation from Gaussian samples to complex output. We use a variant of VAE, called the β-VAE, where the original VAE is modified to learn an interpretable, ’disentangled’ latent representation. The loss function is presented in Equation 3–3 [48]. Using β = 1 would result in the original VAE, but higher β shows superior accuracy.

F ≥ L E | − | ∥ (θ, ϕ, β; x, z) (θ, ϕ; x, z, β) = qϕ(z|x) [log pθ(x z)] βDKL (qϕ(z x) p(z)) (3–3)

3.5.6 Generative Adversarial Network (GAN)

Generative Adversarial Network revolves around the idea of two competing neural net- works, a Generator and a Discriminator against it each other [40]. The Generator tries to generate the data similar to the input and Discriminator tries to estimate how fake the gen- erated data is. This game is played between the two agents for many rounds resulting in a powerful generator that a powerful discriminator cannot decide the ’fakeness’ of the generated data. Training is done alternatively and involves freezing the Discriminator and updating the

69 Figure 3-17. VAE architecture. weights for the Generator and then the same is repeated for the Discriminator. We use a vari- ation called Wasserstein GAN [8] where the loss function is based on the Wasserstein distance (earth movers distance). GANs suffer from issues such as non-convergence and mode collapse. Mode collapse means that the generator fails to learn the diversity of modes in the input data and produces a limited variety of data points. The WGAN variant improves the stability of GAN training, and also alleviates mode collapse. Figure 3-18 shows the abstract architecture of a GAN.

Figure 3-18. GAN architecture.

3.6 Lessons Learned and Modeling Insights

Our findings in this chapter provide further (but surely not yet comprehensive) insights into considerations relevant to the design and parameterization of mobility and network traffic

70 Table 3-4. Average Kolmogorov-Smirnov statistic of all generative algorithms for training and testing on mobility, network traffic, and combined features. Test on Mobility Test on Traffic Flutes Cellos Flutes Cellos Train on Mobility 0.198 0.350 N/A N/A Train on Traffic N/A N/A 0.317 0.172 Train on Combined 0.187 0.316 0.268 0.180

Figure 3-19. Kolmogorov-Smirnov (KS) statistic for all mobility features, comparing training on mobility alone with training on mobility and network traffic features combined. The values are scaled to 1 based on each feature’s KS statistic in training on mobility. models. As summarized in Table 3-4, there is significant potential accuracy gain when using integrated mobility-network traffic features. Such a model has multiple important benefits:

• Potential improvements in model accuracy: This is the case on average of all algorithms for mobility features across device types and traffic features for Flutes, while the error for traffic features for Cellos is within 5%, shown inTable 3-4. The breakdown of KS statistic for each mobility and network traffic feature is presented in Fig. 3-19 and Fig. 3-20 respectively. It shows up to 25% error reduction for mobility and up to 20% error reduction for network traffic (with the exception of a few traffic features).

• Re-use the same model: When training on mobility and network traffic separately, there are two separate models that need to be coded, tuned and maintained. With the integrated model, there is a single model, which is easier to code and maintain.

71 Figure 3-20. Kolmogorov-Smirnov (KS) statistic for all network traffic features, comparing training on traffic alone with training on mobility and network traffic features combined. The values are scaled to 1 based on each feature’s KS statistic in training on traffic.

• Capture correlations between mobility and network traffic: An integrated model can easily capture the correlations among these features. An example of this is shown in Fig. 3-21, where a traffic feature’s correlations with multiple mobility features is shown for both synthetic and original data points.

Figure 3-21. Correlation between total active time (TAT) and several mobility features in both synthetic and original data points. DLT: delta time, DIA: diameter, TJM: total jumps, PDT: preferred location delta time, APC: AP count. Synthetic data points are from the RBM model.

The best performer among the generative models introduced (GMM, RBM, VAE, GAN) is the VAE variant (β-VAE). 8 The Kolmogorov-Smirnov statistic of the β-VAE is presented

8 A good set of hyper-parameters are as follows: intermediate dimension = 128, latent di- mension = 112, learning rate = 0.005, decay rate = 0.00005, number of encoder layers = 4,

72 in Table 3-5 for weekdays and Table 3-6 for weekends. The numbers show significant gains in model performance for Cellos when combined features are used, with the performance for Flutes being better for mobility but worse for traffic. The results are from the same model, with the same parameters, trained on different subsets of the data. The model achieves overall superior accuracy, is simple to code and maintain, and also captures the relationship of mobility and network traffic features into somewhat interpretable latent variables. To ’visualize’ the

latent variables, one can uniformly sample a single dimension, Zi from the latent space, while fixing the rest, and use the decoder network to generate samples. Then the samples can

be analyzed to find the relationship of Zi with different input features. In the original study introducing β-VAE, it was shown that when the model was trained to generate faces, the first few latent variables were capturing important aspects of a human face, such asAzimuth (rotation), emotion (e.g. smile), hair (fringe) [48]. In our dataset, we used a heat-map to show how each individual feature varies when the latent variable Zi varies from -4 to +4. For some

of the latent variables, we found obvious patterns, such as the case of Z1 where it tends to mostly capture correlated mobility features on weekends (depicted in Fig. 3-22). The changes

are not always easily interpretable, as the changes in Zi can lead to changes in non-linear combinations of mobility and network traffic features.

Table 3-5. Kolmogorov-Smirnov statistic of the β-VAE for weekday features. Test on Mobility Test on Traffic Flutes Cellos Flutes Cellos Train on Mobility 0.107 0.295 N/A N/A Train on Traffic N/A N/A 0.074 0.278 Train on Combined 0.081 0.179 0.108 0.110

number of decoder layers = 7, alpha = 0.09 for LeakyRELU, bath normalization momentum = 0.39, beta = 40, batch size = 32. The hyper-parameters were tuning using Microsoft Neural Network Intelligence (NNI) available online: https://github.com/Microsoft/nni

73 Table 3-6. Kolmogorov-Smirnov statistic of the β-VAE for weekend features. Test on Mobility Test on Traffic Flutes Cellos Flutes Cellos Train on Mobility 0.187 0.493 N/A N/A Train on Traffic N/A N/A 0.120 0.303 Train on Combined 0.191 0.403 0.124 0.142

Figure 3-22. Visualization of mobility features with Z1 from β-VAE. (Features that end in ’W’ are for weekdays and those ending in ’E’ are for weekends. UAP: unique AP count, PDT: preferred location delta time, TJU: total jumps, DIA: diameter, DTI: delta time)

While we leave validating a concrete candidate model for future work, we can readily identify the following important elements: It is crucial to differentiate flutes vs. cellos for both mobility and traffic due to theirvery different nature. More specifically, flutes exhibit continuous presence whereas cellos areon/off with jumps between locations. Beyond differences in continuity, the traffic patterns (flow sizes, arrival times, etc.) should be specified by device class. Moreover, the traffic generation, spatial locations, and temporal behavior can be linked per device type and per user “community” (e.g. students of different disciplines at various buildings). Our analyses allow us to quantify the correlated elements of traffic and motion for the (type of) campus we investigated. AP associations allow us to calibrate user mobility and determine their community structures (and “hotspots”), while daily and weekly activity patterns help outline user schedules—which could be parameterized based upon outline

74 information such as class schedules, public holidays, and other predictable events. Traffic flows per device can be created from our corresponding distributions, while AP traffic observations allow parameterizing the above “hotspots” with respect to user activity. Together, these could form a basis for generating parameters for established mobility models9 ,10 , possibly extend them as needed, and augment them with adequate correlated traffic models. The next step is examining the minimum number of parameters required (parsimonious model) to approximate user behavior within given error bounds with respect to the observed ground truth. This should be complemented by studying datasets from other networks settings to further validate the observed behaviors beyond our campus environment, taking into account the specifics of the campus and identifying further influencing factors. Besides different campus environments, a few further elements deserve further study: 1) Our study so far gives only limited insight into the actual services users are accessing, so analyzing web domains users are accessing, for interest and application analysis as well as investigating the relationship of user interests with spatio-temporal characteristics of mobility and traffic would be important next steps. 2) Flutes and cellos are likely two pointsina dimension that is continuously changing as a) differentiation in device classes appear (e.g., smartphones vs tablets—do we have “guitars” as well?); b) blur again (e.g., due to new form factors for smartphones and tablets); and c) new devices such as fitness trackers, smartwatches and “glasses” further enrich the portfolio. 3) The models so far were trained on individual device profiles only, the next generation of models will also need to span the social context dimension ( pair-wise and collective).

9 Hsu, W-J., et al. ”Modeling time-variant user mobility in wireless mobile networks.” INFO- COM 2007 10 Boldrini, Chiara, and Andrea Passarella. ”HCMM: Modelling spatial and temporal proper- ties of human mobility driven by users’ social relationships.” Computer Communications 2010

75 3.7 Summary

In this chapter, we mined large-scale WLAN and NetFlow logs from a campus to answer: (I) How different are mobility and traffic characteristics across device types, time andspace? (II) What are the relationships between these characteristics? (III) Should new models be devised to capture these differences? And, if so, how? We built ’FLAMeS’, a framework for systematic processing and analysis of the datasets. Using MAC address survey, OUI matching and web domain analysis, we categorized devices: ’flutes’ (’“on-the-go”’) and ’cellos’ (’“stop- to-use”’). We then studied a multitude of mobility and traffic metrics, comparing flutes and cellos across time and space. On average, flutes visit twice as many APs as cellos, while cellos generate ≈2x more flows. However, flute flows are 2.5x larger in size, with ≈2x the number of packets. The best fit distribution for location preference is ’Zipfian’, for flow/packet sizes is ’Lognormal’, and for flow IAT at APs is ’beta’. Furthermore, flute traffic drops sharply on weekends whereas many cellos remain active. Across mobility and traffic dimensions, we spotted a negative correlation for flutes between mobility and flow duration but negligible correlation with traffic size; for cellos, this effect is less pronounced. We found anegative correlation with APs visited and active time, particularly for flutes. However, no correlation exists between APs visited and traffic for cellos. We ’quantified’ correlations ’across both mobility and traffic’. Finally, we applied machine learning and trained a mixture modelto synthesize data points and verified that the ’combined’ mobility-traffic features capture the ’differences’ in metrics better than ’either mobility or traffic separately’. Many ofourfindings are not captured by today’s models, and they provide insightful guidelines for the design of evaluation frameworks and simulations models. Hence, this study answered the questions posed, introduced a strong case for newer models, and provided our first step towards a future integrated mobility-traffic model, with implementation and analysis of multiple generative models: Gaussian Mixture Model (GMM), Restricted Boltzmann Machine (RBM), Variational Auto-Encoder (VAE) and Generative Adversarial Networks (GAN). A variant of the VAE, called β-VAE showed the best performance in regenerating integrated mobility-network traffic user

76 device profiles. However, this approach is ≈ 2 orders of magnitude slower than more simple approaches such as the GMM. This trade-off does not reduce the utility of the gains, giventhe fact that generative models are going to be used in an offline fashion.

77 CHAPTER 4 PREDICTABILITY ANALYSIS AND PREDICTION ALGORITHM DESIGN Prediction techniques constitute fundamental mechanistic building blocks for many mobile protocols and applications, ranging from resource allocation to caching and recommender systems, among others. The study of predictability and behavioral regularity in the mobile society has been the focus of several studies. Employing information-theoretic concepts and machine learning methods, earlier research has shown evidence that human behavior can be highly predictable. However, the previously reported findings on very high predictability of human mobility were drawn from analyses that had several limitations. First, the spatio- temporal scale at which users’ locations were observed—location within mobile network cell tower coverage—renders the prediction results impractical for most mobile applications that depend on high accuracy of the user localization. Second, the theoretical limits of predictability were derived based on an entropy estimator that was unable to capture the variability of repeated sub-sequences of visited locations and therefore underestimating entropy (overestimating predictability). Despite existing studies, more investigations are needed to capture intrinsic mobility characteristics constraining predictability, and to explore more dimensions (e.g. device types) and spatio-temporal granularities, especially with the change in human behavior and technology. We analyze extensive longitudinal datasets with fine spatial granularity (AP level) covering 16 months. We revisit the study of mobile user predictability, and we address the aforementioned limiting factors by exploring two datasets of WiFi network associations—traces with spatio- temporal granularity that is orders of magnitude finer than the previous studies, andby utilizing a more efficient entropy estimator as well as several well-known predictors. Weuse our extensive WLAN dataset, described in Section 2.1.1, accessing a WiFi network with 1750 APs in 150 buildings. The predictors include Long short-term memory (LSTM) neural network—a variant of recurrent neural networks (RNNs), introduced in Section 2.3.1.2—where feedback across hidden layers captures order in time and sequencing of mobility behavior

78 and a Markov Chain (MC) based predictor, described in Section 2.3.1.2, in addition to a 1D convolutional neural network (CNN), used in natural language processing and other sequence learning tasks, adapted to this new context. The goal of this study is to investigate practical prediction mechanisms to quantify predictability as an aspect of human mobility modeling, across time, space and device types. We apply our systematic analysis to wireless traces from a large university campus. We compare several algorithms using varying degrees of temporal and spatial granularity for the two modes of devices; Flutes vs. Cellos. We show that for many practical wireless applications the human mobility is much less regular than previously reported. The study also reveals device type as an important factor affecting predictability. Ultra-portable devices such as smartphones have ”on-the-go” mode of usage (and hence dubbed ’Flutes’), whereas laptops are ”sit-to-use” (dubbed ’Cellos’). Through our analysis, we quantify how the mobility of Flutes is less predictable than the mobility of Cellos. In addition, this pattern is consistent across various spatio-temporal granularities, and for different methods (Markov chains, neural networks/deep learning, entropy-based estimators). This work substantiates the importance of predictability as an essential aspect of human mobility, with direct application in predictive caching, user behavior modeling and mobility simulations. The main questions addressed here are: i. How different are ’Flutes’ and ’Cellos’ in terms of predictability? ii. How does the predictability of these device types change with different spatio-temporal granularity (5, 15, 30 min, 1 hour and 2 hours; access point and building level)? iii. Does the choice of method or predictor (e.g. Markov Chain, neural networks such as LSTM and CNN, BWT or LZ based estimators significantly alter the answers to aforementioned questions? To address these question, this study provides the following contributions: 1. Quantifying the differences of Flutes and Cellos for prediction analysis, evaluated on a real-world large-scale dataset. 2. Comparison of several well-known algorithms (Markov Chains, Neural Networks) and LZ/BWT-based theoretical bounds across different time and space scales for Flutes and

79 Cellos. 3. Use of prediction accuracy as part of the user profile for modeling, and investigation of its correlation with a combination of network traffic and mobility features. 4.1 Related Work

There are various studies regarding prediction of human behaviors, and its limits. Many studies have suggested that humans behave in a highly predictable manner. To study theo- retical limits of predictability, concept of ’Entropy’ from information theory has been used. Predictability limits has been studied in various settings using entropy, while machine learning is employed for practical predictions in different applications. Given its simplicity and yet interpretable results, ’entropy’ has been extensively used to model the ’randomness’ in numerous studies analyzing predictability limits of sequences that encode human behaviors. such as movements in time. The seminal work by [106] devised a framework for estimating the maximum possible ’predictability’ (Πmax) by solving the Fano’s inequality using the estimated entropy rate of a sequence of timely ordered events [106]. Using traces from cellular network operators, their work reported a possible upper-limit of 93% with very little variance across age groups and short- or long-distance travelers. Attempting to refine these results, authors in [104], using GPS traces, finer spatial and temporal resolutions (600 m2 5 minutes vs. 3 km2 and 1 hour) and an alternative approximation for Πmax report that these upper limits are actually 11-24% lower [104]. On the study of different problems,82 [ ] devise the calculation of ’instantaneous entropy’ using GPS traces along with mobile apps usage and evaluate how anticipatory phone notifica- tions could benefit from predictability estimations82 [ ], [74] using cellular traces report a high predictability of human mobility even after a natural disaster event [74], and [14] using WLAN traces from a university campus network report multi-modal entropy distributions which can be partially explained by demographics data from the same population (’i.e.’, age, gender, major of studies) [14]. Other entropy based studies include vehicular mobility [35, 70, 118], online social behavior [103, 111], complex systems [45], cellular network traffic [124] and utilization [41].

80 Several studies used an entropy estimator based on the LZ compression algorithm to compute Πmax, but a more efficient estimator has been proposed based on13 BWT[ ]. An important aspect of entropy estimators is their sensitivity to large alphabet sizes[13]. Alphabet sizes (|Z|) comparable to the length of the input sequence may influence the accuracy and introduce bias for an equal number of observations. Comparing the behaviors of LZ and BWT under large alphabets (’i.e.’, the device has visited a large number of locations), LZ shows a larger variance. For our study, we estimate the entropy rate of sequences using the Burrows– Wheeler Transform [2] which performs better than the widely used Lempel-Ziv [125] under larger alphabet sizes. Various attempts were made in trying to close the gap between theoretical upper limits and practical approaches using Machine Learning. Using cellular traces with large spatial and temporal granularities (sub-prefectures in Cote D’Ivoire and 24 hours respectively), [75] report accuracies of up to 87% and 95% for stationary and non-stationary trajectories respectively with Markov Predictors, where MC(1) (’i.e.’, which takes into account only the current event for predicting the next event) performs as well as higher order predictors [75]. Similarly, in [121], using GPS traces report high correlations (up to 0.86) between Πmax and Markov Predictors for different spatial and temporal granularities (100’s meters, streets, districts and regions of a city, and from 1 to 4 hours respectively) [121]. Similar to [56], we model our input sequence of mobility events into two types of times series: discrete-time and event based. The former being the most used in the aforementioned literature may overestimate accuracy given the tendency for people to stay for long hours at often visited locations (’e.g.’, home, work, school)[41]. Alternative methods for mobility or location prediction include: Hidden-Markov-Models reporting high regularity in human mobility after Earth quakes in Japan[108], improving prediction accuracy by clustering users with similar behavior and using transfer learning [59], as well as the use of social graph and the knowledge of the whereabouts of someone’s friends to

81 predict next location [19, 99]. Other approaches include kernel density models [23], decision- trees [113], Gaussian Mixture Models [32] and Neural Networks [72]. University campus wireless networks have also been previously used as the means to study mobility of its users. Authors in [107] compare four families of online location predictors including Markov, LZ [125], Prediction by Partial Matching (PPM), and Sampled Pattern Matching (SPM) [58] and found that more complex predictors yield only marginally better results. Long-spanning traces (6 years) has been studied in [63] which enables exploring the evolution of WLAN users over time. In this study, we evaluate our method using two large WLAN traces from University of Florida (UF), introduced in 2.1.1, and from KTH Royal Institute of Technology, with similar features to the UF traces, collected for 4 months (September-December 2014) from 930 APs in 48 buildings. Table 4-1 contains a brief summary of each trace with mean (µ) and standard deviation

(’std’), where Nap are number of unique access points observed, Nday number of unique days

with at least one record, Nrec number of records during data collection, and ’total’ number of devices available for at least 7 days.

Table 4-1. Statistics per device available for at least 7 days & accessed more than 5 APs. N N N ap day rec Total Devices µ std µ std µ std UF 127.3 142.3 63.5 59.2 1861 5121 138028

We investigate two methods to measure predictability; a theoretical method based on entropy, and a systems method based on practical predictor algorithms. Following we provide the entropy estimation based definition and discuss the different algorithms studied, including a reference-point Markov Chains approach, and more sophisticated deep learning approaches. 4.2 Entropy Estimators and Maximum Predictability

’Entropy’ can be defined as the level of order (or disorder) of a system. For a randompro- cess, this metric is sensitive to both the relative frequency of events and their interdependen- cies [41]. For estimating the lower-bound of predictability, we compute the ’time-uncorrelated’

82 entropy (Sunc) which takes into account only the frequency of the observed events. For the upper-bound of predictability we compute two ’time-correlated’ estimators based on compres- sion algorithms (Slz and Sbwt) which also consider the memory of the system. 1

Formally, for a random variable Λ with possible values {λ1, ... , λn}, from an alphabet Z ∑ P ∈ Z Sunc − and probability distribution p(λ) = (Λ = λ) for λ , we define = λ∈Z p(λ) log2 p(λ), often referred to as ’Shannon entropy’. Next, based on the LZ compression algorithm, we de- ∑ Slz − P p P p P p fine = Λprime⊂Λ (X rime) log2 (Λ rime), where (Λ rime) is the probability of finding the time-ordered sub-sequence Λprime in the sequence Λ. Note that this estimator is used to account for the ordered sequence of values (i.e., the memory of the system) [106]. Similarly, Sbwt is computed based on the Burrows–Wheeler Transform (BWT) [2] in which a sequence of values Λ is rearranged into runs of similar symbols (Ω), which are then chunked ∑ Sbwt − 1 into sub-sequences (ω). Finally, we define = n ω∈Ω log2 qˆ(ω), where qˆ(ω) is the en- tropy of the chunk ω. In other words, the ’uncorrelated’ form of entropy quantifies the amount of randomness in a sequence based ’only’ on the probabilities of the events. Furthermore, while the LZ algorithm estimates the entropy of a sequence Λ in terms of the probabilities of finding ’repeated’ sub-sequences Λprime in Λ, the BWT estimator is based on a block-sorting algo- rithm which converges to the ’real’ entropy by averaging the entropy of equally sized chunks of the block-sorted input sequence. For a complete description on how to compute Slz and Sbwt , we invite the reader to refer to [37] and [13] respectively. The ’maximum predictability’ (Πmax) is defined as the best possible prediction of anen-

tity’s state xi given a state xj . Note that in our current problem formulation we are estimating

max Π of state xi given xi−n, where n are all possible previous events. We compute Π from the entropy S of a given sequence of events X by solving Fano’s inequality [31].

1 This work was carried out in collaboration with Leonardo Tonetto at Technical University of Munich (TUM), with help from Aaron Yi Ding and supervised by Jörg Ott.

83 Formally, for a sequence of N events X = {x1, ... , xn} and entropy S, Π is given by S − − − − − − = H(Π) + (1 Π) log2(N 1), where H(Π) = Π log2 Π (1 Π) log2(1 Π). Therefore, solving this non-linear equation allows us to find a candidate for Πmax. In this way, from S unc we can estimate Πunc which could be interpreted as the maximum predictability an algorithm could achieve if guessing future events ’simply’ based on the probabilities of these events. Similarly, from S real (LZ or BWT) we can estimate Πreal, the maximum achievable predictability if ’also’ the order of events are taken into account. For a complete description of theoretically linking entropy with an upper bound of predictability, we invite the reader to refer to [104]. 4.3 Prediction Algorithms

As introduced in Section 2.3.1.2, we analyze well-known sequence prediction algorithms, namely, Markov-Chain (MC) based predictors, long short-term memory (LSTM) and 1D convolutional neural network (1D CNN). A quick summary of these algorithms is provided next. Markov Chain-based predictor

A Markov chain (MC) with a discrete state space has been applied for user mobility prediction [75, 107]. In an order-k Markov predictor, the state space consists of tuples of k location names (e.g., AP), where the next location prediction depends solely on the most recent preceding k-tuple. We build the model on the data so that observed k-tuples comprise the states. The transition probabilities are learned based on the frequency of appearances of such a transition in observations. The probability for a transition from the current state

S = Xi Xi+1...Xj to Xi+1Xi+2...Xj Xj+1 where j − i = k and each Xi is the symbol for each location, is represented as P(Xj+1 = c | S = Xi Xi+1...Xj ) for all c observed in data and is learned based on the reappearance frequency of such a sequence. If the predictor of order k encounters a new sequence that has never seen before, it falls back to the lower, k − 1 order recursively. The base case is O(0) which is simply the frequency distribution of all symbols observed so far.

84 Deep learning

Recent approaches to sequence prediction use deep Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN). Recurrent neural networks have loops within their cells, allowing information to persist and thus enabling the neural network to connect previous information to make a reasonable prediction of the future, it is a supervised model that is trained to predict the next symbol in a sequence. Certain types of RNNs are capable of learning long-term dependencies. There are multiple variants of RNNs, including Long short-term memory (LSTM) [52] and Gated Recurrent Unit (GRU) [20]. LSTM is a variant of recurrent neural networks (RNNs) where feedback across hidden layers captures order in time and sequencing of mobility behavior in our data. These networks can learn dynamic temporal patterns and have successfully been applied in speech recognition, text-to-speech engines and predicting next location [62, 101]. Another type of neural networks, the Convolutional neural networks (CNNs) learn convolutional filters to extract latent information across the data (i.e. 1D CNNs learn different temporal locality patterns) and use that information for predicting the next location. In our study, we use a multi-layer LSTM and 1D CNN to predict movements of users based on similar input tuples used for MC-based predictors. Neural networks are computationally expensive and require hyper-parameter tuning. Thus the deep model is run only on a sample of users in this study. One goal of this study is to analyze the payoff (and cost) of adding complexity tothe predictor (e.g. LSTMs), versus the simpler MC-based predictors. 4.4 Experimental Setup

To study the regularity of human behavior, we performed a data-driven analysis applying our methods to a university campus WiFi traces from the University of Florida. This dataset was initially introduced in Chapter 2. The logs were collected from networks providing wireless access to a large number of portable devices via access points deployed in non-residential areas, including classrooms, computer laboratories, libraries, offices, administrative premises,

85 cafeterias, and restaurants. Every trace entry contains a unique user identifier (’uuid’), time- stamp and an access point unique identifier (’apid’). Based on the ’apid’s string we areable to identify the building as well as the room in which an access point (AP) was located. Only the geographical coordinates of buildings are known. Table 4-1 contains a brief summary of

the dataset with mean (µ) and standard deviation (std), where Nap is number of unique access

points observed per device, Nday number of unique days with at least one record, Nrec number of records during data collection, and total number of devices available for at least 7 days and accessed more than 5 APs. 2 We also employ traces from KTH Royal Institute of Technology, with similar features to the UF traces, collected for 4 months (September-December 2014) from 930 APs in 48 buildings. The ’KTH’ dataset lacks the device type classification, thus it is only used on a subset of the analysis that does not consider the device types. All collected WiFi traces are processed as discrete time-series, defined next. 4.4.1 Discrete-time Series

Given a set a of timely ordered events X = {xt : t = 1, ··· , n}, where xt is the realization of X at time t for t ∈ T , we say that a timeseries is discrete if T are measurements taken at successive times spaced at uniform intervals ’w’, also referred to as sampling rate (defining the temporal granularity). Fig. 4-1 depicts an example of how the real location of a device is sensed by the wireless management system through AP associations (red stars) and finally how the discrete-time series is obtained. For a given sampling time window ’w’, our ’discrete-time’ series may result in different sequences depending on whether we choose an AP or a building as thelevelof spatial resolution.

2 Transient devices are not counted to ensure the analysis is carried out on devices that are mobile and benefit from predictive systems the most, while stationary devices (e.g. plugged-in Cellos) and guests that never return to campus are ignored.

86 Figure 4-1. Discrete-Time Series: Location of the device is sampled at a constant rate.

From Fig. 4-1, for the first 4 time steps the device switched its associated AP withouta real location change. This switch in AP association can be triggered by the mobile device (e.g. stronger wireless signal) or by the network management system (e.g. load balancing). Note that it is important to define the resolution for ’space’ and ’time, i.e., howbiga location is in space (or point-of-interest) and how often we are going to sample from the input signal. In this example, larger values of ’w’ could eliminate this ’ping-pong’ effect of switching between APs without actually moving, but also cause loss of information when the user transits from one location to another. On the contrary, very small values of ’w’ could over-sample long periods when the user is not moving. Similarly, different values of spatial resolution could mitigate noise but eliminate information from the traces. Choosing these parameters is often influenced by the characteristics of the available dataset as well as the targeted application of the study. Step value

A weighing mechanism is used to pick the corresponding location to represent a time step. During a time interval, we weigh every observed location of the device with the duration of time at that location and pick the one with the highest weight to represent that step. We

87 assign a user to a specific location ℓ in the time interval δt between an association at ℓ and the next association at any other location, but only if δt < tmax. After tmax the device will be in an ’unknown’ state [106] until the next network event which will reveal its location for future steps. 4.4.2 Experiment Dimensions

The design of our experiments is based on our study’s questions: i. How different are Flutes and Cellos in terms of predictability? ii. How does the predictability of these device types change with different spatio-temporal granularity? iii. Does the choice of method or predictor significantly alter the answers to aforementioned questions? Thus, we evaluated a matrix, involving combinations of the following dimensions:

• Device Types: Flutes vs. Cellos.

• Temporal Resolutions: 5 min, 15 min, 30 min, 1 hour and 2 hours.

• Spatial Resolutions: Access Points, and Buildings.

• Methods: A. Well known sequence prediction algorithms from machine learning literature (Markov Chains, Neural Networks) B. Entropy-based Estimations of predictability upper-bounds. The experiments were implemented in Python, the neural networks were implemented using Tensorflow1 [ ] and Keras. Training is carried out in an ’online’ manner and the evaluation is through providing a sliding window of k observations to the predictor and testing the prediction correctness of the next symbol. The fraction of correct next symbol predictions is the prediction accuracy metric. 4.5 Mobility Analysis

4.5.1 Overview

Before feeding our WiFi traces into the analysis tools, we modeled the network associa- tions into a series of discrete events. This step allows us to reduce the inherit complexity of continuous-time as we will describe bellow. For that, we model our traces into a ’discrete-time signal’ [106] as well as ’state transition/event-based’[41]. In short, the former samples the

88 original traces at constant time intervals assigning a sensed value to each time step, and the latter defines possible states from the input traces and generates a new sequence (or chain)of states. A discrete-time series when modeling human mobility might be limited in representing the likelihood of people to remain at their current location. This problem is accentuated when smaller sampling rates are used and the sampling rate is no longer comparable to the frequency of location changes. Given the environment of our university campus WiFi, users tend to sit for lectures or study at libraries for up to a couple hours, therefore the choice of the spatial and temporal scales will affect the maximum predictability of devices. This way, high predictability values reported in some studies (e.g. [106]) may have low applicability in spite of its impor- tance in devising the use of entropy for modeling the predictability of human mobility. The state transitions model does not over estimate static periods and the meaning of each state is more interpretable. That is, a device disconnected from the system for several days between periods of activity will have a single ”unknown” state representing this period of inactivity. Similarly long periods of ’stay’ at any location will not generate a series of repeated states which could bias the predictability results. This model has a few drawbacks though. Given that the amount of states is directly proportional to the level of activity of a user (’i.e.’, how often location changes happen instead of defined by a fixed sampling rate), in this model itmight not be possible to precisely point which in which state the user was at a certain time. Note that in both models the prediction of a future event will be based on knowledge about its previous ’n’-events. In our studied problem, on the ’discrete-time’ model, events are locations sampled at regular time intervals and therefore two consecutive steps may or may not be the same depending on the sampling rate and spatial granularity chosen. Moreover, on the ’state transition’ model any two consecutive states must be different. We present and discuss these two models before presenting the results for each of them. It is important to define the resolution for ’space’ and ’time’, ’i.e.,’ how big a location is in space (or point-of-interest) and how often we are going to sample from the input signal. Larger values of ’temporal window size (w)’ could eliminate the ’ping-pong’ effect of switching

89 between APs without actually moving, but also cause loss of information when the user transits from one location to another. On the contrary, very small values of ’w’ could over-sample long periods when the user is not moving. Similarly, different values of spatial resolution could mitigate noise but eliminate information from the traces. Choosing these parameters is often defined by the available dataset as well as the targeted application of the study. Forour analysis, we apply a weighing based approach to pick the corresponding sensed location which is going to represent a time step. Therefore, during a temporal window size ’w’, we weigh every observed location of the device with the duration of time at that location and pick the one with the highest weight to represent that step. We assign a user to a specific location ℓ in the time interval δt between an association at ℓ and the next association at any other location,

but only if δt < tmax. After tmax the device will be in an ’unknown’ state [106] until the next network event which will reveal its location for future steps. To evaluate the predictors, we use an online training approach. The accuracy is defined as the fraction of next symbols correctly predicted. For Markov chains, we find that the best order is 1, followed by 2, 3 and 4. This is in line with findings of prior research. In case of higher order MCs, the predictor cannot make a prediction if it had never seen a sequence in the input, so it falls back to lower order MCs. We find that the algorithm is falling back to lower order MCs most of the time. Thus the results are at best on par with lower order MCs, while being computationally more expensive. Overall, it turns out that the more previous symbols the chain relies on, the less accurate it becomes. This behavior is consistent with finding of previous research68 [ ]. We find that when MC is allowed to fall back to O(1), for 76.68% of UF users, the prediction accuracy goes up. If we allow fall back to O(0), i.e. just counting the frequency of symbols seen so far and returning the most probable one, this numbers increases to 97.75%. The 90th percentile of accuracy gains is 5.08% which is significant boost when taken into the low predictability of most users. Tomeasure robustness of the MC, different random subsets of data were omitted. The results show that accuracy varies significantly between runs. On the other hand, LSTMs not only achieve higher

90 accuracy overall but also have better robustness. With different runs using random subsets producing similar results as opposed to MC algorithm. For the neural networks, there is a lot of potential in parameter tuning, to find a good balance of accuracy and compute time. Comparing prediction accuracy and theoretical predictability limits, our initial experimental results show that the real-world prediction accuracy numbers are significantly lower than maximum predictability numbers estimated using entropy. The results of the LSTM model is even lower than the theoretical upper bounds of predictability. There are two reasons: 1) These theoretical upper bounds are significantly overestimating the prediction accuracy possible. 2) The prediction algorithms need to be better tuned for this kind of task. The theoretical upper bounds are simply not achievable using today’s state-of-the-art algorithms, the current methods to estimate the upper bound seem out of touch with reality and a better, tighter, upper bound needs to devised. We also compared results of using different temporal windows, ’w’, from 5 minutes to2 hours. We found that using a smaller ’w’ significantly increases the predictor results. While the theoretical limit is higher in this case, the difference between theoretical and experimental numbers is much smaller. Very small values of ’w’ could over-sample long periods when the user is not moving, thus there are a lot of repeated sequences of the same symbol, leading to inflated prediction accuracy results. Predictability as presented here, based on entropy, is boolean (either right or not), but different metrics for accuracy could be used to evaluate a predictor, such as thedistance between the predicted next AP and the actual next AP. 4.5.2 Spatio-temporal Resolutions

To answer the first two questions of this study, particularly ”ii. How does the predictability of these device types change with different spatio-temporal granularity?”, Table 4-2 summarizes the median accuracy of an LSTM predictor for Flutes and Cellos with different spatial and temporal granularity.

91 Table 4-2. Median Accuracy of LSTM (sequence Len. 40) for Flutes vs Cellos, 5min-2h temporal and AP/Bldg spatial granularity. AP Building FCFC 5 min 33.22 42.25 44 63.4 15 min 21.42 36.9 34.53 58.06 30 min 21.88 27.39 39.56 50.78 1 hour 19.67 24.33 32.62 52.03 2 hour 17.17 22.5 32.6 59.62

The choice of granularity is application-dependent, for example, to predict foot traffic at buildings and congestion planning based on density, building level analysis is more appropriate. Cellos show more predictable behavior overall, as the fraction of correct next symbol predictions is higher for Cellos across the board. At the AP level, with longer time bins, the accuracy for both Flutes and Cellos decreases. This observation is in line with previous findings [104]. At 15min time intervals, the difference between Flutes and Cellos is at its maximum and drops and remains stable for longer time intervals. At the building level, the accuracy follows a less regular pattern but both Flutes and Cellos are most predictable at 5min intervals (due to repeats of the same location in the sequence). Cellos’ accuracy drops for 30min bins and goes back up again. On the other hand, Flutes are more predictable in 30min bins than 15min, 1h or 2h bins. Looking across all temporal bins, Fig. 4-2 presents the empirical cumulative distribution function (ECDF) of prediction accuracy at AP and building spatial granularity. The ”sit-to-use” Cellos show significantly higher predictability at every percentile; this is reasonable given their lower mobility [6] and mode of usage. In fact, prediction accuracy is highly correlated with other mobility and network traffic features of mobile wireless users, we will take a brief lookat these correlations in Section 4.6 and Fig. 4-3. 4.5.3 Comparison of Methods

To answer the third question of this study, ”iii. Does the choice of method or predictor significantly alter the answers to aforementioned questions?”, here we compare the experiment results for different methods: 1) MC: Markov Chain 2) LSTM: A type of recurrent neural

92 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Flutes (AP) Fraction of users 0.2 Cellos (AP) 0.1 Flutes (Bldg. ) Cellos (Bldg. ) 0.0 0 20 40 60 80 100

Figure 4-2. ECDF of LSTM Prediction Accuracy for Flutes & Cellos at AP and Building spatial levels (all temporal levels combined, vertical lines denote medians, sequence length 40). network 3) CNN: 1D Convolutional Neural Network 4) Hr_LZ: Theoretical predictability based on the Lempel-Ziv (LZ) entropy estimator 5) Hr_BWT: Theoretical predictability based on the Burrows-Wheeler transform (BWT) entropy estimator. A summary of comparisons is presented in Table 4-3, for temporal granularity of 1h and 15min, highlighting the difference of Cellos - Flutes. In all cases Cellos are more predictable than Flutes, regardless of the choice of method (with a minor exception of LZ predictor at 15min time and building level which might be due to intrinsic instability of LZ based estimator). The difference in median accuracy for Flutes vs Cellos is up to 25% (Building level, 15min window, sequence length 40, Flutes 33.97% vs Cellos 59.03%). Other temporal choices result in a similar pattern. Another notable observation is that while the neural networks are more complex, and require vastly more computing power, they only achieve modest increase compared to Markov Chains in some scenarios (e.g., Cellos, at the Bldg. level and Seq. Len. 40, from 48.56% to 52.5%). This

93 is a trade-off that needs to be considered in the design of predictive caching systems. In addition, increasing the sequence length k (i.e. the number of previous time steps available to the predictor) impacts the Markov Chain model more than the neural networks. This is particularly pronounced for 15min time window, in fact, the neural networks do not lose much accuracy from increasing sequence length 5 to 40 in case of the 1h time window. Also, the theoretical LZ and BWT based estimators, show higher upper bounds compared with the best of the algorithms, with Seq. Len. 5 Markov Chains and CNNs being the closest practical algorithms for the 15min case. The predictors are far behind in the 1h case, suggesting room for improvement via tuning for specific time and space granularities. The run time ofLSTMis the longest, followed by CNN (not shown for brevity).

Table 4-3. Summary of Median Accuracy for Flutes vs Cellos with different methods (Diff is Cellos − Flutes) and sequence lengths for 15min and 1h time windows.

AP, 1h Bldg., 1h AP, 15min Bldg., 15min Seq Len Predictor F C Diff F C Diff F C Diff FCDiff MC 21.05 25.95 +4.90 38.25 53.50 +15.25 61.72 70.30 +8.58 75.00 87.60 +12.60 5 LSTM 21.62 25.00 +3.38 35.03 50.00 +14.97 40.00 44.56 +4.56 52.44 65.56 +13.12 CNN 16.45 24.27 +7.82 34.94 50.00 +15.06 50.00 59.80 +9.80 64.60 76.94 +12.34

MC 17.98 25.6 +7.62 36.72 50.28 +13.56 52.25 61.97 +9.72 68.00 82.25 +14.25 10 LSTM 20.83 26.31 +5.48 37.50 50.66 +13.16 31.14 44.62 +13.48 45.38 64.56 +19.18 CNN 18.06 22.62 +4.56 36.20 52.03 +15.83 49.20 58.80 +9.60 64.56 74.00 +9.44

MC 18.1 24.52 +6.42 36.28 49.94 +13.66 38.50 48.22 +9.72 57.30 74.94 +17.64 20 LSTM 21.22 24.19 +2.97 36.12 50.78 +14.66 29.17 41.00 +11.83 43.62 61.47 +17.85 CNN 18.44 23.60 +5.16 35.28 50.00 +14.72 37.84 48.12 +10.28 50.00 65.00 +15.00

MC 17.88 23.61 +5.73 35.1 48.56 +13.46 27.97 31,00 +3.03 47.12 65.80 +18.68 40 LSTM 19.67 24.33 +4.66 32.62 52.03 +19.41 23.30 39.40 +16.10 33.97 59.03 +25.06 CNN 18.75 23.97 +5.22 35.25 52.50 +17.25 27.62 44.70 +17.08 41.25 62.10 +20.85 LZ 46.90 52.60 +5.70 58.78 66.40 +7.62 72.70 76.06 +3.36 79.60 79.10 -0.50 BWT 66.44 69.44 +3.00 73.70 79.90 +6.20 83.30 88.06 +4.76 88.60 92.20 +3.60

4.5.4 Correlations with Mobility and Network Traffic

Figure 4-3 shows the correlation of prediction accuracy with a sample of features that describe the mobility or network traffic of users. PDT(W/E) and TJ(W/E) are mobility features while AAT(W/E) and AI(W/E) are traffic features. PDTW is the time spent atthe

94 user’s preferred building (most common) on weekdays (PDTE for weekends). TJW is the total sum of jumps (distance) for the weekdays while TJE describes the same feature for weekends. AATW is the average of active time (as indicated by network usage) of the user for weekdays (AATE for weekends). AIW stands for average inter-arrival time of flows on weekdays, and AIE for weekends ([5, 6]).

Flutes Cellos

PDTW 0.48 0.51 0.50 PDTE 0.24 0.49

TJW -0.51 -0.26 0.25

TJE -0.26 -0.17 0.00 AATW 0.18 -0.43

AATE 0.19 -0.42 0.25

AIW 0.25 0.41 0.50 AIE 0.14 0.41

Figure 4-3. Pearson Correlation of Prediction Accuracy with several Mobility and Network Traffic Features.

The results present significant correlations between the prediction accuracy, with not only the mobility features, but also network traffic features. These correlations vary across device types (Flutes vs Cellos), and in time (Weekdays vs Weekends). This is a very important observation for the design of predictive caching systems, importantly, it might be possible to improve prediction of where the user is going based on network traffic profile while noting the different modes of usage based on device types. We leave the investigation ofsuch improvements to future work.

95 4.6 Summary and Future Work

Summary

In this study, we defined our research problem as predicting the next symbol in a discrete- time series for users with two categories of devices. The accuracy is evaluated as the fraction of the next symbols predicted correctly. We sought to answer three questions: i. How different are Flutes and Cellos in terms of predictability? ii. How does the predictability of these device types change with different spatio-temporal granularity? iii. Does the choice of method or predictor significantly alter the answers to aforementioned questions? For this purpose, we processed a large-scale dataset from a campus environment, and grouped the devices into two categories; and we chose a set of methods to make the comparisons, including Entropy-based estimators and popular algorithms such as Markov Chains and Neural Networks. The results of experiments show the movements of Cellos (”sit-to-use”) are significantly more predictable than Flutes (up to 25% difference in accuracy). This pattern is consistent across various temporal granularities (5 min to 2 hours), spatial granularities (Access Point and Building level), and for different methods (Markov Chains, Neural Networks, Entropy-based Estimators). We illustrate that the performance of predictors depends strongly on the span of temporal bins. Markov Chains tend to outperform deep learning models in shorter time- bins while LSTMs and CNNs usually show a higher accuracy in longer time-bins. CNNs have mostly similar accuracy to LSTMs in the latter case but have significantly better run time ona modern GPU. We also found significant correlations among prediction accuracy, mobility features, and also network traffic features, an important observation for the design of predictive caching systems where it might be possible to improve mobility prediction based on network traffic profile. While some earlier studies investigated a similar problem setup, our study has notable implications. For example, across device types, predictability can vary significantly. Also, with larger time windows such as 1 hour, it is easy to miss short stays (since one location visit with

96 a duration of 31 minutes would result in other locations in that 1h window being ignored). On the other hand, a short time window results in multiple repetitions of the same location in the sequence, potentially achieving high prediction accuracy even when the method is not predicting the ’transitions’ well. It is important to consider the device type, context, and application in order to choose an appropriate time and space granularity; the best performing method differs across these dimensions. Besides, the measured accuracy only considers an exact match to be correct, so even if the method predicts a nearby location to the actual location, it would count as incorrect. We plan to investigate measuring how far a predicted location is from the actual location and embed that information in the loss function of our neural networks for possible improvements in prediction. Extension to prediction of network traffic

With a similar approach as mobility analysis, traffic events can be modeled into discrete- time or event-based time series. Each event corresponds to a flow, i.e. a user accessing a website. Then, in an event-based model, each of these events are sorted in time for each user. In case of discrete-time, the weighting mechanism here assigns the weight based on the total traffic volume instead of time. Therefore, during a temporal window size ’w’, we weighevery observed IP access of the device with the traffic size of the flow and pick the one withthe highest sum of bytes to represent that step. Given the massive size of this dataset (30TB, discussed in Section 2.1.2), this analysis is much more computationally demanding. Integrated mobility-network traffic modeling

Given the observed correlations, we hypothesize that use of ’predictability’ as a feature in an integrated mobility-traffic generative model could lead to more realistic synthetic traces. Such a data-driven generative model would be an essential tool for network simulations and capacity planning. Notably, it can also be made privacy preserving, since collected traces would be replaced with realistic synthetic data that captures mobility, network traffic, predictability, and their relationships. Further study is beyond the scope of this work and is left for the future.

97 CHAPTER 5 LEARNING THE RELATION BETWEEN MOBILE ENCOUNTERS AND WEB TRAFFIC PATTERNS Mobility and network traffic have been traditionally studied separately, and at an individual level. Their interaction, as well as their patterns at encounter and group levels, are vital factors for generations of future mobile services and effective caching, but has not been studied in depth with real-world big data. In this chapter, we characterize mobility encounters and study the correlation between encounters and web traffic profiles using large-scale datasets ofWiFi and NetFlow traces, introduced in Chapter 2. The analysis quantifies these correlations for the first time, across spatio-temporal dimensions, for device types grouped into on-the-go ’Flutes’ and sit-to-use ’Cellos’. The results consistently show clear relation between encounters and traffic across different buildings over multiple days, with encountered pairs showing higher traffic similarity than non-encountered pairs, and long encounters being associated withthe highest similarity. This provides a compelling case to integrate both dimensions in future models, not only at an individual level, but also at ’pairwise’ and collective levels. We also investigate the feasibility of learning encounters through web traffic profiles, with implications for dissemination protocols, and contact tracing. We have released samples of code and data used in this study on GitHub, to support reproducibility and encourage further research (https://github.com/BabakAp/encounter-traffic). With the analysis of mobility and network traffic correlations, published6 in[ ], we contrast various mobility and traffic features of ’Flutes’ and ’Cellos’, including radius of gyration, location visitation preferences, and flow-level statistics. But that study only investigated mobility and traffic features across the ’individual’ dimension. We argue that the relation between mobility and traffic needs further in-depth analysis, as it will likely be the center of many future mobile services. In this chapter, we focus on the ’pairwise’ (encounter) dimension of mobility and study its interplay with the traffic patterns of mobile users. Aside from the importance of this study in realistic modeling, simulation

98 and performance evaluation of next generation networks, it is quite relevant to encounter- based services, e.g., content sharing, opportunistic networking, mobile social networks, and encounter-based trust and ([25, 78]), to name a few. We use extensive traces from our collected datasets (with 76B records, and ≈30TB in size) covering over 78K devices, in 140 buildings on a university campus. The data includes information about WiFi associations, as well as DHCP and NetFlow traces, covering the dimensions of mobility and network traffic. The data is sanitized and categorized basedon buildings, days, device types (’Flutes’, ’Cellos’), and encounter duration, then analysis is done across all these dimensions. The full description of these datasets and the experimental setup can be found in Section 2.1. Due to quadratic asymptotic growth of ’pairwise’ traffic analysis, for this study, we focus on the 10,000 most active users in terms of traffic consumption, to keep computations manageable. Using the device type classification described in Section 2.2.1, pairwise encounters are categorized into the following groups: 1) Flute-Flute (’FF’): encounter event between two flutes. 2) Cello-Cello (’CC’): the pair are cello devices. 3) Flute-Cello (’FC’): encounter event between a flute and cello. Pairwise user mobility behavioral patterns are represented through the patterns of encounters between two mobile nodes. An encounter is defined as when two user devices are associated with the same AP at an overlapping time interval. The ’Encounter’ traces are generated based on WLAN logs. An example of a pairwise encounter record, constructed from WLAN traces, is shown in Table 5-1. The main question addressed in this chapter is ‘How do device encounters affect network traffic patterns, across time, space, device type and encounter duration?’ For that purpose, we: i- analyze mobility encounters patterns, ii- define web traffic profiles for users, and iii-look at their interplay. Although this question has not been directly studied in-depth before, our findings are quite surprising, showing that for the majority of buildings a consistent correlation exists between traffic profiles and encountered (vs. non-encountered) pairs of users. Wealso

99 found the correlation to be the strongest for ’Cello-Cello’ encounters on weekends. Further, we find that such relation strengthens for long encounters, while short encounters arenot significantly different from non-encountered pairs. Finally, we utilized a deep learning modelto learn encounters of user pairs in a day and building based on their traffic profiles alone. The model achieved a high accuracy (90%+) in most settings. These results can potentially impact a variety of applications, including encounter-based services, rumor anatomy analysis, and tracing and any service utilizing prediction of traffic load/demand using encounters, and vice versa. In addition, mobility modeling and protocol evaluation could benefit from deep integration of (and interplay between) encounters and traffic. We hope for this chapter to be the first in a series of studies on mobility and traffic, and their interplay, across individual, ’pairwise’ and collective (group) dimensions, towards fully integrated realistic traffic-mobility models. Integrated mobility-traffic modeling is discussed in Section 3.5 and the collective dimension is left for future work. 5.1 Related Work

The effect of mobility and network traffic on wireless networks has been clearly established in the literature (such as the seminal work in [10]). Several efforts studied models of mobility and network traffic, albeit mostly separately and in isolation. There is a vast body of research focused on mobility or traffic independently, which we cannot possibly exhaustively cover.We refer the reader to [47, 114] for surveys of mobility modeling and analysis. Some of the most advanced studies on mobility [112] have identified individual54 [ ], pairwise (encounter), and collective (group) plans for modeling of the social context. Fig. 5-1 shows the 3 axes of the social context from [112], consisting of individual, pair-wise and collective patterns. That study, however, did not consider traffic. We hope to bridge that gap by analyzing the interplay of mobility and traffic at the ’pair-wise’ level. Encounters between mobile nodes have been studied in previous research (e.g., in [16, 53]) to characterize opportunities of inter-user encounters. Others (e.g., [26, 55]), mainly collect encounter traces using mobile devices to analyze, model and understand communication

100 Figure 5-1. Social context axes. (Sourced from [112]) opportunities in different settings. None of these studies, however, analyze traffic northe correlation between encounters and traffic patterns. In this study, we focus on the interplay between traffic and encounters, while considering the score (e.g., duration) of the encounter events. We use extensive data-driven analyses to quantify the correlations between network traffic and encounter scores. Several studies analyzed wireless traffic flows84 [ ], and mobile web-visitation patterns [86]. These studies, however, did not investigate the relation with mobility and node encounters. In addition, many research studies on mobility encounters or traffic patterns did not consider device type. Devices’ form factor affects mode of usage, leading to varied traffic profiles 3([ , 18, 30, 38, 68, 76, 93]). But these studies do not study the interplay of traffic with

101 mobility and encounters. These devices are also used during different modes of transportation. Smartphones and e-readers, which we refer to as ’Flutes’, are devices used ’on-the-go’. On the other hand, laptops are ’sit-to-use’ devices and are referred to as ’Cellos’ in this study.

Table 5-1. Encounter record example

User1 Mac User2 Mac User1 Asso. Start User1 Assoc. End Access Point Mac User2 Asso. Start User2 Association End Encounter Start Encounter End 7c:61:93:9d:30:2e cc:08:e0:34:a7:3e 1334503199 11334503337 1334501153 1334506764 bcftr0gb-win-lap1142-1 1334503199 1334503337

5.2 Mobility Encounters

Pairwise mobile encounter events provide opportunities for dissemination events such as content dissemination [16] and infection spreading through direct encounter [4]. Consequently, designing effective content distribution, routing schemes and infection tracing back approaches require encounter understanding and realistic modeling. While encounter events have been analyzed in several previous studies (e.g. [16, 53]), here we develop new insights into pairwise events by considering the following: 1) Device types: We distinguish between encounters among the three groups in our analyses (’FF’, ’CC’, ’FC’). 2) Large-scale data: The data is first of its kind in term of size where it covered more than 140 buildings withdifferent categories. Also, we analyze mainly indoor (in-building) encounters, unlike most previous studies. 3) Traffic-encounter analysis: Daily encounter patterns at buildings are analyzed per device type, then their correlation to traffic patterns are studied for the first time inSection 5.4. 5.2.1 Daily Encounter Duration at the Building Level

The pairwise statistical summary of mobility encounters are generated from daily en- counter records at each building.1 The total encounter duration, ’E’, of a pair of users (u1, u2) during day ’d’ at building ’B’, ∑ i=n EBd (u1, u2) is computed as: EBd (u1, u2) = i=1 EBd (u1, u2)i , where

1 This task was done in collaboration with Mimonah Al Qathrady at University of Florida.

102 n is the number of encounters, and EBd (u1, u2)i is the duration of encounter i between u1 and u2 on day d at building B, respectively. The pairs are then separated based on their pair device types.

Table 5-2. Daily Encounter Duration in Seconds

Pairs-Types Min. 1st Qu. Median Mean 3rd.Qu. Max. Std. Flute-Flute (FF) 1 27 84 528.2 367 77170 1417 Cello-Cello (CC) 1 135 844 2061 2834 84580 3244 Flute-Cello (FC) 1 34 169 954.4 855 80660 2021

0.9 0.8 0.7 0.6 0.5 0.4 0.3 All-Encounters 0.2 Flute-Flute (FF)

P[Encounter Duration < X] Cello-Cello (CC) 0.1 Flute-Cello (FC) 0.0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Encounter Duration in Seconds

Figure 5-2. Encounter duration CDF based on encounter pair device type.

The daily encounter duration based on device types are summarized in Table 5-2. From the table, it is clear that CC pairs have longer encounter duration than other kinds of pairs. For example, the mean CC daily encounter duration is 290% longer than the FF pairs. This result is beneficial when modeling the encounters based on device type, or with applications thatuse the encounter duration.

103 Table 5-3. Best fit distributions for total daily encounter pairs duration based onpairs classifications. Percentage of buildings shown in brackets. (’Pl’: Power Law, ’Ll’: Log Logistic, ’W’: Weibull, ’G’: Gamma, ’Pr:’ Pareto, ’Ln’: Log Normal, ’B’: Beta). Pair-Types 1st best fit 2nd best fit 3rd best fit KS-test<=5% KS-test<=10% Flute-Flute Pl[74%], Ll[10%], Ll[29%], Ln[26%], Ll[39%], Pr[24%], Pl[72%], Ll[19%] Pl[92%], Ll[92%] (FF) Pr[8%] Pr[19%], W[11%] Ln[21%] Ln[15%], Pr[15%] Pr[89%], Ln[85%] Cello-Cello Pl[39%], W[21%], B[22%], Ll[18%] Ll[25%], W[24%], Pl[34%], W[32%] Ll[87%], Pl[86%] (CC) B[14%] Pl[16%], W[14%] Pr[13%], Pl[12%] B[30%], G[16%] W[78%], Ln[71%] Flute-Cello Pl[67%], Ll[13%], Ln[28%], Ll[20%], Ll[46%], Ln[16%] Pl[61%], Pr[12%] Pl[92%], Ll[92%] (FC) Pr[9%] Pr[19%], W[18%], Pl[10%] Pr[14%] Ll[11%], W[9%] Pr[80%],Ln[78%]

Figure 5-2 shows the CDF of the encounter duration for 95% of pairs (the highest 5% is omitted for clarity). Note that 80% of FF encounters have daily encounter duration ≤8 minutes, while only 40% of CC encounters are ≤8 minutes. For all encounters, 33% are ≤38s, dubbed ’short’, the next 33% are ≤317s, called ’medium’, and >317s are ’long’ encounters. This definition will be used in Sec. 5.4 for pairwise analysis. 5.2.2 Encounter Duration Statistical Distributions

Eleven distributions are fit to the total daily encounter duration using maximum likelihood, and goodness-of-fit test methods: Power-law (Pl), Weibull (W ), Gamma (G), Lognormal (Ln), Pareto (Pr), Normal (N), Exponential (Ex), Uniform (U), Cauchy (C), Beta (B) and Log-logistic (Ll). The Kolmogorov-Smirnov (KS) statistic is used to evaluate the distributions fitness. The three best fit distributions are selected and presented intable 5-3. For example, 74% of the buildings have power-law (Pl) distribution as the best fit for their ’FF’ pair daily encounter duration, while only 39% of buildings have ’CC’ daily encounter distribution following power-law. Also, Table 5-3 shows the percentage of buildings that have KS-statistic with less than a threshold, specifically ≤ 5% and ≤ 10%. This is calculated to see if there is a distribution that can be a good fit for the majority of buildings, even if it is not first-best fit. Power-law and log-logistic distributions usually have KS-test with ≤10% for 92% for ’FF’ and ’FC’ pairs. 5.3 Web Traffic Profile

We use ’NetFlow’ traces to analyze traffic behavior of user devices. In[6], we analyzed traffic on an ’individual’ level. We found ’cellos’ to generate 2x more flows than ’flutes’, while

104 the Flute flows are 2.5x larger. Also, flow sizes were found to follow Lognormal distribution, while flow inter-arrival times (IAT) follow beta distribution, with high skewness/kurtosis, hinting at infrequent extreme values (e.g., Flutes incur more extreme periods of inactivity, caused by higher mobility). In this study, we conduct ’pairwise’ (vs. individual) level analysis. To analyze traffic patterns of users for all buildings and days, we define a ’traffic vector’ based on NetFlow traces, that is efficient to calculate and granular enough for our analysis:

• First, we select a set of popular websites for analysis based on total bytes sent and received, filtering out websites with little usage. The IP address of selected websites form the dimensions of the traffic vector, denoted as κ. There are ≈ 10, 000 IP addresses in κ, with average ’daily’ traffic from few MBs, and up to 690GBs. 10k websites are chosen for this analysis. These websites have exchanged >1GB of data in NetFlow.

• Next, for each address, IPj , we calculate B(i,j), defined as the natural logarithm of total traffic ’user i’, Ui , has sent to or received from IPj . This forms the initial ’traffic vector’ for Ui , consisting of B(i,j), ∀j ∈ κ. Other ideas attempted were count the 15-min buckets with data or just sum bytes, but log(sum(bytes) makes more sense since it dampens the effects of outliers. Other potential ideas are using percentage of active time, packet counts, frequency of visit to website, which we leave to future work.

• Finally, we apply term frequency inverse document frequency (TF-IDF [100]) to the collection of traffic vectors of all users. This reduces the effect of wildly popular websites, and identifies websites that can distinguish between user on-line behavior, enabling us to study the richness in the access patterns. In this context, each IPj is a ’term’ and each user traffic vector is a ’document’. TF-IDF is calculated as the product ofterm frequency (the number of times a term appears in a document, corresponding to B(i,j) in our context, which reflects the bytes IPj exchanged with Ui ), and inverse document frequency (the inverse of number of documents (’users’) the term (’IP’) occurs in) [60]. Each row of the resulting matrix is a ’traffic profile, TPi ,’ of user Ui , as depicted in Fig 5-3.

 IPj  .  .  TF-IDF= < ··· Bi,j ··· > ← TPi . .

Figure 5-3. TF-IDF Matrix: Each row is a user profile.

105 This process is applied for NetFlow data of every building on each day, to enable spatial (across buildings) and temporal (across days) analysis of user traffic profiles2 . For pairwise comparison of traffic profiles, we use Cosine similarity which computes the cosine oftheangle between two user profiles. 5.4 Pairwise Encounter-Traffic Relationship

With mobility encounters and traffic profiles as the pillars, here, we take steps toinves- tigate ’”whether physical encounters are correlated with similarity of traffic profiles”’. This analysis outlines our initial findings in the ’pairwise’ (’encounter’) dimension of mobility-traffic analysis, following our work in [6] that focused on ’individual’ aspect of combined mobility- traffic modeling, and providing the foundation for ’collective’ (’group’) analysis in thefuture. As a first step, we seek to establish whether the traffic profiles of encountered pairsare more similar compared to traffic profiles of non-encountered pairs of users. For this purpose, ∑ cosine(TPi ,TPj ) we calculate enc = ∀(i,j)∈Z |Z| . Here, Z denotes the set of all encountered pairs ∑ ′ cosine(TPi ,TPj ) of users. Similarly, for all non-encountered pairs, Z , nonenc = ∀(i,j)∈Z ′ |Z ′| . This calculation is carried out on each building every day. Overall, we observe that ≈93% of the time ’enc > nonenc’, with the main exceptions being buildings close to bus stop hubs on campus, with a high pass-by rate of users; resulting in many short encounters that do not show traffic similarity. With that simple observation, next we asked whether the difference between traffic similarity of encountered and non-encountered is statistically significant. ’Mann–Whitney U’ test [77] was applied on the two independent sets, with the null hypothesis being the two sets are drawn from the same distribution. We find that for 86% of (building, day) tuples, we can reject the null hypothesis (p < 0.05). This shows that ’in most cases, the traffic profiles of encountered pairs are more similar and the difference between the two groups is statistically significant’. The next question is how consistent these difference are across:

2 If a building on a specific day has less than 20 encountered pairs, that (building, day)pair is omitted, to maintain statistical significance.

106 1. Device type categories: As discussed earlier, usage patterns of devices differ based on form factor (e.g., on-the-go ’flutes’ vs. sit-to-use ’cellos’). We compare flute-flute (FF), cello-cello (CC), and flute-cello (FC) encounters.

2. Weekday vs. Weekend: We established significant differences between mobility and traffic patterns of weekdays and weekends in [6]. Here we analyze the mobility encounter-traffic interplay across the week days.

3. Encounter duration: We define three encounter duration categories using 3 bins of equal frequency: short (< 0.6min), medium (0.6 − 5min) and long (> 5min). We then analyze each group for correlation between encounter duration and traffic profile similarity. 5.4.1 Device Type Categories

We analyze how similarity of traffic profiles for encountered pairs varies when two ’flutes’ meet (’FF’), two ’cellos’ meet (’CC’) or a ’flute’ meets a ’cello’ (’FC’). The results, aspre- sented in Figure 5-4, show that the similarity of ’CC’ is slightly higher than the other groups, while the ’FF’ and ’FC’ groups show similar trends. Notably, however, ’all three encountered groups are significantly different from the non-encountered group’p ( < 0.05). This is consis- tent across most buildings. Given the context of the traces, we suspect heavy use of laptops for educational content on campus. Further analysis website content may shed light on the shared interests among encountered users with various forms of devices. We leave this for future work. 5.4.2 Weekday Vs. Weekend

Intuitively, there are significant differences between weekdays and weekends in user behavior, and consequently their mobility and traffic patterns. In[6], we found that numbers of user devices on campus drops significantly on weekends, but the remaining devices donot show significant differences in terms of flow size, packet count and active duration. Herewe identify and quantify the encounter-traffic correlation over weekdays/weekends for the first time. Results are depicted in Fig 5-5. We find that the ’pairwise similarities of weekend pairs to be overall higher than their weekday counterparts’ regardless of encounter (or not), with weekend ’non-encountered’ pairs being more similar than weekday ’encountered’ pairs. This is explained by observing significantly reduced mobility of devices on weekends. For example, median radius of gyration for cellos drops by 66%, and by 15% for Flutes [6]. In addition to

107 1.0 Flute-Flute (FF) 0.9 Cello-Cello (CC) 0.8 Flute-Cello (FC) 0.7 Non-encountered pair 0.6 0.5 0.4 0.3 Fraction of pairs 0.2 0.1 0.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Figure 5-4. CDF of pairwise cosine similarity of traffic profiles across device types (vertical lines denote medians). decreased mobility, most activity is clustered around several academic buildings with research labs (33% of APs handle no Flute traffic on weekends, while 56% receive no Cello traffic). Thus the increase in traffic similarity during weekends might be explained by presence of researchers collaborating on related fields of interest and accessing similar content. 5.4.3 Encounter Duration

Based on the encounter durations introduced in Section 5.2.1, we define three encounter duration categories with 3 bins of equal frequency: short (< 0.6min), medium (0.6 − 5min) and long (> 5min). As depicted in Fig. 5-6, the short encounter group is not significantly different from the group of non-encountered pairs (p > 0.05). However, the differences between the other groups are statistically significant, with the ’long encounter group showing the highest

108 1.0 Weekday encountered pair 0.9 Weekday non-encountered pair Weekend encountered pair 0.8 Weekend non-encountered pair 0.7 0.6 0.5 0.4 0.3 Fraction of pairs 0.2 0.1 0.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Figure 5-5. CDF of pairwise cosine similarity of traffic profiles in weekdays and weekends (vertical lines denote medians).

similarity of traffic profiles’, hinting at a correlation between the duration of encounter and traffic profile similarity. Hence, we investigate the correlation this correlation. We found almost no correlation for short and medium encounters (based on Pearson and Spearman correlation coefficients), however there is a ’small positive’ linear correlation between encounter duration and traffic profile similarity for long encountersρ ( = 0.21). Breaking down the correlations into different device types and weekday/weekend (Fig. 5-7), shows the highest correlation for Cello-Cello (CC) encounters on weekends, supporting our earlier observation. Overall, the correlation between encounter duration and traffic profile similarity is dynamic, changing across space and time. Fig. 5-8 shows a time-series plot of the linear correlation coefficient of several buildings on campus for more than 3 weeks. It shows how the correlation

109 1.0 Short encounter 0.9 Medium encounter Long encounter 0.8 Non-encountered pair 0.7 0.6 0.5 0.4 0.3 Fraction of pairs 0.2 0.1 0.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Figure 5-6. CDF of pairwise cosine similarity of traffic profiles with different encounter durations (vertical lines denote medians). varies across time in different buildings, with rapid changes every 7 days (around weekends). Surprisingly, a few buildings show significant ’negative’ correlations on weekends (e.g., music and theater buildings), while others show significant ’positive’ correlations on the same days (mostly academic buildings). Further analysis of the other buildings and its interaction with mobility encounters and traffic profiles are left for future work. 5.5 Learning Encounters

Given the relationships shown so far, there seems to be great potential in training a ’machine learning model’ that can learn to predict an encounter given two traffic profiles. Such a model has several practical applications. Given two traffic profiles, if it is possible to predict they have encountered on a certain day in a building with good accuracy, then there is useful information in the relationship of mobility encounter and traffic profile similarity, which could

110 Weekend Weekday 0.40

FF 0.22 0.25 0.32

0.24 CC 0.34 0.12 0.16

0.08 FC 0.16 0.20

0.00

Figure 5-7. Pearson correlation coefficient between encounter duration and traffic profile similarity for weekdays and weekends as well as device types. (Cello-Cello (CC) encounters on weekends show the highest positive correlation.) be used in design of encounter-based ’dissemination’ protocols, analysis of ’rumor anatomy’, or ’tracing’ of disease spread [4] even if mobility traces are not reliable for each user (for example due to MAC address ’randomization’), but traffic profiles of users are accessible (via ’authentication’ mechanisms identifying users at a higher OSI layer). To investigate the feasibility of this task, for every pair of users, their ’traffic profiles’ in each building and on each day are coupled as input (either through concatenation or taking the absolute differences, with the former depicted in results and figures), and a binary targetlabel is assigned based on whether the pair has encountered on that day and building. Since most pairs of users do not typically encounter on a day, predicting a negative label is rather trivial in this case. To prevent this bias, we sample this dataset to make sure each label is represented by an equal number of samples for our models.

111 1.00

0.75

0.50

0.25

0.00

0.25

0.50

Turlington Hall Bruton-Geer Hall Nuclear Sciences Clock Tower 0.75 Tolbert Hall Norman Hall Addition Constans Theatre Weil Hall Music Building Reitz Student Union Library West School of Architecture Hume Commons Gator Corner Dining Center Rinker Hall Physics 1.00 8 10 12 14 16 18 20 22 24

Figure 5-8. Pearson correlation coefficient between encounter duration and traffic profile similarity for different buildings across days of the month (X-axis).

5.5.1 Random Forest (RF)

We first used Random Forests12 [ , 51] for this classification task, which is a well- established algorithm used for supervised learning problems. Our work showed that on a building (the computer department) the algorithm achieved a promising ≈ 70% accuracy on average across all days, based on stratified k-fold cross-validation, without employing any preprocessing or parameter tuning of the model. Next, we applied a dimensionality reduction algorithm, using ’Singular-Value Decomposition (SVD)’ to preprocess the input vector. This technique is adapted from Latent semantic analysis of natural language processing. Its ap- plication improved the accuracy, in the same settings, to ≈ 73%. This lead to the idea of using stacked auto-encoders (SAE) to retain information and connect the SAE to a ’deep, fully-connected neural network (DNN)’ for classification. 5.5.2 Deep Learning

We utilized several recent ideas from the field of artificial intelligence to improve our learning of encounters significantly. Auto-encoders are a class of artificial neural networks

112 95

90

85

80

75

70

65 Random Forest 60 Deep Model Accuracy (Computer department data only) 8 10 12 14 16 18 20 22 24 Days

Figure 5-9. Accuracy of random forest and deep learning model for encounters in the computer department across days (X-axis).

that are trained in an unsupervised fashion to learn an efficient representation of their input. In simple terms, the network consists of two stages: encoder and decoder. The encoder consists of layers of decreasing size, that is then connected to the decoder, which is made up of layers of increasing size. The objective is to reconstruct the input as accurately as possible with purely unsupervised learning. Stacked auto-encoders (SAE) have been used in various applications to extract features [79], reduce dimensionality [92], as well as denoising the corrupted variations of input [117]. We use a ’Stacked De-noising Auto-Encoder (SDAE)’ [117]. This network is provided with input data corrupted by Gaussian noise, and is trained to reconstruct the original, uncorrupted input similar to a traditional auto-encoder. Thus, denoising becomes a training criterion, while dimensionality reduction of input is also achieved (We kindly refer the reader to [117] for details of SDAE and comparison with SAE). Then, the encoded data points are fed to a fully connected, multi-layer neural network. We used ’stratified k-fold cross-validation’, ’early

113 stopping’ and ’dropout’ layers to regularize the network and alleviate overfitting. An illustration of the architecture is presented in Fig. 5-10. Comparing the results to the random forest, there is a significant increase in accuracy to an average of 92% for the same building anddays as the random forest classifier. The SDAE is able to capture high-dimensional, sparse input much better than the random forest. Comparing ’device type categories’ of encounters, we find ’cello-cello’ encounters to be the most distinguishable, followed by ’flute-cello’ and ’flute-flute’. However, the difference between accuracies for different device type categories is <5% inmost locations and dates, a testimony to the ’robustness’ of the model. This accuracy is also stable across time, with the median of accuracy, for ’weekdays’ at 93.25%, and ’weekends’ at 90.75%, for the computer department samples. The much higher accuracy comes at the cost of high compute power costs and complexity of the model, as well as lower relative interpretability. Fig. 5-9 shows the results of both the random forest and the deep neural network model for this building across ≈ 3 weeks. 5.6 Summary and Future Work

In this chapter, we presented the first steps to analyze and quantify the relation between mobility and traffic. Focusing on the pairwise (encounter) dimension of mobility, its interplay with the traffic patterns of mobile users was studied. This work has implications for realistic modeling and simulation, offloading through opportunistic encounters, as well as implemen- tation and benchmarking of encounter-based services such as content sharing, mobile social networks and encounter-based trust. We use extensive, highly granular datasets (30TB in size), in more than 140 buildings on a university campus, including information about WiFi associations, DHCP and NetFlow, covering the dimensions of mobility and network traffic. To answer our main question of ‘ How do device encounters affect network traffic patterns, across time, space, device type and encounter duration? ’, We analyze mobility en- counters and presented their statistical characteristics. Power law and Log logistic distributions, fitted to daily encounter duration, have KS-test ≤10% in 92% of buildings for flute-flute and flute-cello encounters, and 86% for cello-cello encounters. Also, Cello-cello pairs have longer

114 Gaussian Input Noise

Dense (256)

Dense (128)

Dense (64) Dense (512), Dropout (0.2)

Dense (512), Dropout (0.2) Dense(64) Dense (512), Dropout (0.2) Dense(128)

Dense (256) Encounter?

Stacked Auto-Encoder Fully-connected layers (Unsupervised step) (Supervised step)

Figure 5-10. Architecture of the deep learning model (SDAE). Numbers show the number of neurons in each layer (internal details omitted for brevity). daily encounter duration than the others. Analyzing traffic, we find significant differences between traffic profile similarity of encountered versus non-encountered pairs for devicetype categories (Flute-Flute, Cello-Cello, Flute-Cello), with the highest similarity being the CC group. Further, comparing weekdays and weekends, in both cases, the encountered pairs are more similar, with the distinction that weekend traffic profiles are more similar than weekdays’. Analysis of correlation between encounter duration and traffic profile similarity revealed short and medium encounters not being significantly different from non-encountered group, while the long encounters show significantly higher similarity. We employed random forests and created

115 a deep neural network (DNN) model to predict encounters of pairs of user traffic profiles, with a very high accuracy (up to 94% depending on settings). To the best of our knowledge, this was the first study to use a stacked denoising autoencoder to capture pairwise network traffic behavior. The findings in this part of the work are not currently captured by any of theexisting mobility or traffic models, while having important implications in many contexts, suchas predictive caching, information dissemination, opportunistic social networks and infection tracing. This calls for a compelling case for integrated traffic-mobility models that consider multiple dimensions (individual, pairwise, and group). We plan to further investigate the causal relationship between mobility and traffic for pairwise and ’collective’ (group) dimensions in the future. Samples of code and data have been released on GitHub and Docker hub, to support reproducibility and encourage further research (https://github.com/BabakAp/ encounter-traffic).

116 CHAPTER 6 FUTURE DIRECTIONS In this chapter, a summary of potential future research directions planned is presented. Many of these potential follow-up threads have been described and detailed before, but they are summarized and enumerated here. We identified different dimensions that comprise a user device behavior. These are:

• Mobility: Quantifies how a user device moves in the form of spatiotemporal features.

• Network traffic: Provides spatiotemporal web traffic patterns.

• Device type: Devices such as Flutes and Cellos have different modes of usage.

• Social context: Spans individual, pairwise (encounter), and collective levels.

• Interest: This pertains to the interest of users in terms of mobility or network traffic. These ’interest’ based features include preferred locations as well as favorite apps and web domains.

• Predictability: Different groups of users, depending on many factors show different levels of predictability. Many of these factors originate in the aforementioned dimensions, for example, devices with low mobility tend to be more predictable. Please note that these aspects of user device behavior are not orthogonal to each other. There are inter-dependencies among them and we have quantified some of those relationships in this work. To position this work and visualize the most promising future directions, Fig. 6-1 presents a radar chart. While our user device profiles cover a wide range of mobility and network traffic features, in addition to identifying device types, analyzing location preferences and prediction, there is still a lot of room for expansion:

• Extensions of integrated mobility-network traffic models in the social context and interest dimensions: The models introduced use individual features and did not consider the preferences of users in terms of web domains.

• Predictability analysis in the network traffic and interest dimensions: Prediction schemes relied on individual patterns only. Besides, the prediction task was predicting the next ’location’ (AP or building), but there is also the task of predicting the next web domain

117 Figure 6-1. Positioning of the work and most promising future directions.

access and its related analysis that can be immensely helpful for predictive caching schemes.

• Integrated generative modeling across mobility-network traffic-device types-social context- predictability-interest: We envision a future integrated model using artificial intelligence that incorporates all bits of available information, and views the user device behavior across these dimensions as spectrums (instead of categorizing them into a limited number of bins such as Flutes and Cellos), while preserving their relationships and associations. 6.1 Extensions of Flutes vs. Cellos in Social Context and Interest Dimensions

In Chapter 3 we mined large-scale WLAN and NetFlow traces to quantify mobility and traffic characteristics across device types (’flutes’ vs. ’cellos’), time and space. Webuilt the ’FLAMeS’ framework to systematically approach this problem, relying on a data cube containing mobility, traffic, and device type dimensions.

118 In addition, in Chapter 5, we extended our analysis of inter-dimension correlations to pair-wise level, and measured the relationship of traffic profiles of users with mobility pair-wise encounters. Analysis at the individual level has implications for realistic modeling and simulation, while the extension towards pair-wise enables studying offloading through opportunistic encounters, as well as implementation and benchmarking of encounter-based services such as content sharing, mobile social networks, and encounter-based trust. This analysis can also be extended to ’communities’ of users as well, which is the next, more abstract, level of social context. Furthermore, the correlation of traffic with mobility studied previously only looked at the overall traffic. But there are certain categories of websites suchas games or music, which might show different behaviors. Thus this analysis can be extended to include the user ’interest’ dimension. 6.2 Predictability Analysis in the Traffic and Interest Dimensions

In Chapter 4, we looked at maximum theoretical predictability as well as multiple pre- diction algorithms of mobility event sequences, with the focus on Markov-Chain (MC) based predictors and Long Short-Term Memory (LSTM). We compared the results of different algo- rithms, while also contrasting the practical accuracy achieved against theoretical upper bounds. This work was carried out on UF and KTH traces, with two different types of time series: 1) Event-based: This is the sequence of events in a time series, as they happen. 2) Discrete-time: Using temporal windows, ’w’, the location the user spent the most time at during a time win- dow is assigned to that window. The sensitivity of the algorithms to the parameter ’w’ was also analyzed. With a similar approach as mobility analysis, traffic events can be modeled intoa discrete-time or event-based time series. Each event corresponds to a flow, i.e. a user accessing a website. Then, in an event-based model, each of these events is sorted in time for each user. In the case of discrete-time, the weighting mechanism here assigns the weight based on the total traffic volume instead of time. Therefore, during a temporal window size ’w’, weweigh every observed IP access of the device with the traffic size of the flow and pick the onewith the highest sum of bytes to represent that step. In addition, these websites belong to certain

119 categories of interest (such as arts and sports). Thus, each user might be better represented using multiple time series, one per interest category. This analysis needs to be carried out on the 30TB of NetFlow data, as well as other similar candidate datasets.

120 APPENDIX A WEB DOMAIN INTEREST ANALYSIS In addition to the analysis of user device behavior and interests in terms of mobility, it is also important to extend that analysis to web traffic. It has been shown that users have preferences towards certain locations and buildings. In addition to that, users can have an interest in certain categories of websites. Some prior research has shown promise in finding patterns of web interests in a wireless dataset [85, 86]. A.1 Web Domain Interest Extraction

The motivation for this section is to go from IP addresses, available in the NetFlow traces, to hostnames/websites and then classify the web domains into various categories of user interest, such as arts, computers, games, business, health, home, recreation, science, society, and sports. Combining the core dataset introduced before, with the interest categories discovered, enables us to analyze the correlation of interest profiles of users with mobility and network traffic. This is a multi-staged process, the first piece in the pipeline isareverse domain name lookup. However, a simple reverse DNS (rDNS) request to a current DNS server would not suffice, as the mapping from IP addresses to hosts might have changed. Asa result, we decided to use a snapshot of rDNS from all IPv4 to host mappings in 2013 1 . This reverse DNS dataset was transformed into ’Parquet’ files and processed with Apache Spark in conjunction with the WLAN and NetFlow traces. The next step is to classify websites into various categories, two approaches are taken:

1. Domain-based categorization: Based on categories from the open-content directory ’DMOZ’ 2 , each domain name is mapped to one or more categories. To avoid ambiguity, we only kept the domains that had a single category. For each domain, only the top level category was preserved. The categories of websites are: Arts, Business, Computers, Games, Health, Home, Recreation, Science, Society, Sports.

1 Available online: https://scans.io/study/sonar.rdns 2 Available online: http://www.dmoz.org/

121 Figure A-1. Heatmap of user activity across building categories. The Y-axis includes the user indices, and the X-axis includes building categories. The categories are: Academic, Admin, Housing, Library, Museum, Police, Social, Sports, Unknown.

2. Document classification: In this approach, we created an asynchronous pipeline that connects to various websites (the result of rDNS) and retrieves their base HTML document. The asynchronous system is built on top of an open source library to make REST API calls 3 . Next, an HTML parsing library strips away all tags, header and markup 4 , leaving the raw text intact, which is then put into a document classification algorithm 5 . The document classification algorithm returns an array of probabilities, denoting the probability of the given text belong to a category. The top three most probable categories are maintained for analysis. These steps through the interest extraction pipeline produce a complementary dataset that can be joined with the ’core’ dataset, enriching it further by adding the category of interest to flows of user devices. This dataset includes a classification for over 200,000 websites.

3 The library is called OpenFeign, available online: https://github.com/OpenFeign/feign 4 Available online: https://jsoup.org/ 5 Available online: https://www.uclassify.com/browse/uclassify/topics

122 Unfortunately, this classification process puts many websites in the ”Computers” category (About 56%), which shows that this category is too general and needs to be broken down. With this dataset, it is possible to use 2D and 3D co-clustering methods to find clusters of users, that are similar in accessing similar sets of websites in the 2D case and finding clusters of users that are similar, in groups of buildings, accessing similar sets of websites. A snapshot of user activity data across building categories is shown in Fig. A-1. This dataset can then be used for 2D co-clustering of users and building categories. Expansion of this dataset to 3D would allow creating clusters of users, across building categories and web interest domains, capturing spatial preferences (categories of buildings) as well as interests (categories of websites), which would provide better resolution for modeling of communities.

123 APPENDIX B CHOICE OF DATA PROCESSING TOOLS Throughout this study, we found Apache Spark to be the best tool since it incorporates a data processing engine, connectors to various data storage tools and also a machine learning library. So the processing is done by Apache Spark, but for storage, there are a few options:

1. Working directly with the CSVs: This is the easiest to work with, as no further prepro- cessing or setup is necessary. It uses the universal language of text, and all tools can work with this format. The disadvantages are that CSVs are row-oriented which leads to multiple issues: 1) analytical queries suffer since entire rows need to be read through I/O and parsed 2) data size is larger because compression works best when columns of the same type are compressed based on their type.

2. Working with a relational database such as PostgreSQL: Popular databases like the Post- greSQL are easy to work with and have a vast array of tools and plugins to manipulate and process data. But they also tend to be optimized for transactions, with row-oriented storage and limited scaling when it comes to multi-terabyte datasets.

3. Working with a distributed NoSQL database such as Apache Cassandra: We also used Apache Cassandra, it is easy to work with, has mature Apache Spark driver, provides a hybrid column-oriented/key-value database with reasonable compression. However, it is tuned for writes and multiple users instead of bulk reads, it has high processing overhead, the included CQL (query language) has limited capability and doesn’t support subqueries and joins (though these operations can be done using Spark’s engine on top of Cassandra).

4. Working with a distributed Hadoop data warehouse such as Apache Hive: This option also provides mature Apache Spark driver, can use Apache Parquet (columnar, com- pressed storage format available to the entire Hadoop ecosystem, where one can only add columns to the end). It scales well to large datasets and is optimized for bulk processing of data warehouses. However, it is hard to setup, requires careful configuration manage- ment and also imposes a high overhead which makes this tool only useful for very large datasets (e.g. our NetFlow dataset). With these advantages and disadvantages in mind, we performed a simple benchmark query as a preliminary performance measurement, using the latest version of the tools at the time of analysis. The query is a group by protocol, then take the sum of the number of packets in each flow per protocol. It’s a simple aggregate function in SQL, and should run inO(n). Testing is done with one hour of NetFlow (2012-04-13-09) which is a 33GB uncompressed CSV.

124 Here is a summary of results:

1. PostgreSQL 9.5: The problem with PostgreSQL is that it’s single-threaded per user connection and doesn’t scale well. The only way to parallelize is to create multiple connections in the code (e.g. Java using JDBC) and then merge the results in code which is quite cumbersome. This limitation was later changed with the release of PostgreSQL 9.6 with parallel query support. It took the database 16 minutes to import the CSV file, resulting in a database storage requirement of 34GB and took 100 seconds to process the query. Adding more data results in significantly longer query times.

2. Apache Cassandra 3.7, Apache Spark 1.6.2: It took this system 1 hour and 16 minutes to import the dataset. The dataset size was reduced to 17GB with compression (51% of the original size). The query time was about 300 seconds.

3. Hive table stored as Parquet as the data source, Spark SQL as the engine: The query was written in HiveQL (which was almost SQL2003 compliant at the time and has joins, subqueries, aggregates, etc.). It took this system 16 minutes to import the data and save it to Parquet files, requiring only 5.7GB of storage (17% of the original size). Thequery time was also only 28 seconds. We also tested saving the Hive table using Apache ORC (optimized row columnar), it decreases the file size to 4.6GB but the higher compression comes at the cost of CPU time thus increasing the query time to 60s. Apache Parquet for storage and Apache Spark for the engine were chosen because this system is fast and efficient with CPU while achieving stellar compression ratio. The good query timesis in part due to the greatly reduced stored size, which to some degree alleviates the I/O bottleneck. We also made several optimizations to the Hive tables to save space and increase effi- ciency. These include converting IPv4 addresses to 32-bit integers, deduplication of calculated columns, categorizing protocols (i.e. instead of using strings such as ”tcp”, use enums), and cleaning many erroneous entries. After these optimizations, it took 147 hours to merge the NetFlow dataset with the WLAN-based DHCP traces to create the core dataset used through- out the studies. This was done using join and union operators in Apache Spark. A visualization of a sample run of the join operation between NetFlow and DHCP in Apache Spark is shown in Fig. B-1.

125 Figure B-1. Visualization of join operation between NetFlow and DHCP in Apache Spark.

126 REFERENCES [1] Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dande- lion, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. “TensorFlow: Large-Scale Ma- chine Learning on Heterogeneous Systems.” 2015. Software available from tensorflow.org.

[2] Adjeroh, Donald, Bell, Timothy, and Mukherjee, Amar. The Burrows-Wheeler Trans- form:: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media, 2008.

[3] Afanasyev, Mikhail, Chen, Tsuwei, Voelker, Geoffrey M, and Snoeren, Alex C. “Analysis of a mixed-use urban wifi network: when metropolitan becomes neapolitan.” ACM SIGCOMM. 2008.

[4] Al Qathrady, Mimonah, Helmy, Ahmed, and Almuzaini, Khalid. “Infection tracing in smart hospitals.” IEEE WiMob. 2016.

[5] Alipour, Babak, Al Qathrady, Mimonah, and Helmy, Ahmed. “Learning the Relation Between Mobile Encounters and Web Traffic Patterns: A Data-driven Study.” ACM MSWIM. 2018.

[6] Alipour, Babak, Tonetto, Leonardo, Yi Ding, Aaron, Ketabi, Roozbeh, Ott, Joerg, and Helmy, Ahmed. “Flutes vs. Cellos: Analyzing Mobility-Traffic Correlations in Large WLAN Traces.” IEEE INFOCOM. 2018.

[7] Alshammari, Riyad and Zincir-Heywood, A. Nur. “Machine learning based encrypted traffic classification: Identifying SSH and Skype.” IEEE Symposium on Computational Intelligence for Security and Defense Applications, CISDA 2009 (2009).

[8] Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. “Wasserstein gan.” arXiv preprint arXiv:1701.07875 (2017).

[9] Auld, Tom, Moore, Andrew W, and Gull, Stephen F. “Bayesian Neural Networks for Internet Traffic Classification.” IEEE Transactions on Neural Networks 18 (2007).1: 223–239. URL http://ieeexplore.ieee.org/document/4049810/

[10] Bai, Fan, Sadagopan, Narayanan, and Helmy, Ahmed. “The IMPORTANT framework for analyzing the Impact of Mobility on Performance Of RouTing protocols for Adhoc NeTworks.” Ad hoc networks (2003).

127 [11] Box, George EP, Draper, Norman Richard, et al. Empirical model-building and response surfaces, vol. 424. Wiley New York, 1987.

[12] Breiman, Leo. “Random Forests.” Machine Learning 45 (2001).1: 5–32.

[13] Cai, Haixiao, Kulkarni, Sanjeev R., and Verdú, Sergio. “Universal entropy estimation via block sorting.” 2004.

[14] Cao, Paul, Li, Gang, Champion, Adam, Xuan, Dong, Romig, Steve, and Zhao, Wei. “On Human Mobility Predictability via WLAN Logs.” Proc. INFOCOM. 2017.

[15] Cao, Paul Y, Li, Gang, Champion, Adam C, Xuan, Dong, Romig, Steve, and Zhao, Wei. “On human mobility predictability via WLAN logs.” INFOCOM 2017. IEEE, 2017, 1–9.

[16] Chaintreau, Augustin, Hui, Pan, Crowcroft, Jon, Diot, Christophe, Gass, Richard, and Scott, James. “Impact of human mobility on opportunistic forwarding algorithms.” IEEE Transactions on Mobile Computing (2007).

[17] Chen, Shengyong, Huang, Wei, Cattani, Carlo, and Altieri, Giuseppe. “Traffic dynamics on complex networks: a survey.” Mathematical Problems in Engineering 2012 (2012).

[18] Chen, Xian, Jin, Ruofan, Suh, Kyoungwon, Wang, Bing, and Wei, Wei. “Network performance of smart mobile handhelds in a university campus WiFi network.” IMC. 2012.

[19] Cho, Eunjoon, Myers, Seth A., and Leskovec, Jure. “Friendship and mobility.” Proceed- ings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’11 (2011): 1082.

[20] Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. “Learning phrase representa- tions using RNN encoder-decoder for statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).

[21] Das, Aveek K., Pathak, Parth H., Chuah, Chen-Nee, and Mohapatra, Prasant. “Charac- terization of Wireless Multidevice Users.” ACM TOIT (2016).

[22] Dempster, Arthur P, Laird, Nan M, and Rubin, Donald B. “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the royal statistical society. Series B (methodological) (1977): 1–38.

[23] Do, Trinh Minh Tri, Dousse, Olivier, Miettinen, Markus, and Gatica-Perez, Daniel. “A probabilistic kernel method for human mobility prediction with smartphones.” Pervasive and Mobile Computing 20 (2015): 13–28.

[24] Domingos, Pedro and Pazzani, Michael. “On the Optimality of the Simple Bayesian Classifier under Zero-One Los.” Machine Learning 29 (1997): 103–130.

128 [25] Dong, Yuxiao, Yang, Yang, Tang, Jie, Yang, Yang, and Chawla, Nitesh V. “Inferring user demographics and social strategies in mobile social networks.” Proceedings of the 20th ACM SIGKDD. 2014.

[26] Eagle, Nathan and Pentland, Alex Sandy. “Reality mining: sensing complex social systems.” Personal and ubiquitous computing (2006).

[27] Erman, Jeffrey, Arlitt, Martin, and Mahanti, Anirban. “Traffic classification using clustering algorithms.” Proceedings of the 2006 SIGCOMM workshop on Mining network data - MineNet ’06 (2006): 281–286. URL http://portal.acm.org/citation.cfm?doid=1162678.1162679

[28] Erman, Jeffrey, Mahanti, Anirban, Arlitt, Martin, and Williamson, Carey. “Identifying and discriminating between web and peer-to-peer traffic in the network core.” Proceedings of the 16th international conference on World Wide Web - WWW ’07 (2007): 883. URL http://portal.acm.org/citation.cfm?doid=1242572.1242692

[29] Ester, Martin, Kriegel, Hans-Peter, Sander, Jörg, Xu, Xiaowei, et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” Kdd. 1996.

[30] Falaki, Hossein, Lymberopoulos, Dimitrios, Mahajan, Ratul, Kandula, Srikanth, and Estrin, Deborah. “A first look at traffic on smartphones.” IMC ’10 (2010).

[31] Fano, Robert M and Wintringham, WT. “Transmission of information.” Physics Today 14 (1961): 56.

[32] Ficek, Michal and Kencl, Lukas. “Inter-call mobility model: A spatio-temporal refinement of call data records using a gaussian mixture model.” IEEE INFOCOM. 2012.

[33] Finsterbusch, Michael, Richter, Chris, Rocha, Eduardo, Müller, Jean Alexander, and Hänßgen, Klaus. “A survey of payload-based traffic classification approaches.” IEEE Communications Surveys and Tutorials 16 (2014).2: 1135–1156.

[34] Freund, Yoav and Schapire, Robert E. “A desicion-theoretic generalization of on-line learning and an application to boosting.” European conference on computational learning theory. Springer, 1995, 23–37.

[35] Gallotti, Riccardo, Bazzani, Armando, Esposti, Mirko Degli, and Rambaldi, Sandro. “Entropic measures of individual mobility patterns.” Journal of Statistical Mechanics: Theory and Experiment (2013).

[36] Gao, Jim. “Machine Learning Applications for Data Center Optimization.” Google White Paper (2014): 1–13.

[37] Gao, Yun, Kontoyiannis, Ioannis, and Bienenstock, Elie. “Estimating the entropy of binary time series: Methodology, some theory and a simulation study.” Entropy 10 (2008).2: 71–99.

129 [38] Gember, Aaron, Anand, Ashok, and Akella, Aditya. “A comparative study of handheld and non-handheld traffic in campus Wi-Fi networks.” PAM. 2011.

[39] Gonzalez, Marta C, Hidalgo, Cesar A, and Barabasi, Albert-Laszlo. “Understanding individual human mobility patterns.” Nature 453 (2008).7196.

[40] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. “Generative adversarial nets.” Advances in neural information processing systems. 2014, 2672–2680.

[41] Goulet-Langlois, Gabriel, N. Koutsopoulos, Haris, Zhao, Zhan, and Zhao, Jinhua. “Measuring Regularity of Individual Travel Patterns.” IEEE Transactions on Intelligent Transportation Systems (2017).

[42] Granichin, O. “Cluster validation.” Intelligent Systems Reference Library 67 (2015): 163–228.

[43] Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Bernhard, Reutemann, Peter, and Witten, Ian H. “The WEKA data mining software.” ACM SIGKDD Explorations 11 (2009).1: 10–18.

[44] Hall, Mark A. Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato, 1999.

[45] Hanel, R. and Thurner, S. “A comprehensive classification of complex statistical systems and an axiomatic derivation of their entropy and distribution functions.” Epl 93 (2011).2.

[46] Henderson, Tristan, Kotz, David, and Abyzov, Ilya. “The changing usage of a mature campus-wide wireless network.” Elsevier Computer Networks (2008).

[47] Hess, Andrea, Hummel, Karin Anna, Gansterer, Wilfried N., and Haring, Guenther. “Data-driven Human Mobility Modeling: A Survey and Engineering Guidance for Mobile Networking.” ACM CSUR (2016).

[48] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, and Lerchner, Alexander. “beta-vae: Learning basic visual concepts with a constrained variational framework.” ICLR. vol. 3. 2017.

[49] Hinton, Geoffrey E. “A practical guide to training restricted Boltzmann machines.” Neural networks: Tricks of the trade. Springer, 2012. 599–619.

[50] Hinton, Geoffrey E and Salakhutdinov, Ruslan R. “Reducing the dimensionality ofdata with neural networks.” science 313 (2006).5786: 504–507.

[51] Ho, Tin Kam. “Random decision forests.” Document analysis and recognition conference. vol. 1. IEEE, 1995, 278–282.

130 [52] Hochreiter, Sepp and Schmidhuber, Jürgen. “Long short-term memory.” Neural computation 9 (1997).8: 1735–1780.

[53] Hsu, Wei-jen and Helmy, Ahmed. “On nodal encounter patterns in wireless LAN traces.” IEEE Transactions on Mobile Computing (2010).

[54] Hsu, Wei-Jen, Spyropoulos, Thrasyvoulos, Psounis, Konstantinos, and Helmy, Ahmed. “Modeling spatial and temporal dependencies of user mobility in wireless mobile net- works.” IEEE/ACM Transactions on Networking (ToN) (2009).

[55] Hui, Pan, Chaintreau, Augustin, Scott, James, Gass, Richard, Crowcroft, Jon, and Diot, Christophe. “Pocket switched networks and human mobility in conference environments.” Proceedings of the 2005 ACM SIGCOMM workshop on DTN. 2005.

[56] Ikanovic, Edin Lind and Mollgaard, Anders. “An alternative approach to the limits of predictability in human mobility.” EPJ Data Science 6 (2017).1.

[57] Isaacman, Sibren, Becker, Richard, Cáceres, Ramón, Martonosi, Margaret, Rowland, James, Varshavsky, Alexander, and Willinger, Walter. “Human mobility modeling at metropolitan scales.” Proceedings of the MobiSys. Acm, 2012, 239–252.

[58] Jacquet, P., Szpankowski, W., and Apostol, I. “A universal predictor based on pattern matching.” IEEE Transactions on Information Theory 48 (2002).6: 1462–1472.

[59] Jeong, Jaeseong, Leconte, Mathieu, and Proutiere, Alexandre. “Cluster-aided mobility predictions.” Proceedings - IEEE INFOCOM 2016-July (2016): 1–9.

[60] JONES, KAREN SPARCK. “A STATISTICAL INTERPRETATION OF TERM SPECI- FICITY AND ITS APPLICATION IN RETRIEVAL.” Journal of Documentation (1972).

[61] Kalchbrenner, Nal, Grefenstette, Edward, and Blunsom, Phil. “A convolutional neural network for modelling sentences.” arXiv preprint arXiv:1404.2188 (2014).

[62] Karatzoglou, Antonios, Jablonski, Adrian, and Beigl, Michael. “A Seq2Seq learning approach for modeling semantic trajectories and predicting the next location.” ACM SIGSPATIAL. 2018.

[63] Kim, Jeeyoung and Helmy, Ahmed. “Analysing the Mobility, Predictability and Evolution of WLAN Users.” International Journal of Autonomous and Adaptive Communications Systems (IJAACS) 7 (2014).1/2: 169–191.

[64] Kim, Yoon. “Convolutional neural networks for sentence classification.” arXiv preprint arXiv:1408.5882 (2014).

[65] Kingma, Diederik P and Welling, Max. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

131 [66] Kotz, David and Essien, Kobby. “Analysis of a Campus-Wide Wireless Network.” Springer Wireless Networks 11 (2005).2.

[67] Kullback, Solomon and Leibler, Richard A. “On information and sufficiency.” The annals of mathematical statistics 22 (1951).1: 79–86.

[68] Kumar, Udayan, Kim, Jeeyoung, and Helmy, Ahmed. “Changing patterns of mobile network (WLAN) usage: Smart-phones vs. laptops.” IWCMC ’13 (2013).

[69] Leshno, Moshe, Lin, Vladimir Ya, Pinkus, Allan, and Schocken, Shimon. “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.” Neural Networks (1993).

[70] Li, Yong, Jin, Depeng, Hui, Pan, Wang, Zhaocheng, and Chen, Sheng. “Limits of Predictability for Large-Scale Urban Vehicular Mobility.” IEEE Transactions on Intelligent Transportation Systems 15 (2014).6: 2671–2682.

[71] Lichman”, ”M. “UCI Machine Learning Repository.” 2013. URL http://archive.ics.uci.edu/ml

[72] Liu, Qiang, Wu, Shu, Wang, Liang, and Tan, Tieniu. “Predicting the Next Location: A Recurrent Model with Spatial and Temporal Contexts.” AAAI. 2016, 194–200.

[73] Lloyd, Stuart P. “Least Squares Quantization in PCM.” IEEE Transactions on Informa- tion Theory 28 (1982).2: 129–137.

[74] Lu, W., Shen, Y., Chen, S., and Ooi, B.C. “Efficient processing of k nearest neighbor joins using MapReduce.” Proceedings of the VLDB Endowment 5 (2012).10: 1016–1027. URL http://dl.acm.org/citation.cfm?id=2336664.2336674

[75] Lu, Xin, Wetter, Erik, Bharti, Nita, Tatem, Andrew J, and Bengtsson, Linus. “Approach- ing the limit of predictability in human mobility.” Scientific reports 3 (2013).

[76] Maier, Gregor, Schneider, Fabian, and Feldmann, Anja. “A first look at mobile hand-held device traffic.” PAM. Springer, 2010.

[77] Mann, Henry B and Whitney, Donald R. “On a test of whether one of two random variables is stochastically larger than the other.” The annals of mathematical statistics (1947): 50–60.

[78] Manweiler, Justin, Scudellari, Ryan, and Cox, Landon P. “SMILE: encounter-based trust for mobile social services.” Proceedings of the 16th ACM CCS. 2009.

[79] Masci, Jonathan, Meier, Ueli, Cireşan, Dan, and Schmidhuber, Jürgen. “Stacked convolutional auto-encoders for hierarchical feature extraction.” International Conference on Artificial Neural Networks. Springer, 2011, 52–59.

132 [80] Massey Jr, Frank J. “The Kolmogorov-Smirnov test for goodness of fit.” Journal of the American statistical Association 46 (1951).253: 68–78.

[81] McGregor, Anthony, Hall, Mark, Lorier, Perry, and Brunskill, James. “Flow Clustering Using Machine Learning Techniques.” Passive and Active Network Measurement SE - 21 (2004).

[82] McInerney, James, Stein, Sebastian, Rogers, Alex, and Jennings, Nicholas R. “Breaking the habit: Measuring and predicting departures from routine in individual human mobility.” Pervasive and Mobile Computing 9 (2013).6: 808–822.

[83] Meng, Xiaoqiao George, Wong, Starsky H Y, Yuan, Yuan, and Lu, Songwu. “Character- izing flows in large wireless data networks.” MobiCom ’04 (2004).

[84] Meng, Xiaoqiao George, Wong, Starsky HY, Yuan, Yuan, and Lu, Songwu. “Characteriz- ing flows in large wireless data networks.” MobiCom (2004).

[85] Moghaddam, Saeed and Helmy, Ahmed. “SPIRIT: A simulation paradigm for realistic design of mature mobile societies.” IWCMC ’11. 2011.

[86] Moghaddam, Saeed, Helmy, Ahmed, Ranka, Sanjay, and Somaiya, Manas. “Data-driven co-clustering model of internet usage in large mobile societies.” ACM MSWiM (2010).

[87] Moore, Andrew W. and Zuev, Denis. “Internet traffic classification using bayesian analysis techniques.” SIGMETRICS ’05 33 (2005).1.

[88] Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.

[89] Nguyen, Thuy T T, Armitage, Grenville, Branch, Philip, and Zander, Sebastian. “Timely and Continuous Machine-Learning-Based Classification for Interactive IP Traffic.” IEEE/ACM Transactions on Networking 20 (2012).6: 1880–1894.

[90] Nguyen, T.T.T. and Armitage, G. “A survey of techniques for internet traffic classifica- tion using machine learning.” IEEE CST (2008).

[91] Nikravesh, Ashkan, Guo, Yihua, Qian, Feng, Mao, Z Morley, and Sen, Subhabrata. “An in-depth understanding of multipath TCP on mobile devices.” MobiCom. 2016.

[92] Nowicki, Michał and Wietrzykowski, Jan. “Low-effort place recognition with WiFi fingerprints using deep learning.” International Conference Automation. Springer, 2017.

[93] Papapanagiotou, Ioannis, Nahum, Erich M, and Pappas, Vasileios. “Smartphones vs. laptops: comparing web browsing behavior and the implications for caching.” SIGMETRICS (2012).

[94] Paxson, V. and Floyd, Sally. “Wide area traffic: the failure of Poisson modeling.” IEEE/ACM ToN 3 (1995).3.

133 [95] Quinlan, J.R. “Induction of Decision Trees.” Machine Learning 1 (1986).1: 81–106. URL http://link.springer.com/10.1023/A:1022643204877

[96] Quinlan, Ross J. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

[97] Rousseeuw, Peter J. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” Journal of computational and applied mathematics 20 (1987): 53–65.

[98] Roussopoulos, Nick, Kelley, Stephen, and Vincent, Frédéric. “Nearest neighbor queries.” Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data - SIGMOD ’95 (1995): 71–79. URL http://portal.acm.org/citation.cfm?doid=223784.223794

[99] Sadilek, Adam, Kautz, Henry, and Bigham, Jeffrey P. “Finding your friends and following them to where you are.” Proceedings of the fifth ACM international conference on Web search and data mining - WSDM ’12 (2012): 723.

[100] Salton, Gerard and McGill, Michael J. Introduction to modern information retrieval. McGraw-Hill, Inc., 1986.

[101] Schmidhuber, Jürgen. “Deep learning in neural networks: An overview.” Neural networks 61 (2015): 85–117.

[102] Silver, David, Huang, Aja, Maddison, Chris J., Guez, Arthur, Sifre, Laurent, van den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, Dieleman, Sander, Grewe, Dominik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach, Madeleine, Kavukcuoglu, Koray, Graepel, Thore, and Hassabis, Demis. “Mastering the game of Go with deep neural networks and tree search.” Nature 529 (2016).7587: 484–489.

[103] Sinatra, Roberta and Szell, Michael. “Entropy and the predictability of online life.” Entropy 16 (2014).1: 543–556.

[104] Smith, Gavin, Wieser, Romain, Goulding, James, and Barrack, Duncan. “A refined limit on the predictability of human mobility.” 2014 IEEE International Conference on Pervasive Computing and Communications, PerCom 2014 (2014): 88–94.

[105] Song, Chaoming, Koren, Tal, Wang, Pu, and Barabási, Albert-László. “Modelling the scaling properties of human mobility.” Nature Physics 6 (2010).10.

[106] Song, Chaoming, Qu, Zehui, Blumm, Nicholas, and Barabási, Albert-László. “Limits of predictability in human mobility.” Science 327 (2010).5968: 1018–1021.

[107] Song, Libo, Kotz, David, Jain, Ravi, and He, Xiaoning. “Evaluating location predictors with extensive Wi-Fi mobility data.” Proc. IEEE INFOCOM. vol. 2. 2004, 1414–1424.

134 [108] Song, Xuan, Zhang, Quanshi, Sekimoto, Yoshihide, Shibasaki, Ryosuke, Yuan, Nicholas Jing, and Xie, Xing. “Prediction and Simulation of Human Mobility Fol- lowing Natural Disasters.” ACM Transactions on Intelligent Systems and Technology 8 (2016).2: 1–23.

[109] Sundaresan, Srikanth, de Donato, Walter, Feamster, Nick, Teixeira, Renata, Crawford, Sam, and Pescapè, Antonio. “Broadband Internet Performance: A View from the Gateway.” SIGCOMM Comput. Commun. Rev. 41 (2011).4: 134–145. URL http://doi.acm.org/10.1145/2043164.2018452

[110] Suykens, J A K and Vandewalle, J. “Least Squares Support Vector Machine Classifiers.” Neural Processing Letters 9 (1999).3: 293–300. URL http://dx.doi.org/10.1023/A:1018628609742

[111] Takaguchi, Taro, Nakamura, Mitsuhiro, Sato, Nobuo, Yano, Kazuo, and Masuda, Naoki. “Predictability of conversation partners.” Physical Review X 1 (2011).1: 011008.

[112] Thakur, Gautam S and Helmy, Ahmed. “COBRA: A framework for the analysis of realistic mobility models.” INFOCOM. 2013.

[113] Tran, Le Hung, Catasta, Michele, McDowell, Lucas Kelsey, and Aberer, Karl. “Next Place Prediction using Mobile Data.” Proceedings of the Mobile Data Challenge Workshop (MDC 2012) (2012).

[114] Treurniet, Joanne. “A Taxonomy and Survey of Microscopic Mobility Models from the Mobile Networking Domain.” ACM CSUR (2014).

[115] Vapnik, Vladimir. The Nature of Statistical Learning Theory. Springer Science & Business Media, 1999.

[116] Varga, Andras. “OMNeT++.” Modeling and Tools for Network Simulation. Springer, 2010. 35–59.

[117] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, and Manzagol, Pierre-Antoine. “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of Machine Learning Research 11 (2010).Dec: 3371–3408.

[118] Wang, Jingyuan, Mao, Yu, Li, Jing, Xiong, Zhang, and Wang, Wen Xu. “Predictability of road traffic and congestion in urban areas.” PLoS ONE 10 (2015).4: 1–12.

[119] Wilcoxon, Frank. “Individual comparisons by ranking methods.” Biometrics bulletin 1 (1945).6: 80–83.

[120] Xu, F., Li, Y., Wang, H., Zhang, P., and Jin, D. “Understanding Mobile Traffic Patterns of Large Scale Cellular Towers in Urban Environment.” IEEE/ACM Transactions on Networking (2017).

135 [121] Zeng, Sihan, Wang, Huandong, Li, Yong, and Jin, Depeng. “Predictability and Prediction of Human Mobility Based on Application-Collected Location Data.” 2017 IEEE 14th International Conference on Mobile Ad Hoc and Sensor Systems (MASS) (2017): 28–36.

[122] Zhand, Ying and Arvidsson, Åke. “Understanding the Characteristics of Cellular Data Traffic.” ACM SIGCOMM CellNet workshop. 2012.

[123] Zhang, Desheng, Huang, Jun, Li, Ye, Zhang, Fan, Xu, Chengzhong, and He, Tian. “Exploring human mobility with multi-source data at extremely large metropolitan scales.” MobiCom (2014).

[124] Zhou, Xuan, Zhao, Zhifeng, Li, Rongpeng, Zhou, Yifan, and Zhang, Honggang. “The predictability of cellular networks traffic.” 2012 International Symposium on Communica- tions and Information Technologies, ISCIT 2012. 2012, 973–978.

[125] Ziv, J. and Lempel, A. “Compression of individual sequences via variable-rate coding.” IEEE Transactions on Information Theory 24 (1978).5: 530–536.

136 BIOGRAPHICAL SKETCH Babak Alipour received his Ph.D. from the Department of Computer and Information Sciences and Engineering at University of Florida (UF) in spring 2019, under the supervision of Prof. Ahmed Helmy. In 2014, he received his B.Eng in information technology from Amirkabir University of Technology (Tehran Polytechnic), Iran. His research interests lie in big data analytics and machine learning in mobile networks for modeling, simulation and benchmarking of protocols. He was a lead researcher of studies funded by the National Science Foundation (NSF). He worked on research projects as an intern in Google Geo location platform during summer 2017 and summer 2018. He is a recipient of the best poster award at IEEE International Conference on Computer Communications (INFOCOM) 2017 and the best presentation-in-session award at IEEE INFOCOM 2018. He has served as a reviewer for multiple top-tier academic journals and conferences such as IEEE Transactions on Mobile Computing (TMC), Transaction on Vehicular Technology (TVT), and International Wireless Communications and Mobile Computing Conference (IWCMC).

137