INTEGRATED NETWORK TRAFFIC-MOBILITY ANALYSIS AND MODELING USING BIG DATA
By BABAK ALIPOUR
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2019 © 2019 Babak Alipour To my Mom and Dad, whose love and support made my success possible ACKNOWLEDGMENTS First and foremost, thanks to all the help I received from my advisor, Prof. Ahmed Helmy, whose guidance, patience and intelligence made research obstacles conquerable. We thank Dr. Alin Dobra and Dr. Daisy Zhe Wang for help in the computing cluster, and the anonymous reviewers of IEEE InfoCom 2018 for useful feedback. The term ’cello mobility’ was suggested by Prof. Mostafa Ammar and used here with permission. Partial funding was provided by NSF Award Number 1320694 at University of Florida. We gratefully acknowledge the support of NVIDIA Corp. with the donation of the Titan Xp GPU used for this research.
4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 8 LIST OF FIGURES ...... 9 ABSTRACT ...... 12
CHAPTER 1 INTRODUCTION ...... 14 1.1 Data-Driven Traffic and Mobility Analysis ...... 15 1.1.1 Mobility Analysis ...... 15 1.1.2 Traffic Analysis ...... 16 1.1.3 ’FLAMeS’: Framework for Large-scale Analysis of Mobile Societies ... 17 1.2 Predictability Analysis and Prediction Algorithm Design ...... 19 1.3 Integrated ’Generative’ Traffic-Mobility Modeling ...... 20 1.4 Research Contributions ...... 21 1.5 Dissertation Organization ...... 21 2 MATERIALS AND METHODS ...... 23 2.1 Input Datasets ...... 23 2.1.1 WLAN AP Logs ...... 24 2.1.2 NetFlow Logs ...... 24 2.2 DHCP and Merging Datasets ...... 25 2.2.1 Device Type Classification ...... 26 2.2.2 Computing System ...... 27 2.3 Machine Learning Application and Background ...... 28 2.3.1 Supervised Learning ...... 28 2.3.1.1 Classification ...... 29 2.3.1.2 Sequence prediction ...... 31 2.3.2 Unsupervised Learning ...... 32 2.3.2.1 Clustering ...... 33 2.3.2.2 Generative ...... 34 2.3.3 Reinforcement Learning ...... 35 2.3.4 Feature Selection ...... 35 3 ’FLUTES’ vs. ’CELLOS’: ANALYZING MOBILITY-TRAFFIC CORRELATIONS IN LARGE WLAN TRACES ...... 37 3.1 Related Work ...... 37 3.2 Mobility Analysis ...... 40 3.2.1 Session Start Probability ...... 40
5 3.2.2 Radius of Gyration ...... 42 3.2.3 Visitation Preferences and Interests ...... 43 3.2.4 Sessions Per Building ...... 45 3.2.5 Hourly Associations ...... 46 3.2.6 Visitation Preferences ...... 47 3.2.7 Return Probability ...... 48 3.3 Traffic Analysis ...... 48 3.3.1 Flow-level Statistical Characterization ...... 49 3.3.1.1 Size ...... 49 3.3.1.2 Packets ...... 51 3.3.1.3 Runtime ...... 52 3.3.1.4 Inter-arrival times (IAT) ...... 52 3.3.1.5 Protocols ...... 54 3.3.2 Network-Centric (Spatial) Analysis ...... 55 3.3.3 User Behavior (Temporal) Analysis ...... 56 3.3.3.1 Data consumption ...... 56 3.3.3.2 Packet rate ...... 56 3.3.3.3 Active duration ...... 57 3.4 Integrated Mobility-Network Traffic Analysis ...... 57 3.4.1 Feature Engineering ...... 58 3.4.1.1 Mobility ...... 58 3.4.1.2 Network traffic ...... 58 3.4.1.3 Cross-dimension ...... 59 3.4.2 Utility of Integrated Modeling ...... 61 3.5 Integrated Mobility-Network Traffic Generative Modeling ...... 62 3.5.1 Related Modeling Work ...... 64 3.5.2 Statistical Metrics ...... 65 3.5.3 Gaussian Mixture Model (GMM) ...... 67 3.5.4 Restricted Boltzmann Machine (RBM) ...... 68 3.5.5 Variational Auto-Encoder (VAE) ...... 69 3.5.6 Generative Adversarial Network (GAN) ...... 69 3.6 Lessons Learned and Modeling Insights ...... 70 3.7 Summary ...... 76 4 PREDICTABILITY ANALYSIS AND PREDICTION ALGORITHM DESIGN ..... 78 4.1 Related Work ...... 80 4.2 Entropy Estimators and Maximum Predictability ...... 82 4.3 Prediction Algorithms ...... 84 4.4 Experimental Setup ...... 85 4.4.1 Discrete-time Series ...... 86 4.4.2 Experiment Dimensions ...... 88 4.5 Mobility Analysis ...... 88 4.5.1 Overview ...... 88 4.5.2 Spatio-temporal Resolutions ...... 91
6 4.5.3 Comparison of Methods ...... 92 4.5.4 Correlations with Mobility and Network Traffic ...... 94 4.6 Summary and Future Work ...... 96 5 LEARNING THE RELATION BETWEEN MOBILE ENCOUNTERS AND WEB TRAF- FIC PATTERNS ...... 98 5.1 Related Work ...... 100 5.2 Mobility Encounters ...... 102 5.2.1 Daily Encounter Duration at the Building Level ...... 102 5.2.2 Encounter Duration Statistical Distributions ...... 104 5.3 Web Traffic Profile ...... 104 5.4 Pairwise Encounter-Traffic Relationship ...... 106 5.4.1 Device Type Categories ...... 107 5.4.2 Weekday Vs. Weekend ...... 107 5.4.3 Encounter Duration ...... 108 5.5 Learning Encounters ...... 110 5.5.1 Random Forest (RF) ...... 112 5.5.2 Deep Learning ...... 112 5.6 Summary and Future Work ...... 114 6 FUTURE DIRECTIONS ...... 117 6.1 Extensions of Flutes vs. Cellos in Social Context and Interest Dimensions ... 118 6.2 Predictability Analysis in the Traffic and Interest Dimensions ...... 119
APPENDIX A WEB DOMAIN INTEREST ANALYSIS ...... 121 A.1 Web Domain Interest Extraction ...... 121 B CHOICE OF DATA PROCESSING TOOLS ...... 124 REFERENCES ...... 127 BIOGRAPHICAL SKETCH ...... 137
7 LIST OF TABLES
2-1 Summary of datasets. B=billion...... 23
2-2 NetFlow example records...... 24
2-3 AP logs/DHCP example records...... 24
3-1 Summary of results for mobility analysis...... 41
3-2 Merged DHCP-NetFlow traces overview ...... 48
3-3 Traffic features used for integrated mobility-traffic analysis ...... 60
3-4 Average Kolmogorov-Smirnov statistic of all algorithms ...... 71
3-5 Kolmogorov-Smirnov statistic of the β-VAE for weekday features...... 73
3-6 Kolmogorov-Smirnov statistic of the β-VAE for weekend features...... 74
4-1 Statistics per device available for at least 7 days & accessed more than 5 APs. ... 82
4-2 Median Accuracy of LSTM across spatio-temporal resolutions ...... 92
4-3 Summary of Median Accuracy for Flutes vs Cellos with different methods ...... 94
5-1 Encounter record example ...... 102
5-2 Daily Encounter Duration in Seconds ...... 103
5-3 Best fit distributions for total daily encounter duration based on pairs classifications 104
8 LIST OF FIGURES
1-1 ’FLAMeS’ system overview...... 19
2-1 Wireless association for a device at different times...... 25
2-2 Time series for 25 days of combined AP-NetFlow Core traces ...... 27
2-3 Machine Learning algorithms widely used In Internet traffic classification...... 35
3-1 PDF Session start over time of the day...... 42
3-2 Radius of gyration and visited locations S(t) ...... 44
3-3 Zipf’s plot on L visited access points...... 45
3-4 Probability P (t) of session duration t...... 45
3-5 Hourly associations...... 46
3-6 Time spent at preferred building...... 47
3-7 Probability to return to a previously visited location...... 48
3-8 Traffic distribution plots...... 50
3-9 CDF of individual flow sizes ...... 51
3-10 Lognormal distribution plot for mean packet size of either device type ...... 52
3-11 Theoretical and empirical plots for Lognormal and mean packet size of Flute flows . 53
3-12 Exponential and Beta distribution plots for IAT...... 54
9 3-13 Correlation plots for mobility and traffic ...... 59
3-14 Correlation plots across mobility and traffic ...... 61
3-15 Visualization of Kolmogorov–Smirnov statistic ...... 66
3-16 Synthetic vs. Original ’total active time (TAT)’ feature for flutes...... 68
3-17 VAE architecture...... 70
3-18 GAN architecture...... 70
3-19 Kolmogorov-Smirnov statistic for mobility features ...... 71
3-20 Kolmogorov-Smirnov statistic for network traffic features ...... 72
3-21 Correlation between TAT and several mobility features ...... 72
3-22 Visualization of mobility features with Z1 from β-VAE...... 74
4-1 Discrete-Time Series Example ...... 87
4-2 ECDF of LSTM Prediction Accuracy for Flutes & Cellos at AP and Bldg. level ... 93
4-3 Correlation of Prediction Accuracy with Mobility and Network Traffic Features ... 95
5-1 Social context axes. (Sourced from [112]) ...... 101
5-2 Encounter duration CDF based on encounter pair device type...... 103
5-3 TF-IDF Matrix: Each row is a user profile...... 105
5-4 ECDF of traffic profile similarity across device types ...... 108
10 5-5 ECDF of traffic profile similarity for weekdays and weekends ...... 109
5-6 ECDF of traffic profile similarity across durations ...... 110
5-7 Pearson correlation coefficient between encounter duration and traffic profile similarity 111
5-8 Correlation of encounter duration and TP similarity for different Bldg. across days . 112
5-9 Accuracy of random forest and SDAE for encounters across days ...... 113
5-10 Architecture of the deep learning model (SDAE) ...... 115
6-1 Positioning of the work and most promising future directions...... 118
A-1 Heatmap of user activity across building categories ...... 122
B-1 Visualization of join operation between NetFlow and DHCP in Apache Spark. .... 126
11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy INTEGRATED NETWORK TRAFFIC-MOBILITY ANALYSIS AND MODELING USING BIG DATA By Babak Alipour May 2019 Chair: Ahmed Helmy Major: Computer Engineering Two major factors affecting mobile network performance are mobility and traffic patterns. Simulations, analytical-based performance evaluations and predictive caching schemes rely on models to approximate factors affecting the network. Hence, the understanding of mobility and traffic is imperative to the effective evaluation and efficient design of future mobilenetworks. Current models target either mobility or traffic, but do not capture their interplay. Besides, Many trace-based mobility models have largely used pre-smartphone datasets (e.g., AP-logs), or much coarser granularity (e.g., cell-towers) traces. Moreover, behaviors of users need to be modeled individually, and also in social contexts through pair-wise encounter and collective analysis. This raises questions regarding the relevance of existing models, and motivated us to revisit this area. In this work, we conduct a multidimensional analysis, to quantitatively characterize mobility and traffic spatio-temporal patterns, for laptops and smartphones, and across individual, pair-wise and collective planes, leading to a detailed integrated mobility-traffic analysis. Our study is data-driven, as we collect and mine capacious datasets (with 30TB in size, 300k devices) that capture all of mobility and traffic dimensions, in time and space, for each device type. The investigation is performed using our systematic (FLAMeS) framework. Overall, dozens of mobility and traffic features have been analyzed. This includes features at the individual level, i.e. how each user behaves as well as analysis of interactions between users through pair-wise and collective analysis. Furthermore, predictability limits of human behavior are investigated and pragmatic prediction algorithm design is discussed. We provide insights
12 and lessons, to serve as guidelines and a first step towards future integrated mobility-traffic models. We posit that a realistic model should encapsulate mobility, traffic, social context (individual, pair-wise, collective), and interest features of users across time and space, per device type. Finally, we build upon the knowledge of this interplay across dimensions and propose a generative integrated mobility-traffic model, with experimental results showing that it captures mobility and traffic features at the individual level well, while being extensible for expansion into other axes of social context. Our work acts as a stepping-stone towards a richer, more-realistic suite of mobile test scenarios and benchmarks.
13 CHAPTER 1 INTRODUCTION The research presented in this work consists of three main components. These are: Data-Driven Integrated Traffic-Mobility Analysis, Predictability Analysis and Prediction Algorithm Design, and Integrated Generative Mobility-Traffic Modeling. We first study (I) How different are mobility and traffic characteristics across device types, time andspace? (II) What are the relationships between these characteristics? (III) Should new models be devised to capture these differences? And, if so, how? Both mobility and network usage, characterize different aspects of human behavior. In this sense, we have a mobility plane anda (network) traffic plane. In reality, these two planes are likely interdependent. Human mobility may be influenced by network activity; for example, a person slowing down to read incoming messages. Also, network activity may be influenced by mobility and location; stationary users may produce/consume more data than those walking, and people may use different services in different places6 ([ ]). These planes also run through individual user behavior, as well as through the interactions between users through pair-wise and collective behaviors. Next, we study predictability by analyzing spatio-temporal patterns of entropy and design of predictive algorithms using a hybrid of classical algorithms (e.g. Markov Chains) with deep learning approaches (e.g. Recurrent Neural Networks (RNNs)). Then, we investigate the relationship between mobility ’encounters’ and web traffic profiles. Mobility and traffic planes run through individual user behaviors, as well as through the interactions between users through pair-wise and collective behaviors. Thus this investigation seeks to establish the relationship between mobility pair-wise encounters and web traffic. Finally, generative models such as Gaussian Mixture Models (GMMs) and Restricted Boltzmann Machines (RBMs) are used to create realistic integrated models spanning mobility, traffic, social context, and interest dimensions, with extensibility and privacy in mind. Given the ubiquity of Wireless LAN traces, these logs present an important data source to glean information on mobility behaviors and patterns of users. In addition, NetFlow traces,
14 which contain large-scale data on network traffic headers, enable analysis of web traffic access patterns. Thus, we rely on merging of 30TB of NetFlow data with half a terabyte of WLAN traces to achieve data-driven analysis, testing, design, and evaluation of integrated traffic- mobility models. The rest of this chapter is organized as follows. Section 1.1 describes the current ap- proaches in modeling mobility and traffic as well as challenges in their integration. Section 1.2 introduces ’predictability’ as a metric and discusses mobility and traffic prediction and predictor design. Section 1.3 presents a compelling case for the necessity of integrated traffic-mobility modeling. Section 1.4 summarizes the contributions of this work. Finally, Section 1.5 provides the organization of this dissertation. 1.1 Data-Driven Traffic and Mobility Analysis
Both mobility and network usage, characterize different aspects of human behavior. In this sense, we have a ’mobility plane’ and a ’network traffic plane’. In reality, these two planes are likely interdependent. In this section, we introduce how each of these planes has traditionally been studied, and the challenges in performing integrated traffic-mobility analysis and modeling. 1.1.1 Mobility Analysis
Human mobility has been an area of interest for decades, due to its importance in various research areas. It has been studied extensively, with many models derived with varying degrees of complexity and customizability. The spectrum ranges from simple synthetic mobility models to complex trace-based models, capturing different properties with varying degrees of accuracy [47, 114]. For spatial-temporal patterns, [39] and [105] reveal the regularity and bounds for predicting human mobility using cellular logs, presenting similar findings. This reaffirms the intrinsic properties of human mobility, despite differences in granularity and population across datasets. Another study highlighted the importance of combining different datasets to study various features simultaneously [123]. Another study using WLAN traces [15] revealed surprising patterns on increases of long-term mobility entropy by age, and the impact of
15 academic majors on students’ long-term mobility entropy. One of the motivations of our work is to analyze the stability of human mobility patterns, especially in terms of regularity and prediction accuracy upper bounds, in our dataset and then integrating different datasets to correlate mobility and network traffic. 1.1.2 Traffic Analysis
Network traffic has been studied extensively, for fixed networks (e.g.,1 RIPEAtlas , Ark2 , BISmark [109]), and increasingly for wireless networks: for rather stationary users (as in WLANs) (e.g., [46, 66]) and potentially more mobile users as for cellular networks (e.g., [76, 122]). Such analyses range from metrics such as flow count, sizes, and traffic volume to service usage (e.g., visited websites, backend services). The authors of [85] investigated correlations and characteristics of web domains accessed by users and their locations, based on NetFlow and DHCP logs from a university campus in 2004. They propose a simulation paradigm with data-driven parameters, producing realistic scenarios for simulations. On both WiFi and cellular networks, the authors of [91] performed an in-depth study on smartphone traffic, highlighting the benefits and limitations of using MPTCP. Distributions of flowinter- arrival time (IAT) and arrival rate at APs of ’static’ flows were analyzed (e.g., Exp, Weibull, Pareto, Lognormal) in [83]. Lognormal was found to best fit the flow sizes, while at small time scales (i.e. hourly), IAT was best described by Weibull but parameters vary from hour to hour. We analyze flows on a much larger scale, newer dataset including smartphones, and identify Lognormal distribution as the best fit for flow sizes, and beta as best for IAT, regardless of device type. The study in [120] analyzed ISP traces of cellular towers in Shanghai, and mapped timed traffic patterns to urban regions. It provided insight into mobile traffic patternsacross time, location, and frequency. This work is complementary to ours, as we provide a much finer scope analyzing campus WLAN traces.
1 http://atlas.ripe.net 2 http://www.caida.org/projects/ark/
16 1.1.3 ’FLAMeS’: Framework for Large-scale Analysis of Mobile Societies
It is likely that mobility and traffic planes are interdependent. Human mobility may be influenced by network activity; for example, a person slowing down to read incoming messages. Also, network activity may be influenced by mobility and location; stationary users may produce/consume more data than those walking, and people may use different services in different places85 [ ]. In earlier studies, this interdependence has not been widely considered, and models for both mobility and network traffic planes have been developed and evaluated largely in isolation, as described in the previous sections. For example, when evaluating mobile systems’ performance, traffic generation generally follows regular patterns, drawn from common simple distributions (e.g., exponential or uniform), while assuming neither transmission nor reception of data impacts mobility. Simply observing people walking while staring at (or reacting to) their smartphones suggests, however, that such interdependencies need to be captured properly. Understanding the mobility-traffic interplay is imperative tothe effective evaluation and efficient design of future mobile algorithms ranging from userbehavior prediction and caching, to network load estimation and resource allocation. In this work, we take a stab at understanding the interconnection of the mobility and traffic planes. To do this properly, we need to consider the nature of mobile devices people use: one class of devices is merely intended for stationary use, typically while the user is seated—this primarily holds for laptop computers, dubbed ’cellos’. In contrast, another class— ’on-the-go’ smartphones, which we refer to as ’flutes’—lend themselves to truly mobile3 use . Analysis of the relation between mobility and traffic is an identified research gap. We focus our analysis on these two classes because they have been around long enough to have extensive datasets to build upon. Usage and traffic patterns of different device types have been studied from various perspectives ([3, 18, 38, 68, 76, 93]). However, those findings are based on classifications that
3 Throughout, we use ’flutes’ for ’smartphones’, and ’cellos’ for ’laptops’.
17 rely on either MAC addresses or HTTP headers solely. The former is rather limited and the latter may have serious privacy implications and are often unavailable. In [30], authors use packet-level traces from 10 phones and application-level monitoring from 33 Android devices to analyze smartphone traffic. Although this allowed fine-grained measurements, the approach is invasive and limited in scalability, leading to small sample sizes and restricted conclusions. They also do not compare the traffic of smartphones with that of “stop-to-use” wireless devices (i.e. cellos) nor do they measure spatial metrics. The study in [21] analyzes 32k users on campus, and focuses on multi-device usage. It notes differences between laptops and smartphones in packets, content, and time of usage. That work targets device usage patterns and security, while we study mobility and wireless traffic correlations. In our method, the combination of MAC and NetFlow allowed us to classify the majority of observed devices while preserving users’ privacy. We stipulate that the interconnection of mobility and traffic is modulated by the device(s) a mobile user is carrying. Therefore, we follow two main lines of investigation: we develop a framework to differentiate between cellos and flutes, and study both the mobility and traffic patterns for each of those types. Specifically, we quantitatively investigate the following questions in-depth: (I)How different are mobility and traffic characteristics across device types, time and space? (II)What are the relationships between these characteristics? (III) Should new models be devised to capture these differences? And, if so, how? To answer these questions, a multi-dimensional (comparative) analysis approach is adopted to investigate mobility and traffic spatio-temporal patterns for flutes and cellos. We drive our study with capacious datasets (30TB+) that capture all the above dimensions in a campus society, including over 300k devices. We set out to understand and ’quantify’ the ’gaps between flutes and cellos’, and the ’interaction between the mobility and traffic dimensions’. To methodically analyze statistical characteristics and correlations in multiple, a systematic ’F’ramework for ’L’arge-scale ’A’nalysis of ’M’obil’e’ ’S’ocieties (’FLAMeS’) is devised, as depicted in Fig. 1-1, that is used to analyze multi-sourced, multi-dimensional big data for
18 ’flutes’ and ’cellos’, including WLAN and NetFlow traces sourced from University of Florida’s campus. The main components include: I. Data collection and pre-processing, II. Flutes vs. cellos mobility and traffic analysis, and III. Integrated mobility-traffic analysis.
I.a I.b
I.c
II
III
Integrated Future Mixture Mobility-Traffic Models and Analysis Synthesis
Figure 1-1. ’FLAMeS’ system overview.
1.2 Predictability Analysis and Prediction Algorithm Design
Prediction techniques constitute fundamental mechanistic building blocks for many mobile protocols and applications, ranging from resource allocation to caching and recommender systems, among others. The study of predictability and behavioral regularity in the mobile society has been the focus of several studies.
19 However, the previously reported findings on the very high predictability of human mobility were drawn from analyses that had several limitations. First, the spatio-temporal scale at which users’ locations were observed—location within mobile network cell tower coverage— renders the prediction results impractical for most mobile applications that depend on the high accuracy of the user localization. Second, the theoretical limits of predictability were derived based on an entropy estimator that was unable to capture the variability of repeated sub-sequences of visited locations and therefore underestimating entropy (overestimating predictability). We revisit the study of mobile user predictability, and try to address the aforementioned limitations by exploring ’maximum predictability’ and ’pragmatic predictors’ using WiFi traces that have a much higher spatio-temporal granularity compared to call data records used in most previous studies on predictability. 1.3 Integrated ’Generative’ Traffic-Mobility Modeling
Mobile network performance is heavily impacted by ’mobility’ and ’traffic’ patterns. Models are applied extensively in simulations, analytical-based performance evaluations and benchmarks, to approximate factors affecting the network. There is a vast body of research on mobility and traffic analysis and modeling, carried out independently. Some of the most advanced models of human mobility provide mathematical frameworks with tunable parameters to generate a variety of mobility scenarios. The previous works suffer from one or more of the following limitations: 1) Strong assumptions andlackof validation on real-world big data. 2) Use of coarse-grained CDR logs. 3) Lack of integration with traffic features. 4) Missing predictability, and user interest aspects. In this work, after quantifying the differences between ’flutes’ and ’cellos’, across a variety of spatio-temporal mobility and traffic metrics, we established the necessity of modeling the device type, in addition to spatio-temporal mobility and traffic features, at the individual, pair-wise and group levels. We also analyze predictability and use it as a feature to define user behavior. Finally, we attempt to put all of these features together, to create an integrated generative traffic-mobility model.
20 1.4 Research Contributions
Our main contributions include:
1. ’Integrated Mobility-Traffic Analyses’ (Chapter 3): This work is the first to quantify the correlations of numerous features of mobility and traffic simultaneously. This can identify gaps in existing mobile networking models, and reopen the door for future impactful work in this area.
2. ’Flutes vs. Cellos Analysis’ (Sec. 3.2 and 3.3): The device type classification presented here, is an important dimension to understand. This is particularly relevant as new gen- erations of portable devices are introduced, that are different than laptops, traditionally considered in earlier studies.
3. ’Systematic Multi-dimensional Investigation Framework’ (Fig. 1-1): ’FLAMeS’ provides the scaffolding needed to process, in multiple dimensions, many features of largesets of measurements from wireless networks, including AP-logs and NetFlow traces. This systematic method can apply to other datasets in future studies.
4. ’Predictability Analysis and Prediction Algorithm Design’ (Chapter 4): We use highly granular datasets to measure maximum predictability, and compare the theoretical upper bounds with pragmatic predictors based on deep learning techniques such as Long Short-Term Memory (LSTM).
5. ’Analysis of Mobile Encounters and Web Traffic Patterns’ (Chapter 5): Mobility and network traffic have been traditionally studied separately and at an ’individual’ level.The interactions and patterns at ’encounter’ and ’group’ levels are vital factors for future mobile services and encounter-based services, but has not been studied in depth with real-world big data. In Chapter 5, we characterize mobility encounters and study the correlation between encounters and web traffic profiles using large-scale datasets ofWiFi and NetFlow traces
6. ’Integrated Mobility-Traffic Generative Modeling’ (Section 3.5): Models are utilized extensively in simulations, performance evaluations and benchmarks. With the analyses done in previous chapters, the dimensions of such models were investigated and their correlations analyzed. This part of the study is the culmination of all of those analyses, with the goal of creating an integrated generative mobility-traffic model. 1.5 Dissertation Organization
The rest of this dissertation is organized as follows. Chapter 2 introduces the materials and methods used in this study, including the datasets, tools, and brief machine learning background information. Chapter 3 presents the analysis of mobility and traffic dimensions for flutes and cellos, and then provides insight for an integrated mobility-traffic modelthat
21 captures the correlations of mobility and traffic features across device types, time and space. Finally, this work outlines a generative, integrated mobility-traffic model based on prior analyses. In Chapter 4, work on the predictability of human behaviors is presented, theoretical limits of predictability are investigated, and a new prediction algorithm is proposed that combines Markov Chain (MC) based predictors with recurrent neural networks. Chapter 5 characterizes mobility encounters and studies the correlation between encounters and web traffic profiles using large-scale datasets of WiFi and NetFlow traces, introduced inChapter 2. It also introduces a deep learning approach that learns to classify whether a pair of web traffic profiles have encountered. Finally, Chapter 6 puts forward potential future research directions that have been identified throughout this work.
22 CHAPTER 2 MATERIALS AND METHODS This chapter provides background information on the materials and methods used in this study, including the multi-sourced datasets, a wide variety of analytical and big data tools, experimental setup, as well as a brief introduction on several machine learning algorithms explored in this work. The goal is to describe the raw, interim and preprocessed inputs for the rest of this document (i.e. Sections I.a, I.b and I.c of FLAMeS framework depicted in Fig. 1-1), with reproducibility and completeness in mind. 2.1 Input Datasets
We drive our FLAMeS framework with large-scale datasets from multiple sources, capturing the mobility and traffic features in different dimensions. In this section, we introduce the two major datasets and their preprocessing, and present the device type classification into flutes and cellos. This is a necessary step to enable analysis of numerous metrics formobility and traffic over such large-scale datasets. The input datasets are specifically chosen tocapture: 1. location, mobility and network traffic information, 2. smartphone and laptop devices, 3. spatio-temporal features, and 4. scale in the number of devices and records. The total size is >30TB, consisting of two main parts: WLAN Access Point (AP) logs, and Netflow records (details in Tables 2-1, 2-2, and 2-3). 1
Table 2-1. Summary of datasets. B=billion. Record count Traffic Vol. (TB) MAC count DHCP CORE TCP UDP WLAN CORE Flutes 412.0 M 2.13 B 56.18 4.50 186.0 K 50.3 K Cellos 101.0 M 4.20 B 73.85 12.90 93.2 K 27.1 K Total2 557.5 M 6.53 B 134.39 17.61 316.0 K 80.0 K
1 Data was collected using proper procedures. It does not contain personally identifiable information (PII).
23 Table 2-2. NetFlow example records. Start time Finish time Duration Src. IP Dst. IP Protocol Src. port Dst. port Packets Size 1334332274.912 1334332276.576 1.664 173.194.37.7 10.15.225.126 TCP 80 60482 157 217708 1334332281.440 1334332282.912 1.472 10.15.133.170 74.125.229.58 TCP 2068 80 6 1484
Table 2-3. AP logs/DHCP example records. User IP User MAC AP name AP MAC Lease begin time Lease end time 10.130.90.3 00:11:22:33:44:55 b422r143-win-1 00:1d:e5:8f:1b:30 1333238737 1333238741 10.132.190.68 00:22:11:44:11:66 b416c299-win-1 00:17:59:5a:0e:30 1333239804 1333239818
2.1.1 WLAN AP Logs
These logs are collected from 1760 APs in 138 buildings over 479 days on a university campus, and contain association and authentication events from 316k devices in 2011-2012. It contains over 555M records, with each record including the device’s MAC and assigned IP addresses, the associated AP and a timestamp. Locations of the APs are approximated by the building locations where they are installed, i.e., (longitude, latitude) of Google Maps API. To validate this, we fetched 8000 mapped APs around the campus area from a crowd-sourced service, wigle.net. For the 130 matched APs (7.6% of total) in 42% of buildings (i.e., 58 bldgs), all were less than 200m from their mapped location; an error of less than 1.5% of the campus area. This is a reasonable margin of error for our research purposes, and acceptable when considering the maximum AP coverage range, inaccurate coarse-grained localization services and that we use coordinates of the center of each building whereas users may see an AP on the edge of a building. These access points are installed in a wide variety of buildings, including housing, classrooms, computer laboratories, libraries, offices, administrative, and restaurants. 2.1.2 NetFlow Logs
Over 76 billion records of NetFlow traces were collected from the same network, over 25 days in April 2012. A flow is defined as a consecutive sequence of packets with thesame transport protocol, source/destination IP and port number, as identified by the collecting gateway router. An example of major Netflow data fields is presented in Table 2-2. The NetFlow records are matched with the wireless associations (from the AP logs) using the dynamic MAC-to-IP address mapping from the DHCP logs. We refer to the result as
24 CORE dataset (Table 2-1). They are also augmented with location and website information using reverse DNS (rDNS). Dataset merging and system details in this part of the work is described in the following sections. 2.2 DHCP and Merging Datasets
In order to study network traffic across devices and APs, it is necessary to matchthe NetFlow records with wireless associations (from WLAN dataset). This task requires the MAC-IP mapping. The IP addresses are dynamically assigned using DHCP but DHCP session logs were not directly available and had to be derived. We define the duration of a DHCP lease as the time between two consecutive associations of a device with any AP; i.e. when a
device connects to AP1, a session starts and once the user device connects to AP2, the first session ends and a new one starts. Fig. 2-1 illustrates the associations of a sample device with
different APs at different times. The first session would have the IPgivenby AP1 and a lease
time t2 − t1, and so on. (total of 5 sessions in this example) The last association is discarded as we do not know the duration of that IP assignment. Combining these derived-DHCP records with the Location Information and Device Type Classification we create the DHCP table.
Figure 2-1. Wireless association for a device at different times.
The derived DHCP and NetFlow datasets were then merged to form what we refer to as the Core dataset for our study. The unique identifiers between the two are the clients’ IPs in addition to start and end time of flows, hence the need for a DHCP-like set. For a DHCP lease session LS, all flows whose IP address is the same as the lease and whose entire lifetime falls within the lease duration, are associated with LS. Given these traces, cellular usage cannot be analyzed. However, this does not significantly impact analysis for two reasons: 1) The traces already capture a very large user-base, with
25 tens of thousands of active devices. This raises confidence in our analysis of a real-world WLAN. 2) The WiFi campus coverage is ubiquitous, with 1760 APs installed in the vast majority of populated areas. Also, most laptops on campus lack cellular connectivity, and many smartphones use WiFi for their data to avoid cellular data costs. 2.2.1 Device Type Classification
To classify devices into flutes and cellos, we utilize several observations and heuristics. To start, note that a device manufacturer (with OUI) can be identified based on the first 3 octets of the MAC address3 . Most manufacturers produce one type of device (either laptop or phone), but some produce both (e.g., Apple). In the latter case, OUI used for one device type is not used for another. We conducted a survey to help classify 30 MAC prefixes accurately. Using OUI and survey information, we identify and label 46% of the total devices (90k cellos and 56k flutes). Then, from the NetFlow logs of these labeled devices, we observe over3k devices (92% of which are flutes) contacting admob.com; an ad platform serving mainly smartphones and tablets (i.e. flutes). This enables further classification of the remaining MAC addresses. Finally, we apply the following heuristic to the dataset: (1) obtain all OUIs (MAC prefix) that contacted admob.com; (2) if it is unlabeled, mark it as a flute. Overall, over270k devices were labeled (180k as flutes), covering 86% of the devices in AP logs and97%in NetFlow traces, a reasonable coverage for our purposes. Out of ≈ 80k devices in the NetFlow logs, ≈ 50K are flutes and ≈ 27K cellos. Fig. 2-2 shows the temporal plot for the combined traces over 25 days, after device classification. Throughout, the number of flows and total traffic volume is clearly higherfor cellos, even with an overall higher number of flutes connected. This larger number of flutes is also reflected in up to 600 more active APs communicating with flutes (during early morning hours). Also, note the device activities in a diurnal and weekly cycles, with the peaks occurring
3 MAC address randomization does not affect our association trace since it became common practice after our traces were collected.
26 Figure 2-2. Time series for 25 days of combined AP-NetFlow Core traces during weekdays, as expected. Wed, 25th, was the last day of classes, explaining the decline in network activity afterwards. This plot motivated our analyses for flutes vs cellos, over weekends vs weekdays. 2.2.2 Computing System
The size of the datasets is ≈30TB in raw text format, mostly consisting of NetFlow data and ≈0.5TB for AP logs. There were several challenges in managing and mining the large- scale datasets that required a thorough preparation, to run on a fast machine with plenty of resources/memory. We explored several techniques and pipelines for extraction, transformation, loading (ETL) and querying of big data and chose tools from Apache Hadoop ecosystem. We use Hive as our data warehouse (tables stored in Parquet format). Apache Spark is the compute engine for data processing and analysis tasks. Computation runs on two nodes, each with 64 cores and ≈0.5TB of memory. Further discussion of the system and comparison to others is out of the scope of this document.
27 2.3 Machine Learning Application and Background
This section provides a brief overview of machine learning (ML) and deep learning (DL) fundamentals and concepts that are essential to the discussions and comparisons of various methods in the rest of this study. The goal is to introduce important concepts that help explain decisions that have been made in terms of ML algorithms, methods and tools which will lead to discussions on speed, efficiency and scalability. The amount of data generated has been increasing rapidly and thus it has become evident that automated data analysis is crucial to make sense of these massive data pools. Machine learning enables computers to learn functions and patterns from data, without having been programmed explicitly; then those patterns can be used to predict future outcomes. Machine learning is typically utilized when explicit algorithms are not feasible and has been widely adopted in various fields, from automatic speech recognition and translation to data center optimization [36]. According to the type of problem, available data and the context, machine learning methods are broadly put into three groups [88]: Supervised learning, Unsupervised learning and Reinforcement learning. An introduction to each of these categories follows. 2.3.1 Supervised Learning
In supervised learning, the goal is to learn a function approximation, a mapping, from inputs to outputs, given a set of labeled data. The input typically consists of D-dimensional vectors of features or attributes and the output is either a real number or a category from a finite set of categorical class labels. Labeled data consists of data points whose inputand output are known; this is also called the training data. When the output is provided by direct observation with very high confidence, it is referred to as ground or base truth. Erroneous output labels in the training data will directly impact the usefulness of learned mappings as the supervised learning algorithm merely optimizes its predicted output to be as similar and as close as possible to the ground truth, according to its loss function; it simply does not have any means to determine the real-world usefulness of results. It is up to the scholars to provide reasonable ground truth, which is indeed one of the major resource-intensive challenges in
28 supervised learning. In supervised learning, depending on whether the output is categorical or real-valued, the problem is called classification or regression, respectively. The following algorithms have been used in multiple studies for traffic classification and are the most popular supervised learning algorithms in the community. 2.3.1.1 Classification
Naïve Bayes (NB). This model is probabilistic classifier based on Bayes’ theorem. With the assumption that the features are conditionally independent, then the probability of an observation with multiple features given a class can be written as the product of the probability of one-dimensional features of that observation given that class, which can be easily estimated from the data. This method is optimal if the independence assumption is true. However, the strong conditional independence assumption is the reason for the ”Naïve” word, because most real-world datasets are not expected to show such properties [88]. Nonetheless, authors in [24] show that Naïve Bayes classifier achieves high classification accuracy onmany real-world datasets. They chose 28 datasets from UCI machine learning repository [71] and got an average accuracy of 79%. Owing to its simplicity, Naïve Bayes algorithm runs very fast compared to many other machine learning algorithms [87].
k-Nearest Neighbors (k-NN). This algorithm is a non-parametric learner, which means it has a fixed number of parameters regardless of the amount of data. The algorithm looks at some number of nearest, labeled, neighbors of a new data point and typically outputs the majority class label. The parameter K determines how many neighbors are to be examined and a distance metric such as the Euclidean distance establishes the notion of nearest neighbors [88]. Since there is no actual training involved, this algorithm is categorized as a ”lazy” learner, which means the computation happens when a new unlabeled data point is made available and a label prediction is required. There have been numerous research studies on k-NN to make it run efficiently such as[98] or in a distributed environment such as [74].
Support Vector Machines (SVM). This class of supervised learners typically projects the data points into a higher dimensional space and then attempts to find a hyperplane or a set
29 of hyperplanes that best separate the target classes. That means finding a hyperplane whose distance from the nearest data points is maximum. Those nearest data points in the training set are the support vectors [110, 115]. A more in-depth discussion of the theory behind this model is beyond the scope of our survey and can be found in [88] and [115].
Decision Trees (DT). Given a training set of data points with known labels, the goal is to find classification rules that can predict the label of any data point from the valueofits features. Such classification rules can be expressed as a decision tree. The internal nodesofa decision tree consist of rules whose outcomes create the branches. The leaves in a decision tree represent the class label. Training a decision tree on the input data involves creating various possible decision trees on the input and choosing the simplest one that most often correctly classifies the training data [95]. A popular algorithm for supervised learning using decision trees, the C4.5 algorithm, is described in detail in [96]. An open source Java implementation of this algorithm, j48, is available in the WEKA software suite [43] and used in numerous studies [7, 33, 89].
Multilayer Perceptron (MLP). Neural networks have always been of great interest among researchers, aspiring to mimic the brain. A specific type of neural networks is the multilayer perceptron which is a feed-forward artificial neural network. The neurons in this type of network form a directed acyclic graph hence the name feed-forward. Each non-input node in the graph represents a neuron that has an activation function, this function defines the output of the neuron given its set of inputs. Authors in [69] show that these networks can approximate any function if and only if the activation function is not a polynomial. Researchers in [9] are one of the earliest adopters of MLP for Internet traffic classification.
Meta-algorithms. This category consists of algorithms that by definition are not machine learning algorithms themselves but rely on combining various machine learning techniques into a more accurate learner. A well-known meta-algorithm is AdaBoost [34]. This algorithm makes no assumptions about the performance of the weak classifiers as long as they are slightly better than random in predicting the true label. AdaBoost, typically using a decision tree, attempts
30 to reduce error by iteratively updating weights for input data depending on the results of the previous iteration. [7] applies AdaBoost for traffic characterization.
2.3.1.2 Sequence prediction
Markov Chain-based predictor. A Markov chain (MC) with a discrete state space has been applied for user mobility sequence prediction [75, 107]. In an order-k Markov predictor, the state space consists of tuples of k location (e.g. AP) names, where the next location prediction depends solely on the most recent preceding k-tuple. We build the model on the data so that observed k-tuples comprise the states. The transition probabilities are learned based on the frequency of appearances of such a transition in observations. The probability for a transition from current state S = Xi Xi+1...Xj to Xi+1Xi+2...Xj Xj+1 where j − i = k and
each Xi is the symbol for each location, is represented as P(Xj+1 = c | S = Xi Xi+1...Xj ) for all c observed in data and is learned based on the reappearance frequency of such a sequence. If a MC O(k) encounters a new sequence that it has never seen before, it falls back to MC O(k-1) recursively. The base case is O(0) which is simply the frequency distribution of all symbols observed so far. We compare the accuracy of Markov chains of varying orders with the theoretical predictability and recurrent neural networks in Chapter 4.
Recurrent Neural Networks (RNN). A more recent approach to sequence prediction, used in Chapter 4, is using deep recurrent neural networks (RNNs). Recurrent neural networks have loops within their cells, allowing information to persist and thus enabling the neural network to connect previous information to make a reasonable prediction of future. Certain types of RNNS are capable of learning long-term dependencies. These networks are essentially supervised neural networks that are trained to predict the next symbol in a sequence. There are multiple variants of RNNs, including Long short-term memory (LSTM) [52] and Gated Recurrent Unit (GRU) [20]. These networks can learn dynamic temporal patterns and have successfully been applied in speech recognition and text-to-speech engines [101]. In this work, we use a multi-layer LSTM to predict movements of users based on similar input tuples used for Markov Chain-based predictors in Chapter 4. RNNs are computationally expensive and
31 require hyper-parameter tuning. Thus the deep model is run only on a sample of users in this study. We also introduce a hybrid model of Markov Chains (MC) and RNN. The hybrid model uses the output of RNN only when the RNN is very confident (i.e. over 90% probability associated with the next symbol), otherwise it falls back to the MC introduced before.
1D Convolutional Neural Networks (CNN). Another type of neural networks, the convolutional neural networks (CNNs) learn convolutional filters to extract latent informa- tion across the data (i.e. 1D CNNs learn different temporal locality patterns) and use that information for predicting the next location. This is in contrast to 2D convolution filters that are typically used in image recognition tasks, where the filter capture spatial information. It has successfully been used in natural language processing for modeling and classification of sentences [61, 64]. In our study in Chapter 4, in addition to the multi-layer LSTM, we also use a 1D CNN to predict movements of users based on similar input tuples used for MC-based predictors. Convolutional neural networks tend to utilize GPU resources better, and thus are practically faster in our tests compared to the recurrent neural networks. 2.3.2 Unsupervised Learning
In contrast to supervised learning, given only the inputs, the goal is to find interesting patterns; that is why it is also sometimes called knowledge discovery. The most common unsupervised learning method is clustering. There are generally two types of clustering, hard clustering algorithms such as k-Means assign data points to exactly one cluster whereas soft clustering algorithms such as the expectation maximization (EM) algorithm [22] work by assigning probabilities to data points belonging to different clusters81 [ ]. K-means is sometimes called a hard EM algorithm [88]. As [88] explains, due to the inherent ambiguity of pattern mining, there is no obvious error metric to use; there are various metrics that are employed in different algorithms. These cluster validity measures are typically divided into two majorgroups [42]:
• External measures: These measures rely on the availability of external class labels to validate clusters. For instance, for each cluster, one can calculate the ratio of data points
32 within the cluster with a specific class label and then repeat the same procedure tofind the maximum ratio for all class labels. That is the definition of purity for the cluster. The average of purity for all clusters can be used as a measure for the purity of the clustering algorithm, with higher purity meaning more homogeneous clustering. These ratios calculated as described above can also be interpreted as probabilities of various classes within a cluster and then used to calculate the entropy of each cluster. Taking the average of these entropy values, one can compute the entropy of the clustering algorithm, with lower entropy showing higher homogeneity.
• Internal measures: These measures attempt to quantify how good the clusters are without class labels. An example of this measure is the sum of squared errors (SSE) which is calculated as the sum of squared distances from cluster mean for each data point. The distance method is usually Euclidean distance or Manhattan distance. Another popular cluster validation method is the silhouette coefficient97 [ ]. This measure calculates how well each data point belongs to its own cluster and how well it is separated from other clusters. Since the definition is per data point, the silhouette coefficients of all points within a cluster are typically averaged to get ameasureper cluster or these values are averaged for all clusters to get an idea of the clustering algorithm’s performance. A brief introduction of clustering methods that have been in used in traffic follows. Almost all of the studies that we discuss use labeled data and rely on some form of external validity measure for their choice of clustering method.
2.3.2.1 Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN). This clustering method is density-based, which means that clusters are defined to be regions of higher density in the input space. DBSCAN finds clusters of points that are close to each other and marks those that do not have many data points nearby as outliers. The algorithm takes two parameters, ε and N. The algorithm starts with some random, yet unvisited, data point and looks for points that are less than ε distance away from it, if the number of such points exceeds N, then a new cluster is started from there [29].
Expectation Maximization (EM). The goal of expectation maximization is to find the most probable clusters based on the training data and prior expectations. Assuming there exist groupings of data points within the entire dataset so that each group fits a certain distribution with different parameters, the data set can be modeled as a mixture of these distributions.
33 The probability distribution of choice is usually the Gaussian distribution. Finding the optimal maximum likelihood estimate is NP-hard [88]. But the EM algorithm is a good heuristic that iteratively attempts to find the maximum likelihood estimates of the parameters for these distributions [81, 88]. Another issue is finding the number of groupings in the data whichis typically done empirically. Authors in [81] apply EM algorithm for clustering different traffic profiles.
K-Means. K-means or sometimes called hard EM [88], assigns each data point to one cluster. This problem is also NP-hard but a heuristic algorithm that iteratively refines the assignment exists and it is sometimes referred to as the Lloyd’s algorithm [73]. In the simplest approach, one starts with K random points in the input space, called the centroids, and then assigns each data point to the nearest centroid, creating K clusters. Before the start of next iteration, K new centroids are calculated by taking the mean of each cluster. This process is typically repeated until the changes to centroids become smaller than a threshold. Examples of studies applying K-means to traffic characterization include27 [ , 28].
2.3.2.2 Generative
Gaussian Mixture Model (GMM). Mixture models allow probabilistic representation of sub-populations with an overall population. In a Gaussian mixture model, all data points are assumed to be generated from a mixture of Gaussian distributions. These models rely on Expectation Maximization (EM), introduced above, for finding the best fits of the Gaussian mixture. This is an efficient and fast, with reasonable results on our datasets.
Restricted Boltzmann Machine (RBM). Restricted Boltzmann Machine is a type of artificial neural network that uses generative learning, which is different from discriminative learning of supervised learning methods. In this approach, the goal is to estimate the probabil- ity distribution of the input. A fast learning algorithm for RBMs was introduced in 2006 [50]. RBMs have been used for various tasks, such as classification, regression, feature learning and modeling. A RBM is composed of two layers, a visible (input) layer and a hidden layer. Each node in the hidden layer, takes input from all nodes in the visible layer and produces an output,
34 Machine Learning
Supervised Unsupervised Learning Learning
Support Meta- k-Nearest Decision Multilayer EM Naïve Bayes Vector algorithms DBSCAN K-Means Neighbors Trees Perceptron Clustering Machines [AdaBoost]
Figure 2-3. Machine Learning algorithms widely used In Internet traffic classification.
by putting the weighted sum of those values through an activation function such as a Sigmoid. The weights are typically initialized to small random values chosen from a zero-mean Gaussian with a standard deviation of about 0.01, since using larger random values can speed the initial learning, but it may lead to a slightly worse final model49 [ ]. There is no intra-layer edge (and hence the restricted in restricted Boltzmann Machine). RBMs can reconstruct the data without labels, by sampling the values of the visible layer from the hidden layer, and can be utilized for modeling. In this work, We use RBMs in Section 3.5 to reconstruct integrated mobility-traffic data points. More advanced deep generative models are introduced and employed in Section 3.5. 2.3.3 Reinforcement Learning
Reinforcement learning (RL) is a type of learning which stems from behaviorism in psychology with the goal of maximizing a reward by taking appropriate actions in a given context or environment, allowing the machine to learn a policy given the feedback from the environment. It has a wide range of applications from robot navigation to playing abstract strategy games [102], but it is less commonly used in general and we have not seen RL techniques being applied to the traffic characterization problem. 2.3.4 Feature Selection
The run time of many machine learning models, whether supervised as Naïve Bayes or unsupervised such as k-Means, depends on the number of features. In addition to that, training a statistical classifier such as the Naïve Bayes involves fitting a joint distribution overthe
35 features, which in this case are assumed to be independent. Such a model’s performance can be improved if less relevant features that are not beneficial for classification are removed88 [ ]. One way of doing this is to use filters that measure relevance of features independent of the choice of the model. For example, the correlation between the features and the target class can be used for feature selection. Another approach is using wrapper methods. This approach uses a learning algorithm and attempts to find a good subset of features by running the learning algorithm on different subsets of features iteratively and comparing their outcomes. There are many ways to generate the candidate subsets of features including but not limited to exhaustive search of all subsets, forward selection which starts with no features and adds features one by one or backward elimination where all features are available in the beginning and are removed one by one for each iteration until no improvement is observed [87]. In the end, ”All models are wrong, but some models are useful” [11].
36 CHAPTER 3 ’FLUTES’ VS. ’CELLOS’: ANALYZING MOBILITY-TRAFFIC CORRELATIONS IN LARGE WLAN TRACES Mobile network performance is affected by a multitude of elements. Two major factors affecting mobile network performance are ’mobility’ and ’traffic’ patterns. Simulations and analytical-based performance evaluations rely on models to approximate factors affecting the network, enabling design and planning of future networks. Hence, the understanding of mobility and traffic is imperative to the effective evaluation and efficient design offuture mobile networks. Current models target either mobility or traffic, but do not capture their interplay. Many trace-based mobility models have largely used pre-smartphone datasets (e.g., AP-logs), or much coarser granularity (e.g., cell-towers) traces. This raises questions regarding the relevance of existing models, and motivates our study to revisit this area. Here, we conduct a multi-dimensional analysis, to ’quantitatively’ characterize mobility and traffic spatio-temporal patterns, for laptops and smartphones, leading to a detailed integrated mobility-traffic analysis. We present a ’data-driven’ study, published in [6], as we collect and mine capacious datasets (with over ’30TB’ of data, and 300k devices overall) that capture all of these dimensions. The investigation is performed using our systematic (’FLAMeS’) framework (Fig. 1-1). Overall, dozens of mobility and traffic features have been analyzed. The insights and lessons learned serve as guidelines and first steps towards a future ’integrated mobility-traffic models’. In addition, our work acts as a stepping-stone towards a richer, more-realistic suite of ’mobile test scenarios’ and ’benchmarks’. 3.1 Related Work
The body of literature has to be grouped into multiple subcategories, including studies in mobility analysis and modeling, traffic and network usage analysis and modeling, as wellas variations of the aforementioned dimensions across device types. Human Mobility: Human mobility is an important area of research with numerous application in urban planning, design and construction of infrastructure (including network- ing), congestion prediction, energy consumption estimation, content dissemination through
37 opportunistic networking, etc. Due to the wide variety of applications, human mobility has received significant attention. Over several decades, many models have been derived with varying degrees of complexity and customizability. The spectrum ranges from simple synthetic mobility models to complex data-driven models, designed to capture different properties with varying degrees of accuracy. We refer the reader to [47, 114] for surveys of mobility modeling and analysis. For spatial-temporal patterns, [39] and [105] reveal the regularity and bounds for predicting human mobility using cellular logs. In [105], authors argue for quantitative models to capture statistical characteristics of individual users by highlighting the differences between continuous time random walk (CTRW) models and empirical results. Another work on WLAN traces [15] revealed surprising patterns on increases of long-term mobility entropy by age, and the impact of academic majors on students’ long-term mobility entropy. We have made obser- vations in terms of regularity of human mobility that are similar to [39, 105], which reaffirms the intrinsic properties of human mobility, despite differences in granularity and population across datasets. Another study highlighted the importance of combining datasets from multiple sources to study various features simultaneously [123]. To advance the understanding of human mobility, we integrated different datasets to correlate mobility and network traffic. Traffic analysis: Network traffic has been studied extensively , for fixed networks (e.g., RIPE Atlas1 , Ark2 , BISmark [109]), and increasingly for wireless networks: for rather stationary users (as in WLANs) (e.g., [46, 66]) and potentially more mobile users as for cellular networks (e.g., [76, 122]). Multimedia content is a major category of traffic consumption in mobile devices [76]. In [122], a comparison of traffic patterns between wired and wireless connections is presented. Their results show that overall flow counts and sizes are smaller for wireless networks, and there are differences between accesses to different categories of websites such as social media or multi-media content. Usage mode of the wireless network has been
1 http://atlas.ripe.net 2 http://www.caida.org/projects/ark/
38 changing and while most users were non-mobile in early 2000s [46], with the proliferation of on-the-go devices (’flutes’), that is no longer the case as we show in this study. Such analyses range from metrics such as flow count, sizes, and traffic volume to service usage (e.g., visited websites, backend services), to categories of interest (e.g. sports, news, etc.). The authors of [85] investigated correlations and characteristics of web domains accessed by users and their locations, based on NetFlow and DHCP logs from a university campus in 2004. They propose a simulation paradigm with data-driven parameters, producing realistic scenarios for simulations. On both WiFi and cellular networks, the authors of [91] performed an in-depth study on smartphone traffic, highlighting the benefits and limitations of using MPTCP. Distributions of flow inter-arrival time (IAT) and arrival rate at APs of “static” flows were analyzed (e.g., Exp, Weibull, Pareto, Lognormal) in [83]. Lognormal was found to best fit the flow sizes, while at small time scales (i.e. hourly), IAT was best described by Weibull but parameters vary from hour to hour. We analyze flows on a much larger scale, newer dataset including smartphones, and identify Lognormal distribution as the best fit for flow sizes, and beta as best for IAT, regardless of device type. The study in [120] analyzed ISP traces with 9600 cellular towers and 150K users in Shanghai, and mapped timed traffic patterns to urban regions. It provided insight into mobile traffic patterns across time, location and frequency. This workis complementary to ours, as we provide a much finer scope analyzing campus WLAN traces. Device Variation: Usage and traffic patterns of different device types have been studied from various perspectives ([3, 18, 38, 68, 76, 93]). The study in [3] analyzed Google WiFi network, deployed in Mountain View California during Spring of 2008. They categorized devices into three groups: Hotspot, Modem and Smartphone. Their results show that these three device types are naturally different from each other in terms of traffic features. Hand-held devices were found to have low UDP traffic in38 [ ], while HTTP traffic was high. This isin line with our findings. In [68], authors use extensive datasets to analyze mobility behaviors of laptops and smartphones. They compared a wide range of mobility metrics and presented significant differences between these two device types across spatial and temporal features,
39 such as unique APs visited, the radius of gyration and load characteristics. This study does not correlate traffic features with mobility. However, those findings are based on classifications that rely on either MAC addresses or HTTP headers solely. The former is rather limited and the latter may have serious privacy implications and are often unavailable. In [30], authors use packet-level traces from 10 phones and application-level monitoring from 33 Android devices to analyze smartphone traffic. Although this allowed fine-grained measurements, the approach is invasive and limited in scalability, leading to small sample sizes and restricted conclusions. They also do not compare the traffic of smartphones with that of “stop-to-use” wireless devices (i.e. cellos) nor do they measure spatial metrics. To characterize usage pattern for users with multiple wireless devices, authors in [21] analyze 32k users on a university campus environment, and focus on multi-device usage. It notes differences between laptops and smartphones in packets, content, and time of usage. That work targets device usage patterns and security, while we study mobility and wireless traffic correlations. In our method, the combination of MAC and NetFlow allowed us to classify majority of observed devices while preserving users’ privacy. 3.2 Mobility Analysis
This section covers the ’temporal’ and ’spatial’ mobility analyses. For all metrics, unless otherwise noted, we investigate 479 days. A summary of studied metrics and their most significant statistical values are presented in Tab. 3-1 along with mean and median ratios for comparison. From that list, we further investigate in this section those metrics that show the most interesting or non-trivial differences between ’flutes’ and ’cellos’. 3 3.2.1 Session Start Probability
A session is defined as the period between WLAN associations. The distributions of session start times across the day for four building categories are depicted in Fig. 3-1. The
3 We would like to thank Leonardo Tonetto, at Technical University of Munich (TUM), for the effective collaboration on this investigation.
40 Table 3-1. Summary of results for mobility. Upper values are for weekdays and lower ones for weekends (in red color). LJM: maximum jump [m]; DIA: diameter [m]; TJM: total trajectory length [m]; GYR: radius of gyration [m]; BLD: no. uniq. buildings; APC: access point count; PDT: time spent at preferred building [minutes]; DLT: total session time at each building. Flutes (F) Cellos (C) Ratio (C/F) µ ’median’ σ µ ’median’ σ µ ’median’ 435 296 813 178 1 624 0.409 0.003 LJM 350 168 683 97 1 312 0.277 0.006
549 411 874 195 1 642 0.355 0.002 DIA 425 179 739 107 1 338 0.252 0.006
1582 707 2336 378 1 1444 0.239 0.001 TJM 1036 279 1793 252 1 1766 0.243 0.004
396 290 2725 321 191 3265 1.102 1.019 GYR 330 248 1368 178 65.1 1800 1.247 1.4
5.4 3 5.6 1.8 1 2.1 0.811 0.659 BLD 2.8 2 4.1 1.5 1 1.8 0.539 0.262
11.8 6 13.3 3.7 2 4.8 0.333 0.333 APC 7.2 4 8.8 3 2 3.8 0.536 0.5
225 161 219 248 164 254 0.314 0.333 PDT 223 135 272 278 189 292 0.417 0.5
316 235 302 316 217 305 1 0.92 DTL 326 247 308 316 221 309 0.97 0.89 start times of the Sessions match the periodic beginning of classes, but mainly in ’Academic’ buildings, where users move mostly at the start and end of classes. In these places, activity drops sharply for ’cellos’ at 5pm, with considerable ’flutes’ activity until 8pm. For ’Social’ and ’Library’ buildings, the probability of new sessions remains higher for a few more hours into the evening, and the times users tend to leave are more spread out. We do not make similar observation during weekends, which is expected when the day is, unlike weekdays, not governed by a class schedule. For most visitors, the session start distributions show a smooth shape and no significant differences between device types (omitted for brevity).
41 Academic Social 0.0020 0.0020
0.0015 0.0015 ) )
T 0.0010 T 0.0010 ( ( P P
0.0005 0.0005
0.0000 0.0000 0 6 12 18 24 0 6 12 18 24 T (Time of day) [h] laptop T (Time of day) [h] smartphone Administrative Library 0.0020 0.0020
0.0015 0.0015 ) )
T 0.0010 T 0.0010 ( ( P P
0.0005 0.0005
0.0000 0.0000 0 6 12 18 24 0 6 12 18 24 T (Time of day) [h] T (Time of day) [h]
Figure 3-1. PDF Session start over time of the day.
3.2.2 Radius of Gyration
This metric, GYR, captures the size of the geospatial dispersion of a device’s movements, ∑ 1 N − 2 denoted by rg and computed as rg = N k=1 (r⃗k r⃗s ) , where r⃗1, ..., r⃗N are positional vectors
of a device and r⃗s is its center of gravity.
Grouping devices by their rg after six months of observation, we look at its evolution since the first time they are observed. Unsurprisingly (cf.[39]), after an initial transient period of about one week, this value stabilizes even across different semesters (not shown). We split the traces into weekdays and weekends, presenting the distributions in Fig. 3-2(a). For ’cellos’, we notice a substantial reduction in their overall mobility whereas, for ’flutes’, this difference is not so pronounced. This might be due to students havingfewer
42 activities on weekends, a tendency to study at a single building like a library, or just not carry their cellos; we will revisit this aspect in Sec. 3.4. ’Flutes’, being “always-on” devices, are able to capture movements at pass-by locations, dining areas, and bus stops and thus are better suited to capture the fine-granular mobility of their users than cellos. The latter, in acampus environment, are often used during studies only (such as during lectures and in libraries), while the former often follows the user while moving around the campus. Despite the 8.1km2 area of the campus (approximate radius of 1.42km), buildings with related fields of study (e.g. e.g. Fine Arts, Law School and Engineering) are fairly clustered. Computing the distance between the k-nearest neighboring buildings, for k = 22 and k = 9 (average number of visited buildings for ’flutes’ and ’cellos’) the median distances are 295m and 172m, respectively. Due to their focus on classes, attending students have limited area of activity on weekdays, which explains the observed ’radius of gyration’.
We also evaluated: (1) ’diameter’ DIA, the longest distance between any pair of r⃗k points;
(2) ’max jump’ LJM, the longest distance between a pair of consecutive r⃗k points; and (3) ’total trajectory length’ TJM, the sum of all trips made by a device. The distributions of these metrics are similar to ’Radius of Gyration’ and therefore not shown. Table 3-1 summarizes the most significant statistical values for these metrics. 3.2.3 Visitation Preferences and Interests
We count the number of unique buildings visited by a user, BLD, and define a ’preferred building’ as the location where a device has spent most of its time in a given day, measured in ∑ Nb minutes and referred to as PDT . We approximate the latter by the formula tb = k=1 Sk ,
where tb is the time spent, Nb the total number of sessions and S1...SN the time duration of ’each session’ at a building b, here referred as DLT . Interestingly, cellos have slightly longer stays but both have medians around 2:40 hours. The similarity of the distributions, combined with a lower number of visited locations indicate that cellos are used mostly when users remain longer periods at places.
43 1.0 102 laptops smartphones 0.8 t0.59 0.6
) 1 t 10 ( S CDF 0.4 t0.33 0.54 weekday laptop t 0.2 weekday smartphone weekend laptop weekend smartphone 0 t0.20 0.0 10 0 300 600 900 1200 100 101 102 103 104 rg [meters] t [hour]
(a) (b)
Figure 3-2. (a) Radius of gyration (rg for the device types). (b) Visited locations S (t). Vertical lines at 7, 120 and 240 days.
Fig. 3-2(b) highlights the differences between ’flutes’ and ’cellos’ on the required time t to visit ’S’(’t’) locations. After an initial exploration period of one week the rates of new visits change similarly for both device types, and new exploration rates show up at 120 and 240 days. These could be explained by the weekly schedules of the university as well as the usual length of a lecture term (≈ 4 months). Considering the coefficients during the “steady states” between 7 days and 120 days (t0.59 and t0.54), our dataset matches observations made by [105]. We also consider the number of unique APs a device associates with, APC, which provides a finer spatial resolution than the building level. Furthermore, the probability of finding a device at its ’L-th’ most visited access point is shown inFig. 3-3. When taking buildings as aggregating points for location, the values become L−1.36 for ’cellos’ and L−1.16 for ’flutes’. These approximations validate previous work on human mobility [39], yet highlight differences between device types.
44 Figure 3-3. Zipf’s plot on L visited access points.