Integrated Network Traffic-Mobility Analysis and Modeling Using Big Data
Total Page:16
File Type:pdf, Size:1020Kb
INTEGRATED NETWORK TRAFFIC-MOBILITY ANALYSIS AND MODELING USING BIG DATA By BABAK ALIPOUR A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2019 © 2019 Babak Alipour To my Mom and Dad, whose love and support made my success possible ACKNOWLEDGMENTS First and foremost, thanks to all the help I received from my advisor, Prof. Ahmed Helmy, whose guidance, patience and intelligence made research obstacles conquerable. We thank Dr. Alin Dobra and Dr. Daisy Zhe Wang for help in the computing cluster, and the anonymous reviewers of IEEE InfoCom 2018 for useful feedback. The term ’cello mobility’ was suggested by Prof. Mostafa Ammar and used here with permission. Partial funding was provided by NSF Award Number 1320694 at University of Florida. We gratefully acknowledge the support of NVIDIA Corp. with the donation of the Titan Xp GPU used for this research. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................... 4 LIST OF TABLES ...................................... 8 LIST OF FIGURES ..................................... 9 ABSTRACT ......................................... 12 CHAPTER 1 INTRODUCTION ................................... 14 1.1 Data-Driven Traffic and Mobility Analysis .................... 15 1.1.1 Mobility Analysis ............................. 15 1.1.2 Traffic Analysis .............................. 16 1.1.3 ’FLAMeS’: Framework for Large-scale Analysis of Mobile Societies ... 17 1.2 Predictability Analysis and Prediction Algorithm Design ............ 19 1.3 Integrated ’Generative’ Traffic-Mobility Modeling ................ 20 1.4 Research Contributions ............................. 21 1.5 Dissertation Organization ............................ 21 2 MATERIALS AND METHODS ............................ 23 2.1 Input Datasets .................................. 23 2.1.1 WLAN AP Logs ............................. 24 2.1.2 NetFlow Logs ............................... 24 2.2 DHCP and Merging Datasets .......................... 25 2.2.1 Device Type Classification ........................ 26 2.2.2 Computing System ............................ 27 2.3 Machine Learning Application and Background ................. 28 2.3.1 Supervised Learning ........................... 28 2.3.1.1 Classification ......................... 29 2.3.1.2 Sequence prediction ...................... 31 2.3.2 Unsupervised Learning .......................... 32 2.3.2.1 Clustering ........................... 33 2.3.2.2 Generative ........................... 34 2.3.3 Reinforcement Learning ......................... 35 2.3.4 Feature Selection ............................. 35 3 ’FLUTES’ vs. ’CELLOS’: ANALYZING MOBILITY-TRAFFIC CORRELATIONS IN LARGE WLAN TRACES ............................... 37 3.1 Related Work .................................. 37 3.2 Mobility Analysis ................................. 40 3.2.1 Session Start Probability ......................... 40 5 3.2.2 Radius of Gyration ............................ 42 3.2.3 Visitation Preferences and Interests ................... 43 3.2.4 Sessions Per Building ........................... 45 3.2.5 Hourly Associations ........................... 46 3.2.6 Visitation Preferences .......................... 47 3.2.7 Return Probability ............................ 48 3.3 Traffic Analysis .................................. 48 3.3.1 Flow-level Statistical Characterization .................. 49 3.3.1.1 Size .............................. 49 3.3.1.2 Packets ............................ 51 3.3.1.3 Runtime ............................ 52 3.3.1.4 Inter-arrival times (IAT) .................... 52 3.3.1.5 Protocols ........................... 54 3.3.2 Network-Centric (Spatial) Analysis .................... 55 3.3.3 User Behavior (Temporal) Analysis ................... 56 3.3.3.1 Data consumption ....................... 56 3.3.3.2 Packet rate .......................... 56 3.3.3.3 Active duration ........................ 57 3.4 Integrated Mobility-Network Traffic Analysis .................. 57 3.4.1 Feature Engineering ........................... 58 3.4.1.1 Mobility ............................ 58 3.4.1.2 Network traffic ........................ 58 3.4.1.3 Cross-dimension ........................ 59 3.4.2 Utility of Integrated Modeling ...................... 61 3.5 Integrated Mobility-Network Traffic Generative Modeling ............ 62 3.5.1 Related Modeling Work ......................... 64 3.5.2 Statistical Metrics ............................ 65 3.5.3 Gaussian Mixture Model (GMM) ..................... 67 3.5.4 Restricted Boltzmann Machine (RBM) ................. 68 3.5.5 Variational Auto-Encoder (VAE) ..................... 69 3.5.6 Generative Adversarial Network (GAN) ................. 69 3.6 Lessons Learned and Modeling Insights ..................... 70 3.7 Summary ..................................... 76 4 PREDICTABILITY ANALYSIS AND PREDICTION ALGORITHM DESIGN ..... 78 4.1 Related Work .................................. 80 4.2 Entropy Estimators and Maximum Predictability ................ 82 4.3 Prediction Algorithms .............................. 84 4.4 Experimental Setup ............................... 85 4.4.1 Discrete-time Series ........................... 86 4.4.2 Experiment Dimensions ......................... 88 4.5 Mobility Analysis ................................. 88 4.5.1 Overview ................................. 88 4.5.2 Spatio-temporal Resolutions ....................... 91 6 4.5.3 Comparison of Methods ......................... 92 4.5.4 Correlations with Mobility and Network Traffic ............. 94 4.6 Summary and Future Work ........................... 96 5 LEARNING THE RELATION BETWEEN MOBILE ENCOUNTERS AND WEB TRAF- FIC PATTERNS .................................... 98 5.1 Related Work .................................. 100 5.2 Mobility Encounters ............................... 102 5.2.1 Daily Encounter Duration at the Building Level ............. 102 5.2.2 Encounter Duration Statistical Distributions ............... 104 5.3 Web Traffic Profile ................................ 104 5.4 Pairwise Encounter-Traffic Relationship ..................... 106 5.4.1 Device Type Categories ......................... 107 5.4.2 Weekday Vs. Weekend .......................... 107 5.4.3 Encounter Duration ........................... 108 5.5 Learning Encounters ............................... 110 5.5.1 Random Forest (RF) ........................... 112 5.5.2 Deep Learning .............................. 112 5.6 Summary and Future Work ........................... 114 6 FUTURE DIRECTIONS ................................ 117 6.1 Extensions of Flutes vs. Cellos in Social Context and Interest Dimensions ... 118 6.2 Predictability Analysis in the Traffic and Interest Dimensions .......... 119 APPENDIX A WEB DOMAIN INTEREST ANALYSIS ........................ 121 A.1 Web Domain Interest Extraction ......................... 121 B CHOICE OF DATA PROCESSING TOOLS ...................... 124 REFERENCES ........................................ 127 BIOGRAPHICAL SKETCH ................................. 137 7 LIST OF TABLES 2-1 Summary of datasets. B=billion. ........................... 23 2-2 NetFlow example records. ............................... 24 2-3 AP logs/DHCP example records. ........................... 24 3-1 Summary of results for mobility analysis. ....................... 41 3-2 Merged DHCP-NetFlow traces overview ........................ 48 3-3 Traffic features used for integrated mobility-traffic analysis .............. 60 3-4 Average Kolmogorov-Smirnov statistic of all algorithms ............... 71 3-5 Kolmogorov-Smirnov statistic of the β-VAE for weekday features. .......... 73 3-6 Kolmogorov-Smirnov statistic of the β-VAE for weekend features. .......... 74 4-1 Statistics per device available for at least 7 days & accessed more than 5 APs. ... 82 4-2 Median Accuracy of LSTM across spatio-temporal resolutions ............ 92 4-3 Summary of Median Accuracy for Flutes vs Cellos with different methods ...... 94 5-1 Encounter record example ............................... 102 5-2 Daily Encounter Duration in Seconds ......................... 103 5-3 Best fit distributions for total daily encounter duration based on pairs classifications 104 8 LIST OF FIGURES 1-1 ’FLAMeS’ system overview. .............................. 19 2-1 Wireless association for a device at different times. .................. 25 2-2 Time series for 25 days of combined AP-NetFlow Core traces ............ 27 2-3 Machine Learning algorithms widely used In Internet traffic classification. ...... 35 3-1 PDF Session start over time of the day. ........................ 42 3-2 Radius of gyration and visited locations S(t) ..................... 44 3-3 Zipf’s plot on L visited access points. ......................... 45 3-4 Probability P (t) of session duration t. ........................ 45 3-5 Hourly associations. .................................. 46 3-6 Time spent at preferred building. ........................... 47 3-7 Probability to return to a previously visited location. ................. 48 3-8 Traffic distribution plots. ................................ 50 3-9 CDF of individual flow sizes .............................. 51 3-10 Lognormal distribution plot for mean packet size of either device type ........ 52 3-11 Theoretical and empirical plots for Lognormal and mean packet size of Flute flows . 53 3-12 Exponential and Beta distribution plots for IAT. ................... 54 9 3-13 Correlation plots