Modeling Network User Behavior: Various Approaches Shidan Xu

Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2016-007 June 28, 2016 Modeling Network User Behavior: Various Approaches Shidan Xu massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Modeling User Network Transitions: Various Approaches by Shidan Xu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 c Massachusetts Institute of Technology 2016. All rights reserved. ⃝ Author................................................................ Department of Electrical Engineering and Computer Science May 23, 2016 Certified by............................................................ Professor Karen R. Sollins Principal Research Scientist Thesis Supervisor Accepted by........................................................... Professor Christopher J. Terman Chairman, Masters of Engineering Thesis Committee Modeling Network User Behavior: Various Approaches Shidan Xu May 20, 2016 Abstract This project involves learning to predict users’ mobility within the network topology. Topo- logical mobility, as opposed to physical mobility, can be substantial as a user switches from LTE to wifi network, while moving minimally physically. Our dataset consists of email IMAP logs as they document associated client IP addresses, as well as the clients’ identifiers. Prediction for online mobility is of particular interest to the networks community. If we can predict online mobility with high probability, then new network architecture can be designed to optimize the caching system by minimizing resending packets. We used various approaches and techniques to model the user’s behavior, including probabilistic programming, regression, neural nets, and clustering algorithms. We compare and contrast how models di↵er in their prediction accuracy, speed of convergence, and algorithmic complexity. 1 Acknowledgement I would like to thank Dr. Karen Sollins for giving me an opportunity to work on this MobilityFirst project for my MEng. I loved my time exploring the intricacies in the datasets and I learned so much in the research process. I would like to thank Dr. Karen Sollins and Dr. Steven Bauer for their continued advice and insight into conceiving this project and thesis. I would like to thank Dr. Bauer for his technical help and project focus guidance throughout this project. I’d like to thank Dr. Sollins for all of her help with design decisions and project planning. Without their insights, this project would not have been possible. Their help was invaluable. I would like to thank the MIT Advanced Networks Architecture (ANA) group for having me carry out this research. I would like to thank Tingtao Zhou, Tianfan Xue, and Zhengli Wang for their help with the statistics concepts and machine learning ideas. Lastly, I would like to thank my family for all of their support, love, and help in life. This project was funded in part under NSF Grant 1413973, ”NeTS: Large: Collaborative Re- search: Location-Independent Networks: Evaluation Strategies and Studies”. All code written in this project can be found on Github at All the work can be found on Github at https://github.com/shidanxu/mengvfinal. 2 Contents 1 Introduction 7 1.1 ProjectBackground .................................... 7 1.2 RelationstoMobilityFirst ................................. 7 1.3 Current Internet Has Inefficiency in Topological Distance . 8 1.4 Data Analysis Approaches on Networks Problem . 10 2 Previous Works on Quantitative Network Measurements 11 2.1 Sookhyun Yang’s Three State Markov Model . 11 2.1.1 Choosing the Dataset . 11 2.1.2 Comments on Yang’s Work . 12 2.2 Beverly Applies Machine Learning to Extract Most Important IP Bit for Traffic Con- gestion............................................ 13 2.3 Probabilistic Programming As a Tool For Fast Modeling . 14 3 Reproducing Yang Paper Results 16 3.1 A Peek of the Dataset . 16 3.2 Markov Chain For Transition Prediction . 17 3.3 Evaluate Markov Chain Prediction . 19 3.4 Aspects of Modeling where Markov Chain Falls Short . 22 3.5 Questions that Emerge from Yang Experiment Reproduction . 22 4Modeling 24 4.1 ClusteringFindsThreeGroups .............................. 24 4.2 Clustering Shines Light on Canonical User . 27 4.3 Regression Helps Weighing Factors . 28 4.3.1 Ridge Regression . 29 4.3.2 Logistic Regression . 32 4.4 Neural Net Increases Performance . 35 4.4.1 Background . 35 3 4.4.2 Results and Discussion . 37 4.5 Bayesian Modeling and Probabilistic Programming . 38 4.5.1 Generativevs. DiscriminativeModeling . 38 4.5.2 Probabilistic Programming . 39 5 Approach Comparison 43 5.1 Speed ............................................ 43 5.2 Prediction Accuracy . 44 5.3 Relevance to Dataset . 45 6 Toolkit 47 7 Challenges 49 7.1 Challenges in the Dataset . 49 7.2 Challenges in Technology . 51 8 Contributions and Conclusion 54 8.1 Contributions........................................ 54 8.2 Conclusion ......................................... 55 4 List of Figures 1 Physical proximity does not guarantee topological proximity. Small moves in physical space may be large moves in Network topology. The path from the 4G LTE network to the MIT network goes through US Eastern. 9 2 Sample IMAP entry for 2014.01.28 . 16 3 The distribution of number of transitions from Markov chain and testing set. Sample size 10000 user days each. 20 4 The distribution of number of transitions from training set and testing set. Sample size 10000 user days each. The light blue regions are overlaps of both sets. 21 5 The clustering algorithm output. X-axis is average duration in seconds. Y-axis is session start hour. Average duration length significantly a↵ected the clustering. The greendotsrepresentthecentersofeachcluster. 25 6 X-axis is portion of user entry generated from a wifi network. Y-axis is day of the week. Those were two of the features that did not significantly a↵ected the clustering. 25 7 The K-means clustering algorithm output. The top three factors for separating the dataset are average duration per session, average number of sessions in one day, and the average starting hour for sessions. 26 8 Theaveragenumberofsessionspercluster. 28 9 The long cluster average session length prediction. Each integer on the x-axis is a feature and the y-axis is showing the feature’s corresponding weight. 31 10 The distribution of session durations, in seconds. Note the second peak at the hour mark. ............................................ 32 11 The distribution of session durations, by tags. 34 12 A one hidden-layer feed forward neural net. Image courtesy to Bishop[20]. 36 13 An example PP output using PYMC3. 41 14 Sample session data, each line is a session. The column headers are session start time, sessionendtime,IP,device,userID. 49 15 Exponential distribution implementation in PYMC3. 51 List of Tables 1 Di↵erent kinds of Markov Models[17] . 18 2 Empirical estimates of transition probabilities. Data acquired on training sample (10000 user days). 19 5 3 Empirical estimates of transition probabilities for all users. 27 4 Empirical estimates of transition probabilities for the long duration cluster. 27 5 Comparison of di↵erentnumberofnodesinthehiddenlayer. 38 6 Complexity of di↵erent approaches. 44 7 Prediction Accuracy of di↵erent approaches. 44 8 Comparison of various Python probabilistic programming APIs. 48 6 1 Introduction 1.1 Project Background The Internet is approaching a historic inflection point, with mobile platforms and applications re- placing the fixed-host/server model that has dominated the Internet since its inception [14]. This gradual shift produces an opportunity to design a new network focused on mobile users. The problem this thesis investigates is motivated by the MobilityFirst project in the FIND (Future Internet Design) initiative [7], which asks the question of what the requirements should be for a global network 15 years from now, and how we can build such a network if we are not constrained by the current Internet. 1.2 Relations to MobilityFirst The MobilityFirst projects presents several major design goals [14]: • Dynamic hosting to provide scalable network mobility. • The robustness of the wireless transfer medium. • Reinforce network security and privacy for both mobile and wired networks. • Provide context-aware mobile services. Mobile users often switch between di↵erent networks (such as cellular and wifi) as they move geographically. Traditional networks such as the Internet were designed more for a static user. In designing a new network that is more suited to the mobile users, we need to evaluate this switching behavior. In this thesis project, we evaluate the user’s network transition behavior, in order to understand which users should the new network address. This relates to the first design goal. By 7 providing a clearer understanding of the user behavior, new networks can be designed to configure hosts that is more suitable for the user. One aspect of suitable is to minimize the di↵erence in topological distance and geographical distance. We expand this point next. 1.3 Current Internet Has Inefficiency in Topological Distance Developers of the Internet made some design choices with assumptions. One particular assumption is that network mobility is similar to mobility in geography[3]. However, in real world cases, this assumption is often violated. For example,

Load more