Computer Science and Artificial Intelligence Laboratory Technical Report
MIT-CSAIL-TR-2016-007 June 28, 2016
Modeling Network User Behavior: Various Approaches Shidan Xu
massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Modeling User Network Transitions: Various Approaches by Shidan Xu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 c Massachusetts Institute of Technology 2016. All rights reserved. ⃝
Author...... Department of Electrical Engineering and Computer Science May 23, 2016
Certified by...... Professor Karen R. Sollins Principal Research Scientist Thesis Supervisor
Accepted by...... Professor Christopher J. Terman Chairman, Masters of Engineering Thesis Committee Modeling Network User Behavior: Various Approaches
Shidan Xu
May 20, 2016
Abstract
This project involves learning to predict users’ mobility within the network topology. Topo- logical mobility, as opposed to physical mobility, can be substantial as a user switches from LTE to wifi network, while moving minimally physically. Our dataset consists of email IMAP logs as they document associated client IP addresses, as well as the clients’ identifiers. Prediction for online mobility is of particular interest to the networks community. If we can predict online mobility with high probability, then new network architecture can be designed to optimize the caching system by minimizing resending packets. We used various approaches and techniques to model the user’s behavior, including probabilistic programming, regression, neural nets, and clustering algorithms. We compare and contrast how models di↵er in their prediction accuracy, speed of convergence, and algorithmic complexity.
1 Acknowledgement
I would like to thank Dr. Karen Sollins for giving me an opportunity to work on this MobilityFirst project for my MEng. I loved my time exploring the intricacies in the datasets and I learned so much in the research process.
I would like to thank Dr. Karen Sollins and Dr. Steven Bauer for their continued advice and insight into conceiving this project and thesis.
I would like to thank Dr. Bauer for his technical help and project focus guidance throughout this project. I’d like to thank Dr. Sollins for all of her help with design decisions and project planning.
Without their insights, this project would not have been possible. Their help was invaluable.
I would like to thank the MIT Advanced Networks Architecture (ANA) group for having me carry out this research.
I would like to thank Tingtao Zhou, Tianfan Xue, and Zhengli Wang for their help with the statistics concepts and machine learning ideas.
Lastly, I would like to thank my family for all of their support, love, and help in life.
This project was funded in part under NSF Grant 1413973, ”NeTS: Large: Collaborative Re- search: Location-Independent Networks: Evaluation Strategies and Studies”. All code written in this project can be found on Github at All the work can be found on Github at https://github.com/shidanxu/mengvfinal.
2 Contents
1 Introduction 7
1.1 ProjectBackground ...... 7
1.2 RelationstoMobilityFirst ...... 7
1.3 Current Internet Has Ine ciency in Topological Distance ...... 8
1.4 Data Analysis Approaches on Networks Problem ...... 10
2 Previous Works on Quantitative Network Measurements 11
2.1 Sookhyun Yang’s Three State Markov Model ...... 11
2.1.1 Choosing the Dataset ...... 11
2.1.2 Comments on Yang’s Work ...... 12
2.2 Beverly Applies Machine Learning to Extract Most Important IP Bit for Tra c Con- gestion...... 13
2.3 Probabilistic Programming As a Tool For Fast Modeling ...... 14
3 Reproducing Yang Paper Results 16
3.1 A Peek of the Dataset ...... 16
3.2 Markov Chain For Transition Prediction ...... 17
3.3 Evaluate Markov Chain Prediction ...... 19
3.4 Aspects of Modeling where Markov Chain Falls Short ...... 22
3.5 Questions that Emerge from Yang Experiment Reproduction ...... 22
4Modeling 24
4.1 ClusteringFindsThreeGroups ...... 24
4.2 Clustering Shines Light on Canonical User ...... 27
4.3 Regression Helps Weighing Factors ...... 28
4.3.1 Ridge Regression ...... 29
4.3.2 Logistic Regression ...... 32
4.4 Neural Net Increases Performance ...... 35
4.4.1 Background ...... 35
3 4.4.2 Results and Discussion ...... 37
4.5 Bayesian Modeling and Probabilistic Programming ...... 38
4.5.1 Generativevs. DiscriminativeModeling ...... 38
4.5.2 Probabilistic Programming ...... 39
5 Approach Comparison 43
5.1 Speed ...... 43
5.2 Prediction Accuracy ...... 44
5.3 Relevance to Dataset ...... 45
6 Toolkit 47
7 Challenges 49
7.1 Challenges in the Dataset ...... 49
7.2 Challenges in Technology ...... 51
8 Contributions and Conclusion 54
8.1 Contributions...... 54
8.2 Conclusion ...... 55
4 List of Figures
1 Physical proximity does not guarantee topological proximity. Small moves in physical space may be large moves in Network topology. The path from the 4G LTE network to the MIT network goes through US Eastern...... 9
2 Sample IMAP entry for 2014.01.28 ...... 16
3 The distribution of number of transitions from Markov chain and testing set. Sample size 10000 user days each...... 20
4 The distribution of number of transitions from training set and testing set. Sample size 10000 user days each. The light blue regions are overlaps of both sets...... 21
5 The clustering algorithm output. X-axis is average duration in seconds. Y-axis is session start hour. Average duration length significantly a↵ected the clustering. The greendotsrepresentthecentersofeachcluster...... 25
6 X-axis is portion of user entry generated from a wifi network. Y-axis is day of the week. Those were two of the features that did not significantly a↵ected the clustering. 25
7 The K-means clustering algorithm output. The top three factors for separating the dataset are average duration per session, average number of sessions in one day, and the average starting hour for sessions...... 26
8 Theaveragenumberofsessionspercluster...... 28
9 The long cluster average session length prediction. Each integer on the x-axis is a feature and the y-axis is showing the feature’s corresponding weight...... 31
10 The distribution of session durations, in seconds. Note the second peak at the hour mark...... 32
11 The distribution of session durations, by tags...... 34
12 A one hidden-layer feed forward neural net. Image courtesy to Bishop[20]...... 36
13 An example PP output using PYMC3...... 41
14 Sample session data, each line is a session. The column headers are session start time, sessionendtime,IP,device,userID...... 49
15 Exponential distribution implementation in PYMC3...... 51
List of Tables
1 Di↵erent kinds of Markov Models[17] ...... 18
2 Empirical estimates of transition probabilities. Data acquired on training sample (10000 user days)...... 19
5 3 Empirical estimates of transition probabilities for all users...... 27
4 Empirical estimates of transition probabilities for the long duration cluster...... 27
5 Comparison of di↵erentnumberofnodesinthehiddenlayer...... 38
6 Complexity of di↵erent approaches...... 44
7 Prediction Accuracy of di↵erent approaches...... 44
8 Comparison of various Python probabilistic programming APIs...... 48
6 1 Introduction
1.1 Project Background
The Internet is approaching a historic inflection point, with mobile platforms and applications re- placing the fixed-host/server model that has dominated the Internet since its inception [14]. This gradual shift produces an opportunity to design a new network focused on mobile users. The prob- lem this thesis investigates is motivated by the MobilityFirst project in the FIND (Future Internet
Design) initiative [7], which asks the question of what the requirements should be for a global net- work 15 years from now, and how we can build such a network if we are not constrained by the current Internet.
1.2 Relations to MobilityFirst
The MobilityFirst projects presents several major design goals [14]:
• Dynamic hosting to provide scalable network mobility.
• The robustness of the wireless transfer medium.
• Reinforce network security and privacy for both mobile and wired networks.
• Provide context-aware mobile services.
Mobile users often switch between di↵erent networks (such as cellular and wifi) as they move geographically. Traditional networks such as the Internet were designed more for a static user. In designing a new network that is more suited to the mobile users, we need to evaluate this switching behavior. In this thesis project, we evaluate the user’s network transition behavior, in order to understand which users should the new network address. This relates to the first design goal. By
7 providing a clearer understanding of the user behavior, new networks can be designed to configure hosts that is more suitable for the user. One aspect of suitable is to minimize the di↵erence in topological distance and geographical distance. We expand this point next.
1.3 Current Internet Has Ine ciency in Topological Distance
Developers of the Internet made some design choices with assumptions. One particular assumption is that network mobility is similar to mobility in geography[3]. However, in real world cases, this assumption is often violated. For example, a phone user could transition from a 4G network to Wi-
Fi by walking a few meters in real world, but the network topological distance is long. In extreme cases, the nearest path of two nodes in network topology can traverse opposite coasts of the country.
Such design is costly for the user to transition between networks. In designing a new network, one of the focal points is to decrease inconsistency between geographical and topological distance, as packets of information need to be rerouted. The future network needs to address this geographically proximate network change, and decrease this data transfer cost.
8 Figure 1: Physical proximity does not guarantee topological proximity. Small moves in physical space may be large moves in Network topology. The path from the 4G LTE network to the MIT network goes through US Eastern.
Imagine a person traversing through a college campus while checking his email on the phone. As he moves along, his phone’s network switches between LTE and his college network alternatively.
Every time the user switches network, his phone loses some data and has to send extra requests to recover lost data. The LTE and campus wifi, despite being geographically close by, can be vastly distant in network topology. Hence a request may have to traverse a much longer path than the physically proximity suggests. This extra path traversed slows down packet transfer in the current network.
Consider another typical 9-5 o ce worker. She wakes up at 8, goes to work at 9, comes home at 5. She has dinner at 6 and spends two hours surfing the web at night. Now consider how her daily network usage would look like. She would be on her home network in the morning. She uses
9 a combination of the company and LTE network at work. Sometimes she eats lunch at a co↵ee shop and uses the network there. She spends some time on her home network at night. Her active network tra c ends at 11, but some passive email checking and other maintenance from her device persist throughout the night. Her daily network usage is very predictable. If researchers can predict her network usage, mobile device applications can be customized to make her life more convenient.
Multiple questions can be asked in those two scenarios. To what accuracy is human network behavior habitual? To what degree can user network behavior be predicted? How should we evaluate network usage behavior? How often do such costly network transitions happen? How do we measure the cost of one such transition? How much tra c is a↵ected by those costly transitions? How should we design our networks to minimize this costly transition?
1.4 Data Analysis Approaches on Networks Problem
Additionally, the project stresses to create a procedure for fast exploration of new models using approaches such as probabilistic programming and neural nets. These procedures are chosen to quickly expand on the current model. We spent time in evaluating the di↵erent modeling techniques and tools for their prediction accuracy, speed of convergence, and algorithmic complexity.
Noam Chomsky believed that the Merge operation is one of the fundamental di↵erences that separate humans apart from other primates [13]. That is, humans can take two abstract concepts and combine them to create scenarios that s/he has never physically experienced. A person can imagine a dinosaur ripping apart an opera stage as long as he understands dinosaurs and opera, despite he has never seen such an event in reality. This Merge ability can induce creativity. Statistical methods and machine learning are trending approaches to extract insight out of the dataset. By applying machine learning strategies to a networks problem, we try to develop new insights that’s not traditionally seen.
10 2 Previous Works on Quantitative Network Measurements
In this chapter, we review some of the previous works. We reviewed previous works on quantitive models of user transitioning among networks, we evaluate what kind of approaches were taken for networks datasets. We also evaluate new techniques for implementing models.
2.1 Sookhyun Yang’s Three State Markov Model
This thesis seeks to expand the work on Yang’s Measurement and Modeling of User Transition
Among Networks [5].
Yang’s paper created a 3-state Markov model that’s evaluated by measuring the distribution of cost function of signaling a network attachment / detachment, using Internet Message Access
Protocol (IMAP) logs for email access for the population of University of Massachusetts at Amherst residents [5].
2.1.1 Choosing the Dataset
A major di culty in network mobility research is to select a dataset that captures user transitions.
Yang argued for three major criteria for selecting a proper dataset[5].
• The application should be frequently used by the user to provide a su ciently large dataset.
• The application can be monitored without introducing inconvenience to the user.
• The application provides trackable user identification.
Therefore, Yang’s choice on IMAP logs is a good exploration that satisfies those requirements.
We also had several options choosing our datasets.
11 • Use the Yang dataset.
• Collect our own email based dataset.
• Use mobility datasets from a third party provider who installs apps to track user network
usage.
If we pick Yang’s dataset, we have a leap ahead into what kind of approaches we can take to understand the dataset. More importantly, we save time in acquiring the dataset ourselves, which in Yang’s case took four months. We can also collect our own email-based dataset at MIT. In order to extract more information on the user’s age, residence, etc., we need to clear privacy controls from the institute. Such data collection can also be time-consuming. Using mobility datasets from a third party is also an option that can provide a feature-rich dataset. We can include identification features for the users, such as age and gender. This is not possible in the IMAP logs due to the way the protocol is setup. However, using data from a third party creates bias towards only the users who are interested in installing the software. We evaluated our options and continued our work based on
Yang’s datasets as we see areas of improvement on their work.
The Yang dataset identifies a user by an ID that was intentionally anonymized. The dataset specified times when the user logged into an email service, the IP address s/he uses to attach to the network, and times when the server closes the network connection. We see that the raw dataset is large as it contains many entries, but not as feature-rich as each entry contains limited information.
2.1.2 Comments on Yang’s Work
While the paper serves as a decent exploration baseline, we explore alternatives on some aspects, definitions, and assumptions. For instance, the Yang paper collected two datasets. One dataset contains email logs for campus professors and sta↵members in the fall. Another dataset contains
12 all users across campus over a four month period in winter. In evaluating the transition costs, Yang assumed all users in the campus wide dataset formed one uniform group. We hypothesized that users form several groups based on their behavioral patterns, and would like to evaluate potentially di↵erent user types. We suspect that students would have their preferred time to access emails (night or morning) depending on the course load and personal preferences. Those factors can possible separate the dataset into multiple clusters.
Another criticism is that the Markov model is simple and not easily expandable to adapt to specific user types. The only independent parameter in this model is the current state the user is in.
If another factor is identified to be significant, the entire modeling process needs to restart. We want to avoid such repetitive work in further modeling, by utilizing tools and techniques that facilitate exploration in modeling space.
2.2 Beverly Applies Machine Learning to Extract Most Important IP Bit
for Tra c Congestion
There are many related works in applying machine learning to a networks problem. In Beverly [1], the author described several learning approaches for attacking various networks tra c problems with minimal datasets. In particular, learning was useful in capitalizing on under-utilized information and inferring behavior reliably. Using a hidden Markov model, Beverly was able to predict mean
TCP congestion times based on the IP address bits. By using 8 out of the 32 bits, Beverly can predict the mean congestion time accurately to an error of less than 30ms. We take from Beverly’s work that statistics and machine learning can provide unique insights into a dataset. By utilizing statistical methods to aid humans in extracting patterns in data, we can improve prediction ac- curacies. Beverly’s paper provides intuition on which learning approaches to use on what type of problem.
13 2.3 Probabilistic Programming As a Tool For Fast Modeling
Probabilistic programming is a generative learning approach, and an ongoing research area. Gener- ative learning approaches can analyze complex scenes more richly and flexibly, but have been less widely embraced due to their slow inference speed and problem specific engineering to obtain robust results [6]. Like a Markov model, generative approaches can learn the optimal parameters for the model. Once the parameters are set, researchers can use the model to generate new data to compare with the observed testing data. One of the trending topics lately in probabilistic programming is in creating a generic inference engine that facilitates optimization of parameters. This inference engine facilitates code implementation for di↵erent models. Many current applications are under devel- opment for probabilistic programming [16] (See probabilistic-programming.org for a list of existing probabilistic programming systems). Picture and Venture are two notable probabilistic program- ming languages conceived at MIT. Both languages focus on the development of a general inference engine. With an inference engine, a user can take the benefit of probabilistic models without having the expertise in statistics. We refrained from using those languages as they are alpha versions and are standalone platforms that have higher learning curves. Instead, we explored PyMC3, a stable python package that’s written with a general inference engine. We chose this option because we want to explore more models without having to implement the inference methods ourselves [2].
In Tenenbaum [9], current applications and limitations for probabilistic programming are dis- cussed, with a vision for hierarchical structural learning. In hierarchical structural learning, the algorithm can learn the dependence of variables. That is, not only can the learning algorithm opti- mize the parameters for the model, it can also learn the most appropriate structure of the model.
With enhanced storage capacities and computational cheapness, hierarchical structural learning can facilitate data mining further by saving the scientists time to conceive the more structurally accurate model.
14 Summary In this chapter, we review some of the related previous works. Three works are espe- cially important to us. Yang[5] created a three-state Markov Model to predict user transition among networks. Beverly[1] applied machine learning techniques to various networks problem. Probabilis- tic programming, a modeling tool that implements an inference engine to facilitate fast modeling, is reviewed.
15 3 Reproducing Yang Paper Results
In this chapter, we first reproduce Yang’s Markov chain model for user transition prediction. We then evaluate this model to see areas of improvement.
3.1 A Peek of the Dataset
We acquired processed IMAP datasets from University of Massachusetts. The dataset included the
IMAP logs for 7137 campus wide users. The user population was mainly students [5] and data were taken from December 3, 2013 to Mar 26, 2014.
Figure 2: Sample IMAP entry for 2014.01.28
This is what a section of a sample IMAP processed logs file looks like. The most useful information here is that on line 1, a user with an IP address of 68.186.244.103 attached to the network. In the middle section, another user attached to the UMass network. At 00:47:44, the server unilaterally detached the idle user identified by 34630. The log also includes information on when the user checks the inbox and deletes messages. Note that the user identity, which is originally an email address, is masked using SHA2-hashing [5] to protect user privacy.
16 We extract the device the user signed in with, and acquire additional information from the IP address. We used the free website ip-api.com to determine the user’s city location and internet service provider (ISP) information. We threw out entries that have unidentifiable IP addresses as we used IP addresses to identify when the user is continuously on a network. We did not consider entries with abroad IP addresses because network transition around the world may exhibit di↵erent patterns due to the local network infrastructure. Those entries account for roughly 3% of the entire dataset.
We note that a user can simultaneously be signed into the same email account from multiple devices or networks. The server does not dismiss a user’s first attachment once s/he signs in from a second device.
3.2 Markov Chain For Transition Prediction
One of the important results in Yang’s paper is a 3-state Markov chain model that predicts the likelihood of a user transitioning (defined below), given the number of networks s/he is currently on.
To understand how users make network transitions, first we need to define a transition.
Definition A session is a continuous period of time for which a user is on the same network, which is defined by the same IP address. A transition happens when the user detaches from one network and attaches to another one.
With this definition, we can decide whether a transition happens from the logs. Brian Copeland, a UROP student of the ANA group, parsed the original UMass log entries into entries that contain each individual session. We overlay the durations to decide occurrences of transition. The details of transition criteria can be found in Yang [5] paper.
17 A Markov model is a way of modeling the randomly changing systems where the future states only depend on the current state. There are four types of Markov models depending on two criteria.
Observable evaluates whether the state of the system is directly observable by the outside. Controlled means for each new step, the system has options on which action to take, and this action a↵ects the transition probabilities. Autonomous means that the system does not have the option to choose an action.
Dimension System State Fully Observable System State Partially Observable
System is Autonomous Markov Chain Hidden Markov Model(HMM)
System is Controlled Markov Decision Process(MDP) Partially Observable MDP
Table 1: Di↵erent kinds of Markov Models[17]
In Yang, the researchers implemented a Markov Chain, that is, a system with fully observable states and no action options. The three states of the Markov chain are
• On 0 networks
• On 1 network
• On 2 networks
A 3 by 3-state transition probability matrix can be calculated directly from the dataset. One subtle choice we have to make is how often we call each time step. One extreme is to take near- continuous measurements. For instance, we can consider every second as a new time step. However, this creates a nearly inescapable state. On average, a typical user logs 20 sessions a day. There are
86400 seconds in one day. This is at most 20/86400 = 0.23‰. Hence we are likely to be stuck in the on 0 networks state. Using such a model, we might create a skewed dataset that does not faithfully captures the original distribution.
18 We took measurements at the end of each 20-minute period as Yang concluded 20-minute periods was appropriate for this dataset [5]. We look at the networks the user is on at the end of the 20- minute window, versus at the beginning, to see whether s/he has transitioned. In table 2, we have the transition probability matrix calculated on a 10000 user days training sample. Given this transition probability matrix, we can generate estimations of what the user may behave in the future. We then compare this model generated data with the testing sample.
Transition Prob. On 0 Networks On 1 Network On >= 1 Networks
On 0 Networks CRF 0.930 0.066 0.004
On 1 Network 0.696 0.266 0.038
On >= 1 Networks 0.291 0.356 0.353
Table 2: Empirical estimates of transition probabilities. Data acquired on training sample (10000 user days).
3.3 Evaluate Markov Chain Prediction
For baseline, we evaluate the distribution of the number of transitions for all users on a daily basis as noted in Yang. We identify a transition as a change of state (on 0, on 1, or on 2 or more) from the end of the previous 20-minute period to the end of the current 20-minute period. We compute one distribution from the testing dataset, and another for the randomly generated dataset based on Markov chain runs. Further, we also included the distribution for the training data we used to produce the transition probability matrix. We make the simple assumption that all users behave similarly.
19 Figure 3: The distribution of number of transitions from Markov chain and testing set. Sample size 10000 user days each.
In figure 3, we illustrate the distribution of daily transitions for 10000 sample days generated by the Markov chain, and another randomly selected 10000 user days of testing data. The Markov chain generated data shows a more Gaussian distribution, whereas the testing data shows a skewed tail. We further confirmed that such di↵erence is not due to randomness in the data selection, by comparing the 10000 user days training data used to generate the Markov transition probability matrix and the test data.
20 Figure 4: The distribution of number of transitions from training set and testing set. Sample size 10000 user days each. The light blue regions are overlaps of both sets.
The training data used to generate the Markov transition probability matrix has a similar dis- tribution as the testing data.
Despite the Markov generated set and the testing set having similar transition probability ma- trices, the Markov chain failed to capture the exponential distribution. Because the fixed transition probability matrix was used to generate all data for the Markov chain, we expected a more nor- mally distributed number of transitions transitions, whereas the real data include users that have larger variance in number of transitions. We confirm the concern that a simple three state Markov transition matrix alone does not capture the transition behavior exhibited by the dataset.
21 3.4 Aspects of Modeling where Markov Chain Falls Short
We do not believe that such modeling captures user behavior well enough for the following reasons.
• Loss of information by the probability matrix.
By categorizing the user into one of the three states, we assume the transition only depends
on the number of networks a user is on. The user’s likelihood to make a transition can depend
on other factors, such as the time of the day. If more users access the network during the
daytime, it is plausible that the transition probabilities during the day may be higher than at
night.
• Loss of individuality of users
The entire dataset contains users of various habitual patterns. There could be a number of
groups of users who have contrasting transition probability matrix. By combining all users
together, we only gain insight on how an ”average user” behaves like, though this ”average
user” may be a mix of several exemplary users.
• Failure to capture the exponential distribution
The Markov chain generated dataset presented a more normal distribution of transitions,
as opposed to exponential distribution in the training and testing dataset. This could be
potentially explained by the model not capturing the di↵erent types of users.
We continue to the next section to evaluate what can be done along those dimensions.
3.5 Questions that Emerge from Yang Experiment Reproduction
Due to the simplicity of the Markov chain, we can ask several major questions:
22 • Can the entire user population be separated into groups? How many di↵erent groups of users
are there?
• How frequently do users transition? What does a typical transition look like? What factors
can signal, or a↵ect a transition?
• Do users exhibit any periodical behavioral patterns?
We first answer the grouping question by utilizing clustering algorithms.
23 4 Modeling
In this section, we utilized clustering, regression, neural nets, and probabilistic programming to evaluate and predict the number of sessions a user participates in, and the duration of each session.
4.1 Clustering Finds Three Groups
A closer inspection at the dataset conceives the idea that we may split users into user groups. The population in the dataset is residents of UMass. This could include professors, students, and other populations whose behavioral pattern are likely di↵erent. It would be beneficial for the Markov model to take inference for each one of the groups.
We used K-means analysis for carrying out the clustering analysis. K-means analysis is a com- monly used clustering algorithm that is scalable and fast. A typical implementation of the k-means algorithm needs the user to provide the number of clusters, or k. The algorithm then picks the k initial starting centroids for the population, and gradually update the position of centers as the algorithm iterates itself. In Tibshirani[10], the optimal number of clusters k can be found using the gap statistic, which finds a statistic to measure the comparison of the compactness of the clusters with a distribution of no obvious clustering. We implemented such an algorithm with features in- cluding ID, IP, start time, end time, average session duration, Android, Apple, device, whether the
ISP is wireless, whether the ISP is wifi, and average number of sessions per day. Three clusters were identified by the algorithm. This clustering separates the users mainly by average session durations.
24 Figure 5: The clustering algorithm output. X-axis is average duration in seconds. Y-axis is session start hour. Average duration length significantly a↵ected the clustering. The green dots represent the centers of each cluster.
Figure 6: X-axis is portion of user entry generated from a wifi network. Y-axis is day of the week. Those were two of the features that did not significantly a↵ected the clustering.
25 Figure 7: The K-means clustering algorithm output. The top three factors for separating the dataset are average duration per session, average number of sessions in one day, and the average starting hour for sessions.
The K-means clustering algorithm helps quantify the influence each feature has on dissecting the dataset. Here the k-means algorithm proposes that average session duration is the number one factor for distinguishing user types. Another plausible but less promising factor is the average number of sessions on an average day.
We note that the clustering output may be less convincing, but it serves as a leading path into potential criteria for distinguishing the users. We make the following arguments and observations regarding this output.
• The long duration cluster transitions less frequently.
• Users with on average long durations may be less likely to transition.
• The number of sessions per day can directly a↵ect the calculation of how frequently the user
transitions.
26 • Combining the above arguments, we want to verify that users who have on average short
sessions and high number of transitions tend to switch between networks more frequently.
(Note that a user is not always online, otherwise this claim is trivial. The average online time
is roughly 4.5 hours a day[5].) This needs to be verified because a user can be on multiple
networks simultaneously. Hence when designing a new network catered to mobile users, those
users deserve extra attention for congestion.
We move on to test the arguments above.
4.2 Clustering Shines Light on Canonical User
To test out the hypotheses above, we break the users into clusters, and then calculate the Markov transition probability matrix to see whether there is a significant di↵erence.
Transition Prob. On 0 Networks On 1 Network On >= 1 Networks
On 0 Networks Now 0.930 0.066 0.004
On 1 Network 0.696 0.266 0.038
On >= 1 Networks 0.291 0.356 0.353
Table 3: Empirical estimates of transition probabilities for all users.
Transition Prob. On 0 Networks On 1 Network On >= 1 Networks
On 0 Networks Now 0.896 0.096 0.008
On 1 Network 0.694 0.266 0.040
On >= 1 Networks 0.285 0.293 0.422
Table 4: Empirical estimates of transition probabilities for the long duration cluster.
Hypothesis 1 The average user in the long duration cluster (average duration 67% percentile)
27 exhibits on average less number of transitions than the average user.
H0 is that the average user in the long duration cluster exhibits on average the same number of transitions as the average user.
First we see that on average, the long duration users have a daily 18.49 sessions, whereas the average user of the population of 6084 has a daily 20.64 sessions. The result is significant (p =
0.00179) by the t-test. So we reject the null hypothesis.
Figure 8: The average number of sessions per cluster.
Hypothesis 2 Users with on average the most sessions per day (top 33%) transition more frequently than the average user.
By using a similar t-test, we observe that users with more sessions group on average transition more frequently than the average user (p =0.00834).
4.3 Regression Helps Weighing Factors
Another aspect of the Yang model that can be improved and discussed is the assumption that the number of networks the user is on solely determines the transition probabilities. We’d like to quantify
28 the e↵ect other features have on the transitioning.
To answer the question how frequently do users transition, there are several parameters we can model. We chose to model for each user, how many sessions happened in a given day, and what is the duration of each session. If we can accurately model the number of sessions in a day, how long each sessions is, and when exactly these sessions happen, then we can overlay the sessions to evaluate the transitions.
We first tried out with regression analysis. Regression analysis evaluates the relationship between a dependent variable and multiple independent variables. It does so by creating a curve that fits all the data points with minimal total error norm. We started out by exploring linear regression models for simplicity. More specifically, let each xi be a p-dimensional feature vector for prediction, for each yi the prediction values, a typical regression estimates by minimizing
(y xT )2, i i i X
where is the weight vector for the variables.
Regression analysis can be used to discover correlations between the variables. One of the challenges of our dataset is that the predictors are not entirely independent. Hence we tried Ridge regression and logistic regression.
4.3.1 Ridge Regression
Background Ridge regression is a technique that evaluates multiple regression data that su↵er from multicollinearity[19], or multiple predictor variables being highly correlated. Ridge regression resolves that issue by regularization, or adding a penalty on the weight sizes. Specifically, let each xi
29 be a p-dimensional feature vector for prediction, for each yi the prediction values, Ridge regression estimates by minimizing
p (y xT )2 + 2, i i j i j=1 X X
where is the weight vector for the variables, and j is the weight vector for the jth variable.
The first sum calculates the error term, and the second sum creates an additional penalty on the absolute size of the individual weights. This partially resolves the multicollinearity of the variables.
If two variables are negatively correlated, the additional regularization term pushes the weights to small absolute values[20]. Note that there is an additional regularization constant , which controls how much having weight values of small absolute values matters. The choice of is due to the researcher and trial and error.
Results
Predicting Number of Sessions We first tried to predict the average number of sessions for each day for every user as it is normally distributed. We used scikit-learn[27] as our library, for its ease of use and availability of tutorials. We used all 6000 users for our data. We took 60% of the users as our training set, and 40% as the testing set. For each user day, we passed in the following features: whether the device is apple, whether the device is android, what day of the week it is, the user’s likelihood to be on a cellular network that day, average duration for this user’s sessions, average session start time for this user. The prediction score is defined by the R2 score, or
(y y )2 1 true pred , (y y )2 P true true P
30 or how much the model explains the variability in the data around its mean. We did the prediction using 1. The general population. 2. Once for each cluster.
Figure 9: The long cluster average session length prediction. Each integer on the x-axis is a feature and the y-axis is showing the feature’s corresponding weight.
R2 scores on the training dataset and testing dataset reach around 7% for both the general population and each cluster. Training the model took 0.003 seconds on average due to the low dimensions.
The main observations are:
• Using just the 7 features, a linear regression model cannot predict the number of sessions in a
given day
• Being on a cellular network decreases the number of sessions per day. Being on a wifi network
increases the number of sessions. This is potentially due to the automatic email checking is
only on when on a wifi network to save wireless data usage.
• Grouping by cluster does not help show significant di↵erence in prediction accuracy.
Note that in predicting the daily number of sessions, we are no longer accessible to the unique
31 IP address as found in each entry before. This decrease in number of independent variables further decreased the performance of our prediction.
This result opens up discussion for alternative modeling techniques.
4.3.2 Logistic Regression
Motivation Another parameter we’d like to predict is the duration of each session. We first look at the distribution of session durations.
Figure 10: The distribution of session durations, in seconds. Note the second peak at the hour mark.
We observe that there is a secondary peak at the hour mark, and more importantly, the distri- bution is not normal. A deeper inspection at the original processed dataset reveals that the hour peak was largely due to the server closing down connections after an hour of idle time. This does not conflict with our definition of session, as the user was on, despite not utilizing the network, for the entire time.
Since Ridge regression assumes a normal distribution, it will not work well directly predicting the session duration. We asked whether a categorical answer (such as long or short) can be acquired
32 through regression. We decided to use logistic regression to give a categorial prediction for session duration.
Background Logistic regression is a regression model where the dependent variable is categorical, as opposed to continuous. Logistic regression makes its predictions by comparing the likelihood of each individual category, and picks the category with the highest likelihood.
Formally, the logistic regression tries to predict the conditional probability Pr(Y =1X = x), | where Y is the binary indicator variable for the category (0 or 1). In order to reach a 0 or 1 output, we need to carefully examine the math to produce the correct results.
The formal model [21] is that
p(x) log = + x 1 p(x) 0 ·
where 0 is the y-intercept and the weights for the features. Solving for p we get
e 0+x 1 p(x)= · = 0+x ( +x ) 1+e · 1+e 0 ·
To minimize error, we should predict Y =1whenp 0.5 and Y =0whenp<0.5. This means predict 1 when + x 0, 0 otherwise. 0 ·
We used a variation of the logistic regression model, namely we expanded it to predict from multiple categories. That is,