Web Pre-Fetching

Implementation Project

Implementation of Markov model for web request prediction

Habel Kurian Spring 08

Major advisor: Dr. Daniel Andresen

Department of Computing and Information Sciences Kansas State University, Manhattan, Kansas

1 Index

Topic Page no.

1. Introduction 3 2. Web pre-fetching 4 3. Train test model 5 4. Pseudo code 8 5. Markov tree (serialized as XML) 9 6. Implementation 10 7. Results 11 8. Future work 13 9. Conclusion 13 10. References 14

2 Introduction

The World Wide Web is a huge information repository. When so many users access this huge information repository it is easy to find certain patterns in they way they access it. The growing popularity of technologies like Ajax and REST based architecture which generates dynamic content on a continuous basis is helping me revisit web pre-fetching. Web request prediction has been implemented in the past primarily for static content. Since the goal is to pre-fetch the predicted item, I will use the term pre-fetching and prediction interchangeably.

The objective of web request prediction is to identify the subsequent requests of a user given the current request that a user has made. This way the server can either pre-fetch it and cache those pages or it can pre-send this information to the client. The idea is to control the load on the server and thus reduce the access time. Careful implementation of this technique can reduce access time and latency making optimal usage of the servers computing power and the network bandwidth.

In this project I have built a Markov model. The algorithm to built the Markov model is based on the algorithm proposed by Brian D. Davison in his work on Learning web request patterns. Since I plan to study this topic further I have designed the model to be one that can be easily serialized and has cross platform applicability. A lot of work has been done in point-based prediction where the next request depends on the actions that are recorded on time instants and the results depends solely on currently observed action. Path based models on the contrary are built based on users past path data and has generally found to be more accurate with higher applicability. This work reestablishes the fact that users short request sequence have unpredictable results and therefore using a higher order Markov model is more useful

Markov model is a machine learning technique and is different from the approach that data mining do with web logs. Data mining approach is to identify classes of users using their attributes and predicting future actions without considering interactivity and immediate implications. There are other techniques like prediction by partial matching and information retrieval that may be used in conjunction with Markov modeling to enhance the performance and the accuracy.

There are different types of logs that are maintained by the server and largely depend on how the server is configured. Two types of logs that will be considered in this implementation and the work ahead are access logs and referrer logs. Access log contains all the requests that a server receives. These are by and large the new requests. The old request may be served by the browser cache, the proxy server, the reverse proxy server or the main server itself if it maintains a cache. The referrer logs also captures traversal information. However referrer header is not a mandatory field in HTTP and may often found to be empty. In this implementation I have only considered the access logs. Also I am concerned with the requests served by the server and not by intermediate stages like the browser cache, proxy server or the reverse proxy server.

3 Web Pre-fetching

Web pre-fetching is mechanism by which web server/clients can pre-fetch web pages well in advance before a request is actually received by a server or send by a client. The idea is, give a request, how accurately can you predict the next consequent request. That way a web server can cache the most probable next request reducing the time taken to respond to a request considerably. It can help to make up for the web latency that we face on the Internet today.

This implementation project is about designing and implementing a pre-fetching model in the form of a Markov model. The Markov property states that, given the current state, future states are independent of the past states. Hence, the present state fully captures all information that can influence the future evolution of the model.

Markov model has been found to be successful in the web domain particularly because it has variants in the form of different orders of Markov model. The data collected in this implementation clearly shows that a 3rd and higher order model is found to have a high success rate in terms of correct future predictions. Thus onc can build variations of this model and use the one with highest applicability and success rate or use a combination. The different order model is directly associated with n-gram used by the speech and language processing community. We borrow this idea and consider an n-gram as a sequence of n consecutive request. To make a prediction one should match the prefix of n-1 length of an n gram and use the Markov model to predict the nth request. However the important thing here is given a prefix of length n-1 there are sever possibilities of the nth request. How do we identify the nth request appropriately? We use the Markov model’s idea of states. Each node and therefore each request represent a state. The transition from one state to another has some probability associated with it. Given a sequence of n-1 states we pick the nth request with the highest probability. How we calculate this probability is explained in the next section. Note that the transition with the highest probability may not be the correct one always and hence we use the idea of top-n predictions. Where we not only consider the nth request with the highest probability but more than one nth request with high probabilities. We could establish a minimum threshold to achieve higher accuracy.

The change between different states is known as transition and the probability with which this occurs is known as transition probability. So when we have a request that gets a web page, it will be the current state of the system. Based on the pre-fetching model (transition probability) one can predict the future state.

4 Train-Test Model How is this implementation project different from some of the work done in the past? First of all, the log files that are being used each time to construct this model is of limited size with just about 20 thousand entries which records the request made to the CIS web server in one day. Therefore I have sessions spanning over less than 12 hours. I believe 12 hours is too a big session. I have plans to make a better recording system where I will not only consider the ip to record all requests made by client but concatenate ip and time stamp information to generate unique keys (pre-defined time intervals). This will not only help record requests coming out of an ip but also give a better session definition.

The Markov model that I construct while running this program is stored in xml format. This gives me the flexibility to incorporate into other systems say like a web server with ease. I built the model using a set of logs and thereafter tested it with another set of logs.

Although this model does not consider referrer logs while constructing the Markov model, it does help to a great extend in chaffing out requests that have come from other websites. Since this model is being utilized to improve the performance of an in-house web server we are more concerned about request whose origin is within CIS domain.

So I have used the referrer logs to eliminate a lot of requests that are not of much concern at this stage. I have also taken measures to eliminate log entries that deals with requests for embedded resources like audio/video/pictures etc. I have not eliminated java script or java resource files like an applet as it has more repercussions and require more detailed study.

In the Markov tree constructed here each node has the following set of information:

1. Self count 2. No of children 3. Child count 4. Immediate parent

Fig1. Markov tree node

5 Self-count is the number of times a particular node has been accessed i.e. number of times a particular node has been visited in a Markov chain. Number of children is the number of children that belongs to a node. Child count is the number of times one or more of its children have been accessed. Immediate parent of a node is also recorded that has no immediate implication but has been added in case I decide to use a different data structure in future.

Given the current state, the Markov based probability of a particular state is calculated by dividing the self-count of a node by the child count of its parent node.

Given the URL sequence associated with a particular users session, the urls are assigned unique identifiers and as we run through the sequences, a new node is constructed if one does not exist or the contents of the node self count, child count and no of children are updated as the case may be. Please see the pseudo code for more details.

Consider we have already transformed a sequence of requests into unique identifiers: Eg: 1000 1003 10002 1000. As per the algorithm the following combinations are possible in a second order Markov model (2-gram). a. 1003 10002 1000 ->1000, 1002 1001, 1003 10002 1000 b. 1000 1003 10002->1002, 1003 1002, 1000 10003 1002 c. 1000 1003-> 1003, 1000 1003 d. 1000->1000

Fig2. Markov tree after completing the algorithm for the first part (a) of the sequence

6 Fig3. Markov tree after completing the algorithm for the sequence listed above

The transition probability of the states can be explained using Fig3. Given the root of the model, which is the starting point of a sequence of requests, one can pre-fetch all three requests namely 1000,1002,1003. However this is not a good way of doing it because in a real scenario, the number of children of the root may be many (524 in the example given below.) So we will either pick the one with highest probability or use top-n approach. 1000, 1002 and 1003 have probabilities .5, .25 and .25 respectively. In this case we would pre-fetch only 1000 i.e. the one with highest probability. This is calculated by dividing the self-count of the child with child count of its parent. The use of unique identifiers instead of urls plays an important role in space usage and has been found to be extremely convenient.

7 Pseudo code to built Markov tree Based on Brian D. Davison. (2004) Learning web request pre-fetching

for each set of sequences associated with a user session { while (there is sub sequence that has not been considered starting from first request)

for i from 0 to min(|s|, makov model order) { let ss be the subsequence containing last i items from s

let p be a pointer to t

if |ss|==0 increment p.selfcount

else

for j from first(ss) to last(ss) { increment p.childcount

if not-exist-child (p,j)

increment p.No children

add a new node for j to the list of p’s children

let p point to child j

if j=last(ss)

increment p.selfcount } } }

8 Snapshot of the model as an xml file:

- 10000 10000 524 524 134 - - - - 0 0 0 0 0 10000 0 0 2 10309 - - - - - - - 10000 0 0 2 10313

. . . .

9 The first node with the tag gives the self-count, child count and number of children i.e. 524, 524 and 134 respectively of the root node. The full tree cannot be listed however you clearly see the nesting of nodes with the tags , , and . The higher the order of the model, the more nested this structure will be. The parent information is not useful in this implementation. However this information may be useful in case we decide to use a different data structure in future.

Implementation

The parser is coded in Perl and takes advantage of the scalar data type, and a rich set of libraries like the CPAN and perl’s ease to work with regular expressions. As the parser parses through the log files, it stores the unique urls that it traverses in a hash table and assigns a unique identifier to it. This way one can use these unique identifiers instead of variable length urls and bring consistency as well as save storage space. The use of multi-list hash table helps to assign all the urls accessed by an IP (user) during a given session. The definition of a session is vague. I am parsing one log file at a time. A log file has entries of one day and at most two days on an average. It is however possible to make a stronger definition of sessions by making use of the time stamp information in the logs.

The urls associated with a user is thus stored in a multi-list hash table in the same order as they are seen in the log file. Theses lists are used one at a time to construct different order Markov models. The Markov model is built as an XML tree. This is one of the ways to model it. It provides the flexibility as far as storage and portability is concerned.

XML::DOM, XML::Twig provided a rich set of functions to built and manipulate xml files. I maintain two xml files, one to map the url with a unique identifier and the other, which has the model itself. The ability to describe the path of the libraries in perl helps to run the program remotely on any machine. The pseudo code to implement this model can be found above and is self-explanatory.

To test the model for accuracy, I make use of the same infrastructure that I use to build the model. Say for instance I am testing the 2nd order model with the following url sequence 1000-> 1001->1003. Now my model says that 1000->1001->1003 has a higher transition probability as compared to 1000->1001->1004. So the 3 rd url, 1003 in this case in the log and the one in the model matches and this would be considered a successful match. These tests are repeated for all the Markov model.

10 Results

The results are based on based on data collected over a period of approximately ten days. The total number of log entries is about ninety thousand. However it is important to keep in mind that many of these log entries are ignored by the parser for the reason mentioned above. The accuracy of the model is calculated by calculating correct predictions/all requests and can be represented by percentage.

The parser records only those requests that are trying to access a page in the cis.ksu.edu domain. Also note that many of the web pages have files like image or scripts embedded in them. Among these are common image files like jpg/jpeg/gif. The parser does not record the request for these resources as these files are embedded and should be implicitly pre-fetched.

The first graph (fig.4) shows the variation in accuracy of making correct predictions for four different order Markov trees. It is clearly seen that the third order Markov model has high success rate. The prediction rate remains relatively same for the fourth order also. This is explained by the facts that a user follows at most three to five embedded hyperlinks from the original page requested. After making 3 to 5 consecutive requests by following the hyperlink, the user is most likely to traverse back. This is in agreement with the work done by other researchers. However models with an order higher than five is observed to have deteriorating success rates/ invariant success for the same reason as a lower order models.

The second graph (fig.5) shows the relative improvement in accuracy of predictions with the number of predictions with highest probability. Given a request the Markov model would predict the next request with the one that has the highest probability. However the test data reveals that the resource with the highest probability is often not the next possible request. By pre-fetching at least two such pages/resource with highest probabilities helps increase the performance significantly. However pre-fetching more than two resources have negligible accuracy gain.

Testing the performance of the model with a neutral file i.e. one that has not been used to built the model indicates that with increasing number of logs used to built the model, the performance increases moderately initially over a span of 6 to 7 logs and then stabilizes itself. In each of the above tests I tested the model over multiple logs and used the mean average of the accuracy.

If we let predictions to match more than just the very next request and thus increase the prediction window has impact on the accuracy of the model. Thus server cached content based on the current Markov model is very likely to be used by future requests. Thus an unmatched request need not necessarily mean a prediction failure. However my test program currently does not have the infrastructure to support this. I propose to do this as a part of my future work.

11 100 90 80 ) %

( 70

e t 60 a r 50 Series1 s s

e 40 c c 30 u s 20 10 0 1 order 2 order 3 order 4 order Markov model order (oder)

Fig. 4 Graph showing success rate with increase in the order of Markov tree y

c 35.00% a r u

c 30.00% c a

n 25.00% i

t n

e 20.00%

m Series1 e

v 15.00% o r p

m 10.00% i

e v i

t 5.00% a l e r 0.00% 1 2 3 4 5 6 no of top predictions use d

Fig 5. Graph depicting accuracy by varying no of top predictions used (for 3rd order Markov model)

12 Future work

The direction for future work is to analyze the important role that a referrer log can play in combination with an access log in building a prediction model. Referrer logs are often neglected, as a particular web response message may not carry the referrer information. However the fact that some experiments utilizing referrer logs have shown significant difference in success rate as far as successful predictions are concerned makes it important to consider this proposition. By considering a larger number of logs it is possible to collect more referrer information. Also, I would like to define a session precisely. As mentioned earlier in this report, the definition of session is different for different people. Coming up with a stronger definition for a session and identifying the one which will give the best results is another motive. The way this model can evolve with dynamic technologies like Ajax and REST based services will be closely studied. This model has a few drawbacks too. One of the drawbacks is that it does not track the probability of the next item that has never been seen. Using a variant of prediction by partial matching will help take care of this situation and will be considered in my work ahead.

Conclusion

Web prediction can go a long way in improving the experience of the users on the web. Web and web technologies are still evolving and the opportunity to incorporate such techniques is wide open. Making sure that using such techniques does not indirectly hamper the performance of the system has to be taken care of. For instance we cannot afford to build the model on the same machine as the web server during high traffic sessions.

Tests that have been conducted on the Markov models built using the logs of the CIS department at Kansas state university once again reestablishes the effectiveness of Markov models and the role it can play in pre-fetching, pre-loading, recommendation systems and caching.

13 References

1. Brian D. Davison. The Design and Evaluation of Web Prefetching and Caching Techniques. PhD thesis, Department of Computer Science, Rutgers University, October 2002.

2. Rikki N. Nguyen and Azer Bestavros. Implementation of server assisted prefetching for the www. cs-www.bu.edu. Spring 1996

3. Zhong Su, Qiang Yang, Ye Lu, and Hong-Jiang Zhang. WhatNext: A prediction system for Web request using n-gram sequence models. In First International Conference on Web Information Systems and Engineering Conference, pages 200-207, June 2000.

4. Raluca Popa and Tihamer Levendovvszky. Marcov model for web access prediction. 8th International symposium of Hungarian researchers on computational intelligence and infomatics. November 2007

5. Markov model/chain, http://en.wikipedia.org/wiki/Markov_chain

14