Detecting Sequences and Cycles of Web Pages

Detecting sequences and cycles of web pages B. L. Narayan and Sankar K. Pal Machine Intelligence Unit, Indian Statistical Institute, 203, B. T. Road, Calcutta - 700108, India. E-mail:{bln r, sankar}@isical.ac.in. Abstract path from any vertex x1 to any other vertex xn in V . The in- degree (out-degree) of a vertex is the number of arcs lead- Cycle detection in graphs and digraphs has received ing to (going out of) the vertex. In what follows, we use the wide attention and several algorithms are available for this term sequence to refer to a path in a digraph. purpose. While the web may be modeled as a digraph, such Link analysis algorithms dealing with the web graph algorithms would not be of much use due to both the scale assume that it is strongly connected, and in practice, of the web and the number of uninteresting cycles and se- this is ensured by adding artificial arcs to the graph quences in it. We propose a novel sequence detection algo- [6]. According to recent estimates, the number of rithm for web pages, and highlight its importance for search pages on the WWW is several billion, with Google related systems. Here, the sequence found is such that its claiming to have indexed over 8 billion pages (Source: consecutive elements have the same relation among them. http://www.google.com/googleblog/2004/11/ This relation is measured in terms of the positional prop- googles-index-nearly-doubles.html). erties of navigational links, for which we provide a method In such massive digraphs, with several arcs between the for identifying navigational links. The proposed methodol- vertices, sequences and cycles are a common phenomenon. ogy does not detect all possible sequences and cycles in the However, only a few of these sequences and cycles were web graph, but just those that were intended by the creators conceived at the time of their creation by the authors of the of those web pages. Experimental results confirm the accu- constituent pages. While the reason behind creating such racy of the proposed algorithm. special structures in the web graph may be convenience or Search Engine Optimization, sometimes it borders on mali- cious intent or spam. Moreover, with web authoring styles 1. Introduction varying widely, the same kind of content may be presented in a single page, or broken up into several pages. Such dis- The Web has a complex graph structure and is usu- similarities bring about large differences in link based anal- ally modeled as a digraph [3, 6], with the pages forming yses. So, for the sake of uniformity and consistency dur- the vertices and the links between them being the arcs. ing comparison of web pages, the detection of these special We use the following terminology related to graphs [8]. structures is essential. A digraph or directed graph G is an ordered pair (V, E), In the present article, we propose a novel algorithm that where V is a set of vertices and E is the set of arcs detects sequences of web pages which were intended by or directed edges between these vertices. We denote the their creator to be traversed in that order. The algorithm arc from vertex i to vertex j by eij. A path in a di- broadly consists of three parts. The first deals with identify- graph is a sequence of vertices x1, x2, . , xi, xi+1, . , xn ing navigational links. The second part finds ordered triplets such that xi and xi+1 are connected by an arc in E, of pages (A, B, C) such that C is related to B in the same and the vertices are not repeated. A cycle is a se- way as B is related to A. These form the segments of the quence of vertices x1, x2, . , xi, xi+1, . , xn, xn+1 such sequences to be identified. The final part concatenates the that xn+1 = x1 and if xn+1 is removed, the remaining part detected segments, to obtain sequences as long as possible. is a path. We denote the path and cycle defined here by Certain speedups and improvements over this naive algo- x1.x2 . xi.xi+1 . xn and (x1.x2 . xi.xi+1 . xn), re- rithm are also discussed. Experiments performed on a data spectively. A digraph is strongly connected if there exists a set containing cycles of HTML pages show the performance 1 International Conference on Web Intelligence, September 2005, Compiegne,´ France of the proposed methodology to be satisfactory. degree one, stack based algorithms are employed for cycle This article is organized as follows: Section 2 lists the detection [11]. background work related to identifying navigational links Detecting cycles in general graphs and digraphs is more and detecting cycles in graphs. In Section 3, we describe complicated, because the out-degrees of the vertices may the proposed methodology for detecting sequences and cy- be more than one. There are several digraph cycle detection cles of web pages. We discuss the characteristics and impor- algorithms, as evident from the opening statement of [9]. tance of our methodology in Section 4, which is followed by Some of them like the one by Read and Tarzan depend on Section 5 where we present our experimental results. We depth-first search while others like the Szwarcfiter-Lauer al- draw our conclusions in Section 6. gorithm employ a recursive backtracking procedure which search strongly connected components of the graph [9]. For 2. Related Work the web graph, where the whole of it may form a single strongly connected component, such algorithms may not be of much use. The methodology proposed in this article relies on both We now look at an algorithm that detects sequences and identification of navigational links and detecting sequences cycles of interest from the web graph. and cycles from graphs. Here, we survey the literature on both these topics. Identification of navigational links has been studied in 3. Proposed Sequence Detection Algorithm many works available in the literature [2, 4, 16, 7]. Bharat and Mihaila [2] disregarded links between pages on affili- We begin this section with the motivation for why yet ated hosts. Two hosts were called affiliated if they shared another cycle detection algorithm is needed. the first three octets in their IP addresses, or the right- most non-generic tokens in their hostnames was the same. 3.1. Motivation Borodin, et al [4] identified and eliminated navigational links using a very similar idea. However, not all naviga- Consider the following statement. tional links are detected in this manner [4]. Moreover, this approach is quite severe, and some links connecting pages A: http://www.stanford.edu/index.html, wholly on the basis of their content would be wrongly clas- http://www.yahoo.com/index.html and sified as navigational links. This would especially be the http://wi-consortium.org/index.html case where, say, a member of an organization links to con- form part of a cycle of web pages. tent on a colleague’s page. One can easily be convinced that Statement A is true, pri- Yi and Liu [17] detect navigational links along with ban- marily because of the high in- and out-degrees of the chosen ner ads, decoration pictures, etc, which were collectively web pages. Moreover, Statement A would hold true even in termed web page noise, in a set of pages by constructing a conjunction with any of the following statements: compressed structure tree (CST). The diversity of tags in an element node of the CST determines how noisy the corre- B: The three pages appear in a prespecified order sponding block in the web page is. In the above mentioned studies, navigational links have C: The length of the cycle is exactly 20 been treated as noise and discarded. Also, in [17], they have D: The cycle contains no web page from a co.uk or a not been separated from other kinds of web page noise like .net domain banner ads, decoration pictures, etc. Sequence or path detection in graphs has largely been re- The ease with which these statements have been made, and stricted to finding shortest paths, which conveys the feeling can be verified to be true, provides an idea of how huge the that sequence detection, by itself, was not considered very number of cycles involving these three web pages would interesting. On the other hand, cycle detection has been be. This in turn would imply that the total number of cycles widely studied under two related areas, namely, pseudo- on the web (without imposing any conditions) would be ex- random number generation [5, 11] and graph theory [14, 9]. tremely large, say of the order of billions, with the number In pseudo-random number generation, a sequence of num- of vertices (web pages) being a few to several billion. Enu- bers is generated by applying the same function to the last merating all such cycles (say, by an algorithm as mentioned generated number. The objective is to detect cycles in the in [13] or [9]) would be tedious and computationally expen- sequences of random numbers being generated. By adding sive, and several individual cycles that have been formed arcs between consecutive elements of such sequences, this just by chance may not be interesting at all. Perhaps, of problem may be studied as a graph cycle detection algo- more interest, and consequence, would be statements about rithm. Since, every element of such sequences has out- the absence of cycles of a certain kind. 2 International Conference on Web Intelligence, September 2005, Compiegne,´ France However, there are some that stand out in this galaxy of elements remains the same.

Detecting Sequences and Cycles of Web Pages

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support