Local Identification of Web Graph Communities

LOCAL IDENTIFICATION OF WEB GRAPH COMMUNITIES Max Hinne Institute for Computing and Information Science, Radboud University, Toernooiveld 1, Nijmegen, Netherlands [email protected] Keywords: Web graph, local modularity, community identification algorithm Abstract: In order to use knowledge of the Web graph in Information Retrieval, we provide a consistent overview, aiming firstly at global aspects of the graph such as degree distribution, and then proceed by examining local aspects of the graph: community identification. We discuss several community models and we implement a community identification algorithm that operates without a priori knowledge of the graph. To elaborate on the algorithm we introduce a notational framework for graph clusters. We run the algorithm on the Dutch domain (.NL) and from the results of this experiment we conclude that the Web consists of several clusters that are mutually connected through a core of hubs. In addition we evaluate the clustering quality of the algorithm, which provides a reputable basis for local community identification. 1 INTRODUCTION lately and to connected components. After this global view we proceed with local aspects of the In the past decade, the World Wide Web (WWW) Web graph in section 4 where the emerging of has grown significantly. A recent study estimates graph communities and modularity is discussed. the total number of websites at 11,5 billion (Gulli & The question we consider here is: ‘How is the Web Signorini, 2005) and this number is still increasing. graph organised on a local scale?’. To answer this Since the WWW has become such an important question we review several community models. asset of our daily life, the Web has gained interest Subsequently we go into great detail about an in the scientific community, which resulted in algorithm that can identify communities without a various studies concerning a wide variety of topics. priori knowledge of the graph, based on local One of these areas of research examines only modularity, which can be seen as a measure of the structural properties of the WWW – the Web is seen disconnectedness of clusters in relation to the rest of as a graph, the contents of websites are mostly the graph (Clauset, 2005). To explain the algorithm ignored. Using this approach one is able to analyse we introduce a framework for describing local the evolution of structures and phenomena on the graph phenomena. We then proceed by Web (Broder, et al., 2000). An interesting example implementing this algorithm on the Web. The of such a phenomenon is the scale-free degree results of this experiment with our community distribution on the Web (Barabási, Albert & Jeong, identification implementation are provided in 2000), which will be explained in detail in the section 5. Section 6 concludes the paper and following sections. In this paper we continue the provides suggestions for further research. ongoing process of providing a model that accurately describes the Web. To do so we firstly provide a brief primer on basic graph theoretic 2 PRIMER ON GRAPH THEORY concepts in section 2. Thereafter the distinction between global and local graph characteristics is Before we proceed with modelling the Web graph, made. In section 3 we discuss the current state of we cover some of the basics of graph theory. affairs concerning the Web graph globally. The We abstract from the content of websites and attention will be directed to the scale-free degree regard only their connectivity. An interesting side distribution that has received so much attention effect of this approach is that the Web can be 1 compared to totally different networks – like the The neighbourhood of 푥 in an undirected graph is metabolic system. We define the Web graph as an given by ordered pair 퐺 = (푉, 퐴). The set 푉 contains the websites, which we will refer to as nodes or vertices 퐸 푥 ≜ 퐸(푥, 푉) 푣 ∈ 푉 and the set 퐴 contains the directed hyperlinks, ordered pairs (푖, 푗) ∈ 퐴 ⊆ 푉2, which we We also define a predicate for all edges between will refer to as arcs. We assume the graph contains two sets: no point-cycles. 퐴 can be viewed as a binary relation over 푉. The 푒푑푒푠(푋, 푌) ≜ (푥, 푦) ∈ 퐸 푥 ∈ 푋 ∧ 푦 ∈ 푌 notation 퐴(푥, 푦) means that an arc from 푥 to 푦 exists. In a directed graph, this relation is In addition there is the notion of a path between two asymmetric, so in general 퐴(푥, 푦) ↮ 퐴(푦, 푥). In vertices 푥 and 푦 if they are neighbours in one or addition the predicate 퐴(푥, 푌) is used, indicating the more steps: vertices in the set 푌 ⊆ 푉 that 푥 points to: 푝푎푡ℎ 푥, 푦 ≜ 퐴(푥, 푦) ∨ ∃푧 퐴(푥, 푧) ∧ 푝푎푡ℎ(푧, 푦) 퐴(푥, 푌) ≜ 푦 ∈ 푌 퐴(푥, 푦) And in an undirected graph there can exist a chain The symbol ≜ is used as ‘is defined as’. Secondly of edges between two nodes 푥 and 푦: we introduce 퐴(푋, 푦), the nodes in the set 푋 ⊆ 푉 that point to 푦: 푐ℎ푎푖푛 푥, 푦 ≜ 퐸(푥, 푦) ∨ ∃푧 퐸(푥, 푧) ∧ 푐ℎ푎푖푛(푧, 푦) 퐴(푋, 푦) ≜ 푥 ∈ 푋 퐴(푥, 푦) These predicates will play an important role in our community identification algorithm, to which we Of special interested is the set of all nodes that will return later. connect to a specific vertex; its neighbourhood. In a directed graph two types of neighbourhoods exist: the set that points to a node and the set that are 3 GLOBAL STRUCTURE OF pointed to by a node: THE WEB GRAPH 퐴푖푛 푥 ≜ 퐴(푉, 푥) and 퐴표푢푡 푥 ≜ 퐴(푥, 푉) When trying to find the connectivity structure of a The complete neighbourhood of 푥 is then simply large graph, in particular the WWW, we use a process called crawling. The crawler starts at a given seed vertex 푣0 ∈ 푉 (or a seed set of vertices) 퐴 푥 = 퐴푖푛 ∪ 퐴표푢푡 and proceeds to add all neighbours 퐴표푢푡 (푣0) to its Later on we will also use sets of arcs instead of crawl frontier. This is then repeated in a breadth- nodes. More specifically, we want to know all arcs first search process for each vertex in the frontier, from 푋 to 푌: adding all new vertices and arcs to the stored graph, until no new vertices to explore remain. Crawlers are often used by search engines, which in addition 푎푟푐푠(푋, 푌) ≜ (푥, 푦) ∈ 퐴 푥 ∈ 푋 ∧ 푦 ∈ 푌 to storing the graph structure, index the documents based on their contents and structure. It is sometimes desirable to view a directed By using such a crawl, Broder et al. (2000) have graph as undirected, i.e. we make no distinction observed that if the Web is seen as undirected, between a source and a destination vertex: 퐺 = about 10% of the vertices have no chain to any of (푉, 퐸). The arcs in an undirected graph are edges. the nodes in the other 90%, which form a connected For an undirected graph the above predicates are component and as a consequence, not all vertices defined analogously: The notation 퐸(푥, 푦) indicates can be reached from the chosen seed of a crawl. It that 푥 and 푦 are connected. This relation is gets more interesting when directionality is taken symmetric, i.e. 퐸 푥, 푦 ↔ 퐸 푦, 푥 . The predicate into account. One can distinguish four different 퐸(푥, 푌) provides all the nodes in 푌 ⊆ 푉 that are graph connectivity subsets: A strongly connected connected to 푥: component (SCC), which is defined as a subset 푆 of a directed graph 퐺, such that any node in 푆 has a 퐸(푥, 푌) ≜ 푦 ∈ 푌 퐸(푥, 푦) path to all other nodes in 푆 and 푆 is not a subset of 2 any larger such set: Both the bow-tie model and the daisy model provide a general idea of how the Web is organised 푆퐶퐶(푆) ≜ 푥 ∈ 푉 ∀푦 푦 ∈ 푆 ↔ 푝푎푡ℎ(푥, 푦) on a global scale. However, they provide no insight in how vertices tend to relate to each other. For this, The SCC forms the central CORE of the web we need another concept called the degree graph. The next two parts are referred to as IN and distribution of the graph. OUT, which respectively label the subset of nodes that have a path to a node in the central core, but cannot be reached from it, and the subset that has a path from a node in the central core, but cannot return to it: 퐼푁(퐼, 푆) ≜ 푥 ∈ 푉 − 푆퐶퐶(푆) ∀푦∈푆퐶퐶(푆) 푝푎푡ℎ 푥, 푦 And Figure 2: The daisy visualisation of the Web graph. 푂푈푇(푂, 푆) ≜ 푥 ∈ 푉 − 푆퐶퐶 푆 ∀푦∈푆퐶퐶(푆) 푝푎푡ℎ 푦, 푥 3.1 Degree Distributions and Scale- Finally there is the collection of sub graphs that cannot reach, and cannot be reached from, the SCC, free Graphs but that are connected to either the IN or OUT component. These sets are called the TENDRILS of The degree distribution of the Web has received the World Wide Web. The CORE is the largest much attention in the scientific community, because component with roughly 27% of the vertices, it shows similarities to various other networks. To followed by the IN and OUT components that both explain the concept some predicates require consist of 21% of the graph. The TENDRILS make definition: up for 22%, which means that 9% of the web graph Let 푖푛푑푒⁡(푥) be the in-degree of vertex 푥, is disconnected from the rest of the graph (which defined as the number of neighbours that point to could also be considered as a fifth component).

Load more