Nities in a Neighborhood Graph
Total Page:16
File Type:pdf, Size:1020Kb
Library and Bibliotheque et 1+1 Archives Canada Archives Canada Published Heritage Direction du Branch Patrimoine de !'edition 395 Wellington Street 395, rue Wellington Ottawa ON K1A ON4 Ottawa ON K1A ON4 Canada Canada Your file Votre reference ISBN: 978-0-494-33440-9 Our file Notre reference ISBN: 978-0-494-33440-9 NOTICE: AVIS: The author has granted a non L'auteur a accorde une licence non exclusive exclusive license allowing Library permettant a Ia Bibliotheque et Archives and Archives Canada to reproduce, Canada de reproduire, publier, archiver, publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public communicate to the public by par telecommunication ou par !'Internet, preter, telecommunication or on the Internet, distribuer et vendre des theses partout dans loan, distribute and sell theses le monde, a des fins commerciales ou autres, worldwide, for commercial or non sur support microforme, papier, electronique commercial purposes, in microform, et/ou autres formats. paper, electronic and/or any other formats. The author retains copyright L'auteur conserve Ia propriete du droit d'auteur ownership and moral rights in et des droits moraux qui protege cette these. this thesis. Neither the thesis Ni Ia these ni des extraits substantiels de nor substantial extracts from it celle-ci ne doivent etre imprimes ou autrement may be printed or otherwise reproduits sans son autorisation. reproduced without the author's permission. In compliance with the Canadian Conformement a Ia loi canadienne Privacy Act some supporting sur Ia protection de Ia vie privee, forms may have been removed quelques formulaires secondaires from this thesis. ant ete enleves de cette these. While these forms may be included Bien que ces formulaires in the document page count, aient inclus dans Ia pagination, their removal does not represent il n'y aura aucun contenu manquant. any loss of content from the thesis. ••• Canada Modeling and Thematic Analysis of Neighborhood Structures in the Web and Hierarchical Identification of Web Communities by © lsheeta N argis A thesis submitted to the School of Graduate Studies in partial fulfilment of the requirements for the degree of M.Sc. Department of Computer Science Memorial University of Newfoundland July 2007 St. John's Newfoundland and Labrador ii Abstract The web graph represents the structure of the World Wide Web by denoting each web page as a vertex and each hyperlink as an arc. The motivating goal behind the research constituting this thesis is twofold - firstly, to model the local structure of the web graph, and secondly, to discover communities of related web pages. We study the concept of the neighborhood graph to model the local structure surrounding a particular web page. We analyze some structural and statistical prop erties of the neighborhood graphs and perform a comparison with the corresponding properties of the whole web graph. In several aspects these neighborhood graphs show a similar characteristic to the entire web graph. Both the indegree and out degree distribution of a sufficiently large neighborhood graph follow the power law phenomenon. We perform a thematic analysis of the local structure of the Web by discovering authorities and hubs in neighborhood graphs, where the set of author ities and hubs gives a representative flavor on the theme of the page in question. We also analyze temporal evolution of Hyperlinked communities and identify Core communities in neighborhood graphs. We devise an algorithm to extract communities solely based on the topology of the Web. Central to our approach is the innovative idea of Iterative Cycle Contraction to discover Web communities comprised of related web pages. The intuition behind this algorithm is that if two pages link to each other then they are thematically re lated. Successive iterations yield a hierarchical structure of communities and allow us to define a similarity measure between two web pages in the same community by noting at which iteration their corresponding vertices are first grouped into a sin gle vertex. We apply the algorithm to some focused subgraphs of the web graph and evaluate its effectiveness by performing an investigation into the theme(s) of the lll putative communities that it finds. We find that the algorithm is successful at iden tifying communities and distinguishing communities with varying thematic content in neighborhood graphs. An examination of the distribution of community sizes in a particular iteration of the algorithm reveals that for sufficiently large neighborhood graphs a power law is observed. iv Acknowledgments First of all, I would like to express my immense gratitude to my supervisor, Dr. David A. Pike, without whose continuous support and guidance this work could not reach any destination. I particularly appreciate the amount of time and effort that he spent for guiding me. His support was not only limited to research works; he gave me advices on other aspects of academic life as well. He sent me to AARMS Summer School to have a solid background in my research area. Whenever I faced any problem in the preparation of materials in latex, he provided me a solution. We have gone through several corrections on my writing, he corrected many grammatical mistakes and more importantly, many instances where I could not express in writing what I was trying to convey. He commented to fill the gaps in my writing. He helped me in all ways to improve the skills in scientific writing. I am grateful to my co-supervisor, Dr. Wolfgang Banzhaf, who gave me many suggestions to improve my thesis and encouraged me by his inspiring comments. I would like to express my gratitude to the examiners Dr. Anthony Bonato and Dr. Manrique Mata-Montero for the feedback and suggestions they provided me. I would like to express my thanks to Neil A. McKay, who worked on the construc tion of the Neighborhood Graphs in Summer 2005 with David. At the very start of my work he explained his works to me. I adapted his programs to suit my specific needs for the research. He also gave valuable feedbacks on the draft of one of my papers of which he is a co-author, and I have incorporated some of those feedbacks to my thesis as well. I feel myself really very lucky since I had the chance to learn about various math ematical models of the web graph when I took a course named Massive Networks and Internet Mathematics in the AARMS Summer School 2006 held in Dalhousie Uni- v versity, Halifax, Canada. I would like to express gratitude to the course teacher Dr. Anthony Bonato. I learnt about recent developments in this area when I attended the MoMiNIS Winter School and the Fourth WAW held in Banff, Canada. I would like to thank Dr. Jeannette Janssen for providing me feedback and sug gestions in my early stage of research while I was in Halifax. This research has been supported by funding from MITACS (Mathematics of Information Technology and Complex Systems) and NSERC (Nat ural Sciences and Engineering Research Council of Canada). I would like to thank MITACS and NSERC for providing us the grants. I have learnt many things about the academic life while in conversation with my friends Momotaz and Rajibul. In many stages whenever I felt frustrated they gave me the best suggestions. It has been a very pleasant time with them in St. John's. I would like to express my gratitude to my family members. My husband Thhin has been always there for me, whenever I need any type of support. Being a student in the same discipline, he also gave me good feedback on my work. At times of frustrations he is the one to inspire me and cheer me up; at the time of achievements he is the one who seems happier than me. My sisters and my brother gave me good suggestions in all stages of my life. My nephews and nieces are the best gifts in this world, I find immense pleasure passing my time with them. My parents are the ones for whom I have reached where I am now. My mother took care of my studies in my childhood. My father has been the inspiration of my life. I always wanted to be as caring as my mother and as scholar as my father. I always miss them here in Canada since they live in my country, Bangladesh. But whenever I feel obstacles or darkness I think about my parents to get a light of hope. Contents Abstract 11 Acknowledgments iv List of Tables X List of Figures xu 1 Introduction 1 1.1 The Structure of the Web 1 1.2 Motivation ........ 2 1.3 Contributions of this Thesis 2 1.4 Organization of this Thesis . 3 2 Background and Related Works 6 2.1 The Web Graph . 6 2 .1.1 Definition 6 2.1.2 Applications . 7 2.2 The Bow-Tie Structure of the Web 8 2.3 Power Law . 8 2.4 Self-similarity of the Web . 10 vi CONTENTS vii 2.5 Authorities and Hubs . 10 2.5.1 Overview ... 10 2.5.2 The Graph on which HITS operates . 11 2.5.3 The HITS Algorithm 13 2.5.4 Convergence of HITS 14 2.5.5 HITS in similar-page queries . 14 2.6 Trawling the Web . 16 2. 7 Additional Notations and Terminology from Graph Theory 16 3 Neighborhood Graphs 18 3.1 Definition . 18 3.2 Construction of the neighborhood graph 21 3.3 Similarities and differences between the input graph in similar-page queries by HITS and neighborhood graphs 23 3.4 Experimental Results . 25 3.4.1 Relative Size of Regions in the bow-tie structure .