Google’s PageRank and Beyond: The Science of Search Engine Rankings This page intentionally left blank Google’s PageRank and Beyond: The Science of Search Engine Rankings Amy N. Langville and Carl D. Meyer PRINCETON UNIVERSITY PRESS PRINCETON AND OXFORD This page intentionally left blank Contents Preface ix Chapter 1. Introduction to Web Search Engines 1 1.1 A Short History of Information Retrieval 1 1.2 An Overview of Traditional Information Retrieval 5 1.3 Web Information Retrieval 9 Chapter 2. Crawling, Indexing, and Query Processing 15 2.1 Crawling 15 2.2 The Content Index 19 2.3 Query Processing 21 Chapter 3. Ranking Webpages by Popularity 25 3.1 The Scene in 1998 25 3.2 Two Theses 26 3.3 Query-Independence 30 Chapter 4. The Mathematics of Google’s PageRank 31 4.1 The Original Summation Formula for PageRank 32 4.2 Matrix Representation of the Summation Equations 33 4.3 Problems with the Iterative Process 34 4.4 A Little Markov Chain Theory 36 4.5 Early Adjustments to the Basic Model 36 4.6 Computation of the PageRank Vector 39 4.7 Theorem and Proof for Spectrum of the Google Matrix 45 Chapter 5. Parameters in the PageRank Model 47 5.1 The α Factor 47 5.2 The Hyperlink Matrix H 48 5.3 The Teleportation Matrix E 49 Chapter 6. The Sensitivity of PageRank 57 6.1 Sensitivity with respect to α 57 vi CONTENTS 6.2 Sensitivity with respect to H 62 T 6.3 Sensitivity with respect to v 63 6.4 Other Analyses of Sensitivity 63 6.5 Sensitivity Theorems and Proofs 66 Chapter 7. The PageRank Problem as a Linear System 71 7.1 Properties of (I − αS) 71 7.2 Properties of (I − αH) 72 7.3 Proof of the PageRank Sparse Linear System 73 Chapter 8. Issues in Large-Scale Implementation of PageRank 75 8.1 Storage Issues 75 8.2 Convergence Criterion 79 8.3 Accuracy 79 8.4 Dangling Nodes 80 8.5 Back Button Modeling 84 Chapter 9. Accelerating the Computation of PageRank 89 9.1 An Adaptive Power Method 89 9.2 Extrapolation 90 9.3 Aggregation 94 9.4 Other Numerical Methods 97 Chapter 10. Updating the PageRank Vector 99 10.1 The Two Updating Problems and their History 100 10.2 Restarting the Power Method 101 10.3 Approximate Updating Using Approximate Aggregation 102 10.4 Exact Aggregation 104 10.5 Exact vs. Approximate Aggregation 105 10.6 Updating with Iterative Aggregation 107 10.7 Determining the Partition 109 10.8 Conclusions 111 Chapter 11. The HITS Method for Ranking Webpages 115 11.1 The HITS Algorithm 115 11.2 HITS Implementation 117 11.3 HITS Convergence 119 11.4 HITS Example 120 11.5 Strengths and Weaknesses of HITS 122 11.6 HITS’s Relationship to Bibliometrics 123 11.7 Query-Independent HITS 124 11.8 Accelerating HITS 126 11.9 HITS Sensitivity 126 CONTENTS vii Chapter 12. Other Link Methods for Ranking Webpages 131 12.1 SALSA 131 12.2 Hybrid Ranking Methods 135 12.3 Rankings based on Traffic Flow 136 Chapter 13. The Future of Web Information Retrieval 139 13.1 Spam 139 13.2 Personalization 142 13.3 Clustering 142 13.4 Intelligent Agents 143 13.5 Trends and Time-Sensitive Search 144 13.6 Privacy and Censorship 146 13.7 Library Classification Schemes 147 13.8 Data Fusion 148 Chapter 14. Resources for Web Information Retrieval 149 14.1 Resources for Getting Started 149 14.2 Resources for Serious Study 150 Chapter 15. The Mathematics Guide 153 15.1 Linear Algebra 153 15.2 Perron–Frobenius Theory 167 15.3 Markov Chains 175 15.4 Perron Complementation 186 15.5 Stochastic Complementation 192 15.6 Censoring 194 15.7 Aggregation 195 15.8 Disaggregation 198 Chapter 16. Glossary 201 Bibliography 207 Index 219 This page intentionally left blank Preface Purpose As teachers of linear algebra, we wanted to write a book to help students and the general public appreciate and understand one of the most exciting applications of linear algebra today—the use of link analysis by web search engines. This topic is inherently interesting, timely, and familiar. For instance, the book answers such curious questions as: How do search engines work? Why is Google so good? What’s a Google bomb? How can I improve the ranking of my homepage in Teoma? We also wanted this book to be a single source for material on web search engine rank- ings. A great deal has been written on this topic, but it’s currently spread across numerous technical reports, preprints, conference proceedings, articles, and talks. Here we have summarized, clarified, condensed, and categorized the state of the art in web ranking. Our Audience We wrote this book with two diverse audiences in mind: the general science reader and the technical science reader. The title echoes the technical content of the book, but in addition to being informative on a technical level, we have also tried to provide some entertaining features and lighter material concerning search engines and how they work. The Mathematics Our goal in writing this book was to reach a challenging audience consisting of the general scientific public as well as the technical scientific public. Of course, a complete understanding of link analysis requires an acquaintance with many mathematical ideas. Nevertheless, we have tried to make the majority of the book accessible to the general sci- entific public. For instance, each chapter builds progressively in mathematical knowledge, technicality, and prerequisites. As a result, Chapters 1-4, which introduce web search and link analysis, are aimed at the general science reader. Chapters 6, 9, and 10 are particularly mathematical. The last chapter, Chapter 15, “The Mathematics Guide,” is a condensed but complete reference for every mathematical concept used in the earlier chapters. Through- out the book, key mathematical concepts are highlighted in shaded boxes. By postponing the mathematical definitions and formulas until Chapter 15 (rather than interspersing them throughout the text), we were able to create a book that our mathematically sophisticated readers will also enjoy. We feel this approach is a compromise that allows us to serve both audiences: the general and technical scientific public. x PREFACE Asides An enjoyable feature of this book is the use of Asides. Asides contain entertaining news stories, practical search tips, amusing quotes, and racy lawsuits. Every chapter, even the particularly technical ones, contains several asides. Often times a light aside provides the perfect break after a stretch of serious mathematical thinking. Brief asides appear in shaded boxes while longer asides that stretch across multiple pages are offset by horizontal bars and italicized font. We hope you enjoy these breaks—we found ourselves looking forward to writing them. Computing and Code Truly mastering a subject requires experimenting with the ideas. Consequently, we have incorporated Matlab code to encourage and jump-start the experimentation process. While any programming language is appropriate, we chose Matlab for three reasons: (1) its matrix storage architecture and built-in commands are particularly suited to the large sparse link analysis matrices of this text, (2) among colleges and universities, Matlab is a market leader in mathematical software, and (3) it’s very user-friendly. The Matlab programs in this book are intended to be instruction, not production, code. We hope that, by playing with these programs, readers will be inspired to create new models and algorithms. Acknowledgments We thank Princeton University Press for supporting this book. We especially enjoyed working with Vickie Kearn, the Senior Editor at PUP. Vickie, thank you for displaying just the right combination of patience and gentle pressure. For a book with such timely mate- rial, you showed amazing faith in us. We thank all those who reviewed our manuscripts and made this a better book. Of course, we also thank our families and friends for their encouragement. Your pride in us is a powerful driving force. Dedication We dedicate this book to mentors and mentees worldwide. The energy, inspiration, and support that is sparked through such relationships can inspire great products. For us, it produced this book, but more importantly, a wonderful synergistic friendship. Chapter One Introduction to Web Search Engines 1.1 A SHORT HISTORY OF INFORMATION RETRIEVAL Today we have museums for everything—the museum of baseball, of baseball players, of crazed fans of baseball players, museums for world wars, national battles, legal fights, and family feuds. While there’s no shortage of museums, we have yet to find a museum ded- icated to this book’s field, a museum of information retrieval and its history. Of course, there are related museums, such as the Library Museum in Boras, Sweden, but none con- centrating on information retrieval. Information retrieval1 is the process of searching within a document collection for a particular information need (called a query). Although dominated by recent events following the invention of the computer, information retrieval actually has a long and glorious tradition. To honor that tradition, we propose the cre- ation of a museum dedicated to its history. Like all museums, our museum of information retrieval contains some very interesting artifacts. Join us for a brief tour. The earliest document collections were recorded on the painted walls of caves. A cave dweller interested in searching a collection of cave paintings to answer a particular information query had to travel by foot, and stand, staring in front of each painting. Un- fortunately, it’s hard to collect an artifact without being gruesome, so let’s fast forward a bit. Before the invention of paper, ancient Romans and Greeks recorded information on papyrus rolls.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages235 Page
-
File Size-