Analysis, Modeling, and Algorithms for Scalable Web Crawling
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Texas A&M Repository ANALYSIS, MODELING, AND ALGORITHMS FOR SCALABLE WEB CRAWLING A Dissertation by SARKER TANZIR AHMED Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Dmitri Loguinov Committee Members, Riccardo Bettati James Caverlee A. L. Narasimha Reddy Head of Department, Dilma Da Silva August 2016 Major Subject: Computer Science Copyright 2016 Sarker Tanzir Ahmed ABSTRACT This dissertation presents a modeling framework for the intermediate data gener- ated by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort, replacement selection) that are well-known, yet without accurate models of produced data volume. The motivation comes from the IRLbot crawl experience in June 2007, where a collection of scalable and high-performance external sorting methods are used to handle such problems as URL uniqueness checking, real-time frontier rank- ing, budget allocation, spam avoidance, all being monumental tasks, especially when limited to the resources of a single-machine. We discuss this crawl experience in detail, use novel algorithms to collect data from the crawl image, and then advance to a broader problem – sorting arbitrarily large-scale data using limited resources and accurately capturing the required cost (e.g., time and disk usage). To solve these problems, we present an accurate model of uniqueness probabil- ity the probability to encounter previous unseen data and use that to analyze the amount of intermediate data generated the above-mentioned sorting methods. We also demonstrate how the intermediate data volume and runtime vary based on the input properties (e.g., frequency distribution), hardware configuration (e.g., main memory size, CPU and disk speed) and the choice of sorting method, and that our proposed models accurately capture such variation. Furthermore, we propose a novel hash-based method for replacement selection sort and its model in case of duplicate data, where existing literature is limited to random or mostly-unique data. Note that the classic replacement selection method has the ability to increase the length of sorted runs and reduce their number, both directly benefiting the merge step of external sorting and . But because of a priority- ii queue-assisted sort operation that is inherently slow, the application of replacement selection was limited. Our hash-based design solves this problem by making the sort phase significantly faster compared to existing methods, making this method a preferred choice. The presented models also enable exact analysis of Least-Recently-Used (LRU) and Random Replacement caches (i.e., their hit rate) that are used as part of the algorithms presented here. These cache models are more accurate than the ones in existing literature, since the existing ones mostly assume infinite stream of data, while our models work accurately on finite streams (e.g., sampled web graphs, click stream) as well. In addition, we present accurate models for various crawl characteristics of random graphs, which can forecast a number of aspects of crawl experience based on the graph properties (e.g., degree distribution). All these models are presented under a unified umbrella to analyze a set of large-scale information processing algorithms that are streamlined for high performance and scalability. iii DEDICATION To my mother, To my father, To my brother, and To my niece. iv ACKNOWLEDGEMENTS First of all, I would like to express my deepest gratitude to Dr. Dmitri Loguinov for all that he has done for me. He has changed me from a person with no research experience to someone who is ready to take on the challenges of real world research. His commitment to research, attention to details, persistence for high quality, most of all his hardwork have left a deep impression on me, as I intend to reflect these qualities in my own research career. I am also deeply thankful to him for his patience with me and for the countless hours that he spent with me not just to solve the problems at hand, but also for teaching me how to solve those on my own. I am thankful to the Department of Computer Science for generously supporting me as a Teaching Assistant (TA) for the most part of my study here. The job of TA also gave me the opportunity to meet new students, who estimated highly of me and gave me a great deal of confidence, affecting my research positively. I also consider myself lucky to work in the exciting IRL lab, where Xiaoyong Li, Yi Cui, Zain Shmasi have been more than friends to me. I have learned a great deal from working with them. Finally, I would like to acknowledge my family – my mother and father who worked so hard to give me and my brother the best possible environment to be good human beings and the opportunity to get best out of us. My brother is the best in the world, who although being almost the same age as me, has always been a fatherly figure to me. He has been by me through thick and thin, and continually pushed me to my goal, even when I faltered. I want to mention that without these three persons, I would never achieve this. Thank you for being my family. v TABLE OF CONTENTS Page ABSTRACT .................................... ii DEDICATION ................................... iv ACKNOWLEDGEMENTS ............................ v TABLEOFCONTENTS ............................. vi LISTOFFIGURES ................................ x LISTOFTABLES ................................. xiii 1. INTRODUCTION ............................... 1 1.1 Motivation................................. 4 2. UNDERSTANDING LARGE-SCALE WEB CRAWLS . 6 2.1 Introduction................................ 6 2.2 UnderstandingWebCrawls ....................... 8 2.2.1 CrawlerOperation ........................ 9 2.2.2 Duplicate Elimination . 11 2.2.3 Ranking and Admission Control . 11 2.3 IRLbot................................... 12 2.3.1 Duplicate Elimination . 13 2.3.2 Ranking .............................. 13 2.3.3 AdmissionControl ........................ 15 2.3.4 SoftwareandHardware. 16 2.3.5 Discussion............................. 19 2.4 Page-LevelAnalysis............................ 19 2.4.1 AdmittedURLs.......................... 21 2.4.2 CrawledURLs .......................... 22 2.4.3 DownloadedURLs ........................ 23 2.4.4 Links................................ 25 2.4.5 Summary ............................. 26 2.5 Server-LevelAnalysis........................... 28 2.5.1 DNSandRobots ......................... 28 vi 2.5.2 Bandwidth............................. 31 2.6 ExtrapolatingCrawls........................... 32 2.6.1 StochasticModel ......................... 33 2.6.2 DataExtraction.......................... 35 2.6.3 URLs ............................... 36 2.6.4 HostsandPLDs.......................... 38 2.7 Internet-WideCoverage . 39 2.7.1 BasicProperties.......................... 39 2.7.2 Observations ........................... 41 2.7.3 TLDCoverage .......................... 43 2.7.4 MoreonBudgets ......................... 46 2.8 RelatedWork ............................... 47 2.9 Conclusion................................. 48 3. RANDOMIZEDDATASTREAMS . 49 3.1 Introduction................................ 49 3.2 LiteratureReview............................. 50 3.3 Randomized1DStreams ......................... 51 3.3.1 Terminology............................ 52 3.3.2 StreamResiduals ......................... 52 3.3.3 A Few Words on Simulations . 54 3.3.4 SingleNode ............................ 55 3.3.5 Uniqueness Probability . 60 3.3.6 SetofUniqueNodes ....................... 62 3.4 Randomized2DStreams ......................... 63 3.4.1 Terminology and Assumptions . 64 3.4.2 DegreeoftheSeenSet ...................... 65 3.4.3 DestinationNodes ........................ 69 3.5 Applications................................ 72 3.5.1 LeastRecentlyUsedCache . 73 3.5.2 RandomReplacementCache. 77 3.5.3 MapReduce ............................ 79 3.5.4 FrontierinGraphTraversal . 83 3.6 Conclusion................................. 88 4. CLASSICAL EXTERNAL MEMORY ALGORITHMS . 89 4.1 Introduction................................ 89 4.1.1 Contributions ........................... 90 4.2 RelatedWork ............................... 91 4.2.1 MapReducePerformance. 91 4.2.2 Replacement Selection . 106 vii 4.3 RandomStreamProperties. 107 4.3.1 TerminologyandAssumptions . 107 4.3.2 Uniqueness Probability . 108 4.3.3 ExperimentalSetup. 110 4.4 BasicLoad-and-Sort ........................... 111 4.4.1 Preliminaries . 111 4.4.2 MergeSort ............................ 112 4.4.3 HashTables............................ 115 4.5 ReplacementSelection . .. .. 118 4.5.1 Traditional Two Queue Implementation . 119 4.5.2 Disk I/O in Replacement Selection . 120 4.6 Discussion................................. 122 4.6.1 Comparisons ........................... 122 4.6.2 Multi-CoreMapReduce. 125 4.6.3 ClusterMapReduce. 127 4.7 Conclusion................................. 127 5. ADVANCED EXTERNAL MEMORY ALGORITHMS . 129 5.1 Introduction................................ 129 5.2 RelatedWork ............................... 130 5.2.1 Caching .............................. 130 5.2.2 High-PerformanceSorting . 134 5.3 Preliminaries . 142 5.3.1 FiniteStream ........................... 142 5.4 IncrementalHashTable . 143 5.4.1 Individual Run Analysis . 144 5.4.2 PartialKeySpaceFunctions . 145 5.4.3 KeyDensityFunctions . 146 5.4.4