University of Iceland Faculty of Engineering Department of Computer Science Adaptive Revisiting with Heritrix Master Thesis (30 credits/60 ECTS) by Kristinn Sigurðsson May 2005 Supervisors: Helgi Þorbergsson, PhD Þorsteinn Hallgrímsson Útdráttur á íslensku Veraldarvefurinn geymir sívaxandi hluta af þekkingu og menningararfi heimsins. Þar sem Vefurinn er einnig sífellt að breytast þá er nú unnið ötullega að því að varðveita innihald hans á hverjum tíma. Þessi vinna er framlenging á skylduskila lögum sem hafa í síðustu aldir stuðlað að því að varðveita prentað efni. Fyrstu þrír kaflarnir lýsa grundvallar erfiðleikum við það að safna Vefnum og kynnir hugbúnaðinn Heritrix, sem var smíðaður til að vinna það verk. Fyrsti kaflinn einbeitir sér að ástæðunum og bakgrunni þessarar vinnu en kaflar tvö og þrjú beina kastljósinu að tæknilegri þáttum. Markmið verkefnisins var að þróa nýja tækni til að safna ákveðnum hluta af Vefnum sem er álitinn breytast ört og vera í eðli sínu áhugaverður. Seinni kaflar fjalla um skilgreininu á slíkri aðferðafræði og hvernig hún var útfærð í Heritrix. Hluti þessarar umfjöllunar beinist að því hvernig greina má breytingar í skjölum. Að lokum er fjallað um fyrstu reynslu af nýja hugbúnaðinum og sjónum er beint að þeim þáttum sem þarfnast frekari vinnu eða athygli. Þar sem markmiðið með verkefninu var að leggja grunnlínur fyrir svona aðferðafræði og útbúa einfalda og stöðuga útfærsla þá inniheldur þessi hluti margar hugmyndir um hvað mætti gera betur. Keywords Web crawling, web archiving, Heritrix, Internet, World Wide Web, legal deposit, electronic legal deposit. i Abstract The World Wide Web contains an increasingly significant amount of the world’s knowledge and heritage. Since the Web is also in a constant state of change significant efforts are now underway to capture and preserve its contents. These efforts extend the traditional legal deposit laws that have been aimed at preserving printed material over the last centuries. The first three chapters outline the fundamental challenges for collecting the Web and present the software, Heritrix, which has been designed to perform this task. The first chapter focuses on the reasons and history behind this endeavour, with chapters two and three focusing on more technical aspects. The goal of this project was to develop a new way of collecting parts of the Web that are believed to change very rapidly and are considered of significant interest. The later chapters focus on defining such an incremental strategy, which we call an ‘adaptive revisting strategy’ and how it was implemented as a part of Heritrix. A part of this discussion is how to detect change in documents. Finally we discuss initial impressions of the new software and highlight areas that require further work or attention. As the goal of the project was primarily to establish the foundation for such incremental crawling and provide a simple and sturdy implementation, this section contains many thoughts on issues that could be improved on in the future. ii Table of contents TABLES................................................................................................... V FIGURES ................................................................................................. V 1. BACKGROUND................................................................................. 1 1.1 WEB ARCHIVING ........................................................................................1 1.2 LEGAL DEPOSIT .........................................................................................2 1.3 ELECTRONIC LEGAL DEPOSIT LAWS ..........................................................2 1.4 COOPERATION ...........................................................................................3 2. CRAWLING STRATEGIES ............................................................ 5 2.1 TERMINOLOGY ........................................................................................11 3. HERITRIX........................................................................................ 13 3.1 CRAWL CONTROLLER ..............................................................................14 3.2 TOE THREADS ..........................................................................................15 3.3 THE SETTINGS FRAMEWORK ...................................................................16 3.3.1 Context based settings ................................................................................ 17 3.4 THE WEB USER INTERFACE ....................................................................19 3.4.1 Jobs and profiles......................................................................................... 20 3.4.2 Logs and reports......................................................................................... 22 3.5 FRONTIERS ..............................................................................................23 3.5.1 HostQueuesFrontier................................................................................... 27 3.5.2 BdbFrontier................................................................................................ 28 3.5.3 AbstractFrontier......................................................................................... 29 3.5.4 Making other Frontiers .............................................................................. 30 3.6 URI S, UURI S, CANDIDATE URI S AND CRAWL URI S.................................30 3.7 THE PROCESSING CHAIN ..........................................................................33 3.8 SCOPES ....................................................................................................36 3.9 FILTERS ...................................................................................................37 4. THE OBJECTIVE ........................................................................... 39 4.1 LIMITING THE PROJECT ............................................................................41 5 DEFINING AN ADAPTIVE REVISITING STRATEGY ............ 43 5.1 DETECTING CHANGE ...............................................................................47 6. INTEGRATION WITH HERITRIX ............................................. 52 6.1 CHANGES TO THE CRAWL URI ..................................................................54 6.2 THE ADAPTIVE REVISITING FRONTIER .....................................................55 6.2.1 AdaptiveRevisitHostQueue ......................................................................... 65 6.2.2 AdaptiveRevisitQueueList........................................................................... 70 iii 6.2.3 Synchronous Access.................................................................................... 70 6.2.4 Recovery..................................................................................................... 71 6.2.4 Frontier features not implemented ............................................................. 73 6.2.5 AbstractFrontier......................................................................................... 74 6.3 NEW PROCESSORS ...................................................................................76 6.3.1 ChangeEvaluator........................................................................................ 77 6.3.2 WaitEvaluators........................................................................................... 79 6.3.3 HTTPContentDigest ................................................................................... 82 6.4 USING HTTP HEADERS ...........................................................................83 7. RESULTS.......................................................................................... 85 8. UNRESOLVED AND FUTURE ISSUES ...................................... 89 9. ACKNOWLEDGEMENTS............................................................. 94 REFERENCES....................................................................................... 95 iv Tables Table 1 Reliability and usefulness of datestamps and etags ................... 50 Figures Figure 1 The Frontier concept in crawling ............................................... 6 Figure 2 Different emphasis of incremental and snapshot strategies ..... 10 Figure 3 Heritrix’s basic architecture ..................................................... 14 Figure 4 Heritrix’s web user interface .................................................... 19 Figure 5 Heritrix’s settings ..................................................................... 21 Figure 6 CandidateURI and CrawlURI lifecycles .................................. 32 Figure 7 A typical processing chain ....................................................... 35 Figure 8 AdaptiveRevisitFrontier architecture ....................................... 58 Figure 9 Frontier data flow ..................................................................... 61 Figure 10 AdaptiveRevisitHostQueue databases .................................... 69 Figure 11 Fitting the AR processors into the processing chain .............. 77 Figure 12 The UI settings for three WaitEvaluators ............................... 82 Figure 13 Modules setting with the ARFrontier set ............................... 88 v 1. Background Since the World Wide Web's inception in the early '90s it has grown at a phenomenal rate. The amount and diversity of content has rapidly increased and almost from the very start, the only way to locate anything you didn't already have a link to was to use a search engine. It is fair to say
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages102 Page
-
File Size-