Adaptive Revisiting with Heritrix Master Thesis (30 Credits/60 ECTS)

Total Page:16

File Type:pdf, Size:1020Kb

Adaptive Revisiting with Heritrix Master Thesis (30 Credits/60 ECTS) University of Iceland Faculty of Engineering Department of Computer Science Adaptive Revisiting with Heritrix Master Thesis (30 credits/60 ECTS) by Kristinn Sigurðsson May 2005 Supervisors: Helgi Þorbergsson, PhD Þorsteinn Hallgrímsson Útdráttur á íslensku Veraldarvefurinn geymir sívaxandi hluta af þekkingu og menningararfi heimsins. Þar sem Vefurinn er einnig sífellt að breytast þá er nú unnið ötullega að því að varðveita innihald hans á hverjum tíma. Þessi vinna er framlenging á skylduskila lögum sem hafa í síðustu aldir stuðlað að því að varðveita prentað efni. Fyrstu þrír kaflarnir lýsa grundvallar erfiðleikum við það að safna Vefnum og kynnir hugbúnaðinn Heritrix, sem var smíðaður til að vinna það verk. Fyrsti kaflinn einbeitir sér að ástæðunum og bakgrunni þessarar vinnu en kaflar tvö og þrjú beina kastljósinu að tæknilegri þáttum. Markmið verkefnisins var að þróa nýja tækni til að safna ákveðnum hluta af Vefnum sem er álitinn breytast ört og vera í eðli sínu áhugaverður. Seinni kaflar fjalla um skilgreininu á slíkri aðferðafræði og hvernig hún var útfærð í Heritrix. Hluti þessarar umfjöllunar beinist að því hvernig greina má breytingar í skjölum. Að lokum er fjallað um fyrstu reynslu af nýja hugbúnaðinum og sjónum er beint að þeim þáttum sem þarfnast frekari vinnu eða athygli. Þar sem markmiðið með verkefninu var að leggja grunnlínur fyrir svona aðferðafræði og útbúa einfalda og stöðuga útfærsla þá inniheldur þessi hluti margar hugmyndir um hvað mætti gera betur. Keywords Web crawling, web archiving, Heritrix, Internet, World Wide Web, legal deposit, electronic legal deposit. i Abstract The World Wide Web contains an increasingly significant amount of the world’s knowledge and heritage. Since the Web is also in a constant state of change significant efforts are now underway to capture and preserve its contents. These efforts extend the traditional legal deposit laws that have been aimed at preserving printed material over the last centuries. The first three chapters outline the fundamental challenges for collecting the Web and present the software, Heritrix, which has been designed to perform this task. The first chapter focuses on the reasons and history behind this endeavour, with chapters two and three focusing on more technical aspects. The goal of this project was to develop a new way of collecting parts of the Web that are believed to change very rapidly and are considered of significant interest. The later chapters focus on defining such an incremental strategy, which we call an ‘adaptive revisting strategy’ and how it was implemented as a part of Heritrix. A part of this discussion is how to detect change in documents. Finally we discuss initial impressions of the new software and highlight areas that require further work or attention. As the goal of the project was primarily to establish the foundation for such incremental crawling and provide a simple and sturdy implementation, this section contains many thoughts on issues that could be improved on in the future. ii Table of contents TABLES................................................................................................... V FIGURES ................................................................................................. V 1. BACKGROUND................................................................................. 1 1.1 WEB ARCHIVING ........................................................................................1 1.2 LEGAL DEPOSIT .........................................................................................2 1.3 ELECTRONIC LEGAL DEPOSIT LAWS ..........................................................2 1.4 COOPERATION ...........................................................................................3 2. CRAWLING STRATEGIES ............................................................ 5 2.1 TERMINOLOGY ........................................................................................11 3. HERITRIX........................................................................................ 13 3.1 CRAWL CONTROLLER ..............................................................................14 3.2 TOE THREADS ..........................................................................................15 3.3 THE SETTINGS FRAMEWORK ...................................................................16 3.3.1 Context based settings ................................................................................ 17 3.4 THE WEB USER INTERFACE ....................................................................19 3.4.1 Jobs and profiles......................................................................................... 20 3.4.2 Logs and reports......................................................................................... 22 3.5 FRONTIERS ..............................................................................................23 3.5.1 HostQueuesFrontier................................................................................... 27 3.5.2 BdbFrontier................................................................................................ 28 3.5.3 AbstractFrontier......................................................................................... 29 3.5.4 Making other Frontiers .............................................................................. 30 3.6 URI S, UURI S, CANDIDATE URI S AND CRAWL URI S.................................30 3.7 THE PROCESSING CHAIN ..........................................................................33 3.8 SCOPES ....................................................................................................36 3.9 FILTERS ...................................................................................................37 4. THE OBJECTIVE ........................................................................... 39 4.1 LIMITING THE PROJECT ............................................................................41 5 DEFINING AN ADAPTIVE REVISITING STRATEGY ............ 43 5.1 DETECTING CHANGE ...............................................................................47 6. INTEGRATION WITH HERITRIX ............................................. 52 6.1 CHANGES TO THE CRAWL URI ..................................................................54 6.2 THE ADAPTIVE REVISITING FRONTIER .....................................................55 6.2.1 AdaptiveRevisitHostQueue ......................................................................... 65 6.2.2 AdaptiveRevisitQueueList........................................................................... 70 iii 6.2.3 Synchronous Access.................................................................................... 70 6.2.4 Recovery..................................................................................................... 71 6.2.4 Frontier features not implemented ............................................................. 73 6.2.5 AbstractFrontier......................................................................................... 74 6.3 NEW PROCESSORS ...................................................................................76 6.3.1 ChangeEvaluator........................................................................................ 77 6.3.2 WaitEvaluators........................................................................................... 79 6.3.3 HTTPContentDigest ................................................................................... 82 6.4 USING HTTP HEADERS ...........................................................................83 7. RESULTS.......................................................................................... 85 8. UNRESOLVED AND FUTURE ISSUES ...................................... 89 9. ACKNOWLEDGEMENTS............................................................. 94 REFERENCES....................................................................................... 95 iv Tables Table 1 Reliability and usefulness of datestamps and etags ................... 50 Figures Figure 1 The Frontier concept in crawling ............................................... 6 Figure 2 Different emphasis of incremental and snapshot strategies ..... 10 Figure 3 Heritrix’s basic architecture ..................................................... 14 Figure 4 Heritrix’s web user interface .................................................... 19 Figure 5 Heritrix’s settings ..................................................................... 21 Figure 6 CandidateURI and CrawlURI lifecycles .................................. 32 Figure 7 A typical processing chain ....................................................... 35 Figure 8 AdaptiveRevisitFrontier architecture ....................................... 58 Figure 9 Frontier data flow ..................................................................... 61 Figure 10 AdaptiveRevisitHostQueue databases .................................... 69 Figure 11 Fitting the AR processors into the processing chain .............. 77 Figure 12 The UI settings for three WaitEvaluators ............................... 82 Figure 13 Modules setting with the ARFrontier set ............................... 88 v 1. Background Since the World Wide Web's inception in the early '90s it has grown at a phenomenal rate. The amount and diversity of content has rapidly increased and almost from the very start, the only way to locate anything you didn't already have a link to was to use a search engine. It is fair to say
Recommended publications
  • “The Future Is Open” for Composition Studies: a New
    THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES “THE FUTURE IS OPEN” FOR COMPOSITION STUDIES: A NEW INTELLECTUAL PROPERTY MODEL IN THE DIGITAL AGE By CHARLES LOWE A Dissertation submitted to the Department of English in partial fulfillment of the requirements for the degree of Doctor of Philosophy Degree Awarded Summer Semester, 2006 Copyright © 2006 Charles Lowe All Rights Reserved The members of the committee approve the dissertation of Charles Lowe defended on May 25, 2006. ______________________________ John Fenstermaker Professor Directing Dissertation ______________________________ Ernest Rehder Outside Committee Member ______________________________ Eric Walker Committee Member ______________________________ Deborah Coxwell-Teague Committee Member Approved: ______________________________ Hunt Hawkins, Chair, Department of English The Office of Graduate Studies has verified and approved the above named committee members. ii This text is dedicated to Wendy Bishop, John Lovas, Candace Spigelman, and Richard Straub, four teachers and researchers in the field of composition studies with whom it was my pleasure to work. I only wish I could have the opportunity again. iii TABLE OF CONTENTS ABSTRACT...............................................................................................................................vi INTRODUCTION.......................................................................................................................1 The Future Is Open.............................................................................................................
    [Show full text]
  • Harvesting Strategies for a National Domain France Lasfargues, Clément Oury, Bert Wendland
    Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, Bert Wendland To cite this version: France Lasfargues, Clément Oury, Bert Wendland. Legal deposit of the French Web: harvesting strategies for a national domain. International Web Archiving Workshop, Sep 2008, Aarhus, Denmark. hal-01098538 HAL Id: hal-01098538 https://hal-bnf.archives-ouvertes.fr/hal-01098538 Submitted on 26 Dec 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, and Bert Wendland Bibliothèque nationale de France Quai François Mauriac 75706 Paris Cedex 13 {france.lasfargues, clement.oury, bert.wendland}@bnf.fr ABSTRACT 1. THE FRENCH CONTEXT According to French Copyright Law voted on August 1st, 2006, the Bibliothèque nationale de France (“BnF”, or “the Library”) is 1.1 Defining the scope of the legal deposit in charge of collecting and preserving the French Internet. The On August 1st, 2006, a new Copyright law was voted by the Library has established a “mixed model” of Web archiving, which French Parliament.
    [Show full text]
  • The Internet Archive: an Interview with Brewster Kahle Brewster Kahle and Ana Parejo Vadillo
    The Internet Archive: An Interview with Brewster Kahle Brewster Kahle and Ana Parejo Vadillo Rumour has it that one of the candidates for Librarian of Congress is Brewster Kahle, the founder and director of the non-profit digital library Internet Archive.1 That he may be considered for the post is a testament to Kahle’s commitment to mass digitization, the cornerstone of modern librarianship. A visionary of the digital preservation of knowledge and an outspo- ken advocate of the open access movement (the memorial for the Internet activist Aaron Swartz was held at the Internet Archive’s headquarters in San Francisco), Kahle has been part of the many ventures that have created our cyber age. At MIT, he was on the project team of Thinking Machines, a precursor of the World Wide Web. In 1989 he created WAIS (Wide Area Information Server), the first electronic publishing system, which was designed to search and make information available. He left Thinking Machines to focus on his newly founded company, WAIS, Inc., which was sold to AOL two years later for a reported $15 million. In 1996 he co- founded Alexa Internet, which was built on the principles of collecting Web traffic data and analysis.2 The company was named after the Library of Alexandria, the largest repository of knowledge in the ancient world, to highlight the potential of the Internet to become such a custodian. It was sold for c. $250 million in stock to Amazon, which uses it for data mining. Alongside Alexa Internet, in 1996 Kahle founded the Internet Archive to archive Web culture (Fig.
    [Show full text]
  • The BRIDGE Linking Engin Ee Ring and Soci E T Y
    Spring 2010 THE ELECTRICITY GRID The BRIDGE LINKING ENGIN ee RING AND SOCI E TY The Impact of Renewable Resources on the Performance and Reliability of the Electricity Grid Vijay Vittal Securing the Electricity Grid S. Massoud Amin New Products and Services for the Electric Power Industry Clark W. Gellings Energy Independence: Can the U.S. Finally Get It Right? John F. Caskey Educating the Workforce for the Modern Electric Power System: University–Industry Collaboration B. Don Russell The Smart Grid: A Bridge between Emerging Technologies, Society, and the Environment Richard E. Schuler Promoting the technological welfare of the nation by marshalling the knowledge and insights of eminent members of the engineering profession. The BRIDGE NatiOnaL AcaDemY OF Engineering Irwin M. Jacobs, Chair Charles M. Vest, President Maxine L. Savitz, Vice President Thomas F. Budinger, Home Secretary George Bugliarello, Foreign Secretary C.D. (Dan) Mote Jr., Treasurer Editor in Chief (interim): George Bugliarello Managing Editor: Carol R. Arenberg Production Assistant: Penelope Gibbs The Bridge (ISSN 0737-6278) is published quarterly by the National Aca- demy of Engineering, 2101 Constitution Avenue, N.W., Washington, DC 20418. Periodicals postage paid at Washington, DC. Vol. 40, No. 1, Spring 2010 Postmaster: Send address changes to The Bridge, 2101 Constitution Avenue, N.W., Washington, DC 20418. Papers are presented in The Bridge on the basis of general interest and time- liness. They reflect the views of the authors and not necessarily the position of the National Academy of Engineering. The Bridge is printed on recycled paper. © 2010 by the National Academy of Sciences. All rights reserved.
    [Show full text]
  • The Culture of Wikipedia
    Good Faith Collaboration: The Culture of Wikipedia Good Faith Collaboration The Culture of Wikipedia Joseph Michael Reagle Jr. Foreword by Lawrence Lessig The MIT Press, Cambridge, MA. Web edition, Copyright © 2011 by Joseph Michael Reagle Jr. CC-NC-SA 3.0 Purchase at Amazon.com | Barnes and Noble | IndieBound | MIT Press Wikipedia's style of collaborative production has been lauded, lambasted, and satirized. Despite unease over its implications for the character (and quality) of knowledge, Wikipedia has brought us closer than ever to a realization of the centuries-old Author Bio & Research Blog pursuit of a universal encyclopedia. Good Faith Collaboration: The Culture of Wikipedia is a rich ethnographic portrayal of Wikipedia's historical roots, collaborative culture, and much debated legacy. Foreword Preface to the Web Edition Praise for Good Faith Collaboration Preface Extended Table of Contents "Reagle offers a compelling case that Wikipedia's most fascinating and unprecedented aspect isn't the encyclopedia itself — rather, it's the collaborative culture that underpins it: brawling, self-reflexive, funny, serious, and full-tilt committed to the 1. Nazis and Norms project, even if it means setting aside personal differences. Reagle's position as a scholar and a member of the community 2. The Pursuit of the Universal makes him uniquely situated to describe this culture." —Cory Doctorow , Boing Boing Encyclopedia "Reagle provides ample data regarding the everyday practices and cultural norms of the community which collaborates to 3. Good Faith Collaboration produce Wikipedia. His rich research and nuanced appreciation of the complexities of cultural digital media research are 4. The Puzzle of Openness well presented.
    [Show full text]
  • SLA Silicon Valley
    Home Discussion List About Us » Leadership » Events » Members » Career » Documents » Sponsorship » Site Map Archive | Chapter Events RSS feed for this section 2017 Holiday Party Posted on November 14, 2017. The holidays are quickly approaching, which means it’s time for our annual holiday party! This year, we are pulling out all the stops, so be sure to join us for an evening of fun and festivities. Ticket price includes dinner, dessert, and two drink tickets. When: 5:30-8pm, Tuesday, Dec 12 Where: Xanh, a modern Vietnamese restaurant located at 110 Castro St, in Mountain View. How: Pre-pay for admission here: Event has concluded We look forward to seeing you then! Posted in Chapter Events, EventsComments Off on 2017 Holiday Party SLA SF/SLA SV Joint Dinner Program Posted on August 24, 2017. The San Francisco Bay Region and the Silicon Valley Chapters jointly present a dinner program… Developing Your Cultural Intelligence In The Workplace: What It Is and Why It Matters Presented by Dr. Michele A. L. Villagran, President and CEO of CulturalCo, LLC. Our workplaces are becoming more diverse than ever with a range of cultures, including ethic, national, generational, and organizational. Do you want to learn how to develop and apply cultural intelligence at your organization? How can you improve your effectiveness when working with culturally diverse colleagues and clients? Dr. Villagran will share with us how we can use Cultural Intelligence to address these concerns. Tuesday, September 19, 2017 Fattoria e Mare 1095 Rollins Road Burlingame, CA 94010 Schedule: 5:30pm Check in/networking 6:00pm Dinner 6:45pm Introductions 7:00pm Speaker presentation 8:30pm Closing remarks Registration Cost: $30 SLA Member $40 Non- SLA Member $20 Student/Retired/Unemployed ————————————————— Please RSVP to Heather Heen ([email protected]) by September 15th, by sending her your name, email address, and company affiliation.
    [Show full text]
  • The Federal Computer Commission, 84 N.C
    NORTH CAROLINA LAW REVIEW Volume 84 | Number 1 Article 3 12-1-2005 The edeF ral Computer Commission Kevin Werbach Follow this and additional works at: http://scholarship.law.unc.edu/nclr Part of the Law Commons Recommended Citation Kevin Werbach, The Federal Computer Commission, 84 N.C. L. Rev. 1 (2005). Available at: http://scholarship.law.unc.edu/nclr/vol84/iss1/3 This Article is brought to you for free and open access by Carolina Law Scholarship Repository. It has been accepted for inclusion in North Carolina Law Review by an authorized administrator of Carolina Law Scholarship Repository. For more information, please contact [email protected]. THE FEDERAL COMPUTER COMMISSION KEVIN WERBACH* The conventional wisdom that the computer industry thrives in the absence of government regulation is wrong. Federal Communications Commission ("FCC") rules touch every personal computer ever made. Over the last quarter-century, the FCC has steadily increasedits influence over personalcomputing devices and applications. Perhaps surprisingly, though, the "Federal Computer Commission" has largely been a positive force in the technology sector. Regulators are now poised to take several actions that could shape the future of the Internet and the computer industry. In this environment, exposing the Federal Computer Commission provides a foundation for reasoned policy approaches. The fate of a dynamic and important set of industries should not be decided under the influence of a myth. INTRODU CTION ....................................................................................... 2 I. FEAR AND LOATHING .............................................................. 8 II. FCC COMPUTER REGULATION: PAST AND PRESENT ............. 14 A. Computers Invade the Phone Network ............................ 16 1. The Battle Over Terminal Attachments .................... 16 2. The Computer Inquiries and Open Network A rchitecture ................................................................
    [Show full text]
  • Tentechnologiesforpubliclibrar
    10 TECHNOLOGY RESOURCES FOR PUBLIC LIBRARIANS Kathryn Brockmeier, Jodi Rethmeier, Lisa Sewell, and Kirsten Yates School of Information Science & Technologies, University of Missouri Emerging Technologies, 9410 Seminar in Information Science and Learning Technology, Fall 2015 This publication is meant to be inspiration for public library staff who are looking for ways to bring technology tools to their staff and patrons. We have attempted to compile some of the most accessible technologies that can be used with a variety of budgets and levels of experience. On the Ground – For Library Users Ancestry.com Library Edition http://www.ancestry.com/cs/us/institution#library-edition According to GenealogyInTime Magazine, it’s possible that one out of forty members of a family is researching family history (“How Popular Is Genealogy?” n.d.). That totals 7.9 million Americans. Ancestry.com is an online family history research collection of billions of digitized and indexed historical records. This resource provides access to millions of historical photos, billions of historical documents, plus local narratives, oral histories, indexes and other resources in over 30,000 databases that span from the 1500s to the 2000s. Ancestry Library Edition, distributed exclusively by ProQuest, is the institutional subscription available to libraries. If one- fortieth of your local population could use Ancestry.com, after cost-analysis, it may be worth the price of a subscription. Canva https://www.canva.com/ If one doesn’t have access to expensive graphic design software, Canva is a freemium application that tailors the user experience by work, personal, or education, to create stunning posters, social media graphic posts, cards, letterheads, wallpapers, flyers, invitations, and more.
    [Show full text]
  • Louis Round Wilson Academy Formed Inaugural Meeting Held in Chapel Hill
    $1.5 million bequest to benefit SILS technology Inside this Issue Dean’s Message ....................................... 2 Dr. William H. and Vonna K. Graves have pledged a gift of $1.5 Faculty News ............................................. 8 million to the School of Information and Library Science (SILS). The Honor Roll of Donors ........................... 13 bequest, SILS’ largest to date, is intended to enhance the School’s Student News ..........................................18 technology programs and services. See page 3. Alumni News ...........................................23 SCHOOL OF INFORMATION AND LIBRARY SCIENCE @ The SCHOOL of INFORMATION and LIBRARY SCIENCE • TheCarolina UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Spring 2006 http://sils.unc.edu Number 67 Louis Round Wilson Academy Formed Inaugural meeting held in Chapel Hill Citizens around “Our faculty, and the world are becoming the faculty of every more aware that they leading University often need a trusted in the world, real- guide to help sort and izes that the role of substantiate the infor- the 21st and 22nd mation they require. century knowledge Faculty members at the professional must be School of Information carefully shaped,” and Library Science said Dr. José-Marie (SILS) agree that Griffiths, dean leading institutions are of SILS and the obliged to review and Lippert Photography Photo by Tom founding chair of design anew roles and Members of the Louis Round Wilson Academy and the University of North Carolina at Chapel Hill’s School the Louis Round models for Knowledge Pro- of Information and Library Science faculty following the formal induction ceremony in the rotunda of Wilson Academy. the Rare Books Room of the Louis Round Wilson Library.
    [Show full text]
  • Web Archiving Environmental Scan
    Web Archiving Environmental Scan The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Web Archiving Environmental Scan Harvard Library Report January 2016 Prepared by Gail Truman The Harvard Library Report “Web Archiving Environmental Scan” is licensed under a Creative Commons Attribution 4.0 International License. Prepared by Gail Truman, Truman Technologies Reviewed by Andrea Goethals, Harvard Library and Abigail Bordeaux, Library Technology Services, Harvard University Revised by Andrea Goethals in July 2017 to correct the number of dedicated web archiving staff at the Danish Royal Library This report was produced with the generous support of the Arcadia Fund. Citation: Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Table of Contents Executive Summary ............................................................................................................................ 3 Introduction ......................................................................................................................................
    [Show full text]
  • User Manual [Pdf]
    Heritrix User Manual Internet Archive Kristinn Sigur#sson Michael Stack Igor Ranitovic Table of Contents 1. Introduction ............................................................................................................ 1 2. Installing and running Heritrix .................................................................................... 2 2.1. Obtaining and installing Heritrix ...................................................................... 2 2.2. Running Heritrix ........................................................................................... 3 2.3. Security Considerations .................................................................................. 7 3. Web based user interface ........................................................................................... 7 4. A quick guide to running your first crawl job ................................................................ 8 5. Creating jobs and profiles .......................................................................................... 9 5.1. Crawl job .....................................................................................................9 5.2. Profile ....................................................................................................... 10 6. Configuring jobs and profiles ................................................................................... 11 6.1. Modules (Scope, Frontier, and Processors) ....................................................... 12 6.2. Submodules ..............................................................................................
    [Show full text]
  • Web Archiving for Academic Institutions
    University of San Diego Digital USD Digital Initiatives Symposium Apr 23rd, 1:00 PM - 4:00 PM Web Archiving for Academic Institutions Lori Donovan Internet Archive Mary Haberle Internet Archive Follow this and additional works at: https://digital.sandiego.edu/symposium Donovan, Lori and Haberle, Mary, "Web Archiving for Academic Institutions" (2018). Digital Initiatives Symposium. 4. https://digital.sandiego.edu/symposium/2018/2018/4 This Workshop is brought to you for free and open access by Digital USD. It has been accepted for inclusion in Digital Initiatives Symposium by an authorized administrator of Digital USD. For more information, please contact [email protected]. Web Archiving for Academic Institutions Presenter 1 Title Senior Program Manager, Archive-It Presenter 2 Title Web Archivist Session Type Workshop Abstract With the advent of the internet, content that institutional archivists once preserved in physical formats is now web-based, and new avenues for information sharing, interaction and record-keeping are fundamentally changing how the history of the 21st century will be studied. Due to the transient nature of web content, much of this information is at risk. This half-day workshop will cover the basics of web archiving, help attendees identify content of interest to them and their communities, and give them an opportunity to interact with tools that assist with the capture and preservation of web content. Attendees will gain hands-on web archiving skills, insights into selection and collecting policies for web archives and how to apply what they've learned in the workshop to their own organizations. Location KIPJ Room B Comments Lori Donovan works with partners and the Internet Archive’s web archivists and engineering team to develop the Archive-It service so that it meets the needs of memory institutions.
    [Show full text]