Large Multilingual Corpus

Total Page:16

File Type:pdf, Size:1020Kb

Large Multilingual Corpus Charles University in Prague Faculty of Mathematics and Physics MASTER THESIS Martin Majliˇs Large Multilingual Corpus Institute of Formal and Applied Linguistics Supervisor: doc. Ing. Zdenˇek Zabokrtsk´y,ˇ Ph.D. Study programme: Informatics Specialization: Mathematical Linguistics Prague 2011 I would like to thank my supervisor doc. Ing. Zdenˇek Zabokrtsk´y,ˇ Ph.D. for his advice and my parents for their support. I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In Prague, August 5, 2011 ...................... N´azev pr´ace: Velk´ymnohojazyˇcn´ykorpus Autor: Martin Majliˇs Katedra: Ustav´ form´aln´ıa aplikovan´elingvistiky Vedouc´ıdiplomov´epr´ace: doc. Ing. Zdenˇek Zabokrtsk´yPh.D.ˇ Abstrakt: V t´eto diplomov´epr´aci je pops´an webov´ykorpus W2C. Tento korpus obsahuje 97 jazyku a pro kaˇzd´yz nich alespoˇn10 milion˚uslov. Celkov´avelikost je 10,5 miliardy slov. Aby bylo moˇzn´etakov´yto korpus vytvoˇrit, bylo nutn´evyˇreˇsit celou ˇradu d´ılˇc´ıch probl´em˚u. Na zaˇc´atku musel b´yt sestaven korpus z Wikipedie se 122 jazyky, na kter´em byl natr´enov´an rozpozn´avaˇcjazyk˚u. Pro stahov´an´ı webov´ych str´anek byl implementov´an distribuovan´ysyst´em, kter´yvyuˇz´ıval 35 poˇc´ıtaˇc˚u. Ze staˇzen´ych dat byly odstranˇeny duplicity. Vytvoˇren´ekorpusy byly vz´ajemnˇeporovn´any pomoc´ır˚uzn´ych statistik, jako jsou pr˚umˇern´ad´elky slov a vˇet, podm´ınˇen´aentropie a podm´ınˇen´aperplexita. Kl´ıˇcov´aslova: jazykov´ykorpus, distribuovan´ezpracov´an´ı Title: Large Multilingual Corpus Author: Martin Majliˇs Department: Institute of Formal and Applied Linguistics Supervisor: doc. Ing. Zdenˇek Zabokrtsk´yPh.D.ˇ Abstract: This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. Keywords: language corpus, distributed processing Contents 1 Introduction 3 1.1 ProblemDefinition .......................... 3 1.2 Motivation............................... 4 1.3 ThesisOrganization.......................... 4 2 Literature Review 6 2.1 ExistingLanguages.......................... 6 2.2 LanguageResources ......................... 6 2.3 Multilingual Web Corpora . 12 2.4 WordSeeds .............................. 18 2.5 URLSeeds............................... 18 2.6 Crawling................................ 21 2.7 LanguageRecognition ........................ 22 2.8 CorpusStoringandDistribution. 25 2.9 Corpus Quality Analysis . 26 2.10InternetSize.............................. 26 3 Methods 28 3.1 AvailableResources.......................... 28 3.2 GeneralPrinciples........................... 30 3.3 Metadata ............................... 31 3.4 Database................................ 32 3.5 WikiCorpus.............................. 34 3.6 LanguageRecognition ........................ 37 1 CONTENTS CONTENTS 3.7 URLSeeds............................... 41 3.8 W2CBuilder ............................. 42 3.9 Distributed Corpus Building . 50 3.10 Duplicity Detection . 52 3.11W2CCorpus ............................. 53 3.12 CorpusDistribution.......................... 54 3.13 ComparingWikivsWeb ....................... 55 4 Results 56 4.1 WikiCorpus.............................. 56 4.2 LanguageRecognition ........................ 56 4.3 W2CCorpus ............................. 61 4.4 ComparingWikivsWeb ....................... 62 4.5 InternetSize.............................. 67 5 Conclusions 71 A DVD Content 73 B List of Languages 75 C Wiki vs Web 79 2 1. Introduction As statistical approaches become the dominant paradigm in natural language processing, there is an increasing demand for data. It is known that simple models and a lot of data outclass sophisticated models based on less data. The web contains huge amounts of linguistics data for many languages. The web has many undeniable advantages: (a) size — it is the largest text collection containing billions of documents and its size is exponentially growing, (b) range — texts are available in many languages, styles and domains, (c) availability — most of the documents are available in machine-readable form, so no scanning or rewriting is necessary. One of the key issues for computational linguists is easy access to such data. These data are publicly available on the Internet, but already collected corpora are available only for the major word languages, but not for most of the other languages. Therefore, my aim is to collect, with minimal or no human intervention, at least ten millions of words for as many languages as possible. 1.1 Problem Definition The goal of this thesis is to build multilingual corpus of texts available on the Internet. This corpus will consist of at least 10 million words for as many lan- guages as possible. The collected material will be quantitative, and qualitative, analysed and conclusions about different languages will be made. The project consists of: A study of existing multilingual resources and approaches used to construct them. A review of tools and methods used for solving particular tasks such as building initial corpora, crawling, language recognition and duplicity detec- tion. A design for solving these particular tasks as well as the main tasks with respect to amount of processed data. 3 1.2. MOTIVATION CHAPTER 1. INTRODUCTION An implementation of tools and processes capable of taking benefits of distributed environment. A quantitative and qualitative analyses of the collected material. Conclusions about used methods with evaluation of their performances for different languages. 1.2 Motivation There are many publicly available projects that are trying to collect multilingual textual resources. Some of them cover many of languages but contain either very few documents or these documents are not in computer accessible form, so they cannot be easily used in computational linguistics. Other projects contain more data, but are available in very few languages. Therefore, it will be useful to construct corpus, that will overcome these disadvantages. When this data becomes available, it will be possible to use it for comparative analysis of related languages, building language models for various applications such as machine translation, speech recognition, spell checking, etc. For achieving the main goal, many subtasks has to be solved, such as recognizing languages or downloading millions of web pages. When all this data is collected, it will be possible to use it for further improvements. Apart from these objective motivations, there are also my personal motivations. Working on this project gives me a chance to get insight, knowledge and hands-on experience on processing massive amounts of data. 1.3 Thesis Organization The work is divided into five chapters, beginning with the introductory Chapter 1 containing problem definition and motivation. Chapter 2 gives an overview of existing methods and techniques. It briefly introduces existing multilingual re- sources and multilingual corpora as well as methods used for their construction. It also presents methods for solving particular steps. Chaprter 3 presents require- ments for the complete system and available computational resources. It also in- troduces implemented tools and methods how to use them effectively. Chapter 4 shows achieved results in language recognition and size of constructed corpus. A quantitative and qualitative analyses of the corpus is included. Chapter 5 dis- 4 1.3. THESIS ORGANIZATION CHAPTER 1. INTRODUCTION cusses the results and areas where the methods and implementation could be improved. It also suggests goals for the for future work. Four appendices are included: Appendix A describes the content of the DVD. Appendix B contains lists of languages covered by the collected corpus with their ISO-639-3 codes and. Appendix C presents differences between the Wiki Corpus and the W2C Corpus. 5 2. Literature Review This chapter reviews existing tools, methods and approaches. It opens by pre- senting statistics about existing languages, followed by an introduction of exist- ing multilingual projects and multilingual web corpora. The end of this chapter contains an overview of methods used for crawling, text extraction, language recognition and corpus storing and distribution. 2.1 Existing Languages There are 6,909 known living languages according to the Ethnologue databse 1, but only about 390 of them are used by more than 1 million of native speakers2, while 172 of them have more than 3 million speakers. Detailed distribution of languages and speakers is showed in Table 2.1 and Fig- ure 2.1. These numbers must be
Recommended publications
  • Harvesting Strategies for a National Domain France Lasfargues, Clément Oury, Bert Wendland
    Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, Bert Wendland To cite this version: France Lasfargues, Clément Oury, Bert Wendland. Legal deposit of the French Web: harvesting strategies for a national domain. International Web Archiving Workshop, Sep 2008, Aarhus, Denmark. hal-01098538 HAL Id: hal-01098538 https://hal-bnf.archives-ouvertes.fr/hal-01098538 Submitted on 26 Dec 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Legal deposit of the French Web: harvesting strategies for a national domain France Lasfargues, Clément Oury, and Bert Wendland Bibliothèque nationale de France Quai François Mauriac 75706 Paris Cedex 13 {france.lasfargues, clement.oury, bert.wendland}@bnf.fr ABSTRACT 1. THE FRENCH CONTEXT According to French Copyright Law voted on August 1st, 2006, the Bibliothèque nationale de France (“BnF”, or “the Library”) is 1.1 Defining the scope of the legal deposit in charge of collecting and preserving the French Internet. The On August 1st, 2006, a new Copyright law was voted by the Library has established a “mixed model” of Web archiving, which French Parliament.
    [Show full text]
  • Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics
    computers Article Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland * Correspondence: [email protected]; Tel.: +48-(61)-639-27-93 Received: 10 May 2019; Accepted: 13 August 2019; Published: 14 August 2019 Abstract: On Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, the quality of information about the same topic depends on the language. Any interested user can improve an article and that improvement may depend on the popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular selected topics are among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories, we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach, we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content.
    [Show full text]
  • Web Archiving Environmental Scan
    Web Archiving Environmental Scan The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25658314 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Web Archiving Environmental Scan Harvard Library Report January 2016 Prepared by Gail Truman The Harvard Library Report “Web Archiving Environmental Scan” is licensed under a Creative Commons Attribution 4.0 International License. Prepared by Gail Truman, Truman Technologies Reviewed by Andrea Goethals, Harvard Library and Abigail Bordeaux, Library Technology Services, Harvard University Revised by Andrea Goethals in July 2017 to correct the number of dedicated web archiving staff at the Danish Royal Library This report was produced with the generous support of the Arcadia Fund. Citation: Truman, Gail. 2016. Web Archiving Environmental Scan. Harvard Library Report. Table of Contents Executive Summary ............................................................................................................................ 3 Introduction ......................................................................................................................................
    [Show full text]
  • User Manual [Pdf]
    Heritrix User Manual Internet Archive Kristinn Sigur#sson Michael Stack Igor Ranitovic Table of Contents 1. Introduction ............................................................................................................ 1 2. Installing and running Heritrix .................................................................................... 2 2.1. Obtaining and installing Heritrix ...................................................................... 2 2.2. Running Heritrix ........................................................................................... 3 2.3. Security Considerations .................................................................................. 7 3. Web based user interface ........................................................................................... 7 4. A quick guide to running your first crawl job ................................................................ 8 5. Creating jobs and profiles .......................................................................................... 9 5.1. Crawl job .....................................................................................................9 5.2. Profile ....................................................................................................... 10 6. Configuring jobs and profiles ................................................................................... 11 6.1. Modules (Scope, Frontier, and Processors) ....................................................... 12 6.2. Submodules ..............................................................................................
    [Show full text]
  • Web Archiving for Academic Institutions
    University of San Diego Digital USD Digital Initiatives Symposium Apr 23rd, 1:00 PM - 4:00 PM Web Archiving for Academic Institutions Lori Donovan Internet Archive Mary Haberle Internet Archive Follow this and additional works at: https://digital.sandiego.edu/symposium Donovan, Lori and Haberle, Mary, "Web Archiving for Academic Institutions" (2018). Digital Initiatives Symposium. 4. https://digital.sandiego.edu/symposium/2018/2018/4 This Workshop is brought to you for free and open access by Digital USD. It has been accepted for inclusion in Digital Initiatives Symposium by an authorized administrator of Digital USD. For more information, please contact [email protected]. Web Archiving for Academic Institutions Presenter 1 Title Senior Program Manager, Archive-It Presenter 2 Title Web Archivist Session Type Workshop Abstract With the advent of the internet, content that institutional archivists once preserved in physical formats is now web-based, and new avenues for information sharing, interaction and record-keeping are fundamentally changing how the history of the 21st century will be studied. Due to the transient nature of web content, much of this information is at risk. This half-day workshop will cover the basics of web archiving, help attendees identify content of interest to them and their communities, and give them an opportunity to interact with tools that assist with the capture and preservation of web content. Attendees will gain hands-on web archiving skills, insights into selection and collecting policies for web archives and how to apply what they've learned in the workshop to their own organizations. Location KIPJ Room B Comments Lori Donovan works with partners and the Internet Archive’s web archivists and engineering team to develop the Archive-It service so that it meets the needs of memory institutions.
    [Show full text]
  • Getting Started in Web Archiving
    Submitted on: 13.06.2017 Getting Started in Web Archiving Abigail Grotke Library Services, Digital Collections Management and Services Division, Library of Congress, Washington, D.C., United States. E-mail address: [email protected] This work is made available under the terms of the Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0 Abstract: This purpose of this paper is to provide general information about how organizations can get started in web archiving, for both those who are developing new web archiving programs and for libraries that are just beginning to explore the possibilities. The paper includes an overview of considerations when establishing a web archiving program, including typical approaches that national libraries take when preserving the web. These include: collection development, legal issues, tools and approaches, staffing, and whether to do work in-house or outsource some or most of the work. The paper will introduce the International Internet Preservation Consortium and the benefits of collaboration when building web archives. Keywords: web archiving, legal deposit, collaboration 1 BACKGROUND In the more than twenty five years since the World Wide Web was invented, it has been woven into everyday life—a platform by which a huge number of individuals and more traditional publishers distribute information and communicate with one another around the world. While the size of the web can be difficult to articulate, it is generally understood that it is large and ever-changing and that content is continuously added and removed. With so much global cultural heritage being documented online, librarians, archivists, and others are increasingly becoming aware of the need to preserve this valuable resource for future generations.
    [Show full text]
  • Incremental Crawling with Heritrix
    Incremental crawling with Heritrix Kristinn Sigurðsson National and University Library of Iceland Arngrímsgötu 3 107 Reykjavík Iceland [email protected] Abstract. The Heritrix web crawler aims to be the world's first open source, extensible, web-scale, archival-quality web crawler. It has however been limited in its crawling strategies to snapshot crawling. This paper reports on work to add the ability to conduct incremental crawls to its capabilities. We first discuss the concept of incremental crawling as opposed to snapshot crawling and then the possible ways to design an effective incremental strategy. An overview is given of the implementation that we did, its limits and strengths are discussed. We then report on the results of initial experimentation with the new software which have gone well. Finally, we discuss issues that remain unresolved and possible future improvements. 1 Introduction With an increasing number of parties interested in crawling the World Wide Web, for a variety of reasons, a number of different crawl types have emerged. The development team at Internet Archive [12] responsible for the Heritrix web crawler, have highlighted four distinct variations [1], broad , focused, continuous and experimental crawling. Broad and focused crawls are in many ways similar, the primary difference being that broad crawls emphasize capturing a large scope1, whereas focused crawling calls for a more complete coverage of a smaller scope. Both approaches use a snapshot strategy , which involves crawling the scope once and once only. Of course, crawls are repeatable but only by starting again from the seeds. No information from past crawls is used in new ones, except possibly some changes to the configuration made by the operator, to avoid crawler traps etc.
    [Show full text]
  • India Program/Indic Languages/Urdu/Discussions/2011 1 India Program/Indic Languages/Urdu/Discussions/2011
    India Program/Indic Languages/urdu/Discussions/2011 1 India Program/Indic Languages/urdu/Discussions/2011 2011 Assamese Bengali Gujarati Hindi Kannada Malayalam Marathi Nepali Odia Sanskrit Tamil Telugu Summary of discussions Discussion with Hindustanilanguage from Urdu Wiki Community 1. How did you hear about Urdu Wikipedia? How did you reached Urdu Wikipedia? I came to know about Urdu Wikipedia through links of other languages posted on the side-bar of the articles of the English Wikipedia. I was thrilled the enormous amount of information available in Urdu, although the common misconception is that Wikipedia is available in only in English. 2. Why you decided you must contribute to Urdu wikipedia? While I strongly believe knowledge should be like the free flow of river water and everybody should have equal access to it. Hence I decided to contribute to it just as I contribute to the English Wikipedia and Wiki Commons. I would also place on record that I am a qualified journalist and I have contributed articles to several newspapers and magazines. I also contributed to some of the leading Urdu websites such as Urdustan.com [1] - the oldest Urdu Website on the Internet, and Sherosokhan.com [2]. 3. What kind of topics you usually edit? I edit all types of topics and subjects. However, since I contribute to the Wikipedia on a voluntary basis, I make it a point to ensure quality and edit only those article where I can trace and quote the sources easily. 4. Why I joined the Urdu Wikipedia? See the response to question number 2.
    [Show full text]
  • Adaptive Revisiting with Heritrix Master Thesis (30 Credits/60 ECTS)
    University of Iceland Faculty of Engineering Department of Computer Science Adaptive Revisiting with Heritrix Master Thesis (30 credits/60 ECTS) by Kristinn Sigurðsson May 2005 Supervisors: Helgi Þorbergsson, PhD Þorsteinn Hallgrímsson Útdráttur á íslensku Veraldarvefurinn geymir sívaxandi hluta af þekkingu og menningararfi heimsins. Þar sem Vefurinn er einnig sífellt að breytast þá er nú unnið ötullega að því að varðveita innihald hans á hverjum tíma. Þessi vinna er framlenging á skylduskila lögum sem hafa í síðustu aldir stuðlað að því að varðveita prentað efni. Fyrstu þrír kaflarnir lýsa grundvallar erfiðleikum við það að safna Vefnum og kynnir hugbúnaðinn Heritrix, sem var smíðaður til að vinna það verk. Fyrsti kaflinn einbeitir sér að ástæðunum og bakgrunni þessarar vinnu en kaflar tvö og þrjú beina kastljósinu að tæknilegri þáttum. Markmið verkefnisins var að þróa nýja tækni til að safna ákveðnum hluta af Vefnum sem er álitinn breytast ört og vera í eðli sínu áhugaverður. Seinni kaflar fjalla um skilgreininu á slíkri aðferðafræði og hvernig hún var útfærð í Heritrix. Hluti þessarar umfjöllunar beinist að því hvernig greina má breytingar í skjölum. Að lokum er fjallað um fyrstu reynslu af nýja hugbúnaðinum og sjónum er beint að þeim þáttum sem þarfnast frekari vinnu eða athygli. Þar sem markmiðið með verkefninu var að leggja grunnlínur fyrir svona aðferðafræði og útbúa einfalda og stöðuga útfærsla þá inniheldur þessi hluti margar hugmyndir um hvað mætti gera betur. Keywords Web crawling, web archiving, Heritrix, Internet, World Wide Web, legal deposit, electronic legal deposit. i Abstract The World Wide Web contains an increasingly significant amount of the world’s knowledge and heritage.
    [Show full text]
  • An Introduction to Heritrix an Open Source Archival Quality Web Crawler
    An Introduction to Heritrix An open source archival quality web crawler Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery and Michele Kimpton Internet Archive Web Team {gordon,stack,igor,dan,michele}@archive.org Abstract. Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality webcrawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broadcrawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, extensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls. Introduction The Internet Archive (IA) is a 5013C non-profit corporation, whose mission is to build a public Internet digital library. Over the last 6 years, IA has built the largest public web archive to date, hosting over 400 TB of data. The Web Archive is comprised primarily of pages collected by Alexa Internet starting in 1996. Alexa Internet is a Web cataloguing company founded by Brewster Kahle and Bruce Gilliat in 1996. Alexa Internet takes a snapshot of the web every 2 months, currently collecting 10 TB of data per month from over 35 million sites. Alexa Internet donates this crawl to the Internet Archive, and IA stores and indexes the collection. Alexa uses its own proprietary software and techniques to crawl the web. This software is not available to Internet Archive or other institutions for use or extension.
    [Show full text]
  • Exploring the Intersections of Web Science and Accessibility 3
    Exploring the Intersections of Web Science and Accessibility TREVOR BOSTIC, JEFF STANLEY, JOHN HIGGINS, DANIEL CHUDNOV, RACHAEL L. BRADLEY MONTGOMERY, JUSTIN F. BRUNELLE, The MITRE Corporation The web is the prominent way information is exchanged in the 21st century. However, ensuring web-based information is accessible is complicated, particularly with web applications that rely on JavaScript and other technologies to deliver and build representations; representations are often the HTML, images, or other code a server delivers for a web resource. Static representations are becoming rarer and assessing the accessibility of web-based information to ensure it is available to all users is increasingly difficult given the dynamic nature of representations. In this work, we survey three ongoing research threads that can inform web accessibility solutions: assessing web accessibility, modeling web user activity, and web application crawling. Current web accessibility research is contin- ually focused on increasing the percentage of automatically testable standards, but still relies heavily upon manual testing for complex interactive applications. Along-side web accessibility research, there are mechanisms developed by researchers that replicate user interactions with web pages based on usage patterns. Crawling web applications is a broad research domain; exposing content in web applications is difficult because of incompatibilities in web crawlers and the technologies used to create the applications. We describe research on crawling the deep web by exercising user forms. We close with a thought exercise regarding the convergence of these three threads and the future of automated, web-based accessibility evaluation and assurance through a use case in web archiving. These research efforts provide insight into how users interact with websites, how to automate and simulate user interactions, how to record the results of user interactions, and how to analyze, evaluate, and map resulting website content to determine its relative accessibility.
    [Show full text]
  • Partnership Opportunities with the Internet Archive Web Archiving in Libraries October 21, 2020
    Partnership Opportunities with the Internet Archive Web archiving in libraries October 21, 2020 Karl-Rainer Blumenthal Web Archivist, Internet Archive Web archiving is the process of collecting, preserving, and enabling access to web-published materials. Average lifespan of a webpage 92 days WEB ARCHIVING crawler replay app W/ARC WEB ARCHIVING TECHNOLOGY Brozzler Heritrix ARC HTTrack WARC warcprox wget Wayback Machine OpenWayback pywb wab.ac oldweb.today WEB ARCHIVING TECHNOLOGY Brozzler Heritrix ARC HTTrack WARC warcprox wget Archive-It Wayback Machine NetarchiveSuite (DK/FR) OpenWayback PANDAS (AUS) pywb Web Curator (UK/NZ) wab.ac Webrecorder oldweb.today WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine The largest publicly available web archive in existence. https://archive.org/web/ > 300 Billion Web Pages > 100 million websites > 150 languages ~ 1 billion URLs added per week WEB ARCHIVING The Wayback Machine Limitations: Lightly curated Completeness Temporal cohesion Access: No full-text search No descriptive metadata Access by URL only ARCHIVE-IT Archive-It https://archive-it.org Curator controlled > 700 partner organizations ~ 2 PB of web data collected Full text and metadata searchable APIs for archives, metadata, search, &c. ARCHIVE-IT COLLECTIONS ARCHIVE-IT PARTNERS WEB ARCHIVES AS DATA WEB ARCHIVES AS DATA WEB ARCHIVES AS DATA WEB ARCHIVES AS DATA WEB ARCHIVES AS (GOV) DATA WEB ARCHIVE ACCESSIBILITY WEB ARCHIVE ACCESSIBILITY WEB ARCHIVING COLLABORATION FDLP Libraries WEB ARCHIVING COLLABORATION FDLP Libraries Archive-It partners THANKS <3 ...and keep in touch! Karl-Rainer Blumenthal Web Archivist, Internet Archive [email protected] [email protected] Partnership Opportunities with Internet Archive Andrea Mills – Digitization Program Manager, Internet Archive 1.
    [Show full text]