Large Multilingual Corpus

Charles University in Prague Faculty of Mathematics and Physics MASTER THESIS Martin Majliˇs Large Multilingual Corpus Institute of Formal and Applied Linguistics Supervisor: doc. Ing. Zdenˇek Zabokrtský,ˇ Ph.D. Study programme: Informatics Specialization: Mathematical Linguistics Prague 2011 I would like to thank my supervisor doc. Ing. Zdenˇek Zabokrtský,ˇ Ph.D. for his advice and my parents for their support. I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In Prague, August 5, 2011 ...................... Název práce: Velkýmnohojazyˇcnýkorpus Autor: Martin Majliˇs Katedra: Ustav´ formáln´ıa aplikovanélingvistiky Vedouc´ıdiplomovépráce: doc. Ing. Zdenˇek ZabokrtskýPh.D.ˇ Abstrakt: V této diplomovépráci je popsán webovýkorpus W2C. Tento korpus obsahuje 97 jazyku a pro kaˇzdýz nich alespoˇn10 milion˚uslov. Celkovávelikost je 10,5 miliardy slov. Aby bylo moˇznétakovýto korpus vytvoˇrit, bylo nutnévyˇreˇsit celou ˇradu d´ılˇc´ıch problém˚u. Na zaˇcátku musel být sestaven korpus z Wikipedie se 122 jazyky, na kterém byl natrénován rozpoznávaˇcjazyk˚u. Pro stahován´ı webových stránek byl implementován distribuovanýsystém, kterývyuˇz´ıval 35 poˇc´ıtaˇc˚u. Ze staˇzených dat byly odstranˇeny duplicity. Vytvoˇrenékorpusy byly vzájemnˇeporovnány pomoc´ır˚uzných statistik, jako jsou pr˚umˇernádélky slov a vˇet, podm´ınˇenáentropie a podm´ınˇenáperplexita. Kl´ıˇcováslova: jazykovýkorpus, distribuovanézpracován´ı Title: Large Multilingual Corpus Author: Martin Majliˇs Department: Institute of Formal and Applied Linguistics Supervisor: doc. Ing. Zdenˇek ZabokrtskýPh.D.ˇ Abstract: This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. Keywords: language corpus, distributed processing Contents 1 Introduction 3 1.1 ProblemDefinition .......................... 3 1.2 Motivation............................... 4 1.3 ThesisOrganization.......................... 4 2 Literature Review 6 2.1 ExistingLanguages.......................... 6 2.2 LanguageResources ......................... 6 2.3 Multilingual Web Corpora . 12 2.4 WordSeeds .............................. 18 2.5 URLSeeds............................... 18 2.6 Crawling................................ 21 2.7 LanguageRecognition ........................ 22 2.8 CorpusStoringandDistribution. 25 2.9 Corpus Quality Analysis . 26 2.10InternetSize.............................. 26 3 Methods 28 3.1 AvailableResources.......................... 28 3.2 GeneralPrinciples........................... 30 3.3 Metadata ............................... 31 3.4 Database................................ 32 3.5 WikiCorpus.............................. 34 3.6 LanguageRecognition ........................ 37 1 CONTENTS CONTENTS 3.7 URLSeeds............................... 41 3.8 W2CBuilder ............................. 42 3.9 Distributed Corpus Building . 50 3.10 Duplicity Detection . 52 3.11W2CCorpus ............................. 53 3.12 CorpusDistribution.......................... 54 3.13 ComparingWikivsWeb ....................... 55 4 Results 56 4.1 WikiCorpus.............................. 56 4.2 LanguageRecognition ........................ 56 4.3 W2CCorpus ............................. 61 4.4 ComparingWikivsWeb ....................... 62 4.5 InternetSize.............................. 67 5 Conclusions 71 A DVD Content 73 B List of Languages 75 C Wiki vs Web 79 2 1. Introduction As statistical approaches become the dominant paradigm in natural language processing, there is an increasing demand for data. It is known that simple models and a lot of data outclass sophisticated models based on less data. The web contains huge amounts of linguistics data for many languages. The web has many undeniable advantages: (a) size — it is the largest text collection containing billions of documents and its size is exponentially growing, (b) range — texts are available in many languages, styles and domains, (c) availability — most of the documents are available in machine-readable form, so no scanning or rewriting is necessary. One of the key issues for computational linguists is easy access to such data. These data are publicly available on the Internet, but already collected corpora are available only for the major word languages, but not for most of the other languages. Therefore, my aim is to collect, with minimal or no human intervention, at least ten millions of words for as many languages as possible. 1.1 Problem Definition The goal of this thesis is to build multilingual corpus of texts available on the Internet. This corpus will consist of at least 10 million words for as many languages as possible. The collected material will be quantitative, and qualitative, analysed and conclusions about different languages will be made. The project consists of: A study of existing multilingual resources and approaches used to construct them. A review of tools and methods used for solving particular tasks such as building initial corpora, crawling, language recognition and duplicity detection. A design for solving these particular tasks as well as the main tasks with respect to amount of processed data. 3 1.2. MOTIVATION CHAPTER 1. INTRODUCTION An implementation of tools and processes capable of taking benefits of distributed environment. A quantitative and qualitative analyses of the collected material. Conclusions about used methods with evaluation of their performances for different languages. 1.2 Motivation There are many publicly available projects that are trying to collect multilingual textual resources. Some of them cover many of languages but contain either very few documents or these documents are not in computer accessible form, so they cannot be easily used in computational linguistics. Other projects contain more data, but are available in very few languages. Therefore, it will be useful to construct corpus, that will overcome these disadvantages. When this data becomes available, it will be possible to use it for comparative analysis of related languages, building language models for various applications such as machine translation, speech recognition, spell checking, etc. For achieving the main goal, many subtasks has to be solved, such as recognizing languages or downloading millions of web pages. When all this data is collected, it will be possible to use it for further improvements. Apart from these objective motivations, there are also my personal motivations. Working on this project gives me a chance to get insight, knowledge and hands-on experience on processing massive amounts of data. 1.3 Thesis Organization The work is divided into five chapters, beginning with the introductory Chapter 1 containing problem definition and motivation. Chapter 2 gives an overview of existing methods and techniques. It briefly introduces existing multilingual resources and multilingual corpora as well as methods used for their construction. It also presents methods for solving particular steps. Chaprter 3 presents require- ments for the complete system and available computational resources. It also introduces implemented tools and methods how to use them effectively. Chapter 4 shows achieved results in language recognition and size of constructed corpus. A quantitative and qualitative analyses of the corpus is included. Chapter 5 dis- 4 1.3. THESIS ORGANIZATION CHAPTER 1. INTRODUCTION cusses the results and areas where the methods and implementation could be improved. It also suggests goals for the for future work. Four appendices are included: Appendix A describes the content of the DVD. Appendix B contains lists of languages covered by the collected corpus with their ISO-639-3 codes and. Appendix C presents differences between the Wiki Corpus and the W2C Corpus. 5 2. Literature Review This chapter reviews existing tools, methods and approaches. It opens by pre- senting statistics about existing languages, followed by an introduction of existing multilingual projects and multilingual web corpora. The end of this chapter contains an overview of methods used for crawling, text extraction, language recognition and corpus storing and distribution. 2.1 Existing Languages There are 6,909 known living languages according to the Ethnologue databse 1, but only about 390 of them are used by more than 1 million of native speakers2, while 172 of them have more than 3 million speakers. Detailed distribution of languages and speakers is showed in Table 2.1 and Fig- ure 2.1. These numbers must be

Large Multilingual Corpus

Harvesting Strategies for a National Domain France Lasfargues, Clément Oury, Bert Wendland

Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics

Web Archiving Environmental Scan

User Manual [Pdf]

Web Archiving for Academic Institutions

Getting Started in Web Archiving

Incremental Crawling with Heritrix

India Program/Indic Languages/Urdu/Discussions/2011 1 India Program/Indic Languages/Urdu/Discussions/2011

Adaptive Revisiting with Heritrix Master Thesis (30 Credits/60 ECTS)

An Introduction to Heritrix an Open Source Archival Quality Web Crawler

Exploring the Intersections of Web Science and Accessibility 3

Partnership Opportunities with the Internet Archive Web Archiving in Libraries October 21, 2020