Large Multilingual Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Charles University in Prague Faculty of Mathematics and Physics MASTER THESIS Martin Majliˇs Large Multilingual Corpus Institute of Formal and Applied Linguistics Supervisor: doc. Ing. Zdenˇek Zabokrtsk´y,ˇ Ph.D. Study programme: Informatics Specialization: Mathematical Linguistics Prague 2011 I would like to thank my supervisor doc. Ing. Zdenˇek Zabokrtsk´y,ˇ Ph.D. for his advice and my parents for their support. I declare that I carried out this master thesis independently, and only with the cited sources, literature and other professional sources. I understand that my work relates to the rights and obligations under the Act No. 121/2000 Coll., the Copyright Act, as amended, in particular the fact that the Charles University in Prague has the right to conclude a license agreement on the use of this work as a school work pursuant to Section 60 paragraph 1 of the Copyright Act. In Prague, August 5, 2011 ...................... N´azev pr´ace: Velk´ymnohojazyˇcn´ykorpus Autor: Martin Majliˇs Katedra: Ustav´ form´aln´ıa aplikovan´elingvistiky Vedouc´ıdiplomov´epr´ace: doc. Ing. Zdenˇek Zabokrtsk´yPh.D.ˇ Abstrakt: V t´eto diplomov´epr´aci je pops´an webov´ykorpus W2C. Tento korpus obsahuje 97 jazyku a pro kaˇzd´yz nich alespoˇn10 milion˚uslov. Celkov´avelikost je 10,5 miliardy slov. Aby bylo moˇzn´etakov´yto korpus vytvoˇrit, bylo nutn´evyˇreˇsit celou ˇradu d´ılˇc´ıch probl´em˚u. Na zaˇc´atku musel b´yt sestaven korpus z Wikipedie se 122 jazyky, na kter´em byl natr´enov´an rozpozn´avaˇcjazyk˚u. Pro stahov´an´ı webov´ych str´anek byl implementov´an distribuovan´ysyst´em, kter´yvyuˇz´ıval 35 poˇc´ıtaˇc˚u. Ze staˇzen´ych dat byly odstranˇeny duplicity. Vytvoˇren´ekorpusy byly vz´ajemnˇeporovn´any pomoc´ır˚uzn´ych statistik, jako jsou pr˚umˇern´ad´elky slov a vˇet, podm´ınˇen´aentropie a podm´ınˇen´aperplexita. Kl´ıˇcov´aslova: jazykov´ykorpus, distribuovan´ezpracov´an´ı Title: Large Multilingual Corpus Author: Martin Majliˇs Department: Institute of Formal and Applied Linguistics Supervisor: doc. Ing. Zdenˇek Zabokrtsk´yPh.D.ˇ Abstract: This thesis introduces the W2C Corpus which contains 97 languages with more than 10 million words for each of these languages, with the total size 10.5 billion words. The corpus was built by crawling the Internet. This work describes the methods and tools used for its construction. The complete process consisted of building an initial corpus from Wikipedia, developing a language recognizer for 122 languages, implementing a distributed system for crawling and parsing webpages and finally, the reduction of duplicities. A comparative analysis of the texts of Wikipedia and the Internet is provided at the end of this thesis. The analysis is based on basic statistics such as average word and sentence length, conditional entropy and perplexity. Keywords: language corpus, distributed processing Contents 1 Introduction 3 1.1 ProblemDefinition .......................... 3 1.2 Motivation............................... 4 1.3 ThesisOrganization.......................... 4 2 Literature Review 6 2.1 ExistingLanguages.......................... 6 2.2 LanguageResources ......................... 6 2.3 Multilingual Web Corpora . 12 2.4 WordSeeds .............................. 18 2.5 URLSeeds............................... 18 2.6 Crawling................................ 21 2.7 LanguageRecognition ........................ 22 2.8 CorpusStoringandDistribution. 25 2.9 Corpus Quality Analysis . 26 2.10InternetSize.............................. 26 3 Methods 28 3.1 AvailableResources.......................... 28 3.2 GeneralPrinciples........................... 30 3.3 Metadata ............................... 31 3.4 Database................................ 32 3.5 WikiCorpus.............................. 34 3.6 LanguageRecognition ........................ 37 1 CONTENTS CONTENTS 3.7 URLSeeds............................... 41 3.8 W2CBuilder ............................. 42 3.9 Distributed Corpus Building . 50 3.10 Duplicity Detection . 52 3.11W2CCorpus ............................. 53 3.12 CorpusDistribution.......................... 54 3.13 ComparingWikivsWeb ....................... 55 4 Results 56 4.1 WikiCorpus.............................. 56 4.2 LanguageRecognition ........................ 56 4.3 W2CCorpus ............................. 61 4.4 ComparingWikivsWeb ....................... 62 4.5 InternetSize.............................. 67 5 Conclusions 71 A DVD Content 73 B List of Languages 75 C Wiki vs Web 79 2 1. Introduction As statistical approaches become the dominant paradigm in natural language processing, there is an increasing demand for data. It is known that simple models and a lot of data outclass sophisticated models based on less data. The web contains huge amounts of linguistics data for many languages. The web has many undeniable advantages: (a) size — it is the largest text collection containing billions of documents and its size is exponentially growing, (b) range — texts are available in many languages, styles and domains, (c) availability — most of the documents are available in machine-readable form, so no scanning or rewriting is necessary. One of the key issues for computational linguists is easy access to such data. These data are publicly available on the Internet, but already collected corpora are available only for the major word languages, but not for most of the other languages. Therefore, my aim is to collect, with minimal or no human intervention, at least ten millions of words for as many languages as possible. 1.1 Problem Definition The goal of this thesis is to build multilingual corpus of texts available on the Internet. This corpus will consist of at least 10 million words for as many lan- guages as possible. The collected material will be quantitative, and qualitative, analysed and conclusions about different languages will be made. The project consists of: A study of existing multilingual resources and approaches used to construct them. A review of tools and methods used for solving particular tasks such as building initial corpora, crawling, language recognition and duplicity detec- tion. A design for solving these particular tasks as well as the main tasks with respect to amount of processed data. 3 1.2. MOTIVATION CHAPTER 1. INTRODUCTION An implementation of tools and processes capable of taking benefits of distributed environment. A quantitative and qualitative analyses of the collected material. Conclusions about used methods with evaluation of their performances for different languages. 1.2 Motivation There are many publicly available projects that are trying to collect multilingual textual resources. Some of them cover many of languages but contain either very few documents or these documents are not in computer accessible form, so they cannot be easily used in computational linguistics. Other projects contain more data, but are available in very few languages. Therefore, it will be useful to construct corpus, that will overcome these disadvantages. When this data becomes available, it will be possible to use it for comparative analysis of related languages, building language models for various applications such as machine translation, speech recognition, spell checking, etc. For achieving the main goal, many subtasks has to be solved, such as recognizing languages or downloading millions of web pages. When all this data is collected, it will be possible to use it for further improvements. Apart from these objective motivations, there are also my personal motivations. Working on this project gives me a chance to get insight, knowledge and hands-on experience on processing massive amounts of data. 1.3 Thesis Organization The work is divided into five chapters, beginning with the introductory Chapter 1 containing problem definition and motivation. Chapter 2 gives an overview of existing methods and techniques. It briefly introduces existing multilingual re- sources and multilingual corpora as well as methods used for their construction. It also presents methods for solving particular steps. Chaprter 3 presents require- ments for the complete system and available computational resources. It also in- troduces implemented tools and methods how to use them effectively. Chapter 4 shows achieved results in language recognition and size of constructed corpus. A quantitative and qualitative analyses of the corpus is included. Chapter 5 dis- 4 1.3. THESIS ORGANIZATION CHAPTER 1. INTRODUCTION cusses the results and areas where the methods and implementation could be improved. It also suggests goals for the for future work. Four appendices are included: Appendix A describes the content of the DVD. Appendix B contains lists of languages covered by the collected corpus with their ISO-639-3 codes and. Appendix C presents differences between the Wiki Corpus and the W2C Corpus. 5 2. Literature Review This chapter reviews existing tools, methods and approaches. It opens by pre- senting statistics about existing languages, followed by an introduction of exist- ing multilingual projects and multilingual web corpora. The end of this chapter contains an overview of methods used for crawling, text extraction, language recognition and corpus storing and distribution. 2.1 Existing Languages There are 6,909 known living languages according to the Ethnologue databse 1, but only about 390 of them are used by more than 1 million of native speakers2, while 172 of them have more than 3 million speakers. Detailed distribution of languages and speakers is showed in Table 2.1 and Fig- ure 2.1. These numbers must be