Better Web Corpora for Corpus Linguistics and NLP

Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Masaryk University Faculty of Informatics Better Web Corpora For Corpus Linguistics And NLP Doctoral Thesis Vít Suchomel Brno, Spring 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vít Suchomel Advisor: Pavel Rychlý i Acknowledgements I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Language Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of lexicography and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during writing this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest. ii Abstract The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods. This thesis presents a web crawler designed to obtain texts from the internet allowing to build large text corpora for NLP and linguistic applications. An asynchronous communication design (rather than usual synchronous multi-threaded design) was implemented for the crawler to provide an easy to maintain alternative to other web spider software. Cleaning techniques were devised to transform the messy nature of data coming from the uncontrolled environment of the internet. However, it can be observed that usability of recently built web corpora is hindered by several factors: The results derived from statistical processing of corpus data are significantly affected by the presence of non-text (web spam, computer generated text and machine translation) in text corpora. It is important to study the issue to be able to avoid non-text at all or at least decrease its size in web corpora. Another observed factor is the case of web pages or their parts written in multiple languages. Multilingual pages should be recognised, languages identified and text parts separated to respective monolingual corpora. This thesis proposes additional cleaning stages in the process of building text corpora which help to deal with these issues. Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. Rich annotation of corpus content is dealt with in the last part of the thesis. An inter-annotator agreement driven English genre annotation and two experiments with supervised classification of text types in English and Estonian web corpora are presented. iii Keywords Web corpora, Web crawling, Text processing, Language identification, Discerning similar languages, Spam removal, Corpus annotation, Inter- annotator agreement, Text types, Text topic, Text genre iv Contents Introduction 1 0.1 Large, Clean and Rich Web Corpora .............1 0.2 Contents of the Thesis & Relation to Publications ......5 1 Efficient Web Crawling For Large Text Corpora 8 1.1 Building Corpora From the Web ...............8 1.2 SpiderLing, an Asynchronous Text Focused Web Crawler .. 14 1.2.1 General Web Crawler Architecture . 14 1.2.2 SpiderLing Architecture . 17 1.2.3 Yield Rate Aware Efficient Crawling . 28 1.2.4 Deployment of SpiderLing in Corpus Projects . 38 1.3 Brno Corpus Processing Pipeline ............... 41 2 Cleaner Web Corpora 47 2.1 Discerning Similar Languages ................ 48 2.1.1 Method Description . 48 2.1.2 Evaluation on VarDial Datasets . 51 2.1.3 Comparison to Other Language Detection Tools 56 2.1.4 Application to Web Corpora . 57 2.2 Non-Text Removal ...................... 63 2.2.1 Web Spam in Text Corpora . 63 2.2.2 Removing Spam from an English Web Corpus through Supervised Learning . 73 2.2.3 Semi-manual Efficient Classification of Non-text in an Estonian Web Corpus . 85 2.2.4 Web Spam Conclusion . 91 3 Richer Web Corpora 92 3.1 Genre Annotation of Web Corpora: Scheme and Issues ... 94 3.1.1 Genre Selection and Reliability of Classification 94 3.1.2 Experiment Setup . 97 3.1.3 Inter-annotator Agreement . 103 3.1.4 Dealing with a Low Agreement . 107 3.2 Text Type Annotation of Web Corpora ............ 110 3.2.1 Topic Annotation of an English Web Corpus through Learning from a Web Directory . 110 v 3.2.2 Semi-manual Efficient Annotation of Text Types in Estonian National Corpus . 117 4 Summary 120 4.1 Author’s Contribution .................... 120 4.2 Future Challenges of Building Web Corpora ......... 122 A Appendices 124 A.1 Genre Definition for Annotators ............... 125 A.2 Text Size after Processing Steps ................ 127 A.3 FastText Hyperparameters for English Topic Classification . 129 A.4 Selected Papers ........................ 130 Bibliography 140 vi List of Tables 1.1 Sums of downloaded and final data size for all domains above the given yield rate threshold. 31 1.2 The yield rate threshold as a function of the number of downloaded documents. 33 1.3 Yield rate of crawling of the web of selected target languages in 2019: The ratio of the size of the plaintext output of the crawler to the size of all data downloaded is calculated in the fourth column ‘YR’. The ratio of the size of the plaintext after discerning similar languages and near paragraph de-duplication to the size of all data downloaded is calculated in the last, cumulative yield rate column ‘CYR’. Cs & sk denotes Czech and Slovak languages that were crawled together. 39 2.1 Sizes of wordlists used in the evaluation. Large web sources – TenTen, Aranea and WaC corpora – were limited to respective national TLDs. Other wordlists were built from the training and evaluation data of DSL Corpus Collection and parts of GloWbE corpus. Columns Web, DSL and GloWbE contain the count of words in the respective wordlist. 53 2.2 Overall accuracy using large web corpus wordlists and DSL CC v. 1 training data wordlists on DSL CC v. 1 gold data. The best result achieved by participants in VarDial 2014 can be found in the last column. 54 2.3 Performance of our method on VarDial DSL test data compared to the best score achieved by participants of the competition at that time. 55 2.4 Comparison of language identification tools on 952 random paragraphs from Czech and Slovak web. The tools were set to discern Czech, Slovak and English. 57 2.5 Discriminating similar languages in Indonesian web corpus from 2010 (Indonesian WaC corpus v. 3 by Siva Reddy): Document count and token count of corpus parts in languages discerned. 58 vii 2.6 Discriminating similar languages in the Norwegian web corpus from 2015 (noTenTen15): Document count and token count of corpus parts in languages discerned. 59 2.7 Overview of removal of unwanted languages in recently built web corpora (gaTenTen20, enTenTen19, etTenTen19, frTenTen19, huTenTen12, itTenTen19, roTenTen16). Document count and token count of corpus data before and after language filtering. ‘Removed’ stands for the percent of data removed. 60 2.8 Languages recognised in the Estonian web corpus from 2019 (etTenTen19). Document count and token count of corpus parts in languages discerned. 61 2.9 Languages recognised in the output of SpiderLing crawling Czech and Slovak web in 2019. Document count and token count of corpus parts in languages discerned. 62 2.10 Comparison of the 2015 English web corpus before and after spam removal using the classifier. Corpus sizes and relative frequencies (number of occurrences per million words) of selected words are shown. By reducing the corpus to 55 % of the former token count, phrases strongly indicating spam documents such as cialis 20 mg, payday loan, essay writing or slot machine were almost removed while innocent phrases not attracting spammers from the same domains such as oral administration, interest rate, pass the exam or play games were reduced proportionally to the whole corpus. 75 2.11 Top collocate objects of verb ‘buy’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the verb: 14,267,996 in the original corpus and 2,699,951 in the cleaned corpus – 81 % reduction by cleaning (i.e. more than the average reduction of a word in the corpus). 80 2.12 Top collocate subjects of verb ‘buy’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). 81 viii 2.13 Top collocate modifiers of noun ‘house’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the noun: 10,873,053 in the original corpus and 3,675,144 in the cleaned corpus – 66 % reduction by cleaning. 82 2.14 Top collocate nouns modified by adjective ‘online’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the adjective: 20,903,329 in the original corpus and 4,118,261 in the cleaned corpus – 80 % reduction by cleaning. 83 2.15 Top collocate nouns modified by adjective ‘green’ before and after spam removal in English web corpus (enTenTen15, Word Sketches).

Better Web Corpora for Corpus Linguistics and NLP

Talk Bank: a Multimodal Database of Communicative Interaction

International Computer Science Institute Activity Report 2007

Arabic Corpus and Word Sketches

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Collection of Usage Information for Language Resources from Academic Articles

A Resource of Corpus-Derived Typed Predicate Argument Structures for Croatian

Student Research Workshop Associated with RANLP 2011, Pages 1–8, Hissar, Bulgaria, 13 September 2011

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Creating and Using Multilingual Corpora in Translation Studies Claudio Fantinuoli and Federico Zanettin

From CHILDES to Talkbank

Jazykovedný Ústav Ľudovíta Štúra

The British National Corpus Revisited: Developing Parameters for Written BNC2014 Abi Hawtin (Lancaster University, UK)