<<

Masaryk University Faculty of Informatics

Better Web Corpora For Corpus And NLP

Doctoral Thesis

Vít Suchomel

Brno, Spring 2020 Masaryk University Faculty of Informatics

Better Web Corpora For And NLP

Doctoral Thesis

Vít Suchomel

Brno, Spring 2020 Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Vít Suchomel

Advisor: Pavel Rychlý

i Acknowledgements

I would like to thank my advisors, prof. Karel Pala and prof. Pavel Rychlý for their problem insight, help with software design and con- stant encouragement. I am also grateful to my colleagues from Natural Process- ing Centre at Masaryk University and Lexical Computing, especially Miloš Jakubíček, Pavel Rychlý and Aleš Horák, for their support of my work and invaluable advice. Furthermore, I would like to thank Adam Kilgarriff, who gave me a wonderful opportunity to work for a leading company in the field of and corpus driven NLP and Jan Pomikálek who helped me to start. I thank to my wife Kateřina who supported me a lot during this thesis. Of those who have always accepted me and loved me in spite of my failures, God is the greatest.

ii Abstract

The internet is used by computational linguists, lexicographers and social scientists as an immensely large source of text data for various NLP tasks and language studies. Web corpora can be built in sizes which would be virtually impossible to achieve using traditional corpus creation methods. This thesis presents a designed to obtain texts from the internet allowing to build large text corpora for NLP and linguistic applications. An asynchronous communication design (rather than usual synchronous multi-threaded design) was implemented for the crawler to provide an easy to maintain alternative to other web spider software. Cleaning techniques were devised to transform the messy nature of data coming from the uncontrolled environment of the internet. However, it can be observed that usability of recently built web cor- pora is hindered by several factors: The results derived from statistical processing of corpus data are significantly affected by the presence of non-text (web spam, computer generated text and machine trans- lation) in text corpora. It is important to study the issue to be able to avoid non-text at all or at least decrease its size in web corpora. Another observed factor is the case of web pages or their parts written in multiple . Multilingual pages should be recognised, lan- guages identified and text parts separated to respective monolingual corpora. This thesis proposes additional cleaning stages in the process of building text corpora which help to deal with these issues. Unlike traditional corpora made from printed media in the past decades, sources of web corpora are not categorised and described well, thus making it difficult to control the content of the corpus. Rich annotation of corpus content is dealt with in the last part of the thesis. An inter-annotator agreement driven English genre annotation and two experiments with supervised classification of text types in English and Estonian web corpora are presented.

iii Keywords

Web corpora, Web crawling, , Language identification, Discerning similar languages, Spam removal, Corpus annotation, Inter- annotator agreement, Text types, Text topic, Text genre

iv Contents

Introduction 1 0.1 Large, Clean and Rich Web Corpora ...... 1 0.2 Contents of the Thesis & Relation to Publications ...... 5

1 Efficient Web Crawling For Large Text Corpora 8 1.1 Building Corpora From the Web ...... 8 1.2 SpiderLing, an Asynchronous Text Focused Web Crawler .. 14 1.2.1 General Web Crawler Architecture ...... 14 1.2.2 SpiderLing Architecture ...... 17 1.2.3 Yield Rate Aware Efficient Crawling ...... 28 1.2.4 Deployment of SpiderLing in Corpus Projects . 38 1.3 Brno Corpus Processing Pipeline ...... 41

2 Cleaner Web Corpora 47 2.1 Discerning Similar Languages ...... 48 2.1.1 Method Description ...... 48 2.1.2 Evaluation on VarDial Datasets ...... 51 2.1.3 Comparison to Other Language Detection Tools 56 2.1.4 Application to Web Corpora ...... 57 2.2 Non-Text Removal ...... 63 2.2.1 Web Spam in Text Corpora ...... 63 2.2.2 Removing Spam from an English Web Corpus through Supervised Learning ...... 73 2.2.3 Semi-manual Efficient Classification of Non-text in an Estonian Web Corpus ...... 85 2.2.4 Web Spam Conclusion ...... 91

3 Richer Web Corpora 92 3.1 Genre Annotation of Web Corpora: Scheme and Issues ... 94 3.1.1 Genre Selection and Reliability of Classification 94 3.1.2 Experiment Setup ...... 97 3.1.3 Inter-annotator Agreement ...... 103 3.1.4 Dealing with a Low Agreement ...... 107 3.2 Text Type Annotation of Web Corpora ...... 110 3.2.1 Topic Annotation of an English Web Corpus through Learning from a Web Directory . . . . . 110

v 3.2.2 Semi-manual Efficient Annotation of Text Types in Estonian National Corpus ...... 117

4 Summary 120 4.1 Author’s Contribution ...... 120 4.2 Future Challenges of Building Web Corpora ...... 122

A Appendices 124 A.1 Genre Definition for Annotators ...... 125 A.2 Text Size after Processing Steps ...... 127 A.3 FastText Hyperparameters for English Topic Classification . 129 A.4 Selected Papers ...... 130

Bibliography 140

vi List of Tables

1.1 Sums of downloaded and final data size for all domains above the given yield rate threshold. 31 1.2 The yield rate threshold as a function of the number of downloaded documents. 33 1.3 Yield rate of crawling of the web of selected target languages in 2019: The ratio of the size of the plaintext output of the crawler to the size of all data downloaded is calculated in the fourth column ‘YR’. The ratio of the size of the plaintext after discerning similar languages and near paragraph de-duplication to the size of all data downloaded is calculated in the last, cumulative yield rate column ‘CYR’. Cs & sk denotes Czech and Slovak languages that were crawled together. 39 2.1 Sizes of wordlists used in the evaluation. Large web sources – TenTen, Aranea and WaC corpora – were limited to respective national TLDs. Other wordlists were built from the training and evaluation data of DSL Corpus Collection and parts of GloWbE corpus. Columns Web, DSL and GloWbE contain the count of in the respective wordlist. 53 2.2 Overall accuracy using large web corpus wordlists and DSL CC v. 1 training data wordlists on DSL CC v. 1 gold data. The best result achieved by participants in VarDial 2014 can be found in the last column. 54 2.3 Performance of our method on VarDial DSL test data compared to the best score achieved by participants of the competition at that time. 55 2.4 Comparison of language identification tools on 952 random paragraphs from Czech and Slovak web. The tools were set to discern Czech, Slovak and English. 57 2.5 Discriminating similar languages in Indonesian web corpus from 2010 (Indonesian WaC corpus v. 3 by Siva Reddy): Document count and token count of corpus parts in languages discerned. 58

vii 2.6 Discriminating similar languages in the Norwegian web corpus from 2015 (noTenTen15): Document count and token count of corpus parts in languages discerned. 59 2.7 Overview of removal of unwanted languages in recently built web corpora (gaTenTen20, enTenTen19, etTenTen19, frTenTen19, huTenTen12, itTenTen19, roTenTen16). Document count and token count of corpus data before and after language filtering. ‘Removed’ stands for the percent of data removed. 60 2.8 Languages recognised in the Estonian web corpus from 2019 (etTenTen19). Document count and token count of corpus parts in languages discerned. 61 2.9 Languages recognised in the output of SpiderLing crawling Czech and Slovak web in 2019. Document count and token count of corpus parts in languages discerned. 62 2.10 Comparison of the 2015 English web corpus before and after spam removal using the classifier. Corpus sizes and relative frequencies (number of occurrences per million words) of selected words are shown. By reducing the corpus to 55 % of the former token count, phrases strongly indicating spam documents such as cialis 20 mg, payday loan, essay writing or slot machine were almost removed while innocent phrases not attracting spammers from the same domains such as oral administration, interest rate, pass the exam or play games were reduced proportionally to the whole corpus. 75 2.11 Top collocate objects of ‘buy’ before and after spam removal in English web corpus (enTenTen15, Sketches). Corpus frequency of the verb: 14,267,996 in the original corpus and 2,699,951 in the cleaned corpus – 81 % reduction by cleaning (i.e. more than the average reduction of a word in the corpus). 80 2.12 Top collocate subjects of verb ‘buy’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). 81

viii 2.13 Top collocate modifiers of noun ‘house’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the noun: 10,873,053 in the original corpus and 3,675,144 in the cleaned corpus – 66 % reduction by cleaning. 82

2.14 Top collocate nouns modified by adjective ‘online’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the adjective: 20,903,329 in the original corpus and 4,118,261 in the cleaned corpus – 80 % reduction by cleaning. 83

2.15 Top collocate nouns modified by adjective ‘green’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the adjective: 2,626,241 in the original corpus and 1,585,328 in the cleaned corpus – 40 % reduction by cleaning (i.e. less than the average reduction of a word in the corpus). 84

3.1 Sources of the collection of texts used in our experiment. Different subsetsS ( ) were added in different times (starting with subset 1). Most certain texts and least certain texts refer to the certainty of a classifier measured by the entropy of the probability distribution of labels given by FastText for a particular document. UKWaC [Fer+08b], enTenTen13, enTenTen15 and enTenTen18 are English web corpora from 2007, 2013, 2015 and 2018, respectively. 98

ix 3.2 Inter-annotator agreement of genre annotation of web documents for different experiment setups. P is the count of people annotating, Data refers to collection subsets, N is the count of documents, A is the average count of annotations per text. Acc is Accuracy, Jac is Jaccard’s similarity, K-Acc, K-Jac and K-Nom stand for Krippendorff’s alpha with the set similarity metric setto Accuracy, Jaccard’s similarity and comparison, respectively. ‘6/9 genres’ means that four of the nine labels were merged in a single label for the particular evaluation. ‘No unsure’ means annotations indicating the person was not sure were omitted. ‘No multi’ means annotations with multiple strong labels were omitted. 105 3.3 Pair agreement summary for setups with 9 genres and 6/9 genres, without unsure or multi-label samples. 107 3.4 Topics from dmoz.org in the training set 112 3.5 Precision and recall for each recognised dmoz.org level 1 topic estimated by FastText. The threshold of minimal probability of the top label was set to the value where the estimated precision was close to 0.94. 115

A.1 Text size after three Brno pipeline processing steps for ten recently crawled target languages. ‘Clean rate’ columns show how much data or tokens were removed in the respective cleaning step. The first part is performed by tools embedded in crawler SpiderLing: Boilerplate removal by Justext, plaintext extraction from HTML by Justext and language filtering using character models for each recognised language. More than 95 % of downloaded data is removed by this procedure. The next step is filtering unwanted languages including discerning similar languages using lists of words from large web corpora. The last step in this table shows the percentage of tokens removed by near paragraph de-duplication by Onion. More than 60 % of tokens is removed this way. 128

x A.2 Hyperparameter values autotuned by FastText for topic classification in our English web corpus. By modifying FastText’s autotune code, the search space of some parameters was limited to a certain interval and parameters marked as fixed were set to a fixed value. ‘Val.’ is the final value. ‘M’ stands for millions. 129

xi List of Figures

1.1 Crawling as the data source component of web search engines. Graphics source: A presentation of paper [She13]. 13 1.2 General web crawler architecture. Source: [MRS08, Chapter 20]. 16 1.3 IRLbot architecture. DRUM stands for ‘Disk Repository With Update Management’, a fast storage solution. Source: [Lee+09]. 16 1.4 A focused crawler architecture. Source: [SBB14]. 17 1.5 SpiderLing architecture. The design loosely follows the general model. There is a single process scheduler, a single process downloader using asynchronous sockets and multiple processes for web page processors that extract text and links from HTML. 19 1.6 An example of a web page stored in a doc structure. The plaintext is separated to paragraphs marked by structure p. 23 1.7 Average TCP connections opened per second in day intervals by SpiderLing crawling selected language webs in 2019. 25 1.8 Average size of raw HTML data downloaded per day in day intervals by SpiderLing crawling selected language webs in 2019. 26 1.9 Average size of plaintext extracted from HTML per day in day intervals by SpiderLing crawling selected language webs in 2019. 27 1.10 Web domains yield rate for a Heritrix crawl on .pt. 30 1.11 Average yield rate in time for various yield rate threshold functions (crawling the Czech web) 34 1.12 Web domains yield rate for a SpiderLing crawl on the Czech web. 36 1.13 The Yield rate of web domains measured during SpiderLing crawls of six target languages in 2011 and 2012. 37

xii 2.1 Sentence score and word scores calculated to discern from using relative word counts from a large web corpus. A sample from VarDial 2014 test data, vertical format. Column description: Word form, en-GB score, en-US score. 52 2.2 Web spam in examples of use of word ‘money’ in application for Language Learning at https://skell.sketchengine.eu/. See non-text lines 2, 4 and 10. 65 2.3 Google’s analysis of spam types and quantities that had to be removed manually, 2004–2012. Source: http://www.google.com/insidesearch/ howsearchworks/fighting-spam., accessed in January 2015, no longer at the site as of April 2020. Labels were moved below the chart and resized by the author of this thesis for the sake of readability. 68 2.4 Relative word count comparison of the original 2015 web corpus with , top 26 + lemmas sorted by the keyword score. Score = f pm1 100 f pm2+100 where f pm1 is the count of lemmas per million in the focus corpus (3rd column) and f pm2 is the count of lemmas per million in the reference corpus (5th column). 76 2.5 Relative word count comparison of the cleaned web corpus with British National Corpus. (A screenshot from Sketch Engine.) 77 2.6 Relative word count comparison of the original web corpus with the cleaned version. (A screenshot from Sketch Engine.) 78 2.7 Evaluation of the binary spam classifier in a 10 fold cross-validation on semi-manually checked Estonian web corpus. Precision and recall were estimated for minimal probabilities of the top label from 0 to 1 in 0.05 steps and averaged across folds. The baseline accuracy (putting all samples in the larger class) is 0.826. 87

xiii 2.8 Evaluation of the final binary spam classifier on documents not previously checked by a human annotator in Estonian web corpus. Precision and recall were estimated for minimal probabilities of the non-text label from 0.05 to 0.15. Since we aim for a high recall, the performance with the non-text label threshold set to 0.05 is satisfying. A higher threshold leads to an undesirable drop of recall. 88

2.9 Evaluation of the relation of the distance of web domain from the initial domains to the presence of non-text on the sites. Web pages of distances 0 to 4 classified semi-manually or by the spam classifier were taken into account. Two thirds of the pages were in distance 1. The percentage of good and bad documents within the same domain distance is shown. The presence of non-text in the data is notable from distance 1. 90

3.1 Text type annotation interface – a web application in a browser – the left side of the screen. Information about the annotation process can be seen at the top. Genres with a brief description and examples follow. Class ‘Information::Promotion’ is labelled as strongly present in this case. Buttons for weaker presence of genre markers (Partly, Somewhat, None) can be clicked to change the annotation. 102

3.2 Text type annotation interface – a web application in a browser – the right side of the screen. The title of the document with a link leading to the original source is located at the top. The plaintext split to paragraphs can be seen below. Both sides of each paragraph are coloured to visualise separate paragraphs. A paragraph can be suggested for removal from the document (to make the result training data less noisy) by clicking the respective button. 102

xiv 3.3 Text type annotation interface in the review mode after the training round – as seen by the author of this thesis who trained six other annotators. Labels, coded by identifiers in columns B1 to B99, assigned to a single document by each annotator are shown. Values ‘Strong’, ‘Partially’ and ‘None’ are coded by 2, 1, 1/2 and 0, respectively. (The same coding was used by [Sha18].) Time in seconds spent by annotating the document by each annotator can be seen in the rightmost column. 103 3.4 Pair annotation matrix for the setup with 9 genres, without unsure or multi-label samples. Percentage of all annotation pairs is shown. 106 3.5 Pair annotation matrix for the setup with 6/9 genres, without unsure or multi-label samples. Percentage of all annotation pairs is shown. 106 3.6 Evaluation of the 14 topic classifier on the test set. Precision and recall were estimated by FastText for minimal probabilities of the top label from 0 to 1 in 0.05 steps. F-0.5 values plotted in green. 113 3.7 Sizes of topic annotated subcorpora of enTenTen15 – document and token counts. 116 3.8 Sizes of topic annotated subcorpora of Estonian National Corpus 2019 – document and token counts. 118

xv Introduction

0.1 Large, Clean and Rich Web Corpora

A corpus is a special collection of textual material collected according to a certain set of criteria. In statistical natural language processing one needs a large amount of language use data situated within its textual context. The text corpora are one of the main requirements for statistical NLP research. [MS99, pp. 5, 6, 117, 119] The field of linguistics greatly benefits from the evidence oflan- guage phenomena of interest one can found in large text corpora. In particular such data source is essential for various subfields of computational linguistics such as lexicography, machine , language learning, text generation. It is said ‘There is no data like more data’. [PRK09] showed that ‘Bigger corpora provide more information.’ Indeed, since the 2000s, the internet has been commonly used by computational linguists (which resulted in establishing the Web as Corpus ACL SIG.1). The count of words in very large corpora reached tens of billions of words, e.g. 70 billion words reported by [PJR12]. Since then, constantly growing and spreading, the web data has become an immensely large source of text data for various NLP tasks and language studies in general: Web corpora can be built in sizes hardly possible to achieve using traditional methods of corpus cre- ation. [PJR12] The quantity of text data on the web is quite large, with many vari- eties, for a very wide range of languages. [GN00] Further advantages of this source are immediate availability, low cost of access and no need for concern over copyright. [CK01]. [KG03] list examples of use of the web as a source of corpus data for language modelling, information retrieval, , automatic population of ontologies, translating terms and language teaching.

1. Special Interest Group of the Association for Computational Linguistics on Web as Corpus, https://www.sigwac.org.uk/

1 There are 77 possible text/linguistic corpora applications listed on the website of Linguistic data consortium2. We believe language modeling, language teaching, lexicography, linguistic analysis and machine learning could benefit from large, clean and richly annotated web corpora the most. Corpora built using methods and tools presented in this thesis are used by Sketch Engine3 users in those fields. Although there is a valid criticism of web corpora – e.g. [Cvr+20] showed that web corpora lack some areas of linguistic variation that ‘cannot be substituted by general web-crawled data’ such as the cover- age of certain genres, ‘namely spoken informal (intimate) , written private correspondence and some types of fiction (dynamic and addressee oriented)’, the size of web corpora helps finding evi- dence of scarce language phenomena in natural language context. The thing is that most language phenomena follow the Zipfian distribution. Simply said, the more data the better. For example, to study modifiers of phrase ‘to deliver ’4, what size of the corpus is sufficient to contain enough occurrences of important collocates in a natural context? Corpus frequencies of strongest collocates of ‘to deliver speech’ in selected English corpora follow. It can be observed a 100 million word corpus and a 1 billion words corpus are clearly not large enough. ∙ BNC (96 million words in the corpus): major (8), keynote (6). ∙ 2007 web corpus (ukWaC, 1.32 billion words): keynote (125), opening (12), budget (8), wedding (7). ∙ 2012 web corpus (enTenTen12, 11.2 billion words): keynote (813), acceptance (129), major (127), wedding (118), short (101), opening (97), famous (80). ∙ 2015 web corpus (enTenTen15, 15.7 billion words): keynote (3673), opening (684), welcome (413), key (257), major (255),

2. https://catalog.ldc.upenn.edu/search, accessed in April 2020. 3. Corpus management software and a website operated by company Lexical Computing at https://www.sketchengine.eu/. 4. E.g. in a lexicographic project where the goal is to explain the meaning and typical use of ‘to deliver speech’ to an intermediate level student of English using natural context of the phrase.

2 acceptance (233), powerful (229), commencement (226), in- spiring (210), inaugural (146).

∙ 2009 web corpus (ClueWeb09, English part, 70.5 billion words): keynote (3802), acceptance (1035), opening (589), famous (555), commencement (356), impassioned (335), inaugural (333). A question similar to the previous one5 is – Which phrases can be combined? If ‘pregnancy test’ is a strong and ‘pass a test’ is another one, can they be combined into ‘pass a pregnancy test’? Speakers proficient in English know they cannot. (This example was borrowed from [PJR12].) Large corpora help to get correct answers for phenomena not having enough evidence in small corpora. However – size is not everything. ‘A significant fraction of all web pages are of poor utility.’6 The content of the web is not regulated in terms of data quality, originality, or correct description and this results in even more issues. This is a list of selected issues of building language resources from the web – formulated as practical tasks: ∙ Language identification and discerning similar languages, ∙ Character encoding detection, ∙ Efficient web crawling, ∙ Boilerplate removal (basically the extraction of plaintext from HTML markup7), ∙ De-duplication (removal of identical or nearly identical texts), ∙ Fighting web spam (i.e. dealing with computer generated text, in general any non-text), ∙ Authorship recognition & plagiarism detection, ∙ Storing and indexing large text collections. Boilerplate, duplicates, and web spam skew corpus based analyses and therefore have to be removed. While the first two issues have been successfully addressed, e.g. by [MPS07; Pom11; SB13; VP12], spam might be still observed in web corpora as reported by us in [KS13].

5. E.g. a question of a student of English as their second language. 6. In 2020 as well as in 2008 [MRS08, Chapter 20] 7. Boilerplate – unwanted content like HTML markup, non textual parts, short repetitive text such as page navigation.

3 That is why a spam cleaning stage should be a part of the process of building web corpora. Automatically generated content does not provide examples of authentic use of a natural language. Nonsense, incoherent or any unnatural texts such as the following short instance have to be removed from a good quality web corpus: Edmonton Oilers rallied towards get over the Montreal Canadiens 4-3 upon Thursday.Ryan Nugent-Hopkins completed with 2 aims, together with the match-tying rating with 25 seconds remaining within just legislation.8 Another drawback of building text corpora from the web – which has to be dealt with – is understanding the content of a corpus. Tra- ditional corpora (e.g. British National Corpus) were designed for particular use and compiled from deliberately selected sources of good quality (e.g. the BNC consisting of a spoken data and a written component further divided by other metadata [Lee92]). Such pre- cise selection of nice texts is hardly possible in the case of large web corpora. Do we know what is being downloaded from the web? Do re- searchers who base their work on web corpora know which language varieties, topics, genres, registers and other text types are represented in the corpus and what is the distribution like? These questions should be asked by those who build or use web corpora. We would like to add rich metadata to texts in web corpora, includ- ing text type annotation. Because of the size of web corpora, supervised classification is the preferred way to achieve that.

8. Source: http://masterclasspolska.pl/forum/, accessed in 2015.

4 0.2 Contents of the Thesis & Relation to Publications

Chapter 1 presents an overview of technical aspects of a web crawler architecture. SpiderLing, a web crawler implemented by the author of this thesis is introduced. Key design features of the software are explained. The crawler gathers information about web domains and aims to download web pages from domains providing a high ratio of the size of plaintext extracted from web pages to the size all downloaded data. This feature was described in our co-authored paper Efficient Web Crawling for Large Text Corpora [SP12]. The paper was presented at Web as Corpus workshop in Lyon in 2012 and with 94 citations so far9 it belongs to our most cited works. The crawler is put into the context of other components of so called ‘Brno processing pipeline’ in this thesis. This set of tools has been successfully used to build large, clean text corpora from the web. Separate tools from the pipeline were described in corresponding papers in the past – including our co-authored works on character encoding detection [PS11] and text tokenisation [MSP14] and our paper on discriminating similar languages [Suc19]. Our work on Brno processing pipeline follows the steps of Jan Pomikálek who finished the first components, boilerplate removal and de-duplication tools [Pom11]. The author of this thesis has been developing the pipeline and maintaining its parts since 2012. Since 2012 our work on efficient web crawling and building web corpora in more than 50 languages has led to publishing papers co- authored with academics studying the respective languages: [Art+14; BS12b; Boj+14; DSŠ12a; RS16; Srd+13]. Among other venues, this work was presented on B-rated conferences LREC10 in 2014 and TSD11 in 2016. The emerging set of web corpora build by the author of this thesis was presented in our co-authored paper The TenTen Corpus Fam- ily [Jak+13] in Lancaster in 2013 – with 203 citations up to date. All corpora in this corpus family became a part of corpus manager and

9. The source of all citation counts in this thesis is Google Scholar, accessed in April 2020. 10. Language Resources and Evaluation Conference 11. Text, Speech and Dialogue 5 corpus Sketch Engine operated by Lexical Computing. The work on Sketch Engine, including our contribution of corpora in many languages, was presented in an article in Springer’s journal on lexicog- raphy [Kil+14]. Having been cited 627 times, this is our most cited co-authored work to date.

Chapter 2 deals with two issues of building language resources from the web: Discerning similar languages and non-text removal. Methods presented in both sections of that chapter were applied to web corpora. This thesis builds upon our previous work on language identi- fication and discerning similar languages. A method of Language Discrimination Through Expectation–Maximization [Her+16] was presented in a co-authored paper at the Third Workshop on NLP for Similar Languages, Varieties and Dialects in Osaka in 2016. A recent work consisting in adjusting the method for use with large web cor- pora and evaluating the result was published in paper Discriminating Between Similar Languages Using Large Web Corpora. [Suc19] We have been dealing with non-text in web corpora since 2012. It still remains one of the current challenges in web corpus building. Papers on this topic were presented at Web as Corpus workshops in 2013, 2015, 2017 and 2020. The issue was described. Selected ways to avoid downloading non-text and methods to remove web spam were proposed and implemented. The most important results of our work are summarised in this thesis. The improvement achieved by a supervised classifier applied to an English web corpus is shown atthe end of the chapter. Although the spam fighting procedure is not perfect, the evalu- ation of the impact on the quality of an English web corpus shows a great progress made towards a better quality of web corpora in a lexicography oriented application.

Chapter 3 stresses the need of adding rich annotation to web cor- pora. An experiment to annotate genres in an English web corpus leading to a discussion about overcoming a low inter-annotator agree- ment is introduced in the first section. A text type classification task performed on English web corpus and Estonian National Corpus is presented in the end of the chapter.

6 The results of this work are summarised in Chapter 4. Challenges of building large, clean web corpora to be addressed in the near future are briefly discussed there too.

7 1 Efficient Web Crawling For Large Text Cor- pora

1.1 Building Corpora From the Web

To build a large collection of texts from the web, one needs to master the following general disciplines stated in [SB13]: 1. Data collection, 2. Post-processing, 3. Linguistic processing, 4. Corpus Evaluation and Comparison. A web corpus can be built following these general steps: 1. Identify suitable documents to obtain. 2. Download the selected data from the internet, keeping impor- tant metadata such as the source URL and the date of acquisi- tion. 3. Process the obtained data by stripping off non textual parts, clearing away boilerplate and unwanted text parts, removing duplicate parts, and other possible methods to get quality data in the result. 4. Store the result in a way enabling access according to the desired purpose, reusability and ability to process the raw internet data again. A good source of information about important decisions and practical advice for building large web corpora is [SB13]. There is also a tradition of building sizeable text collections at author’s institution1. There are billions of documents available on the web. The process of traversing the web and downloading data (crawling) is a time and resource consuming task. A web crawler is a piece of software made for the task of crawling the internet. The crawler is usually initialized by a set of starting internet points, the seed URLs. It downloads each document from the initial set, ex- tracts links to other documents from the data and continues its work with the discovered set of new URLs.

1. The Natural Language Processing Centre at Faculty of Informatics, Masaryk University, http://nlp.fi.muni.cz/web3/en/NLPCentre

8 1. Efficient Web Crawling For Large Text Corpora

The crawling strategy – making decisions which parts of the web to explore first, i.e. which documents to download immediately and which postpone for obtaining later – is a very important factor in the design of a successful crawler. A well crawled data set should contain data which is important. [FCV09] dealt with evaluation of web crawl source selection policy and showed crawling the top PageRank metric sites is better than simple breadth-first crawl in terms of importance of documents inthe set. [SP12] showed pruning domains yielding less data (selective crawl- ing) overperforms a general Heritrix crawl in terms of crawling effi- ciency (i.e. the ratio of size of extracted text and amount of all down- loaded data). Thus the implemented traversing algorithm is crucial in achieving wide coverage of web domains or higher crawling efficiency; higher amount of extracted data or catching ‘important’ web pages, whichever is a priority. Additional issues have to be taken into account when crawling the web: Not overusing the source servers by obeying the Robots exclusion protocol2, boilerplate removal and content de-duplication (if desired), robust post-processing of crawled data (e.g. dealing with malformed data, language detection, character encoding detection) [SP12]. In a one-time web crawling setup, if the same URL is discovered again, it is considered as duplicate and discarded. Since the web changes rapidly, maintaining a good ‘freshness’ of the data is hard. The content is constantly added, modified, deleted [NCO04] and du- plicated [Pom11]. For scenarios where one needs to keep the crawled data up to date, [CG99] proposed a crawler which selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode. [FCV09] devised a more conservative strategy to continuously crawl the web, starting from the seed URLs over and over again, re- visiting all pages once crawled and building ‘snapshots’ of the part of the web it is visiting.

2. A to control access to certain web pages by automated means, http: //www.robotstxt.org

9 1. Efficient Web Crawling For Large Text Corpora

As could be understood from the previous, starting the crawl with a good, text yielding and trustworthy (i.e. with a low possibility of spammed content) sources can positively benefit the quality of the result corpus. [GJM01] proposed a method exploiting web search engines to identify relevant documents in the web. The search engine is expected to supply good text data in the desired language based on search parameters. Baroni [BB04] devised method ‘BootCaT’ for bootstrapping cor- pora and terms from the web. The method requires a small set of seed terms as input. The seeds are used to build a corpus via auto- mated Google queries, more terms are extracted from that corpus and used again as seeds to build a larger corpus and so forth [BB04]. Two thirds of English and Italian documents obtained with BootCaT were reported to be informative and related to the search terms. WebBootCaT [Bar+06] is an extension of the former tool in the form of a web application, allowing quick and effortless domain fo- cused web corpus building3. A similar approach in a much larger scale was used later by the ClueWeb project: [Cal+09] started with two types of seed URLs: one from an earlier 200 million page crawl, another given by commercial search engines (Google, Yahoo). The search engines were queried using most frequent queries and random word queries for each target language. The DMOZ4 category were used in the process too. To get the DMOZ categories in other languages, Google Translate was employed to translate the original in English. More internet directory and content rating services can be em- ployed when looking for quality content to include in web corpora. [BS14] constrained the selection of web documents in their corpus for learning to sources listed in the DMOZ directory or in the white list of the URL blacklist.5 6

3. WebBootCaT is currently a module in Sketch Engine corpus query system, http://www.sketchengine.co.uk/documentation/wiki/Website/Features# WebBootCat 4. DMOZ is the largest, most comprehensive human-edited directory of the Web. http://www.dmoz.org/ 5. I.e. URLs not categorised as spam, advertisement or pornography in the URL blacklist directory, http://urlblacklist.com/. 6. Apart from Wikipedia articles and search engine supplied documents.

10 1. Efficient Web Crawling For Large Text Corpora

There were many large web corpora constructed using crawlers recently. There is ClueWeb, a huge general purpose collection of billion web pages in ten languages7 – cleaned, de-duplicated and further processed into a 70 billion word corpus of English by [PJR12]. Another huge web crawl is CommonCrawl8 which was gathered through years 2008–2014. Many private companies, most prominently those providing search engines, download data from the web for their own purpose, usually web indexing9, web data mining10, web monitoring (for changes, for copyright violations), or web archiving (digital preservation).11 According to a web crawler tracking list12 mentioned in [She13], there were over 1100 crawler agents in 2013. According to our web crawling experience – checking the agent names, some bearing the names of the companies behind them, in robots exclusion protocol files – the number has grown a lot. Even though so much crawling is done that some sites get more traffic from crawlers than human visitors [She13], the data may notbe available to researchers and other institutions or may not be suitable for linguistic use. The effort of web search companies is also notable since they main- tain large distributed data warehouses to store indexed web pages for serving by their search engines. Text corpora for linguistic purpose can make use from textual parts of such data. Google books ngrams [GO13] is a collection of word n-grams from English books spanning a long time period, which received attention

7. 2009 collection: http://www.lemurproject.org/clueweb09, 2012 collection: http://www.lemurproject.org/clueweb12/ 8. An open repository of web crawl data that can be accessed and analyzed by anyone, http://commoncrawl.org/ 9. Obviously, all search engines do web indexing. 10. A lot of companies mine the web for trends, marketing, user opinion, shopping or even political preferences of people nowadays. 11. E.g. the Internet Archive storing the history of over 424 billion web pages on the Internet (as of April 2020), archive.org. There is a long list of web archiving initiatives at https://en.wikipedia.org/wiki/List_of_Web_ archiving_initiatives. 12. http://www.crawltrack.fr/crawlerlist.php

11 1. Efficient Web Crawling For Large Text Corpora

recently.13 Unfortunately, the n-grams are hardly sufficient for cer- tain computational linguistics use, e.g. lexicography and language learning. Apart from corpora made from traditional sources in the past, such as British National Corpus14 by a process of selecting, balancing and redacting texts, or linguistic resource collections with a subscription ac- cess policy, such as Linguistic Data Consortium collection containing Gigaword corpora15, several families of web corpora16 for compu- tational linguistics use emerged in the Web as Corpus community of late: Web as Corpus (WaC)17 [Bar+09; Fer+08c], Corpus of Web (CoW)18 [SB12], TenTen19 [Jak+13], Aranea20 [Ben14], multiple WaC inspired corpora [Let14; LK14; LT14], The Leipzig Corpora Collec- tion21 [Bie+07]. Corpus manager is a software that indexes text corpora and - vides a corpus interface to the users. According to our experience with the development and support of corpus manager Sketch En- gine [Kil+14; Kil+04], the users are linguists, lexicographers, social scientists, brand specialists, people who teach languages and their students [Tho14], various human language technologists and others. Figure 1.1 shows crawling as the data source component of web search engines. Similarly, in the filed of computational linguistic, crawl- ing is the source of data of a corpus manager. The architecture of web crawler SpiderLing developed by the au- thor of this thesis, its key features such as asynchronous communica-

13. The N-gram viewer is a very nice application – https://books.google.com/ ngrams. 14. Originally created by Oxford University press in the 1980s - early 1990s, https: //www.english-corpora.org/bnc/ 15. https://catalog.ldc.upenn.edu/ 16. By corpus family we name a collection of corpora in different languages sharing the same general name, means of retrieval, cleaning and processing. 17. http://wacky.sslmit.unibo.it/doku.php?id=corpora 18. http://corporafromtheweb.org/category/corpora/ 19. TenTen – aiming at reaching the size of at least ten to the power of ten (1010) to- kens for each language. http://www.sketchengine.co.uk/documentation/wiki/ Corpora/TenTen 20. http://ucts.uniba.sk/aranea_about/ 21. https://wortschatz.uni-leipzig.de/en/

12 1. Efficient Web Crawling For Large Text Corpora

Figure 1.1: Crawling as the data source component of web search engines. Graphics source: A presentation of paper [She13]. tion and text focused design, and Brno Corpus Processing Pipeline, a set of tools for building corpora from the web, are presented in the following sections.

13 1. Efficient Web Crawling For Large Text Corpora 1.2 SpiderLing, an Asynchronous Text Focused Web Crawler

1.2.1 General Web Crawler Architecture The classical textbook definition of web crawling as a technique inthe field of information retrieval according to [MRS08, Chapter 20] is‘the process by which we gather pages from the Web to index them and support a search engine. The objective of crawling is to quickly and efficiently as many useful web pages as possible, together withthe link structure that interconnects them.’ The main component of web crawling is a web crawler. Assuming a graph representation of the internet consisting of web pages – nodes – and links connecting web pages – one-directional edges – making an oriented graph structure, the crawler starts at seed web pages (seed URLs or seed domains) – the initial nodes – and traverses the web graph by extracting links from web pages and following them to the target pages. The planner component, also called the scheduler, is the compo- nent of the crawler responsible for making decisions in which order to follow the links, i.e. in which directions to search the graph. The features a crawler must or should provide, according to the book, are: ∙ Robustness: Crawlers must be resilient to traps misleading them into getting stuck fetching an infinite number of pages in a par- ticular domain. – Let us add a more general statement: Crawlers must be resilient to both design and technical properties of the ever changing web. ∙ Politeness: Policies of web servers regulating the rate at which a crawler can visit them must be respected. ∙ Distributed: The ability to execute in a distributed fashion across multiple machines. ∙ Scalable: Permit scaling up by adding extra machines and band- width. ∙ Performance and efficiency: System resources including proces- sor, storage, and network bandwidth should be used efficiently.

14 1. Efficient Web Crawling For Large Text Corpora

∙ Quality: The crawler should be biased toward fetching useful pages first. ∙ Freshness: Able to obtain fresh copies of previously fetched pages. ∙ Extensible: Should cope with new data formats and protocols implying a modular crawler architecture. There are many ways to meet the crucial and recommended fea- tures. We expect the following are the main differences in the archi- tecture of various crawlers 1. The intended audience and use. Our goal is to get a large col- lection of monolingual natural language texts for each target language for the Sketch Engine audience or other NLP applica- tions. 2. The importance of crawler’s features. For example, a simple design to improve the easiness of use, maintenance and ex- tensibility is more important than the ‘distributed’ feature for us. 3. The scheduler component. (Also called the frontier.) Various strategies of web traversal are possible. A base line strategy is the breadth-first search. 4. Technical solution. [SB13] stresses that crawlers ‘require care- ful implementations if the crawler is intended to sustain a high download rate over long crawl times’. A general design pattern is depicted on Figure 1.2. A more detailed schema of the archi- tecture of a large scale web crawler IRLbot [Lee+09] is showed on Figure 1.3. [SB13] names the following components of a web crawler: ∙ Fetcher – A massively multi-threaded downloader. ∙ Parser – Extracts URLs from downloaded web pages. ∙ URL filters to discard duplicate or blacklisted22 URLs.

22. Blacklisted URLs or web domains are sources of unwanted text types or poor quality text so the aim is to avoid crawling them. The blacklist can be supplied at the start of the crawler and extended on-the-fly.

15 1. Efficient Web Crawling For Large Text Corpora

Figure 1.2: General web crawler architecture. Source: [MRS08, Chap- ter 20].

Figure 1.3: IRLbot architecture. DRUM stands for ‘Disk Repository With Update Management’, a fast storage solution. Source: [Lee+09].

16 1. Efficient Web Crawling For Large Text Corpora

Figure 1.4: A focused crawler architecture. Source: [SBB14].

∙ Frontier – ‘Data structures which store, queue and prioritize URLs and finally pass them to the fetcher.’ ‘Biasing a crawl toward desired pages to improve the word count for a well-defined weight function’ [SBB14] is called focused crawling. The crawler can prefer particular web domains or text types based on the function. A focused crawler architecture can be seen on Figure 1.4.

1.2.2 SpiderLing Architecture Most documents on internet contain data not useful for text corpora, such as lists of links, forms, advertisement, isolated words in tables, and other kind of text not comprised of grammatical sentences. There- fore, by doing general web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. [SB13] reported 94 % downloaded web pages not making it into the final version of corpus DECOW12. To be able to download large collections of web texts in a good quality and at a low cost for corpora collection managed by Sketch En- gine23, we developed SpiderLing—a web spider for linguistics. Unlike traditional crawlers or web indexers, we do not aim to collect all data

23. http://sketchengine.co.uk/

17 1. Efficient Web Crawling For Large Text Corpora

(e.g. whole web domains). Rather than that we want to retrieve many documents containing full sentences in as little time as possible. Implementation details follow. The crawler was implemented in Python 3 and released under GNU GPL v. 3 licence at http://corpus. tools/wiki/SpiderLing. The design schema of the crawler can be seen on Figure 1.5. There are three main components of the crawler: A scheduler, a downloader and a document processor. Each of the components runs as a separate process. While there can be multiple document processors, there is a single scheduler and a single downloader which is the main difference from standard crawler architectures employing multi-threaded downloaders. The reason for this design decision is to make the tool easy to understand, maintain and extend. Although there are multiple threads operating within each process of the crawler to prevent I/O deadlocks, due to Python’s global interpreter lock only a single thread can be active at a time. Furthermore, all processes communicate through files read from and written to a filesystem. This way debugging the whole tool is easier than a heavily parallel software consisting of hundreds or thousands of concurrently running components. Using more sophisticated storage such as was avoided for the purpose of simplicity. All queues (URL, robots, redirects, web page metadata) are read and written sequentially. The user can check the queues anytime to see what is happening. The scheduler contains data structures to represent each web do- main encountered in the process of crawling. This is another difference from the standard design: URLs are separated by their respective web domains rather than hold together in a sigle data structure. The sched- uler takes into account information about the domain such as the effectiveness of crawling the domain or the information about paths within the domain whem making decisions concerning traversing the web. A web domain consists of the following metainformation: ∙ Technical information: Protocol (http/https), IP address.

∙ Hostname – long hostnames tend to contain more non-text (each URL is split to protocol, hostname and path).

18 1. Efficient Web Crawling For Large Text Corpora Figure 1.5: SpiderLing architecture. The designscheduler, loosely a follows single the process general model. downloader There using is asynchronous a sockets single and process multiple processes for web page processors that extract text and links from HTML.

19 1. Efficient Web Crawling For Large Text Corpora

∙ New paths within the domain to send to the downloader in the future. The paths are sorted by length; short paths are down- loaded prior to obtaining long paths. ∙ Hashes of paths sent to the downloader – stored just to be able to prevent downloading of the same page multiple times. ∙ Number of pages already downloaded from the domain – this can be limited to avoid crawler traps or overrepresentation of a domain in the target corpus. ∙ Web domain yield rate – the effectiveness of the web domain calculated on-the-fly. More on this key feature follows in Chap- ter 1.2.3. ∙ Distance of the domain from the seed domains. The distance is the graph distance of a node (a path) within the domain closest to the initial nodes (the seed URLs) in a graph representation of the web. Since the seed domains are trustworthy (high quality content) and (most likely) links leading from a trustworthy page lead to other quality content, the value can be used to estimate the quality of the content of the domain. Domains close to the seeds should be crawled more than domains far from the seeds. ∙ Robots exclusion protocol24 file for the domain. The file is parsed into a set of rules to follow when adding new paths into the domain. There is a ‘URL selector’ cycle periodically taking a small count of new URLs from each domain structure. This helps to increase the variety of URL hosts sent to the downloader. The more separate web domains, the more connections can be opened at once. A ‘URL manager’ receives seed URLs, redirected URLs from the downloader and new extracted URLs from document processors. A new URL is put into the respective domain structure. It’s hash is compared to hashes of paths already in the domain. The manager also

24. A web communication standard for operators of web servers to set rules for au- tomated agents accessing their content. https://www.robotstxt.org/robotstxt. html. For example, a path within the domain can be prevented from downloading this way. Following the protocol is polite.

20 1. Efficient Web Crawling For Large Text Corpora

sorts web domains by their text yield rate preferring text-rich websites over less text yielding sources. A ‘duplicate content manager’ is a routine reading hashes of down- loaded web pages and other web page metadata from document pro- cessors. Since the information about duplicate content is distributed in document processors, this procedure gathers the data and writes identifiers of duplicate documents to a file. The file is used byastan- dalone script after the crawling is finished to de-duplicate the text output of the crawler. A ‘crawl delay manager’ in the downloader component receives URLs to download. It makes sure target HTTP servers are not over- loaded by forcing a crawl delay by postponing connections to the same HTTP host or IP address within a certain period after the last connection was made. Excess paths within the same web domain are stored to the filesystem to be read later. An asynchronous design (rather than the usual synchronous multi- threaded design) is used to achieve a high throughput of HTTP com- munication. A TCP socket is opened for each web page in a download queue at once. The sockets are non-blocking so the opener routine does not have to wait for an answer of the remote server. There is another routine run periodically to poll sockets that are ready to write to, in which case a HTTP request is sent, or ready to read from, in which case a chunk of the response of the remotes server is read. The socket poller stores the downloaded data into the filesystem. There are two queues to be read and resolved by the scheduler: A robot queue holding the content of robots exclusion protocol files and a redirect queue recording HTTP redirects. There is another queue for web pages waiting to be parsed by a document processor.

Text processing details follow. Plaintext and new links are extracted from HTML text of web pages by a document processor. New links are stored in a file to be read by the scheduler. The plaintext is wrapped in a XML structure and stored to a file. Useful metadata obtained during the processing is added as at- tributes of the structures: the title (the content of HTML element

21 1. Efficient Web Crawling For Large Text Corpora

title), the character length range, date of crawling25, IP address, language model difference (see more about the model below), URL, character encoding (the value stated in the HTML and the detected value). An example of the data in this format is shown on Figure 1.6. Tool Chared26 [PS11] is used to detect character encoding of web pages. Tool Justext27 [Pom11] is used to extract plaintext from HTML, split text to paragraphs (adhering to the paragraph denoting HTML markup) and remove boilerplate such as headers, footers, navigation and tables. A character trigram language model is built to determine the simi- larity of the text in a web page to a pre-defined set of nice texts in the target language or unwanted languages expected to be downloaded too. The cosine of vectors of is used to calculate the similarity. If the text is more similar to an unwanted language or the similarity to the target language is below a threshold, it is not included in the result data. All text post-processing mentioned here is done on-the-fly by the document processor.

Comments on scalability, performance and adaptability follow: Since there are only one scheduler and one downloader and the fact the rate at which data is read and written by these components is variable, the data queues implemented as files allow all separate components to run at their own best pace. No component is waiting for the other. The scheduler is limited by the operational memory holding the web domain metadata and hashes of text files. The downloader is limited by the bandwidth of the network and the crawl delay policy. Furthermore, the scalability of the crawler is controlled by setting the count of document processors. According to our experience, four processes is enough for crawling texts in languages with a small web presence such as Estonian so a machine with a stock 8 core CPU with 16 GB RAM should be enough.

25. Unfortunately, the dates of creation and modification of the content optionally sent in HTTP headers may not be accurate. 26. http://corpus.tools/wiki/Chared 27. http://corpus.tools/wiki/Justext

22 1. Efficient Web Crawling For Large Text Corpora

A paragraph of text.

Another paragraph of text.

Figure 1.6: An example of a web page stored in a doc structure. The plaintext is separated to paragraphs marked by structure p.

SpiderLing used for big crawling projects such as obtaining the English, Spanish or French web can make use of a 32 core machine and 200 GB RAM. Concerning a comparison to big crawling projects of large institu- tions, let us compare our work to three results achieved by others: ∙ [Cal+09] used a modified version of Nutch [Kha+04] to build ClueWeb09 – one of the biggest collections of web texts made available. It was crawled in 60 days and consists of one billion web pages (with total size of 25TB). [PJR12] took the English part of the collection – approximately 500 millions web pages – processed it using Unitok and Chared28 and got a result corpus sized 82.6 billion tokens. ∙ According to [Tro+12], a dedicated cluster of 100 machines running Hadoop file system, 33 TB of disk storage and a 1GB/s network were used. The Heritrix crawler29 was chosen for gath-

28. Tokeniser and de-duplication tool, see Section 1.3. 29. https://webarchive.jira.com/wiki/display/Heritrix/Heritrix

23 1. Efficient Web Crawling For Large Text Corpora

ering ClueWeb12, the successor collection, in 2012. 1.2 billion pages amounting to 37 TB text (and 67 TB of other files) was crawled for this collection in 13 weeks [Tro+12]. ∙ IRLbot crawler30 downloaded 6.4 billion pages over two months as reported by [Lee+09], with a performance of 3000 pages downloaded per second in peaks and the average of 1200 pages per second. ∙ SpiderLing crawled 179 millions web pages from the English web in 15 days with a peak performance around 600 pages per second and the average of 140 pages per second. Crawling performance of SpiderLing crawling the English, French, Estonian, Finnish and Czech & Slovak web in 2019, measured as con- nections opened per second, the size of raw data downloaded per day and the size of plaintext extracted from HTML per day can be found in Figures 1.7, 1.8 and 1.9. The figures show the performance declining over time. Themain reason for this is the decreasing variety of web domains. The sched- uler started with a high count of different web domains. The variety decreased by depleting new paths to download from web domain objects. The downloader is prevented to opening more connections and downloading more data by the crawl delay as a part of the politeness policy. The size of extracted plaintext is limited by the number of text processors employed. That count is set before the start of crawling. The extraction rate starts to decrease when there is no plaintext in the waiting queue causing some text processors to go idle. The crawler can be adapted to downloading national or language webs of various properties. The adaptation requires three kinds of language dependent resources described in Section 1.3. Furthermore, the behaviour of all components of the software is configurable using a file with preset defaults and switches for crawling ‘small languages’ or starting from a low number of seed URLs. Setting the target internet top level domain (TLD) and blacklists of TLDs and hostnames is supported. Thus one can avoid downloading from

30. http://irl.cs.tamu.edu/crawler/ 24 1. Efficient Web Crawling For Large Text Corpora Figure 1.7: Average TCP connections openedlanguage per webs second in in 2019. day intervals by SpiderLing crawling selected

25 1. Efficient Web Crawling For Large Text Corpora Figure 1.8: Average size of rawselected HTML language data webs downloaded in per 2019. day in day intervals by SpiderLing crawling

26 1. Efficient Web Crawling For Large Text Corpora Figure 1.9: Average size of plaintext extractedselected from language HTML webs per in day 2019. in day intervals by SpiderLing crawling

27 1. Efficient Web Crawling For Large Text Corpora

domains known for bad content or force crawling from particular national domains, e.g. .de, .at, .ch for the . Crawling multiple languages at the same time is encouraged. This functionality was used e.g. for obtaining text in three languages spoken in Nigeria: Hausa, Igbo and Yoruba. The configuration file can be reused for crawling the same language in the future. Adapting the crawler to focus on texts on a particular topic is also possible. A list of two hundred environment and nature protection terms was added to the scheduler to prefer domains consisting of documents containing words from the list. This work was done for a lexicographic project. The seed URLs were obtained from a web search engine set to search for the terms. This experiment led to building a 61 million word topical corpus.31 The corpus was used for improving a terminology in the domain of environment and nature protection.

1.2.3 Yield Rate Aware Efficient Crawling We experimented with a third party software for obtaining text docu- ments from the web. Following the example of other researchers [BK06; Bar+09; Fer+08a], we used Heritrix crawler32 and downloaded docu- ments for the language in interest by restricting the crawl to national web domains of the countries where the language is widely used (e.g. .cz for Czech). Though our colleagues managed to compile corpora of up to 5.5 bil- lion words this way [PRK09], they were not satisfied with the fact they needed to keep the crawler running for several weeks and download terabytes of data in order to retrieve a reasonable amount of text. It turned out that most downloaded documents were discarded during post-processing since they contained only material with little or no good quality text. We were interested in knowing how much data was downloaded in vain when using Heritrix and if the sources which should be avoided can be easily identified. In order to get that information we analyzed

31. Information about the result corpus can be found at https://www. sketchengine.eu/environment-corpus/. 32. http://crawler.archive.org/

28 1. Efficient Web Crawling For Large Text Corpora

the data of a billion word corpus of European Portuguese downloaded from the .pt domain with Heritrix. For each downloaded web page we compute its yield rate as final data size yield rate = downloaded data size where final data size is the number of bytes in the text which the page contributed to the final corpus and downloaded data size is simply the size of the page in bytes (i.e. the number of bytes which had to be downloaded). Many web pages have a zero yield rate, mostly because they get rejected by a language classifier or they only contain junk or text duplicate to previously retrieved text. We grouped the data by web domains and computed a yield rate for each domain as the average yield rate of the contained web pages. We visualized this on a scatterplot which is displayed in Figure 1.10. Each domain is represented by a single point in the graph. It can be seen that the differences among domains are enormous. For example, each of the points in the lower right corner represents a domain from which we downloaded more than 1 GB of data, but it only yielded around 1 kB of text. At the same time, there are domains which yielded more than 100 MB of text (an amount higher by five orders of magnitude) from a similar amount of downloaded data. These domains are positioned in the upper right corner of the graph. Next, we selected a set of yield rate thresholds and computed for each threshold the number of domains with a higher yield rate and the sum of downloaded and final data in these domains. The results can be found in Table 1.1. It is easy to see that as the yield rate threshold increases the size of the downloaded data drops quickly whereas there is only a fairly small loss in the final data. This suggests that by avoiding the domains with low yield rate a web crawler could save a lot of bandwidth (and time) without making the final corpus significantly smaller. For instance if only domains with a yield rate above 0.0128 were crawled, the amount of downloaded data would be reduced from 1289 GB to 86 GB (to less than 7 %) while the size of the final data would only drop from 4.81 GB to 3.62 GB (73.7 %). This is of course only a hypothetical situation, since in practice one would need to

29 1. Efficient Web Crawling For Large Text Corpora

109 yield rate = 0.1 yield rate = 0.01 yield rate = 0.001 domains .pt 108

107

106

105 Final data size (bytes)

104

103

2 10 4 5 7 11 103 10 10 106 10 108 109 1010 10 Downloaded data size (bytes)

Figure 1.10: Web domains yield rate for a Heritrix crawl on .pt.

30 1. Efficient Web Crawling For Large Text Corpora

Table 1.1: Sums of downloaded and final data size for all domains above the given yield rate threshold.

Domains Crawler Final Yield rate above the output Final data yield threshold threshold size [GB] size [GB] rate none 51645 1288.87 4.91 0.0038 0.0001 29580 705.07 4.90 0.0069 0.0002 28710 619.44 4.89 0.0079 0.0004 27460 513.86 4.86 0.0095 0.0008 25956 407.30 4.80 0.0118 0.0016 24380 307.27 4.68 0.0152 0.0032 22325 214.18 4.47 0.0209 0.0064 19463 142.38 4.13 0.0290 0.0128 15624 85.69 3.62 0.0422 0.0256 11277 45.05 2.91 0.0646 0.0512 7003 18.61 1.98 0.1064 0.1024 3577 5.45 1.06 0.1945 0.2048 1346 1.76 0.54 0.3068 0.4096 313 0.21 0.10 0.4762

31 1. Efficient Web Crawling For Large Text Corpora

download at least several pages from each domain in order to estimate its yield rate. Nevertheless, it is clear that there is a lot of room for making the crawling for web corpora much more efficient. We observe many web domains offer documents of a similar type. For example, a news site contains short articles, a blog site contains blog entries, a company presentation site contains descriptions of the goods sold or products manufactured. We believe the quality of several documents (with regard to building text corpora) on such sites could represent the quality of all documents within the given domain. One could argue that a segmentation by domains is too coarse- grained since a domain may contain multiple websites with both high and low yield rates. Though we agree, we believe that identifying more fine-grained sets of web pages (like a text rich discussion forumona text poor goods presentation site) introduces further complications and we leave that for future work. Simple web crawlers are not robust enough to suit our needs (e.g. not supporting heavily concurrent communication, lacking load bal- ancing by domain or IP address, not able to restart the crawling after a system crash). On the other hand, the source code of sophisticated crawlers is too complex to alter, making implementation of our way of efficient web traversing difficult. We came to the conclusion that the easiest way of implementing our very specific requirements on web crawling is to create a custom crawler from scratch. In order to reduce the amount of unwanted downloaded content, the crawler actively looks for text rich resources and avoids websites containing material mostly not suitable for text corpora. Our hope was that by avoiding the unwanted content we can not only save bandwidth but also shorten the time required for data post-processing and building a web corpus of given size. Our primary aim is to identify high-yielding domains and to avoid low-yielding ones. At the same time we want to make sure that we do not download all data only from a few top-yielding domains so that we achieve a reasonable diversity of the obtained texts. We collect information about the current yield rate of each domain during crawling the web. If the yield rate drops below a certain thresh- old we blacklist the domain and do not download any further data

32 1. Efficient Web Crawling For Large Text Corpora Table 1.2: The yield rate threshold as a function of the number of downloaded documents. Document count Yield rate threshold 10 0.00 100 0.01 1000 0.02 10000 0.03

from it. We define a minimum amount of data which must be retrieved from each domain before it can be blacklisted. Current limit is 10 web pages or 512 kB of data, whichever is higher. The yield rate threshold is dynamic and increases as more pages are downloaded from the domain. This ensures that sooner or later all domains get blacklisted, which prevents over-representation of data from a single domain. Nevertheless, low-yielding domains are black- listed much sooner and thus the average yield rate should increase. The yield rate threshold for a domain is computed using the fol- lowing function:  t(n) = 0.01 · log10 (n) − 1 where n is the number of documents downloaded from the domain. The function is based partly on the author’s intuition and partly on the results of initial experiments. Table 1.2 contains a list of thresholds for selected numbers of downloaded documents. We experimented with various parameters of the yield rate thresh- old function. Figure 1.11 shows how the average yield rate changes in time with different yield rate threshold functions. These experiments were performed with Czech as the target language. It can be seen that stricter threshold functions result in higher av- erage yield rate. However, too high thresholds have a negative impact on the crawling speed (some domains are blacklisted too early). It is therefore necessary to make a reasonable compromise. Note: We used the threshold functions from Figure 1.11 in our initial experiments. We selected an even less strict one (defined in this section) later on during crawling various data sources. It was a

33 1. Efficient Web Crawling For Large Text Corpora

No constraints t(n) = 0.02 (log (n) 1) · 7 − t(n) = 0.015 (log (n) 1) (less strict) 0.08 · 10 −

0.07

0.06

0.05

0.04

crawler output yield rate 0.03

0.02

0.01

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 time [hours] Figure 1.11: Average yield rate in time for various yield rate threshold functions (crawling the Czech web)

matter of balancing high yield rate versus total amount of obtained data. Too much data was thrown away due to a strict threshold. That is why the currently used threshold function is not present in the figure. The main point is that yield rate is strongly affected by the selected threshold function.

Tool Justext was embedded in SpiderLing to remove content such as navigation links, advertisements, headers and footers from down- loaded web pages. Only paragraphs containing full sentences are preserved. Duplicate documents are removed at two levels: (i) original form (text + HTML), and (ii) clean text as produced by jusText. Two corre- spondent checksums are computed for each web page and stored in memory. Documents with previously seen checksums are discarded. Both kinds of removal are done on-the-fly during the crawling to immediately propagate the currently crawled documents’ yield rate into the corresponding domain yield rate. This enables SpiderLing to dynamically react to obtained data.

34 1. Efficient Web Crawling For Large Text Corpora

By applying yield rate thresholds on domains we managed to reduce downloading data which is of no use for text corpora and increased the overall average yield rate. Figure 1.12 contains the same kind of scatterplot as displayed in Figure 1.10, this time on the data downloaded by SpiderLing with Czech as a target language. This is a significant improvement over the previous graph. For low-yielding domains only up to 1 MB of data is downloaded and high amounts of data are only retrieved from high-yielding sources. Many points (i.e. domains) are aligned along the line representing a yield rate of 10 %. Furthermore, the crawling was stopped already at the 512 kB threshold in case of many bad domains. Note that the graph in Figure 1.12 does not take de-duplication by onion into account. It displays the size of the data as output by the crawler (i.e. boilerplate removed by jusText, no exactly same documents), not the final de-duplicated texts size. Even though the achieved improvement over the previous is indisputable.

We were also interested in the development of the crawling effi- ciency during crawling. We expected the yield rate to slightly increase over time (the more data downloaded the higher yielding domains selected). The results are pictured by Figure 1.13. Contrary to our expectations, the measured efficiency grew only slightly or stagnated in most cases. We still consider this a good result because even the stagnating yield rates were good (with regard to Table 1.1). Crawling Japanese was an exception, since the rate kept increasing almost all the time there. The reason may be the starting rate was low. The inbuilt language dependent models (character trigrams, wordlist) may not be adapted well for Japanese and throw away good documents as well. The less web resources in the target language, the sooner the yield rate drops down. It can be demonstrated by the example of Tajik Persian. The initial yield rate obviously depends on the quality of the seed (initial) URLs. (For example many URLs of electronic newspaper articles in the target language give good initial yield rate.) Irrespective of the seed URLs, the measurements show that sooner or later, the

35 1. Efficient Web Crawling For Large Text Corpora

109 yield rate = 0.1 yield rate = 0.01 yield rate = 0.001

8 domains .cz 10

107

106

105 Crawler output size (bytes)

104

103

2 10 4 5 7 11 103 10 10 106 10 108 109 1010 10 Downloaded data size (bytes)

Figure 1.12: Web domains yield rate for a SpiderLing crawl on the Czech web.

36 1. Efficient Web Crawling For Large Text Corpora

0.07

0.06

0.05

0.04

0.03 crawler output yield rate

0.02 Am. Spanish Czech 0.01 Tajik Japanese Russian Turkish

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 raw data downloaded [GB]

Figure 1.13: The Yield rate of web domains measured during Spider- Ling crawls of six target languages in 2011 and 2012.

37 1. Efficient Web Crawling For Large Text Corpora program discovers enough URLs to be able to select good quality domains. Unlike other languages, crawling Arabic, Japanese and Turkish was not restricted to the respective national domains. That inevitably led to downloading more data in other languages thus throwing away more documents. Considering crawling efficiency in these cases on Figure 1.13, the yield rate also depends on constraining crawling to national top level domains. The yield rate may decrease after downloading a lot of data (the amount depends on the web presence of the target language). In the case of rare languages, the best (text rich) domains get exhausted and the crawler has to select less yielding domains. The yield rates of parts of the web in selected languages33 obtained by the crawler recently are summarized in Table 1.3. As can be seen in the table, the yield rate differs for various languages. According to our experience, the difference can be caused by a different configuration of the crawler including language resources and seed URLs – the better yield rate of seed domains, the better yield rate of the whole crawl. Different nature of the web in different languages may also play apart.

1.2.4 Deployment of SpiderLing in Corpus Projects The crawler was successfully used in cooperation with various re- search partners to build corpora in many languages, e.g.: Tajik Persian (2011, 2012) [Dov+11; DSŠ12b], Arabic (2012, 2018) [Art+14; Bel+13], Japanese (2011, 2018) [Srd+13], Hindi (2013, 2017) [Boj+14], Amharic (2015-2017) [RS16], Lao (2018-2019), Tagalog (2018, 2019) [Bai+19], Estonian (2013, 2017, 2019) [KSK17; Kop+19]. A summary of data size in four stages of text processing of web corpora crawled by SpiderLing recently can be found in Table A.1 in the Appendices. Three linguistic resources projects benefited from the crawler and the processing pipeline: ELEXIS34, Habit35, Lindat36.

33. Defined for the whole crawl similarly to a single domain. 34. https://elex.is/ 35. https://habit-project.eu/ 36. https://lindat.cz/

38 1. Efficient Web Crawling For Large Text Corpora

Table 1.3: Yield rate of crawling of the web of selected target languages in 2019: The ratio of the size of the plaintext output of the crawler to the size of all data downloaded is calculated in the fourth column ‘YR’. The ratio of the size of the plaintext after discerning similar languages and near paragraph de-duplication to the size of all data downloaded is calculated in the last, cumulative yield rate column ‘CYR’. Cs & sk denotes Czech and Slovak languages that were crawled together.

Crawler output After de-duplication Language HTML plaintext YR plaintext CYR Cs & sk 9.74 TB 241 GB 2.48 % 71.7 GB 0.74 % English 9.09 TB 611 GB 6.72 % 300 GB 3.30 % Estonian 3.81 TB 57.9 GB 1.52 % 19.4 GB 0.51 % Finnish 7.65 TB 137 GB 1.79 % 41.9 GB 0.55 % French 7.44 TB 264 GB 3.55 % 98.8 GB 1.33 % Greek 3.71 TB 109 GB 2.93 % 35.7 GB 0.96 % Irish 38.6 GB 1.02 GB 2.66 % 398 MB 1.03 % Italian 6.26 TB 195 GB 3.12 % 96.7 GB 1.54 % Polish 2.17 TB 97.4 GB 4.50 % 46.5 GB 2.14 %

39 1. Efficient Web Crawling For Large Text Corpora

According to our records, SpiderLing was downloaded by others more than 600 times between November 2016 and April 2020, mostly by academics from around the world. The crawler was successfully used by researchers at Slovak Academy of Sciences [Ben14; Ben16] and University of Zagreb [LK14; LT14].

40 1. Efficient Web Crawling For Large Text Corpora 1.3 Brno Corpus Processing Pipeline

The process of building corpora, also named the corpus creation pipeline, consists in several steps. Having discussed the topic in the WaC community at the WaC workshop in 2014, we discovered the principal parts of the pipeline are the same for many corpus families made recently. An older paper on creating WaC corpora reports a similar process as well [Bar+09]. The difference can be in the techni- cal implementation of cleaning tools, in some minor improvements tailored for a specific language or some minor problem spotted inthe particular corpus data. But the general coarse grained steps are the same. [VP12] summarizes the corpus creation pipeline as follows: 1. In the first phase, crawling creates an archive of HTML pages that are to be processed further. 2. The second phase consists of boilerplate removal and de-dupli- cation, which yields the raw text of these HTML pages without any navigation elements or non-informative text. 3. In the subsequent phases, the raw text is tokenised and sentence splitting and subsequent linguistic processing is applied. In the first phase, the document language and character encoding have to be identified. In the case of a single target language, data in other languages are stripped off. The document encoding can be normalized to UTF-8 which is the most spread encoding standard capable of addressing all necessary character codepoints. [VP12] stated that without boilerplate detection and removal, one could hardly use the result, since boilerplate text snippets would distort the statistics (e.g. the distributional similarity information) of the final corpus. Similarly, [Pom11] agreed the boilerplate is known to cause problems in text corpora since the increased count of some terms gives biased information about the language and makes the corpus search provide no useful evidence about the phenomenon being investigated. The literature names non-informative parts, such as navigation menus, lists of links, decorative elements, advertisements, copyright notes, headers and footers, advertisements, etc. as examples of web document boilerplate. The main approaches to deal with boilerplate either:

41 1. Efficient Web Crawling For Large Text Corpora

1. Take advantage of knowledge of the common HTML DOM structure across multiple documents on a website [BR02]; 2. Or take into account properties of the inspected HTML page alone (an unsupervised, heuristic method) such as HTML DOM, headings, link density, paragraph length, stoplist words, etc. [Pom11] There was also a competition on cleaning web pages organized by the ACL SIG WAC – CLEANEVAL37 [Bar+08]. The best performing tool was Victor [MPS07]. Other tools we have encountered later are BoilerPipe38 based on [KFN10] and jusText39 [Pom11]. The second important issue is duplicity. The digital information on the web is easily copied and thus documents may have significant overlaps. That might happen on purpose, e.g. in the case of spamming technique called ‘content farms’ (covered later on) or in a seemingly innocent case of multiple online newspapers sharing the same an- nouncement released by a press agency. [Pom11] added to these cases also document revisions (multiple versions of a single document are usually overlapping a lot) and quotations of previous posts in online discussion forums. Pomikálek further explains identifying and removing duplicate and near-duplicate texts is therefore essential for using the web data in text corpora. For instance, ‘the users might get many duplicate lines when searching in a corpus with excess duplicate content’. Similarly to boilerplate, ‘duplicate content may bias results derived from statistical processing of corpus data by artificially inflat- ing frequencies of some words and expressions’. The work also points out it is necessary to distinguish natural recurrences of phrases or short chunks of text (which are not harmful to effective use of text corpora) from whole duplicate or near-duplicate sentences and paragraphs (characteristic for so called co-derivative texts). [Pom11] The corpus de-duplication experience follows: [Bar+09] removed duplicate documents sharing at least two 5-grams of 25 5-grams per a document.

37. http://cleaneval.sigwac.org.uk/ 38. http://code.google.com/p/boilerpipe/ 39. http://corpus.tools/wiki/Justext

42 1. Efficient Web Crawling For Large Text Corpora

Broder’s shingling algorithm [Bro+00] was implemented by [VP12] and [Pom11] for identifying duplicate n-grams of words in corpus data. [VP12] used an approach that creates a smaller number of still- distinctive overlapping phrases based on ‘spots’ that are composed of a function word and three content words rather than plain word n-grams. Unlike the previous cases, [LK14] and [Ben14] just marked the duplicate paragraphs and let the user of the corpus decide if the duplicate part is needed. The crawler described in this thesis has found its place in the con- text of contemporary web corpus building projects. SpiderLing is a component of the group of tools for building large web corpora produced and maintained by Natural Language Processing Centre at Masaryk University40. This set of tools was nicknamed by oth- ers [Ben16; LT14] ‘Brno Corpus Processing Pipeline’. All tools in the pipeline are open source released under a free licence. The processing chain for creating a corpus using the tools in the pipeline follows:41 1. Consider the geographical distribution of target languages42 (countries where they are spoken), top level domains (TLDs) of the countries, the main script used for writing each target language and similar languages that could be hard to recognise from target languages when configuring the crawler.

2. Prepare language dependent resources used for text process- ing in the following steps. The resources can be obtained from another general corpus or by downloading web pages and text from trustworthy (i.e. good content) sites, e.g. news sites, gov-

40. https://nlp.fi.muni.cz/en/NLPCentre 41. The author of this thesis is also the author of the crawler and the language discerning tool and a co-author of the character encoding detection tool and the universal tokeniser. 42. Assuming the goal is to obtain monolingual corpora in target languages, these languages along with other languages spoken in the same geographical or internet space should be recognised. The pipeline will aim to accept text in target languages and reject content in other recognised languages.

43 1. Efficient Web Crawling For Large Text Corpora

ernment sites, quality blogs and Wikipedia. Resources not sup- plied with the tools have to be created manually:

(a) 1000 web pages in each recognised language to train a byte trigram model for Chared, the encoding detection tool. The software comes with default models for many languages. (b) 3000 words most frequent in the target language for each target language. A larger amount is recommendable in the case of highly flective languages. The wordlist is used by Justext, the boilerplate removal tool to identify general text in the target language. The software comes with default wordlists for many languages. (c) 100 kB of plaintext for a character trigram language model that is built by the crawler for each recognised language. Natural sentences on general topics are advisable here.

3. Find at least thousands of seed URLs.43 It is reasonable to in- clude just trustworthy sites in the case of languages widely spread on the web to reduce downloading poor quality content from the start. On the other side, if crawling scarce languages, use all URLs that can be found. Wikipedia and web directo- ries such as curlie.org44 are rich sources. Employing a search engine to search for documents containing frequent words or phrases in target languages as suggested by [BB04] is helpful when the number of initial sites is low. 4. Start the crawler using the seed URLs. 5. Processing and evaluation of the downloaded data is done by the crawler on-the-fly:

(a) Encoding detection using Chared. Although UTF-8 is the most frequent encoding today, encoding detection is still needed for a part of the web. (b) Language filtering using the character trigram model.

43. It is still possible to start the crawler with a single seed URL. 44. Formerly known as dmoz.org.

44 1. Efficient Web Crawling For Large Text Corpora

(c) Boilerplate removal using Justext. The algorithm is lexi- cally informed, rejecting material that does not have a high proportion of tokens that are the words of the language thus most material which is not in the desired language is removed. (d) De-duplication of identical documents. (e) Evaluation of the yield rate and other features of web domains.

6. The post-processing phase follows offline by tokenisation per- formed by Unitok45 [MSP14] or a third-party segmentation tool (often also a morphological analyser) in the case of East Asian languages. 7. Near duplicate paragraphs are removed using de-duplication tool Onion46. [Pom11] The task is performed on the paragraph level. Paragraphs consisting of more than 50 % word 7-tuples encountered in previously processed data are removed. Since such de-duplication is a highly demanding task in terms of both processor cycles and memory consumption, it has not been em- bedded into the crawler. Nonetheless, we are still considering some way of integration, since it would enable a more accu- rate estimate of yield rates and thus improve the crawler’s web traversing algorithm. 8. Similar languages should be discerned and multi-language pages split to separate documents. A language filtering script using wordlists with word occurrence counts from big mono- lingual web corpora can do that. [Her+16; Suc19] 9. ,syntactic, or semantic annotation can be performed by external tools.47 We made extensive use of TreeTagger and FreeLing for European languages; Stanford tools for Chinese,

45. Universal text tokenisation tool with profiles for many languages, http:// corpus.tools/wiki/Unitok. 46. http://corpus.tools/wiki/Onion 47. Annotation is not a part of Brno Corpus Processing Pipeline. It is mentioned here to mark its place in the list of corpus building procedures. It is done after all text cleaning and processing.

45 1. Efficient Web Crawling For Large Text Corpora

meCab (with UniDic ) for Japanese, Han Nanum for Ko- rean, and MADA (in collaboration with Columbia University) for Arabic in the past. 10. All data is stored and indexed by corpus manager Sketch En- gine.

46 2 Cleaner Web Corpora

We observe the has become the source of data pre- ferred by many for NLP oriented research. As has been stated in the introductory section, the content of the web is not regulated in terms of data quality, originality, or correct description. That brings serious issues to deal with. Our work is directed towards cleaning web corpora. Boilerplate removal and near-duplicate de-duplication (explained in the previous chapter) are solved problems. Methods for language identification and web spam removal in large web corpora are presented in this chapter. A tool for language identification and discerning similar languages is introduced and evaluated in section 2.1. Its application to web cor- pora is presented as well. The issue of non-text in contemporary web corpora is described in section 2.2. A method for removing spam from web corpora through supervised learning using FastText is presented at the end of this chapter.

47 2. Cleaner Web Corpora 2.1 Discerning Similar Languages

We present a method for discriminating similar languages based on wordlists from large web corpora. The main benefits of the approach are language independency, a measure of confidence of the classifica- tion and an easy-to-maintain implementation. The method is evaluated on data sets of workshops on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial). The result accuracy is comparable to other methods successfully perform- ing at the workshop. Language identification is a procedure necessary for building monolingual text corpora from the web. For obvious reasons, dis- criminating similar languages is the most difficult case to deal with. Continuing in the steps of our previous work [Her+16], our goal in corpus building is to keep documents in target languages while removing texts in other, often similar languages. The aim is to process text of billion-word sized corpora using efficient and language inde- pendent algorithms. Precision (rather than recall), processing speed and easy-to-maintain software design are of key importance to us. Data to evaluate language discrimination methods have been cre- ated by the organisers of VarDial shared tasks since 2014 [Mal+16; Zam+17; Zam+14; Zam+15]. Various media ranging from nice news- paper articles to short social network texts full of tags were made available. Successful participants of this series of workshops have published their own approaches to the problem.

2.1.1 Method Description The aim of the method presented in this thesis is to provide a sim- ple and fast way to separate a large collection of documents from the web by language. This is the use case: Millions of web pages are downloaded from the web using a web crawler. To build monolingual corpora, one has to split the data by language. Since the set of internet national top level domains (TLDs) tar- geted by the crawler is usually limited and a similarity of the down- loaded texts to the target languages can be easily measured using e.g. a character n-gram model [LB12], one can expect only a limited set of languages similar to the target languages to discriminate. The

48 2. Cleaner Web Corpora

method should work with both documents in languages that have been discerned in the past as well as texts in languages never processed before. The presented method does: ∙ Enable supporting new languages easily (that implies the same way for adding any language). ∙ Allow adding a language never worked with before, using just the web pages downloaded or a resource available for all lan- guages (e.g. articles from Wikipedia). ∙ Not use language specific resources varying for each language supported (e.g. a morphological database) – since that makes supporting new languages difficult. ∙ Apply to any structure of text, e.g. documents, paragraphs, sentences. ∙ Provide a way to measure the contribution of parts of a text, e.g. paragraphs, sentences, tokens, to the final classification of the structure of the text. ∙ Provide a measure of confidence to allow setting a threshold and classifying documents below the threshold of minimal confidence as mixed or unknown language. ∙ Work fast even with collections of millions of documents. This method uses the initial step of the algorithm described in our co- authored paper [Her+16]. The reason for not including the expectation- maximisation steps mentioned in the paper is we aim to decrease the complexity of the solution, keeping the data processing time reason- ably short. The method exploits big monolingual collections of web pages downloaded in the past or even right before applying the method (i.e. using the text to identify its language as the method data source at the same time). The language of documents in such collections should be deter- mined correctly in most cases, however some mistakes must be ac- cepted because there are many foreign words in monolingual web corpora since e.g. foreign named entities or quotes are preserved. Even a lot of low frequency noise can be tolerable. Lists of words with rela- tive frequency are built from these big monolingual collections of web pages.

49 2. Cleaner Web Corpora

The method uses a decimal logarithm of word count per billion words to determine the relative wordlist score of each word from the list of words according to the following formula:

 f (w) · 109  score (w) = log 10 |D| Where f (w) is the corpus frequency of the word (number of occur- rences of the word in the collection) and |D| is the corpus size (number of all occurrences of all words in the collection). The wordlist is built for all languages to discern, prior to reading the input text. Usually, when building corpora from the web, languages similar to the target languages and languages prevalent in the region of the internet national top level domains occurring in the crawled data are considered. A big web corpus is a suitable source. To improve the list by reducing the presence of foreign words, limiting the national TLD of source web pages is advisable. E.g. using texts from TLD .cz to create a Czech word list should, intuitively, improve precision at a slight cost of recall. The input of the method, i.e. the documents to separate by lan- guage, must be tokenised. Unitok [MSP14] was used to tokenise text in all sources used in this work. Then, for each word in the input, the relative wordlist score is retrieved from each language wordlist. The scores of all words in a document grouped by the language are summed up to calculate the language score of a document. The same can be done for paragraphs or sentences or any corpus structure. document score (language) = ∑ language score (w) w∈document The language scores of a document are sorted and the ratio of two highest scoring languages is computed to determine the confidence of the classification. The score ratio is compared to a pre-set confidence threshold. If the ratio is below the threshold, the document is marked as a mixed language text and not included in the final collection of monolingual corpora. Otherwise the result language is the language with the highest score. document score (top language) confidence ratio (document) = document score (second top language)

50 2. Cleaner Web Corpora

According to our experience, setting the confidence threshold quite low (e.g. to 1.005) is advisable in the case of discerning very similar languages while higher values (e.g. 1.01 to 1.05) work for other cases (e.g. Czech vs. Slovak, Norwegian vs. Danish). We usually understand a paragraph to be the largest structure consisting of a single language in the case of multi-language web pages. The method presented in this work allows separating paragraphs in different languages found in a single multilingual document to multiple monolingual documents. Although code switching within a paragraph is possible, detecting that phenomenon is beyond the scope of this work. Figure 2.1 shows the overall sentence language scores as well as particular word language scores in a sentence from VarDial 2014 test data. Words ‘scheme’, ‘council’ and ‘tenant’ contribute the most to cor- rectly classifying the sample as British English (rather than American English). Punctuation was omitted from the wordlists thus getting a zero score.

2.1.2 Evaluation on VarDial Datasets

The method was used to build language wordlists from sources de- scribed in the next subsection and evaluated on groups of similar languages. In this work, TenTen web corpus family [Jak+13] was used to build the language wordlists. Aranea web corpora [Ben14; Ben16] were used in addition to TenTen corpora in the case of Czech and Slovak. bsWaC, hrWaC and srWaC web corpora [LK14] were used in the case of Bosnian, Croatian and Serbian. All words, even hapax legomena were included in the wordlists unless stated otherwise. The source web pages were limited to the respective national TLD where possible. Another set of wordlists to compare the method to other approaches was obtained from the DSL Corpus Collection v. 11. The data set was made available at VarDial 2014 and described by [Zam+14].2

1. http://ttg.uni-saarland.de/resources/DSLCC/ 2. http://corporavm.uni-koeln.de/vardial/sharedtask.html

51 2. Cleaner Web Corpora

Under 5.74 5.74 the 7.77 7.75 rent 4.70 4.59 deposit 4.56 4.40 bond 4.49 4.63 scheme 5.26 4.41 , 0.00 0.00 the 7.77 7.75 council 5.56 5.20 pays 4.20 4.26 the 7.77 7.75 deposit 4.56 4.40 for 7.06 7.07 a 7.36 7.34 tenant 4.34 3.94 so 6.34 6.31 they 6.51 6.50 can 6.53 6.54 rent 4.70 4.59 a 7.36 7.34 property 5.38 5.37 privately 4.05 3.99 . 0.00 0.00 Figure 2.1: Sentence score and word scores calculated to discern British English from American English using relative word counts from a large web corpus. A sample from VarDial 2014 test data, vertical format. Column description: Word form, en-GB score, en-US score.

52 2. Cleaner Web Corpora Table 2.1: Sizes of wordlists used in the evaluation. Large web sources – TenTen, Aranea and WaC corpora – were limited to respective national TLDs. Other wordlists were built from the training and evaluation data of DSL Corpus Collection and parts of GloWbE corpus. Columns Web, DSL and GloWbE contain the count of words in the respective wordlist. Language TLD Web DSL GloWbE Bosnian .ba 2,262,136 51,337 Croatian .hr 6,442,922 50,368 Serbian .rs 3,510,943 49,370 Indonesian – 860,827 48,824 Malaysian – 1,346,371 34,769 Czech .cz 26,534,728 109,635 Slovak .sk 5,333,581 121,550 Brazilian Portuguese .br 9,298,711 52,612 European Portuguese .pt 2,495,008 51,185 Argentine Spanish .ar 6,376,369 52,179 .es 8,396,533 62,945 English, UK .uk 6,738,021 42,516 1,222,292 English, US .us 2,814,873 42,358 1,245,821

The last couple of wordlists for the purpose of evaluating the method was taken from corpus GloWbE comprising of 60 % blogs from various English speaking countries [DF15].3 The sizes and source TLDs of the wordlists are shown in Table 2.1. The difference of wordlist sizes is countered by using the relative counts in the algorithm. The evaluation of the language separation method described in this paper on DSL Corpus Collection v. 1 gold data4 performed by the

3. http://www.corpusdata.org/ 4. https://bitbucket.org/alvations/dslsharedtask2014/src/master/ test-gold.txt

53 2. Cleaner Web Corpora Table 2.2: Overall accuracy using large web corpus wordlists and DSL CC v. 1 training data wordlists on DSL CC v. 1 gold data. The best result achieved by participants in VarDial 2014 can be found in the last column. Languages Wordlist Accuracy DSL Best English UK/US Web corpora 0.6913 0.6394 English UK/US GloWbE 0.6956 0.6394 English UK/US DSL training 0.4706 0.6394 Other languages Web corpora 0.8565 0.8800 Other languages DSL training 0.9354 0.9571 Bosnian, Croatian, Serbian DSL training 0.8883 0.9360 Indonesian, Malaysian DSL training 0.9955 0.9955 Czech, Slovak DSL training 1.0000 1.0000 Portuguese BR/PT DSL training 0.9345 0.9560 Spanish AR/ES DSL training 0.8820 0.9095

original evaluation script5 can be found in Table 2.2. The result overall accuracy is compared to the best result presented at VarDial 20146 This comparison shows that our method performed better with large web corpora based wordlists than with the DSL training data based wordlists in the case of discerning British from American En- glish.

5. https://bitbucket.org/alvations/dslsharedtask2014/src/master/ dslevalscript.py 6. http://htmlpreview.github.io/?https://bitbucket.org/alvations/ dslsharedtask2014/downloads/dsl-results.html

54 2. Cleaner Web Corpora Table 2.3: Performance of our method on VarDial DSL test data com- pared to the best score achieved by participants of the competition at that time. Year Dataset Wordlist Metric Score DSL best 2015 A Web corpora Accuracy 0.9149 0.9565 2015 B Web corpora Accuracy 0.8999 0.9341 2016 1 A DSL training Macro-F1 0.8743 0.8938 2016 1 A Web corpora Macro-F1 0.8420 0.8889 2017 DSL DSL training Macro-F1 0.8883 0.9271 2017 DSL Web corpora Macro-F1 0.8414 N/A

A brief overview of results achieved by Language Filter on selected datasets from subsequent VarDial shared tasks from 20157, 20168 and 20179 can be found in Table 2.3. Wordlists created from the shared task training data might have been better than large web corpora wordlists for discriminating the DSL test data since DSL training sentences were more similar to test sen- tences from the domain of journalism and newspaper texts than web documents. Generally, our wordlist based language separation method per- forms comparably to the results of participants of VarDial shared tasks, albeit never reaching the top score since the 2015 edition. A better adaptation to the task data would have probably helped a bit.

7. VarDial 2015, http://ttg.uni-saarland.de/lt4vardial2015/dsl.html. DSLCC v. 2.0; Set A: newspapers, named entities; Set B: newspapers, named entities blinded. Languages: Bosnian, Croatian, and Serbian; Bulgarian and Macedonian; Czech and Slovak; Malay and Indonesian; Portuguese: Brazil and Portugal; Spanish: Argentina and Spain; Other (mixed) languages. 8. VarDial 2016, http://ttg.uni-saarland.de/vardial2016/dsl2016.html. DSLCC v. 3; Sub-task 1 (DSL); Set A: newspapers. Languages: Bosnian, Croatian, and Serbian; Malay and Indonesian; Portuguese: Brazil and Portugal; Spanish: Argentina, Mexico, and Spain; French: France and Canada. 9. VarDial 2017, http://ttg.uni-saarland.de/vardial2017/sharedtask2017. html. DSLCC v. 4; There was a single DSL task/dataset. Languages: Bosnian, Croat- ian, and Serbian; Malay and Indonesian; Persian and Dari; Canadian and Hexagonal French; Brazilian and European Portuguese; Argentine, Peninsular, and Peruvian Spanish. Competition results of the open submission in 2017 are not available.

55 2. Cleaner Web Corpora

2.1.3 Comparison to Other Language Detection Tools

To compare Language Filter to other publicly available and easy-to- install language detection tools, lang.id10 [LB12] and langdetect11 [Nak10], which are well known pieces of software, were selected. Both tools use a naive Bayes classifier, the first with byte 1-to-7-grams, the second with character n-grams. 1,000 random paragraphs from Czech and Slovak web crawled in 2019 were selected for the experiment. All tools were set to discern Czech, Slovak and English. To create the ‘gold standard’ data for the task, the base set of paragraphs was classified by all three tools. Labels of texts that were given the same label by all tools were put in the gold standard without being checked by a human. 104 paragraphs where a disagreement of the tools occurred were labelled by us. In the end, there were 920 Czech pieces of text, 27 Slovak texts and 5 English texts in the collection, i.e. 952 altogether. The rest were either cases where a human could not determine the language (e.g. both Czech and Slovak could be right) or duplicates. The average length of the texts in the collection is 48 tokens, the median is 30 tokens. Our tool was slightly modified to predict a language even for short texts or texts where the scores of the top two languages were the same. (The default behaviour is to throw away these cases.) Furthermore, the tool was used with selected sizes of frequency wordlists. The smallest number – the most frequent 10,000 words in the language – is the size of the wordlist distributed with the tool under a free licence by us. The full frequency wordlist comes from large web corpora: csTenTen17, skTenTen11 and enTenTen15. Smaller wordlists were taken from its top million, 500,000, 100,000 and 50,000 lines to evaluate how much data helps improving the result.

10. Source homepage: https://github.com/saffsd/langid.py. The latest commit (4153583 from July 15, 2017) was obtained. 11. Source homepage: https://github.com/Mimino666/langdetect. The latest commit (d0c8e85 from March 5, 2020) was obtained. That is a Python re- implementation of the original tool written in Java – https://github.com/shuyo/ language-detection/tree/wiki.

56 2. Cleaner Web Corpora Table 2.4: Comparison of language identification tools on 952 random paragraphs from Czech and Slovak web. The tools were set to discern Czech, Slovak and English.

Tool Accuracy Langid.py 0.963 Langdetect 0.954 Language Filter, 10 k wordlist (public) 0.960 Language Filter, 50 k wordlist 0.985 Language Filter, 100 k wordlist 0.991 Language Filter, 500 k wordlist 0.993 Language Filter, 1 M wordlist 0.994 Language Filter, unlimited wordlist 0.995

Table 2.4 shows Language Filter overperforms other tools in the ex- periment with the 50,000 wordlist and larger lists. The tool performed the best with unlimited wordlists.12 The results indicate the larger source web corpus – the better wordlist for discerning languages. A Makefile and test data to reproduce the experiment are attached to http://corpus.tools/wiki/languagefilter.

2.1.4 Application to Web Corpora

The software was added to ‘Brno corpus processing pipeline’. By applying the tool to all web corpora recently produced by the author of this thesis, text in unwanted languages was removed thus improving the quality of corpora.

12. More precisely ‘full’ wordlists – limited just by the relative corpus frequency of a word which had to be greater than one hit per billion words in the respective corpus to account the word in the list. The size of all ‘full’ wordlists in this experiment was between 2 and 3 million items.

57 2. Cleaner Web Corpora Table 2.5: Discriminating similar languages in Indonesian web corpus from 2010 (Indonesian WaC corpus v. 3 by Siva Reddy): Document count and token count of corpus parts in languages discerned.

Documents Tokens Cleaning Indonesian Count % Count % Original data 27,049 100 % 111,722,544 100 % Indonesian language 25,876 95.7 % 94,280,984 84.4 % Malay language 12,684 46.9 % 13,946,288 12.5 % English 5,780 21.4 % 1,397,778 1.3 % Arabic 263 1.0 % 107,359 0.1 % French 185 0.7 % 19,471 0.0 %

The aim was to get clean monolingual corpora by removing para- graphs and documents in unwanted languages. Selected results are summarised in tables at the end of this section.13 Sizes of language parts identified in web pages downloaded from the Indonesian web are summarised in Table 2.5. Indonesian and Malay are similar languages. 16 % of text was removed from the corpus to get a better Indonesian only corpus. Table 2.6 presents sizes of language parts identified while process- ing Norwegian web texts. Partners in the corpus building projects who were native Norwegian speakers immediately revealed there was a lot of Danish content in the corpus. By applying the language filter- ing script, 28 % of text was removed from the corpus, being mostly Danish text, or mixed language text below the threshold of reliable identification, or very short paragraphs also below the threshold of reliable identification. Variants of Norwegian were discerned too. In the end, two corpora were built: a Bokmål corpus consisting of the majority of the original data and a smaller Nynorsk corpus. A summary of corpus sizes before and after the additional lan- guage filtering of recently build web corpora is presented in Table2.7.

13. Note the sum of document counts of separated language data in the tables may exceed 100 % of the original document count since multi-language documents were split to separate documents for each language identified.

58 2. Cleaner Web Corpora Table 2.6: Discriminating similar languages in the Norwegian web corpus from 2015 (noTenTen15): Document count and token count of corpus parts in languages discerned.

Documents 106 tokens Cleaning Norwegian Count % Count % Original data 5,589,833 100 % 1,954 100 % Bokmål Norwegian 3,443,807 61.6 % 1,365 69.8 % Nynorsk Norwegian 175,885 3.1 % 50 2.6 % Danish and other langs 1,970,141 35.2 % 539 27.6 %

The procedure proved the most useful in the case of Estonian. Nev- ertheless the quality of all corpora in the table benefited from the process. Multiple languages were identified in source web texts in the Es- tonian web corpus from 2019. Although Estonian, English and Rus- sian were discerned using character trigram language models and a boilerplate removal tool Justext – usual procedure in the former text processing pipeline – almost 5 % of corpus tokens, including 1.7 million English words and a smaller amount of Russian words, still had to be removed using the method described here. The confidence threshold was set to 1.01 in this case. A complete list of language parts can be found in Table 2.8. An example of using this method to deal with bad – missing diacritics in Czech and Slovak14 15 – is shown by Table 2.9. Texts suffering from the problem were treated as a separate language to filter out. A copy of the Czech frequency wordlist with diacritics removed was used to simulate language ‘Czech without diacritics’. A ‘Slovak without diacritics’ wordlist was produced the same way from

14. For example, ‘plast’, ‘plást’ and ‘plášť’ are three different words separable only by diacritical marks in Czech orthography. Notable number of Czech and Slovak texts on the web are written that way either because the writer is lazy to type diacritics or because of a technical reason. These texts should be removed from corpora. 15. We faced the same issue in a Tajik web corpus written in Cyrillic in 2011. It turned out that many Tajik people used Russian keyboards without labels for characters not found in the Russian alphabet.

59 2. Cleaner Web Corpora

Table 2.7: Overview of removal of unwanted languages in recently built web corpora (gaTenTen20, enTenTen19, etTenTen19, frTenTen19, huTenTen12, itTenTen19, roTenTen16). Document count and token count of corpus data before and after language filtering. ‘Removed’ stands for the percent of data removed.

Documents Target language Before After Removed Irish 242,442 239,840 1.07 % English 126,318,554 126,118,769 0.16 % Estonian 13,123,009 11,971,640 8.77 % French 60,977,258 60,904,106 0.12 % Hungarian 6,447,178 6,427,320 0.31 % Italian 44,753,427 44,708,252 0.10 % Romanian 9,302,262 9,239,153 0.68 % Millions of tokens Target languages Before After Removed Irish 172 172 0.48 % English 104,997 104,184 0.77 % Estonian 7,762 7,404 4.61 % French 44,448 44,079 0.83 % Hungarian 3,162 3,153 0.29 % Italian 31,102 30,989 0.36 % Romanian 3,143 3,136 0.21 %

60 2. Cleaner Web Corpora

Table 2.8: Languages recognised in the Estonian web corpus from 2019 (etTenTen19). Document count and token count of corpus parts in languages discerned.

Documents Tokens Filtering Estonian Count % Count % Original data 13,123,009 100 % 7,761,827,865 100 % Estonian 11,971,640 91.2 % 7,404,008,394 95.4 % Danish 344 0.00 % 16,004 0.00 % English 35,362 0.27 % 1,714,243 0.02 % Finnish 1,069 0.01 % 66,375 0.00 % French 1,323 0.01 % 59,217 0.00 % German 964 0.01 % 70,706 0.00 % Italian 861 0.01 % 40,963 0.00 % Polish 818 0.01 % 44,014 0.00 % Portuguese 399 0.00 % 19,611 0.00 % Russian 643 0.00 % 70,805 0.00 % Spanish 1,059 0.01 % 44,907 0.00 % Mixed or too short 3,073,350 23.42 % 3,6546,756 0.47 %

61 2. Cleaner Web Corpora Table 2.9: Languages recognised in the output of SpiderLing crawling Czech and Slovak web in 2019. Document count and token count of corpus parts in languages discerned.

Documents 106 tokens Filtering Czech & Slovak Count % Count % Original data 64,772,104 100% 35,187 100% Czech 62,520,894 96.5% 33,111 94.1% Czech without diacritics 3,375,842 5.2% 747 2.1% Slovak 3,213,002 5.0% 687 2.0% Slovak without diacritics 1,171,249 1.8% 130 0.4% Other languages 162,373 0.3% 8 0.0% Mixed or too short 21,780,897 33.6% 505 1.4%

the general Slovak wordlist. The confidence threshold was set to 1.01 in this case. To conclude the section on Discerning similar languages, the lan- guage filtering tool from Brno Corpus Processing Pipeline was intro- duced. The evaluation shows it works very well for general corpus cleaning as well as for discerning similar languages. A trick to remove texts without diacritics using the tool was presented too. We think the main advantage of our method is it provides an understandable threshold of the confidence of classification. Wedo not mind false positives – removing more text from the result corpus than necessary – but it is important to know how the threshold by which is the removal rate controlled works.

62 2. Cleaner Web Corpora 2.2 Non-Text Removal

2.2.1 Web Spam in Text Corpora

According to our experience, the most problematic issue of contem- porary web corpora is the presence of web spam. The internet is a messy place and the spread of text affecting spam is ‘getting worse’. In [KS13], we reported the biggest difference observed between their 2008 and 2012 web corpora, both obtained in the same way, was web spam. [GG05] defines web spamming as ‘actions intended to mislead search engines into ranking some pages higher than they deserve’ and state that the amount of web spam has increased dramatically, leading to a degradation of search results. According to the paper, even spamming techniques observed in 2005 involved modification of text data or computer generated text. A built for computational linguistics purpose should contain fluent, meaningful, natural sentences in the desired language. However, some spamming methods, in addition to misleading the search results, break those properties and thus hinder the quality of the corpus. The reasons are similar as with boilerplate or duplicate content – it may affect results derived from statistical processing of corpus data significantly. Therefore it is important to study spamming techniques and spam web pages to be able to avoid it in the process of cleaning text data.

Here are three examples of nonsense texts from an English web corpus found by [KS13]: ∙ The particular Moroccan oil could very well moisturize dry skin handing it out an even make-up including easier different textures. ∙ Now on the web stores are very aggressive price smart so there gen- uinely isn’t any very good cause to go way out of your way to get the presents (unless of course of program you procrastinated). ∙ Hemorrhoids sickliness is incorrect to be considered as a lethiferous malaise even though shutins are struck with calamitous tantrums of agonizing hazards, bulging soreness and irritating psoriasis.

63 2. Cleaner Web Corpora

Consider another example by Google Quality guidelines for webmas- ters:16 The same words or phrases can be repeated so often that it sounds unnatural. That is a sign of a spamming technique called key- word stuffing. ∙ We sell custom cigar humidors. Our custom cigar humidors are hand- made. If you’re thinking of buying a custom cigar humidor, please contact our custom cigar humidor specialists at [email protected]. As can be seen, the context of an examined word is not natural there, i.e. as it would be used in a meaningful sentence uttered by a human speaker. Such texts skew the properties of the word in the corpus: its corpus frequency, its document frequency, its collocates and because of that all derived analyses as well. Nonsense sentences cannot be used to explain the meaning of the word too. We managed to work around the issue of spammed data – in most cases – by selecting quality sources in the case of a billion word sized English corpus used in an application for English language learning introduced in [BS14]. Some nonsense sentences remained in the cor- pus nevertheless. An example of web spam can in the application be seen on Figure 2.2.1. In the case of general large web corpora we were not able to just avoid problematic kinds of web spam. Unless the issue is dealt with, we have less confidence in usefulness of the recent web data for language analysis. To improve user experience of browsing the internet and to better search the massive amount of online data, web search engines en- courage web administrators to create high quality well structured and informative web pages. An example of such administrator manual is the (already mentioned) Webmaster Guidelines by Google. Adhering to these principles is called search engine optimization (SEO). According to [GG05], SEO ‘without improving the true value of a page’ is in fact web spammig. Web spamming refers to ‘any deliberate human action that is meant to trigger an unjustifiably favorable rele- vance or importance for some web page, considering the page’s true value’. It is used to mislead search engines to assign to some pages a

16. Available at https://support.google.com/webmasters/answer/66358?hl= en, accessed in January 2015. 64 2. Cleaner Web Corpora

Figure 2.2: Web spam in examples of use of word ‘money’ in ap- plication Sketch Engine for Language Learning at https://skell. sketchengine.eu/. See non-text lines 2, 4 and 10.

65 2. Cleaner Web Corpora

higher ranking than they deserve. Web page content or links that are the result of some form of spamming are called spam. As has been shown in the introductory example, certain kinds of web spam look quite like the data we want to gather, but once properly identified as a non fluent, non coherent, non grammatical, containing suspicious unrelated keywords or by other means unnatural text, we do not want it to corrupt our corpora. [GG05] presented a useful taxonomy of web spam, and corre- sponding strategies used to make spam. Their paper was presented at the first AIRWeb17 workshop: it was the first of five annual workshops, associated with two shared tasks or ‘Web Spam Challenges’. The last of the AIRWeb workshops was held in 2009. In the following years, there have been joint WICOW/AIRWeb Workshops on Web Quality.18. These workshops, held at WWW conferences, have been the main venue for adversarial information retrieval work on web spam. Since the merge, there has been less work on web spam, with the focus, insofar as it relates to spam, moving to spam in social networks. [EGB12] The web spam taxonomy paper [GG05] revealed two main types of web spamming: ‘boosting techniques, i.e., methods through which one seeks to achieve high relevance and/or importance for some pages’ and ‘hiding techniques, methods that by themselves do not influence the search engine’s ranking algorithms, but that are used to hide the adopted boosting techniques from the eyes of human web users’. The boosting type of spamming consists in changing the frequency properties of a web page content in favour of spam targeted words or phrases – to increase the relevance for a document in a web search for those words or phrases. The paper also identified these ways of altering web text: Repetition of terms related to the spam campaign target, inserting a large number of unrelated terms, often even en- tire , weaving of spam terms into content copied from informative sites, e.g. news articles, glueing sentences or phrases from different sources together.

17. Adversarial Information Retrieval on the Web, organized in 2005–2009, http: //airweb.cse.lehigh.edu/ 18. Workshop on Information Credibility on the Web, organized since 2011, http: //www.dl.kuis.kyoto-u.ac.jp/webquality2011/

66 2. Cleaner Web Corpora

Other techniques listed in the spam taxonomy article, namely link spamming or instances of hiding techniques such as content hiding, cloaking, or redirection are less problematic for web corpora. Although they may reduce efficiency of a web crawler (by attracting it to poor quality sources), they do not present corrupted or fake text hard to recognize from the desired content. Search engines strive hard to suppress the bad SEO. Document ‘Fighting Spam’ by Google19 describes the kinds of spam that Google finds, and what they do about it. Among the techniques most danger- ous for web corpora are ‘aggressive spam techniques such as automat- ically generated gibberish’, automatically generated text or duplicated content. It is interesting to notice that although Google says their al- gorithms address the vast majority of spam, other spam has to be addressed manually. Figure 2.3 shows analysis of manually inspected spam types and quantities from 2004 to 2012. The following types of automatically generated content are ex- amples of documents penalised by Google: ‘Text translated by an automated tool without human review or curation before publish- ing. Text generated through automated processes, such as Markov chains. Text generated using automated synonymizing or obfusca- tion techniques.’ These kinds of spam should certainly be eliminated from web corpora while the other two examples given by Google may not present a harm to the corpus use: ‘Text generated from scraping Atom/RSS feeds or search results. Stitching or combining content from different web pages without adding sufficient20 value.’ [KS13] described the situation as a game played between spam- mers and search engines who employ teams of analysts and program- mers to combat spam. However, the most up to date knowledge of experts and resources may not be shared outside the search engine companies, for obvious reasons. Expecting the great effort of search engines’ measures against web spam pays off, the study speculated the corpus builders can benefit directly from the BootCaT approach.

19. http://www.google.com/insidesearch/howsearchworks/fighting-spam. html, accessed in January 2015, moved to https://www.google.com/search/ howsearchworks/mission/creators/ as of April 2020. 20. Source of quoted text: Google quality guidelines – https://support.google. com/webmasters/answer/2721306, accessed in January 2015. 67 2. Cleaner Web Corpora

Figure 2.3: Google’s analysis of spam types and quantities that had to be removed manually, 2004–2012. Source: http://www.google. com/insidesearch/howsearchworks/fighting-spam.html, accessed in January 2015, no longer at the site as of April 2020. Labels were moved below the chart and resized by the author of this thesis for the sake of readability.

68 2. Cleaner Web Corpora

General research in combating spam is conducted in two main directions: content analysis and web topology. The first approach represented e.g. by [Nto+06] is based onex- tracting text features such as ‘number of words, number of words in title, average word length, fraction of anchor words, fraction of content visible on the page, compression ratio, fraction of the most common words, 3-gram occurrence probability’. The web topology oriented techniques perceive the web as a graph of web pages interconnected by links, e.g. [Cas+07] discovered that ‘linked hosts tend to belong to the same class: either both are spam or both are non-spam’. Another work [GGP04] proposes a web page assessment algorithm TrustRank based on trusted sources manually identified by experts. The set of such reputable pages is used as seeds for a web crawl. The link structure of the web helps to discover other pages that are likely to be good. [Nto+06] also published two observations we find important for our research: 1. Web spamming is not only a matter of the English part of in- ternet. Spam was found in their French, German, Japanese and Chinese documents as well. There seems to be no language constraint. Language independent methods of combating spam might be of use.

2. NLP techniques required for ‘grammatical and ultimately se- mantic correctness’ analysis of web content are computationally expensive. Since web corpora processed by current cleaning tools are much smaller than the original, we may employ even some computationally expensive techniques nonetheless.

Our paper [KS13] summarizes the issues in web text corpus building in the context of corpora for language studies and NLP: Instances of web spam observed in data from 2012 and later are ‘by design, hard to find and set apart from good text’. ‘At what stage should spam detection take place – before HTML removal, or after, and do we work at the level of the page or the website?’ ‘Should we concentrate on websites, or web pages?’ The evidence suggests that ‘the landscape

69 2. Cleaner Web Corpora

hosts and domains change very quickly, so methods based on text may retain validity for longer.’ These questions should be addressed in order to convince the NLP community the web can be a reliable source of text data despite the growing spam content. In comparison to the problem of boilerplate and duplicity, web spam brings new challenges. While boilerplate and duplicate content is in fact an inevitable and natural property of the web, we understand the spamming is worse because it is always deliberate. Spammers intend to fool the search engines (and the web page visitors in some cases too). We have to realize the spamming techniques will be ad- vancing as fast as the countermeasures are taken.

The definition of web spam varies according to use of the web pages. Having discussed the fulltext search development team of Czech web search engine Seznam.cz about their attitude to web spam, we realized the difference. A search engine: ∙ The goal is to serve nice looking, informative, quality, trustwor- thy and original texts. ∙ The users are people searching for websites, products, informa- tion. ∙ Therefore, relevance and importance of documents is crucial. ∙ Content that is not original, poorly informative, scarcely con- nected to other pages, or misused is penalized in the search results. ∙ Spammers are interested in search engine behaviour. They try to keep up with countermeasures.

On the other hand, a text corpus: ∙ The goal is to represent the behaviour of words, phrases, sen- tences, generally all kinds of linguistic phenomena in a context in a language. ∙ The users are linguists, translators, teachers, NLP scientists and NLP applications. ∙ Natural distribution and context of the studied phenomena is important. ∙ Web spammers are not interested in text corpora.

70 2. Cleaner Web Corpora

For instance, an unofficial clone of a famous web page gets penalized by the search engine since people most likely want to look for the official page. Yet the clone might contain some original content. The corpus builders do not care if the page is official or not. We do want to keep the unofficial page in the corpus too, provided it contains at least a few natural sentences that cannot be seen elsewhere. Another example is a web content farm. A website (or more sites) contain parts of texts or full texts copied from original, trustworthy and informative sources, e.g. Wikipedia or news pages. Search engines are not permitting such aggressive spamming technique. Nonetheless, it is no issue for text corpora cleaned by currently available tools because duplicate removal is a solved problem. Still, some harm is done in the phase of crawling – the crawler is occupied by downloading data that gets cleaned afterwards. Similarly to search engines, there is a significant overlap with the information retrieval approach to spam, however it is not the same. ‘IR work mostly focuses on finding bad hosts (and much of it,on links, ‘the web as a graph’). That is a distinct strategy to finding bad text, e.g. within a web corpus once it has been cleaned, with links deleted.’ [KS13] Considering the corpus definition of web spam, the coarse grained classification of web pages to spam / not spam (or borderline asthe category between) might not be enough. Although there are web spam datasets available for development and evaluation of spam fighting methods, the focus is, to a certain extent, on the search engine defini- tion of spam rather than our definition of non-text. We also that historical datasets are of limited value as spammers will have moved on. Avoiding the web spam by carefully selecting spam free sources works well. Wikipedia, news sites, government webs are generally considered trustworthy. As we reported in [BS14], it is possible to con- struct medium sized corpora from URL whitelists and web catalogues. [SS12] reported a similar way of building a Czech web corpus too. Also the BootCaT method [BB04] indirectly avoids the spam issue by relying on a search engine to find non spam data. Despite the avoiding methods being rather successful, it is doubtful a huge web collection can be obtained just from trustworthy sources.

71 2. Cleaner Web Corpora

Furthermore, a manual spam classification of seed web pages is costly for each target language. In contrast to the search engine understanding of web spamming as well as other previously seen definitions, one could reformulate the definition of spam for linguistics use of text corpora: ∙ A fluent, naturally sounding, consistent text is good, regardless the purpose of the web page or its links to other pages.

∙ The bad content is this: computer generated text, machine trans- lated text, text altered by keyword stuffing or phrase stitching, text altered by replacing words with synonyms using a the- saurus, summaries automatically generated from databases (e.g. weather forecast, sport results – all of the same kind very similar), and finally any incoherent text. This is the kindof non-text this work is interested in.

∙ Varieties of spam removable by existing tools, e.g. duplicate content, link farms (quite a lot of links with scarce text), are only a minor problem. The same holds for techniques not affecting text, e.g. redirection.

To summarize – In contrast to the traditional or search engine defini- tions of web spam, the corpus use point of view is not concerned with intentions of spam producers or the justification of the search engine optimisation of a web page. A text corpus built for NLP or linguistics purpose should contain coherent and consistent, meaningful, natural and authentic sentences in the target language. Only texts created by spamming techniques breaking those properties should be detected and avoided. The unwanted non-text is this: computer generated text, machine translated text, text altered by keyword stuffing or phrase stitching, text altered by replacing words with synonyms using a thesaurus, summaries automatically generated from databases (e.g. stock market reports, weather forecast, sport results – all of the same kind very similar), and finally any incoherent text. Varieties of spam removable by existing tools, e.g. duplicate con- tent, link farms (quite a lot of links with scarce text), are only a minor problem.

72 2. Cleaner Web Corpora

Considering the strictness of removing spam, there is no lack of data in the internet nowadays. Therefore recall should be preferred over precision. We do not mind false positives if most of the spam is stripped away well. To address non-text in web corpora, we carried out a supervised learning experiment. The setup and results are described in the fol- lowing section.

2.2.2 Removing Spam from an English Web Corpus through Supervised Learning This section describes training and evaluation of a supervised classifier to detect spam in web corpora. We have manually annotated a collection of 1630 web pages from various web sources from years 2006 to 2015.21 To cover the main topics of spam texts observed in our previously built corpora, we included 107 spam pages promoting medication, financial services, commercial essay writing and other subjects. Both phrase level and sentence level incoherent texts (mostly key- word insertions, n-grams of words stitched together or seemingly authentic sentences not conveying any connecting message) were represented. Another 39 spam documents coming from random web documents identified by annotators were included. There were 146 positive instances of spam documents altogether. The classifier was trained using FastText [Jou+16] and applied toa large English web corpus from 2015. The expected performance of the classifier was evaluated using a 30-fold cross-validation on the web page collection. Since our aim was to remove as much spam from the corpus as possible, regardless false positives, the classifier confidence threshold was set to prioritize recall over precision. The achieved precision and recall were 71.5 % and 70.5 % respec- tively. Applying this classifier to an English web corpus from 2015 resulted in removing 35 % of corpus documents still leaving enough data for the corpus use.

21. This is a subset of a text collection which was a part of another classification experiment co-authored by us.

73 2. Cleaner Web Corpora

An inspection of the cleaned corpus revealed the relative count of usual spam related keywords dropped significantly as expected while general words not necessarily associated with spam were affected less as can be seen in Table 2.10. Another evaluation of the classifier was performed by manually checking 299 random web documents from the cleaned corpus and 25 random spam documents removed by the classifier. The achieved pre- cision was 40.0 % with the recall of 27.8 %. The error analysis showed the classifier was not able to recognise non-text rather than spam.17 of 26 unrecognised documents were scientific paper references or lists of names, dates and places, i.e. Submitted by Diana on 2013-09-25 and updated by Diana on Wed, 2013-09-25 08:32 or January 13, 2014 – January 16, 2014 Gaithersburg, Maryland, USA. Such web pages were not present in the training data since we believed it had been removed from the corpus sources by a boilerplate removal tool and paid attention to longer documents. Not counting these 17 non-text false negatives, the recall would reach 52.6 %. To find out what was removed from the corpus, relative counts of lemmas22 in the corpus were compared with the BNC23 in Figure 2.4 and Figure 2.5. A list of lemmas in the web corpus with the most reduced relative count caused by removing unwanted documents is presented in Figure 2.6. The inspection showed there were a lot of spam related words in the original web corpus and that spam words are no longer characteristic of the cleaned version of the corpus in comparison to the BNC.24 To show the impact of the cleaning method on data used in real applications, Word Sketches of selected verb, nouns and adjectives in the original corpus and the cleaned corpus were compared. A Word Sketch is a table like report providing a collocation and grammatical summary of the word’s behaviour that is essential for modern lexicog-

22. Corpora in the study were lemmatised by TreeTagger. 23. The tokenisation of the BNC had to be changed to the same way the web corpus was tokenised in order to make the counts of tokens in both corpora comparable. 24. The comparison with the BNC also revealed there are words related to the modern technology (e.g. website, online, email) and American English spelled words (center, organization) in the 2015 web corpus.

74 2. Cleaner Web Corpora

Table 2.10: Comparison of the 2015 English web corpus before and after spam removal using the classifier. Corpus sizes and relative frequencies (number of occurrences per million words) of selected words are shown. By reducing the corpus to 55 % of the former token count, phrases strongly indicating spam documents such as cialis 20 mg, payday loan, essay writing or slot machine were almost removed while innocent phrases not attracting spammers from the same domains such as oral administration, interest rate, pass the exam or play games were reduced proportionally to the whole corpus.

Original corpus Clean corpus Kept Document count 58,438,034 37,810,139 64.7 % Token count 33,144,241,513 18,371,812,861 55.4 % Phrase Original hits/M Clean hits/M Kept viagra 229.71 3.42 0.8 % cialis 20 mg 2.74 0.02 0.4 % aspirin 5.63 1.52 14.8 % oral administration 0.26 0.23 48.8 % loan 166.32 48.34 16.1 % payday loan 24.19 1.09 2.5 % cheap 295.31 64.30 12.1 % interest rate 14.73 9.80 36.7 % essay 348.89 33.95 5.4 % essay writing 7.72 0.32 2.3 % pass the exam 0.34 0.36 59.4 % slot machine 3.50 0.99 15.8 % playing cards 1.01 0.67 36.8 % play games 3.55 3.68 53.9 %

75 2. Cleaner Web Corpora

Figure 2.4: Relative word count comparison of the original 2015 web corpus with British National Corpus, top 26 lemmas sorted by the + keyword score. Score = f pm1 100 where f pm is the count of lemmas f pm2+100 1 per million in the focus corpus (3rd column) and f pm2 is the count of lemmas per million in the reference corpus (5th column).

76 2. Cleaner Web Corpora

Figure 2.5: Relative word count comparison of the cleaned web corpus with British National Corpus. (A screenshot from Sketch Engine.)

77 2. Cleaner Web Corpora

Figure 2.6: Relative word count comparison of the original web corpus with the cleaned version. (A screenshot from Sketch Engine.)

78 2. Cleaner Web Corpora

raphy e.g. to derive the typical context and word senses of headwords in a dictionary. [Bai+19; Kil+14]. In all tables below,the highest scoring lemmas are displayed. Lemma is the base form of a word (aggregating all word forms in Word Sketches). Frequency denotes the number of occurrences of the lemma as a collocate of the headword in the corpus – the main word of a dictionary entry, the verb ‘buy’ in this case. The score column repre- sents the typicality value (calculated by collocation metric LogDice described in [Ryc08] in this case) indicating how strong the collocation is. To create a good entry in a dictionary, one has to know strong col- locates of the headword. We will show better collocates are provided by the cleaned corpus rather than the original version in the case of selected headwords. Tables 2.11 and 2.12 show that top collocates of verb ‘buy’ in re- lation ‘objects of verb’ were improved a lot by applying the cleaning method to the corpus. It is true that e.g. ‘buy viagra’ or ‘buy essay’ are valid phrases, however looking at random concordance lines of these , vast majority come from computer generated un- grammatical sentences such as Canadensis where can I buy cheap viagra in the UK atardecer viagra japan ship deodorant25 or Judge amoungst none buy argumentative essay so among is what in himself And almost Inter- preter26. Comparison of modifiers of noun ‘house’ in Table 2.13 reveals that the Word Sketch of a seemingly problem-free headword such as ‘house’ can be polluted by a false collocate – ‘geisha’. Checking random concordance lines for co-occurrences of ‘house’ and ‘geisha’, almost none of them are natural English sentences, e.g. ‘well done enough glimpse of geisha house as a remote and devotee of sherlock republic windows 7 holmes’27. While ‘geisha’ is the fifth strongest collocate in the original corpus, it is not present among top 100 collocates in the cleaned version.

25. Source: http://nakil.baskent-adn.edu.tr/viagra-japan-ship/, visited by the crawler in December 2015. 26. Source: http://www.pushtherock.org/writing-a-college-research-paper/, visited by the crawler in December 2015. 27. Source: http://www.littleteddies.net/?de-vergeting-epub, visited by the crawler in November 2015.

79 2. Cleaner Web Corpora

Table 2.11: Top collocate objects of verb ‘buy’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the verb: 14,267,996 in the original corpus and 2,699,951 in the cleaned corpus – 81 % reduction by cleaning (i.e. more than the average reduction of a word in the corpus).

Original corpus Cleaned corpus lemma frequency score lemma frequency score viagra 569,944 10.68 ticket 52,529 9.80 ciali 242,476 9.56 house 28,313 8.59 essay 212,077 9.17 product 37,126 8.49 paper 180,180 8.93 food 24,940 8.22 levitra 98,830 8.33 car 20,053 8.18 uk 93,491 8.22 book 27,088 8.09 ticket 85,994 8.08 property 17,210 7.88 product 105,263 8.00 land 15,857 7.83 cialis 71,359 7.85 share 12,083 7.67 car 75,496 7.75 home 22,599 7.63 house 70,204 7.61 item 12,647 7.40 propecia 55,883 7.53 good 9,480 7.37

80 2. Cleaner Web Corpora

Table 2.12: Top collocate subjects of verb ‘buy’ before and after spam removal in English web corpus (enTenTen15, Word Sketches).

Original corpus Cleaned corpus lemma frequency score lemma frequency score viagra 83,231 9.41 customer 6,356 7.82 ciali 74,612 9.32 consumer 4,216 7.78 i 372,257 9.25 store 3,254 7.70 prescription 46,548 8.62 investor 3,210 7.59 essay 37,119 7.94 i 14,334 7.42 levitra 24,694 7.82 [number] 2,227 6.71 [url] 27,316 7.76 money 2,295 6.57 uk 24,126 7.60 trader 1,131 6.46 paper 28,990 7.54 Ignatius 803 6.33 delivery 18,541 7.39 people 21,800 6.32 tablet 18,806 7.38 Best 789 6.28 canada 18,924 7.32 [url] 919 6.16

81 2. Cleaner Web Corpora

Table 2.13: Top collocate modifiers of noun ‘house’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the noun: 10,873,053 in the original corpus and 3,675,144 in the cleaned corpus – 66 % reduction by cleaning.

Original corpus Cleaned corpus lemma frequency score lemma frequency score white 280,842 10.58 publishing 20,314 8.63 opera 58,182 8.53 open 39,684 8.47 auction 41,438 8.05 guest 13,574 7.94 publishing 41,855 8.02 opera 9,847 7.67 geisha 38,331 7.95 old 32,855 7.64 open 37,627 7.78 haunted 9,013 7.58 old 73,454 7.52 auction 8,240 7.40 guest 28,655 7.44 manor 7,225 7.28 country 26,092 7.07 bedroom 7,717 7.26 stone 18,711 6.77 country 9,926 7.20 dream 17,953 6.77 coffee 8,171 7.18 coffee 18,336 6.74 wooden 6,803 6.96

82 2. Cleaner Web Corpora

Table 2.14: Top collocate nouns modified by adjective ‘online’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the adjective: 20,903,329 in the origi- nal corpus and 4,118,261 in the cleaned corpus – 80 % reduction by cleaning.

Original corpus Cleaned corpus lemma frequency score lemma frequency score pharmacy 317,588 9.54 course 70,353 8.71 casino 183,846 8.88 store 43,183 8.63 store 224,567 8.85 resource 60,044 8.45 game 210,890 8.49 platform 39,529 8.27 course 187,383 8.45 tool 43,916 8.03 uk 125,519 8.11 form 44,133 8.00 viagra 135,812 8.09 survey 24,608 7.92 canada 108,810 8.03 registration 20,276 7.91 shop 89,764 7.70 database 23,112 7.89 business 100,203 7.56 game 39,666 7.81 resource 91,213 7.53 learning 21,260 7.78 site 115,730 7.52 portal 17,682 7.77

83 2. Cleaner Web Corpora Table 2.15: Top collocate nouns modified by adjective ‘green’ before and after spam removal in English web corpus (enTenTen15, Word Sketches). Corpus frequency of the adjective: 2,626,241 in the origi- nal corpus and 1,585,328 in the cleaned corpus – 40 % reduction by cleaning (i.e. less than the average reduction of a word in the corpus).

Original corpus Cleaned corpus lemma frequency score lemma frequency score tea 86,031 10.04 tea 45,214 9.94 light 54,991 8.74 light 33,069 8.86 bean 28,724 8.63 space 51,830 8.72 egg 26,150 8.45 roof 17,916 8.72 space 55,412 8.19 bean 15,398 8.52 vegetable 20,906 8.16 economy 24,181 8.21 roof 18,910 8.1 energy 18,101 7.8 leave 16,712 7.74 infrastructure 13,331 7.69 economy 25,261 7.72 leave 9,754 7.69 grass 13,483 7.61 building 22,172 7.67 eye 24,025 7.56 eye 11,753 7.64 onion 11,544 7.46 vegetable 8,110 7.58

As expected, Table 2.14 shows that nouns modified by adjective ‘online’ – another word highly attractive for web spammers – suffer from spam too. Again, the cleaned Word Sketch looks better than the original version. The last comparison in Table 2.15 showing nouns modified by ad- jective ‘green’ is an example of cases not changed much by the cleaning method.28 It is also worthy of noting that apart from other words in this evaluation, the relative number of occurrences of adjective ‘green’ in the corpus was decreased less than the whole corpus. Although

28. There is a difference in the order of the collocates, however a fine evaluation of collocates is out of the scope of this work.

84 2. Cleaner Web Corpora

the classifier deliberately prefers recall over precision, the presence of non-spam words in the corpus was reduced less than the count of ‘spam susceptible’ words.

2.2.3 Semi-manual Efficient Classification of Non-text inan Estonian Web Corpus Unlike the spam classification of English web pages described in the previous chapter, where human annotators identified a small set of spam documents representing various non-text types, the annotators classified whole web domains this time. An Estonian web corpus crawled in 2019 was used in this experiment. Similarly to our previous result, supervised learning using FastText was employed to classify the corpus. Our assumption in this setup is that all pages in a web domain are either good – consisting of nice human produced text – or bad – i.e. machine generated non-text or other poor quality content. Although this supposition might not hold for all cases and can lead to noisy training data for the classifier, it has two advantages: Much more training samples are obtained and the cost to determine if a web domain tends to provide good text or non-text is not high. In this case, that work was done by Kristina Koppel from the Institute of at University of Tartu in several days. Furthermore, it is efficient to check the most represented domains in the corpus. Thus a lot of non-text can be eliminated while obtaining a lot of training web pages at the same time. Spam documents coming from less represented web domains can be traced by the classifier once it is built. A list of 1,000 Estonian web sites with the highest count of docu- ments or the highest count of tokens in the corpus was used in the process of manual quality checking. There were also path prefixes covering at least 10 % of all paths within each site available to provide information about the structure of the domain. If the site was known to the human annotator, it was marked as good without further checks. If the site name looked suspiciously (e.g. a concatenation of unrelated words, mixed letters and numbers, or a foreign TLD), the annotator checked the site content on the web or its concordance lines in Sketch Engine.

85 2. Cleaner Web Corpora

Site name rules were formulated by observation of bad web do- mains. E.g. all hosts starting with ee., est., or et. under generic TLDs .com, .net, .org29 were marked as non-text since there was machine translated content usually observed in these cases. 77 % of web pages in the corpus were semi-manually classified this way. 16 % of these documents were marked as computer generated non-text, mostly machine translated. 6 % of these documents were marked as bad for other reasons, generally poor quality content. A binary classifier was trained using FastText on good and non-text web pages. URL of a page, plaintext word forms and 3 to 6 tuples of plaintext characters were the features supplied to FastText. 10 fold cross-validation was carried out to estimate the classifier’s perfor- mance. Documents from the same web site were put in the same fold to make sure there was not the same content or the same URL prefix in multiple folds. The estimated performance reported by FastText can be found in Figure 2.7. Precision and recall of a classification into two classes were measured for various label probability thresholds given by FastText. Since the ratio of good to non-text samples in the data was approxi- mately 77:16, the baseline accuracy (putting all samples in the larger class) was 0.826. Despite the rather high baseline, the classifier per- formed well. The final classifier was applied to the part of the corpus thathad not been checked by the human annotator. 100 positive, i.e. non-text, and 100 negative, i.e. good, web pages were randomly selected for inspection. Kristina Koppel and Margit Langemets from the same institution checked the URL and plaintext30 of each page. Three mini- mal probabilities of the top label were tested. The result precision and recall can be seen in Figure 2.8. It can be observed the recall dropped a lot with an increasing threshold. Therefore, the final top label probability applied to the corpus was set just to 0.05 to keep the recall high. We do not mind false positives as long as most of non-text is removed.

29. Excluding et.wikipedia.org. 30. Texts were cropped to first 2,000 characters to speed up the process.

86 2. Cleaner Web Corpora Figure 2.7: Evaluation of the binaryEstonian spam web classifier corpus. Precisionin and recallto 10 a were 1fold cross-validation estimated in for on semi-manually 0.05 minimal checked steps probabilitiesis and of 0.826. averaged the across top folds. label The from baseline 0 accuracy (putting all samples in the larger class)

87 2. Cleaner Web Corpora

Figure 2.8: Evaluation of the final binary spam classifier on documents not previously checked by a human annotator in Estonian web corpus. Precision and recall were estimated for minimal probabilities of the non-text label from 0.05 to 0.15. Since we aim for a high recall, the performance with the non-text label threshold set to 0.05 is satisfying. A higher threshold leads to an undesirable drop of recall.

88 2. Cleaner Web Corpora

We consider this setup and the result as both time efficient and well performing. It will be applied to web corpora in other languages in cooperation with native speaker experts in the future. Since web crawler SpiderLing measures the distance of web do- mains from the initial domains, the value can be used to estimate the quality of the content of a domain. At least, this is our hypothesis. If it was true, domains close to the seeds should be crawled more than domains far from the seeds. To prove or reject the hypothesis, the classification of spam from the previous experiment was put into a relation with the domain dis- tance of the respective good or bad documents. Both semi-manual and machine classification web pages were included. The binary classifi- cation of texts – good and bad labels – aggregated by the distance of the respective web sites from seed domains is displayed on Figure 2.9. The evaluation does not support the hypothesis much, at least in the case of the Estonian web. To sum up the findings of our experiments with Estonian web corpus: 1. A non-text classifier with a very high recall (at the cost ofpre- cision) can be trained on human annotated good and bad web sites. 2. The annotation process can be quite efficient: Checking web domains most represented in the corpus produces sufficient samples to classify the rest. 3. It is beneficial to start the crawling from trustworthy, quality content sites. However, there is non-text on web sites linked from the initial sites. The domain distance is related to the presence of non-text but the correlation is not strong enough to make it an important feature in spam removal.

89 2. Cleaner Web Corpora

Figure 2.9: Evaluation of the relation of the distance of web domain from the initial domains to the presence of non-text on the sites. Web pages of distances 0 to 4 classified semi-manually or by the spam classifier were taken into account. Two thirds of the pages werein distance 1. The percentage of good and bad documents within the same domain distance is shown. The presence of non-text in the data is notable from distance 1.

90 2. Cleaner Web Corpora

2.2.4 Web Spam Conclusion To conclude the section on web spam removal, we consider computer generated non-text the main factor decreasing the quality of web corpora. A classifier trained on manually identified spam documents was applied to remove unwanted content from a recent English web corpus. The threshold of the classifier was set to prefer recall at the cost of greatly reducing the size of the result corpus. Although the evaluation of the classifier on the training set reports a far from perfect recall of 71 %, it was able to notably decrease the presence of spam related words in the corpus. An extrinsic evaluation was carried out by comparing the original data and the cleaned version in a lexicography oriented application: Relative corpus frequencies of words and Word Sketches of grammati- cal relations that could be used to make a dictionary entry for selected verb, nouns and adjectives were compared in the experiment. Another experiment with a smaller Estonian corpus was carried out. An efficient human annotation lead to using more than two thirds of the corpus as training data for the spam classifier. The evaluation of the classifier shows a very high recall of 97 % was reached. We understand the process can take more time for large internet languages such as English, Spanish, Russian or Chinese. We admit the number of sites in our Estonian experiment is small in comparison to these languages. Nevertheless we believe this is a good way to go for all languages.

91 3 Richer Web Corpora

As we stated in the introductory chapter, understanding the content of a web corpus is important. The traditional corpora were designed for particular use and compiled from deliberately selected sources of good quality: ∙ British National Corpus consists of ca. 10 % spoken data and ca. 90 % written component further divided by domain, genre, level and date. [Lee92] ∙ SYN contains ‘two kinds of reference corpora: balanced and newspaper ones. The latter are composed solely of newspapers and magazines and this is denoted by their suffix “PUB”. The 3 balanced corpora cover 3 consecutive time periods and they contain a large variety of written genres (including also fiction and professional literature) subclassified into 74 categories in proportions based on language reception studies.’ [Hná+14] ∙ Slovak National Corpus is reported to contain published texts consisting of 71.1 % journalism, 15.4 % arts, 8.5 % technical doc- uments, 50˙ % other texts.1 ∙ Estonian National Corpus is a mix of texts from traditional media and web pages with subcorpora from Estonian Reference Corpus 1990-2008, Estonian web from 2013, 2017 and 2019, Estonian Wikipedia from 2017 and 2019, Estonian Open Access Journals (DOAJ). [KK] There is rich metadata in the Reference corpus part and the Wikipedia subcorpus because of the nature of their sources. Such precise selection of nice texts is hardly possible in the case of large web corpora. Researchers using web corpora for their work need to know their sources! (Re-formulation of the appeal in [Kil12].) From our point of view, it is desirable to provide web corpora with rich annotation such as language varieties (e.g. British, American, Australian or Indian English), topics (such as sports, health, business,

1. https://korpus.sk/structure1.html, accessed in January 2020.

92 3. Richer Web Corpora

culture), genres (informative, encyclopedic, instructive, appellative, narrative), registers (formal, informal, professional) and other text types. Since these categories do not come with the data, it is possible to annotate web documents using a supervised classification approach. Furthermore, the distribution of these text attributes in a web corpus can help the users not only to know what is ‘inside’. In addition to that, it enables the users to work with just a selection of subcorpora based on the text types. Language variety can be identified by the wordlist method de- scribed and applied to Norwegian variants Bokmål and Nynorsk in Section 2.1. [Kil12] proposed keyword comparison as a method of knowing corpora better. This and other intrinsic methods such as the size, aver- age sentence length, homogeneity or the count of blacklisted words compare measurable corpus properties. On the other hand, extrinsic evaluation compares corpora by com- paring the output of applications. For example, by measuring by the BLEU score of the same method trained on the corpora to compare, we can find which corpus is better for training ma- chine translation or – in a wider scope – better for all language model based methods or – generalising ad maximum – better altogether. A genre annotation scheme in an English web corpus and se- lected issues of genre definition and identification are described in Section 3.1. A case study of adding a text topic annotation to English and Estonian web corpora through supervised classification is presented in Section 3.2.

93 3. Richer Web Corpora 3.1 Genre Annotation of Web Corpora: Scheme and Issues

This section presents an attempt to classify genres in a large English web corpus through supervised learning. A collection of web pages representing various genres that was created for this task and a scheme of consequent human annotation of the data set is described. Measuring the inter-annotator agreement revealed that either the problem may not be well defined, or that our expectations concerning the precision and recall of the classifier cannot be met. Eventually, the project was postponed at that point. Possible solu- tions of the issue are discussed at the end of the section.

3.1.1 Genre Selection and Reliability of Classification A dictionary definition of genre is ‘A particular style or category of works of art; esp. a type of literary work characterised by a particular form, style, or purpose.’2 [MSS10] state genre is ‘A set of conventions (regularities) that tran- scend individual texts, helping humans to identify the communicative purpose and the context underlying a document.’ To add a perspective of text corpus users who do language re- search, build dictionaries, or e.g. produce language models for writing prediction – adding information about genre to corpus texts allows them to know more about the composition of the corpus and enables them to use subcorpora limited to a particular genre. [Ros08] lists reasons for determining genres of web documents for information retrieval. These reasons can be applied to corpus linguis- tics and NLP too: ∙ ‘People normally store and retrieve documents by genre.’ ∙ ‘Genre is a compact way to describe a document.’ ∙ ‘There is a need for non-topical search descriptors.’ ∙ ‘Many traditional genres have migrated to the web.’

2. The second edition of Oxford English Dictionary (1989). Accessed online at https://www.oed.com/oed2/00093719 in April 2020.

94 3. Richer Web Corpora

∙ There are unique genres on the web. ‘Some of the most popular tags for web pages on the social tagging site delicio.us are genre labels, such as blog, howto, tutorial, news and research.’ [Bib88, p. 121] uses The Lancaster-Oslo/Bergen Corpus of British English (LOB) and The London-Lund Corpus of Spoken English. Six dimensions based on lexical, syntactic and semantic attributes of text are described, e.g. Narrative vs. Non-Narrative Concerns or Abstract vs. Non-Abstract Information. Indeed, genres can be described by linguistic properties of text. [Bib89] uses the term ‘linguistic markers’ and lists positive features of e.g. discerning ‘narrative versus non-narrative concerns’ – ‘past- tense , 3rd person pronouns, perfect-aspect verbs, public verbs, synthetic negation, present-participial clauses’ – and complementary features – ‘present-tense verbs, attributive adjectives’. Biber used rules based on scores of linguistic features. Machine learning techniques are preferred by recent research. Our understand- ing is a genre is determined by the style of writing, content words are only supporting evidence – unlike topics that are determined by con- tent words. Therefore it is the style that is key in assessing the genre, content words are only secondary. It is necessary to add linguistic features such as the verb tense in the features for training a classifier. Well known corpora and LOB consist of a priori determined number of texts bearing signs of the following genres and topics: Press: reportage; Press: editorial; Press: reviews; Religion; Skills, trades and hobbies; Popular lore; Belles lettres, biography, essays; Learned and scientific ; General fiction; Mystery and detective fiction; Science fiction; Adventure and western fiction; Romance and love story; Humour; Miscellaneous. Unlike corpora traditionally constructed from selected sources with known text types, we need to classify documents that come from sources without a determined text type and that are already in the corpus. In the early ears of the internet, [DKB98] dealt with the genre of web sites. The set of genres to recognise was derived from a poll of internet users. The result set was: Personal homepages; Public or commercial homepages; Searchable indices; Journalistic materials; Reports (scientific, legal); Other running text; FAQs; Link Collections;

95 3. Richer Web Corpora

Other listings and tables; Asynchronous multi-party correspondence (discussions, Usenet News); Error Messages. Despite too internet oriented and somewhat old, categories such as Personal homepages, Public or commercial homepages and FAQs are useful for our purpose. While [Ros08] claimed ‘People can recognize the genre of digital documents’, [ZS04] stated the opposite – that humans are incapable of determining the genre of a web page consistently and that web pages can have multiple genres and may not resemble sample single genre texts. A research of papers on web genres by [CKR10] revealed serious difficulties of defining and determining web genres: ‘Unfortunately, our review of the literature reveals a lack of consensus about the Web genre taxonomy on which to base such systems. Furthermore, our review of reported efforts to develop such taxonomies suggests that consensus is unlikely. Rather, we argue that these issues actually resist resolution because the acceptance of potential answers depends on a researcher’s epistemological and ontological orientation’. Yet they stress ‘a continuing and, indeed, growing need for understanding a document’s genre’. In our work, we are determined to identify genres useful for text corpora users. That should reduce the number of possibilities and approaches at least a little. [ZS04] claimed ‘An inherent problem of Web genre classification is that even humans are not able to consistently specify the genre of a given page.’ Such cautious approach resulted in a relatively small count of genres that were identified: Help; article; discussion; shop; private portrayal; non-private portrayal; link collection; download. A support vector machines classifier was trained on 800 HTML pages leading to an average classification performance of 70 %.

From our point of view, Dewe [DKB98], zu Eissen and Stein [ZS04] and Crowston [CKR10] propose too web oriented labels. We believe corpus linguists and lexicographers expect separate classes like Re- porting, Information, Legal, Narrative which are not discerned by these works. The audience of British National Corpus (BNC) is close to ours. [Lee01] categorised documents in BNC into 46 genres of written text

96 3. Richer Web Corpora

and 24 genres of spoken text. Some documents can have multiple genres assigned. Such fine grained categorization is not possible in the case of web corpora for the reasons already mentioned. [KC13] constructed a corpus from four sources naturally consisting of different genres: Conversation, newspaper, fiction and the web. (The point was a dictionary based on that corpus represented the use of words in those genres better than if it had been based on a single genre corpus.) On the contrary, we would like to keep selected web genres separate rather than merging them into a single label. [DS16] identified seven genres in academic texts on the web: In- structions, hard news, legal, commercial presentation, science/tech- nology, information, narrative. Although we like these categories, we need to cover more than academic texts. Following Biber, [Sha18] worked with ‘Functional Text Dimensions’ and focused on large web corpora. 18 genres were identified: Argumen- tative, Emotive, Fictive, Flippant, Informal, Instructive, Hard news, Legal, Personal, Commercial presentation, Ideological presentation, Science and technology, Specialist texts, Information or encyclopedic, Evaluation, Dialogue, Poetic, Appellative. A subset containing 12 of these genres was also defined. We decided to start from the Sharoff’s list of 12 genres. As noted above, finding genres that can be reliably identified is a difficult task. We understand borders between genres are not sharp so other param- eters, namely the very definition of genres, have to be adapted. Therefore we decided to measure the agreement of human classifi- cation of genres of web documents and merge classes until the inter- annotator agreement is sufficient. The agreement can be increased by decreasing the granularity of genres to discern. That is why we call our approach Agreement driven.

3.1.2 Experiment Setup The set of 12 genres defined by [Sha18] is: Argumentative, Fictive, Instructive, Hard news, Legal, Personal, Commercial presentation, Ideological presentation, Science and technology, Information or en- cyclopedic, Evaluation, Appellative.

97 3. Richer Web Corpora

Table 3.1: Sources of the collection of texts used in our experiment. Different subsetsS ( ) were added in different times (starting with subset 1). Most certain texts and least certain texts refer to the certainty of a classifier measured by the entropy of the probability distribution of labels given by FastText for a particular document. UKWaC [Fer+08b], enTenTen13, enTenTen15 and enTenTen18 are English web corpora from 2007, 2013, 2015 and 2018, respectively.

S Description Author Count 1 Selection from UKWaC Sharoff 448 Selection of non-text from enTenTen15 Suchomel 123 2 Most certain texts from UKWaC Sharoff 456 Web search for underrepresented genres Suchomel 198 3 Least certain texts from enTenTen13 Suchomel 405 4 enTenTen18 random texts Suchomel 344 Total count of documents in the collection 1974

Non-text (machine generated text and other web spam) was a 13th genre added by us to enable us to use the data for learning a non-text classifier as described in Chapter3 2. We aimed to determine genres of web pages in a large English web corpus by training a supervised classifier on human annotated texts. A new collection of documents, mostly web pages, was created for this purpose. Texts were added in the collection in four subsets according to our evaluation of the task in several stages. The sources of the collection are summarised in Table 3.1. We did four rounds of manual annotation of texts from the collec- tion. A group of students and academics at the University of Leeds was instructed and supervised by Serge Sharoff. Another group of students and academics at Masaryk University was instructed and supervised by the author of this thesis.

3. Since this section is about genres in web corpora rather than spam removal, genre Non-text is not included in the results unless explicitly mentioned.

98 3. Richer Web Corpora

The inter-annotator agreement (IAA) was measured after each round to see if the genre definition was understood well and to decide how to improve. Sample documents to explain genre differences to annotators were created. Multiple labels were allowed for documents showing strong signs of more genres. We also did a round of active learning: According to the idea of a technique called Uncertainty sampling [LC94], annotating samples where the classifier is the least certain should efficiently contribute to training the classifier, e.g. requiring a lower number of costly annotated samples than a random selection. ‘The basic premise is that the learner can avoid querying the in- stances it is already confident about, and focus its attention instead on the unlabeled instances it finds confusing.’ [Set12, p. 11] In our experiment, a classifier was trained on texts annotated at that time using FastText. Documents with the highest entropy of the probability distribution of labels provided by FastText – i.e. cases where the classifier was most unsure – were selected for the next round of the annotation. Since the initial IAA was below our expectation, we tried merging labels and omitting least successful classes getting to ‘6 classes + non- text’. Unfortunately, an evaluation showed that did not help enough so we decided to start over with the following set of categories (still based on Sharoff’s genres from [Sha18] a lot). The full version canbe found in Appendix A.1. 1. Information – subcategorised to

(a) Promotion (covering both commercial and ideological pre- sentation from Sharoff’s list): Promotion of a product, ser- vice, political movement, party, religious faith. Examples: An advert, a product/service page, a political manifesto. (b) Academic: Research. Example: A research paper or any text written using the academic style. (c) Review: Evaluation of a specific entity by endorsing or criticising it. Example: A product review endorsing or criticising the product.

99 3. Richer Web Corpora

(d) Other: Any informative text not from a category above.

2. Story telling (a better name for both Fiction and Narrative): A description of events (real or fictional, usually in the order they followed), often informal, can be in the first person. Examples: Fiction, narrative blogs. 3. Instructions: Teaching the reader how something works. The imperative is frequent. 2nd person pronouns may be frequent. Examples: Howtos, FAQs, instructions to fill a web form. 4. News: Informative report of events recent (or coming in the near future) at the time of writing (not a discussion or a general state of affairs). Usually the formal style, set to a particular place and time. May quote sources of information. Example: A newswire, diary-like blogs. 5. Legal. Examples: A contract, a set of regulations, a software licence. 6. Discussion: A written communication of participants of a dis- cussion. Usually multiple authors. Can be personal, informal style. Examples: Web forums, discussions, product comments. 7. Unsure or too short: Indicating the annotator was unable to determine the genre with confidence. Four subcategories of Information were meant to be used separately or merged together for the final classifier depending on the IAAand classifier performance. The fourth subset of the collection was created from enTenTen18 at this moment to introduce more contemporary web documents to the collection. A way for annotators to mark documents too short to reach an agreement was implemented to reduce noise in the training data. A web application providing the annotation interface was made by the author of this thesis. The tool consist of a Python script and a Sqlite database backend and an HTML & JavaScript frontend making asynchronous requests to the backend. Screenshots of the application can be seen on Figure 3.1 and Figure 3.2. The functionality of the interface includes:

100 3. Richer Web Corpora

∙ Displaying the plaintext to annotate with a link to the original web page. ∙ Compact definitions of genres with examples from the annota- tion manual are shown. A short description of the procedure helps annotators to remember key concepts, e.g. that the style of writing is more important than the topic. ∙ Metainformation such as annotator’s nickname, number of doc- uments already annotated and remaining to annotate in the current round is displayed. ∙ Multiple genres can be marked for each document. ∙ There are four labels to indicate the presence of the features of a particular genre in the document: None (the default), Some- what, Partly, Strongly. ∙ The plaintext is split to paragraphs. The application allows marking paragraphs containing signs of a genre different from the genre of the majority of the text. ∙ A review mode allowing the supervisor to review annotations of others after a training round. ∙ Anyone can be enabled to use the review mode to see common mistakes or example documents after a training round. After reviewing the initial round of annotations, we decided to ac- count in only ‘Strongly’ labels. The annotators did not know that. The purpose of other, weaker labels was just to help humans resist the temptation to choose ‘Strongly’ if not perfectly sure (and not indicating ‘Unsure or too short’ too). The final experiment continued by hiring six university students proficient in English at level B2 or C1 who received a two hours’ train- ing. Then, each of annotators worked through a training round of 45 documents. All annotations were checked by us, issues were made clear and everyone had to review frequent mistakes. A screenshot of the use of the review mode can be seen on Fig- ure 3.3. Since the agreement was poor in this instance, the case was explained to the annotators and the document was put to a review for everyone.

101 3. Richer Web Corpora

Figure 3.1: Text type annotation interface – a web application in a browser – the left side of the screen. Information about the annotation process can be seen at the top. Genres with a brief description and examples follow. Class ‘Information::Promotion’ is labelled as strongly present in this case. Buttons for weaker presence of genre markers (Partly, Somewhat, None) can be clicked to change the annotation.

Figure 3.2: Text type annotation interface – a web application in a browser – the right side of the screen. The title of the document with a link leading to the original source is located at the top. The plaintext split to paragraphs can be seen below. Both sides of each paragraph are coloured to visualise separate paragraphs. A paragraph can be suggested for removal from the document (to make the result training data less noisy) by clicking the respective button.

102 3. Richer Web Corpora

Figure 3.3: Text type annotation interface in the review mode after the training round – as seen by the author of this thesis who trained six other annotators. Labels, coded by identifiers in columns B1 to B99, assigned to a single document by each annotator are shown. Values ‘Strong’, ‘Partially’ and ‘None’ are coded by 2, 1, 1/2 and 0, respectively. (The same coding was used by [Sha18].) Time in seconds spent by annotating the document by each annotator can be seen in the rightmost column.

After the training round, two consecutive rounds of annotation were made. Several rounds of active learning were planned after that, i.e. training a classifier to find the most unsure documents and an- notating those texts. However, we were not satisfied with the IAA at that time. Further evaluations were made to find if changing some parameters of the experiment would help.

3.1.3 Inter-annotator Agreement Our goal was to reach the following level of inter-annotator agreement: ∙ Pairwise Jaccard’s similarity ≥ 0.8. ...The similarity of sets of labels assigned by each pair of annotators is measured. ∙ Krippendorff’s alpha[Kri04] ≥ 0.67.4 Since multiple labels for a single sample were allowed, a text annota- tion was a set of labels in our case. We selected the following metrics providing the similarity of a pair of sets.

4. Krippendorff wrote on the acceptable level of reliability expressed bythe α: ‘Rely only on variables with reliabilities above α = .800. Consider variables with reliabili- ties between α = .667 and α = .800 only for drawing tentative conclusions.’ [Kri04, p. 241]

103 3. Richer Web Corpora

1. Accuracy as a set similarity metric: def accuracy(labels1, labels2): i= labels1.intersect(labels2) return(len(i)/ len(labels1) + len(i)/ len(labels2)) / 2.0

2. Jaccard’s similarity: def jaccard(labels1, labels2): i= len(labels1.intersect(labels2)) u= len(labels1.union(labels2)) return len(i)/ len(u)

3. Nominal comparison which tests for an exact match. It is the default metric for Krippendorff’s alpha for discrete labels. An example to illustrate the difference of the metrics (assume A,B,C are class labels) follows:

|{ }∩{ }| |{ }∩{ }| A,B A,C + A,B A,C 1 1 |{A,B}| |{A,C}| + 1 Acc({A, B}, {A, C}) = = 2 2 = 2 2 2 |{A, B} ∩ {A, C}| 1 Jaccard({A, B}, {A, C}) = = |{A, B} ∪ {A, C}| 3 Nominal({A, B}, {A, C}) = 0 ({A, B} ̸= {A, C})

An overview of all experiments with IAA expressed as Accuracy, Jac- card’s similarity and Krippendorff’s alpha with set similarity metrics Accuracy, Jaccard’s similarity and Nominal comparison is provided by Table 3.2. The first four experiments led to the final setup with nine classes (in bold typeface in Table 3.2). All rows below were derived from that setup by not counting instances marked as unsure or not count- ing instances with multiple labels or by merging labels Information, Promotion, Academic and Review into a single label. Figure 3.4 and Figure 3.5 show pair annotation matrices for experi- ments without unsure or multi-label samples accounted in. Each pair of annotations of the same sample by two annotators was accounted in a two dimensional matrix with each dimension representing labels

104 3. Richer Web Corpora and 0.401 0.417 0.534 0.580 0.449 0.528 0.570 0.527 0.576 0.629 0.497 K-Jac K-Nom , K-Acc 0.428 0.438 0.550 0.595 0.477 0.518 0.550 0.570 0.552 0.606 0.629 K-Jac 0.444 0.449 0.557 0.603 0.491 0.530 0.562 0.570 0.566 0.622 0.629 is the count of documents, K-Acc N Jacc 0.507 0.484 0.653 0.762 0.585 0.628 0.658 0.676 0.765 0.802 0.819 is Jaccard’s similarity, Acc 0.527 0.495 0.660 0.768 0.600 0.640 0.670 0.676 0.776 0.814 0.819 Jac A 3.66 3.30 4.00 5.00 7.00 2.51 2.43 2.57 2.51 2.43 2.57 N 77 50 50 45 149 is Accuracy, refers to collection subsets, 1,356 1,342 1,340 1,356 1,342 1,340 Acc Data Data 1 & 2 1 & 2 1 to 3 1 to 3 All All All All All All All P 7 4 4 5 7 6 6 6 6 6 6 ↓ is the count of people annotating, stand for Krippendorff’s alpha with the set similarity metric Accuracy, to set Jaccard’s similarity P Experiment 12 genres + spam 12 genres + spam 6 genres + spam 6 genres + spam, no unsure 9 genres, training 9 genres – the base for 9 genres, no unsure 9 genres, no unsure, no multi 6/9 genres 6/9 genres, no unsure 6/9 genres, no unsure, no multi is the average count of annotations per text. setups. K-Nom and Nominal comparison, respectively. ‘6/9 genres’single means label that for four the of particular thesure evaluation. nine ‘No were labels unsure’ omitted. were means ‘No merged multi’ annotations in indicating means a the annotations with person was multiple not strong labels were omitted. Table 3.2: Inter-annotator agreement of genre annotation of web documents for different experiment A

105 3. Richer Web Corpora

Figure 3.4: Pair annotation matrix for the setup with 9 genres, without unsure or multi-label samples. Percentage of all annotation pairs is shown.

given by the corresponding annotator, i.e. a sample was accounted in the row respective to the label given by the first annotator and the and column respective to the label given by the second annotator. Agreements are on the diagonal, disagreements are in other fields. The percentage of all pairs is shown. It can be seen that Information was the class causing most disagree- ment. The reason may be that borders between other genres are more clear than the border of any genre with Information in our definition of genres.

Figure 3.5: Pair annotation matrix for the setup with 6/9 genres, with- out unsure or multi-label samples. Percentage of all annotation pairs is shown.

106 3. Richer Web Corpora

Table 3.3: Pair agreement summary for setups with 9 genres and 6/9 genres, without unsure or multi-label samples.

Pair agreement 9 genres 6/9 genres Information 51.2 % 77.8 % Story telling 64.8 % 64.8 % Instructions 49.1 % 49.1 % News 42.7 % 42.7 % Legal 73.5 % 73.5 % Discussion 31.7 % 31.7 % Promotion 52.9 % Academic 32.4 % Review 44.1 %

The percentage of agreement for each class (i.e. the ratio of the value on the diagonal to the rest related to the label) is summarised in Table 3.3.

3.1.4 Dealing with a Low Agreement

To summarise, we tried the following to improve the inter-annotator agreement: ∙ The number of recognised genres was reduced. ∙ Multi-genre texts were omitted. ∙ Short texts (indicated by annotators) were omitted. ∙ Annotators were trained and their mistakes in the training round were explained thoroughly. ∙ Annotators could indicate they were not sure. ∙ Annotators were paid for time spent annotating rather than the count of annotations. (The average duration of annotating a document was 57 seconds in the final annotation round.) As can be seen in Table 3.2, the minimal acceptable value of Krippen- dorff’s alpha was not reached. Getting a high agreement in webgenre classification is hardly possible. Defining genres both interesting for

107 3. Richer Web Corpora web corpus users and agreeable to annotators is difficult. That is our conclusion as well as others: [CKR10; ZS04]. If there is a reasonable solution, it must require a different ap- proach or a lower level of target inter-annotator agreement or both. We suggest taking these measures to get a reliable genre annotation of web corpora: ∙ Remove paragraphs showing sings of a genre different from the major genre in the training data (such paragraphs were marked by annotators). ∙ Continue the process with active learning rounds to efficiently annotate more data. ∙ Consider using whole single genre web sites for training. This technique helped in our other work with an Estonian web cor- pus – non-text removal (Section 2.2.3) and topic classification (Section 3.2.2). ∙ Train the classifier only on documents with a perfect agreement. ∙ Set a high top label probability threshold of the classifier. That will increase precision at the cost of recall. The users of text corpora will not mind if the genre of some documents remains unknown. They mind the precision. Furthermore, since the borders of genres are not strict, we suggest a different approach to the evaluation: the ‘User’s point of view’. To evaluate the classification, corpus users would be asked to assess the genre annotation of random web pages from the corpus (in the plaintext format) by assigning one of the following three labels to each selected document: 1. This is the genre. 2. This could possibly be the genre. 3. This could not be the genre. We consider 5 % of texts marked as ‘This could not be the genre’ an acceptable level of classification mistakes. To conclude, we will continue the research on genre classification of large web corpora despite the issues described in this section. We are interested in exploring the genre composition of the English web and offering the corpus users a possibility to focus their research on particular genres.

108 3. Richer Web Corpora

A keyword comparison of single genre subcorpora to the whole corpus proposed by [Kil12] could also show interesting properties of web genres. Exploring possibilities for transfer of the method to other languages than English will be important too.

109 3. Richer Web Corpora 3.2 Text Type Annotation of Web Corpora

3.2.1 Topic Annotation of an English Web Corpus through Learning from a Web Directory Text type annotation was the next stage in the process of making the English web corpus5 better – after building it using Brno Corpus Processing Pipeline introduced in Chapter 1 – followed by language filtering and non-text removal as described in Chapter 2. At thismo- ment, 15.4 billions of tokens in over 33 million web pages were in the collection. Several factors contributed to selecting the method and the setup of the task. The first decision was which text types to discern. We considered both ‘bottom to top’ and ‘top to bottom’ approach. The bottom to top way is data driven and unsupervised. The text types are determined by an algorithm that reads big data (i.e. text documents corresponding to web pages in a web corpus). The number of text types is arbitrary – can be specified. Clustering of vector representations of documents or topic mod- elling through Latent Dirichlet Allocation (LDA) were feasible options. Having experimented with the second method, we were unable to reliably select a single clear label not overlapping with other labels for a set of words representing a LDA topic. In the supervised or bottom to top setup, the class labels are defined apriori. Then a supervised classifier is trained using instances of data where the label is known, e.g. manually annotated texts. In the case of text topics, the training instances can be obtained from Wikipedia articles organised in portals and hierarchical structure of categories or web directories such as dmoz.org6 which maintain lists of web sites organised by a topic tree. We went for the supervised approach to classify topics using the English part of web directory dmoz.org as the source of training data. Unlike genre that is determined by the style of writing where content words are only a supporting evidence, topics tend to be deter- mined by words. Therefore word aware methods should be suitable for topic classification. That is why we think it is easier to discern top-

5. enTenTen15 obtained by crawling the English web in 2015. 6. dmoz.org was moved to curlie.org in 2017. 110 3. Richer Web Corpora

ics than genres. FastText by Facebook Research presented in [Jou+16] was selected for the task. It is an ‘open-source lightweight library that allows users to learn text representations and text classifiers’ using vector representation of words and character n-grams.7 There are 14 top level topics in the web directory: Arts, Business, Computers, Games, Health, Home, News, Recreation, Reference, Re- gional, Science, Shopping, Society, and Sports. There are hundreds of topics at the second level, for example Arts → Movies, Society → History, Sports → Track and Field. The directory goes even deeper.8 We decided to aim for high precision and use just the first level topics in the end. 532 thousand standalone web pages and pages from 1,376 thou- sand web sites linked from the landing pages of the sites listed in the directory were downloaded from the web. The data was processed by Brno Corpus Processing Pipeline. The result consisted of 2.2 billion tokens in nearly 4 million documents after processing. 2 % documents that were categorised in multiple classes were removed. Documents shorter than 50 words were removed too. The result training corpus was made from a balanced (to a limited degree) subset of the data, 1,220,530 web pages in total. The distribution of topics in this set can be found in Table 3.4. The collection was shuffled and split to 97 % training set, 2%eval- uation set and 1 % test set. The Autotune feature of FastText was em- ployed to find optimal hyperparameters for the classification ofour data using the evaluation set.9 Reasonable boundaries to the hyperparameter space were set prior running Autotune. The loss function was set to negative sampling since it is much faster than softmax. The features of documents were words and character 3-to-6-grams. The size of vector dimension was set to 100. We also experimented with dimensions of 200 and 300 but autotuning in that space was two or three times slower (as expected) and showed no improvements over the default value of 100.

7. https://fasttext.cc/, accessed in April 2020. 8. E.g. Society → Issues → Warfare and Conflict → Specific Conflicts → War on Terrorism → News and Media → September 11, 2001 → BBC News. 9. Autotune mode was added to FastText in 2019. It searches the hyperparameter space to optimise F-1 measure for the given task. https://fasttext.cc/docs/en/ autotune.html.

111 3. Richer Web Corpora

Table 3.4: Topics from dmoz.org in the training set

Topic Web pages Topic Web pages Arts 98,320 Recreation 98,673 Business 98,694 Reference 93,176 Computers 98,259 Regional 97,322 Games 53,828 Science 96,994 Health 98,826 Shopping 99,378 Home 45,942 Society 97,399 News 44,722 Sports 98,997 Total 1,220,530

The final hyperparameter values were obtained after 3,000 CPU core-hours of autotuning for level 1 topics and 1,000 CPU core-hours for level 2 topics.10 The expected F-1 values reached 0.712 in the case of level 1 topics and approximately 0.6 in the case of a subset of level 2 topics. From this step we decided to continue just with level 1 topics. Using the best hyperparameter values, a FastText supervised model was trained on train and evaluation sets. Figure 3.6 shows the final evaluation of the classifier on the test set. Precision, recall, F-1 and F-0.5 were measured for probability thresholds given by the classifier.11 A web page was classified by the best estimated label if the proba- bility of the label was higher than the threshold. Thresholds from 0 to 1 with steps of 0.05 were applied. The best value of F-1 was reached at threshold 0.15. The best value of F-0.5 was reached at threshold 0.45. Since we want to achieve a high precision of the final corpus classification, low recall is permissible. In the end, we decided to set the final probability threshold to apply to the corpus to the value where the estimated precision was

10. The CPUs were 2 × 16-core Intel Xeon 4110s. The values of hyperparameters can be found in Table A.2 in the Appendices. 11. FastText in the supervised mode gives the probability distribution of all labels for each classified instance.

112 3. Richer Web Corpora Figure 3.6: Evaluation of theFastText for 14 minimal topic probabilities classifier of the topon label testthe from 0set. Precision to 1 in and 0.05 recall steps. were estimated F-0.5 by values plotted in green.

113 3. Richer Web Corpora

close to 0.94 – separately for each topic. Precision, recall and F-0.5 estimated for this setup and each recognised topic are summarised in Table 3.5. The setup can be explained using this example: The classifier gives the probability distribution of labels for a web page. Assume the best label is Arts. If the reported probability of the label is at least 98.3 % (see the Arts row in Table 3.5) the web page gets a topic label ‘Arts’. Otherwise the topic label will be ‘unknown’. According to the estimation obtained by evaluating the classifier on the test set, we can expect 1.) to find approximately 25 % of real Arts web pages and 2.) that there may be approx. 6 % documents wrongly classified as Arts in the Arts subcorpus. The final model was trained on the whole set (1,220,530 webpages listed on dmoz.org) and applied using the label probability thresholds in Table 3.5 to the English web corpus. Although documents in the corpus are web pages like the training data, the variety of the full corpus can be much greater – not just good content sites that made it in one of the 14 level 1 categories in the directory. Therefore a lower success rate than estimated should be expected. Finally, 11 % of web pages in the corpus were assigned a topic label using the method. Figure 3.7 shows the distribution of classified documents in the corpus.

114 3. Richer Web Corpora

Table 3.5: Precision and recall for each recognised dmoz.org level 1 topic estimated by FastText. The threshold of minimal probability of the top label was set to the value where the estimated precision was close to 0.94.

Topic Threshold Precision Recall F-0.5 Arts 0.983 0.942 0.246 0.602 Business 0.985 0.932 0.106 0.364 Computers 0.990 0.940 0.262 0.619 Games 0.944 0.936 0.507 0.801 Health 0.864 0.940 0.571 0.832 Home 0.922 0.942 0.365 0.716 News 0.931 0.940 0.335 0.691 Recreation 0.993 0.930 0.191 0.524 Reference 0.989 0.946 0.208 0.553 Regional 0.935 0.946 0.367 0.719 Science 0.980 0.938 0.228 0.578 Shopping 0.963 0.942 0.341 0.696 Society 0.986 0.946 0.199 0.540 Sports 0.957 0.940 0.506 0.802

115 3. Richer Web Corpora Figure 3.7: Sizes of topic annotated subcorpora of – enTenTen15 document and token counts.

116 3. Richer Web Corpora

3.2.2 Semi-manual Efficient Annotation of Text Types in Estonian National Corpus

A similar setup to classify topics was made in the case of Estonian National Corpus 2019. This time, the following parts of the procedure were different from the case of the English corpus: 1. Estonian web directory neti.ee and sites identified by Kristina Koppel from the Institute of Estonian Language at University of Tartu in the corpus were used to annotate the training data. Kristina also defined rules for URLs that had precedence over the label of whole sites. For example, all web pages containing ‘forum’ in the URL were labelled as Discussion and all web pages containing ‘blog’ in the URL were labelled as Blog. Thus only 28 % of non-spam web pages remained to be classified.

2. 36 classes identified by Kristina were merged by the author of this thesis into 20 topics and two genres (the genres being Discussion and Blog) for the sake of clarity. E.g. ‘sites focused on women’ were merged into Society which is a more general category.

3. URL of a page was made part of the features for training the classifier.

4. The amount and consistency of training data and better esti- mations of the performance of the classifier allowed for a more permissive top label probability threshold of 0.5 for the best estimated label. The final recall of the classifier built by FastText using the training data reached 33 %. Figure 3.8 shows the distribution of topics of all documents in the result corpus.

Twenty topics and two genres were discerned in Estonian National Corpus to know better the content of the corpus. It is possible to work with topic based subcorpora of the corpus now. Better preparation of training data (including human effort) and adding the URL among the training features led to an improved per- formance in comparison with the previously described experiment.

117 3. Richer Web Corpora Figure 3.8: Sizes of topiccounts. annotated subcorpora of Estonian National Corpus 2019 – document and token

118 3. Richer Web Corpora

A plan for the future may involve creating a web application for lexi- cographers to get more text types in large corpora by a semi-automated way: First, text types of web domains represented the most in the corpus and rules for URLs would be set. Then, one or two rounds of active learning would follow to improve the classifier efficiently. The rest can be automated: FastText provides an interface for all classification tasks involved and is fast enough to train a model and classify the corpus in less than a day.

119 4 Summary

4.1 Author’s Contribution

The author of this thesis has been creating corpus tools and building web corpora since 2011. The size of corpora built in the last three years exceeds 200 billion tokens after post-processing altogether. In this thesis, the architecture of a web crawler developed by the author was introduced. Its key design features were explained. The crawler was put into the context of other components of so called ‘Brno processing pipeline’ which has been successfully used to build large text corpora from the web. Separate tools from the pipeline were described in several papers in the past. This thesis presents a new overview of the whole process of using the text processing pipeline. The author of this thesis has been developing the pipeline and main- taining its parts since 2012. This thesis builds upon our previous work on discerning similar languages and removing non-text from an English web corpus. In this thesis, a method of language discrimination using word counts from large web corpora was implemented and evaluated. We have been dealing with non-text in web corpora since 2012. The contribution of the author of this thesis published in an own paper [Suc17] and in co-authored submissions [BS12a; Bai+15; Jak+20b; KS13] presented at Web as Corpus workshops is: 1. Description of the issue of spam in web corpora. 2. Proposal of ways to avoid downloading non-text, implementa- tion of the method, application to data for a language learning site. 3. Proposal of methods to remove spam, application of a super- vised learning based method to an English web corpus, evalua- tion. 4. Analysis of current challenges in web corpus building and mi- tigation of their impact on web corpora. The most important results of that work were summarised in this thesis. The improvement of corpus based language analyses achieved by a supervised classifier applied to an English web corpus was shown.

120 4. Summary

A semi-manual approach of obtaining samples of non-text web pages making the supervised learning process more efficient was used for cleaning an Estonian web corpus. The recall of the Estonian web spam classifier – manually evaluated on 200 web pages – reaching 97 % is a success.

Our work on large web corpora, i.e. building corpora, cleaning them and making them richer, proved essential in successful projects in several fields of natural language processing: ∙ In the field of terminology, results were presented in aco- authored paper Finding Terms in Corpora for Many Languages with the Sketch Engine [Jak+14] in the demonstration session of A-ranked conference EACL1 in Gothenburg in 2014, cited 36 times up to date, and in a co-authored article A New Approach for Semi-Automatic Building and Extending a Multilingual Terminology Thesaurus [Hor+19] published in International Journal on Artificial Intelligence Tools. ∙ In the field of language learning, our contribution to a corpus based tool for online language learning was presented at Cor- pus Linguistics conference in Birmingham in 2015. [Bai+15] ∙ In the field of lexicography, several projects were based on corpora built and adjusted to the projects’ needs by the author of this thesis. Results were presented at conference e-Lex2 in Leiden in 2017 and in Sintra in 2019. [Bai+19; KSK17; Kop+19]

1. The European Chapter of the Association for Computational Linguistics 2. Electronic Lexicography in the 21st Century

121 4. Summary 4.2 Future Challenges of Building Web Corpora

Although our work on efficient web crawling and corpus text cleaning has proven useful in more than ten projects collaborating with partners from over the world, there is yet more work ahead. The internet is changing constantly. The part of the web closed to general crawlers is growing. Both [She13] and [Jak+20b] name closed content or deep web as a significant issue in their papers on current challenges in web crawling. News sites are moving to a paid subscription. Facebook and Twitter which could be great sources of genres under- represented in web corpora (as noticed by [Cvr+20]) are not giving such data out (obviously) or do give only some data and only to large companies and their affiliates. The organisers of Web as Corpus workshop in 2020 noticed ‘the death of forums in favour of more closed platforms’.3 A part of web content is served dynamically. This is another change crawling has to cope with. Using a web browser engine to parse and execute web scripts is the key. Efficient crawling requires dealing with the fact the same content can be reached via multiple (possibly an unlimited amount) of URLs. E.g. various filters in e-shops leading to the same page with thede- scription of goods. Computer generated text is on the rise too. Although ‘starting the crawl from a set of trustworthy seed domains, measuring domain distance from seed domains and not deviating too deep from the seed domains using hostname heuristics’ [Jak+20b] are ways to avoid spam, a lot of generated non-text will still be downloaded. Strategies of non-text detection using language models will just compete with the same language models generating non-text. Machine translation is a specific subcase. Although there might exist a solution – watermarking the output of statistical machine trans- lation – suggested by [Ven+11], we are not aware of the actual spread of this technique. Internet societies emerging from the so called ‘developing coun- tries’ seem to skip the age of recording thoughts in text – to generating

3. https://www.sigwac.org.uk/wiki/WAC-XII, visited in January 2020.

122 4. Summary multimedia content. Will there be ‘Video Sketches’ summarising mul- timedia collocations in the future? An example is e.g. Laos, a country with over 7 million citizens out of which over 25 % are online4, where after extensive crawling for about half a year we were only able to obtain a corpus of about 100 million words (after a thorough cleaning) – whereas in a country like Slovenia with 2 million citizens out of which almost 80 % are online, one can crawl a billion-word-sized corpus with no extra effort. Richer annotation of web corpora is another field we would like to move ahead in the future. Topic and genre annotation should be added to all corpora since understanding the data is important for every application.

Finally, we believe the web will remain the largest source of text corpora – worthy of dealing with both known and emerging chal- lenges.

4. Data taken from https://en.wikipedia.org/wiki/List_of_countries_by_ number_of_Internet_users, accessed in February 2020.

123 A Appendices

124 A. Appendices A.1 Genre Definition for Annotators

Definition of genre and classes to recognise from the annotation man- ual used for the reference of annotators in the ‘9 genres’ annotation scheme follow. This manual is an extension of Sharoff’s description of Functional Text Dimensions in [Sha18, pp. 94–95]. Classes were re-organised in a 9 class scheme, linguistic markers to observe in the text were provided and examples of text were added by the author of this thesis. Definitions of genre ∙ A particular style or category of works of art; esp. a type of literary work characterised by a particular form, style, or purpose. – A general OED definition.

∙ A set of conventions (regularities) that transcend individual texts, helping humans to identify the communicative purpose and the con- text underlying a document. – Santini, Mehler, Sharoff: Genres on the Web: Computational Models and Empirical Studies. Vol. 42. Springer Science & Business Media, 2010.

∙ Sketch Engine perspective: The users need to know what types of text are included in the corpus. Since the users do language research, building dictionaries, n-gram models for writing pre- diction etc., including genre information allows them to use subcorpora limited to a particular genre. The aim of this annotation project is to identify genres in texts on the web. Genres are not topics. Topics are determined by content words. A genre is determined by the style of writing, content words are only supporting evidence. Therefore it is the style that is key in assessing the genre, content words are only secondary. Recognised Genres (Functional Text Dimensions) Information = To what extent does the text provide informa- tion? Examples: topic definition (textbooks, general informa- tion, encyclopedia), blogs (topic blogs, argumentative/point of view blogs), research (scientific papers, popular science), advertisement/promotion of goods/services/thoughts.

125 A. Appendices

– Subcategory Promotion = To what extent does the text promote a product, service, political movement, party, re- ligious faith? Examples: A company landing page, an ad- vertisement, a product presentation page, an e-shop cata- logue page, a job offer, a page describing the service ofa charity, a religious tract, a political manifesto. – Subcategory Academic = To what extent would you con- sider the text as representing research? Usually formal, first person plural, scientific terms. Example: A research paper or any text written using the academic style. Also, it can be Partly if a news text reports scientific contents. – Subcategory Review = To what extent does the text evalu- ate a specific entity by endorsing or criticising it? Usually a personal experience of the reviewer, comparison to other products, pros and cons. Example: product review endors- ing or criticising the product. – Other informative text – select the general Information

Story telling = To what extent is the text’s content fictional or telling a story? Examples: Description of events in the order they followed (real or fictional), sometimes informal style, can be in the first person. All narrative texts belong to this category: fiction, narrative blogs. Example: “I visited New York, Iwasat the White House, saw the Statue of Liberty, my luggage got lost on the way back.”

Instructions = Teaching the reader how something works. The imperative is frequent. Example: “Fill in all fields in this form and click the OK button. Then wait for three minutes and add one teaspoon of sugar.”

News = Informative report of events recent (or coming in the near future) at the time of writing (not a discussion or a general state of affairs). Frequently formal style, set to a particular place and time. Often quotes sources of information. A diary-like blog entry is also considered reporting. Examples: “Prague, 10/28/18. President Zeman said .” Or: “‘Almost

126 A. Appendices

five million tourists visited the last year’, said Jana Bartošová, the deputy of the minister of culture.”

Legal = To what extent does the text lay down a contract or specify a set of regulations? Examples: a law, a contract, copy- right notices, university regulations.

Discussion = A written communication of participants of a dis- cussion. Frequently personal and informal style. Examples: ex- pressing points of view, giving advice, responses/comments to the original article or previous comments, sharing personal ex- periences. Can be multiple authors. (Note that just describing how something works is Information, just giving instructions how to solve a problem is Instructions.)

Non-text = To what extent is the text different from what is expected to be a normal running text? Examples: Lists of links, online forms, tables of items, bibliographic references, cloud of tags, sentences ending in the middle, machine generated text. Not a Non-text if at least 30

Short text/Unsure = A valid text (not a Non-text) that is too short to determine its genre. Or a text not belonging strongly to any class.

Multiple genres = A valid text (not a Non-text) consisting of several long parts showing strong signs of multiple genres. Example: A long news article with a long discussion below the article. Instruction: Select Multiple genres, then mark particular genres by Partly. Use the Remove button to remove paragraphs of a minor genre instead.

A.2 Text Size after Processing Steps

A summary of data size in four stages of text processing of web corpora crawled by SpiderLing recently can be found in Table A.1.

127 A. Appendices rate Clean 75.7 % 60.2 % 71.1 % 74.0 % 71.1 % 71.7 % 69.9 % 60.0 % 62.0 % 8.2 B 2.1 B 4.1 B 2.9 B 5.2 B 41.4 B 12.8 B 12.4 B 51.6 M tokens De-duplication Output rate 3.9 % 0.8 % 4.6 % 4.3 % 0.8 % 0.3 % 0.5 % 0.4 % 0.2 % Clean 7.4 B 33.8 B 15.8 B 44.1 B 10.1 B 31.0 B 13.7 B tokens 104.2 B Output 171.7 M 7.8 B Input 35.2 B 16.5 B 44.4 B 10.1 B 31.1 B 13.8 B tokens Language filtering, DSL 105.0 B 172.5 M rate Clean 97.52 % 93.28 % 98.48 % 98.21 % 96.45 % 97.07 % 97.34 % 96.88 % 95.50 % size 1,024 MB Plaintext 57,938 MB 97,427 MB 241,452 MB 610,971 MB 136,899 MB 264,199 MB 108,724 MB 195,150 MB size 39 GB Boilerplate removal, lang id. HTML 9.74 TB 9.09 TB 3.81 TB 7.65 TB 7.44 TB 3.71 TB 6.26 TB 2.17 TB Target language CS & SK English Estonian Finnish French Greek Irish Italian Polish each recognised language. More thanstep 95 % is of filtering downloadedunwanted data languages including islarge discerning removed web by corpora. similar this Thelanguages procedure. last Theusing step de-duplication next in by lists this Onion. of words from More table than shows the 60 % percentage of of tokens tokens is removed by removed this near way. paragraph plaintext extraction from HTML by Justext and language filtering using character trigram models for Table A.1: Text size after three Brno pipeline processing steps forThe ten first recently part crawled target languages. is performed by tools embedded crawler in SpiderLing: Boilerplate removal by Justext, ‘Clean rate’ columns show how much data or tokens were removed in the respective cleaning step.

128 A. Appendices A.3 FastText Hyperparameters for English Topic Classification

Table A.2: Hyperparameter values autotuned by FastText for topic classification in our English web corpus. By modifying FastText’s autotune code, the search space of some parameters was limited to a certain interval and parameters marked as fixed were set to a fixed value. ‘Val.’ is the final value. ‘M’ stands for millions.

Parameter FT param. Val. Method Dimensions dim 100 try {100, 200, 300} Loss function loss ns fixed Context size ws 5 try {5, 10} Negatives sampled neg 15 try {5, 10, 15} Learning rate lr 0.134 autotune in [0.1, 1] Epoch epoch 33 autotune in [10, 50] Min word freq. minCount 5 fixed Max word tuples wordNgrams 5 fixed Min char. tuples minn 3 try {0, 3} Max char. tuples maxn 6 try {0, 6} Buckets bucket 5 M autotune in [1M, 5M] Subvector size dsub 2 autotune in {2, 4, 8}

129 A. Appendices A.4 Selected Papers

A list of journal and conference papers authored or co-authored by the author of this thesis follow. A brief description of the contribution of the author of this thesis and estimated share of the work is given for each paper. Citation count obtained from Google Scholar in April 2020 is provided in the case of the most cited works.

Journals: 1. Aleš Horák, Vít Baisa, Adam Rambousek, and Vít Suchomel. “A New Approach for Semi-Automatic Building and Extending a Multilingual Terminology Thesaurus”. In: International Journal on Artificial Intelligence Tools 28.02 (2019), p. 1950008 Share of work: 15 % Own contribution: Crawling domain web corpora, corpora de- scription, extraction of onthology information from the corpora. Journal Rating: Impact factor 0.849 2. Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kilgar- riff, and Vít Suchomel. “arTenTen: Arabic corpus and word sketches”. In: Journal of King Saud University – Computer and Information Sciences 26.4 (2014), pp. 357–371 Share of work: 20 % Own contribution: Efficient crawling of an Arabic web corpus, building the corpus, corpus description. Journal Rating: FI-rank B (Google Scholar h5-index = 30) Citation count: 34 3. Adam Kilgarriff, Vít Baisa, Jan Busta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. “The Sketch Engine: Ten Years On”. In: Lexicography 1.1 (2014), pp. 7– 36 Share of work: 10 % Own contribution: Efficient crawling of corpora in various lan- guages. Text processing pipeline. Process description. Citation count: 622

130 A. Appendices

Rank B conferences: 4. Pavel Rychlý and Vít Suchomel. “Annotated amharic corpora”. In: International Conference on Text, Speech, and Dialogue. Springer. 2016, pp. 295–302 Share of work: 50 % Own contribution: Efficient crawling of an Amharic corpus, building the corpus, corpus description. Conference Rating: FI-rank B (Google Scholar h5-index = 11)

5. Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Vít Suchomel, Aleš Tamchyna, and Daniel Zeman. “HindEnCorp- Hindi-English and Hindi-only Corpus for Machine Transla- tion”. In: Proceedings of Ninth International Conference on Lan- guage Resources and Evaluation. 2014, pp. 3550–3555 Share of work: 10 % Own contribution: Efficient crawling of a Hindi corpus, data for the final corpus. Conference Rating: FI-rank B (GGS Conference Rating B, Google Scholar h5-index = 45) Citation count: 78

Web As Corpus ACL SIG workshop – The venue most relevant to my work: 6. Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, and Vít Suchomel. Current Challenges in Web Corpus Building. Submission Accepted for publication. 2020 Share of work: 40 % Own contribution: Co-analysis of current challenges in web corpus building and mitigation of their impact on web corpora.

7. Vít Suchomel. “Removing spam from web corpora through supervised learning using FastText”. In: Proceedings of the Work- shop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the Web-as-Corpus (WAC-XI) guest sec- tion. Birmingham, 2017, pp. 56–60 Share of work: 100 %

131 A. Appendices

Own contribution: A supervised learning method of non-text detection, application to a web corpus, evaluation. 8. Adam Kilgarriff and Vít Suchomel. “Web Spam”. In: Proceedings of the 8th Web as Corpus Workshop (WAC-8) @Corpus Linguistics 2013. Ed. by Paul Rayson Stefan Evert Egon Stemle. 2013, pp. 46– 52 Share of work: 50 % Own contribution: Methods of non-text detection and mitiga- tion of its impact on web corpora. Citation count: 6 9. Vít Suchomel and Jan Pomikálek. “Efficient Web Crawling for Large Text Corpora”. In: Proceedings of the seventh Web as Corpus Workshop (WAC7). Ed. by Serge Sharoff Adam Kilgarriff. Lyon, 2012, pp. 39–43 Share of work: 60 % Own contribution: Web crawler design, implementation of fo- cused crawling by measuring the yield rate of web domains, efficient crawling of a web corpus used in a comparison ofweb crawlers. Citation count: 94

Chapter in a book: 10. Miloš Jakubíček, Vít Baisa, Jan Bušta, Vojtěch Kovář, Jan Michel- feit, Pavel Rychlý, and Vít Suchomel. Walking the Tightrope be- tween Linguistics and Language Engineering. Submission accepted for publication. Springer, 2020 Share of work: 10 % Own contribution: Section on building very large text corpora from the web.

Other papers relevant to this thesis: 11. Vít Suchomel. “Discriminating Between Similar Languages Us- ing Large Web Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. 2019, pp. 129–135 Share of work: 100 %

132 A. Appendices

Own contribution: A method derived from our previous co- authored work applied to web corpra. Evaluated using data from workshop VarDial on discriminating between similar lan- guages and dialects.

12. Vít Baisa, Marek Blahuš, Michal Cukr, Ondřej Herman, Miloš Jakubíček, Vojtěch Kovář, Marek Medveď, Michal Měchura, Pavel Rychlý, and Vít Suchomel. “Automating Dictionary Pro- duction: a Tagalog-English-Korean Dictionary from Scratch”. In: Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference. 1-3 October 2019, Sintra, Portugal. 2019, pp. 805– 818 Share of work: 10 % Own contribution: Efficient crawling of a Tagalog web corpus, adaptation of the crawler to a low resourced language, building the corpus, development of methods of avoiding non-text in the process of crawling and in the composition of the corpus.

13. Kristina Koppel, Jelena Kallas, Maria Khokhlova, Vít Suchomel, Vít Baisa, and Jan Michelfeit. “SkELL Corpora as a Part of the Language Portal Sõnaveeb: Problems and Perspectives”. In: Proceedings of eLex 2019 (2019) Share of work: 10 % Own contribution: Efficient crawling of an Estonian web corpus, building the corpus. Co-analysis of problems and solutions.

14. Vít Suchomel. “csTenTen17, a Recent Czech Web Corpus”. In: Proceedings of Recent Advances in Slavonic Natural Language Pro- cessing, RASLAN 2018. 2018, pp. 111–123 Share of work: 100 % Own contribution: Efficient crawling of a Czech web corpus, building the corpus, corpus description and comparison to another Czech web corpus.

15. Jelena Kallas, Vít Suchomel, and Maria Khokhlova. “Automated Identification of Domain Preferences of Collocations”. In: Elec- tronic lexicography in the 21st century. Proceedings of eLex 2017 conference. 2017, pp. 309–320 Share of work: 25 %

133 A. Appendices

Own contribution: Building domain corpora from the web, ter- minology extraction, evaluation of .

16. Ondřej Herman, Vít Suchomel, Vít Baisa, and Pavel Rychlý. “DSL Shared task 2016: Perfect Is The Enemy of Good Lan- guage Discrimination Through Expectation–Maximization and Chunk-based Language Model”. In: Proceedings of the Third Work- shop on NLP for Similar Languages, Varieties and Dialects (Var- Dial3). Osaka, 2016, pp. 114–118 Share of work: 25 % Own contribution: Frequency worldists for similar languages obtained from large web corpora built by me, co-development of the method of discerning languages.

17. Darja Fišer, Vít Suchomel, and Miloš Jakubíček. “Terminol- ogy Extraction for Academic Slovene Using Sketch Engine”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. 2016, pp. 135–141 Share of work: 25 % Own contribution: Compilation of a corpus of academic Slovene. Co-work on Slovene terminology definition and extraction.

18. Vít Baisa, Vít Suchomel, Adam Kilgarriff, and Miloš Jakubíček. “Sketch Engine for English Language Learning”. In: Corpus Linguistics 2015. Ed. by Federica Formato and Andrew Hardie. UCREL. Birmingham, 2015, pp. 33–35 Share of work: 25 % Own contribution: Efficient crawling of an English corpus, build- ing the corpus, development of methods of avoiding non-text in the process of crawling and in the composition of the corpus.

19. Vít Baisa and Vít Suchomel. “Corpus Based Extraction of Hy- pernyms in Terminological Thesaurus for Land Surveying Do- main”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. Tribun EU, 2015, pp. 69–74 Share of work: 50 % Own contribution: Building domain corpora from the web. Hy- pernym extraction for a terminology project. Co-work on the method and its description.

134 A. Appendices

20. Vít Baisa and Vít Suchomel. “Turkic Language Support in Sketch Engine”. In: Proceedings of the 3rd International Conference on Computer Processing in Turkic Languages (TurkLang 2015). 2015, pp. 214–223 Share of work: 40 % Own contribution: Efficient crawling of web corpora in Turkic languages, building the corpora, corpus description.

21. Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, and Vít Suchomel. “Finding terms in corpora for many lan- guages with the Sketch Engine”. In: Proceedings of the Demonstra- tions at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, 2014, pp. 53–56 Share of work: 30 % Own contribution: Efficient crawling of web corpora, term ex- traction from the corpora for 10 languages, co-development of a general description of terms in the languages. Submission type: A paper in the demonstrations track (not a part of the main conference proceedings) Conference Rating: FI-rank A (CORE rank A) Citation count: 36

22. Vít Baisa and Vít Suchomel. “SkELL: Web Interface for En- glish Language Learning”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2014. Brno, 2014, pp. 63–70 Share of work: 30 % Own contribution: Efficient crawling of an English corpus, build- ing the corpus, development of methods of avoiding non-text in the process of crawling and in the composition of the corpus. Citation count: 40

23. Jan Michelfeit, Vít Suchomel, and Jan Pomikálek. “Text To- kenisation Using unitok.” In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2014. 2014, pp. 71–75 Share of work: 50 % Own contribution: Description and evaluation of the method. Citation count: 21

135 A. Appendices

24. Adam Rambousek, Aleš Horák, Vít Suchomel, and Lucia Kocin- cová. “Semiautomatic Building and Extension of Terminological Thesaurus for Land Surveying Domain”. In: Proceedings of Re- cent Advances in Slavonic Natural Language Processing, RASLAN 2014. 2014, pp. 129–137 Share of work: 15 % Own contribution: Building domain corpora from the web. Au- tomated extraction of semantic relations from the corpora for a terminology project. Co-work on the method and its descrip- tion. 25. Zuzana Nevěřilová and Vít Suchomel. “Intelligent Search and Replace for Czech Phrases”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2014. 2014, pp. 97–105 Share of work: 50 % Own contribution: The idea of search and replace for phrases in inflected languages. Application to a Czech corpus. Co-work on the method and its description. 26. Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel, Jan Bušta, Vít Baisa, and Jan Michelfeit. “The TenTen Corpus Family”. In: 7th International Corpus Linguistics Conference CL. UCREL. Lancaster, 2013, pp. 125–127 Share of work: 30 % Own contribution: Efficient crawling of web corpora in multiple languages. Citation count: 202 27. Yonatan Belinkov,Nizar Habash, Adam Kilgarriff, Noam Ordan, Ryan Roth, and Vít Suchomel. “arTenTen: a new, vast corpus for Arabic”. In: Proceedings of WACL 20 (2013) Share of work: 20 % Own contribution: Efficient crawling of an Arabic web corpus, building the corpus, corpus description. 28. Irena Srdanović, Vít Suchomel, Toshinobu Ogiso, and Adam Kilgarriff. “ Lexical and Grammatical Profil- ing Using the Web Corpus JpTenTen”. In: Proceeding of the 3rd Japanese corpus linguistics workshop. Tokyo: NINJAL, Department of

136 A. Appendices

Corpus Studies/Center for Corpus Development. 2013, pp. 229–238 Share of work: 20 % Own contribution: Efficient crawling of a Japanese web corpus, building the corpus, corpus description.

29. Vít Baisa and Vít Suchomel. “Intrinsic Methods for Comparison of Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. 2013, pp. 51–58 Share of work: 50 % Own contribution: Method description and evaluation – 2/3 of methods in the paper.

30. Vít Baisa and Vít Suchomel. “Large Corpora for Turkic Lan- guages and Unsupervised Morphological Analysis”. In: Pro- ceedings of the Eight International Conference on Language Resources and Evaluation. Ed. by Mehmet Ugur Dogan Seniz Demir Ilknur Durgar El-Kahlout. European Language Resources Association (ELRA). Istanbul, 2012, pp. 28–32 Share of work: 50 % Own contribution: Efficient crawling of five corpora in turkic languages, building the corpora, corpus description. Submission type: A paper in a workshop on turkic languages. Conference Rating: FI-rank B (GGS Conference Rating B, Google Scholar h5-index = 45) Citation count: 9

31. Vít Baisa and Vít Suchomel. “Detecting Spam in Web Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. 2012, pp. 69–76 Share of work: 50 % Own contribution: Methods of non-text detection and mitiga- tion of its impact on web corpora.

32. Gulshan Dovudov, Vít Suchomel, and Pavel Šmerk. “Towards 100M Morphologically Annotated Corpus of Tajik”. In: Proceed- ings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. 2012, pp. 91–94 Share of work: 25 % Own contribution: Efficient crawling of a Tajik Persian web cor-

137 A. Appendices

pus, building the corpus, corpus description. Adaptation of the cralwer to a less resourced language.

33. Vít Suchomel. “Recent Czech Web Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. 2012, pp. 77–83 Share of work: 100 % Own contribution: Efficient crawling of a Czech web corpus, building the corpus, corpus description. Corpus comparison to other contemporary Czech corpora.

34. Gulshan Dovudov, Vít Suchomel, and Pavel Šmerk. “POS Anno- tated 50M Corpus of Tajik Language”. In: Proceedings of the Work- shop on Language Technology for Normalisation of Less-Resourced Languages SaLTMiL 8 – AfLaT2012. Istanbul, 2012, pp. 93–98 Share of work: 25 % Own contribution: Efficient crawling of a Tajik Persian corpus, building the corpus. Adaptation of the crawler to a less re- sourced language.

35. Vít Suchomel and Jan Pomikálek. “Practical Web Crawling for Text Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011. 2011, pp. 97–108 Share of work: 75 % Own contribution: Crawler implementation and description.

36. Jan Pomikálek and Vít Suchomel. “chared: Character Encoding Detection with a Known Language”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011. 2011, pp. 125–129 Share of work: 50 % Own contribution: Tool implementation and a part of its de- scription.

37. Gulshan Dovudov, Jan Pomikálek, Vít Suchomel, and Pavel Šmerk. “Building a 50M Corpus of Tajik Language”. In: Proceed- ings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011. 2011, pp. 89–95 Share of work: 25 %

138 A. Appendices

Own contribution: Efficient crawling of a Tajik Persian web corpus, building the corpus, corpus description.

139 Bibliography

[Art+14] Tressy Arts, Yonatan Belinkov, Nizar Habash, Adam Kil- garriff, and Vít Suchomel. “arTenTen: Arabic corpus and word sketches”. In: Journal of King Saud University – Com- puter and Information Sciences 26.4 (2014), pp. 357–371. [Bai+19] Vít Baisa, Marek Blahuš, Michal Cukr, Ondřej Herman, Miloš Jakubíček, Vojtěch Kovář, Marek Medveď, Michal Měchura, Pavel Rychlý, and Vít Suchomel. “Automating Dictionary Production: a Tagalog-English-Korean Dictio- nary from Scratch”. In: Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference. 1-3 October 2019, Sintra, Portugal. 2019, pp. 805–818. [BS12a] Vít Baisa and Vít Suchomel. “Detecting Spam in Web Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. 2012, pp. 69– 76. [BS12b] Vít Baisa and Vít Suchomel. “Large Corpora for Turkic Languages and Unsupervised Morphological Analysis”. In: Proceedings of the Eight International Conference on Lan- guage Resources and Evaluation. Ed. by Mehmet Ugur Do- gan Seniz Demir Ilknur Durgar El-Kahlout. European Language Resources Association (ELRA). Istanbul, 2012, pp. 28–32. [BS13] Vít Baisa and Vít Suchomel. “Intrinsic Methods for Com- parison of Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. 2013, pp. 51–58. [BS14] Vít Baisa and Vít Suchomel. “SkELL: Web Interface for English Language Learning”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2014. Brno, 2014, pp. 63–70. [BS15a] Vít Baisa and Vít Suchomel. “Corpus Based Extraction of Hypernyms in Terminological Thesaurus for Land

140 BIBLIOGRAPHY

Surveying Domain”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. Tribun EU, 2015, pp. 69–74. [BS15b] Vít Baisa and Vít Suchomel. “Turkic Language Support in Sketch Engine”. In: Proceedings of the 3rd International Conference on Computer Processing in Turkic Languages (Turk- Lang 2015). 2015, pp. 214–223. [Bai+15] Vít Baisa, Vít Suchomel, Adam Kilgarriff, and Miloš Jakubíček. “Sketch Engine for English Language Learning”. In: Corpus Linguistics 2015. Ed. by Federica Formato and Andrew Hardie. UCREL. Birmingham, 2015, pp. 33–35. [BR02] Ziv Bar-Yossef and Sridhar Rajagopalan. “Template detec- tion via and its applications”. In: Proceedings of the 11th international conference on World Wide Web. ACM. 2002, pp. 580–591. [BK06] M. Baroni and A. Kilgarriff. “Large linguistically- processed web corpora for multiple languages”. In: Proceedings of European ACL (2006). [BB04] Marco Baroni and Silvia Bernardini. “BootCaT: Bootstrap- ping Corpora and Terms from the Web”. In: Proceedings of International Conference on Language Resources and Evalu- ation. 2004. [Bar+09] Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. “The WaCky wide web: a collection of very large linguistically processed web-crawled corpora”. In: Language resources and evaluation 43.3 (2009), pp. 209– 226. [Bar+08] Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. “Cleaneval: a Competition for Cleaning Web Pages”. In: Proceedings of Sixth International Conference on Language Resources and Evaluation. 2008. [Bar+06] Marco Baroni, Adam Kilgarriff, Jan Pomikálek, Pavel Rychlý, et al. “WebBootCaT: instant domain-specific

141 BIBLIOGRAPHY

corpora to support human translators”. In: Proceedings of EAMT. 2006, pp. 247–252. [Bel+13] Yonatan Belinkov, Nizar Habash, Adam Kilgarriff, Noam Ordan, Ryan Roth, and Vít Suchomel. “arTenTen: a new, vast corpus for Arabic”. In: Proceedings of WACL 20 (2013). [Ben14] Vladimír Benko. “Aranea: Yet Another Family of (Com- parable) Web Corpora”. In: Text, Speech and Dialogue. Springer. 2014, pp. 247–256. [Ben16] Vladimír Benko. “Feeding the "Brno Pipeline": The Case of Araneum Slovacum”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2016. 2016, pp. 19–27. [Bib88] Douglas Biber. Variation across speech and writing. Cam- bridge University Press, 1988. [Bib89] Douglas Biber. “A typology of English texts”. In: Linguis- tics 27.1 (1989), pp. 3–44. [Bie+07] Chris Biemann, Gerhard Heyer, Uwe Quasthoff, and Matthias Richter. “The Leipzig Corpora Collection- monolingual corpora of standard size”. In: Proceedings of Corpus Linguistic 2007 (2007). [Boj+14] Ondřej Bojar, Vojtěch Diatka, Pavel Rychlý, Pavel Straňák, Vít Suchomel, Aleš Tamchyna, and Daniel Zeman. “HindEnCorp-Hindi-English and Hindi-only Corpus for Machine Translation”. In: Proceedings of Ninth Inter- national Conference on Language Resources and Evaluation. 2014, pp. 3550–3555. [Bro+00] Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. “Min-wise independent permu- tations”. In: Journal of Computer and System Sciences 60.3 (2000), pp. 630–659. [Cal+09] Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. Clueweb09 data set. 2009.

142 BIBLIOGRAPHY

[Cas+07] Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. “Know your neighbors: Web spam detection using the web topology”. In: Proceed- ings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 2007, pp. 423–430. [CK01] Gabriela Cavagliá and Adam Kilgarriff. “Corpora from the Web”. In: Fourth Annual CLUCK Colloquium, Sheffield, UK. 2001. [CG99] J. Cho and H. Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler. Technical Report 1999-22. Stanford InfoLab, 1999. [CKR10] Kevin Crowston, Barbara Kwasnik, and Joseph Rubleske. “Problems in the use-centered development of a taxon- omy of web genres”. In: Genres on the Web. Springer, 2010, pp. 69–84. [Cvr+20] Václav Cvrček, Zuzana Komrsková, David Lukeš, Petra Poukarová, Anna Řehořková, Adrian Jan Zasina, and Vladimír Benko. “Comparing web-crawled and traditional corpora”. In: Language Resources and Evaluation (2020), pp. 1–33. [DS16] Erika Dalan and Serge Sharoff. “Genre classification fora corpus of academic webpages”. In: Proceedings of the 10th Web as Corpus Workshop. 2016, pp. 90–98. [DF15] Mark Davies and Robert Fuchs. “Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE)”. In: En- glish World-Wide 36.1 (2015), pp. 1–28. [DKB98] Johan Dewe, Jussi Karlgren, and Ivan Bretan. “Assem- bling a balanced corpus from the internet”. In: Proceedings of the 11th Nordic Conference of Computational Linguistics (NODALIDA 1998). 1998, pp. 100–108. [Dov+11] Gulshan Dovudov, Jan Pomikálek, Vít Suchomel, and Pavel Šmerk. “Building a 50M Corpus of Tajik Language”.

143 BIBLIOGRAPHY

In: Proceedings of Recent Advances in Slavonic Natural Lan- guage Processing, RASLAN 2011. 2011, pp. 89–95. [DSŠ12a] Gulshan Dovudov, Vít Suchomel, and Pavel Šmerk. “POS Annotated 50M Corpus of Tajik Language”. In: Proceed- ings of the Workshop on Language Technology for Normalisa- tion of Less-Resourced Languages SaLTMiL 8 – AfLaT2012. Istanbul, 2012, pp. 93–98. [DSŠ12b] Gulshan Dovudov, Vít Suchomel, and Pavel Šmerk. “To- wards 100M Morphologically Annotated Corpus of Tajik”. In: Proceedings of Recent Advances in Slavonic Natural Lan- guage Processing, RASLAN 2012. 2012, pp. 91–94. [EGB12] Miklós Erdélyi, András Grazo, and András A. Benczúr. “Web spam Classification: a few features worth more”. In: Proc. Joint WICOW/AIRWeb Workshop at WWW-2012. 2012. [Fer+08a] A. Ferraresi, E. Zanchetta, M. Baroni, and S. Bernardini. “Introducing and evaluating ukWaC, a very large web- derived corpus of English”. In: Proceedings of the 4th Web as Corpus Workshop at LREC 2008. 2008. [Fer+08b] Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. “Introducing and evaluating ukWaC, a very large web-derived corpus of English”. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54. [Fer+08c] Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. “Introducing and evaluating ukwac, a very large web-derived corpus of English”. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54. [FCV09] Dennis Fetterly, Nick Craswell, and Vishwa Vinay. “The impact of crawl policy on web search effectiveness”. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM. 2009, pp. 580–587.

144 BIBLIOGRAPHY

[FSJ16] Darja Fišer, Vít Suchomel, and Miloš Jakubíček. “Termi- nology Extraction for Academic Slovene Using Sketch Engine”. In: Proceedings of Recent Advances in Slavonic Nat- ural Language Processing, RASLAN 2016. 2016, pp. 135– 141. [GJM01] Rayid Ghani, Rosie Jones, and Dunja Mladenić. “Mining the web to create minority language corpora”. In: Proceed- ings of the tenth international conference on Information and knowledge management. ACM. 2001, pp. 279–286. [GO13] Yoav Goldberg and Jon Orwant. “A dataset of syntactic- ngrams over time from a very large corpus of english books”. In: Second Joint Conference on Lexical and Computa- tional (* SEM). Vol. 1. 2013, pp. 241–247. [GN00] Gregory Grefenstette and Julien Nioche. “Estimation of English and non-English Language Use on the WWW”. In: In Recherche d’Information Assistée par Ordinateur (RIAO). 2000. [GG05] Zoltan Gyongyi and Hector Garcia-Molina. “Web spam taxonomy”. In: First international workshop on adversarial information retrieval on the web (AIRWeb 2005). 2005. [GGP04] Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. “Combating web spam with trustrank”. In: Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment. 2004, pp. 576–587. [Her+16] Ondřej Herman, Vít Suchomel, Vít Baisa, and Pavel Rychlý. “DSL Shared task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation– Maximization and Chunk-based Language Model”. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3). Osaka, 2016, pp. 114–118. [Hná+14] Milena Hnátková, Michal Kren, Pavel Procházka, and Hana Skoumalová. “The SYN-series corpora of written Czech”. In: Proceedings of Ninth International Conference on Language Resources and Evaluation. 2014, pp. 160–164.

145 BIBLIOGRAPHY

[Hor+19] Aleš Horák, Vít Baisa, Adam Rambousek, and Vít Su- chomel. “A New Approach for Semi-Automatic Building and Extending a Multilingual Terminology Thesaurus”. In: International Journal on Artificial Intelligence Tools 28.02 (2019), p. 1950008. [Jak+20a] Miloš Jakubíček, Vít Baisa, Jan Bušta, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. Walking the Tightrope between Linguistics and Language Engineering. Sub- mission accepted for publication. Springer, 2020. [Jak+14] Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, and Vít Suchomel. “Finding terms in corpora for many languages with the Sketch Engine”. In: Proceed- ings of the Demonstrations at the 14th Conference of the Euro- pean Chapter of the Association for Computational Linguistics. Gothenburg, 2014, pp. 53–56. [Jak+13] Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel, Jan Bušta, Vít Baisa, and Jan Michel- feit. “The TenTen Corpus Family”. In: 7th International Corpus Linguistics Conference CL. UCREL. Lancaster, 2013, pp. 125–127. [Jak+20b] Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, and Vít Suchomel. Current Challenges in Web Corpus Building. Sub- mission Accepted for publication. 2020. [Jou+16] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. “Bag of Tricks for Efficient Text Classifi- cation”. In: arXiv preprint arXiv:1607.01759 (2016). [KK] Jelena Kallas and Kristina Koppel. Eesti keele ühendkorpus 2019. Centre of Estonian Language Resources. [KSK17] Jelena Kallas, Vít Suchomel, and Maria Khokhlova. “Au- tomated Identification of Domain Preferences of Collo- cations”. In: Electronic lexicography in the 21st century. Pro- ceedings of eLex 2017 conference. 2017, pp. 309–320. [Kha+04] Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. “Nutch: A flexible and scalable open-source web

146 BIBLIOGRAPHY

search engine”. In: Oregon State University 1 (2004), pp. 32– 32. [Kil12] Adam Kilgarriff. “Getting to know your corpus”. In: In- ternational conference on text, speech and dialogue. Springer. 2012, pp. 3–15. [Kil+14] Adam Kilgarriff, Vít Baisa, Jan Busta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Su- chomel. “The Sketch Engine: Ten Years On”. In: Lexicogra- phy 1.1 (2014), pp. 7–36. [KC13] Adam Kilgarriff and Tiberius Carole. Genre in a frequency dictionary. Presented at the International Conference on Corpus Linguistics, Lancaster, 2013. 2013. [KG03] Adam Kilgarriff and Gregory Grefenstette. “Introduction to the special issue on the web as corpus”. In: Computa- tional linguistics 29.3 (2003), pp. 333–347. [Kil+04] Adam Kilgarriff, Pavel Rychlý, Pavel Smrž, and David Tugwell. “The Sketch Engine”. In: Information Technology 105 (2004), p. 116. [KS13] Adam Kilgarriff and Vít Suchomel. “Web Spam”. In: Pro- ceedings of the 8th Web as Corpus Workshop (WAC-8) @Cor- pus Linguistics 2013. Ed. by Paul Rayson Stefan Evert Egon Stemle. 2013, pp. 46–52. [KFN10] Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. “Boilerplate detection using shallow text features”. In: Proceedings of the third ACM international conference on Web search and data mining. ACM. 2010, pp. 441–450. [Kop+19] Kristina Koppel, Jelena Kallas, Maria Khokhlova, Vít Su- chomel, Vít Baisa, and Jan Michelfeit. “SkELL Corpora as a Part of the Language Portal Sõnaveeb: Problems and Perspectives”. In: Proceedings of eLex 2019 (2019). [Kri04] Klaus Krippendorff. Content analysis: An introduction to its methodology (2nd ed.) Thousand Oaks, CA: Sage, 2004.

147 BIBLIOGRAPHY

[Lee01] David Y. W. Lee. “Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the BNC jungle”. In: Language Learning and Tech- nology 5.3 (2001), pp. 37–72. [Lee+09] Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. “IRLbot: scaling to 6 billion pages and beyond”. In: ACM Transactions on the Web (TWEB) 3.3 (2009), pp. 1–34. [Lee92] . “100 million words of English: the British National Corpus (BNC)”. In: Language Research 28.1 (1992), pp. 1–13. [Let14] Igor Leturia. “The Web as a Corpus of Basque”. PhD thesis. University of the Basque Country, 2014. [LC94] David D Lewis and Jason Catlett. “Heterogeneous un- certainty sampling for supervised learning”. In: Machine learning proceedings 1994. Elsevier, 1994, pp. 148–156. [LK14] Nikola Ljubešić and Filip Klubička. “bs, hr, sr WaC: Web corpora of Bosnian, Croatian and Serbian”. In: Proceedings of WAC-9 workshop. Association for Computational Linguis- tics. 2014. [LT14] Nikola Ljubešić and Antonio Toral. “caWaC – A web cor- pus of Catalan and its application to language modeling and machine translation”. In: Proceedings of Ninth Inter- national Conference on Language Resources and Evaluation. 2014, pp. 1728–1732. [LB12] Marco Lui and Timothy Baldwin. “langid.py: An Off- the-shelf Language Identification Tool”. In: Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics. Jeju Island, Korea, July 2012, pp. 25–30. [Mal+16] Shervin Malmasi, Marcos Zampieri, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, and Jörg Tiedemann. “Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared

148 BIBLIOGRAPHY

Task”. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3). Osaka, Japan, Dec. 2016, pp. 1–14. [MS99] Christopher D. Manning and Hinrich Schütze. Founda- tions of statistical natural language processing. MIT press, 1999. [MRS08] Christopher D Manning, Prabhakar Raghavan, and Hin- rich Schütze. Introduction to information retrieval. Cam- bridge university press, 2008. [MPS07] Michal Marek, Pavel Pecina, and Miroslav Spousta. “Web page cleaning with conditional random fields”. In: Build- ing and Exploring Web Corpora: Proceedings of the Fifth Web as Corpus Workshop, Incorporationg CleanEval (WAC3), Bel- gium. 2007, pp. 155–162. [MSS10] Alexander Mehler, Serge Sharoff, and Marina Santini. Genres on the web: Computational models and empirical studies. Vol. 42. Springer, 2010. [MSP14] Jan Michelfeit, Vít Suchomel, and Jan Pomikálek. “Text Tokenisation Using unitok.” In: Proceedings of Recent Ad- vances in Slavonic Natural Language Processing, RASLAN 2014. 2014, pp. 71–75. [Nak10] Shuyo Nakatani. Language Detection Library for Java. 2010. url: https://github.com/shuyo/language-detection. [NS14] Zuzana Nevěřilová and Vít Suchomel. “Intelligent Search and Replace for Czech Phrases”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2014. 2014, pp. 97–105. [NCO04] Alexandros Ntoulas, Junghoo Cho, and Christopher Ol- ston. “What’s new on the web?: the evolution of the web from a search engine perspective”. In: Proceedings of the 13th international conference on World Wide Web. ACM. 2004, pp. 1–12. [Nto+06] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. “Detecting spam web pages through con-

149 BIBLIOGRAPHY

tent analysis”. In: Proceedings of the 15th international con- ference on World Wide Web. ACM. 2006, pp. 83–92. [Pom11] Jan Pomikálek. “Removing boilerplate and duplicate con- tent from web corpora”. PhD thesis. Masaryk University, 2011. [PJR12] Jan Pomikálek, Miloš Jakubíček, and Pavel Rychlý. “Build- ing a 70 billion word corpus of English from ClueWeb”. In: Proceedings of Eighth International Conference on Language Resources and Evaluation. 2012, pp. 502–506. [PRK09] Jan Pomikálek, Pavel Rychlý, and Adam Kilgarriff. “Scal- ing to Billion-plus Word Corpora”. In: Advances in Com- putational Linguistics 41 (2009), pp. 3–13. [PS11] Jan Pomikálek and Vít Suchomel. “chared: Character En- coding Detection with a Known Language”. In: Proceed- ings of Recent Advances in Slavonic Natural Language Pro- cessing, RASLAN 2011. 2011, pp. 125–129. [Ram+14] Adam Rambousek, Aleš Horák, Vít Suchomel, and Lucia Kocincová. “Semiautomatic Building and Extension of Terminological Thesaurus for Land Surveying Domain”. In: Proceedings of Recent Advances in Slavonic Natural Lan- guage Processing, RASLAN 2014. 2014, pp. 129–137. [Ros08] Mark Rosso. “User-based identification of Web genres”. In: Journal of the American Society for Information Science and Technology 59.7 (2008), pp. 1053–1072. [Ryc08] Pavel Rychlý. “A Lexicographer-Friendly Association Score”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2008. 2008, pp. 6–9. [RS16] Pavel Rychlý and Vít Suchomel. “Annotated amharic cor- pora”. In: International Conference on Text, Speech, and Dia- logue. Springer. 2016, pp. 295–302. [SBB14] Roland Schäfer, Adrien Barbaresi, and Felix Bildhauer. “Focused Web Corpus Crawling”. In: Proceedings of the 9th Web as Corpus workshop (WAC-9). 2014, pp. 9–15.

150 BIBLIOGRAPHY

[SB12] Roland Schäfer and Felix Bildhauer. “Building Large Cor- pora from the Web Using a New Efficient Tool Chain”. In: Proceedings of Eighth International Conference on Language Resources and Evaluation. 2012, pp. 486–493. [SB13] Roland Schäfer and Felix Bildhauer. Web Corpus Construc- tion. Vol. 6. Morgan & Claypool Publishers, 2013, pp. 1– 145. [Set12] Burr Settles. Active Learning. Vol. 6. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2012. [Sha18] Serge Sharoff. “Functional text dimensions for the anno- tation of web corpora”. In: Corpora 13.1 (2018), pp. 65– 95. [She13] Denis Shestakov. “Current challenges in web crawling”. In: International Conference on Web Engineering. Springer. 2013, pp. 518–521. [SS12] Johanka Spoustová and Miroslav Spousta. “A High- Quality Web Corpus of Czech”. In: Proceedings of Eighth International Conference on Language Resources and Evaluation. 2012, pp. 311–315. [Srd+13] Irena Srdanović, Vít Suchomel, Toshinobu Ogiso, and Adam Kilgarriff. “Japanese Language Lexical and Gram- matical Profiling Using the Web Corpus JpTenTen”. In: Proceeding of the 3rd Japanese corpus linguistics workshop. Tokyo: NINJAL, Department of Corpus Studies/Center for Corpus Development. 2013, pp. 229–238. [Suc12] Vít Suchomel. “Recent Czech Web Corpora”. In: Proceed- ings of Recent Advances in Slavonic Natural Language Pro- cessing, RASLAN 2012. 2012, pp. 77–83. [Suc17] Vít Suchomel. “Removing spam from web corpora through supervised learning using FastText”. In: Proceed- ings of the Workshop on Challenges in the Management of Large Corpora and Big Data and Natural Language Processing (CMLC-5+BigNLP) 2017 including the papers from the

151 BIBLIOGRAPHY

Web-as-Corpus (WAC-XI) guest section. Birmingham, 2017, pp. 56–60. [Suc18] Vít Suchomel. “csTenTen17, a Recent Czech Web Corpus”. In: Proceedings of Recent Advances in Slavonic Natural Lan- guage Processing, RASLAN 2018. 2018, pp. 111–123. [Suc19] Vít Suchomel. “Discriminating Between Similar Lan- guages Using Large Web Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2019. 2019, pp. 129–135. [SP11] Vít Suchomel and Jan Pomikálek. “Practical Web Crawl- ing for Text Corpora”. In: Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011. 2011, pp. 97–108. [SP12] Vít Suchomel and Jan Pomikálek. “Efficient Web Crawling for Large Text Corpora”. In: Proceedings of the seventh Web as Corpus Workshop (WAC7). Ed. by Serge Sharoff Adam Kilgarriff. Lyon, 2012, pp. 39–43. [Tho14] James Thomas. Discovering English with the Sketch Engine. Research-publishing.net. La Grange des Noyes, France, 2014. [Tro+12] Andrew Trotman, Charles LA Clarke, Iadh Ounis, Shane Culpepper, Marc-Allen Cartright, and Shlomo Geva. “Open source information retrieval: a report on the SIGIR 2012 workshop”. In: ACM SIGIR Forum. Vol. 46. ACM. 2012, pp. 95–101. [Ven+11] Ashish Venugopal, Jakob Uszkoreit, David Talbot, Franz J Och, and Juri Ganitkevitch. “Watermarking the outputs of structured prediction with an application in statistical machine translation”. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Associa- tion for Computational Linguistics. 2011, pp. 1363–1372. [VP12] Yannick Versley and Yana Panchenko. “Not just bigger: Towards better-quality Web corpora”. In: Proceedings of the seventh Web as Corpus Workshop (WAC7). 2012, pp. 44–52.

152 BIBLIOGRAPHY

[Zam+17] Marcos Zampieri, Shervin Malmasi, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, Jörg Tiedemann, Yves Scherrer, and Noëmi Aepli. “Findings of the VarDial Evaluation Campaign 2017”. In: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). Association for Computational Linguistics. Valencia, Spain, Apr. 2017, pp. 1–15. [Zam+14] Marcos Zampieri, Liling Tan, Nikola Ljubešić, and Jörg Tiedemann. “A report on the DSL shared task 2014”. In: Proceedings of the first workshop on applying NLP tools to similar languages, varieties and dialects. 2014, pp. 58–67. [Zam+15] Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiede- mann, and Preslav Nakov. “Overview of the DSL shared task 2015”. In: Proceedings of the Joint Workshop on Lan- guage Technology for Closely Related Languages, Varieties and Dialects. 2015, pp. 1–9. [ZS04] Sven Meyer Zu Eissen and Benno Stein. “Genre classifi- cation of web pages”. In: Annual Conference on Artificial Intelligence. Springer. 2004, pp. 256–269.

153