A Free, Balanced, Multilayer English Web Corpus
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5267–5275 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC AMALGUM – A Free, Balanced, Multilayer English Web Corpus Luke Gessler, Siyao Peng, Yang Liu, Yilun Zhu, Shabnam Behzad, Amir Zeldes Corpling Lab Georgetown University flg876, sp1184, yl879, yz565, sb1796, [email protected] Abstract We present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory. By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets, while avoiding pitfalls such as imbalanced or unknown composition, licensing problems, and low-quality natural language processing. We harness knowledge from multiple annotation layers in order to achieve a “better than NLP” benchmark and evaluate the accuracy of the resulting resource. Keywords: web corpora, annotation, discourse, coreference, NER, RST 1. Introduction target domain(s). In addition, we focus on making discourse- Corpus development for academic research and practical level annotations available, including complete nested entity NLP applications often meets with a dilemma: develop- recognition, coreference resolution, and discourse parsing, ment of high quality data sets with rich annotation schemes while striving for accuracy that is as high as possible. Some (e.g. MASC (Ide et al., 2010)) requires substantial man- of the applications we envision for this corpus include: ual curation and analysis effort and does not scale up eas- ily; on the other hand, easy “opportunistic” data collection • Corpus linguistic studies on genre variation and dis- (Kennedy, 1998), where as much available material is col- course using detailed annotation layers lected from the Internet as possible, usually means little • Active learning – by permuting different subsets of the or no manual annotation or inspection of corpus contents automatically annotated corpus as training data, we can for genre balance or other distributional properties, and at evaluate performance on gold standard development most the addition of relatively reliable automatic annota- data and automatically select those documents whose tions such as part of speech tags and lemmatization (e.g. the annotations improve performance on a given task WaCky corpora (Baroni et al., 2009)), or sometimes out- • Cross-tagger validation – we can select the subset of of-the-box, automatic dependency parsing (see the COW sentences which are tagged or parsed identically by corpora, (Schafer,¨ 2015)). This divide leads to a number of multiple taggers/models trained on different datasets problems: and assume that these are highly likely to be correct (cf. Derczynski et al. (2012)), or use other similar 1. The study of systematic genre variation is limited to approaches such as tri-training (cf. Zhou and Li (2005)) surface textual properties and cannot use complex an- notations, unless it is confined to very limited data • Human-in-the-loop/crowdsourcing – outputting sen- tences with low NLP tool confidence for expert correc- 2. For large resources, NLP quality is low since tion or crowd worker annotations, or using a hybrid opportunistically-acquired content often deviates sub- annotation setup in which several models are used, and stantially from the language found in training data used the tags that the models agree on are only reviewed by out-of-the-box tools by annotators and the rest are annotated from scratch (Berzak et al., 2016) 3. Large open data sets for training NLP tools on com- plex phenomena (e.g. coreference resolution, discourse • Pre-training – the error-prone but large automatically parsing) are unavailable1 produced corpus can be used for pre-training a model which can later be fine-tuned on a smaller gold dataset In this paper, we attempt to walk a middle path, combining some of the best features of corpus data harvested from the Testing these applications is outside the scope of the cur- web – size, open licenses, lexical diversity – and data col- rent paper, though several papers suggest that large scale lected using a carefully defined sampling frame (cf. Hundt ‘silver quality’ data can be better for training tools than (2008)), which allows for more interpretable inferences as smaller gold standard resources for POS tagging (Schulz well as the application of tools specially prepared for the and Ketschik, 2019), parsing (Toutanova, 2005) using infor- mation from parse banks (Charniak et al., 2000), discourse 1The largest datasets for these tasks (e.g. RST Discourse Tree- relation classification (Marcu and Echihabi, 2002), and other bank (Carlson et al., 2003), OntoNotes (Weischedel et al., 2012)), tasks. Moreover, due to its genre diversity, this corpus can contain between 200K and 1.6 million words and are only available be useful for genre studies as well as automatic text type for purchase from the LDC. identification in Web corpora (e.g. Asheghi et al. (2014), 5267 Rehm et al. (2008), and Dalan and Sharoff (2016)). We plan coreference in OntoNotes) and 4–20 times the size of the to pursue several of these approaches in future work (see largest English discourse treebanks (around 1M tokens for Section 5). the shallow parsed PDTB (Prasad et al., 2019), and 200K Our main contributions in this paper consist in (1) presenting tokens for RST-DT with full document trees (Carlson et al., a genre-balanced, richly-annotated web corpus which is 2003), both of which are restricted to WSJ text). The close more than double (and in some cases 20 times) the size of match in content to the manually annotated GUM means the largest similar gold annotated resources; (2) evaluating that we expect AMALGUM to be particularly useful for ap- tailored NLP tools which have been trained specifically for plications such as active learning when combining resources this task and comparing them to off-the-shelf tools; and from the gold standard GUM data and the automatically (3) demonstrating the added value of rich annotations and prepared data presented here. ensembling of multiple concurrent tools, which improve the accuracy of each task by incorporating information from 2.2. Document Filtering other tasks and tool outputs. Documents were scraped from sites that provide data under In the remainder of this paper, we will describe and evaluate open licenses (in most cases, Creative Commons licenses). the quality of the dataset: In the next section, we present the The one exception is text from Reddit forum discussions, data collected for our corpus, AMALGUM (AMachine- which cannot be distributed directly. To work around this, Annotated Lookalike of GUM), which covers eight English we distribute only stand-off annotations of the Reddit text, web genres for which we can obtain gold standard train- along with a script that recovers the plain text using the ing data: news, interviews, travel guides, how-to guides, Reddit API and then combines it with the annotations. academic papers, fiction, biographies, and forum discus- In order to ensure high quality, we applied a number of sions. Section 2 describes the data and how it was collected, heuristics during scraping to rule out documents that are Section 3 presents annotations added to the corpus and the likely to differ strongly from the original GUM data: for fic- process of using information across annotation layers to im- tion, we required the keyword “fiction” to appear in Project prove NLP tool output, Section 4 evaluates the quality of Gutenberg metadata, and disqualified documents that con- the resulting data, and Section 5 concludes with future plans tained archaic forms (e.g. “thou”) or unrestored hyphenation and applications for the corpus. Our corpus is available at (e.g. tokens like “disre-”, from broken “disregard”) by us- https://github.com/gucorpling/amalgum. ing a stop list of frequent items from sample documents marked for exclusion by a human. For Reddit, we ruled out 2. Data documents containing too many links or e-mail addresses, 2.1. Corpus Composition which were often just lists of links. In all wiki-based genres, In order to ensure relatively high-quality NLP output and references, tables of contents, empty sections, and all other an open license, we base our dataset on the composition of portions that contained non-primary or boilerplate text were an existing, smaller web corpus of English, called GUM removed. (Georgetown University Multilayer corpus (Zeldes, 2017)). The GUM corpus contains data from the same genres men- 2.3. Document Extent tioned above, currently amounting to approximately 130,000 Since the corpus described here is focused on discourse-level tokens. We use the term genre somewhat loosely here to annotations, such as coreference resolution and discourse describe any recurring combination of features which char- parses, we elected to only use documents similar in size to acterize groups of texts that are created under similar ex- the documents in the benchmark GUM data, which gener- tralinguistic conditions and with comparable communicative ally has texts of about 500–1,000 words. This restriction intent (cf. (Biber and Conrad, 2009)). The corpus is manu- was also necessary due to the disparity across genres, and ally annotated with a large number