Dbpedia Abstracts: a Large-Scale, Open, Multilingual NLP Training Corpus

DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus 1Martin Brümmer, 1,2Milan Dojchinovski, 1Sebastian Hellmann 1AKSW, InfAI at the University of Leipzig, Germany 2Web Intelligence Research Group, FIT, Czech Technical University in Prague {bruemmer,dojchinovski,hellmann}@informatik.uni-leipzig.de Abstract The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated Wikipedia texts in six languages, featuring over 11 million texts and over 97 million entity links. The properties of the Wikipedia texts are being described, as well as the corpus creation process, its format and interesting use-cases, like Named Entity Linking training and evaluation. Keywords: training, dbpedia, corpus, named entity recognition, named entity linking, nlp 1. Introduction tributes by making a large number of Wikipedia abstracts in currently six languages available for bulk processing with Wikipedia is the most important and comprehensive source NLP tools. NIF (Hellmann et al., 2013) was used as the cor- of open, encyclopedic knowledge. The English Wikipedia pus format to provide DBpedia compatibility using Linked alone features over 5 million articles, representing a vast Data as well as NLP tool interoperability. Furthermore, lists source of openly licensed natural language text. of link surface forms as well as the number of occurrences Besides the Wikipedia web site, Wikipedia data is avail- per surface form per link have been extracted as a secondary able via a RESTful API as well as complete XML dumps. resource. However, API access is officially limited to one request per second, prohibiting a web scraping approach to acquire the 1.1. Wikipedia Abstract Properties data. Even after downloading the XML dump files, con- The abstracts represent a special form of text. The articles suming and processing them is a complicated task, further have an encyclopedic style (Nguyen et al., 2007), describ- hindering large-scale data extraction. To alleviate these ing the topic by relating them to other entities often ex- issues, the DBpedia project (Auer et al., 2008) has been plicitly linked in the text. Following the official Wikipedia extracting, mapping, converting and publishing Wikipedia guidelines on article style and writing1, the first sentence data since 2007. The main focus of the DBpedia extrac- usually tries to put the article in a larger context. Thus tion lies in mapping of info boxes, templates and other eas- Wikipedia abstracts can be understood as “Is-A”-corpus, ily identified structured data found in Wikipedia articles to defining an entity in relation to a broader class of related properties of an ontology. Article texts and the data they entities or an overarching topic. This interesting property may contain are not focused in the extraction process, al- is exemplified by the fact that by clicking the first link in though they are the largest part of most articles in terms a Wikipedia article, one will eventually reach the article on of time spent on writing, informational content and size. “Philosophy” for 94,5% of all articles2. Only the text of the first introductory section up until the Wikipedia guidelines on linking3 prescribe that: first heading of the articles is extracted and contained in the 1. every word that is needed to understand the article DBpedia, and called abstract. Links inside the articles are should be linked, only extracted as an unordered bag, showing only an un- specified relation between the article that contains the link 2. links should only appear in the text once, and and the articles that are being linked, but not where in the 3. the article topic itself should not be linked in its own text the linked article was mentioned or what its surface article. form was. But because the links are set by the contributors to Wikipedia themselves, they represent entities in the text These guidelines manifest themselves in certain properties that were intellectually disambiguated by URL. This prop- of the corpus: erty makes extracting the abstracts including the links and their exact position in the text an interesting opportunity to 1Wikipedia Manual of Style, Lead section, Open- create a corpus usable for, among other cases, NER and ing paragraph: http://en.wikipedia.org/wiki/ NEL algorithm training and evaluation. Wikipedia:MOSBEGIN#Opening_paragraph 2 This paper describes the creation of a large-scale, multi- Data compiled by Wikipedia user “Ilmari_Karonen” using English Wikipedia articles from 26 May 2011: lingual Wikipedia abstract corpus that contains Wikipedia http://en.wikipedia.org/wiki/User:Ilmari_ abstract texts, entity mentions occuring in the text, their po- Karonen/First_link sition, surface form as well as the entity link. The vast ma- 3Wikipedia Manual of Style, Linking https: jority of entity links are linking Wikipedia pages, allowing //en.wikipedia.org/wiki/Wikipedia:Manual_ for extraction of related categories and entity types. It con- of_Style/Linking#Principles 3339 1. Most major concepts important to the topic at hand will be mentioned in the text and their article will be linked. 2. If a concept has been linked, repeat mentions of it will not be linked again. 3. The article topic itself is often mentioned in the text, but never linked. While the first property is crucial to be able to use the data as training corpus, the second and third properties have been adressed in the conversion process to guarantee a high-quality language resource. 2. Related Work Wikilinks and their surface forms are one data source used to train entity linking tools such as DBpedia Spot- light (Mendes et al., 2011) or AGDISTIS (Usbeck et al., 2014). The articles themselves serve as a source of con- textual information to (Cucerzan, 2007) in their named entity disambiguation approach. They have also been used for named entity linking evaluation (Hachey et al., 2013) and wikification (Cheng and Roth, 2013). (Nothman et al., 2012) present a small subset of Wikipedia articles with entity annotations4 manually checked for correctness in the CoNLL format. NIF corpora are being used for evaluation Figure 1: Data flow diagram showing the data conversion in the GERBIL entity annotator benchmark (Usbeck et al., process 2015). 3. Conversion Implementation For the extraction of abstracts, the framework uses Medi- Although Wikipedia article texts can be acquired using the awiki’s API. For every page to extract data from, an HTTP official Wikipedia API, it is recommended to not use it on request is made to the locally hosted API with the parame- a large scale as a courtesy to the Wikipedia project, so ters action=parse and section=0 to obtain the com- it cannot be used for large scale extractions. Wikipedia plete HTML source of the first section of each article9. XML dumps5 provide an alternative, but contain the arti- 6 Running the framework using this extractor and the local cles in Wiki markup , a special syntax that is used to for- Mediawiki mirror creates one file in ntriples format per lan- mat Wikipedia articles and add further data. Besides text guage, containing a triple of the following form for each formatting, like representing a ==Heading==, it also fea- article: tures calls to external LUA scripts and Wikipedia templates, 1 <$DBpediaUri> <http://dbpedia.org/ontology/abstract> making it a very tedious and Wikipedia language-specific "$htmlAbstract". task to implement a tool to render Wikipedia articles just like Mediawiki7 does. To our knowledge, no tool exists These triples present an intermediary data format that only that implements all templates and LUA scrips used in Wiki serves for further extraction of HTML abstracts and the markup, making Mediawiki the only tool available to pro- respective DBpedia URI. The HTML abstract texts were duce high-quality text from Wiki markup. Therefore a lo- then split into relevant paragraph elements, which were cal Mediawiki instance as mirror of the Wikipedia was in- converted to valid XHTML and parsed with a SAX-style stalled and configured as first part of the extraction pipeline, parser. tasked with rendering the Wiki markup to plain text. All child elements of the relevant paragraphs where tra- Figure 1 shows the data flow of the pipeline used. Central versed by the parser. If a #text element was encountered, to the extraction was the DBpedia extraction framework8 an its content was appended to the abstract string. If an a el- open source framework to convert Wikipedia data to LOD. ement with attribute href was encountered, a Link ob- ject was created, containing the start and end offsets of the 4http://schwa.org/projects/resources/ #text child element of the a element, as well as the sur- wiki/Wikiner face form and the linked URL. Thus, links are anchored to 5http://dumps.wikimedia.org/ the text position they are found at. Other elements were 6http://en.wikipedia.org/wiki/Help: skipped to exclude tables and spans containing various Wiki_markup 7Wikipedia’s software, http://www.mediawiki.org/ 9Example: https://en.wikipedia.org/w/api. 8https://github.com/dbpedia/extraction- php?action=parse&section=0&prop=text&page= framework Leipzig&format=xml 3340 other information not representable as plain text. Special 4. NIF Format consideration was given to Mediawiki specific elements A nif:Context resource was established for each arti- used for visual styling. For example, kbd elements were cle, containing the article abstract in the nif:isString not skipped, because they are used to style the text they property, as well as the string offsets denoting its length 10 contain as if it was a keyboard letter.

Load more