A Tree Extension for Conll-RDF

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 7161–7169 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC A Tree Extension for CoNLL-RDF Christian Chiarcos, Luis Glaser Applied Computational Linguistics Lab (ACoLi) Goethe University Frankfurt, Germany fchiarcos, [email protected] Abstract The technological bridges between knowledge graphs and natural language processing are of utmost importance for the future development of language technology. CoNLL-RDF is a technology that provides such a bridge for popular one-word-per-line formats with tab-separated values (TSV) that are widely used in NLP (e.g., the CoNLL Shared Tasks), annotation (Universal Dependencies, Unimorph), corpus linguistics (Corpus WorkBench, CWB) and digital lexicography (SketchEngine): Every empty-line separated table (usually a sentence) is parsed into an graph, can be freely manipulated and enriched using standard graph technology, and then be serialized back into in a TSV format, RDF or other formats. An important limitation is that CoNLL-RDF provides native support for word-level annotations only. This does include dependency syntax and semantic role annotations, but neither phrase structures nor text structure. We describe the extension of the CoNLL-RDF technology stack for two vocabulary extensions of CoNLL-TSV, the bracket notation used in earlier CoNLL Shared Tasks and the extension with XML markup elements featured by CWB and SketchEngine. In order to represent the necessary extensions of the CoNLL vocabulary in an adequate fashion, we employ the POWLA vocabulary for representing and navigating in tree structures. Keywords: linguistic annotation, data formats, interoperability, tab-separated values, RDF, graphs, trees, CoNLL-RDF, POWLA 1. Introduction & Motivation ferred to as ‘CoNLL-TSV’, e.g., the CoNLL-U format of Since their original conception in the early 1990s, one- the Universal Dependencies. Here, we follow this con- word-per-line formats, mostly based on tab-separated val- vention in describing all one-word-per-line TSV formats as ues (TSV), have become enormously influential in cor- CoNLL-TSV. pus linguistics (Evert and Hardie, 2011, Corpus Work- Chiarcos and Fath¨ (2017) introduced the CoNLL-RDF li- bench/CWB),1 digital lexicography (Kilgarriff et al., 2014, brary, which allows to transform any CoNLL-TSV format SketchEngine),2 natural language processing (Schmid, into an isomorphic graph representation in RDF, to flex- 1994, TreeTagger), and as exchange format of various an- ibly transform it with SPARQL-Update, to enrich it with notation projects (Universal Dependencies (Nivre et al., external ontologies and knowledge bases, and to serialize 2016), PropBank, UniMorph).3 The main reasons for their it in a canonical format, in any RDF serialization (RD- continuing popularity are the apparent simplicity of tables F/XML, Turtle, JSON-LD), a graphical representation (us- and tab-separated values, and their expressive power that ing GraphViz/Dot) or back into a CoNLL-TSV representa- allows for the flexible and seamless extension of annotation. This transformation is generic and applicable to any tions with additional columns and novel annotations, mak- TSV format as it uses user-provided column labels to pro- ing these formats capable of expressing and combining duce properties (labeled edges) that hold the various anno- any kind of word-level annotation in a compact, human- tations. The only data structures it enforces are word (row) readable, and machine-processable format. and sentence (empty-line separated group of rows), as well Yet another factor contributing to the popularity of TSV for- as conll:HEAD, a property that links a word with its par- mats in natural language processing is their application as ent (in dependency annotation), resp., the (virtual root of exchange and evaluation format(s) of the Shared Tasks of the) sentence. Beyond that, CoNLL-RDF is semantically the SIGNLL Conference on Computational Natural Lan- shallow and its annotation properties are not backed by an guage Learning (CoNLL)4, conducted annually since 1999. ontology. However, using SPARQL Update, CoNLL-RDF With CoNLL data sets as an important basis for future eval- data can be flexibly transformed into a proper Linked Data uation of most NLP tasks, their formats are supported by representation and linked with external ontologies. most existing NLP tools, and various CoNLL-TSV formats The canonical CoNLL-RDF format is a fragment of RD- (‘dialects’) have seen a steady increase in popularity for F/Turtle that emulates the original TSV formats by putting linguistic data in NLP applications since. Even when cre- one word (with its annotations) in one line, and separating ated independently from CoNLL Shared Tasks, novel TSV sentences with an empty line. However, whereas CoNLL- formats following CoNLL conventions are now usually re- TSV is a tabular format, the canonical CoNLL-RDF format uses explicit property names (comparable to a Python dic- ∗ The authors contributed equally to the submission. 1 tionary or Java hashtable) as its data structure. Aside from http://cwb.sourceforge.net/ providing RDF output, we also provide transformation back 2http://www.sketchengine.eu into CoNLL for further processing of its annotations with 3http://universaldependencies.org/, conventional NLP pipelines and libraries. http://propbank.github.io/, http://unimorph. github.io/ Despite its popularity and wide use, CoNLL-TSV is lim- 4http://www.signll.org/conll ited in its expressivity: It imposes constraints on the type 7161 #ID WORD LEMMA UPOS POS FEATS HEAD EDGE 1 James James PROPN NNP Number=Sing 2 name are delimited with a new line. There are different CoNLL 2 Baker Baker PROPN NNP Number=Sing 3 nsubj 3 told tell VERB VBD Mood=Ind|... 0 root dialects, that each represent different information in these 4 reporters reporter NOUN NNS Number=Plur 3 obj columns. 5 Friday Friday PROPN NNP Number=Sing 3 nmod:tmod 6 : : PUNCT : _ 3 punct Fig. 1 depicts an example representation of such a data format, the CoNLL-U format used in the context of the Uni- versal Dependencies initiative, with the last two columns Figure 1: Example of CoNLL-U annotation (DEPS and DEPS and MISC omitted. The column names are given in MISC columns omitted) the first comment line, which is lead by a hash mark. The ID column contains a sentence wide identifier, WORD rep- of linguistic annotation it captures. Indeed, CoNLL-TSV resents the form, LEMMA the lemma, UPOS the POS-tag is ideally suited for word-level annotations such as part-of- in the UD tagset, the POS column contains any language speech tags or syntactic dependencies, it can be extended specific speech tag, FEATS contains morphosyntactic fea- to the annotation of non-recursive spans (using the IOBES tures encoded as key-value pairs with a pipe (‘j’) as delim- scheme) and semantic roles (by adding one argument col- iter. The HEAD column points to the ID of the parent in the umn per predicate), but the format provides no native sup- dependency tree, with 0 representing the root. Finally, the port for phrase-level annotations, text structure or annota- EDGE column carries the label of the dependency relation tions beyond single sentences. CoNLL-RDF does not im- that holds between a word and its head, resp., the (virtual 5 pose such restrictions, but allows to create and to process root of the) sentence. arbitrary graph data structures, for example, for parsing CoNLL-U only represents one specific CoNLL dialect, but topological fields in Middle High German (Chiarcos et al., is supported by most multilingual dependency parsers – 2018), or for the creation of syntactic-semantic annotations as these are usually trained on UD data. Fig. 1 shows a in the context of Role and Reference Grammar (Chiarcos slightly simplified example from the OntoNotes (Hovy et and Fath,¨ 2019). However, as its vocabulary is grounded in al., 2006) wsj-0655 file, with annotations subsequently CoNLL-TSV, CoNLL-RDF does not specify datatypes for provided by the authors. Other dialects omit, reorder or trees and directed graphs not grounded in individual words, complement these columns with additional content. All nor does the CoNLL-RDF package parse or serialize such of them share the general advantages of this format: They annotations. enjoy widespread integration in existing NLP technology, In this paper we introduce the extensions we made to the they are easy to read and can be transformed and read with- CoNLL-RDF library to natively support tree-like annota- out a lot of overhead. tions. Using the POWLA vocabulary (Chiarcos, 2012c), we model an annotation graph independent of the sentence 2.2. From TSV to Annotation Graphs and word structure imposed by CoNLL. In this way, we can Being easily processable, interpretable and extensible, tab- even represent generic linguistic annotations that go beyond ular data structures in TSV formats are popular and effec- token-level annotation e.g. for usage in spoken discourse. tive for word-level annotations. They are also capable of We will first revisit the original concept and functionality representing other annotations, e.g., dependency relations of the CoNLL-RDF libraries. Second, this paper will de- between words by means of coindexing (between HEAD scribe a number of one-word-per-line TSV dialects with and ID columns in Fig. 1). But while TSV formats can be extensions for tree structures. We show that CoNLL-RDF trivially parsed into tables (two-dimensional arrays), such was only partially able to represent them, and incapable of tables are less than ideal for querying or manipulating graph processing them effectively. We then explain how we com- structures. With the following listing, we show how to re- plement the CoNLL-RDF vocabulary with generic linguis- trieve the lemmas of all subject arguments of the root nodes tic data structures from the POWLA vocabulary (Chiarcos, of a sentence in Java, using a two-dimensional array: 2012c), and we describe the extensions of the CoNLL-RDF libraries to parse tree annotations, to process and to produce String[][] sentence; // 2-dimensional array such data structures.

Load more