Development of a Dependency Treebank for Russian and Its Possible Applications in NLP

Development of a Dependency Treebank for Russian and its Possible Applications in NLP Igor BOGUSLAVSKY, Ivan CHARDIN, Svetlana GRIGORIEVA, Nikolai GRIGORIEV, Leonid IOMDIN, Lеonid KREIDLIN, Nadezhda FRID Laboratory for Computational Linguistics Institute for Information Transmission Problems Russian Academy of Sciences Bolshoj Karetnyj per. 19, 101447, Moscow – RUSSIA {bogus, ic, sveta, grig, iomdin, lenya, nadya}@iitp.ru Abstract The paper describes a tagging scheme designed for the Russian Treebank and presents tools used for corpus creation. syntactic relations). This is an important feature, since the 1. Introductory Remarks majority of syntactically annotated corpora, both those The present paper describes a project aimed at developing already available and under construction, represent the the first annotated corpus of Russian texts. Large text syntactic structure by means of constituents. corpora have been used in the computational linguistics The closest analogue to our work is Prague Dependency community for quite a long time now; at present, over 20 Treebank (PDT) – an annotated corpus of Czech collected large corpora for the main European languages are at Charles University in Prague (see Hajicova, Panevova, available, the largest of them containing hundreds of Sgall, 1998). In this corpus, the syntactic data are also millions of words (Language Resources (1997); Marcus, expressed in the dependency formalism, although the Santorini and Marcinkiewicz (1993); Kurohashi, Nagao inventory of syntactic functional relations is much smaller (1998)). For Russian, annotated corpora had been than ours as it only has 23 relations. Our corpus therefore nonexistent until 2000 when the first part of the corpus gives a more fine-grained representation of syntactic reported here was compiled (Boguslavsky et al., 2000). phenomena. On the other hand, Czech researchers made Since then, Russian corpus linguistics has been evolving an extremely interesting attempt to incorporate into their rapidly, and several groups of researchers announced their annotation information on discourse structure (topic-focus intent to create corpora. Of these, the CLD-MSU Corpus opposition) (Bemova et al., 1999). project looks particularly promising. It aims at building a morphologically tagged corpus, and an upgrade to Besides PDT, several other corpus-related projects use syntactic annotation is envisaged in the future even though some kind of dependency structures; worth noticing are it is not pursued at the present stage (Sichinava, 2001). NEGRA for German (Brants et al, 1999) and Alpino Dependency Treebank for Dutch (Van der Beek et al., Different tasks require different annotation levels that 2001). entail different amount of information about text structure. The corpus that is being created in the framework of the In what follows, we describe the types of texts used to project under discussion consists of several subcorpora create the corpus (Section 2), markup format (Section 3), that differ in the level of annotation. The following three annotation tools and procedures (Section 4), and types of levels are envisaged: linguistic data included in the markup (Section 5). S lemmatized texts: for every word, its normal form (lemma) and part of speech are indicated; 2. Source Text Selection S The well-known Uppsala University Corpus of morphologically tagged texts: for every word, along contemporary Russian prose has been chosen as the with the lemma and the part of speech, a full set of primary source for the first part of our corpus, which has inflectional morphological attributes is specified; already been completed. This part contains about 10,000 S syntactically tagged texts: apart from the full sentences. The Uppsala Corpus is well balanced between morphological markup at the word level, every fiction and journalistic genre, with a smaller percentage of sentence is assigned a syntactic structure. scientific and popular science texts. The Corpus includes samples of contemporary Russian prose, as well as We annotate Russian texts with dependency structures – a excerpts from newspapers and magazines of the last few formalism which we consider more suitable for Slavonic decades of the 20th century, and gives a representative languages with their relatively free word order than coverage of written Russian in modern use. constituent structures. The structure not only contains Conversational examples are scarce and appear as information as to which words of the sentence are dialogues inside fiction texts. syntactically linked, but also relegates each link to one of the several dozen syntactic types (at present, we use 78 The second part of the corpus consists of several hundred 3.4. Storing Information about Syntactic Structure. To short texts published in 2001-2002 on various Internet annotate the information about syntactic dependencies, we news portals. The bulk of the texts come from the use two other attributes attached to the <W> element: following newswires: www.yandex.ru, www.rbc.ru, DOM – the ID of the word on which W depends and www.polit.ru, www.lenta.ru, www.strana.ru, LINK – syntactic function label. www.news.ru. Each text is a small story (up to 30 sentences) about a single event. Their themes include The formalism has special provisions to store auxiliary political, financial, cultural, and sports news, both information, e.g. multiple morphological analyses and domestic and international; and a certain amount of texts syntactic trees. They will not appear in the final version of deal with hi-tech achievements. We have done our best to the corpus. make source text selection a representative sample of Internet news delivery in Russian. 4. Annotation Tools and Procedures The procedure of corpus data acquisition is semi- 3. Markup Format automatic. An initial version of markup is generated by a The design principles were formulated as follows: computer using a general purpose morphological analyzer and syntax parser engine; after that, the results of the S “layered” markup – several annotation levels coexist automatic processing are submitted to human post-editing. and can be extracted or processed independently; The analysis engine (morphology and parsing) is based S incrementality – it should be easy to add higher upon the ETAP-3 machine translation engine as discussed annotation levels; in Apresjan et al. (1992, 1993). To support the creation of annotated data, a variety of S convenient parsing of the annotated text by means of tools have been designed and implemented. All tools are standard software packages. Win32 applications written in C++. The tools available The most natural solution to meet these criteria is an are: XML-based markup language. We have tried to make our S format compatible with TEI (Text Encoding for a program that creates sentence boundaries markup, Interchange, see TEI Guidelines (1994)), introducing new called Chopper; elements or attributes only in situations where TEI S a post-editor for building, editing and managing markup does not provide adequate means to describe the syntactically annotated texts – Structure Editor (or text structure in the dependency grammar framework. StrEd). Listed below are types of information about text structure The amount of manual labor required to build annotations that must be encoded in the markup, and the respective depends on the complexity of the input data. StrEd offers tags/attributes used to carry this information. different options for building structures. Most sentences 3.1. Splitting Text into Sentences. A special container can be reliably processed without any human intervention; in which case, a linguist should only look through the element <S> (available in TEI) is used to delimit sentence result of the processing and endorse it. If the structure boundaries. The element may have an optional ID contains errors, the linguist can edit it using a user- attribute that supplies a unique identifier for the sentence friendly graphical interface (see screenshots below). If the within the text; this identifier may be used to store errors are too many or no structure could be produced, the information about extra-sentential relations in the text. It linguist may resort to a special split-and-run mode. This may also have a attribute, used by linguists to COMMENT mode involves manual pre-chunking of the input sentence store notes and observations on about particular syntactic into such pieces that have a more transparent structure and phenomena encountered in the sentence; applying the analyzer/parser to every chunk in turn. Then 3.2. Splitting Sentences into Lexical Items (Words). the linguist must manually integrate the subtrees produced The words are delimited by a container element <W>. Like for every chunk into a single tree structure. sentences, words may have a unique ID attribute that is If the linguist has come across an especially difficult used to refer to the word within the sentence; syntactic construction so that he/she is uncertain about 3.3. Assigning Morphological Features to Words. what the adequate structure is, he/she may mark as Morphological information is ascribed to the word by “doubtful” the whole sentence or else single words whose means of two attributes attached to the <W> tag: LEMMA – functions are not completely clear. The information will normalized word form and FEAT – list of morphological be stored in the markup, and StrEd will visualize the features. respective sentence as one needing further editing. Fig. 1 presents the main dialog window for editing way of editing the structure consists in invoking a Tree sentence properties. An operator can edit the markup Editor window, shown in Fig. 2 with the same sentence directly in any text editor, or edit single properties using a as in the previous picture. graphical interface. The source text under analysis is The Tree Editor interface is simple and natural. Words of written in the top line of the edit window: Xotja pis’mo ne the source sentence are written on the left, their lemmas bylo podpisano, ja mgnovenno dogadalsja, kto ego are placed into gray rectangles, and their morphological napisal [Although the letter was not signed, I instantly features are written on the right.

Development of a Dependency Treebank for Russian and Its Possible Applications in NLP

How Do BERT Embeddings Organize Linguistic Knowledge?

Treebanks, Linguistic Theories and Applications Introduction to Treebanks

Usage of IT Terminology in Corpus Linguistics Mavlonova Mavluda Davurovna

Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness

Deep Linguistic Analysis for the Accurate Identification of Predicate

The Procedure of Lexico-Semantic Annotation of Składnica Treebank

Linguistic Annotation of the Digital Papyrological Corpus: Sematia

Unified Language Model Pre-Training for Natural

Building a Treebank for French

Converting an HPSG-Based Treebank Into Its Parallel Dependency-Based Treebank

Creating and Using Multilingual Corpora in Translation Studies Claudio Fantinuoli and Federico Zanettin

Corpus Linguistics As a Tool in Legal Interpretation Lawrence M