Ulsana: Universal Language Semantic Analyzer

ULSAna: Universal Language Semantic Analyzer Ondrejˇ Prazˇak´ Miloslav Konop´ık NTIS Dpt. of Computer Science and Engineering Faculty of Applied Sciences Faculty of Applied Sciences University of West Bohemia University of West Bohemia Technicka´ 8, 306 14 Plzenˇ Technicka´ 8, 306 14 Plzenˇ [email protected] [email protected] Abstract (1) [He]A0 believes [in what he plays] A1 . (2) Can [you] A0 cook [the dinner] A1 ? We present a live cross-lingual system ca- (3) [The nation‘s] largest [pension] fund, pable of producing shallow semantic an- AM-LOC A1 notations of natural language sentences for Figure 1: Three examples of shallow semantic an- 51 languages at this time. The domain of notations: 1) and 2) are examples of verb predi- the input sentences is in principle uncon- cates and 3) of a noun predicate. strained. The system uses single training data (in English) for all the languages. The resulting semantic annotations are there- at the same time expressive enough to bring a use- fore consistent across different languages. ful insight into the sentence semantics. We use CoNLL Semantic Role Labeling In order to be able to produce semantic anno- training data and Universal dependencies tations for more languages, we employ the cross- as the basis for the system. The system is lingual SRL. Cross-lingual SRL takes the training publicly available and supports processing data from a source language (usually English) and data in batches; therefore, it can be easily builds a language independent model that can be used by the community for research tasks. applied to target languages. The advantage of the cross-lingual SRL is that it ensures coherent anno- 1 Introduction tation for all supported languages because it trains on single training data for all the languages. This In this work, we present a major outcome in our does not apply to the monolingual SRL where the journey to build a system capable of producing se- tagsets and annotation guidelines change with ev- mantic annotations for domain and language un- ery training dataset. constrained natural language sentences. Currently, In our approach, we heavily depend on Univer- we rely on the Semantic Role Labeling (SRL) an- sal Dependencies (UD) (Nivre et al., 2016). They notation scheme (Gildea and Jurafsky, 2002). The are the primary means to transfer the learned rules SRL goal is to determine semantic relationships from one language to another. With universal de- (semantic roles) of given predicates (see examples pendencies, we can create language independent in Figure1). Verbs, such as “believe” or “cook”, parse trees. It means that sentences with the same are natural predicates but certain nouns can be ac- syntactic structure share (in theory) the same parse cepted as predicates as well (see the third line in trees for all (supported) languages. We train our the example). The semantic roles are specific for machine learning model on the UD trees to capture each predicate; however, the meaning of the roles the syntactic patterns required for semantic role la- is mostly shared across predicates. The core roles beling. We do not use any lexical information or are denoted by A0 (usually Agent), A1 (usually any other language dependent features. Our only Patient) and A2. Additional roles are modifier ar- information for SRL comes from UD trees. Thus, guments (AM-*), restriction arguments (R-*) and the resulting model can be applied to any of the others. We selected SRL because we believe that supported languages. the annotations are simple enough to be general- The system we present in this paper is a web- ized for different languages and target domains but based application written in Java – see the screen- 967 Proceedings of Recent Advances in Natural Language Processing, pages 967–972, Varna, Bulgaria, Sep 2–4, 2019. https://doi.org/10.26615/978-954-452-056-4_112 shots in Figures3 and2. The system allows Titov and Klementiev(2012) propose a superior a user to input a natural language sentence in argument clustering by using the Chinese restau- any of the 51 languages. The system outputs rant process. SRL annotations (predicates and corresponding semantic roles) of the input sentences. The se- Our approach belongs among the model trans- mantic roles are associated with syntactic tree fer approaches. Most of the other state-of-the-art nodes. The video demonstration of the system approaches to SRL rely on lexical features (e.g. is available here: https://www.youtube. word lemmas). In the cross-language scenario, com/watch?v=8QPKCegHT_c. The system such features require bilingual models (e.g. word itself can be accessed at the following address: mapping via machine translation or bilingual clus- http://liks.fav.zcu.cz/ulsana. We ters). In our demonstration application, we show intend to support the system for public use for sev- a multi-language model that is capable of produc- eral years. ing annotations for many languages. Therefore, we omit all bilingual features including the lexical 2 Related Work features from our model. Approaches to cross-lingual SRL can be divided 3 System Description into three main categories: 1) Annotation projection methods attempt to transfer annotations from In this section, we describe the core of the cross- one language to another and then they train an lingual system we use in our demo. The sys- SRL system on the transferred annotations. 2) tem is described in our original paper (Prazˇak´ and Model transfer approaches are designed to use Konopik, 2017) in full details. Here, we explain language-independent features to train a universal only the basic principles. The system described model which can be applied to languages that sup- in (Prazˇak´ and Konopik, 2017) is available for five port the designed features. 3) Methods based on languages only. In this demo, we extended the sys- unsupervised training require no annotated data; tem for 51 languages. however, the models have difficulties in assigning 3.1 Training Dataset and Annotation meaningful labels to predicate arguments. Conversion Annotation projection Pado´ and Lapata(2009) We train our system on UD parse trees. However, transfer annotations to a target language via word there are no such training data that would contain alignments obtained from parallel corpora. Annesi SRL annotations on UD trees. Therefore, we pro- and Basili(2010) use a similar approach and ex- posed an algorithm to convert existing SRL anno- tend it with an HMM model to increase the trans- tations built on SD1 trees. fer accuracy. The conversion process is by no means straight- Model transfer Kozhevnikov and Titov(2013) forward. The main source of complications stems use cross-lingual word mappings and cross-lingual from different approaches to choose head words semantic clusters obtained from parallel corpora, for syntactic phrases in UD trees. To solve this is- and cross-lingual features extracted from unla- sue, we proposed optimization algorithms that at- belled syntactic dependencies to create a cross- tempt to select the most appropriate heads for UD lingual SRL system. In (Kozhevnikov and Titov, trees which would cover the same phrases as the 2014), they try to find a mapping between heads in original SD trees. In many cases, there is language-specific models using parallel data auto- no such word in UD trees which could be used as matically. the new head. In such cases, we choose the head that minimizes the annotation error. The details of Unsupervised SRL Grenager and Manning the proposed conversion algorithms are presented (2006) deploy unsupervised learning using the EM in the original paper – Section 4. algorithm based upon a structured probabilistic In our application, we use the CoNLL 2009 model of the domain. Lang and Lapata(2011) English dataset (Hajicˇ et al., 2009). The corpus discover arguments of verb predicates with high accuracy using a small set of rules. A split- 1SD stands for Standard or language-Specific Dependencies, e.g. Stanford dependencies merge clustering is consequently applied to as- – urlhttps://nlp.stanford.edu/software/stanford- sign (nameless) roles to the discovered arguments. dependencies.shtml. 968 Figure 2: Application screenshot includes syntactic dependencies (from the Penn argument in a sentence. Treebank [TB]) and semantic dependencies (from PropBank [PB] and NomBank [NB]). • POS – part-of-speech of the predicate, the argument and their parent nodes. 3.2 Universal Dependencies Parser Our system requires syntactic trees in the UD an- • Dependency relation – dependency tree rela- notation scheme. We rely on the freely available tion of the predicate, the argument and their tool UDPipe (Straka et al., 2016). It contains pre- parent nodes. trained models for all the languages we support • Directed path – dependency tree path from in our application. We use models provided with the predicate to the argument including the parsers based on UD. The algorithms were devel- indication of the dependency directions. oped on UD v1.2 models based on UD 2.0 were also added, and they achieve better results (about • Undirected path – the list of relations from +1% of labeling accuracy). the predicate to the argument. 3.3 Classifier & Features • Verb voice – indication of active/passive We train a supervised system based upon the Max- voice. imum Entropy classifier using the Brainy tool (Konkol, 2014). We use separate models for verb • Other syntactic features – feats column in and non-verb predicates. CoNLL 2009 format with additional informa- All features employed in our system are syntac- tion about the words. tic: • Bigram features – predicate-argument bi- • Predicate-argument distance – the distance grams of the part-of-speech and the depen- between the locations of a predicate and an dency relations. 969 The dependency path features are encoded as a 3.6 Known issues probability of a word being a predicate argument • Parser errors – Since our system relies solely (or having a specific role) given the path.

Load more