English Propbank Annotation Guidelines

Total Page:16

File Type:pdf, Size:1020Kb

English Propbank Annotation Guidelines English PropBank Annotation Guidelines Claire Bonial [email protected] Julia Bonn [email protected] Kathryn Conger [email protected] Jena Hwang [email protected] Martha Palmer [email protected] Nicholas Reese [email protected] Center for Computational Language and Education Research Institute of Cognitive Science University of Colorado at Boulder Based on original PropBank guidelines by Olga Babko-Malaya July 14, 2015 Contents 1 Verb Annotation Instructions3 1.1 PropBank Annotation Goals..............................3 1.2 Sense Annotation....................................4 1.2.1 Frame Files...................................4 1.2.2 ER: Error roleset................................6 1.2.3 IE: Idiomatic Expression roleset.......................7 1.2.4 DP: Duplicate indicator roleset........................7 1.2.5 LV: Light Verb roleset.............................8 1.2.6 What to do when there is no frame file....................8 1.3 Annotation of Numbered Arguments.........................8 1.3.1 Choosing ARG0 versus ARG1.........................8 1.3.2 ARGA: Secondary Agent........................... 10 1.4 Annotation of Modifiers................................ 10 1.4.1 Comitatives (COM).............................. 11 1.4.2 Locatives (LOC)................................ 12 1.4.3 Directional (DIR)............................... 12 1.4.4 Goal (GOL)................................... 12 1.4.5 Manner (MNR)................................. 13 1.4.6 Temporal (TMP)................................ 14 1.4.7 Extent (EXT)................................. 14 1.4.8 Reciprocals (REC)............................... 15 1.4.9 Secondary Predication (PRD)......................... 16 1.4.10 Purpose Clauses (PRP)............................ 16 1.4.11 Cause Clauses (CAU)............................. 17 1.4.12 Discourse (DIS)................................. 17 1.4.13 Modals (MOD)................................. 19 1.4.14 Negation (NEG)................................ 19 1.4.15 Direct Speech (DSP).............................. 19 1.4.16 Adverbials (ADV)............................... 21 1.4.17 Adjectival (ADJ)................................ 21 1.4.18 Light Verb (LVB)............................... 21 1.4.19 Construction (CXN).............................. 22 1.5 Span of Annotation................................... 22 1.6 Where to Place Tags.................................. 26 1.6.1 Exceptions to Normal Tag Placement.................... 26 1.7 Understanding and Annotating Null Elements in the Penn TreeBank....... 27 1.7.1 Passive Sentences................................ 28 1.7.2 Fronted and Dislocated Arguments...................... 29 1.7.3 Questions and Wh-Phrases.......................... 30 1.7.4 Interpret Constituent Here (ICH) Traces................... 31 1 1.7.5 Right Node Raising (RNR) Traces...................... 32 1.7.6 It EXtraPosition (EXP)............................ 35 1.7.7 Other Traces.................................. 37 1.8 Linking and Annotation of Null Elements...................... 38 1.8.1 Relative Clause Annotation.......................... 38 1.8.2 Reduced Relative Annotation......................... 38 1.8.3 Concatenation of multiple nodes into one argument............. 41 1.8.4 Special cases of topicalization......................... 42 1.9 Special Cases: small clauses and sentential complements.............. 43 1.10 Handling common features of spoken data...................... 45 1.10.1 Disfluencies and Edited Nodes........................ 45 1.10.2 Asides: PRN nodes............................... 46 2 Light Verb Annotation 47 2.1 Pass 1: Verb Pass.................................... 47 2.2 Pass 2: Noun Pass................................... 50 2.3 Examples........................................ 50 2.4 Tricky Cases...................................... 52 3 Noun Annotation Instructions 53 3.1 Span of Annotation................................... 53 3.2 Annotation of Numbered Arguments......................... 55 3.3 Annotation of Modifiers................................ 55 3.3.1 Adjectival modifiers (ADJ).......................... 56 3.3.2 Secondary Predication modifiers (PRD)................... 56 3.4 Exceptions to Annotation: Determiners and Other Noun Relations........ 57 3.5 .YY Roleset....................................... 57 4 Adjectival Predicate Annotation Instructions 59 4.1 Span of Annotation................................... 59 4.2 Annotation of Arguments............................... 59 4.3 Annotating Constructions............................... 59 4.3.1 Comparative Constructions.......................... 60 4.3.2 Degree Construction.............................. 61 4.3.3 The Xer the Yer Construction......................... 62 A Jubilee Hotkeys 63 References 65 2 Chapter 1 Verb Annotation Instructions 1.1 PropBank Annotation Goals PropBank is a corpus in which the arguments of each predicate are annotated with their semantic roles in relation to the predicate (Palmer et al., 2005). Currently, all the PropBank annotations are done on top of the phrase structure annotation of the Penn TreeBank (Marcus et al., 1993). In addition to semantic role annotation, PropBank annotation requires the choice of a sense id (also known as a frameset or roleset id) for each predicate. Thus, for each verb in every tree (representing the phrase structure of the corresponding sentence), we create a PropBank instance that consists of the sense id of the predicate (e.g. run.02) and its arguments labeled with semantic roles. An important goal is to provide consistent argument labels across different syntactic realizations of the same verb, as in. [John]ARG0 broke [the window]ARG1 [The window]ARG1 broke As this example shows, the arguments of the verbs are labeled as numbered arguments: ARG0, ARG1, ARG2, and so on. The argument structure of each predicate is outlined in the PropBank frame file for that predicate. The frame file gives both semantic and syntactic information about each sense of the predicate lemmas that have been encountered thus far in PropBank annotation. The frame file also denotes the correspondences between numbered arguments and semantic roles, as this is somewhat unique for each predicate. Numbered arguments reflect either the arguments that are required for the valency of a predicate (e.g., agent, patient, benefactive), or if not required, those that occur with high-frequency in actual usage. Although numbered arguments correspond to slightly different semantic roles given the usage of each predicate, in general numbered arguments correspond to the following semantic roles: ARG0 agent ARG3 starting point, benefactive, attribute ARG1 patient ARG4 ending point ARG2 instrument, benefactive, attribute ARGM modifier Table 1.1: List of arguments in PropBank In addition to numbered arguments, another task of PropBank annotation involves assigning functional tags to all modifiers of the verb, such as manner (MNR), locative (LOC), temporal (TMP) and others: Mr. Bush met him privately, in the White House, on Thursday. REL: met 3 ARG0: Mr. Bush ARG1: him ARGM-MNR: privately ARGM-LOC: in the White House ARGM-TMP: on Thursday And, finally, PropBank annotation involves finding antecedents for empty arguments of the verbs, as illustrated below: I made a decision [*PRO*] to leave. The subject of the verb leave in this example is represented as an empty category [*] in TreeBank. In PropBank, all empty categories which could be co-referred with a NP within the same sentence are linked in co-reference chains: REL: leave ARG0: [*PRO*] * [I] Similarly, relativizers and their referents are linked in relative clause constructions, and traces are linked to their referents in reduced relative constructions. While these links were at one time created manually by the annotators, they are now added automatically in post-processing. The annotation of this information creates a valuable corpus, which can be used as training data for a variety of natural language processing applications. Training data, essentially, is what computer scientists and computational linguists can use to `teach the computer' about different aspects of human language. Once this information is processed, it can guide future decisions on how to categorize and/or label different features in novel utterances outside of the PropBank corpus. Parallel PropBank corpora currently exist or are underway for English, Chinese, Arabic and Hindi. As a whole, the PropBank corpus has the potential to assist in natural language processing applications such as machine translation, text editing, text summary and evaluation as well as question answering. Thus, the main tasks of PropBank annotation are: argument labeling, annotation of modifiers, choosing a sense for the predicate, and creating links for empty categories, relative clauses, and reduced relatives. Each of these aspects of annotation are discussed in detail below. Although some detail is provided in each section on how to annotate appropriately using the annotation tool, Jubilee, complete guidelines on the use of this tool are provided in the Jubilee technical report: Jubilee: Propbank Instance Editor Guideline (Version 2.1) (Choi et al., 2009). 1.2 Sense Annotation 1.2.1 Frame Files The argument labels for each predicate are specified in the frame files, which are available at http://verbs.colorado.edu/propbank/framesets-english-aliases/ and are also dis- played in the frameset view of the annotation tool, Jubilee (see Jubilee technical report (Choi et al., 2009) for further information).
Recommended publications
  • Abstraction and Generalisation in Semantic Role Labels: Propbank
    Abstraction and Generalisation in Semantic Role Labels: PropBank, VerbNet or both? Paola Merlo Lonneke Van Der Plas Linguistics Department Linguistics Department University of Geneva University of Geneva 5 Rue de Candolle, 1204 Geneva 5 Rue de Candolle, 1204 Geneva Switzerland Switzerland [email protected] [email protected] Abstract The role of theories of semantic role lists is to obtain a set of semantic roles that can apply to Semantic role labels are the representa- any argument of any verb, to provide an unam- tion of the grammatically relevant aspects biguous identifier of the grammatical roles of the of a sentence meaning. Capturing the participants in the event described by the sentence nature and the number of semantic roles (Dowty, 1991). Starting from the first proposals in a sentence is therefore fundamental to (Gruber, 1965; Fillmore, 1968; Jackendoff, 1972), correctly describing the interface between several approaches have been put forth, ranging grammar and meaning. In this paper, we from a combination of very few roles to lists of compare two annotation schemes, Prop- very fine-grained specificity. (See Levin and Rap- Bank and VerbNet, in a task-independent, paport Hovav (2005) for an exhaustive review). general way, analysing how well they fare In NLP, several proposals have been put forth in in capturing the linguistic generalisations recent years and adopted in the annotation of large that are known to hold for semantic role samples of text (Baker et al., 1998; Palmer et al., labels, and consequently how well they 2005; Kipper, 2005; Loper et al., 2007). The an- grammaticalise aspects of meaning.
    [Show full text]
  • A Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling
    VerbAtlas: a Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling Andrea Di Fabio}~, Simone Conia}, Roberto Navigli} }Department of Computer Science ~Department of Literature and Modern Cultures Sapienza University of Rome, Italy {difabio,conia,navigli}@di.uniroma1.it Abstract The automatic identification and labeling of ar- gument structures is a task pioneered by Gildea We present VerbAtlas, a new, hand-crafted and Jurafsky(2002) called Semantic Role Label- lexical-semantic resource whose goal is to ing (SRL). SRL has become very popular thanks bring together all verbal synsets from Word- to its integration into other related NLP tasks Net into semantically-coherent frames. The such as machine translation (Liu and Gildea, frames define a common, prototypical argu- ment structure while at the same time pro- 2010), visual semantic role labeling (Silberer and viding new concept-specific information. In Pinkal, 2018) and information extraction (Bas- contrast to PropBank, which defines enumer- tianelli et al., 2013). ative semantic roles, VerbAtlas comes with In order to be performed, SRL requires the fol- an explicit, cross-frame set of semantic roles lowing core elements: 1) a verb inventory, and 2) linked to selectional preferences expressed in terms of WordNet synsets, and is the first a semantic role inventory. However, the current resource enriched with semantic information verb inventories used for this task, such as Prop- about implicit, shadow, and default arguments. Bank (Palmer et al., 2005) and FrameNet (Baker We demonstrate the effectiveness of VerbAtlas et al., 1998), are language-specific and lack high- in the task of dependency-based Semantic quality interoperability with existing knowledge Role Labeling and show how its integration bases.
    [Show full text]
  • How Do BERT Embeddings Organize Linguistic Knowledge?
    How Do BERT Embeddings Organize Linguistic Knowledge? Giovanni Puccettiy , Alessio Miaschi? , Felice Dell’Orletta y Scuola Normale Superiore, Pisa ?Department of Computer Science, University of Pisa Istituto di Linguistica Computazionale “Antonio Zampolli”, Pisa ItaliaNLP Lab – www.italianlp.it [email protected], [email protected], [email protected] Abstract et al., 2019), we proposed an in-depth investigation Several studies investigated the linguistic in- aimed at understanding how the information en- formation implicitly encoded in Neural Lan- coded by BERT is arranged within its internal rep- guage Models. Most of these works focused resentation. In particular, we defined two research on quantifying the amount and type of in- questions, aimed at: (i) investigating the relation- formation available within their internal rep- ship between the sentence-level linguistic knowl- resentations and across their layers. In line edge encoded in a pre-trained version of BERT and with this scenario, we proposed a different the number of individual units involved in the en- study, based on Lasso regression, aimed at understanding how the information encoded coding of such knowledge; (ii) understanding how by BERT sentence-level representations is ar- these sentence-level properties are organized within ranged within its hidden units. Using a suite of the internal representations of BERT, identifying several probing tasks, we showed the existence groups of units more relevant for specific linguistic of a relationship between the implicit knowl- tasks. We defined a suite of probing tasks based on edge learned by the model and the number of a variable selection approach, in order to identify individual units involved in the encodings of which units in the internal representations of BERT this competence.
    [Show full text]
  • Treebanks, Linguistic Theories and Applications Introduction to Treebanks
    Treebanks, Linguistic Theories and Applications Introduction to Treebanks Lecture One Petya Osenova and Kiril Simov Sofia University “St. Kliment Ohridski”, Bulgaria Bulgarian Academy of Sciences, Bulgaria ESSLLI 2018 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Plan of the Lecture • Definition of a treebank • The place of the treebank in the language modeling • Related terms: parsebank, dynamic treebank • Prerequisites for the creation of a treebank • Treebank lifecycle • Theory (in)dependency • Language (in)dependency • Tendences in the treebank development 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Treebank Definition A corpus annotated with syntactic information • The information in the annotation is added/checked by a trained annotator - manual annotation • The annotation is complete - no unannotated fragments of the text • The annotation is consistent - similar fragments are analysed in the same way • The primary format of annotation - syntactic tree/graph 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Example Syntactic Trees from Wikipedia The two main approaches to modeling the syntactic information 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Pros vs. Cons (Handbook of NLP, p. 171) Constituency • Easy to read • Correspond to common grammatical knowledge (phrases) • Introduce arbitrary complexity Dependency • Flexible • Also correspond to common grammatical
    [Show full text]
  • An Arabic Wordnet with Ontologically Clean Content
    Applied Ontology (2021) IOS Press The Arabic Ontology – An Arabic Wordnet with Ontologically Clean Content Mustafa Jarrar Birzeit University, Palestine [email protected] Abstract. We present a formal Arabic wordnet built on the basis of a carefully designed ontology hereby referred to as the Arabic Ontology. The ontology provides a formal representation of the concepts that the Arabic terms convey, and its content was built with ontological analysis in mind, and benchmarked to scientific advances and rigorous knowledge sources as much as this is possible, rather than to only speakers’ beliefs as lexicons typically are. A comprehensive evaluation was conducted thereby demonstrating that the current version of the top-levels of the ontology can top the majority of the Arabic meanings. The ontology consists currently of about 1,300 well-investigated concepts in addition to 11,000 concepts that are partially validated. The ontology is accessible and searchable through a lexicographic search engine (http://ontology.birzeit.edu) that also includes about 150 Arabic-multilingual lexicons, and which are being mapped and enriched using the ontology. The ontology is fully mapped with Princeton WordNet, Wikidata, and other resources. Keywords. Linguistic Ontology, WordNet, Arabic Wordnet, Lexicon, Lexical Semantics, Arabic Natural Language Processing Accepted by: 1. Introduction The importance of linguistic ontologies and wordnets is increasing in many application areas, such as multilingual big data (Oana et al., 2012; Ceravolo, 2018), information retrieval (Abderrahim et al., 2013), question-answering and NLP-based applications (Shinde et al., 2012), data integration (Castanier et al., 2012; Jarrar et al., 2011), multilingual web (McCrae et al., 2011; Jarrar, 2006), among others.
    [Show full text]
  • Verbnet Based Citation Sentiment Class Assignment Using Machine Learning
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 9, 2020 VerbNet based Citation Sentiment Class Assignment using Machine Learning Zainab Amjad1, Imran Ihsan2 Department of Creative Technologies Air University, Islamabad, Pakistan Abstract—Citations are used to establish a link between time-consuming and complicated. To resolve this issue there articles. This intent has changed over the years, and citations are exists many researchers [7]–[9] who deal with the sentiment now being used as a criterion for evaluating the research work or analysis of citation sentences to improve bibliometric the author and has become one of the most important criteria for measures. Such applications can help scholars in the period of granting rewards or incentives. As a result, many unethical research to identify the problems with the present approaches, activities related to the use of citations have emerged. That is why unaddressed issues, and the present research gaps [10]. content-based citation sentiment analysis techniques are developed on the hypothesis that all citations are not equal. There are two existing approaches for Citation Sentiment There are several pieces of research to find the sentiment of a Analysis: Qualitative and Quantitative [7]. Quantitative citation, however, only a handful of techniques that have used approaches consider that all citations are equally important citation sentences for this purpose. In this research, we have while qualitative approaches believe that all citations are not proposed a verb-oriented citation sentiment classification for equally important [9]. The quantitative approach uses citation researchers by semantically analyzing verbs within a citation text count to rank a research paper [8] while the qualitative using VerbNet Ontology, natural language processing & four approach analyzes the nature of citation [10].
    [Show full text]
  • Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness
    WordNet::SenseRelate::AllWords - A Broad Coverage Word Sense Tagger that Maximizes Semantic Relatedness Ted Pedersen and Varada Kolhatkar Department of Computer Science University of Minnesota Duluth, MN 55812 USA tpederse,kolha002 @d.umn.edu http://senserelate.sourceforge.net{ } Abstract Despite these difficulties, word sense disambigua- tion is often a necessary step in NLP and can’t sim- WordNet::SenseRelate::AllWords is a freely ply be ignored. The question arises as to how to de- available open source Perl package that as- velop broad coverage sense disambiguation modules signs a sense to every content word (known that can be deployed in a practical setting without in- to WordNet) in a text. It finds the sense of vesting huge sums in manual annotation efforts. Our each word that is most related to the senses answer is WordNet::SenseRelate::AllWords (SR- of surrounding words, based on measures found in WordNet::Similarity. This method is AW), a method that uses knowledge already avail- shown to be competitive with results from re- able in the lexical database WordNet to assign senses cent evaluations including SENSEVAL-2 and to every content word in text, and as such offers SENSEVAL-3. broad coverage and requires no manual annotation of training data. SR-AW finds the sense of each word that is most 1 Introduction related or most similar to those of its neighbors in the Word sense disambiguation is the task of assigning sentence, according to any of the ten measures avail- a sense to a word based on the context in which it able in WordNet::Similarity (Pedersen et al., 2004).
    [Show full text]
  • Deep Linguistic Analysis for the Accurate Identification of Predicate
    Deep Linguistic Analysis for the Accurate Identification of Predicate-Argument Relations Yusuke Miyao Jun'ichi Tsujii Department of Computer Science Department of Computer Science University of Tokyo University of Tokyo [email protected] CREST, JST [email protected] Abstract obtained by deep linguistic analysis (Gildea and This paper evaluates the accuracy of HPSG Hockenmaier, 2003; Chen and Rambow, 2003). parsing in terms of the identification of They employed a CCG (Steedman, 2000) or LTAG predicate-argument relations. We could directly (Schabes et al., 1988) parser to acquire syntac- compare the output of HPSG parsing with Prop- tic/semantic structures, which would be passed to Bank annotations, by assuming a unique map- statistical classifier as features. That is, they used ping from HPSG semantic representation into deep analysis as a preprocessor to obtain useful fea- PropBank annotation. Even though PropBank tures for training a probabilistic model or statistical was not used for the training of a disambigua- tion model, an HPSG parser achieved the ac- classifier of a semantic argument identifier. These curacy competitive with existing studies on the results imply the superiority of deep linguistic anal- task of identifying PropBank annotations. ysis for this task. Although the statistical approach seems a reason- 1 Introduction able way for developing an accurate identifier of Recently, deep linguistic analysis has successfully PropBank annotations, this study aims at establish- been applied to real-world texts. Several parsers ing a method of directly comparing the outputs of have been implemented in various grammar for- HPSG parsing with the PropBank annotation in or- malisms and empirical evaluation has been re- der to explicitly demonstrate the availability of deep ported: LFG (Riezler et al., 2002; Cahill et al., parsers.
    [Show full text]
  • The Procedure of Lexico-Semantic Annotation of Składnica Treebank
    The Procedure of Lexico-Semantic Annotation of Składnica Treebank Elzbieta˙ Hajnicz Institute of Computer Science, Polish Academy of Sciences ul. Ordona 21, 01-237 Warsaw, Poland [email protected] Abstract In this paper, the procedure of lexico-semantic annotation of Składnica Treebank using Polish WordNet is presented. Other semantically annotated corpora, in particular treebanks, are outlined first. Resources involved in annotation as well as a tool called Semantikon used for it are described. The main part of the paper is the analysis of the applied procedure. It consists of the basic and correction phases. During basic phase all nouns, verbs and adjectives are annotated with wordnet senses. The annotation is performed independently by two linguists. Multi-word units obtain special tags, synonyms and hypernyms are used for senses absent in Polish WordNet. Additionally, each sentence receives its general assessment. During the correction phase, conflicts are resolved by the linguist supervising the process. Finally, some statistics of the results of annotation are given, including inter-annotator agreement. The final resource is represented in XML files preserving the structure of Składnica. Keywords: treebanks, wordnets, semantic annotation, Polish 1. Introduction 2. Semantically Annotated Corpora It is widely acknowledged that linguistically annotated cor- Semantic annotation of text corpora seems to be the last pora play a crucial role in NLP. There is even a tendency phase in the process of corpus annotation, less popular than towards their ever-deeper annotation. In particular, seman- morphosyntactic and (shallow or deep) syntactic annota- tically annotated corpora become more and more popular tion. However, there exist semantically annotated subcor- as they have a number of applications in word sense disam- pora for many languages.
    [Show full text]
  • Generating Lexical Representations of Frames Using Lexical Substitution
    Generating Lexical Representations of Frames using Lexical Substitution Saba Anwar Artem Shelmanov Universitat¨ Hamburg Skolkovo Institute of Science and Technology Germany Russia [email protected] [email protected] Alexander Panchenko Chris Biemann Skolkovo Institute of Science and Technology Universitat¨ Hamburg Russia Germany [email protected] [email protected] Abstract Seed sentence: I hope PattiHelper can helpAssistance youBenefited party soonTime . Semantic frames are formal linguistic struc- Substitutes for Assistance: assist, aid tures describing situations/actions/events, e.g. Substitutes for Helper: she, I, he, you, we, someone, Commercial transfer of goods. Each frame they, it, lori, hannah, paul, sarah, melanie, pam, riley Substitutes for Benefited party: me, him, folk, her, provides a set of roles corresponding to the sit- everyone, people uation participants, e.g. Buyer and Goods, and Substitutes for Time: tomorrow, now, shortly, sooner, lexical units (LUs) – words and phrases that tonight, today, later can evoke this particular frame in texts, e.g. Sell. The scarcity of annotated resources hin- Table 1: An example of the induced lexical represen- ders wider adoption of frame semantics across tation (roles and LUs) of the Assistance FrameNet languages and domains. We investigate a sim- frame using lexical substitutes from a single seed sen- ple yet effective method, lexical substitution tence. with word representation models, to automat- ically expand a small set of frame-annotated annotated resources. Some publicly available re- sentences with new words for their respective sources are FrameNet (Baker et al., 1998) and roles and LUs. We evaluate the expansion PropBank (Palmer et al., 2005), yet for many lan- quality using FrameNet.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Towards a Cross-Linguistic Verbnet-Style Lexicon for Brazilian Portuguese
    Towards a cross-linguistic VerbNet-style lexicon for Brazilian Portuguese Carolina Scarton, Sandra Alu´ısio Center of Computational Linguistics (NILC), University of Sao˜ Paulo Av. Trabalhador Sao-Carlense,˜ 400. 13560-970 Sao˜ Carlos/SP, Brazil [email protected], [email protected] Abstract This paper presents preliminary results of the Brazilian Portuguese Verbnet (VerbNet.Br). This resource is being built by using other existing Computational Lexical Resources via a semi-automatic method. We identified, automatically, 5688 verbs as candidate members of VerbNet.Br, which are distributed in 257 classes inherited from VerbNet. These preliminary results give us some directions of future work and, since the results were automatically generated, a manual revision of the complete resource is highly desirable. 1. Introduction the verb to load. To fulfill this gap, VerbNet has mappings The task of building Computational Lexical Resources to WordNet, which has deeper semantic relations. (CLRs) and making them publicly available is one of Brazilian Portuguese language lacks CLRs. There are some the most important tasks of Natural Language Processing initiatives like WordNet.Br (Dias da Silva et al., 2008), that (NLP) area. CLRs are used in many other applications is based on and aligned to WordNet. This resource is the in NLP, such as automatic summarization, machine trans- most complete for Brazilian Portuguese language. How- lation and opinion mining. Specially, CLRs that treat the ever, only the verb database is in an advanced stage (it syntactic and semantic behaviour of verbs are very impor- is finished, but without manual validation), currently con- tant to the tasks of information retrieval (Croch and King, sisting of 5,860 verbs in 3,713 synsets.
    [Show full text]