English Propbank Annotation Guidelines
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Abstraction and Generalisation in Semantic Role Labels: Propbank
Abstraction and Generalisation in Semantic Role Labels: PropBank, VerbNet or both? Paola Merlo Lonneke Van Der Plas Linguistics Department Linguistics Department University of Geneva University of Geneva 5 Rue de Candolle, 1204 Geneva 5 Rue de Candolle, 1204 Geneva Switzerland Switzerland [email protected] [email protected] Abstract The role of theories of semantic role lists is to obtain a set of semantic roles that can apply to Semantic role labels are the representa- any argument of any verb, to provide an unam- tion of the grammatically relevant aspects biguous identifier of the grammatical roles of the of a sentence meaning. Capturing the participants in the event described by the sentence nature and the number of semantic roles (Dowty, 1991). Starting from the first proposals in a sentence is therefore fundamental to (Gruber, 1965; Fillmore, 1968; Jackendoff, 1972), correctly describing the interface between several approaches have been put forth, ranging grammar and meaning. In this paper, we from a combination of very few roles to lists of compare two annotation schemes, Prop- very fine-grained specificity. (See Levin and Rap- Bank and VerbNet, in a task-independent, paport Hovav (2005) for an exhaustive review). general way, analysing how well they fare In NLP, several proposals have been put forth in in capturing the linguistic generalisations recent years and adopted in the annotation of large that are known to hold for semantic role samples of text (Baker et al., 1998; Palmer et al., labels, and consequently how well they 2005; Kipper, 2005; Loper et al., 2007). The an- grammaticalise aspects of meaning. -
A Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling
VerbAtlas: a Novel Large-Scale Verbal Semantic Resource and Its Application to Semantic Role Labeling Andrea Di Fabio}~, Simone Conia}, Roberto Navigli} }Department of Computer Science ~Department of Literature and Modern Cultures Sapienza University of Rome, Italy {difabio,conia,navigli}@di.uniroma1.it Abstract The automatic identification and labeling of ar- gument structures is a task pioneered by Gildea We present VerbAtlas, a new, hand-crafted and Jurafsky(2002) called Semantic Role Label- lexical-semantic resource whose goal is to ing (SRL). SRL has become very popular thanks bring together all verbal synsets from Word- to its integration into other related NLP tasks Net into semantically-coherent frames. The such as machine translation (Liu and Gildea, frames define a common, prototypical argu- ment structure while at the same time pro- 2010), visual semantic role labeling (Silberer and viding new concept-specific information. In Pinkal, 2018) and information extraction (Bas- contrast to PropBank, which defines enumer- tianelli et al., 2013). ative semantic roles, VerbAtlas comes with In order to be performed, SRL requires the fol- an explicit, cross-frame set of semantic roles lowing core elements: 1) a verb inventory, and 2) linked to selectional preferences expressed in terms of WordNet synsets, and is the first a semantic role inventory. However, the current resource enriched with semantic information verb inventories used for this task, such as Prop- about implicit, shadow, and default arguments. Bank (Palmer et al., 2005) and FrameNet (Baker We demonstrate the effectiveness of VerbAtlas et al., 1998), are language-specific and lack high- in the task of dependency-based Semantic quality interoperability with existing knowledge Role Labeling and show how its integration bases. -
How Do BERT Embeddings Organize Linguistic Knowledge?
How Do BERT Embeddings Organize Linguistic Knowledge? Giovanni Puccettiy , Alessio Miaschi? , Felice Dell’Orletta y Scuola Normale Superiore, Pisa ?Department of Computer Science, University of Pisa Istituto di Linguistica Computazionale “Antonio Zampolli”, Pisa ItaliaNLP Lab – www.italianlp.it [email protected], [email protected], [email protected] Abstract et al., 2019), we proposed an in-depth investigation Several studies investigated the linguistic in- aimed at understanding how the information en- formation implicitly encoded in Neural Lan- coded by BERT is arranged within its internal rep- guage Models. Most of these works focused resentation. In particular, we defined two research on quantifying the amount and type of in- questions, aimed at: (i) investigating the relation- formation available within their internal rep- ship between the sentence-level linguistic knowl- resentations and across their layers. In line edge encoded in a pre-trained version of BERT and with this scenario, we proposed a different the number of individual units involved in the en- study, based on Lasso regression, aimed at understanding how the information encoded coding of such knowledge; (ii) understanding how by BERT sentence-level representations is ar- these sentence-level properties are organized within ranged within its hidden units. Using a suite of the internal representations of BERT, identifying several probing tasks, we showed the existence groups of units more relevant for specific linguistic of a relationship between the implicit knowl- tasks. We defined a suite of probing tasks based on edge learned by the model and the number of a variable selection approach, in order to identify individual units involved in the encodings of which units in the internal representations of BERT this competence. -
Treebanks, Linguistic Theories and Applications Introduction to Treebanks
Treebanks, Linguistic Theories and Applications Introduction to Treebanks Lecture One Petya Osenova and Kiril Simov Sofia University “St. Kliment Ohridski”, Bulgaria Bulgarian Academy of Sciences, Bulgaria ESSLLI 2018 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Plan of the Lecture • Definition of a treebank • The place of the treebank in the language modeling • Related terms: parsebank, dynamic treebank • Prerequisites for the creation of a treebank • Treebank lifecycle • Theory (in)dependency • Language (in)dependency • Tendences in the treebank development 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Treebank Definition A corpus annotated with syntactic information • The information in the annotation is added/checked by a trained annotator - manual annotation • The annotation is complete - no unannotated fragments of the text • The annotation is consistent - similar fragments are analysed in the same way • The primary format of annotation - syntactic tree/graph 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Example Syntactic Trees from Wikipedia The two main approaches to modeling the syntactic information 30th European Summer School in Logic, Language and Information (6 August – 17 August 2018) Pros vs. Cons (Handbook of NLP, p. 171) Constituency • Easy to read • Correspond to common grammatical knowledge (phrases) • Introduce arbitrary complexity Dependency • Flexible • Also correspond to common grammatical -
An Arabic Wordnet with Ontologically Clean Content
Applied Ontology (2021) IOS Press The Arabic Ontology – An Arabic Wordnet with Ontologically Clean Content Mustafa Jarrar Birzeit University, Palestine [email protected] Abstract. We present a formal Arabic wordnet built on the basis of a carefully designed ontology hereby referred to as the Arabic Ontology. The ontology provides a formal representation of the concepts that the Arabic terms convey, and its content was built with ontological analysis in mind, and benchmarked to scientific advances and rigorous knowledge sources as much as this is possible, rather than to only speakers’ beliefs as lexicons typically are. A comprehensive evaluation was conducted thereby demonstrating that the current version of the top-levels of the ontology can top the majority of the Arabic meanings. The ontology consists currently of about 1,300 well-investigated concepts in addition to 11,000 concepts that are partially validated. The ontology is accessible and searchable through a lexicographic search engine (http://ontology.birzeit.edu) that also includes about 150 Arabic-multilingual lexicons, and which are being mapped and enriched using the ontology. The ontology is fully mapped with Princeton WordNet, Wikidata, and other resources. Keywords. Linguistic Ontology, WordNet, Arabic Wordnet, Lexicon, Lexical Semantics, Arabic Natural Language Processing Accepted by: 1. Introduction The importance of linguistic ontologies and wordnets is increasing in many application areas, such as multilingual big data (Oana et al., 2012; Ceravolo, 2018), information retrieval (Abderrahim et al., 2013), question-answering and NLP-based applications (Shinde et al., 2012), data integration (Castanier et al., 2012; Jarrar et al., 2011), multilingual web (McCrae et al., 2011; Jarrar, 2006), among others. -
Verbnet Based Citation Sentiment Class Assignment Using Machine Learning
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 9, 2020 VerbNet based Citation Sentiment Class Assignment using Machine Learning Zainab Amjad1, Imran Ihsan2 Department of Creative Technologies Air University, Islamabad, Pakistan Abstract—Citations are used to establish a link between time-consuming and complicated. To resolve this issue there articles. This intent has changed over the years, and citations are exists many researchers [7]–[9] who deal with the sentiment now being used as a criterion for evaluating the research work or analysis of citation sentences to improve bibliometric the author and has become one of the most important criteria for measures. Such applications can help scholars in the period of granting rewards or incentives. As a result, many unethical research to identify the problems with the present approaches, activities related to the use of citations have emerged. That is why unaddressed issues, and the present research gaps [10]. content-based citation sentiment analysis techniques are developed on the hypothesis that all citations are not equal. There are two existing approaches for Citation Sentiment There are several pieces of research to find the sentiment of a Analysis: Qualitative and Quantitative [7]. Quantitative citation, however, only a handful of techniques that have used approaches consider that all citations are equally important citation sentences for this purpose. In this research, we have while qualitative approaches believe that all citations are not proposed a verb-oriented citation sentiment classification for equally important [9]. The quantitative approach uses citation researchers by semantically analyzing verbs within a citation text count to rank a research paper [8] while the qualitative using VerbNet Ontology, natural language processing & four approach analyzes the nature of citation [10]. -
Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness
WordNet::SenseRelate::AllWords - A Broad Coverage Word Sense Tagger that Maximizes Semantic Relatedness Ted Pedersen and Varada Kolhatkar Department of Computer Science University of Minnesota Duluth, MN 55812 USA tpederse,kolha002 @d.umn.edu http://senserelate.sourceforge.net{ } Abstract Despite these difficulties, word sense disambigua- tion is often a necessary step in NLP and can’t sim- WordNet::SenseRelate::AllWords is a freely ply be ignored. The question arises as to how to de- available open source Perl package that as- velop broad coverage sense disambiguation modules signs a sense to every content word (known that can be deployed in a practical setting without in- to WordNet) in a text. It finds the sense of vesting huge sums in manual annotation efforts. Our each word that is most related to the senses answer is WordNet::SenseRelate::AllWords (SR- of surrounding words, based on measures found in WordNet::Similarity. This method is AW), a method that uses knowledge already avail- shown to be competitive with results from re- able in the lexical database WordNet to assign senses cent evaluations including SENSEVAL-2 and to every content word in text, and as such offers SENSEVAL-3. broad coverage and requires no manual annotation of training data. SR-AW finds the sense of each word that is most 1 Introduction related or most similar to those of its neighbors in the Word sense disambiguation is the task of assigning sentence, according to any of the ten measures avail- a sense to a word based on the context in which it able in WordNet::Similarity (Pedersen et al., 2004). -
Deep Linguistic Analysis for the Accurate Identification of Predicate
Deep Linguistic Analysis for the Accurate Identification of Predicate-Argument Relations Yusuke Miyao Jun'ichi Tsujii Department of Computer Science Department of Computer Science University of Tokyo University of Tokyo [email protected] CREST, JST [email protected] Abstract obtained by deep linguistic analysis (Gildea and This paper evaluates the accuracy of HPSG Hockenmaier, 2003; Chen and Rambow, 2003). parsing in terms of the identification of They employed a CCG (Steedman, 2000) or LTAG predicate-argument relations. We could directly (Schabes et al., 1988) parser to acquire syntac- compare the output of HPSG parsing with Prop- tic/semantic structures, which would be passed to Bank annotations, by assuming a unique map- statistical classifier as features. That is, they used ping from HPSG semantic representation into deep analysis as a preprocessor to obtain useful fea- PropBank annotation. Even though PropBank tures for training a probabilistic model or statistical was not used for the training of a disambigua- tion model, an HPSG parser achieved the ac- classifier of a semantic argument identifier. These curacy competitive with existing studies on the results imply the superiority of deep linguistic anal- task of identifying PropBank annotations. ysis for this task. Although the statistical approach seems a reason- 1 Introduction able way for developing an accurate identifier of Recently, deep linguistic analysis has successfully PropBank annotations, this study aims at establish- been applied to real-world texts. Several parsers ing a method of directly comparing the outputs of have been implemented in various grammar for- HPSG parsing with the PropBank annotation in or- malisms and empirical evaluation has been re- der to explicitly demonstrate the availability of deep ported: LFG (Riezler et al., 2002; Cahill et al., parsers. -
The Procedure of Lexico-Semantic Annotation of Składnica Treebank
The Procedure of Lexico-Semantic Annotation of Składnica Treebank Elzbieta˙ Hajnicz Institute of Computer Science, Polish Academy of Sciences ul. Ordona 21, 01-237 Warsaw, Poland [email protected] Abstract In this paper, the procedure of lexico-semantic annotation of Składnica Treebank using Polish WordNet is presented. Other semantically annotated corpora, in particular treebanks, are outlined first. Resources involved in annotation as well as a tool called Semantikon used for it are described. The main part of the paper is the analysis of the applied procedure. It consists of the basic and correction phases. During basic phase all nouns, verbs and adjectives are annotated with wordnet senses. The annotation is performed independently by two linguists. Multi-word units obtain special tags, synonyms and hypernyms are used for senses absent in Polish WordNet. Additionally, each sentence receives its general assessment. During the correction phase, conflicts are resolved by the linguist supervising the process. Finally, some statistics of the results of annotation are given, including inter-annotator agreement. The final resource is represented in XML files preserving the structure of Składnica. Keywords: treebanks, wordnets, semantic annotation, Polish 1. Introduction 2. Semantically Annotated Corpora It is widely acknowledged that linguistically annotated cor- Semantic annotation of text corpora seems to be the last pora play a crucial role in NLP. There is even a tendency phase in the process of corpus annotation, less popular than towards their ever-deeper annotation. In particular, seman- morphosyntactic and (shallow or deep) syntactic annota- tically annotated corpora become more and more popular tion. However, there exist semantically annotated subcor- as they have a number of applications in word sense disam- pora for many languages. -
Generating Lexical Representations of Frames Using Lexical Substitution
Generating Lexical Representations of Frames using Lexical Substitution Saba Anwar Artem Shelmanov Universitat¨ Hamburg Skolkovo Institute of Science and Technology Germany Russia [email protected] [email protected] Alexander Panchenko Chris Biemann Skolkovo Institute of Science and Technology Universitat¨ Hamburg Russia Germany [email protected] [email protected] Abstract Seed sentence: I hope PattiHelper can helpAssistance youBenefited party soonTime . Semantic frames are formal linguistic struc- Substitutes for Assistance: assist, aid tures describing situations/actions/events, e.g. Substitutes for Helper: she, I, he, you, we, someone, Commercial transfer of goods. Each frame they, it, lori, hannah, paul, sarah, melanie, pam, riley Substitutes for Benefited party: me, him, folk, her, provides a set of roles corresponding to the sit- everyone, people uation participants, e.g. Buyer and Goods, and Substitutes for Time: tomorrow, now, shortly, sooner, lexical units (LUs) – words and phrases that tonight, today, later can evoke this particular frame in texts, e.g. Sell. The scarcity of annotated resources hin- Table 1: An example of the induced lexical represen- ders wider adoption of frame semantics across tation (roles and LUs) of the Assistance FrameNet languages and domains. We investigate a sim- frame using lexical substitutes from a single seed sen- ple yet effective method, lexical substitution tence. with word representation models, to automat- ically expand a small set of frame-annotated annotated resources. Some publicly available re- sentences with new words for their respective sources are FrameNet (Baker et al., 1998) and roles and LUs. We evaluate the expansion PropBank (Palmer et al., 2005), yet for many lan- quality using FrameNet. -
Unified Language Model Pre-Training for Natural
Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1]. -
Towards a Cross-Linguistic Verbnet-Style Lexicon for Brazilian Portuguese
Towards a cross-linguistic VerbNet-style lexicon for Brazilian Portuguese Carolina Scarton, Sandra Alu´ısio Center of Computational Linguistics (NILC), University of Sao˜ Paulo Av. Trabalhador Sao-Carlense,˜ 400. 13560-970 Sao˜ Carlos/SP, Brazil [email protected], [email protected] Abstract This paper presents preliminary results of the Brazilian Portuguese Verbnet (VerbNet.Br). This resource is being built by using other existing Computational Lexical Resources via a semi-automatic method. We identified, automatically, 5688 verbs as candidate members of VerbNet.Br, which are distributed in 257 classes inherited from VerbNet. These preliminary results give us some directions of future work and, since the results were automatically generated, a manual revision of the complete resource is highly desirable. 1. Introduction the verb to load. To fulfill this gap, VerbNet has mappings The task of building Computational Lexical Resources to WordNet, which has deeper semantic relations. (CLRs) and making them publicly available is one of Brazilian Portuguese language lacks CLRs. There are some the most important tasks of Natural Language Processing initiatives like WordNet.Br (Dias da Silva et al., 2008), that (NLP) area. CLRs are used in many other applications is based on and aligned to WordNet. This resource is the in NLP, such as automatic summarization, machine trans- most complete for Brazilian Portuguese language. How- lation and opinion mining. Specially, CLRs that treat the ever, only the verb database is in an advanced stage (it syntactic and semantic behaviour of verbs are very impor- is finished, but without manual validation), currently con- tant to the tasks of information retrieval (Croch and King, sisting of 5,860 verbs in 3,713 synsets.