Semeval-2017 Task 9: Abstract Meaning Representation Parsing and Generation
Total Page:16
File Type:pdf, Size:1020Kb
SemEval-2017 Task 9: Abstract Meaning Representation Parsing and Generation Jonathan May and Jay Priyadarshi Information Sciences Institute Computer Science Department University of Southern California [email protected], [email protected] (f / fear-01 Abstract :polarity "-" :ARG0 ( s / soldier ) In this report we summarize the results :ARG1 ( d / die-01 of the 2017 AMR SemEval shared task. :ARG1 s )) The task consisted of two separate yet The soldier was not afraid of dying. related subtasks. In the parsing sub- The soldier was not afraid to die. task, participants were asked to produce The soldier did not fear death. Abstract Meaning Representation (AMR) (Banarescu et al., 2013) graphs for a set Figure 1: An Abstract Meaning Representation of English sentences in the biomedical do- (AMR) with several English renderings. Example main. In the generation subtask, partici- borrowed from Pust et al.(2015). pants were asked to generate English sen- tences given AMR graphs in the news/fo- rum domain. A total of five sites partici- formal compared to some of those evaluated in last pated in the parsing subtask, and four par- year’s task, they are also very complex, and have ticipated in the generation subtask. Along many terms unique to the domain. An example with a description of the task and the par- is shown in Figure2. We continue to use Smatch ticipants’ systems, we show various score (Cai and Knight, 2013) as a metric for AMR pars- ablations and some sample outputs. ing, but we perform additional ablative analysis using the approach proposed by Damonte et al. 1 Introduction (2016). Abstract Meaning Representation (AMR) is a Along with parsing into AMR, it is important compact, readable, whole-sentence semantic an- to encourage improvements in automatic genera- notation (Banarescu et al., 2013). It includes entity tion of natural language (NL) text from AMR. Hu- identification and typing, PropBank semantic roles mans favor communication in NL. An AI that is (Kingsbury and Palmer, 2002), individual entities able to parse text into AMR at a quality level in- playing multiple roles, as well as treatments of distinguishable from humans may be said to un- modality, negation, etc. AMR abstracts in numer- derstand NL, but without the ability to render its ous ways, e.g., by assigning the same conceptual own semantic representations into NL no human structure to fear (v), fear (n), and afraid (adj). Fig- will ever be able to appreciate this. ure1 gives an example. The advent of several systems that generate En- In 2016 an AMR parsing shared task was held at glish text from AMR input (Flanigan et al., 2016b; SemEval (May, 2016). Task participants demon- Pourdamghani et al., 2016) inspired us to conduct strated several new directions in AMR parsing a generation-based shared task from AMRs in the technology and also validated the strong perfor- news/discussion forum domain. For the gener- mance of existing parsers. We sought, in 2017, to ation subtask, we solicited human judgments of focus AMR parsing performance on the biomed- sentence quality. We followed the precedent es- ical domain, for which a not insignificant but tablished by the Workshop in Machine Transla- still relatively small training corpus had been pro- tion (Bojar et al., 2016) and used the Appraise so- duced. While sentences from this domain are quite licitation system (Federmann, 2012), lightly mod- 536 Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 536–545, Vancouver, Canada, August 3 - 4, 2017. c 2017 Association for Computational Linguistics Interestingly, serpinE2 mRNA and protein were also markedly enhanced in human CRC pervised by Ulf Hermjakob at USC/ISI, is an cells exhibiting mutation in extension of previous releases (LDC2015E86, <i>KRAS </i>and <i>BRAF</i>. LDC2014E41 and LDC2014T12). It contains (e / enhance-01 :li 2 39,260 sentences (subsuming, in turn, the 19,572 :ARG1 (a3 / and AMRs from LDC2015E86, the 18,779 AMRs :op1 (n6 / nucleic-acid from LDC2014E41, and the 13,051 AMRs from :name (n / name :op1 "mRNA") :ARG0-of (e2 / encode-01 LDC2014T12), partitioned into training, develop- :ARG1 p)) ment, and test splits, from a variety of news and :op2 (p / protein discussion forum sources. Participants in the gen- :name (n2 / name :op1 "serpinE2"))) :manner (m / marked) eration task only were provided with AMRs for :mod (a2 / also) an additional 1,293 sentences for evaluation; the :location (c / cell :ARG0-of (e3 / exhibit-01 original sentences were also provided, as needed, :ARG1 (m2 / mutate-01 to human evaluators during the human evaluation :ARG1 (a4 / and phase of the generation subtask (see Section 5.2). :op1 (g / gene :name (n4 / name :op1 "KRAS")) These sentences and their corresponding AMRs :op2 (g2 / gene were sequestered and never released as data :name (n5 / name :op1 "BRAF"))))) before the evaluation phase. :mod (h / human) :mod (d / disease We also made available the Bio-AMR corpus :name (n3 / name :op1 "CRC"))) version 0.8, which consists of 6,452 AMR anno- :manner (i / interesting)) tations of sentences from cancer-related PubMed articles, covering 3 full papers1 as well as the re- Figure 2: One of the simpler biomedical domain sult sections of 46 additional PubMed papers. The sentences and its AMR. Note the italics markers corpus also includes about 1000 sentences each in the original sentence are preserved, as they are from the BEL BioCreative training corpus and the semantically important to the sentence’s Chicago Corpus. The Bio-AMR corpus was par- understanding. titioned into training, development, and test splits. An additional 500 sentences and their AMRs were ified, to gather human rankings, then TrueSkill sequestered until the evaluation phase, at which (Sakaguchi et al., 2014) to elicit an overall system point the sentences were provided to parsing task ranking. participants only. Table1 summarizes the avail- Since the same training data and tools are avail- able data, including the split sizes. able to both subtasks (though, in the case of the 3 Other Resources generation subtask, the utility of the Bio-AMR corpus is unclear), we will describe all the re- We made the following resources available to par- sources for both subtasks in Sections2 and3 but ticipants: then will handle descriptions and ablations for the The tokenizer (from Ulf Hermjakob) used to parsing and generation subtasks separately, in, re- • produce the tokenized sentences in the train- spectively, Sections4 and5. Readers interested ing corpus.2 in only one of these subtasks should not feel com- pelled to read the other section. We will reconvene The AMR specification, used by annotators • in Section6 to conclude and discuss hardware, as in producing the AMRs.3 we continue the tradition established last year in 4 the awarding of trophies to the declared winners The JAMR (Flanigan et al., 2014) and • 5 of each subtask. CAMR (Wang et al., 2015a) parsers, as strong parser baselines. 2 Data 1PMIDs 24651010, 11777939, and 15630473 2 http://alt.qcri.org/semeval2016/ LDC released a new corpus of AMRs task8/data/uploads/tokenizer.tar.gz (LDC2016E25), created as part of the DARPA 3 https://github.com/ kevincrawfordknight/amr-guidelines/blob/ DEFT program, in March of 2016. The new master/amr.md corpus, which was annotated by teams at SDL, 4 https://github.com/jflanigan/jamr LDC, and the University of Colorado, and su- 5 https://github.com/c-amr/camr 537 Corpus Domain Train Dev Test Eval LDC2016E25 News/Forum 36,521 1,368 1,371 N/A Bio-AMR v0.8 Biomedical 5,452 500 500 N/A Parsing evaluation set Biomedical N/A N/A N/A 500 Generation evaluation set (LDC2016R33) News/Form N/A N/A N/A 1,293 Table 1: A summary of data used in this task; split sizes indicate the number of AMRs per sub-corpus. Figure 3: The Appraise interface, adapted for AMR generation evaluation. The JAMR (Flanigan et al., 2016b) genera- asked to produce AMR graphs. Main results and • tion system, as a strong generation baseline. ablative results are shown in Table2. An unsupervised AMR-to-English aligner 4.1 Systems • (Pourdamghani et al., 2014).6 Five teams participated in the task, a noticeable The same Smatch (Cai and Knight, 2013) decline from last year’s task, which saw eleven full • scoring script used in the evaluation.7 participants. One team submitted two systems for a total of six distinct systems. Two teams were re- A Python AMR manipulation library, from peats from last year: CMU and RIGOTRIO (previ- • Nathan Schneider.8 ously RIGA). Below are brief descriptions of each of the various systems, based on summaries pro- 4 Parsing Sub-Task vided by the system authors. Readers are encour- aged to consult individual system description pa- In the parsing sub-task, participants were given pers or relevant conference paper descriptions for 500 previously sequestered Bio-AMRs and were more details. 6 http://isi.edu/˜damghani/papers/ Aligner.zip 4.1.1 The Meaning Factory 7 https://github.com/snowblink14/ (van Noord and Bos, 2017) smatch 8 https://github.com/nschneid/ This team submitted two parsers. TMF-1 is a amr-hackathon character-level sequence-to-sequence deep learn- 538 Smatch Unlab. No WSD NER Wiki TMF-1 0.46 0.5 0.46 0.51 0.46 TMF-2 0.58 0.63 0.58 0.58 0.4 UIT-DANGNT-CLNLP 0.61 0.65 0.61 0.66 0.35 Oxford 0.59 0.63 0.59 0.66 0.18 CMU 0.44 0.47 0.44 0.48 0.59 RIGOTRIO 0.54 0.59 0.54 0.46 0 (a) Main parsing results and four ablations Smatch Neg. Concepts Reent.