GCN-Sem at SemEval-2019 Task 1: Semantic Parsing using Graph Convolutional and Recurrent Neural Networks

Shiva Taslimipoor Omid Rohanian Sara Mozeˇ Research Group in Computational Linguistics University of Wolverhampton, UK {shiva.taslimi, omid.rohanian, s.moze}@wlv.ac.uk

Abstract automatically converted to other parsing formats, e.g. Abstract Meaning Representation (AMR) and This paper describes the system submitted to Semantic Dependency Parsing (SDP), inter alia the SemEval 2019 shared task 1 ‘Cross-lingual Semantic Parsing with UCCA’. We rely on the (Hershcovich et al., 2018). semantic dependency parse trees provided in Although the schemes are formally different, the shared task which are converted from the they have shared semantic content. In order to original UCCA files and model the task as tag- perform our experiments, we target the converted ging. The aim is to predict the graph structure CONLLU format, which corresponds to tradi- of the output along with the types of relations tional bi-lexical dependencies and rely on the con- among the nodes. Our proposed neural archi- tecture is composed of Graph Convolution and version methodology which is provided in the BiLSTM components. The layers of the sys- shared task (Hershcovich et al., 2019) to attain tem share their weights while predicting de- UCCA graphs. pendency links and semantic labels. The sys- UCCA graphs contain both explicit and implicit tem is applied to the CONLLU format of the units 1 However, in bi-lexical dependencies, nodes input data and is best suited for semantic de- are text tokens and semantic relations are direct pendency parsing. bi-lexical relations between the tokens. The con- 1 Introduction version between the two format results in partial loss of information. Nonetheless, we believe that Universal Conceptual Cognitive Annotation it is worth trying to model the task using one of (UCCA) (Abend and Rappoport, 2013) is a the available formats (i.e. semantic dependency semantically motivated approach to grammatical parsing) which is very popular among NLP re- representation inspired by typological theories of searchers. grammar (Dixon, 2012) and Cognitive Linguistics Typically, transition-based methods are used in literature (Croft and Cruse, 2004). In parsing, syntactic (Chen and Manning, 2014) and seman- bi-lexical dependencies that are based on binary tic (Hershcovich et al., 2017) dependency parsing. head-argument relations between lexical units By contrast, our proposed system shares several are commonly employed in the representation of similarities with sequence-to-sequence neural ar- syntax (Nivre et al., 2007; Chen and Manning, chitectures, as it does not specifically deal with 2014) and (Hajicˇ et al., 2012; Oepen parsing transitions. Our model uses word, POS et al., 2014; Dozat and Manning, 2018). and syntactic dependency tree representations as UCCA differs significantly from traditional de- input and directly produces an edge-labeled graph pendency approaches in that it attempts to ab- representation for each sentence (i.e. edges and stract away traditional syntactic structures and re- their labels as two separate outputs). This multi- lations in favour of employing purely semantic label neural architecture, which consists of a BiL- distinctions to analyse sentence structure. The STM and a Graph Convolutional Network (GCN), shared task, ‘cross-lingual semantic parsing with is described in Section3. UCCA’ (Hershcovich et al., 2019) consists in pars- ing English, German, and French datasets using 1Explicit units (terminal nodes) correspond to tokens in the UCCA semantic tagset. In order to enable the text, but implicit (semantic) units have no corresponding multi-task learning, the UCCA-annotated data is component in the text.

102 Proceedings of the 13th International Workshop on Semantic Evaluation (SemEval-2019), pages 102–106 Minneapolis, Minnesota, USA, June 6–7, 2019. ©2019 Association for Computational Linguistics 2 Related Work of words).

A recent trend in parsing research is sequence-to- Graph Convolution. Convolutional Neural Net- sequence learning (Vinyals et al., 2015b; Kitaev works (CNNs), as originally conceived, are se- and Klein, 2018), which is inspired from Neural quential in nature, acting as detectors of N- . These methods ignore ex- grams (Kim, 2014), and are often used as feature- plicit structural information in favour of relying on generating front-ends in deep neural networks. long-term memory, attention mechanism (content- Graph Convolutional Network (GCN) has been in- based or position-based) (Kitaev and Klein, 2018) troduced as a way to integrate rich structural rela- or pointer networks (Vinyals et al., 2015a). By do- tions such as syntactic graphs into the convolution ing so, high-order features are implicitly captured, process. which results in competitive parsing performance In the context of a syntax tree, a GCN can be (Jia and Liang, 2016). understood as a non-linear activation function f Sequence-to-sequence learning has been partic- and a filter W with a bias term b: X ularly effective in (SRL) c = f( W xi + b) (1) (Zhou and Xu, 2015). By augmenting these i∈r(v) models with syntactic information, researchers where r(v) denotes all the words in relation with have been able to develop state-of-the-art systems a given word v in a sentence, and c represents the for SRL (Marcheggiani and Titov, 2017; Strubell output of the convolution. Using adjacency matri- et al., 2018). ces, we define graph relations as mask filters for As information derived from dependency parse the inputs (Kipf and Welling, 2017; Schlichtkrull trees can significantly contribute towards under- et al., 2017). standing the semantics of a sentence, Graph Con- In the present task, information from each graph volutional Network (GCN) (Kipf and Welling, corresponds to a sentence-level dependency parse 2017) is used to help our system perform semantic tree. Given the filter W and bias b , we can there- parsing while attending to structural syntactic in- s s fore define the sentence-level GCN as follows: formation. The architecture is similar to the GCN component employed in Rohanian et al.(2019) for C = f(W XT A + b ) (2) detecting gappy multiword expressions. s s s where Xn×v, An×n, and Co×n are tensor rep- 3 Methodology resentation of words, the adjacency matrix, and the convolution output respectively.3 In Kipf For this task, we employ a neural architecture util- and Welling(2017), a separate adjacency matrix ising structural features to predict semantic pars- is constructed for each relation to avoid over- ing tags for each sentence. The system maps a parametrising the model; by contrast, our model sentence from the source language to a probability is limited to the following three types of relations: distribution over the tags for all the words in the 1) the head to the dependents, 2) the dependents sentence. Our architecture consists of a GCN layer to the head, and 3) each word to itself (self-loops) (Kipf and Welling, 2017), a bidirectional LSTM, similar to Marcheggiani and Titov(2017). The fi- and a final dense layer on top. nal output is the maximum of the weights from the The inputs to our system are sequences of three individual adjacency matrices. words, alongside their corresponding POS and The model architecture is depicted in Figure1. named-entity tags.2 Word tokens are repre- sented by contextualised ELMo embeddings (Pe- 4 Experiments ters et al., 2018), and POS and named-entity tags Our system participated in the closed track for En- are one-hot encoded. We also use sentence-level glish and German and the open track for French. syntactic dependency parse information as input We exclusively used the data provided in the to the system. In the GCN layer, the convolu- shared task. The system is trained on the train- tion filters operate based on the structure of the ing data only, and the parameters are optimised us- dependency tree (rather than the sequential order ing the development set. The results are reported 2spaCy (Honnibal and Johnson, 2015) is used to generate 3o: output dimension; v: word vectors dimension; n: sen- POS, named-entity and syntactic dependency tags. tence length

103 Word as explained in Section 4.1. The parameters are Representation optimised on the English Wiki development data (batch-size = 16 and number of epochs = 100) and

GCN GCN used for all four settings. As no training data was (Root to (Dependency GCN Dependency) to Root) (Root) available for French, the trained system on English Wiki was used to parse French sentences of 20K

Max Leagues. For this reason the French model is eval- uated within the open track.

BiLSTM 4.3 Official Evaluation Our model predicts two outputs for each dataset: FFN primary edges and their labels (UCCA semantic categories). 5 Figure 1: A GCN-based recurrent architecture. Table1 shows the performance (in terms of pre- on blind-test data in both in-domain and out-of- cision, recall, and F1-score) for predicting primary domain settings. We focus on predicting the pri- edges in both labeled (i.e. with semantic tags) and mary edges of UCCA semantic relations and their unlabeled settings (i.e. ignoring semantic tags). labels. Table2 shows F1-scores for each semantic cate- gory separately. Although the overall performance 4.1 Data of the system, as shown in the official evaluation in Table1, is not particularly impressive, there are The datasets of the shared task are devised for four a few results worth reporting. These are listed in settings: 1) English in-domain, using the Wiki cor- Table2. pus; 2) English out-of-domain, using the Wiki cor- pus as training and development data, and 20K Our system is ranked second in predicting four Leagues as test data; 3) German in-domain, using relations, i.e. L (linker), N (Connector), R (Rela- the 20K Leagues corpus; 4) French setting with tor), and G (Ground), in all settings displayed in no training data (except trial data), using the 20K bold. A plausible explanation would be that these Leagues corpus as development and test data. relations are somewhat less affected by the loss of information incurred as a result of the conversions Whilst the annotated files used by the shared between formats. task organisers are in the XML format, several other formats are also available. We decided to 5 Discussion use CONLLU, as it is more interpretable. How- ever, according to the shared task description,4 the Our neural model is applied to UCCA corpora, conversion between XML and CONLLU, which is which are converted to bi-lexical semantic depen- a necessary step before evaluation, is lossy. Her- dency graphs and represented in the CONLLU for- shcovich et al.(2017) used the same procedure of mat. The conversion from UCCA annotations to performing dependency parsing methods on CON- CONLLU tags appears to have a distinctly neg- LLU files and converting the predictions back to ative impact on the system’s overall performance. UCCA. As reported in the shared task description, convert- ing the English Wiki corpus to the CONLLU for- 4.2 Settings mat and back to the standard format results in an We trained ELMo on each of the shared task F1-score of only 89.7 for primary labeled edges. datasets using the system implemented by Che This means that our system cannot go beyond this et al.(2018). The embedding dimension is set to upper limit. 1024. The number of nodes is 256 for GCN and Since our system is trained on CONLLU files 300 for BiLSTM, and we applied a dropout of 0.5 and the evaluation involves converting the CON- after each layer. We used the Adam optimiser for LLU format back to the standard UCCA format, compiling the model. 5For more details about UCCA semantic categories and We tested our model in four different settings, the way they are used for the shared task, see https: //competitions.codalab.org/competitions/ 4https://competitions.codalab.org/ 19160#learn_the_details-overview. Our system competitions/19160#fn1 does not predict remote edges defined in UCCA.

104 labeled unlabeled dataset track Avg. F1 P R F1 Avg. F1 P R F1 UCCA English-Wiki closed 0.657 0.673 0.655 0.664 0.809 0.829 0.807 0.818 UCCA English-20K closed 0.626 0.632 0.642 0.637 0.8 0.808 0.821 0.814 UCCA German-20K closed 0.71 0.72 0.72 0.72 0.851 0.863 0.862 0.862 UCCA French-20K* open 0.438 0.443 0.447 0.445 0.690 0.698 0.705 0.702

Table 1: Official results of the shared task evaluation for predicting different semantic category labels. (* The results for French are for Post-Evaluation.)

dataset (D) (C) (N) (E) (F) (G) (L) (H) (A) (P) (U) (R) (S) (Terminal) English-Wiki 0.7 0.708 0.866 0.738 0.801 0.286 0.836 0.289 0.582 0.451 0.948 0.914 0 0.997 English-20K 0.521 0.733 0.776 0.743 0.647 0.04 0.719 0.248 0.538 0.527 0.978 0.844 0 0.997 German-20K 0.691 0.813 0.796 0.82 0.845 0.778 0.834 0.375 0.697 0.561 0.997 0.916 0 0.998 French-20K* 0.223 0.569 0.579 0.551 0.378 0.000 0.536 0.118 0.314 0.358 0.987 0.711 0 0.993

Table 2: Official results of the shared task evaluation for predicting Primary edges and their labels. (* The results for French are for Post-Evaluation.) the reported results for our system can be mislead- presented in this paper by applying the architec- ing. In order to further investigate this issue, we ture to the standard UCCA dataset, or possibly performed an evaluation using the English Wiki training the system to perform bi-lexical semantic development data, comparing the predicted labels dependency annotation. with the gold standard in development set in the CONLLU format. The average F1-score for la- belled edges was 0.71 compared to the 0.685 score References our system achieved on the development set using Omri Abend and Ari Rappoport. 2013. Universal Con- the official evaluation script. ceptual Cognitive Annotation (UCCA). In Proc. of This clearly demonstrates that our system fares ACL, pages 228–238. significantly better if it receives its input in the Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, form of bi-lexical dependency graphs. Therefore, and Ting Liu. 2018. Towards better UD parsing: the system is best suited for semantic dependency Deep contextualized word embeddings, ensemble, and treebank concatenation. In Proceedings of the parsing, although we believe that promising re- CoNLL 2018 Shared Task: Multilingual Parsing sults could also be achieved in UCCA annota- from Raw Text to Universal Dependencies, pages tion if the conversion between the CONLLU and 55–64, Brussels, Belgium. Association for Compu- UCCA formats is improved to map and preserve tational Linguistics. information more accurately. Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser using neural net- 6 Conclusion and Future Work works. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing In this paper, we described the system we sub- (EMNLP), pages 740–750. Association for Compu- mitted to the SemEval-2019 Task 1: ‘Semantic tational Linguistics. Parsing using Graph Convolutional and Recurrent William Croft and D. A Cruse. 2004. Cognitive lin- Neural Networks’. The model performs semantic guistics. Cambridge University Press. parsing using information derived from syntactic dependencies between words in each sentence. We Robert M. W Dixon. 2012. Basic linguistic theory. Oxford University Press. developed the model using a combination of GCN and BiLSTM components. Due to the penalisation Timothy Dozat and Christopher D. Manning. 2018. resulting from the use of lossy CONLLU files, we Simpler but more accurate semantic dependency parsing. In Proceedings of the 56th Annual Meet- argue that the results cannot be directly compared ing of the Association for Computational Linguistics 6 with those of the other task participants. (Volume 2: Short Papers), pages 484–490. Associa- In the future, we would like to build on the work tion for Computational Linguistics.

6The code is available at https://github.com/ Jan Hajic,ˇ Eva Hajicovˇ a,´ Jarmila Panevova,´ Petr shivaat/GCN-Sem. Sgall, Ondrejˇ Bojar, Silvie Cinkova,´ Eva Fucˇ´ıkova,´

105 Marie Mikulova,´ Petr Pajas, Jan Popelka, Jirˇ´ı Se- Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, mecky,´ Jana Sindlerovˇ a,´ Jan Stˇ epˇ anek,´ Josef Toman, Daniel Zeman, Dan Flickinger, Jan Hajic,ˇ Angelina Zdenkaˇ Uresovˇ a,´ and Zdenekˇ Zabokrtskˇ y.´ 2012. Ivanova, and Yi Zhang. 2014. Semeval 2014 task Announcing prague czech-english dependency tree- 8: Broad-coverage semantic dependency parsing. In bank 2.0. In Proceedings of the 8th International Proceedings of the 8th International Workshop on Conference on Language Resources and Evaluation Semantic Evaluation (SemEval 2014), pages 63–72. (LREC 2012), pages 3153–3160. ELRA, European Language Resources Association. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Daniel Hershcovich, Omri Abend, and Ari Rappoport. Zettlemoyer. 2018. Deep contextualized word rep- 2017. A transition-based directed acyclic graph resentations. In Proceedings of NAACL. parser for ucca. In Proc. of ACL, pages 1127–1138. Omid Rohanian, Shiva Taslimipoor, Samaneh Daniel Hershcovich, Omri Abend, and Ari Rappoport. Kouchaki, Le An Ha, and Ruslan Mitkov. 2019. 2018. Multitask parsing across semantic representa- Bridging the gap: Attending to discontinuity in tions. In Proc. of ACL, pages 373–385. identification of multiword expressions.

Daniel Hershcovich, Zohar Aizenbud, Leshem Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Choshen, Elior Sulem, Ari Rappoport, and Omri Rianne van den Berg, Ivan Titov, and Max Welling. Abend. 2019. Semeval 2019 task 1: Cross-lingual 2017. Modeling relational data with graph convolu- semantic parsing with ucca. tional networks. arXiv preprint arXiv:1703.06103.

Matthew Honnibal and Mark Johnson. 2015. An im- Emma Strubell, Patrick Verga, Daniel Andor, proved non-monotonic transition system for depen- David Weiss, and Andrew McCallum. 2018. dency parsing. In Proceedings of the 2015 Con- Linguistically-informed self-attention for semantic ference on Empirical Methods in Natural Language role labeling. In Proceedings of the 2018 Confer- Processing, pages 1373–1378, Lisbon, Portugal. As- ence on Empirical Methods in Natural Language sociation for Computational Linguistics. Processing, pages 5027–5038. Association for Computational Linguistics. Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 54th Annual Meeting of the Association for Compu- 2015a. Pointer networks. In C. Cortes, N. D. tational Linguistics (Volume 1: Long Papers), pages Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Advances in Neural Information Processing 12–22. Association for Computational Linguistics. editors, Systems 28, pages 2692–2700. Curran Associates, Yoon Kim. 2014. Convolutional neural networks for Inc. sentence classification. In Proceedings of the 2014 Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Conference on Empirical Methods in Natural Lan- Ilya Sutskever, and Geoffrey Hinton. 2015b. Gram- guage Processing (EMNLP), pages 1746–1751. mar as a foreign language. In C. Cortes, N. D. Thomas N. Kipf and Max Welling. 2017. Semi- Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, supervised classification with graph convolutional editors, Advances in Neural Information Processing networks. In International Conference on Learning Systems 28, pages 2773–2781. Curran Associates, Representations (ICLR). Inc.

Nikita Kitaev and Dan Klein. 2018. Constituency Jie Zhou and Wei Xu. 2015. End-to-end learning of parsing with a self-attentive encoder. In Proceed- semantic role labeling using recurrent neural net- ings of the 56th Annual Meeting of the Association works. In Proceedings of the 53rd Annual Meet- for Computational Linguistics (Volume 1: Long Pa- ing of the Association for Computational Linguis- pers), Melbourne, Australia. Association for Com- tics and the 7th International Joint Conference on putational Linguistics. Natural Language Processing (Volume 1: Long Pa- pers), pages 1127–1137. Association for Computa- Diego Marcheggiani and Ivan Titov. 2017. Encoding tional Linguistics. sentences with graph convolutional networks for se- mantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1506–1515. Association for Computational Linguistics.

Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Glsen Eryigit, Sandra Kbler, Svetoslav Marinov, and Erwin Marsi. 2007. Maltparser: A language-independent system for data-driven depen- dency parsing. Natural Language Engineering, 13(02).

106