MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging Gabriel Marzinotto, Johannes Heinecke, Geraldine Damnati

To cite this version:

Gabriel Marzinotto, Johannes Heinecke, Geraldine Damnati. MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging. Proceed- ings of the Thirteenth International Workshop on Semantic Evaluation, Jun 2019, Minneapolis, United States. ￿hal-02298429￿

HAL Id: hal-02298429 https://hal.archives-ouvertes.fr/hal-02298429 Submitted on 4 Oct 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging

Gabriel Marzinotto1,2 Johannes Heinecke1 Geraldine´ Damnati1 (1) Orange Labs / Lannion France (2) Aix Marseille Univ, CNRS, LIS / Marseille France {gabriel.marzinotto,johannes.heinecke,geraldine.damnati}@orange.com

Abstract glish, French and German (with pilot annotation This paper describes our recursive system for projects on Czech, Russian and Hebrew). De- SemEval-2019 Task 1: Cross-lingual Seman- spite the newness of UCCA, it has proven useful tic Parsing with UCCA. Each recursive step for defining semantic evaluation measures in text- consists of two parts. We first perform seman- to-text generation and (Birch tic parsing using a sequence tagger to estimate et al., 2016). UCCA represents the of the probabilities of the UCCA categories in the a sentence using directed acyclic graphs (DAGs), sentence. Then, we apply a decoding policy where terminal nodes correspond to text tokens, which interprets these probabilities and builds and non-terminal nodes to higher level semantic the graph nodes. Parsing is done recursively, we perform a first inference on the sentence to units. Edges are labelled, indicating the role of extract the main scenes and links and then we a child in the relation to its parent. UCCA pars- recursively apply our model on the sentence ing is a recent task and since UCCA has several using a masking feature that reflects the deci- unique properties, adapting syntactic parsers or sions made in previous steps. Process contin- parsers from other semantic representations is not ues until the terminal nodes are reached. We straight-forward. Current state of the art parser choose a standard neural tagger and we fo- TUPA (Hershcovich et al., 2017) uses a transition cused on our recursive parsing strategy and on the cross lingual transfer problem to develop based parsing to build UCCA representations. a robust model for the French language, using Building over previous work on FrameNet Se- only few training samples. mantic Parsing (Marzinotto et al., 2018a,b) we chose to perform UCCA parsing using sequence 1 Introduction tagging methods along with a graph decoding pol- Semantic representation is an essential part of icy. To do this we propose a recursive strategy in NLP. For this reason, several semantic represen- which we perform a first inference on the sentence tation paradigms have been proposed. Among to extract the main scenes and links and then we them we find PropBank (Palmer et al., 2005) and recursively apply our model on the sentence with FrameNet Semantics (Baker et al., 1998), Ab- a masking mechanism at the input in order to feed stract Meaning Representation (AMR) (Banarescu information about the previous parsing decisions. et al., 2013), Universal Decompositional Seman- tics (White et al., 2016) and Universal Conceptual 2 Model Cognitive Annotation (UCCA) (Abend and Rap- poport, 2013). These constantly improving rep- Our system consists of a sequence tagger that is resentations, along with the advances in semantic first applied on the sentence to extract the main parsing, have proven to be beneficial in many NLU scenes and links and then it is recursively applied tasks such as Question Answering (Shen and La- on the extracted element to build the semantic pata, 2007), text summarization (Genest and La- graph. At each step of the recursion we use a palme, 2011), dialog systems (Tur et al., 2005), in- masking mechanism to feed information about the formation extraction (Bastianelli et al., 2013) and previous stages into the model. In order to convert machine translation (Liu and Gildea, 2010). the sequence labels into nodes of the UCCA graph UCCA is a cross-lingual semantic representa- we also apply a decoding policy at each stage. tion scheme, has demonstrated applicability in En- Our tagger is implemented using deep bi- directional GRU (biGRU). This simple architec- To train such a model, we build a new training ture is frequently used in semantic parsers across corpus in which the sentences are repeated several different representation paradigms. Besides its times. More precisely, a sentence appears N times flexibility, it is a powerful model, with close to (N being the number of non terminal nodes in the state of the art performance on both PropBank (He UCCA graph) each one a with different mask. et al., 2017) and FrameNet semantic parsing (Yang 2.2 Multi-Task UCCA Objective and Mitchell, 2017; Marzinotto et al., 2018b). More precisely, the model consists of a 4 layer Along with the UCCA-XML graph representa- bi-directional Gated Recurrent Unit (GRU) with tions, a simplified tree representation in CoNLL highway connections (Srivastava et al., 2015). Our format was also provided. Our model combines model uses has a rich set of features including syn- both representations using a multitask objective tactic, morphological, lexical and surface features, with two tasks. TASK1 consists in, for a given which have shown to be useful in language ab- node and its corresponding mask, predicting the stracted representations. The list is given below: children and their arc labels. TASK1 encodes the children spans using a BIO scheme. The • Word embeddings of 300 dimensions 1. TASK2 consists in predicting the CoNLL sim- • Syntactic dependencies of each token2. plified UCCA structure of the sentence. More • Part-of-speech and morphological features precisely, TASK2 is a sequence tagger that pre- such as gender, number, voice and degree2. dicts the UCCA-CoNLL function of each token. TASK2 • Capitalization and word length encoding. is not used for inference purposes. It is only a support that help the model to extract rele- • Prefixes and Suffixes of 2 and 3 characters. vant features, allowing it to model the whole sen- • A language indicator feature. tence even when parsing small pre-terminal nodes. • Boolean indicator of idioms and multi word expression. Detailed in section 3.2. 2.3 Label Encoding • Masking mechanism, which indicates, for a We have previously stated that TASK1 uses BIO given node in the graph, the tokens within the encoded labels to model the structure of the chil- span as well as the arc label between the node dren of each node in the semantic graph. In some and its parent. See details in section 2.1. rare cases, the BIO encoding scheme is not suf- ficient to model the interaction between parallel Except for words where we use pre-trained em- scenes. For example, when we have two paral- beddings, we use randomly initialized embedding lel scenes and one of them appears as a clause layers for categorical features. inside the other. In such cases, BIO encoding does not allow to determine whether the last part 2.1 Masking Mechanism of the sentence belongs to the first scene or to We introduce an original masking mechanism in the clause. Despite this issue, prior experiments order to feed information about the previous pars- testing more complete label encoding schemes ing stages into the model. During parsing, we (BIEO, BIEOW) showed that BIO outperforms the first do an initial inference step to extract the main other schemes on the validation sets. scenes and links. Then, for each resulting node, we build a new input which is essentially the same, 2.4 Graph Decoding but with a categorical sequence masking feature. During the decoding phase, we convert the BIO la- For the input tokens in the node span, this feature bels into graph nodes. To do so, we add a few con- is equal to the label of the arc between the node straints to ensure the outputs are feasible UCCA and its parent. Outside of the node span, this mask graphs that respect the sentence’s structure: is equal to O. A diagram of this masking process • We merge parallel scenes (H) that do not have is shown in figure1. The process continues and either a verb or an action noun to the nearest the model recursively extracts the inner semantic previous scene having one. structures (the node’s children) in the graph, until • Within each parallel scene, we force the ex- the terminal nodes are reached. istence of one and only one State (S) or 1Obtained from https://github.com/facebookresearch/MUSE Process (P) by selecting the token with the 2 Using Universal Dependencies categories. highest probability of State or Process. B:H I:H I:H I:H B:L B:H I:H

Step 1 Bi-GRU

Word The mice ate cheese and fell asleep Mask INIT INIT INIT INIT INIT INIT INIT

B:A I:A B:P B:A O O O

Step 2.A Bi-GRU

The mice ate cheese and fell asleep H H H H O O O

REM-A REM-A O O O B:D B:P

Step 2.B Bi-GRU

The mice ate and fell cheese asleep O O O O H O H B:E B:C O O O O O

Step 3 Bi-GRU

The mice ate cheese and fell asleep 1 Interne Orange A A O O O O O Figure 1: Masking mechanism through recursive calls. Step 1 parses the sentence to extract parallel scenes (H) and links (L). Then Steps 2.A 2.B use a different mask to parse these scenes and extract arguments (A) and processes (P) which will be recursively parsed until terminal nodes are reached.

• For scenes (H) and arguments (A) we do not Corpus Train Dev Test allow to split multi word expressions (MWE) English Wiki 4113 514 515 and chunks into different graph nodes. If the English 20K - - 492 boundary between two segments lies inside a German 20K 5211 651 652 chunk or MWE segments are merged. French 20K 15 238 239 Table 1: number of UCCA annotated sentences in 2.5 Remote Edges the partitions for each language and domain

Our approach easily handles remote edges. We 3 consider remote arguments as those detected out- ing the Treebanks from Universial Dependencies . side the parent’s node span (see REM in Fig.1). Our For French we enriched the Treebank with XPOS earlier models showed low recall on remotes. To from our lexicon. Finally, since tokenization is fix this, we introduced a detection threshold on the pre-established in the UCCA corpus we projected output probabilities. This increased the recall at the improved POS and dependency parsing into the cost of some precision. The optimal detection the original tokenization of the task. threshold was optimized on the validation set. 3.2 Supplementary lexicon 3 Data We observed that a major difficulty in UCCA pars- ing is analyzing idioms and phrases. The unaware- 3.1 UCCA Task Data ness about these expressions, which are mostly In table1 we show the number of annotations for used as links between scenes, mislead the model each language and domain. Our objective is to during the early stages of the inference and er- build a model that generalizes to the French lan- rors get propagated through the graph. To boost guage despite of having only 15 training samples. the performance of our model when detecting links and parallel scenes we developed an inter- When we analyse data in details we observe that nal list with about 500 expression for each lan- there are several tokenization errors. Specially in guage. These lists include prepositional, adverbial the French corpus. These errors propagate to the and conjunctive expressions and are used to com- POS tagging and dependency parsing as well. For pute Boolean features indicating the words in the this reason, we retokenized and parsed all the cor- sentence which are part of an expression. pus using a enriched version of UDpipe that we trained ourselves (Straka and Strakova´, 2017) us- 3https://universaldependencies.org/ Ours Labeled Ours Unlabeled TUPA Labeled TUPA Unlabeled Open Tracks Avg Prim Rem Avg Prim Rem Avg Prim Rem Avg Prim Rem F1 F1 F1 F1 F1 F1 F1 F1 F1 F1 F1 F1 Dev English Wiki 70.8 71.3 58.7 82.5 83.8 37.5 74.8 75.3 51.4 86.3 87.0 51.4 Dev German 20K 74.7 75.4 40.5 87.4 88.6 40.9 79.2 79.7 58.7 90.7 91.5 59.0 Dev French 20K 63.6 64.4 19.0 78.9 79.6 20.5 51.4 52.3 1.6 74.9 76.2 1.6 Test English Wiki 68.9 69.4 42.5 82.3 83.1 42.8 73.5 73.9 53.5 85.1 85.7 54.3 Test English 20K 66.6 67.7 24.6 82.0 83.4 24.9 68.4 69.4 25.9 82.5 83.9 26.2 Test German 20K 74.2 74.8 47.3 87.1 88.0 47.6 79.1 79.6 59.9 90.3 91.0 60.5 Test French 20K 65.4 66.6 24.3 80.9 82.5 25.8 48.7 49.6 2.4 74.0 75.3 3.2 Table 2: Our model vs TUPA baseline performance for each open track Tracks D C N E F G L H A P U R S EN Wiki 64.3 71.4 68.5 69.6 76.7 0.0 71.4 61.3 60.0 64.0 99.7 89.2 25.1 EN 20K 47.2 75.2 62.5 72.3 71.5 0.2 57.9 49.5 55.7 69.8 99.7 83.2 19.5 DE 20K 69.4 83.8 57.7 80.5 83.8 59.2 68.4 62.2 67.5 68.9 97.1 86.9 25.9 FR 20K 46.1 76.0 58.9 71.2 53.3 4.8 59.4 50.4 52.8 67.6 99.6 83.5 16.9 Table 3: Our model’s Fine-grained F1 by label on Test Open Tracks

3.3 Multilingual Training comparison. Our model finishes 4th in the French This model uses multilingual word embeddings Open Track with an average F1 score of 65.4%, trained using fastText (Bojanowski et al., 2017) very close to the 3rd place which had a 65.6% and aligned using MUSE (Conneau et al., 2017). F1. For languages with larger training corpus, our This is done in order to ease cross-lingual training. model did not outperform the monolingual TUPA. In prior experiments we introduced an adversarial 4.2 Error Analysis objective similar to (Kim et al., 2017; Marzinotto In Table3 we give the performance by arc type. et al., 2019) to build a language independent rep- We observe that the main performance bottleneck resentation. However, the language imbalance on is in the parallel scene segmentation (H). Due to the training data did not allow us to take advantage our recursive parsing approach, this kind of er- from this technique. Hence, we simply merged ror is particularly harmful to the model perfor- training data from different languages. mance, because scene segmentation errors at the 4 Experiments early steps of the parsing may induce errors in the rest of the graph. To assert this, we used the vali- We focus on obtaining the model that best general- dation set to compare the performance of the mono izes on the French language. We trained our model scene sentences (with no potential scene segmen- for 50 epochs and we selected the best one on the tation problems) with the multi scene sentences. validation set. In our experiments we did not use For the French track we obtained 67.2% avg. F1 any product of experts or bagging technique and on the 114 mono scene sentences compared to we did not run any hyper parameter optimization. 61.9% avg. F1 on the 124 multi scene sentences. We trained several models building different training corpora composed of different language 5 Conclusions combinations. We obtained our best model us- We described an original approach to recursively ing the training data for all the languages. This build the UCCA semantic graph using a sequence model FR+DE+EN achieved 63.6% avg. F1 on tagger along with a masking mechanism and a de- the French validation set. Compared to 63.1% for coding policy. Even though this approach did not FR+DE, 62.9% for FR+EN and 50.8% for only FR. yield the best results in the UCCA task, we believe that our original recursive, mask-based parsing 4.1 Main Results can be helpful in low resource languages. More- In Table2 we provide the performance of our over, we believe that this model could be further model for all the open tracks and we provide the improved by introducing a global criterion and by results for TUPA baseline in order to establish a performing further hyper parameter tuning. References learning for pos tagging without cross-lingual re- sources. In Proceedings of the 2017 Conference Omri Abend and Ari Rappoport. 2013. Universal con- on Empirical Methods in Natural Language Pro- ceptual cognitive annotation (UCCA). In Proceed- cessing, pages 2832–2838. Association for Compu- ings of the 51st Annual Meeting of the Association tational Linguistics. for Computational Linguistics (Volume 1: Long Pa- pers), volume 1, pages 228–238. Ding Liu and Daniel Gildea. 2010. Semantic role features for machine translation. In Proceedings Collin F. Baker, Charles J. Fillmore, and John B. Lowe. of the 23rd International Conference on Computa- Pro- 1998. The Berkeley FrameNet project. In tional Linguistics, COLING ’10, pages 716–724, ceedings of the 36th Annual Meeting of the Associ- Stroudsburg, PA, USA. Association for Computa- ation for Computational Linguistics and 17th Inter- tional Linguistics. national Conference on Computational Linguistics- Volume 1, pages 86–90. Association for Computa- Gabriel Marzinotto, Jeremy Auguste, Fred´ eric´ Bechet,´ tional Linguistics. Geraldine´ Damnati, and Alexis Nasr. 2018a. Se- mantic Frame Parsing for Information Extraction : Laura Banarescu, Claire Bonial, Shu Cai, Madalina the CALOR corpus. In LREC 2018, Miyazaki, Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Japan. Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation Gabriel Marzinotto, Fred´ eric´ Bechet,´ Geraldine´ for sembanking. In Proceedings of the 7th Linguis- Damnati, and Alexis Nasr. 2018b. Sources of Com- tic Annotation Workshop and Interoperability with plexity in Semantic Frame Parsing for Information Discourse, pages 178–186. Extraction. In International FrameNet Workshop Emanuele Bastianelli, Giuseppe Castellucci, Danilo 2018, Miyazaki, Japan. Croce, and Roberto Basili. 2013. Textual inference Gabriel Marzinotto, Geraldine´ Damnati, Fred´ eric´ and meaning representation in human robot interac- Bechet,´ and Benoˆıt Favre. 2019. Robust semantic tion. In Proceedings of the Joint Symposium on Se- parsing with adversarial learning for domain gener- mantic Processing. Textual Inference and Structures alization. In Proc. of NAACL. in Corpora, pages 65–69. Martha Palmer, Daniel Gildea, and Paul Kingsbury. Alexandra Birch, Omri Abend, Ondrejˇ Bojar, and 2005. The proposition bank: An annotated cor- Barry Haddow. 2016. HUME: Human UCCA-based pus of semantic roles. Computational linguistics, evaluation of machine translation. arXiv preprint 31(1):71–106. arXiv:1607.00030.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Dan Shen and Mirella Lapata. 2007. Using seman- Tomas Mikolov. 2017. Enriching word vectors with tic roles to improve question answering. In Pro- subword information. Transactions of the Associa- ceedings of the 2007 Joint Conference on Empiri- tion for Computational Linguistics, 5:135–146. cal Methods in Natural Language Processing and Computational Natural Language Learning, pages Alexis Conneau, Guillaume Lample, Marc’Aurelio 12–21, Prague. Association for Computational Lin- Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2017. guistics, Association for Computational Linguistics. Word translation without parallel data. arXiv preprint arXiv:1710.04087. Rupesh Kumar Srivastava, Klaus Greff, and Jurgen¨ Schmidhuber. 2015. Training very deep networks. Pierre-Etienne Genest and Guy Lapalme. 2011. In NIPS 2015, pages 2377–2385, Montral, Qubec, Framework for abstractive summarization using Canada. text-to-text generation. In Proceedings of the Work- shop on Monolingual Text-To-Text Generation, Port- Milan Straka and Jana Strakova.´ 2017. Tokenizing, land, Oregon, USA. Association for Computational pos tagging, lemmatizing and parsing ud 2.0 with Linguistics, Association for Computational Linguis- udpipe. In Proceedings of the CoNLL 2017 Shared tics. Task: Multilingual Parsing from Raw Text to Univer- sal Dependencies, pages 88–99, Vancouver, Canada. Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle- Association for Computational Linguistics. moyer. 2017. Deep : What works and what’s next. In Proceedings of the An- Gokhan Tur, Dilek Hakkani-Tur,¨ and Ananlada Choti- nual Meeting of the Association for Computational mongkol. 2005. Semi-supervised learning for spo- Linguistics. ken language understanding semantic role labeling. In IEEE Workshop on Automatic Speech Recogni- Daniel Hershcovich, Omri Abend, and Ari Rappoport. tion and Understanding, pages 232 – 237. 2017. A transition-based directed acyclic graph parser for UCCA. arXiv preprint arXiv:1704.00552. Aaron Steven White, Drew Reisinger, Keisuke Sak- aguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Kyle Rawlins, and Benjamin Van Durme. 2016. Eric Fosler-Lussier. 2017. Cross-lingual transfer Universal decompositional semantics on universal dependencies. In Proceedings of the 2016 Confer- ence on Empirical Methods in Natural Language Processing, pages 1713–1723. Bishan Yang and Tom Mitchell. 2017. A joint sequen- tial and relational model for frame-semantic parsing. In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 1247–1256. Association for Computational Linguis- tics.