Multilingual Abstractions: Abstract Syntax Trees and Universal Dependencies UNIVERSITY of GOTHENBURG

THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Multilingual Abstractions: Abstract Syntax Trees and Universal Dependencies PRASANTH KOLACHINA UNIVERSITY OF GOTHENBURG Department of Computer Science & Engineering Chalmers University of Technology and Gothenburg University Gothenburg, Sweden, 2019 Multilingual Abstractions: Abstract Syntax Trees and Universal Dependencies PRASANTH KOLACHINA © Prasanth Kolachina, 2019. ISBN 978-91-7833-509-1 Technical Report 174D Department of Computer Science & Engineering Division of Functional Programming Department of Computer Science & Engineering Chalmers University of Technology and Gothenburg University Gothenburg, Sweden Telephone +46 (0)31-772 1000 Printed at Chalmers reproservice Gothenburg, Sweden 2019. ii To my family Mom, Dad and Sudheer iv Abstract This thesis studies the connections between parsing friendly representations and interlingua grammars developed for multilingual language generation. Parsing friendly representations refer to dependency tree representations that can be used for robust, ac- curate and scalable analysis of natural language text. Shared multilingual abstractions are central to both these representations. Universal Dependencies (UD) is a framework to develop cross-lingual representations, using dependency trees for multlingual representations. Similarly, Grammatical Framework (GF) is a framework for interlingual grammars, used to derive abstract syntax trees (ASTs) corresponding to sentences. The first half of this thesis explores the connections between the representations behind these two multilingual abstractions. The first study presents a conversion method from abstract syntax trees (ASTs) to dependency trees and present the mapping between the two abstractions – GF and UD – by applying the conversion from ASTs to UD. Experiments show that there is a lot of similarity behind these two abstractions and our method is used to bootstrap parallel UD treebanks for 31 languages. In the second study, we study the inverse problem i.e. converting UD trees to ASTs. This is moti- vated with the goal of helping GF-based interlingual translation by using dependency parsers as a robust front end instead of the parser used in GF. The second half of this thesis focuses on the topic of data augmentation for parsing – specifically using grammar-based backends for aiding in dependency parsing. We propose a generic method to generate synthetic UD treebanks using interlingua grammars and the methods developed in the first half. Results show that these synthetic treebanks are an alternative to develop parsing models, especially for under-resourced languages without much resources. This study is followed up by another study on out-of-vocabulary words (OOVs) – a more focused problem in parsing. OOVs pose an interesting problem in parser development and the method we present in this paper is a generic simplification that can act as a drop-in replacement for any symbolic parser. Our idea of replacing unknown words with known, similar words results in small but significant improvements in experiments using two parsers and for a range of 7 languages. Keywords Natural Language Processing, Grammatical Framework, Universal Dependencies, multilinguality, abstract syntax trees, dependency trees, multilingual generation, multilingual parsers Acknowledgments In the time spent working on this, I was also trying to narrow down that one domino that made me pursue this crazy job. I did find it in due time – a stroll on a winter evening with a colleague on the streets of Hyderabad. I told him the crazy idea I had in the months prior working as an research intern, to pursue a Ph.D. By the end of the day, he convinced me I was both necessarily and sufficiently crazy to go for one! This was followed by a conversation with a mentor from those days, who cautioned me about a long tunnel, along with what I thought was some sage advice. That evening and an year later, I was at Chalmers. In all my time at Chalmers – working with Aarne – I did find myself asking at times if the tale about the tunnel was true. There were certainly times when it seemed true but the ensuing time was also filled with learning moments. Aarne gave me the space and time to address each part I thought I lacked in order to pursue my research interests. For that I will always thank him, more so because he let me address them on my own terms. Thanks are also due to Richard Johansson and Krasimir Angelov – my co-supervisors – who have at all times during this journey supported me. Richard, did at times indulge me listening patiently to what seemed and still seem like crazy ideas to me, and always helped me refine those ideas and aspects of my research that are not often overtly realized, atleast immediately. Krasimir was more hands-on helping me make sense of the craziness involved by sharing his own experiences. The Grammar Technology group – Herbert, Peter Ljunglof, Prasad KVS, Inari, John, Normunds, Gregoire, Thomas Hallgren and Koen Classen (when he consented to being part of the group) and David – was something I could always count on for insightful conversations and at times, for procrastination working on interesting problems. Agneta Nilsson, Mary Sheeran who have been in my committee and Devdatt Dubhashi who acted in the role of my examiner have also supported me throughout the years. But the journey is not all about research, and those of us here know the role teaching plays in the process. I had taught before coming to Chalmers – when asked to – but never imagined myself liking it, much less, enjoy it. I did discover those aspects of teaching over the years while working with Dag Wedelin on the problem solving course. That only got better working with other people involved in the course – Birgit Grohe, Simon R., Dan R., Victor, Mikael amongst others – and one I think of as a valuable experience. If the idea of Ph.D. seemed crazy to me those years ago, I admit the idea of pursuing an academic career seems equally crazy now. That said if I do pursue one – it will not be due to my problem solving skills – it will surely be due to my experience in the course on problem solving. Thank you for that Dag and everyone who worked in the course the last five years including the students. I would like to say thanks to Joakim Nivre who shared his insights on my work vii viii and has also graciously agreed to be a part of my defense, in addition to hosting my research visit at Uppsala. The Computational Linguistics group at Uppsala – Miryam, Amir, Ali, Yan, Marie, Fabienne, Aaron – are nothing short of fantastic and an excellent presence to have in close proximity. Thanks also to Filip Ginter who was the discussion leader for my licentiate and Lilja Øvrelid and Marco Kuhlmann who have all accepted to be on the committee for my defense. I met Marco while working on ud2gf and his insights have been very helpful in improving my understanding of this work. Just as all work makes Jack a dull boy, my time at Chalmers was enriched abun- dantly by time spent with amazing people outside the group. Olof and Mikael with whom I did have conversations on everything that is around – from machine learning and natural language processing to Sweden, from technical to social and perhaps at times even religious – and Alirad were the best colleagues one could ask for. These hangouts were further improved when I spent my time with the larger group of CLT in Gothenburg – Luis, Ildiko, Mehdi, Nina, Markus – who constantly reminded me that it was okay to sign-out from work. There are other things I need to acknowledge as part of this journey – Sweden being the primary one. The who, why and what of that is impossible to precisely quantify – neither the who or what can be enumerated here in entirety – and is perhaps best left unspecified while relishing all that did happen. I also found great company in the online world – Science Twitter – which made me feel welcome. Finally, I heard over the years the cliff-climbing-cliche of life and I admit this never felt like one. Most times, it felt as though I had jumped off one – after all climbing gives you a choice to stop at any point but jumping never does – only to realize I had a lot of fantastic support underneath it all. Sudheer – the colleague and brother who encouraged me to start this journey – and Lilla have always and continue to give me sage advice when I need it. Behind him was my mother who despite not knowing why I was doing this, always let me know that things would be okay when I most needed to hear it. The journey might not have started because of them, but it definitely would not have come this far if not for them. The work presented in this thesis has been funded by the Swedish Research Council as part of the REMU project — Reliable Multilingual Digital Communication: Methods and Applications (grant number 2012-5746). List of Publications Appended publications This thesis is based on the following publications: [I] Prasanth Kolachina and Aarne Ranta “From Abstract Syntax to Universal Dependencies” Linguistic Issues in Language Technology 13(3), 2016. [II] Aarne Ranta and Prasanth Kolachina “From Universal Dependencies to Abstract Syntax” Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pp. 107–116. [III] Prasanth Kolachina and Aarne Ranta “Bootstrapping UD treebanks for Delexi- calized Parsing” Under submission. [IV] Prasanth Kolachina and Martin Riedl and Chris Biemann “Replacing OOV Words For Dependency Parsing With Distributional Semantics” Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaL- iDa), 2017, pp. 11–19. ix x Other publications [V] Aarne Ranta and Prasanth Kolachina and Thomas Hallgren “Cross-Lingual Syn- tax: Relating Grammatical Framework with Universal Dependencies” Proceed- ings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa), System Demos, 2017.

Load more