Frontiers of Multilingual Grammar Development Ramona Enache ISBN 978-91-628-8787-2
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Göteborgs universitets publikationer - e-publicering och e-arkiv THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Frontiers of Multilingual Grammar Development Ramona Enache Department of Computer Science and Engineering Chalmers University of Technology and Göteborg University SE-412 96 Göteborg Sweden Göteborg, 2013 Frontiers of Multilingual Grammar Development Ramona Enache ISBN 978-91-628-8787-2 c Ramona Enache, 2013 Technical Report no. 99D Department of Computer Science and Engineering Language Technology Research Group Department of Computer Science and Engineering Chalmers University of Technology and Göteborg University SE-412 96 Göteborg, Sweden Printed at Chalmers, Göteborg, 2013 Abstract The thesis explores a number of ways for developing multilingual grammars written in GF (Grammatical Framework). The goal is to enhance both the coverage of the grammars, in terms of content and number of languages, and to reduce the development effort by automating a larger part of the process. The first direction in grammar development targets the creation of general language resources. These are the starting point for building domain-specific grammars for the language. Developing resource grammars gives a good overview of the effort required and provides a solid base for subsequent experiments in automation. Our work resulted in building computational grammars for Romanian and Swedish. A further development step is multilingual domain-specific grammar creation. The technique we employed is converting structured models into grammars, which pre- serves the original structure of the model as a backbone of the grammar and uses the general GF resources for a smooth multilingual verbalization of the model. The use cases considered are an upper-domain ontology, a business model and an ontology de- scribing cultural heritage artefacts, each posing a different challenge and illustrating another aspect of the GF grammars-ontology interoperability and its advantages. An orthogonal approach to multilingual grammar development aims at increasing the number of languages from a domain grammar. Our solution is an example-based prototype which partially replaces grammar programming with feedback from native informants and SMT tools (such as Google Translate). Last but not least, as an attempt to not only enhance GF grammars, but also use them in a novel way, we present the grammar-based hybrid system architecture com- bining GF grammars and SMT systems. This marks some of the first steps in using grammars for translating free text. As a side-effect of the work, we propose a tech- nique for building bilingual GF lexicon resources from SMT phrase tables. Keywords: multilingual grammar development, ontology verbalization, resource grammar development, hybrid machine translation, functional programming, domain specific languages. 3 Acknowledgements The thesis is dedicated first and foremost to my dearest grandmother Gica, who raised me since I was 2 weeks old, taught me everything about life that’s worth knowing and has been my best friend even before I knew what a best friend was. Unfortunately, she won’t read these lines, as she left this world in March 2010, one week after I started PhD studies and went to a better place. Even so, there’s no day that passes by, without me thinking about her and about all her unconditional love and support. On my path to PhD studies, for every crossroad I found, there was also someone who steered me in the right direction and I am very grateful for that! It all started with my Mathematics teacher from high-school, Mr. Constantin Ursu, who believed that even I can do Math, and eventually convinced me of that, too! It’s due to him I chose to study Mathematics and Computer Science at the University, which I still regard as one of my wisest choices to date. Later on, during my BSc studies at the University of Bucharest, I got exposed to research for the first time by joining the GLAU research group (Group for Logic and Universal Algebra), thanks to Prof. George Georgescu, the supervisor of my BSc thesis. It’s due to him that I discovered how amazing research is and that I want to continue my studies and become a researcher myself. With this thought in mind and also longing for a change of scenery and new chal- lenges, I came to Sweden for MSc studies in 2008. There I was lucky to meet Krasimir Angelov, from whom I heard about Language Technology for the first time. It was our stimulating discussions and the amazing PhD defence of Björn Bringert, which con- vinced me that there’s nothing else I would rather do than Language Technology! I would like to also thank Krasimir for introducing me to the GF formalism, with which I worked in the next 5 years, for being a great supervisor for the MSc thesis and a good colleague and collaborator ever since. Regarding the time of my PhD studies, in the first place, I would like to thank my supervisor, Aarne Ranta, for being such an inspiring researcher. I greatly appreciate working with him, his patience and vast knowledge. Also, I’m grateful that he provided collaboration opportunities for me with academic and industrial partners during the MOLTO project and gave me freedom to find my own research path. I would also like to thank my co-supervisor Koen Claessen, with whom it’s always so inspiring to discuss research and not only. Our meeting were always too short! Luckily, we will get to work together more in the future project, where I will continue as a post-doc. Many thanks also go to my examiner, Patrik Jansson, who gave valuable feedback at each follow-up meeting and on the thesis manuscript. Also, I would like to thank Wolfgang Ahrendt, not only for his role in the follow-up committee, but also for our collaboration in the Software Engineering using Formal Method and illuminating dis- cussions about football and life. I would like to express my gratitude to the partners from the MOLTO project, with whom it was so inspiring to collaborate during the last 3 years. In particular, Cristina España-Bonet, Lluís Màrquez and Meritxell Gonzàlez, who hosted me at Universitat Politècnica de Catalunya, from Barcelona, during the summer of 2012. Moreover, I would like to thank Dana Dannélls for our collaboration, our refreshing walks down- town and the most inspiring discussions about everything. I would also like to thank Jeroen van Grondelle, from Be Informed, Laurette Pretorius from UNISA and Brian Davis from DERI for all the inspiration they provided during our short, but fruitful collaboration. Special thanks also go to Malin Ahlberg, whom I had the pleasure to 4 supervise for the BSc and MSc thesis and who is a most brilliant young researcher. I would also like to acknowledge the past and present members of the Language Technology group, which is such a great environment for research: Thomas Hallgren, Bengt Nordström, John J. Camilleri, Grégoire Détrez, Peter Ljunglöf, Håkan Burden, Olga Caprotti, Shafqat Virk, K.V.S. Prasad and Inari Listenmaa. The Computer Science and Engineering Department from Chalmers was not only a fabulous and inspiring working environment, but also the place with the highest density of amazing people that I’ve ever encountered. I would like to thank from the bottom of my heart all the people from CSE who were like another family to me during these 3.5 years! Special thanks also go to Christina Lidbeck, who helped me starting a life in Gothenburg, especially in the beginning, when times were rough. She also gave me the best advice about Swedish society and helped me integrate here. Knowing her and her family made all the difference to me! To my dearest friends from Gothenburg – Nui, Camilo, Maryana, Gaël – and else- where – Oana and Andrei – for their overall awesomeness that words can’t describe and for making my life wonderful, I would like to say: Kob kun ka / Gracias / Spasiba / Merci / Mul¸tumesc! Last, but not least, the thesis is dedicated to the world’s most wonderful mom, Lidia, to my amazing little brother, Bogdan and to my miniature muse, François-Frederic. Va˘ iubesc mult! Since arriving to Sweden 5 years ago, my life has changed in so many ways. How- ever hard I tried, I just couldn’t fully describe the whole series of fortunate events that made me grow so much as a person, without risking to turn the thesis into my own Bildungsroman. Everyone that had a part in it, knows it already and they know how lucky I am that our paths crossed. As a conclusion, to all of you who made my life better, who provided inspiration, friendship and support, I would like to say (again): Thank you so much! This work has been funded by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement number 247914 (MOLTO project, FP7-ICT-2009-4-247914) and by the Swedish Research Council (Vetenskapsrådet) (RE- MU project 2013-2017). 5 6 Contents 1 Introduction 9 1 Contributions . 13 1.1 Creating Language Resources . 13 1.2 Grammars Describing Structured Models . 15 1.3 Bootstrapping Grammars from External Sources . 19 1.4 Grammar-Based Hybrid Systems for Machine Translation . 21 2 Further Frontiers of Multilingual Grammar Development . 25 2 Creating Language Resources 33 1 An Open-Source Computational Grammar for Romanian . 35 2 A Type-Theoretical Wide-Coverage Computational Grammar for Swedish 49 3 Grammars Describing Structured Models 59 1 Typeful Ontologies with Direct Multilingual Verbalization . 61 2 Multilingual Verbalization of Modular Ontologies using GF and lemon 81 3 Multilingual Grammar for Museum Object Descriptions . 99 4 Bootstrapping Grammars from External Sources 109 1 Controlled Language for Everyday Use: the MOLTO Phrasebook .