This Thesis Has Been Submitted in Fulfilment of the Requirements for a Postgraduate Degree (E.G
Total Page:16
File Type:pdf, Size:1020Kb
This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: • This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. • A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. • This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. • The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. • When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. An Iterated Learning Framework for Unsupervised Part-of-Speech Induction Christos Christodoulopoulos I V N E R U S E I T H Y T O H F G E R D I N B U Doctor of Philosophy School of Informatics University of Edinburgh 2013 Abstract Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on rule-based or corpora-based (supervised) methods. Despite the unde- niable success of supervised learning methods in NLP, they have two main drawbacks: on the practical side, it is expensive to produce the manual annotation (or the rules) required and it is not easy to find annotators for less common languages. A theoretical disadvantage is that the computational analysis produced is tied to a specific theory or annotation scheme. Unsupervised methods offer the possibility to expand our analyses into more resource- poor languages, and to move beyond the conventional linguistic theories. They are a way of observing patterns and regularities emerging directly from the data and can provide new linguistic insights. In this thesis I explore unsupervised methods for inducing parts of speech across languages. I discuss the challenges in evaluation of unsupervised learning and at the same time, by looking at the historical evolution of part-of-speech systems, I make the case that the compartmentalised, traditional pipeline approach of NLP is not ideal for the task. I present a generative Bayesian system that makes it easy to incorporate multiple diverse features, spanning different levels of linguistic structure, like morphology, lex- ical distribution, syntactic dependencies and word alignment information that allow for the examination of cross-linguistic patterns. I test the system using features pro- vided by unsupervised systems in a pipeline mode (where the output of one system is the input to another) and show that the performance of the baseline (distributional) model increases significantly, reaching and in some cases surpassing the performance of state-of-the-art part-of-speech induction systems. I then turn to the unsupervised systems that provided these sources of information (morphology, dependencies, word alignment) and examine the way that part-of-speech information influences their inference. Having established a bi-directional relationship between each system and my part-of-speech inducer, I describe an iterated learning method, where each component system is trained using the output of the other sys- tem in each iteration. The iterated learning method improves the performance of both component systems in each task. Finally, using this iterated learning framework, and by using parts of speech as the central component, I produce chains of linguistic structure induction that combine all iii the component systems to offer a more holistic view of NLP. To show the potential of this multi-level system, I demonstrate its use ‘in the wild’. I describe the creation of a vastly multilingual parallel corpus based on 100 translations of the Bible in a diverse set of languages. Using the multi-level induction system, I induce cross-lingual clusters, and provide some qualitative results of my approach. I show that it is possible to discover similarities between languages that correspond to ‘hidden’ morphological, syntactic or semantic elements. iv Lay Summary Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on supervised methods. Despite their undeniable success in NLP, super- vised learning methods have two main drawbacks: on the practical side, it is expensive to produce the manual annotation (or the rules) required and it is not easy to find anno- tators for less common languages. A theoretical disadvantage is that the computational analysis produced is tied to a specific theory or annotation scheme. Unsupervised methods, on the other hand, offer the possibility to expand our anal- yses into more resource-poor languages, and move beyond the conventional linguistic theories. They are a way of observing patterns and regularities emerging directly from the data and provide new linguistic insights. In this thesis I explore unsupervised methods for inducing parts of speech across languages. I discuss the challenges in evaluation of unsupervised learning and at the same time, by looking at the historical evolution of part-of-speech, I make the case that the compartmentalised, traditional pipeline approach of NLP (where the output of one system is the input to the next) is not ideal for the task. I present a part-of- speech induction system that makes it easy to incorporate multiple diverse features, spanning different levels of linguistic structure, like morphology, lexical distribution, syntactic dependencies and word alignment information that allow for the examination of cross-linguistic patterns. I then turn to the unsupervised systems that provide these sources of information (morphology, dependencies, word alignment) and examine the way that part-of-speech information influences their decisions and describe an iterated learning method, where each component system is trained using the output of the other system in each iteration. Using this iterated learning framework, and by using parts of speech as the central component, I combine all the component systems in a chain that offers a more holistic view of NLP. I describe the creation of a vastly multilingual parallel corpus based on 100 translations of the Bible in a diverse set of languages, and provide some qualitative results of my approach. v Acknowledgements A commonplace statement that I’m used to reading in acknowledgement sections is that there are lots of people without whom a dissertation would not have been possible. I always thought of it as an exaggeration or (at best) a pleasantry. That is, until I started my own PhD! Three, or three and a half years is not a lot of time for a whole PhD project and, while no-one is going to do your work for you, you need all the support you can get. I was very fortunate to have the support of two of the best supervisors one can wish for, in Mark Steedman and Sharon Goldwater. Not only because of their tremen- dous academic knowledge, but also because they complemented each other in the most harmonious way for me. Mark, whose incredible depth and breadth of knowledge sur- prises me to this day, was an inspiration ever since he introduced me to computational linguistics during an MSc in Edinburgh. He allowed me to pursue wild ideas while steering me away from research pitfalls. Sharon embraced my project in its early stages and her energy and immense technical knowledge help me to shape the core of my thesis. Both Sharon and Mark’s patience and willingness to be convinced about new potential directions led me to a better understanding of my project and (perhaps more importantly) of the underlying principles behind it. I am a better researcher thanks to them. Apart from my supervisors, most of the weight for my support (academic, moral, or otherwise) fell on the past members of the ‘Steedman gang’: Michael Auli, Lexi Birch, Prachya (Arm) Boonkwan, Tom Kwiatkowski, Kira Mourao,˜ Emily Thomforde, Luke Zettlemoyer, and its current members (also known as the ‘Darkstar crew’): Bharat Am- bati, Greg Coppola, Tejaswini Deoskar, Aciel Eshky, Mark Granroth-Wilding, Mike Lewis, Siva Reddy, Nathaniel Smith. Together with the rest of the students and post- docs of the ILCC, they made me feel part of a great community, and provided me with endless occasions of discussion, drinking, dining and general good times! I would like to add my special thanks for two people in particular. First, to Mark Granroth-Wilding, for being an amazing friend since my MSc days, for giving me my British accent (and most of my Britishness in general), for the countless coffee sessions that fuelled my entire PhD with caffeine and ideas, for trusting me with his thesis (and accusing me of procrastinating in his own acknowledgements!) and for painstakingly going through this document and making sense of my ramblings. Second, to Yannis Konstas, for keeping me a bit closer to Greece, for being a constant source of academic stimulation and good music, for patiently listening to me through my endless rants vi about part-of-speech induction, parsing and the state of NLP, for giving me his amazing parser and just for being a reminder of all that’s good about Greece. I’m also indebted to many researchers that helped me throughout my PhD, either by providing their code during my hunt for unsupervised part-of-speech induction sys- tems, or by giving me their expertise on various subjects, or by reviewing my papers. I want to thank all my friends, both here in Edinburgh and back in Greece for putting up with my academic nature and for constantly reminding me that life is more than a PhD.