Computational Analysis of Morphosyntactic Categories in Georgian
Total Page:16
File Type:pdf, Size:1020Kb
Computational Analysis of Morphosyntactic Categories in Georgian Sophiko Daraselia Submitted in accordance with the requirements for the degree of Doctor of Philosophy University of Leeds School of Languages, Cultures and Societies April 2019 The candidate confirms that the work submitted is her own and that appropriate credit has been given where reference has been made to the work of others. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. The right of Sophiko Daraselia to be identified as Author of this work has been asserted by her in accordance with the Copyright, Designs and Patents Act 1988. ©2019 The University of Leeds and Sophiko Daraselia i Acknowledgements I would like to thank the International Educational Center (IEC) for giving me the opportunity of a lifetime by fully funding my PhD program at the University of Leeds. Without this financial support, this would not have been possible. Of course, none of these would have been possible without my supervisory team, which has included Dr Serge Sharoff, Dr Diane Nelson and Dr Andrew Hardie. I would like to thank Dr Serge Sharoff, my first supervisor. I am very grateful to both him and my second supervisor, Dr Diane Nelson, for giving me their guidance throughout the years on my PhD project. Most importantly, they have always been supportive in my academic development beyond my PhD research. They have helped and supported me to lead the Kartvelian and Caucasian Studies research satellite at Language@Leeds. They have helped me in organizing the international annual workshop on Computational approaches to Morphologically Rich Languages (CAMRL2017 and CAMRL2018) at the University of Leeds. I cannot thank enough and express how grateful I am to Dr Andrew Hardie, my external supervisor from Lancaster University. He has been so helpful, and I feel I have learnt a great deal from him. With Dr Hardie’s supervision, I have managed to design a tagset for Georgian (KATAG), which is one of the main contributions in my PhD project. I am very lucky to have my husband Thomas James Blackwell for supporting during my PhD journey. My loving thanks go also to each of the members of Thomas’s family, who always made me feel very special. ii Many thanks go to my family in Georgia, especially to my mother Nona Berishvili, who has always been there for me, showing her love and support every single day throughout the whole PhD project. Alongside my PhD project, I was involved in a number of interesting projects with my Georgian friends and colleagues. I would like to thank Professor Tinatin Margalitadze, Professor Marine Beridze, Professor Nino Sharashenidze, Eto Churadze and Dr Rusudan Gersamia for their friendship, kindness and cooperation. I also would like to thank the University of Leeds for accepting my PhD project and providing an incredibly professional and supportive research environment throughout my PhD study. I would like to thank my friends in Leeds with whom I shared the PhD experience and their love and support have given me strength during hard times, especially Faye Bennet, Polly Gallis, Aishah Mubaraki and Shifa Al Askari. iii Abstract This thesis describes the development of part-of-speech tagging resources for the Georgian language, consisting of i.) a new morphosyntactic language model for part- of-speech (POS) tagging purposes; ii.) tagging guidelines for tagging and post-editing; iii.) the KATAG tagset and iv.) the trained parameter files the probabilistic TreeTagger program needs to work on Georgian texts. A new morphosyntactic model of Georgian for part-of-speech tagging purposes is described in the thesis. The thesis also describes a tagset (KATAG) defined in accordance with a new morphosyntactic model of the language and a set of design principles and tagging guidelines. A stochastic methodology is used here to perform tagging in Georgian. Namely, the Treetagger - a probabilistic part-of-speech tagging program has been trained on Georgian texts. The justification for this choice is discussed. I use two tokenisation approaches in part-of-speech tagging. An accuracy of 92.41% using an enclitic tokenisation approach and accuracy of 87.13% was achieved using a non-enclitic tokenisation approach, corroborating my hypothesis that treating enclitic elements separately from the host words results in better tagging performance. To make the tagger program easily adaptable for a range of inputs (type, variety or genre of text), the performance of the probabilistic TreeTagger program was evaluated according to the obtained test set consisting of five different genres such as academic, informal, legal, fiction and news. iv Table of Contents Acknowledgements .................................................................................................... ii Abstract ..................................................................................................................... iv List of Tables ............................................................................................................ ix List of Figures ......................................................................................................... xiii Abbreviation ........................................................................................................... xiv Chapter 1 ................................................................................................................... 1 Introduction ............................................................................................................... 1 1.1 Motivation .................................................................................................... 1 1.2 Aims and objectives ..................................................................................... 3 1.3 Research questions ....................................................................................... 3 1.4 Thesis outline ............................................................................................... 7 Chapter 2 Background issues to the tagging of Georgian ..................................... 9 2.1 The Georgian Language ............................................................................... 9 2.1.1 A brief overview of the structure of Georgian ................................ 11 2.2 Previous work on corpus annotation in Georgian ...................................... 23 2.2.1 The KaWaC Corpus......................................................................... 23 2.2.2 A parser for Georgian ...................................................................... 25 2.2.3 Georgian morphological analyser .................................................... 27 2.2.4 Morphological Analyzer and Generator for Georgian .................... 28 2.3 Concluding Remarks .................................................................................. 29 Chapter 3 Design principles of the Georgian Tagset ........................................... 31 3.1 What is part-of-speech tagging? ................................................................ 31 3.2 Previous work on English part-of-speech tagsets ...................................... 32 3.3 Design principles for a Georgian tagset ..................................................... 35 3.3.1 Information to include ..................................................................... 36 3.3.2 Hierarchy and decomposability ....................................................... 37 3.3.3 Tokenisation .................................................................................... 39 3.3.4 Disambiguation ................................................................................ 40 3.3.5 The tagset’s appearance ................................................................... 42 Chapter 4 Specification of the Tagset for Georgian ............................................ 44 4.1 Noun (arsebiti saxeli) ................................................................................ 44 4.1.1 Number ............................................................................................ 45 4.1.2 Case System in Modern Georgian ................................................... 47 v 4.1.3 Tags for Nouns ................................................................................ 58 4.2 Adjectives (zedsartavi saxeli) .................................................................... 59 4.2.1 Tags for Adjectives .......................................................................... 63 4.3 Pronouns (nacvalsaxeli) ............................................................................. 65 4.3.1 Tags for Pronouns ............................................................................ 77 4.4 Numerals (ricxviti saxeli) ........................................................................... 83 4.4.1 Tags for numerals ............................................................................ 85 4.5 Adverbs (zmnizeda or zmnisarti) ............................................................... 87 4.5.1 Tags for Adverbs ............................................................................. 93 4.6 Conjunctions (k'avširi) ............................................................................... 94 4.6.1 Coordinating Conjunctions .............................................................. 94 4.6.2 Subordinating Conjunctions ............................................................ 96 4.6.3 Tags for Conjunctions