A Model for Natural Language Processing
Total Page:16
File Type:pdf, Size:1020Kb
A COMPUTATIONAL ANALYSIS OF NEPALI MORPHOLOGY: A MODEL FOR NATURAL LANGUAGE PROCESSING A Dissertation Submitted to the Faculty of Humanities and Social Sciences of Tribhuvan University in Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY in LINGUISTICS By BALARAM PRASAIN Ph.D. Reg. No.: 19-2058 Magh TU Reg. No.: 288-83 March 2011 ACKNOWLEDGEMENTS My profound indebtedness is due to my supervisor prof. Dr. Yogendra Prasad Yadava, the former head, Central Department of Linguistics, Tribhuvan University, Nepal for his insistent encouragement, continuous guidance, valuable suggestions and insightful comments in accomplishing this dissertation. I would like to express my sincere gratitude to Professor Dr. Miriam Butt, Department of Linguistics, Konstanz University, Germany for her constructive suggestions, proper guidance, insightful comments to improve this dissertation. I owe a great deal to Dr. Andrew Hardie, Lancaster University, for his valuable suggestions and providing his articles which help me understand the basic concepts and also for helping in using online NNC corpus. I would like to extend my thanks to Dr. Dan Raj Regmi, head of the Central Department of Linguistics, Tribhuvan University, for his encouragement, useful suggestions and comments provided to my study. I would like to extend my sincere gratitude to Prof. Madhav Prasad Pokharel, Central Department of Linguistics, Tribhuvan University, for his prompt answer to any queries regarding Nepali morphology and its structure. I would like to express my gratitude to Prof. Dr. Chuda Mani Bandhu and Prof. Dr. Tej R. Kansakar, Former heads, Central Department of Linguistics, for their inspiration and encouragement given to this study. I extend my thanks to Krishna Prasad Chalise, Central Department of Linguistics, Dubi Nanda Dhakal, Central Department of Linguistics, Krishna Poudel, Central Department of Linguistics, for their valuable comments and active participation in discussion whenever the problem was raised. I am equally thankful to Ram Raj Lohani, Central Department of Linguistics, Bhim Narayan Regmi, Central Department of Linguistics, Karnakhar Khatiwada, Central Department of Linguistics, Bhim Lal Gautam, Central Department of Linguistics, Krishna Prasad Parajuli, Central Department of Linguistics for their support and encouragement. iv I am extremely thankful to Santa Bahadur Basnet for his tokenizer and a number of computational concepts. I would like to thank Madan Puraskar Library and its staff for their help in different point of time. My sincere thanks go to Dr. Tika Ram Poudel, Tina Bögel, Sebastian Sulger, Kanstanz Univeristy for their help. I would like to thank Tribhuvan University for providing me study leave for two years, SIL International for providing me travel grants while attending the workshops and institute in University of California, Bangkok, Thailand and IIT Hyderabada, India; Bhashasanchar project for supporting me financially while attending the training on text-to-speech in Gotenberg University, Sweden; and Department of Linguistics, University of Kanstanz for supporting me financially to attend the school of computational and natural language processing in Konstanz University, Germany. I would like to take this opportunity to express my sincere appreciation to my spouse Mrs. Nirmala Prasain, son Aryan Prasain and daughter Sanskriti Prasain for their tolerance. Finally, I express my thanks to all the Central Department’s non-teaching staff for their help whenever required. BALARAM PRASAIN v ABSTRACT The main goal of this study is to present a computational analysis of morphology in Nepali for developing a model for natural language processing by applying the finite state approach. The morphological categories have been analyzed according to the principle of Two-level morphology (Koskeniemmi 1983), and these categories have been implemented using Xerox finite state tool (Beesley and Kartumnen 2003) to create the morphological analyzer. A version of finite state automaton called finite state transducer is used in this study which handles relation between two languages, namely upper language and lower language. Upper language is equivalent to lexical level and lower language is equivalent to surface level. The finite state transducer is bidirectional, i.e., moving from surface level to lexical level is analysis and from lexical level to surface level is generation. This study is organized into eight chapters. Chapter 1 presents the general morphological concepts, the objectives, methodology, the significance and limitations of the study. Chapter 2 presents the theoretical framework that is adopted for the study. Chapter 3 analyzes nouns, pronouns, adjectives, numerals and classifiers in Nepali. Chapter 4 analyzes the verbs in Nepali from computational approach in the first part and verbal inflections in the second part. Chapter 5 deals with indeclinable words in Nepali. Chapter 6 analyzes the derivational process. Chapter 7 implements the outcome of analysis in previous chapters into a finite state transducer using Xerox Finite State Tool. Chapter 8 summarizes the findings of the study. This study has identified fourteen groups of nouns, eight groups of pronouns, four groups of adjectives, one group of cardinal numerals, two groups of ordinal numerals, three groups of classifiers, ten groups of verbs, seven groups of adverbs, two groups of conjunctions, three groups of postpositions, one group of particles and fifteen groups of derivations in Nepali. The phonological rules for each group have also been identified. The finite state transducer for each group with corresponding morphological tags and phonological rules have been created; and all of them have been put together into a single transducer which can be used as a morphological analyzer for Nepali. vi TABLE OF CONTENTS Recommendation Letter i Approval Letter ii Declaration iii Acknowledgments iv Abstract vi List of tables xiii List of figures xx List of abbreviations xxv Chapter 1: Introduction 1 1.1 Background 1 1.2 Statement of the problem 6 1.3 Objectives of the study 7 1.3 Literature review 7 1.5 Significance of the study 16 1.6 Research methodology 16 1.7 Limitations 17 1.8 Organization of the study 17 Chapter 2: Theoretical framework 19 2.0 Outline 19 2.1 Computational concept 19 2.2 Regular expression 20 2.3 Finite state technology 22 2.4 Regular language 23 2.5. Finite state machine 24 2.5.1 Finite state automata (FSA) 24 2.5.2 Finite state transducer (FST) 25 2.5.3 Some important operations on FSTs 26 vii 2.6 FST in computational morphology 30 2.7 Xerox finite state tool syntax (XFST) 31 2.7.1 LEXC grammar 33 2.7.2 XFST interface 38 2.8 Summary 41 Chapter 3: Nominal morphology 42 3.0 Outline 42 3.1 Nouns in Nepali 42 3.1.1 Characteristics of nouns in Nepali 43 3.2 Classification of nouns in Nepali 55 3.2.1 O-ending nouns 55 3.2.2 Non-o-ending nouns 60 3.3 Pronouns 69 3.3.1 Characteristics of pronouns in Nepali 69 3.3.2 Grouping of pronouns 71 3.4 Adjectives 91 3.4.1 Characteristics of adjectives in Nepali 91 3.4.2 Classification of adjectives 95 3.5 Numerals 100 3.5.1 Cardinal numbers 100 3.5.2 Ordinal number 101 3.5.2 Other numerals 105 3.6 Classifiers in Nepali 107 3.6.1 Numeral classifiers 107 3.6.2 Quasi classifiers 108 3.7 Summary 110 Chapter 4: Verbal morphology 111 4.0 Outline 111 4.1 Characteristics of verb in Nepali 111 viii 4.1.1 Significant verb stem finals 111 4.1.2 Transitivity 118 4.1.3 Syllabicity 120 4.1.4 Sound आ a 122 4.2 Morphological processes 122 4.2.1 Causativization/transitivization 122 4.2.2 Passivization 127 4.2.3 Negativization 129 4.3 Stem formation 130 4.4 Grouping of verb stems 131 4.4.1 Intransitive verb stems 131 4.4.2 Transitive verb stem 138 4.4.3 Irregular verb stems 144 4.4.4 Suppletive verb stems 145 4.5 Verbal inflections 147 4.5.1 Auxiliary verbs in Nepali 147 4.5.2 Tense 155 4.5.3 Aspects 163 4.5.4 Moods 173 4.5.5 Participial forms 179 4.6 Summary 187 Chapter 5: Adverbs, conjunctions, postpositions and particles 188 5.0 Outline 188 5.1 Adverbs in Nepali 188 5.1.1 Temporal adverbs 188 5.1.2 Spatial adverbs 189 5.1.3 Amount adverbs 190 5.1.4 Manner adverbs 191 5.1.5 Frequency adverbs 191 5.1.6 Reason adverbs 192 ix 5.1.7 Sentential adverbs 193 5.2 Conjunctions in Nepali 194 5.2.1 Coordinate conjunctions 194 5.2.2 Subordinate conjunctions 195 5.3 Postpositions in Nepali 196 5.3.1 Plural/collective marker 196 5.3.2 Case markers in Nepali 196 5.3.3 Adverbial postpositions 198 5.4 Particles and interjections in Nepali 201 5.4.1 Particles 201 5.4.2 Emphatic markers 202 5.4.3 Interjections in Nepali 203 5.5 Summary 204 Chapter 6: Derivational morphology 205 6.0 Outline 205 6.1 Prefixation 205 6.1.2 Noun to noun derivation 206 6.1.3 Noun to adjective derivation 207 6.1.4 Noun to adverb derivation 208 6.1.5 Adjective to adjective derivation 209 6.2 Suffixation 209 6.2.1 Noun to noun derivation 209 6.2.2 Noun to adjective derivation 210 6.2.3 Noun to noun/adjective derivation 212 6.2.4 Adjective to noun derivation 213 6.2.5 Adjective/noun to noun derivation 214 6.2.6 Verb to noun derivation 215 6.2.7 Verb to adjective derivation 217 6.2.8 Verb to adverb derivation 219 6.2.9 Adverb to adjective derivation 220 x 6.2.10 Verb to noun conversion 221 6.2.11 Verb to adjective/noun conversion 222 6.2.12 Verb to noun derivation 223 6.3 Summary 224 Chapter 7: Implementation 225 7.0 Outline 225 7.1 Morphotactics: syntax of morphemes 225 7.1.1 Morphological categories 225 7.1.2 Grammatical categories 226 7.2 Lexc grammar 228 7.2.1