A step towards Torwali machine translation: an analysis of morphosyntactic challenges in a low-resource language
Naeem Uddin Jalal Uddin Torwali Research Forum Torwali Research Forum [email protected] [email protected]
dictionary available along with some structured Abstract data. This paper discusses an attempt to create a morphological analyzer using HFST from Torwali is an endangered language spo- scratch. ken in the north of Pakistan. It is a com- In NLP, morphological analysis is used to putationally challenging language be- identify the morpheme and affixes of words in a cause of its RTL Perso-Arabic script, language and individual words are analyzed into non-concatenative nature and distinct their components. Apart from computational words alterations. This paper discusses linguistics, there are other uses that require issues and challenges regarding gram- morphological analysis e.g text processing, matical structure, divergence in terms of information retrieval and user interfaces. lexicon as well as morphological make- Morphologies nowadays are commonly writ- up for the machine translation of a less ten by using special purpose languages based on studied language. It includes creation of finite state technology, one of them is HFST NLP tools such as parts of speech (POS) which is based on regular expressions. tagger and morphological analyser with HFST which is based on the idea of 2 Goals building lexicon and morphological rules using finite state devices. This work, on The goal of this study is to create a baseline which this paper is based, will be a system that paves way for machine translation of source of Torwali finite state morphology Torwali which will cover: and its future computational growth as POS tagging electronic dictionaries are usually Creating a lexc equipped with morphological analyser Basic inflection rules using twol (two and it will also be helpful for developing level rule) language pairs. 3 Unicode and input method Key words: Machine Translation, Low- Unicode UTF-8 encoding is used as an encoding resource language, Morphological analy- scheme as XFST/HFST files are always treated sis, language pairs as UTF-8. As an Input tool TRF phonetic keyboard (TRF 1 Introduction 2L V1.0) is used which is developed so that users Torwali belongs to the Kohistani sub-group of can easily input texts without going on-screen. the Indo-Aryan Dardic languages, spoken in the 4 Morphological analysis using HFST upper reaches of district Swat of northern Paki- stan. It has two dialects (the Bahrain and Chail HFST-Helsinki finite-state Technology is a dialects), with a total of approximately 90,000 to framework for compiling and applying linguistic 100,000 speakers. descriptions with finite state methods. Finite- Torwali is written in a cursive, context sensi- state transducers methods are useful for solving tive Perso-Arabic script from left to right having problems involving language identification via unique grammar (morphology + syntax). Being morphological processing and POS tagging. a marginalized and low resource language, there There are two principle files in a morphological are no robust morphology sources which hinders transducer in HFST, a lexc file which is progress in NLP (Natural Language Processing) concerned with morphotactics i-e about the way tools for Torwali thou there is a digital Torwali morphemes are joined together in a word and
LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 6 twol file is used to describe phonological and 5.1 Nouns orthographical alternation rules i-e about what In Torwali, nouns are inflected for number and happened when the morphemes are joined case and the stem can be joined by an optional together. plural suffix and an optional oblique case marker. For morphological analysis of Torwali Torwali uses several strategies to mark plurality HFST/finite state transducers are chosen because but the primary morphological method is tone Torwali language is quite immature for statistical along with verb agreement like for most of the machine translation and also this implementation singular nouns have a tone with rising pitch from is done using Apertium, which is an open source low-to-high and their plural counterparts have a machine translation platform in which HFST can tone with low pitch. Due to the issue related to be used. representing tone, Torwali words which use tone 4.1 LEXC to mark plurality the following approach is used where singular/plural for masculine and feminine Lexicon Compiler or LEXC is a finite-state are handled in a single paradigm. compiler also called a lexical transducer that reads set of morphemes and their morphotactic combinations in order to create finite-state Table 1: tags transducer of a lexicon. LEXC contains morphemes grouped in sub-lexicon sets which in tag description turn contain finite strings separated by ‘:’ and a N-M Noun masculine continuation class (a lexicon name).
4.2 TWOLC N-F Noun feminine TWOLC, Two-Level Compiler is a two level Noun mascu- rule compiler used for compiling grammars of N-MF line/feminine two levels into finite state transducers sets. Two level rules are constraints on lexical word forms corresponding to surface forms, It describes ڙینگ:ڙینگو morphological alternations such as Multichar_Symbols say: saying). It takes) بن:بنو ,(weep: weeping) ! Part of speech categories surface forms produced by LEXC and applies %
5 Torwali Morphology LEXICON Root NounRoot ; Torwali has a unique morphology because it is LEXICON N-M basically a fusional language which uses several %
LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 7 person. Torwali has three tenses: present, past can be used to mark /ی/ ,/If we compile the lexicon with hfst-lexc and and future. The suffix /i test it with hfst-fst2strings, it spits out the feminine singular forms and present tense on as masculine /و/ ,/following result: feminine singular forms, /u singular suffix and present tense on masculine being used /ی/ ,/singular forms with the suffix /i آن:
The above output marks singularity and plurality masculine and feminine, infinitives, transitive for words having tonal change but have no de- and intransitive forms of verb are selected and in the continuation class the suffixes are added to scription about the tone’s pitch. To make plural mark inflection associated with each of them is added to the /ے/ ,/oblique of nouns a suffix /e which are defined with suitable tags as shown ,/شان/ and /خارے/ ,/خار/ ;stem as in the words below. ./شانے/ Reduplication is another strategy to communi- Multichar_Symbols cate plurality and intensity but not in the same ! Part of speech categories v%> ! Verb>% ,/میل گیل/ ,/گال مال/ ,way as tone does. For instance ./چ ٔی م ٔی / ,/پہٹ پہٹ/ ,/ ُچن ُچن/ Torwali noun forms can be derived from ad- ! Number morphology jectives which can be implemented by adding the %
LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 8 The analyzer analyses these verb forms in the being tagged. We found HFST a good choice for following way when compiled with hfst-lexc implementing Torwali Morphology for now as and tested with hfst-fst2strings: this is the first ever attempt to implement Torwali morphology using FST. However; to develop a $ hfst-fst2strings trw.lexc.hfst full fledge Morphological analyzer more work has to be done. The major problem which we ڙینگ:
6 Result and Conclusions Grierson, George A. 1929. Torwali: An Account of a Dardic Language of the Swat Kohistan. Royal Asiatic This work presents a straight forward Society. London. implementation of Towali morphology analyzer using HFST, which implemented the basic K. Linden, M. Silfverberg, , T. Pirinen, “HFST Tools inflections of nouns, verbs and other POS like for Morphology—An Efficient Open Source Package adjectives, adverbs, pronouns and postpositions
LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 9 for Construction of Morphological Analyzers”, In: Mahlow, Piotrowski (eds.) , pp. 28–47,2009.
Lunsford, Wayne. 2001. An Overview of Linguistic Structures in Torwali, A Language of Northern Pakistan. M.A. Thesis, University of Texas at Arlington.
Ullah, Inam. 2004. ‘Lexical Database of the Torwali Dictionary.’ In The Asia lexicography conference. Chiangmai: Payap University.
LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 10