A step towards Torwali machine translation: an analysis of morphosyntactic challenges in a low-resource language

Naeem Uddin Jalal Uddin Torwali Research Forum Torwali Research Forum [email protected] [email protected]

dictionary available along with some structured Abstract data. This paper discusses an attempt to create a morphological analyzer using HFST from Torwali is an endangered language spo- scratch. ken in the north of . It is a com- In NLP, morphological analysis is used to putationally challenging language be- identify the morpheme and affixes of words in a cause of its RTL Perso- script, language and individual words are analyzed into non-concatenative nature and distinct their components. Apart from computational words alterations. This paper discusses linguistics, there are other uses that require issues and challenges regarding gram- morphological analysis e.g text processing, matical structure, divergence in terms of information retrieval and user interfaces. lexicon as well as morphological make- Morphologies nowadays are commonly writ- up for the machine translation of a less ten by using special purpose languages based on studied language. It includes creation of finite state technology, one of them is HFST NLP tools such as parts of speech (POS) which is based on regular expressions. tagger and morphological analyser with HFST which is based on the idea of 2 Goals building lexicon and morphological rules using finite state devices. This work, on The goal of this study is to create a baseline which this paper is based, will be a system that paves way for machine translation of source of Torwali finite state morphology Torwali which will cover: and its future computational growth as  POS tagging electronic dictionaries are usually  Creating a lexc equipped with morphological analyser  Basic inflection rules using twol (two and it will also be helpful for developing level rule) language pairs. 3 Unicode and input method Key words: Machine Translation, Low- Unicode UTF-8 encoding is used as an encoding resource language, Morphological analy- scheme as XFST/HFST files are always treated sis, language pairs as UTF-8. As an Input tool TRF phonetic keyboard (TRF 1 Introduction 2L V1.0) is used which is developed so that users Torwali belongs to the Kohistani sub-group of can easily input texts without going on-screen. the Indo-Aryan , spoken in the 4 Morphological analysis using HFST upper reaches of district Swat of northern Paki- stan. It has two dialects (the Bahrain and Chail HFST-Helsinki finite-state Technology is a dialects), with a total of approximately 90,000 to framework for compiling and applying linguistic 100,000 speakers. descriptions with finite state methods. Finite- Torwali is written in a cursive, context sensi- state transducers methods are useful for solving tive Perso- from left to right having problems involving language identification via unique grammar (morphology + syntax). Being morphological processing and POS tagging. a marginalized and low resource language, there There are two principle files in a morphological are no robust morphology sources which hinders transducer in HFST, a lexc file which is progress in NLP (Natural Language Processing) concerned with morphotactics i-e about the way tools for Torwali thou there is a digital Torwali morphemes are joined together in a word and

LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 6 twol file is used to describe phonological and 5.1 Nouns orthographical alternation rules i-e about what In Torwali, nouns are inflected for number and happened when the morphemes are joined case and the stem can be joined by an optional together. plural suffix and an optional oblique case marker. For morphological analysis of Torwali Torwali uses several strategies to mark plurality HFST/finite state transducers are chosen because but the primary morphological method is tone Torwali language is quite immature for statistical along with verb agreement like for most of the machine translation and also this implementation singular nouns have a tone with rising pitch from is done using Apertium, which is an open source low-to-high and their plural counterparts have a machine translation platform in which HFST can tone with low pitch. Due to the issue related to be used. representing tone, Torwali words which use tone 4.1 LEXC to mark plurality the following approach is used where singular/plural for masculine and feminine Lexicon Compiler or LEXC is a finite-state are handled in a single paradigm. compiler also called a lexical transducer that reads set of morphemes and their morphotactic combinations in order to create finite-state Table 1: tags transducer of a lexicon. LEXC contains morphemes grouped in sub-lexicon sets which in tag description turn contain finite strings separated by ‘:’ and a N-M Noun masculine continuation class (a lexicon name).

4.2 TWOLC N-F Noun feminine TWOLC, Two-Level Compiler is a two level Noun mascu- rule compiler used for compiling grammars of N-MF line/feminine two levels into finite state transducers sets. Two level rules are constraints on lexical word forms corresponding to surface forms, It describes ڙینگ:ڙینگو morphological alternations such as Multichar_Symbols say: saying). It takes) بن:بنو ,(weep: weeping) ! Part of speech categories surface forms produced by LEXC and applies % ! Noun rules on them; the rules vary depending on morphological alteration of stem, ! Number morphology morphologically or phonologically conditioned % ! Plural deletion of suffix, morphologically or % ! Singular phonologically conditioned insertion, morphologically or phonologically conditioned ! Gender symbol change. % ! Masculine % ! Feminine

5 Torwali Morphology LEXICON Root NounRoot ; Torwali has a unique morphology because it is LEXICON N-M basically a fusional language which uses several %%%: # ; strategies like stem modification, reduplication %%%: # ; and existence of words in inflected form, derived LEXICON N-F form, compound form and root form. The %%%: # ; morphological analyzer separates root and suffix %%%: # ; أمیژیل morphemes in all lexical entries i-e in and are suffixes. And, words are added in the lexicon in the /یل/,/چا/ are roots and /أمیژ/,/لن/ ,لناچا The purpose of this section is to discuss Torwali following way: morphology and its implementation in HFST for main grammatical categories of Torwali i-e LEXICON NounRoot nouns, verbs, adjectives and pronouns. ; N-M آن:آن ; N-F أر:أر

LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 7 person. Torwali has three tenses: present, past can be used to mark /ی/ ,/If we compile the lexicon with hfst-lexc and and future. The suffix /i test it with hfst-fst2strings, it spits out the feminine singular forms and present tense on as masculine /و/ ,/following result: feminine singular forms, /u singular suffix and present tense on masculine being used /ی/ ,/singular forms with the suffix /i آن:آن for present tense on plural forms too. For infinite آن:آن .verbs the suffix /u/ is added to the stem أر:أر أر آن : To make a test, only present tense on

The above output marks singularity and plurality masculine and feminine, infinitives, transitive for words having tonal change but have no de- and intransitive forms of verb are selected and in the continuation class the suffixes are added to scription about the tone’s pitch. To make plural mark inflection associated with each of them is added to the /ے/ ,/oblique of nouns a suffix /e which are defined with suitable tags as shown ,/شان/ and /خارے/ ,/خار/ ;stem as in the words below. ./شانے/ Reduplication is another strategy to communi- Multichar_Symbols cate plurality and intensity but not in the same ! Part of speech categories v%> ! Verb>% ,/میل گیل/ ,/گال مال/ ,way as tone does. For instance ./چ ٔی م ٔی / ,/پہٹ پہٹ/ ,/ ُچن ُچن/ Torwali noun forms can be derived from ad- ! Number morphology jectives which can be implemented by adding the % ! Plural making noun root follow a continua- % ! Singular /اچا/ suffix tion class: Gender ! ; # اچا<%:<%n>%%% m%> ! Masculine>% ; # اچا<%:<%n>%%% Where % is tag for derived noun. % ! Feminine For noun inflection which undergoes stem modifications, there is a lot of complexity re- ! Verb forms garding standard rule formation for them; here is % ! Pres a general conclusion: % ! Infinitives For majority of masculine nouns the vowel % ! Verb Transitive % ! Verb Intransitive changes form \a\ to \ə\ and for feminine nouns \a\ ! Other symbols to \æ\ but some masculine and feminine nouns %> ! Morpheme boundary behave differently. For morphological alteration of the stem the following rules must be imple- LEXICON Root mented using twolc, taking the surface forms Verbs ; produced by lexc. LEXICON V-INF ; # و<%:<%for making plurals of mascu- %%%%%%% /ڙن/, /ڙان/ and /دن/, /ناد/ consonant as in LEXICON VT for making plurals ( أ,by (æ ( ا, Replace (a ; # ٔو <%:<%v%>%% follows LEXICON VI (ا,of feminine nouns when the (a ; # ہؤو <%:<%v%>%% , /ڙات/ and /بأت/ , /بات/ a consonant as in . /ڙأت/  For noun inflections relating sizes; re- Now if we add some verbs, to mark small size of LEXICON Verbs ( أ,by (æ ( ا,place (a ; V ڙینگ:ڙینگ ./لہأڑ/, /لہاڑ/ .large-size nouns e.g ; V-INF بن:بن ; VT ُسلُو: ُسلُو ; VI ُسو َرئ: ُسو َرئ Verbs 5.2 Torwali verbs inflect for tense, aspect, mood and gender and most of the verb forms make gender and number distinction only, no distinction for

LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 8 The analyzer analyses these verb forms in the being tagged. We found HFST a good choice for following way when compiled with hfst-lexc implementing Torwali Morphology for now as and tested with hfst-fst2strings: this is the first ever attempt to implement Torwali morphology using FST. However; to develop a $ hfst-fst2strings trw.lexc.hfst full fledge Morphological analyzer more work has to be done. The major problem which we ڙینگ:ڙینگ have to face is the random stem changes in و<بن:بن ,nouns, variation in nouns using change of tone دو<بن:بن distinct behavior of vowel ending verbs and جی<بن:بن nouns. There is a need to learn more about ٔو < ُسلُو: ُسلُو Torwali Morphology regarding the affixes and ہؤو < ُسو َرئ: ُسو َرئ varying stems and more rules needs to be From the above output it is concluded that the defined. following implementations can be done. These rules can be applied to verbs whose stem ends 7 Future work with a consonant. This work could further be enhanced to follow- :to represent past ing extensions depending upon the possibility /دُد/  Adding a suffix tense on infinitive verb.  Addition of missing diacritic marks to to mark future tense words /نین/  Adding a suffix on finite verb.  Technique to interpret tone of nouns to to mark inceptive of indentify singular/plural nouns by its /سأت/  Adding suffix infinite verb. tone.  For present perfective on masculine sin-  Algorithms to differentiate phonetically .for both similar words /ی/ and /و/ gular adding suffix plurals and feminine singular.  A comprehensive implementation of .Torwali syntax /یجی/ ,for masculine singular /ودو/  Suffix for plurals /یدی/ for feminine singular and to mark present perfective on finite verbs 8 References .for masculine singular and Beesley, K.R, Karttunen, L.: Finite State Morphology /وشو/  Suffix (for both feminine singular and plu- CSLI Publications, Palo Alto (2003 /یشی/ rals to mark past perfective on finite verb. K. Lindén, A. Erik, H. Sam, A. P. Tommi, and S. Miikka, "HFST-Framework for Compiling and Apply- Verbs ending with a vowel inflect differently ing Morphologies." International Workshop on Sys- they sometimes behave like verbs with consonant tems and Frameworks for Computational Morpholo- ending stems with a minor modification some gy, Springer Berlin Heidelberg, 2011. with the stem before/أ/ plural forms tend to have applying the plural suffix; however most of the K. Linden, E. Axelson , S. Drobac, S. Hardwick, J. verbs with vowel-final stem follow different con- Kuokkala, J. Niemi, T. Pirinen and Silfverberg "HFST—A System for Creating NLP Tools.", figurations. International Workshop on Systems and Frameworks for Computational Morphology, Springer Berlin 5.3 Adjectives, Pronouns, Adverbs and Heidelberg, 2013. closed classes Ullah, Inam (2019) "Digital Dictionary Development In a similar way Adjectives, Pronouns, Adverbs, for Torwali, A Less-studied Language: Process and Postpositions, Conjunctions and Interjections Challenges," Proceedings of have been implemented with the same level of the Workshop on Computational Methods for Endan- detail. gered Languages: Vol. 2 , Article 3.

6 Result and Conclusions Grierson, George A. 1929. Torwali: An Account of a Dardic Language of the Swat Kohistan. Royal Asiatic This work presents a straight forward Society. London. implementation of Towali morphology analyzer using HFST, which implemented the basic K. Linden, M. Silfverberg, , T. Pirinen, “HFST Tools inflections of nouns, verbs and other POS like for Morphology—An Efficient Open Source Package adjectives, adverbs, pronouns and postpositions

LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 9 for Construction of Morphological Analyzers”, In: Mahlow, Piotrowski (eds.) , pp. 28–47,2009.

Lunsford, Wayne. 2001. An Overview of Linguistic Structures in Torwali, A Language of Northern Pakistan. M.A. Thesis, University of Texas at Arlington.

Ullah, Inam. 2004. ‘Lexical Database of the Torwali Dictionary.’ In The Asia lexicography conference. Chiangmai: Payap University.

LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 10