A Step Towards Torwali Machine Translation: an Analysis of Morphosyntactic Challenges in a Low-Resource Language
Total Page:16
File Type:pdf, Size:1020Kb
A step towards Torwali machine translation: an analysis of morphosyntactic challenges in a low-resource language Naeem Uddin Jalal Uddin Torwali Research Forum Torwali Research Forum [email protected] [email protected] dictionary available along with some structured Abstract data. This paper discusses an attempt to create a morphological analyzer using HFST from Torwali is an endangered language spo- scratch. ken in the north of Pakistan. It is a com- In NLP, morphological analysis is used to putationally challenging language be- identify the morpheme and affixes of words in a cause of its RTL Perso-Arabic script, language and individual words are analyzed into non-concatenative nature and distinct their components. Apart from computational words alterations. This paper discusses linguistics, there are other uses that require issues and challenges regarding gram- morphological analysis e.g text processing, matical structure, divergence in terms of information retrieval and user interfaces. lexicon as well as morphological make- Morphologies nowadays are commonly writ- up for the machine translation of a less ten by using special purpose languages based on studied language. It includes creation of finite state technology, one of them is HFST NLP tools such as parts of speech (POS) which is based on regular expressions. tagger and morphological analyser with HFST which is based on the idea of 2 Goals building lexicon and morphological rules using finite state devices. This work, on The goal of this study is to create a baseline which this paper is based, will be a system that paves way for machine translation of source of Torwali finite state morphology Torwali which will cover: and its future computational growth as POS tagging electronic dictionaries are usually Creating a lexc equipped with morphological analyser Basic inflection rules using twol (two and it will also be helpful for developing level rule) language pairs. 3 Unicode and input method Key words: Machine Translation, Low- Unicode UTF-8 encoding is used as an encoding resource language, Morphological analy- scheme as XFST/HFST files are always treated sis, language pairs as UTF-8. As an Input tool TRF phonetic keyboard (TRF 1 Introduction 2L V1.0) is used which is developed so that users Torwali belongs to the Kohistani sub-group of can easily input texts without going on-screen. the Indo-Aryan Dardic languages, spoken in the 4 Morphological analysis using HFST upper reaches of district Swat of northern Paki- stan. It has two dialects (the Bahrain and Chail HFST-Helsinki finite-state Technology is a dialects), with a total of approximately 90,000 to framework for compiling and applying linguistic 100,000 speakers. descriptions with finite state methods. Finite- Torwali is written in a cursive, context sensi- state transducers methods are useful for solving tive Perso-Arabic script from left to right having problems involving language identification via unique grammar (morphology + syntax). Being morphological processing and POS tagging. a marginalized and low resource language, there There are two principle files in a morphological are no robust morphology sources which hinders transducer in HFST, a lexc file which is progress in NLP (Natural Language Processing) concerned with morphotactics i-e about the way tools for Torwali thou there is a digital Torwali morphemes are joined together in a word and LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 6 twol file is used to describe phonological and 5.1 Nouns orthographical alternation rules i-e about what In Torwali, nouns are inflected for number and happened when the morphemes are joined case and the stem can be joined by an optional together. plural suffix and an optional oblique case marker. For morphological analysis of Torwali Torwali uses several strategies to mark plurality HFST/finite state transducers are chosen because but the primary morphological method is tone Torwali language is quite immature for statistical along with verb agreement like for most of the machine translation and also this implementation singular nouns have a tone with rising pitch from is done using Apertium, which is an open source low-to-high and their plural counterparts have a machine translation platform in which HFST can tone with low pitch. Due to the issue related to be used. representing tone, Torwali words which use tone 4.1 LEXC to mark plurality the following approach is used where singular/plural for masculine and feminine Lexicon Compiler or LEXC is a finite-state are handled in a single paradigm. compiler also called a lexical transducer that reads set of morphemes and their morphotactic combinations in order to create finite-state Table 1: tags transducer of a lexicon. LEXC contains morphemes grouped in sub-lexicon sets which in tag description turn contain finite strings separated by ‘:’ and a N-M Noun masculine continuation class (a lexicon name). 4.2 TWOLC N-F Noun feminine TWOLC, Two-Level Compiler is a two level Noun mascu- rule compiler used for compiling grammars of N-MF line/feminine two levels into finite state transducers sets. Two level rules are constraints on lexical word forms corresponding to surface forms, It describes ڙینگ:ڙینگو morphological alternations such as Multichar_Symbols say: saying). It takes) بن:بنو ,(weep: weeping) ! Part of speech categories surface forms produced by LEXC and applies %<n%> ! Noun rules on them; the rules vary depending on morphological alteration of stem, ! Number morphology morphologically or phonologically conditioned %<pl%> ! Plural deletion of suffix, morphologically or %<sg%> ! Singular phonologically conditioned insertion, morphologically or phonologically conditioned ! Gender symbol change. %<m%> ! Masculine %<f%> ! Feminine 5 Torwali Morphology LEXICON Root NounRoot ; Torwali has a unique morphology because it is LEXICON N-M basically a fusional language which uses several %<n%>%<m%>%<sg%>: # ; strategies like stem modification, reduplication %<n%>%<m%>%<pl%>: # ; and existence of words in inflected form, derived LEXICON N-F form, compound form and root form. The %<n%>%<f%>%<sg%>: # ; morphological analyzer separates root and suffix %<n%>%<m%>%<pl%>: # ; أمیژیل morphemes in all lexical entries i-e in and are suffixes. And, words are added in the lexicon in the /یل/,/چا/ are roots and /أمیژ/,/لن/ ,لناچا The purpose of this section is to discuss Torwali following way: morphology and its implementation in HFST for main grammatical categories of Torwali i-e LEXICON NounRoot nouns, verbs, adjectives and pronouns. ; N-M آن:آن ; N-F أر:أر LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 7 person. Torwali has three tenses: present, past can be used to mark /ی/ ,/If we compile the lexicon with hfst-lexc and and future. The suffix /i test it with hfst-fst2strings, it spits out the feminine singular forms and present tense on as masculine /و/ ,/following result: feminine singular forms, /u singular suffix and present tense on masculine being used /ی/ ,/singular forms with the suffix /i آن:<n><m><sg>آن for present tense on plural forms too. For infinite آن:<n><m><pl>آن .verbs the suffix /u/ is added to the stem أر:<n><f><sg>أر أر:<n><f><pl>آن To make a test, only present tense on The above output marks singularity and plurality masculine and feminine, infinitives, transitive for words having tonal change but have no de- and intransitive forms of verb are selected and in the continuation class the suffixes are added to scription about the tone’s pitch. To make plural mark inflection associated with each of them is added to the /ے/ ,/oblique of nouns a suffix /e which are defined with suitable tags as shown ,/شان/ and /خارے/ ,/خار/ ;stem as in the words below. ./شانے/ Reduplication is another strategy to communi- Multichar_Symbols cate plurality and intensity but not in the same ! Part of speech categories v%> ! Verb>% ,/میل گیل/ ,/گال مال/ ,way as tone does. For instance ./چ ٔی م ٔی / ,/پہٹ پہٹ/ ,/ ُچن ُچن/ Torwali noun forms can be derived from ad- ! Number morphology jectives which can be implemented by adding the %<pl%> ! Plural making noun root follow a continua- %<sg%> ! Singular /اچا/ suffix tion class: Gender ! ; # اچا<%:<%n>%<nder%>%<m%>% m%> ! Masculine>% ; # اچا<%:<%n>%<nder%>%<f%>% Where %<nder%> is tag for derived noun. %<f%> ! Feminine For noun inflection which undergoes stem modifications, there is a lot of complexity re- ! Verb forms garding standard rule formation for them; here is %<pres%> ! Pres a general conclusion: %<inf%> ! Infinitives For majority of masculine nouns the vowel %<vt%> ! Verb Transitive %<vi%> ! Verb Intransitive changes form \a\ to \ə\ and for feminine nouns \a\ ! Other symbols to \æ\ but some masculine and feminine nouns %> ! Morpheme boundary behave differently. For morphological alteration of the stem the following rules must be imple- LEXICON Root mented using twolc, taking the surface forms Verbs ; produced by lexc. LEXICON V-INF ; # و<%:<%for making plurals of mascu- %<v%>%<inf (ا, Delete (a دو ; # <%:<%follows a %<v%>%<m%>%<pres )ا,line singular nouns, when )a ; # جی<%:<%v%>%<f%>%<pres>% /ڙن/, /ڙان/ and /دن/, /ناد/ consonant as in LEXICON VT for making plurals ( أ,by (æ ( ا, Replace (a ; # ٔو <%:<%v%>%<vt>% follows LEXICON VI (ا,of feminine nouns when the (a ; # ہؤو <%:<%v%>%<vi>% , /ڙات/ and /بأت/ , /بات/ a consonant as in . /ڙأت/ For noun inflections relating sizes; re- Now if we add some verbs, to mark small size of LEXICON Verbs ( أ,by (æ ( ا,place (a ; V ڙینگ:ڙینگ ./لہأڑ/, /لہاڑ/ .large-size nouns e.g ; V-INF بن:بن ; VT ُسلُو: ُسلُو ; VI ُسو َرئ: ُسو َرئ Verbs 5.2 Torwali verbs inflect for tense, aspect, mood and gender and most of the verb forms make gender and number distinction only, no distinction for LoResMT 2019 Dublin, Aug. 19-23, 2019 | p. 8 The analyzer analyses these verb forms in the being tagged.