<<

1 Survey on Publicly Available Natural Language Processing Tools and Research

Nisansa de Silva

Abstract—

Sinhala is the native language of the who make up the largest ethnic group of Sri . The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.

Index Terms—Sinhala, Natural Language Processing, Resource Poor Language !

1 INTRODUCTION one hand, according to Liddy [24], these systems can be categorized into the following layers; phonological, morpho- Sinhala1 language, being the native language of the Sin- logical, lexical, syntactic, semantic, discourse, and pragmatic. halese people [2–4], who make up the largest ethnic group The phonological layer deals with the interpretation of lan- of the island country of Sri Lanka, enjoys being reported as guage sounds. As such, it consists of mainly speech-to-text the mother tongue of approximately 16 million people [5,6]. and text-to-speech systems. In cases where one is working To give a brief linguistic background for the purpose of with written text of the language rather than speech, it is aligning the with the baseline of English, possible to replace this layer with tools which handle Op- primarily it should be noted that Sinhala language belongs tical Character Recognition (OCR) and language rendering to the same the Indo-European language tree [7,8]. How- standards (such as [26]). The morphological layer ever, unlike English, which is part of the Germanic branch, analyses at their smallest units of meaning. As such, Sinhala belongs to the Indo-Aryan branch. Further, Sinhala, analysis on lemmas and prefix-suffix-based inflection unlike English, which borrowed the Latin alphabet, has its are handled in this layer. Lexical layer handles individual own , which is a descendant of the Indian words. Therefore tasks such as Part of Speech (PoS) tagging [9–15]. By extension, this makes happens here. The next layer, syntactic, takes place at the a member of the Aramaic family of scripts [16, 17]. Thus phrase and sentence level where grammatical structures by inheritance, Sinhala writing system is (alphasyl- are utilized to obtain meaning. Semantic layer attempts to labary), which to say that consonant-vowel sequences are derive the meanings from the word level to the sentence written as single units [18]. It should be noted that the mod- level. Starting with Named Entity Recognition (NER) at ern Sinhala language have loanwords from languages such arXiv:1906.02358v10 [cs.CL] 3 Sep 2021 the word level and working its way up by identifying the as Tamil, English, Portuguese, and Dutch due to various contexts they are set in until arriving at overall meaning. The historical reasons [19]. Regardless of the rich historical array discourse layer handles meaning in textual units larger than a of literature spanning several millennia (starting between rd nd sentence. In this, the function of a particular sentence maybe 3 to 2 century BCE [20, 21]), modern natural language contextualized within the document it is set in. Finally, the processing tools for the Sinhala language are scarce [22]. pragmatic layer handles contexts read into contents without Natural Language Processing (NLP) is a broad area cov- having to be explicitly mentioned [23, 24]. Some forms ering all computational processing and analysis of human of anaphora (coreference) resolution [27–31] fall into this languages. To achieve this end, NLP systems operate at application. different levels [23–25]. A graphical representation of NLP On the other hand, Wimalasuriya and Dou [25] catego- layers and application domains are shown in Figure1. On rize NLP tools and research by utility. They introduce three categories with increasing complexity; Information Retrieval • Nisansa de Silva is with the Department of Computer Science & Engi- (IR), (IE), and Natural Language Un- neering, University of Moratuwa. E-mail: [email protected] derstanding (NLU). Information Retrieval covers applications, Manuscript revised: September 7, 2021. which search and retrieve information which are relevant 1. Englebretson and Genetti [1] observe that in some contexts the to a given query. For pure IR, tools and methods up-to Sinhala language is also referred as Sinhalese, Singhala, and Singhalese and including the syntactic layer in the above analysis are 2

Natural Language Understanding

Domain knowledge Pragmatic Layer

Application reasoning and execution

Contextual Utterance Reasoning Discourse context planning Discourse Layer Information Extraction

NER and Semantic Semantic realisation tagging Semantic model Semantic Layer

Information Retrieval Syntactic realisation Grammar Syntactic Layer

PoS tagging Lexical realisation Lexicon Lexical Layer

Morphological Analysis Morphological realisation Morphological Rules Morphological Layer

Speech Analysis Phonological Layer Pronunciation model

Input text Output text

Fig. 1: NLP layers and tasks [23]

used. Information Extraction, on the other hand, extracts this survey, we shall also consider the Sri Lankan NLP tools structured information. The difference between IR and IE repository, lknlp3. This manuscript is at version 4.0.0. The is the fact that IR does not change the structure of the latest version of the manuscript can be obtained from arXiv4 documents in question. Be them structured, semi-structured, or ResearchGate5. or unstructured, all IR does is fetching them as they are. In Figure3 in AppendixA shows the most prolific re- comparison, IE, takes semi-structured or unstructured text searchers in the domain of Sinhala NLP. The nodes contain and puts them in a machine readable structure. For this, IE the name of the researcher along with the total number of utilizes all the layers used by IR and the semantic layer. Nat- Sinhala NLP papers that researcher has authored. The edges ural Language Understanding is purely the idea of cognition. between two researchers are labeled with the number of Most NLU tasks fall under AI-hard category and remain Sinhala NLP papers the relevant pair of researchers have co- unsolved [23]. However, with varying accuracy, some NLU authored. Given that the objective of the visualization is to tasks such as machine translation2 are being attempted. portray the co-operation between researchers, the threshold The pragmatic layer of the above analysis belongs to the used here is to have at least 3 papers on Sinhala NLP NLU tasks while the discourse layer straddles information with on another researcher. We have also added labels to extraction and natural language understanding [23]. clusters in cases where all or the majority of researchers The objective of this paper is to serve as a comprehensive in those clusters have the same affiliation. It is observable survey on the state of natural language processing resources that the cluster from the Department of Computer Science & for the Sinhala language. The initial structure and content Engineering, University of Moratuwa is the most prolific in of this survey are heavily influenced by the preliminary Sinhala NLP research. surveys carried out by de Silva [22] and Wijeratne et al. The remainder of this survey is organized as follows; [23]. However, our hope is to host this survey at arXiv Section2 introduces some important properties and con- as a perpetually evolving work which continuously gets ventions of the Sinhala language which are important for updated as new research and tools for Sinhala language the development and understanding of Sinhala NLP. Sec- are created and made publicly available. Hence, it is our tion3 discusses the various tools and research available for hope that this work will help future researchers who are Sinhala NLP. In this section we would discuss both pure engaged in Sinhala NLP research to conduct their literature Sinhala NLP tools and research as well as hybrid Sinhala- surveys efficiently and comprehensively. For the success of 3. https://github.com/lknlp/lknlp.github.io 2. This is, however, not without the criticism of being nothing more 4. https://arxiv.org/abs/1906.02358 than a Chinese room [32] rather than true NLU. 5. http://bit.ly/31AhvvR 3

English work. We will also discuss research and tools which Herath et al. [49] categorize Sinhala suffixes along the contributes to Sinhala NLP either along with or by the help attributes of: gender, number, definiteness, case, and con- of Tamil, the other official language of Sri Lanka. Section4 junctive. They further claim that there are 3 types of suffixes: gives a brief introduction to the primary language sources Suf1 adds gender, number, and definiteness; Suf2 adds case; used by the studies discussed in this work. Finally, Section5, and Suf3 adds conjunctive. Conjunctive is claimed to be concludes the survey. equivalent to too and and in English. We show an extension of the suffix structure proposed by Herath et al. [49] in 2 PROPERTIESOFTHE SINHALA LANGUAGE Table4. Before moving on to discussing Sinhala NLP resources, we shall give a brief introduction to some of the important 3 SINHALA NLP RESOURCES properties of Sinhala language, which impact the develop- ment of Sinhala NLP resources. Sinhala grammar has two In this section we generally follow the structure shown in forms: written (literary) and spoken. These forms differ from Figure1 for sectioning. However, in addition to that, we each-other in their core grammatical structures [1, 33, 34]. also discuss topics such as available corpora, other data The written form strictly adheres to the SOV (, sets, dictionaries, and . We focus on NLP tools Object, and ) configuration [35, 36]. Further, in the and research rather than the mechanics of language script written form, subject-verb agreement is enforced [37] such handling [50–55]. One of the earliest attempts on Sinhala that, in order to be grammatically correct, the subject and the NLP was done by Herath et al. [56]. However, progress verb must agree in terms of: gender (male/female), num- on that project has been minimal due to the limitations of ber (singular/plural) and person (1st/2nd/3rd). However, their time. The later work by Nandasara [57] has not caught in spoken Sinhala, the SOV order can be neglected [38] much of the advances done up to the time of its publication. and male singular 3rd person verb can be used for all Given that it was a decade old by the time the first edition nouns [37]. Sinhala is also a head-final language, where of this survey was compiled, we observe the existence of the complements and modifiers would appear before their many new discoveries in Sinhala NLP which have not been heads [39] this is similar to that of English and dissimilar taken into account by it. A review on some challenges and to that of French. In total, according to Abhayasinghe [40], opportunities of using Sinhala in computer science was there are 25 types of simple sentence structures in Sinhala. done by Nandasara and Mikami [58]. At this point, it is Similar to many Indo Aryan languages, plays a worthy to note that the largest number of studies in Sinhala major role in Sinhala grammar in syntactic and semantic NLP has been on optical character recognition (OCR) rather roles [41–43]. Comparative studies done by Noguchi [44] than on higher levels of the hierarchy shown in Figure1. On and by Miyagishi [34, 45] have found that animacy extends the other hand, the most prolific single project of Sinhala its influence from phrase level to sentence level in Sinhala NLP we have observed so far is an attempt to create an (e.g., Usage of post-positions [35, 46]). On this matter, Ta- end-to-end Sinhala-to-English translator [18, 59–76]. Tamil, ble1 explains grammatical cases and inflections of animate the other official language of Sri Lanka is also a resource common nouns while Table2 explains grammatical cases poor language. However, due to the existence of larger and inflections of inanimate common nouns. We provide populations of Tamil speakers worldwide, including but not a comparative analysis of parsing the very simple English limited to economic powerhouses such as India, there are sentence “I eat a red apple” and its Sinhala, , and French more research and tools available for Tamil NLP tasks [23]. translations in Fig2. English and French parsing was done Therefore, it is rational to notice that Sinhala and Tamil NLP using the Stanford Parser6. Hindi parsing was done using the endeavours can help each other. Especially, given the above IIIT-Hyderabad Parser7 and the study by et al. [47]. fact, that these are official languages of Sri Lanka, results Herath et al. [48, 49] argue that pure Sinhala words did in the generation of parallel data sets in the form of official not have suffixes and that adding suffixes was incorporated government documents and local news items. A number to Sinhala after 12th century BC with the influx of of researchers make use of this opportunity. We shall be words. With this, they declare Sinhala to have to following discussing those applications in this paper as well. Further, types of words: there have been some fringe implementations, which bridge Sinhala with other languages such as Japanese [8, 21, 77–79]. 1) Suffixes 2) Nouns 3) Cases 3.1 Corpora 4) For any language, the key for NLP applications and im- 5) Conjunctions and articles plementations is the existence of adequate corpora. On this 6) Adjectives matter, a relatively substantial Sinhala text corpus8 was 7) , Interrogatives, and negatives created by Upeksha et al. [80, 81] by web crawling. Later a 8) Particles and prefixes smaller Sinhala newes corpus9 was created by de Silva [22]. They further divide nouns into five groups: material, agen- Both of the above corpora are publicly available. However, tive, common, abstract, and proper. In addition to these, none of these come close to the massive capacity and range they also introduce nouns.We show the noun of the existing English corpora. A word corpus of ap- categorization proposed by Herath et al. [48] in Table3. proximately 35,000 entries was developed by Weerasinghe

6. http://nlp.stanford.edu:8080/parser/ 8. https://osf.io/a5quv/ 7. http://ltrc.iiit.ac.in/analyzer/ 9. https://osf.io/tdb84/ 4 TABLE 1: Examples for grammatical cases and inflection of animate common nouns Singular Plural Form Case Masculine Feminine Masculine Feminine Definite Indefinite Definite Indefinite ගැහැය 1 Nominative සා ෙස ගැහැය ස් ගැහැ ගැහැෙය 2 Accusative සා ෙස ගැහැය ගැහැයක ගැහැ 3 Auxiliary ගැහැයට 4 Dative සාට ෙසට ගැහැයට ට ගැහැට ගැහැයකට 5 Genitive සාෙ ෙසෙ ගැහැයෙ ගැහැයකෙ ෙ ගැහැෙ 6 Locative 7 Instrumental සාෙග ෙසෙග ගැහැයෙග ගැහැයකෙග ෙග ගැහැෙග 8 Ablative 9 Vocative ස ස ගැහැය ගැහැය ගැහැ

TABLE 2: Examples for grammatical cases and inflection released16 a massive corpus of text and stop words taken of inanimate common nouns Note that the grammatical cases of Auxiliary and Vocative do not exist from a decade of Sinhala Facebook posts. A parallel corpus for inanimate nouns in Sinhala. of Sinhala and English was collected by Banon´ et al. [86] containing 217,407 sentences and available to download Singular 17 Plural from their website . However the later audit by Caswell Form Case Definite Indefinite et al. [87] raised issues on the quality of that data set. 1 Nominative A parallel corpus18 of aligned Sinhala-English documents ෙපත ෙපත ෙප 2 Accusative and sentences obtained from crawling the web was released 4 Dative ෙපතට ෙපතකට ෙපවලට by Sachintha et al. [88]. 5 Genitive ෙපෙ As for Sinhala-Tamil corpora, Hameed et al. [89] claim ෙපතක ෙපවල 6 Locative ෙපෙත to have built a sentence aligned Sinhala-Tamil parallel cor- 7 Instrumental ෙපෙත pus and Mohamed et al. [90] claim to have built a word ෙපත ෙපව 8 Ablative ෙප aligned Sinhala-Tamil parallel corpus. However, at the time of writing this paper, neither of them was publicly available. TABLE 3: Noun categorization by Herath et al. [48] A very small Sinhala-Tamil aligned parallel corpus created by Farhath et al. [91] using order papers of government Type Examples of Sri Lanka is available to download19. A Sinhala and Material , ෙගය Tamil annotated corpus for emotion analysis was created by Jenarthanan et al. [92]. Agentive වනා, ම Common ෙගයා, සා 3.2 Data Sets Abstract , උස Specific data sets for Sinhala, as expected, is scarce. How- Proper ෙකළඹ, අමර ever, a Sinhala PoS tagged data set [93–95] is available to Compound බ (+බ), ම (+ම) download from github20. Further, a Sinhala NER data set created by Manamini et al. [96] is also available to download from github21. et al. [82]. But it does not seem to be online anymore. A Facebook has released FastText [97–99] models for the number of Sinhala-English parallel corpora were introduced Sinhala language trained using the Wikipedia corpus. They 22 23 by Guzman´ et al. [83]. This includes a 600k+ Sinhala- are available as both text models and binary files . Using English subtitle pairs10 initially collected by [84], 45k+ the above models by Facebook, Lakmal et al. [100] have Sinhala-English sentence pairs from GNOME11, KDE12, and created an extended FastText model trained on Wikipedia, 24 Ubuntu13. Guzman´ et al. [83] further provided two mono- News, and official government documents. The binary file lingual corpora for Sinhala. Those were a 155k+ sentences of of the trained model is available to be downloaded. Herath filtered Sinhala Wikipedia14 and 5178k+ sentences of Sinhala 15 16. https://bit.ly/2GEI4d6 common crawl . Wijeratne and de Silva [85] have publicly 17. https://www.paracrawl.eu/ 18. https://github.com/kdissa/comparable-corpus 10. http://bit.ly/2KsFQxm 19. http://bit.ly/2HTMEme 11. http://bit.ly/2Z8q0fo 20. http://bit.ly/2Krhrbv 12. http://bit.ly/2WLY6bI 21. http://bit.ly/2XrwCoK 13. http://bit.ly/2wLVZGt 22. http://bit.ly/2JXAyL8 14. http://bit.ly/2EQZ7oM 23. http://bit.ly/2JY5J9c 15. http://bit.ly/2ZaQFZo 24. http://bit.ly/2WowH0h 5

S

S NP VP

NP VP

NP V

VBP NP

JJ NP

PRP DT JJ NN

PRP JJ NN I eat a red apple

මම ර ඇප ෙගය ක

(a) English (b) Sinhala

S S

NP VP VN NP

NP VGF CLS V DET MWN

PRP JJ NN VM VP N ADJ

म लाल सेब खाता ँ Je mange une pomme rouge

(c) Hindi (d) French

Fig. 2: Parse trees for the sentence “I eat a red apple” in four languages.

et al. [101] has compiled a report on the Sinhala lexicon for earliest and most extensive Sinhala-English dictionary avail- the purpose of establishing a basis for NLP applications. A able for consumption was by Malalasekera [103]. However, comparative analysis of Sinhala has been this dictionary is locked behind copyright laws and is not conducted by Lakmal et al. [100]. A dataset25 consisting available for public research and development. This copy- of 3576 Sinhala documents drawn from Sri Lankan news right issue is shared with other printed dictionaries [104– websites and tagged (CREDIBLE, FALSE, PARTIAL or UN- 109] as well. The dictionary by Kulatunga [110] is publicly CERTAIN) was published by Jayawickrama et al. [102] available for usage through an online web interface but does not provide API access or means to directly access the data set. The largest publicly available English-Sinhala dictionary 3.3 Dictionaries data set is from a discontinued plug-in EnSiTip [111] A necessary component for the purpose of bridging Sinhala which bears a more than passing resemblance to the above and English resources are English-Sinhala dictionaries. The dictionary by Kulatunga [110]. Hettige and Karunananda [62] claim to to have created a lexicon to help in their 25. https://github.com/LIRNEasia/MisinformationCorpusSinhala attempt to create a system capable of English-to-Sinhala 6 TABLE 4: Extension of the suffix structure proposed by Herath et al. [49] Number = Singular (S) / Plural (P) Definite = Definite (D) / Indefinite (I) / Undecided (U) Case = Nominative (N) / Accusative (A) / Dative (Da) / Genitive (G) / Instrumentive (In) / Auxiliary (Au) / Locative (L) / Ablative (Ab) Conjun = with suffix th (Y) / without suffix th (N) Attributes Suf1 Suf2 Suf3 Noun -tree- Number Definite Case Conjun - - - ගස් (-s) P U N, A, V N අ - - ගස (the-) S D N, A, V N අ - - ගස (a-) S I N, A N අක - - ගසක (of a-) S I G N අ ට - ගසට (to the-) S D Da N - ඒ - ගෙස (in the-) S D Au N - වල - ගස්වල (on -s) P U L N - වලට - ගස්වලට (to -s) P U Da N - එ - ගෙස (of the-) S D G N අ ඉ - ගස (from a-) S I Au N - - උ ග (-s and) P U N, A Y අ - ගස (-and) S D N, A Y අ - උ ගස (a- too) S I N, A Y අක - ගසක (of a -, too) S U N, A Y අ ට ගසට (to the- too) S D Da Y - ඒ ගෙස (in -s too) S D Au Y - වල ගස්වල (on -s too) P U L Y - වලට ගස්වලට (to -s too) P U Da Y - එ ගෙස (of the- too) S D G Y අ ඉ උ ගස (from a- too) S I Au Y . A review on the requirements for what Arukgoda et al. [121] have cloned to use in their appli- English-Sinhala smart bilingual dictionary was conducted cation uploaded on github26. However, even at its peak, due by Samarawickrama and Hettige [112]. to the lack of volunteers for the crowd-soured methodology There exists the government sponsored trilingual dic- of populating the WordNet, it was at best an incomplete tionary [113], which matches Sinhala, English, and Tamil. product. Another effort to build a Sinhala Wordnet was However, other than a crude web interface on the ministry initiated by Welgama et al. [122] independently from above; website, there is no efficient API or any other way for a but it too have stopped progression even before achieving researcher to access the data on this dictionary. Weerasinghe the completion level of Wijesiri et al. [119]. and Dias [114] have created a multilingual place name database for Sri Lanka which may function both as a dic- 3.5 Morphological Analyzers tionary and a resource for certain NER tasks. As shown in Fig1, morphological analysis is a ground level necessary component for natural language processing. 3.4 WordNets Given that Sinhala is a highly inflected language [22, 37, 38], WordNets [115] are extremely powerful and act as a versatile a proper morphological analysis process is vital. The ear- component of many NLP applications. They encompass a liest attempt on Sinhala morphological analysis we have number of linguistic properties which exist between the observed are the studies by Herath et al. [48, 49]. They are words in the lexicon of the language including but not more of an analysis of Sinhala morphology rather than a limited to: hyponymy, hypernymy, synonymy, and meronymy. working tool. As such we discussed the observations and Their uses range from simple gazetteer listing applica- conclusions of these works at Section2. It is also worth to tions [25] to information extraction based on semantic simi- note that these works predates the introduction of Sinhala larity [116, 117] or semantic oppositeness [118]. An attempt unicode and thus use a transliteration of Sinhala in the Latin has been made to build a Sinhala Wordnet by Wijesiri et al. alphabet. [119]. For a time it was hosted on [120] but it too is now defunct and all the data and applications are lost other than 26. https://github.com/jseanm1/aruthSWSD 7

The next attempt by Herath et al. [123] creates a modular and Silva [140] proposed a stochastic POS tagger based on a unit structure for morphological analysis of Sinhala. Much small 10,000 word corpus drawn from Facebook and Twitter. later, as a step on their efforts to create a system with the ability to do English-to-Sinhala machine translation, Hettige 3.7 Parsers and Karunananda [59] claim to have created a morphologi- The PoS tagged data then needs to be handed over to cal analyzer (void of any public data or code), which links a parser. This is an area which is not completely solved to their studies of a Sinhala parser [60] and computational even in English due to various inherent ambiguities in grammar [18]. Hettige et al. [71] further propose a multi- natural languages. However, in the case of English, there agent System for morphological analysis. Welgama et al. are systems which provide adequate results [141] even if not [124] attempted to evaluate machine learning approaches perfect yet. The Sinhala state of affairs, is that, the first parser for Sinhala morphological analysis. Yet another indepen- for the Sinhala language was proposed by Hettige and dent attempt to create a morphological parser for Sinhala Karunananda [60] with a model for grammar [18]. The study verbs was carried out by Fernando and Weerasinghe [125]. by Liyanage et al. [38] is concentrated on the same given that Later, another study, which was restricted to morphological they have worked on formalizing a computational grammar analysis of Sinhala verbs was conducted by Dilshani and for Sinhala. While they do report reasonable results, yet Dias [126]. There was no indication on whether this work again, do not provide any means for the public to access the was continued to cover other types of words. Further, other data or the tools that they have developed. Kanduboda [37] than this singular publication, no data or tools were made have worked on Sinhala differential object markers relevant publicly accessible. Nandathilaka et al. [127] proposed a rule for parsing. based approach for Sinhala lemmatizing. The work by Wel- The first attempt at a Sinhala parser, as mentioned above, gama et al. [128] claim to have set a set of gold standard was by Hettige and Karunananda [60] where they created definitions for the morphology of Sinhala Words; but given prototype Sinhala morphological analyzer and a parser as that their results are not publicly available, further usage or part of their larger project to build an end-to-end transla- confirmation of these claims cannot not be done. The table5 tor system. The function of the parser is based on three provides a comparative summery of the discussion above. dictionaries: Base Dictionary, Rule Dictionary, and Concept Dictionary. They are built as follows: 3.6 Part of Speech Taggers The next step after morphological analysis is Part of Speech • The Base Dictionary: prakurthi (base words), nipatha (PoS) tagging. The PoS tags differ in number and function- (prepositions), upasarga (prefixes), and vibakthi (Irreg- ality from language to language. Therefore, the first step in ular Verbs). creating an effective PoS tagger is to identifying the PoS • The Rule Dictionary: inflection rules used to gener- tag set for the language. This work has been accomplished ate various forms of verbs and nouns from the base by Fernando et al. [93] and Dilshani et al. [94]. Expanding words. on that, Fernando et al. [93] has introduced an SVM Based • The Concept Dictionary: synonyms and antonyms PoS Tagger for Sinhala and then Fernando and Ranathunga for the words found in the base dictionary. [95] give an evaluation of different classifiers for the task Parsers are, in essence, a computational representation of Sinhala PoS tagging. While here it is obvious that there of the grammar of a natural language. As such, in building has been some follow up work after the initial foundation, Sinhala parsers, it is crucial to create a computational model it seems, all of that has been internal to one research group for Sinhala grammar. The first such attempt was taken at one institution as neither the data nor the tools of any by Hettige and Karunananda [18] with special consideration of these findings have been made available for the use of given to Morphology and the Syntax of the Sinhala language external researchers. Several attempts to create a stochastic as an extension to their earlier work [60]. Here, it is worthy PoS tagger for Sinhala has been done with the studies to note that, unlike in their earlier attempt [60], where they by Herath and Weerasinghe [129], Jayaweera and Dias [130], explicitly mentioned that they are building a parser, in this and Jayasuriya and Weerasinghe [131] being the most no- study [18], they use the much conservative claim of building table. Within a single group which did one of the above a computational grammar. Under Morphology, they again stochastic studies [130], yet another set of studies was car- handled Sinhala inflection. Their system is based on a Finite ried out to create a Sinhala PoS tagger starting with the foun- State Transducer (FST) and Context-Free Grammar (CFG) dation of Jayaweera and Dias [132] which then extended where they they modeled 85 rules for nouns and 18 rules to a Hidden Markov Model (HMM) based approach [133] for verbs. The specific implementation is more partial to a and an analysis of unknown words [134, 135]. Further, this rule-based composer rather than parser. It is also worthy to group presented a comparison of few Sinhala PoS taggers note that this system could only handle simple sentences that are available to them [136]. A RESTFul PoS tagging which only contained the following 8 constituents: Attribu- web service created by Jayaweera and Dias [137] using the tive Adjunct of Subject, Subject, Attributive Adjunct of Object, 27 above research can still be accessed via POST and GET. Object, Attributive Adjunct of Predicate, Attributive Adjunct A hybrid PoS tagger for Sinhala language was proposed of the Complement of Predicate, Complement of Predicate, and by Gunasekara et al. [138]. The study by Kothalawala et al. Predicate. With these, they propose the following grammar [139] discussed the data availability problem in NLP with a rules for Sinhala: Sinhala POS tagging experiment among others. Withanage S = Subject Akkyanaya 27. http://bit.ly/2F0jKid Subject = SimpleSubject | ComplexSubject 8 TABLE 5: Morphological Analyzers comparison Base: Rule-based (RB) / Machine Learning (ML) Able to Handle Part of Speech (Handles): Yes (Y) / No (N) Outputs: Yes (Y) / No (N) / No Information (O) Abbreviations: Nouns (Nu), Verbs (Ve), Adjectives (Aj), Adverbs (Av), Function Words (Fn), Root (R), Person (P), Number (Nb), Gender (G), Article (A), Case (C)

Handles Output Base Modus Operandi Nu Ve Aj Av Fn R P Nb G A C Hettige and Karunananda [59] RB Finite State Automata Y Y N N N Y Y Y Y Y Y Hettige et al. [71] RB Agent-based Y Y Y Y Y N N N N N N Nandathilaka et al. [127] RB N/A Y N N N N Y N N N N N Welgama et al. [124] ML Morfessor algorithm Y Y Y Y Y Y N N N N N Fernando and Weerasinghe [125] RB Finite State Transducer N Y N N N Y Y Y Y N Y Dilshani and Dias [126] RB N/A N Y N N N O O O O O O

ComplexSubject = SimpleSubject ConSub Similar to Hettige and Karunananda [18], the work SimpleSubject = Noun | Adjective Noun by Liyanage et al. [38] also builds a CFG for Sinhala covering ConSub = SimpleSubject 10 out of the 25 types of simple sentence structures in Sin- Akkyanaya = VerbP | Object VerbP hala reported by Abhayasinghe [40]. This parser is unable Object = SimpleObject | ComplexObject to parse sentences where inanimate subjects do not consider ComplexObject = Conjunction SimpleObject the number. Further, sentences which contain, compound SimpleObject = Noun | Adjective Noun verbs, auxiliary verbs, present , or past participles VerbP = Verb | Adverb Verb cannot be handled by this parser. If the verbs have imperative mood or negation those too cannot be handled by this. Non- The later work by Liyanage et al. [38] also involves verbal sentences which end with adjectives, oblique nominals, formalizing a computational grammar for Sinhala. They locative predicates, adverbials, or any other language entity claim that Sinhala can have any order of words in practice. which is not a verb cannot be handled by this parser. However, they do not note that this is happening because The study by Kanduboda [37] covers not the whole of practices of the spoken language, which does not share Sinhala parsing but analyzes a very specific property of the strong SOV conventions of the written language, are Sinhala observed by Aissen [143] which states that it is pos- slowly seeping into written text. However, they do make sible to notice Differential Object Marking (DOM) in Sinhala note of how Sinhala grammar is modeled as a head-final active sentences. Kanduboda [37] define this as the choice language [39]. They propose the Sinhala Noun Phrase (NP ) of /wa/ and /ta/ object markers. They further observe three to be defined as shown in equation1 where NN is a noun unique aspects of DOM in Sinhala: (a) it is only observed which can be of types: common noun (N), pronoun (P rN) in active sentences which contain transitive verbs, (b) it can or proper noun (P ropN). The adjectival phrase (ADJP ) occur with accusative marked nouns but not with any other is then defined as as shown in equation2 where: Det is cases, (c) it exists only if the sentence has placed an animate a Determiner, Adj is the adjective, and Deg is an optional noun in the accusative position. They do a statistical analysis operator Degrees which can be used to intensify the meaning and provide a number of short gazetteer lists as appendixes. of the adjective in cases where the adjective is qualitative. However, they observe that further work has to be done While they note that according to Gunasekara [142], there for this particular language rule in Sinhala given that they has to three classes of adjectives (qualitative, quantitative, found some examples which proved to be exceptions to the and ), they do not implement this distinction general model which they proposed. in their system. Similarly, they propose Sinhala Verb Phrase (VP ) to be defined as shown in equation3 where V is a single verb. They here note that they are ignoring compound 3.8 Named Entity Recognition Tools verbs and auxiliary verbs in their grammar. The adverbial phrases (ADV P ) are then recursively defined as as shown As shown in Fig1, once the text is properly parsed, it in equation4. has to be processed using a Named Entity Recognition (NER) system. The first attempt of Sinahla NER was done NP = [ADJP ][NN] (1) by Dahanayaka and Weerasinghe [144]. Given that they were conducting the first study for Sinhala NER, they based their approach on NER research done for other languages. In   h i this, they gave prominent notice to that of Indic languages. ADJP = [Det] [Deg][Adj] (2) On that matter, they were the first to make the interesting observation that NER for Indic languages (including, but VP = [ADV P ][V ] (3) not limited to Sinhala) is more difficult than that of English by the virtue of the absence of a capitalization mechanic. Following prior work done on other languages, they used Conditional Random Fields (CRF) as their main model and "  # h i compared it against a baseline of a Maximum Entropy (ME) ADV P = [NP ] [ADV P ] [Deg][ADV ] (4) model. However, they only use the candidate word, Context 9 TABLE 6: NER system comparison * Denotes a baseline. F1 to F11 denotes Context Words, Word Prefixes and Suffixes, Length of the Word, Frequency of the Word, First Word/ Last Word of a Sentence, (POS) Tags, Gazetteer Lists, Clue Words, Outcome Prior, Previous Map, and Cutoff Value

Features CRF ME F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 Dahanayaka and Weerasinghe [144] Yes Yes* Yes Yes No No No No No No No No No Senevirathne et al. [145] Yes No Yes Yes Yes No Yes No No Yes No Yes No Manamini et al. [96] Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes

Words around the candidate word, and a simple analysis of work on NER [96] and PoS tagging [95]. Sinhala suffixes as their features. The follow up work by Senevirathne et al. [145] kept the 3.9 Semantic Tools CRF model with all the previous features but did not report Applications of the semantic layer are more advanced than comparative analysis with an ME model. The innovation in- the ones below it in Figure1. But even with the obvious troduced by this work is a richer set of features. In addition lack of resources and tools, a number of attempts have to the features used by Dahanayaka and Weerasinghe [144], been made on semantic level applications for the Sinhala they introduced, Length of the Word as a threshold feature. Language. The earliest attempt on semantic analysis was They also introduced First Word feature after observing done by Herath et al. [147] using their earlier work which certain rigid grammatical rules of Sinhala. A feature of clue dealt with Sinhala morphological analysis [48]. A Sinhala Words in the form of a subset of Context Words feature measure has been developed for short was first proposed by this work. Finally, they introduced sentences by Kadupitiya et al. [148]. This work has been a feature for Previous Map which is essentially the NE value then extended by Kadupitiya et al. [149] for the application of the preceding word. Some of these feature extractions use case of short answer grading. Data and tools for these are done with the help of a rule-based post-processor which projects are not publicly available. A deterministic process utilizes context-based word lists. flow for automatic Sinhala text summarizing was proposed The third attempt at Sinhala NER was by Manamini et al. by Welgama [150]. The study by Wimalasuriya [151], which [96] who dubbed their system Ananya. They inherit the CRF has the same name as the above work by Welgama [150], model and ME baseline from the work of Dahanayaka and uses graph based TextRank algorithm for automatic Sinhala Weerasinghe [144]. In addition to that, they take the en- text summarizing. hanced feature list of Senevirathne et al. [145] and enrich it There have been multiple attempts to do word sense further more. They introduce a Frequency of the Word feature disambiguation (WSD) [152–156] for Sinhala. For this, Aruk- based on the assumption that most commonly occurring goda et al. [121] have proposed a system named Aruth words are not NEs. Thus, they model this as a Boolean based on the Lesk Algorithm[157]. An online tool29, an API30 value with a threshold applied on the word frequency. They of the algorithm, and code along with data on github31 extend the First Word feature proposed by Senevirathne are available. For the same task, Marasinghe et al. [158] et al. [145] to a First Word/ Last Word of a Sentence feature have proposed a system based on probabilistic modeling. noting that Sinhala grammar is of SOV configuration. They A dialogue act recognition system which utilizes simple introduce a (PoS) Tag feature and a gazetteer lists based classification algorithms has been proposed by Palihakkara feature keeping in line with research done on NER in et al. [159]. other languages. They formally introduce clue Words, which Text classification is a popular application on the seman- was initially proposed as a sub-feature by Dahanayaka tic layer of the NLP stack. A very basic Sinhala text classi- and Weerasinghe [144], as an independent feature. Utilizing fication using Na¨ıve Bayes Classifier, Zipf’s Law Behavior, the fact that they have the ME model unlike Dahanayaka and SVMs was attempted by Gallege [160]. A smaller imple- and Weerasinghe [144], they introduce a complementary mentation of Sinhala news classification has been attempted feature to Previous Map named Outcome Prior, which uses the by de Silva [22]. As mentioned in Section 3.2, their news underlying distribution of the outcomes of the ME model. corpus is publicly available32. Another attempt on Sinhala Finally, they introduce a Cutoff Value feature to handle the text classification using six popular rule based algorithms over-fitting problem. was done by Lakmali and Haddela [161]. Even-though they The table6 provides comparative summery of the dis- talk about building a corpus named SinNG5, they do not cussion above. It should be noted that all three of these mod- indicate of means for others to obtain the said corpus. els only tag NEs of types: person names, location names and Another study by Kumari and Haddela [162] utilizes the organization names. The Ananya system by Manamini et al. SinNG5 corpus as the data set for their attempt to use [96] is available to download at GitHub 28. The data and LIME [163] for human interpretability of Sinhala document code for the approaches by Dahanayaka and Weerasinghe classification. However, they too do not provide access cor- [144] and by Senevirathne et al. [145] are not accessible to pus. Nanayakkara and Ranathunga [164] have implemented the public. Azeez and Ranathunga [146] proposed a fine- grained NER model for Sinhala building on their earlier 29. http://aruth.herokuapp.com/ 30. https://bit.ly/3sJEYbS 31. https://github.com/jseanm1/aruthSWSD 28. http://bit.ly/2XrwCoK 32. https://osf.io/tdb84/ 10

a system which uses corpus-based similarity measures for created Sinhala document readers for visually impaired per- Sinhala text classification. Gunasekara and Haddela [165] sons to be used on Android devices. An OCR and Text-to- claim to have created a context aware extraction Speech system for Sinhala named Bhashitha was proposed method for Sinhala text classification based on simple TF- by De Zoysa et al. [194]. Anuradha and Thelijjagoda [195] IDF. An LSTM based system for Sinhala proposed a machine translation system to convert Sinhala was proposed by Jayasinghe and Sirts [166]. and English Braille documents into voice. A separate group A simple MLP based method to classify sentiments has done work on Sinhala text-to-speech systems indepen- in Sinhala text was initially proposed by Medagoda [167] dent to above [196]. based on their prior work [168]. A based tool33 On the converse, Nadungodage et al. [197] have done a for of Sinhala news comments is avail- series of work on Sinhala with special able. A methodology for constructing a sentiment lexicon notice given to Sinhala being a resource poor language. This for Sinhala Language in a semi-automated manner based project divides its focus on: continuity [198], active learn- on a given corpus was proposed by Chathuranga et al. ing [199], and speaker adaptation [200]. A Sinhala speech [169]. Demotte et al. [170] proposed a sentiment analysis recognition for voice dialing which is speaker independent system based on sentence-state LSTM Networks for Sinhala was proposed by Amarasingha and Gamini [201] and on news comments. In the subsequent work [171], they used the other end, a Sinhala speech recognition methodology word similarity to generate a Sinhala semantic lexicon. They for interactive voice response systems, which are accessed followed this up with a further study [172] which discussed through mobile phones was proposed by Manamperi et al. a number of other deep learning models such as RNN and [202]. Priyadarshani [203] proposes a method for speaker Bi-LSTM in the domain of Sinhala sentiment analysis. Jaya- dependant speech recognition based on their previous work suriya et al. [173] proposed a method to classify Sinhala on: dynamic time warping for recognizing isolated Sin- posts in the domain of sports into positive and negative hala words [204], genetic algorithms [205], and syllable class sentiments. Ranathunga and Liyanage [174] claimed segmentation method utilizing acoustic envelopes [206]. that using word embedding models as semantic features can The method proposed by Gunasekara and Meegama [207] compensate for the lack of well developed language-specific utilizes an HMM model for Sinhala speech-to-text. A Sin- linguistic or language resources in the case of analysing hala speech recognizer supporting bi-directional conver- sentiment of Sinhala news comments. sion between Unicode Sinhala and phonetic English was A model for detecting grammatical mistakes in Sinhala proposed by Punchimudiyanse and Meegama [208]. The was developed by Pabasara and Jayalal [175]. They followed work by Karunanayake et al. [209] transfer learns CNNs for this up with an grammatical error detection and correc- transcribing free-form Sinhala and Tamil speech data sets tion model [176]. Gunasekara et al. [177] used annotation for the purpose of classification. Dilshan [210] conducted projection for for Sinhala. A cross- a study for the specific use case of transcribing number lingual document similarity measurement using the use- sequences in continuous Sinhala speech. The work by Di- case of Sinhala and English was developed by Isuranga et al. nushika et al. [211] uses automatic speech recognition of [178]. Sinhala for speech command classification. Gamage et al. [212] explored the use of combinational acoustic models 3.10 Phonological Tools such as Deep Neural Network - Hidden Markov Model On the case of phonological layer, a report on Sinhala pho- (DNN-HMM) [213] and Subspace Gaussian Mixture Model netics and phonology was published by Wasala and Gamage (SGMM) [213] in Sinhala speech recognition. Karunathilaka [179]. Wickramasinghe et al. [180] discussed the practical et al. [214] explore Sinhala speech recognition using deep issues in developing Sinhala Text-to-Speech and Speech learning models such as: pre-trained DNN, DNN, TDNN, Recognition systems. Based on the earlier work by Weeras- TDNN+LSTM. inghe et al. [181], Wasala et al. [182] have developed meth- The Sinhala speech classification system proposed ods for Sinhala grapheme-to-phoneme conversion along by Buddhika et al. [215] does so without converting the with a set of rules for schwa epenthesis. This work was then speech-to-text. However, they report that this approach extended by Nadungodage et al. [183]. Weerasinghe et al. only works for specific domains with well-defined limited [184] developed a Sinhala text-to-speech system. However, vocabularies. Kavmini et al. [216] presented a Sinhala speech it is not publicly accessible. They internally extended it to command classification system which can be used for down- create a system capable of helping a mute person achieve stream applications. synthesized real-time interactive voice communication in Sinhala [185]. A rule based approach for automatic segmen- 3.11 Optical Character Recognition Applications tation of a small set of Sinhala text into syllables was pro- posed by Kumara et al. [186]. An ew prosodic phrasing method While it is not necessarily a component of the NLP stack to help with Sinhala Text-to-Speech process was proposed shown in Fig1, which follows the definition by Liddy [24], by Bandara et al. [187, 188, 189]. Sodimana et al. [190] it is possible to swap out the bottom most phonological layer proposed a text normalization methodology for Sinhala text- of the stack in favour of an Optical Character Recognition to-speech systems. Further, Sodimana et al. [191] formalized (OCR) and text rendering layer. a step-by-step process for building text-to-speech voices for An attempt for Sinhala OCR system has been taken Sinhala. Both Jayamanna [192] and Mishangi [193] have by Rajapakse et al. [217] before any other work has been done on the topic. Much later, a linear symmetry based 33. http://bit.ly/2QKI9Np approach was proposed by Premaratne and Bigun [218, 219]. 11

They then used hidden Markov model-based optimization has been proposed by Dharmapala et al. [248]. The study on the recognized Sinhala script [220]. Similarly, Hewavitha- by Walawage and Ranathunga [249] specifically focuses on rana et al. [221] used hidden Markov models for off-line segmentation of overlapping and touching Sinhala hand- Sinhala character recognition. Statistical approaches with written characters. Silva and Jayasundere [250] focused on histogram projections for Sinhala character recognition is recognizing character modifiers in Sinhala handwriting. proposed by Hewavitharana and Kodikara [222], by Ajward Summarizing on optically recognized old Sinhala text et al. [223], and by Madushanka et al. [224]. Karunanayaka for the purpose of archival search and preservation was et al. [225] also did off-line Sinhala character recognition explored by Rathnasena et al. [251]. The work of Peiris with a use case for postal city name recognition. A separate [252] also focused on OCR for ancient Sinhala inscriptions. group had attempted Sinhala OCR [226] mainly involving A neural network based method for recognizing ancient the nearest-neighbor method [227]. A study by Ediriweera Sinhala inscriptions was proposed by Karunarathne et al. [228] uses dictionaries to correct errors in Sinhala OCR. [253]. Chanda et al. [254] proposed a Gaussian kernel SVM An early attempt for Sinhala OCR by Dias et al. [229] has based method for word-wise Sinhala, Tamil, and English been extended to be online and made available to use via script identification. desktops [230] and hand-held devices [231] with the ability to recognize handwriting. A simple neural network based 3.12 Translators approach for Sinhala OCR was utilized by Rimas et al. [232]. A series of work has been done by a group towards English A fuzzy-based model for identifying printed Sinhala charac- to Sinhala translation as mentioned in some of the above ters was proposed by Gunarathna et al. [233]. Premachandra subsections. This work includes; building a morphological et al. [234] proposes a simple back-propagation artificial analyzer [59], lexicon databases [62], a transliteration sys- neural network with hand crafted features for Sinhala char- tem [63], an evaluation model [68], a computational model acter recognition. Another neural network with specialized of grammar [18], and a multi-agent solution [74]. After feature extraction for Sinhala character recognition was working on human-assisted machine translation [64], Het- proposed by Jayamaha and Naleer [235]. On the matter of tige and Karunananda [67, 69] have attempted to establish a neural networks and feature extraction, a feature selection theoretical basics for English to Sinhala machine translation. process for Sinhala OCR was proposed by Kumara and A very simplistic web based translator was proposed [65, Ragel [236]. Jayawickrama et al. [237] worked on Sinhala 66]. The same group have worked on a Sinhala ontology printed characters with special focus on handling generator for the purpose of machine translation [73] and vowels. However, they opted to refer to diacritic vowels a phrase level translator [75] based on the previous work as modifiers in their work. Gunawardhana and Ranathunga on a multi-agent system for translation [72]. Further, an [238] proposed a limited approach to recognize Sinhala application of the English to Sinhala translator on the use letters on Facebook images. A CNN-based methodology case of selected text for reading was implemented [70]. They to improve printed Sinhala character OCR was proposed later continued their work on multi agent English to Sinhala by Liyanage [239]. A meta-study on the effects of text translation with the AGR organizational model [76]. genre, image resolution, and algorithmic complexity needed Another group independently attempted English-to- for Sinhala OCR from books and newspapers was con- Sinhala machine translation [255] with a statistical ap- ducted by Anuradha et al. [240]. An OCR and Text-to- proach [256]. Wijerathna et al. [257] and De Silva et al. Speech system for Sinhala named Bhashitha was proposed [258] have proposed simple rule based translators. An by De Zoysa et al. [194]. Anuradha and Thelijjagoda [195] example-based method applied on the English-Sinhala sen- uses OCR on Sinhala and English Braille documents on tence aligned government domain corpus was proposed which they then run a machine translation system in order by Silva and Weerasinghe [259]. A translator based on a to convert them into voice. look-up system was proposed by Vidanaralage et al. [260]. Fernando et al. [241] claim to have created a database In a preprint, et al. [261] proposes an evolutionary for handwriting recognition research in Sinhala language algorithm for Sinhala to English translation with a basis of and further claims that the data set is available at National Point-wise Mutual Information (PMI) and claims that the Science Foundation (NSF) of Sri Lanka. However, the paper code will be shared once the paper is accepted. However, provides no URLs and we were not able to find the data they do not report any quantitative results to be compared set on the NSF website either. The work by Karunanayaka and the reported qualitative results are also superficial. Fer- et al. [242] is focused on noise reduction and skew correction nando et al. [262] tries to solve the Out of vocabulary (OOV) of Sinhala handwritten words. A genetic algorithm-based problem for Sinhala in the context of Sinhala-English-Tamil approach for non-cursive Sinhala handwritten script recog- statistical machine translation. Fonseka et al. [263] used Byte nition was proposed by Jayasekara and Udawatta [243]. Ni- Pair Encoding (BPE) for English to Sinhala neural machine laweera et al. [244] compare projection and wavelet-based translation. As an another solution to the OOV problem, techniques for recognizing handwritten Sinhala script. Silva an analysis of subword techniques to improve English to and Kariyawasam [245] worked on segmenting Sinhala Sinhala Neural Machine Translation (NMT) was conducted handwritten characters with special focus on handling dia- by Naranpanawa et al. [264]. A data augmentation method critic vowels. A comparative study of few available Sinhala to expand bilingual lexicon terms based on case-markers handwriting recognition methods was done by Silva et al. for the purpose of solving the OOV problem in the domain [246]. Silva et al. [247] uses contour tracing for isolated char- of NMT was proposed by Fernando et al. [265]. A acters in handwritten Sinhala text. A Sinhala handwriting to Sinhala converter which uses an LSTM was proposed OCR system which utilizes zone-based feature extraction by De Silva [266]. 12

Most of the cross Sinhala and Tamil work has been done ten Sinhala [294] and animation of finger-spelled words and in the domain of machine translation. A neural machine number signs [295]. A simple Sinhala chat bot which utilizes translation for Sinhala and Tamil languages was initiated a small knowledge base has been proposed by Hettige and by Tennage et al. [267, 268]. Then they further enhanced Karunananda [61]. A study on the effect of word embed- it with transliteration and byte pair encoding [269] and dings on a Sinhala chat bot was conducted by Gamage used synthetic training data to handle the rare word prob- et al. [296] where they used, the model trained by lem [270]. This project produced Si-Ta [271] a machine Facebook [97–99], on a RASA36 chat bot. Dissanayake and translation system of Sinhala and Tamil official documents. Hettige [297] implemented a question and answer generator In the statistical machine translation front, Farhath et al. for Sinhala with the limited PoS: pronouns, adjectives, verbs, [272] worked on integrating bilingual lists. The attempts and adverbs. by Weerasinghe [273] and Sripirakas et al. [274] were also fo- Fernando [298] proposed a method for inexact matching cused on statistical machine translation while Jeyakaran and of Sinhala proper names. A study on determining canonical Weerasinghe [275] attempted a kernel regression method. of colloquial Sinhala sentences using priority A yet another attempt was made by Pushpananda et al. information was conducted by Kanduboda and Tamaoka [276] which they later extended with some quality im- [299] which they later extended [300–302]. Jayakody et al. provements [277]. An attempt on real-time direct translation [303] uses simple KNN and SVM methods on a PoS tagged between Sinhala and Tamil was done by Rajpirathap et al. Sinhala corpus to create a question-answering system which [278]. Dilshani et al. [279] have done a study on the linguistic they name Mahoshadha. Sandaruwan et al. [304] have at- divergence of Sinhala and Tamil languages in respect to tempted to identify abusive Sinhala comments in social machine translation. Mokanarangan [280] claims to have media using and machine learning techniques. built a named entity translator between Sinhala and Tamil An extremely simple plagiarism detection tool which only for official government documents. But this work is locked uses n-grams of simply tokenized text was proposed by Bas- behind an institutional repository wall and thus is not ac- nayake et al. [305]. Another simple plagiarism detection tool cessible by other researchers. Arukgoda et al. [281] studied that uses synonymy and Hyponymy-Hypernymy (which the possibility of using deep learning techniques to improve they call Generalization in the paper) was attempted by Ra- Sinhala-Tamil translation. Pramodya et al. [282] compared jamanthri and Thelijjagoda [306]. Transformers, Recurrent Neural Networks, and Statistical A dataset consisting of Sinhala documents drawn from Machine Translation (SMT) in the context of Tamil to Sinhala Sri Lankan news websites was published by Jayawickrama machine translation. The work by Nissanka et al. [283] used et al. [102] along with the benchmark misinformation clas- monolingual word embedding to improve NMT between sification models. A trending topic detection model for Sinhala and Tamil. While not related to Tamil, there have Sinhala tweets using simple clustering and ranking algo- been attempts to link Sinhala NLP with Japanese by Herath rithms was proposed by Jayasekara and Ahangama [307]. A et al. [21, 77, 78], Thelijjagoda [79], and Kanduboda [8]. machine learning approach to detect hate speech in Sinhala There has been an attempt to use dictionary-based machine was proposed by De Silva [308]. The work by Liyanage translation [284] between Sinhala and the liturgical language and Ranathunga [309, 310] attempt to use LSTMs for math- of Buddhism, [285–287]. The study by Anuradha and ematical word problem generation in Sinhala and other Thelijjagoda [195] uses machine translation on the unique languages. The problem of recognizing Sinhala and English application of converting Sinhala and English Braille docu- code-mixed data where the Sinhala text is written in Singlish ments, which they have run OCR on, into voice. was explored by Smith and Thayasivam [311] using an XGB classifier and a CRF model building on their previous 3.13 Miscellaneous Applications work [312], which analysed such data. Sandathara et al. Arunalu In this section, we discuss NLP tools and research which [313] proposed a system which they named that are either hard to categorize under above sections or are they claimed to use Voice recognition, Natural Language reasonably involving multiples of them. The first miscel- Processing, Machine Learning, and Deep Learning concepts laneous application of Sinhala NLP is checking. The to help individuals with dyslexia overcome problems of open-source data driven approach proposed by Wasala et al. reading Sinhala. Rajitha et al. [314] used the data set that [288, 289] claims to be able to check and correct spelling they introduced in their previous work [88] to prove that errors in Sinhala. The approach by Jayalatharachchi et al. task specific supervised distance learning metrics outper- [290] attempts to obtain synergy between two algorithms form their unsupervised counterparts, for document align- for the same purpose. These efforts [288, 290] were then ment. extended by Subhagya et al. [291]. A rule-based Sinhala spell 34 checker named SinSpell based on was introduced 4 PRIMARY SOURCES by Liyanapathirana et al. [292]. They have also made the tool available35 online for use. The study by Sithamparanathan Even though the main objective of this survey is to cover and Uthayasanker [293] extended the Generic Environment NLP tools and research, we noticed that much of these NLP for context-aware spell correction to handle Sinhala and Tamil. tools and research depend on primary sources of Sinhala On the matter of Sinhala sign language, strides have language such as printed books in the role of knowledge been made in the domains of computer interpreting for writ- sources and ground truth. Therefore, for the benefit of other researchers who venture into Sinhala NLP, we decided to 34. http://hunspell.github.io/ 35. http://nlp-tools.uom.lk/sinspell/ 36. https://rasa.com/ 13 add a short introduction to the available primary sources of poor language? The important point which is missing in Sinhala language used by their peers. We note that the body that assumption is that in the cases of almost all of the of work by a single scholar, Disanayake [2, 35, 315–325], is above listed implementations and findings, the only thing quite prominent in the case of being used for NLP appli- that is publicly available for a researcher is a set of research cations. For formal introduction of the language, the books papers. The corpora, tools, algorithm, and anything else that by Disanayake [2] and Perera [3] are commonly used. In were discovered through these research are either locked cases which deal with the Sinhala alphabet, the introduction away as properties of individual research groups or worse by Indrasena [326] and by Disanayake [315, 316] have been lost to the time with crashed ancient servers, lost hard used. An analysis of modern Sinhala linguistics has been drives, and expired web hosts. This reason and probably done by Jayathilake [327] and by Pallatthara and Weihene academic/research rivalry have caused these separate re- [36]. The early study by Henadeerage [328] covers a number search groups not to cite or build upon the works of each- of topics on the Sinhala language such as grammatical other. In many cases where similar work is done, it is a relations, argument structure, phrase structure and focus re-hashing on the same ideas adopted from resource rich constructions. languages because of, the unavailability of (or the reluctance As we discussed in Section 3.3, a number of printed to), referring and building on work done by another group. Sinhala dictionaries exist, Malalasekera [103] being the most This has resulted in multiple groups building multiple prominent English-Sinahala dictionary among them. In ad- foundations behind closed doors but no one ending up dition to that seminal work, previous researchers of Sinhala with a completed end-to-end NLP work-flow. In conclusion, NLP have utilized a number of other dictionaries of vari- what can be said is that, even though there are islands of ous configurations such as: English-Sinhala [105, 108, 109], implementations done for Sinhala NLP, they are of very Sinhala-Sinhala [104, 107], and English-Sinhala-Tamil [106]. small scale and/or are usually not readily accessible for A number of NLP applications have utilized first sources further use and research by other researchers. Thus, so far, intended to teach children [329–333] or foreigners [142, 334, sadly, Sinhala stays a resource poor language. 335]37. This makes sense given that an introduction written for children would start with basic principles and thus be ACKNOWLEDGMENTS ideal for crafting rule based NLP systems and an introduc- The authors would like to thank Romain Egele for checking tion written for foreigners would have Sinhala language the examples we have provided in French for their accuracy. described in terms of English, making easy the process of Similarly, the authors would also like to thank Shravan Kale rule based translation of English NLP tools to Sinhala. for checking the examples we have provided in Hindi for For applications where a rule based approach for Sinhala their accuracy. spelling correction is utilized, the books by Disanayake [322, 323], by Koparahewa [336], and by Gair and Karunatil- lake [337] are used to provide a basis. A number of NLP REFERENCES applications which handle spoken Sinhala in the capacity of [1] R. Englebretson and C. Genetti, “Santa barbara papers in linguis- tics: Proceeding from the workshop on sinhala linguistics,” Santa phonological layer (Section 3.10) applications or otherwise, Barbara, CA: Department of Linguistics at the University of California, make note of the fact that spoken Sinhala is considerably Santa Barbara, 2005. distinguishable from written Sinhala, as such, they refer [2] J. B. Disanayake, National Languages of Sri-Lanka: Sinhala. De- partment of cultural affairs, 1976. primary sources which explicitly deal with spoken Sin- [3] T. G. Perera, Sinhala Bhashawa, 1985. hala [35, 321, 325, 331, 334, 338, 339]. [4] L. Bauer, Linguistics Student’s Handbook. Edinburgh University Primary sources used in NLP application for Sinhala Press, 2007. grammar are varied. A number of them provide overviews [5] Department of Census and Statistics Sri Lanka. Percentage of population aged 10 years and over in major ethnic groups by of the entirety of Sinhala grammar [19, 36, 39, 324, 340– district and ability to speak sinhala, tamil and english languages. 349]. There are specific primary sources focusing on [Online]. Available: https://goo.gl/nnVZSd verbs [320, 332, 350], nouns [319, 333], prepositions [331], [6] J. W. Gair and W. Karunatilaka, Literary Sinhala. ERIC, Cornell University. New York, 1974. compounds [318], derivation [317], case system [351], and [7] H. Young. A tree - in pictures sentence structure [40] of the Sinhala language. The book — education — the guardian. [Online]. Avail- by Rajapaksha [352] is commonly used in NLP applica- able: https://www.theguardian.com/education/gallery/2015/ tions as a guide for word tagging and punctuation mark jan/23/a-language-family-tree-in-pictures [8] A. B. Kanduboda, “The role of animacy in determining noun handling. NLP studies that tackle the hard problem of phrase cases in the sinhalese and japanese languages,” Science of handling questions expressed in Sinhala often refer to the words, vol. 24, pp. 5–20, 2011. book by Kariyakarawana [353]. Kekulawala [354] has aptly [9] P. Fernando, “Palaeographical development of the brahmi script in ceylon from 3rd century bc to 7th century ad,” 1949. discussed the much controversial topic of the situation of [10] D. Bandara, N. Warnajith, A. Minato, and S. Ozawa, “Creation of future tense of Sinhala. precise alphabet fonts of early brahmi script from photographic data of ancient sri lankan inscriptions,” Canadian Journal on Arti- ficial Intelligence, Machine Learning and Pattern Recognition, vol. 3, 5 CONCLUSION no. 3, pp. 33–39, 2012. [11] P. T. Daniels and W. Bright, The world’s writing systems. Oxford At this point, a reader might think, there seems to be a University Press on Demand, 1996. significant number of implementations of NLP for Sinhala. [12] M. H. Sirisoma, “Brahmi inscriptions of sri lanka from 3rd century bc to 65 ad,” pp. 3–54, 1990. Therefore, how can one justify listing Sinhala as a resource [13] M. Dias, “Lakdiwa sellipiwalin heliwana sinhala bhashawe prathyartha namayange vikashanaya,” Department of Archaeology, 37. Note that [335] is an extension of [142]. Sri Lanka, p. 1, 1996. 14

[14] A. S. Hettiarachchi, “Investigation of 2nd, 3rd and 4th century [44] T. Noguchi, “Shinharago nyuumon [introductory to the sinhalese inscriptions,” Inscriptions: Volume Two, Archaeological Department language],” Tokyo: Daigaku Shorin, 1984. Centenary (1890–1990), Commemorative Series. Colombo: Department [45] T. Miyagishi, “A comparison of word order between japanese of Archaeology, pp. 57–104, 1990. and sinhalese,” Bulletin of and Literature, pp. [15] S. Paranavitana and S. L. P. Departam¯ entuva,¯ Inscriptions of 101–107, 2003. Ceylon. Dept. of Archaeology, 1970. [46] D. Chandralal, Sinhala. John Benjamins Publishing, 2010, vol. 15. [16] R. Salomon, Indian epigraphy: a guide to the study of inscriptions [47] S. P. Singh, A. Kumar, P. Sahu, and P. Verma, “Syntax based in Sanskrit, Prakrit, and the other Indo-Aryan languages. Oxford machine translation using blended methodology,” in 2016 2nd University Press, 1998. International Conference on Next Generation Computing Technologies [17] H. Falk, Schrift im alten Indien: ein Forschungsbericht mit Anmerkun- (NGCT). IEEE, 2016, pp. 242–247. gen. Gunter Narr Verlag, 1993, vol. 56. [48] S. Herath, T. Ikeda, S. Yokoyama, H. Isahara, and S. Ishizaki, [18] B. Hettige and A. S. Karunananda, “Computational model of “Sinhalese morphological analysis: a step towards machine pro- grammar for english to sinhala machine translation,” in Advances cessing of sinhalese,” in [Proceedings 1989] IEEE International in ICT for Emerging Regions (ICTer), 2011 International Conference Workshop on Tools for Artificial Intelligence. IEEE, 1989, pp. 100– on. IEEE, 2011, pp. 26–31. 107. [19] A. M. Gunasekara, A Comprehensive Grammar of the Sinhalese [49] S. Herath, T. Ikeda, S. Ishizaki, and Y. Anzai, “Formalization of Language. Asian Educational Services, New Delhi, Madras, sinhalese morphology,” in Proc. of the 40th National Congress of India, 1986. IPSJ. Jyouhou Syori Gakkai, vol. 1, 1990, pp. 327–328. [20] H. P. Ray, The archaeology of seafaring in ancient . Cam- [50] V. K. Samaranayake, J. B. Disanayaka, and S. T. Nandasara, “A bridge University Press, 2003. standard code for sinhala characters,” Proceedings, 9th Annual [21] A. Herath, Y. Hyodo, Y. Kawada, T. Ikeda, and S. Herath, “A Sessions of the Computer Society of Sri Lanka, Colombo, 1989. practical machine translation system from japanese to modern [51] V. K. Samaranayake, S. T. Nandasara, J. B. Disanayaka, A. R. sinhalese,” Gifu University, pp. 153–162, 1994. Weerasinghe, and H. Wijayawardhana, “An introduction to uni- [22] N. de Silva, “Sinhala Text Classification: Observations from the code for sinhala characters,” University Of Colombo School of Perspective of a Resource Poor Language,” 2015. Computing, 2003. [23] Y. Wijeratne, N. de Silva, and Y. Shanmugarajah, “Natural Lan- [52] G. Dias and A. Goonetilleke, “Development of standards for guage Processing for Government: Problems and Potential,” Sinhala computing,” in 1st Regional Conference on ICT and E- LIRNEasia, 2019. Paradigms, 2004. [24] E. D. Liddy, “Natural language processing,” 2001. [53] G. V. Dias, “Challenges of enabling it in the sinhala language,” in [25] D. C. Wimalasuriya and D. Dou, “Ontology-based information 27th Internationalization and Unicode Conference, 2005. extraction: An introduction and a survey of current approaches,” [54] A. R. Weerasinghe, D. L. Herath, and K. Gamage, “The sinhala Journal of Information Science, vol. 36, no. 3, pp. 306–323, 2010. collation sequence and its representation in unicode,” Localization [26] U. Consortium et al., “The unicode standard: A tech- Focus, 2006. nical introduction,” online document, http://www. unicode. [55] G. Sandeva, “Design and evaluation of user-friendly yet efficient org/unicode/standards/principles. html, 1996. sinhala input methods,” 2009. [27] R. A. Van der Sandt, “Presupposition projection as anaphora [56] S. Herath, S. Ishizaki, T. Ikeda, Y. Anzai, and H. Aiso, “Machine resolution,” Journal of semantics, vol. 9, no. 4, pp. 333–377, 1992. processing of sinhala natural language: a step toward intelligent [28] S. Lappin and H. J. Leass, “An algorithm for pronominal systems,” Cybernetics and systems, vol. 22, no. 3, pp. 331–348, 1991. anaphora resolution,” Computational linguistics, vol. 20, no. 4, pp. [57] S. T. Nandasara, “From the past to the present: Evolution of 535–561, 1994. computing in the sinhala language,” IEEE Annals of the History [29] W. M. Soon, H. T. Ng, and D. C. Y. Lim, “A machine learning ap- of Computing, vol. 31, no. 1, pp. 32–45, 2009. proach to coreference resolution of noun phrases,” Computational [58] S. T. Nandasara and Y. Mikami, “Bridging the digital divide in linguistics, vol. 27, no. 4, pp. 521–544, 2001. sri lanka: some challenges and opportunities in using sinhala in [30] V. Ng and C. Cardie, “Improving machine learning approaches ict,” International Journal on Advances in ICT for Emerging Regions to coreference resolution,” in Proceedings of the 40th annual meeting (ICTer), vol. 8, no. 1, 2016. on association for computational linguistics. Association for Com- [59] B. Hettige and A. S. Karunananda, “A morphological analyzer to putational Linguistics, 2002, pp. 104–111. enable english to sinhala machine translation,” in Information and [31] R. Mitkov, Anaphora resolution. Routledge, 2014. Automation, 2006. ICIA 2006. International Conference on. IEEE, [32] J. Preston and M. J. M. Bishop, Views into the Chinese room: New 2006, pp. 21–26. essays on Searle and artificial intelligence. OUP, 2002. [60] ——, “A parser for sinhala language-first step towards english [33] G. H. Fairbanks, J. W. Gair, and M. W. S. D. Silva, Colloquial to sinhala machine translation,” in Industrial and Information Sinhalese. ERIC, Cornell University. New York, 1968. Systems, First International Conference on. IEEE, 2006, pp. 583– [34] T. Miyagishi, “Accusative subject of subordinate clause in literary 587. sinhala,” Journal of Yasuda Women’s University, vol. 33, 2005. [61] ——, “First sinhala in action,” Proceedings of the 3rd [35] J. B. Disanayake, Say it in Sinhala. Lake House Investments Annual Sessions of Sri Lanka Association for Artificial Intelligence Limited, 1985. (SLAAI), University of Moratuwa, 2006. [36] S. Pallatthara and P. Weihene, Sinhala Grammar in Linguistic [62] ——, “Developing lexicon databases for english to sinhala ma- Perspective, 1966. chine translation,” in Industrial and Information Systems, 2007. [37] A. P. B. Kanduboda, “On the usage of sinhalese differential object ICIIS 2007. International Conference on. IEEE, 2007, pp. 215–220. markers object marker /wa/ vs. object marker /ta/,” Theory and [63] ——, “Transliteration system for english to sinhala machine Practice in Language Studies, vol. 3, no. 7, p. 1081, 2013. translation,” in Industrial and Information Systems, 2007. ICIIS [38] C. Liyanage, R. Pushpananda, D. L. Herath, and R. Weerasinghe, 2007. International Conference on. IEEE, 2007, pp. 209–214. “A computational grammar of Sinhala,” in International Confer- [64] ——, “Using human-assisted machine translation to overcome ence on Intelligent and Computational Linguistics. language barrier in sri lanka,” Proceedings of 4th Annual session of Springer, 2012, pp. 188–200. Sri Lanka Association for Artificial Intelligence, p. 10, 2007. [39] W. S. Karunatilaka, Sinhala Bhasa Vyakaranaya. M D Gunasena [65] ——, “Web-based english-sinhala translator in action,” in 2008 Publishers, Colombo, 1997. 4th International Conference on Information and Automation for Sus- [40] A. A. Abhayasinghe, Sinhala bhashave sarala vakya vibagaya, 1998. tainability. IEEE, 2008, pp. 80–85. [41] C. Jany, “The relationship between case marking and s, a, and o [66] ——, “Web-based english to sinhala selected texts translation in spoken sinhala,” Santa Barbara Papers in Linguistics, no. 17, pp. system,” Sri Lanka Association for Artificial Intelligence, p. 26, 2008. 68–84, 2006. [67] ——, “Theoretical based approach to english to sinhala machine [42] J. Garland, “ and the complexity of nom- translation,” in 2009 International Conference on Industrial and inal morphology in sinhala,” Santa Barbara Papers in Linguistics, Information Systems (ICIIS). IEEE, 2009, pp. 380–385. no. 17, pp. 1–19, 2005. [68] ——, “An evaluation methodology for english to sinhala machine [43] M. Henderson, “Between lexical and lexico-grammatical classifi- translation,” in Information and Automation for Sustainability (ICI- cation: nominal classification in sinhala,” Santa Barbara Papers in AFs), 2010 5th International Conference on. IEEE, 2010, pp. 31–36. Linguistics, p. 29, 2005. [69] ——, “Varanageema: A theoretical basics for english to sinhala 15

machine translation,” in Sri Lanka Association for Artificial Intelli- Mohamed, S. Ranathunga, S. Jayasena, G. Dias, and S. Fernando, gence (SLAAI), 2010. “Automatic creation of a sentence aligned sinhala-tamil parallel [70] B. Hettige, G. Rzevski, and A. S. Karunananda, “Selected ,” in Proceedings of the 6th Workshop on South and Southeast machine translator for english to sinhala,” 2013. Asian Natural Language Processing (WSSANLP2016), 2016, pp. 124– [71] B. Hettige, A. S. Karunananda, and G. Rzevski, “Multi-agent 132. system technology for morphological analysis,” Proceedings of the [90] M. Z. Mohamed, A. Ihalapathirana, R. A. Hameed, N. Pathiren- 9th Annual Sessions of Sri Lanka Association for Artificial Intelligence nehelage, S. Ranathunga, S. Jayasena, and G. Dias, “Automatic (SLAAI), Colombo, 2012. creation of a word aligned sinhala-tamil parallel corpus,” in [72] ——, “Masmt: A multi-agent system development framework for Engineering Research Conference (MERCon), 2017 Moratuwa. IEEE, english-sinhala machine translation,” International Journal of Com- 2017, pp. 425–430. putational Linguistics and Natural Language Processing (IJCLNLP), [91] F. Farhath, P. Theivendiram, S. Ranathunga, S. Jayasena, and vol. 2, no. 7, pp. 411–416, 2013. G. Dias, “Improving domain-specific smt for low-resourced lan- [73] ——, “Sinhala ontology generator for english to sinhala machine guages using data from different domains,” in Proceedings of translation,” in Proc. of KDU International Research Conference, the Eleventh International Conference on Language Resources and 2014. Evaluation (LREC-2018), 2018. [74] ——, “A multi-agent solution for managing complexity in english [92] R. Jenarthanan, Y. Senarath, and U. Thayasivam, “Actsea: an- to sinhala machine translation,” Complex Systems: Fundamentals & notated corpus for tamil & sinhala emotion analysis,” in 2019 Applications, vol. 90, p. 251, 2016. Moratuwa Engineering Research Conference (MERCon). IEEE, 2019, [75] ——, “Phrase-level english to sinhala machine translation with pp. 49–53. multi-agent approach,” in 2017 IEEE International Conference on [93] S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, “Com- Industrial and Information Systems (ICIIS). IEEE, 2017, pp. 1–6. prehensive part-of-speech tag set and svm based pos tagger for [76] ——, “Masmt4: The agr organizational model-based multi-agent sinhala,” in Proceedings of the 6th Workshop on South and Southeast system development framework for machine translation,” in Asian Natural Language Processing (WSSANLP2016), 2016, pp. 173– Inventive Computation and Information Technologies. Springer, 2021, 182. pp. 691–702. [94] N. Dilshani, S. Fernando, S. Ranathunga, S. Jayasena, and G. Dias, [77] A. Herath, Y. Hyodo, T. Ikeda, and S. Herath, “Generation of “A comprehensive part of speech (pos) tag set for sinhala lan- sinhalese units from japanese bunsetsu structure,” 1993. guage.” The Third International Conference on Linguistics in [78] A. Herath, Y. Hyodo, Y. Kunieda, T. Ikeda, and S. Herath, Sri Lanka, ICLSL 2017 . . . , 2017. “Bunsetsu-based japanese-sinhalese translation system,” Informa- [95] S. Fernando and S. Ranathunga, “Evaluation of different clas- tion sciences, vol. 90, no. 1-4, pp. 303–319, 1996. sifiers for sinhala pos tagging,” in 2018 Moratuwa Engineering [79] S. Thelijjagoda, “Japanese-sinhalese mt system (jaw/sinhalese),” Research Conference (MERCon). IEEE, 2018, pp. 96–101. in Proceedings of Asian Symposium on Natural Language Processing [96] S. A. P. M. Manamini, A. F. Ahamed, R. A. E. C. Rajapakshe, to Overcome Language Barriers, IJCNLP-04 Satellite Symposium, G. H. A. Reemal, S. Jayasena, G. V. Dias, and S. Ranathunga, 2004. “Ananya-a named-entity-recognition (ner) system for sinhala [80] D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, language,” in Moratuwa Engineering Research Conference (MER- C. Wimalasuriya, N. H. N. D. de Silva, and G. Dias, “Implement- Con), 2016. IEEE, 2016, pp. 30–35. ing a Corpus for Sinhala Language,” in Symposium on Language [97] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jegou,´ and Technology for South Asia 2015, 2015. T. Mikolov, “Fasttext. zip: Compressing text classification mod- [81] ——, “Comparison between performance of various database els,” arXiv preprint arXiv:1612.03651, 2016. systems for implementing a language corpus,” in Interna- [98] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching tional Conference: Beyond Databases, Architectures and Structures. word vectors with subword information,” Transactions of the Springer, May 2015, pp. 82–91. Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. [82] R. Weerasinghe, D. Herath, and V. Welgama, “Corpus-based sin- [99] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks hala lexicon,” in Proceedings of the 7th Workshop on Asian Language for efficient text classification,” in Proceedings of the 15th Confer- Resources. Association for Computational Linguistics, 2009, pp. ence of the European Chapter of the Association for Computational 17–23. Linguistics: Volume 2, Short Papers, 2017, pp. 427–431. [83] F. Guzman,´ P.-J. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, [100] D. Lakmal, S. Ranathunga, S. Peramuna, and I. Herath, “Word V. Chaudhary, and M. Ranzato, “Two new evaluation datasets for embedding evaluation for sinhala,” in Proceedings of The 12th low-resource machine translation: Nepali-english and sinhala- Language Resources and Evaluation Conference, 2020, pp. 1874–1881. english,” arXiv preprint arXiv:1902.01382, 2019. [101] D. Herath, K. Gamage, and A. Malalasekara, “Research report on [84] P. Lison and J. Tiedemann, “Opensubtitles2016: Extracting large sinhala lexicon,” Langugae Technology Research Laboratory, UCSC. parallel corpora from movie and tv subtitles,” 2016. [102] V. Jayawickrama, A. Ranasinghe, D. C. Attanayake, and Y. Wi- [85] Y. Wijeratne and N. de Silva, “Sinhala language corpora and jeratne, “A corpus and machine learning models for fake news stopwords from a decade of sri lankan facebook,” arXiv preprint classification in sinhala,” 2021. arXiv:2007.07884, 2020. [103] G. P. Malalasekera, “English-sinhalese dictionary.” 1967. [86] M. Banon,´ P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Espla- [104] D. B. Jayathilake, Sinhala Shabdakoshaya (Sinhala Dictionary), Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn et al., Prathama Bhagaya (Vol 1). Sri Lankan branch of Royal Asian “Paracrawl: Web-scale acquisition of parallel corpora,” in Proceed- Society, 1937. ings of the 58th Annual Meeting of the Association for Computational [105] S. Maitipe, Gunasena English-Sinhalese Concise Dictionary. MD Linguistics, 2020, pp. 4555–4567. Gunasena & Co., 1988. [87] I. Caswell, J. Kreutzer, L. Wang, A. Wahab, D. van Esch, [106] A. Weerasinghe and C. P. Weerasinghe, “Godage english-sinhala- N. Ulzii-Orshikh, A. Tapo, N. Subramani, A. Sokolov, C. Sikasote, tamil dictionary,” Sri Lanka: S. Godage and brothers, Godage book M. Setyawan, S. Sarin, S. Samb, B. Sagot, C. Rivera, A. Rios, shop, vol. 661, 1999. I. Papadimitriou, S. Osei, P. J. O. Suarez,´ I. Orife, K. Ogueji, R. A. [107] H. Wijayathunga, Maha Sinhala Sabdakoshaya. M. D. Gunasena & Niyongabo, T. Q. Nguyen, M. Muller,¨ A. Muller,¨ S. H. Muham- Co. Ltd., Colombo, 2003. mad, N. Muhammad, A. Mnyakeni, J. Mirzakhalov, T. Matan- [108] S. Ranaweera, Wasana English-Sinhala Dictionary. Wasana gira, C. Leong, N. Lawson, S. Kudugunta, Y. Jernite, M. Jenny, Prakashakayo, Dankotuwa, Sri Lanka, 2004. O. Firat, B. F. P. Dossou, S. Dlamini, N. de Silva, S. C¸abuk Ballı, [109] K. Gunaratne, Ratna English-Sinhalese Dictionary. Ratna Poth S. Biderman, A. Battisti, A. Baruwa, A. Bapna, P. Baljekar, I. A. Prakasakayo, 513, Maradana road, Colombo 10, Sri Lanka, 2006. Azime, A. Awokoya, D. Ataman, O. Ahia, O. Ahia, S. Agrawal, [110] M. Kulatunga. Madura english-sinhala dictionary - online and M. Adeyemi, “Quality at a glance: An audit of web-crawled language translator. [Online]. Available: https://maduraonline. multilingual datasets,” arXiv preprint arXiv:2103.12028, 2021. com/ [88] D. Sachintha, L. Piyarathna, C. Rajitha, and S. Ranathunga, [111] A. Wasala and R. Weerasinghe, “Ensitip: a tool to unlock the “Exploiting parallel corpora to improve multilingual embed- english web,” in 11th international conference on humans and com- ding based document and sentence alignment,” arXiv preprint puters, Nagaoka University of Technology, Japan, 2008, pp. 20–23. arXiv:2106.06766, 2021. [112] L. Samarawickrama and B. Hettige, “Requirements for an [89] R. A. Hameed, N. Pathirennehelage, A. Ihalapathirana, M. Z. english-sinhala smart bilingual dictionary: A review.” 16

[113] Department of Official Languages, Sri Lanka. Tri-lingual dictio- part of speech tagger for sinhala language,” in Advances in ICT nary. [Online]. Available: https://www.trilingualdictionary.lk/ for Emerging Regions (ICTer), 2016 Sixteenth International Conference [114] A. Weerasinghe and G. Dias, “Construction of a multilingual on. IEEE, 2016, pp. 41–48. place name database for sri lanka,” 2013. [139] B. Kothalawala, R. Weerasinghe, and P. Kumarasinghe, “Online [115] G. A. Miller, “Wordnet: a lexical database for english,” Communi- learning for solving data availability problem in natural language cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995. processing.” in NL4AI@ AI* IA, 2019. [116] Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” [140] S. G. Withanage and T. Silva, “A stochastic part of speech tagger in Proceedings of the 32nd annual meeting on Association for Compu- for the sinhala language based on social media data mining,” in tational Linguistics. Association for Computational Linguistics, 2020 20th International Conference on Advances in ICT for Emerging 1994, pp. 133–138. Regions (ICTer). IEEE, 2020, pp. 137–142. [117] J. J. Jiang and D. W. Conrath, “Semantic similarity based on cor- [141] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, pus statistics and lexical taxonomy,” in Proc of 10th International and D. McClosky, “The Stanford CoreNLP natural language Conference on Research in Computational Linguistics, ROCLING’97. processing toolkit,” in Association for Computational Linguistics Citeseer, 1997. (ACL) System Demonstrations, 2014, pp. 55–60. [Online]. Available: [118] N. de Silva, D. Dou, and J. Huang, “Discovering inconsisten- http://www.aclweb.org/anthology/P/P14/P14-5010 cies in pubmed abstracts through ontology-based information [142] A. M. Gunasekara, A comprehensive grammar of the Sinhalese lan- extraction,” in Proceedings of the 8th ACM International Conference guage: adapted for the use of English readers and prescribed for the Civil on Bioinformatics, Computational Biology, and Health Informatics. Service examinations. GJA Skeen, 1891. ACM, 2017, pp. 362–371. [143] J. Aissen, “Differential object marking: Iconicity vs. economy,” [119] I. Wijesiri, M. Gallage, B. Gunathilaka, M. Lakjeewa, D. Wimala- Natural Language & Linguistic Theory, vol. 21, no. 3, pp. 435–483, suriya, G. Dias, R. Paranavithana, and N. de Silva, “Building a 2003. for Sinhala,” in Proceedings of the Seventh Global WordNet [144] J. K. Dahanayaka and A. R. Weerasinghe, “Named entity recogni- Conference, 2014, pp. 100–108. tion for sinhala language,” in Advances in ICT for Emerging Regions [120] Sinhala wordnet. [Online]. Available: http://www.wordnet.lk/ (ICTer), 2014 International Conference on. IEEE, 2014, pp. 215–220. [121] J. Arukgoda, V. Bandara, S. Bashani, V. Gamage, and D. Wimala- [145] K. U. Senevirathne, N. S. Attanayake, A. W. M. H. Dhananjanie, suriya, “A word sense disambiguation technique for sinhala,” W. A. S. U. Weragoda, A. Nugaliyadde, and S. Thelijjagoda, in 2014 4th International Conference on Artificial Intelligence with “Conditional random fields based named entity recognition for Applications in Engineering and Technology. IEEE, 2014, pp. 207– sinhala,” in 2015 IEEE 10th International Conference on Industrial 211. and Information Systems (ICIIS). IEEE, 2015, pp. 302–307. [122] V. Welgama, D. L. Herath, C. Liyanage, N. Udalamatta, [146] R. Azeez and S. Ranathunga, “Fine-grained named entity recog- R. Weerasinghe, and T. Jayawardana, “Towards a sinhala word- nition for sinhala,” in 2020 Moratuwa Engineering Research Confer- net,” in Proceedings of the Conference on Human Language Technology ence (MERCon). IEEE, 2020, pp. 295–300. for Development, 2011. [147] S. Herath, S. Ishizaki, T. Ikeda, Y. Anzai, and H. Aiso, “Syntactic [123] S. Herath, T. Ikeda, S. Ishizaki, Y. Anzai, and H. Aiso, “Analysis and semantic analysis of sinhala: a step towards intelligence com- system for sinhalese unit structure,” Journal of Experimental & puting systems,” in Proceedings. 5th IEEE International Symposium Theoretical Artificial Intelligence, vol. 4, no. 1, pp. 29–48, 1992. on Intelligent Control 1990. IEEE, 1990, pp. 316–324. [124] V. Welgama, R. Weerasinghe, and M. Niranjan, “Evaluating a ma- [148] J. C. S. Kadupitiya, S. Ranathunga, and G. Dias, “Sinhala chine learning approach to sinhala morphological analysis,” in short sentence similarity calculation using corpus-based and Proceedings of the 10th International Conference on Natural Language knowledge-based similarity measures,” in Proceedings of the 6th Processing, Noida, India, 2013. Workshop on South and Southeast Asian Natural Language Processing [125] N. Fernando and R. Weerasinghe, “A morphological parser for (WSSANLP2016), 2016, pp. 44–53. sinhala verbs,” in Proceedings of the International Conference on [149] ——, “Sinhala short sentence similarity measures using corpus- Advances in ICT for Emerging Regions, 2013. based simi-larity for short answer grading,” in 6th Workshop on [126] W. S. N. Dilshani and G. Dias, “A corpus-based morphological South and Southeast Asian Natural Language Processing, 2017, pp. analysis of sinhala verbs.” The Third International Conference 44–53. on Linguistics in Sri Lanka, ICLSL 2017 . . . , 2017. [150] W. V. Welgama, “Automatic text summarization for sinhala,” [127] M. Nandathilaka, S. Ahangama, and G. T. Weerasuriya, “A rule- 2012. based lemmatizing approach for sinhala language,” in 2018 3rd [151] O. S. Wimalasuriya, “Automatic text summarization for sinhala,” International Conference on Information Technology Research (ICITR). Ph.D. dissertation, 2019. IEEE, 2018, pp. 1–5. [152] D. Yarowsky, “Word-sense disambiguation using statistical mod- [128] V. Welgama, R. Weerasinghe, and N. Mahesan, “Defining the gold els of roget’s categories trained on large corpora,” in Proceedings standard definitions for the morphology of sinhala words.” of the 14th conference on Computational linguistics-Volume 2. Asso- [129] D. L. Herath and A. R. Weerasinghe, “A stochastic part of ciation for Computational Linguistics, 1992, pp. 454–460. speech tagger for sinhala,” in Proceedings of the 06th International [153] N. Ide and J. Veronis,´ “Introduction to the special issue on Information Technology Conference, 2004, pp. 27–28. word sense disambiguation: the state of the art,” Computational [130] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Evaluation of stochastic linguistics, vol. 24, no. 1, pp. 2–40, 1998. based tagging approach for sinhala language,” 2012. [154] D. Yarowsky, “Unsupervised word sense disambiguation rivaling [131] M. Jayasuriya and A. R. Weerasinghe, “Learning a stochastic part supervised methods,” in 33rd annual meeting of the association for of speech tagger for sinhala,” in Advances in ICT for Emerging computational linguistics, 1995, pp. 189–196. Regions (ICTer), 2013 International Conference on. IEEE, 2013, pp. [155] S. Banerjee and T. Pedersen, “An adapted lesk algorithm for word 137–143. sense disambiguation using wordnet,” in International conference [132] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Part of speech (pos) on intelligent text processing and computational linguistics. Springer, tagger for sinhala language,” 2011. 2002, pp. 136–145. [133] ——, “Hidden markov model based part of speech tagger for [156] R. Navigli, “Word sense disambiguation: A survey,” ACM com- sinhala language,” arXiv preprint arXiv:1407.2989, 2014. puting surveys (CSUR), vol. 41, no. 2, p. 10, 2009. [134] ——, “Unknown words analysis in pos tagging of sinhala lan- [157] M. Lesk, “Automatic sense disambiguation using machine read- guage,” in Advances in ICT for Emerging Regions (ICTer), 2014 able dictionaries: how to tell a pine cone from an ice cream cone,” International Conference on. IEEE, 2014, pp. 270–270. in Proceedings of the 5th annual international conference on Systems [135] ——, “Handling issues with unknown words in pos tagging.” documentation. Citeseer, 1986, pp. 24–26. Book of Abstracts, Annual Research Symposium 2014, 2014. [158] C. Marasinghe, S. Herath, and A. Herath, “Word sense disam- [136] M. Jayaweera and N. G. J. Dias, “Comparison of part of speech biguation of sinhala language with unsupervised learning,” in taggers for sinhala language,” 2016. Proc. International Conference on Information Technology and Appli- [137] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Restful pos tagging cations, 2002, pp. 25–29. web service for sinhala language,” in 2015 Fifteenth International [159] S. Palihakkara, D. Sahabandu, A. Shamsudeen, C. Bandara, and Conference on Advances in ICT for Emerging Regions (ICTer). IEEE, S. Ranathunga, “Dialogue act recognition for text-based sinhala,” 2015, pp. 50–57. in Proceedings of the 12th International Conference on Natural Lan- [138] D. Gunasekara, W. V. Welgama, and A. R. Weerasinghe, “Hybrid guage Processing, 2015, pp. 367–375. 17

[160] S. Gallege, “Analysis of sinhala using natural language process- Association for Computational Linguistics, 2006, pp. 890–897. ing techniques,” 2010. [183] T. Nadungodage, C. Liyanage, A. Prerera, R. Pushpananda, and [161] K. B. N. Lakmali and P. S. Haddela, “Effectiveness of rule- R. Weerasinghe, “Sinhala g2p conversion for speech processing,” based classifiers in sinhala text categorization,” in 2017 National in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Information Technology Conference (NITC). IEEE, 2017, pp. 153– Under-Resourced Languages, pp. 112–116. 158. [184] R. Weerasinghe, A. Wasala, V. Welgama, and K. Gamage, [162] P. K. S. Kumari and P. S. Haddela, “Use of lime for human “Festival-si: A sinhala text-to-speech system,” in International interpretability in sinhala document classification,” in 2019 In- Conference on Text, Speech and Dialogue. Springer, 2007, pp. 472– ternational Research Conference on Smart Computing and Systems 479. Engineering (SCSE). IEEE, 2019, pp. 97–102. [185] M. S. Amarasekara, K. M. N. S. Bandara, B. V. A. I. Vithana, [163] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?: D. H. De Silva, and A. Jayakody, “Real-time interactive voice Explaining the predictions of any classifier,” in Proceedings of the communication-for a mute person in sinhala (rtivc),” in 2013 8th 22nd ACM SIGKDD international conference on knowledge discovery International Conference on Computer Science & Education. IEEE, and data mining. ACM, 2016, pp. 1135–1144. 2013, pp. 671–675. [164] P. Nanayakkara and S. Ranathunga, “Clustering sinhala news ar- [186] K. H. Kumara, N. G. J. Dias, and H. Sirisena, “Automatic seg- ticles using corpus-based similarity measures,” in 2018 Moratuwa mentation of given set of sinhala text into syllables for speech Engineering Research Conference (MERCon). IEEE, 2018, pp. 437– synthesis,” pp. 53–62, 2007. 442. [187] W. M. C. Bandara, W. M. S. Lakmal, T. D. Liyanagama, S. V. Bu- [165] S. V. S. Gunasekara and P. S. Haddela, “Context aware stopwords lathsinghala, G. Dias, and S. Jayasena, “A ew prosodic phrasing for sinhala text classification,” in 2018 National Information Tech- method for sinhala language,” 2017. nology Conference (NITC). IEEE, 2018, pp. 1–6. [188] W. M. C. Bandara, S. V. Bulathsinghala, W. M. S.Lakmal, T. D. [166] S. H. Jayasinghe and K. Sirts, “Deep learning textual entailment Liyanagama, G. Dias, and S. Jayasena, “Sinhala text to speech system for sinhala language,” 2019. system,” 2009. [167] N. Medagoda, “Sentiment analysis on morphologically rich lan- [189] W. M. C. Bandara, V. M. S. Lakmal, T. D. Liyanagama, S. V. Bu- guages: An artificial neural network (ann) approach,” in Artificial lathsinghala, G. Dias, and S. Jayasena, “A new prosodic phrasing Neural Network Modelling. Springer, 2016, pp. 377–393. model for sinhala language,” 2013. [168] N. Medagoda, S. Shanmuganathan, and J. Whalley, “Sentiment [190] K. Sodimana, P. De Silva, R. Sproat, A. Theeraphol, C. F. Li, lexicon construction using sentiwordnet 3.0,” in 2015 11th Inter- A. Gutkin, S. Sarin, and K. Pipatsrisawat, “Text normalization national Conference on Natural Computation (ICNC). IEEE, 2015, for bangla, khmer, nepali, javanese, sinhala, and sundanese text- pp. 802–807. to-speech systems,” 2018. [169] P. D. T. Chathuranga, S. A. S. Lorensuhewa, and M. A. L. Kalyani, [191] K. Sodimana, K. Pipatsrisawat, L. Ha, M. Jansche, O. Kjartansson, “Sinhala sentiment analysis using corpus based sentiment lexi- P. De Silva, and S. Sarin, “A step-by-step process for building con,” in International Conference on Advances in ICT for Emerging tts voices using open source data and framework for bangla, Regions (ICTer), vol. 1, 2019, p. 7. javanese, khmer, nepali, sinhala, and sundanese,” 2018. [170] P. Demotte, L. Senevirathne, B. Karunanayake, U. Munasinghe, [192] D. S. Jayamanna, “Android based sinhala document reader for and S. Ranathunga, “Sentiment Analysis of Sinhala News Com- visually impaired persons,” 2014. ments using Sentence-State LSTM Networks,” in 2020 Moratuwa [193] A. K. P. D. Mishangi, “Android based sinhala document reader Engineering Research Conference (MERCon). IEEE, 2020, pp. 283– for visually impaired people,” 2018. 288. [194] D. S. S. De Zoysa, J. M. Sampath, E. M. P. De Seram, D. M. [171] B. Karunanayake, U. Munasinghe, P. Demotte, L. Senevirathne, I. D. Dissanayake, L. Wijerathna, and S. Thelijjagoda, “Project and S. Ranathunga, “Sinhala Sentiment Lexicon Generation using bhashitha-mobile based optical character recognition and text-to- Word Similarity,” in 2020 20th International Conference on Advances speech system,” in 2018 13th International Conference on Computer in ICT for Emerging Regions (ICTer). IEEE, 2020, pp. 77–82. Science & Education (ICCSE). IEEE, 2018, pp. 1–5. [172] L. Senevirathne, P. Demotte, B. Karunanayake, U. Munas- [195] K. S. Anuradha and S. Thelijjagoda, “Machine translation system inghe, and S. Ranathunga, “Sentiment Analysis for Sinhala to convert sinhala and english braille documents into voice,” Language using Deep Learning Techniques,” arXiv preprint in 2020 International Research Conference on Smart Computing and arXiv:2011.07280, 2020. Systems Engineering (SCSE). IEEE, 2020, pp. 7–16. [173] P. Jayasuriya, R. Munasinghe, and S. Thelijjagoda, “Sentiment [196] L. Nanayakkara, C. Liyanage, P.-T. Viswakula, T. Nagungodage, classification of sinhala content in social media: A comparison R. Pushpananda, and R. Weerasinghe, “A human quality text between word n-grams and character n-grams.” to speech system for sinhala,” in Proc. The 6th Intl. Workshop on [174] S. Ranathunga and I. U. Liyanage, “Sentiment analysis of sinhala Spoken Language Technologies for Under-Resourced Languages, pp. news comments,” Transactions on Asian and Low-Resource Language 157–161. Information Processing, vol. 20, no. 4, pp. 1–23, 2021. [197] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Speech [175] H. Pabasara and S. Jayalal, “Computational model for detecting recognition for low resourced languages: Efficient use of training grammatical mistakes in sinhala text,” in 9TH YSF SYMPOSIUM, data for sinhala speech recognition by active learning.” 2020, p. 255. [198] T. Nadungodage and R. Weerasinghe, “Continuous sinhala [176] ——, “Grammatical error detection and correction model for sin- speech recognizer,” in Conference on Human Language Technology hala language sentences,” in 2020 International Research Conference for Development, Alexandria, Egypt, 2011, pp. 2–5. on Smart Computing and Systems Engineering (SCSE). IEEE, 2020, [199] T. Nadungodage, R. Weerasinghe, and M. Niranjan, “Efficient pp. 17–24. use of training data for Sinhala speech recognition using active [177] S. Gunasekara, D. Chathura, C. Jeewantha, and G. Dias, “Using learning,” in Advances in ICT for Emerging Regions (ICTer), 2013 annotation projection for semantic role labeling of low-resourced International Conference on. IEEE, 2013, pp. 149–153. language: Sinhala,” 2020. [200] ——, “Speaker Adaptation Applied to Sinhala Speech Recogni- [178] U. Isuranga, J. Sandaruwan, U. Athukorala, and G. Dias, “Im- tion.” Int. J. Comput. Linguistics Appl., vol. 6, no. 1, pp. 117–129, proved cross-lingual document similarity measurement,” 2020. 2015. [179] A. Wasala and K. Gamage, “Research report on phonetics and [201] W. G. T. N. Amarasingha and D. D. A. Gamini, “Speaker phonology of sinhala,” Language Technology Research Laboratory, independent sinhala speech recognition for voice dialling,” in University of Colombo School of Computing, vol. 35, 2005. International Conference on Advances in ICT for Emerging Regions [180] R. I. P. Wickramasinghe, K. H. Kumara, and N. G. J. Dias, (ICTer2012). IEEE, 2012, pp. 3–6. “Practical issues in the development of tts and sr for the sinhala [202] W. Manamperi, D. Karunathilake, T. Madhushani, N. Galagedara, language,” 2007. and D. Dias, “Sinhala speech recognition for interactive voice [181] R. Weerasinghe, A. Wasala, and K. Gamage, “A rule based response systems accessed through mobile phones,” in 2018 syllabification algorithm for sinhala,” in International Conference Moratuwa Engineering Research Conference (MERCon). IEEE, 2018, on Natural Language Processing. Springer, 2005, pp. 438–449. pp. 241–246. [182] A. Wasala, R. Weerasinghe, and K. Gamage, “Sinhala grapheme- [203] P. G. N. Priyadarshani, “Speaker dependent speech recognition to-phoneme conversion and rules for schwa epenthesis,” in Pro- on a selected set of sinhala words,” 2012. ceedings of the COLING/ACL on Main conference poster sessions. [204] P. G. N. Priyadarshani, N. G. J. Dias, and A. Punchihewa, “Dy- 18

namic time warping based speech recognition for isolated sinhala national Conference on Signal and Image Processing (ICSIP). IEEE, words,” in 2012 IEEE 55th International Midwest Symposium on 2017, pp. 46–50. Circuits and Systems (MWSCAS). IEEE, 2012, pp. 892–895. [225] M. L. M. Karunanayaka, N. D. Kodikara, and G. D. S. P. [205] ——, “Genetic algorithm approach for sinhala speech recog- Wimalaratne, “Off line sinhala handwriting recognition with an nition,” in 2012 IEEE 55th International Midwest Symposium on application for postal city name recognition,” Il’I’C 2004, 2004. Circuits and Systems (MWSCAS). IEEE, 2012, pp. 896–899. [226] R. Weerasinghe, A. Wasala, D. Herath, and V. Welgama, “Nlp [206] P. G. N. Priyadarshani and N. G. J. Dias, “Automatic segmen- applications of sinhala: Tts & ocr,” in Proceedings of the Third Inter- tation of separately pronounced sinhala words into syllables,” national Joint Conference on Natural Language Processing: Volume-II, 2011. 2008. [207] M. K. H. Gunasekara and R. G. N. Meegama, “Real-time transla- [227] A. R. Weerasinghe, D. L. Herath, and N. P. K. Medagoda, “A tion of discrete sinhala speech to unicode text,” in 2015 Fifteenth nearest-neighbor based algorithm for printed sinhala character International Conference on Advances in ICT for Emerging Regions recognition,” Innovations for a Knowledge Economy, p. 11, 2006. (ICTer). IEEE, 2015, pp. 140–145. [228] D. N. Ediriweera, “Improviing the accuracy of the output of [208] M. Punchimudiyanse and R. G. N. Meegama, “Unicode sinhala sinhala ocr by using a dictionary,” Ph.D. dissertation, University and phonetic english bi-directional conversion for sinhala speech of Moratuwa Sri Lanka, 2012. recognizer,” in 2015 IEEE 10th International Conference on Industrial [229] G. Dias, T. N. P. Patikirikorala, C. I. Arambewela, R. P. M. and Information Systems (ICIIS). IEEE, 2015, pp. 296–301. Darshana, and N. D. Alahendra, “Sinhala optical character recog- [209] Y. Karunanayake, U. Thayasivam, and S. Ranathunga, “Transfer nition for desktops,” 2013. learning based free-form speech command classification for low- [230] G. Dias, T. N. P. Patikirikorala, C. I. Arambewela, R. P. M. resource languages,” in Proceedings of the 57th Conference of the As- Darshani, and N. D. Alahendra, “Online sinhala handwritten sociation for Computational Linguistics: Student Research Workshop, character recognition for desktops,” 2013. 2019, pp. 288–294. [231] M. H. P. Ranmuthugala, G. D. N. C. Pathiragoda, S. H. C. [210] K. A. D. C. Dilshan, “Transcribing number sequences in continu- Jayasundara, G. Dias, and A. S. Karunananda, “Online sinhala ous sinhala speech,” 2018. handwritten character recognition on handheld devices,” Innova- [211] T. Dinushika, L. Kavmini, P. Abeyawardhana, U. Thayasivam, tions for a Knowledge Economy, p. 1, 2006. and S. Jayasena, “Speech command classification system for [232] M. Rimas, R. P. Thilakumara, and P. Koswatta, “Optical character sinhala language based on automatic speech recognition,” in recognition for sinhala language,” in 2013 IEEE Global Humanitar- 2019 International Conference on Asian Language Processing (IALP). ian Technology Conference: South Asia Satellite (GHTC-SAS). IEEE, IEEE, 2019, pp. 205–210. 2013, pp. 149–153. [212] B. Gamage, R. Pushpananda, R. Weerasinghe, and T. Nadun- [233] G. I. Gunarathna, M. A. P. Chamikara, and R. G. Ragel, “A fuzzy godage, “Usage of combinational acoustic models (dnn-hmm based model to identify printed sinhala characters,” in 7th Inter- and sgmm) and identifying the impact of language models in national Conference on Information and Automation for Sustainability. sinhala speech recognition,” in 2020 20th International Conference IEEE, 2014, pp. 1–6. on Advances in ICT for Emerging Regions (ICTer). IEEE, 2020, pp. [234] H. W. H. Premachandra, C. Premachandra, T. Kimura, and 17–22. H. Kawanaka, “Artificial neural network based sinhala character [213] D. Giuliani and B. BabaAli, “Large vocabulary children’s speech recognition,” in International Conference on Computer Vision and recognition with dnn-hmm and sgmm acoustic modeling,” in Six- Graphics. Springer, 2016, pp. 594–603. teenth Annual Conference of the International Speech Communication [235] J. M. H. M. Jayamaha and H. M. M. Naleer, “Feature extraction Association, 2015. technique based character recognition using artificial neural net- [214] H. Karunathilaka, V. Welgama, T. Nadungodage, and R. Weeras- work for sinhala characters,” 2016. inghe, “Low-resource sinhala speech recognition using deep [236] T. N. Kumara and R. Ragel, “A systematic feature selection learning,” in 2020 20th International Conference on Advances in ICT process for a sinhala character recognition system,” in 2016 for Emerging Regions (ICTer). IEEE, 2020, pp. 196–201. IEEE International Conference on Information and Automation for [215] D. Buddhika, R. Liyadipita, S. Nadeeshan, H. Witharana, Sustainability (ICIAfS). IEEE, 2016, pp. 1–6. S. Javasena, and U. Thayasivam, “Domain specific intent clas- [237] B. R. Jayawickrama, L. Ranathunga, K. L. Mahaliyanaarachchi, sification of sinhala speech data,” in 2018 International Conference L. G. B. Subhagya, and W. H. A. Nimasha, “Letter segmentation on Asian Language Processing (IALP). IEEE, 2018, pp. 197–202. and modifier detection in printed sinhala signage,” in 2018 18th [216] L. Kavmini, T. Dinushika, U. Thayasivam, and S. Jayasena, International Conference on Advances in ICT for Emerging Regions “Improved speech command classification system for sinhala (ICTer). IEEE, 2018, pp. 203–208. language based on automatic speech recognition,” International [238] S. Gunawardhana and L. Ranathunga, “Segmentation and iden- Journal of Asian Language Processing, p. 2050009, 2020. tification of presence of sinhala characters in facebook images,” [217] R. K. Rajapakse, A. R. Weerasinghe, and E. K. Seneviratne, “A in 2018 IEEE 13th International Conference on Industrial and Infor- neural network based character recognition system for sinhala mation Systems (ICIIS). IEEE, 2018, pp. 77–82. script,” Department of Statistics and Computer Science, University of [239] K. L. N. D. Liyanage, “Improving sinhala ocr using deep learn- Colombo, 1995. ing,” 2018. [218] H. L. Premaratne and J. Bigun, “Recognition of printed sinhala [240] I. Anuradha, C. Liyanage, and R. Weerasinghe, “Estimating the characters using linear symmetry,” in The 5th Asian Conference on effects of text genre, image resolution and algorithmic complexity Computer Vision, 2002, pp. 23–25. needed for sinhala optical character recognition,” International [219] ——, “A segmentation-free approach to recognise printed sinhala Journal on Advances in ICT for Emerging Regions (ICTer), vol. 14, script using linear symmetry,” Pattern recognition, vol. 37, no. 10, no. 3, 2021. pp. 2081–2089, 2004. [241] H. C. Fernando, N. D. Kodikara, and S. Hewavitharana, “A [220] H. L. Premaratne, E. Jarpe,¨ and J. Bigun, “Lexicon and hid- database for handwriting recognition research in sinhala lan- den markov model-based optimisation of the recognised sinhala guage.” in ICDAR, 2003, pp. 1262–1264. script,” Pattern recognition letters, vol. 27, no. 6, pp. 696–705, 2006. [242] M. L. M. Karunanayaka, C. A. Marasinghe, and N. D. Kodikara, [221] S. Hewavitharana, H. C. Fernando, and N. D. Kodikara, “Off-line “Thresholding, noise reduction and skew correction of sinhala sinhala handwriting recognition using hidden markov models.” handwritten words.” in MVA, 2005, pp. 355–358. in ICVGIP, 2002. [243] B. Jayasekara and L. Udawatta, “Non-cursive sinhala hand- [222] S. Hewavitharana and N. D. Kodikara, “A statistical approach written script recognition: A genetic algorithm based alphabet to sinhala handwriting recognition,” in Proc. of the International training approach,” in Proceedings of the International Conference Information Technology Conference (IITC), Colombo, Sri Lanka, 2002. on Information and Automation, 2005. [223] S. Ajward, N. Jayasundara, S. Madushika, and R. Ragel, “Con- [244] N. P. T. I. Nilaweera, H. L. Premeratne, and D. U. J. Sonnadara, verting printed sinhala documents to formatted editable text,” in “Comparison of projection and wavelet based techniques in 2010 Fifth International Conference on Information and Automation recognition of sinhala handwritten scripts,” in Proceedings of the for Sustainability. IEEE, 2010, pp. 138–143. 25th National IT Conference, 2007. [224] P. T. C. Madushanka, R. Bandara, and L. Ranathunga, “Sinhala [245] C. Silva and C. Kariyawasam, “Segmenting sinhala handwritten handwritten character recognition by using enhanced thinning characters,” International Journal of Conceptions on Computing and and curvature histogram based method,” in 2017 IEEE 2nd Inter- Information Technology, vol. 2, no. 4, pp. 22–26, 2014. 19

[246] C. M. Silva, N. D. Jayasundere, and C. Kariyawasam, “State of [268] P. N. Tennage, M. W. D. P. Sandaruwan, J. K. M. M. Thilakarathne, handwriting recognition of modern sinhala script,” 2014. A. N. Herath, S. Ranathunga, S. Jayasena, and G. Dias, “Neural [247] ——, “Contour tracing for isolated sinhala handwritten character machine translation for sinhala-tamil,” 2017. recognition,” in 2015 Fifteenth International Conference on Advances [269] P. Tennage, A. Herath, M. Thilakarathne, P. Sandaruwan, and in ICT for Emerging Regions (ICTer). IEEE, 2015, pp. 25–31. S. Ranathunga, “Transliteration and byte pair encoding to im- [248] K. A. K. N. D. Dharmapala, W. P. M. V. Wijesooriya, C. P. prove tamil to sinhala neural machine translation,” in 2018 Chandrasekara, U. K. A. U. Rathnapriya, and L. Ranathunga, Moratuwa Engineering Research Conference (MERCon). IEEE, 2018, “Sinhala handwriting recognition mechanism using zone based pp. 390–395. feature extraction,” 2017. [270] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, and [249] K. S. A. Walawage and L. Ranathunga, “Segmentation of overlap- S. Ranathunga, “Handling rare word problem using synthetic ping and touching sinhala handwritten characters,” in 2018 3rd training data for sinhala and tamil neural machine translation,” International Conference on Information Technology Research (ICITR). in Proceedings of the Eleventh International Conference on Language IEEE, 2018, pp. 1–6. Resources and Evaluation (LREC-2018), 2018. [250] C. M. Silva and N. Jayasundere, “Character modifier combina- [271] S. Ranathunga, F. Farhath, U. Thayasivam, S. Jayasena, and tions recognition in sinhala handwriting.” G. Dias, “Si-ta: Machine translation of sinhala and tamil official [251] K. A. M. P. Rathnasena, K. M. S. J. Kumarasinghe, D. T. P. documents,” in 2018 National Information Technology Conference Paranavitharana, D. V. A. U. Dayarathne, and L. Ranathunga, (NITC). IEEE, 2018, pp. 1–6. “Summarization based approach for old sinhala text archival [272] F. Farhath, S. Ranathunga, S. Jayasena, and G. Dias, “Integration search and preservation,” in 2018 18th International Conference on of bilingual lists for domain-specific statistical machine trans- Advances in ICT for Emerging Regions (ICTer). IEEE, 2018, pp. lation for sinhala-tamil,” in 2018 Moratuwa Engineering Research 182–188. Conference (MERCon). IEEE, 2018, pp. 538–543. [252] T. M. T. H. Peiris, “Recognition of inscriptions in ancient sri [273] R. Weerasinghe, “A statistical machine translation approach to lanka,” 2012. sinhala- translation,” Towards an ICT enabled Soci- [253] K. G. N. D. Karunarathne, K. V. Liyanage, D. A. S. Ruwanmini, ety, p. 136, 2003. K. Dias, and S. Nandasara, “Recognizing ancient sinhala inscrip- [274] S. Sripirakas, A. R. Weerasinghe, and D. L. Herath, “Statistical tion characters using neural network technologies,” Internationa machine translation of systems for sinhala-tamil,” in Advances in Journal of Scientific Emgineering and Applied Sciences, vol. 3, no. 1, ICT for Emerging Regions (ICTer), 2010 International Conference on. 2017. IEEE, 2010, pp. 62–68. [254] S. Chanda, S. Pal, and U. Pal, “Word-wise sinhala tamil and [275] M. Jeyakaran and R. Weerasinghe, “A novel kernel regression english script identification using gaussian kernel svm,” in 2008 based machine translation system for sinhala-tamil translation,” 19th International Conference on Pattern Recognition. IEEE, 2008, in Proceedings of 4th Annual UCSC Research Symposium, 2013. pp. 1–4. [276] R. Pushpananda, R. Weerasinghe, and M. Niranjan, “Towards [255] J. Liyanapathirana and R. Weerasinghe, “English to sinhala sinhala tamil machine translation,” in Advances in ICT for Emerg- machine translation: Towards better information access for sri ing Regions (ICTer), 2013 International Conference on. IEEE, 2013, lankans,” in Conference on Human Language Technology for Devel- pp. 288–288. opment, 2011, pp. 182–186. [277] ——, “Sinhala-tamil machine translation: Towards better transla- [256] J. U. Liyanapathirana, “A statistical approach to english and tion quality,” in Proceedings of the Australasian Language Technology sinhala translation,” 2013. Association Workshop 2014, 2014, pp. 129–133. [257] L. Wijerathna, W. L. S. L. Somaweera, S. L. Kaduruwana, Y. V. [278] S. Rajpirathap, S. Sheeyam, K. Umasuthan, and A. Chelvara- Wijesinghe, D. I. De Silva, K. Pulasinghe, and S. Thellijjagoda, “A jah, “Real-time direct translation system for sinhala and tamil translator from sinhala to english and english to sinhala (sees),” languages,” in 2015 Federated Conference on Computer Science and in International Conference on Advances in ICT for Emerging Regions Information Systems (FedCSIS). IEEE, 2015, pp. 1437–1443. (ICTer2012). IEEE, 2012, pp. 14–18. [279] W. S. N. Dilshani, S. Yashothara, R. T. Uthayasanker, and [258] D. De Silva, A. Alahakoon, I. Udayangani, V. Kumara, D. Kolon- S. Jayasena, “Linguistic divergence of sinhala and tamil lan- nage, H. Perera, and S. Thelijjagoda, “Sinhala to english language guages in machine translation,” in 2018 International Conference translator,” in 2008 4th International Conference on Information and on Asian Language Processing (IALP). IEEE, 2018, pp. 13–18. Automation for Sustainability. IEEE, 2008, pp. 419–424. [280] T. Mokanarangan, “Translation of named entities between sin- [259] A. M. Silva and R. Weerasinghe, “Example based machine trans- hala and tamil for official government documents,” 2019. lation for english-sinhala translations,” in Proceedings of the 09th [281] A. Arukgoda, A. R. Weerasinghe, and R. Pushpananda, “Improv- International IT Conference, 2008, pp. 27–28. ing sinhala-tamil translation through deep learning techniques,” [260] A. J. Vidanaralage, A. U. Illangakoon, S. Y. Sumanaweera, 2019. C. Pavithra, and S. Thelijjagoda, “Sinhala language decoder,” in [282] A. Pramodya, R. Pushpananda, and R. Weerasinghe, “a compar- 2018 National Information Technology Conference (NITC). IEEE, ison of transformer, recurrent neural networks and smt in tamil 2018, pp. 1–5. to sinhala mt,” in 2020 20th International Conference on Advances in [261] J. Joseph, W. Chathurika, A. Nugaliyadde, and ICT for Emerging Regions (ICTer). IEEE, 2020, pp. 155–160. Y. Mallawarachchi, “Evolutionary algorithm for sinhala to [283] L. Nissanka, B. Pushpananda, and A. Weerasinghe, “Exploring english translation,” arXiv preprint arXiv:1907.03202, 2019. neural machine translation for sinhala-tamil languages pair,” in [262] A. Fernando, S. Ranathunga, and G. Dias, “Data augmenta- 2020 20th International Conference on Advances in ICT for Emerging tion and terminology integration for domain-specific sinhala- Regions (ICTer). IEEE, 2020, pp. 202–207. english-tamil statistical machine translation,” arXiv preprint [284] R. M. M. Shalini and B. Hettige, “Dictionary based machine arXiv:2011.02821, 2020. translation system for pali to sinhala,” in SLAAI-International [263] T. Fonseka, R. Naranpanawa, R. Perera, and U. Thayasivam, Conference on Artificial Intelligence, 2017, p. 23. “English to sinhala neural machine translation,” in IALP, 2020. [285] R. C. Childers, A dictionary of the Pali language. Trubner,¨ 1875. [264] R. Naranpanawa, R. Perera, T. Fonseka, and U. Thayasivam, [286] S. Salaville, An introduction to the study of eastern liturgies. Sands “Analyzing subword techniques to improve english to sinhala & Company, 1938. neural machine translation,” International Journal of Asian Lan- [287] A. J. Liddicoat, “Choosing a liturgical language,” Australian Re- guage Processing, vol. 30, no. 04, p. 2050017, 2020. view of Applied Linguistics, vol. 16, no. 2, pp. 123–141, 1993. [265] A. Fernando, G. Dias, and S. Ranathunga, “Data augmenta- [288] A. Wasala, R. Weerasinghe, R. Pushpananda, C. Liyanage, and tion and list integration for improving domain-specific sinhala- E. Jayalatharachchi, “A data-driven approach to checking and english-tamil statistical machine translation,” 2021. correcting spelling errors in sinhala,” Int. J. Adv. ICT Emerg. Reg, [266] A. D. De Silva, “Singlish to sinhala converter using machine vol. 3, no. 01, 2010. learning,” 2020. [289] R. A. Wasala, R. Weerasinghe, R. Pushpananda, C. Liyanage, and [267] P. Tennage, P. Sandaruwan, M. Thilakarathne, A. Herath, E. Jayalatharachchi, “An open-source data driven S. Ranathunga, S. Jayasena, and G. Dias, “Neural machine for sinhala,” ICTer, vol. 3, no. 1, 2011. translation for sinhala and tamil languages,” in Asian Language [290] E. Jayalatharachchi, A. Wasala, and R. Weerasinghe, “Data-driven Processing (IALP), 2017 International Conference on. IEEE, 2017, spell checking: the synergy of two algorithms for spelling error pp. 189–192. detection and correction,” in International Conference on Advances 20

in ICT for Emerging Regions (ICTer2012). IEEE, 2012, pp. 7–13. S. Thelijjagoda, “Arunalu: Learning ecosystem to overcome sin- [291] L. G. B. Subhagya, L. Ranathunga, W. H. A. Nimasha, B. R. hala reading weakness due to dyslexia,” in 2020 2nd International Jayawickrama, and K. L. Mahaliyanaarchchi, “Data driven ap- Conference on Advancements in Computing (ICAC), vol. 1. IEEE, proach to sinhala spellchecker and correction,” in 2018 18th 2020, pp. 416–421. International Conference on Advances in ICT for Emerging Regions [314] C. Rajitha, L. Piyarathne, D. Sachintha, and S. Ranathunga, “Met- (ICTer). IEEE, 2018, pp. 01–06. ric learning in multilingual sentence similarity measurement for [292] U. Liyanapathirana, K. Gunasinghe, and G. Dias, “Sinspell: document alignment,” arXiv preprint arXiv:2108.09495, 2021. A comprehensive spelling checker for sinhala,” arXiv preprint [315] J. B. Disanayake, Basaka Mahima: 2 - Akuru ha Pili. Godage & arXiv:2107.02983, 2021. brothers, 2000. [293] L. Sithamparanathan and T. Uthayasanker, “A sinhala and tamil [316] ——, Basaka Mahima: 6 - Prakurthi. Godage & brothers, 2004. extension to generic environment for context-aware correction,” [317] ——, Sinhala Reethiya: 7 - Pada Nirmanaya. Sumitha Books, 2014. in 2019 National Information Technology Conference (NITC). IEEE, [318] ——, Basaka Mahima: 8 - Tadditha. Godage & brothers, 2000. 2019, pp. 102–106. [319] ——, Basaka Mahima: 10 - Nama Padaya. Godage & brothers, 2008. [294] M. Punchimudiyanse and R. G. N. Meegama, “Computer in- [320] ——, Basaka Mahima: 11 - Kriya Padaya. Godage & brothers, 2001. terpreter for translating written sinhala to sinhala sign,” OUSL [321] ——, The structure of spoken Sinhala: Sounds and their patterns. Journal, vol. 12, no. 1, pp. 70–90, 2017. National Institute of Education, 1991. [295] ——, “Animation of fingerspelled words and number signs of [322] ——, Sinhala Akshara Vicharaya (Sinhala Graphology). Sumitha the sinhala sign language,” ACM Transactions on Asian and Low- Publishers, 2006. Resource Language Information Processing (TALLIP), vol. 16, no. 4, [323] ——, The Usage of Dental and Cerebral Nasals. Sumitha Publishers, p. 24, 2017. 2007. [296] B. Gamage, R. Pushpananda, and R. Weerasinghe, “The impact [324] ——, Bashavaka rata samudaya. Lake house investment Co. Ltd. of using pre-trained word embeddings in sinhala ,” in Colombo 2, 1969. 2020 20th International Conference on Advances in ICT for Emerging [325] ——, Grammar of Contemporary Literary Sinhala-Introduction to Regions (ICTer). IEEE, 2020, pp. 161–165. Grammar, Structure of Spoken Sinhala. Godage & Bros, 1995, vol. [297] T. Dissanayake and B. Hettige, “Thematic relations based qa gen- 661. erator for sinhala,” 13th International Research Conference General [326] D. A. Indrasena, Sinhala Akshara Malava, 2001. Sir John Kotelawala Defence University, 2020. [327] K. Jayathilake, Modern Sinhalese lingustics. Pradeepa Publica- [298] S. C. Fernando, “Inexact matching of proper names in sinhala,” tions, 1991. 2011. [328] D. K. Henadeerage, “Topics in sinhala syntax,” Ph.D. disserta- [299] A. B. P. Kanduboda and K. Tamaoka, “Priority information in tion, The Australian National University, 2002. determining canonical word order of colloquial sinhalese sen- [329] A. E. S. Dasanayaka, kumara rachanaya; Grade 4. M D Gunasena tences,” in Proceedings of the 139th Conference of the Linguistic Publishers, Colombo, 1990. Society of Japan, vol. 1, 2009, pp. 32–37. [330] ——, kumara rachanaya; Grade 5. M D Gunasena Publishers, [300] ——, “Priority information for canonical word order of written Colombo, 2005. sinhala sentences,” in Proceedings of the 140th Conference of the [331] S. O. Fernando, Wara nonamena Nipatha. Sammana, January 1994. Linguistic Society of Japan, 2010, pp. 358–363. [332] ——, Kriya pada igena ganimu. Sammana, January 1994. [301] K. Tamaoka, P. B. A. Kanduboda, and H. Sakai, “Effects of word [333] ——, Sinhala Nouns year 11. Sammana, March 1994. order alternation on the sentence processing of sinhalese written [334] E. Ranawake, Spoken Sinhalese for Foreigners. MD Gunasena & and spoken forms,” Open Journal of Modern Linguistics, vol. 1, Co. Ltd., 1986. no. 02, pp. 24–32, 2011. [335] A. M. Gunasekara, A comprehensive grammar of the Sinhalese lan- [302] A. B. P. Kanduboda and K. Tamaoka, “Priority information deter- guage. Asian Educational Services, 1999. mining the canonical word order of written sinhalese sentences,” [336] S. Koparahewa, Dictionary of Sinhala Spelling. S. Godage and Open Journal of Modern Linguistics, vol. 2, no. 01, p. 26, 2012. Brothers, Colombo, 2006. [303] J. Jayakody, T. Gamlath, W. Lasantha, K. Premachandra, A. Nu- [337] J. W. Gair and W. S. Karunatillake, The Sinhala Writing System, A galiyadde, and Y. Mallawarachchi, ““mahoshadha”, the sinhala Guide to Transliteration. Sinhamedia, PO Box 1027, Trumansburg, tagged corpus based system,” in Proceedings NY 14886, 2006. of First International Conference on Information and Communication [338] W. S. Karunatillake, An Introduction to Spoken Sinhala.MD Technology for Intelligent Systems: Volume 1. Springer, 2016, pp. Gunasena & Company, 1990. 313–322. [339] M. Inman, Duration and in Sinhala. Stanford University, [304] S. T. Sandaruwan, S. A. S. Lorensuhewa, and K. Munasinghe, 1986. “Identification of abusive sinhala comments in social media using [340] K. Munidasa, Vyakarana Vivaranaya. MD Gunasena Publishers, text mining and machine learning techniques,” ICTer, vol. 13, Colombo, 1938. no. 1, 2020. [341] National Institute of Education, Sinhala Laekhana Reethiya, 1989. [305] S. Basnayake, H. Wijekoon, and T. K. Wijayasiriwardhane, “Pla- [342] K. Jayathilake, Nuthana Sinhala Vyakaranaye Mul Potha. Pradeepa giarism detection in sinhala language: A software approach.” Publications, 1991. [306] L. Rajamanthri and S. Thelijjagoda, “Sinhala language plagiarism [343] U. S. Sannasgala and A. Perera, Viyakarana Vimansawa, 1995. tool with internet resources using natural language processing.” [344] V. G. Balagalle, Basha Adauanayasaha Sinhala Vivaharaya, 1995. [307] L. Jayasekara and S. Ahangama, “Trend detection in sinhala [345] W. S. Karunatilaka, Sinhala Basha Viharanaya. M D Gunasena tweets using clustering and ranking algorithms,” in 2020 From Publishers, Colombo, 2004. Innovation to Impact (FITI), vol. 1. IEEE, 2020, pp. 1–6. [346] S. Karunarathna, Sinahala Viharanaya. Washana prakasakayo, [308] D. H. A. De Silva, “An approach to hate speech detection,” 2019. Dankotuwa, Sri Lanka, 2004. [309] V. Liyanage and S. Ranathunga, “A multi-language platform for [347] P. Alwis, Niwaeradi Wahara. Suriya Prakashakayo, 2006, vol. 109. generating algebraic mathematical word problems,” in 2019 14th [348] ——, Niwaeradi Wahara -2. Suriya Prakashakayo, 2007. Conference on Industrial and Information Systems (ICIIS). IEEE, [349] K. C. Perera, Prayogika Sinhla Viyakaranaya. 2019, pp. 332–337. [350] K. Munidasa, Kriya Vivaranaya. MD Gunasena Publishers, [310] ——, “Multi-lingual mathematical word problem generation us- Colombo, 1993. ing long short term memory networks with enhanced input fea- [351] T. Jayawardana, The surface case system in Sinhala. University of tures,” in Proceedings of The 12th Language Resources and Evaluation Kelaniya, 1989. Conference, 2020, pp. 4709–4716. [352] D. Rajapaksha, Sinhala bhashave pada bedima saha lakshana [311] I. Smith and U. Thayasivam, “Language detection in sinhala- bhavithaya, 2008. english code-mixed data,” in 2019 International Conference on [353] S. M. Kariyakarawana, The syntax of focus and wh-questions in Asian Language Processing (IALP). IEEE, 2019, pp. 228–233. Sinhala. Karunaratne & Sons Limited, 1998. [312] ——, “Sinhala-english code-mixed data analysis: A review on [354] S. L. Kekulawala, The future tense in Sinhalese – an ‘unorthodox’ data collection process,” in 2019 19th International Conference on point of view. Vidyalankara University of Ceylon, 1972. Advances in ICT for Emerging Regions (ICTer), vol. 250. IEEE, 2019, pp. 1–6. [313] L. Sandathara, S. Tissera, R. Sathsarani, H. Hapuarachchi, and 21 3 Udyogi Munasinghe 3 3 30 10 Surangika Ranathunga Uthayasanker Thayasivam 3 3 3 3 3 10 7 7 3 3 3 3 7 3 3 3 15 13 Viraj Welgama Sanath Jayasena Nisansa de Silva Piyumal Demotte T. Liyanagama D. Sandareka Fernando Prabath Sandaruwan Thilini Nadungodage 7 6 3 3 3 3 3 3 3 11 3 3 29 3 3 3 3 28 3 Gihan Dias Pasindu Tennage Ruvan Weerasinghe Lahiru Senevirathne Department of CSE - University Department of Moratuwa of CSE 4 6 9 3 5 3 3 3 3 Y Anzai G Rzevski G 4 6 UCSC - University of Colombo UCSC 8 6 6 3 3 3 4 22 10 6 7 3 12 N G J Dias G N Takashi Ikeda Fathima Farhath Budditha Hettige Kumudu Gamage Mahesan Niranjan Randil Pushpananda Malith Thilakarathne Binod Karunanayake 3 8 3 6 9 3 4 5 4 3 6 18 Susantha Herath University of Kelaniya FacultyIT of - UniversityMoratuwa of 3 3 4 3 3 4 7 8 3 6 19 H Aiso S Ishizaki Achini Herath Asanka Wasala Ajantha Herath Chamila Liyanage Asoka S. Karunananda S. Asoka A. J. P. J. A. M. P. Jayaweera RAPH G A AUTHOR - PPENDIX O A C Fig. 3: Co-author graph of the most prolific researchers in the the Sinhala NLP domain (Selected at the threshold of at least 3 coauthored publications)