Morphological Analysis for the Maltese Language: the Challenges of a Hybrid System
Total Page:16
File Type:pdf, Size:1020Kb
Morphological Analysis for the Maltese Language: The challenges of a hybrid system Claudia Borg Albert Gatt Dept. A.I., Faculty of ICT Institute of Linguistics University of Malta University of Malta [email protected] [email protected] Abstract with a vowel melody and patterns to derive forms. By contrast, the Romance/English morphologi- Maltese is a morphologically rich lan- cal component is concatenative (i.e. exclusively guage with a hybrid morphological sys- stem-and-affix based). Table 1 provides an ex- tem which features both concatenative ample of these two systems, showing inflection and non-concatenative processes. This and derivation for the words ezamina˙ ‘to exam- paper analyses the impact of this hy- ine’ taking a stem-based form, and gideb ‘to lie’ bridity on the performance of machine from the root √GDB which is based on a tem- learning techniques for morphological la- platic system. Table 2 gives an examply of ver- belling and clustering. In particular, we bal inflection, which is affix-based, and applies to analyse a dataset of morphologically re- lexemes arising from both concatenative and non- lated word clusters to evaluate the differ- concatenative systems, the main difference being ence in results for concatenative and non- that the latter evinces frequent stem variation. concatenative clusters. We also describe research carried out in morphological la- Table 1: Examples of inflection and derivation in belling, with a particular focus on the verb the concatenative and non-concatenative systems category. Two evaluations were carried out, one using an unseen dataset, and an- Derivation Inflection other one using a gold standard dataset Concat. which was manually labelled. The gold ezamina˙ ezaminatur˙ ezaminatr-i˙ ci˙ , sg.f standard dataset was split into concatena- ‘examine’ ‘examiner’ ezaminatur-i˙ , pl. tive and non-concatenative to analyse the Non-Con. difference in results between the two mor- gideb ‘lie’ giddieb giddieb-a, sg.f. phological systems. √GDB ‘liar’ giddib-in, pl. 1 Introduction Maltese, the national language of the Maltese Is- Table 2: Verbal inflections for the concatenative lands and, since 2004, also an official European and non-concatenative systems. language, has a hybrid morphological system that ezamina˙ gideb √GDB evolved from an Arabic stratum, a Romance (Si- ‘examine’ ‘lie’ cilian/Italian) superstratum and an English adstra- 1SG n-ezamina˙ n-igdeb tum (Brincat, 2011). The Semitic influence is 2SG t-ezamina˙ t-igdeb evident in the basic syntactic structure, with a 3SGM j-ezamina˙ j-igdeb highly productive non-Semitic component man- 3SGF t-ezamina˙ t-igdeb ifest in its lexis and morphology (Fabri, 2010; 1PL n-ezamina-w˙ n-igdb-u Borg and Azzopardi-Alexander, 1997; Fabri et 2PL t-ezamina-w˙ t-igdb-u al., 2014). Semitic morphological processes still 3PL j-ezamina-w˙ j-igdb-u account for a sizeable proportion of the lexicon and follow a non-concatenative, root-and-pattern strategy (or templatic morphology) similar to Ara- To date, there still is no complete morpholog- bic and Hebrew, with consonantal roots combined ical analyser for Maltese. In a first attempt at a 25 Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), pages 25–34, Valencia, Spain, April 3, 2017. c 2017 Association for Computational Linguistics computational treatment of Maltese morphology, on the two different morphological systems. Sec- Farrugia (2008) used a neural network and fo- ond, we are interested in labelling words with their cused solely on broken plural for nouns (Schem- morphological properties. We view this as a clas- bri, 2006). The only work treating computa- sification problem, and treat complex morpholog- tional morphology for Maltese in general was by ical properties as separate features which can be Borg and Gatt (2014), who used unsupervised classified in an optimal sequence to provide a final techniques to group together morphologically re- complex label. Once again, the focus of the analy- lated words. A theoretical analysis of the tem- sis is on the hybridity of the language and whether platic verbs (Spagnol, 2011) was used by Camil- a single technique is appropriate for a mixed mor- leri (2013), who created a computational gram- phology such as that found in Maltese. mar for Maltese for the Resource Grammar Li- brary (Ranta, 2011), with a particular focus on in- 2 Related Work flectional verbal morphology. The grammar pro- duced the full paradigm of a verb on the basis of Computational morphology can be viewed as hav- its root, which can consist of over 1,400 inflective ing three separate subtasks — segmentation, clus- forms per derived verbal form, of which traditional tering related words, and labelling (see Ham- grammars usually list 10. This resource is known marstrom¨ and Borin (2011)). Various approaches as Gabra˙ and is available online1. Gabra˙ is, to date, are used for each of the tasks, ranging from rule- the best computational resource available in terms based techniques, such as finite state transduc- of morphological information. It is limited in its ers for Arabic morphological analysis (Beesley, focus to templatic morphology and restricted to 1996; Habash et al., 2005), to various unsuper- the wordforms available in the database. A further vised, semi- or fully-supervised techniques which resource is the lexicon and analyser provided as would generally deal with one or two of the sub- part of the Apertium open-source machine transla- tasks. For most of the techniques described, it is tion toolkit (Forcada et al., 2011). A subset of this difficult to directly compare results due to differ- lexicon has since been incorporated in the Gabra˙ ence in the data used and the evaluation setting it- database. self. For instance, the results achieved by segmen- This paper presents work carried out for Mal- tation techniques are then evaluated in an informa- tese morphology, with a particular emphasis on tion retrieval task. the problem of hybridity in the morphological sys- The majority of works dealing with unsuper- tem. Morphological analysis is challenging for a vised morphology focus on English and assume language like Maltese due to the mixed morpho- that the morphological processes are concatena- logical processes existing side by side. Although tive (Hammarstrom¨ and Borin, 2011). Goldsmith there are similarities between the two systems, (2001) uses the minimum description length al- as seen in verbal inflections, various differences gorithm, which aims to represent a language in among the subsystems exist which make a uni- the most compact way possible by grouping to- fied treatment challenging, including: (a) stem al- gether words that take on the same set of suf- lomorphy, which occurs far more frequently with fixes. In a similar vein, Creutz and Lagus (2005; Semitic stems; (b) paradigmatic gaps, especially 2007) use Maximum a Posteriori approaches to in the derivational system based on semitic roots segment words from unannotated texts, and have (Spagnol, 2011); (c) the fact that morphological become part of the baseline and standard evalua- analysis for a hybrid system needs to pay atten- tion in the Morpho Challenge series of competi- tion to both stem-internal (templatic) processes, tions (Kurimo et al., 2010). Kohonen et al. (2010) and phenomena occurring at the stem’s edge (by extends this work by introducing semi- and super- affixation). vised approaches to the model learning for seg- mentation. This is done by introducing a discrim- First, we will analyse the results of the unsu- inative weighting scheme that gives preference to pervised clustering technique by Borg and Gatt the segmentations within the labelled data. (2014) applied on Maltese, with a particular focus of distinguishing the performance of the technique Transitional probabilities are used to determine potential word boundaries (Keshava and Pitler, 1http://mlrs.research.um.edu.mt/ 2006; Dasgupta and Ng, 2007; Demberg, 2007). resources/gabra/ The technique is very intuitive, and posits that the 26 most likely place for a segmentation to take place words. However this technique relies on part-of- is at nodes in the trie with a large branching factor. speech, affix and stem information. Can and Man- The result is a ranked list of affixes which can then andhar (2012) create a hierarchical clustering of be used to segment words. morphologically related words using both affixes Van den Bosch and Daelemans (1999) and and stems to combine words in the same clusters. Clark (2002; 2007) apply Memory-based Learn- Ahlberg et al. (2014) produce inflection tables by ing to classify morphological labels. The latter obtaining generalisations over a small number of work was tested on Arabic singular and broken samples through a semi-supervised approach. The plural pairs, with the algorithm learning how to as- system takes a group of words and assumes that sociate an inflected form with its base form. Dur- the similar elements that are shared by the differ- rett and DeNero (2013) derives rules on the basis ent forms can be generalised over and are irrele- of the orthographic changes that take place in an vant for the inflection process. inflection table (containing a paradigm). A log- For Semitic languages, a central issue in com- linear model is then used to place a conditional putational morphology is disambiguation between distribution over all valid rules. multiple possible analyses. Habash and Ram- bow (2005) learn classifiers to identify different Poon et al. (2009) use a log-linear model for morphological features, used specifically to im- unsupervised morphological segmentation, which prove part-of-speech tagging. Snyder and Barzi- leverages overlapping features such as morphemes lay (2008) tackle morphological segmentation for and their context. It incorporates exponential pri- multiple languages in the Semitic family and En- ors as a way of describing a language in an effi- glish by creating a model that maps frequently oc- cient and compact manner.