A Computational Lexicon for Maltese
Total Page:16
File Type:pdf, Size:1020Kb
Maltilex: A Computational Lexicon for Maltese M. Rosner, J. Caruana and R. Fabri University of Malta, Msida MSD06, Malta mros@cs, tun. edu. mt, j car l@um. edu. mt, rfab l@um. edu. ml; Abstract become interleaved. The first is the identifica- The project described in this paper, which is tion of a set of lexical entries, i.e. entries that still in the preliminary phase, concerns the de- will serve as the carriers of information. The sign and implementation of a computational second is the population of the entries with in- lexicon for Maltese, a language very much in formation of various kinds e.g. syntactic, se- current use but so far lacking most of the in- mantic, phonological etc. frastructure required for NLP. One of the main Our initial task, trivial as it may sound, is to characteristics of Maltese, a source of many dif- concentrate on the first of these subtasks, creat- ficulties, is that it is an amalgam of different ing what amounts to a word list, in a machine- language types (chiefly Semitic and Romance), readable and consistent format, for all the basic as illustrated in the first part of the paper. The lexical entries of the language. The idea is that latter part of the paper describes our general this will subsequently be used not only as a basis approach to the problem of constructing the lex- for applications (initially we will concentrate on icon. spell-checking), but also as a tool for linguistic research on the language itself. 1 Introduction With few exceptions (e.g. Galea (1996)) Mal- 2 The Maltese Language tese is pretty much virgin territory as far as Maltese is the national language of Malta and, language processing is concerned, and therefore together with English, one of the two official one question worth asking is: where to begin? languages of the Republic of Malta. Its use be- There are basically two extreme positions that yond the shores of the Maltese islands is lim- one can adopt in answering this question. One ited to small emigrant communities in Canada is to attack a variety of applications first, e.g. and Australia, but within the geographical con- translation, speech, dialogue etc., and hope that fines of Malta, the language is used for the in so doing, enough general expertise can be ac- widest possible range of types of interaction quired to build the basis of an NLP culture that and communication, including education, jour- is taken for granted with more computationally nalism, broadcasting, administration, business established languages. The other extreme is to and literary discourse. attack the linguistic issues first, since, for what- Unsurprisingly in view of the disparate po- ever reason, there is currently rather little in the litical and cultural influences the islands have way of an accepted linguistic framework from been exposed to over the centuries, Maltese which to design computational materials. is a so-called 'mixed' language, with a sub- We have decided to adopt the middle ground strate of Arabic, a considerable superstrate of by embarking upon the construction of a sub- Romance origin (especially Sicilian) and, to a stantial machine-tractable lexicon of the lan- much more limited extent, English. The Semitic guage, since whether we think in terms of appli- (Western/Maghrebi Arabic) element is evident cations or linguistic theory, the lexicon is clearly enough to justify considering the language a pe- a resource of fundamental importance. ripheral dialect of Arabic. Its script, codified as The construction of the lexicon involves two recently as the 1920s, utilises a modified Latin rather separate subtasks which may in practice alphabet. This is just one of the peculiarities of 97 Maltese as compared to other dialectal varieties karozza car karozz- i cars; of Arabic, more important ones being its status ikla meal ikl- iet meals; as a 'high' variety and its use in literary, formal haddiem worker haddiema workers. and official discourse, its lack of reference to any Maltese has taken on a very large number of Qur'anic Arabic ideal, as well as its handling of Romance lexical items and incorporated them extensive borrowings from non-Semitic sources. within the Semitic pattern. For example, pizza, These features make Maltese a very interesting a word of Romance origin, has the broken plu- area for those working in the fields of language ral form pizez (compare Italian pizza/pizze), contact and Arabic dialectology. and ~ippa, a very recent borrowing from English 2.1 The Maltese Alphabet (computer chip) has a broken plural form ~ipep. As noted above, Maltese is the only dialect of In certain cases, one gets free variation between Arabic with a Latin script. Maltese orthogra- the broken plural form and a sound plural based phy was standardised in the 1920s, utilising an on (Romance) affixation, e.g.: alphabet largely identical with the Latin one, kaxxa box kaxex/kaxxi boxes with the following additions/modifications: tapir carpet twapet/tapiti carpets. Maltese Pronunciation The stem, as opposed to the consonantal root, chip (Eng) also plays an important role in word forma- jam (Eng) tion, in particular in nominal inflection. Typi- h silent cal stem-based plural forms in which the stem gh mostly silent remains intact are: h hat (Eng) zip (Eng) ahar news item ahbar- iiet news ie ear (Eng) (approx) omm mother omm- ijiet mothers 2.2 Morphological Aspects of Maltese Verbs are also often borrowed and fully inte- grated into the Semitic verbal system and can The morphology is still based on a root-and- take all of the inflective forms for person, num- pattern system typical of Semitic languages. ber, gender, tense etc. that any other Maltese For example, from the triliteral root consonants verbs of Semitic origin can take. For example: h - d - rn one can obtain forms like: spjega explain (It. spiegare) liadem work (verb); haddiem worker; jispjega he explains nispjegaw we explain hidma work (noun); spjegat she explained nhadem be worked (verb passive); spjegajt I explained, etc. haddem caused to work. Most of these forms are based on produc- izzuttja kick a football (Eng. shoot) tive templates (binyanim/forom/conjugations), jixxuttja he kicks of which Maltese has a subset of those in Clas- nixzuttjaw we kick sical Arabic. One other typical feature shared izzuttjat she kicked with Semitic languages is broken plural forma- ixzuttjajt I kicked, etc. tion as opposed to so-called sound plural. A few The vigour and productivity of these pro- examples are: cesses is attested to by the fact that one keeps qamar moon qmura moons; coming across new loan verbs all the time (in- tifel/tifla boy/girl tfal children. creasingly more from English), both in spoken and in written Maltese, without the language Plural formation in such instances involves a having any difficulty in integrating them seam- change in CV pattern. Sound plural formation lessly into its morphological setup. involves affixation of suffixes such as -i, very Within the verbal system complex inflectional common with words of Romance origin, -let or forms can also be built through multiple affixa- -a as in: tion. For example, the word 98 bghat - t - hie - lu - x arise, however, such as: 'I didn'tsend her to him', contains the the suf- • What to do if a machine readable version fixes -t for 3rd person singular masculine sub- of the printed dictionary is not available, as ject (perfective), -hie for 3rd person singular is in fact the case with Maltese. feminine direct object, -lu for 3rd person sin- • How to deal with the idlosynchratic for- gular masculine indirect object, and -x for verb mats adopted by different lexicographers, negation. This ready potential for inflectional and how to handle the omissions and in- complexity is another Semitic feature of Mal- consistencies that are characteristic of all tese which applies across the board, whatever human oriented dictionaries. the origin of the verb. It also raises interest- • Once the information is available, how to ing questions concerning the nature of lexical represent it. entries, the relationship between lexical entries • How to deal with evolution of the language and surface strings, and the kind of morpholog- under investigation. Dictionaries always ical processing that is necessary to connect the reflect the language as it was, not as it is. In two together. the case of Maltese this problem is partic- Many of the linguistic issues that could help ularly acute, given that the most obviously to resolve these questions are themselves unre- useful dictionary contains a large number solved for lack of suitably organised languaage of entries that are regarded by many as ar- resources (like the lexicon itself!). For this rea- chaic. son, we see the design/implementation of the lexicon, the development of language resources, Many of these problems, except the last, are and the evolution of linguistic theory for Mal- alleviated by adopting an essentially manual ap- tese as three goals which must be pursued in proach in the early stages. We have adopted parallel. the most complete and detailed dictionary cur- At this very early stage of the project, we rently available by J. Aquilina (Aquilina, 1987) have sidestepped many of the finer issues by and are in the process of transcribing the so- opting to codify the most uncontentous parts called major entries into our own format by of the lexicon first, as described below. At the means of a form interface as illustrated in figure same time, we are in the process of develop- 3.1. Major entries of this dictionary comprise ing an extensible text archive which will serve a specific, orthographically distinguished (capi- as the basis for empirical work concerning both talised) subset containing the basic lexical forms the lexicon and the underlying linguistics.