Maltilex: A Computational Lexicon for Maltese

M. Rosner, J. Caruana and R. Fabri University of , Msida MSD06, Malta mros@cs, tun. edu. mt, j car l@um. edu. mt, rfab l@um. edu. ml;

Abstract become interleaved. The first is the identifica- The project described in this paper, which is tion of a set of lexical entries, i.e. entries that still in the preliminary phase, concerns the de- will serve as the carriers of information. The sign and implementation of a computational second is the population of the entries with in- lexicon for Maltese, a language very much in formation of various kinds e.g. syntactic, se- current use but so far lacking most of the in- mantic, phonological etc. frastructure required for NLP. One of the main Our initial task, trivial as it may sound, is to characteristics of Maltese, a source of many dif- concentrate on the first of these subtasks, creat- ficulties, is that it is an amalgam of different ing what amounts to a word list, in a machine- language types (chiefly Semitic and Romance), readable and consistent format, for all the basic as illustrated in the first part of the paper. The lexical entries of the language. The idea is that latter part of the paper describes our general this will subsequently be used not only as a basis approach to the problem of constructing the lex- for applications (initially we will concentrate on icon. spell-checking), but also as a tool for linguistic research on the language itself. 1 Introduction With few exceptions (e.g. Galea (1996)) Mal- 2 The tese is pretty much virgin territory as far as Maltese is the national language of Malta and, language processing is concerned, and therefore together with English, one of the two official one question worth asking is: where to begin? languages of the Republic of Malta. Its use be- There are basically two extreme positions that yond the shores of the Maltese islands is lim- one can adopt in answering this question. One ited to small emigrant communities in Canada is to attack a variety of applications first, e.g. and Australia, but within the geographical con- translation, speech, dialogue etc., and hope that fines of Malta, the language is used for the in so doing, enough general expertise can be ac- widest possible range of types of interaction quired to build the basis of an NLP culture that and communication, including education, jour- is taken for granted with more computationally nalism, broadcasting, administration, business established languages. The other extreme is to and literary discourse. attack the linguistic issues first, since, for what- Unsurprisingly in view of the disparate po- ever reason, there is currently rather little in the litical and cultural influences the islands have way of an accepted linguistic framework from been exposed to over the centuries, Maltese which to design computational materials. is a so-called 'mixed' language, with a sub- We have decided to adopt the middle ground strate of , a considerable superstrate of by embarking upon the construction of a sub- Romance origin (especially Sicilian) and, to a stantial machine-tractable lexicon of the lan- much more limited extent, English. The Semitic guage, since whether we think in terms of appli- (Western/Maghrebi Arabic) element is evident cations or linguistic theory, the lexicon is clearly enough to justify considering the language a pe- a resource of fundamental importance. ripheral dialect of Arabic. Its script, codified as The construction of the lexicon involves two recently as the 1920s, utilises a modified Latin rather separate subtasks which may in practice alphabet. This is just one of the peculiarities of 97 Maltese as compared to other dialectal varieties karozza car karozz- i cars; of Arabic, more important ones being its status ikla meal ikl- iet meals; as a 'high' variety and its use in literary, formal haddiem worker haddiema workers. and official discourse, its lack of reference to any Maltese has taken on a very large number of Qur'anic Arabic ideal, as well as its handling of Romance lexical items and incorporated them extensive borrowings from non-Semitic sources. within the Semitic pattern. For example, pizza, These features make Maltese a very interesting a word of Romance origin, has the broken plu- area for those working in the fields of language ral form pizez (compare Italian pizza/pizze), contact and Arabic dialectology. and ~ippa, a very recent borrowing from English 2.1 The Maltese Alphabet (computer chip) has a broken plural form ~ipep. As noted above, Maltese is the only dialect of In certain cases, one gets free variation between Arabic with a Latin script. Maltese orthogra- the broken plural form and a sound plural based phy was standardised in the 1920s, utilising an on (Romance) affixation, e.g.: alphabet largely identical with the Latin one, kaxxa box kaxex/kaxxi boxes with the following additions/modifications: tapir carpet twapet/tapiti carpets. Maltese Pronunciation The stem, as opposed to the consonantal root, chip (Eng) also plays an important role in word forma- jam (Eng) tion, in particular in nominal inflection. Typi- h silent cal stem-based plural forms in which the stem gh mostly silent remains intact are: h hat (Eng) zip (Eng) ahar news item ahbar- iiet news ie ear (Eng) (approx) omm mother omm- ijiet mothers 2.2 Morphological Aspects of Maltese Verbs are also often borrowed and fully inte- grated into the Semitic verbal system and can The morphology is still based on a root-and- take all of the inflective forms for person, num- pattern system typical of . ber, gender, tense etc. that any other Maltese For example, from the triliteral root consonants verbs of Semitic origin can take. For example: h - d - rn one can obtain forms like: spjega explain (It. spiegare) liadem work (verb); haddiem worker; jispjega he explains nispjegaw we explain hidma work (noun); spjegat she explained nhadem be worked (verb passive); spjegajt I explained, etc. haddem caused to work. Most of these forms are based on produc- izzuttja kick a football (Eng. shoot) tive templates (binyanim/forom/conjugations), jixxuttja he kicks of which Maltese has a subset of those in Clas- nixzuttjaw we kick sical Arabic. One other typical feature shared izzuttjat she kicked with Semitic languages is broken plural forma- ixzuttjajt I kicked, etc. tion as opposed to so-called sound plural. A few The vigour and productivity of these pro- examples are: cesses is attested to by the fact that one keeps qamar moon qmura moons; coming across new loan verbs all the time (in- tifel/tifla boy/girl tfal children. creasingly more from English), both in spoken and in written Maltese, without the language Plural formation in such instances involves a having any difficulty in integrating them seam- change in CV pattern. Sound plural formation lessly into its morphological setup. involves affixation of suffixes such as -i, very Within the verbal system complex inflectional common with words of Romance origin, -let or forms can also be built through multiple affixa- -a as in: tion. For example, the word 98 bghat - t - hie - lu - x arise, however, such as:

'I didn'tsend her to him', contains the the suf- • What to do if a machine readable version fixes -t for 3rd person singular masculine sub- of the printed dictionary is not available, as ject (perfective), -hie for 3rd person singular is in fact the case with Maltese. feminine direct object, -lu for 3rd person sin- • How to deal with the idlosynchratic for- gular masculine indirect object, and -x for verb mats adopted by different lexicographers, negation. This ready potential for inflectional and how to handle the omissions and in- complexity is another Semitic feature of Mal- consistencies that are characteristic of all tese which applies across the board, whatever human oriented dictionaries. the origin of the verb. It also raises interest- • Once the information is available, how to ing questions concerning the nature of lexical represent it. entries, the relationship between lexical entries • How to deal with evolution of the language and surface strings, and the kind of morpholog- under investigation. Dictionaries always ical processing that is necessary to connect the reflect the language as it was, not as it is. In two together. the case of Maltese this problem is partic- Many of the linguistic issues that could help ularly acute, given that the most obviously to resolve these questions are themselves unre- useful dictionary contains a large number solved for lack of suitably organised languaage of entries that are regarded by many as ar- resources (like the lexicon itself!). For this rea- chaic. son, we see the design/implementation of the lexicon, the development of language resources, Many of these problems, except the last, are and the evolution of linguistic theory for Mal- alleviated by adopting an essentially manual ap- tese as three goals which must be pursued in proach in the early stages. We have adopted parallel. the most complete and detailed dictionary cur- At this very early stage of the project, we rently available by J. Aquilina (Aquilina, 1987) have sidestepped many of the finer issues by and are in the process of transcribing the so- opting to codify the most uncontentous parts called major entries into our own format by of the lexicon first, as described below. At the means of a form interface as illustrated in figure same time, we are in the process of develop- 3.1. Major entries of this dictionary comprise ing an extensible text archive which will serve a specific, orthographically distinguished (capi- as the basis for empirical work concerning both talised) subset containing the basic lexical forms the lexicon and the underlying linguistics. of the language. They thus form a reasonable starting point for our purposes. The other (non- 3 Constructing the Lexicon capitalised) entries are derived lexical forms of The two main resources available to construct various kinds. the lexicon are dictionaries and text corpora. For the present, we are simply ignoring in- Both, in some sense, are representative of the flectional forms, since ultimately it is more ef- lexical behaviour of words, and both have their ficient to assume that they can be systemati- advantages and disadvantages. cally related to the basic entries by a morpho- logical transformation of the sort implemented 3.1 The Dictionary Approach by Galen (1996). The basic idea underlying the dictionary ap- The most important information is headword, proach is this: if some lexicographer has al- a sequence of characters used to identify a par- ready gone to a great deal of trouble to compile ticular lexical primitive or lexeme. Most of a dictionary, why not make use of that work the time, the headword and the lexeme are in rather than repeat it? The appeal is obvious, one:one correspondance, but there are excep- and can be made to work, as is evidenced by, tions. Distinct lexemes (and therefore entries) for example, the work of Boguraev and Briscoe with the same headword are homonyms (e.g. (1987) who attempted to extract entries from tikk, a clock tick and tikk, a facial spasm). the machine-readable version of Longman's dic- Single lexemes can also manifest polysemy, dif- tionary. Problems of a practical nature soon ferent meanings under the same headword (e.g. 99 Maltilex Lexical Entry 3.2 The Corpus Approach HeadWord: Comparatively recent technological changes IIs-= Vafi~mt 1: Vadam 2: have made it possible, in principle, to create Verb and maintain corpora that are sufficiently large Tr.msilive I mr~lJ,~e ~ l~e and accessible to be suitable for the purposes of SuhuamJve Iexical acquisition. One of the greatest advan- Noua Verbal Noun [[ Noun Agem Dimunifive tages of the corpus approach to lexical acquisi- Gender tion, compared to the dictionary approach just Masculine Fenuninlne ] described, is that in principle such corpora come Plurals as close as it is possible to get to a truly current Vadam I: Vadant 2: 1 i snapshot of the language, particularly if they Vadam 3: Coll¢~ve: ] are continuously updated. Other arguments in favour of using texts as the basis for lexical ac- quisition are advanced in the editor's introduc- tion to Boguraev and Pustejovsky (1995). Searching the Word List To adopt the corpus approach it is of course Emer a wo~ necessary to have a corpus, so that a priority task is the construction of a machine-readable, Definitions evolving record of the current written language. Smm a morl/~me or' a ctmlb~nafion of nnot~anes It) which affixes are added All the main Maltese language newspapers have

P~vious Page been approached, and some journalistic texts Holr~ (various fields) have already been obtained. We have recently managed to obtain speech corpora with parallel text of national radio news broad- casts. Furthermore, practical arrangements are currently being made for the provision of such Figure 1: Internet Form for Dictionary Entries materials on a regular and frequent basis. Book publishers have agreed to make titles from their respective ranges available for inclusion in the tikka, a point-like mark and tikka, a corpus. As it stands, the raw collection includes very small amount). a number of book excerpts from various titles. These variations are accommodated using the One feature of this approach is the constantly headword (string), homonym (integer) and pol- evolving relationship between corpus and lex- yseme (integer) fields in the form, the inte- icon: the corpus enriches the lexicon, but as gers deriving from the ordering implicit in the the latter evolves, it can be used to add fur- printed dictionary. ther information to the corpus in the form of The second line of the form contains root annotations or tags, thus expanding its scope. (typically 3 consonants) and stem information A corpus annotated with part-of-speech tags, for words of semitic and non-semitic origin re- for example, can be used to infer a statistical spectively, whilst the third contains variants model that can be harnessed to efficiently and (e.g. farfett/ferfett, butterfly). automatically assign tags to previously unseen The remainder of the form contains mostly texts. grammatical information, including that on (various forms of) plural. There is also space 3.3 Character Representation for comments from the individual lexicographer. In the course of collecting corpus texts, it soon The end product of the work described in this became apparent that, as a result of lack of section is essentially a list of lexical entries for standardisation early on in the introduction and what we are calling the uncontentious parts of spread of IT in Malta, a certain amount of anar- the language. The content of entries is essen- chy reigns, with various computer/printer sup- tially by reference (to the entries of Aquilina's pliers having developed and disseminated 'Mal- dictionary) rather than literal. tese' adaptations of existing fontsets. The fact 100 that they proceeded independently of each other 4 Conclusion and with no external regulation meant that the This paper has attempted to convey our ap- same Maltese-specific characters were assigned proach to the problem of rendering Maltese different ANSI codes in Windows (TTF) fonts amenable to current language engineering tech- supplied by competing sellers, making it diffi- niques via the construction of a computational cult to read documents not only across plat- lexicon. One difficulty that we are currently fac- forms but also within the same platform. ing is a shortage of appropriately qualified per- A persistent challenge to the computational sonnel to work on the project, though hopefully treatment of Maltese is therefore the question this problem will be alleviated by the appear- of text representation, i.e. the numerical coding ance of our first CS/Computational Linguistics for the characters that make up words. The graduates during the coming year. Three sub- requirements are: projects are currently in the pipeline with the following themes: • That the coding should follow an interna- tionally recognised standard. • Finite State Methods. Development of • That there exist appropriate fonts for finite state transducers for extracting lexi- use on the screen and on the printer cal information from text corpora. across a variety of hardware platforms • Computational Grammar. Develop- (PC/Mac/Unix). meat of a grammar and parsing system for • That there exists an accepted keyboard Maltese sentences. This will probably be configuration to generate the codes. based on HPSG. • Computational Morphology of Plural Although no code satisfying all of these Forms. requirements exists, the most acceptable workaround available at present is to adopt 5 Acknowledgements fonts conforming to IS08859-3, known as Latin The authors grastefully acknowledge the con- Alphabet No. 3. Two PC-compatible fonts con- tribution made by the Mid-Med Computer and forming to this standard are known as "Tor- Commerces Foundation to the funding of this nado" and "FTIMAL" and we are currently in- project. vestigating the copyright status of each of these. Given that these fonts are closely tied to References PC (rather than Unix or Macintosh platforms), J. Aquilina. 1987. Maltese-English Dictionary. and given rather casual attitude taken to the Midsea Books. adoption of text representation standards lo- B. Boguraev and T. Briscoe. 1987. Large lex- cally, we have defined a project-internal Stan- icons for natural language processing: ex- dard Maltese Text Representation (SMTR) for ploring the grammar coding system of ldoce. storing text archives in a way that is (a) human- Computational Linguistics, 13:203-218. readable (and human-editable), (b) compatible B. Boguraev and J Pustejovsky. 1995. Cor- with Unix systems and (c) easily translatable pus Processing for Lexical Acquisition. MIT to and from any other coding format by means Press, Cambridge, Ma. of simple finite-state methods (we are using Xe- D. Galea. 1996. Morphological analysis of mal- rox's xfst for this purpose). tese verbs. Technical Report B.Sc Disserta- tion, Department of Computer Science, Uni- Maltese Ascii versity of Malta. _c

gh _y h _.h _2. ie _i i01