A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew
Total Page:16
File Type:pdf, Size:1020Kb
A Comprehensive NLP System for Modern Standard Arabic and Modern Hebrew Morphological analysis, lemmatization, vocalization, disambiguation and text-to-speech Dror Kamir Naama Soreq Yoni Neeman Melingo Ltd. Melingo Ltd. Melingo Ltd. 16 Totseret Haaretz st. 16 Totseret Haaretz st. 16 Totseret Haaretz st. Tel-Aviv, Israel Tel-Aviv, Israel Tel-Aviv, Israel [email protected] [email protected] [email protected] Abstract 1 Introduction This paper presents a comprehensive NLP sys- 1.1 The common Semitic basis from an NLP tem by Melingo that has been recently developed standpoint for Arabic, based on MorfixTM – an operational formerly developed highly successful comprehen- Modern Standard Arabic (MSA) and Modern Hebrew (MH) share the basic Semitic traits: rich sive Hebrew NLP system. morphology, based on consonantal roots (Jiðr / The system discussed includes modules for Šoreš)1, which depends on vowel changes and in morphological analysis, context sensitive lemmati- some cases consonantal insertions and deletions to zation, vocalization, text-to-phoneme conversion, create inflections and derivations.2 and syntactic-analysis-based prosody (intonation) For example, in MSA: the consonantal root model. It is employed in applications such as full /ktb/ combined with the vocalic pattern CaCaCa text search, information retrieval, text categoriza- derives the verb kataba ‘to write’. This derivation tion, textual data mining, online contextual dic- is further inflected into forms that indicate seman- tionaries, filtering, and text-to-speech applications tic features, such as number, gender, tense etc.: katab-tu ‘I wrote’, katab-ta ‘you (sing. masc.) in the fields of telephony and accessibility and wrote’, katab-ti ‘you (sing. fem.) wrote, ?a-ktubu could serve as a handy accessory for non-fluent ‘I write/will write’, etc. Arabic or Hebrew speakers. Similarly in MH: the consonantal root /ktv/ Modern Hebrew and Modern Standard Arabic combined with the vocalic pattern CaCaC derives share some unique Semitic linguistic characteris- the verb katav ‘to write’, and its inflections are: tics. Yet up to now, the two languages have been katav-ti ‘I wrote’, katav-ta ‘you (sing. masc.) handled separately in Natural Language Processing circles, both on the academic and on the applica- 1 A remark about the notation: Phonetic transcriptions always tive levels. This paper reviews the major similari- appear in Italics, and follow the IPA convention, except the ties and the minor dissimilarities between Modern following: ? – glottal stop, ¿ – voiced pharyngeal fricative (‘Ayn), đ – velarized d, ś – velarized s. Orthographic Hebrew and Modern Standard Arabic from the transliterations appear in curly brackets. Bound morphemes NLP standpoint, and emphasizes the benefit of de- (affixes, clitics, consonantal roots) are written between two veloping and maintaining a unified system for both slashes. Arabic and Hebrew linguistic terms are written in phonetic spelling beginning with a capital letter. The Arabic languages. term comes first. 2 For a review on the different approaches to Semitic inflec- tions see Beesley (2001), p. 2. wrote’, katav-t ‘you (sing. fem.) wrote’, e-xtov ‘I The fact that MSA and MH morphology is will write’ etc. root-based might promote the notion of identifying In fact, morphological similarity extends much the lemma with the root. But this solution is not further than this general observation, and includes satisfactory: in most cases there is indeed a dia- very specific similarities in terms of the NLP sys- chronic relation in meaning among words and tems, such as usage of nominal forms to mark forms of the same consonantal root. However, se- tenses and moods of verbs; usage of pronominal mantic shifts which occur over the years rule out enclitics to convey direct objects, and usage of this method in synchronic analysis. Moreover, proclitics to convey some prepositions. Moreover, some diachronic processes result in totally coinci- the inflectional patterns and clitics are quite similar dental “sharing” of a root by two or more com- in form in most cases. Both languages exhibit con- pletely different semantic domains. For example, struct formation (Iđa:fa / Smixut), which is similar in MSA, the words fajr ‘dawn’ and infija:r ‘explo- in its structure and in its role. The suffix marking sion’ share the same root /fjr/ (the latter might have feminine gender is also similar, and similarity goes originally been a metaphor). Similarly, in MH the as far as peculiarities in the numbering system, verbs pasal ‘to ban, disqualify’ and pisel ‘to sculp- where the female gender suffix marks the mascu- ture’ share the same root /psl/ (the former is an old line. Some of these phenomena will be demon- loan from Aramaic). strated below. In Morfix, as described below (2.1), a lemma is defined not as the root, but as the manifestation 1.2 Lemmatization of Semitic Languages of this root, most commonly as the lesser marked A consistent definition of lemma is crucial for form of a noun, adjective or verb. There is no es- a data retrieval system. A lemma can be said to be cape from some arbitrariness in the implementation the equivalent to a lexical entry: the basic gram- of this definition, due to the fine line between in- matical unit of natural language that is semanti- flectional morphology and derivational morphol- cally closed. In applications such as search ogy. However, Morfix generally follows the engines, usually it is the lemma that is sought, tradition set by dictionaries, especially bilingual while additional information including tense, num- dictionaries. Thus, for example, difference in part ber, and person are dispensable. of speech entails different lemmas, even if the In MSA and MH a lemma is actually the morphological process is partially predictable. common denominator of a set of forms (hundreds Similarly each verb pattern (Wazn / Binyan) is or thousands of forms in each set) that share the treated as a different lemma. same meaning and some morphological and syn- Even so, the roots should not be overlooked, as tactic features. Thus, in MSA, the forms: ?awla:d, they are a good basis for forming groups of lem- walada:ni, despite their remarkable difference in mas; in other words, the root can often serve as a appearance, share the same lemma WALAD ‘a boy’. “super-lemma”, joining together several lemmas, This is even more noticeable in verbs, where forms provided they all share a semantic field. like kataba, yaktubu, kutiba, yuktabu, kita:ba and The Issue of Nominal Inflections of Verbs many more are all part of the same lemma: 1.3 KATABA ‘to write’. The inconclusive selection of lemmas in MSA The rather large number of inflections and and MH can be demonstrated by looking into an complex forms (forms that include clitics, see be- interesting phenomenon: the nominal inflections of low 1.5) possible for each lemma results in a high verbs (roughly parallel to the Latin participle, see total number of forms, which, in fact, is estimated below). Since this issue is a good example both for to be the same for both languages: around 70 mil- a characteristic of Semitic NLP and for the simi- lion3. The mapping of these forms into lemmas is larities between MSA and MH, it is worthwhile to inconclusive (See Dichy (2001), p. 24). Hence the further elaborate on it. question rises: what should be defined as lemma in Both MSA and MH use the nominal inflections MSA and MH. of verbs to convey tenses, moods and aspects. These inflections are derived directly from the verb 3 For Arabic - see Beesley (2001), p. 7 For Hebrew - our own according to strict rules, and their forms are pre- sources. dictable in most cases. Nonetheless, grammati- It is easy to see the additional difficulty that cally, these forms behave as nouns or adjectives. this writing convention presents for NLP. The This means that they bear case marking in MSA, string {yktb} in MSA can be interpreted as yak- nominal marking for number and gender (in both tubu (future tense), yaktuba (subjunctive), yaktub languages) and they can be definite or indefinite (jussive), yuktabu (future tense passive) and even (in both languages). Moreover, these inflections yuktibu ‘he dictates/will dictate’ a form that is con- often serve as nouns or adjectives in their own sidered by Morfix to be a different lemma alto- right. This, in fact, causes the crucial problem for gether (see above 1.2). Furthermore, ambiguity can data retrieval, since the system has to determine occur between totally unrelated words, as will be whether the user refers to the noun/adjective or shown in section 1.7. A trained MSA reader can rather to the verb for which it serves as inflection. distinguish between these forms by using contex- Nominal inflections of verbs exist in non- tual cues (both syntactic and semantic). A similar Semitic languages as well; in most European lan- contextual sensitivity must be programmed into the guages participles and infinitives have nominal NLP system in order to meet this challenge. features. However, two Semitic traits make this Each language also has some orthographic pe- phenomenon more challenging in our case – the culiarities of its own. The most striking in MH is rich morphology which creates a large set of in- the multiple spelling conventions that are used si- flections for each base form (i.e. the verb is in- multaneously. The classical convention has been flected to create nominal forms and then each form replaced in most texts with some kind of spelling is inflected again for case, gender and number). system that partially indicates vowels, and thus Furthermore, Semitic languages allow nominal reduces ambiguities. An NLP system has to take clauses, namely verbless sentences, which increase into account the various spelling systems and the ambiguity. For example, in English it is easy to fact that the classic convention is still occasionally recognize the form ‘drunk’ in ‘he has drunk’ as used.