Word Sense Disambiguation Approach for Arabic Text

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016 Word Sense Disambiguation Approach for Arabic Text

Nadia Bouhriz Faouzia Benabbou El Habib Ben Lahmar Dept. of Mathematics and computer Dept. of Mathematics and computer Dept. of Mathematics and computer science science science Faculty of Sciences Ben M'sik Faculty of Sciences Ben M'sik Faculty of Sciences Ben M'sik Hassan II University Hassan II University Hassan II University Casablanca, Morocco Casablanca, Morocco Casablanca, Morocco

Abstract—Word Sense Disambiguation (WSD) consists of the obtained results. The last section gives conclusion and identifying the correct sense of an ambiguous word occurring in some perspectives. a given context. Most of Arabic WSD systems are based generally on the information extracted from the local context of the word II. WSD SYSTEMS ARCHITECTURE to be disambiguated. This information is not usually sufficient for In 1949, Weaver [4] discussed the necessity of WSD for a best disambiguation. To overcome this limit, we propose an MT, and he explained that to realize this process, the approach that takes into consideration, in addition to the local ambiguous word must be taken from the context where it context, the global context too extracted from the full text. More particularly, the sense attributed to an ambiguous word is the occurred. In 1950, Kaplan [3] made experiences to determine one of which semantic proximity is more close both to its local in which size the context should be, in order to disambiguate a and global context. The experiments show that the proposed word. It proved that two words at the right and at the left (size system achieved an accuracy of 74%. =2) of the ambiguous word are sufficient for its disambiguation; Masterman [5] confirmed this result for the Keywords—Word Sense Disambiguation; Arabic Text; local Russian language, while Choueka and Lusignan [6] confirmed context; global context; Arabic WordNet; Semantic Similarity it for the French.

I. INTRODUCTION Over the years, WSD systems were developed according to different approaches. Actually, these systems have generally an WSD is a natural language processing (NLP) field. It aims architecture structured around three main steps: at determining the appropriate sense of an ambiguous word occurring in a given context [1] [2]. It is a task which allows a  Sense inventory: consists on selecting the senses of the better understanding, and consequently a better exploitation of words. the processed linguistic material. It is therefore very essential  Context representation: represents senses and contexts task for NLP applications, such as Machine Translation (MT), in a formal manner. Information Retrieval (IR), Text classification…etc.  Disambiguation Process: attributes for every ambiguous The oldest WSD approach proved that two words before word its correct sense according to its context. and after the ambiguous word are sufficient for its disambiguation [3]. For the Arabic language, the information The sense inventory step is the one that makes the extracted from this local context is not always sufficient. difference from one system to another depending on the adopted approach. Generally, two approaches exist: To solve this problem, an Arabic WSD system was proposed in this paper that is not only based on the local The first one, called Knowledge-based approach, is based context, but also on the global context extracted from the full on the use of external lexical resources. These resources are text. The objective is to combine the local contextual containing all the words of a language with their senses. These information with the global one for a better disambiguation. resources can be dictionaries [7], thesaurus [8], or ontologies [9] [10]. More particularly, the proposed system uses the resource Arabic WordNet (AWN) to select word senses. The sense Unlike the first approach, the second one doesn’t use attributed then to an ambiguous word is the one that possesses external lexical resources, but it acquires the necessary the closest semantic proximity to the local context, as well as to information to define words’ senses from a corpus; it’s called a the global one. This proximity is measured based on the Corpus-based approach. This information is obtained by the semantic hierarchy offered by WordNet. application of statistical language models on this corpus. Three approaches are distinguished in this category, supervised The rest of the paper is organized as follows: Section II approaches that require annotated corpus [11] [12] [13], presents the architecture of WSD systems. Section III exposes unsupervised approaches [14] [10] that require unannotated some Arabic WSD systems. Section IV displays the description corpus, and a semi-supervised approaches that require both of of the proposed system. Section V contains experiments and the annotated and the unannotated corpus [15].

381 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016

III. ARABIC WSD SYSTEMS IV. THE PROPOSED SYSTEM A. Challenges Before describing the system process, the structure of Arabic presents several challenges for WSD, due Arabic WordNet is firstly given. essentially to the particularities of this language and also to the A. Arabic WordNet lack of resources necessary to the disambiguation process. The Arabic WordNet (AWN) [19] [20] [21] is a lexical Diacritics’ missing in Arabic texts is the most challenging resource for modern standard Arabic. It was constructed characteristic for WSD; because it increases the number of a according to the Princeton WordNet content. It’s organized word’s possible senses and consequently makes the around elements called Synsets, which are a set of synonyms disambiguation task more difficult. For example, the word and pointers connecting it with other synsets. So, the AWN is a have 11 senses according to the lexical network in which synsets represent its nodes and the (صوت without diacritics (Swt AWN, while the use of diacritics for the same word (Saw~ata connections between synsets represent its edges. .cuts down the number of senses to two ,( َص َّو َتَ This resource counts at present 23,481 words organized .On the other hand, the Arabic languageَ is very rich into 11,269 synsets. A word can belong to one or more synsets morphologically. This causes an ambiguity during the lexical segmentation, and influences consequently the detection of the In this work, the senses of a word are defined by the words’ correct sense during disambiguation process. For Synsets to which it belongs in the AWN. Below, some words :Wjd) have two possible segmentations; synsets (i.e. senses) extracted from AWN are presented وجد) example, the word جد) W) is a prefix of و) the first one considers that the letter TABLE I. EXAMPLE OF AWN SENSES Jad), while the second considers it as a letter in the word, which gives two totally different words. words Senses (synsets) [بحر] = Sense 1 بحر [محيط,َبحر] = B. State of the Art Sense 2 [شعر,َقصيدة] = The first WSD systems were mostly concerned by Latin Sense 1 [شعر]= Sense 2 [شعر] = languages like English and French since several decades ago. Sense 3 شعر [أحس,َشعر] = The Arabic language, as for it, didn't get the attention until the Sense 4 [حس,َأحس,َشعر] = last decade. Sense 5 [أحس,َشعر] = Sense 5 [فلوس,َثروة,َدراهم,َمال] = The first Arabic WSD system was proposed by Mona Diab Sense 1 [نقود,َمال] = in 2002 [10]. The author introduced in this work an Sense 2 [مال] = unsupervised method to annotate Arabic words by their sense Sense 3 [تمايل,َترنح,َمال] = Sense 3 مال .using English WordNet and an English-Arabic parallel corpus [انحدر,َمال] = Sense 4 [نزعَإلى,َمال] = Another contribution was proposed by Elmougy [13] where Sense 5 [أقنع,َأمال,َمال] = a Naïve Bayes Classifier was used to disambiguate Arabic Sense 6 [انحرف,َانحنى,َمال] = words without diacritics. Sense 7 Merhbene [16] was based on the semantic trees and a B. Description of the proposed system measure of collocation to choose the most appropriate sense to 1) Sense inventory an ambiguous word. In this step, a preprocessing phase is applied; it contains a text segmentation process, a stop words removal process, and Zouaghi [17] have proposed a system of WSD by finally a stemming process to remove words’ affixes (prefixes combining the information retrieval measures with the Lesk and suffixes). algorithm to estimate the most appropriate sense of the ambiguous word. Afterwards, the obtained words are classified, according to the AWN, into two categories: The most recent work was proposed by Menai [18], in which the author was based on the genetic algorithms. His  Non ambiguous words: belonging to one Synset, i.e. objective is to exploit the power of these algorithms in the possessing one sense. Arabic WSD.  Ambiguous words: belonging to several Synsets, i.e. All of the previously mentioned works used only one possessing several senses. contextual information to disambiguate. The proposed system, as for it, leans on two contextual informations. The first one is extracted from the local context of the ambiguous word and the second from its global context.

382 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016

Arabic Text Non ambiguous word senses = ( )

Segmentation, Removing stop words Build the space vector spanned by the standard Stemming Sense’s selection basis

Set of Words Arabic Calculate of distance WordNet A word sense in the space to one synset vector = ∑ AWN semantic

to several synsets hierarchy

Non ambiguous words Ambiguous Words Set of all non ambiguous Set of non ambiguous word word sense vectors sense vectors present locally Fig. 1. Sense inventory process = =

Sense inventory Algorithm Fig. 2. Context representation Input: Arabic Text T Output: List of Ambiguous Words and Non Ambiguous Context representation Algorithm Words . Input: list of and

Output: , 1: Segment the Text 2: Remove stop words 1: For all words in do: 3: Apply Stemming process for all obtained words 2: Extract associated senses ( ) 4: For each word do: 3: End 5: If: is belonging to one synset in AWN 4: For each word do: 6: Then: Add to NAW list. 5: For each senses do: 7: Else if: is belonging to a several synsets in AWN 6: For each do: 8: Then: Add to AW list. 7: Calculate the wu-p semantic distance between word 9: End sense and the sense . 2) Context representation 8: End

This step consists of representing words’ senses as vectors. 9: Calculate word sense vector ∑ For this purpose, the set of all non ambiguous word senses( 10: End ) is firstly considered, afterwards, the vector 11: End space, spanned by the standard basis , 12: Construct where ( ) ( ) 13: Construct

( ) are respectively the unit vector of the sense , is built. 3) Disambiguation process: This last step consists of attributing for each ambiguous Using this space, words’ senses will be represented by the word its appropriate sense. This is done by choosing the sense th vector ∑ where is the i coordinate representing with the closest semantic proximity to its local and global the semantic distance between the word sense and the sense context. in AWN. To calculate this distance, the Wu and Palmer (wu-p) measure is used [2]. Sense semantic proximity with a context is defined by the percentage of vectors in this context that are similar to the The global context will be afterwards defined by the sense vector of this sense. vectors set of non ambiguous words present in the full Similarity measurement between two vectors text: , while the local context will V=( ) and ( ) can be calculated be defined by the sense vectors set of non ambiguous words present only locally: by three distances which are; dot product, cosines, and Jaccard defined respectively as follows:

( ) ∑ Finally, an ambiguous word that has m senses will be represented by the set of its sense vectors: ∑ ( ) ∑ ∑ . ∑ ( ) ∑ ∑ ∑

383 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016

According to the previous definitions, the local and global correctly disambiguated divided by the number of all semantic proximity are measured for each ambiguous word ambiguous words. Experiment results have shown a precision sense; as a result, a pair of percentages representing of 74%. respectively each of the semantic proximity is obtained. The sense with the better average of its two percentages will be Another experiment have shown that the use of a stemming assigned finally to the ambiguous word. process during the sense inventory phase increases the system’s efficiency. More particularly, results (Table II) show Ambiguous word firstly that the use of this process increases efficiency by 30%, moreover, they have shown that the use of AlKhalil Analyzer Distance vectors [23] is better than Buckwalter [24] by 4%: measurement For each sense

TABLE II. IMPORTANCE OF THE STEMMING PROCESS

Vector similarity between and Morphological Without Buckwalter AlKhalil each vector in and analyzer Stemming System precision 40% 70% 74% The table below (Table III) show some disambiguated Percentage of Percentage of words from this piece of text: similar vector to similar vector to Average of وفي سياق متصل قال المتحدث الرسمي في in local in global the two الرئاسة العامة لألرصاد وحماية البيئة حسين = context = context القحطاني انه يوجد بالرئاسة مركز للبالغات Local semantic percentages Global semantic of ورقم مجاني )899( الستقبال بالغات الكوارث proximity of proximity of الطبيعية والبحرية، كما أن هناك خطة وطنية لالستجابة ومكافحة التلوث تضم في عضويتها الجهات ذات العالقة بالتلوث البحري. Assign the sense that has the highly average to the ambiguous word

Fig. 3. Disambiguation Process TABLE III. EXAMPLE OF WORDS DISAMBIGUATED

Disambiguation process Algorithm Input: ambiguous word Output: sense of the ambiguous word

1: For each sense of the ambiguous word do: 2: Calculate local semantic proximity of 3: Calculate global semantic proximity of 4: Calculate average semantic Proximity of 5: End 6: 7: For each sense of the ambiguous word do: 8: If: ( ( ) ( )) 9: Then: 10: End 11: Assign the sense to the ambiguous word.

V. EXPERIMENTATIONS AND RESULTS To evaluate the proposed system, a test corpus is constructed by collecting texts from various fields (news, sport, medicine, religions, etc.); afterwards, each word is annotated manually by its correct sense according to the AWN. The last experiment results show that the proposed The Java language was used to implement the system, and approach is better by 0.34% than the classical method (based to access the XML AWN database the ‘Java API for Arabic 1 on local context). This is due to some challenges described as WordNet” was used. Finally, the application of the stemming follows: process is based on SAFAR platform2.  The non-recognition of named entities (persons’ names, For measuring the system’s efficiency, the precision locations, organizations…etc.). These last should not be measurement was used; it consists of the number of words separated during segmentation process. Experiments have not been عبد َهللا, َأبو َظبي :show that words like 1https://sourceforge.net/projects/javasourcecodeapiarabicwordnet/ recognized as a named entity. 2 http://sibawayh.emi.ac.ma/safar/publications.php

384 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 4, 2016

 Another similar challenge that decreases the system [8] D. Yarowsky. ‘’Word-Sense Disambiguation Using Statistical Models efficiency is the incapability of multiword expression of Roget's Categories Trained on Large Corpora’’. Proceedings of the 14th International Conference on Computational Linguistics (COLING- .etc. 92), 454 –460, 1992…قاعدةَبيانت,َاألممَالمتحدة recognition such as  The absence of a component that allows disambiguating [9] P. Resnik. ‘’Disambiguating noun groupings with respect to WordNet senses’’, in S. Armstrong, K. Church, P. Isabelle, S. Manzi, E. senses with the same average semantic proximity. Tzoukermann & D. Yarowsky, eds, ‘Natural Language Processing Using Very Large Corpora’, Kluwer Academic Publishers, Boston,  The absence of a part-of-speech tagging that allows M.A, pp. 77–98, 1999. categorizing words in verbs and names allowing [10] M. Diab, P. Resnik ‘’An unsupervised method for word sense tagging consequently studying names and verbs in a separate using parallel corpora’’. in Proceedings of the 40th Anniversary Meeting way. of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, pp. 255–262, 2002.  The last challenge is relying on the lexical resource [11] E.F. Kelly, P.J. Stone. ‘’Computer recognition of english word senses’’. used. The AWN doesn’t cover all Arabic words, which North-Holland Publishing. North-Holland, Amsterdam. 1975. has consequently an impact on the system efficiency. [12] T. Pedersen. ‘’Learning Probabilistic Models of Word Sense doesn’t belong to the AWN Disambiguation’’. PhD thesis, Southern Methodist University, Dallas منطق For example the word structure. 1998. [13] S. Elmougy, T. Hamza, H.M. Noaman. ‘’Naive Bayes classifier for VI. CONCLUSION Arabic word sense disambiguation’’. In: Proceedings of INFOS 2008, Cairo, pp 27–29, 2008. In this paper, a WSD system for Arabic texts was [14] H. Schutze. ‘’Automatic word sense discrimination. Computational presented. The proposed system, unlike other systems, takes Linguistics’’. Special Issue on Word Sense Disambiguation, 24 (1), 97– into consideration two types of context during disambiguation 123, 1998. process. The first one is the local context defined by the words [15] D. Yarowsky. ‘’Unsupervised word sense disambiguation rivaling in the neighborhood of the ambiguous word, and the second is supervised methods’’. In 33th Annual Meeting of the Association for the global context defined by the full text. Computational Linguistics, pp 189-196, 1995. [16] L. Merhbene, A. Zouaghi, M. Zrigui. ‘’Approche basée sur les arbres Experiments have shown an accuracy of 74% for the sémantiques pour la désambiguïsation lexicale de la langue arabe en proposed system. utilisant une procédure de vote’’. proceeding de 21ème conférence sur le Traitement Automatique des Langues Naturelles, Marseille 2014. The incorporation of a named entities and a multiword [17] A. Zouaghi, L. Merhbene, M. Zrigui. ‘’Combination of information expression component in the process will be necessary done in retrieval methods with LESK algorithm for Arabic word sense the future for a better results, as well as a raise of all the disambiguation’’. Artif Intell Rev 38:257–269, 2012. challenges previously mentioned. [18] M.E. Menai. ‘’Word sense disambiguation using evolutionary algorithms – Application to Arabic language’’. Computers in Human As for the future, this method will be integrated in a Behavior 41 : 92–103, 2014. semantic indexing process to help enhancing Arabic [19] W. Black, S. El-Kateb. ‘’A Prototype English-Arabic Dictionary Based information retrieval system. on WordNet’’. http://www.fi.muni.cz/gwc2004/proc/95.pdf . 2004. [20] C. Fellbaum, W. Black, S. Elkateb, A. Marti, A. Pease, H. Rodriguez, P. REFERENCES Vossen. ‘’Constructing Arabic WordNet in Parallel with an Ontology’’. [1] E. Agirre, P. Edmonds, ‘’Word sense disambiguation: algorithms and http://www.globalwordnet.org/AWN/meetings/meet20050901/Fellbaum applications’’. Springer, 2006. .ppt , 2005. [2] R. Navigli, ‘’Word sense disambiguation: a survey’’. ACM Comput [21] S. Elkateb, W. Black, P. Vossen, D. Farwell, A. Pease, C. Fellbaum. Surv 41(2):1–69, 2009. Arabic WordNet and the Challenges of Arabic. http://www.mt- [3] A. Kaplan, ‘’An experimental study of ambiguity and context’’. archive.info/BCS-2006-Elkateb.pdf, 2006. Mechanical Translation 2(2), 39–46, 1955. [22] Z. Wu, M. Palmer. ‘’Verb semantics and lexical selection’’. [4] W. Weaver, ‘’Translation’’. in Machine Translation of Languages, MIT Proceedings of 32nd annual Meeting of the Association for Press, Cambridge, MA, 1949. Computational Linguistics, Las Cruces, New Mexico, pp 27-30, 1994. [5] M. Masterman, ‘’Semantic message detection for machine translation, [23] A. Boudlal, A. Lakhouaja, A. Mazroui, A. Meziane, O.B.M Ould using an interlingua”. International Conference on Machine Translation Abdallahi, and M. Shoul. ‘’Alkhalil Morpho SYS1: A Morphosyntactic of Languages and Applied Language Analysis, Her Majesty’s Stationery Analysis System for Arabic Texts’’. In the proceedings of the 11th Office, London, 437-475, 1961. International Arab Conference on Information Technology. Benghazi, [6] Y. Choueka, S. Lusignan, ‘’Disambiguation by short contexts’’. Libya 2010. Computers and the Humanities, 19, 147-158, 1985. [24] T. Buckwalter. ‘’Buckwalter {Arabic} Morphological Analyzer Version [7] M.E. Lesk, ‘’Automatic sense disambiguation using machine readable 1.0’’ , 2002. dictionaries : How to tell a pine cone from a nice cream cone’’. In Proceedings of the SIGDOC Conference, Toronto. 1986.

385 | P a g e www.ijacsa.thesai.org