A Comparative Study for Arabic Word Sense Disambiguation Using Document Preprocessing and Machine Learning Techniques

ALTIC -2011, Alexandria, Egypt A COMPARATIVE STUDY FOR ARABIC WORD SENSE DISAMBIGUATION USING DOCUMENT PREPROCESSING AND MACHINE LEARNING TECHNIQUES Mohamed M. El-Gamml , M. Waleed Fakhr Arab Academy for Science and Technology , Heliopolis, Cairo, Egypt [email protected], [email protected] Mohsen A. Rashwan, Almoataz B. Al-Said Faculty of Engineering, Dar Al-Ulum College, Cairo University, Giza, Egypt [email protected], [email protected] Keywords: Word Sense Disambiguation, Support Vector Machine (SVM), Naïve Bayesian Classifier (NBC), k-means clustering, Levenshtein Distance, Latent Semantic Analysis (LSA). Abstract: Word sense disambiguation is a core problem in many tasks related to language processing and was recognized at the beginning of the scientific interest in machine translation and artificial intelligence. In this paper, we introduce the possibilities of using the Support Vector Machine (SVM) classifier to solve the Word Sense Disambiguation problem in a supervised manner after using the Levenshtein Distance algorithm to measure the matching distance between words through the usage of the lexical samples of five Arabic words. The performance of the proposed technique is compared to supervised and unsupervised machine learning algorithms, namely the Naïve Bayes Classifier (NBC) and Latent Semantic Analysis (LSA) with K-means clustering, representing the baseline and state-of-the-art algorithms for WSD. 1 INTRODUCTION context provides the evidence, and each occurrence of a word is assigned to one or more of its possible classes based on the evidence [1]. Anyone who gets the joke when they hear a pun will Arabic, similar to most Natural Languages, is realize that lexical ambiguity is a fundamental characteristic of language: Words can have more ambiguous since many words might have multiple senses. The correct meaning of an ambiguous word than one distinct meaning. So why is it that text depends on the context in which it occurs. The doesn’t seem like one long string of puns? After all, lexical ambiguity is pervasive [1]. Lexical speaker of a language can usually resolve this ambiguity without difficulty. However, disambiguation in its broadest definition is nothing identification of the specific meaning of a word less than determining the meaning of every word in context, which appears to be a largely unconscious computationally, through Word Sense Disambiguation (WSD), is not an easy task. process in people. As a computational problem it is Although WSD may not be considered a standalone often described as “AI-complete”, that is, a problem whose solution presupposes a solution to complete approach by itself, it is an integral part of many applications such as Machine Translation [3, 4], natural-language understanding or common- sense Information Retrieval [5, 6], text mining, reasoning [2]. In the field of computational linguistics, the Lexicography [7], and Information Extraction [8]. Approaches to WSD are often classified problem is generally called word sense according to the main source of knowledge used in disambiguation (WSD), and is defined as the problem of computationally determining which sense differentiation. Methods that rely primarily on dictionaries, thesauri, and lexical knowledge bases, “sense” of a word is activated by the use of the word without using any corpus evidence, are termed in a particular context. WSD is essentially a task of classification: word senses are the classes, the dictionary-based or knowledge-based. Methods that ALTIC -2011, Alexandria, Egypt eschew (almost) completely external information The 1980s were a turning point for WSD. Large- and work directly from raw un-annotated corpora are scale lexical resources and corpora became available termed unsupervised methods (adopting terminology so handcrafting could be replaced with knowledge from machine learning). Included in this category extracted automatically from the resources (Wilks et are methods that use word-aligned corpora to gather al. 1990). Lesk’s (1986) short but extremely seminal cross-linguistic evidence for sense discrimination. paper used the overlap of word sense definitions in Finally, supervised and semi-supervised WSD make the Oxford Advanced Learner’s Dictionary of use of annotated corpora to train from, or as seed Current English (OALD) to resolve word senses. data in a bootstrapping process. Dictionary-based WSD had begun and the In this paper, we explore the idea of using relationship of WSD to lexicography became Levenshtein Distance algorithm to calculate the explicit. For example, Guthrie et al. (1991) used the distance between each two words which help to say subject codes (e.g., Economics,Engineering, etc.) in the Longman Dictionary of Contemporary English ( ,سشطبُ , ٍسشطِ , سشطبّٚ ,اىسشطبُ) that for example Is a derivation from the same word then using a (LDOCE) (Procter 1978) on top of Lesk’s method. supervised or unsupervised technique for In 1996, Mooney compared Naïve Bayes with a classification or clustering the senses . Neural Network, Decision Tree/List Learners, This paper is organized as follows. In Section 2, Disjunctive and Conjunctive Normal Form learners, we briefly describe some related works in the area of and a perceptron when disambiguating six senses of Word Sense Disambiguation. The preprocessing line. Pedersen in 1998 compared Naïve Bayes with preformed on the documents that will be used from Decision Tree, Rule Based Learner, Probabilistic the training of the proposed system and also will be Model, etc. when disambiguating line and 12 other used in the testing phase is described in Section 3. words. All of these researchers found that Naïve Supervised Corpus-Based Methods for WSD and the Bayesian Classifier performed as well as any of the learning algorithm used in this paper for word sense other methods. disambiguation are discussed in Section 4. Let us take some examples of the previous works Unsupervised Corpus-Based Methods for WSD and in the fields of Arabic word sense disambiguation. In the related algorithm used in this paper for word 2003, Mona T. Diab in her Ph.D. thesis "Word sense disambiguation are discussed in Section 5. The Sense Disambiguation within a Multilingual experiments and the experimental results are Framework" used unsupervised machine learning presented in Section 6. In Section 7 we summarize approach called bootstrap. Achraf Chalabi, a Sakhr the conclusion of our work and suggest some future researcher, in 1998 had introduced a new word sense work ideas. disambiguation algorithm that have been used in Sakhr Arabic-English computeraided translation system based on Thematic Words of a given context to choose the appropriate sense of a ambiguities 2 RELATED WORK word [1]. Also Soha M. Eid [21] in her Ph.D. thesis “A WSD was first formulated as a distinct Comparative Study of Rocchio Classifier Applied to computational task during the early days of machine supervised WSD Using Arabic Lexical Samples” translation in the late 1940s, making it one of the Says that the Rocchio classifier outperforms the oldest problems in computational linguistics. other classification approaches with an overall Weaver (1949) introduced the problem in his now accuracy of 88%. famous memorandum on machine translation. Later in 1950, Kaplan observed that sense resolution given two words on either side of the word was not significantly better or worse than when given the 3 WORD MATCHING AND entire sentence. Several researchers since Kaplan's DOCUMENT PREPROCESSING work e.g. Koutsoudas and Korfhage in 1956 on Russian language, Masterman in 1961, Gougenheim The Levenshtein distance is a metric for measuring and Michéa in 1961 on French, and Choueka and the amount of difference between two sequences Lusignan reported the same phenomenon in 1993. (i.e. an edit distance). The term edit distance is often WSD was resurrected in the 1970s within used to refer specifically to Levenshtein distance [9- artificial intelligence (AI) research on full natural 11]. language understanding. In this spirit, Wilks (1975) Levenshtein distance (LD) is a measure of the developed “preference semantics”, one of the first similarity between two strings, which we will refer systems to explicitly account for WSD. to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, ALTIC -2011, Alexandria, Egypt or substitutions required to transform s into t. For Different supervised Machine Learning example: approaches have been employed in supervised WSD by training a classifier using a set of labelled . If s is "test" and t is "test", then LD(s,t) = 0, instances of the ambiguous word and create a because no transformations are needed. The statistical model the generated model is then applied strings are already identical. to unlabeled instances of the ambiguous word to decide their correct sense. In this work, two different . If s is "test" and t is "tent", then LD(s,t) = 1, supervised learning techniques will be used for because one substitution (change "s" to "n") solving WSD problem. These are Naïve Bayes is sufficient to transform s into t. Classifier (NBC) and Support Vector Machine In this work we introduce a simple algorithm to (SVM). detect similarity between two words as follow: 4.1 Naïve Bayes Classifier (NBC) If S is the source term and D is the destination term so, Naïve Bayes is the simplest representative of if (Length(S) < 2 || Length(D) < 2) probabilistic learning methods [13]. In this model, return false; an example is assumed to be

A Comparative Study for Arabic Word Sense Disambiguation Using Document Preprocessing and Machine Learning Techniques

Intelligent Chat Bot

An Ensemble Regression Approach for Ocr Error Correction

Practice with Python

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

Use of Word Embedding to Generate Similar Words and Misspellings for Training Purpose in Chatbot Development

Feature Combination for Measuring Sentence Similarity

The Research of Weighted Community Partition Based on Simhash

Hybrid Algorithm for Approximate String Matching to Be Used for Information Retrieval Surbhi Arora, Ira Pandey

Levenshtein Distance Based Information Retrieval Veena G, Jalaja G BNM Institute of Technology, Visvesvaraya Technological University

Thesis Submitted in Partial Fulﬁlment for the Degree of Master of Computing (Advanced) at Research School of Computer Science the Australian National University

Effective Search Space Reduction for Spell Correction Using Character Neural Embeddings