<<

ALTIC -2011, Alexandria, Egypt

A COMPARATIVE STUDY FOR ARABIC WORD SENSE DISAMBIGUATION USING DOCUMENT PREPROCESSING AND MACHINE LEARNING TECHNIQUES

Mohamed M. El-Gamml , M. Waleed Fakhr Arab Academy for Science and Technology , Heliopolis, Cairo, Egypt [email protected], [email protected]

Mohsen A. Rashwan, Almoataz B. Al-Said Faculty of Engineering, Dar Al-Ulum College, Cairo University, Giza, Egypt [email protected], [email protected]

Keywords: Word Sense Disambiguation, Support Vector Machine (SVM), Naïve Bayesian Classifier (NBC), k-means clustering, Levenshtein , (LSA).

Abstract: Word sense disambiguation is a core problem in many tasks related to language processing and was recognized at the beginning of the scientific interest in and artificial intelligence. In this paper, we introduce the possibilities of using the Support Vector Machine (SVM) classifier to solve the Word Sense Disambiguation problem in a supervised manner after using the Levenshtein Distance algorithm to measure the matching distance between words through the usage of the lexical samples of five Arabic words. The performance of the proposed technique is compared to supervised and unsupervised machine learning algorithms, namely the Naïve Bayes Classifier (NBC) and Latent Semantic Analysis (LSA) with K-means clustering, representing the baseline and state-of-the-art algorithms for WSD.

1 INTRODUCTION context provides the evidence, and each occurrence of a word is assigned to one or more of its possible classes based on the evidence [1]. Anyone who gets the joke when they hear a pun will Arabic, similar to most Natural Languages, is realize that lexical ambiguity is a fundamental characteristic of language: Words can have more ambiguous since many words might have multiple senses. The correct meaning of an ambiguous word than one distinct meaning. So why is it that text depends on the context in which it occurs. The doesn’t seem like one long string of puns? After all, lexical ambiguity is pervasive [1]. Lexical speaker of a language can usually resolve this ambiguity without difficulty. However, disambiguation in its broadest definition is nothing identification of the specific meaning of a word less than determining the meaning of every word in context, which appears to be a largely unconscious computationally, through Word Sense Disambiguation (WSD), is not an easy task. process in people. As a computational problem it is Although WSD may not be considered a standalone often described as “AI-complete”, that is, a problem whose solution presupposes a solution to complete approach by itself, it is an integral part of many applications such as Machine Translation [3, 4], natural-language understanding or common- sense Information Retrieval [5, 6], , reasoning [2]. In the field of computational , the Lexicography [7], and [8]. Approaches to WSD are often classified problem is generally called word sense according to the main source of knowledge used in disambiguation (WSD), and is defined as the problem of computationally determining which sense differentiation. Methods that rely primarily on dictionaries, thesauri, and lexical knowledge bases, “sense” of a word is activated by the use of the word without using any corpus evidence, are termed in a particular context. WSD is essentially a task of classification: word senses are the classes, the dictionary-based or knowledge-based. Methods that ALTIC -2011, Alexandria, Egypt

eschew (almost) completely external information The 1980s were a turning point for WSD. Large- and work directly from raw un-annotated corpora are scale lexical resources and corpora became available termed unsupervised methods (adopting terminology so handcrafting could be replaced with knowledge from machine learning). Included in this category extracted automatically from the resources (Wilks et are methods that use word-aligned corpora to gather al. 1990). Lesk’s (1986) short but extremely seminal cross-linguistic evidence for sense discrimination. paper used the overlap of word sense definitions in Finally, supervised and semi-supervised WSD make the Oxford Advanced Learner’s Dictionary of use of annotated corpora to train from, or as seed Current English (OALD) to resolve word senses. data in a bootstrapping process. Dictionary-based WSD had begun and the In this paper, we explore the idea of using relationship of WSD to lexicography became Levenshtein Distance algorithm to calculate the explicit. For example, Guthrie et al. (1991) used the distance between each two words which help to say subject codes (e.g., Economics,Engineering, etc.) in the Longman Dictionary of Contemporary English ( ,سشطبُ , ٍسشطِ , سشطبّٚ ,اىسشطبُ) that for example Is a derivation from the same word then using a (LDOCE) (Procter 1978) on top of Lesk’s method. supervised or unsupervised technique for In 1996, Mooney compared Naïve Bayes with a classification or clustering the senses . Neural Network, Decision Tree/List Learners, This paper is organized as follows. In Section 2, Disjunctive and Conjunctive Normal Form learners, we briefly describe some related works in the area of and a perceptron when disambiguating six senses of Word Sense Disambiguation. The preprocessing line. Pedersen in 1998 compared Naïve Bayes with preformed on the documents that will be used from Decision Tree, Rule Based Learner, Probabilistic the training of the proposed system and also will be Model, etc. when disambiguating line and 12 other used in the testing phase is described in Section 3. words. All of these researchers found that Naïve Supervised Corpus-Based Methods for WSD and the Bayesian Classifier performed as well as any of the learning algorithm used in this paper for word sense other methods. disambiguation are discussed in Section 4. Let us take some examples of the previous works Unsupervised Corpus-Based Methods for WSD and in the fields of Arabic word sense disambiguation. In the related algorithm used in this paper for word 2003, Mona T. Diab in her Ph.D. thesis "Word sense disambiguation are discussed in Section 5. The Sense Disambiguation within a Multilingual experiments and the experimental results are Framework" used unsupervised machine learning presented in Section 6. In Section 7 we summarize approach called bootstrap. Achraf Chalabi, a Sakhr the conclusion of our work and suggest some future researcher, in 1998 had introduced a new word sense work ideas. disambiguation algorithm that have been used in Sakhr Arabic-English computeraided translation system based on Thematic Words of a given context to choose the appropriate sense of a ambiguities 2 RELATED WORK word [1]. Also Soha M. Eid [21] in her Ph.D. thesis “A WSD was first formulated as a distinct Comparative Study of Rocchio Classifier Applied to computational task during the early days of machine supervised WSD Using Arabic Lexical Samples” translation in the late 1940s, making it one of the Says that the Rocchio classifier outperforms the oldest problems in computational linguistics. other classification approaches with an overall Weaver (1949) introduced the problem in his now accuracy of 88%. famous memorandum on machine translation. Later in 1950, Kaplan observed that sense resolution given two words on either side of the word was not significantly better or worse than when given the 3 WORD MATCHING AND entire sentence. Several researchers since Kaplan's DOCUMENT PREPROCESSING work e.g. Koutsoudas and Korfhage in 1956 on Russian language, Masterman in 1961, Gougenheim The Levenshtein distance is a for measuring and Michéa in 1961 on French, and Choueka and the amount of difference between two sequences Lusignan reported the same phenomenon in 1993. (i.e. an ). The term edit distance is often WSD was resurrected in the 1970s within used to refer specifically to Levenshtein distance [9- artificial intelligence (AI) research on full natural 11]. language understanding. In this spirit, Wilks (1975) Levenshtein distance (LD) is a measure of the developed “preference semantics”, one of the first similarity between two strings, which we will refer systems to explicitly account for WSD. to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, ALTIC -2011, Alexandria, Egypt

or substitutions required to transform s into t. For Different supervised Machine Learning example: approaches have been employed in supervised WSD by training a classifier using a set of labelled . If s is "test" and t is "test", then LD(s,t) = 0, instances of the ambiguous word and create a because no transformations are needed. The statistical model the generated model is then applied strings are already identical. to unlabeled instances of the ambiguous word to decide their correct sense. In this work, two different . If s is "test" and t is "tent", then LD(s,t) = 1, supervised learning techniques will be used for because one substitution (change "s" to "n") solving WSD problem. These are Naïve Bayes is sufficient to transform s into t. Classifier (NBC) and Support Vector Machine In this work we introduce a simple algorithm to (SVM). detect similarity between two words as follow: 4.1 Naïve Bayes Classifier (NBC) If S is the source term and D is the destination term so, Naïve Bayes is the simplest representative of if (Length(S) < 2 || Length(D) < 2) probabilistic learning methods [13]. In this model, return false; an example is assumed to be “generated” first by from S&D stochastically selecting the sense S of the example ("ال") removing if length>3 and then each of the features independently min_length=Min(Length(S),Length(D)); according to their individual distributions P( xi | S ) . if (min_length < 2) Using Naïve Bayes classifier for supervised return false; WSD relies on estimating the conditional probability else if (min_length == 3) of each sense Si of a word w given a feature fj in the distance=(0.7-LvsDis(S,D)/min_length); context. The sense with the maximum conditional else probability P(Si|f1, …, fm) is chosen as the most distance=(1.0-LvsDis(S,D)/min_length); appropriate sense in context, where P(Si|f1, …, fm) return distance > 0.6; is given by

P(f , . . .,f |S ) P(S ) (1) P(S |f , . . .,f )  1 m i i i 1 m P(f , . . .,f ) 4 SUPERVISED CORPUS-BASED 1 m METHODS FOR WSD Based on the naïve assumption that the features are conditionally independent given the sense, we need The goal in supervised learning for classification only to maximize P(Si|f1, …, fm) based on the consists of inducing from a training set S, an following equation approximation of an unknown function f that maps m from an input space X to a discrete unordered output (2) P(Si |f1, . . .,fm )  P(Si )(P(f j | Si ) space Y={1,…, K}[12]. j1 The training set contains m training examples, S = {(x1, y1),…,(xm , ym )}, which are pairs (x, y) The probabilities of P(Si) and P(fj|Si) are where x belongs to X and y = f (x). The x estimated as the relative occurrences frequencies in component of each example is typically a vector the training set of sense Si and the feature fj in the x = (x1…, xn), whose components, called presence of Si, respectively. To handle the cases of features (or attributes), are discrete- or real-valued zero counts, where a feature (fj) does not exist in a and describe the relevant information/properties sense, the Laplace smoothing technique is used and about the example. The values of the output space Y is defined as associated with each training example are called N( f , S ) 1 j i (3) classes (or categories). Therefore, each training P( f j | Si )  example is completely described by a set of  N( f , Si ) |V | attribute-value pairs and a class label. f Supervised WSD may be seen as a categorisation Where N(fj,Si) is occurrence frequency of task, once word occurrence contexts are viewed as feature fj in all training examples of class Si and V is documents and word senses as categories. the number of all features. Supervised learning methods represent linguistic information in the form of features. Each feature informs of the occurrence of certain attribute in a context. ALTIC -2011, Alexandria, Egypt

4.2 Support Vector Machine (HAL) [16,17], and Clustering by Committee (CBC)[18]. SVMs are based on the principle of Structural Risk In this work we modify the Latent Semantic Minimization from the Statistical Learning Theory Analysis (LSA) by using K-mean clustering [14] and, in their basic form, they learn a linear technique for clustering sentences. discriminate that separates a set of positive examples from a set of negative examples with maximum 5.1 Latent Semantic Analysis (LSA) margin (the margin is defined by the distance of the hyperplane to the nearest of the positive and negative LSA traces its origins to a technique in examples). This learning bias has proved to have good information retrieval known as Latent Semantic properties in terms of generalization bounds for the Indexing (LSI) [15]. The objective of LSI is to induced classifiers. improve the retrieval of documents by reducing a large term-by-document matrix into a much smaller space using Singular Value Decomposition (SVD). LSA uses much the same methodology, except that it employs a word-by-context representation. LSA represents a corpus of text as an M × N co- occurrence matrix, where the M rows correspond to word types, and the N columns provide a unit of Figure 1: geometrical intuition about the maximal margin context such as a phrase, sentence, or paragraph. hyperplane. Each cell in this matrix contains a count of the The left plot in Fig(1). shows the geometrical number of times that a word given in the row occurs intuition about the maximal margin hyperplane in a in the context provided by the column. two-dimensional space. The linear classifier is LSI and LSA differ primarily in regards to their defined by two elements: a weight vector w (with definition of context. For LSI it is a document, while one component for each feature), and a bias b which for LSA it is more flexible, although it is often a stands for the distance of the hyperplane to the paragraph of text. If the unit of context in LSA is a origin. document, then LSA and LSI become essentially the Sometimes, training examples are not linearly same technique. After the co-occurrence cell counts separable or, simply, it is not desirable to obtain a are collected and possibly smoothed or transformed perfect hyperplane. In these cases it is preferable to in some way, the M × N matrix is decomposed via allow some errors in the training set so as to Singular Value Decomposition (SVD)[19], which is maintain a better solution hyperplane (see the right a form of factor or principal components analysis. plot of Fig (1)). This is achieved by a variant of the SVD reduces the dimensionality of the original optimization problem, referred to as soft margin, in matrix so that similar contexts are collapsed into which the contribution to the objective function of each other. SVD is based on the fact that any the margin maximization and the training errors can rectangular matrix can be decomposed into the be balanced through the use of a parameter called C product of three other matrices. [12]. This decomposition can be achieved without any In this work we use SVM as a classifier with loss of information if no more factors than the linear kernel type, C=2 and Gamma=0.8. minimum of N and M are used. In such cases the original matrix may be perfectly reconstructed. The reconstructed matrix is a least-squares best fit. 5 UNSUPERVISED CORPUS- Finally apply k-mean clustering algorithm to BASED METHODS FOR WSD cluster columns into number of senses. 5.2 K-mean Clustering Algorithm Unsupervised corpus-based methods of word sense discrimination that are knowledge-lean, and do not Clustering problems arise in many different rely on external knowledge sources such as machine applications, such as data mining and knowledge readable dictionaries, concept hierarchies, or sense- discovery, data compression and vector quantization, tagged text. They do not assign sense tags to words; and pattern recognition and pattern classification rather, they discriminate among word meanings [20]. based on information found in unannotated corpora. In K-means clustering, we are given a set of n These methods include Latent Semantic Analysis data points in d-dimensional space Rd and an integer (LSA) [15], the Hyperspace Analogue to Language K and the problem is to determine a set of K points in Rd, called centers, so as to minimize the mean ALTIC -2011, Alexandria, Egypt

squared distance from each data point to its nearest The results reported in terms of the standard center. measure of accuracy. In the supervised techniques, the available data set is partitioned into training and test sets. The training set is used for learning while 6 EXPERIMENTAL SETUP the test set is used for evaluation. To insure that the results are unbiased to a certain train/test split, an N- fold cross validation is used. In N-folds cross- validation, the corpora are split in N parts of similar 6.1 Data Set size. A single part is used as the testing gold-standard, The data set used is a lexical samples corpus of five and the remaining (N-1) segments arts for training Arabic nouns. Each noun has two or three senses. the system. The final result is the average of the N The lexical samples corpus is a corpus of sentences executions. The corpora are partitioned in a way to and phrases. Each of these samples represents an keep the same proportion of word senses in each of example of the occurrence of an ambiguous noun. the folds. Since the available number of examples is The examples were collected from different Arabic small, the results reported are averaged over five resources. Most of these resources are literature different test/training splits that are partitioned in a publications of several authors including Al- way to keep the same proportion of word senses in Khansaa poetry from Al-Jazeera and Elhawy fi each split. In each split, 33% of the data is used for Alteb from Persia, which is a publication in testing. medicine. They represent different domains, times and geographical distributions. 6.3 Results and Discussion

Table 1: Statistics of the Corpus \Dataset Each classifier is tested over the five nouns two times first one without using any pre-processing on the real data except removing punctuation and special char the second is by using Levenshtein distance algorithm for similarity detection. Obviously, each classifier exhibits different performance for each noun.

6.3.1 First Experiment The data was selected and annotated by a professional linguist with different number of This experiment is considered 3 experiments average examples per sense (Table I). It is worth (LSA+K-Mean, NBC and SVM). noting that not all the tagged senses are derived from . LSA+K-Mean: :Steps اىحبجت an Arabic dictionary (For instance, the noun is annotated with three senses which are a brow, a a- First removing special character as (, ; “ . etc.). doorman, or a proper noun). Obviously, the first two b- Removing punctuation for each word. senses are consistent with dictionary definitions c- Calculate an M × N co-occurrence matrix, where while the third depends on a corpus usage. the M rows correspond to word types for each Moreover, not all nouns have a balanced distribution word comes at least twice into separate of examples. sentences, and the N columns provide a unit of .has two senses, which are context such as a phrase, sentence, or paragraph اىَشسً٘ The noun something drawn and a decree, with more or less Each cell in this matrix contains a count of the balanced data. However, the unbalanced data may be number of times that a word given in the row justified, since it represents the normal usage of the occurs in the context provided by the column for .has each اىسشطبُ noun in daily life. For instance, the word d- Apply singular value decomposition. unbalanced data for its three senses; an animal, a e- Use K-mean for Clustring. zodiacal constellation, and a disease, where the minimum examples are provided for the first sense. . NBC and SVM: something) اىَششٗع The remaining two words are Steps: water stream and a) اىجذٗه legal and a project) and table). For both words, the second sense has a more a- First removing special character as (, ; “ . etc.). presence in life and larger number of examples. b- Removing punctuation for each word. f- Calculate an M × N co-occurrence matrix, where the M rows correspond to word types for each 6.2 Performance Evaluation word comes at least twice into separate ALTIC -2011, Alexandria, Egypt

sentences, and the N columns provide a unit of context such as a phrase, sentence, or paragraph. Each cell in this matrix contains a count of the number of times that a word given in the row Table 3: Comparison of Different Classifiers results occurs in the context provided by the column for Noun SVM NBC LSA+Kmean %55.00 %86.90 %95.65 اىسشطبُ .each Use NBC or SVM for classification. %71 %93.20 %97.17 اىجذٗه -c

%64 %84 %96.80 اىَششٗع

%64 %91 %96.20 اىَشسً٘ Table 2: Comparison of Different Classifiers results %58 %90 %95.18 اىحبجت Noun SVM NBC LSA+Kmean Average 96.2% 89.02% 62.40% %48 %83 %85 اىسشطبُ %63 %81 %86 اىجذٗه This example illustrates our work in the second %60 %85 %87 اىَششٗع experiment when using SVM as a classifier. %57 %89 %91 اىَشسً٘ Table (4) is an example of the date set with its %51 %80 %82 اىحبجت senses. Average 86.20% 83.60% 55.80% Table 4: Example of data 6.3.1 Second Experiment Sentence Sense فَعْٞب َك َغشثب جذٗه فٜ ٍفبض ٍة َمَ ِّش اى َخيٞج فٜ َ َ َ ٍ َ َ َ ِ 1 َصفٞحٍ ٍُ َص َّ٘ ِة This experiment is considered 3 experiments سٍٞ ُث ثٖب ححّٚ سأٝ ُث ٍذا َدٕب ٝط٘ ُف ثٖب اىحَّٞب ُس فLSA+K-Mean, NBC and SVM) but using ٜ) َ َ ِ َ َ ِ َ ِ 2 ُم ِّو َجذ َٗ ِه Levenshtein distance algorithm for similarity ٍَج ًشdetection. ٙ َصغٞش جَ َز َّمشجٌُُٖ ٍب إُ جَج ُّف ٍذاٍعٜ َمأَُ جذٗ ٌه ٝسقٜ ِ ِ ِ َ ِ َ َ َ LSA+K-Mean: 3 . ىيَٞبٓ ٍضاسع ٍخشٗة ِ َ ِ َ َ ِ :Steps أَٗ َجذ َٗ ٌه فٜ ِظال ِه َّخ ٍو ىِيَب ِء ٍِِ جَححِ ِٔ قَسٞ ُت a- First removing special character as (, ; “ . etc.). 4 b- Removing punctuation for each word. ُٝحٞ ُو فٜ َجذٗه جَحج٘ َضفبد ُعُٔ َحج٘ اى َج٘اسٛ جَشٙ َ ٍ ِ َ c- Apply Levenshtein Distance algorithm to detect 5 فٜ ٍبئٔ ُّطُقب ِ ِ .the matched words فحعيٌ ٍِ أٛ جْس ٝصٞش ٍب ٝجحَع ىل ٍْٖب g- Calculate an M × N co-occurrence matrix, where ثٖزا اىجذٗه فخز ٍِ أحذ سطشٛ اة اىجٞث the M rows correspond to word types for each 6 اىَشسً٘ فٞٔ رىل اىجْس word comes at least twice into separate sentences, and the N columns provide a unit of رىل أّل أسدت أُ جضشة سٗاثع فٜ ث٘اىث .context such as a phrase, sentence, or paragraph َصف َحة فأخزت ٍِ جذٗه اة اىزٛ فٜ عشض اى٘سقة أEach cell in this matrix contains a count of the 7 ٛ ّ ُٝ َخظ فٖٞب اىجْسِٞ شئث ٗىٞنِ أٗالً number of times that a word given in the row ُ ُخط٘ط occurs in the context provided by the column for ً ً ٍُحَ٘ا ِصَٝة ٗىٞنِ أٗال اىث٘اىث فخشجث ٍْٔ ٍ٘اصٝب ىيشٗاثع .each ٗ ٍُحَقب ِط َعة، فٜ جذٗه اة اىزٛ فٜ ط٘ه اى٘سقة ف٘جذت فd- Apply singular value decomposition. 8 ٜ جُ َن ُِّ٘ اىجٞث اىزٝ ٛ٘اصe- Use K-mean for Clustering. ٔٝ ٍُ َشثَّعبتٍ َ اىجْس اىزٛ صبس إىٞٔ اىَضشٗة، ٗمزىل ى٘ ُٝنح ُت فَٞب :NBC and SVM . ثْٖٞب أخزت ٍِ جذٗه اة اىشٗاثع ٗخشجث ٍْٖب ثإصاء 9 اىث٘اىث اىحٜ فٜ جذٗه :Steps a- First removing special character as (, ; “ . etc.). جذٗه اة اىشٗاثع ٗخشجث ٍْٖب ثإصاء اىث٘اىث .b- Removing punctuation for each word اىحٜ فٜ جذٗه اة اٟخش ٗجذت فٞٔ س٘اثع ٗمزىل c- Apply Levenshtein Distance algorithm to detect 10 جفعو ثنو ٍب جشٝذ ٍِ .the matched words h- Calculate an M × N co-occurrence matrix, where the M rows correspond to word types for each Table (5) below shows the co-occurrence matrix word comes at least twice into separate after removing special characters and punctuation sentences, and the N columns provide a unit of context such as a phrase, sentence, or paragraph. then using Levenshtein Distance for string similarity Each cell in this matrix contains a count of the where each column is considered as a feature vector number of times that a word given in the row for each sentense and each row represents a word of occurs in the context provided by the column for the whole context. Cell entries are the number of each. times that a word (rows) appeared in a title d- Use NBC or SVM for classification. ALTIC -2011, Alexandria, Egypt

(columns) for words that appeared in at least two means clustering algorithm for clustering after titles. (LSA) improves the performance of LSA.

Table 5: Co-occurrence matrix S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 2 2 1 1 1 1 1 1 1 1 جذٗه 1 1 3 2 0 2 1 0 1 2 فٜ 1 0 0 0 1 0 0 1 0 0 ٍب 0 1 0 0 0 0 0 1 0 0 ٍخشٗة 1 1 0 1 2 0 1 0 0 0 ٍِ 0 0 0 1 1 0 0 0 0 0 أٛ 0 1 0 0 2 0 0 0 0 0 جْس 1 1 0 0 1 0 0 0 0 0 ٍْٖب 2 1 1 1 1 0 0 0 0 0 اة 0 0 1 0 1 0 0 0 0 0 اىجٞث 1 0 0 0 1 0 0 0 0 0 فٞٔ 0 0 0 1 1 0 0 0 0 0 رىل 2 1 0 1 0 0 0 0 0 0 سٗاثع 1 1 1 1 0 0 0 0 0 0 ث٘اىث 0 1 0 1 0 0 0 0 0 0 فأخزت 0 1 2 1 0 0 0 0 0 0 اىزٛ 0 0 1 1 0 0 0 0 0 0 اى٘سقة Figure 2: results graph for each word 0 0 1 1 0 0 0 0 0 0 ٗىٞنِ 0 0 1 1 0 0 0 0 0 0 أٗال Since the acquisition of manually tagged data is usually expensive , fig(2) obtains a relation between 1 1 1 0 0 0 0 0 0 0 فخشجث number of samples taken as a training data per each 1 0 1 0 0 0 0 0 0 0 ف٘جذت sense on x-axis and the percentage of right classification for each sense, this graph obtains that 1 1 0 0 0 0 0 0 0 0 ٗمزىل the increasing in training data may improve the 1 1 0 0 0 0 0 0 0 0 ثإصاء performance of the classifier but the type of training 1 1 0 0 0 0 0 0 0 0 اىحٜ data also affect the performance as seen in fig(2- b),fig(2-d) and fig(2-e) sense2 stay no increasing If we take the sentence(1,2,4,5,7,8) as a training and instead of using more samples as training that’s (3,6,9,10) for testing the result comes from SVM is means that the type of these samples doesn’t add obtained in Table(6). any information to the classifier (more money = no Table 6: result of the simple example more performance), there are a threshold point the Sentence classified Sense Target Sense system will saturate that means that more training data need as seen in fig (2-b),fig (2-d) sense1 and S3 1 1 fig(2-e) sense1 and sense3. S6 1 2 S9 2 2 S10 2 2 7 CONCLUSION

Using Levenshtein distance algorithm for In this paper we have investigated the use of similarity detection improves the performance in all Levenshtein distance algorithm for similarity supervised and unsupervised methods also using K- detection as a pre-processing stage before using any ALTIC -2011, Alexandria, Egypt

classification method in the word sense Conference on Research and Development in disambiguation system. We have also studied the Information Retrieval, pp. 159-166. possibility of using K-means algorithm after [7]Atkins, Sue. 1991. Tools for computer-aided corpus Appling latent semantic analysis (LSA). We have lexicography: The Hector project. Acta Linguistica also investigated using the support vector machine Hungarica, 41: 5–72. as a supervised classifier for the WSD problem. [8]Jacquemin, B., Brun, C., and Boux, C.,2002, Levenshtein distance is not the only popular notion “Enriching a Text by Semantic Disambiguation for of edit distance. Variations can be obtained by Information Extraction”, in Proc. of the Workshop on Using Semantics for Information Retrieval and changing the set of allowable edit operations: for Filtering in the 3rd International Conference in instance, Language Resources and Evaluation (LREC).  The longest common subsequence is the [9]Lloyd Allison, Algorithm metric obtained by allowing only addition (DPA) for Edit-Distance. and deletion, not substitution. [10]Alex Bogomolny, Distance Between Strings. [11]Thierry Lecroq, Levenshtein Distance . [12]L. Màrquez, G. Escudero, D. Martínez and  The only allows G. Rigau,2006, Machine Learning Methods for WSD. substitution (and hence, only applies to Chapter of the book Word Sense Disambiguation: strings of the same length). Algorithms, Applications, and Trends. E. Agirre and P. Edmonds, editors. Kluwer. But Levenshtein distance allows addition, [13]Duda, R. and Hart, 1973, “Pattern Classification and deletion, substitution and can be used with strings of Scene Analysis”, Wiley. different lengths. [14]Boser, Bernhard E., Isabelle M. Guyon & Vladimir N. Vapnik. 1992. A training algorithm for optimal margin Using Levenshtein as string similarity classifiers. Proceedings of the 5th Annual Workshop detection improves the performance of using NBC on Computational Learning Theory (CoLT), or SVM as classifier and also in LSA. Pittsburgh, U.S.A., 144–152. [15]Deerwester, Scott, Susan T. Dumais, George W. Our plan also is to use this work to do some efforts Furnas, Thomas K. Landauer & Richard Harshman. on how to select a small subset of data need to be 1990. Indexing by latent semantic analysis. Journal of manually labeled but actually improve the the American Society for Information Science, 41(6): performance of the classifier. 391–407 . [16]Burgess, Curt & Kevin Lund. 1997. Modeling constraints with highdimensional context space. Language and Cognitive Processes, 12(2–3):177–210. REFERENCES [17]Burgess, Curt & Kevin Lund. 2000. The dynamics of meaning in memory. Cognitive Dynamics: Conceptual [1] E. Agirre and P. Edmonds ( eds.),2007, Word S ense Representational Change in Humans and Machines, Disambiguation: Algorithms and Applications, 1– ed. by Eric Dietrich and Arthur Markman, 117–156. 28.© 2007 Springer. Mahmah, U.S.A.: Lawrence Erlbaum Associates. [2]Nancy Ide, Jean Véronis, 1998. Word sense [18]Lin, Dekang & Patrick Pantel. 2001. Induction of disambiguation: The state of the art. Computational semantic classes from natural language text. Linguistics. Proceedings of ACM SIGKDD Conference on [3]Carpaut, M., and Wu, D., 2005, Word Sense Knowledge Discovery and Data Mining, 317–322. Disambiguation vs. Statistical Machine Translation”, [19]Furnas, George W., Scott Deerwester, Susan T. in Proc. of the 43rd Annual Meeting of the Dumais, Thomas K. Landauer, Richard Harshman, L. Association for Computational Linguistics (ACL), A. Streeter & K. E. Lochbaum. 1988. Information pp. 387-394. retrieval using a Singular Value Decomposition model [4]Chan, Y., Ng, H., and Chiang, D.,2007, “Word Sense of latent semantic structure. Proceedings of the 11th Disambiguation Improves Statistical Machine Annual International ACM Conference on Research Translation”, in Proc. of the 45rd Annual Meeting of and Development in Information Retrieval (SIGIR), the Association for Computational Linguistics (ACL), Grenoble, France, 465–480. pp. 33-40. [20]Lloyd., S. P. (1982). Least squares quantization in [5]Schütze, H., and Pedersen,1995, “Information Retrieval PCM, IEEE Transactions on 28 Based on Word Senses”, in Proc. of Symposium on (2): 129–137. Document Analysis and Information Retrieval [21]Soha M. Eid, Almoataz B. Al-Said, Nayer M. Wanas, (SDAIR’95), pp. 161-175. Mohsen A. Rashwan, Nadia H. Hegazy,” A [6]Stokoe, C., Oakes, M., and Tait, 2003, “Word Sense Comparative Study of Rocchio Classifier Applied to Disambiguation in Information Retrieval Revisited”, supervised WSD Using Arabic Lexical Samples “, in Proc. of the 26th Annual International ACM SIGIR ESOLEC'2010.