Robust Algorithm for Stemming Text Document

International Journal of Computer Information Systems and Industrial Management Applications. ISSN 2150-7988 Volume 8 (2016) pp. 235-246 © MIR Labs,www.mirlabs.net/ijcisim/index.html RAID: Robust Algorithm for stemmIng text Document Kabil BOUKHARI1 and Mohamed Nazih OMRI2 1MARS Unit of Research, Department of computer sciences Faculty of sciences of Monastir, University of Monastir, 5000, Tunisia [email protected] 2MARS Unit of Research, Department of computer sciences Faculty of sciences of Monastir, University of Monastir, 5000, Tunisia [email protected] Abstract: In this work, we propose a robust algorithm for The indexing phase is subsequently classified using automatic indexing unstructured Document. It can detect the indexes, the document from a set of documents in a given most relevant words in an unstructured document. This collection and retrieving the context of this index within algorithm is based on two main modules: the first module the document itself . This type of indexing is to optimize ensures the processing of compound words and the second access to data in large databases[1]. allows the detection of the endings of the words that have not In this context, the research is centered on the extraction of been taken into consideration by the approaches presented in key terms used in the documents to facilitate access and literature. The proposed algorithm allows the detection and navigation in web pages and to find the electronic removal of suffixes and enriches the basis of suffixes by eliminating the suffixes of compound words. We have information. These keywords are used in the process of experienced our algorithm on two bases of words: a standard information search to get relevant answers to questions[2]. collection of terms and a medical corpus. The results show the The questions that arise are of the form: Can we find this remarkable effectiveness of our algorithm compared to others document? How well the documents are relevant? Do they presented in related works. meet user needs? To answer these questions, the system must take the user Keywords: Robust algorithm, Stemming, Documents input in the form of key terms and linking them to indexing, Information retrieval. information contained in the documents. The recovery technique paves the way for a possible I. Introduction inquiry about the fact if any given document and a given query share a particular keyword or not. The obvious In recent years, electronic documents have increased both answer is simply tested for absolute equality between the in the Internet and in corporate intranets. Finding the keyword and all terms in the document. It is only after the information needed here in thus proves a task increasingly confirmation of an existing similarity is found that difficult. Often, users are discouraged by the slowness and automatic indexing retrieves it. inefficiency of traditional search engines available. However, the terms key can have many morphological Automatic indexing of documents makes it easy and solves variants that share the same semantics and can be much of the problem. beneficial for the information retrieval system to consider Automatic indexing is defined as a document these equivalent terms. To recognize these variations, as in representation of the analysis results of natural language or [3][33]the system needed terms in a natural form in order standardized language of a document [19][20][21]. to treat them. A form of natural language processing, Another more classic definition and consonant definition which may be opted for to carry out this task, can be an suggests that automatic indexing is the identification and algorithm that transforms an end to its morphological root location of relevant sequences or major themes in a via the removal of prefixes and suffixes[4]. Here we can document by analyzing its content [22][23][24].However, talk about stemming. other works [25] [26] have shown the existence of Then the purpose of an information retrieval system (IRS) irrelevant concepts for texts. [30][31][32] is to find the most relevant documents that correspond to the user’s queries given the large number of Dynamic Publishers, Inc., USA 236 BOUKHARI, OMRI documents on the Internet. So, to improve the performance These algorithms are designed to identify the roots of the of an IRS, it is essential to develop a new stemmer for a words through a set of rules and conditions. more relevant and accurate indexing system. Stemming operation consists of removing inflectional and The techniques used[5] are generally based on a list of derivational suffixes to reduce the various forms of a word affixes (suffixes, prefixes, postfix, antefixes) of the at their root. This root must be included in a morphological language in question and on a set of stemming/de- sense: two words might share or have the same suffixation rules constructed already, that allow the finding morphological root and completely different meanings. of the stem of a word. The techniques used are generally based on a list of affixes Several algorithms have been proposed for research lexical of the language at hand as well as a set of de-suffixation root for the English language. The main algorithms rules priory built that allow the finding of a stem of a given developed in this senseare the LOVINS, Paice, Husk, word. PORTER, EDA and Krovetz algorithm's. Search engines use stemming algorithms to improve Part of the focus of this work is the study of two standard information retrieval[8]. The keywords of a query or algorithms, namely LOVINS algorithm and that of document which are represented by their roots rather than PORTER. We present the definition and the principle of by the original words. As in [9] several variations of a term each of the five algorithms. can thus be grouped into a single representative form, As for stemming algorithms, there is no perfect algorithm which reduces the size of the dictionary, and then the that meets user needs for different corpus. Meanwhile, this number of distinct words needed to represent a set of algorithm allows the indexing process of non–structured documents. A dictionary of reduced size saves both space documents. and execution time. The focal blemish of de-suffixation algorithms that are There are two main families of stemmers: the algorithmic developed so far is their lack of producing one hundred stemmers and dictionary-based stemmers: algorithmic percent reliable results (lack of precision): same context stemmer is often faster and can extract the roots of words do not have the same stems. We noticed that, on the unfamiliar words (in a sense, all found words are one hand, in the stemming process, there is no phase of unfamiliar to it). However, it will have a higher error rate, treatment for compound shapes, and on the other hand, grouping sets of words that should not be together (over- several suffixes are ignored and are not, therefore, treated, Stemming). A dictionary based stemmer where the number which is the case of the LOVINS algorithm. of error on known words is almost zero. It is also slower This paper is divided into fourparts. After the introduction, and requires the removal of suffixes before looking for the the second presents the stemming phase of the unstructered corresponding root in the dictionary. documents. The third part devoted to the presentation of The de-suffixation algorithm functions on different steps the best-known stemming algorithms in literature, this through which the words to process successively pass, section will be concluded by a comparative study of according to the rules, when the parser recognizes a suffix different algorithms and their limits. Part four presents the from the list, it removes or transforms it. Here the longest proposed algorithm, inspired from the SAID approach[27], suffix which determines the rule to be applied. Each for stemming words of a text. The fifth paragraph presents algorithm has its own steps and its different rules. the experimental data, the results obtained and provides a detailed analysis of these results. We finish this work by a III. Related works conclusion and we give the perspectives of the future work. Different Stemming algorithms [10] have been proposed II. Stemming in literature. The first algorithm that we treat is Husk and Paice's The label stemming or de-suffixation is given to the algorithm [11][12] and belongs to the family of process that aims to transform the inflections in their algorithmic stemmers. It is a simple iterative Stemmer, it radical or stem. It seeks to bring together the different removes the suffixs from a word in an indefinite number variants, inflectional and derivationnel, of a word around a of steps. stem. This is algorithm based on a set of rules to extract roots The root of a word corresponds to its remaining part, with more stores outside the rules of the code. On each namely its stem, after the removal of its affixes (prefix iteration, it tries to find an applicable rule by the last and/or suffix). It is also sometimes known as the stem of a character of the word. word. Unlike the lemma that corresponds to a real word in This approach consists of a set of functions that will use the language, the root or stem is generally not a real word. the root extraction rules applicable to the input word and For example, the word "relation" has "rel" as a stem which check the acceptability of the proposed root, and the set of does not correspond to a real word but in the example of rewrite rules. The main function takes as parameters the "greatest" the radical or stem is "great". word that we want to extract, the root and the code of the The stemming is an important stage for the indexing language.

Load more