PSWG: an Automatic Stop-Word List Generator for Persian Information
Total Page:16
File Type:pdf, Size:1020Kb
Vol. 2(5), Apr. 2016, pp. 326-334 PSWG: An Automatic Stop-word List Generator for Persian Information Retrieval Systems Based on Similarity Function & POS Information Mohammad-Ali Yaghoub-Zadeh-Fard1, Behrouz Minaei-Bidgoli1, Saeed Rahmani and Saeed Shahrivari 1Iran University of Science and Technology, Tehran, Iran 2Freelancer, Tehran, Iran *Corresponding Author's E-mail: [email protected] Abstract y the advent of new information resources, search engines have encountered a new challenge since they have been obliged to store a large amount of text materials. This is even more B drastic for small-sized companies which are suffering from a lack of hardware resources and limited budgets. In such a circumstance, reducing index size is of paramount importance as it is to maintain the accuracy of retrieval. One of the primary ways to reduce the index size in text processing systems is to remove stop-words, frequently occurring terms which do not contribute to the information content of documents. Even though there are manually built stop-word lists almost for all languages in the world, stop-word lists are domain-specific; in other words, a term which is a stop- word in a specific domain may play an indispensable role in another one. This paper proposes an aggregated method for automatically building stop-word lists for Persian information retrieval systems. Using part of speech tagging and analyzing statistical features of terms, the proposed method tries to enhance the accuracy of retrieval and minimize potential side effects of removing informative terms. The experiment results show that the proposed approach enhances the average precision, decreases the index storage size, and improves the overall response time. Keywords: Information Retrieval, Natural Language Processing, Part of Speech Tagging, Index Size Reduction, Stop-words 1. Introduction Information retrieval (IR) researchers are concentrated on increasing accuracy and precision of information retrieval systems using new information sources and expanding incorporation of available sources such as web pages from the Internet. However, the expansion of the Internet and tremendous growth of online web pages have enlarged text indices. As a result, index size reduction has become more important than before, especially for small companies with limited hardware resources. Various techniques have been used to reduce the index size. Compression, compaction, and summarization are among well-known available methods used for index size reduction. Nevertheless, an effective technique for index size reduction in text processing systems is elimination of non-informative words known as stop-words. Articles, prepositions, conjunctions, and pronouns are typical candidates for elimination as stop words [1]. These words, which frequently occur in documents, do have very low discrimination values. By removing stop-words, text processing systems follow two purposes: (1) elimination of noise and (2) dimension reduction. While the former tries to enhance the accuracy of retrieval in IR systems, the latter aims to reduce the index size as well as to minimize the response time. Stop-words elimination has been reported to cause over 40% reduction in index storage size [2]. Many researches have been conducted to build stop-word lists manually for almost all languages currently spoken on the Earth. However, manually building stop lists has several 326 Article History: JKBEI DOI: 649123/11045 Received Date: 21 Sep. 2015 Accepted Date: 11 Feb. 2016 Available Online: 09 Apr. 2016 Mohammad-Ali Yaghoub-Zadeh-Fard et al. / Vol. 2(5) Apr. 2016, pp. 326-334 JKBEI DOI: 649123/11045 drawbacks. First, a term in a specific domain can be informative, but not meaningful in another domain. For example, the word “for”, which is a potential candidate for removal, is the most important term in a web page which is teaching about loops in the C programming. Second, human languages are dynamic and everyday lots of new words are created. Consequently, manually built stop-word lists are supposed to be updated regularly, and this is more drastic in multi-language systems. Third, words are weighted differently in different applications and it means we need to use various stop-word lists based on the application. For instance, even though previously it has been reported that using stop-words is not beneficial in plagiarism detection [3], Stamatatos showed that using a small set of stop-words is useful for identifying similarity in the document level as well the exact passage boundaries in the plagiarized and the source documents [4]. Fourth, in some languages namely Persian, word boundaries are not clear enough, and the way a tokenizer detects these boundaries has a big impact on the efficiency of using stop-word lists. For instance, the word, which is a compound word in Persian meaning “none”, may be considered as two different words by a tokenizer. In such a system, the term is indexed because it is a number in the Persian language. However, considering the compound word, another system will have the ability to recognize the whole word as a stop-word. Fifth, having an ordered list of stop-words can come in handy. For example, an IR system can remove more terms to increase its performance just in case. Sixth, removing stop-words can lead to erroneous semantic interferences [5]. For instance, consider this query: “How does Brazil win the games.” Now, removing the word “how” along with other stop-words in this sentence we will have a completely different question: “Does Brazil win games.” Seventh, it must be noted that blind removal of stop-words can damage phrase queries. A phrase query is the one that matches documents containing a particular sequence of terms. For such a query, removing any kind of terms can eventuate in losing accuracy. Even though by applying keyphrase extraction techniques, it is plausible to remove only those stop words which do not belong to any phrases, keeping terms belong to key phrases cannot ensure anything. After all, sometimes users are in quest of finding exact matches to their queries. By taking all above-mentioned reasons into consideration, we come to the conclusion that it is of great importance to generate stop-word lists automatically and minimize the potential repercussions of removing informative terms as much as possible. As it is claimed the future of stop-word lists is automated construction of them[6, 7]. In this paper, we are going to tackle some of these problems. Our algorithm benefits from part of speech information as well as the distribution of terms in a given collection, and aims to minimize the potential side effects of removing informative terms, while it struggles to discriminate between non- informative and informative terms. The experimentations on a standard corpus show that our algorithm surpasses other well-known stop-word lists, improves the accuracy of the retrieval, and accelerates the retrieval process. The remainder of this paper is organized as follows: Section II is an overview of related work; Section III gives some background information necessary for our approach. Section IV describes our method for automatic stop-words list building. Section V demonstrates our experiments. Section VI compares the proposed approach in comparison to other methods. Finally, the last section provides our conclusions. I. Related Works Various works have been done to generate stop-word lists automatically for different languages. As one of the first attempts to construct stop-word lists automatically, term-based random sampling method was introduced by Rachel et al [8]. They wanted to know how informative a particular term is. The less important a term is, the more likely it is a stop-word. They calculated the importance of a given term using the Kullback-Leibler divergence measure [9]. 327 Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Mohammad-Ali Yaghoub-Zadeh-Fard et al. / Vol. 2(5) Apr. 2016, pp. 326-334 JKBEI DOI: 649123/11045 Feng Zou et al. [10] proposed an automatic aggregated method based on statistical and information models in Chinese language. In quest of words with stable distribution, they introduced Statistical Value (SAT) and merged the result obtained by calculating entropy. They describe stop-words as terms which have high mean of probability as well as stable distribution (low variance of probability). Later the same methodology was used to build a stop-word list for Arabic [11] by Alajmi et al. For Mongolian, Gong and Guan used entropy and kept verbs and nouns as the most important words in this language to build a stop-word list [12]. Another research has conducted by Sadeghi et al. to automatically recognize the stop-words in Persian [13]. They extracted words contain less than 5 letters and sorted them based on the collection term frequency, normalized inverse document frequency of a word in the collection, and word entropy measure. Aggregating lists obtained by these three features, they construct their stop- word list for a Persian textual IR system. However, this was not the only work done for building stop- word lists for Persian. Taghva[14], Hamshahri[15], and Davarpanah[16] are among available stop- word lists that are prepared manually by experts for Persian. II. Background As it mentioned before, Zou proposed an aggregated methodology using both entropy and SAT. Since we used this methodology to compare and build our stop-word list, here we are going to briefly explain how SAT value is computed for a given term. A. Statistical Value Mean of probability and variance of probability of wordWj can be calculated using equation 2 and equation 3. Mean probability ofWj is the average of the probabilities that Wj appears in all documents. The probability of Wjin the document di is also obtained by dividing term frequency of Wj by the total number of terms in the document.