<<

Vol. 2(5), Apr. 2016, pp. 326-334

PSWG: An Automatic Stop-word List Generator for Persian Systems Based on Similarity Function & POS Information Mohammad-Ali Yaghoub-Zadeh-Fard1, Behrouz Minaei-Bidgoli1, Saeed Rahmani and Saeed Shahrivari 1Iran University of Science and Technology, Tehran, Iran 2Freelancer, Tehran, Iran *Corresponding Author's E-mail: [email protected] Abstract y the advent of new information resources, search engines have encountered a new challenge since they have been obliged to store a large amount of text materials. This is even more B drastic for small-sized companies which are suffering from a lack of hardware resources and limited budgets. In such a circumstance, reducing index size is of paramount importance as it is to maintain the accuracy of retrieval. One of the primary ways to reduce the index size in systems is to remove stop-words, frequently occurring terms which do not contribute to the information content of documents. Even though there are manually built stop-word lists almost for all languages in the world, stop-word lists are domain-specific; in other words, a term which is a stop- word in a specific domain may play an indispensable role in another one. This paper proposes an aggregated method for automatically building stop-word lists for Persian information retrieval systems. Using part of speech tagging and analyzing statistical features of terms, the proposed method tries to enhance the accuracy of retrieval and minimize potential side effects of removing informative terms. The experiment results show that the proposed approach enhances the average precision, decreases the index storage size, and improves the overall response time.

Keywords: Information Retrieval, Natural Language Processing, Part of Speech Tagging, Index Size Reduction, Stop-words

1. Introduction Information retrieval (IR) researchers are concentrated on increasing accuracy and precision of information retrieval systems using new information sources and expanding incorporation of available sources such as web pages from the Internet. However, the expansion of the Internet and tremendous growth of online web pages have enlarged text indices. As a result, index size reduction has become more important than before, especially for small companies with limited hardware resources. Various techniques have been used to reduce the index size. Compression, compaction, and summarization are among well-known available methods used for index size reduction. Nevertheless, an effective technique for index size reduction in text processing systems is elimination of non-informative words known as stop-words. Articles, prepositions, conjunctions, and pronouns are typical candidates for elimination as stop words [1]. These words, which frequently occur in documents, do have very low discrimination values. By removing stop-words, text processing systems follow two purposes: (1) elimination of noise and (2) dimension reduction. While the former tries to enhance the accuracy of retrieval in IR systems, the latter aims to reduce the index size as well as to minimize the response time. Stop-words elimination has been reported to cause over 40% reduction in index storage size [2]. Many researches have been conducted to build stop-word lists manually for almost all languages currently spoken on the Earth. However, manually building stop lists has several

326 Article History: JKBEI DOI: 649123/11045 Received Date: 21 Sep. 2015 Accepted Date: 11 Feb. 2016 Available Online: 09 Apr. 2016 Mohammad-Ali Yaghoub-Zadeh-Fard et al. / Vol. 2(5) Apr. 2016, pp. 326-334 JKBEI DOI: 649123/11045

drawbacks. First, a term in a specific domain can be informative, but not meaningful in another domain. For example, the word “for”, which is a potential candidate for removal, is the most important term in a web page which is teaching about loops in the C programming. Second, human languages are dynamic and everyday lots of new words are created. Consequently, manually built stop-word lists are supposed to be updated regularly, and this is more drastic in multi-language systems. Third, words are weighted differently in different applications and it means we need to use various stop-word lists based on the application. For instance, even though previously it has been reported that using stop-words is not beneficial in plagiarism detection [3], Stamatatos showed that using a small set of stop-words is useful for identifying similarity in the document level as well the exact passage boundaries in the plagiarized and the source documents [4]. Fourth, in some languages namely Persian, word boundaries are not clear enough, and the way a tokenizer detects these boundaries has a big impact on the efficiency of using stop-word lists. For instance, the word, which is a word in Persian meaning “none”, may be considered as two different words by a tokenizer. In such a system, the term is indexed because it is a number in the Persian language. However, considering the compound word, another system will have the ability to recognize the whole word as a stop-word. Fifth, having an ordered list of stop-words can come in handy. For example, an IR system can remove more terms to increase its performance just in case. Sixth, removing stop-words can lead to erroneous semantic interferences [5]. For instance, consider this query: “How does Brazil win the games.” Now, removing the word “how” along with other stop-words in this sentence we will have a completely different question: “Does Brazil win games.” Seventh, it must be noted that blind removal of stop-words can damage phrase queries. A phrase query is the one that matches documents containing a particular sequence of terms. For such a query, removing any kind of terms can eventuate in losing accuracy. Even though by applying keyphrase extraction techniques, it is plausible to remove only those stop words which do not belong to any phrases, keeping terms belong to key phrases cannot ensure anything. After all, sometimes users are in quest of finding exact matches to their queries. By taking all above-mentioned reasons into consideration, we come to the conclusion that it is of great importance to generate stop-word lists automatically and minimize the potential repercussions of removing informative terms as much as possible. As it is claimed the future of stop-word lists is automated construction of them[6, 7]. In this paper, we are going to tackle some of these problems. Our algorithm benefits from part of speech information as well as the distribution of terms in a given collection, and aims to minimize the potential side effects of removing informative terms, while it struggles to discriminate between non- informative and informative terms. The experimentations on a standard corpus show that our algorithm surpasses other well-known stop-word lists, improves the accuracy of the retrieval, and accelerates the retrieval process. The remainder of this paper is organized as follows: Section II is an overview of related work; Section III gives some background information necessary for our approach. Section IV describes our method for automatic stop-words list building. Section V demonstrates our experiments. Section VI compares the proposed approach in comparison to other methods. Finally, the last section provides our conclusions.

I. Related Works Various works have been done to generate stop-word lists automatically for different languages. As one of the first attempts to construct stop-word lists automatically, term-based random sampling method was introduced by Rachel et al [8]. They wanted to know how informative a particular term is. The less important a term is, the more likely it is a stop-word. They calculated the importance of a given term using the Kullback-Leibler divergence measure [9].

327

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Mohammad-Ali Yaghoub-Zadeh-Fard et al. / Vol. 2(5) Apr. 2016, pp. 326-334 JKBEI DOI: 649123/11045

Feng Zou et al. [10] proposed an automatic aggregated method based on statistical and information models in Chinese language. In quest of words with stable distribution, they introduced Statistical Value (SAT) and merged the result obtained by calculating entropy. They describe stop-words as terms which have high mean of probability as well as stable distribution (low variance of probability). Later the same methodology was used to build a stop-word list for Arabic [11] by Alajmi et al. For Mongolian, Gong and Guan used entropy and kept verbs and nouns as the most important words in this language to build a stop-word list [12]. Another research has conducted by Sadeghi et al. to automatically recognize the stop-words in Persian [13]. They extracted words contain less than 5 letters and sorted them based on the collection term frequency, normalized inverse document frequency of a word in the collection, and word entropy measure. Aggregating lists obtained by these three features, they construct their stop- word list for a Persian textual IR system. However, this was not the only work done for building stop- word lists for Persian. Taghva[14], Hamshahri[15], and Davarpanah[16] are among available stop- word lists that are prepared manually by experts for Persian.

II. Background As it mentioned before, Zou proposed an aggregated methodology using both entropy and SAT. Since we used this methodology to compare and build our stop-word list, here we are going to briefly explain how SAT value is computed for a given term.

A. Statistical Value Mean of probability and variance of probability of wordWj can be calculated using equation 2 and equation 3. Mean probability ofWj is the average of the probabilities that Wj appears in all documents. The probability of Wjin the document di is also obtained by dividing term frequency of Wj by the total number of terms in the document. Having calculated the mean probability ofWj, we can calculate its variance. Stability of a distribution of probability can be determined by it variance. Owing to this fact, this approach tries to select only those words which stably have high probabilities. Finally, a combination of these two equations comes to equation 1.

SAT w =  (1) MP�wj�

� j� VP�wj� = , =  | | (2) ∑1≤𝑖𝑖≤𝑁𝑁 𝑃𝑃𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖 𝑗𝑗 𝑁𝑁 ( )𝑖𝑖𝑖𝑖 𝑑𝑑𝑖𝑖 𝑀𝑀𝑀𝑀�𝑤𝑤 � = 𝑃𝑃 (3) ∑1≤𝑖𝑖≤𝑁𝑁 𝑃𝑃𝑖𝑖𝑖𝑖−𝑃𝑃�𝑗𝑗 B. Entropy 𝑉𝑉𝑉𝑉�𝑤𝑤𝑗𝑗� 𝑁𝑁 Entropy is the amount of information in a source, and commonly is used as a measure of disorder [17]. Given a set of documents and wj entropy is defined as equation 4. In this equation, Pij is the probability of seeing word wj in document di.

= × log (4) 𝑚𝑚

𝐻𝐻�𝑤𝑤𝑗𝑗� − ∑𝑖𝑖=1 𝑃𝑃𝑖𝑖𝑖𝑖 �𝑃𝑃𝑖𝑖𝑖𝑖� III. The Proposed Method (PSWG) Prior to generating a stop-word list, we conducted an experiment to determine the most efficient set of part of speech (POS) tags. POS tagging is the task of labelling each word in a sentence with its appropriate part of speech such as noun, verb, adverb, and etc. [18]. We wanted to know which types of words carry the information content of Persian text the most (for more information see section V.B). This way, it is possible to discriminate between different occurrences of a word on the

328

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Mohammad-Ali Yaghoub-Zadeh-Fard et al. / Vol. 2(5) Apr. 2016, pp. 326-334 JKBEI DOI: 649123/11045

as a verb, which means “does”, might be ”ﮐﻨﺪ“ base of their POS information. For example, the word considered a stop-word; however, as an adjective, it means “slow” and has a significant value. The result of this experiment was a set of informative POS tags which is called the efficient set in this paper. After creating the efficient set of POS tags, our algorithm (PSWG) pursues three main steps depicted in Fig. 1. First, some words based on their SAT value are selected, and then in the second step, non-informative terms, which are not tagged with any member of the efficient set, are extracted. Finally, to select the final stop-words, we score the nominated terms using BM25, which is the same function used to rank and retrieve documents. Using term scoring has a significant benefit; it insures that negative side effects of considering meaningful terms as stop-words are minimal because only terms with low scores are removed. After all, removing terms with a score of near zero has virtually no impact on ranking documents. Furthermore, we decide to divide the mean scores of terms by standard deviation, for our aim is to select stable terms that are usually have low score. Equation 5 is used to score the nominated words. In these equations, D is the corpus, A is a subset of D and is defined by equation 8, t is a stop-word candidate, and SimFun is the similarity function used in the IR system to calculate term scores.

( ) ( ) = ( ) (5) 𝜇𝜇 𝑡𝑡 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇( ) = 𝑡𝑡 𝜎𝜎 𝑡𝑡 (D, ) | | (6) 1 𝜇𝜇 (𝑡𝑡 ) = 𝐴𝐴 ∑D∈𝐴𝐴 𝑆𝑆|𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (D𝑡𝑡, ) ( )| | | (7) 1 𝜎𝜎 =𝑡𝑡 { |𝐴𝐴 ∑𝐷𝐷D∈}𝐴𝐴 𝑆𝑆 𝑆𝑆 𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆 𝑡𝑡 − 𝜇𝜇 𝑡𝑡 (8)

𝐴𝐴 𝐷𝐷𝑖𝑖 𝑡𝑡 ∈ SelectNominatedStopWords(C,N,I, SF)

Input : C, the corpus N, number of desired stop-words I, the set of informative POS tags SF, the similarity function used to rank documents

Output: SWL, a set of stop-words

1. SWL top N terms with highest statistical value 2. for eachterm do 3. ifpart_of_sp← eech(t) I and then 4. add t to SWL 𝑡𝑡 ∈ 𝐶𝐶 5. end if ∉ 𝑡𝑡 ∉ 𝑆𝑆𝑆𝑆𝑆𝑆 6. end for

7. Let F be an array of term scores using SF 8. for eachterm do 9. F[t] mean scores of all occurrences of t divided by their𝑡𝑡 ∈variance𝑆𝑆𝑆𝑆𝑆𝑆 11. end for←

12. sort terms in SWL by their F scores 13. return SWL[1..N]

Fig. 1. Selecting nominated stop-words in PSWG algorithm

329

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

TABLE I. Persian Major Parts of Speech

POS Description Example

ﮐﺴﻮف .Nouns Persian words that indicate people, beings, things, places, phenomena, qualities or ideas رﻓﺘﻦ، ﺑﺮف ﺑﺎرﯾﺪن .Verbs Persian words that indicate actions, occurrences or states

ھﯿﭻ وﻗﺖ، ھﻨﻮز .Adverbs Persian words that modify clauses, sentences and phrases directly

ﺧﻮش، ﺧﻮب .Adjectives Persian words that give attributes to nouns, extending their definitions

اﻣﺎ، اﮔﺮﭼﮫ .Conjunctions Persian words that connect words, phrases or clauses together

آن، اﯾﻦ .Determiners Persian words that reference nouns, expressing their contexts directly ای، ﺑﮫ ﺑﮫ .Interjections Persian words that express emotions as exclamations

ھﺸﺖ، ﯾﮏ .Numerals Persian words that indicate quantities of nouns

را، و .Particles Persian words that lack own grammatical functions, usually forming other parts of speech

ﺑﺮ، ﺑﺎ .Prepositions Persian words that connect nouns or pronouns that follow them to form adjectival or adverbial phrases

آن، ﮐﺪام ﯾﮏ .Pronouns Persian words that refer to and substitute nouns

In the final stage, when words are being removed from the index, it is of paramount important to take the efficient set into account, and eliminate only those terms which do not belong to the efficient set of POS tags. In other words, the output of PWSG is not only a set of stop-words but also an important set of POS tags which discriminates between different parts of speech of stop-words. However, the more precise task can be generating stop-pairs, terms with their parts of speech, and scoring pairs by the same algorithm described in Fig. 1.

IV. Experiments A. Test Environment An IR model governs how documents and queries are represented and how the relevance of a document to a user’s query is defined. To rank documents, we employed Okapi BM25 [19], which is a probabilistic retrieval framework. This model tries to estimate the probability of finding whether or not a document is relevant to a given query. Moreover, using Hamshahri corpus [15] and Apache Lucene, we conducted numerous experimentations to evaluate our approach along with different methods. Hamshahri corpus is a standard reliable Persian text collection that has been used at cross language evaluation forum (CLEF) during years 2008 and 2009 for evaluating Persian IR systems.

B. Determination of the Efficient Set of Persian POS Tags To determine the efficient set of POS tags in Persian IR, we consider different configurations. In each configuration, a particular POS tag, some of which are described in Table I, is used to construct the index. Different configurations and the amount of effectiveness of each configuration in Persian IR are shown in Table II, where “Indexing Tag” column illustrates the POS tag of words that construct the index. For example, the first configuration only uses nouns to represent documents. It should be noted that to evaluate the exact role of each POS in IR, any text pre-processing methods such as removal and lemmatization are not employed.

330

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

TABLE II. System performance for different POS configurations for one of the folds in 5-fold cross validation

# Indexing Tag Mean Average Precision Storage Size Reduction (%) 1 Noun 0.2248 55 2 Adjective 0.0874 83 3 Unknown 0.0350 88 4 Number 0.0030 98 5 Verb 0.0007 93 6 Preposition 0.0006 95 7 Pronoun 0.0006 97 8 Other tags 0.0002 91

To determine and evaluate the proposed approach, we partitioned the Hamshahri corpus to five equal parts to perform 5-fold cross evaluation; then we repeated the algorithm for each partition to determine an efficient set of POS tags in Persian IR systems.Table II reveals the system performance for one of the folds. In this table, each configuration shows how the words labelled by an especial tag discriminate between relevant documents and irrelevant ones; it shows that nouns, adjectives, and unknown terms (unknown words (including proper names, words with spelling errors, or words from other languages) have significant discrimination power. The noteworthy result is the little effect of adverbs and specially verbs in Persian IR. This must be due to the different nature of the verbs in Persian; verbs usually break up into one or more non-verb parts, usually noun(s) followed by the verb part. Information that non-verb parts of compound verbs carry is more effective than that of verb part. However, just verb parts of compound verbs are usually considered as verbs in POS tagged corpora, at least in our experiments. Therefore, depending on the tokenizer a system uses, this result may slightly be different. Finally, to determine an efficient set of POS tags for Persian IR, nouns are selected; then words with the next efficient POS tag, referring to its mean average precision in Table II, are added to the experiment. This process continues until the retrieval precision stops increasing. The experiment results of different POS tag sets are shown in Table III. The best performance belongs to the set of {Noun, Adjective, Unknown, Number}, even better results in comparison with using all of the terms. Our experiments indicate that this is the case for all of the folds, and this set is the efficient set of POS tags when no pre-processing technique is applied, and we will use this set as an important set of POS tags in the rest of our paper. As it is mentioned before, in our experiments, we ignored all kinds of pre-processing techniques namely lemmatization. However, it is wiser to take pre-processing methods into account, whenever any of them is used in an IR system. For example, lemmatizing all terms, we repeated the process mentioned above, and the result was a bit different. By applying lemmatization, verbs are added to the efficient set resulting in the set of {Noun, Adjective, Unknown, Verb, Number}.

C. Determination of Stop-Words Using PSWG Using top 1000 terms with high SAT values along with terms not labelled by one of the efficient set of POS tags, we ordered the nominated words based on their term scores according to BM25, and then those with low score are selected as stop-words. Furthermore, in the process of eliminating these stop-words from the index, we kept all nouns, adjectives, unknown words (including proper names, words with spelling errors, or words from other languages), and numbers in a document, even if there are in the stop-word list.

331

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

V. EVALUATION AND COMPARISION To evaluate and compare our methods with manual stop-word lists, we employed the Hamshahri, Taghva, and Davarpanah lists mentioned in II. Table IV shows the performance of PSWG for the second version of the Hamshahri collection at various thresholds. Owing to the fact that determination of a threshold for an efficient number of stop-words is a challenging task and this threshold plays an indispensable role in the performance of an IR system, the order of nominated stop-words is of paramount importance. In other words, an algorithm must assure that a word with the lowest discrimination value is the first candidate to be removed. However, the question is that how can we decide whether or not an approach is ideal? The common sense suggests that the elimination of stop-words increases the performance of an IR system dramatically at the beginning, and it loses its importance little by little. Later, after eliminating a couple of hundreds of nominated stop-words, it starts playing a very negative role in the accuracy of retrieval. Of the great importance is that an ideal method must not show any sign of fluctuations. In other words, removing ordered stop-words one by one is supposed to follow a smooth pattern. Fig. 2 demonstrates the performance of different automated stop-word detection methods at different thresholds. According to this figure, not only PSWG does surpass other methods at most thresholds, but it is also the most stable method. Among these methods, using entropy yields the least performance.

Fig. 2. Precision at different thresholds

The performance of using manually built stop-word lists is also shown in Table V. The results indicate that using automated methods enhances the accuracy of retrieval. For example, removing only 100 words selected by the proposed approach improves the mean average precision by 7% while it is about 6% for Taghva’s list, which has the best performance among all manual stop-word lists.

TABLE III. System performance for different POS sets for one of the folds

POS Set Mean Average Precision Speed Up (%) Storage Size Reduction (%)

{Noun} 0.225 180 55.4

{Noun, Adjective} 0.237 152 44.2

{Noun, Adjective, Unknown} 0.250 116 33.1

{Noun, Adjective, Unknown, Number } 0.256 107 31.0 {Noun, Adjective, Unknown, Number, Pronoun 0.255 82.0 21.5 } All of terms 0.215 0.00 0.00

332

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

TABLE IV. Performance of the IR system usingPSWG to select stop-words Number of Mean Average Index Storage Retrieval Speedup # R-precision Recall stop-words Precision Reduction (%) (%) 1 0 0.330 0.245 0.475 0 0

2 100 0.352 0.262 0.504 6.6 51

3 200 0.357 0.268 0.517 9.4 52

4 300 0.363 0.280 0.531 11.2 55

5 400 0.363 0.285 0.538 12.8 56

6 500 0.366 0.290 0.540 14.2 58

7 600 0.364 0.290 0.542 15.4 59

8 700 0.370 0.293 0.542 16.4 62

9 800 0.365 0.291 0.533 17.3 64

10 900 0.366 0.296 0.525 18.2 65

11 1000 0.361 0.293 0.521 19.0 65

TABLE V. Performance of the IR system using manually built stop-word lists Mean Average Storage Reduction Methods R-Precision Recall Speed-up (%) Precision (%) Hamshahri Stop- 0.340 0.254 0.493 8.4 65 words Davarpanah Stop- 0.338 0.258 0.495 6.3 43 words Taghva Stop-words 0.339 0.259 0.498 9.2 64

Conclusion The experiments demonstrate that automatically building a stop-word list based on score functions and using POS tagging improves performance of IR systems. Even though tagging words is time-consuming, it can be ignored since it is done just one after building the index. The proposed method (PSWG) has attained advantages in comparison with existing methods. Not only does it try to soften the impact of removing stop-words, but also surpasses manual stop-words lists.

References [1] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications): Springer-Verlag New York, Inc., 2006. [2] R. A. Baeza-Yates, and B. Ribeiro-Neto, Modern Information Retrieval: Addison-Wesley Longman Publishing Co., Inc., 1999. [3] Z. Ceska, and C. Fox, "The Influence of Text Pre-processing on Plagiarism Detection." pp. 55-59. [4] E. Stamatatos, “Plagiarism detection using stopword n-grams,” J. Am. Soc. Inf. Sci. Technol., vol. 62, no. 12, pp. 2512- 2527, 2011. [5] S. Popova, T. Krivosheeva, and M. Korenevsky, "Automatic Stop List Generation for Clustering Recognition Results of Call Center Recordings," Speech and Computer, Lecture Notes in Computer Science A. Ronzhin, R. Potapova and V. Delic, eds., pp. 137-144: Springer International Publishing, 2014. [6] A. Blanchard, “Understanding and customizing stopword lists for enhanced patent mapping,” World Patent Information, vol. 29, no. 4, pp. 308-316, 12//, 2007. [7] Y. Zhou, and Z.-w. Cao, "Research on the Construction and Filter Method of Stop-word List in Text Preprocessing." pp. 217-221. [8] R. T. Lo, B. He, and I. Ounis, “Automatically building a stopword list for an information retrieval system,” in Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR'05), 2005, pp. 3-8.

333

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print) Ahmad Pouramini et al. / Vol. 2(4) Jan. 2016, pp. 234-238 JKBEI DOI: 649123/11034

[9] T. M. Cover, and J. A. Thomas, Elements of information theory: Wiley-Interscience, 1991. [10] F. Zou et al., “Automatic construction of Chinese stop word list,” in Proceedings of the 5th WSEAS international conference on Applied computer science, Hangzhou, China, 2006, pp. 1009-1014. [11] A. Alajmi, E. M. Saad, and R. R. Darwish, “Toward an ARABIC Stop-Words List Generation,” International Journal of Computer Applications, vol. 46, no. 8, pp. 8-13, 2012. [12] Z. Gong, and G. Guan, "The selection of Mongolian stop words." pp. 71-74. [13] M. Sadeghi et al., “Automatic identification of light stop words for Persian information retrieval systems,” J. Inf. Sci., vol. 40, no. 4, pp. 476-487, 2014. [14] K. Taghva, R. Beckley, and M. Sadeh, “A list of farsi stopwords.” [15] A. AleAhmad et al., “Hamshahri: A standard Persian text collection,” Know.-Based Syst., vol. 22, no. 5, pp. 382-387, 2009. [16] M. R. Davarpanah, M. Sanji, and M. Aramideh, “Farsi and stop word list,” Library Hi Tech, vol. 27, no. 3, pp. 435-449, 2009. [17] C. E. Shannon, and W. Weaver, A Mathematical Theory of Communication: University of Illinois Press, 1963. [18] C. D. Manning, and H. Schütze, Foundations of statistical natural language processing: MIT Press, 1999. [19] S. E. Robertson et al., "Okapi at TREC-3." pp. 109-126.

334

Journal of Knowledge-Based Engineering and Innovation (JKBEI) Universal Scientific Organization, http://www.aeuso.org/jkbei ISSN: 2413-6794 (Online) and ISSN: 2518-0479 (Print)