<<

Available online at www.sciencedirect.com ScienceDirect

Procedia Computer Science 89 ( 2016 ) 313 – 319

Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016) On Continent and -Wise Divisions-Based Statistical Measures for Stop-Words Lists of International Jatinderkumar R. Sainia,c,∗ and Rajnish M. Rakholiab,c

aNarmada College of Computer Application, Bharuch 392 011, Gujarat, India bS. . Agrawal Institute of Computer Science, Navsari 396 445, Gujarat, India cR K University, Rajkot, Gujarat, India

Abstract The data for the current research work was collected for 42 different International languages encompassing 3 continents viz. Asia, Europe and South America. The data comprised of unigram model representation of lexicons in the stop-words lists. 13 scripting systems comprising Arabic, Armenian, Bengali, Chinese, Cyrillic, , Greek, , & , , , Marathi, Roman (Latin) and Thai were considered. Based on a comprehensive analysis of statistical measures for Stop-words lists, it has been concluded that Asian languages are mostly self-scripted and that the average number of stop-words in Asian languages is more than those in European languages. In addition to various important and other first research results, a very important inference from the current research work is that the average number of stop-words for any given could be predicted to be 200. © 2016 The Authors. Published by Elsevier B.V. © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (Peer-reviewhttp://creativecommons.org/licenses/by-nc-nd/4.0/ under responsibility of organizing). committee of the Twelfth International Multi-Conference on Information PeerProcessing-2016-review under (IMCIP-2016).responsibility of organizing committee of the Organizing Committee of IMCIP-2016

Keywords: Function-Words; Language; Natural Language Processing (NLP); Script; Stop-Words.

1. Introduction

Owing to the increased availability of computing power and awareness as well as the need of processing naturally spoken languages by people, the domain of Natural Language Processing (NLP) has come up with a wide scope of research. Often, the research areas of NLP overlap with the other areas of computing including Text Mining (TM) and Artificial Intelligence (). These areas of NLP, TM and AI, often even with other areas like Computational Linguistics (CL) and Data Mining (DM), collectively encompass various tasks and sub-tasks including Concept Mining, Information Extraction, Information Retrieval, Stemming, Lemmatization, Search Engine Indexing, Text Summarization, Part-of-speech (POS) Tagging, Wordnet development, Text Analysis and Text Classification, to name afew. Almost all of the tasks and sub-tasks of NLP and its intra and inter-related areas require that during pre-processing phase or as required, stop-words be removed from the text corpus before further processing. In computing, according to Wikipedia21, they are the words which usually refer to the most common words in a language. Neither is there a universal list of stop-words, common for all languages, nor is there any standardized list of stop-words for each

∗ Corresponding author. Tel.: +91-9687689708. -mail address: saini−[email protected]

1877-0509 © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Organizing Committee of IMCIP-2016 doi: 10.1016/j.procs.2016.06.076 314 Jatinderkumar R. Saini and Rajnish M. Rakholia / Procedia Computer Science 89 ( 2016 ) 313 – 319

language. Neither do all NLP tools use such lists nor do all tools prefer not to use them. There have been even instances when even more than one type of list of stop-words has been used by researchers15,17. It is notable that Function-words are not synonymous with Stop-words though Function-words may include articles (‘the’, ‘a’, etc.), pronouns (‘’, ‘him’, ‘she’, ‘her’, etc.), particles (‘however’, ‘then’, ‘if’, ‘thus’, etc.), etc. Hence, Function-words act as a subset of Stop-words. In the context of current research work it is also important to note that the paper uses the words ‘language’ and ‘script’ to mean ‘the human spoken natural language’ and its ‘ consisting of ’, respectively. The set for a script may in turn consist of all or some or combinations of , , etc. The author believes that the language does not become rich if it has a long list of stop-words. Rather, such a language will turn out to be a non-good choice from the perspective of NLP algorithms. It is so because a sentence, a query, etc. written in such a language will have so many words removed from it under the title of stop-words, that probably after the removal of such words the actual intended meaning of the sentence is not conveyed properly. Such a situation becomes a big challenge for NLP, AI and TM tasks including but not limited to query-processing, semantic web-development, text-analysis, text-retrieval, text-summarization, document classification and machine translation. It is however important to be kept in mind that the language processing is an area in which development of text-processing algorithms or being able to comment on the text is relative and subject to the language and type of text including the domain of consideration.

2. Related Works

There are good numbers of research instances wherein the authors have worked on different international languages and language families5,6 as well as Indian languages7 in context of NLP. But there are fewer instances of research works where the authors have provided statistical analysis of more than two stop-words lists. Sadegahi and Vegas13 have shown coincidence between Persian and English stop-words. They have defined light stop-word as a unigram stop-word containing few letters. Hence they have provided a comparison of at-least two languages. Alajmi et al.1 present a statistical approach for extraction of Arabic stop-words list. They have used an approach consisting of calculation of mean probability, entropy and variance probability as well. They have used statistical analysis approach but for extraction of Arabic stop-word list and not comparative analysis of Arabic stop-word list with more than two other language stop-words lists. Similar to the research work for Arabic language, Zou et al.22 proposed an automatic aggregated methodology based on statistical and information models for extraction of stop-words for . Their results showed that their list is quite general and approach is promising enough for implementation purpose. NLP works in languages other than Arabic and Chinese have also been done. For instance, Rakholia and Saini10,11 have worked on Gujarati language while Saini and Desai20 have worked on Hindi language. There are also instances of research works involving multi-lingual documents12. There are also research instances wherein the authors have used stop-words lists for application purpose. Such an application to a specific domain like Twitter data is presented by Choy3. He has used Term-Frequency (TF), Inverse Document Frequency (IDF) and Term Adjacency (TA) for developing a stop words list for the Twitter data source. He has proposed a new technique using combinatorial values as an alternative measure to effectively list out stop words for Twitter data. Another work on Twitter data is presented by Saif et al.14. They have investigated the effect of removal of stop-words on the effectiveness of Twitter sentiment classification methods. They have found that stop-words do have a great effect on the studied classification methods. Kaur and Saini9 have presented a Part-of-speech (POS) word class based categorization of Gurmukhi script stop-words. Arun et al.2 have explored the use of Latent Dirichlet Allocation (LDA) on stop-words and showed that it usage is employable for suitably handling the authorship attributions. They have also proved that this approach using stop-words is also effective in correct identification of author genders using stylometric studies for textual contents. The related literature contains many instances of NLP algorithm applications wherein the application of stop-words has not been done or has been done partially too. Identification and analysis of most frequently occurring significant proper nouns in 419 Nigerian scams16, textual analysis of digits used for designing Yahoo-group identifiers18 and structural analysis of username in email addresses of MCA institutes of Gujarat state are examples of this19. Jatinderkumar R. Saini and Rajnish M. Rakholia / Procedia Computer Science 89 ( 2016 ) 313 – 319 315

Fig. 1. Snap-shot of First 25 Stop-words of First 19 Languages.

The current work is based on available data but this data could not be considered scanty as owing to general intelligence concept, the language used most widely only will have the maximum probability of availability of data. This means to say that the language for which much data is not available is the one whose usage is not much prominent in the world.

3. Methodology

In order to present a comparative analysis of stop-words lists of different international languages based on statistical interpretations, the first step was to collect data of stop-words. The data for this purpose was sourced from Doyle4 as well as research work of Kaur and Saini8. All the stop-words considered for current research work are based on the unigram model of representation of lexicons. A total of 42 different international languages were used for the research purpose. Out of these 42 languages, 21 languages were from Europe, 20 languages were from Asia and 1 language was from South America. The 21 languages from Europe were Basque, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Slovak, Spanish and Swedish. Similarly, the 20 languages from Asia were Arabic, Armenian, Bengali, Chinese, Hindi, Indonesian, Irish, Japanese, Korean, Kurdish, Latvian, Lithuanian, Marathi, Persian, Punjabi, Russian, Thai, Turkish, Ukranian and . The only South American language under consideration was Brazilian. It is noteworthy to mention that these languages were not chosen randomly. The availability of a list of stop-words for a language was the most important criterion for selection of a language for the current research work. A sample snap-shot of different languages chosen for current research work is presented in Fig. 1. Due to lack of , only a sample set consisting of ‘first’ 25 stop-words of ‘first’ 19 languages only is presented here. The significance of word ‘first’ does not need to be emphasized as it is used here to mean that both the list of stop-words as well as the list of languages is sorted and only respective first ‘n’ values are presented here. It is also noteworthy to mention that this paper has used the term Indonesian language to refer to the Indonesian language as well as its other commonly used term called Malay language. Similarly, this paper deploys use of Kurdish language for Sorani language, for Farsi language and Irish language for Gaeilge language, respectively, as well. In addition to dispersion of these languages among 3 continents, the languages could also be classified based on the script used to code that language in written format. Accordingly, there were 25 languages for which English script is required whereas there were 17 languages which used non-English scripting system. It is worth mentioning here that 316 Jatinderkumar R. Saini and Rajnish M. Rakholia / Procedia Computer Science 89 ( 2016 ) 313 – 319

from linguistics perspective there is nothing like English or non-English script. English language uses Roman script and it is often called the . This paper, however, to more clearly highlight the usage of English language, uses the term Roman Script, English Script and Latin Script interchangeably. The only South American language under study used English scripting system. Also, most of European languages use English for scripting hence most of the non-English scripted languages turned out to be Asian languages. The is the basis of various languages including Arabic, Kurdish, Persian and Urdu. It is noteworthy that Urdu is scripted using Persian alphabets but they in turn are derived from Arabic itself. Arabic hence is a root to scripts of many languages. Cyrillic is the script for Bulgarian, Russian and Ukranian languages. Armenian, Bengali, Chinese, Greek and Thai similarly provide scripting alphabets to languages named Armenian, Bengali, Chinese, Greek and Thai respectively. Remarkably, these languages and their scripting systems have same names. It is also noteworthy to found that Bulgarian though being a European language has scripting system common with Russian and Ukranian languages, which are Asian languages. Hence, language does not recognize political boundary of land-pieces of earth. Also, such common roots to basic scripts also occur owing to migrations, common civilizations and neighboring countries, though in different continents. Moving further, Hindi and Marathi have common script called Devanagari. uses Kanji and Kana scripting system. Kana script further is composed of and styles. It is noteworthy that writing experts consider the mentioned two kana styles as two different scripts, and not merely as two different styles. Korean, which is believed to be an isolate language, uses Hanja and Hangul script. Korean, in fact, is rarely found to use the Hanja script today. At most, it is seen only in a few scattered around Korean texts to highlight a particular Chinese-derived . Punjabi language uses two scripting styles, viz. Shahmukhi and Gurmukhi. The Shahmukhi style is influenced by Arabic script. This paper used Gurmukhi style of scripting Punjabi language for stop-words list.

4. Results and findings

Table 1 presents a listing of 42 different international languages along with their numbers of Stop-words. Figure 2 presents the details of Table 1 with more clarity in graphical format. Based on the analysis of these 42 languages, a total of 8387 stop-words were found. The maximum number of stop-words was found to be 747, present in Finnish language. Finnish is a European and English scripted language. The minimum number of stop-words was found to be 28 in Ukranian language. Ukranian is an Asian and Cyrillic scripted language. The average number of stop words among 42 languages was found to be 199.69 (∼200). These are the overall maximum, minimum and average values under study. These values were also found based on continent-wise and script-wise divisions, generating different sets of combinations. These statistical values corresponding to these combinations are presented in Table 2. Brazilian, which is a South American and English scripted language, being the only language outside Asia and Europe was kept out for statistical analysis. Brazilian language was treated as an outlier for the current research work to prevent over-analysis and mis-interpretations based on single language represented by its domain through only one language from one continent. Based on the statistical data presented in Table 2, it has been found that for Asia, the maximum and minimum numbers of stop-words for English scripted languages have been found to be 507 and 109, respectively for Lithuanian and Irish languages. The maximum and minimum numbers of stop-words for non-English scripted languages for Asia has been found to be 677 and 28 respectively for Korean and Ukranian languages. Similarly, for Europe, the maximum and minimum numbers of stop-words for English scripted languages have been found to be 747 and 35, respectively for Finnish and Hungarian languages. The maximum and minimum numbers of stop-words for non-English scripted languages for Europe has been found to be 259 and 79 respectively for Bulgarian and Greek languages. Further, from Table 2, it can also be seen that the average number of stop-words in English scripted languages is more than the average number of stop-words in non-English scripted languages. This statement is true irrespective of continental divisions, .e., 250 > 216 and 180 > 169: both hold true. Also, it is also observed that the average number of stop-words in Asian languages is more than the average number of stop-words in European languages. This statement also holds true for English as well as non-English scripted languages of both continents, i.e. (250 and 216) > (180 and 169). It is notable that this statement is true even though the overall highest number of stop-words were found in Finnish, a European language. The comparison presented in current paper should not be purely seen as Jatinderkumar R. Saini and Rajnish M. Rakholia / Procedia Computer Science 89 ( 2016 ) 313 – 319 317

Table 1. International Languages and their Numbers of Stop-words.

Sr. No. Language Stop-words Sr. No. Language Stop-words 1 Arabic 102 22 Italian 134 2 Armenian 43 23 Japanese 44 3 Basque 98 24 Korean 677 4 Bengali 363 25 Kurdish 63 5 Brazilian 128 26 Latvian 163 6 Bulgarian 259 27 Lithuanian 507 7 Catalan 126 28 Marathi 99 8 Chinese 125 29 Norwegian 119 9 Czech 138 30 Persian 332 10 Danish 101 31 Polish 138 11 Dutch 48 32 Portuguese 147 12 English 174 33 Punjabi 184 13 Finnish 747 34 Romanian 281 14 French 116 35 Russian 421 15 Galician 160 36 Slovak 173 16 German 129 37 Spanish 178 17 Greek 79 38 Swedish 386 18 Hindi 97 39 Thai 115 19 Hungarian 35 40 Turkish 114 20 Indonesian 357 41 Ukranian 28 21 Irish 109 42 Urdu 550

Fig. 2. Graphical Representation of Languages and their Stop words. highlighting on the English scripted and non-English scripted languages. The comparison is also noteworthy in terms of the dimension called continent or the geographical area wherein the language is used. It is notable that the two English-scripted languages with the largest number of stop words are Finnish and Lithuanian respectively for Europe and Asia. It is also worth mentioning here that these two languages have respectively rich agglutinative and inflectional . Rarely, in the related literature is the relationship between the type of script a language uses is related to the number of stop words a language has. The current paper was an attempt to present the same, that is the interplay of the three viz., stop-words, language and script. Obviously, if “can” and “could” are both in one’s stop words list for English, 318 Jatinderkumar R. Saini and Rajnish M. Rakholia / Procedia Computer Science 89 ( 2016 ) 313 – 319

Table 2. Continent and Script-wise Division-based Statistical Measures for Stop-words.

Languages Statistical Measure Asian European for Stop-words English Scripted Non-English Scripted English Scripted Non-English Scripted Total 1250 3243 3428 338 Maximum 507 677 747 259 Minimum 109 28 35 79 Average 250.00 (=250) 216.20 (∼216) 180.42 (∼180) 169.00 (=169)

then one has two words that are from one modal verb. If there is a richly inflected language under consideration, one might have 20 forms or more for the same modal verb. This leads us to the discussion of inflectional complexity of a language in perspective of its scripted stop words, which will be in the future scope of research work.

5. Conclusions

Based on the analysis of statistical measures for stop-words of lists of 42 different international languages, it has been concluded that there is a vast difference between the numbers of stop-words in a list of such words for a given international language. It is concluded that the languages may belong to different continents but still may have the common scripting system. The most important conclusion of this research work is that the average number of stop-words in any given language will be 200. The English scripted languages with maximum and minimum numbers of stop-words for Asia were Lithuanian and Irish languages respectively while their counterparts for Europe were Finnish and Hungarian languages respectively. Similarly, the non-English scripted languages with maximum and minimum numbers of stop-words for Asia were Korean and Ukranian languages respectively while their counterparts for Europe were Bulgarian and Greek languages respectively. It is also concluded that the average number of stop-words in English scripted languages is more than the average number of stop-words in non-English scripted languages, irrespective of continental divisions. It is further concluded that the average number of stop-words in Asian languages is more than the average number of stop-words in European languages. Also, another significant conclusion of the current research work is that most of the European languages are English script based while most of the Asian languages are self-scripted. The current research work, directly or in-directly, does not claim that a particular language is better or not better than the other. It just presents an un-biased and critical comparison based on the statistical measures derived from the analysis of stop-words lists of different international languages. This paper does not intend to advocate the preeminence or inadequacy of a language, script, country, continent, civilization or any community. The current paper presents first results just for academic research purpose and are best reported on the available data used for research purpose. The results reported here could be used for predicting about stop-words list of a language for which such a list even does not exist yet. The results could also be used for development of various language dependent tools as well as various NLP algorithms.

References

[1] A. Alajmi, E. M. Saad and R. R. Darwish, Toward an ARABIC Stop-Words List Generation, International Journal of Computer Applications, ISSN: 0975–8887, vol. 46(8), pp. 8–13, (2012). [2] R. Arun, R. Saradha, V. Suresh, M. N. Murty, Stopwords, Stylometry: A Latent Dirichlet Allocation Approach, Proceedings of Advances in Neural Information Processing System Workshop on Applications of Topic Models, beyond (NIPS), pp. 1–4, (2009). Available: http://clweb.csa.iisc.ernet.in/arun− r/papers/nips09-stopwords-and-stylometry.pdf [3] M. Choy, Effective Listings of Function Stop Words for Twitter, International Journal of Advanced Computer Science, Applications, ISSN: 2158-107X, vol. 3(6), pp. 8–11, (2012). [4] D. Doyle, Stopwords in Other Languages, Ranks NL, Available: http://www.ranks.nl/stopwords. [5] J. Kaur and J. R. Saini, A Study, Analysis of Opinion Mining Research in Indo-Aryan, Dravidian, Tibeto-Burman Language Families, International Journal of Data Mining, Emerging Technologies, ISSN: 2249-3212 (eISSN: 2249-3220), vol. 4(2), pp. 53–60, (2014). [6] J. Kaur and J. R. Saini, An Analysis of Opinion Mining Research Works Based on Language, Writing Style, Feature Selection Parameters, International Journal of Advanced Networking, Applications, ISSN: 0975-0290 (eISSN: 0975-0282), special issue, pp. 7–13, (2014). Jatinderkumar R. Saini and Rajnish M. Rakholia / Procedia Computer Science 89 ( 2016 ) 313 – 319 319

[7] J. Kaur and J. R. Saini, A Study of Text Classification Natural Language Processing Algorithms for Indian Languages, The VNSGU Journal of Science, Technology; ISSN: 0975-5446, vol. 4(1), pp. 162–167, (2015). [8] J. Kaur and J. R. Saini, Identification of Stop Words for Written Punjabi Natural Language Processing, accepted, to be published in International Journal of Data Mining, Emerging Technologies, ISSN: 2249-3212 (eISSN: 2249-3220), vol. 5(2), (2015). [9] J. Kaur and J. R. Saini, POS Word Class based Categorization of Gurmukhi Language Stemmed Stop Words, Accepted, to be Published in the Proceedings of International Conference on ICT for Intelligent Systems (ICTIS-2015), ISSN: 2190-3018, 2190-3026, Springer-Verlag Publications, Germany, Smart Innovation, Systems, Technologies (SIST) Series, (2015). [10] R. M. Rakholia and J. R. Saini, A Study, Comparative Analysis of Different Stemmer, Character Recognition Algorithms for Indian , International Journal of Computer Application, ISSN: 0975-8887, vol. 106(2), pp. 45–50, (2014). [11] R. M. Rakholia and J. R. Saini, The Design, Implementation of Extraction Technique for Gujarati Written Script using Transformation Format, proceedings of IEEE International Conference on Electrical, Computer, Communication Technologies (ICECCT-2015), ISBN Numbers: Xplore Complaint: 978-1-4799-6085-9, CD-ROM : 978-1-4799-6083-5, Print : 978-1-4799-6084-2; vol. 2, pp. 654–659, IEEE publications. [12] R. M. Rakholia and J. R. Saini, Automatic Language Identification, Content Separation from Indian Multilingual Documents using Unicode Transformation Format, Accepted, to be Published in the Proceedings of 1st International Conference on Data Engineering, Communication Technology (ICDECT-2016), ISSN: 1615-3871, 2194-5357, 1860-0794; Springer-Verlag Publications, Germany, Advances in Intelligent, Soft Computing (AISC) Series. [13] M. Sadegahi and J. Vegas, Automatic Identification Of Light Stop Words For Persian Information Retrieval Systems, Journal of Information Science, ISSN: 0165-5515, vol. 40(4), pp. 476–487, (2014). [14] H. Saif, F. Miriam, H. Yulan and A. Harith, On Stopwords, Filtering, Data Sparsity for Sentiment Analysis of Twitter, Proceedings of Ninth International Conference on Language Resources, Evaluation, Iceland, pp. 810–817, (2014). [15] J. R. Saini, Self Learning Taxonomical Classification System Using Vector Space Document Analysis Model For Web Text Mining In UBE, PhD Thesis, Under Supervision of Desai AA, Department of Computer Science, VNSGU, Surat, (2009). [16] J. R. Saini, Identification, Analysis of Most Frequently Occurring Significant Proper Nouns in 419 Nigerian Scams, National Journal of Computer Science, Technology, ISSN: 0975-2463, vol. 4(1), pp. 01–04, (2012). [17] J. R. Saini, Polarity Determination using Opinion Mining in Stocks, Shares-Advertising Unsolicited Bulk E-Mails, International Journal of Engineering Innovations & Research, ISSN: 2277–5668, vol. 1(2), pp. 86–92, (2012). [18] J. R. Saini and A. A. Desai, A Textual Analysis of Digits Used for Designing Yahoo-group Identifiers, The IUP Journal of Information Technology, ISSN: 0973-2896, vol. 6(2), pp. 34–42, (2010). [19] J. R. Saini and A. A. Desai, Structural Analysis of Username Segment in Email Addresses of MCA Institutes of Gujarat State, The IUP Journal of Information Technology, ISSN: 0973-2896, vol. 6(3), pp. 43–50, (2010). [20] J. R. Saini and A. A. Desai, Identification of Hindi Words Used in Pornographic Unsolicited Bulk Emails, The IUP Journal of Systems Management, ISSN: 0972-6896, vol. 9(2), pp. 53–60, (2011). [21] Wikipedia, the Free Encyclopedia, Stop Words, Wikimedia Foundation Inc. (2015). Available: http://en.wikipedia.org/wiki/Stop− words [22] F. Zou, F. . , X. Deng, S. and L. S. Wang, Automatic Construction of Chinese Stop Word List, Proceedings of the 5th WSEAS International Conference on Applied Computer Science, , pp. 1010–1015, (2006). Available: https://kirillpavlov.files.wordpress.com/2013/11/chinese-stopwords.pdf