2012 IEEE Conference on Control, Systems and Industrial Informatics (ICCSII) Bandung, Indonesia, September 23-26,2012 Knowledge Discovery in Scientific Databases Using Text Mining and Analysis

Ammar lalalimanesh Information Engineering Department Iranian Research Institute for Information Science and Technology Tehran, Iran [email protected]

Abstract-This paper introduces a novel methodology to extract IRANDOC duties, and thereforeo, analyzing theses contents is core concepts from text corpus. This methodology is based on an interesting research domain for it. text mining and . At the text mining phase the keywords are extracted by tokenizing, removing stop-lists In this research we tried to discover core concepts from and generating N-grams. Network analysis phase includes co­ theses repository by applying text mining methods and SNA word occurrence extraction, network representation of linked techniques. Accordingly 650 M.Sc and PhD theses were terms and calculating measure. We applied our selected as test corpus. The text mining routines were applied methodology on a text corpus including 650 thesis titles in the on theses titles. First, key phrases were extracted from theses domain of Industrial engineering. Interpreting enriched titles. Next the network of co-word occurrence was drawn. networks was interesting and gave us valuable knowledge about Finally centrality measures were calculated for central words. corpus content. To conclude we tried to demonstrate these measures visually and to interpret them by exploring enriched network. Keywords-Text nunmg; Social network analysis,' Industrial engineering; concept mapping; Knowledge discovery The organization of this paper is as follows. In the second section, we review related works and experiences in the context I. INTRODUCTION of our study. In section three we describe our proposed Extracting core concepts from huge amount of scientific methodology for knowledge discovery by including text data is among interesting research fields these days. To achieve mining and SNA phases. In section four, we explain industrial this goal many projects have been done. In each experience engineering theses case study and the results will be clarified. researchers looked at this target from different viewpoints. In section five, we make discussion and conclusion and finally Some of them had linguistic approaches and many others used we present a number of interesting topics for future research. artificial intelligence techniques such as machine learning. II. RELATED WORKS There were also some attempts trying to combine both sorts of techniques. Text mining is among artificial intelligence Due to widespread interest in text mining and SNA method to discover knowledge from unstructured textual data. separately, there are many works related to our study. But we Linguistic techniques are also used in text mining. could find few researches which tried to combine these techniques. The network is a new language for analyzing wide variety of subjects. The network paradigm was used in many different Gregorowicz introduced an algorithm to mine a network of fields to conceptualize interactions among actors. Most of the concepts and terms from Wikipedia[2]. The aim was concepts of network analysis, such as centrality measures are overcoming problem of variable terminology by the aid of highly portable across fields[ 1]. The network, also called concept-based information retrieval. Shen and Fox worked on graph, is mathematical representation of elements that interact automatic generation of concept maps through text mining with each other or relate to each other. The elements are called techniques[3]. They used GetSmart software package to draw nodes or vertices, and links connecting them, are called edges. concept maps, and established a publicly accessible repository In some networks the edges are weighted, denoting that some of concept maps to enable sharing of the knowledge. edges have stronger relations. Social Network Analysis (SNA) Pedersen used decision tree to find bigrams that occur is a set of techniques which are mostly used to investigate nearby[4]. He evaluated his approach using the sense-tagged such as email networks, and corpora from the 1998 SENSEV AL word sense disambiguation forums. exercise. He showed that bigrams are powerful features for Iranian Research Institute for Information Science and performing word sense disambiguation. He also proved that an Technology (IRANDOC) is an institute affiliated with the effortless decision tree where each node checks whether or not Ministry of Science, Research, and Technology (MSRT) which a particular bigram occurs near the ambiguous word results in was established to work in the field of information science and accuracy comparable with state-of-the-art methods. technology and librarianship. Iran theses archiving is among

978-1-4673-1023-9/12/$31.00 ©2012 IEEE 46 Feldman et al. designed the term extraction module of the how often two words take place in all text records. Next the document explorer system[5]. They provide investigational network is visualized with specific SNA software packages. evaluation performed on a set of 52,000 documents published There are many software packages with different features that by Reuters. The results showed that working on the term level can be used for this aim such as Sci2 tools, Network work facilitates the creation (with the help of semi-automatic tools) bench, NodeXL and so on. At the next step the network of a hierarchical taxonomy which is extremely important to a centrality measures are calculated using mentioned software. text mining system. There were many researches which used Social Network Analysis in context different from common social networks. Borgatti presented social SNA in supply chain context[l]. He investigated basic concepts in SNA and discussed about meaning of different types of network style in supply chain context. Supply chain consists of companies, vendors and manufacturer that they have relationship with each other to supply manufacturing demands. With this definition we can consider interlibrary network as supply chain that its elements interact to supply information demands. Data collection Fritsch investigate the impact of network structure on knowledge transfer in the context of innovation network using SNA[6]. According to his research results the strong ties are more valuable for the exchange of knowledge and information than weak ties. Cantner used social network analysis Stop list cheking techniques to illustrate the evolution of the innovator network of Jena, Germany in the period from 1995 to 2001[7]. Garg uses SNA for completely different reason[8]. She applied network analysis techniques to recognize role of participants in ------, meetings. Then she validated her results by comparing guessed roles with actual ones.

III. METHODOLOGY Extracting co-word occurrence network Figure (1) shows our methodology step by step. We divide this process to two main steps including text mining and social network analysis. In the text mining phase we try to extract keywords from text records. Keywords, which we characterize as a chain of one or more words, provide a compact demonstration of a Knowledge Calculating measures document's content[9]. Firstly the text corpus is pre-processed discovery Social Network Analysis (SNA) to become robust for next steps. Pre-processing phase includes ------� tokenizing and stop-list checking. The continuous text should Figure 1: Knowledge discovery process be tokenized to words, phrases, symbols and other elements called tokens. Subsequently the extracted words are checked Finally the visualized network is enriched using calculated with the stop list to exclude unnecessary words. Stop-lists (or measures. The new representations are interpreted to discover 'stop-words') are lists of non information-bearing words[10]. core concepts of the corpus content. Afterward by mining the tokens N-grams are identified. N­ gram is a neighboring succession of N items from a given IV. INDUSTRIAL ENGINEERING THESES CASE STUDY sequence of text. Recognized N-grams are checked and For testing our methodology we built a text corpus from meaningful ones are replaced with singular tokens in text. At 650 M.Sc and PhD theses title in the field of industrial the end of this phase we have a set of extracted n-grams engineering. All the theses were selected from IRANDOC including some unigrams, bigrams and so on. repository. Due to complexity of mentioned network only the The second phase of our methodology is network analysis. titles of the theses are considered as our text corpus.

In this step the text is represented as a network. We extract co­ A. Text mining phase word occurrence to calculate adjacency matrix of network, when Co-occurrence networks are the collective For text mining phase RapidMiner 5.2 (An open source interconnection of terms based on their paired incidence within package) was employed. The software package has user a piece of text. It means when two words or phrases appear in friendly interface and flexible features to cope with many data one thesis title, they are connected. In this study we produced a sources. Rapidminer read the text corpus that stored in MS weighted network where each vertex is a word and edges join Access using ODSC connection. Figure (2) shows snapshot of words to each other, where the strength of an edge exhibits Rapidminer process-based GUI. After tokenizing theses titles and filtering stop-words, N-grams of size 2 (bigrams) and size

47 3 (trigrams) extracted from titles. We also update Stop-words knowledge map of researches that were done in the field of list recursively after each run manually in order to exclude industrial engineering according to the theses corpus. unimportant terms from list. This is due to the fact that we At the next step we calculated the centrality measures for could not fmd good stop word lists in the Persian language. network of terms. Centrality measures are some of the most After several runs and improvement of Stop-words list the fundamental and frequently used measures of network extracted N-grams were replaced with their equivalent terms. structure[ll]. The simplest centrality measure is degree which Finally we had a set of keywords instead of title for each thesis. is the number of edges attached to it. Figure (4) shows the terms network where the size of disks is tuned by their degree. Process Docu... j By looking at this figure, it can be understood that terms like system, model and manufacturing are connected with many ::: � ::: )------different concepts in industrial engineering domain such as evaluation, decision making and optimization ......

� to--: mil:. . ';:;0 �

Figure 2: Rapidminer snapshot of text mining process

B. Network analysis phase According to our methodology we generated a co-word occurrence matrix by the aid of Network Work Bench (NWB) Establislunent version 1.0.0. This open source package is mostly used for Repair scientometrics projects. We convert the fmal records that Figure 3: Co-word occurrence network based on terms frequency resulted from text mining phase to Scopus CSV format to import to NWB. The final network had about 1500 nodes (terms) and more than 300000 edges (co-occurrences). The network information exported in GraphML (XML) format to visualize.

In order to draw networks the NodeXL 1 package version 1.0.1.196 were used. NodeXL is an Excel add-in that displays and analyzes network graphs. It is mostly used for social network analysis. It has also some features to calculate network measures and to cluster the network. Alg� Due to huge number of vertex and edges and complexity of network we had to filter it to most important terms. Figure (3) illustrates filtered network based on terms frequency that also clustered using Claset-Newman-Moore algorithm into four groups. The terms are originally in Persian and we translate them to English. Some phrases like simulation and optimization are bigrams in Persian. The size of disks represents the number of reference for each term and the edge thickness depicts number of co-occurrence for each couple of terms. According to this figure, terms such as system, Figure 4: Co-word occurrence network based on terms Degree manufacturing and model are the most frequently used words. The structure of networks gives us valuable knowledge about Another useful centrality measure is betweenness centrality. corpus content. For instance we have a triangle between terms The betweenness centrality of vertex i is the fraction of maintenance, repair and system that describes the core concept geodesic paths between other vertices that i falls on[ 11]. of maintenance systems. This network also works as Figure (5) exhibits the terms network where the size of disks is adjusted according to their betweenness centrality measures. 1 http://nodexl.codeplex.comJ

48 According to the figure some terms have higher betweenness than others. As an example, even the term system has less frequency than term model but it has more betweenness. It shows that concept behind term system is a binder to connect REFERENCES other concepts together. [1] S. P. Borgatti and X. Li, "On Social Network Analysis in a Supply Chain Context*," Journal of Supply Chain Management, vol. 45, pp. 5-22, 2009. [2] A. Gregorowicz and M. A. Kramer, "Mining a large­ scale term-concept network from wikipedia," MITRE Corporation2006. [3] E. Rao Shen and R. Richardson, "Concept maps as visual interfaces to digital libraries: summarization, collaboration, and automatic generation, workshop paper," Joint con{rence on digital library. Rice University, Houston, Texas: IEEE computer society, 2003. [4] T. Pedersen, "Lexical semantic ambiguity resolution with bigram-based decision trees," Computational Linguistics and Intelligent Text Processing, pp. 157-168, 200l.

Production syst em [5] R. Feldman, et aI., "Text mining at the term level," Principles of Data Mining and Knowledge Discovery, Figure 5: Co-word occurrence network based on terms pp. 65-73, 1998. Betweenness [6] M. Fritsch and M. Kauffeld-Monz, "The impact of network structure on knowledge transfer: an application v. CONCLUSION AND FUTURE WORKS of social network analysis in the context of regional This study shows that network analysis is an effective way innovation networks," The Annals of Regional Science, to extract core concepts from text corpus. Network vol. 44, pp. 21-38, 2010. representation of terms gives us comprehensive overview of text repositories. Our experiment also gives us an idea about [7] U. Cantner and H. Graf, "The network of innovators in building ontology by completing this methodology. Jena: An application of social network analysis," Research Policy, vol. 35, pp. 463-480, 2006. Lack of good stemming engine in Persian reduced the [8] N. P. Garg, et aI., "Role recognition for meeting quality of tokens. The extracted N-grams also could check with participants: an approach based on lexical information thesauri to become standard phrases. and social network analysis," 2008, pp. 693-696. For future research we can use weighted measure for [9] M. W. Berry and 1. Kogan, Text Mining: Applications analyzing network instead of simple ones, because we have a and Theory: Wiley, 2010. weighted network. Opsahl introduces set of new measures to [10] A. Blanchard, "Understanding and customizing stop word calculating centrality in weighted networks[12]. lists for enhanced patent mapping," World Patent To discover knowledge about evolution of contents in Information, vol. 29, pp. 308-316, 2007. temporal manner we can apply temporal measures on terms [11] M. E. 1. Newman, "The mathematics of networks," The network. Santro et al. introduce new indicators and metrics for New Palgrave Encyclopedia of Economics, vol. 2, 2008. time-varying graphs in social network analysis context[13]. [12] T. Opsahl, et aI., "Node centrality in weighted networks: Generalizing degree and shortest paths," Social ACKNOWLEDGMENT Networks, vol. 32, pp. 245-251, 2010. The author gratefully acknowledges the support of the [13] N. Santoro, et aI., "Time-Varying Graphs and Social Iranian Research Institute for Information Science and Network Analysis: Temporal Indicators and Metrics," technology, Specially Dr. Sirous Alidousti and Dr. Hamid R Arxiv pre print arXiv: 1102. 0629, 2011. Jamali.

49