Knowledge Discovery in Scientific Databases Using Text Mining and Social Network Analysis
Total Page:16
File Type:pdf, Size:1020Kb
2012 IEEE Conference on Control, Systems and Industrial Informatics (ICCSII) Bandung, Indonesia, September 23-26,2012 Knowledge Discovery in Scientific Databases Using Text Mining and Social Network Analysis Ammar lalalimanesh Information Engineering Department Iranian Research Institute for Information Science and Technology Tehran, Iran [email protected] Abstract-This paper introduces a novel methodology to extract IRANDOC duties, and thereforeo, analyzing theses contents is core concepts from text corpus. This methodology is based on an interesting research domain for it. text mining and social network analysis. At the text mining phase the keywords are extracted by tokenizing, removing stop-lists In this research we tried to discover core concepts from and generating N-grams. Network analysis phase includes co theses repository by applying text mining methods and SNA word occurrence extraction, network representation of linked techniques. Accordingly 650 M.Sc and PhD theses were terms and calculating centrality measure. We applied our selected as test corpus. The text mining routines were applied methodology on a text corpus including 650 thesis titles in the on theses titles. First, key phrases were extracted from theses domain of Industrial engineering. Interpreting enriched titles. Next the network of co-word occurrence was drawn. networks was interesting and gave us valuable knowledge about Finally centrality measures were calculated for central words. corpus content. To conclude we tried to demonstrate these measures visually and to interpret them by exploring enriched network. Keywords-Text nunmg; Social network analysis,' Industrial engineering; concept mapping; Knowledge discovery The organization of this paper is as follows. In the second section, we review related works and experiences in the context I. INTRODUCTION of our study. In section three we describe our proposed Extracting core concepts from huge amount of scientific methodology for knowledge discovery by including text data is among interesting research fields these days. To achieve mining and SNA phases. In section four, we explain industrial this goal many projects have been done. In each experience engineering theses case study and the results will be clarified. researchers looked at this target from different viewpoints. In section five, we make discussion and conclusion and finally Some of them had linguistic approaches and many others used we present a number of interesting topics for future research. artificial intelligence techniques such as machine learning. II. RELATED WORKS There were also some attempts trying to combine both sorts of techniques. Text mining is among artificial intelligence Due to widespread interest in text mining and SNA method to discover knowledge from unstructured textual data. separately, there are many works related to our study. But we Linguistic techniques are also used in text mining. could find few researches which tried to combine these techniques. The network is a new language for analyzing wide variety of subjects. The network paradigm was used in many different Gregorowicz introduced an algorithm to mine a network of fields to conceptualize interactions among actors. Most of the concepts and terms from Wikipedia[2]. The aim was concepts of network analysis, such as centrality measures are overcoming problem of variable terminology by the aid of highly portable across fields[ 1]. The network, also called concept-based information retrieval. Shen and Fox worked on graph, is mathematical representation of elements that interact automatic generation of concept maps through text mining with each other or relate to each other. The elements are called techniques[3]. They used GetSmart software package to draw nodes or vertices, and links connecting them, are called edges. concept maps, and established a publicly accessible repository In some networks the edges are weighted, denoting that some of concept maps to enable sharing of the knowledge. edges have stronger relations. Social Network Analysis (SNA) Pedersen used decision tree to find bigrams that occur is a set of techniques which are mostly used to investigate nearby[4]. He evaluated his approach using the sense-tagged virtual community such as email networks, social media and corpora from the 1998 SENSEV AL word sense disambiguation forums. exercise. He showed that bigrams are powerful features for Iranian Research Institute for Information Science and performing word sense disambiguation. He also proved that an Technology (IRANDOC) is an institute affiliated with the effortless decision tree where each node checks whether or not Ministry of Science, Research, and Technology (MSRT) which a particular bigram occurs near the ambiguous word results in was established to work in the field of information science and accuracy comparable with state-of-the-art methods. technology and librarianship. Iran theses archiving is among 978-1-4673-1023-9/12/$31.00 ©2012 IEEE 46 Feldman et al. designed the term extraction module of the how often two words take place in all text records. Next the document explorer system[5]. They provide investigational network is visualized with specific SNA software packages. evaluation performed on a set of 52,000 documents published There are many software packages with different features that by Reuters. The results showed that working on the term level can be used for this aim such as Sci2 tools, Network work facilitates the creation (with the help of semi-automatic tools) bench, NodeXL and so on. At the next step the network of a hierarchical taxonomy which is extremely important to a centrality measures are calculated using mentioned software. text mining system. There were many researches which used Social Network Analysis in context different from common social networks. Borgatti presented social SNA in supply chain context[l]. He investigated basic concepts in SNA and discussed about meaning of different types of network style in supply chain context. Supply chain consists of companies, vendors and manufacturer that they have relationship with each other to supply manufacturing demands. With this definition we can consider interlibrary network as supply chain that its elements interact to supply information demands. Data collection Fritsch investigate the impact of network structure on knowledge transfer in the context of innovation network using SNA[6]. According to his research results the strong ties are more valuable for the exchange of knowledge and information than weak ties. Cantner used social network analysis Stop list cheking techniques to illustrate the evolution of the innovator network of Jena, Germany in the period from 1995 to 2001[7]. Garg uses SNA for completely different reason[8]. She applied network analysis techniques to recognize role of participants in ----------------, meetings. Then she validated her results by comparing guessed roles with actual ones. III. METHODOLOGY Extracting co-word occurrence network Figure (1) shows our methodology step by step. We divide this process to two main steps including text mining and social network analysis. In the text mining phase we try to extract keywords from text records. Keywords, which we characterize as a chain of one or more words, provide a compact demonstration of a Knowledge Calculating measures document's content[9]. Firstly the text corpus is pre-processed discovery Social Network Analysis (SNA) to become robust for next steps. Pre-processing phase includes --------------------- � tokenizing and stop-list checking. The continuous text should Figure 1: Knowledge discovery process be tokenized to words, phrases, symbols and other elements called tokens. Subsequently the extracted words are checked Finally the visualized network is enriched using calculated with the stop list to exclude unnecessary words. Stop-lists (or measures. The new representations are interpreted to discover 'stop-words') are lists of non information-bearing words[10]. core concepts of the corpus content. Afterward by mining the tokens N-grams are identified. N gram is a neighboring succession of N items from a given IV. INDUSTRIAL ENGINEERING THESES CASE STUDY sequence of text. Recognized N-grams are checked and For testing our methodology we built a text corpus from meaningful ones are replaced with singular tokens in text. At 650 M.Sc and PhD theses title in the field of industrial the end of this phase we have a set of extracted n-grams engineering. All the theses were selected from IRANDOC including some unigrams, bigrams and so on. repository. Due to complexity of mentioned network only the The second phase of our methodology is network analysis. titles of the theses are considered as our text corpus. In this step the text is represented as a network. We extract co A. Text mining phase word occurrence to calculate adjacency matrix of network, when Co-occurrence networks are the collective For text mining phase RapidMiner 5.2 (An open source interconnection of terms based on their paired incidence within package) was employed. The software package has user a piece of text. It means when two words or phrases appear in friendly interface and flexible features to cope with many data one thesis title, they are connected. In this study we produced a sources. Rapidminer read the text corpus that stored in MS weighted network where each vertex is a word and edges join Access using ODSC connection. Figure (2) shows snapshot of words to each other, where the strength of an