A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques Mehdi Allahyari Seyedamin Pouriyeh Mehdi Assefi Computer Science Department Computer Science Department Computer Science Department University of Georgia University of Georgia University of Georgia Athens, GA Athens, GA Athens, GA [email protected] [email protected] [email protected] Saied Safaei Elizabeth D. Trippe Juan B. Gutierrez Computer Science Department Institute of Bioinformatics Department of Mathematics University of Georgia University of Georgia Institute of Bioinformatics Athens, GA Athens, GA University of Georgia [email protected] [email protected] Athens, GA [email protected] Krys Kochut Computer Science Department University of Georgia Athens, GA [email protected] ABSTRACT 1 INTRODUCTION The amount of text that is generated every day is increasing dra- Text Mining (TM) field has gained a great deal of attention in recent matically. This tremendous volume of mostly unstructured text years due the tremendous amount of text data, which are created in cannot be simply processed and perceived by computers. There- a variety of forms such as social networks, patient records, health fore, efficient and effective techniques and algorithms are required care insurance data, news outlets, etc. IDC, in a report sponsored to discover useful patterns. Text mining is the task of extracting by EMC, predicts that the data volume will grow to 40 zettabytes1 meaningful information from text, which has gained significant by 2020, leading to a 50-time growth from the beginning of 2010 attentions in recent years. In this paper, we describe several of [52]. the most fundamental text mining tasks and techniques including Text data is a good example of unstructured information, which text pre-processing, classification and clustering. Additionally, we is one of the simplest forms of data that can be generated in most briefly explain text mining in biomedical and health care domains. scenarios. Unstructured text is easily processed and perceived by humans, but is significantly harder for machines to understand. CCS CONCEPTS Needless to say, this volume of text is an invaluable source of in- • Information systems → Document topic models; Informa- formation and knowledge. As a result, there is a desperate need to tion extraction; Clustering and classification; design methods and algorithms in order to effectively process this avalanche of text in a wide variety of applications. KEYWORDS Text mining approaches are related to traditional data mining, arXiv:1707.02919v2 [cs.CL] 28 Jul 2017 Text mining, classification, clustering, information retrieval, infor- and knowledge discovery methods, with some specificities, as de- mation extraction scribed below. ACM Reference format: Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth 1.1 Knowledge Discovery vs. Data Mining D. Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. A Brief Survey of Text There are various definitions for knowledge discovery or knowl- Mining: Classification, Clustering and Extraction Techniques .In Proceedings edge discovery in databases (KDD) and data mining in the lit- 13 pages. of KDD Bigdas, Halifax, Canada, August 2017, erature. We define it as follows: Knowledge Discovery in Databases is extracting implicit valid, Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed new and potentially useful information from data, which is non- for profit or commercial advantage and that copies bear this notice and the full citation trivial [45, 48]. Data Mining is a the application of particular algo- on the first page. Copyrights for third-party components of this work must be honored. rithms for extracting patterns from data. KDD aims at discovering For all other uses, contact the owner/author(s). hidden patterns and connections in the data. Based on the above KDD Bigdas, August 2017, Halifax, Canada © 2017 Copyright held by the owner/author(s). 11 ZB = 1021 bytes = 1 billion terabytes. KDD Bigdas, August 2017, Halifax, Canada Allahyari, M. et al definitions KDD refers to the overall process of discovering useful 1.2 Text Mining Approaches knowledge from data while data mining refers to a specific step Text Mining or knowledge discovery from text (KDT) − first intro- in this process. Data can be structured like databases, but also un- duced by Fledman et al. [46] − refers to the process of extracting structured like data in a simple text file. high quality of information from text (i.e. structured such as RDBMS data [28, 43], semi-structured such as XML and JSON [39, 111, 112], Knowledge discovery in databases is a process that involves and unstructured text resources such as word documents, videos, several steps to be applied to the data set of interest in order to and images). It widely covers a large set of related topics and algo- excerpt useful patterns. These steps are iterative and interactive rithms for analyzing text, spanning various communities, including they may need decisions being made by the user. CRoss Industry information retrieval, natural language processing, data mining, 2 Standard Process for Data Mining (Crisp DM ) model defines these machine learning many application domains web and biomedical primary steps as follows: 1) understanding of the application and sciences. data and identifying the goal of the KDD process, 2) data preparation and preprocessing, 3) modeling, 4) evaluation, 5) deployment. Data Information Retrieval (IR): Information Retrieval is the activ- cleaning and preprocessing is one of the most tedious steps, because ity of finding information resources (usually documents) froma it needs special methods to convert textual data to an appropriate collection of unstructured data sets that satisfies the information format for data mining algorithms to use. need [44, 93]. Therefore information retrieval mostly focused on Data mining and knowledge discovery terms are often used facilitating information access rather than analyzing information interchangeably. Some would consider data mining as synonym and finding hidden patterns, which is the main purpose oftext for knowledge discovery, i.e. data mining consists of all aspects mining. Information retrieval has less priority on processing or of KDD process. The second definition considers data mining as transformation of text whereas text mining can be considered as part of the KDD process (see [45]) and explicate the modeling step, going beyond information access to further aid users to analyze i.e. selecting methods and algorithms to be used for searching for and understand information and ease the decision making. patterns in the data. We consider data mining as a modeling phase of KDD process. Natural Language Processing (NLP): Natural Language Pro- Research in knowledge discovery and data mining has seen rapid cessing is sub-field of computer science, artificial intelligence and advances in recent years, because of the vast progresses in hardware linguistics which aims at understanding of natural language using and software technology. Data mining continues to evolve from the computers [90, 94]. Many of the text mining algorithms extensively intersection of diverse fields such as machine learning, databases, make use of NLP techniques, such as part of speech tagging (POG), statistics and artificial intelligence, to name a few, which shows the syntactic parsing and other types of linguistic analysis (see [80, 116] underlying interdisciplinary nature of this field. We briefly describe for more information). the relations to the three of aforementioned research areas. Information Extraction from text (IE): Information Extrac- Databases are essential to efficiently analyze large amounts of tion is the task of automatically extracting information or facts data. Data mining algorithms on the other hand can significantly from unstructured or semi-structured documents [35, 122]. It usu- boost the ability to analyze the data. Therefore for the data integrity ally serves as a starting point for other text mining algorithms. For and management considerations, data analysis requires to be inte- example extraction entities, Name Entity Recognition (NER), and grated with databases [105]. An overview for the data mining from their relations from text can give us useful semantic information. the database perspective can be found in [28]. Text Summarization: Many text mining applications need to Machine Learning (ML) is a branch of artificial intelligence that summarize the text documents in order to get a concise overview of tries to define set of approaches to find patterns in data tobeable a large document or a collection of documents on a topic [67, 115]. to predict the patterns of future data. Machine learning involves There are two categories of summarization techniques in general: study of methods and algorithms that can extract information au- extractive summarization where a summary comprises information tomatically. There are a great deal of machine learning algorithms units extracted from the original text, and in contrary abstractive used in data mining. For more information please refer to [101, 126]. summarization where a summary may contain “synthesized” information that may not occur in the original document (see [6, 38] for Statistics is a mathematical science that deals with collection, an overview). analysis, interpretation or explanation, and presentation of data3. Today lots

A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques

Probabilistic Topic Modelling with Semantic Graph

A Hierarchical Clustering Approach for Dbpedia Based Contextual Information of Tweets

Query-Based Multi-Document Summarization by Clustering of Documents

Wordnet-Based Metrics Do Not Seem to Help Document Clustering

Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints

Word Sense Disambiguation in Biomedical Ontologies with Term Co-Occurrence Analysis and Document Clustering

A Query Focused Multi Document Automatic Summarization

Semantic Based Document Clustering Using Lexical Chains

Latent Semantic Sentence Clustering for Multi-Document Summarization

Clustering News Articles Using K-Means and N-Grams by Desmond

Document Clustering Using K-Means and K-Medoids Rakesh Chandra Balabantaray, Chandrali Sarma, Monica Jha

Text Summarization Using Clustering Technique Anjali R