Analysis of Methods and Means of Text Mining Z
Total Page:16
File Type:pdf, Size:1020Kb
ECONTECHMOD. AN INTERNATIONAL QUARTERLY JOURNAL – 2017. Vol. 6. No. 2. 73–78 Analysis of methods and means of text mining Z. Rybchak 1, O. Basystiuk 2 1Lviv Polytechnic National University, e-mail: [email protected] 2Lviv Polytechnic National University, e-mail: [email protected] Received February 15.2017: accepted May 28.2017 Abstract. In Big Data era when data volume doubled every business rules, and relationships – that is otherwise locked year analyzing of all this data become really complicated task, in textual form, impenetrable to automated processing. so in this case text mining systems, techniques and tools become main instrument of analyzing tones and tones of information, As said earlier 70% of business-relevant information selecting that information that suit the best for your needs and is stored as text, but this is truism for all spheres of human just help save your time for more interesting thing. The main life, most information is currently stored as text aims of this article are explain basic principles of this field and information, that why text mining is believed to have a overview some interesting technologies that nowadays are high commercial potential value. Increasing interest is to widely used in text mining. multilingual data mining: the ability to gain information Key words: text mining, text analytics, data analysing, across languages and cluster similar items from different high-quality information, text categorization, text clustering, linguistic sources according to their meaning. document summarization, sentiment analysis. A simple application is to scan a text written in a INTRODUCTION natural language, then identify key-phrases of this document, show this key-phrases and by this phrases Text mining, also known as intelligent text analysis, program can make prediction of what type of text it is. text data mining or knowledge-discovery in text (KDT) all this terms describes a set of linguistic, statistical, and Typical tasks in text mining include this: text machine learning techniques that refers generally to the categorization, and text clustering (this tasks is main goal process of extracting interesting and non-trivial of my example of simple application), also it could be information and knowledge from unstructured text. concept extraction, entity extraction, sentiment analysis Techniques that help model and structure the information (popular, when you want to know, what people think content of textual sources for business intelligence, about your business), document summarization etc. exploratory data analysis, research, or investigation. This technology nowadays is widely implemented Text mining is a relatively young field, manual text and used in variety of government, research, and business mining researches started in the middle of 1980s, notably needs. for sciences and government needs, but interested people TEXT MINING AND MINING TECHNICS and technological advances have enabled developing of OF RECEIVING HIGH-QUALITY INFORMATION that field during this time. Nowadays this field united such field of science, like: computational linguistics, Text mining, it`s process of mining text data, or in machine learning and some specific field of statistics. other words receiving high-quality information from text The field of text mining usually deals with text whose data. High-quality text information usually refers to some main function is providing communication and help people combination of relevance, novelty, and interestingness. To to express their thoughts and opinions, and the motivation for get high-quality information from text, use few methods, trying to extract information from such text automatically is such as: information retrieval, lexical analysis to study compelling – even if success is only partial. word frequency distributions, pattern recognition, pattern The term text mining also describes that application learning, regularities in data, tagging/annotation, of text mining to respond to business problems, whether information extraction, data mining techniques including independently or in conjunction with query and analysis link and association analysis, visualization, and predictive of fielded, numerical data. It is a truism that more than analytics. In addition, text mining involves the process of 70 % of business-relevant information is stored in structuring the input text, deriving patterns within the unstructured form, such as text. These techniques and structured data, and finally evaluation and interpretation processes help to discover and present knowledge – facts, of the output. 74 Z. RYBCHAK, O. BASYSTIUK The main goal is, to turn text into data for analysis, Open source: by applications of natural language processing and · Natural Language Toolkit; analytical methods, and when you get this data for · OpenNLP; analysis, you can create method of processing this data in · Orange. way you need, or you already created libraries for this. One particular framework and development All this method is a branch of machine learning (or environment for text mining, called General Architecture nearly synonymous with machine learning), especially for Text Engineering or GATE, aims to help users recognition of patterns and regularities in data, because develop, evaluate and deploy systems for what the authors pattern recognition systems are in many cases trained term “language engineering.” It provides support not just from labeled “training" data, but when no labeled data are for standard text mining applications such as information available other algorithms can be used to discover extraction, but also for tasks such as building and previously unknown patterns. Same thing for regularities, annotating corpora, and evaluating the applications. in this case knowledge of computational linguistics is At the lowest level, GATE supports a variety of needed, you input labeled data of regularities and trained formats including XML, RTF, HTML, SGML, email and your algorithm to recognize this regularities and handle plain text, converting them into a single unified model them in right order. that also supports annotation. There are three storage TECHNIQUES AND TOOLS ANALYSIS mechanisms: · relational database; Text mining systems use a big spectrum of different · serialized Java object; approaches, partly because of the great scope of systems · XML based internal format documents can be re- and tools that perform text mining, and partly because the exported into their original format with or without field don`t have it dominant methodologies due to the annotations. youth of this field. Nevertheless, we can divide these Text encoding is based on Unicode to provide approaches on: support for multilingual data processing, so that systems · High-level, that’s mean you involved into your developed with GATE can be ported to new languages text mining application systems that use an automatic with no additional overhead apart from the development training systems to do stuff like recognising patterns. This of the resources needed for the specific language. GATE works in easy way, you create some training system, push includes a tokenizer and a sentence splitter. It incorporates into this system already target “training” data and by this a part-of-speech tagger and a gazetteer that includes lists data your system learn how to recognize different patterns. of cities, organizations, days of the week, etc. It has a · Low-level, that`s mean you deal with natural semantic tagger that applies hand-crafted rules written in language and involve your custom created decisions a language in which patterns can be described and making system, that strongly influence the success of annotations created as a result. mining process. Patterns can be specified by giving a particular text But as low-level way of solving problems is more string, or annotations that have previously been created by complexity, time-consuming and required big amount of modules such as the tokenizer, gazetteer, or document knowledge, it`s less popular nowadays. Because you need format analysis. It also includes semantic modules that to implement by yourself this trivial on first view, but recognize relations between entities and detect co- really important logic on dealing with all small decisions, reference. It contains tools for creating new language like: how to deal with apostrophes and hyphens, resources, and for evaluating the performance of text capitalization, punctuation, numbers, alphanumeric mining systems developed with GATE. strings, whether the amount of white space is significant, One application of GATE is a system for entity whether to impose a maximum length on tokens, what to extraction of names that is capable of processing texts from do with non-printing characters, and so on. So using of widely different domains and genres. This has been used to this method require good background in text analysis perform recognition and tracking tasks of named, nominal principles and implementing all logic of this small, but and pronominal entities in several types of text.Least but really important, decisions. not last, there are few popular techniques, which will help In this case, high-level way provides big you to start work in effectively mining text data. opportunities to create some small, but pretty useful app · Sentiment analysis – analyzing the people in a few minutes. There are a lot of text mining computer opinions and tone of feedback, posts, articles about programs available from many commercial and