UNSTRUCTURED DATA MINING and ITS APPLICATIONS 1Jagruti Jangal Wagh, 2Jidnyasa Dharmik Gondane, 3Ashvini Tulshiram Dukare

UNSTRUCTURED DATA MINING and ITS APPLICATIONS 1Jagruti Jangal Wagh, 2Jidnyasa Dharmik Gondane, 3Ashvini Tulshiram Dukare

UNSTRUCTURED DATA MINING AND ITS APPLICATIONS 1Jagruti Jangal Wagh, 2Jidnyasa Dharmik Gondane, 3Ashvini Tulshiram Dukare. 1,2Student, Department of Computer Engineering, Cummins College of Engineering for Women Pune, Savitribai Phule Pune University Email: [email protected], [email protected], [email protected] Abstract artificial intelligence, statistics, machine Information is generated in various forms. learning and databases. The basic goal of a About 90% of the data in the world has been data mining process is the extraction of generated over the last two years. If this data information from a large dataset and convert is left unmanaged, then it becomes or transform it into understandable form for overwhelming, making it difficult to get future use. Thus, this paper highlights on information from it whenever it is needed. unstructured data-mining and its Structured data is information, usually text applications. files, displayed in titled columns and rows Index Terms: Data Mining, Information, which can be easily processed and ordered by Structured data, unstructured data. data mining tools. Information system may differ in its I. INTRODUCTION application and form but they serve a Everyday data is generated, collected in huge common purpose that is to convert data into amount but many-a-times it remains unutilized some useful form through which knowledge without drawing useful information and can be obtained. Data is comprised of the meaningful insights. These insights are vital in basic, unfiltered and generally unrefined strategic and operational decision making information. Information is much more process like-marketing, customer engagement, refined data that is being useful for some branding, etc. Data generated by various form of analysis. Knowledge resides in the channels like marketing, distribution, customer user and comes up only when human insights engagement, social channels and web contents is and experience is applied to information and in different forms structured as well as data. But most of the today’s data generated unstructured and available in multiple systems. is unstructured. Unstructured data is that This unstructured data needs to be converted into which has no identifiable internal structure. something a bit more useful form, it requires A leading industry analyst on the confluence finding approaches to convert the free form, of unstructured data and structured data unstructured data to some form of structured or sources, published an article that stated, semi-structured data and analyze it to get “80% of business-related information meaningful insights to address business originates in unstructured form basically problems and helps business decision making. text”. So there is a great need to convert this Based upon the type of input data, reports can be unstructured data into some usable form. generated in the form of charts, bar-graphs etc. Data mining is the analysis step of Business Intelligence dashboards can also be “Knowledge Discovery in databases” or KDD experimented to achieve effective visualization process which is the computational process of mechanism. discovering patterns in large data that is Structured data is the data which is in the involving methods at the intersection of organized form that is in the form of rows and ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-3, 2016 36 INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR) columns and can be easily used by a computer Data Cleaning − In this step the program. Relationships exist between entities of inconsistent data and noise is removed. data such as classes and their objects. Data Integration − In Data Integration Unstructured data is the one that does not multiple data sources are combined. conform to a data model or is not in the form that Data Selection – Here, the relevant data can be used easily by a computer. Various to the analysis task are being retrieved examples include memos, chat-rooms, videos, from the database. images and researches, body of an email etc. Data Transformation − In transformation Gartner estimates that almost 80% of the data of data the data is transformed or that is generated in any enterprise today is consolidated into forms which are Unstructured data. Roughly, around 10% of data appropriate or valid for mining by is in the structured and semi-structured category performing various aggregation [1]. operations. II. DATA MINING Data Mining − In this step various intelligent methods are applied so as to Data mining can be viewed as the result of the extract the various data patterns. natural evolution of information technology. The databases and data management industry Pattern Evaluation − In this step the data evolved in the development of several critical patterns are being evaluated. functionalities like the data collection and Knowledge Presentation − In this step, database creation, data management including knowledge is represented[1]. the storage and retrieval of data. Data mining is The goal of the data mining process is to extract the analysis step of the KDD or the ˝Knowledge the information from a data-set and convert or Discovery in Databases” process, an transform it into an understandable form for interdisciplinary subfield of computer science, is future use. the process of discovering patterns in large datasets. III. STRUCTURED VS UNSTRUCTURED DATA Following are the steps involved in the MINING knowledge discovery process – Structured data is data that can be organized easily, regardless of its simplicity many experts in data industry today estimate that structured data only for 20% of the available data. It is usually stored in databases and is analytical and clean. Some examples of structured data are: Machine Generated: Sensory Data - manufacturing sensors, medical devices and GPS data Point-of-Sale Data - Credit card information, product information, etc Call Detail Records - Caller and recipient information, time for call. Web Server Logs - Page requests, server activities Human Generated: Input Data – The data that is given as a input to a Fig. 1: Data mining which is a step in the process computer: age, gender, code, etc. of knowledge discovery-[ Han & Kamber & Pei, Structured data has always had and will always Data Mining: Concepts and Techniques, 3rd be playing a crucial role in data analytics. It Edition, Morgan Kaufmann] functions like backbone to the critical business ISSN (PRINT): 2393-8374, (ONLINE): 2394-0697, VOLUME-3, ISSUE-3, 2016 37 INTERNATIONAL JOURNAL OF CURRENT ENGINEERING AND SCIENTIFIC RESEARCH (IJCESR) insights. Without structured data, it would have IV. CHALLENGES IN UNSTRUCTURED DATA been difficult to know where to find insights MINING hiding in the unstructured data sets. The challenges of unstructured data run the Unstructured data is the data that is not organized gamut from gathering to storing, to using it to in a predefined manner and does not have a make decisions: predefined model of data. It is obvious that social media plays a very important role in Usability: For unstructured data to be usable, unstructured data. According to a research, 73% businesses will have to come up with a way of online adults use social networking sites. One to locate, extract, organize, and store the of the many ways in which businesses are data. utilizing this data is to gather and analyse the brand sentiments. In addition to social media Volume: Data Size is growing exponentially there are many other forms of unstructured data which is creating new challenges to deal in which are common: scale. The volume of unstructured data is growing at a rate of approximately 60% per Word Doc’s, Text Files and PDF’s - Books, year. For many businesses, that’s more than letters, audio and video files, other written they can keep up with. That presents documents challenges for both using and securing the Audio Files -Customer service recordings, data. voicemails, phone calls Presentations-SlideShares,PowerPoint’s Relevance: One way in which relevance Videos-YouTube uploads, personal videos comes into play is lack of insight into the Images-Pictures, memos previous story of certain pieces of data. Messaging- Text or instant messages Heterogeneity: The difficulties of big data In all these examples the data can provide very analysis are caused because of its large compelling and useful insights. Using the right scale and the presence of mixed data based tools this unstructured data can add a depth to on different rules or patterns in the collected data analysis that could not be achieved and stored data. In the case of complicated otherwise very easily. The unstructured data heterogeneous mixture data, this data has enhances the ability of any business to derive several patterns and rules and the properties greater insights from the datasets. of their patterns may vary greatly. Unstructured data is the most important piece to the data pie of any business. Tools which are Incompleteness: Incomplete data creates accessible widely can help businesses use this uncertainties during its analysis and it must data to the greatest of its potential. be managed during its analysis. Incomplete Text mining is the process of deriving the data refers to the missing of data field information that is of high quality from text. values for some of its samples. High-quality information is typically derived by Quality: By nature, a large volume of the devising of trends and patterns through unstructured data is being unverified. This means such as statistical pattern learning. presents serious challenges for consumers Actually, text mining refers to the using data and enterprises. On a consumer level, mining techniques for finding out or discovering people could be negatively impacted by useful patterns from texts. The primary companies that make decisions based on difference is that unlike data mining, in text unstructured data. On an enterprise level, mining the data that is being used is making business decisions based on unstructured. inaccurate data could be extremely costly. Requirements for real-time data analysis has been predominant like for weather predictions, exstock tradings, time series, etc.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us