Data Mining and Warehousing
Total Page:16
File Type:pdf, Size:1020Kb
Data Mining and Warehousing MCA Third Year Paper No. XI School of Distance Education Bharathiar University, Coimbatore - 641 046 Author: Jahir Uddin Copyright © 2009, Bharathiar University All Rights Reserved Produced and Printed by EXCEL BOOKS PRIVATE LIMITED A-45, Naraina, Phase-I, New Delhi-110 028 for SCHOOL OF DISTANCE EDUCATION Bharathiar University Coimbatore-641 046 CONTENTS Page No. UNIT I Lesson 1 Data Mining Concept 7 Lesson 2 Data Mining with Database 27 Lesson 3 Data Mining Techniques 40 UNIT II Lesson 4 Data Mining Classification 53 UNIT III Lesson 5 Cluster Analysis 83 Lesson 6 Association Rules 108 UNIT IV Lesson 7 Data Warehouse 119 Lesson 8 Data Warehouse and OLAP Technology 140 UNIT V Lesson 9 Developing Data Warehouse 167 Lesson 10 Application of Data Warehousing and Mining in Government 175 Model Question Paper 181 DATA MINING AND WAREHOUSING SYLLABUS UNIT I Basic data mining tasks: Data mining versus knowledge discovery in databases - Data mining issues - data mining metrices - social implications of data mining - data mining from a database perspective. Data mining techniques: Introduction - a statistical perspective on data mining - similarity measures - decision trees - neural networks - genetic algorithms. UNIT II Classification: Introduction - statistical - based algorithms - distance - based algorithms - decision tree- based algorithms- neural network - based algorithms - rule-based algorithms - combining techniques. UNIT III Clustering: Introduction - Similarity and distance Measures - Outliers - Hierarchical Algorithms - Partitional Algorithms. Association rules: Introduction-large item sets - basic algorithms - parallel & distributed algorithms - comparing approaches - incremental rules - advanced association rules techniques - measuring the quality of rules. UNIT IV Data warehousing: An introduction - characteristic of a data warehouse - data mats - other aspects of data mart. Online analytical processing: introduction - OLTP & OLAP systems - data modeling - star schema for multidimensional view - data modeling - multifact star schema or snow flake schema - OLAP TOOLS - state of the market - OLAP TOOLS and the internet. UNIT V Developing a data WAREHOUSE: Why and how to build a data warehouse architectural strategies and organization issues-design consideration- data content- metadata distribution of data - tools for data warehousing - performance consideration-crucial decision in designing a data warehouse. Applications of data warehousing and data mining in government: Introduction - National data warehouses- other areas for data warehousing and data mining. 5 Data Mining Concept UNIT I 6 Data Mining and Warehousing LESSON 7 Data Mining Concept 1 DATA MINING CONCEPT CONTENTS 1.0 Aims and Objectives 1.1 Introduction 1.2 Motivation for Data Mining: Why is it Important? 1.3 What is Data Mining? 1.3.1 Definition of Data Mining 1.4 Architecture of Data Mining 1.5 How Data Mining Works? 1.6 Data Mining – On What Kind of Data? 1.7 Data Mining Functionalities — What Kinds of Patterns can be Mined? 1.8 Classification of Data Mining Systems 1.9 Advantages of Data Mining 1.10 Disadvantages of Data Mining 1.11 Ethical Issues of Data Mining 1.11.1 Consumers’ Point of View 1.11.2 Organizations’ Point of View 1.11.3 Government’s Point of View 1.11.4 Society’s Point of View 1.12 Analysis of Ethical Issues 1.13 Global Issues of Data Mining 1.14 Let us Sum up 1.15 Lesson End Activity 1.16 Keywords 1.17 Questions for Discussion 1.18 Suggested Readings 1.0 AIMS AND OBJECTIVES After studying this lesson, you should be able to understand: The concept of data mining Basic knowledge how data mining works Concept of data mining architecture Ethical issues of data mining Concept of global issue in data mining 8 Data Mining and Warehousing 1.1 INTRODUCTION This lesson provides an introduction to the multidisciplinary field of data mining. It discusses the evolutionary path of database technology, which led up to the need for data mining, and the importance of its application potential. The basic architecture of data mining systems is described, and a brief introduction to the concepts of database systems and data warehouses is given. A detailed classification of data mining tasks is presented, based on the different kinds of knowledge to be mined. A classification of data mining systems is presented, and major challenges in the field are discussed. With the increased and widespread use of technologies, interest in data mining has increased rapidly. Companies are now utilized data mining techniques to exam their database looking for trends, relationships, and outcomes to enhance their overall operations and discover new patterns that may allow them to better serve their customers. Data mining provides numerous benefits to businesses, government, society as well as individual persons. However, like many technologies, there are negative things that caused by data mining such as invasion of privacy right. In addition, the ethical and global issues regarding the use of data mining will also be discussed. 1.2 MOTIVATION FOR DATA MINING: WHY IS IT IMPORTANT? In recent years data mining has attracted a great deal of attention in information industry due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from business management, production control, and market analysis, to engineering design and science exploration. Data mining can be viewed as a result of the natural evolution of information technology. An evolutionary path has been witnessed in the database industry in the development of the following functionalities: Data collection and database creation, Data management (including data storage and retrieval, and database transaction processing), and Data analysis and understanding (involving data warehousing and data mining). For instance, the early development of data collection and database creation mechanisms served as a prerequisite for later development of effective mechanisms for data storage and retrieval, and query and transaction processing. With numerous database systems offering query and transaction processing as common practice, data analysis and understanding has naturally become the next target. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision-making, process control, information management, and query processing. Therefore, data mining is considered one of the most important frontiers in database and information systems and one of the most promising interdisciplinary developments in the information technology. 1.3 WHAT IS DATA MINING? In simple words, data mining refers to extracting or "mining" knowledge from large amounts of data. Some other terms like knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging are also used for data mining. Many people treat data mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. Some people view data mining as simply an essential step in the process of knowledge 9 discovery. Knowledge discovery as a process and consists of an iterative sequence of Data Mining Concept the following steps: 1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance) 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualisation and knowledge representation techniques are used to present the mined knowledge to the user). The first four steps are different forms of data preprocessing, which are used for data preparation for mining. After this the data-mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base. 1.3.1 Definition of Data Mining Today, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data. Therefore in a broader view of data mining functionality data mining can be defined as “the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses, or other information repositories.” For many years, statistics have been used to analyze data in an effort to find correlations, patterns, and dependencies. However, with an increased in technology more and more data are available, which greatly exceed the human capacity to manually analyze them. Before the 1990’s, data collected by bankers, credit card companies, department stores and so on have little used. But in recent years, as computational power increases, the idea of data mining has emerged. Data mining is a term used to describe the “process of discovering patterns and trends in large data sets in order to find useful decision-making